157 49 32MB
English Pages 508 Year 2017
SOCIOMETRICS AND HUMAN RELATIONSHIPS Analyzing Social Networks to Manage Brands, Predict Trends, and Improve Organizational Performance
This page intentionally left blank
SOCIOMETRICS AND HUMAN RELATIONSHIPS Analyzing Social Networks to Manage Brands, Predict Trends, and Improve Organizational Performance BY
PETER A. GLOOR MIT Center for Collective Intelligence, Massachusetts Institute of Technology, Cambridge, MA, USA
United Kingdom North America Japan India Malaysia China
Emerald Publishing Limited Howard House, Wagon Lane, Bingley BD16 1WA, UK First edition 2017 Copyright r 2017 Peter A. Gloor Reprints and permissions service Contact: [email protected]
British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN: 978-1-78714-113-1 (Print) ISBN: 978-1-78714-112-4 (Online) ISBN: 978-1-78714-725-6 (Epub)
ISOQAR certified Management System, awarded to Emerald for adherence to Environmental standard ISO 14001:2004. Certificate Number 1985 ISO 14001
CONTENTS Acknowledgments
xi
1. Introduction 1.1. Roadmap 1.2. Key Takeaways of This Book 1.3. Study Plan for a One-Semester Course 1.4. Sample Course Syllabus
1 6 15 18 20
PART I. TREND PREDICTION BY MEASURING SOCIAL NETWORKS 2. Coolfarming Organizations 2.1. Knowledge Flow Optimization through Organizational Social Network Analysis 2.2. The Coolfarming Data Collection and Analysis Process 3. Coolhunting and Trend Forecasting on the Web 3.1. Measuring Collective Awareness 3.2. The Coolhunting Process — Finding Trends by Finding Trendsetter 4. The Six Honest Signals of Collaboration 4.1. The Honest Signals Have Different Meanings for Different Organizations 4.2. Virtual Mirroring Leads to Change 4.3. Dealing with Privacy Concerns v
27 29 31 37 37 39 45 54 56 56
Contents
vi
4.4. 4.5. 4.6. 4.7.
How to Apply Knowledge Flow Optimization Four Examples Areas of E-Mail-Based SNA Improving Financial Capital through Optimizing Social Capital
5. Essentials of Social Network Analysis and Statistics 5.1. Basics of Social Network Analysis (SNA) 5.2. Basics of Statistics 6. How Ideas Spread in Online Social Networks — Readings 6.1. Theories of Information Diffusion 6.2. Spreading Ideas on Facebook 6.3. Finding Fake Reviews through Machine Learning 6.4. Measuring Financial Performance 6.5. Calculating Demographic Information 6.6. Predicting Election Outcome
58 61 64 65
69 70 75 85 89 95 96 97 99 103
PART II. ANALYZING STRUCTURE, DYNAMICS, AND CONTENT OF NETWORKS WITH CONDOR 7. The Four-Step Analysis Process 7.1. Social Media Fetchers 7.2. Social Media Filters 7.3. Social Media Visualizers 7.4. Social Media Exporters
111 115 116 116 118
8. Getting Started with Condor 8.1. Analyzing the Facebook Wall with Condor
121 125
Contents
8.2. 8.3.
8.4. 8.5. 8.6. 8.7.
vii
Sample Four-Step Analysis with Twitter Measuring the Importance of Brands through Betweenness of Actors in Bipartite Graphs Pruning the Leaves in a Graph Degree-of-Separation Search with Google CSE Degree-of-Separation Search with Twitter Wikipedia Search
9. Analyzing E-Mail with Condor 9.1. Creating a Virtual Mirror of Your Own Mailbox 9.2. Finding COINs through Community Detection 9.3. Creating a Virtual Mirror of an Organization 9.4. Analyzing Hillary Clinton’s Mail 9.5. Organizational Aspects of E-Mail-Based SNA 9.6. Follow-on Exercises 9.7. (Partial) List of E-Mail Studies Conducted by the Author in Various Organizations 10.
Calculating Personality Characteristics from E-Mail 10.1. Calculating Correlations between FFI and E-Mail 10.2. Developing a General Prediction Formula 10.3. Adding Gender, Ethnicity, and Nationality as Control Variables 10.4. Follow-on Exercises
130
136 137 141 146 150 153 154 185 192 219 228 232 232
241 242 244 254 260
Contents
viii
11.
12.
Predicting Criminal Intent from E-Mail — Analyzing the Enron E-Mail Archive 11.1. Exploratory Analysis 11.2. Identifying Criminal Actors through Their Honest Signals of Collaboration 11.3. “Tribefinder” — Identifying Criminals through Machine Learning in Condor 11.4. Follow-on Exercises
263 264 273 280 291
Coolhunting on the Internet with Condor 12.1. Expert Analysis — Websites and Blogs 12.2. Swarm Analysis — Wikipedia 12.3. Analysis of the Crowd — Twitter 12.4. Follow-on Exercises 12.5. (Partial) List of Internet Coolhunting Studies
295 298 311 322 334
13.
Coolhunting — Francogeddon 13.1. Follow-on Exercises
339 347
14.
Coolhunting the US Presidential Elections 14.1. Bernie Sander’s Presidential Campaign — The Perfect COIN 14.2. Coolhunting Bernie Sanders, Hillary Clinton, Jeb Bush, and Donald Trump 14.3. Tribefinder on Twitter (Using Machine Learning) 14.4. Follow-on Exercises
349
335
353
356 366 383
PART III. AUTOMATIC MEDIA INSIGHTS COIN ASSESSMENT (AMICA) 15.
Inside Media Individual Collaboration (IMIC) 15.1. IMIC Annotation Process
391 401
Contents
ix
16.
Outside Media Individual Collaboration (OMIC) 405 16.1. OMIC Annotation Process 414
17.
Inside Media Organizational Collaboration (IMOC) 17.1. IMOC Annotation Process
419 423
Outside Media Organizational Collaboration (OMOC) 18.1. OMOC Annotation Process 18.2. Follow-on Exercises
425 429 430
18.
19.
Survey of Individual and Organizational Collaboration (SIC & SOC) 19.1. Survey of Individual Collaboration (SIC) 19.2. Survey of Organizational Collaboration (SOC) 19.3. Sample Download
431 431 439 444
PART IV. APPENDIX — USEFUL MACHINE LEARNING AND GRAPH ANALYSIS TOOLS Appendix A: Identifying Anti-Vaxxers through Machine Learning Using KNIME
447
Appendix B: Generating Nice Graph Pictures with Gephi
459
Appendix C: Sample Mid-Term Exam
465
Appendix D: References
469
Biography
483
Index
485
This page intentionally left blank
ACKNOWLEDGMENTS
The tools and methods described in this book have been developed and tested over the last 12 years in the Collaborative Innovation Networks (COINs) seminar. I am deeply grateful to all my instructor colleagues, and of course to the hundreds of students from the United States, Finland, Germany, Switzerland, Chile, Italy, South Korea, and China who have contributed many creative ideas, and have taught me what works, and what does not. The COINs seminar was started at MIT Sloan in spring 2005. In fall of the same year, the seminar morphed into a virtual distributed course joined by students from Helsinki, supervised by Maria Paasivaara and Casper Lassenius, students from Cologne tutored by Detlef Schoder and Kai Fischbach, and students from Savannah College of Art and Design (SCAD) lectured by Christine Z. Miller. In the meantime, the seminar has also repeatedly been taught at Pontificia Universidad Catolica Santiago de Chile coached by Cristobal Garcia Herrera, and University of Applied Sciences Northwestern Switzerland, where Michael Henninger has been the indispensable instructor. Since 2011, the students from Cologne have been coached first by Johannes Putzke, and since 2014 by Gloria Volkmann, while at xi
xii
Acknowledgments
University of Bamberg, students have been instructed by Kai Fischbach and Matthaeus Zylka. The software tool Condor that is the basis of this course was started in 2003, when the Center for Digital Strategies at Dartmouth College under the leadership of Hans Brechbühl and Eric Johnson agreed to support Yan Zhao’s software development efforts as part of her Master’s thesis supervised by Fillia Makedon. For the next three years, Yan, ably supported by the algorithm genius of her husband Song Ye, built the first two versions of Condor, originally called TecFlow. End of 2006, she passed the baton to Renauld Richardet, who added Apache Lucene’s text processing capabilities. In 2008, Condor development continued in Switzerland at galaxyadvisors, funded by the Swiss Commission for Technology and Innovation CTI. Michael Henninger, Hauke Fuehres, Martin Stangl, Lucas Broennimann, Marton Makai, and Kevin Zogg from the University of Applied Sciences Northwestern Switzerland (FHNW) worked on building a fundamentally revised version of Condor in the team of Manfred Vogel and André Csillaghy at the Institute for 4D technologies i4ds. Since 2013, Condor development is done by my colleagues at galaxyadvisors, Marton Makai, Hauke Fuehres, and Joao Marcos Da Oliveira, supported from 2014 to 2015 by Karsten Packmohr. This book is the product of many people working together over 14 years, building the tools and methods described here. First of all, I am grateful to Ken Riopelle and Michael Henninger, who have been essential in making the social media analysis tool Condor accessible to a wider audience beyond programmers and
Acknowledgments
xiii
statisticians. Ken created the first Condor videos, and wrote a comprehensive manual, the precursor of this book. Michael wrote the first tutorial for Condor in the COINs seminar at University of Applied Sciences Northwestern Switzerland. Ken Riopelle, Michael Henninger, and Lucas Broennimann provided valuable feedback on earlier versions of this manuscript. Ken also contributed the last section of Chapter 3 of Part II. My sincerest thanks to all of you, without your creative ideas, didactical talent, and Java development and software architecting skills, both the COINs course and Condor would not exist.
This page intentionally left blank
1 INTRODUCTION
Imagine being able to spot if a customer is becoming really unhappy with your product and service — and do something about it before they actually leave you. Imagine finding out what the constituency of a politician or political party really thinks. Imagine finding out what your customers love and hate about your product. Imagine being able to identify your most creative employees, your external innovators, and lead users — and help them become even more creative. Imagine being able to predict who wants to leave your company, your department, or your project team — and not just identify them, but help them become happy and motivated workers again. Imagine identifying potentially fraudulent or risky behavior among your employees before they actually commit anything illegal.
r 2017 Peter A. Gloor
1
Sociometrics and Human Relationships
2
If you are looking for answers to these and similar questions, read on. This book gives you a framework to analyze your organization from the inside, by mining e-mail, skype, and calendar data, and from the outside, by crunching Twitter, Wikipedia, and blog data. From your and your organization’s e-mail, skype, and calendar data, you can: Find out about the happiness of your employees (see Section 9.3). Find out about the satisfaction of your customers (see Section 9.3). Find out who might be leaving your company (see Section 9.3). Find your most creative and motivated employees (see Chapter 10). Find out about the willingness of your employees to take unnecessary risks (see Section 11.3). From Twitter, Wikipedia, and blog interaction data, you can: Find out about what your customers and prospects really think about your company and your brands (see Chapter 12). Measure the strength of your brand (see Chapter 12).
Introduction
3
Find out about the demographic profile of the customers and aficionados of your company and brands (see Section 14.3). Forecast the popularity and voter share of a politician (see Section 14.2). Find out about the demographic profile of the voters of a politician (see Section 14.3). These are just a few use cases that we will address to study how humans communicate and collaborate inside the organization, through e-mail, chat, videoconferencing, and faceto-face communication, and outside on online social media. Better communication leads to better collaboration, which leads to more and better innovation! This book describes algorithms and tools to find and support collaboration within and between organizations. Our approach puts a lens to the organization by mining electronic communications such as e-mail, sociometric badges, telephone, chat, online meeting, Web/videoconferencing, and calendars to make existing communication patterns visible. The Condor software tool, which has been developed over the past decade at the MIT Center for Collective Intelligence and the University of Applied Sciences Northwestern Switzerland, mines these electronic archives and generates a broad range of structural, temporal, and content-based social network metrics which can be used to calculate and forecast all of these real-world insights mentioned above (Figure 1). This book provides a practical guide to Coolhunting and Coolfarming on online social media. It explains how to “Coolhunt” — to find cool trends by finding the trendsetters on Twitter, Facebook, Wikipedia, blogs,
4
Sociometrics and Human Relationships
Figure 1: Focus of This Book.
online forums, and e-mail. It also teaches how to optimize your own communication behavior by creating a personal virtual mirror from your own e-mail, skype log, online calendar, or chat log. It then extends this approach to “Coolfarming” an organization by improving collaboration and innovation through finding the best communication behavior to reach a certain goal. It mirrors back to the organization and its current communication behavior by mining its e-mail, phone, Web conferencing, or online calendars. This virtual mirror of communication deficiencies helps the organization to change its communication behavior for better performance and innovation. The first part of the book explains the theory behind Coolhunting and Coolfarming, the second part provides a series of in-depth hands-on tutorials to analyze online social networks, and the third part introduces Automatic Media Insights COIN Assessment (AMICA), a specific method using Condor applying the procedures and processes introduced in Part II to measure and increase individual and organizational creativity and performance through virtual
Introduction
5
mirroring. After having worked through the examples, you will be able to improve yours and your organization’s communication for better collaboration and better innovation: First, you will know more about yourself by understanding the social network where you, as an individual, are embedded, through analyzing your mailbox and your Web network. Second, you will be able to understand and optimize the communication network of your organization by analyzing its e-mail and other communication archives. This analysis might increase an organization’s creativity, its employees’ satisfaction, or its sales success. Third, you will be able to identify your best customers, your key competitors, and your possible business partners through your communication patterns and position in online social media such as Twitter, blogs, Facebook, and Wikipedia. This book is geared toward students and practitioners with a background in management, human resources, marketing, design, sociology, psychology, and the humanities. It includes numerous examples with the user-friendly software tool Condor that analyzes all types of online social networks such as Twitter, Wikipedia, blogs, Facebook, as well as e-mail. The book is a brief and targeted guide with step-by-step instructions, with an objective to deliver immediate actionable insights for anybody interested in analyzing online social networks. It explains how to visualize, track, and manage brands, products, and topics on the Internet through online social media, and to analyze organizations through their e-mail networks. The book translates latest academic research into practical business strategies and techniques. It provides a wealth of examples of how to apply social network analysis (SNA) for the prediction of trends by mining Twitter, Wikipedia, blogs, and Facebook.
6
Sociometrics and Human Relationships
It also illustrates how to improve organizational performance by optimizing communication and collaboration using individual and organizational e-mail archives. The book is based on a course on Collaborative Innovation Networks (COINs) that has been taught for the last 12 years to students forming virtual teams participating from universities in Boston, Savannah, Helsinki, Cologne, Brugg, Bamberg, Rome, and Chile,1 with majors in business, statistics, education, design, computer science, psychology, and sociology. In this course, students use and analyze social media to answer complex questions impacting society. The course teaches students how to leverage virtual collaborative creativity in the Internet age. It helps them understand and apply the dynamics of online communication using e-mail, social media, Twitter, Wikipedia, and the Web. This is done using online SNA with Condor. The examples in this book have been drawn from class projects from this course. The book includes a free academic license of Condor to analyze dynamic semantic social networks. 1.1. ROADMAP 1.1.1. Part I — Trend Prediction by Analyzing Social Networks • Chapter 2, Coolfarming Organizations This chapter describes the key principles of how innovation can be improved by better collaboration and 1
MIT, Savannah College of Art and Design, Aalto University Helsinki, University of Cologne, University of Applied Sciences Northwestern Switzerland, University of Bamberg, University Tor Vergata Rome, Pontificia Universidad Catolica Santiago Chile.
Introduction
7
better communication. It shows how by analyzing social networks at companies through mining online communication archives, such as e-mail, skype, calendars, and phone logs works, and how through virtual mirroring
organizational
performance
can
be
optimized. • Chapter 3, Coolhunting and Trend Forecasting on the Web This chapter gives an introduction to the key principles of Coolhunting. Coolhunting measures global consciousness by analyzing the wisdom (and madness) of the crowd on Twitter, the (paid) wisdom of experts on blogs and online newspapers, and the wisdom of swarms on Wikipedia, Facebook groups, and online forums. • Chapter 4, The Six Honest Signals of Collaboration This chapter introduces six social indicators of creative collaboration — “the six honest signals” developed by the MIT’s research group where Condor was created over the last 12 years. The indicators are collected and measured through tweets, bloglinks, Wikipedia entries, e-mail archives, and body signals captured through sensors. These “honest signals” are predictive of future creativity, performance, and outcomes of teams. Changing the individual communication behavior to adhere to these six indicators will lead to better communication, collaboration, and more innovative results. The six indicators are central leadership, rotating leadership, balanced contribution, rapid response, honest language, and shared context.
8
Sociometrics and Human Relationships
• Chapter 5, Essentials of Social Network Analysis and Statistics The chapter gives a short introduction to SNA, which is needed to do a social media analysis. It describes actor-level metrics such as degree and betweenness centrality, contribution index, and path length, as well as group-level metrics such as density, group degree, and group betweenness centrality. It also introduces the basic statistical techniques (t-tests, correlation, regression) illustrated using the KNIME environment, which is described in the appendix, to understand predictive analytics for forecasting organizational variables such as employee satisfaction, personality characteristics, or sales success based on e-mail communication in the organization. The same statistics is needed to analyze online social media such as Twitter to predict friends and foes of politicians, the outcome of elections, or who will win an Oscar. • Chapter 6, How Ideas Spread in Online Social Networks — Readings This chapter briefly presents the insights from 22 key papers that provide the theoretical background for the examples described in Part II. They are structured into theories of information diffusion, how ideas spread on Facebook, how machine learning can be more accurate than human judgment in analyzing online social networks, how stocks and other financial indicators can be predicted from Twitter, Google, and Wikipedia, how demographic information can be mapped to real-world users by geographic and other indicators, and how the outcome of elections can be
Introduction
9
predicted from social media. If this book is used for a classroom course, students may be asked to read and present the papers in the classroom as part of the course. 1.1.2. Part II — Analyzing Structure, Dynamics, and Content of Networks with Condor The second part of the book describes how to use Condor for Coolhunting and Coolfarming described in Part I. • Chapter 7, The Four-Step Analysis Process This chapter describes the key analysis process in Condor, starting with collecting communication data not only from Twitter, Facebook, Wikipedia, and blogs, but also from e-mail and other types of organizational communication archives such as calendars. The collected data is then preprocessed and cleaned using a series of content filters. In the next step, Condor provides a variety of visual analysis tools, to visually explore the social network in many different ways. In the last step, the data is exported as actor-level variables and time series for further statistical analysis in tools like Excel, KNIME, R, or SPSS. • Chapter 8, Getting Started with Condor This chapter introduces the basics of Condor on Mac and Windows, including how to install MySQL and Java, which are needed for Condor. It will use precollected datasets from Twitter, Wikipedia, Facebook, blogs, and e-mail to teach the essentials of digital network, sentiment, and content analysis using Condor.
10
Sociometrics and Human Relationships
It also introduces degree-of-separation search to measure the importance of influential, brands, and products on the Web. • Chapter 9, Analyzing E-Mail with Condor This chapter teaches how to use Condor to analyze email networks and discover hidden communities. Creating a social network to map a personal mailbox will give unprecedented insights into whom one is working most closely with, who the hidden influencers are, whom one likes the best, and whom one respects the most, and how these measures can change over time as the network of relations changes with new projects, employees, suppliers, and clients. The same SNA can also be extended to teams and entire companies. The second example in this chapter analyzes the network of an entire organization using the e-mail communication of a class of 50 students working in 10 teams. The third example analyzes Hillary Clinton’s e-mails released as part of the controversy about her use of a private e-mail server while she was serving as the US secretary of state. The network map can be used as the foundation to improve communication within the organization, by identifying bottlenecks, collaborators, hidden influencers, and people bridging structural holes. Even more, it can also be used to improve knowledge flow in business processes, and to increase organizational effectiveness by tracking and improving employee satisfaction, customer satisfaction, employee turnover, and salesforce effectiveness. The analysis is based on the “six honest signals of collaboration” introduced in Part I.
Introduction
11
• Chapter 10, Calculating Personality Characteristics from E-Mail In this chapter, we calculate the personality characteristics of individuals based on their e-mailing behavior. We compare the six honest signals of collaboration of individual actors with their personality characteristics measured through the Big Five personality characteristics. The Big Five personality test measures Neuroticism, Extraversion, Openness to experience, Agreeableness, and Conscientiousness through a survey and is commonly used to assess personality characteristics by scientific psychologists. • Chapter 11, Predicting Criminal Intent from E-Mail — Analyzing the Enron E-Mail Archive In this chapter, we try to catch criminals based on their e-mailing behavior, by analyzing the e-mail archive of Enron. The Enron e-mail archive documents the spectacular crash of Texan energy trading firm Enron at the end of 2001. Enron’s downfall has been widely publicized and has also been described in the book The Smartest Guys in the Room and in a movie by the same name. We identify differences in the six honest signals of collaboration between ordinary Enron employees and the convicted criminals, which in theory could be used to identify potential suspects in other e-mail archives. The chapter also introduces “tribefinder” that uses Condor’s machine learning capability to identify people with communication patterns similar to the convicted criminals. • Chapter 12, Coolhunting on the Internet with Condor This chapter illustrates how to use Condor for analyzing the importance of a brand on the Internet.
12
Sociometrics and Human Relationships
Coolhunting for a brand consists of identifying the context of a brand, in particular, its competitors, measuring the relative strength of the brand and its competitors, and identifying the brand’s associated influencers, ranking them by their impact. Thanks to the availability of geotagging, this analysis can be done globally, and can also be restricted by geography, drilling down into different target markets. The Coolhunting process is illustrated using Condor by tracking a brand on the Web, Wikipedia, and Twitter. • Chapter 13, Coolhunting — Francogeddon This chapter illustrates Coolhunting on the Web and on Twitter, measuring the global awareness during Francogeddon, when on January 15, 2015, the Swiss National Bank unexpectedly removed the link between Euro and Swiss Franc, leading to huge global currency fluctuations and the bankruptcy of some hedge funds. The global sentiments of those events are analyzed through tweets about “Swiss Franc,” “Euro,” and “USD.” • Chapter 14, Coolhunting the US Presidential Elections This chapter gives a detailed example of Internet Coolhunting by analyzing and predicting the outcome of elections. The 2016 US Presidential election provides an excellent opportunity to study Coolhunting and Coolfarming. Not only are the US Presidential elections fought to a large extent on social media, but the differing styles of the candidates also offer a prime example of the difference between COIN-based and hierarchical leadership style. Using machine learning in “tribefinder,” Condor identifies members of the “Bernie Sanders tribe” and the “Donald Trump tribe,”
Introduction
13
people with Twitter behavior similar to known Donald Trump and Bernie Sanders fans. 1.1.3. Part III — Automatic Media Insights COIN Assessment (AMICA) AMICA is an assessment of individual and group behavior that measures, compares, and optimizes the collective mindset of an individual, organization, or a company. AMICA identifies which types of communication patterns are indicative of the most efficient and effective collaboration and helps individuals and organizations to improve their collaborative behavior. • Chapter 15, Inside Media Individual Collaboration (IMIC) IMIC measures the collaboration behavior of individuals inside an organization, based on their e-mail, skype, and calendar archives. It displays results as a comparative radar chart and offers a drill-down with social network charts, scatter plots, and bar charts. • Chapter 16, Outside Media Individual Collaboration (OMIC) OMIC measures the collaboration behavior of individuals seen from the outside through online social media such as Twitter, Facebook, Wikipedia, and Google search. It starts with an analysis of an individual’s footprint on Twitter and drills down through a Wikipedia, Facebook wall, and Google Blog search analysis. This chapter also introduces Twitter EgoFetcher, which allows to measure the echo chamber of an individual on Twitter, similar to an individual mailbox analysis.
Sociometrics and Human Relationships
14
• Chapter 17, Inside Media Organizational Collaboration (IMOC) IMOC measures the collaborative performance of an organization based on the organization’s e-mail, skype, and calendar archives. It compares departments, business units, or companies using the group measures of Condor. • Chapter 18, Outside Media Organizational Collaboration (OMOC) OMOC measures the collaboration behavior of companies from the outside through online social media such as Twitter, Facebook, Wikipedia, and Google search. It starts with an analysis of the organization’s footprint on Twitter and drills down with Wikipedia and Google Blog search analysis. • Chapter 19, Survey of Individual and Organizational Collaboration (SIC & SOC) The four automated online media-based assessments of AMICA are complemented by two survey-based assessments, Survey of Individual Collaboration (SIC), focusing on the collaborativeness of the individual, and Survey of Organizational Collaboration (SOC) with a focus on the organization.
1.1.4. Part IV — Appendix — Useful Machine Learning and Graph Analysis Tools The appendix describes KNIME and gephi, two additional tools besides Condor, useful for mapping the collective mind on online social media.
Introduction
15
• Appendix A — Identifying Anti-Vaxxers through Machine Learning using KNIME This appendix describes how to use machine learning to distinguish supporters and objectors of the “AntiVaxxer” theory through their online behavior. It analyzes a dataset of tweets that was collected in Spring 2015. The resulting tweets, together with information about the tweeters, were used to manually classify two sets of tweets, one belonging to pro-vaxxers and the other belonging to anti-vaxxers. It illustrates the use of KNIME, an opensource text mining and data analytics tool with a visual frontend, to investigate if pro- and anti-vaxxers use function words in different ways and thus develop an automatic way of profiling online users based on their behavior. • Appendix B — Generating Nice Graph Pictures with Gephi This appendix discusses how to use the open source graph drawing tool Gephi, to draw and manipulate graphs with additional functionality and layout options not available in Condor, such as clustering, segmenting, and pruning the networks.
1.2. KEY TAKEAWAYS OF THIS BOOK The goal of the book is to teach you how to read the collective mind through interpreting honest signals of collaboration. As all of us are part of the collective mind, understanding it will also change our own behavior. It will help you understand who you REALLY are and how you can become whom you would like to be.
Sociometrics and Human Relationships
16
Applying the principles of social quantum physics introduced in the companion book Swarm Leadership and the Collective Mind: Using Collaborative Innovation Networks to Build a Better Business, this book helps you to learn about how to build entanglement through empathy, and to reflect and reboot (Figure 2). In particular, after having worked through the examples described in this book, you will know about the following: • A practical framework for trend prediction based on social media analysis. • A process description of the “six honest signals of collaboration” as a key mechanism for trend prediction and to increase organizational creativity, performance, and collaborativeness. • A tutorial illustrating Coolhunting for trends by finding the trendsetters on online social media. • Guidelines for Coolfarming through “virtual mirroring” to analyze individual, group, and organizational e-mail archives to increase personal, group, and organizational effectiveness and creativity.
Figure 2. Four Principles of Social Quantum Physics.
Introduction
17
• AMICA, a method for conducting virtual mirroring and increasing collaboration through analyzing inside media such as e-mail, and outside media such as Twitter and Wikipedia on the individual and organizational level. • Step-by-step tutorials to get started with the userfriendly automated social media analysis and monitoring tool Condor. • Using machine learning with “tribefinder” to segment online profiles according to sociodemographic criteria based on their different attributes. • Detailed descriptions to tackle hard business problems through mining communication archives: Track the happiness of your employees Track the satisfaction of your customers Predict which employees consider leaving your company Locate your most creative and motivated employees Forecast the propensity of your employees to take unnecessary risks Track the opinion of customers and prospects about your company and your brands Track the strength of your brand Discover the demographic profile of friends and foes of your company and brands Predict the popularity and voter share of a politician Discover the demographic profile of the voters of a politician and lovers of a brand.
18
Sociometrics and Human Relationships
1.3. STUDY PLAN FOR A ONE-SEMESTER COURSE This section describes the syllabus and study plan of the COIN course that has been taught for the last 12 years at dozens of universities around the world. This course is usually run as a collaboration between different universities, with students from different universities working together as teams. In this course, students learn how to build collective consciousness by becoming “entangled” through building empathy with team members from other countries and cultures (Figure 2). By reflecting on their own communication behavior, they will also get to know better about themselves through the eyes of others. This will lead them to reboot, to change their own behavior. The emergence of online social networks opens up unprecedented opportunities to read the collective mind, discovering emergent trends while they are still being hatched by small groups of creative individuals. Using concepts from psychology and sociology, this course gives students the opportunity to lead and work in a wide range of projects analyzing a large corpora of digital traces of human activity. The Web has become a mirror of the real world, allowing course participants to study and better understand why some new ideas change our lives, while others never make it from the drawing board of the innovator. The aim of the COIN course is to track the emergence of new ideas through SNA. Using concepts from sociology and psychology, students predict what people will be doing next, by analyzing their social interactions on three levels: (1) global — on the Internet, blogs, Twitter, Facebook,
Introduction
19
and Wikipedia, (2) organizational — through e-mail/ phone/chat, and (3) individual — through collecting body signals. Students measure “honest signals” of communication through tweets, bloglinks, Wikipedia entries, e-mails, chats, and phone archives, and body signals captured through cameras and other sensors. The COIN seminar is a demanding course combining skills from many interdisciplinary fields: • SNA • Psychology, sociology, and management • Using and building software tools for analyzing online social networks • Selected statistical methods for data mining and data filtering • Concept visualization and information modeling. Learning Goals • Students learn to work and cooperate in virtual international teams • Students learn to analyze and visualize chosen topics with the help of an appropriate software tool • On the global level, students learn to correlate Web sentiment with macroeconomic indicators, and blog buzz with the outcome of political elections • On the organizational level, students learn to compare performance metrics such as revenue, productivity, peer ratings, and customer satisfaction with e-mail network metrics in a variety of settings.
Sociometrics and Human Relationships
20
1.4. SAMPLE COURSE SYLLABUS Below is a sample syllabus for a one-semester course, with a two-hour class every week. #Lesson Topic (Two-Hour Lessons) 1
Introduction to Swarmcreativity (Based on the Companion Book Swarm Leadership and the Collective Mind) Preparatory reading: Chapter 2 Chapter 3 Chapter 4
2
Introduction to Condor: Following the introductory Condor chapter of this book, students experiment with Condor in class (Chapter 8) Preparatory reading: Chapter 7
3
Basics of Social Network Analysis and Statistics: If the students have no previous experience in SNA, a two-hour class explains the basic SNA metrics such as betweenness and degree centralities of actors and networks, t-tests, correlations, and regressions (Chapter 5)
4
Presenting an Analysis of Own E-Mail Network in Class: As a first assignment, each student briefly presents the results of analyzing her/his own e-mailbox, skype network, or Facebook wall (described in Section 9.1) Preparatory reading: Chapter 9
5
Presenting Papers 1 — In five-minute presentations, students present the first 13 papers (Chapter 6)
6
Presenting Results of Individual Coolhunting — Each student briefly presents the results of an individual Web Coolhunting project about a topic of her/his choice, collecting and analyzing Twitter, blog, and Wikipedia data. Preparatory reading: Chapter 12
7
Mid-Term Exam: Consists of six to eight questions on SNA, statistics, Coolhunting, and Coolfarming. The second part of the exam consists of a Condor Web Coolhunting task. A sample Midterm exam can be found in the appendix.
Introduction
21
(Continued ) #Lesson Topic (Two-Hour Lessons) 8
Team Formation: Students from different sites introduce themselves briefly (one minute per student) using Skype, Hangout, or WebEx. Then the topics for the teamwork are presented by the instructors. Next students sign up for a project, for example, using Doodle. Each student may choose two topics; there have to be students from at least two locations in each team, and a team may have at most five members. Students will cc all their team-specific e-mail traffic to a dummy e-mail address, for example, [email protected], to be used for the virtual mirror in Lesson 12.
9
Virtual Meeting 1: Each team presents in 58 minutes, using SCRUM. SCRUM is an agile prototyping-oriented software development method, where developers work in iterations called “sprints,” and show prototypes to each other frequently. Structure of the presentation per team: Project goals Overall project plan Plan for first iteration Your way-of-working
10
Presenting Scientific Papers 2 — In five-minute presentations students present the second group of 12 papers (Chapter 6)
11
Virtual Meeting 2 Structure of the presentation per team (five minutes per team): Goals of the project Progress during the last iteration: explain what was done Show the main results Goals and plans for the next iteration Output from the retrospective Problems/questions?
Sociometrics and Human Relationships
22
(Continued ) #Lesson Topic (Two-Hour Lessons) 12
Virtual Meeting 3/Virtual Mirror Students get the e-mail data of their team and of the full class. Each team will present a virtual mirror as described in Section 9.3. Structure of the presentation per team (810 minutes per team): Presentation of the virtual mirror Goals of the project Progress during the last iteration: explain what was done Show the main results Goals and plans for the next iteration Output from the retrospective Problems/questions?
13
Virtual Meeting 4 Structure of the presentation per team: Goals of the project Progress during the last iteration: explain what was done Show the main results Goals and plans for the next iteration Output from the retrospective Problems/questions?
14
Final Presentations (10 minutes/team) Goals of the project Related work (what others have done in the same area) Work process (how did you organize your work) Results Possible extensions What worked well/what could have been done better (both from the team perspective, and advice for the instructors)
Introduction
23
Virtual meetings can either be conducted in the local classrooms, with the classrooms at different sites connected by Web conferencing, or students may be allowed to connect from home. We found that a mix of on-site and off-site virtual meetings works best. In on-site virtual meetings, students are asked to participate from the classroom, to increase bonding and knowledge sharing between teams. In the off-site virtual meetings, students are allowed to join remotely from wherever is most convenient for them.
This page intentionally left blank
PART I. TREND PREDICTION BY MEASURING SOCIAL NETWORKS This first theoretical part gives an introduction to the basic concepts of COINs, Coolhunting, and Coolfarming. Coolfarming means using dynamic semantic social network analysis to increase individual and organizational creativity by creating and nurturing COINs. Coolhunting means finding trends by finding the trendsetters by finding the COINs through dynamic semantic social network analysis.
r 2017 Peter A. Gloor
25
This page intentionally left blank
2 COOLFARMING ORGANIZATIONS
CHAPTER CONTENTS • What is Coolfarming? • Knowledge Flow Optimization • The Coolfarming Data Collection and Analysis Process.
When Robinson Crusoe was stuck on a lonely island for years, in spite of plenty of food and a subtropical climate, he only had one wish, to finally meet and connect with other people. Relationships form the core of human existence. The way we communicate in our relationships is key for building private and professional success and happiness. In this book, we will learn how to analyze and measure individual, organizational, and global social networks by mining online communication archives such as e-mail, Twitter, Facebook, and blogs to increase collaboration and creativity by better communication. In this initial chapter, we look at networks from the perspective of the individual — called ego networks — and from the perspective of the organization — called organizational networks. Our main means to analyze these networks is r 2017 Peter A. Gloor
27
28
Sociometrics and Human Relationships
communication archives: most prominently, e-mail logs, but also chat, online calendars, Web conferencing logs, and phone archives. The graphtheoretical foundation used to analyze these networks is Social Network Analysis (SNA). Classic SNA looks at the structure of networks; in our own work, we have added analysis of the dynamics of network change over time, and an analysis of the content of the networks, for example, in the content of e-mails or Tweets. Dynamic and content-based SNA affords an X-ray into the inner workings of an organization, mapping the informal relationships that transcend organizational hierarchy. It gives an assessment of communication and knowledge flow, resulting in actionable data to optimize outcomes. The Condor software used in our examples provides an interactive dashboard to deep dive into the structures of ego and organizational networks. Our approach puts a lens to the organization by mining e-mail archives and, as relevant, other electronic communications (e.g., telephone, chat, online meeting, Web/ videoconferencing, calendars) to make existing communication patterns visible. The Condor software tool, which has been developed over the past decade at the MIT Center for Collective Intelligence and University of Applied Sciences Northwestern Switzerland (FHNW), mines electronic archives (e-mail, phone, chat, Web conferencing, sociometric badges — body worn sensors) and generates a broad range of SNA metrics. In the following description, when we talk about e-mail, the term “e-mail” stands for all types of organizational communication archives.
Coolfarming Organizations
29
2.1. KNOWLEDGE FLOW OPTIMIZATION THROUGH ORGANIZATIONAL SOCIAL NETWORK ANALYSIS Just as weather patterns predict sunshine or thunderstorms, communication flows allow us to anticipate positive and negative developments in groups of people. Like a weather forecast, an SNA can serve as an early warning system, revealing the organizational equivalent of sunny days with cool breezes — or impending storms — in interactions between members of groups. Organizational forecasts of this kind are difficult to obtain by other means. Business process reengineering forever changed the way companies do business, introducing a process focus and streamlining structured business processes. SNA can do the same for unstructured, knowledge-intensive processes. By visualizing the flow of knowledge (Figure 3), making it transparent, and reengineering its flow, organizations and Figure 3: Knowledge Flow Is the Nervous System of an Organization.
30
Sociometrics and Human Relationships
individuals can become more creative, innovative, and responsive to change. SNA offers companies an opportunity to complement their organizational charts and business process maps with more fluid maps of communication flows. By making the communication flow transparent, organizations can make better use of people by freeing them from being buried in conventional multilayer hierarchies. By establishing flexible ad hoc workflows based on communication flows, people can assume more efficient roles, which also leads to increased individual motivation. Applying these insights to increase organizational creativity is what I call “Coolfarming.”1 Coolfarming allows the organization to understand and optimize key parameters of organizational health, such as identifying their most creative employees, and find the communication patterns of creativity for their particular organization. It can also find the happiest employees and find communication patterns of satisfied employees as well as identify the communication patterns of dissatisfied employees who are ready to leave the firm. It can identify the honest signals of happy and unhappy customers. Coolfarming can be done from within the organization, by mining e-mail, calendar, phone/skype log, and measuring face-to-face interaction using sociometric badges, little devices worn on the body. It can also be done from the outside by mining Twitter, Facebook, blogs, and Wikipedia entries discussing the organization to be analyzed. On the outside, it can locate discussions about the relationship with the company on social media such as Twitter, blogs, and Facebook groups. It can identify productive and less productive collaboration with business 1
Gloor (2010).
Coolfarming Organizations
31
partners by tracking e-mail exchange between company executives and their outside business partners. It can find social media linkage between company and business partners. Finally, it can also spot novel business ideas, for instance, by identifying new vocabulary picked up by employees through corporate e-mail in the outside discussion on online social media. 2.2. THE COOLFARMING DATA COLLECTION AND ANALYSIS PROCESS The Coolfarming data collection process starts with setting up a way to continuously collect the organization’s communication archive (Figure 4). In the next step, outcome metrics such as customer or employee satisfaction need to be defined and measured. In the third step, these outcome metrics are compared against the social networking metrics, in particular, the six honest signals of collaboration introduced in more detail in Chapter 4. In the fourth step, communication behavior of the organization is continuously tracked and mirrored back to the employees. Figure 4: Coolfarming Data Collection and Analysis Process.
Sociometrics and Human Relationships
32
The Coolfarming process therefore is conducted in four successive steps: 1. Assessing of the organization’s existing communication patterns and structures. 2. Benchmarking the organization’s communication patterns (its “honest signals”) against those seen in other organizations doing similar work. 3. Correlating and calibrating communication patterns against performance metrics. 4. Virtual mirroring: Showing individuals how far away they are from optimal communication, which will lead them to change their behavior, which in turn will lead to improved communication, resulting in better collaboration, leading to more innovation.
2.2.1. Assessing the Organization’s Communication Patterns In the first phase, the social network within the organization is visualized through social network pictures, movies, and other charts and statistics. This way a communication matrix between different business areas can be constructed, and the interactions at the department, role, and employee levels can be analyzed. E-mail-based SNA of the organization on its own can provide insights into the following key points at the divisional, departmental, and role/individual levels: • Who are key influencers? Who is central in the network?
Coolfarming Organizations
33
• How do they behave? Do they contribute to discussions or filter them? Do they assume a collegial/ creative work style? Do they respond quickly? What is the sentiment of their conversations? • Potential bottlenecks and ways to alleviate those. • Prospective future leaders. • Hidden innovation teams (COINs). • Spot individuals who build strong trust relationships to connect within the organization. • Compare in-group and out-group communication behavior. At the organizational level, e-mail-based SNA can address questions such as how business units interact with the rest of the organization and how outside partners interact with the organization. E-mail-based SNA can reveal otherwise invisible communication behaviors that transcend fixed workflows, revealing patterns present inside an organization and in its interactions with peer groups inside the corporation and outside organizations. Such analysis can assist senior management in coaching individuals and teams and redesigning key processes and organizational structures to foster creativity.
2.2.2. Benchmarking the Organization’s Communication Patterns against Those Seen in Other Organizations After the initial organizational fingerprint has been revealed, communication patterns within the organization
34
Sociometrics and Human Relationships
can be compared against those in other organizations. In our projects at MIT and galaxyadvisors, we have studied over 100 different organizations from different industries such as automotive, chemical, financial services, health care, management consulting, pharmaceutical, outsourcing, retail, technology, and nonprofit sectors and ranging in size from global top 100 firms to small start-ups, and collaborations that occur on the open Internet, for instance, Wikipedia, Stackoverflow, and other open forums. 2.2.3. Correlating Communication Patterns against Performance Metrics If the organization has performance metrics, which it can share, these can be used to calibrate performance with communication patterns. Performance metrics could, for instance, be customer or employee satisfaction, sales success, completing projects on time and budget, or the number of patents filed. The correlations between communication behavior and performance variables can then be used to identify which communication patterns are associated with superior performance. These insights can then be taught to the members of the organizations to optimize their communication behavior. 2.2.4. Virtual Mirroring Showing individuals their own communication behavior, and telling them which behavior is desirable (based on Steps 13 above), will change their behavior toward being more collaborative, and thus more innovative. We were able to demonstrate significant performance
Coolfarming Organizations
35
Figure 5: Virtual Mirroring Process.
improvements in earlier similar projects through virtual mirroring. Figure 5 describes the four steps of virtual mirroring. The last part of this book “Automatic Media Insights COIN Assessment” (AMICA) introduces a framework to conduct virtual mirroring. Before the project starts, key methodological, technical, and legal issues will need to be resolved: • Agree on the number of actors to be analyzed: will the analysis be of the in-group only (focus exclusively on interactions between people inside the organization); the in-group and corporate peers (includes the above plus interactions between the organization and other units); or in-group, peers, and out-group (all of the above plus interactions with people in outside organizations). • Agree on the time period to be analyzed. • Decide if subject line and/or content is to be included in the analysis (without at least one of these, sentiment and innovative influencer analysis cannot be done).
Sociometrics and Human Relationships
36
• Decide whether this is a one-time analysis or if the long-term goal is to move toward a continuous collaboration dashboard. • Resolve privacy/regulatory issues. • Determine how e-mail can be accessed technically and how the data will be formatted and transmitted. In the next chapter, we will learn how we can apply the same process of social media analysis and trend prediction to the open Internet through a process we call “Coolhunting.” MAIN LESSONS LEARNED • Relationships are at the core of human existence, understanding them allows organizations and individuals to increase their creativity, performance, and happiness. • Knowledge Flow Optimization streamlines unstructured business processes by constantly monitoring and improving electronic communication. • Coolfarming nurtures and optimizes COINs through virtual mirroring. • Comparing knowledge flow in organizational (inside) e-mail, skype, and calendar networks with outcome variables indicative of performance (e.g., sales success, customer satisfaction, employee creativity) will lead to interventions to increase performance.
3 COOLHUNTING AND TREND FORECASTING ON THE WEB
CHAPTER CONTENTS • Coolhunting measures collective awareness • Coolhunting combines expert, swarm, and crowd on blogs, Wikipedia and forums, and Twitter • Coolhunting finds trends by finding the trendsetters.
3.1. MEASURING COLLECTIVE AWARENESS Does an organization — and thus ultimately humanity — show some sort of consciousness or self-awareness? One might think so, at least in moments such as on the day when Princess Diana died, or more recently, on that day in April 2013 when I was stuck at home in Cambridge while the Boston Marathon bomber was roaming at large in the neighborhood. In those intense moments, we feel maybe not “collectively intelligent” but certainly “collectively aware” or “collectively conscious.” If we meet a stranger in those moments, we know what they are r 2017 Peter A. Gloor
37
38
Sociometrics and Human Relationships
thinking, namely “it’s so sad Diana died” or “where might the marathon bomber be hiding and hitting next.” Moments like these motivate an informal definition of “organizational consciousness.” It is analogous to the human body, where the brain is conscious of the toe, and will respond differently depending on whether a person hits his/her toe at the door or somebody else steps on his/ her toe. Extending this metaphor, a “collectively conscious” organization will respond differently if somebody hits a member purposefully or if a member hurts himself/ herself. Similarly to the neurons in the brain that are communicating through their synapses to create consciousness, humans communicate by interacting with each other verbally, through text, or other signals, either faceto-face or over long distance by phone or Internet. To prove existence of consciousness on the individual level, Descartes famously stated “cogito ergo sum” — “I think, therefore I exist.” Extending this definition to an organization, “if the organization thinks and acts as one cohesive organism, it exists” and thus shows collective consciousness, defining organizational consciousness as the common understanding in an organization’s global context, which allows the members of the organization to implicitly coordinate their activities and behaviors. As an example of a global-level event, in the case of the Boston Marathon bomber, everybody in the Boston area was trying to stay abreast of the most recent developments on Twitter, Facebook, and the News, and looking out for traces of the terrorists. On the organizational level, a well-oiled team of software developers working together closely face-to-face, using chat or e-mail trying to debug a jointly developed application also shows
Coolhunting and Trend Forecasting on the Web
39
a high level of organizational consciousness, as they are able to coordinate their work with minimal use of words. In the research by our team described in this book, we aim to make this implicit understanding more measurable, similarly to brain researchers, who measure individual levels of consciousness by attaching probes to individual neurons, tracking the electrical flow of current flowing through synapses between the neurons. In our work, we measure interaction among people through “Coolhunting” on online media such as e-mail, Twitter, Facebook, and blog posts, applying a framework of “six honest signals of collaboration” to assess the level of global consciousness.
3.2. THE COOLHUNTING PROCESS — FINDING TRENDS BY FINDING TRENDSETTER The Coolhunting approach distinguishes between three different sources of information: the crowd, the experts, and the swarm. The difference is explained well through the metaphor of Coolhunting for a restaurant as a tourist in a foreign city. Following all other tourists will bring us to the places where all the tourists go; these restaurants will be crowded, full of other tourists, expensive, and not particularly good. This is what following the crowd gives us, as the crowd likes to follow well-trodden paths. If we ask the concierge in our hotel for a recommendation, we will end up in a better restaurant, with better food, but it most likely will still be full of tourists, and much more expensive, as the concierge will be sending tourists to a good restaurant, but most likely to the one that pays him a kick-back for sending people there.
40
Sociometrics and Human Relationships
This is what following the advice of the expert brings us. The problem with experts is that they take kick-backs from the organizations whom they recommend, as they are paid to give advice, just like the rating agencies Moody’s and Standard & Poor’s, which get paid from the same companies and governments whom they are supposed to assess, leading to serious conflict of interest. As tourists in a foreign city, we will find the best places to eat if we visit the places popular with the local residents. The hard part is trying to identify the locals on the street and in a crowded restaurant, as they are hard to distinguish from the tourists. We might get some hints by looking at their clothing and listening to their language. We call this the swarm, leading in our restaurant example to the best meal at the lowest price. When doing Coolhunting on social media (Figure 6), we can make the same distinction between crowd, experts, and swarm, based on the media source. Twitter usually Figure 6: Coolhunting on Social Media.
Coolhunting and Trend Forecasting on the Web
41
gives us the wisdom (and madness) of the crowd, blogs and online newspapers give us the (paid) wisdom of the experts, whereas the (intrinsically motivated) swarm might be found among Wikipedia editors, in Facebook groups, and on subject-matter-specific online forums. Obviously, the intrinsically motivated swarm will give us the best information quality. Tracking the right hashtags on Twitter might also lead us to the swarm for a certain topic. The Coolhunting overview consists of filling in the following 3 by 3 matrix, finding the key topics, key people and organizations, and key websites (Table 1). For example for the topic “Social Determinants of Health” doing a Coolhunting (with Condor) would give us the following terms (Table 2). Once we have the first context by reading the Wikipedia page about “Social Determinants of Health,” we can get a feel of the importance of the term by putting it into the context using Google trends. In our example, we compare “Social Determinants of Health” with “poverty reduction,” “minimum wage,” and “early childhood development.” We find that “minimum wage” is the most discussed term by far of the four, the other three are all on the same (much lower) level of interest. Table 1: Generic Coolhunting Overview. Key Topics Experts (from websites) Swarm (from Wikipedia) Crowd (from Twitter)
Key People and Organizations
Key Websites
Sociometrics and Human Relationships
42
Table 2: Coolhunting Overview of the Example of the Topic “Social Determinants of Health.” Key Topics
Key People and
Key Websites
Organizations Experts (from
Education,
Thoraya Ahmed
www.cnn.comwww.
websites)
graduation rate
Obaid
forbes.com
Jake Grovum Swarm (from
Early childhood
Wikipedia)
development
WHO
http://www.who.int/
Crowd (from
Poverty,
Bill Moyers
http://billmoyers.com/
Twitter)
minimum wage
The Oregonian
http://www.
Steny Hoyer
oregonlive.com/
hia/evidence/doh/
Image 1a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
Detailed Coolhunting examples are provided in Chapter 12.
Coolhunting and Trend Forecasting on the Web
43
To resume, Coolhunting consists of the following key steps: 1. Get an overview of the topic using Google, Wikipedia, and other online relevant sources. 2. Find the right search terms for Twitter, blog search, and Wikipedia — this involves repeated “trial and error” experiments with different search terms. 3. Calculate the key strength of the brands and key people by constructing the degree-of-separation networks described in Section 8.5 by combining the search terms of the Twitter, blog, and Wikipedia networks and calculating the betweenness of the combined network. 4. Show the resulting networks and label the search terms and key people. 5. Present the key conclusions and unexpected findings of the Coolhunt. In the next chapter, we will learn about the six honest signals of collaboration, which have been developed for e-mail based analysis, but are similarly applicable to Twitter, blog, and Wikipedia Coolhunting results.
MAIN LESSONS LEARNED • Coolhunting means finding signals of collective awareness on online social media such as blogs, Facebook, Twitter, Wikipedia, and forums. • Coolhunting trendsetters.
finds
trends
by
finding
the
44
Sociometrics and Human Relationships
• Coolhunting finds cool people by finding their COINs and measuring their betweenness in online social networks. • Coolhunting distinguishes between the knowledge of the experts, the madness and wisdom of the crowds, which are both extrinsically motivated, and the wisdom of the swarm, people motivated intrinsically by their cause. • We find experts on blogs, the crowd on Twitter, and the swarm on Wikipedia, Facebook, and online forums.
4 THE SIX HONEST SIGNALS OF COLLABORATION
CHAPTER CONTENTS • The six honest signals of collaboration: strong leadership, balanced contribution, rotating leadership, responsiveness, honest sentiment, shared context • Different interpretations for highly creative and high-performance settings • Four-step process of Coolfarming: analyze, predict, mirror, optimize • E-mail use cases: forecasting customer satisfaction, predicting employee attrition, and improving sales effectiveness and creativity of medical researchers • Improving financial capital through optimizing social capital.
Just like a satellite image allows the meteorologist to predict the weather of the next few days with surprising accuracy, interpreting an e-mail archive allows the analyst r 2017 Peter A. Gloor
45
46
Sociometrics and Human Relationships
to predict personality attributes of the mailbox owner. On the organizational level, the organization’s effectiveness, happiness, the satisfaction of its customers, the propensity of members to leave the organization, or the sales performance of teams and individuals can be predicted by analyzing its e-mail archive. Over the last 15 years, our research group at the MIT Center for Collective Intelligence, University of Cologne, and University of Applied Sciences Northwestern Switzerland (FHNW) has studied hundreds of organizations through the lens of their social networks extracted from the organization’s e-mail archive. Our goal has been to develop a set of metrics and software tools to make informal communication in organizations as measureable as what SAP does for payroll and accounting. Among many others, we have studied social networks at R&D organizations at car manufacturers, marketing departments at banks, sales teams at high-tech manufacturers, medical researchers and doctors at large hospitals, and service delivery teams at large consulting and service provider firms. In addition, we have also looked at open-source organizations like Eclipse software developers, stackoverflow developers, Wikipedians, storywriters, and online communities of patients of chronic diseases. We studied these groups through their public e-mail archives, their Twitter feeds, their Facebook group pages, and dedicated online forums. I first noticed that collaborative innovators show a highly specific communication behavior when I was working as a post-doc in the Advanced Network Architecture group at the MIT Lab for Computer Science in the early 1990s. This was right before the emergence of the Web. Tim Berners-Lee, the creator of the Web, had just joined
The Six Honest Signals of Collaboration
47
our group as a visiting scientist. I observed some marked differences in Tim’s behavior, when compared to others, for instance, answering e-mails in minutes instead of in weeks, as was the habit in those early Internet days. In the meantime, I have also observed these differences many times in the communication networks constructed from the e-mail archives of successful organizations. In particular, I have tested this behavior in the COIN seminar, which I have been teaching with colleagues to students since 2005 at MIT, Aalto University Helsinki, and University of Cologne, and at many other universities around the globe. In this seminar, for the duration of one semester, students form global virtual teams in different time zones, providing an excellent test bed for identifying Figure 7: Social Network Picture of the Author’s COINs Seminar Network (Blue Dot in Center is the Author).a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
48
Sociometrics and Human Relationships
communication patterns of successful distributed teams. By asking students to share their e-mail archives by cc’ing all communication to dummy e-mail boxes, we can compare communication patterns with teacher and peer ratings of student team performance (Figure 7). Our social network research of the last 12 years has identified six “social indicators” of collaborative communication collected through e-mail archives, and also through tweets, bloglinks, Wikipedia entries, and body signals captured through sensors. These signals are predictive of future performance and outcomes. Changing individual communication behavior to adhere to the six indicators increases individual, team, and organizational performance. The six indicators have been made available to individuals and organizations in “Condor.” The six indicators are the following: Strong leadership While one might expect that in collaborative teams everybody is a leader, our research showed the opposite: creative swarms need strong leaders. For example, when Tim Berners-Lee came to MIT, he built up his network with global groups of thought leaders such as the World Economic Forum by connecting with other leaders such as the head of the MIT Lab for Computer Science at the time, Michael Dertouzos. Even Wikipedia, the epitome of creative collaboration, shows this pattern, as we found that articles where a small group is in charge become high quality much faster than articles where a large group of editors is working without clear leaders. In the meantime, we have seen the same patterns for teams of medical
The Six Honest Signals of Collaboration
49
researchers and shared business process service providers, to name just a few. Rotating leadership While strong leaders with the right personality characteristics are essential for successful collaboration, a group of leaders taking turns is even better. We first discovered this by studying the e-mail communication among Eclipse open-source developers, where rotating leadership was the best predictor of the most creative teams. By now, we have verified this behavior in dozens of other organizational networks, for example, in the COIN seminar, where students with rotating leadership showed better results. This was also confirmed for medical researchers developing innovations for the care of patients with chronic diseases, where teams with different leaders taking turns were found to be more creative. Balanced contribution In a team, we differentiate between information consumers and information producers, whom we call the “contributors.” For e-mails in a group, this means that there are people that send more than they receive and the other way around. For instance, when we studied the e-mail archive of the World Wide Web consortium in its early days, Tim Berners-Lee frequently was the most active sender of a community. Over time, others would be contributing their ideas, leading to an overall well-balanced contribution index. We found that teams with a low variance in contribution — all members of the core team contributing a similar number of messages — were more creative than
50
Sociometrics and Human Relationships
teams where one or a few people were contributing most of the messages. On the other hand, for account management teams catering to customers of a global service provider, a centralized contribution index, where a few central leaders sent a steady stream of messages to customers, led to more satisfied customers than a pattern, where customers were bombarded in scattergun fashion from many different employees of the service provider. Responsiveness As a young post-doc at MIT, the speed with which Tim Berners-Lee responded to e-mails was an eye-opening experience. In the meantime, I have found this pattern many times. The speed of response and the number of “nudges” or “pings” it takes until a prospective communication partner answers e-mails are excellent predictors of employee and customer satisfaction and mutual respect. For example, in a consulting company we calculated the average e-mail response times of different departments in 2008, before smartphone usage became popular. Six departments took about three days on average to respond to e-mails, whereas one department was considerably slower. In the meantime, in the age of smartphones, this behavior has become more marked, with average response times of wellworking groups falling in the two-hour range for employees getting hundreds of e-mails per day. Honest language While we are not looking at the specific content of messages, we analyze positive and negative sentiments and emotionality of messages. The Condor system uses a
The Six Honest Signals of Collaboration
51
machine learning approach that can be trained with any large body of classified text, for example, with billions of tweets. We found that if the language is too positive, this might be an indicator of dissatisfied customers. For example, in a project with a global service provider, we found that the more positive language a sales person used in communicating with the customer, the less happy the customer was. On the other hand, in innovation teams we found that using more emotional language, defined as using more positive and negative text at the same time, was a predictor of more creative teams. When looking at employee attrition, we found that employees most likely to terminate their work were becoming less emotional in their language, thereby showing less rotating leadership behavior. Shared context Highly functioning teams also define their own language. When the World Wide Web was started, new words and acronyms like “HTML,” “HTTP,” “RDF,” and “FOAF” were coined and existing words like “web,” “semantic web,” and “apache” took on a new meaning. In our analysis, we measure and use new word usage in two ways. First, we measure the complexity of text as the frequency of rare words in the entire text collection. For instance, we found that the more complex the language of sales people is, the less satisfied their customers are. Second, we also track the diffusion of new words in a community. If somebody introduces a new word in a group, using it for the first time in a message sent to others, we measure how quickly the word is picked up by others. The more somebody succeeds in introducing new words, the more influential she or he is (Table 3).
52
Table 3: Definition of Six Honest Signals. Indicator
SNA Term
Definition
How the Variables Are Calculated in Condor
Central
Degree centrality
leadership Betweenness centrality
It is the number of nearest neighbors from an actor who are both senders and receivers in the network
It is a measure of the extent to which each actor acts as an
It is defined as the likelihood to be on the
information hub and controls the information flow
shortest path between any two actors in the network
Rotating
Betweenness centrality
It is a measure of how frequently actors change their network Number of local maxima and minima in the
leadership
oscillation
position in the team, from central to peripheral, and back
Balanced
Contribution index
Indicates how balanced a communication is in terms of
msg sent msg rcvd/(msg sent + msg
messages sent and messages received
rcvd)
Average number of hours the sender takes to respond to e-
Time until a frame is closed for the
mails
receiver after an e-mail has been sent
Average number of follow-ups that the sender needs to send
Number of pings until the sender responds
contribution Rapid
Ego ART
response Ego nudges
in order to receive a response from the receiver
betweenness curve of an actor or a group
Sociometrics and Human Relationships
Number of actors each person is directly connected with in a network
Indicator
SNA Term
Definition
How the Variables Are Calculated in Condor
Alter ART Alter nudges Honest
Avg. sentiment
Average number of hours the receiver takes to respond to e-
Time until a frame is closed for the sender,
mails
after an e-mail has been sent
Average number of follow-ups that the receiver needs to
Number of pings until the receiver
send in order to receive a response from the sender
responds
Indicates positivity and negativity of communication
Uses automatically generated bag of
language
words, based on a dictionary trained for
The Six Honest Signals of Collaboration
Table 3: (Continued )
language/subject area
Shared context
Avg. emotionality
Represents the deviation from neutral sentiment
Standard deviation of sentiment
Avg. complexity
It is a measure of complexity of word usage. It is defined as
Information distribution using TF/IDF,
the information distribution, that is, the more diverse words,
independent of single words
which are all used evenly, a sender uses, the higher his complexity
53
54
Sociometrics and Human Relationships
4.1. THE HONEST SIGNALS HAVE DIFFERENT MEANINGS FOR DIFFERENT ORGANIZATIONS The six signals have predictive power for both creative and process-oriented organizations; however, we found that for some indicators the directionality might change, while others find universal applicability. When I started my research in communication in COINs 14 years ago, I initially expected to find democratic leadership patterns with members of the core team, all sharing in an egalitarian communication pattern. However, I found the opposite, with people like Tim Berners-Lee for the Web, Linus Torvalds for Linux, or Jimmy Wales for Wikipedia assuming strong leadership roles. Even when it looked like an egalitarian leadership team, with a small group of people sharing the lead, when adding the temporal dimension, it became clear that one person was in charge at any given point in time. Through 12 years of research, we found that rotating leadership was the key indicator of creativity. However, for tasks where creativity is not at a premium, and reliability is essential, the opposite — steady leadership — is more important. For example, when studying nurses in the Post Anesthesia Care Unit of a large hospital, we found that patients were waking up from anesthesia on average faster if the same senior nurse was in charge for the entire duration of a day. Democratic leadership and taking turns in this case does not seem beneficial for the patients. This was different for teams of medical researchers, whose research output was rated more creative when they were showing more rotating leadership, with different senior and junior people taking turns in occupying the most central network position over time.
The Six Honest Signals of Collaboration
55
We also found that speed in response can be interpreted differently depending on the context. For example, when comparing customer satisfaction of a service provider with the speed of response of account managers, we found no direct influence on customer satisfaction — although intuition might tell us otherwise. However, we found a significant correlation between the speed with which the customer answered the e-mails of the account manager and customer satisfaction. The happier the customers were with the services provided, the faster they would respond to messages of their account manager. This tells us that it is not enough to answer messages of a customer quickly, one also needs to solve their problems — although answering messages slowly is certainly one way to create unhappy customers. In this particular case, however, we had already raised awareness of being responsive with the service provider, such that all account managers were already quite fast in answering, thereby not providing a competitive advantage anymore (Table 4).
Table 4: Directionality of Indicators for High-Performing Teams. Indicator
Performance Indicator
Central leadership
Higher performing teams have one or more clear leaders
Rotating
Creative teams have rotating leaders; process-oriented
leadership
teams have steady leaders
Balanced contribution
Creative teams show balanced contribution; processoriented teams show a few dominant contributors
Rapid response
Higher performing teams show rapid response
Honest language
Higher performing teams use honest language
Shared context
Higher performing teams use their own vocabulary
56
Sociometrics and Human Relationships
4.2. VIRTUAL MIRRORING LEADS TO CHANGE When people know that they are being monitored, they change their behavior. This is called the “Hawthorne effect,” discovered almost a century ago when scientists in the Hawthorne factory outside Chicago experimented with augmenting the work environment of factory workers. They found that whatever they did, whether it was turning the lighting up or down or changing the floor plan, performance of the workers improved because of the attention paid to the workers. We call the process of showing people their communication behavior “virtual mirroring.” They are shown a “virtual mirror” of their own communication pattern as a social network picture, created from their e-mail archive, plus a comparative ranking of their six indicators. If exposed to a virtual mirror, people, based on the Hawthorne principle, will initially change their behavior. If, in combination with the virtual mirror, we teach them which type of behavior is indicative of future higher organizational performance, this change in behavior will be permanent and will lead to improved outcome. When monitoring a business process for higher performance, participants will change their behavior to act in a way leading to a better process; when monitoring for creativity, participants in virtual mirroring will become more creative.
4.3. DEALING WITH PRIVACY CONCERNS One of the first questions that always comes up when we start a new project analyzing e-mails is about privacy and
The Six Honest Signals of Collaboration
57
data security. When dealing with sensitive company data, the preferred approach is to host the company’s data within the corporate firewall, provide access to aggregated information to management, and to give each employee access to their own personal communication insights. People are concerned about the contents becoming known in public. On the technical level, we are addressing this issue on three different levels. On the strictest level, we commit to doing an anonymized analysis, where insights about individuals are only given to individuals. This means that in the results screens shown in Figures 9 and 10, individuals log in with their own e-mail address, and will only see their own names, with the rest of the people anonymized. Management will only get anonymized results aggregated on the team or business unit level. The problem with this approach is that the insights to be gained are somewhat limited, as it would be quite interesting for people to know, for example, who responds fastest to them. On the mid-level of privacy, we therefore restrict our analysis to e-mail header information and are not using e-mail content and subject line. This allows us to calculate all the indicators except “honest sentiment” and “shared context.” If we are doing an individual analysis using an individual’s mailbox, the individual has access to the full mailbox anyway. Using our Condor software, we can then disclose who responds fastest and with the least nudges from the mailbox owner, and who is the most honest and the most influential person in the network of the individual based on new word usage. Frequently, we conduct an analysis on this most open level also for organizations, as they own the
58
Sociometrics and Human Relationships
contents of their organization’s e-mail archive, similarly to how they own the payroll and accounting data of their SAP system. Organizations have always been keeping salary and sales data protected from individual employees, while using the aggregate information for their own competitive advantage. The same parallel applies to e-mail data, which in aggregated form, and broken down in employee communication benchmarks, provides invaluable information to corporate management. 4.4. HOW TO APPLY KNOWLEDGE FLOW OPTIMIZATION We have developed a four-step process, which we call “Knowledge Flow Optimization” to analyze and increase the performance of organizations and to “Coolfarm ideas” (Figure 8). It consists of the four Figure 8: Coolfarming through “Knowledge Flow Optimization.”
The Six Honest Signals of Collaboration
59
steps: “Analyze, Predict, Mirror, Optimize.” To illustrate our approach, let’s look at the analysis of the sales force of a fortune 500 high-tech company, where we compared e-mail communication of the organization with sales success of their sales teams in different geographical regions. Step 1: Analyze: Determining social network metrics and communication patterns In the first step, we analyzed and quantified the communication patterns and social network structure embedded within organizational communication archives such as e-mail, videoconferencing, and instant messaging. We used this data to calculate the six indicators for the different types of communication archives such as e-mail or videoconferencing. Step 2: Predict: Six honest signals: Comparing structural attributes with business success In the second step, we compared communication behavior found in Step 1 with the communication patterns identified as indicators of better connectivity, interactivity, and sharing among the individuals in the network. Having calculated the six indicators from the data in the communication archives, we correlate them with quantified success and failure criteria. The success and failure criteria vary significantly depending on the type of organization, the industry, and the individuals being measured. In this example, we measured sales performance of the sales teams in different geographic regions and for different products.
60
Sociometrics and Human Relationships
Step 3: Mirror: Virtual mirroring In the next step, we mirror the communication behavior we have identified for the different parts of the organization back to the teams and individuals. By showing them how they differ from the best practices we found in past projects, we help them to improve their behavior for better performance. Just like with a real mirror, looking at how a team “really” communicates can be an eye-opening experience for the team members, leading to fundamental changes in their behavior for the better (Figure 9). In the example with the global sales force for the hightech manufacturer, we found that the more they showed a rotating leadership behavior within the Web conferencing network, the less e-mail they sent to their customers Figure 9: Overview Screen of “Virtual Mirroring,” Showing How the Individual Does Compared to the Rest of the People in the Group (see also Chapter 15).a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
The Six Honest Signals of Collaboration
61
(the less the “spammed” them), and the less positive and more “honest” language they used, the more they sold. This means that the less they were “talking up” their products, and the more they also admitted upcoming problems, the more willing their customers were to buy their products (Figure 10). Step 4: Optimize: Devising a plan to optimize communication for greater success Once we figured out which of the six indicators are correlated with success and failure, we developed a roadmap and recommendations for the company to act on, to change communication behaviors of the sales staff for more successfully closed deals and more satisfied customers.
4.5. FOUR EXAMPLES Predicting customer satisfaction of a global service provider For a large service provider, we tracked 26 accounts for more than two years. In each of these accounts, dozens to hundreds of service provider employees were working on behalf of one fortune 1000 customer. We collected the e-mail of the account managers of the service provider and calculated the six indicators for every month. We also provided virtual mirroring feedback to the account executives once we had the six variables. When comparing the 26 tracked accounts with a control group of 150 accounts whose e-mail we did not collect, we found that over the duration of our analysis, the satisfaction of the 26 customers, measured by
Figure 10: Drill-down on Virtual Mirror, Showing from the Top to Bottom: Social Network, Communication Frequency (Degree Centrality), and Flexibility and Adaptability (Oscillation in Betweenness Centrality).a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
The Six Honest Signals of Collaboration
63
Net Promoter Score, improved by 5%, whereas average satisfaction decreased by 12% for the control group. Predicting employee attrition of a global service provider For a large service provider, using two years of e-mail data, we predicted the likelihood of the top 3000 employees leaving the company. We also distinguished between the “inner termination” phase before employees actually handed in their resignations and the last three months working for the company. We found marked differences in communication patterns. For instance, employees who were not happy in the company anymore showed a 4% decrease in emotionality and 20% decrease in rotating leadership. Once they had decided to leave, their emotionality levels and patterns of rotating leadership went back to their normal level. Improving sales effectiveness of a global high-tech company For a global high-tech manufacturer, we compared two weeks of the entire e-mail archive, together with videoconferencing networks, and message networks, with sales performance of the global sales teams. There were some overall patterns, such as more responsive sales associates generating higher sales or sales people selling collaboration-based products being more successful when showing a rotating leadership pattern. We also found that different types of communication behaviors were advantageous in different geographies. For example, in France, sales people who initiated more video conferences were generating more sales, while sending more e-mails was not leading to increased sales.
64
Sociometrics and Human Relationships
Boosting research creativity of medical researchers In a multi-year project at a leading US research hospital, we analyzed the e-mail traffic of a large project with over 100 members from both outside research organizations and the hospital. We also provided virtual mirroring sessions to the senior leadership team as well as to selected project teams. These mirroring sessions led to increased rotating leadership, to more honest sentiment, and higher responsiveness. In the meantime, the research model pioneered in this project has been applied to a variety of other healthcare-related research projects, for instance, to address the needs of chronic care patients, as well as to a large project aiming to reduce infant mortality across the United States.
4.6. AREAS OF E-MAIL-BASED SNA Using e-mail-based social network analysis gives managers an unprecedented early warning system of the whole organization, and allows them to predict flash points before they happen, leading to greatly improved performance and the capability to manage risk in a much better way. Measuring the effectiveness of the communication network is akin to measuring the nervous system of the organization, which until now was unmeasurable. Analyzing e-mail flows permits the CEO to predict mission critical factors such as the propensity of employees to leave the company, the satisfaction of a company’s customers, and the effectiveness of its sales force well before the events actually happen.
The Six Honest Signals of Collaboration
65
Our e-mail communication analysis system can thus be put to productive use in three ways: 1. Managers can use it as a monitoring and alert system to spot emerging problems before they actually happen. 2. Business units and teams can optimize their performance through better communication. Telling teams through the virtual mirroring process what their strengths and weaknesses in interaction are can be used to give invaluable advice for better collaboration and improved team success. 3. The six indicators of collaboration can also be mirrored back to individuals, so they can improve their individual communication behavior to become more effective and successful team members.
4.7. IMPROVING FINANCIAL CAPITAL THROUGH OPTIMIZING SOCIAL CAPITAL Just like SAP allows an organization to make better use of its financial resources through continuous monitoring of managerial accounting, Coolfarming through knowledge flow optimization allows an organization to make much better use of its social capital through tracking and optimizing collaboration. While social media monitoring has become quite popular, the social capital-driven approach described in this book is unique in five different ways: Measuring “True Creativity” — Our framework is based on the notion of COINs. It has been field-tested in over 100 organizations to identify the communication patterns
66
Sociometrics and Human Relationships
indicative of creativity. This includes far more than counting the number of e-mails of individuals and teams; rather, using the six honest signals of collaboration listed in Chapter 3, users will be able to identify complex networking patterns of true creativity. Know cool people, and not just “hotspots” and “spammers” — The semantic social network analysis tool Condor finds trends by finding the trendsetters. When doing a Coolhunt on the Web, Twitter, and Wikipedia, in the first step it works like Google, to identify who is using novel words and ideas, but then it finds the most influential people, by applying the six honest signals of collaboration. This way it does not just reward “spammers” with a central position, but also will, for instance, identify influencers by measuring who uses novel words first and how quickly they are picked up by others to grow into new COINs. Anti-gaming — We use social network analysis metrics such as “betweenness centrality” and time series of e-mail exchange, which are far more robust toward “gaming” by employees than simply counting e-mails sent and received. Measuring the betweenness of Twitter users in a retweet network is also more indicative of popularity, than just the number of Twitter followers, which can be gamed quite easily by following back other people. Measuring organizational trust and satisfaction — We do not just count complex words, to measure complexity in dialogue, and counting positive and negative words such as “great,” “wonderful,” “horrible,” “awful,” but through supervised machine learning algorithms track word distribution and measure positivity and negativity as well as complexity in context.
The Six Honest Signals of Collaboration
67
Understanding communication galaxies — We track the evolution of network positions of people, measuring how individuals change from being “stars” to becoming “galaxies,” applying the slogan “don’t be a star, be a galaxy.” The groups of the most creative people and most highly functioning teams act as communication galaxies embedded into clusters of other teams.
MAIN LESSONS LEARNED • The six honest signals of collaboration are strong leadership, balanced contribution, rotating leadership, responsiveness, honest language, and shared context. • Showing the six honest signals to individuals and organizations in a virtual mirroring process, and telling them which behavior is predictive of high performance and creativity, will lead to better organizations and more creative and productive employees. • Monitoring communication will allow organizations to manage social capital just like they manage financial capital. • To address privacy concerns, just like SAP stores accounting information and calculates financial performance metrics, virtual mirroring allows organizations to calculate and show collaborative performance metrics.
This page intentionally left blank
5 ESSENTIALS OF SOCIAL NETWORK ANALYSIS AND STATISTICS
CHAPTER CONTENTS • Introduction to social network analysis • Introduction to statistics (t-tests, correlation, regression). In this chapter, you will learn just enough about SNA and statistics to understand the theory behind the social media analysis with Condor and the trend predictions described in the subsequent chapters. This will enable you to do your own experiments and predictions with social media data collected using Condor. There are many excellent textbooks on SNA (Tsvetovat & Koutznetsov, 2011; Wassermann & Faust, 1994) and statistics (Urdan, 2010) available to learn more. SNA has been around for a long time. In the classic example of the puzzle of the “seven bridges of Koenigsberg,” Leonhard Euler laid the foundations of graph theory in 1736. Since then, SNA has come a long way and with the proliferation of the Internet and the Web, most prominently
r 2017 Peter A. Gloor
69
70
Sociometrics and Human Relationships
Facebook, it has become a key foundation to understand the structure of social networks. 5.1. BASICS OF SOCIAL NETWORK ANALYSIS (SNA) You will learn here about actors and ties, about actorlevel metrics, degree centrality, betweenness centrality, and contribution index, and about group-level metrics, group degree centrality, group betweenness centrality, and density. Networks consists of nodes and connecting edges. In the language of SNA, nodes are called actors, edges are called ties. Figure 11 shows a simple undirected network with seven actors and nine ties. Figure 11: Undirected Network.
The same graph can also be shown as an adjacency matrix, where all actors are labeled on both the rows and the columns. In Figure 12, the black square in element a12 denotes the tie from actor 401 to actor 402. As element a21 is empty, there is no tie from actor 402 to actor 401. On the other hand, elements a16 and a61 are both black, showing that
Essentials of Social Network Analysis and Statistics
71
Figure 12: Adjacency Matrix of (Directed) Graph from Figure 11.
there is a link from actor 407 to actor 401, as well as a link from actor 401 to 407. In other words, between actors 401 and 407, there is a bidirectional link. The adjacency matrix in Figure 12 therefore shows a directed graph (if it were undirected, the matrix would be symmetric). Figure 13 shows the network view of the matrix from Figure 12. All the links in Figures 12 and 13 are Figure 13: Directed Network View of Adjacency Matrix in Figure 12.
72
Sociometrics and Human Relationships
bidirectional except the link from 402 to 404 and 404 to 406. Based on the position of an actor in the network, we can calculate actor-level metrics for each actor in the graph. The simplest actor-level metric is degree centrality, which measures the number of direct neighbors of an actor. CD(a) = deg(a) Figure 14 shows the degree centralities of all actors in the network from Figure 13. Figure 14: Degree Centralities of All Nodes in Network from Figure 13, Nodes Sized by Degree Centrality.
The most frequently used metric is betweenness centrality, which measures information flow among nodes. It measures the number of times a node is on the shortest path between any two nodes other than itself (corresponding to the likelihood of the node to be on the shortest path). Betweenness centrality is commonly taken as a proxy for power and influence, as whoever controls information has power (Figure 15). Formally, the nonnormalized betweenness centrality is P CB ðaÞ ¼ s ≠ a ≠ t σ st ðaÞ where σst(a) is the number of shortest paths passing through a between any two actors s and t.
Essentials of Social Network Analysis and Statistics
73
Figure 15: Nonnormalized Betweenness Centrality of All Nodes in Network from Figure 13, Nodes Sized by Betweenness Centrality.
The normalized betweenness centrality is P CB ðaÞ ¼ s ≠ a ≠ t σ stσðaÞ st where σst is the total number of shortest paths from s to t. Both degree and betweenness centrality are also defined as group metrics. The group metric measures the distribution of the actor-level centralities, with the most centralized graphs, in a star structure where one actor in the center is connected with all other actors with one tie, defined as 1, while a totally egalitarian structure, where every actor has the same number of connections to all other actors, defined as 0. More formally, group degree centrality is defined as Pg [CD ðn Þ CD ðni Þ] CD ¼ i¼1 [ðg 1Þðg 2Þ] where CD(n*) is the maximum degree of any actor in the graph and g is the number of actors. For the example above, this would be ((4 3)+(4 3)+ (4 3)+(4 2)+(4 1)+(4 1))/(6*5)=11/30 = 0.3667.
Sociometrics and Human Relationships
74
For group betweenness centrality, the formula is Pg [CB ðn Þ CB ðni Þ] CB ¼ i¼1 ðg 1Þ where CB(n*) is the maximum normalized betweenness of any actor in the graph and g is the number of actors. Contribution index measures the number of sent and received messages for an actor. Formally, it is defined for each node as messages sent messages received messages sent þ messages received which in Figure 13 would be translated to incoming ties and outgoing ties. For example actors 403, 405, and 407 all would have 3 incoming and 3 outgoing links, leading to contribution index 0. Actor 402 has a total of 3 ties, out of which 2 are outgoing and 1 is incoming, leading to contribution index (2 1)/3 = 0.3333. In the visual representation, each actor is shown as a dot, with the x-axis denoting the total number of messages an actor has sent and received and the y-axis the contribution index (Figure 16). The final group-level metric we will be using is graph density, which measures how many connections between any two nodes out of all possible connections actually exist. Formally, where E is the number of all edges, density D is D¼
E gðg 1Þ
which would be, in the example from Figure 13, D = 16/(7*6)= 0.38095.
Essentials of Social Network Analysis and Statistics
75
Figure 16: Contribution Index for Network from Figure 13.
5.2. BASICS OF STATISTICS You will learn here about t-tests, which help you to compare two samples and figure out if there are statistically significant differences between the two samples: correlations, which help you to decide if two variables are related, and linear regression, which help you to measure if one outcome variable might be caused by a set of input variables. For example, assume that we have a set of data points describing gender, body length, and income for a mixed group of women and men. The t-test will help confirm that men have higher income than women. The correlation will tell us that there is a relationship between body length and income, as men are usually taller than women, and also still on average have higher income. The regression will tell us that a certain fraction of income in general is explained by body length and gender.
76
Sociometrics and Human Relationships
In this chapter, we will look at a social media example to illustrate how t-tests, correlations, and regression can be used to identify patterns in the social media data. We analyze the Twitter data described in Section 11.3. It contains 23,484 tweets by 16,948 people tweeting either about “Bernie Sanders” or “Donald Trump” on April 22, 2016 from 6:42 to 15:19. Using insights by James Pennebaker about the “hidden life of pronouns,”1 we count the number of pronouns “I,” “me,” “we,” “us,” “the,” “and” “or,” “not,” etc. and calculate their frequency in the tweets used by each individual person. Condor allows us to do this automatically, calculating the probability that a particular pronoun, for example “the,” appears in a random tweet of the observed person. We would like to know if Bernie Sanders fans, recognizable through Twitter handles such as “Latinos4Bernie” or “People4Bernie” differ in their language from other tweeters by using these pronouns differently. To measure this we are using the independent group t-test. The t-test measures if two normally distributed samples are statistically different. It basically checks if the two group averages are significantly different and their standard deviations are sufficiently small. Figure 18 shows the results of the independent samples t-tests as calculated using Condor. For example, Bernie Sanders’ fans use the pronoun “you” 10 times as much as the other tweeters, with a probability of 0.00077 within a tweet instead of 0.000077. What this means is illustrated in Figure 17, which shows the distribution of the tweeters using “you.” While there are also Bernie Sanders fans who 1
Pennebaker (2013).
Essentials of Social Network Analysis and Statistics
77
Figure 17: Distribution of Relative Frequency of “you” in Tweets by Bernie Sanders Fans (Orange) and Others (Blue).a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
use “you” as rarely as other tweeters, the majority of Bernie Sanders fans in the normal distribution — the maximum in the distribution curve — is at 0.00077, whereas the majority of the other tweeters uses “you” in a single tweet only with a probability of 0.000077. Figure 18 shows the results of the t-test for all Pennebaker variables. We see that there are 240 Bernie Sanders fans (Group 0 in Figure 18), and 16,708 others (Group 1). We find that the difference in the usage of “you” is highly statistically significant, as the p-value is 0. When looking at the number of Twitter followers (friends_count), we see that while Bernie Sanders fans have more followers than others (1527 instead of 1494), the difference is not statistically significant. The p-value is 0.94, which means there is a 94% chance that the higher number of followers is pure chance. Let’s now look at the relationship between the different variables. We would like to know if people who are more popular on Twitter by being “favored” more often
78
Sociometrics and Human Relationships
Figure 18: Independent Samples t-Test Result for Pronoun Usage in Tweets for Bernie Sanders Fans (Group = 0) and Others (Group = 1) Calculated by Condor.
use a specific language, for instance using more or less words like “the,” “and,” “to,” “with,” and “in.” To check for this we use correlations. The most popular correlation is the Pearson correlation. If two variables have a 100% linear relationship, the Pearson correlation coefficient is 1, if there is no relationship whatsoever between the two variables, the correlation coefficient is 0. Figure 19 illustrates two variables x and y which are strongly correlated (at left) and which have no correlation (at right).
Essentials of Social Network Analysis and Statistics
79
Figure 19: Strong Correlation between Two Variables x and y (Left) and Uncorrelated Variables x and y (Right).a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
We would now like to see if an actor who has many followers (called “friends” in Twitter) is also more popular, by being “favorited” more. Figure 20 shows the relationship between the two variables friends_count and favorites_count for all 16,947 actors in our dataset. As we can see in Figure 20, there is a correlation, but it is not particularly strong (there is a straight line rising, but it is quite flat). Correlating friends_count and favorite_count, we find that the correlation is significant with a coefficient of r = 0.120 (Table 5). A correlation which is significant at the 0.01 level has a 1% chance that the result is by chance; a correlation with a significance level of 0.05 has a 5% chance that the result is accidental. The correlation coefficient of 0.12 means that 12% of the variation between Twitter friends and being “favorited” can be explained through the relationship between Twitter friends and being “favorite.” Table 5 also tells us that there is a weak but significant negative correlation between being “favorited” and
80
Sociometrics and Human Relationships
Figure 20: Correlation between friends_count and favorites_count.
the usage of “the,” “and,” “to,” “with,” and “in.” A very small part, 23% of the variation of being “favorited” can be explained through the usage of these pronouns, using them less indicates to be more “favorited.” To investigate if A is causing B, in this case, if using less pronouns and having more Twitter friends will cause an actor to be more “favorited,” statisticians use a linear regression (Table 6). Table 6 shows the results of the regression with the dependent variable favorites_count and the independent variables avg frequency_With, avg frequency_In, avg frequency_To, avg frequency_The, and friends_count. Entering this data into a statistics package, such as R, SPSS, Stata, SAS, or Excel, returns an adjusted R squared
friends_count favorites_count
avg
avg
avg
avg
avg
frequency_The frequency_And frequency_To frequency_With frequency_In friends_count
Pearson
1
.120a
.006
.009
.008
.002
.003
.000
.505
.284
.328
.835
.757
14,414
14,414
14,414
14,414
14,414
14,414
1
.038
.021
.027
.021
.031a
correlation Sig. (twotailed) N favorites_count Pearson
14,414 .120
a
a
b
a
b
correlation Sig. (two-
.000
.000
.012
.001
.011
.000
14,414
14,414
14,414
14,414
14,414
tailed) N
14,414
14,414
Essentials of Social Network Analysis and Statistics
Table 5: Pearson Correlation between Twitter Friends/Favorite Count and Pronoun Usage.
a
Correlation is significant at the 0.01 level (two-tailed).
b
Correlation is significant at the 0.05 level (two-tailed).
81
82
Table 6: Regression Results of Regressing the Dependent Variable favorites_count against the Independent Variables avg frequency_With, avg frequency_In, avg frequency_To, avg frequency_The, and friends_count. Model
Unstandardized Coefficients
B
t
Sig.
39.403
0.000
Beta
7445.317
188.952
avg frequency_The
31497.381
8162.346
.032
3.859
.000
avg frequency_To
20985.868
9297.563
.019
2.257
.024
avg frequency_In
32535.075
10200.098
.027
3.190
.001
avg frequency_With
39646.094
17694.318
.019
2.241
.025
.324
.022
.120
14.528
.000
friends_count
Sociometrics and Human Relationships
(Constant)
Std. Error
Standardized Coefficients
Essentials of Social Network Analysis and Statistics
83
of 0.017. This means that 1.7% of the favorites_count is explained by this linear model, which is quite a small effect, although significant (all coefficients are significant too). The less “the,” “to,” “in,” “with” somebody uses in her tweets, and the more friends she has, the more will she be favorited. More precisely, the standardized coefficient of 0.032 of “avg frequency_The” means that for a one-unit change in standard deviation for the predictor variable “avg frequency_The,” the dependent variable favorites_count will change by 0.032. In other words, using a lot of “the’s” in Tweets reduces the likelihood of being “favorited” by a small amount. This concludes our bare bones introduction to SNA and statistics; readers are encouraged to further study this subject in one of the many excellent books or online courses available elsewhere.
MAIN LESSONS LEARNED • SNA provides a framework for comparing the structural properties of a social network with its behavior. • The SNA key actor-level properties are degree centrality, betweenness centrality, and contribution index. • The SNA key network-level properties — to compare differences between different networks — are group degree centrality, group betweenness centrality, and density.
84
Sociometrics and Human Relationships
• The statistical t-test allows you to compare the difference of attributes between two samples. • The Pearson correlation measures if two variables have a relationship. • A linear regression tracks the impact of a set of input variables (called “independent variables”) on an outcome variable (called “dependent variable”).
6 HOW IDEAS SPREAD IN ONLINE SOCIAL NETWORKS — READINGS
CHAPTER CONTENTS • Theories of information diffusion on social networks • Spreading ideas on Facebook • Finding fake content through machine learning • Forecasting financial performance on social media • Extracting demographic information from social media • Predicting elections from social media.
This chapter covers the theoretical background for the different applications of the tools and methods described in this book. It provides short comments for 25 foundational research papers investigating how ideas are spreading in online social networks and how analyzing online social network structure and content can be used to extract demographic information about the underlying real-world network.
r 2017 Peter A. Gloor
85
86
Sociometrics and Human Relationships
The first section investigates which network structure is most conducive to spreading new ideas and convincing others to accept these new ideas. It also demonstrates that cooperation is a good idea and looks at trustworthiness and uncalculating cooperators. Subsequently, these concepts are tested in Facebook networks, which provide an unbiased platform to verify the algorithms introduced in the first section. The next section demonstrates that machine learning can discover fake text that humans cannot. The next three papers show how financial trends such as stock prices can be predicted using Google Trend, Twitter, and Wikipedia. Similar algorithms are then used to extract demographic information such as personality characteristics, sociodemographic and socioeconomic information, and the most controversial topics in different cultures from Twitter, Wikipedia, blogs, and mobile phone records. Finally, Twitter data is also used to predict the outcome of popular elections (Table 7). Table 7: Overview of Papers Covered in This Section. Basic Concept
Main Insight
From Paper
How and why ideas spread in social networks Weak ties
Dissemination of ideas
Battilana and Casciaro
Strong ties
Acceptance of ideas
(2012)
Advantages of
Acceptance of ideas
Centola (2010)
150 (Dunbar’s number)
Hill and Dunbar (2003)
Dark side of social
Acceptance of bad ideas
Satyanath, Voigtländer,
capital
comes from friends
and Voth (2013)
embeddedness Maximum number of close friends
How Ideas Spread in Online Social Networks — Readings
87
Table 7: (Continued ) Basic Concept
Main Insight
From Paper
Social structure to
Structural fold (move
Vedres and Stark
spread ideas
people between teams)
(2010)
Social networks in
Are the same, driven by
Apicella, Marlowe,
stone age
homophily
Fowler, and Christakis (2012)
Network structure of
Fewer but fundamental
Wagner, Horlings,
Nobel Prize winners is different
papers, form groups in young years
Whetsell, Mattsson, and Nordqvist (2015)
Evolution is linking
We think Google is part of
Sparrow, Liu, and
computers to humans
our brain
Wegner (2011)
Is cooperation genetic
Models show that altruistic cooperation benefits the
Nowak (2006)
individual Do we prefer to
Preference for interaction
Fu, Nowak, Christakis,
interact with people
with similar people makes
and Fowler (2012)
similar to us
groups more similar
In a lab test,
Uncalculating cooperators
Jordan, Hoffman,
calculating cooperation are more popular
Nowak, and Rand
is tested
(2016)
Spreading ideas on Facebook Do Facebook friends
Yes, for classical music
Lewis, Gonzalez, and
share the same
and Jazz, they start liking
Kaufman (2012)
musical taste
the same style
Do people adapt Yes, preferably if the friend Aral and Walker (2012) recommendations from is older and male Facebook friends Facebook behavior is
People who get more
correlated with
direct Facebook messages Marlow (2011)
happiness
are more popular
Burke, Kraut, and
Finding fake content through machine learning Machine learning finds fake reviews
Using different classifiers
Ott, Choi, Cardie, and Hancock (2011)
Sociometrics and Human Relationships
88
Table 7: (Continued ) Basic Concept
Main Insight
From Paper
Measuring financial performance Google Trends predicts stock markets
Search for a stock correlates with stock price
Preis, Moat, and Stanley (2013)
change Twitter buzz correlates GPOMS “arousal” predicts
Bollen, Mao, and Zeng
with NASDQ
(2011)
rise/drop in stock prices
Do Wikipedia edits and Searches do, edits do not
Moat et al. (2013)
searches correlate with stocks Calculating demographic information Most controversial
Are religion and politics,
Yasseri, Spoerri,
topics in different
Israel is also consistently
Graham, and Kertész
languages through
controversial
(2014)
The way how one
Using FFI, Twitterers with
Quercia, Kosinski,
tweets predicts their
many followers are more
Stillwell, and Crowcroft
personality
extrovert
(2011)
Twitterers with more
Happiness and popularity
Bollen, Gonçalves, van
friends are happier
are not correlated
de Leemput, and Ruan
reverts
(2016) Twitter behavior
Low income Twitterers use Preot¸iuc-Pietro,
predicts income
foul language
Volkova, Lampos, Bachrach, and Aletras (2015)
Mobile phone records
Income and physical
Aharony, Pan, Ip,
give demographic
activity can be measured
Khayal, and Pentland
information
from mobile phone records (2011)
Blogs will predict
FFI (five-factor inventory)
personality
predicted based on Pennebaker pronouns
Yarkoni (2010)
Predicting elections Twitter buzz can
Congressional elections
DiGrazia, McKelvey,
predict outcome of elections
correlate with Twitter buzz
Bollen, and Rojas (2013)
How Ideas Spread in Online Social Networks — Readings
89
6.1. THEORIES OF INFORMATION DIFFUSION Battilana and Casciaro: Change agents, networks, and institutions: A contingency theory of organizational change Battilana and Casciaro describe the advantages of strong and weak ties. Weak ties, connections with casual acquaintances, are good for information dissemination. Strong ties — being embedded through many connections in a group of good friends — are beneficial for the adaptation of new ideas. Drawing on a study in eight healthcare organizations in the United Kingdom, the authors find that “explorative” innovations, which change the status quo, are better supported in “weak tie” networks with “structural holes,” while “exploitative” healthcare innovations that support the status quo flourish in densely connected networks. In particular, the more structural holes the social networks of influencers have, the more likely they are to initiate explorative change. Centola: The spread of behavior in an online social network experiment The paper compares the adaptation of new ideas in social networks with different network structures. In a random network, the network diameter is lower than in a clustered one. So we would expect that in a random network, a new idea has a shorter path for spreading through the whole network. However, in experiments Centola found clustered networks to be more efficient at spreading innovative ideas, despite the higher network diameter. A clustered network represents our social networks more
90
Sociometrics and Human Relationships
closely, as an individual’s neighbors often are neighbors to each other as well. This means that a person is repeatedly exposed to a new idea through the many neighbors in the same cluster, leading to faster adaptation of the idea. As we tend to emulate the behavior of our friends, the likelihood of a person adopting the new idea directly correlates with the amount of neighbors that have adopted the idea. Hill and Dunbar: Social network size in humans This paper analyzes the size of social networks in the Western society. The findings are based on a study in which the number of Christmas cards sent was measured. Forty-three white British households were questioned. The result of the study was that the average network size of a person is 154 people, out of which an average of 125 individuals are contacted explicitly, with the others being included by living in the same household. The relationship with the Christmas card recipients was examined by distance to the sender, relationship with the sender, social status of the recipient, last contact, and emotional closeness. It was found that the more distant the other person was, the more emotionally close, a colleague at work, and living overseas, the higher the likelihood of sending a Christmas card. Satyanath, Voigtländer, and Voth: Bowling for fascism: Social capital, and the rise of the Nazi party in Weimar Germany, 19191933 The authors find a dark side of social capital. They measure social capital as the density of associations in a
How Ideas Spread in Online Social Networks — Readings
91
particular region of Germany at the time when Nazi Germany was emerging, collecting association data from 112 German cities where the records survived the 2nd World War. Their analysis found that areas with higher association density registered more entries to the NSDAP. It seems that the more of our friends adapt to a bad idea, the more we are likely to do the same. Vedres and Stark: Structural folds: Generative disruption in overlapping groups In a dataset with personnel ties among the largest 1696 Hungarian enterprises from 1987 to 2001, the researchers identified a distinctive network topology, the structural folds. Structural folds are bridging ties; actors at the structural folds are members of more than one cohesive group who over time change group membership, acting as knowledge transfer agents. The researchers found that groups with more structural folds show higher revenue growth. Apicella, Marlowe, Fowler, and Christakis: Social networks and cooperation in hunter-gatherers The researchers conduct a social network analysis among members of the Hadza tribe of Tanzania, a people of Stone Age hunter-gatherers. The authors find the same networking behavior as for people living in the Western Facebook civilization. Ties between two people are measured through the option of giving honey sticks to each other. Reciprocity (the increased likelihood of an outbound tie to be reciprocated with an inbound tie from the same person), degree assortativity
92
Sociometrics and Human Relationships
(the tendency of popular people to befriend other popular people), transitivity (the likelihood that two of a person’s friends are in turn friends), and homophily (the tendency of similar people to form ties) seem to be true for both stone age and Internet age people. With respect to homophily, similar generosity, strength of handshake, and body mass index are all predictors of the existence of a tie. Wagner, Horlings, Wheettsell, Mattsson, and Nordqvist: Do Nobel laureates create prize-winning networks? In this paper, a group of 68 Nobel laureates is compared to a group of similarly accomplished scientists who did not win a Nobel. The goal was to compare productivity, impact, coauthorship, and international collaboration patterns of both networks. One big difference of those networks is that the laureates seem to be more likely to close structural holes by building bridges across networks. Therefore, the laureate network has significantly fewer communities and is more interconnected and less clustered. Nevertheless, nonlaureates seem to be more productive and have similar rates of collaboration. Laureates appear to focus on fewer, higher quality publications, and are more highly cited. Furthermore, more connections are found across the laureate network, providing more opportunity for bridging new ideas, methods, and technologies. Sparrow, Liu, and Wegner: Google effects on memory: Cognitive consequences of having information at our fingertips The Internet has become a primary form of external or transactive memory, where information is stored
How Ideas Spread in Online Social Networks — Readings
93
collectively outside our brains. This paper examines if people who expect to have access to online information have lower rates of recall of the information and enhanced recall of where to access the information. In experiments, participants were shown statements, which they thought they would have access to later. A control group who thought they would not have access to the information later was shown the same statements. The researchers found that when participants thought they would have online access, they spent less effort storing the information. The conclusion is that human memory processes are adapting to ready access to information on the Internet, Google, and Wikipedia, by enhancing our memory. Nowak: The evolution of cooperation Nowak’s premise is that cooperation is needed for evolution to construct more complex organizations. On the other hand, natural selection implies competition and therefore opposes cooperation. Nowak then introduces five mechanisms for the evolution of cooperation: kin selection, direct reciprocity, indirect reciprocity, network reciprocity, and group selection. Kin selection means that a member of a network is more likely to help another member of the network if the two are genetically related. On the highest level is group selection, where the members of a group cooperate with each other to beat other groups. According to Nowak, cooperation promotes biological diversity and is the secret behind the open-endedness of the evolutionary process.
94
Sociometrics and Human Relationships
Fu, Nowak, Christakis, and Fowler: The evolution of homophily This paper describes the evolution of homophily under different kinds of conditions by creating a model with preferences for either homophily (the tendency for individuals to interact with similar others) or heterophily, with different phenotypes (size, color, behavior), payoffs to interactions, evolution from one generation to the next, and overall fitness (ability to survive). The payoffs for homophilic interactions are called synergy and payoffs from heterophilic interactions help to increase specialization. The model shows that favoring synergy can significantly reduce the total number of phenotypes, making a group more uniform and dominated by a single phenotype. In the long run, the group alternates between different dominant phenotypes. In heterophilic populations, diversity is maintained by privileging rare phenotypes. Homophilic populations prefer common phenotypes and drive alternate phenotypes to extinction. Jordan, Hoffman, Nowak, and Rand: Uncalculating cooperation as a signal of trustworthiness The paper describes a series of experiments comparing uncalculating cooperation and trustworthiness. The researchers show the following three hypotheses: (1) People should engage in more uncalculating behavior when their decision process is observable. (2) People should perceive uncalculating cooperators as more trustworthy than calculating cooperator. (3) Uncalculating cooperators should behave in a more trustworthy way than calculating ones. To prove their predictions the researchers conducted two experiments, each structured in two stages.
How Ideas Spread in Online Social Networks — Readings
95
6.2. SPREADING IDEAS ON FACEBOOK Lewis, Gonzales, and Kaufmann: Social selection and peer influence in an online social network The paper addresses the question whether or not people influence each other on social networks. This has been examined by collecting data from Facebook over four years from 1600 college students, which was complemented with data from college housing. The question investigated was if Facebook friends would start picking up each other’s taste in music and reading, based on what they liked on Facebook. The results show relatively low evidence for social selection and social influence on social networks. Only for a few sub areas such as “classical music” and “jazz,” a tendency of adapting a friend’s taste was found. Musical tastes such as “indie” even show a negative effect, in that if somebody was a fan of indie music this would deter their Facebook friends to also become indie fans. Aral and Walker: Identifying influential and susceptible members of social networks The paper studies the adaptation of new ideas by tracking a new Facebook app recommending movies. The researchers measure influence and susceptibility of Facebook users based on how many messages they get from their friends about the new app. Users with high influence are less susceptible. Individuals with high susceptibility are mostly noninfluential. They found that older people and men are more influential than women, but that women are less susceptible to influence than men. Married people are least
96
Sociometrics and Human Relationships
susceptible to influence, while people reporting their marital status as “it’s complicated” are most susceptible. Burke, Kraut, and Marlow: Social capital on Facebook: Differentiating uses and users The authors measure the creation of social capital on Facebook. They combine longitudinal self-report surveys and Facebook server logs to examine how direct communication with friends, broadcasting status updates to a wide audience and reading of others’ news can predict changes in users’ social capital, self-esteem, and communication skills. They define three types of social behavior on social networking sites: (1) directed communication with individual friends (messaging, likes, tag sin pictures), (2) passive consumption of social news, when one reads others’ updates, (3) broadcasting, when one writes for others’ consumption and it is not targeted at a particular recipient. Results indicated that for people with low communication skills and self-esteem, passive consumption of information increases their self-esteem and communication skills. Only directed one-to-one communication will actively increase social capital.
6.3. FINDING FAKE REVIEWS THROUGH MACHINE LEARNING Ott, Choi, Cardie, and Hancock: Finding deceptive opinion spam by any stretch of the imagination This paper shows that machine learning is doing a better job than humans in detecting fake reviews. The authors
How Ideas Spread in Online Social Networks — Readings
97
created a dataset of real and fake hotel reviews and automated the detection with genre identification, psycholinguistic methods as well as simple text categorization. As a reference they took a subset of the data categorized by human judges. The results show that automated methods are better capable of detecting deceptive opinion spam, while human judges perform roughly at chance. Among the automated approaches, the n-gram-based text categorization got the best individual results. When combined with psycholinguistically motivated features, the detection accuracy reached almost 90%. The paper also studied how to best write credible fake reviews, considering the context as well as the underlying motivation to detect a deception.
6.4. MEASURING FINANCIAL PERFORMANCE Preis, Moat, and Stanley: Quantifying trading behavior in financial markets using Google Trends The authors look at the Google search behavior of trading and finance-related search terms using Google Trends. They find a correlation between the number of searches for terms relevant for finance and trading and the Dow Jones Industrial Average (DJIA). They find that the search terms precede drops in the Dow Jones by a few days. A trading strategy shorting and buying the Dow Jones each Monday based on the most predictive search terms averaged out over the preceding six weeks gives high theoretical returns.
98
Sociometrics and Human Relationships
Bollen, Mao, and Zeng: Twitter mood predicts the stock market This paper investigates if information about public mood extracted from Twitter messages has predictive capability regarding stock market prices. The researchers collected 10 million tweets over a time period of 10 months and compared it against daily DJIA closing values. The tweets’ contents were analyzed using the tool OpinionFinder to assess the emotional polarity of each tweet. As a second sentiment analysis tool, “Google-Profile of Mood States” was used to calculate the six mood dimensions: calm, alert, sure, vital, kind, and happy. To measure the relationship between DJIA and Google-Profile of Mood States, Granger causality analysis was used, finding a strong correlation between the calm state of tweets and the DJIA. Further analysis using a Self-Organizing Fuzzy Neural Network confirmed the correlation with an accuracy of 87.6% in predicting daily DJIA values based on the calm emotionality metric. Moat, Curme, Avakian, Kenett, Stanley, and Preis: Quantifying Wikipedia usage patterns before stock market moves This paper analyses the correlation between stock market movements and search behavior on Wikipedia. The authors speculate that before trading decisions are made, the traders might look up information on Wikipedia or even make edits to the Wikipedia page. This means that the amount of views or edits of financially relevant Wikipedia pages may contain early signs
How Ideas Spread in Online Social Networks — Readings
99
of stock market moves. Based on previous studies on behavioral economics that demonstrated that humans are loss averse, the authors assumed that investors might be willing to invest more efforts in information gathering before making a decision, which they view to be of greater consequence. This would lead to the conclusion that increases in information gathering would precede falls in stock market prices. The authors analyzed Wikipedia metadata generated between 2007 and 2012 and measured changes in page views and page edits from one week to the next. If the number of page views or edits increased, they predicted falling stock prices, otherwise they bet on rising stock prices. The authors found that a hypothetical portfolio trading based on Wikipedia page view changes was highly profitable; they could not detect any signal however from the Wikipedia page edits.
6.5. CALCULATING DEMOGRAPHIC INFORMATION Yasser, Spoerri, Graham, and Kertesz: The most controversial topics in Wikipedia: A multilingual and geographical analysis The authors extracted background information about the most controversial Wikipedia articles, by calculating “controversy” of an article, based on how many times a Wikipedia article has been edited and reverted. They analyzed Wikipedia articles of 10 different languages with the additional help of geographical tags and found that articles about religion, Israel, and anti-Semitism are
100
Sociometrics and Human Relationships
controversial in every language and region, whereas most of the other topics are only controversial in certain regions and languages, for example, “Gipsy Crime” in Hungarian or “Chile” in Spanish. The English Wikipedia is exceptional because English is widely spoken all over the world; therefore, globally disputed topics like Jesus or anarchism are often represented in the English Wikipedia. The most controversial categories are about politics, geographical locations, and religion. Quercia, Kosinskiy, Stillwell, and Crowcroft: Our Twitter profiles, our selves: Predicting personality with Twitter This project investigated the relationship between personality characteristics of tweeters and their tweeting behavior. Based on a dataset collected from Facebook, where users could take a big five personality test, they compared the personality profiles of the 335 Twitter users included in this dataset, comparing it against their tweeting behavior. They tracked three features of Twitter users publicly available on profiles: following (number of profiles a user follows), followers (number of followers), and listed counts (number of times the user has been listed in others reading lists). Using these three features, they clustered the 355 Twitter users in four categories of Twitter users: listeners, popular, highly read, and influentials. The study produced two main insights. First, all their Twitter users were emotionally stable and most of them were extrovert. Interestingly, popular users tend to be “imaginative” (high in openness), while influential users tend to be “organized” (high in conscientiousness).
How Ideas Spread in Online Social Networks — Readings
101
Bollen, Goncalvez, Van de Leemput, and Ruan: The happiness paradox: Your friends are happier than you This paper investigates the relationship between popularity and happiness on Twitter. The authors examined a group of 40,000 Twitter users. They assessed the popularity and the happiness of these users, calculating happiness through automatic sentiment analysis of tweets, and popularity by counting their number of followers. The results show that a friendship paradox (on average people are less popular than their friends) and a happiness paradox (on average people are less happy than their friends) exist, but there is no correlation between popularity and happiness. Preo¸tiuc-Pietro, Volkova, Lampos, Bachrach, and Aletras: Studying user income through language, behaviour and affect in social media This research analyzed Twitter users. It calculated their profession from their self-declaration in their Twitter profiles and assigned them a number of user-level psychodemographic features for later comparison. Results confirmed the impact of gender or race on income; the researchers also found that people with higher income post more neutral content, have more followers and express more emotions of fear and anger. People with lower income on average send more tweets. The researchers assume that this is caused by the fact that lower income users use Twitter more for social interaction. They are also more emotional in their tweets, but compared to people with higher income they retweet less and
102
Sociometrics and Human Relationships
are retweeted less often. The researchers found that users with higher income tweet more about NGOs and corporate topics, whereas people with lower income tend to use more swear words. Aharony, Pan, Ip, Khayal, and Pentland: Social fMRI, investigating and shaping social mechanism in the real world Just like medical fMRI measures brain activity, the authors use social fMRI to measure interpersonal interaction through mobile sensors, most prominently mobile phone records. In the paper three experiments are conducted, with the first result showing that individuals’ social interaction patterns are influenced by their financial status, and thus the lower the income, the less social interaction they will have. The second experiment concludes that social relationships influence decision-making, particularly by face-to-face interaction; the more interaction somebody had, the more likely they were to install an app on their mobile phone. In the third experiment, the goal was to increase physical exercise through the influence of friends. The researchers found that compensating friends were the best strategy to get couch potatoes to exercise more. Yarkoni: Personality in 100,000 words: A large-scale analysis of personality and word use among bloggers This paper compares personality characteristics and word usage among 700 bloggers who took the big five personality test, which measures intro/extroversion,
How Ideas Spread in Online Social Networks — Readings
103
neuroticism, openness to experience, agreeability, and conscientiousness. Yarkoni found that word usage indeed predicts personality characteristics. For example, a person with high neuroticism may primarily use adjectives to describe events in a negative way (awful, depressing, stressful) rather than nouns connoting actual negative events. A person with high agreeableness might often speak about love (love, hug) but is more likely not speaking about a sexual behavior (porn, gay, fuck). Some personality traits like openness have a high correlation with people’s vocabulary. Others like extraversion or conscientiousness might be more difficult to predict.
6.6. PREDICTING ELECTION OUTCOME DiGrazia, McKelvey, Bollen, and Rojas: More tweets, more votes: Social media as a quantitative indicator of political behavior This project finds a strong correlation between the number of Twitter mentions of candidates for the US congressional election in 2010 and their eventual vote tally. Using 500 million tweets, the researchers cross-verified their findings with district partisanship, demographic, and media coverage to control for other outside influences. The content of the tweets other than mentioning the candidate name did not seem to matter for predicting the popularity of a candidate.
104
Sociometrics and Human Relationships
MAIN LESSONS LEARNED • Having a large group of weak-tie friends is good for spreading knowledge of new ideas, having a close knit group of strong-tie trustworthy friends is good for the adaptation of new ideas. • Humans have on average 150 real friends and “tribe members” (Dunbar’s number); if they have more, for instance, in Facebook, they are not really friends. • Collaborative groups with in-group altruism will win against noncooperative groups. • Facebook provides an excellent testbed for social network research on homophily and influence; older men are most influential on Facebook, while in general humans do not tend to adapt many ideas from their Facebook friends. • For discovering cheating online behavior, machine learning is more accurate than humans reading and assessing the text online. • Online social media provides an excellent source for predicting financial performance of assets such as stock; in particular, Google or Wikipedia search patterns are predictive of future performance.
How Ideas Spread in Online Social Networks — Readings
105
• Online behavior on social media such as Twitter can be used to predict demographic attributes of Twitter profiles such as income or health. • Twitter can also be used to predict the outcome of elections.
This page intentionally left blank
PART II. ANALYZING STRUCTURE, DYNAMICS, AND CONTENT OF NETWORKS WITH CONDOR The second part of this book describes how to use online social media analysis to Coolhunt — identifying trends by finding the trendsetters — and to Coolfarm — studying team networks and improving their communication for better collaboration for more innovation. Analyzing social media to read the collective mind consists of four different parts: virtual mirroring, trend forecasting, Coolhunting, and Coolfarming. These four social media analysis tasks are fundamentally different in two dimensions: (1) time and (2) about the nature of what we do not know (Figure 21). We can either analyze social media to gain insights about things that we do not know today or to predict the future. Trend forecasting and Coolfarming are concerned with activities in the future, whereas virtual mirroring and Coolhunting are concerned with things that we do not know today. The second dimension distinguishes between things which we know to exist, we just do not know how they are developing, these are the known unknowns. But there are also things which we do not
r 2017 Peter A. Gloor
107
108
Sociometrics and Human Relationships
Figure 21: Four Fundamental Social Media Analysis Tasks to Read the Collective Mind.
know to exist before starting the analysis, these are the unknown unknowns. Mathematician Nassim Taleb calls unknown unknowns “black swans,”1 as in the European Middle Ages it was clear that all swans had to be white, it was unimaginable that a swan could be black. Only when the first Europeans came to Australia and found black swans there, they had to change their beliefs. After two introductory chapters that explain the basic architecture and concepts of the social media analysis tool Condor, the succeeding six chapters demonstrate all four facets of online social media sense making, virtual mirroring, trend forecasting, Coolhunting, and Coolfarming. Chapter 9, Analyzing E-Mail with Condor (virtual mirroring)
1
Taleb (2007).
Part II. Analyzing Structure, Dynamics, and Content of Networks
109
Chapter 10, Calculating Personality Characteristics from E-Mail (trend forecasting)
Chapter 11, Predicting Criminal Intent from E-Mail — Analyzing the Enron E-Mail Archive (trend forecasting)
Chapter 12, Coolhunting on the Internet with Condor (Coolhunting)
Chapter 13, Coolhunting — Francogeddon (Coolhunting)
110
Sociometrics and Human Relationships
Section 14.1, Bernie Sander’s Presidential Campaign — The Perfect COIN (Coolfarming)
Section 14.2, Coolhunting Bernie Sanders, Hillary Clinton, Jeb Bush, and Donald Trump (Coolhunting)
Section 14.3, Tribefinder on Twitter (Using Machine Learning) (trend forecasting)
7 THE FOUR-STEP ANALYSIS PROCESS
CHAPTER CONTENTS • Overview of Condor’s four-part architecture • Condor’s social media fetchers • Condor’s social media filters • Condor’s social media visualizers • Condor’s social media exporters.
To illustrate Coolhunting, Coolfarming, trend forecasting, and virtual mirroring concepts described in Part I of this book, Part II introduces the social media analysis tool Condor, which has been developed over the last 14 years by a team from the University of Applied Sciences Northwestern Switzerland, the MIT Center for Collective Intelligence, and over the last seven years by the software company galaxyadvisors. Condor is a powerful social media analysis tool for collecting all types of social media data, aggregating the data, visualizing it,
r 2017 Peter A. Gloor
111
112
Sociometrics and Human Relationships
and exporting it to other types of analysis tools such as Excel. Condor consists of four parts: (1) A series of fetchers to directly load data from e-mail, for example, from Gmail or Exchange, calendars, Skype, Twitter, and from Google, Wikipedia, and Facebook. (2) Interactive preprocessing functions to modify and reduce the graph, filter by content and by structure, annotate by geocode, merge multiple e-mail addresses, and create modified graphs. (3) Visualization functions, to show the static network, a dynamic movie of the network over time, geographical word maps, and different views for structure, content, and sentiment of actors. (4) Export functions to export time series of all variables for later longitudinal analysis in statistics packages such as R or SPSS (or Excel). Figure 22 lists the four components of Condor: the fetchers, filters, visualizers, and exporters. They work together to calculate the six honest signals of collaboration introduced in Chapter 4. These honest signals can be calculated both from the inside of an organization, tracking mostly e-mail, online calendars, and chat, or from the outside on the Internet, tracking tweets, posts on Facebook walls, and the speed with which Wikipedia pages about a certain topic are edited. In step 1, the different fetchers collect the raw communication data in an easy and automated way from e-mail archives, Twitter, Facebook, the Web, and Wikipedia. In step 2, the filters allow to preprocess the data to prune and shape the network in the most meaningful way for
The Four-Step Analysis Process
Figure 22: Four Main Parts of Condor: Fetchers, Filters, Visualizers, Exporters.
113
114
Sociometrics and Human Relationships
analysis. To calculate the six honest signals of collaboration, the network first has to be cleaned. This is where the science of social network analysis meets the art of social network interpretation. To take a simple example, a user with multiple e-mail addresses can be merged into one virtual actor for later analysis. Or mailing list addresses can be removed, as they would show up as the most central actors in the network without having any real social meaning. Or in a complex network with many so-called “leaf” nodes, which are only connected to one other node, these peripheral nodes can be pruned from the network. In step 3, the Condor user can look at the different visualizations of the network’s structure, dynamics, and content to develop hypotheses about which honest signals might be most indicative of the outcome metrics the user is trying to measure. For example, looking at the contribution index chart of individual actors within a community will tell the analyst which person is the most popular member of the community — getting the most e-mails or tweets, or who is the most active participant — sending the most e-mails or tweets. In the dynamic view or in time series curves, the development of a community or a discussion about a topic can be tracked over time. In step 4, the numbers of the time series can be calculated and exported to an external analytics or visualization tool such as Excel or SPSS. This allows the user for instance to drill down on the evolution of betweenness over time to see who has been the most central person in a community at any given point in time. A time series of actor level emotionality metrics will tell who has been most positive at a given point in time. A time series of response times from others will tell how somebody is
The Four-Step Analysis Process
115
gaining respect by others answering her successively faster over time. This is where the statistics briefly described in Section 5.2 will come handy. We will now look at each of the four Condor components (fetcher, filter, visualizer, exporter) in more detail.
7.1. SOCIAL MEDIA FETCHERS The fetchers (Figure 23) allow Condor to automatically collect large amounts of social network data. In particular, Condor has fetchers for e-mail, online calendars, Skype, Facebook wall, Wikipedia, Google Custom Search, and Twitter. The fetchers get the data from outside sources and store it in the MySQL database. This data can then be taken and preprocessed for later Figure 23: List of Condor Fetchers.
116
Sociometrics and Human Relationships
analysis. Note that besides the live social media fetchers, Condor can also directly import MySQL databases and Microsoft CSV files.
7.2. SOCIAL MEDIA FILTERS The Condor filters prepare the data stored by the fetchers in the MySQL database for later analysis. They allow the user to merge multiple e-mail addresses into one virtual actor; for example, the e-mail address [email protected] and [email protected] can be combined into a single virtual actor, that is, one node on the screen. Actors can also be removed by name, or by property, for example, removing all nodes with less than three nearest neighbors. One key function is the annotate function, which will calculate the six honest signals of collaboration described in Chapter 4. Note that all these changes are only stored in the MySQL database when the changes are explicitly saved under a new database name. As the annotated values such as betweenness centrality, or sentiment, are calculated in a network, they should be recalculated, if a node has been removed (Figure 24).
7.3. SOCIAL MEDIA VISUALIZERS At the core of Condor is a list of visualizations (Figure 25). Key is the static and dynamic network views; the dynamic view shows a movie of the evolution of an e-mail, Twitter, or Wikipedia network over time. The actor scatter plot allows users to quickly compare and visualize
The Four-Step Analysis Process
Figure 24: Condor Filters.
117
118
Sociometrics and Human Relationships
Figure 25: Condor Visualizers.
actor level metrics. The values are straightforward to compare, if for the x-axis always the same variable, for example, “total number of messages,” is chosen. The sentiment views, the Word cloud, and the geomap view show the content, which Condor finds in the tweets, and the e-mail message bodies. The SNA metrics over time shows the evolution of group betweenness centrality and other graph-level metrics. The temporal social surface view shows the same information, breaking it down on the individual actor level.
7.4. SOCIAL MEDIA EXPORTERS The Condor export wizards export actor-level metrics in CSV format that are aggregated over time, as well as
The Four-Step Analysis Process
Figure 26: Condor Export Wizards.
119
120
Sociometrics and Human Relationships
longitudinal time series for later analysis in SPSS, Excel, and R. The exported data can be shown directly in Excel or can be further manipulated and computed in statistics packages such as SPSS, Matlab, R, or Stata (Figure 26). After this brief introduction into the architecture of Condor, we will now look at how to get started with Condor.
MAIN LESSONS LEARNED • The three-step analysis process in Condor starts with collecting communication data from Twitter, Facebook, Wikipedia, and blogs, and also from e-mail and other types of organizational communication archives such as calendars or skype. • The collected data is preprocessed and cleaned using a series of content filters. • In the next step, Condor provides a variety of visual analysis tools to visually explore the social network in many different ways. • In the last step, the data is exported as actor-level variables and time series for further statistical analysis in tools like Excel, KNIME, R, or SPSS.
8 GETTING STARTED WITH CONDOR
CHAPTER CONTENTS • Analyzing the Facebook wall • Analyzing Twitter tweets • Measuring the importance of brands through betweenness in bipartite graphs • Removing the “Nobodies” — pruning the leaves • Degree-of-separation search.
In order to start Condor, you first need to install Java and MySQL on your machine (Condor runs on Windows, Mac, or Linux). The first step is to download the latest version of the Java runtime from the Oracle website https://java.com/en/download/ by clicking on the large red button “Free Java Download.”
r 2017 Peter A. Gloor
121
122
Sociometrics and Human Relationships
If you have problems installing Java, you will find help in the Condor Manual http://91.250.82.108:8080/condor/Condor%203%20User%20Manual.pdf The next step is to install MySQL, which you can download from http://dev.mysql.com/downloads/mysql/. Again, if you have problems, you will find tips in the Condor Manual. There is also a series of YouTube videos, linked from the Condor Manual, that will take you step by step through the process. Now you are ready to download Condor, which you will find on http://guardian.galaxyadvisors.com/. The free academic version will allow you to analyze and visualize up to 10,000 nodes; you can download as many nodes as you want with the different data fetchers. If you want to analyze and visualize larger networks, you first have to sign up with a valid e-mail address. You will subsequently get a download link for a Condor trial version sent to your e-mail address. In the e-mail, click on “verify email address.” This will take you to your account page on the Condor Guardian website, from there you can download a full trial version of Condor.
Getting Started with Condor
123
Once you have downloaded Condor, you can either double-click the Java icon or better start it from the command line to allocate more memory to Condor, by opening a DOS command window or a Mac or Linux terminal window.
Once you try to start Condor, depending on your security settings, you might have to go on the Mac to the “Security & Privacy” pane and allow Condor to run by clicking on “Open Anyway.”
124
Sociometrics and Human Relationships
The next step, once you have allowed Condor to start, will be to enter your license key.
After that, the login Window to MySQL will pop up. If the fields are all red, you most likely do not have MySQL running (note that installing MySQL will not start it, you will have to do that after installing MySQL).
Once your MySQL login window looks like the image below, you can click “ok” in case you have installed MySQL with default settings as user “root” and no password. If you have set a MySQL password, you have to enter it now.
Getting Started with Condor
125
Condor will now bring up the Condor Workspace, and you are ready to start working.
Now you are ready to jump into your first social media analysis. 8.1. ANALYZING THE FACEBOOK WALL WITH CONDOR We will start by collecting and analyzing your own personal Facebook wall. This only works if you have a Facebook account.
126
Sociometrics and Human Relationships
The first step consists of creating a dataset.
Alternatively, you can directly enter a new dataset name and create the new dataset on the fly when starting the Facebook wall fetcher.
This will bring up a window to log into Facebook and authorize Condor to collect the Facebook wall.
Getting Started with Condor
127
After logging in, you will be given a security warning by Facebook, which you can ignore. Click on the “next” button again, to collect the actual data.
Once the wall has been collected, you can create a static view of your network.
128
Sociometrics and Human Relationships
To easily find the owner of the wall (myself in this case), I annotate my network by betweenness centrality.
Among the options, I click on “betweenness centrality.” Annotating the graph means calculating the corresponding variable (betweenness centrality, degree centrality, contribution index, etc.) for each actor. Note that these variables will change if a single actor has been removed or added to the network and will have to be
Getting Started with Condor
129
recomputed by rerunning the corresponding annotate command.
When I now size the nodes by betweenness centrality, I see myself as the largest node.
After this simple use of Condor to have a look at your Facebook wall, we will now run a step-by-step analysis “fetcher — filter — visualizer — export” as described in Chapter 7.
130
Sociometrics and Human Relationships
8.2. SAMPLE FOUR-STEP ANALYSIS WITH TWITTER Before running the first Twitter Fetcher query, you will need to obtain your own personal Twitter API (application program interface) keys. For this, you will need to be registered with Twitter. After that, go to https://apps.twitter.com and click on “Create New App.” Fill in the details of the app, for example “measure importance of presidential candidates,” and click on “Create Your Twitter Application” to create your consumer key and consumer secret. You will also have to create your Twitter access token and access token secret, as described here: https://dev.twitter.com/oauth/overview/ application-owner-access-tokens (Alternatively you can generate Twitter access tokens on the fly, by clicking on “Login with Twitter” in the first Condor dialog.) Once you have your Twitter access token, your Twitter access token secret, your Twitter consumer secret, and your Twitter consumer key, you are ready to collect Twitter data with the Condor Twitter Fetcher. For this simple first example, we will compare the US presidential candidates Bernie Sanders, Donald Trump, Hillary Clinton, and Ted Cruz. We start by creating a new MySQL database in Condor, naming it “4candidates.” 8.2.1. Step 1 — Fetch Data
Then we create a dataset for each candidate, starting with Donald Trump.
Getting Started with Condor
131
Next, we run a query collecting the most recent 2000 tweets about the candidate. We need to make sure that the checkbox “Connect nodes with search term” is checked. This will add an additional link from each tweet containing the search term “donald trump,” to the search term “donald trump” which will be shown as an additional actor of type “search term” in the resulting network. Including this link in the graph will allow us to later compare the strength of the brand “donald trump” to the brands “hillary clinton,” “bernie sanders,” and “ted cruz,” which is described in Section 8.3.
In the next step you can either use the credentials from your app, or log in with Twitter. This process is now
132
Sociometrics and Human Relationships
repeated for the other three candidates Hillary Clinton, Bernie Sanders, and Ted Cruz. To do a combined analysis, we then merge the four datasets into one combined dataset.
8.2.2. Step 2 — Process We now calculate the values to be displayed in Step 3: Betweenness and degree centrality Betweenness oscillation Contribution index annotation TurnTaking annotation Sentiment
Getting Started with Condor
133
8.2.3. Step 3 — Visualize The first parameter to look at is the activity of the combined tweets. We see that collecting the last 2000 tweets of the four candidates collected on March 6, 2016, around 14:30 will just give us the last hour, from 14:41 to 15:28.
Bringing up the actor scatter plot will tell us who the most respected tweeters are, which are fastest in being retweeted or responded to by others. For instance, CliffWilkin is a proponent of “Convention of States,” a right-wing initiative that wants to take away power from the federal government, and a follower of Ted Cruz. CliffWilkin is being addressed quickly, within 0.0125 hours on average, that means within less than a minute, by other tweeters.
134
Sociometrics and Human Relationships
Image 2a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
8.2.4. Step 4 — Export In the final step, we export the calculated actor values. We can then open the file “4candidates.csv” in Excel for later analysis.
Getting Started with Condor
135
136
Sociometrics and Human Relationships
8.3. MEASURING THE IMPORTANCE OF BRANDS THROUGH BETWEENNESS OF ACTORS IN BIPARTITE GRAPHS Condor offers a unique way of measuring strength of brands by measuring the betweenness of their search terms in a social network graph created through Twitter, Google Custom Search Engine (CSE) search, or Wikipedia. For instance, in a Twitter graph, a link between two actors is drawn, if person A is retweeting person B, or if person A is mentioning person @B in a tweet. The more a particular brand is mentioned in tweets, the more links it will thus have from all actors mentioning it in their Tweets. For instance, everybody tweeting about “bernie sanders” will have a link to the search term “bernie sanders.” The more people tweet about “bernie sanders,” the more incoming links the search term “bernie sanders” thus will have. If we combine the graph to include all Tweets about “donald trump,” the more incoming links “donald trump” has compared to “bernie sanders,” and the more central the people tweeting about “donald trump” compared to people tweeting about “bernie sanders” are, the stronger the brand of “donald trump” is. The image below (image 3 on page 138) shows the static view with the nodes sized by betweenness. The static view for Twitter will contain two types of actors. All the people tweeting will be shown as circles, with the tweets as connecting lines from a tweeter to the retweeter or the person mentioning the other person in the tweet. In addition, there will be the original search term, as a special node, shown as a square. In our example, we will
Getting Started with Condor
137
have four squares, “bernie sanders,” “hillary clinton,” “ted cruz,” and “donald trump.” By the definition of betweenness, the size of the square will be a proxy for the importance of the search term. Dragging the mouse over each search term will tell us that in the time from 14:41 to 15:28 on March 6, 2016, Donald Trump had the strongest brand, with betweenness of 1.23*107, followed by Ted Cruz with betweenness of 1.06*107, followed by Hillary Clinton with 9.96*106 and Bernie Sanders with 9.86*106.
8.4. PRUNING THE LEAVES IN A GRAPH The Twitter network shown below (image 3 on page 138) about Trump, Cruz, Clinton, and Sanders contains 7435 actors and 17,325 links. We can prune the graph and remove all the peripheral nodes by removing all the nodes with degree centrality 1; these are all the people who did not get retweeted, and whom nobody mentioned in another tweet, or who did not mention anybody else in a tweet. For this we use the “Process dataset->Actor filter” function. Before bringing up the dialog, we need to make sure we have calculated the degree centrality (“Process dataset->Annotate->centrality annotations”).
138
Image 3a
Sociometrics and Human Relationships
a
For color pictures see online version of images, available at http://www.ickn.org/sociometrics/
Getting Started with Condor
139
All the “leaf”-nodes at the periphery, that is, all the tweeters that have only tweeted once without being retweeted or being mentioned in another Tweet, are now gone (image 4 on page 140).
140
Image 4a
Sociometrics and Human Relationships
a
For color pictures see online version of images, available at http://www.ickn.org/sociometrics/
Getting Started with Condor
141
8.5. DEGREE-OF-SEPARATION SEARCH WITH GOOGLE CSE The same four-step analysis process “fetch-filter-visualizeexport” also works for other data sources. Repeating the same query for the two candidates, Trump and Clinton on the Web leads to a “degree-of-separation” search to measure which candidate is more popular, and which websites are most important to boost the candidates’ importance. Web searches in Condor are conducted through the Google CSE API. Before executing the first Condor Google CSE search, you will need to obtain a Google CSE API key. You can get an API key from https://console.developers.google. com/apis/ by clicking on “Custom Search API” in the “Other popular APIs” group. Enter this key in the dialog box in Condor. You will also need to enter the Condor CX Key:000229616349723713761:mlcaolv1mpw. This process is also described step-by-step in the Condor manual. Google gives you 100 free queries per day; if you want to run more queries, Google will ask for your credit card number and charge you a few cents per query. Condor’s “degree-of-separation” search provides a powerful mechanism for measuring the importance of a search term on the Web based on the importance of the websites where it is being used, similarly to Google’s Page Rank algorithm. Different from the Page Rank algorithm, which returns a single number per website as a proxy for its importance, “degree-of-separation” search returns betweenness of the website depending on the particular search term. For example, for searches for politicians politico.com will be important, while for searches for actors imdb.com will be prominent.
142
Sociometrics and Human Relationships
To illustrate how “degree-of-separation” works, let’s look at a specific example. For instance, for the search for “Donald Trump,” a Google search is run through the Google Custom Search Engine (CSE) API. The resulting 10 or 20 top links (depending on your settings in the CSE fetcher) are then plugged back into the Google search engine, and the top 10 or 20 back links pointing back to each of the top 10 or 20 original sites are taken. In more detail (see Section 12.1 for a step-by-step description in Condor), it works as follows: Step 1: Using “Fetch->Fetch Web” search on Google for “Donald Trump” Get the top 10 results. In the example below, done on October 3, 2016, the search for “Donald Trump” returned the following websites:
The image below shows the same result, visualized in Condor as a network with each website containing the search text “Donald Trump” pointing back to the original search term “Donald Trump.”
Getting Started with Condor
143
Step 2: Get the top 10 results pointing back to top 10 results Condor will now collect the websites such as, www. headlinespot.com which links to www.politico.com which contains the search text “Donald Trump.” The image below shows the complete backlink network. Blue nodes contain the search term “Donald Trump,” yellow nodes link back to the blue nodes.
144
Sociometrics and Human Relationships
Image 5a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
In the next step, we repeat the same “degree-of-separation” search for Hillary Clinton, to check which one of the candidates is more popular on the Web. Step 3: “Degree-of-separation” search for “Hillary Clinton” The image below shows the top 10 websites pointing back to the top 10 websites containing the search term “Hillary Clinton.”
Getting Started with Condor
145
Step 4: Merge degree-of-separation search for “Donald Trump” and degree-of-separation search for “Hillary Clinton” and calculate betweenness centrality
The image below illustrates the resulting bipartite (with nodes of two types) website network, with the search terms shown as squares and the websites shown as circles. The size of a node shows its importance, measured as betweenness centrality. In the network below, Hillary Clinton has higher betweenness centrality (her purple square is slightly larger than the one for Donald Trump), based on the higher betweenness centrality of the websites pointing back to her (nytimes.com, nymag.com).
146
Sociometrics and Human Relationships
Image 6a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
8.6. DEGREE-OF-SEPARATION SEARCH WITH TWITTER The same degree-of-separation search as for websites can also be run to collect and analyze Twitter data. The trick is to include the search result as an extra node in the network. The other nodes in the network are not websites, but Twitter users, and a link between two nodes — except to the search term — denotes either one user retweeting another user or mentioning her or him in a tweet. We again compare the popularity of Donald Trump with the popularity of Hillary Clinton on Twitter (see Section 12.3 for a detailed step-by-step example in Condor).
Getting Started with Condor
147
Step 1: Collect last 500 tweets about “Donald Trump” In Step 1, we collect 500 Tweets, which will only cover a few minutes of tweets on October 3, 2016, as the topic is frantically tweeted about. We have to make sure to click the checkbox that connects the search term “Donald Trump” with all the Tweets. Visualizing the search results leads to the image below; the grey box in the center is the search term.
On a side note… Removing the search term will show the most important tweeters. The image below shows the network of above with the search term “Donald Trump” removed; the tweeters are sized by betweenness centrality. Nytimes and ShazBooty714 are the most important tweeters.
148
Sociometrics and Human Relationships
Step 2: Collect last 500 tweets about “Hillary Clinton” We repeat the same search on Twitter for the term “Hillary Clinton,” again connecting the search term with all tweeters. The image below shows the results.
Getting Started with Condor
149
Step 3: Merge Twitter degree-of-separation search results for “Donald Trump” and for “Hillary Clinton” and calculate betweenness centrality The image below shows the merged datasets, size of a node denotes its betweenness centrality. The squares for Hillary Clinton and Donald Trump are of comparable size; this means on Twitter both candidates show approximately the same brand strength.
150
Sociometrics and Human Relationships
8.7. WIKIPEDIA SEARCH We can also use Wikipedia to measure the strength of a brand. This is done not by “degree-of-separation” search, but by direct search. For a detailed example of how to use the Wikipedia Evolution fetcher, see Section 12.2. To compare the strength of Hillary Clinton and Donald Trump on Wikipedia, we used Condor’s Wiki Evolution fetcher to collect the last 250 edits on the Wikipedia pages about “Donald Trump” and “Hillary Clinton.” Condor will collect all pages linking from and back to the pages (called “bidirectional links” in Condor) about “Donald Trump” or “Hillary Clinton” on
Getting Started with Condor
151
Wikipedia which have been referenced in the last 250 edits from other Wikipedia pages. The image below shows the resulting Wikipedia link network, each node is a Wikipedia page, each connecting line is a bidirectional link between two pages. The two most central pages are — not surprisingly — the pages about Hillary Clinton and Donald Trump. Each node is sized by betweenness centrality. Hillary Clinton, who, in addition to being the US presidential candidate, is the wife of a former president, a former senator, and a former US secretary of state, is more central than Donald Trump in the Wikipedia link network.
After this general introduction into the four-step analysis process with Condor, we will now look at how to analyze everybody’s own mailbox.
152
Sociometrics and Human Relationships
MAIN LESSONS LEARNED • Condor runs on Mac, Windows, and Linux. • Before running Condor, Java and MySQL must be installed. • The Facebook wall fetcher allows you to collect and analyze your personal Facebook wall, if you have a Facebook account. • The Twitter fetcher allows you to analyze tweets about any topic; you will need to obtain Twitter API keys first. • The Google CSE fetcher allows you to analyze the most important websites about a certain topic; you will need to obtain Google CSE API keys first. • Degree-of-separation search will measure the importance of brands by constructing a bipartite graph either within websites using Google CSE or Twitter. • Brand importance is measured by calculating the betweenness centrality of the nodes.
9 ANALYZING E-MAIL WITH CONDOR
CHAPTER CONTENTS • Analyzing your personal social network through your mailbox • Finding COINs through community detection • Analyzing the social network of an organization through its e-mail archive • Analyzing Hillary Clinton’s e-mail • How to deal with an organization when analyzing its e-mail archive.
r 2017 Peter A. Gloor
153
154
Sociometrics and Human Relationships
9.1. CREATING A VIRTUAL MIRROR OF YOUR OWN MAILBOX In the first e-mail analysis example, I will be studying four months of e-mail messages of my own personal mailbox. This analysis will allow me to much better understand what worked and what did not work in my collaboration with dozens of teams in a variety of topics. A personal e-mail-box analysis consists of the following steps: 1. Create a new e-mail database and dataset. 2. Fetch e-mail using the options to filter by date and mail folders. 3. Create a static view of the network for an initial look. 4. Use Process Dataset to continue to clean up the network by merging e-mail addresses, removing mailing lists, and other unwanted actors. 5. Annotate the dataset with network measures using Process Dataset. 6. Use the View menu to create a scatter plot of Contribution Index, Average Response Time (ART), and Word Cloud. 7. Remove yourself from the network to understand who else is important in holding your network together and rerun all the annotations. 8. Graph group centrality measures and temporal social surface to examine creativity or performance behavior; use the actor scatter plot to examine
Analyzing E-Mail with Condor
155
sentiment, complexity, betweenness centrality oscillation, and adjacency matrix. 9. Calculate the influence measure; remove noninfluential actors and graph. 10. Summary. I start by creating a MySQL database “mail_peter_Aug15” in Condor, into which I will load my mailbox data from May to August 2015.
Note: Database and dataset names cannot include any spaces in their names. Use an underscore “_” as a separator.
Once the database is created, I switch to it.
I then create a dataset “mail_May_Aug15,” into which I will be loading the mails of last four months of my mailbox.
156
Sociometrics and Human Relationships
Now I can use Condor like an e-mail client, to download the mail into the MySQL database.
Analyzing E-Mail with Condor
157
Note: If a dialog window is hidden behind the main Condor window, it can always be brought to the foreground by clicking on .
Condor has the capability to fetch e-mail from an Exchange, IMAP, or POP3 account. In this example, I am collecting my MIT mailbox, which is stored on a Microsoft Exchange server. I include collecting the contents of my mailbox, by checking the box “Fetch content.” For Exchange, if the server is set up well, it is sufficient to enter the e-mail address and password (just like logging into Webmail), then the Exchange Autodiscover server will automatically figure out hostname and username.
158
Sociometrics and Human Relationships
If you would like to download an IMAP account, you will also have to enter the name of the IMAP host. For example, for GMAIL the settings in the dialog below would log you into your account with Condor.
Note: If you have enabled Google’s two-step password verification, you will get the following error message and will have to generate an app password https://support.google.com/accounts/answer/ (see 185833)
Analyzing E-Mail with Condor
159
Both ways, either using the Google-generated Condor app password or using your own password (if you are not using two-step verification), your GMAIL login dialog will look as follows.
I will now login into my MIT Exchange mailbox. In the next dialog, I have the option to set the time period for the e-mails I want to download. I only collect my mails from May 1, 2015 to the collection time (August 27, 2015).
The next dialog gives me the option to only collect specific folders. I collect all the folders that I expect to have gotten new content in the specified time period.
160
Sociometrics and Human Relationships
Now the e-mails are loaded into the dataset “mail_May_Aug15” as a network with 1381 actors and 13,074 links, and I can create my first visualization (View->Create static view).
The “asteroid belt” outside the large connected cluster in the center comes from e-mails which are not sent to [email protected], but to mailing list addresses.
Analyzing E-Mail with Condor
161
I now clean up my mailbox by removing the mailing list addresses and some other mails that are not interesting (“Process dataset->Remove specific actors”).
In the next step, I merge people who are using more than one e-mail address, using the “Manual node merging” wizard (“Process dataset->Node merging->Manual node merging”). I can filter names by typing substrings in the box at the top. By shift-clicking multiple actors and then clicking on “Merge actors and/or group()” I can merge multiple e-mail addresses into one node in the graph.
162
Sociometrics and Human Relationships
In a further (optional) cleanup step, I reduce graph size without losing key information by removing all nodes that are isolated or have only one connection to the connected component in the center. To do this, I calculate the degree centrality for all nodes (Process dataset ->Annotate->Centrality annotations).
Analyzing E-Mail with Condor
163
I then use the actor filter dialog (Process dataset->Actor filter) to only keep the nodes which have degree centrality larger than 1.
The network has now been reduced to 516 actors and 10,530 edges.
164
Sociometrics and Human Relationships
Note: These changes to the original dataset are not saved in the original database. If you want to save it, you have to right click on the dataset and save it under a new name.
Caution: This might take a lot of space on your hard disk, as Condor databases will get large very quickly!
Now we can calculate the different networking attributes of all actors, using the annotate functions: Process dataset->annotate->Centrality annotations (Central Leadership) Process dataset->annotate-> Oscillation annotations (Rotating Leadership) Process dataset->annotate-> Contribution Index annotations (Balanced Contribution) Process dataset->annotate-> Turntaking annotations (Responsiveness) Calculate sentiment (Honest Sentiment).
Analyzing E-Mail with Condor
165
There are also two annotations on the group level Process dataset->annotate-> AWVCI annotation Process dataset->annotate-> Group density annotation. These annotations calculate five of the six honest signals of collaboration, listed above in parentheses and introduced in Chapter 4, plus additional networking metrics. The image below illustrates the settings for calculating betweenness oscillation annotations for e-mail. The sliding time window to aggregate e-mails is set to 7 days, which means that Condor is always taking a week’s worth of e-mail to calculate betweenness of each actor, recalculating betweenness in 1-day increments for the entire duration from May 1 to Aug 27, 2015. The resulting time series of betweenness values is then smoothened over a 3-day time window.
166
Sociometrics and Human Relationships
The contribution index annotations can be run with default settings, optionally edges can be deduplicated in the graph if one person sends single e-mails to hundreds of recipients simultaneously. I am not doing this for my own mailbox.
The turntaking annotations are run with a minimum response time of 15 seconds, which means that if somebody just sends back an empty reply to an e-mail within 15 seconds (e.g., to try to game their response time) the reply will be ignored. Also, if a message is not answered within 4 days, it will be ignored to calculate the average response time. This assumes that if an e-mail is not answered within 4 days, it did not need an answer.
Analyzing E-Mail with Condor
167
We also calculate sentiment, emotionality, and complexity, using the following settings.
Now we can look at the results, identifying my most active communication partners. We start using the actor scatter plot view (View->Actor scatter plot). Not surprisingly I am the most active participant, sending and receiving a combined total of 6500 e-mails. I am sending slightly more messages than I receive, leading to a contribution index of 0.2. Remember that a contribution index of 1 means that a person only sends e-mail, while a contribution index of 1 means that the person is only receiving messages.
168
Sociometrics and Human Relationships
Image 7a
a
For color pictures see online version of images, available at http://www.ickn.org/sociometrics/
Looking at how quickly somebody responds to somebody else (ego ART) is a proxy for how passionate somebody is. Image 8a
a
For color pictures see online version of images, available at http://www.ickn.org/sociometrics/
Analyzing E-Mail with Condor
169
Looking how quickly everybody else answers to somebody (alter ART) is a proxy for how much somebody is respected. Image 9a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
We can now also look at the content of my e-mails. What immediately jumps into the eye is the bright green of the words, indicating an overall very positive mood. The only negative word is “problem.” My own name is the most popular word. As this is not so interesting, I right click on it to delete it in the view.
170
Sociometrics and Human Relationships
Image 10a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
Clicking on “problem” in the above view will tell what the “problem” is. It turns out it is mostly about Condor bugs. Also note that the messages can be in English, German, French, Spanish, Portuguese, and Italian to be automatically analyzed by Condor’s sentiment detection algorithm.
Analyzing E-Mail with Condor
171
172
Sociometrics and Human Relationships
9.1.1. Drawing the Term Graph Condor provides a second way to visualize content, using the “term graph” function. The term graph is a semantic network of “terms.” “Terms” are the most important words in the content of a document. The network is constructed by cooccurrence of keywords; if two words appear in the same document (each document is an e-mail message in this example), they will be connected by an edge. Running the function with the default settings will create a new dataset “mail_May_Aug15term1,” using the keyword vector generated from the content field by the “Calculate Sentiment” function. The “minimal number of co-occurrences per edge” defines a cut-off value for including keywords into the term graph; in the example below two keywords need to appear at least four times together in four different documents to be included in the term graph.
After clicking the “Next” button, I get a dialog showing all keywords fulfilling the selection criteria.
Analyzing E-Mail with Condor
173
Clicking on the headings “Term,” “Occurrences,” “Type,” and “Language” allows me to sort the terms in different ways, and to choose the ones I want to have included in the final term graph. In the dialog below I have manually chosen 97 words, which I suspect are meaningful in the context of my mailbox from May to August 2015. In the next step, the new dataset “mail_May_ Aug15term1” is created. Next, I calculate the betweenness centrality annotation, and call up the static view, showing all labels corresponding to the 97 keywords or terms, sizing the nodes by betweenness. We see that “team” and “work,” “meetings” and “project” are the most important words by betweenness in the term graph, while the names of my collaborators “Michael,” “Andrea,” and “Ken,” shown on the left, are related = meaning they occasionally show up in the same documents — but more peripheral in the term graph network.
174
Sociometrics and Human Relationships
9.1.2. Removing the Mailbox Owner The next key step in the analysis of an individual mailbox is to remove the owner of the mailbox. As the static view of communication illustrates, I am also by far the most central person in my own network (mailbox owners are
Analyzing E-Mail with Condor
175
usually the most central actors in their own e-mail network, although there are exceptions). Image 11a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
The conclusion is therefore to delete myself from my own network, which I can easily do using the “remove specific actor” function.
176
Sociometrics and Human Relationships
This reduces the number of links in the graph from 13,074 to 4945 edges. Note how the network now falls apart and is broken up in different clusters. Image 12a
a
For color pictures see online version of images, available at http://www.ickn.org/sociometrics/
As the network metrics were heavily influenced by my own position in the network, I have to rerun all annotations. Note how the betweenness changes considerably after rerunning the annotations. Image 13a
a
For color pictures see online version of images, available at http://www.ickn.org/sociometrics/
Analyzing E-Mail with Condor
177
Next we can look at the evolution of the centrality network metrics over time (View->Group centrality measures). They are pretty oscillating, which, as has been shown in our research, is an indicator of creativity.
Next we look at the temporal social surface, using a sliding time window of 7 days, and unchecking the “with history” option.
178
Sociometrics and Human Relationships
This will use a sliding time window approach, taking the last 7 days in 1-day increments to calculate betweenness for each actor, while resorting the actors each day by betweenness, and then plotting their betweenness curves in a three-dimensional surface. Note how in the first six weeks a few individuals show very high betweenness (not me, as I removed myself just before), suggesting they were actively collaborating with groups of people — bridging structural holes — while in the second half of the time period, there are no high-betweenness individuals, and the overall activity of people (people above “flatland”) significantly drops.
Image 14a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
Analyzing E-Mail with Condor
179
Using the actor scatter plot will show us who the most positive people are, plotted against the number of messages they send. Image 15a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
And who is using the most complex language in the e-mails they send.
180
Sociometrics and Human Relationships
Image 16a
a
For color pictures see online version of images, available at http://www.ickn.org/sociometrics/
And who are the most creative people, measured through oscillations in betweenness centrality. Image 17a
a
For color pictures see online version of images, available at http://www.ickn.org/sociometrics/
Analyzing E-Mail with Condor
181
Using the adjacency matrix view (View->Adjacency matrix) and sorting actors by degree centrality will show the most gregarious, that is, most connected individuals in the upper right corner.
We can now also look at the evolution of sentiment in my mailbox with and without me. Note the drop in activity during vacation time in July and August. There is also a drop in sentiment at the end of June. Compare this with the activity, sentiment, emotionality, and complexity of my mailbox, including the messages sent and received by me. Notice how the drop in activity is much less marked for the original mailbox including me, which also includes the mails sent exclusively to me, or received exclusively by me. This means that while overall business activity drops, I am still pretty active during the summer break. Also, while the sentiment shows some noticeable drops in the mailbox where I removed myself, the mood is steadier and less oscillating in the original mailbox including myself.
182
Sociometrics and Human Relationships
Finally, I would also like to know who the most influential people in my network are. It is better to calculate this with myself taken out of the picture, to not skew the influencer calculation algorithm. I check the box “create new dataset” to get a new network of influential people, based on new word usage being picked up by others.
Analyzing E-Mail with Condor
183
I am getting a new dataset “mail_May_ Aug15_Influence.” Drawing this network leads to a large group of noninfluential people — the isolated dots in the image below in the “asteroid belt.”
To remove all these noninfluencers, I calculate degree centrality, and remove all the nodes with degree smaller than 1, as we already did before, to prune the network.
184
Sociometrics and Human Relationships
The image now looks much different, and my colleagues from galaxyadvisors suddenly become most influential, followed by the COIIN project to reduce infant mortality. Image 18a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
Analyzing E-Mail with Condor
185
This concludes the analysis of three months’ worth of e-mail data. We have been able to analyze the following: 1. Overall personality characteristics of my friends. Who is most creative (betweenness oscillation)? Who is most passionate (answers the fastest and uses most emotional language)? Who is most open and gregarious (has the highest degree centrality)? Who is using the most positive language? 2. Who has the most influence in my daily life? Who the most influential people in my work life are? Whom I respect most (whom I answer the fastest)? 3. What are key determinants of my professional life? Which communities and teams I am most working with (the biggest clusters)? What are the key topics I have worked on over the last three months? What are the most hectic periods; what have been the quiet periods over the last three months?
9.2. FINDING COINS THROUGH COMMUNITY DETECTION Condor offers an automatic way to find COINs through its community detection algorithm. It uses the Louvain
186
Sociometrics and Human Relationships
algorithm,1 which assigns each actor to one community based on its social network connections. It finds the “modularity” of a community, which is defined as a value between 1 and 1 by calculating the density of links inside the community compared to the density of all other links outside of the community. In this example to illustrate the detection of COINs, I use my mailbox from January 1, 2009 to December 31, 2015 to locate the key projects I have worked on since the last six years. First, I load the top 6000 actors of a dataset with 15,364 actors, and merge the multiple e-mail addresses that represent the same person into single actors using the “Manual node merging” wizard, as described in the previous section. Then, I run the community detection algorithm.
1
https://en.wikipedia.org/wiki/Louvain_Modularity
Analyzing E-Mail with Condor
187
I then also annotate the actors by betweenness centrality, to be able to draw them in their respective size. The communities come out nicely, they are numbered from 0 for the largest community to the number of communities (1271 in this example), with community 0 having the most members. Image 19a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
When checking the communities by looking at their members, I find that the community detection algorithm did an excellent job grouping related people together. Community 0 (yellow) is most of the COIN seminar students, described in section 9.3 plus many outside collaborators. Community 1 (green) is the C3N Chronic Collaborative Care Network — a project on improving the lives of patients with Crohn’s disease, 2 (turquoise) is the US government part of the IM-COIIN — a project reducing infant mortality in the United States, community 3 (blue) is the sponsored projects of our MIT research
188
Sociometrics and Human Relationships
center, 4 (purple) is the other half of the IM-COIIN, 5 (light blue) is the CFF C3N project — a successor of the C3N project applying the same approach to patients with cystic fibrosis, 6 (gray) is the HV-COIIN — another part of the IM-COIIN project focusing on home visiting, 7 (olive green) is around Technopark Aargau, a startup incubator in Switzerland. As I am part of the initial community, my betweenness centrality is by far the largest, and I am the glue linking the communities, which means that the community detection algorithm will produce one large cluster with me in the core, plus many small communities. I therefore, remove myself from the analysis, as described in the previous section using the “remove specific actor” menu function, and rerun the community detection algorithm. Note that I will also have to recalculate betweenness centrality. Image 20a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
Analyzing E-Mail with Condor
189
I also increase the number of communities to be shown in different colors to the top 12 communities. Note that the clusters are now much smaller, as the big connector in the center (myself) has been removed. Community 0 (yellow), the largest cluster, is now the C3N project focusing on improving the lives of patients with Crohn’s disease, 1 (bright green) is US government part of the IM-COIIN, 2 (turquoise) is my own COINonCOINs, 3 (dark blue) is the MIT sponsored projects, 4 (purple) is my collaborators from University of Cologne, 5 (light blue) is the second part of the IM-COIIN around NICHQ — the contractor running the IM-COIIN project, 6 (gray) is the CFF cystic fibrosis C3N, 7 (olive) is the HV-COIIN, 8 (dark green) is my collaboration with service provider firm Genpact, 9 (blue green) is around Technopark Aargau, a startup incubator, 10 (dark violet) is the MIT Sloan administration, 11 (violet) is a group of particularly active COIN seminar students from 2010. The next step is to drill down into one community at a time. I start by looking at the COINonCOINs, the turquoise cluster with community ID 2. Using the “Actor Filter” dialog with the setting below will remove all actors with a community ID other than 2.
190
Sociometrics and Human Relationships
After clicking on the next button, I obtain the following image. It will still show all the links, but only include the actors from community 2, the COINonCOINs.
Analyzing E-Mail with Condor
191
To obtain a network only showing the links between the members of the COINonCOIN community, I close the static view, and recompute the betweenness centrality as well as the community structure, as the network structure has now radically changed. The resulting image shows the network and subcommunity structure of the COINonCOINs community. Image 21a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
Cluster 0 (yellow) is the students from University of Applied Sciences Northwestern Switzerland (FHNW) and from Wayne State, connected through their respective instructors Michael Henninger and Ken Riopelle. Community 6 (gray) is a particularly active team from FHNW, as are communities 5 (purple) and 7 (olive). Community 1 (green) is students from one COINs course mostly from Germany and SCAD connected through instructors Christine Miller and Julia Gluesing, community
192
Sociometrics and Human Relationships
2 (turquoise) is students from Universidad Cattolica Santiago di Chile connected through instructor Cristobal Garcia, community 3 (blue) is another COINs course with students from Helsinki and IIT connected through instructor Maria Paasivaara. In the next section, we will extend this e-mail analysis framework from analyzing single-user mailboxes to studying the e-mail network of an entire organization.
9.3. CREATING A VIRTUAL MIRROR OF AN ORGANIZATION The same process as analyzing an individual’s mailbox — called an “ego network” — can be used to track organizational networks, with the goal of improving organizational performance. Social network variables can be compared with organizational performance metrics such as employee or customer satisfaction, productivity, sales force success, or propensity to leave the organization. Based on these correlations, interventions to optimize the organization can be developed. In this example, we will analyze the e-mail network of the COINs 2015 spring seminar. Before starting the exploration, it is useful to acquire as much context information as possible about the organization we will be analyzing. The context information about the COINs seminar is as follows: The COINs seminar has been taught as a distributed seminar since 2005. COINs is the acronym for “Collaborative Innovation Networks,” which are the focus of the seminar. Students are participating from
Analyzing E-Mail with Condor
193
MIT, Illinois Institute of Technology, Aalto University, University of Cologne, and University of Bamberg, collaborating as virtual distributed teams, tackling problems of social media analysis, and other COINs-related issues. For most sites, the course consists of a 3-day introductory block course taught on-site, followed by 34 months of virtual collaboration by distributed student teams. The virtual collaboration projects are divided into iterations of 13 weeks. At the end of each iteration, each team presents the results of the last iteration and the plans for the next iteration in a virtual meeting. Half-way through the course, students are shown their own communication behavior captured through e-mail as a virtual mirror. At the end of the course, the students deliver a final presentation and submit a final paper reporting their project. All e-mail communication during the teamwork period is captured by asking the students to cc all their messages to a teamspecific dummy GMAIL folder. We start our analysis by downloading the 10 GMAIL boxes of the 10 teams of the 2015 COINs spring seminar, which are, in addition to students from University of Cologne and University of Bamberg, made up of students from University of Applied Sciences Northwestern Switzerland, and University of Rome Tor Vergata. The e-mails are stored in one dedicated MySQL database, putting each mail folder into one dataset. First, we create a dataset for each team.
194
Sociometrics and Human Relationships
Then, we load the GMAIL mailbox for the team into the newly created dataset using the e-mail fetcher.
This process is now repeated for teams 210, leading to 10 different datasets, one for each team.
Analyzing E-Mail with Condor
195
The next step is to merge the 10 datasets into one combined dataset.
We are now ready to take a first look at the merged dataset, coloring it by original dataset. This will show the communication for each team in a different color. Note, however, that an actor who is in more than one dataset, like the instructor, will only be shown in the color of the first dataset he is in. We also size the nodes by betweenness,
196
Sociometrics and Human Relationships
which means that we have to annotate them first by betweenness (Process dataset->Annotate->Centrality annotations).
Image 22a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
We already can see the central role of the main instructor (Peter). We can also very nicely see the different teams, which are each represented as a separate cluster, connected by inter-team ties, and ties to the instructors. We also note that there are more actors than students per team. Further investigation reveals that a lot of the students have been using more than one e-mail address, so these need to be combined. In addition, we also see that for each team we have a virtual actor called coinproject1 to coinproject11. We start by removing these virtual actors. We also discover some spam and mailing list actors, which we are removing for this analysis, as we are interested in human interactions. Note that if we were interested in knowledge flow, we might keep these virtual
Analyzing E-Mail with Condor
197
actors. We also note some script-generated actors (e.g., [email protected]), which we also remove. To remove specific actors we use the “Process dataset -> Remove actors” function. In the next step, we manually go through the e-mail addresses, using the “Process dataset->Node merging ->manual node merging” function. Condor provides some support for this process. Starting to type a name into the search bar at the top will bring up all names starting with the same characters. Clicking on the heading “Uuid” or “Name” will sort by Uuid or Name. Shiftclicking on more than one actor in a pane will allow us to merge these actor aliases under the first name by clicking “Merge actors and/or groups.”
198
Sociometrics and Human Relationships
Note: A company’s e-mail is usually quite clean because employees generally use only one e-mail address for official business correspondence. However, there are exceptions in case a person changes her/his name after a marriage or a divorce. And, a different e-mail address may be assigned to an employee who quits and then returns. Overall, it is recommended to periodically check for people having more than one e-mail address. Once we have done our labor-intensive merging work, we can save the merge file as an Excel CSV file, by clicking the button “Save actor merge CSV.” Clicking on “Load actor merge CSV” allows us to load a previously saved actor merge file.
Analyzing E-Mail with Condor
199
A line in the actor merge file is of the format “uuiid to keep, name to keep, uuid to merge, name to merge,” for example, [email protected], Peter Gloor, [email protected], Peter A Gloor. This will lead to the combined actor being called [email protected] The network has now been greatly reduced in number of nodes; each node now corresponds to one actor in the COINs seminar.
Image 23a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
200
Sociometrics and Human Relationships
At this point in the analysis, it is a good idea to save the cleaned-up file under a new name, for example “allteams-cleaned.”
We are now closing the original merged dataset, and we open the cleaned-up dataset “allteams-cleaned.” The next step is to annotate the network, by calculating the actor level metrics: Process dataset->annotate->Centrality annotations (betweenness, degree) Process dataset->annotate-> Oscillation annotations Process dataset->annotate-> Contribution Index annotations Process dataset->annotate-> Turntaking annotations Process dataset->Calculate sentiment. After that, we obtain a first overview by creating a dynamic movie of the social network (View->Create dynamic view). The image below shows four snapshots of the movie over the time period from April 14 to June 16, 2015. Note how on the first picture in the upper left the teams self-organize, without the instructor. In pictures
Analyzing E-Mail with Condor
201
23 the instructor becomes increasingly central, communicating intensively with a few teams, while others go their own way. In picture 4 the teams are again communicating among themselves, with the instructor just attached to one team in the center. Image 24a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
Next we look at the e-mail activity over time. We can clearly see a spike in communication activity of the students before the end of an iteration, when they frantically prepare their presentations for the virtual meetings. Likewise, in business we see high rates of team member interaction right before a project deadline, management review meeting, or other significant event. Check your calendars to identify these key event dates.
202
Sociometrics and Human Relationships
Next we look at sentiment (in blue), emotionality (in green), activity (in red) and complexity (in yellow) over time. We see that sentiment starts quite positive, but goes down in the last third of the course, when laggards are pushed in less friendly words to contribute their share to the final presentation. In prior work we have found that the sentiment in the language of well-functioning teams tends to move down somewhat to get more “honest,” as team members are not just giving praise to each other, but also say in clear words what can be done to improve the product. In the end, sentiment goes up again, with mutual back patting after the job well done. In this analysis of the COINs course teams we find the same pattern.
Analyzing E-Mail with Condor
203
Image 25a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
Right-clicking on an actor brings up his individual sentiment, emotionality, activity, and complexity.
Clicking on the instructor (Peter) shows that he is using quite clear language, with very positive sentiment, but also quite negative periods, particularly right before the end of the course.
204
Sociometrics and Human Relationships
Now we look at individual actors, starting with the actor scatter plot (View->Actor scatter plot) of the contribution index (see Section 5.1). Coloring individual actors by “original dataset” (which corresponds to their team), we find that members of the same team show similar send/receive ratios. Different to the analysis of COINs seminar teams in previous years, we find that there is always a team member who is more active than the rest of the team. In prior years, we had found that teams where members showed similar contribution index and number of e-mails sent or received were performing better, but in this analysis we did not find such teams. We still notice that members of a team are close together in the overall number of messages sent, but there is usually one or two members who are more “up-right” in the scatter plot and thus the main communicators and senders of e-mails.
Analyzing E-Mail with Condor
205
Next we look at the most oscillating (betweenness oscillation), responsive (ego ART), respected (alter ART), connected (degree), positive (sentiment), and emotional students. Note: It is important to keep in mind the time zone differential among team members when reviewing a person’s responsiveness (ego ART), and respect (alter ART). Vacations, sick time, and leaves as well as cross cultural differences in answering e-mails over the weekends may make it difficult to compare these indicators uniformly across team members. Again, it is important to understand the context of team members.
206
Sociometrics and Human Relationships
Analyzing E-Mail with Condor
207
Next we are looking at the betweenness centrality of individual actors over time, by exporting a time series of betweenness values per actor to Excel as a CSV file (Export->Export time series). We have to uncheck the “with history” checkbox and set the sliding time window to 7 days to extract actor values that are always representing the last 7 days. If we had checked the “with history” box, the social network would have been built by subsequently adding all edges to the network graph for each actor, up to the final, most complete graph. For this application of tracking what’s happening each week, this would have been the wrong setting.
Opening the export file in Excel, we find that Peter is the most central actor in the beginning, but that different students assume leadership roles of high-betweenness centrality over the progress of the course (the orange, green, and black lines in the image below).
208
Sociometrics and Human Relationships
Comparing the individual betweenness curves with the temporal social surface (View->Create temporal social surface view) leads to a similar image. See the Condor manual for steps to identify actors in the temporal social surface view.
Analyzing E-Mail with Condor
209
Initially, there is one person who is highly central in the temporal social surface (Peter). After this initial more centralized structure, all participants are active with similar betweenness centrality, interrupted by small bursts of centrality along the way toward June 16, 2015. This is the pattern of rotating leadership, which is indicative of creative teams.
Next, we look at word usage over time, selecting four words whose usage over time will be shown.
210
Sociometrics and Human Relationships
The usage pattern of these four words correlates quite well with the overall activity of the students. The students are planning “skype meetings” for “presentations tomorrow.” In mid-June, there are no “meetings” anymore, while students are still working on the “presentation.” Discussion about “tomorrow” is also going down as the semester progresses. Note: Business communication often contains a boilerplate nondisclosure statement at the bottom of each e-mail that can bias or distort the words appearing in a word cloud. It is recommended to identify those words in these nondisclosure statements and exclude them. This is particularly important when many suppliers may be involved who have different nondisclosure wording.
Analyzing E-Mail with Condor
211
Image 26a
a
For color pictures see online version of images, available at http://www.ickn.org/sociometrics/
Next we look at the overall word usage by computing the word cloud. The overall sentiment is very positive; there is no keyword with a negative context in the word cloud of the top 40 words. Meetings and Skype, as well as the names of the instructor (Peter) and of students who are particularly active are the most popular words. Image 27a
a
For color pictures see online version of images, available at http://www.ickn.org/sociometrics/
212
Sociometrics and Human Relationships
We can also check where on the world people involved in the COINs seminar have been active. As we do not have geotagged data in e-mail, we use natural language processing (NLP) to map names of geographical places such as country names and cities to locations on the world map. First, we have to run the location annotation process (Process dataset->Location annotation). Condor has two options to look up the mapping of a location. Either it uses the Google geocoding API (this only works for the first 2500 lookups) or it uses a locally installed geomapping dataset. Note that the first time Condor runs location annotation with the local database, the database has to be installed on the local machine by checking the box “Import local data first.”
The image below shows the results of the geotagged data, running “View->Create geographical view.” Note that while there are no participants from Oceania, there is discussion in Australia and Indonesia, as one of the projects was working on a global health project.
Analyzing E-Mail with Condor
213
Image 28a
a
For color pictures see online version of images, available at http://www.ickn.org/sociometrics/
Bringing up the European map illustrates the focus on Switzerland, Germany, and Italy, where most of the participating students were coming from. Image 29a
a
For color pictures see online version of images, available at http://www.ickn.org/sociometrics/
214
Sociometrics and Human Relationships
For the aggregate analysis on the team level, we combine all e-mail records of a team into one virtual actor per team, running “Analyze->Create collapsed graph,” thereby collapsing the graph by the original datasets that correspond to the 10 teams.
Creating the static view of the newly created network shows the ties between the 10 teams. Note that each person can only be allocated to one team, the one in which the person is first mentioned in. In the image below, node size corresponds to the number of team members.
Analyzing E-Mail with Condor
215
Using the contribution index on the aggregated team level shows the most active teams. Team 4 is most proactive, and team 11 is most active overall.
We can now also check which teams are most responsive (ego ART), most respected (alter ART), most positive (sentiment), and most creative (betweenness centrality oscillation). We find that team 4 is the most creative (highest betweenness centrality oscillation) while also being highly respected (alter ART), that is, others respond to them the fastest; they are also very positive in their sentiment.
216
Sociometrics and Human Relationships
Next we calculate the change in responsiveness of the different teams, starting with exporting the ego ART time series. We set the time window to 7 days, the minimum turn delay to one minute, ignoring replies sent within less than a minute, and e-mails which have not been answered after more than 4 days.
Analyzing E-Mail with Condor
217
The resulting CSV file is then opened in Excel, and shows that most teams start slow, but are getting more responsive as the course progresses, illustrating the effectiveness of the virtual mirroring method: If you tell people what you will be measuring, they will start changing what is being measured! Image 30a
a
For color pictures see online version of images, available at http://www.ickn.org/sociometrics/
218
Sociometrics and Human Relationships
Compared to the first ego-centric network analysis example, where a single mailbox has been explored, this second example illustrates the virtual mirroring process on the organizational level. It can be done for small and large organizations, ranging from teams with a dozen members to companies with hundred thousands of employees. Other than the ego-network analysis which studies mostly networking behavior of individuals, for the organizational analysis, virtual actors can be created by aggregating communication records using grouping attributes, for example, on the team, business unit, or geography level. We can study the following: 1. Personality characteristics of individual actors Who is most creative (betweenness oscillation)? Who is most passionate (answers the fastest and uses most emotional language)? Who is most respected (whom others answer the fastest)? Who is most open and gregarious (has the highest degree centrality)? Who is using the most positive language? 2. Longitudinal analysis of the organization Who are the most influential people at different points in time? What are the key events at different points in time? 3. Aggregated analysis of the organization What are the key topics and sentiment? What are the key locations?
Analyzing E-Mail with Condor
219
4. Organizational unit analysis What are the most creative teams (betweenness oscillation)? What are the happiest teams (answer the fastest and use most positive language)? What are the most respected teams (whom others answer the fastest)? What are the most connected teams (highest degree centrality)? These four steps are building blocks toward developing recommendations for interventions to increase organizational performance, by correlating social network metrics calculated for Steps 14 with organizational performance metrics such as sales performance, customer satisfaction, or employee turnover. Next, we will study another publicly available e-mail dataset, Hillary Clinton’s e-mail.
9.4. ANALYZING HILLARY CLINTON’S MAIL This short example is based on Hillary Clinton’s mailbox from her time as US secretary of state 2009 to 2014.2 Against official regulations, Hillary Clinton had been using a private e-mail server for her official correspondence. The main criticism focused on Hillary Clinton deleting 32,000 e-mails, which she considered private. As part of a subsequent investigation into potential misbehavior, the US
2
https://en.wikipedia.org/wiki/Hillary_Clinton_email_controversy
220
Sociometrics and Human Relationships
department of justice published a subset of her e-mail from 2009 to 2014 as 7000 individual pdf files. These e-mails have been scanned in and provided as a SQLite database and an Excel spreadsheet on the Kaggle website.3 I loaded this data into Condor as a dataset. It is available at www. ickn.org/sociometrics/. First, we create a new database. In this database, we load the CSV data from file “emails.csv” containing the messages, and “persons.csv” containing the actor names. For emails.csv we delete the redundant rows, only keeping the following rows: • Id — the id of the link, an e-mail can result in more than one link if it has multiple recipients • E-mail id — the id given by Kaggle to the e-mail message • Subject — subject line • SenderPersonId — the id of the person sending the e-mail given by Kaggle and listed in persons.csv • ReceiverId — the id of the person receiving the e-mail given by Kaggle and listed in persons.csv • DateSent — the date the message was sent • ExtractedBodyText — the body text extracted by Kaggle from the pdf file. The persons.csv file contains three fields. Note that we need to add a mock starttime field, as the Condor CSV importer is expecting it.
3
https://www.kaggle.com/kaggle/hillary-clinton-emails
Analyzing E-Mail with Condor
221
• Id — the id of the person to link to the person name in the emails.csv file • Name — the name of the person • Starttime — a fake start time for Condor. This leads to the following import CSV dialog.
In the next step we assign dateSent both to Condor’s starttime and endtime.
222
Sociometrics and Human Relationships
This leads to a dataset with 514 actors and 9291 links. The static view below shows the resulting social network image. After annotating the actors with betweenness centrality, the most central actors stand out.
Analyzing E-Mail with Condor
223
Not surprisingly, the most central people are Hillary Clinton herself, and her trusted staffers Cheryl Mills, Huma Abedin, Sidney Blumenthal, and Harold Hongju Koh. Creating a tag cloud of the most frequent words does not bring up any “smoking guns.” Rather, the words such as “government,” “secretary,” “state,” “president,” “time,” “tomorrow” are the ones to be expected in the daily communication of the US secretary of state.
224
Sociometrics and Human Relationships
Image 31a
a
For color pictures see online version of images, available at http://www.ickn.org/sociometrics/
The activity chart below tells us that the bulk of the released e-mails is from 2009 to 2011, peaking end of 2009 with 25 messages per day. The sentiment is quite neutral, confirming Hillary Clinton’s reputation as a welltempered person using neutral non-emotional language.
Analyzing E-Mail with Condor
225
Image 32a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
The contribution index scatter plot below tells us that Hillary is the (not surprisingly) most active sender and recipient of messages, getting somewhat more messages (her contribution index is 0.25) than she sends. The second most active participants are Huma Abedin and Cheryl Mills, who send significantly more to Hillary than they receive (their contribution index is 0.30.5).
226
Sociometrics and Human Relationships
Huma Abedin is blazingly fast in answering her e-mail, with less than one-hour responses. Hillary takes on average nine hours to answer her mails, Cheryl Mills 32 hours.
Analyzing E-Mail with Condor
227
The most creative people, measured through their betweenness oscillation shown below, are Hillary Clinton herself, Huma Abedon, Jake Sullivan, another member of the inner circle of Hillary, and Cheryl Mills.
The last image shows the temporal social surface. I already removed Hillary from that image, because she would dominate the image. It shows us that there are very few people, among them Huma Abedin, Cheryl Mills, Jake Sullivan, and Harold Hongju Koh, who dominate the discussion, while the remaining 500 actors play mostly peripheral roles and have few connections among themselves.
228
Sociometrics and Human Relationships
9.5. ORGANIZATIONAL ASPECTS OF E-MAIL-BASED SNA Analyzing the e-mail of individual members of an organization needs to be approached carefully, in particular if e-mail content is included into the analysis. A series of legal and ethical issues have to be addressed when conducting such a project. It has been our experience that analyzing employee e-mail is, and has been, a very sensitive subject. This issue usually brings the company’s legal staff, human resources, information technology, and senior management into a conversation about how such analysis will be conducted. Nevertheless, we have found that with careful consideration, transparency, and clear rules of involvement, e-mail analysis can be conducted
Analyzing E-Mail with Condor
229
from the level of a team, department, division, or entire enterprise for the benefit of employees, teams, and the enterprise. As an engineering manager said, “We know that our e-mail is monitored for drugs, pornography, gambling etc. Let’s use it for a positive purpose rather than for just punitive purpose.” There are different ways to collect the e-mails of organizations with Condor: 1. The easiest way is to use Condor’s single-user e-mail fetcher to collect the team members’ e-mail one dataset at a time and then merging them. This means that for each e-mail account the user id and password are needed. 2. Using Microsoft Outlook/Exchange impersonation (an e-mail admin account), multiple e-mail accounts can be loaded in batch mode without the need for individual passwords. 3. Condor is also available in a server version called CondorCore, which allows organizations to download multiple e-mail accounts automatically using “CondorCore annotations;” this option, however, requires for the e-mail passwords to all be stored on the server. 4. Exporting e-mail from the server as Excel CSV files, or MySQL tables, and loading the exported CSV files or MySQL tables directly into Condor, using Condor’s import CSV and import MySQL function. 5. Extracting e-mail from a person’s Outlook .pst files, importing it in Condor 2.6.6 and converting to Condor 3.
230
Sociometrics and Human Relationships
6. Microsoft Outlook Rules can be used to automatically collect e-mail from historical archives and current activity with keywords and be forwarded to a dummy mailbox, for example, to a gmail dummy account. My friends Julia Gluesing and Ken Riopelle who did a lot of e-mail analysis projects once worked with a corporate manager who had over 20 subteams reporting to him. He insisted that all his subteams cc him on all e-mails. In this case, his e-mail folder for the subteams was used for analysis. In this case no dummy e-mail box had to be created to be cced on any correspondence. When doing a group e-mail analysis, how does one know who is on the team to invite for the e-mail analysis? This might seem like an obvious question, but for large global teams it is often difficult to identify team members beyond a visible common core. One way to resolve these issues is to first analyze a project team leader’s e-mail and calculate the degree and betweenness centrality for each person and rank-order them as a guide to select team members to be included in the project team analysis; this process is also called “snowball sampling” in social network analysis. This approach can be used to analyze e-mail of tens of thousands of employees and even hundreds of thousands of employees; some of these projects are described in the scientific papers listed at the end of this chapter. As each of these projects is highly dependent on local requirements and the e-mail architecture employed at the organization, describing a general process is beyond the scope of this book.
Analyzing E-Mail with Condor
231
The collection of insights below reflects some lessons learned doing hundreds of e-mail analysis projects at organizations. • Only project-related e-mails should be included. Personal e-mails can be excluded through Outlook rules or by setting up specific folders that will be ignored in the collection process. • With the same exclusion process, patent-related and other legal e-mails can be excluded because of their privileged or confidential nature. • Employees can volunteer to opt into the analysis and can opt out at any time. • The analysis will not be used for any punitive or termination purposes. • The individual analysis will be shared with team members first before upper management gets to see it. • In some cases, content is excluded and only the subject line retained. • In a global study, which involved numerous countries around the world, it was decided to follow German law because it was the most restrictive. • E-mail analysis might stay inside the company’s firewall, by installing Condor and the MySQL database on a server inside the company’s firewall. • When e-mail analysis is shared with people outside the company, it is anonymized or deidentified so no individual is identified.
232
Sociometrics and Human Relationships
9.6. FOLLOW-ON EXERCISES 1. Load your own mailbox into Condor and conduct a similar analysis as the one mailbox. Whom are you answering fastest, who is answering to you fastest? Who is most creative? How positive is the sentiment? What are the key terms you are using in your everyday language? 2. Using Hillary Clinton’s e-mail box, who are the most central people outside her close collaborators. What is the role of Barack Obama in her network? 3. Can you find out what happened to the key people in Hillary Clinton’s mailbox. Doing a Coolhunting for them on social media, what are their activities on Twitter? What are others saying about them in Wikipedia, blogs, and social media? Who are the most prominent people in the public perception on social media?
9.7. (PARTIAL) LIST OF E-MAIL STUDIES CONDUCTED BY THE AUTHOR IN VARIOUS ORGANIZATIONS Downloadable from http://www.ickn.org/publications. html Gloor, P. A. (2015, Winter). What email reveals about your organization. Sloan Management Review. Gloor, P., Woerner, S., Schoder, D., Fischbach, K., & Fronzetti Colladon, A. (2016). Size does not matter In the virtual world. Comparing online social networking
Analyzing E-Mail with Condor
233
behavior with business success of entrepreneurs. International Journal of Entrepreneurial Venturing. Accepted for publication. Allen, T., Gloor, P., Woerner, S., Raz, O., & Fronzetti Colladon, A. (2016). The power of reciprocal knowledge sharing relationships for startup success. Journal of Small Business and Enterprise Development. Accepted for publication. Gloor, P., & Fronzetti, A. (2015). Measuring organizational consciousness through e-mail based social network analysis. Proceedings of the 5th international conference on collaborative innovation networks COINs15, Tokyo, Japan, March 1214. Gloor, P., Paasivaara, M., & Miller, C. (2015). Lessons from the coin seminar. Proceedings of the 5th international conference on collaborative innovation networks COINs15, Tokyo, Japan, March 1214. Maddali, H. T., Gloor, P., & Margolis, P. (2015) Comparing online community structure of patients of chronic diseases. Proceedings of the 5th international conference on collaborative innovation networks COINs15, Tokyo, Japan, March 1214. Hybbeneth, S., Brunberg, D., & Gloor, P. (2014). Increasing knowledge worker productivity through a “Virtual Mirror” of the social network. International Journal Organisational Design and Engineering, 3(3/4).
234
Sociometrics and Human Relationships
Gloor, P., & Giacomelli, G. (2014, Spring). Reading global clients’ signals. Sloan Management Review, March 4. Grippa, F., Provost, S., Gloor, P., McKean, M., & Thakkar S. A. (2014). Systematic methodology to characterize communication patterns in chronic care innovation networks. In S. Long, E-H. Ng, & C. Downing (Eds). Proceedings of the American society for engineering management international annual conference. Zhang, X., Gloor, P., & Grippa, F. (2013). Measuring creative performance of teams through dynamic semantic social network analysis. International Journal Organisational Design and Engineering, 4(2). Gloor, P., & Paasivaara, M. (2013). COINs change leaders — Lessons learned from a distributed course. Proceedings fourth international conference on collaborative innovation networks COINs, Santiago de Chile, August 1113. Hybbeneth, S., Brunnberg, D., & Gloor, P. (2013). Increasing knowledge worker efficiency through a “virtual mirror” of the social network. Proceedings of the fourth international conference on collaborative innovation networks COINs, Santiago de Chile, August 1113. Brunnberg, D., Gloor, P., & Giacomell, G. (2013). Predicting customer satisfaction through (e-mail) network analysis: The communication score card. Proceedings fourth International Conference on Collaborative Innovation Networks, Santiago de Chile, August 1113.
Analyzing E-Mail with Condor
235
Gloor, P., Dorsaz, P., Fuehres, H., & Vogel, M. (2012). Choosing the right friends Predicting success of startup entrepreneurs and innovators through their online social network structure. International Journal of Organisational Design and Engineering, 3(2). Grippa, F., Palazzolo, M., Buccuvalas, J., & Gloor, P. (2012). Monitoring changes in the social network structure of clinical care teams resulting from team development efforts. International Journal of Organisational Design and Engineering, 2(4). Gloor, P., Margolis, P., Seid, M., & Dellal, G. (2014). Coolfarming Lessons from the beehive to increase organizational creativity. MIT Sloan School Working Paper No. 5123-14. Gloor, P., Paasivara, M., Lassenius, C., Schoder, D., Fischbach, K., & Miller, C. (2011). Teaching a global project course: Experiences and lessons learned. ICSE International Conference on Software Engineering Collaborative Teaching of Globally Distributed Software Development Community Building Workshop, Honolulu, Hawaii, May 23. Gloor, P., Grippa, F., Borgert, A., Colletti, R., Dellal, G., Margolis, P., & Seid, M. (2011). Toward growing a COIN in a medical research community. Procedia Social and Behavioral Sciences, 26. Proceedings COINs 2010, Collaborative innovations networks conference, Savannah GA, October 79.
236
Sociometrics and Human Relationships
Merten, F., & Gloor, P. (2009). Too much e-mail decreases job satisfaction. Proceedings COINs, Collaborative innovations networks conference, Savannah GA, October 811. Grippa, F., & Gloor, P. (2009). You are who remembers you. Detecting leadership through accuracy of recall. Social networks, August 11. Allen, T., Raz, O., & Gloor, P. (2009). Does geographic clustering still benefit high tech new ventures? The case of the Cambridge/Boston biotech cluster. MIT ESD-WP2009-01 Working Paper. DiMaggio, M., Gloor, P., & Passiante, G. (2009). Collaborative innovation networks, virtual communities, and geographical clustering. International Journal of Innovation and Regional Development, 1(4), 387404. Fischbach, K., Gloor, P., & Schoder, D. (2009, February). Analysis of informal communication networks A case study. Business & Information Systems Engineering, 2 (also in German). Gloor, P., Paasvaara, M., Schoder, D., & Willems, P. (2007, April). Finding collaborative innovation networks through correlating performance with social network structure. Journal of Production Research. Kidane, Y., & Gloor, P. (2007, March). Correlating temporal communication patterns of the Eclipse open source community with performance and creativity. Computational & Mathematical Organization Theory, 13(1).
Analyzing E-Mail with Condor
237
Zilli, A., Grippa, F., Gloor, P., & Laubacher, R. (2006). One in four is enough Strategies for selecting ego mailboxes for a group network view. Proceedings of European conference on complex systems ECCS ’06, Oxford, UK, September 2529. Gloor, P., & Zhao, Y. (2006) Analyzing actors and their discussion topics by semantic social network analysis. Proceedings of 10th IEEE international conference on information visualisation IV06, London, July 57. Grippa, F., Zilli, A., Laubacher, R., & Gloor, P. (2006). E-mail may not reflect the social network. NAACSOS Conference, Notre Dame IN, North American Association for Computational Social and Organizational Science, June 22, 23. Gloor, P. Niepel, S., & Li, Y. (2006, January). Identifying potential suspects by temporal link analysis. MIT CCS Working Paper. Gloor, P., Laubacher, R., Dynes, S., & Zhao, Y. (2003). Visualization of communication patterns in collaborative innovation networks: Analysis of some W3C working groups. ACM CKIM international conference on information and knowledge management, New Orleans, November 38.
238
Sociometrics and Human Relationships
MAIN LESSONS LEARNED • Creating a social network map of your personal mailbox will give you unprecedented insights into your social network. • You will find out with whom you are working with most closely, who the hidden influencers are, who likes you best, and who respects you most, but also who are the bottlenecks, and who is bridging structural holes among your friends. • Through automatic community detection, you will find your COINs. • You will also see how these measures can change over time as your network of relationships changes with new projects, collaborators, suppliers, and clients. • The same social network analysis can be extended to teams and entire companies to improve communication within the organization, by identifying bottlenecks, collaborators, hidden influencers, and people bridging structural holes. • Even more, the network map can be used to improve knowledge flow in business processes, by tracking and improving employee satisfaction, customer satisfaction, employee turnover, and sales force performance to increase organizational effectiveness.
Analyzing E-Mail with Condor
239
• The analysis is based on a list of “six honest signals of collaboration” that Condor continuously measures indicative of creative or high-performing communication (see Chapter 4). • The second example in this chapter analyzes the network of an entire organization using the e-mail communication of a class of 50 students from the COINs seminar working in 10 teams. • The third example analyzes Hillary Clinton’s e-mails released as part of her e-mail controversy, identifying her closest collaborators.
This page intentionally left blank
10 CALCULATING PERSONALITY CHARACTERISTICS FROM E-MAIL
CHAPTER CONTENTS • Predicting personality characteristic by correlating FFI with e-mail behavior • Developing a prediction formula through ordinary least squares regression • Adding gender variables.
r 2017 Peter A. Gloor
and
nationality
241
as
control
242
Sociometrics and Human Relationships
10.1. CALCULATING CORRELATIONS BETWEEN FFI AND E-MAIL E-mail behavior of people is indicative of their personality characteristics. If we have both an e-mail archive of a person and their personality characteristics, we can correlate e-mail behavior and personality, leading to a general mapping of the honest signals of collaboration to personality characteristics. In this example, we use the e-mails of a group of 50 students from Germany, Finland, and the United States participating in the COINs course. Out of the 50 students collaborating in 11 teams, 34 also took the Neo-FFI test,1 a shortened version of the Big Five personality test. We will now first calculate the honest signals of collaboration, and then correlate them with the Neo-FFI results. First, we load the full e-mail archive into Condor. The archive covers the group communication of the 50 students and their instructors over a period of three months. The static view below shows the full network.
1
https://en.wikipedia.org/wiki/Revised_NEO_Personality_ Inventory
Calculating Personality Characteristics from E-Mail
243
The next step is the calculation of the six honest variables of collaboration. We compute them using the annotate functions: • Process dataset->annotate->Centrality annotations (betweenness and degree) (Central Leadership) • Process dataset->annotate-> Oscillation annotations (Rotating Leadership) • Process dataset->annotate-> Contribution Index annotations (Balanced Contribution) • Process dataset->annotate-> Turntaking annotations (Responsiveness) • Calculate sentiment (Honest Sentiment) • Calculate influence (Shared Context). We then export them to Excel:
244
Sociometrics and Human Relationships
Next, we correlate the resulting values with the FFI metrics: Neuroticism, Extroversion, Openness, Agreeability, and Conscientiousness for the 34 students where we have their FFI metrics. We find the following correlations, using SPSS (Table 8). We find that agreeable people are more central by betweenness as well as by degree. This means they have more important and more numerous friends. One explanation for their popularity could be because they are easier to get along with. People who are more open to experience have fewer friends, that is, lower degree centrality. This could perhaps be because they are more focused on their projects or because they have more connections to people outside their course project not captured in this analysis. More conscientious people send and receive more messages, and others respond faster to them (alter ART). This might be because they act as timekeepers and note takers for their less conscientious peers. More neurotic people send less positive e-mails (lower sentiment), and they also send less complex e-mails. More extrovert people send more positive e-mails.
10.2. DEVELOPING A GENERAL PREDICTION FORMULA Based on the correlations above, we develop five regression equations, to predict the Big Five personality characteristics based on e-mail behavior. Using IBM SPSS or another statistics package, we regress the six honest signals against the Big Five values: Neuroticism, Extroversion,
Correlations
Betweenness centrality
Neuroticism
Extroversion
Openness
Pearson correlation
.189
.236
.268
Sig. (two-tailed)
.285
.178
.125
N Degree centrality
34
.285
.259
Sig. (two -tailed)
.102
.139
Pearson correlation
34
N
34
.014
Sig. (two-tailed)
Alter ART [h]
34
Pearson correlation
N Messages total
34
.939 34
.563 34
.348*
.119
.044
.503
34 .371*
.072
.031
.686
.169 .339 34
34 .259
.198
.199
.108
Sig. (two-tailed)
.551
.262
.258
.543
34
.357*
34
.038 34 .370* .031 34
245
34
34
.139 34
.106
34
34
.035
Pearson correlation
N
Conscientiousness
.363*
34
.103
Agreeability
Calculating Personality Characteristics from E-Mail
Table 8: Correlation Results of FFI Metrics with Six Honest Signal SNA Metrics.
246
Table 8: (Continued ) Correlations
Neuroticism avg sentiment
Pearson correlation Sig. (two-tailed)
avg complexity
Pearson correlation Sig. (two-tailed) N
Betweenness centrality oscillation
Pearson correlation Sig. (two-tailed) N
Alter Nudges
.002 34
Openness
Agreeability
Conscientiousness
.465**
.174
.233
.308
.006
.326
.185
.076
34
34
34
34
.371*
.249
.017
.085
.284
.031
.156
.924
.632
.103
34 .076 .669 34
34 .104 .558 34
34 .060 .737 34
34 .023 .895 34
34 .109 .538 34
Pearson correlation
.107
.159
.104
.032
.020
Sig. (two-tailed)
.547
.368
.558
.859
.912
N
34
34
34
34
34
Sociometrics and Human Relationships
N
.522**
Extroversion
Pearson correlation Sig. (two-tailed) N
Total influence
Pearson correlation Sig. (two-tailed) N
Messages received
.289
.322+
.633
.097
.063
34 .087 .624 34
34 .009 .960 34
34 .212 .230 34
34 .261 .136 34
34 .290+ .097 34
.066
.227
.179
.327+
Sig. (two-tailed)
.912
.711
.196
.312
.059
Pearson correlation
N
34
34
34
34
34
.180
.165
.267
.124
.146
.308
.351
.128
.486
.409
34
34
34
34
34
Pearson correlation
.088
.023
.035
.057
.198
Sig. (two-tailed)
.622
.897
.846
.750
.262
N avg emotionality
.085
.502
.020
Sig. (two-tailed)
Ego ART [h]
.119
.810
Pearson correlation
N Contribution index
.043
Pearson correlation Sig. (two-tailed)
34
34
34
34
.221
.175
.227
.171
.007
.208
.322
.196
.334
.970
34
34
34
34
34
247
N
34
Calculating Personality Characteristics from E-Mail
Messages sent
248
Table 8: (Continued ) Correlations
Average influence per message
Neuroticism
Extroversion
Openness
Agreeability
Conscientiousness
Pearson correlation
.213
.249
.205
.164
.081
Sig. (two-tailed)
.226
.155
.244
.353
.649
Ego nudges
34
34
34
34
Pearson correlation
.014
.005
.125
.111
.062
Sig. (two -tailed)
.936
.979
.482
.532
.727
N *Correlation is significant at the 0.05 level (two-tailed). **Correlation is significant at the 0.01 level (two-tailed). + marginally significant.
34
34
34
34
34
34
Sociometrics and Human Relationships
N
Calculating Personality Characteristics from E-Mail
249
Openness, Agreeability, and Conscientiousness, calculating the goodness of fit, and the coefficients of the regression equations. 10.2.1. Neuroticism The Adjusted R Square for this regression is 0.39; which means 39% of the neuroticism of people can be explained by the structure and content of their e-mail (Table 9). In other words, the less messages people send, the less positive they are, the less central by betweenness they are, and the more different communication partners they have (degree centrality), the more neurotic they are. The equation is as follows: N = 0.28*messages sent 76.135*sentiment 0.019*betweenness centrality + 0.995*degree centrality + 87.781 Table 9: Regression Coefficients for Regressing Six Honest Signals against Neuroticism. Model
Unstandardized
Standardized
Coefficients
Coefficients
B
Std.
t
Sig.
Beta
Error (Constant)
87.781
12.679
6.924 .000
.028
.011
.461
2.652 .013
avg sentiment
76.135
20.757
.518
3.668 .001
Betweenness
.019
.010
.683
1.932 .063
.995
.366
1.054
2.718 .011
Messages sent
centrality Degree centrality
250
Sociometrics and Human Relationships
10.2.2. Extroversion The Adjusted R Square is 0.33, which means 33% of the extroversion of people can be explained by the structure and content of their e-mail (Table 10). In other words, the more people oscillate in their network position, the more positive they are in their messages, the faster other people answer to them (the lower alter ART), and the less different communication partners they have (degree centrality), the more extrovert they are.
10.2.3. Openness The Adjusted R Square is 0.11, this means 11% of the openness to experience of people can be explained by structure and content in their e-mail (Table 11). In other words, the less central people are by betweenness centrality, and the more they send e-mails compared to receiving them, the more open they are to new experiences.
10.2.4. Agreeability The Adjusted R Square is 0.21, which means 21% of the agreeability of people can be explained by the structure and content in their e-mail (Table 12). In other words, the more influential by word usage in people’s e-mails, the less complex their messages, and the more positive their messages, the more agreeable people are.
Model
Unstandardized
Standardized Coefficients
t
Sig.
Coefficients
(Constant) Betweenness centrality oscillation
B
Std. error
16.015
14.669
Beta 1.092
.284
.466
.203
.363
2.299
.029
avg sentiment
73.989
23.723
.465
3.119
.004
Alter ART [h]
.263
.121
.326
2.171
.038
Degree centrality
.348
.165
.341
2.113
.043
Calculating Personality Characteristics from E-Mail
Table 10: Regression Coefficients for Regressing Six Honest Signals against Extraversion.
251
Sociometrics and Human Relationships
252
Table 11: Regression Coefficients for Regressing Six Honest Signals against Openness. Model
Unstandardized
Standardized
Coefficients
Coefficients
B
Std.
t
Sig.
Beta
error (Constant) Betweenness
54.912
1.334
41.156 .000
.007
.004
.310
1.869 .071
10.960
5.894
.308
1.860 .072
centrality Contribution index
Table 12: Regression Coefficients for Regressing Six Honest Signals against Agreeability. Model
Unstandardized
Standardized
Coefficients
Coefficients
B
Std.
t
Sig.
Beta
error (Constant)
47.976
14.418
avg sentiment
51.502
18.594
.495
2.770 .010
3.328 .002
avg
5.967
2.673
.405
2.232 .033
.145
.051
.504
2.864 .008
complexity Total influence
Calculating Personality Characteristics from E-Mail
253
10.2.5. Conscientiousness The Adjusted R Square is 0.57, which means 57% of the conscientiousness of people can be explained by the network structure and content of their e-mails (Table 13). In other words, the less e-mail messages people send, the more messages they receive, the more positive their messages, the more central by betweenness, the fewer communication partners they have, the faster others respond to them, the fewer nudges they need until they answer, the higher the total influence of their messages,
Table 13: Regression Coefficients for Regressing Six Honest Signals against Conscientiousness. Model
Unstandardized
Standardized
Coefficients
Coefficients
B
Std.
t
Sig.
Beta
error (Constant) Messages sent avg sentiment Betweenness
71.128 17.006 .167
.055
57.422 18.404
4.183 .000 2.618
3.047 .006
.374
3.120 .005
.039
.011
1.345
3.651 .001
.393
.116
.507
3.401 .002
1.905
.453
1.933
4.201 .000
1.481
.387
3.498
3.832 .001
.023
.012
.335
1.846 .077
195.283 53.294
.844
3.664 .001
.316
1.835 .079
centrality Alter ART [h] Degree centrality Total influence Messages received Average influence per message Ego nudges
9.781
5.331
254
Sociometrics and Human Relationships
and the lower the influence of an individual message, the more conscientious people are.
10.3. ADDING GENDER, ETHNICITY, AND NATIONALITY AS CONTROL VARIABLES Research about personality characteristics suggests that they differ between gender, ethnicity, and nationality. We therefore introduce three categorical variables, one for gender (female/male), one for ethnicity, where, based on the student population, we have Asian, Caucasian, and Arab, and one for nationality (Finnish, German, US, and other). Besides giving insights about the cultural differences of personality characteristics, adding these control variables also increases the accuracy of some of the predictions. A simple one-way ANOVA in SPSS for gender, ethnicity, and nationality leads to Table 14. Females are more neurotic, more open, more agreeable, and somewhat less conscientious, while there is no difference in extroversion. However, these differences are not statistically significant. The table for ethnicity looks like Table 15. We see again differences, in that the Arabs are most neurotic, extrovert, open, agreeable, and conscientious. The Asians, on the other hand, are most neurotic, but least extrovert, open, and conscientious. The Caucasians are least agreeable. Note, however, these results are not significant, and thus are of mostly anecdotal value. The table for nationality is shown in Table 16.
Gender Male
Neuroticism
Extroversion
Openness
Agreeability
Conscientiousness
Mean
44.26
54.58
52.84
45.26
51.11
N
19
19
19
19
19
Std. deviation Female
9.651
6.517
4.965
8.869
Mean
48.60
54.53
55.13
47.73
49.80
N
15
15
15
15
15
Std. deviation Total
8.608
9.840
10.901
9.149
8.172
10.936
Mean
46.18
54.56
53.85
46.35
50.53
N
34
34
34
34
34
Std. deviation
9.288
10.061
7.746
6.582
Calculating Personality Characteristics from E-Mail
Table 14: ANOVA Results by Gender for FFI Characteristics.
9.699
255
256
Table 15: ANOVA Results by Ethnicity for FFI Characteristics. Ethnicity Asians
Mean N
Caucasians
Openness
Agreeability
Conscientiousness
48.50
51.25
51.00
46.00
45.25
4
4
4
4
10.231
Mean
45.68
54.21
53.82
45.86
51.32
N
28
28
28
28
28
Mean
48.50 2
9.049 66.00 2
7.354 60.00
6.530 54.00
12.148
9.623 50.00
2
2
2
9.899
8.485
5.657
Std. deviation
13.435
11.314
Mean
46.18
54.56
53.85
46.35
50.53
N
34
34
34
34
34
Std. deviation
9.288
10.061
7.746
6.582
9.699
Sociometrics and Human Relationships
15.196
9.121
5.228
4
11.269
N
Total
Extroversion
Std. deviation
Std. deviation Arabs
Neuroticism
Nationality Finnish
Mean N Std. deviation
German
Total
Openness
Agreeability
Conscientiousness
47.44
53.33
58.11
44.11
48.33
9 9.888 45.33
N
15
Mean N
Other
Extroversion
Mean
Std. deviation The United States
Neuroticism
8.006 44.50 8
Std. deviation
11.662
Mean
53.50
9 7.036
9 5.183
9 6.194
9 11.147
53.67
50.80
45.20
53.13
15
15
15
15
10.069 61.63 8 8.943 38.50
N
2
2
Std. deviation
7.778
4.950
7.720 55.13 8 7.039 52.50 2 16.263
6.461 50.75 8 6.042 47.50 2 7.778
8.593 50.25 8 8.430 42.00 2 16.971
Mean
46.18
54.56
53.85
46.35
50.53
N
34
34
34
34
34
9.288
10.061
7.746
6.582
9.699
257
Std. deviation
Calculating Personality Characteristics from E-Mail
Table 16: ANOVA Results by Nationality for FFI Characteristics.
258
Sociometrics and Human Relationships
We find that the Americans are least neurotic, most extrovert, and agreeable, confirming national stereotypes. The Finns are the most open, while they are as extrovert as the Germans, which is quite surprising. The Germans, again confirming national stereotypes, are most conscientious. This time, national differences in extroversion are statistically significant (p = 0.018). When rerunning the regressions with the three control variables, gender, ethnicity, and nationality, we find that these do not influence the coefficients for calculating neuroticism, openness, and conscientiousness with the six honest signals of collaboration. However, the accuracy of the predictions for extroversion and conscientiousness increases.
10.3.1. Extroversion When regressing the six honest signals of collaboration against extroversion using ethnicity as a control variable, Adjusted R Square goes up from 0.33 to 0.49; this means that, now, 49% of extroversion of people can be explained by the structure and content of their e-mail (Table 17). As Table 17 shows, ethnicity explains part of extroversion. Also, adding ethnicity as a control variable has added significance to other social signals of collaboration. When controlling for ethnicity, extroverts keep their behavior of being more oscillating, and showing more positive sentiment. But it also seems extroverts are less popular. For instance, their betweenness centrality is lower, and their messages have less influence. However,
Calculating Personality Characteristics from E-Mail
259
Table 17: Regression Coefficients for Regressing Six Honest Signals against Extraversion with Ethnicity as Control Variable. Model
Unstandardized
Standardized
Coefficients
Coefficients
B
Std.
t
Sig.
Beta
error (Constant) Ethnicity Betweenness
33.648 16.820
2.000 .056
8.695
3.156
.365
2.755 .011
.695
.209
.541
3.324 .003
59.658 21.116
.375
2.825 .009
.331
2.360 .026
centrality oscillation avg sentiment Betweenness
.010
.004
.355
.118
.441
3.002 .006
102.152 34.816
.425
2.934 .007
.351
1.946 .063
centrality Alter ART [h] Average influence per message Ego nudges
11.257
5.786
they are more responsive, as they need fewer nudges until they respond (ego nudges), and others respond faster (alter ART).
10.3.2. Agreeability When regressing the six honest signals of collaboration against agreeability and adding ethnicity and nationality as control variables, the Adjusted R Square goes up from 0.21 to 0.34; this means that now 34% of agreeability of people can be explained by the structure and content of their e-mail (Table 18).
Sociometrics and Human Relationships
260
Table 18: Regression Coefficients for Regressing Six Honest Signals against Agreeability with Ethnicity as Control Variable. Model
Unstandardized
Standardized
Coefficients
Coefficients
B
Std.
t
Sig.
Beta
error (Constant)
33.236
14.583
2.279 .030
Ethnicity
5.612
2.614
.360
2.147 .041
Nationality
2.962
1.320
.390
2.243 .033
Messages
.020
.007
.471
2.875 .008
avg sentiment
36.241
18.072
.348
2.005 .055
avg
4.938
2.566
.335
1.924 .065
sent
complexity
As Table 18 shows, ethnicity and nationality explain part of agreeability. Also, adding ethnicity and nationality as control variables has changed the influence to the “number of messages sent” as a significant variable. When controlling for ethnicity and nationality, it seems that agreeable people show more positive sentiment and use less complex language. They also send more messages.
10.4. FOLLOW-ON EXERCISES 1. Use the coefficients in the tables above to formulate five equations to calculate the Big Five personality characteristics of the other 15 students and of the instructors, who have not taken the neo-FFI test. What would they be?
Calculating Personality Characteristics from E-Mail
261
2. Using the personality insights gained, take your own e-mail archive and make an educated guess on the personality characteristics of the people in your mailbox, based on the correlations between personality and the six honest signals of collaborations identified in this chapter. 3. Using personality recognition through word usage, identify the personality characteristics of the people in your mailbox, using the system mentioned here: Celli and Poesio (2014). An online version of the system is available here: http://personality.altervista.org/pear.php
MAIN LESSONS LEARNED • Personality characteristics of individuals can be calculated based on their e-mailing behavior by comparing the six honest signals of collaboration of individual actors with their personality characteristics measured through the Big Five personality test. • The Big Five personality test measures Neuroticism, Extraversion, Openness to experience, Agreeableness, and Conscientiousness through a survey and is commonly used to assess personality characteristics by scientific psychologists.
This page intentionally left blank
11 PREDICTING CRIMINAL INTENT FROM E-MAIL — ANALYZING THE ENRON E-MAIL ARCHIVE
CHAPTER CONTENTS • The initial phase consists of an exploratory SNA to develop the hypotheses • The second phase identifies criminals through their six honest signals of collaboration with t-tests • Finally, machine learning with “tribefinder” finds the “tribe” of suspected criminals.
r 2017 Peter A. Gloor
263
Sociometrics and Human Relationships
264
Enron’s downfall has been widely publicized and is also described in the book The Smartest Guys in the Room1 and a movie of the same name. Enron was once a high-flying energy trading company located in Houston TX, and named by Fortune for six subsequent years as “America’s most innovative company.” Enron employed 20,000 employees and claimed revenues of $111 billion before it went bankrupt on December 2, 2001. Enron was trading in electricity, natural gas, communications, pulp, and paper. Founded by longtime CEO Ken Lay, and led by COO/CEO Jeffrey Skilling and CFO Andrew Fastow, Enron got engaged in a sophisticated game of hiding bad assets in offshore vehicles and booking future earnings as profits. 11.1. EXPLORATORY ANALYSIS For the discovery process during the criminal process of Enron, the prosecution also obtained and screened the e-mails of the 155 indicted Enron employees. After the process, these e-mails were cleaned up by academics and placed in the public domain. They are now widely used to benchmark e-mail-based social network research projects. There are different variants of the Enron e-mail archive available on the Internet. In this analysis, we use a version from 2006 containing 261,852 e-mails and 27,742 actors, collected from the 155 employees of Enron who were indicted in the criminal proceedings that were started by the US government after Enron’s bankruptcy in 2002. The dataset in Condor format is available from www.ickn.org/ sociometrics. 1
McLean and Elkind (2013).
Predicting Criminal Intent from E-Mail
265
The data goes from 1997 to 2002. For this analysis we restrict the dataset to the top 2862 actors active in the period January 1, 2000 to January 4, 2002, the time when most of the criminal activity at Enron was happening. This dataset includes 170,239 links. In the loading dialog of the dataset “enron2000to2002,” we set the start time to “January 1, 2000.”
The activity chart below shows that there are peaks of activity in October/November 2000, April/May 2001, and October/November 2001.
266
Sociometrics and Human Relationships
Coloring the static view of communication by organization shows a cohesive network where the major part of the actors is from Enron (shown in green in the graph below). Image 33a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
In our analysis, we would like to compare the nonindicted employees with the indicted employees.2 Research on the Internet leads to the following list of people convicted in the context of the Enron criminal process.
2
http://fuelfix.com/blog/2011/11/28/the-defendants-of-the-enronera-and-their-cases/#1875101=0
Andrew Fastow
Chief financial officer — Pleaded guilty to two conspiracy counts; testified against Lay and Skilling; finishing six-year sentence that ended Dec. 2011 at his Houston home
[email protected]
Ben Glisan
Treasurer — Pleaded guilty to conspiracy; served two-thirds of 5-year sentence
[email protected]
Christopher Calger
Vice president, pleaded guilty to a charge of conspiracy to commit wire fraud
NOT FOUND
Dan Boyle
Vice president global finance group — Convicted in Nigerian barge case; did
NOT FOUND
David Bermingham
not appeal; served three-year 10-month sentence British banker pleaded guilty to misleading their former employer in an Andrew Fastow finance scheme; sentenced to three years [email protected]
David Delainey
Former CEO retail energy division — Pleaded guilty to insider trading; served
Predicting Criminal Intent from E-Mail
[email protected]
nine-month prison term NOT FOUND
David Duncan
Arthur Andersen auditor — Withdrew a guilty plea after the Supreme Court reversed firm’s conviction; settled Securities and Exchange Commission complaint of securities laws violations
NOT FOUND
Gary Mulgrew
British banker pleaded guilty to misleading his former employer in an Andrew Fastow finance scheme; sentenced to three years
NOT FOUND
Giles Darby
British banker pleaded guilty to misleading his former employer in an Andrew Fastow finance scheme; sentenced to three years
James A. Brown
Merrill Lynch banker — Convicted in Nigerian barge case; some charges thrown out on appeal; served 47 months on remaining charges
267
NOT FOUND
Jeff Skilling
[email protected]
Jeffrey Richter
President — Serving 24-year sentence
268
[email protected]
Trader Enron Energy Services — Pleaded guilty to manipulating California power markets; served two-year probation
NOT FOUND
Joe Hirko
Co-CEO Enron Broadband Services — Pleaded guilty to charge arising from overstating performance of Broadband division; served 16-month sentence
[email protected]
John M. Forney
Energy trader — Pleaded guilty to manipulating California power markets; served two-year probation
[email protected]
Ken Lay
CEO — Tried with former Enron President Jeff Skilling; conviction thrown out because Lay died before sentencing
Ken Rice
[email protected]
Kevin Hannon
Co-CEO Enron Broadband — Pleaded guilty to securities fraud in Broadband case; served 27-month sentence Chief operating officer Enron Broadband — Pleaded guilty to conspiracy in Broadband case; served two-year sentence
NOT FOUND
Kevin Howard
Finance chief Enron Broadband — Pleaded guilty to one count of falsifying records in Broadband case; served one-year probation
[email protected]
Lawrence Lawyer
Vice president global markets — Pleaded guilty to failing to report income; served two-year probation
NOT FOUND
Lea Fastow
Assistant treasurer, Andrew Fastow’s wife — Pleaded guilty to lying on tax
[email protected]
Mark E. Koenig
Head of investor relations — Pleaded guilty to securities fraud; served 18-
return; served one-year sentence month sentence
Sociometrics and Human Relationships
[email protected]
Michael Kopper
Finance managing director — First Enron executive to enter plea bargain; served less than two-thirds of three-year one-month sentence; released January 2009
[email protected]
Paula Rieker
Managing director of investor relations — Pleaded guilty to insider trading; served two-year probation
[email protected]
Rex T. Shelby
Vice president of engineering operations Enron Broadband — The last of the Enron employees to be sentenced; pleaded guilty to one count of insider trading; sentenced to two-year probation;
[email protected]
Richard Causey
Chief accounting officer — Pleaded guilty to securities fraud; completed five-year six-month sentence
[email protected]
Timothy Belden
Predicting Criminal Intent from E-Mail
NOT FOUND
Head of trading Enron Energy Services — Pleaded guilty to manipulating California power markets; served two-year probation
[email protected]
Timothy DeSpain
Assistant treasurer — Pleaded guilty to conspiracy; served four-year probation
269
270
Acquitted
NOT FOUND
Robert Furst
Merrill Lynch banker — Tried in Nigerian barge case; conviction thrown out on appeal
NOT FOUND
Daniel Bayly
Former head of investment banking for Enron — Tried in Nigerian barge case;
NOT FOUND
Sheila Kahanek
Enron in-house accountant — Tried in Nigerian barge case; acquitted
In full dataset
Michael Krautz
Enron Broadband — Tried in Broadband case; acquitted
NOT FOUND
William Fuhs
Merrill Lynch banker — Tried in what prosecutors alleged was a scheme to inflate earnings through transactions involving power generation barges in Nigeria;
conviction thrown out on appeal
In full dataset
Scott Yeager
Strategic business executive Enron Broadband — Appeals court ordered Yeager acquitted on all charges after his case went to US Supreme Court
Sociometrics and Human Relationships
conviction thrown out on appeal
Predicting Criminal Intent from E-Mail
271
We start by tracking the 16 convicted criminals from the list above who show sufficient e-mail activity in the full network. They are the following: [email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
We first visually identify these 16 people in the full network in the static view of communication. First, we color all the nodes in blue. When coloring the nodes, clicking on the “advanced” button brings up the advanced coloring dialog which allows to set the color of a list of people by their UUIDs.
Using this dialog we load a text file (also available at www.ickn.org/sociometrics) containing the e-mail addresses we want to show in red, separated by commas: [email protected], [email protected], [email protected], david.delainey@enron. com, [email protected], [email protected],
272
Sociometrics and Human Relationships
[email protected], [email protected], ken [email protected], [email protected], mark. [email protected], [email protected], rex. [email protected], [email protected], tim. [email protected], [email protected] The image below shows the recolored static view of communication. Red dots are the convicted criminals. As the image shows, they are not the most central actors in the network. In the next phase of our analysis we will use the six honest signals of collaboration, calculating them for all actors and comparing them between the 16 convicted criminals and the rest of the actors. Image 34a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
Predicting Criminal Intent from E-Mail
273
11.2. IDENTIFYING CRIMINAL ACTORS THROUGH THEIR HONEST SIGNALS OF COLLABORATION Our goal is to identify the differences in communication behavior between the 16 criminals (the experimental group) and the other 2845 people in the dataset (the control group). We will compare the six honest signals of collaboration between the two groups. We calculate them using the annotate functions: • Process dataset->annotate->Centrality annotations (betweenness and degree) (Central Leadership) • Process dataset->annotate-> Oscillation annotations (Rotating Leadership) • Process dataset->annotate-> Contribution Index annotations (Balanced Contribution) • Process dataset->annotate-> Turntaking annotations (Responsiveness) • Calculate sentiment (Honest Sentiment) • Calculate influence (Shared Context). We then export them to Excel:
274
Sociometrics and Human Relationships
Next, we split the table into two samples, the 16-people experimental group of convicts and the 2845 people in the control group. In this example, we are using SPSS “compare means/Independent sample” function, and store the 2861 records in a single SPSS table, adding a variable “0_inno_1_convict,” coding innocent people with “0” and convicts with “1.” Running SPSS t-test to compare the two groups leads to Table 19 (alternatively, we could also have used Condor’s builtin t-test function). As we can see, the criminals send and receive more messages; they are also more central in the network by degree and betweenness; their contribution index is more negative, which means they get more mail than the rest. These indicators all might come from their high-hierarchy positions. They also have much higher betweenness centrality oscillation, which means they are more creative
Predicting Criminal Intent from E-Mail
275
than their peers at Enron, and they are on average more responsive, and others answer faster to them. They also need less nudges until they respond to an e-mail, and others need less nudges, until they respond to the criminals. The speed of response of others (Alter ART) might Table 19: t-Tests of Structural SNA Metrics between Convicts and Control Group. N
0_control_1_convict
Messages sent
Messages received
Mean
Std.
Std. Error
Deviation
Mean
0 2845
59.31
243.01
4.56
1
94.56
201.01
50.25
16
0 2845
58.98
164.14
3.08
16
152.06
204.89
51.22
0 2845
118.28
385.43
7.23
1
16
246.63
334.27
83.57
0 2845
16.25
25.94
0.49
1
16
53.13
55.23
13.81
0 2845
2724.50
12,702.99
238.16
16 13,856.93
23,057.97
5764.49
0 2845
0.13
0.53
0.01
1
16
0.31
0.42
0.11
Betweenness centrality
0 2845
56.05
54.44
1.02
oscillation
1
16
130.38
73.01
18.25
Ego ART [h]
0 1604
62.05
54.64
1.36
1
16
58.74
71.31
17.83
Alter ART [h]
0 1465
63.53
56.03
1.46
1
16
52.76
69.86
17.47
Ego nudges
0 1604
1.27
0.64
0.02
1
16
0.80
0.58
0.15
0 1465
1.40
1.06
0.03
1
0.87
0.57
0.14
1 Messages total
Degree centrality
Betweenness centrality
1 Contribution index
Alter nudges
16
276
Sociometrics and Human Relationships
again be an indicator of high respect due to the highhierarchy position of the convicts. The higher responsiveness of the convicts might indicate that they have “more skin in the game.” Table 20 shows for which of these variables there are statistically significant differences between criminals and their peers at Enron. As Table 20 shows, these differences in messages received, degree and betweenness centrality, betweenness centrality oscillation, and ego and alter nudges are statistically significant. This means that the convicted criminals indeed show a different behavior from the rest of the people at Enron. Note that these differences might be an artifact of data collection. The mailboxes of the 16 convicted criminals are included in the 155 mailboxes that are the basis for the construction of this e-mail dataset. If we wanted to have a cleaner comparison, we would have to restrict our peer group to the other 139 people whose mailboxes were also collected, instead of taking all 2845 nonconvicts as the control group. We now compare the content of the e-mails, calculating complexity, sentiment, emotionality, and influence using Condor’s “honest signals of collaboration” content-based values for each actor (Tables 21 and 22). Statistical analysis with the same SPSS t-test as before shows that the messages of criminals are more influential — this might again be due to their higher rank in the company — but also more complex, and less emotional. There is no difference in average influence per message, and average sentiment between the two groups.
t-Test for Equality of Means
Levene’s Test for Equality of Variances
F
Messages sent
Equal variances assumed
0.48
Equal variances not assumed Messages received
Equal variances assumed
5.22
Equal variances not assumed Messages total
Equal variances not assumed
1.65
t
df
Sig. (twotailed)
Mean difference
Std. error difference
95% Confidence Interval of the Difference
Lower
Upper
0.49 0.58 2859
0.56
35.26
60.87
154.62
84.10
0.70 15.25
0.50
35.26
50.46
142.65
72.14
0.02 2.26 2859
0.02
93.09
41.21
173.89
12.28
1.81 15.11
0.09
93.09
51.31
202.39
16.22
0.20 1.33 2859
0.18
128.34
96.57
317.69
61.00
1.53 15.23
0.15
128.34
83.88
306.90
50.21
277
Equal variances assumed
Sig.
Predicting Criminal Intent from E-Mail
Table 20: Significances of t-Tests between Convicts and Control Group for Structural SNA Metrics.
278
Table 20: (Continued ) t-Test for Equality of Means
Levene’s Test for Equality of Variances
F
29.94
Equal variances not assumed Betweenness centrality
Equal variances assumed
23.31
Equal variances not assumed Contribution index Equal variances assumed Equal variances not assumed
1.79
t
df
Sig. (twotailed)
Mean difference
Std. error difference
95% Confidence Interval of the Difference
Lower
Upper
0.00 5.62 2859
0.00
36.87
6.56
49.74
24.01
2.67 15.04
0.02
36.87
13.82
66.32
7.43
0.00 3.47 2859
0.00
11,132
3203
17,414
4850
1.93 15.05
0.07
11,132
5769
23,426
1161
0.18 1.33 2859
0.18
0.18
0.13
0.08
0.44
1.67 15.27
0.11
0.18
0.11
0.05
0.40
Sociometrics and Human Relationships
Degree centrality Equal variances assumed
Sig.
Equal variances assumed
Ego ART [h]
Equal variances assumed
5.58
Equal variances not assumed 2.92
Equal variances not assumed Alter ART [h]
Equal variances assumed
3.00
Equal variances not assumed Ego nudges
Equal variances assumed
1.41
Equal variances not assumed Alter nudges
Equal variances assumed Equal variances not assumed
0.03
0.02 5.43 2859
0.00
74.32
13.68
101.14
47.51
4.07 15.09
0.00
74.32
18.28
113.27
35.38
0.09 0.24 1618
0.81
3.31
13.77
23.70
30.32
0.19 15.18
0.86
3.31
17.88
34.76
41.38
0.08 0.76 1479
0.45
10.77
14.12
16.93
38.47
0.61 15.21
0.55
10.77
17.53
26.54
48.08
0.24 2.92 1618
0.00
0.47
0.16
0.15
0.78
3.20 15.36
0.01
0.47
0.15
0.16
0.78
0.86 2.00 1479
0.05
0.53
0.26
0.01
1.05
3.65 16.15
0.00
0.53
0.14
0.22
0.83
Predicting Criminal Intent from E-Mail
Betweenness centrality oscillation
279
Sociometrics and Human Relationships
280
Table 21: t-Tests of Content-Based E-Mail Metrics between Convicts and Control Group. N
0_control_1_convict
Mean
Std. Error Mean
1.92
0.04
avg complexity
0 2845.00 16.00
6.19
0.71
0.18
Total influence
0 2845.00
4.40
27.56
0.52
1
16.00
4.80
9.08
2.27
avg emotionality
0 2845.00
0.26
0.08
0.00
16.00
0.23
0.02
0.01
Average influence per
0 2845.00
0.07
0.10
0.00
message
1
16.00
0.07
0.07
0.02
avg sentiment
0 2845.00
0.53
0.09
0.00
1
0.54
0.09
0.02
1
1
16.00
5.68
Std. Deviation
Note that we could have done this analysis entirely in Condor, as it includes the t-test function. If we want to run some regression analysis, however, we will need the exported data loaded into a statistics package such as SPSS (or R, Matlab, SAS, or Stata).
11.3. “TRIBEFINDER” — IDENTIFYING CRIMINALS THROUGH MACHINE LEARNING IN CONDOR As a next step, we use Condor to directly identify potential criminals in the Enron dataset. In Table 21 we have found that convicted criminals show a different behavior from nonconvicted people. Using Condor’s machinelearning function, we can identify other people showing the same “suspicious behavior.”
t-Test for Equality of Means
Levene’s Test for Equality of Variances
F
avg complexity
Equal variances assumed
3.19
Sig.
0.07
Equal variances not assumed Total influence
Equal variances assumed
0.00
0.95
Equal variances not assumed avg emotionality
Equal variances assumed
0.06
df
Sig. Mean Std. error (two- difference difference tailed)
95% Confidence Interval of the Difference
Lower
Upper
1.07 2859.00 0.29
0.51
0.48
1.46
0.43
2.84 16.26
0.01
0.51
0.18
0.90
0.13
0.06 2859.00 0.95
0.40
6.89
13.91
13.12
0.17 16.59
0.40
2.33
5.32
4.52
1.30 2859.00 0.20
0.03
0.02
0.01
0.07
4.54
0.03
0.01
0.01
0.04
17.30
0.87
0.00
281
Equal variances not assumed
3.49
t
Predicting Criminal Intent from E-Mail
Table 22: Significances of t-Tests between Convicts and Control Group for Content-Based Metrics.
282
Sociometrics and Human Relationships
First, we load the Enron dataset.
We load the top 2000 actors out of the 27,742 actors in the dataset, in the time interval from January 1, 1997 to January 4, 2002, which is when most of the activity of the Enron employees happened.
Once we have loaded the data, we need now to calculate the different features to classify the 2000 actors in the dataset into suspects and nonsuspects, using the convicts as our training dataset. As features we use all the honest signals of collaboration:
Predicting Criminal Intent from E-Mail
283
• Process dataset->annotate->Centrality annotations (betweenness and degree) (Central Leadership) • Process dataset->annotate-> Oscillation annotations (Rotating Leadership) • Process dataset->annotate-> Contribution Index annotations (Balanced Contribution) • Process dataset->annotate-> Turntaking annotations (Responsiveness) • Calculate sentiment (Honest Sentiment) • Calculate influence (Shared Context).
After annotating all actors with the social networking and content-based features, we also need to mark each actor as being a convict or nonconvict. In order to later test the accuracy of our machine-learning algorithm, we tag each actor as either being part of the class of convicts or nonconvicts. For this, we load a CSV file of the structure (available at www.ickn.org/sociometrics):
284
Sociometrics and Human Relationships
[email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]
When loading the data, we find that criminal mastermind Andrew Fastow seems to have been a very reluctant user of e-mail, with just 31 messages, which means he does not make it into the top 2000 users, so we end up with 15 “criminals” instead of 16. Now, we are ready to run the machine-learning algorithm. The idea is that “criminals” and “ordinary” people exhibit fundamental differences in their communication patterns. As there are many more “ordinary” people than “criminals,” our sample is highly unbalanced. For the training phase, we will have 15 “criminals” and 1985 “ordinary” people. This means that in Phase 1, described below, we will use all 15 “criminals” for training the machine-learning system, but multiply them by 9000% to obtain 1350 actors and thus make the two samples approximately of
Predicting Criminal Intent from E-Mail
285
same size. Using the Synthetic Minority Over-sampling Technique (SMOTE) algorithm, Condor will modify their features by small random increments to get an additional group of 1335 “virtual criminals” with communication patterns similar to the original 15 ones to get two samples of approximately same size. Condor currently provides two algorithms, decision trees, and random forests. Decision trees are simpler and faster, but they tend to overfit the data to the training dataset. This means that while the formula we are developing will be very accurate in modeling the data in our test cases, it will produce false results as soon as the underlying structure of the data changes only slightly. It is better to have a less accurate, but more robust prediction algorithm. We therefore use the random forest algorithm, which combines a number of decision trees into an “averaged” tree that has been built with many small random variations of the original samples. We now start the random forest machine-learning wizard. First, we load an external file with the e-mail addresses of the 16 convicts. Clicking on the “include” button leads to the following dialog, which shows all the actors on the left and the ones which will be used for training (the “convicts” in our example) on the right.
286
Sociometrics and Human Relationships
The next dialog asks which fields we would like to use for training the classifier with the attributes of suspicious behavior. For our experiment, we use all the attributes as training features:
Predicting Criminal Intent from E-Mail
287
Remember that this dataset is unbalanced, because the two categories of “convicts” and “nonconvicts” are of vastly different size, with the convict class having 15 members and the nonconvict class having 1985 members. In the next dialog, we deal with the imbalance of the two classes. We check the box SMOTE mentioned above which applies the SMOTE algorithm to create additional records for the smaller class, blowing up the original 15 actors by 9000% to 1350 records.
The next image shows the training data on the left and how well each record fits within the random forest training algorithm. On the right, we see a list of people with similar features (i.e., similar communication behavior). [email protected] has the best fit with a communication pattern similar to the “criminals.”
288
Sociometrics and Human Relationships
We check the quality of the fit by creating a receiver operating characteristics (ROC) curve. An ROC curve is a graphical way of visualizing the accuracy of a binary classifier. It calculates and visualizes many different variations of testing the accuracy of true positives against false positives, by splitting the dataset into two parts, one part for training and the other for testing. Condor calculates nine different ROC curves, starting with using 90% of the data for testing, and 10% for training, down to using 10% of the data for testing, and 90% for training. The bigger the area under curve (AUC), the more correct is the classifier, that is, the higher is the proportion of true positives. In the image below we see that the AUC is very large — for a random classifier the ROC curve would be a diagonal. This means that our identification of potential suspects with high likelihood is correct.
Predicting Criminal Intent from E-Mail
289
Finally, we visualize the convicts, which have been our training dataset, in light blue, and the suspects, which are the actors with communication behavior similar to the convicts, in green, and the rest of the people are shown in yellow. Each node is sized by the goodness of fit to the training dataset. The larger the node, the more similar an actor is to a convict.
290
Sociometrics and Human Relationships
Image 35a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
As a very last analysis, we look at the t-test and compare the mean for suspects and nonsuspects using the “convicted” attribute we originally loaded.
Predicting Criminal Intent from E-Mail
291
11.4. FOLLOW-ON EXERCISES 1. As a follow-on exercise, analyze the communication behavior of the two acquitted suspects included in the full “Enron” dataset (Michael Krautz, Scott Yeager) and explore differences in their six honest signals of collaboration from convicted criminals and from the control group of the remaining people in the dataset. 2. As a second exercise, use the same data as input for a regression analysis and develop a formula that predicts if somebody else in the dataset might be a criminal based on a linear combination of their six honest signals of collaboration. 3. As a third exercise, increase the accuracy and predictability of this formula, by looking up on the Internet the names of the 155 people whose e-mail has been collected for this analysis. This means that while the experimental group (the 16 criminals) will stay the same, the control group will now be 155 16 = 139 instead of 2845 people. 4. Finally, focus on the “California Energy Crisis” and repeat the above analysis for this subtopic of the Enron downfall, by only loading the messages containing these keywords, and constructing the network with the resulting messages.
292
Sociometrics and Human Relationships
Important: We CANNOT convict anybody based on this formula. There might be perfectly legitimate reasons why somebody shows a certain combination of communication attributes. The only thing we can say is that with certain likelihood the communication behavior of this person resembles the behavior of convicted criminals.
MAIN LESSONS LEARNED • The six honest signals of collaboration can be used to find people with similar behavior. • In this example, we try to identify criminals based on their e-mail behavior, analyzing the e-mail archive of Enron. The Enron e-mail archive documents the spectacular crash of Texan energy trading firm Enron at the end of 2001. • We identify the differences in the six honest signals of collaboration between ordinary Enron employees and the convicted criminals. • Using machine learning, Condor can be trained with the criminal actors and will then find others with similar (suspicious) behavior.
Predicting Criminal Intent from E-Mail
293
• This is called “tribefinder,” as it uses Condor’s machine-learning capability to identify people with communication patterns similar to a “tribe,” a homogeneous group of people, in this case the convicted criminals.
This page intentionally left blank
12 COOLHUNTING ON THE INTERNET WITH CONDOR
CHAPTER CONTENTS • The wisdom of the experts: finding trends and trendsetters on the Web and blogs • The wisdom of the swarm: through leaders and topics from Wikipedia • The “wisdom” of the crowd: identifying influencers and trends on Twitter.
r 2017 Peter A. Gloor
295
296
Sociometrics and Human Relationships
As the Web is mirroring the real world, it provides an excellent data source to measure the importance of a brand, product, politician, philosopher, or concept. Condor provides a rich functionality for analyzing the importance of a brand, product, or person on the Internet. Coolhunting for a topic consists of • Identifying the context of a topic or brand, in particular, its competitors • Measuring the relative strength of the topic or brand and its competitors • Identifying the topic or brand’s associated influencers, ranking them by influence. Without any further restrictions this analysis is done globally. Thanks to the availability of geotagging, the analysis can also be restricted by drilling down into different target markets. We distinguish three different information spheres where we track context, strength, and influencers of a brand, namely the • Crowd • Experts • Swarm. The crowd is defined as the broad and indiscriminating masses, which easily flip between wisdom and madness. The crowd is found mostly on Twitter. Experts might be journalists, or movie critics, or professional consultants. Experts are primarily found on the Web on blogs and in News websites. The swarm is defined as people with “real skin in the game.” They might be medical researchers when Coolhunting for drugs, or plumbers when Coolhunting for ideas for aircraft toilets, or “treehuggers” when Coolhunting for alternative
Coolhunting on the Internet with Condor
297
energy sources. The swarm is found in Wikipedia and in domain-specific online forums and Facebook pages. In this section, the Coolhunting process is illustrated using Condor for a full 360-degree scan on the Web, Wikipedia, and Twitter for “Amity University,” a private university in India. The key preliminary step consists of understanding the context of the search term, using Google and common sense. In a netnographic analysis, the Google search results are interpreted qualitatively. In this initial phase, key people and influencers, products associated with the search term, and competitors are identified. Just by using Google and Wikipedia, we learn that “Amity University” is a private university in India, established by Ashok Chauhan in 2003, with 125,000 students and 4500 faculty and staff, and campuses not only all across India, but also in the Arab Emirates, China, the United States, and the United Kingdom.
Subsequently, the context of the brand, brand strength, key people and products, and competitors are searched
298
Sociometrics and Human Relationships
for on the Web, Wikipedia, and Twitter using Condor. Based on the Wikipedia entry we can identify “Amity University” and “Ashok Chauhan” as initial search terms. We also identify some further universities as “competitors” to calibrate the strength of Amity’s brand. Based on my own experience in teaching and working at MIT and University of Cologne, I will use “University of Cologne” and “MIT” as global brands to measure “Amity University” against, with “MIT” as a top brand that will come out much stronger, and “University of Cologne” as a local brand in Germany, comparable to Amity. I will compare these terms against “IIT,” the Indian Institutes of Technology as the top Indian brand and “University of Mumbai,” a local competitor of possibly comparable brand strength.
12.1. EXPERT ANALYSIS — WEBSITES AND BLOGS We start by creating a new database in Condor called “amity.”
Coolhunting on the Internet with Condor
299
We then create a new dataset “amity_uni_web.”
This will lead to a new dataset “amity_Uni_Web” being shown in Condor. Tip: Database and dataset names cannot have a space. Use an underscore “_” as a separator
We then start the Web fetcher using the “fetch Web” command, with the following settings. Note that you must previously have obtained your own Google CSE keys.1 Google will allow you to do 100 free searches per day, after that it charges a few cents per query. 1
See the following YouTube video for how to obtain your Google CSE keys: https://www.youtube.com/watch?v=zME1-j9yPvI
300
Sociometrics and Human Relationships
After clicking the “Next >” button, Condor will bring up the top 20 search hits from Google. These can be manually checked to make sure they are really about the “Amity University” we are interested in. This is not a problem for a strong brand like “Amity University,” but might be more problematic if we would be measuring the strength of a local politician named “John Smith” in Alabama. In this case, the search term would be “John Smith Alabama,” and we still might have to check some URLs to make sure they are not about John Smith the teacher in Huntsville.
Coolhunting on the Internet with Condor
301
Condor will now conduct a degree-of-separation search, collecting the top 20 web pages pointing back to each of the URLs shown in the image above. We then repeat this process for “Ashok Chauhan,” “Mumbai University,” and “University Cologne,” making sure to store each of the resulting datasets in a separate Condor dataset called “Ashok_Chauhan_Web,” “Uni_Cologne_Web,” and “Uni_Mumbai_Web.” Once all four queries have been completed, we merge the four datasets into one combined dataset which we call “Web_combined.”
Storing each of the Web fetches into a separate dataset will allow us to display the websites belonging to each of the different universities (Amity, Cologne, Mumbai) and Ashok Chauhan in different colors. As a next step we will have to make sure that each domain is shown separately. We do this by collapsing the dataset by domain.
302
Sociometrics and Human Relationships
Note that we need to check the box “keep original nodes and link each to its collapsed one” to make sure to keep the web page nodes in addition to the domain nodes, and also check the box “keep nodes with missing ‘collapse by’ value” to keep the original search terms. Now we are ready to open the combined dataset.
Coolhunting on the Internet with Condor
303
By clicking the “Next >” button we will be using default settings and not filtering or changing anything during the load dataset process.
We will be loading both the “Query” and the “web pages” by just clicking the “Next >” button again.
304
Sociometrics and Human Relationships
The next dialog would give us the option to manually remove some of the URLs from the resulting network. By just clicking the “Next >” button again, we include all URLs.
Now we have loaded combinedcollapsed.”
the
full
dataset
“Web
Coolhunting on the Internet with Condor
305
By dragging the mouse over the resulting dataset box, we can see that it contains 686 actors and 822 links. Choosing the “View->Create static view” menu, we are now ready to display the network. We also color the network by the original datasets. Image 36a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
The next step will be to calculate the importance of the four brands “Amity, Cologne, Mumbai, Ashok
306
Sociometrics and Human Relationships
Chauhan” and of the different websites. We do this by annotating the actors with betweenness centrality, using “Process dataset->Annotate->Centrality annotations” which will bring up the following dialog. Clicking on the “Next >” button will calculate the betweenness centrality for each node in the graph.
We can now change the size of the nodes by selecting “Size by betweenness centrality” in the window pane on the right. Note that this option is available only now, after having calculated betweenness centrality. Keeping the nodes in the graph colored by the original datasets leads to the following image.
Coolhunting on the Internet with Condor
307
Image 37a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
In the image above, we can already see that “Ashok Chauhan” seems to be a stronger brand than “Amity University.” Right clicking on a node shows its name, dragging the slider “Node labels” increases the font size of the label (see the image below). Squares represent the search terms, indicating the strength of each brand through their node size, circles are URLs.
We can more exactly track the strength of the brands on the Web by exporting the values to Excel (Export->Export CSV). There is no need to also export the links for now.
308
Sociometrics and Human Relationships
Importing the actor.csv file into Excel leads to the image below on page 310. We separate search terms and URLs and draw a separate chart for each of the two ordered lists. The image below shows the search terms sorted by betweenness centrality, which will give a ranked list of the importance of the four brands: 1. Ashok Chauhan 2. Mumbai University
Coolhunting on the Internet with Condor
309
3. University of Cologne 4. Amity University. One unexpected insight of the analysis is that the brand equity of its founder Ashok Chauhan is stronger than the brand of Amity University itself. This means Ashok Chauhan is more prominent on the Web than Amity University. The second list in the image below on page 310 shows the most prominent websites, again sorted by betweenness in the degree-of-separation graph shown before (reverse the graph sort for the highest score at the top). The top websites boosting the brands of Ashok Chauhan, Mumbai University, University of Cologne, and Amity University are 1. Wikipedia 2. Facebook 3. Topuniversities.com 4. Indianexpress.com 5. Quora 6. Portal.uni-koeln.de 7. Amazon.
310
Sociometrics and Human Relationships
Coolhunting on the Internet with Condor
311
The presence of the uni-koeln portal is explained by content about a Nobel Prize winning scientist featured on the uni-koeln website. This illustrates the power of star scientists to promote the popularity of brands such as universities.
12.2. SWARM ANALYSIS — WIKIPEDIA The second analysis compares the presence of the different universities on Wikipedia. As Wikipedia provides access to the world’s knowledge, particularly for measuring the strength of universities, Wikipedia editors represent the intrinsically motivated swarm of “knowledge gathering worker bees.” To calculate the link network for “Amity University,” we create a new dataset “Amity Wiki.”
Using the Wiki Evolution Fetcher, we first search for the key Wikipedia pages containing the term “Amity University” in the English Wikipedia.
312
Sociometrics and Human Relationships
Out of the returned pages, we extract the Wikipedia network originating from the page “Amity University.”
We collect all the links (not just the bidirectional ones) originating and pointing to “Amity University” by unchecking the “restrict static fetcher to bidirectional links” check box shown in the image below.
Note: Bidirectionality of a link means that if page A has a link to page B, there will also be a link on page B linking it back to page A. Including
Coolhunting on the Internet with Condor
313
unidirectional links will give us a broader overview of the network, and as we do not expect “Amity University” to be wildly popular, the number of links found will still be manageable even when including the unidirectional links also. If we would search for a page like “United States,” collecting all the links either originating on the “United States” page, or pointing back to that page, would most likely return a significant part of the entire Wikipedia, for such a page we would only collect the bidirectional links.
We are not collecting the dynamic network, as for now we are interested only in the full network at collection time, not the evolution of the links over time. We also collect the snippets, the text before the table of contents of a Wikipedia page, to obtain the most important words describing Amity University on Wikipedia.
314
Sociometrics and Human Relationships
Once Wiki Evolution has finished its data collection, we can run “create static view” and calculate betweenness by executing “centrality annotations.” Before drawing the network, we can check the nodes for some nodes which might distort the network picture, while not adding meaning to the overall graph, by bringing up the dialog for removing nodes (“Process dataset-> Remove specific actor”). The page “Geographic_ Coordinate_System” and some template pages have nothing to do with the universities; we are thus removing them.
This leads to the following network, with the nodes sized by betweenness.
Coolhunting on the Internet with Condor
315
Image 38a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
The above network shows that Ashok Chauhan commands a much less central position in the Wikipedia network, while India is the defining attribute of Amity University. It is also surprising that only other private universities in India, that is, local competitors (the cluster on the lower left), as well as private Indian high schools (the cluster on the top right) dominate the network. The only research topic that shows up somewhat prominently is biotechnology. Rerunning the same analysis, but only including bidirectional links (keeping the box “Restrict static fetcher to bidirectional links” checked as shown in the image below), will identify the most important relationships.
316
Sociometrics and Human Relationships
The image below shows the resulting network in the static view of communication. We see that there are much less nodes, and general geography nodes like the page about “India” are not mentioned anymore. This makes sense, because Amity University is not important enough to make it on any of the general geography pages.
In the image above, Amity University is clustered together with other local competitors, while the second
Coolhunting on the Internet with Condor
317
large cluster on the bottom right shows the importance of Amity School and its local Indian competitors such as Ryan International School. Next, we are looking at the key terms around “Amity University” on Wikipedia. First, we need to calculate the sentiment for the Wikipedia snippets (the snippet is the text above the table of contents on a Wikipedia page), which we have collected for the “amity” “wiki” dataset. “Process dataset->calculate sentiment” brings up the following dialog, where we calculate the sentiment for the field “content.”
The menu “View->Create word cloud view” creates the following word cloud. It tells us that for Amity University, Amity School and its associated middle and high schools are a stronger brand than the university proper. The dark red color of “school” tells us that the sentiment around the word “school” is slightly negative.
318
Sociometrics and Human Relationships
Image 39a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
Clicking on “school” on the word cloud view brings up the drill-down view of the word “school” shown below. The view shows that “school” has been found in 50 articles, out of which 12 are positive, 13 are neutral, and 25 are negative. The drill-down view also displays the context where the search word “school” has been used for each of the 50 occurrences. The reason why there are 25 negative texts in the context of “school” is the use of words like “strictly,” “poor,” “compulsory,” and “lower.” The next step in the analysis is to compare the Wikipedia link structure of “Amity University” with its competitors, “IIT,” “University of Mumbai,” “MIT,” and “University of Cologne.” Running the Wiki Evolution Fetcher with these four search terms, while collecting all links, not just bidirectional links, produces the following four network images.
Coolhunting on the Internet with Condor
319
320
Sociometrics and Human Relationships
Comparing the information structure of “Amity” with competitors, both in the same “league,” and well above, such as IIT and MIT, shows both its strength and weaknesses. MIT is number 1 in the 2015/16 QS World university ranking, IIT Mumbai is 222, University of Cologne is 305, and University of Mumbai is 551600. In the QS Asia ranking, IIT Delhi is 42, University of Mumbai is 125, and Amity University is 251300. Images 40 to 43a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
At first glance, we can see that MIT (top right), IIT (bottom right), and University of Mumbai (bottom left) are all much more dense and have more nodes in the network than University of Cologne (top left), which has a similar network to Amity University. This tells us at a glance that
Coolhunting on the Internet with Condor
321
MIT, IIT, and also University of Mumbai all play in a league above Amity University. As IIT is spread out across multiple campuses, it has different local clusters, although IIT Mumbai seems to be the dominant one, while the largest node in the IIT network is about India. The links of IIT to other universities are also surprisingly local, almost exclusively linking to other Indian universities. Compared to IIT’s 976 actors and 44,448 Wikipedia links, MIT has 977 actors and 474,976 links. The first link of MIT is to Harvard, illustrating how these two top brands in academia boost each other. The first MIT university links to non-US institutions are to McGill and University of Toronto, the next one is to Tsinghua. The first link to a person in the IIT network is to Narendra Modi, the current prime minister of India, followed by some historical kings. Only quite far down in the link network is the link to Ashoke Sen, a highly respected Indian physics professor. The first link in the MIT network to a person is to John F. Kennedy, followed by links to MIT presidents of the late 19th century, followed by some recent MIT presidents, and then some more recent Nobel Prize winners. The network of University of Cologne is much more sparse, showing fewer links. The University of Cologne network, however, has more international links. After a first link to University of Munich, it is next linked to National University of Singapore, University of Vienna, and London School of Economics. After some politicians, it is linked to Heinrich Böll, a famous poet, and then to Peter Grünberg, a Nobel Prize winner in physics teaching at University of Cologne. The conclusions for Amity University are to build an international alliance network with universities abroad and establish connections to a few star scientists.
322
Sociometrics and Human Relationships
12.3. ANALYSIS OF THE CROWD — TWITTER In our final analysis, we will be investigating what the crowd has to say about “Amity University” on Twitter. To collect the tweets about Amity, we create a new dataset “Amity_twi”.
Searching Twitter for “Amity University” tells us that @amityuni is the Twitter account of Amity University. Following only two people, Amity has 3957 followers.
We now also look at the Twitter accounts of University of Cologne and University of Mumbai. While University of Cologne seems to have an official Twitter account @UniCologne, University of Mumbai seems to be
Coolhunting on the Internet with Condor
323
somewhat disorganized with two accounts, one having tweeted 23 times with only 228 followers and another with 1088 followers, but only one single tweet.
We now collect the tweets for Amity University into the amity_twi dataset, searching for “amityuni.” We uncheck the box “collect only retweets” to collect all the tweets about “amityuni,” even if they have never been retweeted, which means they have been read at least once. We also increase the maximum number of results to collect from 250 to 1000. Note that the Twitter API will only let us collect at most 18,000 tweets from the last seven days, by running at most 180 queries returning 100 tweets each within a 15-minute time window. This means that we could have entered 18,000 into the “Number of result” box, and then we would have to wait 15 minutes for the next query. As our search in the Twitter search box told us that there is not much active tweeting about amityuni, asking to collect at most 1000 tweets within the last seven days gives us the 30 tweets actually been tweeted in the last seven days.
324
Sociometrics and Human Relationships
Condor then brings up the Twitter API key window, where we have already entered our search keys.2
To go further into the past, we also collect all the tweets sent by @amityuni using the Twitter people fetcher.
2
See the following YouTube movie for how to obtain your Twitter API keys: https://www.youtube.com/watch?v=6zMW7YEKJzw
Coolhunting on the Internet with Condor
325
The activity view shows that the tweets about amityuni go back until the beginning of 2015, with most of the tweets being sent at the time of collection, to the end of August 2015, where at most 13 tweets were sent per day.
The resulting static communication view shows the importance of the search term “amityuni.” Image 44a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
326
Sociometrics and Human Relationships
This process is now repeated for “unicologne,” collecting the tweets with the “Twitter fetcher,” and the “Twitter accounts fetcher,” storing the results into a newly created dataset cologne_twi. Next, the tweets for University of Mumbai are collected. As both “mumbai_uni” and “umumbai” are used, two queries, one for “uni mumbai” and another for “mumbai uni” are run. Afterwards the two search terms are merged with the Twitter account “Uni_Mumbai,” using Condors actor merge function (“Process dataset-> Node merging->manual node merging,” see image below).
To make this change persistent, it is saved into the MySQL database under the new name “mumbai_twi2” by right clicking on the dataset box “mumbai_twi.”
Coolhunting on the Internet with Condor
327
The three datasets, amity_tw, cologne_twi, and mumbai_twi2 are now merged into a new dataset.
In the combined static view, we see that Amity Uni for the seven days when the data was collected is the strongest brand in the Twittersphere, while Uni Cologne was most tweeted about.
328
Sociometrics and Human Relationships
Image 45a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
Now we go back to the original amity_twi dataset, looking at its content by calculating sentiment.
Next, we look at what people are tweeting about Amity University in the week August 1724, 2015. It seems the tweets about amity are universally happy, somewhat emotional, and somewhat complex.
Coolhunting on the Internet with Condor
329
Image 46a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
Generating the word cloud shows the keywords being used in the tweets, clicking on a word shows it in context. Image 47a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
330
Sociometrics and Human Relationships
Compared to Amity University, tweets about University of Mumbai are very angry, people are complaining about the “outrageous fees,” and “harassment of females.” Tweets about University of Cologne are much happier. Condor is capable of automatically calculating sentiments of English, German, French, Italian, Spanish, and Portuguese, note that there are some German words in the word cloud about University of Cologne. Image 48a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
We can also calculate where on the World people are tweeting about Amity. For this we first need to run “Process dataset-> Location annotation.”
Coolhunting on the Internet with Condor
331
The World Map shows that almost all of the tweeting is happening in India, with a few tweets from Indians in the United States or Europe tweeting about relatives graduating at Amity Noida. Image 49a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
332
Sociometrics and Human Relationships
Finally, we would like to identify the most influential twitterers about Amity University. Toward that goal, we remove the search term “AmityUni.”
Note that the network, which was initially connected in the static view, now breaks up into many small clusters. These represent individual retweet networks, with a central individual in the core being retweeted by the people in the periphery. We have now the option to size the nodes by different Twitter-specific attributes such as the number of followers or the number of times a tweeter has been listed. This helps to identify the most influential twitterers about Amity.
Coolhunting on the Internet with Condor
333
Image 50a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
We find that the most influential twitterer by follower count is @Brothers2015, which is the official and verified account of the Bollywood movie “Brothers.” It seems Amity University is a sponsor of this movie. Entering “brothers” in the search box at the top will highlight all the tweets and actors containing the keyword, showing a lot of traction for this association with Amity.
334
Sociometrics and Human Relationships
There is also a second tier of influential twitterers such as BeingExample, Nick_Ksg, and 6670G with a decent amount of followers in the 40005000 people range. However, they follow a similar or even larger number of people, which raises the suspicion, that they have “gamed” their follower counts by following each other. This means that they might be tweeting into an “echo chamber” without wider reach.
12.4. FOLLOW-ON EXERCISES 1. Repeat the analysis for another university, company, or organization, for example, take your own university. 2. Repeat the analysis for a product or brand, for example, comparing “Samsung Galaxy” with “iPhone 6s.” 3. Compare the importance of a politician (Hillary Clinton) with the importance of a university. 4. Monitor the twittersphere over time to assess how content and twitterers change. 5. Rerun the Web search but restrict the date range for most recent four weeks to assess what is new and what links have been added. 6. Rerun the Wikipedia analysis including the dynamic fetcher to assess what has been recently added.
Coolhunting on the Internet with Condor
335
12.5. (PARTIAL) LIST OF INTERNET COOLHUNTING STUDIES Downloadable from http://www.ickn.org/publications. html Gloor, P., De Boer, P., Lo, W., Wagner, S., Nemoto, K., & Fuehres, H. (2015). Cultural anthropology through the lens of Wikipedia A comparison of historical leadership networks in the English, Chinese, and Japanese Wikipedia. Proceedings of the 5th international conference on collaborative innovation networks COINs15, Tokyo, Japan, March 1214. Maddali, H. T., Gloor, P., & Margolis, P. (2015). Comparing online community structure of patients of chronic diseases. Proceedings of the 5th international conference on collaborative innovation networks COINs15, Tokyo, Japan, March 1214. Gloor, P., & Nemoto, K. Who really matters in the world Leadership networks in different language Wikipedias. Places and Spaces Mapping Science, Map #157. Frick, K., Guertler, D., & Gloor, P. (2013). Coolhunting for the world’s thought leaders. Proceedings 4rd international conference on collaborative innovation networks COINs 2013, Santiago de Chile, August 1113. Futterer, T., Gloor, P., Malhotra, T., Mfula, Packmohr, K. H., & Schultheiss, S. (2013). Wikipulse A newsportal based on Wikipedia. Proceedings 4rd international
336
Sociometrics and Human Relationships
conference on collaborative innovation networks COINs 2013, Santiago de Chile, August 1113. Yun, Q., & Gloor, P. (2012). The web mirrors value in the real world Comparing a firm’s valuation with its web network position. Sloan Technical Report, Cambridge, MA. Fuehres, H., Gloor, P., Henninger, Kleeb, M., & Nemoto, K. (2012). Galaxysearch: Discovering the knowledge of many by using Wikipedia as a meta-search index. Proceedings collective intelligence, Cambridge, MA, April 1820. Garcia, C., Parraguez, P., Barahona M., & Gloor, P. (2012). Tracking the 2011 student-led collective movement in Chile through social media use. Proceedings collective intelligence 2012, Cambridge, MA, April 1820. Kleeb, R., Gloor, P., & Nemoto, K. (2011). Wikimaps: Dynamic maps of knowledge. Proceedings 3rd international conference on collaborative innovation networks coins 2011, Basel, Switzerland, September 810. Zhang, X., Fuehres, H., & Gloor, P. (2011). Predicting asset value through twitter buzz. In J. Altmann, U. Baumoel, B. Kraemer, (Eds.), Proceedings 2nd symposium on collective intelligence Collin 2011, Seoul, June 910, Springer Advances in Intelligent and Soft Computing, vol. 112. Gloor, P., Grippa, F., Borgert, A., Colletti, R., Dellal, G., Margolis, P., & Seid, M. (2011). Towards growing a
Coolhunting on the Internet with Condor
337
coin in a medical research community. Procedia Social and behavioral sciences (Vol. 26). Proceedings COINs 2010, collaborative innovations networks conference, Savannah GA, October 79, 2010. Zhang, X., Fuehres, H., & Gloor, P (2010). Predicting stock market indicators through twitter: “I hope it is not as bad as I fear,” Procedia Social and Behavioral Sciences, 26, 2011. Collaborative Innovations Networks Conference, Savannah GA, October 79, 2010. Doshi, L., Krauss, J., Nann, S., & Gloor, P (2009). Predicting movie prices through dynamic social network analysis. Proceedings COINs 2009, Collaborative innovations networks conference, Savannah GA, October 811. Gloor, P., Krauss, J., Nann, S., Fischbach, K., & Schoder, D. (2009). Web Science 2.0: Identifying trends through semantic social network analysis. IEEE conference on social computing (SocialCom-09), Vancouver, August 2931. Krauss, J., Nann, S., Simon, D., Fischbach, K., & Gloor, P. (2008). Predicting movie success and academy awards through sentiment and social network analysis. Proceedings of European conference on information systems (ECIS), Galway, Ireland, June 911. Gloor, P. (2007). Coolhunting for trends on the Web (invited paper). Proceedings of IEEE 2007 international symposium on collaborative technologies and systems, Orlando, May 2125.
338
Sociometrics and Human Relationships
MAIN LESSONS LEARNED • Coolhunting with Condor measures and tracks the importance of a brand on the Internet. • Coolhunting for a brand consists of identifying the context of a brand, in particular, its competitors. • It tracks the relative strength of the brand and its competitors through degree-of-separation search, constructing a bipartite graph and measuring betweenness of the search terms. • It also identifies the brand’s associated influencers, ranking them by their impact. • This analysis can be done globally or be restricted by geography or by demographic subgroups.
13 COOLHUNTING — FRANCOGEDDON
CHAPTER CONTENTS • “Francogeddon” — breaking the link between Euro and Swiss Franc on January 15, 2015, by the Swiss National Bank • The Web and Twitter reflect the sentiment of the market in response to “Francogeddon” • The six honest signals of collaboration for “Swiss Franc,” “Euro,” and “USD” on Twitter track the influence of “Francogeddon” on the three currencies.
r 2017 Peter A. Gloor
339
340
Sociometrics and Human Relationships
On January 15, 2015, financial markets were in turmoil. In a surprise move — later termed Francogeddon — the Swiss National Bank removed the artificial exchange rate of Swiss Franc 1.20 to the Euro, which it had set and defended by buying massive amounts of Euro and Dollars since September 6, 2011. Within hours the exchange rate between Euro and Swiss Franc fluctuated from 1.20 Francs per Euro to 95 cents per Euro, leading to massive losses at stock markets around the world, forcing some hedge funds into insolvency. Such an unexpected event in the financial markets offers a unique natural experiment to measure global consciousness of financial markets. Using Condor, we collected the most recent 12,000 tweets containing the string “Swiss Franc,” as well as another 12,000 tweets each containing “Euro” and “USD” on January 18, when Francogeddon was still a major issue, and currencies were still fluctuating wildly. We repeated the data collection at two later points in time, on February 3 and February 6, 2015, when Francogeddon was over, and things had settled down. This nine-part dataset allows us to compare a moment of high public consciousness, when Francogeddon was at the top of everybody’s minds involved into currency trading with a baseline of two later points in time when the event was over and public consciousness for this topic will be low again. The nine charts in Figure 27 illustrate the activity of the tweeters on these three days. While the tweet activity about Euro and USD is about the same on all three sampling days (2030 tweets per minute), tweet activity for Swiss Franc is about 200 tweets per hour on January 18, dropping to 50 tweets per hour on February 3 and 6.
Coolhunting — Francogeddon
Figure 27: Twitter Activity after January 18, 2015 for Search Strings, “Swiss Franc,” “USD,” and “Euro.”
341
342
Sociometrics and Human Relationships
Figure 28 shows the network structure of the three currency Twitter networks on January 18 and February 6. Each node is a person tweeting, a link is added between two nodes if one person is mentioned in the other person’s tweet, or one person retweeting the other person. As Figure 28 illustrates, the tweets about Swiss Franc on January 18 form a large connected component. The Euro network (which was more influenced by the Swiss Franc) shows a somewhat smaller connected component, while the USD tweet network is very little connected which tells us that the tweeters have nothing to do with each other. On February 6, all three tweet networks have similar structures of mostly unconnected tweets, with the Euro still showing a somewhat larger connected component. The six word clouds depicted in Figure 29 show what people are tweeting about. While the sentiment about the Swiss Franc on January 18 is overarching negative (the darker the red of a keyword, the more negative its context), it is somewhat negative for the Euro tweets, and almost exclusively positive for the USD. The Swiss Franc tweets on February 6 are becoming more positive, but still mostly negative, as a lot of people in Eastern Europe, particularly in Poland, but also in Rumania and Austria, complain about taking out mortgages in Swiss Franc, which now ballooned against their local currency. A look at the USD tweets on both January 18 and February 6 shows that they mostly consist of retweets of items auctioned on eBay. This illustrates that the US tweeters don’t care much about Francogeddon. Tweets about the Euro are somewhat negative, but the concerns — which are growing on February 6 — are more about Draghi and the possible Grexit, that is, the exit of Greece from the Eurozone.
Coolhunting — Francogeddon
Figure 28: Twitter Network Structure on January 18 and February 6, 2015 for Search Strings, “Swiss Franc,” “USD,” and “Euro.”
343
344
Figure 29: Word Clouds of Tweets on January 18 and February 6 2015 for Search Strings, “Swiss Franc,” “USD,” and “Euro.”a
Sociometrics and Human Relationships
a
For color pictures see online version of images, available at http://www.ickn.org/sociometrics/
Coolhunting — Francogeddon
345
So let’s now calculate the six honest signals of communication for the nine datasets: (1) Group betweenness centrality (how centralized are the tweet networks). (2) Oscillation in group betweenness centrality (how much is the centrality of individual tweeters in the network changing over time, measured in 15-minute intervals). (3) Average weighted variance in contribution index, that is, how much are individual tweeters being retweeted over time. (4) Average response time (ART) and nudges, which tells how long it takes for a tweet to be retweeted and if people are mutually retweeting each other. (5) Sentiment and emotionality, which shows how positive and negative the tweets are. (6) Complexity of language. The charts below illustrate the changes over the three points in time in emotionality (Figure 30), ART (Figure 31), and the number of nudges per tweeter (Figure 32). For example, the response time (ART) drops considerably for USD from January 18 (day 1) to February 6 (day 3), while it goes up for Swiss Francs. This means things are cooling down for tweets about Swiss Francs, and it takes more time until they are retweeted. Comparing the six honest signals of communication for the three currencies, we see that even for this small sample, using the Mann-Whitney U-test, tweeting behavior about Swiss Franc is different from tweeting about
346
Sociometrics and Human Relationships
Figure 30: Average Emotionality of Tweets Containing Search Strings, “Swiss Franc,” “USD,” and “Euro.”
Figure 31: Average Response Time (ART) of Tweeters Using Search Strings, “Swiss Franc,” “USD,” and “Euro.”
Euro and USD with regards to the number of nudges as well as the variance between nudges until one tweeter responds to another tweeter (p = 0.024). To put this in other words: comparing the three Twitter networks about the three currencies over three points in time, there seems
Coolhunting — Francogeddon
347
Figure 32: Average Number of Nudges (Retweets) of Tweeters Using Search Strings, “Swiss Franc,” “USD,” and “Euro.”
to be a higher global consciousness by people tweeting about Swiss Franc compared to people tweeting about Euro and USD — maybe a glimpse of global consciousness of currency traders related to Francogeddon?
13.1. FOLLOW-ON EXERCISES 1. Do a Coolhunting today for USD, EUR, and CHF, collecting 12,000 tweets for each of the three symbols and compare with the social network and the word cloud from January 2015. 2. Calculate the six honest signals of collaboration for your Coolhunting data for CHF, EUR, and USD and compare with the six honest signals from January 2015.
348
Sociometrics and Human Relationships
3. Who are the most influential people and websites for CHF, EUR, and USD in your new data? 4. Collect the Twitter and Wikipedia data for CHF, EUR, and USD for one month and correlate with the exchange rate for these three currencies, comparing the 1. number of tweets about each currency; 2. sentiment and emotionality of tweets of each currency; 3. ART and nudges for each currency. Which of the three time series results in the highest correlation with the actual exchange rate?
MAIN LESSONS LEARNED • Coolhunting on the Web and on Twitter measures global awareness during “Francogeddon.” • “Francogeddon” happened when on January 15, 2015, the Swiss National Bank unexpectedly removed the link between Euro and Swiss Franc, leading to huge global currency fluctuations and the bankruptcy of some hedge funds. • The global sentiment of those events is analyzed through tweets about “Swiss Franc,” “Euro,” and “USD,” using the six honest signals to compare the impact of Francogeddon on the three currencies.
14 COOLHUNTING THE US PRESIDENTIAL ELECTIONS
CHAPTER CONTENTS • Online social media provides a microscope into the inner workings of the US presidential elections • The 2015/2016 Bernie Sanders campaign — a prime example of swarm-based COIN leadership • Comparing sentiment, demographics, and popularity of four candidates: Donald Trump, Jeb Bush, Hillary Clinton, Bernie Sanders
r 2017 Peter A. Gloor
349
350
Sociometrics and Human Relationships
• “Tribefinders” categorizes supporters of Donald Trump and Bernie Sanders through their tweets.
The 2016 US Presidential election provides an excellent opportunity to study Coolhunting and Coolfarming. The US Presidential elections are fought to a large extent on social media with the candidates flooding the Internet with their tweets, videomessages, and Instagram pictures. The contrasting styles of the candidates also offer a prime example of the difference between COIN-based and hierarchical leadership style. We start the Coolhunting using Google trends (the query below has been made on March 5, 2016, eight months before the elections). Image 51a
a
For color pictures see online version of images, available at http://www.ickn.org/sociometrics/
Google trend tells us that Donald Trump trumps the other candidates in the number of Google searches by an order of magnitude. Bernie Sanders is on the second place and Hillary Clinton on the third place, generating significantly less Google searches than Trump. Marco Rubio is on the fourth place, with the lowest level of interest.
Key Topics Experts (Web)
Hillary, Bernie, Trump, Rubio
Key People
Key Websites
Jon Stewart
www.huffingtonpost.com www.politico.
Robert Reich
com
Paul Krugman Swarm (Wikipedia)
Democratic Party,
Vermonty_Python
Reddit.com
Republican Party,
IrrationalTsunami
FeelTheBern.com
#USElection
RealDonaldTrump
http://t.co/tPiqUzQ0pZ
#MakeAmericaGreatAgain
HillaryClinton
https://about.me/collaborateforrights
#FeelTheBern
BernieSanders RFSchatten,
http://t.co/QN6DgkANlr http://www.zeustechnologies.com
Coolhunting the US Presidential Elections
Table 23: Coolhunting Overview Results for US 2016 Presidential Elections.
US Presidential Election 2016 Crowd (Twitter)
Libertea2012
351
352
Sociometrics and Human Relationships
Table 23 shows the summary of the subsequent Coolhunting with Condor, described in detail in Section 14.2. Key topics on the Web are the search terms on Google for the candidates “Hillary,” “Bernie,” “Trump,” and “Rubio.” The key people talking about the candidates are the technical pundits, commentators, and talk show hosts Jon Stewart, Robert Reich, and NYT commentator Paul Krugman. The most influential websites are the Huffington Post, owned by AOL, and Politico. Key topics on Wikipedia are the three Wikipedia pages for the Democratic Party, the Republican Party, and about the 2016 US Presidential Election. The main swarm — the intrinsically motivated people — are active on Reddit, the two most active people on the Bernie Sanders Forum on Reddit are Vermonty_Python and IrrationalTsunami. The website for Bernie Sanders is another product of the swarm; FeelTheBern.com has been created by volunteers without initial financial backing of the Bernie Sanders campaign. For the crowd, on Twitter, the most important hashtags about the election are the general tag #USElection, Donald Trump’s #MakeAmericaGreatAgain, and Bernie Sanders’ #FeelTheBern. The most central and influential twitterers are Donald Trump, Hillary Clinton, and Bernie Sanders, followed by two politically active volunteers, RFSchatten and Libertea2012. The most central websites on Twitter at the time of the analysis in September 2015 were the personal page of a human rights activist on collaborateforrig and zeustechnologies, the site of a Web marketing agency in the United Kingdom. In the next section, we look at a Coolfarming example, how a COIN operates — the way how the Bernie Sanders
Coolhunting the US Presidential Elections
353
campaign is leveraging the Web and intrinsically motivating self-organizing volunteers to run a highly efficient campaign. The subsequent section will compare the Web activities of two Republican and two Democratic campaigns through Coolhunting.
14.1. BERNIE SANDER’S PRESIDENTIAL CAMPAIGN — THE PERFECT COIN
The process of how Bernie Sander’s campaign to become the next President of the United States is unfolding is a great example of COINs, very different from the hierarchical style of his opponent at the right end of the spectrum, Donald Trump. For a start, the entire progress of the campaign is documented online, on Reddit https://www.reddit.com/r/ SandersForPresident, from its humble beginnings, up to when Bernie Sanders conceded defeat to Hillary Clinton in July 2016. In December 2013, the Reddit forum SandersForPresident was started, and four month later,
Sociometrics and Human Relationships
354
on April 30, 2014, on the same Reddit forum, Sanders announced his candidacy: Reddit — I am running for President of the United States, and seeking the Democratic nomination. I need you to stand with me and organize an unprecedented grass-roots campaign. Are you in? — B In a true COIN fashion, it was three people forming the original Reddit COIN, by the Reddit screen names Vermonty_Python, IrrationalTsunami, and scriggities who created and moderated the SandersForPresident forum. Making excellent use of social media, Sanders became a heavy user of Reddit, Twitter, and Facebook pages. The reason why he has been resonating so much on online social media is that he has been very consistent in his message for the last 30 years. During his campaign he closed in as a close second to Hillary Clinton, the frontrunner as democratic presidential candidate. For instance, in September 2015 Sanders was leading in the critical early voting state New Hampshire and a close second in Iowa. The hundreds of thousands of people on Reddit, Facebook, and Twitter form a perfect Collaborative Learning Network (CLN) learning about Sanders’ viewpoint. Some of them even self-organized their own COINs to further Sanders’ cause. For instance, Sanders succeeded in tapping into the Web savvy of young IT professionals, with whom his message of Northern European style social democracy resonated very well.
Coolhunting the US Presidential Elections
355
Jumpstarted by a young IT professional in NYC, hundreds of software developers volunteered their time, energy, and creativity to create all sorts of social media apps, websites, and idea tracking tools. Titled as “A legion of tech volunteers are leading a charge for Bernie Sanders,” the NYC describes1 how this group created the website “FeelTheBern.org” to showcase Bernie Sander’s position on key issues. They coordinated their work using the communication tool “slack,” moonlighting and contributing their skills for free to create interactive maps, donation collection apps, and grassroots organizing tools. This archetypical COIN is there for all to study through Reddit, Twitter, Facebook, and the Web.
1
http://www.nytimes.com/2015/09/04/us/politics/bernie-sanders-presidential-campaign-tech-supporters.html
356
Sociometrics and Human Relationships
14.2. COOLHUNTING BERNIE SANDERS, HILLARY CLINTON, JEB BUSH, AND DONALD TRUMP In this section, we compare the social media footprint in fall 2015 of the campaign of Bernie Sanders with his counterpart on the right spectrum of the political landscape, Donald Trump, and contrast it with their more established competitors Hillary Clinton and Jeb Bush. In a nutshell, in September 2015, the two outsiders Sanders and Trump were sharing the spotlight, while the two candidates of the establishment, Hillary and Jeb Bush, were badly trailing not just in the polls, but also on social media. We start by comparing the global Twitter footprint of the four candidates. On September 6, 2015, I collected the last 4000 tweets about “Bernie Sanders,” “Hillary Clinton,” “Donald Trump,” and “Jeb Bush.” I also collected an additional 4000 tweets on their most popular hashtags #feelthebern, #Hillary2016, #make AmericaGreatAgain, and #jeb2016.
Coolhunting the US Presidential Elections
357
Image 52a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
As the curve above in image 52 tells, the messages are quite emotional, and just above the positivity line. There are also quite a few people around the world tweeting about Sanders. And, quite importantly, the tweets are about Sanders. The next image shows the tweets about Hillary.
358
Sociometrics and Human Relationships
Image 53a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
Other than Bernie Sanders, Hillary, while also having a global presence, is strongly dominated in the United States by topics other than herself. Besides Sanders showing up in tweets about Hillary, “unitedblue,” a grassroots campaign against SuperPACs (organizations sponsored by wealthy individuals circumventing US election sponsoring restrictions) is also quite prominent.
Coolhunting the US Presidential Elections
359
Image 54a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
In a true celebrity fashion, Donald Trump succeeded in making himself the topic of his own campaign. However, all the positivity of his campaign comes from outside the United States, while the sentiment of his US tweets is very negative, strongly influenced by his attacks against the Latin minority and illegal immigrants in the United States.
360
Sociometrics and Human Relationships
Image 55a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
Jeb Bush is in the weakest position of the four candidates. Even in Jeb’s own Twitter feed Donald Trump features prominently, and overall his tweets are scattered and not very positive. (So his early dropping out of the presidential race was no surprise, and it was predicted by these Coolhunting results.) The next image shows the tag clouds of the tweets of the four candidates.
Coolhunting the US Presidential Elections
361
Image 56a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
Hillary and Jeb are dominated by their opponents. Trump’s feud with the Hispanic immigrants (top left) shows up prominently. There is hope for Hillary (bottom right), because she shows up in all four word clouds. The outlook for Jeb Bush, however, is not good (bottom left). Even in his own tag cloud, his greatest asset is not his own achievements, but his family name. Bernie Sanders’ most prominent tag (top right) is his grassroots campaign “feelthebern.” Next, we look at the importance of the four candidates on Twitter. The next image shows the tweets for each candidate in a different color. It also measures the betweenness centrality of each candidate, drawing a line
362
Sociometrics and Human Relationships
to the search terms “Bernie Sanders,” “Hillary Clinton,” “Donald Trump,” and “Jeb Bush” and #feelthebern, #Hillary2016, #makeAmericaGreatAgain, and #jeb2016. Image 57a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
Owing to his celebrity status, Donald Trump has the highest centrality in the Twittersphere. However, Hillary’s hashtag #hillary2016 is the most prominent. We repeat the analysis, this time filtering out all negative tweets, only keeping the ones that the automatic sentiment analysis function of Condor categorizes as above 0.5, that is, having positive sentiment. Condor automatically recognizes sentiment in English, Spanish, German, French, Italian, and Portuguese.
Coolhunting the US Presidential Elections
363
Image 58a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
The first thing we notice in the new chart is that the network has much less nodes (i.e., people tweeting) and the structure falls apart. Jeb Bush drops out, and among the other three candidates, Sanders’ hashtag #FeelTheBern becomes the most central. Next, we use Condor’s influencer function to calculate the most influential twitterers for each candidate. Condor looks at word usage among twitterers to identify how ideas spread from one actor to the next. If somebody introduces a new word that is picked up quickly by others it makes her or him influential.
364
Sociometrics and Human Relationships
Image 59a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
We again see that Bush has very few influencers, while the Sander’s group is highly creative, coining their own vernacular, and busily retweeting their new words in their own sphere (the turquoise cluster at the lower left). Finally, we repeat the same analysis on the Web, constructing a degree-of-separation search with Condor. This search identifies the most prominent websites for each candidate, and then constructs the Web link structure between these sites.
Coolhunting the US Presidential Elections
365
Image 60a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
The input parameters for the Google CSE fetcher have been set for this example to only collect websites, which had been updated in the four weeks before September 6, 2015. The picture spells good news for Hillary, as she is most central on these blogs, followed by Donald Trump. Huffington Post, Politico, and Twitter are the most important sites boosting the centrality of these candidates. This predicts by seven months the outcome of the primaries, where Hillary Clinton and Donald Trump were elected as official candidates for their parties. In sum, based on this analysis both Hillary Clinton and Bernie Sanders have reason to be optimistic, with Sanders leading the creative and optimistic crowd on Twitter, and Hillary being the strongest brand among the
366
Sociometrics and Human Relationships
establishment. Also Donald Trump commands a strong position, but this might also come from his prior celebrity status and confrontational communication style, which provokes furious responses by the people he attacks.
14.3. TRIBEFINDER ON TWITTER (USING MACHINE LEARNING)
Condor has a built-in machine-learning function that allows the user to discover “virtual tribes” (see Section 11.3 for an example using the machine-learning function of Condor with e-mail data). Virtual tribes are groups of people who exhibit similar communication behavior in terms of network structure, communication dynamics, and message content and word usage. To find a tribe, Condor is trained with “exemplary tribe members.” In this example, in a collection of tweets about the 2016 US presidential election, we will identify the tribe of Bernie Sanders supporters and the tribe of Donald Trump supporters. Similarly to the supporters of a politician, we can also identify supporters of brands and products, for instance finding people who prefer Pepsi Cola to Coca Cola, or Android to the iPhone.
Coolhunting the US Presidential Elections
367
A more primitive way of achieving the same goal would be to search for tweets containing “Donald Trump” and then checking if the sentiment of the tweet is positive or negative. However, there is no guarantee that a positive tweet containing the string “Donald Trump” is from a Donald Trump supporter, as for example the tweet “I like Bernie Sanders better than Donald Trump” would be categorized as positive by Condor and also contain the string “Donald Trump.” A better way to identify supporters is to use the machine-learning function of Condor. As a first step we collect 10,000 tweets about Bernie Sanders and 10,000 tweets about Donald Trump. The Twitter data for Sanders generates a graph with 23,484 edges in the network from 16,948 actors, covering the period from 6:42 to 15:19 on April 22, 2016. The chart below shows the sentiment and activity for the tweets about Bernie Sanders.
Image 61a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
368
Sociometrics and Human Relationships
The same query for “Donald Trump” leads to 9540 actors and 21,466 edges, covering the four hours from 11:00 to 13:00 on April 22, 2016. This tells us that people tweeting about Donald Trump are less connected, but are more active than tweeters about Sanders, as the last 10,000 tweets only cover two hours of tweeting about Trump instead of the four hours in the case of Sanders. The chart below shows activity and sentiment for Donald Trump. Image 62a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
The sentiment of the people tweeting about Donald Trump (the blue line above) seems to be somewhat more positive than the tweets about Bernie Sanders (blue line in the chart before). To find a virtual tribe, Condor uses a three-step process: 1. Locate a smaller subset of known tribe members; they will be the training dataset for the subsequent machine learning.
Coolhunting the US Presidential Elections
369
2. Train the classifier in Condor with the honest signals (called “features” in machine learning) of the tribe members; these features can be variables of network structure, network dynamics, and network contents. Currently, Condor includes decision tree and random forest machine-learning classifiers. 3. Apply the same classifier to the other actors in the dataset, identifying actors with similar features, locating the other tribe members that until now have been hidden in the entire dataset. Let’s now first extract the tribe of “Bernie Sanders” supporters to classify the Sanders fans based on their honest signals. The first step is to identify a group of known supporters. This could be done in different ways: the most direct, although somewhat tedious, way is to read their selfdescriptions in their Twitter profiles. The second method consists of looking at their Twitter names, assuming that anybody with the handle “NH4bernie” or “veteransforbernie” will be a Sanders fan. The third way is to look at who is retweeting the tweets of Bernie Sanders consistently; this way they declare themselves as Bernie Sanders fans. The next step is to train Condor’s machine-learning functionality with their behavior. We will calculate all individual networking attributes of the six honest signals of collaboration, as well as the Pennebaker pronouns (see Section 5.2) as our features for the machine-learning step. Next, we will run the random forest classifier, which delivers better results than decision trees in this context. The classifier will then find other Twitter users in the same dataset with similar communication behavior, that is, similar combinations of features.
370
Sociometrics and Human Relationships
To start the process of identifying the Sanders tribe, we merge the two datasets and remove the two search terms “donald trump” and “bernie sanders” as actors.
We only keep the actors who have sent at least five tweets, reducing our dataset to 448 actors and 508 connections.
Coolhunting the US Presidential Elections
371
As the image below shows, the Bernie Sanders tweeters (yellow) are much more connected than the Donald Trump tweeters (green). Image 63a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
Then we have to annotate the actors with social networking attributes to calculate attributes to be used as features for the machine learning.
372
Sociometrics and Human Relationships
We also calculate the usage of words such as “the,” “and,” “or,” etc., for each actor based on James Pennebaker’s insights about the use of pronouns (see Section 5.2 for a discussion of Pennebaker’s insights). Now we are ready to start the machine learning, using the random forest classifier. First, we identify the actors who are Bernie Sanders supporters, by looking at their names, guessing that people with names like “mexicans4bernie” will be Bernie Sanders fans.
Next, we decide on the features to be used for the training. In the first run we keep all the features, including Twitter attributes such as the number of followers.
Coolhunting the US Presidential Elections
373
Now we run the random forest learner, creating 10 times more training records than the original eight known Bernie Sanders fans by changing their features by small random increments with SMOTE and undersampling the other class to two times the number of Bernie Sanders supporters (see Section 11.3 for a discussion of SMOTE), ending up with 80 virtual Bernie Sanders fans, and 160 nonfans.
374
Sociometrics and Human Relationships
We find 29 possible Bernie Sanders supporters.
Coolhunting the US Presidential Elections
375
A check among the top matches for additional Bernie fans brings up: +LisaBeliveau is a Bernie Sanders delegate +Tthomaslew76 is a social activist +Liberalllatchr is a Bernie supporter +Barbos2 is for Bernie +Ronraj777 is for Bernie +we3fordemocracy is an New Zealand open democracy tweeter +debdlund is a female black democrat, she seems to be a Hillary supporter +tcooper9999 is a female white Hillary supporter +aroyaldmd seems to be a female Hillary supporter +denver_rose is for Hillary. This means, among the top 10 hits, the first five are US Bernie fans, the next one is a Bernie fan from abroad, and the next four are democrats, although for Hillary. The image below sizes the nodes in the combined dataset by their similarity to the original known eight Bernie Sanders supporters.
376
Sociometrics and Human Relationships
Image 64a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
As we can see in the t-test, Bernie Sanders fans (the 35 members of the group “0”) are more active, and they mention each other much more; they also oscillate more (are more creative), are less emotional, but use more complex language.
Coolhunting the US Presidential Elections
377
We now repeat this process with Donald Trump supporters. Trump fans seem less prolific in their tweeting. Therefore, to get enough actors, we include in our analysis the top 3000 actors. We then select the 23 declaring themselves through their Twitter handles as Trump fans.
378
Sociometrics and Human Relationships
We then run the random forest learner, again creating a total of 460 artificial Trump fans using SMOTE. We do not do undersampling this time, as the risk would be too big to remove prolific Trump fans in the undersampling step.
Coolhunting the US Presidential Elections
379
The Random Forest classifier proposes 879 matches. It is less accurate than for Bernie Sanders, as we had to take a larger sample with fewer tweets per actor for our analysis, to get enough confessing Trump fans included.
Among the top 10 Donald Trump fans: +Slowdownandlove is a confessing Trump fan +Jamesspivey is a confessing Trump fan +Mwbrown51358 is a republican taking swings at Ted Cruz Parantokristine is a Bernie Sanders fan BCLaraby is a Canadian Bernie Sanders fan Liepardestin is a Bernie Sanders fan +Luimbe tweets are all over the spectrum, but seem to support Trump +Oldgoatsmell seems to be a Trump fan (Nytpolitics is the newsfeed of the New York Times)
380
Sociometrics and Human Relationships
(AFixhold is a SEO optimizer) +Trumpeterswin is a trump fan. As the image below shows, Trump and Bernie fans are scattered between the two datasets collected with search terms “bernie sanders” and “donald trump.” The second picture below (image 66) shows the connected component in the center of the larger picture (image 65), colored by the original datasets, the blue dots at left are the actors collected with the query “donald trump,” the purple dots at right are the actors collected with the query “bernie sanders.” In the larger picture on top (image 65), the turquoise dots are the original 23 confessing Donald Trump fans, the green dots are potential members of the Trump tribe identified through machine learning of our random forest classifier. Image 65a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
Coolhunting the US Presidential Elections
381
Image 66a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
The t-tests show that the Trump fans (the 899 people of group “0”) have less friends on Twitter and tweet less per person than the rest of the people, that they use less pronouns, they also oscillate more, but contrary to the Sanders fans, they use less complex language, but are more emotional.
382
Sociometrics and Human Relationships
To summarize, the Twitter Tribefinder process consists of four steps: 1. Collect as many tweets as possible about “bernie sanders” and “donald trump.” 2. Merge the two datasets. 3. Find the most prominent Bernie or Donald supporters (e.g., from Twitter handles like “vets4bernie” or “FL4trump”). 4. Train with these supporters to find the other hidden Bernie or Donald fans.
Coolhunting the US Presidential Elections
383
14.4. FOLLOW-ON EXERCISES 1. Collect 10,000 tweets for the search term “Apple.” Find the Apple fans by taking the top 20 retweeters of the official Apple accounts of Tim Cook (@tim_cook), Apple’s CEO, @applenws and “@applesupport.” Find their tribe, and extract its features, comparing them with their peers in the same dataset with a t-test. 2. Collect 10,000 tweets for the search term “Google.” Find the Google fans by taking the top 20 retweeters of the official Google accounts “@Google” and “@googleresearch.” Find their tribe, and extract its features, comparing them with their peers in the same dataset with a t-test. 3. What is the difference between the Apple and the Google fans?
MAIN LESSONS LEARNED • Internet Coolhunting is well-suited for analyzing and predicting the outcome of political elections. • Today’s US Presidential elections are fought to a large extent on the social media. • The contrasting styles of the two candidates Bernie Sanders and Donald Trump demonstrate the difference between COIN-based and hierarchical leadership style. • The popularity of four candidates early in the race for the US 2016 presidential election is
384
Sociometrics and Human Relationships
analyzed: Bernie Sanders, Donald Trump, Hillary Clinton, Jeb Bush. • Using machine learning in “tribefinder,” Condor identifies members of the “Bernie Sanders tribe” and the “Donald Trump tribe” and their communication behavior.
PART III. AUTOMATIC MEDIA INSIGHTS COIN ASSESSMENT (AMICA) The final part of this book illustrates how Web-based Coolhunting and e-mail based Coolfarming can be combined to measure and optimize social capital. Just like SAP, Oracle Financials, and Microsoft Dynamics provide a financial capital management system, Automatic Media Insights COIN Assessment (AMICA) provides a social capital management system for individuals and organizations. AMICA is an assessment of individual and group behavior that measures, compares, and optimizes the collective mind of an individual, organization, or a company. Borrowing two metaphors from medicine, AMICA offers both collaboration diagnosis and collaboration therapy. The AMICA diagnosis identifies which types of communication patterns are indicative of the most efficient and effective collaboration. Monitoring collaboration with AMICA allows individuals and organizations to measure the health of their relationships based on their communication style with friends and colleagues. The AMICA therapy helps individuals to improve their collaborative behavior; it suggests interventions to change individual and organizational collaboration based on
r 2017 Peter A. Gloor
385
386
Sociometrics and Human Relationships
their communication patterns, resulting in healthier and more satisfactory relationships. Based on the principles of social quantum physics introduced in the companion book Swarm Leadership and the Collective Mind: Using Collaborative Innovation Networks to Build a Better Business and briefly mentioned in the introduction chapter, AMICA helps individuals and organization to build entanglement through empathy, and to reboot and reflect, to constantly improve collaboration behavior through self-reflection triggered through a virtual mirror of communication patterns. AMICA provides a complete virtual mirror of an individual’s and an organization’s collaborative conduct, applying Durkheim’s concept of “collective consciousness.”1 AMICA uses Condor to collect a set of comparative data points for individuals and organizations to benchmark their collaborative performance, comparing it against a normalized benchmark based on the six honest signals of collaboration. AMICA also includes a two-part online survey to assess individual and organizational collaboration with a series of qualitative questions. The goal is to assist individuals and organizations in the formation and growth of COINS, emergent ad hoc structures of intrinsically motivated people getting together to create radically new things. Note that these patterns have been identified in our current research and may not exactly fit your team or organization. More work is needed to test and verify them more widely; in their current form they should be treated as experimental and explorative research. We are still at an early discovery period using these metrics and 1
Durkheim and Swain (2008).
Part III. Automatic Media Insights COIN Assessment (AMICA)
387
tools, but have enough confidence with the initial results to share the findings with you in this book to use. AMICA consists of four automated analysis modules abbreviated as IMIC, OMIC, IMOC, and OMOC (Table 24), measuring communication of individuals and organizations through their patterns on inside communication archives (e-mail, Skype, online calendars) and outside online social media (Twitter, Wikipedia, Blogs, and Facebook). It is complemented by an inside and an outside online survey (SIC and SOC). The automated part of AMICA comprises a description of how to calculate the four different metrics in Condor, as well as reference benchmarks and recommendations on how to change the behavior for better collaboration. The three parts of AMICA focusing on the individual are IMIC, OMIC, and SIC. IMIC measures the collective mind of individuals, automatically analyzing the inside media. It specifies a process to collect the six honest signals of individuals using Condor, based on analyzing their e-mail, calendar, and Skype archive. OMIC measures the communication behavior of individuals on the outside media. It specifies a process to track the position of individuals in the social media, looking at their social network extracted from Twitter, Web (through Google CSE), Wikipedia, and Facebook wall. IMIC and OMIC are complemented by SIC, a survey of individual collaboration behavior according to the key principles of social quantum physics. SIC provides a qualitative approach to measuring collaboration, we hope to identify correlation between SIC scores and the six honest signals of collaboration, but this will only be possible once we have collected enough SIC ratings.
388
Table 24: The Six Different Parts of the AMICA Analysis. Inside Media “Honest Signals”
Outside Media “Honest Signals”
Collaboration Readiness Assessment
(1) IMIC Inside Media Individual Collaboration Individual message
(2) OMIC Outside Media Individual Collaboration Social Media (Twitter, Web,
(3) SIC Survey of Individual Collaboration Measuring collective
archive (e-mail, calendar, Skype,…)
Wikipedia…) presence & perception of the
intelligence and collaborative
star or galaxy
individual
capabilities
Condor
Condor
Online Survey
(4) IMOC Inside Media
(5) OMOC Outside Media Organizational
(6) SOC Survey of Organizational
Organizational Collaboration
Collaboration Social Media awareness and
Collaboration Measuring collective
Organizational messaging archive
presence of company, brand, and products
consciousness and collective creativity
(e-mail, calendar, phone,…)
on Twitter, Wikipedia, Blogs,…
of organizations
Condor
Condor
Online Survey
Individual Collaboration Assessment
Assessment
Sociometrics and Human Relationships
Organizational Collaboration
Part III. Automatic Media Insights COIN Assessment (AMICA)
389
AMICA also includes three corresponding assessments for measuring the collective mind of organizations. IMOC automatically analyzes the inside media from the perspective of the organization. It specifies a process to collect the six honest signals of organizations using Condor, based on analyzing their e-mail, calendar, and Skype archive. It also provides benchmarks to interpret the six honest signals of collaboration of organizations, comparing them against a database of different organizations from different cultures which is still under development; currently we have results from the United States, Australia, India, Switzerland, Germany, and Finland. OMOC automatically analyzes the outside media from the organizational perspective. OMOC specifies a process in Condor to track the social network position of organizations in social media, looking at the strength of company names, brands, and products, extracted from Twitter, Web (through Google CSE), Wikipedia, and corporate Facebook walls. Finally AMICA also provides SOC, an organizational online survey that tracks through a series of questions on the collective consciousness and collaborative creativity of organizations. The survey questions are to be answered on a Likert-type scale. Just like SIC, SOC provides a qualitative approach to measuring collaboration; we hope to identify correlation between SOC scores and the six honest signals of collaboration, but this will only be possible once we have collected enough SOC ratings.
This page intentionally left blank
15 INSIDE MEDIA INDIVIDUAL COLLABORATION (IMIC)
IMIC measures collaboration behavior of individuals inside the organization, based on their e-mail, Skype, and calendar archives.
r 2017 Peter A. Gloor
391
Operationalization in Condor
Diagnosis: Indication of High Collaborators
Therapy: What You Can Do to Improve Collaboration
Degree centrality of e-mail
Connecting to many people can be an indicator of openness
Reach out to new people outside the core team
Frequent communication
Number of e-mail messages sent
Sending more messages than receiving can be an indicator of proactive behavior
Be proactive and responsive, but conscious of not flooding others with messages
Group flow
AWVCI average-weighted variance in contribution index
Having a low variance in sending and receiving among group members means having a shared culture
Integrate all group members into information exchange, encourage passives to participate more actively, get spammers to send less
Creativity
Betweenness centrality oscillation
Repeatedly changing individual networking position from central to peripheral may be an indicator of creativity
Empower others by delegating, and by rotating between a central position of responsibility and letting others lead
Passion
Ego ART (average response time of individual) to e-mail
A person who is responsive and answers quickly shows more passion
Be responsive to everybody, independent of status and prestige
Respect
Alter ART (average response time of everybody else to an individual) to e-mail
An individual who elicits fast response from others is highly respected
Treat others with respect, then they will treat you respectfully too
Emotionality
Emotionality is defined as standard deviation of positive and negative sentiments
Saying what is good as well as what could be better is a sign of a high-functioning community
Be more honest, and refrain from using overly positive language
Sociometrics and Human Relationships
Open communication
392
Individual Collaboration Characteristics
Inside Media Individual Collaboration (IMIC)
393
To calculate the main scores, this example uses e-mail, which can be replaced by Skype or calendar data if e-mail is not readily available. For the drill-down, Skype and calendar networks can easily be integrated, if these archives are accessible. Skype will include content through any chat that might have been done in parallel to the call; for calendar analysis there will also be content from the entries and comments in the calendar. The radar chart below gives a comparative overview of the seven individual collaboration characteristics for three different individuals working at the same company, comparing their IMIC metrics against each other. This example illustrates that scores for the seven characteristics are very much dependent on the role and function of the individual, comparing the inside communication patterns of three employees holding very different roles at the company. Andrew, the SVP for products, has a more outside focused role, which leads to a communication profile much higher in communication frequency than IT technician Jacob and software developer Charles. Andrew is very popular with his customers, as he is high in passion for his job. Nevertheless, his customers are much slower in responding to him than the internal IT developers responding to their colleague Charles, who has high respect inside his department, leading to a lower respect score for Andrew. On the other hand, Andrew scores high on flexibility and adaptability.
394
Sociometrics and Human Relationships
Image 67a
a
For color pictures see online version of images, available at http://www.ickn.org/sociometrics/
Drilling down in Condor shows the different social network structures for Jacob and Andrew. Using Condor to filter the top 20 actors by betweenness (proxy for importance) and then sizing them by betweenness oscillation (proxy for creativity) illustrates in the social network image below that while Andrew is communicating as the center of a galaxy, Jacob has a star network. In Andrew’s network, Bob is the most creative person. Images 68 and 69a
a
For color pictures see online version of images, available at http://www.ickn.org/sociometrics/
Inside Media Individual Collaboration (IMIC)
395
The same IMIC representation can be used to track individual change of a single person in the seven personality characteristics over time. The image below compares my own communication behavior between April and June 2016. It shows that while in April I was communicating more — measured as frequency of communication — it was always with the same people, because in May (the red line) the openness of communication with many different people is increasing, reaching its peak in June. On the other hand, my shared vision and emotionality pattern is quite consistent over all three months. Passion is highest in May, while respect peaks in April and June. Flexibility and adaptability show a low point in April, when I was single-mindedly focused on organizing a workshop at MIT.
Image 70a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
396
Sociometrics and Human Relationships
A drill-down of my network shows the top 30 people of my mailbox in April and May 2016. The size of the nodes shows the most respected people, that is, the ones whom others respond the fastest in April and May. We currently rely on Exchange or IMAP to adjust for time zone differences; note, however, that the automatic time zone translation systems are far from perfect, in particular if legacy e-mail data is directly imported, for example from Novell or Lotus Notes.
a
Inside Media Individual Collaboration (IMIC)
Images 71 and 72a
For color pictures see online version of images, available at http://www.ickn.org/sociometrics/
397
398
Sociometrics and Human Relationships
I also analyze my Skype network from 2010 to 2015; the first chart shows my overall activity in number of calls per day. There were some group calls where I was conducting meetings with my students; thus, I was in contact with up to 100 participants in June 2011.
The image below shows my full Skype network over the entire duration from 2010 to 2015.
Inside Media Individual Collaboration (IMIC)
399
Removing in the next step the star in the center (myself) from the Skype call network creates a much clearer picture, as now my colleagues and collaborators stand out distinctly. The size of the nodes of both networks is by betweenness centrality oscillation, which is a proxy of creativity.
400
Sociometrics and Human Relationships
Besides Skype, the online calendar is also a rich source of information. As the activity chart below illustrates, the number of my meetings started to explode in the second half of 2014 and 2015.
The social network chart below shows the reason: it’s mostly because of my participation in the healthcare projects with NICHQ and HRSA, the health resource administration, where I participated in the Infant Mortality reduction Collaborative Improvement and Innovation Networks (IM CoIIN) and with Cincinnati Children’s Hospital Medical Center (CCHMC), where I participated in the Type 1 diabetes (T1D) and cystic fibrosis (CF) projects.
Inside Media Individual Collaboration (IMIC)
401
Image 73a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
15.1. IMIC ANNOTATION PROCESS These values can easily be calculated in Condor. The first step consists of collecting the mailbox data from the server.
402
Sociometrics and Human Relationships
The second step consists of loading the dataset for the desired time interval, and calculating the group values. This step is the same for all four automatic assessments of AMICA, not only IMIC, but also OMIC, IMOC, and OMOC. For example, for analyzing my mails of June 2016, I load the dataset with the following parameters.
Next I compute the group values using Condor’s annotate functions. They are calculated using the centrality annotations (group betweenness and degree centrality), group betweenness oscillation, AWVCI, TurnTaking, and graph density, as well as the “calculate sentiment” function, which will calculate the average sentiment, emotionality, and complexity for all the messages in the dataset.1
1
In principle, these variables could all be calculated automatically; however, their parameters need to be adjusted for the data being analyzed (e-mail, Skype, calendar, monthly/weekly data collection, snowball sampling or corporate archive, etc.). Once the parameters have been defined, the server version of Condor could be used, which can collect and calculate the AMICA values automatically on a server using a RESTful API.
Inside Media Individual Collaboration (IMIC)
403
After having calculated the values, I export them to Excel for comparative analysis, using the “Export-> Export dataset properties” menu. The resulting CSV file can then be loaded into Excel and be visualized using Excel’s radar chart function.
Note that to show them as a nice radar chart in Excel, we need to normalize the values into the interval [0,1], by dividing them through their maximum value. For the values of passion and respect (ego ART and alter ART) we have to invert the maximum values, so 1 stands for the smallest response time (i.e., ego and alter ART) and 0 for the highest, as the lowest response time reflects the highest passion and respect. A sample Excel template is provided on the companion book website (www.ickn. org/sociometrics).
This page intentionally left blank
16 OUTSIDE MEDIA INDIVIDUAL COLLABORATION (OMIC)
OMIC measures collaboration behavior of individuals as seen from the outside through the lens of online social media such as Twitter, Facebook, Wikipedia, and Google search. OMIC starts with an analysis of an individual’s footprint on Twitter and extends the drill-down exploration with Wikipedia, Facebook wall, and Google Blog search analysis.
r 2017 Peter A. Gloor
405
406
Individual
Operationalization in Condor on
Diagnosis: Indication of High
Therapy: What You Can Do to
Social Media
Popularity and Influence on Social
Improve Your Social Media
Media
Footprint
Collaboration Characteristics Activity
The more people tweet about a person, Be selective in tweets, and crosslist
EgoFetcher
the more popular the person is
them on Facebook and LinkedIn
Central
Group degree centrality of the Twitter
The less centralized the network, the
For a Twitter community, a
leadership
network
more different people form their own sub-communities
decentralized network might be desirable
Creativity
Oscillation in betweenness
The more diverse the network structure Reach out to new people outside the
centrality of actors
of people tweeting about the person,
core group by including them in tweets,
the more creative the person
and add blog content and outside links
Ego ART (average response time of
The faster actors respond to tweets,
Be responsive; however, only tweet
individual to tweets from others)
the more passionate they are about the when you have something to say
Passion
person originally tweeting Respect
Alter ART (average response time of
The faster other people respond to
Add substance to your tweets, and
everybody else to tweets from original actor)
tweets from the original actors, the more the original actors are respected
cross-reference blogposts and other interesting content on the Web
Sociometrics and Human Relationships
Number of tweets collected with
Sentiment
Emotionality
Complexity measures word usage by looking at the diversity of the vocabulary,
More complex language means that people are having more diverse and
and its distribution among the different
thought-provoking discussions on
tweets
Twitter
Sentiment measures the positivity and
Having a more positive sentiment in
Use a fundamentally positive tweeting
negativity of tweets using the machinelearning function of Condor
Twitter is an indicator of long-term success
style, and also point out positive things, instead of complaining
Emotionality is defined as standard
Being more emotional on Twitter is an
Be honest in your own tweets, and
deviation of positive and negative
indicator of being more engaged and
refrain from using overly positive
sentiments in tweets
committed
language
Make good use of the 140 characters, and add links to pictures and videos
Outside Media Individual Collaboration (OMIC)
Complexity
407
408
Sociometrics and Human Relationships
The radar chart below compares the social media footprint of Bill Gates, Neil deGrasse Tyson, Paul Krugman, and Arianna Huffington collected on July 12, 2016. On this day, Bill Gates had the highest Twitter activity, followed by Neil deGrasse Tyson. Tyson and Gates were also seen as most creative. In that particular week, there was not much activity around Paul Krugman, which is why his network shows centralized leadership, but his Twitter network is using the most complex language, while Arianna Huffington’s is using the most positive language and is most passionate and respectful, meaning that tweets by her swarm are retweeted quickly, while the members of her swarm are also quick to retweet. In this analysis, we are not just looking at the tweets of Bill Gates, Neil deGrasse Tyson, Paul Krugman, and Arianna Huffington, but also including the importance of the people retweeting them, as well as analyzing the behavior of the people tweeting about them. Note that Twitter reflects the fickleness of the crowd, one week later the values might be quite different.
Outside Media Individual Collaboration (OMIC)
409
Image 74a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
The image below shows out of the total 56,284 actors in Bill Gates’ Twitter network on July 12 the top 20 actors by betweenness. They are his most influential supporters; Bill Gates himself is one of them. RealDonaldTrump also shows up, as he has a dominating presence in August 2016 in the Twittersphere, being cross-referenced numerous times by other tweeters. The most central Twitterid though is YouTube, thanks to the numerous links to science and other videos tweeted by Bill Gates’ and his supporters.
410
Sociometrics and Human Relationships
Image 75a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
To further explore the social media presence of a person, we can also do a Wikipedia query and look at the Wikipedia network surrounding a person. The image below compares my own Wikipedia network (note that I do not have a page on Wikipedia) against the network of Bill Gates on Wikipedia, illustrating Bill Gates’ position as a global thought leader, tycoon of IT, and leader of his charitable foundation.
Outside Media Individual Collaboration (OMIC)
411
412
Sociometrics and Human Relationships
My Facebook wall shows the people who left comments on my Facebook wall. Note that only I as the owner of my Facebook page will be able to download the content of my page into Condor. As the activity chart below illustrates, my Facebook wall is not particularly active from September 2013 to January 2016.
Not surprisingly, I am the most central person on my own Facebook wall; my colleague and Facebook friend Yoshiaki also has left some well-connected comments on my page.
Outside Media Individual Collaboration (OMIC)
413
To better understand the network, I remove myself from the network. Yoshiaki now becomes the center of his own galaxy, while many other Facebook friends become much more central. My Facebook friends Mattaeus, Sergey, Andreas Pollak, Azadeh, Jun, and Puja suddenly stand out as gatekeepers of information and posts on my Facebook wall.
414
Sociometrics and Human Relationships
16.1. OMIC ANNOTATION PROCESS The main tool for the online social media analysis is Condor’s Twitter EgoFetcher, which simulates an individual’s social reach on Twitter, by factoring in the popularity and importance of the people retweeting the individual’s tweets. The EgoFetcher works in four steps: (1) It takes the last N (e.g., 10,000) tweets about the search term or Twitter handle (e.g., “Bill Gates”). Note that normally — except for an individual’s own tweets — the search API of Twitter only returns last week’s tweets. If the Twitter search API’s limit of 180
Outside Media Individual Collaboration (OMIC)
415
tweets is reached, the EgoFetcher will pause for 15 minutes and then continue fetching. There is no limit on the tweets from an individual’s timeline, so you might be able to get much higher numbers of tweets before hitting the rate limit. (2) It constructs a network with a link from actor B to actor A if B retweets A or B mentions A in a tweet. (3) It then takes the timelines of the 480 most influential people in the search results, the influence of people is measured through their degree in the retweet network from step (2). For Twitter users, their timeline is all their tweets, sorted from newest to oldest. (4) It adds for each tweet collected in the previous steps the first 100 retweets. This leads to a network that shows the impact and reach of a person in the twittersphere, and is a better proxy of their importance than just the number of followers, which can be gamed or bought. For example, to collect the Twitter Ego network of Bill Gates, we would run the Twitter EgoFetcher with the following settings.
416
Sociometrics and Human Relationships
The image below shows the 56,284 actors of the Twitter Ego network for Bill Gates; the dark grey dots are the tweets collected in step (1) above, running a search for the string “Bill Gates” on Twitter. The light grey dots come from the timelines of the top 480 users collected in step (1), and also include the top 100 retweets for each of the tweets — these are the scattergun-like funnels in the periphery of the graph.
Outside Media Individual Collaboration (OMIC)
417
After the data collection process has been completed, the annotations are calculated using the same process as described above for IMIC, and exported to Excel, where they are then visualized using the radar chart function of Excel.
This page intentionally left blank
17 INSIDE MEDIA ORGANIZATIONAL COLLABORATION (IMOC)
IMOC measures collaborative performance of organizations through the aggregated communication behavior of its individual members conversing inside the organization, based on their e-mail, Skype, and calendar archives. It looks at how departments, business units, and companies communicate as collective entities. It is calculated using the group measures of Condor.
r 2017 Peter A. Gloor
419
Operationalization in Condor
Diagnosis: Indication of High Collaborators
Therapy: What You Can Do to Improve Collaboration
Group degree centrality of e-mail
More centralized leadership might lead to more innovation
Encourage qualified group members to assume leadership roles
Group creativity
Oscillation of group betweenness centrality over time
Many individuals changing their networking positions from central to peripheral may be an indicator of organizational creativity
Empower others by delegating, and by rotating between a central position of responsibility and letting others step up
Group flow
AWVCI average-weighted variance in contribution index
Having a low variance in sending and receiving among group members means having a shared culture
Integrate all group members into information exchange, encourage passives to participate more actively, get spammers to send less
Empowerment
Graph density
The more directly connected employees are to Increase connectivity by encouraging interothers, the more they are empowered organizational and across-hierarchy communication
Satisfaction
Ego ART (Average Response A community where members are responsive Time of individuals to others) is an indicator of high satisfaction
Try to create a respectful work culture
Empathy
Alter ART (Average Response Time of everybody else to individuals)
Answering quickly to others is an indicator of respect and empathy
Be more responsive to everybody, independent of status and prestige
Emotionality
Emotionality is defined as standard deviation of positive and negative sentiments
Saying what is good as well as what could be better is a sign of a high-functioning community
Teach members of the organization to be more honest, and refrain from using overly positive language
Sociometrics and Human Relationships
Central leadership
420
Organizational Collaboration Characteristics
Inside Media Organizational Collaboration (IMOC)
421
The example below compares the collaboration performance of a professional services company with 45,000 employees over three months, plotting the values for the seven IMOC variables in the radar chart. The image illustrates that in June 2016 the company communicated most centrally, most likely because some company-wide campaigns were run. April was the most creative month, where different leaders took turns assuming central roles. In May, employees were most empathic, responding to each other the fastest. Image 76a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
The drill-down image below shows the most central 2000 employees of the company in April (in blue), as well as their communication with key customers (shown in red). Note that the customers are scattered in the periphery, while the employees of the company are doing a lot of the talking among themselves.
422
Sociometrics and Human Relationships
Image 77a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
When filtering the company employees for only the account managers of the customers, the position of the customers (marked in red) notably changes (image 78), and the customers move into the core of the network. In order to measure the customer focus of the company, it would therefore be worthwhile to repeat the calculation of the IMOC variables using the network below, to track the change in customer focus of the company’s account managers.
Inside Media Organizational Collaboration (IMOC)
423
Image 78a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
17.1. IMOC ANNOTATION PROCESS The IMOC annotation process is identical to the IMIC annotation process described in chapter 15. The main difference is that the e-mail, Skype, or calendar archive is not from an individual, but from an entire organization. This means that normally it is much more voluminous, easily containing billions of communication records which might necessitate calculation of the Condor variables on a cloud server equipped with over 100 GB RAM. For preprocessing the communication data, a map reduce or Hadoop-based cluster might be necessary.
This page intentionally left blank
18 OUTSIDE MEDIA ORGANIZATIONAL COLLABORATION (OMOC)
OMOC measures collaboration behavior of companies as seen from the outside through the lens of online social media such as Twitter, Facebook, Wikipedia, and Google search. OMOC starts with an analysis of the organization’s footprint on Twitter and extends the drill-down exploration with Wikipedia and Google Blog search analysis.
r 2017 Peter A. Gloor
425
Operationalization in Condor on Social Media
Diagnosis: Indication of High Popularity and Influence on Social Media
Therapy: What You Can Do to Improve Your Organization’s Social Media Footprint
The less centralized the network, the more different people form their own sub-communities
Activity
Number of tweets collected with EgoFetcher about company name
The more people tweet about a company, the more Be selective when tweeting, and crosslist popular it is tweets on Facebook and LinkedIn
Creativity
Oscillation in betweenness centrality of company name in Twitter network
The more diverse the network structure of people tweeting about the company, the more creative the company’s brand
Reach out to new people outside the company by including them in tweets, and add blog content and outside links
Passion
Ego ART (Average Response Time of individuals to tweets from others)
The faster actors respond to tweets, the more passionate they are about the person tweeting about the company
Be responsive yourself, and only tweet about the company when you have something to say
Respect
Alter ART (Average Response Time of everybody else to tweets from original actors)
The faster other people respond to tweets from the original actors, the more the original actors tweeting about the company are respected
Add substance to your tweets, and crossreference blogposts and other interesting content on the Web
Complexity
Complexity measures word usage by looking at the diversity of the vocabulary, and its distribution among the different tweets
More complex language means that people are having more diverse and thought-provoking discussions on Twitter about the company
Make good use of the 140 characters of Tweets, and add links to pictures and videos
Sentiment
Sentiment measures the positivity and negativity Having a more positive sentiment in Twitter is an of tweets using the machine-learning function of indicator of a positive attitude toward the brand Condor
Use a fundamentally positive tweeting style, and also point out positive things, instead of complaining
Emotionality
Emotionality is defined as standard deviation of positive and negative sentiments in tweets
Be honest in your own tweets, and refrain from using overly positive language when tweeting about the brand
Being more emotional on Twitter is an indicator of being more engaged and committed toward the brand
For a Twitter community, a decentralized network might be desirable
Sociometrics and Human Relationships
Central leadership Group degree centrality of the Twitter network about the company name
426
Organizational Collaboration Characteristics
Outside Media Organizational Collaboration (OMOC)
427
The radar chart below compares the social media footprint of Microsoft and Google, collected on July 12, 2016. On this day, Microsoft and Google had comparative Twitter activity; however, Google is seen as more creative. People tweeting about Google are also more passionate and respectful. The text of Tweets about Microsoft is slightly more complex. Note that Twitter reflects the fickleness of the crowd; one week later the values might be quite different.
The image below shows a drill-down with the top 30 most central nodes; their node size indicates betweenness centrality oscillation. We find that Prashantrjoshi, MSFTExchange, and sladner show the most creative Twitter networking behavior, which means they are engaged in a rapid exchange of tweets.
428
Sociometrics and Human Relationships
Image 79a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
The image below shows the most central websites boosting the Microsoft brand. Nodes are sized by betweenness. Facebook, Forbes, and the New York Times are most central.
Outside Media Organizational Collaboration (OMOC)
429
18.1. OMOC ANNOTATION PROCESS The OMOC annotation process is identical to the OMIC annotation process described above. The main difference is that the Twitter, Wikipedia, and blog archives are not about an individual, but about an organization. Similarly to OMIC, OMOC mostly uses the Twitter EgoFetcher, collecting tweets about a company’s name, and the Twitter timelines of the most important 480 people tweeting about the company as well as the first 100 retweets of each tweet. Note that the current Condor Facebook fetcher is restricted to the walls of individual people; it does not directly provide walls of Facebook groups and organizations.
430
Sociometrics and Human Relationships
18.2. FOLLOW-ON EXERCISES 1. Using your own mailbox, do a radar chart IMIC analysis, comparing your communication behavior of the last three months and replicate the drill-down as described in Chapter 15. 2. Compare the social media profiles of Pope Francis, Tim Berners-Lee, and Roger Federer using OMIC and replicate the drill-down as described in Chapter 16. 3. Combine the e-mail boxes or Skype archives of you and your three closest friends into one combined team-mail box and analyze your communication behavior over the last three months using IMOC, and replicate the drill-down as described in Chapter 17. 4. Compare the social media profiles of Samsung, Nokia, Apple, and Huawei using OMOC and replicate the drill-down as described in Chapter 18.
19 SURVEY OF INDIVIDUAL AND ORGANIZATIONAL COLLABORATION (SIC & SOC)
The four automated online media-based assessments are complemented by two survey-based assessments: Survey of Individual Collaboration (SIC), focusing on the individual, and Survey of Organizational Collaboration (SOC), with a focus on the organization. The survey questions are grounded in extensive prior research. The survey is also online at http://5.35.249.27/sociometrics/sicsoc
19.1. SURVEY OF INDIVIDUAL COLLABORATION (SIC) SIC measures the attitude of individuals toward collaboration in the seven dimensions: individual motivation, organizational motivation, transparency, fairness, trust/honesty, forgiveness, and empathy/listening. These dimensions are explained in detail in my companion book Swarm Leadership and the Collective Mind: Using Collaborative Innovation Networks to Build a Better Business. r 2017 Peter A. Gloor
431
432
19.1.1. Individual Motivation
Agree I would take a different job paying the same
(1)
Disagree (2)
(3)
(4)
(5)
(6)
(7)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
If I got all the money I ever wanted, I would still stay in my current profession
(1)
(2)
(3)
(4)
(5)
(6)
(7)
In my private time I spend time reading up on professional material
(1)
(2)
(3)
(4)
(5)
(6)
(7)
I am very personally involved in my job
(1)
(2)
(3)
(4)
(5)
(6)
(7)
I consider my profession central to my existence
(1)
(2)
(3)
(4)
(5)
(6)
(7)
Source: Blau (1985).
Sociometrics and Human Relationships
If I could start again, I would not learn my current profession
Agree I am willing to put in extra effort for my organization
(1)
Disagree (2)
(3)
(4)
(5)
(6)
(7)
I talk up my organization as a great place to work
(1)
(2)
(3)
(4)
(5)
(6)
(7)
I would accept almost any job to stay in my organization
(1)
(2)
(3)
(4)
(5)
(6)
(7)
In my private time I spend time reading up on professional material
(1)
(2)
(3)
(4)
(5)
(6)
(7)
I really care about the fate of my organization
(1)
(2)
(3)
(4)
(5)
(6)
(7)
This is the best organization to work for
(1)
(2)
(3)
(4)
(5)
(6)
(7)
Source: Blau (1985).
Survey of Individual and Organizational Collaboration (SIC & SOC) 433
19.1.2. Organizational Motivation
434
19.1.3. Transparency
Agree
Disagree
The people I work with keep me informed
(1)
(2) (3) (4) (5) (6)
(7)
It is important for me to know if a website that collects my information will use it in a way that will
(1)
(2) (3) (4) (5) (6)
(7)
(1)
(2) (3) (4) (5) (6)
(7)
Citizen requests for government documents are just a big distraction for government workers
(1)
(2) (3) (4) (5) (6)
(7)
Do you think whistleblowers, anticorruption activists, and journalists should enjoy legal protections that make them feel secure about reporting cases of corruption?
(1)
(2) (3) (4) (5) (6)
(7)
My organization wants people like me to know what it is doing and why it is doing it
(1)
(2) (3) (4) (5) (6)
(7)
identify me I think ordinary citizens should have access to records of government contracts, including the amount and who got the contracts
Sociometrics and Human Relationships
Sources: Awad and Krishnan (2006); Piotrowski and Van Ryzin (2007); http://www.transparency.org/
Agree
Disagree
I help others to acquire the skills they might need
(1)
(2)
(3)
(4)
(5)
(6)
(7)
My organization treats people like me fairly and justly
(1)
(2)
(3)
(4)
(5)
(6)
(7)
My organization can be relied on to keep its promises
(1)
(2)
(3)
(4)
(5)
(6)
(7)
This organization does not mislead people like me
(1)
(2)
(3)
(4)
(5)
(6)
(7)
My organization is interested in the well-being of people like me, not just itself
(1)
(2)
(3)
(4)
(5)
(6)
(7)
My organization freely admits when it has made mistakes
(1)
(2)
(3)
(4)
(5)
(6)
(7)
Source: Rawlins (2008).
Survey of Individual and Organizational Collaboration (SIC & SOC) 435
19.1.4. Fairness
436
19.1.5. Trust/Honesty
Agree
Disagree
I give information to the group, even if it might jeopardize my position or job
(1)
(2)
(3)
(4)
(5)
(6)
(7)
I am not afraid to offend other people if I think I am right
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
I believe my organization takes the opinions of people like me into account when making decisions
(1)
(2)
(3)
(4)
(5)
(6)
(7)
I’m willing to let my organization make decisions for people like me
(1)
(2)
(3)
(4)
(5)
(6)
(7)
I trust my organization to take care of people like me
(1)
(2)
(3)
(4)
(5)
(6)
(7)
Source: Rawlins (2008).
Sociometrics and Human Relationships
When my friends have a problem, they usually ask me for help
Agree Your significant other has just broken up with you, leaving you hurt and confused. You learn that the
Disagree
(1)
(2) (3) (4) (5) (6)
(7)
I feel hatred whenever I think about the person who wronged me
(1)
(2) (3) (4) (5) (6)
(7)
I think my life is ruined because of this person’s wrongful actions
(1)
(2) (3) (4) (5) (6)
(7)
If I encountered the person who wronged me I would feel at peace
(1)
(2) (3) (4) (5) (6)
(7)
I hope the person who wronged me is treated fairly by others in the future
(1)
(2) (3) (4) (5) (6)
(7)
A friend borrows your most valued possession, and then loses it. You will never forgive her/him
(1)
(2) (3) (4) (5) (6)
(7)
reason for the break up is that your significant other started dating a good friend of yours. You will never forgive her/him
Source: Rye et al. (2001).
Survey of Individual and Organizational Collaboration (SIC & SOC) 437
19.1.6. Forgiveness
438
19.1.7. Empathy/Listening
Agree
Disagree
(1)
(2)
(3)
(4)
(5)
(6)
(7)
Most of my expertise has developed as a result of working with others
(1)
(2)
(3)
(4)
(5)
(6)
(7)
When I am in need my colleagues will go out of their way to help me
(1)
(2)
(3)
(4)
(5)
(6)
(7)
We are continuously encouraged to bring new knowledge in our team
(1)
(2)
(3)
(4)
(5)
(6)
(7)
I believe my organization takes the opinions of people like me into account when making decisions
(1)
(2)
(3)
(4)
(5)
(6)
(7)
My organization asks the opinions of people like me before making decisions
(1)
(2)
(3)
(4)
(5)
(6)
(7)
Sources: Sveiby and Simons (2002); Narayan and Cassidy (2001); Grootaert (2004).
Sociometrics and Human Relationships
I learn a lot from others in the team
Survey of Individual and Organizational Collaboration (SIC & SOC) 439
19.2. SURVEY OF ORGANIZATIONAL COLLABORATION (SOC) SOC measures the collective attributes of individual group members toward their organization in the five dimensions: collective consciousness, leadership, contribution/sharing, and responsiveness/respect. It focuses on collective consciousness of group members, their leadership behavior, their attitude toward sharing, and giving respect to everybody.
440
19.2.1. Collective Consciousness
Agree
Disagree
I am a worthy member of the group I belong to
(1)
(2)
(3)
(4)
(5)
(6)
(7)
I feel good about the group I belong to
(1)
(2)
(3)
(4)
(5)
(6)
(7)
Overall, my group is considered good by others
(1)
(2)
(3)
(4)
(5)
(6)
(7)
We have a strong organizational culture with shared vision, values, norms,
(1)
(2)
(3)
(4)
(5)
(6)
(7)
The group I belong to is an important reflection of who I am
(1)
(2)
(3)
(4)
(5)
(6)
(7)
I have a strong sense of belonging to my own group
(1)
(2)
(3)
(4)
(5)
(6)
(7)
Source: Luhtanen and Crocker (1992); Ashmore, Deaux, and McLaughlin-Volpe (2004).
Sociometrics and Human Relationships
systems, symbols, language, assumptions, beliefs, and habits in our group
Agree
Disagree
My group chooses its own leaders
(1)
(2)
(3)
(4)
(5)
(6)
(7)
If a member of the group has a problem, the group member will collectively help her/him
(1)
(2)
(3)
(4)
(5)
(6)
(7)
My group gives me the power to make important decisions concerning myself by myself
(1)
(2)
(3)
(4)
(5)
(6)
(7)
I feel free of conflict with myself in the context of my group
(1)
(2)
(3)
(4)
(5)
(6)
(7)
My group supports me to become a better person
(1)
(2)
(3)
(4)
(5)
(6)
(7)
My group wants me to stand up for my beliefs, independently of who opposes them
(1)
(2)
(3)
(4)
(5)
(6)
(7)
Source: Braithwaite and Law (1985).
Survey of Individual and Organizational Collaboration (SIC & SOC) 441
19.2.2. Leadership
442
19.2.3. Contribution/Sharing
Agree
Disagree
It is imperative to lessen the gap between the rich and the poor
(1)
(2)
(3)
(4)
(5)
(6)
(7)
All nations of the earth should work together to help each other
(1)
(2)
(3)
(4)
(5)
(6)
(7)
I prefer to gain new insights even if it comes at a cost to myself
(1)
(2)
(3)
(4)
(5)
(6)
(7)
My personal success depends on aligning my goals with the goals of my group or organization
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
I am willing to invest more into my organization than what I get out
(1)
(2)
(3)
(4)
(5)
(6)
(7)
Source: Schwartz (2012); Waldman et al. (2006).
Sociometrics and Human Relationships
My goal is to serve my organization, not myself
Agree Each group member needs to be treated as someone of worth, independent of social position and income
(1)
Disagree (2) (3) (4) (5) (6)
(7)
We need to give every group member an equal chance, even if this means I have less
(1)
(2) (3) (4) (5) (6)
(7)
It is imperative to prevent the destruction of nature, even if this means less income for me
(1)
(2) (3) (4) (5) (6)
(7)
In the context of my group I am taking responsibility for my own actions
(1)
(2) (3) (4) (5) (6)
(7)
I like making fun about the members of my group
(1)
(2) (3) (4) (5) (6)
(7)
Every group member deserves respect
(1)
(2) (3) (4) (5) (6)
(7)
Source: Watson, Newton, and Kim (2003).
Survey of Individual and Organizational Collaboration (SIC & SOC) 443
19.2.4. Responsiveness/Respect
444
Sociometrics and Human Relationships
19.3. SAMPLE DOWNLOAD The samples mentioned in this book can be downloaded from the following link: http://www.ickn.org/sociometrics/ There you will find the following: • Hillary Clinton’s e-mail as provided on Kaggle. • Enron e-mail archive in Condor format. • List of Enron convicts to test machine learning with Condor. • 9000 tweets about Donald Trump on April 22, 2016. • 9000 tweets about Bernie Sanders on April 22, 2016. • Antivaxxer Twitter example in a KNIME format. • Excel spreadsheet to create AMICA output: IMIC, OMIC, IMOC, and OMOC examples. The SIC and SOC survey is available online at http:// 5.35.249.27/sociometrics/sicsoc
PART IV. APPENDIX — USEFUL MACHINE LEARNING AND GRAPH ANALYSIS TOOLS The appendix describes KNIME and Gephi, two additional tools besides Condor useful for mapping the collective mind on online social media.
r 2017 Peter A. Gloor
445
This page intentionally left blank
APPENDIX A: IDENTIFYING ANTIVAXXERS THROUGH MACHINE LEARNING USING KNIME
CHAPTER CONTENTS • KNIME is an open source machine learning tool with a visual front end • A training dataset of tweeters is manually classified into pro-vaxxers and anti-vaxxers • Machine learning distinguishes supporters and objectors of the “Anti-Vaxxer” theory through their word usage in Twitter.
In this example, we will learn how to use machine learning to identify proponents of the “Anti-Vaxxer” theory through their Twitter behavior. Anti-vaccination, the refusal of parents to vaccinate their infants against common infectious diseases, has been scientifically debunked, but is still propagated by a small but vocal minority of parents in the United States. They claim that vaccination of infants will create autism. The consequence is that in some parts of the United States more than 10% of children are not vaccinated, thereby r 2017 Peter A. Gloor
447
448
Sociometrics and Human Relationships
becoming potential carriers of infectious disease, such as measles, for their peers.1 We will analyze a dataset of tweets that has been collected in Spring 2015 by a team of students of the COINs seminar at FHNW Brugg and University of Cologne. They gathered all the Tweets containing the words “vaccination,” “vaccinate,” “vaxxer,” “vaccine,” “anti-vaccination,” and “anti-vaxxers.” The resulting Tweets together with information about the tweeters were used to manually classify two sets of Tweets: one belonging to pro-vaxxers and the other belonging to antivaxxers. These Tweets were used to categorize pro- and antivaxxers based on their word usage. Pennebaker (2013), a social psychology professor at UT Austin, has found that how people use small function words such as “the,” “and,” “or,” “to,” “in,” “it,” “what,” “I,” “my,” “me,” “you,” etc., have high predictive value. As Condor did not count these words initially, an external program was written by the students that calculates these statistics for each Tweet. (In the meantime this function has been added to Condor.) We are then using KNIME, an open source text mining and data analytics tool with a visual front end, to apply the results of Pennebaker to see if pro- and anti-vaxxers use these function words in different ways.
1
This example is based on a class project in the COINs 2015 Spring seminar done by Juerg Dietrich (FHNW Brugg) and Matthias Sambale (University of Cologne), together with their team members in Brugg and Tor Vergata University Rome, Yannick Gaugler, Benjamin Schaja, Luca Balestra, and Rosy Innarella.
Appendix A
449
As an input for KNIME, we use the dataset put together by the student team. The dataset contains the following fields: ID Username isAntivaxxer meanNumberOfMentions meanNumberOfHashtags meanNumberOfHtmls meanTextLength Frequency_The Frequency_And Frequency_To Frequency_In Frequency_It Frequency_My Frequency_You Frequency_Was Frequency_For Frequency_Have Frequency_With Frequency_Me Frequency_But Sum_smallwords
The file “Pro_Anti_Vaxxer_Twitter.csv” contains 171 manually classified profiles of pro-vaxxers and 171 profiles of anti-vaxxers, with their attributes such as meanNumberOfMentions, …, Frequency_But, Sum_smallwords. After downloading KNIME from www.knime.org, we start by creating a new workflow “Antivaxxer2.” From the Node Repository, we first drag a File
450
Sociometrics and Human Relationships
Reader icon into the workspace, by right clicking it and opening the configure window. Selecting the “Pro_Anti_Vaxxer_Twitter.csv” file leads to the following configuration window. Note that the ID and username need to be skipped — they are random numbers for the purpose of machine learning and would confuse the results — for the analysis, by right clicking the column heading and selecting the box “Don’t include column in output table.”
Next we add “Equal Size Sampling” (use exact sampling) and “Partitioning” nodes. Our nominal column has been automatically set to “isAntivaxxer.” The partitioning node
Appendix A
451
will split the input table into a training and a test dataset. We specify equal sampling (relative 50%) and stratified sampling. We do this because our sample contains an equal number of classified anti-vaxxer and pro-vaxxer profiles. After that we drag “Naïve Bayes Learner,” “Naïve Bayer Predictor,” “Decision Tree Learner,” “Decision Tree Predictor,” and “Logistic Regression Learner” and “Regression Prediction” nodes into the workspace and connect their inputs and outputs according to the network plan below. This means that we are applying three different machine learning algorithms to the anti-vaxxer dataset, building three different models to be able to classify new and unclassified anti-vaxxer profiles. The predictor will test the second half of our dataset against the three models developed by the three learners, to give us an indication of the quality of the three models. To measure the accuracy of the output, we need to add a “Scorer” to the output of each predictor.
452
Sociometrics and Human Relationships
The scorer needs to be set up to test the accuracy of the prediction “isAntivaxxer” against the preclassified variable “isAntivaxxer.”
Clicking on the Naïve Bayes Scorer shows the following confusion matrix. The confusion matrix shows that out of 170 test cases classified, 57 were correctly classified as pro-vaxxers and 54 were correctly classified as antivaxxers, leading to an accuracy of 65.3%.
Appendix A
453
Looking at the Naïve Bayes learning view shows that pro-vaxxers use more “me” and “my”; according to Pennebaker this is a sign of humility.
454
Sociometrics and Human Relationships
Clicking on the Decision Tree Scorer shows the following confusion matrix, telling us that 18 anti-vaxxers and 44 pro-vaxxers have been misclassified.
Looking at the decision tree shows that pro-vaxxers use more complex language (Sum_smallwords > 0.075).
Appendix A
455
Clicking on the Logistic Regression Scorer shows the following confusion matrix.
Sociometrics and Human Relationships
456
We can now also look at the coefficient of the Logistic Regression. As we see, we get some significant predictors, for instance meanNumberofMentions, meanNumberofHtmls, meanTextLength, etc. Putting them into the regression equation would allow us to calculate the probability F(x) for new people to be a proor an anti-vaxxer.
FðxÞ ¼
1 1þ
eðβ0 þβ1 xÞ
Appendix A
457
This concludes a very brief introduction to KNIME; more details can be found online. The book Guide to Intelligent Data Analysis by Borgelt, Höppner, and
458
Sociometrics and Human Relationships
Klawonn2 gives a broad introduction to machine learning with KNIME examples.
MAIN LESSONS LEARNED • Supervised learning needs a training and a test dataset. • A dataset of categorized tweets from anti-vaxxers (denying the benefits of vaccinations) and provaxxers (public health officials) helps identify anti-vaxxers based on their word usage in tweets. • As features in the machine learning James Pennebacker’s “small words” are used.3 • Anti-vaxxers use less personal pronouns and less complex language.
2 3
Borgelt, Höppner, and Klawonn (2010). Pennebaker (2013).
APPENDIX B: GENERATING NICE GRAPH PICTURES WITH GEPHI
CHAPTER CONTENTS • Gephi is an open source graph drawing and manipulation tool for Mac, Windows, and Linux. • Gephi includes additional functions for graph drawing, filtering, clustering, and manipulation not available in Condor.
The open source graph drawing tool Gephi offers an additional functionality to draw and manipulate graphs as well as sophisticated layout options, which is not available in Condor. First, you will need to download the most recent version of Gephi (currently 0.9.1) from gephi.org. You might need to adjust the parameters of Gephi in the config file, by right-clicking on the “Gephi”-icon and selecting “show package contents.”
r 2017 Peter A. Gloor
459
460
Sociometrics and Human Relationships
In the gephi.conf file, you can decide what version of the Java virtual machine to run (jdkhome) (Gephi still seems to use Java 1.6) and how much memory to allocate (for instance Xmx11,468 m assigns Gephi 11,468 MB): # ${HOME} will be replaced by user home directory according to platform default_userdir = “${HOME}/.${APPNAME}/ 0.8.2/dev” default_mac_userdir = “${HOME}/Library/ Application Support/${APPNAME}/0.8.2/dev” # options used by the launcher by default, can be overridden by explicit # command line switches default_options = “–branding gephi -J-Xms64m -J-Xmx11468m -J-Xverify:none -J-Dsun.java2d. noddraw = true
-J-Dsun.awt.noerasebackground =
true -J-Dnetbeans.indexing.noFileRefresh = true -J-Dplugin.manager.check.interval = EVERY_DAY” # for development purposes you may wish to append: -J-Dnetbeans.logger.console = true -Jea
Appendix B
461
# default location of JDK/JRE, can be overridden by using –jdkhome switch jdkhome = “/System/Library/Frameworks/ JavaVM.framework/Versions/1.6.0/Home/” # clusters’ paths separated by path.separator (semicolon on Windows, colon on Unices) #extra_clusters =
Once Gephi is started, we are ready to load the data. In this example, I will look at my own mailbox data that I have downloaded into Condor previously, exporting it as a MySQL dump which can be directly loaded into Gephi.
The exported file, for example, “peter.sql” needs to be loaded into MySQL, for example, by using Navicat,
462
Sociometrics and Human Relationships
MySQL Workbench, or by directly using the command line: PETERs-MacBook-Pro-2:~ pgloor$ /usr/local/ mysql/bin/mysql -u root mysql > create database petermail2; Query OK, 1 row affected (0.00 sec) mysql > use petermail2; Database changed mysql > source /Users/pgloor/Desktop/peter. sql
Afterwards, the network can be loaded into Gephi using the command “File->import database”
Appendix B
463
This will lead to the following image.
Calculating statistics, and choosing a layout algorithm, and coloring the nodes by cluster, and sizing nodes and labels by betweenness leads to the following image. Image 80a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
464
Sociometrics and Human Relationships
Filtering only the top nodes by betweenness results in the following graph. Image 81a
a
For color pictures see online version of images, available at http://www.ickn.org/
sociometrics/
A discussion of the rich feature set of Gephi is beyond the scope of this book, I encourage you to experiment and try it out for yourself. MAIN LESSONS LEARNED • Gephi offers rich functionality to manipulate and visualize networks. • It is particularly useful to produce appealing network pictures for presentations and reports. • Gephi also supports export of many graph metrics for subsequent analysis.
APPENDIX C: SAMPLE MID-TERM EXAM
This section includes a sample Mid-Term Exam suitable for a one-semester course on digital social network analysis, organizational redesign and engineering, social media-based trend forecasting and prediction, and Coolhunting for trends and trendsetters. 1. Briefly explain the following (in your own words): a. Social network (2p) b. Reciprocity (2p) c. Egocentric network (2p) d. Draw a network that includes a clique with four nodes (2p)
r 2017 Peter A. Gloor
465
466
Sociometrics and Human Relationships
2. Given the graph G below:
a. What is the betweenness centrality for b? (6p) Use steps to show your thought process b. What is the degree centrality for b? (2p)
3. Based on graph G from task 2 above, fill in the table below (8p): 0.5p each for multiple choice 0.5p for explanation
G is connected
False
Impossible to Say
Why?
Appendix C
True
G is weighted G is directed G is complete G has a bridge G has a gatekeeper G has strong ties The density of G is more than 0.5
467
468
Sociometrics and Human Relationships
4. Draw a network with at least four nodes, where the group degree centrality is 1. (2p) 5. Draw a graph with five nodes with density 1. (2p) 6. What is the correlation between the two variables visualized below for (a), (b), and (c). Explain why.
7. Explain the pros and cons of “collaborative competition” and “competitive collaboration,” and give some examples. (4p) 8. Which social network structure is best for spreading ideas quickly? Which network structure would you create to get others to accept your new idea? (4p) 9. Do a Coolhunting with Condor for “Hillary Clinton,” comparing the results against “Donald Trump.” (30p)
APPENDIX D: REFERENCES
Aharony, N., Pan, W., Ip, C., Khayal, I., & Pentland, A. (2011). Social fMRI: Investigating and shaping social mechanisms in the real world. Pervasive and Mobile Computing, 7(6), 643659. Allen, T., Gloor, P., Woerner, S., Raz, O., & Fronzetti Colladon, A. (2016). The power of reciprocal knowledge sharing relationships for startup success. Journal of Small Business and Enterprise Development, 23(3), 636651. Allen, T., Raz, O., & Gloor, P. (2009). Does geographic clustering still benefit high tech new ventures? The case of the Cambridge/Boston biotech cluster. MIT ESD-WP2009-01 Working Paper #1, 2009. Apicella, C. L., Marlowe, F. W., Fowler, J. H., & Christakis, N. A. (2012). Social networks and cooperation in hunter-gatherers. Nature, 481(7382), 497501. Aral, S., & Walker, D. (2012). Identifying influential and susceptible members of social networks. Science, 337(6092), 337341. Ashmore, R. D., Deaux, K., & McLaughlin-Volpe, T. (2004). An organizing framework for collective identity: Articulation and significance of multidimensionality. Psychological Bulletin, 130(1), 80. r 2017 Peter A. Gloor
469
470
Sociometrics and Human Relationships
Awad, N. F., & Krishnan, M. S. (2006). The personalization privacy paradox: An empirical evaluation of information transparency and the willingness to be profiled online for personalization. MIS Quarterly, 1328. Battilana, J., & Casciaro, T. (2012). Change agents, networks, and institutions: A contingency theory of organizational change. Academy of Management Journal, 55(2), 381398. Blau, G. J. (1985). The measurement and prediction of career commitment. Journal of occupational Psychology, 58(4), 277288. Bollen, J., Gonçalves, B., van de Leemput, I., & Ruan, G. (2016). The happiness paradox: Your friends are happier than you. arXiv preprint arXiv:1602.02665. Bollen, J., Mao, H., & Zeng, X. (2011). Twitter mood predicts the stock market. Journal of Computational Science, 2(1), 18. Borgelt, C., Höppner, F., & Klawonn, F. (2010). Guide to intelligent data analysis. London: Springer-Verlag. Braithwaite, V. A., & Law, H. G. (1985). Structure of human values: Testing the adequacy of the Rokeach Value Survey. Journal of Personality and Social Psychology, 49(1), 250. Brunnberg, D., Gloor, P., & Giacomell, G. (2013). Predicting customer satisfaction through (e-mail) network analysis: The communication score card. Proceedings 4rd international. conference on collaborative innovation networks COINs 2013, Santiago de Chile, August 1113.
Appendix D
471
Burke, M., Kraut, R., & Marlow, C. (2011, May). Social capital on Facebook: Differentiating uses and users. Proceedings of the SIGCHI conference on human factors in computing systems, ACM (pp. 571580). Celli, F., & Poesio, M. (2014). Pr2: A language independent unsupervised tool for personality recognition from text. arXiv preprint arXiv:1402.2796. Centola, D. (2010). The spread of behavior in an online social network experiment. Science, 329(5996), 11941197. DiGrazia, J., McKelvey, K., Bollen, J., & Rojas, F. (2013). More tweets, more votes: Social media as a quantitative indicator of political behavior. PloS One, 8(11), e79449. DiMaggio, M., Gloor, P., & Passiante, G. (2009). Collaborative innovation networks, virtual communities, and geographical clustering. International Journal of Innovation and Regional Development, 1(4), 387404. Doshi, L., Krauss, J., Nann, S., & Gloor, P. (2009). Predicting movie prices through dynamic social network analysis. Proceedings COINs 2009, collaborative innovations networks conference, Savannah GA, Oct 811. Durkheim, E., & Swain, J. W. (2008). The elementary forms of the religious life. Courier Corporation. Fischbach, K., Gloor, P., & Schoder, D. (2009). Analysis of informal communication networks A case study. Business & Information Systems Engineering, 2 (also in German).
472
Sociometrics and Human Relationships
Frick, K., Guertler, D., & Gloor, P. (2009). Coolhunting for the world’s thought leaders. Proceedings 4rd International conference on collaborative innovation networks COINs 2013, Santiago de Chile, August 1113. Fu, F., Nowak, M. A., Christakis, N. A., & Fowler, J. H. (2012). The evolution of homophily. Scientific Reports, 2. Fuehres, H., Gloor, P., Henninger, M., Kleeb, R., & Nemoto, K. (2012). Galaxysearch: Discovering the knowledge of many by using Wikipedia as a meta-search index. Proceedings of collective intelligence 2012, Cambridge, MA, April 1820. Futterer, T., Gloor, P., Malhotra, T., Mfula, H., Packmohr, K. H., & Schultheiss, S. (2013). WikiPulse A news-portal based on Wikipedia. Proceedings 4rd International conference on collaborative innovation networks COINs 2013, Santiago de Chile, August 1113. Garcia, C., Parraguez, P., Barahona, M., & Gloor, P. (2012). Tracking the 2011 Student-led collective movement in Chile through social media use. Proceedings collective intelligence 2012, Cambridge, MA, April 1820. Gloor, P. (2007). Coolhunting for trends on the Web. (invited paper). Proceedings IEEE 2007 international symposium on collaborative technologies and systems, Orlando, May 2125. Gloor, P. (2010). Coolfarming Turn your great idea into the next big thing AMACOM, New York, NY. Gloor, P. (2011). To become a better manager stop being a manager. Ivey Business Journal. March/April 2011.
Appendix D
473
Retrieved from: http://iveybusinessjournal.com/publication/to-become-a-better-manager-stop-being-a-manager/ Gloor, P. (2015). What email reveals about your organization. Sloan Management Review, Winter. Gloor, P., De Boer, P., Lo, W., Wagner, S., Nemoto, K., & Fuehres, H. (2015). Cultural anthropology through the lens of Wikipedia A comparison of historical leadership networks in the English, Chinese, and Japanese Wikipedia. Proceedings of the 5th international conference on collaborative innovation networks COINs15, Tokyo, Japan, March 1214. Gloor, P., Dorsaz, P., Fuehres, H., & Vogel, M. (2012). Choosing the right friends Predicting success of startup entrepreneurs and innovators through their online social network structure. International Journal of Organisational Design and Engineering, 3(2), 6885. Gloor, P., & Fronzetti, A. (2015). Measuring organizational consciousness through e-mail based social network analysis. Proceedings of the 5th international conference on collaborative innovation networks COINs15, Tokyo, Japan, March 1214. Gloor, P., & Giacomelli, G. (2014). Reading global clients’ signals. Sloan Management Review, Spring. Gloor, P., Grippa, F., Borgert, A., Colletti, R., Dellal, G., Margolis, P., & Seid, M. (2011). Towards growing a COIN in a medical research community. Procedia Social and Behavioral Sciences, 26, Proceedings COINs 2010, Collaborative innovations networks conference, Savannah GA, October 79, 2010.
474
Sociometrics and Human Relationships
Gloor, P., Krauss, J., Nann, S., Fischbach, K., & Schoder, D. (2009). Web Science 2.0: Identifying trends through semantic social network analysis. IEEE conference on social computing (SocialCom-09), Vancouver, August 2931. Gloor, P., Laubacher, R., Dynes, S., & Zhao, Y. (2003). Visualization of communication patterns in collaborative innovation networks: Analysis of some W3C working groups. ACM CKIM international conference on information and knowledge management, New Orleans, November 38. Gloor, P., Margolis, P., Seid, M., & Dellal, G. (2014). Coolfarming Lessons from the beehive to increase organizational creativity. MIT Sloan School Working Paper No. 5123-14. Gloor, P., & Nemoto, K. (2013). Who really matters in the world Leadership networks in different language Wikipedias. Places and Spaces Mapping Science, Map #157. Gloor, P., Niepel, S., & Li, Y. (2006, January). Identifying potential suspects by temporal link analysis. MIT CCS Working Paper. Gloor, P., & Paasivaara, M. (2013). COINs change leaders Lessons learned from a distributed course. Proceedings 4rd International conference on collaborative innovation networks COINs 2013, Santiago de Chile, August 1113. Gloor, P., Paasivaara, M., Lassenius, C., Schoder, D., Fischbach, K., & Miller, C. (2011). Teaching a global
Appendix D
475
project course: Experiences and lessons learned. ICSE international conference on software engineering Collaborative teaching of globally distributed software development Community building workshop, Honolulu, Hawaii, May 23. Gloor, P., Paasivaara, M., & Miller, C. (2015). Lessons from the coinseminar. Proceedings of the 5th international conference on collaborative innovation networks COINs15, Tokyo, Japan, March 1214. Gloor, P., Paasivaara, M., Schoder, D., & Willems, P. (2007). Finding collaborative innovation networks through correlating performance with social network structure. Journal of Production Research. Gloor, P., Woerner, S., Schoder, D., Fischbach, K., & Fronzetti Colladon, A. (2016). Size does not matter In the virtual world. Comparing online social networking behavior with business success of entrepreneurs. International Journal of Entrepreneurial Venturing. In press. Gloor, P., & Zhao, Y. (2006). Analyzing actors and their discussion topics by semantic social network analysis. Proceedings of 10th IEEE international conference on information visualisation IV06, London, July 57. Grippa, F., & Gloor, P. (2009). You are who remembers you. Detecting leadership through accuracy of recall. Social Networks, 31, 255261. Grippa, F., Palazzolo, M., Buccuvalas, J., & Gloor, P. (2012). Monitoring changes in the social network structure of clinical care teams resulting from team
476
Sociometrics and Human Relationships
development efforts. International Journal of Organisational Design and Engineering, 2(4), 380401. Grippa, F., Provost, S., Gloor, P., McKean, M., & Thakkar, S. A. (2014). Systematic methodology to characterize communication patterns in chronic care innovation networks. In S. Long, E.-H. Ng, & C. Downing (Eds.), Proceedings of the American society for engineering management international annual conference. Grippa, F., Zilli, A., Laubacher, R., & Gloor, P. (2006). E-mail may not reflect the social network. NAACSOS conference, Notre Dame IN, North American Association for Computational Social and Organizational Science, June 2223. Grootaert, C. (Ed.). (2004). Measuring social capital: An integrated questionnaire. No. 18. World Bank Publications. Hill, R. A., & Dunbar, R. I. (2003). Social network size in humans. Human nature, 14(1), 5372. Hybbeneth, S., Brunberg, D., & Gloor, P. (2014). Increasing knowledge worker productivity through a “Virtual Mirror” of the social network. International Journal of Organisational Design and Engineering, 3(34), 302316. Jordan, J. J., Hoffman, M., Nowak, M. A., & Rand, D. G. (2016). Uncalculating cooperation as a signal of trustworthiness. Retrieved from SSRN. Kidane, Y., & Gloor, P. (2007). Correlating temporal communication patterns of the eclipse open source
Appendix D
477
community with performance and creativity. Computational & Mathematical Organization Theory, 13(1), 1727. Kleeb, R., Gloor, P., & Nemoto, K. (2011). Wikimaps: Dynamic maps of knowledge. Proceedings 3rd International conference on collaborative innovation networks COINs 2011, Basel, Switzerland, September 810. Krauss, J., Nann, S., Simon, D., Fischbach, K., & Gloor, P. (2008). Predicting movie success and academy awards through sentiment and social network analysis. Proceedings European conference on information systems (ECIS), Galway, Ireland, June 911. Lewis, K., Gonzalez, M., & Kaufman, J. (2012). Social selection and peer influence in an online social network. Proceedings of the National Academy of Sciences, 109(1), 6872. Luhtanen, R., & Crocker, J. (1992). A collective selfesteem scale: Self-evaluation of one’s social identity. Personality and Social Psychology Bulletin, 18(3), 302318. Maddali, H. T., Gloor, P., & Margolis, P. (2015). Comparing online community structure of patients of chronic diseases. Proceedings of the 5th international conference on collaborative innovation networks COINs15, Tokyo, Japan, March 1214. McCrae, R. R., & Costa Jr, P. T. (1997). Personality trait structure as a human universal. American Psychologist, 52(5), 509.
478
Sociometrics and Human Relationships
McLean, B., & Elkind, P. (2013). The smartest guys in the room: The amazing rise and scandalous fall of Enron. London: Penguin. Merten, F., & Gloor, P. (2009). Too much e-mail decreases job satisfaction. Proceedings COINs 2009, collaborative innovations networks conference, Savannah GA, October 811. Moat, H. S., Curme, C., Avakian, A., Kenett, D. Y., Stanley, H. E., & Preis, T. (2013). Quantifying Wikipedia usage patterns before stock market moves. Scientific Reports, 3. Narayan, D., & Cassidy, M. F. (2001). A dimensional approach to measuring social capital: Development and validation of a social capital inventory. Current Sociology, 49(2), 59102. Naveen Farag, A. S., & Krishnan, M. S. (2006). The personalization privacy paradox: An empirical evaluation of information transparency and the willingness to be profiled online for personalization. MIS Quarterly, pp. 1328. Nowak, M. A. (2006). Five rules for the evolution of cooperation. Science, 314(5805), 15601563. Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. (2011, June). Finding deceptive opinion spam by any stretch of the imagination. Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (Vol. 1, pp. 309319). Pennebaker, J. (2013). The secret life of pronouns: What our words say about us. London: Bloomsbury Press.
Appendix D
479
Piotrowski, S. J., & Van Ryzin, G. G. (2007). Citizen attitudes toward transparency in local government. The American Review of Public Administration, 37(3), 306323. Preis, T., Moat, H. S., & Stanley, H. E. (2013). Quantifying trading behavior in financial markets using Google trends. Scientific Reports, 3. Preot¸iuc-Pietro, D., Volkova, S., Lampos, V., Bachrach, Y., & Aletras, N. (2015). Studying user income through language, behaviour and affect in social media. PloS One, 10(9), e0138717. Quercia, D., Kosinski, M., Stillwell, D., & Crowcroft, J. (2011). Our twitter profiles, our selves: Predicting personality with Twitter. Privacy, Security, Risk and Trust (PASSAT) and IEEE third international conference on social computing (SocialCom), 2011, pp. 180185. Rawlins, B. (2008). Measuring the relationship between organizational transparency and employee trust. Public Relations Journal, 2(2), 121. Rye, M. S., Loiacono, D. M., Folck, C. D., Olszewski, B. T., Heim, T. A., & Madia, B. P. (2001). Evaluation of the psychometric properties of two forgiveness scales. Current Psychology, 20(3), 260277. Satyanath, S., Voigtländer, N., & Voth, H. J. (2013). Bowling for fascism: Social capital and the rise of the Nazi Party (No. w19201). National Bureau of Economic Research.
480
Sociometrics and Human Relationships
Schwartz, S. H. (2012). An overview of the Schwartz theory of basic values. Online Readings in Psychology and Culture, 2(1). doi:10.9707/2307-0919.1116 Sparrow, B., Liu, J., & Wegner, D. M. (2011). Google effects on memory: Cognitive consequences of having information at our fingertips. Science, 333(6043), 776778. Sveiby, K. E., & Simons, R. (2002). Collaborative climate and effectiveness of knowledge work-an empirical study. Journal of Knowledge Management, 6(5), 420433. Taleb, N. N. (2007). The black swan: The impact of the highly improbable. New York, NY: Random House. Tsvetovat, M., & Koutznetsov, A. (2011). Social network analysis for startups. O’Reilly. Urdan, T. (2010). Statistics in plain English. Abingdon: Routledge. Vedres, B., & Stark, D. (2010). Structural folds: Generative disruption in overlapping Groups1. American Journal of Sociology, 115(4), 11501190. Wagner, C. S., Horlings, E., Whetsell, T. A., Mattsson, P., & Nordqvist, K. (2015). Do Nobel laureates create prize-winning networks? An analysis of collaborative research in physiology or medicine. PloS One, 10(7), e0134164. Waldman, D. A., de Luque, M. S., Washburn, N., House, R. J., Adetoun, B., Barrasa, A., & Dorfman, P. (2006). Cultural and leadership predictors of corporate social responsibility values of top management: A GLOBE study
Appendix D
481
of 15 countries. Journal of International Business Studies, 37(6), 823837. Wassermann, S., & Faust, K. (1994). Social network analysis: Methods and applications. Cambridge: Cambridge University Press. Watson, D. L., Newton, M., & Kim, M. S. (2003). Recognition of values-based constructs in a summer physical activity program. The Urban Review, 35(3), 217232. Yarkoni, T. (2010). Personality in 100,000 words: A large-scale analysis of personality and word use among bloggers. Journal of Research in Personality, 44(3), 363373. Yasseri, T., Spoerri, A., Graham, M., & Kertész, J. (2014). The most controversial topics in Wikipedia: A multilingual and geographical analysis. Yun, Q., & Gloor, P. (2012). The web mirrors value in the real world Comparing a firm’s valuation with its web network position. Sloan Technical Report, Cambridge, MA. Zhang, X., Fuehres, H., & Gloor, P. (2011a). Predicting asset value through twitter Buzz. In J. Altmann, U. Baumoel, B. Kraemer (Eds.), Proceedings 2nd symposium on collective intelligence, Collin 2011 (Vol. 112), Seoul, Springer Advances in Intelligent and Soft Computing, June 910. Zhang, X., Fuehres, H., & Gloor, P. (2011b). Predicting stock market indicators through twitter: “I hope it is not as bad as I fear,” Procedia Social and Behavioral
482
Sociometrics and Human Relationships
Sciences, 26, Collaborative innovations networks conference, Savannah GA, October 79, 2010. Zhang, X., Gloor, P., & Grippa, F. (2013). Measuring creative performance of teams through dynamic semantic social network analysis. International Journal of Organisational Design and Engineering, 4(2), 118. Zilli, A., Grippa, F., Gloor, P., & Laubacher, R. (2006). One in four is enough strategies for selecting ego mailboxes for a group network view. Proceedings European conference on complex systems ECCS ‘06, Oxford UK, September 2529.
BIOGRAPHY
Peter A. Gloor is Research Scientist at the Center for Collective Intelligence at MIT’s Sloan School of Management where he leads a project exploring Collaborative Innovation Networks. He is also the Founder and Chief Creative Officer of software company galaxyadvisors, a Honorary Professor at University of Cologne, Distinguished Visiting Professor at P. Universidad Católica de Chile and Honorary Professor at Jilin University, Changchun, China. Earlier, he was a partner with Deloitte and PwC, and a manager at UBS. He got his Ph.D. in computer science from the University of Zurich and was a Post-Doc at the MIT Lab for Computer Science. In his spare time, Peter likes to work on projects bridging the digital divide, enjoy nature, and play the piano.
r 2017 Peter A. Gloor
483
This page intentionally left blank
INDEX Actor filter, 163, 189 Actors, in SNA, 70 Actor scatter plot, 133, 167, 179 Adjusted R Square, 249, 250, 258, 259 Agreeability, 250, 259260 “Allteams-cleaned”, 200 Amity University, India, 297298, 300, 311, 312, 316317, 322 Annotate functions, 164, 243 ANOVA results by ethnicity for FFI characteristics, 256 by gender for FFI characteristics, 255 by nationality for FFI characteristics, 257 Anti-gaming, 66 Anti-vaccination, 447 Antivaxxers identification through machine learning, 447457 Asteroid belt, 160, 183 Automatic Media Insights COIN Assessment (AMICA), 4, 13, 17, 385389 Average Response Time (ART), 154, 345, 346, 403 Balanced contribution, 4950, 52
BeingExample, 334 Bernie Sander’s presidential campaign, 352, 353355 Betweenness centrality, 70, 7273, 188, 306, 308 Betweenness curves, 178 Bidirectional links, 150, 312313, 315 Bipartite graphs measuring the importance of brands through betweenness of actors in, 136137 Black swans, 108 Blogs, 3, 298311 Bowling for fascism, 9091 Brands, calculating the importance of, 305 “Brothers”, 333 Bush, Jeb, 356, 360, 361, 363, 364 “Calculate Sentiment” function, 164, 167, 172, 200, 243, 273, 283, 317, 402 Calendar data, 2 Centrality annotations, 137, 162, 164, 173, 196, 200, 243, 273, 283, 314, 402 Chat, 3, 4 Chauhan, Ashok, 297, 298, 309
485
486
Cincinnati Children’s Hospital Medical Center (CCHMC), 400 Classic SNA, 28 Clinton, Hillary, 137, 151, 219228, 350, 356, 365 Clustered network, 8990 COIIN project, 184 COINonCOINs community, 189190 Collaboration honest signals of, 45 balanced contribution, 4950 honest language, 5051 responsiveness, 50 rotating leadership, 49 shared context, 5155 strong leadership, 48 knowledge flow optimization, 5861 privacy concerns, dealing with, 5658 virtual mirroring, 56 Collaborative Innovation Networks (COINs), 6, 24, 25, 192, 212, 352, 353354, 386 Collaborative Learning Network (CLN) learning, 354 Collaborative performance of organizations, measuring, 419 Communication galaxies, understanding, 67 Community detection, finding COINs through, 185192
Index
Community detection algorithm, 185, 186, 187, 188 Condor, 108, 109, 155, 156, 157, 165, 170, 172, 185, 197, 208, 212, 229, 242, 296, 340, 366, 419 analyzing e-mail with, 108 bipartite graphs, brands through betweenness of actors in, 136137 Coolhunting on Internet with, 1112 drilling down in, 394 facebook wall with, analyzing, 126129 four-step analysis process. See Four-step analysis process getting started with, 121 Google CSE, degree-ofseparation search with, 141146 graph, 137 identifying criminals through machine learning in, 280290 main parts of, 113 manual, 122 sample four-step analysis with twitter, 130 export, 134 fetch data, 130132 process, 132 visualize, 133134 started with, 910 Twitter, degree-ofseparation search with, 146150
Index
Wikipedia search, 150152 Condor Export Wizards, 118, 119 Condor software tool, 3, 28, 57 Conscientiousness, 103, 244, 253254, 258 Contribution index, 49, 70, 74, 75, 154, 204, 215 Contribution index annotations, 164, 166, 200, 243, 273, 283 Contribution index scatter plot, 225 Convicts versus nonconvicts, 287 Coolfarming, 3, 4, 6, 9, 12, 24, 107, 108 data collection and analysis process, 3132 organizations, 25 through knowledge flow optimization, 5861 Coolhunting, 3, 4, 24, 36, 107, 108, 349 finding trends by finding trendsetter, 3944 Francogeddon, 12, 339348 on Internet with Condor, 1112 on social media, 40 and trend forecasting on web, 7, 37 US Presidential elections, 12 Coolhunting on the Internet with Condor, 295
487
analysis of the crowd, 322334 expert analysis, 298311 swarm analysis, 311321 Cooperation, evolution of, 93 Cooperation and trustworthiness, uncalculating, 9495 Correlation, 7880, 81 Correlation results of FFI metrics with six honest signal SNA metrics, 245248 Correlations calculation between FFI and e-mail, 242244 “Create new dataset”, 182 Creativity, 6566 Criminal actors, identifying through their honest signals of collaboration, 273280 Criminals, identifying through machine learning in condor, 280290 Crowd, 296 analysis of, 322334 CSV data, 220 Deceptive opinion spam, finding, 9697 Degree centrality, 70, 72, 73, 137, 181 Demographic information calculating, 99103 extracting, 85, 86 Density, 70, 74, 186 Directed graph, 71
488
Edges, 70 EgoFetcher, 414416 Ego networks, 25, 192 Election outcome, predicting, 103 Electronic communications, 3, 28 E-mail, 2, 25, 65, 115, 242, 393 analyzing with, 10 calculating personality characteristics from, 11, 109 predicting criminal intent from, 11, 109 see also Personality characteristics calculation from e-mail E-mail analysis with condor, 153 creating a virtual mirror of an organization, 192219 creating virtual mirror of personal e-mailbox, 154 drawing the term graph, 172174 removing the mailbox owner, 174185 finding COINs through community detection, 185191 Hillary Clinton’s mail, analyzing, 219228 organizational aspects of e-mail-based SNA, 228231 E-mail-based social network analysis, 6465
Index
Emails.csv, 220 Enron e-mail archive, 11, 109, 263 exploratory analysis, 264272 identifying criminal actors through their honest signals of collaboration, 273280 “tribefinder”, 280290 Exchange Autodiscover server, 157 Expert analysis, 298311 Experts, 296 Exporters, 113, 118120 Extroversion, 250, 258259 Facebook, 3, 25, 112, 115, 425 spreading ideas on, 9596 Facebook wall, analyzing, 126129 Face-to-face communication, 3, 30, 38 FeelTheBern.com, 352 Fetch content, 157 Fetchers, 111, 112, 113, 115116 “Fetch Web”, 299 Filters, 112, 113, 116 Financial capital, improving through optimizing social capital, 6567 Financial performance, measuring, 9799 Four-step analysis process, 111 social media, 111 exporters, 118120 fetchers, 115116 filters, 116
Index
visualizers, 116118 Francogeddon, 339348 Gates, Bill, 408, 409410 Geotagging, 296 Gephi, generating graph pictures with, 15, 459464 GMAIL login dialog, 158, 159 GMAIL mailbox, 194 Google, 43, 93, 297, 425, 427 Google Custom Search, 115 Google Custom Search Engine (CSE), 136 degree-of-separation search with, 141146 Google Trends, 97, 350 Graph, 28, 137140 Grexit, 342 Group betweenness centrality, 70, 74, 118, 345 Group degree centrality, 70, 73 Happiness paradox, 101 Hawthorne effect, 56 Hillary Clinton’s mail, analyzing, 219228 Homophily, evolution of, 94 Honest language, 5051, 53, 61 Huffington, Arianna, 408 Huffington Post, 352 IIT, 298, 320321 IMAP account, 158 “Import local data first”, 212
489
Infant Mortality reduction Collaboration Improvement and Innovation Networks (IM CoIIN), 189, 400 Inside media individual collaboration (IMIC), 13, 391403 annotation process, 401403 Inside media organizational collaboration (IMOC), 14, 419423 annotation process, 423 Internet, 38, 9293, 264, 295334 Kaggle website, 220 KNIME, 447458 environment, 8 identifying anti-vaxxers through machine learning using, 15 Knowledge flow optimization, 5861 analyze, 59 coolfarming, 58 mirror, 6061 optimize, 61 through organizational social network analysis, 2931 predict, 59 Known unknowns, 107108 Krugman, Paul, 408 Libertea2012, 352 Linear regression, 80, 8283
490
“Load actor merge CSV”, 198 Louvain algorithm, 185186 Machine learning, 447458 finding fake reviews through, 9697 Mailbox owner, removing, 174185 Mann-Whitney U-test, 345 “Manual node merging” wizard, 161, 186 Matlab, 120 Microsoft, 427 MIT, 46, 298, 320321 MSFTExchange, 427 MySQL, 115, 122, 124, 155, 156, 326, 461 Natural language processing (NLP), 212 Neo-FFI test, 242 Neuroticism, 103, 244, 249 Nick_Ksg, 334 “Node labels”, 307 Nodes, 70 “Nonconvicts”, 287 Nudges, 50, 345 One-semester course, 18 Online calendars, 115, 400 Online social media, 3, 349, 354 Online social network demographic information, calculating, 99103 election outcome, predicting, 103 facebook, spreading ideas on, 9596
Index
financial performance, measuring, 9799 ideas spread in, 8, 85 machine learning, finding fake reviews through, 9697 papers covered in section, overview, 8688 social selection and peer influence in, 95 theories of information diffusion, 8994 Openness, 250 Organizational networks, 25 Organizational trust and satisfaction, measuring, 66 Organization’s Communications Patterns assessment, 3233 Oscillation annotations, 164, 165, 200, 243, 273, 283 Outside Media Individual Collaboration (OMIC), 1314, 405417 annotation process, 414417 Outside Media Organizational Collaboration (OMOC), 14, 425 annotation process, 429 Pearson correlation, 7880, 81 Performance metrics correlating communication patterns against, 34
Index
Personal e-mailbox analysis, 154 creating virtual mirror of personal e-mailbox, 154 drawing the term graph, 172185 removing the mailbox owner, 174185 Personality and word use among bloggers, 102103 Personality characteristics calculation from e-mail, 241 adding gender, ethnicity, and nationality as control variables, 254260 agreeability, 259260 extroversion, 258259 calculating correlations between FFI and e-mail, 242244 general prediction formula, developing, 244 agreeability, 250 conscientiousness, 253254 extroversion, 250 neuroticism, 244 openness, 250 Persons.csv file, 220 Privacy concerns, dealing with, 5658 Problem, 170 Process Dataset, 154 Pro-vaxxers, 448 R, statistical package, 120 Receiver operating characteristics (ROC) curve, 288
491
Reddit, 352, 353 Regression, 80, 8283 Regression coefficients for regressing six honest signals against agreeability, 260 against agreeability with ethnicity as control variable, 260 against conscientiousness, 253254 against extraversion, 251 against extraversion with ethnicity as control variable, 259 against neuroticism, 249 against openness, 252 “Remove specific actor” function, 175, 188 Responsiveness, 50, 52 RFSchatten, 352 Rotating leadership, 49, 52 Sales effectiveness of a global high-tech company, 63 Sample course syllabus, 2023 Sample download, 444 Sample mid-term exam, 465468 Sanders, Bernie, 365, 369376 Script-generated actors, 197 Shantrjosh, 427 Shared context, 51, 53, 5455 SIC & SOC (Survey of Individual and Organizational Collaboration), 14
492
Six honest signals of collaboration, 7 6670G, 334 Skype, 2, 115, 393 Slander, 427 SMOTE, 373, 378 “Snowball sampling”, 230 Social capital on Facebook, 96 Social fMRI, 102 Social media, 30 Coolhunting on, 40 exporters, 118120 fetchers, 115116 filters, 116 fundamental analysis, 108 as quantitative indicator of political behavior, 103 visualizers, 116118 Social network analysis (SNA), 56, 28 basics of, 70 E-mail-based, 6465 knowledge flow optimization through, 2931 and statistics, 8, 69 Social network picture of COINs seminar network, 47 Social networks, 5, 90 and cooperation in huntergatherers, 9192 influential and susceptible members of, 9596 trend prediction by analyzing, 6 trend prediction by measuring, 24 Social Quantum Physics, principles of, 16
Index
Spammers, 66 SPSS statistical package, 114, 120 SPSS’ t-test, 274, 276 SQLite database, 220 Stata, 120 Statistical techniques, 8 Statistics basics of, 75 linear regression, 80, 8283 Pearson correlation, 7880, 81 and SNA, 75 t-test, 76, 78 Stock market Twitter mood predicts, 98 Wikipedia usage patterns, 9899 Strong leadership, 48, 54 Strong ties, 89 Survey of individual collaboration (SIC), 431438 empathy/listening, 438 fairness, 435 forgiveness, 437 organizational motivation, 433 transparency, 434 trust/honesty, 436 Survey of organizational collaboration (SOC), 431, 439443 collective consciousness, 440 contribution/sharing, 442 leadership, 441 responsiveness/respect, 443 Swarm analysis, 296, 311321 Swiss Franc, 340, 342
Index
Swiss National Bank, 340 Synthetic Minority Oversampling Technique (SMOTE) algorithm, 285, 287 Tag cloud, creating, 223 Temporal social surface, 208 “Term graph” function, 172 “Terms”, 172 Theories of information diffusion, 8994 Ties, 70 Trend forecasting, 107, 108 Trends finding by finding trendsetter, 3943 “Tribefinder”, 280290, 350, 366382 Trump, Donald, 350, 365368, 377381 t-test, 76, 78, 274, 276 Turntaking annotations, 164, 166, 200, 243, 273, 283 Twitter, 2, 3, 25, 101, 112, 115, 136, 146150, 296, 322334, 425, 427 EgoFetcher, 414416 Tribefinder, 382 2015/2016 Bernie Sanders campaign, 349 2016 US Presidential elections, 350 Bernie Sander’s presidential campaign, 353355
493
Coolhunting Bernie Sanders, Hillary Clinton, Jeb Bush, and Donald Trump, 356366 tribefinder on twitter, 366382 Undirected network, 70 Unidirectional links, 313 Unknown unknowns, 108 Videoconferencing, 3 Virtual collaboration projects, 193 Virtual mirror creation of an organization, 192219 Virtual mirroring, 32, 3436, 56, 107, 108 Virtual tribes, 366, 368369 Visualizers, 113, 116, 118 Weak ties, 89 Web, 295 Websites and blogs, 298311 Wiki Evolution Fetcher, 311, 318 Wikipedia, 2, 3, 42, 93, 112, 115, 136, 150152, 311321, 425 controversial topics in, 99100 “With history” option, 177, 207 Word Cloud, 154