ACM Journal on Computing and Sustainable Societies Volume 1, Issue 2 [1]


122 62 29MB

English Pages [240] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

ACM Journal on Computing and Sustainable Societies Volume 1, Issue 2 [1]

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

2023 Volume 1, Number 2

ACM Journal on

Computing and Sustainable Societies

ACM Journal on Computing and Sustainable Societies

Article 9 R. Kays Designing for Educational Resilience (21 pages) N. Goagoses to Reduce School Dropout: A Case Study H. Winschiers-Theophilus of Namibian San Learners

Vol. 1 • No. 2 • 2023

Article 10 A. Dixon (16 pages) L. Thengo E. Kitsao K. Matiya M. Barasa R. Nyirongo J. Muli F. Kamanga C. Kachimanga F. Munyaneza P. Ngari H. Makungwa J. Chimpukuso M. Amulele E. Karari S. Mbae

Community and Facility Health Information System Integration in Malawi: A Comparison of Machine Learning and Probabilistic Record Linkage Methods

continued on back cover

Articles 9–17

ACM 1601 Broadway, 10th Floor New York, NY 10019-7434 Tel.: 212-869-7440 Fax: 212-869-0481 https://www.acm.org

ACM Journal on

Computing and Sustainable Societies Editor-in-Chief Lakshminarayanan Subramanian

Agha Ali Raza New York University, United States

Area Editors, AI, ML and Data Science for Sustainable Societies Afsaneh Doryab Fei Fang Vanessa Frias Martinez Deshen Moodley Daniel Neill Maximilian Nickel Barry O’Sullivan Rajesh Ranganath David Shmoys Skyler Speakmen

University of Virginia, United States Carnegie Mellon University, United States University of Maryland, United States University of Cape Town, South Africa New York University, United States Meta, United States University College Cork, Ireland New York University, United States Cornell University, United States IBM Kenya, Kenya

Area Editors, Development, Economics and Policy Richard Anderson Ananth Balashankar Sunandan Chakraborty Priyank Chandra Samuel Fraiberger Rayid Ghani Srikanth Jagabathula Anant Sudarshan

University of Washington, United States Google AI, United States Indiana University IUPUI, United States University of Toronto, Canada World Bank, United States Carnegie Mellon University, United States New York University, United States University of Chicago, United States

Area Editors, Environment, Sustainability and Climate Change Engineer Bainomugisha Jay Chen Priya Donti Amrita Gupta Sneha Krishnan Akshay Nambi Aaditeshwar Seth Robert Soden Jay Taneja

Makerere University, Uganda ICSI Berkeley, United States Carnegie Mellon University, United States Georgia Institute of Technology, United States Jindal Global University, India Microsoft, United States Indian Institute of Technology Delhi, India University of Toronto, Canada University of Massachusetts Amherst, United States

Area Editors, HCI, Design and Critical Perspectives Ishtiaque Ahmed Nicki Dell Anirudha Joshi Rita Orji

Home Page: https://acmjcss.acm.org/

University of Toronto, Canada Cornell Tech, United States Indian Institute of Technology Bombayu, India Dalhousie University, Canada

Bill Thies Kentaro Toyama Delvin Varghese Aditya Vashishta Marisol Wong-Villacrés

Lahore University of Management Sciences, Pakistan Microsoft Research India, India University of Michigan, United States Monash University, Australia Cornell University, United States Escuela Superior Politecnica del Litoral, Ecuador

Area Editors, Systems and IoT for Sustainable Societies Waylon Brunette Josiah Chavula Assane Gueye Kurtis Heimerl Veljko Pejovic Ihsan Ayyub Qazi Morgan Vigil-Hayes Yasir Zaki Ellen Zegura Mariya Zhelva

University of Washington, United States University of Cape Town, South Africa Carnegie Mellon University Africa, Rwanda University of Washington, United States University of Ljubljana, Slovenia Lahore University of Management Sciences, Pakistan Northern Arizona University, United States New York University Abu Dhabi, United Arab Emirates Georgia Tech, United States SUNY Albany, United States

Area Editors, Technology, Media, and Social Practice Michael Best Carleen Maitland David Nemer Balaji Parthasarathy Nimmi Rangaswamy Emrys Shoemaker Rajesh Veeraraghavan

Georgia Institute of Technology, United States Penn State University, United States University of Virginia, United States IIIT Bangalore, India IIIT Hyderabad, India Caribou Digital, United States Georgetown University, United States

Journal Administrator Rebecca Malone

KGL Editorial, United States

Headquarters Staff Scott Delman Sara Kate Heukerott Yubing Zhai Stacey Schick Craig Rodkin Barbara Ryan Bernadette Shade Anna Lacson Darshanie Jattan

Director of Publications Associate Director of Publications, Journals Editor Associate Editor Publications Operation Manager Intellectual Property Rights Manager Print Production Manager Content QA Specialist Administrative Assistant

The ACM Journal on Computing and Sustainable Societies (ISSN:2834‐5533) is published quarterly in Spring, Summer, Fall, and Winter by the Association for Computing Machinery (ACM), 1601 Broadway, 10th Floor, New York, NY 10019-7434. Winter 2023. Periodicals class postage paid at New York, NY 10001, and at additional mailing offices. Printed in the U.S.A. POSTMASTER: Send address changes to ACM Journal on Computing and Sustainable Societies, ACM, 1601 Broadway, 10th Floor, New York, NY 10019-7434. For manuscript submissions, subscription, and change of address information, see inside backcover. Copyright ©2023 by the Association for Computing Machinery (ACM). Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permission to republish from: [email protected] or fax Publications Department, ACM, Inc. Fax +1 212-869-0481. For other copying of articles that carry a code at the bottom of the first or last page or screen display, copying is permitted provided that the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.

9 Designing for Educational Resilience to Reduce School Dropout: A Case Study of Namibian San Learners ROSETHA KAYS, Namibia University of Science and Technology, Namibia NASKA GOAGOSES, Carl von Ossietzky University of Oldenburg, Germany HEIKE WINSCHIERS-THEOPHILUS, Namibia University of Science and Technology, Namibia School dropout amongst minority groups is a serious problem worldwide. Focusing on educational resilience can offer a novel beneficial approach to overcome the challenge. In this paper, we focus on San learners in Namibia, gaining a deeper understanding of the adversities they face resulting in school dropout, and identifying personal and environmental factors that promote the development of educational resilience. Narrative interviews were conducted with ten San who completed secondary school and ten who dropped out of school, to identify challenges and factors. Rich picture sessions with twelve primary learners were used to identify current challenges. Based on an in depth analysis of the local adversities, a mobile digital application was designed, deploying a role model approach with local content. This empirical study provides a contribution to local technology design for social and mental well-being as part of a holistic solution for resilience building among marginalized learners. CCS Concepts: • Human-centered computing → Empirical studies in HCI; Additional Key Words and Phrases: Resilience, minority learners, marginalized communities, Namibia ACM Reference format: Rosetha Kays, Naska Goagoses, and Heike Winschiers-Theophilus. 2023. Designing for Educational Resilience to Reduce School Dropout: A Case Study of Namibian San Learners. ACM J. Comput. Sustain. Soc. 1, 2, Article 9 (December 2023), 21 pages. https://doi.org/10.1145/3616389

1

INTRODUCTION

Learners from minority, marginalized, or indigenous groups have comparatively high dropout rates in primary and secondary schools worldwide [26, 31, 37]. While adversities faced might be similar, the challenges that need to be overcome by learners from marginalized or minority groups cannot simply be generalized, as these are unique for specific communities and bound to their individual micro- and macro-system environment [6]. In order to formulate local recommendations for learners, teachers, school counselors, principles, policy-makers, and technology

We are grateful for the financial support by the National Commission on Research, Science and Technology and MTC Namibia. Authors’ addresses: R. Kays and H. Winschiers-Theophilus, Namibia University of Science and Technology, 13 Jackson Kaujeua Street, Windhoek, Namibia; emails: [email protected], [email protected]; N. Goagoses, Carl von Ossietzky Universität Oldenburg, Ammerländer Heerstraße 114-118, 26129 Oldenburg, Germany; email: [email protected]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM. 2834-5533/2023/12-ART9 $15.00 https://doi.org/10.1145/3616389 ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 9. Publication date: December 2023.

9:2

R. Kays et al.

designers it is necessary to explore the phenomenon of school dropout within a local context. In this paper, we focus on the case of learners from San communities, a minority group in Namibia, with a high school drop-out rate. Although some factors leading to school drop-out have been identified in previous related studies (see [24]), adversities and factors promoting resilience have not been examined among this minority group. Recognizing the inspirational value of role models, the //Ana-Djeh San Trust published a booklet of motivational stories to encourage fellow San learners to complete school. Building on these efforts and theories on educational resilience we set out to explore the design of technology which is promoting resilience building among San learners. While the design and use of technologies for social and mental health have increased, issues of human computer interactions (HCI) need to be further addressed to ensure effectiveness of the tools [3]. Bennett-Levy et al. [5] advise to shift the focus from digital mental health to digital social and emotional well being, and using a community-based participatory approach to engage indigenous communities, to improve the relevance and culturally appropriateness of interventions and digital tools. Following a similar approach we address the following research questions in this paper: (1) What adversities do San learners in Namibia experience?, (2) What individual, family, and environmental factors could promote the development of their educational resilience?, and (3) How can we present relevant content within an app aiming to build educational resilience?. The consecutive objectives of the current study were (1) to identify challenges and adversities affecting San learners educational attainment, (2) to explore factors that could promote the development of their educational resilience, based on the stories and advice of San who have successfully completed secondary school and those who dropped out of school. Based on these findings and the notion of role models, a further objective was (3) to design, develop, evaluate, and refine a mobile application with and for San primary learners, which incorporates the stories about adversities and resilience from the San, as well as additional resources. This paper therefore provides a contribution to local technology design for social and mental well-being and educational resilience in particular, with an empirical study based on an in depth analysis of a particular local context of San youth in Namibia. We promote a role model based approach with local content, inspired by an indigenous San initiative, for the design of a resilience building app. We propose the app as part of a holistic solution for resilience building among marginalized learners. In the following sections we present the theoretical framing in educational resilience and related work in designing for well-being. Thereafter, we describe the educational context of San learners in Namibia, our community-based research approach, study participants, and methods. We then present the findings, followed by a discussion on the significance of identifying local adversities and the development of a scalable and sustainable resilience app, among other interventions supporting educational resilience building among learners. 2

RELATED WORK

We provide a theoretical framing in educational resilience and related work on designing for social and mental well-being. 2.1 Educational Resilience Resilience may be viewed as a state, whereby it is seen as an adaptive function of social and cultural contexts that facilitate growth [34]; thus, external resources, such as a supportive family, school, and community must be accessible to trigger individual characteristics toward particular behaviors [34]. This conceptualization is consistent with the Bioecological Theory of Human Development [6], and the Transactional Social-Ecological Perspective (i.e., Social Ecology of Resilience Theory, Ungar [59]), which postulates that resilience is a bi-directional process between ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 9. Publication date: December 2023.

Designing for educational resilience

9:3

the child and received resources from various social-ecological levels. In the current study, we focus specifically on educational resilience, which is defined as the increased likelihood of achieving educational success, despite environmental adversities and personal vulnerabilities [61]. Recently, a set of international studies have highlighted the importance of educational resilience for minorities and marginalized communities (e.g., [10, 12, 51, 63]). The educational resilience framework has also been used to understand the re-enrollment of students after initial dropout, and the subsequent attainment of degrees [61]. Resilience can be promoted by fostering protective factors, such as individual, family, and environmental resources, that counteract or buffer the adversities faced by learners [36, 39, 61]. Individual resources or attributes associated with educational success include cognitive abilities, self-confidence, positive future aspirations, and the motivation to succeed [36, 53, 61]. Although the family socio-economic background is usually a strong driver for adversity (e.g., social and material disadvantages), there are family and home environment factors that are associated with educational resilience; these include parental support, involvement, and educational expectations/encouragement [36, 53, 61]. The school environment provides another context for building educational resilience, with learners being influenced by teachers’ feelings about their abilities and expectations of their success [53, 61]. Social support from family, community members, teachers, and peers is another important factor which contributes positively to educational resilience in learners [21, 53, 61]. Qualitative studies conducted in South Africa underscore the importance of these resources, demonstrating that factors such as motivation, educational aspirations, presence of role models, and social support foster educational resilience [11, 38]. As resilience is in part determined by the availability of external resources [34], and should be supported in contextually-sensitive ways [57], it is important to understand the specific adversities faced by students in a given context. Thus, the adversities and challenges faced by San youth and learners need to be explored by examining the availability or lack of individual, family and environmental resources. 2.2 Designing for Social and Mental Well-being Technology-based health interventions and Human Computer Interaction (HCI) design for social and mental well-being have gained traction. HCI research and studies have explored different design and development approaches, besides the design of specific technologies while working towards strategies of measuring impact. 2.2.1 Design Approaches. Empirical studies in HCI have proposed among others an aspirationbased approach [50, 58], and object or material based approach [15, 29] as means for action for sustainable development exploring how technology design can be part of psychological resilience, mental well-being, and supporting social change. Diefenbach et al. [15] stated that technology can facilitate well-being, as the object-mediated nature of activities could improve well-being since objects can inspire new healthy behaviours. They go on to emphasize the significance of considering the accurate function or contact with the object in order to create a positive experience. According to Kauhondamwa et al. [29], using accessories to promote well-being is beneficial, since they can stimulate and remind the individual of goals and the prospect of personal progress. On the other hand, Toyama [58] promotes the aspiration-based method, in which individuals are asked what they would like to change about themselves or their lives over the next five years. He further asserted that investigations should ask questions like “What do people persistently desire? Whom do they see as role models? What do local norms and attitudes value?” Toyama [58] advises researchers that they may need to spend a significant amount of time with the community before truly uncovering their desires and suggests to researchers not to focus on scarcity in the individual’s current circumstances, but rather on long-term goals, desires, dreams, and personal ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 9. Publication date: December 2023.

9:4

R. Kays et al.

growth. Indeed, a logical review of the use of technology combined with aspiration-based and object-based approaches can be described as beneficial as it can open new opportunities for intervention to mental well-being. 2.2.2 Digital Tools for Social and Mental Well-being. Studies on the use of digital tools for mental and social well-being has expanded significantly in recent years. Naslund et al. [44] suggest that digital technology has the potential to improve access to and the quality of mental health care. In light of this, digital technology was successfully employed to treat and prevent mental disorders in low- and middle-income nations such as Zimbabwe and the Democratic Republic of the Congo [44]. For example, a mobile short message service (SMS) text-messaging screening tool for depression was utilized to reach refugees living in social service settings [44]. Their findings suggest that online, text-messaging, and phone support treatments could be helpful. It has been shown that apps used complementary to traditional treatments and developed in collaboration with experts, such as psychologists and psychiatrists, can be used successfully [65]. Povey et al. [52] recognize that e-mental health technologies can deliver therapy that is easy reachable, contemporary, and fulfilling. Their study aimed at exploring community members’ experiences of using e-mental health apps and their approval. Their findings suggest that e-mental health holds a lot of potential and should be promoted, while considering the inclusion of local content and culturally relevant material [52]. Recently the use of chat bot apps has been investigated for their suitability in promoting well being. Recognizing users’ ease of conversing with chat bots creates opportunities for extracting users’ mental state [65]. However, while able to generate a diagnosis and provide natural language advises using artificial intelligence, the risk of bias and discrimination with fatal consequences in the treatment remains too high [65], especially within marginalized contexts. Based on a literature review, Orji and Moffatt [49] confirm that persuasive technologies are effective at promoting numerous health and wellness-related behaviors. Ndulue and Orji [45] designed and developed an African-centric persuasive game, called STD Pong, to help persuade Africans to change risky sexual behaviours, contextualised within traditional African experiences. What makes this initiative intriguing are the characters and other game features used to engage and convey African culture in the design. The advantage of this style is that it lowers the danger of the tool not being accepted yet finally supports a change in the users’ behavior since they can relate to the design [45]. However in general, further empirical and contextualized research is required in an effort to enhance support to marginalized community members in emotional distress by leveraging digital technologies. 2.2.3 Impact Assessment. The assumption that technology can solve problems directs HCI4D design practices [66]. However, measuring the impact of HCI and design interventions on mental well-being, such as positively changing individual behaviours, have presented a challenge. Numerous commercial apps, promoted on the app stores and through national bodies, promising to increase mental well being, provide no evidence of having conducted a clinical trial or other long term evaluations [65]. On the other hand, a number of researchers have pursued the development of stress relief apps with studies showing improvement over time [65]. Woodward et al. [65] insist that thorough impact assessments of the effectiveness of mental health tools are conducted before wider deployment. 3 RESEARCH CONTEXT AND METHODOLOGY 3.1 San Learners in Namibia Sub-Saharan Africa has recorded the worldwide highest percentages of out of school children and adolescents, with 18.8% at primary and 36.7% at lower secondary school age [13]. High ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 9. Publication date: December 2023.

Designing for educational resilience

9:5

dropout rates and an overall mediocre academic success have been identified as major challenges in the Namibian education system [47, 48]. High drop-out rates occur amongst all ethnic groups in Namibia, yet it affects San learners the most. An examination of the school enrollment of 7-year-olds in 2015, revealed that only 43% of San children were going to school [43]. Dieckmann et al. [14] recorded that as many as 90% of students had dropped out of school, when interviewing 609 San across Namibia. This trend has persisted for San learners even after the local government made an effort to include them into the mainstream education system [25]. After two decades of Namibia’s independence, San people are among the most marginalized and economically disadvantaged [14, 24]. The Namibian government approved a development program, known as the Division of Marginalized Communities, in 2005. The objective of the program is to integrate marginalized communities, by focusing on financial educational support, resettlement land redistribution, livelihood support, and food aid. Furthermore, non-governmental organizations, like The San Council and //Ana-Djeh San Trust, are looking into ways they can support the educational success of San learners [24]. They speculate that a major hindrance is, that San learners do not have role models from their community to look up to, unlike other tribes who have successful representatives across all spheres of political and economic sectors. Generally, school drop-out has been attributed to pregnancy, economic factors, system factors, parental involvement and cultural factors across Namibia [46]. Fernando et al. [24] report that San learners, in particular, face more financial hardship and also get discriminated against by fellow learners and teachers from different ethnicities, once they reach secondary schools. Stereotyping of the San has been a critical issue, with representatives from other ethnicities describing them as drunkards, infantile, not capable to sustain themselves, among many other negative associations and naming [24]. Most San are living on remote rural resettlement farms, where primary learners visit schools in proximity and as a group. However, transitioning to secondary and in the absence of secondary schools nearby, San learners are sent to different schools across the country far away from their home and often separated from their friends. The San learners find it hard to cope, and dropout to return to their families [54]. In an attempt to prevent ethnicity-related drop-outs in Namibia, the //Ana-Djeh San Trust published a booklet with inspirational stories of San female and male role models, to motivate San learners to complete their school despite being faced by adversities. Particularly for underrepresented and stigmatized students, role models offer motivation to set ambitious educational goals and achieve these [41], whilst the absence of role models is believed to negatively impact their education [1]. As a few studies indicated that role models are related to resilience [11, 38], we postulate that focusing on San learners’ educational resilience and developing a fitting mobile app can offer a novel approach, complementing other initiatives focusing on the mitigation of macro-level factors. Developing a mobile app was chosen as it allows for more dynamic content creation and sharing, a wider range of possible media inclusion (e.g., videos), a wider geographical reach (i.e., to include remote communities), and may have more appeal for youth. 3.2 Community-Based Research In this study, we have used a community-based research through design methodology, which builds on community-based co-design. Research through design, is a practice-based inquiry research, whereby new knowledge is created in the process and through the design artefact itself [67]. In other words, for example a mobile application designed within a research context, incorporates and represents new insights gained within the development process. Communitybased co-design is methodologically derived from action research, in terms of seeking for a workable and transferable solution of a real-life challenge, and participatory design as a politically motivated set of methods ensuring that unheard voices are integrated through a full involvement of the community in the technology design process [64]. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 9. Publication date: December 2023.

9:6

R. Kays et al.

Mathikithela and Wood [40] state that youth participatory action research, engaging youth in solution development of their own contextual challenges, can provide youth with confidence in their abilities to become resilient and perceive themselves as agents of change. In this research, the technology development itself was guided by design thinking, a human-centered methodology. Rather than concentrating purely on the functionalities and specifications of the technology, this approach focuses on empathizing with the users, through active listening to their personal expressions of challenges, feelings, and needs. Methods such as narrative interviews, rich pictures, and motivational cards facilitate the empathizing process. The approach enables the development of culturally and contextually appropriate solutions to problems [7]. Thus, through the involvement of the San youth and learners in the development process of a software application and its content, their very own voices are represented. 3.3 Participants Our cases study, involved three different San groups, in Namibia. We previously collaborated with some of the participants on other research projects, and thus used a snowball sampling technique for recruitment. The first group consisted of ten San youth, who successfully completed secondary school and were registered at various tertiary education institutions in the capital city Windhoek. Only one of the participants is a student at the researchers’ affiliated University. All were receiving government support from the Division for Marginalized Communities. The group consisted of three females, aged 25 to 30, and seven males, aged 19 to 25. The second group consisted of ten San youth, who had dropped out of school and were now residing in a rural San community resettlement farm. The community is also supported by the Division of the Marginalized Communities, as well as the Desert Research Foundation Namibia. A long-term research and development engagement with this specific community has been established. All drop-out youth currently residing in the resettlement farm were invited to participate, of which ten volunteered. The group consisted of five females, three of which dropped out in primary school and two in secondary school, and five males, one of which dropped out in primary school and four from secondary school. Their ages ranged between 19 and 30. The third group of participants consisted of learners from grade 4 to 7, schooling at the local primary school. The school is three kilometers away from the resettlement and has a hostel where the learners stay during the week. All sessions were conducted on the weekend, in the resettlement, with the consent of the parents. All learners who were present at the time could opt to partake in each of the three sessions (namely establishing current challenges and two mobile app evaluations) conducted over the period of a few months. In the first and third session seven boys and five girls participated, and in the second session only six boys and four girls participated. The participating learners were not able to express themselves fluently in English, but only in Afrikaans or their own San language, while they fairly understand English instructions. Thus, all sessions were co-facilitated by our late native co-researcher or a local translator. 3.4 Ethical Considerations The study received ethical clearance at the institutional level, as well as national support under a bilateral grant targeting youth at risk. The University has established a long-term research collaboration with the rural San community and has been running multiple projects that focus on the socio-economic uplifting of the community. On this project, three of the researchers are academics (two computer scientists and one educational psychologist), and one late San activist who has been a research collaborator on multiple projects and facilitated most sessions. The project was undertaken in accordance with community defined ethical interaction protocols [30]. The project aims and the purpose of individual sessions were explained to the participants, in their preferred language. Participation was voluntary, and all participants gave informed consent, as did parents of ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 9. Publication date: December 2023.

Designing for educational resilience

9:7

Fig. 1. Research procedure, including the phases and methods, as well as involved researchers and participants.

minors. The participants consented to us taking photos and audio/video recordings of the sessions, and gave their permission for these to be incorporated in the mobile application and in written academic papers. 3.5

Design and Procedure

Aligning with our objectives, the research process consisted of three phases, namely: (1) identifying adversities based on past and current challenges, (2) contextualized exploring of educational resilience, and (3) iterative prototype development and evaluation (see Figure 1). 3.5.1 Phase 1: Identifying Adversities Affecting Educational Attainment. Identifying Past Challenges. Individual narrative interviews were held with San from the first two groups, namely those who completed secondary school and those who dropped out. The purpose of the interviews was to identify challenges they had faced in the past. The participants were prompted with an open-ended question, asking them to tell their personal life story in relation to their school career (i.e., autobiographical stories about their progression through school). When they indicated that they were done, they were asked whether they had any additional information or comments they think are important to understand why they successfully completed school or dropped out. Those who had dropped out were also asked what they would do differently if they could go back to their school years. At the end, all participants were asked to give a short motivational advice to San learners currently in school. The interviews were (video) recorded and transcribed, and some were translated from Ju/’hoansi to English. To analyse the data, we conducted an inductive qualitative content analysis [19], following a data-driven approach to derive thematic categories [33]. After extensive familiarisation with the data, one researcher extracted salient themes regarding adversities in the participants’ narrative interviews and motivational advice, denoting codes and organizing/structuring these into higher-order categories. A second researcher compared the interview transcripts with the derived categories and the paraphrased findings. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 9. Publication date: December 2023.

9:8

R. Kays et al.

Identifying Current Challenges. A full-day session, divided in four phases, was held with San learners from the third group, namely those currently attending primary school. The purpose was to identify the current challenges faced by San learners in primary school. In the first phase, the learners watched video recordings of the interviews from the San who had completed secondary school; this helped provide a better understanding of the project purpose. In the second phase, we relied on a rich picture method, in which situations are explored and expressed visually to create preliminary mental models that aid open discussions and a shared understanding of the situation [9]. In groups of five, the learners were asked to visually depict their current challenges using the provided stationery and magazines. Thereafter individual group discussions were held, followed by a short presentation of the rich pictures to the other groups. In the third phase, the learners in subgroups were tasked with finding different solutions for the identified challenges and create a poster. The solutions were presented to the other groups, and a short discussion ensued. In the fourth phase, the learners individually created motivational cards for fellow San learners, encouraging them not to drop out of school. The learners used different craft materials to decorate their cards. Thereafter they handed their card to another learner and gave a short inspirational speech. Similar to the interview analysis, one researcher set out to extract salient themes regarding adversities and solutions in the participants’ rich pictures. As there were a variety of solutions, that could not be categorized into higher-order categories, a paraphrased descriptive summary of the codes was created. A second researcher compared the rich pictures with the presented findings. 3.5.2 Phase 2: Contextualized Exploring of Educational Resilience. To explore what individual, family, and environmental factors could promote the development of their educational resilience, we closely examined the previously identified thematic categories and the differential responses towards these adversities. Specifically, we focused on uncovering commonalities and differences in the stories narrated by the San who had completed school and those who had dropped out; these are presented in the identified adversities in the results section. We concentrated on emphasized protective factors or positive individual experiences which enabled educational attainment, and which have been theoretically and empirically linked with educational resilience. 3.5.3 Phase 3: Iterative Prototype Development and Evaluation. The results of the content analysis, as well as the stories, rich pictures, and motivational advice, directly informed the design of the mobile app. A high-fidelity mobile app prototype was developed to enable the participants to understand how a final solution could look like. Thus, the San youth could see how their stories are presented and the learners could see how to look for resources, motivations and solutions in the mobile app. During the evaluation phases, the youth and learners could give design ideas and specific change requests; these then were incorporated in the next prototype. The first prototype contained sections of the narrative interviews, categorized according to themes, identified through the inductive content analysis (see a screenshot in Figure 7 left). The evaluation of the first prototype focused on the current learners’ experiences in exploring the application and their understanding of the key themes (Figure 3). Based on the feedback, a second prototype was developed (see screenshot in Figures 6 and 7) and tested by the participants. The mobile app was developed for Android, as the community owns a few Android smart phones, which were donated as part of prior project involvements. Interaction patterns, such as navigation and functionalities for the mobile platform, the visual representation including color, typography, buttons, screen density, image sizes, and icons were developed during this phase. Five of the ten previously participating San who completed secondary school examined the app individually and provided feedback on the design of the application, the functionality, their personal data (i.e., photo and story), by answering an open-ended survey, as an adopted technique for usability evaluations conducted in the field [16]. Nine of the ten previously participating San who dropped out of school evaluated the app in two ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 9. Publication date: December 2023.

Designing for educational resilience

9:9

Fig. 2. Dropout youth during prototype evaluation.

Fig. 3. Learners during prototype evaluation.

groups, exploring the functionality as well as reading the stories (Figure 2). Subsequently, they filled in an abridged usability evaluation questionnaire created specifically for the study, consisting of nine questions using a Likert-scale ranging from strongly disagree (1) to strongly agree (5); prompting for ease of use and user satisfaction [16]. Twelve current learners from the resettlement farm were split into two groups, and were asked to explore and perform tasks in the application. The functionality of the app was explained to the learners, but without giving any directions or suggestions as to how to perform the tasks. The phone was passed to each learner to allow individuals to interact with it directly, while the rest observed, advised, and discussed with each other (Figure 3). This allowed for individual and ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 9. Publication date: December 2023.

9:10

R. Kays et al.

collective participation by the youth, as promoted by Mathikithela and Wood [40] and then the learners filled in the same usability evaluation questionnaire as the dropout youth. Further alterations were undertaken, and a third prototype created. The prototype testing provided an opportunity for communicative validation, with participants confirming their stories and providing feedback to the extracted themes. 4

RESULTS

In the following we present the analysis of the interview and motivational advice, separately from the rich picture session, as they stem from two different samples who have had different educational experiences (i.e., one consisting of participants who have either dropped out or completed secondary school, versus participants who are currently still in primary school). Lastly, we present the design considerations of the iterative app development. 4.1

Identifying Past Challenges from Narrative Interviews

Educational resilience is fostered through external resources, at family, school, and community level, which needs to be accessible to trigger individual characteristics toward particular behaviors [34]. Below we present the thematically identified adversities, contrasting which resources were available and which were lacking for San youth who successfully completed school and for those who dropped out. 4.1.1 Financial Difficulties. The participants reported great financial difficulties during their school time. They reported that their parents (or guardians) were unemployed and that they had problems paying school or exam fees. A lack of transportation required them to walk great distances to reach school, and accommodations closer to school were hard to come by. Furthermore, many reported not having clothes (i.e., school uniforms) and shoes, as well as basic toiletries. Another major challenge was the (extreme) lack of food, with participants reporting that they were often hungry or starving. These financial difficulties were reported by both the San who had completed school (n = 8) and those who had dropped out (n = 8). The San who had completed school reported taking initiative to overcome these difficulties, by getting a job (n = 2) and hunting for food (n = 1). Most of these San reported getting assistance from the government for the continuation of their education (n = 8), although noting that their financial difficulties continued throughout this phase (n = 3). 4.1.2 Family Support. Support from parents was not mentioned by any of the San, except for two who specifically mentioned the lack of support. Only four mentioned receiving support from other family members. One San who completed school, stated “What motivated me to keep on was the way my parents were suffering scared me and I wouldn’t want to go live like them one day. That’s when I made up my mind to do something that will support me which brought studies to my mind.” Almost all the San that dropped out (n = 8), explained that their parents wanted them to complete school and were not happy about their decision to leave school. One San stated “My parents were not happy and told me to go back but I refused and without an argument they let me stay, they did not force me to go back.” 4.1.3 Education and School. Analyzing the interviews, we found that a major difference between the San who completed school and those who dropped out was the support they got from teachers. Six of the San who completed school recalled that they received support from one or more of their teachers during school time. Furthermore, some stories indicated that the teachers supported the students to an extraordinary extent. For instance, one San stated “When I was in grade 10 one of the teachers from the lower grades offered to take me into her house and care to ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 9. Publication date: December 2023.

Designing for educational resilience

9:11

support me. This teacher was a big motivation as she kept on encouraging me to study on top of providing for my needs.” This is in stark comparison to the San who dropped out, of which five reported that they did not get any support from teachers; for instance, one San stated “Teachers also never supported us, as they would have promised in the beginning of the semester ...”. Two noted that the principle tried encouraging them, but that this was not effective. Another problem reported more often by San who dropped out (n = 5) than those who completed school (n = 2), was that there was a lack of consideration for their background and circumstances at the school. One San noted “Teachers would chase me out of class if I am not wearing school uniform and this was one of the reasons I did not want to go to school anymore.” Examining the school career path of the students who completed school revealed that this was not without interruptions and problems. These San reported failing and repeating grades (n = 6), being expelled from school (n = 2), as well as dropping out and re-enrolling years later (n = 3). 4.1.4 Social Comparison and Bullying. Talking about their friends’ school career paths, we found distinctions between those who completed school and those who dropped out. Specifically, two San who dropped out but later completed school noted that seeing their friends continue with (and complete) school motivated them to re-enroll. Amongst the San that dropped out, five noted that their friends also dropped out; with two of these reporting that they dropped out together as a group. Two of the San who completed school stated that at one point they came to the realization that their friends were having a bad influence on them, and distanced themselves. Being bullied, teased, and/or discriminated against was only reported by the San who completed school (n = 5). 4.1.5 Personal and Interpersonal Distractions. Interpersonal distractions, including relationships and sex, were only mentioned by two San who completed school and one who dropped out. Furthermore, one San who later completed school mentioned that she dropped out because of pregnancy, whilst the two San noted that they got pregnant after dropping out. Substance use was only addressed by two male San who later completed school. 4.1.6 Beliefs and Emotional Evaluations Associated with School. Of the San who completed school three mentioned that their childhood/school time was an emotional time. The notion of inferiority was brought up by two San who completed school, one stated “San are seen inferior to others in general in life and if you don’t have a heart like mine to ignore such you might fail.”, whilst the second mentioned that it caused her great pain as she often believed she was inferior. Two San reported feeling lonely and homesick, and two who dropped out reported feeling pressure from their teachers and another fear of corporal punishment. All the San who dropped out expressed regret over their decision (n = 10). 4.2 Identifying Motivational Advice The content of the motivational advice given to San learners still in school contrasted between the San who completed school and those who dropped out. The San who dropped out noted rather bluntly and direct that the learners should not drop out (n = 8), such as “My motivation to the San children still in school is not to give up until you complete high school.” Four of these San also told learners to take opportunities that are given to them, and some referred to the current support they can receive from the government. The San who completed school overwhelmingly told learners that they will be facing obstacles but that they should not give up (n = 7), such as “Continue with your studies no matter you are bullied, teased or how hard it is. Life is not always easy but you will reap the fruits at the end which is your success.” Furthermore, they advised learners to keep focus and show self-discipline (n = 6) and to be aware of bad influences (n = 3); “Think about what you want to become in life and look forward to achieve your dreams. Forget about alcohol and girlfriends ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 9. Publication date: December 2023.

9:12

R. Kays et al.

Fig. 4. Rich picture depicting an example of current challenges and an example of a proposed solution.

or boyfriends, as they will distract you from achieving your dreams”. Lastly, they noted that the learners should be themselves and/or believe in themselves (n = 5); “Be yourself and do not take to heart or feel discouraged about what other are saying about”. 4.3 Identifying Current Challenges from Rich Pictures Most of the challenges identified by the current learners related to the financial difficulties they faced, namely their lack of food, clothes, toiletries, school supplies, and transport (see Figure 4). The learners suggested that donations from the government and other organizations could support them with their basic needs (see Figure 4). Furthermore, they suggested that the teachers and principal could provide them with stationery and a school bus, and that their parents should support them by buying summer and winter school uniform. Furthermore, they noted that the community could plant crops to help supply the hostel with food. The current learners shared with us, that they often feel lonely and have the urge to talk to their parent, siblings, or friends back at home. They stated that it is challenging to contact their parents while in the hostel. They noted that talking to a teacher, especially the life skills teacher, or a trusted person can be comforting and provide support. The learners stated that they sometimes have to endure emotional abuse from others and noted that a solution could be to talk to the life skills teacher about their feelings. As this was the first session it was challenging working with the learners, as they were quite during discussions and were hesitant when presenting in front of the others. Towards the end of the session, whilst creating the motivational messages, they started having more fun. As the motivational cards were given to fellow learners, we were not able to conduct a content analysis with these. 4.4

Iterative Development and Evaluation of the Prototypes

We expanded on the idea of the booklet with inspirational stories of San female and male role models, produced by the //Ana-Djeh San Trust, to motivate San learners to complete their school despite being faced by adversities. Thus, we created a mobile application that allows San youth to share their stories and advice, with a focus on what adversities and challenges they faced, as well as stories of successfully overcoming these. The stories in the mobile app encourage current learners to overcome adversities and challenges (i.e., build educational resilience), by demonstrating that this is possible (i.e., as done by role models) and receiving advice and motivation. The first ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 9. Publication date: December 2023.

Designing for educational resilience

9:13

Fig. 5. First prototype of mobile application interface.

prototype included the information gained from the narrative interviews, rich pictures, and motivational advice (Figure 5). The content (i.e., interface) was categorized into key themes, as extracted from the narrative interviews, namely financial, social, emotional, personal, family, and education. These then included sections of the transcribed narrative interviews of the San who completed school and those who dropped out. A vocabulary list with a collection of themes, phrases, and other words commonly used in the application was added. Considering that many of the San experience financial difficulties (e.g., lack of finances, transport, and toiletries), we included the contact details of supporting organizations and government offices who provide support to San youth. The feedback about the mobile app from the learners was mostly positive. For instance, one learner stated “In our group, we think it will be good and useful to us and we will like to have it.” and “It will motivate us; it will help us so that we must not go out of school; it will help us to stay away from boyfriends and not get pregnant; it will help you when you are far away and you are lonely.” Furthermore, they also provided us with feedback, such as “I want you to put on another movie [referring to the successful San youth’s recordings] because the one we watched is very interesting and I think you must put more about school and the rules; one about helping each other, taking care of each other and to be good friends.” ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 9. Publication date: December 2023.

9:14

R. Kays et al.

Fig. 6. Prototype Version 2.

The second prototype was developed as a fully functional Android app, where users can upload new stories, as well as view stories stored in the app, browse categories, use a glossary, and retrieve the contact details of supporting agencies (Figure 6). The San who had completed school reported that they felt proud of their stories, and that these were being used to motivate the next generation of learners. They believe that the app will help learners become more open-minded and to see reality through a different perspective. They further expressed that they think it is beneficial as it is based on real life experiences and will allow learners to be more optimistic about their education. They gave suggestions on how to make the platform more appealing for learners and noted that the pictures needed to be refined. Furthermore, they suggested that audio and video recording be added, and that these should be in their native language. Generally, they found the app easy to use. Although the San who dropped out were very eager to explore the app, more than half of them did not understand the themes and were not sure where to find information (Figure 7). However, based on their responses to questionnaire items, all agreed that it was easy to use, they liked the design, thought it would help solve challenges, and reported that they felt motivated using it. The majority of the current learners understood the terms and themes, but still suggested to use easier words. Based on their responses to questionnaire items, most learners responded that they felt motivated when using the app, and agreed that the app could help with their challenges. They further reported in the questionnaire that the app was easy to use and navigate through. 5

DISCUSSION

We recognize that resolving school drop out rates through building resilience among students at risk requires a holistic intervention carefully considering local adversities. Based on those, a localised resilience app can be designed and deployed to support government interventions strengthening teacher training and counselling services. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 9. Publication date: December 2023.

Designing for educational resilience

9:15

Fig. 7. Theme page (left), stories (middle), selected story (right) of rototype 2.

5.1 Local Understanding of Adversities It is important to determine what adversities are faced by community members within a specific context, in order to take the appropriate actions in planning interventions and providing specific counselling services. In this study, we gained a deeper understanding of the adversities faced by San learners in Namibia, and identified factors that relate to educational resilience. Comparing the narratives of the San who completed secondary school and those who dropped out, revealed that they faced very similar adversities. In both groups, there were financial difficulties and no real mention of family support. Knowing that family support may be lacking for San learners, this could be an important point to address in order to promote their educational resilience. Bryan et al. [8] note that school counselors are in a unique position to organize equity-focused school– family–community collaborations, which allow the promotion of educational resilience. Many San youth reported that there was no consideration for their background or current challenges, which lead to further disadvantages. One San recalled, “We had to wake up early to walk a far distance as the school was quite far. This caused you to get to school a bit late and then you would be punished for late coming. Punishment could be in the form of cleaning the school while others are attending classes which cause you to miss out on school work.” While such treatment needs to be addressed by authorities, it is important to provide a platform for San to safely raise their concerns to a responsible organization tasked with tackling this delicate subject. Equally, minority groups elsewhere in the world would benefit from a counselling service which addresses their specific issues and considers their socio-cultural circumstances. Many researchers in HCI have pointed out the significance of understanding local user contexts and its consideration in the development of digital tools, especially in designing for social and mental well-being [5, 45]. As digital tools have the capacity to enhance health services [44, 52], numerous features could be implemented addressing specific local needs. 5.2

The Resilience App as a Scalable and Sustainable Solution

We designed a mobile application for San learners in primary school, incorporating the narratives and motivations of San who completed secondary school and from those who dropped out. The app ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 9. Publication date: December 2023.

9:16

R. Kays et al.

allows current learners to become aware of the two separate educational paths followed by others in their very own community, namely early school leaving versus pursuing higher degrees. It provides older San members an opportunity to share their stories and motivational advice with the current learners. The app provides current learners with an outlook, on how their life can unfold, and what they can expect along the way. The stories reveal that all San had challenges and setbacks, but that it is important to persevere. While the San who graduated spoke from experience based on overcoming obstacles, the dropouts expressed regret. In their narratives, some of the San reported that they were determined to succeed based on having seen others in their community succeed (e.g., their friends) or not having achieved their goals (e.g., their parents). Although the mobile application requires San youth to have access to a smart phone, which can be costly or logistically challenging, it provides a platform for expanding on the previously developed booklet by the //Ana-Djeh San Trust. Specifically, the mobile application offers a dynamic solution, allowing San youth to continuously upload and share new stories, as well as resources. We envision the app to be part of a bigger educational resilience program led by educational stakeholders, ranging from government officials to classroom teachers. Reflecting on the app and its usage, we maintain that such an app could also benefit minority groups in other countries, although further improvements are still needed that should be considered in future research. For one, the aspect of social support needs to be strengthened. For instance, the application could allow for the building of networks amongst current learners or connect them with teachers or graduates of the same ethnicity which are willing to provide support. Theron et al. [56] demonstrated noteworthy improvements in children’s assessment of community-based and personal resilience resources after 22 weekly reading sessions of carefully selected indigenous resilience-promoting stories, with African children overcoming challenges. Fernando et al. [24] assume that role-model stories provide coping strategies to fellow San youth suffering similar hardships. Equally, East et al. [18] claim that listening to personal stories can enable resilience-building through reflecting on how the experiences of others can be incorporated in one’s own life. 5.3 Further Initiatives to Foster Resilience Examining a summer program aimed at preventing the dropout of students (from lower socioeconomic backgrounds) as they transitioned to secondary school, Vinas-Forcade et al. [60] found an impact on social integration, i.e., increasing students’ sense of school commitment, and aiding in the development of relationships with teachers and peers. Offering such a program could be beneficial for young San children, as well as the implementation of other resilience promoting interventions [23]. However, future research would be needed to determine the applicability and effectiveness of previously developed programs for this community. Striving for educational inclusion requires a shift in institutional practices reconsidering learners’ social positioning [2, 17]. Thus, we suggest increasing initiatives, such as the youth participatory action research project presented by Mathikithela and Wood [40], which involved learners in reflecting on factors affecting their well-being at school and seeking solutions, thereby gaining agency and developing resilience. The benefit of engaging learners in re-shaping of their own learning context has previously been established, and particularly the role of children in participatory design [20, 27]. Among other activities, learners should continue shaping technologies, such as the resilience app, to ensure relevance and impact. 5.4 Teacher Support Teacher support has been linked to student engagement, achievement, and well-being [32, 55]. Johnson [28] found that teachers showing empathy, listening to students, and providing support helps at-risk children cope better. Liebenberg et al. [35] argue that being respected by teachers ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 9. Publication date: December 2023.

Designing for educational resilience

9:17

is of great importance to students and contributes to the process of developing resilience. Stark differences emerged in the school-related aspects of the narratives, with the San who completed secondary school recalling that they received teacher support, whilst those who dropped out did not. The San reported that teachers cared for them, encouraged them, and assisted with school work. It is noteworthy that the support provided by teachers was often to an extraordinary extent, with teachers additionally providing accommodation or financial assistance. Nairz-Wirth and Feldmann [42] found that teachers primarily attribute school disengagement and drop out to personal and family factors. School-related causes, such as educators’ “inadequate recognition of the cultural and symbolic capital of underprivileged students, and the differences of fit, are seen as natural, rightful, expectable and legitimate” and are largely ignored [42]. Thus, paying attention to the role of teachers is of utmost significance considering that they can either cause a student’s drop out or contribute to a successful completion of school. It is therefore suggested that as part of teachers’ lifelong training, empathy and cultural sensitivity should be emphasized further. The resilience app could be used by teachers, browsing through the collected stories of the students to foster empathy and browsing through the teacher stories to guide their positive attitude and behaviours, reemphasizing the importance of teacher support and care. 5.5

Recommendations for Counseling

To address the importance of educating teachers to recognize the unique challenges faced by minority and marginalized students within the community and the school, school counselors could provide training to improve sensitivity towards the problem (e.g., creating awareness, improving understanding, assisting in the reflection of beliefs, improving ethnocultural diversity knowledge, and delineating practices for inclusive education), as well as drawing out possibilities for action based on the teachers’ skills and desired level of involvement (e.g., creating mentoring tandems, coordinating and implementing school-family-community programs). School counselors could thus promote an inclusive school environment for minority and marginalized students, which in turn may reduce dropout rates [22]; interventions targeting social belonging could lay the foundation for this [62]. In schools which are more diversity-friendly, minority students are more likely to form and maintain positive teacher-student relationships [4]. Bryan et al. [8] note that school counselors play an important role in fostering educational resilience and opportunities, which fits well with the statements made by the current learners, who often referred to the help and support they could receive from the life skills teacher. The collected content on the resilience app could serve as relevant training material for teachers across the region. 5.6

Current and Future Work

The overall objective of the study is to develop an appropriate intervention to develop resilience among primary and secondary San learners in Namibia. We are currently in the process of refining the mobile resilience app to implement the recommendations from previous usability and refinement testing sessions. Suggestions included: audio and video recordings of the motivations in San native language as well as adding cultural proverbs (traditional sayings) for motivation onto the app. Since 2020, we have run two more co-design and evaluation sessions, as well as further investigated resilience building among the learners. We have secured smartphones for the community with access to the resilience app for the learners, enabling continuous data collection. It has to be noted that the design of the app is based on an in depth analysis of the local context however the effectiveness for resilience building will need to be established through a longitudinal study. We have therefore traced the primary learners, who were involved in the original study in 2020. We will further engage parents, teachers, curriculum designers and policy makers to ensure an integrated and sustainable approach is established, as we are refining the app. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 9. Publication date: December 2023.

9:18 5.7

R. Kays et al. Limitations

Limitations include (1) the app being developed in English (with the stories being uploaded in either English or Ju/’hoansi), which may have influenced the ease of usability, (2) learners did not use the app over a longer period of time, due to a lack of phones in the past, which may have provided further insight about the usability and potential challenges, (3) although also being critical of the app in the evaluation, learners might have answered in a socially desirable way in their written responses, and (4) although the age range was kept similar in the two groups of interviewed youth, we did not account for (record) how much time had passed since they left school, which could have an influence on the recalled memories and retrospective evaluations. 6 CONCLUSION Minority students face a great deal of adversities, which often results in higher dropout rates [26, 31, 37]; however, educational resilience may buffer against adversities and decrease dropout rates [61]. San learners dropping out of school is a very pressing educational challenge in Namibia. We found financial difficulties and lack of family support to be the main adversities, whilst teacher support was identified as a factor relevant for successfully completing secondary school. In order to build resilience towards those adversities, we developed a mobile application prototype with and for San learners in primary school, incorporating the notion of role models. The app allows for the transfer of real-life stories and motivational messages from one generation of learners to the next. Support services such as transferring stories and motivational messages from members of the community via a mobile application is adaptable to other minorities; this approach allows potential role models to address the adversities and potential resilience building factors unique for each minority. Through this study, we have contributed insights on the possible role and contextualized design of a resilience app promoting social and mental well-being of marginalized learners. We hope that the visions gained from this project will encourage other researchers to build on it and collaboratively work together to tackle the increasing school dropout rate amongst minority groups across the globe. DATA AVAILABILITY STATEMENT The data that support the findings of this study are available on request from the first author. The data are not publicly available due to restrictions for it containing information that could compromise the privacy of research participants. ACKNOWLEDGMENTS We would like to acknowledge the invaluable and special contributions of our late co-author Helena Afrikaner. We thank all San participants. REFERENCES [1] Benta A. Abuya, Elijah O. Onsomu, and Dakysha Moore. 2012. Educational challenges and diminishing family safety net faced by high-school girls in a slum residence, Nairobi, Kenya. International Journal of Educational Development 32, 1 (2012), 81–91. [2] Eric Daniel Ananga. 2013. Child migration and dropping out of basic school in Ghana: The case of children in a fishing community. Creative Education 4, 06 (2013), 405. [3] Luke Balcombe and Diego De Leo. 2022. Human-computer interaction in digital mental health. Informatics 9, 1 (2022). https://doi.org/10.3390/informatics9010014 [4] Gülseli Baysu, Jessie Hillekens, Karen Phalet, and Kay Deaux. 2021. How diversity approaches affect ethnic minority and majority adolescents: Teacher–student relationship trajectories and school outcomes. Child Development 92, 1 (2021), 367–387. [5] James Bennett-Levy, Judy Singer, Darlene Rotumah, Sarah Bernays, and David Edwards. 2021. From digital mental health to digital social and emotional wellbeing: How Indigenous community-based participatory research influenced ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 9. Publication date: December 2023.

Designing for educational resilience

[6] [7] [8]

[9] [10] [11] [12]

[13] [14]

[15] [16]

[17] [18] [19] [20] [21]

[22] [23] [24]

[25] [26] [27]

[28] [29]

9:19

the Australian government’s digital mental health agenda. International Journal of Environmental Research and Public Health 18, 18 (2021), 9757. Urie Bronfenbrenner and Pamela A. Morris. 2007. The bioecological model of human development. Handbook of Child Psychology 1 (2007). Tim Brown and Jocelyn Wyatt. 2010. Design thinking for social innovation. Development Outreach 12, 1 (2010), 29–43. Julia Bryan, Joseph M. Williams, and Dana Griffin. 2020. Fostering educational resilience and opportunities in urban schools through equity-focused school–family–community partnerships. Professional School Counseling 23, 1_part_2 (2020), 2156759X19899179. Peter Checkland. 2000. Soft systems methodology: A thirty year retrospective. Systems Research and Behavioral Science 17, S1 (2000), S11–S58. Dhiman Das. 2019. Academic resilience among children from disadvantaged social groups in India. Social Indicators Research 145, 2 (2019), 719–739. Priscilla Dass-Brailsford. 2005. Exploring resiliency: Academic achievement among disadvantaged black youth in South Africa. South African Journal of Psychology 35, 3 (2005), 574–591. Jessica J. De Feyter, Mayra D. Parada, Suzanne C. Hartman, Timothy W. Curby, and Adam Winsler. 2020. The early academic resilience of children from low-income, immigrant families. Early Childhood Research Quarterly 51 (2020), 446–461. Lydia Deloumeaux. 2019. New Methodology Shows 258 Million Children, Adolescents and Youth are Out of School. Ute Dieckmann, Maarit Thiem, Eric Dirkx, and Jennifer Hays. 2014. Scraping the Pot: San in Namibia Two Decades after Independence, Produced by the Legal Assistance Center and Desert Research Foundation of Namibia. Meinert, Windhoek, Namibia. Sarah Diefenbach, Marc Hassenzahl, Kai Eckoldt, Lena Hartung, Eva Lenz, and Matthias Laschke. 2017. Designing for well-being: A case study of keeping small secrets. The Journal of Positive Psychology 12, 2 (2017), 151–158. Henry Been-Lirn Duh, Gerald C. B. Tan, and Vivian Hsueh-hua Chen. 2006. Usability evaluation for mobile device: A comparison of laboratory and field tests. In Proceedings of the 8th Conference on Human-Computer Interaction with Mobile Devices and Services. ACM, NY, USA, 181–186. Máiréad Dunne and Eric Daniel Ananga. 2013. Dropping out: Identity conflict in and out of school in Ghana. International Journal of Educational Development 33, 2 (2013), 196–205. Leah East, Debra Jackson, Louise O’Brien, and Kathleen Peters. 2010. Storytelling: An approach that can help to develop resilience. Nurse Researcher 17, 3 (2010), 17–25. Satu Elo and Helvi Kyngäs. 2008. The qualitative content analysis process. Journal of Advanced Nursing 62, 1 (2008), 107–115. Jerry Alan Fails, Mona Leigh Guha, Allison Druin. 2013. Methods and techniques for involving children in the design of new technology for children. Foundations and Trends® in Human–Computer Interaction 6, 2 (2013), 85–166. Guangbao Fang, Philip Wing Keung Chan, and Penelope Kalogeropoulos. 2020. Social support and academic achievement of Chinese low-income children: A mediation effect of academic resilience. International Journal of Psychological Research 13, 1 (2020), 19–28. Muhammad Shahid Farooq. 2013. An inclusive schooling model for the prevention of dropout in primary schools in Pakistan.Bulletin of Education and Research 35, 1 (2013), 47–74. Amanda Fenwick-Smith, Emma E. Dahlberg, and Sandra C. Thompson. 2018. Systematic review of resilienceenhancing, universal, primary school-based mental health promotion programs. BMC Psychology 6 (2018), 1–17. Kileni Fernando, Tertu Fernandu, Simpson Kapembe, Kamati Isay, and Japeni Hoffeni. 2018. A Contemporary Expression of the Namibian San Communities’ Past and Present Sufferings Staged as an Interactive Digital Life Performance. Springer, Singapore, 205–221. Jennifer Hays. 2011. Educational rights for indigenous communities in Botswana and Namibia. The International Journal of Human Rights 15, 1 (2011), 127–153. Ralph Hippe and Maciej Jakubowski. 2018. Immigrant background and expected early school leaving in Europe: Evidence from PISA. Publications Office of the European Union 10 (2018), 111445. Ole Sejer Iversen, Rachel Charlotte Smith, and Christian Dindler. 2017. Child as protagonist: Expanding the role of children in participatory design. In Proceedings of the 2017 Conference on Interaction Design and Children. ACM, NY, USA, 27–37. Bruce Johnson. 2008. Teacher–student relationships which promote resilience at school: A micro-level analysis of students’ views. British Journal of Guidance & Counselling 36, 4 (2008), 385–398. Maria Kauhondamwa, Heike Winschiers-Theophilus, Simson Kapembe, Hiskia Costa, Jan Guxab, Isay Kamati, and Helena Afrikaner. 2018. Co-creating personal augmented reality accessories to enhance social well-being of urban San youth. In Proceedings of the Second African Conference for Human Computer Interaction: Thriving Communities. ACM, NY, USA, 1–10.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 9. Publication date: December 2023.

9:20

R. Kays et al.

[30] Peter Kaulbach, Helena Afrikaner, Brit Stichel, and Heike Winschiers-Theophilus. 2021. Crafting communication protocols with a San community in Namibia. In 3rd African Human-Computer Interaction Conference: Inclusiveness and Empowerment. ACM, NY, USA, 170–173. [31] Sunha Kim, Mido Chang, Kusum Singh, and Katherine R. Allen. 2015. Patterns and factors of high school dropout risks of racial and linguistic groups. Journal of Education for Students Placed at Risk (JESPAR) 20, 4 (2015), 336–351. [32] Adena M. Klem and James P. Connell. 2004. Relationships matter: Linking teacher support to student engagement and achievement. Journal of School Health 74 (2004), 262–273. [33] Udo Kuckartz. 2019. Qualitative text analysis: A systematic approach. In Compendium for Early Career Researchers in Mathematics Education. Springer International Publishing, Switzerland, 181–197. [34] Seffetullah Kuldas and Mairéad Foody. 2022. Neither resiliency-trait nor resilience-state: Transactional resiliency/e. Youth & Society 54, 8 (2022), 1352–1376. [35] Linda Liebenberg, Linda Theron, Jackie Sanders, Robyn Munford, Angelique van Rensburg, Sebastian Rothmann, and Michael Ungar. 2016. Bolstering resilience through teacher-student interaction: Lessons for school psychologists. School Psychology International 37, 2 (2016), 140–154. [36] Jenny J. W. Liu, Maureen Reed, and Todd A. Girard. 2017. Advancing resilience: An integrative, multi-system model of resilience. Personality and Individual Differences 111 (2017), 111–118. [37] Meichen Lu, Manlin Cui, Yaojiang Shi, Fang Chang, Di Mo, Scott Rozelle, and Natalie Johnson. 2016. Who drops out from primary schools in China? Evidence from minority-concentrated rural areas. Asia Pacific Education Review 17 (2016), 235–252. [38] Motlalepule Ruth Mampane. 2014. Factors contributing to the resilience of middle-adolescents in a South African township: Insights from a resilience questionnaire. South African Journal of Education 34, 4 (2014). [39] Ann S. Masten. 2014. Invited commentary: Resilience and positive youth development frameworks in developmental science. Journal of Youth and Adolescence 43 (2014), 1018–1024. [40] M. S. Mathikithela and L. Wood. 2019. Youth as participatory action researchers: Exploring how to make school a more enabling space. Educational Research for Social Change 8, 2 (2019), 77–95. [41] Thekla Morgenroth, Michelle K. Ryan, and Kim Peters. 2015. The motivational theory of role modeling: How role models influence role aspirants’ goals. Review of General Psychology 19, 4 (2015), 465–483. [42] Erna Nairz-Wirth and Klaus Feldmann. 2017. Teachers’ views on the impact of teacher–student relationships on school dropout: A Bourdieusian analysis of misrecognition. Pedagogy, Culture & Society 25, 1 (2017), 121–136. [43] UNICEF Namibia. 2015. Global Initiative on Out-of-School. https://www.unicef.org/namibia/resources_13836.htm. Accessed on 12 February, 2023. [44] John A. Naslund, Kelly A. Aschbrenner, Ricardo Araya, Lisa A. Marsch, Jürgen Unützer, Vikram Patel, and Stephen J. Bartels. 2017. Digital technology for treating and preventing mental disorders in low-income and middle-income countries: A narrative review of the literature. The Lancet Psychiatry 4, 6 (2017), 486–500. [45] Chinenye Ndulue and Rita Orji. 2018. STD PONG: Changing risky sexual behaviour in Africa through persuasive games. In Proceedings of the Second African Conference for Human Computer Interaction: Thriving Communities. ACM, NY, USA, 1–5. [46] Haaveshe Nekongo-Nielsen, Nchindo R. Mbukusa, and Emmy Tjiramba. 2018. Investigating factors that lead to school dropout in Namibia. The Namibia CPD Journal for Educators 2, 1 (2018), 99–118. [47] Ministry of Education Arts and Culture. 2016. EMIS Education Statistics 2016. https://docplayer.net/103137893Education-management-information-system-ministry-of-education-arts-and-culture-republic-of-namibia.html. Accessed on 12 February, 2023. [48] Ministry of Education Arts and Culture. 2017. Strategic Plan 2017/18-2021/22. https://planipolis.iiep.unesco.org/en/ 2017/strategic-plan-201718-202122-6449. Accessed on 12 February, 2023. [49] Rita Orji and Karyn Moffatt. 2018. Persuasive technology for health and wellness: State-of-the-art and emerging trends. Health Informatics Journal 24, 1 (2018), 66–91. [50] Sachin R. Pendse, Naveena Karusala, Divya Siddarth, Pattie Gonsalves, Seema Mehrotra, John A. Naslund, Mamta Sood, Neha Kumar, and Amit Sharma. 2019. Mental health in the Global South: Challenges and opportunities in HCI for development. In Proceedings of the 2nd ACM SIGCAS Conference on Computing and Sustainable Societies. ACM, NY, USA, 22–36. [51] Laura M. Portnoi and Tiffany M. Kwong. 2019. Employing resistance and resilience in pursuing K-12 schooling and higher education: Lived experiences of successful female first-generation students of color. Urban Education 54, 3 (2019), 430–458. [52] Josie Povey, Patj Patj Janama Robert Mills, Kylie Maree Dingwall, Anne Lowell, Judy Singer, Darlene Rotumah, James Bennett-Levy, and Tricia Nagel. 2016. Acceptability of mental health apps for Aboriginal and Torres Strait Islander Australians: A qualitative study. Journal of Medical Internet Research 18, 3 (2016), e65. [53] Ingrid Schoon, Samantha Parsons, and Amanda Sacker. 2004. Socioeconomic adversity, educational resilience, and subsequent levels of adult adaptation. Journal of Adolescent Research 19, 4 (2004), 383–404. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 9. Publication date: December 2023.

Designing for educational resilience

9:21

[54] Brit Stichel, Edwin Blake, Donovan Maasz, Colin Stanley, Heike Winschiers-Theophilus, and Helena Afrikaner. 2019. Namibian indigenous communities reflecting on their own digital representations. In Proceedings of the 9th International Conference on Communities & Technologies-Transforming Communities. ACM, NY, USA, 51–59. [55] Shannon M. Suldo, Allison A. Friedrich, Tiffany White, Jennie Farmer, Devon Minch, and Jessica Michalowski. 2009. Teacher support and adolescents’ subjective well-being: A mixed-methods investigation. School Psychology Review 38, 1 (2009), 67–85. [56] Linda Theron, Kate Cockcroft, and Lesley Wood. 2017. The resilience-enabling value of African folktales: The readme-to-resilience intervention. School Psychology International 38, 5 (2017), 491–506. [57] Linda C. Theron. 2016. The everyday ways that school ecologies facilitate resilience: Implications for school psychologists. School Psychology International 37, 2 (2016), 87–103. [58] Kentaro Toyama. 2018. From needs to aspirations in information technology for development. Information Technology for Development 24, 1 (2018), 15–36. [59] Michael Ungar. 2011. Social ecologies and their contribution to resilience. In The Social Ecology of Resilience: A Handbook of Theory and Practice. Springer, NY, USA, 13–31. [60] Jennifer Vinas-Forcade, Cindy Mels, Martin Valcke, and Ilse Derluyn. 2019. Beyond academics: Dropout prevention summer school programs in the transition to secondary education. International Journal of Educational Development 70 (2019), 102087. [61] Jeffrey C. Wayman. 2002. The utility of educational resilience for studying degree attainment in school dropouts. The Journal of Educational Research 95, 3 (2002), 167–178. [62] C. Lee Williams, Quinn Hirschi, Katherine V. Sublett, Chris S. Hulleman, and Timothy D. Wilson. 2020. A brief social belonging intervention improves academic outcomes for minoritized high school students.Motivation Science 6, 4 (2020), 423. [63] Gabrielle Wills and Heleen Hofmeyr. 2019. Academic resilience in challenging contexts: Evidence from township and rural primary schools in South Africa. International Journal of Educational Research 98 (2019), 192–205. [64] Heike Winschiers-Theophilus, Nicola J. Bidwell, and Edwin Blake. 2012. Community consensus: Design beyond participation. Design Issues 28, 3 (2012), 89–100. [65] Kieran Woodward, Eiman Kanjo, David J. Brown, T. Martin McGinnity, Becky Inkster, Donald J. Macintyre, and Athanasios Tsanas. 2020. Beyond mobile apps: A survey of technologies for mental well-being. IEEE Transactions on Affective Computing 13, 3 (2020), 1216–1235. [66] Susan Wyche. 2022. Reimagining the mobile phone: Investigating speculative approaches to design in humancomputer interaction for development (HCI4D). Proceedings of the ACM on Human-Computer Interaction 6, CSCW2 (2022), 1–27. [67] John Zimmerman, Erik Stolterman, and Jodi Forlizzi. 2010. An analysis and critique of research through design: Towards a formalization of a research approach. In Proceedings of the 8th ACM Conference on Designing Interactive Systems. ACM, NY, USA, 310–319.

Received 14 February 2023; accepted 15 February 2023

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 9. Publication date: December 2023.

Community and Facility Health Information System Integration in Malawi: A Comparison of Machine Learning and Probabilistic Record Linkage Methods 10 ANNA DIXON, Medic, USA LIMBANI THENGO, Partners In Health, Malawi EMMANUEL KITSAO, Medic, Kenya KONDWANI MATIYA, Partners In Health, Malawi MOURICE BARASA, Medic, Kenya REVELATION NYIRONGO, Partners In Health, Malawi JENNIFER MULI, Medic, Kenya FUNNY KAMANGA, CHIYEMBEKEZO KACHIMANGA, and FABIEN MUNYANEZA, Partners In Health, Malawi PHILLIP NGARI, Medic, Kenya

HENRY MAKUNGWA and JONES CHIMPUKUSO, Partners In Health, Malawi MERCY AMULELE, ELIJAH KARARI, and SIMON MBAE, Medic, Kenya Accurate and efficient record linkage methods are essential to link patients between community health worker digital health apps and an EMR system, facilitating information flow and improving coordination of care. This study presents the eTrace workflow as an illustrative example, highlighting the benefits of enhanced coordination of care for patients in antiretroviral and non-communicable disease programs in rural Neno district, Malawi. This research focuses on the following major contributions: (1) development of a machine learning-based record linkage model for electronic health information systems, (2) comparison between the machine learning-based and probabilistic approaches to record linkage and (3) a concrete evaluation of our approach on real data for the eTrace workflow. A review of the standard record linkage architecture and its application to health information exchange systems is also presented. An empirical comparison conducted of logistic regression and the Fellegi-Sunter algorithms for this use case reveals comparable results. Both classifiers demonstrate an average precision of 0.86, while logistic regression achieves a higher recall at a fixed 0.90 precision of 0.74. CCS Concepts: • Applied computing → Health care information systems; • Computing methodologies → Machine learning; Authors’ addresses: A. Dixon, Medic, 3524 19th St Floor 2, San Francisco, CA 94110, USA; e-mail: [email protected]; L. Thengo, K. Matiya, R. Nyirongo, F. Kamanga, C. Kachimanga, F. Munyaneza, H. Makungwa, and J. Chimpukuso, Partners In Health/Abwenzi Pa Za Umoyo, P.O. Box 56, Neno, Malawi; e-mails: {lthengo, kmatiya, rnyirongo, fkamanga, ckachimanga, fmunyaneza, hmakungwa, jchimpukuso}@pih.org; E. Kitsao, M. Barasa, J. Muli, P. Ngari, M. Amulele, E. Karari, and S. Mbae, Medic Riara Road Nairobi, Kenya; e-mails: {emmanuel, mourice, jennifer, philip, amulele, elijah, simon}@medic.org. Author’s current address: Dimagi 585 Masachusetts Ave. Suite 3 Cambridge, MA 02139, USA; e-mail: [email protected]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM. 2834-5533/2023/12-ART10 $15.00 https://doi.org/10.1145/3624773 ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 10. Publication date: December 2023.

10:2

A. Dixon et al.

Additional Key Words and Phrases: Record linkage, probabilistic record linkage, logistic regression, health information exchange ACM Reference format: Anna Dixon, Limbani Thengo, Emmanuel Kitsao, Kondwani Matiya, Mourice Barasa, Revelation Nyirongo, Jennifer Muli, Funny Kamanga, Chiyembekezo Kachimanga, Fabien Munyaneza, Phillip Ngari, Henry Makungwa, Jones Chimpukuso, Mercy Amulele, Elijah Karari, and Simon Mbae. 2023. Community and Facility Health Information System Integration in Malawi: A Comparison of Machine Learning and Probabilistic Record Linkage Methods. ACM J. Comput. Sustain. Soc. 1, 2, Article 10 (December 2023), 16 pages. https://doi.org/10.1145/3624773

1

INTRODUCTION

Digital health initiatives in the hardest-to-reach areas have more impact when they operate with the wider digital health ecosystem [32]. The importance of Electronic Health Information System (EHIS) integration, and eventual interoperability, is emphasized by the World Health Organization’s recommendations for emerging digital health interventions [46]. Ensuring integration with other systems promotes better patient continuity of care and reduces redundant healthcare actions. For example, in Neno District, Malawi, Partners In Health (PIH), known as Abwenzi Pa Za Umuyo (APZU), works with Community Health Workers (CHWs) to provide a critical path to care. The community health app Yendanafe (powered by the Community Health Toolkit) supports CHWs and their patients through digital data collection, decision support workflows, supervisor support, messaging, and task management [34]. However, these patients frequently seek care from multiple different providers, with each point of care operating a different EHIS. In this implementation, patients also receive care at their local health facilities that operate the Electronic Medical Record (EMR) system OpenMRS. The exchange of information between different EHIS deployments, known as a Health Information Exchange, would allow for cases that require patients to be followed up at the community after receiving care in the facility and vice versa [31]. We present how improved community health and facility health information integration can promote better care activities and reduce overall health worker burden. In Neno District, Malawi, patients receive care through facility-based treatment programs, including integrated chronic care where patients are treated at the lowest level of complexity that is safe, timely, efficient, and as close to the patient’s home as possible, as is recommended by the World Health Organization [6, 13, 14, 17, 45]. In 2007, PIH/APZU, in collaboration with Malawi Ministry of Health, developed and now maintains an EMR system, OpenMRS, in primary and secondary care facilities for the integrated care program [3, 9–12, 21, 22, 25]. Facility-based healthcare providers use OpenMRS to capture details of patients and treatment program activities, which has the added benefit of supporting program monitoring and enables data-driven decision making, thus promoting UN Sustainable Development Goals targets 17.6 through 17.8 [2, 15, 16, 18, 19, 24, 30, 36, 40]. At the same time, CHWs provide care to these patients directly in their community using the Yendanafe app to register, screen, make referrals, and follow up on patients in these treatment programs. When a patient in a treatment program misses an appointment, the health team submits an individual TRACE report to follow up on the defaulter in their home. In the proposed application, we explored working with CHWs, who are already integrated in the community, to follow up with treatment program defaulters. The eTrace workflow shown in Figure 1, a joint project between PIH/APZU and Medic, is a workflow for the Yendanafe app whereby every 2 weeks the site supervisors submit a TRACE report for antiretroviral therapy and Non-Communicable Disease (NCD) program ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 10. Publication date: December 2023.

Community and Facility Health Information System Integration in Malawi

10:3

Fig. 1. eTrace workflow: Yendanafe integration in TRACE workflow.

defaulters, which triggers a follow-up task to the patient’s CHW. The CHW then visits her patient to report their condition back to the facility and refer the patient to the facility for further care. This work investigates and implements record linkage methods to link patients between Yendanafe and OpenMRS and enable the flow of information between the community health and facility health information systems. We refer to this process as patient matching. Patient matching accuracy is critical to the flow of information between the systems and subsequent patient care. To illustrate, let us consider the example of using patient matching between OpenMRS and Yendanafe for the eTrace workflow. In this example, not correctly identifying matching records is a missed opportunity to assist the patient matching procedure and results in further burden to staff to manually identify TRACE patients. In contrast, falsely identifying matched records could result in a CHW receiving a task for the wrong patient, causing unnecessary work for the CHW and introducing delayed or even missed TRACE reporting for that patient. We first explored using rule-based approaches to match patients. The most direct rule-based method to match patients across different systems is through a unique patient identifier. In an ideal system, we would match patients between OpenMRS and Yendanafe using the EMR ID. However, data quality issues resulting from the EMR ID capture process in Yendanafe did not allow for this. The EMR ID, which originates in the EMR system and is recorded in the health passport, is captured in Yendanafe when a CHW keys the information into the app. Exploratory data analysis indicates that the EMR ID field in Yendanafe is often missing or incorrect. We hypothesize that the missing EMR IDs in Yendanafe are the result of patients misplacing health passports and EMR ID capture being optional at the onset of Yendanafe patient registration. The exploratory data analysis also suggests that present EMR IDs in the Yendanafe data are sometimes incorrect, likely due to transcription errors. Consequently, linking patients using only the EMR ID will very likely not result in a successful patient matching outcome. Upon recognizing this obstacle, the team implemented a rule-based patient matching algorithm that addressed the EMR ID data-quality issues. This algorithm recognized that links between patient records can be multivariate. Matching patient records between the two systems could be based on several patient data attributes including EMR ID, name, and sex. This algorithm follows a decision tree model, comparing a patient record from OpenMRS to that of Yendanafe (called a record pair) and classifying them as a matching record pair based on the answers to a series of questions about closeness in EMR ID, name, and ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 10. Publication date: December 2023.

10:4

A. Dixon et al. Table 1. Example of Inspired (Not Actual) First Name Differences in Yendanafe to OpenMRS Record Matches

First name (Yendanafe) YUNIS corin KHilistofa jakleen

First name (OpenMRS) Eunice Coreen Christopher Jackline

sex. One advantage of the approach already discussed is its ability to define matching records using multiple patient attributes in the possible absence of the EMR ID. The algorithm also addresses the data-quality issues inherent to having two data silos. An example of a common data attribute with differences between data sources is the patient name, exemplified by the inspired data shown in Table 1. By noting the phonetic similarities in this example, it is reasonable to see how patient name data quality may degrade between the patient verbally telling a health worker their name and being recorded in text. Yet, this matching algorithm matches patients despite these record discrepancies by allowing for approximate matches. In this article, we describe the development of a patient matching module developed for the eTrace workflow. This research focuses on the following major contributions: (1) development of a Machine Learning (ML)-based record linkage model for EHISs, (2) comparison between the ML-based and probabilistic approaches to record linkage, and (3) a concrete evaluation of our approach on real data for the eTrace workflow. For probabilistic and ML approaches, we will continue to develop the concepts laid out by the rule-based algorithm presented previously whereby we measure differences between several patient record attributes and use those comparisons to determine if patients are matching. It expands the model by introducing more patient attributes, such as age, patient village, and treatment facility. Our research seeks to leverage the probabilistic and ML ability to learn the priority weights of the different patient attributes for record matching. Further, this work focuses on building the module that contains the patient matching algorithm, with an industry-standard architecture composed of modularized components to enable effective configurability (e.g., changing patient attribute inputs, inserting newly trained classification models to account for concept drift) and testing of future record linkage research ideas. The rest of the article is organized as follows. Section 2 gives a mathematical introduction to record linkage and discusses related work. Section 3 details the patient matching module architecture, including the probabilistic and ML classification algorithms. Section 4 describes the experimental results for applying the patient matching module to the OpenMRS to Yendanafe case study, including a comparison of Fellegi-Sunter and logistic regression classification models. We present our conclusion in Section 5. 2 BACKGROUND 2.1 Related Work The global health community widely documents the need to move away from stand-alone digital health intervention “pilotitis” to more sustainable and interoperable products capable of operating in a health information exchange [26, 39]. This is evidenced by OpenHIE, an ongoing collaboration between several digital health organizations releasing evidence-based, implementable, and modular standards for health information exchanges [37]. OpenHIE offers standardized architecture that ensures different points of service are interoperable with all the other points of service. In the OpenHIE architecture, the point of service would use a client registry, or master patient index, of all possible clients and a map their respective IDs in all points of service ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 10. Publication date: December 2023.

Community and Facility Health Information System Integration in Malawi

10:5

to accomplish our case study’s integration objective. There are a few existing OpenHIE client registry reference technologies, of which the authors reviewed OpenCR and OpenEMPI [27, 28]. The advantages to using the OpenHIE-compliant client registries in this context include robustness, availability of rich features including a user interface, and scalability to patient matching between multiple points of service. OpenCR and OpenEMPI use rule-based and probabilistic record linkage approaches, and there is no reported analysis for record-matching performance to our knowledge. This work contributes a comparative analysis of probabilistic and ML approaches and may guide client registry tool selection and future work. Record linkage between two disparate sources is a big challenge for health information system integration, but record linkage processes are an established practice with more than six decades of institutional knowledge [44]. There are many applications for record linkage, including customer linking in customer relationship management systems [47], medical record linking between disparate health databases [38], and U.S. Census research for better government data integration and deduplication [33]. We follow a similar framework for presenting related work such as the survey paper by Elmagarmid et al. [8], which broadly categorizes record linkage into two categories: those that utilize training data to learn their matching relationships and those that do not. Record linkage models developed with training data include probabilistic matching and supervised ML techniques. Probabilistic matching observes that the distance between record pair data attributes given matching status often have different likelihoods and thus approaches it as a Bayesian inference problem. Work published by Fellegi and Sunter [23] formalized probabilistic matching and is one of the earliest and most well-established publications in this field. Training data can be used to calculate the likelihoods used in the model. Similarly, supervised ML record linkage models use training data to classify record pairs into “matching”or “nonmatching”categories. Cochinwala et al. [5] presented successful results using the CART algorithm to match customers between different structured telecommunications databases. In 2016, Konda et al. [20] shared Magellan, an open source project that supports the entire entity matching pipeline and allows users to select from several ML classifiers including decision tree, random forest, naive Bayes, SVM, and logistic regression. Mugdal [35] compared four distinct deep learning architectures against Magellan and found that for 11 open-structured entity matching datasets, the resulting F 1 scores were similar. They also showed that deep learning improved entity matching F 1 scores for five open source textual-only datasets and six “dirty” datasets where data was seemingly structured but often had missing or incorrectly separated fields. Record linkage models developed without training data include rule-based, probabilistic, and unsupervised ML algorithms. Rule-based algorithms benefit from not needing training data but usually require more domain knowledge. A common rule-based approach is for a domain expert to craft a set of rules that identify matching entities in a database. In 1989, Wang and Madnick [42] shared a rule-based approach called inter-database instance identification that used a set of rules to generate new database entry keys to join disparate databases. Some probabilistic models also operate without training data. Works by Jaro [29] and Winkler [43] further established probabilistic matching by proposing that the expectation maximization algorithm could be leveraged to implement probabilistic matching with less or no training data. Finally, unsupervised ML models can learn matching relationships without training data, as exemplified by Elfeky et al. [7], who implemented unsupervised clustering for deduplication in a product description database. 2.2

Record Linkage Definition

Given two databases, A and B, record linkage is the process of finding the subset M of the cross product A × B for which the records a ∈ A and b ∈ B refer to the same entity. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 10. Publication date: December 2023.

10:6

A. Dixon et al.

Fig. 2. Standard record linkage process.

In record linkage of two different databases, the databases are often disparate but contain some similar data attributes. A subset of fields appropriate for comparing the two databases are transformed to a common comparable feature space X . X a and X b are the clean, standardized data attribute matrices for databases A and B most suitable for identifying matching entities between them. For each record feature vector pair between the data sources (x a , x b ) where x a ∈ X a and x b ∈ X b , we calculate a comparison vector c a,b . The comparison vector c a,b is as a multidimensional normalized similarity/difference measurement for that record pair. We derive each element of the comparison vector, c ia,b ∈ [0, 1], from a comparison function, fi , designed to measure the normalized similarity/difference of that data attribute. The comparison function selection is dependent on data type (e.g., numerical, categorical, string) and a core component of the record linkage model design. Finally, the record linkage model uses the comparison vectors to classify the matching status of record pairs in A × B. In this work, we explore an ML model for classification and discuss this further in Section 3.3. 3 METHODOLOGY AND DESIGN We developed a new patient matching module to link patient records in OpenMRS and Yendanafe. The patient matching module compares each candidate record to the set of target records and predicts likely matching target records. The module API enables users to request either matches above a certain probability threshold or the top k matches for each candidate record. The candidate and target record sources are abstract, so it can operate by either finding top matching records in Yendanafe for each OpenMRS record or vice versa. Figure 2 shows the standard record linkage process and architecture of the patient matching module. The methodology draws heavily on the best practices presented in the work of Christen [4]. Each step is described in more detail in the following. 3.1 Data Preprocessing Data preprocessing prepares disparate data sources, such as records from OpenMRS and Yendanafe, for direct comparison. Prior to data preprocessing, these databases contain fields that are related but different due to different data collection processes. Candidate and target data undergo separate data preprocessing to clean and transform each raw record into a set of standardized attributes. Figure 3 gives an example of target and candidate record preprocessing similar to our final implementation. In this example, the candidate and target record raw data are not directly comparable. For example, the candidate raw record has first and last name fields, whereas the target raw record has a full name field inclusive of a middle initial. Data preprocessing transforms ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 10. Publication date: December 2023.

Community and Facility Health Information System Integration in Malawi

10:7

Fig. 3. Candidate and target record preprocessing transforms disparate data records into a set of comparable data attributes.

the name fields from both records into separate and comparable first and last name attributes. For the candidate record, the first and last name strings are all cast to lowercase. For the target record, the full name is cast to lowercase, tokenized by the whitespace character, and the middle initial is dropped. The result is that the output from the candidate and target record processors are separate all-lowercase first and last name attributes ready for comparison. Table 2 summarizes the comparable data attributes we derived from the OpenMRS and Yendanafe patient records through data preprocessing. This specification table guides the data standardization by defining the expected data preprocessing output. Shared data attributes for this patient matching model included name, sex, age, enrollment in different health programs, health program IDs, and geographic location. 3.2

Indexing

Indexing enables more efficient record linkage by reducing the number of record pairs to be compared. Without indexing, record linkage is a computationally expensive operation because it is necessary to compare every possible record pair between the two databases (|A| × |B| operations). Fortunately, for any given candidate record, most potential target records are likely very dissimilar and unlikely to match. Indexing is a method to filter out the highly improbable matches. In doing so, there are far fewer record pairs to compare and classify, thus greatly reducing computational complexity. It is common to employ blocking, where record pairs are compared only if a blocking attribute is an exact match (e.g., two patient records are only compared who report to the same facility). A more comprehensive review of indexing methods are given in other works [1, 41]. We chose not to implement indexing for this patient matching module. A drawback of indexing is its potential to cause false negatives by incorrectly filtering out matching record pairs. For this case study, the computational time is mildly prohibitive, with our record linkage operation taking approximately 4 hours on a 2020 MacBook Pro running macOS 12.5.1 with a 2-GHz quad-core Intel Core i5 CPU and 16 GB of RAM. Our use case was a one-shot operation for historic record linkage between the existing large datasets. Future runs will be run at regular intervals for a much smaller ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 10. Publication date: December 2023.

10:8

A. Dixon et al. Table 2. OpenMRS and Community Health Toolkit Data Preprocessing Specifications

Comparable data attribute First name

Data type

Last name

String

Village

String

Additional data specifications Lowercase, alphabetical characters only Lowercase, alphabetical characters only Lowercase

Facility

String

Lowercase

Age (months)

Numerical

Integer

Sex Enrollment in an antiretroviral therapy program Enrollment in an NCD program ART ID numbers

Categorical Boolean

{‘female,’ ‘male’}

NCD ID numbers

Set of strings

String

Comparator function Normalized Levenshtein

Normalized Levenshtein

Thresholded normalized Levenshtein Thresholded normalized Levenshtein Numerical difference, scaled to [0,1] based on population Exact match Exact match

Boolean

Exact match

Set of strings

Returns 1 if any intersection, else returns 0 Returns 1 if any intersection, else returns 0

number of records that have been modified or added to the database. We leave indexing for this patient matching module for future work. 3.3

Record Pair Comparison

Record pair comparison measures the similarity between the standardized data attributes for each record pair. The output of the record pair comparison for one record pair is a comparison vector, c a,b , quantifying the closeness of a record pair. Each element in the comparison vector, c ia,b , represents a closeness score [0,1] for a single data attribute (e.g., last name string similarity) for that record pair. When comparing last names between two patient records, for example, we used Levenshtein distance to measure the number of edits needed to transform one name into the other to measure their similarity. The comparison functions can be categorized by the data type, and selection is dependent on multiple factors, including the data collection processes. Table 2 outlines the comparator functions for the OpenMRS to Yendanafe patient matching module. These comparison functions consist of string comparisons (i.e., first name, last name), continuous numerical (i.e., age), Boolean exact matches (i.e., sex, absence/presence in health program), and Boolean presence of any intersection between two lists (i.e., EMR IDs). The probability distribution of comparison values directly impact their record linkage predictive power. Figure 4 shows the distribution of the last name, age, and sex attribute comparisons for more than 600 known matched records and 600 random nonmatching record pairs. In all instances, a ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 10. Publication date: December 2023.

Community and Facility Health Information System Integration in Malawi

10:9

Fig. 4. Surname, age, and sex similarity scores for known matching and nonmatching record pairs in the OpenMRS and Yendanafe databases.

matching record pair’s similarity score was more densely distributed near “1” (i.e., very similar). We can see here how comparing data attributes for record pairs provides some insight into the probability of a match, but also that no single attribute provides definitive predictive power for matching status. For example, a Levenshtein similarity of 0.5 in the patient’s last name could point to either a match or no-match scenario. The final step, classification, will leverage the multinomial comparison vector for each record pair to make the final patient matching prediction. 3.4

Classification

The final step, classification, evaluates the comparison vectors for each record pair and determines the probability of being a match. For each record pair comparison, we now know how similar the pairs are to one another along several different attributes (e.g., name, age, EMR ID, location). The challenge of the classification stage is to determine how to combine these similarity measurements to predict if the record pairs match. In Figure 4, for example, we can see that similarity in last names gives more information than that of age, but they are all potentially still useful variables when used in concert. Our first classification model was the rule-based record linkage model described in Section 1. This model compares a patient record from OpenMRS to that of Yendanafe and classifies them as a matching record pair based on the answers to a series of questions about the comparison scores for EMR ID, patient name, and sex. For example, in this algorithm, a record pair without an EMR ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 10. Publication date: December 2023.

10:10

A. Dixon et al.

ID match is then declared a patient match if the Levenshtein distance between patient names is under a predefined threshold and the sex is the same. The existence of the matched dataset from our rule-based approach motivated us to explore a supervised ML classifier. The benefit of using the ML model here is that we train the model to assign the significance weights for each of these similarity scores when predicting the match status. For this work, we adopted a logistic regression model for its explainability and not requiring large training data. We applied it to an increased number of comparison attributes outlined in Table 2 and used the standard logistic regression algorithm to predict the probability of a record pair match: 1 . (1) Pr (match) = σ (w T c a,b + b), where σ (z) = 1 + e −z For the logistic regression model, the model weight, w, and bias, b, parameters are optimized during the model training process. Our classifier was trained on labeled data obtained and validated from the first model’s approach, effectively bootstrapping this model from our first approach. Finally, we also investigated a probabilistic model called the Fellegi-Sunter algorithm due to its wide use, including in the existing health information exchange client registries referenced in Section 2.1 [23]. The Fellegi-Sunter algorithm is similar to a naive Bayes classifier operating on a strictly Boolean version of the comparison vector. More formally, the Fellegi-Sunter algorithm first operates by thresholding the comparison vector, c a,b , such that each element defines a “close enough data attribute match.” Next, an intermediary variable is computed by summing a set of parameters selected by the match/no-match status of the thresholded comparison element:  i L(c a,b ) = w a,b , (2) where each of the parameters is determined by the conditional probability of that data attribute’s value given matching status:  i  P (c =1 |a=b ) ⎧ ⎪ loд if c ia,b = 1 ⎪ i ⎪ P (c =1 |ab ) i ⎨   w a,b = ⎪ (3) i ⎪ i ⎪loд 1−P (c i =1 |a=b ) = 0. if c a,b 1−P (c =1 |ab ) ⎩ Similar to an ML classification problem, we determine an optimal threshold of the intermediary variable for the desired evaluation metric (e.g., accuracy, precision, recall). 4 EXPERIMENTAL STUDY The experimental study aims to identify matching patients from the EMR system (OpenMRS) to the community-based health information system (Yendanafe) for the Neno District in Malawi. We evaluated the patient matching module outlined in Section 3 for matching all patients in chronic care programs (i.e., antiretroviral and NCD treatment programs) in OpenMRS to all patients in the Yendanafe deployment. The patient matching module data preprocessing and comparison computation was configured according to the specifications in Table 2. A significant contribution of this work is comparing logistic regression and the Fellegi-Sunter algorithms for the module’s classifier. For the health information system in Neno, Malawi, we consider the EMR to be the patient identification gold standard. Consequently, we measure patient matching module performance on how accurately it identifies matching patients in the Yendanafe database. 4.1 Patient Matching Model Training At the core of this work is using existing knowledge and patient matching data to train a patient matching classifier and optimize the patient matching process. The classifier’s objective is to classify a record pair as being a match (1) or not a match (0). To accomplish this, we start an existing ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 10. Publication date: December 2023.

Community and Facility Health Information System Integration in Malawi

10:11

Fig. 5. Patient matching logistic regression training curve.

Fig. 6. Precision and recall vs. classifier probability threshold during training for logistic regression (a) and Fellegi-Sunter (b) models.

dataset of 685 known matched patient records developed from the rule-based record linkage model presented in Section 1, which will comprise our labeled dataset’s true positives. We derived a balanced dataset for training by sampling an equal number of record pairs between the OpenMRS records that were already matched and all available Yendanafe records that were not previously matched by the rule-based record linkage model sampled with uniform random distribution. We verified the no-match status through manual inspection for our true negatives. This dataset serves as the labeled training data for both models. We first trained the logistic regression model, using a random 80/20 split of the training data for training/validation purposes. Figure 5 shows the training curve for the logistic regression model, exhibiting convergence and indicating that our training data size is appropriate. Experimentation with presence/absence of L2 regularization showed negligible performance difference. Figure 6 shows the precision-recall tradeoff and reports recall of 0.98 at 90% precision. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 10. Publication date: December 2023.

10:12

A. Dixon et al.

Fig. 7. Precision vs. recall for logistic regression and Fellegi-Sunter models.

Similarly, we trained the Fellegi-Sunter algorithm using the same methods. Figure 6 shows the precision-recall tradeoff and reports recall of 0.99 at 90% precision. The validation of both the logistic regression and Fellegi-Sunter algorithms seem promising, but it is important to note that our original dataset sampling is biased by the first rule-based model. Our validation data was selected by our deterministic model that makes patient matching decisions based on thresholded comparisons of EMR ID, name, and sex similarity. Thus, the validation data only includes records for which we knew there was an existing match with moderately strong EMR ID, name, and sex similarities. This could create a specific, biased (and potentially easier) model not representative of our final patient matching problem. 4.2

End-to-End Lab Validation

To more accurately evaluate the performance of the patient matching module, we employed the patient matching module in a manner closer to its actual implementation. For this validation, we started with 400 random OpenMRS patients not previously seen in the model training and ran the patient matching module to identify potential matches in the entire Yendanafe dataset. Unfortunately, reviewing all possible matches is extremely expensive since comparing 400 OpenMRS patients to 79,776 Yendanafe patients results in 31,910,400 total record pairs. We assume here, based on previous observations of the module’s behavior, that a match will likely not occur when the probability of matching is less than 0.85. Consequently, we ran the patient matching module for the 400 random OpenMRS patients and extracted all record pairs with probability of matching greater than 0.85 using the logistic regression model. We manually inspected the resulting record pairs and assigned the positive label if we concluded we would answer “yes”(and the negative label otherwise) to the following question: “Would I ask a CHW if this record pair was a match?”Different team members were involved in data labeling to diversify perspective. The result of this process was the final test dataset consisting of 467 total record pairs, of which there were 247 positive labels and 220 negative labels. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 10. Publication date: December 2023.

Community and Facility Health Information System Integration in Malawi

10:13

Figure 7 compares the precision-recall curves for the logistic regression and Fellegi-Sunter classifiers on the final test data, which likely provides more accurate validation. The precisionrecall curve visualizes the classifier’s ability to handle the tradeoff between maximizing finding all true matches and minimizing identifying false matches. For the final dataset with a 247:220 positive to negative label ratio, a baseline classifier that predicted “1”for all record pairs would present as a horizontal line at 0.52 and an ideal classifier would present as a horizontal line at 1.0. So a good classifier that maintains high precision and recall bows toward the upper right between the baseline and ideal values. We observe the bowing precision-recall curve that has an average precision of 0.86 for both classifiers, meaning that the models show positive comparable predictive power. For this use case, we pursued a high-precision model (90% precision) to minimize undue burden on CHWs and are currently observing a recall of 0.74 at 90% precision for the logistic regression model and a recall of 0.66 at 90% precision for the Fellegi-Sunter algorithm. 5

CONCLUSION AND FUTURE WORK

We built a patient matching module for integrating community health (Yendanafe) and facility health (OpenMRS) datasets for the Neno District in Malawi. The case study is meant to engage CHWs in helping to find and refer patients in treatment programs who have missed appointments to the facility for better care. Our patient matching module was built according to standard record linkage architecture specifications. We compared probabilistic (Fellegi-Sunter) and ML (logistic regression) classification models for our patient matching module, and our preliminary results suggest that logistic regression performed similarly to, if not better than, the industry-standard Fellegi-Sunter algorithm. We are currently undergoing a physical validation activity for a more accurate report of our experimental results. We leave a few interesting topics to future work. First, we will continue exploring the performance of the Fellegi-Sunter algorithm. The literature for probabilistic methods state that an advantage of the method is its ability to be an unsupervised algorithm, meaning that it would not require training data. In the unsupervised approach, the weights are calculated using expectation maximization. When we experimented with the algorithm, we were unable to get the unsupervised weight calculation algorithm to converge for our dataset, and we ultimately opted to calculate the weights using the same training data for the logistic regression model. For future work, we would like to dig deeper here to document the challenges and potential solutions, and repeat the performance comparison of the unsupervised approach. This work may guide selection and improved configuration and adoption of the OpenHIE client registry tools referenced in Section 2.1, since they implement the Fellegi-Sunter algorithm and allow for user-specified weights. Second, we would like to investigate more comparison algorithms. There are many comparison algorithms beyond those outlined in Table 2, and a comparison and analysis of more of those algorithms may yield better results. Finally, we would like to expand our record linkage performance metrics. In this work, we measure the performance of the classifier as if it were a standard ML classifier. However, the test data is very large, and it is expensive to develop a true labeled test dataset. For this reason, we labeled a thresholded smaller subset of the record pairs and report the performance. An alternative approach to consider is to treat the system as a ranking system and explore a new performance metric accordingly. Ranking systems often operate in similar situations where the possible solution set is too large to reasonably evaluate entirely. Consequently, ranking systems typically measure performance by measuring relevance of the top k results. We would like to re-evaluate our results by treating the patient matching module as a ranking system, returning the top k (e.g., k = 3) matches in the Yendanafe data per OpenMRS record and reporting the mean average precision of the top-ranked matching results.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 10. Publication date: December 2023.

10:14

A. Dixon et al.

REFERENCES [1] R. Baxter, P. Christen, and T. Churches. 2003. A comparison of fast blocking methods for record linkage. In Proceedings of the ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation. 25–27. [2] E. P. Bazile. 2016. Electronic Medical Records (EMR): An Empirical Testing of Factors Contributing to Healthcare Professionals’ Resistance to Use ERM Systems. Ph.D. Dissertation. Nova Southeastern University. [3] J. Christophe C. Report and D. Suffrin. 2022. Annals of clinical case reports presumed severe hepatocellular toxicity after initiation on a dolutegravin-based HIV treatment regimen in rural Malawi: A case report. Annals of Clinical Case Reports 7 (2022), 1–5. [4] Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications (DCSA). Springer. [5] Munir Cochinwala, Verghese Kurien, Gail Lalk, and Dennis Shasha. 2001. Efficient data reconciliation. Information Sciences 137 (2001), 1–15. [6] Health Service Executive. n.d. Clinical Design and Innovation: ICP for Prevention and Management of Chronic Disease. Retrieved October 20, 2023 from https://www.hse.ie/eng/about/who/cspd/icp/chronic-disease/ [7] M. G. Elfeky, V. S. Verykios, and A. K. Elmagarmid. 2002. TAILOR: A record linkage toolbox. In Proceedings of the IEEE 18th International Conference on Data Engineering. 1–12. [8] Ahmed Elmagarmid, Panagiotis Ipeirotis, and Vassilios Verykios. 2007. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19, 1 (Jan. 2007), 1–16. [9] A. Bilinski, E. Birru, M. Peckarsky, M. Herce, N. Kalanga, C. Neumann, G. Bronson, S. Po-Chedley, C. Kachimanga, R. McBain, and J. Keck. 2017. Distance to care, enrollment and loss to follow-up of HIV patients during decentralization of antiretroviral therapy in Neno District, Malawi: A restrospective cohort study. PLoS One 12, 10 (2017), e0185699. [10] C. Kachimanga, K. Cundale, E. Wroe, L. Nazimera, A. Jumbe, E. Dunbar, and N. Kalanga. 2017. Novel approaches to screening for noncommunicable diseases: Lessons from Neno, Malawi. Malawi Medical Journal 29, 2 (2017), 78–83. [11] C. Mitambo, S. Khan, B. L. Matanje-Mwagomba, C. Kachimanga, E. Wroe, D. Segula, A. Amberbir, D. Garone, P. R. A. Malik, A. Gondwe, and J. Berman. 2017. Improving the screening and treatment of hypertension in people living with HIV: An evidence-based policy brief by Malawi’s translation platform. Malawi Medical Journal 29, 2 (2017), 224– 228. [12] E. B. Wroe, N. Kalanga, E. L. Dunbar, L. Nazimera, N. F. Price, A. Shah, L. Dullie, B. Mailosi, G. Gonani, E. P. L. Ndarama, G. C. Talama, G. Bukhman, L. Kerr, E. Connolly, and C. Kachimanga. 2020. Expanding access to non-communicable disease care in rural Malawi: Outcomes from a retrospective cohort in an integrated NCD-HIV model. BMJ Open 10, 10 (2020), 1–9. [13] E. B. Wroe, N. Kalanga, B. Mailosi, S. Mwalwanda, C. Kachimanga, K. Nyangulu, E. Dunbar, K. Kerr, L. Nazimera, and L. Dullie. 2015. Leveraging HIV platforms to work toward comprehensive primary care in rural Malawi: The integrated chronic care clinic. Healthcare 3, 4 (2015), 270–276. [14] G. C. Talama, M. Shaw, J. Maloya, T. Chihana, L. Nazimera, E. B. Wroe, and C. Kachimanga. 2020. Improving uptake of cervical cancer screening services for women living with HIV and attending chronic care services in rural Malawi. BMJ Open Quality 9, 3 (2020), e000892. [15] G. P. Douglas, O. J. Gadabu, S. Joukes, S. Mumba, M. V. McKay, A. Ben-Smith, A. Jahn, E. J. Schouten, Z. L. Lewis, J. J. van Oosterhout, T. J. Allain, R. Zachariah, S. D. Berger, A. D. Harries, and F. Chimbwandira. 2010. Using touchscreen electronic medical record systems to support and monitor national scale-up of antiretroviral therapy in Malawi. PLoS Medicine 7, 8 (2010), e1000319. [16] H. Tweya, C. Feldacker, O. J. Gadabu, W. Ng’ambi, S. L. Mumba, D. Phiri, L. Kamvazina, S. Mwakilama, H. Kanyerere, O. Keiser, J. Mwafilaso, C. Kamba, M. Egger, A. Jahn, B. Simwaka, and S. Phiri. 2016. Developing a point-of-care electronic medical record system for TB/HIV co-infected patients: Experiences from Lighthouse Trust, Lilongwe, Malawi. BMC Research Notes 9, 1 (2016), 146. [17] K. Goniewicz, E. Carlstrom, A. J. Hertelendy, F. M. Burkle, M. Goniewicz, D. Lasota, J. G. Richmond, and A. Khorram-Manesh. 2021. Integrated healthcare and the dilemma of public health emergencies. Sustain 13, 8 (2021), 4517. [18] M. H. Ahmed, A. D. Bogale, B. Tilahun, M. H. Kalayou, J. Klein, S. A. Mengiste, and B. F. Endehabtu. 2020. Intention to use electronic medical record and its predictors among health care providers at referral hospitals, north-West Ethiopia, 2019: Using unified theory of acceptance and use technology 2(UTAUT2) model. BMC Medical Informatics and Decision Making 20, 1 (2020), 207. [19] O. Ayaad, A. Alloubani, E. A. ALhajaa, M. Farhan, S. Abuseif, A. Al Hroub, and L. Akhu-Zaheya. 2019. The role of electronic medical records in improving the quality of health care services: Comparative study. International Journal of Medical Informatics 127, 12 (2019), 63–67.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 10. Publication date: December 2023.

Community and Facility Health Information System Integration in Malawi

10:15

[20] Pradap Konda, S. Das, P. Suganthan, A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton, S. Prasad, G. Krishnan, R. Deep, and V. Raghavendra. 2016. Magellan: Toward building entity matching management systems. Proceedings of the VLDB Endowment 9, 12 (Aug. 2016), 1197–1208. [21] R. K. McBain, O. Mwale, T. Ruderman, W. Kayira, E. Connolly, M. Chalamanda, C. Kachimanga, B. D. Khongo, J. Wilson, E. Wroe, G. Raviola, S. Smith, S. Coleman, K. Kelly, A. Houde, M. G. Tebeka, S. Watson, K. Kulisewa, M. Udedi, and G. Wagner. 2021. Stepped care for depression at integrated chronic care centers (IC3) in Malawi: Study protocol for a stepped-wedge cluster randomized control trial. Trials 22, 1 (2021), 630. [22] T. Ojo, L. Lester, J. Iwelunmor, J. Gyamfi, C. Obiezu-Umeh, D. Onakomaiya, A. Aifah, S. Nagendra, J. Opeyemi, M. Oluwasanmi, M. Dalton, U. Nwaozuru, D. Vieira, G. Ogedegbe, and B. Boden-Albala. 2019. Feasibility of integrated, multilevel care for cardiovascular diseases (CVD) and HIV in low- and middle-income countries (LMIC): A scoping review. PLoS One 14, 2 (2019), e0212296. [23] Ivan Fellegi and Alan Sunter. 1969. A theory of record linkage. Journal of the American Statistical Association 64, 328 (1969), 1183–1210. [24] A. Gutterres. 2020. The Sustainable Development Goals Report 2020. Retrieved October 20, 2023 from https://unstats. un.org/sdgs/report/2020/ [25] M. E. Herce, N. Kalanga, E. B. Wroe, J. W. Keck, F. Chingoli, L. Tengatenga, S. Gopal, A. Phiri, B. Mailosi, J. Bazile, J. A. Beste, S. N. Elmore, J. T. Crocker, and J. Rigodon. 2015. Excellent clinical outcomes and retention in care for adults with HIV-associated Kaposi sarcoma treated with systemic chemotherapy and integrated antiretroviral therapy in rural Malawi. Journal of the International AIDS Society 18, 1 (2015), 19929. [26] Fei Huang, Sean Blaschke, and Henry Lucas. 2017. Beyond pilotitis: Taking digital health interventions to the national level in China and Uganda. Globalization and Health 13, 49 (2017), 49. [27] SYSNET International. 2022. OpenEMPI: An Extensible and Scalable Master Patient Index. Retrieved October 20, 2023 from https://www.openempi.org/ [28] IntraHealth. 2022. OpenCR Documentation. Retrieved October 20, 2023 from https://intrahealth.github.io/clientregistry/ [29] Matthew Jaro. 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 84, 406 (1989), 414–420. [30] R. B. Johnston. 2016. Arsenic and the 2030 agenda for sustainable development. In Proceedings of the 6th International Congress on Arsenic Research and Global Sustainability. 12–14. [31] Gilad J. Kuperman. 2011. Health-information exchange: Why are we doing it, and what are we doing? Journal of the American Medical Informatics Association 18, 5 (2011), 678–682. [32] Alain B. Labrique, Christina Wadhwani, Koku Awoonor Williams, Peter Lamptey, Cees Hesp, Rowena Luk, and Anna Aerts. 2018. Best practices in scaling digital health in low and middle income countries. Globalization and Health 14 (2018), 103. [33] Lowell Mason. 2018. Comparison of Record Linkage Techniques. Technical Report. Bureau of Labor Statistics. [34] Medic. 2022. Community Health Toolkit. Retrieved October 20, 2023 from https://communityhealthtoolkit.org/ [35] Sidharth Mugdal. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD ’18). 19–34. [36] United Nations. 2018. The 2030 Agenda and the Sustainable Goals: An Opportunity for Latin America and the Caribbean. United Nations [37] OpenHIE. 2022. OpenHIE Home Page. Retrieved October 20, 2023 from https://ohie.org/ [38] Palindrome Data. 2019. LINK, DETECT, PREDICT—Using National Centralised Lab Data to Identify and Predict Risk of MTCT Transmission Routes in South Africa. Global Digital Health Forum. [39] PATH. 2014. The Journey to Scale: Moving Together Past Digital Health Pilots. PATH. [40] S. Morton, D. Pencheon, and N. Squires. 2017. Sustainable development goals (SDGs), and their implementation: A national global framework for health, development and equity needs a systems approach at every level. British Medical Bulletin 124, 1 (2017), 81–90. [41] Rebecca C. Steorts, Samuel L. Ventura, Mauricio Sadinle, and Stephen E. Fienberg. 2014. A comparison of blocking methods for record linkage. In Proceedings of the International Conference on Privacy in Statistical Databases. 253–268. [42] Y. R. Wang and S. E. Madnick. 1989. The inter-database instance identification problem in integrating autonomous systems. In Proceedings of the 5th International Conference on Data Engineering. 46–55. [43] William Winkler. 1993. Improved Decision Rules in the Fellegi-Sunter Model of Record Linkage. Technical Report Statistics Research Report Series 56. U.S. Bureau of the Census. [44] William E. Winkler. 1999. The State of Record Linkage and Current Research Problems. Technical Report. U.S. Bureau of the Census. [45] World Health Organization. 2018. Continuity and Coordination of Care: A Practice Brief to Support Implementation of the WHO Framework on Integrated People-Centered Health Services. World Health Organization.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 10. Publication date: December 2023.

10:16

A. Dixon et al.

[46] World Health Organization. 2019. Recommendations on Digital Interventions for Health System Strengthening. World Health Organization. [47] Chen Zhao and Yeye He. 2019. Auto-EM: End-to-End Fuzzy Entity-Matching Using Pre-Trained Deep Models and Transfer Learning. Technical Report. Microsoft Research.

Received 15 February 2023; revised 14 June 2023; accepted 22 June 2023

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 10. Publication date: December 2023.

Flamingo: Environmental Impact Factor Matching for Life Cycle Assessment with Zero-shot Machine Learning BHARATHAN BALAJI, VENKATA SAI GARGEYA VUNNAVA, NINA DOMINGO, SHIKHAR GUPTA, HARSH GUPTA, and GEOFFREY GUEST, Amazon, USA ARAVIND SRINIVASAN, University of Maryland and Amazon, USA Consumer products contribute to more than 75% of global greenhouse gas (GHG) emissions, primarily through indirect contributions from the supply chain. Measurement of GHG emissions associated with products is a crucial step toward quantifying the impact of GHG emission abatement actions. Life cycle assessment (LCA), the scientific discipline for measuring GHG emissions, estimates the environmental impact associated with each stage of a product from raw material extraction to its disposal. Scaling LCA to millions of products is challenging as it requires extensive manual analysis by domain experts. To avoid repetitive analysis, environmental impact factors (EIFs) of common materials and products are published for use by LCA experts. However, finding appropriate EIFs for even a single product under study can require hundreds of hours of manual work, especially for complex products. We present Flamingo, an algorithm that leverages natural language machine learning (ML) models to automatically identify an appropriate EIF given a text description. A key challenge in automation is that EIF databases are incomplete. Flamingo uses industry sector classification as an intermediate layer to identify when there are no good matches in the database. On a dataset of 664 products, our method achieves an EIF matching precision of 75%. CCS Concepts: • Information systems → Retrieval models and ranking; Information retrieval; • Computing methodologies → Natural language processing; • Applied computing; Additional Key Words and Phrases: Environmental impact factor, life cycle assessment, carbon footprint, semantic matching, natural language processing, HS codes, Flamingo ACM Reference format: Bharathan Balaji, Venkata Sai Gargeya Vunnava, Nina Domingo, Shikhar Gupta, Harsh Gupta, Geoffrey Guest, and Aravind Srinivasan. 2023. Flamingo: Environmental Impact Factor Matching for Life Cycle Assessment with Zero-shot Machine Learning. ACM J. Comput. Sustain. Soc. 1, 2, Article 11 (December 2023), 23 pages. https://doi.org/10.1145/3616385

1 INTRODUCTION Greenhouse gas (GHG) emissions from human activities are warming the planet and will lead to hazardous climate events in the absence of mitigation efforts [37]. While information on global and country-level emissions is generally available [1], there is a lack of granular emissions data that can inform mitigation. Life cycle assessment (LCA) is a standard environmental accounting method Authors’ addresses: B. Balaji, V. S. G. Vunnava, N. Domingo, S. Gupta, H. Gupta, and G. Guest, Amazon, Seattle, Washington, USA; e-mails: {bhabalaj, gvunnava, nggdom, gupshik, hrshgup}@amazon.com; A. Srinivasan, University of Maryland and Amazon, USA; e-mail: [email protected].

This work is licensed under a Creative Commons Attribution International 4.0 License. © 2023 Copyright held by the owner/author(s). 2834-5533/2023/12-ART11 https://doi.org/10.1145/3616385 ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

11

11:2

B. Balaji et al.

used to estimate higher-granularity GHG emissions associated with an activity or a product. These emissions are often referred to as its carbon footprint and are reported in terms of global warming potential in units of mass of carbon dioxide equivalent (e.g., kgCO2 e) [17]. LCAs for consumer products are of special interest as they account for greater than threefourths of global GHG emissions [32]; thus, data-driven approaches that perform LCA at scale are essential to informing carbon abatement strategies [26]. Moreover, LCA can be used to estimate other environmental impacts such as electricity used, toxic waste, and so forth. LCA estimates emissions in each stage of a product: raw material extraction, manufacturing, transportation, use, and disposal/reuse. The effort to acquire direct measurements for each aspect of a product can be prohibitively expensive [52], and therefore, domain experts use the outputs of existing LCA studies to estimate emissions of common materials, products, and activities associated with the life cycle of their subject [56]. For example, an LCA on the “production of a cotton t-shirt” might rely on the results of LCAs focused on “cotton production” or “transport by truck.” We refer to the outputs of these reference LCA studies as environment impact factors (EIFs). Databases such as Ecoinvent [56] and GaBi [49] collate these EIFs for use by LCA practitioners. There is a lack of automation tools that integrate with EIF datasets. Often, EIF datasets are shared as spreadsheets, and domain experts use string matching or explicit rules to match a product or activity to an appropriate EIF [11, 44]. It can take hundreds of staff hours to match EIFs for LCA of even a single product [33]. To mitigate the manual overhead, we propose using natural language processing (NLP)-based machine learning (ML) methods to perform this matching. To our knowledge, we are the first to study use of neural ML-based EIF matching. A simple approach is to use neural language models trained on web data for semantic textual similarity [41], where the EIFs can be matched based on the distance between the embedding of the query text and the EIF description. A major advantage of this approach is that no training data is required, as off-the-shelf pre-trained models can capture synonyms and conceptual relationships. We consider this as our baseline algorithm. A key characteristic of the EIF matching problem is that the databases are incomplete. Many products such as mushrooms or socks do not have an EIF, because either no LCA study exists or the published LCAs have not been ingested into the database. LCA experts search across multiple databases for a match and use an approximate value with higher uncertainty when no match exists, e.g., average EIF of vegetables as a proxy for red bell peppers [11]. However, pre-trained ML models are not trained to identify when an appropriate match does not exist. We present a novel zero-shot ML algorithm, called Flamingo,1 that introduces an intermediate classification layer in semantic search to identify when an EIF is missing and improves the performance of EIF matches. Flamingo classifies the input query text to a standard industry code and uses semantic text matching to identify the closest EIFs within the industry code. Figure 1 provides an overview of Flamingo with an example. When there are no EIFs available for an industry code, Flamingo predicts that no appropriate match exists. We use an industry sector classification called the Harmonized System (HS) code [9], which is specifically designed for categorizing products based on their material composition and manufacturing complexity. HS codes are hierarchically organized, are used globally for import/export taxes, and are refreshed by the World Customs Organization every 5 years [57]. Flamingo exploits the HS code hierarchy to navigate the precision versus recall tradeoff associated with an EIF match. We evaluate Flamingo on a dataset of 664 products from an e-commerce retailer. We use annotations from crowd workers to identify if an EIF predicted by Flamingo is an appropriate match or if no match exists. Our results show that Flamingo matches EIFs to products with a precision of 1 Flamingos

use their beaks to filter out algae and crustaceans from their food; our algorithm filters out unrelated EIFs.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

Flamingo: EIFM for Life Cycle Assessment with Zero-shot Machine Learning

11:3

Fig. 1. Environmental impact factor (EIF) selection is a key aspect of product carbon footprinting. Flamingo automates selection of an EIF using pre-trained neural language models. It uses industry sector codes to identify when an EIF is not available in the database.

75% and outperforms the semantic text similarity baseline by 8.4%. We open-source our code with a permissive license.2 The rest of the article is structured as follows. In Section 2, we briefly introduce LCA and summarize prior work in the literature on finding EIFs for LCA. We also summarize the relevant textmatching works in the literature and their applicability to finding appropriate EIFs. In Section 3, we introduce our problem statement. In Section 4, we describe the datasets used in our experiments. In Section 5, we describe the proposed methods and provide illustrative examples. In Section 6, we describe the process for collecting ground-truth data. In Section 7, we summarize the results of our experiments. Finally, in Section 8, we close with thoughts for future work. 2

BACKGROUND AND RELATED WORK

In this section, we briefly explain how LCA is used for carbon footprint estimation and the role of EIFs in performing LCAs. We summarize the prior works on automation of matching EIFs for a given query in the LCA literature and the key challenges that remain unaddressed. We then summarize the relevant text-matching literature, along with its applicability to the EIF matching problem. 2.1

Life Cycle Assessment

LCA was introduced as a systematic method to compare product design choices in terms of energy use, waste, and other environmental impacts [19, 23]. It has since been adopted as a standard method for carbon footprinting by both the International Standards Organization [16] and the GHG Protocol [50]. LCA can be categorized into two types: Economic Input-Output LCA (EIOLCA) and Process-LCA. EIO-LCA uses transactions across industries in an economy to obtain an approximate impact assessment at an industry sector level [59]. This type of LCA is associated with aggregation issues, as, for example, different types of paper products are assumed to have the same GHG impact per unit of sale price regardless of how they were manufactured. Process-LCA, on the other hand, produces higher-granularity carbon footprints through detailed tracking of emissions from each life cycle stage of a product. Figure 2 shows a Process-LCA of a paper towel 2 Open

source code repository: https://github.com/amazon-science/carbon-assessment-with-ml ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

11:4

B. Balaji et al.

Fig. 2. Life cycle assessment of a paper towel made with 75% recycled materials [24].

as an example, which requires accounting of how the pulp was sourced, how it was processed across multiple suppliers, how materials were transported between stakeholders, what percentage of products sold was returned, and whether the paper roll was composted, landfilled, or recycled. As it is challenging, and sometimes infeasible, to collect direct emissions data in such high detail, LCA experts use EIFs published in prior studies as an estimate of the GHG emissions associated with a product, material, or activity for which they do not have direct measurements [33, 44]. Ecoinvent [56], GaBi [49], and AGRIBALYSE [12] are some of the common EIF databases used in the industry. Finding the right EIF can be time consuming [33] as exact string match does not capture synonyms (e.g., milk and dairy, maize and corn), abbreviations (e.g., Ni-Cd and Nickel Cadmium), technical terms (e.g., sodium chloride is same as salt), or category relationships (e.g., basmati is a type of rice). A few solutions have been proposed to overcome this challenge in the literature [11, 33, 44]. Meinrenken et al. [33] propose a linear regression algorithm, where they categorize the query text to an industry sector and use a regression based on price to estimate the GHG emissions. However, the categorization is done manually; all the EIFs within an industry sector are averaged together, which increases the variance of the estimate; and heuristics are used to remove outliers. Clark et al. [11] matched ingredients to EIFs for food products. They also rely on a mapping of EIFs to pre-defined food categories (e.g., berries, cheese). In addition, they manually create search terms that are synonyms or sub-types of a given food category (e.g., pecorino is a cheese) so they can improve exact string matches. To reduce the variance of emission estimates, they use a three-level hierarchical categorization so that specificity of a match increases when possible, e.g., use strawberry instead of berries. Our algorithm uses a similar hierarchical industry classification but does not require manual specification of search terms. As a result, our solution scales to all EIFs in the database and is not limited to food EIFs. In contrast to these prior works, Flamingo does not require manual steps or heuristics and can match to specific EIFs in the database instead of industry sectors. ML has been used to automate other aspects of LCA, as covered in a survey by Algren et al. [2]. Specific to EIF estimation, Sousa and Wallace [47] directly estimate carbon dioxide emissions based on product attributes to inform design decisions. They used neural network regression on carefully chosen product descriptors to estimate emissions. They train different models for ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

Flamingo: EIFM for Life Cycle Assessment with Zero-shot Machine Learning

11:5

high- and low-energy-use products. The model was trained on EIFs of 53 products, and predictions on a test set of 6 products gave an error of 40%. However, their product descriptor required details such as mass, percentage contribution from different types of materials (ceramics, fibers, metals, plastics, etc.), energy source, and more. In contrast, we identify if an EIF in the database can be matched to a given text description and do not place any restrictions on the input. We also test our predictions on a much larger dataset of 664 products. 2.2 Text Matching Text search algorithms were envisioned with the advent of digital computers [7, 45]. Much of the early search algorithms were based on exact string matching [6, 53], which is still prevalent in spreadsheets used for EIF searching by LCA experts. Exact matches work well for small strings but have poor recall with long search queries. BM25 [42, 43] and TFIDF [40, 48] are probabilistic algorithms that overcome this challenge by weighting words with their relative frequency of occurrence. They support both long search queries and rank results based on a relevance score. BM25 is the default search algorithm in modern databases such as Elastic Search [18]. To our knowledge, BM25 has not been considered for EIF search in the literature, and we include it as a baseline algorithm in our evaluation. We do not include exact match as a baseline as it is challenging to identify the keyword to use for the query, and use of the entire product description leads to spurious matches with close to 0% precision. Neural search algorithms improve on exact string match by learning semantics such as synonyms and contextual relationships [34]. However, they require a large dataset of queries with matched results for training the model [20, 22]. No such dataset exists for searching EIFs, and therefore, we focus on zero-shot methods that do not require training data. Neural language models such as BERT [28] and GPT [39] reduce the reliance on training data by using a self-supervised objective such as predicting the next word in a sentence. Use of the Transformer architecture enables training on web-scale datasets [55]. Sentence transformers, called SBERT, build on these works and have emerged as a strong zero-shot algorithm for semantic text matching [41, 58]. They are trained on web text to create a vector representation (a.k.a embedding) of an input sentence. We can identify if two sentences are similar by measuring the distance between their embeddings. Using SBERT, we can find the closest EIF that matches a query text by identifying the EIF text embedding that is closest to the query embedding. Balaji et al. [3] use SBERT to find EIFs for EIO-LCA, while we focus on finding EIFs for Process-LCA. Unlike Process-LCA EIF databases, EIO-LCA EIF databases are complete by definition as the EIF corresponds to industry sectors defined by the national governments. Therefore, Balaji et al. [3] do not address the problem of identifying when no EIF matches exist in the database. We include SBERT as a baseline in our evaluation. One of the challenges with a neural search solution is that they are not designed to identify when an appropriate match does not exist [13, 51]. A simple method is to threshold the distance between embeddings beyond which an EIF is not an appropriate match [62]. However, as we show in our evaluation, a threshold-based solution leads to incorrect predictions due to distance miscalibration. The SBERT model can be fine-tuned on a search dataset so that the distances are calibrated [51]. Another method is to train a separate model that determines if the predicted match is correct or incorrect [21]. But these solutions require labeled training data and do not meet our zero-shot objective. We propose a novel solution, where we first classify the EIFs and the query text to an industry sector. If the industry sector of the query does not match with any entry in the EIF database, we can predict that there is no suitable match for this query. We use the HS codes for our industry sector classification, which is hierarchically organized from 2 digits to up to 12 digits, where the level of hierarchy is indicated with a 2-digit increment [9]. Figure 3 shows an example. A major advantage of HS codes is that they are already used for ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

11:6

B. Balaji et al.

Fig. 3. An example of HS codes, along with hierarchical structure.

import/export taxation worldwide, and therefore, datasets that map products to HS codes are readily available. We use up to 6 digits of the HS code (HS6), which translates to ∼5.4K unique codes. We leverage an existing dataset of products mapped to HS6 codes to learn a supervised classifier that can predict the HS code for any text input. Prior works have proposed ML approaches to automate HS code classification as it is used to determine customs tax rate [10, 14, 15]. Chen et al. [10] consider a neural machine translation approach with a hierarchical loss function and obtain an accuracy of 85% on a dataset of ∼1M records. Lee et al. [31] use a supervised, hierarchical sentence retrieval approach to classify 129K electrical products with an accuracy of 89.6%. Du et al. [15] use two parallel neural networks in a Siamese architecture for classifying HS codes; one network uses a hierarchical sequence of LSTM modules on word vectors of text input, and another uses a graph attention network based on word co-occurrence. On a dataset of ∼400K samples, they obtain an average accuracy of 85%. We treat the HS code classification problem as an example of extreme label classification [4, 5]. We use the PECOS algorithm, which leverages the label hierarchy and semantic similarity of labels to improve classification performance [60]. Flamingo is agnostic to the method used to classify HS codes; we use PECOS in our experiments as it achieves the state of the art in multiple extreme label classification benchmarks. On our dataset of 746K products, we obtain a classification accuracy of 82%, which is commensurate to the results reported in prior works. We also use SBERT to identify the best matching HS codes for a given text description with a zero-shot approach. To our knowledge, zero-shot methods to classify HS codes have not been studied in the literature. 3

PROBLEM STATEMENT

Given a query text qi ∈ Q, our objective is to find an appropriate EIF ei from a given set of EIFs E. The query text can be a product description or a specific aspect of a product that an LCA expert will attach an EIF to, such as the material a product is made of. All EIF datasets include text metadata that describes its characteristics, and we assume the text is available for query matching. It is possible that there is no appropriate EIF available for a given query, and the algorithm should output ∅, i.e., “No match,” for such cases. We assume there is no training dataset available that matches query text to EIFs. It is difficult to obtain high-quality annotations for a dataset that is large enough to train models that generalize to all queries and EIFs. Although we do have a small dataset, we only use it for validation of methods. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

Flamingo: EIFM for Life Cycle Assessment with Zero-shot Machine Learning

11:7

Table 1. Summary of Datasets Used in the Article Type Product to EIF (small) Product to EIF (large) Food ingredient to EIF Product to HS code HS code descriptions Ecoinvent EIFs

Size 100 967 272 746K 6,709 19,128

# Match # No Match # No Clear Label 22 68 10 277 387 303 87 177 8 No filter Filter for HS6 codes Filter for reference products

# Used 90 664 264 746K 5,388 2,770

Even for few-shot methods, we need to have a few labeled examples per EIF, which are in the thousands. It is possible to use EIF text descriptions to reduce the reliance on labeled examples for every class [63]; we leave exploration of such methods for future work. 4

DATASET

In this section, we describe the dataset we use to evaluate the performance of Flamingo in matching a query text to an EIF. Table 1 summarizes the datasets used in the article. The dataset—which we refer to as D Q —is a set of 967 products sold by an e-commerce retailer. It includes a variety of products such as ice makers, electric fans, toasters, and adhesives. Given a product, we concatenate its title, description, and additional attributes into a single string as input to the classifier. For each product, the dataset includes the ground-truth HS6 code as well as a matching EIF from the Ecoinvent dataset [56], if a match exists. We use the 2017 version of HS codes, and EIFs from the Ecoinvent v3.7. Human annotations are used to validate the EIF matches in the dataset: the annotation process is described in Section 6. All the product descriptions are in English—while the text-similarity models we use can generalize to any language, we restrict the language because our annotation team only understands English fluently. We annotate a subset of 100 products by LCA experts to evaluate the quality of annotations by non-experts. To evaluate generalization beyond products, we introduce another dataset of 272 food ingredients that are matched to EIFs with annotations. Ecoinvent EIFs include metadata such as impact factor name, reference product, units, data quality, valid years, geographic specificity, and industry classification. We use the attribute “reference product” as the basis for matching EIFs because it provides a non-technical but precise description of the EIF (e.g., wheat, yogurt). Once the EIFs with the reference product are identified, it is easy to add rules to increase specificity by additional attributes such as location. Ecoinvent contains 19K EIFs but only 3.2K unique reference products. The dataset includes EIFs that are not related to carbon footprint of consumer products such as those related to construction, operation of equipment, or transportation of goods. We filter these out to get 2.7K unique EIFs. We refer to an individual HS code as hiδ ∈ H , where δ refers to the number of digits in the code. We obtain the text description of HS6 codes from https://unstats.un.org/. We use a dataset that maps 746K products to their corresponding HS6 codes for learning a supervised model to predict HS6 code given text as input: we refer to this dataset as D H . The dataset consists of a variety of products (e.g., shoes, watch, coat) and comprises 2.5K unique HS6 codes. Note that this is a subset of the full set of HS6 codes, |H | = 5400. The distribution of the number of products per HS6 code is skewed, with 10% of HS6 codes accounting for 86% of the products. Despite the skew and reduced cardinality, the HS6 codes in D H contain a representative set of products, and we expect a large overlap with a generic product query. 5

METHODS

In this section, we describe the Flamingo algorithm. Our method relies on a zero-shot matching algorithm called SBERT [41] to map a query text to an EIF and introduces an intermediate industry ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

11:8

B. Balaji et al. Table 2. Summary of Notations Used throughout the Article

Notation DQ hiδ H τi DH E

Description Query to EIF dataset HS code i with δ digits Set of all HS codes Text input i to match HS code Product to HS code dataset Set of all EIFs

Notation qi ei MH M HZ ME M HP

Description Query i EIF i Model to predict HS6 code SBERT model to predict HS6 code Model to match EIF given a query PECOS model to predict HS6 code

Fig. 4. Flowchart describing the Flamingo algorithm with a “10k gold ring” as an example query. The matching EIF is found by semantic-similarity-based search on the EIF dataset. The HS code predictor is used to identify the HS code for the matched EIF. The HS code predictor uses either semantic text similarity (FlamingoZero) or supervised PECOS algorithm (FlamingoPECOS). If the HS code of the EIF and the given query match, the final prediction is EIF, otherwise “No match.”

classification code layer to identify when no good match exists. We summarize the notations used in Table 2. Given a query text qi , our objective is to both predict if a matching EIF exists in the database and retrieve the matching EIF ei when one exists. Critically, we do not rely on a supervised dataset of query to EIFs commonly used in prior works [21, 51]. Zero-shot matching algorithms, such as SBERT [41], can identify the best matching EIF but are inadequate at identifying when the EIF is not a good match based on a distance threshold. We improve on SBERT using industry sector classification as defined through HS codes [9]. We only consider the first 6 digits of the HS codes, called HS6, as they are globally applicable. Higher-resolution HS codes with 8 or more digits are country specific, and our methods can be extended to these if needed. Figure 4 provides an illustration of the Flamingo algorithm with an example. We use a model M H that predicts the HS6 code for a given text input τi . Given an HS6 code, we can look up the corresponding HS4 and HS2 codes from the code hierarchy. Therefore, we have hiδ = M H (τi , H ). We propose two types of HS6 classifiers. The first is a zero-shot method that leverages the text description of HS6 codes and finds the best match using SBERT: we refer to the resulting method as FlamingoZero or M HZ . The second is a supervised classification method that ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

Flamingo: EIFM for Life Cycle Assessment with Zero-shot Machine Learning

11:9

ALGORITHM 1: Pseudocode of Flamingo Input :qi , E, H , cosine_threshold Output :ei hqδ = M H (qi , H ) ei , cosine_score = M E (qi , E) hδe = M H (ei , H ) if hqδ  hδe then ei = ∅ if cosine_score < cosine_threshold then ei = ∅ return ei

leverages the product to HS6 dataset D H using the PECOS algorithm [60]: we refer to this method as FlamingoPecos or M HP . While we expect the supervised model M HP to outperform the zero-shot model M HZ in the test split of the D H dataset, M HP can under-perform when predicting the HS6 codes for EIFs due to domain shift. For example, chemical EIFs such as “1,1-dimethylcyclopentane” are not available in the labeled HS6 dataset D H and will be mispredicted by the supervised model. As zero-shot methods are unbiased, they do not suffer from such a domain shift, but their performance may suffer due to lack of any supervision. We study the tradeoffs between the two approaches in our evaluation (Section 7). Algorithm 1 details the steps in Flamingo. For a given query text qi , we first predict its HS codes hqδ if not provided. Next, we find the best EIF ei that matches the given query qi using the SBERT model M E . We then predict the corresponding HS code hδe for the EIF ei using the model M H . If the HS codes of both the EIF and the query match and the cosine similarity score is higher than the threshold, we output the best match EIF ei as the final prediction. Otherwise, the output is “no match” ∅. We include three variants of the algorithm, where we match EIFs to the query based on their HS2, HS4, and HS6 codes, respectively. Reducing the number of digits in the HS code helps reduce the specificity of the classification. Therefore, a higher-level HS code increases the precision of finding an appropriate EIF match at the expense of reducing the precision of predicting “No match.” We append the HS code we use to the name of algorithm to specify the variants, e.g., FlamingoZeroHS4. Figure 5 shows examples of products being matched to EIFs based on text descriptions using Flamingo. The first example shows the importance of semantic matching, as the given product description does not mention “jewelry” or “precious metal.” The EIF is considered a match as HS codes of the product and EIF are the same. In the second example, finding a match is easier with the keyword “coffee.” However, notice that the HS codes are slightly different—the product corresponds to roasted coffee, whereas the EIF is for green bean coffee not roasted. Our annotators consider these close enough, and therefore the EIF is considered a match. While 6-digit HS codes won’t match in this case, the 4-digit HS codes corresponding to “coffee” will match. For the last two examples, there are no good EIFs in the dataset. While the best EIFs found do match the product semantically, they are not appropriate from a manufacturing perspective. The corresponding HS codes for the product and EIF reveal this difference and help predict “No match.” We use a pre-trained SBERT model to convert an input text to its corresponding embedding. We use the “all-mpnet-base-v2” model from sbert.net as it has the highest performance in semantic text similarity benchmarks. The model uses the BERT Transformer architecture [28] and is trained on 160GB of text corpora using the MPNet self-supervised objective [46]. BERT uses the masked language model objective for self-supervision, where the model predicts missing ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

11:10

B. Balaji et al.

Fig. 5. Examples of product to EIF matching by Flamingo. The algorithm predicts that the given EIF is a match if the HS codes of the query and EIF match. The HS codes match for the first two examples, whereas they do not for the last two.

Fig. 6. An example of semantic similarity matching using a neural network model trained on web text data. The model takes text as input and outputs a vector representation, called embeddings, such that closely matched inputs will have mapped to embeddings with high cosine similarity.

words in a given sentence. Training on web-scale data with this objective and fine-tuning on downstream natural language tasks led to state-of-the-art results at its time of release. MPNet improves on BERT and permutes the input sentence in addition to masking. The model is further fine-tuned on 1B sentence pairs from a variety of NLP datasets using cosine similarity as the distance metric and a symmetric cross-entropy loss proposed by Qi et al. [38]. We use the default hyper-parameters throughout, where the input sentence is cut off after 128 tokens (about 100 words). Figure 6 shows an illustrative example of semantic text similarity. We use the same model both to predict the HS codes (M HZ ) and to rank the best-match EIFs (M E ). For example, to predict the EIF match for a given query text, the model will output the EIF embedding that is closest to the query embedding as measured through cosine similarity. It is possible that there is a “No match” even after filtering out EIFs. We use a threshold on cosine similarity distance to further filter out unrelated EIFs and show the impact of using both a conservative and an aggressive threshold on the algorithm performance. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

Flamingo: EIFM for Life Cycle Assessment with Zero-shot Machine Learning

11:11

5.1 FlamingoPecos We train a classifier M HP that predicts the HS6 code given text as input using dataset D H . We leave out 20% of the data as the test set. We vectorize the product text using TFIDF [40], and the HS6 classes are one-hot encoded for classification. TFIDF converts each word in the text to a number proportional to relevance in the dataset, determined through the relative frequency of use compared to other words. We use the “XR-Linear” option of the PECOS algorithm to train the classifier [60], which uses logistic regression and hierarchical clustering of labels for classification. We use the default hyper-parameters. We briefly describe the PECOS algorithm: PECOS is a tree-based method for solving extreme multi-label classification problems. We focus on the XR-Linear variant of the algorithm; the linear classifiers can be replaced following the same framework. The classification problem can be framed as follows: train a model that can generate a subset of labels from label space Y that are relevant to an input x. Given training data {(xi , yi )}i=1,2··· , N , we first construct label representations, which are generated  by normalizing the sum of positive x i j , where Si is the set of samples sample features. In particular, label i’s representation is  jj∈S ∈S i x j  that have a positive label i. Once we obtain the label representation, we can construct the label tree using the hierarchical k-means algorithm. The leaves of the tree are the final output labels, which in our case are HS codes. For every given input x, the PECOS XR-linear tree will generate a subset of HS codes along with a score capturing how relevant that HS code is to the input query. More specifically, every node in the tree is associated with a simple linear multi-label classifier, which produces a score measuring the relevance between the child of this node and the input x. The score for the ith node in the lth layer is formulated as follows:   (1) (x, Yi(l ) ) = σ wi(l ) · x , where σ denotes an activation function (e.g., sigmoid) and wi(l ) ∈ Rd denotes a sparse vector of weight parameters. This score accumulates along the path from the root to the leaves, similar to the computation of conditional probability if we view the score in each node as the conditional probability. Training: Let the weight vector for a node j be wj and all the training samples be {(xi , yi )}i=1,2··· , N . For each input xi , we can find a path from the root to leaf yi . When we train wj , we will include the sample xi as a positive if node j is in the path from the root to yi and as a negative if node j is not in the path from the root to yi but the parent of node j is in the path from the root to yi . Once we collect all the positives P j and negatives E j for node j, we can train the weight for node j by solving a binary classification sub-problem:  wj = arg min L(1, w xi ) w

+

 i ∈E j

i ∈P j

L(0, w xi ) +

λ w22 , 2

where λ is the regularization coefficient, and L(·, ·) is the point-wise loss function (e.g., squared hinge loss). Inference: PECOS leverages the tree structure to run the beam search algorithm over the tree. In particular, the beam search algorithm chooses a subset of nodes based on the largest scores of the current active nodes and continues going down from these nodes to arrive at the leaf nodes or HS codes that are relevant for the query. We pick the HS code with the maximum relevance score as the output prediction. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

11:12

B. Balaji et al.

More details about PECOS can be found in Yu et al. [61]. 6 ANNOTATION In this section, we describe the process of collecting ground-truth data of product-to-EIF mappings via manual annotations. Identifying the best match from thousands of EIFs is a challenging and time-consuming task. To reduce burden, we rank the top-5 EIFs using FlamingoZero and ask annotators if any of these are a match. In case there are less than five EIFs after filtering based on HS codes, we add the top-ranked EIFs from the full set until we have a total of five options. We also include the options “No match” and “Not sure”’ to capture both missing EIFs in the database and uncertainty in annotation. We acknowledge that such an annotation system does not measure recall accurately as there could be EIFs in the database that may be an appropriate match but are not captured in the top-ranked EIFs by Flamingo. Nevertheless, it does capture the precision of the algorithm accurately. The recall of the ranked list can be further improved by using a mix of different algorithms such BM25 [54] or universal sentence encoder [8] or with use of late interaction models such as ColBERT [29] that re-rank a candidate list. Our contribution and focus here is on improving the precision of identifying if an EIF is an appropriate match after it has been ranked by a search method. We defer measuring and improving recall of EIF ranking to future work; we expect our results to improve further if recall improves. All of our baseline methods except for BM25 rely on the same SBERT model for EIF ranking. Therefore, the results still provide a controlled experiment to compare methods. In our instructions to annotators, we provide multiple examples of product-to-EIF mapping, including those with “No match.” More than 80% of randomly sampled products do not have an appropriate EIF in the database. To balance this skew, our dataset for annotation includes all the products for which we find an EIF and randomly samples an equivalent number of products for which there is no match. We use the top-5 ranked EIFs without an HS code filter for these products for a consistent annotation experience. We use three independent annotations per product and use the majority vote to reduce labeling noise. Our annotators are experienced in labeling tasks. Each annotation took 30 seconds on average. Despite our efforts to reduce annotation burden, our annotators provided feedback that the task was challenging. They had to constantly look up technical terms, such as “kenaf” and “mercerizing”: some tasks took as much as 5 minutes. Of the 967 products annotated, 28% had unanimous agreement, 44% had a valid majority vote, and the remaining had split votes with no clear majority. Another 5.6% of the products had “Not sure” as the majority vote. We only consider the 664 products that have a valid majority and use the majority voted label as the ground truth for our experiments. The final product dataset has 58% “No match” and 48 unique EIFs with a long-tail distribution. Krippendorf’s Alpha is a measure of inter-annotator agreement for classification tasks, with 0 and 1 representing perfect disagreement and agreement, respectively [30]. The Alpha for our annotations is 0.28, which is similar to values reported for long-tailed classification tasks in the literature [25]. To further validate our dataset, we asked two LCA experts to annotate a subset of 90 products. Picking one of the experts arbitrarily as the ground truth, we measure the precision of both our expert and non-expert annotations. We find that non-experts have a precision of 78.6% on average and are comparable to expert precision of 76.1%. With majority vote, the non-expert agreement with expert annotations increases to 85.9%. Therefore, we consider our non-expert annotations to be of sufficient quality to be considered as ground truth. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

Flamingo: EIFM for Life Cycle Assessment with Zero-shot Machine Learning

11:13

Table 3. Performance Comparison of HS6 Code Prediction Given Product Description as Input

Model Accuracy (%) Macro F1 SBERT (zero-shot) 11.7 9.1 PECOS (supervised) 81.9 44.8

Weighted F1 11.4 81.1

There are 610K and 152K products in the train and test set, respectively.

7 EVALUATION We start with results of predicting the HS6 code for a product using the PECOS model and compare it with zero-shot SBERT model predictions. Next, we analyze the results of predicting EIFs for products for all the baselines and variants of the Flamingo algorithm. We analyze the results from the small 90-product dataset annotated by experts and then expand the analysis to the larger 664product dataset. After this analysis, we assess the use of industry codes other than HS codes as the intermediary layer in Flamingo. We then provide an analysis of the prediction errors and their impact on the carbon footprint values. We end with a micro-benchmark evaluation of the Flamingo algorithm on the food ingredient datasets and compare results with baselines. 7.1 Prediction of HS6 Code We start with evaluation of the HS code classifier M H on the product-to-HS-code dataset D H . We split the dataset in a ratio of 4:1 to get train (|D tHr ain | = 610K) and test (|D tHest | = 152K) sets, respectively. We compare the performance of the SBERT zero-shot model M HZ as well as the supervised PECOS model M HP on D tHest . We pick the prediction with the highest score as the HS6 code, where the score is cosine similarity in case of M HZ and confidence score for M HP . We report accuracy, macro F1 score, and weighted F1 score. The F1 scores take into consideration the class-wise accuracy rather than overall accuracy. The macro F1 score gives equal weight to all classes, whereas the weighted F1 score considers the number of products per class. The results are given in Table 3. As we expect, supervision outperforms the zero-shot pre-trained model by a large margin. However, there is a significant distribution shift between products and EIFs. As we show in the next section, the distribution shift impacts the overall performance of the Flamingo algorithm. 7.2 Prediction of EIF We evaluate EIF predictions by BM25 and SBERT as our baseline algorithms. For BM25, we use the implementation by Trotman et al. [54] and use default hyper-parameters. For both BM25 and SBERT, we use a distance threshold that maximizes their overall performance and include results of threshold sweep in Table 4. For Flamingo, we include the HS2-, HS4-, and HS6-based predictions of both FlamingoZero and FlamingoPecos variants. We consider two cosine similarity thresholds for SBERT, 0 (conservative) and 0.5 (aggressive). These choices show the tradeoffs in design choices of the Flamingo algorithm, and we avoid tuning these parameters on our dataset. We consider two datasets, a small dataset of 90 products labeled by an LCA expert as well as annotators and a larger dataset of 664 products labeled only by the non-expert annotation team. As the annotation is performed on a ranked list of EIFs provided by FlamingoZero, recall cannot be accurately measured. Therefore, we report overall Precision@1, Macro Precision, and Weighted Precision scores. Precision@1, or simply precision, indicates that the metric is for the top-ranked candidate and could be generalized to Precision@K for the top-K items. It is a common metric used for text ranking [27]. We also break down the overall precision by “No match” and “Match.” By ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

11:14

B. Balaji et al. Table 4. Impact of Distance Threshold on Performance of Baseline Algorithms BM25 and SBERT

Model BM25 BM25 BM25 BM25 BM25 BM25 SBERT SBERT SBERT SBERT SBERT

Distance Threshold 20 40 60 80 100 120 0.2 0.3 0.4 0.5 0.6

Precision@1 (%) 29.7 54.5 56.9 58.2 58.6 58.1 25.9 29.1 49.5 66.6 61.5

Table 5. Evaluation Results for Flamingo with a Dataset of 90 Products

Distance Precision@1 (%) Threshold Overall No Match Match BM25 35 65.6 77.9 27.3 SBERT 0.5 61.1 64.7 50.0 SBERT + BM25 0.5 76.7 95.6 18.2 FlamingoPecos FlamingoPecosHS2 0.0 58.9 66.2 36.3 FlamingoPecosHS4 0.0 75.6 89.7 31.8 FlamingoPecosHS6 0.0 72.2 91.2 13.6 FlamingoPecosHS2 0.5 76.7 89.7 31.8 FlamingoPecosHS4 0.5 81.1 97.1 31.8 FlamingoPecosHS6 0.5 77.8 98.5 13.6 FlamingoZero FlamingoZeroHS2 0.0 53.3 55.9 45.5 FlamingoZeroHS4 0.0 67.8 77.9 36.4 FlamingoZeroHS6 0.0 74.4 94.1 13.6 FlamingoZeroHS2 0.5 68.9 76.5 45.5 FlamingoZeroHS4 0.5 75.6 88.2 36.4 FlamingoZeroHS6 0.5 77.8 98.5 13.6 Human Performance Non-expert (Mean of 3) – 78.6 82.3 68.0 Non-expert (Majority) – 85.9 88.2 79.2 Expert – 76.1 77.9 70.8 Method

Macro Weighted Precision Precision 15.9 24.4 21.8

65.6 67.9 70.0

21.9 23.4 15.8 28.5 29.1 18.9

74.2 76.0 68.9 75.3 73.4 68.4

20.4 21.7 17.1 28.8 29.7 19.9

74.9 69.2 61.6 73.9 71.0 62.2

46.1 60.1 63.6

85.6 88.3 90.4

Ground truth is obtained through annotations by an LCA expert.

design, >50% of the products in the dataset do not have a matching EIF to reflect the importance of predicting “No match” in practice. Table 5 shows the results on the 90-product dataset. Both BM25 and SBERT perform quite well, giving 65.6% and 61.1% Precision@1, respectively. The choice of threshold determines the tradeoff between increasing the precision of “Match” and “No match.” It is intuitive to pick a threshold of 0.5 for SBERT as 0 translates to orthogonal vectors and 1 to perfectly aligned vectors in embedding ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

Flamingo: EIFM for Life Cycle Assessment with Zero-shot Machine Learning

11:15

space. On the other hand, a good threshold for BM25 changes with the dataset. Flamingo provides an additional method to navigate the same tradeoff with HS-code-based classification as an intermediate layer. In this dataset, Flamingo helps improve performance by increasing the “No match” precision and filtering out erroneous EIFs based on their HS code sector. Flamingo increases the probability of “No match” as we increase the specificity of HS codes from 2 to 6 digits. However, it comes at the cost of decreased “Match” precision. HS4 codes provide a good tradeoff between the two extremes and yield the best overall performance. Addition of the cosine similarity threshold on EIF selection further increases the probability of a “No match” prediction. The threshold can be adjusted based on the requirements of downstream applications. Both FlamingoZero and FlamingoPecos yield similar results and outperform the baseline algorithms. FlamingoPecos gives a small performance improvement over FlamingoZero. This is in contrast to the large difference in the performance of the HS6 classifiers in Table 3. As the PECOS model M HP is trained on the product-to-HS6 dataset D H , we hypothesize that FlamingoPecos has low performance on domain-specific EIFs such as chemicals and industrial processes. However, as we are using M HP to only match EIFs for products, there is still a significant distribution overlap between the datasets D H and D P . On the other hand, the SBERT model is trained on a large-scale web corpus, and it can generalize to domain-specific text. Use of the zero-shot approach also reduces reliance on external data and maintenance overhead of training models. The human performance metrics show that there is significant room for improvement in EIF matching algorithms, especially in the selection of an appropriate EIF when it is available in the database. Even when the cosine similarity threshold is set to 0 and no EIFs are filtered based on HS codes, the best “Match” precision is 50%, far below the human-level precision of ∼70%. This observation points to an opportunity to improve on the SBERT model, perhaps by fine-tuning on sentences used in EIFs. Table 6 shows the same set of results for the larger 664-product dataset. In this case, the ground truth is obtained through majority vote on three annotations per product. We include the humanlevel performance metrics by comparing the annotations of an individual worker against the majority vote. The reported numbers are average values across workers who annotated at least 100 products. While this is an approximate measure of human performance, it is still instructive to compare these metrics against model predictions. The trends observed in the small dataset largely hold true for the larger dataset results. However, an exception is that the performance of FlamingoPecos degrades in comparison to both the baseline SBERT algorithm and FlamingoZero. We hypothesize that the domain shift is more significant in the larger product dataset. However, FlamingoZero still outperforms the baseline across all metrics, with up to 8.4% improvement in overall precision. Both HS2 and HS4 code-based filtering give a good balance between “No match” and “Match” prediction. Similar to the smaller dataset, the largest gap from human performance stems from mis-prediction of EIFs when there is a match in the database. We use the variant FlamingoZeroHS4 with a cosine similarity threshold of 0 as our default algorithm for the rest of the analysis. 7.3 Alternatives to HS Code Prediction There are multiple industry classification systems that can be an alternative to HS codes. We look at two choices: International Standard Industrial Classification (ISIC) [35] and Central Product Classification (CPC) [36]. They are pertinent choices as they are already included as part of the EIF metadata in the Ecoinvent database [56]. However, we do not have the ISIC or CPC codes of the products in our dataset, so we use the SBERT model to predict the best ISIC and CPC codes for each product using text descriptions. The United Nations also publishes correspondence tables between CPC and HS codes; we include this as an additional option to identify HS codes for ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

11:16

B. Balaji et al. Table 6. Evaluation Results for Flamingo with a Dataset of 664 Products

Distance Threshold

Method BM25 SBERT SBERT + BM25

35 0.5 0.5

FlamingoPecosHS2 FlamingoPecosHS4 FlamingoPecosHS6 FlamingoPecosHS2 FlamingoPecosHS4 FlamingoPecosHS6

0.0 0.0 0.0 0.5 0.5 0.5

FlamingoZeroHS2 FlamingoZeroHS4 FlamingoZeroHS6 FlamingoZeroHS2 FlamingoZeroHS4 FlamingoZeroHS6

0.0 0.0 0.0 0.5 0.5 0.5

Non-expert (Mean)



Precision@1 (%) Overall No Match 53.0 87.1 66.6 85.8 59.0 99.0 FlamingoPecos 62.7 68.7 57.4 89.4 57.2 92.0 68.4 92.2 61.1 97.4 59.8 97.7 FlamingoZero 67.2 78.3 75.0 96.3 61.4 98.7 67.8 93.8 70.2 99.2 59.5 99.7 Human Performance 76.4 76.1

Match 5.4 39.7 3.2

Macro Precision

Weighted Precision

7.8 13.9 8.1

46.0 66.8 44.5

54.1 12.6 8.7 35.0 10.5 6.9

13.8 12.4 7.3 14.8 11.4 6.5

80.5 69.5 67.2 72.5 49.9 47.8

51.6 45.1 9.4 31.4 29.6 3.2

18.0 16.4 11.0 17.3 15.7 9.4

71.3 64.0 55.2 62.9 60.7 54.2

75.5

42.0

86.2

Ground truth is obtained by majority vote on three annotations per product.

Table 7. Performance of EIF Matching Using Different Industry Sector Codes

Industry Code HS4 HS4 ISIC ISIC CPC CPC CPC to HS2 CPC to HS2

Distance Threshold 0.0 0.5 0.0 0.5 0.0 0.5 0.0 0.5

Precision@1 (%) 75.0 70.2 63.4 68.7 52.7 59.9 50.6 57.8

EIFs instead of predicting with the SBERT model. Table 7 shows the performance metrics on the 664-product dataset. EIF matching based on HS codes outperforms the rest of the options across all metrics. Upon manual inspection, we find that HS6 codes are more granular with a deeper hierarchy compared to ISIC and CPC. The assignment of ISIC and CPC codes in the EIF database is also subjective; e.g., some mappings correspond to 3-digit CPC codes, while others correspond to 5 digits. There are also errors in mapping between CPC and HS codes in the correspondence tables, and directly predicting the HS codes reduces the errors compared to mapping correspondence data across two sources. 7.4 Analysis of Errors We randomly sampled 50 data points incorrectly predicted by the FlamingoZeroHS4 from Table 6 and manually analyzed them to understand the reasons behind the erroneous outputs. Figure 7 shows a few examples of the errors, categorized into five types. A majority of the errors (40%) ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

Flamingo: EIFM for Life Cycle Assessment with Zero-shot Machine Learning

11:17

Fig. 7. Examples of errors by Flamingo. Most errors can be categorized into the five types shown.

corresponded to electronic cables such as those for charging or connecting components (row 4 in the figure). There is an EIF in Ecoinvent, called “cable, unspecified,” that captures all of these different types of cables. We find that SBERT is poor at matching text such as these, where it needs to understand the set relationship across closely related EIFs or HS codes. Some HS codes contain descriptions of a similar nature, e.g., “not elsewhere classified,” and are a common source of errors in matching. Methods that capture such set relationships or exploit the hierarchy in a different manner could overcome this issue. In some cases (22%, row 2 in figure), the error is due to mismatch between HS codes even when the EIF predicted is the same as the selected ground truth. For example, a toner module is mapped to the HS code “ink for printing” by the SBERT model, whereas the product maps it to a related HS code “printing machinery.” In this case, the two related HS codes are not close to each other in the hierarchy, so using a 2-digit or 4-digit version of the code did not remove the error. A related source of error (8%, row 5 in figure) is when the EIF predicted is not a match even when the HS codes match. In this case, the HS code category is too broad and does not differentiate between unrelated EIFs. Such errors point to the limitation of using HS codes as an imperfect classification system. A data-driven knowledge base can potentially address such issues by capturing relationships across concepts in a multi-dimensional manner. About 20% of the errors were due to semantic matching errors (row 1 of figure) and point to opportunities for improving on SBERT models. There were a few errors due to human mislabeling as well (10%, row 3 of figure), which could be reduced by improving the annotation procedure. Finally, we found an intriguing but rare error not shown in Figure 7. In one case, the product was a package of a keyboard and mouse. The humans annotated the ground-truth EIF as keyboard, whereas Flamingo predicts the EIF as mouse. Such errors can be addressed if we break down composite products to their individual items. 7.5 Impact of Prediction Error on Carbon Footprint We analyze the impact of an EIF misprediction on downstream use cases using the corresponding greenhouse gas emission values in units of kilograms of carbon dioxide equivalent (kgCO2 e). We used the “reference product name” metadata in the Ecoinvent dataset to represent an EIF, but ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

11:18

B. Balaji et al.

Fig. 8. Impact of EIF misprediction error on downstream carbon footprint estimate. In most cases, the error in prediction does not lead to a large error in carbon emissions; however, there are a few outliers that cause large deviations.

there could be multiple entries for a single reference product based on variations such as region of manufacture or system boundary. For this analysis, we compute the average of these individual values following similar practice from prior works [11, 33]. We use the metrics Pearson’s correlation (Pearson R), mean absolute percentage error (MAPE), and median absolute deviation (MAD) to characterize the errors. We use the predictions by the FlamingoZeroHS4 variant of the algorithm for this evaluation. Our analysis is limited by the known EIFs in the dataset. Of the 664 products in our dataset, we only have ground-truth EIF for 277 products as the rest do not have a match. Of these 227 products, 130 are predicted to have a match by FlamingoZeroHS4; 125 of the 130 products are predicted correctly and yield low error: Pearson R: 0.95, MAPE: 18%, MAD: 0 kgCO2 e. This statistic indicates that misprediction of “No match” is the major source of errors. To further understand the impact of errors due to semantic matching, we analyze the error after removing the HS-code-based filter. Predictions for 172 of 277 products are correct, and the corresponding error metrics are Pearson R: 0.81, MAPE: 144%, MAD: 3.9 kgCO2 e. Figure 8 compares the carbon emission values of the ground-truth EIF to that of the predicted ones. We use a log/log plot to cover the wide range of values; trends look similar in linear scale. Overall, there is a high correlation in predicted and ground-truth values; the errors are small in absolute terms but larger in relative terms. There are a few anomalous predictions that cause an error in multiple orders of magnitude; 4 points have >1,000% error and an additional 5 points have >100% error. Some examples of errors include predicting a microwave as an HVAC unit (8118%), incorrect battery chemistry (324%), and incorrect paper type (112%). 7.6

Generalization to Food Ingredients

LCA experts often match EIFs to materials and manufacturing processes and not just products. We evaluate if Flamingo generalizes to use cases beyond our product dataset using a separate dataset of food ingredients. Unlike the product descriptions, the ingredients consist of only a few ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

Flamingo: EIFM for Life Cycle Assessment with Zero-shot Machine Learning

11:19

Table 8. Evaluation Results for Flamingo with a Dataset of 264 Food Ingredients

Distance Precision@1 (%) Macro Weighted Threshold Overall No Match Match Precision Precision BM25 0.1 23.1 19.8 30.0 13.4 56.9 SBERT 0.5 34.8 14.1 77.0 26.1 86.2 FlamingoZeroHS4 0.0 70.0 75.7 57.4 41.0 76.1 Method

Ground truth is obtained through annotation by a non-expert worker.

words, e.g., “organic diced tomatoes,” “purified water,” “organic licorice.” Our dataset consists of 272 ingredients, and we follow the same steps as Section 6 to annotate the ground truth. Due to the labor-intensive nature of annotations, we only use one worker for this micro-benchmark and do not have a consensus-based label. Unlike generic products, basic materials such as food ingredients have a higher probability of finding a match in the EIF dataset. From the annotations, we find 87 ingredients that match to EIFs and 177 that do not have a match. Table 8 summarizes the results. We report the results of the FlamingoZeroHS4 algorithm and compare it with SBERT and BM25 baselines. In this case, the HS code for the ingredient is provided as part of the dataset, and we rely on Flamingo to predict the HS codes for both EIFs and ingredients. The results follow the same trends as those from product dataset experiments, although the improvement compared to baselines is even larger in this case (35.2%). As an example, the best semantic match for “pineapple juice” is “pineapple.” The HS codes for these two strings are different, and Flamingo correctly predicts “No match.” From an experience point of view, LCA experts find Flamingo to be useful and report significant time savings. A skilled practitioner used Flamingo to complete LCA of 15K food products in just 15 minutes, a task that would have previously taken 40 hours of manually mapping EIFs. 8 CONCLUSION AND FUTURE WORK Measurement of GHG emissions of consumer products is a critical step toward understanding mitigation decisions. LCA provides a systematic approach to measurement of GHG emissions but is challenging to scale due to manual steps. Identifying an appropriate EIF is a critical aspect of LCA, and current practice requires significant time investment from LCA experts. We have presented an algorithm that automates EIF matching for a given query text. Our algorithm, Flamingo, requires no training data and exploits industry codes to predict whether an EIF exists in the database as well as identifies the best-matching EIF when it exists. Evaluation on a dataset of 664 products shows that Flamingo can achieve 75% precision and predicts when no EIF exists with 96% precision. While we have focused on GHG emissions throughout the article, the EIF databases include impact estimates for additional categories such as hazardous wastes, fresh water use, and air pollution. Our algorithms can be easily extended to these impacts. There are several directions to explore that can improve on our methods. We can extend textbased matching to image-based matching using CLIP [38], which generates an embedding with an image-to-text correspondence. The key to further improvement in EIF matching performance is to improve the prediction of HS codes from EIF metadata. While the supervised approach did not lead to significant improvement in performance, future work can consider zero-shot contrastive learning approaches such as MACLR [58] that can exploit EIF text descriptions to learn correspondence with HS codes. While automation of LCA has been studied for several decades [2], it is now starting to be adopted on a large scale to comply with regulations, meet customer demand, and combat climate change. Flamingo provides a promising start on a critical aspect of LCA, but a number of challenges ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

11:20

B. Balaji et al.

remain in scaling to millions of consumer products. As seen from our results, EIFs of most consumer products in the market are not available in the database, and more EIFs are added manually from publicly available documents. Automation of EIF extraction from documents is a promising avenue of future work. ADDITIONAL REMARKS Aravind Srinivasan’s contribution to this publication was not part of his University of Maryland duties or responsibilities. ACKNOWLEDGMENTS We thank the referees for their several helpful comments and suggestions. REFERENCES [1] International Energy Agency. 2021. Global energy review 2021: Assessing the effects of economic recoveries on global energy demand and CO2 emissions in 2021. https://www.iea.org/reports/global-energy-review-2021 [2] Mikaela Algren, Wendy Fisher, and Amy E. Landis. 2021. Machine learning in life cycle assessment. In Data Science Applied to Sustainability Analysis. Elsevier, 167–190. [3] Bharathan Balaji, Venkata Sai Gargeya Vunnava, Geoffrey Guest, and Jared Kramer. 2023. CaML: Carbon footprinting of household products with zero-shot semantic text similarity. In Proceedings of the ACM Web Conference 2023. 4004–4014. [4] K. Bhatia, K. Dahiya, H. Jain, P. Kar, A. Mittal, Y. Prabhu, and M. Varma. 2016. The extreme classification repository: Multi-label datasets and code. http://manikvarma.org/downloads/XC/XMLRepository.html [5] Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, and Prateek Jain. 2015. Sparse local embeddings for extreme multi-label classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 (NIPS’15, Montreal, Canada). MIT Press, Cambridge, MA, 730–738. [6] Robert S. Boyer and J. Strother Moore. 1977. A fast string searching algorithm. Communications of the ACM 20, 10 (1977), 762–772. [7] Vannevar Bush. 1945. As we may think. Atlantic Monthly 176, 1 (1945), 101–108. https://doi.org/10.1145/227181.227186 [8] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario GuajardoCespedes, Steve Yuan, Chris Tar, Brian Strope, and Ray Kurzweil. 2018. Universal Sentence Encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Brussels, Belgium, 169–174. https://doi.org/10.18653/v1/D18-2029 [9] Peggy Chaplin. 1987. An introduction to the harmonized system. North Carolina Journal of International Law & Commerical Regulation 12 (1987), 417. [10] Xi Chen, Stefano Bromuri, and Marko Van Eekelen. 2021. Neural machine translation for harmonized system codes prediction. In 2021 6th International Conference on Machine Learning Technologies. 158–163. [11] Michael Clark, Marco Springmann, Mike Rayner, Peter Scarborough, Jason Hill, David Tilman, Jennie I. Macdiarmid, Jessica Fanzo, Lauren Bandy, and Richard A. Harrington. 2022. Estimating the environmental impacts of 57,000 food products. Proceedings of the National Academy of Sciences 119, 33 (2022), e2120584119. [12] Vincent Colomb, Samy Ait Amar, Claudine Basset Mens, Armelle Gac, Gérard Gaillard, Peter Koch, Jerome Mousset, Thibault Salou, Aurélie Tailleur, and Hays M. G. van der Werf. 2015. AGRIBALYSE®, the French LCI database for agricultural products: High quality data for producers and environmental labelling. Oilseeds and fats, Crops and Lipids (OCL) 22, 1 (2015), D104. [13] Akshay Raj Dhamija, Manuel Günther, and Terrance Boult. 2018. Reducing network agnostophobia. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18, Montréal, Canada), Curran Associates Inc., Red Hook, NY, 9175–9186. [14] Liya Ding, ZhenZhen Fan, and DongLiang Chen. 2015. Auto-categorization of HS code using background net approach. Procedia Computer Science 60 (2015), 1462–1471. [15] Shaohua Du, Zhihao Wu, Huaiyu Wan, and YouFang Lin. 2021. HScodeNet: Combining hierarchical sequential and global spatial information of text for commodity HS code classification. In Advances in Knowledge Discovery and Data Mining: 25th Pacific-Asia Conference, (PAKDD’21, Virtual Event, May 11–14, 2021), Proceedings, Part II. Springer, 676–689. [16] International Organization for Standardization. 2006. Environmental Management: Life Cycle Assessment; Principles and Framework. ISO. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

Flamingo: EIFM for Life Cycle Assessment with Zero-shot Machine Learning

11:21

[17] Tao Gao, Qing Liu, and Jianping Wang. 2014. A comparative study of carbon footprint and assessment standards. International Journal of Low-Carbon Technologies 9, 3 (2014), 237–243. [18] Clinton Gormley and Zachary Tong. 2015. Elasticsearch: The Definitive Guide: A Distributed Real-time Search and Analytics Engine. O’Reilly Media, Inc. [19] Jeroen B. Guinée, Reinout Heijungs, Gjalt Huppes, Alessandra Zamagni, Paolo Masoni, Roberto Buonamici, Tomas Ekvall, and Tomas Rydberg. 2011. Life cycle assessment: Past, present, and future. Environmental Science and Technology 45, 1 (2011), 90–96. [20] Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional neural network architectures for matching natural language sentences. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’14, Montreal, Canada), MIT Press, Cambridge, MA, 2042–2050. [21] Minghao Hu, Furu Wei, Yuxing Peng, Zhen Huang, Nan Yang, and Dongsheng Li. 2019. Read+ verify: Machine reading comprehension with unanswerable questions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6529–6537. [22] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. 2333–2338. [23] Robert G. Hunt. 1974. Resource and Environmental Profile Analysis of Nine Beverage Container Alternatives. Vol. 91. Environmental Protection Agency. [24] Wesley Ingwersen, Maria Gausman, Annie Weisbrod, Debalina Sengupta, Seung-Jin Lee, Jane Bare, Ed Zanoli, Gurbakash S. Bhander, and Manuel Ceja. 2016. Detailed life cycle assessment of Bounty® paper towel operations in the united states. Journal of Cleaner Production 131 (2016), 509–522. [25] Hamid Jalalzai, Pierre Colombo, Chloé Clavel, Eric Gaussier, Giovanna Varni, Emmanuel Vignon, and Anne Sabourin. 2020. Heavy-tailed representations, text polarity classification & data augmentation. Advances in Neural Information Processing Systems 33 (2020), 4295–4307. [26] Lucas Joppa, Amy Luers, Elizabeth Willmott, S. Julio Friedmann, Steven P. Hamburg, and Rafael Broze. 2021. Microsoft’s million-tonne CO2-removal purchase—lessons for net zero. Nature 597 (September 2021), 629–632. https://www.microsoft.com/en-us/research/publication/microsofts-million-tonne-co2-removal-purchase-lessonsfor-net-zero/ [27] Armand Joulin, Édouard Grave, Piotr Bojanowski, and Tomáš Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 427–431. [28] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423 [29] Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 39–48. [30] Klaus Krippendorff. 2011. Computing Krippendorff’s alpha-reliability. Computing 1 (2011), 25–2011. [31] Eunji Lee, Sundong Kim, Sihyun Kim, Sungwon Park, Meeyoung Cha, Soyeon Jung, Suyoung Yang, Yeonsoo Choi, Sungdae Ji, Minsoo Song, and Heeja Kim. 2021. Classification of Goods Using Text Descriptions With Sentences Retrieval. CoRR abs/2111.01663 (2021). arXiv:2111.01663. https://arxiv.org/abs/2111.01663 [32] Christoph J. Meinrenken, Daniel Chen, Ricardo A. Esparza, Venkat Iyer, Sally P. Paridis, Aruna Prasad, and Erika Whillas. 2022. The carbon catalogue, carbon footprints of 866 commercial products from 8 industry sectors and 5 continents. Scientific Data 9, 1 (2022), 87. [33] Christoph J. Meinrenken, Scott M. Kaufman, Siddharth Ramesh, and Klaus S. Lackner. 2012. Fast carbon footprinting for large product portfolios. Journal of Industrial Ecology 16, 5 (2012), 669–679. [34] Bhaskar Mitra, Nick Craswell, et al. 2018. An introduction to neural information retrieval. Foundations and Trends® in Information Retrieval 13, 1 (2018), 1–126. [35] United Nations. 1969. International Standard Industrial Classification of All Economic Activities. UN. [36] Peter Pariag. 2009. Classification of services. In Regional Symposium on Services. 15–17. [37] H.-O. Pörtner, D. C. Roberts, H. Adams, I. Adelekan, C. Adler, R. Adrian, P. Aldunce, E. Ali, R. Ara Begum, B. Bednar Friedl, R. Bezner Kerr, R. Biesbroek, J. Birkmann, K. Bowen, M. A. Caretta, J. Carnicer, E. Castellanos, T. S. Cheong, W. Chow, G. Cissé G. Cissé, and Z. Zaiton Ibrahim. 2022. Climate Change 2022: Impacts, Adaptation and Vulnerability. Cambridge University Press, Cambridge and New York, 37–118. [38] Jianpeng Qi, Yanwei Yu, Lihong Wang, Jinglei Liu, and Yingjie Wang. 2017. An effective and efficient hierarchical K-means clustering algorithm. International Journal of Distributed Sensor Networks 13, 8 (2017), 1550147717728627.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

11:22

B. Balaji et al.

[39] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748–8763. https: //proceedings.mlr.press/v139/radford21a.html [40] Juan Ramos. 2003. Using TF-IDF to determine word relevance in document queries. In Proceedings of the 1st Instructional Conference on Machine Learning, Vol. 242. Citeseer, 29–48. [41] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP ’19). 3982–3992. [42] Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr. 3, 4 (Apr 2009), 333–389. https://doi.org/10.1561/1500000019 [43] Stephen Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford. 1995. Okapi at TREC-3. In Overview of the Third Text REtrieval Conference (TREC-3) (overview of the third text retrieval conference (trec.3) ed.). NIST, Gaithersburg, MD, 109–126. https://www.microsoft.com/en-us/research/publication/okapi-at-trec-3/ [44] Davide Rovelli, Carlo Brondi, Michele Andreotti, Elisabetta Abbate, Maurizio Zanforlin, and Andrea Ballarino. 2022. A modular tool to support data management for LCA in industry: Methodology, application and potentialities. Sustainability 14, 7 (2022), 3746. [45] Amit Singhal. 2001. Modern information retrieval: A brief overview. IEEE Data Engineering Bulletin 24, 4 (2001), 35–43. http://dblp.uni-trier.de/db/journals/debu/debu24.html#Singhal01 [46] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems 33 (2020), 16857–16867. [47] Ines Sousa and David Wallace. 2006. Product classification to support approximate life-cycle assessment of design concepts. Technological Forecasting and Social Change 73, 3 (2006), 228–249. [48] Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 1 (1972), 11–21. [49] Sphera. 2023. Life cycle assessment product sustainability (GaBi) software. https://sphera.com/life-cycle-assessmentlca-software/ [50] World Resources Institute Wbcsd. 2004. The Greenhouse Gas Protocol. A Corporate Accounting and Reporting Standard. Rev. ed., Conches-Geneva, Washington, DC. [51] Zequn Sun, Muhao Chen, and Wei Hu. 2021. Knowing the no-match: Entity alignment with dangling cases. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 3582–3593. [52] Tomohiro Tasaki, Koichi Shobatake, Kenichi Nakajima, and Carl Dalhammar. 2017. International survey of the costs of assessment for environmental product declarations. Procedia CIRP 61 (2017), 727–731. https://doi.org/10.1016/j.procir. 2016.11.158 [53] Ken Thompson. 1968. Programming techniques: Regular expression search algorithm. Communications of the ACM 11, 6 (1968), 419–422. [54] Andrew Trotman, Antti Puurula, and Blake Burgess. 2014. Improvements to BM25 and language models examined. In Proceedings of the 2014 Australasian Document Computing Symposium. 58–65. [55] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17, Long Beach, California, USA). Curran Associates Inc., Red Hook, NY, 6000–6010. [56] Gregor Wernet, Christian Bauer, Bernhard Steubing, Jürgen Reinhard, Emilia Moreno-Ruiz, and Bo Weidema. 2016. The Ecoinvent database version 3 (part I): Overview and methodology. International Journal of Life Cycle Assessment 21, 9 (2016), 1218–1230. [57] Hans-Michael Wolffgang and Christopher Dallimore. 2012. The world customs organization and its role in the system of world trade: An overview. European Yearbook of International Economic Law (EYIEL), 3, 2012 (2012), 613–633. [58] Yuanhao Xiong, Wei-Cheng Chang, Cho-Jui Hsieh, Hsiang-Fu Yu, and Inderjit Dhillon. 2022. Extreme zero-shot learning for extreme text classification. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 5455–5468. [59] Yi Yang, Wesley W. Ingwersen, Troy R. Hawkins, Michael Srocka, and David E. Meyer. 2017. USEEIO: A new and transparent united states environmentally-extended input-output model. Journal of Cleaner Production 158 (2017), 308–318. [60] Hsiang-Fu Yu, Jiong Zhang, Wei-Cheng Chang, Jyun-Yu Jiang, Wei Li, and Cho-Jui Hsieh. 2022. Pecos: Prediction for enormous and correlated output spaces. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4848–4849.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

Flamingo: EIFM for Life Cycle Assessment with Zero-shot Machine Learning

11:23

[61] Hsiang-Fu Yu, Kai Zhong, and Inderjit S. Dhillon. 2020. PECOS: Prediction for Enormous and Correlated Output Spaces. arXiv:2010.05878 [cs.LG] [62] Weixin Zeng, Xiang Zhao, Jiuyang Tang, Xinyi Li, Minnan Luo, and Qinghua Zheng. 2021. Towards entity alignment in the open world: An unsupervised approach. In Database Systems for Advanced Applications: 26th International Conference, DASFAA 2021, Taipei, Taiwan, April 11–14, 2021, Proceedings, Part I (Taipei, Taiwan). Springer-Verlag, Berlin, Heidelberg, 272–289. https://doi.org/10.1007/978-3-030-73194-6_19 [63] Honglun Zhang, Liqiang Xiao, Wenqing Chen, Yongkun Wang, and Yaohui Jin. 2018. Multi-task label embedding for text classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 4545–4553.

Received 16 February 2023; revised 14 June 2023; accepted 24 June 2023

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 11. Publication date: December 2023.

Tracking Socio-Economic Development in Rural India over Two Decades Using Satellite Imagery ANANT GULGULIA, AMAN GUPTA, AKSHAY P SARASHETTI, AADITYA SINHA, and AADITESHWAR SETH, Indian Institute of Technology Delhi, India Longitudinal analysis of socio-economic development at sub-national scales can reveal valuable insights about which areas tend to develop faster than others and why. However, such analysis is difficult to conduct with traditional data sources such as censuses and surveys which are not repeated frequently and may require assumptions for imputation of values at non-surveyed locations. Indicators of socio-economic development based on satellite data have emerged as a proxy to track development at fine spatial and temporal scales. We build a model using daytime and nightlights satellite data to estimate an index of socio-economic development at the village level in India. We evaluate our model for temporal robustness and use it to produce estimates at three time points over a two-decade period. We then use these estimates to understand the effect on village-level development of factors such as the geographic distance of a village to hubs of economic activity and the inequality of development in the district. Our findings provide evidence of the possible impact that policy changes during this period have had on village development. CCS Concepts: • Applied computing → Economics; • Computing methodologies → Machine learning; Computer vision; Additional Key Words and Phrases: Poverty mapping, satellite data, nightlights, socio-economic development, inequality, census ACM Reference format: Anant Gulgulia, Aman Gupta, Akshay P Sarashetti, Aaditya Sinha, and Aaditeshwar Seth. 2023. Tracking Socio-Economic Development in Rural India over Two Decades Using Satellite Imagery. ACM J. Comput. Sustain. Soc. 1, 2, Article 12 (December 2023), 31 pages. https://doi.org/10.1145/3615361

1

INTRODUCTION

Longitudinally tracking socio-economic development at sub-national scales can provide important insights into underlying development processes that arise from and lead to diversity in development levels within countries [1–3]. Regularly conducted surveys and censuses are standard tools for this purpose. However, surveys and censuses may sometimes not be comparable over successive rounds either due to changes in sampling strategies or the parameters that are collected, or may not be conducted regularly, or the data may not be made available in a timely manner. For example, India conducts a household census every 10 years, but the data takes several additional years to be released. The Indian census originally scheduled for 2021 was inordinately delayed due Authors’ address: A. Gulgulia, A. Gupta, A. P. Sarashetti, A. Sinha, and A. Seth, Indian Institute of Technology Delhi, IIT Campus, Hauz Khas, New Delhi, Delhi 110016, India; e-mails: [email protected], [email protected], [email protected], [email protected], [email protected]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM. 2834-5533/2023/12-ART12 $15.00 https://doi.org/10.1145/3615361 ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 12. Publication date: December 2023.

12

12:2

A. Gulgulia et al.

to COVID-19 disruptions. Partly to counter this lack of data availability, and to bring some standardization to development indicators that can be compared over several years, proxy variables are often used such as nightlights satellite data or machine learning models trained on daytime satellite imagery that produce estimates of wealth indexes and other variables [4–6]. A few gaps, however, remain in related research in this space, especially for studies in the Indian context. First, models to estimate socio-economic indicators using daytime satellite imagery have not been adequately evaluated for temporal robustness, i.e., whether models trained on data from one year can be used to produce estimates for other years. In prior work, we attempted to study this at the district level in India [9]; in this article, we describe our studies at the village level. Second, whereas nightlights time series have been explored to track sub-national scale development over time [7], the low variability for rural areas does not render this method suitable by itself for villagelevel tracking. We build a method that uses both daytime satellite imagery and nightlights data to estimate a composite indicator of socio-economic development at the village level. Some unique aspects of our method to estimate socio-economic development are as follows. Similar to Jean et al. [6], we use pre-trained Convolutional Neural Networks (CNNs) on a variant of a ResNet architecture to learn a model that can produce first-level estimates of development variables using data from the 2011 Indian census as ground truth labels. We then improve the estimates by building a model to also take other features into account, namely the first-level estimates of development variables of neighbouring villages, nightlights-based features for the given village and neighbouring villages, and distance of a village to a nearest hub of economic activity (obtained also from nightlights data). We then do a feature selection specifically to ensure temporal robustness, by identifying those sets of features that produce the most accurate estimates for 2001—that is, we train the model on census data from 2011 and evaluate its accuracy on census data from 2001 on those indicators that are available for both of these census years. Given a satisfactory temporally robust model, we then produce standardized socio-economic development estimates for 2001 and 2019, and go on to test various hypotheses of village development dynamics. We build a composite indicator that combines variables related to asset ownership, access to water, bathroom facilities, literacy, and so forth in a manner similar to how the Human Development Index is calculated by giving equal weight to economic, education, and health variables. We term this the Aggregate Development Index (ADI) and use the satellite data based models to produce village-level estimates for the years 2003, 2011, and 2019.1 We then conduct an econometric analysis to explain village-level development changes over these (approximately) two decades in terms of covariates such as the distance of a village to hubs of economic activity, its development relative to other villages in the district, and the inequality of the district to which the village belongs. We find that villages farther away from economic hubs tend to have a slower pace of development. We also find that less developed villages in general tend to have a faster rate of development, indicating a catch-up phenomenon. At the district level though, our analysis reveals that villages in more unequal districts were developing faster during the 2003–2011 period, but this changed during the 2011–2019 period. We discuss that a likely reason may be changes in the Indian economic policies over these two decades. To the best of our knowledge, our work is the first attempt at a methodology using satellite data to study village-level changes over such a long time span, and especially to study the changes in the 1 As

we explain later, the data source we used for daytime satellite imagery is available from 2003 onward. We consider this to be sufficiently close to the census year of 2001 and therefore use satellite data from 2003 to evaluate our estimates of socio-economic development against the census data for 2001. Both the census and satellite datasets coincide for 2011. We then choose 2019 as the next year for which we obtain satellite data to analyze socio-economic development over this time span of almost two decades. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 12. Publication date: December 2023.

Tracking Socio-Economic Development in Rural India Using Satellite Imagery

12:3

context of India. We produce ADI estimates for the past two decades for 14 states, comprising more than 77.8% of the population of India, to serve as an important data product for other development economics research. Our work is not without its limitations. More recent CNN-based architectures such as those using attention models may produce more accurate estimates than our models. An end-to-end neural network architecture instead of the two-stage setup that we developed may also prove to be better. Finally, instead of using models pre-trained on the ImageNet data such as ResNet, models pre-trained on labeled satellite data such as the DeepSat model [46] may perform better, and we invite other researchers to improve upon our work. All data and code from our effort has been made available as open source.2 2

RELATED WORK

With improvements in the availability of satellite data, and its use in machine learning based methods to obtain estimates of socio-economic variables, several novel applications have emerged in development economics. This includes inferences for population density [10], gridded estimates of poverty mapping [11], use in targeting of social welfare support [13], and so on. In this section, we describe some such efforts and the scope of using satellite data based estimation of socio-economic development variables to study some fundamental hypotheses in development research. 2.1

Use of Nightlights Data

Satellite imagery captured during night hours of the intensity of light emitted by human-made lighting sources on the earth’s surface is called nightlights data. It is captured by the Suomi NPP satellite system at a spatial resolution of ∼500m and a temporal resolution of 12 hours and is made openly available in the form of NPP-Visible Infrared Imaging Radiometer Suite (VIIRS) series (referred to as VIIRS data henceforth, available from 2013). An earlier Defense Meteorological Satellite Program (DMSP)-OLS series, referred to as DMSP data henceforth, is available from 1992 to 2013. These have been shown to correlate at the country level with a nation’s GDP [33, 34], and have also been used in other domains such as to assess the impact of war and post-war recovery measures in Syria [35], study the relationship between electoral cycles in north India with selective electrification of certain constituencies to influence their voting patterns [37], and track the impact of international trade sanctions on economic activities in North Korea [7], among others. GDP estimation at sub-national scales of states in India has also been studied [5]. Nightlights data, however, suffers from over-saturation in urban centers, and unobservant light intensities in rural areas, which renders it inadequate by itself to study socio-economic factors at fine spatial scales of villages, and within cities [5, 8]. In our work, we use features from nightlights data in conjunction with daytime imagery to overcome such challenges. 2.2

Use of Daytime Satellite Data

Spectral data of surface reflection from the earth’s surface is captured by satellite systems such as Landsat (at a 30m resolution, with re-visit times of Mean((Gini Change/ADI Change) for Medium ADI districts) ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 12. Publication date: December 2023.

Tracking Socio-Economic Development in Rural India Using Satellite Imagery

12:25

Table 12. Mean Value of Slope and z-Scores for Districts Having Positive ADI Change and Negative Gini Change Mean slope for

Mean slope for

Mean slope for

High and low

High and medium

low ADI districts

medium ADI districts

high ADI districts

slope diff: z-test

slope diff: z-test

2003–2011

–0.018

–0.028

–0.066

–2.756

–1.899

2011–2019

–0.183

–1.414

–0.061

1.857

1.606

In Table 11, the absolute value of the z-scores being less than 1.645 (for a 5% significance) for both periods indicates that the Kuznets hypothesis cannot be accepted with confidence, although the relative mean values of the slope during the 2003–2011 period agree with the Kuznets hypothesis: more developed districts have a lower slope. However, the negative sign during the 2011–2019 period is curious and indicates that inequality during this time increased at a faster rate with the ADI in medium and high ADI districts than in low ADI districts. This, when coupled with the analysis in the previous section, seems to suggest that policy changes may have slowed down the development of less developed villages in more developed districts (likely to be those which are more industrialized) and thus not reduced inequality in these districts. Figure 15 shows box plots for the change in the Gini coefficient for districts in their starting phase of the Kuznets curve. The same trends are visible: inequality increased more in less developed districts than in more developed districts during 2003–2011, as hypothesized by Kuznets, but this changed during 2011–2019. 5.2.2 Ending Phase of the Kuznets Curve. We similarly construct hypotheses for the ending phase of the Kuznets curve, where slope of the Gini (Y axis) and ADI (X axis) is expected to be higher in magnitude (and negative) for high ADI districts than for medium or low ADI districts, for districts which see a negative Gini change (decrease in inequality) and positive ADI change (increase in economic development). As before, we exclude districts that showed a negative or no change in the ADI, and conduct the test on 19.7% of the districts during 2003–2011 and 28.07% during 2011–2019 for which the ADI change is positive and the Gini change is negative. Null Hypothesis • Mean((Gini Change/ADI Change) for High ADI districts) = Mean((Gini Change/ADI Change) for Low ADI districts) • Mean((Gini Change/ADI Change) for High ADI districts) = Mean((Gini Change/ADI Change) for Medium ADI districts) Alternate Hypothesis • Mean((Gini Change/ADI Change) for High ADI districts) < Mean((Gini Change/ADI Change) for Low ADI districts) • Mean((Gini Change/ADI Change) for High ADI districts) < Mean((Gini Change/ADI Change) for Medium ADI districts) From Table 12, during the 2003–2011 period, there seems to be sufficient evidence to not reject the Kuznets hypothesis: high ADI districts seem to have a greater reduction in inequality than low and medium ADI districts. However, this seems to reverse (reasonably strong statistical significance) during the 2011–2019 period, again hinting towards a possible effect of policy changes resulting in slower development of villages in high ADI districts (more industrialized) than in low and medium ADI districts. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 12. Publication date: December 2023.

12:26

A. Gulgulia et al.

Fig. 16. Box plots of change in Gini for districts in the ending phase of the Kuznets curve during 2003–2011 and 2011–2019.

Figure 16 shows box plots for the change in the Gini coefficient for districts in their ending phase of the Kuznets curve. Inequality reduced more in more developed districts than in less developed districts during 2003–2011, as hypothesized by Kuznets, but this changed during 2011–2019. 6 DISCUSSION AND CONCLUSION We demonstrated the feasibility of using satellite data to estimate socio-economic development at the village scale in a temporally robust manner. Our method combines data from both daytime satellite imagery as well as nightlights. We then showed that this estimate can be used to study development patterns and validated several such observations made in other studies. We showed that villages farther away from hubs of economic activities tend to less developed, and develop slower, than villages closer to economic activities. We also showed that a catch-up pattern exists wherein villages that are less developed tend to develop faster to catch up with other villages in the district. At the same time, a positive relationship is seen during the 2003–2011 period between village development and the initial inequality in a district: villages in more unequal districts tend to develop faster. This relationship, however, reversed during the 2011–2019 period. We hypothesize that this is likely due to policy changes in India during this time wherein economic activities such as industrialization that would have been more prevalent in unequal districts may have slowed down. This is consistent with our interpretation of testing the Kuznets hypothesis during these two periods. According to Kuznets, inequality is likely to first increase with industrialization-led economic growth and then decrease as wages and social welfare improve. Something similar seems to have happened in India during the 2003–2011 period, where industrialization-dominated economic activity followed by more equitable development would have led to districts following the Kuznets hypothesis. However, policy changes that disrupted the informal economy during 2011–2019 seem to have led to a pattern where unequal districts saw a slowdown in economic activity and thereby slower development of their villages, and hence also a reversal of the Kuznets curve. With 80% of the non-agricultural employment in India mostly in the informal sector [47], the recent policy push towards formalization using sudden treatments like demonetization and GST, coupled with a change in the social welfare strategy towards cash transfers from asset creation in rural areas, seems to have concentrated growth in already developed areas and thereby increased inequality. Gradual formalization coupled with improvements in rural asset creation would perhaps be a more appropriate policy to balance growth and equality [48]. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 12. Publication date: December 2023.

Tracking Socio-Economic Development in Rural India Using Satellite Imagery

12:27

Through this experience, we believe that the ADI outputs we have produced, and our methodology that can be used to produce similar outputs for other years, can be used to study development patterns at the village level and higher aggregate geographical scales. Changes in indicators related to health, vaccination coverage, education, welfare expenditure, and so on can be studied with the ADI as a covariate or an outcome variable to understand co-movement and causal patterns. In future work, we plan to study the impact of welfare expenditure on socio-economic development to understand how much and what kind of welfare expenditure is more impactful for socio-economic development. APPENDIX A DISCRETIZATION OF CENSUS SOCIO-ECONOMIC INDICATORS A box plot for clustering census indicators FC, BF, LIT, MSW, and ASSET is shown in Figures 17, 18, 19, 20, and 21, respectively.

Fig. 17. Clustering of the FC indicator into level-1/2/3 villages.

Fig. 18. Clustering of the BF indicator into level-1/2/3 villages.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 12. Publication date: December 2023.

12:28

A. Gulgulia et al.

Fig. 19. Clustering of the LIT indicator into level-1/2/3 villages.

Fig. 20. Clustering of the MSW indicator into level-1/2/3 villages.

Fig. 21. Clustering of the ASSET indicator into level-1/2/3 villages.

ACKNOWLEDGMENTS We are grateful to the High Performance Computing (HPC) infrastructure team of IIT Delhi for their support. We also want to express our gratitude to Chahat Bansal and Badrinath Padmanabhan for support with a few technical aspects, and anonymous reviewers for their feedback in strengthening the article. REFERENCES [1] D. J. Peters. 2012. Income inequality across micro and meso geographic scales in the midwestern United States, 1979– 20091. Rural Sociology 77, 2 (2012), 171–202. https://doi.org/10.1111/j.1549-0831.2012.00077.x [2] A. Kalaiyarasan and M. Vijayabaskar. 2021. The Dravidian Model: Interpreting the Political Economy of Tamil Nadu. Cambridge University Press. https://doi.org/10.1017/9781108933506 [3] A. Kohli. 2012. State and redistributive development in India. In Growth, Inequality and Social Development in India: Is Inclusive Growth Possible?, R. Nagaraj (Ed.). Palgrave Macmillan, London, UK. https://doi.org/10.1057/9781137000767_7 ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 12. Publication date: December 2023.

Tracking Socio-Economic Development in Rural India Using Satellite Imagery

12:29

[4] C. Hall. 2022. Nighttime lights. Earthdata. Retrieved September 5, 2023 from https://www.earthdata.nasa.gov/learn/ backgrounders/nighttime-lights [5] A. Prakash, A. K. Shukla, C. Bhowmick, R. Carl, and M. Beyer. 2019. Night-time luminosity: Does it brighten understanding of economic activity in India? Reserve Bank of India Occasional Papers 40, 1. https://www.researchgate.net/ publication/334811462_Nighttime_Luminosity_Does_it_Brighten_Understanding_of_Economic_Activity_in_India [6] N. Jean, M. Burke, M. E. Xie, W. M. Davis, D. B. Lobell, and S. Ermon. 2016. Combining satellite imagery and machine learning to predict poverty. Science 353, 6301 (2016), 790–794. https://doi.org/10.1126/science.aaf7894 [7] Y. S. Lee. 2018. International isolation and regional inequality: Evidence from sanctions on North Korea. Journal of Urban Economics 103 (2018), 34–51. https://doi.org/10.1016/j.jue.2017.11.002 [8] F. Bickenbach, E. Bode, P. Nunnenkamp, and Mareike Soder. 2016. Night lights and regional GDP. Review of World Economics 152 (2018), 425–447. https://doi.org/10.1007/s10290-016-0246-0 [9] C. Bansal, A. Jain, P. Barwaria, A. Choudhary, A. Singh, A. Gupta, and A. Seth. 2020. Temporal prediction of socioeconomic indicators using satellite imagery. In Proceedings of the 7th ACM IKDD CoDS and 25th COMAD. 73–81. https://doi.org/10.1145/3371158.3371167 [10] W. Hu, J. Patel, Z. Robert, P. Novosad, S. Asher, Z. Tang, M. Burke, D. Lobell, and S. Ermon. 2019. Mapping missing population in rural India: A deep learning approach with satellite imagery. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. https://doi.org/10.1145/3306618.3314263 [11] K. Ayush, B. Uzkent, M. Burke, D. B. Lobell, and S. Ermon. 2020. Generating interpretable poverty maps using object detection in satellite images. In Proceedings of the 29th International Joint Conference on Artificial Intelligence Special Track on AI for CompSust and Human Well-Being. 4410–4416. https://doi.org/10.24963/ijcai.2020/608 [12] S. Asher, T. Lunt, R. Matsuura, and L. Novosad. 2021. Development research at high geographic resolution: An analysis of night-lights, firms, and poverty in India using the SHRUG Open Data platform. World Bank Economic Review 35, 4 (2021), 845–871. https://doi.org/10.1093/wber/lhab003 [13] E. Aiken, S. Bellue, D. Karlan, C. Udry, and J. E. Blumenstock. 2022. Machine learning and phone data can improve targeting of humanitarian aid. Nature 603, 7903 (2022), 864–870. https://doi.org/10.1038/s41586-022-04484-9 [14] Landsat NASA. 2022. Landsat 7 | Landsat Science. Landsat Science | A Joint NASA/USGS Earth Observation Program. Retrieved September 5, 2023 from https://landsat.gsfc.nasa.gov/satellites/landsat-7/ [15] M. Wong. 2023. Visible Infrared Imaging Radiometer Suite (VIIRS). Earthdata. Retrieved September 4, 2023 from https: //earthdata.nasa.gov/earth-observation-data/near-real-time/download-nrt-data/viirs-nrt [16] D. H. Fabini, D. P. De Leon Barido, A. Omu, and J. Taneja. 2014. Mapping induced residential demand for electricity in Kenya. In Proceedings of the 5th ACM Symposium on Computing for Development (ACM DEV-5’14). ACM, New York, NY, 43–52. https://doi.org/10.1145/2674377.2674390 [17] Deepanshu Mohan, Soumya Marri, Bilquis Calcuttawala, Malhaar Kasodekar, Aniruddh Bhaskaran, and Hemang Sharma. 2023. Modi govt’s fiscal policy on welfare: Trends so far and what to expect. The Wire. Retrieved September 4, 2023 from https://thewire.in/economy/modi-govts-fiscal-policy-on-welfare-trends-so-far-and-what-to-expect [18] G. S. Fields. 2001. Distribution and Development: A New Look at the Developing World. MIT Press, Cambridge, MA. https://doi.org/10.7551/mitpress/2465.001.0001 [19] R. W. Fogel. 1987. Some Notes on the Scientific Methods of Simon Kuznets. NBER Working Paper No. w2461. National Bureau of Economic Research. https://doi.org/10.3386/w2461 [20] C. E. Utazi, J. Thorley, V. A. Alegana, M. J. Ferrari, S. Takahashi, B. Roche, J. Lessler, and A. J. Tatem. 2018. High resolution age-structured mapping of childhood vaccination coverage in low and middle income countries. Vaccine 36, 12 (2018), 1583–1591. https://doi.org/10.1016/j.vaccine.2018.02.020 [21] A. Elmustafa, E. Rozi, Y. He, G. Mai, S. Ermon, M. Burke, and D. B. Lobell. 2022. Understanding economic development in rural Africa using satellite imagery, building footprints and deep models. In Proceedings of the 30th International Conference on Advances in Geographic Information Systems. https://doi.org/10.1145/3557915.3561025 [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12). 1097–1105. http://books.nips.cc/papers/files/nips25/NIPS2012_0534.pdf [23] S. Lathuilière, P. Mesejo, X. Alameda-Pineda, and R. Horaud. 2020. A comprehensive analysis of deep regression. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 9 (2020), 2065–2081. https://doi.org/10.1109/tpami.2019. 2910523 [24] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 5987–5995. https: //doi.org/10.1109/cvpr.2017.634 [25] Y. Cui, M. Jia, T. Lin, Y. Song, and S. Belongie. 2019. Class-balanced loss based on effective number of samples. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 9260–9269. https: //doi.org/10.1109/cvpr.2019.00949

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 12. Publication date: December 2023.

12:30

A. Gulgulia et al.

[26] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. 2017. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV’17). 618–626. https://doi.org/10.1109/iccv.2017.74 [27] M. D. Zeiler and R. Fergus. 2014. Visualizing and understanding convolutional networks. In Computer Vision—ECCV 2014. Lecture Notes in Computer Science, Vol. 8689. Springer, 818–833. https://doi.org/10.1007/978-3-319-10590-1_53 [28] N. Jethani, M. Sudarshan, Y. Aphinyanaphongs, and R. Ranganath. 2021. Have we learned to explain?: How interpretability methods can learn to encode predictions in their interpretations. In Proceedings of the International Conference on Artificial Intelligence and Statistics. 1459–1467. https://doi.org/10.48550/arXiv.2103.01890 [29] N. Otsu. 1979. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics 9, 1 (1979), 62–66. https://doi.org/10.1109/tsmc.1979.4310076 [30] S. Kuznets. 1955. Economic growth and income inequality. American Economic Review 45, 1 (1955), 1–28. http://www. jstor.org/stable/1811581 [31] S. Asher, K. Nagpal, and P. Novosad. 2018. The Cost of Distance: Geography and Governance in Rural India. World Bank Working Paper. World Bank. https://thedocs.worldbank.org/en/doc/499571527990962709-0010022018/original/ B1administrativeremotenesspaper20180223.pdf [32] Z. Chen, B. Yu, C. Yang, Y. Zhou, S. Yao, X. Qian, C. Wang, B. Wu, and J. Wu. 2021. An extended time series (2000–2018) of global NPP-VIIRS-like nighttime light data from a cross-sensor calibration. Earth System Science Data 13, 3 (2021), 889–906. https://doi.org/10.5194/essd-13-889-2021 [33] X. Chen and W. D. Nordhaus. 2010. The Value of Luminosity Data as a Proxy for Economic Statistics. Cowles Foundation Discussion Paper No. 1766. Cowles Foundation. https://doi.org/10.2139/ssrn.1666164 [34] C. D. Elvidge, K. E. Baugh, S. B. Anderson, P. J. Sutton, and T. K. Ghosh. 2012. The night light development index (NLDI): A spatially explicit measure of human development from satellite data. Social Geography 7, 1 (2012), 23–35. https://doi.org/10.5194/sg-7-23-2012 [35] Giorgia Giovannetti and Elena Perra. 2019. Syria in the Dark: Estimating the Economic Consequences of the Civil War through Satellite-Derived Night Time Lights. Working Papers—Economics wp2019. Universita degli Studi di Firenze. https://ideas.repec.org/p/frz/wpaper/wp2019_05.rdf.html [36] World Bank. 2023. World Bank Open Data. Retrieved September 5, 2023 from https://data.worldbank.org/indicator/ EG.ELC.ACCS.ZS [37] T. Baskaran, B. Min, and Y. Uppal. 2015. Election cycles and electricity provision: Evidence from a quasi-experiment with Indian special elections. Journal of Public Economics 126 (2015), 64–73. https://doi.org/10.1016/j.jpubeco.2015.03. 011 [38] A. Sharma. 2023. Electrification still a challenge in rural India. Frontline. Retrieved September 5, 2023 from https: //frontline.thehindu.com/the-nation/electrification-still-remains-a-challenge-in-rural-india/article66493576.ece [39] A. Perez, C. Yeh, G. Azzari, M. Burke, D. Lobell, and S. Ermon. 2017. Poverty prediction with public Landsat 7 satellite imagery and machine learning. In Proceedings of the NIPS 2017 Workshop on Machine Learning for the Developing World. http://arxiv.org/abs/1711.03654 [40] S. Pandey, T. Agarwal, and N. C. Krishnan. 2018. Multi-task deep learning for predicting poverty from satellite images. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, the 30th Innovative Applications of Artificial Intelligence Conference, and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (AAAI’18/IAAI’18/EAAI’18). Article 957, 6 pages. https://doi.org/10.1609/aaai.v32i1.11416 [41] R. Rustowicz, R. Y. Cheong, L. Wang, S. Ermon, M. Burke, and D. B. Lobell. 2019. Semantic segmentation of crop type in Africa: A novel dataset and analysis of deep learning methods. In Proceedings of the Conference on Computer Vision and Pattern Recognition. 75–82. https://openaccess.thecvf.com/content_CVPRW_2019/papers/cv4gc/ Rustowicz_Semantic_Segmentation_of_Crop_\Type_in_Africa_A_Novel_Dataset_CVPRW_2019_paper.pdf [42] R. Goldblatt, A. R. Ballesteros, and J. Burney. 2017. High spatial resolution visual band imagery outperforms medium resolution spectral imagery for ecosystem assessment in the Semi-Arid Brazilian Sertão. Remote Sensing 9, 12 (2017), 1336. https://doi.org/10.3390/rs9121336 [43] P. Helber, B. Bischke, A. Dengel, and D. Borth. 2019. EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12, 7 (2019), 2217–2226. https://doi.org/10.1109/jstars.2019.2918242 [44] W. Wu, Z. Lin, P. Oghazi, and P. C. Patel. 2022. The impact of demonetization on microfinance institutions. Journal of Business Research 153 (2022), 1–18. https://doi.org/10.1016/j.jbusres.2022.08.009 [45] A. Sen, D. Ghatak, K. Kumar, G. Khanuja, D. Bansal, M. Gupta, K. Rekha, S. Bhogale, P. Trivedi, and A. Seth. 2019. Studying the discourse on economic policies in India using mass media, social media, and the parliamentary question hour data. In Proceedings of the 2nd ACM SIGCAS Conference on Computing and Sustainable Societies (COMPASS’19). ACM, New York, NY, 234–247. https://doi.org/10.1145/3314344.3332489

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 12. Publication date: December 2023.

Tracking Socio-Economic Development in Rural India Using Satellite Imagery

12:31

[46] S. K. Basu, S. Ganguly, S. Mukhopadhyay, R. DiBiano, M. Karki, and R. R. Nemani. 2015. DeepSat: A learning framework for satellite imagery. In Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL’15). ACM, New York, NY, Article 37, 10 pages. https://doi.org/10.1145/2820783.2820816 [47] Internal Labour Organization. n.d. Informal Economy in South Asia. Retrieved September 5, 2023 from https://www. ilo.org/newdelhi/areasofwork/informal-economy/lang--en/index.htm [48] J. Drèze and A. Sen. 2013. An Uncertain Glory: India and Its Contradictions. Princeton University Press, North Oxford, UK. https://press.princeton.edu/books/hardcover/9780691160795/an-uncertain-glory [49] D. Goswami, S. B. Tripathi, S. Jain, S. Pathak, and A. Seth. 2019. Towards building a district development model for India using census data. In Proceedings of the ACM SIGCAS Conference on Computing and Sustainable Societies (COMPASS’19). https://doi.org/10.1145/3314344.3332491

Received 14 February 2023; revised 15 June 2023; accepted 6 July 2023

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 12. Publication date: December 2023.

Characterizing Swiss Alpine Lakes: from Wikipedia to Citizen Science YUANHUI LIN, EPFL and Idiap Research Institute, Switzerland DANIEL GATICA-PEREZ, Idiap Research Institute and EPFL, Switzerland Within the scope of a citizen science project that aims at understanding the ecological impact of climate change on bacteria communities in Swiss alpine lakes, we designed and implemented an interactive information platform using data collected from Wikipedia, project-specific data, and other sources. By presenting information about Swiss alpine lakes in an interactive way, the goal of the platform is to raise awareness among the public about the state of Swiss alpine lakes, and ultimately to contribute to the conservation of these ecosystems by engaging citizens. Volunteers were invited to use and assess the platform, by answering questions about alpine lake facts and platform usability. The results show that users can accurately extract factual information from the platform. User feedback was also used to improve the platform functionalities. Finally, an online crowdsourcing activity for lake polygon drawing was conducted to enrich the Swiss alpine lake database with this information. The results show that users can implement this task with high quality. CCS Concepts: • Human-centered computing → Empirical studies in collaborative and social computing; Empirical studies in visualization; Additional Key Words and Phrases: Swiss alpine lakes, climate change, citizen science, interactive platform, web-based learning, crowdsourcing ACM Reference format: Yuanhui Lin and Daniel Gatica-Perez. 2023. Characterizing Swiss Alpine Lakes: from Wikipedia to Citizen Science. ACM J. Comput. Sustain. Soc. 1, 2, Article 13 (December 2023), 24 pages. https://doi.org/10.1145/3617128

1

INTRODUCTION

Climate change is a global issue of concern. Numerous studies have been conducted to understand its causes and impacts; and strategies have also been proposed to mitigate global warming. From the latest assessment report released by the Intergovernmental Panel on Climate Change (IPCC) [9], the current warmer world is already causing negative effects on natural and human systems. It is important that human societies do not accelerate global warming, and take action to ease the situation. Lakes play an important role in this issue. They are considered as “sentinels of climate change” [6] because of their sensitivity to environmental factors, especially true for lakes located at high altitudes [37]. Lakes are at the same time witnesses of the past and current climate change and This work was done in the context of the 2000Lakes project, supported by the UNIL-EPFL CLIMACT Starting Grant program. Authors’ addresses: Y. Lin, EPFL, Rte Cantonale, 1015 Lausanne, Switzerland and Idiap Research Institute, Rue Marconi 19, 1920 Martigny, Switzerland; e-mail: [email protected]; D. Gatica-Perez, Idiap Research Institute, Rue Marconi 19, 1920 Martigny, Switzerland and EPFL, Rte Cantonale, 1015 Lausanne, Switzerland; e-mail: [email protected]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM. 2834-5533/2023/12-ART13 $15.00 https://doi.org/10.1145/3617128 ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

13

13:2

Y. Lin and D. Gatica-Perez

regulators of the future one [45]. To study this important role, previous work has identified several lake variables as indicators of environmental effects [6]. This analysis showed that even the same variables can have different efficacy for different lakes, and as such, different variables should be chosen depending on the research goals. Many lake studies have been carried out, including the study of single specific lakes [33, 40, 42], lakes within a region [14, 23], and lakes globally [25, 46]. However, alpine lakes in Switzerland remain understudied, despite their significant number. At the same time, Swiss alpine lakes are not often talked about in the local media. As an illustration of this situation, an analysis of a database of local French-speaking Swiss newspaper articles in 2015–2022, based on keyword matching and close reading, showed that only 253 news articles were about Swiss alpine lakes, out of a total of over 130,000 news. Discussed topics included activities (132 news) and accidents (52 news) where a specific lake was the location; and the environment (69 news), including issues like energy, climate change, pollution, and biodiversity. In addition, only very few lakes are mentioned by name, compared to the actual number of alpine lakes in Switzerland [5]. The 2000Lakes project1 aims to better understand the unexplored Swiss alpine lakes and bring the subject to the public. Our partner researchers, with specialized expertise, study the chemistry and biology of these lakes and are particularly interested in analyzing the abundance of bacteria communities as a response variable to monitor climate change. Other parameters like water temperature, pH value, and so on are also measured. Their research aims at providing necessary knowledge as part of the larger agenda to transition toward a sustainable world. A second goal of the project is the involvement of citizens through a number of activities, spanning from being informed about the topic, participating in the lake sampling process in the Swiss Alps, and overall becoming more interested in the subject and willing to take action to contribute to environmental protection. As researchers in social computing, we contribute to the citizen science component of the project. In this article, we designed, implemented, and evaluated a digital platform to assist the 2000Lakes project in achieving its citizen science goal. Our design aims at building the platform as a digital representation of Swiss alpine lakes. While visualizing project results is a primary goal, various datasets about Swiss Alpine lakes exist in separate sources and disparate formats, and can be integrated, as they can provide a more complete image of the lakes under study. It is also necessary for the scientific data to be presented in forms that laypeople can comprehend. One way to make data engaging for users is by implementing interactive features. The evaluation of the platform is an important aspect of our work. As a citizen science project, how well the public can be involved is an essential criterion for assessing the platform. Having citizens use the platform and comment on its features can offer valuable insights on how to improve it. The article presents an analysis of how Swiss alpine lake data can be effectively gathered and visualized for citizen science purposes. We address three specific research questions: — RQ1—Data: What complementary information can be collected and integrated to facilitate the comprehension of scientific research findings by non-experts? — RQ2—Visualization: What functionalities can be implemented to make the platform attractive to the public and promote user engagement? — RQ3—Evaluation: What are effective ways to gather constructive feedback from users, with the aim of assessing and enhancing the platform’s functionalities? The rest of the article is organized as follows. Section 2 discusses related work. Section 3 describes our methodology. Section 4 presents the results. Section 5 discusses the results and their implications. Finally, Section 6 provides concluding remarks. 1 https://www.idiap.ch/project/2000lakes/

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

Characterizing Swiss Alpine Lakes: from Wikipedia to Citizen Science

13:3

2 RELATED WORK The concept of citizen science involves people participating in a wide range of collaborative research activities [10, 24, 39]. As the authors of [39] concluded, there are at least three benefits of conducting citizen science. Firstly, people can provide valuable resources to scientists, either data or suggestions on research design and implementation. Secondly, it engages the public and is good for science education and outreach. Thirdly, it democratizes the research process. Citizens can feel as part of the research especially when the research aligns with their own issues of concern. When citizens participate, some ethical issues such as conflict of interest and intellectual property should also be addressed [39]. 2.1 Motivations for Citizen Science It is important to understand the motivations of people for joining research in order to conduct a successful citizen science project. In [11, 28], researchers identified that citizens’ willingness to help the environment is a strong factor. Other motivations include their desire to learn new things and to support scientific research. The authors of [34] showed that openness-to-change values, such as the desire to pursue pleasure and novelty, are the main drivers for initial participation. On the other hand, for maintaining sustaining participation, values of self-transcendence, such as the desire to protect everyone’s well-being and nature, play a more important role [34]. Many projects use the motivation of learning new concepts and pursuing pleasure. In Galaxy Zoo, volunteers were asked to do morphological classification on galaxy images by answering multiple-choice questions. Researchers then examined the motivations of people taking part in this crowdsourced astronomy project and found that for this specific project, volunteers mostly participated because of the desire to be entertained, to learn new things, to participate in a community, and to discover something unique [38]. Similarly, in other two projects, Citizen Sort [35, 36] and TagATune [22], games were implemented to amuse users. In Citizen Sort, researchers addressed the issue of attracting volunteers and keeping them involved through games [35, 36]. By implementing a series of artifacts from “tool-like” to “game-like”, researchers compared their efficacy and found that the general public indeed showed more interest in the case with game implementation. They also suggested that a good tradeoff of entertaining volunteers in scientific research may be adopting the task gamification approach. In TagATune, researchers developed a new game mechanism to let volunteers label audio data [22]. The results showed that the game featuring the new mechanism attracted more participants in comparison to other games that collect audio metadata. As such, effective game design can be considered a critical factor in driving user engagement. Malone’s inspection of the factors that contribute to the enjoyment of educational games is highly informative. As he discussed, there are three essential characteristics that make an instructional game enjoyable, which are challenge, fantasy, and curiosity [26]. Later in [27], he extended previous work and introduced more elements like cooperation, competition, and so on. When designing a game, one may take each individual element into careful consideration. 2.2

Map Visualization

Map visualization can be essential when presenting research containing geographical information to citizens. By employing effective visualization methods, the information can be presented in a clear and concise manner, facilitating comprehension. Platforms that embed map implementation for this purpose have been studied in the literature. In an air quality monitoring system [19] and Environmental Health Channel [17], various data sources are presented to empower citizens in discussing environmental health issues with the authorities. Maps are implemented on the platforms to better indicate the location of issued areas. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

13:4

Y. Lin and D. Gatica-Perez

Interactive elements are overlaid on the map so that users can click and get further information with regard to that region. As another example, Smell Pittsburgh is a mobile application that allows citizens to report pollution odors wherever they are, shows the places where these odors are frequently concentrated, and visualizes odor complaints from citizens in real-time [18]. In this app, the odor reports and the air quality detected by the official sensors are visualized by adding various clickable polygons on top of the map layer. Details will only show after the corresponding polygon is clicked. A general pattern can be found when embedding a map on the platform, extra information is usually added on an overlay above the basic map, and will not show until the request is sent. In [32], a relatively complete guideline is proposed for map implementation on the web. The authors of [31] also compiled a list of checkpoints to improve map applications, specifically in citizen science projects. Many useful suggestions can be obtained from previous studies, including making the map application simple, adding fun features and help tools, and so on. In order to enhance the usability of web mapping sites, an appropriate evaluation methodology is crucial. It is possible to evaluate the platform by various means, such as analyzing server logs, conducting online and paper surveys, or interviewing users directly. A hands-on way can be as proposed in [31], which includes drafting a list of map-related tasks, inviting volunteers to finish the tasks, observing their behaviors, and recording the problems encountered. 2.3

Crowdsourcing for Geographical Data

In the context of citizen science, crowdsourcing refers to the practice of having many citizens join data acquisition. By adopting this method, it is easier and less expensive to get a large amount of data, possibly within a shorter time. Moreover, the collected data can cover a wider spatial range and is especially useful to conduct regional research [24]. However, there is an obvious obstacle to crowdsourcing. Data quality is not guaranteed, as data is usually collected by non-professionals, and there can easily be errors or bias [10, 20]. To validate collected data, researchers in [20] suggested comparing citizen-collected data to researchercollected data, and detecting outliers in this way. In [10], researchers believed that data quality is directly linked to observer quality; therefore reliability can be determined by different profiles of participants. They also underlined the importance of training citizens before letting them collect data by themselves. OpenStreetMap (OSM) is a large-scale crowdsourced map, constructed by volunteered geographic information. Only registered users can modify the map, in order to trace the information source. Although OSM declares that data quality is ensured because users will rectify incorrect data when using the map, it is not always the case, as researchers found that the users who actually contribute to mapping construction only take up an extremely small percentage [16, 30]. Since OSM does not have systematic, internal data quality assurance procedures [16], it is of interest to investigate the geodata accuracy. Quality assessments on OSM are usually done in a predefined region, collected geodata is compared with authority data in terms of different features [15, 30]. Researchers in [15] highlighted the importance of contributors following a well-defined specification to collect geodata. In the case of OSM, this is possible because there are communities that will hold mapping workshops and there are also plenty of online platforms that support contributors. 3 METHODOLOGY Our work follows an iterative methodology, wherein valuable datasets from different sources are collected, a base platform is constructed, user feedback is obtained, and subsequent improvements are made to the platform. This section provides a detailed explanation of the methodology adopted at each stage, along with the presentation of the base platform for the first iteration. Additionally, ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

Characterizing Swiss Alpine Lakes: from Wikipedia to Citizen Science

13:5

a crowdsourced mapping activity with significance to enrich the data sources for the platform is introduced at the end of this section. 3.1 Data Sources The data from the 2000Lakes project makes up the first data source of the platform. This scientific data, collected and shared by our project partners, was stored in a structured form, where the name of each sampled lake, its region, altitude, measured pH value, conductivity, temperature, dissolved oxygen, 16S, 18S, g23, and other information were recorded. These measured values can be used to showcase the influence of climate change in Switzerland, offering a fresh perspective on individual lakes. Using this data solely, however, has limitations. To begin with, the data at the time of platform design was limited in terms of quantity, as only 24 lakes had been analyzed with respect to all the above parameters. More generally, it was not realistic to expect researchers to be able to provide complete data in a short time, as sampling and analysis are time-consuming, and also because one of the potential purposes of this platform was to attract people to help with future sampling. Therefore, a database of more Swiss mountain lakes was needed. Secondly, the data variables from the 2000Lakes project are generated from a scientific perspective, which includes concepts that may be unfamiliar to the general public. The terminologies should therefore be well explained on the platform. At the same time, presenting only such scientific data may overwhelm users and make them lose interest. Therefore, additional information that is closer to people’s common experience should be added. We addressed RQ1 by considering various sources of data. Public, collectively created resources like Wikipedia can be a valuable starting point for citizen science applications [8, 12]. In our case, the Wikipedia page containing a list of mountain lakes of Switzerland [44] is a suitable source of complementary data. More specifically, there are 217 lakes in that list, containing lakes from 15 different cantons. For each lake in the list, the page includes the name, canton, elevation, and surface area. Furthermore, a total of 168 of the lakes have a link to another Wikipedia page, which is about either the glacier or village the lake is located at or the lake itself. From the lake’s own page, some attributes listed in a table were extracted, such as the maximum width, length, depth, and water volume, and were integrated into our database. We also extracted the paragraphs describing the lakes and applied natural language processing (NLP) techniques to them. From an initial assessment, however, named-entity recognition and text summarization did not provide any additional information, so we decided not to use the NLP results. Advanced NLP algorithms were not needed at this initial stage; however, they could be included in the future if they produce additional benefits. All the data were scraped from Wikipedia using Python with libraries Requests and BeautifulSoup. Out of the 217 lakes, 4 had also been sampled by our research partners and therefore had research results, resulting in 237 lakes in total. The geo-coordinate of each lake was added using the Google Geocoding API, and the accuracy of the coordinates was manually verified. To obtain more visual information, geo-polygon data of each canton and of each lake was sought. Geojson files of all cantons and lakes were downloaded from Swisstopo (the Swiss Federal Office of Topography) [43], where all the Swiss geodata were made available to the public. 163 out of 237 lakes have such geo-polygon information. Later, an online activity was designed to complete the absent geo-polygon data (see Section 3.4) Over four thousand images were available from the Wikipedia pages of individual lakes. However, not all lakes have corresponding image information on Wikipedia, and some images dated from a long time ago, which might not present their current appearance. Therefore, other sources of images were needed. We studied other platforms including Google Maps, where photos are uploaded by users. Because a large population uses Google Maps, it has a more comprehensive collection of photos, meaning that more lakes can be found with photos. Consequently, we ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

13:6

Y. Lin and D. Gatica-Perez Table 1. List of Swiss Alpine Lakes Per Canton

Canton Number of lakes

AI 3

BE 22 (1)

FR 2 (1)

GL 7

GR 64

TI 35

JU 1

NE 1

NW 2

OW 5

SZ 4

SG 8

UR 9

VS 69 (22)

VD 7

The number of lakes per canton that have been sampled so far is shown in brackets when the number is not zero. Only lakes from three cantons have been sampled and are part of the current data.

decided to utilize Google Maps as our resource. In practice, photos are retrieved using Google Maps Places API, which allows non-commercial usage with attribution properly displayed. To sum up, multi-sourced data were used on the platform. The initial data comes from the 2000Lakes project and was extended with data from Wikipedia, Swisstopo, and Google API. The final data summary for the recorded number of lakes in each canton is shown in Table 1. However, please note that not all lakes have every attribute and that the final dataset does not include all alpine lakes in Switzerland. A complete lake dataset is still a goal for future work. 3.2

Base Visualization Platform

To address RQ2, it is imperative to use an attractive color composition for the website. Efforts should be directed towards enhancing the clarity of the layout and implementing interactive functions that will effectively engage users. 3.2.1 Aesthetic Design. The aesthetic design of the platform is based entirely on the design of the logo of 2000Lakes project. The font family is chosen to be Montserrat, and the four colors in the logo, one shade of green (#02C39A) and three different shades of blue (#011A38, #90E0EF, #E4F8FB), are applied in different elements throughout the website. Aside from that, the red color (#DA291C) is used, as it is the color of the Swiss flag in the logo. The goal is to achieve a coherent visual appearance between the platform and the project. There are several justifications for this decision. Primarily, the usage of the same logo and design style serves to emphasize its connection with the project. Such coherence reinforces individuals’ perceptions of the larger research project. Additionally, it is preferable to reuse the colors from the logo and avoid superfluous colors which could make the platform less visually pleasant. Finally, green and blue colors can be seamlessly associated with the natural environment and are suitable for this particular context. 3.2.2 Function Design. There are certain desired features for the platform. First, a progress bar should be displayed to represent the current status of the project’s lake sampling and analysis phases. Showing this to a viewer conveys the message that the project is still at its early stage and volunteer participation in project activities will actually advance its progress further. Ideally, it encourages people to take action and join the research activities. Second, the analysis results of each lake will be shown on the platform. A section of explanation on these measured parameters is required. Each parameter should be described in detail, including what it is, why it is important for the lake ecosystem, and how human activities can possibly alter it. The influence of human activities needs to be stated explicitly so that users understand their role in protecting lake ecosystems. It is critical that these parameters are not explained in academic terms. Aside from each parameter itself, the actual measured value of each parameter should be explained as well because a pure number carries no meaning for laypeople. It is not necessary to disclose the exact number to the audience, instead the measured value can be divided into several ranges and shown on a scale. Third, the location of lakes will be shown on a map with different layers. The basic layer is implemented using cantons’ geojson file only. Colored polygons will be shown with no other information, and the color indicates the density of lakes in that canton: the darker the color, the higher the density. With this visualization, users get to see roughly the cantonal lake distribution ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

Characterizing Swiss Alpine Lakes: from Wikipedia to Citizen Science

13:7

within Switzerland, free from distractions. Users can also choose to overlap the polygons on top of Google maps. In this case, they get to know the surrounding environment of the lake, such as the closest mountain or a nearby town. This can be helpful to locate the lake if users have existing knowledge of that geographic area. The map can be zoomed in or out, and the lakes inside the canton will only appear after the map is zoomed in to the canton level. This hierarchical visualization invites users to browse and interact with the map. Different visualizations will be available for the lakes. Users can choose to visualize lakes based on the coordinates, the elevation, or the surface area depending on their own interest. For example, it is easier to find interesting facts, like the largest mountain lake in a certain canton, if the user chooses to visualize lakes by surface area. When the computer mouse hovers over a canton or a lake on the map, the basic information of the target should be displayed, eliminating the need for the user to click each time. But if users want further information about a lake, they can click on the lake. Additional data will then be displayed. There should be an entry that directs the users to the lake’s Wikipedia page in case they are interested in that particular lake and would like to learn more. For lakes with no Wikipedia page, users will be invited to create a Wikipedia page for that lake based on the information provided. It is hoped that by requesting the help of citizens, the Wikipedia data on Swiss mountain lakes could eventually be completed. Finally, interesting features should also be implemented in order to attract other citizens. A game similar to the well-known 10-question game is designed for this purpose. The traditional 10-question game requires two players. One player comes up with a target in mind, the other has to ask at most 10 yes-or-no questions regarding the target and guess what the target is based on the answers given. On our platform, a chatbot will play with the user. The chatbot randomly draws a lake from the dataset. Instead of letting users ask questions, the chatbot generates five multiple-choice questions, lets the user select one answer at a time, and tells the user if the answer is correct or not. All questions will be related to the attributes of a lake, such as its temperature and conductivity value. Users will get to learn about the lake’s condition when searching for answers. Since the purpose of the game is to make users learn from the platform and revise what they have learned through browsing, the correct answer is disclosed after each question. It is essential to show players if they won the game at the end of each round, to boost the player’s engagement. Ideally, this game will arouse users’ interest in learning from the platform, with the hope that the more they become aware of the current state of Swiss lakes, the more likely will be for them to eventually engage in protecting the environment and mitigating climate change. 3.2.3 Implementation and Base Platform. Collected data was preprocessed and constructed in a way that is more suitable for web usage using Python. HTML, CSS, and JavaScript were used to implement the website. The whole implementation also benefits from several external packages and APIs. Bootstrap was used to refine the layout, d3 was used to visualize data and implement various interactive elements, canvas-confetti was used to support the winning effect in the game, and Google Maps API was used to support map functions and display photos of the lakes. The full set of implemented features is as follows: When users first load the website, a progress bar is animated to show the status of 2000Lakes, and a note stating the number of lakes sampled versus the number of lakes recorded is displayed as on the top of Figure 1(a). Below the progress bar, there is a slider where the scientific parameters are explained. Users can go to different slides to learn about different parameters, including microbial abundance (16S and 18S), temperature, dissolved oxygen, conductivity, and pH value. The explanation of microbial abundance was provided by the domain expert in the 2000Lakes project. Explanations on water ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

13:8

Y. Lin and D. Gatica-Perez

Fig. 1. Progress bar, slider for parameter explanation, and interactive scale.

temperature, dissolved oxygen, conductivity, and pH value were drafted based on references [1–4] and the correctness was verified by the expert researcher. The explanation of measured values is also shown on each slide. Interactive scales were implemented for some parameters in order to make users more engaged in the learning process. For dissolved oxygen, conductivity, and pH value, the division of the scales is based on References [1–3]. By default, the scale is shown as gray and black, with no descriptions as in Figure 1(a). But if the user clicks on one range, the color changes, and accordingly, a short text describing the meaning of that range appears (Figure 1(b)). Different colors apply on different ranges in a way that matches the usual color scheme. Red means dangerous: values in this range cause the death of organisms inside the lake. Yellow means warning: values in this range cause pressure for lake organisms to live. Light green means good, and dark green means ideal: the lake ecosystem is in good condition in these cases. It is expected that this implementation could help laypeople better recall the parameters in a visual way so that even if they do not read the explanation carefully, they can still grasp the meaning of the measured value. The temperature scale is divided evenly from 0°C to 20°C, as there are no general criteria on which range of temperature is good or not. The same reason applies to 16S and 18S, which represent the abundance of bacteria and archaea, and the abundance of eukaryotic DNA, respectively. However, since the value of these two parameters varies from zero to billions, it is not possible to visualize the scale linearly and divide it evenly, therefore these two parameters are displayed in order of magnitude. The exact scales are shown in the lake description so that users can easily remember the meaning (Figure 3, left). Different layers of the map are implemented for different purposes (Figure 2, upper level). The default map is the one with only colored polygons of all Swiss cantons. Users can also check the Google Maps option. The default Google Maps is set to be constructed by satellite tiles because it is more suitable for mountain lakes, which are located in the natural environment. Mountains and glaciers can be easily located using this map. Therefore, it helps users perceive the lakes’ elevation. Users can also construct Google Maps with road map tiles, then the main town/city and the main road will be shown. As citizens are usually more familiar with this information, this visualization gives a better idea of where the lake is located. For each type of map, when the mouse hovers over a canton, the name of the canton and its lake density are shown in a tooltip, moving with the mouse. At the same time, the cursor changes to a zoom-in icon to let the user know it is possible to click and zoom to the canton level. Users can choose to visualize the lakes by coordinates, by elevation, or by surface area from a drop-down box (Figure 2, lower level). When the mouse hovers over a lake, basic information including its name, elevation, and surface area will be shown in the tooltip. The cursor changes to ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

Characterizing Swiss Alpine Lakes: from Wikipedia to Citizen Science

13:9

Fig. 2. Upper level: Different map layers. Lower level: Different lake visualizations on the canton level.

Fig. 3. Detailed lake information and the chatbot game.

a zoom-out icon to prompt the user to click to zoom out. Lakes that have been sampled and have not been sampled are colored differently. When the user clicks on one lake on the map, detailed information about the lake is shown as in Figure 3, left. The name of the lake is displayed with a Wikipedia icon next to it. Users can click on the icon to check its Wikipedia page, but if the page does not exist, a page written “Hey you found a lake without Wiki page! Why not create one for it!” will be shown to the user. The lake geojson data and the data scraped from the lake’s own Wiki entry are shown to give the user an overview of the target. Analysis results from the 2000Lakes project are visualized with six scales. Since there are no criteria on how to set the range for 16S, 18S, and temperature, those scales are colored in gray. The rest of the scales are exactly the same as in the explanation section. A small photo gallery of the selected lake, constructed using Google Maps API, is also shown. A maximum of 10 pictures will be shown following Google’s restriction. This part gives a more realistic view of the lakes. It is also intended to attract citizens with lake scenes, so that they may be more willing to join future activities, e.g., a sampling campaign in order to enjoy nature, and more likely to take actions to safeguard the environment. The implemented game is designed like a bot texting with the user in a chat box to make the experience more immersive (Figure 3, right). When the user gets to this part of the platform for the first time, messages pop up into the chat box. Starting with a greeting, followed by an introduction to the game, the bot then asks if the user wants to play this game or not. Only if the user chooses to play, the bot continues the conversation by giving out questions. The first question is always about ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

13:10

Y. Lin and D. Gatica-Perez

the canton where a lake is, and the rest four questions are selected randomly from the pre-defined question pool, where all questions are about the lake’s attributes, such as max depth, measured conductivity, and so on. Missing data for some lakes results in some questions bearing no useful information. In such cases, the location of the target lake can be very essential for players to guess the right answer. Once the user makes a guess, the bot answers immediately if the guess is right or not, and replies accordingly. For a wrong guess, the reply bubble is set to a different color, and the right answer is highlighted so that the user can distinguish the results more easily and locate the right attribute. In order to make the dialogue resemble human communication and boost the enjoyment of interacting with the chatbot, the messages sent by the bot are written in a friendly and encouraging way. Several predefined responses were prepared, and the bot randomly picked one to send. The bot is humanized by using a relaxing tone and avoiding fixed replies. If the user finds out the lake the chatbot selected, a confetti effect is triggered; this is to make the player feel satisfied and pleased through this positive feedback and to encourage the player to continue learning on the platform and through the game. If the player guesses wrongly, the correct lake is revealed, and the player can match the given attributes to the target lake so as to learn from the game. After one round of the game is finished, the bot continues to ask if the player wants to play again, and the cycle continues. Following the initial online survey conducted to gather feedback on the base platform, we made improvements to the features, which are discussed in Section 4.2.1. 3.3 Platform Evaluation: Collecting User Feedback To answer RQ3, the platform is evaluated from different perspectives. As an information platform, where users will learn concepts about Swiss alpine lakes, it is of interest to know how well citizens with no previous knowledge are able to grasp such information. In addition, the website is evaluated in terms of user interface (UI) and user experience (UX) design, as the goal is to make the platform easy and pleasing to use. Finally, understanding how influential this website can potentially be in the future should also be evaluated. An online survey is constructed for evaluation purposes. Participants first read a short description of the project and sign a consent form to continue. Later, participants are asked to provide some general information about themselves, such as how concerned they are about the environment, how often they go hiking in Switzerland, and how much they know about Swiss alpine lakes. Having this information before participants interact with the platform is important, as it can be used to assess the influence and usability of the platform. For example, if a person with a low degree of environmental concern became willing to take action to preserve the environment or join 2000Lakes project after visiting the platform, that would constitute positive feedback. Furthermore, if a person with limited knowledge of Swiss alpine lakes could still find it easy to interact with and find information from the map, that would also represent positive feedback. The general information part is followed by a task section. In this section, 10 questions are asked, whose answers can be found by exploring the content in the platform. The questions include: — Q1: How many lakes are recorded in total? — Q2: For the conductivity parameter, is the higher the better? — Q3: Which range of pH value is the most suitable for all kinds of fish? — Q4: Which Canton has the most lakes? — Q5: Lac de Salanfe in Canton Valais is in a very bad condition for the organisms inside in terms of dissolved oxygen. Is this statement true? ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

Characterizing Swiss Alpine Lakes: from Wikipedia to Citizen Science

13:11

— Q6: The largest mountain lake in Canton Graubünden is ... — Q7: The highest mountain lake in Canton Vaud is ... — Q8: The water volume for Limmernsee in Canton Glarus is ... — Q9: Guggisee in Canton Valais has a lower water temperature than Oeschinensee in Canton Bern. Is this statement true? — Q10: How much does the game makes you want to learn more about mountain lakes in Switzerland? The questions are designed such that participants have to go through each feature on the platform to find the correct answers. This makes sure the participants actually interact with the platform, and acquire knowledge on lakes and the environmental indicators, so that their response on the usability evaluation part is more reliable. Based on the accuracy of the provided answers, it is possible to assess how effective the platform is at conveying knowledge. In addition, the survey also records the time spent on each question; therefore, it is possible to filter out any response that might be filled in randomly (i.e., too fast). After participants finish all the tasks, a set of questions about the platform’s aesthetic design and interaction experience are asked. To evaluate visual aesthetics, participants are asked to which degree they consider that everything goes together on this website, the layout is pleasantly varied, the color composition is attractive, and the layout appears professionally designed. Those evaluation items are in the shortened version of the Visual Aesthetics of Websites inventory, which proved to be reliable and able to provide a close representation of the full-length version [29]. To evaluate users’ subjective impression of their experience with the platform, an optimized version of the user experience questionnaire is adopted, where the 26 items were reduced to 8 [41]. Participants are asked to scale the platform in terms of attributes like supportive vs. obstructive, easy vs. complicated, efficient vs. inefficient, clear vs. confusing, exciting vs. boring, interesting vs. not interesting, inventive vs. conventional, and innovative vs. usual. Finally, we are interested in how influential this platform could potentially be. We assess this based on the outcomes and possible actions participants may take after visiting the platform. People are asked to scale to which degree they agree that they have learned something new and would like to learn more on this topic. Participants are also asked if they have become more aware of how people can affect alpine lakes, and more concerned about the need for environmental protection. Finally, participants are asked whether they would like to participate in future activities of the project, by helping sample lake water in real life, creating Wikipedia pages for lakes, or sharing the platform with others. These items are selected to capture the effectiveness of the platform in delivering environmental content, moving from awareness to action. The last three items are concrete activities related to the project that can contribute to climate issues. We implemented the survey using LimeSurvey. The online survey was approved by Idiap’s Data and Research Ethics Committee. Participants are asked to finish the survey using their own laptops. Therefore, the testing environment varies from case to case. However, this should not be a problem, as the authors of [7] indicated that different combinations of devices and software have only reasonable effects on the accuracy and precision of a web platform’s display and response timing. 3.4

Crowdsourcing the Collection of Lake Polygon Data

One important type of geographic data that remains largely unavailable in digital format is the outline of alpine lakes. Such data can provide valuable information to users, as it allows them to visualize the complete shape of the lake, which may not be fully captured in photos. Moreover, the collected outline data holds significant value for research purposes. Regularly gathering this data ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

13:12

Y. Lin and D. Gatica-Perez

enables tracking the evolution of a lake over time, thus facilitating the assessment and monitoring of the impact of climate change. Participants were asked to collect lake polygon data using the platform geojson.io. This data collection was also approved by our Institute’s ethical committee. The task for the participants was to draw a lake shape following the shore and output the geojson file. Participants are encouraged to preserve the overall shape of the lake when drawing, but the intricate details are deemed as less crucial. To ensure data quality, a detailed specification and a demo video were provided, instructing how to do the work. Each participant was asked to collect data for five lakes, including one with known official data from Swisstopo. In the later evaluation, the official data served as ground truth, so it is possible to measure the precision and recall of the crowdsourced geo-data pairwise with the following equation: Aover l ap , (1) Al abel ed Aover l ap recall = , (2) Aaut hor ity where Aover l ap is the overlap area of the two geo polygons, representing the correctly identified lake area, Al abel ed is the area of the polygon that volunteers identified, and Aaut hor ity is the area of the polygon from Swisstopo. The turfpy python library was used for the actual calculation. precision =

4

RESULTS

The survey results of the two iterations of the platform are presented in this section. All features of the base platform for the first iteration are presented in Section 4.1. The second iteration is a modified version based on the survey feedback for the base platform, the results are presented in Section 4.2. Finally, the results of the crowdsourcing activity to collect lake polygon data are presented in Section 4.3. 4.1

First Platform Iteration

The majority of the survey’s recipients are university students in Switzerland, who were invited to fill in the survey voluntarily. The first evaluation of the platform is based on 19 responses that were finished. The survey is expected to be finished in 20 minutes for someone who has never used the platform before. In practice, the time spent on the survey varies for different participants. The average time spent on the survey is 17 min 23 sec, the median time is 14 min 56 sec. All the response times seem to be reasonable, therefore, no response was filtered out. Among the 19 participants, 10 are women and 9 are men; 13 of them are in the age group of 18–24 years old, and 6 of them are in the age group of 25–34 years old. Most of the participants declared that they care about the environment, with 3 of them declaring to be very concerned about this issue. Therefore, the platform may be of interest to these participants. Furthermore, most of them go hiking in Switzerland, but less than 6 times between April and September, and they do not have much previous knowledge about Swiss alpine lakes. That suggests that the correctness of their answers will be significantly based on the information provided on the platform. 4.1.1 Task Results. Participants followed the instructions and found related information to finish the 10 tasks. Except for the final task, where the participants were asked to rate how much the implemented game made them want to learn more about Swiss alpine lakes, each of the remaining questions had a correct answer. The accuracy of the response is generally very high: 15 out of 19 participants answered the questions with at most 1 mistake, and that includes participants with very little knowledge on this topic. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

Characterizing Swiss Alpine Lakes: from Wikipedia to Citizen Science

13:13

Table 2. Results for Each Task

Accuracy Time

Q1 84.21% 54.05s

Q2 94.74% 58.29s

Q3 100% 36.92s

Q4 94.74% 32.26s

Q5 89.47% 100.51s

Q6 78.95% 61.67s

Q7 94.74% 40.94s

Q8 100% 51.23s

Q9 68.42% 123.42s

Q10 NA 64.37s

Q1–Q9 are objective questions about alpine lake facts. Q10 is a subjective question about the game perceived value.

In order to assess how easy the platform is to use, we examine the accuracy and time spent on each question (Table 2). Regarding the objective questions (Q1–Q9), Q9 is the most time-consuming question and has the lowest accuracy. The question is “Guggisee in Canton Valais has a lower water temperature than Oeschinensee in Canton Bern. Is this statement true?”. To answer this question, the participant needs to find Guggisee in Canton Valais, get its measured temperature, find Oeschinense in Canton Bern, get its temperature, and compare the two values. One possibility of this low accuracy is that participants get impatient in finding the target lakes and decide to answer randomly. The second most time-consuming question is Q5, where participants are asked to answer if the condition of Lac de Salanfe in Canton Valais is bad or not in terms of dissolved oxygen. Participants need to find Lac de Salanfe and check its dissolved oxygen value. And if they are uncertain about the value’s meaning, they would need to go back to the section where parameters are explained and check the meaning. These two questions require looking for one specific lake in Canton Valais, where many lakes have been sampled, so it takes more time to find the target. The time spent on the other questions is reasonable, and participants are able to answer each question within one minute. Therefore, we assume that it is mainly the searching part, which is designed for this survey specifically, that causes the time loss. It is also worth mentioning that Q2 and Q3 are very similar questions. Q2 asks if increased conductivity is better, and Q3 asks what is the most suitable pH value for all kinds of fish. They both instruct the users to read the explanation on the corresponding parameter and engage with the scale. After interacting with the first conductivity scale, users become more familiar with that feature on the platform and spend less time on Q3, accordingly, the accuracy becomes higher. We also evaluated the game by the last, subjective question Q10, where participants are asked to rate how much the game makes them want to learn more about mountain lakes. As shown in Figure 4, the answers span a wide range from 1 to 7. There are people who liked the game and think it is a good motivator for them to learn more about Swiss mountain lakes. There are also people who did not enjoy playing the game and did not agree that it motivates them to learn. Most of the participants held a moderate opinion towards this game, rating 3–5 for the answer. In other words, the game did not meet everyone’s satisfaction, but these divergent opinions are interesting, and we hope to improve the game based on the participants’ feedback. Some positive feedback includes: “I like the small game the most, it is very interactive and can evaluate the outcome of the learning.” “The game is really cool, because you can learn a lot of things, it’s really useful, a very good idea.” The educational goal of the game was recognized by some of the participants. Learning through the game by revealing correct answers seems to be a good way to make players learn. There are also some suggestions on how to improve the game. For example, one comment says: “Maybe for the last game, eliminate those lakes do not have known parameters. Then it will be clearer.” Missing data indeed restricts the game. If the bot picks one lake with many missing parameters, the correct answer may be not so informative, and the player can only guess purely based on its location. In this case, the player cannot really learn new things about the lake. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

13:14

Y. Lin and D. Gatica-Perez

Fig. 4. Distribution of answers for Q10: How much does the game makes you want to learn more about mountain lakes in Switzerland?

Overall, from the task results, we believe the platform is effective in conveying information to the public. 4.1.2 Aesthetic Design Evaluation. After the tasks, participants evaluated the aesthetic design of the platform. They rated each of the items on a scale from 1 (strongly disagree) to 7 (strongly agree.) Almost all participants agreed that everything goes together on this website and that the layout is pleasantly varied. The answers for these two items are all 4 or above, with over half of the participants showing strong agreement (>=6). The results for “the color composition is attractive” and “the layout appears to be professionally designed” are varied. Even though most of the participants agreed with these two statements, there were one and two participants, respectively, who disagreed with the statement. Some feedback on color usage and layout includes: “The colors were hard to see sometimes. (clicking on Canton Vaud, I can barely distinguish where the lakes are, I don’t know if it is my settings but the color of the blue circles is very light.)” “The layout before the game. Some words overlap with the pictures.” For future improvement, clearer colors should be chosen to plot the lakes on maps. As there are different map layers, where the background color varies from light to dark, it is important to apply a color that is obvious in all cases or apply different colors for different layers. The layout issue, on the other hand, is caused by an inappropriate technical design. Modifications need to be done to make the layout platform more resilient. 4.1.3 User Experience Evaluation. Based on the responses, the user experience provided by the platform is overall good. Most of the participants agreed that there was enough supportive information for them to finish the tasks, and it was efficient to find information. Also, most thought that the platform was interesting and exciting to play with. To make it more specific, some participants said that they favor the interactive map and the section where detailed lake information is displayed. Showing photos seems to be a good way to present the lake, as some participants comment that real photos of lakes are interesting and that they enjoy seeing all the info and pictures. However, some participants found the platform not so inventive and innovative. Furthermore, even though most of them agreed that the platform was clear, there were still two participants rating it as confusing. The item asking to “evaluate based on easy versus complicated” got more negative responses than the other items, and the results are shown in Figure 5(a). To improve the platform, it is desirable to make the operations easier for all kinds of users. Some participants also gave suggestions on how to improve the platform: “Too much text, it would be nice to have a summary or keywords for each paragraph. Scrolling up and down is inefficient, it could be more comfortable to have detailed information right next to the map.” “On the map, you could put a bar where you can adjust the zoom level on the map because going from one canton to another is sometimes complicated if you have set the zoom.” ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

Characterizing Swiss Alpine Lakes: from Wikipedia to Citizen Science

13:15

Fig. 5. Result for “evaluate based on easy versus complicated”.

Even though the paragraphs have been simplified and made more comprehensible, there are still long texts for users to read. Highlighting keywords by adding bold style to them is a possible approach to shorten the reading time. Also, we may need to explore a new way to make the interaction with the map more efficient. Currently, if the user clicks on one lake from the map, the page will relocate the user to the lake section on the same page automatically; if the user wants to view another lake, he or she has to scroll back to the map section and continue operations. To zoom in to another canton, the user needs to click on the target canton directly, or first zoom out from the current canton and then click on the target canton to zoom in. As some of the users identified these operations to be complicated or inefficient, we may explore an approach to eliminate redundant steps. Other feedback includes: “Scrolling is problematic, also we don’t know we have to scroll down first.” “The points representing for lake by coordinate is a little bit small to catch.” To address these issues, the scrolling function should be made more explicit, for example, by adding a scroll bar or a list of radio buttons on the website. When plotting lakes on the map, the size of the points should be adjusted, so that it is easier for users to interact with them. 4.1.4 Platform Influence Evaluation. All participants agreed to different extent that they learned something new about Swiss alpine lakes after the exercise, and 8 out of 19 participants strongly agreed. Around two-thirds of the participants showed interest in learning more on this topic, and the rest held a neutral attitude. Most of them agreed that they became more aware of how people can affect the lakes and more concerned about the need for environmental protection. However, the percentage drops slightly compared to the previous two outcomes. We may need to add more content stating the link between the lake ecosystem and the environment to be more convincing to citizens. We also evaluated the possible next steps for the participants. Over two-thirds of the participants seemed willing to share this platform with friends or family. However, when asked if they would like to attend sampling activity and if they would like to create Wikipedia for more lakes, their answer spanned a wide range from 1 (strongly disagree) to 7 (strongly agree) (Figure 6(a)). Even though over half of the persons were interested in participating in sampling lake water, there were some people who disagreed. This may be related to their personal interest, as those participants who select 1–3 for this statement are those who seldom go hiking. From Figure 6, we can see that participants were more reluctant to create Wikipedia pages compared to sampling water, with most of them disagreeing that they would write Wikipedia for more lakes. The reason may be that writing a Wikipedia page is more task-oriented and has no social engagement, compared to hiking and sampling water. Another reason may be that the connection between Wikipedia and ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

13:16

Y. Lin and D. Gatica-Perez

Fig. 6. Result for the first iteration about participants’ interest in further activities.

the platform is not so obvious. For future work, we can explore other ways to invite users to draft Wikipedia pages for Swiss mountain lakes. In the final stage of the analysis, we intended to understand how participants’ demographics influenced their perspectives on the platform. We divided the participants into distinct groups based on gender, age, and their inherent concerns on the environment. Subsequently, we calculated the average scores for each item and present the results in Table 3. Only results with a difference exceeding 0.5 (half a point of the Likert scale) are displayed. We conducted t-tests for each of the 3 conditions being tested (gender, age, and degree of concern for the environment.) From the table, we observe that women seemed to have higher expectations for UI design as the scores were lower, whereas men seemed to show stricter preferences regarding UX design. Although women appeared to show slightly more empathy and willingness to engage in activities and scored higher for these items than men, the differences are not statistically significant, as the p-values are all larger than 0.05. The older demographic group appeared to hold higher standards for the platform’s overall design, likely due to their increased experience. On the other hand, the younger group demonstrated a relatively greater inclination to participate in various activities, although not significantly. The group with an intrinsic interest in the topic assigned higher scores to the platform and exhibited a stronger desire to participate in all kinds of activities. It is worth noting that despite some observed differences, the lack of statistical significance in certain cases may be attributed to the small sample size. Beyond our initial work presented here, future evaluations conducted with a larger sample could provide an opportunity to reassess these trends. 4.2 Second Platform Iteration 4.2.1 Improvements. Adjustments were made based on the feedback from the first survey results. Despite the minor changes such as adding a scroll bar on the page, adjusting the size of the clickable points, and changing the colors shown on the map, one more interactive feature was added and the layout of the platform was also changed. The change of layout design intends to address the issue of redundant steps when browsing lakes on the map and when playing the chatbot game. In the first iteration, except for the first explanation section that has horizontal slides, the rest of the sections were joined in a simple vertical order with map visualization followed by the lake information, and the chatbot game was in the last section. When users click on one lake on the map, they will be redirected to the lake information section and need to scroll back to the map section in order to browse other lakes. From the feedback we received, this was not as user-friendly as expected. Moreover, it was hard to compare two lakes because there was only one section that could show information about one ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

Characterizing Swiss Alpine Lakes: from Wikipedia to Citizen Science

13:17

Table 3. Impact of the Demographics of the Users men (9)

women (10)

p-value

18–24 (13)

25–34 (6)

p-value

concerned =5 (14)

p-value

Everything goes / / / 5.85 5.17 0.181 5.00 5.86 0.105 together color composition 6.00 5.44 0.310 / / / / / / is attractive layout is / / / / / / 5.40 6.14 0.146 pleasantly varied layout is 5.78 4.33 0.023∗ 5.69 4.00 0.011∗ / / / professional platform is / / / 5.85 4.17 0.007∗ / / / supportive platform is easy / / / 5.38 4.50 0.322 / / / platform is 5.00 5.56 0.419 5.54 5.00 0.459 5.00 5.50 0.516 efficient platform is clear / / / / / / 4.80 6.14 0.088 platform is / / / / / / 4.60 5.93 0.039* exciting platform is 5.11 6.00 0.102 / / / 4.80 5.93 0.061 interesting platform is / / / 5.77 4.17 0.020∗ / / / inventive platform is / / / 5.46 4.67 0.246 4.60 5.43 0.252 innovative willingness to / / / 5.62 4.67 0.169 4.60 5.57 0.183 learn more willingness to 4.00 5.22 0.205 5.23 3.67 0.120 3.80 5.07 0.238 sample water willingness to 3.56 3.00 0.468 / / / 2.80 3.71 0.328 create Wikipedia became more aware 4.56 5.33 0.216 / / / 4.00 5.43 0.038∗ willingness to 4.78 5.56 0.273 5.54 4.67 0.245 4.80 5.43 0.432 share the platform Cases for which the difference between groups is above 0.5 are included in the table. The sample size for each of the groups is indicated in brackets. All p-values below 0.05 are indicated with “∗ ”.

lake. Another design that could be improved was the game. Since the game invites the user to interact with the map and find information about a certain lake, it is preferred that the chatbot can be shown somewhere together with the map to make the information retrieval process easier. In the second iteration, the lake information section and the game section are not shown directly on the main page. When the user clicks on one lake on the map, a new tab will be opened in the web browser. Instead of scrolling, the user needs to switch between the tabs in order to go back to the map section, which is more common in web design. Multiple lake tabs can be opened, so it is easier to compare lakes. Furthermore, the game was integrated into the map section and the chatbot is only shown if the user chooses to open it. In the case of the game being opened, the display window of the map will become smaller and the map will shrink correspondingly to match the window size. The user can then play the game while browsing the map right next to it. Finally, the goal of the platform is to present information about Swiss mountain lakes, and at the same time invite users to create Wikipedia pages for the lakes based on the information given so as to have more complete data online. From the previous survey results, most participants showed a reluctant attitude toward creating Wikipedia pages for lakes (Figure 6(b)). Therefore, we intended to draw users’ attention to Wikipedia creation in a more interesting way. We created a virtual lake ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

13:18

Y. Lin and D. Gatica-Perez Table 4. Mean Value Comparison for Two Iterations

Layout professionally designed Complicated (1) vs. Easy (7) Conventional (1) vs. Inventive (7) Usual (1) vs. innovative (7) Willingness to create Wikipedia pages

First iteration 5.16 4.79 5.26 5.21 3.47

Second iteration 5.42 5.11 5.26 5.16 3.84

gashapon machine. When the user clicks on the handler of the machine, a ball will be dropped. Then, the user can click to open the ball, in which there is information about a random lake from the database that has no Wikipedia page. The user is then invited to create a page for that lake. It is hoped that this more innovative way to learn about lakes in Switzerland could trigger users’ interest and make them more willing to engage in this activity. 4.2.2 Survey Result. A set of 19 volunteers were invited to answer the exact same survey as for the first iteration; all of them were university students in Switzerland. Since the purpose of the second survey is to verify if the web was indeed improved, items of interest in the survey were analyzed and compared with the results from the first iteration. One major change is the web layout, which intends to simplify some interactions on the platform. Therefore, we are interested in whether users think the layout is professionally designed and if the adjustment indeed makes the platform easier to use. The other change is the gashapon machine, which provides users a way to explore the lake dataset randomly and invites them to create Wikipedia pages. Therefore, we are interested in whether, by adding this new feature, users find the implementation more inventive and more innovative, and if their willingness to create a Wikipedia page is increased. We compare the mean value of the items of interest for the two versions of the platform and the results are shown in Table 4. An improvement in the mean score can be seen for the items “Layout professionally designed”, “Complicated (1) vs. Easy (7)”, and “Willingness to create Wikipedia pages”. The distributions for “Complicated (1) vs. Easy (7)” are shown in Figure 5. In the first iteration, participants’ answers tended to polarize, with many participants voting for “not easy”. However, the result improved for the second iteration, with most participants voting for a neutral or positive answer to this item. The score for “Willingness to create Wikipedia pages” also increased. This suggests that the modifications on the platform improved the platform design, the user experience that the platform provides, and user involvement after the experience. However, the mean score for item “Conventional (1) vs. Inventive (7)” stays the same for both iterations and the mean score for “Usual (1) vs. Innovative (7)” drops a little. While user feedback has acknowledged the presence of the gashapon machine, it did not result in the expected improvement. Therefore, it is worthwhile to explore other methods of interaction for future versions. 4.3

Crowdsourced Lake Polygon Data Quality

A total of 60 lakes were targeted to validate the online crowdsourcing task. A total of 12 volunteers participated, and each of them collected the polygon data of five lakes using geojson.io. Even though a detailed specification and a demo video are given, the whole process is unsupervised. Since the data will be shown on a digital platform, the quality of the data should be assessed. We first randomly sampled from the 60 geo-polygon files volunteers collected and plotted them on geojson.io to see their accuracy manually. The results are representative of the lakes, as the lines match well with the shore shown on the map as shown in Figure 7. To measure the data ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

Characterizing Swiss Alpine Lakes: from Wikipedia to Citizen Science

13:19

Fig. 7. Visual comparison between official and crowdsourced geo-polygon data.

quality, precision and recall were calculated. A set of 12 lakes out of the 60 have the official geojson files from Swisstopo and were used as the ground truth. The collected data reaches an average precision of 95.60% with a standard deviation of 0.061 and an average recall of 97.29% with a standard deviation of 0.016. The crowdsourced data thus shows good quality and suggests that it is possible to expand the activity and invite more people to join if a larger geo-polygon dataset on Swiss mountain lakes is needed. 5 DISCUSSION 5.1 Simplicity in Platform Design The simplicity of a platform should be one of the biggest concerns in a citizen science project [31]. Since the potential users of the platform are the general public (who have no obligation to complete any tasks) designing the platform with ease-of-use in mind is critical. On a platform with crowdsourcing purposes, the quality of collected data can be improved; on a platform with informational purposes like ours, users can grasp the information better. Since map interaction plays an essential role in our work, we are interested in assessing if the design is simple enough for users to interact with the map. We compare the task results from our survey with assessment results for the project described in [31], where map visualization is also an essential part and it was evaluated in a similar way. Note that the project in [31] aims at supporting volunteers to collect and submit data about invasive species, and therefore, the tasks for participants to finish are different. In [31], volunteers were asked to perform a series of more complicated tasks on the map such as editing the map layer, creating species location map, and so on. Volunteers encountered more problems than the researchers expected, and the completion rates for map tasks varied from 25 to 75%. Participants expressed the need for simplified functions and more comprehensible icons to help with the operations. In our platform, the interactive functions on the map are more straightforward. From the results, the completion time for each task is reasonable and the accuracy of all questions is relatively high (Table 2). Indeed, simpler interaction features allow participants to have better performance. We also received extensive feedback on various approaches to simplify the platform. Some people mentioned that there was too much text explaining parameters, which made them lose patience. In the improved version, keywords were highlighted in a different color so to assist users’ reading. One volunteer also mentioned that the rule of the game was too wordy. Therefore, one way to simplify the platform will be to reduce the text content to a minimum. Additional suggestions were made to reduce redundant scrolling steps while comparing two lakes, as previously described, and ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

13:20

Y. Lin and D. Gatica-Perez

to relocate the chatbot game closer to the map interface, given that the game requires the user to search for a lake while playing. Modifying the platform following these suggestions resulted in an improved user experience (Figure 5). 5.2 Innovative Designs Raise User Interest Our work sought innovative ways to engage users and raise their interest in this topic. The main efforts are the chatbot game and the lake gashapon machine, whose functions are recognized by the participants. The implemented game in the platform is not entertainment-oriented, as the “game-like” artifact implemented in the Citizen Sort project [36], but more similar to a “tool-like” game. The format of the player answering multiple-choice questions is similar to the one implemented in the Galaxy Zoo [38], but gaming features were added to make it more enjoyable to play. We underlined the three characteristics of a successful educational game proposed by Malone [26]. The challenge of this game is ensured by asking the player to guess the randomly selected lake. Fantasy is realized by having the player interact with a chatbot. Even though it is a single-player game, the player is placed in a social situation by having a conversation with the bot. We hoped to raise players’ curiosity by revealing the lake’s attributes one after another. And the final confetti effect was hoped to give them self-satisfaction. Several participants expressed that the game was their preferred feature on the platform, noting that it served as an engaging and informative tool for learning about lakes. In the second iteration, the implementation of the gashapon machine captured the attention of some participants. It then became apparent that these participants were more interested in browsing lakes through the gashapon machine than through the map. If well chosen and implemented, innovative features have the potential to significantly augment user engagement. 5.3 Reliability of Online Crowdsourced Geodata When it comes to crowdsourcing activities in a citizen science project, researchers in [10] and [15] emphasized the importance of proper training before collecting data and following a detailed protocol while collecting data. In the online crowdsourcing activity we conducted, where participants were asked to collect geo-polygon data using an online tool themselves, a demo video and specifications were made accessible to the participants to show how to do the task correctly. As evidenced by the high quality of the collected data, the provision of clear instructions and strict procedural guidelines were effective. Consequently, it seems feasible to scale up the activity to collect data for more lakes, as long as the critical steps are ensured. 5.4

Challenges of Using Heterogeneous Data

The platform visualizes the research result of the 2000Lakes project. As a bridge connecting the scientific community and the general public, it is important to think about how to present the data in a way that is interesting for laypeople. In addition to interactive features, the use of heterogeneous data presented a viable solution. By extending the research results with relatable data from everyday life, we aim to avoid overwhelming the audience with complex scientific concepts. However, the choice of appropriate data sources and the combination of different sources can be challenging. As there is no existing dataset of Swiss mountain lakes, our starting point was the Wikipedia page of Swiss mountain lakes. Then, a decision needed to be made on which information should be included on the platform. Our goal was to provide as much information as possible that was directly linked to the topic of interest, while avoiding unnecessary information. Therefore, after a primary NLP process of the paragraphs describing the lakes, we decided not to include the results as they did not provide additional useful information. When confronted with insufficient ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

Characterizing Swiss Alpine Lakes: from Wikipedia to Citizen Science

13:21

information from one particular data source, we sought to incorporate additional data from other sources. We made the decision to include more data from multiple sources after a thorough process of trial and comparison, which enabled us to select the most suitable sources for our needs. The combination of different data sources is also tedious work. One way forward is to set up a new dataset that includes every possible attribute of one target. When designing the new dataset, every use case of the data should be taken into consideration, and its scalability should be ensured for the potential addition of new attributes and targets in the future. 5.5

From Citizen Awareness to Action

This study aimed to investigate the effectiveness of a digital platform to engage participants and learn about Swiss alpine lakes in the climate change context, and to motivate them to take action for preserving the environment. The results of the survey indicated an increase in participants’ awareness, which shows a possibility of using the platform to gather a community of users who could be motivated to take action. These actions encompass a range of possibilities, including political actions such as writing to representatives, participating in demonstrations, or voting; personal actions such as reducing electricity consumption at home; and professional actions such as engaging in initiatives at work aimed at reducing work-related travel. However, the transition from awareness to tangible actions still remains a significant challenge. One possible explanation for this challenge is the intention-action gap, which has been widely recognized in the field of behavior change [21]. Despite participants’ enhanced knowledge and awareness, converting this newfound understanding into concrete behaviors requires overcoming various barriers and motivational factors. Environmental preservation calls for long-term commitment, lifestyle changes, and collective action, which can be daunting for individuals. As part of future work, we envision that the platform could also integrate historical content to document how the Swiss alpine landscape has changed over time. This could contribute to further raising awareness about the local impact of climate change and motivate people to take specific actions [13]. 5.6

Limitations and Future Work

There are improvements that can be made for the lake data this project deals with, for the platform implementation, and for the survey to evaluate the platform. Firstly, our work encountered an issue with insufficient data on Swiss mountain lakes. An extensive list of lakes is desired not only for this project but also for other potential uses. As we verified the reliability of online crowdsourced geo-polygon data, after a more complete list is acquired, it would be possible to fulfill the missing data of lakes by joining forces with citizens. A more complete dataset will represent benefits for users, regarding learning about lakes in Switzerland. Secondly, in terms of platform implementation, even though most problems that participants encountered in the first iteration were solved in the second one, the compatibility issue with certain web browsers (Safari) still needs to be addressed. Furthermore, there is the open question of how to make the platform more creative, and how to incorporate additional content regarding climate change and lakes, including individual and social activities around lakes, like hiking and other enjoyable activities. We envision that a potential direction would be through brainstorming sessions with the public to better understand what they expect from the platform. This feedback would help define improvements to our current design. Thirdly, the evaluations of the two iterations were based on only 19 survey responses, where most of the participants were students. There can be bias in the evaluation of the platform as the participants were relatively few and not diverse enough. In the future, we could expand the sample size and repeat the evaluation with a larger and more diverse audience. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

13:22

Y. Lin and D. Gatica-Perez

Last but not least, as we envision this platform to be used in the future, it is imperative to not only offer detailed instructions to the public regarding concrete activities, e.g., collecting lake data, but also to provide guidance for calibrating the data, and to leverage the expertise of specialized researchers experienced users. This approach will maximize the platform’s potential and overall benefit. 6

CONCLUSION

This article presented an interactive platform for a research project with a citizen science component towards analyzing microbial life in Swiss alpine lakes as a response to climate change. The platform aims to spread general knowledge about mountain lakes in Switzerland, to keep citizens informed of the progress and results of the research, to invite people to join the initiative by taking various actions, and to contribute to building a sustainable society with help from the community. A web-based platform was implemented for these purposes with a key design requirement: considering how scientific data can be conveyed to laypeople. Our work addressed this issue in terms of data, visualization, and evaluation. Besides project-specific scientific data, the main data in the platform came from Wikipedia, which includes many attributes that do not need additional explanation. Visual data and geo- data of the lakes were also included for better presentation. The platform was designed to be visually appealing and aimed at attracting users through various interactive functions. As platform implementation usually goes through a development cycle, feedback was sought for the evaluation of the first iteration through a survey. The same survey was delivered a second time for the improved version, and results were compared in order to validate the modifications. The results showed that participants could effectively grasp facts from the platform, that in general terms the platform provided a good user experience, and that the modifications from the iterative cycle indeed improved the platform. The evaluation of the platform’s potential to influence further activities also showed its value in supporting citizen science goals. ACKNOWLEDGMENTS We specially thank Dr. Anna Carratalá (EPFL Environmental Chemistry Lab, Principal Investigator of the 2000Lakes project), for providing the analysis data of the sampled lakes, explaining the parameters about microbial abundance, and verifying the scientific concepts displayed on the platform. We also thank the 2000Lakes project team and the Social Computing group at Idiap, who gave valuable feedback and advice for platform development. We thank Victor Bros (Idiap) for sharing and discussing the Swiss news discussed in the article. Finally, we thank all the participants in the platform evaluation and geo-data generation phases. REFERENCES [1] 2021. Conductivity. (2021). Retrieved from https://www.enr.gov.nt.ca/sites/enr/files/conductivity.pdf. Accessed April 15, 2022. [2] 2021. Dissolved Oxygen (DO). (2021). Retrieved from https://www.enr.gov.nt.ca/sites/enr/files/dissolved_oxygen.pdf. Accessed April 15, 2022. [3] 2021. pH. (2021). Retrieved from https://www.enr.gov.nt.ca/sites/enr/files/ph.pdf. Accessed April 15, 2022. [4] 2021. Water temperature - gov. (2021). Retrieved from https://www.enr.gov.nt.ca/sites/enr/files/water_temperature. pdf. Accessed April 15, 2022. [5] Yuanhui Lin and Daniel Gatica-Perez. 2022. Analysis of References to Swiss Alpine Lakes in Local Swiss News, Unpublished manuscript. [6] Rita Adrian, Catherine M. O’Reilly, Horacio Zagarese, Stephen B. Baines, Dag O. Hessen, Wendel Keller, David M. Livingstone, Ruben Sommaruga, Dietmar Straile, Ellen Van Donk, Gesa A. Weyhenmeyer, and Monika Winder. 2009. Lakes as sentinels of climate change. Limnology and Oceanography 54, 6part2 (2009), 2283–2297. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

Characterizing Swiss Alpine Lakes: from Wikipedia to Citizen Science

13:23

[7] Alexander Anwyl-Irvine, Edwin S. Dalmaijer, Nick Hodges, and Jo K. Evershed. 2021. Realistic precision and accuracy of online experiment platforms, web browsers, and devices. Behavior Research Methods 53, 4 (2021), 1407–1425. [8] Alexios Brailas, Konstantinos Koskinas, Manolis Dafermos, and Giorgos Alexias. 2015. Wikipedia in education: Acculturation and learning in virtual communities. Learning, Culture, and Social Interaction 7 (2015), 59–70. DOI:https:// doi.org/10.1016/j.lcsi.2015.07.002 [9] Josep G. Canadell, Pedro M. S. Monteiro, Marcos H. Costa, Leticia Cotrim da Cunha, Peter M. Cox, Alexey V. Eliseev, Stephanie Henson, Masao Ishii, Samuel Jaccard, Charles Koven, Annalea Lohila, Prabir K. Patra, Shilong Piao, Joeri Rogelj, Stephen Syampungani, Sönke Zaehle, and Kirsten Zickfeld. 2021. Global carbon and other biogeochemical cycles and feedbacks. In Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change, V. Masson-Delmotte, P. Zhai, A. Pirani, S. L. Connors, C. Péan, S. Berger, N. Caud, Y. Chen, L. Goldfarb, M. I. Gomis, M. Huang, K. Leitzell, E. Lonnoy, J. B. R. Matthews, T. K. Maycock, T. Waterfield, O. Yelekçi, R. Yu, and B. Zhou (Eds.). Cambridge University Press, Cambridge and New York, NY, 673–816. DOI:10.1017/9781009157896.007 [10] Janis L. Dickinson, Benjamin Zuckerberg, and David N. Bonter. 2010. Citizen science as an ecological research tool: Challenges and benefits. Annual Review of Ecology, Evolution, and Systematics 41, 1 (2010), 149–172. [11] Margret C. Domroese and Elizabeth A. Johnson. 2017. Why watch bees? Motivations of citizen science volunteers in the great pollinator project. Biological Conservation 208 (2017), 40–47. [12] Rosta Farzan and Robert E. Kraut. 2013. Wikipedia classroom experiment: Bidirectional benefits of students’ engagement in online production communities. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’13). Association for Computing Machinery, 783–792. DOI:https://doi.org/10.1145/2470654.2470765 [13] Christiana Figueres and Tom Rivett-Carnac. 2020. The Future We Choose: Surviving the Climate Crisis. Manilla Press, London, 240 pages. [14] Glen George. 2010. The impact of climate change on European lakes. In Proceedings of the Impact of Climate Change on European Lakes. Springer, 1–13. [15] Jean-François Girres and Guillaume Touya. 2010. Quality assessment of the French OpenStreetMap dataset. Transactions in GIS 14, 4 (2010), 435–459. [16] Mordechai Haklay and Patrick Weber. 2008. Openstreetmap: User-generated street maps. IEEE Pervasive Computing 7, 4 (2008), 12–18. [17] Yen-Chia Hsu, Jennifer Cross, Paul Dille, Illah Nourbakhsh, Leann Leiter, and Ryan Grode. 2018. Visualization tool for environmental sensing and public health data. In Proceedings of the 2018 ACM Conference Companion Publication on Designing Interactive Systems. 99–104. [18] Yen-Chia Hsu, Jennifer Cross, Paul Dille, Michael Tasota, Beatrice Dias, Randy Sargent, Ting-Hao Huang, and Illah Nourbakhsh. 2019. Smell Pittsburgh: Community-empowered mobile smell reporting system. In Proceedings of the 24th International Conference on Intelligent User Interfaces. 65–79. [19] Yen-Chia Hsu, Paul Dille, Jennifer Cross, Beatrice Dias, Randy Sargent, and Illah Nourbakhsh. 2017. Communityempowered air quality monitoring system. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. 1607–1619. [20] Ashlee Jollymore, Morgan J. Haines, Terre Satterfield, and Mark S. Johnson. 2017. Citizen science for water quality monitoring: Data implications of citizen perspectives. Journal of Environmental Management 200 (2017), 456–467. [21] Feng-Yang Kuo and Mei-Lien Young. 2008. A study of the intention–action gap in knowledge sharing practices. Journal of the American Society for Information Science and Technology 59, 8 (2008), 1224–1237. [22] Edith Law and Luis Von Ahn. 2009. Input-agreement: A new mechanism for collecting data using human computation games. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1197–1206. [23] Hye Won Lee, Eun Jung Kim, Seok Soon Park, and Jung Hyun Choi. 2012. Effects of climate change on the thermal structure of lakes in the Asian monsoon area. Climatic Change 112, 3 (2012), 859–880. [24] Kathryn A. Lee, Jonathan R. Lee, and Patrick Bell. 2020. A review of citizen science within the earth sciences: Potential benefits and obstacles. Proceedings of the Geologists’ Association 131, 6 (2020), 605–617. [25] Stephen C. Maberly, Ruth A. O’Donnell, R. Iestyn Woolway, Mark E. J. Cutler, Mengyi Gong, Ian D. Jones, Christopher J. Merchant, Claire A. Miller, Eirini Politi, E. Marian Scott, Stephen J. Thackeray, and Andrew N. Tyler. 2020. Global lake thermal regions shift under climate change. Nature Communications 11, 1 (2020), 1–9. [26] Thomas W. Malone. 1980. What makes things fun to learn? Heuristics for designing instructional computer games. In Proceedings of the 3rd ACM SIGSMALL Symposium and the 1st SIGPC Symposium on Small Systems. 162–169. [27] Thomas W. Malone and Mark R. Lepper. 2021. Making learning fun: A taxonomy of intrinsic motivations for learning. In Proceedings of the Aptitude, Learning, and Instruction. Routledge, 223–254. [28] Phoebe R. Maund, Katherine N. Irvine, Becki Lawson, Janna Steadman, Kate Risely, Andrew A. Cunningham, and Zoe G. Davies. 2020. What motivates the masses: Understanding why people contribute to conservation citizen science projects. Biological Conservation 246, 9 (2020), 108587.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

13:24

Y. Lin and D. Gatica-Perez

[29] Morten Moshagen and Meinald Thielsch. 2013. A short version of the visual aesthetics of websites inventory. Behaviour and Information Technology 32, 12 (2013), 1305–1311. [30] Pascal Neis and Dennis Zielstra. 2014. Recent developments and future trends in volunteered geographic information research: The case of OpenStreetMap. Future Internet 6, 1 (2014), 76–106. [31] Greg Newman, Don Zimmerman, Alycia Crall, Melinda Laituri, Jim Graham, and Linda Stapel. 2010. User-friendly web mapping: Lessons from a citizen science website. International Journal of Geographical Information Science 24, 12 (2010), 1851–1869. [32] Annu-Maaria Nivala, Stephen Brewster, and Tiina L. Sarjakoski. 2008. Usability evaluation of web mapping sites. The Cartographic Journal 45, 2 (2008), 129–138. [33] Charles G. Oviatt. 1997. Lake Bonneville fluctuations and global climate change. Geology 25, 2 (1997), 155–158. [34] Victoria Palacin, Sarah Gilbert, Shane Orchard, Angela Eaton, Maria Angela Ferrario, and Ari Happonen. 2020. Drivers of participation in digital citizen science: Case Studies on Järviwiki and safecast. Citizen Science: Theory and Practice 5, 1 (2020), 22. [35] Nathan Prestopnik and Kevin Crowston. 2012. Purposeful gaming and socio-computational systems: A citizen science design case. In Proceedings of the 17th ACM International Conference on Supporting Group Work. 75–84. [36] Nathan R. Prestopnik and Kevin Crowston. 2011. Gaming for (citizen) science: Exploring motivation and data quality in the context of crowdsourced science through the design and evaluation of a social-computational system. In Proceedings of the 2011 IEEE 7th International Conference on E-Science Workshops. IEEE, 28–33. [37] Roland Psenner. 2003. Alpine lakes: Extreme ecosystems under the pressure of global change. EAWAG News 55 (2003), 12–14. [38] Jordan Raddick, Georgia Bracey, Pamela Gay, Chris Lintott, Phil Murray, Kevin Schawinski, Alexander Szalay, and Jan Vandenberg. 2009. Galaxy Zoo: Exploring the motivations of citizen science volunteers. Astronomy Education Review 9 (2009). DOI:10.3847/AER2009036 [39] David B. Resnik, Kevin C. Elliott, and Aubrey K. Miller. 2015. A framework for addressing ethical issues in citizen science. Environmental Science and Policy 54 (2015), 475–481. [40] Goloka Behari Sahoo, S. Geoffrey Schladow, John E. Reuter, Robert Coats, Michael Dettinger, John Riverson, Brent B. Wolfe, and Mariza Costa-Cabral. 2013. The response of Lake Tahoe to climate change. Climatic Change 116, 1 (2013), 71–95. [41] Martin Schrepp, Andreas Hinderks, and Jörg Thomaschewski. 2017. Design and evaluation of a short version of the user experience questionnaire (UEQ-S). International Journal of Interactive Multimedia and Artificial Intelligence, 4, 6 (2017), 103–108. [42] Scott Stine and Mary Stine. 1990. A record from Lake Cardiel of climate change in southern South America. Nature 345, 6277 (1990), 705–708. [43] Swisstopo. 2022. Free basic geodata. (2022). Retrieved March, 2022 from https://www.swisstopo.admin.ch/en/ swisstopo/free-geodata.html [44] Wikipedia. 2022. List of mountain lakes of Switzerland. (2022). Retrieved March, 2022 from https://en.wikipedia.org/ wiki/List_of_mountain_lakes_of_Switzerland [45] Craig E. Williamson, Jasmine E. Saros, Warwick F. Vincent, and John P. Smol. 2009. Lakes and reservoirs as sentinels, integrators, and regulators of climate change. Limnology and Oceanography 54, 6part2 (2009), 2273–2282. [46] R. Iestyn Woolway and Christopher J. Merchant. 2019. Worldwide alteration of lake mixing regimes in response to climate change. Nature Geoscience 12, 4 (2019), 271–276.

Received 19 February 2023; revised 11 June 2023; accepted 6 July 2023

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 13. Publication date: December 2023.

A Friend in Need Is a Friend Indeed: Investigating the Quality of Training Data from Peers for Auto-generating Empathetic Textual Responses to Non-Sensitive Posts in a Cohort of College Students RAVI SHARMA, University of South Florida, USA JAMSHIDBEK MIRZAKHALOV, Monic.AI, USA PRATOOL BHARTI, Northern Illinois University, USA RAJ GOYAL and TRINE SCHMIDT, Ajivar LLC, USA SRIRAM CHELLAPPAN, University of South Florida, USA Towards providing personalized care, digital mental-wellness apps today ask questions to learn about subjects. However, not all subjects using these apps will have mood problems; thus, they do not need follow-up questions. In this study, we investigate an alternate mechanism to handle such non-sensitive posts (i.e., those not indicating mood problems) in college settings. To do so, we generate and use training data provided by a cohort of peer college students so that responses to non-sensitive posts are contextual, emotionally aware, and empathetic while also being terminal (not asking follow-up questions). Using data from a real mental-wellness app used by students, we identify that AI models trained with our peer-provided dataset generate desirable responses to non-sensitive posts, while models trained with state-of-the-art (Facebook’s) Empathetic Dataset yields responses that ask many follow-up questions, hence giving a perception of being intrusive. We believe that mental wellness apps today must not assume that any subject using these apps has mood problems. Perceptions of intrusiveness (i.e., apps asking many questions) must be a factor in design. We also believe that peer students can provide rich and reliable training datasets for college mental wellness apps, a topic that is not yet explored. CCS Concepts: • Human-centered computing → Collaborative and social computing; Additional Key Words and Phrases: Empathetic responses, college students, teens, natural language processing ACM Reference format: Ravi Sharma, Jamshidbek Mirzakhalov, Pratool Bharti, Raj Goyal, Trine Schmidt, and Sriram Chellappan. 2023. A Friend in Need Is a Friend Indeed: Investigating the Quality of Training Data from Peers for Autogenerating Empathetic Textual Responses to Non-Sensitive Posts in a Cohort of College Students. ACM J. Comput. Sustain. Soc. 1, 2, Article 14 (December 2023), 27 pages. https://doi.org/10.1145/3616382 Authors’ addresses: R. Sharma and S. Chellappan, University of South Florida, 4202 E Fowler Ave, Tampa, Florida, USA, 33620; e-mails: [email protected], [email protected]; J. Mirzakhalov, Monic.AI, Tampa, Florida, USA, 33620; e-mail: [email protected]; P. Bharti, Northern Illinois University, 1425 W Lincoln Hwy, DeKalb, IL, USA, 60115; e-mail: [email protected]; R. Goyal and T. Schmidt, Ajivar LLC, PO Box 2087, Tarpon Springs, Florida, USA, 34688; e-mails: {raj, trine}@ajivar.com. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM. 2834-5533/2023/12-ART14 $15.00 https://doi.org/10.1145/3616382 ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

14

14:2

R. Sharma et al.

1 INTRODUCTION Mental health is a global concern and has been so for a long time [14, 64, 68, 76, 77, 89, 104, 112]. Among vulnerable populations, young people, especially college students, are of particular concern since issues such as depression, stress, anxiety, and eating disorders can affect the functioning of students at both academic and personal levels [9, 16, 25, 51, 80, 82]. Even though mental health experts, counselors, and professionals at colleges encourage students to reach out to them for help with problems, it is generally true today that student mental wellness needs are unmet due to low budgets, limited staff, lack of 24/7 availability, appointment delays, lack of awareness among students about these services, the inability of younger people to recognize symptoms, stigma [47, 48], and the lack of will, money, and time among students to seek support [84]. We now see shorter counseling sessions, off-campus referrals, and longer waiting times for students [35]. More recently, COVID-19 has severely impacted students’ mental health, which only complicates existing challenges in care delivery at universities [55, 69, 94, 111]. As a result of these trends, there is now a significant interest across the globe in using smartphone apps to deliver mental health services to college students [20, 39, 44, 50, 66, 123]. Our investigations (across many US universities) revealed that college administrators actively recommend students to use mental wellness apps. Traditionally, apps were designed to promote generic relaxation techniques, mindfulness, yoga, and breathing exercises. However, with the advent of artificial intelligence (AI), apps are now geared toward understanding the subject and delivering personalized care. While specifics can vary, at their very core, apps in this space provide an engaging interface for students to post messages freely. A back-end algorithm processes these messages and asks contextual follow-up questions to know more about the subject (giving the feeling that the subject is engaging with an actual human). Based on a few sequences of interactions, the app (with or without human expert involvement) will deliver interventions specific to that subject’s needs. Related apps are Woebot [31], Replika [90], Tess [34], Wysa [43], and Kokobot [71]. Though the impact and reachability of such apps to provide mental wellness are significant, we want to highlight critical challenges that are fundamental from a utility perspective but are overlooked today in these apps. Understanding mood problems or their sources is challenging even for a trained human expert [38, 81, 118]. It takes many back-and-forth conversations (sometimes on a range of topics) in a single session between a counselor and a subject, and sometimes even across multiple sessions, to reasonably diagnose mood issues. For apps to do the same — identify mood problems and their sources — it is much more challenging because, unlike humans, conversational apps do not have access to eyes, hand gestures, body postures, or voices of subjects that are critical to glean mood [83]. Consequently, mental wellness apps today engage in significant exploration with subjects to gauge their mood. While this exploration is necessary for someone with mood problems, a critical question is: what if the subject does not have any mood problems but posts something because the subject can do so via the app? Let us illustrate why this is an issue worthy of attention. When universities recommend an app to be used, it is impossible to restrict who uses the app. Some students may post for no apparent reason, some may post to check how the app works, and someone could be interested in “talking” with an app. In cases in which posts on a mental wellness app do not indicate mood problems: (a) Is further engagement necessary with the subject? (b) Is more exploration of the subject’s mood (and sources of nonexistent mood problems) still necessary? (c) If not, how do apps deal with posts that do not reflect mood problems? To investigate this issue, we partnered with Ajivar, a company specializing in smartphone apps, for delivering personalized mental wellness on college campuses. The Ajivar app is now approved for and used by students in a four-year college in the United States. The app lets subjects enter ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

A Friend in Need Is a Friend Indeed

14:3

free-form texts via an interface and then processes those texts using natural language processing (NLP) algorithms to generate a personalized intervention plan to alleviate mood problems. Note here that students that use the Ajivar app are made aware that their communication is with an automated agent integrated into the application, which is the default option. Our dataset for this study only came from this source. The app also includes a dedicated functionality that allows students to engage with actual human counselors, but data from these interactions were not collected for this study. In partnership with the app’s psychology team, we looked at student posts (after filtering any identifiable information). We identified that a significant number of posts occurred in typical dayto-day conversations and did not indicate any mood problems. While we will present additional samples later, some related posts on the app from students were as follows: “The sun is shining today,”; “My thermostat is not working”; and “I want to exercise today”. It is reasonable to infer that these posts are non-sensitive and may not warrant the standard response of follow-up questions to explore mood. However, the app cannot also choose the other extreme option of ignoring these posts (as in not responding to these posts at all), as that will seem dismissive to students. Avoiding exploration has more benefits when posts do not indicate mood problems. First, the student need not converse with an AI agent without any definite purpose (hence reducing wasted time on digital devices). Second, available resources (computational/storage) can be better directed to needy subjects. Third, the app need not ask unnecessary questions, minimizing risk to the information being processed. Furthermore, importantly, there is a critical privacy angle here. We now live in the age of increasing sensitivity to privacy, which means that apps should take all steps to ensure that they are not perceived as being outright intrusive. The slightest hint to the contrary could easily have cascading effects. If privacy concerns are raised, student trust in the app can quickly dwindle to the point where even students with genuine needs may not trust the system, and administrators have little to no power to change such perceptions. Given these observations, we are broadly concerned in this study with generating appropriate responses to non-sensitive posts (i.e., posts not indicating mood problems) that do not warrant further exploration by mental wellness apps. After comprehensive discussions with multiple stakeholders, including college students, we identified that the ideal option is for the app to process these posts but respond in a manner that is contextual, emotionally aware, empathetic, but also terminal (i.e., not asking for follow-up questions, since there is no need to do so). Towards achieving this goal, in this study, we utilize the power of peers and their experiences during college life to provide training data for tuning AI models to generate responses. In this realm, our contributions are below. (1) Collection of posts from a mental wellness app: We partner with Ajivar, a company specializing in smartphone and AI-assisted mental health services to universities in North America for data collection. Specifically, we collect 9, 090 posts that students (in a four-year college) posted via the Ajivar platform. Counselors curated these posts to ensure that they did not yield private information.1 Interestingly, a majority of the posts did not reflect mental health concerns, as validated by trained experts. Our problem now is how to respond optimally to non-sensitive posts. (2) Fine-tuning DialoGPT with Facebook’s Empathetic Dataset: As a first attempt at generating responses, we fine-tuned a state-of-the-art dialogue generation model DialoGPT [121] using the ParlAI framework [67] with a recent 2019 dataset generated by Facebook, specifically for empathetic conversations (denoted F ED ) [88]. Very briefly, the F ED dataset contains a series of conversational sessions consisting of posts and responses, wherein the responses were given by volunteers trained in empathy. This dataset was generated from sessions consisting of a minimum 1 The

study was approved by the College’s Institutional Review Board (IRB): 14718648-19-133. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

14:4

R. Sharma et al. Table 1. Empathetic Dialogues (F ED )–Based Response Generation with DialoGPT

Post I’m glad to live in my dorm. My career gives me great satisfaction. I could really use some cute puppies photos.

Response That’s great! What dorm you live in? Do you have a lot of roommates?? That’s great! What kind of work do you do for a living? Do you have a job? What kind of puppies do you have? Do you have a lot of them or just a few?

of four to a maximum of eight back-and-forth conversations with an average of 15.3 words per utterance. It is essential to know that this dataset does not explicitly focus on conversations for college students but nevertheless focuses on generating empathetic responses, which is relevant to this study. The F ED dataset is publicly available and is quite popular in the literature [88] for empathetic response generation. Three examples from the resulting DialoGPT model trained with the F ED dataset to generate responses to non-sensitive posts from our actual student cohort are presented in Table 1. In order to illustrate this trend just a little further for the first post in Table 1, we present below the outcome across multiple rounds when we asked a college student volunteer to engage in a continuing conversation with the DialoGPT model fine-tuned with F ED after its first response to the actual student post from our dataset: Student Volunteer: “I’m glad to live in my dorm.” DialoGPT fine-tuned on F ED : “That’s great! What dorm you live in? Do you have a lot of roommates??” Student Volunteer: “Its a personal room and its awesome and I love it.” DialoGPT fine-tuned on F ED : “Its sounds like a great place to live! I hope you have a great time there! What are you studying?” Student Volunteer: “I am in my junior of my engineering.” DialoGPT fine tuned on F ED : “That’s awesome! I am sure you’ll do great! What kind of engineering are you in?” Student Volunteer: “I am in mechanical engineering.” DialoGPT fine tuned on F ED : “That’s really cool. I’m glad you have such a great education. I wish you the best of luck!” We refer readers to the intrusiveness of generated responses from the DialoGPT model finetuned with the F ED dataset, which can be deemed as compromising privacy. This observation is the core premise of our study. (3) Generating training data from peers P ED : As an alternate source for training data towards fine-tuning AI models, we recruited three male and three female college students (who recently graduated from the same geographic area as the college where student posts were collected from). From the 9, 090 student posts, we identified that 6, 392 did not indicate mood problems. From these 6, 392 non-sensitive posts, our student volunteers were given a set of 3, 000 posts. They were asked to respond to these posts with instructions that their response be contextual, emotion-aware, and empathetic (as in to leverage their own recent experiences as college students) while also attempting to be terminal (as in not asking follow-up questions unless it was deemed necessary by them). To be clear, we split the 3, 000 posts into three groups of 1, 000 distinct posts. Each group of 1, 000 posts was given to a pair of one male and one female student to respond. In this way, for each of the 3, 000 student posts, two independent responses were obtained from one male and one ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

A Friend in Need Is a Friend Indeed

14:5

Table 2. Sample Training Data for Responses Provided by Peers

Post

Response from Student 1

I cherish all the new friends I made this year. I am thankful for my family.

I feel amazing for right now.

Response from Student 2 Friendship is like a treasure. Glad Awesome. It is great that you have it. Try to spread this that you have friends to happiness. talk to. Being grateful is very important in Me too. I am grateful for life, specially in these tough times. my parents who have Try to spend more time with them. been supporting me on That will increase the love my work and education. between you all. Great! that you feel amazing. Call Good. Entertain and an old friend and share how you motivate people around feel. you with your amazing personality.

Table 3. Peers Data (P ED )–Based Response Generation with DialoGPT

Post I’m glad to live in my dorm. My career gives me great satisfaction. I could really use some cute puppies photos.

Response That is so sweet. I am so happy for you. Keep up with your dorm life. That’s right. I am happy that you are enjoying your work. Keep it up. Puppies are adorable and I love them so much. I hope you get to see them soon.

female student volunteer.2 We denote this as our Peer Dataset P ED for training. Three examples of the P ED dataset are presented in Table 2. (4) Fine-tuning DialoGPT with our peer-generated empathetic dataset: With only training data provided by peers (P ED ), we fine-tuned another DialoGPT-based model for responding to non-sensitive posts by college students. Three examples from the resulting DialoGPT model to non-sensitive posts (for identical posts in Table 1) are presented in Table 3. We hope readers can see that the generated responses are terminal and not intrusive in asking follow-up questions, which seems unwarranted for the posts. Furthermore, the responses generated by the model trained with peer data seem contextual and emotionally aware. (5) Two points of critical importance to our study: Before we delve further, we wish to notify readers of two critical points concerning our study. The first point is that we assume the presence of models that can check whether a post indicates mood problems.3 Models that do so today are many and sophisticated. Good surveys on the state-of-the-art are presented in [10, 116]. Works such as [15, 24, 45, 114] focus explicitly on detecting suicidal thoughts, expressions of cyber abuse [22, 99, 100], depressive symptoms [13, 30, 85], stress [41, 42, 74], anxiety [28, 91, 103], and more. Given such a sophisticated body of literature in this space, we are confident that these existing models can identify sensitive posts, which, when identified, can be processed separately by mental wellness apps and is out of the scope of this study. Our focus can now go back to being able to generate appropriate responses for posts that are deemed as non-sensitive. 2 The 3 In

student volunteers did not interact with each other during training data generation. which case, the app handles the response differently and is out of the scope of this article. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

14:6

R. Sharma et al.

The second point is a related one also, and stems from False Negatives among existing mood detection algorithms, as in what happens if a post from a subject needing help is identified as non-sensitive? First and foremost, as we show with examples in Table 3 (and as we demonstrate later in more detail), responses fine-tuned from our peer dataset (P ED ) are still contextual and emotionally aware. Nevertheless, responses fine-tuned from our peer dataset are terminal and not exploratory. How a subject with mood problems will react to a response tuned for non-sensitive posts is out of the scope of this article. There are many questions to investigate here: (a) Will the subject engage further with the app? (b) If yes, will the additional data help correctly detect the post as sensitive? (c) Will the subject like or dislike the initial terminal response (since they are prone to receiving follow-up questions)? (d) Will a terminal response fine-tuned with peer data (that is still emotionally aware) temporarily alleviate a subject’s mood? These are critical questions that are out of the scope of this article but definitely will be a topic of our future work as we conduct more extensive scale studies with sensitive and non-sensitive posts. The rest of our article is organized as follows. Section 2 provides critical background information and state-of-the-art. Our formal problem statement, data collection process, and our results across multiple metrics evaluated from multiple perspectives are presented in Section 3. Finally, we provide concluding discussions and scope for future work in Section 4. 2 BACKGROUND, MOTIVATION, AND RELATED WORK In this section, we first provide a quick overview of mental wellness apps in the market to justify the practicality of our study in today’s contexts. We then present related discussions on empathy, the necessities of mental wellness apps, and their nature to explore more about the subject. We then discuss perceptions of trust among the general public when they see technologies attempting to learn more about them. Subsequently, we discuss the power of peer support in mental wellness today. Finally, we summarize this section while cohesively intertwining the above discussions. 2.1 State-of-the-Art in Mental Wellness Apps Wellness apps for mental health can be classified into many types. For this article, we classify them based on how they process user input. The first class of apps gleans the user’s mood from either app-enabled prompts or free-form texts, following which one or more generic intervention techniques are prescribed — e.g., yoga, meditation, exercise, breathing, and music therapy. Such apps include Shine,4 Calm,5 MoodKit,6 and MindShift,7 which are quite popular among college students today. These apps, though, do not engage in any conversation with the user. The second class of apps, such as Woebot, Wysa, and Youper, does engage in a conversation with the user. These apps typically have an engaging interface where users can enter free-form text. Using state-of-the-art AI models, these apps then engage in a free-flowing conversation with the user to “explore” more so that the app can correctly understand the user’s mood problems, following which the app will ideally offer personalized interventions. The conversational ability of these apps comes from sophisticated NLP techniques fine-tuned with appropriate datasets (both of which are proprietary) so that any response generated by the app is contextual, emotionally aware, empathetic, and explores the user effectively for superior personalization. 4 https://www.theshineapp.com 5 https://www.calm.com 6 https://www.thriveport.com/products/moodkit/ 7 https://www.anxietycanada.com/resources/mindshift-cbt/

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

A Friend in Need Is a Friend Indeed

14:7

2.2 The Importance of Empathy in Conversational Agents and Datasets It is common knowledge that subjects with mood disorders seeking help need empathy [5, 7, 78, 102, 105]. Naturally, as conversational apps are increasingly used to provide mental wellness for students, the critical issue is how these apps can empathize with student needs while keeping the overall conversation context relevant and emotionally aware. In order to enable conversational agents to be empathetic, a significant amount of effort has been invested in generating datasets for the same. Several works over the years have used the emotion of the input to guide the response generation [57, 65, 101, 119]. In [75], authors proposed generating responses conditioned with a specified politeness marker (polite, neutral, or rude) using the Stanford Politeness Corpus [17]. Rashkin et al. [88] introduced a novel dataset, Empathetic Dialogues (F ED ), containing 25K conversations grounded in emotional situations labeled via crowd-sourcing. Each conversation has 4−8 utterances. The authors investigate the usefulness of the F ED dataset in generating empathetic responses through a combination of generative and retrieval-based models. They used automatic evaluation (BLEU Score and Perplexity) and human evaluation to evaluate responses. For human evaluation, participants were given at least 100 responses for each model and were asked to rate on three dimensions of performance: (1) Empathy/Sympathy, (2) Fluency, and (3) Relevance, on a Likert scale ranging from 1 (not at all) to 5 (very much). In [117], the authors used F ED for fine-tuning and proposed a T5-based empathetic response-generating bot, Empbot, that achieved state-of-the-art performance in both automatic and human evaluation. In [58], the authors proposed DailyDialog (DD), a multi-turn, human-annotated, multi-topic dialog dataset based on daily life. It consists of around 13K dialogues containing six categories of emotions. The authors of [115] used the DD dataset to implement an emotion-based chatbot, EP-bot, which analyzes the emotions and intentions expressed in a sentence, uses that information to generate the next part of the conversation, and then measures how well it performs by looking at various metrics such as accuracy, perplexity, BLEU score [79], and F1-score. ESCOnv [60] is a crowd-sourced dataset that includes 1, 053 dialogues and 31K utterances consisting of problems such as ongoing depression, job crisis, breakup, problems with friends, and academic pressure. For curating the data, 854 workers (help-seekers) were employed and trained to rate help-seeker and supporter conversations. They were requested to assess their emotions and the supporter’s performance using a five-point Likert scale for three aspects: (1) emotional intensity after the conversation, (2) the supporter’s empathy and comprehension of the help-seekers experiences and feelings, and (3) the appropriateness of the supporter’s responses to the conversation topic. They further showed the effectiveness of the dataset by achieving improved emotional support and mimicking human supporters after using the dataset to fine-tune base models Blenderbot [92] and DialoGPT [121]. The authors of [98] used peer-supported data from Talklife8 from 18M interactions of seekers and supporters in 6.4M threads along with publicly accessible data curated from 55 mental health–supported subreddits containing 1.6M threads and 8M interactions for empathy training. They further annotated a subset of 10K data on empathy for analysis. For annotation, they leveraged eight trained crowd workers for a seeker–response pair post. The volunteers were asked to label the level of empathy expressed in three communication mechanisms (Emotional Response, Interpretation, and Explorations) to the response post, ranging from no communication (0) to strong communication (2). They created the multi-task bi-encoder model based on RoBERTa [61] for empathy response rewriting using empathy identification and rationale extraction. They used multiple baselines to compare and found significant improvement. 8 talklife.com

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

14:8

R. Sharma et al.

Data extracted from Reddit [98, 122] and Twitter [72] along with the generic dump and motivational resources (e.g., tweets from @DalaiLama, @DailyZen, @mindfulEveryday) has also been used for empathy analysis. The list of presented examples of empathetic datasets is incomplete, and additional datasets exist in the literature. They can differ in many aspects but can be generalized as datasets that involve single or multi-turn utterances of conversation involving information seeking to understand context, i.e., asking questions. These questions seek information because they are designed to engage in a personalized and empathetic conversation with users and provide them with emotional support. Unless the context is clear, personalization is not possible. Note here that all contemporary approaches to building conversational agents using datasets involve the use of pre-trained language models and distributed embedding representations such as GPT-2 [86], GPT-3 [6], BERT [21], DialoGPT [121], and more. The use of these approaches has led to the creation and deployment of sophisticated, open-domain chat-bots by tech giants such as Blender (Facebook), Meena (Google), and Xiaolce (Microsoft). While chat-bot classifications range from various applications such as education, health, and customer support [32], we limited our scope of discussions to empathetic chat-bots and respective empathetic datasets. 2.3

Trust and Intrusiveness

We live in an age of extreme awareness (to the point of being paranoid) of privacy compromises. Despite their noblest intention, mental wellness apps are not immune to this perception, especially when seeking to explore information about a user. We now report recent studies on privacy broadly relevant to the focus of our study. In [95], the authors conducted a study to analyze why users lie when communicating with digital platforms, including chat-bots. Defined as “privacy lies” they found that the significant context and motivation to provide false information included (a) perceptions among users that the platforms were asking for information that they deemed was unnecessary/discomforting; (b) unfamiliarity with the platform and its use cases; (c) perceptions that the platforms are monetizing their information; and (d) firm belief that the information provided will lead to identity theft. Krasnova and her team conducted another study that asked 199 Facebook users about their privacy concerns regarding the disclosure of 38 types of information, including data about their friends [52]. Their research found that users are more anxious when they perceive that the app asks more than essential questions about them and their friends. These anxieties were amplified when being made aware of online potential risks. According to another critical study in 2013, the amount of information requested by an app has a clear adverse impact on users’ willingness to install the app [27]. To be precise, the study reveals that if users have privacy-concerning perceptions about an app beforehand, they are unlikely to install it in the first place. Even if a few students in a college tag any mental wellness app as being intrusive, then even those students with a genuine need for mental wellness apps may not even install them in the first place, and as we know, these perceptions are very hard to change. Hence, there is a clear and tangible need for mental wellness apps to not appear to be intrusive unless there is a clear need to seek more information when dealing with genuine mood problems. 2.4 The Impact of Peer Support in Mental Wellness In the realm of mental health (which we claimed earlier in the Introduction), there is a severe shortage of trained professionals, especially with the increasing number of younger people suffering from mood problems today. In this realm, some exciting studies have focused on the positive impact of peers in alleviating mental health concerns that are not complicated to reason out in ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

A Friend in Need Is a Friend Indeed

14:9

a college setting. Many challenges a college student faces can be easily relatable to a classmate, roommate, or senior of that student — to the point where the mate can provide sound advice for the student to bounce back. The same case is also likely true in other arenas with known mental health challenges — be it in the military, sports, health care, or law enforcement. Many studies in the past decade have validated the above insight, i.e., the positive impact that peers can make on mental health improvement at both the individual and community level [2, 11, 19, 33, 40, 56, 70, 73, 87]. These studies broadly conclude that sharing collective experiences, challenges, stressors, and opportunities among peers promotes improved wellness among individuals while creating more resilient groups to overcome barriers with a positive mindset. Peer support and engagement have also been identified as important for reducing social isolation, stigma, and anxiety and promoting inclusiveness. The premise of our article — the significant value of peer support in improving mental health specifically for college students — is also supported by recent work [12, 106–108, 110]. These studies advocate for training a robust community of empathetic peers to support students facing nonserious mood problems (but certainly not for cases involving severe depression, thoughts of suicide or self-harm, and drug abuse that need expert counselors). A 2020 study further argues for training college students with basic skills in counseling and advocates for online-based training programs to do so [4]. The article argues that providing support to peers brings clear mental health benefits. Such approaches would also increase social connection among peers, likely improving academics and retention for all. To provide a comprehensive picture, we refer readers to studies highlighting how orthogonal challenges can also manifest in peer-supported mental wellness programs. The study in [40] conducted on peer support for veterans with Post-Traumatic Stress Disorder revealed that peer support is very productive but requires a cohesive group of peers (based on trauma type, gender, and the era of service). Other findings highlighted in the case of PTSD included the need for a peer support group with high-quality leadership and interpersonal skills to promote mental wellness effectively. Other studies in [36, 53] reveal that peer support for mental wellness may be a short-term solution as the complexity of subject needs and mood problems increase. This finding is expected since more complex cases must be handled by trained professionals. Studies also revealed that diversities in race, language, gender, and sexual orientation become more prominent when peers provide (face-to-face) support compared with professional counselors [29, 93] — again, a very critical but not unexpected finding. Our study in this article follows up on conclusions of all the above studies. Our unique contributions are the creation of a training dataset from college students to respond to non-sensitive posts made by their peers. The overall goals of the training data provided are that the responses should be contextual, emotionally aware, empathetic, and terminal. We are unaware of such datasets or AI models fine-tuned for such scenarios. 3

FORMAL PROBLEM STATEMENT, DATA COLLECTION, EXPERIMENTS AND RESULTS This section presents our formal problem statement, data collection (student posts and peer training data), experiments, and results. 3.1 Our Formal Problem Statement Given a post by a student that is deemed to be non-sensitive (i.e., does not indicate mood problems), our goal is to design an automated conversational agent to generate a response for that post that ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

14:10

R. Sharma et al. Table 4. Student Posts Sample Data

Posts I cherish all the new friends I made this year. I am thankful for my family. Today I spent my time learning a song. Today I drove around my neighborhood and it was the highlight of my day. Saw another stunning sunset today! People need to start making new board games because I’ve played literally every one. Online learning has motivated me a lot to try. I feel amazing for right now. simultaneously (a) is contextual; (b) expresses the right emotion; (c) has a reasonable degree of empathy; and (d) is terminal (as in not asking follow-up questions). 3.2

Collection of Non-sensitive Posts from Ajivar

Our dataset of non-sensitive student posts was collected from Ajivar, a mobile app and web interface platform that provides coaching and counseling for college students’ mental well-being. It is now being approved for and used by students in a four-year college in the United States. The age range of students using the app is from 18 to 24. Approximately 58% of the students on that campus were female, while around 20% identified as a minority. At its core, Ajivar offers students personalized mindfulness techniques through text-based conversational interactions. The primary means for the subjects to engage with the app is a conversational interface where students can post their feelings and thoughts. The app will then ask followup questions to further understand the overall context and mood of the subject. Once the app has understood these, it provides an emotional, context-aware, and empathetic response to alleviate the subject’s mood. Depending on the subject’s prior personalized preferences, the app can also suggest relaxing music, breathing exercises, meditation practices, and self-awareness tasks. For this study, 9, 090 de-identified posts from 1, 451 students who used the app’s text entry feature for three months from February 1, 2020 to April 30, 2020 was provided by Ajivar. We emphasize that none of these posts was constrained in any manner, and students freely posted these in their natural settings after the app was officially launched within the university for full use. For these posts, adhering to the college’s Institutional Review Board (IRB) approval, all personally identifiable information (PPI) was removed from the dataset, ensuring that no information could be linked to the students. These measures protect the privacy and confidentiality of the users while maintaining ethical standards in data handling. After scanning through the posts, we identified that 6, 392 posts did not indicate mood problems and were, in fact, regular commonplace conversations at first glance (which we were able to determine based on our training in student mental wellness programs at our campuses). These 6, 392 posts formed our dataset of non-sensitive posts in this study. Some examples of non-sensitive posts from our dataset (including ones we included in Section 1) are presented in Table 4. 3.3

Generating Peer Responses to Student Posts

This is the most critical aspect of our study, and we paid careful attention to this. We did not opt for crowd-sourcing responses over the Internet. There were multiple reasons for this: the cost was non-trivial; data reliability is a problem since we cannot be sure that the subjects are indeed peers (i.e., college students themselves); and we wanted to get training data from homogeneous peers for superior context awareness, which precluded crowd-sourcing from the Internet. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

A Friend in Need Is a Friend Indeed

14:11

As a result, we recruited (via an ad) six young people who were either current students or recent graduates from another four-year college in the same US city where the app was launched. The age span of the students recruited ranged from 22 to 24 years and included three males and three females. The student volunteers were instructed to respond to a subset of 3, 000 randomly chosen posts from the 6, 392 posts, requesting that each response be contextual, empathetic (without being rude or offensive), and terminal (i.e., not attempting to be intrusive in asking follow-up questions). The students were instructed to remind themselves of their collegiate experiences while offering responses, and they were enthusiastic about participating. The students did not interact with each other during this study. In order to avoid bias and increase the diversity in the obtained responses, every student pair (one male, one female) was asked to respond to 1, 000 posts. This way, every post received two responses. We request readers to see some examples of responses provided by our peers in Table 2 in Section 1. These 3, 000 student posts with responses denote our Peer generated dataset (P ED ). The training and validation datasets for automated response generation were obtained from the P ED dataset. 3.4

Dataset Creation in Parl.AI

In order to have a uniform data format to compare and test the models, we used Parl.AI [67], an open-source software platform for dialog research implemented in Python and available for public use.9 We incorporate our own dataset P ED in the ParlAI framework by first converting the dataset into ParlAI Dialog format, which consists of episodes and then storing it in a text file. More details are presented in Appendix A. Note that our P ED dataset is a collection of single-turn episodes only. It means that all our data entities consist of terminal responses as labels in PD format. In order to compare responses for models fine-tuned with our P ED dataset, we selected the recently proposed and widely popular empathetic dialog dataset from Facebook (denoted as F ED ), which is an emotion-annotated dataset consisting of 32 fine-grained labels that are essential for a mental health conversational chatbot needing to display empathy. The dataset contains multiturn conversations based on a given context. It has shown state-of-the-art performance close to human responses when fine-tuning AI models generating empathetic responses [88]. To illustrate the popularity of the F ED dataset, we point out that the authors of [63] used F ED to evaluate the method they proposed with the objective of creating empathetic responses that imitate the speaker’s emotions, taking into consideration their emotional intensity (positive or negative). The authors of [109] introduce a taxonomy of empathetic response intents for chatbots to engage in prosocial conversations. They also analyze the F ED corpus based on the most frequent intents from the taxonomy and emotion categories from the F ED . They claimed that taxonomy could be used to develop an empathetic chatbot as an annotation scheme for other datasets and evaluate their empathetic characteristics. The authors of [37] proposed a computational framework for assessing interview chatbots and introduced a tool named iChatProfile to aid designers in continually evaluating and enhancing such chatbots. The authors extracted empathetic words from F ED by selecting the top 15 content words from each of its 32 emotion categories to gauge the empathy level. They explained that this measurement could assist designers in identifying chatbot replies that lack empathy and making appropriate improvements. Based on our understanding and its widespread recognition in the literature, we consider Facebook’s F ED dataset the closest to compare with our dataset.

9 http://parl.ai

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

14:12

R. Sharma et al. Table 5. Responses Generated from Base DialoGPT

Post I am open to love on a higher level than I ever expected or wished for. Every appointment I go on goes perfectly and I enjoy my time with the client. I see my creativity as the perfect outlet I can use to relax. I really want to spend time with my girlfriend but school comes first. My air conditioner stopped working so now it is blazing hot in my room.

Response I am open to love on a higher level than I ever hoped for.

Our Analysis Repeats the post followed by a reiteration of the same fact.

How often do you go on for your appointments?

Asks a question.

I’m super close to the nearest field :P It just gets too hot there. I have no idea what I want to do with my life, but I know what I’m doing with it. How hot is it in your living room right now? I’m going to guess it’s hot.

No context relevance with the original post. Shows confusion like the post but does not provide a response. Asks a question which has already been answered.

3.5 DialoGPT Model Selection For generating an empathetic response, post-fine-tuning with F ED and P ED datasets, we used DialoGPT [121]. DialoGPT is a large-scale pre-trained dialogue response generation model for multiturn conversation based on GPT2 that achieves state-of-the-art performance on various language modeling tasks without task/domain-specific training. Released in 2020 by Microsoft, DialoGPT has three versions, out of which we used the smallest model trained on 147M conversations from comment chains scraped from Reddit spanning 12 years. In the original paper, DialoGPT achieved performance close to a human evaluation in single-turn dialogue settings [121]. DialoGPT has been extensively used as a baseline model in the literature for conversational chatbots and neural response generation for many tasks, such as emotion prediction, sentiment-aware text generation, detecting incompatible contexts in conversations, and more [8, 18, 23, 26, 46, 96, 113]. Response Generation of DialoGPT Model for Baseline, Fine-tuning with F ED and P ED First, we utilized the base DialoGPT model in [121] to generate responses for student posts. It will help us glean insights into the limitations of the base model when not fine-tuned for our context of non-sensitive posts by college students. After this, we independently fine-tuned the DialoGPT model with our training datasets, F ED and P ED , to compare performances. 3.6

3.6.1 Response Generation with Baseline DialoGPT Model. Table 5 shows responses that the base DialoGPT generated for randomly selected students’ posts when the model is not fine-tuned on either the F ED or P ED datasets, and Table 5 also presents our analysis on those responses. As we can infer, the responses do not meet acceptable levels of empathy, cognition, or emotional awareness, and improvements are needed in these areas. 3.6.2 Response Generation with DialoGPT Model Fine-tuned with F ED and Fine-tuned with P ED . We now explain our DialoGPT models fine-tuned with Facebook’s Empathetic Dataset (F ED ) and our Peer Provided Dataset (P ED ). For both models, the procedures were similar. For our DialoGPT model fine-tuned on F ED , we followed the approach similar to that employed in [3] that also used fine-tuned DialoGPT with Empathetic Dialogue datasets [88]. We tried using batch sizes of 1, 2, ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

A Friend in Need Is a Friend Indeed

14:13

4, and 8 with a varying number of epochs starting from 1 suggested in [3] to 100 to obtain the fine-tuned model with minimum perplexity. We implemented the following procedure for our DialoGPT model fine-tuned on P ED . Of the 6, 392 non-sensitive student posts, recall that we have two independent responses for 3, 000. We split the latter dataset into 3, 000 «post-response» pairs for training and validation. The remaining 3, 392 posts were used for testing the appropriateness of responses generated from the fine-tuned model. The hyper-parameters for the P ED are presented in Table 8 in Appendix B. Table 6 shows some of the responses that DialoGPT generated when fine-tuned on F ED and fine-tuned on P ED on the testing dataset for comparison. As can be inferred reasonably, while both models provide contextual and emotionally aware responses, the F ED -based model always asks questions. In contrast, our P ED model provides contextual and emotionallyaware responses without questions. 3.7

Automated Evaluation of Responses

We are now ready to present the evaluation results. In the realm of traditional NLP, there have been many standard metrics, such as BLEU (Bilingual Evaluation Understudy) [79], ROUGE (RecallOriented Understudy for Gisting Evaluation) [59], METEOR (Metric for Evaluation of Translation with Explicit ORdering) [1] and BERTscore [120]. While these metrics are beneficial and have driven the NLP field forward, they are limited in their ability to evaluate the emotional awareness of conversational agents, which is critical in mental wellness settings. Motivated towards filling this gap, the authors of [97, 98] have recently attempted to quantify the emotional awareness of conversational agents. Specifically, given a response from an automated conversational agent, they define three novel metrics in this space: Emotional Response (ER), Interpretation (IP), and Exploration (EX). Overall, they define Empathy via a combination of these three metrics (that range from a score between 0 and 2, 2 being the highest). Since their publication, these metrics have been widely used. In [49], the authors used the EX and IP for comparing different dialogue agents on the F ED dataset. The authors of [124] use EX, ER, and IP metrics to generate empathetic responses suitable for health coaching scenarios. In [62], the authors evaluated empathy via employing coarse attributes —presented in [88] on the Likert scale of 1 to 5 and empathy-based attributes — ER, IP, and EX on a scale of low, medium, and high. The authors of [54] use the public Reddit dataset and all three empathy metrics from [98] to measure toxicity and were able to significantly reduce the size of fine-tuning data while making improvements in toxicity mitigation. In our evaluation, we also used Emotional Response (ER), Interpretation (IP), and Exploration (EX) as defined in [98] for evaluating responses from models fine-tuned with both F ED and P ED . We point out that we used the default parameters presented in [98],10 and we trained/validated the models for ER, IP, and EX using the default parameters provided. As a result, three models were generated each for emotional response, interpretation, and exploration. Table 7 presents evaluation results for these three metrics for models fine-tuned with F ED and P ED datasets for our testing dataset of 3, 392 non-sensitive posts. As we can see, for the metrics of Emotional Response and Interpretation, the fine-tuning with the Peer dataset (P ED ) is comparable to that of fine-tuning with the Facebook dataset (F ED ). The differences are minor. However, the P ED dataset yields a model that is very low on exploration compared with F ED . As we argued before, further exploration of the subject is unwarranted for non-sensitive posts and can even be seen as intrusive. We will present human evaluation results next in this context. However, what is critical to note is that with only 3, 000 responses for 10 https://github.com/behavioral-data/Empathy-Mental-Health/tree/master/dataset

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

14:14

R. Sharma et al.

Table 6. Response Generation on our Testing Dataset with F ED -Based Model and P ED -Based Model

Post I really want to spend time with my girlfriend but school comes first. Every appointment I go on goes perfectly and I enjoy my time with the client. I see my creativity as the perfect outlet I can use to relax. I can do anything I set me mind and heart on. My air conditioner stopped working so now it is blazing hot in my room. I create love and romance in my life with ease. Every day I am getting closer to my ideal weight. I adjust effortlessly to change. I am open to love on a higher level than I ever expected or wished for. I am ready to start the day.

Fine-tuned on F ED Oh no, I’m sorry to hear that. I hope you have a good relationship with her. How long have you been together? That’s great! What kind of work do you do? What are you most excited about?

Fine-tuned on P ED I understand, I hope you figure this out. Tell her how you feel about this situation and see what she feels. That’s good to know that you enjoy your time with clients and enjoy working with them.

That’s great! What do you do with your time that you like to spend with others? That’s great! What kind of work do you do? What are you most passionate about? Oh no! I’m so sorry to hear that. I hope you can get it fixed soon.

That’s true. Utilise this time well. Share it with your friends and family. That’s a good thing. You can learn new things daily and apply them to your life. I am sorry about that, I know it must be tough for you. I hope there is someone for help to fix it. That’s so good on your part, I am sure you will achieve great things with your time.

That’s great! What kind of love do you have for your love life? Do you have any hobbies or interests? That’s awesome! How much weight do you want to lose? I’m trying to lose as much weight as I can. That’s great! What kind of changes do you make to make it easier for you to adapt? That’s great! What kind of love do you have for someone you don’t know well? That’s great! What are you going to do today? I’m sure you’ll do great!

That is amazing, I am sure you will achieve it. You can do it anytime. That’s good to hear. Keep up with your routine and you should be good to go. Excellent. Speak about it to your friend and see how they feel about it. See if they share your opinion. That is great, start early. Do chores that you love. This will automatically build spirit and confidence.

training, we can tangibly demonstrate that our P ED dataset fine-tuned model is comparable with Facebook’s empathetic dataset F ED fine-tuned model in our overall goal of generating contextual and emotion-aware responses. Nevertheless, the model fine-tuned with our P ED dataset significantly outperforms that of the F ED dataset when it comes to avoiding perceptions of intrusiveness since responses from the model fine-tuned using the P ED dataset are terminal, which as we argued earlier is an ideal case when posts from students do not indicate mood problems. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

A Friend in Need Is a Friend Indeed

14:15

Table 7. Emotional Response (ER), Interpretation (IP) and Exploration (EX) for Models Fine-tuned on F ED and P ED

Fine-Tuned with F ED P ED

ER IP EX 0.989 0.184 1.345 1.027 0.072 0.049

Average (ER,IP,EX) 0.839 0.382

Average (ER,IP) 0.586 0.549

Note that ER, IP, and EX range from 0 (Poor) to 2 (Strong).

Fig. 1. Average scores of all evaluators for DialoGPTBAS E , F ED , and P ED for (a) Acceptability, Cognition, and Emotion; and (b) Intrusiveness and Emotional Overload.

3.8

Human Evaluation of Responses

Apart from automated evaluation, we also perform a human evaluation of the generated responses. We did this for 200 randomly selected students’ post–response pairs from our testing dataset for the three models: Base DialoGPT, DialoGPT fine-tuned with F ED , and DialoGPT fine-tuned with P ED . A total of four current college students (two male and two female, who were not part of those who gave training responses) were recruited for this. In addition, a campus mental wellness expert was recruited. None of the evaluators interacted with each other during the evaluation phase. All evaluators were asked to give a score of either 0 or 1 on the following five metrics. (1) (2) (3) (4) (5)

Acceptability: If the response is overall acceptable, give 1, else give 0. Cognition: If the response is contextually relevant, give 1, else give 0. Emotion: If the response is in sync with the underlying emotion, give 1, else give 0. Intrusiveness: If the response is deemed intrusive, give 1, else give 0. Emotional Overload: If the response contains more than the necessary intensity of emotion, give 1, else give 0.

Ideally, we want 1 for acceptability, cognition, and emotion for optimal response and 0 for intrusiveness and emotional overload. Instructions given to evaluators are presented in Appendix C. 3.8.1 Average Scores for all Evaluators. Figures 1(a) and 1(b) present the comparison of the aggregated average of evaluators’ scores and the expert scores for DialoGPTBAS E , F ED , and P ED . For the DialoGPTBAS E responses, evaluators feel that only 15% of responses were overall acceptable, which is indeed a low number. The cognition and emotional scores were marginally better at 25.38% and 20.5%, respectively. However, when we consider that most responses were unacceptable in the first place, the corresponding cognition and emotional scores need not be investigated further for the base DialoGPT model. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

14:16

R. Sharma et al.

Fig. 2. Evaluator-wise scores comparison of DialoGPTBAS E , F ED , and P ED for (a) Acceptability, Cognition, and Emotion; and (b) Intrusiveness and Emotional Overload.

The human evaluation scores (four evaluators and expert) for provided models fine-tuned with F ED and P ED are primarily similar except for intrusiveness (I), where, on average, 57.4% of responses from the model fine-tuned with F ED was deemed to be intrusive as compared with only 3.9% of responses in the model fine-tuned with P ED that were deemed to be intrusive. The emotional overload (EO) was relatively low for both models. We now discuss the standard deviations in Figures 1(a) and 1(b). Figures 2(a) and 2(b) present the scores from each evaluator separately across all our metrics. For Acceptability, Cognition, and Emotion in Figure 2(a), the responses of the human evaluators were very similar and similar to ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

A Friend in Need Is a Friend Indeed

14:17

that of the expert evaluator. However, we see interesting results in the cases of intrusiveness and emotional overload metrics in Figure 2(b). The intrusiveness and the emotional-overload metrics of the base DialoGPT model were rated consistently by peers and the expert. However, for F ED and P ED -tuned models, the intrusiveness score generates the most exciting results, potentially of profound significance. The student evaluators found a significantly consistent number of responses as intrusive from the model fine-tuned with F ED . However, very surprisingly, our external evaluator trained in mental wellness for college students found only a handful of responses from the F ED fine-tuned model as intrusive. It is fascinating, and this can only mean that counselors are so used to asking questions in mental health evaluation sessions that they do not see exploration as a problem. However, the student evaluators certainly felt exploration was being intrusive. This difference of opinions is an investigation worth pursuing as our future study over a large-scale population of experts and students. Now, for the case of intrusiveness score with the model fine-tuned with P ED , a minimal number of responses were deemed as intrusive by student evaluators (significantly lower than that for F ED ). Still, our expert evaluator found nothing as intrusive for the P ED -tuned model as well. In a similar vein in Figure 2(b), while student evaluators found some responses from both P ED and F ED as emotionally overloaded, our expert evaluator found nothing of the sort for either model. Again, we can only attribute this to mental health experts conditioned on being kind and sensitive to subjects with mood problems, which enables tolerance to emotionally overloaded responses, a view slightly different from student evaluators. 3.8.2 Correlation Among Scores. Figures 3(a) and 3(b) give the Pearson correlation response values across different metrics. Human evaluations of responses generated via F ED and P ED show a positive correlation of acceptability with emotion and cognition scores. It means that evaluators placed significant importance on emotional and cognitive factors when determining the acceptability of a given response. However, it is noteworthy that certain responses with cognition and/or emotion scores as 1 but also exhibiting intrusiveness or emotional overload were occasionally deemed unsuitable and, therefore, not accepted. On further investigation, evaluators disclosed that such responses made them uncomfortable as they were intrusive, prompting them to reject those responses despite being contextually relevant and emotionally intelligent. 4 CONCLUSIONS AND FUTURE WORK AI-enabled conversational systems to promote people’s mental well-being have already permeated society. A particular demographic where these systems are gaining rapid prominence is students. With younger people more prone to using mobile devices and willing to try novel systems and technologies, no wonder this is the case. In fact, with increasing mental wellness awareness and an acute lack of human and economic resources to meet demands, colleges across the United States and many other countries are also actively recommending that their students install conversationbased mental wellness apps. Our extensive survey of the landscape, which includes a partnership with a company providing such services, has revealed that mental wellness apps work by asking questions to the user for a finer-grained understanding of problems, following which mood alleviation is attempted by the app. It is understandable since even experienced counselors must understand the subjects’ situation before suggesting options to improve their mood. However, this feature of mental wellness apps, unavoidable for people with mood problems, will be counterproductive for students not having serious mood problems but may still be using the app just for practicing general wellness, curiosity, or fun to test the limits of what AI can do. For such students, our survey indicated that mental wellness apps today are perceived as intrusive in asking too many follow-up questions ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

14:18

R. Sharma et al.

Fig. 3. Pearson Correlation among the metric scores of evaluators for (a) F ED and (b) P ED .

for seemingly innocuous posts. Suppose that these apps are seen by students (even a few) as privacy intrusive. In that case, their utility can be seriously compromised, and students (with mood problems or otherwise) may refrain from using these apps. Our study investigated this particular case, wherein the issue was how these apps react when posts are perceived to be non-sensitive (as in not indicative of mood problems). Discussions with counselors, faculty, and recently graduated students also revealed that in such cases, the best response from apps should be an empathetic one that is contextual but also terminal, without asking follow-up questions. Our core contribution in this article is demonstrating the power of training ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

A Friend in Need Is a Friend Indeed

14:19

data provided by peers based on their own college experiences to fine-tune AI models to generate optimal responses. We were unaware of any datasets that could accomplish this goal; thus, we focused on curating our peer dataset. First, we gathered de-identified daily journal entries from an actual mental wellness app involving 1,451 college students for three months. Out of 9, 090 posts, 6, 392 were deemed to be non-sensitive. Out of these, we selected 3, 000 random posts, from which 1, 000 distinct posts were each given to three student pairs (one male and one female made a pair). The students were requested to respond to each post in an empathetic, contextual, and terminal manner. Each pair of students gave responses for 1, 000 posts. Our total dataset was hence 3, 000 posts and 6, 000 responses. We then used this dataset to fine-tune the DialoGPT model to generate responses for nonsensitive student posts. We compared it against another DialoGPT model that we fine-tuned with Facebook’s Empathetic dataset [88]. Based on extensive evaluations using state-of-the-art metrics and human evaluations, we identified that the responses from our model fine-tuned from our peer dataset meet the objectives of being context-aware, emotional, empathetic, and terminal. Critically, the responses were non-intrusive, minimizing potential privacy concerns, unlike those fine-tuned from Facebook’s dataset, which seeks to explore by asking questions for each post. Overall, evaluations have found responses fine-tuned from our peer dataset to be acceptable and cognitively and emotionally aware. We believe that our study closes a critical gap in the usage of mental wellness apps today. Unless necessary, apps that ask too many questions can quickly be labeled as problematic from a privacy perspective, impeding their usage among the general public (including people with mood problems). As such, this issue deserves attention, and we demonstrate the power of peer-generated datasets for such scenarios. We also demonstrate how even very little peer data can go a long way in fine-tuning models specific to handling data from peers (in this case, college students). Our future work in this space is multi-faceted. First and foremost, we want to increase the scale of our study in terms of the number of posts processed, the number of subjects recruited to provide responses, and independent evaluators consisting of peers and students. We are still confident of consistent results with increased scale. Secondly, with the advent of ChatGPT11 (and other tools for automated conversation), it will be a fascinating exercise to see the degree of exploration (in other words, intrusiveness) that ensues before a meaningful empathetic response is given for a concerning student post. Beyond these, our team wishes to formalize the process elaborated on in this article. Specifically, we aim to design formalized mechanisms with which students (across diverse landscapes but still relevant to college-related scenarios such as the year of study, major, gender, race, sexual identity, and more) can be recruited to provide responses to posts for which counselors need responses. We are still determining how to compensate students for their time, but one option could be to tie it to course credit (ethics, for example, or another related course). Eventually, such data can be used to fine-tune models; finer-grained responses can be enabled with the state-of-the-art today, significantly improving campus mental wellness with the benefit of apps and automation. We are expanding the scope of the study to other vulnerable populations (for example, veterans, law enforcement officers, domestic violence victims, and older adults), where we believe that the power of peer-supported models will bring in critical benefits from the perspective of improved empathy, context-awareness, less intrusiveness, and faster training. These avenues of activity are part of our future work in this space. 11 https://chat.openai.com

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

14:20

R. Sharma et al.

APPENDICES A PARL.AI FORMAT An example of the PD format for a two-turn single episode is as follows: text: text:

labels: labels:

episode_done:True

To clarify, an episode is an ordered dialog of the same context with an alternate turn-taking scenario. text represents one’s response to either start the conversation or reply to a prior response. In order to continue the conversation, one person responds to another’s text and this process continues. Once the conversation is over, the episode is considered finished. In a dataset, we can have multiple episodes with or without the same context, with the same or different people. However, the PD format assumes that every episode is different and totally dependent on the model or task where it is being used. In order to make sure we are provided with terminal responses we asked our student collaborators to follow these instructions: (1) Each file contains the students’ posts. (2) The Peers have to give a response in the adjacent column such that: (a) The response must be a genuine response. (b) The response must be definitive, meaning it should not ask questions or request any other information from the student in return. (c) The responses must not be repeated, though you can slightly paraphrase. The idea of response is to provide comfort to the student, either expressing something you feel, have experienced, or believe can be helpful. (d) The tone of the response must not be robotic; it should be like a friend, a therapist, or a well-wisher. (3) You can use these for reference: (a) Empathize: Put yourself into others’ shoes and grasp the situations or feelings with understanding and acceptance (“feeling in”). For example, I know how that professor grades, he has very strict policies. He doesn’t listen to anyone even if it’s correct. [like a classmate] (b) Sympathize: Understand from your prospective and show pity, sorrow, concern (“feeling with”). (c) Empathize + Sympathize: [like a friend, therapist, well-wisher] B FINE TUNING DIALOGPT Table 8 provides the parameters used in fine-tuning the DialoGPT model for P ED data. Similar parameters were used for fine-tuning the model with F ED . C INSTRUCTIONS FOR HUMAN EVALUATION (1) Acceptability Score: Your overall impression on whether or not the response is acceptable for the post. In other words, if you posted the message in Column A, would you consider the Response in Column B acceptable overall? • Give a score of 0 if you feel the response is generally unacceptable (using your own intuition) • Give a score of 1 if you feel the response is generally acceptable to you (using your own intuition)

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

A Friend in Need Is a Friend Indeed

14:21

Table 8. Fine-tuning Parameters in ParlAI

Parameter Name -m –add-special-tokens –delimiter –add-start-token –gpt2-size –task –fromfile-datapath –fromfile-datatypeextension –batchsize -mf –dict-tokenizer –dict-lower –image-mode –eval-batchsize

Value hugging_face/dialogpt True ‘’ True small fromfile:parlaiformat ourData true

Description Model architecture Adds special tokens to the input Specifies the delimiter used in input data Adds a start token to the input Size of the DialoGPT model Specifies the task and data format Path to input data Specifies the data type extension

8 dialogpt_ours_ft bpe true None 10

–warmup_updates –lr-scheduler-patience

100 0

–lr-scheduler-decay –lr –history-size

0.4 5e-05 20

–label-truncate –num-epochs –max_train_time –validation-metric -veps –save-after-valid –log_every_n_secs –fp16 –optimizer –dict-endtoken –skip-generation

72 100.0 3600 loss 10 True 30 False adamax __start__ true

–dynamic-batching

full

Number of examples per batch Path to save the fine-tuned model Tokenizer used to preprocess the input Lowercase the input Specifies how to handle images Number of examples per batch during evaluation Number of updates for warmup Number of epochs to wait before reducing learning rate Learning rate decay rate Learning rate Number of previous turns to include in input Maximum length of output Number of epochs to train Maximum training time Metric used for early stopping Early stopping threshold Save the model after validation Logging interval Whether to use mixed precision training Optimizer used during training Specifies the end token for the dictionary Whether to skip generation during training Specifies the type of dynamic batching used.

(2) Cognition Score: Assesses context similarity between post and response. • Give a score of 0 if you feel the response has not interpreted the post’s context. • Give a score of 1 if you feel the response has correctly interpreted the post’s context. (3) Emotion Score: Assesses correctness of emotion match between post and response. • Give a score of 0 if you feel the response does not exhibit the correct emotion you expect for the post. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

14:22

R. Sharma et al.

• Give a score of 1 if you feel the response exhibits the correct emotion you expect for the post. (4) Intrusiveness Score: • Give a score of 0 if you feel the response is not intrusive or not appearing to pry for information from you. • Give a score of 1 if you feel the response appears to engage in exploratory question(s) that appear to pry for information about the subject or, in other words, appears to you to be asking for information that you are not sure you want to provide for one or more reasons (e.g., to protect your privacy or thinking that third parties can abuse information provided, etc.). Essentially, does the information solicited in questions appear intrusive to you? If so, give a score of 1. (5) Emotional Overload: Recall the Emotional Score you gave. Here, we ask whether there is an overload of emotion as perceived by you. • Give a score of 0 if you feel that the response exhibits an expected degree of emotion/sentiment for the given post. A response devoid of discernible emotion also falls in this category. • Give a score of 1 if you feel that the response exhibits a more than necessary degree of emotion for the given post. For instance, a response that appears excessively sweet or nice will receive a score of 1. REFERENCES [1] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, Ann Arbor, Michigan, 65–72. https://aclanthology.org/W05-0909 [2] Chyrell Bellamy, Timothy Schmutte, and Larry Davidson. 2017. An update on the growing evidence base for peer support. Mental Health and Social Inclusion (2017). [3] Jackylyn Beredo, Carlo Migel Bautista, Macario Cordel, and Ethel Ong. 2021. Generating empathetic responses with a pre-trained conversational model. In International Conference on Text, Speech, and Dialogue. Springer, 147–158. [4] Samantha L. Bernecker, Joseph Jay Williams, Norian A. Caporale-Berkowitz, Akash R. Wasil, and Michael J. Constantino. 2020. Nonprofessional peer support to improve mental health: Randomized trial of a scalable web-based peer counseling course. Journal of Medical Internet Research 22, 9 (2020), e17164. [5] Kathryn Birnie, Michael Speca, and Linda E. Carlson. 2010. Exploring self-compassion and empathy in the context of mindfulness-based stress reduction (MBSR). Stress and Health 26, 5 (2010), 359–371. [6] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. CoRR abs/2005.14165 (2020). arXiv:2005.14165 https://arxiv.org/abs/2005.14165 [7] David D. Burns and Susan Nolen-Hoeksema. 1992. Therapeutic empathy and recovery from depression in cognitivebehavioral therapy: A structural equation model. Journal of Consulting and Clinical Psychology 60, 3 (1992), 441. [8] Shuai Cao, Yuxiang Jia, Changyong Niu, Hongying Zan, Yutuan Ma, and Shuo Xu. 2022. Generating emotional responses with DialoGPT-based multi-task learning. In Natural Language Processing and Chinese Computing: 11th CCF International Conference, NLPCC 2022, Guilin, China, September 24–25, 2022, Proceedings, Part I. Springer, 485–496. [9] Rakesh K. Chadda. 2018. Youth & mental health: Challenges ahead. The Indian Journal of Medical Research 148, 4 (2018), 359. [10] Iti Chaturvedi, Erik Cambria, Roy E. Welsch, and Francisco Herrera. 2018. Distinguishing between facts and opinions for sentiment analysis: Survey and challenges. Information Fusion 44 (2018), 65–77. [11] Matthew Chinman, Preethy George, Richard H. Dougherty, Allen S. Daniels, Sushmita Shoma Ghose, Anita Swift, and Miriam E. Delphin-Rittmon. 2014. Peer support services for individuals with serious mental illnesses: Assessing the evidence. Psychiatric Services 65, 4 (2014), 429–441.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

A Friend in Need Is a Friend Indeed

14:23

[12] M. Dolores Cimini and Estela M. Rivero. 2018. Promoting Behavioral Health and Reducing Risk Among College Students: A Comprehensive Approach. Routledge. [13] Amanda C. Collins, Damien Lekkas, Matthew David Nemesure, Tess Z. Griffin, George Price, Arvind Pillai, Subigya Nepal, Michael V. Heinz, Andrew T. Campbell, and Nicholas C. Jacobson. 2023. Semantic signals in self-reference: The detection and prediction of depressive symptoms from the daily diary entries of a sample with major depressive disorder. (2023). [14] Pamela Y. Collins, Vikram Patel, Sarah S. Joestl, Dana March, Thomas R. Insel, Abdallah S. Daar, Isabel A. Bordin, E. Jane Costello, Maureen Durkin, Christopher Fairburn, et al. 2011. Grand challenges in global mental health. Nature 475, 7354 (2011), 27–30. [15] Benjamin L. Cook, Ana M. Progovac, Pei Chen, Brian Mullin, Sherry Hou, and Enrique Baca-Garcia. 2016. Novel use of natural language processing (NLP) to predict suicidal ideation and psychiatric symptoms in a text-based mental health intervention in Madrid. Computational and Mathematical Methods in Medicine 2016 (2016). [16] North South Ministerial Council. 2015. 2015 Annual report. (2015). [17] Cristian Danescu-Niculescu-Mizil, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. A computational approach to politeness with application to social factors. arXiv preprint arXiv:1306.6078 (2013). [18] Avisha Das, Salih Selek, Alia R. Warner, Xu Zuo, Yan Hu, Vipina Kuttichi Keloth, Jianfu Li, W. Jim Zheng, and Hua Xu. 2022. Conversational bots for psychotherapy: A study of generative Transformer models using domain-specific dialogues. In Proceedings of the 21st Workshop on Biomedical Language Processing. Association for Computational Linguistics, Dublin, Ireland, 285–297. https://doi.org/10.18653/v1/2022.bionlp-1.27 [19] Larry Davidson, Kimberly Guy, et al. 2012. Peer support among persons with severe mental illnesses: A review of evidence and experience. World Psychiatry 11, 2 (2012), 123–128. [20] Izaak Dekker, Elisabeth M. De Jong, Michaéla C. Schippers, De Bruijn-Smolders, Andreas Alexiou, Bas Giesbers, et al. 2020. Optimizing students’ mental health and academic performance: AI-enhanced life crafting. Frontiers in Psychology 11 (2020), 1063. [21] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional Transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). [22] Michele Di Capua, Emanuel Di Nardo, and Alfredo Petrosino. 2016. Unsupervised cyber bullying detection in social networks. In 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 432–437. [23] Isabel Dias, Ricardo Rei, Patrícia Pereira, and Luisa Coheur. 2022. Towards a sentiment-aware conversational agent. In Proceedings of the 22nd ACM International Conference on Intelligent Virtual Agents. 1–3. [24] Evandro J. S. Diniz, José E. Fontenele, Adonias C. de Oliveira, Victor H. Bastos, Silmar Teixeira, Ricardo L. Rabêlo, Dario B. Calçada, Renato M. Dos Santos, Ana K. de Oliveira, and Ariel S. Teles. 2022. Boamente: A natural language processing-based digital phenotyping tool for smart monitoring of suicidal ideation. In Healthcare, Vol. 10. MDPI, 698. [25] Louise A. Douce and Richard P. Keeling. 2014. A Strategic Primer on College Student Mental Health. American Council on Education. [26] Wanyu Du and Yangfeng Ji. 2021. SideControl: Controlled open-domain dialogue generation via additive side networks. arXiv preprint arXiv:2109.01958 (2021). [27] Nicole Eling, Hanna Krasnova, Thomas Widjaja, and Peter Buxmann. 2013. Will you accept an app? Empirical investigation of the decisional calculus behind the adoption of applications on Facebook. (2013). [28] Asra Fatima, Ying Li, Thomas Trenholm Hills, and Massimo Stella. 2021. Dasentimental: Detecting depression, anxiety, and stress in texts via emotional recall, cognitive networks, and machine learning. Big Data and Cognitive Computing 5, 4 (2021), 77. [29] Dana Ferris. 2003. Treatment of error in second language student writing. CATESOL Journal 15 (2003), 183. [30] Neda Firoz, Olga Grigorievna Beresteneva, Aksyonov Sergey Vladimirovich, Mohammad Sadman Tahsin, and Faiza Tafannum. 2023. Automated text-based depression detection using hybrid ConvLSTM and Bi-LSTM model. In 2023 3rd International Conference on Artificial Intelligence and Smart Energy (ICAIS). IEEE, 734–740. [31] Kathleen Kara Fitzpatrick, Alison Darcy, and Molly Vierhile. 2017. Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (Woebot): A randomized controlled trial. JMIR Mental Health 4, 2 (2017), e7785. [32] Asbjørn Følstad, Marita Skjuve, and Petter Bae Brandtzaeg. 2018. Different chatbots for different purposes: Towards a typology of chatbots to understand interaction design. In International Conference on Internet Science. Springer, 145–156. [33] Daniela C. Fuhr, Tatiana Taylor Salisbury, Mary J. De Silva, Najia Atif, Nadja van Ginneken, Atif Rahman, and Vikram Patel. 2014. Effectiveness of peer-delivered interventions for severe mental illness and depression on clinical and psychosocial outcomes: A systematic review and meta-analysis. Social Psychiatry and Psychiatric Epidemiology 49, 11 (2014), 1691–1702.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

14:24

R. Sharma et al.

[34] Russell Fulmer, Angela Joerin, Breanna Gentile, Lysanne Lakerink, and Michiel Rauws. 2018. Using psychological artificial intelligence (Tess) to relieve symptoms of depression and anxiety: Randomized controlled trial. JMIR Mental Health 5, 4 (2018), e64. [35] Robert P. Gallagher. 2014. National Survey of College Counseling Centers 2013, Section One: 4-Year Directors. (2014). [36] Nan Greenwood, Ruth Habibi, Ann Mackenzie, Vari Drennan, and Nicky Easton. 2013. Peer support for carers: A qualitative investigation of the experiences of carers and peer volunteers. American Journal of Alzheimer’s Disease & Other Dementias® 28, 6 (2013), 617–626. [37] Xu Han, Michelle Zhou, Matthew J. Turner, and Tom Yeh. 2021. Designing effective interview chatbots: Automatic chatbot profiling and design suggestion generation for chatbot debugging. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–15. [38] Pamela A. Hays. 2008. Addressing Cultural Complexities in Practice: Assessment, Diagnosis, and Therapy. Vol. 10. American Psychological Association, Washington, DC. [39] Kit Huckvale, Jennifer Nicholas, John Torous, and Mark E. Larsen. 2020. Smartphone apps for the treatment of mental health conditions: Status and considerations. Current Opinion in Psychology 36 (2020), 65–70. [40] Natalie E. Hundt, Andrew Robinson, Jennifer Arney, Melinda A. Stanley, and Jeffrey A. Cully. 2015. Veterans’ perspectives on benefits and drawbacks of peer support for posttraumatic stress disorder. Military Medicine 180, 8 (2015), 851–856. [41] Maleeha Illahi, Isma Farah Siddiqui, Qasim Ali, and Fizza Abbas Alvi. 2022. Ensemble machine learning approach for stress detection in social media texts. Quaid-E-Awam University Research Journal of Engineering, Science & Technology, Nawabshah. 20, 02 (2022), 123–128. [42] Shaunak Inamdar, Rishikesh Chapekar, Shilpa Gite, and Biswajeet Pradhan. 2023. Machine learning driven mental stress detection on Reddit posts using natural language processing. Human-Centric Intelligent Systems (2023), 1–12. [43] Becky Inkster, Shubhankar Sarda, and Vinod Subramanian. 2018. An empathy-driven, conversational artificial intelligence agent (Wysa) for digital mental well-being: Real-world data evaluation mixed-methods study. JMIR mHealth and uHealth 6, 11 (2018), e12106. [44] Anna Irimiás, Tamás Csordás, Kornélia Kiss, and Gábor Michalkó. 2021. Aggregated roles of smartphones in young adults’ leisure and well-being: A diary study. Sustainability 13, 8 (2021), 4133. [45] Shaoxiong Ji, Xue Li, Zi Huang, and Erik Cambria. 2022. Suicidal ideation and mental disorder detection with attentive relation networks. Neural Computing and Applications 34, 13 (2022), 10309–10319. [46] Yuxiang Jia, Shuai Cao, Changyong Niu, Yutuan Ma, Hongying Zan, Rui Chao, and Weicong Zhang. 2021. EmoDialoGPT: Enhancing DialoGPT with emotion. In Natural Language Processing and Chinese Computing: 10th CCF International Conference, NLPCC 2021, Qingdao, China, October 13–17, 2021, Proceedings, Part II 10. Springer, 219–231. [47] Anthony F. Jorm. 2000. Mental health literacy: Public knowledge and beliefs about mental disorders. The British Journal of Psychiatry 177, 5 (2000), 396–401. [48] Anthony F. Jorm, Ailsa E. Korten, Patricia A. Jacomb, Helen Christensen, Bryan Rodgers, and Penelope Pollitt. 1997. “Mental health literacy”: A survey of the public’s ability to recognise mental disorders and their beliefs about the effectiveness of treatment. Medical Journal of Australia 166, 4 (1997), 182–186. [49] Hyunwoo Kim, Byeongchang Kim, and Gunhee Kim. 2021. Perspective-taking and pragmatics for generating empathetic responses focused on emotion causes. arXiv preprint arXiv:2109.08828 (2021). [50] Ruth C. King and Su Dong. 2017. The impact of smartphone on young adults. The Business & Management Review 8, 4 (2017), 342. [51] Martha Anne Kitzrow. 2003. The mental health needs of today’s college students: Challenges and recommendations. Journal of Student Affairs Research and Practice 41, 1 (2003), 167–181. [52] Hanna Krasnova, Nicole Eling, Oleg Schneider, Helena Wenninger, Thomas Widjaja, and Peter Buxmann. 2013. Does this app ask for too much data? The role of privacy perceptions in user behavior towards Facebook applications and permission dialogs. (2013). [53] Sumon Kunwongse. 2013. Peer feedback, benefits and drawbacks. Thammasat Review 16, 3 (2013), 277–288. [54] Allison Lahnala, Charles Welch, Béla Neuendorf, and Lucie Flek. 2022. Mitigating toxic degeneration with empathetic data: Exploring the relationship between toxicity and empathy. arXiv preprint arXiv:2205.07233 (2022). [55] Agnes Yuen-kwan Lai, Letitia Lee, Man-ping Wang, Yibin Feng, Theresa Tze-kwan Lai, Lai-ming Ho, Veronica Sukfun Lam, Mary Sau-man Ip, and Tai-hing Lam. 2020. Mental health impacts of the COVID-19 pandemic on international university students, related stressors, and coping strategies. Frontiers in Psychiatry 11 (2020). [56] Glenn M. Landers and Mei Zhou. 2011. An analysis of relationships among peer support, psychiatric hospitalization, and crisis stabilization. Community Mental Health Journal 47, 1 (2011), 106–112. [57] Jia Li, Xiao Sun, Xing Wei, Changliang Li, and Jianhua Tao. 2019. Reinforcement Learning Based Emotional Editing Constraint Conversation Generation. arXiv:1904.08061 [cs.CL]

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

A Friend in Need Is a Friend Indeed

14:25

[58] Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. Dailydialog: A manually labelled multi-turn dialogue dataset. arXiv preprint arXiv:1710.03957 (2017). [59] Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013 [60] Siyang Liu, Chujie Zheng, Orianna Demasi, Sahand Sabour, Yu Li, Zhou Yu, Yong Jiang, and Minlie Huang. 2021. Towards emotional support dialog systems. arXiv preprint arXiv:2106.01144 (2021). [61] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019). [62] Navonil Majumder, Deepanway Ghosal, Devamanyu Hazarika, Alexander Gelbukh, Rada Mihalcea, and Soujanya Poria. 2022. Exemplars-guided empathetic response generation controlled by the elements of human communication. IEEE Access 10 (2022), 77176–77190. [63] Navonil Majumder, Pengfei Hong, Shanshan Peng, Jiankun Lu, Deepanway Ghosal, Alexander Gelbukh, Rada Mihalcea, and Soujanya Poria. 2020. MIME: MIMicking emotions for empathetic response generation. arXiv preprint arXiv:2010.01454 (2020). [64] Mario D. Mateus, Jair J. Mari, Pedro G. G. Delgado, Naomar Almeida-Filho, Thomas Barrett, Jeronimo Gerolin, Samuel Goihman, Denise Razzouk, Jorge Rodriguez, Renata Weber, et al. 2008. The mental health system in Brazil: Policies and future challenges. International Journal of Mental Health Systems 2, 1 (2008), 1–8. [65] Gary McKeown, Michel Valstar, Roddy Cowie, Maja Pantic, and Marc Schroder. 2012. The SEMAINE database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Transactions on Affective Computing 3, 1 (2012), 5–17. https://doi.org/10.1109/T-AFFC.2011.20 [66] Jennifer Melcher and John Torous. 2020. Smartphone apps for college mental health: A concern for privacy and quality of current offerings. Psychiatric Services 71, 11 (2020), 1114–1119. [67] Alexander H. Miller, Will Feng, Adam Fisch, Jiasen Lu, Dhruv Batra, Antoine Bordes, Devi Parikh, and Jason Weston. 2017. Parlai: A dialog research software platform. arXiv preprint arXiv:1705.06476 (2017). [68] Alberto Minoletti, Rafael Sepúlveda, and Marcela Horvitz-Lennon. 2012. Twenty years of mental health policies in Chile: Lessons and challenges. International Journal of Mental Health 41, 1 (2012), 21–37. [69] Nikolaos Misirlis, Miriam Zwaan, Alexandros Sotiriou, and David Weber. 2020. International students’ loneliness, depression and stress levels in COVID-19 crisis: The role of social media and the host university. Journal of Contemporary Education Theory & Research (JCETR) 4, 2 (2020), 20–25. [70] A. Molassiotis, P. Callaghan, S. F. Twinn, S. W. Lam, W. Y. Chung, and C. K. Li. 2002. A pilot study of the effects of cognitive-behavioral group therapy and peer support/counseling in decreasing psychologic distress and improving quality of life in Chinese patients with symptomatic HIV disease. AIDS Patient Care and STDs 16, 2 (2002), 83–96. [71] Robert R. Morris, Kareem Kouddous, Rohan Kshirsagar, and Stephen M. Schueller. 2018. Towards an artificially empathic conversational agent for mental health applications: System design and user perceptions. Journal of Medical Internet Research 20, 6 (2018), e10148. [72] Pratyush Muthukumar, Karishma Muthukumar, Deepan Muthirayan, and Pramod Khargonekar. 2021. Generative adversarial imitation learning for empathy-based AI. arXiv preprint arXiv:2105.13328 (2021). [73] John A. Naslund, Kelly A. Aschbrenner, Lisa A. Marsch, and Stephen J. Bartels. 2016. The future of mental health care: Peer-to-peer support and social media. Epidemiology and Psychiatric Sciences 25, 2 (2016), 113–122. [74] Tanya Nijhawan, Girija Attigeri, and T. Ananthakrishna. 2022. Stress detection using natural language processing and machine learning over social interactions. Journal of Big Data 9, 1 (2022), 1–24. [75] Tong Niu and Mohit Bansal. 2018. Polite dialogue generation without parallel data. Transactions of the Association for Computational Linguistics 6 (2018), 373–389. [76] Niels Okkels, Christina Blanner Kristiansen, Povl Munk-Jørgensen, and Norman Sartorius. 2018. Urban mental health: Challenges and perspectives. Current Opinion in Psychiatry 31, 3 (2018), 258–264. [77] World Health Organization et al. 2005. Mental Health: Facing the Challenges, Building Solutions: Report from the WHO European Ministerial Conference. WHO Regional Office Europe. [78] Lynn E. O’Connor, Jack W. Berry, Thomas Lewis, Kathleen Mulherin, and Patrice S. Crisostomo. 2007. Empathy and depression: The moral system on overdrive. Empathy in Mental Illness 49 (2007), 75. [79] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318. [80] Vikram Patel, Alan J. Flisher, Sarah Hetrick, and Patrick McGorry. 2007. Mental health of young people: A global public-health challenge. The Lancet 369, 9569 (2007), 1302–1313. [81] Quinn M. Pearson. 2000. Opportunities and challenges in the supervisory relationship: Implications for counselor supervision. Journal of Mental Health Counseling 22, 4 (2000).

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

14:26

R. Sharma et al.

[82] Paola Pedrelli, Maren Nyer, Albert Yeung, Courtney Zulauf, and Timothy Wilens. 2015. College students: Mental health problems and treatment considerations. Academic Psychiatry 39, 5 (2015), 503–511. [83] Soujanya Poria, Navonil Majumder, Rada Mihalcea, and Eduard Hovy. 2019. Emotion recognition in conversation: Research challenges, datasets, and recent advances. IEEE Access 7 (2019), 100943–100953. [84] Jeffrey P. Prince. 2015. University student counseling and mental health in the United States: Trends and challenges. Mental Health & Prevention 3, 1-2 (2015), 5–10. [85] S. Kavi Priya and K. Pon Karthika. 2023. EliteVec: Feature fusion for depression diagnosis using optimized long short-term memory network. Intelligent Automation & Soft Computing 36, 2 (2023). [86] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9. [87] Rajeev Ramchand, Sangeeta C. Ahluwalia, Lea Xenakis, Eric Apaydin, Laura Raaen, and Geoffrey Grimm. 2017. A systematic review of peer-supported interventions for health promotion and disease prevention. Preventive Medicine 101 (2017), 156–170. [88] Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2019. Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset. arXiv:1811.00207 [cs.CL] [89] Venkatashiva Reddy. 2019. Mental health issues and challenges in India: A review. International Journal of Social Sciences Management and Entrepreneurship (IJSSME) 3, 2 (2019). [90] Replika. 2022. Replika AI. https://replika.ai/ [91] Ludovic Rheault. 2016. Expressions of anxiety in political texts. In Proceedings of the First Workshop on NLP and Computational Social Science. 92–101. [92] Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, et al. 2020. Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637 (2020). [93] Paul Rollinson. 2005. Using peer feedback in the ESL writing class. ELT Journal 59, 1 (2005), 23–30. [94] Pradeep Sahu. 2020. Closure of universities due to coronavirus disease 2019 (COVID-19): Impact on education and mental health of students and academic staff. Cureus 12, 4 (2020). [95] Shruti Sannon, Natalya N. Bazarova, and Dan Cosley. 2018. Privacy lies: Understanding how, when, and why people lie to protect their privacy in multiple online contexts. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–13. [96] Shiki Sato, Reina Akama, Hiroki Ouchi, Ryoko Tokuhisa, Jun Suzuki, and Kentaro Inui. 2022. N-best response-based analysis of contradiction-awareness in neural response generation models. arXiv preprint arXiv:2208.02578 (2022). [97] Ashish Sharma, Inna W. Lin, Adam S. Miner, David C. Atkins, and Tim Althoff. 2021. Towards facilitating empathic conversations in online mental health support: A reinforcement learning approach. In Proceedings of the Web Conference 2021. 194–205. [98] Ashish Sharma, Adam S. Miner, David C. Atkins, and Tim Althoff. 2020. A computational approach to understanding empathy expressed in text-based mental health support. arXiv preprint arXiv:2009.08441 (2020). [99] Hitesh Kumar Sharma, K. Kshitiz, et al. 2018. NLP and machine learning techniques for detecting insulting comments on social networking platforms. In 2018 International Conference on Advances in Computing and Communication Engineering (ICACCE). IEEE, 265–272. [100] Atanu Shome, Md Mizanur Rahman, Sriram Chellappan, and A. B. M. Alim Al Islam. 2019. A generalized mechanism beyond NLP for real-time detection of cyber abuse through facial expression analytics. In Proceedings of the 16th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services. 348–357. [101] X. Sun, Xinmiao Chen, Zhengmeng Pei, and F. Ren. 2018. Emotional human machine conversation generation based on SeqGAN. 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia) (2018), 1–6. [102] Emily Teding van Berkhout and John M. Malouff. 2016. The efficacy of empathy training: A meta-analysis of randomized controlled trials. Journal of Counseling Psychology 63, 1 (2016), 32. [103] Yevhen Tyshchenko. 2018. Depression and anxiety detection from blog posts data. Nature Precis. Sci., Inst. Comput. Sci., Univ. Tartu, Tartu, Estonia (2018). [104] Milton L. Wainberg, Pamela Scorza, James M. Shultz, Liat Helpman, Jennifer J. Mootz, Karen A. Johnson, Yuval Neria, Jean-Marie E. Bradford, Maria A. Oquendo, and Melissa R. Arbuckle. 2017. Challenges and opportunities in global mental health: A research-to-practice perspective. Current Psychiatry Reports 19, 5 (2017), 28. [105] Jeanne C. Watson, Patricia L. Steckley, and Evelyn J. McMullen. 2014. The role of empathy in promoting change. Psychotherapy Research 24, 3 (2014), 286–298. [106] Matthew R. Wawrzynski and Jacob D. Lemon. 2019. Understanding student learning outcomes of peer educators. New Directions for Higher Education 2019, 188 (2019), 61–69. [107] Matthew R. Wawrzynski and Jacob D. Lemon. 2021. Trends in health and wellness peer educator training: A five-year analysis. Journal of Student Affairs Research and Practice 58, 2 (2021), 135–147.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

A Friend in Need Is a Friend Indeed

14:27

[108] Matthew R. Wawrzynski, Carl L. LoConte, and Emily J. Straker. 2011. Learning outcomes for peer educators: The National Survey on Peer Education. New Directions for Student Services 2011, 133 (2011), 17–27. [109] Anuradha Welivita and Pearl Pu. 2020. A taxonomy of empathetic response intents in human social conversations. arXiv preprint arXiv:2012.04080 (2020). [110] Sabina White, Yong S. Park, Tania Israel, and Elizabeth D. Cordero. 2009. Longitudinal evaluation of peer health education on a college campus: Impact on health behaviors. Journal of American College Health 57, 5 (2009), 497–506. [111] Yu-Tao Xiang, Yuan Yang, Wen Li, Ling Zhang, Qinge Zhang, Teris Cheung, and Chee H. Ng. 2020. Timely mental health care for the 2019 novel coronavirus outbreak is urgently needed. The Lancet Psychiatry 7, 3 (2020), 228–229. [112] Yu-Tao Xiang, Xin Yu, Norman Sartorius, Gabor S. Ungvari, and Helen F. K. Chiu. 2012. Mental health in China: Challenges and progress. The Lancet (2012). [113] Ze Yang, Wei Wu, Can Xu, Xinnian Liang, Jiaqi Bai, Liran Wang, Wei Wang, and Zhoujun Li. 2020. StyleDGPT: Stylized response generation with pre-trained language models. arXiv preprint arXiv:2010.02569 (2020). [114] Eldar Yeskuatov, Sook-Ling Chua, and Lee Kien Foo. 2022. Leveraging Reddit for suicidal ideation detection: A review of machine learning and natural language processing techniques. International Journal of Environmental Research and Public Health 19, 16 (2022), 10347. [115] SoYeop Yoo and OkRan Jeong. 2021. EP-Bot: Empathetic chatbot using auto-growing knowledge graph. (2021). [116] Samira Zad, Maryam Heidari, H. James Jr, and Ozlem Uzuner. 2021. Emotion detection of textual data: An interdisciplinary survey. In 2021 IEEE World AI IoT Congress (AIIoT). IEEE, 0255–0261. [117] Emmanouil Zaranis, Georgios Paraskevopoulos, Athanasios Katsamanis, and Alexandros Potamianos. 2021. EmpBot: A T5-based empathetic chatbot focusing on sentiments. arXiv preprint arXiv:2111.00310 (2021). [118] Wen Zeng, Ruiqi Chen, Xingyue Wang, Qin Zhang, and Wei Deng. 2019. Prevalence of mental health problems among medical students in China: A meta-analysis. Medicine 98, 18 (2019). [119] Rui Zhang, Zhenyu Wang, and Dongcheng Mai. 2017. Building emotional conversation systems using multi-task Seq2Seq learning. In National CCF Conference on Natural Language Processing and Chinese Computing. Springer, 612–621. [120] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2019. BERTscore: Evaluating text generation with BERT. arXiv preprint arXiv:1904.09675 (2019). [121] Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2019. DialoGPT: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536 (2019). [122] Chujie Zheng, Yong Liu, Wei Chen, Yongcai Leng, and Minlie Huang. 2021. Comae: A multi-factor hierarchical framework for empathetic response generation. arXiv preprint arXiv:2105.08316 (2021). [123] Xiaoyun Zhou, Centaine L. Snoswell, Louise E. Harding, Matthew Bambling, Sisira Edirippulige, Xuejun Bai, and Anthony C. Smith. 2020. The role of telehealth in reducing the mental health burden from COVID-19. Telemedicine and e-Health 26, 4 (2020), 377–379. [124] Yue Zhou, Barbara Di Eugenio, Brian Ziebart, Lisa Sharp, Bing Liu, Ben Gerber, Nikolaos Agadakos, and Shweta Yadav. 2022. Towards enhancing health coaching dialogue in low-resource settings. In Proceedings of the 29th International Conference on Computational Linguistics. 694–706.

Received 15 April 2023; accepted 25 April 2023

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 14. Publication date: December 2023.

“What’s the Point of Having This Conversation?”: From a Telephone Crisis Helpline in Bangladesh to the Decolonization of Mental Health Services ANANYA BHATTACHARJEE, Computer Science, University of Toronto, Canada SHARIFA SULTANA, Information Science, Cornell University, USA MOHAMMAD RUHUL AMIN, Computer and Information Science, Fordham University, USA YESHIM IQBAL, New York University, USA SYED ISHTIAQUE AHMED, Computer Science, University of Toronto, Canada Most of the HCI work on mental health is based on the Western metaphysical definition of mind that is less applicable outside the West. This article focuses on this issue and critically examines “Kaan Pete Roi” (KPR), a suicide prevention and emotional support helpline in Bangladesh, through an interview study with 20 participants. We find that KPR’s service, grounded in the “befriending” model—originating from the UK and emphasizing non-judgmental active listening without offering direct advice—often struggles to ensure callers’ safety, provide long-term support, and protect volunteers from harassment and distress. We argue that such failures are often rooted in some foundational ideas of the UK-born “befriending” model that underpins the service. Building on Enrique Dussel’s decolonial philosophy, we argue that “befriending” model and its underpinning Western metaphysical ideation of mind carry a colonial impulse, and we discuss how community-based approaches may better address the mental health problems in the Global South. CCS Concepts: • Human-centered computing → Human computer interaction (HCI); Additional Key Words and Phrases: Crisis helpline, suicide prevention, befriending, volunteer, mental health, Bangladesh ACM Reference format: Ananya Bhattacharjee, Sharifa Sultana, Mohammad Ruhul Amin, Yeshim Iqbal, and Syed Ishtiaque Ahmed. 2023. “What’s the Point of Having This Conversation?”: From a Telephone Crisis Helpline in Bangladesh to the Decolonization of Mental Health Services. ACM J. Comput. Sustain. Soc. 1, 2, Article 15 (December 2023), 29 pages. https://doi.org/10.1145/3616381

Authors’ addresses: A. Bhattacharjee and S. I. Ahmed, Computer Science, University of Toronto, Toronto, Canada; e-mails: {ananya, ishtiaque}@cs.toronto.edu; S. Sultana, Information Science, Cornell University, Ithaca, New York, USA; e-mail: [email protected]; M. R. Amin, Computer and Information Science, Fordham University, New York, United States; e-mail: [email protected]; Y. Iqbal, New York University, New York, United States; e-mail: [email protected]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM. 2834-5533/2023/12-ART15 $15.00 https://doi.org/10.1145/3616381 ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

15

15:2

A. Bhattacharjee et al.

1 INTRODUCTION Telephone crisis helpline, a type of technology-mediated mental health support (TMMHS) [102] system, is widely used across the world to provide support to individuals with suicidal ideation and emotional distress [12, 32, 64, 70, 107, 109]. These helpline services are often based on a globally popular strategy called “befriending,” where the mental health support-provider listens to the support-seeker over a private phone call, grows empathy in a friendly manner, and helps them manage emotional distress [126]. This model of mental health support gained popularity for being an initial point of access to formal mental healthcare for places with resource constraints [25]. Consequently, many countries in the Global South have set up similar telephone crisis helplines to offer support to vulnerable individuals [1, 13, 20, 21, 28, 31, 58, 72, 85, 126]. However, several studies have questioned the effectiveness of these services in preventing suicide and other types of distresses [57, 75]. The theories and operations of telephone crisis helpline and other TMMHS have recently received significant attention from researchers working on mental health, social computing, and human-computer interaction (HCI) [60, 85, 86, 102, 105]. Studies conducted in the Western world have shed light on how such telephone-mediated mental support was shaped following the technological advancements over time and how they contributed to taking on new challenges regarding suicidal tendency of support-seekers [33, 85, 86]. However, researchers working on TMMHS in the Global South have further looked into the human infrastructure and the labor distribution in crisis helplines, technical challenges, and callers’ perspectives [60, 92, 95, 102, 105]. Their findings have revealed several challenges that are different from those in the West, including (a) widespread stigma around mental health problems [94], (b) lack of privacy in shared living condition that makes it difficult for individuals to make phone calls [6], (c) the resource constraints of the support centers [105], and (d) an overarching expectation from the support centers to be able to provide support in a way that is aligned with local cultural values [37, 78, 93, 101]. While some of these challenges could be addressed through developing smarter technologies and upskilling the manpower, we argue that most of them are rooted far deeper into the way this problem is defined and the service is designed. Many scientific communities have traditionally and primarily relied on the knowledge produced by the Western European researchers, and frequently sidelined the knowledge produced elsewhere [43]. Consequently, today’s mental health professionals, clinicians, or crisis helplines follow the support-behavior models that originated in those Western European countries [84]. These models received the “universal” status assuming that the same approach of giving support would apply to everyone around the world. Argentine-Mexican philosopher Enrique Dussel opposed this universality of European knowledge and proposed a “transmodern world,” where each field of knowledge would enrich themselves by acknowledging different cultures and solutions, and reject the notion of universal solution [43, 44]. Being motivated by this idea of transmodernism, we critically analyze the befriending model. We situate our study in Bangladesh, where all the authors of this article were born and raised. Mental health problems are generally stigmatized in Bangladesh, and the few relevant studies report that most people often hide their mental health issues within and outside of their family because of this widespread stigma around it [53, 56]. Further, people usually hesitate to seek formal and institutional support and often go to Shammans, Kabiraz, and witches to seek help [121]. In such a cultural context, Kaan Pete Roi (KPR), a Bengali term that roughly translates as “eager to listen,” started as the first suicide prevention helpline in Bangladesh in 2013 [60]. They use befriending model, and in this study, we wanted to thoroughly understand how the idea of befriending is perceived, implemented, and received in the Bangladeshi context. This motivated us to set our specific research questions: ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

What’s the Point of Having This Conversation?

15:3

RQ1: What challenges do the volunteers face in underpinning their actions to the theoretical assumptions and framework of befriending and providing support to suicidal ideation and emotional distress of the support-seekers? RQ2: When might the volunteers consider deflecting from traditional befriending assumptions and framework, and adapting other contextual strategies to befriend with support-seekers? To find the answers to these research questions, we conducted interviews with KPR volunteers (n = 20). These volunteers represented diverse age groups and hailed from various regions across the country. Prior to offering support to the support-seekers, they underwent comprehensive training. The volunteers informed us that support-seekers are often concerned about their privacy over their suicidal ruminations; they are reluctant to have their families any clue about their seeking support, and frequently cut calls when they sense any such risks. Thus, providing support in such cases could be challenging as the befriending policies do not allow the volunteers call the supportseekers back to continue providing support. Additionally, the policies do not permit volunteers to access support-seekers’ history. Consequently, each call is treated as a new case, even for repetitive support-seekers. Furthermore, support-seekers often subject the volunteers to harassment when the befriending techniques fail to meet their expected outcomes from the calls. Although the volunteers often wish to employ alternative strategies in some of the aforementioned situations, they are unable to do so due to the organization’s rules. This article makes three key contributions to HCI, social computing, and the literature on online mental health support. First, we present a detailed description of the working model of Kaan Pete Roi to demonstrate how the idea of befriending is implemented in a resource-constrained context through various sociotechnical arrangements. Second, we report how their service often falls short in ensuring the callers’ safety, providing callers with long-term support, and protecting the volunteers from harassment. Here, we argue that such failures are rooted in the cultural differences in perceiving the idea of befriending and the infrastructural constraints of Bangladesh. Finally, building on the decolonial philosophy of Enrique Dussel [44], we argue how the befriending model represents the body-mind duality of western metaphysics that is less applicable to the closely-knit communities in Bangladesh. 2 RELATED WORK We begin by giving a brief overview of the disconnect between current mental health research and the Global South. Then, we go on to explain the concept of transmodernism in the context of mental health. Finally, we elaborate upon past research on telephone crisis helplines. 2.1 Disconnect between Current Mental Health Research and the Global South In HCI and psychology, numerous strategies have been devised to foster psychological wellbeing [71, 91, 96, 120]. These strategies commonly incorporate supportive apps [18, 46, 59, 83] and social media platforms [71], many of which are grounded in established psychological principles, such as cognitive behavioral therapy (CBT) [129]. The ubiquity of mobile phones has been leveraged for the convenience they offer in delivering personalized, mental health-promoting content [16, 17]. Advanced interventions even employ sensors in smartphones and IoT devices for immediate, personalized support [23, 118]. There is also a growing body of research exploring ways to provide care for individuals with traumatic experiences, with a particular focus on improving their experiences with technology [27, 48, 89]. This body of work spans a variety of contexts, including pregnancy loss [4, 47], Post-traumatic Stress Disorder [47], refugee experiences [119], and the experiences of non-binary or transgender individuals [108, 116] among others. Although the importance of family and community is acknowledged in assisting individuals who have experienced ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

15:4

A. Bhattacharjee et al.

traumatic events, or even in managing general psychological well-being [91, 130], when it comes down to providing support to a person who needs help, the researchers predominantly prefer “experts” who are academically trained on human psychology over non-expert community members. A critical limitation of much of the existing research lies in its primary orientation toward the context of the Global North [101]. Many of these studies predominantly utilize controlled lab experiments, which may not fully encapsulate the complex realities of mental health support across diverse environments, cultures, and backgrounds. The generalizability of these findings to the Global South thus remains an open question [8, 15, 49, 100]. Resource constraints can pose major challenges in certain countries of the Global South, particularly for low-income individuals seeking mental healthcare [79, 101]. Conventional mental health interventions in most of the Global North countries depend on trained counselors, a resource that may be scarce in many regions of the Global South. Therefore, when designing interventions for the Global South, it is essential to account for potential users’ unique perspectives and circumstances, which may often diverge from those of the intervention designers. In some cases, societal structures and mental health support behaviors in the Global South can differ from those typically seen in the Global North [14, 69, 121]. For example, people in the Indian subcontinent have historically sought support from spiritual healers for mental health problems [121]. Here many communities associate mental health with religion, supernatural deities, and spirituality; spiritual healers hold a strong position in local societies [67]. Many of them are often believed to possess divine powers, which enable them to explain any given topic ranging from usual rural life problems to international affairs. Such divine powers of the healers are believed to be effective in solving many problems, including people’s mental health issues. Healers use their knowledge about local customs and ways of life to treat their clients [67]. The complex process of healing followed by healers relies significantly on informal conversations with the individual seeking treatment [121]. The discussions may cover one’s history of illness as well as their extended family, source of income, and social lives. Informal conversations with people help the healers get deep insights about one’s problem and solidify trust and respect between them. Additionally, shared knowledge about local culture and traditions with the support seekers enables the healers to form explanations of mental illnesses that are easily understood by their clients. Such practices of mental healthcare, be how effective, are extremely different from Europe-born practices like befriending model (described in Section 2.3). While differences in mental health support practices between the Global North and South can be noticeable, there are nonetheless instances where these practices converge and exhibit commonalities. For instance, the storytelling aspect, though it may manifest differently in varying contexts, remains a prevalent feature across cultures [5, 17, 121]. In the Global North, storytelling may take the form of a patient sharing their emotional journey with a therapist or over a mental health helpline, fostering empathy and understanding [90, 128]. Meanwhile, in the Global South, traditional healers might rely on a person’s narrative to understand their distress and devise a treatment plan, which emphasizes the value of personal experiences and their social contexts in the healing process [67, 121]. There is increasing recognition in both the Global North and South of the importance of broader societal context in mental health. This perspective is reflected in research exploring the influence of social determinants such as socioeconomic status, education, and neighborhood characteristics on mental health outcomes [3, 14]. For instance, studies on migrant youth’s mental health have emphasized the crucial role of socio-ecological circumstances [123, 124], highlighting the intersection of these social determinants with individual mental health experiences [27]. However, despite this trend toward more context-sensitive approaches to mental health support, most existing interventions continue to focus primarily on the individual level, without adequately addressing the ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

What’s the Point of Having This Conversation?

15:5

larger societal factors that may contribute to or exacerbate mental health issues. [3, 14]. Adopting mental health interventions developed in a different setting, hence, faces the risk of being socially and culturally alienated. 2.2 Understanding Transmodernism in the Context of Mental Health Cultural conceptions of health and illness can often vary, influenced by unique historical and philosophical traditions. The Western psychological concept, largely influenced by the Cartesian doctrine of body-mind duality, often perceives body and mind as separate entities, giving rise to distinct classifications for physical and mental illnesses [24, 42, 50]. Conversely, Non-Western traditions, like those in many Global South societies, often espouse a more holistic approach, recognizing a harmonious interconnection between body and mind [35, 50, 114]. This dichotomy extends into mental health support strategies: in the Global North, individualbased therapies such as CBT predominate, focusing largely on mental processes isolated from physical health [50, 129]. However, in the Global South, there is often a strong emphasis on community involvement in mental wellbeing, reflecting the holistic understanding of health [35, 114]. Indigenous healing practices, seen in some African or Asian communities, underscore this approach, emphasizing collective wellbeing and the interconnectedness of body and mind [29, 55]. Moreover, there are differences in the conceptualization of support and solutions. While the Global North typically gravitates toward solution-oriented, goal-based therapies, mental health support in the Global South can often be more communal, ongoing, and integrated with religious and ethical components [50]. Despite the rich diversity in cultural understanding and practices related to mental health and illness, non-Western perspectives are often underrepresented in contemporary psychology literature. This situation can be traced back to historical shifts in power and knowledge around the 16th Century, where numerous valuable knowledge traditions outside of Western Europe were regrettably lost due to devastating events, such as genocides and epistemicides [54]. This marked a period where knowledge produced outside of Western European contexts began to be viewed as “inferior,” leading to its eventual exclusion from mainstream psychology and health discourse. Since the late 18th Century, Western European male thinkers have largely dominated the field of knowledge production across various disciplines. As a result, there can be a presumption that theories and models developed within these Western contexts are universally applicable, neglecting the specific cultural and contextual nuances of different societies around the world. A case in point is the wide implementation of the befriending model across diverse cultural contexts, often with minimal adaptation. This model, developed from the experiences of individuals in a particular country and time, might not be fully congruent with the experiences and needs of people in other cultural contexts. It is not to suggest that this outcome was an intentional design decision in the development of the model, rather it might be an unintended consequence of insufficient cross-cultural consideration. Theorists working at the intersection of mental health and decoloniality have called for the decolonization of mental health infrastructure in the Global South. One prime example is Lambo’s transformative work on the Nigerian healthcare system in the 1950s and ’60s [29, 55]. Recognizing the importance of local contexts and knowledge, Lambo initiated an outpatient experience for mental healthcare involving hospital staff and community members [55]. His approach, a stark contrast to the European model of segregating mental illness patients, fostered a holistic, community-based therapeutic experience [55]. Building on this decolonizing effort, more recent works like those of Mills [84] have critiqued the neurobiological approaches that attribute mental distress solely to neurological dysfunctions, largely ignoring the socio-environmental factors. Drawing from her research in India, Mills highlighted that the root causes of mental distress, like financial instability or ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

15:6

A. Bhattacharjee et al.

technology adaptation failures, can not be resolved merely through medication. She advocates integrating indigenous and local healing knowledge with traditional psychiatric procedures, where no single approach assumes dominance. This perspective resonates with Pendse et al.’s recommendation to center individuals’ lived experience over rigid classifications in digital mental health technologies [104]. We build on Enrique Dussel’s concept of “transmodern world” that challenges the assumptions of Eurocentric knowledge and asks for the revival of traditions and knowledge that was partly destroyed due to genocides and epistemicides [43]. Dussel proposes a decolonial project that does not ignore the traditions of the Global South, but rather extends knowledge in different fields by critically analyzing different traditions. In this way, transmodernism rejects the existence of a ‘universal’ solution, and acknowledges that in different cultures, similar problems will have different solutions. Postcolonial computing, as discussed by Irani et al. [62], similarly advocates for a design that acknowledges the complex socio-historical contexts and challenges the imposition of Western-centric ideals in technology design. They argue for a critical reflection on design practices and to see beyond the universalist design narratives by acknowledging the intricate local contexts and heterogeneities. HCI researchers’ stance on decolonization reflects these perspectives. While discussing the process of decolonization in HCI, Lazem et al. [73] advocated for recognizing the historical specificities of local contexts and adopting cultural sensitivities. Drawing from interaction with ArabHCI [74] and AfriCHI [19], they specified how decolonization is understood as a way to decenter colonial legacies in HCI literature, and center local voices and acknowledge local contexts in technology design. This framing of decolonization questions the validity of Western ideas and consciously attempts to incorporate indigenous customs and ways of living life [74]. 2.3 Telephone Crisis Helplines Telephone crisis helplines have been around since the 1950s, providing a range of mental health services including emotional support and suicide prevention [77]. They were created out of an attempt to shift the power dynamics of the psychodynamic encounter [131]. In usual face-face encounters, counselors establish themselves in a position where they can assert technical power and control the therapeutic relationship [131]. Differences between clients’ and counselors’ social status and the clients’ inability to remain anonymous or terminate the relationship may remind clients of their inferior position and produce anxiety and humiliation [76]. Telephone helplines, however, give much more control to the callers; callers can call from anywhere, remain anonymous, and terminate the conversation at any point of the conversation. This equalization of power and control is believed to have effectively drawn the clients to seek counselors’ support over phone calls, making crisis helplines an established and accepted medium of support [64]. These services are accessible at various points along the pathway to suicidal behavior [63]. Most of today’s crisis helplines function in the following way: a caller who can choose to be anonymous calls at any time of the day (or during the period when the crisis helpline takes calls), and converses with a professional counselor or a trained volunteer who helps the caller to manage their crisis situation [30, 80]. The services are usually not ongoing and recurrent, i.e., counselors do not usually follow up to the callers and assume each call as a new conversation [30]. Helpers either follow a more “direct” approach with advice-giving and making clear recommendations [36], or an “indirect” approach that includes empathetic listening and allowing the callers to figure out solutions themselves (often termed as “Rogerian Active Listening”) [112]. Mishara and Daigle [86] found that the latter approach was more effective to help people manage depression and enable helpers to set up a contract with the callers (i.e., the caller promises that they will not attempt suicide or will engage with follow-up activities). However, they also suggest that repeated callers ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

What’s the Point of Having This Conversation?

15:7

can benefit from a more directive approach [86]. In fact, the Rogerian approaches can also have some components of direct questioning, particularly when it is obvious that callers are at very high risk. Related to this Rogerian approach is the “befriending” model. The history of the befriending model goes way back in 1953, when Chad Varah, a Samaritan minister and counselor, set up a counseling service to help loner and distressed individuals [126]. Since then, this model has gained significant attention as a low-cost way to support individuals struggling with loneliness and lack of peer support [81, 98]. The principles and practices of the volunteers and helplines were determined by the charity organization “Befrienders Worldwide” in 1981 [38]. Here the helpers provide support through listening, empathizing, and acceptance [127]. The relationship between support seeker and the listener is more informal and takes time to develop [88]. Through a conversation with a befriender, an individual, who lacks friends or social support, finds someone who is empathetic about them and with whom they can share their problems. Moreover, the informal nature of conversation can aid to building better ‘therapeutic alliance’, the relationship between a counselor and a help-seeker [87]. Prior research has shown that being informal and friendly is a core quality of therapeutic relationship and is highly appreciated by people seeking mental health support [106]. Additionally, informal relationship (and the fact that callers can choose to be anonymous) can encourage callers to be more open and spontaneous [7], resulting in helpful conversations for the callers. Although the term “befriending” has “friend” in it, the relationship is relatively one-sided, because the helper is supposed to listen to people, and refrain from judging, advising and giving personal opinions. A befriender should be available at any time to provide support. Typically, the befriending relationship involves two or more persons, and is monitored by an organization who clearly identifies the individuals likely to benefit [38]. An organization that provides support following the befriending model is supposed to carefully select volunteers (known as befrienders) and train them. Even if a particular caller admits that they want to kill themselves or have harmed others, a befriender can not express their feeling in the conversation and should try to treat the caller as any other human being. Unlike some helplines in the USA, who are of the opinion that any possibility of preventing a person’s life should be the go-to action if a person shows any indication of self-harm, the befriending model allows help-seekers to act as they wish [127]. In other words, the volunteers offer “disinterested help” to those who ask for the help, while leaving the caller “in their own destiny” [127]. The purpose of the relationship is to provide emotional support and companionship, rather than to solve any particular problem or address any underlying issues. This is seen as a way to create a safe and trusting space for the person seeking help to talk openly and honestly about their feelings, without fear of judgment or criticism. Based on the seven practices of a befriender [127], callers can receive long-term support from volunteers when deemed appropriate, with some helplines setting limits on the duration of such support. In the Global North, helplines that follow the befriending model and have more resources (e.g., Boston Samaritans) have guidelines for providing support over extended periods, sometimes spanning years [22, 40]. However, this does not necessarily mean that the same volunteer must provide support to a specific caller during subsequent calls. Rather, any volunteer who answers the call can support the caller, even if they spoke to someone else earlier that day or the day before. To avoid callers having to repeat themselves, these helplines use databases to enable volunteers to pick up from where the previous conversation ended. In our research, we seek to gain a deeper understanding of the perception, implementation, and reception of the befriending model within the cultural context of Bangladesh. Specifically, we aim to achieve this by analyzing the operations and effectiveness of KPR, a major suicide prevention helpline in Bangladesh that utilizes the befriending model. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

15:8

A. Bhattacharjee et al.

3 TELEPHONE CRISIS HELPLINES IN BANGLADESH AND KPR Unlike the Global North, telephone crisis helplines remain a scarce resource in developing countries like Bangladesh. KPR is one such helpline that was established by a Bangladeshi graduate of Cornell University, USA in 2013 [113]. The organization has dedicated management staff for coordinating volunteers, helpline, public relations and communication, and program outreach. The only physical office of the organization is in Dhaka, Bangladesh. Its primary agenda is to provide emotional support to suicide-prone individuals by attending their calls and befriending them. After KPR, several other helpline services were also launched in Bangladesh, including Sajida Foundation [51], Moner Bandhu, Dosh Unisher Mor, and Bandhu Social Welfare Society. While some research works tried to identify the characteristics of Bangladeshi callers and their problems [60, 61], there is not much work that explored the challenges faced by these organizations and their staff, or identified the gap between theories of providing support and the actual practice in the context of Bangladesh. In our work, we address this gap in the literature by investigating our research questions with KPR. 3.1 Volunteer Recruitment and Training 3.1.1 Qualifications. The volunteers of KPR must be over 18 years old and have a high school degree at least. However, the person does not need to have any academic background in psychology or related fields. KPR expects its volunteers to be open-minded in nature, efficient in communication, and able to maintain confidentiality of the clients due to the sensitive nature of their jobs. 3.1.2 Process of Recruitment. KPR recruits volunteers every three to six months. They circulate the job opening and the application form in a Google form across various social media platforms. After scrutinizing the responses to the Google forms, KRP invites some short-listed candidates to a 30–60 min long interview session. The candidates answer questions about their approach toward managing their own mental health and describe how they interact with people who follow different ideologies or lifestyles (e.g., their attitude toward LGBTQ+ community). The candidates, who pass this interview, are invited to attend two day-long training sessions in the subsequent stage. If their performance during the training sessions is satisfactory, then they have to attend several role-play sessions. Once the training personnel pass new volunteers’ performance, they deal with actual callers. However, the time span between attending the training sessions and starting to take actual calls varies a lot; while some volunteers achieve the expected level of performance in two weeks, others may need to attend role-play sessions for 3–4 months. 3.1.3 Training Process. The aforementioned two-day training sessions are divided into several modules, which cover multiple topics including mental health and suicide prevention in general, suicidal risk assessment, introduction to the befriending model, and several sensitive or taboo topics in the context of Bangladesh (e.g., homosexuality, abortion). These sessions have several components, including Q&A, role-play sessions, and tests to evaluate volunteers’ understanding. In the role-play sessions, the trainees act as the volunteer support providers in calls from KPR training personnel who act as support seekers. Recruitment personnel keenly observe volunteers’ performance in these different sessions, and based on different criteria, they make the decision on whether a certain individual is prepared to be a befriender or not. 3.2 Administrative Duties of Volunteers Volunteers physically attend their duty hours in the office in Dhaka in shifting duties, although they temporarily worked from home during the first phase of the COVID-19 pandemic. They would attend at least one 3-h shift per week. The volunteers have to record the summary of their attended ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

What’s the Point of Having This Conversation?

15:9

calls. They use a shared Google sheet for documenting the conversation in detail along with the time and nature of the conversation and callers’ suicidal risk assessment. The KPR authorities frequently guide this process and suggest the volunteers to note down if the callers’ mood shifted by the end of the call, or if they thanked the volunteer and said they would call again. Volunteers are not provided any monetary compensation for their service. 3.3

Suicide Prevention through Befriending

At the time of the study, KPR had five mobile phone numbers for providing their service, and the helplines were open from 3 p.m. to 3 a.m. everyday. Callers’ calls are directed to the available volunteers. Lately they have also started their support-line on WhatsApp. Below, we describe the anatomy of a phone call: • Greetings: Once the volunteer and the caller are connected, they exchange greetings and their names. However, volunteers are instructed to only share their first names and not any other information about themselves or the organization. But the callers are free to provide as much information as they want or maintain anonymity. • Befriending: After that, the volunteers ask a question like “How are you doing?” to demonstrate their support for the caller. Once callers start sharing their problems, volunteers should empathetically listen and provide validation (e.g., commenting “That sounds really tough” or “I am glad you called”) when necessary. This way they show their empathy and care to the callers’ problems and their interest in supporting through the conversation. • Risk Assessment: Volunteers must conduct suicidal risk assessment for every caller, even if the caller seemingly sounds non-suicidal. Within 2–3 min of the call, the volunteers ask callers if they are feeling suicidal. The question should be direct and simple (i.e., clearly asking “Are you feeling suicidal now?”). From this question and some follow-ups (e.g., “Have you thought of how you may do it?”), volunteers have to determine the risk level of the callers, ranging from no risk from medical emergency with four difference levels of risk in between. However, it may happen that a caller is initially not comfortable sharing their true feelings, so if the volunteer might conduct the risk assessment multiple times in a conversation. • Ending the Call: After 10 min of the conversation, volunteers decide on the calls’ duration, which may range from 10 min (for no suicide risk call) to 30 min (for acute risk or medical emergency call. They give the caller a 1–5 min warning (depending on the suicide risk) before ending the call. The callers may also initiate the termination of the call if they feel they have already shared their problems and found the solution or have found calling KPR was useless. Sometimes calls also drop due to network error, but the volunteers are not permitted to call back, because KPR believes that only callers have the right to initiate a conversations about their problems. Volunteers encourage the callers to call again if they need KPR’s support in the future. Volunteers may appropriate the calls’ structure following KPR training manual if the callers (a) seek sexual arousal, (b) make manipulative attempts to push the volunteers past the bounds of befriending, or (c) make abusive comments. Volunteers immediately terminate the calls that initiate sexual harassment and verbal abuse. However, the volunteers may continue a discussion on sex and related problems of the caller if they feel comfortable and think they can help the caller resolve it. In such cases, they still befriend the callers and assess their suicidal risk. 4 METHOD AND DATA COLLECTION We conducted 20 semi-structured interviews with 20 volunteer support-providers of KPR between December 2020 to August 2021. Among the participants, 11 were females and 9 were males; their ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

15:10

A. Bhattacharjee et al. Table 1. Information about the Participants

Interviewee P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 P19 P20

Age 35 24 24 26 28 29 25 25 25 35 24 25 25 23 30 25 22 22 25 23

Gender Female Female Male Female Male Female Female Female Female Male Male Male Male Male Male Female Male Female Female Female

Educational Background Law, fashion design Industrial and organizational psychology Political science Public health Computer science and engineering Business administration Psychology Industrial and organizational psychology Economics Marketing English Literature Public administration Business administration Business administration Electrical and Electronics Engineering Psychology Film and photography, journalism Computer science and engineering Medicine and surgery Business administration

Employment Author University student University student University student Unemployed Management work University student Management work University student Entrepreneur University student University student University student University student Business person University student University student Business person University student University student

mean age was 26.0 ± 0.83 years old (see Table 1 for details). All the participants and researchers in this project were born and raised in Bangladesh and are native Bengali speakers. This criteria helped us engage with and understand the participants better. We describe our method and the process of data collection and analysis below. 4.1 Interviews To recruit the participants for interviews, we initially consulted with the founder of the helpline and its management committee, conveying our research goals to them. They assisted in shaping the study design based on their experience managing the helpline. We requested them to connect us with the volunteers of their organization, and they helped circulate our invitation to join the interviews among the volunteers through emails. We assured in our call for the interview that participating in the interview was a completely voluntary activity. Twenty volunteers responded and showed interest in attending the interview sessions. We explained the purpose of this research to them again and scheduled the one-on-one interviews based on their convenience. The first author who conducted the interviews was in North America at the time of the data collection and conducted the interview over Zoom teleconferencing platform. Upon starting the interview sessions, we thanked the participants for volunteering to help in this research and ensured the anonymity of the interview participation process. Then, we sought their oral consent to proceed with the interview questions. All the interviews sessions were audio-recorded with permission of the participants. We asked the participants about their experiences of working with the organization as a volunteer, the types of problems they often faced while dealing with callers, and the process they followed for managing callers. We also investigated how they found ways to manage their own mental health, given their job was to listen to suicidal help-seekers. We asked them how KPR ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

What’s the Point of Having This Conversation?

15:11

arranged support and training for them and how KPR handled volunteers’ complaints and recommendations about the whole set up. This discussion helped us engage with them in deeper conversations and understand the activities of the organization better. We conducted the interviews in Bengali. The sessions lasted from 30 to 60 min. 4.2 Data Analysis and Coding The interviews generated around 12 h of interview recordings in total. We transferred the audio recordings to the researchers’ secured computer. After transcribing and translating the interviews, we had 55 pages of documented data. We removed all the identifiers from the transcriptions. Two team members reviewed them to get familiar with the data. Then they independently developed codes. Through an open coding process, they broke down the data into distinct codes before sharing their codes with each other. Initially, there were around twenty codes like regret about not calling back to callers, frustration about callers’ rude behavior, skepticism about befriending model, lack of support from the management, privacy issues, and so on. Then the researchers arranged multiple meetings to form a codebook and group the codes to broader themes through axial coding [117]. We organized our findings around the themes and present them in the next section. 4.3 Ethical Considerations Our research on such a sensitive topic raises several ethical concerns. We anticipated that the interview questions might induce negative emotions among interviewees when they talk about their past interactions with callers or recall a bad experience. The researchers were trained to provide appropriate level of sympathy and support throughout the interviews. The participants also had the option to not answer any question or leave the interview at any stage if they were uncomfortable. In case interviewees express emergent suicidal thoughts, interviewers were also trained to conduct the Columbia-Suicide Risk Assessment protocol [97]. However, no such risks emerged during the study. To ensure the privacy and confidentiality of the callers, all researchers signed a Non-disclosure Agreement (NDA) with the helpline organization. During the interviews with volunteers, we made it explicit at the outset that they should avoid revealing any identifiable information about the callers, thereby ensuring the callers’ privacy was maintained. We submitted the research agenda with detailed plans, interview questions, and data collection processes to our university’s Research Ethics Board. They thoroughly examined them and approved them. 5

FINDINGS

In this section, we elaborate upon the key themes raised from our analysis of the qualitative data. 5.1

A “Disinterested” Friendship

In our interviews, volunteers often critiqued the “disinterested” nature of the befriending model. They expressed their concerns by saying that certain components of this helper behavior model could be perceived negatively by callers and may even increase callers’ distress. We detail their opinions below: 5.1.1 “Friendship” without a Memory. All volunteers told us that they were expected to treat each call as a new conversation, and they did not check whether the caller had previously called or not. Some of them mentioned that sometimes they could recall that the caller had called before, but they did not take their past conversation into account while beginning the new conversation. Some volunteers told us that frequent callers sometimes got upset when they had to start over describing their problems each time they call. P15 said, ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

15:12

A. Bhattacharjee et al. “Sometimes I can identify that I have talked with this caller before. The caller can also identify that. But when they refer to a past conversation saying “Remember what I said last Tuesday?”, I have to say that I don’t, although I almost always do. Sometimes people do not like this treatment. They feel that we actually do not care about whatever happens to them.” (P15, Male, 30 years old)

According to some of the volunteers at KPR, repeatedly recounting a stressful experience during short intervals of phone calls may potentially harm the mental health of the callers. Additionally, some volunteers, such as P16, noted that building a connection with the callers and gaining their trust could take time and may not be achieved in a single conversation. Therefore, having the ability to refer back to a previous conversation was deemed necessary to provide effective and personalized support. Overall, volunteers expressed concerns that the protocol of not referring to past conversations may create a sense of distance between callers and volunteers and ultimately lead to a lower quality of service. 5.1.2 One Way Calling. One important characteristic of the befriending model is that it does not allow the volunteers to call the callers back. Several volunteers raised this issue, particularly when they expressed their concerns about KPR’s policy of never calling back its callers. KPR does not allow its volunteers to make a call to a caller in any situation. In our interviews, a few volunteers talked about past calls that dropped in the middle of the conversation due to bad connectivity, but they were not allowed to call back. The caller could call back again, but then that call might be picked up by a different befriender, and the caller would need to start everything from the very beginning. Officially, volunteers are told to treat each call as a new call, but unofficially if a caller calls back within “a few minutes” and the same volunteer picks up the call, then they can start from where they left off. The definition of this “a few minutes” varied among volunteers; some said it is 1 min, some thought it is 5 min, and some went as far as 10 min. Hence, these volunteers felt the rule about strictly not calling back callers should be made a bit more flexible, and volunteers should use their judgment to decide on whether to call back or not. P3 gave an example of a caller who had suicidal feelings and made plans to commit suicide that night: “I received a call once, where the caller was on the terrace, and he said he was going to jump. ... As the call went on, I could feel that I was making him feel comfortable, and he was reconsidering his prior decision to kill himself. But suddenly the call dropped, and the caller never called back. I was so upset that I actually wanted to call him back. For the next few days, I regularly checked newspapers, trying to find whether there was any report of suicide that matches with the caller’s description.” (P11, Male, 24 years old) P9 also mentioned a similar incident, and expressed that whether to call back or not should be decided by the volunteer. Even if that is not possible, she suggested that special situations may be brought to the attention of management to make exceptions to the rules. P17 and P19 expressed similar opinions stating that special considerations should be made in case of “medical emergency” calls. They proposed that when such calls end for unexpected reasons (e.g., network problems), volunteers should try to call back once to help that individual. They acknowledged that deciding to engage with the conversation is ultimately the callers’ choice, but calling back in “medical emergency” calls can save a human life. Volunteers such as P12 and P14 advocated for making KPR’s phone numbers toll-free, meaning callers would not be charged for calling the helpline. They believed this would prevent call drops caused by zero balance on callers’ phones, and P2 suggested that it would encourage more diverse individuals to call, including those who cannot afford to talk for an extended period. Overall, ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

What’s the Point of Having This Conversation?

15:13

volunteers expressed a desire for a mechanism to call back and prevent losing callers midconversation, especially in cases where callers display signs of self-harm or suicidal ideation. 5.1.3 “Sorry, We Can’t Give You Any Solution from Our Side.” The inability to provide explicit solutions to callers was a concern shared by some participants. During their initial training, volunteers are strictly instructed not to provide any kind of solution or recommendation to the callers, but instead to guide the conversation in a way that enables the callers to find their own solutions. This is a core component of the training program, and management personnel always remind volunteers to be careful of this practice when dealing with callers. Our interviewees confirmed that volunteers take this practice seriously and do not provide any sort of explicit solution, even when the caller is explicitly asking for it. Generally, when callers ask for solutions, volunteers ask them whether the callers themselves have thought of any solution or not. Then they ask the caller to explain the pros and cons of their different solutions, or if their problems can be addressed using a different approach they had not thought of before. However, if the caller continuously asks for advice, the volunteers are trained to emphatically communicate that volunteers are unable to provide any advice, as the life situation of an individual cannot be understood from a phone conversation. While this approach works for most of the callers, interviewees said that sometimes callers get extremely frustrated when they get no advice. P4 said, “When it is clear to the caller that we cannot give them any advice, they say, “Why are we calling you then? What’s the point of having this conversation? Why did I waste my mobile balance?” In some cases, callers start cursing too.” (P4, Female, 26 years old) Some interviewees also questioned the generalizability of this approach of not providing solutions. They felt that this approach makes them “too professional” (P18) and does not allow them to connect with the caller emotionally. P19 said that sometimes callers ask them questions like “Tell me one reason I won’t do suicide,” to which they cannot give a clear answer, as answering this question can be thought of as another example of providing solutions. In those cases, P19 felt that the callers start to get more distressed thinking that even volunteers are not finding any reason for them to live. Although they recognize that giving advice may be too risky in some conditions, there should be some room for judgment from volunteers to decide when to give some sort of advice. P3 said, “Most of the time when people call us they are in distress. Many of them are not finding any way to recover themselves, and they call us hoping that we will be able to help them in some way. Sometimes what happens is they continuously keep asking for solutions and I try to switch the topic of conversations. I know that the protocol says these things and these protocols are developed by knowledgeable persons, but I feel this should be reconsidered. We call ourselves friends, but friends give advice or solutions.” (P3, Male, 24 years old) P17 raised another issue saying that sometimes callers do not know of an already existing solution. For example, sometimes callers look for help with reporting a case of abuse or harassment, or even testing whether they are COVID-19 positive. He felt that many callers do not know that their problem can be easily solved by calling the national helpline or COVID-19-related helplines, but they are not even aware of the existence of such helplines. He suggested that callers should at least be informed of these helplines, and they can take their own decision to reach out to them. In conclusion, volunteers felt that not referring to past conversations, having one-sided control over initiating the conversation, and refusing to provide explicit solutions might come across as a behavior where callers may assume that the volunteer does not care about what happens in their lives. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

15:14

A. Bhattacharjee et al.

5.2 Privacy and Stigma Volunteers informed us of several privacy and stigma-related issues from both callers’ and volunteers’ sides that affect the quality of the service. Below, we discuss these issues. 5.2.1 Privacy Concern of Callers Interrupting Conversation. Volunteers informed us that a large number of callers are concerned about their privacy, and they want to be absolutely certain that their issues will only stay between the caller and the volunteer. Some volunteers like P14 and P15 felt that callers’ privacy concerns sometimes take up a significant amount of time in the conversation, leaving little time for actually discussing their issues. Moreover, P14 expressed that many times the callers do not want their families to have any clue about their issues or the fact that they are calling to share their problems. Often they find a private place to share their problems, but whenever they sense that their privacy is compromised (e.g., they sense that family members are around), they tend to drop the call. P2 recounted an especially poignant incident where a young caller was sharing a deeply traumatic experience. The caller was in a highly distressed state due to a violent incident related to local political tensions. During this emotional call, the line abruptly disconnected, possibly due to a perceived threat to the caller’s privacy. Despite the volunteer’s strong desire to reach out and provide further support, they were unable to return the call. The experience left the volunteer with feelings of guilt and concern over the caller’s wellbeing, illustrating the deep emotional impact that these calls can have on volunteers. However, in similar situations, some callers keep the call open, but they try to hide it from the people who entered their room or private space. In these situations, volunteers either experience a large silence in the middle of the conversation or can hear the caller’s conversation with the other member. P14 explained their behavior in these situations saying, “Sometimes we get confused when during the middle of the conversation, they start conversing with somebody else. I understand that probably they are trying hide that they have called to a helpline. In such situations, management has told us to wait for one minute. After that, we disconnect the call.” (P14, Male, 23 years old) Volunteers, in general, understood that privacy concerns from the callers is natural, but sometimes too many concerns or external factors invading their privacy disrupted the service. 5.2.2 Stigma about Sharing Sensitive Issues. Some volunteers also expressed that callers find it difficult to talk about sensitive topics like homosexuality or sexual problems. They fear that if their secret gets revealed, they would get isolated from their family and society. Existing stigma about their issues affects their problem-sharing process, as they are concerned about their private space and, at the same time, feel uncomfortable talking about these issues. P5 said, “We get a lot of silent calls where people can’t get themselves to talk about their issues. These calls are mostly about taboo topics. Sometimes they pass several minutes without saying a word. They feel shame in sharing their issues, and we have to work a lot to open them up.” (P5, Male, 25 years old) According to the volunteers, the internalized stigma around sensitive topics often forms a barrier for callers seeking necessary support. This stigma does not exist independently, but intersects with the stigma associated with discussing mental health. Participants noted that the stigma around sensitive topics often magnifies the overall stigma around mental health discussions. This magnification makes the already stigmatized topic of mental health more challenging to address, adding another layer of complexity to the conversation. P12 said that the callers fear revealing sensitive

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

What’s the Point of Having This Conversation?

15:15

information due to the risk of further marginalization. This fear amplifies their reticence, compelling volunteers to spend considerable time and effort encouraging these callers to share their experiences and feelings. This intersectionality of stigmas highlights the compounded challenges faced by callers in sharing sensitive topics and discussing mental health concerns. 5.2.3 Privacy of Volunteers. When asked about how the COVID-19 pandemic had affected their work process, volunteers mentioned facing privacy-related problems from their side. A few participants said that they had to face some problems when they wanted a private place in their household. At the beginning of the pandemic, all of these volunteers were living with their parents or family members. Some of these members used to live in hostels or shared places before the pandemic, but either to save cost or have mental support, they shifted to their native places to live with their family members. However, they reported that often they did not have the luxury to arrange a separate place for one individual as several other people, including children, were living in the same house. P10 explained, “We are supposed to take our calls in a private room. When I first started taking calls in my parents’ place, I had troubles making them understand what exactly I do. Once they understood somewhat about the nature of the work, they asked me why I am not getting any money. I am sure they still don’t fully understand my work. Sometimes, they also think that I am talking with girls over the phone. Although they do not enter my room during my office hours, neighbor’s kids or relatives have barged into the room once or twice.” (P10, Male, 35 years old) P4 and P11 also mentioned similar situations. P11 said that his parents were rich and had multiple apartments in the same building; so he often used other apartments for his office hours. Still, he was sometimes interrupted by his cousins who jokingly implied that he was probably involved in “phone sex”. Overall, several volunteers felt that the extended family structure of Bangladesh makes it difficult for both callers and volunteers to have a private conversation. The existing stigma about sharing sensitive issues further deteriorates the situation. 5.3

No Metrics, No Evidence

In our interviews, volunteers speculated on the success of their service in terms of improving people’s mental health because of the absence of concrete metrics to measure progress and no way to do a follow-up. After each conversation, the volunteers report the call in a form and answer some questions that include whether they have observed a shift in a caller’s mood, whether the caller has said they would call again, and whether the caller has thanked the volunteers after the conversation. Some volunteers like P15 thought that those questions might indicate some signs of success, although other volunteers confessed that they were not confident whether the conversation had any positive impact or not. P18 shared such concerns with us: “When people say in the beginning of the call that they are on the terrace and about to jump, and at the end, they go back to their room and explicitly say that they will not do suicide, we can understand that the conversation was successful. But most of the callers are not at that level of risk. Most of the calls are about relationship problems, marital issues, or economic problems. Sure, people call us to have someone who will listen to their problems, but we do not know what happens after these calls or whether our befriending has helped them in any way. I do not know how to measure success in these calls.” (P18, Female, 22 years old)

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

15:16

A. Bhattacharjee et al.

Possible alternative ways of measuring success were also suggested by the volunteers. P14 and P15 suggested that KPR can measure its success by looking at its webpage or social media pages, where callers can give their feedback about the service they got. They also let us know that sometimes callers call the KPR management team and make complaints. However, they acknowledged that the number of these online comments or feedback calls is much smaller than the sheer number of calls they get everyday. Volunteers tried to measure their success by referring to their conversations with the callers; however, there was no way to do follow-up and know what happened after the conversation. For example, if a caller says that they would not harm themselves, there was no way to know whether the caller would stick to their words. The absence of any follow-up information, as a result, made the volunteers question the quality of service they are providing. 5.4

Stress, Trauma, and Harassment for the Volunteers

Although the volunteers are trained through multiple sessions on how to deal with callers, all of our participants mentioned that sometimes they get mentally affected when they hear callers’ stress, frustration, and anxiety. Female volunteers, in particular, talked about how they get affected by sexually inappropriate calls. Below, we discuss these issues. 5.4.1 Induced Stress and Trauma from Dealing with Callers. Volunteers expressed that despite their training, they sometimes get affected by the distress of callers. They shared several incidents when they burst into tears or were emotionally vulnerable for a few days after a conversation. P5 shared an incident: “Once I got a call from a mother who was planning to kill herself along with her three children. She was being abused mentally and physically by her husband. Throughout the call, she talked about her sufferings, and I could not convince her to change her decision. After that particular call ended, I often wondered for the next few days what happened to that family and whether I could have done more to change that woman’s decision.” (P5, Male, 28 years old) Participants also expressed that the inability to share these calls with family members or friends makes it difficult for them to manage their emotions. While they acknowledge that keeping confidentiality is extremely important, they wish to have better ways to manage their mental health. P11 said, “If I were doing a normal job, I could have shared my sufferings with my friends and family. But I can’t do that here. Yes, I can share the calls with other volunteers, but often our shifts don’t match. Now in this pandemic, we don’t even see their faces. This makes the venting out process very difficult.” (P11, Male, 24 years old) When volunteers struggle with these thoughts or get too affected by calls, they share their experience with management and fellow volunteers. Before the COVID-19 pandemic, volunteers used to share their experiences in person with one another. However, during the pandemic, as they were working from home in most cases, they shared their experiences through a Facebook Messenger group. In case the volunteers are too much distressed, they also have the option to take a break temporarily (within their shift as well as for a longer period of time spanning weeks). However, volunteers expressed the desire for frequent consultation with trained professionals on a regular basis. 5.4.2 Sexually Inappropriate Calls Affecting Female Volunteers. All of our female participants talked about their experience with inappropriate callers, mainly about those who made vulgar ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

What’s the Point of Having This Conversation?

15:17

comments or asked them to participate in sexual activity. When asked about the nature of these calls, they mentioned that, generally, these callers try to flirt with them almost from the beginning of the call. These flirtations include appraisals about the volunteers’ voices or making assumptions like “As your voice is so appealing, you must be very beautiful in real life” (P7). In such cases, the female volunteers generally give a warning (“I am not interested in talking about me” (P6)) or try to change the topic. If the callers start talking about their problems, then these calls are handled normally. But often, callers still try to make inappropriate comments and even start using vulgar or derogatory words, in which case the call is treated as an inappropriate call. Sometimes, inappropriate and vulgar comments start right from the beginning as well. When a volunteer is confident that the caller is being inappropriate, the volunteer abruptly ends the call saying “I am going to end the call now.” P9 provided a general comment that might explain the nature of these inappropriate callers. She suspected, “These inappropriate callers call at night to check if any female volunteer takes the call. If a male receives the call, they immediately hang up. They keep trying as long as a female does not pick up the call. Once they get a female volunteer, they start flirting and making vulgar comments.” (P9, Female, 25 years old) Some of the female volunteers (P6, P7, P8, P9, P18, and P19) expressed their frustration over dealing with inappropriate callers. P8 said, “I could have spent my time anywhere else for my own amusement. But I work in KPR because I like to help people. Since I am giving voluntary service and not getting paid a single penny, at least what I expect is respect. When these callers call me and ask me to do phone sex, I feel like I am not being respected. Actually, I feel like I am being used. I am a girl from a respectable family, and I don’t have to deal with these. Still, I work here because I want to help people. But when I receive these calls, I lose the motivation to be a volunteer.” (P8, Female, 25 years old) The other female volunteers also expressed similar sentiments. P9 mentioned that if she was harassed in real life, she would have at least known who was harassing her, or she could have done something about it (“at least I could have cursed the person”). But being a volunteer in KPR, she had to stay polite and could not do anything. This made her a bit frustrated. She suggested that the phone numbers of these inappropriate callers should be reported to the police. P18 felt that the support given to the callers who are actually distressed about sexual reasons is affected due to the overwhelming number of inappropriate calls. She said, “When people actually want to share their sexual problems, we still kind of have the suspicion that ‘Okay, this might be an inappropriate call.’ As a result, we spend extra time to make sure it is not and be cautious of interacting with the caller.” (P18, Female, 22 years old) We did not get such overwhelming frustration over inappropriate calls from male volunteers. One male volunteer (P5) shared his experience with receiving flirtatious calls from women, although these calls were far less aggressive and did not contain vulgar words. Other volunteers (P4 and P10) mentioned receiving calls where the caller started cursing, because they were not giving any solution to the callers’ issues. Overall, regularly dealing with distressed callers can have negative impacts on volunteers’ mental health in various ways. The overwhelming number of inappropriate calls further exacerbates the situation for female volunteers. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

15:18

A. Bhattacharjee et al.

5.5 Resource Constraints Volunteers informed us of a couple of problems that are caused by lack of resources on human and technical infrastructure. We elaborate upon these issues below. 5.5.1 Security Concerns. At the time of the study, KPR was providing its service from 3 p.m. to 3 a.m. everyday. However, some volunteers expressed their concerns about this selective period of working hours and advocated for keeping the service open for 24 h. P1 said, “Imagine someone is having suicidal thoughts in the morning. Now will that person wait for 8–10 hours to see when KPR will start receiving calls?” (P1, Female, 35 years old) Volunteers suspected that by being not open in the morning or early afternoon, they are not being able to help people of different backgrounds and economic status. P4 felt that housewives who are alone after early morning would have called them if KPR were open before 3 p.m. In spite of the desire to offer 24/7 service, the volunteers expressed some reasons why it may be challenging. For instance, P19 explained that the late night or early morning shifts were difficult for some volunteers, because they had to attend to their studies or work the following day. Others could only participate in late-night shifts on weekends. Furthermore, before the pandemic, volunteers were hesitant to work in the office for late night shifts due to concerns about their safety while returning home. Although late-night shifts from home are more practical, P10 mentioned that working during these hours could disturb the household as other family members would be asleep. In summary, while the volunteers believed that a 24/7 helpline service is necessary, they also identified personal and security issues that must be resolved before implementing it. 5.5.2 Technical Issues. We received some complaints about volunteers facing technical issues. In the initial phases of the pandemic, volunteers in Dhaka used to provide service through WhatsApp. However, the volunteers informed us that as KPR had only one WhatsApp number, volunteers had to physically transfer the mobile phones among themselves, because WhatsApp numbers cannot be remotely changed across devices. Multiple volunteers shared a story about one particular volunteer who used to travel by bicycle in the middle of the night during the pandemic, only to give the mobile phone with the WhatsApp number to another volunteer. Some volunteers also said that due to poor internet connection in some rural areas, sometimes volunteers had to wait a few hours to report their everyday conversation with the callers in Google Sheets. 6

DISCUSSION

In our discussion, we first summarize the key findings of our work and connect them with prior works in HCI and mental health literature. Then, we go on to argue for the need for decolonizing mental health interventions based on our findings and recommend future directions for research. 6.1 Key Insights Our work contributes to the HCI and mental health literature on TMMHS and telephone crisis helplines by confirming and extending findings from past works. Our discussion will be structured around the original research questions that guided our investigation. 6.1.1 RQ1: Challenges Faced by Volunteers. Our work responds to RQ1 by revealing several challenges volunteers face in supporting Bangladeshi callers. For example, privacy and stigmarelated issues in the extended family scenario were reported to decelerate the support-giving process. Volunteers reported that many callers were concerned about their privacy and wanted to ensure their issues would only stay between the caller and the volunteer. This privacy concern can sometimes take up a significant amount of time in the conversation, leaving little time to discuss ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

What’s the Point of Having This Conversation?

15:19

their actual issues. Some callers even go to lengths to hide their call from others, such as finding a private place to talk or dropping the call when they sense their privacy has been compromised. Volunteers understood that privacy concerns are natural, but too many concerns or external factors invading privacy disrupted their service. Thus, our findings resonate with the insights presented by Reference [105], emphasizing the discrepancy between the intended provision of care (niti) by helpline services and the actual experience of those services (nyaya) by callers. While the befriending model forms the basis of KPR’s service, it is noteworthy that specific aspects of its implementation and the resulting challenges volunteers face may be unique to the organization. For instance, volunteers at KPR find themselves in a quandary regarding recalling previous sessions. The organization’s policies necessitate volunteers to conceal any recollection of past interactions, which might not be a universal practice across all organizations that employ the befriending model. An alternative implementation of an illustrative befriending model can be seen in Boston Samaritans [22, 40] that provides long-term mental health support by engaging with individuals over extended periods, potentially spanning months or even years. This is notably different from KPR’s approach, where there is no continuity of service with the same volunteer. Boston Samaritans identify these frequent callers as “familiar” callers; most are typically not at risk for suicide. Moreover, despite the extended support, the organization ensures that assistance is not restricted to a specific volunteer. Whenever a familiar caller contacts the service, any volunteer can handle the call by referring to the caller’s history stored in a database named iCarol. This method could provide useful insights for KPR in its effort to enhance long-term support. Volunteers at KPR, like many support providers, also faced general challenges that were not specific to the befriending model. They reported their post-conversation overwhelm and failure to shake off thoughts about the callers. To maintain confidentiality, they were restricted from sharing their experiences with family and friends and hence, struggled to manage their emotions. Additionally, female volunteers experienced inappropriate behavior from callers, including vulgar commentary and requests for sexual activity. However, the mental well-being of crisis helpline volunteers received little attention and is often disregarded by the helplines themselves [102]. Given the frequent engagement with vulnerable individuals, the mental health of volunteers can be severely impacted. Failure to carefully monitor their mental health may lead to low morale among them [34]. To support their mental health, helplines like KPR can take some motivation from Cyr and Dowrick [34]. For example, volunteers could be trained to recognize the signs of stress and frustration from their work and be frequently reminded that many aspects of callers’ lives are beyond their control. Regular evaluations of initial expectations for volunteer work should be conducted to ensure their realistic nature and assess their fulfillment. These evaluations can be carried out involving peer volunteers or supervisors. One such expectation to consider could be the desire for appreciation. It is crucial to recognize that volunteers may not always receive an acknowledgment from clients or be perceived as useful in every interaction. By evaluating and adjusting expectations, volunteers can develop a more realistic understanding of their role and the potential outcomes of their efforts. To reduce female volunteers’ frustration from inappropriate calls, organizations can consider blocking those phone numbers by leveraging their database of prior conversations. Additionally, organizations have an obligation to safeguard their employees against sexual harassment, so helplines should develop policies to ensure that female volunteers deal with as few inappropriate calls as possible [52, 122]. 6.1.2 RQ2: Need for Deflection of Befriending Model and Adoption of Other Contextual Strategies. In pursuit of answering RQ2, we revealed several topics of contention, where volunteers thought they should deflect from the traditional assumptions of the befriending model. Our interviewees reported that the ‘disinterested’ nature of befriending support might upset ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

15:20

A. Bhattacharjee et al.

support-seekers by not giving them direct solutions, providing further evidence to the crosscultural differences in support-seeking behaviors [65, 101, 103]. Cultural norms in Western countries often place a strong emphasis on personal space and privacy. Interactions between strangers typically steer clear of deep dives into personal aspects such as family situations, income levels, or education [9]. The hesitancy in offering direct advice in the befriending model may be rooted in those cultural norms, whereas our findings suggest that the notion of friendship in Bangladesh may entail different expectations compared to the Western norms that the befriending model is built upon. The more open approach to conversation is mirrored in traditional methods of mental health support too. For example, conversations with local religious healers typically involve discussions on intimate topics, often culminating with the healers providing direct advice [121]. In light of these cultural differences, some of our interviewees suggested that Bangladeshi callers may prefer conversations that include more explicit advice. Some also raised concerns about situations in which callers might be unaware of existing solutions or resources, such as national helplines or COVID-19-related helplines, that could help them. This adds a nuanced perspective to [102]’s argument that underscores the importance of designing TMMHS systems in a way that aligns with callers’ expectations and the ways they want to receive and access care. We argue that it is crucial that these helplines do not come across as indifferent to the caller’s problems, as such perceptions might dissuade individuals from seeking help. Many of the volunteers were also worried about the extent of freedom of choice the model gives to the callers, which even allows them to kill themselves. One of the early books on the befriending model by Varah [127] says the following: “...we have publicized ourselves as offering disinterested help to those who approach us, whilst still leaving the caller in charge of his own destiny so that he has nothing whatever to lose ringing The Samaritans.” This kind of “freedom” is eerily similar to the idea of “negative liberty” as described by Isaiah Berlin [11]. This notion emphasizes an individual’s freedom from external coercion or interference, positing that one is free to the extent that no person interferes with their actions. However, Berlin himself cautioned that this form of liberty may inadvertently constrain individuals within their own limitations, as it does not necessarily equip them with the resources or support to transcend their personal barriers [10]. Within the context of the befriending model, this notion of “negative liberty” can translate into a non-interventionist, non-directive approach to support, where the caller maintains agency over their choices, including the extreme decision of ending their life. While this approach is intended to respect the caller’s autonomy, it might raise pivotal concerns regarding whether such an approach sufficiently supports individuals who may be unable to see beyond their immediate crises. Furthermore, understandings and expressions of liberty can be nuanced and vary significantly across different cultural contexts. For example, Islamic liberalism values individual liberty but vehemently disapproves of any form of violence, including self-harm [110]. This perspective underscores the critical need for adapting models such as befriending to align with the cultural and social norms of the specific communities they serve. In many societies of the Global South, including Bangladesh, communal values and interconnectedness often supersede individual autonomy. In such contexts, the “negative liberty” underpinning the befriending model might clash with local societal values, and its hands-off approach could be misconstrued as indifference rather than respect for individual choice. However, while we have articulated critical perspectives on the current design of mental health technologies, we simultaneously recognize its inherent benefits and strengths. The nonjudgmental stance, a cornerstone of the helpline’s approach, can facilitate open dialogue about the callers’ experiences and feelings. This quality can be especially beneficial to individuals fearing identification, stigmatization, or misunderstanding due to their sexual orientation or gender

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

What’s the Point of Having This Conversation?

15:21

identity. For instance, individuals from marginalized communities, such as the LGBTQ+, might be able to safely discuss their struggles and seek assistance without the immediate threat of societal backlash [116]. Moreover, the wide reach of this digital service, coupled with its anonymous nature, can create a supportive environment for those who might find it difficult to access traditional, community-based support structures, either due to inaccessibility or hostility. Acknowledging all the strengths of Kaan Pete Roi, our critique is geared toward envisioning a system that can better integrate these strengths with a more nuanced understanding of different socio-cultural contexts and the specific needs arising from them. In line with a decolonial approach, we acknowledge the challenge of this endeavor—to balance the benefits of a widely accessible and anonymous service with the need for culturally sensitive, locally grounded support. 6.2 Decolonizing Mental Health Interventions Our work also brings in some broader agendas to discuss within HCI, mental health, and related fields. First, we highlight how mental health problems are defined in a way underpinned by Western metaphysical knowledge about psychology [44]. As Enrique Dussel points out, such demarcation between mind and body stems from Descartes’ work [39], “I think therefore I am”—a pronunciation of self-made “I” and a renunciation of god. This “I,” as Dussel has argued, allowed European civilization to develop a knowledge that is separated from the historical and cultural context of the body. More importantly and more relevant to this article, this body/mind separation has made human psychology to be a subject to study and experiment with without engaging with the person’s background. This, in turn, since the Enlightenment, has advanced psychological practices that are often akin to Western medical treatments for a person’s ailing body. While there have been some notable objections to such ideation of an isolated mind by several scholars, inter alia, Freud, Heidegger, Lacan, and Derrida (see Chalmers [26] for details), the dominant practices of defining psychological problems and solutions are still shaped by it. What we see today with this “befriending” model embodies that secular and isolated definition of mind that can be healed by an “expert” with a little knowledge about the patient’s historical background. As this model is employed in Bangladesh, it provides emotional support to distressed individuals but often overlooks the distinct elements of their personal, cultural, and historical circumstances. This approach echoes Descartes’ concept by treating mental distress as an isolated issue, detached from the broader lived experiences of individuals in the Bangladeshi context. As we use telecommunications and computing more and more to offer these services in Bangladesh and elsewhere, we risk continuing to view mental health in an isolated way. This approach reinforces a colonial viewpoint where Western methods are seen as suitable everywhere leading us to overlook the unique cultural, historical, and personal aspects of mental health in Bangladesh [62]. Scholars [82, 125] argue that decolonization must involve systemic change; it should repatriate the indigenous way of life. Our investigation demonstrates that the current practice of providing support is far from the local people’s way of living. We notice that a theory established by the Western European society is considered “supreme” and “universal,” and the local values (e.g., ways of conversation) are abolished. The support behavior model does not recognize callers’ identity and ignores their dissatisfaction, making their demands invisible in the so-called ‘mainstream’ healthcare. The continuous ignorance of local values demand breaking the relentless structuring of the infrastructure [82]. This does not mean switching the position of dominance between local and Western European knowledge, rather establishing a structure where knowledge produced by anybody from anywhere is acknowledged equally. Thus, we put forth this question: If not using this Western metaphysical definition, how then can we approach the problems of mental health in the Global South? One potential way is to find the

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

15:22

A. Bhattacharjee et al.

“social determinants” of mental health problems—a growing area within mental health research [3]. Social determinants are defined as the social conditions that let the victim’s mind feel in a particular way. The preventive measures are then taken by fixing those conditions. Looking from this perspectives, social psychologists have found deep-seated social problems, including racism, sexism, discrimination, and harassment to be causing mental health problems for many people [2, 99]. A broader initiative toward an equal and just society is hence suggested by the experts to reduce the cases of mental health problems. Our findings also pointed that many of the callers could connect those problems to broader social problems in Bangladeshi communities, including sexism, poverty, unemployment, and lack of security. This reveals the potential of helplines like KPR to contribute significantly to understanding these social determinants through empathetic listening and the provision of needed support. In this context, we propose an approach aligned with the principles of distributive justice [111]. This approach suggests a shared responsibility in fostering a society conducive to good mental health—a society where resources, support, and burdens are equitably distributed. Crisis helplines like KPR, when mindfully designed and implemented, can play a crucial role in this endeavor. They can provide crucial mental health support access points to those otherwise underserved. By proactively guiding users to suitable resources such as job websites for the unemployed or housing assistance for those in unstable living conditions [66], helplines can contribute to initiatives aimed at understanding and mitigating the social determinants of mental health. This engagement could help society progress toward greater equality and justice. However, while such preventive measures are well justified for reducing the number of cases of mental health problems, it is undeniable that we also need support for people who need immediate help. This is where we would still need helplines like KPR. Still, we argue that the helplines need to be changed, both in the theories that they base their services and the way they provide services. Throughout the article, we see that the conversation between two Bangladeshi individuals is mediated by a Western framework that was born from Western assumptions. The framework itself negates the local values and expectations from conversations in many ways, offering disinterested help and not providing direct solutions being some of them. The callers are forced to take part in a culturally alienated behavior, giving them little flexibility or control to follow a familiar conversation approach. Thus, helplines are reproducing elements of coloniality, despite the fact that they were originally created as responses to the imbalanced power dynamics of the traditional psychiatric system [131]. 6.3 Future Directions for Research In line with Pendse et al. [104]’s argument, this article emphasizes the need to design solutions that are deeply rooted in the local context, giving prominence to community, faith, and local traditions of mental health support, and acknowledging the lived experiences of the potential users. As we envision future initiatives in this sphere, we perceive the historical origins of “befriending,” alongside the pivotal roles of “faith” and “community,” as crucial areas requiring further exploration [27, 35, 48, 50, 55, 84]. The comfort and relief that individuals initially found in confiding their stories to their community priest Chad Varah highlight the significance of these often overlooked social elements in mental health support theories and social determinant frameworks. Reflecting on the more holistic perspective of mental health, characteristic of many Global South traditions that emphasize the interconnectedness of community, faith, and wellbeing [35, 114], we suggest future endeavors consciously weaving community engagement and faith into the fabric of helpline services design and implementation [27, 48, 68, 104]. This reimagined approach can potentially extend their reach to a wider demographic by connecting callers with trusted, respected members from their own communities. To foster hope in callers, we suggest the integration of the theological concept of ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

What’s the Point of Having This Conversation?

15:23

“redemption” into these conversations [41]. This might entail discussing notions of personal growth, moral improvement, and liberation from present struggles. Such discussions could resonate deeply with callers, particularly those rooted in religious communities where redemption narratives are prevalent. Thus, we call for a departure from the current, more secular, and remote befriending model, advocating instead for a shift toward a more community-oriented and valuesensitive approach. One promising direction for this line of future work involves collaborating with local religious healers to enrich the operation of crisis helplines. Healers like Shamans, Kabiraj, and witches use their understanding of local ways of life to offer mental health support [121]. Future research could explore the communicative techniques these healers employ with their clients, and seek ways to incorporate these methods into helpline interactions. Considering faith as an important aspect of people’s lives, interventions could include the incorporation of faith-based support [45, 115]. This could involve training helpline volunteers in religious sensitivity and understanding of various faith-based perspectives on mental health, or collaborating with religious leaders or faith counselors who can provide support within the callers’ belief frameworks. More radical yet, perhaps even inviting religious healers to participate directly in providing support to helpline callers could be explored. By aligning more closely with local practices, callers might feel more comfortable and open in sharing their problems. However, we clarify that we are not suggesting that the practices of local healers are inherently superior or inferior to those of traditional mental health support models. Rather, we assert that both approaches have their own merits and should be considered equally when developing techniques for helpline support. In future work, careful consideration should be given to how these distinct approaches can be effectively integrated and balanced to provide culturally sensitive and effective mental health support. 7

CONCLUSION AND LIMITATIONS

In this article, we investigate the activity of Kaan Pete Roi (KPR), the first suicide prevention helpline in Bangladesh. To understand their practice of providing crisis support and the challenges faced by the service providers in the organization, we interviewed 20 volunteer support providers in KPR. First, we present a detailed description of working model of KPR—to demonstrate how the idea of befriending is implemented in a resource-constrained context through various sociotechnical arrangements. Then, we report how their service often falls short in ensuring the callers’ safety, providing callers with long-term support, and protecting the volunteers from harassment and fear. Here, we argue that such failures are rooted in the cultural differences in perceiving the idea of befriending, and the infrastructural constraints of Bangladesh. Building on the decolonial philosophy of Enrique Dussel, we argue how the befriending model represents a body-mind separation of western metaphysics that is less applicable to the closely-knit communities in Bangladesh. Finally, we discuss how a community-based model may improve these services by involving the community and match better with the communal resources available in Bangladesh. Our work has a couple of limitations. First, we could not include any participant who received support from KPR, as mental health issues and seeking support in this regard are stigmatized in Bangladeshi culture, and finding suicidal population who suffered from mental health problems is extremely difficult. Thus, our work is missing support seekers’ perspective. Second, we interviewed only the participants who volunteered to join us and analyzed their opinions and experiences. Hence, other volunteers of this organization might have different opinions than the findings of this work. Despite these limitations, we believe that the findings of our study will be useful for crisis helpline research and HCI design of tools and technology for mental health support in the context of the Global South. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

15:24

A. Bhattacharjee et al.

ACKNOWLEDGMENT This work was supported by Discovery Grant, Natural Sciences & Engineering Research Council (#RGPIN-2018-0). We also extend our gratitude to Ashik Abdullah, Chief of Training and Outreach at Kaan Pete Roi, for his assistance throughout various stages of the research. REFERENCES [1] Aasra. 2021. India Suicide Helpline Directory. Retrieved from http://www.aasra.info/helpline.html [2] Margarita Alegría, Amanda NeMoyer, Irene Falgàs Bagué, Ye Wang, and Kiara Alvarez. 2018. Social determinants of mental health: where we are and where we need to go. Curr. Psych. Rep. 20, 11 (2018), 1–13. [3] Jessica Allen, Reuben Balfour, Ruth Bell, and Michael Marmot. 2014. Social determinants of mental health. Int. Rev. Psych. 26, 4 (2014), 392–407. [4] Nazanin Andalibi. 2020. Disclosure, privacy, and stigma on social media: Examining non-disclosure of distressing experiences. ACM Trans. Comput.-Hum. Interact. 27, 3 (2020), 1–43. [5] Kim Anderson and Beatriz Wallace. 2015. Digital storytelling as a trauma narrative intervention for children exposed to domestic violence. In Video and Filmmaking as Psychotherapy. Routledge, 95–107. [6] Nandita Babu, Ziarat Hossain, Jessica E. Morales, and Shivani Vij. 2017. Grandparents in Bangladesh, India, and Pakistan: A way forward with traditions and changes in south Asia. In Grandparents in Cultural Context. Routledge, 159–186. [7] Chris Barker and Nancy Pistrang. 2002. Psychotherapy and social support: Integrating research on psychological helping. Clin. Psychol. Rev. 22, 3 (2002), 361–379. [8] Judith K. Bass, Paul A. Bolton, and Laura K. Murray. 2007. Do not forget culture when studying mental health. Lancet 370, 9591 (2007), 918–919. [9] Catherine Beaulieu. 2004. Intercultural study of personal space: A case study. J. Appl. Soc. Psychol. 34, 4 (2004), 794– 805. [10] Isaiah Berlin. 2014. Freedom and its Betrayal. Princeton University Press. [11] Isaiah Berlin. 2017. Two Concepts of Liberty. Routledge. [12] Paul Best, Una Foye, Brian Taylor, Diane Hazlett, and Roger Manktelow. 2013. Online interactive suicide support services: quality and accessibility. Mental Health Review Journal 18, 4 (2013), 226–239. [13] Ananya Bhattacharjee, Mohammad Ruhul Amin, Yeshim Iqbal, and Syed Ishtiaque Ahmed. 2022. Connecting mental health with sustainable development goals: Insights from call data of a telephone crisis helpline in Bangladesh. In Proceedings of the International Conference on Information and Communication Technologies and Development. 1–5. [14] Ananya Bhattacharjee, S. M. Taiabul Haque, Md Abdul Hady, S. M. Raihanul Alam, Mashfiqui Rabbi, Muhammad Ashad Kabir, and Syed Ishtiaque Ahmed. 2021. Understanding the social determinants of mental health of undergraduate students in Bangladesh: Interview study. JMIR Form. Res. 5, 11 (2021), e27114. [15] Ananya Bhattacharjee, Dana Kulzhabayeva, Mohi Reza, Harsh Kumar, Eunchae Seong, Xuening Wu, Mohammad Rashidujjaman Rifat, Robert Bowman, Rachel Kornfield, Alex Mariakakis, et al. 2023. Integrating individual and social contexts into self-reflection technologies. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–6. [16] Ananya Bhattacharjee, Jiyau Pang, Angelina Liu, Alex Mariakakis, and Joseph Jay Williams. 2023. Design implications for one-way text messaging services that support psychological wellbeing. ACM Trans. Comput.-Hum. Interact. 30, 3 (2023), 1–29. [17] Ananya Bhattacharjee, Joseph Jay Williams, Karrie Chou, Justice Tomlinson, Jonah Meyerhoff, Alex Mariakakis, and Rachel Kornfield. 2022. “I kind of bounce off it”: Translating mental health principles into real life through storybased text messages. Proc. ACM Hum.-Comput. Interact. 6, CSCW2 (2022), 1–31. [18] Ananya Bhattacharjee, Joseph Jay Williams, Jonah Meyerhoff, Harsh Kumar, Alex Mariakakis, and Rachel Kornfield. 2023. Investigating the role of context in the delivery of text messages for supporting psychological wellbeing. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–19. [19] Nicola J. Bidwell. 2016. Decolonising HCI and interaction design discourse: Some considerations in planning AfriCHI. XRDS: Crossroads, ACM Mag. Stud. 22, 4 (2016), 22–27. [20] Befrienders Bloemfontein. 2021. Befrienders Bloemfontein Mental Health Organisation Bloem. Retrieved from https: //www.therapyroute.com/therapist/befrienders-bloemfontein-bloemfontein-za [21] Moner Bondhu. 2021. Moner Bondhu—A Psycho Social Mental Health Support Center. Retrieved from https:// www.monerbondhu.org/ [22] Samaritans At Boston. 2023. Samaritans At Boston. Retrieved from https://www.samaritans.org/branches/boston/ Last accessed: June 10, 2023.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

What’s the Point of Having This Conversation?

15:25

[23] Mehdi Boukhechba, Anna N. Baglione, and Laura E. Barnes. 2020. Leveraging mobile sensing and machine learning for personalized mental health care. Ergon. Design 28, 4 (2020), 18–23. [24] Fritjof Capra. 1983. The Turning Point: Science, Society, and the Rising Culture. Bantam. [25] Mima Cattan, Nicola Kime, and Anne-Marie Bagnall. 2011. The use of telephone befriending in low level support for socially isolated older people–an evaluation. Health Soc. Care Commun. 19, 2 (2011), 198–206. [26] David J. Chalmers. 2002. Philosophy of mind: Classical and contemporary readings. (2002). [27] Janet X. Chen, Allison McDonald, Yixin Zou, Emily Tseng, Kevin A. Roundy, Acar Tamersoy, Florian Schaub, Thomas Ristenpart, and Nicola Dell. 2022. Trauma-informed computing: Towards safer technology experiences for all. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–20. [28] Cipla. 2021. About Cipla—Cipla South Africa. Retrieved from https://www.cipla.co.za/about-cipla [29] Alex Cohen, Vikram Patel, Harry Minas, V. Patel, H. Minas, A. Cohen, and M. Prince. 2014. A brief history of global mental health. Global Mental Health: Princ. Pract. (2014), 3–26. [30] Gregory J. Coman, Graham D. Burrows, and Barry J. Evans. 2001. Telephone counselling in Australia: Applications and considerations for use. Brit. J. Guid. Counsel. 29, 2 (2001), 247–258. [31] Open Counseling. 2021. Free Telephone Counseling Hotlines in South Africa. Retrieved from https:// www.opencounseling.com/hotlines-za [32] Philippe Courtet, Emilie Olié, Christophe Debien, and Guillaume Vaiva. 2020. Keep socially (but not physically) connected and carry on: Preventing suicide in the age of COVID-19. J. Clin. Psych. 81, 3 (2020), 0–0. [33] Catherine M. Coveney, Kristian Pollock, Sarah Armstrong, and John Moore. 2012. Callers’ experiences of contacting a national suicide prevention helpline. Crisis (2012). [34] Carolyn Cyr and Peter W. Dowrick. 1991. Burnout in crisisline volunteers. Admin. Policy Mental Health Mental Health Serv. Res. 18, 5 (1991), 343–354. [35] Ajit K. Dalal and Girishwar Misra. 2010. The core and context of Indian psychology. Psychol. Dev. Soc. 22, 1 (2010), 121–155. [36] Anthony R D’Augelli, Marla H. Handis, Leslie Brumbaugh, Virginia Illig, Richard Searer, Donald W. Turner, and Judith Frankel D’Augelli. 1978. The verbal helping behavior of experienced and novice telephone counselors. J. Commun. Psychol. 6, 3 (1978), 222–228. [37] Munmun De Choudhury, Sanket S. Sharma, Tomaz Logar, Wouter Eekhout, and René Clausen Nielsen. 2017. Gender and cross-cultural differences in social media disclosures of mental illness. In Proceedings of the ACM Conference on Computer Supported Cooperative Work and Social Computing. 353–369. [38] Jo Dean and Robina Goodlad. 1998. The role and impact of befriending. Joseph Rowntree Foundation; F. Demie, K. Lewis, and C. Mclean. 2008. Raising the Achievement of Somali Pupils: Good Practice in London Schools. Lambeth. [39] René Descartes. 2013. Meditations on First Philosophy. Broadview Press. [40] Monica Dickens. 1996. Befriending: The American Samaritans. Popular Press. [41] William L. Dunlop and Jessica L. Tracy. 2013. Sobering stories: Narratives of self-redemption predict behavioral change and improved health among recovering alcoholics. J. Personal. Soc. Psychol. 104, 3 (2013), 576. [42] Enrique Dussel. 2008. Meditaciones anti-cartesianas: Sobre el origen del anti-discurso filosófico de la Modernidad. Tabula Rasa 09 (2008), 153–197. [43] Enrique D. Dussel and Alessandro Fornazzari. 2002. World-system and “trans”-modernity. Nepantla: Views South 3, 2 (2002), 221–244. [44] Enrique D. Dussel, Javier Krauel, and Virginia C. Tuma. 2000. Europe, modernity, and eurocentrism. Nepantla: Views South 1, 3 (2000), 465–478. [45] Jacqueline E Thurston Dyer and W Bryce Hagedorn. 2013. Navigating bereavement with spirituality-based interventions: Implications for non-faith-based counselors. Counsel. Values 58, 1 (2013), 69–84. [46] Sophie Eis, Oriol Solà-Morales, Andrea Duarte-Díaz, Josep Vidal-Alaball, Lilisbeth Perestelo-Pérez, Noemí Robles, and Carme Carrion. 2022. Mobile applications in mood disorders and mental health: Systematic search in apple app store and google play store and review of the literature. Int. J. Environ. Res. Public Health 19, 4 (2022), 2186. [47] Roberto Encarnación Mosquera, Habib M. Fardoun, Daniyal Alghazzawi, Cesar Collazos, and Víctor M. Ruiz Penichet. 2018. Design guidelines for the implementation of an interactive virtual reality application that supports the rehabilitation of amputees of lower limbs patients with post-traumatic stress disorder (PTSD). In Proceedings of the 20th HCI International Conference. Springer, 17–31. [48] Sheena Erete, Yolanda A. Rankin, and Jakita O. Thomas. 2021. I can’t breathe: Reflections from black women in CSCW and HCI. Proc. ACM Hum.-Comput. Interact. 4, CSCW3 (2021), 1–23. [49] Sussie Eshun and Regan A. R. Gurung. 2009. Culture and Mental Health: Sociocultural Influences, Theory, and Practice. John Wiley & Sons. [50] Suman Fernando. 2014. Mental Health Worldwide: Culture, Globalization and Development. Springer.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

15:26

A. Bhattacharjee et al.

[51] SAJIDA FoundationStrateg. 2021. SAJIDA’s Strategy Paper 2021-26—SAJIDA Foundation. Retrieved from https:// sajidafoundation.org/wp-content/uploads/2021/05/SAJIDA-Strategy-Paper-July-2021-June-2026.pdf Last accessed: August 16, 2021. [52] Hilary J. Gettman and Michele J. Gelfand. 2007. When the customer shouldn’t be king: Antecedents and consequences of sexual harassment by clients and customers. J. Appl. Psychol. 92, 3 (2007), 757. [53] Noor Ahmed Giasuddin, Itzhak Levav, and Gilad Gal. 2015. Mental health stigma and attitudes to psychiatry among Bangladeshi medical students. Int. J. Soc. Psych. 61, 2 (2015), 137–147. [54] Ramón Grosfoguel. 2013. The structure of knowledge in westernised universities: Epistemic racism/sexism and the four genocides/epistemicides. Hum. Arch.: J. Sociol. Self-knowl. 1, 1 (2013), 73–90. [55] Matthew M. Heaton. 2013. Black Skin, white Coats: Nigerian Psychiatrists, Decolonization, and the Globalization of Psychiatry. Ohio University Press. [56] Anwar Hossain, Jinnat Rehena, Musammat Sultana Razia et al. 2018. Mental health disorders status in Bangladesh: A systematic review. JOJ Nurs. Health Care 7, 2 (2018), 664–667. [57] Tara Hunt, Coralie J. Wilson, Alan Woodward, Peter Caputi, and Ian Wilson. 2018. Intervention among suicidal men: Future directions for telephone crisis support research. Front. Public Health 6 (2018), 1. [58] iCALL. 2021. iCALL|Free Telephone & Email based Counseling Services. Retrieved from https://icallhelpline.org/ [59] Becky Inkster, Shubhankar Sarda, and Vinod Subramanian. 2018. An empathy-driven, conversational artificial intelligence agent (Wysa) for digital mental well-being: Real-world data evaluation mixed-methods study. JMIR mHealth uHealth 6, 11 (2018), e12106. [60] Yeshim Iqbal, Rubina Jahan, and Muhtasabbib Rumman Matin. 2019. Descriptive characteristics of callers to an emotional support and suicide prevention helpline in Bangladesh (first five years). Asian J. Psych. 45 (2019), 63–65. [61] Yeshim Iqbal, Rubina Jahan, Sakila Yesmin, Ashique Selim, and Shaheen Nafisa Siddique. 2021. COVID-19-related issues on tele-counseling helpline in Bangladesh. Asia-Pacific Psych. 13, 2 (2021), e12407. [62] Lilly Irani, Janet Vertesi, Paul Dourish, Kavita Philip, and Rebecca E. Grinter. 2010. Postcolonial computing: A lens on design and development. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1311–1320. [63] Thomas Joiner, John Kalafat, John Draper, Heather Stokes, Marshall Knudson, Alan L. Berman, and Richard McKeon. 2007. Establishing standards for the assessment of suicide risk among callers to the National Suicide Prevention Lifeline. Suicide Life-threat. Behav. 37, 3 (2007), 353–365. [64] John Kalafat, Madelyn S. Gould, Jimmie Lou Harris Munfakh, and Marjorie Kleinman. 2007. An evaluation of crisis hotline outcomes. Part 1: Nonsuicidal crisis callers. Suicide Life-threat. Behav. 37, 3 (2007), 322–337. [65] Naveena Karusala, Azra Ismail, Karthik S. Bhat, Aakash Gautam, Sachin R. Pendse, Neha Kumar, Richard Anderson, Madeline Balaam, Shaowen Bardzell, Nicola J. Bidwell et al. 2021. The future of care work: Towards a radical politics of care in CSCW research and practice. In Proceedings of the Conference on Computer Supported Cooperative Work and Social Computing. 338–342. [66] Nick Kerman, John Sylvestre, Tim Aubry, and Jino Distasio. 2018. The effects of housing stability on service use among homeless adults with mental illness in a randomized controlled trial of housing first. BMC Health Services Res. 18 (2018), 1–14. [67] Sudhir K. Khandelwal, Harsh P. Jhingan, S Ramesh, Rajesh K. Gupta, and Vinay K. Srivastava. 2004. India mental health country profile. Int. Rev. Psychiatry 16, 1-2 (2004), 126–141. [68] Harold G. Koenig. 2009. Faith and Mental Health: Religious Resources for Healing. Templeton Foundation Press. [69] Anne C. Krendl and Bernice A. Pescosolido. 2020. Countries and cultural differences in the stigma of mental illness: The east–west divide. J. Cross-Cult. Psychol. 51, 2 (2020), 149–167. [70] Mee Huong Lai, Thambu Maniam, Lai Fong Chan, and Arun V. Ravindran. 2014. Caught in the web: A review of web-based suicide prevention. J. Med. Internet Res. 16, 1 (2014), e2973. [71] Emily G. Lattie, Rachel Kornfield, Kathryn E. Ringland, Renwen Zhang, Nathan Winquist, and Madhu Reddy. 2020. Designing mental health technologies that support the social ecosystem of college students. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–15. [72] Live Love Laugh. 2021. Helplines|Live Love Laugh. Retrieved from https://www.thelivelovelaughfoundation.org/ find-help/helplines [73] Shaimaa Lazem, Danilo Giglitto, Makuochi Samuel Nkwo, Hafeni Mthoko, Jessica Upani, and Anicia Peters. 2021. Challenges and paradoxes in decolonising HCI: A critical discussion. Comput. Supp. Coop. Work (2021), 1–38. [74] Shaimaa Lazem, Mennatallah Saleh, and Ebtisam Alabdulqader. 2021. ArabHCI: Five years and counting. Commun. ACM 64, 4 (2021), 69–71. [75] LS Leach and Helen Christensen. 2006. A systematic review of telephone-based interventions for mental disorders. J. Telemed. Telecare 12, 3 (2006), 122–129. [76] David Lester. 1977. The use of the telephone in counseling and crisis intervention. The social impact of the telephone (1977), 454–472.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

What’s the Point of Having This Conversation?

15:27

[77] David Ed Lester. 2002. Crisis Intervention and Counseling by Telephone. Charles C. Thomas Publisher. [78] James D. Livingston and Jennifer E. Boyd. 2010. Correlates and consequences of internalized stigma for people living with mental illness: A systematic review and meta-analysis. Soc. Sci. Med. 71, 12 (2010), 2150–2161. [79] Crick Lund, Alison Breen, Alan J. Flisher, Ritsuko Kakuma, Joanne Corrigall, John A. Joska, Leslie Swartz, and Vikram Patel. 2010. Poverty and common mental disorders in low and middle income countries: A systematic review. Soc. Sci. Med. 71, 3 (2010), 517–528. [80] Christina MacKinnon. 1998. Empowered consumers and telephone hotlines. Crisis 19, 1 (1998), 21–23. [81] Farhana Mann, Jessica K. Bone, Brynmor Lloyd-Evans, Johanna Frerichs, Vanessa Pinfold, Ruimin Ma, Jingyi Wang, and Sonia Johnson. 2017. A life less lonely: The state of the art in interventions to reduce loneliness in people with mental health problems. Soc. Psych. Psych. Epidemiol. 52, 6 (2017), 627–638. [82] A. Memmi. 1991. The Colonizer and the Colonized (Exp. ed., H. Greenfield, Trans.). Beacon Press. [83] Jonah Meyerhoff, Theresa Nguyen, Chris J. Karr, Madhu Reddy, Joseph J. Williams, Ananya Bhattacharjee, David C. Mohr, and Rachel Kornfield. 2022. System design of a text messaging program to support the mental health needs of non-treatment seeking young adults. Procedia Comput. Sci. 206 (2022), 68–80. [84] China Mills. 2014. Decolonizing Global Mental Health: The Psychiatrization of the Majority World. Routledge. [85] Brian L. Mishara, François Chagnon, Marc Daigle, Bogdan Balan, Sylvaine Raymond, Isabelle Marcoux, Cécile Bardon, Julie K. Campbell, and Alan Berman. 2007. Comparing models of helper behavior to actual practice in telephone crisis intervention: A silent monitoring study of calls to the US 1-800-SUICIDE network. Suicide Life-Threat. Behav. 37, 3 (2007), 291–307. [86] Brian L. Mishara and Marc S. Daigle. 1997. Effects of different telephone intervention styles with suicidal callers at two suicide prevention centers: An empirical investigation. Amer. J. Commun. Psychol. 25, 6 (1997), 861–885. [87] Gemma Mitchell. 2007. Befriending Adults with Severe Mental Health Problems: Processes of Helping Befriending Relationships. University of London, University College London, United Kingdom. [88] Jo Moriarty and Jill Manthorpe. 2017. The diversity of befriending by, and of, older people. Working with Older People 21, 2 (2017), 63–71. [89] Nasim Motalebi and Saeed Abdullah. 2018. Conversational agents to provide couple therapy for patients with PTSD. In Proceedings of the 12th EAI International Conference on Pervasive Computing Technologies for Healthcare. 347–351. [90] Emily Moyer-Gusé. 2008. Toward a theory of entertainment persuasion: Explaining the persuasive effects of entertainment-education messages. Commun. Theory 18, 3 (2008), 407–425. [91] Elizabeth L. Murnane, Tara G. Walker, Beck Tench, Stephen Voida, and Jaime Snyder. 2018. Personal informatics in interpersonal contexts: Towards the design of technology that supports the social ecologies of long-term mental health management. Proc. ACM Hum.-Comput. Interact. 2, CSCW (2018), 1–27. [92] Jennifer A. Newberry, Japsimran Kaur, Shravya Gurrapu, Rasika Behl, Gary L. Darmstadt, Bonnie Halpern-Felsher, GV Ramana Rao, Swaminatha V. Mahadevan, and Matthew C. Strehlow. 2020. “So Why Should I Call Them?”: Survivor support service characteristics as drivers of help-seeking in India. J. Interpers. Viol. (2020), Retrieved from https://journals.sagepub.com/doi/10.1177/0886260520970306 [93] Mark Nichter. 1981. Idioms of distress: Alternatives in the expression of psychosocial distress: A case study from South India. Cult., Med. Psychiatry 5, 4 (1981), 379–408. [94] Nazmun Nahar Nuri, Malabika Sarker, Helal Uddin Ahmed, Mohammad Didar Hossain, Claudia Beiersmann, and Albrecht Jahn. 2018. Pathways to care of patients with mental health problems in Bangladesh. Int. J. Mental Health Syst. 12, 1 (2018), 1–12. [95] R. E. Ogbolu, B. O. Oyatokun, K. Ogunsola, T. A. Adegbite, T. Tade, O. Olafisoye, and O. F. Aina. 2020. The pattern of crisis calls to a suicide telephone helpline service in Nigeria. Annals of Health Research 6, 3 (2020), 246–257. [96] Kathleen O’Leary, Stephen M. Schueller, Jacob O. Wobbrock, and Wanda Pratt. 2018. “Suddenly, we got to become therapists for each other” designing peer support chats for mental health. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–14. [97] Maria A. Oquendo and Joel A. Bernanke. 2017. Suicide risk assessment: Tools and challenges. World Psych. 16, 1 (2017), 28. [98] World Health Organization. 2015. World Report on Ageing and Health. World Health Organization. [99] World Health Organization and others. 2014. Social determinants of mental health. (2014). [100] Vikram Patel, Shekhar Saxena, Crick Lund, Graham Thornicroft, Florence Baingana, Paul Bolton, Dan Chisholm, Pamela Y. Collins, Janice L. Cooper, Julian Eaton, et al. 2018. The Lancet Commission on global mental health and sustainable development. Lancet 392, 10157 (2018), 1553–1598. [101] Sachin R. Pendse, Naveena Karusala, Divya Siddarth, Pattie Gonsalves, Seema Mehrotra, John A. Naslund, Mamta Sood, Neha Kumar, and Amit Sharma. 2019. Mental health in the global south: Challenges and opportunities in HCI for development. In Proceedings of the 2nd ACM SIGCAS Conference on Computing and Sustainable Societies. 22–36.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

15:28

A. Bhattacharjee et al.

[102] Sachin R. Pendse, Faisal M. Lalani, Munmun De Choudhury, Amit Sharma, and Neha Kumar. 2020. “Like Shock Absorbers”: Understanding the human infrastructures of technology-mediated mental health support. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–14. [103] Sachin R. Pendse, Kate Niederhoffer, and Amit Sharma. 2019. Cross-cultural differences in the use of online mental health support forums. Proc. ACM Hum.-Comput. Interact. 3, CSCW (2019), 1–29. [104] Sachin R. Pendse, Daniel Nkemelu, Nicola J. Bidwell, Sushrut Jadhav, Soumitra Pathare, Munmun De, and Neha Kumar. 2022. From treatment to healing: Envisioning a decolonial digital mental health. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI’22). [105] Sachin R. Pendse, Amit Sharma, Aditya Vashistha, Munmun De Choudhury, and Neha Kumar. 2021. “Can I Not Be Suicidal on a Sunday?”: Understanding technology-mediated pathways to mental health support. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–16. [106] David L. Penn, Kim T. Mueser, Nick Tarrier, Andrew Gloege, Corrine Cather, Daniel Serrano, and Michael W. Otto. 2004. Supportive therapy for schizophrenia: Possible mechanisms and implications for adjunctive psychosocial treatments. Schiz. Bull. 30, 1 (2004), 101–112. [107] Lore Pil, Kirsten Pauwels, Ekke Muijzers, Gwendolyn Portzky, and Lieven Annemans. 2013. Cost-effectiveness of a helpline for suicide prevention. J. Telemed. Telecare 19, 5 (2013), 273–281. [108] Anthony T. Pinter, Morgan Klaus Scheuerman, and Jed R. Brubaker. 2021. Entering doors, evading traps: Benefits and risks of visibility during transgender coming outs. Proc. ACM Hum.-Comput. Interact. 4, CSCW3 (2021), 1–27. [109] Jane Pirkis, Lay San Too, Matthew J. Spittal, Karolina Krysinska, Jo Robinson, and Yee Tak Derek Cheung. 2015. Interventions to reduce suicides at suicide hotspots: A systematic review and meta-analysis. Lancet Psych. 2, 11 (2015), 994–1001. [110] Shadaab Rahemtulla. 2017. Qur’an of the Oppressed: Liberation Theology and Gender Justice in Islam. Oxford University Press. [111] John E. Roemer. 1996. Theories of Distributive Justice. Harvard University Press. [112] Carl R. Rogers. 1966. Client-centered Therapy. American Psychological Association Washington, DC. [113] Kaan Pete Roi. 2020. Kaan Pete Roi Home. Retrieved from http://shuni.org/. Last accessed: August 16, 2021. [114] Raghunath Safaya. 1975. Indian psychology: A critical and historical analysis of the psychological speculations in Indian philosophical literature. (1975). [115] Elisabeth A. Nesbit Sbanotto, Heather Davediuk Gingrich, and Fred C. Gingrich. 2016. Skills for Effective Counseling: A Faith-based Integration. InterVarsity Press. [116] Morgan Klaus Scheuerman, Stacy M. Branham, and Foad Hamidi. 2018. Safe spaces and safe places: Unpacking technology-mediated experiences of safety and harm with transgender people. Proceedings of the ACM on Humancomputer Interaction 2, CSCW (2018), 1–27. [117] Cliff Scott and Melissa Medaugh. 2017. Axial coding. The International Encyclopedia of Communication Research Methods (2017), 1–2. [118] Mahsa Sheikh, M. Qassem, and Panicos A. Kyriacou. 2021. Wearable, environmental, and smartphone-based passive sensing for mental health monitoring. Front. Dig. Health 3 (2021), 33. [119] Emrys Shoemaker, Gudrun Svava Kristinsdottir, Tanuj Ahuja, Dina Baslan, Bryan Pon, Paul Currion, Pius Gumisizira, and Nicola Dell. 2019. Identity at the margins: Examining refugee experiences with digital identity systems in Lebanon, Jordan, and Uganda. In Proceedings of the 2nd ACM SIGCAS Conference on Computing and Sustainable Societies. 206–217. [120] Amalie Søgaard Neilsen and Rhonda L. Wilson. 2019. Combining e-mental health intervention development with human computer interaction (HCI) design to enhance technology-facilitated recovery for people with depression and/or anxiety conditions: An integrative literature review. Int. J. Mental Health Nurs. 28, 1 (2019), 22–39. [121] Sharifa Sultana and Syed Ishtiaque Ahmed. 2019. Witchcraft and hci: Morality, modernity, and postcolonial computing in rural bangladesh. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–15. [122] Sharifa Sultana, Mitrasree Deb, Ananya Bhattacharjee, Shaid Hasan, S. M. Raihanul Alam, Trishna Chakraborty, Prianka Roy, Samira Fairuz Ahmed, Aparna Moitra, M. Ashraful Amin, et al. 2021. “Unmochon”: A tool to combat online sexual harassment over Facebook messenger. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–18. [123] Franziska Tachtler, Toni Michel, Petr Slovák, and Geraldine Fitzpatrick. 2020. Supporting the supporters of unaccompanied migrant youth: Designing for social-ecological resilience. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–14. [124] Franziska Tachtler, Reem Talhouk, Toni Michel, Petr Slovák, and Geraldine Fitzpatrick. 2021. Unaccompanied migrant youth and mental health technologies: A social-ecological approach to understanding and designing. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–19. [125] Eve Tuck and K. Wayne Yang. 2021. Decolonization is not a metaphor. Tabula Rasa 38 (2021), 61–111.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

What’s the Point of Having This Conversation?

15:29

[126] Chad Varah. 1973. The Samaritans in the’70s: To Befriend the Suicidal and Despairing. Constable, London. [127] Chad Varah. 1987. The Samaritans: Befriending the Suicidal. Constable, London. [128] Lynette P. Vromans and Robert D. Schweitzer. 2011. Narrative therapy for adults with major depressive disorder: Improved symptom and interpersonal outcomes. Psychother. Res. 21, 1 (2011), 4–15. [129] Rob Willson and Rhena Branch. 2019. Cognitive Behavioural Therapy for Dummies. John Wiley & Sons. [130] Naomi Yamashita, Hideaki Kuzuoka, Keiji Hirata, and Takashi Kudo. 2013. Understanding the conflicting demands of family caregivers caring for depressed family members. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2637–2646. [131] Hannah Zeavin. 2021. The Distance Cure: A History of Teletherapy. MIT Press.

Received 15 February 2023; accepted 22 February 2023

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 15. Publication date: December 2023.

Analysis of Performance Improvements and Bias Associated with the Use of Human Mobility Data in COVID-19 Case Prediction Models SAAD MOHAMMAD ABRAR, NAMAN AWASTHI, DANIEL SMOLYAK, and VANESSA FRIAS-MARTINEZ, University of Maryland, College Park, USA The COVID-19 pandemic has mainstreamed human mobility data into the public domain, with research focused on understanding the impact of mobility reduction policies as well as on regional COVID-19 case prediction models. Nevertheless, current research on COVID-19 case prediction tends to focus on performance improvements, masking relevant insights about when mobility data does not help, and more importantly, why, so that it can adequately inform local decision making. In this article, we carry out a systematic analysis to reveal the conditions under which human mobility data provides (or not) an enhancement over individual regional COVID-19 case prediction models that do not use mobility as a source of information. Our analysis— focused on U.S. county-based COVID-19 case prediction models—shows that (1) at most, 60% of counties improve their performance after adding mobility data; (2) the performance improvements are modest, with median correlation improvements of approximately 0.13; (3) improvements were lower for counties with higher Black, Hispanic, and other non-White populations as well as low-income and rural populations, pointing to potential bias in the mobility data negatively impacting predictive performance; and (4) different mobility datasets, predictive models, and training approaches bring about diverse performance improvements. CCS Concepts: • Human-centered computing → Mobile devices; Ubiquitous and mobile devices; • Information systems → Location based services; Additional Key Words and Phrases: COVID-19 case prediction, mobility data, sampling bias, interpretable models ACM Reference format: Saad Mohammad Abrar, Naman Awasthi, Daniel Smolyak, and Vanessa Frias-Martinez. 2023. Analysis of Performance Improvements and Bias Associated with the Use of Human Mobility Data in COVID-19 Case Prediction Models. ACM J. Comput. Sustain. Soc. 1, 2, Article 16 (December 2023), 36 pages. https://doi.org/10.1145/3616380

1 INTRODUCTION The COVID-19 pandemic has mainstreamed human mobility data into the public domain and beyond academic networks. During the early stages of the pandemic, the importance of limiting mobility to contain the epidemic became evident, with cities, states, and countries taking various non-pharmacological interventions focused on mobility, such as national lockdowns or S. Mohammad Abrar, N. Awasthi, and D. Smolyak contributed equally to this research. Authors’ address: S. Mohammad Abrar, N. Awasthi, D. Smolyak, and V. Frias-Martinez, University of Maryland, College Park, College Park, 8125 Paint Branch Drive, College Park MD 20740; emails: {sabrar, nawasthi, dsmolyak, vfrias}@umd.edu.

$ This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License. © 2023 Copyright held by the owner/author(s). 2834-5533/2023/12-ART16 $15.00 https://doi.org/10.1145/3616380 ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

16

16:2

S. Mohammad Abrar et al.

work-from-home approaches [10, 36]. To evaluate the effect of these interventions, public health experts, the CDC, city departments, and journalists explored the use of mobility data that, at the time, was made open and freely available. Companies like Apple, Google, SafeGraph, and Descartes shared different types of aggregated mobility datasets to characterize behaviors such as the volume of visits to specific places (e.g., schools, workplaces, or restaurants), the volume of trips between regions (e.g., trips between two counties), or the volume of trips by type of transportation (e.g., driving vs. public transit). Beyond understanding the impact of mobility reduction policies, the increased access to mobility data sources has also supported research on regional COVID-19 case prediction models, with the assumption that how people moved within a region in the past could potentially provide additional information about how people become infected in the future. COVID-19 case prediction models focus on providing regional-level estimates for future cases in both the short and long term via lookahead analysis performance—that is, measuring region-level prediction performance for various temporal windows such as daily, weekly, or monthly [27]. For example, researchers have shown that SafeGraph data can help predict weekly COVID-19 cases at the county level in the United States, providing higher accuracy when compared to non-mobility baselines [35]. There exist a wide variety of models to predict regional COVID-19 cases, including epidemiological [6, 25, 52], machine learning [17, 33, 47], and statistical models [17, 27]. In this article, we focus on statistical models (linear regression and ARIMA) because we are interested in the deployment of models that are interpretable by decision makers rather than implementing black box predictive approaches that are harder to explain [42]. Nevertheless, there are several gaps in the current state of the art in regional COVID-19 case prediction using mobility data. First, performance results—measured as RMSE or correlation between actual and predicted regional COVID-19 cases—are reported as averages across regions, masking individual region-level performance, which is critical to inform local interventions and policies [27]. For example, past research has shown that mobility data enhances COVID-19 case predictions, on average, across counties in the United States; however, that average might be masking counties for which it did not work [9, 33]. Second, performance results are often not compared against non-mobility baselines, making it hard to measure the effectiveness of adding mobility data to the prediction model [9, 33, 47]. Third, prior work has shown that mobility data might suffer from sampling bias whereby certain demographic groups (e.g., Black, elder, and low-income individuals) can be under-represented in the data due to lower smartphone and cell phone ownership rates [7, 43]. Nevertheless, prior work focused on building COVID-19 case prediction models tends to ignore the bias present in the mobility data, which in turn might affect the performance of regional COVID-19 case prediction models depending on the population of that region [27, 33, 34]. Fourth, current approaches tend to provide narrow evaluations, focused on a few models, or on one or a few mobility datasets, with little research broadly looking into the impact of different prediction models, mobility datasets, and training approaches that use more or less data, on model performance. Given (1) the high cost of acquiring human mobility data for COVID-19 prediction purposes, now that it is no longer freely accessible, and (2) that COVID-19 case predictions are going to be used to assess non-pharmacological interventions such as mobility reduction, or vaccine distribution at the local level, we posit that it is critical to understand the conditions under which mobility data helps (or not) at the individual regional level so that it can adequately inform local decision making. In this article, we aim to analyze the conditions under which human mobility data provides an enhancement over individual regional COVID-19 case prediction models that do not use mobility as a source of information. Our main objective is to inform regional decision makers about the potential of region-level COVID-19 case prediction models that use mobility data, which we posit ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

The Use of Human Mobility Data in COVID-19 Predictions

16:3

should be well understood given the high cost of human mobility data. The main contributions of this work are the following: • Focusing on U.S. counties, we evaluate the number of counties that benefit from adding mobility data and quantify the improvements. Our analyses show that, at most, 60% of counties improve their performance over non-mobility baselines, and that those improvements are modest, happening mostly for longer-term predictions. Looking into the counties that benefit from adding mobility data, 50% of those counties show modest correlation improvements of at most 0.1 and 25% show correlation improvements of at most 0.3. • We present and discuss an approach to assess whether mobility data bias—characterized by demographic and socio-economic characteristics of each county—might explain the differences in the performance of COVID-19 prediction models across counties. We show that correlation improvements were lower for counties with higher Black, Hispanic, and other non-White populations as well as low-income and rural populations, pointing to potential bias in the mobility data negatively impacting predictive performance. • We analyze whether the differences in the performance of mobility-based models over nonmobility baselines vary depending on the mobility datasets, the predictive model, or the training approach. Our results reveal that the improvements brought about by mobility data are similar across mobility datasets, albeit with slightly better values for Apple and SafeGraph; linear regressions are associated with larger improvements; and the training approach might also affect the scale of the improvements. 2

RELATED WORK

2.1 Human Mobility Data and COVID-19 Case Predictions Human mobility data has been used in the past to model and characterize human behaviors in the built environment [12, 21, 41, 46, 51], to support decision making for socio-economic development [11, 13, 14, 16, 22], for public safety [49, 50], as well as during epidemics and disasters [3, 19, 23, 24, 28, 48]. During the COVID-19 pandemic, human mobility has also played a central role in driving decision making, for example, with social distancing policies significantly reducing the spread of the virus [2]. Related work has shown that COVID-19 case prediction models can be enhanced using human mobility data when compared to non-mobility baselines [5, 27, 35]. For example, Ilin et al. [27] analyzed the use of mobility data to forecast COVID-19 cases using interpretable statistical models like regressions [27]. Working at various spatial scales (from county to state to country), the authors revealed that adding mobility data significantly helps decrease the mean percentage prediction error, and that the improvements were higher for longer forecasting lengths. Nevertheless, a significant amount of papers focused on COVID-19 case prediction using mobility data fail to compare model performance against non-mobility baselines [9, 33]. More importantly, several papers have revealed settings in which mobility data did not help. For example, Curtis et al. [8] and Venter et al. [45] found only a small correlation between COVID-19 cases and mobility in parks and natural areas (blue-green spaces), and Mehrab et al. [34] showed that the performance of mobilitybased prediction differed considerably across 50 U.S. counties that correspond to land-grant universities. Human mobility data is generally collected from smartphones and cell phones; however, due to the differences in access to that technology, not all individuals are equally represented in mobility datasets. In fact, prior work has shown that, for example, Black and elder individuals were underrepresented in SafeGraph’s dataset for the state of North Carolina [7]; wealthier individuals tend to be over-represented in cell phone data from several countries, including Sierra Leone or Iraq [43]; ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

16:4

S. Mohammad Abrar et al.

and the relationship between COVID-19 cases and mobility is stronger in urban areas than in rural areas [32]. Given this evidence, we posit that the sampling bias found in mobility data might also affect the performance of regional COVID-19 case prediction models. In this article, we aim to provide a much needed systematic analysis to evaluate the conditions under which mobility data can enhance county COVID-19 case prediction models, and to quantify by how much when compared to non-mobility baselines. Thus, we will extensively evaluate performance across types of prediction models, temporal prediction windows (lookaheads), mobility datasets, training-testing approaches, and county-level demographic and socio-economic characteristics associated with potential mobility data bias that could affect performance. 2.2 COVID-19 Prediction Models A wide variety of models exist to predict regional COVID-19 cases, including epidemiological, machine learning, and statistical models. Epidemiological models, such as SEIR and SIR models, have been used to predict infection rates, and in certain cases, related work has shown that the models can be improved with human mobility data characterizing how people (agents) travel and might infect others [6, 25, 52]. However, SEIR/SIR models have a number of pitfalls, such as the large number of parameters that need to be adjusted [54], or the complexities of adding mobility responses of the population as a function of time and space [40]. For example, Roda et al. [40] found simpler SIR models to be more effective in predicting COVID-19 cases than the more complex SEIR models. A number of machine learning techniques have also been widely applied for regional COVID-19 case prediction with mobility data, including tree-based and K-nearest neighbors models [33]. Furthermore, several works [17, 47] make use of a range of deep learning architectures, including sequential models like long short-term memory (LSTM) [35], gated recurrent units (GRUs), and recurrent neural networks (RNNs) [20], as well as spatio-temporal models like graph neural networks (GNNs) [31]. Some researchers have also incorporated static and dynamic mobility flows—characterizing average and daily mobility patterns between regions—as well as friendship networks to understand the spatio-temporal dependencies between regions that might affect infection rates and which could inform the prediction of regional COVID-19 cases [15, 47]. Statistical models have also been popular for predicting COVID-19 cases because they are transparent and simple to interpret, which is highly important for decision makers; these models can also easily incorporate mobility data along with other features. Statistical models such as autoregressive time series and linear regression have been used for COVID-19 case prediction showing that—early on during the pandemic—mobility data improved non-mobility baselines across countries, states, and other administrative levels [17, 27]. In this article, we focus on statistical models (linear regression and ARIMAX) because we are interested in the deployment of models that are interpretable by decision makers rather than implementing black box predictive approaches that are harder to explain [42]. 3

DATASETS

To assess the effectiveness of mobility data on county-level COVID-19 case prediction models, we evaluate model performance across nine mobility datasets from four different companies: SafeGraph, Google, Descartes, and Apple. We describe each of the COVID-19 and mobility datasets in detail. All datasets used in this work were freely available during the onset of the pandemic. We focus on the period from March 18, 2020 to November 30, 2020 (258 days): the former marks the start of the consistent case data availability in the United States and the latter the date when vaccines were introduced (the Pfizer-BioNTech COVID-19 vaccine was made available on December 11, 2020). We focus on the pre-vaccine period to prevent immunity levels ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

The Use of Human Mobility Data in COVID-19 Predictions

16:5

from acting as a confounder, since the relationship between mobility and infections post-vaccines is less clear in the literature [18]. COVID-19. We use the COVID-19 case data compiled by the New York Times1 at the county level. To account for peaks in daily COVID-19 case counts due to delayed reporting, we use the 7-day daily rolling average of COVID-19 cases (computed as the average of its current value and the prior 6 days) instead of raw counts. We acknowledge that especially during the early stages of the pandemic, case numbers might not be reflective of the actual spread of COVID-19, in large part due to the lack of testing resources [39]. SafeGraph. Curated by tracking the movements of millions of anonymized users via mobile app SDKs, SafeGraph2 open sourced the mobility patterns of app users at the onset of the pandemic. Based on the data available, we use two types of features from SafeGraph datasets: daily O-D flows [30] and daily visits to Points of Interest (POIs). O-D (county-to-county) flows represent daily volumes of trips between pairs of counties across the United States, whereas visits to POIs represent the daily volume of visits to grocery stores, restaurants, religious organizations, and schools in a given county. For O-D flows, we retrieve from SafeGraph both Inflows (i.e., incoming flows to county D from county O) and outflows (i.e., outgoing flows from county O to county D). All mobility features are measured as changes in volumes with respect to a baseline of normal behavior computed by SafeGraph using mobility data from February 17, 2020 to March 7, 2020. Google. Google3 collects mobility data from users who have the location data collection option selected. During the pandemic, Google provided daily county-level mobility scores across different POI categories including parks, residential areas, and transit stations. Mobility scores are calculated as the ratio between the volume of visits on a given day during the pandemic and the volume of visits during a pre-pandemic baseline, with the baseline computed by Google as the median value for each day of week in the 5-week period from January 3, 2020 to February 6, 2020. Among all POI available, we selected workplace, the category with the greatest number of counties with daily data availability in our chosen time period. Descartes Labs. This mobility data from Descartes Labs4 is calculated using geolocation data from mobile devices and captures the median of the maximum distances traveled by individuals in each county each day. As with Google, this median is converted by Descartes to a ratio of pandemic mobility to baseline pre-pandemic mobility. Apple. The mobility data of Apple5 is collected from Apple Maps, and divides its categories by transportation method: driving, walking, and transit. We selected the driving category to measure the volume of individuals driving on a daily basis at the county level, because, similar to Google, this category had more consistent daily data availability than other categories. Across all datasets, we only consider counties that have COVID-19 case data and mobility data available daily throughout the time period of study. Table 1 shows the number of counties that fit this criteria for each dataset; notably, Google has by far the fewest counties with 990 out of the total 3,143 U.S. counties being represented. 4

METHODOLOGY

In this article, we aim to provide a much needed systematic analysis to evaluate the conditions under which mobility data can enhance county COVID-19 case prediction models, and to quantify by how much when compared to non-mobility baselines. Our main objective is to inform regional 1 https://github.com/nytimes/COVID-19-data 2 https://www.safegraph.com/ 3 https://www.google.com/covid19/mobility/ 4 https://github.com/descarteslabs/DL-COVID-19 5 https://covid19.apple.com/mobility

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

16:6

S. Mohammad Abrar et al. Table 1. Number of Counties Considered for Each Mobility Dataset

Mobility Dataset Apple Descartes Google SafeGraph Inflows SafeGraph Outflows SafeGraph POI (Grocery Stores) SafeGraph POI (Religious Organizations) SafeGraph POI (Restaurants) SafeGraph POI (Schools)

County Count 2,064 2,551 990 3,116 3,116 3,065 3,076 3,091 3,084

decision makers about the potential and pitfalls of COVID-19 case prediction models that use mobility data, given the high cost of acquiring such data. Next, we describe the prediction models we evaluate and their parameter adjustment, the training-testing approaches, and the overall evaluation approach. 4.1 Prediction Models As described in Section 2, we consider two types of predictive models commonly used by decision makers due to their interpretability: linear regressions and time series. In contrast to more complex epidemiological models that are hard to tune due to their parametric nature, and to deep learning models with black box architectures, linear models and time series are easier to interpret providing decision makers with the ability to clearly explain their policies [40, 42, 54]. To evaluate the effectiveness of mobility data in the prediction of county-level COVID-19 cases, we analyze the performance of linear regressions and time series using (1) only past county-level COVID-19 case data as an independent variable to predict future COVID-19 cases (these are the non-mobility baselines) and (2) both past county-level COVID-19 case data as well as county-level mobility data as independent variables, with the assumption that how people moved in the past could potentially provide additional information about how people get infected in the future. Both non-mobility baselines and mobility-based models will be evaluated across five different temporal prediction windows (a.k.a. lookaheads): 1-day, 7-days, 14-days, 21-days, and 28-days. Next, we provide further details for each predictive model, and in Section 4.4, we discuss details about the parameter tuning. 4.1.1 Linear Regression. We train one linear regression model per county and lookahead. For the non-mobility baselines, the number of county-level COVID-19 cases for a given lookahead is predicted using the COVID-19 cases from the previous lookahead value—for example, for lookahead 1, the COVID-19 cases in day x are used to predict cases for day x+1, whereas for lookahead 14, the COVID-19 cases in day x are used to predict cases in day x+14. For mobility-based models, the number of county-level COVID-19 cases for a given lookahead is predicted using (1) the COVID-19 cases from the previous lookahead value as well as (2) the 10-day lagged mobility features—that is, the mobility features from day x are used to predict COVID-19 cases for day x+10, since the infection gap (the incubation period from exposed to being able to spread the virus) has been associated with that lag by many reports in the literature [17, 27, 53]. Lagged mobility data is only included in the mobility-based models that will be compared against the non-mobility baselines. 4.1.2 ARIMAX. We train one ARIMA time series forecasting model per county and use it to predict county-level COVID-19 cases for all five lookaheads. Similarly to linear regression, we ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

The Use of Human Mobility Data in COVID-19 Predictions

16:7

Fig. 1. Training-testing approaches for LTW and STW. The LTW approach for linear regression is not represented because it is the standard 70/30 approach.

train non-mobility baselines exclusively with COVID-19 case time series, whereas mobility-based models incorporate 10-day lagged mobility data series. Mobility data is incorporated using ARIMAX models that allow to add exogenous covariates into time series forecasting models. ARIMAX forecasting models have three components that function as parameters: (1) p (autoregressive order, AR) indicates the number of lag observations that the dependent variable regresses with in the model, (2) d (integrated order, I) represents the number of times the time series needs to be differenced to achieve stationarity, and (3) q (moving average, MA) represents the size of the moving average window that models the relationship between the error terms of the moving average model and the lagged observations. We discuss the tuning of these parameters in Section 4.4. 4.2

Training-Testing Approaches

To explore the conditions under which mobility helps in improving predictive performance, we aim to test two different training-testing approaches that have been used in the literature: the Long Training Window (LTW) approach and the Short Training Window (STW). The STW approach focuses on the use of small training datasets to test prediction accuracy for the next 28 days (lookaheads 1, 7, 14, 21, and 28), whereas the LTW approach uses much larger training datasets to predict values for the five lookaheads. Given the high cost of acquiring human mobility data—in fact, none of the companies described earlier offer their data for free anymore—the objective of evaluating these two training approaches is to understand the impact of cost on the performance of COVID-19 prediction models (i.e., buying more (LTW) versus less mobility data (STW)). Next, we explain the implementation of LTW and STW for linear regression and ARIMAX models (also depicted in Figure 1). 4.2.1 Linear Regression. The LTW approach creates one regression model per lookahead per county, trained on the first 70% of days in our dataset and tested on the remaining 30%. This splitting approach is consistent across lookaheads, easing interpretability, but also meaning that the split is not exactly 70/30 for each lookahead—that is, at higher lookaheads, the training set becomes smaller while the testing set remains the same size. The STW approach uses a 60-day sliding window for the training set and a sliding 28-day window for the testing set (smaller training sizes produced extremely low performing results and were not considered). Unlike the LTW approach, the testing set consists of one date for each lookahead (1 day after the end of the training set, 7 days after, etc.). Results are reported by averaging performance metrics across all testing datasets. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

16:8

S. Mohammad Abrar et al.

4.2.2 ARIMAX. The ARIMAX LTW approach uses an expanding window protocol [26, 38] that first trains ARIMAX with a data window that expands over 70% of the entire time series and is tested for the next 28 days to evaluate accuracy per lookahead. After that, the training data window is expanded by 1 day at a time, without dropping older data points, and tested for subsequent 28-day windows to assess accuracy at each lookahead. However, the STW adopts a sliding window approach [38] whereby the training window length remains fixed at each train-test step with a length of 60 days. Each training window is computed by shifting by one with respect to the prior window, effectively discarding older data points. Results are reported by averaging performance metrics across all testing datasets. 4.3 Model Performance We measure individual county prediction performance as the correlation between the predicted volume of COVID-19 cases for that county and its actual case numbers retrieved from the COVID-19 official report dataset. To assess the effectiveness of enhancing COVID-19 case prediction models with mobility data, we will report the correlation improvement (ci) of the mobility-based models over non-mobility baselines. Given that the only difference between mobility-based models and non-mobility baselines is the mobility data, we posit that ci allows us to measure the effectiveness of using mobility data in COVID-19 predictive settings. Specifically, we compute ci as ci = pcorr _Mobility − pcorr _Baseline with pcorr _Mobility representing model performance (correlation) when mobility data is used and pcorr _Baseline measuring the correlation when no mobility data is used in the COVID-19 case prediction model. Given that correlation values go between –1 and +1, ci will be within the (−2, 2) range. 4.4 Adjusting Predictive Models To identify the best linear regression implementation, we computed the average performance of Ridge, Lasso, ElasticNet, and OLS trained with and without mobility data for each lookahead and training approach (LTW and STW), and across all mobility datasets. We then selected the implementation with the majority of best performance values: Ridge for LTW and ElasticNet for STW. Appendix Table 7 shows the detailed numbers. However, the optimal p, d, and q values for the LTW and STW ARIMAX models were chosen based on a grid search and the minimum Akaike information criterion value. The p, d, and q values were selected across lookaheads, and the same values were used for mobility-based models and non-mobility baseline models for comparison analyses. Appendix Table 14 shows a summary of the p, d, and q values identified across all counties. 4.5 Evaluation Approach To analyze the conditions under which mobility data provides (or not) an enhancement over county-level COVID-19 case prediction models that do not use mobility as a source of information, we propose to carry out the following analyses (results are presented in Section 5). In Section 5.1, we analyze the number of counties for which mobility data improved the individual prediction performance of non-mobility baselines, and we quantify the improvements, with a focus on models trained with the LTW approach. In Section 5.2, we delve into the demographic and socio-economic characteristics of these counties to assess if mobility data bias—whereby certain demographic groups are over- or under-represented in the data—might explain the differences in the performance of COVID-19 prediction models across counties. We also analyze whether the differences in the performance of mobility-based models over non-mobility baselines across counties vary depending on (1) the training approach (STW versus LTW (in Section 5.3)), (2) the mobility datasets used (10 different data sources (in Section 5.4)), and (3) the predictive ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

The Use of Human Mobility Data in COVID-19 Predictions

16:9

Fig. 2. Percentage of counties for which adding mobility data to the COVID-19 case prediction model improves the prediction performance (ci ≥ 0). Results are for the LTW approach.

model (linear regression vs. ARIMAX (in Section 5.5)). As stated earlier, our main objective is to bring light into the use of mobility data in county-level COVID-19 case prediction models, to understand when it works (or not), and why, so as to inform decision makers assessing effectiveness-cost tradeoffs given that mobility data is not freely accessible. 5

RESULTS AND ANALYSIS

5.1 Does Mobility Data Help, and by How Much? Figure 2(a) and (b) show the percentage of counties for which the prediction performance of mobility-based models improves over its corresponding non-mobility baselines for ARIMAX and Ridge regression models, respectively, using the LTW approach (a comparison with STW is discussed in Section 5.3). In other words, these plots show the percentage of counties for the correlation improvement ci > 0. These plots show several important insights: (1) incorporating mobility data into county-based COVID-19 case prediction models helps in, at most, 60% of the counties analyzed, leaving at least 40% or more of the other counties with prediction performances lower than their non-mobility counterparts (i.e., mobility data is frequently hurting prediction performance), and (2) mobility data appears to help more in longer-term predictions (lookaheads 14, 21, and 28) than in shorter-term predictions—that is, in the short term, COVID-19 statistics are generally informative enough and provide predictions that are more accurate than those when mobility data is added to the model, whereas for longer lookahead predictions, adding mobility data to predictive COVID-19 case models provides additional information that frequently improves the predictive accuracy of the non-mobility baseline models. This trend appears clear in the ARIMAX model, whereas for Ridge, it is more apparent for the Descartes dataset and for SafeGraph inflow and outflow datasets. We have shown that adding mobility data to COVID-19 case prediction models improves their performance for, at most, 60% of the counties, across datasets, models (ARIMAX and Ridge) and ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

16:10

S. Mohammad Abrar et al.

Fig. 3. Correlation improvements (ci) and baseline correlation distributions (non-mobility) across lookaheads for ARIMAX with the LTW approach using Apple mobility data (a) and SafeGraph Inflow mobility data (b).

lookaheads for the LTW approach. Next, we aim to quantify the performance improvement—that is, does mobility data produce small or large prediction improvements when compared to nonmobility baselines? Quantifying Improvements. For this analysis, we look into two metrics per dataset and model: (1) the distribution of the county correlation coefficients ci for each lookahead, to explore the correlation improvements brought about by adding mobility data to the county prediction models (since we are assessing the improvement over non-mobility baselines, only counties with correlation improvements are considered), and (2) to contextualize these numeric improvements, we also look into the distribution of the correlations (between predicted and actual cases) for the non-mobility baseline models, and only for the counties that showed a correlation improvement so as to match the distributions in the first metric. By analyzing both the improvements brought about by adding mobility data, and by comparing these improvements with the actual baseline correlations, we are able to provide nuanced insights about when mobility data aids COVID-19 case prediction models. In this section, we focus on the LTW approach. Discussions about differences across training approaches, datasets, and models will be covered in Sections 5.3, 5.4, and 5.5. Figures 3 and 4 show an example of the distribution of the county correlation improvement (ci) and the distribution of the non-mobility baseline correlation values for Ridge and ARIMAX, respectively, across lookaheads and using the LTW approach. For clarity purposes, both figures only show distributions for Apple (a) and SafeGraph Inflow (b) datasets. Plots for the remaining datasets can be found in the appendix (see Appendix Figures 6 and 7). The correlation improvements (ci) across datasets show that median ci values are between 0.0 and 0.1—that is, for counties where adding mobility data improves the prediction accuracy, it does so by a maximum of 0.1 for 50% of the counties, across models (Ridge/ARIMAX), datasets, and lookaheads (see Figures 3 and 4 as a sample trend, and Appendix Figures 6 and 7 for the full spectrum trend)—with the largest correlation improvements associated with higher lookaheads. These are modest correlation improvements that might not change the strength of the non-mobility baseline correlation—for example, a baseline moderate correlation of 0.5 will still be moderate after a ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

The Use of Human Mobility Data in COVID-19 Predictions

16:11

Fig. 4. Correlation improvements (ci) and baseline correlation distributions (non-mobility) across lookaheads for Ridge regression with the LTW approach using Apple Mobility data (a, b) and SafeGraph Inflow Mobility data (c, d).

0.1 improvement [1]. Looking at upper quartile values (Q3), we observe slightly better correlation improvements in the 0.0 to 0.4 range, with the majority of Q3 values under 0.3, across models, datasets, and lookaheads, revealing ci values that could improve correlation strengths. Maximum values (Q4) are in the 0 to 0.9 range across models, datasets, and lookaheads, with the majority of maximum values under 0.5, and outliers can reach correlation improvement values of up to 1.9. These less frequent, and extreme, correlation improvements point to situations were negative baseline correlations—from the non-mobility models—are being changed to positive correlations, revealing counties where adding mobility helps reverse bad-performing models to good ones, although these large improvements happen only for a handful of counties. The correlation improvement plots also show that for higher lookaheads, the median value and the right skewness of the correlation improvement also increases across datasets and models. This finding shows that adding mobility data to COVID-19 prediction models produces better better-performing models over non-mobility baselines for longer-term predictions, whereas shortterm prediction models (next day) barely benefit from adding mobility data, with correlation improvement values close to zero. We posit that this might be due to the fact that non-mobility baselines achieve high correlations for lower lookaheads, making it very hard to improve the baselines when adding mobility data. The baseline correlations plots in Figures 3 and 4 (as well as Appendix Figures 6 and 7) show that correlations for non-mobility baselines and lookahead 1 (next-day prediction) have median values of at least 0.9, whereas for higher lookaheads (21 and 28), we observe that non-mobility baselines have much lower correlations, with median values in the range (−0.21, 0.21). In summary, these findings show that mobility data helps, at most, in 60% of the counties in the datasets analyzed, and that correlation improvements range from minimal for next-day predictions to small for higher lookaheads, with 50% of the counties showing modest correlation improvements of at most 0.1 and with 25% of the counties showing correlation improvements of at most 0.3. Larger outlier improvements of up to 1.9 in higher lookaheads are found for a handful of counties. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

16:12

S. Mohammad Abrar et al.

5.2 What Types of Counties Benefit from Mobility Data? A Mobility Data Bias Analysis As stated in Section 1, we hypothesize that a major reason adding mobility data is not helping to improve the accuracy of COVID-19 case prediction models in many more counties might be the bias present in the mobility data—that is, counties with large percentages of racial minority, elder, or low-income populations might be under-represented in the mobility datasets, thus negatively impacting the prediction performance [7, 43]. To assess that hypothesis, we carry out regression coefficient analysis for each mobility dataset and predictive model using demographic and socioeconomic county characteristics as independent variables and the positive correlation improvements across lookaheads as the dependent variable. Our analysis will reveal the statistically significant associations of demographic and socio-economic features with correlation improvements, and the direction of that association (positive or negative). Based on prior work on mobility data bias, we consider the following demographic and socioeconomic county variables from the 2019 Census data: age 65+ (percentage of county residents age 65 or older), income (median household income for that county), Black (percentage of county residents that identify as Black), Hispanic (percentage of county residents that identify as Hispanic), race-Other (percentage of county population that identifies as not White, Black, or Hispanic), and urban-rural (National Center for Health Statistics (NCHS) Urban-Rural Classification from 1 to 6, where 1 is Large Central Metro and 6 is Non-core (extreme rural)). We also consider—as independent variables in the regression model—all possible paired interactions between the variables described by multiplying the values of each pair of variables and by standardizing them. For interaction interpretability, we also ensure that higher values for each feature match with our hypothesis for lower performance. Thus, we change the directionality of the income by negating its values to match our hypothesis that lower income might be associated with worse coverage of mobility data and thus worse performance (all other features stay the same). Finally, we would like to clarify that the COVID-19 case data itself might also suffer from different types of bias due to inaccurate data collection processes [43]. Nevertheless, since that bias affects both mobility-based and non-mobility baseline models, and since we are looking at the correlation improvement differences between the two, we can claim that any performance differences observed can be attributed to the mobility data, which is the only difference between the two models. Next, we discuss the demographic and socio-economic variables that were found to be statistically significantly related to correlation improvements for the Ridge and ARIMAX models with the LTW training approach (results for the STW approach will be discussed in Section 5.3). A detailed presentation of the coefficients and their significance can be checked in Appendix Tables 8 and 10. Findings. For the Ridge model, we observe that the percentage of Black and Hispanic population, rurality, and age above 65 years are significantly and negatively related to correlation improvements across most datasets—that is, counties with higher percentages of minority race and ethnicity, rural, or elder populations benefit less from the addition of mobility data, with increases in these populations related to lower performance improvements over the non-mobility baselines. We posit that this is probably due to potential sampling bias in the mobility datasets, with under-representation of race, ethnicity, old age, and rurality in mobility datasets pointing to worse overall predictive performance. Lower income, however, was significantly positively associated with higher correlation improvements when mobility was added to individual county-level COVID-19 case prediction models. In principle, this result was counter-intuitive since we were expecting that lower incomes would be associated with lower access to smart phones. Nevertheless, looking at the interaction terms between low income and age 65+ as well as low income and race/ethnicity (Black and Hispanic), we observe significant negative coefficients, pointing to the fact that counties with higher percentages of low-income Black and Hispanic groups, as well as ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

The Use of Human Mobility Data in COVID-19 Predictions

16:13

a low-income elder population, are associated with lower correlation improvements possibly due to these groups not being well represented in the datasets. For the ARIMAX model, the trends were not pervasive across mobility datasets, with specific coefficients observed for different datasets. Age 65+ and rurality were negatively related to correlation improvements for the Apple and some SafeGraph datasets, pointing to the fact that elder and rural populations might not be fairly represented in the datasets, and confirming prior work on SafeGraph bias analysis [7]. Apple and SafeGraph also show a significant negative relation between race-Other (not Black, White, or Hispanic) and correlation improvements potentially pointing to the fact that other minority races/ethnicities might not be as represented in the mobility datasets as Whites, Blacks, and Hispanics. There were two instances with counterintuitive results. First, in the Descartes dataset, race-Other was—unexpectedly—significantly positively associated with correlation improvements. However, when considering the interaction between race-Other and rurality, the significant coefficients were negative—that is, counties with high percentages of other minority races in rural settings are negatively related to correlation improvements and thus potentially not fairly represented in the Descartes dataset. Second, lower income had a significant positive coefficient for Apple—that is, counties with lower incomes were associated with higher correlation improvements. Nevertheless, the interaction term between low income and race-Other for the Apple dataset has a significant negative coefficient, revealing worse correlation improvements for counties with high percentages of minority races and low income, potentially due to sampling bias. It is important to clarify that despite the significance of many coefficients, the R-square values for the regressions were low, pointing to the fact that there exist other behavioral or pandemic features that could also explain correlation improvements such as masking mandates, masking behaviors, or transmission rates, among others. Nevertheless, statistics for these features at the county level and for our period of study were not accessible. 5.3

Do Correlation Improvements Change across Training Approaches?

In this section, we explore whether the improvements brought about by adding mobility data to county-based COVID-19 case prediction models are different depending on the type of training approach. Specifically, we discuss the differences in percentages of counties that benefit from adding mobility data, quantify correlation improvements, and discuss the impact of mobility data bias on county-level COVID-19 case prediction models that have been trained with STW as opposed to LTW approaches. Our main objective is to understand if using short-term training windows (which requires considerably less data and hence reduces data costs) has an impact on how COVID-19 case prediction models benefit from mobility data. Percentage of Counties. Figure 5 shows the percentage of counties whose STW-trained COVID-19 case prediction performance improves when adding mobility data. Compared to Figure 2, we can observe that for lower lookaheads (1, 7, and 14), the percentage of counties that benefit from adding mobility data to STW-trained models is smaller than the percentage for LTW-trained models, whereas the percentage of counties remains similar for higher lookaheads (21 and 28) across both linear and ARIMAX models. To assess the statistical significance of these observations, we run a Mann-Whitney U test between the STW percentages and the LTW percentages for each lookahead, and across the 10 datasets and two predictive models (Table 2 presents the details). The distributions were found to be statistically significantly different (p-value < 0.05) for lookaheads 1 and 7, with STW-trained models having significantly smaller median percentages of counties benefiting from adding mobility data (from 16%–31%) than LTW-trained models (from 40%–43%). Correlation Improvements. To quantify correlation improvement differences between STWand LTW-trained models, we run a Mann-Whitney U test for each lookahead between the ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

16:14

S. Mohammad Abrar et al.

Fig. 5. Percentage of counties for which adding mobility data to the COVID-19 case prediction model improves the prediction performance (ci ≥ 0). Results are for the STW approach. Table 2. Mann-Whitney U Test between the STW and LTW Distributions of the Percentage of Counties That Benefit from Adding Mobility Data across the 10 Datasets and Two Models for Each Training Approach

Lookahead p-Value Avg. % Benefit STW Avg. % Benefit LTW 1 0.00002 0.16473 0.40074 7 0.00004 0.31053 0.43324 14 0.08103 0.41427 0.46833 21 0.20845 0.4722 0.49895 28 0.3104 0.48988 0.50206 correlation improvement (ci) distributions for STW and LTW across the 10 mobility datasets and two predictive models—that is, we measure whether the improvements brought about by adding mobility data to COVID-19 case prediction models are statistically significantly different across training approaches, and by how much (distribution plots can be checked in Appendix Figures 6–9). The tests—shown in Table 3—reveal that the two distributions are statistically significantly different (p-value < 0.05) across all lookaheads, with lower median correlation improvement values for LTW-trained models than their STW-trained counterpart (0.0002–0.0442 vs. 0.0018–0.0614), and with a slightly lower maximum median ci value of 0.1, as opposed to STW-trained maximum ci = 0.13. In other words, although a higher percentage of LTW-trained counties improved their performance when adding mobility data, the median improvement range is slightly lower than their STW counterpart. Mobility Data Bias. We repeat the regression coefficient analysis discussed in Section 5.2 for the STW training approach, and a comparison with the associations revealed for the LTW training in that section revealed similar findings. For regression models, similarly to LTW, age 65+, income, ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

The Use of Human Mobility Data in COVID-19 Predictions

16:15

Table 3. Mann-Whitney U Test Per Lookahead between the STW and LTW Distributions of the Correlation Improvements across the 10 Datasets and Two Models for Each Training Approach

Lookahead p-Value Avg. Median ci STW Avg. Median ci LTW 1 0.00334 0.00183 0.00029 7 0.00001 0.01693 0.00738 14 0 0.0321 0.01674 21 0.00009 0.04836 0.02939 28 0.01143 0.06145 0.04425 race Black, race-Other, and rurality are all negatively related to correlation improvement—that is, counties with higher percentages of elder, rural, low income, or minority populations do not benefit as much from the use of mobility data, possibly due to lack of representativity of these groups in the mobility datasets. Nevertheless, the number of significant coefficients across datasets for STW is lower than for LTW—although all datasets except for Google show one or another type of bias. This result highlights that mobility data bias might be more constrained when shorter amounts of training data are used in the model. For ARIMAX models, and similarly to LTW, we found that age 65+, ethnicity Hispanic, and race-Other are negatively associated with correlation improvements—that is, adding mobility data in counties with elder people and other minority race/ethnicity worsens the predictive performance when compared to non-mobility baselines, potentially revealing bias in the data collected for these groups. However, although low income and Black race are positively related to correlation improvements, interaction terms between these two factors and age 65+ showed that counties with a high percentage of elder Black population or elder low-income population were significantly negatively related to correlation improvement—that is, the mobility data associated with these groups might not be representative, thus affecting the quality of the predictions. Finally, the interaction between rurality and age 65+ partially showed that counties with higher percentages of elder rural communities did not always benefit from using SafeGraph mobility data. For further details, we encourage the reader to compare Appendix Table 10 with Appendix Table 11 (for regression models) and Appendix Tables 8 and 9 for ARIMAX models. These analyses show that the training-testing approach impacts the number of counties that benefit from adding mobility data and creates a tradeoff: LTW improves performance for more counties albeit with smaller correlation improvements with respect to their baselines and with more bias across socio-economic and demographic variables when compared to STW, which requires less data and constitutes a more affordable approach. 5.4

Do Correlation Improvements Change across Mobility Datasets?

In this section, we explore whether the improvements brought about by adding mobility data to county-based COVID-19 case prediction models are different depending on the type of dataset. As in previous sections, we will analyze differences across the percentage of counties that benefit from adding mobility data to their predictive models, analyze differences across correlation improvements, and evaluate the role of mobility data bias on those differences. Table 4 summarizes average improvements per mobility dataset across lookaheads, training approaches, and predictive models for (1) percentage of counties and (2) correlation improvements. To evaluate statistically significant differences across datasets, we compute the Friedman test, a non-parametric test that evaluates whether median values across datasets are statistically significantly different. Percentage of Counties. We run the Friedman test with the distribution of the percentage of counties that benefit from adding mobility data across the five lookaheads, two models, and ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

16:16

S. Mohammad Abrar et al.

Table 4. Improvement Statistics Per Mobility Dataset across Training Approaches and Predictive Models

Dataset Apple Mobility Descartes Mobility Google Mobility (SafeGraph) Grocery Store Mobility (SafeGraph) Religious Org Mobility (SafeGraph) Restaurants Mobility (SafeGraph) Schools Mobility SafeGraph Inflow Mobility SafeGraph Intraflow Mobility SafeGraph Outflow Mobility

Avg. % of Counties Improved 0.38107 0.42670 0.53613 0.42302 0.44899 0.42864 0.43075 0.42365 0.46907 0.43595

Avg. Median ci Improvement 0.0337 0.02363 0.01293 0.01816 0.02298 0.02375 0.02123 0.02751 0.02720 0.02938

The percentage of counties column represents the average percentage of counties whose COVID-19 case prediction models benefit from adding mobility data, whereas the correlation improvement (ci) column quantifies the average median ci improvement over the non-mobility baseline.

two training approaches for each mobility dataset. The test was rejected (χ 2 = 28 and p-value = 0.0009), thus pointing to statistically significant differences between the average percentages across datasets. The Friedman test does not identify which distribution(s) are different; however, Table 4 shows that the percentage of counties with improvements is considerably higher for the Google dataset when compared to others, possibly due to the smaller number of counties used in the analysis. Thus, we removed the Google dataset from our set of distributions and repeated the Friedman test, which did not find any statistically significant difference between datasets—that is, the percentage of counties that benefit from adding mobility data to COVID-19 prediction models does not significantly change depending on the mobility dataset except for Google (see Appendix Table 12 for further test details). Correlation Improvements. We run the Friedman test to evaluate whether the correlation improvement brought about by different mobility datasets was statistically significantly different across datasets. The Friedman test between the median correlation improvements across models, training approaches, and lookaheads for each dataset rejected the null hypothesis (χ 2 = 182.7, pvalue = 0), thus pointing to statistically significant differences across datasets. Removal of individual datasets did not change the result of the tests (null hypotheses rejected), revealing that correlation improvements are significantly different across mobility datasets, with Apple and SafeGraph— which includes O-D flows—having the highest improvements across training approaches and predictive models. Nevertheless it is important to highlight that despite significant improvements across datasets were very modest with a maximum value of 0.033 (see Appendix Table 13 for further test details). Mobility Data Bias. Finally, in terms of bias, all mobility datasets were associated with bias in race, age 65+, and income and rurality either as independent features or via feature interactions, with the exception of the Google dataset and SafeGraph Grocery Stores that were associated only to race, age, and income bias, but rurality did not appear to play a role in the correlation improvements (see Appendix Tables 8, 9, 10, and 11 for further details). To summarize the analysis presented in this section, Apple and SafeGraph datasets appear to bring about very modest statistically significantly higher correlation improvements (maximum value of 0.033), but the percentage of counties and the bias identified are similar across mobility datasets. 5.5

Do Correlation Improvements Change across Predictive Models?

In this section, we explore whether the improvements brought about adding mobility data to county-based COVID-19 case prediction models are different depending on the type of predictive ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

The Use of Human Mobility Data in COVID-19 Predictions

16:17

Table 5. Mann-Whitney U Test between the ARIMAX and Linear Distributions of the Percentage of Counties That Benefit from Adding Mobility Data across the 10 Datasets and Two Training Approaches for Each Predictive Model

Lookahead p-Value Average % ARIMAX Average % Linear Regression 1 0.0002 0.16672 0.39875 7 0.00512 0.32085 0.42292 14 0.00069 0.40169 0.48091 21 0.00004 0.4376 0.53354 28 0.00004 0.45166 0.54029 Table 6. Mann-Whitney U Test Per Lookahead between the ARIMAX and Linear Distributions of the Correlation Improvements across the 10 Datasets and Two Training Approaches for Each Model Type

Lookahead p-Value Average Median ci ARIMAX 1 0.00007 0.00196 7 0.94738 0.01544 14 0.00298 0.02564 21 0.00003 0.03196 28 0.00001 0.0381

Average Median ci Regression 0.00029 0.0147 0.03478 0.0612 0.08195

model. Revisiting Figures 2 and 5, we can visually observe that the percentage of counties that benefit from adding mobility data to county-level COVID-19 prediction models is larger when the models are linear regressions rather than ARIMAX. To assess the statistical significance of this difference, we run a Mann-Whitney U test for each lookahead, between the percentage of counties that benefited from adding mobility data to linear models and the percentage of counties that benefited from adding mobility data to ARIMAX models across the 10 datasets and two training approaches. The test showed that the differences between the two types of models are statistically significantly different across all five lookaheads, with the percentage of counties that benefit from adding mobility data being smaller for ARIMAX models (with values between 0.16% and 0.45%) than for linear models (0.39%–0.54%). Table 5 shows further details of the statistical test. However, to assess the statistical significance of the differences in the correlation improvements (ci) between linear and ARIMAX models, we run a Mann-Whitney U test for each lookahead, between the ci values associated with adding mobility data to linear models and the ci values associated with adding mobility data to ARIMAX models across the 10 datasets and two training approaches. Except for lookahead 7, all other lookaheads show significantly different ci distributions, with larger improvements associated with linear regression models at higher lookaheads (0.00029– 0.08195 for linear vs. 0.00196–0.0381 for ARIMAX). Full test details are available in Table 6. Finally, in terms of bias differences across models, we can observe that both models suffer from similar types of bias across datasets and training approaches (i.e., income, race, age 65+, and (a little bit less frequently) rurality). Nevertheless, there is one main difference between linear regression and ARIMAX models. Linear models trained with Google mobility data (both with STW and LTW training approaches) did not identify any significant socio-economic or demographic features in the bias analysis, meaning that these features do not play a role in the performance improvement of county-level COVID-19 case linear predictions. It is important to clarify that although some bias was identified for ARIMAX trained with Google mobility data, the number of features identified was also considerably lower than for any other dataset. We posit that this could be potentially related to the smaller number of counties available (990, see ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

16:18

S. Mohammad Abrar et al.

Table 1). Detailed bias numbers can be found in Appendix Tables 8 and 9 (ARIMAX) and Appendix Tables 10 and 11 (linear regressions). Overall, linear regressions appear to be a better choice, with a larger number of counties benefiting from adding mobility data, with larger correlation improvements at higher lookaheads and with similar bias. 6

DISCUSSION

Our analysis on the use of mobility data for COVID-19 case prediction has shown the heterogeneity and limitations of the benefits of mobility data inclusion. As discussed in Section 5, at most 60% of the counties improve their performance when adding mobility data to the prediction model (top value for linear regressions with the LTW approach). The median correlation improvements across lookaheads, datasets, and models are minimal for next-day predictions and modest for higher lookaheads, with 50% of the counties showing correlation improvements of at most 0.1 and 25% of the counties showing correlation improvements of at most 0.3. Our results have shown that, even in the best case, 40% of counties would face no improvement when mobility data is added to their COVID-19 case prediction models, making the question of whether or not mobility data would be helpful to county-level decision makers essentially a coin flip. Although companies made mobility data freely available at the initial height of the COVID-19 pandemic, companies have already begun to discontinue these practices (Apple’s reports are no longer available, and Google stopped releasing data as of October 15, 2022). As decision makers face the continued spread of COVID-19 and potential future diseases, they must consider whether it will be worth purchasing mobility data. Our analysis shows that purchasing county-level mobility data will not benefit many counties, and decision makers should proceed with caution accordingly. It is also concerning that the extent to which mobility data improves predictions is in part a function of the composition of the population in the county. In fact, across most of the mobility datasets, correlation improvements were lower for counties with higher Black, Hispanic, and other nonWhite populations as well as low-income and rural populations. As older and minority patients have been disproportionately affected by the COVID-19 pandemic, we would hope to provide more and better resources to these groups to ameliorate the disparities. Instead, we see that mobility data could serve to entrench these disparities, providing decision makers in counties with more vulnerable populations with worse-performing models, leading to worse-informed policy decisions. Our analyses have also demonstrated that linear regression models outperform ARIMAX models in both (1) the percentage of counties that benefit from adding mobility data and (2) the correlation improvement when using the same mobility data. These differences are statistically significant. Based on this, we recommend that decision makers favor linear regression models when interested in using interpretable models. We have also shown that the performance across datasets is quite similar, with Apple and some SafeGraph datasets having slightly superior correlation improvements, albeit still so small that in many cases will not lead to a change in the strength of the correlation. This result highlights that decision makers looking to use mobility data as a source of behavioral information should not worry about the dataset they gain access to, since all seem to similarly improve the correlation over their non-mobility baselines. Finally, we have discussed how the training-testing approach presents a tradeoff for decision makers, with LTW approaches increasing the number of counties that benefit from adding mobility data, whereas STW approaches—that require considerably less data and reduce costs—increase the correlation improvements (albeit with small values) and have lower bias. 7

LIMITATIONS AND FUTURE WORK

Although we explore many angles of the use of mobility data in COVID-19 case prediction, there are several limitations and opportunities for further research. One limitation is the reliability of ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

The Use of Human Mobility Data in COVID-19 Predictions

16:19

our dependent variable—COVID-19 cases. Other research has shown that especially in the early stages of the pandemic, published case numbers were not reflective of the actual spread of COVID19, in large part due to the lack of testing resources [39]. Hospitalization and death counts are more reliable than case counts as dependent variables, but they also face issues with under-counting [37] and present further complication with time lags. In future work, we will seek to leverage methods for case count correction [29] to understand the effects of using corrected cases on COVID-19 prediction models’ accuracy and fairness. Future work should also evaluate changes in the reported findings in this article when hospitalization and death counts, instead of cases, are considered. Our study focuses on two prediction models—linear and ARIMAX—and the findings discussed only apply to these models. Using other modeling and predictive approaches such as compartmental [44] or multi-population models [4] might reveal different results. However, it is important to clarify that, unlike the approach proposed in this article, compartmental and multi-population models rely heavily on the availability of more granular data, either in the form of individual mobility patterns or mobility data segmented by sub-groups, and that such data is not freely accessible and often comes at a significant cost. Finally, a national-level policy maker might be interested in a unified model that learns COVID-19 trends for all counties. This would allow for inclusion of county-level variables indicative of population vulnerability directly in the model, potentially yielding more accurate results. One could also explore methods to ameliorate bias in mobility data, whether through interventions on the model or development of techniques to identify how mobility data providers can improve their data collection and transformation procedures. A

APPENDIX

Table 7 shows the average performance—measured as the average correlation between predicted and actual COVID-19 cases—across lookaheads and linear regression implementations for LTW and STW training-testing approaches, respectively. We can observe that, on average, Ridge is the best performing model for three out of the five lookaheads, and ElasticNet is the best performing model for three out of five lookaheads as well. Given their majority best performance, we select this as implementations for the analysis. Tables 8, 9, 10, 11 contains the values for the regression coefficient analysis for ARIMAX LTW, ARIMAX STW, Ridge Regression LTW and ElasticNet Regression STW respectively. Table 12 contains the non-parametric Friedman test analysis, which highlights whether the percentage of counties that benefit from adding mobility data is statistically significantly different or not. Table 13 highlights the results if the ci brought about by different mobility datasets were significantly different across datasets. Table 14 identifies the p, d, q values of the ARIMAX models that were used and the counts of counties that had the stated parameters. Table 7. Average pcor r Value for Each Regularized Method

STW Lookahead 1 7 14 21 28

OLS 0.965098 0.639711 0.453315 0.290965 0.152364

Ridge 0.965128 0.640578 0.454412 0.292186 0.153709

Lasso 0.962530 0.644599 0.462288 0.308287 0.177438

LTW ElasticNet 0.963475 0.647571 0.465076 0.308882 0.174648

OLS 0.952180 0.572311 0.366340 0.214946 0.121256

Ridge 0.952182 0.572374 0.366426 0.215025 0.121314

Lasso 0.938414 0.548012 0.350444 0.212894 0.134021

ElasticNet 0.945211 0.560634 0.361948 0.218513 0.132422

The best value for each window and each lookahead is bolded.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

16:20

S. Mohammad Abrar et al. Table 8. Regression Coefficient Analysis for ARIMAX LTW Feature

Apple

r2 0.0176 adjusted_r2 0.013 const 0.0585 age_65+ –0.0493** low_income 0.0319* black –0.0102 hispanic 0.0195 race_other –0.0935*** rurality –0.042* age_65+:low_income –0.039** age_65+:black –0.0006 age_65+:hispanic –0.0238** age_65+:race_other 0.0188 age_65+:rurality 0.0396* low_income:black 0.0012 low_income:hispanic 0.0156 low_income:race_other –0.0381*** low_income:rurality –0.0047 black:hispanic –0.0019 black:race_other 0.0019 black:rurality 0.017* hispanic:race_other 0.0109** hispanic:rurality 0.0155* race_other:rurality 0.0409***

Descartes

Google

0.0157 0.0117 0.0528 –0.0032 –0.014 0.0343** 0.023 0.0313* 0.0137 –0.0049 –0.0042 –0.0096 0.0003 0.0026 0.0259*** 0.0013 0.0129 0.0053 –0.0063* 0.0055* –0.0022 –0.0014 –0.0076 –0.0223**

0.0173 0.0091 0.0342 0.0078 0.0097 –0.0063 0.0032 0.0038 0.0046 0.002 0.017 –0.0156* 0.0139 –0.0214 –0.0026 –0.0179* 0.0035 –0.0147 –0.0055 –0.0023 –0.0059 –0.0061 0.0003 –0.0078

(SafeGraph) (SafeGraph) Grocery Store Religious Org 0.0065 0.0094 0.0032 0.006 0.0411 0.0475 –0.0052 –0.0022 –0.0007 0.0049 –0.0028 –0.0134 0.0129 0.0021 –0.0184 –0.0044 –0.0093 –0.0035 –0.002 –0.001 –0.0022 –0.0062 –0.005 –0.0035 0.0085* 0.0024 0.0143 0.0052 –0.0043 –0.0222*** 0.0083 0.0026 –0.0033 –0.0051 –0.0011 –0.0006 –0.0025 –0.0077*** 0.0001 –0.0034 0.0054 0.0088 0.001 0.0036 0.0003 0.0094 0.0075 –0.0025

Feature

(SafeGraph) (SafeGraph) SafeGraph SafeGraph SafeGraph Restaurants Schools Inflow Intraflow Outflow r2 0.0112 0.0111 0.0119 0.0083 0.0089 adjusted_r2 0.0079 0.0078 0.0081 0.0051 0.0055 const 0.0489 5.58e–2 0.0565 0.0568 0.0582 age_65+ –0.029* –0.0075 –0.0298* –0.0152 –0.0343* low_income –0.0106 0.0106 0.0075 0.0017 0.0097 black 0.0075 –0.0104 –0.0131 0.0036 0.0139 hispanic 0.0045 –0.022 –2.06e–2 –0.0055 –0.0109 race_other –0.0201 –0.0289 –0.0373* –0.0378** –0.0302* rurality –0.0235* –0.014 –0.0429*** –0.014 –0.0354** age_65+:low_income –0.0022 0.0017 –0.0025 –0.0043 –0.0158 age_65+:black 0.0058 0.0056 0.0091 0.0101 –0.0062 age_65+:hispanic –0.0025 –0.0007 0.0043 0.0028 0.0019 age_65+:race_other 0.0038 0.0103* 0.0102* 0.0093* 0.007 age_65+:rurality 0.0519*** 0.0122 0.0519*** 0.0231 0.0449** low_income:black 0.0132* –0.0019 –0.0086 0.0061 0.0088 low_income:hispanic 0.0065 –0.0025 –0.0138 –0.0033 –0.0057 low_income:race_other –0.0099 –8.30e–3 –0.0099 –0.0078 –0.0022 low_income:rurality 0.0106* –0.0017 –0.0079 0.0029 –0.0043 black:hispanic –0.0041 –0.0003 –0.0063* –0.0043 –0.0051* black:race_other 0.0017 0.0003 –0.0023 0.0019 0.0031 black:rurality 0 0.002 5.70e–3 –0.0007 0.0075 hispanic:race_other –0.0029 0.0032 –0.0012 0.0008 0.0037 hispanic:rurality 0.0067 0.022*** 0.0089 0.0051 0.0097 race_other:rurality 0.0096 0.0087 0.0214 0.0226* 0.0201* Values highlighted in green represent statistically significant positive coefficients, whereas values highlighted in red represent statistically significant negative coefficients. Significance level (p–value smaller than): ***, 0.001 **, 0.01; *, 0.05.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

The Use of Human Mobility Data in COVID-19 Predictions

16:21

Table 9. Regression Coefficient Analysis for ARIMAX STW Feature

Apple

Descartes Google

r2 adjusted_r2 const age_65+ low_income black hispanic race_other rurality age_65+:low_income age_65+:black age_65+:hispanic age_65+:race_other age_65+:rurality low_income:black low_income:hispanic low_income:race_other low_income:rurality black:hispanic black:race_other black:rurality hispanic:race_other hispanic:rurality race_other:rurality

0.0444 0.0371 0.0743 –0.0192 0.0093 0.0449*** –0.0032 0.0081 –0.0124 –0.0241* –0.0176 –0.0169** –0.0136 0.0221 0.0157* –0.001 –0.006 0.0084 0.0055* –0.0046* –0.0001 –0.0008 0.0191** 0.0028

0.0617 0.0567 0.0594 –0.0071 –0.0115 –0.0088 –0.0142 0.0089 0.0306** –0.0166* –0.0108 –0.0032 –0.0138* –0.0035 –0.0203*** –0.0058 –0.0034 0.0259*** 0.0048* –0.002 0.0035 –0.0024 0.0089* 0.0054

Feature

(SafeGraph) Restaurants r2 0.0727 adjusted_r2 0.0679 const 0.0453 age_65+ 0.0085 low_income –0.0168** black 0.0003 hispanic –0.0058 race_other –0.0089 rurality 0.0124 age_65+:low_income 0.0028 age_65+:black 0.0128** age_65+:hispanic 0.0062* age_65+:race_other 0.0035 age_65+:rurality –0.0054 low_income:black –0.0101** low_income:hispanic 0.005 low_income:race_other –0.0037 low_income:rurality 0.0136*** black:hispanic 0.0007 black:race_other –0.0012 black:rurality 0.0133*** hispanic:race_other 0.0034* hispanic:rurality 0.0073* race_other:rurality 0.0006

0.0745 0.0601 0.039 –0.029* 0.0209* 0.0179 0.0123 0.0017 0.0029 –0.044*** –0.0174* –0.0126* –0.0161* 0.0129 –0.005 –0.0011 –0.0073 0.01 –0.0017 –0.0023 0.0041 0.0032 –0.0004 0.0078

(SafeGraph) Schools 0.0345 0.03 0.0479 –0.002 –0.0049 –0.0014 –0.0274*** –0.0241** 0.0013 –0.0024 –0.0023 0.0049 0.0011 0.0024 0.0022 –0.0048 –0.0079* 0.009** 0.0056*** –0.0018 0.0098** 0.0023 0.0184*** 0.0138*

(SafeGraph) Grocery Store 0.0693 0.0649 0.0361 –0.0101 –0.0002 0.0128* –0.0032 –0.0088 0.0023 –0.0116* –0.0044 0.0009 –0.0007 0.0094 –0.0006 –0.0016 –0.0171*** 0.0088** 0.0002 –0.0004 –0.0008 0.0006 0.0061* –0.005

SafeGraph Inflow 0.0529 0.0482 0.0514 0.0056 –0.0114 0.0201* –0.0087 –0.0273* 0.0214** –0.0119 –0.0116* –0.0082* –0.0019 –0.0091 0.009* –0.0064 –0.0109** 0.0215*** –0.0001 0.0017 0.0086* 0.0012 0.0162*** 0.0194*

(SafeGraph) Religious Org 0.056 0.0521 0.0417 0.0113 –0.0175*** 0.0028 –0.0138* –0.0087 0.0188** 0.0038 –0.0136*** 0.0069** 0.0048 –0.0132* –0.0062 –0.0014 –0.0037 0.0142*** 0.002 –0.0016 0.0112** 0.0002 0.0063* 0.0009

SafeGraph Intraflow 0.0323 0.0274 0.0474 –0.0013 –0.0033 0.0083 0.0008 0.0001 0.0006 –0.0058 –0.0054 –0.0049 –0.0034 0.0084 0.0019 –0.0022 –0.0057 0.0078* –0.0001 –0.0009 0.0079 –0.0001 0.0064 –0.0001

SafeGraph Outflow 0.0376 0.033 0.0531 –0.0158 0.0007 0.002 –0.0211** –0.0104 0.0019 –0.0145* –0.0137** –0.0011 –0.003 0.0182* –0.0077 –0.0063 –0.0092* 0.0169*** 0.0017 –0.0032* 0.0113** 0.0007 0.0187*** 0.0051

Values highlighted in green represent statistically significant positive coefficients, whereas values highlighted in red represent statistically significant negative coefficients. Significance level (p–value smaller than): ***, 0.001; **, 0.01 *, 0.05.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

16:22

S. Mohammad Abrar et al. Table 10. Regression Coefficient Analysis for Ridge Regression LTW Feature

Descartes Google (SafeGraph) Grocery Store r2 0.0297 0.0218 0.0054 0.0195 adjusted_r2 0.0229 0.0184 –0.001 0.0164 const 0.101*** 0.064*** 0.061*** 0.071*** age_65+ –0.027 –0.017 0.01 –0.018 low_income 0.055* 0.033* 0.024 0.032* black –0.067 –0.013 –0.046 –0.042* hispanic –0.072* –0.017 0.02 –0.038* race_other –0.049 –0.015 –0.082 –0.041 rurality –0.039 –0.031 –0.003 –0.005 age_65+:low_income –0.033 –0.016 –0.002 –0.02 age_65+:black 0.038 –0.028* 0.002 0.028* age_65+:hispanic –0.029 0.022* 0.001 0.021* age_65+:race_other –0.036 –0.003 0.026 –0.003 age_65+:rurality 0.039 0.021 –0.026 0.002 low_income:black –0.026 –0.036*** –0.033 –0.015 low_income:hispanic –0.079*** –0.021 0.01 –0.025* low_income:race_other –0.011 –0.004 –0.032 –0.013 low_income:rurality 0.002 –0.011 –0.017 0.001 black:hispanic –0.012 –0.009* –0.006 –0.006 black:race_other 0.009 –0.002 0.007 0.004 black:rurality 0.024 0.031*** 0.017 0.017 hispanic:race_other 0.028*** 0.007 0.013 0.004 hispanic:rurality 0.035* –0.008 –0.012 0.001 race_other:rurality 0.05* 0.013 0.013 0.026 Feature

Apple

(SafeGraph) Restaurants r2 0.0208 adjusted_r2 0.018 const 0.069*** age_65+ –0.041* low_income 0.047*** black –0.039* hispanic –0.038* race_other –0.057** rurality –0.037* age_65+:low_income –0.04* age_65+:black 0.003 age_65+:hispanic 0.011 age_65+:race_other –0.005 age_65+:rurality 0.033 low_income:black –0.03** low_income:hispanic –0.021* low_income:race_other –0.01 low_income:rurality –0.004 black:hispanic –0.001 black:race_other 0.001 black:rurality 0.024** hispanic:race_other 0.02*** hispanic:rurality 0.013 race_other:rurality 0.042**

(SafeGraph) Schools 0.0056 0.0025 0.058*** –0.01 0.017 0.01 –0.048* –0.043 0.006 –0.021 –0.003 0.015 –0.005 –0.005 –0.003 –0.025* –0.004 0.008 –0.004 0 –0.003 0.012** 0.011 0.032

SafeGraph Inflow 0.0217 0.0192 0.07*** –0.035* 0.035** –0.051** –0.035* –0.023 –0.036* –0.028* 0.009 0.014* –0.005 0.04* –0.034*** –0.024* –0.006 –0.002 0 0.001 0.03*** 0.001 0.008 0.021

(SafeGraph) Religious Org 0.0102 0.0076 0.063*** –0.012 0.015 –0.003 –0.017 –0.031 –0.004 –0.013 0 0.007 –0.002 0.012 –0.006 –0.018 –0.008 0.006 0 0.003 0.007 0 0.002 0.022

SafeGraph Intraflow 0.0154 0.0131 0.07*** –0.038* 0.037** –0.029 –0.031* –0.027 –0.03 –0.025* –0.011 0.022*** 0.004 0.03 –0.031*** –0.02* –0.012 –0.007 –0.003 0.003 0.027** –0.005 0 0.012

SafeGraph Outflow 0.0204 0.0181 0.066*** –0.043* 0.039*** –0.04* –0.041** –0.033 –0.019 0.033** 0.011 0.014* 0.003 0.029 –0.029*** –0.026** –0.013 0.004 0.002 0.004 0.016 –0.003 0.01 0.016

Values highlighted in green represent statistically significant positive coefficients, whereas values highlighted in red represent statistically significant negative coefficients. Significance level (p–value smaller than): ***, 0.001; **, 0.01; *, 0.05.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

The Use of Human Mobility Data in COVID-19 Predictions

16:23

Table 11. Regression Coefficient Analysis for ElasticNet Regression STW Feature

(SafeGraph) Grocery Store r2 0.012576 0.022346 0.017356 0.008641 adjusted_r2 0.008654 0.01869 0.0085 0.00556 const 0.1071*** 0.0708*** 0.0656*** 0.0591*** age_65+ –0.0306 –0.0208 –0.0126 –0.0064 low_income 0.0077 0.0098 –5.58e–5 –0.0054 black –0.0443* 0.0088 –0.0147 0.0134 hispanic 0.0115 –0.01 –0.0134 0.0034 race_other –0.0064 –0.0273* 0.0017 –0.0111 rurality –0.0684** 0.0094 –0.0231 –0.0005 age_65+:low_income –0.0119 –0.0181 –0.0033 –0.0075 age_65+:black 0.0303* –0.0025 –0.0014 –0.0009 age_65+:hispanic –0.0252** 0.0056 0.0138 –0.0012 age_65+:race_other –0.0119 0.0117 –0.0215 0.0056 age_65+:rurality 0.059** 0.0092 0.0288 0.008 low_income:black –0.0069 –0.0133* –0.0139 0.0046 low_income:hispanic 0.013 –0.0193* –0.0028 0.0018 low_income:race_other –0.0035 –0.0055 –0.0092 1.60e–3 low_income:rurality –0.0153 0.0134* 0.0011 0.0005 black:hispanic 0.0064 –0.0115*** –0.0022 –0.0023 black:race_other 0.0031 0.0039 0.0025 4.00e–4 black:rurality 0.0133 –0.0083 0.0105 –0.0015 hispanic:race_other 0.0085 0.0051 –0.0012 –0.0008 hispanic:rurality 0.0189* –0.0034 –0.0006 0.0017 race_other:rurality 0.0095 0.0056 0.0115 0.008 Feature

Apple

Descartes Google

(SafeGraph) Restaurants r2 0.011075 adjusted_r2 0.007945 const 0.0731*** age_65+ –0.0167 low_income 0.0111 black –0.0171 hispanic –0.003 race_other –0.0084 rurality –2.07e–2 age_65+:low_income –0.0154 age_65+:black 0.0046 age_65+:hispanic –0.0038 age_65+:race_other 0.003 age_65+:rurality 0.02 low_income:black –0.0133 low_income:hispanic –0.0086 low_income:race_other –0.0004 low_income:rurality –0.0047 black:hispanic –0.0038 black:race_other 0.0024 black:rurality 0.0109 hispanic:race_other –0.0008 hispanic:rurality 0.0044 race_other:rurality 0.006

(SafeGraph) Schools 0.009231 0.006016 0.1135*** 0.029 –0.0298* 0.0283 –0.0331 –0.0202 0.0155 0.0209 0.0034 –0.0018 0.008 –0.0186 0.0224* –0.007 0.0037 0.0101 0.0018 –0.0004 –0.0157 –0.0037 0.022* 0.0126

SafeGraph Inflow 0.008333 0.0052 0.0663*** –0.0089 0.0034 0.0005 –0.0076 –0.0161 –0.0174 –0.0059 0.0005 0.0043 –0.0025 0.0166 –0.0023 –0.0005 –0.0055 –0.0071 –0.0013 0.0023 0.0043 0.0001 0.0056 0.0133

(SafeGraph) Religious Org 0.009302 0.006092 0.0689*** –3.10e–3 0.0024 –0.0199 –0.0215 –0.0033 –0.0218 0.0044 0.0015 0.0016 0.0079 0.0117 –0.0145* –0.0096 0.0009 –0.004 –0.0035 0.0015 0.0113 –1.70e–3 0.0125* –0.0028

SafeGraph Intraflow 0.004581 0.001467 0.061*** –0.0232* 0.0098 –0.0036 –0.0082 –0.0056 –0.0147 –0.0132 –0.0033 0.0018 –0.0076* 0.0285* –0.0049 –0.004 –0.0074 0.0031 –0.0022 0.0013 0.0051 –0.0011 0.0028 0.0058

SafeGraph Outflow 0.008229 0.005037 0.0655*** 0.0038 –0.0035 0.0068 –0.0014 0.0259 –0.0044 0.003 –0.0015 0.0009 –0.0099* 0.0075 –0.0021 –0.0039 –0.0009 –0.0036 –0.002 –0.0001 0.0014 –0.0022 0.0016 –0.0151

Values highlighted in green represent statistically significant positive coefficients, whereas values highlighted in red represent statistically significant negative coefficients. Significance level (p-value smaller than): ***, 0.001; **, 0.01; *, 0.05.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

16:24

S. Mohammad Abrar et al. Table 12. Non-Parametric Friedman Test Analysis with Distributions Containing the Percentage of Counties Where Mobility Data Improved COVID-19 Case Prediction Models without Mobility Data across Datasets, Lookaheads, Training Approaches, and Models

Dataset Considered All Datasets Google Excluded Chi-square 28.2873 9.6000 p-Value 0.0009 0.2942 Table 13. Non-Parametric Friedman Test Analysis with Distributions Containing the Median Correlation Improvement Values of Counties Where Mobility Data Improved COVID-19 Case Prediction Models without Mobility Data across Datasets, Lookaheads, Training Approaches, and Models

Chi-Square p-Value Dataset Considered All datasets Restaurants Mobility excluded Religious Organization Mobility excluded Schools Mobility excluded Grocery Stores Mobility excluded SafeGraph Inflows Mobility excluded SafeGraph Outflows Mobility excluded SafeGraph Intraflows Mobility excluded Apple Mobility excluded Descartes Mobility excluded Google Mobility excluded Apple, Google, and SafeGraph Inflow Mobility excluded

182.7029 164.0857 164.0603 166.6444 164.6127 164.1492 164.6000 164.2508 164.5937 164.0603 164.0794 108.6286

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Each test is run with one or a few mobility datasets excluded.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

The Use of Human Mobility Data in COVID-19 Predictions

16:25

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

16:26

S. Mohammad Abrar et al.

Fig. 6. Correlation improvements (ci) and baseline correlation distributions (non-mobility) across lookaheads for ARIMAX with LTW approach using Apple Mobility data (a, b), Descartes Mobility data (c, d), Google Mobility data (e, f), SafeGraph Inflow Mobility data (g, h), SafeGraph Outflow Mobility data (i, j), SafeGraph Intraflow Mobility data (k, l), Restaurants Mobility data (SafeGraph POI) (m, n), Religious Mobility data (SafeGraph POI) (o, p), Schools Mobility data (SafeGraph POI) (q, r), and Grocery Stores Mobility data (SafeGraph POI) (s, t).

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

The Use of Human Mobility Data in COVID-19 Predictions

16:27

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

16:28

S. Mohammad Abrar et al.

Fig. 7. Correlation improvements (ci) and baseline correlation distributions (non-mobility) across lookaheads for Ridge regression with the LTW approach using Apple Mobility data (a, b), Descartes Mobility data (c, d), Google Mobility data (e, f), SafeGraph Inflow Mobility data (g, h), SafeGraph Outflow Mobility data (i, j), SafeGraph Intraflow Mobility data (k, l), Restaurants Mobility data (SafeGraph POI) (m, n), Religious Mobility data (SafeGraph POI) (o, p), Schools Mobility data (SafeGraph POI) (q, r), and Grocery Stores Mobility data (SafeGraph POI) (s, t).

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

The Use of Human Mobility Data in COVID-19 Predictions

16:29

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

16:30

S. Mohammad Abrar et al.

Fig. 8. Correlation improvements (ci) and baseline correlation distributions (non-mobility) across lookaheads for ARIMAX with STW approach using Apple Mobility data (a, b), Descartes Mobility data (c, d), Google Mobility data (e, f), SafeGraph Inflow Mobility data (g, h), SafeGraph Outflow Mobility data (i, j), SafeGraph Intraflow Mobility data (k, l), Restaurants Mobility data (SafeGraph POI) (m, n), Religious Mobility data (SafeGraph POI) (o, p), Schools Mobility data (SafeGraph POI) (q, r), and Grocery Stores Mobility data (SafeGraph POI) (s, t).

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

The Use of Human Mobility Data in COVID-19 Predictions

16:31

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

16:32

S. Mohammad Abrar et al.

Fig. 9. Correlation improvements (ci) and baseline correlation distributions (non-mobility) across lookaheads for elastic regression with the STW approach using Apple Mobility data (a, b), Descartes Mobility data (c, d), Google Mobility data (e, f), SafeGraph Inflow Mobility data (g, h), SafeGraph Outflow Mobility data (i, j), SafeGraph Intraflow Mobility data (k, l), Restaurants Mobility data (SafeGraph POI) (m, n), Religious Mobility data (SafeGraph POI) (o, p), Schools Mobility data (SafeGraph POI) (q, r), and Grocery Stores Mobility data (SafeGraph POI) (s, t).

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

The Use of Human Mobility Data in COVID-19 Predictions

16:33

Table 14. Summary of the Combinations of p, d, q Values Identified for the ARIMAX Models Using Grid Search and the Akaike Information Criterion ARIMAX (p,d,q) (0, 0, 0) (0, 1, 0) (0, 1, 1) (0, 1, 2) (0, 1, 3) (0, 1, 4) (0, 1, 5) (0, 2, 0) (0, 2, 1) (0, 2, 2) (0, 2, 3) (1, 0, 0) (1, 0, 1) (1, 0, 2) (1, 0, 3) (1, 0, 4) (1, 0, 5) (1, 1, 0) (1, 1, 1) (1, 1, 2) (1, 1, 3) (1, 1, 4) (1, 1, 5) (1, 2, 1) (1, 2, 2) (1, 2, 3) (1, 2, 4) (2, 0, 0) (2, 0, 1) (2, 0, 2) (2, 0, 3) (2, 0, 4) (2, 1, 0) (2, 1, 1) (2, 1, 2) (2, 1, 3) (2, 1, 4) (2, 1, 5) (2, 2, 1) (2, 2, 3) (2, 2, 4) (2, 2, 5) (3, 0, 0) (3, 0, 1) (3, 0, 2) (3, 0, 3) (3, 0, 4) (3, 0, 5) (3, 1, 0) (3, 1, 1) (3, 1, 2) (3, 1, 3) (3, 1, 4) (3, 1, 5) (3, 2, 0) (3, 2, 1) (3, 2, 2) (3, 2, 5) (4, 0, 0) (4, 0, 1) (4, 0, 2) (4, 0, 3)

# Counties 13 1,263 189 18 4 8 12 1 25 5 3 108 25 4 3 3 1 201 263 44 22 9 9 2 6 3 2 31 73 18 5 3 100 59 103 19 15 15 7 2 1 1 9 6 24 4 5 1 32 19 36 23 14 12 2 1 2 1 7 6 5 3

ARIMAX (p,d,q) (4, 0, 4) (4, 1, 0) (4, 1, 1) (4, 1, 2) (4, 1, 3) (4, 1, 4) (4, 1, 5) (4, 2, 0) (4, 2, 1) (4, 2, 2) (4, 2, 3) (4, 2, 4) (5, 0, 0) (5, 0, 1) (5, 0, 2) (5, 1, 0) (5, 1, 1) (5, 1, 2) (5, 1, 3) (5, 1, 4) (5, 2, 2) (5, 2, 3) (5, 2, 5) (6, 0, 3) (6, 1, 3) (6, 1, 4) (6, 1, 5) (7, 0, 2) (7, 0, 3) (7, 1, 0) (7, 1, 1) (7, 1, 2) (7, 1, 3) (7, 1, 4) (7, 1, 5) (7, 2, 1) (7, 2, 3) (8, 0, 0) (8, 0, 2) (8, 1, 0) (8, 1, 1) (8, 1, 2) (8, 1, 3) (8, 1, 4) (9, 0, 0) (9, 0, 1) (9, 0, 3) (9, 1, 1) (9, 1, 2) (9, 1, 4) (10, 0, 0) (10, 0, 4) (10, 1, 1) (10, 1, 2) (10, 1, 3) (11, 0, 0) (11, 1, 0) (12, 0, 3) (12, 1, 0) (12, 1, 1) (14, 1, 0) (15, 1, 0)

# Counties 3 19 10 18 11 3 1 1 1 1 1 1 2 4 7 4 3 10 9 5 2 3 1 1 2 1 3 1 1 17 3 5 2 2 1 2 2 3 1 5 4 2 2 2 3 1 1 3 1 1 2 1 1 2 1 1 1 1 1 1 1 1

The table represents the number of counties whose p, d, q values are the ones listed in the left column.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

16:34

S. Mohammad Abrar et al.

REFERENCES [1] Haldun Akoglu. 2018. User’s guide to correlation coefficients. Turkish Journal of Emergency Medicine 18, 3 (2018), 91–93. [2] Hamada S. Badr, Hongru Du, Maximilian Marshall, Ensheng Dong, Marietta M. Squire, and Lauren M. Gardner. 2020. Association between mobility patterns and COVID-19 transmission in the USA: A mathematical modelling study. Lancet Infectious Diseases 20, 11 (2020), 1247–1254. [3] Linus Bengtsson, Jean Gaudart, Xin Lu, Sandra Moore, Erik Wetter, Kankoe Sallah, Stanislas Rebaudet, and Renaud Piarroux. 2015. Using mobile phone data to predict the spatial spread of cholera. Scientific Reports 5, 1 (2015), 1–5. [4] Serina Chang, Emma Pierson, Pang Wei Koh, Jaline Gerardin, Beth Redbird, David Grusky, and Jure Leskovec. 2021. Mobility network models of COVID-19 explain inequities and inform reopening. Nature 589, 7840 (2021), 82–87. [5] Serina Chang, Emma Pierson, Pang Wei Koh, Jaline Gerardin, Beth Redbird, David B. Grusky, and Jure Leskovec. 2021. Mobility network models of COVID-19 explain inequities and inform reopening. Nature 589 (2021), 82–87. https: //doi.org/10.1038/s41586-020-2923-3 [6] Serina Chang, Mandy L. Wilson, Bryan Lewis, Zakaria Mehrab, Komal K. Dudakiya, Emma Pierson, Pang Wei Koh, Jaline Gerardin, Beth Redbird, David Grusky, Madhav Marathe, and Jure Leskovec. 2021. Supporting COVID-19 policy response with large-scale mobility-based modeling. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2632–2642. [7] Amanda Coston, Neel Guha, Derek Ouyang, Lisa Lu, Alexandra Chouldechova, and Daniel E. Ho. 2021. Leveraging administrative data for bias audits: Assessing disparate coverage with mobility data for COVID-19 policy. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 173–184. [8] David S. Curtis, Alessandro Rigolon, Dorothy L. Schmalz, and Barbara B. Brown. 2022. Policy and environmental predictors of park visits during the first months of the COVID-19 pandemic: Getting out while staying in. Environment and Behavior 54, 2 (2022), 487–515. [9] Tiago Tiburcio da Silva, Rodrigo Francisquini, and Mariá C. V. Nascimento. 2021. Meteorological and human mobility data on predicting COVID-19 cases by a novel hybrid decomposition method with anomaly detection analysis: A case study in the capitals of Brazil. Expert Systems with Applications 182 (2021), 115190. [10] Luzhao Feng, Ting Zhang, Qing Wang, Yiran Xie, Zhibin Peng, Jiandong Zheng, Ying Qin, Muli Zhang, Shengjie Lai, Dayan Wang, Zijian Feng, Zhongjie Li, and George F. Gao. 2021. Impact of COVID-19 outbreaks and interventions on influenza in China and the United States. Nature Communications 12, 1 (2021), 1–8. [11] Vanessa Frias-Martinez, Victor Soto, Jesus Virseda, and Enrique Frias-Martinez. 2012. Computing cost-effective census maps from cell phone traces. In Proceedings of the Workshop on Pervasive Urban Applications. [12] Vanessa Frias-Martinez and Jesus Virseda. 2013. Cell phone analytics: Scaling human behavior studies into the millions. Information Technologies & International Development 9, 2 (2013), 35–50. [13] Vanessa Frias-Martinez, Jesus Virseda, and Enrique Frias-Martinez. 2010. Socio-economic levels and human mobility. In Proceedings of the Qual Meets Quant Workshop (QMQ’10). 1–6. [14] Vanessa Frias-Martinez, Jesus Virseda, and Aldo Gomero. 2012. Mobilizing education: Evaluation of a mobile learning tool in a low-income school. In Proceedings of the 14th International Conference on Human-Computer Interaction with Mobile Devices and Services. 441–450. [15] Cornelius Fritz, Emilio Dorigatti, and David Rügamer. 2022. Combining graph neural networks and spatio-temporal disease models to improve the prediction of weekly COVID-19 cases in Germany. Scientific Reports 12, 1 (2022), 1–18. [16] Cheng Fu, Grant McKenzie, Vanessa Frias-Martinez, and Kathleen Stewart. 2018. Identifying spatiotemporal urban activities through linguistic signatures. Computers, Environment and Urban Systems 72 (2018), 25–37. [17] Santi García-Cremades, Juan Morales-García, Rocío Hernández-Sanjaime, Raquel Martínez-España, Andrés BuenoCrespo, Enrique Hernández-Orallo, José J. López-Espín, and José M. Cecilia. 2021. Improving prediction of COVID-19 evolution by fusing epidemiological and mobility data. Scientific Reports 11, 1 (2021), 1–16. [18] Oliver Gatalo, Katie K. Tseng, Alisa Hamilton, Gary Lin, and Eili Y. Klein, for the CDC MInD-Healthcare Program. 2021. Associations between phone mobility data and COVID-19 cases. Lancet Infectious Diseases 21, 5 (2021, E111. https://doi.org/10.1016/s1473-3099(20)30725-8 [19] Jay Ghurye, Gautier Krings, and Vanessa Frias-Martinez. 2016. A framework to model human behavior at large scale during natural disasters. In Proceedings of the 2016 17th IEEE International Conference on Mobile Data Management (MDM’16), Vol. 1. IEEE, Los Alamitos, CA, 18–27. [20] Grace Guan, Yotam Dery, Matan Yechezkel, Irad Ben-Gal, Dan Yamin, and Margaret L. Brandeau. 2021. Early detection of COVID-19 outbreaks using human mobility data. PLoS One 16, 7 (2021), e0253865. [21] Marco Hernandez, Lingzi Hong, Vanessa Frias-Martinez, Andrew Whitby, and Enrique Frias-Martinez. 2017. Estimating Poverty Using Cell Phone Data: Evidence from Guatemala. World Bank Policy Research Working Paper 7969. World Bank, Washington, DC.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

The Use of Human Mobility Data in COVID-19 Predictions

16:35

[22] Lingzi Hong, Enrique Frias-Martinez, and Vanessa Frias-Martinez. 2016. Topic models to infer socio-economic maps. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30. [23] Lingzi Hong and Vanessa Frias-Martinez. 2020. Modeling and predicting evacuation flows during Hurricane Irma. EPJ Data Science 9, 1 (2020), 29. [24] Lingzi Hong, Cheng Fu, Paul Torrens, and Vanessa Frias-Martinez. 2017. Understanding citizens’ and local governments’ digital communications during natural disasters: The case of snowstorms. In Proceedings of the 2017 ACM Web Science Conference. 141–150. [25] Xiao Hou, Song Gao, Qin Li, Yuhao Kang, Nan Chen, Kaiping Chen, Jinmeng Rao, Jordan S. Ellenberg, and Jonathan A. Patz. 2021. Intracounty modeling of COVID-19 infection with human mobility: Assessing spatial heterogeneity with business traffic, age, and race. Proceedings of the National Academy of Sciences 118, 24 (2021), e2020524118. [26] Rob J. Hyndman and George Athanasopoulos. 2018. Forecasting: Principles and Practice. OTexts. [27] Cornelia Ilin, Sébastien Annan-Phan, Xiao Hui Tai, Shikhar Mehra, Solomon Hsiang, and Joshua E. Blumenstock. 2021. Public mobility data enables COVID-19 forecasting and management at local and global scales. Scientific Reports 11, 1 (2021), 1–11. [28] Sibren Isaacman, Vanessa Frias-Martinez, and Enrique Frias-Martinez. 2018. Modeling human migration patterns during drought conditions in La Guajira, Colombia. In Proceedings of the 1st ACM SIGCAS Conference on Computing and Sustainable Societies. 1–9. [29] Kathleen M. Jagodnik, Forest Ray, Federico M. Giorgi, and Alexander Lachmann. 2020. Correcting under-reported COVID-19 case numbers: Estimating the true scale of the pandemic. medRxiv. Retrieved September 1, 2023 from https://www.medrxiv.org/content/10.1101/2020.03.14.20036178v2 [30] Yuhao Kang, Song Gao, Yunlei Liang, Mingxiao Li, Jinmeng Rao, and Jake Kruse. 2020. Multiscale dynamic human mobility flow dataset in the US during the COVID-19 epidemic. Scientific Data 7, 1 (2020), 1–13. [31] Amol Kapoor, Xue Ben, Luyang Liu, Bryan Perozzi, Matt Barnes, Martin Blais, and Shawn O’Banion. 2020. Examining COVID-19 forecasting using spatio-temporal graph neural networks. arXiv preprint arXiv:2007.03113 (2020). [32] Nishant Kishore, Aimee R. Taylor, Pierre E. Jacob, Navin Vembar, Ted Cohen, Caroline O. Buckee, and Nicolas A. Menzies. 2022. Evaluating the reliability of mobility metrics from aggregated mobile phone data as proxies for SARSCoV-2 transmission in the USA: A population-based study. Lancet Digital Health 4, 1 (2022), e27–e36. [33] Cheng-Pin Kuo and Joshua S. Fu. 2021. Evaluating the impact of mobility on COVID-19 pandemic with machine learning hybrid predictions. Science of the Total Environment 758 (2021), 144151. [34] Zakaria Mehrab, Aniruddha Adiga, Madhav V. Marathe, Srinivasan Venkatramanan, and Samarth Swarup. 2022. Evaluating the utility of high-resolution proximity metrics in predicting the spread of COVID-19. ACM Transactions on Spatial Systems and Algorithms 8, 4 (2022), Article 26, 51 pages. [35] Behnam Nikparvar, Md. Mokhlesur Rahman, Faizeh Hatami, and Jean-Claude Thill. 2021. Spatio-temporal prediction of the COVID-19 pandemic in US counties: Modeling with a deep LSTM neural network. Scientific Reports 11, 1 (2021), 1–12. [36] Nicola Perra. 2021. Non-pharmaceutical interventions during the COVID-19 pandemic: A review. Physics Reports 913 (2021), 1–52. [37] Troy Quast and Ross Andel. 2021. Excess mortality associated with COVID-19 by demographic group: Evidence from Florida and Ohio. Public Health Reports 136, 6 (2021), 782–790. [38] Chotirat Ann Ralanamahatana, Jessica Lin, Dimitrios Gunopulos, Eamonn Keogh, Michail Vlachos, and Gautam Das. 2005. Mining time series data. In Data Mining and Knowledge Discovery Handbook. Springer, 1069–1103. [39] Peter Richterich. 2020. Severe underestimation of COVID-19 case numbers: Effect of epidemic growth rate and test restrictions. medRxiv. Retrieved September 1, 2023 from https://www.medrxiv.org/content/10.1101/2020.04.13. 20064220v1 [40] Weston C. Roda, Marie B. Varughese, Donglin Han, and Michael Y. Li. 2020. Why is it difficult to accurately predict the COVID-19 epidemic? Infectious Disease Modelling 5 (2020), 271–281. [41] Alberto Rubio, Vanessa Frias-Martinez, Enrique Frias-Martinez, and Nuria Oliver. 2010. Human mobility in advanced and developing economies: A comparative analysis. In Proceedings of the 2010 AAAI Spring Symposium Series. [42] Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1, 5 (2019), 206–215. [43] Frank Schlosser, Vedran Sekara, Dirk Brockmann, and Manuel Garcia-Herranz. 2021. Biases in human mobility data impact epidemic modeling. arXiv preprint arXiv:2112.12521 (2021). [44] Deshun Sun, Xiaojun Long, and Jingxiang Liu. 2022. Modeling the COVID-19 epidemic with multi-population and control strategies in the United States. Frontiers in Public Health 9 (2022), 751940. [45] Zander S. Venter, Adam Sadilek, Charlotte Stanton, David N. Barton, Kristin Aunan, Sourangsu Chowdhury, Aaron Schneider, and Stefano Maria Iacus. 2021. Mobility in blue-green spaces does not predict COVID-19 transmission: A global analysis. International Journal of Environmental Research and Public Health 18, 23 (2021), 12567.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

16:36

S. Mohammad Abrar et al.

[46] Marcos R. Vieira, Enrique Frias-Martinez, Petko Bakalov, Vanessa Frias-Martinez, and Vassilis J. Tsotras. 2010. Querying spatio-temporal patterns in mobile phone-call databases. In Proceedings of the 2010 11th International Conference on Mobile Data Management. IEEE, Los Alamitos, CA, 239–248. [47] Lijing Wang, Xue Ben, Aniruddha Adiga, Adam Sadilek, Ashish Tendulkar, Srinivasan Venkatramanan, Anil Vullikanti, Gaurav Aggarwal, Alok Talekar, Jiangzhuo Chen, Bryan Lewis, Samarth Swarup, Amol Kapoor, Milind Tambe, and Madhav Marathe. 2020. Using mobility data to understand and forecast COVID19 dynamics. medRxiv. Retrieved September 1, 2023 from https://www.medrxiv.org/content/10.1101/2020.12.13.20248129v1 [48] Amy Wesolowski, Nathan Eagle, Andrew J. Tatem, David L. Smith, Abdisalan M. Noor, Robert W. Snow, and Caroline O. Buckee. 2012. Quantifying the impact of human mobility on malaria. Science 338, 6104 (2012), 267–270. [49] Jiahui Wu, Saad Mohammad Abrar, Naman Awasthi, Enrique Frias-Martinez, and Vanessa Frias-Martinez. 2022. Enhancing short-term crime prediction with human mobility flows and deep learning architectures. EPJ Data Science 11, 1 (2022), 53. [50] Jiahui Wu, Saad Mohammad Abrar, Naman Awasthi, and Vanessa Frías-Martínez. 2023. Auditing the fairness of placebased crime prediction models implemented with deep learning approaches. Computers, Environment and Urban Systems 102 (2023), 101967. [51] Jiahui Wu, Enrique Frias-Martinez, and Vanessa Frias-Martinez. 2021. Spatial sensitivity analysis for urban hotspots using cell phone traces. Environment and Planning B: Urban Analytics and City Science 48, 9 (2021), 2517–2519. [52] Neo Wu, Xue Ben, Bradley Green, Kathryn Rough, Srinivasan Venkatramanan, Madhav Marathe, Paul Eastham, Adam Sadilek, and Shawn O’Banion. 2020. Predicting onset of COVID-19 with mobility-augmented SEIR model. medRxiv. Retrieved September 1, 2023 from https://www.medrxiv.org/content/10.1101.2020.07.27.20159996v2 [53] Nazar Zaki and Elfadil A. Mohamed. 2021. The estimations of the COVID-19 incubation period: A scoping reviews of the literature. Journal of Infection and Public Health 14, 5 (2021), 638–646. [54] Choujun Zhan, Yufan Zheng, Zhikang Lai, Tianyong Hao, and Bing Li. 2021. Identifying epidemic spreading dynamics of COVID-19 by pseudocoevolutionary simulated annealing optimizers. Neural Computing and Applications 33, 10 (2021), 4915–4928.

Received 1 February 2023; revised 9 June 2023; accepted 18 July 2023

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 16. Publication date: December 2023.

17 Evaluation of Dependency Structure for Multivariate Weather Predictors Using Copulas SAMUEL C. MAINA, Microsoft Africa Research Institute, Kenya DORCAS MWIGERERI, Jomo Kenyatta University, Kenya JONATHAN WEYN, Microsoft Corporation, USA LESTER MACKEY, Microsoft Research New England, USA MILLICENT OCHIENG, Microsoft Africa Research Institute, Kenya In the Global South, the effects of climate change have resulted in more frequent and severe weather events such as droughts, floods, and storms, leading to crop failures, food insecurity, and job loss. These effects are expected to increase in intensity in the future, further disadvantaging already marginalized communities and exacerbating existing inequalities. Hence, the need for prevention and adaptation is urgent, but accurate weather forecasting remains challenging, despite advances in machine learning and numerical modeling, due to complex interaction of atmospheric and oceanic variables. This research aims to explore the potential of vine copulas in explaining complex relationships of different weather variables in three African locations. Copulas separate marginal distributions from the dependency structure, offering a flexible way to model dependence between random variables for improved risk assessments and simulations. Vine copulas are based on a variety of bivariate copulas, including Gaussian, Student’s t, Clayton, Gumbel, and Frank copulas, and they are effective in high-dimensional problems and offer a hierarchy of trees to express conditional dependence. In addition, we propose how this framework can be applied within the subseasonal forecasting models to enhance the prediction of different weather events or variables. CCS Concepts: • Mathematics of computing → Exploratory data analysis; Distribution functions; • Applied computing → Environmental sciences; Mathematics and statistics; Additional Key Words and Phrases: Climate change, dependency structure, vine copulas, subseasonal forecasting ACM Reference format: Samuel C. Maina, Dorcas Mwigereri, Jonathan Weyn, Lester Mackey, and Millicent Ochieng. 2023. Evaluation of Dependency Structure for Multivariate Weather Predictors Using Copulas. ACM J. Comput. Sustain. Soc. 1, 2, Article 17 (December 2023), 23 pages. https://doi.org/10.1145/3616384

Authors’ addresses: S. C. Maina and M. Ochieng, Microsoft Africa Research Institute, Nairobi, Kenya; e-mails: [email protected], [email protected]; D. Mwigereri, Jomo Kenyatta University, Kenya; e-mail: [email protected]; J. Weyn, Microsoft Corporation, Redmond, Washington, USA; e-mail: jweyn@microsoft. com; L. Mackey, Microsoft Research New England, Cambridge, Massachusetts, USA; e-mail: [email protected].

This work is licensed under a Creative Commons Attribution International 4.0 License. © 2023 Copyright held by the owner/author(s). 2834-5533/2023/12-ART17 https://doi.org/10.1145/3616384 ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

17:2

S. C. Maina et al.

1 INTRODUCTION Whereas the impact of climate change affects people and ecosystems worldwide, it has been reported in Reference [7] that the Global South experiences a greater share of its adverse effects despite contributing only marginally to greenhouse gas emissions due to a range of factors such as weaker infrastructure, limited resources, and geographic location. The region is experiencing frequent and severe weather events such as droughts, floods, and storms, which in turn lead to crop failures, food insecurity, and loss of livelihoods, which are affecting very vulnerable populations that have limited resources to cope. Precise and reliable forecasting of weather conditions is critical for many societal and economic functions, ranging from agriculture and climate adaptation to energy production and disaster management. This can improve a country’s capability to anticipate weather patterns with adequate lead time, facilitate proactive planning and mitigation strategies, and decrease the potential adverse impact of unfavorable weather events. Moreover, understanding how different weather predictors interact is essential in supporting these sustainable societal activities. However, predicting weather and climate remains a formidable task as it involves simulating and comprehending complex atmospheric and oceanic phenomena at various spatial and temporal scales. The subseasonal timescale, characterized by weather forecasting up to several weeks in advance, has garnered increasing research interest due to the gap in accuracy and proficiency between medium-range and long-range forecasts [13, 33, 42]. The complexities of interactions between the atmosphere, oceans, and other environmental factors make subseasonal forecasting a challenging task, further complicated by the dynamic and rapidly evolving nature of the weather and climate systems [34, 41, 48]. Advancements in machine learning, numerical modeling and observational systems have led to an improvement in the precision of subseasonal forecasts, affording a greater lead time for decision-making and the implementation of mitigation measures [21, 22]. To maintain and enhance the accuracy of subseasonal forecasts, it is crucial to integrate state-of-the-art research and technology and effectively communicate and utilize subseasonal forecast products. The weather system is characterized as chaotic due to the interplay of various factors such as solar radiation, temperature, air masses, pressure systems, ocean currents, topography, humidity, precipitation, and cloud cover [52]. Modeling these weather variables is a major challenge as researchers, policymakers, and practitioners work to understand, minimize, and address the impacts of climate change. According to Baccon and Lunardi [3], an accurate characterization of the statistical distributions of these weather variables is essential for the creation and assessment of stochastic models, including weather models. As stated in Reference [50], to accurately assess the impacts and associated risks of climate change, a thorough understanding of the complete distributions of weather variables is crucial. Currently, climate models only compute the atmospheric dynamics using a limited set of prognostic variables (temperature, humidity, wind, and pressure) as opposed to many diagnostic variables that are actually derived through some approximate empirical formulae. In addition, many machine learning models use a limited set of variables as inputs. The inclusion of additional variables could improve the ability of these models to capture extreme shifts. Weather centers often generate ensemble forecasts based on multiple numerical weather model simulations, each with varying initial conditions and perturbations in the model parameters. These ensembles depict the probability distribution of various weather variables [26]. Due to this multivariate nature, weather data exhibit complex spatiotemporal interdependence structures that may vary from one pair of variables to another, adding complexities for modeling and capturing these types of relationships. As a result, a small change in any of these variables may create a different weather outcome. The use of correlation matrices to define the dependence structure between variables in multivariate statistical analysis often stems from an assumption of Gaussian-distributed data. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

Evaluation of Dependency Structure for Multivariate Weather Predictors Using Copulas

17:3

This fails in many practical problems where we require a multivariate distribution that can handle discrete and continuous distributions that are not Gaussian [43]. The Gaussian mixture model is a commonly used technique for analyzing the intricate dependency structures in multivariate datasets particularly in cluster analysis and density estimation [16, 46]. However, it has limitations, such as the requirement for Gaussian dependencies between variables and Gaussian marginal distributions, which can result in inaccurate estimates. Copulas, as introduced by Reference [47], offer a solution to these limitations. They are probabilistic models that enable the separation of marginal distributions from the dependency structure of a multivariate distribution. Copulas provide a straightforward way to describe the dependency structure among multivariate variables, where each copula represents a distinct dependency. In Reference [9], the authors used copulas to study the relationship between temperature and precipitation in a sample data set. In Reference [44], a multivariate copula framework was created to post-process probabilistic weather forecasts made by ensemble systems. Another study by Dzupire et al. [14] applied copulas to model the relationship between temperature and rainfall for climate risk management in weather derivatives. Other studies, such as References [12, 29, 39], explored the dependency between temperature and rainfall using copulas. Similarly, Leonard et al. [30] discussed how copulas can be used to understand the conditional interdependencies between different variables in a compound weather event. Increasing the dimension of the copula model increases its complexity as well as the difficulty of accurately capturing the dependence structure. High-dimensional copulas can be computationally intensive, making it challenging to estimate the copula parameters. The dependence structure between variables may also change over time, which makes it difficult to fit a single copula model that accurately captures the dependence structure for the entire data set. Vine copulas [5] are useful for building dependence models in high-dimensional problems due to their flexibility in allowing each pair to have a different strength and type of dependence. For an arbitrary number of variables, vine copulas use a sequence of bivariate copulas as building blocks producing a hierarchy of trees, to express conditional dependence. Vine copulas are increasingly used in climate/weather modeling and sustainability problems. For example, Yu et al. [56] used vine copulas to forecast water pollution risk factors and provide an early warning system with multiple sources of uncertainty in an ecosystem. In addition, Vernieuwe et al. [53] examined the dependence structure between storm variables using vine copulas. Möller et al. [37] applied vine copulas to enhance post-processing forecasts from numerical weather prediction models with ensemble methods. The main variations of vine copulas are the C-vine [10] (canonical vine), D-vine [17] (drawable vine), and R-vine (regular vine) [18] copulas. In the recent years, vine copulas have been applied in the area of machine learning. Sun et al. [49] applied deep neural network to learn the vine copula structure from different data sets and then generated synthetic data that resembled the original data. For various applications, they demonstrated the potential of combining vine copula models and deep learning. Similar applications have also been explored by Janke et al. [23], Meyer et al. [35], Tagasovska et al. [51]. Purutçuoğlu and Farnoudkia [40] applied vine copula models (VCMs) and artificial neural networks (ANNs) to model the dependencies among multiple variables related to breast cancer such as tumor size, age, and cancer stage and predict the outcome of the disease. On comparing the performance of the models, they showed that the VCMs outperform ANNs in modeling the dependencies among the variables while also improving upon the performance of ANN models for predicting breast cancer outcomes. Farrokhi et al. [15] developed a hybrid least-square support vector machine model for modeling multivariate dependence structures of different meteorological drought characteristics based on a combination of four-dimensional vine copulas. In their paper, Oliveira et al. [38] presented a novel approach that combines a Copula Variational Long Short-term Memory ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

17:4

S. C. Maina et al. Table 1. Description of the Weather Variables Used in the Research Work

Label 1 2 3 4 5 6 7 8 Acronym t_max_24 t_min_24 td_mean_24 ws_mean_24 c_mean_24 rh_mean_day p_max_24 p_min_24

method with regular vine copula-based dependence structures to model variable dependencies in high-dimensional cross-market data. The method utilizes regular vine copula-based dependence structures to capture the complex dependence relationships among multiple financial time series. The authors demonstrated that this proposed method outperforms other state-of-the-art portfolio forecasting methods, including other copula-based approaches. This article addresses a number of objectives: First, we extend the earlier work by, for example by Dzupire et al. [14], on bivariate (temperature and precipitation) weather models to include additional variables that affect the weather system. We then verify the hypothesis that given the non-Gaussian nature of these variables, the classical correlation assumption would lead to a mis-representation of the dependence structure that may affect the forecasting outcome. Our results demonstrate that copulas can better depict the complex multivariate and heavy-tailed distributions by capturing the bivariate relationships between variables. Second, we show that the pairwise dependency distributions vary across locations, and finally, we demonstrate how we can build high-dimensional dependency relationship on our weather data using vine copulas. The remainder of the article is organized as follows. In Section 2, we provide a description of the data that is used in this study and provide a brief introduction to copula theory. In Section 3, we analyse the dependency relationships by fitting the elliptical and Archimedean families of copulas on the data, while Section 4 focuses on the aforementioned vine copula models. In Section 5, we provide a detailed discussion of our results and point to how these can be embedded in different forecasting frameworks, with a particular focus on the subseasonal forecasting horizon. Section 6 gives conclusions and points to potential areas of further research. 2

DATA AND METHODS

2.1 Data Description The data utilized for evaluating the models was sourced from Synoptic’s Mesonet API.1 Access was obtained for observations with different timescales, including hourly, daily, and weekly, from various locations globally. For this study, we used remote-sensed daily data for Nairobi Wilson Airport (metar code HKNW), Cape Town International Airport (metar code FACT), and Kigali International Airport (metar code HRYR) weather stations. We selected a set of eight variables, as shown in Table 1, and extracted daily data for a period of two years (01/01/2021–31/12/2022). In the variable names, mean_24 is averaging over 24 h, max_24 and min_24 represent the highest and lowest value over 24 h, and mean_day captures the average daytime measurement in a given day. The prefix of each name represents the meteorological variable: t is the temperature variable, ws is the wind speed, c is the percentage of the sky covered by clouds, rh is the percent relative humidity, td represents dewpoint temperature, and p represents the pressure variable. For example: td_mean_24 refers to the daily dewpoint temperature averaged over a 24 h time horizon at a particular weather station. Figure 1 gives the histograms for the densities of our variables of interest for the Nairobi weather center. Figure 7 in the Appendix shows the histogram of the weather variables for other two stations. 1 Mesonet

API (https://developers.synopticdata.com/mesonet/) is Synoptic’s request-based data delivery service.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

Evaluation of Dependency Structure for Multivariate Weather Predictors Using Copulas

17:5

Fig. 1. Histograms of each Nairobi weather station weather variable.

2.2 Copula Theory Copulas are a powerful, convenient, and flexible tool for modeling the relationship between multiple variables by providing tools that describe the relationship between variables that may not have a linear relationship. By definition, a copula function is a multivariate distribution function that describes the joint distribution of different variables while preserving the marginal distributions of each variable. They are used to model the dependence structure between variables without making any explicit assumptions about the marginal distributions. In particular: Definition 2.1 (Distribution and Density Functions). For a scalar random variable X , we define the (cumulative) distribution function, F (x ), as the probability that X takes on a value less than or equal to a given value x: F (x ) = P (X ≤ x ). If X is continuous, then we define its density function d F (x ). If X is discrete, then f (x ) as the derivative of the cumulative distribution function: f (x ) = dx we define f (x ) as the probability of taking on the value x: f (x ) = P (X = x ). Definition 2.2 (Multivariate Distribution and Density Functions). For a vector of random variables X = (X 1 , . . . , X d ) and a vector x = (x 1 , . . . , xd ) ∈ (−∞, ∞)d , we define the multivariate distribution function F (x) = P (X 1 ≤ x 1 , . . . , X d ≤ xd ) and the joint density function f (x 1 , . . . , xd ) as a product of conditional univariate density functions: f (x 1 , . . . , xd ) = f (xd ) · f (xd −1 |xd ) · f (xd −2 |xd −1 , xd ) · . . . · f (x 1 |x 2 , . . . xd ).

(1)

Definition 2.3 (Copula). For every d ≥ 2, a d-dimensional copula C : [0, 1]d → [0, 1] is a multivariate distribution function on [0, 1]d with uniformly distributed marginals. Sklar’s theorem [47] states that any multivariate distribution function F can be written in terms of its marginal distribution functions, F 1 , . . . , Fd , and a copula function C as   F (x) = C F 1 (x 1 ), . . . , Fd (xd ) , ∀x i ∈ (−∞, ∞) (2) and

  C (u) = F F 1−1 (u 1 ), . . . , Fd−1 (ud ) ,

∀u ∈ [0, 1]d .

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

17:6

S. C. Maina et al.

In addition, if Fi is continuous for all i = (1, . . . , d), then the copula C is unique. The theorem allows for separate modeling of the dependency structure and the marginal distributions of the variables. It is common to define a copula density c, c=

∂d C , ∂F 1 · · · ∂Fd

so that the joint density can be expressed as d   f (x) = c F 1 (x 1 ), . . . , Fd (xd ) fi (x i ).

(3)

i=1

The copula densities are used to describe the dependence structure between variables in a multivariate system in a manner that preserves their individual marginal distributions. Applying the log transformation on the density function in Equation (3) enables us to split the density into two components, d    log f (x) = log c F 1 (x 1 ), . . . , Fd (xd ) + log fi (x i ),

(4)

i=1

which allows for a two-step estimation process of the marginal and the copula parameters. Kendall’s tau (τ ) and Spearman’s rho (ρ) are widely used non-parametric rank measures of dependence [27, 32, 36], which are suitable for assessing the strength and direction of the association between variables when the linear correlation assumption and normality fail. The Kendall’s tau quantifies the difference between the probabilities of concordant and discordant pairs, while Spearman’s rho is equivalent to the Pearson correlation computed on ranked observations. These rankbased measures can be computed directly from copula parameter for each copula case, thereby providing meaningful insights into the dependence structure between variables. 2.2.1 Bivariate Copulas. The two most popular families of bivariate copulas are the elliptical copula family and the Archimedean copula family. Elliptical copulas are derived from the multivariate normal distribution, exhibit elliptical symmetry and are simple to interpret. The most prominent examples are the Gaussian copula, derived from the bivariate normal distribution with correlation ρ, and the Student’s t copula, derived from the bivariate Student’s t distribution with ν degrees of freedom and association parameter ρ, constructed by inverting the respective distribution function. The Gaussian copula parameter reflects linear dependence between variables, with values near ±1 indicating strong positive (negative) linear dependence, and values near 0 indicating weak dependence. The Student’s copula is characterized by its degrees of freedom parameter ν , which controls how heavy the tails are. A higher ν indicates lighter tails, similar to a Gaussian copula, while a lower value implies heavier tails, implying greater tail dependence. This flexibility allows the Student’s copula to model both linear and nonlinear dependence. For a well-defined Student copula, it is necessary to satisfy the lower bound ν > 2. Archimedean copulas, however, are good at capturing dependence in the tails of variables’ distributions, i.e., in the regions of relatively low-probability and more extreme events. They are derived from univariate generator function, known as Archimedean generator φ, which is a continuous and decreasing function defined on the interval [0, ∞). This generator determines the shape and characteristics of the copula. Examples of Archimedean copulas include the Clayton [8] and the Gumbel [20] copulas, which have the form C (u; θ ) = φ

d 

 φ −1 (ui ; θ ) ,

i=1

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

Evaluation of Dependency Structure for Multivariate Weather Predictors Using Copulas

17:7

Table 2. List of the Bivariate Copulas Used in This Study and Their Parameter Ranges

Copula

Parameter(s)

Gaussian θ ∈ [−1, 1] Student θ ∈ [−1, 1], ν > 2 Clayton θ ∈ [−1, ∞) \ 0 Gumbel θ ∈ [1, ∞) Frank θ ∈ R\0 See Reference [31] for description of copula families and their closed-form mathematical description.

with density

  d −1 φ (d ) i=1 φ (u i ; θ )  ,  c (u; θ ) = d  φ −1 (u ; θ ) φ i i=1

where φ (d ) represents the dth derivative of the generator function φ with respect to its argument x, u is the vector in [0, 1]d and θ is a parameter vector. For this copula to be well defined, it is required that the generator function φ : [0, ∞) → [0, 1] satisfy the following conditions: • φ(0) = 1 and limx →∞ φ(x ) = 0, • φ has strict monotonicity. The Clayton copula models the lower tail dependence between two variables, and its scalar parameter θ determines the strength of the tail dependence with positive parameters indicating positive tail dependence and negative parameters indicating negative tail dependence.2 As a result, it is suitable for modeling situations where there is a greater likelihood of extreme events occurring together in the lower tails of the distribution. In contrast, the Gumbel copula captures upper tail dependence, making it suitable for modeling situations where extreme events are more likely to occur together in the upper tail. Larger values of θ indicate stronger upper tail dependence, while smaller values result in weaker upper tail dependence. The Frank copulas is a flexible Archimedean copula that is used to model both positive and negative tail dependencies, depending on the value of the generator parameter. Larger absolute values of θ indicate stronger dependence, while smaller absolute values indicate weaker dependence. Table 2 gives the range of the respective copula parameters. 2.2.2 d-variate Copulas. Vine copulas are a popular family of d-variate copulas for d > 2 constructed from a cascade of bivariate copulas. Vine copulas were originally proposed by Joe [24, 25] and were further developed by Bedford and Cooke [4, 5], who introduced a convenient graphical representation called regular (R) vine tree structure for the pair-copula constructions. A regular vine consists of linked trees, where the edges in one tree become the nodes of the next. Figure 2 provides a representative example. The first tree represents d variables as nodes and the bivariate dependence of (d − 1) pairs of variables as edges. Tree 2 shows conditional dependence between 2 Positive tail dependence implies that the variables tend to have joint extreme values in the upper tail of their distributions

and therefore when one variable has an extreme (large) value in its distribution, the other variable is more likely to have an extreme value as well. The converse also holds for negative tail dependence and very small values. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

17:8

S. C. Maina et al.

Fig. 2. Graphical visualization of an eight-dimensional R-vine copula.

(d − 2) pairs of variables given a third variable, where the nodes of tree 2 are the edges of tree 1, and two tree 2 nodes can be connected if the nodes share a common variable. The third tree shows conditional dependence between (d − 3) pairs of variables given two other variables; the tree 3 nodes are the edges of tree 2, and two tree 3 nodes can be connected if two variables are held in common. The process continues until tree (d − 1), which contains only one edge describing the conditional dependence of two variables on (d − 2). A tree’s possible edges depend on the edges of the previous trees although they are not uniquely determined by them. Associated with each edge of each tree is a bivariate copula that can come from a specific parametric family (like Gaussian, Student’s t, Gumbel, Clayton, or Frank) and be estimated separately. Aas et al. [1] extended this work to cover two subclasses of the R-vines, namely, the canonical vines (C-vines) and drawable vines (D-vines). The C-vines copulas have star structures in their tree sequence, while D-vines have a path structures. Figure 3 provides a sample visualization of an eight-dimensional D-vine copula where each graph level consists of a path. The d-dimensional density decomposition under the R-vine copula has the following representation: d −j d d −1      f (x k ) c i j |i+1, ...,i+j−1 F (x i |x i+1 , . . . , x i+j−1 ), F (x i+j |x i+1 , . . . , x i+j−1 ) , f (x 1 , . . . , xd ) = k=1

j=1 i=1

(5) ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

Evaluation of Dependency Structure for Multivariate Weather Predictors Using Copulas

17:9

Fig. 3. Graphical representation of an eight-dimensional D-vine copula.

where the index j identifies the trees, while i runs over the edges in each tree. In this case, no node in tree T j is connected to more than two edges. Similarly, Reference [4] provided the density of a d-dimensional distribution specialized to a D-vine with the representation f (x 1 , . . . , xd ) =

d 

f (x k )

d −j d −1  

  c i,i+j |i+1, ...,i+j−1 F (x i |x i+1 , . . . , x i+j−1 ), F (x i+j |x i+1 , . . . , x i+j−1 ) .

j=1 i=1

k=1

(6) Finally, the d-dimensional density corresponding to a canonical (C) vine has the general form f (x 1 , . . . , xd ) =

d  k=1

f (x k )

d −j d −1  

  c j, j+1 |1, ..., j−1 F (x j |x 1 , . . . , x j−1 ), F (x j+1 |x 1 , . . . , x j−1 ) .

(7)

j=1 i=1

We note that the R-vine copula has similar components as the D-vine copula: the marginal densities of the variables, denoted by f (x k ), and the conditional copulas, denoted by c i j |i+1, ...,i+j−1 , which capture the dependence between x i and x i+j conditioned on x i+1 , . . . , x i+j−1 . The key difference, however, is in the ordering of the variables and the conditional dependencies. The joint density is expressed in terms of bivariate copula densities, marginal densities, and conditional distribution functions. This is known as the copula-based representation of the joint density. Similar representations can be derived for the R-vine and the C-vine copulas. Although R-vine copulas offer more flexibility and can handle diverse dependence structures, the D-vine copulas are more computationally efficient especially when the dimension of the data is large. In addition, they can handle stronger dependencies between variables than the R-vine copulas and their graphical representation is easier to interpret (compare Figures 2 and 3) thereby providing a clearer picture of the dependencies between variables. Whereas D-vine copulas are limited to continuous variables, C-vine copulas can handle both discrete and continuous variables. C-vine copulas are more computationally efficient than general R-vine copulas for high-dimensional data and offer a better intuitive graphical representation of the dependence structure. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

17:10

S. C. Maina et al.

Table 3. Using a Chi-square Goodness-of-fit Test with Significance Level α = 0.05, We Classify Each of Eight Weather Variables Measured at Three Weather Stations as Either Not Gaussian (if the Test Rejects the Gaussian Null Hypothesis) or Gaussian (if the Test Fails to Reject) Nairobi Weather Variable t_max_24 t_min_24 td_mean_24 ws_mean_24 c_mean_24 rh_mean_day p_max_24 p_min_24

3

Cape Town

Kigali

p-value Classification p-value Classification p-value Classification 3.300056e-08 Not Gaussian 9.004380e-06 Not Gaussian 2.989446e-13 Not Gaussian 4.041689e-02 Not Gaussian 8.342909e-12 Not Gaussian 1.165896e-15 Not Gaussian 1.729526e-03 Not Gaussian 1.372075e-03 Not Gaussian 1.032565e-26 Not Gaussian 1.458512e-04 Not Gaussian 4.208273e-07 Not Gaussian 2.271271e-216 Not Gaussian 1.205127e-34 Not Gaussian 1.381648e-24 Not Gaussian 5.531471e-05 Not Gaussian 0.263 Gaussian 0.532 Gaussian 6.612881e-05 Not Gaussian 1.160284e-02 Not Gaussian 1.160038e-09 Not Gaussian 0.000000e+00 Not Gaussian 0.079 Gaussian 1.424755e-10 Not Gaussian 0.215 Gaussian

COPULA-BASED BIVARIATE MODELING OF WEATHER VARIABLES

In this section, we demonstrate the characteristics of the bivariate copulas and their ability to capture different dependency structures in our weather data sets. Copulas make it possible to model distributions that are poorly approximated by multivariate Gaussians. We begin by testing the marginal normality of our 8 weather variables from each of the three stations considered in our experiment. Table 3 summarizes the results of a standard chi-square goodness-of-fit test [19]. We see that, with a marginal significance threshold of α = 0.05, there is sufficient evidence to suspect non-Gaussianity for all but 4 of the 24 variables at each station. To assess the degree to which bivariate copulas can capture the non-Gaussian variation in our data, we next fit bivariate copulas to each of the 28 unique pairs of variables, separately for each site. For a given pair, we fit each of the copulas listed in Table 2 using the pyvinecopulib3 python function, and then we use the Bayesian information criterion (BIC) goodness-of-fit criterion [45] to select a single best copula. The results are summarized in Table 4 with auxiliary details given in Table 8, which also reports two other goodness-of-fit measures, the log likelihood (logLik) and the Akaike information criterion (AIC) [2]. Whereas the logLik is a measure of the absolute goodness of fit, the AIC and BIC are relative measures that provide information on how well a model fits the data, compared to other models. A higher logLik value indicates that a model is a better fit, whereas lower AIC and BIC values indicate a better fit. However, AIC and BIC are known to penalize models that have more parameters, which makes them more appropriate for model selection when the number of parameters is different between models. We observe that the nature of the dependence relationship for each pair of variables can vary with the observation center. For example, we note that the Frank copula can capture the relationship between the pairs (t_max_24, t_min_24) in the Nairobi and Cape Town weather stations whereas the Gumbel copula gives the best fit of the pair at the Kigali airport weather station. For the Nairobi weather center, we note that there is a positive tail dependence in the data captured by the Frank copula with parameter 2.9. A similar observation is made for the Cape Town weather station, with a parameter value of 4.89 suggesting a higher level of tail dependence in the data. For the Kigali weather station, the pairs are best described by a Gumbel copula parameter of 1.09, which suggests a moderate to strong positive dependence between the variables. Finally, Table 5 reports the BIC-selected bivariate copula fits between different stations’ measurements of the same weather variable. For example, we observe that the spatial dependency of the t_max_24 variable in Kigali and Nairobi can be best described using the Gumbel copula whereas 3 pyvinecopulib is a copula lib that interfaces bivariate as well as vine copulas. See https://pypi.org/project/pyvinecopulib/.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

Evaluation of Dependency Structure for Multivariate Weather Predictors Using Copulas

17:11

Table 4. For Each Pair of Weather Variables and Each Weather Station under Consideration, We Report the Name and Parameters of the Best Fitted Bivariate Copula, as Judged by the BIC Model Selection Criterion Applied to the Study Period Data Nairobi Cape Town Kigali Pair Copula, parameter(s) Copula, parameter(s) Copula, parameter(s) (t_max_24, t_min_24) Frank, 2.90 Frank, 4.89 Gumbel, 1.09 (t_max_24, td_mean_24) Gaussian, −0.18 Frank, 3.80 Gaussian, −0.45 (t_max_24, ws_mean_24) Clayton, 0.78 Clayton, 0.12 Student, 0.33, 6.09 (t_max_24, c_mean_24) Gaussian, −0.68 Student, −0.58, 15.72 Frank, −3.88 (t_max_24, rh_mean_day) Gaussian, −0.77 Gaussian, −0.62 Student, −0.69, 13.36 (t_max_24, p_max_24) Gaussian, −0.59 Frank, −4.35 Frank, −1.42 (t_max_24, p_min_24) Gaussian, −0.72 Frank, −3.39 Gaussian, −0.41 (t_min_24, td_mean_24) Clayton, 0.83 Student, 0.82, 11.75 Clayton, 0.211 (t_min_24, ws_mean_24) Frank, 3.19 Gaussian, 0.57 Gumbel, 1.08 (t_min_24, c_mean_24) Frank, −0.54 Clayton, 0.18 Clayton, 0.30 (t_min_24, rh_mean_day) Student, −0.13, 0.08 Frank, −1.26 Clayton, 0.057 (t_min_24, p_max_24) Frank, −4.26 Frank, −6.02 Student, −0.11, 7.52 (t_min_24, p_min_24) Frank, −4.08 Frank, −4.38 Student, −0.026, 5.34 (td_mean_24, ws_mean_24) Gaussian, −0.033 Clayton, 0.49 Gaussian, −0.35 (td_mean_24, c_mean_24) Clayton, 0.68 Clayton, 0.27 Clayton, 1.50 (td_mean_24, rh_mean_day) Gaussian, 0.68 Clayton, 0.38 Clayton, 3.86 (td_mean_24, p_max_24) Frank, −2.05 Frank, −5.84 Frank, −1.98 (td_mean_24, p_min_24) Frank, −1.34 Frank, −4.20 Gaussian, −0.27 (ws_mean_24, c_mean_24) Frank, −2.44 Clayton, 0.38 Gaussian, −0.16 (ws_mean_24, rh_mean_day) Frank, −2.92 Frank, −0.53 Gaussian, −0.34 (ws_mean_24, p_max_24) Frank, −2.20 Frank, −2.36 Frank, −0.45 Frank, −2.14 Gaussian, −0.12 (ws_mean_24, p_min_24) Frank, −2.70 (c_mean_24, rh_mean_day) Gaussian, 0.72 Gumbel, 1.73 Gaussian, 0.76 (c_mean_24, p_max_24) Gaussian, 0.38 Student, −0.013, 11.12 Gaussian, −0.13 (c_mean_24, p_min_24) Gaussian, 0.44 Student, −0.07, 5.24 Gaussian, −0.05 (rh_mean_day, p_max_24) Clayton, 0.42 Frank, 0.73 Frank, −1.30 (rh_mean_day, p_min_24) Gaussian, 0.42 Frank, 0.61 Frank, −0.66 (p_max_24, p_min_24) Student, 0.89, 10.41 Student, 0.90, 4.99 Student, 0.85, 5.04 Table 5. BIC-selected Bivariate Copula Fits for Different Station Pairs and the Respective Weather Variable Measurement Pair Stations t_max_24 t_min_24 td_mean_24 ws_mean_24 c_mean_24 rh_mean_day p_min_24 p_max_24 Kigali Nairobi Gumbel Clayton Frank Gumbel Gumbel Gaussian Gumbel Student Kigali Cape Town Frank Frank Frank Gumbel Gaussian Frank Gumbel Gaussian Nairobi Cape Town Gaussian Frank Frank Frank Frank Frank Gumbel Frank

the same variable between Kigali and Cape Town is best represented using a Frank copula. Meanwhile, the spatial dependency of t_max_24 between Nairobi and Cape Town is best captured using a Gaussian copula. 4 VINE COPULAS FOR DEPENDENT WEATHER VARIABLES To construct a vine copula, bivariate copulas are first selected to model the dependence between each pair of variables using BIC. The pair-copulas are then organized into a pair-copula tree ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

17:12

S. C. Maina et al.

Table 6. Log-likelihood, AIC, and BIC Goodness-of-fit Statistics for the R-vine, D-vine, and C-vine Copula Models under the Three Weather Stations of Study

R-Vine logLik AIC BIC Kigali 2117.77 −4177.53 -4045.06 Cape Town 2856.09 −5654.17 -5521.17 Nairobi 2485.25 −4912.50 −4780.27

D-Vine

C-Vine

logLik AIC BIC logLik AIC BIC 1962.85 −3865.70 −3728.60 2044.32 −4026.63 −3885.02 2722.83 −5375.67 −5215.15 2863.00 −5658.00 −5502.07 2433.17 −4802.34 −4656.44 2506.09 −4946.18 -4795.71

We use the lowest BIC score (highlighted in red) to select the best vine fit for each station.

structure that determines the sequential order in which the pair-copulas are applied. In this section, we investigate how vine copulas can be applied to multi-dimensional weather data. As highlighted in Section 2, vine copulas facilitate the modeling of multivariate dependencies by decomposing these into pairwise dependencies shown in Equations (5), (6), and (7) representing the R-vine, D-vine, and C-vine copulas, respectively. These vine variations have different strengths and weaknesses, and the choice is informed by the application problem at hand. We fitted the three vine copula types to our data using a dedicated R package (vinecop4 ) and assessed each fit using BIC (for completeness, we also report the AIC and log likelihood of each fit). The results, summarized in Table 6, show that the R-vine copula provides the best highdimensional dependence structure fit for the Kigali and Cape Town weather stations, while the C-vine copula presents the best fit of the dependency structure of the data from the Nairobi weather station. In Table 7, we present a more granular description of these results for the Kigali weather station. This includes the bivariate copula fits for each tree together with the respective copula parameters and the Kendall’s tau statistic, which gives insights into the strength and direction of the dependence relationships between variables captured by the vine copula model. The R-vine consists of seven directed acyclic graphs, or Trees 1 to 7, that depict the interdependence of random variables represented by nodes and edges, with each edge being a copula, which describes the dependence between connected nodes. We observe that the dependencies between variables are complex, with diverse copulas and parameters utilized to model them. In particular, we observe the presence of rotations for some Archimedean copulas. As noted in Reference [6], rotating the copulas by 180◦ yields the corresponding survival copulas,5 while rotation by 90◦ and 270◦ allows us to model the negative dependencies, which would otherwise not be captured by the standard non-rotated versions. We provide a visualization of these results in Figure 4. Similarly, we also present the results for the Cape Town and Nairobi weather stations in Appendix Tables 9 and 10, respectively. In addition, the contour figures for these stations are given in Figures 5 and 6. These contour figures display the shape and strength of the dependence between pairs of variables in our datasets and offer valuable insights into the joint distribution of variables modeled for the two stations using vine copulas. A C-vine, unlike an R-vine, represents the conditional dependence of variables given their parents in the vine structure instead of the partial dependence of variables given their neighbors. The variables are iteratively selected and ordered according to their conditional dependence strength. To model conditional dependence between variables, edges in each tree represent copula models. 4 Vinecop

https://www.rdocumentation.org/packages/rvinecopulib/versions/0.6.3.1.1/topics/vinecop is an automated model for fitting and selecting vine copulas with continuous or discrete data. 5 Survival copulas can be used to model the joint behavior of extreme weather events. This allows for a better understanding of their dependencies and the likelihood of the simultaneous events. They can also be applied in modeling the time until specific extreme weather phenomena occurs.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

Evaluation of Dependency Structure for Multivariate Weather Predictors Using Copulas

17:13

Table 7. R-vine Fit for the Kigali International Airport Weather Center Tree Edge Conditioned Conditioning 1 1 c(5, 6) – 1 2 c(3, 6) – 1 3 c(4, 2) – 1 4 c(6, 1) – 1 5 c(2, 7) – 1 6 c(1, 8) – 1 7 c(7, 8) – 2 1 c(5, 1) 6 2 2 c(3, 1) 6 2 3 c(4, 7) 2 2 4 c(6, 8) 1 2 5 c(2, 8) 7 2 6 c(1, 7) 8 3 1 c(5, 3) c(1, 6) 3 2 c(3, 8) c(1, 6) 3 3 c(4, 8) c(7, 2) 3 4 c(6, 7) c(8, 1) 3 5 c(2, 1) c(8, 7) 4 1 c(5, 8) c(3, 1, 6) 4 2 c(3, 7) c(8, 1, 6) 4 3 c(4, 1) c(8, 7, 2) 4 4 c(6, 2) c(7, 8, 1) 5 1 c(5, 7) c(8, 3, 1, 6) 5 2 c(3, 2) c(7, 8, 1, 6) 5 3 c(4, 6) c(1, 8, 7, 2) 6 1 c(5, 2) c(7, 8, 3, 1, 6) 6 2 c(3, 4) c(2, 7, 8, 1, 6) 7 1 c(5, 4) c(2, 7, 8, 3, 1, 6)

R-Vine Family Rotation (in ◦ ) Parameters df tau Gaussian 0 0.72 1 0.51 Gaussian 0 0.68 1 0.47 Frank 0 3.19 1 0.32 Gaussian 0 −0.77 1 −0.56 Frank 0 −4.26 1 −0.41 Gaussian 0 −0.72 1 −0.51 Frank 0 11.70 1 0.71 Clayton 90 0.34 1 −0.15 Frank 0 6.27 1 0.53 Clayton 90 0.14 1 −0.06 Frank 0 −1.93 1 −0.21 Gumbel 270 1.07 1 −0.07 Frank 0 1.07 1 0.12 Clayton 0 0.17 1 0.08 Gaussian 0 −0.41 1 −0.27 Clayton 270 0.26 1 −0.11 Frank 0 −1.02 1 −0.11 Clayton 180 0.20 1 0.09 Clayton 0 0.12 1 0.05 Clayton 90 0.24 1 −0.11 Gumbel 180 1.17 1 0.15 T 0 c(0.27, 5.98) 2 0.17 Gumbel 180 1.11 1 0.10 Gaussian 0 0.57 1 0.39 Frank 0 −1.67 1 −0.18 Clayton 0 0.34 1 0.15 Frank 0 0.48 1 0.05 Gumbel 270 1.10 1 −0.09

Conditional dependence structures in this way are more flexible than those in R-vines, since they allow for more complex relationships among variables. However, C-vines require more computational resources and are more difficult to estimate compared to R-vines. 5 DISCUSSION AND POTENTIAL APPLICATIONS A key objective of our study was to move beyond linear dependence relationships or Gaussianity assumptions when modeling weather interdependence. Using our sample data, we first investigated the stylistic features of the weather variables as covered in Figures 1 and 7, as well as in Table 3. Our results indeed confirm our hypothesis that most of the weather variables under consideration do not fall under the Gaussian family of distributions and therefore a broader model that covers diverse dependency structures should be considered when investigating the joint distributions. In Section 3, we also demonstrated that the dependency structure between variables exhibits both symmetry and tail dependency and that different copula functions best represent these relationships for our stations of interest. In addition, we observed spatial heterogeneity in the type and degree of dependence between a given pair of variables. This could point to impact ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

17:14

S. C. Maina et al.

Fig. 4. Contour plots of the R-vine copula fit for Kigali International Airport weather station.

Fig. 5. Contour plots of the R-vine copula fit for Cape Town International Airport weather station.

of local environmental factors or the influence of regional climate systems on the dependency relationship and therefore caution should be exercised when using models calibrated in one region to make inference in another region for the same weather variable. We also investigated the spatial dependency of our variables of interest across the three locations using bivariate copulas, as shown in Table 5. For example: the variables t_max_24, ws_mean_24, c_mean_24 and p_min_24 between Kigali and Nairobi showed upper tail dependency as captured by the Gumbel copula. We note that the t_min_24 variable shows lower tail dependence for these two locations as captured by the Clayton copula. The variable td_mean_24 showed dependencies ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

Evaluation of Dependency Structure for Multivariate Weather Predictors Using Copulas

17:15

Fig. 6. Contour plots of the C-vine copula fit for Nairobi International Airport weather station.

around the mode across the three location pairs, namely, Kigali-Nairobi, Kigali-Cape Town, and Nairobi-Cape Town. A consistent observation is that for variables with lower variance (see Figures 1 and 7), there is a higher chance of the occurrence of tail dependence in the spatial dependency. These findings provide us with insights into the variability of dependency across various variables and locations, and how it is affected by the stylistic characteristics of each variable. It is well known that in higher-dimensional data, bivariate copulas are ineffective due to their inflexibility and an increase in the computational cost. Vine copulas provide an efficient solution for modeling the dependence structure in high-dimensional data by constructing tree-like representations of the copula dependence through a series of bivariate copulas. Table 6 provides the fit results of the vine copulas for our three locations where we observe that the R-vines best fit Kigali and Cape Town weather stations. In the C-vine copula, there are certain restrictions regarding dependence. The pair-copulas are ordered according to conditional dependence strength, which makes it difficult to capture complex and nonlinear relationships. The R-vine copulas on the other hand are known for their flexibility in modeling various types of dependence structures, including both simple and complex relationships and can therefore capture both linear and nonlinear dependencies effectively. One potential application of meteorological copula modeling is to the understanding of dependencies at subseasonal timescales. Subseasonal forecast systems predict the weather between 2 weeks and 2 months in advance. As discussed in Reference [54], accurate forecasting at this lead time holds great value for disaster mitigation. However, accurate subseasonal forecasting is also notoriously difficult due to complex dependencies both on local weather variables and global climate variables [55]. Copula modeling could lead to new insights into which variables and spatial locations at forecast issuance are most relevant for forecasting the weather at a target location in 2 to 6 weeks. Given a prediction of a certain weather pattern, for example El Nino, a copula function can inform how this ocean pattern affects temperature, precipitation, and so on, at an inland observation station. As we have seen in Section 4, vine copulas provide a flexible and efficient way to model complex and asymmetric conditional dependencies among meteorological variables using a hierarchy of trees. ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

17:16

S. C. Maina et al.

6 SUMMARY Understanding the dependence structure among weather variables can enhance both numerical weather prediction models and machine learning approaches in weather forecasting. Meteorologists can produce better forecasts by using a combination of satellite observations, ground-based measurements, and computer models, particularly in the light of climate change, which is causing more frequent and intense weather events. Understanding how different weather predictors interact is therefore essential for supporting sustainability, adjusting to climate change, and promoting sustainable practices in agriculture, transportation, energy, and other industries. In this article, we have demonstrated that the proposed vine copula methods can capture diverse high-dimensional dependency structures for weather data sets from three different countries by building unique and efficient tree structures that can enhance the performance that is required in practical applications. We have also shown that bivariate copulas, which are the building blocks of the vine copulas, capture diverse dependency shapes contained in weather data. In addition, we have shown that for a particular weather variable in different locations, dependency shape also changes across regions. However, our research still focused on relatively few variables that were selected by the researchers. We can extend our current work to include all available weather data available for each station. While this can pose some computational challenges as the model parameters grow with the dimension, we can investigate the use of the sparse or truncated vine copulas as proposed in a recent paper by Czado et al. [11] to mitigate this issue. The output framework could provide a richer dependency structure that can enhance the predictive power of the weather models. The framework we developed was also found to be static as we fitted the copula models to the entire dataset ignoring the seasonality in the weather patterns. Given the cyclical nature of climate data, it would also be important to explore the use of dynamic vine copulas [28, 57], which could allow us to capture the seasonal variations that are a stylistic features in climate time series data. Future work could also explore (a) modeling dependency on lagged data, (b) combining satellite image observations and ground-based measurements, and (c) jointly modeling spatiotemporal dependencies. The resulting insights could enable scientists to identify specific non-linear dependencies of relevance for forecasting. APPENDIX A ADDITIONAL RESULTS In this section, we include additional results from our analysis. In Figure 7, we provide the histograms of the distribution of the weather variables for the Cape Town and Kigali International airports. In Table 8, we provide a more detailed description of the bivariate copula analysis (compare with Table 4) for the weather variables from the three locations. In particular, we include the BIC, log likelihood, and AIC goodness-of-fit statistics.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

Evaluation of Dependency Structure for Multivariate Weather Predictors Using Copulas

17:17

Fig. 7. Histograms of the weather variables measured at the Cape Town International Airport (FACT) and Kigali International Airport (HRYR).

Table 9 gives a summary of the R-vine fit results for Cape Town international airport providing a more detailed description of the contour plots for the station captured in Figure 5. Finally, we provide the results of the vine copula fit for Nairobi international airport in Table 10. As noted in Table 6, the BIC goodness-of-fit test results show that the C-vine best captures the dependency structure for this particular station.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

Pairs (t_max_24, t_min_24) (t_max_24, td_mean_24) (t_max_24, ws_mean_24) (t_max_24, c_mean_24) (t_max_24, rh_mean_day) (t_max_24, p_max_24) (t_max_24, p_min_24) (t_min_24, td_mean_24) (t_min_24, ws_mean_24) (t_min_24, c_mean_24) (t_min_24, rh_mean_day) (t_min_24, p_max_24) (t_min_24, p_min_24) (td_mean_24, ws_mean_24) (td_mean_24, c_mean_24) (td_mean_24, rh_mean_day) (td_mean_24, p_max_24) (td_mean_24, p_min_24) (ws_mean_24, c_mean_24) (ws_mean_24, rh_mean_day) (ws_mean_24, p_max_24) (ws_mean_24, p_min_24) (c_mean_24, rh_mean_day) (c_mean_24, p_max_24) (c_mean_24, p_min_24) (rh_mean_day, p_max_24) (rh_mean_day, p_min_24) (p_max_24, p_min_24)

BIC −132.20 −16.03 −178.81 −410.86 −607.91 −270.56 −478.42 −192.88 −159.41 1.21 −4.46 −267.99 −250.82 5.82 −152.27 −420.34 −68.01 −28.11 −102.21 −140.96 −80.65 −119.42 −499.04 −91.08 −136.70 −60.12 −120.73 −1028.71

Nairobi Weather Center LogLik AIC Copula and parameter 69.38 −136.76 Frank, 2.90 11.30 −20.59 Gaussian, -0.18 92.69 −183.37 Clayton, 0.78 208.71 −415.42 Gaussian, -0.68 307.24 −612.47 Gaussian, -0.77 138.56 −275.12 Gaussian, -0.59 242.49 −482.98 Gaussian, -0.72 99.72 −197.44 Clayton, 0.83 82.99 −163.97 Frank, 3.19 −3.35 Frank, -0.54 2.68 8.79 −13.57 Student, -0.13, 0.08 137.27 −272.55 Frank, -4.26 128.69 −255.38 Frank, -4.08 0.37 1.26 Gaussian, -0.033 79.42 −156.83 Clayton, 0.68 213.45 −424.90 Gaussian, 0.68 37.29 −72.57 Frank, -2.05 17.34 −32.67 Frank, -1.34 54.39 −106.77 Frank, -2.44 73.76 −145.52 Frank, -2.92 43.60 −85.21 Frank, -2.20 62.99 −123.98 Frank, -2.70 252.80 −503.60 Gaussian, 0.72 48.82 −95.64 Gaussian, 0.38 71.63 −141.26 Gaussian, 0.44 33.34 −64.68 Clayton, 0.42 63.64 −125.29 Gaussian, 0.42 520.91 −1037.83 Student, 0.89, 10.41 BIC −348.54 −223.95 −1.22 −261.31 −335.17 −285.91 −184.09 −793.65 −267.43 −8.54 −26.08 −484.61 −294.27 −88.36 −23.69 −64.36 −459.05 −272.63 −43.81 1.19 −99.62 −81.85 −357.00 10.02 −7.18 −4.29 −0.76 −1155.02

Cape Town Weather Center LogLik AIC Copula and parameter 177.56 −353.12 Frank, 4.89 115.27 −228.54 Frank, 3.80 3.90 −5.81 Clayton, 0.12 137.24 −270.48 Student, -0.58, 15.72 170.88 −339.75 Gaussian, -0.62 146.25 −290.50 Frank, -4.35 95.34 −188.67 Frank, -3.39 403.41 −802.83 Student, 0.82,11.75 137.01 −272.01 Gaussian, 0.57 7.56 −13.12 Clayton, 0.18 16.33 −30.67 Frank, -1.26 245.60 −489.19 Frank, -6.02 150.43 −298.86 Frank, -4.38 47.47 −92.94 Clayton, 0.49 15.14 −28.28 Clayton, 0.27 35.47 −68.95 Clayton, 0.38 232.82 −463.64 Frank, -5.84 139.61 −277.21 Frank, -4.20 25.20 −48.40 Clayton, 0.38 2.70 −3.40 Frank, -0.53 53.10 −104.21 Frank, -2.36 44.22 −86.43 Frank, -2.14 181.79 −361.58 Gumbel, 1.73 1.58 0.85 Student, -0.013, 11.12 10.18 −16.35 Student, -0.07, 5.24 5.44 −8.87 Frank, 0.73 3.67 −5.35 Frank, 0.61 584.10 −1164.19 Student, 0.90, 4.99 BIC −5.64 −143.83 −82.41 −236.44 −438.31 −29.21 −113.12 −14.35 −2.52 −30.93 4.66 1.51 1.23 −83.67 −445.26 −1107.64 −64.09 −44.89 −11.20 −79.60 2.69 −2.29 −591.91 −3.68 5.22 −24.43 −1.63 −865.74

Kigali Weather Center LogLik AIC Copula and parameter 6.10 −10.21 Gumbel, 1.09 75.20 −148.40 Gaussian, -0.45 47.78 −91.55 Student, 0.33, 6.09 121.50 −241.01 Frank, -3.88 225.72 −447.44 Student, -0.69, 13.36 17.89 −33.78 Frank, -1.42 59.85 −117.69 Gaussian, -0.41 10.46 −18.92 Clayton, 0.211 4.54 −7.09 Gumbel, 1.08 18.75 −35.50 Clayton, 0.30 0.95 0.10 Clayton, 0.057 5.81 −7.62 Student, -0.11, 7.52 5.95 −7.90 Student, -0.026, 5.34 45.12 −88.24 Gaussian, -0.35 225.91 −449.83 Clayton, 1.50 557.10 −1112.20 Clayton, 3.86 35.33 −68.66 Frank, -1.98 25.73 −49.45 Gaussian, -0.27 8.88 −15.76 Gaussian, -0.16 43.08 −84.17 Gaussian, -0.34 1.94 −1.88 Frank, -0.45 4.43 −6.86 Gaussian, -0.12 299.24 −596.48 Gaussian, 0.76 5.13 −8.25 Gaussian, -0.13 0.67 0.65 Gaussian, -0.05 15.50 −29.00 Frank, -1.30 4.10 −6.20 Frank, -0.66 439.44 −874.87 Student, 0.849759, 5.04

Table 8. Table of Bivariate Copula Fits for each Pair of Weather Variables Over the Study Period

17:18 S. C. Maina et al.

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

Evaluation of Dependency Structure for Multivariate Weather Predictors Using Copulas

17:19

Table 9. R-vine Fit for the Cape Town International Airport Weather Center

Tree Edge Conditioned Conditioning 1 1 c(4, 2) — 1 2 c(5, 6) — 1 3 c(6, 1) — 1 4 c(1, 2) — 1 5 c(3, 2) — 1 6 c(2, 7) — 1 7 c(7, 8) — 2 1 c(4, 1) 2 2 2 c(5, 1) 6 2 3 c(6, 2) 1 2 4 c(1, 7) 2 2 5 c(3, 7) 2 2 6 c(2, 8) 7 3 1 c(4, 6) c(1, 2) 3 2 c(5, 2) c(1, 6) 3 3 c(6, 7) c(2, 1) 3 4 c(1, 8) c(7, 2) 3 5 c(3, 8) c(7, 2) 4 1 c(4, 7) c(6, 1, 2) 4 2 c(5, 7) c(2, 1, 6) 4 3 c(6, 8) c(7, 2, 1) 4 4 c(1, 3) c(8, 7, 2) 5 1 c(4, 5) c(7, 6, 1, 2) 5 2 c(5, 8) c(7, 2, 1, 6) 5 3 c(6, 3) c(8, 7, 2, 1) 6 1 c(4, 8) c(5, 7, 6, 1, 2) 6 2 c(5, 3) c(8, 7, 2, 1, 6) 7 1 c(4, 3) c(8, 5, 7, 6, 1, 2)

Family Rotation (in ◦ ) Parameters df tau Gumbel 180 1.58 1 0.37 Gumbel 0 1.73 1 0.42 Gaussian 0 −0.62 1 −0.43 Frank 0 4.89 1 0.45 Student 0 c(0.82, 10.74) 2 0.62 Frank 0 −6.02 1 −0.52 Gumbel 0 3.34 1 0.70 Gaussian 0 −0.41 1 −0.27 Clayton 90 0.43 1 −0.18 Gaussian 0 0.32 1 0.21 Gumbel 90 1.23 1 −0.19 Frank 0 −1.88 1 −0.20 Clayton 0 0.13 1 0.06 Gaussian 0 −0.33 1 −0.22 Gaussian 0 0.54 1 0.37 Gumbel 90 1.22 1 −0.18 Clayton 0 0.17 1 0.08 Gumbel 0 1.09 1 0.09 Gumbel 270 1.09 1 −0.08 Gaussian 0 −0.23 1 −0.15 Gumbel 0 1.08 1 0.07 Clayton 90 0.16 1 −0.07 Frank 0 −0.27 1 −0.03 Clayton 90 0.09 1 −0.05 Frank 0 10.19 1 0.67 Clayton 270 0.08 1 −0.04 Gaussian 0 0.09 1 0.06 Gumbel 0 1.01 1 0.01

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

17:20

S. C. Maina et al. Table 10. C-vine Fit for the Nairobi Wilson Airport Weather Center

Tree Edge Conditioned Conditioning 1 1 c(1, 8) — 1 2 c(2, 8) — 1 3 c(3, 8) — 1 4 c(4, 8) — 1 5 c(5, 8) — 1 6 c(6, 8) — 1 7 c(7, 8) — 2 1 c(1, 7) 8 2 2 c(2, 7) 8 2 3 c(3, 7) 8 2 4 c(4, 7) 8 2 5 c(5, 7) 8 2 6 c(6, 7) 8 3 1 c(1, 6) c(7, 8) 3 2 c(2, 6) c(7, 8) 3 3 c(3, 6) c(7, 8) 3 4 c(4, 6) c(7, 8) 3 5 c(5, 6) c(7, 8) 4 1 c(1, 5) c(6, 7, 8) 4 2 c(2, 5) c(6, 7, 8) 4 3 c(3, 5) c(6, 7, 8) 4 4 c(4, 5) c(6, 7, 8) 5 1 c(1, 4) c(5, 6, 7, 8) 5 2 c(2, 4) c(5, 6, 7, 8) 5 3 c(3, 4) c(5, 6, 7, 8) 6 1 c(1, 3) c(4, 5, 6, 7, 8) 6 2 c(2, 3) c(4, 5, 6, 7, 8) 7 1 c(1, 2) c(3, 4, 5, 6, 7, 8)

Family Rotation (in ◦ ) Parameters df tau Gaussian 0 −0.72 1 −0.51 Frank 0 −4.08 1 −0.39 Frank 0 −1.34 1 −0.15 Frank 0 −2.67 1 −0.28 Gaussian 0 0.44 1 0.29 Gaussian 0 0.42 1 0.27 Student 0 c(0.88, 6.67) 2 0.69 Frank 0 1.05 1 0.11 Gaussian 0 −0.21 1 −0.14 Frank 0 −1.76 1 −0.19 Clayton 0 0.09 1 0.04 Frank 0 −0.20 1 −0.02 Frank 0 −1.42 1 −0.15 Student 0 c(−0.73, 14.01) 2 −0.52 Clayton 0 0.22 1 0.10 Gaussian 0 0.84 1 0.63 Frank 0 −2.10 1 −0.22 Gaussian 0 0.67 1 0.47 Student 0 c(−0.17, 16.697) 2 −0.11 Clayton 0 0.33 1 0.14 Clayton 0 0.09 1 0.04 Clayton 270 0.12 1 −0.05 Clayton 0 0.11 1 0.05 Frank 0 2.64 1 0.27 Frank 0 1.83 1 0.20 Student 0 c(0.52, 8.64) 2 0.35 Gaussian 0 0.55 1 0.37 Student 0 c(0.02, 11.23) 2 0.01

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

Evaluation of Dependency Structure for Multivariate Weather Predictors Using Copulas

17:21

REFERENCES [1] Kjersti Aas, Claudia Czado, Arnoldo Frigessi, and Henrik Bakken. 2009. Pair-copula constructions of multiple dependence. Insur.: Math. Econ. 44, 2 (2009), 182–198. [2] Hirotugu Akaike. 1974. A new look at the statistical model identification. IEEE Trans. Auto. Control (1974), 716–723. https://doi.org/10.1109/tac.1974.1100705 [3] Ana L. P. Baccon and José T. Lunardi. 2016. Testing the shape of distributions of weather data. J. Phys.: Conf. Ser. 738, 1 (Aug. 2016), 012078. [4] Tim Bedford and Roger M. Cooke. 2001. Probability density decomposition for conditionally dependent random variables modeled by vines. Ann. Math. Artific. Intell. 32, 1-4 (2001), 245–268. https://doi.org/10.1023/A:1016725902970 [5] Tim Bedford and Roger M. Cooke. 2002. Vines–a new graphical model for dependent random variables. Ann. Stat. 30, 4 (2002), 1031–1068. https://doi.org/10.1214/aos/1031689016 [6] Eike Christian Brechmann and Ulf Schepsmeier. 2013. Modeling dependence with C- and D-vine copulas: The R package CDVine. J. Stat. Softw. 52, 3 (2013), 1–27. https://doi.org/10.18637/jss.v052.i03 [7] CDP. 2020. CDP Africa Report: Benchmarking Progress Towards Climate Safe Cities, States and Regions. Retrieved from https://www.cdp.net/en/research/global-reports/africa-report [8] D. G. Clayton. 1978. A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika 65, 1 (04 1978), 141–151. https://doi.org/10.1093/biomet/65. 1.141 [9] Rong-Gang Cong and Mark Brady. 2012. The interdependence between rainfall and temperature: Copula analyses. Sci. World J. 2012 (Nov. 2012), 405675. https://doi.org/10.1100/2012/405675 [10] Alfred Czado. 2006. Pair-copula constructions of multiple dependence. Insur.: Math. Econ. 39, 2 (2006), 181–197. [11] Claudia Czado, Karoline Bax, Özge Sahin, Thomas Nagler, Aleksey Min, and Sandra Paterlini. 2022. Vine copulabased dependence modeling in sustainable finance. The Journal of Finance and Data Science 8 (2022), 309–330. https: //www.sciencedirect.com/science/article/pii/S2405918822000162 [12] Jorn Van de Velde, Mathias Demuzere, Bernard De Baets, and Niko Verhoest. 2022. Future multivariate weather generation by combining Bartlett-Lewis and vine copula models. Hydrological Sciences Journal 68, 1 (2023), 1–15. DOI:10.1080/02626667.2022.2144322 [13] Daniela I. V. Domeisen, Christopher J. White, Hilla Afargan-Gerstman, Ängel G. Muñoz, Matthew A. Janiga, Frédéric Vitart, C. Ole Wulff, Salomé Antoine, Constantin Ardilouze, Lauriane Batté, Hannah C. Bloomfield, David J. Brayshaw, Suzana J. Camargo, Andrew Charlton-Pérez, Dan Collins, Tim Cowan, Maria del Mar Chaves, Laura Ferranti, Rosario Gómez, Paula L. M. González, Carmen González Romero, Johnna M. Infanti, Stelios Karozis, Hera Kim, Erik W. Kolstad, Emerson LaJoie, Llorenç Lledó Linus Magnusson, Piero Malguzzi, Andrea Manrique-Suñèn, Daniele Mastrangelo, Stefano Materia, Hanoi Medina, Lluís Palma, Luis E. Pineda, Athanasios Sfetsos, Seok-Woo Son, Albert Soret, Sarah Strazzo, and Di Tian. 2022. Advances in the subseasonal prediction of extreme events: Relevant case studies across the globe. Bull. Amer. Meteorol. Soc. 103, 6 (2022), E1473–E1501. https://doi.org/10.1175/BAMS-D-20-0221.1 [14] Nelson Christopher Dzupire, Philip Ngare, and Leo Odongo. 2020. A copula-based bi-variate model for temperature and rainfall processes. Sci. African 8 (2020), e00365. https://doi.org/10.1016/j.sciaf.2020.e00365 [15] Alireza Farrokhi, Saeed Farzin, and Sayed-Farhad Mousavi. 2021. Meteorological drought analysis in response to climate change conditions, based on combined four-dimensional vine copulas and data mining (VC-DM). J. Hydrol. 603 (2021), 127135. [16] Anne-Laure Fouque, Philippe Ciuciu, and Laurent Risser. 2009. Multivariate spatial gaussian mixture modeling for statistical clustering of hemodynamic parameters in functional MRI. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 445–448. https://doi.org/10.1109/ICASSP.2009.4959616 [17] Alexander Friedrich and Matthias Scherer. 2010. D-vines—A new class of vine copulas for high-dimensional probability distributions. J. Multivar. Anal. 101, 9 (2010), 2278–2295. [18] Alexander Friedrich and Matthias Scherer. 2011. R-vine copulas for modeling multivariate dependence. Comput. Stat. Data Anal. 55, 12 (2011), 2939–2953. [19] William G. Cochran George W. Snecdecor. 1991. Statistical Methods, 8th ed. Wiley-Blackwell. [20] E. J. Gumbel. 1960. Bivariate exponential distributions. J. Amer. Stat. Assoc. 55, 292 (1960), 698–707. Retrieved from http://www.jstor.org/stable/2281591 [21] Sijie He, Xinyan Li, Timothy DelSole, Pradeep Ravikumar, and Arindam Banerjee. 2021. Sub-seasonal climate forecasting via machine learning: Challenges, analysis, and advances. Proc. AAAI Conf. Artific. Intell. 35, 1 (May 2021), 169–177. https://doi.org/10.1609/aaai.v35i1.16090 [22] Jessica Hwang, Paulo Orenstein, Judah Cohen, Karl Pfeiffer, and Lester Mackey. 2019. Improving subseasonal forecasting in the western U.S. with machine learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’19). ACM, New York, NY, 2325–2335. https://doi.org/10.1145/3292500. 3330674 ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

17:22

S. C. Maina et al.

[23] Tim Janke, Mohamed Ghanmi, and Florian Steinke. 2021. Implicit generative copulas. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, 26028–26039. [24] H. Joe. 1996. Families of m-variate distributions with given margins and m (m −1)/2 bivariate dependence parameters. In Distributions with Fixed Marginals and Related Topics, L. Rüschendorf, B. Schweizer, and M. D. Taylor (Eds.), Vol. 28. Institute of Mathematical Statistics, Institute of Mathematical Statistics, Hayward, CA, 120–141. https://doi.org/10. 1214/lnms/1215452614 [25] Harry Joe. 1997. Multivariate Models and Dependence Concepts. Chapman & Hall, London. Retrieved from http://www. crcpress.com/product/isbn/9780412073311 [26] Gabriel Jouan, Anne Cuzol, Valérie Monbet, and Goulven Monnier. 2022. Gaussian mixture models for clustering and calibration of ensemble weather forecasts. Discrete Cont. Dynam. Syst. S (2022), 0–0. https://doi.org/10.3934/dcdss. 2022037 [27] M. G. Kendall. 1948. Rank Correlation Methods. C. Griffin. Retrieved from https://books.google.co.ke/books?id= hiBMAAAAMAAJ [28] Rüdiger Kiesel, Magda Mroz, and Ulrich Stadtmüller. 2016. Time-varying copula models for financial time series. Adv. Appl. Probabil. 48, A (2016), 159–180. https://doi.org/10.1017/apr.2016.48 [29] Georgia Lazoglou, Christina Anagnostopoulou, Charalampos Skoulikaris, and Konstantia Tolika. 2019. Bias correction of climate model’s precipitation using the copula method and its application in river basin simulation. Water 11, 3 (2019). Retrieved from https://www.mdpi.com/2073-4441/11/3/600 [30] Michael Leonard, Seth Westra, Aloke Phatak, Martin Lambert, Bart Hurk, Kathleen Mcinnes, James Risbey, Sandra Schuster, Doerte Jakob, and Mark Stafford Smith. 2014. A compound event framework for understanding extreme impacts. Wiley Interdisc. Rev.: Clim. Change 5 (Jan. 2014). https://doi.org/10.1002/wcc.252 [31] Chao Li, Vijay P. Singh, and Ashok K. Mishra. 2013. A bivariate mixed distribution with a heavy-tailed component and its application to single-site daily rainfall simulation. Water Resour. Res. 49, 2 (Feb. 2013), 767–789. [32] Sandy Lovie and Pat Lovie. 2010. Commentary: Charles spearman and correlation: A commentary on “The proof and measurement of association between two things.” Int. J. Epidemiol. 39, 5 (09 2010), 1151–1153. https://doi.org/10.1093/ ije/dyq183 https://academic.oup.com/ije/article-pdf/39/5/1151/2411913/dyq183.pdf [33] Andrea Manrique-Suñén, Lluís Palma, Nube Gonzalez-Reviriego, Francisco J. Doblas-Reyes, and Albert Soret. 2023. Subseasonal predictions for climate services, a recipe for operational implementation. Clim. Serv. 30 (2023), 100359. https://doi.org/10.1016/j.cliser.2023.100359 [34] William J. Merryfield, Johanna Baehr, Lauriane Batté, Emily J. Becker, Amy H. Butler, Caio A. S. Coelho, Gokhan Danabasoglu, Paul A. Dirmeyer, Francisco J. Doblas-Reyes, Daniela I. V. Domeisen, Laura Ferranti, Tatiana Ilynia, Arun Kumar, Wolfgang A. Müller, Michel Rixen, Andrew W. Robertson, Doug M. Smith, Yuhei Takaya, Matthias Tuma, Frederic Vitart, Christopher J. White, Mariano S. Alvarez, Constantin Ardilouze, Hannah Attard, Cory Baggett, Magdalena A. Balmaseda, Asmerom F. Beraki, Partha S. Bhattacharjee, Roberto Bilbao, Felipe M. de Andrade, Michael J. DeFlorio, Leandro B. Díaz, Muhammad Azhar Ehsan, Georgios Fragkoulidis, Sam Grainger, Benjamin W. Green, Momme C. Hell, Johnna M. Infanti, Katharina Isensee, Takahito Kataoka, Ben P. Kirtman, Nicholas P. Klingaman, JuneYi Lee, Kirsten Mayer, Roseanna McKay, Jennifer V. Mecking, Douglas E. Miller, Nele Neddermann, Ching Ho Justin Ng, Albert Ossó, Klaus Pankatz, Simon Peatman, Kathy Pegion, Judith Perlwitz, G. Cristina Recalde-Coronel, Annika Reintges, Christoph Renkl, Balakrishnan Solaraju-Murali, Aaron Spring, Cristiana Stan, Y. Qiang Sun, Carly R. Tozer, Nicolas Vigaud, Steven Woolnough, and Stephen Yeager. 2020. Current and emerging developments in subseasonal to decadal prediction. Bull. Amer. Meteorol. Soc. 101, 6 (2020), pp. E869–E896. Retrieved from https://www.jstor.org/ stable/27153113 [35] David Meyer, Thomas Nagler, and Robin J. Hogan. 2021. Copula-based synthetic data generation for machine learning emulators in weather and climate: application to a simple radiation model. Geoscientific Model Development Discussions 2021 (2021), 1–21. [36] D. S. Moore, G. P. McCabe, and B. A. Craig. 2014. Introduction to the Practice of Statistics. W. H. Freeman. Retrieved from https://books.google.co.ke/books?id=pX1_AwAAQBAJ [37] Annette Möller, Ludovica Spazzini, Daniel Kraus, Thomas Nagler, and Claudia Czado. 2018. Vine copula based postprocessing of ensemble forecasts for temperature. https://arxiv.org/abs/1811.02255 [38] Ricardo T. A. de Oliveira, Thaíze Fernandes O. de Assis, Paulo Renato A. Firmino, Tiago A. E. Ferreira, and Adriano L. I. Oliveira. 2016. Copulas-based ensemble of Artificial Neural Networks for forecasting real world time series. International Joint Conference on Neural Networks (IJCNN’16). IEEE, 4089–4096. [39] P. K. Pandey, Lakhyajit Das, Deepak Jhajharia, and Vanita Pandey. 2018. Modelling of interdependence between rainfall and temperature using copula. Model. Earth Syst. Environ. 4 (2018), 867–879. [40] Vilda Purutçuoğlu and Hajar Farnoudkia. 2021. Vine Copula and Artificial Neural Network Models to Analyze Breast Cancer Data. De Gruyter, Berlin, Boston, 287–304. https://doi.org/doi:10.1515/9783110668322-014

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

Evaluation of Dependency Structure for Multivariate Weather Predictors Using Copulas

17:23

[41] Colin Raymond, Radley M. Horton, Jakob Zscheischler, Olivia Martius, Amir AghaKouchak, Jennifer Balch, Steven G. Bowen, Suzana J. Camargo, Jeremy Hess, Kai Kornhuber, Michael Oppenheimer, Alex C. Ruane, Thomas Wahl, and Kathleen White. 2020. Understanding and managing connected extreme events. Nature Clim. Change 10, 7 (July 2020), 611–621. https://doi.org/10.1038/s41558-020-0790-4 [42] Andrew W. Robertson, Arun Kumar, Malaquias Peña, and Frederic Vitart. 2015. Improving and promoting subseasonal to seasonal prediction. Bull. Amer. Meteorol. Soc. 96, 3 (2015), ES49–ES53. https://doi.org/10.1175/BAMS-D-14-00139.1 [43] B. H. Samset, M. T. Lund, and C. Stjern. 2019. How daily temperature and precipitation distributions evolve with surface temperature. In AGU Fall Meeting Abstracts, Vol. 2019. Article GC23D-03, GC23D-03 pages. [44] Roman Schefzik. 2015. Multivariate discrete copulas, with applications in probabilistic weather forecasting. (2015). https://doi.org/10.48550/ARXIV.1512.05629 [45] Gideon Schwarz. 1978. Estimating the dimension of a model. Ann. Statist. 6, 2 (1978), 461–464. Retrieved from http: //links.jstor.org/sici?sici=0090-5364(197803)6:22.0.CO;2-5&origin=MSN [46] Diogo S. F. Silva and Clayton V. Deutsch. 2018. Multivariate data imputation using Gaussian mixture models. Spatial Stat. 27 (2018), 74–90. https://doi.org/10.1016/j.spasta.2016.11.002 [47] M. Sklar. 1959. Fonctions de repartition an dimensions et leurs marges. Publ. Inst. Stat. Univ. Paris 8 (1959), 229–231. [48] Wan-Jiao Song and Jin-Yi Yu. 2019. Identifying the complex types of atmosphere-ocean interactions in El Niño. Environ. Res. Lett. 14, 11 (Nov. 2019), 114030. https://doi.org/10.1088/1748-9326/ab4968 [49] Yi Sun, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Learning vine copula models for synthetic data generation. Proc. AAAI Conf. Artific. Intell. 33 (July 2019), 5049–5057. https://doi.org/10.1609/aaai.v33i01.33015049 [50] R. T. Sutton. 2018. ESD ideas: A simple proposal to improve the contribution of IPCC WGI to the assessment and communication of climate change risks. Earth Syst. Dynam. 9, 4 (2018), 1155–1158. https://doi.org/10.5194/esd-9-11552018 [51] Natasa Tagasovska, Damien Ackerer, and Thibault Vatter. 2019. Copulas as High-Dimensional Generative Models: Vine Copula Autoencoders. Retrieved from https://arxiv1906.05423 [52] Selim Furkan Tekin, Oguzhan Karaahmetoglu, Fatih Ilhan, Ismail Balaban, and Suleyman Serdar Kozat. 2021. Spatiotemporal weather forecasting and attention mechanism on convolutional LSTMs. Retrieved from https://arxiv.org/ abs/2102.00696 [53] Hilde Vernieuwe, Sander Vandenberghe, Bernard De Baets, and Niko Verhoest. 2015. A continuous rainfall model based on vine copulas. Hydrol. Earth Syst. Sci. 19 (06 2015), 2685–2699. https://doi.org/10.5194/hess-19-2685-2015 [54] Frederic Vitart, Constantin Ardilouze, Axel Bonet, Anca Brookshaw, M. Chen, C. Codorean, M. Déqué, L. Ferranti, E. Fucile, M. Fuentes, Harry Hendon, J. Hodgson, H.S. Kang, A. Kumar, Hai Lin, G. Liu, X. Liu, P. Malguzzi, Ioannis Mallas, and L. Zhang. 2016. The sub-seasonal to seasonal prediction (S2S) project database. Bull. Amer. Meteorol. Soc. 98 (06 2016). https://doi.org/10.1175/BAMS-D-16-0017.1 [55] Frederic Vitart, Andrew Robertson, and David Anderson. 2012. Subseasonal to seasonal prediction project: Bridging the gap between weather and climate. WMO Bull. 61 (01 2012). [56] Ruolan Yu, Rui Yang, Chen Zhang, Maria Špoljar, Natalia Kuczyńska-Kippen, and Guoqing Sang. 2020. A vine copulabased modeling for identification of multivariate water pollution risk in an interconnected river system network. Water 12, 10 (2020). Retrieved from https://www.mdpi.com/2073-4441/12/10/2741 [57] Rui Zhou and Min Ji. 2021. Modelling mortality dependence: An application of dynamic vine copula. Insur.: Math. Econ. 99 (2021), 241–255. https://doi.org/10.1016/j.insmatheco.2021.03.022

Received 15 February 2023; revised 5 June 2023; accepted 18 July 2023

ACM Journal on Computing and Sustainable Societies, Vol. 1, No. 2, Article 17. Publication date: December 2023.

ACM Journal on Computing and Sustainable Societies https://acmjcss.acm.org/ Guide to Manuscript Submission Submission to the ACM Journal on Computing and Sustainable Societies is done electronically through https:// mc.manuscriptcentral.com/acmjcss. Once you are at that site, you can create an account and password with which you can enter the ACM Manuscript Central manuscript review tracking system. Proceed to the Author Center to submit your manuscript and your accompanying files. You will be asked to create an abstract that will be used throughout the system as a synopsis of your paper. You will also be asked to classify your submission using the ACM Computing Classification System through a link provided at the Author Center. For completeness, please select at least one primary-level classification followed by two secondary-level classifications. To make the process easier, you may cut and paste from the list. Remember, you, the author, know best which area and sub-areas are covered by your paper; in addition to clarifying the area where your paper belongs, classification often helps in quickly identifying suitable reviewers for your paper. So it is important that you provide as thorough a classification of your paper as possible. The ACM Production Department prefers that your manuscript be prepared in either LaTeX or MS Word format. Style files for manuscript preparation can be obtained at the following location: https://www.acm.org/publications/ authors/submissions. For editorial review, the manuscript should be submitted as a PDF or Postscript file. Accompanying material can be in any number of text or image formats, as well as software/documentation bundles in zip or tar-gzipped formats. Questions regarding editorial review process should be directed to the Editor-in-Chief. Questions regarding the post-acceptance production process should be addressed to the Editor, Yubing Zhai at [email protected]. Ceasing Print Publication of ACM Journals and Transactions ACM has made the decision to cease print publication for ACM’s journals and transactions as of January 2024. There were several motivations for this change: ACM wants to be as environmentally friendly as possible; print journals lack the new features and functionality of the electronic versions in the ACM Digital Library; and print subscriptions, which have been declining for years, have now reached a level where the time was right to sunset print. Please contact [email protected] should you have any questions. Subscription and Membership Information. Send orders to: ACM Member Services Dept. General Post Office PO Box 30777 New York, NY 10087-0777 For information, contact: Mail:

ACM Member Services Dept. 1601 Broadway, 10th Floor New York, NY 10019-7434 Phone: +1-212-626-0500 Fax: +1-212-944-1318 Email: [email protected] Catalog: https://www.acm.org/publications/alacarte About ACM. ACM is the world’s largest educational and scientific computing society, uniting educators, researchers and professionals to inspire dialogue, share resources and address the field’s challenges. ACM strengthens the computing profession’s collective voice through strong leadership, promotion of the highest standards, and recognition of technical excellence. ACM supports the professional growth of its members by providing opportunities for life-long learning, career development, and professional networking. Visit ACM’s Website: https://www.acm.org. Change of Address Notification. To notify ACM of a change of address, use the addresses above or send an email to [email protected]. Please allow 6–8 weeks for new membership or change of name and address to become effective. Send your old label with your new address notification. To avoid interruption of service, notify your local post office before change of residence. For a fee, the post office will forward 2nd- and 3rd-class periodicals.

Article 11 B. Balaji (23 pages) V. S. G. Vunnava N. Domingo S. Gupta H. Gupta G. Guest A. Srinivasan

Flamingo: Environmental Impact Factor Matching for Life Cycle Assessment with Zero-shot Machine Learning

Article 12 A. Gulgulia (31 pages) A. Gupta A. P. Sarashetti A. Sinha A. Seth

Tracking Socio-Economic Development in Rural India over Two Decades Using Satellite Imagery

Article 13 Y. Lin (24 pages) D. Gatica-Perez

Characterizing Swiss Alpine Lakes: from Wikipedia to Citizen Science

Article 14 R. Sharma (27 pages) J. Mirzakhalov P. Bharti R. Goyal T. Schmidt S. Chellappan

A Friend in Need Is a Friend Indeed: Investigating the Quality of Training Data from Peers for Auto-generating Empathetic Textual Responses to Non-Sensitive Posts in a Cohort of College Students

Article 15 A. Bhattacharjee (29 pages) S. Sultana M. R. Amin Y. Iqbal S. I. Ahmed

“What’s the Point of Having This Conversation?”: From a Telephone Crisis Helpline in Bangladesh to the Decolonization of Mental Health Services

Article 16 S. Mohammad Abrar Analysis of Performance Improvements and Bias Associated with the Use of Human Mobility Data in COVID-19 Case (36 pages) N. Awasthi Prediction Models D. Smolyak V. Frias-Martinez Article 17 S. C. Maina (23 pages) D. Mwigereri J. Weyn L. Mackey M. Ochieng

Evaluation of Dependency Structure for Multivariate Weather Predictors Using Copulas