309 11 6MB
English Pages XVI, 262 [266] Year 2020
Sverker Sikström Danilo Garcia Editors
Statistical Semantics Methods and Applications
Statistical Semantics
Sverker Sikström • Danilo Garcia Editors
Statistical Semantics Methods and Applications
Editors Sverker Sikström Department of Psychology Lund University Lund, Sweden
Danilo Garcia Department of Behavioral Sciences and Learning Linköping University Linköping, Sweden Blekinge Center of Competence Region Blekinge Karlskrona, Sweden Department of Psychology University of Gothenburg Gothenburg, Sweden
ISBN 978-3-030-37249-1 ISBN 978-3-030-37250-7 https://doi.org/10.1007/978-3-030-37250-7
(eBook)
© Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG. The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To Patricia, for giving meaning to my life beyond words. . . “When the world ends Collect your things You're coming with me When the world ends You tuckle up yourself with me Watch it as the stars disappear to nothing The day the world is over We'll be lying in bed” From the song When the World Ends by Dave Mathews Band DG
Preface
When I, Sverker Sikström, took my first psychology class and got an assignment in which I and my classmates were asked to collect data, the professor told us not to ask participants for responses using their words. Use rating scales instead, he said, then you can input the data in your statistical program and do science. So I did, and so did my classmates. However, already at this point in time I reflected that there was something odd with this statement. Why should we not listen to and try to understand the meaning of the participants’ own words? After all, even the very statement of “not to use words” was communicated in words!? It took me a few decades of research experience before I could spell out and approach the answer to this question more in detail. I spent my first decade of research doing computational neural network models of the brain in order to mimic experimental behavioral data of memory of words. At some point during this period, our research team needed to control for the semantic content of the stimuli material we were using. Although you can control this by taking words from a specific semantic category (e.g., people, animals), I soon realized that words can be easily classified into several semantic categories at the same time. Take the word “sheep,” for instance, which is an animal we can eat and therefore associate with food, but it is also an animal that we can associate with farmers, who can use the sheep to produce wool, and at the same time people or some people think “sheep” are not smart animals, but rather kind and gentle. All these associations with “sheep” might come naturally to us humans. However, it also says something about the challenges we can expect when we do science on this topic. Indeed, semantic properties are highly dimensional and complex, and to straighten these things out we need a lot of data and good algorithms. Thus, in order to grasp the semantic content of words or narratives, our research team needed a more elaborate model for what meaning is. It was at this point when I came across a theory, and model, for meaning called Latent Semantic Analysis (LSA, Landauer, 1997). A theory that came to change my research interest and the path of my whole career. Before reading this literature, I had been living with the worldview that semantic or meaning was something unmeasurable, that could not be clearly defined, and that interpretation of meaning vii
viii
Preface
was something that exclusively could be done by humans. This theory of meaning was an eye-opener and made me understand that meaning can be reduced to something very simple, namely, co-occurrences. That is, things that occur together generate meaning. Lund, Sweden May 3, 2019
Sverker Sikström
Preface
My own story, Danilo Garcia, is related to that of my colleague and friend Sverker Sikström. After being intensively curious about human personality as a determinant of people’s well-being, I found out how researchers C. Robert Cloninger, James W. Pennebaker, and Dan P. McAdams were addressing personality as biopsychosocial in nature, that is, composed of temperament, character, and narrative identity. In my quest to understand such a model, I found out about a good model of temperament and character (i.e., Cloninger’s model; see also Chap. 8 in this volume) and about the seminal work on narratives using both qualitative approaches and quantitative computerized methods, such as Pennebaker’s Linguistic Inquiry and Word Count (LIWC). Out of fate, or just serendipity, I had included personality, well-being, and my own designed question in one of the articles I was developing outside of my doctoral dissertation. That question was designed to prime participants to think about a positive or a negative life event by asking them to describe, using their own words, a recent life event. After reading about the possibility to count specific types of words (e.g., function words) to predict people’s health, I started looking for someone who could teach me the world of words beyond the world of numbers. . . to my surprise, not many researchers were doing so in Sweden. But I found one, Sverker Sikström, who instead of being working with word counting was working with quantitative semantics. . .for years to come we and his students have discussed the possibilities, limitations, and groundbreaking applications of these methods. This book aims to give examples of this journey and that of others to push even further that road. Karlskrona, Sweden December 4, 2018
Danilo Garcia
ix
Acknowledgment
This work was supported by a grant from the Swedish Research Council (Dnr. 201501229). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
xi
Contents
Part I
Methods
1
Introduction to Statistical Semantics . . . . . . . . . . . . . . . . . . . . . . . . Sverker Sikström and Danilo Garcia
3
2
Creating Semantic Representations . . . . . . . . . . . . . . . . . . . . . . . . . Finn Årup Nielsen and Lars Kai Hansen
11
3
Software for Creating and Analyzing Semantic Representations . . . Finn Årup Nielsen and Lars Kai Hansen
33
4
Semantic Similarity Scales: Using Semantic Similarity Scales to Measure Depression and Worry . . . . . . . . . . . . . . . . . . . . . . . . . Oscar N. E. Kjell, Katarina Kjell, Danilo Garcia, and Sverker Sikström
5
6
Prediction and Semantic Trained Scales: Examining the Relationship Between Semantic Responses to Depression and Worry and the Corresponding Rating Scales . . . . . . . . . . . . . . Oscar N. E. Kjell, Katarina Kjell, Danilo Garcia, and Sverker Sikström SemanticExcel.com: An Online Software for Statistical Analyses of Text Data Based on Natural Language Processing . . . . . . . . . . . Sverker Sikström, Oscar N. E. Kjell, and Katarina Kjell
Part II
53
73
87
Applications
7
Neuroscience: Mapping the Semantic Representation of the Brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Larissa Langensee and Johan Mårtensson
8
A Ternary Model of Personality: Temperament, Character, and Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Danilo Garcia, Kevin M. Cloninger, Sverker Sikström, Henrik Anckarsäter, and C. Robert Cloninger xiii
xiv
Contents
9
Dark Identity: Distinction Between Malevolent Character Traits Through Self-Descriptive Language . . . . . . . . . . . . . . . . . . . . . . . . 143 Danilo Garcia, Patricia Rosenberg, and Sverker Sikström
10
The (Mis)measurement of Happiness: Words We Associate to Happiness (Semantic Memory) and Narratives of What Makes Us Happy (Episodic Memory) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Danilo Garcia, Ali Al Nima, Oscar N. E. Kjell, Alexandre Granjard, and Sverker Sikström
11
Space: The Importance of Language as an Index of Psychosocial States in Future Space Missions . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Sonja M. Schmer-Galunder
12
Social Psychology: Evaluations of Social Groups with Statistical Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Marie Gustafsson Sendén and Sverker Sikström
13
Implicit Attitudes: Quantitative Semantic Misattribution Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Niklas Lanbeck, Danilo Garcia, Clara Amato, Andreas Olsson, and Sverker Sikström
14
Linguistic: Application of LSA to Predict Linguistic Maturity and Language Disorder in Children . . . . . . . . . . . . . . . . . . . . . . . . 237 Kristina Hansson, Birgitta Sahlén, Rasmus Bååth, and Sverker Sikström
15
Political Science: Moving from Numbers to Words in the Case of Brexit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Annika Fredén
Contributors
Clara Amato Blekinge Center of Competence, Region Blekinge, Karlskrona, Sweden Henrik Anckarsäter Center for Ethics, Law and Mental Health, University of Gothenburg, Gothenburg, Sweden Rasmus Bååth Department of Cognitive Science, Lund University, Lund, Sweden C. Robert Cloninger Blekinge Center of Competence, Region Blekinge, Karlskrona, Sweden Department of Psychology, University of Gothenburg, Gothenburg, Sweden Department of Psychiatry, Center for Well-Being, Washington University School of Medicine in St. Louis, St. Louis, MO, USA Kevin M. Cloninger Blekinge Center of Competence, Region Blekinge, Karlskrona, Sweden Anthropedia Foundation, St. Louis, MO, USA Annika Fredén Karlstad University, Karlstad, Sweden Danilo Garcia Department of Behavioral Sciences and Learning, Linköping University, Linköping, Sweden Blekinge Center of Competence, Region Blekinge, Karlskrona, Sweden Department of Psychology, University of Gothenburg, Gothenburg, Sweden Alexandre Granjard Blekinge Center of Competence, Region Blekinge, Karlskrona, Sweden Lars Kai Hansen Department of Applied Mathematics and Computer Science, Technical University of Denmark, Lyngby, Denmark
xv
xvi
Contributors
Kristina Hansson Department of Logopedics, Phoniatrics and Audiology, Lund University, Lund, Sweden Katarina Kjell Department of Psychology, Lund University, Lund, Sweden Oscar N. E. Kjell Depeartment of Psychology, Lund University, Lund, Sweden Niklas Lanbeck Department of Clinical Neuroscience, Division of Psychology, Karolinska Institute, Stockholm, Sweden Larissa Langensee Department of Clinical Sciences, Lund University, Lund, Sweden Simone Löhndorf Department of Linguistics, Centrefor Languages and Literature, Lund University, Lund, Sweden Johan Mårtensson Department of Clinical Sciences, Lund University, Lund, Sweden Finn Årup Nielsen Department of Applied Mathematics and Computer Science, Technical University of Denmark, Lyngby, Denmark Ali Al Nima Blekinge Center of Competence, Region Blekinge, Karlskrona, Sweden Department of Psychology, University of Gothenburg, Gothenburg, Sweden Andreas Olsson Department of Clinical Neuroscience, Division of Psychology, Karolinska Institute, Stockholm, Sweden Patricia Rosenberg Blekinge Center of Competence, Region Blekinge, Karlskrona, Sweden Birgitta Sahlen Department of Logopedics, Phoniatrics and Audiology, Lund University, Lund, Sweden Sonja Schmer-Galunder Smart Information Flow Technologies, Minneapolis, MN, USA Marie Gustafsson Sendén Department of Psychology, Stockholm University, Stockholm, Sweden Department of Social Sciences, Södertörn University, Huddinge, Sweden Sverker Sikström Depeartment of Psychology, Lund University, Lund, Sweden
Part I
Methods
Chapter 1
Introduction to Statistical Semantics Sverker Sikström and Danilo Garcia
• Behavioral science is currently dominated by numeric rating scales, but people’s mental states are best communicated with words. • The usage of open-ended word responses in behavioral science has been limited by the lack of methods, algorithms, software and knowledge of how to quantify their meaning. • Meaning is created by co-occurrences of concepts in the world. These co-occurrences can be used to create semantic representations of concepts. • Semantic representations can be generated by applying data-compression algorithms on co-occurrence of words in text corpora. • Throughout this book, researchers detail how the quantification of open-ended word responses using quantitative semantics, complement and, in some cases, replace numeric rating scales, thanks to their ability to measure, describe and discriminate between similar concepts. Meaning in the world comes from the fact that concepts tend to occur together in a predictable way. When we see a “dog”, we also see a tail, paws, eyes, legs, fur, and etcetera. In this context, we may also see an owner, that goes for a walk in the park with the “dog” that is attached to a leach, and the “dog” might bark. Thus, “dog” is connected to other concepts, and when these concepts reliable co-occur with each other, then the meaning of “dog” as a concept is created. In other words, the meaning
S. Sikström (*) Department of Psychology, Lund University, Lund, Sweden e-mail: [email protected] D. Garcia Department of Behavioral Sciences and Learning, Linköping University, Linköping, Sweden Blekinge Center of Competence, Region Blekinge, Karlskrona, Sweden Department of Psychology, University of Gothenburg, Gothenburg, Sweden e-mail: [email protected] © Springer Nature Switzerland AG 2020 S. Sikström, D. Garcia (eds.), Statistical Semantics, https://doi.org/10.1007/978-3-030-37250-7_1
3
4
S. Sikström and D. Garcia
of a concept is generated when there are reliable relationships between different concepts related to the concept we want to define. Thus, the probability of occurrences of one concept is increased by the presence of another concept. We actually live in a world of meaning as a function of co-occurrences. Despite this, most concepts in psychological research are currently quantified using numeric rating scales. The most likely reason for this is the lack of methods, and tools, that statistically measure the meaning of words. This book lays a foundation for overcoming these methodological problems and overbridging different methodological schools. The concept of love, for example, is often followed by other positive emotional synonyms such as kindness, caring, friendship, and etcetera. At the same time, love is highly interconnected with its antonyms such as hate, betrayal, killing. Thus, concepts that people perceive as opposites (i.e., pairs of synonyms and antonyms) are concepts that actually do co-occur in natural language. In fact, the presence of an antonyms, is high predictor of a synonym, although they typically occur in slightly different context (e.g., positive or negative scenarios). Although, the world that we live in is filled with reliable co-occurrences that provide information to create and understand the meaning of concepts, we also need a mental representation that reflects these co-occurrences. This is exactly what happens in the brain when we apprehend our environment. However, as researchers we do not have a direct access to the brain’s mental representations of concepts. Hence, we need to create models for describing semantic representations. Quantitative or statistical semantics is one powerful way to create such models using people’s own words and the meaning of those words in natural language. This book limits it’s focus on how semantic representation can be generated from text corpora. However, the reader may acknowledge that meaning, or co-occurrences, in humans are generated from multiple sources, including different modalities such as vision, hearing, smelling touch, etc. This creates a multi-modal representation that is important for semantic representations in humans. These external sensory inputs also co-occur with internally generated representations that are evoked by the sensory inputs and generated from memories of previous experiences. The creation of semantic representations solely on text corpora has its limitations but it has practical purposes and advantages, for example, huge text corpora is easily accessible, and concepts can be directly mapped to entities such as words, which is not as straightforward in other modalities. In a majority of the chapters in this book, researchers use the Latent Semantic Algorithm (LSA) for the creation of semantic representations. The reason for this choice is not necessarily that this algorithm is computationally superior, or newer, than other algorithms. The choice has been governed by the fact that the LSA provides a good enough representation to answer the research questions posed in the chapters, rather than based on which algorithm provides the optimal semantic representation. Chapter 2, by Nielsen and Hansen (2020a), describes how such semantic representation can be generated. There are several different ways to accomplish this, but they all have some core features in common. Most importantly, they all relay on the principal that semantic representations can be generated from co-occurrences of concepts. Furthermore, they all require a huge text corpora to where co-occurrence
1 Introduction to Statistical Semantics
5
of concepts can be generated from. Chapter 2 provides an overview of prominent algorithms for creating semantic representation from text, for example, LSA, word embedding, deep learning, and etcetera. We call the semantic representation generated from an algorithm based on a text corpus a semantic space. In Chap. 3, Nielsen and Hansen (2020b) provide an overview of several popular software required to construct semantic spaces from a text corpus, including; scikitlearn, Gensim and Keras Python. These software come with ready-made semantic representations for specific languages. The tools facilitate accessibility of semantic methods for researchers and students. A semantic space typically consists of a few hundred dimensions, where each dimension describes a semantic feature. Each word is represented by a vector in this representation. A semantic representation can be utilized in several ways, however the most important information that can be generated from a space is semantic similarity, which is discussed in Chap. 4 by Kjell et al. (2020a). A semantic representation describes the similarity between all pairwise words included in the representation. Thus, the meaning of the representation is fully described by the relationship between the concept themselves in the representation, without any direct mapping to the outside world. An important feature of a semantic space is that it also can represent sentences, paragraphs, or text documents. This can be done by simply aggregating the vector representing each word into a single vector representing the words in the texts at hand. Semantic similarity scores can be used to test whether two sets of text are statistically different from each other. This is conducted by measuring the semantic similarity between each text and the texts from the two conditions, and then conduct a t-test on these similarity scores. This is what we call a semantic t-test, due to fact that the means being compared are not from numeric rating scales, but from semantic similarity scores derived from quantified semantic representations. Concepts in the semantic representations can also be relate to entities outside of the semantic representation. Chapter 5, by Kjell et al. (2020b), shows how machine learning can be used to map words, or texts, to external variables. This is conducted by adjusting weights connected to the semantic representation to fit the external variable. Multiple linear regression is an example of a learning algorithm that can be used for this purpose. Semantic training of a numerical variable can be used to study whether the semantic representation associated with texts can be significantly related to the numerical variable. Inference testing is made by testing whether the correlation between the numerical variable and the predicted numerical values are significantly above zero. As such, semantic to numerical correlation is an important tool for scientific investigations of text data. Semantic t-test and semantic to numerical correlations are statistical tools that are the basic building block for scientific investigations of meaning representations. However, such statistical analyses of text data needs appropriate software dedicated to statistical testing of text data. In Chap. 6, Sikström et al. (2020) describes SemanticExcel.com, which is an online tool that allows the user to quickly make semantic-numerical correlations, semantic t-test, prediction, and visualization of text data. It comes with predefined semantic spaces based on Google N-grams in 15 languages (e.g., English, Spanish, Swedish). This online tool is used in most of
6
S. Sikström and D. Garcia
the other chapters in this book and is expected to facilitate the usage of open-ended word responses and provide an alternative to the rating scales that currently dominate as an outcome measure in the behavioral science research. Although this book focus on artificial algorithms and theories for creating, understanding and measuring semantics representations, analog representations of semantics is obviously continuously created and modified in our brains. The fact that almost the entire brain is somehow connected to semantic representations, is hardly surprising as meaning is related to connection or co-occurrence of different concepts. Although word comprehension and production are focused on specific areas of the polar region of temporal lobe and Broca’s area in the prefrontal lobe, other brains areas are activated depending on the specific semantic content. In Chap. 7, Langensee and Mårtensson (2020) provide an over-view of the state-of-the art of semantic representation in the brain, focusing primarily on fMRI studies. The other chapters in the second part of the book have a common theme, that is, the research detailed in them provides an alternative to the overwhelming number of scientific publications in behavioral science today that are primarily based on numerical rating scales. Indeed, the use of numerical rating scales to measure people’s state of mind is limiting because both the researchers themselves and the rest of the world rely almost exclusively on language to conceptualize what is in their own and in other people’s mind. The second part of the book covers several research areas to which researchers apply the theoretical foundation that is laid down in the first part of the book, thus, showing that quantitative semantics is applicable to widely different areas within the behavioral sciences. These areas include: personality psychology (Chap. 8 by Garcia et al. 2020a; Chap. 9 by Garcia et al. 2020c), clinical, health and well-being research (Chap. 10 by Garcia et al. 2020b; see also Chaps. 4 and 5 by Kjell et al. 2020a, b), social psychology (Chap. 12 by Gustafsson Sendén and Sikström 2020; and Chap. 11 by Schmer-Galunder 2020), subliminal processes (Chap. 13 by Lanbeck et al. 2020), linguistics (Chap. 14 by Hansson et al. 2020), and political science (Chap. 15 by Freden 2020). So why should we measure mental states with words, rather than numbers? The most natural answer to this question is that language is the natural way of communicating, the way that people prefer to give an answer to a question and to take in information. Words also provide opportunities for open ended answers. This is important, as it allows people to freely express their mental states, whereas the response alternatives on ratings scales may not always allow for people to do so, and in many cases these alternatives are not relevant to participants’ situations or contexts. Although, people not always think in term of words, they certainly do so much more than they think in terms of numbers. Eventually these words need to be mapped to numbers for conducting statistics, however, numeric rating scales put this difficult task of translation on the participants taking part in the experiments, whereas it may be more appropriate to move this task to objective tools controlled by the behavioral scientist. The reader may wonder if mental states can be measured sufficiently well with words. We hope that this book will provide convincing answer to this question. Chapters 4 and 5 (Kjell et al. 2020a, b), for example, provide evidence that posing
1 Introduction to Statistical Semantics
7
specific open-ended word-based questions to assess people’s well-being and mental health provides information that correlates well with current state of the art measures that use numeric rating scales. Here the question arises of what is a good-enough correlation for making it meaningful to start using words as outcome measures. This question is complicated by the lack of “gold standards” for measuring how people feel. It is not necessarily the case that a higher correlation to a rating scale means that we measure the concept more accurately. It may just be the case that a higher correlation, mimics problems with measures that use numeric rating scales. In Chaps. 4 and 5 we argue that measures of mental health and well-being that use numeric rating scales tend to measure a general valence of well/ill-being, and are therefore less likely to make meaningful discriminations between concepts. In this line, the results showed that semantic measures yield more words that discriminate between intertwined affective disorders, such as depression and anxiety. This finding was particular clear when using one concept as a co-variate to the other concept. Moreover, findings related to the measurement of personality and happiness are interesting from the point of view of how quantitative semantics might help us to overbridge methodological schools. For example, in Chap. 8, Garcia et al. (2020a) show how identity can be quantified as clusters of words that people use for selfpresentation and then, using clinical trained eyes, to personality profiles behind these clusters. In Chap. 9, Garcia et al. go further and identify how words used for selfrepresentation allow us to identify nuances in the expression of malevolent personality as measured by common personality instruments that use numeric rating scales. Finally, in Chap. 10, Garcia et al. (2020b), used words that people freely generate to describe what they relate to happiness versus what makes them happy to identify what type of memory system that is active when people rate their level of subjective well-being or happiness. Finally, we want the reader to reflect upon the fact that semantic measures also have other fundamental and important qualities that numeric rating scales simply lack. One important aspect relates to how the to-be-measured concepts are defined and described. The creation of numeric rating scales is made by the researcher, who generates items with fixed response alternatives, that is summed to aggregated scores of the concept. The outcome of numeric rating scales is then mapped back to a label, e.g., “depressed” or “happy”, that is communicated as results in journal publications or directly to the patient in clinical settings. This mapping back and forth between items and concepts is potentially harmful as people, or groups of researchers and/or participants from different sub-cultures, do not necessarily share the same view of the concept. Although, it is possible to adapt rating scales for different groups, such adaptations are not an inherent property in the rating scales methodology. For example, young/old, women/men, etc. may view concepts and priorities concepts such as “anxiety”, “rape”, and “well-being” differently, and their view may be radically different from the rather narrow group of researchers interested in the specific concepts. The approaches outlined and suggested in this book overcome this problem by allowing the concepts to be defined by the participants, and where semantic measures simultaneously measure and describe the concept. As the concepts are directly presented in the semantic questions to the participant, they can
8
S. Sikström and D. Garcia
write the words that corresponds to their idiosyncratic interpretation of it. The definition of the concept is then communicated to the researcher by the words used in their answer. Also differences in the various subgroups’ view of the concept can be semantically quantified, tested, and/or visualized. We hope that this book will both inspire and provide tools for further research using words as outcome measures. The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking. (Albert Einstein) Acknowledgments This research has been supported by grants from Vinnova (2018-02007) and the Kamprad Foundation (ref # 20180281).
References Freden, A. (2020). Political science: Moving from numbers to words in the case of Brexit. In S. Sikström & D. Garcia (Eds.), Statistical semantics: Methods and applications. Cham: Springer. Garcia, D., Cloninger, K., Sikström, S., & Cloninger, C. R. (2020a). A ternary model of personality: Temperament, character, and identity. In S. Sikström & D. Garcia (Eds.), Statistical semantics: Methods and applications. Cham: Springer. Garcia, D., Rosenberg, P., & Sikström, S. (2020c). Dark identity: Distinction between malevolent character traits through self-descriptive language. In Sikström & Garcia (Eds.), Statistical semantics: Methods and applications. Cham: Springer. Garcia, D., Nima, A., Kjell, O. N. E., Grandjard, A., & Sikström, S. (2020b). The (mis)measurement of happiness: Words we associate to happiness (semantic memory) and narratives of what makes us happy (episodic memory). In S. Sikström & D. Garcia (Eds.), Statistical semantics: Methods and applications. Cham: Springer. Gustafsson, M., & Sikström, S. (2020). Social psychology: Evaluations of social groups with statistical semantics. In S. Sikström & D. Garcia (Eds.), Statistical semantics: Methods and applications. Cham: Springer. Hansson, K., Sahlen, B., Bååth, R., Löhndorf, S., & Sikström, S. (2020). Linguistic: Application of LSA to predict linguistic maturity and language disorder in children. In S. Sikström & D. Garcia (Eds.), Statistical semantics: Methods and applications. Cham: Springer. Kjell, O., Kjell, K., Garcia, D., & Sikström, S. (2020a). Semantic similarity scales: Using semantic similarity scale to measure depression and worry. In S. Sikström & D. Garcia (Eds.), Statistical semantics: Methods and applications. Cham: Springer. Kjell, O., Kjell, K., Garcia, D., & Sikström, S. (2020b). Prediction and semantic trained scales: Examining the relationship between semantic responses to depression and worry and the corresponding rating scales. In S. Sikström & D. Garcia (Eds.), Statistical semantics: Methods and applications. Cham: Springer. Lanbeck, N., Garcia, D., Amato, C., Olsson, A., & Sikström, S. (2020). Implicit attitudes: Quantitative semantic misattribution procedure. In S. Sikström & D. Garcia (Eds.), Statistical semantics: Methods and applications. Cham: Springer. Langensee, L., & Mårtensson, J. (2020). Neuroscience: Mapping the semantic representation of the brain. In S. Sikström & D. Garcia (Eds.), Statistical semantics: Methods and applications. Cham: Springer. Nielsen, F. Å., & Hansen, L. K. (2020a). Creating semantic representations. In S. Sikström & D. Garcia (Eds.), Statistical semantics: Methods and applications. Cham: Springer.
1 Introduction to Statistical Semantics
9
Nielsen, F. Å., & Hansen, L. K. (2020b). Software for creating and analyzing semantic representations. In S. Sikström & D. Garcia (Eds.), Statistical semantics: Methods and applications. Cham: Springer. Schmer-Galunder, S. (2020). Space: The importance of language as an index of psychosocial states in future space missions. In S. Sikström & D. Garcia (Eds.), Statistical semantics: Methods and applications. Cham: Springer. Sikström, S., Kjell, O., & Kjell, K. (2020). SemanticExcel.com: An online software for statistical analyses of text data based on natural language processing. In S. Sikström & D. Garcia (Eds.), Statistical semantics: Methods and applications. Cham: Springer.
Chapter 2
Creating Semantic Representations Finn Årup Nielsen and Lars Kai Hansen
In this chapter, we present the vector space model and some ways to further process such a representation: With feature hashing, random indexing, latent semantic analysis, non-negative matrix factorization, explicit semantic analysis and word embedding, a word or a text may be associated with a distributed semantic representation. Deep learning, explicit semantic networks and auxiliary non-linguistic information provide further means for creating distributed representations from linguistic data. We point to a few of the methods and datasets used to evaluate the many different algorithms that create a semantic representation, and we also point to some of the problems associated with distributed representations. • Individual words can be represented in a vector representation, but the space spanned by the words does not have any semantic interpretation. • Both feature hashing and random indexing can reduce the size of the vector space. • When latent semantic analysis and non-negative matrix factorization are applied on bag-of-words matrices, they create distributed semantic representations where the dimensions may be related to topics present in a given corpus • Word embeddings are distributed semantic representations of words efficiently created from context windows.
Vector Space Model The general idea in vector space representations is that documents are represented by vectors in a manner that similar documents are geometrically close. A vector space model (VSM) can be created from documents with one or more words by associating F. Å. Nielsen (*) · L. K. Hansen Department of Applied Mathematics and Computer Science, Technical University of Denmark, Lyngby, Denmark e-mail: [email protected]; [email protected] © Springer Nature Switzerland AG 2020 S. Sikström, D. Garcia (eds.), Statistical Semantics, https://doi.org/10.1007/978-3-030-37250-7_2
11
12
F. Å. Nielsen and L. K. Hansen
Table 2.1 Example document-term matrix with a bag-of-words representation of three small documents (here single sentences): first: “a dog runs fast”, second: “a dolphin swims fast” and third: “a dog bites another dog” First Second Third
a 1 0 1
dog 1 0 2
runs 1 0 0
fast 1 1 0
dolphin 0 1 0
swims 0 1 0
bites 0 0 1
another 0 0 1
Each element indicates the count of how many times a word occurs in a specific document. For the third document, the word “dog” appears twice
it with a tuple forming the coordinates of a point in the vector space. A simple way to form a vector space representation is to associate each coordinate to a term, e.g., a word, a phrase and/or a sequence of characters and words consisting of n elements,—a so-called n-gram. The values of the vector representing a document can then be either the occurrence of the given term in the document, or simply a 1 or 0 if the given term is present/absent (Salton et al. 1975). If the texts are documents, we can refer to it as a document-term matrix or bag-of-words matrix. Table 2.1 shows a 3-by-8 document-term matrix formed from a small corpus of three short documents: “a dog runs fast”, “a dolphin swims fast” and “a dog bites another dog”. Here each matrix element contains the count of the number of times a term appears in a document. In this VSM, the documents can be regarded as points in an 8-dimensional space, where each dimension is associated with a word. The simple word vector space defined here does not represent individual words in a distributed way, and as such, any distance measured between words will be equal, i.e., that we cannot use this representation to indicate semantic relations between individual words. The bag-of-words vector can be expanded by pairs of (consecutive) words, i.e., bigrams, or longer sequences, trigrams and generally n-grams, but the distances between individual terms will still be equal. However, when the vector space model is used to represent a text with multiple words—a paragraph or a complete document—the text will be represented over multiple elements in the vector and then distances between two texts may have relation to semantics. For a text, vector elements may reflect the presence or the counts of each word in the text. Several methods have been suggested to improve the vector space representation of a document: Elements associated with so-called stop words may be excluded from the vector. Stop words are common words, such as the, and and its, that seldom carry much information about the topic of the document and may hinder more than help in any further semantic processing or interpretation with the vector representation. If a term only occurs in a single document, then it often may be discarded because it does not directly affect any modeling based on co-occurrence. Apart from n-grams, there are various other methods to modify the terms: the individual word may be stemmed or a lemmatizer can identify its lemma, multiword expressions may be detected as named entities and aggregated into one term. The document-term matrix can be scaled both row-wise and column-wise. Long documents have many words and its associated vector with raw word counts is long,
2 Creating Semantic Representations
13
i.e., its norm is large, and the vector representation of a short and a long document about the same topic may be a great distance from each other in the vector space if no scaling for length is performed. Furthermore, words that occur in most documents across the corpus may not be the words that separate topics well, and such words should have less weight. One common scaling scheme is referred to as tf-idf—term frequency (times) inverse document frequency. The specific form of the scaling varies, as an example, the popular machine learning package scikit-learn uses1 tfidf ðd, t Þ ¼ tf ðd, t Þ idf ðd, t Þ where d and t are the document row and term column index, respectively. tf(d, t) is the raw word counts, and the inverse document frequency is computed as idf(d, t) ¼ log [(1 + n)/(1 + df(t))] + 1 with n as the number of documents and df(t) the number of documents with the term t.
Feature Hashing The representation of words and phrases, unigram, bigram or higher-order n-grams produces very large feature spaces which pose a challenge for resource-constrained systems. The so-called “hashing trick” can limit the number of features by setting up a limited number of buckets and assign each word to a bucket. The method goes under the name random feature mixing or feature hashing (Ganchev and Dredze 2008). The approach leads to hash collisions where multiple words share the same bucket. From the further modeling point of view, it will look like a massive homograph problem and it comes as no surprise that the performance of a model using feature hashing may degrade. However, researchers have made the perhaps surprising observation that the performance does not degrade that much. Ganchev and Dredze found that on a range of supervised text classification tasks across different labeled corpora with four popular machine learning methods and binary unigram features, a reduction in the size of 10 would still yield performance of between 96.7% and 97.4% of the original model (Ganchev and Dredze 2008). An example of a hashing function is Ganchev and Dredze suggestion: Java’s hashCode function followed by a modulo operation with a division using the size of the intended number of buckets (Ganchev and Dredze 2008). The popular Python package Gensim implements hashing via an adler32 checksum function and a modulo operation.2
1 http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer. html 2 https://radimrehurek.com/gensim/corpora/hashdictionary.html
14
F. Å. Nielsen and L. K. Hansen
With a good hash function, we can hope that the words are distributed with an equal probability among the buckets. The rate of hash collisions has then the same probability as the so-called ‘birthday paradox’. Feature hashing is not a distributed representation of words. Each word is still represented—as in the normal vector space model—with one single element in the vector, and all distances between words are the same (except for the words that collide, bringing their distance to zero).
Random Indexing In random indexing, texts are projected using a random matrix (Bingham and Mannila 2001). The advantage with this approach is that no model parameters need to be estimated, and only one pass over the corpus is necessary to project the text into a low-dimensional space. If individual words are projected they will get a distributed representation. However, the distances between the projected words do not gain any semantic interpretation. Nevertheless, the dimension of the data is reduced and any subsequent semantic modeling may benefit from that.
Latent Semantic Analysis Latent semantic analysis (LSA) is a form of linear multivariate analysis on texts represented in matrix form (Deerwester et al. 1990). It uses singular value decomposition (SVD) and typically works from a document-term matrix which may be weighted, e.g., with tf-idf before the SVD is applied. SVD factorizes a documentterm matrix, X, X d,t ¼
X
U d,k Lk V t,k 0 ,
into the orthogonal matrices U and V that consist of loadings over documents and terms, respectively. L is a diagonal matrix with positive singular values. It is usually the dimensions associated with the large singular values that are of interest. An SVD algorithm is available for many programming languages and these standard implementations can work on small text data. There exist algorithms which can work with the corpus in batches, thus enabling the LSA to work over very large datasets that cannot fit in memory (Řehůřek 2011). SVD may also be used on character n-grams. For instance, Soboroff et al. (1997) extracted character 3-grams from 32 biblical Hebrew chapters, applied SVD on the chapter-by-n-gram matrix and visualized the resulting loading over chapters for examining authorship and writing style of the chapters.
2 Creating Semantic Representations
15
Non-negative Matrix Factorization Non-negative matrix factorization (NMF) factors a non-negative matrix into two non-negative matrices (Lee and Seung 2001) X d,t
X
W d,k H k,t :
If the matrix X is a document-term matrix then W will be a matrix of loading over documents and H a matrix of loading over terms. In its basic formulation, the only hyperparameter is the dimension of the factorized space k, i.e., the number of columns of W and the number of rows of H. The dimension is usually chosen to be smaller than the dimension of the X, thus the product WH will not be able to reconstruct the original matrix and a residual remains, i.e., “approximative NMF” in contrast to “exact NMF”. Lee and Seung presented simple iterative algorithms for two cost functions (Lee and Seung 2001). In one case, the algorithm minimizes the residual as the Frobenius norm of the difference between the original matrix and the reconstructed matrix kX WHk2 H kt
H kt
ðW 0 X Þkt , W dk ðW 0 WH Þkt
W dk
ðXH 0 Þdk ðWHH 0 Þdk
where subscript means elementwise multiplications. This algorithm has an inherent non-uniqueness: The columns of W and the rows of H can be permuted and scalings can be moved between the two matrices. Furthermore, its convergence properties are not straightforward, e.g., if the denominator of equation is or becomes zero. A practical implementation may augment the denominator with a small positive constant. Probabilistic latent semantic analysis (PLSA) considers the probabilities of co-occurrences of documents and terms, P(d, t), as a mixture distribution with k mixtures Pðd, t Þ ¼
X k
PðdjkÞPðkÞPðtjkÞ,
With an extension of the NMF model and suitable normalization of the matrices, NMF can be interpreted in the context of PLSA X cWLH where P(d, t) cX, P(d| k) W, P(t| k) H and P(k) diag (L ). There exist extensions of NMF. For instance, the cost function can be extended with terms for the norms of the matrices W and H. In contrast to LSA, NMF imposes no orthogonality constraints on any of the vector pairs in W or H. And whereas principal component-based algorithms result in “holistic” representations, the non-negativity of NMF results in parts-based representation (Lee and Seung 1999).
16
F. Å. Nielsen and L. K. Hansen
NMF has been applied in text mining and discovers semantic features (Lee and Seung 1999; Nielsen et al. 2005). The non-orthogonality and non-negative will usually make interpreting of the factorization less problematic than the factorization from LSA. To overcome the issue of selecting the dimension of the factorized space, multiple independent NMFs can be run with different dimensions and a visualization technique can give an overview of the results (Nielsen et al. 2005).
Explicit Semantic Analysis Explicit Semantic Analysis (ESA) creates a semantic representation of a word or a text with the use of an information retrieval technique. ESA represents the word or the text as a weighting over documents. The common method converts a corpus to a tf-idf-weighted document-term matrix. When a particular word is to be encoded in the ESA-distributed representation, the vector space representation of the word is multiplied on the document-term matrix. The original report used Wikipedia as the corpus (Gabrilovich and Markovitch 2007). The attractiveness of the ESA rests on the good performance in semantic relatedness tasks, e.g., a task with numerical judgement of how related two words are. Another attractive feature is that the dimensions of the space are directly interpretable as they represents documents of the corpus used to estimate the ESA representation, e.g., Wikipedia articles. Wikipedias may have millions of articles (the English Wikipedia has over five million articles), so potentially the representation of each word would have a dimension of several million. In practice, the number of documents selected is somewhat lower. The original study (Gabrilovich and Markovitch 2007) used a selection of around 240,000 Wikipedia articles after excluding small articles and articles with few incoming and outgoing hyperlinks. Some research has found that the type of corpus has less relevance for the performance of ESA in document similarity tasks (Anderka and Stein 2009).
Word Embeddings Word embedding algorithms create semantic representation of words by scanning large corpora, extracting a window of words and modeling the words in the window as a relation between the word and its context, so the final algorithm results in a model where each word gets represented as a point in a low-dimensional space,—a common size is 100 dimensions, For a schematic example in two dimensions, see Illustration 2.1. Researchers have suggested several algorithms. One early algorithm, hyperspace analogue to language (HAL), built the co-occurrence matrix by scanning a corpus with a window size of 10 words with weighting within the window based on the number of words separating the two words of interest (Lund and Burgess 1996). From a low-dimensional representation, distances in the space of the co-occurrence
2 Creating Semantic Representations
17
accounting
company bird
Illustration 2.1 Schematic representation of a two-dimensional word embedding. Here three words, accounting, company and bird, are embedded in the two-dimensional space. A good word embedding should place similar words close together, e.g., accounting and company should be closer together, than to accounting
matrix may have semantic interpretation and may separate words belonging to categories such as animal names, body parts and geographical locations. Newer word embedding models build a neural network model between a center word and its context. Mikolov et al. simplified the neural network model to two layers (Mikolov et al. 2013a): A linear layer from the input (with a dimension corresponding to the size of the vocabulary) to the embedding space (with a dimension of 100 or more). From the embedding space, a second layer projects to the output with the size of the vocabulary. In the parlance of Mikolov et al., the model that predicts the center word from the context is called continuous bag of words (CBOW), while the reverse model where the context is predicted from the center word is termed skip-gram (SG). The output layer has a softmax layer. The optimization of the word embedding model through the softmax layer requires—in its common application—the normalization across the vocabulary. To avoid this computationally costly step, the word embedding methods use what is called noisecontrastive estimation or negative sampling, where a few samples of the vocabulary are used as a form of substitute for the normalization. The word embedding models have a number of hyperparameters. The dimension of the embedding space can have sizes ranging, for instance, from 25 to 300,3 while the window size has often the size (2n + 1) ¼ 5 (Al-Rfou et al. 2014).4 An evaluation of the HAL-based model among window sizes of 1, 2, 4, 8 and 10 found that 8 was the size with the largest correlation with a semantic distance measure derived from a human reaction times experiment. An evaluation of word2vec word embedding for predicting various word analogies found that a short window of 2 words on either side mostly performed the best for syntactic analogies, while a wider window of 10 words on either side performed the best for semantic tasks (Linzen 2016). Other hyperparameters that may be important are dynamic context windows, subsampling 3
https://nlp.stanford.edu/projects/glove/ Note it is not always clear how the size is counted. One may count the window size as the total number of words or count it based on the number of words on each side of the word-of-interest. 4
18
F. Å. Nielsen and L. K. Hansen
frequent words and rare words deletion, Generally, the hyperparameters of word embedding models may have considerable influence on the performance of the resulting models,—and that to such an extent that it may be more important to select well-performing hyperparameters, than selecting between different word embedding models (Levy et al. 2015). The quality of the word embedding usually depends much on the size of the corpus: the bigger the better (Nielsen and Hansen 2017). Illustration 2.2 shows a projection of a few words from a word analogy dataset of Mikolov et al. (2013a) with the GloVe model trained on a very large corpus. Only the two first principal component are shown, but the second principal component approximately encodes a femaleness/maleness semantics, and family relations, personal pronoun and royalty names seem to cluster: A qualitative evaluation of the word embedding would say that this is a good embedding. Some of the word embedding algorithms that models the word-context relationship have been related to the pointwise mutual information (PMI) between the words and the contexts (Levy and Goldberg 2014; Levy et al. 2015). Each element of the word-context PMI matrix is the logarithm of the ratio of the joint probability of a word and a context divided by the product of the marginal probabilities of a word and a context. The skip-gram version of word2vec can be regarded as a factorization of the PMI matrix shifted by a value determined by the number of negative samples. If word-based embeddings are to be used for a sentence, a paragraph or a longer text, then the individual embedded representations should be aggregated into one representation,—a procedure that might be referred to as word embedding composition. Different algorithms have different complexity. In some cases, simple additive composition works well (Scheepers et al. 2018). Compared to the ESA model, the word embedding models are usually much smaller. The ESA model may represent a word with a vector that has over 100,000 dimensions, while a typical word embedding model has 100 dimensions.
Deep Learning Representations Deep learning can be used to train neural network-based language models or supervised classification models. Deep learning neural networks consist of multiple layers of parameters and non-linear units. For neural networks modeling sequential data, special recurrent units are also used, e.g., long short-term memory (LSTM) units (Schmidhuber 2014). Typically each layer will have a vectorial representation, and layers near the output of the neural network may gain some semantic representation after sufficient training. Some of the deep learning models train a language model on very large datasets predicting a word, a character or a byte from previous parts of the text sequence. For instance, provided with a large amount of text data, such as 82 million product reviews from Amazon, a UTF-8 encoded byte-level language model with a recurrent neural network may be trained for 1 month (Radford et al. 2017). In a transfer
Illustration 2.2 Principal component projection of male/female words embedded with the 300-dimensional GloVe embedding (Pennington et al. 2014). Here the female words tend towards the positive part of the second principal component, while male words are projected in the opposite direction. All words have been taken from the word analogy dataset of Mikolov et al. (2013a)
2 Creating Semantic Representations 19
20
F. Å. Nielsen and L. K. Hansen
learning scheme, the 4096-dimensional distributed representation of the deep learning model may be used in training a supervised machine learning model for various other prediction tasks, e.g., sentiment analysis with a specific dataset. Another related study trained a deep learning model on 1246 million tweets to predict the type of emojis present in each of the tweets (Felbo et al. 2017). This pre-trained model could be adapted with further training to related tasks.
Creating Explicit Semantic Representations The semantic representation can be explicit where relationships between words and concepts are stated with links for lexical or conceptual relations (hyponym, hypernym, synonym, antonym) and where the words and concepts can be associated with features. WordNet is such an example (Miller 1995). Specialized graphical user interfaces are developed to set up these relations, see, e.g., Henrich and Hinrichs (2010). Wikidata at https://www.wikidata.org is a knowledge graph grown out of the Wikipedia community. It describes the concepts corresponding to Wikipedia articles, but also a range of other topics such as scientific articles, lexemes and word forms. Wikidata is multilingual and identifies its items (concepts, lexemes, forms) not by words but by a non-descriptive integer identifier, e.g., the concept of a dog has the identifier ‘Q144’ while the Danish lexeme gentagelse has the identifier ‘L117’. Wikidata users can collaboratively edit the graph in a specialized online environment featuring revision control where the users can see changes to the graph. Users may also make explicit links to individual items in external resources such as the linguistic resources WordNet, DanNet, ILI and BabelNet, for the WordNet linkage, see Nielsen (2018). There are various methods to convert lexical and conceptual items represented in a graph to a dense vectorial representation (Nielsen 2017; Nielsen and Hansen 2018). Specific methods are known as graph embedding or knowledge graph embedding with names such as node2vec and RDF2vec, An initial process generates ‘pseudo-sentences’ constructed from random walks in the graph, where the ‘words’ are nodes,—and possibly links. These pseudo-sentences are then submitted to standard word embedding software such as word2vec.
Creating Semantic Representations with Non-linguistic Information In some cases, we have access to non-linguistic data that links to linguistic data: Words, sentences and documents might be associated with a location in space (e.g., geolocation), colors, images and brain activity. Modeling of the joint distribution between the linguistic and non-linguistic data can provide further and perhaps
2 Creating Semantic Representations
21
improved semantic representation. There are several large-scale datasets where images and texts are associated. Visual Genome5 will associate a photo with regions, attribute and relationship annotation, e.g., “man playing frisbee”, “frisbee is white” and “building behind player”. Similarly, the COCO dataset6 will associate a photo with a full sentence, e.g., “a man in a red shirt throws an orange frisbee”. In a specialized application (Nielsen and Hansen 2002), short neuroanatomical labels associated with 3-dimensional stereotaxic coordinates from human brain mapping studies formed the basis for a statistical model. The resulting model connected the textual label, l, with the physical sites, z, in the brain with a probability density estimate (PDE), p(z| l), z 2 R3. When this PDE is evaluated in discrete steps in 3-dimensional space, the PDE can be turned into a vector, yl, for each label, thus the neuroanatomical label has a distributed representation where each element is a non-negative value. In a specific application, the probability density was modeled by kernel density estimation ð z xn Þ 2 1 X 1 pffiffiffiffiffiffiffiffiffiffi exp pðzjlÞ ¼ : n2N l jN l j 2σ 2 2πσ 2 With the vector representation of the labels, neuroanatomical labels that partially overlap in brain space can be searched independently of the orthographic representation of the word or its presence from neuroanatomical taxonomies.
Evaluating Semantic Representations Different methods exist for evaluating semantic representations of words. Some researchers distinguish between two modes of evaluations: intrinsic and extrinsic (Chiu et al. 2016; Faruqui et al. 2016; Wróblewska et al. 2017). Intrinsic evaluation establishes dedicated test sets for the evaluation, while extrinsic evaluation uses the semantic representation as part of a model for a more complicated natural language processing task. Word similarity/relatedness, verbal analogy or word intrusion tasks may exemplify the former, while part-of-speech tagging, chunking, named entity extraction, dependency parsing and morphosyntactic disambiguation are examples of the latter (Chiu et al. 2016; Wróblewska et al. 2017). Evaluations may find quite a difference in evaluation between the two modes (Chiu et al. 2016). In connection with the evaluation of embedding models, intrinsic evaluations will typically use the cosine similarity sC between the two compared words represented in the embedding space
5 6
http://visualgenome.org/ http://cocodataset.org
22
F. Å. Nielsen and L. K. Hansen
sC ðx, yÞ ¼
x0 y : k xk2 k yk2
There are other forms of similarity measures, e.g., the Minkowski family of distances (Lund and Burgess 1996), dM ðx, yÞ ¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X r ðjxi yi jÞr ,
where a Minkowski distance with r ¼ 2 is the common Euclidean distance and r ¼ 1 is a city-block distance. Methods that combine word embedding similarity with graph-based distances on the explicit semantic networks compute a common score. In one work (Lee et al. 2016), a semantic relatedness score, rel(wi, wj), between two words, wi and wj, was defined as rel wi , wj ¼ sc xwi , xwj þ ð1 λÞ max m, n
1 , dist Si,m , S j,n
where xw are the word embeddings and dist(Si,m, Sj,n) is the path distance in the semantic network (WordNet in their case) between the two senses Si, m and Sj, n,— the mth sense of the ith word and the nth sense of the jth word. The λ parameter mixes between the two parts of the combined score, and with the extra flexibility provided by this parameter, the researcher reported state-of-the-art results on a small range of standard similarity datasets (Lee et al. 2016).
Word Similarity/Relatedness Datasets with human judgment of similarity/relatedness between pairs of words are popular means of evaluating semantic models. The website wordvectors.org lists 13 word-similarity datasets. WordSim-353 (WS-353) with 353 word-pairs (Finkelstein et al. 2002) is probably the most popular and among the oldest dataset. Similarity and relatedness are similar concepts but often viewed as different with similarity more narrowly defined than relatedness. For instance, car and steering wheel are ‘related’ but would in many contexts not be regarded as ‘similar’. Another example is coffee and cup, where (in some contexts) one word would point to a drink the other to a container (Faruqui et al. 2016). Newer word similarity datasets, such as SimLex-999, are larger and may distinguish more stringently between relatedness and similarity (Chiu et al. 2016). The human scoring used in word similarity tests may be confounded by different effects (Bhattacharyya et al. 2017). A rescoring of a word pair list may not yield the same scores, the humans may differ widely in the similarity score they attribute to the individual word pairs, e.g., scores among 50 humans for the word pair sun and
2 Creating Semantic Representations
23
planet may have used the entire range of possible similarity scores. Furthermore, the human similarity scores may even be affected by the presentation order, so, e.g., the similarity of baby to mother results in different scores than the similarity of mother to baby (Bhattacharyya et al. 2017). With the similarity scored datasets, the models are typically ranked based on the Spearman correlation between the cosine similarity and human similarity judgment. The corpus-based ESA model maintained state-of-art performance on the WordSim353 dataset for a few years.7 In 2017, a combined corpus- and knowledge graphbased approach using ConceptNet set a new state-of-the-art (Speer et al. 2016). Most of the datasets are in English, but word similarity datasets also exist for other languages.
Verbal Analogies Verbal analogies usually come in the form of four words semantically connected, e.g., Germany, Berlin, France, Paris. The task is to predict the fourth word from the three others. Given that the fourth is to be selected from the entire vocabulary, the baseline accuracy for pure guessing would be close to zero. The semantic testing with verbal analogies became popular after a demonstration with distributed word representations showed that the analogies could be estimated with simple algebra in the word embedding space and the cosine similarity (Mikolov et al. 2013a, b). This method is sometimes found under the names vector difference or vector offset (Vylomova et al. 2016). The prototypical example is man is to king, what woman is to queen. With the analogy of four words a : a0 :: b : b0 , the basic “guess” of the fourth word, b0, is the analogy function: b0 ¼ arg max sC ðxb00 , xa0 xa þ xb Þ, 00 b
where x is a word embedding and sC() is the cosine similarity. There are other suggested analogy functions. A variation with multiplication and a division instead of addition and subtraction may perform slightly better (Levy et al. 2015; Linzen 2016). Datasets with word analogies may focus on specific relations, e.g., agent-goal, object-typical action (Jurgens et al. 2012)8 or syntactic relations (Mikolov et al.
7 8
https://aclweb.org/aclwiki/WordSimilarity-353_Test_Collection_(State_of_the_art) The dataset is available at https://sites.google.com/site/semeval2012task2/
24
F. Å. Nielsen and L. K. Hansen
2013b), see also the 15 relations in Table 2 in Vylomova et al. (2016), e.g., hypernym, meronym, location or time association. As an example, the first few lines of the popular dataset from Mikolov et al. from 2013 (Mikolov et al. 2013a) display analogies between countries and capitals: Athens Greece Baghdad Iraq Athens Greece Bangkok Thailand Athens Greece Beijing China Athens Greece Berlin Germany
Apart from English, other languages with word analogy datasets are, e.g., Czech (Svoboda and Brychcín 2018) and German (Köper et al. 2015). Evaluations in 2017 with fastText trained for 3 days on either the very large Common Crawl data set or a combination of the English Wikipedia and news datasets set a new state-of-the-art on 88.5% for the accuracy on a popular word analogy dataset when fastText training included sub-word features (Mikolov et al. 2017). Pre-trained fastText models are available.9
Word Intrusion Word intrusion tasks present a series of words with one of the words being an outlier and should be detected. In work on evaluating distributed semantic representation for Danish (Nielsen and Hansen 2017), we established a small dataset with 100 sets of words, where each set consisted of four words, e.g., corresponding to—in English—apple, pear, cherry and chair as one example and grass, tree, flower and car as another example.10 In this case, we would expect random guessing to produce an accuracy of 25%. The identification of the outlier with a distributed semantic representation projects each of the set of words to a low-dimensional space, e.g., a word2vec representation and then identifies the outlier based on the distances in this space. For instance, one may compute the overall average across words, compute the distance from each of the words to the average and select the one furthest away as the outlier. If the distributed semantic representation is good, we may expect the distance to represent semantic outlierness well. In an evaluation on a Danish set of words (Nielsen and Hansen 2017), an ESA model was found to yield an accuracy on 73%, while a word2vec-based model trained on large corpora yielded almost as good a performance. The Test of English as a Foreign Language (TOEFL) synonym test is related to the word intrusion task. The TOEFL task also selects among four words, but for the
9
https://fasttext.cc/ https://github.com/fnielsen/dasem/blob/master/dasem/data/four_words.csv
10
2 Creating Semantic Representations
25
TOEFL, the algorithm should select the most similar word to a given query word. TOEFL was originally used to test the latent semantic analysis model (Landauer and Dumais 1997).
Sentiment Analysis Sentiment is one aspect of semantics. As such, distributed semantic representations could be expected to be relevant for sentiment as well. A number of “sentiment lexica” exist, where the sentiment or the emotion of a text has been assigned manually. This may be on levels from words, phrases, sentences or paragraphs to complete texts. For instance, AFINN assigns an integer between 5 and 5 to each word or small phrase (Nielsen 2011), e.g., the English word excellent gets assigned the value 3.11 A distributed semantic representation pre-trained on an independent resource can represent the sentiment-labeled words, and if this representation is semantically accurate we may expect that a supervised learning algorithm trained to predict sentiment will easily be able to generalize, i.e., able to predict the sentiment of words, which have not been used during the supervised learning. For the Danish version of AFINN, we found that the scheme worked well and that the trained machine learning classifier could even point to words where the sentiment label may be less than optimal and should be reconsidered (Nielsen and Hansen 2017). For instance, ‘udsigtsløs’ (futile) is labeled as positive in AFINN which can be considered an error. The analysis also pointed to words with ‘implicit’ negativity, e.g., benådet (pardoned), tilgiver (forgives) and præcisere (clarify) which could indicate a change from negativity.
Challenges: Polysemy, Homograph, Bias and Compounds Various problems exist with the semantic representations. For distributed semantic representation based on the orthographic representation of words, we meet frequent challenges, e.g., polysemy and homographs: That two words with the same orthographic representations can have different semantics, e.g., jaguar (disregarding case) can mean a feline, a car make or an operating system version. Maintaining the case of the word will help slightly on the polysemy/homograph problem, but will expand the vocabulary of the corpus, creating a problem with the estimation of the semantic model. For some of the pre-trained models based on a very large corpus, the case is maintained. Applying part-of-speech (POS) tagging before the word embedding training and using the POS-tag—together with its associated word—can
11
https://github.com/fnielsen/afinn/blob/master/afinn/data/AFINN-en-165.txt
26
F. Å. Nielsen and L. K. Hansen
disambiguate word classes (e.g., to fly vs. a fly), but does not help with polysemy/ homograph within word classes. Another approach uses word sense disambiguation (WSD) before word embedding. A system called SensEmbed uses babelfy12 for the disambiguation before training a CBOW word2vec model (Iacobacci et al. 2015). Rather than words, it is senses that are embedded, thus a word with multiple senses will be embedded at several different points in the embedding space. WSD may use a semantic network, such as WordNet or BabelNet, so embedded senses correspond to concepts in the semantic network. In this case, the corpus-based distributional sense embedding may be augmented with the information from the explicit semantic network. The SensEmbed researchers argue that the semantic network is particular helpful for rare words and gives as an example the highly related pair orthodontistdentist, where orthodontist only occurred 70 times in the corpus, making its distributed representation not accurate, and the corpus-based orthodontist-dentist similarity estimation poor. However, in BabelNet, the semantic network they used, the two words were directly linked and the combined distributional-semantic network similarity model can determine the synonymity of the word pair (Iacobacci et al. 2015). Other methods that handle polysemy/homographs, but without direct WSD, model the word context with shallow or deep learning models (Sun et al. 2017; Peters et al. 2018). Word embeddings are based on given text corpora and are therefore expected to incorporate values, biases and stereotypes etc. as they are found in the data. Bias may be related to gender where words for occupation are associated with a specific gender, e.g., as reflected in the title “man is to computer programmer as woman is to homemaker” (Bolukbasi et al. 2016), but see also Nissim et al. (2019). If one is aware of the bias, then it might be possible to correct it. Compounds, that occur frequently in languages such as German and Danish, may pose a problem as they are often rare words. Decompounding methods exist (Koehn and Knight 2003), and may help if the individual compounded words are common. But it may also be worth considering fastText that (with its modeling of n-grams) can handle out-of-vocabulary words. For instance, consider the Danish compound noun bogføringsvirksomhed (bookkeeping company) which is out-of-vocabulary in a Dasem-trained13 fastText model, but fastText is nevertheless able to identify semantically related within-vocabulary words when only the n-gram representation of the word is used: forretningsvirksomhed (business) rådgivningsvirksomhed (advisory/consultant business) revisionsvirksomhed (auditing business)
Here the last compound, virksomhed, is shared.
12 13
http://babelfy.org/ Dasem is a Python package for Danish semantics available at https://github.com/fnielsen/dasem
2 Creating Semantic Representations
27
Yet another issue is misspellings and optical character recognition (OCR) errors in the corpus used to train the distributed representation model. Such erroneous words will also get projected in the embedding space. Whether this is a problem depends on the task: If the task is just to project words or to select among a pre-specified number of words, the presence of misspellings may not matter, but in the case where the task is to select the closest word to a given query, misspelling and OCR errors will likely decrease the performance.
What Does It Mean? It is unclear what form the semantic space should have to best represent semantics. In many cases, the semantic space is estimated with regard to a Euclidean space, but it need not be the best. Researchers have argued that other forms of spaces could yield better representations than a Euclidean space for certain semantic structures. In a tree, the number of leaf nodes grows exponentially with the distance to the root node of the tree, if the tree is having a specific branching factor. A continuous space with similar characteristics is hyperbolic geometry. Embedding in this space has been referred to as Poincaré embedding, and it has been shown that this form of embedding may improve on Euclidean embedding for a hierarchical problem, the WordNet noun hierarchy (Nickel et al. 2017). Due to polysemy/homographs, a semantic word space could be regarded as violating the triangle inequality (Neelakantan et al. 2014). Consider the two wordpairs (bee, fly) and ( fly, travel). Here the semantic similarity within the pairs are high and fly is a homograph, so the semantic distance from bee to travel is short when via fly, but longer when the pair (bee, travel) is viewed in isolation. Semantic spaces may need to be of a certain size to faithfully represent some concept relations. Consider a small semantic star network with one superconcept, A, and four subconcepts, Bi, i 2 [1, 2, 3, 4], where the semantic distances should be dist (A, Bi) ¼ 1 and dist(Bi, Bj) ¼ c, c > 1. In 2 dimensions, it would not be possible to place the subconcepts Bi in a configuration so all six distances between them are equal. Here the addition of a further dimension would enable the four subconcepts to be placed in a 3-dimensional tetradic configuration where the subconcepts all have similar distances to each other.14 On the other hand very high dimensional spaces have issues referred to as “curse of dimensionality”: distance concentration and hubness (Radovanović et al. 2010). The latter means that, e.g., for a tf-idf-weighted corpus some documents of the corpus will very often appear as nearest neighbors to other documents.
We can place the super concept at A ¼ (0, 0, 0) and the subconcepts at (1, 1, 1), (1, 1, 1), (1, 1, 1) and (1, 1, 1). All subconcepts have the same distance to the super concept and the same distance to all other subconcepts. 14
28
F. Å. Nielsen and L. K. Hansen
Acknowledgments We would like to thank Innovation Fund Denmark for funding through the DABAI project.
References Al-Rfou, R., Perozzi, B., & Skiena, S. (2014). Polyglot: Distributed word representations for multilingual NLP. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, June 2014 (pp. 183–192). https://arxiv.org/pdf/1307.1662.pdf Anderka, M., & Stein, B. (2009). The ESA retrieval model revisited. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 670–171). https://doi.org/10.1145/1571941.1572070 Bhattacharyya, M., Suhara, Y., Md Rahman, M., & Krause, M. (2017). Possible confounds in wordbased semantic similarity test data. In Companion of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing – CSCW ‘17 Companion (pp. 147–150). https://doi.org/10.1145/3022198.3026357 Bingham, E., & Mannila, H. (2001). Random projection in dimensionality reduction: Applications to image and text data. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2001 (pp. 245–150). https://doi.org/10. 1145/502512.502546 Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016, July 29). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. Advances in Neural Information Processing Systems, 29. https://papers.nips.cc/paper/6228-man-is-to-com puter-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf Chiu, B., Korhonen, A., & Pyysalo, S. (2016). Intrinsic evaluation of word vectors fails to predict extrinsic performance. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, August 2016. https://sites.google.com/site/repevalacl16/26_Paper.pdf Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41, 391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:63.0.CO;2-9 Faruqui, M., Tsvetkov, Y., Rastogi, P., & Dyer, C. (2016). Problems with evaluation of word embeddings using word similarity tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, May 2016. https://sites.google.com/site/repevalacl16/ 11_Paper.pdf Felbo, B., Mislove, A., Søgaard, A., Rahwan, I., & Lehmann, S. (2017). Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, August (pp. 1616–1626). http://aclweb.org/anthology/D17-1169 Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2002). Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20, 116–131. https://doi.org/10.1145/503104.503110 Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, January (pp. 1606–1611). http://www.aaai.org/Papers/IJCAI/2007/ IJCAI07-259.pdf Ganchev, K., & Dredze, M. (2008). Small statistical models by random feature mixing. In Proceedings of the ACL-2008 Workshop on Mobile Language Processing. Association for Computational Linguistics. Henrich, V., & Hinrichs, E. (2010). GernEdiT: A graphical tool for GermaNet development. In Proceedings of the ACL 2010 System Demonstrations, July 2010 (pp. 19–24).
2 Creating Semantic Representations
29
Iacobacci, I., Pilehvar, M. T., & Navigli, R. (2015). SensEmbed: Learning sense embeddings for word and relational similarity. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, July 2015 (pp. 95–105). https://doi.org/10.3115/V1/P15-1010 Jurgens, D. A., Turney, P. D., Mohammad, S. M., & Holyoak, K. J. (2012). SemEval-2012 task 2: Measuring degrees of relational similarity. In SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), June 2012 (pp. 356–364). http://www.aclweb.org/anthology/S121047 Koehn, P., & Knight, K. (2003). Empirical methods for compound splitting. In Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics, February (pp. 187–193). https://doi.org/10.3115/1067807.1067833 Köper, M., Scheible, C., & Walde, S. S. (2015). Multilingual reliability and ‘semantic’ structure of continuous word spaces. In Proceedings of the 11th International Conference on Computational Semantics, April (pp. 40–45). http://www.aclweb.org/anthology/W15-0105 Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240. https://doi.org/10.1037/0033-295X.104.2.211 Lee, D. D., & Seung, S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788–791. https://doi.org/10.1038/44565 Lee, D. D., & Seung, S. (2001). Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems, 13, 556–562. http://papers.nips.cc/paper/1861-algo rithms-for-non-negative-matrix-factorization.pdf Lee, Y.-Y., Ke, H., Huang, H.-H., & Chen, H.-H. (2016). Combining word embedding and lexical database for semantic relatedness measurement. In Proceedings of the 25th International Conference Companion on World Wide Web (pp. 73–74). https://doi.org/10.1145/2872518. 2889395 Levy, O., & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. Advances in Neural Information Processing Systems, 27, 2177–2185. Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistic, 3, 211–225. Linzen, T. (2016). Issues in evaluating semantic spaces using word analogies. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, August (pp. 13–18). https://doi.org/10.18653/V1/W16-2503 Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, 28, 203–208. https://doi.org/10.3758/BF03204766 Mikolov, T., Chen, K., Corrado, G. S., & Dean, J. (2013a). Efficient estimation of word representations in vector space, January 2013. https://arxiv.org/pdf/1301.3781v3 Mikolov, T., Yih, W.-T., & Zweig, G. (2013b). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 2013 (pp. 746–51). http://www.aclweb.org/anthology/N13-1090 Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2017). Advances in pre-training distributed word representations, December 2017. https://arxiv.org/pdf/1712.09405.pdf Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM, 38, 39–41. Neelakantan, A., Shankar, J., Passos, A., & McCallum, A. (2014). Efficient non-parametric estimation of multiple embeddings per word in vector space. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1059–10699). https://doi.org/10.3115/V1/D14-1113
30
F. Å. Nielsen and L. K. Hansen
Nickel, M., Kiela, D., & Kiela, D. (2017, May 30). Poincaré Embeddings for learning hierarchical representations. Advances in Neural Information Processing Systems. https://arxiv.org/pdf/ 1705.08039.pdf Nielsen, F. Å. (2011). A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. In Proceedings of the Eswc2011 Workshop on ‘Making Sense of Microposts’: Big Things Come in Small Packages, May 2011 (pp. 93–98). http://ceur-ws.org/Vol-718/paper \_16.pdf Nielsen, F. Å. (2017, October). Wembedder: Wikidata entity embedding web service. https://doi. org/10.5281/ZENODO.1009127 Nielsen, F. Å. (2018). Linking ImageNet WordNet Synsets with Wikidata. In WWW ‘18 Companion: The 2018 Web Conference Companion, April 23–27, 2018, Lyon (pp. 1809–1814). https:// doi.org/10.1145/3184558.3191645 Nielsen, F. Å., & Hansen, L. K. (2002). Modeling of activation data in the BrainMap database: Detection of outliers. Human Brain Mapping, 15, 146–156. https://doi.org/10.1002/HBM. 10012 Nielsen, F. Å., & Hansen, L. K. (2017). Open semantic analysis: The case of word level semantics in Danish. In Human Language Technologies as a Challenge for Computer Science and Linguistics, October 2017 (pp. 415–19). http://www2.compute.dtu.dk/pubdb/views/edoc\_ download.php/7029/pdf/imm7029.pdf Nielsen, F. Å., & Hansen, L. K. (2018). Inferring visual semantic similarity with deep learning and Wikidata: Introducing imagesim-353. In Proceedings of the First Workshop on Deep Learning for Knowledge Graphs and Semantic Technologies, April (pp. 56–61). http://www2.compute. dtu.dk/pubdb/views/edoc_download.php/7102/pdf/imm7102.pdf Nielsen, F. Å., Balslev, D., & Hansen, L. K. (2005). Mining the posterior cingulate: Segregation between memory and pain components. NeuroImage, 27, 520–532. https://doi.org/10.1016/J. NEUROIMAGE.2005.04.034 Nissim, M., van Noord, R., & van der Goot, R. (2019). Fair is better than sensational: Man is to doctor as woman is to doctor. arXiv 1905.09866. Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543). http://www.emnlp2014.org/papers/pdf/EMNLP2014162.pdf Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of NAACL-HLT 2018, March 2018 (pp. 2227–2237). https://arxiv.org/pdf/1802.05365.pdf Radford, A., Józefowicz, R., & Sutskever, I. (2017). Learning to generate reviews and discovering sentiment, April 2017. https://arxiv.org/pdf/1704.01444.pdf Radovanović, M., Nanopoulos, A., & Ivanović, M. (2010). Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, 11, 2487–2531. Řehůřek, R. (2011). Fast and faster: A comparison of two streamed matrix decomposition algorithms, February 2011. https://arxiv.org/pdf/1102.5597.pdf Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18, 613–620. https://doi.org/10.1145/361219.361220 Scheepers, T., Kanoulas, E., & Gavves, E. (2018). Improving word embedding compositionality using lexicographic definitions. In Proceedings of the 2018 World Wide Web Conference. https://doi.org/10.1145/3178876.3186007 Schmidhuber, J. (2014). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117. https://doi.org/10.1016/J.NEUNET.2014.09.003 Soboroff, I. M., Nicholas, C. K., Kukla, J. M., & Ebert, D. S. (1997). Visualizing document authorship using n-grams and latent semantic indexing. In Proceedings of the 1997 Workshop on New Paradigms in Information Visualization and Manipulation. https://doi.org/10.1145/ 275519.275529 Speer, R., Chin, J., & Havasi, C. (2016). ConceptNet 5.5: An open multilingual graph of general knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, December 2016 (pp. 4444–4451). https://arxiv.org/pdf/1612.03975.pdf
2 Creating Semantic Representations
31
Sun, Y., Rao, N., & Ding, W. (2017). A simple approach to learn polysemous word embeddings, July 2017. https://arxiv.org/pdf/1707.01793.pdf Svoboda, L., & Brychcín, T. (2018). New word analogy corpus for exploring embeddings of Czech words. In Computational Linguistics and Intelligent Text Processing (pp. 103–114). https://doi. org/10.1007/978-3-319-75477-2_6 Vylomova, E., Rimell, L., Cohn, T., & Baldwin, T. (2016). Take and took, gaggle and goose, book and read: Evaluating the utility of vector differences for lexical relation learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (pp. 1671–1682). https://doi.org/10.18653/V1/P16-1158 Wróblewska, A., Krasnowska-Kieraś, K., & Rybak, P. (2017). Towards the evaluation of feature embedding models of the fusional languages. In Human Language Technologies as a Challenge for Computer Science and Linguistics, November 2017 (pp. 420–424).
Chapter 3
Software for Creating and Analyzing Semantic Representations Finn Årup Nielsen and Lars Kai Hansen
In this chapter, we describe some of the software packages for learning distributed semantic representation in the form of word and graph embeddings. We also describe several Python natural language processing frameworks that can prepare a corpus for the embedding software. We furthermore point to the Wikidata software as a tool for collaborative construction of explicit semantic representation and to tools to embed such data. • Many natural language processing packages exist, including several written in the Python programming language. • The popular scikit-learn, Gensim and Keras Python packages provide means for quickly building and training machine learning models on large corpora. • Several word embedding software packages exist for efficiently building distributed representations of words. • Many of the software packages have associated pre-trained models in multiple languages trained on very large corpora and the models are distributed freely for others to use. • Software for collaboratively building large semantic networks and knowledge graphs exist and these graphs can be converted to a distributed representation.
Introduction Word embedding is a powerful technique for converting words or text fragments to feature vectors amenable for multivariate analysis and machine learning. Since the first description and publication of modern efficient word embedding tools around F. Å. Nielsen (*) · L. K. Hansen Department of Applied Mathematics and Computer Science, Technical University of Denmark, Lyngby, Denmark e-mail: [email protected]; [email protected] © Springer Nature Switzerland AG 2020 S. Sikström, D. Garcia (eds.), Statistical Semantics, https://doi.org/10.1007/978-3-030-37250-7_3
33
34
F. Å. Nielsen and L. K. Hansen
2013, they have found wide-spread use in many areas of text mining providing productive semantic representations. Open source, convenient setup and often excellent performance in semantic tasks have all played important roles in their general success. Natural language processing (NLP) software may be necessary for corpus preparation as dedicated semantic representation software does not handle different forms of textual input. For instance, Wikipedia derived text should usually be stripped of the special wiki markup present in MediaWiki-based wikis. Word embedding software may expect the input to be tokenized and white-space separated, so word tokenization and perhaps a sentence segmentation is necessary. For morphological rich languages, handling of the word variations may be important to avoid too sparse data when estimating a semantic model. Stemmers and lemmatizers may be used to convert word variations to a basic form. Languages with many and rare compound nouns (e.g., German) can face an issue with many out-of-vocabulary words, so decompounding can be advantageous. Datasets and pre-trained models are important dimensions of modern natural language processing, including semantic modeling. Python packages, NLTK and polyglot, have built-in facilities for convenient download of their associated data and pre-trained models making the setup of a working system less of a hassle. These packages also save the data in predefined directories, making subsequent setup simpler. The researchers using word embedding software have also distributed several pre-trained models for others to use. Many text mining packages are available. Below we will focus on a few Python natural language processing packages: NLTK, spaCy, Pattern and polyglot for general natural language processing packages. Then we will describe word embedding software and other software to create semantic representations.
Natural Language Processing Toolkit, NLTK NLTK, short for Natural Language Processing Toolkit, is a Python package and perhaps one of the most popular Python NLP packages. It is documented in a detailed book (Bird et al. 2009) available in print as well as on the Internet at http://www.nltk.org/book/ under a Creative Commons license. NLTK packages a large number of models for a range of NLP tasks and furthermore includes methods for downloading and loading several corpora and other language resources. Pre-trained sentence tokenizers exist for several languages, and different word tokenizers are also implemented. For instance, the word_tokenize function uses the default word tokenizer, which in NLTK version 3.2.5 is a modified version of the regular expression-based TreebankWordTokenizer. An example application
3 Software for Creating and Analyzing Semantic Representations
35
of this default tokenizer for the sentence “I don’t think you’ll see anything here: http://example.org :)” is: from nltk.tokenize import word_tokenize text = ("I don't think you'll see anything here: " "http://example.org :)") token_list = word_tokenize(text)
This tokenization yields a list with the individual words identified, but with the URL and the emoticon not well handled: [’I’, ’do’, "n’t", ’think’, ’you’, "’ll", ’see’, ’anything’, ’here’, ’:’, ’http’, ’:’, ’//example. org’, ’:’, ’)’]. Both the URL and the emoticon are split into separate parts. The NLTK’s TweetTokenizer handles such tokens better. An example application of this method reads from nltk.tokenize import TweetTokenizer tokenizer = TweetTokenizer() token_list = tokenizer.tokenize(text)
The result is [’I’, "don’t", ’think’, "you’ll", ’see’, ’anything’, ’here’, ’:’, ’http://example.org’, ’:)’], where the URL and the emoticon are detected as tokens. NLTK has various methods to deal with morphological variations of words. They are available from the nltk.stem submodule. For English and some other languages, NLTK implements several stemmers. Variations of the Snowball stemmer by Martin Porter work for a range of languages, here an example in French from nltk.stem.snowball import FrenchStemmer from nltk.tokenize import word_tokenize text = ("La cuisine française fait référence à divers " "styles gastronomiques dérivés de la " "tradition française.") stemmer = FrenchStemmer() [stemmer.stem(word) for word in word_tokenize(text)]
This yields [’la’, ’cuisin’, ’français’, ’fait’, ’référent’, ’à’, ’diver’, ’styl’, ’gastronom’, ’dériv’, ’de’, ’la’, ’tradit’, ’français’, ’.’]. For English, a WordNet-based lemmatizer is available. Part-of-speech taggers are implemented in the nltk.tag submodule. The default tagger available with the function pos_tag in NLTK 3.2.5 uses the Greedy Averaged Perceptron tagger by Matthew Honnibal (Honnibal 2013). Trained models for English and Russian are included. Trained models can be saved using Python’s pickle format. This format has the unfortunate problem of posing a security issue, as the pickle files can contain executable code, that gets executed if the file is loaded. It means that pickle data
36 Table 3.1 NLTK WordNet similarities
F. Å. Nielsen and L. K. Hansen
Dog Cat Chair Table
Dog 1.00 0.20 0.08 0.07
Cat 0.20 1.00 0.06 0.05
Chair 0.08 0.06 1.00 0.07
Table 0.07 0.05 0.07 1.00
should only be downloaded from trusted sources. Trusted models distributed by the NLTK developers can fairly easily be downloaded by functions provided by the toolkit. By default, they will be saved under the ~/nltk_data/ directory. For instance, the Danish tokenizer is saved to ~/nltk_data/tokenizers/ punkt/danish.pickle. The functions of NLTK that require these models for setup will automatically read such files. NLTK has also methods for handling semantic representations in the form of wordnets. Access to the Princeton WordNet (PWN) (Miller 1995) is readily available in a submodule imported, e.g., by: from nltk.corpus import wordnet as wn
With this at hand, synsets and lemmas can be found based on the word representation. Semantic similarities can be analyzed with one of several similarity methods that transverse the wordnet taxonomy. Below are the similarities between pairs of 4 nouns (dog, cat, chair and table) computed with the path_similarity method which computes the similarity based on the shortest path in the hypernym/hyponym graph: import numpy as np words = 'dog cat chair table'.split() similarities = np.zeros((len(words), len(words))) for n, word1 in enumerate(words): for m, word2 in enumerate(words): synset1 = wn.synsets(word1, pos='n')[0] synset2 = wn.synsets(word2, pos='n')[0] similarities[n, m] = synset1.path_similarity(synset2)
The result is displayed in Table 3.1, where the words cat and dog are found to have the highest similarity, while cat and table the lowest. For chair, the word dog has a higher similarity than table which in usual contexts would not be right. Each word may have multiple synsets, e.g., dog has 7 synsets. In the code above, the most common synset is selected.
spaCy spaCy (https://spacy.io/) is a relatively new Python package with an initial release in 2015. It offers a range of NLP methods, such as tokenization, POS-tagging, sentence segmentation, dependency parsing, named entity recognition and semantic similarity
3 Software for Creating and Analyzing Semantic Representations
37
computation. A 2015 paper showed that its dependency parser was the fastest among 13 systems evaluated (Choi et al. 2015). spaCy has several pre-trained models for several languages. A command-line method downloads and installs the model packages. The packages with models for a specific language ship in various sizes, and for working with distributional semantics, the largest package should be downloaded for the best result. For instance, one of the large English packages is downloaded with the command1: python -m spacy download en_core_web_lg
This package provides models for the tagger, parser, named-entity recognizer and distributional semantic vectors trained on OntoNotes Release 5 and the Common Crawl dataset. An even larger package (en_vectors_web_lg) contains 300-dimensional distributional semantic vectors for 1.1 million tokens trained on the Common Crawl dataset with GloVe (Pennington et al. 2014). Once installed, the models can be used from within Python: import spacy nlp = spacy.load('en_core_web_lg') doc = nlp(u'A lion is a large cat. It lives in Africa')
The resulting object, here named doc, has several attributes which analyze the text with lazy evaluation. For instance, doc.sents returns a generator that yields the sentences of the text as so-called Span objects, while doc.ents returns named entities, here ‘Africa’, and doc.noun_chunks returns the Span objects representing ‘A lion’, ‘a large cat’, ‘It’ and ‘Africa’. The object itself iterates over tokens. An embedding of the text is available as doc.vector. This is computed as the simple average of the individual word embedding vectors. It can also be computed by iterating over the tokens and computing the average over the word embedding of each token. The example below compares the overall embedding with the computed average of the individual word embeddings of each token: token_vectors = [token.vector for token in doc] average_vector = sum(token_vectors) / len(token_vectors) sum(doc.vector - average_vector) == 0
The last line yields true. The dimension of the subspace is 300 in this case, while token_vectors is a list of 11 300-dimensional vectors (the punctuation is also embedded). As of 2018, spaCy distributes pre-trained models for full support of English as well as a few European languages, see https://spacy.io/models/. French and Spanish models also support embeddings, while German, Portuguese, Italian and Dutch have less support. A multilingual model can be used for name entity detection. 1
A list of the various models is available at https://spacy.io/models/
38
F. Å. Nielsen and L. K. Hansen
Pattern Open source Pattern Python package provides methods for processing text data from the web (De Smedt and Daelemans 2012). Exposed in the pattern.web submodule, it has download methods for a range of web services including popular web search engines,2 Twitter, individual Wikipedia articles and news feeds. It also features a web crawler and an object to retrieve e-mail messages via IMAP. Pattern has natural language processing methods for a few European languages, English, German, French, Italian and Dutch, with varying degrees of implementation. For English, there are, e.g., a part-of-speech tagger, lemmatizer, singularization/ pluralization, conjugation and sentiment analysis. It also contains a small set of word lists (academic, profanity, time and a basic list with 1000 words) as well as an interface to WordNet. The pattern.vector submodule exposes various methods based around the vector space model representation, word count, tf-idf, latent semantic analysis. This module also features K-means and hierarchical document clustering algorithms and supervised classification algorithms.
Polyglot Polyglot is a multilingual NLP Python package (Al-Rfou et al. 2014). It can be applied both as a Python module and as a command-line script. As shown in the documentation at https://polyglot.readthedocs.io, it has language detection, tokenization, part-of-speech tagging, word embedding, named entity extraction, morphological analysis, transliteration and sentiment analysis for many languages. Some of the methods require the download of files from the web. The command-line polyglot program will automatically download the files and, like NLTK, store them locally in a directory such as ~/polyglot_data. For instance, the command. polyglot download embeddings2.de
will download the German word embeddings file. Word embeddings are generated from the different language versions of Wikipedia and trained on a Theanoimplementation of a curriculum learning-inspired method (Al-Rfou et al. 2014). Apart from working with its own word embedding format, the polyglot word embeddings are also able to load pre-trained Gensim, Word2vec and GloVe models.3 A series of commands that find the four nearest words to the German word ‘Buch’ (book) may read.
2 3
Some of the web search engines are paid services, requiring a license key. See, e.g., https://polyglot.readthedocs.io/en/latest/Embeddings.html
3 Software for Creating and Analyzing Semantic Representations
39
from os.path import expanduser from polyglot.mapping import Embedding directory = expanduser('~/polyglot_data/embeddings2/de/') filename = directory + 'embeddings_pkl.tar.bz2' embedding = Embedding.load(filename) words = embedding.nearest_neighbors('Buch', 4)
These commands yield ['Werk', 'Schreiben', 'Foto', 'Archiv']. The embedding vector is available as embedding['Buch']. The German word embedding has a vocabulary of 100,004 tokens and the dimension of the embedding space is 64. Note that the polyglot embeddings distinguish between upper and lower case letter. Out-of-vocabulary case-variations of words can be handled with polyglot’s ‘case expansion’.
MediaWiki Processing Software Widely used as a free, large and multilingual text corpus (Mehdi et al. 2017), Wikipedia poses a special challenge to parse. The entire text is distributed as large compressed XML dump files where the text is embedded with the special MediaWiki markup for the rendering of, e.g., tables, citations, infoboxes and images. There exist a few tools to convert the raw MediaWiki markup to a form suitable for standard NLP tools. The Python package mwparserfromhell parse the MediaWiki markup and can filter the individual parsed components to text in various ways. The Python code below downloads the ‘Denmark’ article from the English Wikipedia and strips any formatting. import requests, mwparserfromhell url = 'https://en.wikipedia.org/w/index.php' response = requests.get(url, params={'title': 'Denmark', 'action': 'raw'}) wikicode = mwparserfromhell.parse(response.content) text = wikicode.strip_code()
The Gensim package (see below) has its own dedicated wiki extractor: the make_wiki script.
Scikit-Learn Scikit-learn, also called sklearn, is a popular Python package for machine learning (Pedregosa et al. 2011). Its popularity may stem from the implementation of a wide variety of machine learning algorithms and a consistent application programming
40
F. Å. Nielsen and L. K. Hansen
interface (API). Though not specifically targeted at text processing, Scikit-learn does have several ways to handle text and convert it to a numerical matrix representation. The sklearn.feature_extraction.text submodule defines the CountVectorizer which tokenizes a list of texts into words or characters and counts the occurrences of the tokens, returning a bag-of-words or bag-of-n-grams representation. TfidfTransformer does the popular tf-idf normalization of a count matrix, while TfidfVectorizer combines the CountVectorizer and TfidfTransformer into one model. The HashingVectorizer applies hashing to the words and has a default size of 1,048,576 features. This particular hashing will make hash collisions, e.g., between “wearisome” and “drummers” and between “funk”, “wag” and “deserters”. Several of scikit-learn’s unsupervised learning models can be used to project high-dimensional representations to a two- or three-dimensional representation useful for visualization. Relevant models are the ones in the sklearn.decomposition submodule, such as the principal component analysis (PCA) as well as the models in the sklearn.manifold submodule where we find the t-distributed Stochastic Neighbor Embedding (TSNE). Yet other models in scikit-learn may be used for topic modeling. Apart from principal component analysis, scikit-learn implements non-negative matrix factorization with sklearn.decomposition.NMF, while latent Dirichlet allocation is implemented with sklearn.decomposition.LatentDirichletAllocation. Andrej Karpathy’s Arxiv Sanity web service with the code available from https:// github.com/karpathy/arxiv-sanity-preserver provides an interesting working example of the use of scikit-learn’s TfidfVectorizer. Karpathy downloads bibliographic information and PDFs from the arXiv preprint server at https://arxiv.org/, extracts the text from the PDF with the pdftotext command-line script, builds a word-bigram tfidf-weighted model with scikit-learn and computes document similarities based on inner products. The web service at http://www.arxiv-sanity.com can then use such similarities for ranking similar scientific papers given a specific paper.
Word Embedding Word embeddings represent words in a continuous dense low-dimensional space. Although several systems exists for training, Word2vec, GloVe and fastText are probably the most popular. Below we describe these software packages.
Word2vec The original word2vec (Mikolov et al. 2013) word embedding software is available from https://code.google.com/archive/p/word2vec. The Apache license software is written in C and compiles to a multi-threaded command-line program. A small
3 Software for Creating and Analyzing Semantic Representations
41
Table 3.2 Selection of important pre-trained word embedding models Model Word2vec GloVe GloVe GloVe
Corpus Wikipedia Common crawl Common crawl Wikipedia +Gigaword
GloVe
Twitter
fastText
Wikipedia
fastText
Common crawl + Wikipedia
Tokens 840G
Vocabulary 10 K–50 K 2.2 M
Dim 300 300
Comment 29 languages Cased, English
42G
1.9 M
300
Uncased, English
6G
400K
300
27G
1.2 M
200
Up to 2.5 M
300
Uncased, English. Models with dimension 50, 100 and 200 are also available 25, 50 and 100 dimensional models are also available. 294 different language versions are available 157 different language versions
300
demonstration program uses the automatically downloaded 100 MB Wikipediaderived text8 corpus to train a model in a few minutes on a modern computer. Pre-trained word2vec models based on multiple languages of Wikipedia and Gensim, has been made available by Kyubyong Park at https://github.com/ Kyubyong/wordvectors, see also Table 3.2. The word2vec program has several parameters for the estimation of the model. Apart from parameters for reading the corpus and writing the trained model, some of the parameters are size, which determines the dimension of the low-dimensional space. The default is 100 and researchers tend to set it to 300 for large corpora. Window controls the size of the context. Cbow switches between the skip-gram and the continuous bag of words model. The min-count parameter discards words occuring less often. Setting this parameter to a high value will result in a smaller vocabulary.
GloVe Another widely used system is GloVe (Pennington et al. 2014) with the reference implementation available from https://nlp.stanford.edu/projects/glove/. Like word2vec, GloVe is distributed under the Apache license and written in C with a demonstration program using the text8 corpus. Several large pre-trained word embeddings are available from the homepage. In (Pennington et al. 2014), the researchers reported performance for the so-called 6B (trained on Wikipedia and Gigaword corpora with 6 gigatokens) and 42B (trained on Common Crawl with 42 gigatokens) embeddings. On word similarity and word analogy tasks, the model trained on the largest corpus (42B) proved the best, also in comparison with the word2vec variations and other examined approaches.
42
F. Å. Nielsen and L. K. Hansen
FastText FastText is an open embedding library and program from Facebook AI Research. It is available from GitHub at https://github.com/facebookresearch/fastText under a BSD license and documented at https://fasttext.cc/ and in several scientific papers (Bojanowski et al. 2016; Joulin et al. 2016; Mikolov et al. 2017; Grave et al. 2018). The program is implemented in C++ with interface in Python. FastText requires a whitespace-separated tokenized corpus as input. Word phrases should be preprocessed. The ability to handle both words and character n-grams sets fastText apart. The fastText research group distributes open pre-trained models. The so-called “Wiki word vectors” available from https://fasttext.cc/docs/en/pretrained-vectors. html are trained on 294 different language versions of Wikipedia using an embedding dimension of 300 and the skip-gram model. Another line of pre-trained models was initially only available in English but trained on very large datasets (Mikolov et al. 2017). These models have produced strong results on benchmark datasets, including word and phrase analogy datasets, rare word similarity and a question answering system. In 2018, the researchers used language detection on the very large dataset which enabled them to train separate models on 157 different languages. This improved the performance of fastText on a word analogy benchmark dataset,—for some language the improvement was considerable (Grave et al. 2018). Apart from the unsupervised learning of a word embedding, fastText can also work in a supervised setting with a labeled corpus. For supervised training, each line in the input should be prefixed with a string indicating the category.
Other Word Embedding Software The Multivec package implements bilingual word embeddings that should be trained on a parallel corpus (Bérard et al. 2016). It is available from https://github.com/eske/ multivec, and implements training as well as functions to compute similarity and find the semantically closest words across languages. The documentation shows training with a 182,761 sentence large French–English parallel corpus and with this pre-trained model, an associated Python package can compute similarity values, e.g., between the French and English word for dog: from multivec import BilingualModel model = BilingualModel('news-commentary.fr-en.bin') model.similarity('chien', 'dog') # 0.7847443222999573 model.similarity('dog', 'chien') # 0.0 model.similarity('chien', 'chien') # 0.0 model.similarity('dog', 'dog') # 0.0
3 Software for Creating and Analyzing Semantic Representations
43
Here the resulting similarities are shown after the method call. Only when the source embedding is French and the target English, does the model report a non-zero similarity. In the other cases, the words are out-of-vocabulary in one or two of the languages and the returned similarity value from the method is zero.
Other Embedding Software Other forms of embedding software go beyond embedding of words. Of particular interest is graph embedding software that takes a graph (rather than a corpus) and embeds the nodes or both nodes and typed links between the nodes in the continuous embedding space. The tools may generate a ‘sentence’ of nodes (and possible links) by a random walk on the graph and submit the ‘sentence’ to conventional (word) embedding software. The node2vec package4 embeds nodes via such random graph walks. The simple reference implementation uses the Python NetworkX package and Gemsim’s implementation of Word2vec. In the accompanying paper (Grover and Leskovec 2016), the node2vec embedding was trained on a Wikipedia-derived word co-occurrence corpus. RDF2vec5 is a similar tool and also uses Gensim’s Word2vec implementation, but works for relational graphs. Wembedder uses a simple RDF2vec-inspired approach to embed a relational graph in the form of the Wikidata knowledge graph. Wembedder embeds both Wikidata items (the nodes in the knowledge graph) as well as the Wikidata properties (the links) and makes similarity computations based on the trained model available as a web service with an associated API (Nielsen 2017).
Gensim Gensim is an open source Python package for topic modeling and related text mining methods going back to at least 2010 (Řehůřek and Sojka 2010). Distributed from https://radimrehurek.com/gensim/, the package has various methods to handle large corpora by streamed reading and to convert a corpus to a vector space representation. Gensim implements several popular topic modeling methods: tf-idf (TfidfModel), latent semantic analysis (LsiModel), random projections (RpModel), latent Dirichlet allocation (LdaModel) and hierarchical Dirichlet process (HdpModel). It provides access to computed vectors so similarity computations can be made between the documents or between the documents and a query. Gensim reimplements the word2vec models including training of the model from a text stream, returning the vector representation of a word and computing similarity
4 5
https://snap.stanford.edu/node2vec/ http://data.dws.informatik.uni-mannheim.de/rdf2vec/
44
F. Å. Nielsen and L. K. Hansen
between words. Several other word embedding algorithms are also implemented: doc2vec, fastText and Poincaré embedding. Furthermore, Gensim wraps a few programs so they are accessible from within the Python session, although the programs need to be installed separately to work. They include WordRank (Ji et al. 2016) and VarEmbed (Bhatia et al. 2016) as well as Dynamic Topic Models and LDA models implemented in Mallet and Vowpal Wabbit. FastText was also wrapped but with the direct implementation in Gensim 3.2, the wrapper is now deprecated. Gensim has a simple class to read the POS-tagged Brown Corpus from an NLTK installation, which can be used as an example. Provided that this small dataset is downloaded via NLTK, training a Gensim word2vec model may be done as follows from os.path import expanduser, join from gensim.models.word2vec import BrownCorpus, Word2Vec dirname = expanduser(join('~', 'nltk_data', 'corpora', 'brown')) model = Word2Vec(BrownCorpus(dirname))
Here the POS-tags are included as part of the word. The training with this small corpus takes no more than a few seconds. A similarity search on the noun “house” (postfix with the POS-tag “nn” for nouns) with model.similar_by_word ('house/nn') may yield the nouns “back”, “room” and “door” as the most similar words.
Deep Learning Several free software packages for deep learning are available: Google’s TensorFlow (Abadi et al. 2016), Microsoft’s CNTK,6 Theano, PyTorch, Keras, MXNet, Caffe (Jia et al. 2014) and Caffe2. Model architectures and trained models may differ between the frameworks, but there are efforts towards a standardized format for interchange. One such effort is Open Neural Network Exchange (ONNX) described at https://github.com/onnx. AllenNLP is a open source framework for natural language understanding using deep learning from the Allen Institute for Artificial Intelligence (Gardner et al. 2017). Available from http://allennlp.org/, it is built on top of PyTorch and spaCy and runs with Python 3.
6
https://docs.microsoft.com/en-us/cognitive-toolkit/
3 Software for Creating and Analyzing Semantic Representations
45
Keras Keras7 is a high-level deep learning library. It runs on top of either TensorFlow, CNTK, or Theano. It is among the most popular deep learning frameworks. Keras enables the deep learning programmer to easily build deep neural networks by chaining layers of simple classes representing, e.g., layers of weights (such as Dense or Conv1D) or activation functions (such as ReLU or softmax). It has a range of layers for recurrent neural networks, such as the popular long short-term memory (LSTM) unit. Of special interest for text mining are the text preprocessing functions and an embedding layer. Under the keras.application submodule, Keras has several pre-trained deep learning models. Currently, they are all trained for image classification with the ImageNet dataset, so not of direct relevance in a text mining context, except in cases with combined image and text analysis, such as image captioning. It is possible to load pre-trained embedding models, such as GloVe models, as a Keras embedding layer with the keras.layers.Embedding class. The set up of pre-trained embedding models requires a lengthy procedure,8 but the third-party KerasGlove package makes the set up simpler. In the keras.datasets submodule, a couple of text corpora are readily available — mostly for benchmarking and not particularly relevant for establishing general semantic models: The Large Movie Review Data ((Maas et al. 2011), by Keras referred to as IMDB Movie reviews) with 25,000 movie reviews in the training set labeled by sentiment and the Reuters newswires dataset with 11,228 newswires labeled over 46 topics. Keras loading functions format the data as a sequence of indices, where the indices point to words. The indices are offset by 3 to make room for indices representing special ‘words’: start, out-of-vocabulary and padding. Loading the Large Movie Review Data into a training and a test set as well as translating the indexed based data back to text can be performed with the following code: from keras.datasets.imdb import get_word_index, load_data # Read the text corpus as indices (x_train, y_train), (x_test, y_test) = load_data( num_words=5000) # Read the index to word mapping index_to_word = {v: k for k, v in get_word_index().items()} # Add special translation indices for padding, start # and out-of-vocabulary indicators index_to_word.update({-3: 'PAD', -2: 'START', -1: 'OOV'})
7
https://keras.io/ See the blog post “Using pre-trained word embeddings in a Keras model” at https://blog.keras.io/ using-pre-trained-word-embeddings-in-a-keras-model.html
8
46
F. Å. Nielsen and L. K. Hansen
# Translate indices to words and concatenate review = " ".join([index_to_word[index - 3] for index in x_train[1]])
The beginning of the second review in the dataset (the review variable) will read “START big hair big OOV bad music and a giant safety OOV these are the words . . .” Here the case, punctuation and markup have been removed from the original text: The original file9 reads “Big hair, big boobs, bad music and a giant safety pin. . .. . .these are the words . . .”. The DeepMoji pre-trained Keras model has been trained to predict emojis based on a very large English Twitter dataset (Felbo et al. 2017).10 The model predicts across 64 common emojis. deepmoji_feature_encoding loads the neural network excluding the last softmax layer so a prediction generates a 2304dimensional feature space. The layers of the model can be used in other systems for semantic tasks. A system for emotion classification used the softmax layer and the attention layer together with domain adaptation as the winning entry in a competition for prediction of affect in messages from Twitter (Duppada et al. 2018). The Keras.js is a JavaScript library that supports running Keras models in the web browser. This browser version may use the GPU through WebGL thus running at reasonable speeds.
Explicit Creation of Semantic Representations Software for construction of ontologies, e.g., for explicit semantic representations exists with, e.g., Protegé (Musen and Team 2015).11 A recent development is Wikibase and its prime instance Wikidata (Vrandečić and Krötzsch 2014), which is a collaborative environment for multilingual structured data, implemented around the Wikipedia software (MediaWiki). Users can describe concepts and, through properties, describe relations between the concepts. A new feature enables users to describe lexemes and their orthographic forms. Users can make properties of different types. In Wikidata, some of the properties are ‘instance of’, ‘subclass of’ and ‘part of’, so users of Wikidata can create multilingual concept hierarchies. A database engine is set up where users can query the knowledge graph with complex queries in the SPARQL query language, e.g., generating semantic networks on-the-fly. The SPARQL listing below generates the semantic hypernym network from the concept chair, see also Fig. 3.1.
The original text can be found as a text file in Large Movie Review Dataset distributed from http:// ai.stanford.edu/~amaas/data/sentiment/ 10 https://github.com/bfelbo/DeepMoji 11 https://protege.stanford.edu/ 9
3 Software for Creating and Analyzing Semantic Representations
47
concrete object
physical substance
physical object
goods and services goods
utensil
product
furniture entity
artificial physical object object
seat
furnishing chair artificial entity creative work
intellectual work
product work
Fig. 3.1 Wikidata semantic hypernym network from the concept ‘chair’ #defaultView:Graph SELECT ?child ?childLabel ?parent ?parentLabel WHERE { SERVICE gas:service { gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" ; gas:in wd:Q15026 ; gas:traversalDirection "Forward" ; gas:out ?parent ; gas:out2 ?child ; gas:linkType wdt:P279 ; }
48
F. Å. Nielsen and L. K. Hansen
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . } }
If the semantic graph representation is not used directly, then graph embedding can embed Wikidata items and properties in a low dimensional space via, e.g., node2vec, RDF2vec or Poincaré embedding (Nickel et al. 2017). Figure 3.2 shows the result of a Poincaré embedding of the furniture hyponym network from Wikidata where the embedding space have been selected to have just two dimensions. The Python code below constructs this plot, first formulating a SPARQL query for the furniture hyponym network, then downloading the data from the Wikidata Query Service at https://query.wikidata.org/ and converting the results, lastly training and plotting with Gensim via its PoincareModel. from gensim.models.poincare import PoincareModel from gensim.viz.poincare import poincare_2d_visualization from plotly.offline import plot import requests # Furniture graph query sparql = """ SELECT ?furniture1 ?furniture1Label ?furniture2 ?furniture2Label WHERE { ?furniture1 wdt:P279+ wd:Q14745 . { ?furniture2 wdt:P279+ wd:Q14745 . ?furniture1 wdt:P279 ?furniture2 . } UNION { BIND(wd:Q14745 AS ?furniture2) ?furniture1 wdt:P279 ?furniture2 . } ?furniture1 rdfs:label ?furniture1Label . ?furniture2 rdfs:label ?furniture2Label . FILTER (lang(?furniture1Label) = 'en') FILTER (lang(?furniture2Label) = 'en') }""" # Fetch data from Wikidata Query Service and convert response = requests.get( "https://query.wikidata.org/sparql", params={'query': sparql, 'format': 'json'}) data = response.json()['results']['bindings'] relations = [ (row['furniture1Label']['value'],
3 Software for Creating and Analyzing Semantic Representations
49
Fig. 3.2 Wikidata semantic network for furniture and its subclasses with concept position (the blue dots) determined by Gensim’s Poincaré embedding where the dimension of the embedding space is two. For this particular trained model, the concept armchair appears at the coordinate (0.04, –0.43) while the root concept furniture is close to the middle of the plot row['furniture2Label']['value']) for row in data] # Set up and train Poincare embedding model model = PoincareModel(relations, size=2, negative=5) model.train(epochs=100) # Plot plot_data = poincare_2d_visualization(model, relations, 'Furniture') plot(plot_data)
We can further query the trained Gensim model for similarity, e.g., model.kv. most_similar('armchair') yields similar furniture for the armchair
50
F. Å. Nielsen and L. K. Hansen
concept. With a particular trained model, the most similar concepts are recliner, Morris chair and fauteuil, but also gynecological chair and gas-discharge lamp. The latter concept is in the hyponym network of furniture because lamp is regarded as furniture. The selection of the Poincaré embedding space to have just two dimensions is entire due to visualization. In the original work (Nickel et al. 2017), the WordNet noun network was modeled with embedding spaces from 5 to 100. Also, note that the resulting embedding is depending upon the initialization and that the resulting configuration in Fig. 3.2 is not optimal (as some concept groups can be moved closer to higher concepts). Acknowledgments We would like to thank Innovation Fund Denmark for funding through the DABAI project.
References Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., et al. (2016, March). TensorFlow: Large-scale machine learning on heterogeneous distributed systems. Retrieved from https://arxiv.org/pdf/1603.04467.pdf Al-Rfou, R., Perozzi, B., & Skiena, S. (2014, June). Polyglot: Distributed word representations for multilingual NLP. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning (pp. 183–192). Retrieved from https://arxiv.org/pdf/1307.1662.pdf Bérard, A., Servan, C., Pietquin, O., & Besacier, L. (2016). MultiVec: A multilingual and multilevel representation learning toolkit for NLP. In Proceedings of the 10th Edition of the Language Resources and Evaluation Conference. Retrieved from http://www.lrec-conf.org/proceedings/ lrec2016/pdf/666_Paper.pdf Bhatia, P., Guthrie, R., & Eisenstein, J. (2016, September). Morphological priors for probabilistic neural word embeddings. Retrieved from https://arxiv.org/pdf/1608.01056.pdf Bird, S., Klein, E., & Loper, E. (2009, June). Natural language processing with python. Retrieved from http://www.nltk.org/book_1ed/ Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016, July). Enriching word vectors with subword information. Retrieved from https://arxiv.org/pdf/1607.04606.pdf Choi, J. D., Tetreault, J., & Stent, A. (2015, July). It depends: Dependency parser comparison using a web-based evaluation tool. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. (pp. 387–396). Retrieved from http://www.aclweb.org/anthology/P15-1038 De Smedt, T., & Daelemans, W. (2012). Pattern for python. Journal of Machine Learning Research, 13, 2031–2035. http://www.jmlr.org/papers/volume13/desmedt12a/desmedt12a.pdf. Duppada, V., Jain, R., & Hiray, S. (2018). SeerNet at SemEval-2018 Task 1: Domain adaptation for affect in tweets. Retrieved from https://static1.squarespace.com/static/58e3ecc75016e194dd5125b0/ t/5aaabbc02b6a28802e380940/1521138626662/domain-adaptation-affect-tweets.pdf Felbo, B., Mislove, A., Søgaard A., Rahwan, I., & Lehmann, S. (2017, August). Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 1616–1626). Retrieved from http://aclweb.org/anthology/D17-1169 Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P., Liu, N., Peters, M., Schmitz, M., & Zettlemoyer, L. (2017). AllenNLP: A deep semantic natural language processing platform. Retrieved from http://allennlp.org/papers/AllenNLP_white_paper.pdf
3 Software for Creating and Analyzing Semantic Representations
51
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., & Mikolov, T. (2018, February). Learning word vectors for 157 languages. In Proceedings of the 11th Edition of the Language Resources and Evaluation Conference. Retrieved from https://arxiv.org/pdf/1802.06893.pdf Grover, A., & Leskovec, J. (2016, August). node2vec: Scalable feature learning for networks. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2016 (pp. 855–864). https://doi.org/10.1145/2939672.2939754. Honnibal, M. (2013, September). A good part-of-speech tagger in about 200 lines of Python. Retrieved from https://explosion.ai/blog/part-of-speech-pos-tagger-in-python Ji, S., Yun, H., Yanardag, P., Matsushima, S., & Vishwanathan, S. V. N. (2016, September). WordRank: Learning word embeddings via robust ranking. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. https://doi.org/10.18653/V1/ D16-1063 Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014, June). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. https://doi.org/10.1145/2647868. 2654889 Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016, August). Bag of tricks for efficient text classification. Retrieved from https://arxiv.org/pdf/1607.01759.pdf Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011, June). Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 142–150). Mehdi, M., Okoli, C., Mesgari, M., Nielsen, F. Å., & Lanamäki, A. (2017). Excavating the mother lode of human-generated text: A systematic review of research that uses the Wikipedia corpus. Information Processing & Management, 53, 505–529. https://doi.org/10.1016/J.IPM.2016.07. 003 Mikolov, T., Chen, K., Corrado, G. S., & Dean, J. (2013, January). Efficient estimation of word representations in vector space. Retrieved from https://arxiv.org/pdf/1301.3781v3 Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2017, December). Advances in pre-training distributed word representations. Retrieved from https://arxiv.org/pdf/1712. 09405.pdf Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM, 38 (November), 39–41. https://doi.org/10.1145/219717.219748 Musen, M. A., & Team, P. (2015). The protégé project: A look back and a look forward. AI Matters, 1(June), 4–12. https://doi.org/10.1145/2757001.2757003 Nickel, M., Kiela, D., & Kiela, D. (2017, May). Poincaré embeddings for learning hierarchical representations. Advances in Neural Information Processing Systems, 30. Retrieved from https://arxiv.org/pdf/1705.08039.pdf Nielsen, F. Å. (2017, October). Wembedder: Wikidata entity embedding web service. https://doi. org/10.5281/ZENODO.1009127 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., et al. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12 (October), 2825–2830. Retrieved from http://www.jmlr.org/papers/volume12/pedregosa11a/ pedregosa11a.pdf Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543). Retrieved from http://www.emnlp2014.org/papers/ pdf/EMNLP2014162.pdf Řehůřek, R., & Sojka, P. (2010, May). Software framework for topic modelling with large corpora. In New Challenges for NLP Frameworks Programme (pp. 45–50). Retrieved from https:// radimrehurek.com/gensim/lrec2010_final.pdf Vrandečić, D., & Krötzsch, M. (2014). Wikidata: A free collaborative knowledgebase. Communications of the ACM, 57(October), 78–85. https://doi.org/10.1145/2629489
Chapter 4
Semantic Similarity Scales: Using Semantic Similarity Scales to Measure Depression and Worry Oscar N. E. Kjell, Katarina Kjell, Danilo Garcia, and Sverker Sikström
Aims and Content The aims of this chapter include describing: • how the semantic representations may be used to measure the semantic similarity between words. • the validity of semantic similarity as measured by cosine. • how semantic similarity scales can be used in research. • how to apply t-test to compare two sets of texts using semantic similarity (i.e. “semantic” t-test). • how to visualize the word responses by plotting words according to semantic similarity scales. • a research study where depression is measured using semantic similarity scales, independent from traditional rating scales. This chapter describes how semantic representations based on Latent Semantic Analysis (LSA; Landauer and Dumais 1997) may be used to measure the semantic similarity between two words, sets of words or texts. Whereas Nielsen and Hansen describe how to create semantic representations in Chap. 1; this chapter focuses on
This research is supported by a grant from VINNOVA (2018-02007; Sweden’s innovation agency) and Kamprad Family Foundation (20180281). O. N. E. Kjell · K. Kjell (*) · S. Sikström Department of Psychology, Lund University, Lund, Sweden e-mail: [email protected]; [email protected]; [email protected] D. Garcia Department of Behavioral Sciences and Learning, Linköping University, Linköping, Sweden Blekinge Center of Competence, Region Blekinge, Karlskrona, Sweden Department of Psychology, University of Gothenburg, Gothenburg, Sweden e-mail: [email protected] © Springer Nature Switzerland AG 2020 S. Sikström, D. Garcia (eds.), Statistical Semantics, https://doi.org/10.1007/978-3-030-37250-7_4
53
54
O. N. E. Kjell et al.
describing how these may be used in research to estimate how similar words/texts are in meaning as well as testing whether two sets of words statistically differ. This approach may, for example, be used to detect between group differences in an experimental design. First, we describe how a single word’s semantic representation may be added together to describe the meaning of several words or an entire text. Second, we discuss how to measure semantic similarity using cosine of the angle of the words’ position in the semantic space. Third, we describe how this procedure of text quantification makes it possible for researchers to use statistical tests (e.g., semantic t-test) for investigating, for example, differences between freely generated narratives. Lastly, we carry out a research study building on studies by Kjell et al. (2018) that demonstrated that semantic similarity scales may be used to measure, differentiate and describe psychological constructs, including depression and worry, independent from traditional numerical rating scales.
Semantic Analysis Methods Using the Semantic Representations to Measure Semantic Similarity Applying High Quality Semantic Representations to Experimental Data To create a semantic space comprising semantic representations with high quality, requires a lot of text data. The amount of text data that is required is often not feasible to collect in many types of research studies. Therefore, we often use a semantic space constructed from a huge amount of text data, and apply its semantic representations to the words collected in a research study (e.g., see Kjell et al. 2016). One approach is to use domain specific texts when creating the semantic space. For example, if the text data to be analysed come from social media, one creates the semantic space using social media text data. However, collecting and creating a semantic space is often cumbersome. Another approach is to use a multi-purpose, general, high-quality semantic space based on huge amount of diverse text data. We often use a general semantic space based on texts from Google n-gram (see https://books.google.com/ ngrams) as this database comprises enormous amount of text from books. Using the same semantic space has the advantage of being comparable and generalizable across studies; whereas an advantage with a domain specific semantic space is that words may have semantic representations that are more domain specific.
Adding Semantic Representations Together to Represent Several Words or a Text The meaning of several words may be represented by one common semantic representation (see Fig. 4.1). That is to say, semantic representations from individual words may be added together to represent one semantic representation for a set of
4 Semantic Similarity Scales: Using Semantic Similarity Scales to Measure. . .
55
Fig. 4.1 Semantic representations are applied from the semantic space to the experimental data and then added together (Adapted from Sikström et al. 2018)
words, sentences, paragraphs, documents, websites, books etc. This is achieved by adding the semantic representations of the words to be summarized, semantic dimension by semantic dimension. So, the values in the first dimension of all the semantic representations are summed up, then the values from the second dimension and so on for all semantic dimensions in the semantic representations. Lastly, the semantic representation is normalized to the length of one by dividing each semantic dimension value with the length of the semantic representation (i.e., the magnitude of the vector). This normalization is carried out because it simplifies subsequent computations, such as computing the cosine semantic similarity.
Understanding Semantic Similarity As discussed in previous chapters, a semantic representation is a vector describing a word’s position in a high-dimensional semantic space. That is, the values in a word’s semantic representation specify how that particular word relates to all the other words in the semantic space. In other words, a semantic representation may be seen as the coordinates for a position or a point in a high-dimensional space. The closer two points are positioned in this high-dimensional semantic space, the closer the semantic similarity and the closer they tend to be in meaning. This semantic similarity may be quantified and presented numerically by calculating the cosine of the angle between the two points (i.e., the two semantic representations). So, measuring semantic similarity with cosine can, as for a correlation, range from 1 to 1, but in practice very seldom go much below 0 (Landauer et al. 2007). In fact, cosine and correlation between two variables are mathematically related (for an elaboration for example see Wickens 2014). A high positive cosine value denotes high semantic similarity and a cosine around 0 denotes that the two words are unrelated in meaning; that is, the semantic representations are orthogonal. Notice, that the density distribution of two randomly selected words have a positive skew,
56
O. N. E. Kjell et al.
Fig. 4.2 Illustration of the semantic similarity between words as measured with cosine
where typically a single word has few words with similarity scores above 0.5, and none below 0.5. Semantic similarity as measured by cosine is visually conceptualised in Fig. 4.2. It shows that the semantic similarity score between Mother and Father is higher than the semantic similarity score between Mother and Fish, as measured using Semantic Excel (see Chap. 4; Sikström, Kjell & Kjell).
Semantic t-Tests Computed on Semantic Similarities The semantic similarity measure can be used to test whether two sets of words/texts differ using a t-test. We call this procedure a semantic t-test (e.g., see Kjell et al. 2018). To test whether two sets of words differ may be important in several different areas of research; including testing the success of an experimental manipulation involving writing about two different scenarios, emotional states, etc. between two conditions. For the purpose of the following description of the analytical procedure of a semantic t-test, consider that we want to test whether word responses to an openended question relating to one’s worry versus one’s depression differ significantly in meaning (which we will do in the research study presented later in this chapter). A semantic t-test may be thought of as procedurally being implemented in three steps. The first step involves creating, what we call, a semantic comparison representation (e.g., see Sikström et al. 2018; or a semantic difference vector, see Kjell et al. 2018). In this first step, the semantic representations of all words in each group/ condition to be compared are summed; and then one semantic representation is subtracted from the other in order to create the semantic comparison representation. So, in our example, all the word responses to the worry question are summed to one semantic representation, and then all word responses to the depression question are summed; then the worry semantic representation is subtracted from the depression semantic representation to create the semantic comparison representation. This semantic comparison representation is then normalized to the length of one.
4 Semantic Similarity Scales: Using Semantic Similarity Scales to Measure. . .
57
Fig. 4.3 Conceptual illustration of how semantic representations of worry versus depression responses are measured against the semantic comparison representation for a semantic t-test
The second step involves computing, what we call, a semantic comparison scale. This is achieved by computing the semantic similarity between the semantic representations of each individual response (for example, individual’s worry word response and then the individual’s depression word response) and the semantic comparison representation. To remove the biases that would occur if the to-be-measured semantic representation also is included in the semantic comparison representation, we apply a leave-nout procedure (typically leave-10%-out). This procedure is employed in two phases of the second step. This means that 10% of the responses are left out in step 1, but used in step 2. This is then repeated for another 10%-portion of the responses and continued until all responses have received a semantic comparison score. The third step includes computing an appropriate (e.g., independent or dependent) t-test comparing the semantic scales between the two groups/conditions. Hence, the semantic scales variables ought to be examined for assumptions such as normality and equal variance, and then the appropriate t-test should be used, such as Student’s t-test (assuming both normality and equal variances; Student 1908), Welch’s t-test (assuming normality, but suitable for unequal variances and unequal sample sizes; e.g. see Welch 1947) or a Wilcoxon–Mann–Whitney test (a robust test that does not assume normality; Wilcox et al. 2013). Subsequently, the analytic procedure produces a p-value and also allows for the calculation of an appropriate effect size such as Cohen’s D (Cohen 1988). So, to clarify (Fig. 4.3): A semantic t-test assesses the null-hypothesis stating that there is no mean difference between the semantic comparison values (measured as cosine semantic similarity) between two sets of words and a semantic comparison representation.
Thus, this procedure requires the calculation of a semantic comparison representation. An alternative method would, for example, be to calculate two semantic similarity scores, i.e. between the to-be-measured representation (e.g., an individual’s
58
O. N. E. Kjell et al.
worry response) and the worry representation (i.e., the added semantic representation for all worry responses) and between the to-be-measured representation and the depression representation (i.e., the added semantic representation for all depression responses), and then use the difference between these semantic similarity scores. However, this measure is no longer bounded by the limits 1 and +1. By using, the semantic comparison vector, we can ensure that resulting measure is bounded by 1 and +1, which is a convenient property of the proposed method.
Research Study Assessing Psychological Constructs Using Semantic Similarity Scales: Measuring, Describing and Differentiating Depression and Worry/Anxiety Semantic measures enable measuring psychological constructs through word responses that are analysed using natural language processing including LSA (Kjell et al. 2018). Kjell et al. (2018) demonstrated that semantic measures may be used to measure the degree of psychological constructs, describe the constructs with significant keywords as well as differentiate between similar constructs such as depression and worry. The aim of this research study involves to further examining the semantic measures of depression and worry. We investigate how unipolar and bipolar semantic similarity scales of depression and worry relate to corresponding numerical rating scales. The data used in this (and the next) chapter are based on Kjell, Kjell and Sikström (in progress). Herein the focus is to replicate the main findings of Kjell et al. (2018), whereas Kjell et al. (in progress) focus on examining how the semantic measures relate to the different items composing the rating scales. This research study extends the research by Kjell et al. (2018) in two important ways: First, in addition to using unipolar (high) scales, we also develop and examine bipolar scales for depression and worry. Second, in addition to using the Patient Health Questionnaire-9 (PHQ-9, measures depression; Kroenke et al. 2001) and the Generalized Anxiety Disorder Scale-7 (GAD-7; Spitzer et al. 2006), we here also use the Penn State Worry Questionnaire-Abbreviated (PSWQ-A; Hopko et al. 2003).
The Semantic Measures Approach: Semantic Questions and Word Norms Kjell et al. (2018) developed semantic questions such as “Over the last 2 weeks, have you been depressed or not?” coupled with instructions asking participants to write descriptive words (for more details see the Method section). The semantic questions and the instructions were constructed with the purpose to generate word responses suitable for analyses with LSA. To measure the degree of psychological constructs, Kjell et al. (2018) developed word norms that represent the to-be-measured construct. This was done by asking approximately hundred participants to describe the
4 Semantic Similarity Scales: Using Semantic Similarity Scales to Measure. . .
59
psychological construct of interest; for example: “Please write 10 words that best describe your view of being depressed”. Hence, to measure an individual’s degree of depression, they computed the cosine semantic similarity between the individuals’ word responses to the semantic question of depression and the depression wordnorm. The higher semantic similarity score, the higher degree of depression the respondent was regarded to have.
Measuring Constructs: Unipolar and Bipolar Semantic Similarity Scales Word norms may be conceptualized as high or low: where the high word norm represents the to-be-measured construct (depressed in our example above), and the low word norm represents the opposite meaning of the to-be-measured construct (e.g., “not at all depressed”). Accordingly, we can talk about unipolar and bipolar semantic similarity scales. A semantic similarity score may be said to be unipolar when only one word norm is used (typically the high word norm). Whereas a bipolar semantic similarity score refers to when the semantic similarity score to the low word norm has been subtracted from the semantic similarity score of the high word norm. Kjell et al. (2018) found that bipolar, as compared with unipolar, semantic similarity scales for harmony in life and satisfaction with life produced considerably higher correlations with corresponding rating scales including the Harmony in life scale (Kjell et al. 2016) and the Satisfaction with Life Scale (Diener et al. 1985). However, they did not collect low word norms for worry and depression. Therefore, this chapter involves collecting and using the word norms for being “not at all depressed” and “not at all worried” in combination with the high norms to create bipolar semantic similarity scales. In accordance with Kjell and colleague’s (2018) finding, we hypothesize that the words generated in response to the semantic question for depression differ significantly from the words in response to the semantic question for worry (H1). Further, it is hypothesised that unipolar scales significantly correlate with corresponding rating scales (H2). We further hypothesize that bipolar scales for worry and depression yield significantly stronger correlations to numerical rating scales than unipolar scales (H3).
Describing Constructs Using Plots Using keyword plots, Kjell et al. (2018) demonstrated that the semantic measures approach may effectively be used to describe the constructs under investigation in ways not previously possible with only rating scales. That is, since the response format in the semantic measures approach is descriptive word responses rather than closed-ended responses, they can be used to describe the constructs. To describe the constructs under investigation, Kjell et al. (2018) plotted statistically significant words in figures, where they compared depression and worry word responses on the x-axis, whilst plotting the words according to their correlation to the semantic similarity scales or rating scales on the y-axis. Based on Kjell and colleagues’ (2018) findings, we hypothesize that the plots reveal words linked to sad and lonely as
60
O. N. E. Kjell et al.
significant words relating to depression; and words linked to anxious and nervous as significant words relating to worry (H4).
Differentiating Between Constructs: Inter-Correlations and Covarying Variables in Plots Kjell et al. (2018) found that semantic similarity scales may differentiate between similar constructs more clearly than the rating scales by examining the inter-correlations of scales and the plotting methodology using covariates on the y-axes. It was found that the PHQ-9 and the GAD-7 yielded a very high inter-correlation; whereas the semantic similarity scales for depression and worry yielded a lower inter-correlation. Hence, the high inter-correlation of rating scales suggests they are measuring very similar, overlapping aspects; whereas the semantic similarity scales show higher independence in what they measure. Therefore, it is hypothesized that the unipolar semantic similarity scales of depression and worry yield a lower inter-correlation than the corresponding rating scales; and that the bipolar scale of depression and worry yield a stronger intercorrelation than the unipolar scales, but lower than the rating scales (H5). Kjell et al. (2018) further found that covarying the semantic similarity scales of worry and depression or vice versa, yielded a considerable independence between the constructs, whilst maintaining a correlation to the words describing the construct; whereas this was not true when covarying the rating scales. That is to say, covarying the rating scales of the PHQ-9 and the GAD-7 revealed a considerably less unique relationship with the words. This relationship was also demonstrated for semantic similarity scales and rating scales measuring harmony in life and satisfaction with life. In accordance with Kjell et al. (2018) findings, we hypothesize that covarying the semantic similarity scales of depression and worry on the y-axes reveal more significant words than covarying the corresponding rating scales (H6).
Method Participants Participants were recruited online using Mechanical Turk (http://www.mturk.com). Mechanical Turk is a website that allows researchers (and companies) to pay participants to complete tasks such as online studies. Four-hundred-and-fifty-five participants submitted their survey, from which 44 (9.7%) were not included in the analyses because they failed to answer control items correctly (as described below). The remaining 411 participants comprised 47% females and 53% males. Their age ranged from 18 to 74 years with a mean of 36.15 and standard deviation of 11.19. Participants reported being from United States (86%), India (11%) or other countries (3%). To participate they were paid $0.7. Mechanical Turk was also used to recruit participants to collect words for the word norms for “not at all worried” and “not at all depressed”. The “not at all
4 Semantic Similarity Scales: Using Semantic Similarity Scales to Measure. . .
61
worried” sample consisted of 97 participants (females ¼ 50.5%; males ¼ 49.5%) with an age ranging from 18 to 64 years (M ¼ 33.93 SD ¼ 9.64). Participants were from the United States (72%), India (11.34%); and various other countries (16.49%). The “not at all depressed” sample comprised 115 participants (females ¼ 53%; males ¼ 47%) with an age ranging from 19 to 63 years (M ¼ 31.44; SD ¼ 13.35). Most participants came from the United States (67%) followed by India (19.13%) and other countries (13.91%). These sample were part of another study where participants were paid $0.5 to participate.
Measures and Material The Semantic Question of Worry (Kjell et al. 2018) includes asking participants: “Over the last 2 weeks, have you been worried or not?”; with instructions to respond using five descriptive words. In addition, the semantic question is coupled with the instructions that participants should weigh the number and strengths of the response words so that they reflect their overall state of worry. The Semantic Question of Depression (Kjell et al. 2018) includes the question “Over the last 2 weeks, have you been depressed or not?”. This question was coupled with an adapted version of the instructions for the semantic question of worry. Five descriptive words were required as response format. The Patient Health Questionnaire-9 (Kroenke et al. 2001) is a 9-item questionnaire comprising items such as “Feeling down, depressed, or hopeless” rated on a scale ranging from 0 ¼ “Not at all” to 3 ¼ “Nearly every day”. In the current study, Cronbach’s α was .92 and McDonald’s ω was .94. The Generalized Anxiety Disorder Scale-7 (Spitzer et al. 2006) is a 7-item questionnaire comprising items such as “Worrying too much about different things” answered on a scale ranging from 0 ¼ “Not at all” to 3 ¼ “Nearly every day.” The GAD-7 showed good internal consistency in the current study as measured by Cronbach’s α ¼ .93 and McDonald’s ω ¼ .95. The Penn State Worry Questionnaire (Meyer et al. 1990) Abbreviated to 8 items (PSWQ-8; Hopko et al. 2003) includes items such as “My worries overwhelm me”, which are answered on scales ranging from 0 ¼ “Not at all typical of me” to 5 ¼ “Very typical of me”. In the current study the internal consistency was good; Cronbach’s α was .96 and McDonald’s ω was .97. Control items included: “On this question please answer the alternative ‘Several days’”; “On this question please answer the alternative ‘More than half the days’”; and “On this question please answer the alternative ‘Not at all’”. They were randomly spread out within the PHQ-9, and the GAD-7. If the respondents failed to correctly answer these control items they were removed from the analysis. Demographic survey included asking participants about their age, gender, country of origin, first language and their perception of their household income. The Word Norm for Depression (Kjell et al. 2018) comprises a set of words that describe being depressed. The word norm includes 1172 words generated from 110 participants answering the question about their view of being depressed: “Please write 10 words that best describe your view of being depressed. Write descriptive
62
O. N. E. Kjell et al.
words relating to those aspects that are most important and meaningful to you. Write only one descriptive word in each box.” The most frequent participant generated word was sad (83 times); and Kjell et al. (2018) added the word depressed so that it was the most represented word (i.e., 84 times). The Word Norm for Worry (Kjell et al. 2018) includes 1036 words describing “being worried” generated by 104 participants. The question and instructions were adapted from the question used to collect the depression word norm. The most frequently participant generated word was anxious (58 times). Due to that participants were instructed to not include the word worry, added the word worry one time more than the most commonly added word (i.e., 59 times, Kjell et al. 2018). The Word Norm for Not at all Depressed includes 1125 words generated from 115 participants. The question was adapted from question used to collect the depression word norm, but asking about “being not at all depressed”. The most frequent word was happy (56 times). The Word Norm for Not at all Worried comprises 938 words collected from 97 participants. The question was adapted from the question used to collect the depression word norm, and asked about “being not at all worried”. The most common word was happy (58 times).
Procedure Participants were first informed about the study and shown a consent form with information regarding how to get more information, that all their responses were saved anonymously and that they have the right to withdraw at any time from the study. Second participants were asked to answer the semantic questions, that were presented in a random order for each participant; followed by the corresponding rating scales, where the order also was randomized. The word questions were given prior to the rating scales to avoid the rating scales items to influence the open-ended word responses. After the rating scales on depression and anxiety/worry, the three social desirability scales were presented in random order; followed by the short demographic survey. Lastly participants were debriefed. The average time to complete the study was 10 minutes and 5 seconds. The word norms for Not at all depressed and Not at all worried were collected as part of a larger study. In this study, the word norm questions were presented first so that no other study content would influence participants’ answers. Participants were randomly allocated to answer one of the word norms questions.
Statistical Analyses Statistical semantics using Semantic Excel (Sikström et al. 2018; Sikström, Kjell & Kjell, Chap. 4) was employed to analyse the word content. The analyses were based on the English space 1, and corrected for artefacts related to words that are more frequently occurring (see Kjell et al. 2018 for more details). For other analyses R was used, which included using the following packages: psych (Revelle 2017),
4 Semantic Similarity Scales: Using Semantic Similarity Scales to Measure. . .
63
apaTables (Stanley 2017) and Hmisc (Harrell et al. 2017). To test the difference between correlational strengths we used http://quantpsy.org (Lee and Preacher 2013). Good internal reliability for the rating scales was seen as above .70 for Cronbach’s α and McDonald’s ω. To interpret the effect sizes we used the following conventions as suggested by Cohen (1988) and Sawilowsky (2009): .01 for very small, .2 for small, .5 for medium, .8 for large, 1.2 for very large, and 2.0 for huge effect sizes. Alpha was set to .05.
Results Descriptive statistics are presented in detail in Kjell, Kjell and Sikström (in progress). Outliers were always removed rather than changed to the nearest (maximum or minimum) value. However, this only affected three variables, including the High as well as the Low unipolar semantic similarity scale for depression and the Low unipolar semantic similarity scale for worry. Many of the variables were not normally distributed; in particular GAD-7 and PHQ-9 demonstrated a positive skew. A probable reason for this is that these rating scales were constructed for clinical symptoms, and although both excessive worry and depression are common psychiatric disorders, a positive skew with more low values are likely to be found in a non-clinical population. Consequently, Spearman rho, as opposed to Pearson’s correlation, is used in the analyses.
Semantic Responses Differ Significantly A paired semantic t-test reveals that semantic responses to the worry and the depression questions differ significantly from each other (t[822] ¼ 19.47; p < .001) with a very large effect size (Cohen’s d ¼ 1.77). This supports H1 indicating that the two semantic questions probe respondents to answer them in different ways; and the effect size indicates that the difference can be considered very large.
Measuring Psychological Constructs Bipolar Scales Yield Stronger Correlations to Rating Scales than Unipolar Scales The bipolar, as compared with the unipolar, semantic similarity scales correlate significantly stronger with the rating scales. The depression unipolar scale significantly correlates (rho ¼ .28, p < .01) with the PHQ-9. The worry unipolar scale significantly correlates with the GAD-7 (rho ¼ .30, p < .01) and the PSWQ-
64
O. N. E. Kjell et al.
8 (rho ¼ .34, p < .01). The bipolar scale for depression yields significantly higher correlations to the PHQ-9 (rho ¼ .60) than the unipolar depression scale (z ¼ 5.79, p < .001, two tailed). The correlation between the GAD-7 and the worry bipolar scale (rho ¼ .50) are significantly stronger than the correlation to the worry unipolar scales (rho ¼ .30; z ¼ 3.42, p < .001 two tailed). And the correlations between the PSWQ-8 and the worry bipolar scale (rho ¼ .54) is significantly stronger than that to the unipolar scale (rho ¼ .34; z ¼ 3.57, p