107 53 3MB
English Pages [139] Year 2023
Studies in Computational Intelligence 1101
Jarosław Protasiewicz
Knowledge Recommendation Systems with Machine Intelligence Algorithms People and Innovations
Studies in Computational Intelligence Volume 1101
Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, selforganizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.
Jarosław Protasiewicz
Knowledge Recommendation Systems with Machine Intelligence Algorithms People and Innovations
Jarosław Protasiewicz National Information Processing Institute Warsaw, Poland
ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-031-32695-0 ISBN 978-3-031-32696-7 (eBook) https://doi.org/10.1007/978-3-031-32696-7 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
As a thank you to the one who recited this poem to me during a walk one night in June through Pole Mokotowskie park in Warsaw: In Verona Cyprian Kamil Norwid Above the house of Capulet and Montague, Thunder-moved washed in dew, Heaven’s gentle eye. Looks on ruins of hostile city-states, On broken garden gates, And drops a star from on high. It is for Juliet, a cypress whispers, For Romeo that tear Seeps through the tomb. But men say knowingly and mock It was not a tear but a rock Awaited by none.
Foreword
The research book authored by Dr. Jaroslaw Protasiewicz brings forward an authoritative treatise on the timely and strategically positioned area of knowledge recommendation and systems of knowledge recommendation. The book covers in a systematic and prudently arranged way the key and far-reaching topics falling under an umbrella of intensive knowledge-based technology. Knowledge recommendation is an urgent and timely topic encountered in research and information services. There is a strongly compelling and urgent need: the modern economy badly requires highly skilled professionals, researchers, and innovators, which enables opportunities to gain competitive advantages and assist in managing financial resources and available goods, as well as carrying out fundamental and applied research more effectively. From the structural perspective, knowledge recommendation comprises three main functional phases: assignment, recommendation, and finding people who are able to deliver knowledge or written artefacts, including articles and innovations. This structural point of view is fully reflected in the organisation of the book. In essence, it focuses on the recommendation of reviewers, experts, and innovation support. From the design point of view, one concentrates on requirement elicitation, architectural development, detailed design, validation, and verification. The design, development, and implementation of the two representative IT systems discussed in the book supplemented with content-based recommendation algorithms illustrate how the paradigm and theory of knowledge recommendation work in practice. This also includes a way of the development and practical application of selected heuristics and machine learning/machine intelligence algorithms that aim to create individuals’ expertise profiles and to deliver ways of evaluating enterprise innovation. The book contains an original material and is unique in many ways. The prudent and though-out selection and the exposure of the topics, depth of coverage of the subject matter, and original insights are the focal features of the book. New and promising directions and techniques of machine learning applied to knowledge
vii
viii
Foreword
recommendation are original. The critical literature review identifying the stateof-the-art of the area along with the main achievements in the area and existing limitations and challenges will appeal to the reader interested in getting acquainted with the intensive studies reported in the literature. Bringing theoretical and applicationoriented facets of the area in a coherent way and forming a coherent view at the junction of theory and practice is a genuine asset of the book. The author himself has been actively involved in the design, deployment, and maintenance of the systems— this first-hand experience shared with the readers is another feature that speaks to the uniqueness of the book. The author did an excellent job. The book is a timely and insightful. Indisputably, it will fully appeal to the researchers and practitioners. It can also serve as a compendium of a thoroughly structured body of knowledge for those entering the area of knowledge recommendation and knowledge-based computing in general. Edmonton, Canada November 2022
Witold Pedrycz
Acknowledgements
This book would not have been possible without the support I have received from a large number of individuals. First and foremost, I wish to express my gratitude to Prof. Witold Pedrycz of the University of Alberta, Canada. His relentless support and encouragement spurred me into action, and his advice and constructive criticism has helped me to avoid many mistakes. I consider myself exceptionally fortunate to have been granted a mentor of such notability. I also extend my gratitude to Prof. Janusz Kacprzyk of the Systems Research Institute at the Polish Academy of Sciences for his insightful comments and suggestions, and for his invaluable help in the publication of this book. I am indebted to him for his willingness to share his vast experience with me. His commitment and advice laid the groundwork for this book. I also wish to thank all of my colleagues at the National Information Processing Institute in Warsaw, Poland, with whom I have had the pleasure to work as a project leader on the information systems presented in this book: the reviewer and expert recommendation system and the Inventorum innovation support system. I offer special thanks to Dr. Sławomir Dadas, who led the development team tasked with the implementation of both systems, and to Dr. Marek Kozłowski, who was responsible for the implementation of the reviewer and expert recommendation system at the Polish National Centre for Research and Development (NCBR). I also wish to thank all coauthors of the publications that have ensued from work on knowledge recommendation systems. Several years of collaboration, exchange of diverse views, and mutual support have allowed me to gain a broader view on knowledge recommendation systems, without which this book would not have been possible. Finally, I wish to thank the Social Communication department team at the National Information Processing Institute, led by Anna Pira, for the translation, editing, and graphic design of this book. They have made the material considerably more accessible—particularly to readers who are not experts in computer science.
ix
x
Acknowledgements
Once again, I wish to express my gratitude to everybody who has supported me in my efforts to publish this book. Jarosław Protasiewicz
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Why Knowledge Recommendation is Needed . . . . . . . . . . . . . . . . . . 1.2 What is Knowledge Recommendation? . . . . . . . . . . . . . . . . . . . . . . . . 1.3 The Road Map of this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 1 2 6
2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 A Quantitative Analysis of Knowledge Recommendation . . . . . . . . . 2.2 Support for the Selection of Reviewers and Experts . . . . . . . . . . . . . . 2.3 Support for Innovation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Selected Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9 10 12 15 17 19 21
3 Recommending Reviewers and Experts . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Reviewing Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 The Purpose of Reviewing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Reviewing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Disruptions to the Reviewing Process . . . . . . . . . . . . . . . . . . . 3.1.4 Why Automate the Selection of Reviewers? . . . . . . . . . . . . . . 3.1.5 Assumptions of the Recommendation System . . . . . . . . . . . . 3.2 The IT Reviewer and Expert Recommendation System . . . . . . . . . . . 3.2.1 System Architecture and System Processes . . . . . . . . . . . . . . 3.2.2 Data Acquisition Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Knowledge Retrieval Module . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Recommendation Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Recommendation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Keywords’ Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 A Full-text Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 The Combination of Two Measures . . . . . . . . . . . . . . . . . . . . .
29 30 30 31 31 33 34 34 35 36 37 38 40 40 43 44
xi
xii
Contents
3.4 Validation of the Recommendation System . . . . . . . . . . . . . . . . . . . . . 3.4.1 A Simple Example of the Recommendation Algorithm . . . . 3.4.2 Implementation of the Complete Algorithm . . . . . . . . . . . . . . 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44 44 46 48 49
4 Supporting Innovativeness and Information Sharing . . . . . . . . . . . . . . 4.1 Innovativeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Innovation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Open Innovation and Innovativeness Strategies . . . . . . . . . . . 4.1.3 An IT System to Support Innovativeness . . . . . . . . . . . . . . . . 4.2 A System that Supports Innovativeness . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 An Outline of the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Data Acquisition and Information Extraction . . . . . . . . . . . . . 4.2.3 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Recommendations in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Recommendations Distribution . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51 51 52 52 53 54 54 55 57 61 63 64 64
5 Selected Algorithmic Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Data Extraction and Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 The Data Extraction Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 The Crawling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 Data Acquisition in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Classification of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Classification Algorithms and Procedures . . . . . . . . . . . . . . . . 5.2.3 Flat Versus Hierarchical Classification . . . . . . . . . . . . . . . . . . 5.2.4 Monolingual Versus Multilingual Classification . . . . . . . . . . 5.2.5 Classification of Publications in Practice . . . . . . . . . . . . . . . . 5.3 Disambiguation of Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Disambiguation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 A Rule-Based Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Clustering by Using Heuristic Similarity . . . . . . . . . . . . . . . . 5.3.4 Clustering Using Similarity Estimated by Classifiers . . . . . . 5.3.5 Disambiguation of Authors in Practice . . . . . . . . . . . . . . . . . . 5.4 Keyword Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Polish Keyword Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Keyword Extraction in Practice . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Evaluation of Enterprises’ Innovativeness . . . . . . . . . . . . . . . . . . . . . . 5.5.1 A Model of Evaluation of Enterprises’ Innovativeness . . . . . 5.5.2 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67 68 68 69 70 71 71 73 73 74 76 81 81 83 85 86 87 90 90 91 93 93 95 96 99
Contents
xiii
6 Knowledge Recommendation in Practice . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 The Reviewer and Expert Recommendation System . . . . . . . . . . . . . 6.1.1 System Architecture and Technology . . . . . . . . . . . . . . . . . . . 6.1.2 System User Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Selected Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Inventorum, the Innovation Support System . . . . . . . . . . . . . . . . . . . . 6.2.1 System Architecture and Technology . . . . . . . . . . . . . . . . . . . 6.2.2 System User Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Selected Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
101 101 102 104 107 111 112 113 117 121 122
7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Knowledge Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Novelty and Originality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Further Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
123 123 124 127
Acronyms
API CF CJSH CRF DBLP DSc FP7 HAC HDFS HTML IEEE IT KEA MLP MNB MSc NCBR NoSQL OSJ PhD PKE R&D REST SVM TFhIDF URL XML
Application Programming Interface Collaborative Filtering Journal of Social Sciences and Humanities Conditional Random Fields DataBase systems and Logic Programming—Computer Science Bibliography Doctor of Science Seventh Framework Programme, European Union research and development funding programme Hierarchical Agglomerative Clustering Hadoop Distributed File System HyperText Markup Language Institute of Electrical and Electronics Engineers Information Technology Keyphrases Extraction Algorithm MultiLayer Perceptron Multinomial Naive Bayes Master of Science National Centre for Research and Development No Structured Query Language Ontology of Scientific Journals Doctor of Philosophy Polish Keyword Extractor Research and Development REpresentational State Transfer Support Vector Machines TF—Term Frequency h IDF—Inverse Document Frequency Uniform Resource Locator Extensible Markup Language
xv
Chapter 1
Introduction
1.1 Why Knowledge Recommendation is Needed Nowadays, we are overwhelmed by data. This is caused by the unprecedented influx of news, advertisements, opinions, technical papers, scientific works, and more. This phenomenon results in difficulties finding the right information, expertise, or people when it is necessary. Knowledge recommendation is an urgent and timely subject in research and information services. The issue is becoming increasingly relevant, as the modern economy requires highly-skilled professionals, researchers, and innovators. Concurrently, there is an apparent shortage of such individuals on the market. Therefore, various approaches must be analysed, and their most successful applications identified, to help in their better utilisation, and to reveal possible research directions to develop solutions that better answer the market’s demands. The complexity of the modern economy and the increased volume of data available generate the need for semiautomatic screening methods of selecting reviewers, experts, and professionals who are best-suited to particular requirements, as well as methods of promoting innovations. It must be underlined that there is a rapidly growing market for solutions that offer knowledge recommendation tools. Providing appropriate solutions that are able to cope with vast data requires machine intelligence algorithms. This book discusses selected issues of knowledge recommendation, such as expertise retrieval, reviewer and expert selection, supporting innovativeness, and information sharing.
1.2 What is Knowledge Recommendation? A recommendation system is an application capable of presenting a user with suggestions of objects, obtained on the basis of their profile, previous preferences, and the tastes of communities that have preferences and opinions similar to the user. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Protasiewicz, Knowledge Recommendation Systems with Machine Intelligence Algorithms, Studies in Computational Intelligence 1101, https://doi.org/10.1007/978-3-031-32696-7_1
1
2
1 Introduction
Recommendation engines offer new items (products or content) to users based on various data [2]. A taxonomy of recommendation systems considers information sources, methods of data processing, filtering algorithms, and other elements [1]. Information sources may be explicit like users’ ratings or implicit like users’ profiles or historical data that reflects their behaviour. This data is utilised by model-based methods that use previously-constructed models or memory-based methods; these primarily use metrics that measure the distance between selected features of items. The most crucial part of any recommendation system is the filtering algorithm, which may be one of the following types: [1, 2]: 1. Collaborative filtering algorithm: In this approach users are represented by an N-dimensional vector of items, and the algorithm searches for users who have similar rating patterns as the target user. Then, it uses the ratings from like-minded users to make a recommendation for the target user. 2. Demographic filtering algorithm: This is based on the idea that people who demonstrate similar biological or cultural features (age, sex, country, etc.) may have something in common. 3. Content-based filtering algorithm: In this approach recommendations are based on data provided by users in the past. More specifically, content that comes from objects somehow related to the users is analysed in terms of a chosen similarity measure. For instance, a system recommends similar objects to a user profile, an algorithm constructs a search query to identify that user’s favorite items (as experts) by the same categories, and/or with similar keywords. 4. Hybrid filtering algorithm: This combines other methods, mainly collaborative filtering with demographic filtering, or collaborative filtering with content-based filtering. This book is about knowledge recommendation which can be considered a type of content-based approach. It can be distinguished from the typical recommendation in that the subject of a recommendation is not a commodity, but knowledge. Although knowledge can be recommended by typical well-known algorithms, it requires unique techniques for its acquisition, suitable representation, and presentation to the user. Knowledge recommendation is illustrated by two examples in comparison to typical content-based recommendation (Fig. 1.1). The first is the recommendation of reviewers or experts that are best-suited to evaluating a particular problem. This incorporates an adequate knowledge and expertise representation. The second is supporting innovativeness by matching business and academia, as well as information sharing.
1.3 The Road Map of this Book This book embarks readers on a journey to the world of knowledge recommendation (Fig. 1.2). First, they will learn about the current state of knowledge in the field. Then, they will have an opportunity to familiarise themselves with the designs of a
1.3 The Road Map of this Book Fig. 1.1 The difference between typical content-based recommendation of items and knowledge recommendation in the context of recommending reviewers or experts, and supporting innovativeness by matching business and academia
3
4
1 Introduction
Fig. 1.2 The road map of this book
reviewer and expert recommendation system, an innovation support and information sharing system, and selected information processing algorithms. This knowledge is covered in Chaps. 3, 4 and 5. Chapter 6 discusses the practical implementations of selected knowledge recommendation systems. All deliberations are summarised in Chap. 7. Chapter 1. Introduction This chapter briefly describes the content of this book. It outlines benefits available to readers of this book. It also describes works on knowledge recommendation narrowed down to the selection of reviewers, experts, innovation, and information sharing. Chapter 2. Literature Review This chapter outlines the current state of knowledge on IT systems that support the selection of reviewers and experts, as well as the systems that support innovation and share information. It also discusses the development of machine learning algo-
1.3 The Road Map of this Book
5
rithms used in extracting data in search of specific information, such as evaluation of enterprise innovativeness, classification of publications, and author disambiguation. Chapter 3. Recommending Reviewers and Experts This chapter depicts a recommendation framework of reviewers and experts designed to evaluate research proposals or articles. The recommendation framework is based on a well-rounded methodology. It explores concepts of data, information, knowledge, and the relations between them to support the formation of suitable recommendations. More specifically, the framework includes a data acquisition module, which collects data concerning either researchers or professionals, i.e., potential reviewers or experts. Then, an information retrieval module transforms the data acquired into information, which comprises people’s profiles that cover their whole expertise. The module utilises various machine learning methods for classification of publications, author disambiguation, keyword extraction, and full-text indexing. Finally, a recommendation module generates a ranking of potential reviewers or experts based on a combination of cosine similarity between the keywords and similarity based on a full-text index. The chapter includes not only a comprehensive algorithmic framework for reviewer and expert recommendation, but also experimental case studies that illustrate the functioning of the system. Readers may benefit both from a comprehensive discussion of individuals’ expertise recommendation and from experimental verification theoretical assumptions exemplified by case studies. Chapter 4. Supporting Innovativeness and Information Sharing This chapter is devoted to the issues of supporting innovativeness by recommendation and proper information sharing. It explores open innovation in the context of cooperation between business and academia. The most typical information categories shared between these parties are innovations, projects, experts, partners, and conferences. The application of recommendation mechanisms to deliver accurate and proper information to the correct receiver may boost cooperation and, consequently, economic competitiveness. In the recommendation mechanism proposed in this chapter, information is served as fast recommendations, full recommendations, or search responses. These ideas form a complete recommendation framework for innovativeness, which: (i) automatically acquires data from the internet, and extracts information on innovative companies to attract more participants; (ii) matches businesses and scientists to enhance cooperation; (iii) recommends innovations, projects, and events that are best-suited to the criteria set by users. Chapter 5. Selected Algorithmic Developments This chapter is dedicated to selected machine learning algorithms implemented in the knowledge recommendation systems discussed in Chaps. 3 and 4. They are algorithms used primarily to construct profiles that describe reviewers and experts: data acquisition, classification of publications, author disambiguation, and keyword extraction. An algorithm used for evaluating enterprises’ degrees of innovativeness
6
1 Introduction
based on their websites is also presented. In addition to the details of that algorithm, this chapter also presents the results of experiments conducted to evaluate the performance of algorithms. Chapter 6. Knowledge Recommendation in Practice This chapter presents the implementation of two knowledge recommendation systems. Firstly, a recommendation system is described, which recommends relevant reviewers or experts to evaluate grant proposals and manuscripts [9, 14]. The system’s architecture is modular from the functional perspective and hierarchical from a technical perspective. Each essential part of the system is treated as a separate module, while each layer supports a particular functionality of the system. The system comprises several modules that are responsible for the transformation of data into information and knowledge. The modularity of the architecture facilitates its maintainability. The system is intended to work autonomously, without any manual adjustment. It is available for free on the internet.1 The next example is an information system, Inventorum [11, 12]. It recommends innovations, projects, experts, partners, and conferences to its users based on their profiles. The information is served in three ways: (i) fast recommendations, (ii) full recommendations, and (iii) search responses. The platform is available on the internet for free.2 Chapter 7. Conclusions The final chapter summarises all considerations and offers suggestions for further action regarding knowledge recommendation. It must be noted that the content of this book is based primarily on the author’s previous publications [3–22]. However, the information included in those works has been reinterpreted and enriched by the author’s updated perspective.
References 1. Bobadilla J, Ortega F, Hernando A, Gutiérrez A (2013) Recommender systems survey. Knowl Based Syst 46:109–132 2. Rubén GC, Oscar SM, Juan MCL, Cristina Pelayo García-Bustelo B, José ELG, Patricia Ordoñez DP (2011) Recommendation system based on user interaction data applied to intelligent electronic books. Comput Human Behav 27(4):1445–1449 3. Kozlowski M, Protasiewicz J (2014) Automatic extraction of keywords from polish abstracts. In: 4th young linguists’ meeting in pozna´n, vol Book of Abstracts, pp 56–57 4. Michajłowicz M, Niemczyk M, Protasiewicz J, Mroczkowska K (2018) Pol-on: The information system of science and higher education in poland. In: EUNIS 2018 congress book of proceedings, pp 1–3 5. Miro´nczuk M, Perełkiewicz M, Protasiewicz J (2017) Detection of the innovative logotypes on the web pages. In: International conference on artificial intelligence and soft computing. Springer, pp 104–115 1 2
http://sssr.opi.org.pl. http://inventorum.opi.org.pl/en.
References
7
6. Miro´nczuk M, Protasiewicz J (2015) A diversified classification committee for recognition of innovative internet domains. In: Beyond databases, architectures and structures. Advanced Technologies for Data Mining and Knowledge Discovery. Springer, pp 368–383 7. Miro´nczuk M, Protasiewicz J (2020) Recognising innovative companies by using a diversified stacked generalisation method for website classification. Appl Intell 50(1):42–60 8. Podwysocki E, Błaszczyk Ł, Niemczyk M, Protasiewicz J, Michajłowicz M, Rosiak S, Kucharska I (2019) Distributed services and a warehouse as an ecosystem on science and higher education. In: EUNIS 2019 congress, pp 139–142 9. Protasiewicz J, Artysiewicz J, Dadas S, Gał¸ez˙ ewska M, Kozłowski M, Kopacz A, Stanisławek T (2012) Procedury recenzowania i doboru recenzentów. Tom 2, vol 2. OPI PIB 10. Protasiewicz J (2014) A support system for selection of reviewers. In: 2014 IEEE international conference on systems, man, and cybernetics (SMC). IEEE, pp 3062–3065 11. Protasiewicz J (2017) Inventorum–a recommendation system connecting business and academia. In: 2017 IEEE international conference on systems, man, and cybernetics (smc), IEEE, pp 1920–1925 12. Protasiewicz J (2017) Inventorum: A platform for open innovation. In: 2017 IEEE international conference on systems, man, and cybernetics (SMC). IEEE, pp 10–15 13. Protasiewicz J, Dadas S (2016) A hybrid knowledge-based framework for author name disambiguation. In: 2016 IEEE international conference on systems, man, and cybernetics (SMC). IEEE, pp 000594–000600 14. Protasiewicz J, Dadas S, Gał¸ez˙ ewska M, Kłodzi´nski P, Kopacz A, Kotynia M, Langa M, Młodo˙zeniec M, Oborzy´nski A, Stanisławek T, Sta´nczyk A, Wieczorek A (2012) Procedury recenzowania i doboru recenzentów. Tom 1, vol 1. OPI PIB 15. Protasiewicz J, Michajłowicz M (2016) A brief overview of the information system for science and higher education in poland. In: EUNIS 2016 congress 16. Protasiewicz J, Miro´nczuk M, Dadas S (2017) Categorization of multilingual scientific documents by a compound classification system. In: International conference on artificial intelligence and soft computing. Springer, pp 563–573 17. Protasiewicz J, Pedrycz W, Kozłowski M, Dadas S, Stanisławek T, Kopacz A, Gałe˛˙zewska M (2016) A recommender system of reviewers and experts in reviewing problems. Knowl Based Syst 106:164–178 18. Protasiewicz J, Podwysocki E, Ostrowska S, Tomczy´nska A (2021) Integrated access to data about science and higher education in the context of general data protection regulation. In: Eunis 2021. A new era of digital transformation challenges for higher education 19. Protasiewicz J, Podwysocki E, Ostrowska S, Tomczy´nska A (2021) Open access to data about higher education and science. case study of the rad-on platform in poland. In: Eunis 2021. A new era of digital transformationChallenges for higher education 20. Jarosław P, Sylwia R, Błaszczyk Ł, Niemczyk M, Michajłowicz M, Kucharska I, Podwysocki E (2019) Rad-on: An integrated system of services for science-online elections for the council of scientific excellence in poland. In: EUNIS 2019 congress, pp 157–160 21. Protasiewicz J, Stanisławek T, Dadas S (2015) Multilingual and hierarchical classification of large datasets of scientific publications. In: 2015 IEEE international conference on systems, man, and cybernetics. IEEE, pp 1670–1675 22. Protasiewicz J, Stefa´nczuk M, Sadłowski A (2017) The national repository of theses: A short polish case study. In: EUNIS 23nd annual congress book of proceedings
Chapter 2
Literature Review
This chapter is a review of literature on knowledge recommendation. It emphasises reviewer and expert recommendation, innovation support, and selected information extraction algorithms that are used to create individual profiles. The literature review is based on previous works of the author [62–66, 74, 107]. The information included in those works has been reinterpreted and supplemented with data found in the most recent publications. The key objective of the chapter is to provide insights on knowledge recommendation’s that help in constructing a recommender system of reviewers and experts and a system that supports innovativeness, as well as selecting algorithms that transform data into information and then into knowledge, which is then used in information systems. The insights come from a quantitative assessment of the subject of knowledge recommendation and thoughtful qualitative analysis of selected issues. The chapter is organised as follows. The first Sect. 2.1 presents a quantitative analysis of broadly understood knowledge recommendation. The second 2.2 presents a detailed qualitative analysis of approaches to supporting the selection of reviewers and experts. The third 2.3 focuses on supporting innovation, including analysis of various IT tools and open data. The fourth 2.4 discusses information extraction algorithms—specifically, the classification of publications and innovative enterprises, and the disambiguation of authors of publications. The chapter ends by drawing a series of conclusions 2.5.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Protasiewicz, Knowledge Recommendation Systems with Machine Intelligence Algorithms, Studies in Computational Intelligence 1101, https://doi.org/10.1007/978-3-031-32696-7_2
9
10
2 Literature Review
2.1 A Quantitative Analysis of Knowledge Recommendation Many studies focus on assigning individuals to specific tasks. Systematic reviews of leading publications, such as Scopus,1 IEEE Xplore Digital Library,2 Web of Science,3 arXiv,4 or Association for Computing Machinery,5 which cover a period of over twenty years (January 2000–July 2022) have enabled the author to identify 326 scientific publications on knowledge recommendation. Given that a detailed analysis of all of those publications lies beyond the purview of this book, only a quantitative analysis is presented here; whereas, a qualitative analysis is performed on selected issues. Knowledge recommendation comprises three main aspects: assignment, recommendation, and finding (Fig. 2.1). Assignment involves discovering the optimal assignment of a finite set of individuals (e.g., reviewers, experts, or employees) to a finite set of tasks (e.g., a manuscript or project review, the realisation of a task, or the filling of a job vacancy). In the case of assignment, we know how many individuals must be assigned to a specified number of tasks, and how many individuals and tasks are contained in each set. This task is a deterministic one and is typically reduced to an optimisation problem [100, 129, 130]. Recommendation tasks involve proposing the most relevant individuals from a finite set to perform a specific task. Typically, individuals and tasks are described using text, and a recommendation is a ranking of individuals sorted by the degree to which their profiles match the task descriptions. Usually, there is one task and a certain number of individuals recommended for that task. That number is provided by a recommendation algorithm from a finite set of individuals [6, 99, 134]. Finding involves searching a dataset (typically a forum or a social network) for individuals (typically experts) who possess the best knowledge on a particular subject or who are best suited to answering specific questions. Individuals are usually presented in the form of a ranking of their expertise in particular areas expressed as topics or questions. In this case, a searchable set is not uniquely determined: the number of potential experts included in a dataset for a specific problem is unknown [1, 56, 112, 144]. The reverse is also possible: searching for scholarly publications that are best matched to particular scholars [4, 5, 12]. In the literature, the recommendation aspect is the most popular one—as many as half of research publications focus on this issue. Both assignment and knowledge finding are the subjects of approximately 25% of the publications (Fig. 2.1). The interest of scholars in these aspects over the years is illustrated in Fig. 2.2. A significant number of studies on knowledge recommendation were observed for the first time only in 2008. During the next decade, the interest of scholars in the 1
https://www.scopus.com. https://ieeexplore.ieee.org. 3 https://www.webofscience.com. 4 https://arxiv.org. 5 https://www.acm.org. 2
2.1 A Quantitative Analysis of Knowledge Recommendation
11
Fig. 2.1 The key aspects of knowledge recommendation, based on 326 scholarly publications (January 2000–July 2022). The percentage share and the numbers of publications on specific subjects are provided
Fig. 2.2 The key aspects of knowledge recommendation over the years, based on 326 scholarly publications (January 2000–July 2022). The numbers of publications on specific subjects in given years are provided
subject remained stable; since then, however, scholars have published prolifically and the number of publications has grown by approximately 20% each year. It can be inferred that growing interest in knowledge recommendation has coincided with the increased volume of data and with the development of natural language processing techniques. Compelling conclusions can be drawn by analysis of the application of the results of studies on knowledge recommendation in scholarly works (Fig. 2.3). Approximately one-third of applications pertain to the review problem, including the selection of reviewers to review manuscripts, reviewers or experts to evaluate project proposals, and experts to evaluate software code. It is noteworthy that software code reviewing has been researched intensely by scholars in recent years. Approximately
12
2 Literature Review
Fig. 2.3 An analysis of the applications of knowledge recommendation research, based on 326 scholarly publications (January 2000–July 2022). The percentage shares and the numbers of publications on specific subjects are provided
one-third of studies on knowledge recommendation are used to search for expertise, or for individual experts who specialise in particular areas or who can answer specific questions. Fewer than 20% of scholarly works focus on assigning staff (experts, professionals) to job vacancies, projects, or tasks. Fewer than one-fifth of scholarly works pertain to publication recommendations to scientists.
2.2 Support for the Selection of Reviewers and Experts The assignment of individuals (reviewers or experts) to tasks (evaluation of articles, projects, or other works) can be considered a generalised assignment problem, a recommendation problem, or an expert finding problem. Particular attention should be paid to solutions that rely on optimisation algorithms, heuristics, artificial intelligence and machine learning, and decision support systems.
2.2 Support for the Selection of Reviewers and Experts
13
Optimisation approach Classic optimisation algorithms can be applied to the selection of reviewers only when the problem is defined in such a way that finding the right people depends on finding the minimum of an objective function. When many measures of matching people to the problem are applied simultaneously, the problem of multicriteria optimisation arises. This may be solved by a regularisation-based optimisation algorithm, in which the regularisation enables more or less emphasis to be dedicated to particular features [137]. Some research suggests that the particle swarm optimisation algorithm is well-suited to this problem [140]. Moreover, the binary multiobjective particle swarm optimisation algorithm may be combined with a genetic algorithm to achieve higher-quality solutions to multiobjective optimisation problems [78]. Since the problem of multicriteria optimisation is computationally complex, it may be simplified by conversion into a mixed integer programming model and by the use of a two-phase stochastic-based greedy algorithm [44]. In the case of simple problems, linear programming is sufficient for proposing rankings of reviewers [34]. Some combinations of linear programming are also suitable, including those of simplex and heuristic algorithms [31], or of polynomial-time algorithms to discover the maximum flow through a network when an assignment problem has been reduced to a network flow problem [49]. It is possible to prove that an incremental maximum flow procedure is near-optimally fair [120]. The heuristic approach and artificial intelligence The issue of selection of reviewers and experts is characterised by the fact that not all of the variables and constraints are unambiguously defined; less strict algorithms are more suitable for such weakly-defined problems. Heuristic algorithms and artificialintelligence-based methods respond well to this requirement and propose optimal combinations of people and tasks effectively. These include: pure heuristic rules [20, 58], a metaheuristic greedy solution [105], tabu search [127], genetic algorithms [78, 101, 127], a greedy randomised adaptive search procedure combined with genetic algorithms [142], ant colony algorithms [117, 127], evolutionary algorithms [27, 72], heuristic knowledge rules and genetic algorithms [152], and fuzzy sets [38]. In the last decade, we have observed a sharp increase in the application of deep learning methods in various areas. Reviewer or expert recommendation is also apparent in this trend. The examples in the literature are too numerous for this chapter to list in full. The most crucial applications of deep learning methods include long short-term memory [40, 53, 141] and its simplified version, gated recurrent units [147, 148], convolutional neural networks [37, 40, 67, 141], bidirectional encoder representations from transformers [40], attention mechanisms [43, 123, 147, 148], and deep autoencoders [102]. It must be noted that some research aims to achieve the opposite task to that discussed above, i.e., the recommendation of scientific papers to researchers [67, 94, 94, 148].
14
2 Literature Review
Semantic data and the probabilistic approach A substantial number of semantic data models, which are more useful in deriving knowledge about people and relations between them than raw data, exist. Coauthor networks, linked data, and the world wide web provide useful information about scientists, experts, and professionals. Coauthor networks and linked data are applied to assignment problems as follows: an expert semantic finder is proposed in order to search for experts and to discover an expert collaboration network using the experts’ publications from the DBLP computer science bibliography [39]; a collaboration network may then be constructed based on coauthorship and a computer science taxonomy [17]; a coauthor network built from the references included in publications may help in the retrieval of potential reviewers [92]. Moreover, web data may substantially enrich assignment methods. This is apparent in the following actions: investigation of the interests of reviewers using their homepages [13]; retrieval of information about the connections between authors [10]; utilisation of data from heterogeneous sources and combination of keyword searches with concept searches [81]; the proposal of a relational and evolutionary graph model that uses relational and textual data to describe and discover experts [135]; discussion about the ExpertFinder framework and its enrichment using vocabularies from the world wide web [16]. Deep learning methods are most widely utilised in probabilistic models that assist in modelling individuals’ expertise. Various embedding models are useful for this purpose [7, 80, 138, 149]. Such models may identify expertise terms and the relations between them, as well as the relations between experts [79, 138]. Collaborative filtering algorithms are used widely in product recommendation systems. Unsurprisingly, such models also prove useful in expert or reviewer recommendation [118, 138]. Other well-known probabilistic methods are utilised for solving this problem, such as simple regression models [36]; singular value decomposition models [102]; latent Dirichlet allocation models [143]; clustering algorithms based on Gaussian–Gamma mixture models [124]; combinations of document similarity, hierarchical clustering, and keyword extraction [51]; algorithms based on the Pearson correlation coefficient [45, 102]; level hierarchical information models (word-sentence-documents) [147]; or measures of semantic similarity based on normalised discounted cumulative gain [122]. Decision support mechanisms Decision support mechanisms are useful tools in reviewer or expert selection. This is evidenced by the following studies: [54, 55, 106, 108, 142, 145, 152]. A decision support system may cover organisational issues alongside decision support [109]. Such a tool should also address the reviewers’ quality. For example, a three-tier reviewer assessment system based on reviewers’ quality and reliability helps manage pools of reviewers who cooperate with particular journals [50]. A decision support system may be useful in assigning the most appropriate experts to the most relevant R&D projects. In such an application, reviewers may be set independently for each project proposal [142]. The procedure of grouping proposals before assigning
2.3 Support for Innovation
15
reviewers to the groups enhances this approach [145]. A blockchain system has also been developed, which performs an open peer-review process based on smart contracts [26]. Deployments Many studies contain only theoretical models, which, although developed in accordance with the state of the art, seem to be tested insufficiently—for example, by being verified only on data from a single scientific conference. Despite this remark, the following practical studies are worthy of attention: an empirical study in the semiconductor industry [24], and a fully automatic system (which was tested only on data from a single conference) [92]. Regardless of the algorithm being used and the type of data being processed, it has been found that the most mature and applicable solutions are those based on the concept of decision support systems. A series of studies—[79, 109, 142, 145, 152]—have proposed a complete methodology and systems based on that conception. The systems are designed to support grant-awarding processes used by the Natural Science Foundation of China. The authors noted that applied methodologies that have been tested in real cases can reduce potential political interference and eliminate subjective mistakes, as well as streamlining and standardising the process of reviewer assignment. Open systems also exist, including a repository-centric peer-review model [110] that comprises a central repository for storing publications using the Open Archiver Initiative protocol and methods for selecting reviewers; an open system for the purpose of the International Conference on Knowledge Discovery and Data Mining [47]; VeTo-web, an open-source, publicly-available tool for searching for academic experts [22]; ExpFinder, an open-source framework [41]; and ReviewerNet, an interactive visualisation system [114].
2.3 Support for Innovation In addition to legal, organisational, and financial solutions, innovation also requires reliable IT tools. A brief analysis of such tools and of the open information problem, based on the literature, is presented below. Dedicated IT systems Based on the literature, publications present various approaches to support innovation using IT systems [3]. The first of them proposes a decision support system that relies on fuzzy rules and genetic algorithms to show entrepreneurs how to be innovative [73]. The system works in the following manner: (i) a manager completes a form that describes an enterprise; (ii) the system generates a benchmark that compares the enterprise with other enterprises; (iii) suggestions are proposed to make the enterprise more innovative. The second approach is a web-based collaboration system whose objective is to boost the collaboration and competitiveness of
16
2 Literature Review
small- and medium-sized Korean enterprises by reducing the production costs and delivery times of new products [76]. Larger systems operate internationally, such as NineSigma,6 an organisation that operates in the United States, Europe, and the Asia–Pacific region [90], and InnoCentive7 [121, 128]. Other systems implement the concept of open innovation that relies on algorithms that use linked data [32]. In these cases, problems are solved by searching for reliable solutions or for competent experts who are capable of tackling the issues. It is assumed that the best innovations are delivered by individuals who can adapt and extend existing knowledge to other areas. Social media To enhance the creativity of their staff, some entrepreneurs have replaced dedicated IT systems with social media platforms [15, 91]. An internal social network can deliver positive results if new ideas presented on the platform meet required standards, senior staff members are prepared to contribute to it, and employers are aware that employees need additional time during their working hours to use it [61]. Enterprise social media platforms are limited to enterprise employees. Entities that seek external sources of innovation can combine social media platforms for internal employees with dedicated internet platforms for prospective external collaborators. Such platforms are used to publish problems that are expected to be solved online by external collaborators—for example, by means of competitions for the best solutions. Internal social media is reserved for the evaluation of proposed solutions and for problems that require context knowledge on a particular enterprise that is available to its employees [46]. Some types of software, such as semantic learning-based innovation frameworks, implement the concept of social media for innovation. This helps small- and medium-sized enterprises to select and develop ideas, and to diffuse innovations on the market [18, 75]. It is noteworthy that the implementation of social media platforms to support collaboration and innovation in enterprises necessitates appropriate digital skills from the employees of such enterprises. Some members of staff may feel less comfortable with such tools than so-called ‘social digital natives’ (the youngest generation of workers) and should receive appropriate training [71]. Open data IT systems ensure that open data plays a key role in the development of innovation. It is believed that open data access contributes to the establishment of new enterprises and innovations [42, 48]. Big data helps new markets to be discovered for products, human needs, and business knowledge; concerns exists, however, over conclusions that are drawn from inaccurate or unreliable data, and over the use of private data. It must be remembered that much data may prove useless if it is not preprocessed and if appropriate information extraction methods are not applied [19, 60]. 6 7
https://www.ninesigma.com. https://www.innocentive.com.
2.4 Selected Algorithms
17
2.4 Selected Algorithms Classification of publications and the disambiguation of authors are used to build profiles of prospective reviewers and experts. The algorithms that are used in this task and that are referenced in the literature are outlined below. Text classification The classification tasks of scientific publications or innovative firms rely on multilingual text classification—since documents in these problems are usually written in various languages. The two most prominent approaches to the classification of multilingual resources are cross-lingual and multilingual classification. Cross-lingual classification involves training one classifier on a primary language, and using the same classifier for the classification of text in other languages translated into the primary language by machine translation algorithms. Its simplest example is bilingual classification [35, 97]; however, more complex cases consider many languages [8, 136]. Multilingual classification involves designing a single algorithm to process many languages. One model may be applied to many languages simultaneously [25, 30], separate models may be trained for each language [11, 84], or combinations of the above may be utilised. A wide range of algorithms can be applied to the classification or grouping of text documents. Statistical machine learning algorithms, including deep learning techniques, are the most distinctive methods. Since a remarkable number of studies related to text classification exist, it is more expedient to rely on literature reviews. The classical algorithms, such as support vector machines, k-nearest neighbours, decision trees, naive Bayes, regression and k-means, and fuzzy c-means with some variations remain in wide and successful use [69, 111, 126]. Currently, deep learning methods, such as convolutional neural networks, deep belief neural networks, recurrent neural networks, attention mechanisms, and transformers [111, 133, 153] are more widely used. Many approaches exist that incorporate fusion classification models to achieve higher classification quality [86]. Besides the selection of a suitable algorithm, feature selection is also crucial in datasets [103]. It must be emphasised that some studies concentrate directly on publication classification [33, 70, 107] or to the detection of innovative themes on the internet [14, 83, 85, 87–89] that fall within the scope of this monograph. Author name disambiguation methods The fast-paced development of electronic libraries of scholarly publications causes author name disambiguation to be challenging. Information on authors of publications can be stored in various manners, which renders it difficult to establish their true identities. Manual disambiguation is impracticable due to the massive volumes of data stored in digital libraries [98]. Author name disambiguation falls into the category of entity recognition [59]. There are two cases of author ambiguation:
18
2 Literature Review
(i) synonyms—an individual publishes under different names, and (ii) polysemies— the same name that appears in documents refers to different individuals [9]. Proper identification of authors may prove particularly difficult in the case of incomplete publication metadata or multiauthored and interdisciplinary articles [98]. Periodically published literature reviews [9, 57, 98, 116, 150] demonstrate high levels of interest in author name disambiguation methods. Overall, these literature reviews classify algorithms into two approaches. The first is author grouping (partitional, hierarchical, density-based, or spectral clustering) using a similarity function (predefined, trained, or graph-based). The second is author assignment realised by classification or clustering. More specifically, the algorithms may be: (i) unsupervised, e.g., hierarchical agglomerative clustering, k-means, spectral clustering; (ii) supervised, e.g., support vector machines, random forest, naive Bayes; or (iii) weakly supervised. The unsupervised algorithms use similarity functions to measure the distance between clusters (e.g., Jaccard, Jaro–Winkler, or Levenshtein) in which attributes may be represented by term frequency-inverse document frequency, latent semantic indexing, or latent Dirichlet allocation [98]. Recent studies bring the application of deep learning methods, such as embedding [29, 115], graph convolutional networks [23, 104], and attentive recurrent neural networks [113] to author name disambiguation. The development of various heuristic frameworks—whose quality is sufficient, and which are less computationally demanding than deep learning methods—has also accelerated in recent years [2, 132]. Hierarchical agglomerative clustering for author name disambiguation Recent studies indicate interest in hierarchical clustering methods for author name disambiguation [113, 146]. Such algorithms are unsupervised and flexible so that they may utilise various similarity metrics and attributes. Besides the application of pure hierarchical clustering, the algorithm itself and the similarity methods have improved. Hierarchical clustering algorithms are often redesigned into a multiphase process. The most common is a two-phased approach. Two different algorithms may be applied [139]: first, a batch hierarchical clustering algorithm is executed; then, an incremental process is executed, which (i) adds a new publication to a cluster, (ii) creates a new cluster, or (iii) merges clusters. In [95], the first phase involves using precise rules to produce high-quality data; in the second phase, a logistic regression classifier, trained on that data, is used to predict similarities between objects. Each phase may utilise different measures of similarity [28] in which very strict metrics generate clusters that, ideally, represent only one person; another measure is then applied to merge the clusters. The work of [68] introduces three consecutive phases: (i) preclustering using a hierarchical clustering algorithm; (ii) locating personal web pages using a search engine and a selected model; and (iii) reclustering for the publications that were not found on the internet. Four consecutive phases of clustering appear in [125]. However, each step uses the same similarity function, and they differ only in the attributes used in their comparisons: e-mail, affiliation, coauthors, and venue. [96] proposes a hierarchical clustering algorithm with the analysis of a social network to prevent the potential errors of a textual similarity function.
2.5 Summary
19
Similarity measures and attributes play crucial roles in hierarchical clustering algorithms. Attributes may be weighted by numerical values or functions selected according to expert knowledge and the literature [82]. The importance of features may be uncovered by some methods—for example, the Dempster-Shafer theory of evidence enables numerous features to be fused and their usefulness to be tested [52]; a Huber loss function can measure pairwise similarity in hierarchical clustering [131]. Pairwise similarity may rely not only on previously selected attributes, but also on relational information, i.e., a shared neighbourhood of similar clusters in greedy agglomerative clustering [59]. The last problem is the stop criteria of the clustering. Typically, the process stops when the similarity between clusters achieves a fixed degree. The stop conditions may also be adaptive—for example, [77] proposes criteria related to individual author names.
2.5 Summary The literature review on knowledge recommendation for the period of January 2000– July 2022 is provided in this chapter. The abundance of studies indicates the importance and timeliness of this subject. I must underline that there are other reviews in the literature [4–6, 12, 21, 56, 93, 99, 112, 119, 129, 130, 134, 144, 151], which offer notable remarks on this subject. Among the many issues of knowledge recommendation discussed in these studies, the following seem to be the most crucial: 1. Algorithms—ensembles of various methods usually produce better results than any single independent model applied to the task of expert finding. No existing universal model or method applies to all cases of knowledge recommendation; each case has its own requirements and unique data, which determine the selection of suitable algorithms. Despite the popularity of machine learning, deep learning, and graph-based algorithms, statistical and heuristic approaches continue to play essential roles. Since textual data typically describes individuals’ expertise, and modern deep learning algorithms model semantic relations in textual data better than traditional text information retrieval approaches, these algorithms may be well-suited to knowledge recommendation. 2. Data—expert recommendation systems typically utilise only textual data, while multimedia sources such as images, video, and audio may offer much information about individuals’ expertise. Thus, the integration of multimedia data may improve these systems’ performance. A cold start problem and the provision of ground truth for algorithm validation remain challenging issues regarding knowledge recommendation. Business applications may also be challenging due to problems with data integration, privacy, up-to-dateness, and completeness. 3. Applications—expert finding or recommendation systems are helpful in academia and a wide range of business domains. Recommendations should not only consider similarity measures, but also the reputation and authority of proposed
20
2 Literature Review
individuals. After the recommendation, practical issues, such as changes in experts’ interests, their availability, and their willingness to work, must be solved. Compared to these review studies, the originality of this chapter is threefold. First, it includes not only selected aspects, but all aspects of knowledge recommendation, such as knowledge (people) finding, recommendation, and assignment to specified tasks. Second, this review covers the first twenty-two years of this century. This has enabled broad analysis of the development, trends, and changes in knowledge recommendation. Third, and most importantly, the conclusions drawn in this review confirm the previous ones and deliver new insights. They are as follows: 1. Given the plethora of scholarly publications and the extensive scope of their subject matter, only a quantitative analysis of the publications on knowledge recommendation has been conducted for this chapter. Three knowledge recommendation currents were observed. The assignment problem involves discovering the optimal assignment of a set of individuals to a set of tasks. The recommendation problem involves recommending the most relevant individuals to conduct a task. The finding problem is about searching in data (usually forums or social networks) for individuals who possess expertise in a specific area. 2. A qualitative analysis was conducted for the assignment of individuals (reviewers and experts) to tasks (evaluation of articles, projects, and other works), which can be performed as an assignment, as a recommendation, or as a finding problem. Studies pertaining to publication classifications and author disambiguation used to build profiles of potential reviewers and experts were analysed, as well as IT tools that support innovation. It was observed that the algorithms utilised in such problems ranged from heuristic solutions, through probabilistic models, statistical machine learning models, and artificial intelligence algorithms, to modern deep learning methods. Although deep learning models typically offer the highest accuracy, they involve considerable computational complexity. That is why simple heuristic or statistical algorithms often prove more useful in practice. 3. Although knowledge recommendation is a subject of theoretical study, it also has a range of practical applications. The significant number of scholarly publications on knowledge recommendation demonstrates the importance of the concept. The growing volume of publications in recent years also suggests better access to algorithmic tools and datasets, which are vital for the proposed algorithms and models (as they are verified experimentally). Given the increasing complexity of the knowledge-based economy, it is expected that knowledge recommendation methods will continue developing. Identifying the most suitable reviewers, experts, and professionals is key for both science and business. The development of these methods is underpinned by the increased availability and quality of datasets.
References
21
References 1. Almuhanna AA, Yafooz WM (2021) Expert finding in scholarly data: An overview. In: 2021 IEEE international IOT, electronics and mechatronics conference (IEMTRONICS), pp 1–7 2. Veloso A, Ferreira AA, Goncalves MA, Laender AH, Meira Jr, W (2012) Cost-effective on-demand associative author name disambiguation. Inf Process Manag 48(4):680–697 3. Mostafavi A, Abraham DM, DeLaurentis D, Sinfield J (2011) Exploring the dimensions of systems of innovation analysis: A system of systems framework. IEEE Syst J 5(2):256–265 4. Ali Z, Qi G, Kefalas P, Abro WA, Ali B (2020) A graph-based taxonomy of citation recommendation models. Artif Intell Rev 53(7):5217–5260 5. Ali Z, Ullah I, Khan A, Ullah Jan A, Muhammad K (2021) An overview and evaluation of citation recommendation models. Scientometrics 126(5):4083–4119 6. Ali Z, Kefalas P, Muhammad K, Ali B, Imran M (2020) Deep learning in citation recommendation models survey. Expert Syst Appl 162 7. Ali Z, Qi G, Kefalas P, Khusro S, Khan I, Muhammad K (2022) SPR-SMN: scientific paper recommendation employing SPECTER with memory network. Scientometrics 127(11):6763– 6785 8. Amini M-R, Goutte C (2010) A co-classification approach to learning from multilingual corpora. Mach Learn 79(1–2):105–121 9. Ferreira AA, Gonçalves MA, Laender AH (2012) A brief survey of automatic methods for author name disambiguation. ACM Sig Rec 41(2):15–26 10. Ryabokon A, Polleres A, Friedrich G, Falkner AA, Haselböck A, Schreiner H (2012) (re) configuration using web data: A case study on the reviewer assignment problem. In: International conference on web reasoning and rule systems. Springer, pp 258–261 11. Mountassir A, Benbrahim H, Berrada I (2012) An empirical study to address the problem of unbalanced data sets in sentiment classification. In: Systems, man, and cybernetics (SMC), 2012 IEEE international conference on, pp 3298–3303 12. Bai X, Wang M, Lee I, Yang Z, Kong X, Xia F (2019) Scientific paper recommendation: A survey. IEEE Access 7:9324–9339 13. Basu C, Hirsh H, Cohen WW (2001) Technical paper recommendation: A study in combining multiple information sources. J Artif Intell Res 14:231–252 14. Benaim M (2018) From symbolic values to symbolic innovation: Internet-memes and innovation. Res Policy 47(5):901–910 15. Bhimani H, Mention A-L, Barlatier P-J (2019) Social media and innovation: A systematic literature review and future research directions. Technol Forecast Soc Change 144:251–269 16. Aleman-Meza B, Boj¯ars U, Boley H, Breslin JG, Mochol M, Nixon LJ, Zhdanova AV (2007) Combining RDF vocabularies for expert finding. In: In proceedings of the 4th european semantic web conference (ESWC2007), number 4519 in Lecture Notes in Computer Science. Springer, pp 235–250 17. Aleman-Meza B, Hakimpour F, Arpinar IB, Sheth AP (2007) Swetodblp ontology of computer science publications. Web Semant: Sci Serv Agents World Wide Web 5(3):151–155 18. Bogers M, Chesbrough H, Moedas C (2018) Open innovation: Research, practices, and policies. California Manag Rev 60(2):5–16 19. Bresciani S, Ciampi F, Meli F, Ferraris A (2021) Using big data for co-innovation processes: Mapping the field of data-driven innovation, proposing theoretical developments and providing a research agenda. Int J Inf Manag 60:102347 20. Cagliero L, Garza P, Pasini A, Baralis E (2021) Additional reviewer assignment by means of weighted association rules. IEEE Trans Emer Top Comput 9(1):329–341 21. Çetin HA, Do˘gan E, Tüzün E (2021) A review of code reviewer recommendation studies: Challenges and future directions. Sci Comput Program 208 22. Chatzopoulos S, Vergoulis T, Dalamagas T, Tryfonopoulos C (2021) Veto-web: A recommendation tool for the expansion of sets of scholars. Proceedings of the ACM/IEEE joint conference on digital libraries 2021:334–335
22
2 Literature Review
23. Chen Y, Yuan H, Liu T, Ding N (2021) Name disambiguation based on graph convolutional network. Sci Programm 2021 24. Chien CF, Chen LF (2008) Data mining to improve personnel selection and enhance human capital: A case study in high-technology industry. Expert Syst Appl 34(1):280–290 25. Wei CP, Yang CC, Lin CM (2008) A latent semantic indexing-based approach to multilingual document clustering. Decis Support Syst 45(3):606–620 26. Choi J, Foster-Pegg B, Hensel J, Schaer O (2021) Using graph algorithms for skills gap analysis. In: IEEE systems and information engineering design symposium. SIEDS 2021 27. Chouchen M, Ouni A, Mkaouer MW, Kula RG, Inoue K (2021) WhoReview: A multiobjective search-based approach for code reviewers recommendation in modern code review. Appl Soft Comput 100 28. Schulz C, Mazloumian A, Petersen AM, Penner O, Helbing D (2014) Exploiting citation networks for large-scale author name disambiguation. EPJ Data Sci 3(1):1–14 29. Chuanming Y, Yunci Z, Aochen L, Lu A (2020) Author name disambiguation with network embedding. Data Anal Knowl Discovery 4(2–3):48–59 30. Lee CH, Yang HC (2009) Construction of supervised and unsupervised learning systems for multilingual text categorization. Expert Syst Appl 36(2, Part 1):2400–2410 31. Cook WD, Golany B, Kress M, Penn M, Raviv T (2005) Optimal allocation of proposals to reviewers to facilitate effective ranking. Manag Sci 51(4):655–661 32. Damljanovic D, Stankovic M, Laublet P (2012) Linked data-based concept recommendation: Comparison of different methods in open innovation scenario. In: Extended semantic web conference. Springer, pp 24–38 33. Danilov GV, Zhukov VV, Kulikov AS, Makashova ES, Mitin NA, Orlov YUN (2020) Comparative analysis of statistical methods of scientific publications classification in medicine. Comput Res Model 12(4):921–933 34. Hartvigsen D, Wei JC, Czuchlewski R (1999) The conference paper-reviewer assignment problem. Deci Sci 30(3):865–876 35. Pinto D, Civera J, Barr´nn-Cedeno A, Juan A, Rosso P (2009) A statistical approach to crosslingual natural language tasks. J Algorithms 64(1):51–60 36. Dehghan M, Abin AA, Neshati M (2020) An improvement in the quality of expert finding in community question answering networks. Decis Support Syst 139 37. Dehghan M, Rahmani HA, Abin AA, Vu V-V (2020) Mining shape of expertise: A novel approach based on convolutional neural network. Inf Process Manag 57(4) 38. Tayal DK, Saxena PC, Sharma A, Khanna G, Gupta S (2014) New method for solving reviewer assignment problem using type-2 fuzzy sets and fuzzy functions. Appl Intell 40(1):54–73 39. Mishra D, Singh SK (2011) Taxonomy-based discovery of experts and collaboration networks. VSRD Int J Comput Sci Inf Technol I(10):698–710 40. Duan Z, Tan S, Zhao S, Wang Q, Chen J, Zhang Y (2019) Reviewer assignment based on sentence pair modeling. Neurocomputing 366:97–108 41. Du H, Kang YB (2021) An open-source framework for ExpFinder integrating n-gram vector space model and co-hits. Soft Impacts 8 42. Lakomaa E, Kallberg J (2013) Open data as a foundation for innovation: The enabling effect of free public sector information for entrepreneurs. IEEE Access 1:558–563 43. Fallahnejad Z, Beigy H (2022) Attention-based skill translation models for expert finding. Expert Syst Appl 193 44. Wang F, Zhou S, Shi N (2013) Group-to-group reviewer assignment problem. Comput Oper Res 40(5):1351–1362 45. Feng W, Zhu Q, Zhuang J, Yu S (2019) An expert recommendation algorithm based on pearson correlation coefficient and FP-growth. Cluster Comput 22:7401–7412 46. Schweitzer FM, Buchinger W, Gassmann O, Obrist M (2012) Crowdsourcing: Leveraging innovation through online idea competitions. Res Technol Manag 55(3):32–38 47. Flach PA, Spiegler S, Golénia B, Price S, Guiver J, Herbrich R, Zaki MJ (2010) Novel tools to streamline the conference review process: Experiences from SIGKDD’09. SIGKDD Explor Newsl 11(2):63–67
References
23
48. Huber F, Wainwright T, Rentocchini F (2020) Open data for open innovation: managing absorptive capacity in SMEs. R&D Manag 50(1):31–46 49. Goldsmith J, Sloan RH (2007) The AI conference paper assignment problem. In: In proceedings AAAI workshop on preference handling for artificial intelligence. Vancouver, pp 53–57 50. Green SM, Callaham ML (2011) Implementation of a journal peer reviewer stratification system based on quality and reliability. Ann Emer Med 57(2):149-152.e4 51. Gündo˘gan E, Kaya M (2022) A novel hybrid paper recommendation system using deep learning. Scientometrics 127(7):3837–3855 52. Wu H, Li B, Pei Y, He J (2014) Unsupervised author disambiguation using dempster-shafer theory. Scientometrics 101(3):1955–1972 53. He T, Guo C, Chu Y, Yang Y, Wang Y (2020) Dynamic user modeling for expert recommendation in community question answering. J Intell Fuzzy Syst 39(5):7281–7292 54. Hoang DT, Nguyen NT, Hwang D (2019) Decision support system for assignment of conference papers to reviewers. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 11683. LNAI, pp 441–450 55. Hoang DT, Nguyen NT, Collins B, Hwang D (2021) Decision support system for solving reviewer assignment problem. Cybern Syst 52(5):379–397 56. Husain O, Salim N, Alias RA, Abdelsalam S, Hassan A (2019) Expert finding systems: A systematic review. Appl Sci (Switzerland) 9(20) 57. Hussain I, Asghar S (2017) A survey of author name disambiguation techniques: 2010–2016. Knowl Eng Rev 32 58. Immonen E, Putkonen A (2020) An heuristic algorithm for fair strategic personnel assignment in continuous operation. Int J Simul Proces Model 15(5):410–424 59. Bhattacharya I, Getoor L (2007) Collective entity resolution in relational data. ACM Trans Knowl Discovery Data 1(1):1–36 60. Tien JM (2015) An SMC perspective on big data: A disruptive innovation to embrace. IEEE Syst Man Cybern Mag 1(2):27–29 61. Recker J, Malsbender A, Kohlborn T (2016) Learning how to efficiently use enterprise social networks as innovation platforms. In: IT professional, number 2 in 18, pp 2–9 62. Protasiewicz J (2014) A support system for selection of reviewers. In: Systems, man and cybernetics (SMC), 2014 IEEE international conference on. IEEE, pp 3062–3065 63. Protasiewicz J (2017) Inventorum: A platform for open innovation. In: 2017 IEEE international conference on systems, man, and cybernetics (SMC). IEEE, pp 10–15 64. Protasiewicz J (2017) Inventorum–a recommendation system connecting business and academia. In: 2017 IEEE international conference on systems, man, and cybernetics (smc). IEEE, pp 1920–1925 65. Protasiewicz J, Pedrycz W, Kozłowski M, Dadas S, Stanisławek T, Kopacz A, Gałç˙zewska M (2016) A recommender system of reviewers and experts in reviewing problems. Knowle Based Syst 106:164–178 66. Protasiewicz J, Dadas S (2016) A hybrid knowledge-based framework for author name disambiguation. In: 2016 IEEE international conference on systems, man, and cybernetics (SMC). IEEE, pp 000594–000600 67. Jeong C, Jang S, Park E, Choi S (2020) A context-aware citation recommendation model with BERT and graph convolutional networks. Scientometrics 124(3):1907–1922 68. Zhu J, Yang Y, Xie Q, Wang L, Hassan SU (2014) Robust hybrid name disambiguation framework for large databases. Scientometrics 98(3):2255–2274 69. Jindal R, Malhotra R, Jain A (2015) Techniques for text classification: Literature review and current trends. Webology 12(2) 70. Jing C, Qiu L, Tian X, Hao T (2022) Publication classification prediction via citation attention fusion based on dynamic relations. Knowl Based Syst 239 71. Patroni J, Von Briel F, Recker J (2016) How enterprise social media can facilitate innovation. IT Prof 18(6):34–41
24
2 Literature Review
72. Merelo-Guervós JJ, Castillo-Valdivieso P (2004) Conference paper assignment using a combined greedy/evolutionary algorithm. In: International conference on parallel problem solving from nature. Springer, pp 602–611 73. Kilic K, Hamarat C (2010) A decision support system framework for innovation management. In: 2010 IEEE Int Conf Manag Innovation Technol 765–770 74. Kozlowski M, Protasiewicz J (2014) Automatic extraction of keywords from polish abstracts. In: 4th Young linguists’ meeting in Pozna´n, volume: book of abstracts, pp 56–57 75. Mirkovski K, Briel F, Lowry PB (2016) Social media use for open innovation initiatives: Proposing the semantic learning-based innovation framework (SLBIF). IT Prof 18(6):26–32 76. Ryu K, Shin J, Cho Y, Kim B, Choi H (2010) Web-based collaborative innovation systems for korean small and medium sized manufacturers. In: 2010 IEEE international technology management conference (ICE). IEEE, pp 1–8 77. Cen L, Dragut EC, Si L, Ouzzani M (2013) Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, pp 741–744 78. Li M, Li Y, Chen Y, Xu Y (2021) Batch recommendation of experts to questions in communitybased question-answering with a sailfish optimizer. Expert Syst Appl 169 79. Liu X, Wang X, Zhu D (2022) Reviewer recommendation method for scientific research proposals: a case for NSFC. Scientometrics 127(6):3343–3366 80. Liu J, Deng A, Xie X, Xie Q (2022) ExpRec: Deep knowledge-awared question routing in software question answering community. Appl Intell 53(5):5681–5696 81. Liu P, Dew P (2004) Using semantic web technologies to improve expertise matching within academia. In: Proceedings of I-KNOW, Graz, Austria, pp 70–378 82. Bolikowski Ł, Dendek PJ (2011) Towards a flexible author name disambiguation framework. In: Towards a digital mathematics library. Masaryk Univ. Press, pp 27–37 83. Nakatsuji M, Yoshida M, Ishida T (2009) Detecting innovative topics based on user-interest ontology. J Web Semant 7(2):107–120 84. Suzuki M, Yamagishi N, Tsai YC, Hirasawa S (2008) Multilingual text categorization using character n-gram. In: IEEE conference on soft computing in industrial applications, pp 49–54 85. Nakatsuji M, Miyoshi Y, Otsuka Y (2006) Innovation detection based on user-interest ontology of blog community. In: International semantic web conference. Springer, pp 515–528 86. Miro´nczuk MM, Protasiewicz J (2018) A recent overview of the state-of-the-art elements of text classification. Expert Syst Appl 106:36–54 87. Miro´nczuk MM, Protasiewicz J (2020) Recognising innovative companies by using a diversified stacked generalisation method for website classification. Appl Intell 50(1):42–60 88. Miro´nczuk MM, Protasiewicz J (2015) A diversified classification committee for recognition of innovative internet domains. In: Beyond databases, architectures and structures. Advanced technologies for data mining and knowledge discovery. Springer, pp 368–383 89. Miro´nczuk MM, Perełkiewicz M, Protasiewicz J (2017) Detection of the innovative logotypes on the web pages. In: International conference on artificial intelligence and soft computing. Springer, pp 104–115 90. Piazza M, Mazzola E, Acur N, Perrone G (2019) Governance considerations for seekersolver relationships: A knowledge-based perspective in crowdsourcing for innovation contests. British J Manag 30(4):810–828 91. Muninger MI, Hammedi W, Mahr D (2019) The value of social media for innovation: A capability perspective. J Bus Res 95:116–127 92. Rodriguez MA, Bollen J (2008) An algorithm to determine peer-reviewers. In: Proceedings of the 17th ACM conference on information and knowledge management, CIKM ’08, ACM, New York, NY, USA, pp 319–328 93. Ma S, Zhang C, Liu X (2020) A review of citation recommendation: from textual content to enriched context. Scientometrics 94. Mei X, Cai X, Xu S, Li W, Pan S, Yang L (2022) Mutually reinforced network embedding: An integrated approach to research paper recommendation. Expert Syst Appl 204
References
25
95. Levin M, Krawczyk S, Bethard S, Jurafsky D (2012) Citation-based bootstrapping for largescale author disambiguation. J Am Soc Inf Sci Technol 63(5):1030–1047 96. Nadimi MH, Mosakhani M (2015) A more accurate clustering method by using co-author social networks for author name disambiguation. J Comput Secur 1(4):307–317 97. Montalvo S, Martinez R, Casillas A, Fresno V (2007) Multilingual news clustering: Feature translation vs. identification of cognate named entities. Pattern Recogn Lett 28(16):2305–2311 98. Smalheiser NR, Torvik VI (2009) Author name disambiguation. Ann Rev Inf Sci Technol 43(1):1–43 99. Nikzad-Khasmakhi N, Balafar MA, Feizi-Derakhshi MR (2019) The state-of-the-art in expert recommendation systems. Eng Appl Artif Intell 82:126–147 100. Patil AH, Mahalle PN (2019) Reviewer paper assignment problem–A brief review. River Publishers 101. Harper PR, de Senna V, Vieira IT, Shahani AK (2005) A genetic algorithm for the project assignment problem. Comput Oper Res 32(5):1255–1265 102. Zhang P, Xiong F, Leung H, Song W (2021) FunkR-pDAE: Personalized project recommendation using deep learning. IEEE Trans Emer Top Comput 9(2):886–900 103. Pintas JT, Fernandes LA, Garcia ACB (2021) Feature selection methods for text classification: a systematic literature review. Artif Intell Rev 54(8):6149–6200 104. Pooja K, Mondal S, Chandra J (2022) Exploiting higher order multi-dimensional relationships with self-attention for author name disambiguation. ACM Trans Knowl Discov Data 16(5):1– 23 105. Pradhan DK, Chakraborty J, Choudhary P, Nandi S (2020) An automated conflict of interest based greedy approach for conference paper assignment system. J Inf 14(2) 106. Pradhan T, Sahoo S, Singh U, Pal S (2021) A proactive decision support system for reviewer recommendation in academia. Expert Syst Appl 169 107. Protasiewicz J, Stanisławek T, Dadas S (2015) Multilingual and hierarchical classification of large datasets of scientific publications. In: Systems, man, and cybernetics (SMC), 2015 IEEE international conference on. IEEE, pp 1670–1675 108. Tian Q, Ma J, Liu O (2002) A hybrid knowledge and model system for R&D project selection. Expert Syst Appl 39(3):265–271 109. Tian Q, Ma J, Liang J, Kwok RC, Liu O (2005) An organizational decision support system for effective & project selection. Decis Support Syst 39(3):403–413 110. Rodriguez MA, Johan B, de Sompel VH (2006) The convergence of digital-libraries and the peer-review process. J Inf Sci 32(2):149–159 111. Rogers D, Preece A, Innes M, Spasic I (2021) Real-time text classification of user-generated content on social media: Systematic review. IEEE Trans Comput Soc Syst 9(4):1154–1166 112. Roozbahani Z, Rezaeenour J, Emamgholizadeh H, Jalaly Bidgoly A (2020) A systematic survey on collaborator finding systems in scientific social networks. Knowl Inf Syst 62(10):3837– 3879 113. Ruolin W, Zhendong N, Qika L, Yifan Z, Ping Q, Hao L, Donglei L (2021) Disambiguating author names with embedding heterogeneous information and attentive RNN clustering parameters. Data Anal Knowl Discov 5(8):13–24 114. Salinas M, Giorgi D, Ponchio F, Cignoni P (2020) ReviewerNet: A visualization platform for the selection of academic reviewers. Comput Graph (Pergamon) 89:77–87 115. Santini C, Gesese GA, Peroni S, Gangemi A, Sack H, Alam M (2022) A knowledge graph embeddings based approach for author name disambiguation using literals. Scientometrics 127(8):4887–4912 116. Sanyal DK, Bhowmick PK, Das PP (2021) A review of author name disambiguation techniques for the pubmed bibliographic database. J Inf Sci 47(2):227–254 117. Sharifian M, Abdolvand N, Harandi SR (2021) Context-based expert finding in online communities using ant colony algorithm. J Inf Syst Telecommun 8(30):130–139 118. Shen M, Wang J, Liu O, Wang H (2020) Expert detection and recommendation model with user-generated tags in collaborative tagging systems. J Database Manag 31(4):24–45
26
2 Literature Review
119. Lin S, Hong W, Wang D, Li T (2017) A survey on expert finding techniques. J Intell Inf Syst 49(2):255–279 120. Stelmakh I, Shah N, Singh A (2021) PeerReview4All: Fair and accurate reviewer assignment in peer review. J Mach Learn Res 22(1):7393–7458 121. Xinbo S, Mingchao Z, Weixin L, Mengqin H (2019) Research on the synergistic incentive mechanism of scientific research crowdsourcing network: Case study of InnoCentive. Manag Rev 31(5):277 122. Tan S, Duan Z, Zhao S, Chen J, Zhang Y (2021) Improved reviewer assignment based on both word and semantic features. Inf Retrieval J 24(3):175–204 123. Tang W, Lu T, Li D, Gu H, Gu N (2020) Hierarchical attentional factorization machines for expert recommendation in community question answering. IEEE Access 8:35331–35343 124. Tang W, Lu T, Gu H, Zhang P, Gu N (2020) Domain problem-solving expert identification in community question answering. Expert Syst 37(5) 125. Arif T, Ali R, Asger M (2015) A multistage hierarchical method for author name disambiguation. Int J Inf Process 9(3):92–105 126. Thangaraj M, Sivakami M (2018) Text classification techniques: A literature review. Interdisc J Inf Knowl Manag 13:117–135 127. Kolasa T, Król D (2011) A survey of algorithms for paper-reviewer assignment problem. IETE Tech Rev 28(2):123–134 128. Vignieri V (2021) Crowdsourcing as a mode of open innovation: Exploring drivers of success of a multisided platform through system dynamics modelling. Syst Res Behav Sci 38(1):108– 124 129. Wang F, Shi N, Chen B (2010) A comprehensive survey of the reviewer assignment problem. Int J Inf Technol Decis Making 9(4):645–668 130. Wang F, Chen B, Miao Z (2008) A survey on reviewer assignment problem. In: Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and lecture notes in bioinformatics), 5027 LNAI, pp 718–727 131. Liu W, Islamaj Do˘gan R, Kim S, Comeau DC, Kim W, Yeganova L, Lu Z, Wilbur WJ (2014) Author name disambiguation for pubmed. J Assoc Inf Sci Technol 65(4):765–781 132. Waqas H, Qadir MA (2021) Multilayer heuristics based clustering framework (MHCF) for author name disambiguation. Scientometrics 126(9):7637–7678 133. Wu H, Liu Y, Wang J (2020) Review of text classification methods on deep learning. Comput Mater Continua 63(3):1309–1321 134. Wang X, Huang C, Yao L, Benatallah B, Dong M (2018) A survey on expert recommendation in community question answering. J Comput Sci Technol 33(4):625–653 135. Song X, Tseng BL, Lin CY, Sun MT (2005) Expertisenet: Relational and evolutionary expert modeling. In: Liliana A, Paul B, Antonija M (eds) User modeling 2005, vol 3538. Lecture notes in computer science. Springer, Berlin, pp 99–108 136. Hu X, Zhang X, Lu C, Park EK, Zhou X (2009) Exploiting wikipedia as external knowledge for document clustering. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 389–396 137. Xu Y, Zhou D, Ma J (2019) Scholar-friend recommendation in online academic communities: An approach based on heterogeneous network. Decis Support Syst 119:1–13 138. Xuefeng JIA, Cunbin LI, Ying Z (2022) An expert recommendation model to electric projects based on KG2E and collaborative filtering. Expert Syst Appl 198 139. Qian Y, Zheng Q, Sakai T, Ye J, Liu J (2015) Dynamic author name disambiguation for growing digital libraries. Inf Retrieval J 18(5):379–412 140. Yang C, Liu T, Yi W, Chen X, Niu B (2020) Identifying expertise through semantic modeling: A modified bbpso algorithm for the reviewer assignment problem. Appl Soft Comput J 94 141. Ye X, Zheng Y, Aljedaani W, Mkaouer MW (2021) Recommending pull request reviewers based on code changes. Soft Comput 25(7):5619–5632 142. Sun YH, Ma J, Fan ZP, Wang J (2008) A hybrid knowledge and model approach for reviewer assignment. Expert Syst Appl 34(2):817–824
References
27
143. Youneng P, Xiuli N (2020) Recommending online medical experts with Labeled-LDA model. Data Anal Knowl Discov 4(4):34–43 144. Yuan S, Zhang Y, Tang J, Hall W, Cabotà JB (2020) Expert finding in community question answering: a review. Artif Intell Rev 53(2):843–874 145. Xu Y, Ma J, Sun Y, Hao G, Xu W, Zhao D (2010) A decision support approach for assigning reviewers to proposals. Expert Syst Appl 37(10):6948–6956 146. Zhang S, Xinhua E, Pan T (2019) A multi-level author name disambiguation algorithm. IEEE Access 7:104250–104257 147. Zhang D, Zhao S, Duan Z, Chen J, Zhang Y, Tang J (2020) A multi-label classification method using a hierarchical and transparent representation for paper-reviewer recommendation. ACM Trans Inf Syst 38(1):1–20 148. Zhao X, Kang H, Feng T, Meng C, Nie Z (2020) A hybrid model based on LFM and BiGRU toward research paper recommendation. IEEE Access 8:188628–188640 149. Zhao Y, Anand A, Sharma G (2022) Reviewer recommendations using document vector embeddings and a publisher database: Implementation and evaluation. IEEE Access 10:21798–21811 150. Zhe S, Yi W, Yifan Y, Ying C (2020) Author name disambiguation techniques for academic literature: A review. Data Anal Knowl Discov 4(8):15–27 151. Yang Z, Liu Q, Sun B, Zhao X (2019) Expert recommendation in community question answering: a review and future direction. Int J Crowd Sci 3(3):348–372 152. Fan ZP, Chen Y, Ma J, Zhu Y (2009) Decision support for proposal grouping: A hybrid approach using knowledge rule and genetic algorithm. Expert Syst Appl 36(2, Part 1):1004– 1013 153. Zulqarnain M, Ghazali R, Hassim YMM, Rehan M (2020) A comparative review on deep learning models for text classification. Indonesian J Electr Eng Comput Sci 19(1):325–335
Chapter 3
Recommending Reviewers and Experts
This chapter focuses on the recommendation of reviewers and experts for the evaluation of scholarly articles and research and development projects. Its main objective is to propose a recommendation system of reviewers and experts—-specifically, its architecture and recommendation algorithm, by presenting case studies concerning the architecture and functions of each of its modules. More specifically, this chapter attempts to address why reviewing is necessary and what might disturb it. The discussion explains the necessity and the assumptions involved in the construction of a recommendation system of reviewers and experts. The system comprises three modules: data acquisition, which involves obtaining data on potential reviewers and experts; information extraction, which involves creating expert profiles of such individuals; and recommendation, which involves transforming information into knowledge, i.e., creating a ranking of individuals that are best-suited for particular reviews and evaluations. Next, the chapter discusses three variants of a contextual recommendation algorithm, based first on keywords and the cosine measure, second on a full-text index and the cosine measure, and third on a combination of the two. The algorithm was evaluated on the basis of experimentation on a simple hypothetical case study. Selected aspects of the algorithm were implemented and presented at the National Centre for Research and Development: a Polish governmental agency that awards grants for research and development projects. Subsequent sections of this chapter explain why reviewing is necessary and what might disturb it 3.1, before focusing on the architecture of the system and an outline of the algorithms that form its individual modules 3.2; outline the details of the recommendation algorithm 3.3 and its experimental validation 3.4; and discusses the relevant 3.5 literature 3.5. The chapter excludes information extraction algorithms, which are discussed in Chap. 5. It must be underlined that the content of this chapter is mainly based on this author’s former publications: [18–21]. However, the information included in those works has been reinterpreted and enriched by the author’s new © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Protasiewicz, Knowledge Recommendation Systems with Machine Intelligence Algorithms, Studies in Computational Intelligence 1101, https://doi.org/10.1007/978-3-031-32696-7_3
29
30
3 Recommending Reviewers and Experts
perspective. Moreover, the presented system [21] is an improved version of an earlier reviewer selection support system [18].
3.1 Reviewing Problems This section includes a discussion on why the reviewing process is necessary and what might disturb it. This leads on to the necessity and the assumptions involved in the construction of the recommendation system of reviewers and experts.
3.1.1 The Purpose of Reviewing Generally, reviewing is associated with everyday products, services, films, or books. It involves consumers, enthusiasts, and, to a lesser extent, professional critics. In the world of science, research, and development, however, reviewing pertains to the evaluation of scholarly articles, project proposals and project results; it is performed by reviewers and experts in their fields. Manuscript reviews ensure the quality of scholarly articles, which enables scientific journals to build their brands and to be recognised as sources of up-to-date and reliable knowledge. Readers of such journals need not be concerned by unreliable research. The primary purpose of reviews is, on the one hand, to ensure the reliability and quality of newly-published research results and, on the other, to ensure freedom of discussion. Publications influence scientific development, and the implementation of relevant research and development projects is crucial for the economic development of nations, as well as of international and local enterprises. It should be emphasised, however, that innovative activity entails a degree of risk, as projects may not yield their expected returns on investment. For this reason, such projects are usually financed by public entities, foundations, investors, and even private firms that seek new technologies. Typically, sponsors’ funding is insufficient to meet the expectations of those who undertake such projects. In consequence, funders are forced to sift through projects by considering their quality and scientific objectives, as well as their economic and social impact. The best evaluations of projects according to specific criteria incorporate reviews that are prepared by both scientists and practitioners. Following the same line of reasoning reveals that evaluation pertains not only to prospective projects, but also to the effects of those that have already been implemented. Reviewers and experts assess whether a project has achieved its goals and wielded the expected results. Enterprises occasionally utilise experts to assess the legitimacy of strategies, investment plans, and completed internal projects. The need to produce high-quality manuscripts, project proposals that fulfil specific criteria, and project deliverables means that it is crucial that reviewers or experts who
3.1 Reviewing Problems
31
are capable of providing credible reviews and opinions be sourced. The search for such individuals can be supported by IT systems.
3.1.2 Reviewing Methods The first mention of the reviewing of a scholarly work dates back to the ninth century, when Ish¯aq bin Ali al-Rohawi advised doctors in his book, Ethics of the Physician, to assess treatment methods with a view to increasing their effectiveness. In modern times, the first definition of reviewing can be attributed to the eighteenth century. The principles of reviewing were defined by Henry Oldenburg for the Philosophical Transactions of the Royal Society journal; group of experts in a given field would assess manuscripts and decide whether to publish them [24]. Oldenburg was a pioneer of reviews in scientific journals; one whose ideas have developed in tandem with modern science. Presently, three basic types of reviewing are used:(i) the single-blind review, (ii) the double-blind review, and (iii) the open peer review. In the single-blind method, the reviewers know the identities of the authors, but the authors do not know the identities of the reviewers. In the doubleblind method, neither the reviewers nor the authors know the other’s identities. In both cases, the results of a review are known only to the reviewers, authors, and editors. Open peer review is the latest approach to reviewing, in which both the reviewers’ and the authors’ identities are known to each other, and the results of reviews are accessible to the public. It is assumed that reviews should be effective, productive, honest, and socially complemented—regardless of their method [1].
3.1.3 Disruptions to the Reviewing Process Although the quality of reviews, evaluations, and recommendations depends primarily on evaluators’ competences, key roles are also played by their independence and (lack of) conflicts of interest. Experts and reviewers who evaluate the works of others may have insufficient knowledge or experience to perform such tasks, and their cognitive perspectives may be distorted. These limitations can lead to misinterpretations of authors’ true intentions and result in the rejection of pertinent research or project drafts. Reviewing is not only a technical process and is not based solely on knowledge and experience; it is a complex psychological one that can be disrupted by various factors. Human decisions are fraught with cognitive distortions. These typically incorporate three foundational heuristics: availability, anchoring, and representativeness [27]. Table 3.1 briefly summarises the potential distortions that result from each. Thirteen types of cognitive distortion are identified, which may occur not only during the reviewing process, but also during the selection of reviewers, as well as in the making of final decisions on manuscripts or project proposals. Considering the number of
32
3 Recommending Reviewers and Experts
Table 3.1 A brief summary of the types of cognitive distortion, depending on where they occur: reviewer selection, the reviewing process, and final decisions concerning a project or article. Thirteen potential types have been identified based on the three foundational heuristics of cognitive distortions: availability, anchoring, and representativeness. The summary is based on the following literature: [3, 4, 6, 8–10, 14, 22] Distortion location Distortion type Interest conflict Tendency to crude assessment Discrimination of novel and controversial ideas Thematic and ideological biases Focusing on frequently or recently appearing names Focusing only on particular details Tendency to confirmation of clarified opinions Influence of an author’s affiliation or position Gender effect Reviewers suggested by authors Syndrome of group thinking and the congruence effect Focussing on the first information and a contrast effect Tendency to support positively verified hypotheses
Reviewers selection x
Reviewing process x x
Final decision x
x
x
x
x
x x
x
x
x
x x x
x
x
x
x
potential distortions presented in Table 3.1, it can be concluded that reviewing is the action that is by far the most susceptible to potential cognitive distortions: all thirteen of them may occur during the evaluation of articles or projects by reviewers. Significantly fewer types of cognitive distortion occur during the making of final decisions on the publication of articles or on the funding of projects; five of the thirteen distortions were identified in such cases. During the selection of reviewers
3.1 Reviewing Problems
33
or of experts tasked with reviewing, only two types of cognitive distortion may appear: conflict of interest and the influence of authors’ affiliations or positions. Evidently, automating the selection of reviewers serves to minimise some cognitive distortion heuristics—not only in the selection of reviewers, but also in the processes of reviewing and in the making of final decisions. This assumes that the individuals who appoint reviewers or experts will be replaced by a disentangled algorithm, and that editors or grant committees will receive information on the appointment of reviewers in the shortest possible time. Above all, however, the algorithmic selection of reviewers and experts allows large groups of candidates to be searched rapidly, enabling such systems to predetermine larger groups of individuals than in the case of analyses conducted by humans. The quality of reviews largely depends on reviewers’ knowledge and professionalism, which must be compatible with a given publication or project. Searching the most extensive possible dataset can impact the probability of success positively. Algorithmic selection also means that identical criteria are applied automatically to all candidates; as a consequence, all preferences that might guide humans—due to potential cognitive distortions rather than ill will—can be eliminated. In summary, better and faster selection of reviewers can enhance the quality of reviews and final decisions, which justifies the adoption of an algorithmic approach.
3.1.4 Why Automate the Selection of Reviewers? The above observations suggest the need for development of methods in the recommendation of reviewers and experts. This chapter advances the concept of a reviewer and expert recommendation system, and presents experimental studies that demonstrate the author’s experience, intuitions, and informal observations. Three problems must be addressed with regard to such a recommendation system. First, human knowledge can be represented not only as structured data (such as keywords), but also as unstructured descriptions of competences. Unstructured data requires natural language processing techniques to be utilised by recommendation algorithms. Second, individuals’ areas of expertise typically develop and alter over time. Third, the degree and scope of individuals’ declared knowledge may differ from their actual achievements. To properly address these challenges, an extensive base of reviewers, experts, and professionals must be created. Moreover, appropriate methods of describing their knowledge and experience must be developed. Reliable algorithms for selecting individuals tasked with reviewing material on the issues in which they specialise must also be provided. Willingness to meet these requirements is a prerequisite for the development of a reviewer and expert recommendation system. The literature presented in Chap. 2 demonstrates that the vast majority of existing reviewer selection systems require human assistance to operate properly. The main goal of this work, therefore, is to present the architecture of a contextual recommendation system of reviewers and experts alongside an extensive algorithmic
34
3 Recommending Reviewers and Experts
apparatus for extracting knowledge from data and representing it appropriately. This work proposes a system that will work autonomously, with little or no human support for algorithm tuning. The system will offer an environment that supports decisionmaking while not fully supplanting humans in the reviewer selection and reviewing processes.
3.1.5 Assumptions of the Recommendation System In view of the assumptions above, the knowledge recommendation system is based on the concepts of data, information and knowledge, and the relationships between them. In this methodology, the system first collects data on potential reviewers from various sources, such as public databases and the internet. Information is then extracted from the data, which comprises both structured and unstructured elements. The extracted information is used to construct profiles of scientists and professionals. Finally, the system recommends reviewers and experts, with consideration for their profiles and the problems that are to be reviewed or evaluated. Matching potential reviewers’ profiles to problems in need of review is expressed using an appropriate measure of similarity. The system achieves this using the cosine measure. Similar solutions have been deployed in other reviewer selection systems, such as those proposed in [2, 5, 13, 23], and [12]. The system pays respect to these experiences and develops them. The selection of reviewers is the result of matching keywords (structured data) and a full-text index (unstructured data). In this way, the recommendations consider the descriptive data of potential reviewers and tests of the manuscripts or project proposals to be reviewed.
3.2 The IT Reviewer and Expert Recommendation System This section discusses the architecture of the proposed reviewer and expert recommendation system and its core business processes. The architecture is modular from a functional and from a process perspective, and hierarchical from a technical perspective. Each of the system’s key processes is considered as a separate module, while each of its layers is responsible for the technical implementation of its individual functions. The chief purposes of this section are to present and discuss the architecture, and to justify why a modular-hierarchical system structure has been proposed instead of a typical flat one containing a series of interoperating modules [15] or a layered information system [11, 16] that incorporates a decision support system [7, 17, 25, 26, 28].
3.2 The IT Reviewer and Expert Recommendation System
35
Fig. 3.1 The main processes of the reviewer and expert recommendation system, which implements the concepts of data, information and knowledge: (i) an autonomous background process that collects data and transforms it into information; and (ii) a process executed at the request of the user, which creates knowledge based on the available information (rankings of reviewers and experts)
3.2.1 System Architecture and System Processes The reviewer and expert recommendation system is based on the concepts of data, information, and knowledge—all of which are interrelated. The autonomous recommendation system relies on the data–information-knowledge process, which facilitates an adaptive and self-expanding knowledge base. The system’s operation can be summarised as follows: the system executes a single main background process, while other processes are executed on demand; the main process collects data from various sources, transforms it into information, and compiles the profiles of candidate reviewers or experts (Fig. 3.1); the on-demand processes are responsible for the formation of knowledge, i.e., for the sourcing of suitable reviewers or experts. From a technical perspective, the system comprises three modules that process data, information, and knowledge. First, the data acquisition module obtains data on reviewers and experts from various public databases and the internet. The information retrieval module then discovers relevant information from relational and unstructured data, and constructs expert profiles of individuals. Finally, the recommendation module recommends the reviewers or experts who are best-suited to reviewing particular articles or evaluating particular projects. All source data, the information extracted from it, and the knowledge generated is stored in a local database and can be accessed via the user interface (Fig. 3.2). In practice, the system should be able to answer the following query: ‘Identify the reviewers or experts that are potentially best suited to evaluating a specific manuscript or project proposal’. A query can be submitted to the system either as a full or abbreviated text of a manuscript or project (a text query), or as keywords that best describe the manuscript or project (a formal query) (Fig. 3.2).
36
3 Recommending Reviewers and Experts
Fig. 3.2 Architecture of the reviewer and expert recommendation system. The system comprises three modules (one that acquires data from external sources, one that retrieves information from the data, and one that recommends reviewers and experts based on available information and system queries), a user interface, and a database. Queries to the system are transferred to the information retrieval module if the problem to be assessed is presented in a form that contains text (e.g., an article or a project description), and to the recommendation module if a problem is more clearly defined (e.g., by keywords)
3.2.2 Data Acquisition Module The primary task of the data acquisition module is to support various data sources in constructing a database that is as complete as possible and that stores information on the scientific and professional achievements of potential reviewers and experts. Three basic categories of data or information sources can be identified: (i) structured data, (ii) unstructured data, and (iii) information provided by users. The module is fitted with purely technical automatic or semiautomatic processes, such as extractors, importers, and web crawlers (Fig. 3.3). Most of the information comprises structured data (in the form of database dumps) or is provided by web services. This type of data is usually well described and is structured regularly, which means that it can be retrieved and parsed via dedicated tools, such as extractors and importers. The second type of data comprises diverse sets of unstructured data, such as websites that contain information on publications or projects, and descriptions of individuals’ achievements. The data of this category is supported by crawlers; they must be equipped, however, with information extraction algorithms that serve to identify the expertise of specific individuals. The last source is information entered by users via the user interface. This might include descriptive profiles of users, their publications or projects, and keywords that summarise their experience. The information that is entered by interested users is considered to be the
3.2 The IT Reviewer and Expert Recommendation System
37
Fig. 3.3 The functional architecture of the data acquisition module. Crawlers acquire data from the internet, while extractors and importers do so from open databases. The data is stored in a database. The user interface allows humans (e.g., potential reviewers and experts) to enter and edit information directly into the database
most reliable and complete, and, therefore, does not require further transformation (Fig. 3.3).
3.2.3 Knowledge Retrieval Module The knowledge retrieval module is responsible for transforming the data into information that describes potential reviewers and experts. The transformation is a complex multistage process that includes data preprocessing, data classification, data disambiguation, keyword extraction from data, and full-text data indexing. Two transformation paths can be traversed, depending on the structure of the relevant data: structured data (mainly publications) is subject to preprocessing, classification, disambiguation, and keyword extraction; unstructured data (documents that describe humans’ experience) consists fully of text that is indexed by a relevant algorithm (Fig. 3.4). A detailed description of the algorithms that the system employs is provided in Chap. 5. The knowledge retrieval module works first by preprocessing structured data (publications) according to the rules of natural language processing; this includes the removal of stopwords, lemmetisation, and transformation to a vector representation, e.g., TFxIDF (TF—term frequency x IDF—inverse document frequency). Then, the keyword retrieval algorithm identifies the most relevant phrases in the data, which mainly comprise the titles and abstracts of publications and projects. Based on these keywords, profiles of the authors of those publications and projects can be created. The publications are classified by scientific discipline, which improves the quality of the disambiguation of their authors. Simultaneously, all unstructured data is indexed, which supplements individuals’ profiles with the information derived from the descriptive data (Fig. 3.4).
38
3 Recommending Reviewers and Experts
Fig. 3.4 The functional architecture of the information retrieval module. The module comprises algorithms for data preprocessing, classification, disambiguation, keyword extraction, and fulltext indexing. These interoperate to transform both the structured and unstructured data into information—in this case, the profiles of potential reviewers and experts
3.2.4 Recommendation Module The recommendation module plays the most crucial role in the system—it suggests which reviewers or experts are best-suited to a specific problem, which usually involves reviewing an article or project. The module forms knowledge on the basis of the information at its disposal: individuals’ profiles that are stored in the system database and requests for articles or projects to be evaluated. The result of a recommendation can be saved in the database. Requests for recommendations may be submitted to the system in two ways: as formal queries and as textual queries. The form of the query determines how it is handled at later stages of the process (Fig. 3.5). A formal query comprises keywords that describe the issue to be reviewed. In this case, it is assumed that the user seeks reviewers or experts from a well-defined research area, and is capable of expressing that using keywords. Since the keywords do not require additional processing, they are sent directly to the recommendation algorithms, alongside all profiles of reviewers and experts that are stored in the database. As a result, the algorithms generate a ranking of individuals that best match the keywords provided (Fig. 3.6). Conversely, a text query incorporates the full text of an article or a project to be reviewed (or a summary thereof), for which reviewers are sought (Fig. 3.7). In such cases, the system must employ the information retrieval module to preprocess the text and to extract keywords, or use full-text indexing. This serves as a method of summarising the text. Following this operation, keywords or full-text indices can be submitted to the recommendation algorithms, which, as in the previous case, analyse available individuals’ profiles and recommend the most suitable reviewers or experts who might evaluate specific materials.
3.2 The IT Reviewer and Expert Recommendation System
39
Fig. 3.5 The functional architecture of the recommendation module. Requests for recommendations may be submitted to the system in two ways: as formal queries and as textual queries. A formal query comprises keywords that describe the issue to be reviewed. They are sent directly to the recommendation algorithms. A text query incorporates the full text of an article or a project to be reviewed (or a summary thereof). The information retrieval module must extract keywords or use full-text indexing. The recommendation engine comprises two approaches: cosine similarity on keywords and full-text indexing
Fig. 3.6 A formal query to the system. The input information received by the recommendation algorithm includes keywords that describe the problem to be reviewed and profiles that describe individuals. Based on that, the algorithm generates a ranking of reviewers or experts who best match to the keywords
Fig. 3.7 A textual query to the system. The input information received by the recommendation algorithm includes the text to be reviewed and profiles that describe individuals. The text is summarised as keywords or as a full-text index. Based on this information, the algorithm generates a ranking of reviewers or experts that are best-matched to the input text
Regardless of the form of the query, the system constructs knowledge (a ranking of potential reviewers and experts) based on information (individuals’ profiles) and the query (keywords or text), using an appropriate cosine similarity (Fig. 3.5). Detailed information on the recommendation algorithm is presented below.
40
3 Recommending Reviewers and Experts
3.3 Recommendation Algorithm This section describes the recommendation algorithm used in the expert and reviewer recommendation system, which constitutes its most integral part. The algorithms used to extract information from the data are presented in Chap. 5. The recommendation algorithm is a key element both of the recommendation module and of the wider system. The algorithm generates content-based recommendations elicited from the cosine measure of similarity between the items that are subject to analysis. Items are represented either by keywords or by full-text indices. Because the method of information representation substantively affects the method used to calculate the recommendations, three recommendation algorithms are presented: one based on keywords’ cosine similarity [18], one based on a full-text index [21], and a combination of the two [21].
3.3.1 Keywords’ Cosine Similarity Suppose that the task of knowledge recommendation is to find the reviewers or experts l = 1, 2, ..., L who are best-suited to reviewing or evaluating a project proposal or a manuscript of an article which, from a technical perspective, is a text document d. Both the documents and the individuals are described using feature vectors. Specifically, document d can be summarised as a vector of words T qd = q1, ..., qi , ..., q I ,
(3.1)
which simultaneously defines the query about experts or reviewers. A specific individual l is described using a feature vector, which is a concatenation of keywords and their weights: fl =
k1 , w11 , w12 , w13 , k2 , w21 , w22 , w23 , T . ...., k M , w1M , w2M , w3M
(3.2)
The vector includes the keywords km , m = 1, 2, ..., M and the weights wmn , n = 1, 2, 3. The words express the areas of knowledge and experiences of a specific individual. The weights determine the degree of connection between a specific word km and a specific individual l, i.e., their degree of knowledge and experience. Similarity The recommendation of reviewers and experts can be understood as a list of individuals sorted in descending order of the values of their similarity measures, which is the cosine measure expressed as follows:
3.3 Recommendation Algorithm
41
qd · pl c_scor el (qd , pl ) = = qd pl I
I
qi pi I 2
i=1
i=1 qi
i=1
(3.3) pi2
The similarity measure is the scalar product of the word vector that describes the document to be evaluated qd and the one that describes individuals pl in relation to the product of the lengths of the vectors. It is expressed using the Euclidean norm. Both vectors contain numerical values and are of equal size. To each word q1 , ..., qi , ..., q I of query qd , value 1 is assigned, which reflects the actual occurrence of the words in the query. Vector pl is created based on the vector of the individual’s features fl according to the relationship, ∀qi ,i=1,2,...,I
pi = max{wm1 , wm2 , wm3 } · qi i f qi ∈ f l ∧ ∃qi = km pi = 0 i f qi ∈ / fl
(3.4)
which can be described in the following manner: one by one, for each item of the query qd , it is checked whether word qi appears in that individual’s features vector fl ; if, in vector fl , an element that meets the condition, qi = km is discovered, then the i-th element of vector pl is assigned the product of the weight wm of word km and value 1; otherwise, element pi is assigned a value of 0. A specific word, km can have many weights, wmn simultaneously, due to having multiple sources; in such cases, the maximum value of the weight is considered. The method for determining word weights is presented below. Word weights Because the same keyword can be identified from sources of varying credibility, a keyword can have multiple weights that link it to an individual. Three word sources and weighting rules can be distinguished: those provided by a human, those provided as keywords of a publication, and those extracted from publication titles and abstracts. If a word is identified as a self-description of an individual, then the word is assigned with weight wm1 = 1, and the self-assessment is regarded to be the most reliable source of information. Words provided as index terms in scholarly articles (weights wm2 ) and those extracted from the titles and abstracts of publication (weights wm3 ) require deeper assessment. Examples of word weight smoothing It should be noted that individuals’ research and professional interests may evolve over time; keywords extracted from older publications must, therefore, be of less importance than those extracted from more recent ones. This can be achieved using a simple exponential smoothing operation, the rule of which is: pub
wmy = e−(y−ym
)/c
.
(3.5)
42
3 Recommending Reviewers and Experts
Fig. 3.8 A graph depicting changes in the weight of the publication from which the keyword comes, depending on the time of publication. Exponential smoothing was applied according to Eq. 3.5. The graph presents three examples of weight changes over twenty-five years, depending on three different values of constant c, i.e., 1, 10, 100. The vertical axis presents the publication weight value and the horizontal axis presents the difference between the current year and the year of the publication in which the word was identified
y
Value wm is the weight of publication pub from which word km comes. This weight depends exponentially on the distance between current year y and year of publication, pub ym from which word km comes. c is a constant that influences the shape of the exponential function. Figure 3.8 presents examples of changes in publication weights. Individuals’ experience and knowledge can also be expressed through the frequency of their publishing in a specific area; it should, therefore, be assumed that keywords that appear more frequently in, and are extracted from, a publication must be—unlike the words that appear less frequently—assigned higher weights. Ultimately, the weight of a word is determined by a recursive rule, (Eq. 3.6), which y considers both the changes in the weight of a publication over time, wm (Eq. 3.5) and y the frequency, cm , with which a specific word appears in publications ∀n=2,3 wmn = y
1
1+e
y y −b· Yy=1 (wm ·cm ·wmn )
(3.6)
cm is the number of publications in the year y that contained a specific word km . The constant b influences the shape of the function. The weight wmn on the right side of the equation is equal to the value of the probability by which a given expression km was identified as a keyword when extracting keywords from an abstract using Bayes’ theorem. However, wmn = 1 if the word km was identified from the keywords (index terms) of the publication. Ultimately, the weight wmn on the left side of the equation is a smoothed value that depends on the frequency by which the word appears in publications and the time that has elapsed since their publication. Figure 3.9 presents sample graphs of changes in the weights of keywords (Eq. 3.6) demonstrates that
3.3 Recommendation Algorithm
43
Fig. 3.9 A graph that presents changes in the weight of an individual’s keyword depending on the time of publication of the material from which the specific word comes and the frequency with which the word appears in publications. Exponential smoothing was applied in accordance with Eq. 3.6. The graph presents some examples of changes in the weight of a word over twenty-five years, with repeatability of 1, 10, 100, and 1, 000. The vertical axis represents the weight of the keyword after smoothing; the horizontal axis represents the difference between the current year and the year of publication of the material from which the word comes
the more often a word is repeated in an individual’s works, the older the words that retain a weight approaching the value of 1 become. The weight stabilises at 0.5 in the case of older publications. This measure means that even the oldest works of an individual retain at least 50% of their original impact on the ranking of reviewers or experts.
3.3.2 A Full-text Index Full-text search engines represent and search documents that are reduced to fulltext indices. This technique can be used to recommend reviewers and experts. Let us assume that the experience and knowledge of each individual l = 1, 2, ..., L j is described in a document that comprises many parts dl , j = 1, 2, ..., J . Each of them describes the reviews, projects, career, patents, publications, keywords, and other achievements of a particular individual l. The vector of text fields dl = dl1 , dl2 , ..., dlJ represents the profile of individual l. Such vectors are represented as a full-text index. The purpose of recommendation is to have a full-text search engine determine the similarity between query q and profiles dl . The query can consist of an entire document or of its keywords. Similarity is measured using the cosine measure, similarly to the keyword similarity discussed above; in this case, however, more factors are considered. One example includes the Apache Lucene scoring formula.1 1
Apache Lucene–Similarity Measure–http://lucene.apache.org.
44
3 Recommending Reviewers and Experts
3.3.3 The Combination of Two Measures A ranking based on the keywords’ cosine similarity and the full-text index can be linked in the following manner: scor el = α · c_scor el (q, pl ) + β · l_scor el (q, dl )
(3.7)
By altering the values of parameters α and β, greater emphasis can be placed on the first or on the second method. In all cases, the following condition must be met: α + β = 1. If α = 0, the final ranking is based solely on the full-text index, and if β = 0, it is based solely on the keywords’ cosine similarity. The parameter values depend only on the user’s intention.
3.4 Validation of the Recommendation System The recommendation algorithm is the most integral part of the recommendation module, which is the final element of the recommendation system. It is impossible to present detailed results that recommend reviewers or experts tasked with evaluating articles or projects, because such data is not made publicly available—both for data protection reasons and to preserve the confidentiality of authors and project applicants. For this reason, this work will present applications that are substantively operational in a real environment. The case of the implementation of the entire system in the National Centre for Research and Development, a Polish governmental agency, will also be discussed.
3.4.1 A Simple Example of the Recommendation Algorithm To evaluate the performance of the recommendation algorithm based on the cosine measure, a simple dataset was prepared, which contained the abstracts of two research and development projects and the profiles of five experts. The projects were selected from the database of the FP7 Programme of the European commission.2 They represent two separate domains: energy and the environment. The keywords were extracted using algorithms for the extraction of keywords from project abstracts. The projects’ names, domains, keywords, and weights are specified in Table 3.2. Individuals’ profiles were created on the basis of data contained in the database of reviewers and experts and their publications.3 The profiles represent five disciplines: energy, the environment, health, security, and information and communication 2 3
http://cordis.europa.eu/projects/home_en.html, 2015-11-18. http://recenzenci.opi.org.pl.
3.4 Validation of the Recommendation System
45
Table 3.2 The projects selected to validate the system. The project names and their corresponding domains are specified. The last column contains the keywords and their weights, which were extracted from the project descriptions by the information retrieval module; project 1: http:// cordis.europa.eu/result/rcn/159946_en.html; project 2: http://cordis.europa.eu/result/rcn/159944_ en.html Project name Domain Keywords (weights) 1
All-oxide photovoltaic cells
Energy
2
Olive oil waste as a fuel
Environment
Solar cell (1), solar energy (1), photovoltaics (0.841), semiconductor (0.703), data mining (0.693) Olive oil (1), renewable energy (0.814), biomass (0.692), biodiesel (0.664), fossil fuel (0.6)
Table 3.3 Hypothetical profiles of individuals (experts) created on the basis of the data from the database of reviewers and experts (http://recenzenci.opi.org.pl). The experts represent five popular disciplines. The knowledge and experience of each expert is defined by five keywords and their weights Expert Domain Keywords (weights) 1
Nawojka
Energy
2
John
Environment
3
Ruth
Health
4
Alan
Security
5
Carlos
Information and communication technologies
Renewable energy (1), photoluminescence (1), solar energy (1), energy consumption (0.5), biomass (0.3) Environment protection (1), ecology (1), sustainable development (1), renewable energy (0.5), biomass (0.2) Health care (1), mental health (1), public health (0.5), cancer (0.2), adolescent (0.1) Internet (1), security requirement (1), cryptography (0.5), authentication (0.5), data mining (0.2), Big data (1), data mining (0.8), machine learning (0.7), information system (0.5), internet (0.5)
technologies. The profiles that include first names, domains, keywords, and their weights are presented in Table 3.3.
46
3 Recommending Reviewers and Experts
For both projects presented in Table 3.2, an algorithm based on the cosine measure proposed reviewers by selecting them from the group of potential reviewers included in Table 3.3. The reviewer rankings assigned to the projects are presented below. 1. Project “All-oxide photovoltaic cells” (the energy domain) Rank 1 Rank 2 Rank 3
Nawojka (energy) Carlos (information and communication technologies) Alan (security)
2. Project “Olive oil waste as a fuel” (the environment domain) Rank 1 Rank 2
Nawojka (energy) John (environment)
First, it is worth noting that the expert named Nawojka is most recommended for projects in energy, although she represents the environment discipline. Only the second expert, John, represents the energy discipline, which aligns with the discipline of the project. It should also be observed that the experts proposed for project evaluation in energy represent different disciplines: energy (Nawojka), information and communication technologies (Carlos), and security (Alan). Nawojka appears to be the expert who is best-suited to evaluating both projects, while Ruth is appropriate for neither of them. It is clear that the discipline resulting from the keywords may differ from the one declared by an individual. Although the above examples are provided solely for illustrative purposes, it can be observed that the reviewer selection algorithm based on the cosine measure and keywords requires both projects and experts to be described by multiple keywords. These words must be properly weighted to indicate which of them are the most important.
3.4.2 Implementation of the Complete Algorithm The reviewer and expert recommendation system was implemented at the National Centre for Research and Development, a governmental agency that is responsible for financing research and development in Poland. Since funding is awarded by means of a competition procedure, project applications must be evaluated by appropriate experts and reviewers. The implementation of the system aimed to improve the selection of individuals responsible for project evaluations. An algorithm was implemented that combined the cosine measure between the keywords and the fulltext index of the documents under consideration (Eq. 3.7). The Apache Lucene system,4 a highly efficient full-text search engine, was used in the implementation of the recommendation algorithm. The fundamental point of reference of the search engine is a document that contains text fields. In this particular implementation, a ‘document’ describes an expert’s or a reviewer’s description. 4
https://lucene.apache.org.
3.4 Validation of the Recommendation System
47
Table 3.4 The weight values assigned to the achievements and academic degrees of potential reviewers by a full-text search engine. Field Former reviews Publications Others exept scientific degrees Weight Scientific degree Weight
3 Professor
DSc
1.5 PhD
1 MSc
No degree
1
0.98
0.95
0.7
0.7
The text fields of the document’ describe the achievements of an individual, and might comprise their previous reviews, completed projects, employment history, CVs, patents, publications, keywords, academic titles, and degrees. One document represents one individual. Documents are stored in the Apache Lucene full-text index. In response to a query, the search engine generates a ranking list of documents, i.e., of experts or reviewers. Text fields represent different aspects of an individual. Apache Lucene offers a mechanism for weighting text fields (boosting). Following a series of experiments and observations, it was decided that the field containing publications would carry a weight of 1.5, previous reviews would have a weight of 3.0, and all other fields–excluding academic titles, degrees and keywords—would have a weight of 1.0 (Table 3.4). This approach results in a greater emphasis on the scientific achievements of potential reviewers. Another implementation may shift the emphasis to other achievements. It was also determined that the bodies that organise the grant award process place greater trust in reviewers who hold higher academic titles and degrees in the disciplines that align with the subject of a project. Accordingly, the following weights were applied: professor–1.00; doctor of science (DSc)–0.98; PhD–0.95; MSc–0.70; no title–0.70 (Table 3.4). It was also assumed that in the absence of any link between the keywords related to an individual and the project to be reviewed, the individual is not considered a prospective reviewer—even if all other criteria are met. The recommendation model was evaluated on the basis of real data from the National Centre for Research and Development. Twenty queries to the recommendation system were formulated; they related to various areas of science, including “machine learning, bio-engineering”, “sociology, economic sociology”, and “civil engineering, bridges”. The system generated a ranking of potential reviewers for each query, each time considering the 100 best-matched individuals. Then, experts who specialise in specific areas of science assessed the rankings and modified them as necessary. As a result, individuals in each ranking were assigned two attributes: a ranking position resulting from the algorithm and another resulting from the expert assessment. Weights were then assigned to each position in the ranking according to the following principles: • weight = 1 if an individual’s position in the ranking has not changed following expert assessment. This is a neutral weight;
48
3 Recommending Reviewers and Experts
• weight = −100, if the experts have removed an individual from the ranking. This is a penalty to the algorithm for an incorrect recommendation; • weight = −10 if, according to the experts, an individual is in the top thirty, and the algorithm ranked them lower. This is a penalty to the algorithm for undervaluing a potential reviewer; • weight = −2, if, according to the experts and the algorithm, an individual is not in the top thirty, but the experts’ and the algorithm’s ranking positions fail to match. This is a penalty to the algorithm for being slightly imprecise in its ranking. The above weights were used in the ranking evaluation function, which considers the sum of the weights in the ranking. The algorithm was tuned by optimising this function through analysis of the distribution of the emphasis between various fields of the document. In this manner, the weights of the fields discussed above were determined.
3.5 Summary This chapter outlines the assumptions, architecture, and implementation of the reviewer and expert recommendation system. The modular system architecture ensures flexibility and easy maintenance. The system includes three modules that are responsible for (i) data acquisition, (ii) information extraction, and (iii) recommendation. The data acquisition module enables the collection of data from both structured and unstructured data sources. The second module extracts information from the data collected that concerns potential reviewers and experts. The third module provides a mechanism for recommending potential reviewers and experts to evaluate manuscripts or project drafts. The ranking is constructed on the basis of individuals’ profiles, which are created by the information retrieval module. The recommendation mechanism supports decision-making, but does not replace the humans who ultimately select reviewers and experts. The recommendation algorithm, which is the capstone of the system, requires additional discussion. The system uses two content-based algorithms. The first is based on the cosine measure between the keyword vectors that represent the reviewer and the problem to be reviewed. Keywords reflect individuals’ areas of knowledge, while the weights of keywords reflect the degree of that knowledge. This facilitates simple summaries of scientists’ knowledge and experience. The problem to be reviewed is represented in a similar manner: it is a summary of a document, which uses keywords and their weights. This simple representation and the clear cosine measure enable the position of an individual in the ranking of potential reviewers tasked with the evaluation of a specific problem to be explained more easily. Regrettably, the quality of recommendations can be disrupted by inaccuracies in retrieving information from the data, such as keywords and their weights. When individuals’ profiles and problems to be reviewed are represented by keywords, some information is lost–first because information granularity depends on the generality or specificity
References
49
of the keywords, and second because of errors arising from missing essential words or from excessive numbers of them caused by too many synonyms. To address these issues, a recommendation algorithm based on the full-text index was used. Despite using the cosine measure, the full-text index eliminates information loss by working on entire indexed documents. Full-text search engines, such as Apache Lucene, are being developed as open-source products; this serves to increase the credibility and transparency of the recommendation algorithm. Unlike the simple cosine measure, a complicated rule for measuring similarity is used in this case, which is cumbersome to understand and to explain. Moreover, excessively detailed information granularity may lead to the algorithm focusing on irrelevant individuals’ achievements, which may, in turn, result in incorrect recommendations of reviewers. Considering the advantages and disadvantages of both algorithms, i.e., keywordbased and full-text indexing, the recommendation system of reviewers and experts employs a hybrid solution, in which the algorithm based on the cosine measure and keywords, and that based on the full-text index are used simultaneously. The impact of both recommendation algorithms on the final ranking of reviewers is weighted. System users can make informed decisions on the method that is best-suited to a specific problem. The originality of the reviewer and expert recommendation system depicted in this chapter is twofold. First, the proposed content-based recommendation algorithm incorporates two approaches: keyword-based and full-text indexing, which allows both to be utlised effectively—the keyword-based approach is understandable for humans, while full-text indexing retrieves more information for textual data. Second, the system has a flexible and modular architecture, which allows for the encapsulation of business or computational processes and for the flexible composition of the system. This follows a modern information system architecture that is similar to that of microservices.
References 1. August D, Muraskin L (1999) Strengthening the standards: Recommendations for oeri peer review. In: Summary report. Prepared for the national educational research policy and priorities board, US Department of Education 2. Basu C, Hirsh H, Cohen WW (2001) Technical paper recommendation: A study in combining multiple information sources. J Artif Intell Res 14:231–252 3. Bornmann L, Daniel HD (2009) Reviewer and editor biases in journal peer review: an investigation of manuscript refereeing at angewandte chemie international edition. Res Eval 18(4): 262–272 4. Eisenhart M (2002) The paradox of peer review: Admitting too much or allowing too little? Res Sci Educ 32(2):241–255 5. Flach PA, Spiegler S, Golénia B, Price S, Guiver J, Herbrich R, Graepel T, Zaki MJ (2010) Novel tools to streamline the conference review process: Experiences from sigkdd’09. ACM SIGKDD Explor Newsl 11(2):63–67 6. Hemlin S (2009) Peer review agreement or peer review disagreement: Which is better? J Psychol Sci Technol 2(1):5–12
50
3 Recommending Reviewers and Experts
7. Hoang DT, Nguyen NT, Collins B, Hwang D (1999) Decision support system for solving reviewer assignment problem. 52(5):379–397 8. Hojat M, Rosenzweig S (2004) Journal peer review in integrative medicine discipline. Sem Integr Med 2(1):1–4 9. Jacoby LL, Kelley C, Brown J, Jasechko J (1989) Becoming famous overnight: Limits on the ability to avoid unconscious influences of the past. J Pers Soc Psychol 56(3):326–338 10. Langfeldt L (2004) Expert panels evaluating research: decision-making and sources of bias. Res Eval 13(1):51–62 11. Liu P, Dew P (2004) Using semantic web technologies to improve expertise matching within academia. In: Proceedings of I-KNOW, Graz, Austria, pp 70–378 12. Liu X, Wang X, Zhu D (1999) Reviewer recommendation method for scientific research proposals: a case for NSFC. 127(6):3343–3366 13. Mabude CN, Awoyelu IO, Akinyemi BO, Aderounmu GA (1999) An integrated approach to research paper and expertise recommendation in academic research. 13(4):485–495 14. Marsh HW, Jayasinghe UW, Bond NW (2008) Improving the peer-review process for grant applications: Reliability, validity, bias, and generalizability. Am Psychol 63(3):160–168 15. Mishra D, Singh SK (2011) Taxonomy-based discovery of experts and collaboration networks. VSRD Int J Comput Sci Inf Technol I(10):698–710 16. Papagelis M, Plexousakis D, Nikolaou PN (2005) CONFIOUS: Managing the electronic submission and reviewing process of scientific conferences. In: Web information systems engineering-WISE 2005, vol 3806. Lecture Notes in Computer Science. Springer, Berlin, pp 711–720 17. Pradhan T, Sahoo S, Singh U, Pal S (1999) A proactive decision support system for reviewer recommendation in academia. 169 18. Protasiewicz J (2014) A support system for selection of reviewers. In: Systems, man and cybernetics (SMC), 2014 IEEE International Conference on. IEEE, pp 3062–3065 19. Protasiewicz J, Artysiewicz J, Dadas S, Gałe˛˙zewska M, Kozłowski M, Kopacz A, Stanisławek T (2012) Procedury recenzowania i doboru recenzentów. Tom 2, vol 2. OPI PIB 20. Protasiewicz J, Dadas S, Gałe˛˙zewska M, Kłodzi´nski P, Kopacz A, Kotynia M, Langa M, Młodo˙zeniec M, Oborzy´nski A, Stanisławek T, Sta´nczyk A, Wieczorek A (2012) Procedury recenzowania i doboru recenzentów. Tom 1, vol 1. OPI PIB 21. Protasiewicz J, Pedrycz W, Kozłowski M, Dadas S, Stanisławek T, Kopacz A, Gałe˛˙zewska M (2016) A recommender system of reviewers and experts in reviewing problems. Knowl Based Syst 106:164–178 22. Rivara FP, Cummings P, Ringold S, Bergman AB, Joffe A, Christakis DA (2007) A comparison of reviewers selected by editors and reviewers suggested by authors. J Pediatr 151(2):202–205 23. Ryabokon A, Polleres A, Friedrich G, Falkner AA, Haselböck A, Schreiner H (2012) (re) configuration using web data: A case study on the reviewer assignment problem. In: International conference on web reasoning and rule systems. Springer, pp 258–261 24. Spier R (2002) The history of the peer-review process. Trends Biotechnol 20(8):357–358 25. Tian Q, Ma J, Liang J, Kwok RC, Liu O (2005) An organizational decision support system for effective & project selection. Decis Support Syst 39(3):403–413 26. Tian Q, Maa J, Liua O (2002) A hybrid knowledge and model system for R&D project selection. Expert Syst Appl 39(3):265–271 27. Tversky A, Kahneman D (1974) Judgment under uncertainty: Heuristics and biases. Science 185(4157):1124–1131 28. Xu Y, Ma J, Sun Y, Hao G, Xu W, Zhao D (2010) A decision support approach for assigning reviewers to proposals. Expert Syst Appl 37(10):6948–6956
Chapter 4
Supporting Innovativeness and Information Sharing
This chapter focuses on the recommendation and sharing of information on innovations. Its main objective is to propose a recommendation system that supports innovativeness and information sharing. The purpose of such recommendations is to reduce the time necessary for users to find the right innovations, potential business partners, experts, and conferences. More specifically, the proposed system is based on data, information, and knowledge. One interesting implementation involves analysing firms’ websites and assessing whether enterprises are innovative or not. To support innovation, the system must ‘understand’ six categories of information (objects): (1) innovation, (2) project, (3) conference, (4) scientific unit, (5) enterprise, and (6) expert. The heart of the system is a contextual recommendation algorithm that proposes information to users that best suits their needs. The chapter is structured as follows. The first Sect. 4.1 discusses what innovation means and what innovation strategies are. Its goal is to identify the needs and assumptions of an IT system that supports innovation. The second Sect. 4.2 describes the system, with an emphasis on information processing and the recommendation mechanism, which aims to connect science and business. The chapter ends with a discussion 4.3. It must be noted that the content of this chapter is based chiefly on the author’s former publications, i.e., [11, 12], as well as covering the author’s new perspective on the issue.
4.1 Innovativeness This section discusses the concept of (open) innovation and innovativeness strategies, as well as identifying the goals of IT systems that support innovation. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Protasiewicz, Knowledge Recommendation Systems with Machine Intelligence Algorithms, Studies in Computational Intelligence 1101, https://doi.org/10.1007/978-3-031-32696-7_4
51
52
4 Supporting Innovativeness and Information Sharing
4.1.1 Innovation The overwhelming majority of enterprises, including small-sized and local ones, must compete on the global market with small enterprises and large corporations alike. The development of information technologies has provided easy access to global information on goods, enterprises, experts, technological novelties, and scientific discoveries—all of which are crucial in strengthening the innovativeness of any type of enterprise. Collaboration between business and science is becoming a key aspect of business strategies because it results in innovations that enable enterprises to remain competitive. Since enterprises are developed by humans, we assume that anyone can become an innovator [9, 16, 18]. In modern economies, great emphasis is placed on innovation to increase the competitiveness of products and services and to improve humans’ quality of life. According to the Oslo Manual guidelines [10], an innovation is either something new or something significantly improved, which is useful to its users or which yields benefits to enterprises that have implemented it. They include product innovations, process innovations, and marketing tools. As innovations are conceived by creative people, the traditional terms, ‘engineer’ and ‘scientist’ are often replaced with the term, ‘innovator’ [14, 18]. Moreover, modern horizontal organisations spend, on average, one day a week conducting research or other activities, implementing programmes, and creating opportunities for their employees to become innovators [16].
4.1.2 Open Innovation and Innovativeness Strategies Open innovation is growing in popularity [3]. It is based on the free flow of knowledge between organisations and their environments, as well as cooperation and the establishment of shared intellectual values [4, 17, 19]. Open innovations are conceived in three stages: (i) the creation of different ideas by numerous entities and individuals using open media; (ii) the implementation of selected ideas in an appropriate collaborative environment; and (iii) commercialisation. Although open innovation is gaining in popularity, it is not yet widely used by small- and medium-sized enterprises [7]. Some scientists even believe that it is more advisable to work on long-term projects that involve stable and long-lasting teams than to work in a rapidly changing environment of open innovation in short-term ventures [5]. Four fundamental enterprise innovation strategies can be delineated: explorative, exploitative, ambidextrous, and no-emphasis. In the case of explorative strategy, enterprises develop their own innovations or knowledge. Conversely, the exploitative strategy involves the ‘buying’ of knowledge by enterprises. The ambidextrous strategy combines the explorative and exploitative strategies. Enterprises that focus solely on internal knowledge without investing in it rely on the no-emphasis strategy [15]. This approach should be avoided by enterprises, as protecting their know-how without cooperating with their external environment leads to low innovativeness.
4.1 Innovativeness
53
Although all of the strategies exercise positive influences on innovativeness, the ambidextrous approach is the most desirable because it ensures the balance between own-generated, bought-in, and codeveloped knowledge [15]. Creating an innovative product or service is an ambitious task; propagating (diffusing) innovation in society may be an even more challenging one. Some individuals embrace new solutions with ease and apply them quickly in their daily lives; others fend off innovation and require incentives to curb their reluctance. Studies demonstrate that innovations are usually initiated by a handful of visionaries who popularise them across societies [1, 6, 20].
4.1.3 An IT System to Support Innovativeness The abundance of available information and the issues discussed above indicate the need for the creation of an IT tool to share innovations; one that implements the open innovation paradigm [4]. Three goals are of crucial importance: i. Strengthening the participation of small- and medium-sized enterprises in creating and using open innovations; ii. Enabling enterprises’ employees to become more innovative; iii. Helping scientists and professionals to establish long-term and long-lasting cooperation. The IT system developed to implement these goals must focus on enhanced innovation and cooperation between science and business. A revolutionary system should advance an architecture and business processes that allow the open innovation concept to be implemented seamlessly. The groundbreaking functions of the system include the following: i. Information on innovations and potentially innovative firms is automatically obtained from the internet. This encourages people to use the system; ii. Innovations, projects, and events are recommended to system users based on their expectations; iii. The recommendation algorithm attempts to bring scientists and entrepreneurs together and enhance their cooperation. The answer to the above requirements is a recommendation system, Inventorum, which was developed in 2016 by the National Information Processing Institute. It is available free of charge to all entities and individuals.1 The open innovation paradigm is implemented by recommending innovations, projects, experts, partners, and conferences to scientific institutions and enterprises. Profiled recommendations should reduce the time necessary to obtain the necessary information and improve
1
The Inventorum system is available on the internet free of charge at http://inventorum.opi.org.pl/ en.
54
4 Supporting Innovativeness and Information Sharing
its accuracy [11, 13]—for example, enterprises may receive innovation proposals from research institutions. The next section depicts the underlying architecture and business processes of the system.
4.2 A System that Supports Innovativeness This section describes the underlying architecture and processes of the system, Inventorum, which supports innovativeness with an emphasis on information processing and the recommendation mechanism that aims to connect science and business. An example that shows the system at work is also provided.
4.2.1 An Outline of the System The autonomous IT system that supports innovation is based on data, information, and knowledge. It comprises three main components that transform the input data into increasingly precise information, which ultimately enables knowledge recommendations (Fig. 4.1). First, crawlers acquire data from the internet. Various open databases can also be used. Regardless of its source, it is assumed that the data obtained must be related to innovation in some way. Next, information on innovation is extracted from the data. That information is then used in a variety of services whose main goal is to help individuals achieve their economic and research goals by means of innovation [11, 12]. Business analysis demonstrated that the innovation support system should ‘understand’ six categories of information (objects): (1) innovation, (2) project, (3) conference, (4) scientific unit, (5) enterprise, and (6) expert. Figure 4.2 depicts the relations between these objects. The focal point is innovation, which can be developed by both research units and enterprises as a result of R&D projects. Projects can be imple-
Fig. 4.1 The main processes of the system that supports innovativeness. The system acquires data, which must be related in some way to innovation, from the internet and open databases. The data is then transformed into information on enterprises, innovators, and innovations. It is then used in innovation support system services, such as recommendations and information exchanges
4.2 A System that Supports Innovativeness
55
Fig. 4.2 The diagram presents six categories of information (objects) distinguished in the system that supports innovation and information exchange, and describes the relationships between them
mented independently, jointly by representatives of the business and science sectors, or by consortia of various entities. Projects are reviewed by experts to assess their scientific, economic, and social substantiality. Although most experts work at research units, enterprises are more often offering them job opportunities in technology and biotechnology. Firms invest in external knowledge to acquire innovations that could increase their competitiveness. It appears that conferences focused on bringing business and science together are the right place to promote knowledge and innovation. They are attended by scientists who have relationships with both scientific units and enterprises. Attendance might prove too costly for some small- and medium-sized enterprises. Experts may also have inadequate time to participate in such events. The bringing together of business and science should, therefore, be supported by an IT system that preselects innovations and suggests cooperation opportunities and collaborations on joint projects. This work has birthed the recommendation system that supports innovation, which is based on the assumptions outlined above. It should be stressed that no information recommendation system can replace personal relationships; such systems can, however, save time and facilitate cooperation.
4.2.2 Data Acquisition and Information Extraction To attract users, the IT system must be rich in information. To fulfil this requirement, special mechanisms have been developed to acquire as many innovations, projects, and conferences as possible. Figure 4.3 depicts the implementation of the
56
4 Supporting Innovativeness and Information Sharing
Fig. 4.3 An outline of the process of acquiring information on innovations, projects, and conferences. Selected areas of the internet are browsed by crawlers; in the case of open databases, data is acquired and analysed in a standard manner. Next, if possible, the data acquired is linked to the individuals to which it pertains. As a result, these individuals (system users) can view, complete, and validate the data
semiautomatic processes designed to perform this task. Information can be acquired by analysing records of open databases or by using crawlers to browse selected areas of the internet. It should be noted that extracting information from web data is an extremely challenging and error-prone process. Important and valuable data on innovations, projects, and conferences acquired in this manner may be incomplete. It is, therefore, imperative that this data be linked to its specific owners, who are able to supplement and validate it; only then should the data be presented to other users of the system (Fig. 4.3). The key to the success of the recommendation system lies in the acquisition of many users who are interested in innovation. Three groups of potential users were identified: scientists, research units, and enterprises. Scientists Given that, as a rule, the platform is open and free, any scientist or expert can join it by creating an account and providing information about themselves. Affiliating individuals to research units or enterprises, however, requires the approval of relevant data administrators. It should also be noted that scientists can upload their data from other databases. The Polish Science database has already been integrated successfully,2 and connection to other systems is also possible. The matter of creating user accounts is strictly technical; for this reason, this chapter will not describe it in further detail. Scientific units Data on scientific entities was initially imported from the Polish Science database, which contains information on all higher education institutions and scientific units in Poland. It is possible to enter information on any entity into the system, regardless of its country of origin. Such entities are widely known and are usually established and maintained by state authorities. Their data seldom changes. This explains why data entered manually via user interfaces is highly trusted and why no additional 2
www.nauka-polska.pl.
4.2 A System that Supports Innovativeness
57
verification of such data (which would otherwise generate additional system maintenance costs) is required. The matter of entering data on scientific entities is a purely technical one and does not merit further discussion in this chapter. Enterprises Convincing enterprises to join the platform has proved a significant challenge. To save the time of enterprise employees, the platform actively searches for potentially innovative enterprises. The system then automatically creates profiles of these enterprises, enters some of the data, and invites the enterprises to participate in the platform. This process is presented in Fig. 4.4. The key assumption is that it is possible to identify an innovative enterprise on the basis of its website. The stages of the process are as follows [8]: 1. Available databases are used to create an initial set of websites of enterprises that have participated in R&D projects or educational programmes. The set is created manually and serves as a starting point for the searching for innovative companies on the internet (see seed in Fig. 4.4). 2. Selected areas of the internet are browsed based on the initial set of websites and on others that are in any way related to them. As a result, entire websites of potentially innovative companies are acquired (see the crawling block in Fig. 4.4). 3. Each website (which may comprise several web pages) is classified by a set of multinomial naive Bayes classifiers to obtain the answers to two questions: (i) is this an enterprise’s website? and (ii) is the enterprise innovative or not? The result is a set of websites of potentially innovative enterprises (see the classification block in Fig. 4.4). 4. Next, information on enterprises, including their names, addresses, business profiles, and descriptions, as well as the forenames, surnames, and email addresses of key contacts is extracted from the websites. As a result, a set of potentially innovative enterprises is created (see the identification block in Fig. 4.4). 5. Each of the firms identified is invited to join the platform. Invitations are personalised and contain a completed company profile. Invitations can be accepted without users entering unnecessary data. The information on an enterprise can be supplemented with any details that better identify the enterprise. If an invitation is accepted, the enterprise to which it was sent becomes a full member of the platform and can use the system without restrictions (see the acceptance block in Fig. 4.4).
4.2.3 Recommendations It is commonly believed that bombarded with data from various electronic media, humans are reluctant or unwilling to adopt new information systems. For that reason,
58
4 Supporting Innovativeness and Information Sharing
Fig. 4.4 An outline of the process of acquiring innovative enterprises as participants of the platform. Initial addresses (URLs) identify areas of the internet to be browsed by crawlers that peruse entire websites of potentially innovative enterprises. Next, the classifier assesses the data collected and determines whether it pertains to innovative enterprises. The selected websites are used to extract basic data on potentially innovative enterprises and to create their initial profiles. The system sends invitations to selected enterprises, which then become participants of the platform upon accepting the invitations
the main assumption and objective of the proposed information platform is to provide information that is carefully selected, useful, important, relevant, and tailored to users’ needs. To this end, the system analyses user profiles to recommend the information that is likely best-adapted to their expectations. The following categories of information are recommended to system users: 1. Innovations—improvements, products, and services that are created by scientific institutions or enterprises and are offered to entities that are interested in ‘buying’ knowledge. This may contribute to innovation advantages in the market; 2. Projects—new ideas of ventures and undertakings proposed by various users. Projects are recommended to enterprises or scientific units to find partners or investors. This may lead to innovations in cooperation; 3. Enterprises or scientific units—the recommendation algorithm attempts to match enterprises and scientific units on the basis of similarities in their profiles and their needs. This may result in the establishment of collaboration between such entities; 4. Experts—professionals who are willing to offer their knowledge and skills in the market are recommended to enterprises that seek specialists to solve technological challenges, or to institutions that wish to strengthen their research and development teams or to have their projects evaluated; 5. Conferences—various innovation events that are proposed to platform users based on analysis of their profiles and preferences. Figure 4.5 outlines the recommendation process in the context of processed information. The recommendation algorithm analyses all available information on companies, scientific units, individuals, related innovations, and project and conference proposals. Each user can describe their expectations and decide what kind of information, of those listed above, should be presented to them. Based on these premises, the algorithm generates unique information recommendations for each user. Figure 4.6 depicts how the recommendation system supports collaboration between business and science partners. A game of sorts can be observed between
4.2 A System that Supports Innovativeness
59
Fig. 4.5 The recommendation system from the perspective of the information processed. The recommendation engine, as an input, receives profiles of people and entities, as well as users’ expectations. As a result, the engine suggests information (innovations, projects, partners, experts, or conferences) that accounts for users’ preferences and other available information
the parties concerned and the recommendation algorithm, which is the information flow centre. Entrepreneurs and scientists provide information about themselves, their enterprises or scientific units, their expectations, and their offers. The algorithm attempts to use this data to best match the parties. Recommendation systems typically use collaborative filtering, content-based filtering, or demographic or hybrid algorithms [2]. Collaborative filtering cannot be applied to the proposed system because innovations, projects, conferences, entities, and individuals are not evaluated; concepts such as innovation or expert are difficult to assess in terms of likes and dislikes. Since individuals’ genders and ages are irrelevant, the demographic approach is also impractical. Issues pertaining to recommendations are characterised by descriptions of innovations, projects, conferences, entities, and individuals. Such objects can be recommended solely on the basis of analyses of description similarities. In the case of the proposed system, content-based filtering seems to be the most appropriate recommendation system. The content-based filtering algorithm determines the similarity of objects using the distance similarity measure. For this purpose, a simple cosine measure can be used, which, for the text objects in this task, will be calculated as follows: It is assumed
60
4 Supporting Innovativeness and Information Sharing
Fig. 4.6 The recommendation system from the perspective of matching entrepreneurs and scientists. The recommendation algorithm serves as a kind of information flow center between business and science partners. The algorithm attempts to match them based on information provided by both sides
that an individual, p, is described using the text profile, q p . In practice, the profile comprises text documents that contain descriptions of an individual’s achievements, information on their employment and education, and their expectations from the system—if they have been defined (see Fig. 4.5). The entire q p profile is a query to the recommendation algorithm. There are also Oi , i = 1, ..., I objects that should be recommended. These include innovations, projects, experts, entities, and conferences. Each can be defined by multiple text documents (see Fig. 4.5). The similarity of each object to the query is calculated as the cosine measure of two vectors that contain the features of these documents similarit y(q p , Oi ) =
q p · Oi , ||q p || · ||Oi ||
(4.1)
where the numerator is the dot product of these vectors, and the denominator is the product of their lengths.
4.2 A System that Supports Innovativeness
61
Extracting document features, or keywords, can be burdensome. In practice, fulltext search engines are used. These index documents and calculate their similarity using an appropriate similarity measure, such as the cosine. This avoids the problem of imprecise keyword extraction. Full-text search engines, such as Apache Lucene,3 enable advanced indexing and searching of text documents. As a result, a list of objects is generated, sorted by their similarity to the query; in the proposed system, it is the similarity of an individual’s profile to innovations, projects, experts, entities, and conferences. The list is a recommendation for the individual, p. Details on how the Apache Lucene algorithm works are described in [13].
4.2.4 Recommendations in Practice The operation of the recommendation mechanism can be illustrated using the following example: Suppose there is a firm called UniqueIT, which intends to develop a new product: a quick recommendation system that uses information contained in large datasets (so-called ‘big data’). Due to being insufficiently skilled in this area, UniqueIT must consider three options if it wants to achieve its goals: (1) purchase a ready-made innovation; (2) establish collaboration with a scientific unit or an enterprise to develop an innovative product jointly; or (3) recruit knowledgeable experts and develop the product independently. It should be noted that these strategies are not mutually exclusive and can be implemented simultaneously. UniqueIT has specified its expectations for each of the above strategies. They are included in the Sought on the market column in Table 4.1. Recommendations regarding each of the strategies of UniqueIT prepared by the recommendation system discussed in this work (Inventorum) are included in the Recommended by the system column in Table 4.1. The recommendations were compared with the suggestions of the Google search engine (the Selected by Google column in Table 4.1). For the sake of clarity, the recommendations have been limited to two items for each task. Google suggests scholarly articles and books rather than specific innovations, scientific units, enterprises, and experts. This is unsurprising: Google, which is a global search engine, indexes the entire internet and suggests the most popular and semantically relevant documents, but is unable to explore the context of its recommendations. Despite holding no global knowledge, the proposed system offers more precise suggestions, because it incorporates a built-in contextual recommendation mechanism and relies on a dedicated database. Unfortunately, not all of the system’s suggestions are correct. When UniqueIT enquired about a ‘fast recommendation algorithm’ innovation, it received only one suitable suggestion; the other was irrelevant, as it pertained to a different field of technology. There were also no recommendations regarding partners to implement strategy 2, which can be explained by the absence of relevant information in the database; the recommendations of experts, however, were relevant. 3
https://lucene.apache.org.
62
4 Supporting Innovativeness and Information Sharing
Table 4.1 Recommendations for UniqueIT produced by Inventorum and the corresponding search results achieved using Google. There are three strategies: (1) purchase a ready-made innovation; (2) establish collaboration and develop an innovative product jointly; or (3) recruit knowledgeable experts and develop the product independently Sought on the market Recommended by the Selected by Google system Strategy 1
Innovations: Fast recommendation algorithm
Innovation: Algorithm and implementation for fast computation of content recommendation Innovation: Fast charge algorithms for Lithium-Ion batteries
Strategy 2
Industry partners: A software team that uses agile software development, Java, and big data
No results
No results
Research partners: Supports decision-making processes including technology transfer from science to enterprise
Research unit: Information processing institute—national research institute No results
Strategy 3
Experts: Big data, recommendation algorithms
Expert: Protasiewicz J. (PhD)
Expert: Romaniuk R. (Prof)
Article: A fast recommendation algorithm for social tagging systems: A ... Article: Fast algorithms to evaluate collaborative filtering recommender systems Article: Making ‘big data’ projects flexible and timely with Agile software LinkedIn portal: Software engineering manager (Big data/Java/Agile) job ... Book: Information society: new media, ethics, and postmodernism Book: Advancing federal sector health care: A model for technology transfer Aricle: How do recommendation systems know what you might like? Article: Big data and recommender systems
The above example, which illustrates the proposed system’s operation, proves that contextual recommendations can quickly provide relevant information to users if a suitable database is available. It is clear that the system cannot compete with global search engines, such as Google or Bing, in terms of the amount of indexed data; nevertheless, a contextually oriented knowledge base and its algorithm may prove better-suited to specific problems, such as the recommendation of innovations, partners, or experts.
4.2 A System that Supports Innovativeness
63
Fig. 4.7 An overview of how the system serves information to its users. There are three information channels: (i) quick recommendations, (ii) full recommendations, and (iii) the search engine. They supply users with news, recommendations, and reports, respectively
4.2.5 Recommendations Distribution System users receive information via three channels: (i) quick recommendations, (ii) full recommendations, and (iii) the search engine. Each plays a distinctive role in transferring information between users (Fig. 4.7). As soon as users log into the system, they receive messages to their desktops (quick recommendations). These contain the latest personalised information recommended by the system. This is comparable to a customer viewing advertisements on a window display—except that in this case, the advertisements are prepared exclusively for a single user. If a user is interested in the messages, they will likely wish to receive more offers. The system then presents the full scope of recommended information based on the expectations and profiles of individuals or the institutions with which they are affiliated. This case can be compared to the hypothetical scenario of a customer entering a shop and receiving a personalised offer that incorporates all products that are in stock. Users may be uninterested in the information recommended to them, but still want to search for it. Should this be the case, they can use a full-text viewer to ask questions using natural language. They then receive a list of pieces of information that serve as answers to their queries. This is comparable to a customer viewing a shop’s merchandise without receiving any assistance.
64
4 Supporting Innovativeness and Information Sharing
4.3 Summary The goal of this work was to propose an IT system that enhances innovation and cooperation between business and science. The system must implement the concept of open innovation through (i) enabling open access to information, (ii) ensuring the diversity of that information, and (iii) supporting cooperation by means of information recommendations. These objectives are fulfilled by the proposed recommendation system, Inventorum, which boasts a dedicated architecture and carefully designed processes. Crawlers collect data on innovation and innovative enterprises; machine learning algorithms extract information on those enterprises, which enables users to receive personalised invitations to join the platform. The heart of the system is the contextual recommendation algorithm that proposes information to users that best suits their needs. The purpose of such recommendations is to reduce the time necessary for users to find the right innovations, potential business partners, experts, and conferences. The originality of the knowledge recommendation system, Inventorum, depicted in this chapter is as follows. First, it is a dedicated knowledge recommendation system for innovativeness. Specifically, the system supports innovation and information sharing of six concepts: innovation, project, conference, scientific unit, enterprise, and expert. In addition, the system supplies a user with news, recommendations, and reports through three concurrent information channels: quick recommendations, full recommendations, and the search engine. Second, the system incorporates a unique algorithm that searches for innovative enterprises on the internet and invites them to participate in the system. Third, the system implements the concept of open innovation and and is open in itself. Inventorum is available to the public free of charge at: https://inventorum.opi.org.pl/en. The user interface is presented in the Polish and English languages. Although the system is complete, it may require further improvement in the future. Users’ comments and suggestions regarding its operation are most welcome.
References 1. Bhimani H, Mention A-L, Barlatier P-J (2019) Social media and innovation: A systematic literature review and future research directions. Technol Forecast Soc Change 144:251–269 2. Bobadilla J, Ortega F, Hernando A, Gutiérrez A (2013) Recommender systems survey. Knowl Based Syst 46:109–132 3. Bogers M, Chesbrough H, Moedas C (2018) Open innovation: Research, practices, and policies. Calif Manag Rev 60(2):5–16 4. Martin C, Bror S (2013) Open innovation 2.0: a new paradigm. OISPG White Paper, pp 1–12 5. Deborah D (2016) Organizing for innovation in complex innovation systems. Innovation, pp 1–5 6. Dominika M, Katarzyna W-J, Joanna Z (2017) Open innovation model in enterprises of the sme sector-sources and barriers. In: Information systems architecture and technology: Proceedings of 37th international conference on information systems architecture and Technology–ISAT 2016–Part IV. Springer, pp 97–104
References
65
7. Mirkovski K, von Briel F, Lowry PB (2016) Social media use for open innovation initiatives: Proposing the semantic learning-based innovation framework (SLBIF). IT Prof 18(6):26–32 8. Miro´nczuk M, Protasiewicz J (2015) A diversified classification committee for recognition of innovative internet domains. In: International conference: beyond databases, architectures and structures. Springer, pp 368–383 9. Muninger MI, Hammedi W, Mahr D (2019) The value of social media for innovation: A capability perspective. J Bus Res 95:116–127 10. OECD, Eurostat, (2005) Oslo Manual-Guidelines for collecting and interpreting innovation data, 3rd edn. OECD, Organisation for Economic Cooporation and Development 11. Protasiewicz J (2017) Inventorum–a recommendation system connecting business and academia. In: 2017 IEEE international conference on systems, man, and cybernetics (smc). IEEE, pp 1920–1925 12. Protasiewicz J (2017) Inventorum: A platform for open innovation. In: 2017 IEEE international conference on systems, man, and cybernetics (SMC). IEEE, pp 10–15 13. Protasiewicz J, Pedrycz W, Kozłowski M, Dadas S, Stanisławek T, Kopacz A, Gałe˛˙zewska M (2016) A recommender system of reviewers and experts in reviewing problems. Knowl Based Syst 106:164–178 14. Recker J, Malsbender A, Kohlborn T (2016) Learning how to efficiently use enterprise social networks as innovation platforms. IT Prof 2(18):2–9 15. Revilla E, Rodriguez-Prado B, Cui Z (2016) A knowledge-based framework of innovation strategy: The differential effect of knowledge sources. IEEE Trans Eng Manag 63(4):362–376 16. Copeland P, Savoia A (2011) Entrepreneurial innovation at Google. Computer 44(4):56–61 17. Vignieri V (2021) Crowdsourcing as a mode of open innovation: Exploring drivers of success of a multisided platform through system dynamics modelling. Syst Res Behav Sci 38(1):108–124 18. Wisnioski M (2015) The birth of innovation. IEEE Spectr 52(2):40–61 19. Xinbo S, Mingchao Z, Weixin L, Mengqin H (2019) Research on the synergistic incentive mechanism of scientific research crowdsourcing network: Case study of innocentive. Manag Rev 31(5):277 20. Zhang J, Xia F, Ning Z, Bekele TM, Bai X, Su X, Wang J (2016) A hybrid mechanism for innovation diffusion in social networks. IEEE Access 4:5408–5416
Chapter 5
Selected Algorithmic Developments
This chapter presents details on the data acquisition and information extraction algorithms that are used in the reviewer and expert recommendation system1 and the innovation promotion system, Inventorum.2 The chapter’s primary objectives are to show these algorithms’ underlying details and to justify their selection among other possible solutions. The chapter also proposes a novel approach to the issues discussed. It focuses initially on how to acquire data on potential reviewers and experts—including on the topical crawlers that are used to extract data from scientists’ websites using a dedicated Conditional Random Fields (CRF) algorithm—and explains the technical processes of data acquisition from open-access publication databases. The chapter then outlines algorithms that transform data into information to create profiles that describe the knowledge and experience of potential reviewers and experts. It discusses various approaches to the classification of publications by scientific domain and discipline, such as flat and hierarchical organisational structures of classifiers, and multilingualism. As a result, a multilingual classification system is proposed. Another issue related to information extraction from data is the disambiguation of authors of publications. In response, the following approaches are proposed: (i) an algorithm based on heuristic rules; (ii) a hierarchical clustering algorithm with cluster similarity determined by heuristic rules; and (iii) a hierarchical clustering algorithm with cluster similarity estimated by classifiers. The last algorithm is used to create individuals’ profiles, and involves extraction of keywords from publications. For Polish-language text, an innovative algorithm called the Polish Keyword Extractor is recommended. English-language text is analysed using standard methods described in the literature. An algorithm that searches for potentially innovative enterprises on the internet 1 2
https://recenzenci.opi.org.pl/sssr-web/site/home?lang=en. https://inventorum.opi.org.pl/en/.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Protasiewicz, Knowledge Recommendation Systems with Machine Intelligence Algorithms, Studies in Computational Intelligence 1101, https://doi.org/10.1007/978-3-031-32696-7_5
67
68
5 Selected Algorithmic Developments
is also used. This classification model assesses whether enterprises are potentially innovative based on their websites. The chapter is structured as follows. Section 5.1 discusses data acquisition algorithms: extractors, importers, and crawlers. Section 5.2 presents document classification algorithms and Sect. 5.3 describes author disambiguation algorithms. Both sections include experimental validation of the algorithms on a publication dataset. Section 5.4 presents keyword extraction algorithms from tests—particularly for the algorithm dedicated to the Polish language. Section 5.5 covers a mechanism that assesses whether specific enterprises are potentially innovative, based on their websites. The chapter ends with a discussion Sect. 5.6 on the literature. It must be noted that the information included in this chapter relies on other works on algorithms used to classify publications [16, 21], to disambiguate the authors of publications [20], to extract keywords from text [5], and to evaluate enterprises’ innovativeness based on their websites [18, 19] and on summary works [14, 15, 17, 22], as well as incorporating the author’s updated perspective on these issues.
5.1 Data Extraction and Crawling The reviewer and expert recommendation system contains a data acquisition model that gathers raw data on potential reviewers and experts. As the data comes from various sources, dedicated data extraction and crawling algorithms are used. As with the architecture of the wider system, the data acquisition component comprises a modular structure. This allows specific process to be separated: data extraction and crawling can work independently by using different data sources. In both cases, the data is gathered in a common database (see Fig. 1.3 in Chap. 3).
5.1.1 The Data Extraction Algorithm Data extractors implement data acquisition algorithms from external open-access publication databases. In view of their diverse structures, each database is matched with an individual extractor, which: (i) parses data from the database according to an individual set of rules; (ii) extracts such data for information on publications; and (iii) transforms the data into a temporary representation in the form of an Extensible Markup Language (XML) file with a standardised structure. Finally, the importer, which is the same for all extractors, transforms the XML files into records and saves them in a local database. This enables the extractors to operate independently and to process data from various data sources simultaneously (Fig. 5.1). For practical use, the scope of data that is gathered by extractors has, until now, been limited to publications; extractors can also, however, be adapted to be used for other data categories. Despite data acquisition being a purely technical process, it presents challenges that require compelling heuristic solutions. For instance, an
5.1 Data Extraction and Crawling
69
Fig. 5.1 A schematic representation of the data extraction and data import to the structured databases that are included in the data acquisition module in the reviewer and expert recommendation system. An extractor parses data from a database into a file; then, an importer transforms the files and stores the data in a database
Fig. 5.2 How the topical crawler works with the CRF model, which is an element of the information extraction module in the reviewer and expert recommendation system. The crawler gathers data from the internet and transforms selected web pages into text; then, the CRF model retrieves desirable information (recognises entities) in the text
importer can detect whether a publication that is being processed is a duplicate of another publication that exists in a database—even in cases of slight differences between the two, for example in the title of a publication. If such an issue is detected, the importer attempts to combine the duplicates into a single record by adding missing attributes to the original publication, such as keywords or abstracts.
5.1.2 The Crawling Algorithm To obtain additional information on publications from the websites of scholars and professionals, a topical crawler was proposed. Unlike extractors, the crawler works on nonstructured data. It has no initial knowledge on the structure of the websites it analyses. For that reason, the algorithm utilises heuristics and data exploration methods to identify specific and relevant data entities. This is a complex and demanding task given how significantly websites can differ and how often they can be updated. To simplify the algorithm, the task of collecting information on the publications was divided into a data gathering and identification phase, and an entity recognition and information extraction phase (Fig. 5.2). The first phase involves the crawling of websites and their pages that may contain information on the publications of potential reviewers and experts. It assumes that
70
5 Selected Algorithmic Developments
the scientists’ URLs are known. The process of crawling is performed as follows: A document (website) is downloaded for a specific address. All links to other URLs are then identified in the document. Heuristic rules are used to establish whether a particular address leads to websites containing scientists’ publications. The heuristic rules search for keywords, such as ‘publication’, ‘article’, ‘research’, or ‘scientific’ in the analysed URL and its description. Only the websites that are in any way connected with a specific scientist and any publication (by any author) are downloaded as HTML documents. Specific pieces of text are then selected from the document, without altering its original structure: paragraphs, blocks of elements, table rows, and lists are saved in separate lines of text. In the second phase, a model based on CRF scans every line of text and determines whether a document contains information on the publications of a specific individual. It appears that CRF, as a discriminative and general graph-based model [25], is more suitable for this task than generative approaches, such as hidden Markov models. The CRF model was developed using a manually-tagged set of transformed websites. It focuses not only on the words being analysed, but also on the neighbouring ones. The model is also sensitive to additional features, such as the length of text lines, gazetteers, popular surnames in various countries, regular expressions that enable document detection, acronyms, and quotations. Understanding these features allows the CRF model to detect lines of text that are included in a document and that contain information on publications.
5.1.3 Data Acquisition in Practice Extractors and topical crawlers are implemented in the data acquisition module, which also contains administrative tools that are used to configure, run, and monitor individual data acquisition processes. Theoretically, extractors and crawlers are compatible with all data types; in practice, however, such programs must be adapted to specific data structures. To evaluate the concept of the data acquisition module, it suffices to implement these algorithms to work with publications from selected data sources. Extractors have interfaces and parsers that are compatible with open data sources, such as The DBLP Computer Science Bibliography,3 PubMed,4 The Central European Journal of Social Sciences and Humanities (CJSH),5 and other local sources. The steps of the data acquisition process are as follows: 1. Select a unique surname of an individual from the reference database (here: the Polish Science Database (Nauka Polska)6 ). 3
https://dblp.org. https://pubmed.ncbi.nlm.nih.gov. 5 http://cejsh.icm.edu.pl. 6 www.nauka-polska.pl. 4
5.2 Classification of Publications
71
2. All extractors download publications from all supported data sources; the publications must, in some way, be connected with the selected name. 3. Parse the data and save it in a local database. Revisit step 1 if it is necessary to process another URL. Crawlers are designed to extract information on publications from scientists’ personal websites. Crawlers incorporate not only data acquisition algorithms, but also the CRF model, which is used to identify and extract information on publications from data. The CRF model relies on the ParsCit open source package [4]. A simplified explanation of how crawlers works is presented below: 1. Select a unique URL of a scientist from the reference database (the Polish Science Database). 2. Locate the website on the internet; crawl the website and its pages; extract information on the publications of a particular individual. 3. Save the data in a local database. Revisit step 1 if it is necessary to process another URL. During the system evaluation process, approximately 160,000 records on scientists were analysed; their names are included in the reference base (the Polish Science Database). Five million publications of various degrees of completeness were obtained. Most of the information was obtained by extractors. Crawling websites turned out to be highly susceptible to errors. For that reason, the data could only be presented to system users as suggestions that require completion and approval. The crawling data proved useful only after it was corrected manually instead of being automatically completed by the user interface.
5.2 Classification of Publications The primary goal of the classification of publications is to arrange them into a hierarchical structure of scientific disciplines. As a result, the disambiguation of publications’ authors can be improved. It is also possible to enhance the visual aspect of groups of publications, individuals, and keywords in IT systems. The classification algorithm is a component of the information extraction module in the reviewer and expert recommendation system.
5.2.1 Problem Definition Generally speaking, the task involves assigning text documents (publications) di ∈ D, i = 1, ..., I, to respective classes (scientific disciplines) clj ∈ C. Classes are organised in a hierarchical tree C, which comprises l = 1, ..., L levels, each of which contains j = 1, 2, ..., Jl classes. The number of classes, Jl differs across the tree’s
72
5 Selected Algorithmic Developments
Fig. 5.3 A graphical illustration of the classification problem. A document is transformed into features, which have different meanings and can be presented in various languages. Then, it is assigned to a class from a hierarchical tree. For example, the document, di is assigned to the class, c33
levels, l. The aim of the classification is to arrange documents (publications) di into a hierarchical tree of classes (scientific disciplines) clj . Note that documents di may be nonhomogeneous: they may comprise various parts, such as titles, summaries, introductions, chapters, tables, figure descriptions, and references. A document may also contain passages written in different languages—a common layout example is two titles and two abstracts in two different languages followed by core text in English. The components of a document are its features f it (lang), t = 1, 2, ..., T , where t indicates a type of feature (e.g. title, abstract, keyword, or chapter). Lang indicates the natural language of the text, and i indicates the document index. The method aims to transform documents into their features and create document– class pairs in the following manner: di ∈ D ⇒ f it (lang) ⇒ {di , clj }. Figure 5.3 illustrates the problem defined above.
(5.1)
5.2 Classification of Publications
73
5.2.2 Classification Algorithms and Procedures Multiple classification algorithms exist. The question of which of them to select for a particular task has been analysed in numerous studies, such as [9, 11]. A broader discussion on the typical steps of a model’s creation of classifications, such as data pre-preparation, feature section, task dimension reduction, as well as model training algorithms and their validation, reach beyond the scope of this chapter. Based on the literature, expert knowledge, and preliminary experiments, the most appropriate algorithms for the classification problem defined above were selected. Multinomial Naive Bayes (MNB), a well-known algorithm, was chosen as the primary tool to create the text data classification models. Its undeniable advantage lies in its simplicity. The algorithm is surprisingly effective, despite the rather unrealistic assumption that no relationship exists between the features being analysed (in this case, words in sentences). Another algorithm is support vector machines (SVM), which, in some cases, can classify more accurately than MNB, but is more computationally complex. The third option is the multilayer perceptron (MLP), an artificial neural network, which is a universal approximator and can solve nonlinear classification problems. All of these algorithms are widely known and well-described in the literature [3]; they, therefore, merit no further discussion in this chapter. The classification algorithm and training data form a classification model, whose objective is to implement the multiclass or multilabel classification strategy. In the multiclass approach, a classifier assigns only a single class clj to document di ; in the multilabel approach, a classifier can assign multiple classes clj to document di simultaneously [26]. In both cases, classes are selected from a limited set of probable solutions clj ∈ C. In practice, a classifier can have one or more outputs that are characterised by the probability of an analysed document belonging to a specific class. A classifier with one output implements the multiclass strategy and can distinguish only two classes; a classifier with multiple outputs, each assigned to a specific class, can implement both multiclass and multilabel strategies. This architecture is limited by the requirement that a classifier must have exactly as many outputs as there are possible classes, which may prove ineffective with a considerably large number of classes. A more flexible classification organisation includes the one vs others model, in which multiple classifiers are trained, but a single classier is capable of recognising only one specific class. Adding an additional class to a set of solutions involves the training of an additional model.
5.2.3 Flat Versus Hierarchical Classification Classification aims to organise scientific publications judiciously. To categorise the documents, Ontology of Scientific Journals (OSJ) [1] was used, which is a three-level, hierarchical tree of scientific disciplines, domains, and sub-domains. Assignment of the publications to specific categories is a complex task, which requires a set of
74
5 Selected Algorithmic Developments
Fig. 5.4 A flat (a) and hierarchical (b) organisation of classifiers, whose task is to organise publications in a hierarchical structure of scientific disciplines. In the case of the flat structure, each level in the hierarchy of categories is modelled by an independent classifier; in the case of the hierarchical structure, a classifier that models a specific level in the hierarchy uses the classification results of the previous level
classifiers that can model individual levels of a hierarchical tree of categories. Two types of organisation of such classifiers can be distinguished: flat and hierarchical, which are depicted in Fig. 5.4. In the flat organisation of classifiers, each level of the hierarchy of categories is modelled by a distinctive classifier (see Fig. 5.4a). This means that the model assigns publications di to classes clj only at the hierarchy level l (e.g. scientific disciplines) and ignores classification results at higher levels l − 1, which contain more general categories (e.g. scientific domains). Conversely, in the hierarchical organisation of classifiers, classifiers that model a specific level in the hierarchy use the classification results from the previous level (see Fig. 5.4b). Consequently, document di is classified ever more accurately; it is assigned from general to more specialised categories at each level of classification. This solution is not without its disadvantages. Any error at a higher level is propagated to lower (more precise) categories.
5.2.4 Monolingual Versus Multilingual Classification Input classification data is represented as vectors of features, such as term frequency– inverse document frequency (TF–IDF), which are created from the available elements included in each publication, such as their titles, abstracts, and keywords. Publications may be written in two or more languages. This pertains mainly to titles, abstracts, and keywords. With consideration of this issue, three classification strategies were proposed: monolingual, simple multilingual (the highest probability model), and multilingual.
5.2 Classification of Publications
75
Fig. 5.5 The multilingual classification system, a component of the information extraction module in the reviewer and expert recommendation system. The system comprises three layers: (i) a data preprocessing module; (ii) a set of monolingual classifiers, which independently classify a document; and (iii) a decision module, which assigns the document to the most probable category
Monolingual classification In justified cases, it may be assumed that all documents (publications) from a specific source (database) are written in a single language. If a classified document—and, formally speaking, all of its features f it —is written in one language, a classifier dedicated to that language can be used. A multilingual document may be ignored as a data error. Simple multilingual classification Although ignoring multilingualism cannot be justified, this issue may simultaneously cause negligible problems that do not require the development of a complex multilingual classification system. Let us assume that documents exist of which some features (elements) appear in different languages. A midway solution might be the maximum probability approach, according to which a document is classified by a set of monolingual classifiers m = 1, ..., M simultaneously. A classification result is considered valid only if it comes from the classifier that has assigned a document to a class with the highest probability: di ∈ D ⇒max{ plj (m)} ⇒ {di , clj }.
(5.2)
It is assumed that the monolingual classifier that is linguistically best-suited to a particular document yields the highest probability of a document belonging to a specific class. Multilingual classification Multilingual documents are common and their classification demands a more advanced approach. A complex classification system comprising three layers is proposed: (i) a data preprocessing module; (ii) a set of monolingual classifiers; and (iii) a decision module (Fig. 5.5).
76
5 Selected Algorithmic Developments
The crux of the system is as follows: a document to be classified di is transformed by the data preprocessing module into a list of features f it of the document, which are represented as vectors. The data is analysed simultaneously by the set of monolingual classifiers, m = 1, ..., M. As a result, each classifier, m, indicates the probability, pi,l j , of the document, di , belonging to category clj . It must be noted that the degrees of probability are specified for each of documents di , i = 1, ..., I by each of models m = 1, ..., M for each potential category clj , j = 1, ..., J at each level in the hierarchy l = 1, ..., L. In the last step, the multilingual decision module determines the final category by analysing the degrees of probability specified by the monolingual models. The decision module comprises classifiers that implement the logistic regression function. As the one vs. others classification rule is applied, J − 1 classifiers are prepared for each level l in the hierarchy of classes. The module then makes the following decisions: di ∈ clj i f h(ϕil ) > h th , di ∈ / clj i f h(ϕil ) ≤ h th
(5.3)
where h th is a decision threshold and h(ϕil ) is a logistic regression function calculated as follows: 1 (5.4) h(ϕil ) = l . 1 + e−ϕi Its argument, ϕil , is a weighted sum of the probabilities that are generated by the monolingual classifiers J M l w j,m · pi,l j,m , (5.5) ϕil = j=1 m=1
where wlj,m are the weights of a regression model produced using quadratic programming [12].
5.2.5 Classification of Publications in Practice The algorithms and classification structures presented in the previous section were verified experimentally. The aim of the classification is to ascribe publications di , which have been gathered by the data acquisition module, to relevant scientific disciplines clj that are defined in the OSJ [1]. The results of the experiments are presented below. They cover aspects including data preprocessing; data representation; flat and hierarchical organisational structures of classifiers; and monolingual and multilingual classification. The classification models were developed on the basis of the MNB, SVM, and MLP algorithms, and were implemented in practice in scikit-learn [13]. The quality of classification is evaluated using the classic F-score (5.6) and a tenfold cross-validation set. This measure reflects the quality of the models fairly
5.2 Classification of Publications
77
accurately because it combines precision (5.7) and recall (5.8) measures, as specified below: F-score =
2 · (r ecall · pr ecision) , r ecall + pr ecision
pr ecision =
r ecall =
TP , T P + FP
TP , T P + FN
(5.6)
(5.7)
(5.8)
where T P, F P, F N indicate true positive, false positive, and false negative outcomes, respectively, of a model with respect to its expected values. The dataset The experiments were conducted using a set of data on publications from an adaptive knowledge base that contains information on potential reviewers [17, 22]7 and from the Inventorum platform, which has been developed to support innovation [18, 19].8 These databases contain almost five million pieces of metadata on publications; not all of it, however, is complete. Although all pieces of data have titles, only two-thirds contain keywords and abstracts in the Polish or English languages. The classification methods are supervised learning algorithms. They require specific training sets that contain ‘publication–target class’ pairs (labels). Labelling the data manually would require a lot of human effort. For that reason, a semiautomatic approach based on the use of OSJ was implemented. OSJ is a hierarchical classifical=2 tion which has six top-level domains, cl=1 j=1,..,6 , 22 fields c j2 =1,..,22 , and 176 subfields, l=3 c j3 =1,..,176 . The semiautomatic creation of the training set involved matching the names of the journals in which the publications were published (this information is contained in the publications’ metadata) to the names of journals in the OSJ classification. This made it possible to label the publications (domains and fields). The procedure automatically generated a set of marked publications; in some cases, however, human assistance was needed, some journals had not been included in because l the OSJ. As a result, a set of pairs, di , c j , which contained 2, 733, 991 publications in English and 119, 312 publications in Polish was created (Table 5.1). Preliminary experiments proved that the third level of the OSJ classification (subfields) is too detailed for the development of a reviewer and expert recommendation system whose elements include the tested classification algorithm. In consequence, further works were limited to six domains and twenty-two fields of science. Moreover, after the conductance of preliminary tests, the second level fields (l = 2) General Science & Technology and General Arts, Humanities, & Social Sciences were 7 8
https://recenzenci.opi.org.pl/sssr-web/site/home?lang=en. https://inventorum.opi.org.pl/en.
78
5 Selected Algorithmic Developments
Table 5.1 The number of publications in English (En) and in Polish (Pl) at the first level of the OSJ. This data constitutes the basis for the creation of the learning and test sets that validate the publication classification algorithm Scientific area Publications Level 1 (domain) En Pl 1 2 3 4 5 6 In total
Applied sciences Art & humanities Economic & social sciences General Health sciences Natural sciences
705,929 25,133 54,853
52,497 28,664 9,839
15,984 1,003,490 928,602 2,733,991
4,055 12,854 11,403 119,312
excluded from the dataset. They added no extra value because they were too general and increased the ambiguity of the classification models. As learning classification models requires balanced proportions of documents, the dataset was adjusted in the manner described below. Upper limits of the samples for each scientific area of the highest level (domains, l = 1) was set to 10, 000 in the case of publications in Polish and to 25, 000 in the case of publications in English. Document representation A document, i.e. publication metadata, di , may contain features f it , such as a title, 3 , k = 1, 2 , ..., K ; and a list of publicaf i1 ; an abstract, f i2 ; a list of keywords, f i,k 4 tion authors, f i, p , p = 1, 2, ..., P. Each individual also has a unique identifier that is assigned in the process of author disambiguation. As this piece of information may distinguish a publication, the last publication feature includes a list of authors’ identifiers, f i5 . Documents are converted to features in the following manner: 1. Features f it are singled out from document di and a natural language for each of the features is detected, which leads to f it,l ; next, standard text preprocessing is conducted (stop-word removal and lemmatisation); 2. The prefix, t_ , is added to all words included in a specific feature, f it,l , to distinguish the features; 3. A vector representation of the new features is created (e.g. TF–IDF). As a result of these transformations, each document, di , is represented by a set of features (e.g. TF–IDF) f it , t = 1, ..., 5, where each feature may contain several attributes (words or identifiers). This work relies on this type of text data representation. Algorithm selection and flat versus hierarchical classification After the dataset was prepared and its representation selected, experiments were conducted to adopt the best classification model, with consideration for three
5.2 Classification of Publications
79
Table 5.2 The results of publication classification obtained from the implementation of different combinations of classification models: the Polish and English language models; the MNB, MLP, and SVM algorithms; and the hierarchical and flat models that classify each of the levels individually. The results are expressed as values of F-score, separately for domains l = 1 and fields l = 2 of the OSJ classification Language Algorithm Flat approach Hierarchical approach Level 1 Level 2 Level 1 Level 2 (domains) (fields) (domains) (fields) Polish
English
MNB SVM MLP MNB SVM MLP
0.92 0.90 0.91 0.80 0.82 0.80
0.85 0.84 0.85 0.70 0.73 0.70
0.92 0.90 0.91 0.80 0.82 0.80
0.84 0.81 0.84 0.68 0.71 0.70
algorithms (MNB, SVM, and MLP), and flat and hierarchical classifications. Tests were conducted at two OSJ levels: domains l = 1 and fields l = 2. For each of these levels, separate classification models, in addition to separate models for the Polish and English languages were trained. The quality of operation of different combinations of classification models, expressed as average values of F-score, is presented in Table 5.2. It was observed that the quality of classification was considerably higher in the case of the Polish dataset than in the English one. This may be because the publications in Polish were sourced from a narrower range of areas. As a result, the words contained in the publications are more discriminatory than those contained in the English language publications. Moreover, this may be the reason why MNB is the best algorithm for the Polish dataset, whereas SVM is the best for the English dataset. Based solely on these results, it is impracticable to assess which of the algorithms reigns supreme. For further experiments, the MNB algorithm was used, as it was found to be the least complex one that offered satisfactory classification. The main objective of the experiments was to compare the flat and hierarchical approaches to classification. Only the results for the second level could be considered because the top-level classification is always conducted in the same manner— regardless of an approach adopted. The flat approach slightly outperforms the hierarchical one. This confirmed the assumption that although hierarchical classification can become increasingly more precise at subsequent levels, errors made at the highest level propagates to lower levels, which leads to poorer results than those achieved using the flat approach. Since the differences in the approaches’ quality are insignificant, both are viable in multilingual experiments. Monolingual and multilingual classification Previous experiments focused on monolingual classification; in the following experiment, multilingual publications were allowed. Two multilingual classification
80
5 Selected Algorithmic Developments
Table 5.3 The results of the publication classification task attained by the monolingual and multilingual models. All of the models were trained using the MNB algorithm. The results are expressed as F-score values separately for domains l = 1 and fields l = 2 of the OSJ classification, both for the flat and hierarchical model organisations Model Flat approach Hierarchical approach Level 1 (domains) Level 2 (fields) Level 1 (domains) Level 2 (fields) Multilingual Monolingual (Polish) Monolingual (English) Maximum probability
0.94 0.90
0.87 0.77
0.93 0.90
0.86 0.78
0.79
0.59
0.81
0.63
0.91
0.78
0.92
0.80
approaches were studied: a maximum probability model and a multilingual classification system, which were both discussed in Sect. 5.2.4. For this purpose, monolingual models were used, which were created for the previous experiment using the MNB algorithm. The role of the monolingual models was twofold: first, they were used as monolingual classifiers to make comparisons with multilingual systems; second, they served as a source of training data for the decision module in the multilingual classification system. The output of these models were probability vectors, which constituted a training set for the decision module, i.e. a set of logistic regression models, which are organised in an architecture, one-vs-others. The publications that contained text in Polish and in English amounted to 33,000. This data was used for the training and evaluation of the multilingual systems. The results of the experiments are presented in Table 5.3, which contains the average F-score values. While the quality of classification of the monolingual models suggested a slight advantage of the hierarchical approach over the flat one, the multilingual system performed better when the flat configuration was used. This could be attributed to the nature of the dataset used for the experiments, rather than to real differences in the approaches’ performance. Expectedly, the approach that involved the selection of the class with the highest probability among the decisions of the monolingual classifiers (the maximum probability model) yielded better results than the one that relied on only one monolingual classifier. This is true for both languages. The most compelling findings pertain to the comparison of the multilingual system with other monolingual approaches. It was observed that the multilingual system yielded better results than all other models—regardless of classification levels. Even if the maximum probability model is used, monolingual classifiers struggle with the selection of relevant multilingual document classes. Their performance depends heavily on the balance of the dataset between the languages. This is not an issue in the proposed multilingual system, as it combines the probabilities from the monolingual classifiers into a multilingual decision module. By doing so, it is unaffected by language imbalances in the documents. It can, therefore, be assumed that the proposed
5.3 Disambiguation of Authors
81
multilingual classification system is recommendable if a document (publication) contains features (parts) written in different languages. Although the system can be recommended specifically to classify multilingual documents, it would also be viable in the classification of monolingual documents.
5.3 Disambiguation of Authors The disambiguation of authors is an important stage in the process of constructing scholars’ profiles. It involves specifying who, precisely, is the author of a publication. It is a nontrivial task: data on the individuals included in the database of potential experts and reviewers must be linked to publications using only the database resources, such as first names, initials, surnames, and additional data. It should be noted that several individuals may exist with the same first names and surnames; some may even share research interests and affiliations. The author disambiguation algorithm is a component of the information extraction module in the reviewer and expert recommendation system.
5.3.1 Disambiguation Framework Authors’ disambiguation can be described as follows. It is assumed that a reference set, P, exists of unambiguously identifiable individuals, pl ∈ P, l = 1, ..., L. Each of these individuals may be connected with numerous documents (publications), di ∈ D, i = 1, ..., il . Documents contain references, ri, j ∈ R, j = 1, ...., J , to unidentified authors. The disambiguation of authors involves assigning a reference included in publication ri, j to individual pl . Figure 5.6 presents a scheme of unam-
Fig. 5.6 A scheme of a task aimed at the disambiguation of authors of publications. On the left, unambiguous relations between documents (publications) and individuals; on the right, unambiguous relations between documents (publications) and the references included in them, i.e. the first names and surnames of the authors
82
5 Selected Algorithmic Developments
Fig. 5.7 The scheme of a disambiguation framework includes two disambiguation phases. Input data includes disambiguous individuals and ambiguous references of publications. In the first phase of the disambiguation, an algorithm based on heuristic rules is applied. In the second phase, a hierarchical clustering algorithm is implemented, which can: (1) use heuristic data cluster similarity measures; or (2) assess these similarities using classifiers. The second phase of the disambiguation is optional. At the output of the framework, disambiguous publication authors are generated. The framework is a component of the information extraction module in the reviewer and expert recommendation system
biguous relations between documents (publications) and individuals, and ambiguous relations between documents (publications) and the references contained in them. To perform the disambiguation task, a disambiguation framework was proposed, which combines heuristic rules with classification algorithms and hierarchical agglomerative clustering (HAC) (Fig. 5.7). The operation of the framework is outlined below. In the first phase of the disambiguation, an algorithm based on heuristic rules is applied, which iteratively matches publications to existing researchers’ profiles. The rules are designed to ensure maximum precision; no errors are expected to occur at this stage. In the second (optional) phase of the disambiguation, hierarchical clustering occurs, which is based on the HAC algorithm. At this point, two approaches are considered. In the first, similarity is measured using heuristic rules; in the second, it is estimated by classifiers that are trained on the data obtained in the first phase of the disambiguation. It should be stressed that the algorithm based on heuristic rules that is used in the first phase is capable of matching publications only with researchers’ profiles that are already included in the database. The clustering algorithm used in the second phase, which is based on the HAC algorithm, not only matches documents to existing researchers’ profiles, but also creates new researcher profiles if no relevant candidate is found in the database.
5.3 Disambiguation of Authors
83
5.3.2 A Rule-Based Algorithm A heuristic algorithm comprises a series of rules that operate on attributes that can be directly extracted from publications di . The operation of the algorithm is outlined below (Fig. 5.8): an input piece of data constitutes an entire publication. First, the algorithm extracts attributes, such as authors’ first names, initials, surnames, affiliations, and keywords. The heuristic rules, which analyse the attributes and profiles of real (identified) individuals and identify the real authors of publications, are then applied. Data preprocessing In reality, the algorithm is more complex and the data preprocessing stage requires further comment. Data blocks bn are formed (Fig. 5.9) based on the surnames and first-name initials included in the publications: for example, Richard J. Smith →R Smith, Nawojka Małgorzata Protasiewicz →N Protasiewicz, or Fryderyk Chopin →F Chopin. As a result, graphs of the relevant publications di are generated from set D, and individuals pi from set P: bn → {d1 , ..., dbn }, { p1 , ...., pbn },
(5.9)
where each document di includes references ri, j , at least one of which equals bn , n = 1, ..., N (Fig. 5.10). It should be noted that the concept of extracting data blocks originates from [6]. Disambiguation Following the extraction of the attributes and the formation of the data blocks, the heuristic rules—which analyse whether it is possible to connect the authors of a
Fig. 5.8 The operation of the heuristic algorithm used to disambiguate the authors of the publications. The algorithm forms part of the information retrieval module, a component of the reviewer and expert recommendation system. First, the algorithm retrieves selected features from publications; then, the heustistic rules identify publication authors by utilising both the features and already-known person profiles
Fig. 5.9 The operation of the heuristic algorithm used to disambiguate the authors of the publications. Data blocks are formed from references included in the publications, and heuristic rules are applied. This process results in the identification of the publications’ real authors
84
5 Selected Algorithmic Developments
Fig. 5.10 An example data block formed after data preprocessing in the heuristic algorithm used to disambiguate the publications. It is a graph that matches individuals’ profiles pi with documents (publications) di and the references ri, j that are included in them, i.e. the unambiguous authors of the publications. The node in which these objects intersect is the surname and the first-name initial, bn
specific publication (references) with profiles included in the database—are initiated. Each data block, bn , is analysed. Three pieces of data in each block (document, di ; references, ri, j ; and profile, pl ) are processed individually by the heuristic rules. If any of the rules manage to unambiguously connect reference ri, j with author pl , the remaining rules are not initiated. After the rules are applied, the next three pieces of data are processed. After all blocks bn are processed, the algorithm can be rerun. In this manner, the algorithm iteratively discovers new relations in subsequent iterations, while using the information obtained in previous steps. The heuristic rules were based on expertise, available data, and preliminary experiments. Finally, five rules were developed based on the following attributes: (1) names and surnames, (2) coauthorship, (3) affiliations, (4) scientific fields, and (5) profiles of both researcher and publication. An example of the rules’ implementation is presented below. An example of an implementation of the heuristic rules It was assumed that three datasets existed: individuals’ profiles, P; references in publications, R; and publications, D. These were used to create data blocks bn in the manner described above and depicted in Fig. 5.10. Let us assume that we want to process document (publication) di and reference ri, j . The rules used to disambiguate the authors of the publications work as follows: (R1 ) Compare reference ri, j with all profiles of individuals pl , considering only names and surnames. If only one element, pl ∈ P, is discovered, it corresponds to ri, j . Reference ri, j is disambiguated, which means that the author of a publication has been identified. (R2 ) Coauthorship: Extract references, ri, j , from all documents of each profile, pl , in block bn . Next, simplify them to surnames and first-name initials,
5.3 Disambiguation of Authors
85
pl → bl,1 , bl,2 , .... Repeat the process for the references of document di → bi,1 , bi,2 , ... that is currently being processed. Calculate the numbers n 1 , n 2 , ..., n l of references in each profile that match the references in the document. If max(n l ) > n max where n max is a defined threshold, then profile pl corresponds to reference ri, j . (R3 ) Compare the institutions included in the publication with the researchers’ affiliations by following the same steps as in rule R2 . Note that in this instance, the numbers of matching affiliations, instead of the number of references, is calculated. (R4 ) Compare the scientific fields of publications with researchers by following the same steps as in rule R2 . Note that the numbers of matching scientific fields, instead of the number of references, is calculated. (R5 ) Compare the profiles (keywords) of publications with scholars’ achievements by using the selected similarity measure, and select the most similar profile that maintains a certain degree of dissimilarity with other scholars.
5.3.3 Clustering by Using Heuristic Similarity The clustering algorithm was based on HAC [3]. In essence, this method creates groups of publications and assigns individuals to them. Initially, each document is treated as a separate cluster. Next, the algorithm iteratively groups documents on the basis of a measure of similarity between clusters. Finally, each set of documents (cluster) corresponds to a single author. As some documents have several authors, the same document may appear in more than one cluster. The HAC algorithm was selected to disambiguate the authors of the publications because it is capable of discovering new profiles of unknown researchers from groups of documents. In this manner, it complements the heuristic algorithm ideally. The algorithm is outlined in Fig. 5.11. The clustering procedure is performed individually for the profiles of individuals pl , which are selected from the entire set of available individuals, P. Note that at later stages, only the surname of a specific individual, bl , is used. The set of available publications D is used to select the publications dl,1 , dl,2 , ..., dl,il that have at least one reference ri j that matches name bl . In this manner, a data block, bl , is created for which a dendrogram is generated. The algorithm calculates the similarity of the selected publications (the similarity between each and every publication). The most similar ones are grouped. During the iterative process, a dendrogram is created, which is terminated at a preset cutoff level. The leaves of the tree are groups of publications written by a particular author. If it is impossible to assign a real author’s profile to a particular group, a new profile is created. The similarity of a pair of publications, di and d j , is expressed as a measure of similarity, sim(di , d j ), which is the probability that both publications were written by the same author. The probability is modelled by the logistic function:
86
5 Selected Algorithmic Developments
Fig. 5.11 The operation of the algorithm used to disambiguate the authors of the publications via hierarchical clustering: first, the similarity between the publications is calculated; next, groups of the publications are created; then, a dendrogram is iteratively created, whose leaves form groups of publications that are written by a particular author
sim(di , d j ) =
1 1 + e−
L
l=1 sl
(5.10)
on the set of partial similarities sl , l = 1, .., L between documents di and d j . Partial similarities were proposed based on expertise and experiments. The following similarities were applied: (s1 ) identified author’s names or initials; (s2 ) normalised difference between years in which works under analysis were published; (s3 ) a journal name or a conference name; (s4 ) coauthors, scientific fields, keywords, and affiliations. Hierarchical clustering utilises all of the attributes of the publications that are used in the heuristic algorithm, as well as additional data that describes the publications, such as year, journal name, conference name, title, and abstract. Detailed weight values, functions, and their parameters depend on the practical implementation of these similarities.
5.3.4 Clustering Using Similarity Estimated by Classifiers It was assumed that the classifiers could estimate the similarities between the data clusters. The classifiers can use various attributes to calculate similarities without conducting a formal analysis to establish whether they are significant; their importance is automatically verified during the learning process. Hierarchical clustering that uses similarity estimated by classifiers differs from clustering algorithms that use heuristic measures of similarity in that the former includes a model training stage in its method of estimating similarities, and, consequently, in the input data used to estimate the similarity of the data clusters. The classifier input data may comprise the following:
5.3 Disambiguation of Authors
87
• concordance weights (concordant, partially concordant, or nonconcordant) that are defined individually for first name, surname, OSJ classification, journal name, and conference name; • time passed since the year in which a publication was published; • a number of matching authors or keywords; • a cosine measure of similarity that is defined individually for titles and abstracts. The attributes listed above were proposed in the reviewer and expert recommendation system and verified experimentally. It should be noted, however, that the selection of the classifier’s input clusters to estimate the similarity of the publication clusters depends on the inventiveness of the system’s designers and on the source data available. The similarity in HAC can be calculated in various ways, depending on which documents in clusters are considered. They may include: (i) a single link (the closest documents); (ii) a full link (the most distant documents); (iii) a group average for all pairs; or (iv) a weighted average for all pairs. One important aspect is the clustering algorithm’s termination condition. Simple criteria are typically used, such as a cutoff value dendrogram reaching a specified number of clusters, or the distances between the closest clusters; in some cases, however, more refined adaptation strategies are implemented [2]. Based on the literature and on preliminary experiments, it was decided to apply an algorithm termination criterion that relied on c-indices for clusters, which can be calculated as follows: cindex =
S − min(S) , max(S) − min(S)
(5.11)
where S is the sum of distances between all pairs of documents in a cluster, min(S) and max(S) are the sums of n, the shortest and n, the longest distances between pairs of documents in the entire set; n is the number of pairs of documents in a cluster for which the value, cindex , is calculated. The algorithm is terminated when value cindex is lower than a specified constant.
5.3.5 Disambiguation of Authors in Practice The author disambiguation algorithms, including the heuristic algorithm and the hierarchical clustering algorithm, were implemented as part of the information extraction model in the reviewer and expert recommendation system. The quality of the algorithms was verified experimentally. The data acquisition module gathered approximately five million publications, which can be attributed to approximately nineteen million authors. Due to the large size of the dataset, the experiments were conducted on selected data.
88
5 Selected Algorithmic Developments
First, tests of the heuristic (rule-based) algorithm were performed. A test set, which was required to conduct the experiments, was prepared semiautomatically or manually as follows: 1. For each scientific discipline, one researcher was selected and authorship of all of their publications was established. In this manner, sixty-three individuals were verified. 2. The most popular surnames and first name initials were selected. The authorship of all publications written by these individuals was established. In this manner, forty-eight individuals were verified. The heuristic algorithm was verified using the above test set, which contained a total of 111 authors and 2, 921 publications. Four heuristic algorithm rules were applied: (1) names and surnames, (2) coauthorship, (3) affiliations, and (4) scientific fields. The algorithm failed thirty-four times to assign authors to their publications correctly, which translated into precision of 99% and recall of 65%. Next, the fifth algorithm rule, (5) the profiles of both a researcher and a publication, were analysed; this rule was applied separately from all other rules. Precision of 82% was attained. The result confirmed that the fifth rule failed to improve the quality of the algorithm. The reason for its unsatisfactory precision lies in its tendency to favour researchers who boast a larger number of publications than their colleagues. Next, the clustering algorithm, which uses a heuristic measure of similarity (clustering by using heuristic similarity), was analysed. For this purpose, fifty surnames were selected randomly from the dataset. Each surname corresponded to 100–200 individuals. As a result, the test set contained approximately 20, 000 publications. It was decided that the smallest cluster would comprise three publications. The results of the experiments are presented in Table 5.4, which includes the values of precision, recall, and F-score with regard to different cutoff levels of the dendrograms constructed. The results of the experiments suggest that authors are disambiguated with high precision, which is independent of the cutoff level. The recall value decreases considerably as the cutoff level value rises, causing an increase in the number of unidentified authors in each of the clusters. F-score is a harmonic mean of both measures that illustrates the balance between them. It is crucial that no publication be misassigned to an incorrect author and that a satisfactory precision level be maintained. Based on these premises, it was established that a cutoff level of 20 ensured both adequate precision and recall. Finally, a clustering algorithm fitted with the classifiers responsible for estimating cluster similarity was tested. For this purpose, a test dataset was prepared that contained 355, 955 articles grouped into 2, 450 clusters. A total of 57, 625 publications whose authors had been verified were used to train the classifiers. Four methods of sorting the documents into clusters during the calculation of similarity were used in the tests: single link, complete link, average link, and weighted link. Two classification algorithms were used to estimate similarity: logistic regression with L 2 regularisation, and SVM with radial basis function kernels. Clustering was stopped if c-index reaches a preset threshold value.
5.3 Disambiguation of Authors
89
Table 5.4 The quality of author disambiguation is expressed by the precision, recall, and F-score measures, and is achieved using a hierarchical clustering algorithm with heuristic (rule-based) estimation of similarity. The results are presented for various dendrogram cutoff level values Cut-off Precision Recall F-score 5 10 20 30
97.96 98.87 99.42 99.35
54.27 34.04 9.98 2.99
69.84 50.64 18.14 5.81
Table 5.5 The quality of author disambiguation is expressed by the F-score measure and is achieved using the hierarchical clustering algorithm with estimation of similarity via classifiers. Four methods of sorting the documents into clusters during the calculation of similarity were used: single link, complete link, average link, and weighted link. Two classification algorithms were used to estimate similarity: logistic regression with L 2 regularisation and SVM with radial basis function kernels. The criterion to terminate the algorithm was an adaptive number of clusters based on c-index values in the range of 0.10–0.25 Link Classifier C-index 0.15 0.20 0.25 Single Complete Average Weighted
LR SVM LR SVM LR SVM LR SVM
0.80 0.73 0.83 0.79 0.89 0.79 0,87 0.80
0.81 0.72 0.86 0.82 0.85 0.77 0.85 0.78
0.79 0.72 0.86 0.83 0.81 0.77 0.82 0.77
The results are presented in Table 5.5. The quality of clustering, expressed as Fscore, ranges from 0.72 to 0.87. A slightly higher quality of author disambiguation was observed when the logistic regression classifier with regularisation was used instead of the SVM classifier to estimate similarity. It was found that the use of a simple logistic regression classifier with regularisation instead of more complex and computationally intensive techniques (such as SVM) yielded better results. It is worth noting that the algorithm attains similar results when the average link, complete link, and weighted link methods are used for computing. All of these methods ensured satisfactory algorithm quality. The results confirmed that it is not recommended to use the single link method. The experiments demonstrated that both the heuristic algorithm, and the hierarchical clustering algorithm with heuristic estimation of similarity or via classifiers, were useful. Although the heuristic algorithm requires expertise in the construction of its rules. It also works rapidly and accurately; the hierarchical clustering algorithm requires no such expertise, but is slower and less precise. The hierarchical clustering
90
5 Selected Algorithmic Developments
algorithm can, however, discover unknown authors, whereas the heuristic algorithm is only capable of assigning publications to individuals whose profiles are already known.
5.4 Keyword Extraction Keyword extraction involves identifying the most important words in text documents. Documents written in Polish were analysed using the Polish Keyword Extractor [5], while those written in English were analysed using the Keyphrases Extraction Algorithm [27]. Extracted keywords were automatically translated from Polish to English (and vice versa) using online dictionaries. In the case of more specialised terminology, both the Polish- and the English-language versions of Wikipedia were used. Translation is a technical process that does not merit further discussion in this section. The same applies to the Keyphrases Extraction Algorithm, which is explained in detail in the literature. More information is provided below on the Polish Keyword Extractor algorithm, which was developed to extract keywords from text in Polish. Keyword extraction algorithms form a part of the information extraction module in the reviewer and expert recommendation system.
5.4.1 Polish Keyword Extractor The Polish Keyword Extractor is an algorithm that is used to extract keywords from academic documents in the Polish language. It is inspired by Rapid Automatic Keyword Extraction [24] and the Keyphrases Extraction Algorithm [27]. Rapid Automatic Keyword Extraction is an unsupervised algorithm used to extract keywords from documents written in any language and in any discipline. The algorithm is based on the observation that keywords are seldom complex phrases. The Keyword Extraction Algorithm is a supervised algorithm that uses MNB to determine how probable it is that a phrase is a keyword. The proposed algorithm was designed to analyse individual documents rather than entire sets of them. Unlike other solutions, it requires no extra knowledge on the documents that are subject to its analysis and uses no external sources of knowledge. The algorithm is fitted with a Polish lemmatiser; a part-of-speech filter; two phrase keyword candidate selection methods (Pattern–Recursive selector and Part-of-Speech selector); and two methods that evaluate selected keyword candidates—one supervised and one unsupervised. This allows for four possible algorithm configurations. Pattern–Recursive selector uses four part-of-speech patterns recursively: noun– adjective, noun–unknown, noun–noun, and noun. Part-of-Speech selector uses patterns that are based on complex regular expressions. First, a piece of text is divided into sequences of words. Next, combinations of each sequence are created, which results in a larger number of keyword candidates. Potential keywords are selected
5.4 Keyword Extraction
91
using the following regular expression (Part-of-Speech selector): (noun)(noun|ad jective|unknown)∗
(5.12)
The evaluation of keyword candidates in the unsupervised algorithm relies on a scoring function, f r eq(w)2 . Scores are assigned to individual words. Complex phrases are evaluated using the sum of all scores assigned to the individual words that form them. The scores assigned to the keyword candidates are sorted, and the phrases with the highest T scores are selected as keywords in the document. The evaluation of keyword candidates by the supervised algorithm involves the binary classification of phrases. In the case of a publication, candidate words are expressed using four features: TF–IDF, first appearance in the abstract, first appearance in the phrase, and size. To conclude, the Polish Keyword Extractor extracts keywords as follows: 1. Divides a document into phrases and ensures that each of them is a sequence of words. 2. Lemmatises the words and specifies which parts of speech they constitute. 3. Selects a finite number of keyword candidates. 4. Evaluates the candidates and selects keywords from among those with the highest scores.
5.4.2 Keyword Extraction in Practice Keyword extraction was tested on abstracts of publications written in the Polish and English languages. Algorithms and tools for the English language are well-described in the literature and have been evaluated in numerous experiments [24, 27]. As far as the Polish language is concerned, the proprietary Polish Keyword Extractor algorithm has been developed, which requires further evaluation. No easy method of evaluating whether extracted keywords are correct exists; this would require an evaluation performed by an expert, which may be subjective. For that reason, attempts were made to estimate the quality of keyword extraction. It was assumed that a reliable algorithm should extract the words or phrases from a publication abstract that are the same or similar to those provided by the authors in the index terms sections of their publications. When the above evaluation criterion was applied, the Polish Keyword Extractor algorithm yielded the best results using the Pattern–Recursive selector and the supervised evaluator. Preliminary experiments also revealed that the Polish Keyword Extractor achieved higher quality than Rapid Automatic Keyword Extraction and the Keyphrases Extraction Algorithm in the case of Polish-language text. Figure 5.12 presents two examples of keyword extraction from text on proteins published on Wikipedia pages. The left side of the figure presents an example of how the Polish Keyword Extractor algorithm worked with the Polish text, while the right side
92
5 Selected Algorithmic Developments
Fig. 5.12 An example of the Polish Keyword Extractor being used on Polish-language text (left) and the Keyphrases Extraction Algorithm on English-language text (right). Both texts come from Wikipedia, and they concern quantum mechanics. It was assumed that the maximum keyphrase length equals 3 and the minimum probability of being a keyphrase equals 0.1. The algorithms were set to extract up to five phrases for each case. The keywords (phrases) are highlighted in grey. More words are highlighted, since some phrases were repeated two or three times. As a result, the phrases mechanika kwantowa, mechanika klasyczna, atom, czastki elementarne, and nadprzewodnictwo are selected from the Polish text; and quantum mechanics, quantum chemistry, quantum field theory, classical physics, and macroscopic scale from the English text
5.5 Evaluation of Enterprises’ Innovativeness
93
presents how the same algorithm performed with the English text. The extracted keywords and key phrases are highlighted in grey. These examples show clearly that it is difficult to evaluate whether extracted words are keywords without the appropriate expertise. Experts may also be subjective in their evaluation. For example, quantum technology (instead of quantum chemistry) or nadciekło´sc´ (instead of nadprzewodnictwo (in Polish)) may be considered keywords. For that reason, no detailed results of typical quality measures, such as precision, recall, or F-score are provided. The tested algorithms can properly summarise text in Polish and in English by extracting keywords; their evaluation was based primarily on experts’ opinions and experience, however. It was observed that the Polish Keyword Extractor algorithm could extract keywords from Polish-language text more precisely than Rapid Automatic Keyword Extraction or the Keyphrases Extraction Algorithm. The Polish Keyword Extractor algorithm is an important and useful component of the information extraction module in the reviewer and expert recommendation system. In practice, however, considering the shortcomings of automatic keyword extraction, the algorithm only proposes keywords for individual profiles. Only after an individual accepts a proposed word will it become a part of their profile (which is used to select reviewers).
5.5 Evaluation of Enterprises’ Innovativeness Chapter 4 presents the Inventorum system, an innovation support system. To increase the number of the platform’s users, it has been fitted with a mechanism that seeks potentially innovative enterprises on the internet and invites them to join the platform. For this purpose, a classification model has been developed to assess whether specific enterprises are potentially innovative, based on their websites [7, 8, 10].
5.5.1 A Model of Evaluation of Enterprises’ Innovativeness The overall structure of the classification model is presented in Fig. 5.13. The input information comprises enterprise websites, which, in turn, comprise several documents and graphic elements. During the data preprocessing stage, aside from text actions that are typical for natural language processing, the input data is also divided into three sets. The first set, Dataset1 , includes text extracted from the main pages, which usually contain general descriptions of enterprises. The second dataset, Dataset2 , contains labels of all links on the main pages; these include the names of all places on the internet to which the enterprises refer. The third dataset, Dataset3 , contains documents extracted from the enterprises’ websites; this is all publiclyavailable knowledge on the enterprises. These datasets do not include all documents: only the relevant ones that have been selected by the BM25 Okapi searching system [23]. Next, each of the datasets is evaluated by relevant Naive Bayes classifiers γ N B1 ,
94
5 Selected Algorithmic Developments
Fig. 5.13 The scheme of the algorithm that estimates the degree of enterprises’ innovativeness based on their websites. The algorithm is a component of the Inventorum system. Partial evaluations are formed by three classifiers: γ N B1 , which analyses text extracted from the main pages of the enterprises’ websites; γ N B2 , which analyses the labels of all links on the main pages; and γ N B3 , which analyses a combination of documents extracted from the enterprises’ websites. The final estimation of the enterprises’ innovativeness is made by decision model γ M L
5.5 Evaluation of Enterprises’ Innovativeness
95
Table 5.6 The quality of operation of an enterprise innovativeness estimation model. A comparison of results using the γV oting voting method and the γ M L classifier, for which the classifier features were selected by a genetic algorithm [10] Experiment Precision Recall F-score γV oting 0.88 γ M L based on SVM 0.94 γ M L based on SVM 0.89 with genetic algorithm
0.69 0.86 0.77
0.77 0.90 0.82
γ N B2 , and γ N B3 . This results in a determination of the probability of the enterprises’ innovativeness, which ranges between zero and one. All scores are stored in the decision matrix, D M L E , which acts as an input for decision model γ M L that is responsible, ultimately, for establishing whether the enterprises are innovative.
5.5.2 Model Evaluation Verification of the model that estimates enterprises’ innovativeness was conducted experimentally. For this purpose, a dataset was prepared that contained 2, 747 websites of various enterprises. After evaluating the websites, experts labelled 509 of them as ‘innovative’ and 2, 238 as ‘other’. Based on this dataset, three MNBs were trained: γ N B1 , γ N B2 , and γ N B3 . In the first approach, their results were aggregated by a simple voting algorithm, γV oting , which selected the model with the highest probability. In the second, more advanced approach, a classification metamodel, γ M L , was applied. This relied on an SVM algorithm, which had been selected experimentally. The classification quality was measured using precision, recall, and F-score with a tenfold cross-validation procedure. The results are presented in Table 5.6. Unsurprisingly, the metaclassifier γ M L performed better in aggregating partial decisions than the γV oting procedure. The number of input features for each of the Naive Bayes classifiers was selected by a genetic algorithm. No considerable improvement in the quality of the classification was achieved in comparison to the no-feature-selection approach. The precision indicator exceeding 0.9 suggested that the proposed system could efficiently identify innovative enterprises on the basis of their websites. A slightly lower recall value implied that some enterprises had been omitted despite being innovative. The F-score vale of approximately 0.9 indicated that we could trust the model and use it to identify potentially innovative enterprises.
96
5 Selected Algorithmic Developments
5.6 Summary This chapter has presented algorithms used to acquire data and extract information from data. They form a part of the reviewer and expert recommendation system and the system that supports innovativeness, Inventorum. The quality of these algorithms was verified experimentally. The originality of the proposed algorithms lies in their selection and tuning throughout experimental assessment on the tasks of construction of profiles of potential experts or reviewers, as well as the selection of innovative entities. These algorithms are used to classify publications, identify (disambiguate) their authors, extract keywords, and assess whether firms is innovative or not. The detailed advantages of the algorithms are discussed below. Data acquisition The data acquisition in the reviewer and expert recommendation system enables the collection of data from both structured and unstructured data sources. Structured data is included in open databases. It is acquired by simple algorithms, which are called extractors and importers. Unstructured data comes from the internet and is acquired by topical crawlers. This work presents a crawler algorithm that uses the CRF model to acquire information on scholars from their websites. For the practical purposes of these algorithms in the reviewer and expert recommendation system, the vast majority of the data was acquired by extractors. The crawling of researchers’ websites turned out to be error-prone and the data collected needed to be corrected manually. It appears that the degree of effort required to create a fully reliable topical crawler is significant. This approach should be adopted only if acquiring data by other means is impossible. For example, the simple algorithms proposed in this work that are used to extract and import data (publication metadata) from open and structured sources proved sufficient in the creation of the reviewer and expert recommendation system. Classification of publications The classification algorithm analysed in this book is responsible for the classification of large sets of metadata of academic publications in a hierarchical structure of scientific fields and disciplines. The classification of publications aims to better organise data during the creation of reviewer or expert profiles. It should be noted that publications may be mono- or multilingual; that they may contain text in one or several languages. Various algorithms and approaches to the organisation of classification were analysed experimentally. During the experiments, two monolingual classifiers (Polish and English) were trained using algorithms including MNB, SVM, and MLP. It appeared that the classification task was simple and that the MNB algorithm could be used, as it was the least complex of all algorithms analysed and offered a satisfactory quality of classification. Due to the target classes being organised in a hierarchical tree of scientific fields and disciplines, both a hierarchical approach (classifiers at higher levels of the tree are linked to classifiers at lower levels of the tree) and a
5.6 Summary
97
flat approach (each level of the tree is considered independent) to the organisation of classification were studied. The classification results demonstrated that the flat approach slightly outperformed the hierarchical one. It appears that this advantage is due to the hierarchical classification errors made at the highest level being propagated to lower levels of the tree of scientific fields and disciplines. This problem does not exist in the flat organisation of classifiers because the classifiers that model individual levels are independent. The primary objective was to propose and verify a system that classifies multilingual documents. A proprietary system was proposed that comprised a set of monolingual classifiers and a multilingual decision model. Monolingual models rely on the simple MNB algorithm, while the decision module uses a logistic regression algorithm. A series of experiments pertaining to the classification of publication metadata were conducted. The proposed system was compared to two monolingual classifiers (English and Polish) and to the maximum probability model. The experiments were also conducted on multilingual documents that contained text in the same languages. Despite the experiments being limited to two languages, it was concluded that the proposed system could result in improved efficiency in the classification of publications that include features (passages) in several languages other the ones analysed. The proposed system is also capable of classifying large sets of data of multilingual documents. The development of classification algorithms is noticeable in the literature. Although transformer networks, which have become increasingly popular in recent years, could arguably offer higher quality of classification, the key benefits of the proposed classification system lie in its simplicity and computational performance. Identification (disambiguation) of authors of publications The algorithm that identifies the authors of publications is a key component in the creation of reviewer or expert profiles, as it allows scientific and practical achievements to be assigned to real individuals included in the database of reviewers and experts. The following approaches to disambiguation of authors were proposed: (i) an algorithm based on heuristic rules; (ii) a clustering algorithm with cluster similarity measured by heuristic rules; and (iii) a clustering algorithm with similarity estimated by classifiers. The clustering algorithm is developed on the basis of HAC. During the experimental verification, differences in the operation of the approaches were observed. The algorithm based on heuristic rules attains high precision and satisfactory recall. The algorithm may be implemented iteratively to identify the authors of new publications and publications that have never before been assigned. It should be noted, however, that the algorithm can match publications only to existing authors. Conversely, the hierarchical clustering algorithm may discover new profiles of unknown researchers from groups of publications. This algorithm is more computationally complex and less precise than the rule-based one. Two approaches to hierarchical clustering were also verified: one in which the similarity of clusters is calculated on the basis of heuristic rules, and another in which it is estimated by classifiers. Heuristic rules ensure high precision, but lower recall. The estimation of cluster similarity by classifiers guarantees fairly reliable results. Moreover,
98
5 Selected Algorithmic Developments
this approach allows relations between publications to be utilised, which, in turn, enables the creation of a model of higher recall at the cost of an insignificant decline in precision. Despite the experiments being conducted on a dataset that is unavailable to the public, the results suggest that the proposed algorithms are worthy of further consideration, as they have proved useful in practice. Agglomerative hierarchical clustering with classifiers used to estimate cluster similarity seems a particularly interesting solution in view of increasingly large databases. This algorithm requires no data processing expertise. When compared to the clustering algorithm, the heuristic algorithm is faster and yields more accurate results, but requires expertise in the construction of its rules. Future theses may focus on the design of a fully-automated algorithm that is capable of identifying authors gradually without human assistance. Keyword extraction Keyword extraction from text contributes to the construction of the profiles of potential reviewers and experts. This process is handled by the proprietary Polish Keyword Extractor algorithm in text written in Polish, and by the Keyphrases Extraction Algorithm in text written in English. The experiments confirmed both algorithms’ undeniable usefulness. The Polish Keyword Extractor proved particularly useful in the analysis of Polish-language text. Natural language processing methods are developing constantly, and it is true that transformer networks could offer the same or better results; the Polish Keyword Extractor stands out, however, by virtue of its simplicity and speed of operation. These features serve as focal points during the design of the information extraction algorithms that are used in nearly all IT systems. Evaluation of enterprise innovativeness The Inventorum system is tasked with supporting innovativeness by bringing science and business together. This is possible only if a sufficient number of entities are included in the system’s database. To ensure a sufficient number of Inventorum users, the system has been fitted with a mechanism that seeks potentially innovative enterprises on the internet and invites them to join the platform. Importantly, such enterprises must be innovative. For this purpose, a classification model has been developed to assess whether this is the case, based on their websites. The model comprises three classifiers, each of which analyses enterprises’ websites from a different perspective. Next, the decision module integrates the partial classification results into a final decision on enterprises’ innovativeness, which is expressed as a probability. The experimental verification of the algorithm suggests that this model can be used to seek potentially innovative enterprises. It appears, however, that the model requires adjustments and more advanced solutions, as using enterprises’ websites as its sole source of information may prove insufficient.
References
99
References 1. Archambault É, Beauchesne OH, Caruso J (2011) Towards a multilingual, comprehensive and open scientific journal ontology. In: Proceedings of the 13th international conference of the international society for scientometrics and informetrics, pp 66–77. Durban South Africa 2. Guerra L, Robles V, Bielza C, Larrañaga P (2012) A comparison of clustering quality indices using outliers and noise. Intell Data Anal 16(4):703–715 3. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Springer, New York 4. Luong MT, Nguyen TD, Kan MY (2010) Logical structure recovery in scholarly articles with rich document features. Int J Digit Library Syst 1(4):1–23 5. Kozlowski M, Protasiewicz J (2014) Automatic extraction of keywords from polish abstracts. In: 4th Young Linguists’ meeting in Pozna´n, volume: book of abstracts, pp 56–57 6. Levin M, Krawczyk S, Bethard S, Jurafsky D (2012) Citation-based bootstrapping for largescale author disambiguation. J Am Soc Inf Sci Technol 63(5):1030–1047 7. Miro´nczuk M, Perełkiewicz M, Protasiewicz J (2017) Detection of the innovative logotypes on the web pages. In: International conference on artificial intelligence and soft computing, pp 104–115. Springer 8. Miro´nczuk M, Protasiewicz J (2015) A diversified classification committee for recognition of innovative internet domains. In: Beyond databases, architectures and structures. Advanced technologies for data mining and knowledge discovery, pp 368–383. Springer 9. Miro´nczuk MM, Protasiewicz J (2018) A recent overview of the state-of-the-art elements of text classification. Expert Syst Appl 106:36–54 10. Miro´nczuk MM, Protasiewicz J (2020) Recognising innovative companies by using a diversified stacked generalisation method for website classification. Appl Intell 50(1):42–60 11. Miro´nczuk MM, Protasiewicz J, Pedrycz W (2019) Empirical evaluation of feature projection algorithms for multi-view text classification. Expert Syst Appl 130:97–112 12. Nocedal J, Wright SJ (2006) Quadratic programming. Numerical optimization, pp 448–492 13. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830 14. Protasiewicz J, Artysiewicz J, Dadas S, Galezewska M, Kozlowski M, Kopacz A, Stanislawek T (2012) Procedures for review and selection of reviewers (in Polish), volume 2. National Information Processing Institute 15. Protasiewicz J, Dadas S, Galezewska M, Klodzinski P, Kopacz A, Kotynia M, Langa M, Mlodozeniec M, Oborzynski A, Stanislawek T, Stanczyk A, Wieczorek A (2012) Procedures for review and selection of reviewers (in Polish), vol 1. National Information Processing Institute 16. Protasiewicz J, Stanisławek T, Dadas S (2015) Multilingual and hierarchical classification of large datasets of scientific publications. In: 2015 IEEE international conference on systems, man, and cybernetics (SMC), pp 1670–1675. IEEE 17. Protasiewicz J (2014) A support system for selection of reviewers. In: 2014 IEEE international conference on systems, man and cybernetics (SMC), pp 3062–3065. IEEE 18. Protasiewicz J (2017) Inventorum - a recommendation system connecting business and academia. In: 2017 IEEE international conference on systems, man, and cybernetics (SMC), pp 1920–1925. IEEE 19. Protasiewicz J (2017) Inventorum: a platform for open innovation. In: 2017 IEEE international conference on systems, man, and cybernetics (SMC), pp 10–15. IEEE 20. Protasiewicz J, Dadas S (2016) A hybrid knowledge-based framework for author name disambiguation. In: 2016 IEEE international conference on systems, man, and cybernetics (SMC), pp 000594–000600. IEEE 21. Protasiewicz J, Miro´nczuk M, Dadas S (2017) Categorization of multilingual scientific documents by a compound classification system. In: International conference on artificial intelligence and soft computing, pp 563–573. Springer
100
5 Selected Algorithmic Developments
22. Protasiewicz J, Pedrycz W, Kozłowski M, Dadas S, Stanisławek T, Kopacz A, Gałe˛˙zewska M (2016) A recommender system of reviewers and experts in reviewing problems. Knowl Based Syst 106:164–178 23. Robertson S, Zaragoza H, et al (2009) The probabilistic relevance framework: BM25 and beyond. Found Trends® Inf Retrieval 3(4):333–389 24. Rose S, Engel D, Cramer N, Cowley W (2010) Automatic keyword extraction from individual documents. Text Min Appl Theory 1:1–20 25. Sutton C, McCallum A (2011) An introduction to conditional random fields. Mach Learn 4(4):267–373 26. Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Warehous Min 1–13:2007 27. Witten I, Paynter G, Frank E, Gutwin C, Nevill-Manning C (2000) KEA: practical automatic keyphrase extraction. Working Paper 00/5, Department of Computer Science, The University of Waikato
Chapter 6
Knowledge Recommendation in Practice
This chapter describes the technical aspects of the reviewer and expert recommendation system and Inventorum, the innovation support system, whose methodological assumptions and algorithms are discussed in detail in Chaps. 3, 4, and 5. The primary objective of this chapter is to explain these systems’ information architectures, technologies, and user interfaces. Much attention has been paid to user interfaces—the only implementation of the systems that is visible to their users. Selected system usage statistics are also discussed. These pertain to the period between mid-2012, when the systems were made available to the public, and mid-2022. The chapter is structured as follows: Sect. 6.1 presents the reviewer and expert recommendation system; Sect. 6.2 presents the innovation support system (Inventorum); the chapter concludes by presenting conclusions 6.3. This chapter is based partially on previous works: [3–6].
6.1 The Reviewer and Expert Recommendation System This section depicts the technical perspective of the reviewer and expert recommendation system, which was launched in mid-2012. It is available online1 and free of charge in the Polish and English languages. A second version of the system was developed to meet the needs of the National Centre for Research and Development (NCBR),2 an organisation that is responsible for R&D funding in Poland.
1 2
https://recenzenci.opi.org.pl/sssr-web/site/home?lang=en. https://www.gov.pl/web/ncbr.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Protasiewicz, Knowledge Recommendation Systems with Machine Intelligence Algorithms, Studies in Computational Intelligence 1101, https://doi.org/10.1007/978-3-031-32696-7_6
101
102
6 Knowledge Recommendation in Practice
Fig. 6.1 A scheme of the architecture and technology of the reviewer and expert recommendation system. The system comprises four technical layers: a database layer responsible for storing data, an access layer that serves data, a business layer that incorporates business and computational services, and an interface layer that interacts with users and other systems
6.1.1 System Architecture and Technology In terms of functionality, the system architecture is modular; from a technical perspective, it is multilayered. A module comprises system elements that are logically organised to implement specific business functions, while a system layer comprises logically organised system elements whose task is to implement specific technical functions. The system resembles a cake: the technical layers are its layers and business modules are its slices. It is worth noting that every slice of the cake (module) includes all of its layers (technical layers). The reviewer and expert recommendation system comprises four technical layers: a database layer, an access layer, a business layer, and an interface layer. Their functionalities and interconnections are presented in Fig. 6.1. The database layer consists of a standard Oracle relational database 11g3 and a MongoDB document-oriented database (also referred to as No Structured Query Language: NoSQL).4 The relational database stores structured data while the NoSQL database stores unstructured data, such as text. The technological solutions of the nonrelational database ensure quick and efficient access to text data, which is difficult to achieve using standard relational databases. The relational database is optimised 3 4
https://www.oracle.com. https://www.mongodb.com.
6.1 The Reviewer and Expert Recommendation System
103
to process structured data rapidly. That is why two distinct databases were used in the database layer of the system. The (data) access layer provides business processes with mechanisms of access to the data stored in the database layer. The access layer is an intermediary layer that incorporates business logic. Its main tasks are to preprocess and to standardise access to data. Three data access mechanisms were developed. The first integrates structured data stored in the relational database with services and processes of the business layer. This is performed, to the exclusion of the NoSQL database, using Java Persistence API5 and Hibernate.6 The second mechanism indexes and integrates text data stored in the NoSQL database and selected relational data with the services and processes of the business layer. The data is indexed and processed by a full-text search engine. This function was implemented using Apache Lucene,7 which ensures quick access to data via text indexes. Some data from the NoSQL database must be available to the services and processes of the business layer in an unprocessed form. To ensure this, native mechanisms of MongoDB that offer direct access to data are used. This constitutes the third data access mechanism of the access layer. The business layer comprises ‘services’ and ‘processes’ components, which lie between the interface layer and the data access layer. The ‘services’ component comprises services that handle users’ requests, which may come from internet interfaces—for example, when individuals access system websites, or from Representational State Transfer (REST) interfaces or other IT systems. This involves not only a simple request–response mechanism, but also specific data processing. The ‘processes’ component includes computational algorithms, such as those tasked with publication classification, disambiguation of authors, keyword extraction, and recommendations. Processes can be initiated by the system administrator via a management console, according to a schedule, or by system users. The technologies used to develop the business layer include Java Enterprise Edition, Enterprise JavaBeans and Java Management Extensions,8 and Spring.9 The interface layer provides various means of communication between users and the system. It must be emphasised that technically, the interface layer communicates only with the business layer. Three interfaces were developed: (i) websites that offer various functionalities to system end users, such as potential reviewers and experts or individuals who seek them; (ii) a REST interface that provides web services to communicate with other IT systems; and (iii) a management console that includes interfaces for the management and administration of the system. The technologies used to develop the interface layer include Java Server Pages, JavaScript, Spring, and Apache Tiles.10
5
https://www.oracle.com/java. https://hibernate.org. 7 https://lucene.apache.org. 8 https://www.oracle.com/pl/java. 9 https://spring.io. 10 https://tiles.apache.org. 6
104
6 Knowledge Recommendation in Practice
Indexation and full-text search technology play important roles in the reviewer and expert recommendation system. The system uses Apache Lucene.11 A full-text model can find any word or phrase in any document included in the database. Full-text search differs from those that are based on metadata or original text fragments stored in databases, such as titles, abstracts, selected sections, or bibliographical references. Alternatives, such as semantic indexing, are reserved for highly specific domains, while most text search systems are based on full-text models. The key element of a full-text search system is a text search engine. It analyses all words in every stored document before attempting to match them to search criteria (i.e. text specified by the user). Searching is based on an index that is built on a collection of documents. The index is a structure of data that is designed to speed up searching. Most full-text search engines use so-called “inverted indexes”: lists of all words that appear in a block of text. Each word is appended with a list of all of its positions in a block of text, sorted in ascending order. Searching an inverted index for a word is as simple as searching a dictionary and returning a list of the word’s occurrences. To search for a phrase (a set of words that cannot be omitted), lists of occurrences for each word must be retrieved and cut to identify the places in which each appears in a block of text [2].
6.1.2 System User Interfaces This section presents the interfaces that are available to users of the reviewer and expert recommendation system. A closer analysis of the interfaces enables deeper understanding of the system’s practical aspects. Users query the ranking of reviewers and experts in two ways. First, users may possess manuscripts of articles, draft projects, or other documents that they wish to be evaluated. In such cases, the interface depicted in Fig. 6.2 is used. Documents to be evaluated can either be pasted in a text field or loaded from a file (the system supports all popular text file formats). Before selecting the ‘analyse document’ option, which extracts keywords from blocks of text and proposes reviewers, several parameters can be set. Users can select the language of their document manually or have it selected automatically. They can also specify the maximum number of phrases to be extracted, the maximum length of a phrase that forms a keyword, and set the minimum probability that allows a phrase to qualify as a keyword. Another way of querying reviewers and experts involves a user skipping the document analysis phase and providing the keywords directly. This method is particularly useful to users who wish to search a space of solutions for a specific area of knowledge. In such cases, the interface presented in Fig. 6.3 is used. Users enter keywords in Polish and English separately. The number of words and the length of the phrases is unlimited. Users can narrow the scope of reviewers and experts that appear in their searches by specifying the scientific discipline in which the proposed individuals specialise. 11
https://lucene.apache.org.
6.1 The Reviewer and Expert Recommendation System
105
Fig. 6.2 A user interface of the reviewer and expert recommendation system: a query about the ranking of reviewers and experts made by uploading a document for review. The system extracts keywords from blocks of text and proposes reviewers when the user selects the ‘analyse document’ option. Before that, the user can set up several parameters, such as the language of their document, the maximum length of a phrase, the maximum number of phrases to be extracted, and the minimum probability that allows a phrase to qualify as a keyword
Users can also specify institutions affiliated with the applicants or contractors of articles and projects. This data is used to identify potential conflicts of interest that might result from affiliations. The ranking of potential reviewers and experts is presented in the form of a list that is sorted in descending order by the degree of compatibility of individuals’ profiles. Queries are made either via the uploading of a document or via the entering of keywords. The range of the degree of compatibility is [0, 1], where 1 indicates ‘full compatibility’ and 0 indicates ‘no compatibility’. An example of such a list presented in a user interface of the reviewer and expert recommendation system is depicted in Fig. 6.4. Each row of the list contains the names and surnames of proposed reviewers or experts, their academic degrees and titles, their affiliations, and the degree of compatibility of their profiles with the query. Affiliations that may lead to conflicts of interest are highlighted in red. To discover more about particular individuals, users can click on their names and surnames. The profiles of those individuals are then displayed. An example of such an interface with a profile displayed is depicted in Fig. 6.5. Users can view detailed information on individuals, including their contact data, keywords, scientific disciplines and competences, publications, affiliations, experience in reviewing, and potential conflicts of interest.
106
6 Knowledge Recommendation in Practice
Fig. 6.3 A user interface of the reviewer and expert recommendation system: a query on the ranking of reviewers and experts created by a user entering keywords that describe competencies. The system proposes reviewers when the user selects the ‘generate ranking’ option. The scope of reviewers and experts can be narrowed by providing scientific disciplines. Information about institutions helps in the identification of potential conflicts of interest
The system also has interfaces that can be used to search and view databases of individuals, keywords, and publications and their sources. These are standard technical solutions; particular attention should be paid, however, to the tool used to visualise scientific disciplines. An example of visualisation in a user interface is presented in Fig. 6.6. It depicts a hierarchical classification of the Ontology of Scientific Journals (OSJ) [1]. Six other classifications are also available. Visualisations are dynamic: the classification tree can be zoomed in and out; each node can be marked and its details displayed, including superordinate and subordinate categories, and equivalents in other classifications; individuals, publications, and keywords linked to classification nodes can also be viewed. This tool enables users to search for knowledge and individuals by analysing a hierarchical tree of classification of scientific domains and disciplines.
6.1 The Reviewer and Expert Recommendation System
107
Fig. 6.4 A user interface of the reviewer and expert recommendation system: a list of proposed reviewers and experts. Each row of the list contains the names and surnames of proposed reviewers or experts, their academic degrees and titles, their affiliations, and their degree of relevance
6.1.3 Selected Statistics The reviewer and expert recommendation system was launched in mid-2012. The first version comprised a ranking of reviewers that was based on the analysis and comparison of keywords. It was available free of charge online.12 The second version was developed for and available solely to NCBR,13 an organisation that is responsible for research and development funding in Poland. That version was supplemented with a ranking of reviewers that was based on full-text indexing, and with other functionalities required by NCBR. At present, the first version of the system is available to the public. The user statistics presented below pertain to the first version of the system. In mid-2022, the system’s databases contained over 200, 000 profiles of potential reviewers and experts. One source of information on reviewers’ and experts’ knowledge and experience is their academic publications, of which over five million have been collected. These are not full text documents, but metadata. It includes titles, affiliations, authors’ names, index terms, and abstracts. It should be emphasised that 12 13
https://recenzenci.opi.org.pl/sssr-web/site/home?lang=en. https://www.gov.pl/web/ncbr-en.
108
6 Knowledge Recommendation in Practice
Fig. 6.5 A user interface of the reviewer and expert recommendation system: a profile of a potential reviewer or expert. A profile presents detailed information on a person, such as contact data, keywords, scientific disciplines and competences, publications, affiliations, and experience in reviewing. It also contains potential conflicts of interest when accessed from a ranking of reviewers and experts
6.1 The Reviewer and Expert Recommendation System
109
Fig. 6.6 A user interface of the reviewer and expert recommendation system: a visualisation of scientific disciplines. There are six different classifications, which are linked to each other. Each classification node shows details, such as superordinate and subordinate categories, equivalents in other classifications, individuals, publications, and keywords. As an example, the OSJ is presented
110 Table 6.1 Selected statistics on some data stored in the database of the reviewer and expert recommendation system as of mid-2022
6 Knowledge Recommendation in Practice Information category
Quantity
Individuals Publications Keywords
211,798 5,145,185 6,779,000
academic publications are not the only source of data used to create individuals’ profiles. In aggregate, the system stores almost eight million keyword phrases that are used to create individuals’ profiles and to recommend reviewers and experts. Brief information on the amount of selected data stored in the database of the reviewer and expert recommendation system is presented in Table 6.1. The usefulness of the system can only be ascertained in practice by its users, which is demonstrated in Charts 6.7 and 6.8. Chart 6.7 presents only the data gathered before 2018. Due to a system log configuration error, the statistics generated for 2022 cannot be presented in the same manner; see Chart 6.8. Given the number of distinct users and how many times each has used the system Chart 6.7, it can be concluded that the system piqued users’ interest during the first five years following its implementation (2013–2017). Chart 6.7 demonstrates that, on average, users used the system twice a year. When the number of pages viewed Chart 6.8 is considered, it can be observed that the system enjoyed great popularity during the first year of its existence and that it has continued to be used, albeit to a lesser extent, until the present (mid-2022). When the number of pages viewed Chart (6.8) is compared with the number of distinct users and visits (Chart 6.7), certain user behaviours become apparent. It can be assumed that following the system’s launch, users tested all system functionalities during single visits, which would explain the large number of pages viewed in 2013. After testing all of the functionalities, users used only those that they found practical. This would also explain the relatively steady number of views and visits between 2014 and 2017. In 2017 and 2018, fewer visits were recorded, but the number of views remained stable. It can be assumed that the users who had used the system for a long time started to use a larger number of system functionalities during single visits (Figs. 6.7 and 6.8).
6.2 Inventorum, the Innovation Support System
111
Fig. 6.7 Selected statistics pertaining to use of the reviewer and expert recommendation system: the number of distinct users and the number of visits per year. The statistics were calculated using AWStats (https://awstats.sourceforge.io). The 2012 statistics pertain only to six months
Fig. 6.8 Selected statistics pertaining to use of the reviewer and expert recommendation system: the number of system pages viewed per year. The statistics were calculated using AWStats (https:// awstats.sourceforge.io). The 2012 and 2022 statistics pertain only to six months
6.2 Inventorum, the Innovation Support System This section presents the technical perspective of Inventorum, a system designed to support innovation, which was launched in December 2015. It is available free of
112
6 Knowledge Recommendation in Practice
Fig. 6.9 A scheme of the architecture and technology of Inventorum, the innovation support system (HDFS stands for Hadoop Distributed File System). The system comprises four technical layers: a data storage layer, a data access layer, a computation layer, and a user interface layer
charge online.14 Anyone who is interested in innovation can use its data and services. The system is presented in Polish- and English-language versions.
6.2.1 System Architecture and Technology Inventorum has a standard multilayer architecture with a number of unique solutions. It comprises four layers: a data storage layer, a data access layer, a computation layer, and a user interface layer. Figure 6.9 outlines the general architecture of the system. The data storage and data access layers contain two mechanisms used to store the information that is provided to the other layers. All metadata that describes objects is stored in a relational database. The content of those objects is stored in a distributed file system (a definition of innovation in the database records and its description in the text files). When necessary, the data access layer synchronises the readings from both layers, but users are interested exclusively in objects’ metadata, so the system can quickly answer their queries by loading full World Wide Web documents. The distributed file system is highly effective in file management and in making the files available to computation processes, including the recommendation algorithm. The computation layer comprises algorithms that process the data that is responsible for 14
https://inventorum.opi.org.pl/en.
6.2 Inventorum, the Innovation Support System
113
data crawling, information extraction, domain classification, entity recognition, and the identification of innovative companies. The computation layer also contains the recommendation algorithm, which acts as the ‘heart’ of the system. Additionally, the layer is fitted with business components, which synchronise the work of all layers and implement a standard model-view-controller architecture that handles users’ queries. The user layer is a typical, responsive view that adapts its layout to the device on which it is displayed. The system was largely implemented using Java Enterprise Edition15 ; the fulltext search engine, however, is based on Apache Lucene,16 and some of the data processing algorithms rely on Apache Spark17 and Apache Hadoop.18 Text data is stored and processed by the Hadoop Distributed File System.19 Other data is stored in a standard relational database.
6.2.2 System User Interfaces The system welcome page Fig. 6.10 entices potential users by providing clear and precise information on the system’s benefits. Users can also use the welcome page to create or log in to their accounts. The welcome page features a prominent search box that allows users to search a multilingual innovation database and projects without having to register in the system. Figure 6.11 presents the interface of that functionality. A full-text search engine enables users to specify their queries with descriptions. The search engine returns a list of projects and innovations that match their expectations. In the results, keywords that match users’ queries are marked in bold. This feature makes it clear why specific projects or innovations are presented as query results. Individuals who are not logged in to the system see only the titles and abstracts of projects and innovations; to view more details, they must become users of the system by registering and logging in. This aligns with the original idea of integrating specific individuals with the resources of the system to encourage visitors to create user accounts. After logging in to the system, users will see their personal desktop Fig. 6.12, which is the central point of system navigation and contains personalised user information. This reflects the idea of the three information channels discussed in detail in Chap. 4: quick recommendations, full recommendations, and active searching. The top part of the desktop includes functionalities that enable full-text searches of databases of innovations, projects, firms, scientific institutions, experts, and conferences. Users use these functionalities when the information recommended to them is insufficient or when they wish to search for additional information independently. 15
https://www.oracle.com/pl/java. https://lucene.apache.org. 17 https://spark.apache.org. 18 https://hadoop.apache.org. 19 https://hadoop.apache.org. 16
114
6 Knowledge Recommendation in Practice
Fig. 6.10 A user interface of Inventorum, the innovation support system: the welcome page. The page informs users of the system’s benefits, allows users to create or log in to their accounts, and enables the search of a multilingual innovation database and projects without registration in the system
6.2 Inventorum, the Innovation Support System
115
Fig. 6.11 A user interface of Inventorum, the innovation support system: searching a multilingual innovation and project database. The search engine returns a list of projects and innovations. The terms that match users’ queries are marked in bold. Individuals not logged in to the system see only the titles and abstracts of projects and innovations; to view more details, they must become users of the system by registering and logging in
116
6 Knowledge Recommendation in Practice
Fig. 6.12 A user interface of Inventorum, the innovation support system: a user panel as seen by a user who is logged in to the system. It comprises quick recommendations, complete recommendations, and active searches
Quick recommendations are displayed on the bottom part of the desktop. They are the most important pieces of information recommended to users by the recommendation algorithm, and the information that users have marked as ‘to be followed’. At the bottom of the list of quick recommendations is a button that allows users to view the full scope of the information recommended. Figure 6.13 presents an interface that displays such recommendations. Users can specify their preferences regarding the information for which they are searching and the scope of its presentation in the interface. These preferences are considered by the recommendation algorithm. The last prominent feature of the system interfaces is user profiles, of which an example is presented in Fig. 6.14. System users may have multiple simultaneous roles: they can be representatives of entities searching for innovation, representatives of scientific institutions offering their knowledge and expertise, experts, and/or
6.2 Inventorum, the Innovation Support System
117
Fig. 6.13 A user interface of Inventorum, the innovation support system: recommendations prepared for a user and recommendation personalisation
reviewers. The user profile interfaces satisfy the needs of all of these roles. General information on individuals is supplemented with information on their achievements, projects, innovations, entities, scientific institutions, and conferences. Users can tag themselves as potential reviewers and specify the keywords that best describe their competences.
6.2.3 Selected Statistics Inventorum, the innovation support system, was launched in December 2015. Table 6.2, and Charts 6.15 and 6.16 present selected usage statistics that have been gathered since the system’s launch. Most information on innovations, projects, conferences, and research institutions comes from public databases. This means that providing the system with initial information was, technically, a straightforward task, and the
118
6 Knowledge Recommendation in Practice
Fig. 6.14 A user interface of Inventorum, the innovation support system: a profile of a system user acting as a reviewer and an expert. It may contain information such as business cards, personal data, related people, keywords, documents, publications, projects, innovations, entities, scientific institutions, and conferences
system contains a large amount of such data Table 6.2. A considerably more challenging task was convincing enterprises and users (including reviewers and experts) that the functionalities of the system could work to their advantage. The methods by which users are invited to use the system are described in Chap. 4. Seven years since the launch of the system, it boasts over sixteen thousand registered users and its database stores information on five hundred innovative enterprises Table 6.2. Despite its considerable number of users, a disproportion of one-to-five can be observed between the number of enterprises and the number of scientific institutions. This data demonstrates that bringing business and science together is a significant challenge. Although only approximately sixteen thousand users are registered in the system, Table 6.2, Chart 6.15 demonstrates that the system is used actively. This is a
6.2 Inventorum, the Innovation Support System Table 6.2 Selected statistics on the data stored in the database of Inventorum, the innovation support system, as of mid-2022
119
Information category
Quantity
Companies Research Institutions Innovations Projects Conferences Users Experts Reviewers
553 3,091 3,480,336 85,957 18,920 16,122 2,265 4,029
considerable achievement, given that the system is not an entertainment platform as popular social media platforms are, but a tool that experts find useful in their professional lives. Between 2017 and 2021, the system was used daily by approximately forty users who collectively visited the system fifty to seventy times a day. This means that few individuals used the system more than once every day, which is unsurprising in this type of service. The number of users and visits during the system’s first year (2016) was three times lower than the yearly average between 2017 and 2021, when it remained stable. In 2022, the number of users and visits is expected to rise by 25% in comparison to the average from previous years (the 2022 data in Chart 6.15 pertains to the first half of the year). The spikes in user activity described above may be influenced by the promotion campaigns for the system in 2016 and 2022. The 2016 campaign popularised the system. In subsequent years (2017–2021), no promotional activities were undertaken and no new users joined the system. A dynamic increase in the number of users has been observed in 2022 after a promotional campaign was organised (Fig. 6.15). Similar conclusions regarding the correlation between users’ activity and promotional campaigns can be drawn from analysis of the number of pages viewed in specific years of the system’s existence (see Chart 6.16). It is also noteworthy that during every visit, the average user views four pages. This means that use of the system is limited to the same, repetitive functionalities, and users seldom explore others. It can be assumed that another invariable exists aside from the constant number of users and visits: the system is used in the same manner by different users. With this in mind, it seems that the revamp of functionalities, the new design of the system, and the promotional campaign for 2022 are entirely justifiable (Fig. 6.16).
120
6 Knowledge Recommendation in Practice
Fig. 6.15 Selected statistics pertaining to the use of Inventorum, the innovation support system: the number of distinct users and the number of visits in a year. The statistics were calculated using Google Analytics (https://analytics.google.com). The 2016 statistics pertain to three-quarters of the year and the 2022 statistics pertain to the first half of the year
Fig. 6.16 Selected statistics pertaining to the use of Inventorum, the innovation support system: the number of system pages viewed in a year. The statistics were calculated using Google Analytics (https://analytics.google.com). The 2016 statistics pertain to three-quarters of the year and the 2022 statistics pertain to the first half of the year
6.3 Summary
121
6.3 Summary The presentation of technical aspects of the reviewer and expert recommendation system and Inventorum, the innovation support system should be concluded with a discussion on the architecture and technology of the systems and on how they are used. It must be noted that although the systems’ architecture and technology were state-of-the-art during the time of their development, the systems’ originality is in their deployment. Most algorithmic proposals on knowledge recommendation reported in the literature are validated on test or limited data. Here, both information systems have been launched in real business cases, which is the crux of its practical originality. Despite the reviewer and expert recommendation system relying on a standard multilayer IT architecture, it boasts three compelling solutions. First, a relational database coexists with NoSQL mass memory, which enabled time-consuming operations to be accelerated. Second, access to relational data is linked to full-text access; such data can be viewed on user interfaces, searched rapidly, and presented. Third, customer services are separated from computational processes. This solution eliminates problems with long-term computations affecting the efficiency of services that support the short-term activities of users. Due to its modular architecture, the system is easy to maintain and expand, as its functional components can be approached individually. Inventorum is also based on a standard multilayer architecture. One compelling solution involves the Hadoop Distributed File System, which has been applied in parallel with a standard relational database. This solution allows long-term computational processes to be separated from the service of users. Both systems were developed approximately ten years ago; it is understandable that different architectural solutions have been implemented in the years since. In the case of a comprehensive modernisation of the systems, the multilayer architecture would likely be replaced by a miscroservice one. User interfaces and system usage statistics should not be discussed separately: system services are provided via interfaces and statistics reveal how such services are used in practice. Despite scant promotion, the reviewer recommendation system attracted considerable interest in the year it was launched. Inventorum required an intensive promotional campaign to pique potential users’ interest. It seemed reasonable to assume that the selection of reviewers and experts was not as socially nor economically important as innovation, but the system’s usage statistics have proved precisely the reverse. This case demonstrates that intuitive knowledge may be insufficient to properly match IT services to customers, and that effective promotion is key for the popularisation of such services. It should be noted that both systems were fairly popular over the course of years. The only variable was system usage patterns: users limited themselves to the same system functionalities without exploring other interface areas. These behaviours could only be altered by promotional campaigns. System interfaces should be designed in consultation with targeted groups of recipients and supported by promotional campaigns. The usefulness of the interfaces should be monitored and they should also be adapted to the needs of disabled people.
122
6 Knowledge Recommendation in Practice
References 1. Archambault É, Beauchesne OH, Caruso J (2011) Towards a multilingual, comprehensive and open scientific journal ontology. In: Proceedings of the 13th international conference of the international society for scientometrics and informetrics. Durban South Africa, pp 66–77 2. Baeza-Yates R, Navarro G (2004) Text searching: theory and practice. In: Formal languages and applications. Springer, pp 565–597 3. Protasiewicz J, Artysiewicz J, Dadas S, Galezewska M, Kozłowski M, Kopacz A, Stanislawek T (2012) Procedures for review and selection of reviewers (in Polish), vol 2. National Information Processing Institute 4. Protasiewicz J, Dadas S, Galezewska M, Klodzinski P, Kopacz A, Kotynia M, Langa M, Mlodozeniec M, Oborzynski A, Stanislawek T, Stanczyk A, Wieczorek A (2012) Procedures for review and selection of reviewers (in Polish), vol 1. National Information Processing Institute 5. Protasiewicz J (2017) Inventorum: A platform for open innovation. In: 2017 IEEE international conference on systems, man, and cybernetics (SMC). IEEE, pp 10–15 6. Protasiewicz J, Pedrycz W, Kozłowski M, Dadas S, Stanisławek T, Kopacz A, Gałe˛˙zewska M (2016) A recommender system of reviewers and experts in reviewing problems. Knowl Based Syst 106:164–178
Chapter 7
Conclusions
7.1 Knowledge Recommendation In the modern world, easy and rapid access to up-to-date data and information that is relevant to specific tasks and studies is crucial. Knowledge can be acquired through recommendation systems, which enable opportunities to gain competitive advantages, as well as assisting in the management of financial resources, available goods, and scientific research more effectively. Knowledge relies heavily on properly structured information; this derives from source data. The main subject of this book is knowledge recommendation. This includes the assignation of reviewers to articles, project and software evaluation; the identification of experts who can answer specific questions or who are competent in specific areas; the assignation of personnel to workplaces and tasks; and the suggestion of scholarly articles to scientists. Each of these tasks can be conducted as knowledge assignment, recommendation, or finding. The differences between these approaches are discussed more deeply in Chap. 2. Due to the extensive nature of knowledge recommendation, only two of its aspects are considered deeply in this book. The first is the recommendation of reviewers and experts who can evaluate manuscripts of scholarly articles or research and development project drafts; the second is innovation support, which involves bringing science and business together by recommending information—such as innovations, projects, prospective partners, experts and conferences—in a meaningful manner. The reviewer and expert recommendation system project is presented in Chap. 3 and the innovation support system project in Chap. 4. Recommendations in both systems rely on content-based algorithms. Knowledge recommendation can be condensed into the suggestion of individuals who hold a certain degree of expertise. Some exceptions to this general rule are recommendations on innovation, projects, enterprises, and conferences. Regardless, the bedrock of the knowledge recommendation system is the development of profiles that contain information on the scope and degree of individuals–expertise. Only after © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Protasiewicz, Knowledge Recommendation Systems with Machine Intelligence Algorithms, Studies in Computational Intelligence 1101, https://doi.org/10.1007/978-3-031-32696-7_7
123
124
7 Conclusions
such profiles are created can they be presented as recommendations. The profiles can be formed from data that describes individual accomplishments—a process known as information extraction from data. Chapter 5 outlines a series of algorithms that are used for this purpose, such as data acquisition, keyword extraction, publication classification, author disambiguation, and enterprise innovation identification. Both the reviewer and expert recommendation system and the innovation support system have been implemented as IT systems that have operated in real business environments. Chapter 6 explains the technical aspects of their implementation, with special regard to the system architectures and the user interface functionalities. Of particular interest are the use statistics of the systems. They show the importance of matching system functionalities to the needs of users, and of maintaining users’ interest in a specific product over time, by, for example, launching periodical promotional campaigns. It should be noted that modern users are fatigued by the wide variety of available e-services; whether they use an IT system depends not only on its usability, but also on how effectively it is promoted.
7.2 Novelty and Originality The systems and algorithms presented in this book refer to contemporary solutions described in the literature and include many compelling aspects concerning novelty and originality. The development of science is an iterative process and all astute changes are worthy of further study. Novelty and originality can be observed in the following areas: 1. Analysis and synthesis of knowledge recommendation accomplishments since 2000. The literature produced since 2000 is reviewed in a critical, meticulous, and constructive manner, and a synthesis of current knowledge recommendation accomplishments is presented. This includes an extensive analysis of the development, trends, and changes in the research directions of knowledge recommendation. A quantitative analysis of relevant publications highlighted three currents of proposed solutions: recommendation, finding, and assignment. A detailed qualitative analysis demonstrated that the algorithms used for these purposes relied on heuristic solutions, probabilistic models, statistical machine learning models, artificial intelligence models, and modern deep learning methods. Although deep learning models usually offer high accuracy, they also entail significant computational complexity. For that reason, simple heuristic and statistical algorithms often prove more useful in practice. The growing number of publications in recent years suggests improved access to algorithmic tools and datasets, which are key to the proposed algorithms and models (as they are verified experimentally).
7.2 Novelty and Originality
125
2. The design, development, and implementation of two IT systems supplemented with content-based recommendation algorithms to apply knowledge recommendation in practice. Knowledge recommendation is illustrated using examples of two aspects: reviewer and expert recommendation, and innovation support. Two IT systems have been designed, developed, and implemented with special regard given to those aspects. The originality of the systems is demonstrated both in their algorithmic and their architectural solutions. Two content-based recommendation algorithms support the reviewer and expert recommendation system. The first is based on the cosine measure between the keyword vectors representing the reviewer and the problem to be reviewed. It provides human-understandable recommendations and descriptions of individuals’ expertise provided by keywords. The second combines full-text indexation of individuals’ accomplishments and the cosine measure. This approach eliminates possible information loss by working on entire indexed documents. The system’s architecture is modular and adjustable, which enables specific computational and business processes to be separated and configured flexibly in the system modules. A content-based recommendation algorithm is also used in the innovation support system. It is worth noting that the algorithm ’understands’ six expressions: innovation, project, conference, scientific unit, enterprise, and expert. The system, as a whole, attempts to manage and control information in a manner that brings science and business together. For this purpose, three channels have been developed via which information is provided to users: fast recommendations, full recommendations, and a semantic information search engine. It should also be highlighted that the system supports open innovation. 3. The development and practical application of selected heuristic and machine learning algorithms to create individuals’ expertise profiles and enterprise innovation evaluations. The core novelty aspect is the design of a cohesive algorithmic process for the development of the profiles of possible experts and reviewers. The originality of the proposed algorithms lies in their appropriate selection and experimental adjustment. In brief, the algorithms are used to acquire data, classify publications, identify (disambiguate) their authors, recognise keywords, and evaluate whether enterprises are innovative. Another novelty is the inclusion of an algorithm (classifier) that evaluates innovation on the basis of the content presented on enterprises’ websites. Structured data acquisition was achieved using heuristic algorithms and nonstructured data acquisition that relies on topical crawlers and the CRF model used to acquire information on scholars and their websites. Despite heuristic algorithms requiring a separate set of rules to be determined for each data source, they proved sufficiently effective. The crawler algorithm was more error-prone and the data it acquired had to be processed manually. The simple heuristic algorithms used to
126
7 Conclusions
extract and import data (publication metadata) from open and structured sources proved sufficient in the creation of the reviewer and expert recommendation system. The problem of classification of large sets of monolingual and multilingual scholarly publication metadata in the hierarchical structure of scientific domains and disciplines was also analysed. Based on a series of experiments, a unique system was proposed that classifies multilingual documents. The system comprised a set of monolingual classifiers developed on the basis of the multinomial naive Bayes algorithm and a multilingual decision model that utilised a logistic regression algorithm. One advantage of the solution is its simplicity, which ensures high computational efficiency without compromising the quality of monolingual and multilingual document classification. The author identification algorithm allows scientific and practical achievements to be assigned to real individuals. An author disambiguation framework is proposed, which is based on heuristic rules and hierarchical agglomerative clustering. The heuristic algorithm is fast and yields accurate results, but requires experience in the creation of its rules. It can match publications only to their existing authors. The clustering algorithm, although less accurate and more computationally complex than the rule-based algorithm, can discover new profiles of unknown researchers from groups of publications. The originality of the solution lies in its combination of these approaches into a single framework that is responsible for the author disambiguation process. The following algorithms have been developed: (i) an algorithm based on heuristic rules; (ii) a clustering algorithm in which cluster similarity is measured using heuristic rules; and (iii) a clustering algorithm in which cluster similarity is estimated by classifiers. The extraction of keywords from text enables the creation of expertise profiles for potential reviewers and experts. The algorithms used to extract keywords from text in the English language are explained sufficiently in the literature. The Polish Keyword Extractor is an original algorithm that is designed for text in the Polish language. The algorithm is simple and fast, which is relevant for the design of real information systems. An original mechanism has been developed that searches the internet for potentially innovative enterprises and invites them to be included in the innovation support system. The innovation evaluation algorithm comprises three classifiers, each of which analyses enterprises’ websites from a different perspective. Next, the decision module integrates partial classification results into a final decision on enterprises’ innovation, which is expressed as innovation probability. The algorithm can be used in the preliminary search for potentially innovative enterprises. The full innovation evaluation, however, necessitates more advanced solutions, as the use of enterprises’ websites as the sole source of information to evaluate innovation may prove insufficient. It should be emphasised that the above by no means covers all aspects of the novelty and originality of the algorithms used in systems that implement knowledge recommendation. Detailed information and full descriptions are included in this book.
7.3 Further Development
127
7.3 Further Development In view of the growing complexity of the knowledge-based economy, knowledge recommendation methods are expected to become more useful and be developed further. Identifying the most relevant reviewers, experts, professionals, innovations, and innovative enterprises is key for science and business. The development of these methods will be supported not only by the design of increasingly more efficient algorithms and more refined information systems, but also by the improved availability and quality of datasets. The algorithms presented in this book are based on heuristic solutions, as well as probabilistic and statistical machine learning models. This approach proved particularly useful in practice due to its simplicity, operational speed, and avoidance of compromising the quality. New methods of knowledge recommendation and individual expertise profile creation should consider the dynamic development of deep learning methods. Neural language models that are based on Transformer networks and are used to model expertise profiles seem particularly promising. Deep learning could also be utilised in knowledge recommendation algorithms. The development of a more complex architecture that comprises multiple models should be expected. It should be noted that currently, algorithms from the deep learning family usually offer higher accuracy at the cost of significantly higher computational complexity. The development of algorithms such as those outlined above is insufficient to achieve a significant qualitative difference in knowledge recommendation. Algorithms must operate in the context of information systems that offer various e-services to end users. Designers of new systems should consider microservice architectures— ones that focus on e-services that satisfy the needs of users rather than on the systems. User interfaces that provide personalised information of a relevant scope and at a relevant time must be designed. In an era in which users are bombarded with information, personalised interfaces may determine the success of e-services. Today’s e-services are quickly forgotten and should, therefore, be promoted regularly and adapted constantly to the changing environment. The last, and possibly most important, directions of development for knowledge recommendation methods are the availability of datasets to a broader group of researchers and e-service providers, and the evaluation of their quality before they are used in the design of methods. Due to knowledge recommendation relying chiefly on nonpublic confidential data, it is impossible to verify and compare the quality of published algorithms. The creation of high-quality datasets that are available to the public is crucial to the success of the communities that develop knowledge recommendation methods. Such datasets should include knowledge recommendation aspects, such as recommendations of reviewers, experts, staff members, innovations, innovative enterprises, and research teams. It would be reasonable to consider the launch of a project or the establishment of an (informal) organisation with the goal of creating a European or global data corpus. The data should be provided in multiple natural languages; initially, several of the most prominent ones would suffice. It is also important that the ongoing initiative be open to new participants.
128
7 Conclusions
I wish to thank my readers for the time they have spent with this book. I sincerely believe that the material will inspire the further development of knowledge recommendation. I would be honoured to collaborate with all readers who wish to work on new knowledge recommendation algorithms, systems, and data.