140 98 33MB
English Pages [422] Year 2024
Bing Wang Zuojin Hu Xianwei Jiang Yu-Dong Zhang (Eds.)
535
Multimedia Technology and Enhanced Learning 5th EAI International Conference, ICMTEL 2023 Leicester, UK, April 28–29, 2023 Proceedings, Part IV
Part 4
Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering Editorial Board Members Ozgur Akan, Middle East Technical University, Ankara, Türkiye Paolo Bellavista, University of Bologna, Bologna, Italy Jiannong Cao, Hong Kong Polytechnic University, Hong Kong, China Geoffrey Coulson, Lancaster University, Lancaster, UK Falko Dressler, University of Erlangen, Erlangen, Germany Domenico Ferrari, Università Cattolica Piacenza, Piacenza, Italy Mario Gerla, UCLA, Los Angeles, USA Hisashi Kobayashi, Princeton University, Princeton, USA Sergio Palazzo, University of Catania, Catania, Italy Sartaj Sahni, University of Florida, Gainesville, USA Xuemin Shen , University of Waterloo, Waterloo, Canada Mircea Stan, University of Virginia, Charlottesville, USA Xiaohua Jia, City University of Hong Kong, Kowloon, Hong Kong Albert Y. Zomaya, University of Sydney, Sydney, Australia
535
The LNICST series publishes ICST’s conferences, symposia and workshops. LNICST reports state-of-the-art results in areas related to the scope of the Institute. The type of material published includes • Proceedings (published in time for the respective event) • Other edited monographs (such as project reports or invited volumes) LNICST topics span the following areas: • • • • • • • •
General Computer Science E-Economy E-Medicine Knowledge Management Multimedia Operations, Management and Policy Social Informatics Systems
Bing Wang · Zuojin Hu · Xianwei Jiang · Yu-Dong Zhang Editors
Multimedia Technology and Enhanced Learning 5th EAI International Conference, ICMTEL 2023 Leicester, UK, April 28–29, 2023 Proceedings, Part IV
Editors Bing Wang Nanjing Normal University of Special Education Nanjing, China Xianwei Jiang Nanjing Normal University of Special Education Nanjing, China
Zuojin Hu Nanjing Normal University of Special Education Nanjing, China Yu-Dong Zhang University of Leicester Leicester, UK
ISSN 1867-8211 ISSN 1867-822X (electronic) Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ISBN 978-3-031-50579-9 ISBN 978-3-031-50580-5 (eBook) https://doi.org/10.1007/978-3-031-50580-5 © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.
Preface
We are delighted to introduce the proceedings of the fifth edition of the 2023 European Alliance for Innovation (EAI) International Conference on Multimedia Technology and Enhanced Learning (ICMTEL). This conference brought researchers, developers and practitioners from around the world who are leveraging and developing multimedia technologies and enhanced learning. The theme of ICMTEL 2023 was “Human Education-Related Learning and Machine Learning-Related Technologies”. The technical program of ICMTEL 2023 consisted of 119 full papers, including 2 invited papers in oral presentation sessions at the main conference tracks. The conference tracks were: Track 1, AI-Based Education and Learning Systems; Track 2, Medical and Healthcare; Track 3, Computer Vision and Image Processing; and Track 4, Data Mining and Machine Learning. Aside from the high-quality technical paper presentations, the technical program also featured three keynote speeches and three technical workshops. The three keynote speeches were by Steven Li from Swansea University, UK, “Using Artificial Intelligence as a Tool to Empower Mechatronic Systems”; Suresh Chandra Satapathy from Kalinga Institute of Industrial Technology, India, “Social Group Optimization: Analysis, Modifications and Applications”; and Shuihua Wang from the University of Leicester, UK, “Multimodal Medical Data Analysis”. The workshops were organized by Xiaoyan Jiang and Xue Han from Nanjing Normal University of Special Education, China and Yuan Xu and Bin Sun from University of Jinan, China. Coordination with the Steering Committee Chairs, Imrich Chlamtac, De-Shuang Huang, and Chunming Li, was essential for the success of the conference. We sincerely appreciate their constant support and guidance. It was also a great pleasure to work with such an excellent organizing committee, we appreciate their hard work in organizing and supporting the conference. In particular, the Technical Program Committee, led by our TPC Co-chairs, Shuihua Wang and Xin Qi, who completed the peer-review process of technical papers and put together a high-quality technical program. We are also grateful to the Conference Manager, Ivana Bujdakova, for her support, and all the authors who submitted their papers to the ICMTEL 2023 conference and workshops. We strongly believe that ICMTEL conference provides a good forum for researchers, developers and practitioners to discuss all science and technology aspects that are relevant to multimedia technology and enhanced learning. We also expect that future events will be as successful and stimulating, as indicated by the contributions presented in this volume. October 2023
Bing Wang Zuojin Hu Xianwei Jiang Yu-Dong Zhang
Organization
Steering Committee Imrich Chlamtac De-Shuang Huang Chunming Li Lu Liu M. Tanveer Huansheng Ning Wenbin Dai Wei Liu Zhibo Pang Suresh Chandra Satapathy Yu-Dong Zhang
Bruno Kessler Professor, University of Trento, Italy Tongji University, China University of Electronic Science and Technology of China (UESTC), China University of Leicester, UK Indian Institute of Technology, Indore, India University of Science and Technology Beijing, China Shanghai Jiaotong University, China University of Sheffield, UK ABB Corporate Research, Sweden KIIT, India University of Leicester, UK
General Chair Yudong Zhang
University of Leicester, UK
General Co-chairs Ruidan Su Zuojin Hu
Shanghai Advanced Research Institute, China Nanjing Normal University of Special Education, China
Technical Program Committee Chairs Shuihua Wang Xin Qi
University of Leicester, UK Hunan Normal University
viii
Organization
Technical Program Committee Co-chairs Bing Wang Yuan Xu Juan Manuel Górriz M. Tanveer Xianwei Jiang
Nanjing Normal University of Special Education, China University of Jinan, China Universidad de Granada, Spain Indian Institute of Technology, Indore, India Nanjing Normal University of Special Education, China
Local Chairs Ziquan Zhu Shiting Sun
University of Leicester, UK University of Leicester, UK
Workshops Chair Yuan Xu
Jinan University, China
Publicity and Social Media Chair Wei Wang
University of Leicester, UK
Publications Chairs Xianwei Jiang Dimas Lima
Nanjing Normal University of Special Education, China Federal University of Santa Catarina, Brazil
Web Chair Lijia Deng
University of Leicester, UK
Organization
ix
Technical Program Committee Abdon Atangana Amin Taheri-Garavand Arifur Nayeem Arun Kumar Sangaiah Carlo Cattani Dang Thanh David Guttery Debesh Jha Dimas Lima Frank Vanhoenshoven Gautam Srivastava Gonzalo Napoles Ruiz Hari Mohan Pandey Hong Cheng Jerry Chun-Wei Lin Juan Manuel Górriz Liangxiu Han Mackenzie Brown Mingwei Shen Nianyin Zeng
University of the Free State, South Africa Lorestan University, Iran Saidpur Government Technical School and College, Bangladesh Vellore Institute of Technology, India University of Tuscia, Italy Hue College of Industry, Vietnam University of Leicester, UK Chosun University, Korea Federal University of Santa Catarina, Brazil University of Hasselt, Belgium Brandon University, Canada University of Hasselt, Belgium Edge Hill University, UK First Affiliated Hospital of Nanjing Medical University, China Western Norway University of Applied Sciences, Bergen, Norway University of Granada, Spain Manchester Metropolitan University, UK Perdana University, Malaysia Hohai University, China Xiamen University, China
Contents – Part IV
Workshop 1: AI-Based Data Processing, Intelligent Control and Their Applications Research and Practice of Sample Data Set Collection Platform Based on Deep Learning Campus Question Answering System . . . . . . . . . . . . . . . . . . . . Wu Zhixia
3
An Optimized Eight-Layer Convolutional Neural Network Based on Blocks for Chinese Fingerspelling Sign Language Recognition . . . . . . . . . . . . Huiwen Chu, Chenlei Jiang, Jingwen Xu, Qisheng Ye, and Xianwei Jiang
11
Opportunities and Challenges of Education Based on AI – The Case of ChatGPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Junjie Zhong, Haoxuan Shu, and Xue Han
32
Visualization Techniques for Analyzing Learning Effects – Taking Python as an Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Keshuang Zhou, Yuyang Li, and Xue Han
42
Adversarial Attack on Scene Text Recognition Based on Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yanju Liu, Xinhai Yi, Yange Li, Bing Wang, Huiyu Zhang, and Yanzhong Liu
53
Collaboration of Intelligent Systems to Improve Information Security . . . . . . . . . Lili Diao and Honglan Xu
64
X"1 + X" Blended Teaching Mode Design in MOOC Environment . . . . . . . . . . . Yanling Liu, Liping Wang, and Shengwei Zhang
76
A CNN-Based Algorithm with an Optimized Attention Mechanism for Sign Language Gesture Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kai Yang, Zhiwei Yang, Li Liu, Yuqi Liu, Xinyu Zhang, Naihe Wang, and Shengwei Zhang Research on Application of Deep Learning in Esophageal Cancer Pathological Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiang Lin, Zhang Juxiao, Yin Lu, and Ji Wenpei
84
95
xii
Contents – Part IV
Workshop 2: Intelligent Application in Education Coke Quality Prediction Based on Blast Furnace Smelting Process Data . . . . . . . 109 ShengWei Zhang, Xiaoting Li, Kai Yang, Zhaosong Zhu, and LiPing Wang Design and Development on an Accessible Community Website of Online Learning and Communication for the Disabled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Jingwen Xu, Hao Chen, Qisheng Ye, Ting Jiang, Xiaoxiao Zhu, and Xianwei Jiang Exploration of the Teaching and Learning Model for College Students with Autism Based on Visual Perception—A Case Study in Nanjing Normal University of Special Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Yan Cui, Xiaoyan Jiang, Yue Dai, and Zuojin Hu Multi-Modal Characteristics Analysis of Teaching Behaviors in Intelligent Classroom—Take Junior Middle School Mathematics as an Example . . . . . . . . . 158 Yanqiong Zhang, Xiang Han, Jingyang Lu, Runjia Liu, and Xinyao Liu Based on the 2010–2022 Review of Domestic and Foreign Educational Evaluation and University Internal Evaluation Methods . . . . . . . . . . . . . . . . . . . . . 173 Xiaoxiao Zhu, Huiyao Ge, Liping Wang, and Yanling Liu A Summary of the Research Methods of Artificial Intelligence in Teaching . . . . 187 Huiyao Ge, Xiaoxiao Zhu, and Xiaoyan Jiang A Sign Language Recognition Based on Optimized Transformer Target Detection Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Li Liu, Zhiwei Yang, Yuqi Liu, Xinyu Zhang, and Kai Yang Workshop 3: The Control and Data Fusion for Intelligent Systems A Review of Electrodes Developed for Electrostimulation . . . . . . . . . . . . . . . . . . . 211 Xinyuan Wang, Mingxu Sun, Hao Liu, Fangyuan Cheng, and Ningning Zhang Non-invasive Scoliosis Assessment in Adolescents . . . . . . . . . . . . . . . . . . . . . . . . . 221 Fangyuan Cheng, Liang Lu, Mingxu Sun, Xinyuan Wang, and Yongmei Wang Algorithm of Pedestrian Detection Based on YOLOv4 . . . . . . . . . . . . . . . . . . . . . . 231 Qinjun Zhao, Kehua Du, Hang Yu, Shijian Hu, Rongyao Jing, and Xiaoqiang Wen
Contents – Part IV
xiii
A Survey of the Effects of Electrical Stimulation on Pain in Patients with Knee Osteoarthritis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 Ruiyun Li, Qing Cao, and Mingxu Sun The Design of Rehabilitation Glove System Based on sEMG Signals Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 Qing Cao, Mingxu Sun, Ruiyun Li, and Yan Yan Gaussian Mass Function Based Multiple Model Fusion for Apple Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Shuhui Bi, Lisha Chen, Xue Li, Xinhua Qu, and Liyao Ma Research on Lightweight Pedestrian Detection Method Based on YOLO . . . . . . 270 Kehua Du, Qinjun Zhao, Rongyao Jing, Lei Zhao, Shijian Hu, Shuaibo Song, and Weisong Liu Research on the Verification Method of Capillary Viscometer Based on Connected Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 Rongyao Jing, Kun Zhang, Qinjun Zhao, Tao Shen, Kehua Du, Lei Zhao, and Shijian Hu Research on License Plate Recognition Methods Based on YOLOv5s and LPRNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 Shijian Hu, Qinjun Zhao, Shuo Li, Tao Shen, Xuebin Li, Rongyao Jing, and Kehua Du Research on Defective Apple Detection Based on Attention Module and ResNet-50 Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 Lei Zhao, Zhenhua Li, Qinjun Zhao, Wenkong Wang, Rongyao Jing, Kehua Du, and Shijian Hu Understanding the Trend of Internet of Things Data Prediction . . . . . . . . . . . . . . . 308 Lu Zhang, Lejie Li, Benjie Dong, Yanwei Ma, and Yongchao Liu Finite Element Simulation of Cutting Temperature Distribution in Coated Tools During Turning Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 Jingjie Zhang, Guanghui Fan, Liwei Zhang, Lili Fan, Guoqing Zhang, Xiangfei Meng, Yu Qi, and Guangchen Li Teaching Exploration on Calculation Method Under the Background of Emerging Engineering Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Shuhui Bi, Liyao Ma, Yuan Xu, and Xuehua Yan
xiv
Contents – Part IV
Predicting NOx Emission in Thermal Power Plants Based on Bidirectional Long and Short Term Memory Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 Xiaoqiang Wen and Kaichuang Li On the Trend and Problems of IoT Data Anomaly Detection . . . . . . . . . . . . . . . . . 358 Shuai Li, Lejie Li, Kaining Xu, Jiafeng Yang, and Siying Qu Power Sequencial Data - Forecasting Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 Lejie Li, Lu Zhang, Bin Sun, Benjie Dong, and Kaining Xu Comparison of Machine Learning Algorithms for Sequential Dataset Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380 Zhuang Ma, Tao Shen, Zhichao Sun, Kaining Xu, and Xingsheng Guo Trend and Methods of IoT Sequential Data Outlier Detection . . . . . . . . . . . . . . . . 386 Yinuo Wang, Tao Shen, Siying Qu, Youling Wang, and Xingsheng Guo Optimization of Probabilistic Roadmap Based on Two-Dimensional Static Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392 Binpeng Wang, Houqin Huang, Lin Sun, and Chao Feng Partition Sampling Strategy for Robot Motion Planning in Narrow Passage Under Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 Binpeng Wang, Zeqiang Li, Lin Sun, and Chao Feng IoT Time-Series Missing Value Imputation - Comparison of Machine Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406 Xudong Chen, Bin Sun, Shuhui Bi, Jiafeng Yang, and Youling Wang Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
Workshop 1: AI-Based Data Processing, Intelligent Control and Their Applications
Research and Practice of Sample Data Set Collection Platform Based on Deep Learning Campus Question Answering System Wu Zhixia(B) School of Mathematics and Information Science, Nanjing Normal University of Special Education, Nanjing, China [email protected], [email protected]
Abstract. This document expounds the design and implementation scheme of a question-and-answer sample dataset collection platform using Spring+ SpringMVC+MyBatis framework and SQL data storage technology. The system mainly provides four functional modules: the text system file import module, the question and answer sample data set collection module, Question and answer sample dataset management module, the question and answer sample dataset output module. This research provides services for domain-specific collection and organization of question answering datasets. Keywords: SSM · question answering system · BERT
1 Introduction Intelligent question answering system is a human machine dialogue service that integrates technologies such as knowledge base, information retrieval, ma-chine learning, and natural language understanding. According to different ap-plication fields, intelligent question answering systems are usually divided into open domain question answering systems and limited domain question answering systems. In an open domain Q&A system, it is necessary to have a rich knowledge base, which provides a certain foundation for answering questions in multiple fields [1]. For example, using DeepPavlov’s chat robot. DeepPavlov is a powerful open source AI library for chat robots and dialogue systems developed by the Moscow Institute of Physics and Technology (MIPT). It uses a large amount of article data from Wikipedia as its source of knowledge. However, when answering questions in professional fields, open domain Q&A systems are difficult to accurately locate answers. Currently, Q&A systems for limited fields such as law,healthcare, and finance are relatively mature, but Q&A systems in the field of universities are in their early stages [2]. If we can analyze the characteris-tics of the University, take the campus as a guide, and help universities establish a unified and reliable information acquisition platform to automatically answer students’ questions, such as enrollment consultation,campus rules and regula-tions, campus library opening hours, and leave approval process, we can reduce labor costs and achieve automation and convenience in campus service work, It can also provide services for dynamic monitoring of students’ school situa-tion. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 3–10, 2024. https://doi.org/10.1007/978-3-031-50580-5_1
4
W. Zhixia
To build a campus intelligent question and answer system, it is necessary to re-duce the occurrence of "wrong answers" during the intelligent question and answer process. Usually, it involves the repeated steps of collecting question and answer datasets, selecting pretrained models, using training sets to participate in training, verifying model effectiveness using test datasets, improving training models, to retrain and to revalidate. Finally, the optimal model was determined and applied to the question answering system to achieve intelligent question an-swering [3]. To build a limited domain campus Q&A system, building a Q&A data collection platform to collect and organize data is a necessary step. This article elaborates on the design and implementation of a sample dataset collection plat-form for a question answering system using the SSM (Spring+SpringMVC+MyBatis) framework and SQL data storage technology.
2 SSM Framework The SSM framework is composed of three open source frameworks, Spring, Spring MVC, and MyBatis. Spring MVC corresponds to the Controller layer in the foreground and is responsible for the separation of MVC. The MyBatis framework provides support for data persistence. As a lightweight IOC (Inversion of Control) container, Spring is responsible for finding, locating, creating, and managing objects and dependencies between objects, making Spring MVC and MyBatis work better. This topic uses Spring+ Spring MVC+ MyBatis, a lightweight development framework for implementation.
3 BERT for Question Answering System BERT is Bidirectional Encoder Representation from Transformers, a pre-trained model proposed by Google in 2018. Pretraining is a concept of transfer learning [4]. For example, if we have a large amount of Wikipedia data, we can use this large amount of data to train a model with strong generalization ability. When we need to use it in specific scenarios, we only need to modify some out-put layers, then use our own data for incremental training, and make slight ad-justments to the weights. There are many pretrained language models, and BERT is one of them [5]. It emphasizes the use of a new Maked Language Model (MLM) to generate deep bidirectional language representations, rather than using tradi-tional one-way language models or shallow cascades of two one-way language models for pretraining. When referring to the "question answering system" as an application of BERT, it usually refers to using BERT for SQUAD (Stanford Question Answering Dataset) [6]. The Stanford Question and Answer Dataset (SQUAD) is a new read-ing comprehension dataset that is based on questions posed in Wikipedia articles. There are over 100000 question and answer pairs in over 500 articles, with each answer being a paragraph or span of the corresponding reading article [7]. In BERT applications, it is necessary to input both the question and the text frag-ment containing the answer, separated by a special symbol [SEP]. BERT can out-put the answer at the beginning and end of the text fragment, which means it can highlight the text range of the corresponding answer [8]. In the later stages of this project, we will conduct incremental training on the collected Chinese sample question answering dataset to build an intelligent question answering system that is in line with the campus.
Research and Practice of Sample Data Set Collection Platform
5
4 Analysis and Design of Sample Data Set Collection Platform for Question Answering System 4.1 Main Functions of the System From the perspective of users, the platform is divided into two roles: ordinary users and administrators. From the function module, it is divided into two modules: foreground and background management. The front end is open to ordinary users, providing a Q&A sample data set collection module. The background is open to the administrator role, providing a text system file import module, a Q&A sample data set management module, and a background Q&A sample data set output module. (1) Text system file import module This module provides dynamic reading of imported text, and the content of the file is divided into file descriptions, chapters, and regulations, which are stored separately in relevant data tables. (2) Question and answer sample dataset collection module This module allows users to select documents, chapters, and regulations to view the content of the regulations. The user enters the question and answer, and marks the reference information block of the answer in the regulatory content. (3) Question and answer sample dataset management module For administrators to randomly check the correctness of the collected Q&A pairs and standard reference information segments. If there are errors in the collected Q&A information, the administrator can directly correct or send a reminder to prompt the collector to correct the sample data. (4) Question and answer sample dataset output module Generate training and test sets in JSON format for subsequent deep learning. 4.2 Data Flow Diagram The Data Flow Chart (DFD) is the main tool used to describe the data flow of a system. It uses a set of symbols to describe the flow, processing, and storage of data in the system. The administrator visits the text system file import module and imports a specific domain file for which to collect Q&A pairs. The content of this document will be split and stored separately in the file description table named “Subject”, the chapter description table named “Chapter Title”, and the regulatory table named “Rules”. Ordinary users visit the Q&A sample dataset collection module, select rules, input possible Q&A pairs, and mark reference information segments. The collected question and answer pairing data is stored in a question and answer table called ‘Qanswer’. The administrator visits the Q&A sample dataset management module and randomly checks the correctness of the Q&A and marked reference information segments. They can access and use the data in the Q&A table. After completing the collection of Q&A pairs, administrators can use the Q&A sample dataset output module to generate JSON formatted training and testing sets for further indepth learning. The data flow of the entire project is shown in Fig. 1.
6
W. Zhixia
Fig. 1. Data Flow Diagram
4.3 Database Design Based on the principle of designing a data structure with high efficiency and low redundancy, four tables are mainly designed. Mainly include the following Tables: (1) subject. A table of subject about the file. Mainly used to record the subject name of the imported file. The fields included are id, subject name, sequence number and so on. (2) chapterTitle. A table of chapters about the document. Mainly used to record the chapter name of the file. The fields included are id, chapter name, sort number, subject_id, create time. (3) rules. A table of related regulations. Mainly used to record the rules name. The fields includes are id, title, content, sequence number, subject_id, chapterTitle_id, create time. (4) qanswer. A table of question and answer. Mainly used to record the question and answer. The fields includes are id, question, answer, referencesTxt, rule_id, subject_id, chapterTitle_id, username, createtime, reviewer1, review time 1, audit level 1, reviewer 2, review time 2, audit level 2, reviewer 3, review time 3, audit level 3, approved level, status. (5) user. A table of user. Mainly used to record the information about user. The fields includes are id, username, password, sex, tel, address, registerTime, role and so on.
Research and Practice of Sample Data Set Collection Platform
7
4.4 System Architecture Design According to the different functions and the layered architecture idea of the SSM framework, the system is strictly divided into five hierarchical structures, namely the entity layer, the data access layer (DAO Layer), the business logic layer, the controller layer, and view layer. Each level complements each other and completes the framework of the system together. The structure of the system is shown in Fig. 2. (1) Entity layer (Domain Layer): This layer consists of several entity classes. (2) Data access layer (DAO Layer): This layer consists of DAO interfaces and Mybatis mapping files. The name of the interface uniformly ends with Mapper, and the name of the mapping file of MyBatis is the same as the name of the interface. This layer is mainly used to define the operations of adding, deleting, modifying, and querying database tables as abstract methods in the DAO interface, and provide specific implementations of DAO interface abstract methods in Mybatis mapping files. (3) Business logic layer (Service Layer): This layer consists of service interface and implementation classes. In this system, the interfaces of the business logic layer are uniformly ended with Service, and Impl is added after the interface name to achieve uniform class names. This layer is mainly used to implement the business logic of the system. (4) Controller layer (Controller Layer): This layer mainly includes Controller classes in SpringMVC. The Controller class is mainly responsible for intercepting user requests, instantiating corresponding components in the business logic layer, and then calling the corresponding methods provided by the instantiation object to process user requests, and then returning the processing results to the JSP page. (5) View layer (View Layer): This layer mainly includes JSP and HTML pages. The JSP and HTML pages are mainly responsible for providing an interface for users to input data or display data.
4.5 Sequence Diagram (1) user login sequence diagram The sequence diagram of user login access is shown in Fig. 3. It can be observed that the user visits a webpage called login.jsp, enters a username and password on the page, submits the form, and sends the request to the controller layer. A class named UserController in the control layer is responsible for processing requests. It encapsulates the received form information into a User entity, instantiates a class named UserService in the business logic layer into an object, and then calls the checkUser() method provided by the object. The checkUser() method in UserService is implemented by calling the checkUser() method provided by UserDao. UserDao belongs to the class under the data persistence layer, which provides interaction with the database. (2) Question and answer dataset collection sequence diagram
8
W. Zhixia
Fig. 2. System hierarchy diagram
Fig. 3. User login sequence diagram
The sequence diagram of the Q&A dataset collection is shown in Fig. 4. It can be observed that the user visit the Q&A sample dataset collection page, input Q&A pairs on the page, and marked reference information segments, submits the form, and sends the request to the controller layer. A class named QAnswerController in the control layer is responsible for processing requests. It encapsulates the received form information into a QAnswer entity, instantiates a class named QAnswerService in the business logic layer into an object, and then calls the add() method provided by the object. The add () method
Research and Practice of Sample Data Set Collection Platform
9
in QAnswerService implements the storage of Q&A data in database tables by calling the add () method provided by QAnswerDao.
Fig. 4. Question and answer dataset collection sequence diagram
4.6 The Output Can Be Used for the Sample Dataset Based on the BERT Chinese Question Answering System This platform imported 48 campus rules and regulations documents from a vocational college in Anhui. New students, full-time teachers, and counselors complete the collection and input of Q&A pairs based on familiar rules and regulations, and verify the correctness of the answers on the platform. The platform collected a total of 4500 sample data, including 3100 available data with a certification score of 7.0 or higher. Use the output module provided by the platform to convert the obtained sample dataset into JSON format for pretraining. 4.7 BERT for Question Answering Divide the collected data set into training set and verification set, apply BERT to the sampled Chinese sample data set, split Chinese words according to spaces, rewrite the code in run_squad_chinese.py, and conduct training and verification experiments. The format of the sample data set generated by the platform can meet the experimental requirements.
5 Conclusions This paper elaborates on the design and implementation of a Q&A data collection platform using the SSM (Spring+SpringMVC+MyBatis) framework and data storage technology. We collected and organized 4500 sample data related to school rules and regulations from a vocational college in Anhui. The completion of this work provides a good
10
W. Zhixia
foundation for the subsequent construction of campus intelligent question answering systems. Acknowledgment. We thank the students from Ma’anshan Teacher College for assisting in the collection of the question and answer. This work was supported by the Natural Science Research Project of Anhui Universities “Research on Campus FAQ Based on Deep Learning”, with the grant number KJ2020A0884.
References 1. Lee, J., Seo, M., Hajishirzi, H., Kang, J.: Contextualized sparse representations for real-time open-domain question answering. In: Procedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 912–919. Association for Computational Linguistics (2020) 2. Li, C., Choi, J.D.: Transformers to learn hierarchical contexts in multiparty dialogue for spanbased question answering. arXiv:2004.03561v2 (2020) 3. Baheti, A., Ritter, A., Small, K.: Fluent response generation for conversational question answering. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 191–207 (2020) 4. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M.: RobERTa: a robustly optimized BERT pretraining approach. arXiv:1907.1169zv1 (2019) 5. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805v2 (2019) 6. Liu, Y., Ott, M.: RoBERTa: a robustly optimized BERT pretraining approach. arxiv (2019) 7. Dibia, V.: NeuralQA: a usable library for question answering (contextual query expansion+BERT) on large datasets. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 15–22. Association for Computational Linguistics (2020) 8. Choi, E., et al.: QuAC: question answering in context. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2174–2184. Association for Computational Linguistics (2018)
An Optimized Eight-Layer Convolutional Neural Network Based on Blocks for Chinese Fingerspelling Sign Language Recognition Huiwen Chu, Chenlei Jiang, Jingwen Xu, Qisheng Ye, and Xianwei Jiang(B) School of Mathematics and Information Science, Nanjing Normal University of Special Education, Nanjing 210038, China [email protected]
Abstract. Sign language plays a significant role in communication for the hearing-impaired and the speech-impaired. Sign language recognition smooths the barriers between the disabled and the healthy. However, the method has been difficult for artificial intelligence to use because it requires complex gestures that must be recognized in real time and with great accuracy. Fingerspelling sign language recognition methods based on convolutional neural networks have gradually gained popularity in recent years thanks to the advancement of deep learning techniques. Recognition of sign language using finger spelling has taken center stage. This study proposed an optimized eight-layer convolutional neural network based on blocks (CNN-BB) for fingerspelling recognition of Chinese sign language. Three different blocks: Conv-BN-ReLU-Pooling, Conv-BN-ReLU, ConvBN-ReLU-BN were adopted and some advanced technologies such as bath normalization, dropout, pooling and data augmentation were employed. The results displayed that our CNN-BB achieved MSD of 93.32 ± 1.42%, which is superior to eight state-of-the-art approaches. Keywords: fingerspelling · sing language recognition · batch normalization · data augmentation · pooling · dropout
1 Introduction Sign language is one of the primary means of communication for the hearing-impaired and speech-impaired. It is a type of language that uses the shape of the hand as a carrier and simulates images or syllables by changing gestures to form certain meanings or words. As an independent visual language with a complete grammatical system, sign language in deaf people and disabled people occupy the core status in the field of communication. In April 2022, the China National Emergency Language Service Team was established in Beijing and put forward the proposal of “doing a solid job in providing emergency language services”. As an important emergency language, sign language is naturally a key link in the construction of such work. Our research enthusiasm for sign language is growing as the country attaches great importance to it. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 11–31, 2024. https://doi.org/10.1007/978-3-031-50580-5_2
12
H. Chu et al.
The major fields of sign language study include computer science and technology, linguistics, and special education. As the leading research field of sign language, computer science and technology mainly focuses on pattern recognition, signal processing, computer vision, visual language, etc., among which the most important research topic is “sign language recognition”. A large number of scholars are immersed in studying various problems of sign language recognition technology and enriching various methods of sign language recognition [1]. Sign Language Recognition (SLR) is a technology that uses computers to convert gestures into text or speech information [2]. Traditional methods of sign language recognition include template matching, Hidden Markov (HMM), and NN, etc. The above conventional approaches have certain drawbacks of their own, but by incorporating innovation, they can overcome these drawbacks as science and technology continue to advance and grow. For example, combining HMM with Dynamic Time Warping (DTW) [3] or Support Vector Machine (SVM) [4] or NN [5], combining fuzzy logic with NN [6], etc. In recent years, the rapid development of deep learning technology has brought new vitality to SLR. The current major deep learning-based sign language recognition technologies include sign language recognition technology based on Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Graph Neural Network (GNN) and the integration of various methods. CNN is an important form of deep learning, which is dedicated to processing data with similar network structure, such as time series and image data. CNN has a huge impact on the field of Chinese sign language recognition because it is extremely effective at handling picture categorization difficulties [7]. Chinese sign language is divided into gesture sign language and fingerspelling sign language, the latter usually includes 30 fingerspelling forms of Chinese sign language: 26 letters (a–z), three retroflex 3 letters (“ch”, “sh”, and “zh”), and one nasal consonant (“ng”). Therefore, a sign language with only 30 letters is more precise [8] and simpler when facial emotions are excluded. Figure 1 shows the letters of the Chinese fingerspelling sign language [9].
Fig. 1. Alphabets in Sign Language
An important scientific advancement is the improvement of Chinese sign language fingerspelling recognition. It can not only aid in the better social integration of the deaf but also advance the study and advancement of computer vision, artificial intelligence, and other related fields. An effective and precise method for recognizing sign language is finger-spelling recognition of Chinese sign language using convolutional neural networks. This method can be used to enhance the precision and effectiveness of artificial
An Optimized Eight-Layer Convolutional Neural Network
13
intelligence and natural language processing in speech recognition, handwriting recognition, and other areas. Chinese Sign Language fingerspelling recognition has enormous application potential and will make people’s lives and jobs more convenient. In order to increase the precision of Chinese fingerspelling sign language identification, this article will install an eight-layer convolutional neural network that has been tuned and will also incorporate data augmentation. By integrating pooling, batch normalization, and dropout approaches, we also increase the validity of the test data, overcoming the shortcomings of pre-training and enhancing CNN usability and accuracy. This article’s remaining sections are organized as follows: The data set is described in Sect. 2. Section 3 explicitly introduces the Chinese fingerspelling sign language recognition methods utilized in this study. The experiment technique is described in Sect. 4, comments are provided in Sect. 5, and conclusions are provided in the last section.
2 Dataset We collected several groups of Chinese finger gesture image samples to establish a relevant gesture data set, each sample covered 26 letters plus “zh”, “ch”, “sh”, “ng“ two-syllable finger language pictures, a total of 30, using the brush tool of Photoshop, set the soft light mode, smear the original image brightness, to ensure that the image is clearly visible processing. In addition, multiple samples take into account individual differences in the use of sign language gestures, which makes the research results more convincing (See Fig. 2).
Fig. 2. The various sign language gestures for the letter “zh”
3 Methodology The image modification in this paper is based on deep neural network. For function extraction of input data, image feature processing and noise reduction are some of the main uses of convolutional layer. By using ReLU function (improve data processing speed), Pooling layer (reduce the number of neural network parameters and computing load to improve computing speed), Dropout (prevent over-fitting, It provides an effective combination of exponential neural network architecture method), Batch normalization (accelerating the speed of model training and increasing the ability of generalization), and Data augmentation (increasing the size of data in some ways to improve the performance of its learning algorithm) to improve the performance of convolutional networks. It is convenient for our experimental calculation, as shown in Fig. 3.
14
H. Chu et al.
Fig. 3. The proposed network architecture
3.1 Convolutional Layer Neural network, also known as artificial neural network or analog neural network, is a network built by connecting a large number of artificial neurons in different ways. It is a feedforward model that mimics the structure and function of signals transmitted by biological neurons, so that it can have simple decision ability and simple judgment ability like people. CNN, Convolutional Neural Network, is a feedforward neural network in the deepening research of neural networks. Its artificial neurons can respond to a subset of surrounding units within a coverage area, making it even better at large-scale image processing. LeCun (1989) first used the term “convolution” [10] when discussing its network structure, hence the name “convolutional neural network”. The basic structure of CNN is the input Layer, the Convolutional Layer, the Max Pooling Layer (also known as the sampling layer), the Fully Connected Layer and the output layer. As shown in the Fig. 4, the image is firstly searched for features by the convolution layer, then downsampling is conducted by the downsampling layer, some information is ignored to reduce the training parameters while the sampling remains unchanged, and finally classification judgment is made by the full connection layer. In CNN, the neuron input value is the local connection between the neuron and the input on the input feature plane of the convolutional layer, and the corresponding connection weight is obtained by adding the weighted offset value of the local input. This process is equivalent to the convolutional process [11]. Generally, the convolutional layer and the subsampling layer are not single, and they are usually set alternating. It’s a convolution layer connected to the lower sampling layer connected to the convolution layer connected to the lower sampling layer and so on. Sample Heading (Third Level). Only two levels of headings should be numbered. Lower level headings remain unnumbered; they are formatted as run-in headings. In the convolutional layer, the feature graph of the upper layer is convolved by a learnable filter (i.e. convolution kernel), which circulatively convolves the entire input
An Optimized Eight-Layer Convolutional Neural Network
15
Fig. 4. CNN operation process
image with a certain step size in the original image, and then obtains the output feature graph through an activation function. Each output feature graph can be convolved with the values of multiple feature graphs: p p (1) xn = f yn p
yn =
p−1
xm
p
p
∗ kmn + bn (m ∈ Mn )
(2) p
Among them, the net activation of the n TH channel of the convolution layer l (yn ), p−1 the output of the jTH channel of the convolution layer I (xm ), the activation function p (f (·)), the input feature map subset of calculating yn (Mn ), the convolution kernel matrix p p (kmn ), the bias of the convolution eigengraph (bn ), the convolution symbol(∗). The formula of output feature graph for each feature graph in the lower sampling layer is as follows p p xn = f yn (3) p p p−1 p yn = γn down xm + bn
(4) p
The net activation of the nTH channel in the lower sampling layer l(yn ), the weight coefficient of the lower sampling layer (γ ), the offset term of the lower sampling layer p (bn ), the downsampling function (down(·)). Each layer in the fully connected layer is a tiled structure composed of many neurons, essentially a perceptron that classifies or regress the input data, and whose output can be obtained by a weighted summation of the input and by a corresponding activation function. (5) xp = f yp yp = ϕ p xp−1 + bp
(6)
The net activation of the fully connected layer l(yp ), the weight coefficient of the fully connected network (ϕ p ), the offset item of the fully connected layer (bp ) [10, 12].
16
H. Chu et al.
CNN, due to incomplete connections between neurons, the sharing of connection weights of silent neurons at the same layer is a special difference from other deep neural network models. These special points make it closer to the biological neural network, and the complexity, parameter training and weight number of the network model are further reduced. In the previous deep neural network, the overfitting problem of the model has been greatly alleviated, and the memory occupied by the model has been clearly reduced At the same time, CNN also has strong stability and fault tolerance, which can show the advantages of efficiency and accuracy in the task of image recognition and influence classification [13]. 3.2 ReLU Function In deep neural network learning, activation function plays an important role in stimulating hidden nodes to produce better output. The main purpose of activation function is to introduce nonlinear features into the model. The commonly used activation functions include Logistic sigmoid and tanh. Hahnloser et al. introduced ReLU function into a dynamic network for the first time in 2000, and proved for the first time in 2011 that ReLU function can train a deeper network better than previous ones. Up to now, ReLU is still the most commonly used activation function for the execution of most deep learning tasks, and it is also widely used by the excitation layer of CNN. Rectified Linear Unit Relu (Rectified Linear Unit) is a commonly used activation function in artificial neural networks. Generally speaking, it refers to a slope function in mathematics. The definition is as follows (Fig. 5): ReLU = max(0, y) y=
x, if x > 0 0, if x ≤ 0
(7) (8)
Fig. 5. The ReLU function
Compared with other activation functions, ReLU function and its derivatives are relatively simple in mathematical operation, short in calculation time, fast in speed, occupy resources, and have the advantages of imitating biological principles, so they are basically
An Optimized Eight-Layer Convolutional Neural Network
17
the preferred type when facing unknown data sets. And for linear function, its expression ability is stronger, especially in the deep network; For nonlinear functions, Since the gradient of ReLU’s non-negative interval is constant, the gradient always exists, so the convergence rate of the model is stable. Of course, ReLU function also has its defects. The slope of its negative end is 0, and the convolution output data cannot be expressed in the negative part, which also leads to the failure of neurons to update parameters, the slow processing of data sets, and the waste of computing resources occupied by neurons that do not update parameters [14]. ReLU is an asymmetric piecewise function. If it is used to fit a smooth nonlinear function, it requires more neurons or hidden layers with the same fitting accuracy. And the indifferentiability of the ReLU function at zero [15]. However, since zero differentiability is still a special case, and various variants have been developed by researchers to make up for its deficiencies, such as Softplus, Leaky ReLU, ELU, SiLU, etc. to improve the performance of some tasks, ReLU is by far the best activation function of deep learning. 3.3 Pooling Layer In view of the large size of image and convolutional feature graph, we do not need to deal with too much redundant information in practical application. The key is to extract image features. Therefore, we adopted a strategy similar to image compression for optimization. Then how to carry out this kind of compression idea? The pooling layer plays such a role. In general, the pooling layer of convolutional neural network follows the convolutional layer, and the operations in the pooling layer are called pooling operations. Pooling principle is similar to the principle of convolutional layer. Convolution is a kind of linear operation algorithm about matrix, and its working mode is to carry out sliding windowing operation on input matrix and convolution kernel to realize feature extraction of image. Meanwhile, the pooling layer also adopts the idea of overlapping Windows to divide input data into multiple blocks for pooling operation to realize feature extraction of image. Common pooling operations include maximum pooling and average pooling. Maximum pooling takes the maximum value of the local area of the original image as the output of this area, and average pooling takes the mean value of the local area of the original image as the output of this area. Pooling steps are relatively fixed. Pooling steps are relatively fixed. In the first step, we set the pooled window size and step size, which are generally set to the same value. For example, set the step of the 8 × 8 pool window to 8, and the step of the 4 × 4 window to 4. The second step is to move the pooling window on the input matrix. For each region, there is an operation result to represent the feature element of the region. Step 3 Repeat Step 2. For a 4 × 4 image to be processed, define a 2 × 2 pooling window step size set as 2, do the maximum pooling operation, move 2 steps each time, the moved subareas cannot be overlapped, and the image is finally compressed into 1/4 of the original. As you can imagine, the 2 × 2 window is constantly moving from left to right or top to bottom from the input image, constantly mapping the eigenvalues in this area from the window. After traversing the original image, it will get an image that represents the pool result and is smaller than the original image. The image’s dimensions are decreased by 75% (Fig. 6).
18
H. Chu et al.
Fig. 6. Maximum pooling with window size 2 and step size 2
Average pooling is similar, except that the output of the 2 × 2 pooled subwindow size uses average pooling to represent the characteristic elements of that subarea. According to the implementation method of pooling operation, it can compress the image size of the input feature map. On the premise that the image can be recognized, the depth dimension remains unchanged and unnecessary redundant information of the image is reduced, which reflects the feature invariance in image processing. For example, when the resolution of a photo of a tree is lowered, we can still recognize it as a tree, indicating that the photo still retains the characteristics of a tree. At the same time, the compressed image pixel matrix is greatly reduced, reducing the number of neural network parameters and computing load, and improving the computing speed. 3.4 Batch Normalization In most cases, there are many more layers of deep neural network than we expect. When data passes each layer of neural network and activation function, Internal Covariate Shift occurs [16]. By adding a Batch Normalization (BN) algorithm to the appropriate network layer, it keeps it from fluctuating too much, keeping sample output values for that layer within given ranges (Fig. 7).
Fig. 7. Batch normalization principle
An Optimized Eight-Layer Convolutional Neural Network
19
Batch Normalization, also known as batch normalization, is a data preprocessing tool used to adjust numerical data to common proportions without distorting its shape. In general, when we are in a deep learning algorithm, we tend to change the value to a balanced ratio. Normalization is to speed up model training. In general, BN layer is placed in the convolution, it is to solve the neural network training “gradient dispersion”, and “gradient explosion” problem of important technical means. Batch Normalization enables data normalization of batches of sample data in a specific way. Normalization selects a part of data from the network layer as sample input, subtracts its mean value from the input value and divides it by the standard deviation of the data, and then completes data preprocessing. But why only select a subset of data from the network layer and normalize it? Assuming that all the data of a layer are normalized, it will produce huge computing overhead and lose the significance of optimization. If the network layer m receives batch inputs, each node of the layer generates m outputs during forward propagation. Batch Normalization aims to normalize these m outputs at each node of the layer. This normalization is calculated as follows. 1 m xi i=1 m 1 m σB → (xi − μB )2 i=1 m μB →
xi − μB xi → σB2 + ε
(9) (10)
(11)
However, the normalization of input nodes in each layer may change the representation of the data in that layer. Because normalization processing turns one set of data into another set of data, each set of data contains different information, so it cannot be finished after normalization processing. It is also necessary to process the linear transformation of the normalized data to get the final result of the linear transformation ζi . . The formula is as follows (12).
ζi = ζi · xi + λi
(12)
It is worth mentioning that ζ and λ can still play a role in restoring data expression ability to some extent. Suppose a feature x in a certain network layer, the mean value obtained by mini-batch calculation of this feature is u, and the standard deviation is o. The normalization of x gives you t, and the linear transformation gives you y. So there’s a special result, when y is equal to o and beta is equal to u the result of the linear transformation y is exactly equal to the characteristic x before the normalization. You can guess the purpose of the formula, and you can reconstruct the meaning of the network layer data, but there are very few such coincidences, because y and beta are trained by the model. BN brings three benefits. First, it speeds up model training and improves learning rate. Second, it has certain regularization effect. The use of BN increases generalization capabilities, even without Dropout, and reduces the use of L2 regularization; Third, there
20
H. Chu et al.
is an opportunity to make the model work better. This effect is not absolute, but many models do get better with BN. Although BN has many advantages, it can also cause problems if it is not used properly. For example, Batch statistical estimation is not accurate will lead to batch smaller, batch normalization error will increase rapidly. Therefore, I also need to know its scope of application. First, batch normalization can be added to the general neural network training to speed up the training. Second, it is suitable for the scenario where each mini-batch is large and the data distribution is similar. 3.5 Dropout The term “Dropout” [17] was introduced by author Hinton in the paper “Improving neural networks by preventing co-adaptation of feature detectors”. The phenomenon of “data overfitting” usually results from training complex feedforward neural networks in small data sets. One of the existing solutions is to improve the performance of neural networks by preventing the co-action of feature detectors, which is used by Alex and Hinton in their paper “ImageNet Classification with Deep Convolutional Neural Networks”. Dropout algorithm has good effect on avoiding overfitting. Moreover, the “AlexNet network model” mentioned in this paper leads the trend and makes CNN the core algorithmic model for image classification [18]. During Dropout training of deep neural networks, overfitting is reduced by ignoring a portion of the feature detectors (counting the corresponding hidden layer node values as 0). This approach reduces interactions between feature detectors (hidden layer nodes), where detector interactions are those where some detectors depend on other detectors to function. Dropout is a regularizer against overfitting. It is a regularization method that randomly sets the activation of the hidden units of each training case to zero at training time. This breaks the co-adaptation of the feature detector, because the exiting units cannot affect the other retained units. In other words, Dropout creates an efficient form of model averaging in which the number of trained models is exponentially related to the number of units and these models share the same parameters. Dropout also inspired other stochastic model averaging methods such as random pooling. Dropout works well in the fully connected layer of convolutional neural networks [19]. There is a trained neural network as follows, as shown in Fig. 8.
Fig. 8. Standard neural networks and partially temporarily deleted neurons
An Optimized Eight-Layer Convolutional Neural Network
21
The flow of the diagram on the left is: First propagate the input quantities forward through the network, and then back propagate the errors to decide how to update the parameters for the network to learn. (1) Randomly delete a portion of the neurons in the hidden layer of the network. This operation is temporary and the input and output neurons remain unchanged (the dashed graph in Fig. 8 indicates the temporarily deleted neurons). (2) Propagate the input quantities forward through the modified network, and then propagate the obtained loss results backward through the modified network. After performing this process on the training samples, the corresponding parameters need to be updated on the neurons that have not been deleted according to the stochastic gradient descent method. (3) Restore the deleted neurons (the deleted neurons are not updated at this time, while the remaining neurons have been updated) by selecting a random subset of the hidden layer neurons of the same size as in (1) and temporarily deleting them (the parameters of the deleted neurons need to be backed up). For the training samples, forward propagation and then backward propagation of the loss is performed and the parameters are updated according to the stochastic gradient descent method (the parameters of the deleted neurons remain unchanged and the remaining parameters are updated). The above process is repeated continuously. The selection scheme of which units to discard is random. Each unit is guaranteed to be retained with a fixed probability independent of the other units, where the fixed probability is chosen in two ways, either based on the validation set or set directly to 0.5 or 0.3. Dropout is being heavily used in fully connected networks and less used in the hidden layer of convolutional networks. The specific reasons include features such as sparsification of convolution itself. In general, Dropout is a hyper-parameter, not the parameter of each layer inside the general model network structure, but the parameter that needs to be adjusted artificially to try to improve the model effect according to the actual network, the actual application area. 3.6 Data Augmentation ‘Data augmentation’ is a set of techniques that artificially augment a data set by modifying a copy of existing data or using a collection of existing data into a new copy of the generated data set. It acts as a regularization and reduces overfitting when training machine learning models. If we have 50 experimental image data sets, we can create new copies of the images and double the training set by flipping them randomly horizontally and vertically. To avoid the problem of overfitting, data enhancement techniques need to be used wisely to improve the performance of the algorithm. Yu Gao et al. proposed a data preprocessing method based on the convolutional neural network model Alex Net, using dataset augmentation, background segmentation and principal component analysis on the dataset. The original public dataset Leaves and the apple surface lesion dataset were firstly tested for classification and recognition. The results show that the recognition accuracy of both the public dataset Leaves and the apple surface lesion dataset on this network
22
H. Chu et al.
has been improved after data augmentation [20]. For the traditional data-driven transient stability analysis method of power system, Yan Zhou et al. proposed a transient stability prediction method based on data augmentation and deep residual network considering the impact of the input data with noise and missing information on the performance of the prediction model. A special convolutional neural network-depth residual network in image processing is used to construct a deep model for transient stability evaluation by using dynamic data of the generator after perturbation as input features [21]. Our dataset is to include different conditions such as different orientations, positions, scales, brightnesses, etc. However, during the actual data collection, the number of datasets we collect is limited. By performing data augmentation and collecting a large number of sample data, we can solve the problem of sample data and prevent the neural network from learning irrelevant features and fundamentally improve the overall performance [22]. In the case of image enhancement, you can randomly flip, crop, rotate, scale, resize, stretch, and imitate. In addition, you can change saturation brightness, contrast, sharpness, and even add noise. In the case of image enhancement, you can randomly flip, crop, rotate, scale, resize, stretch, and imitate. In addition, you can change saturation brightness, contrast, sharpness, and even add noise. Stable convolutional neural networks can correctly classify objects in different situations, and, increasing the trained data can improve the performance of the CNN model. The CNN is invariant to transition, viewpoint, size and illumination, and it works with ‘Data augmentation’. In the article, we exemplify the following six examples of ‘Data augmentation’. 3.6.1 PCA Color Enhancement Method The algorithm performs principal component analysis based on the color channels and adds the color distribution of the original image to perform ‘Data Augmentation’. The PCA color augmentation method is mainly used to change the brightness, contrast and saturation of the image. In order to maintain valid data such as relative color differences, major color families, and contours of the artificial image, principal component analysis is required for the training data set to recover the colors of its distribution principal axes [23]. Then, the artificial image is created by continuously adjusting the multiplicity of principal components of the dataset. 3.6.2 Noise Injection There can be many kinds of noise injection in neural networks, such as input layer, hidden layer, weights, output layer, etc. The core of noise injection is to randomly disturb each pixel RGB of an image by adding a matrix of random values sampled from a Gaussian distribution to produce some new noise-contaminated image. It also augments the dataset and improves the ability to fit the true distribution of the data, helping the CNN to learn more powerful [24].
An Optimized Eight-Layer Convolutional Neural Network
23
3.6.3 Scaling The scaling methods include inward or outward scaling. When scaling inward, the size of the newly generated image becomes smaller; when scaling outward, the size of the newly generated image becomes larger. The image frame is a piece of the newly generated image that is equal in size to the original image. 3.6.4 Random Transfer Random shifting only involves moving the image in the X or Y direction (or both), and when shifting we need to make assumptions about boundaries. With this enhancement method, most objects can be located almost anywhere in the image, so the convolutional neural network can recognize all the corners as well. Using this method can be very effective in enhancing the amount of data if the image has a monochrome background or a pure black background. Random flipping includes horizontal and vertical flipping. Of these, horizontal flipping is the most commonly used, but depending on the actual target, vertical flipping and other angles of flipping can also be used. 3.6.5 Gamma Correction The nonlinear photoelectric conversion characteristics of sensors in electronic devices (e.g., camcorder, monitor) require the application of Gamma correction, which edits the Gamma curve of an electronic image and then performs nonlinear tonal editing of the image [25]. The basic idea of Gamma correction is to segment each of the color spaces R, G, and B and use linear functions in each segment to correct. This series of linear functions is generated based on the compensation of the Gamma curve by using a series of linear functions instead of a symmetric curve of the Gamma curve about the function y = x. Gamma represents a diagonal line between the output and input values of the image, with a gamma value of usually 2.3. Gamma correction has a significant effect on the image, and different Gamma curves can achieve different results. The contrast of the whole image is related to the Gamma correction, and the higher the contrast, the more obvious the visual effect to the human eye. The color of the image is also related to the Gamma correction, the higher the contrast, the higher the color saturation of the whole image. 3.6.6 Affine Transformation The affine transformation means that the image can be translated and rotated by a series of geometric transformations, while maintaining the flatness and parallelism of the image. Straightness means that a straight line is still a straight line and a half-circle is still a half-circle after the affine transformation; parallelism means that parallel lines are still parallel lines after the affine transformation. Affine transform (AFT) is a permutation operation that randomly changes the position of image pixels. The affine transform of an image f(x, y)The size of the n × n The affine transform of an image of pixels is calculated by x , y , and the affine transform
x can be computed by left multiplying the original vector by left multiplying one or y
24
H. Chu et al.
more transformation matrices, and multiple transformation matrices can be combined. The function is represented as follows.
x a i 0 x = AFT {(x, y), n} = + (mod n) (13) b 0j y y where “mod” stands for modal operation, the a ∈ [1, n], b ∈ [1, n], the a, b the values of can be taken randomly in the range. i, j is determined by their relative to n the relative primes chosen. This i and j are chosen such that the affine transformation maps the original coordinates (x, y) mapped to a unique pixel in the transformed coordinates. If i and j are not relatively prime with respect to n are relatively prime, the affine transform maps different original coordinates to the same pixel in the transformed coordinates. After the AFT, the total energy of the input image remains constant [26].
4 Experiment Results 4.1 Experiment Configuration On a personal computer running Windows 10 with a 2.5 GHz Intel Core i7 CPU and 32 GB of RAM, the experiment was carried out. The tests were carried out more times to smooth the randomness issue, and the overall accuracy was introduced to evaluate the results. We set the main parameters of training configuration as follows: Maximum Epochs was defined as 30, Initial Learn Rate was set to 0.01, Mini Batch Size is 256 and Learn Rate Drop Factor is 0.1. 4.2 Structure of Eight-Layer CNN Based on Blocks In this paper, we introduced an eight-layer convolutional neural network with hybrid modules for Chinese fingerspelling sign language recognition. AS shown in the Fig. 3, the proposed network contains six convolutional layers with hybrid modules and two fullyconnected layers. Among them, the hybrid module is further divided into three situations: Block I (Conv-BN-ReLU-Pooling), Block II (Conv-BN-ReLU) and Block III (Conv-BN-ReLU-BN). The first hybrid mode is a commonly used combination, which is employed to verify and test the configuration of pooling. The second is a comparison mode, which omits the pooling operation. The third is a bold innovation. Here, the BN operation is applied twice to better standardize the input of adjacent layer and make the distribution more balanced and reasonable. All advanced technologies play their roles in their respective modules, and different combinations and cofigurations make the overall performance improved. The hyperparameters of proposed network have been demonstrated in Table 1. At the same time, the value of “Padding” is set to “same” and the dropout rate is 0.4. 4.3 Statistical Results Our method CNNBB adopted 3 hybrid blocks was executed 10 runs, and the results are demonstrated in Table 2. As can be seen, the bolded portion of the column indicates
An Optimized Eight-Layer Convolutional Neural Network
25
Table 1. Parameters of each layer based on blocks Index
Layer
Filter Size
Filters
Stride
1
Layer1-Block I
3×3
16
2
2
Layer2-Block I
3×3
32
2
3
Layer3-Block I
3×3
64
2
4
Layer4-Block II
3×3
128
2
5
Layer5-Block II
3×3
128
2
6
Layer6-Block III
3×3
256
2
7
FullyConnectedLayer1
8
FullyConnectedLayer2
Input
Output
that the MSD (mean and standard deviation) is 93.32 ± 1.42%. The maximum accuracy, however, comes in at 96.48%, while the lowest is 91.41%. All the accuracy values exceed 90%. Thus, it indicates that our method owns better stability and effectiveness. Table 2. Ten runs of our method Run
Accuracy of Our Method
1
92.97%
2
92.19%
3
91.41%
4
92.58%
5
92.19%
6
93.75%
7
93.75%
8
93.75%
9
96.48%
10
94.14%
MSD
93.32 ± 1.42%
26
H. Chu et al.
5 Discussions 5.1 Comparison of Pooling Method In this experiment, both maximum pooling (MP) and average pooling (AP) were verified without changing the parameter settings. As shown in Table 3, the results of 10 runs of average pooling are as follow: 90.06%, 91.23%, 89.45%, 91.28%, 91.84%, 92.02%, 91.45%, 92.23%, 91.63% and 93.58%. The MSD of average pooling is 91.48 ± 1.14%, which is litter lower to maximum pooling 93.32 ± 1.42%. Figure 9 is represented a vivid comparison between AP and MP. It is obvious from the results that maximum pooling performs well in recognition accuracy. It achieved the highest accuracy rate of 96.48%, while the average pooling rate was only 93.58%. In addition, the accuracy of maximum pooling is significantly better than that of average pooling each time of execution. Table 3. Comparison of pooling method Run
Average Pooling
Maximum Pooling
1
90.06%
92.97%
2
91.23%
92.19%
3
89.45%
91.41%
4
91.28%
92.58%
5
91.84%
92.19%
6
92.02%
93.75%
7
91.45%
93.75%
8
92.23%
93.75%
9
91.63%
96.48%
10
93.58%
94.14%
MSD
91.48 ± 1.14%
93.32 ± 1.42%
5.2 Effect of Double Bath Normalization As an innovation, a double BN structure (Block III) was introduced to verify whether the overall performance can be improved. In this experiment, the Block III was placed to last position and it played a significant role. As can be seen, the comparison between using double BN and without double BN is indicated in Fig. 10 and Table 4. It denoted that double BN make the input layer more distributed and normalized, which can further avoid gradient disappearance and accelerate learning convergence. 5.3 Comparison to State-of-the-Art Methods In this experiment, eight state-of-the-art methods: HMM [27], CSI [28], HCRF [29],GLCM-PGSVM [30], WE-KSVM [31], 6L CNN-LRELU [32], AlexNet-DAAdam [33], CNN7-DA [34] were compared with our proposed network CNN-BB. As
An Optimized Eight-Layer Convolutional Neural Network
27
Fig. 9. Comparison of average pooling and maximum pooling
Table 4. Comparison of BN Run
Singer BN
Double BN
1
91.41%
92.97%
2
89.06%
92.19%
3
91.80%
91.41%
4
90.63%
92.58%
5
91.41%
92.19%
6
91.41%
93.75%
7
92.19%
93.75%
8
92.58%
93.75%
9
92.58%
96.48%
10
91.41%
94.14%
MSD
91.45 ± 1.03%
93.32 ± 1.42%
can be seen in Fig. 11, our CNN-BB methods achieved superior MSD. Two important factors boost the discriminative ability of our network. Singer BN can provide the balance distribution of input and smooth the disappearance of gradient. Double BN can improve the effectiveness and accelerate the convergence of learning. Additionally, operation of pooling can decrease the computation and cut down the ovefitting. Meanwhile, dropout and ReLU also play their role in the architectures.
28
H. Chu et al.
Fig. 10. Effect of Double Bath Normalization
Fig. 11. Comparison to state-of-the-art approaches
6 Conclusions In the above study, we proposed a block-based eight-layer optimized convolutional neural network (CNN-BB) for Chinese handwritten sign language recognition. In the architectures, three different blocks, that is, Conv-BN-ReLU-Pooling, Conv-BN-ReLU, ConvBN-ReLU-BN were employed. Adopting drop-out, data enhancement, ReLU, batch
An Optimized Eight-Layer Convolutional Neural Network
29
normalization and pooling and other advanced technologies, the CNN-BB proposed in this paper achieved 93.32 ± 1.42% MSD, which is superior to the other advanced method. In future studies, we will try to establish more adequate data sets, more advanced methods and fine-tune the hyperparameters in the hope of obtaining better performance. Meanwhile, transfer learning and shifting the optimized network to other fields are also the goals we should focus on. Acknowledgements. This work was supported by National Philosophy and Social Sciences Foundation (20BTQ065), Natural Science Foundation of Jiangsu Higher Education Institutions of China (19KJA310002).
References 1. Yang, X., Lei, J., Sun, K.: Evolution and trend of sign language research in China: a visual analysis based on CiteSpace, vol. 267, no. 09, pp. 21–28+65 (2022) 2. Yu, Z.: Adaptive problems in Chinese sign language recognition, Ph.D. Harbin Institute of Technology (2010) 3. Yao, G., Yao, H., Jiang, F.: A multi-layer classifier sign language recognition method based on DTW/ISODATA algorithm, vol. 08, pp. 45–47+200 (2005) 4. Zhao, W.: Chinese sign language recognition based on HMM_SVM, vol. 21, no. 10, pp. 24–26 (2011) 5. Wu, J., Gao, W.: ANN/HMM based sign language recognition method, no. 10, pp. 63–66 (1999) 6. Zou, W., Yuan, K., Du, Q., Xu, C.: Fuzzy neural network based word recognition in static sign language, no. 04, pp. 616–621 (2003) 7. Ma, C., Shao, J., Qin, B.: Progress in sign language recognition in the teaching of the hearing impaired, vol. 42, no. 10, pp. 23–27 (2022) 8. Jiang, X., Satapathy, S.C., Yang, L., Wang, S.-H., Zhang, Y.-D.: A Survey on artificial intelligence in Chinese sign language recognition. Arab. J. Sci. Eng. 45(12), 9859–9894 (2020) 9. Lee, Y., Hua, F.: Principle and realization of conversation from standard Chinese pinyin to international phonetic alphabet, vol. 14, pp. 540–545 (2012) 10. Feng, B., Yang, H., Yuan, G., Li, J., Zhan, C.: A review of the research of neural networks in SAR image target recognition, vol. 42, no. 10, pp. 15–22 (2021) 11. Zhou, F., Jin, L., Dong, J.: Review of convolutional neural networks, vol. 40, no. 06, pp. 1229– 1251 (2017). https://kns.cnki.net/kcms/detail/11.1826.TP.20170122.1035.002.html 12. Chang, L., et al.: Convolutional neural networks in image understanding. Acta Autom. Sin. 42(09), 1300–1312 (2016). https://doi.org/10.16383/j.aas.2016.c150800 13. Zhang, Y., Liu, Y., Liu, M., Man, W., Song, T., Li, C.: Fine classification of wetland plant communities based on relief F and convolutional neural networks, no. 02, pp. 58–64 (2023). https://doi.org/10.13474/j.cnki.11-2246.2023.0041 14. Qian, X., Zhang, X., Hao, Z.: Gait recognition based on improved convolutional neural network, vol. 9, no. 02, pp. 91–97 (2022). https://doi.org/10.19306/j.cnki.2095-8110.2022. 02.011 15. Hao, T.: Construction of activation function LeafSpring and comparative study of multiple data sets, vol. 49, no. 03, pp. 306–314+322 (2020). https://doi.org/10.13976/j.cnki.xk.2020. 9332
30
H. Chu et al.
16. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. Presented at the Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, Lille, France (2015) 17. Hinton, G.E., Srivastava, N., Krizhevsky, A., et al.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv 2012; abs/1207.0580 18. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 84–90 (2012) 19. Srivastava, N., Hinton, G., Krizhevsky, A., et al.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014) 20. Gao, Y., Zhou, B., Hu, X.: Research on convolutional neural network image recognition based on data enhancement. Comput. Technol. Dev. 28, 62–65 (2018) 21. Yanzhen, Z., Xiangyu, C., Jian, L., et al.: Transient stability prediction of power systems based on data augmentation and deep residual networks. China Electr. Power 53, 22–31 (2020) 22. Eckert, D., Vesal, S., Ritschl, L., Kappler, S., Maier, A.: Deep learning-based denoising of mammographic images using physics-driven data augmentation. In: Tolxdorff, T., Deserno, T., Handels, H., Maier, A., Maier-Hein, K., Palm, C. (eds.) Bildverarbeitung für die Medizin 2020. Informatik aktuell, pp. 94–100. Springer, Wiesbaden (2020). https://doi.org/10.1007/ 978-3-658-29267-6_21 23. Vasconcelos, C.N., Vasconcelos, B.N.: Convolutional neural network committees for melanoma classification with classical and expert knowledge based image transforms data augmentation. Comput. Vis. Pattern Recognit. (2017) 24. Igl, M., Ciosek, K., Li, Y., et al.: Generalization in reinforcement learning with selective noise injection and information bottleneck (2019) 25. Wang, S.H., Tang, C., Sun, J., et al.: Multiple sclerosis identification by 14-layer convolutional neural network with batch normalization, dropout, and stochastic pooling. Front. Neurosci. 12 (2018) 26. Singh, P., Yadav, A.K., Singh, K.: Color image encryption using affine transform in fractional Hartley domain. Optica Applicata 47 (2017) 27. Zhao, N., Yang, H.: Realizing speech to gesture conversion by keyword spotting. In: 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1–5. IEEE (2016) 28. Li, Y., Chen, X., Zhang, X., et al.: A sign-component-based framework for Chinese sign language recognition using accelerometer and sEMG data. IEEE Trans. Biomed. Eng. 59, 2695–2704 (2012) 29. Yang, H.-.D, Lee, S.-W.: Robust sign language recognition with hierarchical conditional random fields. In: 2010 20th International Conference on Pattern Recognition, pp. 2202–2205. IEEE (2010) 30. Anguita, D., Ghelardoni, L., Ghio, A., et al.: The ‘K’ in K-fold cross validation. In: ESANN, pp. 441–446 (2012) 31. Zhu, Z., Zhang, M., Jiang, X.: Fingerspelling identification for chinese sign language via wavelet entropy and kernel support vector machine. In: Satapathy, S., Zhang, YD., Bhateja, V., Majhi, R. (eds.) Intelligent Data Engineering and Analytics. AISC, vol. 1177, pp. 539–549. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-5679-1_52 32. Jiang, X., Zhang, Y.-D.: Chinese sign language fingerspelling via six-layer convolutional neural network with leaky rectified linear units for therapy and rehabilitation. J. Med. Imaging Health Informat. 9, 2031–2090 (2019)
An Optimized Eight-Layer Convolutional Neural Network
31
33. Jiang, X., Hu, B., Chandra Satapathy, S., et al.: Fingerspelling identification for Chinese sign language via AlexNet-based transfer learning and Adam optimizer. Sci. Progr. 2020, 1–13 (2020) 34. Gao, Y., Zhu, R., Gao, R., Weng, Y., Jiang, X.: An optimized seven-layer convolutional neural network with data augmentation for classification of chinese fingerspelling sign language. In: Fu, W., Xu, Y., Wang, SH., Zhang, Y. (eds.) ICMTEL 2021. LNICST, Part II, vol. 388, pp. 21–42. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-82565-2_3
Opportunities and Challenges of Education Based on AI – The Case of ChatGPT Junjie Zhong, Haoxuan Shu, and Xue Han(B) Nanjing Normal University of Special Education, Nanjing 210000, Jiangsu, China [email protected]
Abstract. Generative artificial intelligence, exemplified by ChatGPT, is growing rapidly and causing multiple controversies in areas such as education. The development of artificial intelligence has great significance and many influences on today’s education. How will ChatGPT change education? The application of ChatGPT in education may also lead to the following four types of risks: academic integrity and evaluation mechanism, excessive dependence and teacher status, information transmission and knowledge level, ethical awareness and ethical risks. Finally, this paper further puts forward three perspectives on the application of generative AI represented by ChatGPT in education. Keywords: ChatGPT · Generative Artificial Intelligence · Educational · Application
1 Introduction In recent years, with the rapid development of information technology and artificial intelligence technology, the state attaches great importance to promoting education informatization and school education digitization. The report of the 20th National Congress of the Communist Party of China clearly put forward “promoting the digitalization of education”. This educational policy focus was fully reflected at the World Digital Education Congress in February 2023. Education digital transformation refers to the transformation of traditional classroom teaching mode to digital teaching mode on the basis of information technology including artificial intelligence, so as to achieve efficient, fast and repeatable education services. The digital transformation of education is obviously the use of digital technology to change the current process of student learning, teacher education, teaching and management. Living in the era of highly developed information technology, we are witnessing the rapid development of digital technology represented by artificial intelligence. Most industries, including education, are in urgent need of digital transformation [1]. The application of information technology and artificial intelligence technology in education and teaching in countries around the world is not only reflected in the use of multimedia teaching, the use of online classroom teaching and other new ways to implement personalized education and teaching, but also reflected in the use of artificial © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 32–41, 2024. https://doi.org/10.1007/978-3-031-50580-5_3
Opportunities and Challenges of Education
33
intelligence technology. For example, automatic grading realizes homework correction, virtual laboratory, robot teaching, intelligent precision teaching realized by big data analysis, and personalized teaching materials and content generation. The innovation and practice ability of artificial intelligence in all aspects and links of school education has been continuously strengthened, and the digital transformation and development of school education has shown remarkable results. Meanwhile, deep learning-based computer-aided systems have become a widely researched topic, with particular emphasis on their applications in various domains such as image recognition, natural language processing, and speech recognition. However, there are also concerns about the ethical and social implications of these systems, such as privacy, bias, and job displacement, that require careful consideration and regulation [2]. ChatGPT (ChatGenerativePre-TrainedTransformer) is an artificial intelligence chatbot program developed by OpenAI company in the United States. It was publicly tested on November 30, 2022 and became popular all over the Internet as soon as it was launched. ChatGPT generates natural response text based on what the user types, allowing the user to communicate with them about anything. Since ChatGPT products are close to the natural living world of human beings and can be connected with almost everyone, it is bound to change many behaviors of people and promote the corresponding change of learning patterns. Its characteristics of daring to question, daring to admit ignorance, and supporting continuous dialogue and understanding of context have attracted a large number of educators’ attention. In the face of rapid development and progress of artificial intelligence, how should educators prepare for the emergence of ChatGPT?
2 What AI Development Means for Education Today In the digital age, using artificial intelligence to promote education reform and innovation is a necessary measure. With the continuous development of information technology represented by artificial intelligence, the influence of artificial intelligence application practice on social production and life gradually attracts people’s attention. In the field of education, artificial intelligence technology enabling education also has a profound impact on the original behavioral cognitive model and educational ecological structure. Accelerate the integration and innovation of artificial intelligence, student learning, teacher development, home-school co-education, educational governance and educational evaluation, and realize the development and progress of education towards higher quality (Fig. 1). High quality student learning, with the development of artificial intelligence-related technologies, student learning presents three changing trends of precision, diversification and simulation. Precision is mainly embodied in the dynamic analysis of students’ learning situation and real-time warning [3]. Diversification is reflected in the adaptive learning support services. Knowledge graph and deep learning algorithm can realize the adaptive push of teaching resources and learning support, so as to better meet the diversified needs of students. Simulation is reflected in the strong interaction and high simulation of teaching situations. The classroom teaching space constructed by ChatGPT, meta-universe, 3D, VR, 5G network technology has the characteristics of high
34
J. Zhong et al.
Fig. 1. High-quality development of AI enabling education
transmission rate, low delay and high simulation, which enhances the enthusiasm and commitment of students in learning and makes them more immersed in learning [4]. High-quality teacher development, through the application of data support and artificial intelligence technology, realizes the management and evaluation value beyond teaching research, and then improves the efficiency and quality of teachers’ teaching research. High-quality home-school co-parenting. With the development of artificial intelligence technology, this model provides the possibility to solve the problems such as separation of education scenes, unequal information and insufficient cooperation and communication in the traditional home-school cooperation model. At the same time, the adoption of reasonable communication forms is also of positive significance to home-school co-parenting [5]. High-quality education governance, the use of artificial intelligence technology to support and promote the improvement of education governance ability, promote the development and progress of education, promote the school, family, society and other multi-party cooperation. High-quality educational evaluation, in terms of evaluation methods, artificial intelligence, big data and other technologies have broken through the limitations of traditional paper-and-pencil examinations, and made students’ knowledge evaluation and ability evaluation realize a comprehensive transformation of process, dynamic and advanced. Promote the practical exploration of result evaluation, process evaluation, value-added
Opportunities and Challenges of Education
35
evaluation and comprehensive evaluation, and make the evaluation methods more refined and diversified [6].
3 Impact of ChatGPT on Education In January 2023, Li, a sophomore, used ChatGPT to finish a paper in the course of New Media and advertising and got a high score of more than 90 [7]. In addition, there are countless examples of how to use ChatGPT to accomplish tasks. According to incomplete statistics, there were more than 100,000 discussions on the use of ChatGPT on domestic social platforms as of January 2023. Some argue that using ChatGPT encourages dependency and deprives students of the ability to think for themselves. Relying solely on artificial intelligence will make students lack of innovative thinking. Chomsky, a professor emeritus at MIT (Massachusetts Institute of Technology), insists that “ChatGPT is essentially high-tech plagiarism” and “a way to avoid learning.” Educators believe that with the development of artificial intelligence, education also faces many challenges: with the development of artificial intelligence, the increase of human-computer dialogue will lead to the decrease of human interaction; If ChatGPT is widely used, whether students will use ChatGPT to cheat in their daily academic life; The misuse of ChatGPT in scientific research raises ethical questions about whether traditional instruction has a role to play in writing instruction and essay writing [8]. There were also dissenting voices in the media, arguing that “ChatGPT’s potential as an educational tool outweighs its risks “and even that” schools should thoughtfully use ChatGPT as a teaching aid, a way to unleash student creativity, provide personalized tutoring, And prepare students with AIDS to work with AI systems as adults.” [9]. Facing the rapid development of generative artificial intelligence technology such as ChatGPT, how should we rationally understand the impact of ChatGPT on human learning, life and work mode? [10] The New Generation of AI Governance Principles focuses on the development of “responsible” AI and puts forward the AI governance framework and action guidelines, which is also a positive response to the development of safe, reliable and controllable AI [11]. Therefore, promoting technological development and risk prevention is an important topic for future AI research. In the field of education, artificial intelligence technology enabling education also has a profound impact on the original behavioral cognitive model and educational ecological structure. In addition, due to the lagging attribute of the education field, facing a series of risks brought by the deep integration of artificial intelligence and education, educators need to study and judge the action mechanism and development law of the deep integration of artificial intelligence and education. It is necessary to prevent the high uncertainty and potential risks brought by artificial intelligence from challenging the existing educational concepts and systems, and realize a beautiful picture of two-way empowering and sustainable development of artificial intelligence and education [12].
4 Risks and Challenges of ChatGPT on Education In the field of education, some scholars have expressed concern about the use of ChatGPT. Alshater [13] believes ChatGPT has certain limitations, including reliance on data quality, limited areas of knowledge, ethical issues, over-reliance on technology, and
36
J. Zhong et al.
potential for abuse. Qadir [14] also noted that ChatGPT, like other generative AI systems, can be biased, even generate and disseminate misinformation, and raise a host of ethical questions. Baidoo-Anu [15] et al. also argue that the use of ChatGPT in education may be affected by issues such as lack of human interaction, limited understanding, biased training data, lack of creativity, lack of contextual understanding, and invasion of privacy. One of the biggest concerns in education today is that ChatGPT undermines educational equity by helping students complete assignments or cheat on exams without thinking for themselves. To sum up, although ChatGPT has great potential in educational applications, there are risks and challenges such as unfavorable academic integrity, over-reliance on teachers and students, inaccurate information delivery, and difficulty in dealing with ethical risks [16]. 4.1 The Academic Integrity is Questioned and the Evaluation Mechanism is Unbalanced ChatGPT’s ability to help students complete academic tasks in a variety of disciplines has raised concerns in the academic community about academic integrity. Some researchers have argued that ChatGPT could help students cheat, destabilize the educational evaluation system, and lead to educational inequity [17, 18]. After comparing ChatGPT’s answers to those given by real students in the Open University exams, the researchers said the launch of ChatGPT could lead to the end of academic integrity in online and open exams, as ChatGPT’s answers showed a high level of critical and logical thinking. By generating highly logical text with very little input, it makes it possible for students to cheat on exams [19]. In addition, the office said it was difficult to distinguish between students and ChatGPT generated writing, and that it was difficult for the office to adequately assess students’ true level of understanding when using ChatGPT to answer questions. Therefore, this may lead to the failure of the existing educational evaluation mechanism [17]. Keffer-butterfield, director of artificial intelligence at the World Economic Forum in Davos, said students submitting AI-generated content would affect their ability to improve themselves because “it’s like a machine that works” [20]. More importantly, the education sector has been slow to respond to AI tools, requiring many trials and adjustments to guard against risks. What can students learn from their education when they rely on AI products and transfer creative sovereignty to AI? How can we know the true level of students? This presents a great challenge to the academic integrity detection and evaluation mechanism. 4.2 Excessive Dependence on Students and Addiction May Weaken the Status of Teachers ChatGPT does offer interactive learning [15]. A 2014 study also proved that students who talked to an imaginary mentor who mimicked human emotional behavior learned better. However, over the course of using ChatGPT multiple times, dependencies can arise. In particular, people who have used it many times with good results may notice the effect. Some people become lazy because they rely on smart tools. For example, it has been reported in the past that if you rely on test search tools, your grades will drop. Over-reliance on online tools can reduce students’ creativity in development and
Opportunities and Challenges of Education
37
may divert the saved time to places unrelated to learning [12]. At the same time, the powerful artificial intelligence AIDS also put forward higher requirements on the ability of teachers, requiring teachers to have a strong ability to distinguish between right and wrong knowledge, especially in the link of lesson preparation to choose the essence and discard the dross. In addition, excessive use of ChatGPT may threaten the teacher’s status in the classroom, causing students to lose concentration and leave after-class problems to ChatGPT [21]. It is true that ChatGPT makes it easier to acquire knowledge, to provide feedback on assignments, to diversify learning patterns, and to dispense with written assignments and marking, but it is unclear how teachers and students can cope with ChatGPT’s changes. 4.3 Inaccurate Information Transmission and Limited Knowledge Level Although ChatGPT is similar to traditional search engines in that it provides a wealth of learning information quickly, its accuracy is not guaranteed. The New York City Department of Education has raised concerns about the information ChatGPT gives students, particularly about the safety and accuracy of the answers. They are also concerned that the use of ChatGPT will lead to pride among young students and a lack of skills needed to assess information [8]. OpenAI also acknowledges that ChatGPT, while reasonable, sometimes gives incorrect or absurd answers [14]. In addition, ChatGPT is sensitive to student text input adjustments and multiple entries of the same text. For example, if you type a question into ChatGPT, even if ChatGPT says it doesn’t know the answer, it will reply with a slight modification. Because of this, ChatGPT must be a textbook for precision counseling when used in the classroom. Teachers and students do not fully trust it. In addition, when OpenAI released ChatGPT, the data of the training model was limited to 2021, and the knowledge of the world after 2021 was limited. There was too little understanding of the problems of specific populations. Almost all the researchers pointed out what had not been mentioned, especially science knowledge. In addition, ChatGPT has no connection to the active Internet, no access to any information from social media, is a closed data set, and therefore can generate incorrect information. If ChatGPT creates fake reading lists on specific search topics, what are the teachable aspects of ChatGPT? [22]. 4.4 Ethical Awareness is not Strengthened, and Ethical Risks Are Difficult to Deal with Artificial intelligence ethics refers to the ethical principles and codes of conduct that should be observed in the development, management and application of artificial intelligence. With the rapid development of artificial intelligence, educators also need to have ethical awareness [23]. When ChatGPT provides personalized feedback to students, educators need to be concerned about three ethical issues: privacy of data, bias, and ownership. The first is data privacy. In order to solve a particular problem with ChatGPT, students and students have to enter a lot of data and information related to it. Stored data can be compromised when ChatGPT identifies the data and provides satisfactory answers to the user. There are concerns about how educational institutions, especially schools, can use students’ data without the consent of third parties. The second
38
J. Zhong et al.
is prejudice. ChatGPT is a large-scale language model that is trained in millions of data points and contains a large number of texts and books, but it can only gain knowledge from the statistical laws of trained data and has no contextual understanding, so it is different from humans. An inability to understand the world in complex, abstract ways can lead to aggressive and biased feedback [15]. Third, ownership. We need to evaluate whether all the content ChatGPT produces for teachers and students can be used directly. ChatGPT’s ghostwriting programming code and ability to extend vivid stories became widely recognized in a short time, and the generation model generated responses from patterns in these training data. However, this is not innovation and originality [24]. In other words, ChatGPT’s responses are the result of collecting data from the database, but who owns these works? Who will bear the risk after that [25]? This is of great concern.
5 Future Prospects of ChatGPT on Education 5.1 Promote the Reform of Educational Concepts The generative AI represented by ChatGPT has some limitations, but it has the ability to directly influence and inspire educational ideas. At present, the education in our country still attaches importance to acquiring knowledge through a large amount of reciting memory, understanding and a large amount of practice, but neglects the ability and the method of knowledge discovery through analytical thinking. Generative AI technology is expected to replace low-level mental workers who only focus on acquiring and accumulating knowledge by demonstrating the ability to effectively accumulate knowledge and rationally utilize it [26]. It can be seen that education should pay more attention to cultivating students’ high-level thinking ability, especially interdisciplinary multi-thinking ability, critical thinking ability and creative thinking ability [27]. Through multi-disciplinary thinking, we can understand and distinguish the complex problems and situations in the real world, and finally complete those realistic topics that artificial intelligence cannot cope with. Only with good critical thinking ability can students’ knowledge and skills go beyond deep understanding and analysis of AI models and fully recognize the limitations of AI technology and its attributes as a tool. Only with certain creative thinking ability, students can give full play to their innovative potential and role in a specific field and not be easily replaced by artificial intelligence machines. At the same time, in the process of education and teaching, teachers should speed up the change of ideas under the new information technology conditions, and fully mobilize the enthusiasm and creativity of front-line educators in the change of social needs and educational ideas brought about by technological change [28]. 5.2 Innovate Teaching Methods and Contents Driven by the educational concept that emphasizes the cultivation of advanced thinking ability, generative AI technology and products will increasingly influence the educational methods and contents, and play different roles. In terms of teaching methods, teachers are encouraged to actively innovate classroom teaching methods, integrate relevant technologies into the teaching process of different subjects, and enrich the content
Opportunities and Challenges of Education
39
and interest of classroom activities. For example, by setting up AI assistants with excellent interactive capabilities, providing real-time machine feedback and even an environment for human-computer discussion, students can engage in co-operative learning with machine assistants. In terms of teaching content, we should actively adjust the training objectives and teaching requirements of different disciplines, and pay more attention to the teaching content setting oriented by the core accomplishment of disciplines [29]. With current AI content generation technologies, for example, multilingual code generation and debugging capabilities are so good that the social division of labor among junior programmers may disappear. Therefore, more attention should be paid to computational thinking, artificial intelligence literacy and algorithmic thinking than to reciting the grammar of programming languages in basic and vocational education. 5.3 Encourage Mutual Development of Education and Technology The technology of generative artificial intelligence is advancing rapidly. Take the GPT series technology as an example. From the early GPT-1 [30] to the current ChatGPT, although the fourth generation has been updated, the performance of each generation has improved significantly. However, as the update time is less than five years, it is expected that more intelligent and humanized AI technologies and products will appear in a short time. The system will further enhance natural language processing capabilities with better content understanding, generation and generalization capabilities. It can be seen that education is to adapt to the rapid development of artificial intelligence technology. We should strive to hold a more open and inclusive attitude towards artificial intelligence, maintain a technology-oriented education concept, and learn and use relevant technologies and tools to jointly complete various educational tasks. At the same time, fully realize that this new technology is no longer a tool for “photo search” or “face swap software,” but an important part of the future of education, with profound implications for the transformation of the education field [27]. In the field of education, we should always pay attention to the potential risks generated by artificial intelligence technology, formulate laws and regulations related to the application environment in the field of education, and form a double spiral in which technology and education promote each other and make progress together. As artificial intelligence spreads to human society, education, the cornerstone of human civilization, needs to be flexible and confident in responding to challenges.
6 Conclusion With the emergence of ChatGPT and its explosive popularity, generative artificial intelligence, a new technology, has gradually entered the education ecosystem and brought great impact and challenge to the current education modernization. Turning challenges into opportunities is an urgent need for every educator to think deeply. The relationship between technology and education is not opposite. The relationship between technological progress and educational development can be realized and needs to develop together. ChatGPT is a milestone in the development of artificial intelligence technology, which has become a great historic opportunity to promote education reform and innovation, but it is also a great challenge to education reform and innovation.
40
J. Zhong et al.
In the era of rapid development of artificial intelligence, either active choice or passive choice, no one can separate themselves from this fact. In the face of the emergence and development of ChatGPT or other new technologies, it is better to turn the crisis into an opportunity, make good use of the advantages of digital technology, so as to realize the integration of human and intelligence, and use technology to promote the progress and development of education [31]. At the same time, we must always remember that only by fully awakening the consciousness and subjectivity of “man” and stimulating creativity, can future human beings get rid of the control of self-created technology, so as to win the competition between man and technology and become the master of history. Funding. This paper is supported by the Fund for Philosophy and Social Sciences of Universities in Jiangsu Province, China, “Research on Integrated Education Based on the Decentralization of Blockchain Technology” (2019SJA0543).
References 1. Campolo, A., et al.: AI now 2017 report (2017) 2. Kuang, Y., et al.: Double stimulations during the follicular and luteal phases of poor responders in IVF/ICSI programmes (Shanghai protocol). Reprod. Biomed. Online 29(6), 684–691 (2014) 3. Firat, M.: How chat GPT can transform autodidactic experiences and open education. Department of Distance Education, Open Education Faculty, Anadolu Unive (2023) 4. George, A.S., George, A.H.: A review of ChatGPT AI’s impact on several business sectors. Partn. Univ. Int. Innov. J. 1(1), 9–23 (2023) 5. Buza, V., Hysa, M.: School-family cooperation through different forms of communication in schools during the Covid-19 pandemic. Thesis 9(2), 55–80 (2020) 6. Gilson, A., et al.: How does CHATGPT perform on the united states medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med. Educ. 9(1), e45312 (2023) 7. Hassani, H., Silva, E.S.: The role of ChatGPT in data science: how AI-assisted conversational interfaces are revolutionizing the field. Big Data Cognit. Comput. 7(2), 62 (2023) 8. Hsu, J.: Should Schools Ban AI Chatbots? Elsevier, Amsterdam (2023) 9. Ali, S.R., et al.: Using ChatGPT to write patient clinic letters. Lancet Digit. Health 5(4), e179–e181 (2023) 10. Biswas, S.S.: Role of Chat GPT in public health. Ann. Biomed. Eng., 1–2 (2023) 11. Huang, F., Kwak, H., An, J.: Is ChatGPT better than human annotators? Potential and limitations of ChatGPT in explaining implicit hate speech. arXiv preprint arXiv:2302.07736 (2023) 12. Jiao, W., et al.: Is ChatGPT a good translator? A preliminary study. arXiv preprint arXiv: 2301.08745 (2023) 13. Alshater, M.M.: Exploring the role of artificial intelligence in enhancing academic performance: a case study of ChatGPT. SSRN (2022) 14. Qadir, J.: Engineering education in the era of ChatGPT: promise and pitfalls of generative AI for education (2022) 15. Baidoo-Anu, D., Owusu Ansah, L.: Education in the era of generative artificial intelligence (AI): understanding the potential benefits of ChatGPT in promoting teaching and learning. SSRN 4337484 (2023) 16. Chen, T.-J.: ChatGPT and other artificial intelligence applications speed up scientific writing. J. Chin. Med. Assoc. 10 (2023)
Opportunities and Challenges of Education
41
17. Cotton, D.R., Cotton, P.A., Shipway, J.R.: Chatting and cheating. Ensuring academic integrity in the era of ChatGPT (2023) 18. Ventayen, R.J.M.: OpenAI ChatGPT generated results: similarity index of artificial intelligence-based contents. SSRN 4332664 (2023) 19. Susnjak, T.: ChatGPT: the end of online exam integrity? arXiv preprint arXiv:2212.09292 (2022) 20. Curtis, N.: To ChatGPT or not to ChatGPT? The impact of artificial intelligence on academic publishing. Pediatr. Infect. Dis. J. 42(4), 275 (2023) 21. Khalil, M., Er, E.: Will ChatGPT get you caught? Rethinking of plagiarism detection. arXiv preprint arXiv:2302.04335 (2023) 22. Lund, B.D., Wang, T.: Chatting about ChatGPT: how may AI and GPT impact academia and libraries? Libr. Hi Tech News (2023) 23. Macdonald, C., et al.: Can ChatGPT draft a research article? An example of population-level vaccine effectiveness analysis. J. Glob. Health 13, 01003 (2023) 24. McGee, R.W.: ANNIE CHAN: three short stories written with chat GPT. SSRN 4359403 (2023) 25. Pavlik, J.V.: Collaborating with ChatGPT: considering the implications of generative artificial intelligence for journalism and media education. Journal. Mass Commun. Educ., 10776958221149577 (2023) 26. McGee, R.W.: Who were the 10 best and 10 worst US presidents? The opinion of chat GPT (artificial intelligence). Opin. Chat GPT Artif. Intell. (2023) 27. Naumova, E.N.: A mistake-find exercise: a teacher’s tool to engage with information innovations, ChatGPT, and their analogs. J. Pub. Health Policy, 1–6 (2023) 28. McGee, R.W.: Is ESG a bad idea? The ChatGPT response. Working Paper, 8 April 2023 (2023). https://ssrn.com/abstract=4413431 29. McGee, R.W.: Political philosophy and ChatGPT. Working Paper, 25 March 2023. https:// ssrn.com/abstract=4399913. 10.13140 30. Radford, A., et al.: Improving language understanding by generative pre-training (2018) 31. Qin, C., et al.: Is ChatGPT a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476 (2023)
Visualization Techniques for Analyzing Learning Effects – Taking Python as an Example Keshuang Zhou, Yuyang Li, and Xue Han(B) Nanjing Normal University of Special Education, Nanjing 210000, Jiangsu, China [email protected]
Abstract. With the advent of the information age, data visualization technology has gradually shown its unique features in various information fields, and its importance has gradually been attached importance by various governments and commercial departments. In teaching in our country, most schools simply use office software to create pie chart, histogram or table to realize visualization. This kind of teaching creates a single chart that doesn’t change at all. The content of the data visualization course of many school courses is outdated, and the way of visualization is not in line with the needs of The Times, that is, the content has been criticized as abstract, language mechanization, format and public culture [1]. In order to analyze the teaching quality and evaluate and improve it, this paper processes and analyzes the data exported from the teaching administration system based on python, mainly from three aspects: data acquisition, data processing and data analysis. Firstly, the python crawler technology is used to obtain students’ grades, secondly, the invalid data is processed, and finally, the matplotlib library is used to visualize the processed data, and the learning status of students in the class is analyzed and evaluated by combining the obtained images. Through the data processing of this paper, it realizes the hiding of the student’s name, protects the privacy of the student, and uses the graph to intuitively reflect the student’s grade distribution, which makes the grade analysis more convenient. Keywords: Artificial Intelligence · Python · Visualization · Word Cloud
1 Introduction Since ancient times, China has always attached great importance to education. With the rapid development of artificial intelligence in today’s society, the idea that science and technology are the primary productive forces is fully implemented, based on education. In the daily learning process of students, students’ performance in class, study time and grades of online courses, as well as the score statistics of students’ final exams all need to deal with a large amount of data. Examination is an important link in the teaching process, and also an important means to reflect the teaching effect. The analysis of examination results can not only reflect students’ learning attitude, learning effect and effort level, but also reflect teachers’ learning attitude, teaching method, teaching effect and teaching management level to a certain extent. Achievement analysis is an important © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 42–52, 2024. https://doi.org/10.1007/978-3-031-50580-5_4
Visualization Techniques for Analyzing Learning Effects
43
part of teaching process [2]. At present, the most commonly used tools are the PivotTable function of Excel or mode, mean value and variance in SPSS to generate visual charts so as to visually reflect the strengths and weaknesses of different subjects and daily learning performance of different students, providing reference for students to adjust their learning status in the future. It also provides a guarantee for teachers and parents to have a clearer understanding of students’ learning status. However, once faced with a large number of data, Excel and SPSS will inevitably make people a little confused, still need a more convenient way to carry out data visualization analysis. In the face of massive data, how to analyze and mine the connotation of data and better present useful information from the data has become the main object of current research. Data visualization is an important means of data analysis and information presentation, and has also become the main tool of data research [3]. Python is a commonly used programming language in big data. This paper uses Matplotlib [4] and Numpy [5] libraries in Python [6] and takes the final grades of this class as an example for data analysis. In order to protect personal privacy, the names of students have been processed when taking data in this paper.
2 Current Situation of Analysis for Learning Effect In the current teaching environment, how to let students have more time to study independently through teaching reform has become a more concerned issue. In the face of the increasing amount of data, the traditional paper analysis method is obviously unable to meet the demand. So how do you turn data into useful information? Through the use of visualization technology, we can use visual charts, graphs and other forms to visually display the performance information. Visual chart analysis of grades is an important reflection of students’ learning results in the learning process, reflecting students’ mastery of knowledge and learning ability in a period of time. In many cases, students’ grades are not particularly ideal, so for them, performance analysis is particularly important. In this way, students can be helped to self-reflect, so as to find their own problems in learning and make targeted adjustments. In daily learning, students can learn the differences between their weak subjects and strong subjects through various channels. But much of this information is indirectly obtained by teachers, which obviously can’t meet the needs of students. So visualization allows students to have a more intuitive understanding of subjects they are weak in or interested in, It not only provides reference for students’ subsequent development of emphasis, but also provides feedback for teachers and schools’ teaching strategies and methods, playing an extremely important role. Traditional analysis of grades wastes time and results are not clear enough. For example, when the author was in high school, he relied on hand-drawn line charts of grades of each subject, which could only reflect the progress trend of the grades of a single subject, and also wasted students’ time. Moreover, some teachers will use the data perspective function of Excel and SPSS in statistics to integrate and analyze data and generate visual charts. The efficiency of these two methods in the face of a large number of data is also evident. The use of technical means, such as Python, can be used to visualize the data, so that students can see their own shortcomings intuitively. Through review and consolidation, students can better grasp the content, so that teachers and schools can quickly
44
K. Zhou et al.
grasp students’ understanding of knowledge, so as to timely adjust the teaching plan and make relevant teaching plans, so that parents can understand the learning effect of students more clearly. Thus more comprehensive grasp of the students’ basic situation.
3 Overview of Tools 3.1 Python Python is a widely used object-oriented, interpreted, cross-platform computer programming language. It is an open source, free, and powerful language that provides a rich and powerful standard library, as well as a large number of third-party libraries. Functions of third-party libraries involve artificial intelligence, data analysis and processing, machine learning, Web application development and other fields. With the development of these fields in recent years, Python has become more and more popular. It has the advantages of clear and simple, high development efficiency, portability, expansibility, embeddability and so on. The open source and free features also make this language get the support and promotion of many users, and become a new choice for the development of data processing system. It is an open-source programming language developed with the community-based model. It’s free to use, and since it’s open-source supports multiple platforms and can be run on any environment. For example, 2D and 3D information visualization optimization libraries Matplotlib, Seaborn and Pandas, Folium, Basemap, MapBox, GeoPlotlib, PyechartsMap, etc. Information visualization management library of social Service Network networkX, Wordcloud, a library for information visibility optimization of dictionaries and cloud images, and WordCloud from Pyecharts [7]. 3.2 Matplotlib Matplotlib is an open source data visualization toolkit that is the most popular drawing library in Python for developers, and it is comprehensive and versatile. Matplotlib is Python’s 2D drawing library. It generates publical-quality graphics in a variety of hardcopy formats and cross-platform interactive environments, mainly for generating drawings and other visualizations of two-dimensional data. When using this module, it takes a few lines of code to generate the required curve chart, histogram, scatter chart, line chart, pie chart, bar chart, box chart, spectrum chart and radar chart. Not only can you draw 2D graphics, you can draw 3D graphics, you can also draw animations. 3.3 Numpy As the most important basic package of Python numerical computation, Numpy is an open source numerical computation extension of Python and a basic module of data analysis and Kaohsiung scientific computation. It can be used to store and process large matrices. It not only supports a large number of dimensional arrays and matrix operations, but also provides many advanced numerical programming tools, such as matrix data type, vector processing and precise operation library. Designed for strict digital processing.
Visualization Techniques for Analyzing Learning Effects
45
4 Data Visualization Analysis 4.1 Concept Visualization is a theory, method and technology that uses computer graphics and image processing technology to convert data into graphics or images and display them on the screen, and carry out interactive processing. With the development of digital multimedia technology, visualization technology has been widely used in engineering, medicine and other fields. This technology converts relevant information such as data into graphic images for display, which helps users understand the interrelationship and development trend between a large number of abstract data, and improves users’ ability to observe things [8]. Visualization tools are widely used in the visualization of graphic images [9], human-computer interaction [10], scientific calculation visualization [11] and other fields. It involves many fields such as computer graphics, image processing and computer vision. For example, some line charts, pie charts and bar charts in Excel can visually display data in the form of charts to help people better understand the meaning of data and make decisions about the next step. This is visualization. The chart below, for example, is the simplest data analysis (Fig. 1).
Fig. 1. Simple line graph
Through data visualization technology, text is converted into graphs. The line chart can intuitively see the sales trend of one year. The chart shows that the sales volume changes first, then rises, then rises to the highest value, and finally declines steadily. 4.2 Principle At present, we are in an era of data and information explosion. No matter when and where we are, we will inevitably face the situation of actively or passively receiving news and feedback. The human eye has powerful pattern recognition capabilities, and more than half of the human brain function can be used to process and feed backvisual information [12]. Compared with boring words and numbers, the human brain can be more intuitive and more specific to recognize elements such as graphics, colors, and sizes, and can discover the information contained in the data for the first time from the data visualization graphics [13]. Data visualization refers to the presentation of a large amount of relevant
46
K. Zhou et al.
data in the form of images and charts, such as word cloud map, radar map, percentage pie chart, etc., which can integrate data and visually display graphic images [14]. These charts can communicate and communicate data clearly and effectively, and can analyze data from multiple dimensions to draw deep relevant conclusions. With graphical means, key contents and features can be conveyed intuitively [15]. Word cloud map consists of frequently appearing words, which are cloud-like color graphics, used to display a large amount of text data. For example, in the topic discussion in the general course of learning, word clouds will be automatically generated when students post their discussions. Words with multiple repetitions are in the center with dark colors and large size [16]. This intuitive visual effect enables teachers to understand the learning effect of students by observing the word cloud. As shown in the picture below (Fig. 2):
Fig. 2. Topic discussion word cloud map
The implementation code is:
def create_ciyun(): input:text open text read() text list (text) Obtain the output final_text
Get the data pseudocode
Visualization Techniques for Analyzing Learning Effects
47
4.3 Steps of Data Visualization Analysis Any data that can be expressed in charts and images can be analyzed visually. First of all, a large amount of relevant data needs to be acquired. There are many methods of data acquisition, such as questionnaire survey and interview, and crawler and other methods are generally adopted to acquire some open data [16]. Secondly, data screening and processing are carried out. The data we acquire may be messy, complex and repetitive [17]. For such low-quality data, data should be discarded to ensure the effectiveness and reliability of the data. The last is data visualization. This step is to express the processed data information in the form of graphics and images. At present, people generally use Excel, Python, Matlab and other software technologies to realize data visualization analysis [18].
5 Data Visualization Analysis by Python First of all, climb from the teaching administration system of the final subject scores of the class, calculate the personal score, and then integrate into the class score, and finally the class score for visual analysis. 5.1 Obtain Data Using the third party library requests this web crawler tool, crawl the student achievement information on the educational administration system, just need to give the website and crawl rules, you can get a lot of information on the website. Based on the analysis of the scores of some students in this class, this paper obtains the score data of the basic course of the final examination in the second semester of the 2022–2023 academic year, and makes a statistical analysis of the information of the students’ course scores. The key indicators mainly include the number of students taking the exam, the passing rate of the paper scores, the passing rate of the overall score, the passing rate of the paper scores and the passing rate of the overall score [19]. The code is as follows: The result information is beautified by BeautifulSoup and saved in a local file, or exported to an Excel workbook for processing [20].
Input: the address of the educational administration system Input: the score address Request head Input: the student username and password def get(url): x=requests.get(url,headers=head) return x if __name__==’__main__’: cookie=get (url) x=requests (url,headers=head,data=user) content=x.text gap=requests (url_gap, headers=head) Obtain the output:gap.text Draw bar chart pseudocode
48
K. Zhou et al.
5.2 Data Processing Check the obtained data, and process invalid duplicate data, modify, replace, or delete it directly [21]. Such programs would also help to reduce error rates by alleviating the burden of manual data processing that hampers the processing of large-scale actigraphy data sets[i]. For example, the data redundancy processing code is as follows.
Input:test={'id':[1,2,3,4,5,6,6],'name':['Lisa','Bob','Candy','Mike','John','Gary','Gary'], 'chinese':[91,90,100,79,98,94,94],'english':[90,95,81,95,95,91,91]} #View redundant rows Obtain the output:Null #Delete redundant rows Obtain the output:Ture
Data processing pseudocode
5.3 Statistical Processing The results of each subject are counted to obtain the total results of the individual and the class. The weighted average function can be used. The average score, the variance, the highest score and the number of students who failed in each subject of the class were calculated according to the student achievement [22]. 5.4 Visualization Matplotib library is a third-party library in python language [23]. The most basic charts in daily work include bar charts, bar charts, line charts, pie charts, etc. Matplotlib module provides corresponding drawing functions for these charts [24]. The data used to draw the chart can be directly used in the code provided. The bar chart is taken as an example to draw the performance analysis chart below [25]. The code is as follows:
Input:x=[Lisa,Liming,Wanghua,Amy,Tom,Anna,Michelle,Mike] Input:y = [60, 80, 97, 65, 82, 67, 80, 90] plt.bar(x, y) plt.show()
Do visual pseudocode
Lines 2 and 3 give the values for the x and y axes of the chart, line 4 uses the bar() function to draw the bar graph, and line 5 uses the show() function to show the graph (Fig. 3).
Visualization Techniques for Analyzing Learning Effects
49
Fig. 3. Bar chart of student achievement
After running, we can intuitively see the distribution of students’ grades. In actual operation, we can draw a variety of graphs about grades by modifying some parameters. Teachers can also set up charts of different situations according to the specific needs of analysis, such as individual subject or total score change table (line chart), the percentage of grade in the total number of students in the class (pie chart) and so on [26] (Figs. 4 and 5).
Fig. 4. Line chart of student achievement
It can be seen that python visualization technology has the advantages of processing a large amount of data and high reuse rate [27]. When the original data is changed, information can be automatically updated, which is convenient for the use and observation of data statisticians. Visualization technology based on big data mining existing mass data, extract valuable information, applied to class performance analysis and university teaching management, has become a hot practice, can play a positive role in improving the teaching effect [28].
50
K. Zhou et al.
Fig. 5. Pie chart of student achievement
6 Conclusion With the advent of the era of big data, the development of cloud computing and the continuous progress of big data storage technology, data analysis and visualization analysis are becoming more and more important [29]. This paper emphasizes its application in the field of education. The visual analysis of student achievement data can be generated flexibly by using Python according to the results of part of the final examination of the author’s class. Some data can be visualized as a whole, can also be used to show some time series data, and even can be used to show some distribution rules, it is also very convenient to use, can be directly through a code to call another code to achieve, for students, schools, parents in the next deployment of various aspects to provide convenient conditions. In order to make targeted adjustments to the next way of learning and teaching. It can be seen that achievement analysis will continue to be an indispensable part of education in the future [30]. At the same time, we also hope that relevant researchers and scholars can continue to in-depth study on the aspect of performance analysis, and apply it to more aspects in the field of education. Education information has become the mainstream, many schools have their own campus system, performance analysis system, so as to use information management to make school management more scientific and effective. As a result, data analysis and visualization are becoming mainstream. In the future, data visualization will become more popular, which can not only reduce labor costs, but also improve the efficiency of analysis, making analysis more simple and convenient for students. I believe that visual analysis technology will become more and more popular in the near future. However, Python also has some shortcomings, such as Python syntax is relatively simple, resulting in python is not easy to cooperate with other languages [31]. Moreover, python is also based on object-oriented language, so it is difficult to support some complex models, these problems need to be perfected in the future, so Python is a technology that can continue to expand the application.
Visualization Techniques for Analyzing Learning Effects
51
Funding. This paper is supported by the Fund for Philosophy and Social Sciences of Universities in Jiangsu Province, China, “Research on Integrated Education Based on the Decentralization of Blockchain Technology” (2019SJA0543).
References 1. Fabry, D.L., Higgs, J.R.: Barriers to the effective use of technology in education: current status. J. Educ. Comput. Res. 17(4), 385–395 (1997) 2. Seidel, T., Shavelson, R.J.: Teaching effectiveness research in the past decade: the role of theory and research design in disentangling meta-analysis results. Rev. Educ. Res. 77(4), 454–499 (2007) 3. Few, S.: Eenie, Meenie, Minie, Moe: selecting the right graphyou’re your message (2004) 4. Bisong, E.: Matplotlib and seaborn. In: Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners, pp. 151–165 (2019) 5. Bressert, E.: SciPy and NumPy: an overview for developers (2012) 6. Kelly, S.: What Is Python? Python, PyGame and Raspberry Pi Game Development, pp. 3–5 (2016) 7. Diehl, S.: Software Visualization: Visualizing the Structure, Behaviour, and Evolution of Software, pp. 11–18. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-465 05-8 8. Telea, A.C.: Data Visualization: Principles and Practice, pp. 91–102. CRC Press, Boca Raton (2007) 9. Shahin, M., Liang, P., Babar, M.A.: A systematic review of software architecture visualization techniques. J. Syst. Softw. 94(5), 161–185 (2014). https://doi.org/10.1016/j.jss.2014.03.071 10. Drevelle, V., Nicola, J.: VIBes: a visualizer for intervals and boxes. Math. Comput. Sci. 8(3–4), 563–572 (2014). https://doi.org/10.1007/s11786-014-0202-0 11. Allen, F., Gale, D.: Limited market participation and volatility of asset prices. J. Am. Econ. Rev. 984 (1994) 12. Cao, S., Zeng, Y., Yang, S., et al.: Research on Python data visualization technology. J. Phys. Conf. Ser. 1757(1), 012122 (2021). IOP Publishing 13. Hammad, G., Reyt, M., Beliy, N., et al.: PyActigraphy: open-source python package for actigraphy data visualization and analysis. PLoS Comput. Biol. 17(10), e1009514 (2021) 14. Dennis, D.R., Meredith, J.R.: An analysis of process industry production and inventory management systems. J. Oper. Manag. (2000) 15. Sambasivam, S., Theodosopoulos, N.: Advanced data clustering methods of mining web documents. Issues Inf. Sci. Inf. Technol. (2006) 16. Freitas, C.M.D.S., et al.: On evaluating information visualization techniques. In: Proceedings of the Working Conference on Advanced Visual Interfaces (2002) 17. Wehrend, S., Lewis, C.: A problem-oriented classification of visualization techniques. In: Proceedings of the First IEEE Conference on Visualization: Visualization90. IEEE (1990) 18. Chi, Ed.H.: A taxonomy of visualization techniques using the data state reference model. In: IEEE Symposium on Information Visualization 2000. INFOVIS 2000. Proceedings. IEEE (2000) 19. Klerkx, J., Verbert, K., Duval, E.: Enhancing learning with visualization techniques. In: Spector, J., Merrill, M., Elen, J., Bishop, M. (eds.) Handbook of Research on Educational Communications and Technology, pp. 791–807. Springer, New York (2014). https://doi.org/10. 1007/978-1-4614-3185-5_64
52
K. Zhou et al.
20. Cao, N., Cui, W.: Overview of text visualization techniques. In: Cao, N., Cui, W. (eds.) Introduction to Text Visualization. Atlantis Briefs in Artificial Intelligence, vol. 1, pp. 11–40. Atlantis Press, Paris (2016). https://doi.org/10.2991/978-94-6239-186-4_2 21. Keim, D.A., Kriegel, H.-P.: Visualization techniques for mining large databases: a comparison. IEEE Trans. Knowl. Data Eng. 8(6), 923–938 (1996) 22. Kamat, V.R., et al.: Research in visualization techniques for field construction. J. Constr. Eng. Manag. 137(10), 853–862 (2011) 23. Al-Kodmany, K.: Using visualization techniques for enhancing public participation in planning and design: process, implementation, and evaluation. Landsc. Urban Plann. 45(1), 37–45 (1999) 24. Kucher, K., Kerren, A.: Text visualization techniques: taxonomy, visual survey, and community insights. In: 2015 IEEE Pacific visualization symposium (pacificVis). IEEE (2015) 25. White, S., Feiner, S.: SiteLens: situated visualization techniques for urban site visits. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (2009) 26. Tatu, A., et al.: Combining automated analysis and visualization techniques for effective exploration of high-dimensional data. In: 2009 IEEE Symposium on Visual Analytics Science and Technology. IEEE (2009) 27. Zammitto, V.: Visualization techniques in video games. Electron. Vis. Arts (EVA 2008), 267–276 (2008) 28. Vallat, R.: Pingouin: statistics in Python. J. Open Sour. Softw. 3(31), 1026 (2018) 29. Sousa da Silva, A.W., Vranken, W.F.: ACPYPE-Antechamber python parser interface. BMC Res. Notes 5, 1–8 (2012) 30. Millman, K.J., Aivazis, M.: Python for scientists and engineers. Comput. Sci. Eng. 13(2), 9–12 (2011) 31. Ari, N., Ustazhanov, M.: Matplotlib in Python. In: 2014 11th International Conference on Electronics, Computer and Computation (ICECCO). IEEE (2014)
Adversarial Attack on Scene Text Recognition Based on Adversarial Networks Yanju Liu1 , Xinhai Yi2(B) , Yange Li2 , Bing Wang1 , Huiyu Zhang2 , and Yanzhong Liu2 1 Nanjing Normal University of Special Education, Nanjing 210038, China
[email protected]
2 Qiqihar University, Qiqihar 161000, China
[email protected]
Abstract. Deep learning further improves the recognition performance of scene text recognition technology, but it also faces many problems, such as complex lighting, blurring, and so on. The vulnerability of deep learning models to subtle noise has been proven. However, the problems faced by the above scene text recognition technology are likely to become a adversarial sample leading to text recognition model recognition errors. An effective measure is to add adversarial samples to the training set to train the model, so studying adversarial attacks is very meaningful. Current attack models mostly rely on manual design parameters. When generating adversarial samples, continuous gradient calculation is required on the original samples. Most of them are for non-sequential tasks such as classification tasks. Few attack models are for sequential tasks such as scene text recognition. This paper reduces the time complexity of generating adversarial samples to O(1) level by using the Adversarial network to semi-white box attack on the scene text recognition model. And a new objective function for sequence model is proposed. The attack success rates of the adversarial samples on the IC03 and IC13 datasets were 85.28% and 86.98% respectively, while ensuring a structural similarity of over 90% between the original samples and the adversarial samples. Keywords: Deep learning · Text recognition · AdvGAN · Adversarial examples · Natural scene
1 Introduction Writing is a tool for human beings to record information and transmit it to civilizations for a long time. As time goes by, a large number of texts need to be stored digitally. Optical Character Recognition (OCR) [1] technology meets the needs of human social development by translating textual information in pictures into computer text. Scene Text Recognition (STR) has become a hot research problem as a subproblem of OCR, which is to recognize the text information in natural scene pictures and convert them into string form [2]. Deep learning networks have recently taken center stage in the field © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 53–63, 2024. https://doi.org/10.1007/978-3-031-50580-5_5
54
Y. Liu et al.
of scene text recognition, and as deep learning has advanced, so too has the technology for scene text recognition, making it more accurate and able to handle complicated images. However, Szegedy et al. [3] found that in image classification tasks, adding small-magnitude perturbations to the input samples may lead to incorrect classification by deep learning models. Scene text is more complex than ordinary text images, so while trying to improve recognition accuracy, it is important to investigate whether scene text recognition models can lead to incorrect recognition due to subtle interference [4]. If the issue of generalization of recognition models is not considered, then security risks may arise in practical applications. At present, the decoders used in scene text recognition technology mainly use two technologies, one is the Connectionist Temporal Classification (CTC) [5] mechanism and the other is the attention mechanism [6]. The attention mechanism have been mainly studied in recent years, so the study of methods to attack attentional mechanisms allows for better generalization of attack models [7]. Most current attack methods are based on optimization equations and simple matrix calculations in pixel space, but these methods usually rely on manual parameter design, which makes the generated interference inflexible and time-consuming in generating adversarial examples through multiple gradient calculations. Xiao et al. [8] used Generative Adversarial Network (GAN) [9] to generate adversarial examples (AdvGAN). However, AdvGAN has limited ability to attack the scene text. The scene text recognition model based on attention decoder is used as the target model. And the attack ability and quality of adversarial samples are both optimized as objectives Once the discriminator and the generator are trained, the generation of adversarial examples can be done quickly with a single query. The following two aspects are mostly covered by the research for this article. In order to optimize the generator, we restructure the network structure in this study and create a new objective function. (1) An generative adversarial network is designed to generate adversarial examples for scene text recognition. The generated adversarial examples are closer to the original samples and have higher attack success rate than those generated by the manual parameter design method. (2) The scene text recognition model based on attention decoder is used as the target model. And the attack ability and quality of adversarial samples are both optimized as objectives. It makes the loss function of the adversarial network easier to optimize and accelerates the convergence speed of the generative network.
2 Related Work 2.1 Scene Text Recognition Method Shi et al. [10] fed the learned feature maps into a stacked Bidirectional Long Short Term Memory (Bi LSTM) network [11], and connected the CTC decoder to the end of the Bi LSTM network to achieve text recognition. Jaderberg et al. [12] first corrects the irregular text to a horizontal direction, and then performs routine recognition. Shi et al. [13] have further extended the paper [12] by using a bidirectional decoder. The “attentional drift problem” is found to correct the center of attention by focusing the attentional network [14]. Also for this problem, a multi-directional non-local self-attention module
Adversarial Attack on Scene Text Recognition
55
is proposed [15]. Litman et al. [16] combined CTC and Attention mechanisms and designed a cascaded Attention selective attention decoder with the aid of CTC training. 2.2 Attack Method for Generating Adversarial Examples The attack problem can be defined as the model f recognizing errors when interference δ is added to the original image, i.e., f (x + δ) = f (x). Szegedy et al. [3] proposed the Fast Gradient Sign Method (FGSM) by computing the fastest descent of δ infinite norm, which obtains the adversarial sample by x + · sign(∇x · Loss(f (x), l))
(1)
Kurakin et al. [17] improves the FGSM algorithm by shortening the step size. This can be better approximated by smaller steps and more iterations, which makes it more aggressive than FGSM. It generates the adversarial sample with an objective function of xt−1 + α · sign(∇x · Loss(f (xt−1 ), l))
(2)
Madry et al. [18] is a method for generating disturbances by multiple iterations. Since the target model is mostly nonlinear, if only one iteration is performed, the perturbation direction of Loss is unclear and it is difficult to attack successfully in one calculation. Project Gradient Descent (PGD) is also smaller step size, multiple iterations. In each iteration, the generated perturbations are controlled within the specified range, and finally an adversarial sample that can be attacked successfully is generated. Its objective function for each iteration is (xt−1 + α · sign(∇x · Loss(f (xt−1 ), l)) (3) xt =
Goodfellow et al. first proposed to generate samples using two networks constrained against each other. Isola et al. [19] further improves the quality of synthetic images. AdvGAN uses the idea of GAN to generate adversarial examples. However, in the backbone network of its generator and discriminator, the multi-scale features of scene text cannot be well learned. In this study, we modified the network structure of the attack network and design a loss function more suitable for scene text to optimize the generator, and successfully integrate the scene text recognition model into the generative adversarial network to attack it to generate adversarial examples of scene text images.
3 AdvGAN-Based Scene Text Recognition Attack Method 3.1 Structure of the Attack Model Figure 1 depicts the general structure of the confrontation model, which is made up of three primary components: a discriminator, a generator, and a target network (model for scene text recognition). The target network is an scene text recognition model that has already been trained. The generator constantly perturbs the real samples during the training process, and the discriminator separates the disturbed instances from the true data. This is done in order to get the generator’s samples closer and closer to the genuine samples until the discriminator is unable to tell them apart. The end result is an adversarial example that can deceive the target network while also being somewhat close to the genuine sample.
56
Y. Liu et al.
Fig. 1. The architecture of Recognition model and GAN
3.2 Structure of the Generative Adversarial Network In the network model design of the generative adversarial network, the step size and kernel size of the 3rd and 4th pooling layers of the discriminator network are set to (1, 2) instead of the traditional 2 steps, which can make the width of the image faster convolution to be computed to 1, and finally a binary classification result can be obtained. The step length of the pooling layer in the middle of the generator network is also adjusted and set to (2, 1), so that a longer feature sequence can be obtained. Since the width of a scene text image is usually much larger than the height, getting a longer feature sequence can reduce the loss of features during the convolution calculation. The structure of the generation network and the discriminative network is shown in Fig. 2 and Fig. 3.
Fig. 2. The architecture of discriminator
3.3 Scene Text Recognition Model Based on Attention Mechanism Attentional Scene Text Recognizer (ASTER) is a classical model in the study of scene text recognition techniques, which uses a bidirectional decoding mechanism based on
Adversarial Attack on Scene Text Recognition
57
Fig. 3. The architecture of Generator
Attention to train a left-to-right decoder and a right-to-left decoder. The two decoders output recognition results from two directions, and then the one with higher confidence is selected as the final recognition result. The Attention decoder decodes the feature H from the encoder directly into the target sequence {l1 , . . . , ln }, and the maximum time step set by the decoder is T. The decoding stops when the terminator “EOS” is encountered, and the output at the t th time step is p(yt ) = softmax(Wout st + bout ) yt ∼ p(lt )
(4)
where Wout and bout are learnable parameters. Softmax(·) guarantees that 0 ≤ p(yt ) ≤ 1 and Ti=1 p(yt ) = 1. st is the hidden state at the t time step, and st is calculated as follows. st = rnn(st−1 , (gt , f (lt−1 ))
(5)
Instead of decoding a result based on a particular feature in the decoding process, the Attention mechanism first calculates an attention weight vector α, weighted and summed over all features according to the weights, and obtains a feature gt with context. gt and α are denoted as n gt = αt,i hi (6) i=1
n
αt,i = exp(et,i )/ i =1 exp(et,i ) et,i = ωT · tanh(Wst−1 + Vhi + b)
(7)
where ωT , W , V , b are all learnable parameters. st−1 is the hidden state of the previous time step.
58
Y. Liu et al.
3.4 Loss Function The process of constructing an adversarial example is a continuous optimization process. The generator G generates a fine perturbation G(x) after receiving a given input image x. The true label sequence corresponding to x is then l = {l1 , . . . , ln }, and the adversarial example x = (x + G(x)) is formed after that. When non-targeted assaults occur, the adversarial example x is fed into the scene text recognition model f to obtain the wrong output sequence l = {l1 , . . . , ln }. This problem can be expressed as follows. minD(x, x ) δ
s.t. f(x) = l f x = l x ∈ [−1, 1]
(8)
where D(·, ·) denotes the distance between the original image and the adversarial example. Formula 6 simultaneously targets the quality and attack ability of adversarial samples as training objectives However, due to the highly nonlinear nature of f x = l , it is not easy to optimize g(·) is defined that satisfies in the actual solution process, so a function the condition f x = l [20] if and only if g(x ) ≤ 0. g x is expressed as the following equation. g x = log(1 + exp(Z x l − max Z(x )l )) − log(2) (9) l=l
where Z(x ) is the output feature of the recognition model without the softmax operation and Z(x) = Wx + b (see Eq. (4)). This way the objective function becomes more linear and it becomes more favorable for optimization. Thus the loss function for generating the adversarial example is minD x, x + c · g(x ) (10) s.t. x ∈ [−1, 1] where c(c > 0) is a constant to measure the relative importance of two terms in the objective function. Taking D(·, ·) as the parametrization, the final loss of the attack target model f is obtained as T − log(2)) Ladv = Ex x, x + c · Ex (log 1 + exp Z x l,i − max Z(x)l ,i i=1
l=l
(11) In this paper, generative adversarial networks are used to generate adversarial examples, and the adversarial loss proposed by GAN is defined as follows. LGAN = Ex log(D(x)) − Ex log(1 − D(x + G(x)))
(12)
where D(x) represents the probability that the discriminator D determines whether x is a real picture (since x itself is real, the closer this value is to 1 for D(x), , the better). And D(x + G(x)) is the probability that the discriminator judges whether the generated
Adversarial Attack on Scene Text Recognition
59
picture by the generator is true or not, and the closer this value is to 1 the better for the generator as well. The two networks play each other, and finally the generator generates the sample closest to the real picture. In attacking the target model, the generator may generate more obvious disturbances in order to attack successfully. In this paper, soft hinge loss [21] is used on the paradigm to limit the perturbation range. Lhinge = Ex max(0, G(x) − c)
(13)
where c is a manually set parameter that stabilizes the training of the GAN. As a result, the objective function of our entire generative adversarial network is L = LGAN + αLadv + βLhinge
(14)
where α = 1 and β = 0.2 to denote the weight of each sub-objective. Finally, the discriminator and generator are continuously optimized by arg minmax L. G
D
4 Experiments and Analysis 4.1 Datasets In this paper, the adversarial samples are generated on the basis of ICDAR2003 (IC03) [22] and ICDAR2013 (IC13) [23] datasets, and the image content is horizontal text. 4.2 Scene Text Recognition Model ASTER is the classical recognition model based on Attention mechanism in scene text recognition. Since the test datasets are all horizontal text, the rectification network of ASTER is removed in the experiments and the bidirectional decoder module is retained. The experiments are implemented based on cuda11.1, PyTorch1.9.0 framework, and the generative adversarial network is trained on an NVIDIA RTX 3050 graphics card with a batch size of 64, and all the optimizers use Adam optimizer. The test results of ASTER on IC03 and IC13 datasets in this experimental environment are shown in Table 1. Table 1. Recognition accuracy of ASTER on different datasets. Model
IC03
IC13
ASTER
0.8941
0.8811
60
Y. Liu et al.
4.3 Generation of Adversarial Examples A high attack success rate is usually accompanied by a higher distortion rate of the adversarial samples. Since this paper limits the range of perturbations generated by the generator, and the generator always finds a balance between generating more noise and seeking a higher attack success rate, it is difficult for the model to achieve an attack success rate close to 100%, but it can be seen from Table 2 that the model generates a higher attack success rate for the adversarial examples on dataset IC03 than the FGSM, BIM, and PGD methods by 60.36%, 9.85%, and 5.23%, respectively. It is 55.56%, 7.11%, and 4.32% higher on dataset IC13, respectively. And in the case of the same attack success rate, the model generated adversarial samples with lower distance and higher SSIM values. This demonstrates that, when the attack success rate is guaranteed, the adversarial instances are more challenging to tell apart with the naked eye, indicating that they are more realistic. The scene text recognition technique’s attack success rate (ASR) is represented as num(lower(f x ) = lowerf((x))) (15) ASR = num(x) where lower(·) is the conversion of the characters of the model recognition result to lowercase form. The ASR can be interpreted as the number of images that can make the model misclassify divided by the total number of images. Structural similarity (SSIM) is one of the important indicators for evaluating the similarity of two images. The higher the value of SSIM, the more similar the two images are. l(X, Y) = (2μx μy + A1 )/(μ2x + μ2y + A1 ) c(X, Y) = (2σx σy + A2 )/(σx2 + σy2 + A2 ) s(X, Y) = (σXY + A3 )/(σX σY + A3 ) 1 H W μx = H×W i=1 j=1 X(i,j) 1 H W 2 σx = H×W i=1 j=1 (X(i,j) − μx )2 1 H W 2 σx2 = H×W i=1 j=1 (X(i,j) − μx )
(16)
where μx , σx2 and σXY denote the mean, variance and covariance of the images, respectively. A1 , A2 and A3 are constants, generally taken as 1% to 3% of the maximum pixel value, to constrain the formula and prevent the calculation of zero. l(·, ·), c(·, ·) and s(·, ·) calculate the brightness, contrast and structure of the two images, respectively. The final structural similarity of the two images is obtained by multiplying the three metrics together as follows. SSIM = l(X, Y) · c(X, Y) · s(X, Y)
(17)
The effect of the Lhinge method on the attack success rate and distance on the IC03 datasets is shown in Fig. 4. By comparing the two pictures, it is obvious that the training of the model is more stable when using Lhinge , and stronger adversarial examples can be generated with smaller distance.
Adversarial Attack on Scene Text Recognition
61
Table 2. Compare the quality and attack ability of FGSM, BIM, PGD and the model proposed in this paper
M ethod/Datasets
IC03 SSIM
ASR
ℓ2 Dis
IC13 SSIM
Iterator
ASR
ℓ2 Dis
Iterator
FGSM
= 0.5
50.92
22.52
0.32
-
31.42
17.84
0.51
-
BIM
= 0.5
75.43
2.92
0.71
20
79.87
2.67
0.74
20
PGD
= 0.5
80.05
1.64
0.80
20
82.66
1.75
0.82
20
85.28
0.38
0.89
-
86.98
0.35
0.92
-
Ours
-
(a)
(b)
Fig. 4. (a). Attack success rate and distance on IC03 dataset without Lhinge . (b). Attack success distance on IC03 dataset with Lhinge . rate and
Fig. 5. Comparison of the adversarial examples generated by different attack methods
The adversarial sample instances generated by different attack algorithms are shown in Fig. 5. Due to FGSM only performing one gradient calculation, which is equivalent
62
Y. Liu et al.
to only adding noise to the original sample once, the generated noise is minimal, and it appears to the naked eye that there is no significant difference between the original sample and the adversarial sample. The scene character recognition model is relatively complex. If only one iteration is carried out, the disturbance direction for the loss function is not clear, and it is difficult to attack successfully in one calculation. BIM and PGD conducts multiple iterations, and adds noise on the original samples to improve the ability to generate attacks against samples. It can be seen that as the disturbance coefficient and iteration number increase, the generated noise becomes more and more obvious.
5 Summary In this paper, an generation adversarial network is used to generate adversarial examples to avoid the limitation of manually setting parameters, and when the training of the network is completed, each sample only needs one calculation to generate the adversarial examples, which greatly reduces the generation time. In the training generation process, a new objective function optimization generator is proposed to reduce the number of optimization iterations and make the generated adversarial samples closer to the real samples and more powerful in attack. In a future study, we will target an attack with another commonly used CTC decoder in scene text recognition, making the attack target coverage more comprehensive for the attack model. Generating high-quality confrontation samples is only a prerequisite to improve the robustness of the recognition model, and the confrontation defense of the attention mechanism based scene text recognition model will be investigated in the future to reinforce the robustness of the recognition model. Funding. This research was funded by Qiqihar University Graduate Innovative Research Project (Grant No. YJSCX2021079), Jiangsu Province College Student Innovation and Entrepreneurship Project (Grant No. 202212048052Y), Jiangsu Higher Education Association Project (Grant No. 2022JDKT133) and Education and Teaching Reform Project of Nanjing Normal University of Special Education (Grant No. 2022XJJG015).
References 1. Radwan, M.A., Khalil, M.I., Abbas, H.M.: Neural networks pipeline for offline machine printed Arabic OCR. Neural Process. Lett. 48(2), 769–787 (2018). http://www.springer.com/ lncs. Accessed 21 Nov 2016 2. Jin, L.W., Zhong, Z.Y., Yang, Z., et al.: Applications of deep learning for handwritten Chinese character recognition: a review. Acta Autom. Sin. 42(8), 1125–1141 (2016) 3. Szegedy, C., Zaremba, W., Sutskever, I., et al.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013) 4. Yuan, X., He, P., Lit, X., et al.: Adaptive adversarial attack on scene text recognition. In: IEEE INFOCOM 2020-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 358–363. IEEE (2020) 5. Graves, A., Fernández, S., Gomez, F., et al.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)
Adversarial Attack on Scene Text Recognition
63
6. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) 7. Yang, M., Zheng, H., Bai, X., et al.: Cost-effective adversarial attacks against scene text recognition. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 2368– 2374. IEEE (2021) 8. Xiao, C., Li, B., Zhu, J.Y., et al.: Generating adversarial examples with adversarial networks. arXiv preprint arXiv:1801.02610 (2018) 9. Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, 27 (2014) 10. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2016) 11. Graves, A., Liwicki, M., Fernández, S., et al.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855–868 (2008) 12. Jaderberg, M., Simonyan, K., Vedaldi, A., et al.: Deep structured output learning for unconstrained text recognition. arXiv preprint arXiv:1412.5903 (2014) 13. Shi, B., Yang, M., Wang, X., et al.: ASTER: an attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. 41(9), 2035–2048 (2018) 14. Cheng, Z., Bai, F., Xu, Y., et al.: Focusing attention: towards accurate text recognition in natural images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5076–5084 (2017) 15. Lu, N., Yu, W., Qi, X., et al.: Master: multi-aspect non-local network for scene text recognition. Pattern Recognit. 117, 107980 (2021) 16. Litman, R., Anschel, O., Tsiper, S., et al.: Scatter: selective context attentional scene text recognizer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11962–11972 (2020) 17. Kurakin, A., Goodfellow, I., Bengio, S.: Adversarial examples in the physical world (2016) 18. Madry, A., Makelov, A., Schmidt, L., et al.: Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017) 19. Isola, P., Zhu, J.Y., Zhou, T., et al.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017) 20. Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. IEEE (2017) 21. Liu, Y., Chen, X., Liu, C., et al.: Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770 (2016) 22. Lucas, S.M., Panaretos, A., Sosa, L., et al.: ICDAR 2003 robust reading competitions: entries, results, and future directions. Int. J. Doc. Anal. Recognit. (IJDAR) 7(2–3), 105–122 (2005) 23. Karatzas, D., Shafait, F., Uchida, S., et al.: ICDAR 2013 robust reading competition. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1484–1493. IEEE (2013)
Collaboration of Intelligent Systems to Improve Information Security Lili Diao1 and Honglan Xu2(B) 1 Nanjing Normal University of Special Education, Nanjing, China 2 ZTE Corporation, Nanjing, China
[email protected]
Abstract. In more and more popular computer systems, industry protects the network on top of them via scanning malware (malicious software or applications) through some generic properties. It is useful but not accurate enough – even 0.01% of accuracy gain can cause millions of malicious software or applications over internet to steal privacy or break down computer systems. To address this problem, the paper proposes building independent intelligent systems to predict possibilities of malware through different angles: Generic properties, Import table properties, Opcode properties etc. Each single intelligent system does not have highest prediction accuracy; whereas, collaboration of independent intelligent systems can bring accuracy improvements over single ones in experiments, which brought tremendous value on helping improving computer information security. Keywords: Information Security · Malware · Intelligent System · Model Collaboration
1 Introduction Information security is always crucial to protect computer networks. Nowadays there are more and more software or applications very useful for helping people in daily life. Unfortunately, malicious software which aims to steal secretes or privacies and break down computer or network systems, for example bank accounts, credit cards, etc., also grows up rapidly. In this sense, how to identify malicious software efficiently becomes significant to guarantee our digital world running correctly. The traditional way is storing virus’ static patterns (known blocks of binaries) and comparing when scanning. However, it is far from being efficient since such patterns are increasing crazily. Artificial Intelligence or Machine Learning is recently applied to comprehensively calculate possibilities of being malicious based on general properties or features of software or applications. As an instance, researchers organize information about file size, segment size, text size, segment offset and so on to form a big table on top of malicious or benign software samples. Thus a machine learning model recording the knowledge of discriminating good or bad is worked out, and is able to be applied into real network environment to predict if any unrecorded software or application is benign or malicious. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 64–75, 2024. https://doi.org/10.1007/978-3-031-50580-5_6
Collaboration of Intelligent Systems to Improve Information Security
65
Now the problem is, single machine learning model with generic properties of software or applications cannot reach the highest accuracy level (a combination of precision and recall). How to further improve the accuracies is still a very tough problem to predict malicious software or applications. This paper introduces a mechanism of collaboration among different intelligent systems to further improve the accuracies in predicting malicious software or applications.
2 Related Works Malicious software or applications (Malware) is a well-known name for the software or applications which do not intent to do good things on computer and/or network systems. In information security field, there are multiple research teams using intelligent systems to predict Malware. Some of them are using different kinds of features to build a hybrid system to detect them. A standard way to detect malware is signature based methods, which is encountering more and more problems [1]. Many modern malicious applications were designed to have multiple polymorphic layers to hide themselves [2]. A set of intelligent systems to detect malwares were introduced in [3]. Boosting algorithms employing n-grams were observed outperforming Bayesian classifiers and Support vector machines in [4]. [5] used association rules found from Windows API execution sequences to identify good or bad. Hidden Markov Models, neural networks, as well as Self-Organizing Maps were used to detect variants of Windows executables in [6, 7] and [8]. [9] proposed a way to cope with ransomware. Active Learning was used to obtain 91.5% accuracy rate on iOS applications as in [10]. [11] obtained 91.43% accuracy. In [12], static and dynamic features were utilized on Android applications. [13] presented a hybrid technique using permissions & traffic features. KNN and K-Medoids algorithms as a whole produced 91.98% accuracy. Mac OS X applications also have multiple approaches to detect malware as in [14] and [15]. Some research work got good results. However, there are still problems of relying on single types of features to build intelligent systems, or the ways to mixture multiple types of features are naïve. In the sense, we need more specifically designed mechanism to apply malware detection in real network environments.
3 Intelligent Systems 3.1 Windows PE Properties and Intelligent Systems Windows applications including malware and good ware are basically using PE (Portable Executable) file structure. PE file format is a data structure that tells the Windows OS loader what information is required to manage the wrapped executable code. This includes dynamic library references for linking, API export, import tables, resource management data, and TLS data [16]. The architecture of PE is illustrated by Fig. 1: The PE format starts from a MS-DOS header accompanied by executable code. The header has 64 bytes. The PE header indicates information about the entire file. The basic header contains the machine type or architecture, a time stamp, a pointer to symbols, if the application handle addresses above 2 GB, if the file needs to be copied to the swap
66
L. Diao and H. Xu
Fig. 1. An Overview of PE Format
file, and so on [17]. More detailed information is stored on different sections of the PE file and needs to be discovered based on domain knowledge. Based on PE structure and contained information, we can design various types of features to describe the executables. There are many static properties of PE such as file length, file entropy, and file segments etc. which are unchanged characteristics after the PE file formed. We call such type feature as generic features. Another type of feature is the export/import table of PE file, which lists all connected function calls. They are IAT features for convenience. Opcode is the machine code in PE which is to be performed when run it. Because Opcode directly connects to executing, we believe it will have great value when make it as a type of feature. The last type of feature is the readable text in all PE sections. Sometimes such feature make contain some information of what the software is aiming for. We call this type of feature “String” feature. The first intelligent system to detect malware is based on the generic properties. We selected nearly 600 features on top of the properties of PE files to build up machine models to predict malicious. They are the base information of each software or application, and may have (partially) indicators of good or benign. We then use various analysis to
Collaboration of Intelligent Systems to Improve Information Security
67
rank importance of the features and compress the intelligent system into a much smaller feature set. We call the intelligent system Generic Model or intelligent system. The second intelligent system is based on export/import table of PE. It is all the API calls of the software or application. This set of properties has value of disclosing what functions the software or applications wants to quote. We mapped them into another feature space to build up an easy to control intelligent system to discriminate good or bad. We call this intelligent system IAT Model or intelligent system. The third intelligent system is based on Assemblies of PE operational codes. They are the machine code of each software manipulating registers, jumps, moves, etc. This set of properties has value of disclosing what actions the software or application really wants to do. The original features are also mapped into another feature space through mapping functions before building up intelligent system to detect malware. We call this intelligent system Opcode Model or intelligent system. The last independent intelligent system is based on readable characters within PE files. This set of properties has value of disclosing what message the software or application wants to deliver to its direct user. Also, for ease of control, we mapped the properties into another feature space to build up intelligent system to identify malwares. We call this intelligent system String Model or intelligent system. 3.2 Intelligent Algorithms Given feature space, we then can choose appropriate machine learning algorithms as the core of intelligent systems, for instance SVM, deep learning, bagging and boosting, etc. To obtain accuracies as high as possible, we choose XGBoost, a specific algorithm in boosting family. XGBoost (Extreme Gradient Boosting) is a popular machine learning algorithm that has gained significant attention in recent years due to its high accuracy and efficiency in solving a wide range of supervised learning problems, such as classification and regression. XGBoost is an ensemble learning method that combines multiple weak models to create a strong model. The algorithm uses decision trees as the base learners. The underlying principles of XGBoost are based on the gradient boosting framework, which involves iteratively adding new weak models (trees) to the ensemble. The loss function is defined as the difference between the predicted values and the actual values, and the goal is to minimize this difference [18]. The final prediction is obtained by summing the predictions of all the trees. Here we listed some key features of XGBoost algorithm. Regularization. XGBoost uses L1 and L2 regularization to prevent overfitting and improve the generalization of the model. By adding penalty terms to the loss function, XGBoost encourages the model to use only the most important features and to avoid overfitting.
68
L. Diao and H. Xu
Tree Pruning. XGBoost uses a technique called tree pruning to remove unnecessary branches from the decision trees, which helps to reduce overfitting and improve the accuracy of the model. Tree pruning involves removing branches that do not contribute significantly to the reduction in the loss function. XGBoost uses a greedy algorithm to determine which branches to prune, starting from the leaves and working its way up to the root. Parallel Processing. XGBoost is designed to be highly scalable and can take advantage of parallel processing to speed up the training process. The algorithm can be run on a single machine or on a distributed cluster of machines. XGBoost uses a technique called approximate computing to reduce the computational cost of the algorithm, by approximating the gradients and Hessians of the loss function. Handling Missing Values. XGBoost can handle missing values in the input data by assigning them to the most appropriate node during the tree construction process. XGBoost uses a technique called sparsity-aware split finding to handle missing values, which involves splitting the data into two groups: one group with missing values and one group without missing values. The algorithm then determines the best split for each group separately. Feature Importance. XGBoost provides a measure of feature importance, which can be used to identify the most important features in the input data. Feature importance is calculated by summing the number of times each feature is used in the decision trees, weighted by the gain of each split. The gain of a split is the reduction in the loss function that is achieved by the split. XGBoost has been used successfully in a wide range of applications, including image classification, natural language processing, and financial forecasting. In image classification, XGBoost has been used to classify images based on their content, such as identifying objects in a scene. In natural language processing, XGBoost has been used to classify text based on its content, such as identifying the sentiment of a tweet. In financial forecasting, XGBoost has been used to predict stock prices and other financial variables. 3.3 Working Model of Single Intelligent Systems As Fig. 2 shows, PE files with feature generation/mapping can be fed into different independent intelligent systems to decide to block a malware. Multiple types of features can build multiple independent intelligent systems. Unfortunately, single independent cannot have highest accuracies in detecting malware in computer systems. Each intelligent system has its strong part and weak part. Cooperation of various intelligent systems may bring hope for further improving via mutually covering weak points. In the sense we designed cooperation mechanism of the independent intelligent systems. Make the independent intelligent systems collaborating with each other may be a novel way to improve the accuracy of malware detection.
Collaboration of Intelligent Systems to Improve Information Security
69
Fig. 2. The Connections of Intelligent Systems and PE files
4 Collaborating of Intelligent Systems Each single intelligent system can have its ability to detect malware or bypass good software quickly. It is very difficult to further improve the accuracies when advanced machine learning algorithms have been given. Collaboration is the only way of further improvement other than improve each individual intelligent system. The advantages of model fusion compared with a single model detection include (but not limited to): 1. Easier to optimize small models and search for best parameters. 2. Logically, different types of features will not impact each other before final scoring – thus can concentrate on specific knowledge mining. 3. Most importantly, model fusion can no longer request for the same dataset to train; instead, we can combine various models from various data/various parameters/various algorithms 4. … In the design for collaboration, the concept can be represented by Fig. 3. Based on a training dataset, use approaches of resampling (bootstrapping and so on) to build various sample set. Various sample set combined with various intelligent systems can generate multiple classifiers or predicting models. They are the first tier classifiers. The first tier classifiers have their own (different) predictions on same training data. The predictions can be regarded as input features into training the second tier classifier – the so-called “Meta” classifier. The Meta classifier, as a second level classifier, may bring extra improvements than single intelligent systems.
70
L. Diao and H. Xu
Fig. 3. Concept of Collaboration among Independent Intelligent Systems
Fig. 4. Architecture of the Collaborated Intelligent Systems to Detect Malware
The architecture of the collaborated intelligent systems is shown in Fig. 4. For input PE files, system prepares features and do feature mapping operations for what intelligent systems need. Each independent intelligent system (Generic, IAT, Opcode, String) in training process has generated multiple (1, 2, …, N) machine learning models with selected machine learning approaches. Each of the models (intelligent systems) can have its own predictions on whether a PE file is good or bad. The collaboration system collected all the predictions and quote the “Meta” classifier to make final decision on good or bad.
Collaboration of Intelligent Systems to Improve Information Security
71
5 Experiments 5.1 Feature Analysis With the datasets of PE malware and good ware, we get 6000+ malware samples, 2 million+ good samples, as well as 2 million+ pending (unknown good or bad) software applications. Based on them, we can partition the data into training sets and tests sets with appropriate sizes to build up (train) intelligent systems. The huge pending dataset will be used to estimate generalization errors empirically. With Removing logically and mathematically invalid features as well as miss valued ones, we then have 397 features remained. We rank the 397 features through different ways: Gap of Mean and Variance; Chi2 Test and Mutual Info Selection; Model Ranking. Figure 5 shows comprehensive ranking results by sorting the features through the height of blue bars of features. On top of that we built an intelligent system with 52 generic features.
Fig. 5. Feature Ranking Results
5.2 Building Intelligent Systems We compared XGBoost model prediction errors (or accuracies) with different number of selected features. Feature number around 50 can generate almost best model with high precision and score along with lower costs (feature size). The test results show that 52-feature-models can have simple form with best level accuracies. As key indicators for accuracy, the precision and recall are calculated in known good (benign) and bad (malicious) software/application sets. In information security context, Precision stands
72
L. Diao and H. Xu
for how many predicted malicious are really malicious, while Recall stands for how many malicious are predicted malicious by intelligent systems. FPR (false positive rate) stands for how many good samples are incorrectly predicted or classified as malicious. In mathematics, the indicators are calculated by following equations: Precision = TP/(TP + FP)
(1)
Recall = TP/(TP + FN)
(2)
F1 = 2PR/(P + R)
(3)
TPR = TP/(TP + FN)
(4)
FPR = FP/(TN + FP)
(5)
Besides precision and recall, we also need to estimate the generalization errors of machine learning models in order to apply them in real production environments – where all samples are new and unknown (good or bad). With the 52-feature-models, we can have three kinds of machine learning models and the corresponding evaluation results (precision, recall, FPR, TP (true positives), TN (true negatives), FP (false positives), FN (False negatives), Pending-Test (detected positive rate in Unknown datasets), etc. Machine learning models built from the 52 generic features can contribute fundamental precision and recall of generic intelligent system. As described above, we also built other intelligent systems with XGBoost algorithm using IAT (import table information), Opcode (ASM machine code) and String (readable characters in PE) features respectively. Table 1 shows their performance indicators. Table 1. Performance Indicators of various Independent Intelligent Systems. Intelligent System
Precision
Recall
FPR
Pending-Test
TP
TN
FP
FN
Generic
99.72%
98.04%
8.9E−6
7.5E−4
6427
2017169
18
128
IAT
95.72%
97.43%
1.4E−4
2.0E−3
6387
2016902
285
168
Opcode
98.14%
97.77%
6.0E−5
6.4E−4
6409
2017066
121
146
String
99.18%
97.74%
2.6E−5
4.3E−4
6407
2017134
53
148
Comparing the performance indications of different intelligent systems, we found each intelligent system has its own pros and cons. Collaboration of different intelligent systems may have the effect of 1 + 1 > 2 as discussed in Sect. 4. With a specifically designed strategy of collaboration, we got promising experimental results.
Collaboration of Intelligent Systems to Improve Information Security
73
5.3 Collaboration Results We conducted experiments with combining multiple intelligent systems via approaches described in Sect. 4. Their single models have performance indicators listed in Table 1. Here Table 2 shows that with more intelligent systems combined, the final (Meta) system can have improved precision, recall, and generalization errors (in pending data sets) compared with single intelligent systems as what Table 1 recorded. Table 2. Performance Indicators of Multiple Intelligent Systems Combined Intelligent System
Precision
Recall
FPR
Pending-Test
TP
TN
FP
FN
Generic + IAT
99.95%
97.71%
1.5E−6
4.4E−4
6405
2017184
3
150
Generic + String + IAT
99.95%
97.67%
1.5E−6
4.4E−4
6402
2017184
3
153
Generic + 99.97% String + IAT + Opcode
98.08%
9.9E−7
3.8E−4
6429
2017189
2
126
From the first line of Table 1, we can find significant improvements of FP errors (15, from 18 to 3) even only combined two intelligent systems (Generic and IAT). With all intelligent systems joined, TP and TN got improvements by 2 and 20 respectively. The most important thing is, FP errors were improved by 16. That means, multiple intelligent systems collaboration can mostly decrease the crucial error of false positives, which is key of malware detection products. In other parts, precisions and recalls are also improved to some extent through collaboration. Though their numeric improvements are not that much, the absolute number of correct detections of malware is big because malware detecting is coping with 10 million+ new software in internet to protect users. In Pending Test, collaboration of multiple intelligent systems also implied its lower potential generalization false alarm detections. It is a precious merit among all information security products.
6 Conclusion In detecting malicious executables for PE, building intelligent systems to make prediction automatically is useful and necessary to handle millions of newly appeared software in internet every single day. However, single intelligent systems are far from enough in making a precise enough detection result with reasonable generalization errors. We analyzed various properties of PE files and built different intelligent systems based on them. With a specific collaboration strategy, combination of different intelligent system can output the best accuracy (precision and recall etc.). With the power of collaborated intelligent systems, industry can block more malware with higher accuracy. Thus information exchange in Internet can be secured. Further
74
L. Diao and H. Xu
research may include exploring more strategy to build intelligent systems such like CNN or Transformer, as well as exploring more efficient approaches of collaboration for even higher performance. Another important thing of malware detection is to collaborate between static and dynamic features of software. The four intelligent systems we designed are based on static features of PE files. However, they are not enough to discover real intentions of software. Dynamic features indicate real behaviors of PE files in operating systems. Multiple layers of malware detection is a key direction in information security. We may try to record behaviors of software in virtual OS and put them into dynamic intelligent systems as the 5th collaboration system to further improve the capability of information security.
References 1. Santos, I., Penya, Y.K., Devesa, J., Garcia, P.G.: N-grams-based file signatures for malware detection. In: ICEIS 2009 - Proceedings of the 11th International Conference on Enterprise Information Systems, Volume AIDSS, Milan, Italy, pp. 317–320 (2009) 2. Konstantinou, E.: Metamorphic virus: analysis and detection. In: Technical Report RHULMA-2008-2, Search Security Award M.Sc. thesis, 93 p. (2008) 3. Chan, P.K., Lippmann, R.: Machine learning for computer security. J. Mach. Learn. Res. 6, 2669–2672 (2006) 4. Kolter, J.Z., Maloof, M.A.: Learning to detect and classify malicious executables in the wild. J. Mach. Learn. Res. 7, 2721–2744 (2006). Special Issue on Machine Learning in Computer Security 5. Ye, Y., Wang, D., Li, T., Ye, D.: IMDS: intelligent malware detection system. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1043–1047 (2007) 6. Chouchane, M.R., Walenstein, A., Lakhotia, A.: Using Markov chains to filter machinemorphed variants of malicious programs. In: Malicious and Unwanted Software, 2008. Proceedings of the 3rd International Conference on MALWARE, pp. 77–84 (2008) 7. Santamarta, R.: Generic detection and classification of polymorphic malware using neural pattern recognition (2006). https://www.semanticscholar.org/paper/GENERIC-DETECT ION-AND-CLASSIFICATION-OF-POLYMORPHIC-Santamarta/5cda37f3fe61f1fa15675 2be27fdb7cc40983e84 8. Yoo, I.: Visualizing windows executable viruses using self-organizing maps. In: VizSEC/DMSEC 2004: Proceedings of the 2004 ACM Workshop on Visualization and Data Mining for Computer Security, pp. 82–89. ACM (2004) 9. Baldwin, J., Dehghantanha, A.: Leveraging support vector machine for opcode density based detection of crypto-ransomware. In: Dehghantanha, A., Conti, M., Dargahi, T. (eds.) Cyber Threat Intelligence. AIS, vol. 70, pp. 107–136. Springer, Cham (2018). https://doi.org/10. 1007/978-3-319-73951-9_6 10. Bhatt, A.J., Gupta, C., Mittal, S.: iABC-AL: active learning-based privacy leaks threat detection for iOS applications. J. King Saud Univ. Comput. Inf. Sci. 33(701), 769–786 (2021) 11. Zhang, H., et al.: Classification of ransomware families with machine learning based on N-gram of opcodes. Future Gener. Comput. Syst. 90, 211–221 (2019) 12. Riasat, R., et al.: Onamd: an online android malware detection approach. In: 2018 International Conference on Machine Learning and Cybernetics (ICMLC), vol. 1, pp. 190–196. IEEE (2018)
Collaboration of Intelligent Systems to Improve Information Security
75
13. Arora, A., et al.: Poster: hybrid android malware detection by combining supervised and unsupervised learning. In: Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, pp. 798–800. ACM (2018) 14. Singh, A., Bist, A.S.: OSX malware detection: challenges and solutions. J. Inf. Optim. Sci. 41(2), 379–385 (2020) 15. Gharghasheh, S.E., Hadayeghparast, S.: Mac OS X malware detection with supervised machine learning algorithms. In: Choo, K.K.R., Dehghantanha, A. (eds.) Handbook of Big Data Analytics and Forensics, pp. 193–208. Springer, Cham (2022). https://doi.org/10.1007/ 978-3-030-74753-4_13 16. Tech-zealots.com. https://tech-zealots.com/malware-analysis/pe-portable-executable-struct ure-malware-analysis-part-2/. Accessed 26 May 2023 17. Wiki. https://wiki.osdev.org/PE. Accessed 01 Feb 2023 18. Nielsen, D.: Tree boosting with XGBoost why does XGBoost win “every” machine learning competition? Master’s thesis, NTNU (2016)
X"1 + X" Blended Teaching Mode Design in MOOC Environment Yanling Liu(B) , Liping Wang(B) , and Shengwei Zhang School of Mathematics and Information Science, Nanjing Normal University of Special Education, Nanjing 210038, China [email protected], [email protected]
Abstract. With the popularity of MOOC in the world, SPOC (Small Private Online Course), which meets the personalized needs and is applicable to the needs of small groups, has gradually spread to China. Building on analysis of the theory and experience of blended teaching mode, this paper attempts to discusses the deep integration of SPOC and other platforms, focuses on the design of SPOC online teaching resources, flipped classroom teaching activities and formative curriculum evaluation system, and puts forward the design of a school-based hybrid teaching mode of “Spoc platform + X”. Keywords: MOOC · SPOC Blended Teaching Mode · Flipped classroom
1 Introduction MOOC is an important thing in the teaching reform of higher education, which has brought a very important impact on the teaching reform of higher education. Its strong online resources, online learning environment and other advantages once promoted the traditional classroom teaching in colleges and universities to make a great breakthrough. Someone presents an end-to-end multi-view interactive framework (EMIF) to predict user dropout in MOOC [1]. In order to better meet the new needs of college students such as personalized and in-depth learning, Armando Fox of the University of California, Berkeley began to integrate MOOC resources and actively explore the application of these resources in small-scale student groups. He also formally proposed the new concept of SPOC (Small Private Online Course), namely small-scale private online courses, in 2013. SPOCs were conceived to succeed where Massive Open Online Courses (MOOCs) failed, namely in the high drop-out rate [2]. Someone presents online SPOC teaching mode is conducive to improving students’ interest in learning and cultivating their comprehensive ability [3]. SPOC has made good use of many advantages such as high-quality teaching resources, formative evaluation, timely feedback and so on, providing a new path and way for higher education teaching reform. Try to build a blended teaching mode based on SPOC, build SPOC online curriculum resources in combination with school-based characteristics, and build a teaching mode that combines “online + offline”, mobile ubiquitous + cooperative learning, and classroom + extracurricular © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 76–83, 2024. https://doi.org/10.1007/978-3-031-50580-5_7
X"1 + X" Blended Teaching Mode Design in MOOC Environment
77
learning. This teaching mode realizes the combination of online resources, environment, in class and extracurricular learning methods, actively creates a diversified interactive and all-round knowledge input environment, breaks through the traditional single learning method, and strengthens the application of knowledge in the online + offline integration environment, so as to improve the comprehensive application ability and practical ability.
2 The Value of Blended Teaching Mode in MOOC Environment Exploring the hybrid teaching mode of combining offline, online and offline in the MOOC environment is beneficial to students’ in-depth learning, improving their highlevel thinking ability, optimizing the existing teaching environment, and constantly improving the teaching quality of higher education. Provide support for college students’ deep learning and improve their high-level thinking ability. The hybrid teaching mode of combining online and offline not only absorbs the advantages of MOOC, but also avoids the defects such as the easy neglect of personalized needs, the untimely teaching feedback and communication and guidance between teachers and students. The SPOC curriculum team in colleges and universities integrates advanced concepts such as online + offline combination, classroom + extracurricular learning into teaching links such as resource construction, teaching content design, learning method reform, evaluation system construction, etc., so that the traditional teaching process can be reversed, teaching structure can be optimized, and the teaching design focusing on multiple learning situations, interactive feedback and reflection, etc. With the help of the Internet and digital technology, relying on highquality online teaching resources and teacher team resources, we can meet personalized learning needs, and support students to change from shallow learning to deep learning in terms of resources, environment, etc. In addition, online and offline teaching, group cooperation homework, offline course group learning, interactive evaluation, online and offline discussion and other links are also used to actively mobilize students’ enthusiasm for active learning and enhance students’ active participation in learning. Through the combination of online and offline hybrid teaching mode, the flipped classroom was created, allowing students to experience learning preparation, new knowledge construction, transfer, application and creation, learning evaluation and other links in turn, cultivating the ability to deeply understand knowledge and solve problems, and effectively cultivating high-level thinking ability. Create an exchange environment to promote the exchange and cooperation of college students’ learning. The SPOC course hybrid teaching mode uses the network platform to carry out distance teaching, which is conducive to connecting with first-class teachers at home and abroad. These teachers include top experts from domestic first-class research universities, who have profound theoretical attainments. These masters have laid a solid foundation for the design, development and implementation of university courses, and also cultivated excellent conditions for cooperative research in scientific research and other fields. Through online and offline multiple interactions, the SPOC curriculum hybrid teaching mode is conducive to creating a truly international and cooperative learning teaching environment, providing students with high-quality interaction and
78
Y. Liu et al.
communication between students. The exploration and practice of online and offline hybrid teaching mode is conducive to cultivating a group of excellent teachers who are in line with international advanced teaching concepts and teaching methods and meet the individualized learning needs of Chinese college students, forming a teaching mode with referential significance, and driving the construction of high-quality courses and high-quality courses in Chinese colleges and universities, so as to improve the internationalization level of disciplines in colleges and universities, Develop world-class schools and disciplines. It can also strengthen international collaborative innovation and promote interdisciplinary integration, so as to cultivate more first-class talents with family and country feelings, innovation ability and global vision.
3 Theoretical Basis of Blended Teaching Mode Blended Learning theory is the theoretical basis for building a blended teaching model in MOOC environment. Professors Bonk and Graham put forward a specific definition of blended learning, he thinks blended learning is the combination of face-to-face teaching and computer-assisted online learning. Blended learning is first a combination of constructivism, cognitivism, constructivism and other teaching theories, or a combination of face-to-face teaching and E-Learning learning methods. Blended learning takes into account the advantages of traditional learning methods and digital learning, and comprehensively uses classroom learning, digital learning and other learning theories, different technologies and means, and different application methods to implement a teaching strategy. By integrating the advantages of online and offline learning, teachers help students learn more actively and effectively. Deep learning theory of high-order thinking. Deep learning is a learning form in which learners use higher-order thinking ability. It requires learners to pay attention to the transfer and comprehensive application of knowledge, creative problem solving, decision-making, etc., and actively participate in higher-order thinking activities such as “application, analysis, evaluation and creation”. This high-level thinking ability also includes metacognition ability, teamwork ability, creative thinking ability, etc. The blended teaching represented by the flipped classroom reverses the original teaching structure, that is, the shallow knowledge learning takes place before the class, and the internalization of knowledge is realized in the classroom with the guidance and help of teachers, so as to promote the improvement of students’ high-level thinking ability. Master learning theory. Bloom put forward the theory of mastering learning. He believed that almost all students could master almost all contents as long as they met the time and appropriate teaching conditions. Information technology has great advantages in meeting the individual learning needs of students. The different learning needs of excellent students and backward students, ordinary students and students with special needs are expected to be met. Mastering the learning theory provides a solid theoretical basis for blended teaching, especially for the learning in the pre class knowledge transfer stage.
X"1 + X" Blended Teaching Mode Design in MOOC Environment
79
4 SPOC Blended Teaching Process in MOOC Environment The SPOC teaching mode has improved based on the deep learning process model, which can build a hybrid learning process of MOOC + offline classroom teaching. It is necessary to develop online and offline blended teaching mode of the SPOC platform in the post pandemic era and apply to real teaching scenarios [4]. The whole process is divided into four progressive and cyclic processes: MOOC, offline classroom teaching or live broadcast, online + offline teaching feedback. The specific process is shown in Fig. 1.
Fig. 1. “Mooc + Offline Class” Blended Teaching Process
5 Design of SPOC Blended Teaching Mode In order to build a blended teaching model with school-based characteristics, it is very necessary to revise and improve the existing blended teaching model in combination with the actual situation of colleges and universities and the characteristics and needs of college students’ learning. Teachers colleges and universities for special education often need to integrate educational reform. These schools not only have ordinary students, but also have students with special needs. These personalized educational needs make the design of blended teaching mode more important. Pay attention to the characteristics of the learning subject, that is, the characteristics and learning needs of college students, and focus on stimulating students’ subjective initiative in the teaching process. Therefore, the design of online courses requires vivid and complete learning resources, and the offline learning planning should have reasonable emphasis on differentiation. Based on this principle, a new specific process has been formed. A (Analysis). The design of the blended teaching mode has a premise, that is, before the design, it is necessary to analyze not only the teaching objectives and content, but also the teaching objects, namely the characteristics of learners and the teaching environment. The first step is to formulate the overall learning objectives of the curriculum. What is the goal of SPOC course in this link? What knowledge, ability, attitude and emotion are mainly cultivated? In SPOC teaching mode, we should not only give the overall learning objectives of the course, but also formulate specific and detailed stage
80
Y. Liu et al.
objectives for different stages of learning. In the process of goal formulation, it is necessary to combine the theoretical characteristics of professional courses in the lower grades, pay attention to the needs of the cultivation of theoretical application level in the higher grades, and set up different phased learning goals according to different learning subjects and curriculum requirements. The second step is to analyze the learning content. Teachers should design the teaching content accordin g to the characteristics of the subject in combination with the learning objectives, determine and divide the content suitable for online self-study, offline classroom interactive learning between teachers and students, and the content that needs practice or practical operation to assist teaching. Specifically, what are the contents to be taught in SPOC courses, and what are the internal links between these contents? What are the key points and difficulties of teaching knowledge? The third step is a comprehensive analysis of the learning object and learning environment. The learning object is the main body of learning in teaching activities. The design of the blended teaching mode of SPOC curriculum is inseparable from the analysis of learners. Designers should first understand the readiness of learning objects for learning content, such as whether they have contacted knowledge points, how much preparatory knowledge they have, and how many relevant skills they have, etc. In addition to the learning basis of knowledge and skills, it is also necessary to specifically analyze their learning characteristics and learning preferences, so as to understand the attitude of learning objects towards learning content and cognitive preferences. Students at different levels have different requirements for learning content. Ordinary students and students with special needs also have different needs for learning resources. Some require systematic learning content, while others focus on the practicality and operability of course content. Learning objects often have different learning styles. It is necessary for designers of hybrid teaching mode to understand the differences of learning objects in information reception, information processing, etc. For example, the time allocation of daily online learning, the cognitive differences between different genders, and the cognitive particularity of students with special needs, and so on. In addition, we should also analyze the general characteristics of all learners, such as the characteristics of learners’ age and psychological development. R (Resource). Based on the analysis of SPOC curriculum objectives, contents and learning objects, the SPOC curriculum team needs to build online resources and offline resources at the same time. SPOC hybrid teaching activities are generally carried out in one learning unit, that is, offline classroom teaching and online learning constitute a unit. According to the needs of learners, the designers of SPOC hybrid teaching mode can learn online resources of each unit for no more than 30 min, and offline resources for no more than 60 min due to discussion and interactive teaching. Under MOOC environment, online resources such as high-quality course resources, online shared resources, and video open classes in colleges and universities are very rich. SPOC course teams can apply these online resources to the SPOC hybrid teaching mode to achieve the co construction and sharing of multiple high-quality resources; teaching videos can also be supplemented according to the needs of application and integrated education. Online resources combined with personalized needs can help achieve the objectives of SPOC courses, which is conducive to the realization of the professional talent training objectives. Offline resources are teaching cases, research papers, investigation reports and
X"1 + X" Blended Teaching Mode Design in MOOC Environment
81
other materials provided by teachers for offline classroom teaching, as well as resources to help students practice and learn. SPOC curriculum team is also necessary to develop online and offline expansion resources, such as videos to discuss hot issues in a certain field, instructional design to carry out research learning, case based heuristic teaching materials, and so on. E (Environment). The construction and sharing of online and offline platforms should be the focus of the curriculum team. The online platforms such as MOOC of University of China and Netease Cloud Classroom selected and used by SPOC can meet the basic needs of students for a hybrid learning environment. Teachers can also actively use diversified platforms such as school online disk and cloud classroom system to build a flexible and diverse online learning system in combination with students’ personalized needs. The construction of offline learning environment is mainly around the classroom. The design of hybrid teaching mode is student-centered, focusing on the design of classroom discussion, group learning, class reporting or sharing, etc. around the objectives and content of SPOC curriculum, to provide high-quality offline environment for students to carry out personalized learning. At the same time, the existing practice base platform, college students’ innovation and entrepreneurship platform, and college students’ discipline competition platform are flexibly used to help students deeply grasp theoretical knowledge and improve their comprehensive practical ability. These platforms are widely used in the SPOC hybrid teaching mode, and the online platform and on-site internship are combined to achieve a learning environment that combines online and offline. E (evaluation). The evaluation of MOOC quality is a multiple criteria decisionmaking issue [5]. Different from the traditional teaching mode, SPOC hybrid teaching mode focuses on formative evaluation, that is, the evaluation of both online and offline learning runs through the whole teaching process. This evaluation breaks through the traditional single evaluation method and combines online system evaluation with offline evaluation. This evaluation runs through online learning, offline classroom discussion and final examination. The online evaluation adopted by SPOC hybrid teaching mode is mainly based on the learning platform to collect and analyze the data of students’ online learning, including the time spent watching videos, the number of online discussions, the number of forum posts, online unit exercises, final tests, students’ self-evaluation and other evaluations. The offline evaluation adopted by SPOC hybrid teaching mode is mainly completed by teachers, based on offline data such as learners’ participation in classroom discussion, group cooperative learning and examination results. The SPOC model attaches particular importance to creating group assignments for interaction between teachers and students and between students and students. See Fig. 2 for group work flow. Divide students into groups of two or three. Group members focus on SPOC curriculum goal planning issues, which can be hot topics in a certain field or common problems in this field. Cooperate to write group analysis reports, mainly to analyze these problems and propose solutions; A new group is formed between the two groups to carry out comparative research on the same topic, mainly reflecting and evaluating the analysis reports of other groups, so as to ensure that the members of the group can inspire and learn from each other during the second cooperation; Then all members further improve
82
Y. Liu et al.
Fig. 2. Flow chart of group work
their respective analysis reports according to the opinions of mutual evaluation. Finally, a report set completed by the whole class will be formed. This formative assessment is also reflected in the timely feedback of SPOC curriculum team on students’ learning. Self-regulated learning (SRL) is a fundamental skill to succeed in MOOCs [6]. Teachers evaluate students according to SPOC course objectives in terms of classic reading, video watching, classroom discussion, group work, final examination, etc., so that students can get timely feedback in the learning process and adjust learning methods to improve the learning effect. Whether it is online discussion or offline classroom discussion, teachers often timely guide the topics in the discussion area or classroom discussion, and give timely answers, so that students can learn in depth and clarify misunderstandings. As for group work, focus on evaluation of students’ academic standards, content innovation and report quality. For the topics that all students pay attention to, and for the good practices and strategies that students experience in China, use the SPOC platform course teaching to implement the curriculum ideology and politics, “tell a good story about China”, and “convey the voice of China”. Acknowledgement. This work was supported by Educational Reform Research Subject of Nanjing Normal University of Special Education(2022XJJG02,2021XJJG09), Jiangsu Disability Research Subject of Disabled Persons’ Federation(2022SC03014)and Educational science planning of Jiangsu Province(D/2021/01/23,B/2022/04/05), Jiangsu University Laboratory Research Association(GS2022BZZ29), Universities’ Philosophy and Social Science Researches Project in Jiangsu Province. (2020SJA0631, 2019SJA0544). The authors gratefully acknowledge these supports and reviewers who given valuable suggestions.
References 1. Li, F., Wei, Z.: MOOC dropout prediction based on multi-view learning. Prediction Based on Multi-view Learning. Journal of Physics: Conference Series 2010(1) (2021) 2. Julio, R.-P., José-María, F.-L., Enrique, S.-R., Ernesto, C.-M.: The implementation of Small Private Online Courses (SPOC) as a new approach to education. Int. J. Edu. Technol. Higher Edu. 17(1) (2020)
X"1 + X" Blended Teaching Mode Design in MOOC Environment
83
3. Jia, Y., Zhang, L.: Research and application of online SPOC teaching mode in analog circuit course. Int. J. Edu. Technol. Higher Edu. 18(1) (2021) 4. Chen, X., Guo, J., Xu, H.: An empirical study of blended teaching mode based on SPOC in the Postpandemic Era. Discrete Dynamics in Nature and Society (2022) 5. Su, P., Guo, J., Shao, Q.: Construction of the quality evaluation index system of MOOC platforms based on the user perspective. Sustainability 13(20) (2021) 6. Vilkova, K.: The Promises and Pitfalls of Self-regulated Learning Interventions in MOOCs. Technology, Knowledge and Learning (2021)
A CNN-Based Algorithm with an Optimized Attention Mechanism for Sign Language Gesture Recognition Kai Yang, Zhiwei Yang, Li Liu(B) , Yuqi Liu, Xinyu Zhang, Naihe Wang, and Shengwei Zhang Nanjing Normal University of Special Education, Nanjing 210038, China [email protected]
Abstract. Sign language is the main method for people with hearing impairment to communicate with others and obtain information from the outside world. It is also an important tool to help them integrate into society. Continuous sign language recognition is a challenging task. Most current models need to pay more attention to the ability to model lengthy sequences as a whole, resulting in low accuracy in the recognition and translation of longer sign language videos. This paper proposes a sign language recognition network based on a target detection network model. First, an optimized attention module is introduced in the backbone network of YOLOv4-tiny, which optimizes channel attention and spatial attention and replaces the original feature vectors with weighted feature vectors for residual fusion. Thus, it can enhance feature representation and reduce the influence of other background sounds; In addition, to reduce the time-consuming object detection, three identical MobileNet modules are used to replace the three CSPBlock modules in the YOLOv4-tiny network to simplify the network structure. The experimental results show that the enhanced network model has improved the average precision mean, precision rate, and recall rate, respectively, effectively improving the detection accuracy of the sign language recognition network. Keywords: Sign language · MobileNet · YOLO · gesture recognition
1 Introduction As a silent communication tool, gesture has become an important way of communication in life. Especially on special occasions, such as communication between deaf people, vacuum conditions, etc., the convenience and significance of gestures are more evident. In recent years, the increasing advancement of the Internet of Things (IoT) and artificial intelligence (AI) technology has facilitated the utilization of human-computer interaction in numerous scenarios [1]. Consequently, gesture recognition technology has emerged as a prominent research area in both academic and industrial circles [2, 3]. People can interact with devices more conveniently using gesture recognition technology, and its related technologies and applications have tremendous potential in the context of the © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 84–94, 2024. https://doi.org/10.1007/978-3-031-50580-5_8
A CNN-Based Algorithm with an Optimized Attention Mechanism
85
Internet of Everything. For example, gestures can be used in smart homes to control devices such as TVs, air conditioners, and refrigerators without additional controllers, significantly enhancing the user experience and operational efficiency. There is a large quantity of research on gesture recognition algorithms. Methods based on hardware devices are more mature and widely used, with high recognition accuracy and fast recognition speed, such as wearable data gloves, Kinect, and Leap motion [4, 5], but their equipment is expensive and human-machine interaction is not smooth; methods based on machine vision, although they reduce the dependence of gesture recognition algorithms on hardware, their network models complicate and reduces the speed of gesture recognition. Recently, deep learning has made significant progress in target detection and image classification. State-of-the-art algorithms like YOLO, SSD, RCNN, and Faster R-CNN [6–8] have achieved high accuracy rates in detecting and classifying targets. In the realm of gesture recognition, Jin et al. [9] proposed a method based on an enhanced residual network and dynamic learning rate adjustment to enhance the accuracy, robustness, and convergence speed of the recognition process. Redmon et al. [10] introduced a multi-scale convolutional feature fusion method for SSD gesture recognition, which integrated different convolutional layers to improve recognition accuracy for small and medium-sized target gestures. Ren et al. [11] presented a deep learning model for gesture recognition based on an improved YOLOv3 network combined with a Bayesian classifier to address the challenges of data vulnerability and enhance network invariance. Furthermore, Guo et al. [12] proposed a gesture interaction algorithm based on an enhanced YOLOv4 network, which can recognize gestures in complex scenarios in real-time while addressing problems such as false detections, missed detections, and a limited amount of gesture data available for recognition using the YOLOv4 algorithm. Although the gesture recognition method based on deep learning can achieve high recognition accuracy [13], the deepening of the network layers poses a great challenge in terms of hardware cost, computation, and difficulty of training and running the neural network model for storage on storage embedded devices. Therefore, we propose the DRL-PP algorithm to improve the YOLOv4-tiny algorithm to analyze sign language images and propose the DRL-PP algorithm. This paper proposes a gesture recognition method that employs a lightweight convolutional neural network to reduce the model size while maintaining high accuracy, thus rendering the model more suitable for deployment on resource-limited mobile or embedded devices. The article is structured as follows: Sect. 2 discusses the related work, Sect. 3 provides a detailed description of the YOLOv4-tiny network utilized in this study, Sect. 4 outlines the DRL-PP algorithm, Sect. 5 presents the experiment results, and finally, Sect. 6 concludes the paper.
2 Related Work The prevalent approaches to sign language recognition can be categorized into two main types: sensor-based methods and computer vision-based methods, depending on the media employed.
86
K. Yang et al.
1) Sensor-based approach. Sensors include data gloves, arm bands, smart devices, etc. Wen et al. [14] proposed a frictional electric smart glove-based sign language recognition method configured with a total of 15 frictional electric sensors to divide the sentences collected by the smart glove into word units using segmentation and reorganization new sentences created by word unit reorganization with a correct average rate of 86.67%. Ahmed et al. [15] introduced an innovative real-time sign recognition system that utilizes wearable sensory mittens consisting of 17 sensors and 65 channels. The experiment involved 75 signs performed by five Malaysian sign language (MSL) participants, comprising MSL numbers, letters, and words. The recognition accuracy was 99%, 96%, and 93.4% for numbers, letters, and words. Although the methods using sensors for sign recognition are highly flexible, they require deaf people to wear the requisite sensory equipment, which is an additional burden for deaf people [16]. 2) Computer vision-based approach. Boukdir et al. [6] presented a method for recognizing Arabic sign language using a deep learning architecture. Their approach involves the use of a 2D convolutional recurring neural network (2DCRNN) model to extract features, along with a recurrent network pattern that detects relationships between frames. Additionally, a 3D convolutional neural network (3DCNN) model is used to learn spatialtemporal features from video blocks by quadruple. The cross-validation technique was employed to evaluate the performance of the proposed method, which yielded a horizontal accuracy of 92% for 2DCRNN and 99% for 3DCNN. In a separate study, Guo et al. [17] proposed a hierarchical long short-term memory (LSTM) network for sign language translation, which addresses the fact that traditional Hidden Markov models and linkage-time classification may not be able to resolve the confusion during recognition with the sentence confusing word order difficulties corresponding to the visual content in the sentence. Yu et al. [18] introduced an end-to-end sign language converter that integrates recognition and translation tasks into a unified architecture, implemented via connectionist temporal classification (CTC). This joint approach does not require temporal information, solves two interdependent sequence learning problems, and performs better on the PHOENIX14T dataset. Ren et al. [19] proposed a machine learning method to classify cancers, as pattern recognition for medical images is widely used in computer-aided cases. Wang et al. [20] used the PSO-guided self-tuning CNN to diagnosis COVID-19. Deep learning-based method can help classify medical images and efficiently improve the accuracy of diagnosis. Computer vision-based sign language recognition is suitable for real-life applications because of its uncomplicated interaction and low device dependency with guaranteed accuracy. Inspired by the above work, this paper proposes a gesture recognition method with a lightweight convolutional neural network.
3 Network Structure The YOLO detection model utilizes the CSPdarknet53-tiny as its backbone network, and its primary structure is presented in Fig. 1. To achieve multi-scale sensing, the (104,104,64) feature map is upsampled and combined with the (52,52,128) feature map to obtain a detection path with a larger sensory field, which is directly output by the backbone network, along with the detection path that has the minimum sensory field.
A CNN-Based Algorithm with an Optimized Attention Mechanism
87
These two detection paths work collaboratively to accomplish the detection task and ensure effective multi-scale sensing.
Fig. 1. YOLO model.
Although the existing Yolo-based network obtains good detection performance in a narrow space, it also has the following shortcomings: (1) The backbone network is too lightweight, and the contour evolution of the feature map is insufficient during the layerby-layer transmission process to learn more occlusion target features effectively during the training process. (2) The neck’s traditional feature fusion network (FPN) needs to be more complex, efficient in fusing between feature maps at different scales, and easy to lose edge detail information. (3) The traditional algorithm has limitations in the postprocessing phase and is prone to erroneously deleting the overlapping prediction frames, leading to missed detection. A gesture recognition model based on MobileNet [14] and an attention mechanism are proposed to resolve the above problems.
4 DRL-PP Algorithm Description In this paper, to improve the detection model’s focus on the gesture region, we incorporate an attention module inspired by [15] into the backbone network. The attention module includes both channel attention and spatial attention, which can be formulated as: F = Mc (F) ⊗ F
(1)
F = Ms (F ) ⊗ F
(2)
The channel attention and spatial attention are implemented as element-wise multiplication operations represented by the symbol ⊗. In this context, F represents the input feature map, F represents the refined feature map, and F” represents the final refined output.
88
K. Yang et al.
Fig. 2. Attention module.
The Channel Attention mechanism, as illustrated in Fig. 2, leverages the inter-channel relationships of feature maps to focus on the most significant portion of all channels. In particular, each convolution kernel can be viewed as a feature detector, producing a feature map that represents one object feature per channel. The first step of channel attention is to reduce the dimensionality of the feature map by passing it through both a max pooling layer and an average pooling layer, which produce two feature descriptors: one that emphasizes important features of the object and another that computes the range of the object efficiently. Next, these descriptors are fed into a shared network comprising an input layer, an output layer, and three hidden layers. As the descriptors pass through the shared network, their output feature vectors are added element-wise. Finally, the sigmoid function is applied to activate the feature vectors, generating the channel attention map. Spatial attention, as shown in the figure, is another component of the attention module that utilizes the spatial relationship between features to generate a spatial attention map. When an image is fed into a convolutional neural network, each pixel in the image is involved in the computation. Similar to channel attention, spatial attention aims to emphasize the regions in the image that are most relevant to the object. First, the channel attention map and the refined feature map obtained from the feature map are passed through the max pooling layer and the average pooling layer, respectively, to generate two feature descriptors. These descriptors are then concatenated and passed through two convolutional layers to accentuate the relevant regions of the descriptors. Finally, the sigmoid function is applied to activate the vector and obtain the spatial attention map. With the channel and spatial attention modules, the weights of the feature map are optimized, resulting in a final feature map that contains more information on the relevant gesture features. Assuming that the average pooling and maximum pooling operations are denoted by Favg and Fmax , respectively, the operation Attavg is effective in removing global background information of the object while preserving the salient features of the gesture.
A CNN-Based Algorithm with an Optimized Attention Mechanism
89
Let xn denote the weight of the nth convolutional kernel. Then, the operations Attavg and Attmax . Can be expressed as follows. Attavg = Rt+1 + γ Rt+2 + . . . =
∞
γ i Rt+i+1
(3)
i=0
Att max = E Rt+1 |s = st
(4)
After the shared network, the channel attention module outputs a vector that recalibrates the importance of each channel in the feature map. This recalibration is expressed as: outputchannel = σ (outputavg × outputmax )
(5)
The resulting features are obtained through matrix multiplication with the original feature map, W = [ω1 , ω2 , ..., ωn ], which can be expressed as W = (xn , outputchannel ) = xn × outputchannel
(6)
To feed W into the spatial attention module, the feature vectors undergo the average and maximum pooling layers, respectively. Then, the resulting features are concatenated along the channel dimension to obtain Cconv ∈ R1 × 2C. To obtain the feature weight information, a convolution operation is needed to let F5×5 denote a convolution operation with two input channels, only one output channel, and kernel size of 5 × 5. exp(Si ) ai = softmax(Si ) = N j=1 exp(Sj )
(7)
The final output of the attention module is denoted as outputcbsp outputcbsp + X, which re-weights the importance of different elements in the original input vector. By doing so, the model is able to selectively amplify the features that contain gesture information and suppress the irrelevant or weak features. This mechanism helps the model to focus on the most salient regions in the input image and improve the accuracy of gesture recognition. The YOLO technique uses the CSPBlock module as a residual module to enhance accuracy, which, however, increases network complexity and slows down the object detection process. To expedite gesture recognition detection, we incorporate the Mobilenet module in place of the three CSPBlock modules used in YOLO. In this study, we use Mobilenet-V1 as the backbone extraction network in DRL-PP. The fundamental concept of the Mobilenet-V1 model is Depthwise Separable Convolution. While the conventional convolution kernel convolves three channels simultaneously to obtain one number, the depthwise separable convolution first convolves three channels with three convolutions to obtain three numbers, then passes a 1 × 1 × 3 convolution kernel to obtain the final number. When more and more feature attributes are extracted, the depth-separable convolution saves more parameters. Thus, the final model is shown in Fig. 3.
90
K. Yang et al.
Fig. 3. Mobile Net-based gesture recognition model under occlusion conditions.
5 Evaluations 5.1 Experiment Settings In the experiment, we use Rob Flow’s American Sign Language Letters Dataset [7] for training. The dataset comprises 1728 images with an image size of 608 × 608, which contains 720 real sign language images and images expanded by data warping and oversampling. The dataset was split into a training set and a test set with a 9:1 ratio. The training set consisted of 1,555 images, while the test set contained 173 images. The network hyperparameters were set as follows: during model training, the Adam optimizer was used to tune the parameters, with a category confidence threshold of 0.5 for the target. The initial learning rate was set to 0.001, and a weight decay coefficient of 0.0005 was employed to avoid overfitting. The batch size was set to 16, and the model was trained for 300 epochs. The investigations are conducted on the Ubuntu operating system, using PyCharm software for programming, CUDA 10.2, cuDNN 7.6.5, as the compilation language. CPU is Intel(R) Pentium(R) G3260 @3.30 GHz, GPU is NVIDIA GTX 3090, and 1 TB hard disk. All experiments in this paper were trained under Linux using the PyTorch framework, with a training configuration of 16 batches of 64 training samples per iteration—the number of iterations corresponding to the loss function on the homemade gesture dataset. The model attained convergence around 10000 iterations with a loss value of around 0.1.
A CNN-Based Algorithm with an Optimized Attention Mechanism
91
5.2 Experiment Parameters In the field of target detection, the performance of target detection algorithms is commonly evaluated using precision, recall, and mean average precision (mAP) [20]. Precision rate P, which is the ratio of true positive samples to the total samples predicted as positive by the model, can be calculated as shown in Eq. (8). P=
TP × 100% TP + FP
(8)
The notation TP represents the count of positive samples correctly classified by the model, while FP refers to the number of negative samples wrongly classified as positive. The recall R measures the proportion of true positive samples correctly classified by the model among all the positive samples in the test set, which can be calculated using Eq. (9). R=
TP × 100% TP + FN
(9)
where FN refers to positive samples erroneously identified as negative. If an algorithm performs relatively well, it should have the following performance: the value of the precision rate remains at a high level while the recall rate increases. A comprehensive parameter is also generally required to test the algorithm performance of the network, such as the mAP value, which is calculated as shown in Eq. (10). mAP =
N 1 P(K)R(K) C
(10)
K=1
Here, N represents the total number of samples in the test set, while C is the number of categories involved in the detection task. P(K) signifies the precision rate achieved by the model when it identifies K samples simultaneously, and R(K) denotes the change in the recall rate when the number of samples identified by the model changes from K 1 to K. These two measures are used to calculate the mean average precision (mAP), which provides an overall assessment of the model’s performance. 5.3 Experiment Analysis To evaluate the impact of various modules on detection performance, only modifications were made to the YOLO backbone extraction network, and the resulting detection results are presented in Table 1. Tests were conducted on YOLO. The results demonstrate that Yolo with MobilenetV1 improves the algorithm runtime. Although the Mobilenet-V1 algorithm diminishes accuracy, the degradation is only 1% to 2%. The improvement is evident in the algorithm training time loss, which is improved by one-third. The loss function images of the YOLO algorithm for the two structures are shown in Fig. 4. Similarly, to compare the effect of the attention mechanism on detection results, YOLO + CSPBlock, DRL-PP and attention module were used as comparison models, and the results are shown in Fig. 5.
92
K. Yang et al. Table 1. Impact of the upper sampling table module.
Methods
Accuracy(%)
Time(s)
Yolo + CSPBlock
92.7
0.8
DRL-PP
89.5
0.5
Fig. 4. Yolo-tiny loss function plot.
As depicted in Fig. 6, among the four methods, the DRL-PP approach achieved superior performance. By employing MobileNet as the backbone model, the accuracy, recall, and average precision of the algorithm were improved by 1.2%, 6.5%, and 2.1%, respectively, compared to the original model. The enhanced attention module proposed in the paper has a recall advantage, suggesting that the model pays more attention to the region containing detailed gesture information. To avoid missed detections, the confidence threshold (0.35) is appropriately lowered during inference to ensure that more finger region boundaries can be detected.
Fig. 5. Training process.
A CNN-Based Algorithm with an Optimized Attention Mechanism
93
Fig. 6. Impact of attention module (%).
6 Conclusion This paper proposes a solution to the challenge of communication between non-disabled individuals and those who are deaf or hard of hearing, through a lightweight convolutional neural network-based gesture recognition detection algorithm called DRL-PP. To improve gesture recognition accuracy, the proposed algorithm combines two approaches. First, MobileNet is used as the backbone extraction network of YOLO to reduce computation and parameters. Second, a self-attentive mechanism is incorporated into the YOLO network to capture richer contextual information. The combination of these approaches compensates for accuracy loss due to the model’s lightweight. Experiment results demonstrate that the DRL-PP algorithm outperforms other methods in sign language gesture recognition, effectively addressing social isolation among deaf individuals and bridging communication gaps between normal and deaf individuals. Acknowledgements. This work was supported by Universities’Philosophy and Social Science Researches Project in Jiangsu Province. (No. 2020SJA0631 & No. 2019SJA0544) & Educational Reform Research Project(No.2018XJJG28) from Nanjing Normal University of Special Education.
References 1. Redmon, J., Divvala, S., Girshick, R., et al.: You only look once: Unified, real-time object detection. In: IEEE CVPR2016 Conference on Computer Vision and Pattern Recognition, pp. 779–788. IEEE Computer Society Press, Washington DC (2016)
94
K. Yang et al.
2. Wang, P., Huang, H., Wang, M., et al.: YOLOv5s-FCG: an improved YOLOv5 method for inspecting riders’ helmet wearing. J. Phys: Conf. Ser. 2024, 012059 (2021) 3. Woo, S., Park, J., Lee, J.Y., et al.: CBAM: convolutional block attention module. In: Proceedings of the 15th European Conference on Computer Vision, Munich, 3–19 (2018) 4. Zhu, R., Huang, X., Huang, X., Li, D., Yang, Q.: An on-site-based opportunistic routing protocol for scalable and energy-efficient underwater acoustic sensor networks. Appl. Sci. 12(23), 12482 (2022) 5. Berman, M., Triki, A.R., Blaschiko, M.B.: The Lovasz-Softmax Loss: a tractable surrogate for optimizing the intersection-over-union measure in neural networks. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4413–4421 (2018) 6. Boukdir, A., Benaddy, M., Ellahyani, A., et al.: Isolated video-based Arabic sign language recognition using convolutional and recursive neural networks. Arab. J. Sci. Eng. 47, 2187– 2199 (2022) 7. Oz, C., Leu, M.c.: American Sign Language word recognition with a sensory glove using artificial neural networks. Eng. Appl. Artif. Intell. 24(7), 1204–1213 (2011) 8. Camgoz, N.c., Koller, O., Hadfield, S., et al.: Sign language transformers: joint end-to-end sign language recognition and translation. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10020–10030 (2020) 9. Jin, X., Lan, C.L., Zeng, W.J., et al.: Style normalization and restitution for generalizable person re-identification. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3140–3149. IEEE, Seattle, WA, USA (2020) 10. Redmon, J., Farhadi, A.: YOLOv3; an incremental improvement. arXiv: 1804.02767 (2018) 11. Ren, S., He, K., Girshick, R., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017) 12. Guo, X.J., Sui, H.D.: Application of improved YOLOv3 in foreign object debris target detection on airfield pavement. Comput. Eng. Appl. 57(8), 249–255 (2021) 13. Chao, H.Q., He, Y.W., Zhang, J.P., et al.: Gait set: regarding gait as a set for cross-view gait recognition. Proceedings of the AAAI Conference on Artificial Intelligence 33, 8126–8133 (2019) 14. Zheng, H.L., Wu, Y.J., Deng, L., et al.: Going deeper with directly-trained larger spiking neural networks. Proceedings of the AAAI Conference on Artificial Intelligence 35(12), 11062–11070 (2021) 15. Guo, D., Zhou, W.G., Wang, M., et al.: Hierarchical LSTM for sign language translation. In: Proceedings of the 32 ND AAAI Conference on Artificial Intelligence, the 30th Innovative Applications of Artificial Intelligence Conference and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence, pp. 6845–6852 (2018) 16. Yu, S.Q., Tan, D.L., Tan, T.N.: A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition. In: 18th International Conference on Pattern Recognition (ICPR’06), pp. 44–444. IEEE, Hong Kong, China (2006) 17. Camgoz, N.C., Hadfield, S., Koller, O., et al.: Neural sign language translation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7784–7793. IEEE Computer Society, Piscataway, NJ (2018) 18. Zhang, S.J., Zhang, Q.: Sign language recognition based on global-local attention. J. Vis. Commun. Image Represent. 80(7), 103280 (2021) 19. Ren, Z., Zhang, Y., Wang, S.: A hybrid framework for lung cancer classification. Electronics 11(10), 1614 (2022). May 20. Wang, W., Pei, Y., Wang, S.H., Gorrz, J.M., Zhang, Y.D.: PSTCNN: Explainable COVID-19 diagnosis using PSO-guided self-tuning CNN. Biocell
Research on Application of Deep Learning in Esophageal Cancer Pathological Detection Xiang Lin(B) , Zhang Juxiao, Yin Lu, and Ji Wenpei Nanjing Normal University of Special Education, Nanjing, Jiangsu, China [email protected]
Abstract. As the “gold standard” of tumor diagnosis, pathological diagnosis is more reliable than the analytical diagnosis, ultrasound, CT, nuclear magnetic resonance, etc. Detection of esophageal cancer based on pathological slice images is focused in this paper combining with deep learning to intelligently obtain reliable detection results. A data set is built by collecting and labeling pathological slices of esophageal cancer at varying stages for model training and verification. By comparing the performance of multiple models, ResNet50 is chosen as the network model. The model is pre-trained on ImageNet with a public breast cancer data set and transferred to the task of esophageal cancer detection. The original data set is enlarged by data augmentation to improve the accuracy, effectively avoiding over-fitting. Experimental results show the test accuracy achieves 0.950 which demonstrates the feasibility of deep learning on the esophageal cancer detection with pathological slice images. Keywords: Esophageal Cancer Detection · Deep Learning · Transfer Learning · Data Augment
1 Introduction Statistical data shows that the incidence rate of cancer has increased year by year in many cities of China in recent years. Taking Huai’an City in Jiangsu province as example, the proportion of cancer deaths among people who die of illness every year accounts for 30% ~ 40%. Among cancer deaths, esophageal cancer ranks the first. By analysis, the increase of the incidence rate of esophageal cancer in Huai’an citizens is not only caused by environmental factors and increasingly serious chemical pollution, but also related to the unhealthy eating habits. Citizens of Huai’an are inclined to eat salty or pickled food, spicy food and excessive drinking. Nitrite rich in pickled food and unbalanced nutrition are important factors to induce esophageal cancer. In addition, one of the main reasons for the higher and higher mortality rate of esophageal cancer is that most of the cancers are in the middle or late stages when discovering illness, and it is difficult for patients at this stage to obtain fundamentally effective treatment. However, in the early stage esophageal cancer is not easy to be detected accurately for less extent of the lesion, which results in that some patients cannot get accurate diagnosis in time © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 95–105, 2024. https://doi.org/10.1007/978-3-031-50580-5_9
96
X. Lin et al.
and miss the best treatment opportunities [1]. If the cancer is detected in the early stage, the survival rates will be greatly improved. Pathological diagnosis has always been seen as the “gold standard” of tumor diagnosis which is a disease diagnostic means to observe the pathological characteristics of organ tissue structure and cells under the microscope. Pathologists are called “doctors among doctors” [2]. As pathological diagnosis directly acts on cytopathic conditions, it is more reliable than the analytical diagnosis based on medical records and symptoms. And it is more convincing than the clinical diagnosis made by means of ultrasound, CT, nuclear magnetic resonance, etc. However, the number of pathologists is seriously insufficient, less than 10,000. Furthermore most of them are concentrated in economically developed areas. In some remote and poor areas, there is hardly a pathologist. Tumor is a high incidence disease. In addition, the incidence rate of cancer has little correlation with regions. The number of cases that pathologists need to deal with is close to six times the number of pathologists, and the proportion is still rising. Therefore, it is more effective and helpful for pathologists to hand over the pre- screening of tumor pathological images to computers. It can not only reduce the work load of the pathologist, but also improve the accuracy and efficiency of detection [3]. In this paper, pathological section images which are more reliable for tumor detection are used for esophageal cancer detection combining with deep learning. For the outstanding performance of deep learning in medical image recognition, esophageal cancer pathological section images are timely detected by selecting proper network model and improving its performance according to the specific esophageal cancer data. The main contributions of this paper are as follows: (1) Select an appropriate deep learning model for esophageal cancer detection; (2) Build and label our own esophageal cancer dataset; (3) Introduce transfer learning to reduce training duration and difficulty to improve training efficiency; (4) Enlarge the dataset by data enhancement technology to upgrade the model performance. The rest of the paper is organized in the following manner. Section 3 lists the classical network models and the technology that we will use later. Section 4 outlines esophageal cancer dataset we collected. The verification experiments are given in Sect. 5. Finally a conclusion is drawn in Sect. 6.
2 Related Research In recent years, deep learning has made breakthrough achievements in various medical fields, especially in the application of medical images [4–6], including radiation oncology diagnosis [7], classification of skin cancer [8], diabetes retinopathy [9], histological classification of biopsy specimens [10], and description of colorectal diseases with cell endoscope [11]. Research on image diagnosis of esophageal cancer based on deep learning has just emerged at home and abroad [12–15]. The use of esophageal endoscopic images combined with deep learning technology to diagnose esophageal cancer can more accurately locate the location of the disease, providing doctors with
Research on Application of Deep Learning
97
an effective means of auxiliary recognition. The deep learning technology was applied to the detection of esophageal cancer CT images [16] and achieved good recognition results. Due to the lack of public pathological picture library, the research on the application of depth learning technology to the classification and detection of pathological images of esophageal cancer is relatively lacking. At present, more mature research on cancer detection for pathological images includes: detection technology for gastric cancer pathological images [17] and detection technology for breast cancer pathological images [18]. On the big data of pathological images, after large-scale and long-term training with deep learning technology, the detection and recognition of tumor pathological images have achieved high accuracy. In 2017, Google used deep learning technology to achieve an accuracy rate of early diagnosis of breast cancer of more than 99% [19], making breakthroughs in the field of the application of deep learning in medical images.
3 Classical Network Models and Transfer Learning 3.1 Classical Network Models Convolutional neural network (CNN) was first proposed by Lecun of New York University in 1998. At the same time, the first CNN model LeNet-5 [20] is designed to classify handwritten digital images, which is the first time that CNN can be widely used in industrial practice. Its network structure is as follows (Fig. 1):
Fig. 1. The structure of LeNet-5 [20]
LeNet-5 is consisted of convolutional layers and a full-connected layer. Local connection and weight sharing greatly reduce network parameters, and the operation of subsampling reduces the data dimension. Convolutional kernels in convolutional layers extract local features, by which images are recognized in full connected layer. In 2012, AlexNet [21] won the championship of ImageNet with an unprecedented error rate of 15.4%, far lower than the top 5 optimal error rate 26.2% before this, creating a great sensation. AlexNet is a network with 8 layers, whose structure is deeper and more complex than LeNet. Moreover, dropout layer is added behind the full-connected layers to avoid overfitting. It brings about an outstanding performance. The success of AlexNet promotes the development of deep learning. In 2014, Karen and Andrew of Oxford University built a 16-layer CNN model, which reduced the error rate of Top-5 to 7.3%. This model is VGG [22]. VGG improves the
98
X. Lin et al.
network performance by adding more layers to make the network structure deeper. It emphasizes the importance of “depth” to the deep learning model with practical effects. VGG basically follows the design idea of AlexNet and implements the word “depth” to the end, which is twice as deep as AlexNet. Unlike the former, VGG uses convolutional kernels of the same size of 3 × 3. Two convolution kernels of 3 × 3 is equivalent to a convolutional kernel of 5 × 5, and three kernels of 3 × 3 is equivalent to a kernel of 7 × 7, but the number of parameters is greatly reduced. Comparing three 3 × 3 kernels with a kernel of 7 × 7. In a same receptive field, the convolutional kernel has 27 parameters in the former and 49 in the latter. The former not only has 81.5% fewer parameters, but also has two more nonlinear operations, which is more conducive to the learning of the network and can also accelerate the convergence of the network. In 2015, the Google proposed GoogleNet [23] using the Inception module, which reduced the error rate of Top5 to 6.7%, broke the traditional method of increasing the network depth and width, and transformed full connection and partial convolution modules into sparse connections. Kaiming He of Microsoft Research Institute and other four Chinese proposed ResNet [24] (Residual Neural Network). This model has won the championship in the largescale image classification competition ILSVRC2015, with the error rate on Top5 being the best 3.57% at present. At the same time, the number of parameters is less than VGG, but shows an extremely outstanding performance. The main idea of ResNet is to add a direct connection channel to the network, that is, the idea of Highway Network [25]. 3.2 Transfer Learning As all know, the size of training data is the key to training effect of neural networks. The size of the esophageal cancer dataset we collected is much smaller than the requirement of the training of big data. Transfer leaning is introduced to enhance the network performance in this paper. Transfer Learning is a machine learning method that takes the model developed for Task A as the initial point and reuses it in the process of developing the model for Task B. The focus of transfer implementation is to solve the gap of data characteristics and feature distribution between Source domain and Target domain. Being both cancer pathological images, esophageal cancer and breast cancer has similar data distribution. Before directly training the network with the data we collected, we first train the network on the public breast cancer data set, and then transfer the training results to the esophageal cancer data to solve the problem of insufficient training data.
4 Dataset 4.1 Data Collection A total of 1524 pathological images were collected from 720 patients in 2011–2017, involving squamous cell carcinoma and adenocarcinoma, as well as pathological section images of chronic inflammation of esophageal and cardiac mucosa. The collected pathological sections were observed and photographed with Olympus BX 50 optical microscope, collected with HMIAS-2000 high-definition full-automatic color image analysis
Research on Application of Deep Learning
99
system and analyzed the positive staining area and staining intensity. The results were interpreted by double blind method and evaluated by two pathologists independently. The whole data set is divided into 5 kinds of different cell carcinoma, a suspected squamous cell carcinoma and 2 kinds of chronic inflammation. In the paper, we focus on the classification of benign and malignant tumors. So all of cell carcinomas including suspected are labeled as malignant results (M), while 2 kinds of chronic inflammation are labeled by benign (B) (as listed in Table 1). Table 1. Data classification labels and the number of images Classification Labels
The Number of Images
Poorly differentiated squamous cell carcinoma (B)
123
Moderately and poorly differentiated squamous cell carcinoma (B)
67
Moderately differentiated squamous cell carcinoma (B)
263
Moderately well differentiated squamous cell carcinoma (B)
141
Well differentiated squamous cell carcinoma (B)
113
Chronic inflammation of esophageal mucosa (M)
86
Squamous epithelial hyperplasia, keratosis and chronic inflammation 541 (M) Suspected squamous cell carcinoma (B)
190
4.2 Data Augmentation To meet the requirements to large sample size of deep learning, data augmentation was used on the digital pathology image library, including rescale, horizontal/vertical rotation, horizontal/vertical translation, horizontal/vertical flip, fill. Table 2 lists the generator parameters and set values of the data augmentation method involved in this paper. Figure 2 shows the comparation of the original and augmented images by transformation. Table 2. List of data augmentation parameters Parameters
Values
rescale
1.0/255
rotation_range
90
width_shift_range
0.3
height_shift_range
0.3
vertical_flip
True
horizontal_flip
True
fill_mode
wrap
100
X. Lin et al.
(a)
(b)
(c)
(f )
(d)
(g)
(h)
(e)
(i)
Fig. 2. Original image and images by data augmentation
4.3 BreakHis Breast Cancer Pathological Image Dataset BreakHis breast cancer pathological image dataset is a common open data set, which is released by Spanhol et al. in 2016, containing 7909 breast tissue pathological images from 82 patients. The data set is divided into 5429 malignant tumor images, 2480 benign tumor images (as listed in Table 3). Table 3. Distribution of images in BreakHis Dataset Magnification Factors
Malignant (M)
Benign (B)
Total
40X
1370
625
1995
100X
1437
644
2081
200X
1390
623
2013
400X
1231
588
1820
5 Experiments and Results This experiment is run on the platform of a personal computer with Intel Core i5–8300 CPU, NVIDIA GTX 1060 with Max-Q Design, and 16 GB memory, under the operating system of Windows 10. 5.1 Experiment I: Model Selection For the best classification results, we compared the classic models of deep neural networks, traditional AlexNet, Inception V3 [14] based on GoogleNet, and ResNet, which have emerged in recent years and achieved excellent results in pathological image diagnosis. The performance of these three networks is preliminarily tested on the public data set BreaKHis breast cancer data set, which is also consisted of pathological slice images. We selected 1995 breast cancer images with magnification of 40 times
Research on Application of Deep Learning
101
and 8 classifications, including 1195 training examples, 400 validation examples and 400 test examples. SGD is used as the optimizer, the learning rate is set to 0.001, the momentum is set to 0.5, the nesterov momentum method is enabled, the batch size is 32, and the number of iterations epochs is set to 50.
(a) AlexNet
(b) InceptionV3
(c) ResNet50
Fig. 3. Training results of AlexNet,InceptionV3 and ResNet50 on BreaKHis
Figure 3 shows the training results of AlexNet、InceptionV3 and ResNet50 (from left to right) on the breast cancer data set BreaKHis. It can be seen from the training curves of error and accuracy that AlexNet’s performance in error and accuracy is far inferior to the other two networks. Although Inception V3 reduces the error to a lower level, its accuracy is not as good as ResNet50. Therefore, ResNet50 outperforms the other two networks in terms of training error and accuracy. For the requirement of accuracy, ResNet50 is selected as the main research network in this paper. 5.2 Experiment II: Transfer Learning It takes a long time to train the network, especially the deep neural network with complex structure. On the selected ResNet50, we verify the validity of the classification performance of the network on esophageal cancer data with randomly initialized weights on the breast cancer images with magnification of 40 times. Figure 4(a) shows the training processing of ResNet50. The decline rate of training loss is very slow. After 50 batches, it only drops to about 0.85. The training accuracy rate (train acc) tends to remain unchanged in the later period, except for the obvious increase observed in the previous batches. After the training, the test error (loss) was 1.443, and the test accuracy was 0.496. The generalization ability could not meet the requirements of actual practice. Training time is up to about 40 min. By comparison, we transfer the model of ResNet50 pre-trained on the dataset of ImageNet. On the same data set with the same hyperparameters, the model is re-trained and the results are shown in Fig. 4 (b). It can be seen that after using transfer learning, the decline speed of error and the increase speed of accuracy rate have been greatly improved. After the training, the test error is 0.808, and the test accuracy rate is 0.705, which is much better than the result of randomly initialized weights. When all the convolution layers are frozen, it takes an average of 12 s to train a batch, and the sharing time is about 10.5 min. Compared with the randomly initialization, transfer learning greatly reduces the training time. So more time is spent on fine-tuning the network parameters,
102
X. Lin et al.
(a) Randomly Initialized
(b) After Transfer Learning
Fig. 4. ResNet50 training with transfer learning
rather than training the network model, which greatly reduces the training difficulty and improves the training efficiency.
(a) 100 epochs
(b) 200 epochs
Fig. 5. ResNet50 training on breast cancer data set Table 4. Training parameters and accuracy on breast cancer data set train_b_s
val_b_s
train_step
val_step
epochs
loss
accuracy
8
2
120
40
100
0.791
0.856
8
2
120
50
100
0.384
0.892
8
2
120
50
200
0.424
0.880
8
4
40
40
200
0.492
0.846
Likewise the model of ResNet50 is trained on the breast cancer data set first. The trained model is transferred to the training of the classification of esophageal cancer. Figure 5 shows the training results after 100 and 200 epochs. Although the error curve has a large range of oscillations, it does not affect the convergence of training. Table 4 lists the parameters and accuracy of random four running times.The average accuracy is higher than 0.85.
Research on Application of Deep Learning
103
5.3 Experiment III: Data Augmentation On the model of ResNet50 pre-trained on the breast cancer data set, esophageal cancer data is directly trained on the original images without data augmentation. Figure 6(a) shows the training results. From the training records, it can be observed that the network has a very serious over fitting phenomenon when training about 10 batches. The training accuracy rate (train acc) is infinitely close to 1, but the verification accuracy rate (val acc) is no longer improving. The training error (train loss) is approaching 0 with the growth of the training batches, and the verification error (val loss) starts to rise rapidly after a small decline. After the training, the test error is 2.486, and the test accuracy is 0.634. However, due to the large test error, the test accuracy is not convincing, and the generalization performance is poor. Figure 6 (b) and (c)show the training processing of the transferred network on the augment esophageal cancer dataset after 50 and 200 epochs separately. It is obvious that the over fitting of the model has been effectively suppressed after data enhancement, and the test results are basically consistent with the training case. The test error is 1.582, and the test accuracy is 0.560. However, it can be seen from the test results that the generalization ability of the model is far from enough, and the test accuracy of 0.56 cannot reach the standard that can be used. Although compared with 50 batches, the accuracy rate after 200 epochs is still too low. Over-fitting also occurred in the late stage.
(a) original
(b) 50 epochs
(c) 200 epochs
Fig. 6. Training on the original and augment esophageal cancer datasets
The main reason for the phenomenon is that the amount of data is too small to meet the requirements of the deep network for large-scale training samples. Data augmentation is used to avoid it. By setting different scaling coefficients, the esophageal cancer dataset is augmented to 2420 (Data augment I) and 4840 (Data augment II). As shown in Fig. 7 (a), the training results after data augment is greatly improved comparing with the originally un-augmented data set. The training error has dropped to a lower level, and the accuracy has also been greatly improved. From the performance of the verification set, it shows that the network is still over fitting, indicating that the distribution of the dataset is not very ideal. Figure 7 (b) over fitting is solved after largescale data enhancement, and the result in the validation set basically conform to the training data. The test error is 0.189, and the test accuracy rate (accuracy) is 0.950 (as listed in Table 5). The testing results show that the performance of the network can be effectively optimized by data augmentation, and the classification accuracy can be greatly improved.
104
X. Lin et al.
(a) Data augment
(b) Data augment
Fig. 7. Training after data augment Table 5. Training parameters and test loss with accuracy on esophageal cancer data set train_b_s
val_b_s
train_step
val_step
epochs
loss
accuracy
8
2
120
40
100
0.337
0.901
8
2
120
50
100
0.218
0.922
8
2
120
50
200
0.203
0.945
8
4
40
40
200
0.189
0.950
6 Conclusion Classification of esophageal cancer pathological images is mainly researched in this paper combining with deep learning. Comparing the results of 3 classical models: AlexNet, InceptionV3, and ResNet50, ResNet50 is chosen as the training model for the classification for outperforming performance. Transfer learning is introduced to reduce the duration of the training instead of randomly initialization. Rather than training on the original data, data augmentation is used to improve the accuracy and effectively avoid over fitting. The test error on verification set is 0.189, and the test accuracy is 0.950. The classification results verify the effectiveness of the network model in the augmented dataset. However, the dilemma of insufficient data still exists, and only eight classified models are slightly inadequate for practical application. In the future, we would continue collecting and enlarging the esophageal cancer pathological image data set to meet the demand for large amounts of data for deep training. Moreover, we would also make an attempt to design proper deep network structures or integrate multiple networks with good performances to achieve a better pathological detection of esophageal cancer.
References 1. Tsai, T.J., et al.: Intelligent identification of early esophageal cancer by band-selective hyperspectral imaging. Cancers 14(17), 4292 (2022) 2. Wang, Z., et al.: Three feature streams based on a convolutional neural network for early esophageal cancer identification. Multimedia Tools and Applications, 1–18 (2022) 3. Niazi, M.K.K., et al.: Digital pathology and artificial intelligence. Lancet Oncol. 20(5), 253– 261 (2019)
Research on Application of Deep Learning
105
4. Takeuchi, M., et al.: Performance of a deep learning-based identification system for esophageal cancer from CT images. Esophagus 18, 612–620 (2021) 5. Zhu, H., et al.: An evolutionary attention-based network for medical image classification. Int. J. Neural Sys. 2350010 (2022) 6. Huang, L.M., et al.: Artificial intelligence technique in detection of early esophageal cancer. World J. Gastroenterol. 26(39), 5959 (2020) 7. Bibault, J.E., et al.: Big data and machine learning in radiation oncology: state of the art and future prospects. Cancer Lett. 382(1), 110–117 (2016) 8. Esteva, A., et al.: Dermatologist-level classification of skin cancer with deep neural networks. Nature 542(7639), 115 (2017) 9. Gulshan, V., et al.: Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316(22), 2402–2410 (2016) 10. Yoshida, H., et al.: Automated histological classification of whole-slide images of gastric biopsy specimens. Gastric Cancer 21(2), 249–257 (2018) 11. Misawa, M., et al.: Accuracy of computer-aided diagnosis based on narrow-band imaging endocytoscopy for diagnosing colorectal lesions: comparison with experts. Int. J. Comput. Assist. Radiol. Surg. 12(5), 757–766 (2017) 12. Yang, C.K., et al.: Deep convolutional neural network-based positron emission tomography analysis predicts esophageal cancer outcome. J. Clini. Medi. 8(6), 844 (2019) 13. Van Riel, S., Van Der Sommen, F., Zinger, S., et al.: Automatic detection of early esophageal cancer with CNNS using transfer learning. In: 25th IEEE International Conference on Image Processing (ICIP), pp. 1383–1387 (2018) 14. Ren, Z., Zhang, Y., Wang, S.: A hybrid framework for lung cancer classification. Electronics (Basel) 11(10), 1614 (2022) 15. Ren, Z., Zhang, Y., Wang, S.: LCDAE: data augmented ensemble framework for lung cancer classification. Technol. Cancer Res. Treat. 21, 15330338221124372 (2022) 16. Horie, Y., Yoshio, T., Aoyama, K., et al.: Diagnostic outcomes of esophageal cancer by artificial intelligence using convolutional neural networks. Gastrointest. Endosc. 89(1), 25–32 (2019) 17. Kather, J.N., et al.: Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat. Med. 25(7), 1054–1056 (2019) 18. Al-Haija, Q.A., Adebanjo, A.: Breast cancer diagnosis in histopathological images using ResNet-50 convolutional neural network. In: 2020 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS), pp. 1–7 (2020) 19. Liu, Y., Gadepalli, K., et al.: Detecting cancer metastases on gigapixel pathology images. arXiv preprint. 1703.02442 (2017) 20. LeCun, Y., et al.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 21. Krizhevsky, A., et al.: Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 1097–1105 (2012) 22. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint, pp. 1409–1556 (2014) 23. Szegedy, C., Liu, W., Jia, Y., et al.: Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9 (2015) 24. He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016) 25. Srivastava, R.K., et al.: Highway networks. arXiv preprint. 1505.00387 (2015)
Workshop 2: Intelligent Application in Education
Coke Quality Prediction Based on Blast Furnace Smelting Process Data ShengWei Zhang, Xiaoting Li, Kai Yang, Zhaosong Zhu(B) , and LiPing Wang Nanjing Normal University of Special Education, Nanjing 210038, China [email protected]
Abstract. Coke is the main material of blast furnace smelting. The quality of coke is directly related to the quality of finished products of blast furnace smelting, and the evaluation of coke quality often depends on the quality of finished products. However, it is impractical to evaluate coke quality based on finished product quality. Therefore, it is of great significance to establish an artificial intelligence model for quality prediction based on the indicators of coke itself. In this paper, starting from the actual production case, taking the indicators of coke as the feature vector and the quality of finished product as the label, different artificial intelligence models are established. These models predict coke quality, and compare and discuss related algorithms, which lays a foundation for further algorithm improvement. Keywords: Coke quality · Artificial intelligence algorithm · Prediction model
1 Introduction Coke is the main material of blast furnace smelting. The coke used in blast furnace smelting in coking enterprises must meet the requirements of low ash, low sulfur and good thermal stability. There are many factors that affect the quality of coke. According to the production daily report provided by a steel company in Anhui Province, there are more than 70 important factors. In addition to the indicators of coal, such as moisture, coal degree, ash, sulfur, fineness, Y, G, etc., coking process is also an important factor, such as fire temperature, coking time, coking temperature, coking straight temperature, flue suction, etc. These parameters not only affect the quality of coke, but also affect the coking energy consumption [1]. Researchers at home and abroad have proposed a variety of prediction methods for coke quality indicators, but the models are different and lack generality. The reason lies not only in the difference of coal quality, but also in the difference of coking process and coking operation conditions of coking enterprises. At the same time, there are few researches on reducing coking energy consumption by domestic and foreign experts and scholars. Xu Shunguo [2] of Meishan Iron and Steel Co., LTD., China, proposed relying on technological advances to reduce energy consumption in the coking process. By improving the automation degree and large scale © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 109–118, 2024. https://doi.org/10.1007/978-3-031-50580-5_10
110
S. Zhang et al.
of coke ovens, starting from the selection of coke quenching technology, optimizing the heating system and other measures, the effect of reducing the energy consumption of coke making process can be achieved. Shu Guang [3] of Wuhan University of Science and Technology proposed the “coking recovery energy flow ordering process”. This process completely solves the problems of high energy consumption and high pollution in the coking recovery process. There are complex physical and chemical reactions in the coking process, and the relationship between the coke quality index and the coal compound index cannot be directly expressed by linear function, and the mechanism model cannot be directly constructed based on the technological process. This paper uses the 2019–2021 production data of a coking plant of a steel company in Anhui, China. Various feature engineering methods such as Pearson correlation coefficient were used to screen the influencing factors. A method based on mechanism and data is used to model the reaction process. Strive to use linear model to describe the mechanism to guide production. On this basis, the coke quality index is predicted by machine learning method. Based on the data in the coke production process, this paper analyzes the correlation between the main factors affecting the coke quality index and the coking process. The coke quality prediction method based on production process and linear regression can effectively predict the coke that meets the requirements. The prediction results provide a basis for the improvement of coking technology, in order to provide reference for the field application of a steel plant in Anhui Province, China.
2 Lasso Regression and Machine Learning Because there are many characteristic factors, there will be correlation between each other. If the linear regression model is directly used, although it can obtain strong interpretability, it is often due to data noise and multicollinearity between variables. This not only affects the accuracy of the model, but also causes the instability of the model parameters, which makes the important characteristic coefficients inconsistent with the theoretical trend and leads to the failure of interpretation. Lasso regression is an improved model that deals with variable correlation and filters variables, also known as L1-regularized linear regression. Its objective function is shown in Eq. 1. argmin ||W T X − Y ||22 + λ||W ||1
(1)
W
This objective function consists of two parts. The first part is the error term, which is consistent with linear regression. The latter part is the regularization term, which is used to constrain the model parameters. Because of the use of L1 regularization term, Lasso regression has a strong feature selection ability. By controlling the parameter λ, the unimportant feature coefficients in the model will converge to 0, and the correlation of these variables will be eliminated. 2.1 RVFL-NLNB Algorithm Rvfl-nlnb is an important variant of RVFL. It has no direct input-output correlation and no output bias term. It differs from gradient descent neural networks in that it directly
Coke Quality Prediction
111
randomly assigns weights to the input layer and then calculates the output layer weights using the Moore-Penrose generalized inverse matrix [4]. Given a data set as shown in Eq. 2., The RVFL-NLNB model output can be expressed as Eq. 2. Z = {(xi , ti )|xi ∈ Rd1 , ti ∈ Rd2 , i = 1, 2, . . . , N} fn,i (xi ) =
n j=1
(2)
βj gj wj , xi , bj , wj ∈ Rd1 , bj ∈ R
(3)
where n is the number of neurons in the hidden layer, xi is the ith input vector, ti is the ith output vector, wj and bj are the hidden layer weights and bias term, βj is the output weight of the jth hidden layer neuron, in addition, the activation function g satisfies Eq. 4. ∫ g2 (x)dx < ∞ or ∫[g(x)]2 dx < ∞ R
(4)
R
The learning task of RVFL is to randomly determine wj and bj and find the optimal βj satisfying Eq. 5. fn,i (xi ) =
n−1 j=1
βj gj wj , xi , bj = ti
(5)
The matrix expression of this formula is given in Eq. 6. Hβ = T
(6)
where H, β and T is denoted by Eqs. 7, 8, and 9, respectively. ⎡ ⎤ g11 (< w1 , x1 > +b1 ) · · · gn1 (< wn , x1 > +bn ) ⎢ ⎥ .. .. .. H=⎣ ⎦ . . . g1N (< w1 , xN > +b1 ) · · · gnN (< wn , xN > +bn )
(7) N×n
β = [βT1 . . . βTn ]Tn×d2
(8)
T = [tT1 . . . tTN ]TN×d2
(9)
After randomly assigning wj and bj , the optimal solution βˆ can be obtained by least squares, as shown in Eq. 10. βˆ = argmin||Hβ − T||2 = (HT H)−1 HT T
(10)
where (HT H)−1 HT is also known as the Moore-Penrose generalized inverse matrix of H.
112
S. Zhang et al.
2.2 I-RVFL-NLNB Algorithm The traditional I-RVFL-NLNB algorithm can be briefly described as follows [5]: For a sample set consisting of N samples, it is denoted by Eq. 11, and the output of the SLFN network can be denoted by Eq. 12. Z = {(xi , ti )|xi ∈ Rd1 , ti ∈ Rd2 , i = 1, 2, . . . , N} fn−1,i (xi ) =
n−1 j=1
βj gj wj , xi , bj , wj ∈ Rd1 , bj ∈ R
(11) (12)
Let L2 (x) be a function space of f on a compact set x of Rd such that Eq. 13, then for the condition u, v ∈ L2 (x), < u, v > It is defined to be defined as Eq. 14 ∫ |f2 (x)|dx < ∞
(13)
< u, v >= ∫ u(x)v(x)dx ∈ L2 (x)
(14)
x
x
Define || · || as the norm of L2 (x) space, then the distance between f and fn−1 in space is given by Eq. 15.
L2 (x)
fn−1 − f = [∫ |fn−1 − f|2 dx]1/2
(15)
x
According to the theory of I-RVFL-NLNB algorithm, given any bounded piecewise continuous function g(w, x, b) : R → R, if Eq. 16 holds, then Eq. 17 holds for any fn and f. < en−1 , gn > (16) βn = gn 2 lim fn − f = 0
n→∞
(17)
For a trained I-RVFL-NLNB, when increasing the hidden neuron j, all the existing βk , j = 1, . . . , j − 1 are all fixed and only the new βj is computed. 2.3 I-I-RVFL-NLNB Algorithm The problem of traditional I-RVFL-NLNB is that many hidden neurons contribute little to the fitting objective function. These useless neurons greatly increase the dimensionality and computational complexity of the network. In other words, most neurons of I-RVFL-NLNB do not help to reduce the model error. Therefore, these non-contributing neuron weights can be adjusted to simplify the calculation and network structure. In this paper, a modified I-RVFL-NLNB (I-I-RVFL-NLNB) algorithm is proposed to change the structure of neurons in I-RVFL-NLNB one by one. The I-I-RVFL-NLNB algorithm first fixed the number of hidden neurons n, and then used a specific probability space [−λ, λ], λ ∈ R > 0. 0 randomly assigns weights to w and b. Next, let the number of hidden neurons gradually increase from 1 to n. Each
Coke Quality Prediction
113
update of the hidden neuron output weight is denoted as an iteration, and the training process of the algorithm is divided into the following stages. In the first stage, when the number of iterations m does not exceed the number of hidden layer neurons n, then m ∈ [1, . . . , n], j = 1, 2, . . . , n, The formula for calculating βj is given in Eq. 18, The new residual em is expressed as Eq. 19. The final output result of the network is expressed as Eq. 20. βj =
< em−1 , gj > gj 2
(18)
em = em−1 − βj gj
(19)
fm = fm−1 + βj gj
(20)
In the second stage, when the number of iterations m exceeds the number of neurons n in the hidden layer and the stopping condition is not reached, m = n + j, j = 1, 2, . . . , n The βj update result is expressed as Eq. 21. βj = βj + βj
(21)
Here βj is the correction of βj , which is calculated as Eq. 22. βj =
< em−1 , gj >
gj 2
, m = n + j, j = 1, 2, . . . , n
(22)
The new network output residual is Eq. 23. em = em−1 − βj gj , m = n + j, j = 1, 2, . . . , n
(23)
When m = n + 1, em−1 = e(n+1)−1 = en = en . The output result of the network after the m th iteration is expressed as Eq. 24 fm = fm−1 + βj gj
(24)
Then, if the preset stopping condition is not reached, βj will continue to be updated according to the formula in phase 2 in the next iteration. Here m = 2n + j, j = 1, 2, ..., n, and at the same time the calculation of the residual em and the output fm is the same. The loop continues until the stopping condition is met.
3 Determination of Variable Characteristics The coke quality indicators involved in this paper are: coke ash Ad, sulfur S, td, crushing strength M40 and wear strength M10, M40 and M10 are usually called cold strength of coke. Coke quality index has a great influence on the blast furnace smelting process, and coke quality is not only related to the coal quality index, but also inseparable from the production process. The ash Ad, sulfur S,td and volatile Vdaf of coals affect the ash Ad
114
S. Zhang et al.
and sulfur S,td of coke. The cold strength of coke is mainly related to the coking process such as volatile matter Vdaf, the maximum thickness of the gelatinous layer Y, the bond index G and the straight temperature in the coking process. Ash Ad is actually an inert substance mixed between carbon atoms, and the planar network structure formed between carbon atoms determines the strength of coke. Sulfur S,td in the coking process will generate oxide corrosion of steel, sulfur should be as low as possible. Volatile Vdaf is an important parameter to evaluate coal quality, it is used to evaluate the maturity of coke, volatile is too high, coke is easy to be in half coke or raw coke state; The volatile matter is too low, easy to make coke intensity drop. The maximum thickness of cuticle Y reflects the amount of colloid, and the bonding index G reflects the bonding and coking ability of coking coal. The larger Y is, the higher the bonding index G is, and generally the higher G is, the stronger the coking property is. The commonly used coke prediction model focuses on the establishment of prediction model based on coal blending theory and production data. The variable feature selection is less, and the influence of complex process variables on the prediction results is not considered. In order to ensure the applicability of the model and enhance the guidance of the model to the actual production, the actual production data of a steel company in Anhui province will be used as the variable input model prediction. At the same time, different coking processes such as coking time and straight line temperature are introduced for prediction. Through the analysis of coke quality index, actual production data and process information, the flow chart of coke quality prediction model in Fig. 1 is drawn as follows.
Fig. 1. Schematic diagram of coke quality prediction model process
Although the Lasso adds an L1 regularization term, its fitting term is essentially linear regression. In order to find and introduce nonlinear features, this paper performs feature enhancement on some of the key features. By adding high-order terms of the original features and cross-terms of multiple features, the fitting ability of the model is increased. However, the introduction of feature enhancement will increase the feature dimension. In order to further filter out invalid feature items, this paper uses recursive feature elimination (RFE) method. Recursive feature elimination is an algorithm for finding an optimal subset of features. It first uses the full set of features to build the
Coke Quality Prediction
115
model, and then filters the features with low weight influence according to the parameters obtained by model training, and iterates many times until it is stable, that is, the final feature subset is obtained.
4 Experimental Results The prediction model of this project is based on the statistical analysis of historical data, and the established model needs to be verified by experiments. Therefore, based on the actual coke-making production data of not less than 3 months, statistics are calculated according to the furnace times (or batches) of coke-pushing. The proportion of the relative error between the predicted value and the actual value of Ad, St,d is less than 2.5% is greater than 95%, the maximum relative error is less than 4%, the proportion of the relative error between the predicted value and the actual value of M40, M10 is less than 3% is greater than 90%, the maximum relative error is less than 6%, the proportion of the relative error between the predicted value and the actual value of CSR, CRI is less than 5% is greater than 90%. If the maximum relative error is less than 10%, the accuracy of the model meets the project requirements. The predicted value of the model should be consistent with the trend of the actual data value. The error analysis of the prediction model is shown in Table 1. It can be seen from Table 1 that the F values of the CSR and CRI prediction models of coke are 10.143 and 5256.3 respectively at the 5% confidence level, and the P values are all less than 0.01, indicating that the prediction model has high credibility and the thermal property index of coke has high correlation with the selected parameters. According to the analysis of the scatter plot of standardized residual after regression fitting (Fig. 1 and Fig. 2), the standardized residual data of the two prediction models all fell between (-2, 2), and there were no abnormal points, indicating that the residual had normal distribution, and all data points were involved in regression fitting. In order to verify the accuracy and applicability of the prediction model, this paper compares several different models commonly used at present, and the prediction results are shown in Fig. 2. As can be seen from Fig. 2, for the same sample set, the prediction accuracy of the linear regression model proposed in this paper is significantly better than that of other models, and it has good predictability. Therefore, the prediction model can be used to guide coking and predict coke quality. In order to verify the accuracy and applicability of the prediction model on M40, this paper compares several different models commonly used at present, and the prediction results are shown in Fig. 2. As can be seen from Fig. 2, for the same sample set, the prediction accuracy of the linear regression model proposed in this paper is significantly better than that of other models, and it has good predictability. Therefore, the prediction model can be used to guide coking and predict coke quality.
116
S. Zhang et al. Table 1. Error analysis of prediction models
Thermal properties of coke
Source of deviation
CSR
regression
43.875
Residual error Total deviation CRI
regression
Sum of squares
Degree of freedom
Mean square error
F value
P value
3
14.622
10.143
< 0.01
10.083
6
1.439
52.932
9 5256.3
< 0.01
15822.342
5
3236.322
Residual error
3.342
6
0.534
Total deviation
15834.532
10
Fig. 2. Comparison of prediction results of different models on M40
5 Conclusion Coke quality prediction is always a difficult problem because of the correlation among the indexes of blending coal. In this paper, the coal resource data, coal and coke quality data and coking process parameters of Masteel 7.63m coke oven coking production in recent two years are statistically analyzed. Through model analysis, comparison and verification, the commonly used prediction models are analyzed and compared, and the best prediction model is determined. The model can predict coke quality indexes (Ad, St,d, cold and hot strength, average particle size, etc.) according to coking coal blending
Coke Quality Prediction
117
and coking process parameters. The prediction model can also obtain the optimized coking coal blending ratio for reference according to the needs of coke quality and cost. Acknowledgements. This work was supported by Universities’Philosophy and Social Science Researches Project in Jiangsu Province (No. 2020SJA0631 & No. 2019SJA0544), Educational Reform Research Project (No.2018XJJG28 & No.2021XJJG09) from Nanjing Normal University of Special Education, Educational science planning of Jiangsu Province(D/2021/01/23), Jiangsu University Laboratory Research Association (Grant NO.GS2022BZZ29).
References 1. Dunshi, L.: Review of China’s coal economic operation Situation and Future market Outlook in 2018. Coal Econ. Res. 39(2), 4–11 (2019) 2. Shunguo, X.: Restarting practice of large coke oven cold furnace. Baosteel Technology 04, 60–64 (2020) 3. Shuguang: Research on energy conservation and emission reduction process of energy flow orderly in coking recovery system. Wuhan University of Science and Technology (2014) 4. Hate and joy: Study on influence factors of reactivity and post-reaction strength of single coal coking coke. Shanxi Coking Coal Science and Technology 45(12), 14–17 (2021) 5. Zhou, P., Jiang, Y., Wen, C., Dai, X.: Improved incremental RVFL with compact structure and its application in quality prediction of blast furnace. IEEE Transactions on Industrial Informatics. https://doi.org/10.1109/TII.2021.3069869 6. Golovko, M.B., Drozdnik, I.D., Miroshnichenko, D.V., et al.: Predicting the yield of coking by products on the basis of elementary and petrographic analysis of the coal batch. Coke and Chemistry 55(6), 204–214 (2012) 7. Got, A., Moussaoui, A., Zouache, D.: A guided population archive whale optimization algorithm for solving multiobjective optimization problems. Expert Systems with Application 141(Mar.), 112972 (2020). 1-112972.15 8. Xcab, C., Mi, H., Dg, D., et al.: A decomposition-based coevolutionary multiobjective local search for combinatorial multiobjective optimization. Swarm Evol. Comput. 49, 178–193 (2019) 9. Bandyopadhyay, S., Saha, S., Maulik, U., et al.: A Simulated Annealing-Based Multiobjective Optimization Algorithm: AMOSA. IEEE Transactions on Evolutionary Computation 12(3), 269–283 (2008) 10. Zhang, Q., Zhou, A., Zhao, S., et al.: Multiobjective optimization test instances for the CEC 2009 special session and competition. Mechanical Engineering (New York, N.Y. 1919), 1–30 (2008) 11. Deb, K., Jain, H.: An Evolutionary Many-Objective Optimization Algorithm Using 12. Reference-Point-Based Nondominated Sorting Approach, Part I: Solving Problems With Box Constraints. IEEE Transactions on Evolutionary Computation 18(4), 577–601 (2014) 13. Jain, H., Deb, K.: An evolutionary many-objective optimization algorithm using referencepoint based nondominated sorting approach, part II: handling constraints and extending to an adaptive approach. IEEE Trans. Evol. Comput. 18(4), 602–622 (2014) 14. Hu, J., Wu, M., Chen, X., et al.: A multilevel prediction model of carbon efficiency based on the differential evolution algorithm for the iron ore sintering process. IEEE Trans. Industr. Electron. 65(11), 8778–8787 (2018) 15. Chen, H.J., Bai, J.F.: A coke quality prediction model based on support vector machine. Advanced Materials Research 690–693, 3097–3101 (2013)
118
S. Zhang et al.
16. Malyi, E.I.: Modification of poorly clinkering coal for use in coking. Coke and Chemistry, 87–90 (2014) 17. Yan, S., Zhao, H., Liu, L., et al.: Application study of sigmoid regularization method in coke quality prediction. Complexity, 220–224 (2020) 18. Bang, Z., Lu, C., Zhang, S., Song, S.: Research on coke quality prediction model based on TSSA-SVR model. China Mining, 1–8 (2022) 19. Wu, Y., Liu, H., Zhang, D., Zheng, M.: Research and application of multiple linear regression analysis prediction model for coke cold strength. Journal of Wanxi University 31(05), 51– 54+60 (2015)
Design and Development on an Accessible Community Website of Online Learning and Communication for the Disabled Jingwen Xu, Hao Chen, Qisheng Ye, Ting Jiang, Xiaoxiao Zhu, and Xianwei Jiang(B) School of Mathematics and Information Science, Nanjing Normal University of Special Education, Nanjing 210038, China [email protected]
Abstract. Education for the disabled is a social issue that cannot be ignored. Nowadays, due to the impact of COVID-19 and the rapid development of information and communication technology, online learning has become the mainstream way for people to acquire knowledge. In China, the number and proportion of the disabled receiving higher education are low. To let the disabled receive better online higher education, we should build a perfect accessible online learning environment. In order to realize this idea, we mainly studied and analyzed the current situation, design principles, and development technology of accessible educational websites, and designed and developed “Zhihai”, an accessible community website of online learning and communication for the disabled. “Zhihai” is committed to solving the difficulties of online learning for the disabled and making contributions to special education. It aims to meet the requirements of strong pertinence, complete functions, high quality accessibility, and a good user experience so as to truly make its contribution to the construction of special education and improve the online learning situation for the disabled. Keywords: Accessibility · website design and development · educational website · online learning
1 Introduction 1.1 Research Background Domestic Research Status In 2019, novel coronaviruses are spreading rapidly across the globe, posing a threat not only to humans and economies but also to human lives. At this time, computeraided diagnosis models based on deep learning with high generality were born. Particle swarm optimization was used to create Pso-guided Self-tuning Convolutional Neural Networks (PSTCNN) [1], allowing the model to automatically tune hyperparameters and select hyperparameter combinations in a targeted manner to obtain solutions that © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 119–147, 2024. https://doi.org/10.1007/978-3-031-50580-5_11
120
J. Xu et al.
are closer to the global optimum in a stable manner. The increasing pressure of diagnostic challenges brought on by a lack of available resources for global medical care can be effectively relieved by hyperparameter tuning of models using optimization algorithms, which is quicker and more efficient than traditional methods. Not only that, there are also applications of artificial intelligence in the medical field: the part of VGG-16z with extracted high-level abstract features and the designed fully connected layer together form a preliminary COVID-19 intelligent assisted diagnosis model based on migration learning [2], iteratively train the diagnosis model using the COVID-19 training set, continuously optimize the parameters of the fully connected layer network, and finally train a COVID-19 intelligent assisted diagnosis model based on migration learning of the VGG-16 convolutional neural network. The transfer learning-trained COVID-19 assisted diagnosis model has high reliability, can quickly give doctors diagnostic references, and can increase work productivity. Applying a deep learning-based computer assisted system to online learning under the influence of the epidemic would be beneficial, while also taking into account the accessibility of online learning for people with disabilities and focusing on their educational needs. Statistics show that there are more than 600 million people with disabilities worldwide. According to research, there is a significant disparity between the demand for software used by the visually impaired and how it is currently used. The demand for visually impaired people to use online learning, travel, and job-hunting software is high, but the usage rate is very low. For example, the software that reads the screen cannot accurately extract the text information from the image, the software is incompatible with the auxiliary software, and there are other issues. It is clear that there are still issues with using accessibility on the Internet, and the level of accessibility realization is not ideal. According to an online survey, the majority of domestic special education websites suffer from serious shortcomings in four categories: compatibility, operability of interface components, comprehensibility of content and control, and perceptibility of web content [3]. In general, there are a number of issues: the website’s functionality is imperfect; the website’s content is sparse, out-of-date, and of low professional caliber; the accessibility level for special users, such as the blind and deaf, is poor and cannot satisfy their needs to independently access the Internet. In order to provide accessible navigation mechanisms that let particular users find the specific information they need on the site quickly and accurately; web developers must understand the needs of particular groups of people and adopt a needs-based approach. Take “Smart Learning” as an example of domestic deep computer-aided learning systems: by establishing keywords and language patterns, combining with artificial intelligence analysis, and using deep learning and big data analysis of students’ learning, the learning assistance system realizes personalized, knowledge map-based diagnosis, assists students in locating the source of their errors, and pushes out corresponding microlesson explanations and moderately challenging questions. To securely categorize data and make it accessible, the Educational Clustering Big Data Mining System (ECBDMS) is proposed to integrate the Cognitive Web Services-based Learning Analytics (CWSLA) system. In comparison to other existing methods, the performance gain, prediction rate, clustering error rate, learning rate, and prediction accuracy are significantly higher [4].
Design and Development on an Accessible Community Website
121
Although some researchers have recently begun to pay attention to this issue, there are still few studies that are specifically focused on the design and development of web accessibility. Instead, they tend to focus on the values, solutions, and evaluation of web accessibility. Accessibility of educational websites is another objective of China’s contemporary web education technology standards. However, the standard system is difficult and specialized for website design, development, evaluation, and maintenance personnel in primary and secondary school campus networks to use because it is intended for businesses creating web education platforms, resources, and software. Therefore, more detailed theoretical guidelines and benchmarks to direct the creation, advancement, assessment, and upkeep of accessible educational websites appear to be required [5]. Overseas Research Status The Section 508 Web Content Accessibility Rule of the United States government, which has been in force since June 2001, consists of 16 major rules: Provide text equivalents; require the synchronous operation of equivalents; avoid irreplaceability of colors; avoid irreplaceability of style sheets; provide additional text links for server-side image mapping; provide client-side image mapping identity; use complex table title attributes for row and column headings; provide shelf headings to reduce screen flicker; provide plain text mirror sites; use scripting language for barriers; detail accessibility applets and plug-ins; design accessibility tables; provide navigation; ignore features; and give users sufficient response time [6]. For global deep learning support systems: In contemporary education and e-learning, Smart Web Interaction Modeling for Teaching and Learning (SWISW), based on artificial intelligence, has been proposed abroad to categorize students according to their learning abilities, ensuring that students using machine learning techniques have appropriate and high-quality learning objects. Local weights, linear regression, and other methods have also been introduced to predict students’ learning performance on the platform [7]. Foreign experts have also suggested a new distributed framework for the context-aware recommender system called DAE-SR (Softmax regression based on deep self-encoding). The program emphasizes user-item-based communication and offers personalized recommendations. The proposed DAE-SR classifier outperforms and is more reliable when compared to other models thanks to the proposed strategy of this recommendation system, which achieves better accuracy, precision, runtime, and recall [8]. Web developers and major U.S. Internet companies have studied web accessibility design. As well as suggesting design strategies for accessibility in terms of page structure, images, text, color, links and navigation, forms, and interactions, they have also proposed strategies for content, navigation, and data entry in web pages. The web development designers also offered design models and techniques for accessible educational websites for the blind to improve websites’ accessibility from all angles, as well as a list of ten design errors and bad habits that frequently result in less accessible websites. 1.2 Research Significance Barrier-Free Design A set of actions taken by architects and designers in the planning and construction
122
J. Xu et al.
processes that take into account the needs of people with various functional impairments, such as the elderly, children, and people with disabilities, is referred to as “barrier-free design”. Barrier-free design, which is not intended for a specific group of people, enables a larger group of people to have full access to the environment, facilities, transportation, and information and communication technologies. The ability of the environment to lessen or stop the development of new usage-related barriers is referred to as “ease of use”. Information Accessibility The main focus of China’s socialist modernization process and a key component of its information development plan is information accessibility. In order to advance the humanitarian spirit and safeguard the legitimate rights and interests of people with disabilities, it is essential to actively promote the development of information accessibility and make it possible for them to engage in social activities on an equal footing. This is not only a concrete example of social development and progress in the modern era. 1.3 Article Frame The rest of this essay is structured as follows: Sect. 2 details Zhihai’s design concepts and the design principles of an accessible website. Section 3 creates the website’s architecture, including its overall structure, the functions of each module, and its database. The implementation of the website’s key technologies, such as the video player, navigational shortcuts, speech-to-text technology, and pop-up generation technology, is described in Sect. 4; The entire essay is concluded in Sect. 5, which analyzes Zhihai’s social significance.
2 Design Ideas 2.1 Follow the Design Principles of WCAG 2.0 In February 1997, the World Wide Web Consortium (W3C), the most authoritative and influential international neutral technical standards body in the field of Web technology, established the Web Accessibility Advocacy Group (WAI) to promote accessibility implementation. It was not until December 2008 that the W3C released the final version of the Web Content Accessibility Guidelines (WCAG 2.0). WCAG 2.0 provides Web designers and developers with a set of non-technical guidelines and success criteria designed to ensure that Web content is properly accessible and usable by people with disabilities. WCAG 2.0 provides the foundation for web accessibility with four principles: perceptibility, operability, comprehensibility, and robustness. In perceptibility, information and user interface components must be presented to the user in a perceptible manner; in operability, user interface components and navigation must be operable; in comprehensibility, information and user interface operations must be understandable; and in robustness, content must be robust enough to be credibly interpreted by a wide variety of user agents (including assistive technologies) (see Fig. 1).
Design and Development on an Accessible Community Website
123
Fig. 1. Four principles that provide the foundation for Web accessibility
Web accessibility design should consider four important factors: structure, technology, content, and browsing. For the structure factor, the accessibility design of web content structure and board planning should be considered; for the technology factor, the accessibility design of processing web content, document language technology, program language technology, media technology, and input/output device technology should be considered; for the content factor, the accessibility design of text information and multimedia information of web pages should be considered; for the browsing factor, the accessibility design of each web browsing structure should be considered (see Fig. 2).
Fig. 2. Four important factors of web accessibility design
2.2 Design Objective Main Objective The goal of this paper is to design and develop “Zhihai”, an accessible community website of online learning and communication for the disabled, in response to the needs of disabled learners who wish to pursue online education. A B/S model architecture is planned for the website in order to give users a more convenient access environment. Visit Zhihai to browse a plethora of comprehensive learning resources that include text, audio, and video in a range of formats to suit the needs of various users. Additionally, users can upload and share their own educational materials and take advantage of a private personal space where they can make notes or compile educational resources uploaded by other users. Zhihai will tag and categorize the uploaded content, and it will add content relevant to the user’s interests to the homepage.
124
J. Xu et al.
During user registration, Zhihai enabled a dual-view mode to distinguish between special users and regular users. Regular users can socialize with people with disabilities while also benefiting from the many and varied learning resources available. Zhihai will also inform visitors about disability culture. Special education will have a greater cultural impact as a result of which more people will be interested in, comprehend, and support disability culture. Additionally, they will be able to empathize with it and get rid of old prejudices against people with disabilities. Zhihai is divided into categories for special users, and depending on the unique circumstances of each type of user, the problems are specifically addressed with reference to WCAG 2.0. For example, adding the function of text-to-speech for visually impaired users and adding audio to text and video captioning for hearing impaired users. When used by a user who is blind or visually impaired, the speech-to-text conversion function applies speech and natural language processing technology in computer software systems, and the user can quickly find the place he or she wants to go by using navigation shortcuts; when used by a user who is deaf or hard of hearing, the user can benefit from the audio data-to-text conversion service and the video captioning service. The Objective of Additional Module On the external moral level, college students are more accepting of people with disabilities as a group, while on the unconscious implicit level, college students are more likely to associate people with disabilities with negative terms and have significant negative attitudes toward people with disabilities. By offering special education-related courses in higher education institutions to convey knowledge about people with disabilities, the negative attitudes of college students toward people with disabilities will be changed to some extent when they are indirectly exposed to them by means of the courses. In addition, college students who have interacted with people with disabilities for longer periods of time also show more positive outward expressions toward people with disabilities, which further demonstrates the importance of increasing contact with people with disabilities in changing negative attitudes toward them [9]. Therefore, Zhihai has created the “The Bridge of Stars” communication community for the disabled. People with disabilities who use this community can share the challenges they face and express their ideas in real time. Caring individuals and people with disabilities can also engage in lively discussion about current events, which can aid in deep understanding and help others solve their own challenges and express their own views. Users can share great learning strategies, exchange learning experiences, etc. in the community. In order to introduce themselves to more people, they can also share details of their private lives. People with disabilities can communicate more effectively thanks to “The Bridge of Stars” community, which can also increase their learning and communication options and opportunities for equality. Sign language is the use of gestures to simulate images or syllables to form certain meanings or words according to the changes in gestures. It is a kind of hand language for people who are hearing impaired or unable to speak to communicate with each other and exchange ideas, and it is the main communication tool for deaf people. The use of sign language still has a number of problems, though. To begin with, there is a lack of consistency in the categories of sign language and sign language gestures used in deaf education and deaf life in China. While school staff and students use Chinese
Design and Development on an Accessible Community Website
125
sign language, local sign language is used by the community. Information exchange is seriously hampered by this inconsistency. According to the survey, the majority of deaf school teachers, students, and deaf adults in society believe that a universal sign language should be established, and the book “Chinese Sign Language” has served as a tool for this purpose on a national level. To address the issue of divergent sign languages, it is essential to create and promote a well-established national common sign language [10]. Therefore, Zhihai has developed a sign language field module that combines sign language interpretation, a sign language dictionary, sign language recognition, and sign language learning specifically for the deaf-impaired and sign language enthusiasts. Users can query the sign language corresponding to words through the sign language dictionary and get detailed explanations of gestures in multiple forms, such as text, pictures, and videos, or they can use the sign language translation function to query the sign language corresponding to sentences and paragraphs of text and generate sign videos. This site also supports sign language recognition by getting the camera function to recognize and translate gestures. Moreover, the field of sign language encourages users to use sign language to create videos, sign language dances, sign language teaching videos, and other interesting content that can play a promotional role to encourage more people to understand and love sign language and, in turn, help to reduce sign language differences, allowing for more accurate and unrestricted sign language communication. As shown by the analysis of national employment statistics for people with disabilities between 2016 and 2020, the vast majority of these people continue to work primarily in flexible employment and agricultural farming. A number of issues affect the employment of people with disabilities, including a low employment rate, unequal pay for the same work, a disconnect between ideal and actual employment, significant issues with sustainable employment, and a need for improvement in the social environment for employment. In order to address these issues, we should strengthen ideological and political education to lay a solid ideological foundation for the proper employment of college students with disabilities, focus on skills training to cultivate multidisciplinary talents, strengthen employment guidance to improve employment skills, and strengthen employment psychologies. Finally, we should deepen the reform of higher special education and give it full play in addressing the employment of disabled people [11]. Zhihai provides career learning guidance for people with disabilities. Professional assessments like the MBTI Occupational Personality Test and the Hollander Occupational Interest Test can be used to identify a test-taker’s occupational interests and personality tendencies, assisting them in selecting a career path that aligns with their interests. In addition to this fundamental function, it also offers disabled people functions like analysis of their employment situation and recommendations of job sites to strengthen employment guidance and provide an effective way to find a job. Vocational education is an important discipline in the employment of people with disabilities, and Zhihai will strengthen the construction of vocational education and employment psychology education to improve the employment skills and resilience of people with disabilities and contribute to their employment. People with disabilities are part of a population with poor mental health, which is frequently accompanied by psychological issues like emotional instability, loneliness, low self-esteem, sensitive and suspicious self-esteem, and a certain amount of complaining
126
J. Xu et al.
psychology [12]. Therefore, it is important to encourage the improvement of the mental health of people with disabilities. Through appropriate mental health initiatives for people with disabilities or the development of a psychological service system, it is important to encourage people with disabilities to have a complete and accurate understanding of who they are and to be guided in how to appropriately and timely handle stress and negative emotions in their lives [13]. As a result, the mental health module will take the shape of a tree hole, enabling users to more easily confide in one another and express their emotions. Various relaxation features, such as therapeutic graphics, decompression radio, and calming pure music for the hearing and visually impaired, are also available in the special zone. Mental health services are crucial for fostering a thorough and accurate understanding of oneself in people with disabilities.
3 Architecture Design 3.1 Overall Architecture Design B/S Architecture The website uses a sharing and cross-platform architecture system based on the B/S model. The fundamental concept is to use a Web browser to run the client application and make use of the services the Web server offers to realize data interaction with the back-end database. There is no need to take into account complex client-side maintenance issues because the B/S model does not require users to install any software in order to realize web pages for corresponding architecture system operations. Because developers only need to focus on designing and developing the back-end server side and do not need to take into account the collaboration between various clients, the B/S model architecture is more flexible, easy to extend, and maintainable than the traditional C/S model. Moreover, because the B/S model is cross-platform, users can easily use different browsers like Google, Firefox, IE, etc., which lowers the maintenance and operation costs of the entire architecture system. Without the need for client applications, the B/S architecture of the software operating system enables users to quickly access a variety of information through a Web browser. Users can access information and use any device more easily thanks to this architecture. Using Internet servers and backend databases, we are able to access a large number of information resources and interact with data. Users can load the necessary program from the web server to quickly complete a task when it needs to be run locally. The Web server will assign the relevant instructions to the corresponding server for processing, and after execution, it will pass the results to the client so that it can continue its work. The web server interacts with the backend database throughout the process by offering services to make sure the user can use any device and has the benefits of flexibility, ease of expansion, and maintenance. The detailed working principle of the B/S architecture is shown in Fig. 3: Its related workflow can be summarized as follows: First, the user needs to submit the form through the relevant page of the browser in order to obtain the required information. To make sure the form is accurate and
Design and Development on an Accessible Community Website
127
Fig. 3. The detailed working principle of the B/S architecture
complete, the user must also send a client request and wait for a reply from the client before submitting the form. The server can respond quickly to the user’s needs after receiving a request from the browser and processing it appropriately. When the server receives the user’s request for data, it responds instantly and updates the corresponding browser with the outcome. The corresponding HTML file is created based on the user’s reflection when the user is presented by the web browser, making it easier for the user to complete the operation [14]. Website Deployment Architecture The operating system adopts a B/S three-tier architecture, including three tiers of servers, such as the web platform, application services, and database. For the client side, which can only be accessed by connecting to a LAN or by using a simple browser, there are no special requirements. Because they consume a lot of bandwidth and system resources, the architecture designates a separate server to manage images, audio, and video resources. The web platform handles user requests and passes them to the application service layer, which handles business logic and obtains back-end data. The goal of the image server and audio/video server is to achieve load balancing and capacity expansion while reducing the initial system resource occupation. The architecture addresses the issue of a large number of images, audio, and video resources used for online browsing and playing occupying system resources and bandwidth, and it offers users better quality, effective, and convenient information services. This significantly raises the site’s overall performance. The website deployment architecture diagram is shown in Fig. 4:
128
J. Xu et al.
Fig. 4. The website deployment architecture diagram
3.2 Module Function Design Functional Architecture The goal of Zhihai, a learning website for both regular and disabled users, is to address the issues that the majority of disabled users face when trying to learn online, including their inability to comprehend information quickly and accurately, as well as to popularize knowledge related to disability culture for regular users. Users can choose to browse the learning materials or access the four additional modules of the website: The Bridge of Stars, Sign Language Field, Career Planning Module, and Mental Health Module. The website implements the corresponding functions as indicated by each module in the figure, and the functional architecture of the website is shown in Fig. 5. The basic functions of each module are shown in Fig. 6. Home Page Design The homepage includes accessibility functions, a search bar, a special education business news propaganda poster, a rotation chart of the most popular recommendations, a functional partition, a personal center, a creation center, a recommendation of excellent works, etc. After going through a number of processes like data mining and data screening, only works with a strong sense of meaning and that adhere to socialist core values will be shown on the home page. Design of Additional Modules Additional modules include the Bridge of Stars module, the Sign Language Field module,
Design and Development on an Accessible Community Website
129
Fig. 5. The functional architecture
Fig. 6. The basic functions of each module
the Career Planning module, and the Mental Health module. After clicking on it, you can navigate to the corresponding partition. The Bridge of Stars A ranking of the hotness of the topics discussed in the Bridge of Stars community, known as the Hotlist, is included in the community. We can quickly learn about current hot topics that are important to the special education industry through the Hotlist, which also allows users to post and discuss topics. We can also quickly comprehend the social context of special education. Sign Language Field In order to assist sign language users in communicating and learning, as well as to address the issue of sign language enthusiasts who struggle with the language and are unsure of where to start, a module called Sign Language Field was developed. It has sign language dictionary, sign language translation, sign language recognition, and sign language video functions. A sign language action breakdown diagram or sign language video will be displayed in response to the user’s translation of a sentence into the dialog box. A sign language dictionary can help with the challenge of switching between words and phrases written in Chinese characters and sign language. The sign language recognition function means that users can access the device or activate the camera to make sign language gestures, and the dialog box will display the corresponding meaning.
130
J. Xu et al.
Career Planning Module The Career Planning Module includes four major functions: career testing, employment situation analysis, recruitment website recommendation, and preferential policy information. The occupation test includes well-known tests such as the Holland Occupation Interest Test and the MBTI test, which help users analyze their own personalities and make reasonable recommendations. The employment situation analysis function is based on the results of data statistics to provide a reference for users. The recruitment website recommendation function is according to the users’ preferences and the recommended rating. Preferential policy information is collected from various well-known special education platforms and official government documents to help the disabled obtain employment. Mental Health Module The Mental Health Module aims to give users who are introverted and lonely a place to talk. The mental health module contains four functions: tree hole, drift bottle, soothing music, and anonymous community. The tree hole function is to release psychological pressure by speaking his mind. The drift bottle function is a whisper from one star to another. Soothing music can help users relax. Anonymous communities allow people to communicate freely. Others Other links include history, the authoring center, and personal space, which are designed to help users better use the site’s features. Background Administrator Privileges Role-based access control (RBAC) is the foundation for Zhihai’s background permission. Users, administrators, and super administrators make up the bulk of website users. A permission is an action that is permitted on a particular object, similar to how an administrator might review a user’s published article, a user’s uploaded video, their identity information, etc. Depending on whether the operation role possesses this control permission or not, the operation may be approved or denied (see Fig. 7). Administrators and super administrators are the two operational objects. Specific tasks, like work review, video recommendation, rotation picture setting, etc., are under the administrator’s control. The super administrator is able to add and remove administrators and has the same rights as an administrator (see Figs. 8, 9 and 10) and Table 1. 3.3 Database Design We design the database by analyzing the requirement function using ER diagrams because the website function business is too big and complex. This database design has a total of seven entities: user entity, story entity, file entity, article entity, administrator entity, role entity, and operation menu entity. Among them, “user entity” and “operation menu entity” do not need to depend on any other entity, so they are strong entities. The other entities can exist independently at any time, but they all depend on the user as an entity, so they are weak entities. A
Design and Development on an Accessible Community Website
131
Fig. 7. Flow chart of the audit
Fig. 8. Use case diagram for frontend visitors
straightforward ER model can be created using these components and the functional relationship prior to the entity, as shown in Fig. 11:
132
J. Xu et al.
Fig. 9. Use case diagram for frontend users
Users and administrators do not interact directly because the frontend and backend are components of two distinct operational ends. There are 7 entities and 5 relationships in this ER model: 1. 2. 3. 4.
Users can post/like/collect/comment multiple stories. One story can contain only one file or only one article. One administrator can only have one role. One role can have more than one operation menu, at the same time, one operation menu may be included by more than one role.
Design and Development on an Accessible Community Website
133
Fig. 10. Use case diagram for backend administrator
Table 1. Admin Menu Function View admin list Add admin Delete admin
Super Admin √ √ √
Admin
User
×
×
×
×
×
×
The business logic of the entire website can be understood using this straightforward ER model. Next, examine the characteristics each entity has: 1. User entity: includes user ID, account number, password, mobile phone number, nickname, avatar, education background, email, individual resume, address, date of birth, age, gender, and user type. 2. Story entity: includes story number, story type, release time, module number, story title, and audit status. 3. File entity: includes file number, file title, and file contents. 4. Article entity: includes article number, article title, and article content. 5. Administrator entity: includes administrator number, administrator name, administrator account, administrator password, and mobile phone number.
134
J. Xu et al.
Fig. 11. ER model Diagram of the “user entity” and “operation menu entity”
6. Role entity: includes role number, role name, and status. 7. Operation menu entity: includes menu number and menu name. After analysis, the ER model is obtained, as shown in Fig. 12: Based on the above ER diagram model, it can be further translated into a specific database. The system uses a MySQL database with a total of 17 tables, which are designed as follows: user table, story table, file table, article table, message notification table, comment table, like table, collection table, collection classification table, history record table, data statistics table, recommendation classification table, administrator table, role table, role menu table, operation menu table, and audit table. The user table is mainly used to store basic information about the user. It contains attributes such as user id, user account, username, nickname, password, and so on. The user table is shown in Table 2: The story table mainly store basic stories published by users. It contains attributes such as story number, story type, release time, module number, story title, audit status, etc. The story table is shown in Table 3:
Design and Development on an Accessible Community Website
135
Fig. 12. Diagram of ER model for front and back end
The file table mainly stores the file information published by users, such as video and audio. It contains attributes such as file number, file title, file contents, and so on. The file table is shown in Table 4: The article table mainly stores the articles and news information published by users. It contains attributes such as article number, article title, article content, etc. The article table is shown in Table 5: The comment table mainly stores the dynamic comments made by users to other users. It contains attributes such as comment number, comment time, comment content, etc. The comment table is shown in Table 6: The like table mainly stores the liking information dynamically made by users to other users. It contains attributes such as like number, like time, etc. The like table is shown in Table 7: The collection table mainly stores the basic information of other users. It contains attributes such as collection number, collection time, collection classification, etc. The collection table is shown in Table 8: The collection classification table mainly stores the basic information saved by other users. It contains attributes such as collection classification, creation time, classification name, number of collections, etc. The collection classification table is shown in Table 9:
136
J. Xu et al. Table 2. The user table
The user table user id
int
user account
varchar(20)
password
char
nickname
varchar(20)
avatar
blob
education background
varchar(8)
email address
varchar(255)
individual resume
varchar(255)
address
varchar(255)
birth date
date
age
int
gender
char(1)
user type
varchar(20)
phone number
varchar(11)
Table 3. The story table The story table story id
int
user id
int
story type
varchar(20)
release time
datetime
module number
varchar(20)
view permission
varchar(20)
story title
varchar(255)
audit status
int
The history record table mainly stores the historical information that the user has browsed other users. It contains attributes such as record number, record time, etc. The history record table is shown in Table 10:
Design and Development on an Accessible Community Website
137
Table 4. The file table The file table file id
int
story id
int
file title
varchar(255)
file content
blob
Table 5. The article table The article table article id
int
story id
int
article title
varchar(255)
article content
blob
Table 6. The comment table The comment table comment number
int
story id
int
user id
int
comment time
datetime
comment content
text
Table 7. The like table The like table like number
int
story id
int
user id
int
like time
datetime
The data statistics table mainly stores the data information that the user’s story is affected by other users. It contains attributes such as data statistics number, likes, views, collections, comments, etc. The data statistics table is shown in Table 11:
138
J. Xu et al. Table 8. The collection table
The collection table collection number
int
story id
int
collection classification
int
user id
int
collection time
datetime
Table 9. The collection classification table The collection classification table collection classification
int
user id
int
classification name
varchar(255)
creation time
datetime
number of collection
int
Table 10. The history record table The history record table record number
int
story id
int
user id
int
Record time
datetime
The recommended classification table mainly stores the basic data information of the basic classification of the user’s story label on the homepage of the website. It contains attributes such as recommend category numbers, category titles, etc. The recommended classification table is shown in Table 12:
Design and Development on an Accessible Community Website
139
Table 11. The data statistics table The data statistics table data statistics number
int
story id
int
user id
int
likes
int
views
int
collections
int
comments
int
forwarding number
int
Table 12. The recommended classification table The recommended classification table classification number
int
story id
int
user id
int
classification title
varchar(20)
The message notification table mainly stores the basic data information sent between users. It contains attributes such as message notification number, user number 1, user number 2, message type, message content, etc. The message notification table is shown in Table 13: Table 13. The message notification table The message notification table message notification number
Int
user number 1
Int
user number 2
Int
message content
varchar(20)
message content
Text
The administrator table stores basic information about the administrator. It contains attributes such as administrator number, administrator name, administrator account number, administrator password, phone number, etc. The administrator table is shown in Table 14:
140
J. Xu et al. Table 14. The administrator table
The administrator table administrator number
int
administrator name
varchar(20)
administrator account number
varchar(20)
phone number
varchar(11)
administrator password
varchar(16)
The role table mainly stores information about the roles to which the administrator belongs. It contains attributes such as role number, role name, role status, etc. The role table is shown in Table 15: Table 15. The role table The role table role number
int
administrator number
int
role name
varchar(255)
Role status
int
The operation menu table mainly stores the operation information that can be performed in the system background. It contains attributes such as menu number, menu name, etc. The operation menu table is shown in Table 16: Table 16. The operation menu table The operation menu table menu number
int
menu name
varchar(255)
The role menu table mainly stores the relationship information between the role and the operation menu. It contains attributes such as role menu number, etc. The role menu table is shown in Table 17: The audit table mainly stores record information about the actions performed by the administrator on the published stories. It contains attributes such as audit number, audit status, audit time, story number, etc. The audit table is shown in Table 18: The database design clearly outlines the system database design process based on specific functional requirements and also displays each link’s milestones. The database
Design and Development on an Accessible Community Website
141
Table 17. The role menu table The role menu table role menu number
Int
role number
Int
menu number
Int
Table 18. The audit table The audit table audit number
int
administrator number
int
story number
int
audit status
varchar(20)
audit time
datetime
design not only increases the website system’s overall design effectiveness, but it also offers solid support for the system’s future efficient and stable development. [15].
4 Implementation of Key Technologies 4.1 Video Player Zhihai uses the powerful open-source video player Shaka Player, a very popular free open-source HTML5 video player. It supports adaptive-bitrate streaming protocols such as HLS and DASH without any plugins or Flash. Shaka plays videos using open web standards such as MSE [16] and EME. It supports on-demand, live streaming, multi-time content, multi-DRM, and subtitles.
142
J. Xu et al.
4.2 Navigation Shortcut The project uses asynchronous JavaScript events to respond to client requests [17]. Using Javascript scripts to set a combination of hotkeys is essentially getting the keyCode value of a key. If you want to add ctrl, alt, shift, and other quick keys, then add a ctrlkey, altKey, shiftKey, and other corresponding keycode values; the key is to get the value of the key code. Take the code below as an example:
Design and Development on an Accessible Community Website
143
4.3 Speech-to-Text The Web Speech API allows you to integrate speech data into web applications [18]. The Web Speech API provides two different types of functionality in different directions: speech synthesis (Text-to-Speech, TTS) and speech recognition (asynchronous speech recognition). Speech recognition involves three processes: first, the device’s microphone is needed to receive speech; second, the speech recognition service checks the speech against a set of grammars (basically, grammars are the words you want to be able to recognize in a particular application); and finally, if a word or phrase is successfully recognized, the result is returned as a text string (there can be more than one result), and further actions can be set to trigger. Text-to-Speech (TTS) is the process of receiving a text from an application that requires speech synthesis and then playing it back into the device’s microphone. The Web Speech API has a main control interface for this, called SpeechSynthesis, plus some other interfaces that deal with how to represent the text to be synthesized (also called “utterances”), what sounds to use to broadcast the utterances, and other related tasks. Similarly, many operating systems have a speech synthesis system of their own, and in this task, we call the available APIs to use the speech synthesis system. Speech recognition involves three processes: first, the device’s microphone is needed to receive this speech; second, the speech recognition server checks this speech against a set of grammars (basically, grammars are words you want to be able to recognize in a particular application); and finally, if a word or phrase is successfully recognized, the result is returned as a text string (there can be more than one result), and more behaviors can be set to trigger. The Web Speech API has a primary control interface, SpeechRecognition, and several closely related interfaces such as Representation Syntax, Representation Results, and so on. Devices usually have a standard speech recognition system available, and most modern operating systems use this speech recognition system to process voice commands.
144
J. Xu et al.
Chrome now supports speech recognition with prefixes, so you need to add something at the top of your code to ensure that the object used is correct in both Chrome, which requires a prefix, and Firefox, which does not. After getting the references to the output div and html elements (which we can use later to output the results of the speech recognition diagnostics and update the background color of the application), we add an onclick event handler to enable the speech recognition service when the screen is clicked. This is done by calling the “SpeechRecognition.start()” method. The internal work of the forEach() method is to add a background color for each color keyword, so that it is intuitive to know what color the color keyword points to.
Once speech recognition has started, there are a number of event handlers that can be used to follow up on the return of a result, and there is some piecemeal related information that can be manipulated in addition to the recognized result (see the SpeechRecognition event handler list). One of the most commonly used is “SpeechRecognition.onresult,” which is triggered when a successful result is received.
Design and Development on an Accessible Community Website
145
4.4 Subtitle Generation In the implementation of Web video live captioning, the first thing that comes to mind is Ajax technology polling. The principle of this technology is very simple, the client and the server will always be connected, and every once in a while to ask. The client polls to see if there is a new message. This way, there are many connections, one receiving and one sending. Also, each request sent would have an HTTP header, which would consume traffic and CPU utilization. Finally, we use Socket.IO, which has high performance, reliability, and speed as well as stability. Socket.IO is a library that provides low-latency, bi-directional, event-based communication between clients and servers. It is built on top of the WebSocket protocol [19] and provides additional guarantees such as fallback to HTTP long-polling or automatic reconnection.
When the server listens to the server-push event, the run function initializes the caption and then generates a caption that scrolls across the screen.
146
J. Xu et al.
5 Summary This paper designs and develops an accessible learning and communication community website based on the difficulties faced by the disabled in online learning. In terms of learning needs, website function construction, communication, career planning, and psychological needs, Zhihai not only meets the needs of general users but also provides practical help to special users according to their needs, solves problems, maximizes the selectivity and equality of special people in learning and communication, promotes the integration of disability culture, and truly understands, respects, and promotes disability culture [20]. The research results of this paper are as follows: Firstly, this paper mainly elaborates on the development of web accessibility and also additionally analyzes various problems such as low acceptance of higher education for the disabled, a communication gap between ordinary people and the disabled, the low popularity of sign language, difficulties in the employment of the disabled, and frequent mental health problems among the disabled, etc. The design and development of Zhihai was carried out through sufficient research and investigation. Second, Zhihai not only meets the requirements of accessibility and the online learning needs of the disabled, but also has several additional modules that can meet the needs of the disabled in many aspects, such as communication, employment, and the improvement of psychological problems. As technology advances and evolves, more and more new technologies are being applied to the design and development of accessible websites. It is only through continuous learning and innovation that we can promote the ongoing development of accessible technology and create a more comfortable environment for the disabled. The above research results are only a small step forward, and the practical process shows that there is still a lot of work to be done. For example, some disabled people have poor language ability and weak comprehension ability, and it is difficult to correctly understand the main meaning of knowledge, so they need to simplify and explain the information, but the simplification and explanation may lose part of the original information, which affects the transmission effect and reliability. In addition, the authority and professionalism of the knowledge of the disabled are demanding, it is difficult to update, and the production cycle for quality information is long. Knowledge-sharing sites for the disabled may contain certain sensitive information that, if left unprotected, may be vulnerable to attack, resulting in information leakage and personal data breaches. In the future, we will likely continue to explore deeper questions, particularly by building on our current findings and expanding our research to cover a broader and more innovative range of topics.
References 1. Wang, W., Pei, Y., Wang, S.-H., Manuel Gorrz, J., Zhang, Y.-D.: PSTCNN: explainable covid-19 diagnosis using PSO-guided self-tuning CNN. Biocell. 47(2), 373–384 (2023) 2. Jiang, Z., Xu, X.: Intelligent assisted diagnosis of covid-19 based on CT images. 43(02), 264–269 (2020)
Design and Development on an Accessible Community Website
147
3. Ji, Z., Xu, Z., Wang, L.: The status quo and Countermeasures of information Accessibility in Digital Society 4. Li, B.: Cognitive web service-based learning analytics in education systems using big data analytics. Int. J. e-Collab. (IJeC) 19(2), 1–19 (2023) 5. Chen, Z.: Accessibility Design and Development of Educational Websites. Master, Zhejiang Normal University (2009) 6. Thatcher, J., Kirkpatrick, A., Urban, M., Lawson, B., Lauke, P.H.: Web Accessibility: Web Standards and Regulatory Compliance. Web Accessibility: Web Standards and Regulatory Compliance (2006) 7. Yang, H., Shankar, A., Velliangiri, S.: Artificial intelligence-enabled interactive system modeling for teaching and learning based on cognitive web services. Int. J. e-Collab. (IJeC) 19(2), 1–18 (2023) 8. Boppana, V., Sandhya, P.: Distributed focused web crawling for context aware recommender system using machine learning and text mining algorithms. Int. J. Adv. Comput. Sci. Appl. 14(3) (2023) 9. Tu, Q.: “Evaluation of College Students’ Explicit and Implicit Attitudes Towards Disabilities. Master, Shenyang Normal University (2017) 10. Liu, Y., Gu, D., Cheng, L., Wei, D.: Survey of sign language use in China. 86(02), 35–41 (2013) 11. Qing, Z.: Employment Problems and Countermeasures of disabled College Students. Master, Nanjing Normal University (2011) 12. Fu, S., Cao, H.: On the psychological problems of disabled persons. 10(06), 116–118 (2009) 13. Lan, J., Liu, T.: A study on the relationship among disability attitude, mental health and subjective wellbeing of persons with disabilities. 30(02), 86–91 (2018) 14. Zhu, B., Chen, G., Li, P., Wang, S.: Design of human resource management system based on B/S mode and MySQL. 44(14), 65–69 (2021) 15. Li, Y., Liu, T.: Database design of home appliance recycling management system based on MySQL. 219(03), 141–143+146 (2023) 16. Li, J.: Real-time audio stream playback implementation based on websocket and MSE technology. 33, 46–49 (2022) 17. Wang, S., Sun, P., Guo, Z., Hu, L.: A method of extending javascript event supporting asynchronous invocation mechanism in webkit brower. 33(01), 226–229 (2016) 18. Zhang, X., Zhang, W., Wang, Y.: Web development-based speech synthesis. 8(29), 6939–6942 (2012) 19. Yang, L.: A Communication Approach and a WebSocket Server. Ed. (2019) 20. Fan, J.: Barrier Free Education Website Design and Realization. Master, Shanghai Jiao Tong University (2012)
Exploration of the Teaching and Learning Model for College Students with Autism Based on Visual Perception—A Case Study in Nanjing Normal University of Special Education Yan Cui(B) , Xiaoyan Jiang, Yue Dai, and Zuojin Hu School of Mathematics and Information Science, Nanjing Normal University of Special Education, Nanjing 210038, China [email protected]
Abstract. Autism is a neurodevelopmental disorder with clinical diversity and heterogeneous etiology. The main clinical features are language problems, social communication disorders, and stereotyped behaviors. With the development of society and economy, more and more autistic children can study in regular classes because of early intervention, receiving common compulsory education and the rehabilitation of education intervention at this stage. It means that more and more autistic teenagers will receive higher education in the future. Therefore, it is necessary to conduct effective teaching design in the integrated classroom of higher education and propose teaching and learning models suitable for the personality characteristics of autistic college students so as to stimulate their learning interest and potentials and help them to establish self-confidence, thus encouraging them to actively participate in communication, and ensuring them to complete degree courses, which will improve the effectiveness of education intervention rehabilitation to a certain degree. This article uses the degree-based course “Linear Algebra” as an example to discuss and summarize the practice of integrated classrooms with autistic students in Nanjing Normal University of Special Education, trying to explore the teaching and learning model suitable for the personality characteristics of autistic students. Keywords: Higher Education · Integrated Classroom · Mild Autism · MATLAB Visual Presentation · Cooperative Learning
1 Introduction Autism is a neurodevelopmental disorder with clinical diversity and heterogeneous etiology. The main clinical features are language problems, social communication disorders, and stereotyped behaviors. According to statistics released by the US Centers for Disease Control and Prevention in 2016, the incidence of autism has reached 1 in 45, and it is showing a clear upward trend [1]. In 2006, autism, as a kind of mental disability, was included in the Second National Sampling Survey of Disability. According to the data of © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 148–157, 2024. https://doi.org/10.1007/978-3-031-50580-5_12
Exploration of the Teaching and Learning Model
149
the Second National Sampling Survey of Disability in 2006, the number of children with mental disabilities was confirmed to be 145,000, 2.9% of the total number of disabled children aged 0–17, of which 41,000 were children with autism. Besides, among the 1.749 million children with intellectual disabilities and the 1.435 million children with multiple disabilities among the disabled children, there might also have autistic children [2]. However, according to the “Report on the Development Status of China’s Autism Education and Rehabilitation” [3] released in April 2017, the number of people with autism in China may be more than 10 million, and the number of autistic children aged 0–14 may be more than 2 million, showing a sharp upward trend in China [4]. With the development of society and economy, more and more autistic children can study in regular classes because of early intervention, receiving common compulsory education and the rehabilitation of education intervention at this stage. It means that more and more autistic teenagers will receive higher education in the future [5–7]. Therefore, how to design a teaching program of personality characteristics suitable for college students with autism in the integrated classroom of higher education will affect the learning of degree courses of college students with autism, as well as the educational intervention and rehabilitation of college students with autism [8–10]. “Linear Algebra” covers the main content of linear equations and matrices, and is the basic academic course of special education and computer science in Nanjing Normal University of Special Education. For students who are new to Linear Algebra, it is the first time to know some concepts of Linear Algebra and many are quite abstract. Therefore, how to design a teaching program suitable for students’ understanding seems to be of paramount importance [11–13]. This article uses Matlab to intuitively demonstrate some concepts of Linear Algebra to facilitate students’ learning, understanding and using. Matlab can help to display some knowledge of Linear Algebra, present intuitive cognition, and stimulate students’ interests in learning. At the same time, explore the cooperative learning model in the integrated classroom, conduct social intervention and rehabilitation for college students with autism, help them better integrate into the class and engage in active communication, cultivate their capability of independent learning, and improve autistic college students’ passing rate of the degree course assessment [14, 15]. In Nanjing Normal University of Special Education, there are two autistic students respectively in grade 2017 and 2018 participating in the learning of Linear Algebra. Therefore, this article tries to make a case study of Linear Algebra to explore the design of teaching and learning models for integrated classrooms in higher education. The main contributions are as follows: (1) Design of Linear Algebra teaching planning based on Matlab visual presentation to get the students’ attention, specifically the student with Autism. The visual stimulation teaching planning can help them improve their understanding, and help them better understand the basic degree courses. (2) Design of the Supervised and mutually-aided learning intervention planning can help autistic students master the knowledge of the course, guide them to lead communication, and cultivate normal students’ problem-solving skills.
150
Y. Cui et al.
2 Design of Linear Algebra Teaching Planning Based on Matlab Visual Presentation Matrix operation for a large part of Linear Algebra. The solution of many problems is finally transformed into matrix problems. Therefore, this section focuses on the Matlabbased visual presentation of three aspects, that is, the introduction of matrix concept, the calculation of matrix, and the application of matrix, to help autistic college students develop a more intuitive understanding of concepts, operations, and applications of matrix. Matrix is a table of m × n elements arranged in m rows and n columns. ⎤ ⎡ The 441 ⎢ 352 ⎥ ⎥ ⎢ ⎥ ⎢ element aij locates in the ith row and the jth column. For example, the table A = ⎢ 323 ⎥ ⎥ ⎢ ⎣ 213 ⎦ 123 is a 5 × 3 matrix. The element 5 locates in the second row and the second column of the matrix. If each row of the matrix is taken as one point, the first column is the coordinate relative to the X axis, the second column is the coordinate relative to the Y axis, and the third column is the coordinate relative to the Z axis, and the scatter plot of the three-dimensional matrix ⎤ ⎡ A is⎤shown in Fig. 1. ⎡ 663 222 ⎢ 574 ⎥ ⎢ 222 ⎥ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ If the matrix B = ⎢ 222 ⎥, then A + B = ⎢ 545 ⎥, that is, the elements of the ⎥ ⎢ ⎥ ⎢ ⎣ 435 ⎦ ⎣ 222 ⎦ 345 222 corresponding positions are added together, which can be shown in the same space in the form of scatter plot in Fig. 2. Each plot is translated along the coordinate axis.
3 2.5 2
1th R 2th R 3th R 4th R 5th R
1.5 1 5 4 3 2 1
1
1.5
2
2.5
3
3.5
4
Fig. 1. Scatter plot of matrix A
The above results show a scatter plot with the rows of the matrix in the form of a vector in the three-dimensional coordinates. In fact, the pictures we usually see are also
Exploration of the Teaching and Learning Model
151
5 4
1th R 2th R 3th R 4th R 5th R
3 2 1 8 6 4 2 0
1
2
3
4
5
6
Fig. 2. Scatter plot of matrix A+B
stored in the form of a matrix, as shown in Fig. 3. From the visual effect, we can see that the image of the puppy is blurry compared to the image of the jellyfish. It indicates different matrix dimensions of images.
Fig. 3. The Byte Size of the Dog and Jellyfish
The gray data matrix of the puppy image is a 93 × 140 matrix, while the gray data matrix of the jellyfish image is 768 × 1024. The puppy image data matrix is much smaller than the jellyfish image. When we transform the puppy image data matrix into a matrix of the same dimension of jellyfish, it generates a new image, as shown in Fig. 4. Apply matrix addition to transform the two images and get the Fig. 5. The effect of using matrix addition to transform two images is shown in Fig. 5, and the effect of running matrix subtraction on two images is shown Fig. 6. That is to say, the application of the visualization effect facilitates the application of the parameter matrix operation to the data matrix in the integrated classroom. Autistic college students can develop a more
152
Y. Cui et al.
intuitive understanding of the concept and operation of the matrix, as well as the effect of matrix operation on images in the image processing. Through visual stimulation, it can arouse autistic students’ interests in learning, attract their attention, improve their understanding, and help them better understand the basic degree courses.
Fig. 4. The data matrix image after transformation
Fig. 5. The image after matrix addition
Fig. 6. The image after matrix subtraction
The problem of linear correlation and maximum linear independent groups can also be presented with the help of image visual effects. Each image in the image set can be
Exploration of the Teaching and Learning Model
153
pulled into a vector, and then the image set can be represented in the form of a matrix. Use the Matlab to process the image data matrix and get the primary transformation, identify the maximum linearly independent group of the data matrix, display the image corresponding to the maximum linearly independent group, and analyze the difference of images. It can help autistic college students develop a more intuitive understanding of the vector maximum linearly independent group. And from the intuitive visual effect, the similarity of linear correlation images is analyzed. Improve the concentration of autistic college students through visual stimulation, and help them better learn basic degree courses.
3 Supervised and Mutually-Aided Learning Intervention Compared with normal students, autistic students have relatively weak comprehension, relatively slow response ability, and relatively short attention span [5–15]. Therefore, for autistic college students in integrated education, their learning of degree courses needs more attentions and intervention from teachers. Taking the study of Linear Algebra as an example, this article proposes a supervised and mutually-aided learning intervention strategy for the study of degree courses of autistic college students with social dysfunction, explores how to implement educational intervention and rehabilitation for autistic college students at the higher-level education stage, and presents effective educational intervention methods for autistic college students’ study of degree courses. Autistic students have the typical characteristics of autism, with poor understanding, short attention span, and obvious social dysfunction. In view of the characteristics of autistic students, we adopt different forms of educational intervention from three aspects, that is, classroom teaching, after-class Q&A and group discussion, to improve the efficiency of course learning and achieve a more ideal course learning effect. 3.1 Classroom Teaching We know that the attention span varies for everyone, and the attention span of autistic students is relatively short. Therefore, for a new course, the teacher adopts a teaching method suitable for the class to attract students’ attention to the study of the course. Take the calculation of matrix in Linear Algebra for example. Use the gray image to explain the addition, multiplication and scalar multiplication of matrix and use the Matlab to display the calculation results through images, which can help students form a more intuitive understanding of matrix operations and algorithms. Prolong the attention span of autistic students through different visual stimuli. In addition, in the classroom practice session, the teacher can observe the autistic students’ practice and ask them questions to know about their learning situation, thus formulating a later strengthening plan. All in all, during the teaching process of new lessons, the teacher will conduct supervised learning intervention and attract the autistic students’ attention to the newly taught knowledge through different methods, and help them understand the knowledge. 3.2 After-Class Q&A Thanks to the development of smart phones, tablets and social software, it is very convenient for autistic students participating into after-class Q&A. The use of social software
154
Y. Cui et al.
transforms everyone’s conversations with autistic students into man-machine conversations through text or voice, reducing the autistic students’ anxiety in personal communication, and making it easier for autistic students to understand the problem. The teacher can list the key points of each new lesson through social software, which will help autistic students to further strengthen the knowledge learned in the classroom and feed the incomprehension knowledge back to the teacher through the social software. The teacher will give further detailed analysis of the incomprehension knowledge for autistic students in the form of voice and draft to strengthen the autistic students’ understanding of new lessons. At the same time, the teacher assigns the exercises corresponding to the key points of the new lesson to the autistic students through social software. The autistic students reply the completed results to the teacher by means of taking photos, and the teacher can learn about the learning situation of autistic students through their exercise feedback. At the same time, the teacher can select some students based on the classroom teaching response and observation, and allow them to participate in the autistic students’ after-class Q&A session, conducting social intervention for the autistic students to communicate with others. Meanwhile, the communication between normal students and autistic students can enhance people’s understanding of autistic students, and also let autistic students to know their classmates, and help autistic students take the first step in active communication. In addition, normal students will feed the autistic students’ Q&A session and their learning situation back to the teacher. Based on the feedback, the teacher will formulate new Q&A session suitable for the individual characteristics of the autistic students, so that the autistic students can better understand the knowledge. It will achieve social intervention during the cooperative learning process. This cooperative learning model, on the one hand, cultivate the sense of responsibility of normal students, helping them to learn new knowledge better and be responsible for their activities during the Q&A session; on the other hand, it also allows normal students to have a comprehensive understanding of autistic students, so that they can keep a healthy value in terms of interacting with autistic people when they enter the society, which gives autistic people fair social opportunities and social interactions. Besides, it realizes the learning intervention and social intervention for autistic students and under the supervision of problem tasks guides autistic students take the initiative to communicate. Therefore, in the supervised and mutually-aided learning intervention process, it can help autistic students master the knowledge of the course, guide them to lead communication, and cultivate normal students’ problem-solving skills. 3.3 Group Discussion Group discussion is an effective intervention method for autistic students with social dysfunction. Considering the characteristics of the course and the learning situation of autistic students, we design a group discussion mode that gives the dominant position of autistic students, and help them to establish self-confidence in study. Take the Linear Algebra for example. The study of this course mainly examines the students’ comprehension and calculation ability. The two autistic students in grade 2017 and 2018 have similar calculation ability with other normal students, or even better. They have no difficulties with the calculation of determinant, the rank of matrix, and calculation
Exploration of the Teaching and Learning Model
155
of equations. In the final examination paper test, their core rate is relatively ideal. But in terms of the calculation ability for logical understanding, autistic students are relatively weaker than most normal students. For example, find the maximum linearly independent group of the vector group. The problem is solved in two steps. The first step is to augment the vector group into a matrix and turn it into a ladder-shaped matrix. The second step is to find corresponding column of the first non-zero element of each non-zero row of the ladder-shaped matrix, and the corresponding vector belongs to the maximum linearly independent group. Autistic students can complete the first step but come across difficulties in the second step. Only through intensive exercises of similar problems, can autistic students find the solution independently. The effectiveness of learning intervention through intensive exercises also shows that autistic students have the ability to learn. Through the intensive training, autistic students can master the method of finding the maximum linearly independent group of vector group, expanding the thinking longitudinally. This kind of intervention is not only a repetitive intervention, but also a natural intervention. Because the key points of knowledge are the same but the exercises are different, autistic students can better master the knowledge of finding the maximum linearly independent group of vector group and potentially complete the training of thinking when they are doing the exercises. Finding the maximum linearly independent group of the vector group belongs to the two-layer knowledge structure, and the eigenvalue eigenvector of the matrix belongs to the three-layer knowledge structure. The first layer calculates the characteristic polynomial of the matrix, and the second layer calculates the eigenvalue of the matrix according to the characteristic polynomial. The third layer calculates the eigenvectors of the matrix based on the eigenvalues. Through the integrated classroom, autistic students may only know how to find the characteristic polynomials of the matrix, and it is difficult to realize the subsequent layers. Therefore, during the after-class Q&A session, the teacher can help autistic students do more exercises to understand the problem-solving approach, training their way of thinking. In the constant or changing process, it can train the logical thinking ability of autistic students. In our follow-up integrated classroom design, we will continue to focus on the training of autistic students’ logical thinking ability based on the course of Linear Algebra, and explore the impact of Linear Algebra-based logical thinking intervention training on other math courses. After the intervention of problem-solving training, a group discussion that gives the dominant status of autistic students is implemented to realize the natural social intervention. For example, conduct group discussion on how to achieve the diagonalization of symmetric matrices. Include the autistic student and normal students in one group and the autistic student serves as the group leader who feed the results of the discussion back to teacher. In this process, the autistic student needs to communicate with each member of the group in order to complete the task. He or she must collect and summarize the problem-solving ideas, extract key information to obtain the correct solution, and give feedback to the teacher. The process of the autistic student collecting problem-solving ideas is actually an intervention process of active communication. Use tasks to guide the autistic student to communicate with normal students, and help them to learn normal human interaction. The process of the autistic student summarizing problem-solving ideas is actually a process of showing themselves to normal students so that normal
156
Y. Cui et al.
students can form a more comprehensive understanding of autistic students, preparing for later group discussion as well as the social communication intervention for autistic students. Help autistic students adapt to group communication and study, and overcome social dysfunction, accumulating experiences in interpersonal communication for the future. Help them to reduce fear and make communication a habit of independent living for autistic students. Individual comment is another form of group discussion, such as using elementary row transformation to calculate the inverse of the matrix. Let autistic students and Tibetan students write the solution process on the blackboard, give each other a comment on the result, and help them master the method of finding the matrix inverse through elementary row transformation. This process contains two main bodies of the integrated education mode. One is the integration of autistic students and normal students, and the other is the integration of Tibetan students and students from the east. This specific problems-based test can enable the teacher to more specifically understand the degree of mastery of basic knowledge of students. The supervised and mutually-aided learning intervention strategy based on group discussion can not only help autistic students overcome communication barriers and study in integrated classroom, but also enable normal students to have a correct understanding of autistic students, promoting communication between autistic students and normal students. In short, we hope autistic students can acquire the main points of the new course in the integrated classroom. Through the after-class Q&A session, autistic students can intensively practice the key points of the new course and achieve the expanded training of logic thinking in the problem solving process. At the same time, conduct natural social intervention on autistic students by means of group discussion, in which autistic students will be given a dominant position. Thus autistic students can learn normal social communication with others and realize pre-training for social communication until they leave school for the society.
4 Conclusion This article takes the Linear Algebra course of Nanjing Normal University of Special Education as an example to explore the use of visual stimulation in the integrated classroom to extend the attention span of autistic students and help them adapt to the teaching mode of the integrated classroom. The task-driven approach can help autistic students take initiative to communicate with normal students, achieving the natural social interventions for autistic students. It can guide autistic students to actively ask questions or give answers, and actively seek help. At the same time, normal students can have a comprehensive understanding of autistic students and establish a correct social interaction values. The learning of different knowledge of Linear Algebra can train the autistic students with the ability of knowledge transfer and longitudinal development of thinking. In follow-up research, we will continue to focus on the key knowledge learning of autistic students based on the course of Linear Algebra, and explore which part of the knowledge in Linear Algebra is crucial for autistic students to learn the course, and propose a teaching and learning program suitable for autistic students to study the Linear Algebra.
Exploration of the Teaching and Learning Model
157
Acknowledgment. The authors would like to thank the editor and the anonymous reviewers for their critical and constructive comments and suggestions. This work is supported by this article is funded by: the High-Level Talent Project of the “Six Talent Peaks” of Jiangsu Province (Project No. JY-51); the Excellent Young Backbone Teacher Project of the “Qinglan Project” of Jiangsu Province; the school-level teaching reform project: Research on Supervised and Mutually-Aided Learning Intervention for Autistic College Students; Research on Cultivation of Professional Core Quality of Normal University Students from the perspective of Professional Certification Taking Educational Technology Major as an Example.
References 1. He, J.: The Analysis of the Educational Intervention for Autism Students under the Inclusive Education Background of Japan. Master Thesis, Northeast Normal University (2018) 2. Wang, F., Yang, G.: On the investigation into and the analysis of the intervention and rehabilitation status of ASD in China. Med. Philos. 38(10B), 49–54 (2017) 3. Report on the Development Status of China’s Autism Education and Rehabilitation II. April. Beijing (2017) 4. Data Analysis of the Current Situation of Autism in China. https://www.sohu.com/a/154932 861_534062 5. Carrington, S., Saggers, B., Harper, K.: Girls, autism identities, and education. International Encyclopedia of education, pp. 485–496 (2023) 6. Garcia, J., Shurack, R., Leahy, N.: Education and culinary skills program for young adults with autism spectrum disorder. J. Nutr. Educ. Behav. 55(3), 215–223 (2023) 7. Meng, L., Yang, G.: On the PRT intervention model for autism. Chin. J. Spec. Educ. 10, 38–42 (2012) 8. Morsa, M., Andrade, V., Alcaraz, C., Tribonniere, X.: A scoping review of education and training interventions in autism spectrum disorder. Patient Educ. Couns. 105(9), 2850–2859 (2022). https://doi.org/10.1016/j.pec.2022.05.012 9. Roorda, D., Zee, M., Bosman, R., Koomen, H.: Student–teacher relationships and school engagement: comparing boys from special education for autism spectrum disorders and regular education. J. Appl. Dev. Psychol. 74, 101–277 (2021) 10. Flannery, K., Carlon, R.: Autism and education. Psychiatr. Clin. North Am. 43(4), 647–671 (2020) 11. DeMayo, M.M., Young, L.J., Hickie, I.B., Song, Y.J.C., Guastella, A.J.: Circuits for social learning: a unified model and application to Autism Spectrum Disorder. Neurosci. Biobehav. Rev. 107, 388–398 (2019). https://doi.org/10.1016/j.neubiorev.2019.09.034 12. Trevor, H., Sarah, M., Robert, N., Allison, W.: Parents training parents: Lessons learned from a study of reciprocal imitation training in young children with autism spectrum disorder. Autism: the Int. J. Res. Pract. 23(6), 121–137 (2019) 13. Mamas, C., Daly, A., Cohen, S.: Social participation of students with autism spectrum disorder in general education settings. Learn. Cult. Soc. Inter. 28, 100467 (2021) 14. Elizabeth, S., Dennis, D., Marlena, N., Doreen, G., Tristram, S., Erik, L.: Identification and analysis of behavioral phenotypes in autism spectrum disorder via unsupervised machine learning. Int. J. Med. Informatics 129, 111–127 (2019) 15. Sarah, D., Serene, H., Benjamin, B., Dana, H.: Feasibility of a trauma-informed parent-teacher cooperative training program for Syrian refugee children with autism. Autism: Int. J. Res. Pract. 23, 1300–1310 (2018)
Multi-Modal Characteristics Analysis of Teaching Behaviors in Intelligent Classroom—Take Junior Middle School Mathematics as an Example Yanqiong Zhang(B) , Xiang Han, Jingyang Lu, Runjia Liu, and Xinyao Liu Nanjing Normal University of Special Education, Nanjing 210038, Jiangsu, China [email protected]
Abstract. An intelligent classroom is based on constructivist learning theory and it is an intelligent and efficient classroom which based on “teacher-driven, studentcentered”. Based on the original traditional classroom, the intelligent classroom combines emerging technologies such as big data, the Internet of Things, and artificial intelligence with campus management through hardware carriers, supported by application software services and campus management platforms. In addition to providing teachers with a wealth of teaching tools, the rapid development of intelligent classrooms leads to substantial changes in teaching behavior. In this study, the TEAM model was used to select representative intelligent classroom examples to evaluate the research status of the intelligent classroom. The NVIVO research tool analyzes and encodes video samples from selected areas frame by frame. It is to complete the research on the behavioral characteristics of intelligent classroom teaching and identify potential problems in intelligent classroom teaching. This study finds that analyzing teaching behavior characteristics in intelligent classrooms can help teachers to create an upward classroom learning atmosphere, finding and solving students’ pain points faster, it can also help teachers to teach students according to their aptitude. Keywords: Intelligent Classroom · Characterization · Multi-modal · Teaching Behavior
1 Introduction The General Office of the CPC Central Committee and General Office of the State Council issued the “Opinions on Further Reducing the Burden of Homework and Off-campus Training for Students in the Compulsory Education Stage”. This stage clearly mentioned that the education sector must guide schools to improve teaching management procedures, processing teaching methods, and strengthening teaching management to upgrade students’ learning efficiency in school [1]. Multimedia classrooms and equipment in traditional classrooms need to face the direction of digital and technologically intelligent © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 158–172, 2024. https://doi.org/10.1007/978-3-031-50580-5_13
Multi-Modal Characteristics Analysis of Teaching Behaviors
159
classroom teaching development. Through this way, to respond to the call of national policies and struggling to improve the quality of classroom teaching. At the same time, practicing the teaching policy about ‘Putting students first’, combined “teaching” with “learning”, then through guidance from teachers to return the main body of the classroom to the students, and stimulate students’ creative thinking, focusing on their learning process, it also achieves the effect of helping teachers reduce their burden. Intelligent classrooms can help teachers to carry out mixed teaching. It can not only enable them to get rich teaching resources which can bigger the classroom malleability of students and teachers but also higher teaching efficiency, which will fully stimulate students’ interest and enthusiasm for learning which can enable every student to complete classroom learning in an interactive, relaxed, and warm learning atmosphere, also help the way teacher teach, making the teaching management easier. Therefore, it is vital to analyze and study the teaching behavior characteristics of intelligent classrooms. According to the qualitative analysis about intelligent classroom teaching videos in TEAM Model and summed up the teaching behavior characteristics of intelligent classrooms to some degree. We believe that the degree can provide experience and promote professional development for novice teachers like us. At present, most of the research on intelligent classrooms focuses on students’ learning behavior, and less attention is paid to teachers’ teaching behavior. Based on this, this paper analyzes the characteristics of teaching behavior in intelligent classrooms, useing NVIVO research tools to comprehensively analyze the value information of teacher behavior data in intelligent classroom environment, discovering teachers’ teaching advantages, disadvantages and characteristics, and providing new ideas for teachers’ professional development in the era of big data [2].
2 Related Works 2.1 Intelligent Classroom Pertinence, innovation, and intelligence are the three keywords associated with an intelligent classroom: (1) The pertinence of intelligent classrooms is reflected on the fact that teachers at various levels have corresponding solutions for students of different ages. For instance, the interactive learning approach of the intelligent classroom can help junior high school students in adolescents pay attention to the classroom. It can also assist new teachers with less experience in controlling classroom order efficiently. (2) The innovation of intelligent classrooms can fragment knowledge and enhance students’ mastery of subject knowledge. As in the intelligent classroom, micro-lessons are a form of teaching; most of its presentation is centered on a knowledge point, an example problem, or an experiment which is highly different from the teaching style in the traditional classroom. (3) For the intelligence of the intelligent classroom. It is reflected not only on the advanced teaching equipment, but also on the development of teaching methods. Intelligent teaching has become an essential and central part of the intelligent classroom. At present, the intelligent classroom presents a development trend from universal teaching mode research to specific discipline-based teaching mode research, which also reflects the current development trend of intelligent classroom teaching research in China, gradually changing from technology dependence to technology application and from teacher
160
Y. Zhang et al.
teaching informatization to the intelligent direction of two-way integration and interaction and harmony between teachers and students [3]. Meanwhile, in recent years with the update and development of information technology, the research and development of intelligent classroom has also matured. As a result, we hope to further promote our selfprofessionalism by studying the characteristics of teaching behaviors in an intelligent classroom. Furthermore, providing ideas and suggestions for building a mathematics middle school mathematics intelligent classroom. 2.2 Mixed Teaching Mode What is mixed teaching mode? According to some studies, “The emergence of various intelligent teaching tools has greatly promoted the diversification of education models. Traditional classroom teaching has organically combined with the “Internet+” technology which formed a “product of a new era” in education, named mixed teaching mode” [4]. In general terms, it means combining offline teaching with online teaching and using intelligent teaching equipment in the intelligent classroom, which not only helps teachers reflect the dominance of their classrooms but also protects students’ subjectivity in the classroom. Especially in today’s post-epidemic era, the development of mixed teaching modes ensures the successful expansion of teaching activities. At the same time, Qianwei Zhang and other researchers think that mixed teaching mode is regarded as the ‘new normal’ of education with flexibility, timeliness, and continuous learning qualities. However, it also poses significant challenges to teachers, such as changing classroom teaching mode, figuring out how to use intelligent classrooms to combine teaching objectives with teaching content, and teachers’ concerns about their ability to operate intelligent classroom technology proficiently and so on [5]. In recent years, China’s research results on blended teaching are various. In CNKI Journal Paper Database, 345 papers of Chinese core journals and above can be retrieved with the theme terms of “blended teaching”, “blended teaching” and “hybrid teaching mode”, including 179 papers in journal C. About 60% of these studies are empirical studies on the design and practice of mixed teaching models, as well as research on teacher mixed teaching competencies, learning analysis, and the theoretical basis of mixed instruction [6]. Therefore, it is crucial to help teachers research the study on the analysis of the characteristics of teaching behaviors in the intelligent classroom, which can help teachers to face the challenge of the mixed teaching mode in a certain way. 2.3 Foreign Related Research Foreign research on intelligent classrooms can be traced back to 1988 from a man called Ronald, who mentioned the “Intelligent Classroom “ (Rescigno, R.C, 1988). He believes that an intelligent classroom is a classroom that embeds information technology such as personal computers, interactive CD-ROMs, and video programs in traditional classrooms [7]. Meanwhile, a foreign academic named Skipton asserts that intelligent classroom is a classroom that uses electrical or technological advancements. He imaginatively shared his thoughts on the topic. The University of Reading in the United Kingdom pays more attention to the interactive technology in the intelligent classroom and studies the interactive behavior of students in the intelligent classroom [8]. The intelligent classroom
Multi-Modal Characteristics Analysis of Teaching Behaviors
161
at Arizona State University uses PDA and situational awareness middleware to achieve group cooperative learning based on ubiquitous computing and network technologies [9]. In recent years, foreign researchers have studied the efficiency of mixed teaching. Among them, mixed teaching is significantly more effective than both face-to-face and online-only learning. The report also reveals a point: the traditional teaching mode, known as face-to-face teaching, has low learning efficiency, which means the necessity to reform the traditional teaching model and the need for a mixed teaching model with the help of scientific and technological developments as a way to enhance learning efficiency. Meanwhile, the data shows that the majority of online courses offered by U.S. colleges and universities are taught in a mixed teaching model; 50% of U.K. colleges use it; Singapore colleges and universities have adopted 80% of the mixed teaching model. 2.4 Domestic Related Research Domestic scholars also define the ‘intelligent classroom’ in terms of the learning environment. Bangqi Liu points out that the so-called ‘intelligent classroom’ is to transform and improve classroom teaching with the ‘Internet+’ mindset and the latest information technology tools to create an innovative, efficient, and intelligent classroom teaching environment. According to intelligent teaching and learning to improve students’ individual growth and intellectual development and solve long-standing and complex problems in traditional classroom teaching [10]. At present, the domestic research on intelligent classroom teaching presents four outstanding characteristics: (1) Research is heating up year by year. (2) The research paradigm shifts from theoretical speculation to design and application. (3) The research results are produced mainly by research groups and scholars specializing in educational technology. (4) The results of IT enterprise R&D institutions are unique. From this, we can find that the development of intelligent classrooms shows vigorous vitality, continuously attracting more and more teachers and scholars to participate in practice and research and promoting the teaching reform and development of front-line teachers [11]. For the current status of mixed learning in China, Professor Zhu Zhiting of East China Normal University first introduced the concept of Mixed Teaching in his “Mixed Learning in Distance Education” in 2003 [12]. Professor Keqiang also proposed the idea of Mixed Teaching at the 7th Global Chinese Computer Education Application Conference and advocated actively introducing the mixed teaching model into course instruction. But in 2004, Professor Kedong Li completed a presentation on ‘Mixed Teaching – An Effective Way to Integrate Information Technology and Curriculum’, creatively proposing the eight steps of mixed Teaching and learning, providing a deeper discussion of mixed Teaching [13]. Then Professor Ronghuai Huang of Beijing Normal University proposed that mixed learning is ‘To learn at the right time and according to apply the fit learning technology and learning styles to deliver the right competencies to the right learners, finally get the learning styles that allows for the most optimal learning results.’, which sublimated the research on mixed Teaching [14]. To sum up, foreign research on intelligent classrooms and mixed teaching has developed innovatively and rapidly. It has good responses, so studying and learning from them is worthwhile. Taking the best of it and teaching it in a practical way so that it can play a
162
Y. Zhang et al.
more significant role in the domestic classroom. In this paper, we will analyze and study the characteristics of teaching behaviors in intelligent classrooms and mixed teaching model instruction by borrowing from the Flanders Interaction Analysis Class List to provide experience for novice teachers and promote their professional development.
3 Research Design 3.1 Research Methods and Tools This research mainly adopts qualitative research methods, which are based on the method of description and analysis, and obtain an explanatory understanding of the construction of their behavior and meaning through interaction with the research subject, with special emphasis on the particularity of the individual experience of the research subject. Qualitative research methods are superior to other research methods in terms of data collection, theory formation and understanding perspective [15]. The research tool used in this paper is NVIVO. NVIVO is a professional qualitative research software that is frequently used by researchers in conducting theoretical research methods. It helps researchers to quickly integrate and analyze imported relevant data, such as images, sounds, videos, documents, and questionnaires, then to code them, which improves the efficiency of the researcher’s research and makes the analysis of the research content more focused [16]. In this paper, the video samples from the selected area were analyzed and coded frame by frame by NVIVO, so as to complete the research on the characteristics of teaching behaviors in the intelligent classroom. The analysis process is shown in Fig. 1.
Fig. 1. NVIVO interface diagram
3.2 Research Subjects We use the intelligent classroom teaching case films in the learning area of the TEAM Model as research samples. The research objectives of the platform are: studying theories
Multi-Modal Characteristics Analysis of Teaching Behaviors
163
and technologies of intelligent education, collecting and studying big data on teaching behaviors, studying and refining intelligent models and intelligent classrooms, establishing a comprehensive database of typical intelligent classrooms, and promoting teachers’ professional growth. Coaching and tracking the construction and development of TEAM Model Intelligent School and TEAM Model Intelligent School District to teach students in accordance of their aptitude and make a more ideal state of education for the promotion of talent [17]. The teaching videos in TEAM Model provide a great help for teachers’ teaching to a certain extent, such as observation, discussion and reflection. Most of the video samples selected in this paper are mainly in middle school mathematics, supplemented by two classic integrated courses which in order to observe and analyze the characteristics of teaching behaviors in intelligent classrooms more comprehensively from different perspectives. The course examples are shown in Table 1. Table 1. Class name and location Serial number
Lesson Name
Location
1
Side lengths of triangles
TEAM Model Intelligent Education Research Institute Public Welfare Lecture Wisteria Station Mathematics Field
2
Area of a triangle
TEAM Model Wisdom Education Institute Public Service Lecture Zhuhai Station Mathematics Field
3
Comparison of perimeter and area
TEAM Model Intelligent Education Research Institute Public Welfare Lecture Yixing Station Mathematics Field
4
Recognize positive numbers
TEAM Model Intelligent Education Research Institute Public Welfare Lecture Cloud and Station Mathematics Field
5
Recognize the average
Nanzhuang No. 3 Middle School offline classroom + online teaching and research mixed teaching and research
6
Integrated courses
The first ‘5G Intelligent Education Cup’ Nantes Normal Student Intelligent Teaching Competition
3.3 Design of Category Analysis Table of Teacher and Student Behavior in Intelligent Classroom Teaching This paper uses the most highly rated Improved Flanders Interaction Analysis System, which called IFIAS as a study to analyze the analysis of student-teacher behavior in middle school mathematics classrooms.
164
Y. Zhang et al.
The IFIAS interaction analysis system classifies the behaviors of classroom teacherstudent verbal interaction into four major categories, which are teacher language, student language, silence, and technology use. All these kinds of behaviors are represented by codes 1–15, where 1–8 denote teacher language, 9–10 denote student language, 11–12 denote silence and confusion, and 13–15 denote technology use [18]. The specific codes are defined as shown in Table 2 below. Table 2. Flanders Interactive Analysis Classify Teacher Language
Encoding
Elements
indirect impact
1
Teacher acceptance of emotions
indirect impact
2
Teacher praise or encouragement
indirect impact
3
Teachers adopt students’ views
indirect impact
4
Asking open-ended questions
indirect impact
Asking closed questions
direct impact
5
Teachers teaching
direct impact
6
Teacher’s instructions
7
Teacher criticism or assertion of teacher authority
8
Students respond passively
9
Student-initiated response
10
Student discuss with their peers
11
Dose not contribute to the confusion of teaching
12
Contribute to the silence of teaching
13
Teacher manipulation techniques
14
Student manipulation techniques
15
Technology in action for students
Teacher Language
Student Language
Student-initiated questions Stillness
Using technology
According to the video analysis coding on the TEAM Model as well as its own understanding, a new coding system was formed through the improvement of the Flanders interactive analysis table, which is suitable for studying the intelligent classroom videos of middle school mathematics, which incorporates the mixed teaching model as well as the unique interactivity of the intelligent classroom. It is shown in Table 3 below. 3.4 Data Encoding and Acquisition 3.4.1 Data Encoding The research process for the study subjects was broadly divided into the following three steps. Firstly, the video samples selected on the TEAM Model are recorded using video
Multi-Modal Characteristics Analysis of Teaching Behaviors
165
Table 3. List of categories of teacher and student behavior Classify
Encoding
Elements
Teacher Conduct
1
Introduce the new lesson and indicate what to learn in this lesson
2
Teacher questions, push questions
3
Teachers share students’ answers
4
The teacher explains
5
Issue instructions (what students are asked to do, assigned student responses)
6
Intelligent selection of people (random selection of people and groups)
12
Group scoreboard: grading students; students also grade the teacher
14
Commendation
15
Point out the error
7
Students engage in reflection
8
Read the PPT
Student behavior
Using technology
9
Students grab the right to answer
10
Proactively answer teacher questions and make points
11
Passive answers to teacher questions
13
Student discussions: round-table format, face-to-face discussions
17
Statistical graphs: showing student participation and correctness of completion
18
Timer: calculate the time for students to complete the problem
19
Quick question and answer: The teacher quickly throws out questions and the students immediately answer them, testing their knowledge
recording software and imported into the NVIVO for analysis. Secondly, based on the imported video samples, observations were made second by second, noting the teacher’s behavior and the students’ behavior, and recording important letters in the time span and content. Finally, nodes were created based on the main content and teaching behavior category analysis table one by one. Divide the video nodes into three main parts: name, material, and reference point. Indicate the content in the time span of each node separately, such as “Students are having a discussion during the time of 29 s–33 s”, which ensures that the coding is consistent with that shown in Figs. 2 and 3 below.
166
Y. Zhang et al.
Fig. 2. Coding of teaching behaviors1
Fig. 3. Coding of teaching behaviors2
3.4.2 Data Collection The integration of the codes and nodes are resulted in Table 4. In the Table 4, ‘content’ refers to the teacher’s actions in the classroom; “Trianglebased Lessons” refers to the lessons which are “Side Lengths of Triangles”, “Area of Triangles” and “Comparison of Perimeter and Area”; “Number-based Lessons” refers to “Recognizing Positive Numbers”, “Recognizing Mean Numbers”, and the Integrated Curriculum; “Integrated Curriculum Teaching Clip 1”, “Integrated Curriculum Teaching Clip 2”, and “Integrated Curriculum Teaching Clip 3” stand in for the interspersed activities that appear in the curriculum, which is the three teaching segments that appear in the integrated curriculum. A more visual graph of the statistics is shown in Fig. 4 below. The Fig. 4 shows that the main teacher teaching behaviors in the intelligent classroom are: issuing instructions, sharing answers, tablet responses, and passive responses. And the teaching behaviors that are less active are reading PPT, power grabbing answers,
Multi-Modal Characteristics Analysis of Teaching Behaviors
167
Table 4. Frequency collection of teaching behavior characteristics Encoding
Elements
Trigonometry Classes
Digital Courses
Integrated Curriculum Teaching Clip 1
Integrated Curriculum Teaching Clip 2
Integrated Curriculum Teaching Clip 3
Aggregate
1
Introduction of new lesson
1
3
1
0
1
6
2
Questions from teachers
4
6
0
2
1
13
3
Share the answer
4
9
1
3
3
20
4
Teachers explain
2
2
1
0
0
5
5
Issue instructions
8
8
1
1
1
19
6
Intelligent Selection
3
2
1
3
2
11
7
Student reflection
2
1
1
1
0
5
8
Read the PPT
2
0
0
0
0
2
9
Power grabbing answers
2
0
0
0
0
2
10
Unsolicited answers
2
1
0
0
0
3
11
Passive Answers
5
3
1
3
2
14
12
Panel Scoreboard
1
0
0
0
0
1
13
Student discussion
1
4
1
1
2
9
14
Teacher praise
2
0
0
1
0
3
15
Point out the error
2
0
0
0
0
2
16
Flat answer
2
7
2
1
2
14
and group scoring board. It can be seen that the intelligent classroom is a studentcentered classroom where the teacher acts as a guide and facilitator of student learning, throwing out questions, asking students to answer, prompting them to think, eventually giving some encouragement and supporting based on their answers, sharing answers, guiding students to reflect on their answers, and teaching them how to learn rather than just learning knowledge. For activities with lower teaching behaviors, we can find that teachers no longer use marks as the only criteria to measure students, weakening the presence of the scoreboard and in line with the direction of quality education. In terms of technology, the usage of the statistical chart, timer, and quick question and answer in the intelligent classroom is shown in Table 5 below.
168
Y. Zhang et al.
Fig. 4. Frequency histogram of teaching behavior characteristics Table 5. Total usage rate of teaching behavior characteristics in technology Number Elements
Total utilization rate
17
Statistical graphs: showing student correctness and participation
37
18
Timer: calculates the time it takes for students to complete the 2 problem
19
Quick questions and answers
11
As can be seen from Table 5, the usage rate of the statistical chart and quick question and answer in the intelligent classroom is the highest, while the timer is used less frequently. It can be seen that in the intelligent classroom, teachers pay more attention to students’ classroom participation and enthusiasm, which can relax the response time for students under appropriate conditions, so as to ensure that students have sufficient time for thinking.
4 Results of the Feature Analysis of Teaching Behaviors in the Intelligent Classroom 4.1 Analysis of Structure of Teaching Activities and Learning Atmosphere in the Intelligent Classroom Flanders divided teaching activities into teacher’s language, student’s language, silence and using of technology, and structure of teaching activities in the intelligent classroom also corresponds to this, that we can complete the course together through teacher’s teaching and student’s learning. From Table 4, we can find that the classroom is dominated by the teacher, teacher asked questions 13 times, issued instructions 19 times. Teacher guided students to think and mobilized the classroom atmosphere by asking and issuing instructions, thus it can be seen that teacher’s language occupies the main
Multi-Modal Characteristics Analysis of Teaching Behaviors
169
time of teaching activities, much higher than student’s language, silence and using of technology. Meanwhile, teacher’s language control the rhythm and the atmosphere of the classroom, when students unable to keep up with the teaching content and their attention is not focused on the class, teacher will take the method of random selection to improve the situation, as seen in Table 4, the number of students’ active responses is only 3 times, while the number of students’ passive responses is 19 times. Therefore, this method can not only bring back students’ thoughts, but also master students’ current knowledge learning degree, which is used frequently in the class. Using of technology in the intelligent classroom can ensure that the classroom atmosphere is not depressed and slience phenomenon is reduced. For instance, the using of scoreboards, answer machines and timers, fully mobilize the enthusiasm of students, supervise and urge students to attend classes. In Table 4, the total number of ‘students answered questions with tablet is as high as 14 times, the number of teacher shared answers through the tablet is 20 times. It is evident that using technology in the intelligent classroom can create better learning conditions and positive learning atmosphere, and help students activate their thinking, let all of the students participate in the class. 4.2 Analysis of Teacher’s Verbal Style Tendencies in the Intelligent Classroom Teacher’s language can be divided into two main areas: direct and indirect influence. Indirect influence includes: encouragement, questioning, adopting students’ ideas and so on, while direct influence includes: lecturing, criticizing, instructing and so on, where indirect influence accounts for a larger proportion and it is in the main position of teacher’s verbal style in the intelligent classroom. There are three reasons for this situation: (1) Take asking questions in class as an example, it is a process of two-way information exchange between teachers and students, and the appropriate performance of questioning can stimulate students’ interest, inspire their thinking and improve classroom efficiency. The usage rate of quick questions and answers in the Table 5 can illustrate this point, it is up to 11 times in total, which shows the frequency and effectiveness of class questioning are high. (2) The new curriculum standard points out that “when we want to evaluate students’ mathematics learning, we should not only pay attention to the understanding and mastering of students’ knowledge and skills, but also pay more attention to the formation and development of their feelings and attitudes; we should not only pay attention to students’ learning results, but also pay more attention to their development and changes in the learning process.” [19] Therefore, the behavior that teacher encourages students in time after getting answers, can make students take an interest in this course and let them learn knowledge more actively, which accord with the requirements of the new curriculum standard. Adopting students’ views and take it as an opportunity to encourage students to expand the discussion can enhance their confidence and enthusiasm, meanwhile, it can not only help students to understand the knowledge content, but also strengthen their impression of knowledge points and improve their learning quality. (3) The indirect influence of finding the right direction can significantly improve students’ learning effects, for example, asking questions from the growth point of knowledge, can lose the extensibility of students’ thinking, enrich students’ imagination and creativity and help students to construct knowledge structure system. To sum up, indirect influence plays an important role in teacher’s language. The intelligent device technology
170
Y. Zhang et al.
in the intelligent classroom can help teacher find the entry point of indirect influence better, such as the statistical chart in the intelligent classroom, which shows its highest usage rate in Table 5, 37 times, it can show the participation and correct rate of students’ answer, and help teacher find the difficulties and doubts of students’ learning, and then solve and guide them. 4.3 The Impact of Mixed Learning in the Intelligent Classroom on the Characteristics of Teachers’ Teaching Behaviors Blending learning in the intelligent classroom has the mission to promote the classroom revolution in colleges and universities. Although the traditional teaching system is characterized by scale and standardization, the talents it has cultivated show the “thousand a voice” phenomenon, which is out of line with the goal of educating talents in the new era. However, blending learning utilize the information technology in the intelligent classroom, has brought changes to college classrooms. As shown in Table 5, the usage of statistical charts and timers in the intelligent classroom makes teacher have the spare time to handle students’ individualized learning, educate the students in accordance with their aptitude and educate purposefully instead of dealing with repetitive work, so as to cultivate innovative talents, enable students to acquire the ability of lifelong learning, so that the traditional classroom, which is born in the industrial era, can gradually change into the intelligent classroom in the intelligent era [20]. Mixed teaching can increase the utilization of intelligent classroom instruction. As shown in Table 4, the number of new lesson introductions is low, only six, but in the last three lessons of cognition average shows that the new lesson introductions utilize a hybrid teaching device, combining online teaching with offline learning. It can be seen that teacher can make good use of the powerful functions of platforms such as MOOC, SPOC, and Wisdom Tree to achieve the task of teaching reform in the new era and new technology environment. It can also be seen that blending learning in the intelligent classroom has a positive impact on the characteristics of teacher’s teaching behaviors. Mixing teaching can solve the cognitive aspects of teaching in the intelligent classroom: how to transform teacher’s thinking and teaching behaviors, and the TEAM Model is a good platform for this. By watching videos of intelligent classroom lessons and expert explanations on the platform, we can further understand how to use the intelligent classroom and how to maximize the functionality of the intelligent classroom, which can largely influence teacher’s teaching behaviors. According to the above statistics, up to 14 tablet responses are more than enough to give teacher a full understanding of each student’s needs and mastery and facilitate teacher self-reflection and improvement. The problem of teaching workload and self-regulation can also be solved with the blending learning in the intelligent classroom. Through the data, we can see that there are 6 times from the new lesson introduction, teacher stimulate students’ interest in learning through flexible and varied new lesson introduction, active classroom atmosphere and enhance the learning effect; the number of times the intelligent selector is used 11 times, which can help teachers control the classroom order and mobilize students’ enthusiasm; the quick question and answer system can help teacher quickly throw out questions, while students immediately give the answer, testing the degree of mastery of student
Multi-Modal Characteristics Analysis of Teaching Behaviors
171
knowledge and so on. A system of intelligent classroom technology equipment can largely reduce the burden of teacher and help teacher complete blending learning.
5 Conclusion 5.1 Research Conclusion Through the use of technology, such as statistical charts, timers, and quick-questionand-answer software, intelligent classroom activities create good learning conditions for students, an upward learning atmosphere as a way to activate students’ thinking, and allow all of the students to participate in the class. At the same time, the intelligent classroom also affects the teacher’s language characteristics, making the teacher change from the main speaker to the guide, from the teacher alone to the teacher and student’s interactive teaching. The teacher’s language influence also gradually changed from direct influence to indirect influence, mostly to encourage, praise, and motivate students. Mixed learning in the intelligent classroom also greatly impacts the characteristics of teachers’ teaching behaviors. Firstly, Mixed learning in the intelligent classroom has the technology and mission to drive the revolution in the college classroom, which indirectly influences the teaching behavior characteristics of teachers and makes their lectures more technical and structured. Secondly, blending learning can increase the utilization of teaching and learning in an intelligent classroom, making teachers’ teaching behaviors more innovative. What’s more, mixed instruction can solve the cognitive aspects of intelligent classroom instruction: how to transform teachers’ instructional thinking and teaching behaviors, making teachers’ teaching behavioral characteristics more rational. Finally, the problems of teaching workload and self-regulation can also be solved by blending learning in the intelligent classroom, making the characteristics of teachers’ teaching behaviors more convenient. As a result, the act of teaching and learning in the intelligent classroom has the following characteristics: indirectness, technology, structure, innovation, rationality, and convenience. 5.2 Possible Problems in the Intelligent Classroom Although the teaching of the intelligent classroom is rich, there are still some problems: (1) Classroom activities are so rich, and the lecture is so fast-paced that some students may not fully grasp the knowledge. (2) When the network or equipment malfunctions during the class, improper handling may affect normal teaching. (3) Using electronic devices such as tablets for too long to affect students’ eye health. In the current era of information explosion, the development of the intelligent classroom is undoubtedly as helpful as it can be in helping the teacher to complete their teaching tasks while carrying out higher-level teaching activities. At the same time, the form of seating arrangement in the intelligent classroom is worth having a careful look at, 5–6 students sit in groups in a circle, which is helpful for group discussions, and the positive influence of the group members allows every student to participate in class discussions.
172
Y. Zhang et al.
Acknowledgements. This work was supported by the Key Project of Teaching Reform of Nanjing Normal University of Special Education “Teaching Model Reform and Practice Based on ‘Intelligent Classroom’“ (No. 2021JXJG10); the Innovation and Entrepreneurship Projects for University Students in Jiangsu Province “Characteristics Analysis of Teaching Behaviors in Intelligent Classroom---Take the Teaching Video in Team Model as an Example” (No. 202212048053Y); Jiangsu Qinglan Project “Sign Language Translation” Excellent Teaching Team; the Third Level Training Object of the Sixth “333 High-level Talent Training Project” of Jiangsu Province.
References 1. Government of the People’s Republic of China Homepage. http://www.gov.cn//zhengce/202107/24/content_5627132.htm 2. Wang, D., Liu, H., Qiu, M.: Analysis method and application verification on teacher behavior data in smart classroom. J. Electro-Chem. Educ. Res. 400(05), 120–127 (2020) 3. Jiang, C., Fu, S.: A review of the research status of smart classroom in China. Teach. Manag. 799(06), 1–4 (2020) 4. Li, J., Xiong, D.: A study of the factors influencing satisfaction with mixed instruction in offline courses. Comput. Educ. 04, 98–102 (2022) 5. Zhang, Q., Zhang, M., Yang, C.: Current situation, challenges and Suggestions of blended teaching readiness of college teachers. Res. Audio-Vis. Educ. 43(01), 46–53 (2022) 6. Zhang, M., Du, H.: Analysis of the implementation status and research trend of blended teaching. China Educ. Inform. 460(01), 82–85 (2020) 7. Rescigno: R. C. Practical Implementation of Education Technology, Academic achievement (1988) 8. Tu’lio, T., Edward, F.: The impact of an Intelligent Classroom on Pupils’ Interactive Behavior. Facilities 23, 262–278 (2005) 9. Pasnik, S., Nudell, H.: PBS K-12 digital classroom pilot evaluation report (2003) 10. Li, K., Zhao, J.: Mixed learning principles and application models. J. Electro-Chem. Educ. Res. (07), 1–6 (2004) 11. Zhang, X., Tian, T., Tian, M.: The evolution and trend of smart teaching and research in China in the past decade. Distance education in China (2020) 12. Zhu, Z., Meng, Q.: Mixed Learning in Distance Education. Distance Education in China (2003) 13. Ma, D., Zheng, L., Zhang, H.: Mixed learning based course design theory. Electrochem. Educ. Res. (01), 9–14 (2009) 14. Liu, B.: Study on the design and implementation strategies of intelligent classroom teaching in the era of “Internet+”. Electrochem. Educ. Res. (10), 51–56 (2016) 15. Min, P., Dequan, Z.: Chinese teachers’ perception of STEM education: based on the qualitative analysis of 52 STEM teachers by NVIVO11 software. Educ. Dev. Res. 40(10), 60–65 (2020) 16. Huang, T., Li, N.: Progress and prospects of research on new think tanks in China - based on NVIVO qualitative analysis. Theory Pract. Think Tanks (01), 43–50 (2022) 17. Global Institute for Smart Education Homepage. http://www.teammodel.org/index_cn.html 18. Wang, Q.: A study of Teacher-Student Verbal Interaction Behavior in Classroom Based on IFIAS – An Example of IT Classroom Teaching in Mudanjiang M Primary School. Mudanjiang Normal College (2022) 19. Ministry of Education Homepage. http://www.moe.gov.cn/srcsite/A26/s8001/202204/t20220 420_619921.html 20. Wang, X., Wang, H., Zhong, Y., Tang, Y., Huang, Z.: Concept and practice of online and offline hybrid smart teaching model. China Educ. Inform. (10), 84–93 (2022)
Based on the 2010–2022 Review of Domestic and Foreign Educational Evaluation and University Internal Evaluation Methods Xiaoxiao Zhu, Huiyao Ge, Liping Wang(B) , and Yanling Liu(B) School of Mathematics and Information Science, Nanjing Normal University of Special Education, Nanjing 210038, China [email protected], [email protected]
Abstract. As technology and education systems continue to improve, education evaluation is also gradually becoming intelligent and standardized, but there are still deficiencies in the education evaluation system and evaluation means. In response to the Implementation Plan of Undergraduate Education and Teaching Audit and Evaluation in ordinary institutions of higher learning published by the Ministry of Education, the problems in the education evaluation system and internal evaluation in ordinary institutions of higher learning during 2010–2022 will be solved at home and internationally. The existing problems are summarized briefly to help relevant researchers and educational units carry out research. Keywords: Educational evaluation · Internet plus · In-school assessment · Evaluation system · Effectiveness evaluation
1 Introduction With the development of science and technology and the continuous improvement of productivity, there is an increasing demand for high-quality talent in advanced, sophisticated, and cutting-edge fields. College talents rush into society after graduation and need to have the ability to undertake corresponding jobs. If college graduates are unable to secure jobs that correspond with their education, it is bound to have a certain impact on the development of society. Educational evaluation plays an important role in the education system. Education evaluation can evaluate colleges and universities in accordance with reasonable rules and regulations, in order to promote college graduates’ ability to cope with corresponding jobs upon graduation and to bring more qualified and high-quality talents into society. Educational evaluation started late in our country, which still has some shortcomings compared to other countries. Evaluation methods are diverse, and deep learning-related fields can also be combined with evaluation. Deep learning also contributes to many fields, such as research on computer-aided systems based on deep learning. For example, “Department of Ultrasound Diagnosis, Beijing Tongren Hospital Affiliated to Capital Medical University: Deep learning (DL) is used to automatically © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 173–186, 2024. https://doi.org/10.1007/978-3-031-50580-5_14
174
X. Zhu et al.
extract image features through computer algorithms, and the convolutional neural network (CNN) algorithm is used to optimize traditional artificial image preprocessing and feature extraction. Optimize breast diagnosis. “In view of the problems faced by school evaluations, it is proposed that they must correctly address four major issues: “why”, “how to evaluate”, “who should evaluate”, and “how to use” [1]. It is suggested that we should further improve the evaluation methods of colleges and universities and find an evaluation method of our own.” In 2021, the Ministry of Education issued the Implementation Plan for the Audit and Evaluation of Undergraduate Education and Teaching in Regular Institutions of Higher Learning (2021–2025). The plan puts forward a complete system of educational evaluation, including guiding ideology, basic principles, evaluation objects, evaluation cycle, classification, evaluation procedures, organization and management, and discipline and supervision. The country pays more and more attention to education evaluation. Now, we need to reflect on the achievements of the previous education evaluation system and whether there are still any shortcomings. Then, Then what needs to be improved, in order to better develop a relatively complete education evaluation system.
2 Research Status of Educational Evaluation and Internal Evaluation Methods in Colleges and Universities In the past ten years, guided by the Opinions of the Ministry of Education on Undergraduate Teaching Evaluation in Colleges and Universities issued by the Ministry of Education in 2011, all regions have continuously optimized the evaluation methods, improved the evaluation system and improved the evaluation level closely around the overall situation of their own education reform and development, which has played a positive role in the development of the national education evaluation cause [2]. In October 2020, the CPC Central Committee and The State Council issued the Overall Plan for Deepening the Reform of Educational Evaluation in the New Era (hereinafter referred to as the “Overall Plan”), putting forward the reform tasks of “improving discipline evaluation” and “highlighting the characteristics, quality and contribution of disciplines”. The government and universities have responded by introducing and adopting new educational evaluation methods, which have achieved remarkable results [3]. 2.1 Main Achievements of Educational Evaluation and Internal Evaluation Methods in Universities in Recent years At present, all provinces have carried out provincial education evaluation projects, covering all kinds of education, and gradually changed the evaluation projects into regular projects, and the evaluation structure has been gradually expanded. For example, in 2017, Zhejiang Educational Modernization Research and Evaluation Center and other evaluation institutions have carried out high-quality education modernization level monitoring and county basic education ecological monitoring and other evaluation projects. In 2021, Zhejiang Provincial Education Examination Institute established the “One school (school) one feedback” mechanism, in which the evaluation results of each evaluation project are fed back in the form of official documents, further optimizing the evaluation
Based on the 2010–2022 Review of Domestic and Foreign Educational Evaluation
175
mechanism of “plan in advance, process supervision and result traceability” [4]. According to the statistics of Zhejiang Education Development Statistical Bulletin in 2021, the improvement rate of each industry in the field of education has increased a lot compared with previous years. For example, the enrollment rate of compulsory education is 99.99%, and the retention rate is 100%. The gross enrollment rate of higher education was 64.8%, an increase of 2.4% points over the previous year. The total number of doctoral and master’s graduates increased by 943, an increase of 5.6%. Compared with other provinces, it has improved more. In 2014, Jiangsu Province developed and formed the review and evaluation program of undergraduate teaching work in colleges and universities and started the evaluation. The evaluation period was 2014–2018. The evaluation practice promoted Jiangsu Province to explore top-level system design, evaluation process optimization, selection of evaluation experts, sunshine evaluation and other aspects, and achieved results. Subsequently formed the characteristics of the province, and for the development of education in Jiangsu Province played a driving role. At the same time, the provinces are constantly improving the education evaluation system to ensure that the evaluation is fair and just. The internal evaluation work of colleges and universities has been included in the teaching work at the beginning of the establishment of colleges and universities, but it is regarded as a teaching task due to the unsound teaching system in the early stage. However, in 1985, the national undergraduate education evaluation began as a pilot project, and the evaluation was carried out from the aspects of the level of colleges and universities, majors, personnel training programs and courses. Since 1994, the state has put forward the evaluation principles of “promoting reform by evaluation, promoting construction by evaluation, combining evaluation with construction, and focusing on construction” to carry out the evaluation work [5]. Since then, the internal evaluation of colleges and universities has risen to a new height. In the article of the construction of the internal quality education evaluation system of colleges and universities, it named the improvement of quality education in the internal evaluation of colleges and universities, and further promoted the integration of quality education and traditional education. By analyzing the background, theoretical basis, implementation process of the National Student Learning Involvement Survey (NSSE) in the United States, the author obtains favorable experience for the internal evaluation of Chinese colleges and universities. For example, from the university development theory, university influence theory, Taylor’s task time, quality of effort, participation theory, social and academic integration, the seven principles of good undergraduate education practice, change evaluation model and other multi-angle optimization of the internal evaluation system of universities [6]. On the background of the Overall Plan for Deepening the Reform of Educational Evaluation in the New Era, this paper proposes the effect evaluation of “double first-class” construction, with the purpose of improving the result evaluation, strengthening the process evaluation, exploring the value-added evaluation and perfecting the comprehensive evaluation (the “four evaluations”), aiming at improving the scientificity, professionalism and objectivity of educational evaluation, and focusing on monitoring the development process, quality and efficiency of the evaluation objects. Demonstrates the overall,
176
X. Zhu et al.
systematic, development thinking. Implement dynamic detection using modern information technology and adhere to the concept of process monitoring [3]. Further improve educational evaluation with modern information technology. The purpose of educational evaluation is to improve educational activities and educational thinking. In recent years, the educational evaluation work in various places generally presents a good attitude of “promoting reform by evaluation”. First, in order to improve the quality of running schools, colleges and universities are encouraged to actively respond to the symbols issued by the Ministry of Education (such as the certification of teacher certification and undergraduate evaluation); The second is to improve the level of various skills of students, such as promoting students to obtain more professional skills in the examination; Third, the government performs its duties in education, such as guiding government schools at all levels to establish a scientific view of educational quality. It provides scientific reference for optimizing the quality of college personnel training and adjusting the professional results. 2.2 The Main Problems of Educational Evaluation and Internal Evaluation Methods in Universities In response to the Ministry of Education’s issuance of the Implementation Plan for the Audit and Evaluation of Undergraduate Education and Teaching in Colleges and Universities (2021–2025) in 2021, China’s education evaluation started late, has a weak foundation, imperfect system, and is not widely covered, compared with other developed countries’ education evaluation systems. There are mainly the following aspects. The System of Education Evaluation Needs to be Improved and its Coordination is Low. There is a lack of legal system provisions for education evaluation, the lack of toplevel design within the education evaluation system, and the overall performance of the evaluation institutions is “weak, small and scattered”, and the lack of “leaders”. The lack of systematic and overall planning of educational evaluation projects eventually leads to slow execution and poor execution effect. The final results of educational evaluation are presented in a high profile and ended in a low profile [4]. Because the role of “almighty government”, even “super government” exists in the long-term history of the school and management system of higher education, the process of our country’s higher education reform is driven by education policy basically [7]. As a result, the evaluation system consists of a single, the thinking does not diverge, can not substantively solve the problems of education evaluation. The educational evaluation system lacks of diversification, rationalization and science, and there is still a certain distance between it and the educational system with Chinese characteristics. Professional Competence Needs to be Improved and Authority is Weak. The lack of professional ability of the evaluation team exists in all provinces. The personnel of educational evaluation institutions are single, the proportion of professionals is not high, and the qualification mark of educational evaluation institutions is lacking. Educational evaluation is uneven, the evaluation process is random, not strong authority [8]. Evaluation System Creates “Five Only” Phenomenon. China’s long-established academic evaluation puts too much emphasis on quantitative indicators such as papers,
Based on the 2010–2022 Review of Domestic and Foreign Educational Evaluation
177
projects, monographs and international journals and the so-called “grading” criteria, dividing academic journals, projects and awards into several grades rather than looking at the quality and level of the papers themselves, but at which journals and magazines they are published in. As a result, the cart is put before the horse. The academic evaluation in the “Five Aspects” era pays too much attention to quantification. Instead of praising the good and demoting the bad, rewarding hard work and punishing the lazy, encouraging the cloudy and promoting the clear, many people take advantage of the loopholes. Academic achievements or contributions are entirely reflected by numbers, but some cone-size-fits-all, over-institutionalization, which leads to some excellent teachers not being promoted, school ranking evaluation only relying on some simple data, and academic performance or contribution completely reflected by numbers. However, some contributions cannot be reflected by numbers, leading to unscientific and reasonable evaluation, and even a large loophole in the evaluation system contributions cannot be reflected by numbers. This phenomenon leads to [9]. The Evaluation Method is Single. Currently, the method of data recording is mainly adopted. The ability of data analysis is not yet available, and the addition of advanced technologies such as digitalization and intelligence is not combined. Most of the data are for recording purposes, and there are still some deficiencies in checking and comparing the evaluation data. The evaluation subject is single and the coverage is small. The lack of evaluation data from students, teachers, experts, parents and other perspectives leads to incomplete data acquisition, and the results can only explain a small part of the problem. The analysis of education evaluation data is not authoritative and thorough, which leads to the slow progress of education evaluation and school evaluation undertakings. At present, the overall thought of educational evaluation attaches great importance to top-down external evaluation and light to bottom-up internal evaluation. It appears a prosperous scene on the surface, but internally presents a lack of evaluation concepts, methods and core values. The thought of internal education evaluation in colleges and universities is in-depth, showing a phenomenon that colleges and universities do not pay much attention to internal evaluation [1].
3 Solutions to Problems Arising in Educational Evaluation and Internal Evaluation of Colleges and Universities The unsound education evaluation system makes it difficult to carry out the evaluation in an orderly and high quality, and can not solve some problems in education evaluation well. For example, evaluation standards are connotative, evaluation institutions are mostly set up in the departments designated by the Ministry of Education, and the openness and cooperation to the outside world are not strong. 3.1 The Main Characteristics of Foreign Educational Evaluation Systems and Internal Evaluation Methods in Colleges and Universities Evaluation is Diversified. Foreign educational evaluation institutions can be divided into two categories: one is established under the guidance of the national government; The
178
X. Zhu et al.
other is the operation of independent government systems, established by associations and non-official organizations. Unofficial appraisal agencies generally establish their legal status through legislation, and perform corresponding duties, so that the appraisal work can be based on laws. For example, the Higher Education Quality Audit Authority (QAA) is the representative education assessment body in the UK. There are also many third-party evaluation agencies in Japan, which are accredited by the government and have special teams to organize “membership review” and “external evaluation” conducted by external evaluation of schools in line with “setting accreditation evaluation”. German education assessment agencies are also intermediaries between the government and universities [3]. Most foreign educational evaluation institutions are not under the direct leadership of the government, which promotes the institutions to have their own subjective dynamic nature, which can expand the evaluation thinking and create new evaluation methods. However, the evaluation institutions still need to comply with the evaluation documents issued by the government and the educational evaluation concepts. Developed Countries Attach Great Importance to Communication, Consultation and Interaction Among Government, Schools, Society and Other Organizations. The United States has been exploring education certification systems since the late 19th century, cooperating with government-appointed certification bodies at the school level [3]. The government is responsible for coordination, certification bodies to the school curriculum and teaching for unified certification management. It also allocates educational resources to accredited schools to promote the unification of evaluation ideas. The unification of evaluation ideas is conducive to the orientation of education development and the promotion of high quality development of education. Evaluation Team Specialization. Systematic training will be carried out between the implementation of educational assessment, and the assessment executive must go through professional, strict and meticulous screening and training. For example, the British Bureau of Education Standards undertakes to supervise the work of education assessment personnel and set up a high threshold for entry, strict training and standard requirements, after training need to be evaluated by the royal inspector of schools after the post [4]. A Variety of Assessment Methods. A variety of educational assessment methods is conducive to the efficient implementation of educational assessment. Educational assessment no longer obtains assessment data simply from paper documents and field investigation data, but also adopts advanced theories and Internet technologies to obtain more educational assessment data, such as the National Student Learning Engagement Survey (NSSE) adopted by the United States [10]. The idea of combining big data with development and evaluation is put forward with the help of Internet+ [11]. From the whole process of student learning, to verify the qualification of education. In terms of professor success evaluation, the University of Cambridge has a unique method of evaluating professors’ teaching achievements. Professors can directly receive various evaluations of each student [12]. Students can also be evaluated online and through questionnaires. Better promote teaching. Evaluation Ideas into Teaching. The evaluation of foreign colleges and universities is not only to cope with the inspection of the evaluation institutions, but also to carry
Based on the 2010–2022 Review of Domestic and Foreign Educational Evaluation
179
out the educational evaluation in line with the principle of promoting the school reform and development. Institutions of higher learning in developed countries carry out annual evaluations, which show a positive correlation between the achievements of college graduates and their contributions to society. Only in this way can a mutually reinforcing situation be formed and help improve the level and quality of university education. 3.2 A Practical Analysis of Domestic Educational Evaluation System and University Evaluation Methods In October 2004, higher education teaching evaluation center was established by the Ministry of Education. Chinese education evaluation work was carried on formally and orderly. With the comprehensive reform in the education field and the change of the government education function unceasingly, various regions constantly accelerate the organization system of education evaluation. For example, evaluation institutions tend to be independent. Each province has set up directly under the municipal education administration department. Moreover, we actively encourage and guide third-party evaluation agencies sponsored by social organizations to participate in professional evaluation. The cooperation of multiple evaluation organizations is conducive to the integration and improvement of new educational evaluation ideas and further strengthen the educational evaluation system. Evaluation Project Management is Becoming More and More Standardized. Scientific and standardized setting and management of evaluation projects make evaluation projects present a situation of promoting the separation of government and affairs. For example, Jiangsu Provincial Department of Education issued Ten Public Commitments of Jiangsu Provincial Department of Education to ensure the orderly conduct of evaluation work. Gradually reach a consensus on the work regulations and establish a set of universal standards and requirements on the work procedures and methods and techniques. For example, Shanghai will publish a Study on Educational Evaluation Procedures in 2020; Chongqing Institute of Education Evaluation developed the “Education Evaluation Procedures”, which was officially released and implemented as a local standard in 2018. The Evaluation System is Becoming More and More Perfect. Aiming at “only papers, only hats, only professional titles, only education, only awards” (five questions). In response to June 30, 2020, the Commission for Deepening Reform of the CPC Central Committee adopted the Overall Plan for Deepening the Reform of Educational Evaluation in the New Era [13]. Some Opinions on Improving the Academic Evaluation System released by Tsinghua University put forward that the evaluation orientation emphasizing teachers’ ethics, talents and learning and quality contribution should be established. Strengthen the construction of academic community based on academic belief and cultural identity. Improve the peer evaluation system, return to the spirit of reward – based incentive. Instead of linking academic evaluation with high material rewards, the main structure of evaluation should be further enriched, and more emphasis should be placed on college self-evaluation, peer evaluation and student participation on the basis of the current government-led and third-party institution participation [14]. Further solve the five problems.
180
X. Zhu et al.
The Internal Evaluation Methods of Colleges and Universities Present a Diverse Situation The overall internal thinking of colleges and universities has gradually changed from the evaluation thought of emphasizing top-down external evaluation and undervaluing bottom-up internal evaluation to the evaluation thought of undervaluing top-down external evaluation and undervaluing bottom-up internal evaluation [6]. Various colleges and universities are adopting innovative forms to reform and upgrade the internal educational evaluation methods in colleges and universities, which promotes the development of educational evaluation. For example, we will absorb the experience of knowledge Exchange Framework in British universities and apply it to our country [15]. Adopt the “Internet + education” concept and model of the evaluation system, everyone is the evaluation subject, but also the evaluation object [16]. For example, by using the web-based evaluation method of university library users, the evaluation data can be visualized to optimize the evaluation materials [17] and the application of STEAM education evaluation in primary and secondary schools to help the development of our education cause [18]. Under the background of “Internet Plus,” the evaluation system of college students’ innovation and entrepreneurship education is applied to the evaluation system of college education to strengthen the construction of the evaluation system of college innovation and entrepreneurship education [19]. To promote the orderly conduct of internal education evaluation in colleges and universities, get rid of the top-down situation and recover the original intention of education.
4 Strategies for Constructing Educational Evaluation System and Evaluation Methods in Colleges and Universities Modern education cannot leave the modernization of education evaluation. In order to promote our country’s education evaluation system and in colleges and universities internal evaluation method to improve continuously, educational evaluation concept deep in colleges and universities, this paper puts forward the following suggestions to speed up our country’s education evaluation system and the perfection of internal evaluation method. 4.1 To Construct a Theoretical System of Educational Evaluation with Chinese Characteristics After 20 years of exploration and practice, our country has initially set up a higher education evaluation system with Chinese characteristics, but there are still some defects and deficiencies compared with foreign countries. Reform and opening up not only bring about economic reform, but also ideological reform. The new century requires that the theoretical system of higher education evaluation in China in the 21st century should develop towards the theoretical system of education evaluation with Chinese characteristics of “foreign for Chinese” and “innovating from the old” [20]. Education evaluation system with Chinese characteristics should “It is guided by Xi Jinping Thought on Socialism with Chinese Characteristics for a New Era, We will fully implement the Party’s policy on education, and ensure that education serves the people, the CPC’s governance,
Based on the 2010–2022 Review of Domestic and Foreign Educational Evaluation
181
the consolidation and development of the socialist system with Chinese characteristics, and the reform, opening up and socialist modernization drive” as the guiding ideology. Follow the five basic principles. Classify according to the assessment object and evaluate according to the cycle. Specify scientific and rigorous evaluation procedures, in accordance with the assessment application, school self-evaluation, expert review, feedback conclusions, rectification within a time limit, supervision and review procedures [5]. The thought of effectiveness evaluation should be integrated into the evaluation system to provide ideas and direction for the education evaluation system and improve the education evaluation system better. The national education evaluation work should follow the overall planning of the Ministry of Education and speed up the construction of educational evaluation theory system with Chinese characteristics. 4.2 Enhance the Competence Level of the Assessment Team and Promote Professional Assessment The establishment of education evaluation system can not leave highly qualified and professional education evaluation teams. It is suggested that the country should set up the post of education evaluation teachers to promote the level of education evaluation teams in our country. To organize and formulate systems for the qualification, employment, training and development of educational evaluation professionals, and create a favorable policy environment for the professional development of talents. Educational appraisers need to pass the corresponding level examination, and they can only be hired after passing the examination. Moreover, foreign professional educational evaluation institutions should be introduced, and international exchanges and cooperation should be greatly strengthened. We will actively encourage Chinese educational assessment institutions to join INQAAHE and the Asia-Pacific Quality Assurance Network established by UNESCO, strengthen exchanges and cooperation with relevant foreign institutions, and learn advanced assessment ideas and theories from foreign countries at close range so as to facilitate the formulation of our own educational assessment system [22]. 4.3 The Internet has Helped Diversify Appraisal Work Educational evaluation needs a large number of multi-angle data support, only in this way can the measured data be accurate, so as to do more detailed “evaluation to promote reform”, no information, there is no modernization, information technology is conducive to support and promote, lead the thinking and technological innovation of educational evaluation. Education assessment can be organically combined with big data, cloud computing, mobile Internet, artificial intelligence and other technologies to further promote “intelligent assessment”. The introduction of Internet technology can monitor education work in real time, make the evaluation work complete, constantly improve and supplement data sources, so as to obtain more accurate results of education evaluation. The Internet helps the evaluation work to become intelligent, accurate and personalized and can provide development suggestions, from promoting the evaluation object continuous improvement. The Internet can also be used to analyze the subtleties of educational work to derive strategies for improvement [16]. Improve the scientific, professional and objective of educational evaluation. To realize the modernization of
182
X. Zhu et al.
education assessment, the evaluation based on valid data is the key to promote, and more attention should be paid to the application of digital technology. The organic combination of digital technology represented by the Internet of Things, cloud computing, virtual reality, blockchain, artificial intelligence, etc., is constantly providing people with a new perspective and mode of thinking. In this digital age with both opportunities and challenges, how to strengthen the support of digital technology, optimize the path of education governance modernization, and provide effective guarantee for the expected results of education evaluation is an extremely important proposition. Digital technologies such as “Internet Plus”, cloud computing and blockchain are gradually becoming the breakthrough point of educational evaluation methods [21]. 4.4 External Education Evaluation and University Internal Evaluation should be Organically Combined Educational evaluation is the process of evaluating and evaluating the efficacy and working status of the education system. The purpose of educational evaluation is to urge the educational work units to train students in accordance with the talent training program, so as to achieve the original intention of running schools. Colleges and universities should fully realize that the purpose of evaluation is not evaluation itself, not for the sake of evaluation, but to promote the development of schools. Universities should treat external evaluation as a natural thing and cannot cope with external evaluation. Only for the development of education work, practical work, to improve students’ professional ability as the core, promote students’ moral, intellectual, physical, American and labor comprehensive development, and enhance the influence of the university. If each educator does his or her own job, educational evaluation will proceed smoothly. Considering external evaluation and internal evaluation of universities and colleges as a work, and combining them organically, our educational evaluation work is bound to go on another stage.
5 Summary and Prospect How to improve the quality of university operations is a complicated, important and always worth studying. This paper focuses on the analysis of educational evaluation and internal evaluation methods in colleges and universities, and introduces the problems of educational evaluation and the defects of evaluation methods, and summarizes the emergence of specific solutions. Based on the current research status, relevant research can be conducted from the following aspects: 5.1 Construct Educational Evaluation Theory System with Regional Characteristics The “Overall Plan for Deepening the Reform of Educational Evaluation in the New Era” clearly states that we should improve the system and mechanism for promoting morality and educating people, reverse the unscientific orientation of educational evaluation, and resolutely overcome the stubborn problems that only require grades, higher education,
Based on the 2010–2022 Review of Domestic and Foreign Educational Evaluation
183
diplomas, papers, and titles. The excessive pursuit of science ignores the humanistic value of education. In the future, the direction of “breaking the five dimensions” should be to break “one” and establish “many.“ Universities should run Chinese universities well based on the basis of personnel training and scientific research. First, a general evaluation index with Chinese characteristics should be established. According to the significant characteristics of a university, each institution will look for landmark factors in the evaluation index. If the university once the index as a deliberate pursuit of a goal, the index will lose its original significance. If a university simply aims for the target and ignores the connotation construction, the development direction of the university will deviate. It may improve in a certain ranking for a while, but the level of the school cannot be really improved. Establish by breaking, and then continuously improve our country’s educational evaluation system. 5.2 The Improvement of Multi-angle Enriched Research Evaluation System Promote construction through evaluation, reform by evaluation and management through evaluation. The educational evaluation of higher education is divided into internal evaluation and external evaluation. Internal evaluation is the responsibility of colleges and universities, while external evaluation is the responsibility of educational administrative departments. Educational evaluation is carried out by departments designated by the state. The evaluation team is composed of authoritative education experts and academic members, and carries out the evaluation according to certain rules and regulations. Due to the limitation of regulations, the current assessment work does not have broad and bold ideas. If the evaluation work cannot be completed efficiently, we can make use of the divergent thinking of a third party to compensate for the rigid defects of the evaluation work in the future, integrate the divergent thinking and further improve the evaluation system. 5.3 Combination of Diversified Technical Means and Educational Assessment The object and objects of education evaluation also become more and more diversified, teaching, scientific research, teachers, students, university management, school quality, and so on all need to be evaluated. Educational evaluation is a field highly dependent on information and data collection and analysis. It can be said that evaluation cannot be completed without information and data. In today’s era, we should make full use of the internet, data mining, visualization analysis, and other means to obtain more real and reliable data as support. It relies on the platform to obtain behavioral data information about teachers and students for evaluation. This not only provides a new perspective on higher education evaluation, but also enhances the objectivity and comprehensiveness of the evaluation. Education evaluation should be combined online and offline, and the correct problems should be sorted out and discussed, and relevant solutions should be given to continuously optimize the education work.
184
X. Zhu et al.
5.4 Effective Combination of Different Assessment Objects in the Assessment Process Expert evaluation refers to the evaluation of colleges and universities by authoritative education experts and academic committees organized by the state in accordance with rules and regulations. Expert committee members have rich experience in education and teaching, evaluation experience, strong ability of identification and judgment, and their judgments are more scientific. Internal evaluation of colleges and universities refers to colleges and universities spontaneously organize teachers with certain authority to conduct evaluations according to rules and procedures, so as to promote reform and improve the quality of running schools. The expert evaluation should be organically combined with the internal evaluation of colleges and universities. The evaluation group of colleges and universities should absorb the experience of expert evaluation groups to improve their evaluation abilities. The expert evaluation team should communicate with the internal evaluation team of the university for a long time to understand the facts. Increase the frequency of communication between the two and make the assessment a part of your daily routine, rather than just going through the motions. Acknowledgement. This work was supported by Educational science planning of Jiangsu Province (D/2021/01/23,B/2022/04/05), Jiangsu University Laboratory Research Association (GS2022BZZ29), Research topic of Teaching reform of Nanjing Special Education Teachers College (2021XJJG09,2022XJJG02), Jiangsu Research on development of the disabled (2022SC03014) and Universities’ Philosophy and Social Science Researches Project in Jiangsu Province. (2020SJA0631, 2019SJA0544). The authors gratefully acknowledge these supports and reviewers who given valuable suggestions.
References 1. Ji, C.: Four key points in school evaluation. Educ. Dev. Res. 39(24), 3 (2019). https://doi.org/ 10.14121/j.cnki.1008-3855.2019.24.002 2. Lu, Y.: Practice and thinking of undergraduate teaching review and evaluation in Jiangsu Province. Shanghai Educ. Eval. Res. (5), 6–10 (2017). https://doi.org/10.13794/j.carolcarroll nkishjee.2017.0068 3. Wang, F., Cui, M.: Comparing horizon, our country’s higher education mechanism construction of strategy of the third-party evaluation. J. Heilongjiang Province High. Educ. Res. 33(3), 6:41 and 46 (2021). https://doi.org/10.19903/j.carolcarrollnkicn23-1074/g.2021.03.008 4. Huang, L., Zhang, Y., Wang, X., Cheng, L.: Zhejiang education evaluation from the modern system construction of thinking. J. Shanghai Educ. Eval. Res. (5), 49–54 (2022). https://doi. org/10.13794/j.carolcarrollnkishjee.2022.0060 5. Xie, L., Wang, J.: Thinking on the collaborative development of educational evaluation institutions and universities. Shanghai Educ. Eval. Res. 1(01), 48–52 (2012) 6. Torres, V., Ribera, A., Gross, J.: NSSE-National Survey of Student Engagement. Paper Presented at the (2002) 7. Chen, D.: China program” of undergraduate teaching evaluation: context, problems and trend. J. Hunan Normal Univ. Educ. Sci. 12(5), 107–115 (2020). https://doi.org/10.19503/j.carolc arrollnki.1671-6124.2020.05.015
Based on the 2010–2022 Review of Domestic and Foreign Educational Evaluation
185
8. Feng, H.: The connotation of the modernization of education evaluation characteristics and promoting strategies. J. Shanghai Educ. Eval. Res. 8(3), 1–4 (2019). https://doi.org/10.13794/ j.carolcarrollnkishjee.2019.0034 9. Li, L., et al.: Beyond the “five aspects”: concerns and prospects of higher education evaluation in the new era. Univ. Educ. Sci. (06), 4–15 (2020) 10. Kuh, G.D.: What we’re learning about student engagement from nsse: benchmarks for effective educational practices. Change: The Mag. High. Learn. 35(2), 24–32 (2003). https://doi. org/10.1080/00091380309604090 11. Raftree, L.: Big data in development evaluation. (23 Nov 2015) [25 Mar 2022]. https://lindar aftree.com/2015/11/23/big-data-in-development-evaluation 12. Zhou, C., Sun, Z.: Review of Higher Education evaluation methods at Home and abroad. Chin. Bus. (Second Half) (11), 268 (2008). 13. t20200701_469492.html 14. Xu, W.: Method a preliminary study of university library user education evaluation. J. Libr. Work Study (11), 71–73 (2012). https://doi.org/10.16384/j.carolcarrollnkilwas.2012.11.017 15. Ma, X., Feng, L.: Content and methods of “Knowledge Exchange Framework” University Evaluation Project in UK . Univ. Educ. Sci. (01), 120–127 (2021) 16. metcalf jv. “Internet + education” concept and mode analysis. China’s High. Educ. Res. (02), 70–73 (2016). https://doi.org/10.16298/j.carolcarrollnki.1004-3667.2016.02.13 17. Chen, J.: Research on Internal Evaluation Mechanism of Colleges and Universities. Fujian Normal University (2009) 18. Song, N., Gao, X.: The connotation, value and theoretical framework of STEAM education evaluation in primary and secondary schools. Res. Educ. Sci. (10), 47–53 (2021) 19. Wang, X.: Construction of college students’ innovation and entrepreneurship education evaluation system under the background of “Internet Plus”. Coll. Logistics Rese. (08), 86–88 (2021) 20. Kang, H.: Evaluation system of higher education in our country: review and outlook. High. Educ. Explor. (04), 20–22+86 (2006). (in Chinese) 21. Feng, X., Fan, X.: Blockchain technology driving education evaluation modernization: realistic prospect, applicable value and implementation approach. Educ. Dev. Res. (19), 69–74 (2022). https://doi.org/10.14121/j.carolcarrollnki.1008-3855.2022.19.011 22. Sun, C., Du, R.: The reform and new vitality of higher education quality evaluation: an overview of the 2020 annual conference of education evaluation branch of china higher education association. High. Educ. Dev. Eval. 37(05), 55–62 (201) 23. Zheng, J., Wang, Y.: Shallow discuss the construction of university internal quality education evaluation system. J. Inform. Sci. Technol. (18), 184–185 (2013). https://doi.org/10.16661/j. carolcarrollnki.1672-3791.2013.18.140 24. Pang, C.: The characteristics and enlightenment of British educational evaluation. Shanghai Educational Eval. Res. 10(5), 57–61 (201) 25. Zhu, H.: Research on Teaching Quality Evaluation System of Colleges and Universities. Dalian University of Technology (2004) 26. Tao, J., Li, Y.: The enlightenment of the evaluation process, index system and measurement methods of Indian universities on the comprehensive evaluation of Chinese “double firstclass” universities. Sci. Manag. Res. 33(01), 6:155–162 (2021). https://doi.org/10.19445/j. carolcarrollnki/g3.2021.01.024.15-1103 27. Li, L.: The analysis of American NSSE and its enlightenment to Chinese School Internal Evaluation . Educ. Tribune (8), 93–96 (2015). https://doi.org/10.16215/j.carolcarrollnki/g4. 2015.08.022cn44-1371 28. Zhang, J.: Five difficulties in policy evaluation of higher education evaluation. Acad. Forum 34(09), 203–206 (2011). https://doi.org/10.16524/J.45-1002.2011.09.045
186
X. Zhu et al.
29. Xin, T, Li, G. Effect and experience of China’s educational evaluation Reform since the 18th National Congress of the CPC. People’s Education (Z2), 6–10 (2022) 30. Chen, T.: Education evaluation method. Shanghai: Shanghai Educ. Eval. Res. 9(02), 17–21 (2020). https://doi.org/10.13794/j.carolcarrollnkishjee.2020.0018 31. Tao, L., Wei, D., Xu, G.: Based on combinatorial optimization algorithm is the wisdom of the construction of colleges and universities evaluation method in the research and design. J. Guizhou Univ. (Nat. Sci. Ed.) 38(4), 79–85 (2021). https://doi.org/10.15958/j.carolcarrollnki GDXBZRB.2021.04.13 32. Wang, Z., Chang, L., Lin, H.: Effect evaluation and dynamic monitoring of “double firstclass” construction. Degree Postgraduate Educ. (11), 47–54 (2022).https://doi.org/10.16750/ j.adge.2022.11.006 33. Wang, Y.: Taking educational evaluation as an opportunity to promote the overall development of teacher structure – based on the quantitative analysis of the teacher structure of 14 physical education colleges in China. Guangzhou Sports college J. 29(01), 14–20 (2009). https://doi. org/10.13830/j.carolcarrollnki/g8.2009.01.006cn44-1129 34. Feng, X., Fang, L., Liu, L.: Education, “tube” and “evaluation” in the governance system of operating mechanism research. J. Shanghai Educ. Eval. Res. 9(02), 7–12 (2020). https://doi. org/10.13794/j.carolcarrollnkishjee.2020.0016 35. Zhang, X., Chen, Y., Du, R.: Construction and evaluation of high-quality education system: a review of 2021 annual conference of education evaluation branch of China higher education association. High. Educ. Dev. Eval. 38(05), 34–42 (202) 36. Zhao, R., Chen, W.: Deep learning under the perspective of scientific evaluation method innovation. J. Intell. Sci. 40(11), 3–11+19 (2022). https://doi.org/10.13833/j.iSSN.1007-7634. 2022.11.001 37. Lv, M., Zhou, S., Zhu, Q.: Computer aided diagnosis system based on the deep study of breast ultrasound research progress. China Med. Imag. Technol. 4(11), 1722–1725 (2020). https:// doi.org/10.13929/j.iSSN.1003-3289.2020.11.031
A Summary of the Research Methods of Artificial Intelligence in Teaching Huiyao Ge, Xiaoxiao Zhu, and Xiaoyan Jiang(B) School of Mathematics and Information Science, Nanjing Normal University of Special Education, Nanjing 210038, China [email protected]
Abstract. With modern technology developing rapidly, “Artificial Intelligence” becomes a hot word of the times. The integration of the development of information technology and artificial intelligence provides an opportunity for education optimization. This article briefly reviews the application of artificial intelligence in teaching from four aspects: learning environment creation, learning data analysis, learning resource matching, and learning path intervention. Through the creation of learning environment, it can broaden the learning dimension and help students to learn immersive. Through intelligent analysis such as multimodal data mining and affective computing learning analysis, it can identify students’ emotional feedback for a certain content and help teachers adjust teaching content and progress with strong pertinence. Learning resource matching technology helps to match learning resources according to students’ personality characteristics and appearance differences. Teachers can carry out learning path intervention for different students and help students to adjust their learning paths and consolidate knowledge learning. Some future research directions are proposed for some research methods. This will help relevant researchers to grasp the research in this field as a whole and play an important role in promoting the application and development of artificial intelligence in the field of education. Keywords: Artificial intelligence · Education · Virtual Reality (VR) · Affective computing · Resource matching · Learning path
1 Introduction Artificial intelligence (AI) is an emerging science and technology that researches and develops for simulating, extending, and expanding human intelligence [1]. Since its birth, AI has undergone continuous evolution and development, and has been valued and applied in more and more fields [2]. For example, medical treatment [3], transportation [4], logistics [5], etc. As a powerful science and technology, artificial intelligence will also promote the deepening of curriculum reform in the field of education, promote the continuous optimization of education structure, and better empower the formation of high-quality education system [6]. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 187–196, 2024. https://doi.org/10.1007/978-3-031-50580-5_15
188
H. Ge et al.
2 Application Scenarios of Artificial Intelligence in Education 2.1 Learning Environment Creation Virtual Reality (VR) technology and augmented reality (AR) technology have developed rapidly in recent years and have become mature. It also has a very wide range of applications in teaching. For example, through VR and AR teaching, you can freely travel through the universe, from small molecular movements to vast galaxies. VR and AR teaching can realize the specific presentation of affairs that are difficult to be directly observed by the human eye. The scene is strong and interesting, which can stimulate learners’ interest. Not only that, but also the cost of the experiment can be saved. In the process of experimental teaching environment, VR and AR technology can be used to simulate experiments, which can not only make students familiar with the operation process, but also save the cost of experiments, which is obvious in flight attendant teaching [7]. The cost of flight attendant teaching to deal with emergencies is very high, and the required materials are expensive and cannot be reused. The use of AR and VR technology can save costs to a certain extent, although it can’t completely replace the actual experiment, but it can play the effect of strengthening operation and accumulating experience. Creation of specific teaching situation development cannot leave the scene. According to the research results of Zhong et al. [8], they designed a VR technology-based experiential learning environment that follows the principles of authenticity, convenient guidance, appropriate content, and reflection. In creating specific teaching scenarios, we can not only use the first-person perspective to enhance students’ immersive experience but also employ the third-person perspective to reinforce the sense of reflection, dialogue operation, and other learning experiences. Based on the application of AR, Qian et al. [9] combined 5G technology with AR to propose the joint application of 5G + AR for an immersive learning experience. The experience mode of 5G+ AR provides new features and opportunities for immersive learning. AR brings a “mind-body-environment” learning experience to immersive learning, while 5G offers a broader development space. The combination of 5G and AR overcomes the technical barriers of AR. Low latency, precise positioning, and highquality communication constitute the basic infrastructure of 5G+ AR. They enable sensory interconnectivity, polymorphic simulation interaction, and personalized learning analysis, which facilitates the transformation of 5G+ AR. Related to this, Guangdong Province has implemented the “Gigabit Optical Network + Cloud VR Education” [10] program in 2021, which will cover basic disciplines such as art, aesthetics, history and culture. The program will be implemented from the 12th school in Guangdong Province, with 40 s-grade students wearing VR equipment and following the teacher’s guidance. Immerse yourself in the poetic world described by the poet. The effectiveness of AR and VR education has been confirmed in various ways. For example, Ruan et al. [11] take the environmental education courses for college students as an example and make a quantitative comparative analysis through experiments, concluding that the use of AR and VR environmental education courses can enhance students’ awareness of environmental protection in colleges, promote teaching
A Summary of the Research Methods of AI in Teaching
189
efficiency, and help shape a positive personality, cultivating college students’ awareness of environmental protection and sustainable development. The integration of AR, VR, and 5G broadens the spatial dimension of learning, creating a new learning environment for learners and broadening teachers’ teaching methods, providing new ideas for teaching. While 5G technology solves the pain points of VR, such as slow loading times and low resolution, the current development scale of the 5G industry is still small, the cost is high, and the terminal is not mature enough. Therefore, the popularization of the joint application of 5G and AR requires continuous development of science and technology to solve the bottleneck problem of 5G. 2.2 Learning Emotional Data Analysis This text is discussing the importance of learning emotion in relation to learners’ efficiency, perception, and thinking ability. Facial expressions are a significant measure of one’s emotional state, as they are often reflected through the movement of facial muscles. As a result, the latest research focus is on face recognition technology based on dynamic face portraits. The research methods generally include optical flow method [12], feature point tracking method [13], modeling method [14], difference method [15] and deep learning method, among which deep learning method is also a research hotspot in recent years. Zhang et al. [16] used deep learning method for micro-expression detection for the first time. Compared with traditional machine learning, the accuracy of detection has been improved, but only vertex frames can be detected. Tran et al. [17] proposed a sequential model based micro-expression detection method. Ding [18] used sliding window to segment long video clips of micro-expressions into several short videos, and combined optical flow with Long Short-Term Memory (LSTM) recurrent neural network. The improved low-complexity optical flow algorithm is used to extract the feature curve in order to predict the occurrence of micro-expressions using LSTM. However, the start and end frames are not precisely determined. Lei et al. [19] proposed a two-stream Graph Temporal Convolutional Network (Graph-TCN) for micro-expression recognition by acquiring motion features of local muscles of micro-expressions. Zhao et al. [20] proposed the Spatio Temporal AU Graph Convolutional Network (STA-GCN) as a deep learning-based emotion recognition method. Firstly, the spatiotemporal AU motion information of AU-related regions is extracted by a three-dimensional Convolutional Neural Network (3DCNN). Then, the dependence of AU is captured by Graph Convolutional Network (GCN). Finally, the activation features are multiplied with full face features for micro-expression recognition. Wang et al. [21] combined AU with facial key points to construct four key regions of eyes, nose, cheeks, and mouth, weighted the corresponding regions, and finally combined the proposed MER-AMRE (MER framework with attention mechanism and region). The network extracts features and improves the ability of the network to extract local information. While deep learning methods are widely used, the performance of deep learning is always low when the model is overfitting. It is worth noting that Ren et al. [22] proposed the Lung Cancer Data Augmentation Integration (LCDAE) framework, which has successfully overcome the overfitting problem in lung cancer classification tasks and
190
H. Ge et al.
has the best effect at present. This study may provide new ideas for related research on emotional data analysis. Based on the deep learning technology of information fusion and the calculation method of data fusion, Zhai et al. [23] established the two-dimensional data model of middle school students’ facial expression and facial pose emotion for online education, and established the learning emotion calculation mode of single-source data and multi-source data fusion, and obtained the optimal mode through comparison. Its main innovation lies in the establishment of multi-source data model and the integration of deep learning algorithms and data fusion algorithms. They organized a sample of 36 students for the experimental test. After the steps of data collection and screening, data labeling, division and selection of learning emotions, and data set division, the specific analysis pictures were determined, and four emotional labels of happy, confused, calm, and bored were labeled. The four emotions are analyzed, and the face pose feature points are focused. They chose the VGG-16 and ResNet-50 convolutional neural networks of the visual geometric network as the analysis models for emotional depth perception. They assigned different weights to the total outputs of the model and simulated the overall output for different emotions, assigning two kinds of weight to achieve weight assignment. Then, they integrated information about facial emotions and facial state to obtain the final emotion fusion model. According to their experimental results, the facial expression model was best at recognizing boredom, followed by happiness. The face pose model had the best recognition ability for boredom, followed by happiness. In selecting the optimal emotion model, the multi-source data model was considered the optimal emotion recognition model based on the fusion comparison results of the unit data model and the multi-source data model. 2.3 Learning Resource Matching Resource recommendation technology has been widely used in various fields, including shopping, news, entertainment, and more. Different resource recommendation methods have been developed, such as collaborative filtering [24], content filtering [25], user experience level checking [26], association rules [27], and user feedback analysis [28]. Other recommended methods include Markov chain [29], resource ontology [30], hybrid model [31], Bayesian model [32], and so on. However, personalized recommendation in the field of education currently still faces challenges. Educational resources have unique characteristics and need to meet both learners’ needs and their characteristics. Han et al. [33] proposed an effective matching principle-based recommendation method for educational resources to solve this issue. User modeling and resource modeling’s characteristic information were used to construct a corresponding model. Because primary and secondary school students comprise the majority of users, the user model considered aspects of students’ learning background, style, and ability. The resource model was constructed based on the learner model, using teaching resource background information, type, and difficulty level as the construction points, and digitizing the relationship model between each user and teaching resource. Then, the matching degree between students and teaching resources was calculated. The effectiveness of similar application learning resources, as well as the effect of users on teaching resources,
A Summary of the Research Methods of AI in Teaching
191
were evaluated, and the matching efficiency of teaching resources and users was comprehensively assessed. Finally, teaching resources were suggested to users based on a numerical system of matching efficiency. This scheme fully considers students’ learning backgrounds, personalized differences, and the effects of various resources on learners, thus meeting individual requirements and achieving efficient personalized recommendation. The Top-k learning resource selection algorithm, provided by Liang et al. [34], is based on content filtering and PageRank semantic similarity replacement. The algorithm adopts content filtering technology, which is mainly used to filter teaching materials similar to the files that users query and configure. The whole computational research design process can be summarized as follows: 1. Constructing the selection mode of teaching materials based on the vector filter of information content, and 2. Through PageRank operation, realizing the matching between various resources, establishing the Markov convergence matrix of the characteristics, and refining the recommendation results of various resources using Top-k operation. In addition to the theoretical design, they also used an open learning resource dataset to conduct experimental tests. According to the experimental results, their solution effectively solves the problem of polysemy and synonymy in text information processing in network teaching support IT, greatly improving the statistical accuracy of the algorithm. 2.4 Learning Path Intervention The intervention of learning paths is mainly aimed at teachers. Through artificial intelligence technology, teachers can fully grasp learners’ cognitive situation and implement interventions on their cognitive behavior by conducting portrait research [35], group stratification suggestion [36], learning diagnosis research, and personalized learning path recommendations. One of the most widely used methods is the “intervention-response” [37] method (RTI). The RTI method hierarchy consists of three levels, with nested multiple interventions. The first level of the RTI model is aimed at all learners, as the least targeted intervention effective for the majority of students in the class. The second-level intervention focuses on some students, monitoring and strengthening interventions for those who are still unqualified after the first-level intervention. The third-level intervention is a closely monitored method of intervention for individual learners. The RTI model has shown efficacy in precision teaching, as evidenced by domestic scholars’ research. However, most of the studies have been conducted in traditional classroom environments that rely primarily on human intervention. These interventions are limited in effect, timeliness, and accuracy. To overcome these limitations, researchers can examine the model in non-traditional classroom environments and incorporate external interventions to test its effectiveness in different directions. Yang et al. [38] proposed a cognitive intervention model based on cognitive analysis and supported by knowledge environments. Their model consists of learning data and behavior measurement data, analysis of learning states and characteristics, learning diagnostic behaviors, and assessments of learning process effectiveness. Intervention strategies for various learning difficulties, such as difficulty understanding knowledge, lack of interest in learning, and lack of concentration, were proposed. These targeted
192
H. Ge et al.
strategies can provide finely tailored interventions for students with different conditions, which is crucial for teachers to carry out successful interventions catered towards different student types. In summary, the RTI model has the potential for precision teaching, and by considering the model in different learning environments and incorporating external interventions, it can promote more effective learning interventions. The cognitive intervention model proposed by Yang, W et al. serves as an excellent starting point for considering how to tailor interventions to specific types of students’ learning difficulties.
3 Conclusion This article discusses the application of artificial intelligence in four areas related to teaching methods, as listed in Table 1. These areas include creating a conducive learning environment, analyzing learning data, matching learning resources to students, and providing personalized learning paths. By broadening students’ learning horizons, recognizing their emotional feedback for specific learning content, assisting teachers with organizing teaching content holistically, and matching resources to their individual characteristics, AI can help teachers and students interact more positively. This, in turn, can improve the overall quality of teaching and foster the development of a high-quality education system. Table 1. Summary table of main references. Reference
Year
Description
Heading level
[7]
2022
In-depth discussion of the positive role of VR/AR technology in teaching
Creation of a learning environment
[8]
2018
Explore immersive learning using 5G + AR
Creation of a learning environment
[9]
2022
Build an immersive learning environment with VR technology
Creation of a learning environment
[17]
2019
Improve the performance of localizing micro-expressions (MEs)
Creation of a learning environment
[18]
2019
Previous work on microexpression Creation of a learning environment detection mainly focused on finding peak frames from video sequences containing microexpressions, and the computational amount is usually very large
[19]
2020
Address the low intensity problem of Learn emotional data analysis muscle movement associated with facial microexpressions (MEs)
[20]
2021
An FME recognition framework called spatiotemporal action unit (AU) graph convolutional network (STA-GCN) is proposed
Learn emotional data analysis
(continued)
A Summary of the Research Methods of AI in Teaching
193
Table 1. (continued) Reference
Year
Description
Heading level
[21]
2015
A new color space model, tensor independent color space (TICS), is proposed to help identify microexpressions
Learn emotional data analysis
[22]
2022
A lung cancer data enhancement Learn emotional data analysis integration (LCDAE) framework was proposed to solve the overfitting and low performance problems in lung cancer classification
[23]
2022
Multi-source data fusion provides a Learn emotional data analysis model basis for effective learner affective computing, and an effective technical path for learning affective computing in online education environments
[33]
2018
Based on the evaluation of the matching degree between resources and users and the prediction effectiveness of resources to users, a precise recommendation scheme of educational resources based on matching effectiveness is designed
[34]
2017
Top-k learning resource Learning resource matching recommendation algorithm based on content filtering PageRank semantic similarity replacement is proposed
[37]
2014
This paper expounds on the Learning path intervention intervention response model and its application theory in early education, and summarizes its enlightening effect on the development of preschool education in China
[38]
2022
To study and explore the effective intervention model to solve the learning difficulties of students with learning difficulties under the condition of intelligent technology
Learning resource matching
Learning path intervention
Artificial intelligence is a strategic, general-purpose technology, and the combination of artificial intelligence algorithms and models with educational scenarios will promote the leapfrog development of educational research [39]. It is believed that by the driving
194
H. Ge et al.
force of artificial intelligence, education modes and learning styles will gradually change, leading to the realization of educational modernization. Acknowledgements. This work was supported by the Teaching Reform Research Project of Nanjing Normal University of Special Education in 2019 “Research on Cultivation of Professional Core Quality of Normal University Students from the perspective of Professional Certification – Taking Educational Technology Major as an Example”; Universities’ Philosophy and Social Science Researches Project in Jiangsu Province (No. 2020SJA0631 & No. 2019SJA0544).
References 1. Lang, J.H.: Medicine in the era of big data and artificial intelligence. Chin. J. Matern. Child Health Res. 30(01), 1–3 (2019) 2. Jing, J., Huang, X.C.: Current status and future prospects of artificial intelligence-assisted diagnosis and treatment in laboratory medicine. Int. J. Lab. Med. 43(21), 2669–2673 (2022) 3. Zhang, K., Liu, X., Shen, J., Li, Z., Jiang, Y., et al.: Clinically applicable AI system for accurate diagnosis, quantitative measurements, and prognosis of COVID-19 pneumonia using computed tomography. Cell 181(6), 1423-1433.e11 (2020) 4. Han, X., Gao, X.: Research on AI teaching design for primary schools based on EDIPT model. Educ. Equipment Res. 39(03), 35–39 (2023) 5. Li, X.C., Wang, F., Meng, J.T.: Application of artificial intelligence technology in supply chain logistics. China Logistics Purchasing 2023(06), 77–78 (2023) 6. Wang, Y. G., Xu, J. Q., Ding, J. H.: Global framework of education 4.0: future school education and model transformation – interpretation of the world economic forum report “schools of the future: defining new education models for the fourth industrial revolution”. J. Distance Educ. 38(03), 3–14 (2020) 7. Zhang, G.C.: Case analysis of AR/VR technology in experimental course teaching. Electron Technol. 51(08), 145–147 (2022) 8. Zhong, Z., Chen, W.D.: Design strategy and case implementation of experiential learning environment based on VR technology. China Audio-Vis. Educ. 2018(02), 51–58 (2018) 9. Qian, X.L., Song, Z.Y., Cai, Q.: Developing immersive learning in the Metaverse: characteristics, paradigms and practices of immersive learning based on 5G+AR. Educ. Rev. 2022(06), 3–16 (2022) 10. Meng, Y.: Empowering education intelligence transformation with gigabit optical network and cloud VR. Commun. World 2022(18), 34–35 (2022) 11. Ruan, B.: VR-assisted environmental education for undergraduates. Adv. Multimed. 2022, 3721301 (2020) 12. Liu, M., Shan, S., Wang, R., Wu, X., Chen, X.: Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1749–1756. IEEE Computer Society (2014) 13. Yao, S., He, N., Zhang, H., Ji, X.: Micro-expression recognition by feature points tracking. In: Proceedings of the 10th International Conference on Communications, pp. 1–4. IEEE Computer Society (2014) 14. Xu, L.F., Wang, J.Y., Cui, J.N., et al.: Dynamic expression recognition based on dynamic time warping and active appearance model. J. Electron. Inform. Technol. 40(2), 338–345 (2018). https://doi.org/10.11999/JEIT170848
A Summary of the Research Methods of AI in Teaching
195
15. Huang, X., Fu, R.D., Jin, W., et al.: Expression recognition based on image difference and convolutional deep belief network. Optoelectron.·Laser 29(11), 1228–1236 (2018) 16. Zhang, Z., Chen, T., Meng, H., et al.: SMEConvNet: a convolutional neural network for spotting spontaneous facial micro-expression from long videos. IEEE Access 6, 71143–71151 (2018). https://doi.org/10.1109/ACCESS.2018.2884349 17. Tran, T.-K., Vo, Q.-N., Hong, X., et al.: Dense prediction for micro-expression spotting based on deep sequence model. Electron. Imag. 2019(8), 1–6 (2019) 18. Ding, J., Tian, Z., Lyu, X., Wang, Q., Zou, B., Xie, H.: Real-time micro-expression detection in unlabeled long videos using optical flow and LSTM neural network. In: Vento, M., Percannella, G. (eds.) CAIP 2019. LNCS, vol. 11678, pp. 622–634. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-29888-3_51 19. Lei, L., Li, J., Chen, T., et al.: A novel graph-TCN with a graph structured representation for micro-expression recognition. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2237–2245. ACM (2020) 20. Zhao, X., Ma, H., & Wang, R.: STA-GCN: spatio-temporal AU graph convolution network for facial micro-expression recognition. In: Proceedings of Chinese Conference on Pattern Recognition and Computer Vision, pp. 80–91. Springer, Berlin, Germany (2021) 21. Wang, S.-J., Yan, W.-J., Li, X., et al.: Micro-expression recognition using color spaces. IEEE Trans. on Image Process. 24(12), 6034–6047 (2015) 22. Ren, Z., et al.: LCDAE: data augmented ensemble framework for lung cancer classification. Technolo. Cancer Res. Treat. 21, 153303382211243 (2022). https://doi.org/10.1177/153303 38221124372 23. Zhai, X.S., Xu, J.Q., Wang, Y.G.: Research on emotional computing in online education: based on multi-source data fusion. J. East China Normal Univ. (Educ. Sci.) 40(09), 32–44 (2022) 24. Hu, W., Wang, Z.L.: Research on intelligent recommendation evaluation of test questions based on collaborative filtering algorithm. J. Qujing Normal Univ. 41(06), 49–54 (2022) 25. Liu, J.J.: Research on Fusion Recommendation Method Based on Content and Collaborative Filtering. Inner Mongolia Normal University (2019) 26. Beijing University of Posts and Telecommunications: User Experience Research and Usability Testing. Beijing University of Posts and Telecommunications (2018) 27. Xiao, X.: Discussion on the application of association rule algorithm in score analysis – taking the test scores of high school students as an example. New Curriculum 16, 60 (2022) 28. Wang, Y., Zhou, S., Weng, Z., Chen, J.: An intelligent analysis and service design method for user feedback. J. Zhengzhou Univ. (Eng. Edn.) 44(03), 56–61 (2022) 29. Wang, A., Zhao, Y., Chen, Y.: Information search trail recommendation based on markov chain model and case-based reasoning. Data Inform. Manag. 5(1), 228–241 (2021). https:// doi.org/10.2478/dim-2020-0047 30. Sun, Z.: Ontology-based Research on Resource Organization of Subject Information Gateway. Jiangsu University of Science and Technology (2010) 31. Ke, H.: Design of real-time data acquisition system for Internet of Things based on hybrid model. Inform. Comput. (Theory Ed.) 34(17), 40–42 (2022) 32. Yu, M., Liu, J., You, Y., Liu, C.: Evaluation of medium and long-term global flood forecasting based on Bayesian model averaging. Geogr. Sci. 42(09), 1646–1653 (2022) 33. Han, Y., Huang, R.: Design of educational resource recommendation scheme based on matching effectiveness. Microcomput. Appl. 34(05), 1–4 (2018) 34. Liang, T., Li, C., Li, H.: Top-k learning resource matching recommendation based on content filtering pagerank. Comput. Eng. 43(02), 220–226 (2017) 35. Huang, W.: Precise ideological and political exploration in colleges and universities based on student portrait analysis. J. Northeast. Univ. (Soc. Sci. Ed.) 23(03), 104–111 (2021)
196
H. Ge et al.
36. Huang, B.: A preliminary study of “stratified teaching” in class groups. Teach.-Friend (02), 3 (1996) 37. He, L., Zhang, L.: Intervention response mode: a new model of inclusive education in the early years in the United States. J. Suzhou Univ. (Educ. Sci. Ed.) 2(04), 111–118 (2014) 38. Yang, W., Zhong, S., Zhao, X., Fan, J., Yang, L., Zhong, Z.: Research on the construction of elementary school mathematics learning intervention model based on learning analysis. China Dist. Educ. 04, 125–133 (2022) 39. Wu, F.: Entering Artificial Intelligence, pp. 200–214. Higher Education Press, Beijing (2022)
A Sign Language Recognition Based on Optimized Transformer Target Detection Model Li Liu, Zhiwei Yang, Yuqi Liu, Xinyu Zhang, and Kai Yang(B) Nanjing Normal University of Special Education, Nanjing 210038, China [email protected]
Abstract. Sign language is the communication medium between deaf and hearing people and has unique grammatical rules. Compared with isolated word recognition, continuous sign language recognition is more context-dependent, semantically complex, and challenging to segment temporally. The current research still needs to be improved regarding recognition accuracy, background interference resistance, and overfitting resistance. The unique coding and decoding structure of the Transformer model can be used for sign language recognition. However, its position encoding method and multi-headed self-attentive mechanism still need to be improved. This paper proposes a sign language recognition algorithm based on the improved Transformer target detection network model (SL-OTT). The continuous sign language recognition method based on the improved Transformer model computes each word vector in a continuous sign language sentence in multiple cycles by multiplexed position encoding with parameters to accurately grasp the position information between each word; adds learnable memory key-value pairs to the attention module to form a persistent memory module, and expands the number of attention heads and embedding dimension by linear high-dimensional mapping in equal proportion. The proposed method achieves competitive recognition results on the most authoritative continuous sign language dataset. Keywords: Sign Language Recognition · Target Detection Model · Neutral Network
1 Introduction In recent years, sign language recognition technology has used intelligent devices like computers to convert sign language movements into messages that can be communicated with other social groups [1]. There are more than 5,500 commonly used words in Chinese sign language [2]. However, the number of people who understand the meaning of sign language is minimal, and most non-deaf people need to learn about sign language. Few of them are willing to spend time and effort to learn this skill, which is one of the reasons for the communication barrier between the deaf community and other social groups. Therefore, the study of sign language recognition technology can not only help deaf people to adapt to the social environment, promote the development of human-computer interaction and provide a better human-computer interaction experience for users [3]. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 197–208, 2024. https://doi.org/10.1007/978-3-031-50580-5_16
198
L. Liu et al.
Research on sign language recognition in computer vision is progressing and evolving. Kollar et al. [4] proposed an end-to-end recognition method by embedding a convolutional neural network (CNN) into a hidden Markov model (HMM) and training the CNN using the frame state alignment generated by the HMM. Continuous sign language recognition can be considered a sequence-to-sequence learning problem. Several scholars have proposed methods based on extended short-term memory networks (LSTM) to capture temporal dependencies. Two typical alignment strategies are usually used to align input video sequences and target sentence sequences: connectionist temporal classification (CTC) sequence modeling approaches without predefined alignment and recurrent neural network encoder-decoder frameworks [5, 6]. However, sign language videos inherently contain complex feature information, including feature associations between hands, faces, and torsos in a single frame and feature variations of individual body parts between frames. Therefore, when dealing with the recognition task of longer sign language videos conforming to natural scenes, traditional recognition methods cannot extract enough useful features in end-to-end training and need more ability to model the video sequence. In recent years, the Transformer model [7] has been gradually extended from the field of natural language understanding to the area of computer vision due to its remarkable long sequence modeling capability. The use of its global self-attention mechanism enables the parallel implementation of sequence-to-sequence recognition and translation tasks, which makes the model a new framework for many machine translation tasks, including continuous sign language recognition tasks. However, due to the weak position encoding capability of the Transformer model, it is difficult to accurately grasp the positions of individual word vectors in a sign language video sequence using only a simple sine cosine position encoding method. On the other hand, when dealing with continuous sign language recognition tasks, the traditional Transformer model uses only simple self-attention and multi-attention methods, which makes it challenging to realize the overall modeling of longer sign language sequences, resulting in poor recognition results [8]. Considering the above problems, we propose the SL-OTT algorithm to extract visual features more effectively and improve the accuracy and robustness of the model. The proposed SL-OTT improves the Transformer model to make it more suitable for the continuous sign language recognition task. The main contributions of this paper are as follows. 1) A continuous sign language recognition method based on the improved Transformer model named SL-OTT is proposed, which accurately grasps the position information between each word in a constant sign language sentence by multiplexed positional encoding with parameters for each word vector in multiple cycles. 2) To enhance the model’s ability to model long sequences of sign language videos overall modeling capability, we add learnable memory key-value pairs to the attention module to form a persistent memory module. In addition, the number of attention heads and embedding dimensions is expanded proportionally through linear highdimensional mapping. The Transformer model’s multi-head attention mechanism is used to maximize the overall modeling ability of long sign language sequences to deeply dig into the critical information in each frame inside the video.
A Sign Language Recognition Based
199
The rest of this article is arranged as follows: Sect. 2 describes the related work; Sect. 3 depicts the proposed algorithm in detail, including the overall architecture; Sect. 4 analyzes the experiment result; At last Sect. 5 concludes this paper.
2 Related Work In the past three decades, the research on sign language recognition in computer vision has continued [9]. The ultimate goal of sign language recognition tasks is to achieve recognition of sign language actions through spatial-temporal modeling and to be able to translate sign language videos into spoken sentences. So far, most research has focused on sign language recognition of isolated words dedicated to application-specific datasets [3], thus limiting the applicability of these techniques. In recent years, there has been an increase in research on continuous sign language recognition tasks. Yang et al. [10] used a threshold model based on conditional random field (CRF) to determine whether each frame in an utterance is a sign language word or a transition action, and then used CRF to recognize the segmented sign language words and finally achieved 87% recognition rate in an American sign language utterance database consisting of 48 words. The threshold model could be better for detecting sign language word boundaries in practical applications because of the large variability of sign language data from non-specific populations. Cui et al. [3] extracted the spatial features of each frame by CNN and then pulled the spatial-temporal characteristics of each sign language segment by superimposed temporal convolutional and temporal pooling layers. Ren et al. [11] proposed a machine learning method to classify cancers, as pattern recognition for medical images is widely used in computer-aided cases. Wang et al. [12] used the PSOguided self-tuning CNN to diagnosis COVID-19. Deep learning-based method can help classify medical images and efficiently improve the accuracy of diagnosis. Currently, most of the sign language word boundary detection algorithms need to be more robust to non-specific people, which affects the recognition of sign language utterances to some extent. With the continuous development in the field of neural machine translation NMT, many excellent coding and decoding networks have been proposed, the most important of which is the Transformer model. In 2017, Vaswani et al. [9] proposed a Transformer model based on an attention mechanism, which replaces recurrent neural networks (RNNs) with a full-attention structure to achieve parallel computation while using a multi-headed attention mechanism [14] to capture the dependencies between the preceding and following texts fully and to grasp the dependencies between long-interval word vectors accurately. Transformer is not only applicable to machine translation tasks [15] but has also been successful in various other challenging tasks, such as language modeling, sentence representation learning, speech recognition, etc. And the Transformer model entirely takes into account the contextual issues in language translation during the operation, and its end-to-end recognition method is also well suited for solving continuous sign language recognition tasks; therefore, the Transformer model is widely used in endless sign language recognition tasks. Huang et al. [16] proposed a sign language attention network SAN that models the context on an entire frame sequence, modeling hand sequences on cropped hand images and combining hand features with their corresponding spatiotemporal contextual features using a self-attentive mechanism. Inspired
200
L. Liu et al.
by the above work, this paper proposes an improved Transformer model for continuous sign language recognition, effectively improving the position encoding and overall modeling ability of the network for longer sign language sequences.
3 Methodology In this section, we mainly introduce the overall architecture of the proposed SL-OTT and related modules. 3.1 Overall Architecture The continuous sign language recognition method based on the improved Transformer model proposed in this paper is shown in Fig. 1. The overall algorithm framework consists of two significant parts of codecs. The model mainly addresses the problems of insufficient location coding ability and weak modeling ability of long sequences in the original method and proposes three optimized modules of multiple reusable learnable location coding (LPE), durable memory module (SMM), and attention extension module (AEM). In Fig. 1, the overall process of the model is as follows: first, the location encoding with parameters (LPE) is reused in front of each multi-headed attention module, and the location encoding weight of each word vector is continuously updated according to the training loss rate to achieve an accurate grasp of the location of each word vector in longer sentences; second, in response to the difficulty of modeling long sequences of sign language videos in the Transformer model, the depth and accuracy of the attention module are expanded by adding persistent memory vectors (SMM) to each multi-headed attention module. At the same time, the number of heads of the attention module and the number of dimensions assigned to each head (AEM) are expanded year-on-year using a high-dimensional linear mapping, which further enhances the overall modeling ability of the model for longer sequences without reducing the perceptual field assigned to each head. Finally, the modeled sign language sequences are translated, and the CTC model outputs the final recognition results. Finally, the modeled sign language sequences are translated by the CTC model, and the final recognition results are output. The overall modeling ability of the model for extended sequences of sign language videos is enhanced by the joint improvement method of multiple modules, which effectively improves the recognition accuracy of the model for sign language videos. 3.1.1 Learnable Location Coding Due to the unique self-attentive computation of the Transformer model, the position of the input sequence needs to be encoded to prevent information loss, generally by adding position information to the input sequence using a sine and cosine function before the encoder. However, the semantic relationships between individual words of longer continuous sign language video sequences are more complex. Using simple cosine position encoding methods, capturing the position relationships between sign language
A Sign Language Recognition Based
201
Fig. 1. The overall structure of the SL-OTT.
context word vectors in long sequences is difficult. Therefore, in this paper, we adopt a multiple reuse learnable position coding method to grasp better the general semantic information of the sign language video and make the syntactic relationships between word vectors more reasonable. The location encoding layer directly inherits a matrix nn.Embedding, the length of which is the size of the dictionary and the width, is used to represent the attribute vector of each element in the dictionary, which is used to realize the mapping of words to word vectors. Then, the weight matrix in Embedding is randomly initialized, and the weight values are updated iteratively during the training process. The model can automatically learn the location information that matches the current word vector better.
202
L. Liu et al.
As shown in Fig. 1, the SL-OTT incorporates learnable position encoding in each encoder, with the first encoder input coming from the image features and the subsequent encoder input coming from the output of the previous encoder, to achieve accurate encoding of each key frame in the phrase sequence by multiplexing them in multiple places. 3.2 Deepening Attention to the Durable Memory Module To solve long sequential sign language recognition tasks, it is of utmost importance to make the attention module entirely, profoundly and accurately mine all the input sequences’ feature information and capture the model’s long-term dependencies. In the literature [17], it is proposed that adding a memory key value vector to the multi-headed self-attentive model can achieve a similar effect to the feedforward layer, thus removing the feedforward network layer and reducing the parameter computation; to improve the modeling ability of the sign language sequences, this paper adds a memory key value vector to the multi-headed attention module, but retains the feedforward layer and expands the attention depth and breadth of the self-attentive module to make it more suitable for continuous This paper refers to the persistent memory module (SMM) as a sign language recognition task. In terms of mathematical interpretation, for a Transformer module i with input X, the final output of the module Y is: yj =
M i=0
wij xj j = 1, 2, · · · , n
(1)
In time series prediction, the autoregressive decoding of standard self-attentive models inevitably introduces huge cumulative errors. Different time series data usually have strong spatial dependence. Using the self-attentive model, multilayer perceptron and convolution module as encoders can eliminate the cumulative error to a certain extent, which can focus on both global and local information, and achieve more accurate and efficient modeling in the time domain. 3.3 Attention Extension Module The Transformer model projects the input sequences to different subspaces of the selfattention layer to extract features, and the literature [14] indicates that increasing the number of heads in the multi-headed attention module can improve the model performance and increase the diversity of the attention graph when training the Transformer model in extension. Therefore, in this paper, the idea is carried over to the continuous sign language recognition task for long sequences, and the depth of attention of the model is enhanced by increasing the number of attention heads so that the model can better focus on the overall feature information of the sequence. However, for models with fixed embedding dimensions, an immediate increase in attention heads reduces the dimensionality assigned to each lead. The dimensionality reduction likewise affects the diversity of the attention maps. This paper adopts the direct expansion method to solve this problem.
A Sign Language Recognition Based
203
The attentional map is mapped to a linear transformation matrix A˜ = [A˜ 1 , · · · , A˜ H ]. Through the linear transformation matrix W A , the following equation is satisfied:
A˜ h =
H
WA (h, i) ∗ Ai , h = 1, · · · H
(2)
i=1
This method linearly maps the multi-headed self-attentive model into the highdimensional space. It can ensure that the number of dimensions of each head is constant while appropriately increasing the number of attention heads, so that the model can enjoy both the benefits of more attention heads and the advantages of high embedding dimensions. Since sign language behavior is naturally localized, the detailed movements of the palm and fingers are crucial for semantic information. Convolutional modules with local feature extraction performance are added after multi-headed self-attentive to form the core components of the Transformer. They learn they shared position-based kernel functions over a local window that maintains translational variance and can capture features such as edges and shapes, allowing the model to focus more on regional characteristics of the sign language. The convolution module consists of three convolution layers from shallow to deep: point-by-point convolution, deep convolution and a third layer of point-by-point convolution. The convolution module starts with a gating mechanism, point-by-point convolution and a gated linear unit, followed by a 1D depth convolution. After a 1D depth convolution, batch normalization and the Swish activation function train the depth model. The Swish function is unsaturated, smooth and non-monotonic, and using the Swish activation function to regulate the network can speed up the convergence of the model.
4 Experimental Analysis This section describes the datasets used for the experiments and the metrics used to evaluate the model performance, the experimental details, recognition results and analysis. Experiments are conducted on two large publicly available continuous sign language benchmark datasets, RPW (RWTH-PHOENIX-Weather) 2014 [19] and RPW 2014 T [20]. The above two datasets are the classic benchmark for current domestic and international sign language video recognition research. 4.1 Experiment Settings The network model was implemented on Ubuntu 14.04 with the following configuration: Intel(R) Pentium(R) CPU G3260 @3.30 GHz, NVIDIA GTX 3090, and 1 TB hard disk. An 18-layer two-dimensional ResNet [12] is used for the CNN module. The final fully-connected layer is removed. A two-layer CM-Transformer encoder with four multiheaded attentions is used, with a model dimension of 512 and a position feedforward layer dimension of 2,048. CTC as decoder to generate complete sign language sentences. The complexity of the whole network is calculated: the number of parameters is 25828294
204
L. Liu et al.
and the number of floating-point operations per second is 1.88 GMac; the complexity of the modified Transformer model is 0.06. The weights of the whole network are initialized at the beginning of the training phase. In this paper, all persistent memory vectors are reparametrized according to the number of dimensions and thresholds of the original vectors and embedded in all headshared locations so that the added constant memory vectors have the same unit variance as the initial context vectors. We use Word Error Rate (WER), the most commonly used word error rate for continuous sign language recognition tasks, as the evaluation index. The lower the WER value, the better the model effect and the higher the accuracy rate. The lower the WER value, the better the model and the higher the accuracy. Each experiment’s del/ins value was recorded as an additional reference for evaluating the model. WER =
#substitutions + #insertions + #deletions #glosses in reference
(3)
4.2 Comparison with Other Algorithms The proposed SL-OTT is compared with the comparison method regarding word error rate and recognition time, where the lower word error rate indicates better recognition. Among the comparison methods, the sign language self-attentive model (SLT, sign language transformer) method [4] uses a self-attentive model for modeling; the stochastic frame loss + stochastic gradient stopping + stochastic fine-grained labeling (SFD + SGS + SFL) method [5] uses a continuous sign language recognition method with a self-attentive model encoder and a CTC decoder, while using The iterative realigned (Re-aligned) method [6] proposes an algorithm to process the provided training labels and dynamically refine the label-to-image alignment in a weakly supervised manner; the staged-Opt method [3] solves the video mapping problem by introducing a recursive convolutional neural network for spatial-temporal feature extraction and sequence learning to recursive convolutional neural network for spatial-temporal feature extraction and sequence teaching to solve the mapping of video clips to labels. The iterative alignment network (Align-iOpt) method [17] uses a 3D convolutional residual network and an encoder-decoder network with CTC for sequence modeling, which is optimized in an alternating manner; the CNN-LSTM-HMM method [12] embeds a powerful CNN-LSTM model in each HMM stream according to a hybrid approach. The comparison results of recognition results between the proposed method and the comparison method on the two datasets are shown in Table 1 and Fig. 2, respectively. As shown in Table 1 and Fig. 2, the proposed SL-OTT achieves better accuracy on the RPW 2014 dataset, reduces the word error rate by 1.5% compared to the currently available state-of-the-art SFD + SGS + SFL method, and achieves competitive results on the RPW 2014 T dataset. The proposed method outperforms the CNN-LSTM-HMM method in terms of recognition time. The proposed method outperforms SFD + SGS + SFL in comparison with the recognition method using the self-attentive model. In contrast, SLT uses a complete self-attentive model codec model with high model complexity, extensive memory usage, and extended training time. Both methods require
A Sign Language Recognition Based
205
50,000 rounds of training before the model converges under the same hardware and software conditions. At the same time, the proposed method only requires 30 games.
Fig. 2. Comparison of identification results of various methods on the RPW 2014 dataset.
Table 1. Comparison of identification results of various methods on RPW 2014 T dataset. Methods
Word error rate / %
Time/h
CNN-LSTM-HMM
20.1
178.1
SLT
20.2
129234.7
SFD + SGS + SFL
22
5.6
SL-OTT
21.3
4.9
At the same time, the proposed SL-OTT can make up for this by combining the self-attentive model with the convolutional model to extract features for local gesture changes of the hand, which can bring out the global nature of the self-attentive model while keeping the regional characteristics of the convolutional model. The proposed model can fully exploit the topology and shape of the hand pose, and finally obtain the desired sign language recognition results. To verify the impact of choosing different modules in the network structure on the overall performance, the results of the ablation experiments conducted on the PHOENIXWeather 2014 dataset in this paper are listed in Table 2 and Fig. 3. From Fig. 3, we can see that wide_resnet101_2 can extract the features of sign language video more accurately due to its broader feature map, more significant number of channels and deeper layers, and obtains the best recognition results.
206
L. Liu et al.
Fig. 3. Comparison of the effect of different types of CNN.
As shown in Table 2, the LPE module achieves accurate position encoding of continuous sign language video by multiplexing the learnable position encoding and reduces the model’s error rate by 0.2%; the AEM module also proves to be effective in expanding the depth and breadth of attention, and the increase in the number of attention heads and embedding dimension effectively reduces the error rate by 0.3% when processing the long sequence of continuous sign language video. Due to its unique persistent memory mechanism, the SMM module reduces the model’s error rate by 1.2%. The SMM module significantly reduced the model misspelling rate by 1.2% due to its unique ongoing memory mechanism, further demonstrating the importance of the attention mechanism in handling long sequence tasks such as continuous sign language recognition, and the addition of the persistent memory module enhanced the overall modeling capability of the model. Table 2. Results of ablation experiments with different modules of LPE, SMM and AEM. Experimental methods
Del/ins
WER/%
Baseline (Transformer)
5.2/5.0
22.2
LPE + SMM
8.2/2.1
21.1
LPE + AEM
7.1/3.0
22
SMM + AEM
6.2/2.4
24.4
SL-OTT
6.1/3.2
21.6
A Sign Language Recognition Based
207
5 Conclusion The study of sign language recognition technology not only enables deaf people to better adapt to the social environment but also promotes the development of humancomputer interaction and provides a better human-computer interaction experience for users. This paper proposes an optimized Transformer method for continuous sign language recognition. The Transformer model is optimized and improved by multiplexing the learnable location encoding, the persistent memory module for deepening attention, and the attention extension module for long sequences, to solve the problem that the traditional Transformer model is weak in location encoding and challenging to model long lines of sign language videos. The proposed method can be applied to an extensive benchmark data set. The proposed method achieves significant recognition progress on a large benchmark dataset.
References 1. Koller, O., Zargaran, O., Ney, H., et al.: Deep sign: hybrid CNN-HMM for continuous sign language recognition. In: British Machine Vision Conference 2016, pp. 1–2. British Machine Vision Association, York (2016) 2. Zhang, Z., Pu, J., Zhuang, L., Zhou, W., et al.: Continuous sign language recognition via reinforcement learning. In: International Conference on Image Processing (ICIP), pp. 285– 289. IEEE Computer Society, Piscataway, NJ (2019) 3. Cui, R., Liu, H., Zhang, C.: Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 7361–7369. IEEE Computer Society, Piscataway, NJ (2017) 4. Camgoz, N.C., Koller, O., Hadfield, S., et al.: Sign language Transformers: joint end-toend sign language recognition and translation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10023–10033. IEEE Computer Society, Piscataway, NJ (2020) 5. Niu, Z., Mak, B.: Stochastic fine-grained labeling of multi-state sign glosses for continuous sign language recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI, pp. 172–186. Springer International Publishing, Cham (2020). https:// doi.org/10.1007/978-3-030-58517-4_11 6. Culati, A., Chiu, C.C., Qin, J., et al.: Conformer: convolution-augmented ‘Transformer for speech recognition. In: Proceedings of the INTERSPEECH 2020, pp. 5036–5040. International Speech Communication Association (ISCA), Baixas (2020) 7. Koller, O., Zargaran, S., Ney, H.: Re-sign: re-aligned end-to-end sequence modeling with deep recurrent CNN-HMMs. InL l/IEEE International Conference on Computer Vision and Pattern Recognition, pp. 4297–4305. IEEE Computer Society, Piscataway, NJ (2017) 8. Graves, A., Fernandezs, G.F., et al.: Connectionist temporal classification: labeling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006) 9. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. Adv. Neural Inform. Process. Syst. 5998–6008 (2017) 10. Molchanov, P., Yang, X., Gupta, S., et al.: Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4207–4215 (2016)
208
L. Liu et al.
11. Ren, Z., Zhang, Y., Wang, S.: A hybrid framework for lung cancer classification. Electronics 11(10), 1614 (2022) 12. Wang, W., Pei, Y., Wang, S.H., Gorrz, J.M., Zhang, Y.D.: PSTCNN: Explainable COVID-19 diagnosis using PSO-guided self-tuning CNN. Biocell 47, 373–384 (2023) 13. Kollar, O., Camgoz, N.C., Ney, H., et al.: Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos. IEEE Trans. Pattern Anal. Mach. Intell. 42(9), 2306–2320 (2019) 14. Hartigan, J.A., Wong, M.A.A.: K-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979) 15. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016) 16. Huang, J., Zhou, W.G., Zhang, Q.L., et al.: Video-based sign language recognition without temporal segmentation. In: AAAI Conference on Artificial Intelligence, pp. 2–7. AAAI, New Orleans (2018) 17. Camgoz, N.C., Hadfield, S., Koller, O., et al.: SubUNets end-to-end hand shape and continuous sign language recognition. In: IEEE International Conference on Computer Vision (ICCV), pp. 3075–3084. IEEE, Venice (2017) 18. Pu, J., Zhou, W., Li, H.: Iterative alignment network for continuous sign language recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4165–4174. IEEE Computer Society, Piscataway, NJ (2019) 19. Slimane, F., Bouguessa, M.: Context matters: self-attention for sign language recognition. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 7884–7891. Milan, Italy (2021) 20. Camgoz, N.C., Hadfield, S., Koller, O., et al.: Neural sign language translation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7784–7793. IEEE Computer Society, Piscataway (2018)
Workshop 3: The Control and Data Fusion for Intelligent Systems
A Review of Electrodes Developed for Electrostimulation Xinyuan Wang1 , Mingxu Sun1 , Hao Liu1 , Fangyuan Cheng1 , and Ningning Zhang2(B) 1 University of Jinan, Jinan 250022, China 2 Association Service Center of Jinan, Jinan 250022, China
[email protected]
Abstract. Surface electrodes are essential devices for performing functional electrical stimulation therapy and play a direct role in the effectiveness of electrical stimulation. In this paper, four typical electrodes are selected, and their preparation materials and their characteristics are introduced and compared, including metal electrodes, carbon rubber electrodes, hydrogel electrodes, and fabric electrodes. Most of the electrodes used for electrical stimulation at this stage are mainly hydrogel electrodes, which are generally uncomfortable to wear, poorly washable, and do not fit well with human skin. The appearance of the fabric electrode improves the above problems, and its preparation material and its preparation method are introduced in detail. At the end of the paper, the development trend of fabric electrodes has been prospected. Keywords: Electrodes · Electrical Stimulation · Fabric Electrode
1 Introduction Patients with nerve impairment, such as stroke survivors, can benefit from functional electrical stimulation (FES) therapy to enhance their range of motion [1]. It is frequently used in patients who have central nervous system damage as a result of head trauma, spinal cord injury, stroke, or other neurological illnesses to help them move more easily. The nerve tissue that enters the muscle is stimulated by FES using electrodes. Although electrodes can be implanted, it is more frequent and practical to place them on the skin’s surface. Due to its noninvasive nature and ease of usage, surface electrical stimulation offers a wide range of clinical uses. The electrode material is a conductor material that is used as both ends of an input or output current in a conductive medium such as a solid, gas, or electrolyte solution. For bioelectric potential monitoring, transcutaneous electrical nerve stimulation, and functional electrical stimulation, electrode devices have been created [2–4]. Most of the current research on electrodes revolves around the measurement of electrocardiographic (ECG) biorhythms to help diagnose heart disease, electromyographic (EMG) monitoring for fitness and rehabilitation applications, or electro-oculogram (EOG) signals and electroencephalography (EEG) to monitor brain activity measurements. Research on electrodes for electrical stimulation is scarce, and © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 211–220, 2024. https://doi.org/10.1007/978-3-031-50580-5_17
212
X. Wang et al.
most early electrical stimulation studies used electrodes made of steel [5] or other metal patches placed on saline-soaked gauze simple electrodes [6–8]. Most of the existing electrode technologies use hydrogel electrodes and carbon rubber electrodes. The longterm goal today is to produce comfortable, washable electrostimulation electrodes that are breathable, flexible, and washable. The advent of e-textile printing technology has made fabric electrodes possible, and longer-lasting washable electrodes may become a new trend for future medical devices. This review will discuss the various electrode types applied to electrical stimulation, summarize their advantages and disadvantages, and focus on the potential impact that textile electrodes may have in the field of electrical stimulation as a conclusion.
2 Electrode Type 2.1 A Fabric Tissue Covered Sheet Metal Electrodes Most early electrostimulation studies used simple electrodes that were made of metal plates placed on fabric soaked in saline. The metal plates must be made of biocompatible materials, as shown in Fig. 1. Usually, stainless steel or silver chloride electrodes are used. The woven fabric can be cotton but is usually a polymeric textile material that has some elasticity and does not wear out quickly. Sponge-like materials are also used and recommended, Zhou K 2016 HVPC electric pads (stainless steel electrodes with a sponge interface) are also moistened with saline [9]. Water or electrode gel is used to make the fabric conductive. It distributes the current evenly on the skin and the electrodes must be carefully dried to prevent skin burns. In the best case (if completely dry), this dry electrode isolates the metal plate from the skin. However, an unevenly distributed electric field under the electrode during the drying process can cause severe skin burns. The electrode needs to be secured to the skin with an elastic band or must be embedded in clothing or a cast, such as the Bioness (formerly known as the Ness Handmaster) neuroprosthesis [10].
Fig. 1. Metal plate electrodes covered by fabric tissue.
A Review of Electrodes Developed for Electrostimulation
213
2.2 Carbon Rubber Electrodes Before the advent of self-adhesive hydrogel electrodes, carbon-silicon electrodes replaced fabric-covered sheet metal electrodes, as shown in Fig. 2 [11]. In order to make electrodes that were pleasant, conductive, and matched the surface on which they were put, these electrodes employed carbon to boost the conductivity of the soft rubber pad. Carbon rubber has a much higher resistance than metal, preventing large current flow in small areas. While no longer risking skin burns from direct contact between skin and metal electrodes. Although carbon rubber itself does not dry out, it is sometimes required to moisten the area where the electrode and the user’s skin come into contact in order to minimize the electrical impedance at that point, which normally decreases the user’s comfort. Because of their chemical stability, they are still used for special applications, such as iontophoresis. The fact that hydrogel electrodes are unstable for long periods of time at temperatures above 40 °C is also of concern; therefore, storage and handling in Middle Eastern countries are both expensive and difficult. In these areas, carbon rubber electrodes remain the preferred choice. Gel or water must be used as the skin contact medium when carbon rubber electrode currents are more than 10 mA. This is due to the electric field’s ability to penetrate deeper and away from the delicate nerve terminals near the surface as a result of the reduction in impedance [12]. Although Horton et al. 2010 reported that switching the first self-adhesive electrodes with carbon rubber electrodes decreased redness in one patient, there are no diagnostic tests evaluating the efficacy of various electrodes [13].
Fig. 2. Carbon rubber electrodes
214
X. Wang et al.
2.3 Hydrogel Electrodes This electrode technique is an alternative to the conventional metal electrodes employed in prior experiments and is frequently employed in a hydrogel, as shown in Fig. 3. Electrodes have a multilayer structure consisting of multiple layers of the hydrogel. The skin interface layer consists of a relatively low peel-strength conductive gel for direct contact with the subject’s skin. It has a moist feel and is relatively easy to remove from the skin. Conductive gels are usually made from polymeric copolymers, such as copolymers of acrylic acid and N-vinylpyrrolidone [14]. In order to create highly low resistance interaction with the skin, hydrogels must have a consistency that is similar to gel and a large amount of water. The best electrical conductivity has been demonstrated by studies for well-constructed hydrogels [15]. But gradually, the water in hydrogels evaporates. The electrodes must then be changed after which this benefit is lost. Furthermore, because of their natural adhesive qualities, they are prone to becoming covered with dirt and bodily tissues. so gradually lose conductance as a result of being unclean.
Fig. 3. Hydrogel electrodes
2.4 Fabric Electrodes Electrode design has been a major focus of the e-textile community’s efforts for some time, and e-textiles have been utilized effectively with electrical stimulation treatment
A Review of Electrodes Developed for Electrostimulation
215
today. Features of a Functional electrical stimulation system employing 24 tiny printed electrodes were reported in 2014 by Yang et al., as shown in Fig. 4. The device may provide a range of varied hand postures, like trying to point, squeezing, and extending the hand. The smaller electrodes (1 cm2) provide a greater selection of stimulation regions. A polyurethane interface and encapsulation layer are placed between conductive silver traces on the electrodes in this instance, which are screen printed. A carbon-containing rubber paste is used to print the electrode pad itself. In order to make printing more dependable and to use less paste to provide a full conductive route, the interface layer is employed to create a flat finish for manufacturing the conductive traces. The encapsulation layer shields the conductive lines from electricity and protects them physically. The electrodes were printed on polyester-cotton fabric that is considered industry standard [16]. Stewart et al. tested hydrogel electrodes against electrodes constructed of conductive cloth in 2017. They changed the amplitude and rise time of the signal used to stimulate the subject’s biceps, among other things. The fabric electrodes outperformed the hydrogel electrodes in all test outcomes, either similarly or somewhat better [1]. According to research by Moineau et al., clothing has also been incorporated with fabric electrodes. They created and put to the test a set of trousers, a blouse, and a jacket that had electrodes constructed from conductive fiber [17]. They had a FES stimulator attached to them, and a method was utilized to determine whether the full motor thresholds, motor, and sensory—the points at which stimulation causes complete muscle contraction—had all been achieved (the point at which the subject cannot tolerate greater intensity). These numbers are contrasted with those obtained by using gel electrodes. Comparable to gel electrodes in behavior, fabric electrodes conduct power more readily toward the skin’s surface, which has more nerve endings, since they have lower sensory thresholds [12]. Additionally, because fabric electrodes’ power transmission is less consistent, “hot patches” that are easier to detect may develop. The maximal stimulation threshold is stronger for textile electrodes, while the comprehensive threshold is smaller. The cables attached to each electrode make the system more challenging to operate since they have a tendency to get tangled, and the electrodes require wetting after ten to fifteen minutes in order to function effectively and beautifully. These are the two main issues the authors identified with their system. Studies have also demonstrated increased performance of dry fabric electrodes, eliminating the requirement for wetting, while other work has improved techniques for incorporating power cables into fabric garments [18]. Pain alleviation is a different treatment for which electrical stimulation using fabric electrodes has proved beneficial. Electronic textile electrodes can be used to give electrical stimulation, which can be utilized to block the body’s natural signal for expressing painful symptoms. Yang et al. examined a mechanism in 2020 utilizing an interrupting signal that was transmitted via 2 sets of electrodes mostly on the user’s leg joints with a current transfer of 100 mA. The interfering current had frequency components around one and ten kHz. The rubber that contained carbon was printed on a cloth that was weaved using a copper cable to create the electrodes that were utilized in the study. This flexible rubber substance forms a strong electrical connection and fits snugly against the surface where it is inserted. Tests showed consistent current transmission, and users said the gadget was pleasant and simple to operate. These accomplishments
216
X. Wang et al.
suggest that comparable outcomes to those seen in the research can be attained when cloth electrode technology is used for FES.
Fig. 4. Fabric electrodes
3 Materials for Fabric Electrodes In the context of e-textiles, a lot of research has been done on electrodes: for transcutaneous electrical nerve stimulation, biopotential monitoring, and functional electrical stimulation, electrode devices have been created. Conventional silver/silver chloride electrodes are made of hydrogels that adhere directly to the skin. However, the performance of these gel electrodes degrades over time due to water evaporation and accumulation of contaminants, making them unsuitable for long-term wear applications. Textile-based dry electrodes exist in wearable therapies and have been used for many applications, including pain relief, stroke rehabilitation, and improved lymphatic function. Metal, which is frequently used to make contact with electrodes or other kinds of wiring, is an obvious material choice for the production of electrical fabric equipment. Flexible silver inks with electrical conductivity up to 3200 Scm-1 have been created [19]. Silver is a popular candidate due to its high electrical conductivity. It is also biocompatible and chemically stable, but because it is a valuable metal, its price might be too high. Despite its greater susceptibility to corrosion, copper seems to be another popular option for e-textile uses since it is likewise highly conductive and considerably less expensive than silver [20]. Steel has been utilized to create conductive threads for e-textiles because of its excellent mix of mechanical and electrical qualities [21]. Polymers are an alternative to metals, and they offer a variety of acceptable products, such as polyaniline, polypyrrole, or poly(3,4 ethylenedioxythiophene) polystyrene sulfonate (PEDOT: PSS). However, compared to metallic conductors, the conductivity of these polymeric fibers is still at minimum an order of magnitude inferior, which makes them
A Review of Electrodes Developed for Electrostimulation
217
less effective for extending existing electrical routes. Metals and polymers can be combined to produce materials having polymers’ physical characteristics but substantially lower impedance [22]. In order to generate a stretchy screen printing ink with such a light coating impedance of just 6 /sq, Merhi et al. combined PEDOT: PSS with nanofibers, which are tiny silver strands. By using a variety of spinning methods, also including electrostatic spinning or moisture spinning, the above components can be transformed into yarns. In electrostatic spinning, a fine thread of polymer is drawn from a solution using an electric field, and in wet spinning, the polymer is precipitated from a solution in a liquid bath.
4 Fabric Electrode Preparation Processes Fabric electrode systems can be divided into two broad categories: The first method entails utilizing conventional textile production methods like weaving, knitting, or embroidery to include conductive components into the cloth itself. On the surface of already-made fabrics, conductive components are printed or deposited in the second method. Conductive yarns are needed for the first kind of technique. This might be a yarn created by any of the aforementioned methods, or it could just be a wire with the right physical characteristics that can be woven into the fabric. The yarn architecture required for complicated circuits can be made using a wide range of weaving techniques, but the conductive channels can only ever be weaved orthogonal lines [23]. Greater geometric freedom is offered by embroidery, but it is more challenging to employ on an industrial scale since most conductive lines lack the tensile strength and elastic flexibility needed for machine sewing. As part of their research for neuroprosthetic applications, Keller et al. [24] employed embroidery to create smart fabric-based electrode pads and electrode wiring on cloth. Commercial metal-coated yarns, however, display low consistency during embroidery owing to deterioration of the conductive yarn surface, necessitating the usage of pricey, high-quality bespoke silver sputtering yarns made via plasma vapor sputtering [25]. Weaving and knitting have been used to manufacture smart fabrics for various wearable electronic applications (e.g., sensing, display, health monitoring, and power generation). However, these methods are not applicable to the fabrication of wearable FES arrays. Because the conductive routes are constrained to follow the actual placement of the threads within the fabric, woven and knitted technologies place restrictions on the design of the arrays. Due to erroneous gaps between the conductive threads, there is also an uneven distribution of resistance. There are many different methods available for printing conductive materials. Printing with a stencil is the most basic of them. In this instance, a conductive paste is poured into a stencil that has apertures carved out in the appropriate pattern. The stencil is then removed once the paste has fully dried [26]. This technique works best for basic designs since with extremely small details (less than 1000 µm), the stencil will become too brittle and distorts, resulting in erroneous printing. Additionally, since every component of the mold must be linked, separate concentric designs are not feasible. Since a lot of material can be put at once, stencil printing is ideal for larger material deposits (>1000 µm), whereas other methods that were initially intended for graphic printing need numerous layers to attain the same width [27]. The idea behind screen printing and
218
X. Wang et al.
aperture plate printing is similar. To produce the specified mask, a screen printer uses a thick metal screen that is only partly covered with latex. A rubber broom is used to push the substance to be printed through the mesh and onto the substrate’s exposed surface. A third way to add conductive material to the cloth is through dispenser printing. Here, a pneumatic or mechanical mechanism is used to push the printing material out while a robotic drive moves the nozzles. Dispensing printers can only print lines, which are typically one millimeter wide, therefore they take a while to cover a huge surface. They are more adaptable than screen or stencil printers, though. Dispenser printers may easily be reconfigured with the design concept; however, updating the design involves creating a new template with either technique. Dispenser printers can also adjust the altitude of the nozzle while printing, making it simpler to print over uneven ground or to modify the print density [28]. It is possible to print on sticky or delicate substrates using dispenser printing since it is a non-contact method and only the paste comes into contact with the substrate. With the nozzle placed near the surface and the ink discharged from there, inkjet printing employs a method akin to dispenser printing. While ink used within ink cartridges is applied as individual droplets, dispenser printers generally print utilizing a constant flow of stock. The ink used for inkjet printing has a far lower viscosity than the ink used for screen printing, stencil printing, or dispenser printing in order to produce such tiny droplets [29]. According to the desired pattern, drops can be generated continually or only as needed. Among the most widely used processes for printing graphics is inkjet, which has benefitted from a lot of research attention. Many more printing techniques exist that are less often employed in the production of e-textiles. By using an air stream to transport the ink droplets from the tip to the surface, which is firmly maintained by some other gas coating around them, aerosol printing is a method that is conceptually similar to dispenser and inkjet printing. Dispenser printing offers many of the same benefits as aerosol printing, but it calls for a considerably more complicated setup. In several techniques, the ink is first applied to the substrate within the desired sequence before being transferred to the substrate. These include the printing processes of gravure and flexography, in which the pattern is etched on a cylinder, filled with ink, and then applied to the substrates [30]. These techniques are not frequently employed for small-volume e-textile manufacture because of the high costs associated with roll production and the demanding standards for ink performance. These types of printing methods often need a very flat surface for ink adherence. If this stipulation is not followed, the ink might not stick or further deposits would be needed to make a full conductive route. Given that the majority of textiles don’t satisfy these specifications, an interface layer can be positioned between the electrode materials and the cloth. As a paste or for distribution as a laminate, polyurethane is frequently chosen for printing. The substrate must meet certain specifications in order to print. Substrate absorption of printing ink is necessary for a stronger mechanical connection to form after curing [31]. For five to thirty minutes, the majority of printed electronic inks are cured at temperatures between 110 and 140 °C. In certain cases, inks may require 10 to 70 s of exposure to high-power UV light. Since the post-processing is often repeated for numerous layers, the substrate must be sturdy enough to survive it [32].
A Review of Electrodes Developed for Electrostimulation
219
5 Discussion and Conclusion This paper discusses the application of electrodes in electrical stimulation studies, detailing four typical electrode materials and their properties: metal electrodes, carbon rubber electrodes, hydrogel electrodes, and fabric electrodes. Moreover, the electrodes currently used in electrical stimulation studies generally have problems such as wearing comfort, poor washability, and unsatisfactory fit to the human body. The advantages of fabric electrodes are pointed out, and their materials and preparation methods are described in detail. For textile electrodes, although electronic textile electrodes have been used in many applications, the precise positioning of electrodes is still a challenge. With the future development of fabric electrodes, the performance of fabric electrodes in muscle stimulation will certainly be improved, which will become a major trend in the future of electrical stimulation electrodes.
References 1. Stewart, A.M., Pretty, C.G., Chen, X.: An evaluation of the effect of stimulation parameters and electrode type on bicep muscle response for a voltage-controlled functional electrical stimulator. IFAC-PapersOnLine 50, 15109–15114 (2017). July 2. Yang, K., et al.: Development of user-friendly wearable electronic textiles for healthcare applications. Sensors 18, 2410 (2018). Aug. 3. Yang, K., et al.: Electronic textiles based wearable electrotherapy for pain relief. Sens. Actuators, A 303, 111701 (2020). Mar. 4. Paul, G., Torah, R., Beeby, S., Tudor, J.: Novel active electrodes for ECG monitoring on woven textiles fabricated by screen and stencil printing. Sens. Actuators, A 221, 60–66 (2015). Jan. 5. Carley, P.J., Wainapel, S.F.: Electrotherapy for an acceleration of wound healing: Lowintensity direct current. Arch. Phys. Med. Rehabil. 66, 443–446 (1985). July 6. Katelaris, P.M., Fletcher, J.P., Ltttle, J.M., Mcentyre, R.J., Jeffcoate, K.W.: Electrical stimulation in the treatment of chronic venous ulceration. Aust. N. Z. J. Surg. 57(9), 605–607 (1987) 7. Griffin, J.W., et al.: Efficacy of high voltage pulsed current for healing of pressure ulcers in patients with spinal cord injury. Phys. Ther. 71, 433–442 (1991). June 8. Peters, E.J., Lavery, L.A., Armstrong, D.G., Fleischli, J.G.: Electric stimulation as an adjunct to healing diabetic foot ulcers: a randomized clinical trial. Arch. Phys. Med. Rehabil. 82, 721–725 (2001). June 9. Zhou, K., Krug, K., Stachura, J., et al.: Silver-Collagen Dressing and High-voltage, Pulsedcurrent Therapy for the Treatment of Chronic Full-thickness Wounds: A Case Series. Ostomy Wound Manage. 62(3), 36–44 (2016). PMID: 26978858 Mar 10. Ijezerman, M.J., et al.: The NESS Handmaster orthosis: restoration of hand function in C5 and stroke patients by means of electrical stimulation. J Rehab Sci. 9, 86–9 (1996) 11. Baker, L.L., McNeal, D.R., Benton, L.A., Bowman, B.R., Waters, R.L.: Neuromuscular electrical stimulation: a practical guide, 3 ed. Rehabilitation Engineering Program, Los Amigos Research and Education Institute, Rancho Los Amigos Medical Center, USA (1993) 12. Zhou, H., et al.: Stimulating the comfort of textile electrodes in wearable neuromuscular electrical stimulation. Sensors (Basel, Switzerland) 15, 17241–17257 (2015). July 13. Houghton, P.E., et al.: Electrical stimulation therapy increases rate of healing of pressure ulcers in community-dwelling people with spinal cord injury. Arch. Phys. Med. Rehabil. 91, 669–678 (2010). May
220
X. Wang et al.
14. Ahmed, E.M.: Hydrogel: preparation, characterization, and applications: a review. J. Adv. Res. 6, 105–121 (2015). Mar. 15. Wang, L., et al.: Enhanced cell proliferation by electrical stimulation based on electroactive regenerated bacterial cellulose hydrogels. Carbohyd. Polym. 249, 116829 (2020) 16. Yang, K., Freeman, C., Torah, R., Beeby, S., Tudor, J.: Screen printed fabric electrode array for wearable functional electrical stimulation. Sens. Actuators, A 213, 108–115 (2014). July 17. Moineau, A., Marquez-Chin, C., Alizadeh-Meghrazi, M., Popovic, M.R.: Garments for functional electrical stimulation: design and proofs of concept. J. Rehabilit. Assistive Technol. Eng. 6, 2055668319854340 (2019). Jan. 18. Catchpole, M.: E-textile seam crossing with screen printed circuits and anisotropic conductive film. Proceedings 32(1), 16 (2019) 19. La, T.-G., et al.: Two-layered and stretchable e-textile patches for wearable healthcare electronics. Adv. Healthcare Mater. 7(22), 1801033 (2018) 20. Acar, G., et al.: Wearable and flexible textile electrodes for biopotential signal monitoring: a review. Electronics 8, 479 (2019). May 21. Perera, T., Mohotti, M., Perera, M.: Stretchable conductive yarn for electronic textiles made using hollow spindle spinning. In: 2018 Moratuwa Engineering Research Conference (MERCon), pp. 544–548 (May 2018) 22. Merhi, Y., Agarwala, S.: Direct write of dry electrodes on healthcare textiles. In: 2021 IEEE International Conference on Flexible and Printable Sensors and Systems (FLEPS), pp. 1–2 (2021) 23. Kuroda, T., Takahashi, H., Masuda, A.: Chapter 3.2 - Woven Electronic Textiles. In: Sazonov, E., Neuman M.R. (eds.) Wearable Sensors, pp. 175–198. Academic Press, Oxford (Jan. 2014) 24. Sardo, P.M.G., Guedes, J.A.D., Alvarelhão, J.J.M., Machado, P.A.P., Melo, E.M.O.P.: Pressure ulcer incidence and braden subscales: retrospective cohort analysis in general wards of a Portuguese hospital. J. Tissue Viability 27, 95–100 (2018). May 25. Biçer, E.K., et al.: Pressure ulcer prevalence, incidence, risk, clinical features, and outcomes among patients in a turkish hospital: a cross-sectional, retrospective study. Wound Management & Prevention 65, 20–28 (2019). Feb. 26. Liu, M., Beeby, S., Yang, K.: Electrode for wearable electrotherapy. Proceedings 32(1), 5 (2019) 27. Yang, K., Torah, R., Wei, Y., Beeby, S., Tudor, J.: Waterproof and durable screen printed silver conductive tracks on textiles. Text. Res. J. 83, 2023–2031 (2013). Nov. 28. de Vos, M., Torah, R., Tudor, J.: A novel pneumatic dispenser fabrication technique for digitally printing electroluminescent lamps on fabric. In: 2015 Symposium on Design, Test, Integration, and Packaging of MEMS/MOEMS (DTIP), pp. 1–4. IEEE, Montpellier, France (Apr. 2015) 29. Karim, N., et al.: All inkjet-printed graphene-based conductive patterns for wearable e-textile applications. Journal of Materials Chemistry C 5(44), 11640–11648 (2017) 30. Honarvar, M.G., Latifi, M.: Overview of wearable electronics and smart textiles. The Journal of The Textile Institute 108, 631–652 (2017). Apr. 31. Komolafe, A., Nunes-Matos, H., Glanc-Gostkiewicz, M., Torah, R.: Influence of textile structure on the wearability of printed e-textiles. In: 2020 IEEE International Conference on Flexible and Printable Sensors and Systems (FLEPS), pp. 1–4 (Aug. 2020) 32. Paul, G., Torah, R., Beeby, S., Tudor, J.: The development of screen printed conductive networks on textiles for biopotential monitoring applications. Sens. Actuators, A 206, 35–41 (2014). Feb.
Non-invasive Scoliosis Assessment in Adolescents Fangyuan Cheng1 , Liang Lu2 , Mingxu Sun1 , Xinyuan Wang1 , and Yongmei Wang3(B) 1 University of Jinan, Jinan 250022, China 2 Jinan Minzu Hospital, Jinan 250000, China 3 Jinan City People’s Hospital, Jinan 271100, China
[email protected]
Abstract. This work reviews the non-invasive scoliosis assessment methods for adolescents in recent years.The purpose of this study was to investigate the nonradiological assessment methods for the treatment of scoliosis that have been studied so far, the tools, characteristics, and validity, and to discuss their advantages and disadvantages. A total of 32 literature articles were compiled on nonradiological assessment methods for scoliosis, including camera measurements, 3D body scans, Kinect-based computer vision-based postural analysis system method, and gait analysis based on cursor camera and inertial sensors. Keywords: Adolescent · Scoliosis · Assessment · Non-invasive
1 Introduction Adolescent idiopathic scoliosis (AIS) is the most common spinal disorder in adolescents [1, 2]. AIS affects the mobility of the spine and balance of the trunk, leading to abnormal gait. Scoliosis screening for young people is important and urgent. The traditional approach to scoliosis assessment has been to use radiology in conjunction with Cobb’s angle [3]. In recent years, many methods based on deep learning and machine learning have also been proposed for X-ray images to improve clinical diagnosis [4], such as 3D imaging of X-ray images followed by localisation of key points to generate 3D spinal curvature; inputting X-ray images, first detecting the number of bones, identifying landmark features and determining whether scoliosis is present and the degree of curvature by measuring the Cobb angle; convolutional neural networks via CNN plus deep learning to automatically detect and classify scoliosis with an accuracy of 86% [5, 6]; identifying vertebral keypoints through a multi-scale keypoint estimation network to reconstruct a machine learning training set of 3D spine curvature, etc. [7]. Traditional methods of scoliosis assessment, while more accurate, repeatedly expose children to radiation, which can be harmful to their health. Studies have found that non-radiological assessment tools may reduce the number of x-rays taken in patients with scoliosis. Further research into scoliosis measurement tools is needed to improve reliability and validity. To date, non-invasive scoliosis assessment has increased and diversified, and can be divided into two main types of assessment: static and dynamic © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 221–230, 2024. https://doi.org/10.1007/978-3-031-50580-5_18
222
F. Cheng et al.
[8]. Static assessments include photogrammetry and 3D reconstruction of the body, while dynamic assessments are mainly based on cursor markers, Kinect or inertial sensors for gait analysis. However, there is no non-invasive method that has gained popularity as a routine assessment method, mainly due to the lower accuracy rate compared to Cobb radiography [9].
2 Scoliosis Assessment Methods 2.1 Static Measurement Method Photogrammetry. Photogrammetry is an alternative to radiography for examining adolescents with idiopathic scoliosis (AIS) to measure the curvature of the human back and to avoid radiographic exposure [10]. A new method for non-radiographic assessment of scoliosis was independently compared with the Cobb radiographic method to quantify the rate of scoliosis by Rozilene Maria Cota Aroeira et al. [11]. And in 2019 developed a computed photogrammetry protocol as a non-radiographic method for the quantification of scoliosis and compared the angle of lateral spinal curvature rate obtained by the computed photogrammetry method with the angle of lateral spinal curvature rate obtained by the Cobb radiographic method [12], with no statistically significant differences and a mathematical relationship established between the two methods. The results of the study indicate that the two methods are equivalent. Nurbaity Sabri et al., study proposed an LBP-1DCNN-ESNN model to classify photogrammetric images of AIS patients by introducing CNN convolutional kernels to reduce the number of features input to the ESNN model and outperformed the LBP + ESNN model by 3.34%. Experimental results show that compared with traditional machine learning, this method is better, with an accuracy rate of 90.01%„ produces results in 0.0111 s and is reliable for comparison with current scoliosis classification [13], and provides better classification results. 3D Body Scanning Method. Surface topography is an interbody 3D morphometry method based on external body contour assessment [14]. Scoliosis assessment or spinal correction surgery using 3D topography can avoid the harm that may occur to the body due to repeated radiation therapy and analysis, and at the same time, combined with trunk asymmetry analysis, can better and more accurately assess the severity of scoliosis or the progress of surgical treatment. The study by Sato et al. examined 126 patients with idiopathic scoliosis and had high accuracy and confidence by comparison with radiology, with a sensitivity of 98%, specificity of 53%, false positive rate of 47%, inter-assessor reliability of 73% and intraassessor reliability of 70% [15]. Stephan Rothstock et al. classified scoliosis prevalence using a machine learning method that classifies patients based on asymmetry of the back surface. Frontal lobe x-rays and 3D scans were analysed in 50 patients for clinical staging based on Cobb’s angle and spinal curve patterns. Patients were classified for degree of scoliosis (mild, moderate, severe) using a fully connected neural network and
Non-invasive Scoliosis Assessment in Adolescents
223
the patient’s ALS classification. This method proposed in the study analyses curvature types based on the ALS classification, These include seven different types of spinal curves. Through analysis and statistical modeling, a torso scanning method is performed, and then a neural network is used to classify the degree of its scoliosis. For the mild and moderate-severe patient groups, the classification success rate for scoliosis severity was 90%, sensitivity was 80% and specificity was 100%. The overall classification success rate for the 3 combined ALS treatment groups was 75%. Other Static Assessment Methods Computerised infrared thermography is a relatively new, non-invasive and painless method of assessing the presence of abnormalities in the body by means of sensors that perceive minute traces of infrared light normally emitted from the human skin surface and convert them into images by computer [16, 17]. Body surface temperature is uniformly regulated by the autonomic nervous system on both sides of the body and blood flow through the skin produces symmetrical temperature patterns [18], making body temperature asymmetry an important criterion for disease diagnosis, showed that thermography is suitable for complementary screening and that images can be taken anywhere because the device is small, mobile and lightweight. It is easy for the therapist to handle it and provide immediate results, but symmetrical movements performed by children with scoliosis may activate the back muscles asymmetrically if not corrected by a physiotherapist. Infrared body thermography is an examination method that is safe and reliable without the risk of pain or radiation exposure during the examination, can be used repeatedly and without restriction regardless of the patient’s physical condition, and allows for easy evaluation of the progression and clinical manifestations of scoliosis. In addition, it is of interest that the patient’s own understanding can be improved by directly signalling to the patient the degree of improvement of symptoms, current status, etc. through the colourised images. However, due to the disadvantages that external factors and the proficiency of the testers can easily influence the results, as well as the vague criteria for determining abnormal and normal, it is necessary to clarify the conditions for implementation prior to testing. Ultrasound is a low-cost, non-invasive test that can be assessed in real time and easily without exposure to radiation, and ultrasound images can be used as an assessment tool for spinal curvature by capturing multiple parts of the spine in two dimensions for visual display [19]. Using ultrasound as an assessment tool for scoliosis, The study by Zheng et al. also reported inter-measurer confidence levels at different levels (ICC > 0.88), and these were lower than intra-measurer confidence levels (ICC > 0.94) [20], which is a drawback of ultrasonic testing, i.e., differences in the proficiency of the testers. The importance of experience in ultrasonic measurements is illustrated based on, for example, differences in interpretation of results. 2.2 Dynamic Measurement Method Computer Vision Based Posture Analysis System Method. Most scoliosis assessments are more accurate for patients with moderate and severe
224
F. Cheng et al.
scoliosis, but tend to be less accurate in the assessment of normal and mild scoliosis, but mild scoliosis is the best stage of treatment for patients during the correction of scoliosis [21]. To enable further accuracy in scoliosis assessment, the Kinect dynamic camera is combined with a computer vision-based posture analysis system.Using a Kinect camera to obtain images of the participant’s skeleton in a standing position and analyse length and angle differences in various body parts (body, shoulder, pelvis and ankle centreline, Kwang Hyeon Kim et al., the study assessed the consistency of use a computer visionbased postural analysis system as a scoliosis screening tool for the detection of spinal postural deformities that is easy to use in clinical practice [22]. The study included 140 participants for scoliosis screening and factors to determine the presence or absence of scoliosis were used as parameters for the use of a commercial computer vision-based postural analysis system as a clinical decision support system (CDSS) [23]. The results of the study showed that the scoliosis results of the CDSS showed 94% agreement with radiological scoliosis assessment. Evaluation results are more accurate in patients with normal and mild scoliosis. Reference SHD and CDSS SHD were statistically significant by paired t-test (p < 0.001). PCA SHD and PHD were the main factors (79.97% for PC1 SHD and 19.86% for PC2 PHD). Although the majority of patients analysed in this study had minor postural abnormalities, the use of a non-ionising radiation-based computer vision-based postural analysis system is a simple and efficient screening and diagnosis tool in clinical practice [24]. Therefore, it can be used as a safe, efficient and convenient CDSS, and it is more accurate in evaluating normal and mild scoliosis patients, and can be used for early scoliosis screening. Gait Analysis Method Many muscle, nervous system and other diseases can cause alterations to normal gait patterns [25, 26] and therefore gait characteristics in patients with spinal disorders will change accordingly [27, 28]. An objective assessment of the gait pattern of patients with scoliosis will facilitate further research into how scoliosis leads to a decrease in life capacity, impairment and abnormalities in daily activities, or to determine the link between scoliosis and the patient’s abnormal gait pattern [29, 30]. In contrast, cursor camera gait analysis, which is used to screen for scoliosis by fixing a cursor to the patient’s joint site and then taking video of the patient’s gait with a video camera to simulate the patient’s gait on a computer, is immature,based primarily on computer vision or inertial sensors Performing abnormal gait assessment in scoliosis patients. Computer vision-based assessment of abnormal gait in scoliosis. Ram Haddas et al. set out to analyse the gait characteristics of patients and identify aspects of human gait related to preoperative, postoperative, and postoperative function and prognosis. The study used a video camera for ground gait testing to measure patient movement, surface electromyography (EMG) to record muscle activity, and force plates to record ground reaction force (GRF). Surface electromyography (EMG) is used to record muscle activity, and force plates are used to record ground reaction forces (GRF). Gait distance and time parameters, ankle, knee, hip, pelvic and trunk range of motion (ROM), duration of lower extremity EMG activity, and peak vertical GRF were collected. Analyze the gait of different patients collected. The study establishes and details some important kinematic and dynamic variables of gait in patients with spinal disorders. It
Non-invasive Scoliosis Assessment in Adolescents
225
is recommended that gait analysis be used as part of the clinical evaluation of scoliosis to improve the accuracy of the assessment. Based on IMU scoliosis abnormal gait assessment. One of the gait analysis studies of scoliosis screening is an expensive and time-consuming computer vision-based analysis method, laborious to work with and have limitations in screening scoliosis patients by comparing only kinematic parameters [31]. IMU have been frequently used to assess gait characteristics and body movements in healthy and pathological populations, so inertial sensor gait analysis to assess scoliosis is gradually emerging.Jae-sung Cho et al., performed scoliosis screening through a machine learning-based gait analysis test [32], which discussed the application of machine learning methods (SVM) that utilises scoliosis-induced gait measurement: kinematics based on gait phase segmentation. Use the IMU to record and analyze gait.A total of 72 gait features were extracted to build a gait recognition model. The performance of the SVM in identifying patients with scoliosis and controlling gait patterns was 90.5%. The feature selection algorithm was effective in differentiating age groups when features were optimally selected with an accuracy of 95.2%. The results showed that the SVM assessed scoliosis degree gait classification with an accuracy of 81.0% and the best selected features were effectively classified with an accuracy of 85.7%, which has considerable potential for the application of support vector machines to scoliosis degree gait classification.
3 Discussion Early diagnosis and assessment of scoliosis progression is clinically important as adolescents are generally skeletally immature and at high risk of scoliosis. Non-invasive scoliosis assessment can reduce the workload of scoliosis screening in adolescents, thereby reducing the burden on healthcare professionals; early detection of scoliosis can reduce the pain of the disease and improve treatment outcomes; radiological screening, although accurate, is costly and has exposure risks that may be harmful to the adolescent’s body. To this end, a literature collation study was conducted to analyse the latest research and characteristics of non-radiological assessment tools applicable to patients with scoliosis. A total of 32 foreign papers were analysed through literature search and keyword screening, from which they were divided into two categories: static assessment and dynamic assessment of scoliosis. The main methods for static assessment of scoliosis are photogrammetry, 3D body scanning, computerised infrared body thermography, and dynamic assessment is mainly based on computer vision or inertial sensors through the combination of Kinect and postural systems. Gait analysis is the main method of assessment. The following Table 1 gives The methods, tools, advantages and disadvantages, participants and conclusions of the experiments involving the different assessment components of scoliosis are summarized. Photogrammetry is used as a low-cost and portable non-ionisation method, but it is time-consuming and laborious in the process of data acquisition, and may even have an impact on its results due to experimental shooting errors, so fixed shooting angles, distances and other experimental conditions need to be fixed in advance. Photogrammetry
226
F. Cheng et al. Table 1. The table shows different evaluation experiments involving scoliosis.
Study(Time)
Method
Tool
Participator
Characteristics
Conclusion
Rozilene Maria Cota Aroeira(2019)
Photogrammetry
camera
Lenke Type 1:10 Non-Lenke Type 1:20
Low cost, portable, high accuracy
An LBP-1DCNN-ESNN model is proposed to classify AIS patients, reduce the number of features input into the ESNN model by introducing CNN convolution kernels, with an accuracy of 90.01%,a time to produce results of 0.0111 s
Stephan Rothstock(2020)
Surface topography
a customised 3D analysis and manipulation software
Scoliosis patients:50
high cost, professional operation and evaluation technology is required, and the accuracy rate is high
In the group of mild and moderate-severe patients, scoliosis severity was classified with a success rate of 90%, sensitivity was 80%, and specificity was 100%
Ana-Maria Vutan(2022)
Computer infrared body thermal imaging method
Infrared sensor
mild scoliosis:15 without postural deviations:15
Small size, light and mobile. Easy to handle and provide immediate results, requiring a physiotherapist to correct the assessment action
Indicated for supplementary scoliosis testing,requiring a physiotherapist to correct the assessment action
Kwang Hyeon Kim (2022)
Pose analysis system method based on Kinect computer vision
kinetic camera, CDSS
Posture imbalance, nonstructural postural deformity, or the presence of scoliosis: 140
A safe, efficient and convenient CDSS for early screening for spinal deformities
Scoliosis results for CDSS showed 94% agreement related to radiological scoliosis assessment. Compliance assessment is more accurate for patients with normal and mild scoliosis
Ram Haddas (2018)
Gait analysis based on computer vision
10 camera Vicon Video system, Ground reaction force and electromyo graphic
ADS: 20, CSM: 20, healthy volunteers: 15
Time-consuming and laborious, high cost and high accuracy, it can be used as an auxiliary screening method to improve the accuracy
Gait analysis for scoliosis provides an objective measure of functional gait
Jae-sung Cho(2018)
Inertial sensor gait analysis evaluation
Inertial measurement unit, SVM
scoliosis patients: 24 normal participants:18
There is considerable potential for application to the classification of scoliosis prevalence
SVM has an accuracy rate of 81.0% in assessing scoliosis degree gait classification and 85.7% accuracy in effectively classifying the best selected features
Non-invasive Scoliosis Assessment in Adolescents
227
may be a suitable alternative to radiometric methods, but further research is needed to improve non-ionisation techniques in AIS screening to further improve accuracy and efficiency. Previous studies have reported that the disadvantage of Surface topography is that it requires expensive equipment and specialist handling and evaluation techniques to perform three-dimensional asymmetry assessments and is more accurate as a screening tool for moderate and severe patients. Ultrasound is a tool for real-time, easy assessment of spinal curvature and has shown good appropriateness and intra-rater reliability, although some studies have also demonstrated lower intra-rater reliability due to differences in examiner proficiency and interpretation. The combination of computer vision and postural systems has the advantage of being safe, efficient and convenient, and provides more accurate results for patients with normal and mild scoliosis than other scoliosis assessment methods, but at a higher cost. The computer vision-based gait analysis method requires a combination of a camera and a force plate, which requires the advance pasting of cursor points at key locations during the assessment process, which is time-consuming and costly, but has a high accuracy rate and can be used as an adjunctive screening method to improve accuracy. The inertial sensorbased gait analysis method uses inertial sensors and an SVM classification method, which has a high accuracy rate and considerable potential to be used as an auxiliary screening method to improve accuracy. It is judged that further research and innovation is needed to improve the accuracy of scoliosis assessment. This study has several limitations in examining the assessment of non-radiological scoliosis. Firstly, this study only included papers published after 2000 in order to understand the research trends in evaluation scales and to gain the most recent insights into new diagnostic tools. Second, this study involved a small number of researchers during the search and selection of papers and the search time was short; the research methods covered in this paper may not have been comprehensive and there may be some potential for research methods that were not identified; fourth, some studies did not report on credibility and feasibility, and even if the same criteria were used or comparative experiments were conducted, the range of credibility and feasibility for each study was wide, making it difficult to give clear and comprehensive conclusions. Fifth, there was no qualitative evaluation of the studies included in this study. Therefore, these aspects need to be complemented in future studies with a more systematic and objective examination of the literature including distortion risk assessment and assessment of study quality. Despite these limitations, this study should provide an understanding of the current status and characteristics, feasibility of the non-radiological scoliosis assessment scales currently being investigated nationally and internationally, and determine the accuracy and potential of each assessment method. Thus, it is proposed that a combination of static assessment, supplemented by gait assessment, may be considered as an alternative to radiological methods in future clinical and research studies as an evaluation tool that can be used for the diagnosis and evaluation of scoliosis.
228
F. Cheng et al.
4 Conclusion and Outlook Adolescents are generally skeletally immature and at higher risk of scoliosis, so early diagnosis and assessment of scoliosis progression is clinically important. Therefore, adolescent scoliosis screening needs to be continually refined and non-radiological assessment tools may reduce the number of x-rays taken in patients with scoliosis. Further research into scoliosis measurement tools is needed to improve reliability and validity. Both static and dynamic assessments are available. Static assessments include photogrammetry, 3D reconstruction of the torso, computerised infrared body thermography, and dynamic assessments are mainly based on computer vision or gait analysis methods using Kinect in combination with postural systems. However, the credibility and feasibility of the studies covered in this study are wide ranging, and there are few large-scale clinical studies that require more detailed and systematic follow-up clinical studies of scoliosis assessment tools. To further improve the accuracy, a combination of static and dynamic assessment could be considered to achieve the same screening results as radiology and COBB, allowing for faster, non-invasive scoliosis screening in adolescents.
References 1. Liu, T., Wang, Y., Yang, Y., et al.: A multi-scale keypoint estimation network with selfsupervision for spinal curvature assessment of idiopathic scoliosis from the imperfect dataset. Artif. Intell. Med. 125, 102235 (2022) 2. Sabri, N., Hamed, H.N.A., Ibrahim, Z., et al.: 2D Photogrammetry image of scoliosis lenke type classification using deep learning. In: 2019 IEEE 9th International Conference on System Engineering and Technology (ICSET). IEEE, pp. 437–440 (2019) 3. Moreira, R., et al.: A computer vision-based mobile tool for assessing human posture: a validation study. Computer Methods and Programs in Biomedicine 214, 106565 (2022) 4. Yellakuor, B.E., Moses, A.A., Zhen, Q., Olaosebikan, O.E., Qin, Z.: A multi-spiking neural network learning model for data classification. IEEE Access 8, 72360–72371 (2020) 5. Amin, A., Abbas, M., Salam, A.A.: Automatic detection and classification of scoliosis from spine X-rays using transfer learning. In: 2022 2nd International Conference on Digital Futures and Transformative Technologies (ICoDT2), pp. 1–6. IEEE (2022) 6. Aroeira, R.M.C., Estevam, B., Pertence, A.E.M., et al.: Non-invasive methods of computer vision in the posture evaluation of adolescent idiopathic scoliosis. J. Bodyw. Mov. Ther. 20(4), 832–843 (2016) 7. Penha, P.J., Penha, N.L.J., De Carvalho, B.K.G., et al.: Posture alignment of adolescent idiopathic scoliosis: photogrammetry in scoliosis school screening. J. Manipulative Physiol. Ther. 40(6), 441–451 (2017) 8. Saad, K.R., Colombo, A.S., Ribeiro, A.P., et al.: Reliability of photogrammetry in the evaluation of the postural aspects of individuals with structural scoliosis. J. Bodyw. Mov. Ther. 16(2), 210–216 (2012) 9. Navarro, I.J.R.L., da Rosa, B.N., Candotti, C.T.: Anatomical reference marks, evaluation parameters and reproducibility of surface topography for evaluating the adolescent idiopathic scoliosis: a systematic review with meta-analysis. Gait Posture 69, 112–120 (2019) 10. Bortone, I., et al.: Recognition and severity rating of parkinson’s disease from postural and kinematic features during gait analysis with microsoft Kinect. In: International Conference on Intelligent Computing, pp. 613–618. Springer (2018)
Non-invasive Scoliosis Assessment in Adolescents
229
11. Journal of King Saud University-Computer and Information Sciences 34(10), 8899–8908 (2022) 12. Aroeira, R.M.C., Leal, J.S., de Melo Pertence, A.E.: New method of scoliosis assessment: preliminary results using computerized photogrammetry. Spine 36(19), 1584–1591 (September 01, 2011). https://doi.org/10.1097/BRS.0b013e3181f7cfaa. Sabri, N., Hamed, H.N.A., Ibrahim, Z., et al.: The hybrid feature extraction method for classification of adolescence idiopathic scoliosis using Evolving Spiking Neural Network[J] 13. Aroeira, R.M.C., et al.: Método não ionizante de rastreamento da escoliose idiopática do adolescente em escolares. Ciência & Saúde Coletiva [online] 24(2), 523–534 (2019). Acessado 3 Janeiro 2023 14. Kim, D.-J., et al.: Review study on the measurement tools of scoliosis: mainly on nonradiological methods. Journal of Korean Medicine. The Society of Korean Medicine (2021). https://doi.org/10.13048/jkm.21006 15. Rothstock, S., Weiss, H.R., Krueger, D., et al.: Clinical classification of scoliosis patients using machine learning and markerless 3D surface trunk data. Med. Biol. Eng. Compu. 58(12), 2953–2962 (2020) 16. Choi, Y.C.C.L., Kwon, K.R.: Standardization study of thermal imaging using the acupoints in human body. Journal of pharmacopuncture 11(3), 113–22 (2008) 17. Lubkowska, A., Gajewska, E.: Temperature distribution of selected body surfaces in scoliosis based on static infrared thermography. Int. J. Environ. Res. Public Health 17, 8913 (2020) 18. Vutan, A.-M., et al.: Evaluation of symmetrical exercises in scoliosis by using thermal scanning. Appl. Sci. 12, 721 (2022) 19. Haddas, R., Ju, K.L., Belanger, T., et al.: The use of gait analysis in the assessment of patients afflicted with spinal disorders. Eur. Spine J. 27, 1712–1723 (2018) 20. Zhang, J., Li, H., Yu, B.: Correlation between cobb angle and spinous process angle measured from ultrasound data. In: Proceedings of the 2017 4th International Conference on Biomedical and Bioinformatics Engineering (2017) 21. Menger, R.P., Sin, A.H.: Adolescent and idiopathic scoliosis. StatPearls. Treasure Island (FL) 2020 22. Kim, D.S., Park, S.H., Goh, T.S., et al.: A meta-analysis of gait in adolescent idiopathic scoliosis. J. Clin. Neurosci. 81, 196–200 (2020) 23. Sohn, M.J., Kim, K.H.: Conformity assessment of a computer vision-based clinical decision support system for the detection of postural spinal deformity (2022) 24. Furlanetto, T.S., Candotti, C.T., Comerlato, T., et al.: Validating a postural evaluation method developed using a Digital Image-based Postural Assessment (DIPA) software. Comput. Methods Programs Biomed. 108(1), 203–212 (2012) 25. Bortone, I., et al.: Gait analysis and parkinson’s disease: recent trends on main applications in healthcare. In: International Conference on NeuroRehabilitation, pp. 1121–1125. Springer (2018) 26. Bortone, I., et al.: A novel approach in combination of 3d gait analysis data for aiding clinical decision-making in patients with parkinson’s disease. In: International Conference on Intelligent Computing, pp. 504–514. Springer (2017) 27. Pesenti, S., et al.: Characterization of trunk motion in adolescents with right thoracic idiopathic scoliosis. Eur. Spine J. 28(9), 2025–2033 (2019) 28. Bortone, I., Piazzolla, A., Buongiorno, D., et al.: Influence of clinical features of the spine on Gait Analysis in adolescent with idiopathic scoliosis. In: 2020 IEEE International Symposium on Medical Measurements and Applications (MeMeA), pp. 1–6. IEEE (2020) 29. Kainz, H., et al.: Reliability of four models for clinical gait analysis. Gait Posture 54, 325–331 (2017)
230
F. Cheng et al.
30. Boompelli, S.A., Bhattacharya, S.: Design of a telemetric gait analysis insole and 1-D convolutional neural network to track postoperative fracture rehabilitation, LifeTech 2021 – 2021 IEEE 3rd Glob. Conf. Life Sci. Technol. pp. 484–488 (Mar. 2021) 31. Cabral, S., Resende, R.A., Clansey, A.C., Deluzio, K.J., Selbie, W.S., Veloso, A.P.: A global gait asymmetry index. J. Appl. Biomech. 32(2), 171–177 (2016) 32. Cho, J., Cho, Y.S., Moon, S.B., et al.: Scoliosis screening through a machine learning based gait analysis test. Int. J. Precis. Eng. Manuf. 19(12), 1861–1872 (2018)
Algorithm of Pedestrian Detection Based on YOLOv4 Qinjun Zhao1(B) , Kehua Du1 , Hang Yu1 , Shijian Hu1 , Rongyao Jing1 , and Xiaoqiang Wen2 1 School of Electrical Engineering, University of Jinan, Jinan 250024, China
[email protected] 2 Department of Automation, Northeast Electric Power University, Jilin 132012, China
Abstract. Pedestrian detection technology is applied to more and more scenes, which shows high application value. In recent years, with the development of electronic information technology, the computing speed of computers has been growing rapidly, and the deep learning technology has become better and better with the development of computers. In this paper, based on YOLOv4, this paper studied the scheme of pedestrian detection, obtained the anchor of the pedestrian data through the K-Means algorithm, the loss function of the target detection algorithm is optimized, and introduced the Soft-NMS to improve the pedestrian occlusion problem in detection. Through relevant model verification experiments, the algorithm in this paper is faster than the traditional target detection algorithm in terms of speed, accuracy and robustness. Keywords: Deep learning · Pedestrian detection · YOLOv4 · Soft-NMS
1 Introduction At present, pedestrian detection has been applied in many aspects, such as assisting the intelligent driving of cars, monitoring the safety of pedestrians in important places, providing a basis for pedestrian behavior analysis, pedestrian number statistics or pedestrian tracking in video surveillance. In various practical scenarios, the detection of pedestrians through computers has greatly improved work efficiency, reduced labor costs, and promoted social and economic progress and development. At present, pedestrian detection methods mainly include two kinds: traditional vision method and the depth learning method. The traditional way of vision is to manually analyze the characteristics of pedestrians, extract the pedestrian foreground, and then design a classifier for discriminant analysis. The typical ones include Haar feature detection, HOG feature detection, and DPM feature detection [1–3]. The accuracy, speed and robustness of traditional detection models all have certain defects. With the continuous improvement of computing power of computer chips, at present, the deep learning algorithm using convolutional neural networks has been favored by researchers, and have achieved excellent results in various image detection fields. YOLO (You Only Look © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 231–239, 2024. https://doi.org/10.1007/978-3-031-50580-5_19
232
Q. Zhao et al.
Once) as the representative of the one-stage detection network, the peer-to-peer design is realized, and the algorithm shows good detection effect and robustness in various application scenarios through training on specific image data [4–7]. This paper studies a set of algorithms for pedestrian detection based on the YOLOv4[7] algorithm, and optimizes and improves the training strategy.
2 YOLOv4 With the introduction of the first generation of YOLO algorithm, its rapidity and accuracy have been continuously loved by researchers and applied to various detection fields. After years of exploration, various versions of YOLO’s derivative structures have been continuously proposed, and the most classic version is YOLOv3[6]. The YOLOv4 target detection algorithm is also an improvement of YOLOv3 detection algorithm. The pedestrian detection algorithm proposed in this paper is also an improvement of YOLOv4 target detection algorithm.
Fig. 1. YOLOv4 structure
A complete convolutional neural network mainly has two calculation stages. The primary stages is called the ‘Backbone’, which is mainly used for image convolution to obtain image features, which is the foundation of the network; the second part is the ‘Neck’ part after Backbone, which is used for other tasks (object classification, semantic segmentation, etc.). The content in Fig. 1 is the composition of YOLOv4. In view of the network degradation problem that may occur in the neural network, the backbone network uses the CSPParknet53 structure, which contains multiple CSP residual block structures [8]. The neck contains the PANet structure and SPP pooling module. In order to achieve better
Algorithm of Pedestrian Detection Based on YOLOv4
233
performance of network accuracy, the SPP pooling module uses 5x5, 9x9 and 13x13 pooling layers for feature fusion. The PANet module can accurately retain spatial information and help locate the target through the fusion of upper and lower layer features. The YOLOv4 network finally has three branch outputs, which are respectively generated as matrix grids of 13 × 13, 26 × 26, and 52 × 52. It provides different sizes of receptive fields for three types of targets, large, medium and small. Each grid predicts 3 prediction boxes, each prediction box contains three types of information: object confidence, classification, and position (x, y, w, h). If only the single type of pedestrian target is detected, a total of 3 × 6 = 18 numbers.
3 Loss Function In CNN training, the loss function plays an important role in the iterative process. By calculating its value and then feeding it back to the network, the network parameters can be corrected. Some loss function formulas mentioned in this paper are as follows: lbox =
B S 2
lcls =
lobj =
i=0
(1)
obj 1i,j BCE P (ci ), P(ci )
(2)
j=0
B S 2 i=0
B S 2 i=0
obj 1i,j 1 − IoU Apre , Bgt
j=0
obj noobj (1i,j BCE ci ,ci + 1i,j BCE ci ,ci )
(3)
j=0
Among them, S 2 refers to the number of grids, B indicates the number of prediction obj noobj boxes contained in the grid, 1i,j represents that the grid contains targets, and 1i,j represents that there are no targets in the grid. Since the earliest YOLO algorithm, the loss function has been playing an important role in calculating the network parameter optimization network. At the beginning, use the square difference loss function to calculate the network variables. At present, the loss function used by the network has been replaced by IoU loss function based on anchor points. Compared with the former, this loss function also has many positive effects on the accuracy of the model and the training speed of the model. Another loss function is used for object and classification loss, which is called cross entropy loss function. 3.1 Bounding Box Loss Correct the bounding box by adjusting the preset Anchor point to fit the target. Before training, the collected pedestrian width and height data are used as a bunch of discrete data, then classify the data, which is generally completed by K-Means algorithm, they are: (10 × 24), (18 × 46), (27 × 104), (40 × 64), (56 × 148), (85 × 265), (121 × 174), (166 × 312), (303 × 371).
234
Q. Zhao et al.
Fig. 2. IoU schematic
Figure 2 shows the relationship between prediction box A with target box B, and the formula for calculating IoU is: IoU =
A∩B A∪B
(4)
When the two boxes are far apart, the IoU value is 0, and the IoU value at this time cannot represent the relationship between the two boxes, resulting in the phenomenon that the loss unable to make effective corrections to the CNN network. In response to this problem, the paper [9] proposed the GIoU algorithm, which considered the proportions of A and B in the complement part of the minimum enclosing matrix. The paper [10] proposes DIoU and CIoU algorithms, The former increases or decreases the square of the distance between the diagonals of the new rectangular area formed by the intersection of two frames according to IoU loss, and the latter adds aspect ratio to DIoU. Combining the ideas of the above algorithms, in this paper, some relevant calculation formulas are summarized to express the relationship between prediction box, target box and boundary box. The loss calculation formula used is as follows: hpre hgt (5) − gt = pre pre gt h +w h +w ρ 2 bpre , bgt lIoU = 1 − IoU (A, B) + ∗ eα (6) c2 where indicates the difference between the absolute values of the length ratio and h width ratio between two boxes, h+w in the range of (0,1). ρ 2 is the second power of the length of the connecting line between the geometric center of the prediction box and the geometric center of the target box, c2 is the square of the distance between the diagonals of the coincident parts of two boxes. eα is added as a coefficient correction, when → 0, eα → 1, the selection of α has different influence on the model fitting speed. In this algorithm, α = 10 is selected to enhance the fitting speed of the model. It is shown in the Fig. 3 that influence of different IoU schemes on AP (Average Precision) in cross-validation set.
Algorithm of Pedestrian Detection Based on YOLOv4
235
Fig. 3. Comparison of effects of different IOU schemes
3.2 Bounding Box Loss BCE loss function is used for object and class loss:
BCE y,y = − ylogy + (1 − y)log 1 − y
(7)
In the actual training process, YOLOv4 divides the picture into N × N regions, and each region detects whether there is a target. Since the data set used in this paper contains fewer samples of pedestrians, the number of samples containing background information is larger, and the small number of pedestrians in a single picture. In calculating the object loss of a single image, the proportion of negative samples will be too large, causing too much loss weight to be biased towards background samples. Therefore, the new loss function is adopted in this paper to adjust negative samples through a variable γ coefficient: lobj =
B S 2 i=0
obj noobj (1i,j BCE ci ,ci + 1i,j BCE ci ,ci ∗ γ )
(8)
j=0
γ is affected by the iteration rate x, which is small at the beginning, and gradually returns to the normal state according to the progress of the iteration (Fig. 4): π γ = sin x , x ∈ (0, 1] (9) 2 By using the γ coefficient, the loss converges faster and the curve is smoother, as shown in Fig. 5.
236
Q. Zhao et al.
Fig. 4. Smoother γ graph, iteration rate x from 0 to 1
Fig. 5. Changed loss graph
4 Soft-NMS Since the deep learning network will generate multiple prediction boxes for the target in the input image, the Non-Maximum Suppression (NMS) algorithm will sort the prediction boxes with high IoU coincidence by their confidence, delete the prediction boxes with lower confidence, and finally keep only one prediction box. This algorithm is simple and efficient, but it also has certain drawbacks. When two or more targets are close to each other in the image, using the NMS algorithm in the model may result in some targets not being detected, resulting in lower detection accuracy. This paper combines the Soft-NMS algorithm proposed in the paper [11] to improve this problem: iou(M , bi ) < Ni s, (10) si = i si (1 − iou(M , bi )), iou(M , bi ) ≥ Ni
Algorithm of Pedestrian Detection Based on YOLOv4
237
In the face of prediction boxes with high coincidence, Soft-NMS adopts a strategy of reducing the score of low-confidence prediction boxes, which is different from the traditional NMS algorithm that directly deletes low-confidence prediction boxes (Fig. 6). By introducing the Soft-NMS strategy, to some extent, can improve the detection effect, the actual detection effect is shown in Fig. 7.
Fig. 6. Pseudo code of Soft-NMS replacing NMS
Fig. 7. Soft-NMS improves results
238
Q. Zhao et al.
5 Experimental Results and Analysis The videos taken at different locations in the school are used for video data in this paper, including two environments, day and night. In order to facilitate processing, the data size is uniformly compressed to 720 × 406 in proportion. The pictures used for training and testing are sampled every 3 frames in the video, only to retain higher quality data. Select some images containing “person” in the PASCAL VOC dataset [12] to enrich the dataset samples. In this paper, 10903 images are taken from 12114 images in the final data set as the training set of the model, and 1211 images are taken as the verification set of the model. The equipment used for training the model is i9-9900X CPU, two RTX 2080 graphics cards with 24G video memory, and the equipment used for testing is i5-9300H CPU, one GTX 1660 graphics card with 6G video memory, and the systems used for training and testing are all Ubuntu operating systems. The initial training was based on the YOLOv4 pre-training weights, finally, verify the model on the verification set, and retain the model with the best performance. This paper summarizes the data from the model test on the test set to Table 1. By analyzing the data of various performance indicators in the table, it can be found that the detection effect and detection speed of the traditional visual algorithm lag behind YOLOv4 no matter in the day or at night. To sum up, the algorithm in this paper can meet the needs of high real-time applications due to its fast detection speed (Fig. 8). Table 1. Test Results Model
Detect accuracy(day)
Detect accuracy(night)
FPS
DPM + SVM
80.41%
72.81%
11.52
YOLOv4(ours)
92.18%
83.46%
25.18
Fig. 8. Part of the detection effect
Through the experiments on traditional visual algorithms and the depth learning algorithm based on YOLOv4, we can draw the following conclusions: the pedestrian
Algorithm of Pedestrian Detection Based on YOLOv4
239
detection algorithm based on depth learning is superior to the traditional visual algorithm in terms of detection accuracy, detection speed and detection robustness. It is true that deep learning algorithms perform well in all aspects, but the algorithms also have shortcomings such as dependence on data sets and long training time. It is believed that in the future, the target detection model based on deep learning will have higher accuracy and faster speed, and will play an increasingly important role in various detection tasks. Acknowledgement. This work is supported by the Shandong Key Technology R&D Program 2019JZZY021005, Natural Science Foundation of Shandong ZR2020MF067, ZR2021MF074 and ZR2022MF296.
References 1. Sumit, S.S., Rambli, D., Mirjalili, S.: Vision-based human detection techniques: a descriptive review. IEEE Access 9, 42724–42761 (2021) 2. Cheng, E.J., Prasad, M., Yang, J., et al.: A fast fused part-based model with new deep feature for pedestrian detection and security monitoring. Measurement 151, 107081 (2020) 3. Khemmar, R., Delong, L., Decoux, B.: Real time pedestrian detection-based Faster HOG/DPM and deep learning approaches. Inter. J. Comput. Appli. 176(42), 34–38 (2020) 4. Redmon, J., Divvala, S., Girshick, R., et al.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788. IEEE, Las Vegas (2016) 5. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: IEEE Conference on Computer Vision & Pattern Recognition, pp. 6517–6525. IEEE, Hawaii (2017) 6. Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint arXiv 1804, 02767 (2018) 7. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv 2004,10934 (2020) 8. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. IEEE, Las Vegas (2016) 9. Rezatofighi, H., Tsoi, N., Gwak, J.Y., et al.: Generalized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 658–666. IEEE, Long Beach (2019) 10. Zheng, Z., Wang, P., Liu, W., et al.: Distance-IoU loss: Faster and better learning for bounding box regression. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12993– 13000. AAAI, New York (2020) 11. Bodla, N., Singh, B., Chellappa, R., et al.: Soft-NMS--improving object detection with one line of code. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5561–5569. IEEE, Venice (2017) 12. Everingham, M., Eslami, S., Gool, L.V., et al.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vision 111(1), 98–136 (2015)
A Survey of the Effects of Electrical Stimulation on Pain in Patients with Knee Osteoarthritis Ruiyun Li, Qing Cao, and Mingxu Sun(B) University of Jinan, Jinan 250022, China [email protected]
Abstract. Objective: To determine whether electrical stimulation therapy is effective at reducing pain in people with knee osteoarthritis. Methods: Various literatures on the treatment of pain in osteoarthritis of knee joint with electrical stimulation were searched. According to the title and abstract, the search records were independently screened, and the author, study design, study population, type of electrical stimulation, evaluation criteria, results and other information were extracted. Results: Ten randomized controlled trials involving 405 patients diagnosed with knee osteoarthritis found that both TENS and NMES had a positive effect on the analgesic effects of knee arthritis, but further work is still needed to clarify the long-term treatment effect of electrical stimulation in terms of knee arthritis pain. Conclusion: TENS treatment is more effective than NMES treatment in relieving joint pain in people who have KOA. In future studies, In future studies, the experimental analysis of the same parameter of TENS is needed to determine better methods of pain relief. Keywords: Electrical stimulation · knee osteoarthritis · TENS · NMES · pain
1 Introduction Knee osteoarthritis (KOA) is the most common disease in arthritis. Its main symptoms are knee pain and knee stiffness, which inconvenience people to move, and can even lead to functional disability [1] in severe cases. Globally, the majority of its patients are senior citizen [2]. The onset factors of osteoarthritis include damage to the articular cartilage inside the joint, which produces a variety of symptoms including joint pain, knee swelling, bone spur formation and reduced range of motion, so that the development of muscle weakness, mobility, and even disability, seriously, et al.affecting the normal life of the elderly [3]. The pain and exercise loss caused by knee osteoarthritis bring a lot of constant to people’s lives, so how to better treat the problem of knee osteoarthritis has been much concerned. From the type of treatment, although there are drug therapy, non-drug therapy, invasive intervention and physical therapy [4], the therapeutic effect is not satisfactory, so pain therapy is the most concerned and feasible treatment method at present [5]. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 240–247, 2024. https://doi.org/10.1007/978-3-031-50580-5_20
A Survey of the Effects of Electrical Stimulation on Pain in Patients
241
Electrical stimulation (ES), as a non-invasive treatment, delivers various stimuli from the skin surface by placing electrodes on the skin to increase muscle strength, reduce joint stiffness and muscle spasm [6, 7], so as to play a role in pain relief, and has been put into use in several fields and has achieved remarkable results, like rehabilitation therapy. It has different types of use in relieving knee osteoarthritis pain, including TENS, NMES, IFC, PES, NIN, etc. [8]. In practical applications, TENS and NMES are the most common, so this paper takes them as examples to analyze the effectiveness of electrical stimulation in treating pain for patients who have knee osteoarthritis. The treatment method of TENS is proposed by Melzack and based on the “GateControl Theory”, and because of its pain-relieving function, it is often used to treat knee osteoarthritis pain [9–11]. Literature search showed that TENS had a positive effect on knee osteoarthritis when it was used in randomized controlled trials [8, 12, 13].Neuromuscular electrical stimulation (NMES) works by stimulating nerves and/or muscles with electrodes to induce muscle contraction, which can increase muscle load and improve muscle strength through electrical induced contraction [14], thus relieving pain. Studies have shown that NMES can promote the strength of the quadriceps muscle, resulting in effective pain relief [15, 16].The review provided strong evidence that both TENS and NMES have a role in pain relief, howeve, the article also has some limitations, and the specific efficacy of the two still need to be further discussed [8, 17, 18]. Therefore, this systematic review of electrical stimulation therapy is aimed to analyze the role of TENS and NMES in reducing pain in patients with KOA. Through a review of the available literature, we speculate that TENS is more useful than NMES treatment in treating KOA pain.
2 Methods First, the literature search was conducted on the CNKI, PubMed, ScienceDirect, and Web of science databases as of December 2022. Keywords searched include “electrical stimulation”, “transcutaneous electrical nerve stimulation”, “neuromuscular electrical stimulation”, “knee osteoarthritis” and “pain”. Next, the literature was screened by identifying whether the title and abstract of the paper met the requirements. Finally, the English randomized controlled trial (RCT) from 2004 to 2020 was selected. These studies included studies in which TNES or NMES were combined with other therapy and TENS or NMES were compared with other treatments. Experiments in which patients had undergone knee replacement surgery were excluded. Studies using other electrical stimuli and studies using implantable electrodes were also excluded.
3 Result A total of 10 randomized controlled trials (RCTs) were selected for systematic analysis, and the study methods of these 10 experiments were summarized [19–28]. As shown in Table 1.
242
R. Li et al. Table 1. Study methods.
Author
Study Type of Outcome Particiapants design electrical Measures / Groups stimulation for pain
Treatment Duraiton
Parameters
Conclusion
Arslan (2020)
RCT
NMES
VAS WOMAC
38 participant Groups: NMES + physiotherapy:21, physiotherapy:17
Five times a week for two weeks (ten sessions)
Frequency: 100Hz Pulse width: 60 ms Duration: 20 min
In terms of pain, NMES provided no additional benefit to patients
Matsuse (2019)
RCT
NMES
KOOS
20 participants Groups: NMES + vc: 10, TENS: 10
Twice a week for twelve weeks (twenty-four sessions)
Frequency: 40 Hz Pulse width: 2.4–22.6 ms Duration: 30min Burst duty cycle: 10% Pulse duration: 200µs
NMES is effective for obese women with knee pain
Rabe (2018)
RCT
NMES
KOOS
35 participants Groups: NMES + vc:17, LLRT:18
Twice a week for twelve weeks (twenty-four sessions)
Frequency: 40 Hz Pulse width:2.4–22.6 ms Duration: 20 min Burst duty cycle: 10% Pulse duration: 200 µs
NMES can effectively relieve pain in women with knee osteoarthritis
Palmieri-Smith (2010)
RCT
NMES
WOMAC
30 participants Groups: NMES: 16, Standard of care: 14
Three times a week for four weeks (twelve sessions)
Frequency: 50Hz Intensity: vary
Women treated with NMES had less pain and were more able to function
Rosemffet (2004)
RCT
NMES
VAS WOMAC
26 participants Groups: FES: 8, exercise program: 10, FES + exercise program: 8
Three times a week for eight weeks (twenty-four sessions)
Frequency: 25 Hz Intensity: 60-80 mA Duration: 30 min
FES may be a useful treatment for patients who have KOA
Sajadi (2020)
RCT
TENS
VAS
105 participants Groups: TENS: 53, Leech therapy: 52
Five times a week for three weeks (fifteen sessions)
Frequency: 40 - 150 Hz Intensity: vary Duration: 20 min
Patients receiving TENS had significant analgesic effects
(continued)
A Survey of the Effects of Electrical Stimulation on Pain in Patients
243
Table 1. (continued) Author
Study Type of Outcome Particiapants design electrical Measures / Groups stimulation for pain
Treatment Duraiton
Parameters
Conclusion
Khan (2018)
RCT
TENS
VAS
60 participants Groups: TENS:30, NSAID + ADL:30
Three times a week for six weeks (eighteen sessions)
Frequency: 80 Hz Intensity: 10-30 mA Duration: 20 min
TENS is more effective at relieving pain than medication
Kanako (2018)
RCT
TENS
VAS
50 participants Groups: TENS:25, Sham-TENS:25
Frequency: 1-250 Hz Pulse width: 60 µs
TENS effectively reduces pain, particularly in preradiographic knee OA
Cherian (2016)
RCT
TENS
VAS
36 participants Groups: TENS:18, physical therapy:18
Six weeks
Pulse duration: 48-400 µs
TENS provided more pain relief than standard treatments
Chen (2013)
RCT
TENS
VAS
50 participants Groups: TENS:23, HA injection:27
Three times a week for four weeks (twelve sessions)
Pulse width: 200 µs Intensity: vary Duration: 20 min
TENS treatment provides more pain relief than HA injection
3.1 Participants Studied In total, 405 patients with KOA participated in the ten studies. All patients enrolled in the randomized trial were recruited from a clinical setting and were voluntary participants. All patients included in the study had imaging evidence of knee osteoarthritis (Kellgren and Lawrence scales > Level 1) or clinical evidence of knee osteoarthritis. There were no specific gender ratios or age requirements in any of the experiments. 3.2 Treatment Approaches In Table 1, the types of electrical stimulation, parameter Settings and experiment time used in ten randomized experiments were summarized. Only one experiment did not report the experiment time, and all the experiments had different parameters. n different experiments, the choice of equipment model and manufacturer is different; Different requirements for the placement of equipment; The number Settings for the same parameter are also different. 3.3 Outcome Measures In ten randomized trials, pain severity of knee OA was measured mostly by VAS and WOMacs, with only two trials using KOOS as a measure. VAS (visual analogue scale) is a scale to measure the pain degree of patients. It reflects the pain degree of patients in the form of numbers. Using a 10cm straight line
244
R. Li et al.
or ruler, the patient is asked to select a number between 0 and 10 to indicate his pain level, with 0 being “no pain” and 10 being “as much pain as possible”. WOMAC is often used to measure knee pain, stiffness, and physical function in patients with knee OA, it includes 3 subscales, 24 items. Patients are asked to rate the items in relation to their condition, with higher scores indicating more severe symptoms, the most limiting, and poorer health [29, 30]. KOOS(Knee Injury and Osteoarthritis Outcome Score) is a questionnaire to evaluate the therapeutic effect of knee injury and osteoarthropathy mainly through patient selfassessment management. It includes pain, symptoms, activity ability of daily living, sports and recreation ability, and quality of life related to knee joint. Patients need to score it based on their own conditions. The lower the score, the better the treatment effect.
4 Discussion Ten studies investigated the treatment impact of electrical stimulation (TENS and NMES) on pain in people with KOA. By analyzing the literature, it is found that electrical stimulation plays an positive role in the therapeutic effect of pain, and the therapeutic effect of TENS on pain is significant. TENS is a common analgesic method. In a randomized trial of the TENS studied, TENS was used to treat painful conditions in KOA. The different data provided by the five selected random experiments indicate that TENS was significantly effective in treating pain, whether compared with other forms of electrical stimulation, conventional therapy or medication. TENS, whether in the form of electrode patches or portable wearables, is always effective in pain treatment for patients who have KOA. NMES is a type of physical therapy commonly used to treat knee osteoarthritis. In the randomized trial of NMES studied, NMES were used to treat pain and motion recovery in knee osteoarthritis. Different evidence from five randomized trials indicated that NMES alone or in combination with other ttreatments can slow down the pain level in patients with KOA. However, some experiments have shown that the therapeutic effect of NMES on pain is not so obvious as that of TENS, so long-term experiments should be carried out for observation. TENS, which uses electrical stimulation to relieve pain, has advanced in many types since it was first proposed by Melzack and has received many applications in the field of pain management, including knee pain management [31]. When TENS is in use, electrodes stimulate the pain-sensing nerve tissue to keep it in a state of continuous excitation, thereby increasing the pain threshold due to fatigue (pain nerves become desensitized); At the same time, TENS stimulation on the nerve can also promote the dilation of blood vessels near the knee joint and accelerate blood circulation, so it has a good analgesic effect, which can be seen in a short time. The physical characteristics of NMES, such as current frequency, waveform and wave width, are more suitable for stimulating muscle contraction than analgesia. Different strength of muscle contraction can be used to improve muscle strength, muscle facilitation, joint motion and other functions, and can also indirectly activate nerve tissue, which can improve neuromuscular efficiency to some extent. Therefore, NMES is more effective than TENS in the recovery of knee motor ability. TENS works better than NMES for pain relief.
A Survey of the Effects of Electrical Stimulation on Pain in Patients
245
One limitation of this paper is that it is possible that all included articles based their treatment protocols on previously published similar treatment guidelines, which means that another studies or all treatment approaches may have been overlooked. Another limitation of this paper is the inconsistency of the electrical stimulation parameters used and the lack of uniformity in the published literature. In the ten experiments studied, the frequency ranges from 25-250Hz, and the frequency ranges from 40-80Hz; Pulse widths vary from 2-60ms; Pulse duration, duty cycle and experiment time are not uniform, so it is not possible to group the experiment according to the setting of equipment parameters, and it is not possible to conduct quantitative analysis of the experiment. Therefore, the analysis of electrical stimulation in this paper is qualitative. It is hoped that future studies can include more Settings of parameters equivalent to the frequency, so as to better analyze experimental results and better determine treatment methods.At the same time, in future experimental studies, longer follow-up and research should be used to better evaluate the treatment effect and safety of electrical stimulation.
5 Conclusions In summary, we analyzed the therapeutic effects of different types of electrical stimulation on osteoarthritis pain in the knee joint. Studies have shown that both TENS and NMES have positive analgesic effects on knee arthritis, but the best analgesic effect is TENS. In future studies, it is necessary to increase the experimental analysis of TENS with the same parameters to determine better methods of pain relief.
References 1. Richmond, J., Hunter, D., Irrgang, J., et al.: Treatment of osteoarthritis of the knee (nonarthroplasty). J. Am. Acad. Orthop. Surg. 17, 591–600 (2009) 2. Woolf, A.D., Pfleger, B.: Burden of major musculoskeletal conditions. Bull. World Health Organ. 81, 646–656 (2003) 3. Durmu¸s, D., Alaylı, G., et al.: Effects of quadriceps electrical stimulation program on clinical parameters in the patients with knee osteoarthritis. Clin. Rheumatol. 26, 674–678 (2007) 4. Tok, F., Aydemir, K., et al.: The effects of electrical stimulation combined with continuous passive motion versus isometric exercise on symptoms, functional capacity, quality of life and balance in knee osteoarthritis: randomized clinical trial. Rheumatol. Int. 31, 177–181 (2011) 5. Hochberg, M.C., et al.: Guidelines for the medical management of osteoarthritis. Arthritis Rheum 38, 1535e40 (1995) 6. Babault, N., Cometti, C., Maffifiuletti, N.A., Deley, G.: Does electrical stimulation enhance post-exercise performance recovery? Eur. J. Appl. Physiol. 111, 2501e7 (2011) 7. Liberson, W.: Electrotherapy In: Ruskın, A.P. (ed.) Current therapy, pp 161–191. Physiatry. Saunders, Philadelphia, (1984) 8. Zeng, C., Li, H., Yang, T., et al.: Electrical stimulation for pain relief in knee osteoarthritis: systematic review and network meta-analysis. Osteoarthritis Cartilage 23, 189–202 (2015) 9. Melzack, R., Wall, P.: Pain mechanisms: a new theory. Science 150, 971–977 (1965) 10. Stemberger, R., Kerschan-Schindl, K.: Osteoarthritis: physical medicine and rehabilitation−nonpharmacological management. Wien. Med. Wochenschr. 163, 228–235 (2013)
246
R. Li et al.
11. Bannuru, R.R., Natov, N.S., Dasi, U.R., Schmid, C.H., McAlindon, TE.L Therapeutic trajectory following intra-articular hyaluronic acid injection in knee osteoarthritis - meta-analysis. Osteoarthr. Cartil. [Internet] 19, 611–619 (2011) 12. Atamaz, F.C., Durmaz, B., Baydar, M., et al.: Comparison of the efficacy of transcutaneous electrical nerve stimulation, interferential currents, and shortwave diathermy in knee osteoarthritis: a double-blind, randomized, controlled, multicenter study. Arch. Phys. Med. Rehabil. 93, 748–756 (2012) 13. Palmer, S., Domaille, M., Cramp, F., et al.: Transcutaneous electrical nerve stimulation as an adjunct to education and exercise for kneeosteoarthritis: a randomized controlled trial. Arthritis Care Res (Hoboken). 66, 387–394 (2014) 14. Monaghan, B., Caulfield, B., O’Mathuna, D.P.: Surface neuromuscular electrical stimulation for quadriceps strengthening pre and post total knee replacement. Cochrane Database Syst. Rev. (1), CD007177 (2010) 15. Talbot, L.A., Gaines, J.M., Ling, S.M., Metter, E.J.: A homebased protocol of electrical muscle stimulation for quadriceps muscle strength in older adults with osteoarthritis of the knee. J. Rheumatol. 30, 1571–1578 (2003) 16. Gaines, J.M., Metter, E.J., Talbot, L.A.: The effect of neuromuscular electrical stimulation on arthritis knee pain in older adults with osteoarthritis of the knee. Appl. Nurs. Res. 17, 201–206 (2004) 17. Giggins, O.M., Fullen, B.M., Coughlan, G.F.: Neuromuscular electrical stimulation in the treatment of knee osteoarthritis: a systematic review and meta-analysis. Cl. Rehabil. 26(10) (2012) 18. Lee, H., Clark, A., Draper, D.O.: Effects of transcutaneous electrical nerve stimulation on pain control in patients with knee osteoarthritis a systematic review. Anesth Pain Res. 4(3), 1–4 (2020) 19. Arslan, S.A., Demirgüç, A., Kocaman, A.A., Keskin, E.D.: The effect of short-term neuromuscular electrical stimulation on pain, physical performance, kinesiophobia, and quality of life in patients with knee osteoarthritis. Physiother Quart. 28(2), 31–37 (2020) 20. Hiroo, M., Segal Neil, A., Rabe Kaitlin, G., Naoto, S.: The effect of neuromuscular electrical stimulation during walking on muscle strength and knee pain in obese women with knee pain: a randomized controlled trial. Am. J. Phys. Med. Rehabil. 99(1) (2020) 21. Rabe, K.G., Matsuse, H., Jackson, A., Segal, N.A.: Evaluation of the combined application of neuromuscular electrical stimulation and volitional contractions on thigh muscle strength, knee pain, and physical performance in women at risk for knee osteoarthritis: a randomized controlled trial. PM&R 10(12) (2018) 22. Palmieri-Smith, R.M., Thomas, A.C., Karvonen-Gutierrez, C., Sowers, M.F.: A clinical trial of neuromuscular electrical stimulation in improving quadriceps muscle strength and activation among women with mild and moderate osteoarthritis. Phys. Ther. 90, 1441–1452 (2010) 23. Rosemffet, M.G., Schneeberger, E.E., Citera, G., et al.: Effects of functional electrostimulation on pain, muscular strength, and functional capacity in patients with osteoarthritis of the knee. J. Clin. Rheumatol. 10, 246–249 (2004) 24. Sajadi, S., et al.: Randomized clinical trial comparing of transcranial direct cur_x0002_rent stimulation (tDCS) and transcutaneous electrical nerve stimulation (TENS) in knee osteoarthritis. Neurophysiologie Clinique/Clinical Neurophysiology (2020). https://doi.org/ 10.1016/j.neucli.2020.08.005 25. Khan, M.H., et al.: Role of Transcutaneous Electrical Nerve Stimulation (Tens) in Management of Pain in Osteoarthritis (OA) of Knee. J. Dhaka Med. College 27(1) (2018) 26. Shimoura, K., Iijima, H., Suzuki, Y., Aoyama, T.: Immediate effects of transcutaneous electrical nerve stimulation on pain and physical performance in individuals with pre-radiographic knee osteoarthritis: a randomized controlled trial. Arch. Phys. Med. Rehabil. 100(2) (2018)
A Survey of the Effects of Electrical Stimulation on Pain in Patients
247
27. Chen, W.L., Hsu, W.C., Lin, Y.J., et al.: Comparison of intra-articular hyaluronic acid injections with transcutaneous electric nerve stimulation for the management of knee osteoarthritis: a randomized controlled trial. Arch. Phys. Med. Rehabil. 94, 1482–1489 (2013) 28. Cherian Jeffrey, J.: Knee Osteoarthritis: Does Transcutaneous Electrical Nerve Stimulation Work?. Orthopedics 39(1) (2016) 29. Basaran, S., et al.: Validity, reliability, and comparison of the WOMAC osteoarthritis index and Lequesne algofunctional index in Turkish patients with hip or knee osteoarthritis. Clin. Rheumatol. 29(7), 749–756 (2010) 30. Tuzun, E.H., et al.: Acceptability, reliability, validity and responsiveness of the Turkish version of WOMAC osteoarthritis index. Osteoarthr. Cartil. 13(1), 28–33 (2005) 31. Zhang, W., Moskowitz, R.W., Nuki, G., et al.: OARSI recommendations for the management of hip and knee osteoarthritis, Part II: OARSI evidence-based, expert consensus guidelines. Osteoarthr Cartilage 16, 137–162 (2008)
The Design of Rehabilitation Glove System Based on sEMG Signals Control Qing Cao1 , Mingxu Sun1 , Ruiyun Li1 , and Yan Yan2(B) 1 University of Jinan, Jinan 250022, China 2 Shandong Guohe Industrial Technology Institute Co., Ltd., Jinan 250098, China
[email protected]
Abstract. Stroke is a sudden disorder that causes impaired blood circulation to the brain, and resulting in varying degrees of impairment of sensory and motor function of the hand. Rehabilitation gloves are devices that assist in the rehabilitation of the hand. The sEMG (Surface Electromyography) is a bioelectrical signal generated by muscle contraction. It is rich in physiological motor information and reflects the person’s motor intention. That means sEMG signals is an ideal signal source for rehabilitation glove system. This paper describes the design of a rehabilitation glove system based on sEMG signals control. The system controls the movements of the rehabilitation glove by collecting and analyzing the sEMG signals, and is used to achieve the purpose of rehabilitation training. This system includes a rehabilitation glove system and a host computer. The rehabilitation glove system is used to control the rehabilitation glove to achieve rehabilitation movements, to perform rehabilitation training for patients and to collect sEMG signals. The host computer is used to receive signals and perform gesture classification by CNN (Convolutional Neural Network) to recognize the movement intention. Keywords: Surface Electromyography(sEMG) · Convolutional Neural Network (CNN) · Pneumatic rehabilitation gloves · Hand rehabilitation
1 Introduction With over 10 million new strokes worldwide each year [1], stroke is remain the main cause of death and disability in adults [2]. Stroke is a sudden disorder that causes sensory and motor functions of the hand are impaired to varying degrees [3]. The clinical manifestations are usually involuntary muscle contraction caused by finger flexor spasm, decreased muscle strength and abnormal muscle tension [4]. Decreased flexibility of fingers, numbness of limbs, and reduction of thumb movement range are also common disabling symptoms after stroke [5]. About 30% of patients will have spasms in the first few days or weeks [6]. The wide distribution of joint nerves and the large number of blood vessels in the hand. Effective nerve stimulation of the hand can promote the rehabilitation of the upper extremity. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 248–256, 2024. https://doi.org/10.1007/978-3-031-50580-5_21
The Design of Rehabilitation Glove System
249
Bioelectrical signals recognize and classification have become an important role in wide applications, including magnetoencephalography (MEG), electroencephalogram (EEG), electrocorticogram (ECoG) and electromyography (EMG) [7]. Nowadays, bioelectrical signal-based recognition technologies have gained widespread attention in various fields such as virtual reality, motion sensing games and medical rehabilitation training [8]. In this paper, EMG is selected as the bioelectrical signal for motion recognition. The EMG is a bioelectrical signal generated by muscle action, and include needle in electromyography (nEMG) and surface electromyography (sEMG). It contains rich physiological motor information [9]. Compared to nEMG, sEMG has the advantages of non-invasive and easy operation. The sEMG-controlled rehabilitation gloves and real human hands belong to the same family of control [10]. By analyzing the sEMG signals, the movement patterns of the hand can be recognized [11]. The sEMG is able to sense and decode human muscular activity directly [12], so sEMG signals are the ideal signal source for human-machine interaction systems [13]. For patients, sEMG signals can be collected from either the unaffected hand or the affected hand. Since most stroke patients have functional impairment in one arm and the sEMG signals from the affected hand is very weak or nearly absent [14]. Therefore, the sEMG signals of the unaffected hand are used as control signals [15]. The movements of the unaffected hand are copied to the affected hand for bimanual training. Bimanual training is a rehabilitation strategy based on natural coordination between limbs [16]. Performing bimanual training on patients improves the efficiency of movements on the affected hand [17]. Machine-assisted hand rehabilitation has been shown to reduce injuries and improve motor function [18]. During the early rehabilitation and spontaneous recovery of the patient’s limbs, the patient’s symptoms can be greatly improved by driving the finger joints and thus assisting in the motor rehabilitation of the hand, and by implementing continuous passive movements to compensate for the lack of active movements [19]. At present, the popular hand rehabilitation equipment can be divided into rigid mechanical exoskeleton rehabilitation equipment [20] and flexible pneumatic rehabilitation gloves. A major limitation of rigid mechanical hand exoskeleton rehabilitation robot is that most are based on rigid linkages and are overall bulky and heavy. This makes them unsuitable for patients to wear them for activities of daily living [21]. Compared with rigid mechanical exoskeleton rehabilitation robot, flexible pneumatic rehabilitation glove has some advantages, including good flexibility, small size, easy to move, safety and reliability [22]. For this purpose, in this paper used a flexible pneumatic rehabilitation glove, and used a CNN classification method. The sEMG from the unaffected hand of the patient are used to classify and identify hand gestures by CNN classification method, to control the movement of the rehabilitation glove and help the patient’s affected hand function recovery.
250
Q. Cao et al.
2 Design of Rehabilitation Gloves System The pneumatic rehabilitation glove system consists of pneumatic gloves, air pump, Stm32 microprocessor, sEMG sensor, solenoid valve, gas pressure sensor and host computer. The composition of the rehabilitation glove system block diagram is shown in Fig. 1
Fig. 1. The rehabilitation glove system block diagram.
The rehabilitation training layer including pneumatic rehabilitation glove, air pump, solenoid valve and gas pressure sensor. The pneumatic rehabilitation glove covers well around the patient’s fingers and back of the hand. The pneumatic gloves are powered by the air pump, air pump can inflate and suctioned to allow finger flexion and dorsiflexion. The solenoid valve is used to achieve the movement of the selected finger to completed different rehabilitation exercises. The collecting layer use the sEMG sensor to collect the sEMG signals from the unaffected hand of the patient. The host computer is used to process the sEMG data and is developed using PyCharm software as the development environment. It performs gesture classification recognition by analyzing the sEMG signals data. The classification results obtained are sent to the microcontroller. Firstly, the unaffected hand makes a fist-clenching action, and then the sEMG signals is collected through the sEMG sensor. Then the signal is sent to the host computer, which further identifies the signals and classifies it to get the corresponding gesture action. After that, the obtained action commands are sent down to the rehabilitation glove, and use the microcontroller controls the air pump. Then driver the action of the pneumatic rehabilitation glove. The rehabilitation glove drives the affected hand to perform rehabilitation movements and complete rehabilitation training to achieve the purpose of bimanual training.
The Design of Rehabilitation Glove System
251
3 Hardware Design of Rehabilitation Glove
Fig. 2. The hardware design of the rehabilitation glove.
The block diagram of the hardware design of the rehabilitation glove is shown in Fig. 2. The rehabilitation glove controller is stm32 microprocessor for the main control of the system. 12V system power input, which is used to supply power to the whole system. Through the internal power conversion circuit, the 12V power supply is converted into the 3.3V required by the chip. The air pump interface and solenoid valve interface are used for connect external air pump and solenoid valve. The Stm32 microprocessor controls the switching of the solenoid valve and the air pump through the driver circuit. By controlling air pump inflated and suctioned to allow finger flexion and dorsiflexion. By controlling different combinations of switches of the solenoid valve to achieve the purpose of completing different rehabilitation gestures. The USB interface and WIFI module are used to connect the controller to communicate with the host computer software. The difference is that the USB interface is connected through a data line and the WIFI module is for wireless communication. The gas pressure sensor interface is used for connect gas pressure sensor. The gas pressure sensor built into the air circuit of the glove to monitor the gas pressure in real time. Prevent injuries caused by excessive air pressure when the air pump inflation and deflation. When the gas pressure exceeds a predetermined pressure value, an alarm will be triggered and the solenoid valve will open immediately to release the gas in the air circuit to avoid injury. The dual-channel sEMG sensors is used to collected original sEMG signals from the unaffected hand of the patient. Using sEMG sensors, the sEMG signals are digitally filtered, amplified and rectified then transmit the to the host computer. The sensor collects sEMG signals of patients with a sampling frequency of 500 Hz. Each sEMG sensor channel has two detection electrodes (yellow and green) and one reference electrode(red). The sEMG sensor is shown in Fig. 3.
252
Q. Cao et al.
Fig. 3. sEMG sensor
Subject: A healthy young people were selected to participate in this experiment, and the subject were free from upper limb muscle diseases and skeletal disorders. Because the sEMG signals is unstable and susceptible to interference on the skin surface, the skin on the arm muscles needed to be cleaned and disinfected with alcohol before collecting the sEMG signals. Electrode sheet location: electrode sheets are used to collect sEMG signals. To have an effectively collect the sEMG signals, it is generally placed at the location where the muscle contraction is strong, and the detection electrode sheets is placed at the muscle of the subject’s forearm as shown in Fig. 4 (a). The reference electrode sheet is usually placed at a location where the number of muscles is low. In this experiment, the reference electrode sheet was chosen at the elbow, as shown in Fig. 4 (b).
Fig. 4. Electrode sheet location
The sEMG signals data of the subject’s hand in the calm state and in the clenched fist state were collected as the test data. The sEMG signals was collected with the subject’s
The Design of Rehabilitation Glove System
253
one side hand palm facing upward and flat on the table. The duration of each set of movements was 2s. When the subject finished a gesture, should rest for 4s. When the gesture returns to resting state, repeat the previous gesture again. 30 sets of sEMG signals were collected from the subject’s hand in the calm state and 30 sets of sEMG signals from the clenched fist state, which were used as the original data set. The one set of the calm and clenched fist states data are shown in the Fig. 5.
Fig. 5. The signals collected in the calm and clenched fist states.
4 Design of sEMG Classification Recognition Build a training model structure for real-time recognition of sEMG signals and save the parameters of the model. The classification method of this study [23], We use a classical signal classification method based on CNN to classify sEMG signals [24]. Feature extraction is performed on the original data set of each gesture, and 80% of the extracted data is randomly selected as the original training set and the remaining part as the original test set, with no crossover between the training set and the test set. The model is first trained with the data from the training set, and then the accuracy of the model is tested using the test set. If the expected accuracy is not achieved, the model continues to be trained. If the expected accuracy is achieved, the training is stopped and the model parameters are saved. The flow chart to save the model is shown in Fig. 6. In order to verify the feasibility of the system, a preliminary binary classification network model was constructed. The model will be further optimized later to achieve more and more accurate classification and recognition of hand gesture actions. The flow chart of real-time sEMG signals recognition is shown in Fig. 7. After initialization, the host computer receives the sEMG data transmitted in real time from the sEMG sensor and performs feature extraction on it. A pre-saved model is used to classify the collected sEMG signals and the classification results are sent to the controller of the rehabilitation glove. The controller outputs the corresponding switch control signal to control the glove to complete the action. The confusion matrix for this network for data set as shown in Fig. 8, which shows all the gestures are classified accurately with low error. From the confusion matrix, this classification method gave 100% validation accuracy for the classification of the two gestures. This lays a good foundation for future extension to multi-gesture classification.
254
Q. Cao et al.
Fig. 6. The flow chart of save model parameters
Fig. 7. The flow chart of real-time sEMG signals recognition
The Design of Rehabilitation Glove System
255
Fig. 8. The confusion matrix for this network for data set
5 Conclusions In this paper we collect sEMG signals from patients through sEMG sensor and use a CNN classification method to recognize hand gestures and control the movement of rehabilitation gloves to help patients recover hand function. The classifier can maintain a high accuracy rate in recognizing hand movements in real time. The motion control of the rehabilitation glove based on the recognition of sEMG signals is verified. Experiments on gesture motion control of rehabilitation gloves were designed. The feasibility of the CNN classification method for controlling rehabilitation gloves is illustrated. It lays the foundation for further research on intelligent rehabilitation gloves.
References 1. Pandian, J.D., Gall, S.L., Kate, M.P., et al.: Prevention of stroke: a global perspective. Lancet 392(10154), 1269–1278 (2018) 2. Ferrarin, M., Palazzo, F., Riener, R., Quintern, J.: Model-based control of FES-induced single joint movements. IEEE Trans. Neural Syst. Rehabil. Eng. 9(3), 245–257 (2001) 3. Johnston, S.C., Mendis, S., Mathers, C.D.: Global variation in stroke burden and mortality: estimates from monitoring, surveillance, and modelling. Lancet Neurology 8(4), 345–354 (2009): 4. Marciniak, C.: Poststroke hypertonicity: upper limb assessment and treatment. Top. Stroke Rehabil. 18(3), 179–194 (2011) 5. Dietz, V., Sinkjaer, T.: Spastic movement disorder: impaired reflex function and altered muscle mechanics. Lancet Neurol. 6(8), 725–733 (2007) 6. Mayer, N.H., Esquenazi, A.: Muscle overactivity and movement dysfunction in the upper motoneuron syndrome. Phys. Med. Rehabil. Clinics 14(4), 855–883 (2003) 7. Feng, Y., et al.: Active triggering control of pneumatic rehabilitation gloves based on surface electromyography sensors. Peer J Comput. Sci, 7, e448 (2021) 8. Wang, Y., et al.: Deep back propagation–long short-term memory network based upper-limb sEMG signal classification for automated rehabilitation. Biocybern. Biomed. Eng. 40(3), 987–1001 (2020)
256
Q. Cao et al.
9. Qi, J., et al.: Intelligent human-computer interaction based on surface EMG gesture recognition. IEEE Access 7, 61378–61387 (2019) 10. Ahanat, K., Juan, A.C.R., Veronique, P.: Tactile sensing in dexterous robot hands-Review. Rob. Auton. Syst. 74, 195–220 (2015) 11. Li, G., et al.: A novel feature extraction method for machine learning based on surface electromyography from healthy brain. Neural Computing and Applications 31(12), 9013–9022 (2019) 12. Fang, Y., et al.: A multichannel surface EMG system for hand motion recognition. Inter. J. Humanoid Robot. 12(02), 1550011 (2015) 13. Kamavuako, E.N., et al.: Estimation of grasping force from features of intramuscular EMG signals with mirrored bilateral training. Annals Biomed. Eng. 40(3), 648–656 (2012) 14. Zou, Z., et al.: Analysis of EEG and sEMG during upper limb movement between Hemiplegic and normal people. In: 2021 IEEE 4th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), vol. 4. IEEE (2021) 15. Leonardis, D., et al.: An EMG-controlled robotic hand exoskeleton for bilateral rehabilitation. IEEE Trans. haptics 8(2), 140–151 (2015) 16. Luft, A.R., et al.: Repetitive bilateral arm training and motor cortex activation in chronic stroke: a randomized controlled trial. Jama 292(15), 1853–1861 (2004) 17. Waller, S.M., et al.: Temporal coordination of the arms during bilateral simultaneous and sequential movements in patients with chronic hemiparesis. Experim. Brain Res. 168(3), 450–454 (2006) 18. Frisoli, A., et al.: Positive effects of robotic exoskeleton training of upper limb reaching movements after stroke. J. Neuroeng. Rehabil. 9(1), 1–16 (2012) 19. Boake, C., et al.: Constraint-induced movement therapy during early stroke rehabilitation. Neurorehabil. Neural Repair 21(1), 14–24 (2007) 20. Agarwal, P., et al.: An index finger exoskeleton with series elastic actuation for rehabilitation: Design, control and performance characterization. Inter. J. Robot. Res. 34(14), 1747–1772 (2015) 21. Laschi, C., Cianchetti, M.: Soft robotics: new perspectives for robot bodyware and control. Front. Bioeng. Biotechnol. 2, 3 (2014) 22. Heung, K.H.L., et al.: Robotic glove with soft-elastic composite actuators for assisting activities of daily living. Soft Robotics 6(2), 289–304 (2019) 23. Atzori, M., Cognolato, M., Müller, H.: Deep learning with convolutional neural networks applied to electromyography data: A resource for the classification of movements for prosthetic hands. Front. Neurorobot.Neurorobot. 10, 9 (2016) 24. Bakircio˘glu, K., Özkurt, N.: Classification of EMG signals using convolution neural network. Inter. J. Appli. Mathem. Electron. Comput. 8(4), 115–119 (2020)
Gaussian Mass Function Based Multiple Model Fusion for Apple Classification Shuhui Bi, Lisha Chen, Xue Li, Xinhua Qu, and Liyao Ma(B) School of Electrical Engineering, University of Jinan, Jinan 250022, Shandong, China cse [email protected] Abstract. Near-infrared spectra can be used to predict the internal quality of apple non-destructively, such as Soluble Solids Content (SSC), acidity and so on. However, it needs to establish a prediction model. And for improving the predictive accuracy, some pre-processing methods should be adopted. In this paper, Apples’ SSC is considered as a representative index, the Probabilistic Neural Network (PNN) and Extreme Learning Machine (ELM) models are established. After carrying out the Multiple Scattering Correction (MSC), which is to reduce the baseline drift, the classification accuracies of both models are 81.8182% and 77.2727% respectively. For avoiding the limitation of single classification model, and dealing with the uncertainty introduced by hard partition of the instance space, an evidence theory based multiple model fusion is proposed. Especially, the mass function generation is considered. A Gaussian mass function is proposed so as to realize the fusion of PNN and ELM models by combining the mass function based on Dempster’s combination rules of evidence theory. The experimental results show that the accuracy of fusion model is 86.3636%, which demonstrate that Gaussian mass function is suitable for apples’ multi-model fusion. Keywords: Apple classification multi-model fusion
1
· Gaussian mass function ·
Introduction
Agricultural products and food closely related to People’s Daily life are in great demand and variety, which brings a huge workload to the quality inspection. Near-infrared spectroscopy (NIR) has technical advantages such as fast, pollution-free, low cost and non-destructive, and has attracted more and more attention in the field of agricultural products and food quality detection [1]. Hu Jing et al. summarized the latest research progress of near infrared spectroscopy technology in recent years at home and abroad in kiwifruit hardness, soluble solids content (SSC), acidity, damage and microbial detection, and prospected the research and application of near infrared spectroscopy analysis technology in kiwifruit quality detection [2]. Cao Niannian et al. took multiple batches of yellow peach chips as the analysis object, collected the raw information of shortwave near-infrared spectroscopy and long-wave near-infrared spectroscopy, and c ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 257–269, 2024. https://doi.org/10.1007/978-3-031-50580-5_22
258
S. Bi et al.
established the prediction model of all-band linear partial least square method and nonlinear support vector machine after pre-processing [3]. Yu Zhihai et al. established a fast and non-destructive dynamic moisture detection model for red Jujube in southern Xinjiang during processing in order to conduct rapid and nondestructive dynamic moisture measurement for red Junjube in southern Xinjiang during processing [4]. Zhu Jinyan et al. adopted near infrared spectroscopy and extreme learning machine (ELM) method to establish a quantitative detection model for storage quality of blueberry, and realized rapid nondestructive detection of SSC, vitamin C and anthocyanin contents of blueberry fruit [5]. Jie Deng Fei et al. took Qilin watermelon as the research object and used near infrared diffuse transmission spectroscopy to detect the SSC of Qilin watermelon, and studied the influence of variable screening method on the accuracy of watermelon sugar degree prediction model [6]. However, the near infrared technology is an indirect technology, so it is necessary to establish a prediction model. In view of the poor applicability of a single prediction model and the classification uncertainty caused by modeling errors during hard segmentation according to classification indexes, the advantages and effectiveness of each model can be integrated on the basis of the current spectral data preprocessing, so as to make predictions more accurately and improve the reliability of classification models. Has become one of the problems that research must solve. The quality function describes the degree of evidence’s trust in the proposition, and the evidence theory can provide an effective method for the expression and synthesis of uncertain information. When the evidence theory is used to integrate, then the quality function and combination rule play a very important role, but the application of the quality function and combination rule in fruit classification is rare [7]. Xu Xuefang et al. used core mass function (CMF) of molecular clours to study the origin of stellar initial mass function (IMF) [8]. Di Peng et al. introduced D-S evidence theory based on cloud model evaluation to solve the problem of fuzziness and uncertainty of linguistic value evaluation in multi-attribute decision making [9]. Wang Yan et al. built a safety evaluation model based on D-S evidence theory and multi-source information in order to evaluate the safety of dam combined with the properties of surrounding goaf, and verified the feasibility and effectiveness of the evaluation method of multi-source evidence index system [10]. Liao Ruijin et al. combined the chromatographic data and electrical test data, made use of the data fusion principle, organically combined neural network and evidence theory, and proposed a synthetic transformer fault diagnosis method based on the fusion of multi-neural network and evidence theory [11]. Therefore, this paper studies the prediction accuracy of Apples’ SSC based on Gaussian mass function and fusion model of multiple classification models.
Gaussian Mass Function Based Multiple Model Fusion
2 2.1
259
Data Acquisition and Preprocessing Data Acquisition
In this paper, 439 red Fuji apple samples were selected, their Near-infrared spectra were acquired by using WY-6100 type fruit online nondestructive testing system within the range of 4000–10000 cm−1 , and soluble solid content (SSC) was also collected [11]. The original spectrum of the samples are shown in Fig. 1.
Fig. 1. Original spectra
According to the national standard GB/T12295-90 “Fruit and vegetable products − Determination of soluble solids − Refractometric method”, the SSC of each apple sample was measured three times and the average value was taken as the apples’ SSC. In order to reduce the impact of data sets on the accuracy of the prediction model, the data sets should be divided before modeling to make each category more uniform. And 70% of the fruits of each category were randomly selected as the training set and the remaining 30% as the test set. 2.2
Multiple Scattering Correction
In the spectral data acquisition process, spectral information shifts to a certain extent with the change of external environment, which could affect the accuracy of predictive model. In order to reduce the degree of baseline offset, Multiple Scattering Correction (MSC) is selected to preprocess the original spectra [14]. It includes 3 steps: searching for a reasonable “ideal spectrum”, unary linear regression, and multiple scattering correction. After that, the spectral information shifts can be reduced, which is demonstrated in Fig. 2.
260
S. Bi et al.
Fig. 2. Spectra after MSC.
Fig. 3. The fusion process of PNN and ELM model.
Gaussian Mass Function Based Multiple Model Fusion
3
261
Gaussian Mass Function Based Multi-model Fusion
In [14], PNN and ELM methods were verified suitable for apples’ classification prediction. Therefore, in this study, PNN and ELM methods are used respectively to establish prediction models, wherein, the model input is the characteristic spectra pre-processed by MSC, and the model output is the predicted value of SSC. For improving the prediction accuracy, avoiding the limitation of single classification model, and dealing with the uncertainty introduced by hard partition of the instance space, an evidence theory-based multi-model fusion is proposed. The whole process is shown in Fig. 3. That is, the mass functions of the two prediction models should be obtained, and then, the combination rules are integrated. 3.1
Gaussian Mass Function
The degree to which the evidence trusts the proposition is described by a mass function [15]. In this paper, a Gaussian mass function is generated based on the distance between the predicted value of Apples’ SSC and the classification boundary. The reason for the unclear classification is that the predicted value of SSC is close to the boundary of the two kinds of apples, this paper assigns the uncertain quality function of a certain class to the set composed of adjacent classes. According to SSC range and corresponding apple category, the relationship between the level of SSC predicted value x and apple category C can be obtained, as shown below. ⎧ ⎪ ⎨ I, 13 ≤ x < 16 (1) C = II, 11 ≤ x < 13 ⎪ ⎩ III, 8 ≤ x < 11 where, x is SSC. When x is 11, the apple may be Class III or Class II, but the trust degree of the apple in Class III is greater than that in class II. When x is 9.5, the average value of 8 and 11, the trust degree of this apple is the highest. Within the range [8, 9.5], the trust degree of this apple is increasing with the increase of x. Within the range (9.5,11), the trust degree of this fruit is decreasing with the increase of x. Therefore, according the trust degree, this paper constructs the Gaussian mass function diagram as shown in Fig. 4, which is generated based on the distance between SSC and classification boundary.
262
S. Bi et al.
Fig. 4. Gaussian mass function
Where, a and c are the SSC values of the left and right boundary of the classification label respectively, and the mass function value of the corresponding class label is set to 0.9, b = (a+c) 2 , that is, when the SSC middle value of the class apple is set to 1. Taking Class II fruit as an example, the range of SSC is [11,13], then a is 11, c is 13, and b is 12. According to Fig. 4, the Gaussian mass function formula of apple can be constructed, as shown below. m(x) = f e−
(x−b)2 2σ 2
, 8 ≤ x < 16
(2)
When x is in the range of [8,11) and is close to boundary 8, that is, x is in the range of [8, 9.5), the apple may be classified as class III or it may be uncertain as several kinds of fruit. Based on Eq. (1), C = III is known, and m(III) is obtained according to Eq. (2). Then, the remaining degree of trust can be assigned to {I, II, III} class fruit, that is, m({I, II, III}) = 1 − m(III). When x is in the range of [8,11) and close to boundary 11, that is, x is in the range of (9.5, 11), the apple may be classified as Class III or Class II. Based on Eq. (1), C=III is known, and m(III) is obtained according to Eq. (2). Then, the remaining degree of trust can be assigned to {II, III} class fruit, that is, m {II, III} = 1 − m(III). When x is in the range of [11,13) and close to boundary 11, that is, x is in the range of [11, 12), the fruit may be classified as Class III or Class II. Based on Eq. (1), C = II is known, and m(II) is obtained according to Eq. (2). Then, the remaining degree of trust can be assigned to {II, III} class fruit, that is, m {II, III} = 1 − m(II). When x is in the range of [11,13) and close to boundary 13, that is, x is in the range of (12, 13), the fruit may be classified as Class I or Class II. Based on Eq. (1), C = II is known, and m(II) is obtained according to Eq. (2). Then, the degree of trust can be assigned to {I, II} class fruit, that is, remaining m {I, II} = 1 − m(II). When x is in the range of [13,16) and close to boundary 13, that is, x is in the range of [13, 14.5), the fruit may be classified as Class I or Class II.
Gaussian Mass Function Based Multiple Model Fusion
263
Based on Eq. (1), C = I is known, and m(I) is obtained according to Eq. (2). Then, the remaining degree of trust can be assigned to {I, II} class fruit, that is, m {I, II} = 1 − m(I). When x is in the range of [13,16) and close to boundary 16, that is, x is in the range of (14.5, 16), the fruit may be classified as Class I or other. Based on Eq. (1), C = I is known, and m(I) is obtained according to Eq. (2). Then, the remaining degree of trust can be assigned to {I, II, III} class fruit, that is, m {I, II, III} = 1 − m(I). Based on the above analysis, the Gaussian value graph of the class-scale mass function of the three types of Red Fuji apples can be obtained as shown in Fig. 5. Gaussian mass function formula is established in the following.
Fig. 5. Gaussian value of the label mass function of three grades of Red Fuji apples
1, m(I) = 2 e4∗(x−14.5) ∗log(0.9/9) , 1, m(II) = 2 e(x−12) ∗log(0.9) , m(III) = e4∗(x−9.5) m {I, II} =
1 − m(III), 1 − m(II),
m {I, II, III} =
(3)
x = 11 11 < x < 13
(4)
∗log(0.9/9)
1 − m(II), 1 − m(I),
m {II, III} =
2
x = 13 x > 13
, x < 11
12 ≤ x < 13 13 ≤ x < 14.5 9.5 ≤ x < 11 11 ≤ x < 12
1 − m(III), 1 − m(II),
x < 9.5 x ≥ 14.5
(5)
(6)
(7)
(8)
264
3.2
S. Bi et al.
DS Evidence Theory
In Dempster-Shafer (DS) evidence theory, for decision problems, the set of all decision results is represented by U, which is called the recognition framework. m(A) = 1, m(φ) = 0, is called If the function m:2U → [1,2], and meet A2U
A focal element, m(X) for the quality of the letter X, said the proposition X’s trust. Suppose that two pieces of evidence from different sources, m1 and m2 , have
the same identification frame, U={A1 ,A2 ,A3 ,...,An }, is fusion symbol, and Dempster’s combination rule of DS evidential theory is m2 (D) = K −1 m1 (Ai )m2 (Aj ) m(D) = m1 (9) Ai
Aj =D
In type 1 ≤ i ≤ n, 1 ≤ j ≤ n, Ai Aj = D can be obtained by an Ai and Aj intersect.
Ai Aj = D = φ means that event Ai has no intersection with Aj , then K =1− m1 (Ai )m2 (Aj ) (10) Ai
Aj =φ
In the formula, K is the normalization factor, which is an indicator to measure the size of contradictions among various evidences. According to Eqs. (9) and (10), as long as the quality functions of multiple prediction models are obtained, the fusion model can be obtained through the combination rules. 3.3
Gaussian Mass Function Based Multi-model Fusion
Aiming at Apple classification problem, the recognition frames of two prediction models, PNN and ELM, are set as U = {I, II, III}. When the predicted value of SSC is 13.3537, it can be judged as Class I fruit, according to the accurate value of 13.3537 within the range [13,16]. However, due to the error of a single prediction model and the influence of classification equipment and other factors, the value may be inaccurate and close to the boundary 13 of class I and Class II. Therefore, this sample may also be identified as Class II. In view of this situation, which cannot accurately indicate how many kinds of fruit apples are, uncertainty analysis is introduced in this study, and it is listed as {I,II}, which indicates that it may be Class I or Class II. By taking the spectral data x with the actual SSC value of 13 as input of PNN and ELM, the predicted SSC values of the two models can be obtained as 12.0557 and 13.3537, respectively. (1) Mass function of PNN prediction model The output result of the PNN model is taken as the first evidence. The predicted value of PNN is x1 = 12.0557, within the range of 11 ≤ x < 13,
Gaussian Mass Function Based Multiple Model Fusion
265
and then x = x1 is substituted into Eq. (4), m1 (II) = 0.9999. Since x1 = 12.0557 is close to the boundary 13 of Class I and Class II, this apple may also be class II. Substituting into Eq. (6), it can be obtained that m1 {I, II} = 1 − m1 (II) = 0.0001, and the mass function of PNN can be obtained as follows: m1 (II) = 0.9999 (11) m1 {I, II} = 1 − m1 (II) = 0.0001 (2) Mass function of ELM prediction model The output result of the ELM model is taken as the second evidence. The predicted value of ELM is x2 = 13.3537, within the range of 13 ≤ x < 16, and then x = x2 is substituted into Eq. (3), m2 (I) = 0.0052. Since x2 = 13.3537 is close to the boundary 13 of Class I and Class II, this fruit may also be Class I. Substituting into Eq. (6), it can be obtained that m2 {I, II} = 1 − m2 (I) = 0.9948, and the mass function of ELM can be obtained as follows: m2 (I) = 0.0052 (12) m2 {I, II} = 1 − m2 (I) = 0.9948 (3) DS evidence theory based Fusion and decision making Mass function m1 and m2 are fused according to Eqs. (9) and (10), and the fused mass function is expressed by m as: K = 0.9948, m(I) = 0, m {I, II} = 0.0001, m(II) = 0.9999 (13) Then, decision is made according to the size of the mass function. It can be seen from the fusion results that, m(II) > m {I, II} > m(I). It can be concluded that this paper is belong to Class II, which is the same as the actual classification result.
4 4.1
Simulation Prediction by PNN and ELM
In this paper, the confusion matrix is used to evaluate the predictive models’ performance. The confusion matrix is a data table for understanding the performance of the categorical model, and it indicates that how to separate test data into different classes. It is a two-dimensional table with row labels as the actual categories and column labels as the categories predicted by the model.
266
S. Bi et al.
The original spectral data and SSC of the training set are taken as the input of the PNN and ELM training model to train the PNN and ELM. Then the original spectral data of the test set is taken as the input of the prediction model to predict the Apples’ SSC. The predicted SSC values of the PNN and ELM prediction models were converted into class markers, and compared with the original class markers, the confusion matrices of PNN and ELM were obtained, as shown in Tables 1 and 2, respectively. Table 1. Confusion matrix of PNN prediction model. Real class mark Classification Results of PNN Class I Class II Class III Subtotal Class I
33
3
0
36
Class II
18
75
3
96
Class III
0
0
0
0
Subtotal
51
78
3
132
Table 2. Confusion matrix of ELM prediction model. Real class mark Classification Results of ELM Class I Class II Class III Subtotal Class I
20
16
0
36
Class II
13
82
1
96
Class III
0
0
0
0
Subtotal
33
98
1
132
According to Tables 1 and 2, 108 samples of PNN are accurately classified, among which 33 and 75 apple samples of class I and Class II are correct respectively. 3 class I fruits were improperly divided into class II, 18 II fruits were divided into class I, 3 class II fruits were divided into class III. In summary, with a total of 132 samples, the prediction accuracy was 81.8182%, as shown in Fig. 6. While 102 samples of ELM are accurately classified, among which 20 and 82 samples of Class I and Class II fruits are respectively, 16 class I fruits were divided into class II, 13 II fruits were divided into class I, 1 class II fruits were divided into class III, with a total of 132 samples, and the prediction accuracy was 77.2727%, as shown in Fig. 7.
Gaussian Mass Function Based Multiple Model Fusion
267
PNN Accuracy accuracy=81.8182%
3
level
actual level test level
2
1 0
20
40
60
80
100
120
140
test set
Fig. 6. Classification accuracy of PNN test set ELM Accuracy accuracy=77.2727%
3
level
actual level test level
2
1 0
20
40
60
80
100
120
140
test set
Fig. 7. Classification accuracy of ELM test set
4.2
Simulation Experiment on Fusion Model
The confusion matrix of fusion model by combing PNN model and ELM model is shown in Table 3. As can be seen from Table 3, among the 132 samples, a total of 114 samples were accurately classified after DS fusion, among which 34 samples of class I and 80 samples of class II were right. And 2 samples of class I were wrongly divided into class II, and 16 samples of class II were incorrectly divided into class I.
268
S. Bi et al. Table 3. Confusion matrix of DS fusion prediction model. Real class mark Classification Results of DS Class I Class II Class III Subtotal Class I
34
2
0
36
Class II
16
80
0
96
Class III
0
0
0
0
Subtotal
50
82
0
132
The obtained classification accuracy of the fusion model is shown in Fig. 8, the prediction accuracy is 86.3636%. Through the example analysis and simulation experiment, it is proved that the Gaussian mass function proposed in this paper can not only solve the uncertainty caused by discount factor in the classification process and the poor adaptability of a single prediction model, but also improve the prediction accuracy by combining it with Dempster’s combination rule. DS Accuracy accuracy=86.3636%
2
level
actual level test level
1 0
20
40
60
80
100
120
140
test set
Fig. 8. Classification accuracy after DS fusion
5
Conclusion
Aiming at the limitation of single classification model in the near-infrared spectra technology and the uncertainties caused from the hard partition of the instance space in apples’ classification application, an evidence theory-based multi-model fusion was proposed to deal with this issue. A Gaussian mass function was proposed so as to realize the fusion of PNN and ELM models by combining their mass function based on Dempster’s combination rules. The maximum mass value of the combined mass function was selected as the final decision of apple grades.
Gaussian Mass Function Based Multiple Model Fusion
269
And, the experimental results showed that the proposed mass function in apples classification could improve the prediction accuracy and is reasonable than that using the hard partition. Acknowledgment. This paper is supported by the Natural Science Foundation of Shandong Province ZR2021MF074, ZR2020KF027 and ZR2020MF067.
References 1. Li, M., Han, D., Lu, D., et al.: Research progress of universal model of near infrared spectroscopy in agricultural products and foods detection. Spectro. Spectral Anal. 42(11), 3355–3360 (2022) 2. Hu, J., Huang, J., Liu, X., et al.: Research progress of kiwi fruit quality detection based on near infrared spectroscopy. Res. Dev. 43(2), 196–201 (2022) 3. Cao, N., Liu, Q., Peng, J., et al.: Quantitative determination method of soluble solid and hardness of yellow peach crisp chips based on near infrared spectroscopy. Food Mach. 37(3), 51–57 (2021) 4. Yu, Z., Luo, H., Kong, D., et al.: Study on dynamic moisture model of Junzao in southern Xinjiang based on near infrared spectroscopy. Xinjiang Agric. Mechanization 2(1), 35–38 (2022) 5. Zhu, J., Zhu, Y., Feng, G., et al.: Establishment of quantitative models for blueberry storage quality based on near infrared spectroscopy combined with extreme learning machine. Food Ferment. Ind. 48(16), 270–276 (2022) 6. Jie, D., Xie, L., Rao, X., et al.: Near infrared spectrum variable screening improved the accuracy of watermelon sugar degree prediction model. Trans. Chin. Soc. Agric. Eng. 29(12), 264–270 (2013) 7. Bi, S., Li, X., Shen, T., et al.: Apple classification method based on multi-model evidence fusion. Trans. Chin. Soc. Agric. Eng. 38(13), 141–149 (2022) 8. Xu, X., Li, D., Ren, Z.: Cloud core mass function. Adv. Astron. 32(3), 299–312 (2014) 9. Di, P., Ni, Z., Yin, D., et al.: A multi-attribute decision making optimization algorithm based on cloud model and evidence theory. Syst. Eng. Theory Pract. 41(4), 1061–1070 (2021) 10. Yan, W., Wei, Y., Fuheng, Ma.: Dam safety evaluation based on D-S evidence theory and multi-source information of GOAF. Water Conservancy Hydropower Technol. 51(4), 175–183 (2020) 11. Liao, R., Liao, Y., Yang, L., et al.: Research on transformer fault comprehensive diagnosis based on the fusion of multi-neural network and evidence theory. Chin. J. Electr. Eng. 26(3), 119–124 (2006) 12. Specht, D.F.: Probabilistic neural networks. Neural Netw. 3(1), 109–118 (1990) 13. Zhou, S., Liang, J.: Partial discharge pattern Recognition based on moment feature and probabilistic neural network. Prot. Control Power Syst. 44(3), 98–102 (2016) 14. Chen, Li, X., L., Ma, L., Bi, S.: Probabilistic neural network based apple classification prediction. In: Proceedings of 2022 International Conference on Advanced Mechatronic Systems, Japan, pp. 71–75 (2022) 15. Guo, W., Shang, L., Zhu, X., et al.: Nondestructive detection of soluble solids content of apples from dielectric spectra with ANN and chemometric methods. Food Bioprocess Technol. 8(5), 1126–1138 (2015)
Research on Lightweight Pedestrian Detection Method Based on YOLO Kehua Du, Qinjun Zhao(B) , Rongyao Jing, Lei Zhao, Shijian Hu, Shuaibo Song, and Weisong Liu University of Jinan, Jinan 250022, China [email protected]
Abstract. Aiming at the problems of large size, high calculation cost and slow detection speed of current pedestrian detection models, this paper proposes a lightweight improved pedestrian detection algorithm based on YOLO v5. Firstly, the Shufflenet v2 network is introduced to replace the backbone network of YOLO v5. Then cascade convolution is designed, and the size of the backbone extraction network convolution core is modified to improve the sensing field of the backbone feature extraction network so that more important context features can be separated. Finally, the unnecessary structure of the backbone network is cut to reduce the scale of network parameters and improve the inference speed. In this paper, the INRIA dataset is used for relevant experiments. Through the experimental analysis of the two algorithms, the size of the model, the number of parameters and the reasoning time of the algorithm in this paper are reduced to 50.1%, 48.6% and 64.7% of YOLO v5s model, respectively. In contrast, the average accuracy of the algorithm is only reduced by 2.1%. This algorithm not only guarantees accuracy, but also greatly improves the reasoning speed. Keywords: Deep learning · Pedestrian detection · YOLO v5 · Shufflenet v2
1 Introduction Pedestrian detection technology [1] refers to the detection technology that uses computer vision to determine whether there are pedestrians in images or videos and to locate them accurately. Pedestrian detection technology has been widely used in artificial intelligence, which has important research value. Pedestrian detection technology began in the mid-1990s and had been developing rapidly since 2005. However, the current detection still does not balance accuracy and speed. The detection of traditional methods is realized by manually extracting features and matching classifiers, mainly using static features of images. The extracted features mainly include the haar-like feature [2], the Edgelet feature [3], the HOG feature [4], the Shapelet feature [5], the contour template feature [6, 7], etc. The classifier uses SVM [8] or enhanced learning. The manually designed features only use the gradient or texture © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 270–278, 2024. https://doi.org/10.1007/978-3-031-50580-5_23
Research on Lightweight Pedestrian Detection Method
271
information on the image and other relatively simple descriptors to describe pedestrians, which leads to faster detection speed, but poor detection accuracy. The pedestrian detection algorithm based on depth learning does not need to design features manually. Still, it uses a convolutional neural network to select appropriate features from many samples to achieve reliable detection. Current mainstream detection methods include: 1) Two-level network is used for target detection, such as R-CNN [9] and faster R-CNN [10]. The first-level network is used to extract candidate regions, and the regional candidate network (RPN) is used to locate the area where the target may exist. The second level network is used to classify and regress candidate regions. The detection algorithm based on region recommendation has high accuracy, but requires a lot of computing power. 2) Target detection algorithms composed of the primary network, such as YOLO [11] series algorithms based on regression method and SSD algorithm [12], do not extract candidate regions, and only use the primary network to complete classification and regression. The algorithm does not need a regional candidate network and directly completes target detection through the one-stage network. It is faster than the former. The target detection algorithm based on depth learning has higher precision than the traditional method, but the computation is large, and the detection speed is slow. This paper proposes a lightweight pedestrian detection algorithm based on YOLO v5. The following is a summary of the work done in this paper: 1) Shufflenet v2 network was introduced to replace the backbone feature extraction network of YOLO v5, preliminarily completing the lightweight improvement of the YOLO network. 2) By enlarging the convolution kernel size and designing the cascade convolution method, the receptive field of the introduced Shufflenet v2 network is improved, so that the feature extraction network of the proposed algorithm can obtain a larger range of feature information. 3) The unnecessary structure of the Shufflenet v2 network is removed, the parameter scale of the network model of this algorithm is optimized, and the running speed of this algorithm is further improved.
2 Related Work 2.1 Introduction to YOLO v5 Network YOLO v5 is a widely used target detection algorithm, which has made some new improvements based on YOLO v4, so its speed and accuracy have greatly improved. Ultralytics publicly released it on June 9, 2020. YOLO v5 is mainly composed of a backbone network, feature fusion part and prediction head. The backbone network of YOLO v5 adopts the CSPDarknet53 structure, which is improved on Darknet53 and uses a residual network, the accuracy of the network can be improved by optimizing it. Three effective feature layers can be obtained using the backbone network, and then the PANet structure fuses these feature layers. After feature fusion, the feature layer can contain deep semantic and texture information. The prediction head makes predictions using features containing semantic information and texture information. The YOLO v5 model is widely used because of its flexibility and high detection accuracy.
272
K. Du et al.
2.2 Introduction to Shufflenet v2 Network Shufflenet v2 belongs to a classic lightweight network, a lightweight network model suitable for deploying mobile terminals. By analyzing the execution of Shufflenet v1 and Mobilenet v2, the impact on floating-point computation amounts and memory access cost on the reasoning speed of the model is obtained; through the analysis of the impact of input and output channels, convolution, network integrity, and unit operation on different hardware, four lightweight network design standards are summarized: 1) The memory access cost is related to whether the number of input and output channels is equal. When they are equal, the memory access cost is the most. 2) Reducing the number of packet convolutions can reduce the memory access cost. 3) Too complex network structure (too many branches and basic units) will reduce the parallelism between the network. 4) The cost of element-wise operations cannot be ignored.
Fig. 1. Improved network structure
According to the design criteria, Shufflenet v2 reconstructs the network structure based on Shufflenet v1, and introduces the channel segmentation structure of the network structure of Shufflenet v1. The purpose is to divide the normal channel of the network into two parts, and achieve the effect of independent calculation in the channel. Shufflenet v2 can accommodate more feature channels. After the channel splitting operation, each convolution calculation is performed on a different feature channel. The calculation
Research on Lightweight Pedestrian Detection Method
273
and parameters are correspondingly reduced, achieving a balance between speed and accuracy.
3 The Improved Pedestrian Detection Algorithm 3.1 Modify the Backbone Network First, remove the original backbone network of YOLO, and then implant Shufflenet v2 network into YOLO v5 as a new backbone network. Shufflenet v2 is used as the backbone of YOLO v5 to complete the lightweight improvement of the YOLO v5 network. The network structure after modifying the backbone network is shown in Fig. 1. 3.2 Enhance Network Perception Field By analyzing the structure of the introduced Shufflenet v2 network, it is found that it roughly includes two structures: One is the network structure when the step size is 1. The input channel is divided into two parts. One part is directly connected with the part of branch 1 after the convolution calculation of branch 2, as shown in Fig. 2 (a); The other is a network structure of a step size of 2. In this structure, since the sampling needs to be reduced and the dimension needs to be improved, the input is directly copied into two copies, which are connected together after the convolution operation of branch 1 and branch 2, and the channels are doubled, as shown in Fig. 2 (b).
Fig. 2. Expand network feeling field
To make the network gain a greater receptive field, this algorithm uses different methods to make the two structures of the network obtain a larger receptive field according to the structure of the network. As described in Fig. 2 (c) (d), when the step size is 1, replace the 3 * 3 depth convolution in Fig. 2 (a) with the cascade convolution shown in Fig. 2 (c). Figure 2 (d) shows the modification of the network structure when the step size is 2, which directly expands the convolution kernel size of the depth convolution to 5 * 5. The reasons for expanding the convolution kernel and designing cascade convolution in Fig. 2 (c) and (d) are as follows:
274
K. Du et al.
1) When the step size is 1, using 3 * 3 convolution cascade can obtain the receptive field of equivalent 5 * 5 convolution, while retaining more features and paying less computational cost than using 5 * 5 convolution directly. 2) When the step size of the convolution kernel changes, the size of the output feature map will also change. Therefore, if the step size is 2, the 3 * 3 convolution kernel is executed twice, and it can be found that the width and height of the final output characteristic matrix will be greatly reduced according to Formula 1. Suppose the feature input size of the subsequent network needs to be matched. In that case, it is necessary to perform some related operations to restore the size of the feature matrix, which will undoubtedly increase the amount of calculation. Therefore, when the step size is 2, 5 * 5 depth convolution is directly used to replace 3 * 3 depth convolution. ⎧ ⎪ ⎨W =
Nw +2P−F S
+1
⎪ ⎩H =
Nh +2P−F S
+1
(1)
Nw , Nh are the width and height of the input matrix data, P is the value of paddings, S is the convolution step, F is the convolution kernel size, W is the width of the output matrix data, and H is the height of the output matrix data.
Fig. 3. Clipping network structure
3.3 Clipping Shufflenet v2 Network Generally, 1 * 1 convolution is used before or after the depth convolution for two purposes: Fusing the information on channels to make up for the lack of information fusion function of channels due to deep convolution; The other is to reduce and improve the dimension. Through the analysis of the Shufflenet v2 network introduced in this paper, it is found that two 1 * 1 convolutions are used in branch 2 of both structures, as shown in Fig. 3 (a) and (b), branch 2. However, there is no need to increase or reduce dimensions here, so only one 1 * 1 convolution can still complete the information fusion operation and reduce the number of algorithm parameters. Only one 1 * 1 convolution can be
Research on Lightweight Pedestrian Detection Method
275
retained here. Figure 3 (c) and (d) show the network structure diagram of Shufflenet v2 after expanding the receptive field and trimming redundant structures.
4 Experiment and Analysis 4.1 Experimental Environment To verify the effectiveness and rationality of the algorithm in this paper, YOLO v5s and the improved algorithm proposed in this paper uses the same data set for training and verification, and the detection accuracy of the two algorithms is compared; at the same time, the speed of the two algorithms is tested on the same experimental platform to compare the performance of the two algorithms. To accurately evaluate the detection effect of the model, to evaluate the accuracy of the model more intuitively, the experiment mainly tests the accuracy P (Precision), recall R (recall) and mAP (mean Average Precision) of the model. Formulas (2)–(4) show the relevant calculation formulas of each model performance index. At the same time, compare the model’s size and parameters, and use each picture’s Inference time as the indicator of detection speed evaluation. P=
TP TP + FP
TP TP + FN k APi mAP = i=1 k R=
(2) (3)
(4)
In Formula (2)–(4), TP (True Positive) means that the positive class is determined to be a positive class; FP (False Positive) Indicates that the negative sample is not recognized as a negative sample; FN (False Negative) means that positive classes are not recognized as a positive class; In Formula (4), k is the category, and the Formula for AP (Average Precision) can be expressed as: AP =
1
P(R)dR
(5)
0
INRIA pedestrian data set was used in this experiment. The INRIA pedestrian data set is a group of images of standing or walking pedestrians with labels. The training set has 614 samples with pedestrians and 1218 samples without pedestrians, with a total of 1237 pedestrians; The test set contains 288 images containing pedestrians and 453 images without pedestrians, including 589 pedestrian targets. The experimental platform uses Windows 10 operating system with 8G memory, NVIDIA GTX950M 2 GB GPU, torch 1.7.0 + cuda110 and cudnn 8.5.0.96 as learning framework.
276
K. Du et al.
Fig. 4. Comparison of verification results of two algorithms
4.2 Experimental Result In the verification results of the two algorithms in Fig. 4, It can be found that the detection result of this algorithm is very close to that of YOLO v5s. If there are too many human bodies in the image and obstacles blocking the human body, YOLO v5s can accurately detect the clustered crowd, but the human body blocked by obstacles is not detected. The algorithm in this paper can detect the human body blocked by obstacles, but false detection will occur in the case of crowd aggregation. When the human body accounts for a small proportion of the image, Both algorithms have different degrees of missing and false detection. Table 1 shows the precision data obtained by the two algorithms after 200 rounds of training on the INRIA dataset. In the INRIA dataset experiment, the detection accuracy P of the YOLO v5s algorithm is 0.7% lower than that of this algorithm, the recall R is 5.3% lower than the YOLO v5s algorithm, and the average accuracy mAP of YOLO v5s is 2.1% higher than the algorithm in this paper. The data in Table 2 are the model sizes of the two algorithms and the speed test results of the two algorithms. It can be seen from the data in the table that the YOLO v5s algorithm lags behind this algorithm in terms of model size, model parameter size and reasoning time of the single image. By observing the data in the table, the size of the YOLO v5s model is 13.7 MB, and the parameter quantity is 7012.9k; The size of the algorithm model in this paper is only 6.94 MB, and the number of parameters is only 3408.8k. In terms of speed, the reasoning time of the YOLO algorithm for each picture on GTX950M graphics card is 57.5 ms; The reasoning time of each image in this algorithm is 37.2 ms. According to the data analysis in the table, the model size, model parameters and reasoning time of the algorithm in this paper are only 50.1%, 48.6% and 64.7% of YOLO v5s. According to the data analysis in Table 1 and Table 2 and the detection results of the two algorithms in Fig. 4, it can be found that the detection accuracy of YOLO v5s is slightly higher than that of the algorithm in this paper. Still, the algorithm in this paper has
Research on Lightweight Pedestrian Detection Method
277
Table 1. Comparison of detection accuracy of different algorithms Algorithm
P (%)
R (%)
mAP (%)
YOLO v5s
93.7
90.4
95.6
Ours
94.4
85.1
93.5
Table 2. Comparison of detection speed between different algorithms Algorithm
Size
Parameters
Inference time
YOLO v5s
13.7 MB
7012.9k
57.5 ms
Ours
6.94 MB
3408.8k
37.2 ms
obvious advantages in model size and detection speed. Considering the faster detection speed of the algorithm in this paper, the loss of a small part of mAP is acceptable.
5 Conclusion Based on the YOLO v5 target detection algorithm, this paper’s improved algorithm replaces the backbone of YOLO v5 with the Shufflenet v2 network of an enlarged receptive field and a simplified structure. Analyzing the experiment of the algorithm in the INRIA dataset, the algorithm improves the speed of target detection and ensures the detection accuracy of the network. Therefore, this algorithm can be applied to pedestrian detection with high real-time requirements. Acknowledgements. This paper is supported by the Shandong Key Technology R&D Program 2019JZZY021005, Natural Science Foundation of Shandong ZR2020MF067 and Natural Science Foundation of Shandong Province ZR2021MF074.
References 1. Wang, W., Ga, l., Wu, S., Zhao, Y.: A review of pedestrian detection. Motorcycle Technol. (01), 29–32 (2019) 2. Adouani, A., Henia, W.M.B., Lachiri, Z.: Comparison of Haar-like, HOG and LBP approaches for face detection in video sequences. In: 2019 16th International Multi-Conference on Systems, Signals & Devices (SSD), Istanbul, pp. 266–271. IEEE (2019) 3. Ning, C., Menglu, L., Hao, Y., et al.: Survey of pedestrian detection with occlusion. Complex Intell. Syst. 7(1), 577–587 (2021) 4. Zhou, W., Gao, S., Zhang, L., et al.: Histogram of oriented gradients feature extraction from raw Bayer pattern images. IEEE Trans. Circuits Syst. II Express Briefs 67(5), 946–950 (2020) 5. Ji, C., Zou, X., Hu, Y., et al.: XG-SF: an XGBoost classifier based on shapelet features for time series classification. Procedia Comput. Sci. 147, 24–28 (2019)
278
K. Du et al.
6. Humeau-Heurtier, A.: Texture feature extraction methods: a survey. IEEE Access 7, 8975– 9000 (2019) 7. Zebari, R., Abdulazeez, A., Zeebaree, D., et al.: A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. J. Appl. Sci. Technol. Trends 1(2), 56–70 (2020) 8. Hu, R., Zhu, X., Zhu, Y., et al.: Robust SVM with adaptive graph learning. World Wide Web 23(3), 19455–21968 (2020) 9. Bharati, P., Pramanik, A.: Deep learning techniques—R-CNN to mask R-CNN: a survey. Comput. Intell. Pattern Recognit., 657–668 (2020) 10. Maity, M., Banerjee, S., Chaudhuri, S.S.: Faster R-CNN and yolo based vehicle detection: a survey. In: 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), Erode, pp. 1442–1447. IEEE (2021) 11. Ma Zhao, N., Chai, L., Feng, Z.: Overview of target detection algorithms for in-depth learning. Inf. Rec. Mater. 23(10), 1–4 (2022) 12. Fu, M., Deng, Z., Zhang, D.: Overview of image target detection algorithms based on deep neural network. Comput. Syst. Appl. 31(07), 35–45 (2022)
Research on the Verification Method of Capillary Viscometer Based on Connected Domain Rongyao Jing1 , Kun Zhang2 , Qinjun Zhao1(B) , Tao Shen1 , Kehua Du1 , Lei Zhao1 , and Shijian Hu1 1 University of Jinan, Jinan 250022, China
[email protected] 2 The 53rd Research Institute, China North Industries Group, Jinan 250000, China
Abstract. In order to solve the problem that the dust in the insulation cabinet is mistakenly identified as the liquid level in the process of automatic verification of the capillary viscometer, this paper studied the verification method of the capillary viscometer based on connected domain. On the basis of the common automatic verification system of capillary viscometer based on computer vision, the dust recognition method based on connected domain is added. Industrial cameras are used to acquire viscometer video images in real-time. In order to make the images clearer and improve the processing speed of the images, the following preprocessing is carried out on the acquired images first, including the ROI region selection, frame difference method to capture moving targets, binarization, corrosion, and expansion. Then, the preprocessed image is marked with connected domain. The parameter difference of the connected domain combining liquid level and dust, the problem of misidentified dust as liquid level is solved. The experimental results show that the time repeatability and constant reproducibility of the verification results of the proposed method are better than those of the ordinary verification method based on computer vision, which reduces the probability of misidentified dust as the liquid level and improves the accuracy and efficiency of the verification of capillary viscometer. Keywords: Connected domain · Liquid level · Dust · Capillary viscometer · Verification
1 Introduction Viscosity is the resistance of a fluid to flow. Accurate measurement of viscosity is of great significance in many scientific research fields [1, 2]. The traditional measurement methods are mainly the capillary method, rotation method and vibration method. Among them, the capillary method has become the most widely used method for liquid viscosity measurement because of its high accuracy and simple structure. In order to ensure the measurement accuracy of the capillary viscometer, it is necessary to pass the verification process stipulated by the state to judge whether the capillary © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 279–287, 2024. https://doi.org/10.1007/978-3-031-50580-5_24
280
R. Jing et al.
viscometer is qualified. Under constant temperature conditions, a certain amount of standard fluid with known kinematic viscosity fell from the upper scale line to the lower scale line under gravity, and the duration of liquid level fell from the upper scale line to the lower scale line of the standard fluid is measured. The verification of the viscometer can be completed according to the requirements of JJG 155-2016 verification regulations for working capillary viscometer. At present, the verification of capillary viscometers usually uses a method based on computer vision to detect the liquid level of the standard liquid [3], and an external timer is used to measure the time for the level of the standard liquid fell from the upper scale line to the lower scale line. However, the verification of the viscometer needs to be carried out under constant temperature, so the viscometer is placed in the insulation cabinet. However, there is a lot of dust in the insulation cabinet, which will affect the detection of the liquid level, and the dust will be misidentified as the liquid level. Therefore, the capillary viscometer verification and dust identification system based on connected domain is studied in this paper. On the basis of the traditional automatic verification system of capillary viscometer based on computer vision, the connected domain is added to eliminate the influence of dust on liquid level detection, which improves the efficiency and accuracy of viscometer verification.
2 Liquid Level and Dust Detection Based on Connected Domain 2.1 Basic Principles of Connected Domain The connected domain refers to the area composed of adjacent pixels with the same pixel value in the image. Generally, a connected domain contains only one kind of pixel value, so in order to prevent the influence of pixel value fluctuations on the extraction of different connected domains, connected domain analysis usually deals with the image after binarization. There can be multiply connected domains in a binary image, and any two connected domains are neither overlapping nor adjacent. There are two traditional connected domain labelling algorithms, which are four neighbourhood connected domain labelling and eight neighbourhood connected domain labelling. Four neighbourhood connected domain labelling determines the connected relation according to four neighbourhood directions (up, down, left, and right). Eight neighbourhood connected domain labelling identifies the connectivity relationship based on eight neighbourhood directions (up, down, left, right, upper left, lower left, upper right, lower right). As shown in Fig. 1, if any pixel labelled b, d, e, g has the same pixel value as the pixel labelled 1, it is considered to be connected with 1. As shown in Fig. 2, if any pixel labelled a, b, c, d, e, f, g, h has the same pixel value as the pixel labelled 2, it is considered to be connected with 2.
Research on the Verification Method of Capillary Viscometer
281
Fig. 1. Four neighbourhood connected domain labelling
Fig. 2. Eight neighbourhood connected domain labelling
In the binary image, the regions composed of the pixels that conform to the four neighbourhood connected domain labelling and the eight neighbourhood connected domain labelling are marked with different labels, which are called the labelling of the connected domain. Therefore, multiple target regions in the binary image are assigned unique markers. Then, the relevant parameters of the region corresponding to each marker number are calculated. As shown in Fig. 3 below, it is the image before the connected domain label. There are seven white areas in the image, and they are not connected. As shown in Fig. 4 are the images marked with connected domains. Seven white areas are marked with “1, 2,
Fig. 3. Before mark
282
R. Jing et al.
3, 4, 5, 6, 7” and seven different marks, respectively. Through these seven marks, it is convenient to calculate the different characteristics of the seven regions: the area of the connected domain, the width and height of the outer rectangle of the connected domain, the y coordinates of the upper left corner of the connected domain and the x and y coordinates of the central point of the connected domain. By analyzing this information, we can judge the size, position, and so on of the objects in the image.
Fig. 4. After mark
2.2 Image Preprocessing This system uses Hikvision industrial camera to collect images. The collected images will have noise, which will increase the amount of calculation and reduce the accuracy of the image analysis. Therefore, the collected images are preprocessed. Pretreatment mainly includes the ROI region selection, frame difference method to capture moving targets, binarization, corrosion, and expansion. First, the ROI region at the scale line of the viscometer is selected, as shown in Fig. 5. This can reduce the amount of calculation and improve the speed of calculation. Then, the frame difference method [4, 5] was used to capture the moving targets, as shown in Fig. 6, to capture the liquid level. However, the acquired image is very unclear, so the image is made clearer by binarization [6, 7], with only 0 and 255 pixel values, as shown in Fig. 7. Finally, the excess image at the edge is removed by corrosion, and the image edge is expanded by expansion so that the internal defects are filled, and the liquid level becomes clearer, as shown in Fig. 8.
Research on the Verification Method of Capillary Viscometer
283
Fig. 5. The ROI region selection
Fig. 6. Liquid level capture
Fig. 7. Image after binarization
2.3 Use the Connected Domain to Mark the Liquid Level and Dust The verification of the viscometer needs to be carried out under constant temperature conditions, so the viscometer needs to be put into a special insulation cabinet. However, due to the long term use of the insulation cabinet, a large amount of dust is accumulated, which will affect the verification of the viscometer. Using the frame difference method, the liquid level and dust will be detected as moving targets at the same time, and the
284
R. Jing et al.
Fig. 8. Image after corrosion expansion
dust passed through the scale line will be mistakenly identified as the liquid level passed through the scale line. In order to eliminate the influence of dust, the system marked the connected domain of the preprocessed image and distinguished whether the moving target is the liquid level or not according to the different connected domain parameters of the liquid level and dust. The liquid level after the connected domain label is shown in Fig. 9.
Fig. 9. Image after connected domain mark
It is mainly distinguished by the following parameters: 1. The y coordinate of the upper left corner of the connected domain. When the liquid level rises, the y coordinate of the upper left corner of the connected domain changes from large to small, and when the liquid level falls, the y coordinate of the upper left corner of the connected domain changes from small to large. The y coordinate of the upper left corner of the connected domain of dust has been changing irregularly. 2. The x coordinate of the centre point of the connected domain. Near the scale line, the x coordinate of the centre point of the connected domain of the liquid level changes within a certain range without significant changes. The x coordinate of the centre point of dust connected domain changes greatly. 3. Area of connected domain. The area of the liquid level connected domain is usually relatively large, while the area of the dust connected domain is often relatively small.
Research on the Verification Method of Capillary Viscometer
285
3 Experimental Results and Analysis In order to test the effectiveness of the capillary viscometer verification and dust recognition system based on the connected domain, under the requirements of the national regulations and standards, the Pinkevitch capillary viscometer is used for verification, and the following two systems are compared. The experimental requirements are as follows: 1. The temperature of the water bath is maintained stable, and the temperature of the water bath is set at 20 °C, and the temperature fluctuates within 0.01 °C; 2. The inner diameter of the Pinkevitch capillary viscometer to be verified is appropriate; 3. Design repeated experiments and calculate repeatability. According to the experimental requirements, the comparison experiments are designed, which are the ordinary automatic verification system, that is, the automatic verification system based on computer vision and the ordinary automatic verification system added to the dust identification system in this paper. A Pinkevitch capillary viscometer with an inner diameter of 0.8 mm was verified by using standard fluids with kinematic viscosity of 10.10 mm2 ·s−1 and 20.94 mm2 ·s−1 . The repeatability of outflow time and the reproducibility of the viscometer constant were taken as comparison parameters. The verification results are shown in Table 1. Table 1. Comparison of experimental results Experimental parameters
Added dust identification
General automatic verification
Group 1
Group 2
Group 1
Group 2
10.10
20.94
10.10
20.94
1st
282.06
584.72
282.07
584.67
2nd
282.03
584.70
261.05
530.13
3rd
282.00
584.64
282.03
584.63
4th
282.10
584.62
282.01
584.65
Mean value of the outflow time t (s)
282.05
584.67
276.79
571.02
Time repeatability δ i
0.04%
0.02%
7.59%
9.55%
Viscometer constant C i (mm2 ·s−2 )
0.035809
0.035815
0.036490
0.036671
Constant reproducibility δC
0.02%
0.49%
Viscometer constant C (mm2 ·s−2 )
0.03581
0.03658
Kinematic viscosity of standard liquid (mm2 ·s−1 ) Outflow time t i (s)
286
R. Jing et al.
It can be seen from the experimental data that the time measurement data after the addition of the dust recognition system is relatively uniform and does not fluctuate greatly. However, due to the interference of dust in liquid level detection in the ordinary automatic verification system, the dust passed the scale line will be mistakenly identified as liquid level passed the scale line, resulting in a large difference between the start and end timing. Through analysis and calculation, the time repeatability and constant reproducibility parameters of the capillary viscometer verification and dust recognition system based on the connected domain are superior to the ordinary automatic verification system and meet the requirements of the regulation on time repeatability and constant reproducibility, thus verifying the feasibility and effectiveness of the dust recognition system, and can effectively eliminate the interference of dust on liquid level recognition. In conclusion, the stability and accuracy of the capillary viscometer verification and dust identification system based on the connected domain in this paper are better than that of the ordinary automatic verification system, and the interference of dust on liquid level recognition is basically eliminated in the verification process, greatly reduced the probability of misidentified dust as liquid level.
4 Conclusion In order to eliminate the influence of dust on liquid level detection in capillary viscometer verification, the capillary viscometer verification and dust recognition system based on connected domain is studied in this paper. The dust recognition system based on connected domain was added to the ordinary automatic verification method of capillary viscometer based on computer vision. The verification results show that the capillary viscometer verification and dust recognition system based on connected domain in this paper is superior to the ordinary automatic verification method of capillary viscometer based on computer vision, and the measurement results meet the requirements of JJG 155-2016 verification regulations for working capillary viscometer. At the same time, the capillary viscometer verification and dust recognition system based on connected domain can greatly reduce the probability of misidentified dust as liquid level and greatly improve the accuracy and efficiency of verification. Acknowledgements. This paper is supported by the Shandong Key Technology R&D Program 2019JZZY021005, Natural Science Foundation of Shandong ZR2020MF067, and Natural Science Foundation of Shandong Province ZR2021MF074.
References 1. Akbay, C., Koçak, O.: Vibrational viscosimeter design for biomedical purposes. In: 2018 Medical Technologies National Congress (TIPTEKNO), pp. 1–4. IEEE (2018) 2. Kang, D., Wang, W., Lee, J., et al.: Measurement of viscosity of unadulterated human whole blood using a capillary pressure-driven viscometer. In: 10th IEEE International Conference on Nano/Micro Engineered and Molecular Systems, pp. 1–4. IEEE (2015) 3. Wang, H., Zhang, K., Xun, Q., et al.: Ubbelohde viscometer verification method and system based on computer vision. In: 2021 China Automation Congress (CAC), pp. 5501–5506. IEEE (2021)
Research on the Verification Method of Capillary Viscometer
287
4. Husein, A.M., Halim, D., Leo, R.: Motion detect application with frame difference method on a surveillance camera. J. Phys. Conf. Ser. 1230(1), 012017 (2019) 5. Huiying, D., Xuejing, Z.: Detection and removal of rain and snow from videos based on frame difference method. In: The 27th Chinese Control and Decision Conference (2015 CCDC), pp. 5139–5143. IEEE (2015) 6. Haque, M., Afsha, S., Ovi, T.B., et al.: Improving automatic sign language translation with image binarisation and deep learning. In: 2021 5th International Conference on Electrical Engineering and Information & Communication Technology (ICEEICT), 1–5. IEEE (2021) 7. Bipin Nair, B.J., Unni Govind, S., Nihad Abdulla, V.A., Akhil, A.: A novel binarization method to remove verdigris from ancient metal image. In: 2021 5th International Conference on Intelligent Computing and Control Systems (ICICCS), pp. 884–888 (2021). IEEE
Research on License Plate Recognition Methods Based on YOLOv5s and LPRNet Shijian Hu1 , Qinjun Zhao1(B) , Shuo Li2 , Tao Shen1 , Xuebin Li1 , Rongyao Jing1 , and Kehua Du1 1 University of Jinan, Jinan 250022, China
[email protected] 2 Audit Bureau of Postal Savings Bank of China, Jinan 250000, China
Abstract. License plate recognition technology has been applied more and more widely. To meet the speed and accuracy requirements of license plate recognition methods, this paper proposes a license plate recognition method based on YOLOv5s and LPRNet model. First, the YOLOv5s model was used as the detection module, then the detection results were used as the input of the license plate identification module with the LPRNet model as the main part, and finally, the license plate recognition results were output. The practical consequence shows that compared with the other three models for license plate recognition, the recognition method based on YOLOv5s and LPRNet models proposed in this paper has superiorities in the accuracy and speed of license plate identification and the comprehensive identification rate of the license plate is increased to 93%. Keywords: Deep learning · License plate detection · License plate recognition · YOLOv5s · LPRNet
1 Introduction Although automatic license plate recognition (LPR) plays a vital role in many practical adhibitions, the current traditional algorithm has great limitations, such as recognition of fixed license plates, limited speed, and fixed background, etc. To reduce the constraints of the working environment, this thesis put forward a new license plate recognition technology, which is composed of a license plate orientation module and a license plate number identification module. The first mentioned of two use the YOLOv5s model to detect license plates from input images, while the latter uses the LPRNet model as the main part, aiming at recognizing numbers and English and Chinese characters in license plates. Model training and data verification are conducted for each module.
2 Related Work Kanayama, Lee, Matas et al. did extensive research. They used the color attributes and features of the vehicle itself and found that the Sobel operator could produce obvious features at the edge of the image. Through these features can achieve the preliminary license plate recognition effect. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 288–297, 2024. https://doi.org/10.1007/978-3-031-50580-5_25
Research on License Plate Recognition Methods
289
Although these methods play a certain role in identifying deformed and covered license plates, the recognition rate is low. Until the 21st century, with the rapid expansion of machine vision techniques, deep learning, and artificial intelligence technology, the use of neural networks on actual license plate detection has shown better performance and effect. Because vehicle license plates are often affected by stains, illumination, rust, deformation, etc. [1], defects or opaque marks will appear after character segmentation, which seriously affects the accuracy of the feature description. However, Chinese standard vehicle license plates are composed of Chinese characters, English letters, and Arabic numerals, which undoubtedly increases the technical difficulty.
3 YOLOv5s and LPRNet Model Introduction To improve the tempo and correctness of plate number recognition, this paper chooses the YOLOv5s model in YOLOv5 and LPRNet model to recognize license plates. The method mainly consists of two parts: YOLOv5s is used for image segmentation and license plate position detection. LPRNet recognizes license plates, corrects distortion, and eliminates noise points. YOLOv5s is a single-stage target recognition algorithm that considers performance and accuracy. It increases the recognition rate, guarantees recognition accuracy, and improves the accuracy of the YOLO series for small target recognition. 3.1 YOLOv5s Basic Network Structure YOLOv5s is made up of Input, Backbone, Neck, and Prediction. The previous algorithm of the YOLO series is to scale the original picture to standard size. It is then fed into the detect network. But in practical application, a lot of photos have a different length-width ratio, so the size of the atramentous edge on both sides will be diverse after zooming and padding. Too much filling will give rise to information redundancy and impact inference speed. So the code in YOLOv5s first modifies the letterbox function, adding adaption to the original image to make the black edge narrower. The mosaic method is then used for data enhancement. Mosaic was proposed in 2019, and it is improved by referring to the data enhancement method of CutMix; instead of selecting two pictures before, four pictures were selected for random cutting, random combination, and random assembly so that a lot of data containing small targets were obtained, which increased the data set and improved the ability to detect small objects [2]. In the Backbone, YOLOv5s mainly adds a Focus construction and CSP construction. The Focus construction is shown in Fig. 1, which is similar to the PassThrough layer of YOLOv3. It converts information from width and length into thoroughfare dimensionality and then separates diverse characteristics by convolution. The Focus structure is used for downsampling (a downsampling in neural networks is mainly used to reduce the number of parameters, reduce dimension and increase local sensitivity). Compared with other algorithms that use a convolution layer and pooling layer whose step size is as small as 2, the Focus layer can effectively reduce the information loss caused by downsampling and reduce the amount of computation [3].
290
S. Hu et al.
Fig. 1. Structure of Focus
There are two CSP constructions in YOLOv5s. The CSP1_X construction, shown in Fig. 2, is applied to the Backbone, and the CSP2_X construction is applied to the Neck. When the input pictures are 608*608, the variation rule of the characteristic pattern is as follows: 608, 304, 152, 76, 38, 19. After five times of CSP modules, the feature graph with the size of 19*19 is obtained. The CSP structure added to the Backbone network enhances the study ability of CNN, ensuring accuracy while lightweight, reducing computing bottlenecks, and cutting down memory costs [4].
Fig. 2. Structure of CSP1_X
The Neck layer of YOLOv5s employs the construction of FPN + PAN to realize multi-scale feature information fusion. The FPN layer extracts strong semantic features from top to base and fuses the extracted characteristics with those drew by Backbone, while the PAN layer extracts strong positioning characteristics from the base to top and fuses extracted features with those extracted by the FPN layer. Based on the FPN + PAN network construction, characteristics extracted from the Backbone and Neck can be used to improve the characteristic fusion capability of the mesh. YOLOv5s added a CSP2_X structure, as shown in Fig. 3, into the Neck layer to Further improve the capability of network characteristic fusion. The Prediction layer of YOLOv5s comprises the Bounding box loss function and NMS. IOU is a training parameter.It is the overlap rate between the produced candidate
Research on License Plate Recognition Methods
291
Fig. 3. Structure of CSP2_X
bound and the round truth bound, the value ranges from 0 to 1. YOLOv5s selected CIOULoss As its Bounding box loss function. Relative to the traditional loss function, the bounding box’s bounding loss function, in essence, optimized the crossover ratio formula: Distance_22 v2 − CIOULoss = 1 − IOU = 1 − IOU − (1) Distance_C 2 (1 − IOU ) + v In this formula, v represents the consistency of the length-width ratio, Distance_2 is the distance between the prediction box and the center of the standard box, Distance_C is the diagonal distance between the prediction box and the smallest enclosing rectangle of the standard box. Use a weighted NMS (Non-Maximum Suppression) for filtering while generating many target boxes. NMS is generally used to eliminate redundant boxes after model prediction (that is, non-target boxes filtered by the YOLO algorithm after global prediction boxes are generated in the initial state), thus improving speed and accuracy [5]. 3.2 License Plate Char Recognition Model Using LPRNet There are many problems in the course of license plate recognition, such as lighting conditions, weather conditions, image deformation, and license plate tilt Angle. A strong character recognition system should ensure that it can handle recognition tasks in different environments without losing accuracy. LPRNet doesn’t need license plate character segmentation. It is an end-to-end algorithm with excellent adaptation to the environment. The main mesh construction is shown in Table 1. The apiece basic module includes four convolutional layers, one input layer, and one characteristic output layer, and its construction is shown in Table 2. Therefore, establishing the LPRNet algorithm model and applying it to an embedded system can have good performance in detecting relatively complex Chinese license plates. The LPRNet backbone network receives the rawest RGB images as input and calculates the spatial distribution of a great quantity of functions. Wide convolution (1*13 convolution core) replaces the RNN neural network based on LSTM with the context structure of local characters, thus getting rid of the dependence on RNN. The output of the subnetwork can be viewed as a sequence with probabilities representing the likelihood of the corresponding character, whose length is only equivalent to the breadth of the input picture. Because the decoder output does not correspond to
292
S. Hu et al.
the length of the objective sequence, the CTC loss function is introduced without the need for segmented end-to-end training [6]. The CTC loss function is diffusely used to solve the input-output order disaccord problem. A raw RGB image is an RGB image with a source network that is used as input to CNN and to extract image features. Use context-dependent 1*13 connections to the kernel instead of LSTM-based RNN. The output of the backbone sub-network can be a sequence representing the related character probabilities, the length of which is related to the width of the input image [7]. Because the network output code isn’t equal to the license plate length, this experiment adopts the method of CTC loss for end-to-end training. Table 1. LPRNet network backbone Layer Type
Parameters
Input
94*24 pixels RGB image
Convolution
#64, 3*3, stride:1
MaxPooling
#64, 3*3, stride:1
Small basic block
#128, 3*3, stride:1
MaxPooling
#64, 3*3, stride:1
Small basic block
#256, 3*3, stride:1
Small basic block
#256, 3*3, stride:1
MaxPooling
#64, 3*3, stride:2*1
Dropout
0.5 ratio
Convolution
#256, 4*1, stride:1
Dropout
0.5 ratio
Convolution
# class_number, 1*13, stride:1
Table 2. Small basic block structure. Layer Type
Parameters
Input
Cin *H* W feature map
Convolution
# Cout/4, 1*1, stride:1
Convolution
# Cout /4, 3*1, strideh:1, padh:1
Convolution
# Cout /4, 1*3, strideh:1, padw:1
Convolution
# Cout, 1x1, stride:1
Output
Cout *H *W feature map
Research on License Plate Recognition Methods
293
3.3 Yolov5s-LPRNet License Plate Recognition Model Aiming at the complex license plate identification environment, YOLOv5s and LPRNet are combined to design the vehicle license plate recognition system, as shown in Fig. 4. Before the training, the data set is sorted out and reasonably distributed. LPRNet does not need too much training model and only needs a single training for a single license plate type. If the model recognition effect after training is good, the training time can be reduced in the later stage. By connecting the trained license plate detection network YOLOv5s with the character recognition network LPRNet, the image is input, and the detection box is extracted through the detection network; the output candidate box is used as the input of the LPRNet network [8]. The character recognition network LPRNet first extracts the character information from the detected candidate boxes using the backbone network and then uses CTC and cluster to search and output the final license plate number. Feature extraction is carried out through the LPRNet backbone network, and the license plate sequence is obtained by the convolutional kernel. Due to the difference between LPRNet output and target sequence length, the use of CTC. Using SoftmaxLoss to make each character element correspond to each column output, that is, to mark the position of each character in the picture in the training set, this process consumes a lot of time. However, CTC is able to deal with the misalign-ment of network labels and output [9]. In addition, the detection capability of each node is evaluated by the heuristic function. When the solution space of the graph is very large, the algorithm is often used to reduce the occupied storage space and time resources. When the depth is extended, the nodes with poor quality can be removed, which can reduce space consumption and also effectively save time consumption. LPRNet uses cluster search to obtain the top n sequences with the uppermost feasibility and returns the first successful matching template set, which is based on the motor vehicle license plate standard. Limited by the beam-size argument, the quantity n represents the number of possible words reserved at each step.
Fig. 4. Process of license plate recognition system
4 Experiment and Analysis The YOLOv5s model and LPRNet model are used to detect the license plate first and then identify the text on the license plate. First, the overall model of YOLOv5s and LPRNet is constructed to come true endto-end identification. Then, the overall image is sent to the neural network to come true license plate detection and identification. Finally, license plate positioning accuracy of YOLOv5s is improved by optimizing the loss function and data enhancement in the neural network of YOLOv5s.
294
S. Hu et al.
In this paper, the CCPD [10] license plate data set is selected, the data is enhanced, and the YOLOv5s model and LPRNet model are retrained, which not only improved the accuracy of experimental results but also prevented the overfitting phenomenon of recognition results. This paper firstly enhanced the data set, then used labelme for data annotation, and finally converted the labelme annotated XML data into the txt file format recognized by YOLOv5 and configured the training parameter table (Table 3). Table 3. Hyperparameter setting Parameter name
Parameter setting
img
640*640
Batch size
16
Epochs
300
optimizer
Adam
Initial learning rate
0.10
Where img is the pixel size of the input image, the image input in YOLOv5s must be output in accordance with the specification so that it can enter into the neural network correctly for training; Batch-size is the amount of pictures for a single training; epochs is the number of rounds of training. In the first stage, it took 4 h for training to locate the license plate through YOLOv5. Through data enhancement, the number of data sets is increased, the marking accuracy of data sets is increased, and the model is optimized to raise the overall identification accuracy. In the experiment, 2500 pictures were randomly picked from the CCPD data set as training objects, 80% as the training date, and 20% as the test date. Adam optimizer is selected, and the learning rate decreases with the increase in training times. Taking 100epochs as an example, the change in learning rate is shown in Fig. 5.
Fig. 5. Learning rate curve
Research on License Plate Recognition Methods
295
Subsequently, the designed network super parameters are adjusted constantly. The main changes of super parameters are weight attenuation and learning rate increase. The learning rate is an indispensable hyperparameter. Adjusting the learning rate according to training performance can effectively reduce the occurrence of model overfitting. After 300 epochs of training on the model, the YOLOv5s model has fully entered the convergence state for license plate identification. In the process of model training, accuracy has been steadily improving. The training and test consequence is shown in Fig. 6.
Fig. 6. Test results
It can be seen from the figure above that the recognition effect of the YOLOV5s model is better after 300epchs, and then the LPRNet model is added, and the recognition result is shown in Fig. 7. The output license plate number is “ AC8775”, “ AU7526”, “ FV8608”, “ Q79867”, “ B911ZB”, “ A1035D”, “ K12345”, “ AUV027”, “ HK1777”. The effects of different models on license plate identification are shown in Table 4: It can be seen that in the CCPD license plate data set, the identification frame rate of the model proposed in this paper on GPU can reach 40 frames, and the recognition accuracy is 3.3%, 9.1%, and 2.1% higher than License_plate, YOLOv4, and YOLOv5, respectively. Through the analysis of the above experimental data, it can be seen that the lightweight YOLOv5s + LPRNet model proposed in this paper greatly improves the speed of license plate recognition on GPU and CPU while improving the realtime of license plate recognition on the premise of maintaining the recognition rate of 93.6%, which has certain practical application value for license plate recognition in future intelligent transportation.
296
S. Hu et al.
Fig. 7. Identify renderings
Table 4. Each model corresponds to the recognition rate Model License_plate YOLOv4
GPU GTX3060 FPS(figure/s)
CPU I7-12700h FPS(figure/s)
Recognition rate
4.41
0.54
90.3
15.63
1.21
84.5
YOLOv5s
29.12
4.25
91.5
YOLOv5s + LPRNet
40.65
11.65
93.6
5 Conclusion In the constantly developing modern intelligent transportation system, traffic management, and traffic monitoring, the system can replace some repetitive complex management work; it will save a lot of unnecessary labor and material resources. License plate recognition automation is one of the key technologies for implementing intelligent traffic management. This paper proves that license plate recognition can be realized by using a smaller convolutional neural network. In the complex driving environment, the lightweight LPRNet network selected in this paper has strong environmental adaptability and good anti-interference ability. Combined with the lightweight YOLOv5s network, it can be easily deployed to the hardware on the premise of guaranteeing the recognition rate and accuracy, which can basically meet the requirements for license plate recognition in different environments.
Research on License Plate Recognition Methods
297
Acknowledgements. This paper is supported by the Shandong Key Technology R&D Program 2019 JZZY021005, Natural Science Foundation of Shandong ZR2020MF067, Natural Science Foundation of Shandong Province ZR2021MF074 and Natural Science Foundation of Shandong Province ZR2022MF296.
References 1. Wang, L., Xu, J., Chen, S.: Design of license plate recognition platform in micro traffic environment. In: 2021 40th Chinese Control Conference (CCC), pp. 6668–6672. IEEE (2021) 2. Zhu, X., Lyu, S., Wang, X., et al.: TPH-YOLOv5: improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2778–2788 (2021) 3. Xiaomeng, L., Jun, F., Peng, C.: Vehicle detection in traffic monitoring scenes based on improved YOLOV5s. In: 2022 International Conference on Computer Engineering and Artificial Intelligence (ICCEAI), pp. 467–471. IEEE (2022) 4. Xiao, B., Guo, J., He, Z.: Real-time object detection algorithm of autonomous vehicles based on improved YOLOv5s. In: 2021 5th CAA International Conference on Vehicular Control and Intelligence (CVCI), pp. 1–6. IEEE (2021) 5. Zhang, X., Fan, H., Zhu, H., et al.: Improvement of YOLOV5 model based on the structure of multiscale domain adaptive network for crowdscape. In: 2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS), pp. 171–175. IEEE (2021) 6. Lamberti, L., Rusci, M., Fariselli, M., et al.: Low-power license plate detection and recognition on a RISC-V multi-core MCU-based vision system. In: 2021 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. IEEE (2021) 7. Nguyen, D.-L., Putro, M.D., Vo, X.-T., Jo, K.-H.: Triple detector based on feature pyramid network for license plate detection and recognition system in unusual conditions. In: 2021 IEEE 30th International Symposium on Industrial Electronics (ISIE), pp. 1–6. IEEE (2021) 8. Alborzi, Y., Mehraban, T.S., Khoramdel, J., Ardekany, A.N.: Robust real time lightweight automatic license plate recognition system for Iranian license plates. In: 2019 7th International Conference on Robotics and Mechatronics (ICRoM), pp. 352–356. IEEE (2019) 9. Luo, S., Liu, J.: Research on car license plate recognition based on improved YOLOv5m and LPRNet. IEEE Access, vol. 10, pp. 93692–93700 (2022) 10. Xu, Z., Yang, W., Meng, A., et al.: Towards end-to-end license plate detection and recognition: a large dataset and baseline. In: Computer Vision ECCV 2018. Springer, Cham (2018). https:// doi.org/10.1007/978-3-030-01261-8_16
Research on Defective Apple Detection Based on Attention Module and ResNet-50 Network Lei Zhao1 , Zhenhua Li2 , Qinjun Zhao1(B) , Wenkong Wang1 , Rongyao Jing1 , Kehua Du1 , and Shijian Hu1 1 University of Jinan, Jinan 250022, China
[email protected] 2 Shan Dong Cereals and Oils Detecting Center, Jinan 250012, China
Abstract. In defective apple detection, stem and calyx are easily confused with defects, and the detection accuracy of defective apples is lower. In order to solve these problems, this paper proposes a defective apple detection algorithm based on attention module and ResNet-50 network. CAM attention module and LeakyReLU activation function are used to optimize ResNet-50 network, which is named as C-ResNet-50 network. During network training, we use the cosine attenuation learning rate method, which effectively reduces the oscillation of training loss and accelerates the speed of network convergence. After the training and validation of the C-ResNet-50 network, the detection accuracy of defective apples reaches 97.35%, which is 2.33% higher than that of unimproved ResNet-50 network, 3.16% higher than VGGNet network and 4.14% higher than AlexNet network. This proves that the C-ResNet-50 network can improve the accuracy of defective apple detection. Keywords: Defective Apple Detection · ResNet-50 · Attention Module · LeakyReLU Activation Function
1 Introduction Apple has become a widely eaten fruit worldwide because of its beautiful flavor and high nutritional value. China has become the largest producer and consumer of apples in the world, and its industry has an important position in the development of part of China’s rural economy. Defective apple detection is an important part of apple production and sales. When a batch of apples is mixed with defective apples will affect the sales profit of the whole batch of apples and then affect the economic benefits of fruit farmers. The traditional defective apple detection adopts manual selection method, which not only consumes much manpower but also fails to meet the demand in the detection speed. After a period of time, some image processing methods have been developed [1]. This method has a fast detection speed, but the stem and calyx have similar features to the apple defect in the image, which will reduce the accuracy of defective apple detection. At present, deep learning has been widely used in the fruit defects detection due to its advantages in feature extraction [2]. By pre-training the model, automatic features © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 298–307, 2024. https://doi.org/10.1007/978-3-031-50580-5_26
Research on Defective Apple Detection
299
extraction and continuous optimization of features, it can quickly process a large number of data with better performance and higher accuracy. Some researchers analyze apple images and spectral images by deep learning technology to detect Apple defects and have made some research progress [3]. XUE Yong used GooLeNet deep migration model to detect Apple defects, and the test accuracy reached 91.91% [4]. Sovon Chakraborty used CNN network to detect and identify fresh and rotten fruits and obtained an accuracy of 93.72% [5]. Chithra used Kapur’s algorithm to sort apples as defective and normal, and the test accuracy reached 93.33% [6]. Nur Alam used hyperspectral imaging (HSI) to detect apple skin detection, and the accuracy reached 94.28% [7]. Although defective apple detection has been developed for a long time, the accuracy still needs to be improved. To improve defective apple detection accuracy, this paper uses CAM, LeakyReLU activation function and ResNet-50 for fusion to get a new network. The new network is named as C-ResNet-50 network, which is used to detect defective apple task. The C-ResNet-50 network is built on the Pytorch framework. The data set is set as training set and validation set according to the ratio of 8:2, that is applied to the training and optimization of the network. Finally, the accuracy of the C-ResNet-50 is compared with AlexNet, VGGNet and ResNet-50.
2 Network Design and Optimization This paper studies defective apple defection, including network improvement and optimization, data set setting, network training and result analysis. ResNet-50 network is optimized using the CAM and LeakyReLU activation function, which is defined as the C-ResNet-50 network. CAM is the Channel Attention Module. Feature channel weight is obtained by using CAM. Through the application of that weight, the network can greatly capture meaningful information and improve the accuracy of defective apple detection. The activation function increases the nonlinearity of the network. ResNet-50. 2.1 ResNet-50 Network The ResNet-50 network consists of four network layers, convolutional layer, pooling layers and full connection layer. Each network layer is composed of a different number of residual modules. That is 3, 4, 6 and 3, respectively. Each residual module is composed of three convolutional layers, which have convolution kernels of different sizes. Residual Module. The residual module consists of a convolution module and a direct connection module (see Fig. 1). When the output of the convolutional module is 0, H (z) = z, this solves the problem of decreasing accuracy caused by the increase of network depth. The direct connection module is directly completed by simple connection, without introducing additional parameters, which reduces the computing burden and improves the network accuracy [8]. The output of the residual module is as follows. H (z) = F(z) + z
(1)
300
L. Zhao et al.
Fig. 1. The network structure of the residual module
Convolution Layer. The function of the convolution layer is to extract different features of the input. It contains multiple convolution kernels. Each neuron element in the convolutional layer is linked to multiple neurons in the previous layer. When the convolution kernel works, it regularly sweeps the input features, summates matrix elements multiplication in the receptive field and superimposes the deviation quantities. Batch Normalization Layer. The Batch Normalization layer is as follows. 1. It adjusts data distribution and speeds up network training and convergence. 2. Prevent gradient explosion and gradient disappearance. 3. Prevent overfitting. The specific calculation process of the Batch Normalization layer is as follows. Firstly, the mean value of samples is calculated. Then, the variance of samples is calculated and standardized. Finally, linear transformation and migration. The calculation formula of the Batch Normalization layer is as follows. 1 n zi i=1 n
(2)
1 n (zi − μ)2 i=1 n
(3)
μ= σ2 =
zi − μ (z i )norm = √ σ2 + ε
(4)
Research on Defective Apple Detection
yi = α(z i )norm + β
301
(5)
where: μ is the average value, n is the number of elements, zi is the value of each element, σ 2 is the variance, ε is the deviation, (z i )norm is the normalized value of each element, α is the parameter of the linear transformation, β is the offset, and yi is the output of the batch normalization layer. Activation Function. ReLU activation function is adopted in ResNet-50 network. When the input x ≤ 0, the output F(x) is 0. When the input x > 0, the output F(x) is still x (see Fig. 2).
x x x 0
Fig. 2. ReLU activation function
The formula of ReLU activation function is as follows. x x>0 F(x) = 0 x≤0
(6)
The ReLU activation function uses simple mathematical operations to make the network have good sparsity, which can reduce the amount of computation and reduce overfitting. However, ReLU activation function has some shortcomings, such as, when the input x < 0, the output F(x) is 0, which leads to the output is always 0. Neuron no longer learns in the future learning, which is the “Dead Neuron” phenomenon.
2.2 CAM Attention Module CAM attention module (see Fig. 3) consists of the pooling part, MLP fully connected network and sigmoid function. Channel data are obtained by global average pooling and global maximum pooling, respectively for input features. These pass through the shared MLP network and then add up to get new channel data. Finally, this data is converted from 0 to 1, which becomes the weight of each feature channel via the sigmoid function. The optimized feature data is obtained by multiplying the input feature and the feature
302
L. Zhao et al.
channel weight. The CAM attention module outputs the optimized data. The output formula of CAM attention module is as follows. WC = σ {MLP[MaxPool(F)] + MLP[AvgPool(F)]}
(7)
F = WC ⊗ F
(8)
where: F is the CAM module input, MaxPool is the global max-pooling, AvgPool is the global average-pooling, MLP is the fully connected network, σ is sigmoid function, WC is the weight, and F is the output of CAM module.
Fig. 3. CAM attention module
2.3 Activation Function Improving In order to solve the “Dead Neuron” phenomenon, this paper adopts LeakyReLU activation function for improvement. When x ≥ 0, the output remains x. When x < 0, the output is ax (a is a custom number). With this improvement, there is still an output when the input is less than 0. The improvement of activation function can maintain the normal update of neurons while keeping less computation. The formula for the LeakyReLU activation function is as follows. x x>0 F(x) = (9) α∗x x ≤0 LeakyReLU activation function (see Fig. 4).
Research on Defective Apple Detection
303
x x x ax
Fig. 4. LeakyReLU activation function
2.4 C-ResNet-50 Network
Previous block
CAM(Channel Attention Module)
Next block +
Fig. 5. CAM Attention module joins ResNet-50 network
After each residual module, a CAM attention module is added and a total of sixteen CAM attention modules are inserted. Through the CAM attention module, we can get the characteristic channel weight of the last residual module. Applying this weight to the network can improve network focus on critical information (see Fig. 5). NI = BO + F
(10)
where: BO is the previous residual module output, NI is the next residual module input, F is the CAM attention module output.
3 Experiment and Analysis 3.1 Data-Set Setting We use 4423 valid images for detective apple detection task. Apples that contain puncture wounds, crushing wounds, sunburn, insect injuries, and cracked apples are defined as defective apples, while apples with no damage or slight damage are defined as normal apples (see Fig. 6).
304
L. Zhao et al.
Fig. 6. Defective apple (left 1,2) and normal apple (left 3,4)
In this experiment, specific training was carried out for the classification of stem, calyx and defect to improve the classification accuracy of defective apples. The data set was divided into four categories: 1. apples without defect, stem and calyx, 2. apples having stem and calyx without defect, 3. apples having defect without stem and calyx, 4. apples having defect, stem and calyx. According to the above four classification categories, we classified the data set and sorted out 1125 apple pictures without defect, stem and calyx, 1750 apple pictures having stem and calyx without defect, 798 apple pictures having defect without stem and calyx, 750 apple pictures having defect, stem and calyx. We randomly divided the above data into training set and validation set according to a ratio of about 8:2. The data set for defective apple detection is shown in Table 1. Table 1. Data set setting Apple type\Set
Validation set
amount
876
249
1125
having stem and calyx without defect
1371
379
1750
having defect without stem and calyx
639
159
798
600
150
750
3486
937
4423
without defect, stem and calyx
having defect, stem and calyx amount
Training set
3.2 Training Process This paper uses cosine attenuation learning rate to training and a 3090 GPU was used to accelerate the training of the model. In the beginning, a large learning rate is used for learning to accelerate the convergence speed. At the end of training, a small learning rate is used to improve the network’s convergence accuracy (see Fig. 7). The calculation formula for the learning rate is as follows. Tcur 1 lr t = lr min + (lr max − lr min ) 1 + cos π (11) 2 Tmax where: lr t is the current learning rate, lr min is the minimum learning rate set, lr max is the maximum learning rate set, Tcur is the current training times, and Tmax is the maximum training times.
Research on Defective Apple Detection
305
Y Label:Learning Rate 0.01
0.001 0
200 X label:Epochs
Fig. 7. Learning rate change curve
In order to verify the C-Resnet-50 network accuracy in defective apple detection, this paper selected AlexNet, VGGNet and ResNet-50 together for accuracy comparison. 3.3 Experiment Analysis In the training process, the cosine attenuation learning rate is compared to constant learning rate. Furthermore, using cosine attenuation learning rate has a faster convergence speed and smaller oscillation (see Fig. 8). Y Label:Training loss 1.00
Cosine decay learning rate Constant learning rate
0.5
0.00
0
100
200 X label:Epochs
Fig. 8. Training losses
After the training and optimization of the C-ResNet-50 network, we obtained the C-ResNet-50 validation loss diagram and Comparison of C-ResNet-50 and ResNet-50 (see Fig. 9). After training and verifying several neural networks, we obtained the detection comparison of AlexNet, VGGNet, Resnet-50 and C-Resnet-50 networks (see Fig. 10) and the detection accuracy of the above several networks (see Table 2).
306
L. Zhao et al.
0.1
Y Label:Validation loss
Y label:Accuracy 1.00 0.95 0.90
0.05
0.85
C-ResNet-50
0.80
0.00
0
100
X label:Epochs
200
ResNet-50
0.75 100
150
200
X label: Epochs
Fig. 9. C-ResNet-50 validation loss (left) and comparison of C-ResNet-50 and ResNet-50 (right)
1.00
Y label:Accuracy 1.00
0.95
0.95
0.90
0.90
0.85
0.85
Y label:Accuracy
C-ResNet-50
0.80
0.80
AlexNet
0.75
C-ResNet-50 VGGNet
0.75
0.70
0.70 100
150
200
X label: Epochs
100
150
200 X label: Epochs
Fig. 10. Comparison of C-ResNet-50 and AlexNet (left) and VGGNet (right)
Table 2. Detection accuracy of defective apple Number
Network
Accuracy (%)
1
AlexNet
93.21
2
VGGNet
94.23
3
ResNet-50
95.02
4
C-Resnet-50
97.35
In this experiment, we improved the ResNet-50 network to obtain the C-Resnet-50 network, which performed well in the task of defective apple detection with an accuracy of 97.35%. The accuracy of ResNet-50, AlexNet network and VGGNet were 95.05%, 94.23% and 93.21%, respectively. The accuracy of the C-ResNet-50 network is 2.33% higher than that of unimproved ResNet-50 and 3.16%, 4.14% higher than AlexNet,
Research on Defective Apple Detection
307
VGGNet, respectively. All these prove that the C-Resnet-50 network is superior to other convolutional neural networks, and competent for the task of defective apple defection.
4 Conclusion This paper describes a defective apple detection method. To solve the problem of low accuracy of defective apple detection, the C-ResNet-50 network is proposed, which is a fusion model of CAM attention module, LeakyReLU activation function and ResNet-50 network. In network training, the cosine attenuation learning rate method is adopted to effectively improve the convergence speed and reduce the oscillation of loss. Finally, after validation, the C-ResNet-50 network achieves 97.35% accuracy in the defective apple detection task. Compared with AlexNet, VGGNet and ResNet-50, C-ResNet-50 accuracy increases by 2.33%, 3.16% and 4.14%, respectively. This proves that the CResNet-50 network based on attention module and ResNet-50 is competent for the task of defective apple detection. Acknowledgements. This paper is supported by the Shandong Key Technology R&D Program 2019JZZY021005, Natural Science Foundation of Shandong ZR2020MF067 and Natural Science Foundation of Shandong Province ZR2021MF074.
References 1. Wang, B., Yin, J., et al.: Extraction and classification of apple defects under uneven illumination based on machine vision. J. Food Process Eng 45(03), e13976 (2022) 2. Abbaspour-Gilandeh, Y., Aghabara, A., Davari, M., et al.: Feasibility of using computer vision and artificial intelligence techniques in detection of some apple pests and diseases. Appl. Sci. 12(2), 906 (2022) 3. Cheng, C., Zhang, G.T.: Deep learning method based on physics informed neural network with resnet block for solving fluid flow problems. Water 13(4), 423 (2021) 4. Yong, X.U.E., Liyang, W.A.N.G., et al.: Defect detection method of apples based on GoogLeNet deep transfer learning. J. Agric. Mach. 51(07), 30–35 (2020) 5. Chithra, P., Henila, M.: Apple fruit sorting using novel thresholding and area calculation algorithms. Soft. Comput.Comput. 25, 431–445 (2021) 6. Chakraborty, S., Shamrat, F.M.J.M., Billah, M.M., et al.: Implementation of deep learning methods to identify rotten fruits. In: 25th International Conference on Trends in Electronics and Informatics (ICOEI), pp. 1207–1212. IEEE, Tirunelveli (2021) 7. Nur Alam, M.D., Thapamagar, R., Rasaili, T., Olimjonov, O., Al-Absi, A.A.: Apple defects detection based on average principal component using hyperspectral imaging. In: Pattnaik, P.K., Sain, M., Al-Absi, A.A., Kumar, P. (eds.) SMARTCYBER 2020. LNNS, vol. 149, pp. 173–187. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-7990-5_17 8. He, J., Jiang, D.: Fully automatic model based on SE-resnet for bone age assessment. IEEE Access 9, 62460–62466 (2021)
Understanding the Trend of Internet of Things Data Prediction Lu Zhang1 , Lejie Li2(B) , Benjie Dong2 , Yanwei Ma2 , and Yongchao Liu3 1
2
3
School of EE, University of Jinan, Jinan 250022, China Jinan Lingsheng Info Tech. Co. Ltd., Huaiyin District, Jinan 250000, China [email protected] Jinan Yiling Electronic Tech. Co. Ltd., Huaiyin District, Jinan 250000, China
Abstract. With the advancement of science and technology in recent years, the Internet of Things has become another technology hotspot after the Internet. It is widely used in various fields under its intelligent processing and reliability of transmission. However, rapid development also brings certain opportunities and challenges. The most prominent is the massive increase in equipment data, which brings huge challenges to the field of data analysis and prediction. Therefore, how to efficiently process and predict the time series data generated by the Internet of Things has become a research hotspot and difficulty. With the improvement of computer indicators in the past ten years, machine learning has developed to a certain extent. Most scholars will use machine learning methods when researching time-series data processing and forecasting of the Internet of Things. Therefore, we provide a preliminary overview of the history and evolution of machine learning-based IoT time-series data analysis and forecasting from a bibliometric perspective. Keywords: Internet of Things · Time series data Data prediction · Machine learning
1
· Data analytics ·
Introduction
With the continuous advancement of technology, the Internet of Things has gradually been applied in various fields and has achieved good results [1]. At the same time, a large amount of device data is also generated. These data are usually time series data, and whether they can be effectively analyzed and predicted has become very important in the fields of transportation [2], healthcare, smart home, smart agriculture, and industrial production. Accurate analysis and prediction can not only improve the overall efficiency of the application field but also have far-reaching significance in promoting social stability and ensuring the safety of people’s lives and property. At the same time, the research direction of IoT data analysis and prediction is also closely related to the research content of some other fields. c ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 308–318, 2024. https://doi.org/10.1007/978-3-031-50580-5_27
Understanding the Trend of Internet of Things Data Prediction
2
309
Background
At present, in the related research of Internet of Things data prediction, researchers at home and abroad have made in-depth explorations in various fields. In the field of medicine, some mobile healthcare devices are widely used in diseases such as diabetes, obesity, and heart disease. Device sensors can be used to obtain the physical and mental state of patients on time, and further, make subsequent medical plans. At the same time, our mobile phones or some smart bracelets also have the function of health monitoring [3]. Data prediction is more important in the field of intelligent transportation. Bin Sun et al. [4] propose a data-driven approach to anomaly detection. Yisheng Lv et al. [5] used auto encoders for the first time to extract temporal and spatial characteristics in traffic flow data, and the constructed deep framework model performed well in multiple real data sets. Bichen Wu [6] and others proposed that the SqueezeDet algorithm be applied to the target detection of automatic driving. Bin Sun et al. used machine learning to predict emergency traffic [7]. In the industrial field, Shen Yan et al. used deep learning to propose a machine tool anomaly detection method [8]. Mingzhi Chen et al. improved the GAN and significantly improved the accuracy of the fault diagnosis task [9]. Although there has been a lot of research on the fusion of the Internet of Things and artificial intelligence, this is only the beginning, and the Internet of Things based on artificial intelligence still has a lot of room for development.
3
Related Papers and Sources
We selected 759 articles and citations from the core database of Web Of Science (WOS) on machine learning-based IoT/IoV time-series data prediction. It is mainly selected from 167 periodicals and books. Their publication time is mainly from 2015 to 2022, and a total of 2,758 authors participated in the writing of the articles. The average number of citations per article is more than 13 times. A total of more than 20,000 other documents were cited. We import article information into R library bibliometrix for analysis [10]. From Fig. 1, we can see that the prediction of time-series data of the Internet of Things based on machine learning algorithms has been widely studied by scholars from all over the world in recent years. Especially after 2018, related articles have grown exponentially, which means that it shows that the large amount of data generated by the Internet of Things is increasingly relying on machine learning algorithms for processing and prediction. Figure 2 shows the most productive sources of our articles according to Bradford’s law [11]. We can see that these articles mainly come from IEEE Access, IEEE Internet of Things Journal, etc. However, a journal with a large number of published articles does not mean that the quality of its published articles is high. We sorted all the sources by influence (G-index), as shown in Fig. 3, we can see that the most authoritative source is IEEE Internet of Things Journal, which published some key points of
310
L. Zhang et al.
Fig. 1. Annual scientific production
Fig. 2. Analyzing the Literature via Bradford’s Law
Tang, FX team paper. A total of 79 papers have been published in the database we collected. For example, An intelligent channel allocation algorithm proposed by Tang, FX [12], which effectively solves the problem of network congestion during data transmission, has been cited 133 times in Web Of Science (WOS). Wang, XF [13] solved the problem of access congestion in the network, which was cited 109 times in WOS.
Understanding the Trend of Internet of Things Data Prediction
311
Fig. 3. Source impact (Source G-index)
4
Citations and References
As shown in Fig. 4, this figure shows some of the most cited documents, among which the most cited one is Luong, NC team’s survey on the application of reinforcement learning in communication networks [14], cited 585 in WOS Secondrate. The second is from the Mahdavinejad, MS team [15]. They have done some analysis on the application of machine learning in IoT data analysis, which has been cited 375 times in WOS. It is worth mentioning that in the top ten, the team of Tang and FX has two places, and the third paper, which looks forward to the way of artificial intelligence to the future 6G vehicle network [16], has been cited 247 times in WOS. The sixth paper proposes a deep learning-based intelligent channel assignment algorithm [12], which has been cited 133 times in WOS. The rest of the articles come from different affiliated organizations, and the fourth-ranked article is about the method of information interaction and sharing in the Internet of Vehicles, which has been cited 144 times. The fifth-ranked article is from Munir, and the M team proposed the DeepAnT algorithm to solve common periodic anomalies in data [17], with 142 citations.
312
L. Zhang et al.
Fig. 4. Most cited documents
According to the literature statistics (Fig. 5), we can see that the more cited papers have a greater impact on the research of other scholars. From this, we can also see that in recent years and the next few years or even decades, the integration of AI and IoT is still a hot topic in the future. Table 1 analyzes from Lotka’s law [18], which reveals the author’s scientific productivity and also reflects the quantitative relationship between the author and the paper. We can see that most of the authors have published less than three (including three) papers in the database we collected, and the proportion of these authors is 97.8%. Two authors (Zhang Y and Zhang J) produced 10 and 12 papers respectively. It can be seen that there are not only a large number of authors conducting research in this field but also some scholars have conducted in-depth research. Figure 6 shows the relationship between the three aspects of papers, authors, and subject keywords. In this three-section diagram, you can see the structural composition of some important papers, authors, and keywords, so that Important research topics have been formed. At the same time, we can see that the Internet, deep learning, predictive models, edge computing, etc. are the most important directions that scholars have studied in recent years.
5
Trending Topics and Theme Evolution
We can learn more from the trend topics. The trend change is mainly caused by the change in topic frequency. The trend of topics and the evolution of topics is shown in Fig. 7 and Fig. 8. In this analysis, we can see that modules, Internet, and performance have received continuous attention in recent years.
Understanding the Trend of Internet of Things Data Prediction
Fig. 5. References spectroscopy (RPYS) Table 1. Analyzing Author Productivity According to Lotka’s Law Number of articles Number of authors Proportion of Authors 1
2357
0.855
2
259
0.094
3
81
0.029
4
26
0.009
5
20
0.007
6
8
0.003
7
2
0.001
8
1
0.000
9
2
0.001
10
1
0.000
12
1
0.000
313
314
L. Zhang et al.
Fig. 6. Three-Field Plot
After 2021, trend topics and topic evolution have undergone major changes, which indicates that the focus of research has shifted to a certain extent in 2021, and some related issues such as wireless sensor networks and resource allocation have been resolved. The Internet and prediction began to gradually shift to neural network. Although neural-network and time-series related issues have been topics of concern in recent years, their frequency of occurrence is among the best. From another perspective, we divide the trending topics into four different types [19]. In the figure, they are in the first, second, third, and fourth quadrants respectively, as shown in Fig. 9 and Fig. 10. The horizontal axis in the figure represents centrality, and the vertical axis represents density. In the upper right corner of the figure, which is the first quadrant, is Motor Themes. This quadrant contains topics that are not only important but well-developed. The upper left corner of the graph (the second quadrant) is Niche Themes. Although the themes contained in this quadrant have developed well, they have not been applied to the current field. The lower left corner of the graph is Emerging or declining Themes and the themes contained in this quadrant are mainly marginal themes. It may be some themes that are just emerging, or it may be some themes that are about to decline. The last is Basic Themes. The themes in this part are relatively basic but important.
Understanding the Trend of Internet of Things Data Prediction
Fig. 7. Topic trend (Frequency)
Fig. 8. Thematic evolution
315
316
L. Zhang et al.
Fig. 9. Thematic map until 2021
Fig. 10. Thematic map after 2021
From Fig. 9, we can see that from 2015 to 2021, the topics in the first quadrant are mainly composed of the Internet, and the topics in the fourth quadrant mainly include forecasting, big data, algorithms, models, systems, etc. Before 2021, multivariate and autoencoding are the main research work in the second
Understanding the Trend of Internet of Things Data Prediction
317
quadrant. The third quadrant mainly includes delivery and intrusion detection systems, which were undeveloped and marginal research directions at that time. However, after 2021, the thematic map has undergone major changes, as shown in Fig. 10. The direction of research and the distribution of topics has changed a lot. The most obvious is that in the first quadrant, new neural networks and classifications have emerged, while the original Internet is gradually slipping into the fourth quadrant. The fourth quadrant is mainly composed of algorithms and models, but it is also gradually approaching the first quadrant, which shows that the development of algorithms and models is becoming more and more important, and still faces greater challenges. There is mainly 5G and mobile in the third quadrant, indicating that after 2021, this topic has slowly faded out of the research field.
6
Conclusions and Future Directions
According to the evolution of topics and changes in topic trends, we can see that the integration of the field of artificial intelligence and the Internet of Things is the general trend [20]. Machine learning has become indispensable in predictive research on IoT time-series data. As a topic that has attracted attention in recent years, neural networks, we believe that more and more researchers will devote themselves to the research of neural networks in the future. In particular, the processing of IoT data based on neural networks can be said to be the focus and difficulty of research in the next few years or even decades.
References 1. Sun, B., Geng, R., Zhang, L., Li, S., Shen, T., Ma, L.: Securing 6G-enabled IoT/IoV networks by machine learning and data fusion. EURASIP J. Wirel. Commun. Network. 2022 (2022) 2. Sun, B., Geng, R., Shen, T., Xu, Y., Bi, S.: Dynamic emergency transit forecasting with IoT sequential data. Mob. Netw. Appl. 1–15 (2022) 3. Chen, M., Li, W., Hao, Y., Qian, Y., Humar, I.: Edge cognitive computing based smart healthcare system. Futur. Gener. Comput. Syst. 86, 403–411 (2018) 4. Sun, B., Ma, L., Shen, T., Geng, R., Zhou, Y., Tian, Y.: A robust data-driven method for multiseasonality and heteroscedasticity in time series preprocessing. Wirel. Commun. Mob. Comput. 2021, 6692390:1–6692390:11 (2021) 5. Lv, Y., Duan, Y., Kang, W., Li, Z., Wang, F.-Y.: Traffic flow prediction with big data: a deep learning approach. IEEE Trans. Intell. Transp. Syst. 16(2), 865–873 (2015) 6. Wu, B., Wan, A., Iandola, F., Jin, P.H., Keutzer, K.: Squeezedet: unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 446–454 (2017) 7. Sun, B., Geng, R., Yuan, X., Shen, T.: Prediction of emergency mobility under diverse IoT availability. EAI Endors. Trans. Pervasive Health Technol. 8(4), e2 (2022)
318
L. Zhang et al.
8. Yan, S., Shao, H., Xiao, Y., Liu, B., Wan, J.: Hybrid robust convolutional autoencoder for unsupervised anomaly detection of machine tools under noises. Robot. Comput. Integr. Manuf. 79, 102441 (2023) 9. Chen, M., Shao, H., Dou, H., Li, W., Liu, B.: Data augmentation and intelligent fault diagnosis of planetary gearbox using Ilofgan under extremely limited samples. IEEE Trans. Reliab. 1–9 (2022) 10. Aria, M., Cuccurullo, C.: bibliometrix: An r-tool for comprehensive science mapping analysis. J. Informet. 11(4), 959–975 (2017) 11. Nisonger, T.E.: The “80/20 rule” and core journals. Serials Librarian 55(1-2), 62– 84 (2008) 12. Tang, F., Fadlullah, Z.Md., Mao, B., Kato, N.: An intelligent traffic load predictionbased adaptive channel assignment algorithm in SDN-IoT: a deep learning approach. IEEE Internet Things J. 5(6), 5141–5154 (2018) 13. Wang, X., Wang, C., Li, X., Leung, V.C.M., Taleb, T.: Federated deep reinforcement learning for internet of things with decentralized cooperative edge caching. IEEE Internet Things J. 7(10), 9441–9455 (2020) 14. Luong, N.C., et al.: Applications of deep reinforcement learning in communications and networking: a survey. IEEE Commun. Surv. Tutor. 21(4), 3133–3174 (2019) 15. Mahdavinejad, M.S., Rezvan, M., Barekatain, M., Adibi, P., Barnaghi, P., Sheth, A.P.: Machine learning for internet of things data analysis: a survey. Digit. Commun. Netw. 4(3), 161–175 (2018) 16. Tang, F., Kawamot, Y., Kato, N., Liu, J.: Future intelligent and secure vehicular network toward 6g: machine-learning approaches. Proc. IEEE 108(2), 292–307 (2020) 17. Munir, M., Siddiqui, S.A., Dengel, A., Ahmed, S.: DeepAnt: a deep learning approach for unsupervised anomaly detection in time series. IEEE Access 7, 1991–2005 (2019) 18. Lotka, A.J.: The frequency distribution of scientific productivity. J. Washington Acad. Sci. (Baltimore) 19, 317–323 (1926) 19. Sun, B., Ma, L., Geng, R., Xu, Y.: Matrix profile evolution: an initial overview. In: Fu, W., Xu, Y., Wang, S.-H., Zhang, Y. (eds.) ICMTEL 2021. LNICST, vol. 387, pp. 492–501. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-825621 48 20. Sun, B., Cheng, W., Bai, G., Goswami, P.: Correcting and complementing freeway traffic accident data using Mahalanobis distance based outlier detection. Tehnicki Vjesnik-Tech. Gazette 24, 10 (2017)
Finite Element Simulation of Cutting Temperature Distribution in Coated Tools During Turning Processes Jingjie Zhang1,2,3(B) , Guanghui Fan1,2 , Liwei Zhang4 , Lili Fan1,2 , Guoqing Zhang1,2 , Xiangfei Meng1,2 , Yu Qi1,2 , and Guangchen Li1,2 1 Faculty of Mechanical Engineering, Qilu University of Technology (Shandong Academy of
Sciences), Jinan 250353, China [email protected] 2 Key Laboratory of Equipments Manufacturing and Intelligent Measurement and Control, China National Light Industry, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China 3 School of Materials Science and Engineering, Shandong University, Jinan 250000, China 4 School of Civil Engineering, Qingdao University of Technology, Qidao 266000, China
Abstract. The effect of cutting temperature on mechanism of cutting process has been a fundamental issue. Cutting tool temperature has significant influences on wear behavior of cutting tool, surface finish and surface integrity during the cutting process. Advanced coating materials are appropriate to deposit on the carbide substrate to enhance the tool performance and then prolong the tool life. This paper presents the cutting temperature of coated tool based on the cutting process simulation with finite element method (FEM) simulation by using Third Wave AdvantEdge software. The influences of coating materials and coating thickness on the temperature distribution in coated cutting tools were investigated. The simulated results showed that the temperature gradually increases in the tool-chip contact area. And the temperature rapidly decreases after the tool-chip separation point. TiAlN coating showed a better thermal barrier property than other coatings at the same conditions. The cutting tool temperature of TiN coated cutting tools with different coating thickness was also investigated with FEM. The temperature distribution at the tool rake face and substrate temperature were different for various coating thicknesses. Keywords: cutting temperature · coated tools · coating material · coating thickness · finite element simulation
1 Introduction In machinery manufacturing, although a variety of molding processes for different parts have developed, still more than 90% of the mechanical parts are made by cutting. In metal cutting, heat sources responsible for high temperature rise include plastic deformation © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2024 Published by Springer Nature Switzerland AG 2024. All Rights Reserved B. Wang et al. (Eds.): ICMTEL 2023, LNICST 535, pp. 319–328, 2024. https://doi.org/10.1007/978-3-031-50580-5_28
320
J. Zhang et al.
in the shear zones and friction at the tool-chip interface and tool-workpiece interface. The generated heat mainly diffuses into four parts: chips, cutting tools, workpiece and environment. Recent studies have shown a good correlation between tool temperature and tool life [1–4]. And the temperature has a great influence on the wear mechanism and performance of the tool. In metal cutting, cutting temperatures leads to chemical element diffusion and the oxidation [5, 6]. It will weaken tool strength and reduce the tool service life. In metal cutting, the distribution of temperature fields and heat dissipation are mainly dominated by the thermo-physical properties of tool and workpiece materials [7]. The tool coating has high hardness, temperature and oxidation resistance, and poor friction coefficient [8]. Compared with uncoated tools, coated tools can improve the cutting speed, tool life and cutting efficiency. Coating materials and the number of coating layers have effects on coated tool cutting temperature. In particular, the temperature plays an important role in cutting, so cutting temperature research and measurement has become the focus of the cutting experiment. Among the temperature research, the maximum temperature and the temperature distribution of the rake face is the key to temperature research. Generally, there are mainly two ways to measure the cutting temperature: contact and noncontact. Abukhshim [9] used the digital infrared pyrometer to measure the cutting temperature in turning experiments. The temperature profiles of the cutting tool is obtained by the finite element transient thermal analysis. The results shows that the temperature of the rake face increases with the cutting speed. The temperature variation beneath on the rake face of a cutting tool were analyzed by the Green’s function approach. And the temperature variation on the rake face were obtained by a pyrometer consisting of two optical fibers and a fiber coupler during the milling process [10]. Saelzer [11] carried out orthogonal cutting experiments to investigate the contact behavior between the tool and the chip. In experiments, cutting temperature on the rake face of orthogonal cutting was directly measured by a two-color fiber pyrometer Fire III, built by the company En2Aix. The result shows that the temperature of the chip surface is independent of the cutting speed and the uncut thickness of the chip, but the cutting speed has a strong influence on rake face temperature. The temperature of the rake face increases with the cutting speed. In order to test the tool temperature distribution in the processing of machining titanium alloys and considered the fact that carbide inserts are easy to wear when machining titanium, Li [12] designed a cemented carbide tools embedded with thin film thermocouples, and made micro-grooves on the rake face near the tip through laser machining, and installed the thin film sensors in the micro-grooves. Besides, the modeling approach is used to predict the temperature on the rake face, MöHring [13] shown that the numerical iteration method to calculate the cutting temperature, and the new method takes the heat distribution for moving sources into account. In the high-speed milling process, the heat flux and temperature distribution on the tool-chip interface were analyzed by a three-dimensional inverse heat-transfer model [14]. In recent years, the technology of FEM has been widely used for the simulation of metal cutting. In this paper, the commercial Third Wave AdvantEdge software was employed to develop a coupled thermo-mechanical finite element model for cutting temperature simulation of monolayer coating cemented carbide tools in the orthogonal metal cutting process. The cutting process is simulated from the initial to the steady-state
Finite Element Simulation of Cutting Temperature Distribution
321
phase. The effects of main factors of coating materials and coating thickness on cutting temperature distribution are investigated.
2 Finite Element Modeling of Orthogonal Cutting Process Komanduri [15] reported that the heat sources in metal cutting responsible for high temperature rise include (1) heat dissipation due to plastic deformation in the shear zones and (2) friction heat generated at the tool-chip and tool-workpiece interfaces. Figure 1 shows these main heat sources generated in the cutting process.
Chip
Tool-chip interface Tool
Shear zone Tool-workpiece interface
Workpiece
Fig. 1. Principal of heat sources in metal machining
2.1 Boundary Conditions As shown in Fig. 2, the boundary condition of the FEM is set as follows. (1) Machining is performed at ambient temperature, which is equal to the room temperature (Tr = 20 °C). (2) For the noncontact surfaces of the tool and workpiece, heat loss due to heat convection (ha = 20 W/m2 °C) and thermal radiation to the environment was considered. ST
Y
Vc X
b
c
E
S0 D
SA
ST
Tool (Coated carbide)
Chip
Adiabatic boundary
H
Constant temperature boundary
SA C
ST B
SA
Sc a F
Vc
d
Convection boundary with air Convection boundary
SA G
Workpiece (H13)
S0
ST
A
Fig. 2. Boundary conditions in metal cutting modeling
322
J. Zhang et al.
The thermal boundary conditions are summarized below. (1) ST is a constant temperature boundary, and the temperature is room temperature (Tr = 20 °C). (2) SC is convection boundary with chip. This boundary is the tool-chip interface, which has frictional heat. (3) SA is convection boundary with air (ha = 20 W/m2 °C). (4) S0 is an adiabatic boundary. The boundaries including ST, SC, SA, and S0 are detailed in Fig. 2. 2.2 Graphical Representation of the Simulation Model
Workshop view
To
vc
Simulation model
tool f
Workpiece l
(a) Graphical representation of cutting process
(b) FEM simulation model Fig. 3. Graphical representation of FEM simulation model
The simplification of the cutting process and the construction of the simulation model are shown in the Fig. 3. Where Vc is the cutting speed; ƒ is the feed rate; γ0 is the rake angle; α is the relief angle. The tool model consists of an adequate number of node planar heat-transfer elements, because heat transfer analysis is carried out on it.
Finite Element Simulation of Cutting Temperature Distribution
323
2.3 Cutting Conditions The orthogonal cutting processes without coolant were simulated. The workpiece material used in the simulation was a hardened steel of H13 with the ultimate tensile UST = 1882 MPa. The cutting speeds were 300 m/min and 500 m/min, respectively. Undeformed chip thickness was fixed at 0.2 mm, and the width of cut was 0.25 mm. Machining is performed at ambient temperature, which is equal to the room temperature (Tr = 20 °C).
3 Simulated Results and Discussion The effect of coating materials and coating thickness of coated tools on cutting temperature distribution is investigated in the FEM simulation. In addition, the temperature distribution on the tool rake face and in the tool body is simulated and analyzed. 3.1 Simulation of Cutting Temperature of Different Coating Material Three kinds of monolayer coated tools with TiN, TiC and TiAlN coating, respectively, were employed in this investigation. Their substrate bulk materials were cemented carbide. The coating thickness of these three kind tools was 2 μm. As shown in Fig. 4, the tool-chip interface is the main area of the temperature distribution. Therefore, the tool-chip interface temperature is important for the coated tools investigated. The temperature distribution perpendicular to the cutting tool rake face is then investigated to study the heat conduction in the cutting tools with different coating layers.
Tool
Tool-chip inte
rface
Chip
Tool rake face Heat conduction direction
Fig. 4. Schematic of cutting temperature distribution
3.1.1 Temperature Distribution on Tool Rake Face The temperature of the tool-chip interface are obtained by the FEM. The temperature results obtained for the different coatings are shown in Fig. 5. As shown in Fig. 5, the
324
J. Zhang et al.
temperature along the tool-chip interface increases with distance from the tool tip. After the tool-chip separation point, the temperature shows a decreasing trend. Because of the rake face temperatures are mainly depend on the secondary deformation friction heat. The friction between the tool and chip disappears after the tool-chip separation point, which it will cause the temperature decrease.
Tool-chip interface
Fig. 5. Cutting temperature distribution on rake face for TiN, TiC and TiAlN coated tools
Figure 5 shows that the cutting temperatures on rake face of TiC, TiN and TiAlN coated tools are different. The rake face temperature of TiN and TiC coated tools are higher than that of TiAlN coated tool at the same position on rake face. The rake face temperature of TiC coated tool is the highest. The average temperature of rake face and tool-chip interface for these three-coated tools was shown in Table 1. The average rake face temperature and the average tool-chip interface temperature of these three kinds have a relationship as follow: TiAlN