135 113 41MB
English Pages [1146] Year 2022
2022 25th International Conference on Computer and Information T Technology echnology (ICCIT) 17-19 17-19 December, December, Cox’s Cox’s Bazar, Bazar, Bangladesh
An Evaluation of Transformer-Based Models in Personal Health Mention Detection Alvi Aveen Khan∗ , Fida Kamal† , Nuzhat Nower‡ , Tasnim Ahmed§ and Tareque Mohmud Chowdhury¶ ∗†‡§¶
Department of Computer Science and Engineering, Islamic University of Technology, Dhaka, Bangladesh {∗ alviaveen, † fidakamal, ‡ nuzhatnower, § tasnimahmed, ¶ tareque}@iut-dhaka.edu
Abstract—In public health surveillance, the identification of Personal Health Mentions (PHM) is an essential initial step. It involves examining a social media post that mentions an illness and determining whether the context of the post is about an actual person facing the illness or not. When attempting to determine how far a disease has spread, the monitoring of such public posts linked to healthcare is crucial, and numerous datasets have been produced to aid researchers in developing techniques to handle this. Unfortunately, social media posts tend to contain links, emojis, informal phrasing, sarcasm, etc., making them challenging to work with. To handle such issues and detect PHMs directly from social media posts, we propose a few transformer-based models and compare their performances. These models have not undergone a thorough evaluation in this domain, but are known to perform well on other languagerelated tasks. We trained the models on an imbalanced dataset produced by collecting a large number of public posts from Twitter. The empirical results show that we have achieved stateof-the-art performance on the dataset, with an average F1 score of 94.5% with the RoBERTa-based classifier. The code used in our experiments is publicly available1 . Index Terms—Health Monitoring, Natural Language Processing, Transformers, Social Media
I. I NTRODUCTION Online health surveillance is a key method by which we can monitor the spread of different diseases [1]. This requires the examination of what people are saying about a certain disease with regard to their personal health. Frequently, this is done using public social media posts. Data scraped from social media has recently become one of the most extensively used resources for a variety of research that requires the use of public data. The ease with which such data can be collected makes it a very attractive source in research related to machine learning, a field that typically requires large amounts of data. Notable applications of this include early detection of epidemics [1], cyberbullying detection [2] and more recently, detection of PHM related to COVID-19 [3]. The work we are presenting here is related to the domain of Personal Health Mention (PHM) detection. PHM detection has previously been defined as detecting positive reports of an illness, given the name of the illness [4]. A positive report is one that talks about the illness in reference to either the author of the post or someone the author personally knows. By contrast, a negative report talks about the illness in generic terms or outside of a medical context. Data gathered from social media platforms is perfect for this use case, as users 1 https://github.com/alvi-khan/PHM-Detection-with-Transformers
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
frequently share information about life events related to their health that would be challenging to obtain in other settings. Data gathered from the social media site Twitter has already been shown to be an accurate predictor of public health reports [4]. This has made the use of automatic text classification to solve the problem of PHM detection using such data appealing to researchers. There are, however, a range of difficulties that must be faced when working with social media posts. Typically, the posts are short, use colloquial language, contain abbreviations, etc. The language is informal and sometimes ambiguous and sarcastic to the point where it is difficult for even humans to identify the underlying intent. Such issues have made using such data quite difficult [5]. Deep-learning based approaches to PHM detection in the past have relied on word embeddings and Long Short-Term Memory (LSTM) based networks. Although these approaches gave state-of-the-art performance for their time, the word embeddings they used did not take the context of the input text into account. In this study, we use transformer-based language models, which use contextual word embeddings. Further analysis of the advantages of transformer-based models is provided in section II. To the best of our knowledge, transformerbased models have not been extensively used in the field of PHM detection. We approach the PHM detection problem as a text classification problem. By employing transformer-based models to this problem, we are able to achieve state-of-the-art performance. II. L ITERATURE R EVIEW Numerous approaches have been explored in the past with regard to PHM detection. We will be examining a few of these approaches in this section in order to gain an understanding of where the domain currently stands. Conventional methods utilised machine learning, which typically involved feature extraction and classifier training, relying on medical expertise to identify features. The Aspect Subject Ailment Model (ATAM) [6] and the Hidden Flu-State Tweet Model (HFSTM) [7] both employed combinations of keywords and health-related topics to discover health mentions. As machine learning algorithms developed, Jiang et al. [8] was able to utilise decision trees, KNN, and MLP in the training of classifiers while using user mentions and emotion keywords for feature extraction. Other approaches include
Page 1
unigram and bigram feature extraction [9], semantic feature frames [10] and the use of Bayesian classifiers [11]. Among deep learning-based approaches, word embeddings have become popular. Word embeddings are vectors associated with each word in an input string. The values of the vectors represent different linguistic features of the words. Jiang et al. [12] encoded the tweets from their dataset with pre-trained, non-contextual word embeddings and used them as the input for an LSTM network. Similarly, Wang et al. [13] used word embeddings generated from the Global Vectors for word representation (GloVe) algorithm [14], which they fed into a bidirectional LSTM network. Iyer et al. [15] introduced a CNN-based approach and focused on whether disease-related phrases were being used metaphorically. They came to the conclusion that word embeddings were an essential feature for PHM detection and that the metaphoric features did not increase the performance. In this study, we work with well-known transformer-based models. First introduced by Vaswani et al. [16], these models use context-based word embeddings, which result in better performance compared to their non-contextual counterparts [17]. Work such as that by Ahmed et al. [18] and Ahmed et al. [2] has shown promise when using transformer-based models on data extracted from social media for text classification. Based on this, we demonstrate that employing such models for PHM classification is likewise an effective strategy. Along with introducing the dataset, Karisani et al. [19] also introduced a multi-view active learning model called ContextAware Co-Testing with Bagging (COCOBA). The model uses two contextual representations of user posts and uses a cotesting algorithm to resolve the disagreements between the two representations and come to a conclusion about the correct classification. On top of this, they employed the bagging technique [20] to increase the model’s robustness to noise. Karisani [21] also investigated the dataset with regards to their multiple-source unsupervised model for text classification under domain shift. The model was trained on multiple source domains to learn to minimise the classification error on a provided target domain. To the best of our knowledge, these are the only two works on the Illness Dataset. The fact that neither of them utilised transformer-based models, combined with the exemplary performance of these models on similar datasets, provides justification to investigate their performance on this dataset as well. III. DATASET We used the publicly available Illness Dataset introduced in [19] which contains 22,660 tweets mentioning four diseases: Alzheimer’s, Parkinson’s, Cancer and Diabetes. The data was collected from public posts made on the social media platform, Twitter between 2018 and 2019. It is an imbalanced dataset. The class distribution for the dataset is shown in Fig. 1. The dataset contains the disease names, labels, and documents (tweets). Based on the definition of Personal Health Mention (PHM) [22], the tweets were given either a positive or a negative label. Labels 0 and 1 indicate negative tweets, which
Alzheimer's (+ve) 643 Alzheimer's (-ve) 3,380
Diabetes (-ve) 5,339 23.6%
14.9% 2.8%
Parkinson's (+ve) 995
4.4% Diabetes (+ve) 1,076
4.7% 23.0% Parkinson's (-ve) 5,221
21.1% 5.4%
Cancer (-ve) 4,780
Cancer (+ve) 1,226
Fig. 1: Class Distribution of the Dataset
either mention the name of the disease outside of a medical context or discuss the disease in general, without talking about a specific person affected by the disease. Labels 2 and 3 indicate positive tweets, which means that either the author of the post or someone the author knows has the disease. Samples for one positive and one negative document for each of the classes are provided in Table I. Hyperlinks have been removed and any personal identifiers, such as usernames, have been hidden to protect the users’ privacy. TABLE I: Sample Documents from the Dataset Disease Name Alzheimer’s
Label
Sample Text
Positive
Elderly woman with Alzheimer’s reported missing from Rogers Park
Negative
Fund for Alzheimer’s Research Established at UT
Positive
Billy Connolly’s ‘no longer recognises friends’ as Parkinson’s takes toll
Negative
Appendix identified as a potential starting point for Parkinson’s disease
Positive
I’m raising money for My Breast Cancer Battle. Click to Donate: via *********
Negative
I heard the entire quote and he sought to minimize the growing cancer of white Supremacy. SHAMEFUL!!!
Positive
“Nick Jonas talks about his type 1 diabetes, 13 years after being diagnosed”
Negative
Obesity and type 2 diabetes harm bone health.
Parkinson’s
Cancer
Diabetes
IV. P ROPOSED M ETHODOLOGY In this study, the effectiveness of a few renowned transformer-based natural language processing models is compared. In this section, the specifics of how the models were assessed are presented. A. Feature Extractors As feature extractors, three models, BERT [23], RoBERTa [24], and XLNet [25], were used. All of these models are based on the original transformer model proposed by Vaswani et al. [16], which was a Sequence-to-Sequence architecture. This architecture has an encoder section, which extracts features from input sequences, and a decoder section, which predicts the output from the features.
Page 2
250
Token Length
200 150 100 50 0
0
100
200 300 400 Number of Samples
500
Fig. 2: Token Lengths of the Dataset
The models we are working with fall into two categories, Autoregressive (XLNet) and Autoencoding (BERT, RoBERTa). Autoregressive models are pre-trained to predict a new sequence from the current one. Autoencoding models are pre-trained by masking parts of the input sequence and making the model reconstruct it. Training in this way allows the models to gain an understanding of the language. This in turn means that the models require a minimal amount of fine-tuning when they are used in downstream tasks. B. Tokenization Each of the feature extractors requires that the inputs be of the same shape. Input vectors for the input texts must be created in order to accomplish this. The corresponding tokenizers for each of the feature extractors were used to handle this vector generation, giving us the input tokens. Each input token represents a word or a word fragment. If a token corresponds to a word fragment, meaning a complete word has multiple tokens, then related words will have some common tokens. If a token corresponds to a complete word, then related words will have token values that are very close. [PAD] tokens are added to the end of shorter token sequences to make the sequences the same shape. Due to computational limitations, a fixed sequence length of 128 was used for our experiments. It was observed that for sample text from the dataset, token sequences are unlikely to be longer than this. Fig. 2 shows a histogram of the token lengths of the tokens generated by the RoBERTa model’s tokenizer. There are also some special tokens used by the tokenizer, such as the [SEP] token, which denotes the point where one input ends and another one begins, and the [UNK] token, which is used to mark a word the tokenizer has not seen before. C. Classifier Network The models being used were all pre-trained on a large corpus of text. This makes them effective feature extractors since they have a good grasp of the basic linguistic features. However, the text they were pre-trained on was generic, not specific to the task at hand. If we use a classifier network before using the models for a downstream task, the performance will be better than if we used the models as is [26]. Using a classifier network makes it possible to further adjust
the pre-trained weights to the particular data being used. The architecture of the classifier network used in the experiments is shown in Fig. 3. This section was trained from scratch. Both dropout layers exist to prevent overfitting. The dropout rate being set to 40% means that a random 40% of the output units from the previous layer will be set to 0. This increases the overall robustness of the model. The dropout rate was determined by trial and error. The normalisation layer normalises the activation values of the previous layer, bringing them to a similar scale. Gradient descent subsequently becomes more stable, cutting down on the time needed to train the model. The epsilon value used does not affect the model itself, but rather provides numerical stability for the internal calculations to avoid divisions by zero. As such, the exact value chosen is insignificant. The linear transformation layers change the number of dimensions the model is dealing with. The first linear transformation layer reduces the dimensions with the purpose of reducing computational complexity. It brings down the number of dimensions from 768, which is the output size for all of the pre-trained models used, to 64. The second linear transformation layer exists to bring the 64 dimensions down to the 8 dimensions required as the final output. Each of the values from the 8 dimensions represents a different class. To be able to determine the probability that each of these classes is the correct one, a probability value must be assigned to them. The softmax layer translates the values from the 8 dimensions into probability values. From this, the class that has the highest probability is chosen as the prediction. D. Experimental Setup and Hyperparameters The experiments were carried out in Kaggle2 with CUDA 11.4. We used an Intel Xeon CPU with 13 GB of RAM and an NVidia Tesla P100 GPU with 15.9 GB of Video RAM. The entire dataset was first divided into the training, validation, and test sets in a 60:20:20 ratio. After each epoch of training on the training set, the model’s performance was assessed using the validation set, and its weights were saved if performance improved. Each classifier underwent 15 epochs of training. Training beyond this was found to result in insignificant performance improvements. The test set was used for the final evaluation and remained unseen to the model during the training phase. Due to the increased memory utilisation brought on by using larger batch sizes, the batch size had to be limited. Based on the hardware configuration available, it was determined that a batch size of 16 had an acceptable memory usage. We used a learning rate of 1e − 5. A lower learning rate means the model takes smaller steps during the training phase, which allows it to better adjust the error values from the loss function. The loss function used in this study was categorical cross-entropy. However, such a low learning rate would normally mean that the model would take a larger number of epochs to converge to an acceptable extent. We were able to avoid this issue by using pre-trained models. 2 https://www.kaggle.com/
Page 3
Tweet
Text
Predicted
Class
Classifier Network
BERT / RoBERTa / XLNet
Dropout
40%
Input Units - 768
Output Units - 64
Epsilon - 1e-08
Dropout
40%
Input Units - 64
Output Units - 8
Pre-Trained Feature Extractor
Dropout
Linear Transformation
Normalization
Dropout
Linear Transformation
Classwise Probabilities
Fig. 3: Classifier Network Architecture
The first 20% of the training steps were used as warmup steps, meaning the learning rate was gradually increased from 0 to 1e − 5 during this time. The reduced learning rate during the initial stages makes the learning process less volatile since the model is less likely to become misled [27]. At the end of each step of batch training, the weight values were optimised. This was accomplished by utilising the AdamW optimiser [28] with a weight decay of 0.01. V. R ESULT A NALYSIS We analysed the results obtained by applying each of the models described in the previous section on the dataset described in section III. The takeaway from this analysis is discussed in this section.
their performance against the Illness Dataset. Karisani et al. [19], used the COCOBA model, a multi-view active learning model, while Karisani [21] used the CEPC model, which is a multiple-source unsupervised model. The results make it clear that the RoBERTa-based classifier outperforms all the others. It is closely followed by the XLNetbased classifier, with the difference in performance between the two being minimal. The possible reasons behind this are further analysed in subsection V-C. TABLE II: Experimental Results Paper Karisani et al. [19] Karisani [21] Ours
Model COCOBA CEPC BERT RoBERTa XLNet
Accuracy 0.937 0.944 0.941
Precision 0.786 0.938 0.947 0.943
Recall 0.841 0.937 0.944 0.941
F1 Score 0.809 0.811 0.938 0.945 0.942
ROC-AUC 0.986 0.990 0.986
MCC 0.923 0.931 0.928
A. Evaluation Metrics The metrics chosen to evaluate the models were accuracy, precision, recall, F1 score, AUC-ROC and MCC. Accuracy was chosen since it is a widely used evaluation metric. Unfortunately, for imbalanced datasets, such as the one used in our experiments, this metric does not give an accurate representation of the performance [18]. The precision and recall metrics can help deal with this and are popular evaluation metrics in their own right [29]. However, even these metrics have their share of issues, since they do not penalise incorrect outcomes. The F1 score metric is capable of giving an accurate representation of how well the models performed, taking into account all of these issues [2]. The Area Under the Receiver Operating Characteristic Curve (AUC-ROC) metric effectively measures how well a model is able to differentiate between classes [29, 30]. The Matthews Correlation Coefficient (MCC) metric measures how correlated the predicted and target values are. Both of these metrics evaluate the models from a completely different perspective than the other metrics, which is why they are included in the evaluation process [18].
The ROC curves for the BERT, RoBERTA and XLNet models are shown in Fig. 4a, 4b and 4c respectively. For all three models, we can see that the positive class for each disease has worse performance than the negative class. This phenomenon is further analysed in subsection V-D. C. Performance Analysis The results of this study support those of Yang et al. [25] and Liu et al. [24]. Yang et al. [25] confirmed that XLNet outperforms state-of-the-art models in a variety of natural language processing experiments, while Liu et al. [24] demonstrated that RoBERTa outperforms both BERT and XLNet on the GLUE benchmark. The improved pretraining procedure for these models is what accounts for the better performance. XLNet outperforms models like BERT because it is pre-trained on a larger amount of data and creates word permutations during the pre-training phase. On the other hand, RoBERTa was a development over BERT and was pretrained on longer sequences, used larger batch sizes, and adopted dynamic masking, where the masked word changed dynamically during pre-training.
B. Performance Comparisons
D. Error Analysis
Table II shows a comparison of the results obtained by our classifier when working with each of the three feature extractors, alongside baseline classification models. The performance of the best results obtained in each metric is highlighted in bold. The baseline models considered in this comparison come from the two previous works that analysed
The three models all performed better on the negative classes than they did on the corresponding positive classes, as shown in Fig. 4. The significant difference in the number of samples in the classes, as depicted in Fig. 1, explains this disparity. The extent of the difference between the sample sizes directly relates to the degree of difference in ROC-AUC scores.
Page 4
(a) BERT
(b) RoBERTa
(c) XLNet
Fig. 4: ROC Curves Produced by Fine-Tuned Classifier Networks
To enhance performance on imbalanced datasets, previous studies have tested various strategies. This problem has been successfully addressed by randomly oversampling the smaller class [31], [32] or undersampling the larger class [33], [34]. The latter solution, however, may lead to overfitting or bias. Recently, few-shot learning has also been effectively used [35]. Table III provides samples of erroneous predictions made by the RoBERTa model. Analysis of the errors reveals that the model usually makes mistakes that involve mislabelled samples, multiple disease mentions and humorous posts using a first-person perspective. For example, the post ‘They told you animal proteins are good for you but it causes cancer, diabetes, inflammation, kidney stones, etc.’ is correctly predicted as a negative health mention, but the label and prediction disagree on the disease name since multiple diseases are mentioned. TABLE III: Samples of Erroneous Predictions Label
Prediction
Sample Text
Parkinson’s (Positive)
Parkinson’s (Negative)
Steps to Better Walking Even With #Parkinson’s #Disease [url]
Cancer (Negative)
Diabetes (Negative)
They told you animal proteins are good for you but it causes cancer, diabetes, inflammation, kidney stones, etc.
Alzheimer’s (Negative)
Alzheimer’s (Positive)
Old McDonald had alzheimer’s Have you any wool....
VI. C ONCLUSION In this paper, we used transformer-based models in the domain of PHM detection to classify 4 diseases, namely Alzheimer’s, Parkinson’s, Cancer and Diabetes, using the Illness Dataset as the basis for our experiments. We demonstrated that such models are able to work with social media posts in an effective manner and achieve exemplary performance. There is scope for improvement on the work presented here. The dataset used is limited in the number of diseases it deals with, which leaves room for further work using a more varied dataset. Furthermore, we hope to explore the effectiveness of these models when working with the PHM detection problem in general. Concentrating on specific diseases limits the practical use cases of our findings. This limitation can be removed by training the models to identify PHM regardless of which disease they are dealing with.
R EFERENCES [1] A. Joshi, R. Sparks, S. Karimi, S.-L. J. Yan, A. A. Chughtai, C. Paris, and C. R. MacIntyre, “Automated monitoring of tweets for early detection of the 2014 ebola epidemic,” PloS one, vol. 15, no. 3, p. e0230322, 2020. [2] T. Ahmed, M. Kabir, S. Ivan, H. Mahmud, and K. Hasan, “Am i being bullied on social media? an ensemble approach to categorize cyberbullying,” in 2021 IEEE International Conference on Big Data (Big Data). IEEE, 2021, pp. 2442–2453. [3] L. Luo, Y. Wang, and H. Liu, “Covid-19 personal health mention detection from tweets using dual convolutional neural network,” Expert Systems with Applications, vol. 200, p. 117139, 2022. [4] A. Lamb, M. Paul, and M. Dredze, “Separating fact from fear: Tracking flu infections on twitter,” in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013, pp. 789–795. [5] A. Olteanu, E. Kıcıman, and C. Castillo, “A critical review of online social data: Biases, methodological pitfalls, and ethical boundaries,” in Proceedings of the eleventh ACM international conference on web search and data mining, 2018, pp. 785–786. [6] M. Paul and M. Dredze, “You are what you tweet: Analyzing twitter for public health,” in Proceedings of the International AAAI Conference on Web and Social Media, vol. 5, no. 1, 2011, pp. 265–272. [7] L. Chen, K. Tozammel Hossain, P. Butler, N. Ramakrishnan, and B. A. Prakash, “Syndromic surveillance of flu on twitter using weakly supervised temporal topic models,” Data mining and knowledge discovery, vol. 30, no. 3, pp. 681–710, 2016. [8] K. Jiang, R. Calix, and M. Gupta, “Construction of a personal experience tweet corpus for health surveillance,” in Proceedings of the 15th workshop on biomedical natural language processing, 2016, pp. 128–135. [9] E. Aramaki, S. Maskawa, and M. Morita, “Twitter catches the flu: detecting influenza epidemics using twitter,” in Proceedings of the 2011 Conference on empirical methods in natural language processing, 2011, pp. 1568– 1576.
Page 5
[10] W. W. Chapman, L. M. Christensen, M. M. Wagner, P. J. Haug, O. Ivanov, J. N. Dowling, and R. T. Olszewski, “Classifying free-text triage chief complaints into syndromic categories with natural language processing,” Artificial intelligence in medicine, vol. 33, no. 1, pp. 31– 40, 2005. [11] R. T. Olszewski, “Bayesian classification of triage diagnoses for the early detection of epidemics.” in Flairs conference, 2003, pp. 412–416. [12] K. Jiang, S. Feng, Q. Song, R. A. Calix, M. Gupta, and G. R. Bernard, “Identifying tweets of personal health experience through word embedding and lstm neural network,” BMC bioinformatics, vol. 19, no. 8, pp. 67– 74, 2018. [13] C.-K. Wang, O. Singh, Z.-L. Tang, and H.-J. Dai, “Using a recurrent neural network model for classification of tweets conveyed influenza-related information,” in Proceedings of the International Workshop on Digital Disease Detection Using Social Media 2017 (DDDSM2017), 2017, pp. 33–38. [14] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543. [Online]. Available: http://www.aclweb.org/anthology/D14-1162 [15] A. Iyer, A. Joshi, S. Karimi, R. Sparks, and C. Paris, “Figurative usage detection of symptom words to improve personal health mention detection,” arXiv preprint arXiv:1906.05466, 2019. [16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [17] A. Joshi, S. Karimi, R. Sparks, C. Paris, and C. R. MacIntyre, “A comparison of word-based and context-based representations for classification problems in health informatics,” arXiv preprint arXiv:1906.05468, 2019. [18] T. Ahmed, S. Ivan, M. Kabir, H. Mahmud, and K. Hasan, “Performance analysis of transformer-based architectures and their ensembles to detect trait-based cyberbullying,” Social Network Analysis and Mining, vol. 12, no. 1, pp. 1–17, 2022. [19] P. Karisani, N. Karisani, and L. Xiong, “Contextual multi-view query learning for short text classification in user-generated data,” arXiv preprint arXiv:2112.02611, 2021. [20] P. B¨uhlmann and B. Yu, “Analyzing bagging,” The annals of Statistics, vol. 30, no. 4, pp. 927–961, 2002. [21] P. Karisani, “Multiple-source domain adaptation via coordinated domain encoders and paired classifiers,” arXiv e-prints, pp. arXiv–2201, 2022. [22] P. Karisani and E. Agichtein, “Did you really just have a heart attack? towards robust detection of personal health mentions in social media,” in Proceedings of the 2018 World Wide Web Conference, 2018, pp. 137–146. [23] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova,
[24]
[25]
[26]
[27]
[28] [29]
[30]
[31]
[32]
[33]
[34]
[35]
“Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019. Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” Advances in neural information processing systems, vol. 32, 2019. M. Kabir, T. Ahmed, M. B. Hasan, M. T. R. Laskar, T. K. Joarder, H. Mahmud, and K. Hasan, “Deptweet: A typology for social media texts to detect depression severities,” Computers in Human Behavior, vol. 139, p. 107503, 2023. P. Goyal, P. Doll´ar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017. S. Ahmed, M. B. Hasan, T. Ahmed, M. R. K. Sony, and M. H. Kabir, “Less is more: Lighter and faster deep neural architecture for tomato leaf disease classification,” IEEE Access, 2022. M. S. Morshed, S. Ahmed, T. Ahmed, M. U. Islam, and A. B. M. A. Rahman, “Fruit quality assessment with densely connected convolutional neural network,” 2022. [Online]. Available: https://arxiv.org/abs/2212.04255 J. Lun, J. Zhu, Y. Tang, and M. Yang, “Multiple data augmentation strategies for improving performance on automatic short answer scoring,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 09, 2020, pp. 13 389–13 396. S. Qiu, B. Xu, J. Zhang, Y. Wang, X. Shen, G. De Melo, C. Long, and X. Li, “Easyaug: An automatic textual data augmentation platform for classification tasks,” in Companion Proceedings of the Web Conference 2020, 2020, pp. 249–252. P. Sobhani, H. Viktor, and S. Matwin, “Learning from imbalanced data using ensemble methods and clusterbased undersampling,” in International Workshop on New Frontiers in Mining Complex Patterns. Springer, 2014, pp. 69–83. A. Anand, G. Pugalenthi, G. B. Fogel, and P. Suganthan, “An approach for classification of highly imbalanced data using weighting and undersampling,” Amino acids, vol. 39, no. 5, pp. 1385–1391, 2010. A. Rios and R. Kavuluru, “Few-shot and zero-shot multilabel learning for structured label spaces,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, vol. 2018. NIH Public Access, 2018, p. 3132.
Page 6
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
RNN Variants vs Transformer Variants: Uncertainty in Text Classification with Monte Carlo Dropout Md. Farhadul Islam
Fardin Bin Rahman
Sarah Zabeen
School of Data and Sciences Brac University Dhaka, Bangladesh [email protected]
School of Data and Sciences Brac University Dhaka, Bangladesh [email protected]
School of Data and Sciences Brac University Dhaka, Bangladesh [email protected]
Md. Azharul Islam
Md Sabbir Hossain
Md Humaion Kabir Mehedi
School of Data and Sciences Brac University Dhaka, Bangladesh [email protected]
School of Data and Sciences Brac University Dhaka, Bangladesh [email protected]
School of Data and Sciences Brac University Dhaka, Bangladesh [email protected]
Meem Arafat Manab
Annajiat Alim Rasel
School of Data and Sciences Brac University Dhaka, Bangladesh [email protected]
School of Data and Sciences Brac University Dhaka, Bangladesh [email protected]
Abstract—Language models that can perform linguistic tasks, just like humans, have surpassed all expectations in recent years. Recurrent Neural Networks (RNN) and Transformer Architectures have exponentially accelerated the development of Natural Language Processing. It has drastically affected how we handle textual data. Understanding the reliability and confidence of these models is crucial for developing machine learning systems that can be successfully applied in real life situations. Uncertaintybased quantitative and comparative research between these two types of architectures has not yet been conducted. It is vital to identify confident models in text classification tasks, as the modern world seeks safe and dependable intelligent systems. In this work, the uncertainty of Transformer-based models such as BERT and XLNet is compared to that of RNN variations such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU). To measure the uncertainty of these models, we use dropouts during the inference phase (Monte Carlo Dropout). Monte Carlo Dropout (MCD) has negligible computation costs and helps separate uncertain samples from the predictions. Based on our thorough experiments, we have determined that BERT surpasses all other models utilized in this study. Index Terms—Transformer, Uncertainty, Monte Carlo Dropout, RNN
I. I NTRODUCTION Natural language processing (NLP) systems that are dependable, accountable, and trustworthy must quantify the uncertainty of their machine learning models. Obtaining measurements of uncertainty in predictions enables the identification of out-of-domain [1], adversarial, or error-prone occurrences, ultimately requiring particular treatment. For instance, such occurrences may be subjected to additional review by hu-
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
man specialists or more advanced technology, or they may be rejected from classification [2]. In addition, uncertainty estimation is a key aspect of a variety of applications, such as active learning [3] and error identification. BERT [4] and ELECTRA [5], for instance, utilize deep pre-trained models based on the Transformer architecture [4]–[6]. Consequently, obtaining accurate uncertainty estimates for such neural networks (NNs) can directly benefit a wide variety of natural language processing tasks. However, implementing uncertainty estimation in this situation is challenging due to the large amount of parameters in these deep learning models. A workable method for assessing the uncertainty of deep models is offered by the MCD that estimates from Bayesian inference, based on dropout, during the testing stage [7]. Due to the necessity of performing several stochastic forecasts, they are typically accompanied by significant processing overhead. It is important to note that training ensembles of independent models result in even more prohibitive overheads [8]. In this work, our contribution includes: • Showing a comparative analysis between RNN variants and transformer variants, based on uncertainty estimation. • Creating a benchmark for uncertainty measurement based models in text classification • We also pinpoint the most efficient model in our study II. L ITERATURE R EVIEW Uncertainty can be measured in several approaches. Three of the most popular ways to estimate uncertainty are Dropout as a Bayesian Approximation (Monte Carlo Dropout) [7],
Page 7
ensembling, where a discrepancy between models’ predictions are interpreted as a sample variance [8], and Bayesian neural networks [9]. MCD is the most convenient technique to build uncertainty-aware models among all three of them. It also has other advantages, such as minimizing overfitting, decreasing model complexity, etc. Shelmanov et al. [3] compare multiple uncertain estimates in text classification tasks for the cutting-edge Transformer model ELECTRA and the speed-oriented DistilBERT model. They use many stochastic passes with the MCD and a dropout based on Determinantal Point Processes to derive uncertainty estimates. Hu et al. [10] also suggest using empirical uncertainty in out-of-distribution identification for tasks involving text classification. They present a low-cost framework that uses auxiliary outliers as well as pseudo off-manifold samples to train the model with previous knowledge of a certain class, which has a high vacuity for out-of-distribution data. Vazhentsev et al. [11] employ Diverse Determinantal Point Process Monte Carlo Dropout to measure uncertainty and provide two optimizations to transformer models that reduce computation time and enhance misclassification identification inside named entity recognition and text categorization. III. U NCERTAINTY E STIMATION For estimating uncertainty for this study, we use MCD and to quantify the uncertainty we use entropy. Each of the model is baseline model embedded with MCD layers with 30% dropout rate. For a fair comparison, the MCD layers are embedded similarly in each model. A. Monte Carlo Dropout Model Complexity and overfitting are two problems that can be solved through the implementation of dropout [12]. In Neural Networks (NN), “dropout” is used to define the action of omitting certain hidden and invisible units. The neuron along with its incoming and outgoing links are temporarily disconnected. However, the neuron which gets “dropped” is chosen at random. Basically, an individual unit has an independent and predetermined probability “p”. The probability “p” may be selected by a validation set. It may also be set at 0.5 which is the optimal value for many networks and its applications. The best probability of retention for input units is almost 1. In summary, the output of every neuron is proliferated using a binary mask which is derived from Bernoulli distribution, in the training stage. This is how the neurons are initialized to zero, following which the NN is applied at the testing stage. The idea of employing dropout was brought forth by Gal and Ghahramani [7], who used it as an estimation of probabilistic Bayesian models for deep Gaussian processes. An ensemble of predictions showcasing the uncertainty estimations can be generated using MCD. The MCD method involves executing several stochastic forward passes in a model by employing activated dropout during the testing stage. If provided with a trained model with dropout fnn . To calculate the uncertainty of a single sample x, we collect the
predictions of T inferences with various dropout masks. Here, di fnn is the model with the dropout mask di . Therefore, we obtain a sample of the potential model outputs for instance x as follows: d0 dT fnn (x), ....., fnn (x)
(1)
We obtain an ensemble prediction by computing the mean and the variance of this sample. The prediction is the mean of the model’s posterior distribution for this sample and the estimated uncertainty of the model regarding x. P redictive P osterior M ean, p =
T 1 X di f (x) T i=0 nn
T 1 X di U ncertainty, c = [f (x) − p]2 T i=0 nn
(2)
(3)
The dropout model is not modified, only the outcomes of the stochastic forward passes are collected. Through this technique, the predictive mean and model uncertainties are evaluated. As a result, existing dropout trained models can have the data applied to them. IV. E XPERIMENTS A. Classification Models 1) RNN Variants - LSTM and GRU: In comparison to standard RNNs, LSTM incorporates input gates and forget gates to handle gradient disappearance and explosion complications. This allows for the collection of long-term information and improved performance in lengthy sequence text. GRU’s internal structure is comparable to LSTM, while its input and output structure is identical to that of a standard RNN [13]. Information cannot be encoded from back to front using either LSTM or GRU. In situations with greater categorization granularity, like the five-category jobs of strong and weak commendatory terms, weak disparaging term, and neutrality, the interplay amongst words of degree, emotion negativity words should be considered. This problem is overcome with a device known as a bi-directional long-short-term memory (Bi-LSTM), as well as a bi-directional gate recurrent unit (BiGRU). Forward and backward LSTMs or GRUs are stacked in order to improve the bidirectional semantic dependence. BiLSTM or Bi-GRU are frequently superior to LSTM or GRU, however training time is much longer. In our experiments, we have used stacked Bi-LSTM and stacked Bi-GRU models, since we are comparing them to comparatively powerful transformer models. We use 2 Bi-LSTM layers and the model starts with the embeddings layer. The MCD layer is added right before the classifier layer. Similarly, for the GRU model, we use 2 Bi-GRU layers and the rest of the model is identical to the LSTM model. Both of them have MCD layers with 30% dropout rate.
Page 8
Fig. 1. Monte Carlo Dropout Method
2) Transformer Variants - BERT and XLNet: Simultaneously conditioning both the right and the left context in every layer, BERT [4] architecture is capable of pre-training deep bidirectional representations from unlabeled text. Thus, the BERT model simply requires a final output layer in order to produce popular models which are viable for handling a comprehensive amount of tasks. These tasks range from answering questions to inferencing languages without having to majorly alter the architecture for a specific task. As we know, a directional model reads a given text either from rightto-left or left-to-right. In contrast, encoders in Transformer models scan the full string of words in one go, hence earning the name bidirectional or non-directional. This key attribute enables the model to learn all surrounding-based contexts of a particular word. One major issue with BERT is essentially its pre-training objective on masked sequences i.e the Denoising Autoencoding objective. An autoregressive pretraining technique called XLNet [14] is a rather generic option for enabling the learning of bidirectional contexts. The task is accomplished via maximization of the anticipated likelihood over all permutations of the factorization order. Furthermore, the autoregressive formulation allows the system to triumph over the drawbacks of BERT. It does not apply the denoising to inputs as in the autoencoding objective and removes the unidirectionality from a traditional autoregressive objective. In this study, baseline BERT and XLNet models are utilized with MCD embeddings. Each model of our network ends with an MCD layer (at a 30% rate). This is an improvement since it removes all evidence of dependency between the neurons and allows us to quantify uncertainty. The procedure includes randomly setting neuron outputs to zero at a set rate, simplifying the model even further.
TABLE I P ERFORMANCE OF MCD-BASED M ODELS Model LSTM GRU BERT XLNet
Accuracy 90.98% 88.48% 87.05% 84.92%
Precision 89.36% 86.42% 87.32% 85.24%
Recall 87.06% 86.58% 87.13% 85.05%
F1 Score 88.19% 86.50% 87.22% 85.13%
B. Experimental Setup and Training Details The training and testing methods for this experiment are developed using Python libraries such as Tensorflow and Keras. An NVIDIA RTX 3080Ti GPU with 34.1 TeraFLOPS of performance is used to train and assess the models. In our experiment, the dropout rates are adjusted to 30% for comparative purposes, and the models are trained with 5 epochs with and without the usage of MCD. In every experiment model, we set the learning rate at 0.0000006. The number of parameters for LSTM, GRU, BERT, and XLNet are 0.66 Million, 0.65 Million, 109 Million, and 110 Million respectively. The batch size for all tests is set to 64. All of the models were trained with 5 epochs. C. Dataset This dataset [15] is compiled by interpreting negative polarity for 1 and 2 ratings and positive polarity for 3 and 4 ratings. The collection comprises 280,000 training instances and 19,000 test instances grouped by polarity type. As a whole, 560,000 training instances and 38,000 testing instances are present where negative and positive polarity are represented by class 1 and class 2 respectively. In our experiment, the positive class is converted to 0 while the negative class remains the same since binary classification is performed.
Page 9
Fig. 2. Accuracy and Loss Curves of BERT and XLNet
performance. E. Measuring Uncertainty
Fig. 3. First 10 Samples of the Yelp Review Polarity Dataset
D. Performance Evaluation The model’s quality is evaluated using performance assessment measures once it has completed any picture classification task. Performance evaluation metrics such as Accuracy, Recall, and Precision are used in quantitative assessments to gauge
Using MCD embedded models with 30% dropout rate, to determine the distribution of predictions, we utilize 500 test samples and predict each sample 500 times (Monte Carlo Sampling). This is needed to measure the uncertainty from the predicted class-wise softmax score distribution of the 500 test samples. Now, we locate the most uncertain cases. This will be beneficial for comprehending our dataset or identifying problematic areas of the model. We select the most uncertain samples from the Monte Carlo prediction, using the variance of their softmax score. The predictive entropy is utilized for evaluating the model uncertainty on a specific image. An uncertain sample is selected, and the predictive entropy relays how “surprised” the model is to see the particular image. The model is said to be sure about its prediction’s accuracy if the value is “low”. Similarly, a “high” value insinuates that the model is uncertain about the image. Entropy, H ≈ −
C X (µc )log(µc )
(4)
c
Page 10
Fig. 4. Probability Distribution of the Predictions
From Fig. 2 we can see BERT has a better fit in the training phase. BERT has achieved 87.11% training accuracy and 86.88% validation accuracy. The test accuracy of the BERT model is 87.05%. It has 86.42% precision and 87.13% recall. On the other hand, XLNet did not perform as well as BERT, having only 84.92% test accuracy, 85.24% precision and 85.05% recall. We use the same text in Table II ”Input” column in LSTM and GRU experiment for the out-of-distribution prediction to measure uncertainty. In Fig. 4 BERT section, we see how well BERT performed compared to XLNet in the first sample from Table II. Even though both of them correctly predicted the first randomly selected test sample, BERT is more certain. BERT has 0.12 entropy and XLNet has 0.73 entropy. In the second case, BERT has 0.46 entropy and XLNet has 0.92 entropy. However, here XLNet has predicted wrong. Regardless, high entropy is a good sign since it will be separated as an uncertain prediction. From the results of RNN variants and Transformers, we can conclude that BERT outperforms all existing models and is followed by LSTM, GRU and XLNet respectively. F. Discussion
P We calculate entropy using (4) where, µc = N1 n pc n is the class-wise mean softmax score. By computing the variance of the anticipated softmax score, we choose the images. Softmax turns the actual values into probabilities, therefore the score we get are simply probabilities. Additionally, the image indexes are sorted to locate an uncertain sample from the test set of data. We compare the uncertainty by taking a random sample from the test dataset to see how well the model performs with new meaningful data. In Fig. 4 we can see four different figures where he xaxis of the graphs shows the softmax score, and the y-axis reflects the number of samples. If the prediction is correct, we want the graph to be close to softmax score 1 (x-axis) and high number of samples (y-axis), which gives us less entropy value. But for wrong predictions, we want the opposite and high entropy value. This indicates the prediction is very uncertain and should not rely on the prediction. The worst case scenario in this task will be incorrect prediction with less entropy. indicates model’s inability. To resolve this issue, we must look at the test accuracy and other metrics such as F1Score, etc. In Table II, we see two randomly selected samples which belong to two different classes. From the LSTM diagram and GRU diagram of of Fig. 4 we see that, LSTM and GRU are quite identical in terms of predictive certainty. LSTM has 0.27 and GRU has 0.38 entropy of it’s softmax score. Both of them predicted correctly. In the second sample, GRU give a wrong prediction but with higher entropy. In case of wrong predictions, higher entropy is preferred as it implies that the prediction is highly uncertain. Then, we compare it with the performance of BERT and XLNet next. In Table I, we see that both LSTM and GRU obtained quite high accuracy comparatively.
In terms of performance, LSTM outperforms other models used in our experiments on the Yelp Review Polarity dataset. LSTM is better in terms of both uncertainty measures and performance. GRU comes very close to LSTM in terms of results due to their structural similarity. Regardless, LSTM still could not outperform BERT in both ways. From our study we find that BERT outperforms XLNet in terms of both performance and uncertainty measurements. But LSTM and GRU outperformed in terms of performance only but the predictions were not as certain as BERT, while having the same dropout rate of 30%. BERT also has a good fit during it’s training phase, unlike XLNet. Even though XLNet has proven to work better than BERT, in this experiment XLNet could not outperform BERT, since the baseline architecture was being used and the dataset was small. We can use, LSTM (having only 0.66 Million parameters) as the classifier model in smaller tasks. Also, we may use it if we have less computational resources, since it is more size and cost efficent in this particular task. V. C ONCLUSION AND F UTURE W ORK In this study, we assessed various estimations of uncertainty for the state-of-the-art Transformer models BERT and XLNet in text categorization tasks. Here, multiple stochastic passes utilizing the MCD are used to derive predictions. We demonstrate that uncertainty can be calculated by using the dropout layers before the classifier layer of the model for stochastic predictions. Moreover, MCD boosts up the performance of models, by reducing the overfitting and increasing the test prediction accuracy. Our scheme can separate uncertain samples, reducing the risk factors of real world scenarios. We can conclude that, with smaller amounts of data, baseline BERT performs much better in terms of prediction confidence.
Page 11
TABLE II U NCERTAINTY E STIMATION R EPORT Input
Model
Predicted Class
True Class
Entropy
This is by far the best dentist I have ever been to.She is honest and never trys to sell you a bunch a stuff that you dont need. We are very great full that we discovered her. We adjust our vacations to make sure that we stop into her office in Pennsylvania at least once a year for cleaning and check ups, worth the drive from Florida.
LSTM
0
0
0.27
GRU BERT XLNet
0 0 0
0 0 0
0.38 0.12 0.73
LSTM
1
1
0.54
GRU BERT XLNet
1 1 1
0 1 0
0.97 0.46 0.92
After waiting for almost 30 minutes to trade in an old phone part of the buy back program our customer service rep incorrectly processed the transaction. This led to us waiting another 30 minutes for him to correct it. Don’t visit this store if you want pleasant or good service.
However, LSTM and GRU can be used in smaller text classification tasks, since the performance and confidence difference is almost negligible. Our future work focuses on working with automated uncertainty reasoning in text classification tasks. R EFERENCES [1] A. Malinin and M. Gales, “Predictive uncertainty estimation via prior networks,” in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018. [Online]. Available: https://proceedings.neurips.cc/paper/2018/file/3ea2db50e62ceefceaf70a 9d9a56a6f4-Paper.pdf [2] R. Herbei and M. H. Wegkamp, “Classification with reject option,” The Canadian Journal of Statistics / La Revue Canadienne de Statistique, vol. 34, no. 4, pp. 709–721, 2006. [Online]. Available: http://www.jstor.org/stable/20445230 [3] A. Shelmanov, E. Tsymbalov, D. Puzyrev, K. Fedyanin, A. Panchenko, and M. Panov, “How certain is your Transformer?” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Online: Association for Computational Linguistics, Apr. 2021, pp. 1833–1840. [Online]. Available: https://aclanthology.org/2021.eacl-main.157 [4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2018. [Online]. Available: https://arxiv.org/abs/1810.04805 [5] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-training text encoders as discriminators rather than generators,” 2020. [Online]. Available: https://arxiv.org/abs/2003.10555 [6] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws, L. Jones, Kaiser, N. Kalchbrenner, N. Parmar, R. Sepassi, N. Shazeer, and J. Uszkoreit, “Tensor2tensor for neural machine translation,” 2018. [Online]. Available: https://arxiv.org/abs/1803.07416 [7] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” ser. ICML’16. JMLR.org, 2016, p. 1050–1059. [8] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/file/ 9ef2ed4b7fd2c810847ffa5fa85bce38-Paper.pdf [9] M. Teye, H. Azizpour, and K. Smith, “Bayesian uncertainty estimation for batch normalized deep networks,” 2018. [Online]. Available: https://arxiv.org/abs/1802.06455 [10] Y. Hu and L. Khan, “Uncertainty-aware reliable text classification,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery amp; Data Mining, ser. KDD ’21. New York, NY, USA: Association for Computing Machinery, 2021, p. 628–636. [Online]. Available: https://doi.org/10.1145/3447548.3467382
[11] A. Vazhentsev, G. Kuzmin, A. Shelmanov, A. Tsvigun, E. Tsymbalov, K. Fedyanin, M. Panov, A. Panchenko, G. Gusev, M. Burtsev, M. Avetisian, and L. Zhukov, “Uncertainty estimation of transformer predictions for misclassification detection,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 8237–8252. [Online]. Available: https://aclanthology.org/2022.acl-long.566 [12] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 56, pp. 1929–1958, 2014. [Online]. Available: http://jmlr.org/papers/v15/srivastava14a.html [13] H. Hettiarachchi and T. Ranasinghe, “Emoji powered capsule network to detect type and target of offensive posts in social media,” in Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), 2019, pp. 474–480. [14] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” 2019. [Online]. Available: https://arxiv.org/abs/1906.08237 [15] X. Zhang, J. Zhao, and Y. LeCun, “Character-level Convolutional Networks for Text Classification ,” arXiv:1509.01626 [cs], Sep. 2015.
Page 12
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
DNN Based Blood Glucose Level Estimation Using PPG Characteristic Features of Smartphone Videos S. M. Taslim Uddin Raju∗ , and M.M.A. Hashem† Department of Computer Science and Engineering Khulna University of Engineering & Technology (KUET), Khulna 9203, Bangladesh [email protected]∗ , [email protected]†
Abstract—Diabetes is a perpetual metabolic issue that can prompt severe complications. Blood glucose level (BGL) is usually monitored by collecting a blood sample and assessing the results. This type of measurement is extremely unpleasant and inconvenient for the patient, who must undergo it frequently. This paper proposes a novel real-time, non-invasive technique for estimating BGL with smartphone photoplethysmogram (PPG) signal extracted from fingertip video and deep neural networks (DNN). Fingertip videos are collected from 93 subjects using a smartphone camera and a lighting source, and subsequently the frames are converted into PPG signal. The PPG signals have been preprocessed with Butterworth bandpass filter to eliminate high frequency noise, and motion artifact. Therefore, there are 34 features that are derived from the PPG signal and its derivatives and Fourier transformed form. In addition, age and gender are also included as features due to their considerable influence on glucose. Maximal information coefficient (MIC) feature selection technique has been applied for selecting the best feature set for obtaining good accuracy. Finally, the DNN model has been established to determine BGL non-invasively. DNN model along with the MIC feature selection technique outperformed in estimating BGL with the coefficient of determination (R2 ) of 0.96, implying a good relationship between glucose level and selected features. The results of the experiments suggest that the proposed method can be used clinically to determine BGL without drawing blood. Index Terms—Smartphone; Blood Glucose Level; Photoplethysmogram; Features Extraction; Features Selection; Deep Neural Network.
I. I NTRODUCTION Worldwide, diabetes is the leading cause of death, as well as the most common form of metabolic illness. It is a condition that damages the ability to produce or effectively use body insulin [1]. Diabetes is a serious health concern that has been proclaimed a worldwide epidemic by the World Health Organization (WHO) because of its quickly expanding rate. The current estimates by the International Diabetes Federation suggest that 415 million people have diabetes worldwide in 2015 and foresee it increasing to 640 million by 2040 [2]. In addition, patients with chronic diabetes are more likely to suffer from a variety of ailments, including heart disease, kidney damage, and lead to vision loss [3]. Diabetes is classified into two forms such as Type I and Type II. Type I is observed in teenagers known as juveniles and 10% of the world. Type II is observed in mature people and is the most common type of diabetes. There are some ways to control diabetes, and one of the easiest and possible options is self-monitoring. Regular
blood glucose monitoring can minimize complications that occur due to diabetes. Currently, blood glucose level (BGL) is measured by invasive or minimally invasive ways [4]. In these processes, a blood sample is usually taken from the human body using a needle, and after examining the sample, the result is calculated. But this method of measurement is excruciating, inconvenient, and costly for people who need to measure blood glucose regularly. Non-invasive glucose monitoring is painless, risk-free, convenient, and comfortable for users. Now, photoplethysmogram (PPG) are extensively used for vitals physiological parameters monitoring [5]. PPG is an optical measuring technique used to measure blood volume fluctuations [6]. Typical components of a PPG device include a light source to illuminate the tissue and a detector to sense the reflected light. Periodic variations in the amount of light absorbed occur with blood volume and can be utilized to obtain the PPG signal. Many researches have been done monitoring several physiological parameters based on the PPG signal because of the PPG signal’s simplicity, low cost, and comfortable setup. Some examples include hemoglobin level measurement [7], blood pressure estimation [8], and blood glucose level measurement [7]. Generally, PPG signals are obtained through optical approaches such as sensor-based devices, chips, or pulse oximeters [8]. Recently, numerous mobile devices have integrated builtin sensor systems to assess physiological parameters using PPG signals. Patients who need constant health monitoring and health professionals can benefit from these non-invasive approaches. However, technological improvements have enabled the smartphone camera to act as a sensor. For example, in the year 2015, Devadhasan et al. [9] used Samsung camera to predict glucose level. In the year 2019, Chowdhury et al. [10] developed a non-invasive approach to estimate the BGL using the iPhone 7 plus. Gaussian filter and Asymmetric Least Square methods were applied to reduce the noise of the PPG signal, and then the extracted features were fed to the principal component regression. In [11], the same authors improved video data quality by using Xiaomi Redmi Note 5 Pro, OnePlus 6T, and Samsung Galaxy Note 8 cameras. Zhang et al. [12] developed a non-invasive blood glucose measurement system based on smartphone PPG signal. The fingertip video was captured using a smartphone camera with 28 fps (sampling rate of 28 Hz) and converted to a PPG signal. Therefore, 67 features were collected from the valid PPG
979-8-3503-4602-2/22/$31.00 ©2022 IEEE Page 13
signal and its derivatives and a subspace K-Nearest Neighbor (KNN) classifier was applied to surmise the blood glucose level. Golap et al. in [7] developed a non-invasive approach to measure BGL with smartphone PPG signal and multigene genetic programming (MGGP). This paper has introduced a novel non-invasive method to estimate BGL with smartphone PPG signals extracted from fingertip videos and deep neural network model (DNN). Nearinfrared light-emitting diodes (NIR LEDs) illuminate the finger, and the video of the fingertip is captured on a smartphone. Then, the video data of the fingertip is preprocessed and converted into the PPG signal. Features are extracted from the preprocessed selected PPG cycle and its derivatives, as well as the Fourier transform. Maximal information coefficient (MIC) feature selection technique has been used to select the best feature set. Finally, a DNN-based model has been constructed for detecting the blood glucose level. The main contributions of the paper are summarized as follows: • • •
•
Constructing a wearable data collection kit with NIR LEDs to collect fingertip videos. Generating PPG signal from fingertip video as well as selecting the best PPG cycle for feature extraction. Extracting features from selected PPG cycle and its derivatives and selecting the relevant features using MIC algorithm. Developing the DNN model to assess blood glucose levels non-invasively.
The rest of the paper is organized as follows: the methodologies used in this system are presented to Section II. Section III analyses and discusses the obtained result from the proposed method. The conclusion and directions for future of this paper are drawn in Section IV. II. M ETHODOLOGY This section illustrates the proposed methodology concisely. The acquisition of fingertip video data from subjects, generation of the PPG signal, extraction and selection of features, and construction of a model are briefly described. The system’s overall architecture is depicted in Fig. 1.
Fig. 1. The overall system architecture for non-invasive BGL measurement.
B. Data Collection A smartphone camera (Nexus 6p) with 30 fps frame rate was used to capture a 15-second long video of the right-hand index finger. During the recording phase, the index finger was put in the data collection kit, as shown in Fig. 2(C) and collected the fingertip video while the finger was illuminated, as shown in Fig. 2(D). It was also taken into account that the finger was cleansed and dried, nail polish was not permitted, and there were no signs of injury. Simultaneously, the gold standard blood glucose value was determined by Thermo Scientific Konelab 60i in the clinical laboratory. These two procedures were performed successively, therefore the blood glucose level didn’t change quickly. The authorities and medical teams of the Medical Centre Hospital located in Nizam Road, Chattogram, Bangladesh approved the study. 93 volunteering subjects participated in the whole procedure. Age and gender were also collected on the subject during data collection. Table I depicts the statistical information of the dataset.
A. System Configuration The proposed system for measuring BGL needs a data collection kit to illuminate the finger and a smartphone to capture the video, as illustrated in Fig. 2. The determination of NIR ranges is the initial step of board design and a vital factor in acquiring a robust and clean PPG signal. Considering availability and financial restraints, we found that 850nm NIR LEDs will be optimum for our purpose. The data collection kit consists of a circle of eight NIR LEDs and a white LED in the middle, shown as Fig. 2 (A). Amplifying the intensity of NIR LEDs is the primary function of white LED. The device’s external surface is black, therefore the reflectance factor has minimal impact on the analysis.
Fig. 2. Hardware used to capture fingertip videos: A) A NIR LED device/data collection kit in turned off, B) A NIR LED device in turned on for data collection, C) Index finger on the device while NIR LED device is on, D) Nexus 6p smartphone is used to record video while LEDs illuminate the finger.
Page 14
TABLE I PATIENT DEMOGRAPHICS AND CLINICAL LABORATORY DATA . Physical Index Age (years) Gender Glucose (mmol/L) *
Statistical Data 0 to 69 (µ = 32.67, σ = 16.53) 59 male (63.5%); 34 female (36.5%) 3.33 to 21.11 (µ = 6.64, σ = 2.97)
µ = mean, σ = standard deviation
C. Generation of PPG Signal and Preprocessing Blood’s absorption of light is related to the variation in finger blood volume, which is reflected and captured in the video. Consequently, the same region’s pixel intensity in successive frames is different. A 15-second (30 fps) video is a series of 350 frames. The first 3 seconds and the last 2 seconds
Fig. 3. Generated PPG signal from fingertip video: (a) raw PPG signal from fingertip video, (b) filtered PPG signal with selected PPG cycle, and (c) single best PPG cycle with highest systolic peak.
of each video are discarded due to unstable frames. The red (225 − 245), green (0 − 3) and blue (15 − 25) channels are extracted from individual frame of the video. The intensity of the red channel is the highest among the three channels. Therefore, other channels are discarded. The continuous PPG signal is calculated by overall pixel intensity variations in each frame. For each video frame, a threshold is set using (1), and the PPG’s value for ith frame is measured by (2) as the mean of the pixels with intensity above the specific threshold. The PPG signal is acquired by plotting the computed mean of pixels from frames as shown in Fig. 3(a). i i thresholdi = 0.5 ∗ (intensitymax + intensitymin )
P P G[i] =
1 totalpixels
totalpixels X
(1)
intensityi > thresholdi
i=1
(2) The threshold value is selected empirically. The PPG signals extracted from the recorded videos in this study were in the transmittance mode. LED device and smartphone camera are on opposite side of the finger. Before feature extraction, the raw PPG signal was preprocessed to minimize noise and motion artifacts. Then, the Butterworth bandpass filter [13] was applied to the calculated PPG signal with fps = 30, minimum blood pulse per minute (BPM L) = 40, maximum blood flow rate (BPM H) = 500, and order = 4. The preprocessed PPG signal is shown in Fig. 3(b).
Fig. 4. Features extracted from PPG signal and its derivatives (VPPG and APPG) as well as Fourier Transformed PPG signal.
Page 15
TABLE II F EATURES EXTRCATED FROM PPG Features f1 : c f2 : x f3 : y f4 : z f5 : d f6 : a1 f7 : b 1 f8 : e 1 f9 : l 1 f10 : a2 f11 : b2 f12 : e2 f13 : t1 f14 : f15 : f16 : f17 : f18 : f19 : f20 : f21 : f22 : f23 : f24 : f25 : f26 : f27 :
t2 t3 t4 t5 tpi ta1 tb1 te1 tl1 ta2 tb2 te2 fbase |sbase |
f28 : f2nd f29 : |s2nd | f30 : f3rd f31 : |s3rd | f32 : w f33 :A3 /(A1 + A2 ) f34 :(A2 + A3 )/A1
SIGNAL
(PPG-34)
Definition Amplitude at maximum slop on the up-rise of the PPG signal Amplitude of PPG wave Amplitude of diastolic peak Amplitude of dicrotic notch Amplitude of inflection point 1st peak of volume change velocity 1st valley of volume change velocity 2nd peak of volume change velocity 2nd valley of volume change velocity 1st peak of volume change acceleration 1st valley of volume change acceleration 2nd peak of volume change acceleration Time to maximum slop on the up-rise of the PPG signal Elapsed time to systolic peak Elapsed time to dicrotic notch Elapsed time to inflection point Elapsed time to diastolic peak Elapsed time of the entire pulse wave Elapsed time to a1 Elapsed time to b1 Elapsed time to e1 Elapsed time to l1 Elapsed time to a2 Elapsed time to b2 Elapsed time to e2 Fundamental frequency, the reciprocal of tpi Fundamental component amplitude acquired from Fast Fourier Transformation (FFT) Frequency of the second harmonics Second component amplitude acquired from FFT Frequency of the third harmonics Third component amplitude acquired from FFT Pulse width at half amplitude Inflection poin area ratio (IP A) Ratio of the area before and after dicrotic notch (sV RI)
D. PPG Cycle Selection and Feature Extraction In this study, single PPG cycle is needed to extract all the features. PPG signals are continuous and repetitive waveforms that usually contain the same information. A peak detection algorithm was applied to detect each systolic peak. Therefore, the PPG signals were segmented into single periods. Each single-period PPG signal might look relatively different for each person, but they all have the same characteristics. From the continuous PPG waveform, the PPG cycle with the largest positive systolic peak was selected based on its maximum intensity variations as shown in Fig. 3(c). This single PPG cycle was analyzed to extract characteristics. After selecting the best PPG cycle, 34 features were extracted from single PPG cycle, its first derivatives (velocityPPG′ or VPPG), and its second derivatives (acceleration-PPG′′ or APPG), as well as Fourier transformation. Fig. 4 depicts the characteristic features of the PPG signal. The derived features are divided into four categories: amplitude related features (f1
to f12 ), time domain features (f13 to f25 ), frequency domain features (f26 to f31 ) and other features (f32 to f34 ). Age (f35 ) and gender (f36 ) were also included as features. The details of PPG-34 features are summarized in Table II. E. MIC Based Features Selection Feature selection increases performance of models by reducing overfitting, enhancing accuracy, and reducing training time. In this study, the Maximal Information Coefficient (MIC) technique has been applied to determine the optimal feature set. MIC is a theory-based information measure of reciprocal dependency that may account for various functional and nonfunctional dependencies between variables [14]. Using this approach, the relationship between the input features and the target variable is established. The highest-scoring attributes are selected as those are most probable to have the most influence on the estimation results. For two discrete vectors, mutual information M ID (F, O) is defined as: M ID (F, O) =
XX
P (f, o) log(
o∈O f ∈F
P (f, o) ) P (f )P (o)
(3)
where, F and O denote the feature set and reference BGL values, respectively. P (f, o) refers to the joint probabilistic mass function of f and o. The marginal mass functions of f and o are P (f ) and P (o). For continuous variables, mutual information M IC (F, O) is formulated as follows: Z Z P (f, o) M IC (F, O) = P (f, o) log( )df do (4) P (f )P (o) where, P (f, o) denotes the associated joint probabilistic density and marginal probabilistic density functions are P (f ), and P (o), respectively. Directly calculating the probabilistic mass function can be a useful tool for assessing the dependence of two continuous variables; however, it is not always straightforward to do so. So, providing a maximal mutual information searching approach and an optimal data binning method, MIC was created to overcome this problem [15]. Meanwhile, the mutual information can be normalized to a scale from 0 to 1 with the help of MIC, making it easier to evaluate the dependencies and co-relationships between two variables. Consequently, for each pair of colinear features, we eliminated the feature with the lower MIC value against BGL and allowed the other feature for further analysis. Table III shows the selected features using MIC algorithm. TABLE III S ELECTED FEATURES USING MIC f3 f18
f5 f7 f8 f9 f11 f12 f13 f14 f15 f17 f19 f23 f24 f27 f29 f31 f32 f33 f34
F. Model Construction and Validation Deep neural networks are artificial feed-forward networks with input, output, and hidden layers. The architecture of our proposed DNN model is shown in Fig. 10. As shown in Fig.
Page 16
Pn (oi − oˆi )2 R = 1 − Pi=1 n ¯)2 i=1 (oi − o 2
(9)
n
M AE =
1X |oi − oˆi | n i=1
(10)
n
1X M SE = (oi − oˆi )2 n i=1 v u n u1 X RM SE = t (oi − oˆi )2 n i=1
j
Using the learning rate λ, the following equation is used to iteratively update the weight and bias vector [16]. (ω h+1 , β (h+1) ) = (ω h , β (h) ) − λ
∂ELoss ∂(ω h , β (h) )
(6)
Using the ReLU activation function, the hidden layer will change the input above as in Eq. (7). φre (υ) = max(0, υ)
(7)
Finally, at the output layer, a linear activation function is employed. φli = υ ′ (8) where υ ′ = (−∞, +∞). The DNN model was trained with 100 epochs, 32 batch size, and a learning rate of 0.01. The model was trained and evaluated using all features as well as MIC-based selected features. The DNN model was validated using a 10-fold cross-validation technique. G. Performance Evaluating Criteria The proposed DNN model’s performance is measured using five indices such as R2 : coefficient of determination, MAE: mean absolute error, MSE: mean squared error, RMSE: root mean square error, and MAPE: mean absolute percentage error. The mathematics formulas are as follows [17]:
(12)
n 1 X oi − oˆi M AP E = n i=1 oi
Fig. 5. The proposed architecture of DNN model for BGL measurement.
5, one neuron makes up the output layer, while the input layer has as many neurons as the number of features. There are a total of four hidden layers: the first has 150 neurons, the second has 200 neurons and a dropout unit of 0.25, the third has 250 neurons, and the fourth has 300 neurons and a dropout unit of 0.5. Suppose H hidden layers of neural network. Let h ∈ {1, 2, · · ·, H} index of the hidden layers of the network. Let f (h) specify the inputs vector for layer h. At layer h, ω (h) and β (h) are the weights and biases, respectively. Each neuron’s hidden layer output can be expressed as in Eq. (5). X υj (h+1) = ωj (h) f (h) + βj (h) (5)
(11)
(13)
where, oi is ith reference value and oˆi is the corresponding measurement value as well as n is the total sample. III. R ESULTS AND D ISCUSSIONS A total of 93 subjects (59 male (63.5%) and 34 female (36.5%)) were studied ranging in age from 0 to 69 years. The range of reference blood glucose values for this study from 3.33 mmol/L to 21.11 mmol/L, with µ = 6.64 mmol/L and σ = 2.97 mmol/L. The proposed DNN model was trained and tested in the beginning stage with all the features for measuring blood glucose levels. A 10-fold cross-validation technique was used to verify the model, where each fold contains the reference and measurement BGL values. Therefore, the mean performance of the model was determined following that. Table IV illustrates BGL mesurement using the DNN model along with various algorithms. The estimated accuracies of the DNN model using all features (PPG-36) are R2 = 0.839, and MAE = 0.566 mmol/L. Furthermore, the MIC algorithm was applied to determine the optimal features. It is essential to reduce the likelihood of models being overfitted. After using the MIC algorithm on the features dataset, the number of features was reduced from 36 to 21 for BGL. Therefore, the selected features were implemented to the DNN model to measure the BGL values. According to the obtained results in Table IV, the DNN model and MIC algorithm provided the highest estimated accuracy of R2 = 0.953, and MAE = 0.300 mmol/L. Overall, it is clear that the proposed method(DNN+MIC) provides the best-estimated accuracy compared to other algorithms. TABLE IV P ERFORMANCE MEASUREMENT OF BGL USING DIFFERENT ALGORITHMS . Algorithms DNN(PPG-36) MGGP+CFS [7] DNN+CFS [17] Proprosed(DNN+MIC) *
R2 0.839 0.807 0.902 0.953
MAE 0.566 0.324 0.375 0.300
MSE 1.462 0.270 0.840 0.411
RMSE 1.209 0.520 0.917 0.641
MAPE 6.801 7.269 5.035 3.675
CFS = Correlation-based feature selection
Page 17
balanced (2) developing a cloud-based mobile application to make the whole process user-friendly.
(a)
(b)
(c)
(d)
Fig. 6. Correlation and agreement (Bland-Altman) plots of BGL with reference values and estimated values at testing stage for DNN model based on all features, and selected feature using MIC algorithm: (a) relationship, (b) agreement, (c) relationship, and (d) agreement.
Fig. 6 ((a), and (c)) shows a correlation-based comparison between measurement BGL and reference BGL values for all features and selected features. Furthermore, Fig. 6 ((b), and (d)) depicts a Bland-Altman plot for determining the distance between the measurement value and the reference value. Bland-Altman establishes limits of agreement to specify the relationship between these values. The plots show that a higher percentage of measurement values are within the limits of agreement (md ± 1.96 ∗ sd). At 95% confidence interval, the limits of agreement for BGL with all features were [−2.104, 2.201] and [−1.071, 1.372] for MIC selected features. IV. C ONCLUSION Regular BGL monitoring prevents long- and short-term consequences of diabetes. This paper has proposed a novel non-invasive method to estimate BGL with smartphone PPG signals extracted from fingertip videos and deep neural network model. It provides an excellent basis to observe BGL at home in real-time. At first, fingertip video is captured using a smartphone camera while the fingertip is illuminated using an NIR LEDs kit. Secondly, the frames are converted into PPG signals and then extracted 34 features from the PPG signal, its derivatives, and Fourier form. Age and gender of each subject are also added as features. Thirdly, appropriate features are selected using the MIC technique. Fourthly, we construct the DNN model and apply the 10-fold cross-validation technique to substantiate the model. The DNN model and the MIC feature selection method provided the highest estimated accuracy in measuring the BGL. The results indicate that our proposed technique can be used in clinical practice. Future plans include: (1) diversifying the dataset to make it more
ACKNOWLEDGMENT The authors would like to express their appreciation to the study’s volunteers as well as the Medical Centre Hospital’s doctors, nurses, and support personnel in Chittagong, Bangladesh, for their tireless support during the study’s duration. R EFERENCES [1] D. Control, C. T. of Diabetes Interventions, and C. D. S. R. Group, “Intensive diabetes treatment and cardiovascular disease in patients with type 1 diabetes,” New England Journal of Medicine, vol. 353, no. 25, pp. 2643–2653, 2005. [2] K. Papatheodorou, M. Banach, E. Bekiari, M. Rizzo, and M. Edmonds, “Complications of diabetes 2017,” Journal of diabetes research, vol. 2018, 2018. [3] A. D. Deshpande, M. Harris-Hayes, and M. Schootman, “Epidemiology of diabetes and diabetes-related complications,” Physical therapy, vol. 88, no. 11, pp. 1254–1264, 2008. [4] Y. Liu, M. Xia, Z. Nie, J. Li, Y. Zeng, and L. Wang, “In vivo wearable non-invasive glucose monitoring based on dielectric spectroscopy,” in 2016 IEEE 13th International Conference on Signal Processing (ICSP). IEEE, 2016, pp. 1388–1391. [5] A. Hernando et al., “Finger and forehead ppg signal comparison for respiratory rate estimation based on pulse amplitude variability,” in 2017 25th European Signal Processing Conference (EUSIPCO). IEEE, 2017, pp. 2076–2080. [6] F. Rundo, S. Conoci, A. Ortis, and S. Battiato, “An advanced bioinspired photoplethysmography (ppg) and ecg pattern recognition system for medical assessment,” Sensors, vol. 18, no. 2, p. 405, 2018. [7] M. A.-u. Golap, S. M. T. U. Raju, M. R. Haque, and M. M. A. Hashem, “Hemoglobin and glucose level estimation from ppg characteristics features of fingertip video using mggp-based model,” Biomedical Signal Processing and Control, vol. 67, p. 102478, 2021. [8] L. Wang, W. Zhou, Y. Xing, and X. Zhou, “A novel neural network model for blood pressure estimation using photoplethesmography without electrocardiogram,” Journal of healthcare engineering, vol. 2018, 2018. [9] J. P. Devadhasan, H. Oh, C. S. Choi, and S. Kim, “Whole blood glucose analysis based on smartphone camera module,” Journal of biomedical optics, vol. 20, no. 11, p. 117001, 2015. [10] T. T. Chowdhury, T. Mishma, S. Osman, and T. Rahman, “Estimation of blood glucose level of type-2 diabetes patients using smartphone video through pca-da,” in Proceedings of the 6th International Conference on Networking, Systems and Security, 2019, pp. 104–108. [11] T. T. Islam, M. S. Ahmed, M. Hassanuzzaman, S. A. Bin Amir, and T. Rahman, “Blood glucose level regression for smartphone ppg signals using machine learning,” Applied Sciences, vol. 11, no. 2, p. 618, 2021. [12] Y. Zhang, Y. Zhang, S. A. Siddiqui, and A. Kos, “Non-invasive bloodglucose estimation using smartphone ppg signals and subspace knn classifier,” Elektrotehniski Vestnik, vol. 86, no. 1/2, pp. 68–74, 2019. [13] A. Chatterjee and A. Prinz, “Image analysis on fingertip video to obtain ppg,” Biomedical and Pharmacology Journal, vol. 11, no. 4, pp. 1811– 1827, 2018. [14] D. N. Reshef et al., “Detecting novel associations in large data sets,” science, vol. 334, no. 6062, pp. 1518–1524, 2011. [15] Y. Xing, C. Lv, and D. Cao, “Personalized vehicle trajectory prediction based on joint time-series modeling for connected vehicles,” IEEE Transactions on Vehicular Technology, vol. 69, no. 2, pp. 1341–1352, 2019. [16] B.-K. Lee and J.-H. Chang, “Packet loss concealment based on deep neural networks for digital speech transmission,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 2, pp. 378–387, 2015. [17] M. R. Haque, S. M. T. U. Raju, M. A.-U. Golap, and M. M. A. Hashem, “A novel technique for non-invasive measurement of human blood component levels from fingertip video using dnn based models,” IEEE Access, vol. 9, pp. 19 025–19 042, 2021.
Page 18
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December 2022, Cox’s Bazar, Bangladesh
Calibration of a simplified thermodynamic model for VVER-1200-based nuclear power plants using evolutionary algorithms Sk. Azmaeen Bin Amir Institute of Nuclear Power Engineering (INPE) Bangladesh University of Engineering and Technology (BUET) Dhaka, Bangladesh. [email protected]
Abid Hossain Khan Institute of Nuclear Power Engineering (INPE) Bangladesh University of Engineering and Technology (BUET) Dhaka, Bangladesh. [email protected]
Abstract— A thermal power plant's efficiency and output power are very sensitive to its surrounding weather conditions. Since a nuclear power plant (NPP) usually runs at lower thermodynamic efficiency compared to other thermal power plants, an additional decrease in output power may challenge the economic viability of the project. Thus, it is very important to establish a sufficiently accurate model than can depict the correlation between NPP output power and condenser pressure. This work attempts to calibrate a simplified thermodynamic model using two evolutionary algorithms, Genetic Algorithm (GA) and Particle Swarm Optimization (PSO). For GA, the initial population is varied in the range of 10-1000, while the mutation and crossover rates are taken as 0.01 and 0.50, respectively. For PSO, the swarm size is varied within the range of 100-1000. Results reveal that the calibrated model has more accurate predictions compared to the original model. The model calibrated with GA is found to be slightly better performing than the one calibrated with PSO. Additionally, the calibration process is observed to be insensitive to the reference condenser pressure. Finally, it is estimated that the efficiency of the plant can go down to 33.56% at 15kPa condenser pressure compared to 37.30% at 4kPa. Keywords—Model calibration, Nuclear power, VVER-1200
Evolutionary
algorithm,
I. INTRODUCTION The ongoing worldwide fuel crisis has attracted everyone’s attention. The inventory of fossil fuels is depleting rapidly, and soon it will be difficult to meet the energy needs of growing economies. Although humanity has had the solution to this problem ever since the discovery of the nuclear chain reaction, world leaders had to abandon nuclear power due to a series of accidents and subsequent opposition from the mass population in many countries. Now, many developing countries are considering the inclusion of this promising source in their energy mix to support their development projects. The People’s Republic of Bangladesh is one of these countries wishing to establish its very first nuclear power plant (NPP), the Rooppur Nuclear Power Plant (RNPP). The VVER-1200 reactor has been selected for RNPP due to its reliability and enhanced safety features [1]. However, there is a serious concern about the economic benefits of the project, which is estimated to cost around 12.68 billion USD [2].
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
Another major concern is that the efficiency of RNPP should decrease due to the weather conditions in Bangladesh. This is because of the higher temperature of the available tertiary coolant, i.e., reservoir water compared to coldweathered countries. As a result, the condenser of the RNPP will have to operate at a higher pressure than the ones in the NPPs located in the USA, Russia, France, China, etc. Unfortunately, the actual thermodynamic cycle of an NPP is too complex to be analyzed without knowing the exact operating parameters. Numerous studies in the literature proposed different thermodynamic modeling approaches for NPPs. Some considered energy and exergy analysis of the actual cycle using thermodynamic simulators [3-5] while others utilized simplified thermodynamic models [6-8] to reduce computational expenses. Recently, Khan and Islam [7] proposed a simplified thermodynamic model for a VVER1200-type NPP and showed that their model was accurate enough to estimate most of the plant performance parameters accurately under different condenser pressures. However, the model was based on multiple assumptions that could not be justified from an engineering point of view. The major drawback of the model of Khan and Islam was the selection of the pressures of the feedwater heaters (FWHs) and the value of the isentropic efficiency of the turbines without proper explanation. To overcome these limitations, Khan et al. [8] proposed an optimized thermodynamic model for VVER-1200. The model was optimized for seven parameters; the pressures of six FWHs and the coolant temperature at the steam generator (SG) inlet. The authors used the Genetic Algorithm (GA) to maximize the efficiency of the model for 4kPa condenser pressure, which is a rated design parameter for the NPP. The model was observed to have higher accuracy compared to the previous model [8]. However, the authors considered an arbitrary value of the isentropic efficiency for the turbines, ηT = 0.875, based on engineering knowledge before performing the optimization process. No proper explanation was provided for selecting this particular value. Additionally, the work calibrated the model using only the data for 4kPa condenser pressure. The work did not consider calibrating the model for condenser pressures other than 4kPa. Therefore, the applicability of the proposed calibration process for other condenser pressures is somewhat unknown. Also, a proper
Page 19
Fig.1. Schematic diagram of the simplified model (courtesy: Khan et. al, 2022) [8]
justification for calibrating the model on the basis of a single known data point was not provided in the work of Khan et. al. Finally, the work utilized only GA; the possibility of utilizing other evolutionary algorithms in the calibration process is yet unexplored. This work attempted to further calibrate the model proposed by Khan et al. [8] using two evolutionary optimization algorithms, GA and Particle Swarm Optimization (PSO). While doing so, an additional decision variable i.e., isentropic efficiency was included. A comparative study was performed to understand which algorithm had greater effectiveness in calibrating the model. The study also explored the sensitivity of the calibration process to the reference data, especially the condenser pressure. The study also investigated the changes in different plant performance parameters such as efficiency, output power, condenser thermal load, etc. with the change in condenser pressure, which is representative of the surrounding weather conditions of a country. II. METHOD A. Simplified thermodynamic model for VVER-1200 This study utilized the same simplified thermodynamic model proposed in the work of Khan et al. [8]. The schematic diagram of the simplified model of Khan et al. is presented in Fig.1. The key difference between the two works, however, is that the present study considered ηT as a decision variable to be optimized, not a constant like the previous work. Thus, there were eight decision variables in this study, as shown in Table I. The ranges within which the optimum values were searched were kept the same as those suggested by Khan et al. [8] for the seven decision variables. The additional decision variable, i.e., ηT was varied within the range of 0.70-1.00. Due to this single change, it was not possible to consider the calibration process as a maximization problem since the efficiency of the cycle will, obviously, always be maximum for ηT = 1.0. Thus, the approach taken by Khan et al. would not be applicable to the calibration process of this study.
To tackle this situation, the calibration process was converted to a minimization problem where the objective of the optimization process was to find the optimum values of the decision variables so that the difference between the rated efficiency and the predicted efficiency from the simplified model was minimum. Thus, the objective function (OF) of the present study is expressed as, (1) The equation for predicting the thermodynamic efficiency from the simplified model is, (2) Here is the heat rejected in the steam generator and WT is the total work output given by, (3) The expressions for the work output from the HighPressure Turbine (HPT) and Low-Pressure Turbine (LPT) are taken from the work of Khan et al. [8]. Table I. Decision variables Decision Variables P24 P25 P27 P28 P29 P30 T23 ηT
Name of the parameters Pressure of HPH1 Pressure of HPH2 Pressure of LPH1 Pressure of LPH2 Pressure of LPH3 Pressure of LPH4 Secondary coolant inlet temperature Isentropic efficiency of turbines
Range of the parameters to be optimized by GA 2.4 MPa ≤ P24 ≤ 3.0 MPa 1.0 MPa ≤ P25 ≤ 2.0 MPa 300 kPa ≤ P27 ≤ 500 kPa 150 kPa ≤ P28 ≤ 250 kPa 75 kPa ≤ P29 ≤ 125 kPa 30 kPa ≤ P30 ≤ 60 kPa 220 OC ≤ T23 ≤ 230 OC 0.70 – 1.00
This study performed calibrations of the simplified model for two condenser pressures, 4kPa and 7kPa since the efficiencies of a VVER-1200-based NPP are known for these
Page 20
two design pressures. As a result, this work could identify whether calibrating the model with data for a single condenser pressure was justified or not. B. Genetic Algorithm (GA) Genetic Algorithm is an optimization technique based on natural selection for solving both unconstrained and constrained optimization problems. This algorithm works in the same manner as the DNA structures of the parents are transferred to their offspring [9]. In this algorithm, an initial population is taken having a specific number of individuals. These individuals are represented by their chromosomal structure, and they are nothing but potential solutions to the problem. The individuals from the current generation pass on their genes to the next generation just like the parents do to their children [9]. Mimicking the features of natural selection, only the fittest individuals will survive. The process will be repeated until the termination criterion is reached, i.e., the best solution is identified. Fig.2 presents the flow chart of GA.
process. The crossover and mutation rates were taken as 0.50 and 0.01, respectively following the recommendations of Khan et al. [8]. MATLAB 2018 was used to run the GA-based simulation. Gene P24
P25
P27
P28
P29
P30
T23
ηT
Chromosome Fig.3. GA chromosome structure and gene sequence
C. Particle Swarm Optimization (PSO) PSO is a strong optimization tool inspired by the birds’ folk or the fish swarm. It is in the class of Meta-heuristic algorithms. It is preferred by many researchers due to its lower computational requirements compared to GA [10]. In PSO, all the potential solutions to the problem are represented by a group of particles behaving like swarms. These particles will have random positions and velocities at the beginning of the optimization process, assuming a uniform distribution. Each particle will know both its best flying outcome and the best outcome for the group and adjust its flight accordingly. Thus, the swarms will approach the global optima of the problem [10]. Fig.4 presents the flow chart of PSO.
Fig.2. Flow chart of GA
In this work, the objective function of the solution, i.e., the optimum set of values of the decision variables, was targeted to be minimized. Since there are eight decision variables, there are eight genes in the chromosomal structure of each individual of the population, as shown in Fig.3. The initial population was varied within the of range 10-1000 in order to observe the sensitivity of the optimization process to the selection of population size. It was assumed that the population remains constant throughout the optimization
Fig.4. Flow chart of PSO
In this study, the swarm size varied within the range of 100-1000. The initial swarm locations and swarm velocities were randomly generated. All the simulation runs were performed in MATLAB 2018. The main reason for employing PSO in this study was to compare the results with those of GA
Page 21
and understand which algorithm performed better in finding the optimum values of the decision variables. III. RESULT AND DISCUSSION Tables II to V present the simulation results obtained from GA and PSO, respectively for two condenser pressures, i.e., 4kPa and 7kPa. For both GA and PSO, three trial runs were performed for each population size or swarm size, considering the fact that both these algorithms are probabilistic in nature. From Tables II and IV, it may be observed that there were significant differences in the values of the optimized parameters for each trial run, which was somewhat expected. However, it was observed that the minimized values of the OF were lower for higher population sizes, indicating that increasing the population size facilitated the optimization process. Trial number 2 for the population size of 1000 gave the lowest value of the OF at 4kPa condenser pressure. On the other hand, Trial number 1 gave the lowest value of the OF for the population size at 7kPa condenser pressure. The optimized values of the decision variables obtained from these two trials were considered the best possible combinations.
obtained from the original model proposed by Khan et al. [8], and the predicted values from the models calibrated using GA and PSO in the present study. From the table, it may be observed that the calibrated models have better accuracy in predicting the output power of the VVER-1200-based NPP for both 4kPa and 7kPa reference condenser pressures. The difference between the predictions made by the GA and PSOcalibrated models seemed insignificant, although the GAcalibrated model was observed to be slightly better performing. However, the most important observation from Table VI was that the calibration process is mostly insensitive to the condenser pressure for which the model is calibrated. The accuracy of the models calibrated for 4kPa and 7kPa condenser pressures were quite comparable. Thus, it may be opined that a simplified thermodynamic model may be successfully calibrated if suitable data of an NPP is known for a single condenser pressure.
Fig.7. Percentage of steam entering condenser vs condenser pressure
Fig.5. Efficiency vs condenser pressure
Fig.8. Condenser thermal load vs condenser pressure
Fig.6. Output power vs condenser pressure
From Table III and V, it may be observed that there was a similar trend in the results obtained from PSO-based simulations. Increasing the swarm size resulted in a lower final value of the OF. However, the OFs obtained from PSObased simulations were larger than the ones obtained from GA-based simulations. Thus, it may be opined that GA has a better optimization performance compared to PSO. The set of values obtained for the swarm size of 1000 was taken as the calibrated set for further analyses. Table VI presents a comparative study among the data from the available literature [1, 11], the predicted values
Fig.5 shows the change in the efficiency of the VVER1200-based NPP with the increase in condenser pressure predicted by the GA-calibrated model. The prediction made by the PSO-calibrated model is not shown due to the lack of a significant difference between the two. From Fig.5, it may clearly be observed that the efficiency of the plant goes down drastically with the increase in condenser pressure. The model calibrated for 4kPa condenser pressure predicted that the efficiency may go down to around 34.8% at 10kPa condenser pressure, which means the output power will be somewhere around 1117MWe. It may further go down to 33.56% at 15kPa condenser pressure. Thus, the effect of weather conditions on a VVER-1200-based NPP is quite significant. Fig.6 shows the decrease in output power with the increase in condenser pressure. From the figure, it may clearly be observed that the output power of a VVER-1200 type NPP can be significantly reduced because of the weather conditions of
Page 22
Parameter P24 P25 P27 P28 P29 P30 T23 ηT OF
Parameter P24 P25 P27 P28 P29 P30 T23 ηT OF
Parameter P24 P25 P27 P28 P29 P30 T23 ηT OF
Parameter P24 P25 P27 P28 P29 P30 T23 ηT OF
Table II. Calibrated parameters from GA-based simulations (4kPa condenser pressure) Population size = 10 Population size = 100 Population size = 1000 Trial #1 Trial #2 Trial #3 Trial #1 Trial #2 Trial #3 Trial #1 Trial #3 Trial #2 2474.7209 2467.3890 2400.0016 2407.9657 2400.1254 2451.884 2583.7854 2815.3078 2428.8029 1881.8403 1609.5481 1300.1620 1589.1426 1204.4883 1979.7737 1718.9499 1522.7583 1664.7149 451.5229 412.8690 377.6203 425.8513 426.1572 365.7797 327.8356 347.7916 385.5703 246.6884 225.1401 216.9021 152.0625 160.9704 218.7980 221.1552 245.2861 192.8451 78.1790 106.5619 100.4443 89.8520 83.7084 78.2297 82.7950 84.7473 105.3640 57.7767 38.3449 32.8984 31.3686 31.7287 56.3482 37.1669 40.4936 44.7189 225.6741 221.8881 228.9280 229.4547 228.9129 220.0203 221.6746 225.2528 223.2347 0.8771 0.8752 0.8715 0.8715 0.8718 0.8796 0.8759 0.8751 0.8752 1.74E-08 1.12E-09 4.96E-09 6.73E-11 5.12E-10 3.11E-10 1.57E-11 1.60E-12 2.84E-13
Trial #1 2754.6129 1843.5776 368.5611 216.3860 75.0000 53.0990 222.2974 0.8778 1.85E-06
Table III. Calibrated parameters from PSO-based simulations (4kPa condenser pressure) Swarm size = 100 Swarm size = 500 Swarm size = 1000 Trial #2 Trial #3 Trial #1 Trial #2 Trial #3 Trial #2 Trial #3 Trial #1 2754.6129 2754.6129 2930.0610 2930.0610 2930.0610 2647.7764 2647.7764 2647.7764 1843.5776 1843.5776 1293.0699 1293.0699 1293.0699 1335.2411 1335.2411 1335.2411 368.5611 368.5611 407.3733 407.3733 407.3733 421.5143 421.5143 421.5143 216.3860 216.3860 250.0000 250.0000 250.0000 186.7870 186.7870 186.7870 75.0000 75.0000 112.0445 112.0445 112.0445 75.0000 75.0000 75.0000 53.0990 53.0990 52.8178 52.8178 52.8178 52.7348 52.7348 52.7348 222.2974 222.2974 223.3752 223.3752 223.3752 221.1368 221.1368 221.1368 0.8778 0.8778 0.8772 0.8772 0.8772 0.8777 0.8777 0.8777 1.85E-06 1.85E-06 9.58E-07 9.58E-07 9.58E-07 3.18E-08 3.18E-08 3.18E-08
Table IV. Calibrated parameters from GA-based simulations (7kPa condenser pressure) Population size = 10 Population size = 100 Population size = 1000 Trial #1 Trial #2 Trial #3 Trial #1 Trial #2 Trial #3 Trial #2 Trial #3 Trial #1 2689.9782 2839.8886 2526.6114 2915.3616 2447.1959 2435.4659 2537.1966 2971.6155 2849.7696 1045.5479 1070.6937 1895.2599 1303.4824 1713.4711 1294.7475 1834.8859 1640.0520 1826.7413 329.6684 483.6987 337.3784 337.9577 475.5913 467.9364 401.9035 404.0554 371.2532 231.7469 227.2850 216.4814 163.9912 175.0605 218.1239 243.1071 172.1066 212.4329 100.4740 96.0304 75.2356 90.02704 93.9971 102.7596 107.1496 88.0583 114.3242 51.4717 47.7954 58.7411 49.8928 54.9131 48.6735 41.7076 50.7952 31.1552 223.4388 228.9984 220.0815 220.9293 223.2748 226.8124 225.0174 226.2383 224.4317 0.8815 0.8772 0.8783 0.8766 0.8755 0.8727 0.8737 0.8742 0.8739 1.46E-09 3.66E-10 4.12E-10 1.57E-11 1.42E-10 8.05E-12 7.41E-12 1.21E-11 1.83E-12
Trial #1 2842.5440 1111.1000 466.5223 174.2237 81.4850 58.7672 220.0000 0.8793 5.18E-09
Table V. Calibrated parameters from PSO-based simulations (7kPa condenser pressure) Swarm size = 100 Swarm size = 500 Swarm size = 1000 Trial #2 Trial #3 Trial #1 Trial #2 Trial #3 Trial #2 Trial #3 Trial #1 2842.5440 2842.5440 2730.9230 2730.9230 2730.9230 2958.0910 2959.0390 2959.0390 1111.1000 1111.1000 1532.4730 1532.4730 1532.4730 2000.0000 1034.8260 1034.8260 466.5223 466.5223 431.0854 431.0854 431.0854 379.7654 302.4778 302.4778 174.2237 174.2237 197.8253 197.8253 197.8253 227.9384 245.6381 245.6381 81.4850 81.4850 103.9069 103.9069 103.9069 115.6019 75.0000 75.0000 58.7672 58.7672 32.7113 32.7113 32.7113 42.1705 36.0273 36.02731 223.6968 220.0000 220.0000 220.8855 220.8855 220.8855 226.1751 223.6968 0.8793 0.8793 0.8747 0.8747 0.8747 0.8666 0.8831 0.8831 5.18E-09 5.18E-09 2.52E-08 2.52E-08 2.52E-08 1.51E-07 6.83E-09 6.83E-09
Table VI. A comparative study among available data and predicted values from calibrated models Available GA-calibrated PSO-calibrated GA-calibrated Plant Parameter Original model [4] Data model (4kPa) model (4kPa) model (7kPa) Rated Data (4kPa condenser Predicted Error Predicted Error Predicted Error Predicted Error pressure) [1] value (%) value (%) value (%) value (%) Gross Efficiency 37.0 37.34 +0.92 37.30 +0.805 37.29 +0.800 37.27 +0.730 (%) Output Power 1198.0 1199.36 +0.11 1198.00 -0.000 1197.93 -0.006 1197.13 -0.073 (MWe) Rooppur NPP (7kPa condenser pressure) [11] Output Power ≥ 1150.0 1152.16 +0.19 1151.45 +0.126 1151.50 +0.130 1150.00 -0.000 (MWe)
a country. The predicted efficiency of RNPP was somewhat above 1150MWe, which is already acknowledged by the manufacturer [11]. However, the condenser pressure may go up further due to global warming. The calibrated model predicted that the output power may be somewhere around 1075.64-1078.07MWe at 15kPa, which is a decrease of more than 100MWe from the rated output.
PSO-calibrated model (7kPa) Predicted Error value (%) 37.28
+0.757
1197.52
-0.040
1150.00
-0.000
Fig.7 presents the change in the percentage of steam entering the condenser with the change in condenser pressure. From Fig.7, it may be observed that the percentage increased with the increase in condenser pressure. Thus, the condenser thermal load will be higher, meaning that a larger condenser will be required for a tropical-region country compared to a cold-region country. The quantitative representation of this
Page 23
important finding is also presented in Fig.8. From the figure, it may clearly be observed that the condenser thermal load increased with increasing condenser pressure, requiring a larger condenser to remove this excess heat from the secondary coolant. IV. CONCLUSION This work aimed at calibrating a simplified thermodynamic model for predicting the thermodynamic performance parameters, i.e. efficiency, output power, etc., of a VVER-1200-based nuclear power plant. To calibrate the model, Genetic Algorithm and Particle Swarm Algorithm were employed. Eight decision variables were selected for calibration. These decision variables were the pressures of the two high-pressure heaters, the pressures of the four lowpressure heaters, the temperature of the coolant entering the steam generator, and the isentropic efficiency of the turbines. The model calibrations were performed for two different condenser pressures, 4kPa and 7kPa, in order to know whether the value of the reference condenser pressure had any effect on the calibration process or not. For the GA-based calibration process, the initial population was varied within the range of 10-1000. The crossover and mutation rates were taken as 0.50 and 0.01, respectively. For the PSO-based calibration process, the swarm size was varied within the range of 100-1000. Three trial runs were performed for each population or swarm size to identify the best possible combination of decision variables. Results revealed that the calibrated models in the present study had better predictive accuracy compared to the originally proposed model by Khan et al. However, no specific advantage to selecting either GA or PSO was identified. Also, the calibration process was found unaffected by the selection of reference condenser pressure since the results were similar for both the models calibrated at 4kPa and 7kPa condenser pressures. The calibrated models were utilized to observe the change in the thermodynamic performance parameters. The calibrated models predicted that the efficiency of Rooppur NPP should be near 1150MWe, but it may go down to 1075MWe if the condenser pressure is increased to 15kPa. Furthermore, the
condenser thermal load was expected to be higher at higher condenser pressure as the percentage of steam entering the condenser increased with increasing condenser pressure. The work utilized a simplified model which is applicable only for VVER-1200-type NPPs. The work may be extended to develop a generalized thermodynamic model that is applicable to both thermal and fast nuclear reactors. The use of machine learning and artificial intelligence may also be explored in the future. REFERENCES [1]
ROSATOM, “The VVER today: evolution, design, safety.” [Online]. Available: https://www.rosatom.ru/. [Accessed: 09-Dec-2022]. [2] A. H. Khan, M. Hasan, M. M. Rahman, and S. M. Anowar, “Estimating the Levelized cost of electricity of the first nuclear power plant in Bangladesh,” Lecture Notes in Electrical Engineering, pp. 1–11, 2022. [3] H. Sayyaadi and T. Sabzaligol, “Various approaches in optimization of a typical pressurized water reactor power plant,” Applied Energy, vol. 86, no. 7-8, pp. 1301–1310, 2009. [4] A. Teyssedou, J. Dipama, W. Hounkonnou, and F. Aubé, “Modeling and optimization of a nuclear power plant secondary loop,” Nuclear Engineering and Design, vol. 240, no. 6, pp. 1403–1416, 2010. [5] L. Lizon-A-Lugrin, A. Teyssedou, and I. Pioro, “Appropriate thermodynamic cycles to be used in future pressure-channel supercritical water-cooled nuclear power plants,” Nuclear Engineering and Design, vol. 246, pp. 2–11, 2012. [6] A. Dragunov, E. Saltanov, I. Pioro, P. Kirillov, and R. Duffey, “Power cycles of generation III and III+ nuclear power plants,” Journal of Nuclear Engineering and Radiation Science, vol. 1, no. 2, 2015. [7] A. H. Khan and M. S. Islam, “Prediction of thermal efficiency loss in nuclear power plants due to weather conditions in tropical region,” Energy Procedia, vol. 160, pp. 84–91, 2019. [8] A. H. Khan, S. Hossain, M. Hasan, M. S. Islam, M. M. Rahman, and J.-H. Kim, “Development of an optimized thermodynamic model for VVER-1200 reactor-based nuclear power plants using genetic algorithm,” Alexandria Engineering Journal, vol. 61, no. 11, pp. 9129– 9148, 2022. [9] C. Liu, “Improved ant colony genetic optimization algorithm and its application,” Journal of Computer Applications, vol. 33, no. 11, pp. 3111–3113, 2013. [10] E. H. Houssein, A. G. Gad, K. Hussain, and P. N. Suganthan, “Major advances in particle swarm optimization: Theory, analysis, and application,” Swarm and Evolutionary Computation, vol. 63, p. 100868, 2021. [11] RNPP, “Main Technical Features of Rooppur NPP. .” [Online]. Available: http://www.rooppurnpp.gov.bd/. [Accessed: 09-Dec-2022].
Page 24
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
Human Speech Emotion Recognition Using CNN MD. Akib Iqbal Majumder , Irfana Arafin , Tajish Farhan , Nusrat Jaman Pretty , Md. Abdullah Al Masum Anas , Md Humaion Kabir Mehedi , MD. Mustakin Alam and Annajiat Alim Rasel Department of Computer Science and Engineering Brac University 66 Mohakhali, Dhaka - 1212, Bangladesh { md.akib.iqbal.majumder ,irfana.arafin, tajish.farhan, nusrat.jaman.pretty, md.abdullah.al.masum.anas, humaion.kabir.mehedi, md.mustakin.alam}@g.bracu.ac.bd [email protected] Abstract—Human speech emotion recognition is known to be the procedure of identifying emotion from the natural speech. It has become necessary to understand and detect the emotions of humans through various ways to provide a better user experience for the consumers. However, individuals have a wide range of diversity in their capacity of expressing emotions. Also, it may differ depending on how those emotions are being identified. Previously, different methodologies and techniques have been used to identify human emotions. An innovative convolutional neural network (CNN) based system for recognizing human speech and emotions is presented in this paper. Using a potent GPU, a model is constructed and fed with an unprocessed speech from a specified dataset for training, classification, and testing. The overall result was 94.38% which is quite better and surpasses many other models. Index Terms—Speech Emotion, CNN, Recognition, Human.
I. I NTRODUCTION With the advancements of rising smart devices and machine intelligence, the importance of emotion identification in human speech is growing dramatically as it has a significant impact on understanding the process of human decision-making. Also, it has been proven in several studies that learning this process can enhance the human-machine communication experience. It helps to enhance user experience in several cases by automatically acknowledging human emotional states. In recent years, human speech [1] emotion recognition has attained a massive rise in demand for various applications. However, in human SER (speech emotion recognition) [2] it is frequently thought that audio data received from multiple devices can be merged into a single repository with a centralized approach, which is often unfeasible due to expanding data, bandwidth limitations, communication costs, and the absence of proper privacy concerns. Many efforts have been made to detect human speech emotions [3] keeping consumer privacy safe. However, to resolve this issue usage of federated learning attracted growing attention. Due to its distinctive flexibility to cooperatively train models of machine learning without sharing the local data and jeopardizing consumer privacy. Federated learning is especially well suited for this purpose. It is a setting for
machine learning where multiple clients are seen to collaborate to train the models combined with a central server while keeping the data decentralized. In this environment, a central server is able to compile model updates from several clients during the training process. Clients can use federated learning to get the fundamental model that the server provides. This model may be seen as an ecosystem in which machine learning models and projects acquire data knowledge. Voice emotion identification is a task that may be used to identify the kind of emotion from human speech data. But the majority of recent research [4] that includes this environment has been on complete supervised learning procedures. However, finding supervised data [5], [6] might be quite challenging for this scenario. In recent years, several improvements have been achieved in the field of study involving the identification of human voice emotion. With the advancements in technology as data collection has been easily attainable, deep learning has seen evolutionary growth as well. This way our speech emotion recognition research area has also made quite a usage of deep learning models like CNN, RNN, DSN, DNN, and so on. However, even though usage of RNN and other versions of RNN such as LSTM had benefits in training data and accuracy but such architectural models were seen to be increasing the computational cost. We intend on studying suitable data while reaching a higher accuracy and lesser computational cost. In this work, we are proposing a CNN (convolutional neural network) [7] approach for better performance of Human Speech Emotion Recognition. II. L ITERATURE R EVIEW In the research, paper [8] in order to achieve better outcomes, the wiener filter, a cascaded PRNN, and K-NN system, and a hybrid MFCC and GLCM system were each used in turn for each of these three phases. The Wiener filter was employed because it is simple to build and controls output error. When used together in a cascade system, PRNN and K-NN a signal’s structure derived from the emotional content is utilised. The PRNN can recognize waves. Also, the nearest pattern of the signal is also to be more likely
979-8-3503-4602-2/22/$31.00 © 2022 IEEE
Page 25
established by the K-NN method. This way, a hybrid system was developed that combines both MFCC and GLCM. In the research, paper [9] federated learning was utilized to create the robot model. Following that, any client can train its own model and publish fresh prediction models to the server. When the operation is finished, the server will distribute the revised parameters once more after analyzing them. Each DTbot is a client that uses the same amount of patient data, trains the data locally, and only communicates with the server to update model parameters. The DTbot that was created did not need to send any images, videos, or audio files to the cloud for analysis because that kind of confidential material might include a lot of patients’ in-depth private photos and discussions. It is crucial to safeguard privacy effectively while helping people in this day and age when it is highly valued. All of the hospital’s robots have the capability to train their own personal data using the model provided by the host and communicate the results of any pertinent training results directly to the host for further use. This ensures that the learning model will work and that the medical files will not need to be moved outside of the robot. Guliani and his colleagues [10] trained voice recognition models using the decentralized approach of federated learning. FedAvg (federated averaging algorithm) and RNN-T architecture were utilized. FedAvg was used to execute federated training of RNN-T models on a TensorFlow-based FL simulator running on TPU hardware. In this research study [2], Tsouvalas and his colleagues offer a federated learning-based SER model that protects user privacy. To remove the requirement of extensive labeled data availability on devices, researchers have used a data-efficient federated self-training technique to develop SER models using a small number of labeled data on devices. Based on their examination, they revealed that the accuracy of their models regularly outperforms fully supervised federated settings with the same supply of labeled data. In this research, SemiFedSER framework was introduced by Feng and Narayanan [5] to address the limitations of small labeled sample data. They have implemented federated learning for voice emotion recognition. Semi-FedSER makes use of labeled and unlabeled data samples along with current user pseudo-labeling. In order to address the issue of non-IID data distribution in the FL context, his team also employed the SCAFFOLD approach. Results suggest that the proposed Semi-FedSER architecture gives accurate SER forecasts despite the local label rate l = 20 In this research work [4], Tsouvalas and his colleagues investigate the practical challenge of semi-supervised federated learning for audio identification tasks. The customers have little to no motivation to classify their data, and for a number of crucial jobs, the subject expertise required to complete the annotation process effectively is lacking. On the other hand, vast quantities of unprocessed audio data are easily accessible on client devices. Regardless of its simplicity, they show that their technique, FedSTAR, is extremely practical
for semi-supervised audio recognition training in a variety of federated setups and label availability. Comparing FedSTAR’s performance with that of its fully-supervised, federated, conventional, and centralized equivalents, they conduct a comprehensive evaluation of FedSTAR on a variety of publically accessible datasets. The accuracy of the models regularly exceeds that of fully supervised, federated setups with the same label availability. The acoustic modeling for children’s ASR is the main topic of this paper [11]. Standard short-term spectral feature extraction for speech recognition frequently use a speech production model to describe the short-term spectral envelope and gather data about the vocal tract system. The authors show how children’s ASR systems can be advantageous. By investigating one such method, we can benefit from automated feature learning. To decode WSJCAM0 utterances, the WSJ corpus’ standard 20k trimmed trigram LMs were applied. The following is how the PF-STAR language model (LM) was created: Witten-Bell smoothing was used to build one LM using the training set, and Witten-Bell smoothing was used to build another LM using normalized text from the MGB-3 challenge. Also, they use CNN, GMM-HMM systems, DNNHMM systems, and CNN-HMM systems. In comparison to their GMM/HMM and DNN/HMM counterparts, the CNNbased systems routinely outperform or are on a level with them. The SGMM systems also benefit from multi-pass decoding and data scarcity to produce accurate results. It is important to note that, to the best of our knowledge, the performance reported at 11.99% WER is the best on the PF-STAR corpus. In the research paper [12], for their proposed technique, they created a multi-task framework in which they concurrently train an accent classifier. They also develop a separate network that learns accent embeddings, which may be included into our multi-task design as auxiliary inputs. Additionally, the Mozilla Common Voice corpus was used for this experiment. Common Voice is a corpus of English reading speech that was compiled from many different English speakers throughout the globe. The authors of this paper look at the application of a multi-task design for accented speech recognition where a multi-accent acoustic model and an accent classifier are concurrently taught. In comparison to a multi-accent baseline system, this network performs significantly better, reducing WER by up to 15% on a test set with visible accents and by 10% on unnoticed accents. Performance is further enhanced by accent embedding acquired from a separate network. In the research paper [13] the fundamental method put out is based on two separate vectors, suggesting that a mood will be produced based on where two vectors are situated on the emotion planes. Three main sorts of techniques can often be used by modern speech processing systems. The first of them takes into account certain spectral characteristics. In this case, the speaker’s general features can be revealed without regard to any particular phoneme characteristics. The second method uses feature vectors for a quick training phase. Regrettably, the amount of training vectors required for real-time emotion recognition is so high that it exceeds the memory and process-
Page 26
ing power of current computers. As a result, it is required to use some unique solutions, such as vector quantization (VQ) strategies or HMM-based approaches. A VQ codebook is made up of a limited amount of straightforward yet very specific feature vectors. By grouping these vectors according to this codebook, it is feasible to reflect certain speaker attributes. The third technique uses speech recognition techniques. Since various languages pronounce the same phoneme differently, phoneme templates developed through training processes can be used to identify emotions. The authors of this article [14] suggest a methodology for SER by using a DSCNN (Deep stride convolutional neural network) based framework that is mostly used in plain nets in terms of Computer Vision tasks. It has been implemented through scikit-learn packages. They employed transfer learning strategies to train the AlexNet, Vgg-16, and Resnet-50 CNN’s models using the IEMOCAP dataset. Via this model, the accuracy of 81.75% has been achieved III. DATASETS One is RAVDESS Emotional speech audio [15] which consists of 1440 files with randomized data of speech collected from songs, audio, and videos. Among 1440 file data there are 60 trials per actor. 720 files are of Female voices and the other 720 are of Male voices. This data consists of 7 identifiers that are, Modality, Vocal channel (01 = speech, 02 = song), Emotion, and Emotional intensity. Statement, Repetition, Actor Number. This dataset includes 3 sorts of modalities which are fully Audio—Video, only Audio, and only Video. The speech is either retrieved from an original speech of a song. It includes 8 different emotions that are categorized and assigned numbers. In this dataset, the neutral state is represented by the number 1, calm by the number 2, joyful by the number 3, sad by the number 4, furious by the number 5, afraid by the number 6, disgust by the number 7, and unexpected by the number 8. However, This dataset is unable to classify neutral emotional states. It has only 2 types of emotional intensity that are normal and strong. The actors spoke a selection of only 2 sentences. Dataset CREMA-D [16] focused more on accents and pronunciations by choosing actors of different ethnicity. It consists of 7442 clips from 91 different actors. There was a selection of 12 sentences from which the chosen actors spoke their parts. This dataset consists of 6 different emotions. Also, all these classes have can be again classified in 4 different levels through which emotions can be expressed. These are, Low Level, Medium Level, High Level, and Unspecified. There are 2800 records in the Toronto emotional speech set (TESS) dataset [17] , and only 2 female performers were used to capture 200 of the target terms. Seven emotions—anger, disgust, fear, pleasure, pleasant surprise, sorrow, and neutral—are depicted in this dataset. SAVEE-dataset(Surrey Audio-Visual Expressed Emotion) [18] was initially recorded from 4 English male speakers who wear also native English speakers. This dataset is portraying 6
different emotions. However, this dataset also added a neutral category. In text material consisted in this dataset, per emotion, there are15 TIMIT sentences. IV. M ETHODOLOGY A. Data preprocessing : As we have four different datasets: Crema-D, Ravdess, Savee, Tess. Therefore, we must establish a data frame that stores all emotions of the data together with their pathways. This data frame will be used to extract features for our model’s training. As for the datasets we need to extract files for each audio according to their emotions and encode integers to actual emotions. Then we have to save all the datasets to a file path. The primary component of a speech emotion recognition system is feature extraction. Most of the time, it is accomplished by converting the voice waveform into a parametric representation at a noticeably slower data rate. In feature extraction, we have used zero crossing rate, and MFCC (Mel Frequency Cepstral Coefficients).
Fig. 1.
Waveplot for surprise emotion
B. Audio Augmentation and Splitting : We can clearly see that surprise audio has its distinguish figure 1. Moreover, We can construct syntactic data for audio by injecting noise, altering time, and modifying pitch and tempo. We have used noise for the particular surprise audio in figure 2. We have also used stretched, shifted, and pitch for audio augmentation. We divided our data into the train set, validation set, and test set in order to train our suggested model. The training set has 35026 audios, the validation set has 3892 audios and the test set has 9730 audios. C. Architecture of Deep Convolutional Network : Our proposed CNN Model has a total of 6 one-dimensional convolutional layers as we can see from table I, each being preceded by a layer of batch normalization and a layer of max pooling with a pool size of two., except the last Conv1D [19] which has a pool size of 3. The activation function of the 1st Dense layer is ‘Relu’ and the second one we have used is the softmax function. For a baseline comparison, we have also implemented 4 layers CNN model as it is shown in table II using 4 one-dimensional convolutional layers. We also have used 2 dense layers for this model.
Page 27
TABLE II T HE SEQUENTIAL MODEL OF 4 LAYERS CNN
Fig. 2.
Waveplot for surprise emotion after adding noise
TABLE I T HE SEQUENTIAL MODEL OF 6 LAYERS CNN Layer (type) conv1d (Conv1D) batch normalization (BatchNormalization) max pooling1d (MaxPooling1D) conv1d 1 (Conv1D) batch normalization 1 (BatchNormalization) max pooling1d 1 (MaxPooling1D) conv1d 2 (Conv1D) batch normalization 2 (BatchNormalization) max pooling1d 2 (MaxPooling1D) conv1d 3 (Conv1D) batch normalization 3 (BatchNormalization) max pooling1d 3 (MaxPooling1D) conv1d 4 (Conv1D) batch normalization 4 (BatchNormalization) max pooling1d 4 (MaxPooling1D) flatten (Flatten) dense (Dense) batch normalization 5 (BatchNormalization) dense 1 (Dense) Total params: 7,193,223 Trainable params: 7,188,871 Non-trainable params: 4,352
MODEL
Layer (type) conv1d (Conv1D) max pooling1d (MaxPooling1D) conv1d 1 (Conv1D) max pooling1d 1 (MaxPooling1D) conv1d 2 (Conv1D) max pooling1d 2 (MaxPooling1D) dropout (Dropout) conv1d 3 (Conv1D) max pooling1d 3 (MaxPooling1D) flatten (Flatten) dense (Dense) dropout 1 (Dropout) dense 1 (Dense) Total params: 557,288 Trainable params: 557,288 Non-trainable params: 0
MODEL
Output Shape (None, 162, 256)
Param # 1536
(None, 81, 256)
0
(None, 81, 256)
327936
(None, 41, 256)
0
(None, 41, 128)
163968
(None, 21, 128)
0
(None, 21, 128) (None, 21, 64)
0 41024
(None, 11, 64)
0
(None, (None, (None, (None,
0 22560 0 264
704) 32) 32) 8)
Output Shape (None, 2376, 512)
Param # 3072
(None, 2376, 512)
2048
(None, 1188, 512)
0
(None, 1188, 512)
1311232
(None, 1188, 512)
2048
to train our model on 50 epochs.
(None, 594, 512)
0
(None, 594, 256)
655616
(None, 594, 256)
1024
(None, 297, 256)
0
(None, 297, 256)
196864
(None, 297, 256)
1024
(None, 149, 256)
0
(None, 149, 128)
98432
V. R ESULTS A NALYSIS For evaluting our model’s performance [20] we have used accuracy, precision, recall, f1 and confusion matrix. The accuracy and loss graph of 4 layers CNN model is illustrated in figure 3 and figure 4. We ran the model with 50 epochs and with every epoch the training loss was declining as we can see in figure 4. Also, we can clearly see from figure 3 that the accuracy of the 4 layers CNN model on test data was around 60.74%.
(None, 149, 128)
512
(None, 75, 128)
0
(None, 9600) (None, 512)
0 4915712
(None, 512)
2048
(None, 7)
3591
D. Training CNN model In natural language processing (NLP), the usage of CNNs has been rising in recent years because of their effective generation and discriminating abilities. For that reason, we chose a one-dimensional convolutional neural network because the way our audio sounds changes over time. The linear evolution of the 1D CNN kernels takes use of the time-based structure of an audio wave. With an initial learning rate of 0.00001, we used the ”adam” optimizer to train the model. The loss function is utilized because it gauges how well the prediction model predicts the anticipated outcome. For the loss function, we have used ‘Categorical cross-entropy. As it is giving us the best performance on 50 epochs we have chosen
Fig. 3.
Training and Testing accuracy 4 layers CNN model.
By adding some extra features for data augmentation techniques and using other feature extraction methods and also creating extra two layers for convolutional layers along with increased filter size we have achieved an accuracy of 94% for our test data as we can see from the figure 5. Which is respectable but still has room for improvement through the use of additional augmentation approaches and alternate feature extraction strategies. The graph of train and test accuracy and loss of our proposed model respectively in figure 5 and in figure 6.
Page 28
Fig. 6. Fig. 4.
Training and Testing loss of proposed model
Training and Testing loss 4 layers CNN model.
Fig. 5. Training and Testing accuracy of proposed model Fig. 7.
Moreover, we can see from the confusion matrix in figure 7, that our model is predicting every features quite well. This makes sense given the wide variety of differences between audio files containing expressions of such moods and those containing neutral expressions. We have tested both models on evaluation metrics. From the table III we can clearly see the comparative analysis of our proposed model and the 4 layers CNN model. From table III our proposed model surpassed the 4 layers CNN model in
TABLE III A COMPARISON TABLE BETWEEN 6 LAYERS AND 4 LAYERS CNN MODELS ON PERFORMANCE EVALUATION METRICS Model
Proposed CNN Model
4 layers CNN Model
Class angry disgust fear happy neutral sad surprise
Pr 0.95 0.96 0.94 0.93 0.95 0.93 0.97
Re 0.95 0.93 0.95 0.94 0.94 0.95 0.97
f1-score 0.95 0.94 0.94 0.95 0.95 0.94 0.97
angry disgust fear happy neutral sad surprise
0.80 0.57 0.71 0.49 0.55 0.54 0.86
0.66 0.44 0.46 0.65 0.60 0.76 0.74
0.73 0.49 0.56 0.56 0.58 0.63 0.80
Here, Pr = precision , Re = recall and Acc = accuracy
Acc
0.94
0.60
Confusion matrix of the proposed CNN model
terms of accuracy on the test set and it scored 94% which is 34% higher than the 4 layers CNN model. In terms of precision(pr) our proposed model predicts in different classes like, angry 95% of the time, disgust 96%, fear 94%, happy 93%, neutral 95%, sad 93% and the best is predicting surprise which is 97%. On the other hand, the 4 layers CNN model predicted angry 80% of the time, disgust 57%, fear 71%, happy 49%, neutral 55%, sad 54% and surprise which is 86% of the time. VI. C ONCLUSION Human speech emotion recognition is a field where many researchers are working to develop a system that is capable of comprehending the state of a human voice to assess or detect the speaker’s emotional state. The human speech emotion recognition literature confronts many difficulties to increase recognition precision while reducing the computing complexity of the model. To resolve those issues we have applied a CNN architecture while improving the dataset by merging Four different datasets based on its classes. After successful application of our model and by analyzing achieved results it is a matter of fact that the model has given better accuracy compared to the results achieved in many other previous works that we reviewed in earlier sections. However, as we previously discussed the recent growing concerns about keeping user privacy safe in such tasks, unfortunately, our model does not take this issue into the account. In the future, we would like
Page 29
to resolve this issue through the usage of federated learning in our work. As a consequence of utilizing federated learning, we intend on training our deep learning architecture model without sharing our consumer’s data locally later on. Also, we will try to implement our model on the Bengali language.
IEEE 10th Region 10 Humanitarian Technology Conference (R10-HTC), 2022, pp. 19–25. [20] M. H. K. Mehedi, A. S. Hosain, S. Ahmed, S. T. Promita, R. K. Muna, M. Hasan, and M. T. Reza, “Plant leaf disease detection using transfer learning and explainable ai,” in 2022 IEEE 13th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), 2022, pp. 0166–0170.
R EFERENCES [1] M. A. Pranjol, F. Rahman, E. R. Rhythm, R. A. Shuvo, T. Ahmed, B. Y. Anika, M. A. Al Masum Anas, J. Hasan, S. Arfain, S. Iqbal, M. H. K. Mehedi, and A. A. Rasel, “Bengali speech recognition: An overview,” in 2022 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), 2022, pp. 1–6. [2] V. Tsouvalas, T. Ozcelebi, and N. Meratnia, “Privacy-preserving speech emotion recognition through semi-supervised federated learning,” in 2022 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops). IEEE, 2022, pp. 359–364. [3] F. R. Rahat, M. Mim, A. Mahmud, I. Islam, M. J. A. Mahbub, M. H. K. Mehedi, A. A. Rasel et al., “Data analysis using nlp to sense human emotions through chatbot,” in International Conference on Advances in Computing and Data Sciences. Springer, 2022, pp. 64–75. [4] V. Tsouvalas, A. Saeed, and T. Ozcelebi, “Federated self-training for semi-supervised audio recognition,” ACM Transactions on Embedded Computing Systems (TECS), 2021. [5] T. Feng and S. Narayanan, “Semi-fedser: Semi-supervised learning for speech emotion recognition on federated learning using multiview pseudo-labeling,” arXiv preprint arXiv:2203.08810, 2022. [6] K. M. Hasib, F. Rahman, R. Hasnat, and M. G. R. Alam, “A machine learning and explainable ai approach for predicting secondary school student performance,” in 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC), 2022, pp. 0399– 0405. [7] Q. A. R. Adib, M. H. K. Mehedi, M. S. Sakib, K. K. Patwary, M. S. Hossain, and A. A. Rasel, “A deep hybrid learning approach to detect bangla fake news,” in 2021 5th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), 2021, pp. 442–447. [8] J. Umamaheswari and A. Akila, “An enhanced human speech emotion recognition using hybrid of prnn and knn,” pp. 177–183, 2019. [9] Y. Liu and R. Yang, “Federated learning application on depression treatment robots (dtbot),” in 2021 IEEE 13th International Conference on Computer Research and Development (ICCRD). IEEE, 2021, pp. 121–124. [10] D. Guliani, F. Beaufays, and G. Motta, “Training speech recognition models with federated learning: A quality/cost framework,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 3080–3084. [11] S. P. Dubagunta, S. H. Kabil, and M. M. Doss, “Improving children speech recognition through feature learning from raw speech signal,” pp. 5736–5740, 2019. [12] A. Jain, M. Upreti, and P. Jyothi, “Improved accented speech recognition using accent embeddings and multi-task learning.” pp. 2454–2458, 2018. [13] Z. Ciota, “Emotion recognition on the basis of human speech,” pp. 1–4, 2005. [14] S. Kwon, “A cnn-assisted enhanced audio signal processing for speech emotion recognition,” Sensors, vol. 20, no. 1, p. 183, 2019. [15] S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PloS one, vol. 13, no. 5, p. e0196391, 2018. [16] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,” IEEE transactions on affective computing, vol. 5, no. 4, pp. 377–390, 2014. [17] K. Dupuis and M. K. Pichora-Fuller, “Toronto emotional speech set (tess)-younger talker happy,” 2010. [18] P. Jackson and S. Haq, “Surrey audio-visual expressed emotion (savee) database,” 2014. [19] M. H. K. Mehedi, K. O. Faruk, A. Rahman, I. Nessa, B. Zabin, K. Nahar, S. Iqbal, M. S. Hossain, and A. A. Rasel, “Automatic bangla article content categorization using a hybrid deep learning model,” in 2022
Page 30
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December 2022, Cox’s Bazar, Bangladesh
Comparative Analysis of Interpretable Mushroom Classification using Several Machine Learning Models Md. Sabbir Ahmed
Sadia Afrose
Ashik Adnan
Nazifa Khanom
Deaprtment of CSE Brac University Dhaka, Bangladesh [email protected]
Deaprtment of CSE Brac University Dhaka, Bangladesh [email protected]
Deaprtment of CSE Brac University Dhaka, Bangladesh [email protected]
Deaprtment of CSE Brac University Dhaka, Bangladesh [email protected]
Md Sabbir Hossain
Md Humaion Kabir Mehedi
Annajiat Alim Rasel
Deaprtment of CSE Brac University Dhaka, Bangladesh [email protected]
Deaprtment of CSE Brac University Dhaka, Bangladesh [email protected]
Deaprtment of CSE Brac University Dhaka, Bangladesh [email protected]
Abstract—An excellent substitute for red meat, mushrooms are a rich, calorie-efficient source of protein, fiber, and antioxidants. Mushrooms may also be rich sources of potent medications. Therefore, it’s important to classify edible and poisonous mushrooms. An interpretable system for the identification of mushrooms is being developed using machine learning methods and Explainable Artificial Intelligence (XAI) models. The Mushroom dataset from the UC Irvine Machine Learning Repository was the one utilized in this study. Among the six ML models, Decision Tree, Random Forest, and KNN performed flawlessly in this dataset, achieving 100% accuracy. Whereas, SVM had a 98% accuracy rate, compared to 95% for Logistic Regression and 93% for Naive Bayes. The two XAI models SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model Agnostic Explanation) were used to interpret the top three ML models. Index Terms—Decision Tree, Explainable AI, Random Forest, KNN, SHAP, LIME
I. I NTRODUCTION Mushroom has the meaty substance of a fungus which bears lots of spores on its body. It is also known as toadstool and scientifically known as Agaricus bisporus. Mushroom belongs to the fungi family. Mushroom can be grown in anywhere and anytime. It can be grown from below ground to above the soil, and also from tropical to frosty places. There are few types of mushrooms observed so far. Out of the estimated 1,500,000 species in the world, less than 69,000 species of mushrooms have been identified to date, and there are fewer than 200,000 species in Indonesia [1]. Those can be edible or medicinal, or even can be poisonous. When consumed, members of the deadly Agaricus and Lepiota families can make a person ill or perhaps kill them. You can
consume and even use wild members of the Agaricus and Lepiota families as medications [2]. Based on their growing place, those can be mild, tropical or subtropical. Mushroom recently refers to a robotic technique used in the food sector. This method was used to restrict attributes like color. The recent mushroom system uses particular qualities to enhance the selection of the process of mushrooms. A system like that depends on analysis and examining the features to improve the classification based on the recognized characteristics [3]. However, not many research works have discussed the interpretability of how a model is identifying an instance of mushroom as Edible or Poisonous. It’s critical that humans can comprehend the conclusion or prediction produced by the AI or ML models, since it’s a circumstance that might result in life or death. Explainable Artificial Intelligence (XAI) models can help to demonstrate the interpretability of any machine learning models as human understandable [4]. In order to classify mushrooms into those that are harmful to the body and those that are not, this study will examine the traits of each 23 species of gilled mushrooms in order to develop the various models and determine whether it is an edible mushroom or a poisonous mushroom and further show the intepretability of the models. The main focus of this study is to explain the reasons behind every models’ behavior towards the classification of mushroom based on their features or traits by applying Explainable AI (XAI) models. SHAP (SHapley Additive exPlanations) [5] will be used for global explainability and LIME (Local Interpretable Model Agnostic Explanation) [6] for local or individual interpretability. This
979-8-3503-4602-2/22/$31.00 © 2022 IEEE
Page 31
work will also help to identify the key characteristics for the classification of Edible and Poisonous class. The study is structured as follows: in part II, a quick overview of the related work, section III contains a description of the dataset and models, section IV provides the result and discussion of the proposed work which is followed by section V, which provides a conclusion. II. R ELATED W ORK Based on a CNN model, mushroom is classified to be either edible or not in [1]. With an accuracy of 0.93, the proposed method of deep CNN or DCNN proves to be better in classifying mushroom. In static dataset, deep CNN or DCNN works better. It is also mentioned that since the performance of classification and network depth are unrelated, adding complexity won’t improve the situation. In another paper [7], it is found that the result of K-NN showed 100% accuracy rate. The reason for having the best result is simply for the dataset which was numeric data with discrete value. This kind of dataset is highly suitable for K-NN algorithm, which is why the accuracy rate is this high. Again, based on the results of evaluating the top three data mining classification techniques, comparative classification algorithm testing accuracy in prior data mining has not been conducted. In comparison to the other two widely used classification algorithms, the C4.5 algorithm is the most accurate and has the fastest processing times. This approach produces a decision tree that may be readily used to build applications [2]. In [3], they used a variety of machine-learning classifiers on a dataset called “mushroom data” that was provided, and the dataset is from UCI repository. They discovered that one characteristic, “stalk-root,” has numerous missing values, while another, “veil-type,” has identical values across all rows. In order to avoid their impact on the classifications, these two traits are removed. Whereas, the “odor n” feature is the element that influences decisions the most. Therefore, according to the paper, a mushroom is more likely to be inedible if it has an odor. Due to the data set’s cleanliness, the majority of classifiers appear to perform well on it. Nevertheless, the decision tree, ANN, and SVM classifiers perform better than the other classifiers. Moreover, in order to increase accuracy, they proposed a hybrid model that combines the most effective classifiers. Here in this paper [8], the attempts were taken to improve the results by removing the background from the photographs, but the effort was unsuccessful. The experiment’s findings also indicate that background images have an advantage, particularly when using the KNN algorithm, Eigen features extraction, and the real dimensions of mushrooms, where
accuracy reached 0.944, while the result when the real dimensions are replaced with virtual dimensions is 87%. After removing the background of the photos, the KNN value achieved a maximum of 0.819. In our upcoming work and also according to the paper, certain physical characteristics of mushrooms, such as cup sizes, stem heights, color, and textures can be used for getting better outcome. The accuracy of the four ML models used in the research [9] is 90.99%, 98.82%, 99.98%, and 100% for Naive Bayes, Decision Trees, Support Vector Machines, and AdaBoost algorithms, respectively. However, there was a lack of explanation in the work. Another study [10] applied Naive Bayes, Decision Tree (C4.5), Support Vector Machine (SVM), and Logistic Regression and came to the conclusion that the c4.5 algorithm shows a maximum accuracy of 93.34%. K-Nearest Neighbor and Decision Tree models were compared by Chitayae et al. (2020), who came to the conclusion that Decision Tree performed better [11]. In several articles, deep learning-based methodologies were applied, however they also failed to demonstrate the explainability of their models [12] [13]. III. M ETHODOLOGY A. Dataset Description and Preprocessing The Audubon Society Field Guide to North American Mushrooms’ descriptions of hypothetical samples for 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom are included in this dataset [14]. Every species is labeled as either unquestionably edible, definitely poisonous, or of unknown edibility and not advised. The last two classes were merged as a single class. The dataset has 8124 rows and 23 columns, and there are 4208 instances of the Edible class and 3916 instances of the Poisonous class. The dataset is therefore balanced. Since the information is categorical, we will use LabelEncoder to make it ordinal. The LabelEncoder transforms each value in a column into a number. The category column must be of the data type “category” in order to use this method. A non-numerical column’s datatype by default is “object.” Following the conversion of the columns to the “category” type, LabelEncoder was used to make the ordinal conversion. The column “veil type” was also removed because it has a value of 0 and does not add anything to the data. B. Model Description Decision Tree: It’s a different-supervised regression and classification machine learning approach. Decision trees can be used at classification task, training models to predict the target variable’s class or value from prior data (training data). Decision Trees compare root and record attribute values and predict the next node’s branch from the tree’s root. Support Vector Machine (SVM): SVM is a supervised technique for classification and regression. SVM classifies
Page 32
future data points by selecting the optimal line or decision boundary, which helps partition n-dimensional space into classes. Random Forest: The random forest consists of multiple decision trees. It employs bagging and feature randomization to produce an uncorrelated forest of trees whose prediction by committee is more accurate than any individual tree. This approach classifies continuous and categorical variables better than it does regression.
Recall and F1 Score of both Edible and Poisonous class are 100%. We will go into depth on the interpretability of the top three models for identifying the edible and poisonous classes using two XAI models, SHAP and LIME. By calculating the Shapely values for the entire dataset and combining them, SHAP can determine the global interpretation.
Logistic Regression: Logistic regression predicts the likelihood of a target variable using supervised learning classifications. A dichotomous target or dependent variable has only two classifications. [15] Logistic regression is categorical, and discrete or categorical results are required here. K-Nearest Neighbor (KNN): It is a supervised learning classifier which is non-parametric and employs closeness to group data points. It can be used for regression or classification, although it’s usually a classification approach that assumes comparable points are close together. Naive Bayes: It is a Bayes Theorem-based machine learning algorithm used for categorization. A Naive Bayes classifier presupposes that a class’s features are unrelated. It’s been used for many purposes, but it excels at NLP. IV. R ESULTS AND D ISCUSSION Three of the six ML models that we employed for the classification task performed exceptionally well, and among those three, two of them had 100% accuracy rate. The other three models did quite good as well. From Table-1, We can observe that, in case of Support Vector Machine, the Accuracy, Precision, Recall and F1 Score are respectively 98%, 97%, 98% and 98% for the Edible class and 98%, 98%, 97% and 97% respectively for the Poisonous class. In case of Logistic Regression, the Accuracy for both Edible and Poisonous class is 95%, Precision for Edible class is 96% and Poisonous class is 94%, Recall is respectively 94% and 96% for Edible and Poisonous class and the F1 Score is 95% for both the classes. The worst performing model among the proposed models is Naive Bayes with 93% Accuracy for both Edible and Poisonous class. The Precision, Recall and F1 Score for the Edible class are respectively 94%, 92% and 93%. For the Poisonous class, the Precision, Recall and F1 Score are 91%, 94% and 92% respectively. On the other hand, our second best performing model is the K-Nearest Neighbor (KNN) with 100% Accuracy and F1 Score for both classes and 100% Precision and 99% Recall for the Edible class and 99% Precision and 100% Recall for the Poisonous class. Finally, the best two models are Decision Tree and Random Forest. For both models, we can observe that the Accuracy, Precision,
Fig. 1. Summary of SHAP values over all features for Decision Tree
Fig. 1 shows a global summary of the SHAP values distributed over the features of the Decision Tree model. For each feature, the distribution of feature importance can be observed. From Fig. 1 it can be seen that gill color, spore print color, gill size, population, stalk shape, and odor have the greatest effects on the prediction over the whole dataset. The high SHAP value are shown on the x-axis. High gill color, spore print color, stalk shape, and odor values impact the prediction negatively as red values are on the left-hand side for these features, while high gill size and population values affect the prediction in a positive way as red values are on the right-hand side. In a similar manner, low gill color, spore print color, stalk shape, and odor values affect the prediction positively, and low gill size and population values affect the prediction negatively. An overview of the SHAP value distribution across all characteristics for the Random Forest model is shown in Fig. 2. The distribution of feature importance is shown for each feature (horizontal rows). As seen in the diagram, odor, gill
Page 33
TABLE I ACCURACY, P RECISION , R ECALL AND F1- SCORE OF E DIBLE AND P OISONOUS C LASS FOR THE 6 D IFFERENT ML M ODELS
Model Name Decision Tree Support Vector Machine (SVM) Random Forest Logistic Regression K-Nearest Neighbour (KNN) Naive Bayes
Accuracy 1.00 0.98 1.00 0.95 1.00 0.93
Edible Precision Recall 1.00 1.00 0.97 0.98 1.00 1.00 0.96 0.94 1.00 0.99 0.94 0.92
Fig. 2. Summary of SHAP values over all features for Random Forest
color, gill size, spore print color, ring type, and bruises have a significant impact on prediction over the whole dataset. High gill color, ring type, and bruises values have a negative impact on the prediction as red values are on the left, but high odor, gill size, and spore print color values have a positive effect as red values are on the right-hand side. In a similar manner, low gill color, ring type, and bruises values have a positive impact on the prediction, whereas low odor, gill size, and spore print color values have a negative influence. A global summary of the SHAP value distribution across all characteristics for the K-Nearest Neighbour (KNN) model is shown in Fig. 3. The distribution of feature importance is displayed for each feature (horizontal rows). We can observe from the diagram that gill color, odor, ring type, spore print color, cap color, stalk color above ring and habitat all have a significant impact on the prediction over
F1 Score 1.00 0.98 1.00 0.95 1.00 0.93
Accuracy 1.00 0.98 1.00 0.95 1.00 0.93
Poisonous Precision Recall 1.00 1.00 0.98 0.97 1.00 1.00 0.94 0.96 0.99 1.00 0.91 0.94
F1 Score 1.00 0.97 1.00 0.95 1.00 0.92
Fig. 3. Summary of SHAP values over all features for K-Nearest Neighbour (KNN)
the whole dataset. High gill color, odor, ring type, spore print color, and stalk color above ring (mostly) values affect the prediction in negative way as red values are on the left-hand side, high habitat values, on the other hand, have a favorable impact on the forecast since the red values are on the right. Similarly, low gill color, odor, ring type, spore print color, and stalk color above ring (mostly) values affect the prediction positively and low habitat values affect the prediction negatively. Cap color, however, does not exhibit a pronounced separation of importance from the other top features. The three best models for our proposed classification system have been discussed in terms of their SHAP values, and it is clear that key features, such as gill color, spore print color, and odor, are shared by all three models as being
Page 34
among their most significant features. Therefore, we may infer that these features are the most crucial characteristics for classifying edible and poisonous mushrooms. LIME offers local model interpretability in contrast to SHAP. LIME adjusts one data sample by changing the feature values, then tracks the effect on the output. This frequently relates to the questions that people ask while looking at a model’s results. The next section will discuss LIME on our top three ML models.
Fig. 5. LIME explainability for a single instance of the Random Forest model
poisonous class out of all of these features.
Fig. 4. LIME explainability for a single instance of the Decision Tree model
From Fig. 4, it can be observed that in case of Decision Tree, the model predicts this particular instance as poisonous with 100% confidence and explains the prediction because of the features gill color, gill size, population, stalk root, habitat, cap shape, and cap color. All of these features converge toward class 1, which is the poisonous class. Out of all of these features, gill size and gill color had the most influence on the prediction towards the poisonous class. From Fig. 5, it is clear that for Random Forest, the model accurately predicts that this specific instance is poisonous with 100% confidence and provides an explanation for the prediction based on the features of the sample, including odor, gill size, gill spacing, gill color, ring type, population, stalk color below ring, bruises, stalk color above ring, habitat, ring number, and cap shape. These features all go toward class 1, which is the poisonous class. Odor, gill size, color, and spacing had the most impact on the prediction of the
Inferred from Fig. 6 is that the model predicts with 100% confidence that this particular instance is poisonous, and offers an explanation for the prediction based on practically all the features for K-Nearest Neighbour (KNN). These features are all indicative of class 1, which is the poisonous class. But out of all of these features, odor, veil color, spore print color, and habitat had the greatest impact on the classification of the poisonous class.
V. C ONCLUSION The objective of our proposed research was to demonstrate the interpretability of machine learning models for classifying edible and poisonous mushrooms. In order to do that, the two XAI models, SHAP and LIME, were employed. We demonstrated the global and local interpretability of the top three ML models from our proposed work, and as a result, we obtained some crucial knowledge about the best features for the classification task. The combination of various XAI models with Deep Learning and Machine Learning models may be a potential future research work. R EFERENCES [1] G. Devika and A. G. Karegowda, “Identification of edible and non-edible mushroom through convolution neural network,” in 3rd International Conference on Integrated Intelligent Computing Communication & Security (ICIIC 2021). Atlantis Press, 2021, pp. 312–321.
Page 35
classification,” in 2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD). IEEE, 2021, pp. 440–444. [13] N. Kiss and L. Cz´uni, “Mushroom image classification with cnns: A case-study of different learning strategies,” in 2021 12th International Symposium on Image and Signal Processing and Analysis (ISPA). IEEE, 2021, pp. 165–170. [14] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml [15] F. Hasnat, M. M. Hasan, A. U. Nasib, A. Adnan, N. Khanom, S. M. M. Islam, M. H. K. Mehedi, S. Iqbal, and A. A. Rasel, “Understanding sarcasm from reddit texts using supervised algorithms,” in 2022 IEEE 10th Region 10 Humanitarian Technology Conference (R10-HTC), 2022, pp. 1–6.
Fig. 6. LIME explainability for a single instance of the K-Nearest Neighbour (KNN) model
[2] A. Wibowo, Y. Rahayu, A. Riyanto, and T. Hidayatulloh, “Classification algorithm for edible mushroom identification,” in 2018 International Conference on Information and Communications Technology (ICOIACT). IEEE, 2018, pp. 250–253. [3] O. Tarawneh, M. Tarawneh, Y. Sharrab, and M. Altarawneh, “Mushroom classification using machine-learning techniques,” 07 2022. [4] D. Gunning and D. Aha, “Darpa’s explainable artificial intelligence (xai) program,” AI magazine, vol. 40, no. 2, pp. 44–58, 2019. [5] M. Chromik, “reshape: A framework for interactive explanations in xai based on shap,” in Proceedings of 18th European Conference on Computer-Supported Cooperative Work. European Society for Socially Embedded Technologies (EUSSET), 2020. [6] M. T. Ribeiro, S. Singh, and C. Guestrin, “Model-agnostic interpretability of machine learning,” arXiv preprint arXiv:1606.05386, 2016. [7] N. Chumuang, K. Sukkanchana, M. Ketcham, W. Yimyam, J. Chalermdit, N. Wittayakhom, and P. Pramkeaw, “Mushroom classification by physical characteristics by technique of k-nearest neighbor,” in 2020 15th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP). IEEE, 2020, pp. 1–6. [8] M. A. Ottom, N. A. Alawad, and K. Nahar, “Classification of mushroom fungi using machine learning techniques,” International Journal of Advanced Trends in Computer Science and Engineering, vol. 8, no. 5, pp. 2378–2385, 2019. [9] K. Tutuncu, I. Cinar, R. Kursun, and M. Koklu, “Edible and poisonous mushrooms classification by machine learning algorithms,” in 2022 11th Mediterranean Conference on Embedded Computing (MECO). IEEE, 2022, pp. 1–4. [10] K. Kousalya, B. Krishnakumar, S. Boomika, N. Dharati, and N. Hemavathy, “Edible mushroom identification using machine learning,” in 2022 International Conference on Computer Communication and Informatics (ICCCI). IEEE, 2022, pp. 1–7. [11] N. Chitayae and A. Sunyoto, “Performance comparison of mushroom types classification using k-nearest neighbor method and decision tree method,” in 2020 3rd International Conference on Information and Communications Technology (ICOIACT). IEEE, 2020, pp. 308–313. [12] N. Zahan, M. Z. Hasan, M. A. Malek, and S. S. Reya, “A deep learning-based approach for edible, inedible and poisonous mushroom
Page 36
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
Data-driven Forecasting of Weather in Bangladesh Leveraging Transformer Network and Strong Inter-feature Correlation ∗ Dept.
Zahidul Islam∗ , Raisa Fariha∗ of Computer Science and Engineering, Islamic University of Technology Gazipur-1704, Dhaka, Bangladesh {zahidulislam, raisafariha}@iut-dhaka.edu
Abstract—Accurate weather forecasting is indispensable for countries like Bangladesh because of their reliance on agriculture and vulnerability to frequently occurring natural disasters such as floods, cyclones, and riverbank erosion. Bangladesh has a fairly regular annual weather pattern where the weather features, such as temperature, humidity, and rainfall, are highly correlated. Leveraging this inter-feature correlation between temperature and rainfall, we propose a flexible transformer based neural network that can forecast monthly temperature or rainfall by analyzing the past few data-points of any one or both of these weather features. We evaluated the proposed method using a public dataset called Bangladesh Weather Dataset which contains 115 years of Bangladesh weather data comprising month-wise average temperature and rainfall measurements. Our method demonstrates substantial improvements over the previously proposed approaches for this task in both metrics- mean squared error and mean absolute error. Our proposed transformer network is also significantly more lightweight and computationally efficient and accurate. This transformer aided method would pave a way to understand and leverage the complex relationship between the weather features and open up possibilities of coming up with even more robust methods of forecasting weather data. Index Terms—Weather forecasting, Transformer, Temperature prediction, Rainfall prediction, Neural Network
I. I NTRODUCTION Being predominantly an agricultural country, a substantial portion of the yearly revenue of Bangladesh comes from the production of crops such as rice, wheat, jute, and other subsectors such as livestock, fisheries, and poultry. The economic growth of Bangladesh is heavily tied to weather features such as temperature, rainfall, and humidity as they heavily influence agricultural production and profits. Therefore, accurate prediction of weather conditions is crucial for countries like Bangladesh. With these forecasts, various stages of farming such as sowing, irrigation, and harvesting can be planned more effectively and farmers can be more accurate in predicting their yield of crops [1]. Robust weather forecasting systems also help to minimize the risks and damages of natural disasters and thereby reduce the loss of fertile lands, forestation, crops, and habitats. Due to being a low-lying delta full of river networks and its proximity to the bay of Bengal, Bangladesh is riddled with natural calamities such as floods, cyclones, droughts, tidal surges, and river erosion. Bangladesh Bureau of Statistics has
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
1 year
temperature
rainfall
Fig. 1. Monthly average temperature and rainfall data from the dataset [5] for the first 200 months since January 1901 normalized in the range -1 to 1.
recently reported that the damages due to natural calamities has increased nearly ten times in 2015-2020 compared to that in the years 2009-2014 [2]. The rapid onset of global warming and climate change might exacerbate the situation. For instance, the devastating flood in the Sylhet region of May 2022 is considered to be one of the worst floods to hit Bangladesh in years. There were thousands of people who were affected by diseases, 75% of which were suffering from Watery Diarrhea and others from various skin problems, eye infections, and other water-borne diseases [3]. According to reports [4], the flood caused damage to almost 105, 000 hectares of land with paddy, maize, and other crops worth Tk 11.13 billion. This caused food production to drop by 2 lakh tons, affecting almost 4.29 lakh farmers. Such setbacks can be controlled and minimized with systems that can track weather patterns and predict future weather conditions based on past data. Weather features such as temperature, rainfall, and humidity often have rhythmic annual patterns and may have strong correlation among one another. Being able to predict future weather conditions is a long sought problem which has been traditionally done using various rule-based approaches, numerical analysis, simulations, and statistical methods [6], [7]. However, due to the complex nature of weather patterns, these methods often do not generalize well in real-world conditions and often are computationally expensive therefore not scalable. Gradually, researchers focused more on data-driven machine learning models [8], [9]. However, these methods often require a lot of pre-processing on the data. In contrast, modern deep learning based methods [10], [11] are more robust and can learn to automatically attend to important features.
Page 37
Despite the potential socio-economic and agricultural benefits, not much work has been done that focuses particularly on Bangladesh weather data [12], [13]. Even fewer research works have leveraged cutting edge deep learning methods for more robust predictions. Our contributions in this paper can be summarized as follows•
•
•
We propose to employ an end-to-end transformer-based model to build a robust method for weather forecasting that can learn to capture long-range information by leveraging the attention mechanism. We leveraged the strong correlation between different weather features such as temperature and rainfall to come up with models that can learn to predict one weather attribute from another. We found significant improvements over existing approaches by evaluating our models on a public dataset containing 115 years of Bangladesh weather data with respect to two performance metrics- Mean Squared Error and Mean Absolute Error.
The rest of our paper is laid out as follows. Section 2 provides a rundown of the related works in weather forecasting. Section 3 describes the proposed methodologies in detail. Section 4 lays out experimentation, training details, and result analyses. Lastly, section 5 recapitulates our works, presents some limitations, and discusses potential future improvements on our work. II. R ELATED W ORK Various statistical methods have been proposed to be used in weather forecasting over the years. Manideep et al. [6] improved an additive variant of the Holt-Winters algorithm for rainfall prediction whereas Graham et al. [7] employed the ARIMA model for the same task. Hidden Markov models were also popular for prediction tasks [14]. Eventually, data-driven machine learning (ML) methods surpassed their performance because of their ability to learn complex patterns within the data. Shafin et al. [8] used Bangladesh weather dataset [5] for temperature prediction task using linear regression, polynomial regression, and Support Vector Regression (SVR). With the widespread adoption of deep neural networks, studies focused more on deep learning approaches for weather forecasting, which can learn complex features within the data with little to no pre-processing. Recurrent neural networks are a common choice for time-series forecasting. That’s why many studies have used RNN, LSTM, and GRU models for weather forecasting. Fente et al. [15] employed an LSTM-based model for predicting different weather features, such as temperature, pressure, and humidity. Chantry et al. [16] outlined various opportunities and challenges of weather modelling using AI methods. Salman et al. [17] predicted weather variables of the Indonesian airport using single-layer and multilayer LSTM. Poornima et al. [18] used 34 years of rainfall data of Hyderabad for rainfall prediction with an intensified LSTM model. Xingjian et al. [19] employed a convolutional LSTM for the task of precipitation nowcasting.
Despite the huge impact of weather forecasting in Bangladesh, only a handful of studies have focused specifically on this region and employed modern deep learning methods. For forecasting monsoon rainfall in Bangladesh, Banik et al. [20] employed three AI algorithms- genetic algorithm (GA), adaptive neuro-fuzzy inference system (ANFIS), and ANN. Using correlation analysis on climate data, Ashraf et al. [21] found homogenous climate zones in Bangladesh. Maria et al. [12] used an ensemble of CNN and LSTM on Bangladesh weather data [5] for predicting monthly average temperature. Khan et al. [13] focused on predicting both monthly temperature and monthly total rainfall using an LSTM-based neural network. Rizvee et al. [22] focused specifically on the north-western region of Bangladesh and used both ANN and Extreme Learning Machine algorithms for the prediction of different weather features such as temperature, rainfall, humidity, and wind. In our work, we implemented a transformer-based neural network architecture inspired by the work of Vaswani et al. [23], where they proposed a sequence transduction network that employs the attention mechanism. Though originally proposed for language modeling, it has been successfully adapted for diverse data such as music [24], images [25], and gene sequences [26]. For temperature forecasting, Bilgin et al. [10] proposed tensorial attention to be used in a transformer encoder. Bojesomo et al. [11] employed a video network called Swin-transformer for short-time weather forecasting. However, to the best of our knowledge, research works on Bangladesh weather have not yet exploited the attention mechanism and transformer networks. III. P ROPOSED M ETHOD For effective weather forecasting, we intend to devise an end-to-end trainable deep learning pipeline that can learn to capture long-range temporal information by focusing particularly on important features. Hence, we implement a neural network architecture based on the transformer model inspired from the work of Vaswani et al. [23]. We choose transformer because it is highly effective in sequence-to-sequence learning and natural language processing tasks due to its ability to handle long-range dependencies by leveraging the attention mechanism [27]. A. Network Architecture As depicted in figure 2, the proposed neural network is composed of the following building blocks- position encoder, input encoder, transformer encoder, and output decoder. The input, i = {i0 . . . in−1 }, is a sequence of n consecutive monthly average temperature or rainfall amounts. In our experiments, the value of n is taken to be 10. The model learns to predict an output, f = {i1 ..in−1 , in }, which is a sequence containing n − 1 consecutive time-step from the input and 1 future timestep, in , representing the average temperature or rainfall of the next month. Sequence to sequence models like RNN, GRU, or LSTM have a recurrent architecture i.e. the time steps in the input sequence are fed into those models one
Page 38
3x
Feed Forward
Output Sequence
Fully Connected
LeakyReLU
Fully Connected
Layer Norm
Output Decoder
Add Feed Forward
Layer Norm
Add
Transformer Encoder Block
Multi-Head Self-Attention
Fully Connected
LeakyReLU
Fully Connected Position Encoder
Input Sequence
Input Encoder
Fig. 2. Architecture of the proposed transformer [23] based neural network. This sequence to sequence network is composed of a nonlinear input encoder, a position encoder, stacked transformer encoder blocks, and a nonlinear output decoder.
by one sequentially. On the other hand, the transformer model proposed by Vaswani et al. [23], has a non-recurrent design allowing it to process all the time steps in parallel. This also solves the gradient vanishing problem of the recurrent neural nets and allows modern GPUs to be fully utilized during training. As the transformer model is fed all the time steps at once, a position encoder is used to encode the relative position of the time steps in the form of sinusoidal waves which are passed into the transformer model along with inputs. Following [23], the positional encoding is calculated using equations (1). Where, j is the dimension number, p is the position of the time-step, and dim is the dimension of the output of the input encoder. In our case, dim was chosen to be 128. P osEnc(p, 2j) = sin(p/100002j/dim ) P osEnc(p, 2j + 1) = cos(p/100002j/dim )
(1)
Outputs of the position and input encoder are added and passed onto the transformer encoder which is built by stacking three transformer encoder blocks one after one. Each transformer encoder block is composed of one multi-headed selfattention layer, layer-normalization layers, along with some feed-forward layers. The multi-headed self-attention layer is built by stacking H self-attention heads. In our model, 16 attention heads was used in each layer. Each self-attention layer tries to encode the relative importance or weight of each time-step of the previous layer to determine a time-step of the next layer. Each attention layer computes an output comprised of the weighted sum of values, V , where weights of each value is computed by measuring the interaction between each input
key, K, with the query, Q using the equations (2). Q, K, and V are derived from the previous layer’s output features and dim is dimension size of Q [23]. Q ∗ KT )V Attention(Q, K, V ) = Sof tmax( √ dim
(2)
By estimating the interaction or the relative importance of time-steps of the previous layer’s output amongst each other, the output of the next layer is determined. Skip connections are used in each transformer encoder block which helps reduce vanishing gradients by facilitating flow of gradients through the model. Each time-step in the output of the normalization layer are fed into a feed-forward layer which is comprised of two fully connected layers and a ReLU [28] activation. The output features from the last transformer encoder block goes into the output decoder module which then produces the predicted sequence. B. Nonlinear Encoder Input and Decoder Modules The input encoder learns to encode each time-step of the input sequence data into a feature having a specified dimension, dim. Weather attributes such as temperature, rainfall, or humidity tend to have nuanced nonlinear patterns which are difficult to capture using a linear input encoder. Hence, unlike a generic transformer, we have opted to design the input encoder using two stacked fully connected layers with an activation layer, ReLU [28], in between. The output decoder, on the other hand, takes in the output feature of the transformer encoder with n time-steps and learns to predict the target sequence. Mean Squared Error between the output sequence
Page 39
TABLE I C OMPARISON OF O UR A PPROACH WITH P REVIOUS M ETHODS Method LSTM [12] CNN+LSTM [12] CNN+GRU [12] LSTM [13] Ours (Temp2Temp)
MSE 0.54 0.54 0.49 0.012
TABLE III E VALUATION OF E FFICIENCY OF THE P ROPOSED M ODEL
MAE 0.57 0.58 0.53 0.38 0.085
Model LSTM Transformers
TABLE II C OMPARISON OF THE D IFFERENT VARIANTS OF THE P ROPOSED M ODEL Model Temp2Temp Rain2Temp Temp2Temp + Rain2Temp Rain2Rain Temp2Rain Rain2Rain + Temp2Rain TempRain2Temp TempRain2Rain
MSE 0.012 0.024 0.012 0.034 0.030 0.032 0.013 0.030
MAE 0.085 0.118 0.086 0.125 0.116 0.119 0.089 0.117
Parameters 413,057 166,913
Mult-Adds (M) 99.24 1.17
Param. Size (MB) 1.65 0.67
outputs are then passed on to a simple classifier network comprising of fully connected layers which provide the final predictions. In TempRain2Temp model, both the temperature and rainfall data of the past n months are simultaneously fed into a single network to predict the average temperature in the next month. In the next sections, we present and analyze the performance of each of these variants side by side. Our experiments show that it is possible to predict one of these weather attributes given the past information of the other with reasonable accuracy. IV. E XPERIMENTS AND R ESULT A NALYSES
and the target sequence is calculated as the loss which is then back-propagated through the network. The nonlinear nature of the output decoder helps the model learn complex patterns and predict the target sequence accurately. C. Leveraging Strong Inter-feature Correlation Different weather features, such as temperature, rainfall, or humidity, are influenced by many interrelated factors and can be strongly correlated with each other. We calculate the Pearson’s correlation coefficient [29] between the temperature and rainfall data in the Bangladesh weather dataset by evaluating the expression (3), where temp and rain are respectively the sequence of temperature and rainfall data. (tempi − temp)(raini − rain) (tempi − temp)2 (raini − rain)2
(3)
The temperature and rainfall information in the dataset showed a strong correlation with a coefficient value of 0.70. This indicates that it is possible to train the model to predict one feature from another. This will be especially beneficial in scenarios where the past data of a particular weather feature is missing. This can also help to make predictions stronger by ensembling two different models where, for example, one model predicts temperature from past rainfall data and the other from past temperature data. The variant of our model, which is trained to predict future temperature data given a sequence of temperature data in the past n months, is termed as Temp2Temp. On the other hand, the variant of our model, which predicts the average rainfall of the next month, given the monthly average temperature of the past n months, is named Temp2Rain. Similarly, all the other variants of the proposed model are named. We have also experimented with ensemble models to increase the diversity of learned representations and better generalization. In the ensemble model termed Temp2Temp+Temp2Rain, the models, Temp2Temp and Temp2Rain, are first frozen and their transformer encoders’
A. Dataset We utilize a publicly available dataset on Bangladesh weather called “Bangladesh Weather Dataset” [5] shared on the Kaggle platform [30]. This dataset compiles monthly average temperature and monthly total rainfall measurement of 115 years, ranging from the month of January of 1901 up to the month of December of 2015 which makes 1380 data-points in total. The dataset is compiled in CSV format. In figure 1, we plot the normalized temperature and rainfall data of the first 200 months in the dataset. The red line indicates the temperature data, whereas the blue line depicts rainfall. The annual cyclic pattern of both temperature and rainfall is noticeable. The amount of variability in this annual trend is higher in rainfall than the temperature with occasional spikes. Whereas, the monthly average temperature has a fairly consistent regular annual trend. B. Training Methodology To prepare the data for experimentation, out of 1380 datapoints in the dataset, we split the first 950 contiguous datapoints for training and leave the other 450 for testing. We traverse through both the training and test data and extract all possible sequences of contiguous 10 months of rainfall or temperature data. This gives us 939 sequences of training data and 419 sequences for testing. We use a batch size of 25 for training the models. We do not employ any data augmentation methods. The proposed models are implemented using the Pytorch [31] library in Python. We train the networks on a workstation with AMD Ryzen 7 5800H CPU and a single Nvidia RTX 3060 GPU. We train each of the models for 200 epochs. On average, 0.72 seconds are needed for training each epoch on our setup. We set the initial learning rate as 0.001 and decrease it with a multiplier of 0.95 after each epoch using a learning rate scheduler. To optimize the models, we employ the AdamW [32] optimizer.
Page 40
(a)
(b)
(c)
(d)
Fig. 3. Forecast of different variants of the proposed model on the test data. The red lines indicate the actual target data and the blue lines represent the predicted data.
C. Performance Metrics For evaluating the effectiveness of the proposed models, we choose to use mean squared error (MSE) and mean absolute error (MAE) as metrics as they are standard metrics to determine the likeness between two signals. Mean squared error is defined by the average of the sum of errors squared. It can be expressed using equation (4), where y is the predicted, and y¯ is the target data-point, and N represents the number of sequences. M SE =
N 1 (y − y¯)2 N i=1
(4)
Similarly, mean absolute error is found out by calculating the average of the sum of the absolute errors for all the sequences and can be represented using equation (5). M AE =
N 1 |y − y¯| N i=1
(5)
D. Comparison with Existing Works In table I, we compare the T emp2T emp variant of our proposed model with the temperature prediction models proposed in previous works on the Bangladesh weather dataset [12], [13]. Maria et al. [12] used ensembles of convolutional and recurrent neural networks such as CNN+LSTM. Khan et al. [13] used an LSTM model. Our transformer-based network reduces the errors by a significant margin, which indicates a higher accuracy of predictions. In table II, we compare different variants of the proposed model. We can see that the models that predict temperature, such as Temp2Temp, have lower error compared the rainfall prediction models. This is because the annual rhythmic pattern of monthly average temperature is more consistent and regular than the rainfall data. Because of the higher variability of
the rainfall data, the rainfall prediction models show higher error. The ensemble models do not show any significant improvement because of the high variance and irregularity in rainfall data. However, the prediction error of Rain2Rain model decreases by ensembling it with Temp2Temp model. If more informative weather features are added, such as humidity, wind, the ensemble models are likely to be more effective. E. Evaluation of Efficiency For assessing the efficiency of our proposed model architecture in terms of memory and computational complexity, we implement an LSTM model which has a similar architecture to the proposed transformer-based model except each transformer encoder block is replaced with an LSTM layer. For consistency, the number of features in the hidden state of LSTM is taken to be equal to the number of features in the transformer encoder layer. In table III, the two models are compared in terms of the parameter count, the number of multiplication-addition operations, and the parameter size in bytes. The proposed transformer-based model has almost three times less number of parameters. Moreover, it uses almost a hundred times fewer Mult-Add operations than the LSTM-based model which makes it more effective in resourceconstrained situations F. Qualitative Analysis In figure 3, we demonstrate the qualitative results of our approach by plotting the predictions of different variants of the proposed model against the actual target values. We plot the predictions for the first 100 sequences in the test data. It can be noticed that the temperature prediction model visually appears to be more accurate. In the rainfall prediction models, such as Rain2Rain, the predicted values sometimes don’t match with the anomalously high values in the target data. Whereas, the peaks in the temperature data are fairly consistent and regular. This indicates that in the rainy season of some years it rains
Page 41
significantly more than the others. This makes it challenging for the models to make accurate predictions. However, overall, all the variants could match the annual pattern and trend of both temperature and rainfall data with good accuracy, which indicates the effectiveness of our approach. V. C ONCLUSIONS Even though accurate weather prediction is critical for Bangladesh because of its geography and socio-economic conditions, only a handful of research works have been focused on weather forecasting in Bangladesh and employed state-ofthe-art methods. In this work, we propose a deep learning pipeline for weather forecasting using a transformer-based neural network. Our method has shown superior performance both in terms of accuracy and efficiency. We intend to followup this work by integrating more weather attributes, such as humidity. By developing models that can understand the complex relationship between multiple weather features, it might be possible to make more accurate predictions. R EFERENCES [1] M. A. Hossain, M. N. Uddin, M. A. Hossain, and Y. M. Jang, “Predicting rice yield for bangladesh by exploiting weather conditions,” in 2017 international conference on information and communication technology convergence (ICTC). IEEE, 2017, pp. 589–594. [2] M. Byron, “Natural disasters 2015-2020: Loss, damage rise tenfold,” The Daily Star. [Online]. Available: https://www.thedailystar.net/environment/climate-crisis/naturaldisaster/news/the-poor-always-lost-their-most-bbs-3052221 [3] “Sylhet flash floods: Situation support,” Relief Web Report. [Online]. Available: https://reliefweb.int/report/bangladesh/sylhet-flashfloods-situation-support [4] M. Hossain, “Sylhet farmers lose crops worth over tk 11.13b in floods,” New Age Bangladesh. [Online]. Available: https://www.newagebd.net/article/175291/sylhetfarmers-lose-crops-worth-over-tk-1113b-in-floods [5] Y. Rubaiyat, “Bangladesh weather dataset.” [Online]. Available: https://www.kaggle.com/datasets/yakinrubaiat/bangladeshweather-dataset [6] K. Manideep and K. Sekar, “Rainfall prediction using different methods of holt winters algorithm: A big data approach,” Int. J. Pure Appl. Math, vol. 119, no. 15, pp. 379–386, 2018. [7] A. Graham and E. P. Mishra, “Time series analysis model to forecast rainfall for allahabad region,” Journal of Pharmacognosy and Phytochemistry, vol. 6, no. 5, pp. 1418–1421, 2017. [8] A. A. Shafin, “Machine learning approach to forecast average weather temperature of bangladesh,” Global Journal of Computer Science and Technology, 2019. [9] M. A. R. Mia, M. A. Yousuf, and R. Ghosh, “Business forecasting system using machine learning approach,” in 2021 2nd International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST), 2021, pp. 314–318. [10] O. Bilgin, P. Maka, T. Vergutz, and S. Mehrkanoon, “Tent: Tensorized encoder transformer for temperature forecasting,” arXiv preprint arXiv:2106.14742, 2021. [11] A. Bojesomo, H. Al-Marzouqi, P. Liatsis, G. Cong, and M. Ramanath, “Spatiotemporal swin-transformer network for short time weather forecasting.” in CIKM Workshops, 2021. [12] A. S. Maria, S. Afridi, and S. Ahmed, “An ensemble of cnn-lstms for temperatures forecasting from bangladesh weather dataset,” in 2021 International Conference on Science Contemporary Technologies (ICSCT), 2021, pp. 1–6. [13] M. M. R. Khan, M. A. B. Siddique, S. Sakib, A. Aziz, I. K. Tasawar, and Z. Hossain, “Prediction of temperature and rainfall in bangladesh using long short term memory recurrent neural networks,” in 2020 4th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), 2020, pp. 1–6.
[14] J. P. Hughes, P. Guttorp, and S. P. Charles, “A non-homogeneous hidden markov model for precipitation occurrence,” Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 48, no. 1, pp. 15– 30, 1999. [15] D. N. Fente and D. K. Singh, “Weather forecasting using artificial neural network,” in 2018 second international conference on inventive communication and computational technologies (ICICCT). IEEE, 2018, pp. 1757–1761. [16] M. Chantry, H. Christensen, P. Dueben, and T. Palmer, “Opportunities and challenges for machine learning in weather and climate modelling: hard, medium and soft ai,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 379, no. 2194, p. 20200083, 2021. [Online]. Available: https://royalsocietypublishing.org/doi/abs/10.1098/rsta.2020.0083 [17] A. G. Salman, Y. Heryadi, E. Abdurahman, and W. Suparta, “Single layer & multi-layer long short-term memory (lstm) model with intermediate variables for weather forecasting,” Procedia Computer Science, vol. 135, pp. 89–98, 2018. [18] S. Poornima and M. Pushpalatha, “Prediction of rainfall using intensified lstm based recurrent neural network with weighted linear units,” Atmosphere, vol. 10, no. 11, p. 668, 2019. [19] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” Advances in neural information processing systems, vol. 28, 2015. [20] S. Banik, F. H. Chanchary, K. Khan, R. A. Rouf, and M. Anwer, “Neural network and genetic algorithm approaches for forecasting bangladeshi monsoon rainfall,” in 2008 11th International Conference on Computer and Information Technology, 2008, pp. 735–740. [21] F. B. Ashraf, M. R. Kabir, M. S. R. Shafi, and J. I. M. Rifat, “Finding homogeneous climate zones in bangladesh from statistical analysis of climate data using machine learning technique,” in 2020 23rd International Conference on Computer and Information Technology (ICCIT), 2020, pp. 1–6. [22] M. A. Rizvee, A. R. Arju, M. Al-Hasan, S. M. Tareque, and M. Z. Hasan, “Weather forecasting for the north-western region of bangladesh: A machine learning approach,” in 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT). IEEE, 2020, pp. 1–6. [23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [24] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, I. Simon, C. Hawthorne, A. M. Dai, M. D. Hoffman, M. Dinculescu, and D. Eck, “Music transformer,” arXiv preprint arXiv:1809.04281, 2018. [25] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020. [26] N. Q. K. Le, Q.-T. Ho, T.-T.-D. Nguyen, and Y.-Y. Ou, “A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information,” Briefings in Bioinformatics, vol. 22, no. 5, 02 2021, bbab005. [Online]. Available: https://doi.org/10.1093/bib/bbab005 [27] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014. [28] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Icml, 2010. [29] K. Pearson, “Correlation coefficient,” in Royal Society Proceedings, vol. 58, 1895, p. 214. [30] “Your machine learning and data science community.” [Online]. Available: https://www.kaggle.com/ [31] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019. [32] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
Page 42
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
BCI-based Consumers’ Preference Prediction using Single Channel Commercial EEG Device Farhan Ishtiaque1, Fazla Rabbi Mashrur2, Mohammad Touhidul Islam Miya3, Khandoker Mahmudur Rahman3, Ravi Vaidyanathan4, Syed Ferhat Anwar5, Farhana Sarker6, Khondaker A. Mamun1 Abstract— Brain-Computer Interface (BCI) technology is used in neuromarketing to learn how consumers respond to marketing stimuli. This helps evaluate the marketing stimuli which is traditionally done using marketing research procedures. BCI-based neuromarketing promises to replace these traditional marketing research procedures which are time-consuming and costly. Although BCI-based neuromarketing has its difficulty as EEG devices are inconvenient for consumer-grade applications. This study is performed to predict consumers’ affective attitude (AA) and purchase intention (PI) toward a product using EEG signals. EEG signals are collected using a single channel consumergrade EEG device from 4 healthy participants while they are subject to 3 different types of marketing stimuli; product, promotion, and endorsement. Multi-domain features are extracted from the EEG signals after pre-processing. 52 features are selected among those using SVM-based Recursive Feature Elimination. SMOTE algorithm is used to balance out the dataset. Support Vector Machine (SVM) is used to classify positive and negative affective attitude and purchase intention. The model manages to achieve an accuracy of 88.2% for affective attitude and 80.4% for purchase intention proving the viability of consumer-grade BCI devices in neuromarketing. Keywords—Brain Computer Interface, Neuromarketing, EEG, Signal Processing, Machine Learning
I. INTRODUCTION Neuromarketing studies customers' corporeal and emotional reactions to promoted items or services. It combines marketing with neuroscience using BrainComputer Interface (BCI) technology to learn about consumers' preferences and purchase intentions. Every year $750 billion is used for marketing, promotion, and advertisement [1]. A significant amount is used in marketing research where the primary goal is to express and present a particular advertisement effectively. In traditional marketing, research consumers are asked to fill out surveys, participate in focus groups, or have one-on-one interviews to get information on their thoughts regarding a product [2]. These methods are time-consuming and costly. Most importantly the method has to be repeated for every product. Additionally, because of the inherent limitations of selfreporting questionnaire surveys, these responses often do not accurately depict the customers' actual mental state [3]. On the other hand, neuromarketing is a solution to these problems because it focuses on capturing the brain’s response to a person. So, there is a need for an autonomous system that can predict consumer preferences. Several autonomic approaches have been proposed by researchers which can predict consumer preferences in the last couple of years. In these studies, researchers used EEG, eye-tracking, facial expression, magnetoencephalography,
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
etc. to measure consumers’ preferences. In the beginning, researchers were mostly interested in fMRI for neuromarketing research [3]. Later researchers moved on to EEG or MEG because it had poor temporal resolution and also high cost. On the other hand, EEG and MEG have an excellent temporal resolution. Between these two MEG has proven a viable technique for neuromarketing research [4] but it is costly compared to EEG. Also, the device is quite inconvenient and difficult to use. Other techniques such as eye tracking, facial expression, and ECG has been successfully used in neuromarketing research [3] but these techniques are unable to incorporate consumers’ subconscious intuition. Since Krugman’s first use of EEG technology in 1971, it has drawn interest from the neuromarketing industry as a relatively effective, affordable, well-established, and convenient instrument for neuromarketing research. In EEG-based neuromarketing studies, it is seen in the past that advertisements and their design can affect consumers’ attitudes and purchase intention toward a certain product. [5] analyzed like and dislike feedback from participants using shoe images. [6] used the Hidden Markov model to extract and classify like and dislikes from 42 images. [7] used product and endorsement to predict consumer preference using EEG. For EEG-based analysis pre-frontal cortex and frontal cortex are the most important regions because these regions are responsible for decisionmaking [8]. [1] utilizes a 6-channel EEG signal collected from the frontal cortex of the brain to predict consumer preferences and purchase intention from different types of marketing stimuli. All the above-mentioned studies use multi-channel research-grade BCI technology. The devices used by these researchers are reliable and offer good quality data. However, the devices are quite inconvenient for the wearer. Most devices use wet electrodes which require a long setup time and a dedicated environment [1]. There are several easy-to-use consumer-grade BCI technology available but these are not tested in neuromarketing research yet because oftentimes these are single-channel dry electrode devices with mediocre data quality. But it is necessary to use consumer-grade BCI in neuromarketing to make the system feasible for real-life application. In this study, we used a single channel dry electrode consumer-grade BCI technology (Focus Calm by Brain Co) to collect data. We propose an effective method to analyze and classify the collected data. By utilizing a consumer-grade EEG device this study promises the advancement of neuromarketing for marketing research proving the feasibility to use the technology on a regular basis.
Page 43
(a)
(b)
Fig. 3. Difference between a research grade EEG device (a) vs Consumer grade EEG device (b).
Fig. 1. Flow diagram of the performed study. First we preprocess the raw EEG data collected from Fpz channel. Then we extract several features and select relevant features using SVM-RFE. After that SMOTE is used to balance the dataset. Finally we classify the dataset usin SVM.
II. METHODOLOGY Fig. 1 shows the working procedure of the proposed framework. In this section participants, EEG device, experimental setup and data collection, data pre-processing, feature extraction and selection, and data classification are discussed. A. Participants For this study 4 subjects, 2 male and 2 female (age: 24 ± 2 years) participated. The subjects had no history of neurological disorders. In accordance with the Helsinki Declaration and the Neuromarketing Science and Business Association Code of Ethics, all participants provided informed consent prior to enrollment.
consumer-grade device. C. Experimental setup and data collection This study utilized eight different products, each with an associated promotion and endorsement. Customers are more likely to buy a product after seeing a promotion or endorsement. It is common for celebrities to endorse products in a real-world setting. However, in our study, we used neutral endorsement to prevent participants from becoming biased. The stimulus is shown in Fig. 4. The data collection process was inspired by [9],[10], and [1]. It was divided into three parts. In the first part, the participants sit comfortably in front of a screen and are briefed about the experiment. The EEG device was placed on their head. In the second part, the participants were shown the images of product, promotion, and endorsement using PsychoPy v3.0 [11]. Each image was shown for 5 seconds and between each image, a white cross was shown for 3 seconds. The stimuli flow diagram is given in Fig 5. EEG data were being collected simultaneously from the
B. BCI Device For this experiment, we used the Focus Calm device by Brain Co. It is a single-channel EEG device that collects data from the frontal Fpz channel. Fig. 2 shows the electrode position of [1] and our electrode position. The electrode used in this device is a dry electrode. This device was originally built to perform brain exercises and increase focus. It was built for long time usage and for that reason comfort was a priority. Although this type of device is not ideal for research, we chose this device intentionally in order to test its feasibility in neuromarketing research and application. Fig. 3 shows the difference between a research-grade device and a
Fig. 2. Electrode placement of [1] (red) vs electrode placement of this study (green).
Fig. 4. Stimuli used in our experiment. The first column represents 8 products, second column represents endorsement of those 8 products and the last column represents the promotion offered of those 8 products.
Page 44
Fig. 5. The stimuli flow diagram used in this study. At first the participants are shown a white cross in black screen folowed by the image of first product. After that sequentialy comes the endorsement and promotion of the product seperated by black screen with cross. This sequence is repeated IRUSURGXFWs.
whose frequency range is 0.5-70 Hz. After that, a 50 Hz notch filter is used to remove powerline noise. Lastly, the data is normalized by subtracting the mean from each point and dividing it by the standard deviation. For visualization, pre-processed positive and negative EEG data are plotted. The plot is given in Fig 7. E. Wavelet Packet Transform (WPT) WPT is a technique used to split time-series data into different frequency bands. In this study we split our dataset LQWRIUHTXHQF\EDQGVį í+]ș í+]Į í +] ȕ í +] ȕ í +] Ȗ í +] using following equations,
)LJ1XPEHURIUHVSRQVHVRIWKHSDULFLSDQWVIRU$IIHFWLYH$WWLWXGH (blue) and for purchase intention (orange).
frontal Fpz channel. The data collection rate was 256 Hz. In the third part, the participants were given a questionnaire where they were asked two questions, 1. If given the opportunity how willing are you to have x? 2. If given the opportunity how willing are you to buy x? The first one is for affective attitude and the second one is for purchase intention. The participants answered based on a 1-7 scale where 1 is strongly disagree and 7 is strongly agree. Later it was converted into three classes; negative (1-3), neutral (4), and positive (5-7). Participants’ response is shown in Fig. 6. D. Data Pre-processing MATLAB 2019a and EEGLAB [12] are used to process the data. To remove high and low-frequency noise the signals are first filtered using a bandpass Butterworth filter
Wm,2nN σܮെ1 ݈=0 ݄(l)WPíQ NíOPRG1Qí)
(1)
Wm,2n+1N σܮെ1 ݈=0 ݃(l)WPíQ NíOPRG1Qí)
(2)
Here, Wm,nN Q P-1 is the WPT coefficient at level i. F. Feature Extraction, Selection, and Classification Features are extracted from three domains (time, frequency, and time-frequency) similar to [12]. Several time domain and frequency domain features are extracted first. Time domain features include Average and relative power, Hjorth parameters, Skewness, Arithmetic mean, Median, Minimum value, Absolute value, Interquartile range, Renyi entropy, Absolute threshold crossing, Threshold crossing, zero crossing, etc. Frequency domain features include
(a)
(b) )LJ*UDQGDYHUDJHRIV((*GDWDLQWLPHGRPDLQIRU3$$YV1$$D DQG33,YV13,E 7KHJUDSKVKRZVWKHGLIIHUHQFHLQ EEG pattern for ERWKRXWFRPHLQFOXGLQJGLIIHUHQFHLQ1DQG3FRPSRQHQW
Page 45
(a)
(b)
Fig. 8. Confusion matrix of the prediction of our model for affective attitude (a) and purchase intention (b).
Spectral centroid, spread, kurtosis, entropy, flatness, crest, slope, decrease, roll-off point, etc. The ratio of average power and relative power is also calculated as separate features. Then Wavelet Packet Transform (WPT) is used to decompose the EEG signal into six frequency bands [13][14]. Then those features are again extracted from each of these bands. As a separate feature, the ratio of the average and relative power of each band are also calculated. The total number of features extracted is more than 400. For feature selection, we use Recursive Feature Elimination (RFE) using linear SVM. 10-fold crossvalidation is used during the feature selection procedure. A total of 52 features are found significant for the analysis. These 52 features are selected for further classification. For binary classification, the neutral class is irrelevant and possesses a chance to confuse the algorithm. That’s why we remove the neutral class and only keep the positive and negative classes. The dataset is highly imbalanced as positive affective attitude and purchase intention are greater than negative shown in Fig. 6. To balance out the dataset Synthetic Minority Oversampling Technique [15] is used to create 23 synthetic NAA and NPI. After that, the data is trained using linear SVM. 5-fold cross-validation is used to evaluate the model’s performance. For comparison, the
(a)
dataset is also classified using Random Forest, Decision Tree, and Naïve Bayes algorithm as well
III. RESULT AND DISCUSSION The outcome of the classification is given in Table 1. TABLE I.
PERFORMANCE OF THE SVM MODEL COMPARED TO ANOTHER RESEARCH
Paper Our Work [1]
PI (Mean of 5-fold CV)
AA (Mean of 5-fold CV)
Acc.
Spec.
Sens.
Acc.
Spec.
Sens.
80.4% ± 3.8 84.0%
79.24%
81.6%
83.1%
95%
75.32%
89.66%
88.2% ± 7.4 87%
74.41%
92.98%
The table contains the average accuracy from the 5-fold cross-validation. The Specificity and Sensitivity are calculated from the combined confusion matrix which is shown in Fig. 8. The results obtained by [1] are also shown for comparison. It can be seen that the model was quite successful in predicting both AA and PI. Our method achieved similar results compared to [1]. [1] used sixchannel EEG data but we used single-channel EEG data and also a commercial EEG device. This proves the viability of our method.
(b)
Fig. 9. Performance of the proposed system for various classifiers shown in box plot. Here we can see that SVM outperformed other classifiers.
Page 46
(a)
(b)
(c)
(d)
Fig. 10. Percentage of the type of selected features from feature space for affective attitude (a) and purchase intention (b). Percentage of the type of selected ‘time-frequency’ features from feature space for affective attitude (b) and purchase intention (d).
The collected dataset was imbalanced as there were approximately 65% positive data points and 35% negative data points for both affective attitude and purchase intention. The Synthetic Minority Oversampling Technique was used to balance the dataset. Before applying SMOTE, a classification study was performed and the model had a slightly poorer performance in comparison with the balanced dataset. So, it is important to have a balanced dataset as it affects the model’s performance directly. For comparison, we classified the dataset using other models as well. The performance of the other models is shown in Fig. 9. From the box plot, it can be seen that SVM outperformed all the other models. It is worth mentioning here that the feature selection was done using SVM. Also, for purchase intention, the first quartile and third quartile are very close referring to the stability of the method. This finding aligns with the findings of [1]. Fig. 7 shows the difference in EEG data between the positive and negative AA and PI. Here we can see that for both NAA and NPI the N200 component shows a greater peek than PAA and PPI. Again, for PAA and PPI, the P300 component has a greater peek than NAA and NPI. Also, negative responses have higher dispersion than positive responses. This same outcome can also be seen in [1]. Fig. 10 shows the percentage of features selected out of the total number of features extracted. It can be seen that Time-Frequency features are the most important features among all extracted features. The same finding can be seen in [1]. From the time-IUHTXHQF\IHDWXUHVșEDQGIHDWXUHVDUH the most relevant for predicting consumers’ affective attitudes and purchase intention. It can also be seen in the
outcome of [1], [3], [10]. The second most relevant IUHTXHQF\ EDQG LV į which is also claimed by [1]. This confirms that the finding of this study aligns with [1]. From the above discussion, it can be seen that a consumer-grade EEG device is capable of performing accurately, compared to research-grade EEG devices. We can say that because our findings align with the findings of other mentioned literatures. This proves the viability of neuromarketing using a consumer-grade single-channel EEG device. Although a lot of research has shown the feasibility of neuromarketing, it is still yet to be adopted by marketing research due to the inconvenience and difficult data collection process using traditional research-grade BCI devices. Whereas consumer-grade BCI devices do not need a sophisticated data collection process which is usually required in neuromarketing research. This is the first study achieving such competitive results using a consumer-grade single channel BCI proving the method's implication in realworld marketing research. For the experiment 4 participants were used. The participants gave their written consent prior to the experiment. The whole experiment procedure was reviewed and accepted by the Institutional Research Ethics Board (IREB), United International University, Dhaka. IV. CONCLUSION AND FUTURE WORK In this study, a consumer-grade EEG device was used to collect data from 4 participants while being exposed to different marketing stimuli. 3 types of stimuli (product, endorsement, promotion) of a total of 8 products were shown. A dataset was created using the participants'
Page 47
feedback on the products’ affective attitude and purchase intention. The collected data was pre-processed. After that several times, frequency and time-frequency domain features were extracted from the dataset. From a total of 470 features, 52 features were selected for classification using the 10-fold SVM-based Recursive Feature Elimination method. After that Synthetic Minority Oversampling Technique (SMOTE) was used to balance the dataset. Finally, linear SVM was used to classify the dataset. 5-fold cross-validation was used to evaluate the model’s performance. The model performed better at classifying affective attitude with an accuracy of 88.2%. For classifying purchase intention, the model achieved an accuracy of 80.4%. This result proves that a consumer-grade single-channel EEG device is viable for neuromarketing research. It opens up new possibilities for neuromarketing applications in marketing research and analysis. The only limitation this study has is it is done on a small dataset. In the future, we plan to create a large dataset using more participants to perform the same task. We also plan to use other research-grade EEG devices alongside our current device to compare the performance of both.
[13] M. K. Wali, M. Murugappan, and B. Ahmmad, “Wavelet packet transform based driver distraction level classification using EEG,” Math. Probl. Eng., vol. 2013, 2013, doi: 10.1155/2013/297587. [14] L. S. Vidyaratne and K. M. Iftekharuddin, “Real-Time Epileptic Seizure Detection Using EEG,” IEEE Trans. Neural Syst. Rehabil. Eng., vol. 25, no. 11, pp. 2146–2156, Nov. 2017, doi: 10.1109/TNSRE.2017.2697920. [15] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, Jun. 2002, doi: 10.1613/JAIR.953.
V. REFERENCES [1]
F. R. Mashrur et al., “BCI-Based Consumers’ Choice Prediction From EEG Signals: An Intelligent Neuromarketing Framework,” Front. Hum. Neurosci., vol. 16, p. 861270, May 2022, doi: 10.3389/FNHUM.2022.861270. [2] J. Hulland, H. Baumgartner, and K. M. Smith, “Marketing survey research best practices: evidence and recommendations from a review of JAMS articles,” J. Acad. Mark. Sci., vol. 46, no. 1, pp. 92–108, Jan. 2018, doi: 10.1007/S11747-017-0532-Y/TABLES/4. [3] F. S. Rawnaque et al., “Technological advancements and opportunities in Neuromarketing: a systematic review,” Brain Informatics, vol. 7, no. 1, pp. 1–19, Dec. 2020, doi: 10.1186/S40708020-00109-X/TABLES/4. [4] G. Vecchiato et al., “On the Use of EEG or MEG Brain Imaging Tools in Neuromarketing Research,” Comput. Intell. Neurosci., vol. 2011, p. 12, 2011, doi: 10.1155/2011/643489. [5] B. Yilmaz, S. Korkmaz, D. B. Arslan, E. Güngör, and M. H. Asyali, “Like/dislike analysis using EEG: Determination of most discriminative channels and frequencies,” Comput. Methods Programs Biomed., vol. 113, no. 2, pp. 705–713, Feb. 2014, doi: 10.1016/J.CMPB.2013.11.010. [6] M. Yadava, P. Kumar, R. Saini, P. P. Roy, and D. Prosad Dogra, “Analysis of EEG signals and its application to neuromarketing,” Multimed. Tools Appl., vol. 76, no. 18, pp. 19087–19111, Sep. 2017, doi: 10.1007/S11042-017-4580-6/TABLES/4. [7] F. R. Mashrur et al., “MarketBrain: An EEG based intelligent consumer preference prediction system,” Proc. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. EMBS, pp. 808–811, 2021, doi: 10.1109/EMBC46164.2021.9629841. [8] N. Ahmad et al., “Relative reward preference in primate orbitofrontal cortex,” Nat. 1999 3986729, vol. 398, no. 6729, pp. 704–708, Apr. 1999, doi: 10.1038/19525. [9] I. Levy, S. C. Lazzaro, R. B. Rutledge, and P. W. Glimcher, “Choice from Non-Choice: Predicting Consumer Preferences from Blood Oxygenation Level-Dependent Signals Obtained during Passive Viewing,” J. Neurosci., vol. 31, no. 1, pp. 118–125, Jan. 2011, doi: 10.1523/JNEUROSCI.3214-10.2011. [10] A. Telpaz, R. Webb, and D. J. Levy, “Using EEG to Predict Consumers’ Future Choices:,” https://doi.org/10.1509/jmr.13.0564, vol. 52, no. 4, pp. 511–529, Aug. 2015, doi: 10.1509/JMR.13.0564. [11] A. Delorme and S. Makeig, “EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis,” J. Neurosci. Methods, vol. 134, no. 1, pp. 9–21, Mar. 2004, doi: 10.1016/J.JNEUMETH.2003.10.009. [12] K. A. Mamun, C. M. Steele, and T. Chau, “Swallowing accelerometry signal feature variations with sensor displacement,” Med. Eng. Phys., vol. 37, no. 7, pp. 665–673, Jul. 2015, doi: 10.1016/J.MEDENGPHY.2015.04.007.
Page 48
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
Performance Evaluation of Fake News Detection using Predictive Modeling Md.Razaul Haque Subho
Md.Ridowan Chowdhury
Department of Computer Science and Engineering
Department of Computer Science and Engineering
School of Engineering and Computer Science,
School of Engineering and Computer Science,
BRAC University, Dhaka-1212 Bangladesh
BRAC University, Dhaka-1212 Bangladesh
[email protected]
[email protected]
Abstract— Fake news on the internet is more prevalent than ever, because it has the power to create chaos among people. In the digital age, the internet usage has increased, and people are now more reliable on online news portals. An intentional spread of fake news can influence people’s perception and cause devastating outcome towards an issue. To cope with the huge amount of data produced on the internet every day, a computational model is the only solution. In this paper, we experiment a predictive modeling approach to classify text document as real or fake from the linguistic pattern of the document. The task is more difficult than the other text classification tasks because there is very little dissimilarity between a real document and a fake document. Three most popular predictive models that works well in the text classification task, multinomial naïve bayes, support vector machine and long shortterm memory is experimented with feature extracted from bag of words and tf-idf methods. There are previously available works can be found, that tried to predict documents credibility with these three models, however in this paper we tried to compare the performance these models in the same dataset. Another goal is to achieve high accuracy and f1 score with at least less than 5% false negative rate. Keywords- Natural Language Processing; Multinomial Naïve bayes; Support Vector Machine; Long Short-Term Memory; Fake News
I.
INTRODUCTION
With the headway of innovation, everything is accessible and getting news around the world is not different. Internet has enabled the access to receive information in a much simpler way. An increasing amount of our lives involve following up events by interacting with social media or news portal to get insightful information. However, in the context of news creation, spread, there are abundant chances to hoax. Misrepresenting news by mass-media or any social media often plays with the public emotion by creating chaos. Fake news is commonly used to mislead people’s belief and opinionin order to gain advantage financially or politically [1]. Fake news is associated with the description of content that are being twisted in a frivolous way. As a result, veracity of news is often compromised to get the highest impact. News portal, social media, underlines exaggerated and fallacious headlines to grab reader’s attention. They intentionally distribute delude news, propaganda, vague information which eventually drive web traffic and intensify their impact [2]. These afflict of fake news stances serious threats to the morality of journalism. In 2016, a man with a rifle, walked ina Pizza shop and began shooting since he read news from tweets that “this pizza shop was harboring young children as sex slave as a part of pedophile sex ring led by Hillary Clinton” [3]. The man immediately was arrested and charged for firing an assault rifle in the restaurant. The broadly related issue on detecting fake news is {979-8-3503-4602-2/22/$31.00 ©2022 IEEE}
notnew to the field of natural language processing. Detecting equivocal news on online portal, social media deals with several challenging task and research. First, fake news deliberately is written to delude readers, which makes it significant to perceive essentially reliant on news content [4]. In other words, fake news is used to trigger people’s opinion of interpreting and reacting to that news. Moreover, fake news highlight to reveal truth with diverse linguistic style while in parallel fabricate the real content continuously. Second, fake news which often contains similar set of words, also has numerous grammatical mistakes andrearranged by sentimental text [2]. Fake news detection simply denotes whether the given context of an article or document is fake or not. According to Horne and Adali’s observation, fake and honest articles were distinguished by extracting feature into 3 categories name as complexity, psychology and Stylistic.[5] Even though it’seasier to understand and trace the intention and the impact of fake news, the intention and impact of creating propaganda by spreading fake news cannot be measured or understood easily [6]. Therefore, hand-crafted and data specific textual features are not sufficient for fake news detection. The aim of this research is to detect fake news from its linguistic pattern within the scope of predictive modeling. In this case we only experiment with the linguistic pattern as feature. For this purpose, three highly used predictive models which work well in text classification task, multinomial naïve bayes, support vector machine and long short-term memory is experimented. All these three models take different predictive approach for prediction. The multinomial naïve bayes takes a probabilistic approach, the support vector divides the data in n1 dimensional hyperplane and long short-term memory uses neural network to predict. The objective is to reach high accuracy and f1 score with at least less than 5% false negative rate. The false negative is given more penalties, because we assume predicting real document with fake prediction is more problematic than predicting a fake document with real. II.
LITERATURE REVIEW
Identifying the fabrication of the news must undergo several methods and analysis. Based on individual’s perspective, challenges arise dealing with their own intuitionof a particular concept. Therefore, each content signifies different aspects which contradict with the general terms. Research has been approached for detecting fake contentof the online news by automatic identification. Legitimate news dataset and web dataset have been constructed by covering up seven different news domains. Further,exploratory analyses were done to identify linguistic
Page 49
{2022 25th International Conference on Computer and Information Technology (ICCIT)} properties of fake content. Fake news detector was built based on several linguistic features such as n-grams, punctuation, Psycholinguistic features, Readability and Syntax. Following up, different computational model including linear SVM classifier, Five- fold cross validation, accuracy, precision, F1 measure and five iterate average were evaluated to obtain performance of the classifiers. Additionally, cross-domain evaluation was done in comparison FakeNewsATM and Celebrity dataset describing significant loss in accuracy. Using annotation interface, human performance was measured to spot fake news and judge its credibility and reveals human performs better identifying fake news in Celebrity domain [1]. Given the background of political fact-checking and fake news detection, another research work has been conducted to find linguistic characteristics of fake content of political quotes and news portal. In parallel, a case study carries out probing the realistic measurement of automatic political fact- checking on a 6-point scale. The ratio of reliable and unreliable news was reported with Bonferroni correction and Welsch t-test to measure statistical significance. Collected Articles were then used in predicting the reliability of the news based on four separate segments. Consequently, Max- Entropy classifiers with L2 regularization on N-gram TF-IDF feature vectors were trained to achieve higher F1 score. Regarding the truthfulness of news article, labeled statements were collected from PolitiFact site and quotes were split for train test set which return a rating resulting the reliability of the statement. Model performance resulted LSTMoutperforming Naïve Bayes, MaxEntropy, Majority Baseline in predicting the PolitiFact ratings [7]. In [6], the research tried a unified approach to automatically detect fake content based on fake reviews and fake news. The proposed model was assessed using 3 different datasets containing honest and fake content. They proposed a detection model which is a combination of n-gram features, terms frequency matric and machine learning for text analysis.Two distinctive feature extraction technique and 6 different machine learning model were applied for comparison. Experimental evaluation outperformed other conventional approaches by obtaining 90% accuracy which is impressively higher contrasted with Horne and Adali’s text feature approaches (71% accuracy). Keypad is widely used as an input device in various embedded system project. They are used to take input in the form of numbers and alphabets and passes the signal into the III. DATA DESCRIPTION Since, the main targets are to analyze and predict the credibility of a news article from the linguistic pattern of the article, the required features of the dataset would be title text, inside text and a label of trueness. For the purpose of experimenting with a dataset that contains both reliable and fake news, we worked on the George McIntires’ dataset on fake and real news [11]. The dataset contains 6474 number of instances with title, text and label marked with real or fake. It has 3297 number of instances marked with fake news and 3177 number of instances marked with real news. A train-test split of 65-35 is taken for the experiment.
{979-8-3503-4602-2/22/$31.00 ©2022 IEEE}
Table 1. Sample Dataset index 0 1 2 3 4
title You Can Smell… Watch The... Kerry to go… Bernie supporters… The battle of …
text Daniel Greenfield… Google Pinterest… U.S Secretary … Kaydee King … It’s primary day…
label FAKE FAKE REAL FAKE REAL
The task of this work is to test popular text classification classifiers in the scope of classifying fake or real news from the linguistic pattern of text document. For simplicity, the text documents are classified in two labels, fake and real. In real life scenario a news article can be labeled as biased, partially fake and many more but identifying a documents’ credibility only from a linguistic pattern is a hard challenge. In addition to, the title texts are considered in the model since the word sequence is small which may end up in over fitting. Only the inside text of a news article is used to train the classifiers.
Figure 1. Flowchart of the experimental setup
The text data are first fit into a vectorizer and then the feature vectors are trained in the predictive model. The models are then tested on the test data. For vectorization purpose, bag of words and tf-idf vectorization methods are used independently. The predictive models used in this task are, multinomial naive bayes, support vector machine and long short-term memory. Finally, accuracy, F1 score, and false negative rate are used as the performance evaluation method. IV. EXPERIMENT WITH MODELS In The documents are vectorized to extract the feature vectors from the text document. The tf-idf vectorizer and bag of words are the two occurrence-based methods that are applied to vectorize the document separately to test the impact on the results. Before that some preprocessing steps are taken to clean the dataset. Term Frequency and Inverse Document Frequency (TFIDF) are basically two matrices which are closely interrelated for doing search and extract the most descriptive terms of an article based on the number of occurrences of a word in a given document. It’s also used to compute the similarity between 2 articles. On the other hand, bag-of-word is used in Natural language processing for extracting the features from documents which can be used for machine learning algorithm. Regardless the grammar, semantic relationship of the context,
Page 50
{2022 25th International Conference on Computer and Information Technology (ICCIT)} are used. But Polynomial Kernel can cause overfitting by text of a document is represented as a collection of words in loading train set more than expected. an unordered manner and list down the frequency of word occurred in a sentence The main difference is that the tf-idf vectorizer normalizes the count where bag of words counts just occurrencefrequency. In the further step the stop words and digits are removed, and document threshold is set to 0.6 to remove all the words that appeared in at least 60% of the documents. A. Multinomial Naïve bayes The naive bayes is a probabilistic predictive approach to classify data. It applies the bayes theorem of probability with the assumption of conditional dependency between each pair of attributes. Multinomial naive bayes is a variant of naive bayes, where the features are presented in a multinomial distribution with a vectorized form. A smoothing parameter of Laplace or lidstone smoothing tackles situation of vocabulary does not present in feature vector. It is a simple conditional probabilistic reasoning approach and yet works very well in the text classification tasks. In our experiment the feature vectors are represented using both bag of words and tf-idf vectorizer. Both the feature vectors are trained on the multinomial naive bayes model withdifferent values of laplace and lidstone smoothing. The aim is to achieve high accuracy and f1 score with at least less than5% false negative rate. Experimented smoothing parameter values are, alpha = [0, 0.1, 0.4, 0.5, 1, 1.3, 2, 5].
SVC with sigmoid kernel: Depending on the level ofcrossvalidation of a particular problem where the number of dimensions is relatively higher or nonlinear separation in 2D space, the sigmoid kernel is effective compared to others. Two of the parameters that affect the performance of support vectors are penalty parameter C and kernel coefficient gamma. A larger value of penalty parameter C make complex decision function rises the probability of overfitting. So, for this experimental setup the C is taken as the default value 1.0. The gamma value is the importance of each training sample and we experimented with different gamma values in the setup. C. Long Short-Term Memory (LSTM) A LSTM model contain five types of gates where each node i is represented by five vectors such as an input gate, a forget gate, an output gate, a candidate memory cell gate, and a memory cell gate. Forget gate and output gate are used to show which value will be updated, either to forget or keep. Candidate memory cell gate and memory cell gate are used to keep the candidate features and the actual accepted features, respectively.
B. Support Vector Machine The support vector machine divides the training sample in a n-dimensional vector space with n-1 dimensional hyperplane. It is a non-probabilistic classifier and works very well in the text classification tasks. Different kernel functions find the appropriate hyperplane by transforming the dimension of training sample. Kernel’s function is often utilized in the machine learning algorithm to find the homogeneity of inputs. Using Kernel allows working with highly complex, efficient to compute problems by avoiding vast dimensional feature vector. Kernel selection often determines the execution of the Support vector machine which disunites the data linearly in higher dimensional space. Four types of kernel use in SVM are mentioned as follow: SVC with Linear Kernel: when the data is linearly separable, and the quantity of features is very substantial, linear Kernel can be used. For example, image recognition(2D points). Additionally, linear kernels are good for text content and performance specific problems. SVC with RBF Kernel: RBF kernel is used when the data is not linearly separable. It is one of the popular kernels used in machine learning model used for high accuracy in low variance. RBF kernel is more like a low-band pass filter, often used as a signal processing tool to smooth the pictures. SVC with Polynomial (n degree) kernel: When the dataset has discrete value and has no smoothness polynomial Kernels
{979-8-3503-4602-2/22/$31.00 ©2022 IEEE}
Figure 2. Architecture of LSTM Model
LSTM is used in order to learn the sequential problem and been successfully used in several fake news detection []. Hence, we considered using LSTM to figure out Fake news detection task for our article. Each article is processed asfollows. In our proposed system, LSTM model has collection of Article. Each separated piece of news belongs to corpus or set of Article. Each piece of news includes its text as a sentence.A sentence formed by a sequence of words. The features of a word form a word vector with specific length. Every word first associated with its own integer is, therefore fed into the first LSTM to predict the real content and fake content with a SoftMax layer. The training set consists of news description. A new embedding layer is used for the word token of articles and feed through LSTM to compute the description embedding.
Page 51
{2022 25th International Conference on Computer and Information Technology (ICCIT)} Target embedding is tested to predict the original content and fake content. We have used the maximum article word of 200 and maximum length is 5000 and sequence is represented by tokens. Each token is obtained from word index and using fit_on_text that returns into individual vocabulary index based on word frequency. Further text_to_sequnce transform each text of Article into a sequence of integer. Sequence in a list is ensured to have the same length. V. RESULTS AND ANALYSIS
Figure 5. Accuracy, F1 score, False Negative Comparison in Support Vector Machine-Bag of Word (polynomial Kernel) with respect to gamma
SVM-BAG OF WORDS (SIGMOID)
100 50
100 50 0
Alpha (α) Figure 3. Accuracy, F1 score, False Negative Comparison in Multinomial Naive Bayes- Bag of Words with respect to Alpha
Figure 6. Accuracy, F1 score, False Negative Comparison in Support Vector Machine-Bag of Word (sigmoid Kernel) with respect to gamma
0
0.1 0.4 0.5
1
1.3
2
5
100
Figure 4. Accuracy, F1 score, False Negative Comparison in Multinomial Naive Bayes TF IDF with respect to Alpha
In multinomial naive bayes, for the model trained with the bag of words vectors, initially without smoothing the accuracy achieved 89.6%. Then after applying smoothing the accuracy, f1 score and false negative rate is decreasing as alpha keeps increasing. The highest f1 score achieved is 90.5 with smoothing parameter alpha set to 0.1. The highest accuracy (87.2%) and f1 score (87.1%) with less than 5% false negative rate (4.13%) is achieved with alpha set to 5. Again, for the tfidf approach the f1 score and accuracy keep decreasing with the increase in smoothing parameter alpha. The highest f1 score achieved is 90.4 with smoothing parameter alpha set to 0.1. The highest accuracy (88.0%) and f1 score (87.9%) with less than 5% false negative rate (3.9%) is achieved with alpha set to 0.4. So, we can see the model has more impact on accuracy and f1 score on the tf-idf approach. Both the tf-idf and bag of words approach can reach close to 90% accuracy and f1 score with right tuning of the smoothing parameter. With a larger corpus, the vocabulary will be larger, and the smoothing parameter will have less impact on the accuracy and f1 score.
{979-8-3503-4602-2/22/$31.00 ©2022 IEEE}
Figure 7. Accuracy, F1 score, False Negative Comparison in Support Vector Machine-Bag of Word (RBF Kernel) with respect to gamma
Figure 1. Accuracy, F1 score, False Negative Comparison in Support Vector Machine-TF IDF (Polynomial Kernel) with respect to gamma
Page 52
{2022 25th International Conference on Computer and Information Technology (ICCIT)} For LSTM model the accuracy reaches to 92.2% accuracy. The false negative rate is 4.33%. The confusion matrix isshown in the table. Finally, among the experimented models the LSTM model performed best because it reaches a moderate good accuracy with less than 5% false negative rate.
Figure 2. Accuracy, F1 score, False Negative Comparison in Support Vector Machine-TF IDF (Sigmoid Kernel) with respect to gamma
Figure 3. Accuracy, F1 score, False Negative Comparison in Support Vector Machine-TF IDF (RBF Kernel) with respect to gamma
In the experimental setup on support vector machine, for polynomial kernel the accuracy and f1 score decrease with the increase in gamma in both feature vectors. The there is no significant change in false negative rate with the change of gamma. The f1 score ranges from 79.2%-85.5% and false negative rate ranges from 16.2%-19.5% within the gamma range of 0.01-1.0. For all the cases the false negative rate ranges from 15%-20%. In addition to, for rbf kernel results arealmost same as polynomial kernel. In linear kernel the bag of words feature vector reaches the accuracy 88.1% with 12.56% false negative rate. and tf-idf feature vector reaches 92.9.% accuracy with 9.66% false negative rate. The result in sigmoid kernel is poor and less than 50% accuracy in all setups.So, all the kernels except sigmoid kernel reach close to 90% accuracy.
In summary, the experiment is conducted by first splitting the dataset into training and test sets. Bag of words and tf-idf methods are employed to extract features from the documents. These features are fed into the classifiers to train the models. Different evaluation metrics such as accuracy, precision, recall and F1-score are used to measure the performance of the models. The results of the experiment show that the multinomial naïve bayes model performs with an accuracy of .5%. The support vector machine model also performs well with an accuracy of 89.42%. The long short-term memory model has an accuracy of 87.99%. The results show that the predictive modeling approach is effective in classifying text documents into real or fake. VI. CONCLUSION In the end, figuring out fake news from the way it is written is a hard task that requires a mix of methods. By looking for words and phrases, inconsistencies, inflammatory language, and suspicious sources, we can identify potentially false stories. Through the project, we have tried to demonstrate best model to for detect fake news in linguistic pattern. First, the data is cleaned and tokenized using natural language processing (NLP). The tokenized data is then used to create the feature vectors for the predictive models. Each of the features is a term or word from the text which is used as an input for the predictive models. The feature vectors are then used as input for the predictive models. The models are then trained and tested using cross-validation technique. The performance of the models is evaluated based on the accuracy and f1 score. To further refine the models, hyperparameter tuning is used to find the optimal parameter values. The results obtained from the experiments are compared and analyzed. The accuracy and f1 score are used to evaluate the performance of the models. The results obtained from this research show that the long short-term memory model is a good model to detect fake news from its linguistic pattern. For the current challenge, the result is satisfactory and can be improved with more experimentation with feature extraction methods and predictive models. It can be further improved beyond the scope of linguistic pattern by named entity recognition task. In the era of internet, news articles available on the internet can spread very fast and it can create immediate impact among people. So, an artificial intelligence based false news detector is a social need.
References [1] Y. Long, Q. Lu, R. Xiang, M. Li and C. Huang, "Proceedings of the 8th International Joint Conference on Natural Language Processing, pages 252–256, Taipei, Taiwan, November 27 – December 1, 2017" [2] M. Granik and V. Mesyura, "Fake news detection using naive Bayes classifier", 2017 IEEE First Ukraine Conference on Electrical and Computer Engineering (UKRCON), 2017. Available: 10.1109/ukrcon.2017.8100379 [Accessed 25 March 2019]. [3] X. “The victims of fake news", Columbia Journalism Review, 2019. [Online]. Available: https://www.cjr.org/special_report/fake-newspizzagate-seth-rich-newtown-sandy-hook.php. Figure 11. Confusion Matrix for LSTM Model
[4] K. Shu, A. Sliva, S. Wang, J. Tang and H. Liu, "Fake News Detection on Social Media", ACM SIGKDD Explorations Newsletter, vol. 19, no. 1, pp. 22-36, 2017. Available: 10.1145/3137597.3137600 [Accessed 25 March 2019]. [5] D. W. Engels, Y. S. Kang, and J. Wang, “On security with the new Gen2 RFID security framework,” 2013 IEEE International Conference
{979-8-3503-4602-2/22/$31.00 ©2022 IEEE}
Page 53
{2022 25th International Conference on Computer and Information Technology (ICCIT)} on RFID (RFID), 2013 . [6] H. Ahmed, I. Traore and S. Saad, "Detecting opinion spams and fake news using text classification", Security and Privacy, vol. 1, no. 1, p. e9, 2017. Available: 10.1002/spy2.9 [Accessed 25 March 2019]. [7] J. Corner, "Fake news, post-truth and media–political change", Media, Culture & Society, vol. 39, no. 7, pp. 1100-1107, 2017. Available: 10.1177/0163443717726743.
[8] "Research on the Application of an Improved TFIDF Algorithm in Text Classification", Journal of Convergence Information Technology, vol. 8, no. 7, pp. 639-646, 2013. Available: 10.4156/jcit.vol8.issue7.80. [9] J. Chen, H. Huang, S. Tian and Y. Qu, "Feature selection for text classification with Naïve Bayes", Expert Systems with Applications, vol. 36, no. 3, pp. 5432-5435, 2009. Available: 10.1016/j.eswa.2008.06.054. [10] L. Wei, B. Wei and B. Wang, "Text Classification Using Support Vector Machine with Mixture of Kernel", Journal of Software Engineering and Applications, vol. 05, no. 12, pp. 55-58, 2012. [11] G. McIntire, "fake_or_real_news.csv". [Online]. Available: https://github.com/GeorgeMcIntire/fake_real_news_dataset. [Accessed: 25- Dec- 2018].
{979-8-3503-4602-2/22/$31.00 ©2022 IEEE}
Page 54
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December 2022, Cox’s Bazar, Bangladesh
SMOTE Based Credit Card Fraud Detection Using Convolutional Neural Network Md. Nawab Yousuf Ali1, Taniya Kabir2, Noushin Laila Raka3, Sanzida Siddikha Toma4, Md. Lizur Rahman5, Jannatul Ferdaus6 1,2,3,4,5 Department of Computer Science and Engineering, East West University, Dhaka 1212, Bangladesh [6 Department of Computer Science and Engineering, Hajee Mohammad Danesh Science and Technology University [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] Abstract—Nowadays, fraud correlated with credit cards became very prevalent since a lot of people use credit cards for buying goods and services. Because of e-commerce and technological advancement, most transactions are happening online, which is increasing the risk of fraudulent transactions and resulting in huge losses financially. Therefore, an effective detection technique, as the quickest prediction option, should be developed to deter fraud from propagating. This paper targeted to develop a deep learning (DL)-based model on SMOTE oversampling technique to predict the fraudulent transactions of credit cards. The system used three popular DL algorithms: Artificial Neural Network (ANN), Convolutional Neural Network (CNN), and Long Short-Term Memory Recurrent Neural Network (LSTM RNN), and measured the best performer in terms of evaluation metrics. However, the results confirm that the CNN algorithm outperformed both ANN and LSTM RNN. Additionally, compared to previous studies, our CNN fraud detection program recorded high rates of accuracy in identifying fraudulent activity. The system achieved an accuracy of 99.97%, precision of 99.94%, recall of 99.99%, and F1-Score of 99.96%. This proposed scheme can help to reduce financial loss by detecting credit card scams or frauds globally.
To detect credit card fraud, various algorithms of ML and data mining have been deployed, but have not yielded noticeable results. Hence, it is necessary to develop algorithms that work effectively and efficiently [4]. We have tried to prevent the fraudster in credit cards prior to our transaction by balancing the dataset with the Synthetic Minority Oversampling Technique (SMOTE) oversampling technique in some algorithms of deep learning: ANN, CNN, and LSTM, as well as comparing between them.
Keywords—Convolutional Neural Network, Credit card, Fraud, Deep Learning, SMOTE
Fraud is a dishonest act performed by someone unauthorized by deceiving innocent people. In the fraud of credit cards, the fraudsters steal cardholders' vital information and fraudulently use them. Frauds mostly happen through phone calls or text messages. But it also occurs via some software controlled by fraudsters. To detect the fraud of credit cards, the card owner enters the requisite information to complete a transaction, and this only gets approved after the confirmation of no fraudulent activity. To verify a transaction, it is first sent to the verification process to categorize it as fraud or not. Transactions that are categorized as fraud gets rejected. And if not, it gets accepted.
I. INTRODUCTION A credit card is an alternative payment card provided to customers to enable them to make purchases and payments to traders by reserving or committing to the issuer their guaranteed payment for the sums as well as any subsequent fees due. Fraudulent credit card transactions refer to unauthorized withdrawals or payments made by an individual not authorized to use the account. Briefly, fraud in credit cards occurs when a person uses another person's credit card to make a purchase for personal use without the original cardholder's consent. Over the past few years, more and more people are utilizing credit cards for their purchases because of the progress of technology, which is gradually leading to an increment in fraud. Nowadays, almost all companies from tiny to large are relying on credit cards for payment. Thus, nearly all industries, including appliances, automobiles, banks, etc. is becoming the victim of fraud. And as the credit card is not required physically, the fraud is increasing promptly day by day [1]. Fig. 1 shows the fraud reports related to credit cards in the US. In 2017, worldwide financial losses totaled $22.8 billion, expecting to rise continuingly to $31 billion [3].
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
Fig. 1. Fraud reports of United States [2]
ML has become trending for fraud detection in credit cards. Due to its combination of several applications, less time-consuming, and more accurate outcomes, ML has become the most used approach. ML involves various algorithms and models allowing computers to execute tasks without being hard-coded, a model then gets developed using training data, after that, the data get tested using the trained model [4]. As ML has a lot of applications, it is used in multiple sectors. DL is an important part of ML that involves neural networks. And neural networks process data and make decisions in a similar way to the human brain. DL is best for speech recognition, object detection, and processing of
Page 55
natural language [5]. DL garnered a tremendous amount of attention among machine learning researchers over the past few years, for its notable and promising results [6]. Some DL methods are CNN, ANN, RNN, Autoencoder, restricted Boltzmann machine, etc. That is why, the proposed system utilizes CNN to detect credit card fraud with ANN and LSTM. Before that, we have balanced our dataset using SMOTE oversampling technique as the dataset is highly imbalanced. After that, we have compared all three of the algorithms and conclude that CNN predicts fraudulent cases better than ANN and LSTM. The measurement of performance and calculation of accuracy is determined by prediction. In this experiment, we are using a dataset of 31 attributes correlated to name, age, information of account, etc., and the final attribute indicates whether the fraudulent transaction was successful or not in a ‘1’ or ‘0’, respectively. TABLE I. Types of fraud [4] Application fraud Manual or Electronic card imprints Cards are not present Counterfeit credit card fraud Stolen or lost card fraud Theft of card IDs Non-received mail card fraud Account Takeover Fake merchant website Merchant collision
TYPES OF FRAUD ON CREDIT CARDS Behavior
In this fraud, fraudsters steal customers' user information and make fraudulent transactions by opening a fake account. When a fraudster skims the data from the cards' magnetic strip, then uses them to execute the fraudulent transaction. In this fraud, fraudsters somehow acknowledge the card's account number and expiry date, use it without its physical presence. This fraud happens when fraudsters copy all the information of the real card and make a counterfeit card and execute transactions. When a cardholder loses the credit card, or someone steals the card, fraud occurs. In this fraud, fraudsters somehow steal cardholders' card IDs and commit fraudulent transactions. When someone applies to get issued a credit card, the cardholder is sent a mail, which fraudsters obstacles with phishing and use it by their names. In this fraud, fraudsters somehow access all the relevant information of the original card and the cardholder and gain entire control of the card. Fraudsters create a website and influence customers to buy products using their credit cards. Once it happens, they gather all the card data and make fraud. In this type of fraud, cardholder information is disclosed by merchants to a third party or the fraudster without the cardholder's permission.
The aims of this research are—(1) To identify fraudulent credit card activity. (2) Develop a SMOTE-based CNN model to detect fraudulent transactions. (3) Comparison between the algorithms of deep learning and show the proposed model's effectiveness according to accuracy. II. RELATED WORK Altyeb Altaher Taha and Sharaf Jameel Malebary described that credit card has become one of the popular methods of payment in the field of e-commerce. Hence, fraud is also increasing. They utilized OLightGBM with Bayesian-based hyperparameter optimization to adjust with the parameter of LightGBM. They utilized two real-world datasets, compared them with other techniques, and claimed that their model outperformed, according to accuracy. Their
method offers 98.40% of accuracy, 92.88% of AUC, 97.34% of Precision, and 56.95% of F1-Score [3]. Asha RB and Suresh Kumar KR stated that due to technological advancement, the use of credit cards is rising, thus card frauds. That is why they have proposed a model based on deep learning and machine learning to detect fraudulent credit card activity. They applied ANN, SVM, and KNN to generate their model and got an accuracy of 99.92%, 99.82%, and 93.49%, respectively. They confirmed between all the algorithms and found that ANN outperformed both KNN and SVM [4]. S. L. Marie-Sainte et al. proposed a scheme to detect fraud in credit cards by using LSTM RNN. They used a public dataset and proved the model's efficiency with 99.4% accuracy [5]. S. Makki et al. declared that an unbalanced dataset is a foremost reason for inaccurate results in detecting fraud of credit cards. And they proclaimed that due to this problem, financial losses are increasing rapidly. Therefore, they balanced the dataset and used various ML algorithms, and found that C5.0, LR, Decision tree (DT), SVM, and ANN had the best accuracy, sensitivity, and AUCPR [7]. D. Prusti and S.K. Rath proposed a scheme by applying DT, KNN, ELM, MLP, and SVM to identify frauds in credit cards. They hybridized the DT, KNN, and SVM techniques and used two protocols named SOAP and REST. They compared all the five algorithms and declared that SVM provides 81.63% accuracy, while the hybrid method outperformed with 82.58% of accuracy [8]. Vrushal Shah and Kalpdrum passi presented an innovative approach for detecting fraud transactions of credit cards by employing the CMTNN and SMOTE techniques. They used SMOTE in the minority class and CMTNN in the majority class. They also used some classification algorithms (i.e., SVM, ANN, LR, XGBoost, and RF). They analyzed the results of each algorithm in three forms of SMOTE, CMTNN, and CMTNN+SMOTE. However, they confirmed that CMTNN+SMOTE produces higher AUC comparing with the other two ways [9]. M.S. Kumar et al. proposed a fraud detection model of credit cards employing Algorithm RF. They mainly focused on the detection of fraud in credit cards in the real world. They used DT to classify their dataset. The performance was evaluated through a confusion matrix with a 90% of accuracy [10]. Fayaz Itoo et al. adopted three ML algorithms: LR, NB, and, KNN. They re-sampled their dataset by datapreprocessing to obtain better results. They divided their training data into three proportions (i.e., the ratio A-50:50, ratio B-34:66, and ratio C-25:75). Among the algorithms, LR performed best with an accuracy of 91.2% in ratio A, 92.3% in ratio B, and 95.9% in ratio C [11]. S. R. Lenka et al. proposed an ensemble-based model to detect credit card fraudulent activity. They used AdaBoost, RF, XGBoost, and GBDT in an imbalanced dataset. They measured the performance by analyzing both single and ensemble-based classifiers. They found that combined GBDT and RF with ensemble technique outperformed by gaining an accuracy of 92.23% [12].
Page 56
P. Caroline Cynthia and S. Thomas George intended to find which ML algorithm performs more accurately between supervised learning and unsupervised learning in detecting scams in credit cards. They used Local Outlier Factor (LOF), Isolation Forest (IF) as supervised learning, and SVM and LR as unsupervised learning. They measured the best by comparing accuracy, recall, precision, F1-score, confusion matrix, and support by averaging with micro-average, macro-average, and weighted-average. The LOF and IF gained an accuracy of 99.7% and 99.6%, respectively, while the SVM and LR reported an accuracy of 97.2% and 99.8%, respectively. Finally, they declared that unsupervised learning performs more suitable in credit card scam or fraud detection [13].
negotiated equally [1]. A dataset has two types of classes: majority and minority. A learning algorithm that attempts to learn the minority class features will find it difficult to do so due to this imbalance of training data distribution [9]. And oversampling approaches solve these types of problems smartly, too often [15]. However, the dataset that has been used in this research is highly imbalanced. Here, the majority class refers to legitimate transactions, and the minority class refers to fraudulent transactions. Due to this high imbalance of dataset, it was getting hard for the ML approach to predict fraudulent transactions more accurately. To solve it, we have used SMOTE to balance the dataset. The key reason behind choosing SMOTE is its wide usage and efficient result [16].
III. METHODOLOGY This section demonstrates the execution process of our proposed model. It includes an architectural diagram to understand the system clearly. Additionally, the data collection process, the used applications: SMOTE oversampling technique, ANN, LSTM, and CNN algorithms, are described in this part. A. Architectural Design The architectural design will provide a clear idea of the proposed system. The implementation of the system is explained properly in it. The architectural design of the proposed model has multiple phases that are depicted in Fig. 2. At first, the data will go through the preprocessing phase. Then, the system will preprocess the data by the resize, shuffle, and normalization. Then, the system will balance the data by using SMOTE. The data, after that, will be split into two sections for training and testing. Next, CNN, ANN, and LSTM algorithms will be applied to the training data. During each epoch, the accuracy and loss will be measured of the trained data. Once all three models are developed, accuracy, precision, recall, and F1-score will be calculated to predict whether the transaction is legitimate or fraudulent.
Fig. 2. Architectural Diagram
B. Dataset Collection The dataset that has been used in this proposed system is from the website: www.kaggle.com. The dataset is made by the customers' card transactions of a European bank in 2013. It has 284,807 transactions, of which about 492 cases were fraudulent. The ratio of illegal transactions is 0.172%. It has 31 columns. Among them, 28 columns (i.e., V1, V2,..., V28) are PCA employed, while 'Time', 'Class', and 'Amount' are not PCA employed. Class refers to fraud or non-fraud through '1' and '0', respectively. C. SMOTE Technique ML algorithms generally intend to maximize accuracy, they don't perform well when there is an imbalanced dataset—because all errors of misclassification are
Fig. 3. SMOTE oversampling technique structure
The over-sampling technique SMOTE is a method that increases the samples of minority classes using the interpolation method. SMOTE creates synthetic and new data employing the algorithm KNN to connect the instances of minority classes, which is illustrated in Fig. 3. This technique estimates the range between feature vectors with the nearest neighbors. Afterward, the difference gets multiplied with random numbers, either '0' or '1', and gets readded into the feature [9]. In this research, we applied SMOTE to make the number equal between legitimate and fraudulent transactions of credit card transactions. For this, we have employed SMOTE to the minority class to boost their sample numbers. D. ANN Algorithm ANN is an algorithm based on human brains biologically. Neurons in human brains are interconnected similarly to some nodes that are interconnected in ANN. Fig. 4 illustrates the composition of an ANN with inputs, outputs, and hidden layers. There are x1, x2,..., xn as inputs, and y as output. Weights w1,..., wn are correlated with x1,..., xn inputs [4]. For our credit card fraud detection system ReLU is used as the activation function.
Fig. 4. ANN Algorithm Structure
Page 57
E. CNN Algorithm CNN is a specific kind of multilayer perceptron, but only a simple neural network is unable to determine complex features, in contrast with deep learning. CNN's are designed to extract local features from inputs at high layers, then pass them down to lower layers to process more complex features. A CNN consists convolutional layer, pooling layer, and fully connected layer (FC). Fig. 5 illustrates a standard CNN structure including these layers.
LSTM is an advanced version of recurrent neural networks (RNNs). The LSTM model provides an alternative to conventional RNNs by resolving the problem of vanishing gradients [20]. Then, a cell state is added to store the state of long-term, which makes it different from RNNs. An LSTM structure is capable of remembering and connecting prior information to the current data [21]. Fig. 6 illustrates a standard LSTM structure. The LSTM is composed of three gates: input, forget, and output. Here, represents the present input; is new, and is the prior cell state, and is new, and is the prior output. ,
In the given equations, are biases.
,
,
are weights, and
,
(3) (4) (5) Fig. 5. CNN Algorithm Structure
The convolutional layer comprises kernels [17] to determine the feature map's tensor. The kernels in these programs combine a whole input applying a "stride(s)" to convert the output volume dimensions into integers [18]. Consequently, zero paddings are needed to pad a volume input to zeros and keep the input volume's dimensions when features of low-level are in use. A convolutional layer operates as follows:
Here, Eq. (3) determines which part of the information needs to add through passing and via the sigmoid layer. Afterward, Eq. (4) is applied to acquire new information after and have been passed via the layer. In Eq. (5), and are merged into , where output, respectively. and are the sigmoid output and After that, the forget gate enables the information of selective passage through a dot product and a sigmoid layer. In Eq. (6), it is decided if the previous cell's information will be forgotten or not based on a certain possibility. (6)
(1) Here, the input matrix is denoted by 'I', 'K' represents size (m × n)'s 2D filter, and the 2D feature map's output is indicated by 'F'. Convolutional layer operation is presented by I*K. To make the feature maps more nonlinear, the ReLU layer is utilized [19]. ReLU calculates activation by zeroing out the threshold input. In mathematical terms, it is: (2) The pooling layer decreases the parameter numbers by down sampling the provided input dimension. Max pooling is a popular method to maximize value in the region of input. FC layer offers a decision based on detailed features received via the convolutional layer and pooling layer [20].
By and , the output gate defines the required states for continuation, in Eq. (7) and Eq. (8). After receiving the . final output, it is multiplied by , by the layer
(7) (8) IV. EXPERIMENTAL RESULT ANALYSIS In this section, we have described the experimental setup: the language, libraries, packages, and functions. Along with this, this part defines the evaluation metrics that we used to measure the performance and the algorithm that provided the best results among the algorithms used.
F. LSTM Algorithm
Fig. 6. LSTM Algorithm Structure
A. Experimental Setup For this experiment, training was done on 80% of the dataset, while testing used 20%. Training data was trained after using SMOTE method. To acquire the best result, various parameters, optimizers, and activation functions are used. This system used CNN, ANN, LSTM modules, batch size and epochs as parameters, Adam as optimizers, and ReLU, Softmax as activation function. The maximum number of epochs is 100. Our system was implemented using python as the main programming language with Keras and TensorFlow libraries in Jupyter Notebook on Intel(R) Core(TM) i5-1.6 GHz processor with 8 GB RAM. Additionally, we used Numpy, Pandas, Scikitlearn as python libraries.
Page 58
B. Evaluation Metrics Ultimately, accuracy, precision, recall, and F1-Score determine the outcome. The proposed method is measured by the following evaluation metrics: True Positive (TP): Both values are positive, which is 1. True Negative (TN): Both values are negative, which is 0. False Positive (FP): True class's value is 0, and non-true class's value is 1. False Negative (FN): True class's value is 1, and non-true class's value is 0.
Fig. 8. Accuracy and loss of LSTM model
Here, TP denotes the samples of fraudulent transactions, FP denotes the number of legitimate transactions presented as fraudulent, TN denotes the legitimate transactions, and FN denotes the number of fraud transactions presented as legitimate.
(9) Fig. 9. Accuracy and loss of CNN model
(10) (11) (12) C. Results The proposed method used CNN, ANN, and LSTM after balancing the dataset with SMOTE technique. We evaluated metrics of accuracy, precision, recall, and F1-Score. The obtained accuracy of ANN, LSTM, and CNN is 99.95%, 97.30%, and 99.97%, respectively. Table II shows all the evaluated metrics. TABLE II.
ACCURACY, PRECISION, RECALL & F1-SCORE
ANN
Accuracy (%) 99.95
Precision (%) 99.94
Recall (%) 99.94
F1-Score (%) 99.94
LSTM
97.30
97.28
97.36
97.29
CNN
99.97
99.94
99.99
99.96
Algorithms
In comparison with ANN and LSTM, the CNN model accuracy was the highest. The model accuracy and loss of ANN, LSTM, and CNN algorithms are shown in Fig. 7, Fig. 8, and Fig. 9, respectively. Among all three algorithms, CNN worked best in detecting fraudulent card activity according to all the evaluation metrics. Fig. 10 shows the comparison between ANN, LSTM, and CNN based on accuracy, precision, recall, and F1-Score.
Fig. 10. Evaluation metrics using ANN, LSTM, and CNN
V. DISCUSSION TABLE III.
COMPARISON BETWEEN PROPOSED AND EXISTING MODELS Accuracy (%) 99.92 99.87
Precision (%) 81.15 100
Recall (%) 76.19 100
F1-Score (%) 100
OLightGBM [3]
98.40
97.34
-
56.95
LR [11]
95.9
99.1
-
90.9
LSTM RNN [5]
99.4
-
-
-
GBDT + RF [12]
92.23
-
-
91.39
96
-
-
-
82.58
96.83
-
89.71
90
-
-
-
99.3
100
99
100
99.58
99.6
80
-
99.97
99.94
99.99
99.96
Method ANN [4] LR [13]
C5.0, SVM, LR, ANN [7] Hybrid based model [8] RF [10] CMTNN+SMOTE based RF [9] LSTM RNN [14] SMOTE-based CNN [proposed model]
Fig. 7. Accuracy and loss of ANN model
Page 59
By investigating the results, it is found that SMOTEbased CNN has notable outcomes on the detection of credit card fraud. The proposed scheme could distinguish fraudulent and legitimate card transactions with higher accuracy. A comparison of existing methods and proposed scheme, according to accuracy, F1-Score, precision, and recall, is given in Table.5. From Table 5, it is observed that some of the existing models [8], [10], and [12] achieved a bit lower accuracy ranging from 82.58%–92.23%. The medium highest accuracy of 95.9%, 96%, and 98.40% are found in [11], [7], and [3], respectively. Some also obtained a very good accuracy of 99.3%, 99.4%, 99.58%, and 99.8% are found in [9], [5], [14], and [13], respectively. The system described in [4] obtained the highest accuracy of 99.92% with the ANN algorithm. In our system, ANN, LSTM, and CNN algorithms have provided 99.95%, 97.30%, and 99.97% respectively, after applying SMOTE technique. It can be said that our proposed SMOTE-based CNN model has performed better than ANN and LSTM. Overall, our proposed scheme provides superior results to other current models. Table III describes the comparison of our proposed model with already existing models.
[6]
[7]
[8]
[9]
[10]
[11]
VI. CONCLUSION & FUTURE WORK As e-commerce and online usage is increasing rapidly, the use of credit cards is also rising. Hence, the fraud associate with credit cards is also growing. Therefore, it is crucial to detect each credit card transactions that are fraudulent. In this paper, we have introduced a SMOTE-based CNN model to detect scams correlated to credit cards. We have balanced our dataset with SMOTE oversampling technique and then used three algorithms of deep learning (i.e., ANN, LSTM, and CNN) to compare them by accuracy. We found that our proposed CNN model outperformed both the ANN and LSTM. It obtained an accuracy of 99.97%, 99.99% recall, 99.94% precision, and 99.96% F1-Score. Our experimental outcome shows that our proposed structure can detect frauds or scams of credit cards effectively. We expect that the proposed scheme can develop an efficient tool and help reducing fraud associated with credit cards. In our future work, we would like to explore a combination of a deep CNN-LSTM structure for the detection of illegal card transactions, where CNN will work as the feature extractor and LSTM as the classifier.
[12]
[13]
[14]
[15]
[16]
[17]
REFERENCES [1]
[2]
[3]
[4]
[5]
U. Fiore, A. De Santis, F. Perla, P. Zanetti, and F. Palmieri, “Using generative adversarial networks for improving classification effectiveness in credit card fraud detection,” Inf. Sci. (Ny)., vol. 479, pp. 448–455, Apr. 2019, doi: 10.1016/j.ins.2017.12.030. “Credit Card Fraud Statistics [Updated September 2020] Shift Processing.” https://shiftprocessing.com/credit-card-fraudstatistics/ (accessed Aug. 22, 2021). A. A. Taha and S. J. Malebary, “An Intelligent Approach to Credit Card Fraud Detection Using an Optimized Light Gradient Boosting Machine,” IEEE Access, vol. 8, pp. 25579–25587, 2020, doi: 10.1109/ACCESS.2020.2971354. A. RB and S. K. KR, “Credit card fraud detection using artificial neural network,” Glob. Transitions Proc., vol. 2, no. 1, pp. 35–41, Jun. 2021, doi: 10.1016/j.gltp.2021.01.006. S. L. Marie-Sainte, M. Bin Alamir, D. Alsaleh, G. Albakri, and J. Zouhair, “Enhancing Credit Card Fraud Detection Using Deep
[18]
[19]
[20]
[21]
Neural Network,” in Advances in Intelligent Systems and Computing, vol. 1229 AISC, Springer, 2020, pp. 301–313. J. Forough and S. Momtazi, “Ensemble of deep sequential models for credit card fraud detection,” Appl. Soft Comput., vol. 99, p. 106883, Feb. 2021, doi: 10.1016/j.asoc.2020.106883. S. Makki, Z. Assaghir, Y. Taher, R. Haque, M.-S. Hacid, and H. Zeineddine, “An Experimental Study With Imbalanced Classification Approaches for Credit Card Fraud Detection,” IEEE Access, vol. 7, pp. 93010–93022, 2019, doi: 10.1109/ACCESS.2019.2927266. D. Prusti and S. K. Rath, “Web service based credit card fraud detection by applying machine learning techniques,” in TENCON 2019 - 2019 IEEE Region 10 Conference (TENCON), Oct. 2019, vol. 2019-Octob, pp. 492–497, doi: 10.1109/TENCON.2019.8929372. V. Shah and K. Passi, “Data Balancing for Credit Card Fraud Detection Using Complementary Neural Networks and SMOTE Algorithm,” in Computing Science, Communication and Security, Springer, Cham, 2021, pp. 3–16. M. S. Kumar, V. Soundarya, S. Kavitha, E. S. Keerthika, and E. Aswini, “Credit Card Fraud Detection Using Random Forest Algorithm,” in 2019 Proceedings of the 3rd International Conference on Computing and Communications Technologies, ICCCT 2019, Feb. 2019, pp. 149–153, doi: 10.1109/ICCCT2.2019.8824930. F. Itoo, Meenakshi, and S. Singh, “Comparison and analysis of logistic regression, Naïve Bayes and KNN machine learning algorithms for credit card fraud detection,” Int. J. Inf. Technol., pp. 1–9, Feb. 2020, doi: 10.1007/s41870-020-00430-y. S. R. Lenka, M. Pant, R. K. Barik, S. S. Patra, and H. Dubey, “Investigation into the Efficacy of Various Machine Learning Techniques for Mitigation in Credit Card Fraud Detection,” in Advances in Intelligent Systems and Computing, 2021, vol. 1176, pp. 255–264, doi: 10.1007/978-981-15-5788-0_24. P. Caroline Cynthia and S. Thomas George, “An outlier detection approach on credit card fraud detection using machine learning: A comparative analysis on supervised and unsupervised learning,” in Advances in Intelligent Systems and Computing, 2021, vol. 1167, pp. 125–135, doi: 10.1007/978-981-15-52854_12. O. Owolafe, O. B. Ogunrinde, and A. F.-B. Thompson, “A Long Short Term Memory Model for Credit Card Fraud Detection,” in Artificial Intelligence for Cyber Security: Methods, Issues and Possible Horizons or Opportunities, Springer, Cham, 2021, pp. 369–391. S. Bej, N. Davtyan, M. Wolfien, M. Nassar, and O. Wolkenhauer, “LoRAS: an oversampling approach for imbalanced datasets,” Mach. Learn., vol. 110, no. 2, pp. 279–301, Feb. 2021, doi: 10.1007/s10994-020-05913-4. P. Li, M. Abdel-Aty, and J. Yuan, “Real-time crash risk prediction on arterials based on LSTM-CNN,” Accid. Anal. Prev., vol. 135, p. 105371, Feb. 2020, doi: 10.1016/j.aap.2019.105371. A. M. Hasan, H. A. Jalab, F. Meziane, H. Kahtan, and A. S. AlAhmad, “Combining Deep and Handcrafted Image Features for MRI Brain Scan Classification,” IEEE Access, vol. 7, pp. 79959– 79967, 2019, doi: 10.1109/ACCESS.2019.2922691. J. Gu et al., “Recent advances in convolutional neural networks,” Pattern Recognit., vol. 77, pp. 354–377, May 2018, doi: 10.1016/j.patcog.2017.10.013. D. Singh, V. Kumar, Vaishali, and M. Kaur, “Classification of COVID-19 patients from chest CT images using multi-objective differential evolution–based convolutional neural networks,” Eur. J. Clin. Microbiol. Infect. Dis., vol. 39, no. 7, pp. 1379–1389, Jul. 2020, doi: 10.1007/s10096-020-03901-z. M. Z. Islam, M. M. Islam, and A. Asraf, “A combined deep CNN-LSTM network for the detection of novel coronavirus (COVID-19) using X-ray images,” Informatics Med. Unlocked, vol. 20, p. 100412, Jan. 2020, doi: 10.1016/j.imu.2020.100412. G. Chen, “A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation,” Oct. 2016, Accessed: Jun. 27, 2022. [Online]. Available: http://arxiv.org/abs/1610.02583.
Page 60
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December 2022, Cox’s Bazar, Bangladesh
Paddy Disease Detection Using Deep Learning Md Fahim Faez Abir 1, Kamrul Hassan Rokon 2, Sk Mohammad Asem 3, Abdullah Abdur Rahman 4 , Nasim Bahadur 5 and Md Nawab Yousuf Ali 6 1,2,3,4,5
Department of Computer Science and Engineering East West University, Dhaka, Bangladesh Email- [email protected] , [email protected] , [email protected] , [email protected], 5 [email protected] , [email protected]
Abstract—The economy depends heavily on agricultural production. This is among the factors that make plant disease detection important in agricultural fields. Rice plant diseases are thought to be a major cause of agricultural, economic, and societal deficits in agricultural growth. This Project develops an automated approach to detect paddy diseases using a Deep Learning model based on the SMOTE-ENN resampling technique. We have used five popular Deep Learning algorithms: Convolutional Neural Network (CNN), VGG16, VGG19, Xception, and ResNet50. We measured the best model performance by their evaluation metrics. Comparing these five algorithms, Xception recorded the highest accuracy to detect paddy diseases. The model achieved an accuracy of 98%, precision of 97%, recall of 96%, and F1-Score of 97%. This proposed system can help our agriculture firms to reduce production loss.
processes, including feature extraction, identification, and classification, are included in the system [2]. A. Types of rice leaf diseases Rice plants are affected by various types of diseases, kits, etc. We have worked on some of the diseases which are given below and the sample of those diseases is shown in Figure 1.
Keywords—Rice leaf diseases, Image sharpening, SMOTEENN, CNN, VGG16, VGG19, RESNET50, and XCEPTION.
I. INTRODUCTION Rice is a significant food crop that is grown all over the world. Nearly half of the population of the globe relies on it as the main food. As food and people continue to rise, so does the demand for rice. Raising rice yield is necessary to keep up with the world's expanding population. To meet the demands of the world's expanding population, rice productivity must be increased. Researchers in Bangladesh have noticed a decrease in rice output ranging from 10 to 15 percent on average due to the country's 10 most important rice diseases [1]. Disease symptoms are often shown in rice leaves. The diseases present in the rice plant may be evaluated by examining the condition of the rice leaves. There are occasions when farmers are unable to make sound pesticide decisions because they are baffled. At the moment, a manual eyeball test is the most common method used to evaluate the state of a plant's health based on its leaves. Since manual evaluation is dependent on the individual screening the leaf, this procedure is not objective, it is laborious, and it is prone to making mistakes [1]. In such instances, the method that was recommended proved to be very useful in the process of evaluating wide fields of crops. It is made simpler as well as less expensive if the diseases can be automatically detected by just seeing the signs on the plant's leaves. This automated solution tackles the difficulties posed by manual processes. A conventional digital camera or a high-resolution mobile phone camera. might be used to capture the picture. The system receives this picture as an input to determine the leaf's characteristics. Several
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
Fig. 1.
Types of leaf disease
1) Rice Leaf Blast: Magnaporthe Oryzae is a fungus that causes the disease known as rice leaf blast. Due to its global distribution and potential for great destruction in the right conditions, it is typically regarded as the most severe rice disease in the world. Except for the roots, rice blasts may harm the whole rice plant. When it comes to planting infection, the fungus doesn't discriminate between different stages of development [3]. 2) Rice Neck blast: Neck blast is caused by similar symptoms as leaf blast. Grayish brown lesions on the neck may induce girdling, causing the neck and the panicle to collapse over one another. There will be no grain produced if infection occurs before the milky stage, but if the infection happens later, there will be a grain of worse grade quality [4]. 3) Leaf Burn: Rice leaf damage and yield reduction are caused by both nitrogen deficit and over-fertilization. Yellowing of the leaves is a sign of insufficient nitrogen delivery, while leaf burn is a sign of applying fertilizer at concentrations that are too high for the plant. Seeds and
Page 61
agricultural stubbles are the main carriers of the disease. Conditions favorable to the disease include wet weather and substantial amounts of nitrogenous fertilizer [5]. 4) Potassium (K) Deficiency: Potassium(K) is a nutrient that is needed for plant growth. Both inside the plant and in the soil, potassium may move about pretty easily. The majority of rice fields that are irrigated require an appropriate amount of K fertilizer to be applied. When there is a severe lack of potassium in the soil, the uppermost tips of the leaves will begin to fall off, and this process will continue nearly down to the base of the leaves [6].
•
Comparison of our proposed model with previous works and effectiveness of our proposed model.
C. Proposed system The proposed system is to utilize Deep Learning Algorithms to detect paddy diseases. Firstly, we cleaned our dataset and deleted unnecessary and garbage data. Then we used the image sharpening technique and balanced our dataset using SMOTE-ENN as it is highly imbalanced. After that, we compared the accuracy of Deep Learning Algorithms CNN, VGG16, VGG19, Xception, and ResNet50. The dataset has nine paddy diseases. II. RELATED WORK
5) Phosphorus (P) Deficiency: Plants' key energy storage and transfer capabilities are compromised when they are deficient in phosphorus (P). Tilling, root growth, early blooming, and ripening is all affected by it. Many rice habitats are plagued by phosphorus deficit, particularly in acidic highland soils, where soil P-fixation capability is generally high [7]. 6) leaf Smut: Entyloma Oryzae, which causes the illness known as "leaf smut," is a common but not harmful disease that affects rice. The fungus survives the winter on ailing leaves and stems in the soil and spreads through the air via spores. Leaf smut is a late-season illness that causes only minor harm. High nitrogen settings are favorable to the disease's growth. Dealing with the circumstance is not recommended [8]. 7) Brown spot: A fungus known as the brown spot can harm both adult plants and seedlings. The disease causes blight in seedlings produced from seeds that have been heavily polluted [9]. 8) Nitrogen (N) Deficiency: Rice output is significantly impacted by nitrogen, which is an essential nutrient. Nitrogen is the nutrient that has the biggest influence on the growth, development, and yield of rice; it also plays a diverse role in maintaining and regulating the physiological processes of rice. Among the many nutrients, nitrogen has the most impact. A lack of nitrogen in rice prevents the creation of chlorophyll and proteins, which in turn reduces the rate of photosynthesis and has an effect on the amount of dry matter produced [10]. 9) Foot Rotten: This disease is induced by fungus. Sometimes the soil may be contaminated by the fungus as well. After the irrigation of the contaminated soil, the water will penetrate the paddy stem. The influence will emerge from the root and afterward, the afflicted place will turn brown. Then gradually, the root will be rotten and will then be bent down. B. Objectives • This researcher detects paddy diseases. • By balancing the data using the SMOTE-ENN resampling technique, we can compare deep learning algorithms and find the best-performing algorithm.
Automated technology for on-the-job pesticide application keeps an eye on and regulates environmental factors like irrigation of water, soil moisture, and animal intrusion in the field. The rice crop is afflicted by blast disease everywhere it is grown. Despite the use of pesticides, the fast development of this infectious illness remains a challenge. Chowdhury R. Rahman et. al. demonstrated the deep learning-based algorithms that may be used to identify diseases and pests using photos of rice plants. Compared to VGG16, the proposed architecture is 99 percent smaller yet achieves 93.3% accuracy using KNN and ANN [11]. Anam Islam and his colleagues proposed a method using a segmentation based method deep neural network to detect the disease. Diseases affected area extracted by local threshold segmentation. They achieved the highest level of accuracy 90.91% by DenseNet121 [12]. Prajapati suggested an algorithm that uses image processing and machine learning to find diseases in rice plants. K-means clustering was utilized to segment the image of the rice leaves, and the SVM method was then applied to categorize the diseases. In contrast to a confidence level of 93.33%, only 73% accuracy was obtained when analyzing the split analysis results on the initial data set [13]. According to Wen-Liang Chen and his colleagues, bacterial blast leaf disease is one of the most widespread paddy plagues. They focus mostly on-farm sensors that generate non-image data that can be automatically trained and evaluated in real-time by the AI mechanism. They do this by utilizing technologies such as the Internet of Things and artificial intelligence to identify plant diseases with a high degree of accuracy of 89.4 percent [14]. Chen et al. used the INC-VGGN approach to find diseases in rice and maize leaves. They applied two genesis layers and one global average pooling layer to replace VGG19's final convolutional layer [15]. An application that uses convolutional neural networks and image processing has been proposed by Mique Jr., E. L., and their colleagues. The photos that have been collected undergo preprocessing to train the model, and upon successful implementation, the model provides an accuracy of 90.9 percent [16].
Page 62
S. Ramesh et al. suggested a method for detecting rice blast leaf disease using KNN and ANN algorithms. They concentrated their efforts mostly on rice crops grown in India, one particular rice leaf disease, and methods for detecting illness in its early stages. They accomplished the highest level of accuracy possible from ANN, which was 90 percent [17]. III. METHODOLOGY This section demonstrates the implementation process of the proposed model with the architectural diagram which will give a clear understanding of the system. The dataset collection process, using the image Sharpening Technique, SMOTE-ENN resampling technique, Image Augmentation technique, CNN, VGG16, VGG19, Xception, and ResNet50 Algorithms description is included in this chapter. A. Architectural Design The architectural design will give a clear vision of our proposed system. The process of the system will be explained in it. The architectural design of our proposed system has more than one phase illustrated in Figure 2. Firstly, In the preprocessing phase, there are resizing, and sharpening techniques. After that, the dataset will be balanced by using SMOTE-ENN Technique. Then the dataset will be split into train data and test data. The train data will be augmented with zooming, shifting, rotating, etc. After augmentation, the algorithms will run with the augmented dataset. Finally, the model will be tested by the test dataset and the accuracy will have appeared.
3 4 5 6 7 8 9
Leaf Blast Leaf Burn Leaf Smut Nitrogen (N) Phosphorus (P) Potassium (K) Neck Blast Paddy
923 500 40 440 333 383 493 Total = 5622
C. PreProcessing Technique As we took the data from Kaggle, the data had a lot of noise, so we used image sharpening to remove the noise. We leveled the image concerning disease names. The number of varieties of diseases was imbalanced. So we used a resampling technique named SMOTE-ENN to solve the imbalanced dataset. Image Sharpening Technique: We have used the image smoothing technique to remove noise from our dataset. In this approach, a Gaussian kernel is used. We defined sigma X and sigma Y for X and Y directions. Only if sigma X is given, sigma Y is the same. If both are 0, sigma X and sigma Y will be calculated using kernel size [18]. By using the addweight function, we blended the image from the Gaussian blur and the original image. The function addweight helps blend the image's alpha channel. Alpha blending lets you put a background image on top of a foreground image that looks like it is see-through This mask for letting light through is called an alpha mask [19]. γ (1)
ℎ
SMOTE-ENN Technique: The SMOTE-ENN approach was developed by Batista et al. (2004) and combines the SMOTE and Wilson's Adapted Edited Nearest Neighbor Rule (ENN) [20]. SMOTE is an oversampling technique with the primary objective of creating new minority class examples from a small number of closely spaced minority class examples. ENN is a method for cleaning data such that better-defined class clusters can be created. Any case whose class label differs from the classes of at least two of its three nearest neighbors can be eliminated [21].
Fig. 2. Architectural Design
B. Dataset Collection We used a merged dataset to create the model we suggested. The dataset is divided into three sections and is gathered from the www.kaggle.com website. From the village of Kushkhali, close to the Satkhira district, we gathered 100 pictures of the Brown Spot. We collected 5622 photos from nine diseases as well as information on cleaned paddy in total. The number of photos in each category is displayed in Table I. TABLE I. Labels 0 1 2
LABELS AND THE NUMBER OF IMAGES Category Brown Spot Foot Rotten Healthy
Number of Images 523 500 1487
Augmentation Technique: Data augmentation techniques are used to make a dataset bigger n times than the real one. In the ML model, data augmentation works like a regularizer to avoid overfitting [22]. We split our dataset into train and test data. Then Only the train data is modified in different ways by the augmented technique. We have used normal transformations like rotation, horizontal flip, vertical flip, zoom range, fill mode, shear range, and brightness parameters in augmentation for our proposed system. We have taken nine modification images of a single original image. D. CNN Algorithm An artificial neural network (ANN) that is inspired by nature and incorporates continuous information flow is referred to as a "convolutional neural network" (CNN) [23]. After these modules come to a fully convolutional network, which might consist of a single fully linked layer or numerous layers combined. The architecture of a deep CNN model is constructed by stacking the numerous CNN model modules.
Page 63
The most frequent elements of a CNN include fully connected layer (FC), convolutional and pooling layers, and others. Figure 3 below displays the architecture of our CNN model as well as the overall number of parameters.
channels are dispersed in an overall deep CNN model, the Xception model employs both depthwise and point-wise convolution [26]. H. ResNet50 Algorithm Our updated ResNet50 model's trainable parameter is set to false, and the input image's dimensions are (180,180,3). We include a Flatten layer as well as two Dense layers with activation functions Relu and unit sizes of 128 and 256. To prevent overfitting problems, we also employ L2 regularizers and dropouts. Activation function softmax with a class size of 10 in the output layer. There were 33,007,490 trainable parameters [27]. IV. EXPERIMENTAL RESULT In this section, we have discussed the experimental environment setup in which the language, environment, libraries packages, functions, and parameters. The evaluation metrics to measure the performance by comparing the algorithms are included. And added that, the best-performing algorithm will be discussed.
Fig. 3. CNN Architectural Design & Number of Parameter
E. VGG16 Algorithm Our modified VGG16 model's trainable parameter is defined as false, and the input image's dimensions are (180,180,3). A Flatten layer and two Dense layers with activation functions Relu and unit sizes of 128 and 256 have been added. Additionally, in order to prevent overfitting problems, we used L2 regularizers and dropouts. In the output layer, we define class size 10 with activation function softmax. 1,676,820 trainable parameters were available [24]. F. VGG19 Algorithm The dimensions of our input picture for our modified VGG19 model are (180,180,3), and we specify false as our trainable parameter. We include a flatten layer as well as two Dense layers with activation functions Relu and unit sizes of 128 and 256. In order to avoid overfitting problems, we also employ L2 regularizers and dropouts. We specify a class size of 10 with the activation function softmax in the output layer. There were 1,674,122 trainable parameters [24]. G. Xception Algorithm Depth Wise Separable Convolutions are the foundation of the deep CNN architecture known as Xception. Extreme Inception, sometimes known as "Xception," develops the concepts from "Inception" to their logical conclusion [25]. In contrast to the Inception model, the Xception model comprises two levels, only one of which has more than one layer. This layer divides the result into three parts and sends them to the following series of filters. While space and
A. Experimental Setup We used 80% of the dataset for training and 20% for testing in the suggested model. We balanced our dataset using the SMOTE-ENN technique before dividing it into train and test. Our proposed system was executed using the python programming language with tensorflow and keras libraries in Google Colab Pro. In Google Colab Pro we get 25 GB ram, more than 100 GB disk space, and Tesla P100-PCIE GPU. Adding that, we have used numpy, sklearn, model checkpoint, regularizer, and drop ou. We have used CNN, VGG16, VGG19, Xception, and ResNet50 algorithms. The batch size changes from 16 to 20 and the maximum epoch number is 30. The labels with respect to the diseases are shown in Table II. TABLE II.
LABELS OF THE DISEASE OF OUR DATASET Labels 0 1 2 3 4 5 6 7 8 9
Category Brown Spot Foot Rotten Healthy Leaf Blast Leaf Burn Leaf Smut Nitrogen (N) Phosphorus (P) Potassium (K) Neck Blast Paddy
B. Evaluation Metrics Our ultimate objective is to increase F1-Score, Precision, Recall, and Accuracy. We will talk about the results, including the confusion matrix, in this part. Confusion Matrix: A confusion matrix is a table where we can see the performance of a Machine Learning classification model. The confusion matrix is comparatively simple for understanding, although it has confusing terminology. A simple confusion matrix is shown below in Figure 4.
Page 64
The accuracy and value loss of our CNN model are displayed in Figure 5.
Fig. 4. Confusion Matrix
From the confusion matrix, we can calculate the accuracy, precision, recall, and F1-score values. a)
Accuracy: The percentage of correct forecasts to all other guesses is the accuracy rate. This ratio represents the likelihood that an ML model will produce the right result. ℎ "
! (2)
#$%'$%#&%'&
b) Precision: The proportion of real positive predictions to total projected positives is known as precision. ( "
!
!
#$ #$%'$
c)
Figure 6 displays the accuracy and value loss of our Xception model.
!
#$%#&
(
Fig. 5. Proposed CNN Model Loss and Accuracy
) ( ( ) ( (3)
Fig. 6. Xception Model Loss and Accuracy
The Xception model, with a 97.44% accuracy rate, has the greatest performance for identifying paddy diseases among these five algorithms. Figure 7 compares the performance of CNN, VGG16, VGG19, Xception, and ResNet50 in terms of accuracy, precision, recall, and F1-Score.
Recall: The proportion of total true positive predictions to total actual positives is known as recall. (
*
"
) ( ( ) (
#$
(4)
#$%'&
d) F1-Score: The F1-score creates a single metric by averaging a classifier's precision and recall [28]. . #$ (5) +1 . #$%'$%'&
C. Result Analysis
Fig. 7. Comparison Histogram of the Five Model
The balanced dataset was employed with the suggested system, which also used CNN, VGG16, VGG19, Xception, and ResNet50. Our employed model is assessed for accuracy, precision, recall, and F1-Score. The achieved accuracy for ResNet50, CNN, VGG16, VGG19, Xception, and VGG16 is 95.60%, followed by 96.14%, 97.59%, and 91.69%. Table III below provides a comparison table that includes the metrics accuracy, precision, recall, and F1-score. TABLE III. Algorithms
CNN VGG16 VGG19 Xception ResNet50
PERFORMANCE COMPARISON FOR FIVE MODELS Accuracy (%)
Precision (%)
Recall (%)
94.82 95.60 96.14 97.59 91.69
91 95 94 97 91
91 94 92 96 88
F1score (%) 91 94 92 97 89
V. DISCUSSION In our research, we found that the Xception model has remarkable outcomes in detecting paddy diseases using the SMOTE-ENN and Image Sharpening Technique. A short comparison of some existing methods according to accuracy is given in Table IV. In that table 4, we can see that some existing models [12], [14], [15], [16] and [17] achieved a lower accuracy range of 89.40 - 92.00%. Comparatively the highest accuracy found in [11] and [13] models is 93.33%. In our proposed system CNN, VGG16, VGG19, Xception, and ResNet50 algorithms have given 94.82%, 95.60%, 95.14%, 98%, and 92% respectively, after applying image sharpening, SMOTE-ENN, and augmentation Technique. The Xception model has better performance than the CNN model. The Xception model precision of 97%, recall of 96%, and F1Score of 97%.
Page 65
TABLE IV. MODELS
COMPARISON TABLE BETWEEN EXISTING AND PROPOSED
Reference [11] [12] [13] [14] [15] [16] [17] Proposed Model
Method KNN / ANN DenseNet121 SVM IoT / AI VGGNet CNN ANN / KNN CNN XCEPTION
Accuracy (%) 93.33 90.91 93.33 89.40 92.00 90.90 90.00 94.80 98.00
VI. CONCLUSION AND FUTURE WORK To categorize paddy diseases in our project, we combine image processing and deep learning methods. The recommended method accomplishes this by using machine learning and image processing technologies. To reduce noise in the image, we used image sharpening and Gaussian blur. With the SMOTE-ENN resampling technique, we balanced our dataset, expanded the training set, and applied deep learning algorithms (i.e CNN, VGG16, VGG19, XCEPTION, and ResNet50). The accuracy, precision, recall, and F1-Score of the Xception model were 98%, 97%, 96%, and 97%, respectively. Our goal is to turn our suggested model into a useful resource for agricultural businesses. In our upcoming work, we hope to implement our model using segmentation and create a practical tool for spotting diseases in rice leaves and roots that will be crucial to our economy. REFERENCES [1] R. A. D. Pugoy and V. Y. Mariano, "Automated rice leaf disease detection using color image analysis," Third international conference on digital image processing, vol. 8009, no. 1, pp. 93-99, 2011. [2] G. M. Choudhary and V. Gulati, "Advance in image processing for detection of plant diseases," International Journal of Advanced Research in Computer Science and Software Engineering, no. 2, pp. 1090-1093, 2015. [3] D. Pak, P. Y. Ming, L. Vincent and J. B. Martin, "Management of rice blast (Pyricularia oryzae): implications of alternative hosts," European Journal of Plant Pathology, vol. 161, no. 3, pp. 343-355, 2021.
[9] A. Jain, S. Sarsaiya, Y. L. Qin Wu and j. Shi, "A review of plant leaf fungal diseases and its environment speciation," Bioengineered, vol. 10, no. 1, pp. 409-424, 2019. [10] F. Yu, S. Feng, W. Du and D. Wang, "A study of nitrogen deficiency inversion in rice leaves based on the hyperspectral reflectance differential," Frontiers in Plant Science, vol. 11, p. 573272, 2020. [11] C. R. Rahman and . S. A. Preetom, "Identification and recognition of rice diseases and pests using convolutional neural networks," Biosystems Engineering, vol. 194, pp. 112-120, 2020. [12] A. Islam, R. Islam and S. R. Haque, "Rice Leaf Disease Recognition using Local Threshold Based Segmentation and Deep CNN," I.J. Intelligent Systems and Applications, vol. 5, pp. 35-45, 2021. [13] H. B. Prajapati, J. P. Shah and V. K. Dabhi, "Detection and Classification of Rice Plant Diseases," Intelligent Decision Technologies, vol. 11, pp. 357-373, 2017. [14] W.-L. Chen, Y.-B. Lin and F.-L. Ng, "RiceTalk: Rice blast detection using Internet of Things and artificial intelligence technologies," IEEE Internet of Things Journal, vol. 7, no. 2, pp. 1001-1010, 2019. [15] J. Chen, J. Chen and D. Zhang, "Using deep transfer learning for image-based plant disease identification," Computers and Electronics in Agriculture, vol. 173, p. 105393, 2020. [16] E. L. Mique Jr and T. D. Palaoag, "Rice pest and disease detection using convolutional neural network," In Proceedings of the 2018 international conference on information science and system, pp. 147151, 2018. [17] S. Ramesh and D. Vydeki, "Application of machine learning in detection of blast disease in South Indian rice crops," J. Phytol, vol. 11, no. 1, pp. 31-37, 2019. [18] "OpenCV: Smoothing Images," Docs.opencv.org, 2022. [Online]. Available: https://docs.opencv.org/4.x/d4/d13/tutorial_py_filtering.html. [Accessed 26 JUN 2022]. [19] "OpenCV addWeighted | How does addWeighted Function Work | Example," EDUCBA, 2022. [Online]. Available: https://www.educba.com/opencv-addweighted/. [Accessed 18 JUN 2022]. [20] G. E. Batista, . R. C. Prati and M. C. Monard, "A study of the behavior of several methods for balancing machine learning training data," ACM SIGKDD explorations newsletter, vol. 6, no. 1, pp. 20-19, 2004. [21] D. L. Wilson, "Asymptotic properties of nearest neighbor rules using edited data," IEEE Transactions on Systems, Man, and Cybernetics, vol. 3, pp. 408-421, 1972. [22] "Papers With Code - Data Augmentation," Paperswithcode.Com, 2022. [Online]. Available: https://paperswithcode.com/task/dataaugmentation. [Accessed 18 JUN 2022]. [23] A. A. Maashri, M. Debole and M. Cotter, "Accelerating neuromorphic vision algorithms for recognition," In Proceedings of the 49th annual design automation conference, pp. 579-584, 2012.
[4] Simkhada, K. and R. Thapa, "Rice Blast, A Major Threat to the Rice Production and its Various Management Techniques," Turkish Journal of Agriculture-Food Science and Technology, vol. 10, no. 2, pp. 147-157, 2022.
[24] G. Boesch, "VGG Very Deep Convolutional Networks (VGGNet) What you need to know," viso.ai, 2022. [Online]. Available: https://viso.ai/deep-learning/vgg-very-deep-convolutional-networks/. [Accessed 12 JUN 2022].
[5] K. M. N. Abd, A. Wayayok, A. F. Abdullah and A. R. M. Shariff, "Effect of variable rate application on rice leaves burn and chlorosis in system of rice intensification," Malaysian Journal of Sustainable Agriculture, vol. 4, no. 2, pp. 66-70, 2020.
[25] Z. Akhtar, "Xception: Deep Learning With Depth-Wise Separable Convolutions," Opengenus IQ: Computing Expertise & Legacy, 2022. [Online]. Available: https://iq.opengenus.org/xception-model/. [Accessed 20 JUN 2022].
[6] N. A. Slaton, R. Cartwright and C. Wilson Jr, "Potassium deficiency and plant diseases observed in rice fields," Better Crops, vol. 79, no. 4, pp. 12-14, 1995.
[26] K. Srinivasan, L. Garg and . D. Datta, "Performance comparison of deep cnn models for detecting driver’s distraction," CMC-Computers, Materials & Continua, vol. 68, no. 3, pp. 4109-4124, 2021.
[7] "Phosphorus Deficiency - IRRI Rice Knowledge Bank," Knowledgebank.Irri.Org, [Online]. Available: http://www.knowledgebank.irri.org/training/fact-sheets/nutrientmanagement/deficiencies-and-toxicities-factsheet/item/phosphorous-deficiency. [Accessed 2022].
[27] G. Boesch, "Deep Residual Networks (Resnet, Resnet50) - Guide In 2022," Viso.Ai, 2022. [Online]. Available: https://viso.ai/deeplearning/resnet-residual-neural-network. [Accessed 22 JUN 2002].
[8] D. Groth and C. Hollier, "Leaf Smut Of Rice," Lsuagcenter.Com, [Online]. Available: https://www.lsuagcenter.com/NR/rdonlyres/41C7A6F2-E14A-4713BFE1-724F8BE27509/75781/pub3115LeafSmutLOWRES.pdf. [Accessed 2022].
[28] "What is the F1-score?," Educative: Interactive Courses for Software Developers, 2022. [Online]. Available: https://www.educative.io/answers/what-is-the-f1-score. [Accessed 18 JUN 2022].
Page 66
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
Estimating Aerodynamic Data via Supervised Learning A ZIZUL H AQUE Department of Electrical and Computer Engineering North South University Dhaka, Bangladesh [email protected]
TANZIM H OSSAIN Department of Electrical and Computer Engineering North South University Dhaka, Bangladesh [email protected]
M OHAMMAD N. M URSHED Department of Mathematics and Physics North South University Dhaka, Bangladesh [email protected]
K IFE I NTASAR B IN I QBAL Department of Mathematics Bangladesh University of Engineering and Technology Dhaka, Bangladesh [email protected]
M. M ONIR U DDIN Department of Mathematics & Physics North South University Dhaka, Bangladesh [email protected]
Abstract—Supervised learning extracts a relationship between the input and the output from a training dataset. We consider four models – Support Vector Machine, Random Forest, Gradient Boost, and K-Nearest Neighbor – and employ them on data pertaining to airfoils in two different cases. First, given data about several different airfoil configurations, our objective is to predict the aerodynamic coefficients of a new airfoil at different angles of attack. Second, we seek to investigate how the coefficients can be estimated for a specific airfoil if the Reynolds number dramatically changes. It is our finding that the Random Forest and the Gradient Boost show promising performance in both the scenarios. Index Terms—Supervised Learning, Support Vector Machine, Random Forest, Gradient Boosting, K-Nearest Neighbor, Aerodynamic Coefficients, Airfoil.
I. I NTRODUCTION Machine learning (ML) has been in use to learn patterns from aerodynamic data for quite some time. Hai Chen et al. [1] developed a graphical prediction method for multiple aerodynamic coefficients of airfoils based on a convolutional neural network (CNN). Recently, machine learning-based algorithms have been applied in the estimation of aerodynamic coefficients of a non-slender delta wing underground effect using artificial intelligence techniques. Sergen Tumse et al. [2] presents machine learning techniques to estimate the aerodynamic coefficients of a 40 swept delta wing under the ground effect. For this purpose, three different approaches including feed-forward neural network (FNN), Elman neural network (ENN), and adaptive neuro-fuzzy interference system
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
(ANFIS) have been used. The optimal configuration of these models was compared with each other, and the best accurate prediction model was determined. [3] analyzed the Random Forest (RF) and Extreme Gradient Boosting (Xgboost) for the prediction of wing aerodynamic coefficients. [4] focuses mostly on the use of neural network models to predict the drag acting on a vehicle in platoon configuration. To date, the use of more convenient supervised learning models is poorly understood. In this project, we aim to analyze the applicability of four accessible supervised learning methods, rather, Support Vector Machine, Random Forest, Gradient Boost, and K-Nearest Neighbor to make prediction for problems in the field of aerodynamics. In particular, we look at two cases: •
•
In wing design, aerodynamicists often spend a huge amount of time computing the lift and drag coefficient of a test airfoil since, in principle, it requires them to perform mathematical integration over the surfaces. ML can be used to learn how a set of airfoil shapes at some given flow condition relate to their corresponding output like lift coefficient and drag coefficient, and can then predict the output for a new airfoil. When a plane is in cruise, it can go through turbulence, where the flow condition dramatically changes. Such an effect is well understood in terms of what is called the Reynolds number. Any change in the Reynolds number would have an impact on the lift and drag coefficient
Page 67
of the wing. In this scenario, ML will capture the relationship between the input (Reynolds number) and the output (lift and drag coefficient), so to predict the new output in case the input suddenly changes in the middle of the flight. This will definitely help in the control of the aircraft. The rest of the paper is organized as follows. In Section II, the theory of the supervised machine learning models are discussed. Methodology is provided in Section III. Section IV shows the numerical results and the paper is summarized in Section V. II. BACKGROUND The aim of this section is to introduce the technical ideas of four supervised learning techniques, namely, Support Vector Machine, Random Forest, Gradient Boost, and K-Nearest Neighbor. A. Support Vector Machine Support Vector Machine (SVM) [5], [6] is a regression algorithm that supports both linear and non-linear regressions. The task of SVM is to approximate the best value within a predetermined margin known as the ϵ-Insensitive Tube. The tube has a width of ϵ and the width is measured vertically along the axis. It means that any point in our dataset that falls inside the tube is completely disregarded for error. The error to be minimized will be measured as the distance between the point and the tube itself as, n
X 1 (ξi + ξi∗ ), ∥ w ∥2 +C 2 i=1
(1)
where the constant C is the box constraint, a positive numeric value that controls the penalty imposed on observations that lie outside the ϵ margin and helps to prevent overfitting. ξi and ξi∗ are slack variables which allow regression errors to exist up to the value of ξi and ξi∗ . Points outside the tube are called the support vectors as they are support the structure or formation of the tube. The kernel computes the dot product of two vectors x and y in some (very high dimensional) feature space, maximize α
n X i=1
n
αi −
n
1 XX αi αj yi yj K(XiT .Xj ) (2) 2 i=1 j=1
B. Random Forest Random Forest [7], [8] is a supervised learning algorithm that uses the ensemble learning method for regression. During training, a Random Forest builds several decision trees and outputs the mean of the classes as the prediction of all the trees. At first, it picks random i data points from the training set and then builds a decision tree associated with these i data points. J number of trees are chosen to build and repeat those previous steps. For a new data point, each one of the J-tree trees are made to predict the value of the target for the data point and the new data point is assigned to the average across all of the predicted target values. For each decision tree, the algorithm calculates the importance of a node using Gini Importance, assuming only two child nodes (binary tree): nij = wj Cj − wlef t(j) Clef t(j) − wright(j) Cright(j) ,
where nij is the importance of node j, wj is the weighted number of samples reaching node j, Cj is the the impurity value of node j, lef t(j) is the child node from left split on node j and right(j) is the child node from right split on node j. The importance for each feature on a decision tree is then calculated as: P j:node j splits on f eature i nij P (5) f ii = k ∈ all nodes nik where f ii is the importance of feature i and nij is the importance of node j. These can then be normalized to a value between 0 and 1 by dividing by the sum of all feature importance values. The final feature importance, at the Random Forest level, is its average over all the trees. C. Gradient Boost Gradient Boost (GB) [9], [10] is a machine learning algorithm that works on the ensemble technique called ’Boosting’. The idea behind Boosting is to train weak learners sequentially, each trying to correct its predecessor, that is, the algorithm is always going to learn something which is not completely accurate but a small step in the right direction. As the algorithm moves forward by sequentially correcting the previous errors, it improves the prediction ability. In gradient boosting, a base model is built to predict the observations in the training dataset. Mathematically, it can be written as: F0 (x) = arg min γ
s.t
0 ≤ αi ≤ C ∀ i and
n X
n X
L(yi , γ),
(6)
i=1
αi yi = 0.
i=1
The kernel used in this paper is the Radial Basis Function, k(X i , X j ) = (X i .X j + 1)d , where d is the degree of the polynomial.
(4)
where γ is the predicted value, L is the loss function, argmin means we have to find a predicted value γ for which the loss function is minimum. For continuous target values, the loss function will be:
(3)
n
L=
1X (yi − γi )2 , n i=0
(7)
Page 68
where yi is the observed value and γ is the predicted value. The pseudo residual is computed from the derivative of the loss function with respect to the predicted value, n X dL =− (yi − γi ) = −(Observed − P redicted). (8) dγ i=0
The predicted value here is the prediction made by the previous model. In this step, the output values are found for each leaf of the decision tree. There might be a case where 1 leaf gets more than 1 residual, hence, the final output of all the leaves are required. The average of all the numbers in a leaf gives the output. Next, the Decision trees are constructed based on the residuals, which are, in turn, used to update the predictions of the previous model. D. K-Nearest Neighbors
margins are both fixed at (0, 0) and (1, 0), respectively. These two points, on the other hand, are disregarded in our dataset since they are consistent in every training case. We collect data for 560 airfoils for the NACA 4-digit series based on these design criteria and specified parameters. A Python script is designed to import all files, stack them in the appropriate format, and remove any irrelevant data. The resulting dataset is P (2Q + 5) in size, with P being the number of row-wise samples and the first Q columns containing coordinates of the top surface at fixed x locations, whereas the following N columns have y-coordinates of the bottom surface at the same x locations, whereas the following N columns have y-coordinates of the bottom surface at the same x locations. The Reynolds number, Mach number, and AOA, as well as cl and cd , are in the last five columns. The dataset contained 171431 data samples in 4-digit datasets. Following that, the samples in each dataset were scrambled at random to improve the overall model quality and data reliability. Once scrambled the first 2Q+3 columns comprised of Y − coordinates of the upper surface at fixed X locations and the next 2 columns consist of Y − coordinates of the lower surface at the same X locations. The last five columns consist of angle of attack, Reynolds number, Mach number, lift coefficient cl and drag coefficient cd , respectively.
K-Nearest Neighbor (KNN) [11], [12] algorithm is used for classification and regression. In both uses, the input consists of the k closest training examples in the feature space. On the other hand, the output depends on the case. In K-Nearest Neighbors Regression, the output is the property value for the object. KNN assumes that all the data points are geometrically close to each other or in other words neighborhood points should be close to each other. In the classification problem, for a given value of k, the KNN algorithm will find the k nearest neighbor of the unseen data point, and then it will assign the class to the unseen data point by having the class which has the highest number of data points out of all classes of k neighbors. There are many distance matrices to calculate the nearest/neighborhood points such as, Euclidean distance, Manhattan distance, and Minkowski distance. pPn In this paper, 2 we worked with Minkowski distance, p i=1 |x1i − x2i | , which is a generalization of both the Euclidean distance and the Manhattan distance. If k is small, then it will consider only a few points and the model will likely overfit and vice versa.
After the dataset is fed into the machine learning tool, we use Mean Absolute Error (MAE) as the performance metric. MAE is a measure of errors between paired observations expressing the same phenomenon. It is calculated as, Pn |yi − xi | , (9) M AE = i=1 n where yi is the prediction, xi the true value, and n the size of the sample .
III. M ETHODOLOGY
IV. N UMERICAL R ESULTS
In this work, we considered NACA 4-digit airfoils. The first digit represents the highest camber (m) in the percentage of the chord (airfoil length), the second denotes the maximum camber position (p) in tenths of the chord, and the last two digits indicate the maximum thickness (t) of the airfoil in the percentage of the chord. We used Javafoil to generate NACA 4-digit airfoils. To create smooth upper and lower surfaces, each airfoil is discretized at 101 cosine-spaced points (normalized to unit chord length) depending on the shape. The related lift (cl ) and drag (cd ) coefficients for each airfoil are likewise calculated using the same software for various Reynolds numbers (Re), Mach numbers (M a), and Angle of Attack (AoA). These coordinate points are then interpolated at N positions on the x-axis, which are dispersed using cosine spacing, where yu,k and yl,k are upper and lower surface points, respectively, for all k ∈ [1, N]. This spacing strategy is typically utilized to capture the leading and trailing edge forms by having denser points around these areas than the center. It’s also worth noting that the leading and trailing
The utility of four different machine learning algorithms, namely, Support Vector Regression, Random Forest, Gradient Boosting, and K-Nearest Neighbor, is tested in two different cases. A. Prediction of aerodynamic coefficients for a test (new) airfoil at different angle of attack at fixed Re and M a We set Re = 100000 and M a = 0.1, and create two dataframes: one for all of the feature variables and another one for the target variable. The feature variables are airfoil coordinates, angle of attack and the target variable dataset contains lift and drag coefficients. To partition the dataset, we split the dataframe into a 90 : 10 ratio for training and testing the models and make sure that these datasets are well balanced. To train with SVM, we choose radial basis function as the kernel, kernel coefficient value = 0.1, and the regularization parameter = 100. In case of KNN, k is taken to be 5, with one weight equaling uniform and the other equaling distance. Default parameters are used for RF and GB.
Page 69
10
·10−2 ·10−2
Train Test
9
Train Test
5
4
7
MAE
MAE
8
6 5
3
2 −10
−5 0 5 Angle of Attack(degree)
10 −10
(a) MAE vs Alpha for cl ·10−2
−5 0 5 Angle of Attack(degree)
10
(a) MAE vs Alpha for cl
8
·10−3 Train Test
5
MAE
6
MAE
4 4
2
Train Test
2
3
1 −10
−5 0 5 Angle of Attack(degree)
10 −10
(b) MAE vs Alpha for cd
−5 0 5 Angle of Attack(degree)
10
(b) MAE vs Alpha for cd
Fig. 1: Support Vector Machine
Fig. 2: Random Forest
·10−2 5
Train Test
4 MAE
We calculate the Mean Absolute Error for all the four models for the various angles of attack, Figures 1, 2, 3, and 4. We use 10-fold cross-validation to validate our result. The test samples are almost as accurate as the training samples for SVM, whereas the error in the test samples is higher than that in the train samples in case of the other three models. The plot of MAE vs AOA consists of training and testing errors with respect to the various angle of attack. The main purpose of plotting training and testing MAE is to check overfitting or underfitting. The fact that MAE doesn’t differ much indicates that our model generalizes the training data well and it does not overfit or underfit.
3
2
−10
B. Prediction of aerodynamic coefficients at different Re for a given airfoil
−5 0 5 Angle of Attack(degree)
(a) MAE vs Alpha for cl ·10−3 6
Train Test
5
MAE
In this case, a dataset is created for a single airfoil that contains different Reynolds numbers, Mach number, AOA, lift coefficient, and drag coefficient. We have created four distinctive models named SVM, RF, GB, and KNN to predict the airfoil coefficient cl and cd with regard to distinctive Reynolds numbers 400000 and 500000. The dataset was separated into two segments, with train data corresponding to Re = 100000 to 300000 and test data corresponding to Re = 400000 to 500000. We computed the error in our predictions to predict the train and test set. On both train and test data, the used models RF and GB perform well, as evident in Table I. SVM fails to predict these coefficients, whereas KNN worked moderately well for cl and poorly for cd . We suspect that the
10
4 3 2 1 −10
−5 0 5 Angle of Attack(degree)
10
(b) MAE vs Alpha for cd
Fig. 3: Gradient Boost
Page 70
·10−2
the sense that the errors in the test samples are of the same order as that in train samples. In the second case, given an airfoil, the models can roughly predict the aerodynamic data if there is a variation in the Reynolds number; RF and GB are found to perform better than SVM and KNN. In the future, we plan to explore the application of neural networks in these two practically important cases.
Train Test
5
MAE
4.5 4 3.5
ACKNOWLEDGMENT
3 −10
−5 0 5 Angle of Attack(degree)
10
(a) MAE vs Alpha for cl
This research was funded by NSU Conference & Travel Grant Committee under the project ID.No.: CTRG21/SEPS/15.
·10−3 6
Train Test
5
MAE
4 3 2 1 −10
−5 0 5 Angle of Attack(degree)
10
(b) MAE vs Alpha for cd
Fig. 4: K-Nearest Neighbor cl SVM RF GB KNN
400000 23.64 0.03 0.65 1.94
cd 500000 20.52 1.37 2.95 6.42
400000 1714.54 0.23 0.57 38.915
500000 50.52 6.33 8.49 215.21
TABLE I: Error in prediction large error in SVM and KNN may have resulted due to the small size of the dataset. V. C ONCLUSION AND F UTURE W ORK In this paper, our goal was to analyze the use of four accessible supervised learning tools - Support Vector Machine, Random Forest, Gradient Boost, and K-Nearest Neighbor. We demonstrate them on airfoil data in two different cases. In the first one, models are trained to predict the aerodynamic coefficients for a new airfoil configuration at a fixed Reynolds number and Mach number. All of them performed well in
R EFERENCES [1] H. Chen, L. He, W. Qian, and S. Wang, “Multiple aerodynamic coefficient prediction of airfoils using a convolutional neural network,” Symmetry, vol. 12, no. 4, p. 544, 2020. [2] S. Tumse, M. Bilgili, and B. Sahin, “Estimation of aerodynamic coefficients of a non-slender delta wing under ground effect using artificial intelligence techniques,” Neural Computing and Applications, pp. 1–22, 2022. [3] X. Yan and Y. Ma, “Airfoil aerodynamic coefficient prediction based on ensemble learning,” Forest Chemicals Review, pp. 1110–1120, 2022. [4] F. Jaffar, T. Farid, M. Sajid, Y. Ayaz, and M. J. Khan, “Prediction of drag force on vehicles in a platoon configuration using machine learning,” IEEE Access, vol. 8, pp. 201 823–201 834, 2020. [5] C.-H. Wu, J.-M. Ho, and D.-T. Lee, “Travel-time prediction with support vector regression,” IEEE transactions on intelligent transportation systems, vol. 5, no. 4, pp. 276–281, 2004. [6] M. S. Ahmad, S. M. Adnan, S. Zaidi, and P. Bhargava, “A novel support vector regression (svr) model for the prediction of splice strength of the unconfined beam specimens,” Construction and building materials, vol. 248, p. 118475, 2020. [7] T. F. Cootes, M. C. Ionita, C. Lindner, and P. Sauer, “Robust and accurate shape model fitting using random forest regression voting,” in European conference on computer vision. Springer, 2012, pp. 278–291. [8] X. Zhou, X. Zhu, Z. Dong, W. Guo et al., “Estimation of biomass in wheat using random forest regression algorithm and remote sensing data,” The Crop Journal, vol. 4, no. 3, pp. 212–219, 2016. [9] A. Keprate and R. C. Ratnayake, “Using gradient boosting regressor to predict stress intensity factor of a crack propagating in small bore piping,” in 2017 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM). IEEE, 2017, pp. 1331–1336. [10] N. Bagalkot, A. Keprate, and R. Orderløkken, “Combining computational fluid dynamics and gradient boosting regressor for predicting force distribution on horizontal axis wind turbine,” Vibration, vol. 4, no. 1, pp. 248–262, 2021. [11] Y. Li, B. Fang, L. Guo, and Y. Chen, “Network anomaly detection based on tcm-knn algorithm,” in Proceedings of the 2nd ACM symposium on Information, computer and communications security, 2007, pp. 13–19. [12] D. Cheng, S. Zhang, Z. Deng, Y. Zhu, and M. Zong, “knn algorithm with data-driven k value,” in International Conference on Advanced Data Mining and Applications. Springer, 2014, pp. 499–512.
Page 71
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December 2022, Cox’s Bazar, Bangladesh
A 2D Convolution Neural Network Based Method for Human Emotion Classification from Speech Signal 1,2Dept.
Rakhi Rani Paul1, Subrata Kumer Paul2, Md. Ekramul Hamid3 of Computer Science and Engineering, Bangladesh Army University of Engineering & Technology (BAUET), Qadirabad, Dayarampur, Natore-6431, Bangladesh 3Department of Computer Science and Engineering, University of Rajshahi, Rajshahi-6205, Bangladesh Email: [email protected], [email protected], [email protected]
Abstract—recognizing emotions from speech signals is one of the active research fields in the area of human information processing as well as man-machine interaction. Different persons have different emotions and altogether different ways of expressing them. In this paper, a 2D Convolutional Neural Network (CNN) based method is presented for human emotion classification. We consider RAVDESS and SAVEE datasets to evaluate the performance of the model. Initially, Mel-frequency cepstral coefficients MFCC features are extracted from the speech signals which are used for the training purpose. Here, we consider only forty (40) cepstrum coefficients per frame. The proposed 2D CNN model is trained to classify seven different emotional states (neutral, calm, happy, sad, angry, scared, disgust, surprised). We achieve 89.86% overall accuracy from our proposed model for the RAVDESS dataset and 83.57% for the SAVEE dataset respectively. It is found that happy class is classified with an accuracy of 96% for the RAVDESS dataset and 92% for the SAVEE dataset. Lastly, the result of our proposed model is compared with the other recent existing works. The performance of our proposed model is good enough because it achieves better accuracy than other models. This work has many real-life applications such as man-machine interaction, auto supervision, auxiliary lie detection, the discovery of dissatisfaction with the client’s mode, detecting neurological disordered patients and so on. Keywords—Convolutional Neural Network, classification, Mel-frequency cepstral coefficients.
emotion
I. INTRODUCTION Speech is a common form of human communication and interaction. It conveys much more information such as acoustic-phonetic symbols, prosody, gender information, age, accent, emotions and health conditions, etc. Speech Signals are generated by the vibrations of air pressure. It is generated by pushing breathe-in air from the lungs through the vibrating of the cords of the vocal and the tract of the vocal and out from the lips and nose airways [1]. The vibration of the vocal cords is responsible for the generation of speech sound. Emotions are feelings that have both physiological and cognitive elements that influence that behavior. There are different types of emotional states in realworld situations such as sad, angry, scared, neutral, disgusted, and surprised. There are different types of applications such as a man-machine interaction system, control of the safety system [2], improved teaching quality, giving proper treatment of neurological disorder patients, auto-detection of the psychological state of criminal suspects and lie detection,
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
and so on. Speech Emotion Recognition (SER) is a recent field and very interesting in Artificial intelligence (AI). The paper is organized as follows- section II discusses the literature reviews, and section III is the feature extraction method. Finally, sections IV and V discuss speech emotion estimation methods and experimental results and discussion respectively. Lastly, the concluding remarks include in section VI. II. LETERATURE REVIEW Many researches have been performed to identify emotions by using speech statistics over the last years. Ghai et al. [7] chose to take frame sample of the sound signals at 16000 Hz and the selection duration of each frame was 0.25 seconds. Support vector machine (SVM), Gradient Boosting, Random Decision Forest were used as machine learning model on EMO-DB dataset and achieved 55.89% for SVM, 65.23% for Gradient Boosting and 81.01% for Random Decision Forest. Schuller et al. [8] examine a new spectral feature to determine emotions and to characterize groups. Emotions are grouped based on acoustic features and a novel hierarchical classifier. Various classifiers like GMM, and HMM are evaluated with totally different configurations and input features to produce novel hierarchical techniques for Emotion Classification. The discovery of the proposed method is two things. The first one is the selection of foremost performing features. Second is the employment of the foremost classwise classification performance of total features the same as the classifier. The hierarchical approach performs better compared to the standard classifier with decupled crossvalidation in Berlin dataset. The performance of the standard HMM method achieved 68.57% and the hierarchical model achieved 71.75%. Chen et al. [9] aimed to enhance speech emotion recognition in speaker-independent with 3 level speech emotion recognition technique. Principal Element Analysis (PCA) and Artificial Neural Network (ANN) used to classify. In that paper, four comparative experiments were discussed. Four comparative experiments embody Fisher + SVM, PCA + SVM, Fisher + ANN and PCA + ANN. Consequence indicates in dimension reduction Fisher is best than PCA for classification. These experiments achieved 50.17%, 43.15%,
Page 72
40.43% and 39.17% respectively on the basis of Beihang university info of emotional speech (BHUDES) dataset. S. N. Zisad et al. [12] aimed to develop a system that can recognize emotions from the speech of a neurologically disordered person. They used convolution neural network (CNN) is to develop their system. In that paper, they achieved 87.5% accuracy to classify emotions. In this study, we use the 2D CNN classification method to classify different human emotional states rather than the CNN classification method. The proposed model can achieve higher target classification accuracy and efficiency than the accuracies of other traditional methods because 2D CNN model can do fine tuning with large database, achieving higher accuracy and robustness. In 2D CNN, kernel moves in 2 directions where as others CNN model move 1 direction.
three essential building blocks: Convolutional Layer (CL), Pooling Layer (PL), and Fully Connected Layer (FCL). Fig 2 shows the internal architecture of a CNN. It is designed in a series of stages. Convolutional layers and pooling layers combine to make up the initial few phases of the model architecture. A fully connected layer comprises the model's last phase [5]. The CLs contains several filters that convolute the input through a set of weights from the previous layer. It composes a feature output that is known as a feature map. Neurons are connected directly to the input data points, multiplying the data by the weights within each filter. Within the same filter, all neurons share their weights that optimize the time and complication of CNN.
III. FEATURE EXTRACTION The speech features are extracted from each audio data file of each dataset using the Mel-frequency Cepstrum Coefficients (MFCCs) method. Fig. 1 presents a block diagram of the computational steps of MFCC which are discussed in detail in this section. MFCC includes certain steps applied to an input speech signal. The MFCC includes: preprocessing, framing, windowing, discrete Fourier Transformation (DFT), Mel Filter Bank, logarithm and finally computing the inverse of DFT(DCT) [4]. Table I records the MFCC parameters and defined values.
Fig. 1. Different stages in MFCC feature extraction method TABLE I.
MFCC PARAMETERS DEFINITION
IV. EMOTION CLASSIFICATION METHOD A. Convolutional Neural Network (CNN) Convolutional Neural Networks (CNNs) is one of the most widespread deep learning models. Usually, CNNs have
Fig. 2. An architecture of Convolutional Neural Networks (CNN)
A. Datasets Description and Data Pre-processing In this experiment, we take into consideration the two datasets known respectively as Surrey Audio-Visual Expressed Emotion (SAVEE) and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). RAVDESS dataset has eight emotional states with 1440 files (60 trials per actor x 24 actors = 1440) [6]. SAVEE is a wellcited public British English speech dataset that contains 420 data files [7]. But in this experiment, we only considered seven emotional states (neutral, happy, calm, sad, angry, disgust, scared, and surprised). Pre-processing is a crucial step in data preparation for model correctness and productivity. In this stage, the audio signals are cleaned to eliminate background sounds, silent periods, and other unimportant information from speech signals. After collecting the data, it very carefully examined the duration of the audio files. We have found that most of the duration of the audio files of RAVDESS datasets is three seconds. Some audio files have four seconds duration. These files have been very carefully examined and found that the last second of these files has not carried useful information. Most of the time it seems to be silent. That’s why we ignore the last second. The SAVEE dataset has two, three- and fourseconds duration audio files. We carefully examined that for four seconds of audio files, the last two or one second carry useful information that’s why we can’t ignore it. Zero padding is used to equal the duration of the other audio files. After padding, the duration of these files is four seconds. The audio files in RAVDESS and SAVEE audio speech datasets are sampled at 48 kHz and 44.1 kHz respectively. All files are used with a sampling rate of 16 kHz.
Page 73
B. Proposed 2D CNN Model Diagram In the following Fig. 3 presents the work-flow of this experiment. The first step is to prepare the dataset for the model. By using the five-fold cross-validation approach, all the data are divided into training and testing or validation sets. By dividing the dataset into a training and test set, crossvalidation is performed to assess the classification model. The model is evaluated using the test set after being trained on the training set. The original dataset is randomly divided into five equal subsamples for five-fold cross-validation. Four of the five subsample sets are used for training purposes, and the fifth subsample set is used to evaluate the model. In order to make sure that each of the five subsamples is used to exact ones as a validation set, the cross-validation procedure is repeated five times. The results for each fold are then averaged to obtain a single measurement of the experiments. The main advantage of the five-fold crossvalidation is that all data samples are used to train and validate the model. The feature of the training and testing dataset are extracted using MFCCs method. 2D CNN model is used as a classification model. In this study, the proposed model shown in Fig. 4 was created using a 2D Convolution Neural Network. This model has four CLs with 16, 32, 64, and 128 filters each, with a 2*2 kernel size for each layer. Each convolution layer uses a Rectified Linear Unit (ReLU) as the activation function. The definition of this function is: ,
Fig. 4. 2D CNN Architecture for the proposed method
One Global Average Pooling layer has been added to the last hidden layer, taking the average that is appropriate for feeding into our dense output layer. This model's output layer has seven nodes because there are seven classes in it. In this layer, SoftMax has been used as an activation function.
………………… (1)
Audio data with 16 kHz been given to the model as input. The input shapes for the RAVDESS dataset and the SAVEE dataset are (40, 281, 1) and (40, 375, 1), respectively. Here, 40 denotes the number of MFCC features that were extracted, 281 is the number of frames that were taken into consideration for padding, and 1 denotes that the audio is a mono signal. A max-pooling layer with a pool size of 2*2 follows the convolution layer. The number of parameters is minimized by choosing the highest value from the corrected feature map and reducing the amount of data. ReLU has been used in hidden layers as an activation function, similar to the convolution layer. In order to prevent overfitting, a dropout layer with a dropout value of 0.2 is also included [8].
TABLE II: PROPOSED MODEL PARAMETERS SN 1.
First Convolution Layer
2.
First Max Pooling Layer
3.
Dropout Layer
4. 5. 6. 7. 8. 9. 10. 11.
Second Convolution Layer Second Max Pooling Layer Dropout Layer Third Convolution Layer Third Max Pooling Layer Dropout Layer Fourth Convolution Layer Fourth Max Pooling Layer
12.
Dropout Layer
13.
Output Layer Optimization Function
14. 15.
Fig. 3. Work flow of the proposed method
Contents
Callback
Details 16 filters of size 2 × 2, ReLU, input size 40*281*1 for RAVDESS, 40*375*1 for SAVEE dataset Pooling size 2 × 2 The dropout value of 0.2 which randomly deactivates 20% neurons to avoid overfitting 32 filters of size 2 × 2, ReLU Pooling size 2 × 2 The dropout value of 0.2 which randomly deactivates 20% neurons to avoid overfitting 64 filters of size 2 × 2, ReLU Pooling size 2 × 2 The dropout value of 0.2 which randomly deactivates 20% neurons to avoid overfitting 128 filters of size 2 × 2, ReLU Pooling size 2 × 2 The dropout value of 0.2 which randomly deactivates 20% neurons to avoid overfitting 7 nodes for 7 classes, SoftMax Adam ModelCheckpoint
Adam [10] was applied as a model optimizer. As a loss function, categorical cross entropy has been employed. The model has callbacks for Early Stopping and ModelCheckpoint. If there is no improvement in minimizing
Page 74
loss value after 5 epochs, Early Stopping will end the training process, whereas ModelCheckpoint will save the best model in local storage. Table II presents the parameters that are invoked for this model.
The proposed model was applied to the two datasets. Firstly, we consider the RAVDESS dataset for training the model. The confusion matrix of this dataset is shown in Table III. TABLE VI. PERFORMANCE MEASUREMENT FOR 2D CNN MODEL (SAVEE DATASET)
V. EXPERIMENTAL RESULT We conduct utterance-based studies on SER using a fivefold cross-validation method. The dataset is divided into three parts: training (80% of the data), validation (10%), and testing (10%). We used speech features (MFCCs) to train our 2D CNN model and evaluated the model's efficacy and accuracy in making predictions. Measure the model's evaluation in terms of sensitivity, f1-score, precision, and overall correctness. From this analysis, the happy, sad, angry, and disgusted obtain good prediction accuracy and scared, surprised, and neutral slightly less, but the overall accuracy of this model 89.86% is good for the RAVDESS dataset. It is noted that the model segregates the speech emotional states with 96% accuracy for happy, 91% accuracy for sad, 94% accuracy for angry, 80% accuracy for scared and neutral, 90% accuracy for disgusted, and 89% accuracy for surprised. We evaluate the model in terms of precision, sensitivity, f1score, and overall accuracy for the RAVDESS dataset which is tabulated in Table IV.
Emotional states
Performance Measurement Precision 0.96 0.91 0.94 0.80 0.89 0.90 0.89
Happy Sad Angry Scared Neutral Disgusted Surprised
Sensitivity 0.87 0.91 0.85 0.91 0.86 0.91 0.10
F1-Score 0.91 0.91 0.89 0.85 0.88 0.90 0.94
TABLE VII. CROSS-VALIDATION RESULT OF RAVDESS DATASET No. of Folds Fold-1 Fold-2 Fold-3 Fold-4 Fold-5 Average Best
Training accuracy (%) 92.40 91.34 94.70 90.22 94.53 92.64 94.70
Validation accuracy (%) 89.33 91.47 93.12 89.31 92.78 91.20 93.12
Testing accuracy (%) 90.81 87.80 92.01 87.60 91.06 89.86 92.01
Emotional states
Happy
Sad
Angry
Scared
Neutral
Disgusted
Surprised
TABLE III: CONFUSION MATRIX OF 2D CNN MODEL (RAVDESS DATASET)
Happy Sad Angry Scared Neutral Disgusted Surprised
0.96 0 0 0.10 0 0.04 0
0 0.91 0 0 0.09 0 0
0.04 0 0.94 0.05 0.02 0 0.06
0 0 0 0.80 0 0.06 0.02
0 0.09 0 0.05 0.89 0 0
0 0 0.06 0 0 0.90 0.03
0 0 0 0 0 0 0.89
TABLE IV. PERFORMANCE MEASUREMENT FOR 2D CNN MODEL (RAVDESS DATASET) Performance Measurement
Emotional states
Precision 0.96 0.91 0.94 0.80 0.89 0.90 0.89
Happy Sad Angry Scared Neutral Disgusted Surprised
Sensitivity 0.87 0.91 0.85 0.91 0.86 0.91 0.10
F1-Score 0.91 0.91 0.89 0.85 0.88 0.90 0.94
Emotional states
Happy
Sad
Angry
Scared
Neutral
Disgusted
Surprised
TABLE V. CONFUSION MATRIX OF 2D CNN MODEL (SAVEE DATASET)
Happy Sad Angry Scared Neutral Disgusted Surprised
0.92 0.05 0.04 0.01 0.04 0 0
0 0.83 0.01 0.04 0.04 0 0.02
0.01 0.04 0.84 0.03 0.02 0 0.02
0.03 0.05 0.04 0.80 0.01 0.04 0.04
0 0.03 0.05 0.03 0.80 0.05 0.03
0.02 0 0.02 0.02 0.04 0.80 0.03
0.02 0 0 0.07 0.05 0.11 0.86
96 94 92 90 88 86 84 Fold-1 Fold-2
Fold-3 Fold-4
Training accuracy
Fold-5 Average
Validation accuracy
Best
Testing accuracy
Fig. 5. The comparison of Cross-Validation result of RAVDESS dataset TABLE VIII. CROSS-VALIDATION RESULT OF SAVEE DATASET No. of Folds
Training accuracy (%)
Validation accuracy (%)
Testing accuracy (%)
Fold-1 Fold-2 Fold-3 Fold-4 Fold-5 Average Best
91.50 93.71 92.14 89.31 89.80 91.29 93.71
85.71 89.23 82.01 84.52 86.14 85.52 89.23
84.71 86.32 81.49 83.16 82.18 83.57 86.32
Secondly, we consider the SAVEE dataset for training purposes. Table V presents the Confusion matrix of the proposed model on the SAVEE dataset. Happy identifies correctly 92% of the time, while other states correctly identify between 80% and 86% of the time. The SAVEE dataset has a good overall prediction accuracy of 83.57%. Our good results have been seen for the happy class with an accuracy of 96% for the RAVDESS dataset and 92% for the SAVEE dataset. We evaluate the model in terms of precision, sensitivity, f1-score, and overall accuracy for this dataset, which is given in Table VI.
Page 75
Using Convolutional Neural Network [12]
95 90 85
Our model
80
A 2D Convolution Neural Network Based Method for Human Emotion Classification from Speech Signal
2D CNN
75 Fold-1
Fold-2
Training accuracy
Fold-3
Fold-4
Fold-5 Average
Validation accuracy
Best
According to Table VII, which displays the results of fivefold cross-validation for this RAVDESS dataset, the best testing accuracy was reached in Fold-3 and it was 92.01%. The most accurate values for training and validation are respectively 94.70% and 93.12%. Average training and validation accuracies were 92.64% and 91.20%, respectively, while testing accuracy averaged 89.86%. Fig. 5 presents the graphic representation of Table VII. 5. The best testing accuracy is attained in the Fold-2 and it is 86.32% in that table, according to the results of the fivefold cross-validation for this SAVEE dataset. 93.71% and 89.23%, respectively, represent the best accuracy for training and validation. Average training and validation accuracy increase from 91.29% to 85.52%, respectively, whereas the average testing accuracy decreases to 83.57%. Fig. 6 presents the graphic representation of Table VIII. 6. After all, we have also compared obtained results in this work with the other existing works. The details performance comparison is shown in the following Table IX. TABLE IX. EXISTING PERFORMANCE COMPARISON OF SPEECH EMOTION RECOGNITION ACCURACY WITH THE PROPOSED METHOD
Existing works
1.
Emotion recognition on speech signals using machine learning [7]
2.
Hidden Markov modelbased speech emotion recognition [8]
3.
Speech emotion recognition: Features and classification models [9]
4.
5. 6.
Towards emotion recognition from speech: Definition, problems and the materials of research [13] Emotion Recognition by Speech Signals [11] Speech Emotion Recognition in Neurological Disorders
VI. CONCLUSION
Testing accuracy
Fig. 6. The comparison of Cross-Validation result of SAVEE dataset
SN
Used method
Accuracy
SVM Gradient Boosting Random Forest
55.89% 65.23% 81.05%
GMM HMM Fisher +SVM PCA+SVM Fisher +ANN PCA+ANN ANN SVM HMM
89.86% for RAVDESS and 83.57% for SAVEE
The main focus of study was human speech emotion recognition using a deep learning algorithm. In this paper, we extracted speech features from the audio speech files using Mel-frequency Cepstrum Coefficient (MFCC) technique. We considered only 40 Cepstrum Coefficients per frame. The 2D Convolution Neural Network (CNN) was employed for obtaining a better classification accuracy. Developing a 2D CNN model for classifying seven emotions is the major goal of this work. Finally, we have organized our findings into tabular formats that can be used as a resource for future studies on emotion speech recognition. This in-depth investigation into the extraction of emotions from human speech signals would offer listed information and data for future investigations into other crucial features in this area. The contribution of our work is also very useful in the case of neurological disordered patients. It brings the research one step ahead to develop a speech emotion recognition system. REFERENCES [1]
Titze, I.R. "The physics of small-amplitude oscillation of the vocal folds", Journal of the Acoustical Society of America. 2019, pp. 1536–1552
[2]
Dong Yu and Li Deng. "Automatic Speech Recognition", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, pp.55-59
[3]
Monita Chatterjee, Danielle J Zion, Mickael L Deroche, Brooke A Burianek, Charles J Limb, Alison P Goren, Aditya M Kulkarni, and Julie A Christensen. "Voice emotion recognition by cochlear-implanted children and their normallyhearing peers", Journal of the Innovative research, 2015, pp.151–162.
[4]
Paul, S.K., Paul, R.R., Nishimura, M., Hamid, M.E. (2021). Throat Microphone Speech Enhancement Using Machine Learning Technique. In: Favorskaya, M.N., Peng, SL., Simic, M., Alhadidi, B., Pal, S. (eds) Intelligent Computing Paradigm and Cutting-edge Technologies. ICICCT 2020. Learning and Analytics in Intelligent Systems, vol 21. Springer, Cham. https://doi.org/10.1007/978-3-030-65407-8_1
[5]
Louis Ten Bosch, "Emotions speech and the ASR framework", Journal of the Speech Communication system", India 2015, pp. 213–225
[6]
https://zenodo.org/record/1188976
[7]
Ghai, Mohan, et al. "Emotion recognition on speech signals using machine learning." 2017 international conference on big data analytics and computational intelligence (ICBDAC). IEEE, 2017.
[8]
Schuller, Björn, Gerhard Rigoll, and Manfred Lang. "Hidden Markov model-based speech emotion recognition." 2003 IEEE
86.8% 77.8% 50.17% 43.15% 40.43% 39.17%
49.19% 76.75%
GSVM
70.1% 42.3%
CNN
87.5%
Page 76
International Conference on Acoustics, Speech, and Signal Processing. Proceedings. (ICASSP'03).. Vol. 2. IEEE, 2003. [9]
Chen, Lijiang, et al. "Speech emotion recognition: Features and classification models." Digital signal processing 22.6 (2012): 1154-1160.
[10] Anagnostopoulos, Christos-Nikolaos, and Theodoros Iliou.
"Towards emotion recognition from speech: definition, problems and the materials of research." Semantics in Adaptive and Personalized Services. Springer, Berlin, Heidelberg, 2010. 127-143.
[11] Kwon, Oh-Wook, et al. "Emotion recognition by speech
signals." Eighth European Conference Communication and Technology. 2003.
on
Speech
[12] Zisad, Sharif Noor, Mohammad Shahadat Hossain, and Karl
Andersson. "Speech emotion recognition in neurological disorders using Convolutional Neural Network." International Conference on Brain Informatics. Springer, Cham, 2020 [13] Anagnostopoulos, Christos-Nikolaos, and Theodoros Iliou.
"Towards emotion recognition from speech: definition, problems and the materials of research." Semantics in Adaptive and Personalized Services. Springer, Berlin, Heidelberg, 2010. 127-143.
Page 77
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
A Secure Medical Record Sharing Scheme Based on Blockchain and Two-fold Encryption Md. Ahsan Habib Department of Computer Science and Engineering (CSE) Khulna University of Engineering & Technology (KUET) Khulna-9203, Bangladesh [email protected]
Kazi Md. Rokibul Alam Department of Computer Science and Engineering (CSE) Khulna University of Engineering & Technology (KUET) Khulna-9203, Bangladesh [email protected]
Abstract–Usually, a medical record (MR) contains the patients’ disease-oriented sensitive information. In addition, the MR needs to be shared among different bodies, e.g., diagnostic centres, hospitals, physicians, etc. Hence, retaining the privacy and integrity of MR is crucial. A blockchain based secure MR sharing system can manage these aspects properly. This paper proposes a blockchain based electronic (e-) MR sharing scheme that (i) considers the medical image and the text as the input, (ii) enriches the data privacy through a two-fold encryption mechanism consisting of an asymmetric cryptosystem and the dynamic DNA encoding, (iii) assures data integrity by storing the encrypted e-MR in the distinct block designated for each user in the blockchain, and (iv) eventually, enables authorized entities to regain the e-MR through decryption. Preliminary evaluations, analyses, comparisons with state-of-the-art works, etc., imply the efficacy of the proposed scheme. Keywords–ElGamal cryptosystem, DNA bases, Data privacy, Data integrity, Blockchain
I. INTRODUCTION Generally, human beings are susceptible to disease, and to be cured, they need to visit health service-related bodies. As an example, a US citizen visits 18.7 different physicians on average and keeps 19 separate medical records (MRs) in its lifespan [1]. This figure is presumed to be higher in a developing country like Bangladesh. Also, the patient’s MR may exist in diverse domains, and to serve better medical treatment, very often the MR needs to be shared among multiple bodies [2, 3]. However, sharing MR is challenging due to various constraints, e.g., the data format, unreliable diffusion, interoperability, confidentiality violation, data integrity disruption [5], etc. Preserving and exchanging the patients’ MR through the conventional system is problematic and faces numerous hassles. Hence, electronic (e-) MR is replacing paper-based MR rapidly [6] where e-MR is usually stored in the local dedicated server of a healthcare service provider. However, a local server usually experiences numerous difficulties, e.g., diverse malicious security attacks, single-point failure [7], etc., which may lead to forgoing data privacy. For instance, millions of e-MR had been compromised and healthcare databases had lost nearly $30 billion over the last two decades [2]. In addition, when a patient needs to share its eMR with a healthcare service provider, it undergoes an inefficient and manual consent process [7]. Alongside, hackers had sold patients’ data at up to 20 times higher prices than banking data [2]. This paper develops a blockchain based secure e-MR sharing scheme where a patient has sufficient control over its e-MR including the ability to know who has accessed it. Here, the major contributions are as follows. i)
Consideration of the medical image along with the textual data as the input.
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
Yasuhiko Morimoto Graduate School of Advanced Science and Engineering Hiroshima University Higashi-Hiroshima 739-8521, Japan [email protected]
ii)
Enrichment of data privacy via a two-fold encryption mechanism consisting of an asymmetric cryptosystem and dynamic DNA encoding. Where at first, it encrypts the plain e-MR by an asymmetric cryptosystem. Then, applies the dynamic DNA encoding over the cipher eMR to enrich the degree of confusion, i.e., data privacy.
iii)
Assurance of data integrity by storing each cipher eMR in a distinct block having an index number in the blockchain. While storing, it returns the index number of the corresponding block to the concerned user for further usage.
iv) Enablement of authorized entities to regain the e-MR
through decryption while maintaining the patients’ anonymity.
The remaining parts of the paper are arranged as follows. Section II studies the related works. Section III describes the required building blocks. Section IV illustrates the proposed secure e-MR sharing scheme. Section V explains the preliminary evaluations, and Section VI concludes the paper. II. RELATED WORKS As discussed in the previous section, e-MR contains patients’ confidential information which needs to be shared among multiple bodies. In order to retain data privacy, data integrity, availability, interoperability, reliability, etc., for eMR sharing system many schemes already have been proposed, e.g., MediChain [5], HDG [8], MediBchain [9], MeDShare [10], MedBlock [11], DPS [12], SEMRES [13], MedChain [14], EHRChain [15], etc. Usually, these works adopt single-fold encryption to retain data privacy and store the plain e-MR metadata or encrypted e-MR over the blockchain. The works proposed in [5], [8], [12], and [16] considered both medical images and textual data as input. The scheme in [12] used a symmetric key cryptosystem named Advanced Encryption Standard (AES) to encrypt the e-MR while to encrypt the AES key it used an asymmetric key cryptosystem named the Elliptic Curve Cryptography (ECC). In contrast, the schemes in [5], [8], and [16] did not explicitly mention their adopted cryptosystem. Alongside, the system in [8] used the blockchain to store the e-MR, whereas the scheme in [16] only stored the data-accessing-related information of the e-MR. The systems in [5] and [12] exploited locationrelated information in the blockchain. The systems in [9] and [17] considered only textual data as input, where the approach [9] employed ECC to encrypt e-MR. But, to secure the e-MR and AES key, the scheme in [17] used the AES and threshold cryptosystem, respectively. The system in [9] used the permissioned blockchain to store the e-MR, whereas the work in [17] kept the metadata of the e-MR in the blockchain. A different work proposed in [18]
Page 78
took into account only the medical image to store on the local server. It employed an asymmetric-key cryptosystem to secure the URL of the individual file that is stored on the blockchain server. The system developed in [19] utilized a private and a consortium blockchain to preserve the e-MR and its index, respectively. Here, to encrypt the e-MR, it also used a public key cryptosystem. Some other works proposed in [6], [7], and [13] also used the blockchain and employed the symmetric cryptosystem to secure the e-MR data. Here, schemes in [7] and [13] used an asymmetric cryptosystem to encrypt the symmetric key. The work in [6] stored the e-MR in the blockchain and to manage each block it replaced the traditional Merkle tree with an improved convolutional one to reduce the tree layers, nodes, hash calculations, etc. The scheme in [7] kept the e-MR and its metadata in the distributed file system and the blockchain server, respectively. In contrast, the system in [13] used the blockchain to store the hash value of the e-MR. However, none of these works specified their exact input. The above analyses infer that to secure the e-MR data, huge state-of-the-art works rely on single-fold symmetric or asymmetric cryptosystem. Where a symmetric system like AES is vulnerable due to the side-channel attack, co-relation scan attack [20], etc. Likewise, known-plaintext, chosenplaintext, ciphertext-only, etc., attacks can mount over the asymmetric cryptosystem [21]. Hence, to enrich data privacy this paper adopts a two-fold encryption technique capable enough to safeguard the sensitive e-MR data. In addition, rather than solely the textual data, it treats the visual and the textual data as input. Moreover, it deploys blockchain to assure data integrity and regains plain e-MR data while retaining the data owner’s anonymity. The preliminary assessment shows that the proposed secure e-MR sharing scheme is proficient enough to retain colossal data privacy, data integrity, availability, authenticity, etc., criteria. III.
PRELIMINARIES
The major building blocks required to develop the proposed scheme are described below. In addition, it adopts an anonymous authentication mechanism, i.e., deploys the anonymous credential proposed in [4] for each patient. Thus, while communicating with other entities, the patient can reveal its identity anonymously instead of the real identity. A. Blockchain Storage Blockchain is a decentralized, distributed, and tamperproof data storage that confirms data integrity [5]. Nowadays it is widely used in different applications, e.g., healthcare, financial services, public administration, supply chain management [24], etc. The data stored in the blocks of the blockchain are linearly and cryptographically linked together to form a chain. Each block contains a cryptographic hash pointer of the previous block, a timestamp, and transaction data. The hash pointer linked with the previous block gives the immutability property of the blockchain. Here, new blocks are added only when majority of the nodes consent by validating all transaction data. Since it appends new blocks continuously, it keeps growing. Fig. 1 depicts its general architecture. Each block in the blockchain consists of two parts, i.e., a block header and a block body. The block header contains the metadata that typically includes the block ID, the hash of previous block, the number of transactions, nonce, Merkle
root, timestamp, etc. The block body holds each transaction data. As shown in Fig. 1, the Merkle tree, a binary hash tree is used to create the Merkle root that is stored in the block header. Here Di represents transaction data and Hi denotes the cryptographic hash of the transaction Di. SHA−256 is a prominent hash function [23] used in the blockchain domain. Block ID
Hash of Previous Block
Number of Transactions
Nonce
Merkle Root
Timestamp
H31 H22
H21 H1
H2
H3
H4
D1
D2
D3
D4
Fig. 1. The general architecture of blockchain.
B. Dynamic DNA Encoding The goal of dynamic DNA encoding is to increase the confusion level of the ciphertext. For this purpose, each pair of successive cipher chunks are joined together. Different from [22], the DNA bases merge any two consecutive cipher chunks dynamically through the following equations. R = logb (x)
(1)
S = (R × N) (mod Q)
(2)
Here, b and x specify the base and the random integer, respectively, whereas the values of N and Q are 10000 and 100, respectively. The S dummy DNA bases are added in between every two consecutive chunks. The value of x is incremented by 1 for calculating the value of R for the next two consecutive chunks. These DNA bases are picked from the first chunk or the second chunk of two consecutive chunks determined via the following equation. w = (-1)S
(3)
If w is positive, then S dummy bases will be picked from the first chunk. Otherwise, S dummy DNA bases will be picked from the second chunk. C. Formation of Two-fold Encryption The proposed two-fold encryption mechanism comprises the ElGamal cryptosystem [25] and the dynamic DNA encoding. Firstly, the plain data is encrypted through an asymmetric cryptosystem, e.g., ElGamal. Here, ElGamal is chosen because its key generation is based on the discrete logarithm problem that is difficult to solve. Besides, the ciphertext produced by the ElGamal encryption function is non-deterministic and it performs better while decryption [22]. Then, dummy DNA bases are picked and placed within the chunks of encrypted data by using the dynamic DNA encoding mechanism to enrich the data privacy. Fig. 2 illustrates the formation of this encryption process. Asymmetric cryptosystem Plain data
Dynamic DNA bases Single-fold encrypt data
Two-fold encrypt data
Fig. 2. The two-fold encryption process.
Page 79
IV. PROPOSED E-MR SHARING SCHEME This section first briefly describes the system model and provides an overview of the proposed e-MR sharing scheme which is depicted in Fig. 3. Then it describes the encryption, the blockchain storage, and the data query together with the decryption phases. A. System Model The proposed scheme consists of five entities, i.e., (i) Pi (i 1) patient, (ii) Phj (j 1) physician, (iii) DC ( 1) diagnostic center, (iv) a blockchain storage, and (v) authorized third-party. Here, every entity possesses a unique identity. The major attribute of every entity are as below. Patient (Pi): A person who needs medical treatment. Physician (Phj): An individual who provides medical treatment to cure the patient’s disease. Diagnostic Center (DC): A pathological laboratory where a patient’s clinical specimen is examined. Blockchain Storage (BS): A decentralized, distributed, and tamper-proof data storage that ensures data integrity. Authorized Entity (AE): An individual who is eligible to access the patients’ health-related data to provide better service, analyses, etc. B. Overview of the Proposed Scheme Let, for medical services a patient Pi visits the physician Phj to consult about its disease while the Pi shares its public encryption key PubP and other required parameters with the Phj. The Phj asks for required medication and diagnosis that creates the Pi’s e-MR. Now using the PubP, Phj encrypts the e-MR through the adopted two-fold encryption mechanism. Then the Phj creates a block and forwards it to the blockchain network. Later on, while any authorized entity AE requests the e-MR of Pi, the Pi requests its block from the blockchain storage BS. Then the BS returns the corresponding block to the Pi and the Pi decrypts the e-MR using its private key PriP to obtain the plain e-MR. Further, the Pi encrypts its e-MR using an asymmetric cryptosystem, namely ElGamal with different parameters using the public key PubAE of AE and sends it to AE. Then the AE decrypts the cipher e-MR through its private key PriAE to retrieve the plain e-MR. C. Encryption Phase The physician Phj encrypts the e-MR of the patient Pi according to the following steps. Fig. 4 represents them.
5. Request e-MR
Step 3: Divide the ASCII integer data into equal-length chunks. If necessary, apply ‘0’ padding at the leftmost of the leftmost chunk. Step 4: Encrypt each chunk by using an asymmetric cryptosystem, e.g., ElGamal encryption technique. Step 5: Convert each encrypted chunk into its corresponding bits. Step 6: Transform bits into corresponding DNA bases (i.e., 00 = A, 01 = C, 10 = G, 11 = T). Step 7: Add dummy DNA bases between every two consecutive chunks according to Section III-C. Finally, create a block of blockchain by using the encrypted chunks. D. Blockchain Storage Phase After encryption, the Phj creates a block with the encrypted e-MR, timestamp, and previous block hash and forwards it to the blockchain network. The BS returns an index number IBS of the block to the Phj. Then the Phj sends IBS to the Pi for further accessing of the block. E. Data Query and Decryption Phase When the Pi wants to retrieve its e-MR, it anonymously sends an access request of the block containing the e-MR to the BS providing the index number IBS. Then, the BS returns the corresponding block to the Pi and the Pi decrypts the cipher e-MR thru the following steps and Fig. 5 shows them. Step 1: Scan the encrypted chunks of the block from the blockchain. Step 2: Discard dummy DNA bases between every two consecutive chunks. Step 3: Decode the DNA encoded chunks to retrieve the binary chunks. Step 4: Convert the binary chunks into corresponding ASCII integer chunks (still it is in partial encrypted format). Step 5: Decrypt the individual ASCII integer chunks by using the private key of the patient.
2. Generate e-MR
1. Go to Ph j
Patient
Step 1: First, scan the textual data and the medical image (color) of the e-MR. Step 2: Convert them separately into the corresponding ASCII value. Namely, for each character of the text, convert it into 03-digit ASCII integer. For the image, convert each pixel into its corresponding 03-channel ASCII value into 03digit format. Now, these 03-channels are managed serially.
Physician
Diagnostic Center
10. Send e-MR
3. Encrypt e-MR (Text & Image) Encryption Phase
6. Request of Block
4. Store
7. Return cipher e-MR
8. Decrypt
Authorized Entity
e-MR (Text & Image) Decryption Phase
Blockchain Server
Fig. 3. An overview of the proposed scheme.
Page 80
Start
Scan plain e-MR (Text & Image)
Convert into ASCII integer data
Convert into equal-length chunks (Apply ‘0’ padding if necessary)
Generate encrypted chunks using asymmetric cryptosystem Convert cipher of each chunk into bits Transform bits into DNA bases
Add dummy DNA bases in between every two chunks Stop
Fig. 4. Proposed encryption process.
Step 6: Discard all ‘0’s from the left side of the left-most chunk, if necessary. Step 7: Retrieve the e-MR (textual data and medical image separately) from the ASCII data. Start
Decrypt the individual ASCII integer chunks
Read encrypted chunks from blockchain Discard dummy bases between every two chunks
Discard ‘0’ padding from the left-most chunk (if necessary)
Decode the DNA chunks into binary chunks
Get plain e-MR (Text & Image)
Convert binary chunks into ASCII integer data
Stop
Fig. 5. Proposed decryption process.
V.
EVALUATION OF THE SCHEME
A. Experimental Setup A prototype of the proposed scheme was developed under the environment of Intel(R) Core™ i5-10500 CPU @3.10GHz 64-bit processor with 12 GB RAM running on Windows 10 operating system. It was developed in Visual Studio Code 2019 for coding purposes. The ElGamal encryption system is adopted as an asymmetric cryptosystem and used a 1024-bit key for encryption and decryption operations. Byte size matters calculator [26] was used to measure the text size. Here, the parameters related to the dynamic DNA encoding, i.e., to form two-fold encryption phase were set as b=3 and x = 10 where the other parameters were already specified in Section III-C. B. The output of the Encryption Phase Considering the plaintext ‘Patient Name: Alice’ and a color chest X-ray image with a size of 300250, based on the steps of Section IV-C, Table I presents the output of encryption phase. C. The output of the Decryption Phase Based on the steps of Section IV-E (and for the data of Section V-B), Table II shows the output of decryption phase. D. Experimental Results and Comparisons The proposed e-MR sharing scheme was implemented with different data sizes by employing the developed twofold encryption mechanism. Here, three sets of e-MR were chosen as the input, (i) 100KB textual data and a 200200 color medical image (29KB), (ii) 300KB textual data and a 300300 color medical image (263KB), and (iii) 500KB textual data and a 500500 color medical image (732KB). Then the size of the plain e-MR together with the size of the cipher e-MR, is shown in Fig. 6. By observing the figure, it is seen that the size of the cipher e-MR is approximately 10 times greater than the size of the plain e-MR.
TABLE I. OUTPUT OF THE ENCRYPTION PHASE Step
Operation
Step 1
Scan e-MR
Step 2
Convert into ASCII value
Output Text: Patient Name: Alice
08009711610510111011 60320780971091010580 32065108105099101
Image:
160 166 171 139 179 139 166 123 064 Red channel
010 016 017 019 029 039 016 013 064 Green channel
240 186 191 239 199 239 206 183 244 Blue channel
0800971161051011101160320780971091010580320651081050991011600102401660161861710171911390192391790291991390392391660162061230131830640 64244 Chunk 1: *080097116105101110116032078097109101058032065108105099101160010240 Chunk 2: *186171017191139019239179029199139039239166016206123013183064064244
Step 3
Divide into equal-length chunks
Step 4
Encrypt chunks using ElGamal
Chunk 1: *160196110354652489523396765104697193423769108011543972106147876233 Chunk 2: *106871287860786913019408022543218433480150823923496926652762894356
Step 5
Convert cipher chunks into bits
Chunk 1: *110010000100000011110000010100101110111000001011101110001111100101 Chunk 2: *100001001010001111011011110000110001100011101100111111010001001100
Step 6
Transform bits into DNA bases
Chunk 1: *TGCAAGAACTGAAGGCCTCTAACCTCTACTTAGACCGTATGAGAAGAT Chunk 2: *TAAGTCTAGCTTTCTTCCACACATGAGCTACCCAGGCGAATAAGCCAC
Step 7
Add dummy DNA bases
*,**
ATCTGTTCTGTTACTCAATCCAACAACTTGGTATCTGCTACGGGCGGCTATTTCT AGTACGATGAAACATTGCGCTTCCCAACCAACAATTGCGCTTGCTATTTCTTTTCT
*
denotes a portion of data; **dummy DNA bases are shown in bold italic
Page 81
TABLE II. OUTPUT OF THE DECRYPTION PHASE Step
Operation
Step 1
Read final ciphertext
*,**
Output
Step 2
Discard dummy bases
Chunk 1: *TGCAAGAACTGAAGGCCTCTAACCTCTACTTAGACCGTATGAGAAGAT Chunk 2: *TAAGTCTAGCTTTCTTCCACACATGAGCTACCCAGGCGAATAAGCCAC
Step 3
Decode into binary chunks
Chunk 1: *110010000100000011110000010100101110111000001011101110001111100101 Chunk 2: *100001001010001111011011110000110001100011101100111111010001001100
Step 4
Convert into ASCII integer data
Chunk 1: *160196110354652489523396765104697193423769108011543972106147876233 Chunk 2: *106871287860786913019408022543218433480150823923496926652762894356
Step 5
Decrypt ASCII integer chunks
Chunk 1: *080097116105101110116032078097109101058032065108105099101160010240 Chunk 2: *186171017191139019239179029199139039239166016206123013183064064244
Step 6
Discard ‘0’ padding from the leftmost chunk
0800971161051011101160320780971091010580320651081050991011600102401660161861710171911390192391790291991390392391660162061230131830640 64244
ATCTGTTCTGTTACTCAATCCAACAACTTGGTATCTGCTACGGGCGGCTATTTCT AGTACGATGAAACATTGCGCTTCCCAACCAACAATTGCGCTTGCTATTTCTTTTCT
160 166 171 139 179 139 166 123 064 Red channel
08009711610510111011 60320780971091010580 32077114046032065108 105099101 Step 7
Get plain e-MR (Text & Image)
Text: Patient Name: Alice
010 016 017 019 029 039 016 013 064 Green channel
240 186 191 239 199 239 206 183 244 Blue channel
Image:
*
denotes a portion of data; **dummy DNA bases are shown in bold italic
Another experiment was done to exhibit the encryption and decryption time where the size of input e-MR sets remains the same as in Fig. 6. The output is displayed in Fig. 7. It shows that the time required for the encryption is nearly double than the time required for the decryption. The reason is that the proposed two-fold encryption mechanism at first
16000
Block creation time
Comparision among different e-MR size
20
8000
Time (ms)
Data Size (KB)
17
13346
12000 6097
4000
1403
129 1
2
Plaintext (KB)
3 Ciphertext (KB)
Time requirement for different e-MR size 99616
80000 54822
45755 40000
25919 10744 5864
0 1
Encryption
2
Decryption
9
10
0
4
Block 1
Block 2
Block 3
Fig. 8. Time required for block creation of three different cipher e-MR.
Fig. 6. A comparison between plain e-MR and cipher e-MR data-size for three distinct inputs.
120000
15
5
1232
563
0
Time (ms)
employs the ElGamal cryptosystem that consists of two modular exponentiation operations in the encryption phase but only one alike operation in the decryption phase. Alongside, Fig. 8 depicts the time requirement to create a distinct private block while using the same inputs of Fig. 6 also and it excludes the e-MR encryption time.
3
Fig. 7. A comparison between encryption and decryption time for three distinct e-MR inputs.
The proposed scheme is compared with [5]–[9], [12], [13], [16], [17] and [18] as shown in Table III. It shows that all the compared schemes uses a single-fold encryption and most of them use symmetric cryptosystem that is vulnerable for various attacks. Some of the works do not consider image data along with textual data while most of them do not maintain the data integrity. Thus, the table implies that the proposed scheme is better for storing and sharing e-MR and assures data privacy, data integrity, availability, etc. E. Security Analyses This section evaluates the security aspects, i.e., data privacy, data integrity, level of encryption, etc., encompassed by the proposed scheme that are depicted in Table III. At the same time, the table includes a comparison with other works.
Page 82
TABLE III. A COMPARISON WITH OTHER SCHEMES Scheme [6] [7] [8] [9] [12] [13] [5, 16] [17] [18] Proposed
Input type
[6]
Text
Image
Encrypt e-MR asymmetrically
data integrity
Two-fold encryption
– – √ √ √ – √ √ × √
– – √ × √ – √ × √ √
× × – √ × × – – √ √
√ × √ √ × × × × √ √
× × × × × × × × × √
[7]
[8]
[9]
[10]
– = not mentioned; √ = considered; × = not considered
Data Privacy: The proposed scheme offers privacy via ElGamal encryption and dynamic DNA encoding. But, as in Table III, most compared schemes use symmetric encryption Data Integrity: As blockchain is an immutable ledger, data integrity of the cipher e-MR is assured by storing in a distinct block of the blockchain. Whereas, as in Table III, other almost schemes do not satisfy this criteria. Two-fold Encryption: To enrich the level of confusion, the proposed scheme adopts two-fold encryption. However, as in Table III, the compared ones use single-fold encryption. VI. CONCLUSIONS In this paper, the proposed blockchain based secure e-MR storing and sharing scheme maintains adequate data privacy, data integrity, availability, etc., about the patient’s medical record. It considers both the textual data and the medical image as input. Herein, the two-fold encryption mechanism consisting of the asymmetric cryptosystem and the dynamic DNA encoding enriches data privacy. The storage of the patient’s medical record over the blockchain assures data integrity. The preliminary assessment refers the efficacy of the proposed scheme. An upcoming plan of enhancement is to improve the required building blocks and the encryption and decryption phases, develop and incorporate a consensus mechanism for blockchain architecture, and implement the prototype of this scheme in a more realistic environment. REFERENCES [1]
[2]
[3]
[4]
[5]
J. Hecht, “The future of electronic health records,” Nature, 2019. https://www.nature.com/articles/d41586-019-02876-y#:~:text=Data from the US Department,poll by the Henry J. (Accessed May 10, 2022). J. N. Al-Karaki, A. Gawanmeh, M. Ayache, and A. Mashaleh, “DASS-CARE: A Decentralized, Accessible, Scalable, and Secure Healthcare Framework using Blockchain,” in 15th Int. Wireless Communications & Mobile Computing Conference (IWCMC), pp. 330–335, Jun. 2019. S. Tanwar, K. Parekh, and R. Evans, “Blockchain-based electronic healthcare record system for healthcare 4.0 applications,” J. Inf. Secur. Appl., Vol. 50, p. 102407, Feb. 2020. S. Tamura and S. Taniguchi, “Enhancement of anonymous tag based credentials,” Inf. Security and Comput. Fraud, Vol. 2, No. 1, pp. 10– 20, 2014. S. Rouhani, L. Butterworth, A. D. Simmons, D. G. Humphery, and R. Deters, “MediChain TM : A Secure Decentralized Medical Data Asset Management System,” in IEEE Int. Conf. on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing
[11]
[12]
[13]
[14] [15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
(CPSCom) and IEEE Smart Data (SmartData), pp. 1533–1538, 2018. H. Zhu, Y. Guo, and L. Zhang, “An improved convolution Merkle tree-based blockchain electronic medical record secure storage scheme,” J. Inf. Secur. Appl., Vol. 61, p. 102952, Sep. 2021. K. Shuaib, J. Abdella, F. Sallabi, and M. A. Serhani, “Secure decentralized electronic health records sharing system based on blockchains,” J. King Saud Univ. - Comput. Inf. Sci., May 2021. X. Yue, H. Wang, D. Jin, M. Li, and W. Jiang, “Healthcare Data Gateways: Found Healthcare Intelligence on Blockchain with Novel Privacy Risk Control,” J. Med. Syst., Vol. 40, No. 10, p. 218, 2016. A. Al Omar, M. S. Rahman, A. Basu, and S. Kiyomoto, “MediBchain: A Blockchain Based Privacy Preserving Platform for Healthcare Data,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 534–543, 2017. Q. Xia, E. B. Sifah, K. O. Asamoah, J. Gao, X. Du, and M. Guizani, “MeDShare: Trust-Less Medical Data Sharing Among Cloud Service Providers via Blockchain,” IEEE Access, Vol. 5, pp. 14757–14767, 2017. K. Fan, S. Wang, Y. Ren, H. Li, and Y. Yang, “MedBlock: Efficient and Secure Medical Data Sharing Via Blockchain,” J. Med. Syst., Vol. 42, No. 8, p. 136, Aug. 2018. H. Li, L. Zhu, M. Shen, F. Gao, X. Tao, and S. Liu, “BlockchainBased Data Preservation System for Medical Data,” J. Med. Syst., Vol. 42, No. 8, p. 141, Aug. 2018. Y.-L. Lee, H.-A. Lee, C.-Y. Hsu, H.-H. Kung, and H.-W. Chiu, “SEMRES - A Triple Security Protected Blockchain Based Medical Record Exchange Structure,” Comput. Methods Programs Biomed., Vol. 215, p. 106595, Mar. 2022. B. Shen, J. Guo, and Y. Yang, “MedChain: Efficient Healthcare Data Sharing via Blockchain,” Appl. Sci., Vol. 9, No. 6, p. 1207, 2019. F. Li, K. Liu, L. Zhang, S. Huang, and Q. Wu, “EHRChain: A Blockchain-based EHR System Using Attribute-Based and Homomorphic Cryptosystem,” IEEE Trans. Serv. Comput., 2021. A. A. Vazirani, O. O’Donoghue, D. Brindley, and E. Meinert, “Blockchain vehicles for efficient Medical Record management,” npj Digit. Med., Vol. 3, No. 1, p. 1, Dec. 2020. X. Zheng, R. R. Mukkamala, R. Vatrapu, and J. Ordieres-Mere, “Blockchain-based Personal Health Data Sharing System Using Cloud Storage,” in IEEE 20th Int. Conf. on e-Health Networking, Applications and Services (Healthcom), pp. 1–6, Sep. 2018. V. Patel, “A framework for secure and decentralized sharing of medical imaging data via blockchain consensus,” Health Informatics J., Vol. 25, No. 4, pp. 1398–1411, Dec. 2019. A. Zhang and X. Lin, “Towards Secure and Privacy-Preserving Data Sharing in e-Health Systems via Consortium Blockchain,” J. Med. Syst., Vol. 42, No. 8, p. 140, Aug. 2018. Y. Sao, S. S. Ali, D. Ray, S. Singh, and S. Biswas, “Co-relation scan attack analysis (COSAA) on AES: A comprehensive approach,” Microelectron. Reliab., Vol. 123, p. 114216, Aug. 2021. A. Mondal, K. M. R. Alam, Nawaz Ali, P. H. J. Chong, and Y. Morimoto, “A Multi-Stage Encryption Technique to Enhance the Secrecy of Image,” KSII Trans. Internet Inf. Syst., Vol. 13, No. 5, May 2019. M. R. Biswas, K. M. R. Alam, S. Tamura, and Y. Morimoto, “A technique for DNA cryptography based on dynamic mechanisms,” J. Inf. Secur. Appl., Vol. 48, p. 102363, Oct. 2019. M. S. R. Tanveer, K. M. R. Alam, and Y. Morimoto, “A multi-stage chaotic encryption technique for medical image,” Information Security Journal: A Global Perspective, 1-19, Aug. 2021. Z. Zheng, S. Xie, H. Dai, X. Chen, and H. Wang, “An Overview of Blockchain Technology: Architecture, Consensus, and Future Trends,” in IEEE Int. Congress on Big Data (BigData Congress), pp. 557–564, Jun. 2017. T. Elgamal, “A public key cryptosystem and a signature scheme based on discrete logarithms,” IEEE Trans. Inf. Theory, Vol. 31, No. 4, pp. 469–472, Jul. 1985. “Byte Size Matters.” http://bytesizematters.com/ (Accessed Aug. 29, 2022).
Page 83
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
An Approach to Classify the Shot Selection by Batsmen in Cricket Matches Using Deep Neural Network on Image Data Afsana Khan1 , Fariha Haque Nabila2 , Masud Mohiuddin3 , Mahadi Mollah4 , Md Ashraful Alam,5 and Md Tanzim Reza6 1,2,3,4,5,6
Department of Computer Science and Engineering, BRAC University, 66 Mohakhali, Dhaka 1212, Bangladesh Email: 1 [email protected], 2 [email protected], 3 [email protected], 4 [email protected], 5 [email protected], 6 [email protected]
Abstract—In recent times, technological advancement has brought a tremendous change in the field o f c ricket, which is a popular sport in many countries. Technology is being utilized to figure out projected score prediction, wicket prediction, winning probability, run rate, and many other parameters. In this research, our primary goal is to use Machine learning in the field o f C ricket, w here w e a im t o c lassify t he s hot p layed by the batsman, which can help in applications such as automated broadcasting systems or statistical data generation systems. For implementing our proposed model, we have generated our own dataset of cricket shot images by taking real-time photos from various cricket matches. We collected 1000 images of 10 different types of shots being played. For the classification task, we trained VGG-19 and Inception v3 model architecture and we got a better result by using VGG19. Before classification, t he i mages h ad to go through several pre-processing methods such as background removal through Mask R-CNN, batsman segmentation through YOLO v3, etc. Then we used 83% of the total images to train the models and 17% to test the models. Finally, we achieved desired accuracy of 84.71% from VGG-19 and 82.35% from InceptionV3. Index Terms—Cricket, Batsman, Shot, Camera, Autonomous, Broadcasting, VGG-19, Inception.
I. I NTRODUCTION Various types of entertainment can bring relief for us in our stressful life. One form of entertainment is different kinds of sports. Sports such as cricket currently have a huge fan base around the world. As of now, a total of 106 countries play in this form of the game. [1] A cricket shot can be defined as an approach of hitting the ball with the bat in a cricket match. Depending on the bowling type, fielder placement, and various other factors, batsmen can play different variants of cricket shot that takes the ball to various places within the field. D ifferent s hots r esult i n d ifferent p oses m ade b y the batsman, and detecting the cricket shot from visual information (images, videos) falls under the paradigm of human pose detection. An automated approach to classifying shots from visual information can provide us many useful applications. For example, to make this form of game more lucrative, live broadcasting systems through electronic mediums have been integral to the game. These broadcasting systems are mostly handled manually by humans through the process of manual
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
camera control and scene changes; which is difficult, timeconsuming, and prone to human-made error. An automated shot detection system can play an important part in the automation of the broadcasting system as camera movement or changes are often partially dependent on the shot being played. Additionally, this automation can also help to dynamically generate different cricket shot related statistics. In the context of Cricket, deep learning has already shown tremendous success in score prediction, wicket prediction, winning probability, and many others. However, deep learning is not extensively used for applications like shot detection or live broadcast system, resulting in technologies like Drones and Spidercams being controlled by humans right now. If deep learning is successfully applied, it can autonomously sense the shot of the batsman and if accurate enough, this automation can be utilized in various useful applications. In order to describe the proposed research, the entire paper has been divided into multiple segments. In the next chapters of the paper, we have discussed various works related to this field, the proposed method, data collection and pre-processing procedure, and the result analysis. II. L ITERATURE R EVIEW A. Previous Works Cricket is among one of the oldest running sports with a very rich history and fairly well-developed global reach. However, very little scientific research has been done in this field despite it being present for a long time [2]. Shot classification in the field of cricket is comparatively a new ground for researchers to work on. Only a handful amount of work has been done in this field because of its complexity in terms of data collection and processing. We can recognize Shot classification as a sub-category of human posture detection and posture detection, in general, has generated a lot of research work across various domains in the past. Among the research related to cricket that has been done, most used supervised machine learning based models to achieve various automation models. Part of these models includes cricket batting stroke detection with the help of body and batting posture using
Page 84
fuzzy set [3], classifying the pose from images and videos using bounding box based models [4], using motion vector to classify batting strokes with around 60% accuracy [5], and so on. Applications of machine learning are not only limited to shot or pose detection, various works on other domains have also been done such as using k nearest neighbor to classify and analyze the legality of bowling action [6], using VGG-16 to classify the bowling action by bowler [7], autonomous third umpire decision making using Inception V3 [8], etc. Few works on shot detection based on neural networks have also been done. For example, classification of the shot of a batsman was done using a modified CNN model [9]. However, this paper only covers 6 different shots of a right-handed batsman and it uses only one model to classify the shots. Therefore, it is important to do more research on this topic with different models and compare their results. Moreover, it is important to increase the number of shot classes as there are many unique shot classes available that can take the ball to various parts of the field. Furthermore, many papers support the fact that image classification results can be improved if the models are pre-trained with a large number of images from a huge dataset such as ImageNet. Since shot classification or pose detection also falls under the image classification technique, such transfer learning methods from other datasets can be beneficial. To conclude, we can say that there are various scopes for more research in this area to get better results.
convolutional kernel that reduces the feature size, the training speed of this model is comparatively faster than the similarly performing models [14]. Another model we used in our research is the Mask RCNN model. In 2017, a group of AI researchers developed this model for object detection purpose. Alongside object detection, it can also do pixel-wise instance segmentation. Instance segmentation is a challenging task as it requires correct detection of every object in an image as well as segmentation of every instance of the object on the pixels [15]. There are various models for object detection based on instance segmentation, but Mask R-CNN is one of the most successful ones in terms of accuracy [16]. Finally, in order to segment the Batsman in the entire background removed image, we used bounding box segmentation through YOLO v3 model, which is an improved variant of the original You Only Look Once (YOLO) model. [17] YOLO v3 performs only a single forward propagation through neural network for object detection, resulting in a very fast responding object detection model. YOLO v3 provides great performance for bounding box regression. Therefore, we have utilized it to separate the batsman by a bounding box and cropping it out of the image, so that we can use the cropped image with only the batsman in it for classification. III. P ROPOSED MODEL
B. Algorithms In our research, we have used various Convolutional Neural Network (CNN) models because CNN-based models are dominating the computer vision space for image classification and object detection [10] these days. The process of identifying photos into different categories depending on their characteristics is known as image classification. CNN models in general include an input layer, output layer, and as well as multiple hidden convolution layers. For image classification, CNN models use various features of an image such as pixel intensity, pixel value changes, edges of various shapes, and so on [11]. VGG, also known as VGGnet, is a well established CNN model architecture that did well on large-scale datasets like imagenet. Basically, VGG refers to a visual geometric group. For VGGNet, the default input shape for the image is 224 × 224 × 3 and there are numbers of convolution layers of various sized filters alongside max-pooling layers inside the network [12]. VGG has two variants; one of those is VGG16 containing 16 segments of convolution layers and on the other hand, VGG-19 consists of 19 layers of convolution [13]. For our work, use used the variant with 19 layers. Inception is another CNN model which is used for image inspection and classification of various elements of images. Inception models play a vital role in the domain of image classification. On the ImageNet dataset, Inception v3, a variant of Inception, has been demonstrated to achieve higher than 78.1 percent accuracy. Furthermore, thanks to the 1 × 1
Fig. 1. Proposed model
Page 85
At first, we looked into the highlights of various cricket games on YouTube and built our own dataset of 1000 photos by taking screenshots. Afterward, some pre-processing was done and train-test sets for the dataset were formed. Of the total number of images, 830 were used to train, and the rest of the 170 images were used for model testing. We have used 4 models for our work: Mask R-CNN, YOLO v3 for the data pre-processing segment, and VGG-19, Inception-V3 models for the classification task. Finally, after getting the results, we analyzed and compared them. IV. DATASET DETAILS AND PROCESSING A. Dataset details In order to do the research, we generated our own dataset for both training and testing purposes. At first, we collected cricket shot images from the highlights of various matches and then we split it into 10 classes namely cover drive, cut drive, late cut, off drive, on drive, leg glide, hook, pull, square cut, and straight drive. We considered all formats of cricket starting from ODI to T-20 as well as all the cricket playing nations who are currently playing test cricket under the rules of ICC. The class distribution of the dataset is intentionally kept unbalanced as different shots are usually played at different frequencies. E.g. ’Hook shot’ is played much more rarely compared to ’Cover drive’ so there are fewer images (37 images) in the ’Hook shot’ class compared to the cover drive class (184 images). The class distribution for training and the test set is given in table number I. TABLE I C LASS D ISTRIBUTION OF THE T RAIN AND THE T EST SET Shot class
Train Images
Test Images
Cover Drive Cut Drive Hook Late Cut Leg Glide Off Drive On Drive Pull Square Cut Straight Drive
154 97 29 50 86 104 82 128 49 51
30 18 8 10 18 20 16 29 10 11
Total
830
170
From Table number 1, we can see that a nearly equal train:test image distribution was kept for all the classes. In general, the distribution ratio for each class is nearly 5:1 for the train and the test set. However, the ratio varies slightly for different classes. Some sample of images from the dataset is given in figure 2. B. Dataset processing Data pre-processing is the method of modifying raw data in a way so that it can be easily processed by the models. Raw data usually comes with lots of noise and extra unnecessary information, making it harder for the models to learn patterns from it. Therefore, pre-processing is necessary to improve
Fig. 2. Subset of Dataset
results. During pre-processing, we performed the following processes: scaling, augmentation, background removal, and batsman segmentation. Before scaling, the pixel values were in the range of 0-255 and after rescaling, the range of pixel values were converted to 0-1. Additionally, we performed data augmentation to artificially increase the volume of the data. For augmentation, we used 10 percent of the width shift towards both left and the right direction and also 10 percent of the height shift towards the up and down direction. Furthermore, we applied horizontal flips to the images. Our original dataset consisted of shots only played by right-handed batsmen. Horizontal flip randomly mirrors the images and artificially converts right-handed shots into left-handed shots. Finally, we have used background removal in order to drop unnecessary detail from the data [18]. For shot identification we primarily need the pose of the batsman, therefore, we can drop unnecessary details such as the field in the background and other elements. To achieve this, we applied mask RCNN to perform instance segmentation of human character models on the image. Instance segmentation is the process of detecting all the pixels that are the property of a particular object [19]. After instance segmentation, only the resultant return pixels were kept and everything else was removed. Consequently, only the major player characters were in the images while all the other information was removed. Figure number 3 visualizes some of the images with the background removed. TABLE II P RE - PROCESSING D ETAILS Pre-processing type
Details
Pixel Scaling Width Shift (Augmentation) Height Shift (Augmentation) Horizontal Flip (Augmentation) Background Removal
Within 0-1 10% 10% True True
Page 86
Fig. 5. Cropped images
Fig. 3. Images after background removal
At this stage of pre-processing, we had the images with only the players in it. Finally, we applied YOLO v3 model to segment the batsmen in the images for cropping purpose. After background removal, the images became rather simple so not many training images were required for batsman segmentation. We picked 250 images and annotated it using roboflow. [20] After annotation, we applied bounding box segmentation on rest of the images (600 images) and cropped those according to the bounding box location. Some of the bounding box results are given in figure 4.
5 pixels wider on both sides of the Y-axis so that the extended bat or body parts while playing the shot do not get removed completely. Some cropped results are given in figure 5. Finally, all the cropped images were scaled before passing through the classification CNNs for training. The pixel values were divided by 255 so that all the pixel values get converted within a range of 0-1. V. R ESULT AND A NALYSIS Since there was no significant hyper-parameter tweaking while training and our dataset were limited in terms of image amount, no separate validation set was kept. Instead, the accuracy of the final epoch was termed the final test accuracy. For YOLO v3 on 200 training images and 50 validation images, the loss curve from the 6th epoch has been given in figure 6. As the loss range for all the epochs is extremely high, the loss of the first 5 epochs has been shown separately in a table for better visualization in the graph.
Fig. 6. YOLO v3 Training-Testing Loss Fig. 4. Bounding box detection on the images
For training, all the input images were converted to 416x16 resolution, which is the default input resolution of YOLO v3. 200 images were kept for training and the rest were used for validation. The batch size was 32 and the number of epochs was 70. The overall training improvement details are given in the results and analysis section. After training the model, the rest of the images were cropped using it. The actual cropped area was kept 15 pixels wider on both sides of the X-axis and
Post data pre-processing, we prepared the model for training. For both VGG19 and Inception v3, we kept the convolution layers only and dropped the default fully connected layers. Instead, at the end of the convolution layers, we added our own fully connected layers. There are basically 2 fully connected layers, the first one consists of 1024 neurons, the second one is the output layer consisting of 10 neurons, corresponding to the number of classes. In order to avoid overfitting, we added a 30% dropout between flattened convolution features and the
Page 87
first dropout layer. Additionally, for loss function, categorical cross entropy alongside adam optimizer (With the learning rate of 10-ˆ5) was used. For accuracy metrics, we used categorical accuracy. The number of epochs was determined dynamically by seeing the tendency of overfitting. We ran 35 epochs of training for VGG19 and 40 epochs of training for Inception v3 before observing the obvious signs of overfitting. In figure 7 and 8, the training graph for VGG19 and Inception v3 is given.
Fig. 9. VGG19 vs Inception v3 Accuracy comparison
Additionally Figure 10 and 11 defines the confusion matrix of VGG19 and Inception v3 so that we can provide a clear idea about the accuracy.
Fig. 7. Model Training Graph of VGG19
Fig. 10. Confusion matrix of VGG19
Fig. 8. Model Training Graph of Inception V3
Table number III is generated based on our result derived from VGG-16 and Inception-v3 model implementation. We have shown the comparison in the below table: TABLE III C OMPARISON TABLE OF VGG19 Shots Cover Drive Cut Drive Hook Late Cut Leg Glide Off Drive On Drive Pull Square Cut Straight Drive Weighted Average
AND I NCEPTION V 3
Precision (VGG19) 0.96 0.83 1.00 0.80 1.00 0.77 1.00 0.76 0.64 0.83
Re-Call (VGG19) 0.87 0.83 0.75 0.80 0.89 0.85 0.69 0.97 0.70 0.91
F-1 Score (VGG19) 0.91 0.83 0.86 0.80 0.94 0.81 0.81 0.85 0.67 0.87
Precision (InceptionV3) 0.85 0.77 1.00 0.86 1.00 0.88 0.54 0.86 0.86 0.91
Re-call (InceptionV3) 0.93 0.94 0.75 0.60 0.83 0.70 0.81 0.86 0.60 0.91
F-1 Score (InceptionV3) 0.89 0.85 0.86 0.71 0.91 0.78 0.65 0.86 0.71 0.91
0.86
0.85
0.85
0.85
0.82
0.83
Comparison Between VGG19 and Inception-v3: We got a final test accuracy of 84.7% from VGG19 and 82.4% from Inception-v3. For our proposed model. We got slightly less accuracy in Inception-v3 compared to VGG19 which is visualized in figure 9.
Fig. 11. Confusion matrix of Inception-V3
To evaluate the performance of our approach, we have compared our result with a previous paper [9]. An important point to note here is that a direct comparison might not be accurate here as the different experiments used completely different datasets. In the mentioned research, they used custom CNN for classifying the shots. They classified between a total of 6 classes and 4 of those classes match ours. For 4 of those matching classes, we compared the precision, recall, and F1 scores in table IV. Again, this comparison does not provide
Page 88
any direct idea of the better approach as the datasets used are different. However, a comparable or even better score in some cases despite having more classes in the proposed approach shows its viability. TABLE IV C OMPARISON TABLE OF VGG-19 AND I NCEPTION - V 3 Shots Cover Drive Cut Drive Straight Drive Pull Shot
Precision (VGG-19) 0.96 0.83 0.83 0.76
Re-Call (VGG-19) 0.87 0.83 0.91 0.97
F-1 Score (VGG-19) 0.91 0.83 0.87 0.85
Precision (InceptionV3) 0.85 0.77 0.91 0.91
Re-call (InceptionV3) 0.93 0.94 0.91 0.91
F-1 Score (InceptionV3 0.89 0.85 0.91 0.91
AND OLD PAPER Precision (Previous) 0.74 0.69 0.78 0.89
Re-Call (Previous) 0.78 0.76 0.83 0.77
F-1 Score (Previous) 0.76 0.72 0.81 0.83
Finally, we examined the images that were classified wrong. Upon examining the results, a few observations reflected the difficulty of classifying cricket shots from image data: • Few of the shots in the proposed research, for example, Square cut and late cut, are fairly similar to each other in terms of the pose. In these cases, an image taken at a slightly wrong moment may result in a fairly similar looking pose. A few mis-classifications happened because of it. Taking temporal information, e.g. from a video, may solve the issue. • In the case of spin and medium-pace bowling, the wicketkeeper often comes close to the batsmen. Despite the cropping, parts of the wicketkeeper still stay on the image and work as background noise. This is something to be handled and improved upon on. • Passing images through two different segmentation models might be expensive in different use cases. It is possible to directly segment the batsmen semantically using Mask R-CNN only. However, that will require training the model and it will take a lot of images to train because of all the background and foreground on the image. Labeling the data with polygons for semantic segmentation will also be more time consuming. In comparison, labeling fewer images with bounding boxes and training YOLO v3 was rather easy because of background removal. VI. C ONCLUSION AND F UTURE W ORKS Cricket has a huge fan base around the world and autonomous technology is spreading in every sector of workplace. Different types of automation technologies are being used in cricket matches too. Accurate and autonomous identification of cricket shots may have extremely useful applications such as autonomous broadcasting or autonomous data generation. However, detecting cricket shots accurately in real-time can be challenging as the cricket field is a busy place with lots of visual information on display. Therefore, proper approaches are required to filter out the excess information and noise. This is an aspect that the proposed research tried to address. However, there are scopes to improve upon. As discussed previously, passing images through two segmentation models can be expensive. Therefore, a single segmentation model can be trained extensively to semantically crop the batsman in a single pass. Additionally, instead of images, shots can
be classified directly on video data. This particular approach should be able to utilize the temporal features, which should improve the shot prediction result. These are the approaches that we plan to explore in the future in order to improve upon the current approach. R EFERENCES [1] ICC welcomes Mongolia, Tajikistan and Switzerland as new Members. Available at https://www.icc-cricket.com/media-releases/2192201. [2] TD Noakes and JJ Durandt. Physiological requirements of cricket. Journal of sports sciences, 18(12):919–929, 2000. [3] MG Kelly, KM Curtis, and MP Craven. Fuzzy recognition of cricket batting strokes based on sequences of body and bat postures. In IEEE SoutheastCon, 2003. Proceedings., pages 140–147. IEEE, 2003. [4] P. Reddy Gurunatha Swamy Ananth Reddy. Human pose estimation in images and videos. International Journal of Engineering Technology, 7(3):27, 2018. [5] D Karmaker, AZME Chowdhury, MSU Miah, MA Imran, and MH Rahman. Cricket shot classification using motion vector. In 2015 Second International Conference on Computing Technology and Information Management (ICCTIM), pages 125–129. IEEE, 2015. [6] Muhammad Salman, Saad Qaisar, and Ali Mustafa Qamar. Classification and legality analysis of bowling action in the game of cricket. Data Mining and Knowledge Discovery, 31(6):1706–1734, 2017. [7] Md Nafee Al Islam, Tanzil Bin Hassan, and Siamul Karim Khan. A cnn-based approach to classify cricket bowlers based on their bowling actions. In 2019 IEEE International Conference on Signal Processing, Information, Communication & Systems (SPICSCON), pages 130–134. IEEE, 2019. [8] Md Kowsher, M Ashraful Alam, Md Jashim Uddin, Faisal Ahmed, Md Wali Ullah, and Md Rafiqul Islam. Detecting third umpire decisions & automated scoring system of cricket. In 2019 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering (IC4ME2), pages 1–8. IEEE, 2019. [9] Md Foysal, Ferdouse Ahmed, Mohammad Shakirul Islam, Asif Karim, and Nafis Neehal. Shot-net: a convolutional neural network for classifying different cricket shots. In International Conference on Recent Trends in Image Processing and Pattern Recognition, pages 111–120. Springer, 2018. [10] Mahbub Hussain, Jordan J Bird, and Diego R Faria. A study on cnn transfer learning for image classification. In UK Workshop on computational Intelligence, pages 191–202. Springer, 2018. [11] Sali Issa and Abdel Rohman Khaled. Knee abnormality diagnosis based on electromyography signals. In International Conference on Soft Computing and Pattern Recognition, pages 146–155. Springer, 2021. [12] Hassan Ali Khan, Wu Jue, Muhammad Mushtaq, and Muhammad Umer Mushtaq. Brain tumor classification in mri image using convolutional neural network. Math. Biosci. Eng, 17(5):6203–6216, 2020. [13] Hussam Qassim, Abhishek Verma, and David Feinzimer. Compressed residual-vgg16 cnn model for big data places image recognition. In 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), pages 169–175. IEEE, 2018. [14] Chunmian Lin, Lin Li, Wenting Luo, Kelvin CP Wang, and Jiangang Guo. Transfer learning based traffic sign recognition using inception-v3 model. Periodica Polytechnica Transportation Engineering, 47(3):242– 250, 2019. [15] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. [16] Tianheng Cheng, Xinggang Wang, Lichao Huang, and Wenyu Liu. Boundary-preserving mask r-cnn. In European conference on computer vision, pages 660–676. Springer, 2020. [17] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018. [18] Kamal Kc, Zhendong Yin, Dasen Li, and Zhilu Wu. Impacts of background removal on convolutional neural networks for plant disease classification in-situ. Agriculture, 11(9):827, 2021. [19] AL-Alimi Dalal, Yuxiang Shao, Ahamed Alalimi, and Ahmed Abdu. Mask r-cnn for geospatial object detection. International Journal of Information Technology and Computer Science (IJITCS), 12(5):63–72, 2020. [20] Roboflow Image Annotation. Available at https://roboflow.com/annotate.
Page 89
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December 2022, Cox’s Bazar, Bangladesh
A Blockchain Based Secure Framework for Usercentric Multi-party Skyline Queries Md. Motaleb Hossen Manik Department of Computer Science and Engineering (CSE) Khulna University of Engineering & Technology (KUET) Khulna-9203, Bangladesh [email protected]
Kazi Md. Rokibul Alam Department of Computer Science and Engineering (CSE) Khulna University of Engineering & Technology (KUET) Khulna-9203, Bangladesh [email protected]
Abstract—For commercial purposes, business organizations with mutual interests while competing with one another usually intend to keep their datasets to be confidential. Secure multiparty skyline queries enable these organizations to retrieve dominated data objects from their sensitive datasets. Here, a distinction among datasets according to priority, retaining data privacy along with integrity, targeted enquires, etc., are crucial. This paper proposes a secure framework for multi-party skyline queries that incorporates these requirements together. To prioritize datasets it assigns distinct weights to parties’ datasets. To retain adequate privacy it exploits a cryptosystem that engages a single key for encryption but multiple separate keys for decryption. To attain data anonymity it re-encrypts and shuffles the encrypted data. To assure data integrity it deploys a unique block comprising encrypted data designated for every party within the blockchain storage. To enable enlisted users to ask intended queries it accepts their inquiries. This paper is a preliminary report and evaluation of the proposed framework. Keywords—Skyline queries, Multi-party ElGamal cryptosystem, Blockchain, Data integrity, Data prioritization
I. INTRODUCTION Commercial organizations keep on an interest in the competitive development of their products, services, etc., and usually contain confidential datasets. Skyline queries [1] enable users to find out datasets that are not dominated by other organizations’ datasets. Here, secure multi-party skyline queries assist them in analyzing those sensitive datasets while identifying the dominant ones. Besides, a relevant scenario may arise as a query where a user can select the best service or product that suits its demands, namely, locating the cheapest but most fascinating hotel nearby. These sorts of queries are referred to as targeted queries. Existing secure multi-party skyline queries are developed considering various perspectives. For example, in many cases, a single third party is responsible to assess the security-related operations of the framework [2, 3, 5]. However, concerning security, to trust a single entity is somehow impractical, since it may be involved in data forgery [4]. Again, many works are not conscious to retain data anonymity, i.e., eliminating the link between the encrypted and the decrypted data [5]. In addition, some other works do not pay proper attention to data integrity [2, 3, 6] which may be violated due to retaining data privacy. Moreover, many works emphasize the conventional skyline queries rather than the targeted ones [2, 4, 5]. To overcome these limitations (i.e., relying a single entity, improper care for data anonymity and integrity, conventional skyline queries), this paper proposes a framework for usercentric multi-party skyline queries over parties’ sensitive datasets. The major contributions are listed below.
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
Yasuhiko Morimoto Graduate School of Advanced Science and Engineering Hiroshima University Higashi-Hiroshima 739-8521, Japan [email protected]
i) As per the impact of the parties’ data, it presents a publicly verifiable weight assignment mechanism that prioritizes parties’ datasets. ii) To attain adequate data privacy, it exploits a data encryption mechanism that involves a single key for encryption but multiple separate keys for decryption. iii) To assure data anonymity, it executes ‘re-encryption and shuffling’ operations over encrypted data. iv) To ensure data integrity, it utilizes the blockchain that allocates a unique block for each party. v) Eventually, it incorporates targeted queries on parties’ combined datasets to enable legitimate users to ask intended queries. The rest of this paper is arranged as follows. Sec. II studies the related works. Sec. III explains the preliminaries and building blocks. Sec. IV illustrates the system model and the role of entities. Sec. V describes the proposed secure multi-party skyline framework. Sec. VI explains the security analyses plus the preliminary evaluations, and Sec. VII is the conclusion of the paper. II. RELATED WORKS Initially, the ‘skyline queries’ were conducted with Block Nested Loop (BNL) method [1]. Following the first article, numerous studies on skyline queries from various perspectives have been conducted. The noticeable categories of skyline queries are: entity-based (e.g., single entity [2, 3, 5], multiple entities [4]), environment-based (e.g., distributed [8], mobile [9], cloud [6, 10, 11, 14]), mechanism-based (e.g., homomorphic encryption [2, 3, 6]), etc. A study was proposed in [10] aiming to secure the outsourced data in the cloud. They used the ElGamal cryptosystem to secure the data. However, the model relied on a single entity called ‘Service Provider’ to perform the entire encryption and decryption operations, rather than distributing the multiple keys among distinct entities which impaired the model’s privacy requirement. Besides, because of ElGamal’s homomorphic feature, the outsourced data can be simply manipulated by an adverse entity. As a result, the data integrity may be compromised. These constraints have deteriorated the model’s acceptability. Another secure skyline study was conducted in [2] using the Paillier cryptosystem to retain data privacy. This framework relies on the additive homomorphic property of the Paillier cryptosystem. Since data may be manipulated via the homomorphic cryptosystem, this model is also feeble for maintaining data integrity. Although this model included the concept of multi-party, only a single entity can access the
Page 90
private key, which enabled unfair entities to alter the data. Moreover, it incorporated only conventional skyline queries rather than any user-centric targeted queries. These limitations have decreased the practicality of this model. Similar to [2], another work was proposed in [3], which incorporated the symmetric homomorphic encryption (SHE) technique with two atomic protocols, namely a comparison test and an equality test. Though two different protocols had been incorporated herein with the user query processing facility, this framework still lacks data integrity assurance because of the practice of SHE. Moreover, like [10], this framework relies on a single entity to perform the encryption and decryption operations, thus retaining feeble data privacy which limits its applicability. Another work in [11] suggested skyline queries over cloud-enabled databases that employed Data as a Service (DaaS) cloud infrastructure. The entire databases were hosted in the cloud enabling users to ask queries. Although it exploited secure cloud databases, it underestimated the malicious cloud infrastructure by relying on a single entity named ‘Service Provider’. Besides, there were no precautions to determine whether any malicious entity had altered the data or not. Also, the databases were singly encrypted without varying their appearances and positions, which enabled intruders to trace the link between the data owner and the encrypted databases. Again [6] and [9] proposed their models for cloud storage and mobile environment, respectively. Although [9] discussed privacy concerns, it could not integrate them. In contrast, though work [6] included homomorphic encryption, still, it failed to protect against data alteration since an intruder may succeed in modifying the encrypted data. Although the multiparty skyline proposed in [4] reflected on data privacy, data integrity, data anonymity, etc., criteria, due to retaining data integrity, it attached unique tags with parties’ datasets separately, which increased the computation overhead of the framework. Besides, it did not consider data prioritization, users’ targeted queries, blockchain storage, etc., demands. This paper proposes a framework for secure multi-party skyline queries that facilitates (i) prioritization of distinct datasets according to their impact, (ii) adequate data privacy, (iii) data integrity, (iv) data anonymity, and (v) users’ inquiries over parties’ prioritized datasets. To assist data prioritization it offers a publicly verifiable weight embedding mechanism. To enrich data privacy it adopts an ElGamal [12] based multiparty encryption system [7, 13] that employs a single key for data encryption but multiple separate keys for decryption. To assure data integrity it utilizes blockchain, which assigns a unique block to each party. To maintain data anonymity it adopts ‘re-encryption and shuffling’ operations on encrypted data. To grip targeted inquiries it offers user-centric queries from legitimate users. These qualities make the framework more practical and efficient than the existing works. III. PRELIMINARIES AND BUILDING BLOCKS This section discusses the skyline queries, cryptographic tools, weight assignment, etc., required to develop the proposed framework. A. Skyline [6] 1) Skyline Query: Consider a dataset D = [d1, d2, d3, …, dl] with the dimension m (≥ 1) and the number of data l (> 1) where da and db are two different data items in D. Here, da is said to be dominant over db if for all k, da [k] ≤ db [k] and if
there is at least one dimension for which da [k] < db [k] holds, where 1 ≤ k ≤ m. 2) Targeted Skyline Query: Consider the user query q = [q1, q2, …, qm] over the dataset D = [d1, d2, d3, …, dl] along with da and db where m, l, da, db, etc., are already specified. Then, by applying the Euclidian distance algorithm, da is said to be dominant over db concerning q, if for all k, (da [k] – q[k])2 ≤ (db [k] – q[k])2 and if there is at least one dimension for which (da [k] – q[k])2 < (db [k] – q[k])2 holds. B. ElGamal Cryptosystem The ElGamal [12] is a public-key cryptosystem based on discrete logarithms. It consists of three distinct phases, i.e., key generation (KeyGen), encryption, and decryption which works as below. i) KeyGen: The receiver picks a large prime p with generator g from a cyclic group G of the order p. Now, it selects the private decryption key X {X ∈ (1, …, p−1)} and calculates the public encryption key Y = gX mod p. Then, it reveals {p, g, Y} publicly. ii) Encryption: To encrypt a message m (0 < m ≤ p−1), the sender chooses a random integer r {r ∈ (1, …, p−1)} and computes the ciphertext EY (r, m) = {y1, y2} where y1 = gr mod p and y2 = m.Yr mod p. Then, it delivers EY (r, m) to the receiver. iii) Decryption: To regain the original message m, the receiver computes y2 / y1X using the private key {X}. C. ElGamal-based Multi-party Encryption Scheme ElGamal cryptosystem assists the commutative property [13] to develop the decryption mix-net as well as the reencryption mix-net [7]. It consists of four phases which work as follows. i) KeyGen: Consider a mix-net with P (≥ 2) mix-servers denoted as M1, …, MP. Here, each Mi {Mi ∈ (M1, …, MP)} holds a private key {Xi} and calculates its public key Yi as Yi = gXi mod p. Now their combined encryption key Yc is generated as Yc = Y1.Y2…YP = gX1.gX2… gXP = g(X1 + X2 +… +XP) (mod p). ii) Encryption: By using Yc, message m is encrypted as EYc(r, m) = {gr, m.Ycr} alike ElGamal of Sec. III (B). iii) Re-encryption: While re-encryption, M1,…MP act as follows and Fig. 1 depicts the procedure. 1. The first mix-server M1 obtains EYc(r, m) = {gr, m.Ycr}. 2. Then M1,…MP calculate EYc((r + ∑ ), m) = {gr. (∑ ), m.Ycr. ( )} which is regarded as re-encryption. 3. While re-encryption, each Mi shuffles all re-encrypted messages, which removes the links between incoming and outgoing ones and assures data anonymity. 4. Thus the last mix-server Mp outputs EYc(rc, m) = {grc, m. Ycrc}, where rc = r + r1 + … + rp. iv) Decryption: While decryption, M1,…MP successively decrypts EYc(rc, m) by using their private keys X1,…, XP, respectively, and finally retrieves the original message m. Here, individual decryption by each Mi is identical to the decryption operation of Sec. III (B).
Page 91
Mix-server
Input message
Output message
M1 M2 … MP
{gr, m.Ycr} {gr.gr1, m.Ycr. Ycr1} … {gr.gr1…grp −1, m.Ycr. Ycr1 …Ycrp−1}
{gr.gr1, m.Ycr. Ycr1} {gr.gr1.gr2, m.Ycr.Ycr1 Ycr2} … {gr.gr1…grp, m.Ycr.Ycr1 …Ycrp}
Hash of previous block
Block Id Timestamp Merkle root Block data Nonce
H1234
Fig. 1. Re-encryption operation by each mix-server.
D. Weight Assignment Each dataset is modified by attaching weights from a public weight table to prioritize its values. These weights are allocated for every party Pn through an agreement in advance. Consider a dataset Dn of Pn as specified in Sec. III (A). A specific weight is calculated using the first (m-1) attributes (i.e., data dimensions) of Dn, and it is attached to the m-th attribute because its value varies subject to them. It consists of two steps, i.e., data scaling and weighted value calculation. Here, the initial step unifies every attribute, and the later step calculates the value of the m-th attribute. i) Data Scaling: A row vector named scaling factor Vscale is used from the weight table to scale the attributes, where Vscale = [a1, a2, …, am-1]. The attributes of each Dn can also be represented as row vector Vdata, where Vdata = [A1, A2, …, Am-1]. Now, a new row vector named scaled attribute vector A' is calculated by performing ‘Hadamard Product’ on Vdata and Vscale as follows. (1) Vdata .Vscale = [A1.a1, A2.a2, … , Am-1.am-1] A' = [A1', A2', …, Am-1'] ii) Weighted Value Calculation: After calculating A', to calculate the weighted value of the m-th attribute, a sum of product (i.e., dot product) operation is performed between A' and a row weight vector Ꞷ, where Ꞷ = [w1, w2, …, wm-1] is selected from the public weight table. This operation is illustrated below. Am = A' . Ꞷ = [A1'.w1 + A2'.w2 + … + Am−1'.wm−1] (2) Finally, the round-up operation on Am eliminates the decimal portions from it. Since weights are publicly verifiable as will be discussed in Sec. V (6 (i)), each party will be able to confess the weights. E. Blockchain Storage A blockchain [15] is referred to as an immutable ledger. Once data has been stored in it, the data cannot be altered, which assures data integrity. It is used in many applications, e.g., data sharing, e-voting, healthcare, stock exchange [15], etc. The blockchain contains multiple blocks and each block is linked linearly. The header of a typical block (Bn) of the blockchain storage contains block ID, block data, timestamp, a hash value of the previous block, a root value of Merkle tree, a nonce, and finally a hash value of the current block which is generated from the rest of the data. The Merkle root is generated from the Merkle tree that facilitates data integrity. If there is any dispute about data alteration, a new Merkle root is rebuilt using the current data and compared with the previous one to validate the data. Fig. 2 illustrates a sample block of the blockchain. Here, each Di indicates the i-th data item of party Pn and H1,…,i indicates the hash value of its child node(s) generated from D1, …, Di. Finally, a bottom-up approach creates the Merkle tree, and the block header stores the root.
Hash of current block
H12
H34
H1
H2
H3
H4
D1
D2
D3
D4
Fig. 2. A typical block of blockchain storage.
IV. THE SYSTEM MODEL This section briefly introduces the involved entities and the required privacy properties for the proposed framework. Fig. 3 illustrates the entity relationship diagram. A. Involved Entities The framework contains five distinct entities which are introduced below. 1) Party (Pn): Each party Pn {Pn ∈ (P1, …, PN)} owns sensitive datasets regarding its services. Alongside, Pn obtains weights corresponding to its dataset according to the mechanism of Sec. III (D) and attaches them with its dataset to generate a weighted dataset. Now, Pn encrypts both the plain and weighted datasets using the combined encryption key Yc and other public encryption parameters. Then, Pn shares both datasets with the service provider (SP), where the SP stores the encrypted dataset in its memory and sends the encrypted weighted dataset to the blockchain. 2) Service Provider (SP): The SP is an intermediary entity among Pn, query user (Qn), mix-net, and blockchain storage. It conducts registration for every Pn and Qn, allocates weights for Pn, conveys the encrypted datasets and encrypted weighted datasets to M1, …, MP for re-encryption, stores them in blockchain, asks M1, …, MP to retrieve them through decryption while required, and later on, it validates weights of Pn. Finally, according to Qn’s demand, it runs targeted queries over datasets and reveals the result. 3) Mix-server (Mi): Each Mi performs ‘re-encryption and shuffling’ as well as decryption operations over the encrypted datasets of every Pn. Each Mi holds a private decryption key Xi and reveals the related public encryption key Yi, where the combined public encryption key Yc is calculated by using these distinct Yi as in Sec. III (C). Since the adopted ElGamal-based multi-party encryption scheme is commutative, the sequential decryption operation by each Mi regains the data. 4) Blockchain Storage (BS): Each block Bn stores the encrypted weighted dataset of Pn. After successfully storing the dataset, it sends back the block index Idx[Bn] to SP and the SP records the Idx[Bn] of each Pn to regain the encrypted weighted dataset for later uses. Since blockchain is an immutable ledger, the stored dataset cannot be altered. 5) Query User (Qn): A Qn is another registered entity. An authentic Qn can generate target-specific query q and encrypts it to E(k, q) via the encryption mechanism
Page 92
Step 5: SP decrypts E(k, q) and executes the targeted queries over Dn' as discussed in Sec. III (A). Finally, SP reveals the result.
Encrypted Datasets
Service Provider
Blockchain Storage
Query User
Targeted Queries
4.
4. Decryption
1. Registration
Parties
Mix-net
5. Query Result
Plain Datasets
Weighted Datasets
2.
Decrypted Datasets
Fig. 3. Relationships among involved entities.
described in Sec. III (B). Then it sends E(k, q) to SP. Finally, SP retrieves q, generates the results, and sends it back to Qn. B. Privacy Properties The ultimate goal of the developed framework is to generate the results of the targeted queries. To achieve this, the following privacy properties are required to retain. i) Data Privacy: Private datasets of each Pn are secure due to encryption. No outsider can know the plain data of Pn, and no Pn can know the encrypted datasets of other ones. ii) Identity Privacy or Data Anonymity: Links between the encrypted dataset and its owner are eliminated through the ‘re-encryption and shuffling’ operations by the mixservers of the mix-net. iii) Query Privacy: The user’s targeted query q is encrypted through another encryption key and no one except the SP can know the plain form of q. V. THE PROPOSED FRAMEWORK FOR TARGETED SKYLINE QUERIES This section first presents an overview of the proposed framework. Then it demonstrates its steps distinctly. A. Overview of the Framework Step 1: The SP registers every Pn and Qn. Besides, based on the mechanism of Sec. III (D), every Pn obtains weights for its dataset Dn and generates the weighted dataset Dn'. Now, Pn encrypts both Dn and Dn' to send them to SP. Step 2: SP sends the encrypted Dn and Dn ' to the mix-net M1, …, MP to re-encrypt and shuffle them. Then, SP stores the reencrypted form of Dn in its memory and sends the reencrypted Dn' to the blockchain storage to create a new block Bn. Finally, the BS returns the block index Idx[Bn] of the stored dataset to SP for tracking it for further usage. Step 3: Qn generates a targeted query q, encrypts it to E(k, q), and sends it to the SP to know the result. Step 4: Upon receiving E(k, q), SP asks M1, …, MP to decrypt all Dn' stored in the BS. Then to validate the weights, SP asks M1, …, MP to decrypt all encrypted Dn. Now, SP generates the new weighted dataset Dn'' of each Pn (of Sec. V (B (2))) and compares between Dn'' and Dn'. While they are equal, the SP is convinced about the correctness of Pn’s operation.
B. Distinct Stages of the Framework The distinct stages of the proposed framework proceed as follows. 1) Registration: This stage conducts the registration of two types of entities as follows. (a) Party’s Registration i) To prevent illegitimate entities from pretending to be legitimate parties, initially, each legal party {Pn ∈ (P1, …, PN)} physically appears to SP on the occasion of registration along with their unique identity IDn. ii) After validating, SP completes their registration. (b) Query User’s Registration i) Same as Pn’s registration, each query user Qn also appears to SP with its unique identity IDQn. ii) To check the validity of Qn, the SP checks if Qn is a valid one or not. If valid, then SP registers it. 2) Weight Assignment: This stage (i) assigns publicly disclosed weights to every party’s datasets and (ii) generates the weighted datasets, thus prioritizing them. It works as below. i) First, attributes from the dataset of each Pn are selected to obtain the weights. Hence, the L number of scaled attribute vectors A′ are calculated using (1) from Dn. Here, L is the number of total data items in Dn. Then, a column vector named weight value vector W = [W1, W2, …, WL] is calculated using (2) where each Wm ∈ W represents Am of Sec. III (D (ii)). Table I illustrates the calculation of W. Thus, every Pn generates its Wn corresponding to its Dn. ii) Now, the m-th attribute of Pn i.e., Am is replaced by each element of Wn to generate Dn'. Table II illustrates the procedure of generating Dn'. TABLE I: WEIGHT VALUE VECTOR CALCULATION A 2′
…
Am-1′
W
a1,1′
a1,2′
…
a1,m-1′
W1
a2,1′ …
a2,2′ …
…
a2,m-1′ …
W2
aL,1′
aL,2′
A 1′
… …
… WL
aL,m-1′
TABLE II: WEIGHTED DATASET GENERATION Dn
W
Dn'
a1,1, a1,2, …, a1,m-1, a1,m
W1
a1,1′, a1,2′, …, a1,m-1′, W1
a2,1, a2,2, …, a2,m-1, a2,m
W2
… aL,1, aL,2, …, aL,m-1, aL,m
… WL
a2,1′, a2,2′, …, a2,m-1′,W2 … aL,1′, aL,2′, …, aL,m-1′,WL
3) Dataset Submission: This stage enables every party to submit its encrypted dataset together with the encrypted weighted dataset to SP and it continues as follows. i) Each Pn encrypts Dn {Dn = (Dn1, Dn2, …, DnL)} and Dn′ {Dn′ = (Dn1′, Dn2′, …, DnL′)} by using the key Yc. Here, any data of Dn and Dn′ is denoted as (Dnj ∈ {Dn1, …, DnL}) and (Dnj′ ∈ { Dn1′, …, DnL′ }), respectively.
Page 93
VI. ANALYSES AND EVALUATIONS This section explains that the proposed framework fulfills all the essential security requirements. Besides, it describes the experimental setup, runtime efficiency, etc., to develop the framework, and compares it with existing works. A. Security Analyses i) Data Privacy: Each Pn encrypts its dataset by using the combined encryption key Yc calculated from separate public keys of multiple entities. Thereby, no entity
including Pn can guess the plain data of other parties. Thus, the proposed framework enriches data privacy. ii) Data Anonymity: ‘Re-encryption and shuffling’ tasks by mix-servers over the encrypted data eliminate links between the encrypted dataset and its owner. iii) Data Integrity: As discussed in Sec. II, the major limitation of a huge number of state-of-the-art works is improper concern about data integrity. To overcome this deficiency, the proposed framework adopts the notion of blockchain storage. Thereby, no data can be altered after storing it in the blockchain. iv) Query Privacy: No one other than the SP can decrypt the encrypted q which ensures protection against disclosing it among other parties or entities. B. Experimental Setup A prototype of the framework was implemented with Python language (version 3.8.8) for coding purposes on a single PC with an environment of Intel core i5-10500 CPU @ 3.10 GHz processor and 12 GB RAM under the Windows 11 operating system. No parallel processing along with multi-threading was used to conduct the experiment. C. Runtime Efficiency Fig. 4 depicts the time required for the encryption and decryption operations of the framework while considering 20 data for each party and the number of mix-servers 5. It shows that the encryption time is longer than the decryption time. Besides, the time for conducting the operations increases linearly while the number of dimensions and parties are rising in the range of [5, 20] and [20, 35], which are depicted in Fig. 4 (a) and Fig. 4 (b), respectively. Here, only mix-servers execute the decryption operations consecutively. But while encryption, first the party and then the mix-servers run the encryption and the re-encryption operations, respectively.
Time (s)
Encryption vs. decryption time by varying dimension 35 30 25 20 15 10 5 0 5
10 15 Dimension (5 ~ 20), party = 20 Encryption Decryption
20
(a)
Encryption vs. decryption time by varying party 15 12 Time (s)
ii) Thereby, the plain Dn and Dn′ are transformed into encrypted {EYc(k1, Dn1), …, EYc(kL, DnL)} and {EYc(k1, Dn1′), …, EYc(kL, DnL′)}, respectively. Namely, the encrypted forms of Dnj and Dnj′ are EYc(kj, Dnj) = {gkj, Dnj.Yckj} and EYc(kj, Dnj′) = {gkj, Dnj′.Yckj}, respectively. 4) Data Anonymization and Storing: This stage performs ‘re-encryption and shuffling’ operations on both encrypted datasets and encrypted weighted datasets and stores them. i) Each mix-server Mi re-encrypts every encrypted data of Dn and Dn′, shuffles them altogether and passes them to the next mix-server Mi+1 according to the mechanism of Sec. III (C (iii)). Considering an example, for an encrypted data EYc(kj, Dnj′), Mi re-encrypts it into EYc{ri, EYc(kj, Dnj′)} = {gkj…gri , Dnj′*Yckj…Ycri} and passes it to Mi+1. Lastly, Mp conveys these datasets to SP. ii) Now, SP stores every re-encrypted data of Dn in its storage and sends every re-encrypted data of Dn′ to BS to create a new block Bn using them. Further, SP keeps the record Idx[Bn] as in Sec. IV (A (4)) to retrieve it later. 5) Data Decryption and Targeted Queries: Upon receiving any encrypted query request E(k, q) from Qn, SP asks M1, …, MP to decrypt the encrypted Dn and Dn′ of each Pn as in Sec. III (C (iv)) while using their private decryption keys {X1, …, XP} sequentially. Now, SP executes the following sub-stages to obtain the targeted skyline queries (i) weight verification and (ii) result of targeted queries. Through the weight verification sub-stage, SP verifies assigned weights to confirm whether they are picked from the publicly disclosed weight table or not. Then, through the result of targeted queries sub-stage, it calculates the query results to publish them. These sub-stages are described below. i) Weight Verification: SP gains the plain form of Dn and Dn′ from its memory and BS, respectively. From Dn, SP re-calculates a new weighted dataset Dn' same as the mechanism of Sec. V (B (2)). If Dn'' = Dn', then SP is convinced about the correctness of the operations of Pn. Otherwise, Pn’s weighted dataset will be discarded from the consideration of further operations. ii) Result of Targeted Queries: Qn requests SP to perform a targeted query q (q = [q1, q2, …, qm]) which is comprised by considering m attributes (i.e., dimensions). Initially, Qn encrypts its query q into E(k, q) based on Sec. III (B) and sends it to SP. Upon receiving E(k, q), SP decrypts it and regains q. Then, for each dataset Dn′, SP calculates the N number of Euclidian distance vectors Vn as in Sec. III (A (2)). Here, Vn is a column vector of size L × 1, N is the total number of parties and L is the highest number of data items for a Pn. Hence, SP calculates a total of {(L1 × 1), …, (LN × 1)} column vectors. Now, Qn can choose any data item Dnj' which maps to corresponding Vnj {Vnj ∈ (V1,1, …, V1,L), …, (VN,1, …, VN,L)} and j ∈ {L1, …, LN}.
9 6 3 0 20
25 30 Party (20 ~ 35), dimension = 5 Encryption
35
Decryption
(b) Fig. 4. Time requirement for encryption and decryption operations while varying the number of: (a) data-dimension and (b) party.
Page 94
In addition, Fig. 5 shows the time required for creating, i.e., storing the dataset in the block of a private blockchain while considering a different number of parties in the range of [20, 35]. Here, same as in Fig. 4, each party has 20 data and creates a distinct block using them. The figure shows that the required time rises linearly while the number of parties increases. Block creation/storage time by varying the party
Time (s)
15 10
it runs ‘re-encryption and shuffling’ over encrypted data that ensures data anonymity and removes links between the data and its owner. Also, it stores parties’ encrypted datasets on distinct blocks of blockchain that assure data integrity. Besides, the provision of targeted queries enables legitimate users to know the query results. The preliminary assessment implies the efficacy of the proposed framework. A further plan for improvement is to propose and incorporate an efficient public blockchain architecture with a consensus mechanism, improved storage design, efficient searching, etc., and to implement the prototype of the framework over a more realistic environment. REFERENCES
5 [1] 0 20
25
30
35
[2]
Party (20 ~ 35), dimension = 5 Fig. 5. Time requirement to create/store data in a block of the private blockchain for different parties.
D. Comparisons Based on the preliminary evaluation, security, and other aspects, Table III represents a comparison among the proposed framework and the ones in [2] and [4]. Concerning the considered aspects, the table indicates that the proposed framework is more efficient and applicable than the compared ones. Every framework considers multi-party skyline queries on private datasets and retains data privacy, data integrity, data anonymity, etc., entirely or partially. TABLE III: A COMPARISON BASED ON DIFFERENT ASPECTS Aspects
[4]
Proposed
Exploited Cryptosystem
Single party Paillier
Multi-party ElGamal
Multi-party ElGamal
How to retain data integrity
Not considered
How to retain data anonymity
XOR operation, permutation
Attaching Unique Tags with data Reencryption and shuffles
How datasets are prioritized
Not considered
Not considered
Assigning publicly disclosed weights
Calculation of skyline queries
All parties all data
All parties all data
Targeted queries
(L×M×N)
(L×M×N)
(L+M)×N
Relatively high
Relatively high
Relatively low
Involved participants Participant itself
Untrusted third party Web Bulletin Board
Untrusted third party Blocks of Blockchain
Who conducted skyline queries
Data storage
[4]
[5]
[6]
[7]
Framework [2]
Overall skyline computations Computational overhead
[3]
[8]
Creating blocks over blockchain
[9]
Re-encryption and shuffles
[10]
[11]
[12]
[13]
[14]
L = Maximum data items of a single party, M = Highest data dimension, N = Number of parties
VII. CONCLUSIONS The developed framework for user-centric multi-party skyline queries prioritizes parties’ datasets as per their impact. Then, it enriches data privacy by assigning multiple separate decryption keys among mutual entities. In addition,
[15]
S. Borzsony, D. Kossmann, and K. Stocker, “The Skyline operator,” in 17th Int. Conf. on Data Engineering, pp. 421–430, 2001. M. Qaosar, K. M. R. Alam, A. Zaman, C. Li, S. Ahmed, M. A. Siddique, and Y. Morimoto, “A Framework for Privacy-Preserving Multi-Party Skyline Query Based on Homomorphic Encryption,” IEEE Access, Vol. 7, pp. 167481–167496, 2019. S. Zhang, S. Ray, R. Lu, Y. Zheng, Y. Guan, and J. Shao, “Achieving Efficient and Privacy-Preserving Dynamic Skyline Query in Online Medical Diagnosis,” IEEE Internet of Things J, pp. 1–1, 2021. D. Das, K. M. R. Alam, and Y. Morimoto, “A Framework for Multiparty Skyline Query Maintaining Privacy and Data Integrity,” 24th Int. Conf. on Computer and Information Technology (ICCIT), pp. 16, 2021. X. Liu, R. Lu, J. Ma, L. Chen, and H. Bao, “Efficient and privacypreserving skyline computation framework across domains,” Future Generation Computer Systems, Vol. 62, pp. 161–174, Sep. 2016. X. Ding, Z. Wang, P. Zhou, K. K. R. Choo, and H. Jin, “Efficient and Privacy-Preserving Multi-Party Skyline Queries over Encrypted Data,” IEEE Transactions on Information Forensics and Security, Vol. 16, pp. 4589–4604, 2021. N. Islam, K. M. R. Alam, and A. Rahman, “The effectiveness of mixnets – an empirical study,” Computer Fraud & Security, Elsevier, Vol. 2013, No. 12, pp. 9–14, Dec. 2013. C. C. Lai, Z. F. Akbar, C. M. Liu, V. D. Ta, and L. C. Wang, “Distributed continuous range-skyline query monitoring over the Internet of mobile Things,” IEEE Internet of Things J, Vol. 6, No. 4, pp. 6652-6667, Aug. 2019. X. Lin, J. Xu, and H. Hu, “Range-based skyline queries in mobile environments,” IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 4, pp. 835–849, 2013. X. Zhang, R. Lu, J. Shao, H. Zhu, and A. A. Ghorbani, “Continuous Probabilistic Skyline Query for Secure Worker Selection in Mobile Crowdsensing,” IEEE Internet of Things J, Vol. 8, No. 14, pp. 11758– 11772, Jul. 2021. A. Cuzzocrea, P. Karras, and A. Vlachou, “Effective and efficient skyline query processing over attribute-order-preserving-free encrypted data in cloud-enabled databases,” Future Generation Computer Systems, Vol. 126, pp. 237–251, Jan. 2022. T. Elgamal, “A Public Key Cryptosystem and a Signature Scheme Based on Discrete Logarithms,” IEEE transactions on information theory, Vol. 31, No. 4, 1985. N. Islam, K. M. R. Alam, and S. S. Rahman, “Commutative reencryption techniques: Significance and analysis,” Information Security Journal: A Global Perspective, Taylor & Francis, Vol. 24, No. 4-6, pp. 185-193, 2015. W. Wang, H. Li, Y. Peng, S. S. Bhowmick, P. Chen, X. Chen, and J. Cui, “Scale: An efficient framework for secure dynamic skyline query processing in the cloud,” in Int. Conf. on Database Systems for Advanced Applications, pp. 288-305, Springer, Cham, 2020. A. A. Monrat, O. Schelén, and K. Andersson, “A survey of blockchain from the perspectives of applications, challenges, and opportunities,” IEEE Access, Vol. 7, pp. 117134–117151, 2019.
Page 95
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December 2022, Cox’s Bazar, Bangladesh
Machine Learning and Deep Learning Based Network Slicing Models for 5G Network Md. Ariful Islam Arif1, Shahriar Kabir2, Md Faruk Hussain Khan3, Samrat Kumar Dey4, and Md. Mahbubur Rahman5 1,2,3,5
Department of Computer Science and Engineering 4 School of Science and Technology 1,2,3,5 Military Institute of Science and Technology, Dhaka, Bangladesh 4 Bangladesh Open University, Gazipur, Bangladesh Email- [email protected], [email protected], [email protected], [email protected], [email protected]
Abstract—5G network can provide high speed data transfer with low latency at present days. Network slicing is the prime capability of 5G, where different slices can be utilized for different purposes. Therefore, the network operators can utilize their resources for the users. Machine Learning (ML) or Deep Learning (DL) approach is recently used to address the network issues. Efficient 5G network slicing using ML or DL can provide an effective network. An endeavour has been made to propose an effective 5G network slicing model by applying different ML and DL algorithms. All the methods are adopted in developing the model by data collection, analysis, processing and finally applying the algorithm on the processed dataset. Later the appropriate classifier is determined for the model subjected to accuracy assessment. The dataset collected for use in the research work focuses on type of uses, equipment, technology, day time, duration, guaranteed bit rate (GBR), rate of packet loss, delay budget of packet and slice. The five DL algorithms used are CNN, RNN, LSTM, Bi-LSTM, CNN-LSTM and the four ML algorithms used are XGBoost, RF, NB, SVM. Indeed, among these algorithms, the RNN algorithm has been able to achieve maximum accuracy. The outcome of the research revealed that the suggested model could have an impact on the allocation of precise 5G network slicing. Keywords—5G network, network slicing, 5G slice, deep learning, machine learning, mobile network.
I. INTRODUCTION Fifth-generation (5G) wireless is the most recent iteration of cellular technology, designed to significantly improve the speed and responsiveness of wireless networks. With 5G technology the wireless broadband connections may transfer data at multigigabit speed and the peak rate could reach up to 10 gigabits per second (Gbps) or more. 5G speed is faster than wired network which provides latency of around 5 milliseconds (ms) or less. This is useful for applications that need real-time input. Due to improved bandwidth and antenna technology, 5G is going to transfer more data over the wireless network [1]. Since COVID-19 pandemic, the internet is widely used in various fields and 5G technology is widely used to ensure the speed of the internet [2]. Network slicing is the prime capability of 5G and it enables numerous virtual and independent networks on a shared physical infrastructure. But in 4G and older generations of mobile network were devoid of such facility. Each network slice can contain its particular logical topology, rules of security, and performance characteristics within imposed limit by the underlying physical networks. Various slices can be utilized for different purposes, like confirming priority to specific
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
application/service or segregating traffic for definite users or device classes. The network operators can utilize their maximum network resources for the users and they also have service flexibility. Slicing technologies on the arena of ethernet networks are as old as virtual local area networks (VLANs). Better network flexibility can be achieved by software-defined networking (SDN) and network functions virtualization (NVF) through partitioning the network architectures into virtual elements [2]. The partitions are dynamically chosen based on requirements and specific purpose. The devoted resource changes according to the change of need. Sometimes, this is done to meet the specific customers’ requirements or requirements of network security. Machine learning (ML) approaches have recently made significant strides in a wide range of application domains, particularly by deep learning (DL), reinforcement learning, and federated learning. However, rapid trend is seen in the networking world toward applying ML or DL approach to address difficult issues with network design, management, and optimization. 5G is a new technology and many countries are going to effectively use this technology for future benefit. Efficient 5G network slicing will allow the network to be used effectively. ML and DL approach can provide better solution for data slicing model in 5G technology. In this study, the authors propose an efficient 5G network slicing model by using different ML and DL approach. Therefore, this paper focuses on the purpose of determining the appropriate network slicing method based on the usage pattern, network type, used technology, time, bit rate, packet loss rate, packet delay budget, etc. Different ML and DL algorithms have been applied to the obtained dataset, the most accurate approach has been chosen, and a slicing model has been produced using that algorithm. The rest of the article layout is structured in the following sequences: Section II offers a brief literature reviews on the subject; Section III describes the system architecture and 5G environment; Section IV discusses the proposed methodology; Section V evaluates the performance; Section VI illustrates the analysis of the result; and finally, Section VII draws conclusion including future work proposal. II. LITERATURE REVIEW Different articles related to 5G network slicing with ML are consulted. Here some of the relevant articles are briefly illustrated. Those may assist to provide some useful information for the preparation of this study. M. H. Abidi et al. [3] intended to develop a hybrid learning algorithm for a
Page 96
Fig. 1. 5G network slicing model with ML and DL.
successful network slicing. They have hybridized two metaheuristic algorithms and termed the proposed model as glowworm swarm-based deer hunting optimization algorithm (GSDHOA). For each device, a hybrid classifier has classified the precise network slices using DL and neural networks. The GSDHOA optimizes the weight function of both networks. M. E. Morocho-Cayamcela et al. [4] focused on the potential results for 5G from an ML perception. They have established the essential concepts of supervised, unsupervised, and reinforcement learning considering the existing ML concept. They have discussed the likely approaches on how ML can assist to support each targeted 5G network necessities by highlighting its specific use cases and assessing the impact including limitations on the network operation. Z. Ullah et al. [5] described that the UAVs-assisted nextgeneration communications are highly influenced by various techniques like artificial intelligence (AI), ML, deep reinforcement learning (DRL), mobile edge computing (MEC), and SDN. A review is developed to examine the UAVs joint optimization difficulties to improve the efficiency of system. H. Fourati et al. [6] narrated that mobile operators are rethinking their network strategy in order to offer more adaptable, dynamic, economical, and intelligent solutions. They have discussed 5G wireless network background and the challenges including some proposed solutions to manage by ML methods. T. Li et al. [7] discussed that future mobile vehicular social networks (VCNs) are anticipated to be significantly influenced by 5G networks. Higher coverage ratio vehicles are preferred and ML methods are applied to select that. J. Lam and R. Abbas [8] proposed SDS (Software Defined Security) to provide an automated, flexible and scalable network defence system. SDS will connect current advances in ML to design a CNN (Convolutional Neural Network) using NAS (Neural Architecture Search) to detect anomalous network traffic. SDS can be useful to an intrusion detection system to produce a more practical and end-to-end defence for a 5G network. I. Alawe et al. [9] proposed different method to scale 5G core network assets by expected traffic load changes through estimation via ML method. Recurrent Neural Network (RNN), more specifically Long Short-Term Memory (LSTM) Cell and Deep Learning Neural Network (DNN) were used and compared. Their replication results confirmed the efficiency of the RNN-based solution compared to a threshold-based solution.
S. K. Singh et al. [10] suggested their concept, where each logical slice of the network is separated into a virtualized subslice of available resources, to address the problem of network load balancing. By choosing the feature with the Support Vector Machine (SVM) technique, they were able to determine the requirements of various connected device applications. Sub-slice clusters for the similar applications were created by using K-means algorithm. The proposed framework performs better in their comparison analysis than in the current experimental evaluation research. L. Le et al. [11] have developed a useful and effective framework for clustering, predicting, and managing traffic behaviour for a large number of base stations with different statistical traffic characteristics of various types of cells by combining big data, ML, SDN, and NFV technologies. P. Subedi et al. [12] proposed that network slicing enables network design to have flexible and dynamic features. To enable network slicing in 5G, the pre-existing network architecture must undergo domain modification. They enhanced network flexibility and dynamics to suggest a network slicing architecture for 5G. Idris Badmus et al. [13] proposed that, the local 5G microoperator concept will likely be deployed in a variety of ways that involve end-to-end network slicing. The anticipated architecture incorporated with multi-tenancy layer, service layer, slicing management and orchestration layer. S. S. Shinde et al. [14] proposed a different network function allocation issue for multi-service 5G networks. It will be able to set up network functionalities in a distributed computing environment on the service request. In their proposal, the core network (CN) and radio access network (RAN) are both taken into consideration. S. A. Alqahtani and W. A. Alhomiqani [15] proposed an architecture on network slicing for 5G involving cloud radio access network (C-RAN), MEC, and cloud data centre. Their proposed model is created on queueing theory. The results of the investigation and the simulation model proved that the projected model has a significant influence on how many MEC and C-RAN cores are needed to meet the quality-ofservice goals for 5G slices. The background study on various publications were mostly related to network slicing and some publications have described the slicing using ML and DL. ML algorithms used SVM while DL algorithms used RNN and LSTM. In addition,
Page 97
Service layer
Service provider
Virtual mobile operator
Network function layer
Slice- urLLC (high reliability and low latency)
Network function
Slice- mMTC (low energy and low bandwidth)
Network operation
Network slice controller
Slice- eMBB (High bandwidth)
Infrastructure layer
Radio access network
Transport network
Core network
Fig. 2. Graphical overview on generic 5G network slicing procedure.
various feature extraction methods have been used by various researchers. 5G network is currently expanding its scope. The research and development activities are ongoing in this regard. Therefore, the study is conducted to develop acceptable model for 5-G network slicing by applying different ML and DL algorithms. III. SYSTEM ARCHITECTURE AND 5G ENVIRONMENT To develop the network slicing model, the connected devices in the network including their services and services delivery requirements need to be collected for detail review. Data collection should be based on network type, network strength, bandwidth, latency, etc. Figure 1 shows an expert system approach of the network slicing model with ML and DL. Depending on the type of devices or equipment and method of use, each RAN controller will determine or recommend specific slice sizes for users based on the expert system results. Network slicing is an end-to-end concept that covers all of the prevailing network segments. It significantly converts the entire perception of networking by extracting, isolating, arranging, and separating the logical network components from the original physical network resources, which enhance the principles of network architecture including capabilities [2]. In a network slicing, the operators can allocate the required amount of resource as per the slice. It immensely improves the operational efficiency of the network. Network operators can physically isolate the traffic of different radio networks, slice a network, blend the multiple network capacity and slice the shared resources [10]. This enables 5G network operators to choose the characteristics needed to support their target levels of spectrum efficiency, traffic capacity, and connection density, which is how many devices can connect
from a given space. The generic framework of 5G network slicing and device connectivity with layers is presented in Figure 2. Three categories have been established by the International Telecommunication Union (ITU) for 5G mobile network services [16]: • Enhanced Mobile Broadband (eMBB): It offers mobile data access to: (i) densely populated of users, (ii) immensely mobile users and (iii) users spread over large areas. It depends on structures such as large ranges of multiple input, multiple output (MIMO) antennas and the combination of spectra beginning with 4G conventional wavelengths and extending into the millimetre band. • Massive Machine-Type Communications (mMTC): The facilities are made to serve enormous quantity of devices in a small zone with the belief that they produce little data (about tens of bytes/ second) and can stand high latency (up to ten seconds on a round trip). • Ultrareliable Low-Latency Communications (URLLC): It can provide secure communications with 1 ms latencies and higher reliability with zero or low packet loss. It can be achieved through a combination of: optimization of physical device, instantaneous multiple frequency handling, packet coding with process techniques and optimized signal management. IV. PROPOSED METHODOLOGY The usage of ML and DL algorithms in prior research studies on 5G network slicing has been quite limited. In this paper, ML and DL algorithms have been used based on the
Page 98
characteristics of the 5G network, i.e. type of uses, equipment, technology, day time, duration, guaranteed bit rate (GBR), packet loss rate, packet delay budget, slice, etc. 5G slicing dataset is collected from open source and different ML and DL algorithms are applied on it. Some processing methods are adopted to convert the obtained dataset into a utility set for application in the model. To prepare the data for processing, text data is first converted to a numerical dataset using level encoding. The original dataset is divided into two different parts with 80% of the original dataset as the training dataset and the remaining 20% of the data as the test dataset. Later, CNN, RNN, LSTM, bidirectional LSTM (Bi-LSTM), CNNLSTM, eXtreme gradient boosting (XGBoost), random forest Data collection
opposed to only one, where first one is the original copy and second one is the reversed copy. This can give the network an extra context and lead to a quicker and even more thorough learning process for the problem. CNN-LSTM: Convolutional layers and maximum pooling or maximum overtime pooling layers are used in CNN models to extract high level features. Although LSTM models are more suited to text categorization because they can identify long-term connections between word sequences. XGBoost: Recently XGBoost has topped Kaggle competitions for structured or tabular data. A distributed gradient boosting toolbox with an emphasis on portability,
Data labelling
Processed Dataset
CNN
RNN
LSTM
Accuracy
• • •
Data Pre-processing
Label encoding Check noisy value Check null value
Train – Test split
Bi-LSTM
CNN-LSTM
Evaluation
XGBoost
RF
SVM
NB
Slicing model
Fig. 3. Methodology used for 5G network slicing by using DL, ML models.
(RF), naïve bayas (NB) and SVM algorithms are applied with 80% of the dataset and their accuracy is determined. Figure 3 shows the whole process along with data pre-processing methodology. CNN: It is a special neural network for processing data with a grid-like topology. CNN has many layers and one of the layers is the convolution layer which is used to extract various features from the input images. Filter is applied on the input matrix and the output is received as feature map which is used in the feature extraction stage. The pooling layer is an additional layer that is utilized to reduce the dimensions without sacrificing quality. This layer also reduces overfitting as there are fewer parameters. Finally, the model becomes more tolerant towards variations and distortions. RNN: In the field of natural language processing, it is mostly employed. RNN can process sequential data, accepting both the present and past input data. RNNs are able to recall previously obtained data, which aids in their ability to anticipate what will happen next. The RNN requires a lot of effort to train. Long sequences cannot be handled when Rectified Linear Unit (ReLu) or Hyperbolic Tangent (TANH) are used as activation functions. LSTM: The data is modified slightly by LSTM using additions and multiplications. In LSTM, information is communicated through a method called cell states. This allows LSTM to selectively remember or forget information. LSTM networks are well-suited to processing, categorizing, and producing forecasts based on time-string data. Bi-LSTM: It is an extension of traditional LSTM that can improve model performance on sequence classification problems. It trains two LSTMs on the input sequence as
flexibility, and efficiency is called XGBoost. It develops machine learning algorithms utilizing the Gradient Boosting architecture. To swiftly and accurately address a variety of detection and analysis difficulties, XGBoost uses parallel tree boosting. RF: For the aim of prediction, RF creates a vast collection of de-correlated trees. By adding randomization to the treegrowing process, it lessens tree correlation. It performs splitvariable randomization. At each tree split in the RF, the feature search space is reduced [17]. NB: One of the earliest machine learning algorithms is NB. Basic statistics and the Bayes theorem form the foundation of this approach. The NB model employs class probabilities as well as conditional probabilities. To expand characteristics Gaussian distribution is used [17]. The Gaussian distribution with mean and standard deviation is described in (1).
p ( x = v | Ck ) =
1 2πσ k2
−
e
( v − µ k )2 2σ k2
(1)
SVM: An algorithm for supervised ML is the SVM. Both classification and regression issues are addressed using this. N-dimensional space is used to hold the data items, and the values of the features are used to show the specific coordinate. Because it produces the most homogeneous points in each subsection, it is known as a hyperplane [18]. A maximum margin separator is created by SVM and is used to create decision boundaries with the greatest feasible distance. W is for weight vector and X is for is the set of points. By using (2), we can find out the separator.
W .X + b = 0
(2)
Page 99
each other. Therefore, the accuracy is relatively good in the case of DL or neural networks. Figure 4 shows the relative accuracy of the nine algorithms used in this study.
V. PERFORMANCE EVALUATION Five DL algorithms and four ML algorithms are used in this research work. The data slicing model is chosen based on which of the nine applied algorithms will offer the highest accuracy. The performance of different ML and DL methods are displayed in Tables I and II respectively. Here it is seen that among the five deep learning algorithms, RNN has achieved the highest accuracy with 86.43%, while among the machine learning algorithms, XGBoost has been able to achieve the highest accuracy with 85.28%. Parameters used in the applied algorithms are shown in Table III. TABLE I.
To determine whether the suggested network slicing system is good or not, it needs to compare with some current and pertinent studies. The suppositions used by researchers when gathering samples and disclosing the findings of their research activities when processing those samples will have a strong indication in the effort of comparing performance evaluation. Since the 5G network has not yet been fully launched in Bangladesh, it has become difficult to simulate with real-time data. The majority of the literature review of various research works revealed that only a small number of works had been done specifically for the 5G network slicing model. But there had been a sizable amount of work done on the ML scope on 5G network, 5G network features including advancement and network intrusion. Nonetheless, an effort has been made in this study to evaluate this proposed model against other studied models using criteria like accuracy and algorithm. A comparison between proposed work and other works is shown in Table IV. Since very little previous work found on 5G network slicing, therefore the comparison of proposed work is done with few other methodologies.
PERFORMANCE OF DL ALGORITHMS
Algorithm
Accuracy (%)
CNN RNN LSTM CNN-LSTM Bi-LSTM
84.92 86.43 83.91 87.41 78.32
TABLE II.
PERFORMANCE OF ML ALGORITHMS
Algorithm
Accuracy (%)
XGBoost SVM Random forest Naïve Bayes
85.26 82.95 81.93 80.27
VII. CONCLUSION AND FUTURE WORK A proposed 5G network slicing model is developed using ML and DL algorithms. This model has three distinct phases: collection, preparation, and application of data. Initially the
TABLE III.
DETAILED SPECIFICATIONS OF THE ALGORITHMS USED
Algorithms
Specifications C = 1.0 Kernel: radial basis function =
−γ ‖x −x ‖ )
SVM Gamma: scale =
×
.
) m
XGBoost
Distribution measure: Gini index,
Gini( D ) = 1 − pi2 i =1
Maximum depth = 0, Minimum samples split = 2
Naïve Bayes
Distribution: Gaussian distribution = Mean, µ y =
Random forest
*
∑,
,) ,
Variance, σ y =
, , )= ∑1 .23
-.
"√ $
% &%')( ()(
/-̅ )(
Estimator number = 100, Maximum depth = 2, Random state number = 0
CNN
Filters = 176, Kernel size = 4, Loss function = Mean square error, Activation = rectified linear unit
RNN
Sample RNN = 156, Optimizer = Adam, Loss function = Mean square error, Input dim = 1000
LSTM
Recurrent dropout = 0.2, Spatial Dropout1D = 0.4, Loss function = Mean square error, activation = SoftMax
CNN-LSTM Bi-LSTM
CNN activation = sigmoid, Optimizer = Adam, LSTM activation = rectified linear unit Recurrent dropout = 0.3, activation = SoftMax, Loss function = Mean square error
VI. RESULT ANALYSIS Now a day’s different expert systems, recommendation systems, detection systems, etc. are built using models created by various ML or DL techniques. The performance of the algorithms utilized in the models reveals its variation. DL is composed of neural networks and its layers interact more with
data about 5G network slicing is collected. The dataset included attributes linked with various network devices such as “type of uses, equipment, technology, day time, duration, GBR, packet loss rate, packet delay budget, and slice". With the help of nine prominent classifiers, 5G network slicing has been classified. The accuracy of the classifiers has been used
Page 100
90
84.92
86.43
83.91
86.41
85.28 82.95
85
81.93
78.32
80.27
80 75 70 CNN
RNN
LSTM
CNN-LSTM
Bi-LSTM Accuracy
XGBoost
SVM
RF
NB
Fig. 4. Performance of algorithms used to build models for 5G network slicing. TABLE IV.
Articles
DETAILED SPECIFICATIONS OF THE ALGORITHMS USED
Problem domain
Algorithms
5G network slicing
GS-DHOA-NN+DBN
classified network slices using hybrid classifier using DBN
94.44%
M. E. MorochoCayamcela et al. [4]
Discussion on 5G network requirement, emphasizing 5G/B5G mobile and wireless communications
NM
Stimulate discussion on the role of ML on a wide deployment of 5G/B5G
NM
I. Alawe et al. [9]
Traffic forecasting for 5G core network scalability
RNN, LSTM
Load forecast on traffic arrival in a mobile network
RNN has better performance
Idris Badmus et al. [13]
end‑to‑end network slice architecture and distribution
NM
architecture incorporated with multi-tenancy layer, service layer, slicing management and orchestration layer
NM
Proposed Methodology
5G network slicing with ML and DL
CNN, Bi-LSTM, CNNLSTM, RNN, LSTM, XGBoost, SVM, RF, NB
5G network slicing based on user performance
86.43% (RNN)
[9]
[10]
[11]
REFERENCES
[2]
[3]
[4]
[5]
[6]
[7]
[8]
Accuracy
M.H. Abidi et al. [3]
to gauge their merits. By examining the outcomes of subsequent identical works, the relative qualities of the results obtained have been evaluated. The study achieved an accuracy of 86.43% with RNN classifier, which is good as well as promising. There remains a potential future work with a very large set of 5G network data.
[1]
Outcome
N. Al-Falahy and O. Y. Alani, “Technologies for 5G Networks: Challenges and Opportunities,” IT Professional, vol. 19, no. 1, pp. 1220, February 2017. X. Li, M. Samaka, H. A. Chan, D. Bhamare, L. Gupta, C. Guo, and Raj Jain, “Network Slicing for 5G: Challenges and Opportunities,” IEEE Internet Computing, vol. 21, no. 5, pp. 20-27, September 2017. M. H. Abidi, H. Alkhalefah, K. Moiduddin, M. Alazab, M. K. Mohammed, W. Ameen, and T. R. Gadekallu, “Optimal 5G network slicing using machine learning and deep learning concepts,” Computer Standards and Interfaces, vol. 76, pp. 103518, June 2021. M. E. Morocho-Cayamcela, H. Lee, and W. Lim, “Machine Learning for 5G/B5G Mobile and Wireless Communications: Potential, Limitations, and Future Directions,” IEEE Access, vol. 7, pp. 137184– 137206, September 2019. Z.Ullah, F. Al-Turjman, U. Moatasim, L. Mostarda, and R. Gagliardi, “UAVs joint optimization problems and machine learning to improve the 5G and Beyond communication,” Computer Networks, vol. 182, pp. 107478, December 2020. H. Fourati, R. Maaloul, and L. Chaari, “A survey of 5G network systems: challenges and machine learning approaches,” International Journal of Machine Learning and Cybernetics, vol. 12, no. 2, pp. 385431, August 2020. T. Li, M. Zhao, and K. K. L. Wong, “Machine learning based code dissemination by selection of reliability mobile vehicles in 5G networks,” Computer Communications, vol. 152, pp. 109–118, February 2020. J. Lam and R. Abbas, “Machine Learning based Anomaly Detection for 5G Networks.” arXiv, 03474, March 2020.
[12]
[13]
[14]
[15]
[16]
[17] [18]
I. Alawe, A. Ksentini, Y. Hadjadj-Aoul, and P. Bertin, “Improving Traffic Forecasting for 5G Core Network Scalability: A Machine Learning Approach,” IEEE Network, vol. 32, no. 6, pp. 42–49, November 2018. S. K. Singh, M. M. Salim, J. Cha, Y. Pan, and J. H. Park, “Machine Learning-Based Network Sub-Slicing Framework in a Sustainable 5G Environment,” Sustainability, vol. 12, no. 15, pp. 6250, January 2020. L.V. Le, D. Sinh, B.S. P. Lin, and L.P. Tung, “Applying Big Data, Machine Learning, and SDN/NFV to 5G Traffic Clustering, Forecasting, and Management,” 2018 4th IEEE Conference on Network Softwarization and Workshops (NetSoft), pp. 168–176, June 2018. P. Subedi, A. Alsadoon, P. W. C. Prasad, S. Rehman, N. Giweli, M. Imran and S. Arif, “Network slicing: a next generation 5G perspective,” EURASIP Journal on Wireless Communications and Networking, vol. 2021, no. 1, pp. 102, 2021, April 2021. I. Badmus, A. Laghrissi, M. Matinmikko-Blue and A. Pouttu “End-toend network slice architecture and distribution across 5G microoperator leveraging multi-domain and multi-tenancy,” EURASIP Journal on Wireless Communications and Networking, vol. 2021, no. 1, pp. 94, April 2021. S. S. Shinde, D. Marabissi, D.Tarchi, “A network operator-biased approach for multi-service network function placement in a 5G network slicing architecture,” Computer Networks, vol. 201, pp. 108598, December 2021. S. A. AlQahtani and W. A. Alhomiqani, “A multi-stage analysis of network slicing architecture for 5G mobile networks,” Telecommun Syst, vol. 73, no. 2, pp. 205–221, February 2020. P. Popovski, K. F. Trillingsgaard, O. Simeone, and G. Durisi, “5G Wireless Network Slicing for eMBB, URLLC, and mMTC: A Communication-Theoretic View,” IEEE Access, vol. 6, pp. 55765– 55779, September 2018. J. Han, M. Kamber, J. Pei, Data Mining Concept and Technique, 3rd Edition, Morgan Kaufmann, 2012, pp. 332-398. S. J. Russell, and P. Norvig, Artificial Intelligence a Modern Approach, 3rd Edition, Upper Saddle River, NJ: Prentice Hall, 2010, pp. 725-744.
Page 101
2022 25th International Conference on Computer and Information Technology (ICCIT)
A Breast Cancer Detection Model using a Tuned SVM Classifier Partho Ghose1 , Md. Ashraf Uddin1 , Mohammad Manzurul Islam2 , Manowarul Islam1 , Uzzal Kumar Acharjee
1
1
Department of Computer Science and Engineering, Jagannath University, Dhaka. Department of Computer Science and Engineering, East West University, Dhaka. [email protected], (ashraf, manowar, uzzal)@cse.jnu.ac.bd, [email protected] 2
Abstract—Breast cancer has become a common disease that affects women all over the world. Early detection and diagnosis of the breast cancer is crucial for an effective medication and treatment. But, detection of breast cancer at the primary stage is challenging due to the ambiguity of the mammograms. Many researchers have explored Machine learning (ML) based model to detect breast cancer. Most of the developed models have not been clinically effective. To address this, in this paper, we propose an optimized SVM based model for the prediction of breast cancer where Bayesian search method is applied to discover the best hyper-parameters of the SVM classifier. Performance of the model with default hyper-parameter for the SVM is compared to the performance with tuned hyper-parameter. The comparison shows that performance is significantly improved when the tuned hyper-parameter is used for training SVM classifier. Our findings show that SVM’s performance with default parameters is 96% whereas the maximum accuracy level 98% is obtained using tuned hyper-parameter. Index Terms—breast cancer detection, machine learning algorithms, SVM, hyper-parameter tuning, fine tuned SVM
I. I NTODUCTION Cell growth is a normal process in the human body and this growth normally occurs in a controlled way. But sometimes the development of cell can happen uncontrollably and this abnormal growth of cells is termed as cancer. Gradually cancer also attacks all the healthy cells in the body. Similarly, abnormal growth of breast cells is thought to be breast cancer that rapidly contaminates other cells around the breast and stimulates other parts of the body. Breast cancer is the most prevalent and unfortunately the most fatal of all cancers detected in the human body [1], [2]. Researchers around the world are trying vigorously to find various kind of solutions for early detection of the disease for better treatment and one of the solutions can be the usage of machine learning approaches. Machine learning is one of the artificial intelligence related fields that explores different forms of mathematical, statistical and probabilistic methods to improve performance through training with new data. Recently diverse kinds of machine learning models are being widely used in diagnosing various human diseases [3], [4] including breast cancer because of their high performances. There is still a lack of effective medication and treatment for breast cancer, especially if this disease is later discovered. However, this fatal disease is curable if it is diagnosed at the early stage. So accurate and effective identification of breast cancer data is important in medical diagnosis. Appropriate
979-8-3503-4602-2/22/31.00 © 2022 IEEE
classification using machine learning approaches can assist physicians to detect breast cancer of a patient at an early stage. In the process of breast cancer identification, machine learning plays a crucial role in reducing the rate of mortality caused from breast cancer. Motivated by this, researchers [5]–[12] have already developed different kinds of machine learning model to detect breast cancer at the early stage which reduces mortality rate of this disease. For example, Assegie et al. [13] proposed an optimized KNearest Neighbor (KNN) model to identify a breast cancer where a grid search technique was utilized for searching the optimal hyper-parameter. However, the authors concentrated primarily on accuracy that is sometimes not enough for a medical prediction system and accuracy(94.35%) could also be improved. Jabbar et al. [14] introduced an ensemble model utilizing Bayesian network and Radial Basis Function to categorize breast cancer data. The experimental results showed that the proposed model could accurately identify 97% on breast cancer cases. However, accuracy level of a machine learning model can be further increased through tuning important parameters of a classifier. Many state-ofthe-arts applied SVM to detect breast cancer but failed to discover the proper value of hyper-parameters in order to boost the accuracy level. In this paper, we adopt SVM which is a well-known classification algorithm to predict breast cancer with higher accuracy. We tune hyper-parameter for SVM using Bayesian search to improve the accuracy level of breast cancer detection. The comparison between the proposed model and state-of-the-art models is done with respect to several metrics. Our contributions include the following. 1) First, we perform pre-processing on the breast cancer dataset where the pre-processing includes removing null values, detecting and discarding outliers. 2) Second, we develop the breast cancer prediction model using SVM classifier with tuning hyper-parameter by applying Bayesian search method. We compare performance of the model with the existing works. The rest of this paper is organized as follows: Literature review of recent works that are relevant to this study is described in Section II. In Section III, description of the proposed system along with the data-set collection and preparation is provided. The experimental results of the proposed system and comparative performance analysis are
Page 102
presented in Section IV. Finally, we conclude the paper with our future works in Section V.
II. R ELATED W ORK This section presents a review of related literature on breast cancer analysis, taking into account the dataset, methodologies, and performance measures and finds out their limitations. Various research works has been proposed on the identification of breast cancer with machine learning technology. But all those researches have been conducted using different machine learning algorithms on different breast cancer datasets. The performance of those researches varies based on which algorithms they used and which datasets they applied on. The researchers proposed an optimized K-Nearest Neighbor (KNN) model to detect breast cancer [13]. For this purpose, they applied a grid search technique for searching the optimal hyper-parameter to tune the K-Nearest Neighbor (KNN) model. The grid search was utilized to determine the best K value, with the aim of boosting the proposal’s accuracy. A significant change on the performances had been found after tuning the hyper parameters. The accuracy of the proposed optimized model was 94.35%, which was better than the accuracy of the KNN with the default hyper-parameter, which was 90.10%. But the authors did not assess performance criteria like as precision, recall, f1 score, or AUC, which are crucial to evaluate any machine learning model. Shofwatul et al. [7] proposed a decision tree-based classification model for breast cancer detection. An empirical analysis of this model showed that the performance is acceptable with an accuracy metric of 80.50% for breast cancer detection. Though the proposed model has an acceptable level of performance, it has plenty of room for improvement in breast cancer detection. An artificial Neural Network(ANN) based breast cancer detection approach [9] has been proposed. The model has only one hidden layer. The authors used the Kmeans clustering algorithm on the breast cancer dataset from the University of California (UCL) [11]. According to an experimental analysis the accuracy score of the proposal was 73.70%. Nallamala et al. [15] used three different machine learning algorithms to develop a model for detecting breast cancer. The algorithms are: Naive Bayes, Random Forest, and K-Nearest Neighbour (KKN). In this case, the authors [15] have used the Wisconsin breast cancer data repository. Among the three algorithms, the K-Nearest Neighbours algorithm performs better than the two other algorithms. Kaur et al. [16] utilized the UCI breast cancer dataset to develop a breast cancer identification model based on neural networks. KNearest Neighbor and Na¨ıve Bayes, were used to compare the model performances, and the results showed that the neural network outperformed the others. The performance of three different algorithms, namely decision tree, naive bayes, and logistics regression, is compared in [17]. Based on the results, the decision tree algorithm is found to be more accurate than the others.
A deep learning-based model is proposed in [18], and the University of California Irvine breast cancer data repository was utilized to train the model. However, the approach does not explain how to choose hyper-parameters during the training phase. The introduced deep learning model was able to gain an accuracy of 90%. III. R ESEARCH M ETHODOLOGY Our proposed Breast cancer detection model consists of several stages as shown in Figure 6.
Fig. 1: Work flows for the Proposed Methodology The first step involves the pre-processing of dataset before feeding those into the model. Second, dataset is split into training set and testings. The four different machine learning algorithms are trained with the dataset. Depending on the performance of these classifers, SVM shows higher level of accuracy. The performance parameters include accuracy, specificity, recall, precision and F1 Score. Finally, the hyper parameters are tuned for boosting the accuracy of the SVM. A. Dataset The dataset used for training the ML algorithm to detect cancer is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset [19]. The integrated properties of the dataset were calculated from a digitized image of a breast-mass fine needle aspirate (FNA). The dataset contains 569 data points of which 212 Malignant and 357 - Benign. Accordingly, the properties of the dataset are as follows: 1)radius (mean of distances from center to points on the perimeter) 2) texture(standard deviation of gray-scale values) 3) area 4) perimeter 5) compactness(perimeter2 /area − 1.0) 6) smoothness(local variation in radius lengths) 7) concavity(severity of concave portions of the contour) 8) symmetry 9) concave points(number of concave portions of the contour) 10) fractal dimension (”coastline approximation” - 1). Each property contains three attributes: mean, standard deviation and worst
Page 103
or largest (mean of the three largest values), thus a total of 30 dataset attributes. B. Data Pre-processing Data pre-processing includes cleaning, standardization, removing null values and so on. Data pre-processing is discussed below. 1) Data Cleaning: The dataset used for training the model does not have null values. However, outliers are identified in most of the standard error attributes and fractual dimensional attributes of the dataset. We apply the following techniques to handle outliers: 1) Elliptic Envelope 2) DBSCAN 3) Local Outlier Factor 4) Isolation Forest. With a view to assess the performance of outlier detection methods, we use simple regression fitting as well as compare the R-squared (R2 ) score on cleaned data and original data. R2 is a statistical measure that is used to determine the suitability of a regression model. The closer the value of R2 to 1, the better is the model fitted where 1 is the standard value of R2 . The mathematical representation of R2 is as in equation 1 [20] R2 = 1 −
SSres SStotal
(1)
where SSres is the sum of squares of the residual errors and SStotal indicates the total sum of the errors. Figure 2 shows the performance of outliers detection using the above mentioned methods. Figure 2 shows that every
Fig. 3: Comparison of total skews and kurtosis
as detecting outliers that reduce both skewness and kurtosis of data. Since Overall Skewness and Kurtosis are significantly reduced and R2 score is improved slightly with Isolation forest. So, Isolation forest is finally applied to detect the outliers in our model. After applying Isolation Forest with contamination of 10%, we got the data without Outliers in a shape of corrected data: (540, 31) from the original data shape: (569, 31). 2) Standardization: Standardization is the process of retrieving one or more attributes so that their mean is 0 and the standard deviation is 1. Clinical data assembled from different institutions for various purposes might be dissimilar and recorded in various formats. These records have been normalized for maintaining the same format [21]. Thus, to avoid materiality gaps, the data set was normalized using the equation 2. X −ϑ (2) α Here, Z is the normalized feature, X is the feature to be standardized,ϑ is the mean value of the feature, and α is the standard deviation of the feature. Z=
C. Machine Learning Algorithms Applied
Fig. 2: Outliers and original data outlier detection method identify outliers efficiently and their R2 scores are almost equal. Therefore, in order to select the best outlier method, we observe skewness and kurtosis values for finding the best outlier method as shown in Figure 3. Total Skews and total kurtosis refer to the sum of skews and sum of kurtosis for all features respectively. Figure 3 shows that total skews and total kurtosis calculated by the four algorithms respectively. Figure 3 demonstrates that isolation forest outcomes better result with default setting of 5% points
1) Support Vector Machine (SVM): SVM classifier is widely applied for diagnosing diverse kinds of diseases including cancer. SVM selects critical samples from all classes known as predictive support vectors. SVM works separately to create a linear function that uses these support vectors as widely as possible. Therefore, a mapping of a highdimensional space within an input vector using a SVM is aimed at discovering the most compatible hyper-plane that splits the data set into classes. The motive of this linear classification is to make the gap as large as possible between the hyper-plane and the nearest data point, called the marginal gap as shown in figure 4, to find the most appropriate hyperplane [22].
Page 104
no co-variance (separate dimension) between the dimensions. This model can only be competent by obtaining the mean and standard deviation of the points in each label, which is obligated to ascertain such an allocation. D. Performance Evaluation Metrics
Fig. 4: SVM Overview
2) K – Nearest Neighbour (KNN): KNN can directly predict outcome based on the training dataset. KNN finds a complete training set for a similar instance (neighbor) of K and summarizes the output variable for these K instances and assumes a new instance (x). For regression, it can be the average output variable, in classification it can be the mode (or most common) class value. Distance measurements are used to determine which k versions of the training dataset are compatible with the new entry for real value input variables. The most common distance measurement is the euclidean distance as illustrated in equation 3 , the sum of the squares of the difference between a new point x and an existing point xi in all input properties j. EuclideanDistance(x, xi ) =
q (sum((xj –xij )2 ))
(3)
At the grouping stage, k is a user-specified constant, and vectors without label (questions and test scores) are sorted by imputing the most occurred labels in the k training sample near the point of the question [13]. 3) Classification and Regression Trees (CART) : A cart is a predictive model that explains how to predict a result instance based on other values a cart output is a decision tree where each spine is divided into a prediction example and each end node contains a prediction for variable results the cart model is illustrated as a binary tree where each root node is represented as a single input variable x and a variable as a divisor point of that variable (assuming the variable is numeric). The tree leaf node takes on an output variable (y) that is used to predict. 4) Gaussian Naive Bayes classifier(GNB): Gaussian Naive Bayes is a variant of Naive Base that follows the Gaussian normal ordination and upholds continuous data. Gaussian Naive Bayes constantly upholds continuous valuable features and models according to a Gaussian (normal) ordination. One method of creating a general model is to pretend that the data is stated by a Gaussian ordination in which there is
To observe the performance of the proposed system, four metrics like accuracy, precision, sensitivity or recall, and F1- Scores are evaluated. For the metrics calculation some variables are used as described below: • TP (True Positive) indicates a correct prediction for cancerous cell. • FP (False Positive) indicates normal cases incorrectly classified as cancerous one. • TN (True Negative) is the correctly classified benign cases. • FN (False Negative) is the malignant or cancerous cases that are miss-classified as normal cases. Using the above variables, performance evaluation metrics can be calculated as: Accuracy =
(T P + T N ) (T N + F P + T P + F N )
Recall =
TP (T P + F N )
P recision =
F 1 − Score =
(4)
(5)
TP (T P + F P )
(6)
2T P (2T P + F P + F N )
(7)
IV. E XPERIMENTAL E VALUATION AND R ESULT Before feeding the data to ML classifiers, first we divided the dataset into two parts: 1) train set consisting of 80% of total data, and 2) test set with the remaining 20% data. We applied 10-fold cross validation on the train set to evaluate the performance of different classifiers (refer to section ??) and identified the best classifier based on accuracy (in our experiment, SVM) for further fine-tuning. Note, all our experiments were implemented using python programming language. Figure 5 shows that SVM outperforms other ML algorithms with 2%-6% higher accuracy. Therefore, we selected SVM as the classifier and fine-tuned it for further improving the performance. Figure 6 depicts the overall process for tuning the hyper-parameters of SVM. We experimented with four different SVM kernels (linear, polynomial, RBF and sigmoid) for hyper-parameter tuning and found RBF as the best performing kernel. Bayesian Search method was utilised to find the best hyper-parameters (the regularisation parameter, C = 67.95 and variance of RBF kernel, γ = 0.0002992). We evaluated the optimised SVM model using the test data and report the performance using accuracy, recall, precision and F-1 score in the Table I.
Page 105
Again, Table II depicts the comparative performance of the tuned SVM with the default SVM from which we can see that after tuning, SVM provided better results than traditional SVM.
100 97 95
Accuracy
95
92
91 90
TABLE II: Performances of Tuned SVM and default setting SVM Evaluation Criteria
Class
85
Accuracy
Precision
Recall
F1-Score
Tuned SVM
98%
98%
98%
98%
SVM
97%
96%
98%
95%
80 CART
KNN SVM ML Algorithms
GaussianNB
Figure 7 depicts the ROC cuves of the proposed model. From the ROC curve, we found that the Area Under the Curve (AUC) for breast cancer detection system is 99.3%.
Fig. 5: Performances of different ML algorithms
Fig. 7: ROC curve generated by the Proposed Model
Fig. 6: Overall process for tuning the hyper-parameters of SVM
Confusion metrics as depicts figure 8 is also calculated for better understanding of the optimized SVM model. From confusion metrics we found that 60 patients are correctly classified from 64 malignant patients and no miss-classification is found for benign class which shows better recall for this class. Only 4 patients are classified as benign who belonged to malignant patient.
TABLE I: Performances of Proposed model Evaluation Criteria
Class Accuracy
Precision
Recall
F1-Score
Benign
97.69%
96%
100%
98%
Malignant
98.72%
100%
94%
97%
Average
98%
98%
98%
98%
Fig. 8: Confusion Metrics created by the Proposed Model
Page 106
A. Comparative Analysis Table III shows the comparison with the proposed optimized SVM model with existing works in literature based on breast cancer detection. The investigation revealed that the proposed method outperformed existing machine learning based breast cancer identification systems. TABLE III: Comparative analysis of the proposed and existing systems in terms of performance metrics Evaluation Criteria
Methods Accuracy
Precision
Recall
F-Score
Shofwatul et al. [7]
93.18%
88%
92%
87.5%
T. Padhi et al. [11]
73%
–
–
–
Jabbar et al. [14]
97.42%
96.72%
99.32%
98.00%
S. Nallamala et al. [15]
95%
98.5%
–
–
S. Sharma et al. [17]
94.74%
92.18%
93.65%
92.90%
Proposed Model
98.72%
100%
94%
97%
V. C ONCLUSION ML techniques have been widely used in the medical field and serve as a useful diagnostic tool that helps physicians analyze available data as well as design medical adept systems. In this study, an optimized SVM model has been proposed for the prediction of breast cancer using a grid search method for the best hyper-parameter search aimed at improving performance. The obtained simulation results prove that the performance differs depending on the method chosen, including hyperparameter tuning of the model. The results showed that finetuned SVM have the highest performance in term accuracy to detect breast cancer. ACKNOWLEDGMENT This research was supported by Jagannath University Research Grant (JnU/Research/gapro/2021-2022/Science/38 and JnU/Research/gapro/2022-2023/Science/22). R EFERENCES [1] P. Sathiyanarayanan, S. Pavithra, M. S. Saranya, and M. Makeswari, “Identification of breast cancer using the decision tree algorithm,” in 2019 IEEE International Conference on System, Computation, Automation and Networking (ICSCAN). IEEE, 2019, pp. 1–6. [2] S. Goli, H. Mahjub, J. Faradmal, H. Mashayekhi, and A.-R. Soltanian, “Survival prediction and feature selection in patients with breast cancer using support vector regression,” Computational and mathematical methods in medicine, vol. 2016, 2016. [3] P. Ghose, M. Alavi, M. Tabassum, M. A. Uddin, M. Biswas, K. Mahbub, L. Gaur, S. Mallik, and Z. Zhao, “Detecting covid-19 infection status from chest x-ray and ct scan via single transfer learning-driven approach,” Frontiers in Genetics, vol. 13, 2022. [4] P. Ghose, M. A. Uddin, U. K. Acharjee, and S. Sharmin, “Deep viewing for the identification of covid-19 infection status from chest x-ray image using cnn based architecture,” Intelligent Systems with Applications, vol. 16, p. 200130, 2022. [5] T. A. Assegie and P. S. Nair, “The performance of different machine learning models on diabetes prediction,” Int. Journal Of Scientific & Tech. Research, vol. 9, no. 01, 2020.
[6] A. K. Dubey, U. Gupta, and S. Jain, “Comparative study of k-means and fuzzy c-means algorithms on the breast cancer data,” International Journal on Advanced Science, Engineering and Information Technology, vol. 8, no. 1, pp. 18–29, 2018. [7] S. ‘Uyun and L. Choridah, “Feature selection mammogram based on breast cancer mining,” International Journal of Electrical and Computer Engineering, vol. 8, no. 1, pp. 60–69, 2018. [8] P. Ghose, U. K. Acharjee, M. A. Islam, S. Sharmin, and M. A. Uddin, “Deep viewing for covid-19 detection from x-ray using cnn based architecture,” in 2021 8th International Conference on Electrical Engineering, Computer Science and Informatics (EECSI). IEEE, 2021, pp. 283–287. [9] M. Nemissi, H. Salah, and H. Seridi, “Breast cancer diagnosis using an enhanced extreme learning machine based-neural network,” in 2018 International Conference on Signal, Image, Vision and their Applications (SIVA). IEEE, 2018, pp. 1–4. [10] M. K. Mahbub, M. Z. H. Zamil, M. A. M. Miah, P. Ghose, M. Biswas, and K. Santosh, “Mobapp4infectiousdisease: Classify covid-19, pneumonia, and tuberculosis,” in 2022 ieee 35th international symposium on computer-based medical systems (cbms). IEEE, 2022, pp. 119–124. [11] T. Padhi and P. Kumar, “Breast cancer analysis using weka,” in 2019 9th International Conference on Cloud Computing, Data Science & Engineering. IEEE, 2019, pp. 229–232. [12] T. A. Assegie, S. Sushma, and S. Prasanna Kumar, “Weighted decision tree model for breast cancer detection,” Technology reports of Kansai university, vol. 62, no. 03, 2020. [13] T. A. Assegie, “An optimized k-nearest neighbor based breast cancer detection,” Journal of Robotics and Control (JRC), vol. 2, no. 3, pp. 115–118, 2021. [14] M. A. Jabbar, “Breast cancer data classification using ensemble machine learning,” Engineering and Applied Science Research, vol. 48, no. 1, pp. 65–72, 2021. [15] S. H. Nallamala, P. Mishra, and S. V. Koneru, “Breast cancer detection using machine learning way,” Int J Recent Technol Eng, vol. 8, pp. 1402–1405, 2019. [16] A. Kaur and P. Kaur, “Breast cancer detection and classification using analysis and gene-back proportional neural network algorithm,” International Journal of Innovative Technology and Exploring Engineering, 2019. [17] S. Sharma, A. Aggarwal, and T. Choudhury, “Breast cancer detection using machine learning algorithms,” in 2018 International Conference on Computational Techniques, Electronics and Mechanical Systems (CTEMS). IEEE, 2018, pp. 114–118. [18] A. M. Abdel-Zaher and A. M. Eldeib, “Breast cancer classification using deep belief networks,” Expert Systems with Applications, vol. 46, pp. 139–144, 2016. [19] M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml [20] C. Silver, “Geeksforgeeks,” 2021. [Online]. Available: https://bit.ly/2WwQEmu [21] I. Batyrshin, “Constructing time series shape association measures: Minkowski distance and data standardization,” in 2013 BRICS Congress on Computational Intelligence and 11th Brazilian Congress on Computational Intelligence. IEEE, 2013, pp. 204–212. [22] K. Kourou, T. P. Exarchos, K. P. Exarchos, M. V. Karamouzis, and D. I. Fotiadis, “Machine learning applications in cancer prognosis and prediction,” Computational and structural biotechnology journal, vol. 13, pp. 8–17, 2015.
Page 107
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
A Deep Learning-Based Bengali Visual Question Answering System Mahamudul Hasan Rafi Department of Computer Science and Engineering Ahsanullah University of Science and Technology Dhaka, Bangladesh [email protected]
S. M. Hasan Imtiaz Labib Department of Computer Science and Engineering Ahsanullah University of Science and Technology Dhaka, Bangladesh [email protected]
Faisal Muhammad Shah Department of Computer Science and Engineering Ahsanullah University of Science and Technology Dhaka, Bangladesh [email protected]
Abstract—Visual Question Answering (VQA) is a challenging task in Artificial Intelligence (AI), where an AI agent answers questions regarding visual content based on images provided. Therefore, to implement a VQA system, a computer system requires complex reasoning over visual aspects of images and textual parts of the questions to anticipate the correct answer. Although there is a good deal of VQA research in English, Bengali still needs to thoroughly explore this area of artificial intelligence. To address this, we have constructed a Bengali VQA dataset by preparing human-annotated question-answers using a small portion of the images from the VQA v2.0 dataset. To overcome high linguistic priors that hide the importance of precise visual information in visual question answering, we have used real-life scenarios to construct a balanced Bengali VQA dataset. This is the first human-annotated dataset of this kind in Bengali. We have proposed a Top-Down Attention-based approach in this study and conducted several studies to assess our model’s performance. Index Terms—Bengali Visual Question Answering, VQA v2.0, Image Attention, Deep learning
I. I NTRODUCTION Visual Question Answering (VQA) has become one of the most active study areas utilizing recent advancements in computer vision and natural language processing (NLP). Visual question answering focuses on specific areas of an image, such as visual reasoning and underlying context and tries to provide an accurate answer based on the context of the image. In recent years, many methods for increasing the VQA model’s performance have been proposed. In finding the fine-grained regions of an image related to the answer, the most general method was to retrieve
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
Shifat Islam Department of Computer Science and Engineering Bangladesh University of Engineering and Technology Dhaka, Bangladesh [email protected]
SM Sajid Hasan Department of Computer Science and Engineering Ahsanullah University of Science and Technology Dhaka, Bangladesh [email protected]
Sifat Ahmed Senior Engineer, Robotics & Artificial Intelligence Environmental Intelligence & Innovation Co., Ltd. Tokyo, Japan [email protected]
the image feature vector using a CNN model and encode the associated question as a text feature vector using a long short-term memory network (LSTM) and then merge them to extrapolate the answer [1]. These models provide good results, but when it comes to more specific regions in an image, these models frequently fail to provide precise answers. The visual attention mechanism [2]–[6] is used in some existing approaches to collect important visual information related to the topic. Although there have been many studies on VQA, the existing literature only addresses English. Because of the significant number of Bengali speakers worldwide, it has also become crucial to build VQA systems for native Bengali speakers. However, it is challenging to develop such systems due to a lack of linguistic resources. A few works were also carried out in languages like Hindi [7], Chinese [8], Japanese [4], and Bengali [9] as well. Even so, VQA experiments still have limitations in Bengali. In this paper, we have proposed an attention mechanism that focuses on the specific regions of the image related to the answer. This attention mechanism is depicted in Fig. 2 However, the lack of datasets is the main issue with Bengali VQA. As a result, we used a small fragment of the existing VQA v2.0 dataset to generate our human-annotated Bengali VQA dataset. The VQA v2.0 dataset is balanced overall, but because we utilized a small portion of it for human annotation in Bengali, the dataset was initially quite unbalanced. We made a great effort to balance our dataset when creating it. The following are some of our major contributions:
Page 108
Fig. 1: Examples from our Balanced Bengali VQA dataset.
Firstly, we have created a dataset for Bengali VQA. For visual information, image features are extracted We manually annotated our dataset without utilizing using pre-trained CNN models [1], [3], [10], [18] and Google Api translation to introduce the system with a in recent years, pre-trained Faster RCNN has been natural Bengali linguistic style. used [2], [11]. For textual information, questions are Secondly, we have balanced our dataset by creating extracted using LSTM [1]–[3], [5], [17], [18] and similar questions to the images with similar contexts GRU [11]. To fully understand the semanticism of the questions based on images, attention-based methods but different answers [10]. Thirdly, we have implemented our approach into [2], [5], [18] have been used recently, which produces practice by applying attention to the combined features outstanding outcomes. extracted from CNN and BiGRU. III. DATASET Finally, we have assessed our proposed model using In addition to the VQA dataset introduced by [10], our human-annotated Bengali dataset by doing several we have built a Bengali VQA Dataset. There are 204K ablation studies. Finally, we have shown our model’s MS-COCO images, as well as over 11 million questions correct and incorrect answers. with 110 million answers (10 answers per question), II. R ELATED WORKS among which yes or no, number and other question Visual Question Answering (VQA) has been a types are included in this original VQA dataset. We have used a small portion of this enormous major research area for the past seven years and is undoubtedly the most popular visual reasoning problem dataset to create our Bengali VQA dataset. Our research [11]. Several datasets [1], [12]–[14] and strategies for has been limited to a single domain. We have only visual question answering research have been proposed worked with the Binary classification dataset, in which in recent work. A typical VQA dataset includes an questions are answered with yes or no answers. First, we have collected data from the VQA v2.0 image, a question about the image and the answer.In recent years, various datasets for the VQA task have dataset consisting of 3280 images and about 4750 been constructed. DAQUAR [12] was the first dataset questions with yes or no answers. We have translated to use real-life photos to introduce image question the questions into Bengali manually. During the transanswering. CLEVER [13] dataset includes questions lation process, we cross-checked the translations by that involve spatial and relational reasoning on visual some native speakers. After completing the translation features. The VQA dataset [1], which contains images process, our dataset was highly unbalanced, leading to from the MS-COCO [15] dataset, is one of the largest linguistic bias and overfitting issues [10]. To overcome and merged open-ended questions with answers about the linguistic bias issue and improve the standing of the images. VQA v2.0 [10] is a balanced version image understanding, we have to come up with a of the previous VQA dataset by amassing additional Balanced Bengali VQA dataset. complementary images for each query. Zhang et al. We have implemented the balancing strategy in [16] research is perhaps the most applicable to our our dataset, as described in [10]. According to the work because it focuses on balancing VQA- binary procedure, we have identified similar images, having (yes/no) questions on abstract clipart scenes from VQA asked the same question but with different answers. abstract scenes dataset. This is a well-known balancing methodology in the field The most generalized way for a VQA task is to use of VQA, and using this method, the highly imbalanced fusion-based methods [17]. The image and question VQA v1.0 dataset was balanced to produce VQA v2.0. are represented as global features to anticipate the Initially, the number of questions in our unbalanced correct answer and merged into a unified representation. dataset was 4750 for 3280 images. After applying the
Page 109
Fig. 2: Our Proposed Model Architecture
balancing approach, the number of questions becomes 13046 from 4750 for 3280 images. Fig. 2. Shows some examples of our human-annotated Balanced VQA dataset. In our dataset, all of the questions and answers have been human-annotated. Finally, we have created our own Balanced Binary VQA dataset where each of the images in our dataset has an average of four ’yes’ or ’no’ type questions and their answers.
of language features, and (3) attention mechanism. In this section, we will go through the three major elements.
to the extracted image on the basis of question features. Fig. 2. depicts the overall architecture of our approach. The three components of our proposed approach are as follows: (1) Extraction of visual features, (2) extraction
We have employed a two-layer attention model which extract regions that are highly relevant to the answer. In our attention model, we first feed our extracted image feature vector vI and extracted question feature vector
A. Image Representation
A Convolutional Neural Network (CNN), the VGG19, is used in the image model to extract the image feature vector vI over each of the three colour channels of all input images. A. Data Preprocessing We first resize our images to 448x448 pixels and then 1) Image Processing: We have used various image use our image model, VGG19, to extract image features. preprocessing techniques in our image dataset. We We have trained our CNN model, VGG19, from scratch have applied image preprocessing techniques such as while keeping up with the Bengali question-answer Gaussian Blur (blurring an image with a Gaussian sequence. After passing the images through the CNN function to reduce image noise), RandomGridShuf- model, we get a 100-dimension feature vector for each fle (random permutation of shuffled grids), CLAHE of the images. Therefore, each input image of 448x448 (contrast limited adaptive histogram equalization), a pixels are represented by a 100-dimensional image technique for enhancing the visibility of foggy images. feature vector. We have also applied some augmentation techniques on our images, some of them are Image flipping and B. Question Representation RGB Shift (moves the colour channels independently We have utilized Bidirectional GRU (BiGRU) to capfor an edgy, moving look). ture the contextual and semantic content of questions We have applied several preprocessing and augmen- in our study, where one GRU takes the input text from tation techniques to provide variety to the model and one direction, while the other GRU takes the same make the model easier to generalize. input from the reverse direction. 2) Text Processing: We have used the tokenizaNow, the input question is represented by our tion technique to distinguish words from spaces in vocabulary-based vector representation of the word our Bengali VQA datasets questions. In our human- for each time stamp, with a length of 10. Each word is annotated Bengali VQA dataset, we have discovered embedded in a 100-dimensional vector space using an that there are 1940 distinct Bengali words for questions. embedding matrix. After that, we feed our Bidirectional Furthermore, we have utilized zero to pad all question GRU models the 100-dimensional embedding vector vector sequences to make them the same length as of words in the question for each time stamp. Here, the longest. In our Bengali Visual Question Answering the representation vector for the question is extracted dataset, the longest question is ten words long, whereas from the last hidden layer. In both of these cases, we most questions are 3, 4, or 5 words long. get 100-dimensional feature vector vQ representation for each question in our study. IV. P ROPOSED M ETHODOLOGY The proposed method is based on applying attention C. Attention Mechanism
Page 110
vQ through a single layer neural network followed by a softmax function to create the attention distribution over the image region. h1A = tanh(WI,att vI + (WQ,att vQ + batt )) att1I = sof tmax(Wh h1A + bh )
(1) (2)
Here, WI,att , WQ,att and Wh , denotes the attention layer weight matrix, which updates based on the given input image and question. We next combine vI with the question vector vQ to generate a query vector u1 by calculating the weighted sum of the image feature vectors, vI using the first layer attention output. Here, we have applied attention on our image vector using the question vector: X u1 = att1I vI + vQ (3)
epoch to analyze the performance of the model along with identifying the point from which the model starts to overfit. 2) Implementation Details: In our experiment, we employed hyperparameters to regulate how the proposed model will learn during the training process. We have applied the trial-and-error approach when determining the optimal value for a hyper-parameter. We have trained our model for 30 epochs with a batch size of 16 and discovered that after 15 epochs, the model begins to overfit as the test loss increases with training. In addition, we have used the Adam optimizer with a 0.0001 learning rate during training and BCEWithLogitsLoss to calculate the loss with a 0.5 dropout value. We have implemented our framework
Again, we iterate the above query-attention process with our second attention layer. Our second attention layer uses the following equations: h2A = tanh(WI ,att vI + (WQ ,att u1 + batt )) att2I = sof tmax(Wh h2A + bh ) X u2 = att2I vI + u 1
(4) (5) (6)
After getting the query vector from the second attention layer, we have passed the query vector in a single layer neural network to predict the final answer: prediction = sof tmax(Wu u2 + bu )
Fig. 3: Accuracy-Loss plot on Unbalanced Bengali VQA dataset
(7)
Finally, the model determines precisely the areas that are related to the potential answer in the attention layer. V. E XPERIMENTS In this section, we present the implemenation detail and comparison of our models performance to various combinations of modules and performance metric. A. Experimental Setup 1) Dataset: In our study, we have analyzed the effectiveness of the proposed model using our Fig. 4: Accuracy-Loss plot on Balanced Bengali VQA dataset human-annotated Bengali VQA dataset. The images in our dataset are from the MS-COCO dataset [15], and we use our Bengali VQA dataset to minimize with PyTorch and all of our experimental approaches strong language biases. In our experiment, we have have been trained on a Google Colab-provided NVIDIA used the eighty percent of the 13046 questions and Tesla K80 GPU. their corresponding 3280 images as the training set to train our model and twenty percent of the dataset B. Result Analysis as validation set to evaluate the performance of our The most important step after training our model is model. From the overall 13046 question-answer pairs to assess how well it can anticipate the answer. We have in our human-annotated Bengali VQA dataset, 7657 utilized an evaluation metric called ”Accuracy” to valiquestions answered with ’yes’ and 5389 with ’no’. date our models performance after each iteration. We Following an 80:20 split on our whole dataset, this have conducted ablation tests on the proposed method figure becomes 6125 ’yes’ answers and 4311 ’no’ and assessed the method for different combinations of answers under train split, and 1532 ’yes’ answers and modules. We have compared our CNN model, VGG19, 1078 ’no’ answers under test split. We evaluate our with the most recent top CNN model, ResNet152, model on the twenty percent test dataset after each using a variety of combinations to conduct the ablation
Page 111
TABLE I: Performance comparison of different modular combinations of our proposed model Methods
Accuracy
ResNet152 + BiGRU
62.2
ResNet152 + BiLSTM
62.4
VGG19 + BiGRU (Best)
63.3
VGG19 + BiLSTM
62.0
test. In contrast to Resnet, a complex architecture that implements a Residual network, VGG-19 is a straightforward model that implements a sequential model. Since our dataset was small, there was little room to draw conclusions from the images. Because of this, simpler models work best in this situation, whereas complicated models tend to overfit. The following Table I shows that the VGG-19 based model with BiGRU outperformed the ResNet models in terms of performance on our proposed dataset TABLE II: Performance evaluation on Different state of Attention Accuracy
Methods VGG19+BiGRU (Without Attention)
54.5
VGG19+BiGRU+Attention (1 layer)
63.1
VGG19+BiGRU+Attention (2 layer) (Best)
63.3
VGG19+BiGRU+Attention (3 layer)
62.1
We have also conducted numerous more ablation studies on our model based on the Attention layer in TableII. In Fig.3 and Fig.4, we have demonstrated how our model performed on our balanced and unbalanced dataset. Qualitative results from our model on sample images are shown in Fig. 5a and Fig. 5b. VI. C ONCLUSION We developed a balanced binary dataset using reallife scenarios and different contextual Bengali questions to highlight the importance of visual information in visual question answering. This dataset will allow native Bengali researchers to investigate VQA in a new light. We introduced an attention network-based model in this paper that imposes attention on an image numerous times to pinpoint the relevant visual region and eventually deduce the answer. With our balanced binary VQA dataset, our model produces a benchmark result in the domain of Bengali VQA. We had to put a lot of effort and time into creating this dataset and balancing it to fix the language bias problem. Therefore, we have limited our annotation process to just one domain (yes/no). We intend to expand our humanannotated Bengali VQA dataset into more domains in the future and investigate other deep learning-based architectures on that expanded dataset.
R EFERENCES [1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433. [2] W. Guo, Y. Zhang, J. Yang, and X. Yuan, “Re-attention for visual question answering,” IEEE Transactions on Image Processing, vol. 30, pp. 6730–6743, 2021. [3] V. Kazemi and A. Elqursh, “Show, ask, attend, and answer: A strong baseline for visual question answering,” arXiv preprint arXiv:1704.03162, 2017. [4] N. Shimizu, N. Rong, and T. Miyazaki, “Visual question answering dataset for bilingual image understanding: A study of cross-lingual transfer using attention maps,” in Proceedings of the 27th International Conference on Computational Linguistics, 2018, pp. 1918–1928. [5] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for image question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 21–29. [6] Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, “Deep modular co-attention networks for visual question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6281–6290. [7] D. Gupta, P. Lenka, A. Ekbal, and P. Bhattacharyya, “A unified framework for multilingual and code-mixed visual question answering,” in Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 2020, pp. 900–913. [8] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, “Are you talking to a machine? dataset and methods for multilingual image question,” Advances in neural information processing systems, vol. 28, 2015. [9] S. S. Islam, R. A. Auntor, M. Islam, M. Y. H. Anik, A. A. A. Islam, and J. Noor, “Note: Towards devising an efficient vqa in the bengali language,” in ACM SIGCAS/SIGCHI Conference on Computing and Sustainable Societies (COMPASS), 2022, pp. 632–637. [10] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6904–6913. [11] R. Cadene, H. Ben-Younes, M. Cord, and N. Thome, “Murel: Multimodal relational reasoning for visual question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1989–1998. [12] M. Malinowski and M. Fritz, “A multi-world approach to question answering about real-world scenes based on uncertain input,” Advances in neural information processing systems, vol. 27, 2014. [13] J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick, “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2901–2910. [14] M. Ren, R. Kiros, and R. Zemel, “Exploring models and data for image question answering,” Advances in neural information processing systems, vol. 28, 2015. [15] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, C. Zitnick et al., “Microsoft coco: Common objects in context. ineuropean conference on computer vision 2014 sep 6 (pp. 740–755).” [16] P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, and D. Parikh, “Yin and yang: Balancing and answering binary visual questions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5014–5022. [17] J. Lu, J. Yang, D. Batra, and D. Parikh, “Hierarchical questionimage co-attention for visual question answering,” Advances in neural information processing systems, vol. 29, 2016. [18] I. Schwartz, A. Schwing, and T. Hazan, “High-order attention models for visual question answering,” Advances in Neural Information Processing Systems, vol. 30, 2017.
Page 112
A PPENDIX
(a) Qualitative results of correct predictions on sample images
(b) Qualitative results of mistakes on sample images.
Page 113
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December 2022, Cox’s Bazar, Bangladesh
SlotFinder: A Spatio-temporal based Car Parking System Mebin Rahman Fateha, Md. Saddam Hossain Mukta, Md. Abir Hossain, Mahmud Al Islam, Salekul Islam Department of CSE, United International University (UIU) Plot-2, United City, Madani Avenue, Badda, Dhaka-1212, Bangladesh Email: [email protected], [email protected], [email protected], [email protected], [email protected]
Abstract—Nowadays, the increasing number of vehicles and shortage of parking spaces have become an inescapable condition in big cities across the world. Car parking problem is not a new phenomenon, especially in a crowded city such as Dhaka, Bangladesh. Shortage of parking spaces leads to several problems such as road congestion, illegal parking on the streets, and fuel waste in searching for a free parking space. In order to overcome the parking problem, we develop a spatio-temporal based car parking system namely, SlotFinder. We collect the data of 408 buildings those have parking slots from seven different locations. We then cluster these data based on time and locations. Later, we train location wise vacant parking spaces by using stacked Long Short-Term Memory (LSTM) based on their temporal patterns. We also compare our technique with the baseline models and conduct an ablation analysis, which outperforms (lower RMSE and MAE of 0.29 and 0.24, respectively) than that of the previous approaches. Index Terms—Stacked LSTM, Spatio-temporal, Car parking system, Machine Learning
I. I NTRODUCTION Car parking is an emerging problem with the increasing number of vehicles in large cities world wide. Dhaka city is an unplanned city where roads, housing and offices are established without foreseeing the inconvenient future. Therefore, finding a parking space is a common problem created by the increased number of vehicles. Searching for a parking space requires time, effort and extra fuel. A global parking survey by IBM in 2011 shows that 20 minutes is spent on average in searching for a perfect parking space [7]. To solve the problem, we develop a spatio-temporal based car parking model, namely SlotFinder. Dhaka city is one of the most densely populated areas in the world. In 2010, the population of Dhaka city was 14.7 million and in 2021, it became 21.7 million, which reflects a significant increase of population from 2010 to 2021 [3]. With the increasing number of population, the number of registered vehicles is also rising proportionally. Parking poses a serious challenge to the vehicle owners since there is a serious shortage of parking facilities. The parking problem leads to illegal parking on city streets, increasing intense traffic jam and worsening the annoyance of the moving traffic. Thus, the overall parking problem has motivated us to come up with a solution. Towards this direction, we develop a machine learning based spatio-temporal car parking system. We find
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
several location aware applications in the literature [6], [11], [12], [14], [15], which largely apply RNN models to predict in terms of both space and time. In this study, we collect data from 408 buildings from seven different areas of Dhaka city. Our dataset contains six spatiotemporal features. First, we apply k-Means clustering based on longitude and latitude to group up the regions with similar parking trend. Then, we use the clustering technique with k=7 to make seven clusters and each cluster consists of time intervals of vacant parking spaces in a region. Then, to learn our model the pattern of the parking vacancies, we apply RNN based stacked LSTM model on each cluster to predict vacant parking spaces (departure and arrival time). In short, in this paper, we have following contribution: • We build a dataset with available car parking time (i.e, departure and arrival time) and spatial features (i.e., latitude and longitude). • We develop an efficient spatio-temporal based stacked LSTM for predicting the vacant parking spaces. • We demonstrate that our model outperforms the baseline approaches. II. L ITERATURE R EVIEW In this section, we present several studies related to car parking systems. Kotb et al. [5] introduced a smart car parking system with static resource scheduling, dynamic resource allocation and pricing models, to optimize the parking system for drivers and parking owners. They combined real time reservation (RTR) with share time reservation (STR). Gandhi et al. [4] introduced a system that directs information about open and full parking spaces via mobile or web application. This IoT system includes micro-controller and sensor devices with Electric Vehicle (EV)–charging points, which is situated in respective car parking space. Zacepins et al. [21] proposed a smart parking management based on video processing and analysis. In this paper they made a python application for real time parking lot monitoring. For the occupancy detection in parking, they used five classifiers (Logistic Regression, Linear Support Vector Machine, Radial Basis Function Support Vector Machine, Decision Tree and Random Forest). Shao et al. [17] introduced a range based kNN algorithm which is named as Range-kNN. Their proposed
Page 114
Fig. 1. Methodology of spatio-temporal based car parking system.
algorithm consists of two parts: expansion of query range and search algorithm. We observe that the majority of the studies consider spaces only where temporal parameters are not taken in consideration while time is a vital factor. Towards this direction, we propose a novel stacked LSTM based car parking solution which considers both space and time. III. M ETHODOLOGY In this paper, we mainly work with spatial-temporal dataset. Figure 1 shows different steps of our proposed car parking system. First, we select a location which is ideally a crowded area and where most of the people have their own vehicles. Then, we annotate these areas with starting and departure times. Later, we find different pattern for the starting and departure times based on locations. By observing these patterns, we apply our spatio-temporal based machine learning technique. We briefly present the pipelines of our study. A. Data Collection In this section, we first select seven populated areas in Dhaka city for collecting data. In these areas, people have higher percentage (around 45%) [18] of their own vehicles. TABLE I S UMMARY OF COLLECTED Dimension Number of areas Average departure time Average arrival time Duration of empty parking Average number of empty parking Average parking spaces Number of instances
We also choose those areas where parking problem is a common issue due to the increased crowd and less parking facilities. After area selection, the next step is to annotate the buildings to identify the parking spaces by its latitude and longitude. For collecting data, we conduct field visits in seven locations inside Dhaka such as Dhanmondi, Gulshan, Uttara, Mirpur, etc.1 . We visit several residential buildings in order to get the parking information. We take some relevant information about parking spaces such as if there is any free parking space, for how much time in a day it remains vacant, whether a specific parking space remains occupied and free, etc. We take these information from 408 buildings of seven areas by taking face-to-face survey with building managers. A summary of collected data is shown in Table I. Frequent departure time of vacant parking spaces is 8:00 am to 9:00 am and arrival time is 5:00 pm to 6:00 pm. Average duration of the vacant parking spaces is 9-10 hours. Each building has 510 vacant parking spaces on an average. However, for further research, we share the dataset for public use 2 . Table II shows the number of instances from each area based on average departure and arrival time. TABLE II N UMBER OF INSTANCES PER AREA BASED ON THEIR AVERAGE DEPARTURE AND ARRIVAL TIME .
DATA .
408x6 7 8:00 AM - 9:00 AM 5:00 PM - 6:00 PM 9-10 hours
Area Name Dhanmondi Gulshan Uttara Mirpur Kallyanpur Shyamoli Mohammadpur Total
Departure time 8:00 AM - 9:00 AM 9:00 AM - 10:00 AM 8:00 AM - 9:00 AM 7:00 AM - 8:00 AM 7:00 AM - 8:00 AM 8:00 AM - 9:00 AM 8:00 AM - 9:00 AM
Arrival time 5:00 PM - 6:00 5:00 PM - 6:00 6:00 PM - 7:00 5:00 PM - 6:00 5:00 PM - 6:00 5:00 PM - 6:00 5:00 PM - 6:00
PM PM PM PM PM PM PM
5-10 20-25 408
1 Google
map: shorturl.at/lruTZ
2 https://bit.ly/3yiNvFY
Page 115
size 74 67 63 71 48 42 43 408
Fig. 2. Architecture of our SlotFinder System.
B. Building Model In this section, we first describe our system architecture in Figure 2. Then, we present our clustering approach and discuss data preparation techniques. Next, we discuss how we apply our proposed stacked LSTM model over our dataset. We take the parking information based on the time interval whether a parking space is vacant and store these information in the data set. After that, we apply k-Means clustering on the data set based on location. The availability of free parking space with time is different for each areas in Dhaka city. During our field study, we see some significant differences in parking trends among the seven areas. For example, the average departure time for Dhanmondi area is 8:00-9:00 AM and for Gulshan the average departure time is 9:00-10:00 AM. Since Dhanmondi has become the biggest hub of educational institutes and the majority of the schools start lectures around 8:00-8:30 AM. Therefore, the maximum parking spaces of this area get vacant around 8:00-9:00 AM. On the other hand, many private corporations start their offices in Gulshan and the maximum people who live in Gulshan has office or business around Gulshan. Thus, they leave their home around 9:0010:00 AM . Before applying our machine learning algorithm, we group similar parking spaces together. Therefore, we apply k-Means clustering method with k=7 as we collect data from seven different areas to group 408 parking spaces. We use longitude and latitude values of each parking space as input feature for the clustering. Figure 3 shows seven clusters with respect to cluster centroids. Data preparation is important because we prepare data for applying machine learning algorithms over a structured dataset. Data preparation helps to find efficient result with less error. Both numerical and categorical features are present in our dataset. The categorical data cannot be immediately interpreted by machines. In order to process the categorical data further, we transform it into numerical data. In our dataset, departure time and arrival time are in categorical form. We use label encoding to convert these two features into numerical form shown in Figure 4.
Fig. 3. Clusters visualization based on latitude and longitude.
Fig. 4. Label encoding of departure time and arrival time.
Data scaling is a data preprocessing technique for numerical features. Data scaling is necessary for obtaining improved performance of many machine learning algorithms, including RNN. For this, various scaling are defined. We use MinMax [13] scaling to scale our numerical features (range is 0 to 1). Then, we split our dataset into train and test parts by 65% and 35%, respectively. We train our dataset by using 10iterations with 10-fold cross validation. Parking event maintains a sequence and we also observe that the buildings in a specific region maintain a sequence of similar parking trend. For this reason, we apply stacked LSTM in each cluster that predicts departure time and arrival time. The basic LSTM model consists of a single hidden LSTM layer followed by a
Page 116
Fig. 5. Re-arranging of trainX and trainY.
conventional feedforward output layer. The Stacked LSTM is a model extension that contains multiple hidden LSTM layers, each of which contains multiple memory cells [8], [16]. Since our desired output predicts both the departure and arrival time, therefore we incorporate the stacked LSTM in our problem. Our stacked LSTM model consists of several LSTM layers. In stacked LSTM architecture, the output of the first LSTM layer goes as input of the next LSTM layer. In our model, the first LSTM has 64 layers and the second LSTM has 32 layers. We use a dropout of 0.2 followed by a dense layer. In our experiment, we consider time steps to be 3, time steps denote that how previous data should be considered to predict the future parking time. Considering the time steps to be 3 gives us an optimal result. We take previous 3 parking times to predict the 4th parking timing. We split the data as X, Y. In the 0-th iteration, the first 3 values are in X and the 4th value is in Y and so on as shown in Figure 5. In this manner, we re-arrange both our train and test datasets. Figure 6 (a) shows the training loss and validation loss of a single cluster and Figure 6 (b) shows the training loss and validation loss of the seven clusters. We see from the figures that the training loss and validation loss both decrease and remain stable at a point.
performance of the system. The performance of a model may remain stable, become better or get worse when these components are changed. Accuracy can be improved primarily by experimenting with various hyper-parameters like optimizer, learning rates, loss functions and batch sizes. Altering the architecture of the model has an effect on overall performance. In this study, we demonstrate six case studies by altering different system parameters and response of the system by the changes. Evaluation Metrics: We apply three metrics to evaluate our proposed model: mean absolute error (MAE), mean squared error (MSE) and root mean squared error (RMSE). Six experiments are conducted as an ablation study, each changing a different components of the proposed Stacked LSTM model. A more reliable architecture with improved performance can be achieved by changing many components. This is accomplished by performing an ablation study on a number of components: batch size, hidden layer, loss function, optimizer, learning rate, activation function and dropout. 1) Ablation Study 1 (Changing Hidden Layers): In our model, we use stacked LSTM. In this stacked LSTM we have 2 LSTM layers. In the first LSTM, we have 64 hidden layers and in the second LSTM, we have 32 hidden layers. To observe the model’s performance we change the number of the hidden layers of both the LSTMs. In Table III, we present RMSE, MAE and Val loss scores in Table III. TABLE III A BLATION STUDY BY CHANGING Case Study 1
Hidden lyer LSTM1-64 LSTM2- 32 LSTM1- 128 LSTM2- 64 LSTM1- 50 LSTM2- 50
THE HIDDEN LAYERS .
RMSE
MAE
Val Loss
0.30
0.25
0.03
0.30
0.25
0.01
0.30
0.25
0.02
2) Ablation Study 2 (Changing Batch Size): The term “batch size” refers to the number of training samples used in a single iteration. To determine the ideal batch size for our proposed model, we experiment with different batch sizes in our study. When we change our bacth size from 64 to 16 which gives RMSE from 0.3 to 0.29. The results are shown in Table IV. TABLE IV A BLATION STUDY BY CHANGING THE BATCH Case Study 2
Fig. 6. Training Loss and Validation Loss.
C. Ablation Study Components of a deep learning network are typically deleted or replaced as part of an experiment called an ablation study to determine how these changes affect the overall
Batch Size 16 32 64
RMSE 0.29 0.30 0.30
MAE 0.25 0.26 0.25
SIZE .
Val Loss 0.01 0.03 0.01
3) Ablation Study 3 (Changing Optimizer): We use Adam [22] in our model as optimizer. The model gives us RMSE, MAE and Val loss scores of 0.29, 0.25 and 0.01, respectively. If we change the optimizer to sgd and Nadam which give us an increase in the scores of RMSE, MAE
Page 117
and Val loss shown in Table V. SGD does each iteration using a single sample, or a batch size of one. The sample is chosen and randomly shuffled in order to carry out the iteration. Nadam [19] is an optimizer combination of adam and RMSprop with Nesterov momentum.
optimizer and number of hidden layers are used. Table IX shows the final configuration of our stacked LSTM model for predicting vacant car parking slots in different regions. TABLE IX C ONFIGURATION OF PROPOSED
TABLE V A BLATION STUDY BY CHANGING OPTIMIZER . Case Study 3
Optimizer sgd adam nadam
RMSE 0.35 0.29 0.30
MAE 0.31 0.25 0.25
Val Loss 0.04 0.01 0.01
4) Ablation Study 4 (Changing Learning Rate): We use learning rate of 0.01 which gives RMSE, MAE and Val loss scores of 0.29, 0.24 and 0.01, respectively. If we replace the learning rate with 0.001 and 0.0001, the scores of RMSE, MAE and Val loss increase. The results are shown in Table VI.
TABLE VI A BLATION STUDY BY CHANGING Case Study 4
Learning Rate 0.01 0.001 0.0001
LEARNING RATES .
RMSE
MAE
Val Loss
0.29 0.30 0.57
0.24 0.24 0.50
0.01 0.03 0.04
5) Ablation Study 5 (Changing Activation Functions): We use activation function of ReLU initially. It gives RMSE of 0.294 and MAE of 0.246. After replacing the activation function with softmax gives RMSE of 0.290 and MAE of 0.242. We also replace the activation function with tanh. The results are shown in Table VII.
ARCHITECTURE AFTER ABLATION STUDY.
Configuration Data set size Epochs Optimization function Learning rate Batch size Activation function Dropout
Value 408×6 100 Adam 0.01 16 Softmax 0.2
IV. C OMPARISON WITH BASELINE MODELS We also apply Recurrent Neural Network (RNN) [20], Autoregressive integrated moving average (ARIMA) [2] and LSTM [23] models to predict vacant car parking time as baseline models. However, in this case, out stacked LSTM model outperforms than that of the mentioned models. Table X shows the performance of different baseline models including our proposed stacked LSTM model. TABLE X P ERFORMANCE OF DIFFERENT MODELS Models LSTM RNN ARIMA Stacked LSTM
RMSE 0.35 0.45 0.40 0.29
MAE 0.35 0.48 0.30 0.25
Val Loss 0.04 0.03 0.04 0.01
V. R ESULTS AND D ISCUSSION TABLE VII A BLATION STUDY BY CHANGING Case Study 5
Activation Function reLU softmax tanh
ACTIVATION FUNCTION .
RMSE
MAE
Val Loss
0.294 0.290 0.292
0.246 0.242 0.245
0.01 0.01 0.01
6) Ablation Study 6 (Changing Dropout): We also use dropout of 0.2 which gives RMSE and MAE scores of 0.290 and 0.242, respectively. We apply dropouts of 0.5 and 0.7 as well. The results are shown in Table VIII. TABLE VIII A BLATION STUDY BY CHANGING Case Study 6
Dropout 0.2 0.5 0.7
RMSE 0.290 0.291 0.291
THE DROPOUTS .
MAE 0.242 0.243 0.243
Val Loss 0.01 0.02 0.01
D. Performance Analysis of the Best Model After analyzing all the case studies, we set a model with lower error rate when the optimal batch size, learning rate,
We develop an efficient spatio-temporal based machine learning model for predicting vacant parking spaces. First, we apply a clustering method based on their location to group up the regions with similar parking trends. Later, we apply stacked LSTM to predict departure and arrival time of vacant parking spaces. During data collection, we observe that different regions follow different parking trends. We see that the departure and arrival times of every area maintain a particular pattern. Thus, we apply stacked LSTM in each cluster that predicts departure and arrival times. The basic LSTM model consists of a single hidden LSTM layer followed by a conventional feedforward output layer. The Stacked LSTM is a model extension that contains multiple hidden LSTM layers, each of which contains multiple memory cells. Since our desired output predicts both the departure and arrival time, therefore we incorporate the stacked LSTM in our problem. Our stacked LSTM model consists of several LSTM layers. In stacked LSTM architecture, the output of the first LSTM layer goes as input of the next LSTM layer. In our model the first LSTM has 128 layers and the second LSTM has 64 layers. We use a dropout of 0.2 followed by a dense layer. Our ablation study
Page 118
finds the best configuration for our model. For the hidden layer, we use 128 hidden layers for the first LSTM and 64 for the second LSTM. We apply batch sizes of 64, 32 and 16. Batch size of 16 gives us the lower RMSE and MAE. We changed our activation function with reLU, softmax and tanh. Softmax performs better than the other activation functions. Initially, we apply adam optimizer then we also apply sgd and nadam but adam gives the optimal result. We stick with the initial learning rate which is 0.01 because after changing the learning rate with 0.001 and 0.0001, we find an increased RMSE and MAE scores. Lastly, we find changing the dropout values with 0.2 which gives the optimal result. We consider applying RNN in our model, but we work with only 408 buildings for now but in reality there are thousands of buildings. In that case, RNN does not give optimal results because it has a long term dependency problem due to the vanishing gradient problem. LSTM overcomes the long term dependency problem. After training the model, the prediction for Dhanmondi, Gulshan and Mirpur clusters perform better than the rest of the clusters because the number of instances in Gulshan, Dhanmondi and Mirpur is large compared to other clusters. A few studies [1], [9], [10] show that weighted LSTM and HMM models also provide satisfactory results in predicting future events. However, our model has several limitations which can be an avenue for future research. Our model cannot find the nearest parking spaces and needs the longitude and latitude values of a specific building to predict vacant parking space. The parking behavior of people might sound unrealistic as if they are robotic entities. However, we observe similar patterns of the parking behavior because the office time is generally 9:00 AM to 5:00 PM and the time of the educational institutes is mostly 8:00 AM to 4:00 PM. In addition to this, we mainly work with departure and arrival time. For this reason, we do not consider any spatial and managerial factors. Therefore, we do not need to apply any nearest neighbor or distance searching algorithms by using spatial data structures. VI. C ONCLUSION In this paper, we have worked with real world spatiotemporal dataset to predict free parking time (departure time and arrival time). First, we have applied clustering method where k=7 in seven different locations (longitude and latitude) to group up the regions with similar parking trend. Later, we have applied stacked LSTM on each cluster to predict departure and arrival time of those parking spaces. We have trained with stacked LSTM to predict vacant parking spaces in a region. We have conducted an ablation study to see the impact of the components within the architecture and to select the optimal values of the components. We have also compared our techniques with the baseline techniques and found that our system outperforms than that of the previous approaches. R EFERENCES
[2] [3] [4]
[5]
[6]
[7] [8] [9]
[10] [11] [12]
[13] [14]
[15] [16] [17] [18]
[19] [20] [21] [22] [23]
2020 IEEE Region 10 Symposium (TENSYMP). pp. 262–265. IEEE (2020) Benvenuto, D., Giovanetti, M., Vassallo, L., Angeletti, S., Ciccozzi, M.: Application of the arima model on the covid-2019 epidemic dataset. Data in brief 29, 105340 (2020) BRTA: Number of registered motor vehicles in bangladesh (yearwise) Retrieved from. shorturl.at/kqD78 (July 7, 2020) Gandhi, R., Nagarajan, S., Chandramohan, J., Parimala, A., Arulmurugan, V.: Iot based automatic smart parking system with ev-charging point in crowd sensing area. Annals of the Romanian Society for Cell Biology 25(6), 6398–6409 (2021) Kotb, A.O., Shen, Y.C., Zhu, X., Huang, Y.: iparker—a new smart car-parking system based on dynamic resource allocation and pricing. IEEE transactions on intelligent transportation systems 17(9), 2637– 2647 (2016) Leon, M.I., Iqbal, M.I., Meem, S., Alahi, F., Ahmed, M., Shatabda, S., Mukta, M.S.H.: Dengue outbreak prediction from weather aware data. In: International Conference on Bangabandhu and Digital Bangladesh. pp. 1–11. Springer (2022) Liang, J.K., Eccarius, T., Lu, C.C.: Investigating factors that affect the intention to use shared parking: A case study of taipei city. Transportation Research Part A: Policy and Practice 130, 799–812 (2019) Malhotra, P., Vig, L., Shroff, G., Agarwal, P., et al.: Long short term memory networks for anomaly detection in time series. In: Proceedings. vol. 89, pp. 89–94 (2015) Mukta, M.S.H., Ali, M.E., Mahmud, J.: Identifying and predicting temporal change of basic human values from social network usage. In: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017. pp. 619–620 (2017) Mukta, M.S.H., Ali, M.E., Mahmud, J.: Temporal modeling of basic human values from social network usage. Journal of the Association for Information Science and Technology 70(2), 151–163 (2019) Nawshin, S., Mukta, M.S.H., Ali, M.E., Islam, A.N.: Modeling weatheraware prediction of user activities and future visits. IEEE Access 8, 105127–105138 (2020) Rahman, M.M., Majumder, M.T.H., Mukta, M.S.H., Ali, M.E., Mahmud, J.: Can we predict eat-out preference of a person from tweets? In: Proceedings of the 8th ACM Conference on Web Science. pp. 350–351 (2016) Raju, V.G., Lakshmi, K.P., Jain, V.M., Kalidindi, A., Padma, V.: Study the influence of normalization/transformation process on the accuracy of supervised classification. In: ICSSIT. pp. 729–735. IEEE (2020) Riad, S., Ahmed, M., Himel, M.H., Mim, A.H., Zaman, A., Islam, S., Mukta, M.S.H.: Prediction of soil nutrients using hyperspectral satellite imaging. In: Proceedings of International Conference on Fourth Industrial Revolution and Beyond 2021. pp. 183–198. Springer (2022) Rupai, A.A.A., Mukta, M.S.H., Islam, A.N.: Predicting bowling performance in cricket from publicly available data. In: Proceedings of the International Conference on Computing Advancements. pp. 1–6 (2020) Sagheer, A., Kotb, M.: Unsupervised pre-training of a deep lstm-based stacked autoencoder for multivariate time series forecasting problems. Scientific reports 9(1), 1–16 (2019) Shao, Z., Taniar, D.: Range-based nearest neighbour search in a mobile environment. In: Proceedings of the 12th International Conference on Advances in Mobile Computing and Multimedia. pp. 215–224 (2014) Sharmeen, N., Houston, D.: Urban form, socio-demographics, attitude and activity spaces: Using household-based travel diary approach to understand travel and activity space behaviors. Urban Science 4(4), 69 (2020) Tato, A., Nkambou, R.: Improving adam optimizer (2018) Tomasiello, S., Loia, V., Khaliq, A.: A granular recurrent neural network for multiple time series prediction. Neural Computing and Applications 33(16), 10293–10310 (2021) Zacepins, A., Komasilovs, V., Kviesis, A.: Implementation of smart parking solution by image analysis. In: VEHITS. pp. 666–669 (2018) Zhang, Z.: Improved adam optimizer for deep neural networks. In: 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS). pp. 1–2. Ieee (2018) Zhao, Z., Chen, W., Wu, X., Chen, P.C., Liu, J.: Lstm network: a deep learning approach for short-term traffic forecast. IET Intelligent Transport Systems 11(2), 68–75 (2017)
[1] Al Rafi, A.S., Rahman, T., Al Abir, A.R., Rajib, T.A., Islam, M., Mukta, M.S.H.: A new classification technique: random weighted lstm (rwl). In:
Page 119
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December 2022, Cox’s Bazar, Bangladesh
Intelligent Door Controller Using Deep Learning-Based Network Pruned Face Recognition Prangon Das∗ , Nurul Amin Asif∗ , Md. Mehedi Hasan∗ , Sarafat Hussain Abhi∗ , Mehtar Jahin Tatha∗ , Swarnali Deb Bristi∗ ∗ Department
of Mechatronics Engineering, Rajshahi University of Engineering & Technology, Bangladesh Email: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
Abstract—Nowadays, our home is designed with various technologies which have increased our living comfort and offering more flexibility. Installing various technology in our Home makes it a smart home and we also call this installation process Home Automation. The popularity of Home Automation systems is increasing rapidly and it develops the quality of living. Home automation offers automatic light, fan, temperature, etc. control and also an automatic alarming system to alert the people, etc. Already there are various techniques have been used for implementing Home Automation. Here, in this paper, an intelligent door controller, an application of home automation is presented by using deep learning techniques. An intelligent door basically opens automatically and closes after a predefined time based on the person coming in front of the door. If a person is known then the door will be opened and after his/her entrance the door will be closed automatically. And if the person is not known then the door will remain closed. Here to identify the person, the person’s face is recognized by using deep learning. As well ass, Arduino and Servo motors are used to control the door opening or closing.
Fig. 1: Generic diagram of facial recognition system
Index Terms—Automatic door control, Face recognition, Arcface, Servo motor, Home automation
I. I NTRODUCTION Current artificial intelligence-based technologies should not only suit people’s incredibly diverse functional needs, but also their increasingly individualized experience and interactive wants. Artificial intelligence is a depiction of the iterative growth of computer information technology. As a result, in order to continuously enhance the user experience, contemporary artificial intelligence-based products need to include a particular feedback system [1]. There is a stronger emphasis on experience design in user experience design since it considers if individuals can have an emotional response to using the device. In terms of the major development path for humancomputer interaction, the facial recognition-based feedback system that employs artificial intelligence and addresses the human emotion component as the design goal has progressively taken the lead from the standpoint of the user experience [2]. It is a critical stage in the creation of the facial recognition feedback system that uses artificial intelligence to improve social interaction [3]. Secondly, as a result of the iterative development of artificial intelligence technology, the continuous updating of the level of intelligent chip technology, and
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
the noticeably improved level of artificial intelligence software and hardware development, human-computer interaction has developed into an essential component of intelligent design life. Additionally, the demand for artificial computer intelligence is rising in different applications, like biomedical sectors [4], [5], autonomous vehicles [6], and so on. Different application scenarios and feedback mechanisms have been created and are currently in use. Beside that, face recognition [7] is progressing toward becoming more sophisticated and human.In the case of identity and information security, which has emerged as a major obstacle to be overcome, face recognition has developed into a substantial organic component of humancomputer feedback systems. A friendly and useful way to gather data is through biometrics. Face recognition analyzes a face’s feature data and identifies the provided face image using an existing face database. Figure 1 illustrates the variety of topics and characteristics it encompasses. It is widely utilized in various fields because combining the benefits of different fields can determine identity, capacity, and function. Studying the face recognition feedback system based on AI applied to intelligent chips is therefore very useful in realworld applications [8].
Page 120
Fig. 2: Overall methodology of the system
II. L ITERATURE R EVIEW This section’s main emphasis is on the research initiatives that came before this research activity. However, this also entails researching how to create artificial intelligence-based door automation systems [9]. It is important to note that few studies have attempted to use the aforementioned artificial intelligence tools to solve the issues. Academics have nonetheless proposed a range of automatic door solutions. Despite this, academics have proposed a variety of automated door systems. In a different study, [10] proposed a fuzzy logic-based automatic door controller. Fuzzy logic was used to extend the opening distance and speed of the door. According to the article, 25 principles were used to construct a heuristic fuzzy logic controller for controlling the system controller. However, applying fuzzy logic in this circumstance does not make the system clever because it is designed in the ”if A then B approach.” The system is unable to think things through before acting. Facial detection and recognition-based automated door access system was recommended by the study [11], which is referenced below. Using the Principal Component (PCA) Analysis method, pertinent features from facial photographs were identified and passed to the microcontroller for authentication. The Matlab software’s output, which served as a detection and recognition system, is sent to the microcontroller for further processing. Even if PCA successfully reduces the resulting feature dimensions. However, given that facial data must be transformed into grayscale images, convolutional neural networks (CNN) can do this task better. A person identification and intention analysis-based automatic door system were recommended by [12]. This project aims to eliminate the inappropriate behaviors of the door system through the implementation of a behavior analysis system that enhances
task accuracy. Before using trajectory tracking and statistical analysis to identify an object’s intent, the researchers first use contour detection to determine whether it is a human. After the intention was identified and confirmed, the system was able to respond in about 2 seconds with a low false activation rate and a high rate of accurate activation. III. M ETHODOLOGY This section illustrates how an end-to-end network pruned arcface face recognition model is used by an intelligent door controller. The overall overview of the system has been foreshadowed in fig. 2 where it can be seen that An Arcface + Network Pruned deep learning model has been trained with LFW dataset which is next applied to a controller for the controlling mechanism of the motor for intelligent door locking. There are two parts to the embedded system we propose: 1) An Arcface + network-pruned face recognition model, and (2) an Arduino and face recognition API connection mechanism which are discussed in the following subsections accordingly. A. ArcFace A loss function used in facial recognition applications is called ArcFace, or Additive Angular Margin Loss [14]. Traditionally, the softmax is employed for these jobs. Because the softmax loss function does not explicitly optimize the feature embedding to impose more similarity for intraclass samples and diversity for interclass samples, deep face recognition suffers from a performance gap under significant levels of intra-class variation. The Arcface loss for face recognition mechanism has been illustrated in fig. 6. As the embedding features are scattered around each feature center on the hypersphere, we apply an additive angle margin
Page 121
Fig. 3: Training a DCNN under the supervision of the ArcFace loss for face recognition [13].
penalty m between xi and Wyi to simultaneously increase the intra-class compactness and inter-class discrepancy. Because the proposed additive angular margin penalty is the same as the geodesic distance margin penalty in the normalized hypersphere, we refer to our approach as ArcFace. The pseudo code has been illustraded in Algorthm 1 [13].
B. Network Pruning Neural network pruning is a technique that is based on the logical notion of removing unnecessary components from a network that functions well but consumes a lot of resources. Large neural networks have in fact repeatedly demonstrated their ability to learn, but it turns out that not all of their components remain effective after training is complete. The goal is to get rid of these components while maintaining network performance [15]. The overall learning process of the iterative pruning algorithm has been summarized in Algorithm 2 [16].
C. API Connection Mechanism of Arduino with DC servo Motor A DC servo motor that uses an open-loop system is one that cannot serve as a feedback reference while the motor is operating. As a result, there is no way for the system to detect faults or confirm that it is working effectively. By connecting a DC servo motor to an external controller, openloop control of a servo system is made possible. The servo motor does not offer any feedback references of its own in this configuration. This is not an open-loop servo motor. The servo controller receives position feedback from the motor’s ”internal feedback” system. A potentiometer receives position feedback from the servo gear system. The reading from the potentiometer is then translated into values of a voltage, which correspond to the positions of the servo shaft [17]. The ”internal feedback” qualities have the drawback of being limited to position feedback for the internal servo controller of the motor. These functionalities can only be used in this manner. The block design of an open-loop DC servo motor with ”Internal” voltage feedback is shown in Fig. 4
Page 122
Fig. 5: DC servo motor block schematic with closed-loop voltage feedback [17]
potentiometer, which is located inside the servo motor casing, is almost probably the same as that of the ”internal” voltage feedback. Figure 5 depicts the block diagram of a DC servo motor with closed-loop voltage feedback. IV. DATASETS & R ESULTS In the next part, we explicate the experimental setup and quantitative findings, as well as the datasets that were utilized to train the arcface + network pruning face recognition model. A. Experimental Settings & Datasets
Fig. 4: The block design of an open-loop DC servo motor with ”Internal” voltage feedback [17]
A closed-loop DC servo motor, on the other hand, has an operating system with a built-in feedback mechanism. As a result, the system can compare the output’s anticipated and actual circumstances. The existence of closed-loop feedback acting as the primary controller for a DC servo motor is incredibly beneficial [17]. The motor’s location at the moment may be determined, and any mistakes can be located and fixed. In some types of DC servo motors that are currently on the market, closedloop feedback may take the form of voltage (potentiometer) or pulses (encoder). These feedbacks operate in a manner that is quite similar to the ”internal” feedback, which is proportional to the angular position of the motor shaft. The main subject of this essay is the use of a DC servo motor with voltage position feedback. The voltage range of the closed-loop voltage feedback type’s
For this experiment, we have utilized PyTorch to fine-tune a facial recognition model. The SGD optimizer is utilized across functions. Due to a computational limitation, the batch size is fixed to 64. Primarily, we have used LFW (Labelled Faces in the Wild) dataset [18]. Labeled Faces in the Wild (LFW) is an image dataset containing face photographs, collected especially for studying the problem of unconstrained face recognition. It has over 13 thousands images collected from the world wide web. TABLE I: Comparison of testing accuracy between proposed model and other models based on our dataset Face Recognition Model CNN + Softmax [19] FaceNet [20] RetinaFace [21] ArcFace [13] ArcFace + Network Pruning
Accuracy 80% 85% 87% 90% 90%
Interface Slow Fast Medium Slow Fast
B. Result In the beginning, we just employed the arcface face recognition model in our dataset. The slow interface was a disadvantage of employing solely the arcface face recognition model.To address this problem, we employed Arcface and the Network pruning model concurrently to achieve 90% accuracy and a quick interface. Table 1 shows the comparison based on testing accuracy and interface between our proposed model and other face
Page 123
Fig. 6: A graphical representation of training result on different face recognition models
recognition models. Figure 6 depicts a graphical depiction of the training result our proposed approach and another face recognition model. Among them, our model exhibits exceptional accuracy. C. Conclusion Images are maintained in a database in this suggested door access system employing face recognition. This technique is employed for door lock access in both residential and business settings. This study offers an analysis of a real-time facial recognition system utilizing the arcface algorithm based on the current issues. We have also employed network pruning to achieve a real-time facial recognition-based door locking system and to speed up inference. As a result, this research implements a real-time face identification method based on arcface and network pruning that can ignore the impact of various expressions, angles, and sizes. In order to avoid the impacts of different expressions, angles, and sizes, this study implements a real-time face identification approach based on the arcface model and network pruning strategy. This system can detect human faces with a high degree of recognition speed and accuracy and a recognition efficiency of 90%. We demonstrate that our method frequently outperforms the state of the art using the most rigorous examination. R EFERENCES [1] L. Feng, J. Wang, C. Ding, Y. Chen, and T. Xie, “Research on the feedback system of face recognition based on artificial intelligence applied to intelligent chip,” in Journal of Physics: Conference Series, vol. 1744, no. 3. IOP Publishing, 2021, p. 032162. [2] B. Zhu, D. Zhang, Y. Chu, X. Zhao, L. Zhang, and L. Zhao, “Facecomputer interface (fci): Intent recognition based on facial electromyography (femg) and online human-computer interface with audiovisual feedback,” Frontiers in Neurorobotics, vol. 15, 2021.
[3] C. Marechal, D. Mikolajewski, K. Tyburek, P. Prokopowicz, L. Bougueroua, C. Ancourt, and K. Wegrzyn-Wolska, “Survey on ai-based multimodal methods for emotion detection.” High-performance modelling and simulation for big data applications, vol. 11400, pp. 307–324, 2019. [4] M. Mashiata, T. Ali, P. Das, Z. Tasneem, F. R. Badal, S. K. Sarker, M. Hasan, S. H. Abhi, M. R. Islam, F. Ali et al., “Towards assisting visually impaired individuals: A review on current status and future prospects,” Biosensors and Bioelectronics: X, p. 100265, 2022. [5] C. Das, A. A. Mumu, M. F. Ali, S. K. Sarker, S. M. Muyeen, S. K. Das, P. Das, M. M. Hasan, Z. Tasneem, M. M. Islam, M. R. Islam, M. F. R. Badal, M. H. Ahamed, and S. H. Abhi, “Towards iort collaborative digital twin technology enabled future surgical sector: Technical innovations, opportunities and challenges,” IEEE Access, pp. 1–1, 2022. [6] A. Biswas, M. O. Reon, P. Das, Z. Tasneem, S. Muyeen, S. K. Das, F. R. Badal, S. K. Sarker, M. M. Hassan, S. H. Abhi et al., “State-ofthe-art review on recent advancements on lateral control of autonomous vehicles,” IEEE Access, 2022. [7] T. Ahmed, P. Das, M. F. Ali, and M.-F. Mahmud, “A comparative study on convolutional neural network based face recognition,” in 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT). IEEE, 2020, pp. 1–5. [8] P. Das, T. Ahmed, and M. F. Ali, “Static hand gesture recognition for american sign language using deep convolutional neural network,” in 2020 IEEE Region 10 Symposium (TENSYMP). IEEE, 2020, pp. 1762– 1765. [9] S. Sepasgozar, R. Karimi, L. Farahzadi, F. Moezzi, S. Shirowzhan, S. M. Ebrahimzadeh, F. Hui, and L. Aye, “A systematic content review of artificial intelligence and the internet of things applications in smart home,” Applied Sciences, vol. 10, no. 9, p. 3074, 2020. [10] H. S¨umb¨ul, A. Cos¸kun, and M. Tas¸demir, “The control of an automatic door using fuzzy logic,” in 2011 International Symposium on Innovations in Intelligent Systems and Applications. IEEE, 2011, pp. 432–435. [11] H. H. Lwin, A. S. Khaing, and H. M. Tun, “Automatic door access system using face recognition,” international Journal of scientific & technology research, vol. 4, no. 6, pp. 294–299, 2015. [12] J.-C. Yang, C.-L. Lai, H.-T. Sheu, and J.-J. Chen, “An intelligent automated door control system based on a smart camera,” Sensors, vol. 13, no. 5, pp. 5923–5936, 2013. [13] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4690– 4699. [14] “MS Windows NT kernel description,” https://paperswithcode.com/method/arcface, accessed: 09-12-2022. [15] H. Tessier, “Neural network pruning 101,” https://towardsdatascience.com/neural-network-pruning-101af816aaea61, 2021. [16] B. Geng, M. Yang, F. Yuan, S. Wang, X. Ao, and R. Xu, “Iterative network pruning with uncertainty regularization for lifelong sentiment classification,” in Proceedings of the 44th International ACM SIGIR conference on Research and Development in Information Retrieval, 2021, pp. 1229–1238. [17] A. S. Sadun, J. Jalani, J. A. Sukor, and B. Pahat, “A comparative study on the position control method of dc servo motor with position feedback by using arduino,” in Proceedings of Engineering Technology International Conference (ETIC 2015), 2015, pp. 10–11. [18] A. ANAND, “Lfw people (face recognition),” https://www.kaggle.com/datasets/atulanandjha/lfwpeople, 2019. [19] S. Sharma, K. Shanmugasundaram, and S. K. Ramasamy, “Farec—cnn based efficient face recognition technique using dlib,” in 2016 international conference on advanced communication control and computing technologies (ICACCCT). IEEE, 2016, pp. 192–195. [20] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815– 823. [21] J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “Retinaface: Single-shot multi-level face localisation in the wild,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5203–5212.
Page 124
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
A Robust Vision Based Lane Scenario Detection and Classification Using Machine Learning for SelfDriving Vehicles Sheikh Fardin Hossen Araf, Tasfia Zaman Raisa, Akfa Sultana Mithika, Nusrat Fateha Shahira, Fakir Sharif Hossain Department of Electrical and Electronic Engineering Ahsanullah University of Science and Technology Dhaka, Bangladesh [email protected], [email protected], [email protected], [email protected], [email protected]
Abstract—In the five levels of autonomous vehicles, fully autonomous driving is an essential requirement where lane detection is one of the critical challenges. With zero human interaction for lane detection, a high safety margin must be considered, requiring complex decision-making algorithms. In that case, Machine learning algorithms are the highly trusted option. In this paper, a simplified camera-based lane detection and scene classification method is proposed depending on different road conditions so that autonomous vehicles can operate in various challenging scenes with a single camera and accurately recognize lanes under any conditions. Our proposed instance segmentationbased lane detection and classification algorithm can identify different road conditions by analyzing different scenes in complex situations. The training process is enriched with in-field data set which delivers a significant accuracy. Keywords—Lane detection; MASK R-CNN; Machine Learning; Instance Segmentation.
I. I NTRODUCTION Nowadays, autonomous vehicles have allured the attention of many researchers globally. An artificial intelligent vehicle can convey people gradually to the proper destination. Artificial intelligence in autonomous vehicles formulates accurate road conditions. Lane detection is pivotal to the acquisition of an autonomous vehicle. Lane detection is an algorithm through which each lane’s exact features and position are known in detail. Previously several methods are proposed for lane detection, those are instance segmentation, Hough transform, FPGA (Field Programmable Gate Array), CNN (Convolutional Neural Network); an artificial neural network used for image processing and recognition [1], R-CNN (Region Based CNN), Faster R-CNN [2], etc. Hough transform can accurately detect a straight line in a road. However, the method fails to detect curvy roads’ defects and provides the wrong statistics on shaded areas. CNNs are commonly applied to image process-
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
ing and analysis, but a straightforward CNN architecture is not the best option for an image that contains several objects [3]. Another process for lane detection is R-CNN. The purpose of R-CNN architecture is to handle image detection problems. Additionally, Mask R-CNN is built on the R-CNN architecture, which is then augmented to create Faster R-CNN [4]. The studies showed that most methods identify lanes under certain conditions such as at night, in the rain, in foggy conditions, etc but not in all conditions. In this paper, we propose instance segmentation-based lane detection algorithms to categorize daylight, night, rainy, foggy and shadowed all types of circumstances to detect lanes in an efficient way. The proposed method can identify lanes under any circumstances by analyzing lane images. Three classes are proposed to categorize the lane and scenarios for an autonomous vehicle for recognizing lanes. These are i) daylight conditions, or marked and unmarked lanes during the daytime, ii) night conditions, where lanes can be detected under low-light conditions and nighttime, and iii) challenging condition refers to any difficult circumstance such as shadow, rain, dust, or any other scenarios. Lanes are detected using a low-featured camera that is very cost effective compared to costly sensors like Lidar. The MASK R-CNN network can resolve lane line recognition in challenging circumstances in road scenes. Real-world data sets are used to classify video clips into road images to evaluate the proposed method’s effectiveness. The proposed method can detect lanes even in challenging conditions. The main significance of the proposed model is that it can detect whether the road is in a day, night, or other situation based on a random road image. As a result, our model can classify any unknown situation. The remaining parts are organized as follow. Section II discusses machine learning-based lane detection, lane identification and classification systems for self-driving cars, and
Page 125
Fig. 1: A region of interest (ROI) pooling operation.
the class label determines which object belongs to each ROI. In this process, ROI pooling is used and the duties are included in MASK R-CNN. The issue of the object detection network’s fixed image size requirement is resolved by ROI pooling. ROI pooling is mostly used to scale the proposal uniformly. It divides the mapped region into pieces of the same size, runs average or maximum pooling procedures on each segment, and then maps the proposal to the relevant point on the feature map. Fixing the image size for object detection networks is a complex process and that can be resolved via ROI pooling as can be seen in Fig. 1. By correctly aligning the extracted features with the input, ROI-align eliminates the severe quantization of ROI pool. The entire image is fed into a CNN, an artificial neural network that is used in the image processing and recognizing specifically built to process pixel data and can detect ROI on the feature maps [8]. B. Related Works
region-based convolution neural networks. The prepossessing of the incoming data and model architecture is covered in Section III. The experimental lane detection outcomes under various lighting and weather situations are discussed in Section IV. Finally, a concluding remark is drawn in Section V. II. BACKGROUND A. Region Based Convolution Neural Network and Machine Learning Artificial Intelligence (AI) incorporates machine learning that analyzes data from experience without overtly programming systems, learns from that data, and then applies that learning to make more precise decisions [5]. An advanced method of picture segmentation called instance segmentation deals with locating instances of things with defining their bounds, which treats many objects belonging to the same class as different individual instances. Therefore, if the process identifies a human, it will categorize the person into another class. The method MASK R-CNN in [1] addresses the issue of instant segmentation. Instant segmentation defined by [6] is a mixture of two sub-problems that deal with the process of locating and defining each object of interest in an image. The first is object detection, which is the issue of identifying and categorizing a variable number of objects in an image. These objects are variable in number because the number of objects that can be found in an image can vary from image to image. The second method is semantic segmentation presented in [7], which enables the object of interest to span various areas in the image at the pixel level, then involves providing each pixel in the image with a class label. This method accurately finds things that have irregular shapes. A region of interest (ROI) [8] is a portion of an image or data set selected for a specific objective. In order to create several object regions or regions of interest, the bounding box object detection method RCNN (Region Based Convolution Neural Network) is used. In its next iteration, Faster R-CNN, object detection is carried out in two steps, first determining the bounding box and, ultimately, the areas of interest. Second,
In [4], Lane line recognition based on MASK R-CNN algorithm and TSD-Max data sets are used with challenging circumstance samples resulting in a total accuracy of 97.9% that is the only attempt to identify lanes or roads in specific circumstances in literature. The proposed method in [9] can handle various lane changes instantly, utilizing a bird’s-eye view with a speed of 50 fps (feet-per-second). The method in [10] considers the lane detection challenge as an instance segmentation problem and trains the model with an end-toend method, but fails to achieve a significant accuracy. A complete data set is prepared by the Mask R-CNN training with an unmarked road image. The model is empowered to detect unmarked roads under various background conditions and acquired 80.2% accurate results, proposing an algorithm for a lane departure warning system. An algorithm in [11] is presented for detecting shadowed and illumination changing lanes using Haars filters which helps to calculate the eigenvalue. Data sets are trained with an improved boosting algorithm. Fisher discriminant analysis is also used to initialize the weights. In [12], a split up into three chunks i) lane segmentation, ii) lane discrimination, and iii) mapping in terms of lane segmentation is presented. To segment key frames and non-key frames, a semantic segmentation network and a slim optical flow estimation network are proposed. There is a chance of 3% refinement. [13] Road scene analysis using the instance segmentation method, which incorporates the semantic segmentation technique by the evolved FCNs (Fully Convolutional Network) and the lane mark fitting algorithm. The suggested methodologies increase the integrity and accuracy of edge contour detection for each object and greatly extract drivable areas and lane markers. Digital image processing in [14] is divided into three levels. At the low level, the sharpness is improved by reducing the input image dimensional, and the region of interest is defined based on the minimum safe distance from the vehicle ahead. The authors designed an algorithm for a feature extractor for lane edge detection. Hough transform and shape-preserving spline interpolation are also used to
Page 126
achieve a smooth lane fitting. The authors presented a realtime embedded LDWS powered by Advanced RISC Machines (ARM) [15]. In terms of software design, lane boundaries are successfully detected using an upgraded lane detection method based on peak discovery for feature extraction. The effectiveness of this method is that in a highway environment, the lane detection rate is 99.57% during the day and 98.88% at night. Several FPGA based lane detection algorithms are in the literature. The system in [16] suggests an FPGA-based system that can instantly process photos with high resolution and frame rates resulting in a wide range of view and accuracy of the distance that works together to define the drivable tunnel by identifying the lane and road surface including the tunnel’s forward obstruction. The standard Canny-Hough lane detecting technique is modified for real-time processing and to reduce computational complexity. In [6], authors presented an encoder-decoder deep learning architecture to generate binary segmentation of lanes, and the binary segmentation map is further processed to separate lanes and a sliding window removes each lane to provide the segmentation image for each individual lane. In a tuSimple data set, the method is verified.
Fig. 2: Input image from camera (1280x720).
Fig. 3: Cropped image (1280x260).
III. M ETHOD A. Pre-possessing Data
In [17], python-based data acquisition is presented to identify the lane of the road. However, an efficient process that can change the parameters during the day and night is needed adaptively to be included in this algorithm, as constant parameters can only be used under identical illumination conditions. In [18], a deep convolution neural network is constructed on the FCN network to detect lane borderline characteristics removal and lane images at the pixel level.Hough transformation is also utilized here to dictate the suitable intermission and the least squire methods utilized to fit lane labeling. They claim an accuracy rate on the tuSimple data set is 98.74%, and on the Caltech lanes data set is 96.29%. The paper [19] focused on pixel intensity to detect lanes in various weather conditions. They divided up the ROI into non-overlapping blocks to reduce the calculative parts. The method presented two simplified masks to obtain the block gradients and block angles. The technique claims 96.12% and 98.60% accuracy for the average lane detection rate and departure warning. The lane features [20] along the curved highways can be extracted semi-automatically from Mobile Laser Scanning (MLS) [21] point clouds. From the above works, it is evident that several techniques are presented up until today, displaying a lot of advancement in 5th generation autonomous vehicles for identifying lanes in an efficient and safer way. The accuracy a bit falls short of most mentioned approaches. In addition, most of the works detect lanes under certain conditions, not capable of detecting lane in all situations or environmental issues. Therefore, in this work, we propose an efficient machine learning-based technique that can deliver heightened accuracy using simple locally available cameras.
The proposed method determines weather conditions after capturing a road image, whether day, night, rain, or shadow. As a result, the road images are classified into three categories. The proposed approach can also detect augmented or blurred images, allowing it to recognize every camera viewpoint. Additionally, 250 sample road images with a resolution of 1280x720 pixels are taken from Google, YouTube, and our infield captured videos. All samples are collected from different roads and highways in Bangladesh. In this scenario, ROI is kept at half the height of an image that can be seen in Fig. 2 and Fig. 3, respectively. This is due to the training procedure as it needs to be more accurate and rich sufficiently with a large set of data under huge challenging situations. To maintain the same resolution throughout, the images are downsized from 1280x720 pixels (see Fig. 3) to 600x600 pixels that can be seen in Fig. 4. Convolution neural network Res Net101 is utilized for training the model. Here, sample image data that is pre-trained from a collected data set are loaded, which can be found in [22]. B. Proposed Model Architecture A feature extractor known as a Feature Pyramid Network (FPN) produces proportionally scaled feature maps at several levels in an utterly convolutional manner from a single-scale image of any size to detect lanes. Possible object bounding boxes are initially proposed by RPN (Region Proposal Network), and then ROI pool extracts the features from these boxes. A tiny feature map is typically extracted from each RoI using RoI pool. Before dividing into spatial bins, it quantizes and then divides into those bins again. When pixel-accurate masks are predicted, quantization causes misalignment. The
Page 127
Cropped
Input Video
Extracted Frames
Preprocessing
Train Set
Image Annotation
Validation Set
Dataset Preparation
Resized
Results
Testing
Evaluation
Training
Fig. 6: Flowchart of the proposed method.
Fig. 4: Stretched image (600x600).
(a)
(b)
Fig. 7: (a) Masked out object and (b) masked out background
Fig. 5: Mask RCNN architecture.
coordinates are therefore calculated using bilinear interpolation rather than quantization. ROI align in [23] is utilized to solve this issue, which can be seen in Fig. 5 with significant outputs. Since there is no information loss and every region of interest can maintain pixel-to-pixel alignment, ROI align is preferable to ROI pool since quantization is nonexistent. Following the features extraction, classification and boundingbox regression can be used to analyze the feature data. It generates a binary mask for each ROI and classification and bounding-box regression. ξ = ξclass + ξbox + ξmask
(1)
In (1), ξclass stands for the classification loss , ξbox for bounding box loss, and ξmask for average binary cross-entropy loss. Since classification depends on mask predictions, there is a specific order for classification and regression for most modern network systems. On the other hand, Mask R-CNN performs bounding-box classification and regression simultaneously, effectively streamlining the multi-stage pipeline of the original R-CNN. The system uses the multi-task loss, which is calculated as the sum of the classification loss, bounding-box loss, and average binary cross-entropy loss. One thing to note is that those masks across classes compete with one another for other network systems. However, in this particular scenario, a per-pixel sigmoid and a binary loss render the masks
across classes incompatible, making this formula essential for successful instance segmentation. The architecture of the proposed approach is depicted in Fig. 6. The lane videos are recorded through a simple camera and frames taken from the video reals. The image should be cropped and resized during pre-processing to obtain the exact resolution. Then, a Tesla T4 12GB GPU is utilized for training, along with batch sizes of 5, epoch 100, and 50 iterations. The results from the trained model with a maximum average precision of 100%. TP (2) TP + FN Fig. 7 shows masking, which hides parts of the scene as if they were behind an invisible object. It is used to make environments that extend backward and are framed by a target image. The chance of accurately detecting lane markers is referred to as TPR (True Positive Rate) throughout the training process. Where TP is the total number of accurate detection and FN is the total number of false negatives that can not identify any of the conditions or classes according to (2). Twenty random pictures from each class are used to create test samples. The model’s accuracy is then assessed using these test samples. TPR =
IV. R ESULTS A. Training and Validation Loss During model training and validation, there are some significant losses that can happen, including the rpn-class-loss, rpnbbox-loss, mrcnn-class-loss, mrcnn-bbox-loss, mrcnn-maskloss, and some general losses. For mrcnn-class-loss it covers
Page 128
Fig. 8: Epoch vs loss.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
Fig. 10: Input-output at different scenarios. (a) and (c) are inputs of daylight; (b) and (d) are outputs of daylight; (e)and (f) are rainy images for input-output; (g) and (h) are foggy images for input-output; (i)and (j) are shadow images for input-output; and (k)-(l) are night images for input-output. Fig. 9: Epoch vs accuracy. almost all classes to make predictions. On the other hand, rpn-class-loss predicts the classes of every object, including background and foreground. bbox or bounding box loss checks the distance between true boxes where rpn-bbox-loss predicts where the object is located and mrcnn-bbox-loss does precision of the bounding box. It produces a binary mask for each class of different regions of interest (ROI) for mask loss. For general loss, it is calculated from the other losses from [1]. In Fig. 8, it is clear that the model gradually gained its best fit. Two higher losses can be obtained in the case of training loss which is technical training error. Accuracy is the most critical factor during the training model in Keras [24]. In Fig. 9 it can be seen that the accuracy is gradually increasing with the epochs, verifying that the model training process is almost entirely efficient. At 97th epoch the highest accuracy rate is received, which is 98.855%. B. Testing Results While testing the results, random googled pictures in different lighting and scenarios are taken. The accuracy is promising that is shown in Fig. 10. Fig. 10 displays input-output scenarios under various condition: daylight, rainy, foggy, shadowed images including night mode. The results of testing with a sample set under different conditions with accuracy measured in (2) are given in Table. I. Each sample set contained 20 images of situations in daylight, night, and challenging rainy, foggy and shadowed situations. V. C ONCLUSION This paper proposed a system for lane detection and classification using MASK R-CNN in various lighting and
TABLE I: Accuracy results for different test-sets Test set 1 2 3 Average
Daylight condition
Night condition
Accuracy 98.67% 99.25% 99.1 % 99.00 %
Accuracy 97.65% 98.99 % 98.62 % 98.42%
Challenging Condition Accuracy 96.6 % 98.54% 98.65 % 97.93%
weather scenarios. The experimental results ensure accurate lane scenario detection based on the surrounding conditions, with 98.45% detection accuracy of the proposed model. Due to the unavailability of snow weather on the data set, the model cannot detect any challenging scenario related to the snow environment. However, the method may detect snowy scenarios if adding data on snow surroundings. 100% accuracy can be achieved, possibly by adding more data in the training set. Our future direction is to improve the detection accuracy by taking a large number of data-set. ACKNOWLEDGMENT This research work is supported by the department of Electrical and Electronic Engineering (EEE), Ahsanullah University of Science and Technology (AUST) as part of undergraduate thesis work. R EFERENCES [1] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask rcnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, Venice, Italy, 2229 Oct, 2017. [2] M. Mduduzi, Tu. Chunling, and P. A. Owolawi. Preprocessed faster rcnn for vehicle detection. In 2018
Page 129
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
International Conference on Intelligent and Innovative Computing Applications (ICONIC), pages 1–4. IEEE, Mon Tresor, Mauritius, 06-07 Dec, 2018. Y. Pang, M. Sun, X. Jiang, and X. Li. Convolution in convolution for network in network. IEEE transactions on neural networks and learning systems, 29(5):1587– 1597, 2017. B. Liu, H. Liu, and J. Yuan. Lane line detection based on mask r-cnn. In 3rd International Conference on Mechatronics Engineering and Information Technology (ICMEIT 2019), pages 696–699. Atlantis Press, Dalian, China, 29-30 Mar, 2019. S. Hong and D. Park. Runtime virtual lane prediction based on inverse perspective transformation and machine learning for lane departure warning in low-power embedded systems. In 2022 IEEE International Conference on Imaging Systems and Techniques (IST), pages 1–6. IEEE, Kaohsiung, Taiwan, 21-23 Jun, 2022. G. M. Gad, A. M. Annaby, N. K Negied, and M S. Darweesh. Real-time lane instance segmentation using segnet and image processing. In 2020 2nd Novel Intelligent and Leading Emerging Sciences Conference (NILES), pages 253–258. IEEE, Giza, Egypt, 24-26 Oct, 2020. M. Shao, M. A. Haq, D. Gao, P. Chondro, and S. Ruan. Semantic segmentation for free space and lane based on grid-based interest point detection. IEEE Transactions on Intelligent Transportation Systems, 23(7):8498 – 8512, 2021. D. Ding, C. Lee, and K. Lee. An adaptive road roi determination algorithm for lane detection. In 2013 IEEE International Conference of IEEE Region 10 (TENCON 2013), pages 1–4. IEEE, Xi’an, China, 22-25 Oct, 2013. D. Neven, Bert D. B., S. Georgoulis, M. Proesmans, and L. V. Gool. Towards end-to-end lane detection: an instance segmentation approach. In 2018 IEEE intelligent vehicles symposium (IV), pages 286–291. IEEE, Changshu, China, 26-30 Jun, 2018. H. J. A. Undit, M. F. A. Hassan, and Z. M. Zin. Visionbased unmarked road detection with semantic segmentation using mask r-cnn for lane departure warning system. In 2021 4th International Symposium on Agents, MultiAgent Systems and Robotics (ISAMSR), pages 1–6. IEEE, Batu Pahat, Malaysia, 06-08 Sep, 2021. C. Fan, J. Xu, and S. Di. Lane detection based on machine learning algorithm. TELKOMNIKA Indonesian Journal of Electrical Engineering, 12(2):1403–1409, 2014. S. Lu, Z. Luo, F. Gao, M. Liu, K. Chang, and C. Piao. A fast and robust lane detection method based on semantic segmentation and optical flow estimation. Sensors, 21(2):400, 2021. Y. Chan, Y. Lin, and P. Chen. Lane mark and drivable area detection using a novel instance segmentation scheme. In 2019 IEEE/SICE International Symposium on System Integration (SII), pages 502–506. IEEE, Paris,
France, 14-16 Jan, 2019. [14] D. C Andrade, F. Bueno, F. R Franco, R. A. Silva, J. H. Z Neme, E. Margraf, W. T Omoto, F. A Farinelli, A. M Tusset, S. Okida, et al. A novel strategy for road lane detection and tracking based on a vehicle’s forward monocular camera. IEEE Transactions on Intelligent Transportation Systems, 20(4):1497–1507, 2018. [15] P. Hsiao, C. Yeh, S. Huang, and L. Fu. A portable visionbased real-time lane departure warning system: day and night. IEEE Transactions on Vehicular Technology, 58(4):2089–2094, 2008. [16] H. Iwata and Keiji Saneyoshi. Forward obstacle detection in a lane by stereo vision. In IECON 2013-39th Annual Conference of the IEEE Industrial Electronics Society, pages 2420–2425. IEEE, Vienna, Austria, 10-13 Nov, 2013. [17] M. V. G. Aziz, A. S. Prihatmanto, and H. Hindersah. Implementation of lane detection algorithm for selfdriving car on toll road cipularang using python language. In 2017 4th international conference on electric vehicular technology (ICEVT), pages 144–148. IEEE, Bali, Indonesia, 02-05 Oct, 2017. [18] F. Chao, Song Y., and J. Ya-Jie. Multi-lane detection based on deep convolutional neural network. IEEE access, 7:150833–150841, 2019. [19] C. Wu, L. Wang, and K. Wang. Ultra-low complexity block-based lane detection and departure warning system. IEEE Transactions on Circuits and Systems for Video Technology, 29(2):582–593, 2018. [20] C. Ye, He Zhao, Lingfei Ma, H. Jiang, H. Li, R. Wang, M. A Chapman, J. M. Junior, and J. Li. Robust lane extraction from mls point clouds towards hd maps especially in curve road. IEEE Transactions on Intelligent Transportation Systems, 23(2):1505 – 1518, 2020. [21] C. Ye, J. Li, H. Jiang, H. Zhao, L. Ma, and M. Chapman. Semi-automated generation of road transition lines using mobile laser scanning data. IEEE Transactions on Intelligent Transportation Systems, 21(5):1877–1890, 2019. [22] https://github.com/matterport/Mask RCNN/releases. [23] L. Su, Y. Wang, and Y. Tian. R-siamnet: Roi-align pooling baesd siamese network for object tracking. In 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pages 19–24. IEEE, Shenzhen, China, 06-08 Aug, 2020. [24] F. Chollet et al. Keras. https://github.com/fchollet/keras, 2015.
Page 130
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
A Federated Learning Based Privacy Preserving Approach for Detecting Parkinson’s Disease Using Deep Learning Sumit Howlader Dipro, Mynul Islam, Md. Abdullah Al Nahian, Moonami Sharmita Azad, Amitabha Chakrabarty, Md Tanzim Reza Department of Computer Science and Engineering Brac University, Dhaka, Bangladesh Email: {sumit.howlader.dipro, mynul.islam, md.abdullah.al.nahian, moonami.sharmita.azad}@g.bracu.ac.bd, [email protected], [email protected]
Abstract—Parkinson’s disease (PD) is a degenerative ailment caused by the loss of nerve cells in the brain region known as the Substantia Nigra, which governs movement. In numerous research papers, traditional machine learning techniques have been utilized for the purpose of PD detection. However, traditional ML algorithms always put a risk on the sensitivity of patients’ data and privacy. This research proposes a novel approach to detecting PD by preserving privacy and security through Federated Learning (FL). FL may train a single algorithm across numerous decentralized local servers as an improved version of the ML approach instead of trading gradient information. The proposed model has been tested and evaluated by using three CNN models (VGG19, VGG16 & InceptionV3) in this research, and within these models, VGG19 has the best accuracy of 97%. The result demonstrates that this model is very accurate for detecting PD by preserving one’s privacy and security using Federated Learning. Index Terms—Parkinson’s disease, Federated Learning, Healthcare, Privacy Preserving.
I. I NTRODUCTION Parkinson’s disease is one of the most common diseases in the world. This illness starts typically with minor symptoms and gradually worsens. Parkinson’s disease symptoms vary from patient to patient. However, the most common symptoms are tremors, slowness of movement (bradykinesia), rigid muscles/stiff limbs, loss of automatic movements, and stooped posture. There is currently no specified c ure f or the illness. Therefore, detecting the disease at an early stage and classifying the symptoms of individual patients is crucial for battling this disease. Parkinson’s disease’s early symptoms get overlooked, resulting in patients not seeking medical attention. We propose an Ai-based detection system that can detect and classify the disease early to overcome this obstacle. All while keeping the patient data secure and private. Federated learning emerged as a clear and practical answer to the challenge. We can detect Parkinson’s disease symptoms utilizing data from a patient’s mobile device using Federated learning, without the data ever leaving the system. Thus, the patient’s information is kept confidential.
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
When it comes to the medical sector, the privacy of the data is as crucial as any other primary sector. Among these, 77.65% of data breaches happened in healthcare provider organizations in 2019 [1]. Also, according to [1], medical data breaches in 2019 have increased by 37.47% compared to last year. This number keeps growing year after year. Therefore, traditional machine learning algorithms are perpetually exposed to data security and privacy concerns. Data privacy and security necessitates maintaining data confidentiality, as privacy cannot be guaranteed if data are vulnerable to unwanted access. Existing solutions for machine learning algorithms cannot afford to be secure. Traditional machine learning algorithms are run in a centralized data center, and data owners upload their data there; as a result, data is private, and owners are hesitant to share, [2]. Additionally, data collection is a timeconsuming and challenging task that is crucial for machine learning improvement. ML is becoming a commodity service that individuals use regularly. If machine learning algorithms provided by unfaithful parties are applied blindly, the sensitive information included in the training set will be exposed [3]. Furthermore [4], additional characteristics might reduce model accuracy because there is additional data that needs to be generalized. To address the issue above, we propose a Federated Learning model. Federated Learning has proven to be a promising paradigm for maintaining the privacy and security of clients’ data. Federated Learning is a fundamental idea that enables the development of machine learning models using data sets spread across multiple devices while preventing data leakage. FL enables several participants to cooperate on instruction to a machine learning model without exchanging local data. This paper aims to create a federated learning environment, where several medical institutes can collaborate to research Parkinson’s Disease using machine learning, without having to share their medical data. In this research, we have successfully created such an environment. In our proposed model, there will be four client servers, where clients will compute training gradients in each of their servers through distributed learning algorithms. In [5], demonstrated that gradient updates can leak
Page 131
a large amount of information about clients’ training data. So, to avoid this issue after training, each local server will send gradients to the central server, and in this way, for every communication round, there will be transactions, and for every transaction, one block will be added, and it will create chains with upcoming transactions. Likewise, the central server will train those gradients through the Fed Average algorithm, and updated gradients will be sent back to the local servers. II. R ELATED W ORKS A. Parkinson’s Disease For diagnosing Parkinson’s disease in the paper [6], Neural Networks (NN), DMneural, Regression, and Decision trees were used; different assessment techniques were used to compare these schemas’ overall classifier performance. Neural Network classifiers had the highest accuracy at 92.9%, NN classifiers were shown to be more accurate than kernel SVM. Vocal problems were linked to early Parkinson’s Disease symptoms in 90% of individuals [7]. Moreover, this requires using vocal characteristics in computer-assisted diagnosis and remote patient monitoring, resulting in greater accuracy and fewer specified voice feature in recognizing Parkinson’s disease. However, four classifiers utilized 94.7% accuracy, 98.4 percent sensitivity, 92.68 percent specificity, and 97.22% percent precision; they were boosting accuracy while reducing computing complexity by giving classifiers important, uncorrelated information on where 50 characteristics had been fed to classifiers. In this research [8], authors applied a ml-based computer-aided diagnostic system for the effective prediction of PD. In addition, they used boosting algorithms in which they found the accuracy of Light BGM is 93.39% for detecting PD by outperforming other boosting algorithms. According to [9], early detection slows Parkinson’s. In addition, globally 10 million people suffer from PD where early PD diagnosis could be a critical aspect for reducing its progression from detection of gait and speech. Moreover, this research used Random forest, vector support, and neural network based on acoustic speech analysis. They developed a deep learning model which differentiated PD from healthy people and the suggested deep learning model got 96.45% accuracy which proved crucial in early PD diagnosis. B. Federated Learning The impact of data distribution among cooperating institutions on model quality and learning patterns, federal learning (FL) had an advantage over other collaborative methods such as incremental institutional learning (IIL) and cyclic incremental learning (CIIL), which were compared with FL. Also, it was discovered how well FL performed than IIL and CIIL. Additionally, FL improves models at the fastest pace among the data-private collaborative learning approaches [10]. Our research work showed how collaborative data and FL approaches could achieve complete data learning while ignoring the need to share patient information and support broad multi-institutional collaboration to overcome technology and data ownership issues and support data protection regulatory
requirements. According to this paper [11], AI has altered imaging, pathology, and genetics and deep learning models need millions of parameters for clinical-grade accuracy; so high-quality data collection takes time and money whereas FL trains algorithms without sharing data, addressing data governance and privacy also FL-trained models outperform single-institution models; where FL is used for a brain tumor and whole-brain segmentation. When training a new ML model, a system is constantly faced with vast amounts of data. For this issue in[12], R. Kumar, A. A. Khan, J. Kumar, et al. conducted research on a normalization process that is introduced to organize the massive amount of data generated in the medical industry. In addition, they use a Capsule Network (CN) to get better results than other known models. Moreover, the algorithm is relatively slow due to its dynamic routing’s inner loop.
Fig. 1. Federated Learning Architecture
An end-to-end framework is proposed in another research [13] for data standardization. Moreover, to avoid any bottleneck in training the model, they used the Alternating Direction Method of Multipliers (ADMM), which requires fewer iterations. Although this framework has high potential, which still needed in-depth research and application to large amounts of data to confirm its accuracy. Using federated learning (FL), Decentralized data analysis eliminates the requirement to submit data to a central server. Furthermore, the data retains its utility despite being stored locally [2], in this federated way, the confidentiality and privacy of source data are meant to be maintained. According to [14], FL is able to provide a reasonable trade-off between accuracy and utility together with privacy enhancement as compared to traditional centralized learning. Additionally, FL training preserves the generalizability of the model at the cost of a minimal accuracy loss also in exchange, FL’s distributed learning function can improve the scalability of the smart healthcare system. C. FedAvg FedAvg is an averaging method of federated learning. It takes place on the global server. In FedAvg, the global server
Page 132
takes the weighted average of the resultant models after each client performs one round of gradient descent with the current model by using it on the local data. By iterating the local update numerous times before performing the averaging on the global server, extra processing can be added to each client. The number of computations is directed by three parameters mentioned in [15] and indicated by equation 1 and 2. k ∀k, wt+1 ← wt − ηgk
wt+1 ← η
K X nk k=1
n
k wt+1
(1)
(2)
III. DATASET D ESCRIPTION
•
Then the mixed setting, which included both original and all the smoothed pictures, including 414 Healthy Patients and 876 Parkinson’s disease patients. A total of two hundred and ninety-eight patients who had been under regular clinical assistance in University Medical Center Hamburg Eppendorf were selected at random from the database [17]. Data preprocessing is required when training a model using vast amounts of data. SPECT acquires FT-CIT images in 2D slices. Brain slices are merged into a 3D model. We used 2D slices of affected (PD) and nonaffected (HC) persons for training. Then, we divide the data into train-test and test with independent HC and PD parts. Train test got 90% of the data, test 10%. Due to lower striatal and normal FT-SIT uptake, certain HC and PD scans were eliminated.
A. PPMI DATA We used FP-CIT SPECT (Single-photon emission computed tomography) data received via approved access from the Parkinson’s Progression Markers Initiative (PPMI) [16]. Any–up-to-date data may be obtained from [16] through a data access request. PPMI is a major research objective to find biological markers, initiation, and development of Parkinson’s Disease which was founded in 2010. They give an open-access dataset and biosample library of PD including FP-CIT SPECT (Single-photon emission computed tomography) and MRI. In this study, we are going to employ SPECT. The collection comprises 645 subjects’ FP-CIT SPECT pictures. With 207 healthy individuals (HC) and 434 people with Parkinson’s disease (PD) taking part. Following the collection of raw data, PPMI uses an iterative ordered- subsets-expectationmaximization method in their lab to rebuild the center picture on the HERMES workstation (a medical imaging computer). The data in the dataset has two states. One is the Resting state, another one is the BOLD (blood-oxygen-level dependent) state. In the resting state, the brain is kept idle or task-negative state while mapping. In the BOLD state where brain mapping is done even when an external prompted task is absent. B. Data Pre-Processing Data preprocessing is a challenging aspect as it requires the right attributes to be used to do a relevant analysis. The Data we have used is employed with computed tomography with a limited multifaceted resolution and an anisotropic Gaussian filter with an 18 mm FWHM to construct FP-CIT single photon emission for early data preprocessing) to every one of the initial 3D PPMI SPECT images. This process is used to smoothen the images for the SPECT images and prep the data accordingly for image classification later. There are three distinct techniques for the creation of the PPMI settings. These are, • Original, unsmoothed images, which include around 438 Parkinson’s disease affected patients, and 207 healthy patients. • Smoothed images including 438 Parkinson’s disease patients, and 207 Healthy patients.
Fig. 2. SPECT Imaging
Also, we have implemented image augmentation. Image augmentation modifies training dataset images to create various versions of the same image. This gives us additional photos to train on and exposes our classifier to more lighting and color conditions, strengthening it. We flipped and equalized histograms. Flipping pictures horizontally is a typical strategy for classifiers. Histogram Equalization increases visual contrast by presenting an image’s pixel density distribution. If there are ranges of pixel brightness that aren’t being utilized, the histogram is stretched to encompass those ranges before being reprojected onto the picture to improve contrast. IV. P ROPOSED M ODEL Federated learning trains ML models without collecting user data. In the Federated learning architecture, the central server sends the machine learning algorithm to the client’s local device for training. Every client device in the architecture uses this model. This model trains itself in the device using user data. Afterword, the trained model is delivered to the central server, all the while user data not leaving the device, and no private information is exposed during training. The server only receives the model and weights. The central server averages those local models to build a single superior global model. This global model is then delivered to every client device for further training. This cycle repeats until training achieves required precision. This way, client’s data remains confidential when dealing with lots of data. This framework is proposed for a large number of data. For this model to work, several
Page 133
medical institutes has adopt our federated framework. Lets assume that we have 20 medical institutes containing several local devices each. The federated learning server broadcasts basic global settings at the start. Then, each medical institute’s devices aggregate local models with local dataset. To generate the local weights in the local devices, the significant factor of each hospital is multiplied by its model parameters. The central server aggregates these local parameters & updates the global parameters. The central server uses FedAvg to calculate the mean weighted parameter value of all local models. Finally, each local device receives the aggregated new model and weight.
it into a TensorFlow data set. Now for the model, we added the convolution neural network with flattened output. We have also added a dense layer of ReLU. Softmax is also used for the actual classification. For optimizer, Adam is used on the accuracy as the matric. Now for the FL part, vanilla algorithm which is FedAvg is used for averaging. The data we utilized is horizontally partitioned, so we have done component wise parameter averaging which is weighted depending on the percentage of data points given by each of the participating clients. Here, we are predicting the weight parameters for every client based on the loss values reported across every data point they trained with. There are three sections to this. First, a percentage is calculated by comparing the total number of training data points collected from all clients with the total number of data points stored by a single client. As a consequence, we now know how much data there is for training throughout the world. The model’s weights are scaled in the second section, and the scaled average of the weights is returned in the third section as the sum of the stated scale weights. This is followed by a comparison with a known test dataset to determine the accuracy of the global model. Now that we are ready to begin the real training session, we will first get the weights of the global model, which will be used as the starting weights for all local models. Next, we will randomize and shuffle the client data. After that, a new local model is built for each client, and that model’s weight is adjusted so that it matches the weight of the global model. After that, the local models are calibrated with the client data, and their weights are adjusted so that they may be added to the list. To get the average value across all of the local models, we need to do nothing more complicated than add up all of the weights that have been scaled. V. RESULTS AND ANALYSIS
TABLE I Accuracy between different CNN Architectures
Fig. 3. Flow chart of the proposed FL Model
Accuracy
VGG16 VGG19 InceptionV3
0.96 0.95 0.96
A. Implementation We have used CNN [18] models for training our dataset. Our proposed research compares 3 deep learning [19] models’ performance in FL architecture. The best model was picked among the 3 based on their performance. We have trained our dataset using VGG16, VGG19, and InceptionV3 [20]. We have used TensorFlow Federated for implementing federated learning. The database used here is divided into two categories, one is HC which represents healthy patients and another one is PD, representing the Parkinson affected patients. After this division, the dataset is shuffled and split into test and train categories. In the implementation of FL that will take place in the real world, every federated member will have their own data linked with them in isolation. However, in order to conduct this study, we have fabricated five clients and given each of them a random portion of the data shards. The next step is to batch process each individual user’s data and import
Validation Accuracy 0.97 0.94 0.96
Loss 0.11 0.10 0.08
Validation Loss 0.08 0.14 0.08
We can see from Table 1 how the VGG19, VGG16, and InceptionV3 accuracy look side by side for centralized classification. We made the table by taking the best accuracy of the last epoch. Each architecture has its level of accuracy. In the table, we can see that in the final epoch, VGG19 gives an accuracy of 95%, VGG16 gives an accuracy of 96%, and InceptionV3 gives an accuracy of 96%. So, we can say that VGG16 and InceptionV3 architectures did better than VGG19. Now we have trained our dataset with this model in a federated setting. With these models, we split the dataset into five clients and performed the training as a decentralized local server which is the main concept of federated learning.
Page 134
From each server we trained the gradients with the models we have found for better accuracy after that, each local server has sent the gradients to the central server, creating transactions for every communication round. In addition, we have used our augmented data for federated training as well. We can see from Table 2 that VGG19 showed better performance after testing the global model. As it has more layers it showed more accuracy and better performance compared to other models. After performing training with augmented data, we saw VGG16 now performing better in Table 3. But InceptionV3 model still showed poor performance so for getting better results in this model we need more datasets.
Fig. 4. Accuracy For VGG19
TABLE II Accuracy Between Different CNN Architectures In A Federated Setting With Less Data Model VGG16 VGG19 InceptionV3
Accuracy 0.75 0.95 0.43
Precision 0.88 0.94 0.52
Recall 0.52 0.94 0.51
F1 0.76 0.95 0.43
TABLE III Accuracy Between Different CNN Architectures In A Federated Setting With Augmented Data Model VGG16 VGG19 InceptionV3
Accuracy 0.96 0.97 0.62
Precision 0.97 0.97 0.62
Recall 0.97 0.97 0.62
Fig. 5. Confusion Matrix of VGG19
F1 0.97 0.97 0.62
Additionally, we have created the confusion matrices for these models. Using a table called a confusion matrix. It describes the classification model’s performance. The rows are the actual classes, whereas the columns are the expected classes. Figure 5 shows the confusion matrix for VGG19 on the dataset. Figure 6 shows the confusion matrix for VGG19 on the augmented dataset. For our Federated Learning technique, each local model was trained multiple times on the complete data set, tweaking parameters as required, which often involved changing data pre-processing functions, and learning rates, and adding layers to suit the models’ complexity. To reduce the exorbitant expense of acquiring thousands of training photographs, image augmentation was devised to synthesis training data from an existing dataset. Following that, each global model was created using the local models. We have trained each model for 10 federated rounds. The findings suggest that when training on our dataset for federated settings, it can be beneficial only with the VGG19 model because the more layers it performed higher the accuracy. Also, it performed better with lesser data. So, we can say, for decentralized classification, the VGG19 model outperforms the VGG16 model and InceptionV3. The results demonstrate that the VGG19 model can perform better with only 1290 image samples. After image data augmentation we increased our dataset to 4002 image samples and 2001 image samples for each class. Only then VGG16 model showed better
Fig. 6. Confusion Matrix of VGG19 (Augmented) TABLE IV Global Model Result Analysis of VGG19 category Healthy Affected Accuracy Macro Avg Weighted Avg
precision 0.96 0.98 0.97 0.97 0.97
recall 0.98 0.96 0.97 0.97 0.97
F1 0.97 0.97 0.97 0.97 0.97
support 199 202 401 401 401
accuracy. For getting better accuracy in the InceptionV3 model we need a far larger dataset. This result proves that VGG19 is the most superior model for detecting Parkinson’s disease from Brain images if the dataset is limited. VI. C ONCLUSION This research aims to provide a new privacy-protecting architecture for distributed machine learning and healthcare. Using CNN deep learning architectures VGG19, VGG16,
Page 135
and InceptionV3 in centralized machine learning settings, we detected PD with 95%, 96%, and 96% accuracy. We effectively detected PD with 97%, 96% and 62% on VGG19, VGG16, and InceptionV3 by guaranteeing privacy. All CNN Deep Learning models were trained and evaluated using traditional machine learning and Federated learning. VGG19 and VGG16 were sensitive and accurate in federated situations. However, data and parameter quality affect a model’s efficacy. Data suppliers should update their databases as new data becomes available to improve model communication loops. If all these parameters are maintained, the design can match or exceed our precision. The real-world implementation of this model is simplified by using image processing and neural networks separately; this approach uses extraordinarily few processing resources in a decentralized way and is extremely accurate for eradicating privacy issues, making this framework much faster and more efficient than traditional ml. Furthermore, the functionality of the federated learning model can be disrupted by model poisoning. In future work, In order to address the issue, we plan to extend the model by integrating blockchainbased pipelines with it.
[9]
[10]
[11]
[12]
[13]
R EFERENCES [1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
S. Alder, “Healthcare data breach report,”,” HIPAA Journal. https://www. hipaajournal. com/june-2019healthcare-data-breach-report, 2019. Y. Zhao, J. Zhao, L. Jiang, et al., “Privacy-preserving blockchain-based federated learning for iot devices,” IEEE Internet of Things Journal, vol. 8, no. 3, pp. 1817– 1829, 2020. C. Song, T. Ristenpart, and V. Shmatikov, “Machine learning models that remember too much,” in Proceedings of the 2017 ACM SIGSAC Conference on computer and communications security, 2017, pp. 587–601. I. Gupta, V. Sharma, S. Kaur, and A. Singh, “Pca-rf: An efficient parkinson’s disease prediction model based on random forest classification,” Mar. 2022. L. Melis, C. Song, E. De Cristofaro, and V. Shmatikov, “Exploiting unintended feature leakage in collaborative learning,” in 2019 IEEE Symposium on Security and Privacy (SP), IEEE, 2019, pp. 691–706. R. Das, “A comparison of multiple classification methods for diagnosis of parkinson disease,” Expert Systems with Applications, vol. 37, no. 2, pp. 1568–1572, 2010. G. Solana-Lavalle, J.-C. Gal´an-Hern´andez, and R. Rosas-Romero, “Automatic parkinson disease detection at early stages as a pre-diagnosis tool by using classifiers and a small set of vocal features,” Biocybernetics and Biomedical Engineering, vol. 40, no. 1, pp. 505– 516, 2020. M. M. Nishat, T. Hasan, S. M. Nasrullah, F. Faisal, M. A.-A.-R. Asif, and M. A. Hoque, “Detection of parkinson’s disease by employing boosting algorithms,” 2021 Joint 10th International Conference on Informatics, Electronics & Vision (ICIEV) and 2021 5th International Conference on Imaging, Vision & Pattern Recog-
[14]
[15]
[16]
[17]
[18]
[19]
[20]
nition (icIVPR), 2021. DOI: 10.1109/icievicivpr52578. 2021.9564108. W. Wang, J. Lee, F. Harrou, and Y. Sun, “Early detection of parkinson’s disease using deep learning and machine learning,” IEEE Access, vol. 8, pp. 147 635–147 646, 2020. M. J. Sheller, B. Edwards, G. A. Reina, et al., “Federated learning in medicine: Facilitating multiinstitutional collaborations without sharing patient data,” Scientific reports, vol. 10, no. 1, pp. 1–12, 2020. N. Rieke, J. Hancox, W. Li, et al., “The future of digital health with federated learning,” NPJ digital medicine, vol. 3, no. 1, pp. 1–7, 2020. R. Kumar, A. A. Khan, J. Kumar, et al., “Blockchainfederated-learning and deep learning models for covid19 detection using ct imaging,” IEEE Sensors Journal, vol. 21, no. 14, pp. 16 301–16 314, 2021. S. Silva, B. A. Gutman, E. Romero, P. M. Thompson, A. Altmann, and M. Lorenzi, “Federated learning in distributed medical databases: Meta-analysis of large-scale subcortical brain data,” in 2019 IEEE 16th international symposium on biomedical imaging (ISBI 2019), IEEE, 2019, pp. 270–274. D. C. Nguyen, Q.-V. Pham, P. N. Pathirana, et al., “Federated learning for smart healthcare: A survey,” ACM Computing Surveys, vol. 55, no. 3, pp. 1–37, 2023. DOI : 10.1145/3501296. S. Singh, Ppml series 2 - federated optimization algorithms - fedsgd and fedavg, Dec. 2021. [Online]. Available: https : / / shreyansh26 . github . io / post / 2021-12-18 federated optimization fedavg/ (visited on 05/16/2022). The study that could change everything. [Online]. Available: https : / / www . ppmi - info . org/ (visited on 12/25/2021). G. F. Wilson and C. A. Russell, “Real-time assessment of mental workload using psychophysiological measures and artificial neural networks,” Human factors, vol. 45, no. 4, pp. 635–644, 2003. A. S. Hosain, M. Islam, M. H. K. Mehedi, I. E. Kabir, and Z. T. Khan, “Gastrointestinal disorder detection with a transformer based approach,” in 2022 IEEE 13th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), 2022, pp. 0280–0285. DOI: 10 . 1109 / IEMCON56893 . 2022 . 9946531. M. H. K. Mehedi, A. S. Hosain, S. Ahmed, et al., “Plant leaf disease detection using transfer learning and explainable ai,” in 2022 IEEE 13th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), 2022, pp. 0166–0170. DOI: 10. 1109/IEMCON56893.2022.9946513. A. S. Hosain, M. H. K. Mehedi, and I. E. Kabir, Pconet: A convolutional neural network architecture to detect polycystic ovary syndrome (pcos) from ovarian ultrasound images, Oct. 2022.
Page 136
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
VR Glove: A Virtual Input System for Controlling VR with Enhanced Usability and High Accuracy Ishraq Hasan1 , Muhammad Munswarim Khan2 , Kazi Tasnim Rahman1 , Anika Siddiqui Mayesha1 , Zinia Sultana1 , Muhammad Nazrul Islam1 1
Department of Computer Science and Engineering Military Institute of Science and Technology, Dhaka, Bangladesh Email: [email protected] 2
Institute of Information Technology University of Dhaka, Dhaka, Bangladesh Email: [email protected] Abstract—Virtual Reality (VR) is one of the pioneering technologies in the current decade. An increasing number of users are migrating to the virtual world as time progresses. This has led to users desiring more intuitive and natural control for their inputs in the digital world, surpassing the need for traditional input systems such as mouse and keyboard. VR based natural inputs systems are scarce, and the available ones are expensive. Again, very few of them translates the normal hand movements into the virtual control inputs. Therefore, the objective is to design and develop a wearable input system for controlling VR with enhanced usability and high accuracy. To attain this objective, user requirements were firstly elicited through semi-structured interviews. Then, a cost-effective and usable wearable system (VR Glove) was developed based on the revealed requirements for controlling VR. Finally, the system was evaluated with 20 test-participants (novice and expert); and it was found that the VR Glove was usable both to the novice and expert users, though the system was more usable to experts than the novice users. Participants were comfortable with the working mechanism of the proposed VR Glove system and also found the system very responsive. Keywords—Virtual Reality, Human Computer Interaction, Usability, Virtual Input, User Requirement
I. Introduction With the advent of new technologies that incorporate the use of Virtual and Augmented reality, it is becoming necessary to make the virtual world more accessible and intuitive for humans to interact with [1], [2], [3]. This concern has led to the belief that users of such technologies would require a way to use human interface devices in the virtual space [4], [5], [6]. Previously published literature implies that, it would be effective if physical devices can be replaced with virtual input systems that the user can see, and interact with [7], [8]. In recent years, various innovative implementations and pioneering breakthroughs have been made in the field of VR [9]. They include, and are not limited to, full-body tracking, object-tracking, VR based physics, and countless video games which utilize these features and incorporate them in gameplay [9]. However, the most crucial field of innovation lies
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
in the interaction layer between the human and the virtual world [10], [9]. Through positional tracking of each handheld controller, for example, a user can see their hands move in the virtual world, and incorporation of haptic feedback in these controllers, enables the user to simulate a sense of touch in the virtual world [5], [10]. In general, the case of VR controllers that users are supposed to grab onto, in a similar fashion to that of a tennis racquet, have several shortcomings [5], [10], [11], [12]. Firstly, it needs to be held by the user at all times to use its hand-tracking capabilities. Secondly, if a user drops, or puts down the controller, the hand-tracking effectively gets disabled. Thirdly, the user needs to operate peripheral devices when dealing with a virtual world, where input is necessary. Fourthly, the haptic feedback is not so efficient for tactile perception. In spite of gesture-based controllers being an option, users tend to use the keyboard for faster and precise input [13]. This leads to the eventual issue, where the user needs to put down the VR controller, use some other input device, and pick up the controller again to resume their VR activities. It is to be noted that during this process, hand tracking gets disabled and reenabled due to the user putting the controller down. This gap in the process is non-intuitive, and breaks the immersion of the virtual world. Since the overall objective of VR technologies is to increase immersion, this gap can be treated as a flaw. In this article, the concerns of user input mechanisms used in VR are addressed, while designing, and developing a wearable device to perform respective activities using a non-physical device within the virtual environment without breaking immersion. Thus the prime objectives of the research are to model and design a set of gloves equipped with sensors; to implement precise hand-tracking on the gloves; and to model and design a virtual environment for users to operate with and evaluate the glove. II. Related Works A number of articles were selected based on the relevance to the topic, possible applications, and likelihood of user
Page 137
adoption. For example, Nanjappan et al. [14] focused on VR controller input systems that were commercially available and highlighted how users engaged in natural interactions with a VR setting, using both hands and arms. An interesting trade-off between effort and intuitiveness was observed in the highlighted tests, where users tended to prefer two-handed inputs, even though it required more effort. In another study, Masurovsky et al. [15] modified a Leap Motion controller, which was an existing hands-free VR input system, to reduce its drop rate by a significant margin. Afterwards, they conducted a usability analysis on this input method, against a traditional VR input mechanism. The usability study provided quantitative feedback from more than thirty users, with the majority preferring traditional input via controller over the hands-free solution. Similarly, Esmaeili et al. [16] featured an investigation conducted on the user perception of scaled hand movements in VR space. The takeaway from this research was the fact that one-to-one mapping was not the ideal course of action in cases involving VR, where intuitiveness was to be prioritized. Mapping the real world hand movements and gestures to a scaled value was often required. A limited number of studies focused on the design, development and usability evaluation of VR systems. For example, Kumar and Todorov [17] developed a virtual reality system that combined real-time motion capture and physics simulation. It enabled a user wearing a CyberGlove to manipulate virtual objects through contacts with a tele-operated virtual hand. But the problem with this project was the hardware being very expensive and not suitable for common people to use. Wang et al. [18] highlighted the design and development of VR hand tracking system, using a novel approach involving several cameras, a large database of hand pose estimated functioning as a lookup table, and a glove made of simple cloth, that had patches of color painted across it. However, the system relied on a dataset that needed to be reconfigured every time the cameras are moved to a new environment. This made the proposed system less versatile, with the trade-off being the very low overall cost. On the other hand, Cameron et al. [19] proposed a wearable system that used motion tracking and flex sensors to determine a hand orientation of a person while maintaining low latency. Findings of their usability study reveal that the users quickly adapted with the system and were able to finish assigned tasks with ease. However, components of the system were moderately expensive. Similarly, VoigtAntons et al. [20] developed two VR games that involved hand tracking. One involved isolating objects of same color, while the other one was to retype a numeric sequence using a virtual num-pad. It was found that the users were comparatively more comfortable with the first game as they could relate easily the task with their everyday activities. Like-wise, SalchowHömmen et al. [21] illustrated the development of a hand tracking system involving a special wearable, similar to a glove. The work focused on increasing the precision of the finger tracking using a kinematic model, and the experimental results showed a promising maximum deviation of 2 cm for each fingertip’s position.
Nowadays, instead of using complex room configurations, or state-of-the-art IR tracking, it is possible to achieve accurate hand tracking in VR using a smartphone that has decent AR compatibility. Google’s ARCore provides an API that allowed to extract this position and rotation information, and it was used in the proposed system to conduct hand tracking. Using a readily available everyday device instead of a more expensive and complex system for hand tracking in VR, is where our approach qualifies as relatively fresh. Even though some priority is given to making the system inexpensive, the main focus was to design and evaluate a system that has a greater degree of usability. III. Designing and Developing the System A. Elicitation of User Requirements A total of eight participants were interviewed one-byone to understand their interactions with VR systems. Age of participants was 29 ± 5.3 years. All participants were university students, enrolled in either of undergraduate or graduate programs. All participants had prior experience with VR system, but the usage was not frequent; they occasionally used VR systems mostly for recreation. At first the participants were briefed about the purpose of the interview and written consent was signed by the participants for addressing privacy and ethical issues. During the interview, the participants were asked about their experience regarding VR gadgets and asked them about the difficulties in using such devices. Their provided information was essential to understand the requirements that may facilitate easier interactions with VR systems. Interviewees’ responses were translated and analyzed following the qualitative data analysis approach. The interview data revealed that the subjects were not satisfied with the existing control mechanism for the VR gadgets. Those came with handle-like input devices with buttons and rotational sticks. They also stated that the control mechanisms had no relation with the natural movements of hands and fingers so it felt unrealistic. For example, one participant responded as "...it’s like a gaming joystick without the plate-like base. Playing an arcade with joystick was one thing. But key-bindings of these VR controllers is far too complicated..." Another participant said "...it takes time, effort and patience to get well-acquainted with such conventional controllers. But still that will not be enough..." They thought that most of the manufacturers were focusing on the visual details and paid little attention towards how the users interacted with the system to get virtual experiences. For example, a participant pointed, "...I really wonder whether the manufacturing companies themselves try their products before releasing them in the market..." A few participants opined that that it’s not worth their money unless the control mechanism is improved. One responded as "...it is okay for a good gadget to be pricey, but what value does it have if I don’t find it useful..." The user requirements revealed for VR Glove are summarized in Table I.
Page 138
Table I FUNCTIONAL AND NON-FUNCTIONAL REQUIREMENTS
Requirement Type Functional
Non-functional
Explored Requirements 1. User’s natural finger movements are needed to be taken as input 2. Users need to be able to interact with virtual 3D objects 3. The glove needs to be wireless 1. The user interface needs to be simple and intuitive 2. The system needs to be cheaper than its market-alternatives 3. The delay between user’s action and corresponding output needs to be negligible
B. Implementation of Prototype To develop the prototype of the VR Glove, firstly there was an attempt at finding a cost-effective and accurate technique for tracking finger orientations. Secondly, different approaches of positional tracking of hands were explored. Thirdly, a VR Glove system is proposed and developed a VR game for demonstration. Initially, the general decision was to use a flex sensor for a precise detection of finger movement. But these expensive sensors were greatly affecting the manufacturing cost of the gloves. On the other hand, available alternatives were not suitable for precise results. Again, Bonet [22] developed an open source VR glove which was specifically designed for VR games only, and did not deal with precision of input. Moreover, the design depends largely on existing expensive consumer-grade VR technology for hand movement detection. In the initial phase, customized retractable reels were designed, inspired by Bonet’s open source design, as a cheaper alternative, using cheap badge reels coupled with a rotary potentiometer to track the finger movements. A smartphone was used to interpret and convey the position and rotation data of the arm, in real time, to a connected host computer. However, after several tests, it was observed that the component yielded poor accuracy and proper quantitative analysis could not be conducted, as the mechanical friction of the 3D-printed parts impeded proper movement of fingers. Thus, the flex sensors were used for the final version of the proposed system. The finger tracking mechanism was reworked on to solve the issue. The latest iteration of the proposed system consists of one flex sensor equipped on each finger, with a smartphone mounted on the arm. The flex sensors are responsible for recognizing finger movements (see Fig. 1). The flex sensors used in our proposed system were manufactured by Spectra Symbol. The 4.5” flex sensors were used for the index, middle and ring fingers, while 2.2” flex sensors being used for the thumb and little fingers. Flex-sensor data corresponding to finger orientation were bundled as a packet and were sent to a host computer via wireless UDP. The simplest but costly solution for hand movement tracking in a 3D space was to use readily available VR controllers that had 6 degrees of freedom movement tracking. Instead, a custom Inertial Measurement Unit based positional tracking, was designed for keeping track of hand position in 3D space.
Figure 1. Circuit diagram for VR Glove system (embedded unit)
However, it had a lot of drift, and even using filters such as Mahony [23] or Madgwick [24], the system provided minimal reliability. This issue was later solved by introducing ARCorebased hand tracking. Next the use of an available android phone with proper calibration was pursued to keep track of linear acceleration and gyroscope data, which also did not provide reliable data. Finally, it was decided to settle for a custom AR based tracking, where the android phone is to be mounted on a person’s arm, and the movement of the phone will be accurately determined by the fusion of camera data, and internal sensor data. Fig. 2 shows the the final version of the system. The workflow of the proposed system is shown in Fig. 3. IV. Evaluation A. Participant’s Profile To determine the effectiveness of the proposed system, it was evaluated with 20 test-subjects. Based on the experience of using computer and VR gadgets, the participants were divided into two groups - experts and novice. There were 10 participants belonging to the expert group and they were graduate students with ages between 23 and 26, who had considerable knowledge on computer usage and had prior experience on VR. The remaining 10 subjects, who belonged
Page 139
Table II EFFECTIVENESS MEASUREMENT OF THE PROPOSED SYSTEM Acti-vity
Criteria Comple-tion Time
Tech. Exp. Expert Novice
T1 No. of Attempts Comple-tion Time
Expert Novice Expert Novice
T2 No. of Attempts Comple-tion Time Figure 2. Hardware components and assembled prototype
B. Study Procedure The following three tasks were selected to be accomplished by the subjects for testing the proposed system as a virtual controller: T1 : Wear and Enable the VR Glove - In this state, users are supposed to wear the VR Glove and power on its power, and after that they will start the android app on a phone and finally they will attach the phone to their arms. T2: Start the Visualization Software - For this task, users will open the visualizer program in a computer, and when the program starts running, users will click on the Start button, after which, they will try to control a hand avatar in VR space with natural hand movement. T3: Perform three different gestures - After the user gets acquainted with the VR Glove system, the users will perform three hand gestures, which are, pointing, victory and thumbsup. A structured interview was carried out for collecting the participants’ feedback regarding the proposed system as a post-test evaluation. The interview questions were mainly designed to evaluate the comfort and confidence of users, their overall satisfaction, usage complications (ease of use), learn-ability (ease of learning), eagerness of participants for future use, and lastly, willingness to promote the system to others. These were assessed based on 5-point Likert scale. Participants were also asked how much they are willing to pay for purchasing this system. The responses of users collected through a questionnaire forms.
Novice Expert Novice
T3 No. of Attempts
to the novice group, were just average computer users with no prior experience of using VR gadgets. 7 of the novice subjects were professionals of ages ranging from 45 to 65 and the other 3 novice users were high-school students.
Expert
Expert Novice
Mean ± SD 16.052 ± 0.367 29.584 ± 1.77
T-test t-value p-value 23.6729
< 0.0001
Signi-ficant
2.753
0.0131
Not signi-ficant
5.9225
< 0.0001
Signi-ficant
2.9055
0.0094
Not signi-ficant
34.6916
< 0.0001
Signi-ficant
N/A
N/A
No diffe-rence
1±0 1.8 ± 0.912 18.63 ± 0.0803 32.704 ± 7.514 2±0 3.1 ± 1.197 7.074 ± 0.173 11.704 ± 0.385 1±0 1±0
Diff.
C. Data Analysis and Findings The summary of quantitative analysis of the defined tasks are highlighted in Table II. The time of completion and number of attempts of users were recorded, for both users (expert vs novice) for comparison. The interview responses were also assessed to measure user satisfaction, which is depicted in Fig. 4. The blue line of the chart indicates the mean value, and the maroon dashed line shows the standard deviation. From the acquired data, it can be observed that, expert users were able to complete their assigned tasks without making any mistake and took comparatively less. It was also observed that the novice users had to attempt several times to successfully complete their assigned tasks. However, while performing T3, the difference in time and success rate between the two groups gradually declined. Upon analysis of the two-tailed P values obtained from the unpaired T-tests, the claim can be made that the difference between the performance of expert and novice testers varied significantly in all tasks when it came to completing the task. But the testers, regardless of their technological experience, performed similarly when number of attempts were taken into account. The test-subjects were able to complete the tasks eventually, with a low failure rate, regardless of technological expertise. The findings of user satisfaction are visualized in Fig. 4. The average score of each satisfaction metric was high, as can be seen. The system was moderately easy to use (mean: 3.87) and easily learn-able according to the participants (mean: 4.67). They were also quite enthusiastic about using the system in the future (mean: 4.73) and recommending it to others (mean: 4.5). Finally, the overall satisfaction score was 4.87, indicating that the application was well-received by
Page 140
Figure 3. Workflow diagram of the proposed system
Figure 4. Usability scores of the VR Glove
the participants. Again, all the test subjects agreed that the proposed system was able to detect natural finger movement as input and translate it to make coordinated movements of
a virtual hand avatar and 80% of the users were comfortable with the working mechanics. 93.3% users felt that the user was very responsive, but 13.3% users felt the presence of jitter. A minimal UI was prepared, which felt less user-friendly to 26.7% users. None of the subjects worried about their privacy getting compromised but they all noticed that the virtual environment was not good enough to interact with. When the participants were asked about the price they were willing to pay for purchasing the proposed system, they were given 5 choices : less than $105.12; $105.12 - $157.68; $157.68 $210.25; $210.25 - $262.81; and $262.81 - $315.37. Users were allowed to select multiple options. 60% user were willing to pay less than $105.12 and 66.7% users selected the range of $105.12 - $157.68. However, seeing such price range, 46.7% users commented that they found the system to be expensive. V. Conclusion Through this research, the proposed VR Glove system was successfully designed, which can track hand movements and
Page 141
allow users to interact with the virtual 3D objects. As per the users’ opinion, the gloves are easy and intuitive to use. The system can be used as the underlying hardware for VR applications which require the presence of accurate handtracking. This includes video games, as well as simulation software for equipment training. Software platforms involving sign-language communication can also be built on top of this system. However, no haptic feedback as integrated thus users do get any sense of touch while interacting with the virtual environment. Though an entire smartphone had to be incorporated in the project, but only the camera and some sensors of the smartphone were actually needed. Again, comparatively costly flex sensors were used since no cheaper alternative is available in the market. In the future, the sense of touch while gripping or holding an object in the virtual space will be included on designing a cheaper alternative of the flex sensor. References [1] S. Afrin, S. R. Zaman, D. Sadekeen, Z. Islam, N. Tabassum, and M. N. Islam, “How usability and user experience vary among the basic mcommerce, ar and vr based user interfaces of mobile application for online shopping,” in International Conference on Design and Digital Communication. Springer, 2021, pp. 44–53. [2] M. N. Islam, Information and communication technologies for humanitarian services. Institution of Engineering and Technology, 2020. [3] P. Ghosh, R. Singhee, R. Karmakar, S. Maitra, S. Rai, and S. B. Pal, “Virtual keyboard using image processing and computer vision,” in Cyber Intelligence and Information Retrieval. Springer, 2022, pp. 71– 79. [4] A. Sharif, F. Anzum, A. Zavin, S. A. Suha, A. Ibnat, and M. N. Islam, “Exploring the opportunities and challenges of adopting augmented reality in education in a developing country,” in 2018 IEEE 18th International Conference on Advanced Learning Technologies (ICALT). IEEE, 2018, pp. 364–366. [5] S. Chen, J. Wang, S. Guerra, N. Mittal, and S. Prakkamakul, “Exploring word-gesture text entry techniques in virtual reality,” in Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, 2019, pp. 1–6. [6] S. Xiaolin, Z. Yunzhou, Q. Zhao, and L. Gao, “Virtual keyboard: A human-computer interaction device based on laser and image processing,” in The 5th Annual IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems, 2015. [7] A. Gupta, C. Ji, H.-S. Yeo, A. Quigley, and D. Vogel, “Rotoswype: Word-gesture typing using a ring,” in Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 2019, pp. 1–12. [8] S. Sridhar, A. M. Feit, C. Theobalt, and A. Oulasvirta, “Investigating the dexterity of multi-finger input for mid-air text entry,” in Proceedings
[9] [10] [11] [12] [13] [14]
[15]
[16]
[17] [18] [19]
[20]
[21] [22] [23] [24]
of the 33rd Annual ACM Conference on Human Factors in Computing Systems, 2015, pp. 3643–3652. V. G. Motti, “Wearable interaction,” in Wearable Interaction. Springer, 2020, pp. 81–107. A. Scavarelli, A. Arya, and R. J. Teather, “Virtual reality and augmented reality in social learning spaces: A literature review,” Virtual Reality, vol. 25, pp. 257–277, 2021. E.-S. Lee and B.-S. Shin, “A flexible input mapping system for nextgeneration virtual reality controllers,” Electronics, vol. 10, no. 17, p. 2149, 2021. P. Schreiner, “Accessing virtual reality: Challenges met and lessons learned,” 2017. M. A. Rahim, J. Shin, and M. R. Islam, “Hand gesture recognition-based non-touch character writing system on a virtual keyboard,” Multimedia Tools and Applications, vol. 79, no. 17, pp. 11 813–11 836, 2020. V. Nanjappan, H.-N. Liang, F. Lu, K. Papangelis, Y. Yue, and K. L. Man, “User-elicited dual-hand interactions for manipulating 3d objects in virtual reality environments,” Human-centric Computing and Information Sciences, vol. 8, no. 1, pp. 1–16, 2018. A. Masurovsky, P. Chojecki, D. Runde, M. Lafci, D. Przewozny, and M. Gaebler, “Controller-free hand tracking for grab-and-place tasks in immersive virtual reality: Design elements and their empirical study,” Multimodal Technologies and Interaction, vol. 4, no. 4, p. 91, 2020. S. Esmaeili, B. Benda, and E. D. Ragan, “Detection of scaled hand interactions in virtual reality: The effects of motion direction and task complexity,” in 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). IEEE, 2020, pp. 453–462. V. Kumar and E. Todorov, “Mujoco haptix: A virtual reality system for hand manipulation,” in 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids). IEEE, 2015, pp. 657–663. R. Y. Wang and J. Popovi´c, “Real-time hand-tracking with a color glove,” ACM transactions on graphics (TOG), vol. 28, no. 3, pp. 1– 8, 2009. C. R. Cameron, L. W. DiValentin, R. Manaktala, A. C. McElhaney, C. H. Nostrand, O. J. Quinlan, L. N. Sharpe, A. C. Slagle, C. D. Wood, Y. Y. Zheng et al., “Hand tracking and visualization in a virtual reality simulation,” in 2011 IEEE systems and information engineering design symposium. IEEE, 2011, pp. 127–132. J.-N. Voigt-Antons, T. Kojic, D. Ali, and S. Möller, “Influence of hand tracking as a way of interaction in virtual reality on user experience,” in 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX). IEEE, 2020, pp. 1–4. C. Salchow-Hömmen, L. Callies, D. Laidig, M. Valtin, T. Schauer, and T. Seel, “A tangible solution for hand motion tracking in clinical applications,” Sensors, vol. 19, no. 1, p. 208, 2019. L. D. Bonet, “Lucidvr/lucidgloves: Arduino/esp32 based diy vr haptic gloves. compatible with steamvr via opengloves.” 2020. [Online]. Available: https://github.com/LucidVR/lucidgloves R. Mahony, T. Hamel, and J.-M. Pflimlin, “Nonlinear complementary filters on the special orthogonal group,” IEEE Transactions on automatic control, vol. 53, no. 5, pp. 1203–1218, 2008. S. Madgwick et al., “An efficient orientation filter for inertial and inertial/magnetic sensor arrays,” Report x-io and University of Bristol (UK), vol. 25, pp. 113–118, 2010.
Page 142
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
An Efficient D eep L earning A pproach t o detect Brain Tumor Using MRI Images Annur Tasnim Islam1 , Sakib Mashrafi A pu2 , S udipta S arker3 , S yeed A lam Shuvo4 , Inzamam M. Hasan5 , Dr. Md. Ashraful Alam6 , and Shakib Mahmud Dipto7 1,2,3,4,5,6,7
Department of Computer Science and Engineering, BRAC University, 66 Mohakhali, Dhaka 1212, Bangladesh Email: 1 [email protected], 2 [email protected], 3 [email protected], 4 [email protected], 5 [email protected], 6 [email protected], 7 [email protected]
Abstract—The formation of altered cells in the human brain constitutes a brain tumor. There are numerous varieties of brain tumors in existence today. According to academics and medical professionals, some brain tumors are curable, while others are deadly. In most cases, brain cancer is identified at a late stage, making recovery difficult. This raises the rate of mortality. If this could be identified in its earliest stages, many lives could be saved. Brain cancers are currently identified by automated processes that use AI algorithms and brain imaging data. In this article, we use Magnetic Resonance Imaging (MRI) data and the fusion of learning models to suggest an effective strategy for detecting brain tumors. The suggested system consists of multiple processes, including preprocessing and classification of brain MRI images, performance analysis and optimization of various deep neural networks, and efficient methodologies. The proposed study allows for a more precise classification of brain cancers. We start by collecting the dataset and classifying it with the VGG16, VGG19, ResNet50, ResNet101, and InceptionV3 architectures. We achieved an accuracy rate of 96.72% for VGG16, 96.17% for ResNet50, and 95.55% for InceptionV3 as a result of our analysis. Using the top three classifiers, we created an ensemble model called EBTDM (Ensembled Brain Tumor Detection Model) and achieved an overall accuracy rate of 98.60%. Index Terms—machine learning, deep learning, transformers, rumor detection, natural language processing, xgboost, svm, random forest, BERT, DistilBERT
I. I NTRODUCTION The most fragile part of the human body is the brain. It keeps the entire central nervous system functioning properly, which also controls all bodily processes [1]. A brain tumor is one of the deadly diseases that can affect humans. The National Brain Tumor Foundation (NBTF) reports that the death rate from brain tumors has increased by as much as 300 percent during the past three decades. Patients with severe brain tumors have an average life expectancy of three years. However, even in the early stages, it is quite challenging for a physician to manually identify the condition and its severity. Primarily, MRI image analysis is the first step in the detection of brain tumors. In this study, we intend to create an effective deep learning model for classifying images of brain tumors. In addition, by evaluating MRI images of the brains of patients, we will be able to obtain more precision than with the present methods. In addition, our technique for custom-made models guarantees a high level of precision. In the majority of cases, the brain tumor is discovered in the
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
last stages, resulting in a low recovery rate. If we can detect the brain tumor at an early stage, this recovery rate can be drastically enhanced. Therefore, it is vital to establish a rapid procedure that can assist medical professionals in efficiently identifying the tumor. Usually, brain tumor identification begins with MRI image analysis. We propose an implicit method for classifying brain tumors from MRI images using the fusion of deep learning models in our study . Also, this procedure assures greater precision. II. R ELATED W ORKS Several studies employing deep neural networks to classify brain tumors have already been conducted. Many of them have had a good outcome which helps the computer aided diagnostic system. In paper [2], to classify images, the authors used a trained version of the Inception V3 algorithm. The 10-fold cross-validation results indicated a high success percentage of 99.82%. Not only did they use Inception V3, but they also employed DCNN and TL. The accuracy of their model was 99.8%, while the speed of their implementation was roughly 15 seconds for each epoch. Additionally, a separate work [3] showed that a research pair of pre-trained models, Inception V3 and DenseNet201, were proposed. Brain tumor diagnosis and categorization were tested on two separate patients. Then, with Inception V2 and DensNet201, the accuracies reached 99.34% and 99.51%. Subsequently, a different work [4] showed, the accuracies of the scratched CNN model were compared with VGG16, Inception V3, and ResNet50. This was done after the accuracy of the 8 layers of the CNN model was compared. This also showed an accuracy of 96% on data training and 89% for validation. Additionally, Separate coevolutionary networks, one for ROI and one for non-ROI, UNet-VGG16, were developed in another study [5]. Then, a cross between U-Net and VGG16 was created. Ultimately, reducing the amount of work involved in the computation was the driving force behind this effort to shorten the training time. Also, it shows to have less loss and to be more precise. The performance of Inception V3, ResNet50 and VGG16 were examined with 253 brain MRI images in a study [6]. The accuracy was 63.13%, 82.49% and 94.42% for Inception V3, ResNet50, and VGG16 respectively. ResNet50 and VGG16
Page 143
showed 100% precision. In paper [7], Xception and Inception V3 were taken into account. Here, separated elements were given to the AI-based classifiers such as SVM, KNN and RF. For training and testing them, an MRI dataset was utilized. Finally, the accuracy was 93.79% for the Xception model and 94.34% for the Inception V3 model. A pre-arranged model VGG16 was offered with the training blocks in a research [8]. Here the GLCM framework was also separated for the images for shaping the GLCM images. This resulted in the analysis of combining the energy images of GLCM with the original images as a contribution to effective contrast testing. This model showed an accuracy of 96.5%. Moreover, The authors in research [9] used AlexNet, VGG16, ResNet50 and GoogleNet for research purposes and got different results. Later it was found by comparison that the accuracy for AlexNet was 82.7% (least accurate) and for ResNet was 95.8% (most accurate), in their studies. In a study [10], an SE-ResNet-101 architecture was proposed. They included data augmentation which raised the accuracy of 93.83% to 98.67%, 91.03% and 91.81% for Glioma [11], Pituitary tumor and Meningioma respectively. Automatic classification of the brain MRI images have been done with 5 pre-trained Deep Learning models such as VGG16, AlexNet, ResNet18, ResNet34 and ResNet50 in the analyzation [12]. Here it has been claimed that the best results were obtained for ResNet50 after it was analyzed and implemented in different approaches. Some studies carried out on gray level and RGB (Red, Green, Blue) color channels were included in the system with a main purpose of reducing the total operational time complexity. Proper decision had to be made by the authors to get to a proper conclusion. This helped in speeding up the detection process using MRI Images.
III. R ESEARCH M ETHODOLOGY
Fig. 1: Illustration of our Working Process A. Details of the Dataset To carry out this research, we used two existing public datasets, Brain Tumor [13] and Br35H:: Brain Tumor Detection 2020 [14]. Figure 2 represent some sample Brain MRI image from our dataset.
Figure 1 illustrates a high-level view of the processes we went through to train and evaluate our models. We began by collecting and preprocessing our data, which included image resizing, normalization, and augmentation. In our research, multiple deep learning models, including ResNet101, ResNet50, Inception V3, VGG19, and VGG16, have been put in place. We compared VGG16 with VGG19 and chose VGG16 based on its superior performance. Consequently, we compared ResNet50 and ResNet101 and chose ResNet50 based on its greater performance. Finally, for a more successful comparison outcome, we ensembled the top three performing architectures. In this study, we used transfer learning to evaluate deep learning models that made use of previously taught weights. We then deployed the model we developed, called EBTDM, which is an ensemble of VGG16, ResNet50, and Inception V3.
Fig. 2: Sample Data of the Dataset B. Data Classification In our research, the dataset was divided into an 8:2 ratio, with 20% of the data being used to validate the models and
Page 144
the remaining 80% being used to train the models. Table I gives the distribution of our dataset. TABLE I: Classification of our Dataset Normal Affected Total
Training Set 2863 2546 5409
Validation Set 716 637 1353
Total 3579 3183 6689
4) Ensemble model: The ensemble model combines different architectures, with the input layer’s dimension matching that of the output layer from each of the various architectures. Each separate architecture offers probabilities for each label, and the average layer aggregates these probabilities to categorize the images. Due to the identical input and output dimensions, an ensemble model produces better results by reducing mistakes and boosting precision. IV. I MPLEMENTATION AND R ESULT A NALYSIS
C. Data Pre-processing 1) Resize Images: Since we used deep learning models that had already been trained when we trained the architecture, each image was shrunk to a specific size of 224 x 224. For this purpose, we employed the scikit-image [15], TensorFlow [16], and Caffe frameworks [17]. 2) Normalizing Images: Normalization was performed using Principal Component Analysis (PCA). Eigen flat fields and brain MRI images were produced and merged. The systematic mistakes in projection intensity normalization are then reduced by scaling our images with dynamic flat fields. This work was accomplished using the ImageDataGenerator class in Keras. [18] 3) Data Augmentation: In our study, we used image orientation-based data augmentation techniques to enhance the model’s performance. The initial photos were enhanced using random rotations and flips of 90, 180, and 270 degrees. D. Model Specification 1) VGG: The Visual Geometry Group (VGG) consists of a large number of convolutional layers with the Softmax, Fully Connected, and ReLU activation functions. The input is set up to take 224x224x3 RGB images, which are transmitted via a number of convolutional layers using 3x3 kernels. The pooling filter layers have a dimension of 2x2 and a stride of 2 pixels. VGG19 and VGG16 are the two most common VGG architectures. 2) ResNet: ResNet is based on the principle of skip connections, which enables neural networks with more than 150 layers to be trained. We used ResNet50 and ResNet101 in our methodology. The ResNet50 protocol is composed of five phases. Each level has a different convolution layer combination. ResNet101 is a model with 101 layers, though. ”CNN” is the name of this pattern. 3) Inception: A primary goal of Inception is to reduce the amount of computing power required to run by altering the structures it has used in the past. Inception is a well-known GoogLeNet network that has demonstrated strong classification performance in a number of biological applications with transfer learning techniques similar to GoogLeNet. The result is less computing complexity and fewer parameters that need to be trained. It uses factorized 7 x 7 convolutions, label smoothing, and an auxiliary classifier in addition to these upgrades over earlier Inception family members.
Results and analysis from our investigation included the construction of a confusion matrix as well as metrics for each model’s performance, such as Validation accuracy, recall, precision, and F1 score. These were used in performance evaluations. The confusion matrix is a collection of both the right and the algorithm’s inaccurate predictions and the actual situation [19]. Elements of confusion Matrix are, • True Positive (TP): As per the model, the number of people who actually have brain tumor. • False Negative (FN): number of people with brain tumor who are otherwise healthy. • False Positive (FP): Despite being in fact healthy, the model classifies a large number of people as having brain tumor. • True Negative (TN): Number of people labeled as healthy and truly in good health, as shown by the model. The following equations are used to calculate accuracy, F1 score, recall, and precision based on this theory, TP + TN TP + TN + FP + FN TP P recision = TP + FP TP Recall = TP + FN 2 ∗ TP f 1 − score = 2 ∗ TP + FP + FN A. Performance of the Deep Learning Models Accuracy =
(1) (2) (3) (4)
TABLE II: Performance of the Deep Learning (DL) Models DL Precision Models ResNet50 96.19% ResNet101 95.97% VGG16 96.69% VGG19 96.76% InceptionV3 95.60%
Recall 96.15% 95.10% 96.77% 96.19% 95.65%
F1 Score 96.17% 95.55% 96.72% 96.47% 95.55%
Accuracy 96.17% 95.55% 96.72% 96.47% 95.55%
From the table II, we chose VGG16, ResNet50, and Inception V3 deep learning architectures to be ensembled. After
Page 145
comparing results, it was determined that VGG16 performed better than VGG19. Similarly, ResNet50 has achieved greater accuracy than ResNet101. In addition to those, we chose the Inception V3 model.
or not 15 of the impacted images were mistakenly labeled as normal by the algorithm. Additionally, the algorithm properly classified 616 photos as normal, whereas it wrongly classified 3 images as abnormal.
B. Brain Tumor Analysis by Ensemble modelling (EBTDM) In the proposed Ensembled Brain Tumor Detection Model (EBTDM), our three best-performing architectures, VGG16, InceptionV3, and ResNet50, were used for ensemble modeling. In EBTDM, input and output dimensions are identical to those of the individual models. In the EBTDM model, VGG16, InceptionV3, and ResNet50 will provide probabilities for the labels, and these probabilities will be averaged at the ensemble model’s averaging layer to classify the image.
Fig. 5: Confusion Matrix of Ensemble Model C. Comparison Analysis TABLE III: Result comparison table of other published papers Fig. 3: Training Curve of EBTDM
Other Papers N. Noreen Et al. [7] Belaid and Loudini [8] A. A. Pravitasar [20] Our proposed model
Model Used Inception-v3 GLCM UNet-VGG16 EBTDM
Accuracy 94.34% 96.50% 96.10% 98.60%
Table III demonstrates that our proposed EBTDM model has obtained a 98.60% level of accuracy. Which surpassed the results obtained from previous research studies. V. C ONCLUSION
Fig. 4: Validation Curve of EBTDM As a consequence of the result analysis, the EBTDM has obtained an accuracy of 98.60%, which is higher than any other design developed thus far. Additionally, EBTDM has achieved a precision of 97.73%, recall of 99.50%, and f1score of 98.62%. Thus, EBTDM outperforms the other six trained models. As a result, we considered EBTDM for implementation at the next level. According to the above confusion matrix of EBTDM, 646 pictures were correctly recognized as brain tumors. Whether
Every year, more and more people are diagnosed with brain tumors, and sadly, more and more people also die due to these cancers. Millions of individuals seek treatment at hospitals across the globe. Some patients with life-threatening conditions are not getting prompt treatment because of the lengthy time it takes to collect results. In the end, our model will help greatly by being able to recognize brain tumors through analysis of MRI scans and classification of tumor severity based on MRI images. We conducted training on the dataset utilizing the VGG16, VGG19, InceptionV3, ResNet50, and ResNet101 architectures, and then compared the results to our own EBTDM model. Further data collection is planned for the future so that we can improve our model. Finally, we are aiming to make our model less time-consuming and establish a way or interface that medical professionals may use to successfully discover brain tumors in MRI images.
Page 146
R EFERENCES [1] A. Filatov, P. Sharma, F. Hindi, and P. S. Espinosa, “Neurological complications of coronavirus disease (covid-19): encephalopathy,” Cureus, vol. 12, no. 3, 2020. [2] A. A. Habiba et al., “A novel hybrid approach of deep learning network along with transfer learning for brain tumor classification,” Turkish Journal of Computer and Mathematics Education (TURCOMAT), vol. 12, no. 9, pp. 1363–1373, 2021. [3] N. Noreen, S. Palaniappan, A. Qayyum, I. Ahmad, M. Imran, and M. Shoaib, “A deep learning model based on concatenation approach for the diagnosis of brain tumor,” IEEE Access, vol. 8, pp. 55 135–55 144, 2020. [4] H. A. Khan, W. Jue, M. Mushtaq, and M. U. Mushtaq, “Brain tumor classification in mri image using convolutional neural network,” Math. Biosci. Eng, vol. 17, p. 6203, 2020. [5] A. B. S. Salamh et al., “Investigation the effect of using gray level and rgb channels on brain tumor image,” Computer Science & Information Technology (CS & IT), pp. 141–148, 2017. [6] O. Sevli, “Performance comparison of different pre-trained deep learning models in classifying brain mri images,” Acta Infologica, vol. 5, no. 1, pp. 141–154, 2021. [7] N. Noreen, S. Palaniappan, A. Qayyum, I. Ahmad, and M. O. Alassafi, “Brain tumor classification based on fine-tuned models and the ensemble method,” CMC-COMPUTERS MATERIALS & CONTINUA, vol. 67, no. 3, pp. 3967–3982, 2021. [8] O. N. Belaid and M. Loudini, “Classification of brain tumor by combination of pre-trained vgg16 cnn,” Journal of Information Technology Management, vol. 12, no. 2, pp. 13–25, 2020. [9] A. A. Abbood, Q. M. Shallal, and M. A. Fadhel, “Automated brain tumor classification using various deep learning models: a comparative study,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 22, no. 1, pp. 252–259, 2021. [10] M. Masood, T. Nazir, M. Nawaz, A. Mehmood, J. Rashid, H.-Y. Kwon, T. Mahmood, and A. Hussain, “A novel deep learning method for recognition and classification of brain tumors from mri images,” Diagnostics, vol. 11, no. 5, p. 744, 2021. [11] P. Ghosal, L. Nandanwar, S. Kanchan, A. Bhadra, J. Chakraborty, and D. Nandi, “Brain tumor classification using resnet-101 based squeeze and excitation deep neural network,” in 2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP). IEEE, 2019, pp. 1–6. [12] H. M. Rai and K. Chatterjee, “Detection of brain abnormality by a novel lu-net deep neural cnn model from mr images,” Machine Learning with Applications, vol. 2, p. 100004, 2020. [13] J. Bohaju, “Brain tumor, version 3,” https://www.kaggle.com/ jakeshbohaju/brain-tumor/version/3, 2020. [14] A. Hamada, “Br35h :: Brain tumor detection 2020,” https://www.kaggle. com/datasets/ahmedhamada0/brain-tumor-detection, 2020. [15] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in python,” the Journal of machine Learning research, vol. 12, pp. 2825–2830, 2011. [16] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for largescale machine learning,” in 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16), 2016, pp. 265–283. [17] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 675–678. [18] A. Gulli and S. Pal, Deep learning with Keras. Packt Publishing Ltd, 2017. [19] S. M. Dipto, A. Iftekher, T. Ghosh, M. T. Reza, and M. A. Alam, “Suitable crop suggesting system based on n.p.k. values using machine learning models,” in 2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE), 2021, pp. 1–6. [20] A. A. Pravitasari, N. Iriawan, M. Almuhayar, T. Azmi, K. Fithriasari, S. W. Purnami, W. Ferriastuti et al., “Unet-vgg16 with transfer learning for mri-based brain tumor segmentation,” Telkomnika, vol. 18, no. 3, pp. 1310–1318, 2020.
Page 147
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
Design of a Aperture-Coupled Pentad-Polarization Reconfigurable Microstrip Antenna Md. Naimur Rahman, Muhammad Asad Rahman, Md. Azad Hossain Faculty of Electrical and Computer Engineering Chittagong University of Engineering and Technology Bangladesh {naim125, asad31, azad}@cuet.ac.bd Abstract—This paper presents a multi–layer pentad– polarization reconfigurable antenna with an intent to use in the sub–6 GHz spectrum. By employing two orthogonal aperture– couple switchable feed networks, the proposed antenna can operate in dual–fed and single–fed modes. At single–fed mode, linear polarization (LP) sense of horizontal and vertical polarization can be obtained. At dual–fed mode, +45° LP and circular polarization (CP) sense of right–hand CP (RHCP) or left–hand CP (LHCP) can be achieved by switchable microstrip feed line excitation. Each feed network contains two microstrip feed lines with quadrature phase difference and switching between the feed lines is achieved by integrating two PIN diodes. A 10–dB impedance bandwidth of 1.02% for LP and 1.19% for CP are attained. Furthermore, without applying any kind of isolation enhancement techniques, excellent isolation characteristic ( 32 -> 64 -> 128 -> 256.
Upsampling and concatenation were used in the decoder, and then standard convolution operations were performed. Each decoder level reduced the number of features by a factor of two, whereas upsampling needed to add pixels around and in between the existing pixels to achieve the desired resolution. The upsampling path was divided into four blocks, each containing two 3 × 3 convolutions followed by 2 × 2 upsampling in the proposed method. The U-Net used skip connections to integrate spatial information from the downsampling path with the upsampling path to generate the concatenated features. In this research, 128 × 128 × 16 features were extracted from the last output layer and then fed as input into the classifier.
D. Classifier Implementation of CNN feature extraction followed by a simple classifier reduces the automated system’s latency [15]. Following this study, we used a simpler classifier (MLP) to categorize the patches based on the feature vector extracted from U-net. It consisted of three layers — input layer, hidden layer, and output layer. The input layer in our study got 16 features per pixel, which were indicated as 16 neurons. The output layer was made up of a single neuron with a range of output from 0 to 1, resulting in a binary image with blood vessels identified. The hidden layer placed between the input and output layers is the true computational engine of the MLP. As mentioned in [16], The computations taking place at every neuron in the output and hidden layer are as follows, o(x) = G(b(2) + W (2) h(x))
(4)
h(x) = s(b(1) + W (1) x)
(5)
Here, b(1) and b(2) are the bias vectors, W (1) , W (2) are the weight matrices, G and s are the activation functions. We used the sigmoid activation function to interpret the output as a posterior probability. E. Post-Processing The output target image was generated by reconstructing the segmented patches (shown in Fig. 4(a)). The output image indicated that some vessels might have a few gaps (i.e. vessel pixels that have been misclassified as non-vessels). Additionally, some relatively small non-vessel regions near vessels may be mistakenly categorized as vessels. These misclassifications were corrected by two post-processing steps: 1) Filling vessel gaps, and 2) Removal of artifacts. The illustrations of the target image and the post-processed image are presented in Fig. 4(b) and Fig. 4(c). 1) Filling Vessel Gaps: The undetected vessel gaps were recovered by performing an iterative filling operation, where each pixel was considered as a vessel pixel if at least six neighbours (in a 3 × 3 neighbourhood) were vessel pixels. Thereafter, a morphological closing operation was performed to join the disconnected vessel pixels.
Page 202
(a) Test Result of Patches
(b) Target Segmented Image
(c) Post-processed Image
Fig. 4. Reconstruction and Post-processing.
training and testing patches were 12,000,000 and 4,000,000, respectively. The features were retrieved from the last output layer of U-net(FV mentioned in fig. 3) and fed into the MLP for classification of the patches. We trained, tested, and evaluated both U-net and MLP networks utilizing the features from Unet and finally chose MLP as a classifier since it required less computation time than U-net. In U-net, the parameters were trained using a backpropagation method with cross-entropy as the loss function and the Adam optimizer as the Optimization algorithm of the model. Furthermore, a 10% dropout in the contracting path and a 20% dropout in the expansive path was used to avoid overfitting. The number of training epochs was 50 and the total trainable parameters were 2,511,361. We used checkpoint validation, which includes early stop, which means that if the network’s performance does not improve, the iteration will terminate. The advantage of MLP network having only one hidden layer was discussed before in [21]. The proposed method was implemented in python using Keras, TensorFlow, and scikit-learn libraries. IV. E XPERIMENTAL R ESULTS AND O BSERVATIONS
2) Removal of Artifacts: In the output, little isolated patches (called artifacts) were misclassified as blood vessels. The associated pixel area was measured to remove these artifacts. To get a more accurate test result, pixels in areas below 25 were evaluated as non-blood vessels and were eliminated. III. I MPLEMENTATION S ETUP In order to train a deep learning network, a large number of images are necessary, which reduces the possibility of overfitting and enhances model performance. The existing datasets, however, are insufficient. Hence, we used data patching and augmentation to increase the data while reducing the chance of a biased conclusion [17]. As in many prior studies, we patched the prepossessed and ground truth images in the training step. In the testing stage, we employed patching before extracting features and patch reconstruction after the classification of the patches to create the segmented target image. The patch size was set at 128 × 128, with the patch step 128, suggesting no overlap. According to the study [18], unnecessary sub-images sometimes can skew the classification result and accuracy. Hence, we only considered patches that properly fit within the circular active zone of the image. The patch size was chosen carefully to avoid the possibility of overfitting. Random Crop and Fill (RCF), a data augmentation approach [19], was employed. The patches were randomly rotated 90, 180, and 270 degrees with 50% probability, and horizontally and vertically flipped with 50% probability. The datasets for training and testing were split into a 3:1 ratio. To guide the training, testing, and evaluation, the k-fold evaluation [20] was employed. The dataset was then shuffled at random, with one being utilized for testing and the other for training. After retraining the evaluation score, the steps were repeated with a different group. The overall size of the
A. Performance Metrices The limits of the methodology were scientifically assessed by using three parameters, accuracy, sensitivity, and specificity. The test’s sensitivity or true positive rate (TPR) indicates how well it predicts one category, while its specificity or true negative rate (TNR) tells how well it predicts the other. Accuracy, on the other hand, is supposed to measure how well the test predicts both groups. TP + TN (6) TP + FP + TN + FN TP Sensitivity(SN ) = (7) TP + FN TN Specif icity(SP ) = (8) TN + FP True Positive (TP) refers to a successfully segmented vessel pixel, while False Positive (FP) specifies an incorrectly segmented background pixel as the vessel. False Negative (FN) implies incorrectly segmented vessel pixels as background and True Negative (TN) denotes effectively segmented background pixels. Accuracy(ACC) =
B. Proposed Method Evaluation In Table I, we showed the performance of the proposed method using different parameters. Since the number of training and test sets are fixed in DRIVE, the result is included for each configuration, whereas in STARE and HRF, we used images over ten folds of leave-one-out, and thus performance is presented as the average of all outputs. It can be seen from the table that a significant performance change occurred while using the prepossessing methods, patching, and data augmentation, specifically in the DRIVE dataset. The accuracy and sensitivity appeared lower than other datasets in DRIVE because of its low resolution and less
Page 203
TABLE I P ERFORMANCE R ESULTS ON DRIVE, STARE
Methods
DRIVE
AND
TABLE II P ERFORMANCE R ESULTS OF C ROSS -DATABASE T EST
HRF DATASETS
STARE
HRF
Training Test Dataset Dataset
Accuracy(%) Sensitivity(%) Specificity(%)
ACC
SN
SP
ACC
SN
SP
ACC SN
SP
STARE
DRIVE
98.34
81.56
98.44
(%)
(%)
(%)
(%)
(%)
(%)
(%)
(%)
(%)
DRIVE
STARE
96.98
76.34
95.92
Our U-Net 94.61 74.06 97.08 93.19 77.80 98.23 NA
NA
NA
STARE
HRF
97.44
81.26
98.02
Ours with 97.84 78.61 98.87 98.63 80.61 98.70 98.28 79.34 99.39 Preprocessing
DRIVE
HRF
96.21
80.74
95.56
HRF
STARE
97.89
81.98
97.65
Ours with 98.34 78.89 99.21 99.78 81.44 99.92 98.85 79.73 98.45 Preprocessing and data enhancement
HRF
DRIVE
97.88
80.64
96.67
variety of images. Data enhancement, on the other hand, is required for the HRF dataset’s image to operate on our hardware due to the high resolution of its images. Overall, our proposed system achieved the best performance on the STARE database, with 99.78% accuracy, 81.44% sensitivity, and 99.92% specificity. We observed that categorizing a batch of 150 patches takes just 0.29 seconds and segmenting an entire image takes only 0.08 seconds, which is acceptable in a medical context. Multiple leave-one-out evaluation scheme eliminates this bias, hence the results should be deemed more trustworthy. Since small patch sizes make classification more challenging, we set the patch size to 128 × 128 and achieved a decent classification result. In practical reality, retraining the model every time a new patient’s fundus image needs to be evaluated is not feasible. The acquisition equipment used by different hospitals is generally from different manufacturers. So, a dependable approach must be able to correctly evaluate images captured by various equipment. Hence, robustness and generalization are critical criteria for assessing the model’s actual application capabilities. Due to the stability and flexibility of our framework, we also performed cross-database validation in our method. Table II displays the result of the test. We can observe that using HRF as the training database and STARE as the testing database yields the best results in terms of sensitivity (81.98%), however, delivers subpar results in terms of accuracy and specificity. By analyzing the results, it can be said that STARE as the training database and DRIVE as testing have the best combination and produced the best accuracy of 98.34%.
Fig. 5. Test Result of Patches-Accuracy-Loss Curve.
training is progressing well. The accuracy line of validation follows the training accuracy line, which also shows that the model is not overfitted. The training and validation loss starts high and gradually decreases to zero. D. Comparison to Other Methods The proposed methodology outperformed most of the other methods as shown in Table III, which compares our proposed methodology to state-of-the-art algorithms published in the last decade. These approaches applied a methodology that was similar to ours, with a few significant adjustments. The proposed method’s use of a U-net to extract features and utilization of a simpler classifier is undeniably new and remarkable compared to the past methods and shows better results in terms of accuracy and specificity. Although it is desirable to have both high sensitivity and specificity, it is a trade-off in clinical tests. HRF was not used in the performance comparison since the majority of the existing literature did not apply it as it required large memory. V. C ONCLUSION
C. Performance Analysis Accuracy-Loss curve with echoes is used to measure the performance of the model. Fig. 5 exhibits the U-net network’s accuracy and loss over echoes during training and validation. Accuracy in training approaching 1 is an indication that the
In this research, we proposed an improved method for automated blood vessel segmentation, where pre- and postprocessing methods resulted in significant performance gains, along with the usage of the U-net for feature extraction and a simple MLP network to decrease system latency. The STARE,
Page 204
TABLE III P ERFORMANCE C OMPARISON WITH THE R ELATED S TATE - OF - THE - ART W ORKS . List of
Year
Papers
DRIVE
STARE
ACC(%) SN(%) SP(%) ACC(%) SN(%) SP(%)
Fraz et al. [1] 2012 95.34
75.48
97.63
94.80
74.06
98.07
Liskowski 2016 95.66 and Krawiec [22]
78.67
97.54
94.95
77.63
97.68
Mo and 2017 96.74 Zhang [6]
81.47
98.44
95.21
77.79
97.8
Jin et al. [7]
2019 96.41
75.95
98.78
95.66
79.63
98.00
Ramos-Soto et al. [23]
2021 96.67
75.78
98.60
95.80
74.74
98.36
98.34
78.89
99.21
99.78
81.44
99.92
Our Proposed Method
-
DRIVE, and HRF datasets were used to train the presented Unet model, which can be used as a pre-trained model in future research in this domain. Our method achieves an accuracy of 99.78%, a sensitivity of 81.44%, and a specificity of 99.92% while using the STARE database. Although the proposed method performed exceptionally well in many areas, there are still a couple of areas that can be enhanced. The method demonstrated a marginally lower sensitivity rate compared to other recent methods. Instead of using publicly accessible datasets, a dataset collected from patients could be utilized to get more robust and generalized results. In order to solve the sensitivity issue, we are considering further research to determine the cause and find out ways to improve it. Moreover, we plan to experiment with more standard datasets with DR as well as other diseases involving the retinal blood vessel, and combine our trained U-net with simpler classifiers to reduce system latency, and develop a more robust and generalized framework for segmenting the retinal blood vessel network. R EFERENCES [1] M. M. Fraz, P. Remagnino, A. Hoppe, B. Uyyanonvara, A. R. Rudnicka, C. G. Owen, and S. A. Barman, “An ensemble classification-based approach applied to retinal blood vessel segmentation,” IEEE Transactions on Biomedical Engineering, vol. 59, no. 9, pp. 2538–2548, 2012. [2] N. M. Salem and A. K. Nandi, “Unsupervised segmentation of retinal blood vessels using a single parameter vesselness measure,” in 2008 Sixth Indian Conference on Computer Vision, Graphics Image Processing, 2008, pp. 528–534. [3] U. T. Nguyen, A. Bhuiyan, L. A. Park, and K. Ramamohanarao, “An effective retinal blood vessel segmentation method using multi-scale line detection,” Pattern Recognition, vol. 46, no. 3, pp. 703–715, 2013. [4] W. S. Oliveira, T. I. Ren, and G. D. Cavalcanti, “An unsupervised segmentation method for retinal vessel using combined filters,” in 2012 IEEE 24th International Conference on Tools with Artificial Intelligence, vol. 1, 2012, pp. 750–756.
[5] Z. Yao, Z. Zhang, and L.-Q. Xu, “Convolutional neural network for retinal blood vessel segmentation,” in 2016 9th International Symposium on Computational Intelligence and Design (ISCID), vol. 1, 2016, pp. 406–409. [6] J. Mo and L. Zhang, “Multi-level deep supervised networks for retinal vessel segmentation,” International journal of computer assisted radiology and surgery, vol. 12, no. 12, pp. 2181–2193, 2017. [7] Q. Jin, Z. Meng, T. D. Pham, Q. Chen, L. Wei, and R. Su, “Dunet: A deformable network for retinal vessel segmentation,” Knowledge-Based Systems, vol. 178, pp. 149–162, 2019. [8] P. M. Samuel and T. Veeramalai, “Vssc net: vessel specific skip chain convolutional network for blood vessel segmentation,” Computer methods and programs in biomedicine, vol. 198, p. 105769, 2021. [9] C. Wan, X. Zhou, Q. You, J. Sun, J. Shen, S. Zhu, Q. Jiang, and W. Yang, “Retinal image enhancement using cycle-constraint adversarial network,” Frontiers in Medicine, vol. 8, 2021. [10] “Retinal diseases symptoms and causes.” [Online]. Available: https://www.mayoclinic.org/diseases-conditions/retinaldiseases/symptoms-causes/syc-20355825. Last accessed 12 Apr 2022 [11] “Structured analysis of the retina.” [Online]. Available: https://cecas.clemson.edu/ ahoover/stare/. Last accessed 28 Mar 2022 [12] J. Staal, M. D. Abr`amoff, M. Niemeijer, M. A. Viergever, and B. Van Ginneken, “Ridge-based vessel segmentation in color images of the retina,” IEEE transactions on medical imaging, vol. 23, no. 4, pp. 501–509, 2004. [13] A. Budai, R. Bock, A. Maier, J. Hornegger, and G. Michelson, “Robust vessel segmentation in fundus images,” International journal of biomedical imaging, vol. 2013, 2013. [14] D. Mar´ın, A. Aquino, M. E. Geg´undez-Arias, and J. M. Bravo, “A new supervised method for blood vessel segmentation in retinal images by using gray-level and moment invariants-based features,” IEEE Transactions on medical imaging, vol. 30, no. 1, pp. 146–158, 2010. [15] O. O. Sule, “A survey of deep learning for retinal blood vessel segmentation methods: Taxonomy, trends, challenges and future directions.” IEEE Access, 2022. [16] S. Abirami and P. Chitra, “Chapter fourteen - energy-efficient edge based real-time healthcare support system,” in The Digital Twin Paradigm for Smarter Systems and Environments: The Industry Use Cases, ser. Advances in Computers, P. Raj and P. Evangeline, Eds. Elsevier, 2020, vol. 117, no. 1, pp. 339–368. [17] S. Gayathri, V. P. Gopi, and P. Palanisamy, “A lightweight cnn for diabetic retinopathy classification from fundus images,” Biomedical Signal Processing and Control, vol. 62, p. 102115, 2020. [18] S. Dhar and L. Shamir, “Evaluation of the benchmark datasets for testing the efficacy of deep convolutional neural networks,” Visual Informatics, vol. 5, no. 3, pp. 92–101, 2021. [19] Y. Jiang, H. Zhang, N. Tan, and L. Chen, “Automatic retinal blood vessel segmentation based on fully convolutional neural networks,” Symmetry, vol. 11, no. 9, p. 1112, 2019. [20] S. Yadav and S. Shukla, “Analysis of k-fold cross-validation over holdout validation on colossal datasets for quality classification,” in 2016 IEEE 6th International Conference on Advanced Computing (IACC), 2016, pp. 78–83. [21] G.-B. Huang, Y.-Q. Chen, and H. A. Babri, “Classification ability of single hidden layer feedforward neural networks,” IEEE transactions on neural networks, vol. 11, no. 3, pp. 799–801, 2000. [22] P. Liskowski and K. Krawiec, “Segmenting retinal blood vessels with deep neural networks,” IEEE transactions on medical imaging, vol. 35, no. 11, pp. 2369–2380, 2016. [23] O. Ramos-Soto, E. Rodr´ıguez-Esparza, S. E. Balderas-Mata, D. Oliva, A. E. Hassanien, R. K. Meleppat, and R. J. Zawadzki, “An efficient retinal blood vessel segmentation in eye fundus images by using optimized top-hat and homomorphic filtering,” Computer Methods and Programs in Biomedicine, vol. 201, p. 105949, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0169260721000237
Page 205
2022 25th International Conference on Computer and Information Technology (ICCIT), 17-19 December, 2022, Cox’s Bazar, Bangladesh
An Efficient Deep Learning Technique for Bangla Fake News Detection Md. Shahriar Rahman
Faisal Bin Ashraf
Md. Rayhan Kabir
Department of CSE Brac University Dhaka, Bangladesh [email protected]
Department of CSE Brac University Dhaka, Bangladesh [email protected]
Department of CS University of Alberta Edmonton, Canada [email protected]
Abstract—People connect with a plethora of information from many online portals due to the availability and ease of access to the internet and electronic communication devices. However, news portals sometimes abuse press freedom by manipulating facts. Most of the time, people are unable to discriminate between true and false news. It is difficult to avoid the detrimental impact of Bangla fake news from spreading quickly through online channels and influencing people’s judgment. In this work, we investigated many real and false news pieces in Bangla to discover a common pattern for determining if an article is disseminating incorrect information or not. We developed a deep learning model that was trained and validated on our selected dataset. For learning, the dataset contains 48,678 legitimate news and 1,299 fraudulent news. To deal with the imbalanced data, we used random undersampling and then ensemble to achieve the combined output. In terms of Bangla text processing, our proposed model achieved an accuracy of 98.29% and a recall of 99%. Index Terms—Bangla Text, Natural Language Processing, News Classification, Fake News, Bangla Language
I. I NTRODUCTION In the present Web 2.01 era, online news has become more popular than traditional news media such as newspapers, television, magazines, etc. Owing to rapid consumption and low-cost characteristics, it is spreading very quickly through different micro-blogging websites like Facebook, Reddit, and Twitter [10]. Consequently, the authenticity and background checking of online news has become challenging. As a result, fake news frequently spreads more than well-written original content, resulting in havoc on our society [7]. The possibility of harmful events in our society due to online fake news is not limited to some extent. It is continuously threatening many aspects of people’s life [2]. In Bangladesh, due to the spreading of misleading information related to child beheading during the construction of Padma Bridge, eight people were killed as wrongly suspected by the mobs in July 2019 [12]. Even this false news spreading also harmed the socio-economic and political aspects. During the parliamentary election of Bangladesh in 2018, different fake websites dispersed misinformation by imitating authentic news portals to affect people’s opinions intentionally [13]. Furthermore, in the Covid-19 pandemic, fake news articles 1 https://en.wikipedia.org/wiki/Web
2.0
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
are misleading general people in a way that the intelligence allures this current wave as the wave of disinformation or disinfodemic [11] [14]. During the Covid-19 crisis, people are primarily dependent on online services. Sadly, spreading misleading information has become a significant concern for Bangladesh [15] which are generally related to fake death tolls, political concerns, consumerism, religious views, economic aspects, etc [16]. In the present context, numerous state-of-the-art models are present which can detect fake news in case of the language having high resources like English, French, Spanish, etc. On the other hand, in the field of fake news detection, lowlevel language like Bangla has very small attention from the researchers as online resources of Bangla do not have a large space on the internet [8]. The state-of-the-art models to detect fake news in high resource languages are not efficient in the case of Bangla language as most of the model are deep in nature and deep models require a huge volume of training data for effective result [26]. Though Bangla online resources constitute a small space in internet, it is the fifth largest language on earth2 and the number of Bangla speaking internet users is rapidly increasing. From some recent events of spreading fake news over internet which resulted in crime and social unrest, we understand the importance of designing an efficient model to detect Bangla fake news which can identify fake news and minimize this vulnerability to prevent the quick spreading nature of fake news [9]. In this paper, we have proposed deep learning-based approach that can be used as an automation tool to check the authenticity of Bangla news. The main contribution of our model is, our proposed model has very competitive performance over other existing models. Another contribution is, though deep in nature, our model has efficient performance when trained with small volume of training data which solves the problems of training deep learning model in the case of low resource languages. Also, We have deployed the whole project online for practical implementation and ease of access from the user side. The rest of the paper is organized as follows. Discussion 2 https://en.wikipedia.org/wiki/List of languages by number of native speakers
Page 206
on related works is provided in Section II. We have provided a brief discussion on the proposed model in section III. We have provided description of dataset and experimentation in Section IV. In Section V we have provided the analysis on our result and finally, Section VI concludes our paper. II. L ITERATURE R EVIEW A number of sate-of-the-art models have been proposed using different datasets to detect fake news or malicious content in languages with large resource spaces, little work has been done in the Bangla language. A group of researchers used Logistic Regression algorithm to classify both suspicious and non-suspicious text in a corpus [1]. They compared the result with other algorithms like SVM, Naive Bayes, KNN, and Decision Tree for the efficiency test. They took 1500 and 500 text documents from the dataset as training and testing sets. An accuracy of 92% for their proposed method is found, which is the highest compared to other classifiers. An increase in the number of training sets and segregation of the stop words was suggested for further improvement. In another study, the authors have used scraped data from different online news portals and categorized articles as fake or accurate based on the reliability of the portal [2]. The percentage of fake news on the dataset was 39.08%. After preprocessing the data, they used a count vectorizer and TF-IDF vectorizer for feature extraction of the text and then trained the data to traditional machine learning classifiers. Interestingly, SVM with linear kernel gave an accuracy of 96.64%, which outperforms Multimodal Na¨ıve Bayes (93.32%). Lastly, they suggested reducing the corpus size using a stemmer and enhancing performance using a hybrid classifier. A collection of 5644 Facebook comments and posts in Bangla were used in another work [3]. They performed MNB, Linear SVC, RBF, SVC, and CNN-LSTM algorithms, where the Linear SVM performed better with an accuracy of 78% than all other algorithms when the dataset is balanced. However, with fewer data(3324 samples), the CNN-LSTM model was not built, but after the increment of the data up to 5644 samples, it had the highest rate of accuracy change from 73% to 77.5%. Hence, incrementing the data can enhance the performance of the CNN-LSTM model, which aims at the local information and long dependencies. Another work [23] used CNN-LSTM architecture to find cyberbullying in Bangla online and got 87.91% accuracy. A novel corpus named Suspicious Bengali Text Dataset(SBTD), consisting of 7000 Bangla text documents, is proposed [4]. The authors extracted features using the bag of words and TF-IDF after preprocessing. For enhancing the meaning of a sentence, they considered a combination of N-gram features used in their proposed model. Furthermore, they used Logistic Regression(LR), Decision Tree(DT), Random Forest(RF), and Multinomial Naive Bayes(MNB) classification algorithms. The proposed method, Stochastic Gradient Descent(SDG), outperforms others while combined with TF-IDF with unigram and bigram features with an accuracy of 84.57%.
Another work used the Stochastic Gradient Descent(SGD) classifier to categorize Bangla text documents [5]. The authors scraped a collection of 9127 articles from a Bangla news portal named BDNews24 and labeled those documents in different categories. The performance was measured using Precision, Recall, and F1 score. They selected 350000 features using TF-IDF during the experiment, where bigram was considered a term. They performed different experiments keeping DC(Document Classifier) as a common classifier. They combined DC with Ridge classifier, Passive Aggressive, SVM, Naive Bayes, Logistic Regression, and finally, added DC with SGD classifier, which is their proposed method. Their proposed method provided the highest precision and F1 Score, 0.9386 and 0.9385. The authors built a framework to detect social media spam from Bangla text data in another study [6]. They have collected the data from different social platforms like YouTube and Facebook. The percentage of spam text in the dataset was 32.87%. They experimented with the dataset using a Multinomial Naive Bayes(MNB) classifier, whereas their proposed model detects spam based on the polarity of each sentence. They got an accuracy of 82.44% with their model in identifying Bangla text spam. Shared news on Twitter as a dataset was used to implement a fake news detector framework using five classification methodologies in another work [7]. Before feeding the data into models, data were preprocessed using different text classifiers like Count Vectors, TF-IDF, and Word Embedding and then used RMSprop as the optimizer and compared the accuracy of these classifiers in Naive Bayes, LR, and SVM. Finally, they stated that for text characterization, SVM works best. In the future, including domain information and analyzing the name entities extracted from the headlines and then comparing their interrelations may enhance the system as they suggested. Unlike these studies, we have proposed a deep learningbased approach with practical implementation that can help us to detect Bangla fake news efficiently after necessary preprocessing and imbalanced data handling with under-sampling technique. III. M ETHODOLOGY Our proposed methodology is divided into four significant steps - preprocessing, imbalanced data handling, deep learning model and ensemble of the models. In the first step, we have prepared the dataset containing Bangla text for computation in machine and then performed random under-sampling to reduce the majority class data. The third step performs the computation in a complex deep learning model to classify the authentic and fake news. Finally, the models are ensembled to get the final prediction. A. Preprocessing Removing unnecessary characters from raw text is necessary before any classification task [2]. We have done the preprocessing in several steps. Different punctuation, emoticons, English letters and digits, Page 207
Fig. 1. Workflow diagram.
Bangla digits and extra symbols, pictographs, Latin symbols etc. are removed from the text. Then, we have tokenized the words into single text units. As Bangla is a grammatically inflicted language, word stemming plays an important role before any classification task [18], [19], [20]. Therefore, the stemming process of the bangla words is done using a Bangla stemmer library3 . At last, Bangla stop words are eliminated from the texts. We have used stop words ISO4 as these stop words are found with better performance in case of validation [8]. B. Proposed Model We have designed a deep learning based classification model to identify Bangla authentic and fake news. Our learning model has multiple layers starting with an input layer. We have used first 150 words for input that goes into an embedding layer. Embedding dimension of our model is 256 that converts the words into 256 size of vectors. After the embedding layer a series of flatten and dense layer of different size have been added in the model for extracting distinct features from the inputs. We have considered 15000 most frequent words in the data set as our dictionary. On this note, we have determined different layer sizes based on implementation and better performance of the model. Moreover, we have reduced learning rate to get better model when the validation loss does not decrease over a period. Layer size distribution of our proposed deep learning model is shown in Table II. 3 https://pypi.org/project/bangla-stemmer/ 4 https://github.com/stopwords-iso
IV. DATASET AND E XPERIMENTATION A. Dataset Description In our experiment, the Banfakenews dataset [8] is used. It contains 48678 authentic articles and 1299 fake articles. Satires, clickbaits, misleading and parody type news articles are annotated as fake article. On this note, human baseline experiment is done to determine the authenticity of an article. Furthermore, this dataset contains a total of 12 news categories which are Miscellaneous, Entertainment, Lifestyle, National, International, Politics, Sports, Crime, Education, Technology, Finance and Editorial. We have used the article contents for our proposed model. B. Experimentation The experimentation process has done in multiple steps as shown in Fig. 1. Firstly, the article contents are preprocessed as mentioned in preprocessing section above. As the fake articles are less in number i.e. minority class compared to authentic articles, we have tried to balance the train and test data by segmenting the authentic article into five different parts. Hence, for training the model, whole fake data and one segment from authentic data is taken. In this way, our proposed model has been trained and validated four times sequentially resulting four different models. In addition, the whole fake data and the last part of the authentic data is used as unknown data for validating the performance of these four different models. The whole process of imbalance data handling is depicted in Fig. 2. The whole data segmentation details for each experiment is shown in table III. On this note, Page 208
TABLE I C OMPARISON WITH R ELATED W ORKS Article
Classification Task
Database
Results (Accuracy) MNB - 93.32% SVM - 96.64%
Scraped Bangla News Real Article - 1548 Fake Article - 993
Bangla Fake News
Sharma et al. [17]
Scraped Bangla News Real Article - 1480 Satire Article - 1480
Satire in Bangla
(Accuracy) CNN - 96.4%
Imran et al. [24]
Scraped Bangla News
Satire and Fake News
(Accuracy) DNN - 90%
Benazir et al. [10]
Scraped about 2000 Health Tweets in Bangla
Fake, Real, Ad, Info, Irrelevant, Query, Satire, Unsure Tweets
(Accuracy) CNN (fasttext embeddings) - 91%
Sharif et al. [1]
Non-Suspicious - Pre-build Bengali Corpus [25] Suspicious - Online and Offline Resources Total - 2000 Bangla Text Documents
Suspicious and Non-suspicious Bangla Text
Hussain et al. [2]
(Recall Score) MNB - 0.91 SVM - 0.97
MM Hossain et al. [27]
BanFakeNews Dataset [8] Real Articles - 48678 Fake Articles - 1299
Bangla Fake News
Proposed Method
BanFakeNews Dataset [8] Real Articles - 48678 Fake Articles - 1299
Bangla Fake News
TABLE II L AYERS OF PROPOSED DEEP LEARNING Layer (type) Output Shape InputLayer [(None, 150)] Embedding (None, 150, 256) F latten (None, 38400) Dense1 (None, 128) Dense2 (None, 32) Dense3 (None, 1) Total params: 8,759,489 Trainable params: 8,759,489 Non-trainable params: 0
(Accuracy) LR - 92% Recall = 0.93 (Near-miss technique) Random forest(baseline) F1-score = 0.943 (Model stacking) Random forest Accuracy = 99%, F1-score = 0.791 Precision = 0.846, Recall = 0.742 (Accuracy) Deep Learning Approach = 98.29% Precision = 0.99, Recall = 0.99 F1-score = 0.99
MODEL
Param 0 3840000 0 4915328 4128 33
70% of the data is used for training the models and 30% of the data is used to validate the performance of the models. Fig. 2. Imbalanced data handling. DATA
TABLE III DISTRIBUTION IN TRAIN , VALIDATION
Experiment No. Dataset Size Authentic Training set Fake Authentic Validation set Fake Test set with Authentic Unknown Data Fake
1 6601 3711 909 1591 390 13590 1299
AND
T EST SET
2 10970 6769 909 2902 390
3 14729 9401 909 4029 390
4 7984 4679 909 2006 390
-
-
-
Lastly, to use the learning from all these four models, we have incorporated an ensemble technique, hard voting5 , on their outputs to get the final verdict. The whole project is deployed online as web application6 with heroku7 CLI as 5 https://machinelearningmastery.com/voting-ensembles-with-python/ 6 https://detect-bangla-fake-news.herokuapp.com 7 https://www.heroku.com/
Page 209
Fig. 4. Deployed web app consisting of trained models. Fig. 3. Train, validation and test accuracy for different models. TABLE IV C OMPARISON OF THE M ODELS Model No. Model 1 Model 2 Model 3 Model 4 Ensemble Model
Precision Score
Recall Score
F1-Score
0.993 0.988 0.988 0.99
0.96 0.989 0.991 0.984
0.976 0.989 0.99 0.987
0.99
0.99
0.99
shown in Fig. 4, where two out of the four trained models are assembled due to space constraints. The Implementation and result of our proposed model is available here. V. R ESULT ANALYSIS We have experimented the deep learning models of same architecture with four different dataset. As a result, four different models have been generated respectively. Lastly, all the models were ensembled to get the final result. We have implemented the models, then, analyzed all the validation accuracy and different performance metrics using Scikit-learn [21] and tensorflow [22]. Fig. 3 shows all the training accuracy, validation accuracy and accuracy of the models on unknown data. It shows that all the models have almost perfect accuracy. However, their validation accuracy vary from each other despite the fact that all of the models shows good validation accuracy. Moreover, even the models overfit different train data, they do not underfit the validation data. Additionally, All of the models performed with around 97% accuracy for unknown data. From the behaviour of these models, we understand that more amount of data tends to build a better model. For example, in these experiments, Experiment 3 gives the highest validation accuracy and close to highest performance on unknown data. And, if we look into Table III, this experiment was done on maximum amount of data comparing with others. In the last part of our experiment, the predicted values of the four models from the unknown data are taken. After the application of hard voting technique, we have got an accuracy
of 98.29% which is the highest compared to the validation accuracy of the individual models. Therefore, all the models can contribute together to make a better decision. We have further analysed our models and calculated the precision, recall and F1 score to find out how these perform on identifying both authentic and fake news individually. Table IV represents all the scores for each of our models. All the individual models learned in each experiments have yielded good precision and recall. Among these, models from experiment 3 outperforms all other in recall as it gets more data to distinguish from the fake news. However, ensemble technique gives the best result without any bias to any specific class. We have compared the performance of our classification model with other existing methods. Although fake news classification models for Bangla language are not abundant, we have looked for similar type of binary classification models in Bangla and compared with them. The comparison is shown in Table I. It is prominent that our model outperforms all the available models in literature as per our knowledge. VI. C ONCLUSION Identifying fake news online efficiently has become a challenging task for the low resources of Bangla language processing. In this study, we have devised an efficient deep learning model to detect fake and authentic news in the Bangla language. We have used a human-annotated dataset where authentic and fake news articles are labeled. Because of the high amount of authentic data, we have used a random undersampling technique. As a result, we have created multiple secondary datasets and fit the first four concatenated data into the proposed model sequentially which results in four different trained deep learning models. After that, all these models are tested with an unknown dataset to evaluate the performance of the learned models. The highest validation accuracy we have got using the unknown data for an individual model is 98.06%. Finally, we have ensembled the models and used the hard voting technique to get the final validation accuracy of 98.29% with a recall score of 99%. From the experimentation, we have found that very few news articles are wrongly classified by our model as we have adopted the ensemble method and we have multiple deep learning models using a different dataset. Page 210
However, the work can be extended to develop an optimized model by focusing on more ways to handle imbalanced data and building a more sophisticated model that can be used to classify the authenticity of a Bangla news article. Lastly, Our proposed model can be useful to detect fake news online as it has excellent accuracy and recall score. R EFERENCES [1] Sharif, O., Hoque, M. M. (2019, October). Automatic detection of suspicious Bangla text using logistic regression. In International Conference on Intelligent Computing Optimization (pp. 581-590). Springer, Cham. [2] Hussain, M. G., Hasan, M. R., Rahman, M., Protim, J., Hasan, S. A. (2020). Detection of bangla fake news using mnb and svm classifier. arXiv preprint arXiv:2005.14627. [3] Chakraborty, P., Seddiqui, M. H. (2019, May). Threat and abusive language detection on social media in bengali language. In 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT) (pp. 1-6). IEEE. [4] Sharif, O., Hoque, M. M., Kayes, A. S. M., Nowrozy, R., Sarker, I. H. (2020). Detecting suspicious texts using machine learning techniques. Applied Sciences, 10(18), 6527. [5] Kabir, F., Siddique, S., Kotwal, M. R. A., Huda, M. N. (2015, March). Bangla text document categorization using stochastic gradient descent (sgd) classifier. In 2015 International Conference on Cognitive Computing and Information Processing (CCIP) (pp. 1-4). IEEE. [6] Islam, T., Latif, S., & Ahmed, N. (2019, May). Using Social Networks to Detect Malicious Bangla Text Content. In 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT) (pp. 1-4). IEEE. [7] Mahir, E. M., Akhter, S., & Huq, M. R. (2019, June). Detecting Fake News using Machine Learning and Deep Learning Algorithms. In 2019 7th International Conference on Smart Computing & Communications (ICSCC) (pp. 1-5). IEEE. [8] Hossain, M. Z., Rahman, M. A., Islam, M. S., & Kar, S. (2020). Banfakenews: A dataset for detecting fake news in bangla. arXiv preprint arXiv:2004.08789. [9] P´erez-Rosas, V., Kleinberg, B., Lefevre, A., & Mihalcea, R. (2017). Automatic detection of fake news. arXiv preprint arXiv:1708.07104. [10] Benazir, A., Sharmin, S. (2020, December). Credibility Assessment of User Generated health information of the Bengali language in microblogging sites employing NLP techniques. In 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT) (pp. 837-844). IEEE. [11] United Nations. (n.d.). During this coronavirus pandemic, ’fake news’ is putting lives at risk: UNESCO — — UN News. United Nations. https://news.un.org/en/story/2020/04/1061592. [12] Bangladesh lynchings: Eight killed by mobs over false child abduction rumours. BBC News. (2021). Retrieved 6 July 2021, from https://www.bbc.com/news/world-asia-49102074. [13] Niloy Alam, Mahadi Al Hasnat, Arifur Rahman Rabbi, and Shegufta Hasnine Surur. 2018. Fake news hits Bangladeshi news sites before polls. Retrieved September 18, 2019 from https://www.dhakatribune.com/bangladesh/election/2018/11/17/fakenews-hits-bangladeshi-news-sites-before-polls. [14] Chen, E., Chang, H., Rao, A., Lerman, K., Cowan, G., Ferrara, E. (2021). COVID-19 misinformation and the 2020 U.S. presidential election. Harvard Kennedy School Misinformation Review. https://doi.org/10.37016/mr-2020-57 [15] Haque, M. M., Yousuf, M., Alam, A. S., Saha, P., Ahmed, S. I., Hassan, N. (2020). Combating Misinformation in Bangladesh: Roles and Responsibilities as Perceived by Journalists, Fact-checkers, and Users. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW2), 1-32. [16] Al-Zaman, M. S. (2021). COVID-19-related online misinformation in Bangladesh. Journal of Health Research. [17] Sharma, A. S., Mridul, M. A., Islam, M. S. (2019, September). Automatic detection of satire in bangla documents: A cnn approach based on hybrid feature extraction model. In 2019 International Conference on Bangla Speech and Language Processing (ICBSLP) (pp. 1-5). IEEE. [18] M. Islam, M. Uddin, M. Khan et al., “A light weight stemmer for bengali and its use in spelling checker,” 2007.
[19] M. R. Mahmud, M. Afrin, M. A. Razzaque, E. Miller, and J. Iwashige, “A rule based bengali stemmer,” in 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE, 2014, pp. 2750–2756. [20] S. Sarkar and S. Bandyopadhyay, “Design of a rule-based stemmer for natural language text in bengali,” in Proceedings of the IJCNLP-08 workshop on NLP for Less Privileged Languages, 2008. [21] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830. [22] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... Zheng, X. (2016). Tensorflow: A system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16) (pp. 265-283). [23] Ahmed, M. F., Mahmud, Z., Biash, Z. T., Ryen, A. A. N., Hossain, A., Ashraf, F. B. (2021). Cyberbullying Detection Using Deep Neural Network from Social Media Comments in Bangla Language. arXiv preprint arXiv:2106.04506. [24] Al Imran, A., Wahid, Z., Ahmed, T. (2020, December). BNnet: A Deep Neural Network for the Identification of Satire and Fake Bangla News. In International Conference on Computational Data and Social Networks (pp. 464-475). Springer, Cham. [25] Ratul, M. A. S., Khan, M. Y. A., Islam, M. S. Open Source Autonomous Bengali Corpus. [26] Sarker, I.H. Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions. SN COMPUT. SCI. 2, 420 (2021). https://doi.org/10.1007/s42979-021-00815-1 [27] Hossain, M. M., Awosaf, Z., Prottoy, M., Hossan, S., Alvy, A. S. M., Morol, M. (2022). Approaches for Improving the Performance of Fake News Detection in Bangla: Imbalance Handling and Model Stacking. arXiv preprint arXiv:2203.11486.
Page 211
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
An Efficient Deep Learning Approach for Brain Tumor Segmentation using 3D Convolutional Neural Network 1st Syed Muaz Ali
2nd Md. Ashraful Alam
Department of Computer Science and Engineering Brac University Dhaka, Bangladesh [email protected]
Department of Computer Science and Engineering Brac University Dhaka, Bangladesh [email protected]
Abstract—In medical application, deep learning-based biomedical semantic segmentation has provided state-of-the-art results and proven to be more efficient than manual segmentation by human interaction in various cases. One of the most popular architectures for biomedical segmentation is U-Net. In this paper, a convolutional neural architecture based on 3D U-Net but with fewer parameters and lower computational cost is used for the segmentation of brain tumors. The proposed model is able to maintain a very efficient performance and provides better results in some cases compared to conventional U-Net, while reducing memory usage, training time and inference time. The model is trained on the BraTS 2021 dataset and is able to achieve Dice scores of 0.9105, 0.884 and 0.8254 on Whole Tumor, Tumor Core and Enhancing-Tumor on the testing dataset. Index Terms—Convolutional Neural Network, Brain Tumor, Transfer Learning
I. I NTRODUCTION One of the most fatal cancers is Brain tumor. Around 83,570 people would be diagnosed with brain tumor and other type of central nervous system tumor, 18,600 people out of them would die due to the illness [24]. Typically diagnosing brain tumor involves neurological exam, brain scan (CT scan, MRI, PET or an angiogram) and biopsy. These tests require expert operators to perform and are prone to human error [1]. In order to increase the survival rate of patients, it is important to diagnose brain tumor at a very early phase. CNN architectures based on U-Net have provided great results for detecting brain tumor by each pixel from MRI images through semantic segmentation. However, U-Net[7] based architectures are can require powerful hardware in order to run the models based on millions of parameters and using 3D convolutional neural network increases the parameter count compared to 2D convolutional neural network. Thus, we propose an efficient CNN architecture similar to 3D U-Net but computationally less expensive and requires less than a million parameter to run. II. R ELATED W ORKS Squeeze U-Net[22] is a memory and computationally efficient U-Net based architecture based on SqueezeNet[9].
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
SqueezeNet is able to have a similar accuracy compared to AlexNet[3] by implementing Fire modules and has a reduced number of parameters resulting in a model size less than 1MB. Following the architecture of SqueezeNet, Squeeze U-Net architecture reduces the model size by 12 times, provides 17% increase in speed during inference and training time becomes 52% faster compared to U-Net. Wei et al[18] proposed a UNet based architecture called S3D-UNET based on S3D architecture[17]. The authors of S3D-UNET have divided the 3D convolution operations into separate convolution operations to reduce the computational cost. The authors have suggested the usages of separable convolution over 3D convolution as it provides good results compared to usual 3D convolution. The authors have demonstrated the performance of the model on BraTS 2017 dataset. III. M ETHODOLOGY The model was focused to have a very comparable result while maintaining a similar structure according to U-Net but a reduced computational cost and memory usage. In order to make the model efficient by reducing the parameters, 3D separable convolutions were implemented. Squeeze-andexcitation[15] blocks and attention gates which don’t require high computational power were also added. The final model used stem blocks similar to the initial block of ResNet-D[19] to greatly reduce computational cost. Again, deep-supervision was added to the models to further improve the results. Finally, our final model was first trained using patches from the same training data where data augmentation was applied and again the model was trained on the original un-patched images by loading the weights and provided better results. A. Dataset We used the BraTS 2021[23][6][10] dataset to evaluate the models. The dataset contains 1251 samples where the tumor regions are labeled. The regions include the following categories: 1. Enhancing Tumor (ET), 2. Non-Enhancing Tumor (NET/NCR), 3. Edema (ED). Each of the scans includes four
Page 212
channels T1, Flair, T2, T1CE. The dimension of each of the channel is 240x240x155. B. Data Pre-processing and Augmentation Data pre-processing was divided into two steps. At first, the Nifti images’ values were scaled between 0 to 1 and were cropped to 128x128x128 to reduce the number of zero values and to fit into the memory of GPU. The T1CE, T2 and Flair images to create a four-dimensional Numpy array and saved images where the mask contains at least 1% of it as tumor data. After saving the images, the dataset was randomly split into 80% training set, 10% validation set and 10% testing set. On the second step of data preprocessing, patches of dimenson 64x64x64 with no overlap from each of the images were extracted from training and validation dataset. Each of the 128x128x128 images resulted in total 8 patches of dimension 64x64x64.The patches where the masks contain at least 1% data of either ET, ED or NET/NCR were saved. Data augmentation was applied to the training dataset of 64x64x64 patches where the patches were randomly rotated and flipped each of the images. Through data augmentation, there were total 16,536 images for training. IV. I MPLEMENTATION A. 3D Separable Convolutions In separable convolutions, one convolution operation is divided into multiple convolution to reduce the computational cost. Figure 1 demonstrates the types of separable convolution operations used.
C. Squeeze-and-Excitation block Using Squeeze-and-Excitation blocks can further improve the accuracy of a CNN while not having high computational cost recalibrating channel-wise feature responses. The blocks were added to the convolution blocks of encoder and decoders in the model. D. Attention Gates According to Oktay et al[16], attention gates can help to improve representation of salient features by reducing the irrelevant areas of an image. A slight modification was done to the attention gate to reduce computational cost by not doing any convolution operation on the feature maps from the encoder block and the gating signal from the decoder. The features from the decoder block are up-sampled and added to the features of the encoder block, followed by an activation function and a convolution of kernel size 1x1x1 and feature size of 1 and finally followed by a sigmoid activation function. The output is then multiplied with the input features from the encoder block through element-wise multiplication. E. Stem Blocks Stem Blocks similar to the initial block on ResNet[8] were used as the first stage of the model. Using convolution operation of kernel size 7x7x7 with stirdes set to 2 gave a great improvement in terms of training time and memory by reducing the computational cost. Furthermore, the stem block was modified based on ResNet-D by Tong et al[19] by adding an additional average pooling layer followed by a convolution operation of kernel size of 1x1x1. According to the authors of ResNet-D, due to the strides of 2 on ResNet, some of the information on the images are ignored and using an average pooling layer with followed by convolution operation can help overcome this issue. However, on proposed models, an average pooling operation was used with padding set to same and strides set to 1 and on the next convolution operation of kernel size 1x1x1, the strides were set to 2 for down-sampling to reduce computational cost. The Stem Blocks are shown in the Figure 2.
Fig. 1: Separable Convolutions
B. Mixed Convolutions The authors in the paper[20] proposed the idea of concatenating separate convolutions of different kernel sizes to improve the accuracy. A similar method to concatenate the outputs of separable convolutions was implemented. However, unlike MixConv[20] where the features are split into separate groups, a convolution operation was performed on the same features but with half the number of filters.
Fig. 2: Stem Blocks F. Modified Convolution Blocks Figure 3 shows the modified convolution blocks that were used in encoders, decoders and the bottleneck layers of
Page 213
proposed model. The block consists of separable convolutions incorporating with mixed convolution. In the figure, the separable convolution operations of kernel sizes of NxNx1 followed by a convolution operation of 1x1xN is shown. A convolution operation of kernel size 1x1x1 is performed on the concatenated output from mixed convolution and added to the original input through element-wise addition, similar to residual-blocks to avoid issue with vanishing gradients and overfitting. A convolution operation is performed on input layer for identity mapping and dimensionality reduction or increasing to match with the output in order for addition. Finally the added output is processed through a squeezeand-excitation block. Batch normalization, followed by a an Activation function was applied to each of the convolution operation.
that was used to compare uses 4 encoder and decoder blocks. Figure 4 shows the architecture of the model.
Fig. 4: Proposed model H. U-Net Architecture An U-Net architecture with where the encoders and decoders have 16,32,64,128 features and bottleneck has 256 features was used, following a similar pattern of the model. On the convolution blocks, batch normalization was used after each of the convolution operation to avoid overfitting and improve generalization. I. Deep-Supervision Originally proposed by Chen-Yu et al[5] to improve the results of a deep learning model. In order to implement a computationally efficient deeply-supervised model, convolution transpose operation of kernel size 2x2x2 was performed on the first decoder block and added the output to the next decoder block, finally performing two additional convolution transpose operation on the added output to match the size of the output layer.
Fig. 3: Modified Convolution Block
G. Proposed model Architecture On the proposed model the modified convolution blocks, stem blocks and attention gates were used. The input size of the model is 128x128x128x3 or 64x64x64x3 during training with the patches and the output is of 128x128x128x4 or 64x64x64x4. On the up-sampling blocks of proposed model, the layers are up-sampled by a size of 2x2x2 followed by a convolution operation of kernel size of 2x2x2, batch normalization and then activation. The encoder and decoder blocks use feature maps starting from 16 to 64 and the bottleneck layer uses filters of 256. However, compared to U-Net, the model has 3 encoder and decoder blocks while the U-Net
J. Transfer learning and Fine-tuning Mina et al[21] have demonstrated the results of freezing layers of different blocks on the U-Net architecture. According to their work, freezing the bottleneck block’s layers would provide similar results as training the whole network while reducing the number of parameters by around half. A similar methodology was followed and during transfer learning, the layers of the bottleneck were frozen and the total parameter was 627,155 and the trainable parameters were 417,555, thus reduced computational cost. K. Activation functions The performance of proposed models using different activation functions or combined both on different layers of the network. Swish[12], Smish[26] and ReLU[14] activation functions were used to make the comparison to find the best combination for achieving highest accuracy while maintaining efficiency. The model also uses Sigmoid activation function on different layer and Softmax activation in the final layer to make predictions for multiple classes.
Page 214
V. E XPERIMENTS AND R ESULTS The results of all the models were evaluated using Dice score (F1), IoU, precision, sensitivity and specificity. Dice loss[13] and Focal loss[11] were combined as a loss function[25] to train the models. A. Training and Results on U-Net The U-Net model was trained with learning rate of 0.0001 and Adam[4] optimizer with batch size set to 2. The model was trained on 128x128x128x3 images with loss function set to combined Dice loss and Focal loss. The validation scores did not improve after 81th epoch reaching validation IoU score 0.7766 and F1 score 0.8507 and the training was stopped at 140th epoch. the results on testing dataset of the model for NET/NCR, ED % ET is shown on Table I. A sample prediction is shown on U-Net on Figure 5 where the left image is ground truth and the right image is prediction from U-Net. Class NET/NCR ED ET
IoU 0.6869 0.7141 0.7484
F1 0.7654 0.8142 0.8273
Spe 0.9986 0.9958 0.9989
Sen 0.8195 0.8421 0.8432
Prc 0.7596 0.8202 0.8333
TABLE I: U-Net Results On Testing Dataset
and ET classes in terms of IoU and F1 scores compared to proposed model with ReLU. Table II and III shows the results of the models on the testing set for NET/NCR, ED and ET on testing dataset. Class NET/NCR ED ET
B. Training and Results on proposed models Firstly, the proposed model was trained with the unmodified Stem block from Figure 2, Separable Convolution 1 from Figure 1 and without any deep-supervision. How ReLU and Smish activation functions perform on the proposed model was tested. In order to achieve this, ReLU activation function was used on the layers that require non-linearity without replacing the existing Sigmoid or Softmax functions. Secondly, all the ReLU activation functions were replaced with Smish activation function and trained the models. The models were trained on 128x128x128x3 images with same hyperparameters as U-Net and used L2 Regularization[2] with a weight decay of 0.0005. proposed model with ReLU activation validation scores didn’t improve after 166th epoch reaching validation IoU score 0.7535 and F1 score 0.8332 and the model with Smish didn’t improve after 188th epoch reaching validation IoU score 0.7606 and F1 score 0.8377. Both of the models, including U-Net, were trained using Tensorflow with mixed precision enabled, however Smish function tends to consume more memory. The model with Smish performed better on ED
F1 0.7462 0.7906 0.7818
Spe 0.99810 0.997 0.9983
Sen 0.8135 0.7658 0.8361
Prc 0.7406 0.866 0.7596
TABLE II: Proposed model with ReLU results on different classes on testing dataset Class NET/NCR ED ET
IoU 0.658 0.7007 0.7235
F1 0.7427 0.8042 0.8104
Spe 0.9983 0.9963 0.9989
Sen 0.8005 0.8142 0.8151
Prc 0.7373 0.8283 0.8364
TABLE III: Proposed model with Smish results on different classes on testing dataset On the otherhand, the model with ReLU activation was trained with Separable Convolutions 2 from Figure 1. This time the model showed better results in terms of IoU and F1 score compared to using Separable Convolutions 1 on the testing dataset. The model was trained with the same hyper parameters as proposed model with ReLU and the validation loss did not improve after 189th epoch reaching validation IoU score 0.75453 and F1 score 0.83263. Class NET/NCR ED ET
Fig. 5: Prediction On U-Net (Left Ground Truth, Right Prediction)
IoU 0.663 0.6833 0.693
IoU 0.6653 0.6929 0.7181
F1 0.7563 0.7958 0.8049
Spe 0.9984 0.9954 0.9986
Sen 0.8117 0.8278 0.8296
Prc 0.7527 0.8039 0.8058
TABLE IV: Proposed model with ReLU (Separable Convolution 2) results on different classes on testing dataset Finally, a modification was done to the model to use the separable convolution 2 and the modified stem block based on ResNet-D. Deep-supervision was also implemented to the model. The model was trained the model on the dataset of 64x64x64 patches. Smish activation function was used in the last encoder, first decoder and bottleneck block layers and Swish activation in other blocks’ layers as the Smish function is more memory consuming than Swish function. After training the model with the patches, the model was trained on the dataset of 128x128x128 images using transfer learning to load the weights and for fine-tuning, the layers in the bottleneck layer were set to not-trainable. The models were trained using same hyper parameters as the previous models, however, during training the patches, initially the learning rate was set to 0.001 and batch size was set to 16. The model’s validation scores didn’t improve after 63th epoch reaching validation IoU score 0.7575 and validation F1 score 0.8482 for main output layer so the training was stopped and the best model was saved. Again, the saved model was trained at a learning rate at 0.0001. This time the validation scores didn’t improve after
Page 215
32th epoch reaching validation IoU score 0.77688 and F1 score 0.86083 for main output layer. Finally, the best model was saved and trained on the 128x128x128 images by loading the weights from the model trained on patches, freezing the layers on bottleneck block. The learning rate was set to 0.0001 and the validation scores didn’t improve after 11th epoch reaching validation IoU score 0.7801 and F1 score 0.8511, higher than U-Net. Pre-training the model on the patches and again training the model on the same images that were used to extract the patches can improve the accuracy. The final model through transfer learning showed better results compared to our previous models. Table V shows the results on the testing dataset when only the model trained using patches, for predicting 128x128x128 images using model trained on 64x64x64 images, patches were extracted for each of the images and prediction was done on the patches individually and again combined the predicted patches back together to original shape to calculate the results. Table VI shows the results of the model on testing dataset where patch-wise pretraining and transfer-learning was used. Class NET/NCR ED ET
IoU 0.6468 0.674 0.6574
F1 0.73445 0.7810 0.7522
Spe 0.9983 0.9958 0.9986
Sen 0.788 0.783 0.759
Prc 0.7413 0.8197 0.7889
TABLE V: Results on testing dataset by predicting using patches Class NET/NCR ED ET
IoU 0.6915 0.725 0.7444
F1 0.772 0.819 0.8254
Spe 0.9985 0.9959 0.9989
Sen 0.8429 0.8426 0.8362
Prc 0.7507 0.8256 0.8281
TABLE VI: Results on testing dataset through patch-wise pretraining and transfer-learning
The final model based on patch-wise pre-training and transfer learning is compared with U-Net as it provided the best results compared to the previous experiments on the testing dataset. Furthermore, the results for Whole Tumor (WT) and Tumor Core (TC) are also compared on Table VII and F1 Scores on VIII on testing dataset. ET 0.748 0.744
WT 0.838 0.851
TC 0.828 0.83
NET/NCR 0.687 0.692
ED 0.714 0.725
TABLE VII: Comparison of IoU on U-Net and proposed model Model U-Net Proposed model
ET 0.827 0.825
WT 0.904 0.911
TC 0.885 0.884
Model U-Net Proposed M.
Params 5,651,716 604,495
Size 65.0 MB 2.87 MB
Trn. Time 15+ min 5+ min
Inf. Time 0.355ms 0.194ms
TABLE IX: Comparison of U-Net and proposed model Based on the final results, proposed model has achieved a very comparable performance compared to U-Net and performed better in some cases while maintaining around 11% of total parameters, reduced computational cost and less inference time. During prediction, the hidden output layer that was used for deep-supervision as the main output layer produces the best result. Removing the layer reduced the model size from 6.62 MB to 2.87 MB, reduced parameters from 627,155 to 604,495 and decreased inference time. The model took around 12 minutes per epoch when training on the patches of the images, the Table IX shows the parameters, size on disk, training time per epoch on training dataset of 128x128x128x3 images and inference time per image on testing dataset. VI. C ONCLUSION & F UTURE W ORK The model managed to reduce computational cost and memory usage. However, depth-wise 3D convolution should be further implemented to the further model after Tensorflow releases a version to further reduce the computational cost and the model should be validated through BraTS online validation. R EFERENCES
C. Comparison of The Model with U-Net
Model U-Net Proposed model
Fig. 6: Sample predictions between U-Net (Middle) and Proposed M. (Right) (Left GT)
NET/NCR 0.765 0.772
ED 0.814 0.819
TABLE VIII: Comparison of F1 (Dice Score) on U-Net and proposed model
[1]
S. T. Chao, J. H. Suh, S. Raja, S.-Y. Lee, and G. Barnett, The sensitivity and specificity of fdg pet in distinguishing recurrent brain tumor from radionecrosis in patients treated with stereotactic radiosurgery, en, 2001. DOI: 10.1002/ijc.1016. [Online]. Available: http: //dx.doi.org/10.1002/ijc.1016. [2] C. Cortes, M. Mohri, and A. Rostamizadeh, “L2 regularization for learning kernels,” ArXiv, vol. abs/1205.2653, 2009. [3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds., vol. 25, Curran Associates, Inc., 2012. [Online]. Available: https : / / proceedings . neurips . cc / paper/2012/file/c399862d3b9d6b76c8436e924a68c45bPaper.pdf. [4] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2015.
Page 216
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14] [15]
[16]
[17]
[18]
C.-Y. Lee, S. Xie, P. W. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets,” ArXiv, vol. abs/1409.5185, 2015. B. H. Menze, A. Jakab, S. Bauer, et al., “The multimodal brain tumor image segmentation benchmark (BRATS),” en, IEEE Trans. Med. Imaging, vol. 34, no. 10, pp. 1993–2024, Oct. 2015. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” ArXiv, vol. abs/1505.04597, 2015. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016. F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and ¡1mb model size,” ArXiv, vol. abs/1602.07360, 2016. S. Bakas, H. Akbari, A. Sotiras, et al., “Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features,” en, Sci. Data, vol. 4, p. 170 117, Sep. 2017. T.-Y. Lin, P. Goyal, R. B. Girshick, K. He, and P. Doll´ar, “Focal loss for dense object detection,” 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2999–3007, 2017. P. Ramachandran, B. Zoph, and Q. V. Le, “Swish: A self-gated activation function,” arXiv: Neural and Evolutionary Computing, 2017. C. H. Sudre, W. Li, T. K. M. Vercauteren, S. Ourselin, and M. J. Cardoso, “Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations,” Deep learning in medical image analysis and multimodal learning for clinical decision support : Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, held in conjunction with MICCAI 2017 Quebec City, QC,..., vol. 2017, pp. 240–248, 2017. A. F. Agarap, “Deep learning using rectified linear units (relu),” ArXiv, vol. abs/1803.08375, 2018. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132–7141, 2018. O. Oktay, J. Schlemper, L. L. Folgoc, et al., “Attention u-net: Learning where to look for the pancreas,” ArXiv, vol. abs/1804.03999, 2018. S. Xie, C. Sun, J. Huang, Z. Tu, and K. P. Murphy, “Rethinking spatiotemporal feature learning: Speedaccuracy trade-offs in video classification,” in ECCV, 2018. W. Chen, B. Liu, S. Peng, J. Sun, and X. Qiao, “S3dunet: Separable 3d u-net for brain tumor segmentation,” in Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, A. Crimi, S. Bakas, H. Kuijf, F. Keyvan, M. Reyes, and T. van Walsum, Eds., Cham:
[19]
[20] [21]
[22]
[23]
[24]
[25]
[26]
Springer International Publishing, 2019, pp. 358–368, ISBN : 978-3-030-11726-9. T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li, “Bag of tricks for image classification with convolutional neural networks,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 558–567, 2019. M. Tan and Q. V. Le, “Mixconv: Mixed depthwise convolutional kernels,” ArXiv, vol. abs/1907.09595, 2019. M. Amiri, R. Brooks, and H. Rivaz, “Fine-tuning unet for ultrasound image segmentation: Different layers, different outcomes,” IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, vol. 67, no. 12, pp. 2510–2518, 2020. DOI: 10 . 1109 / TUFFC . 2020 . 3015081. N. Beheshti and L. Johnsson, “Squeeze u-net: A memory and energy efficient image segmentation network,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020, pp. 1495–1504. DOI: 10 . 1109 / CVPRW50498 . 2020 . 00190. U. Baid, S. Ghodasara, M. Bilello, et al., “The rsnaasnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification,” ArXiv, vol. abs/2107.02314, 2021. K. D. Miller, Q. T. Ostrom, C. Kruchko, et al., “Brain and other central nervous system tumor statistics, 2021,” CA: A Cancer Journal for Clinicians, vol. 71, no. 5, pp. 381–406, 2021. DOI: https://doi.org/10.3322/caac. 21693. eprint: https : / / acsjournals . onlinelibrary. wiley. com/doi/pdf/10.3322/caac.21693. [Online]. Available: https://acsjournals.onlinelibrary.wiley.com/doi/abs/10. 3322/caac.21693. R. Solovyev, A. A. Kalinin, and T. Gabruseva, “3d convolutional neural networks for stalled brain capillary detection,” Computers in Biology and Medicine, vol. 141, p. 105 089, 2022. DOI: 10.1016/j.compbiomed. 2021.105089. X. Wang, H. Ren, and A. Wang, “Smish: A novel activation function for deep learning methods,” Electronics, vol. 11, no. 4, 2022, ISSN: 2079-9292. DOI: 10.3390/ electronics11040540. [Online]. Available: https://www. mdpi.com/2079-9292/11/4/540.
Page 217
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
BloodComm: A Peer-to-Peer Blockchain-based Community for Blood Donation Network Chowdhury Mohammad Abdullah, Minhaz Kamal, Fairuz Shaiara, Abu Raihan Mostofa Kamal, Md Azam Hossain Network and Data Analysis Group (NDAG), Department of Computer Science and Engineering Islamic University of Technology, Gazipur 1704, Bangladesh Email: {abdullah39, minhazkamal, fairuzshaiara, raihan.kamal, azam}@iut-dhaka.edu
Abstract—Blood transfusion is an integral part of the healthcare system that plays an important role in ensuring the quality of care for patients undergoing a variety of medical procedures and treatments. A large portion of this blood comes from voluntary donors. The existing blood donor management systems are unable to offer a reliable audit trail and traceability. Hence, there is a significant risk that patients may get transfusion of blood from unreliable sources. In this paper, we propose a system built on Ethereum with the goal of creating a decentralized, transparent, traceable, and secure network of blood donors. The platform uses smart contracts to facilitate peer-to-peer interactions. To encourage donors to donate blood more regularly, the system also offers rewards in the form of tokens. Our source code is available in a public Github repository1 . Index Terms—Blood donation, blockchain, smart contracts, Ethereum, peer-to-peer.
I. I NTRODUCTION Blood transfusion is one of the most essential and vital issues in public healthcare. It is required in a variety of situations, including the treatment of soft tissue injuries (e.g., serious burn damage, tissue puncture, etc.), as well as medical procedures and operations (e.g., C-section, organ transplant, etc.) that pose a risk of excessive blood loss. Another important consideration is the treatment of various medical disorders such as blood cancer, leukemia, etc [1]. Furthermore, there are people with certain illnesses (e.g., anemia, thalassemia) who need blood transfusions on a regular basis [2]. The demand for blood transfusions is rising far faster than the number of available blood donors. According to the ”American Red Cross”, just 3% of the eligible population donates blood, which is insufficient to fulfill demand during abrupt surges [3]. The situation is unlikely to improve much in the foreseeable future, other than becoming more difficult. The main cause for this is the world’s aging population, and an older population is more likely to be a consumer rather than a supplier [4]. Furthermore, the ”American Cancer Society” forecasts that in 2022, 1.9 million new cancer cases would be identified, with about 609,360 fatalities occurring in the United States [5]. During their various phases of treatment, each newly diagnosed patient will need a large amount of blood transfusion and component transfusion. Similar predictions are being made for leukemia and thalassemia [1], [6]. 1 Github Repository of Implementation: https://github.com/minhazkamal/ Blood-Donation-System-with-Blockchain
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
This combination of an aging population and chronic diseases forecasts a surge in the demand for component-specific transfusions (e.g., red blood cells, plasma, and platelets) and a simultaneous decline in the blood donation rate. It indicates a potential scarcity of blood donations in the future [4]. For a future-proof solution, it has become indispensable to make the overall contribution procedure simplified, participatory, safe, and to some degree rewarding for donors if possible. In this aspect, leveraging the latest technologies (e.g. blockchain, machine learning, etc.) can prove to be fruitful. Blood donation begins with finding a donor whose blood group is compatible. Following that, more inquiry into the donors’ background and health status is required due to safety and risk concerns for both the patient and the donor. For instance, a patient will not prefer to receive blood from a donor who is addicted to harmful narcotics and intoxicants or contaminated with STDs [7]. This needs to undertake a series of blood searching, matching, and screening to increase the overall turnaround time. Moreover, there are currently no trustworthy digital solutions for managing blood donation management. Different organizations attempt to recruit blood donors and provide blood bags to patients based on their requirements. It is possible to build a community of mutually helpful peers who interact and help each other. To overcome the aforementioned obstacles, we propose a private blockchain-based blood donation system. The necessary data will be organized using a combination of a database and a distributed ledger based on the blockchain network. It will ensure the community-wide traceability of blood donors and reduce the possibility of receiving contaminated blood. In the case of anomalies, the system can audit the incident within a few seconds and retrieve the necessary pieces of evidence for investigation. In addition, the system will maintain fungible tokens for donor incentives provided a successful blood donation occurs [8]. Using these tokens, users will be able to redeem benefits from recognized health organizations. The overall system makes use of smart contracts for errorfree, transparent, and near real-time execution of logic in a decentralized fashion. II. R ELATED WORKS The authors of the article [9] presented a framework for the distribution of blood called KanChain. They also introduced
Page 218
a cryptocurrency named KanCoin that serves as an incentive for blood donors inside the system. However, neither the design nor the implementation of the proposed work has been presented. In [10], the authors tackled the blood donation supply chain (BDSC) utilizing blockchain. Using blockchain’s traceability, they track a blood bag’s information throughout its life. They capture the information from the time a donor donates blood until it is given to a receiver. Every time a function is executed in the smart contract and the information is encoded in the immutable ledger, record-keeping continues until the blood is transfused into a patient. Here, role-based smart contract access is defined to enable BDSC system traceability. However, this work did not reveal the system architecture and provided no information on the different participants inside the system. The authors of [11] proposed integrating private blockchain technology with existing blood donation management systems and the K-Nearest Neighbour (KNN) algorithm. They employ immutable blockchain ledgers to keep information on blood donations. They rely on the Inter-planetary File System (IPFS) to store large files (such as images of documents) and record their hash value in the blockchain to preserve the file’s integrity. They suggest offering donors credit points usable for medical services to encourage future blood donations. In their article, they presented two smart contracts for these two activities. However, the authors did not include architectural details. The interaction between the conventional management system, the blockchain network, and the KNN algorithm is mostly unexplored. Even though they used a private blockchain, they did not use the feature that allows multiparty cooperation to work together to manage blood donation sites and share information. Hawashin et al. [12] proposed an Ethereum-based system that tracks blood donation sequences from manufacture to consumption. Blood production refers to a donor supplying blood to a blood bank, whereas blood consumption refers to the blood bank delivering blood to a patient. The method of supplying blood relies greatly on the central blood bank administrator. This deviation from the distributed model questions the blockchain’s utility. It doesn’t grasp decentralized and distributed technologies, resulting in unequal privilege allocation. Thus, central administrators have the greatest authority to influence receivers, which may benefit a certain set of individuals. The authors of [13] and [14] employed blockchain to manage the cold blood supply chain. Both approaches explore the multiparty collaboration aspect of blood storage and address it using private blockchain technology. This enables faster and more secure communication between related organizations, as well as improved blood quality and traceability of unsafe blood transfusions. Similarly, the authors of [15] advocated private blockchain to provide organizational-level blood donation management (i.e., B2B). They do not address the interaction between the donor and receiver levels of their blood donation system. A comparative analysis of the related works and the proposed systems is given in section VII.
III. BACKGROUND S TUDY: W HY B LOCKCHAIN ? A blockchain is a distributed, immutable ledger that uses cryptography to store records. Data in a blockchain is thought of as transactions between entities, and the transactions are stored inside a block (i.e., a collection of ledgers). Cryptographic hash links are used to connect a sequence of these blocks. Tampering with the data inside a block effectively breaks these connections from the point of tampering all the way to the end of the chain. Furthermore, the chain instance is replicated across all nodes in the blockchain network. This signifies that an appropriate alteration must be made in all copies. The combination of cryptographic hash links and the consensus protocol provides the blockchain’s immutability, provenance, and traceability. This is why incorporating blockchain into our system may assist in tracing down contaminated blood transfusions in a couple of seconds. In a centralized or cloudbased system, such thorough investigations are not as quick or reliable as they are in a blockchain network. Blockchain is classified into three types: public blockchain, private blockchain, and consortium blockchain. In a public blockchain, any new user can join the network by presenting proper identification, however, in a private blockchain, only a certain group of entities can join. Consortium blockchain, on the other hand, is concerned with multiparty cooperation in a group of companies or organizations. The proposed system employs a private blockchain and is implemented on the Ethereum network. Another component of the blockchain network that can facilitate automated transactions in an irreversible, traceable, and secure way is smart contacts. It is just a piece of code that executes a predefined set of logic expressed in a predefined programming language. A certain amount of gas must be spent in order to perform mathematical operations inside a smart contract. As a result, smart contracts must be as simple and feasible in order to lower the cost of gas consumption. IV. P ROPOSED S YSTEM The proposed system uses a combination of blockchain technology and database to compensate for the shortcomings of the traditional approaches. Since the system handles sensitive health information, private Ethereum is used to ensure confidentiality. This also results in the elimination of gas fees (with real ether) for executing smart contracts applicable on the public Ethereum chain. Additionally, it offers better throughput and reduced latency, making the whole blockchain integration process seamless from the user’s perspective. Figure 1 depicts the architecture of the proposed BloodComm system at a high level. In this section, the architecture’s components are explained. • Users: Users will be registered in the system by providing the required information and thereby receive a wallet address using which they will be identified inside the blockchain network. Depending on the circumstances, a user may function as both a donor and a receiver of blood. When a user needs blood, they may submit a
Page 219
Fig. 1: System architecture.
•
•
•
•
request by providing the appropriate details (i.e., blood group, quantity, the organization where transfusion will take place, etc.). This request will be sent to the closest matching donors through a request feed. Users who are potential donors will be able to react to the request at their convenience. Following a successful transfusion, the system maintains a provision to reward the donor with optional fungible tokens. The system will examine the donors’ and recipients’ locations and provide donors with the closest request for a matching blood group. The coordinates will not be kept on the blockchain due to its high rate of change as the users are mobile. Organizations: Blood banks, hospitals, and health facilities will be registered as organizations. These organizations facilitate blood transfusions. Organizations seldom change their location, unlike users. It will be stored on the blockchain. An organization will act as a meeting point for the two parties in a transfusion. Organization Representatives: Representatives will interact with our system while assuming the identities of their respective organizations. Following a blood transfusion, an organization records the scanned copies of the test results and their related hash in the ledger. This extra layer of verification assures that the blood was tested at a recognized medical facility and that they may be held liable in the event of a contaminated blood transfusion. Application: Front-end Distributed Applications (DApps) will be used, accompanied by an appropriate back end. The back end will communicate with the blockchain ledger through the Application Binary Interface (ABI). Database: Some data will be stored in the database to reduce the burden on the blockchain. For instance, the database will include scanned test results. Their hash values will be stored on the blockchain so that they may be compared afterward. Users of the application will be
presented with data that has been seamlessly pulled from both the database and the blockchain. • Smart Contract: The smart contract is written using the Solidity programming language. This piece of code will be compiled using the Solidity compiler, which will give out two files: Application Binary Interface(ABI) and Bytecode. The ABI is used for interfacing purposes between the application and the ledger while the bytecode is deployed in the ledger. • Ethereum Ledger: The immutable distributed ledger that stores the records and the deployed bytecodes of the smart contracts. Figure 2 depicts a sequence diagram for a better understanding of the system’s above-mentioned actors. We will consider a blood recipient who is in need of blood. The recipient would submit a blood donation request by entering the essential information into the supplied online access point. This will then be sent to the blockchain ledger through APIs and irrevocably stored there. The program will deliver these requests to other users by combining data from the ledger and the database. A donor will now respond to this request, notifying the receiver. At this stage, both parties communicate with one another and meet at the agreed-upon organization to complete the transfusion process. As a sort of reward, the receiver might give the donor a token of appreciation. From the standpoint of the blockchain, the sequence of logic is executed via smart contracts, and the outputs are immutably written in the ledger.
Fig. 2: Sequence diagram of the system.
V. D ESIGN AND I MPLEMENTATION A. Data Structure Smart contracts make use of data structures to keep track of the entities involved. As stated previously, users provide the relevant information prior to enrolling in the system. Table I displays this information. After completing the registration process, they will be assigned a unique ID and recorded on the blockchain as shown in table II. While posting a request, the user needs to define a patient from whom the blood is needed. This provision is retained to
Page 220
address the situation in which users post a request for family members or friends who are not registered users in the system.
TABLE V: Data Structure of Request Format Prefix Key
TABLE I: Basic information about User BASIC INFORMATION OF USER Name: User Name DOB: Date of Birth Contact: Contact Number Blood Group: ’A+’/’A-’/’B+’/’B-’/’AB+’/’AB-’/’O+’/’O-’ Gender: ’male’/’female’/’others’
TABLE II: Data Structure of User Prefix Key
Value
USER User ID { id: user id BASIC USER INFORMATION wallet address: User’s wallet address }
A patient will be represented by the information shown in the table III. Similarly, the information needed for organizations is available in the table IV. Here, the records of patients and organizations will not be stored in the ledger directly. It is because these two records are solely associated with donation request posted by a user (the data structure given in table V. This record contains the previously mentioned entities and eventually embeds them in the blockchain. Storing them on the blockchain beforehand is redundant. TABLE III: Basic information about Patient BASIC INFORMATION OF PATIENT Name: Patient Name Contact: Patient’s Contact Number Blood Group: ’A+’/’A-’/’B+’/’B-’/’AB+’/’AB-’/’O+’/’O-’ Gender: ’male’/’female’/’others’ Description: Information about the requirements
TABLE IV: Basic information about Blood Donation Center BASIC INFORMATION OF BLOOD DONATION CENTER Name: Center’s Name Contact: Center’s Contact Number Address: Center’s Address Location: Coordinates of the center
B. Smart Contract Design The smart contracts are encoded with defined sets of logic, and after their execution, the output is stored inside the ledger. The front end of the DApp will take input from the user and, through API calls, the back end will execute the deployed smart contracts. This subsection presents the pseudo-codes of our smart contracts. For registering a user, the basic user information (BUI) is provided along with the user wallet address (UWA), and, using the algorithm 1, their addresses are registered. In the case of posting a request, the system takes the basic information of the patient (BUP), the blood donation center (BUBDC), and the requester’s wallet address (RQWA) and records them
Value
REQUEST Request ID { id: request id BASIC PATIENT INFORMATION BASIC INFORMATION OF BLOOD DONATION CENTER wallet address: Requester wallet address isCompleted: ’yes’/’no’/’ongoing’ }
through the algorithm 2. When a user responds to a particular request, the algorithm 3 is invoked where the input remains the responders’ wallet address (RPWA) and the request list (RL). Finally, for providing tokens to the donors, the algorithm 4 is used with both wallet addresses RQWA and RPWA and a list of the responders (RPL). Algorithm 1 Algorithm for User Registration Inputs: BUI: Basic Information of User; UWA: User Wallet Address; Outputs: 1: if addressRegistred(UWA) == True then 2: User already Registered!; 3: else 4: registerAddress(UWA); 5: user_id = id; 6: BASIC_USER_INFORMATION = BUI; 7: wallet_address = UWA; 8: end if Algorithm 2 Algorithm for Posting Request Inputs: BUP: Basic Information of Patient; BUBDC: Basic Information of Blood Donation Center; RQWA: Requester´s Wallet Address; Outputs: 1: if addressRegistred(RQWA) == True then 2: request_id = id; 3: 4: 5: 6: 7: 8:
Request.BASIC_Information_of_Patient = BUP; Request.BASIC_Information_of_BDC = BUBDC;
BASIC_USER_INFORMATION = BUI; else User do not have the permission; end if
C. Implementation The smart contracts for the system were implemented using the Solidity programming language (version 0.8.15) [16]. These contracts were then deployed using the Remix online compiler (version 0.26.3) [17]. VI. T ESTING AND VALIDATION This section presents the results of the tests performed on the implemented smart contracts in order to validate their
Page 221
Algorithm 3 Algorithm for Responding to Requests Inputs: ´ RPWA: RespondersWallet Address; RL: Request List; Outputs: 1: if addressRegistred(RPWA) == True then 2: for request in RL do 3: if applicable(RPWA) == True then 4: Responder[request_id] += RPWA; 5: else 6: 7: 8: 9: 10: 11:
7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:
Listing 1: Registering User [ { "from": "0xd9145CCE52D386f254917e481eB44 e9943F39138", "topic": "0x54db7a5cb4735e1aac1f53db512d3390 390bb6637bd30ad4bf9fc98667d9b9b9", "event": "UserRegistered", "args": { "0": "0x5B38Da6a701c568545dCfcB03 FcB875f56beddC4", "user": "0x5B38Da6a701c568545dCfcB0 3FcB875f56beddC4" } }
Not applicable for this request;
end if end for else User do not have the permission; end if
Algorithm 4 Algorithm for Information Sharing and Tokenization Inputs: BURQ: Basic Information of Requester; RQWA: Requester´s Wallet Address; RPWA: Responder´s Wallet Address; RPL: Responders List; Outputs: 1: if addressRegistred(RQWA) == True then 2: for responders in RPL[request_id] do 3: if donorApplicable(responders) == True then 4: 5: 6:
the address of user 2, a response is given to the previously created request (the event shown in listing 3). After finishing the donation process, user 1 transacted 100 tokens to user 2 as appreciation (an event shown in the listing 4).
]
Listing 2: Posting a Request [ { "from": "0xf8e81D47203A594245E36C48e151709 F0C19fBe8", "topic": "0x2d645b9ef46a25e3a0066be653c1f52 fe444bb261bde3e6a6257f8def714be15", "event": "PostRequest", "args": { "0": "0x5B38Da6a701c568545dCfcB03FcB87 5f56beddC4", "RequestPostedBy": "0x5B38Da6a701c5685 45dCfcB03FcB875f56beddC4" }
sharePersonalInformation(responders);
requestHasBeenCompleted(request_id);
if requestCompleted(request id) == True AND enoughBalance(RQWA) == True then TransferToken(Amount, responder.RPWA);
else
} ]
Process can not be completed.;
end if else Donor is not applicable; end if end for else User do not have the permission; end if
functionality. The whole process was carried out utilizing the online Remix editor. First, we register a user in the system by providing the necessary inputs and get the corresponding event output shown in listing 1. Here, from indicates the address of the deployed smart contract. The topic shows the cryptographic hash of the event which triggers after the successful execution of the smart contract. The args is the list of inputs for the execution. After registering two users, a request was posted from the address of user 1. The event output is shown in the listing 2. From
Listing 3: Responding to a Request [ { "from": "0xf8e81D47203A594245E36C48e151709 F0C19fBe8", "topic": "0x586815d217679ee3e5e849d633165e e32056949ce3138c70297dc67025ad7666", "event": "RespondRequest", "args": { "0": "0xAb8483F64d9C6d1EcF9b849Ae6 77dD3315835cb2", "1": "0", "2": "0x5B38Da6a701c568545dCfcB03F cB875f56beddC4", "Responder": "0xAb8483F64d9C6d1EcF 9b849Ae677dD3315835cb2", "request_id": "0", "RequestPostedBy": "0x5B38Da6a701c5 68545dCfcB03FcB875f56beddC4" } } ]
Page 222
community-wide blood donor traceability, limit the likelihood of obtaining tainted blood while ensuring provenance and ensuring traceability. Additionally, following each successful donation, the system will reward donors with fungible tokens which will be valid in the registered health centers and hospitals. Finally, the use of smart contracts will make the system error-free and enable transparent executions in a secured decentralized manner. In the future, we intend to develop a complete decentralized app consisting of front-end, back-end, and server. As for the smart contracts, we intend to deploy them in real networks and evaluate them through comprehensive testing. Finally, we will integrate blood supply chain management into our system to make it a one-stop umbrella platform for blood donation management.
Listing 4: Transfer Token [ { "from": "0x417Bf7C9dc415FEEb693B6FE313d118 6C692600F", "topic": "0xddf252ad1be2c89b69c2b068fc378 daa952ba7f163c4a11628f55a4df523b3ef", "event": "Transfer", "args": { "0": "0x5B38Da6a701c568545dCfcB03F cB875f56beddC4", "1": "0xAb8483F64d9C6d1EcF9b849Ae6 77dD3315835cb2", "2": "100", "from": "0x5B38Da6a701c568545dCfcB 03FcB875f56beddC4", "to": "0xAb8483F64d9C6d1EcF9b849Ae6 77dD3315835cb2", "tokens": "100" }
R EFERENCES
} ]
VII. D ISCUSSION In this study, we have developed a platform called BloodComm built on blockchain technology that makes it easier for blood donors and receivers to connect with one another on a peer-to-peer basis. The existing hierarchical procedure used to discover blood donors followed in the traditional system will be eliminated by our system. As a result, the turnaround time will be reduced while at the same time ensuring traceability and provenance through the use of blockchain technology. Compared to the existing literature, our proposed approach focuses on developing a community of donors and enabling their connection without the need for a middleman. Table VI shows a comparison between BloodComm and the existing state-of-the-art approach for blood transfusion management using blockchain. BloodComm supports peer-to-peer contact and the formation of a community for blood donation as shown in table VI. Furthermore, we introduced the notion of a token to patronize and ensure a better participation rate from the donor community. TABLE VI: Comparative overview of related works. P2P B2C B2C B2C B2B B2B B2B
Type of Blockchain Public Private Private Private Private Private Private
P2P
Private
References
Nature
[9] [10] [11] [12] [13] [14] [15] Our System
Ethereum Hyperledger Fabric Ethereum Hyperledger Fabric Hyperledger Fabric Hyperledger Fabric
Incentives for donor Yes No Yes No No No No
Location awareness No No Yes No No No No
Ethereum
Yes
Yes
Implementation
VIII. C ONCLUSION AND F UTURE W ORKS This paper proposes BloodComm, a private blockchainbased blood donation system. BloodComm will enable
[1] F. A. Sayani and J. L. Kwiatkowski, “Increasing prevalence of thalassemia in america: Implications for primary care,” Annals of Medicine, vol. 47, no. 7, pp. 592–604, 2015. PMID: 26541064. [2] M. S. H. Jiisun, R. A. Rupa, M. H. Chowdhury, H. Mushrofa, and M. R. Hoque, “Blood Donation Systems in Bangladesh: Problems and Remedy,” International Journal of Business and Management, vol. 14, pp. 145–145, July 2021. [3] American Red Cross, “US Blood Supply Facts.” Accessed: Sep. 15, 2022. [4] E. M. d. Oliveira and I. A. Reis, “What are the perspectives for blood donations and blood component transfusion worldwide? a systematic review of time series studies,” Sao Paulo Medical Journal, vol. 138, pp. 54–59, 2020. [5] American Cancer Society, “Cancer facts & figures 2022.” Accessed: Sep. 15, 2022. [6] The Leukemia & Lymphoma Society, “Lymphoma Survival Rate Blood Cancer Survival Rates.” Accessed: Sep. 15, 2022. [7] B. H. Shaz, “Chapter 66 - transfusion transmitted diseases,” in Transfusion Medicine and Hemostasis (C. D. Hillyer, B. H. Shaz, J. C. Zimring, and T. C. Abshire, eds.), pp. 361–371, San Diego: Academic Press, 2009. [8] A. Lisi, A. De Salve, P. Mori, L. Ricci, and S. Fabrizi, “Rewarding reviews with tokens: An ethereum-based approach,” Future Generation Computer Systems, vol. 120, pp. 36–54, 2021. ¨ [9] M. C ¸ a˘glıyangil, S. Erdem, and G. Ozda˘ go˘glu, A Blockchain Based Framework for Blood Distribution. Cham: Springer International Publishing, 2020. [10] S. Sadri, A. Shahzad, and K. Zhang, “Blockchain traceability in healthcare: Blood donation supply chain,” in 2021 23rd International Conference on Advanced Communication Technology (ICACT), pp. 119– 126, 2021. [11] Y. Luo, G. Lu, and Y. Wu, “Design and analysis of blood donation model based on blockchain and knn,” in 2021 3rd Blockchain and Internet of Things Conference, BIOTC 2021, (New York, NY, USA), p. 32–37, Association for Computing Machinery, 2021. [12] D. Hawashin, D. A. J. Mahboobeh, K. Salah, R. Jayaraman, I. Yaqoob, M. Debe, and S. Ellahham, “Blockchain-based management of blood donation,” IEEE Access, vol. 9, pp. 163016–163032, 2021. [13] H. T. Le, T. T. L. Nguyen, T. A. Nguyen, X. S. Ha, and N. DuongTrung, “Bloodchain: A blood donation network managed by blockchain technologies,” Network, vol. 2, no. 1, pp. 21–35, 2022. [14] S. Kim, J. Kim, and D. Kim, “Implementation of a blood cold chain system using blockchain technology,” Applied Sciences, vol. 10, no. 9, 2020. [15] S. Lakshminarayanan, P. N. Kumar, and N. M. Dhanya, “Implementation of blockchain-based blood donation framework,” in Computational Intelligence in Data Science (A. Chandrabose, U. Furbach, A. Ghosh, and A. Kumar M., eds.), (Cham), pp. 276–290, Springer International Publishing, 2020. [16] Solidity, “Solidity - Solidity 0.8.15 documentation.” Accessed: Sep. 15, 2022. [17] Remix, “Remix IDE & Community.” Accessed: Sep. 15, 2022.
Page 223
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
A Comparative Study on the Effectiveness of IDM and Card Sorting Method for Autism Specialized School Website Maria Afnan Pushpo 1 , Zareen Tasneem 2 , Sheikh Tasfia 3 , Anusha Aziz 4 , Muhammad Nazrul Islam
5
Department of Computer Science and Engineering Military Institute of Science and Technology Dhaka, Bangladesh [email protected] , [email protected] , [email protected] , [email protected] , [email protected]
Abstract—With the advent of new technologies, software and web applications have been an integral part of today’s modern life. These applications are exclusively designed and developed for different themes and purposes. Present day an autismrelated web application, being one of the most demanding applications requires effective, efficient and satisfactory services with a package of portability across all platforms. The success of a product significantly depends upon the factor of usability. A clear, concise, and user-interactive design interface is the key to a successful product or application. In the current literature reviewed, no particular design technique has so far been proposed to implement a multi-channel autism application. The objective of our study is to create two websites using Card sorting and Interactive Dialogue Model (IDM) on the same topic, compare between these two websites and perform a comparative study considering some parameters to find out which approach is a better approach for website making. We have developed an autism application using card sorting and IDM. An evaluation was carried out with a number of participants. As an outcome, we have proposed one design technique to produce an effective and easy-to-use web application for autistic children. Keywords—Autism specialized school, Multi-channel application, IDM, Card Sorting, Usability, Effectiveness.
I. I NTRODUCTION Living in the era of modern advancement of technology usability has been a key factor for a wide range of softwareoriented applications such as web applications, mobile applications or any other software [1]. Web designers are constantly coming up with versatile applications while exploring new challenges every day. Usability is defined as the degree of contentment to user which an application is consumed by the specified end users to achieve targeted objectives with three parameters - effectiveness, efficiency, and satisfaction which ensures that a software is fit for achieving a specific goal in a specified context of use [14]. Effectiveness is referred to the accuracy and completeness of the functions that a software is intended to do; efficiency referred to the use of resources optimally and satisfaction referred to the trust and positive attitudes towards the use of the software [2] [3]. In most cases, applications are designed for different channels. Multichannel web applications provide the same content with
a similar interactive user experience using different devices and technologies [4]. The use of different multichannel web applications in different fields like e-governance, e-health, e-commerce, e-learning, etc. was versatile over the last few decades. However, the design of multi-channel applications in the field of autism aid and schooling for children of special need is very demanding nowadays. There are a handful amount autistic children who are mostly known as special children around us and their education and growth need to be taken care of in a special way [16]. Developing a web application, where people no longer have to manually visit institutions and buy necessary things can make it easier for the parents. Besides, these websites can also help them to acquire a brief knowledge before manually visiting any institution. There are a limited number of applications exclusively designed for children with special needs. Also, these applications might have some challenges to achieve a high degree of quality attributes e.g. user satisfaction, easeof-use, dependency, reliability, learnability, effectiveness, and efficiency [15]. To ensure proper usability of such information incentive web applications, it is a must to develop the website using a suitable design technique. IDM or interactive dialogue model focuses on humancomputer interaction that illustrates how the content and navigation will be structured based on the conversation between humans and the particular multi-channel application [4]. On the other hand, card sorting is used to create a number of clusters to organize the contents of a website [5]. To evaluate the usability of the design techniques one of the most feasible approaches is to conduct an experiment based on various usability factors to assess which design technique is suitable for which type of web application. Therefore, the aim of this paper is to evaluate the usability standard of a website using the two most widely used design techniques - IDM and card sorting through an experiment. Two websites were designed and developed using these two design approaches for special need people to provide information about the institution, the services and facilities provided, admission process, education program, rules and
979-8-3503-4602-2/22/$31.00 ©2022 IEEE Page 224
regulations as well as focuses on the e-commerce side and many more. To present the rest of the paper, it is organized as follows. Section II provides a brief overview of the related works regarding web application usability, IDM and card sorting design technique. The study methodology of the work is highlighted in Section III. Next, Section IV and Section V illustrate the system design and development respectively. Then, Section VI describes the participant profile, study procedure, data analysis and results of the conducted experiment. Finally, Section VII discusses the study outcomes, limitations, future work and a brief concluding remark. II. R ELATED W ORKS The concerning issue has come to light with a number of studies and development works by the research community. The authors of [4] have discussed the methodology of an interactive dialogue model (IDM) which is a dialogue base design technique. The authors also emphasize that, to develop a multichannel application IDM is a very lightweight and cost effective design technique. It also reflects that this design technique is not only providing better user experience but also is easy to implement for the developers in case of multichannel application. In paper [6], the authors have used the card-sorting design technique, a well known method for knowledge elicitation, for classifying requirements change during the development phase of a software. It shows that the Card Sorting played an effective role to explore the view of requirement change problem. Paper [7] is related to Card Sorting which describes the use of the sorting techniques and has given the guidelines for choosing a particular sorting technique. In paper [8], the authors have mainly focused on the design and discussed how deeply design is related with Human Computer Interaction, showing relation between Designoriented Research and Research-oriented Design. Testing the usability of a website without the help of HCI lab by using a proxy server which modifies the actual website with additional javascript code to monitor users mouse movement, keyboard input and other fine details of the user activity without interrupting the users experience is demonstrated in paper [9]. The authors have taken a user case scenario to demonstrate it and they also showed that the procedure is cost effective and a lot of data can be collected by the methodology. But their approach is not applicable for eye tracking of the user which is very vital for testing the usability of any website. In paper [10], it shows the standard questionnaires for evaluating a system, especially of a website. The authors have compared their own set of questionnaires over different sets of questionnaires such as System Usability Scale, Questionnaire for User Interface satisfaction, Computer System Usability Questionnaire and Words. In sum, research works on different design techniques are available to have a better interaction between human and computer.
III. S TUDY M ETHODOLOGY A sequential and comparative scrutiny has been carried out to attain the objective of the paper. Two design techniques: Card Sorting and Interactive Dialogue Model (IDM) are selected for designing a compassionate themed informative web application. After development of the web application, an experimental analysis is conducted for evaluation of which design technique leads to a high convenient website. A. Topic Selection The theme, autism specialized school portal has been chosen as the experiment topic. This is a website whose aim is not only to provide details regarding the institution but also selling relevant products both offline and online. This theme has been selected particularly because special children require special attention towards them and specially the education environment of such schools is a major concern of the parents. Autism specialized schools are not very common to find in Bangladesh. It is not always possible for parents to search for and pay a visit to an autism specialized school and buy products anywhere physically for special children in a short notice. Hence, if an autism school web application can be developed with highest relevance, it can be helpful for the guardians of special children in providing the required information and conducting shopping purposes. B. Choosing the Design Technique After choosing the theme, two design techniques which are very popular and implemented have been selected to develop the website: IDM and Card Sorting. •
•
Interactive Dialogue Model (IDM): It is a dialoguebased human-computer interactive design consisting of three parts - C-IDM (Conceptual IDM), L-IDM (Logical IDM) and P-IDM (Page IDM). One group of authors has used the IDM technique to design the autism specialized school website. Card Sorting: It’s a designing technique that helps to design a website by a group of people who gather the contents or topics and categorize. The technique is carried out by using small paper cards or virtual cards in software. Except for the group who performed the IDM technique, another group used card sorting technique.
C. Web Application Development At this stage, the website is developed employing the selected design techniques separately. The group of authors who have designed the website using the IDM technique has developed the website accordingly. The authors who have used IDM do not participate in developing the website which utilizes card sorting design. Both websites built depending on IDM and card sorting have involved HTML, CSS, Javascript language for front-end and PHP, MySQL database for backend.
Page 225
D. Conducting an Experimental Analysis In order to get a clear vision about the usability standard of the website, an experiment has been carried out involving some participants. Collecting their biographical data, the participants have taken part in some pre and post questionnaire survey. The participants have carried out some tasks for the websites and some evaluation parameters have been estimated. All these contributions have led to a decision which design technique appears to be a highly usable interactive website. IV. W EB I NTERFACE D ESIGN Two different prototypes of Autism Specialized School Website named ’Promitee’ was designed using IDM and Card Sorting technique independently. For each prototype website we used the same set of stakeholders and features. The detail design methodology for developing the application is discussed in the subsequent sections: A. Web Interface Design Using IDM In IDM design technique, the three sequential phases are C-IDM (conceptual IDM), L-IDM (logical IDM) and P-IDM (page IDM) which are discussed through our website in brief as follows. 1) C-IDM (Conceptual IDM): C-IDM is the first phase which is independent from any type of channel. Conceptual IDM consists of design parameters like - topic, relevant relation, group of topics and parametric group of topics [4]. There are two types of topics: single topic which is exclusive of multiple instances in the application and kind of topic includes multiple instances in the application. In the autism specialized school website, the single topics are “Admission”, “Contact”, “Donation”, “FAQ”, “Career” and “Notice Board”. For example, “Contact” is a single instance containing some contact information. In this website, kind of topics are “About”, “Students”, “Faculty”, “Facilities”, “Services”, “Shoppers or Viewers”, “Education Program”, “Account”, “Cart”, “Wishlist” and “Shop Products”. For example, “Services” is a kind of topic as it has multiple instances containing information of what type of services the institution will provide towards the special children. Relevant relation is established for a kind of topic stipulating the change of conversation from one kind of topic to another. For example, someone may want to know about facilities provided by the institution. So, a relation between “Students” and “Facilities” should be established. Group of topics determine a specific list of topics and parametric group of topics determine a collection of group of topics. For example, in this website, “Facilities” is kind of topic and it can be searched by the user like what type of facilities it provides. It might be searched by Lab, Library, Play zone, meal arrangement. So, these can be considered as group of topics. Again, in “Education Program” topic “Early Childhood Development Program”, “School of Primary Education”, “Adult Leisure & Learning Program”, “Vocational School” are group of topics under the parametric group of topic “Comprehensive Education Program”. Other group of
topics and parametric group of topics are shown in the attached C-IDM. 2) L-IDM (Logical IDM):: L-IDM is the second phase dependent on the channel type (website, desktop, mobile app etc.) consisting of design parameters like - i) dialogue act, ii) transition act, iii) introductory act and iv) multiple introductory act [4]. The content of a topic is divided into a number of units which is known as dialogue act. In our website, the dialogue acts for the topic Donation are Introduction, Donation Rules and Donation Form. The relevant relation in C-IDM is specified as a transition act with cardinality in L-IDM. A transition act mainly indicates a list of instances when the topic is multiple and the navigation of this list can also be structured using different patterns of transitional strategy [12]. In this website, one faculty can conduct multiple educational programs. Again, one educational program can be carried out by more than one faculty. Hence, 1:n cardinality is appropriate to establish between “Faculty” and “Education Program”. The L-IDM group of topics of C-IDM is introduced as an introductory act. Each group of topics is linked to a list of possible instances when the topic is multiple and the navigation of the list can also be structured using different patterns of subject strategy [12]. In this website, the introductory acts of the topic Facilities are indicated as “Lab”, “Library”, “Play zone”, “Meal Arrangement”. Before admitting a child, facilities can be searched by someone’s specific need. Parametric group of topics of C-IDM is introduced as multiple introductory act in L-IDM. For example, “Medical Assessment Care”, “Audiology Assessment”, “Neurological Assessment”, “Psychological Assessment”, “Diet Management” are introductory acts under the multiple introductory act “Mediation Clinic” of the topic ”Services”. The entire schema of L-IDM is demonstrated in the attached LIDM with necessary symbols. 3) P-IDM (Page IDM):: Page IDM is the final phase which involves designing the prototypes for each design parameters discussed before. IDM page design (P-IDM) refers to defining the elements to be interacted with the users in a single dialogue act. With respect to the L-IDM schema in order to sustain the dialogue, designers now have to model the actual pages containing the necessary elements. There are a number ground rules to be followed for transitioning from L-IDM (channel design) to P-IDM (page design): • • • •
•
Each dialogue act is to be converted to a page type. Each introductory act is to be converted to a page type. Each transition act is to be converted to a page type. Relevant topics are to be converted to landmarks [i.e., links present in (almost) any page]. Landmarks are usually either single topics or important groups of topics that are always accessible. Relevant groups of topics are to be converted to landmarks. Different page types can be easily derived from dialogue acts, introductory acts, and transition acts. We have a set of specific guidelines for page derivation illustrated in Table I
Page 226
Fig. 1. P-IDM of Product Details Page.
the needs of your users. It can be overall defined as card sorting is a method in which users are asked to organize data into logical groupings. Card sorting helps to develop the main information architecture, workflows, menu structures, and website navigation pathways. In this technique, users are given labeled notecards and they are asked to sort the notecards into groups based on criteria that make sense to them. This method defines the structure of the user’s domain knowledge and this is how the information architecture is developed that meets their requirements. In our work card sorting is done in four steps. Step-1 is to choose a set of topics, Step-2 is to omit the duplicate topics, Step-3 is to organize topics into groups and Step-4 is to name the groups. V. W EB A PPLICATION D EVELOPMENT
Fig. 2. P-IDM of Special Education Page.
. These guidelines imply the designer to consider the elements when creating a page. Visual communication designers can then make layout and graphic decisions on the basis of this input to create mockup prototypes or the final rendered page. B. Web Interface Design Using Card Sorting Card sorting is a technique for assisting in the design and evaluation of information architecture [13]. Users classify subjects into categories using card sorting, which is a UX study approach. It may be used to construct an IA that meets
TABLE I PAGE ELEMENTS DESCRIPTION FOR A DIALOGUE ACT [4] Page element Content
Structural (if any) Transition (if any) Group of links Orientation (if any) Landmarks
links links topic info
Description The actual content of the dialogue act (e.g., text, graphics, voice, video, audio, or any combination of these) To pages of the other dialogue acts of the same topic To pages of related topic (1:1) or to pages of transition acts (1:n) Next-previous (in case of guided tour) or to pages of introductory acts/introductory act I came from Messages communication “where I am” To relevant sections of the site (pages of single topics) or a group of topics
Two websites have been developed on the concept of autism specialized school using two design techniques IDM and card sorting. The name given to the autism specialized school is Promitee. It’s already mentioned that HTML, CSS, JavaScript for front-end is used and PHP is used for making the websites dynamic. We have used the Apache server and MySQL database of XAMPP platform to test the websites and deployed it to a local server when we have conducted the experiment. Sample views of the websites are provided here and all the information provided are dummy. Though IDM may take more time to design a website than card sorting, from designers point of view, the sequential steps of IDM - C-IDM, L-IDM and P-IDM results in a comparatively better content organization, navigation, consistent interface design with required functionality than card sorting design technique. Card sorting mainly focuses on grouping different contents of a website and indicates the page design of a homepage but the participants complained that it lacks navigational strategy. But, IDM design technique overcomes the bottlenecks of card sorting coming up with a high usability standard website. Therefore, it is quite obvious that, as a designer and as a user, IDM design technique is comparatively better than card sorting design technique for developing a website like this. Figure 3 displays the shop products page of the developed website whereas Figure 4 shows the products page which have been developed using the card sorting design technique. VI. E XPERIMENT D ESIGN In order to prove our belief that “A website developed using IDM design technique can be better in relevancy, usability and navigation than that of Card Sorting design technique”, an experiment is conducted among ten participants in order to evaluate the usability test of the websites. So, each website is considered as an independent variable of this experiment. Dependent variables like task completion time, success/fail, no. of clicks, no. of attempts and no. of asking help while performing the tasks are considered. The following subsections briefly discuss the profile of participants, how the experiment has been conducted and the result of this experiment.
Page 227
required to complete a task. From the answers of the postquestionnaires it can be concluded that website 1 was found easy to perform most of the tasks in less time and it’s more preferable than website 2 which is visible in Figure 5. Also it is clearly observed from Figure 6 that website 1 is more preferable in all aspects than website 2.
Fig. 3. Promitee Shop Products page using IDM.
Fig. 5. Task completion time. Fig. 4. Promitee product page using card sorting.
A. Participant Portfolio Ten participants agreed to perform where six of them were male and four of them were female. Among them one female is married. The age range of the participants was from 18 years to 23 years old. There were participants like Electrical Engineer, Computer Science and Engineer, Doctor, Lawyer, Mechanical Engineer, Businessman and Housewife. Each of them was acquainted with one or more special children. Each member was ensured that he/she was not judged but the websites are so that they could perform without hesitation. B. Study Procedure The study was conducted online from 09 a.m. to 12 p.m. At the beginning, both websites were demonstrated to the participants. Then, the participant information sheet, prequestionnaires and task sheet were provided to them. First, they filled up their information sheet and pre-questionnaires. Then they were requested to take part in the tasks. Finding the admission form for special children; finding details of education amenity expected for special children; finding details of therapeutic support expected to be provided by the institution; finding the desired products regarding communication and placing order were the four tasks to be carried out for each website. While performing the tasks, data regarding dependent variables were collected. Lastly, post-questionnaires were completed by the participants. C. Data Analysis and Result Collecting data, the statistical analysis and average value has been calculated based on task completion time; number of attempts, number of asking help and number of clicks
Fig. 6. Graph representation of preferable websites for each task.
VII. D ISCUSSION AND C ONCLUSION The study outcomes from the experimental analysis apparently show that, IDM design technique outraces card sorting in terms of user preference and satisfaction. Participants felt at ease to use the website developed using IDM design technique to perform specific tasks for its easy and interactive navigation and user-friendly interface design. Though IDM may take more time to design a website than card sorting, from designers point of view, the sequential steps of IDM - C-IDM, L-IDM and P-IDM results in a comparatively better content organization, navigation, consistent interface design with required functionality than card sorting design technique. Card sorting mainly focuses on grouping different contents of a website and indicates the page design of a homepage but the participants complained that it lacks navigational strategy. But, IDM design technique overcomes the bottlenecks of card sorting coming up with a high usability
Page 228
standard website. Therefore, it is quite obvious that, as a designer and as a user, IDM design technique is comparatively better than card sorting design technique for developing a website like this. The study has been performed on the theme, a autism specialized school website. Although it appears that IDM is a better design technique than card sorting, there exists some limitations. The number of participants needed to be increased resulting to be satisfied with approximate outcomes and also very few parameters were possible to measure. As it is an autism specialized school website, guardians having special children should have been involved in the study. But, most of the participants were not as desired. This study gives an overview concerning whether IDM or card sorting is a better technique for developing a high usability standard website based on some experimental analysis. The future work of this study would be to conduct the experiment with more desired participants and to judge through more parameters. Finally, this study can play a great role to the HCI and Multimedia Tools research community and web developers to decide on the suitable design technique while developing a high usability standard web application.
[13] Usability.gov. Card sorting. [Online]. Available: https://www.usability.gov/how-to-and-tools/methods/card-sorting.html. Last accessed 11 Sep 2022. [14] UsabilityNet. What is usability? [Online]. Available: https://www.interaction-design.org/literature/topics/usability. Last accessed 11 Sep 2022. [15] M. N. Islam, S. J. Oishwee, S. Z. Mayem, A. S. M. Nur Mokarrom, M. A. Razzak and A. B. M. H. Kabir, ”Developing a multi-channel military application using Interactive Dialogue Model (IDM),” 2017 3rd International Conference on Electrical Information and Communication Technology (EICT), 2017, pp. 1-6, doi: 10.1109/EICT.2017.8275230. [16] Nidirect Government Services. Children with special educational needs. Available: https://www.nidirect.gov.uk/articles/children-specialeducational-needs. Last Accessed on 11 Sep 2022.
R EFERENCES [1] T. Zaki, Z. Sultana, S. M. A. Rahman and Md. N. Islam, ”Exploring and Comparing the Performance of Design Methods Used for Information Intensive Websites” in MIST International Journal of Science and Technology (MIJST), Vol. 08, June 2020, pp. 49-60, doi: 10.47981/j.mijst.08. [2] Md. N. Islam and H. Bouwman, “An assessment of a semiotic framework for evaluating user-intuitive Web interface signs”, Universal Access in the Information Society, vol. 14, no. 4, pp. 563 to 582, Nov 2015. [3] W. Quesenbery, “Usability Standards: Connecting Practice around the World”, in proceedings of the International Professional Communication Conference, Ireland, 2005, pp. 451 - 457. [4] D. Bolchini and P. Paolini, “Interactive dialogue model: A design technique for multichannel applications,” IEEE Transactions on Multimedia, vol. 8, no. 3, pp. 529–541, 2006. [5] “International Organization for Standardization, “Guidance on Usability Standards”. International Organization for Standardization, ISO 9241 11.(1998).” [Online]. Available: https://www.iso.org/standard/16883.html. Last accessed 11 Sep 2022. [6] N. Nurmuliani, D. Zowghi, and S. P. Williams, “Using card sorting technique to classify requirements change,” in Requirements Engineering Conference, 2004. Proceedings. 12th IEEE International. IEEE, 2004, pp. 240–248. [7] G. Rugg and P. McGeorge, “The sorting techniques: a tutorial paper on card sorts, picture sorts and item sorts,” Expert Systems, vol. 14, no. 2, pp. 80–93, 1997. [8] D. Fallman, “Design-oriented human-computer interaction,” in Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 2003, pp. 225–232. [9] R. Atterer, M. Wnuk, and A. Schmidt, “Knowing the user’s every move: user activity tracking for website usability evaluation and implicit interaction,” in Proceedings of the 15th international conference on World Wide Web. ACM, 2006, pp. 203–212. [10] T. S. Tullis and J. N. Stetson, “A comparison of questionnaires for assessing website usability,” in Usability professional association conference, 2004, pp. 1–12. [11] Y. Yorozu, M. Hirano, K. Oka, and Y. Tagawa, “Electron spectroscopy studies on magneto-optical media and plastic substrate interface,” IEEE Transl. J. Magn. Japan, vol. 2, pp. 740–741, August 1987 [Digests 9th Annual Conf. Magnetics Japan, p. 301, 1982]. [12] M. Young, The Technical Writer’s Handbook. Mill Valley, CA: University Science, 1989.
Page 229
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
Screening Pathological Abnormalities in Gastrointestinal Images Using Deep Ensemble Transfer Learning Md. Farukuzzaman Faruk∗ , Md. Rabiul Islam † ,Emrana Kabir Hashi‡ Department of Computer Science & Engineering Rajshahi University of Engineering & Technology, Rajshahi-6204, Bangladesh Emails: ∗ [email protected], † [email protected], ‡ [email protected] Abstract—Globally, gastrointestinal cancers including colorectal and esophageal cause a substantial amount of deaths every year. Early detection of these disorders requires an accurate, rapid, and automated diagnosis approach. Almost all prior deep learning-based gastrointestinal duct analysis research was restricted to polyp identification. However, esophagitis, ulcerative colitis, and numerous pathological findings of gastrointestinal organs must be analyzed together with polyps. This study focused on the detection of pathological findings in gastrointestinal images. A deep ensemble transfer learning approach was introduced to screen pathological findings. Initially, the eight EfficientNet members were used separately to train pathological findings. The ensemble networks were then constructed by incorporating a number of modified softmax averaging mathematical formulae. The ensemble’s efficiency and efficacy across individual networks were then verified quantitatively. The proposed ensemble network correctly predicted samples misclassified in the individual networks. Consequently, it has a 96.40% accuracy, 96.60% precision, and a recall of 96.40% for pathological findings. The proposed work surpassed nearly all recent comparable research in terms of accuracy and efficiency. Index Terms—Gastrointestinal-images, Pathological, Polyps, Compound-scaling, Transfer-learning, Ensemble-networks.
I. I NTRODUCTION Our digestive system is plagued by various abnormalities, the most serious of which are multiple cancers. For instance, three out of eight frequent cancers emerge in the gastrointestinal (GI) tract [1]. Digestive organ cancers, also referred to as GI tract cancers account for a large proportion of all cancer deaths worldwide each year. Around 2.8 million individuals are diagnosed with cancer every year, according to WHO estimates, including stomach, colorectal, and esophageal cancers, and more than 65% of them die from the disease [1]. Colorectal cancers (CRC) are the most lethal and ubiquitous, responsible for around 10% of all newly diagnosed cancers worldwide [2]. A colon polyp is a small cluster of cells that forms on the surface of the colon and it may develop into cancer over time [3]. These polyps can be removed before they become cancerous if spotted and diagnosed at a preliminary phase. These problems arise due to the limitations of the GI(colonoscopy and gastroscopy) test process, lack of expert physicians and the manual interpretation of the GI data. Aside from CRC, the second most common and deadly type of aberration in the GI tract are esophageal and stomach cancers.
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
Gastroscopy and colonoscopy are the two most commonly used testing tool-kits for GI tract investigation [4], [5]. Gastroscopy examines the upper GI tract, including the esophagus, stomach, and first section of the small bowel, whereas colonoscopy examines the large bowel (colon) [4], [5]. In recent years, almost all of the research on GI tract analysis has concentrated on the straightforward predictions of binary polyp identification. Furthermore, in addition to these polyp identification techniques, pathological findings, such as esophagitis and ulcerative-colitis detection, are critical in screening various disorders across our entire digestive system. Pathological findings include colon polyps, esophagitis, and ulcerative-colitis. Deep learning, particularly CNNs, has revolutionized the area of medical image processing in recent decades. On the other hand, transfer learning is a method of leveraging pretrained CNNs with or without a few adjustments that require fewer resources and may be utilized for domain-specific and cross-domain applications [6]. This research proposed a framework based on deep ensemble transfer learning to classify the major GI tract, comprising three pathological findings. Transfer learning models were constructed using the EfficientNet members [7]. Finally, the individual outcomes were ensembled to obtain final classification results. The suggested deep ensemble transfer learning methodology outperformed nearly all current competing studies on the GI classification process. The findings from this study may be immensely important for physicians seeking to quantify digestive organs accurately. II. L ITERATURE R EVIEW Several deep learning-based investigations on GI tract datasets have been conducted in recent years. Pogorelov et al. [8] proposed a deep learning-based framework for colon polyp identification, localization, and segmentation. Six distinct open-access datasets were employed in their study: ‘Kvasir’ [4], four ‘CVC’ datasets [9] and ‘Nerthus’ [10]. They primarily concentrated on frame-wise detection and pixel-level segmentation using generative adversarial networks (GANs) [11]. For the histological categorization of colorectal polyps, Korbar et al. [12] developed an automated method using a variety of deep learning approaches with whole-side
Page 230
Fig. 1. Pathological findings samples from Kvasir dataset
scans. A real-time screening framework was developed by G. Urban et al. [13] to localize and categorize the polyps with several CNNs models. Wang et al. [14] suggested enhanced deep learning frameworks for autonomously diagnosing colorectal polyps by employing global average pooling (GAP) to the original VGGNets [15] and ResNets [16] architectures to make the models lighter. Shin et al. [17] presented a comparative analysis on hand-crafted features-based classification using a deep learning-based approach adopting hand-crafted features like HOG and hue histogram in colonoscopy images. Yuan et al. [18] presented a deep learning-based automated method for the detection of polyps in colonoscopy video frames. Their model produced an accuracy of 0.9147 and sensitivity of 0.9176 with AlexNet [19] for the ASU-Mayo [20] dataset. Aksenov et al. [21] presented a deep CNN-based ensemble framework to screen GI findings in endoscopy video frames successfully. III. M ATERIALS AND M ETHODS A. Dataset This study used Kvasir [4], a multi-class image dataset for the identification of GI tract disorders. Pogorelov et al. [4] produced an open-access database encompassing 4000 images of the gastrointestinal tract. According to Kvasir, three separate findings from the gastrointestinal tracts are included in addition to the polyps. Anatomical landmarks (Z line, cecum, and pylorus), three pathological findings (polyps, ulcerative colitis and esophagitis), and two gastroscopic findings(dyed and lifted polyp and dyed resection margins) are all included in this diagnostic tool. Each image is between 720 × 576 to 1920×1072 pixels in resolution. In this study, we utilized only the pathological GI findings data that includes a total of 1500 samples. Fig. 1 depicts the samples of pathological findings from Kvasir dataset. B. Deep learning architecture This paper proposed a deep ensemble transfer learning framework for characterizing prominent GI tract anomalies in pathological findings data. Initial transfer learning models were developed using eight EfficientNet versions (from B0B7) that were primarily fine-tuned. Secondly, a total of eight (B0-B7) outcomes was merged to create an ensemble model. In summary, the research focused on screening pathological findings using ensemble transfer learning architecture. 1) Original EfficientNet architecture: EfficientNet is a deep CNN architecture based on the observation of a novel model scaling approach [7]. In their research paper, the authors
Fig. 2. A standard compound scaling architecture
proposed a uniform and adaptive scaling strategy to scale up all three dimensions, including depth (d), width (w), and resolution of a baseline architecture [7]. The efficacy of the scaling process is mainly determined by the architecture of the baseline model to which further scaling is applied. The uniform scaling technique was then applied to the baseline design by carefully balancing the network’s depth, width, and resolutions. EfficientNet-Bo is the baseline and all additional scaled models are referred to as B1, B2....B7. The typical architecture for a compound scaling model is depicted in Fig. 2. 2) Proposed transfer learning architecture: In this study, the EfficientNet family (B0-B7) was used to develop a deep ensemble transfer learning framework for screening GI pathological findings. Eight EfficientNet versions (B0-B7) were initially employed as transfer learning models, and they were fine-tuned independently before deploying the deep ensemble architecture. The original EfficientNet versions (B0-B7) are trained on the ImageNet [22] dataset. Our transfer learning approaches utilized the EfficientNet versions with pre-trained ImageNet weights except for the related top layers. The suggested transfer learning system’s design is illustrated in Fig. 3. It is quite apparent that it is divided into two parts. The first is a pre-trained frozen (non-trainable) part compatible with all EfficientNet versions (B0-B7), and the second is our custom trainable classification section. All layers in part-1 were frozen to retain their pre-trained weights and parameters. For instance, if the pre-trained part is built using EfficientNet-B0, it takes images of 224×224×3, as illustrated in the exemplary proposed architecture in Fig. 3. Part-1 (EfficientNet-B0) functions as a feature extractor, extracting the most distinguishable features from GI images using repeated convolutional and mobile inverted bottleneck convolutional operations. In the last operation of part-1, the original GI images were transformed to the shape of 7 × 7 × 1280. (In the case of EfficientNet-B0 as part-1). Part-2 (Customized classification) is integrated into part-1. The output of Part-1 is then flattened to create a 1D vector. Furthermore, we incorporated a stack of four dense layers with 512, 256, 128, and 64 filters, respectively, each with ReLU activation and a predefined dropout. Finally, the softmax probabilities were generated using a fully connected layer with softmax activation. 3) Ensemble of transfer learning: In this study, the suggested ensemble model was developed using the softmax averaging approach. Fig. 4 depicts the overall process of deep
Page 231
Fig. 3. EfficientNet based transfer learning architecture(An example with B0 version)
transfer learning-based ensemble architecture. The softmax determines the relative probabilities literally. Here (1) calculates the softmax value for the individual neuron output Zi . In (1), the term K represents the number of total classes. Then, a total of three (K=3) softmax of the shape 1 × 3 are generated for a specific data sample i by (2). In (2), the term σP atho (Zi1 ) represents the softmax value for the class-1 for sample i. Furthermore, the softmax is generated for all data samples for a specific EfficientNet version by using (3). A total of SA × 3 softmax outputs are produced by (3) where SA stands for total samples and the number 3 refers to the total 3 classes of pathological findings. Here, σP atho (ZSA 1 ) represents the softmax value for a sample SA of class-1 and so on. Eventually, all of the generated softmax from all EfficientNet versions (B0-B7) for all data samples are now ensembled. The ensemble process is done by (4). Here N refers to the total number of EfficientNet versions that is N = 8 ranging between B0 to B7. 4) Data preparation: Only the pathological findings data was investigated in this study from the Kvasir [4] dataset. The resolution of the images ranges from 720 × 576 to 1920×1072 pixels. All images were resized to a predetermined size compatible with the corresponding EfficientNet family version before being fed into the transfer learning network. For example, if the EfficientNet-B0 version was used as the transfer learning’s part-1 (Pre-trained part), the images were resized to 224 × 224 × 3. Consequently, the image size for EfficientNet-B1 was 240×240×3 , and so on. The images were then standardized between pixel values of 0 and 1. Finally, an on-the-fly domain-agnostic data augmentation technique was used. eZi σP atho (Zi ) = PK (1) Zj j=1 e
h i σP atho (Si ) = σP atho (Zi1 ) σP atho (Zi2 ) σP atho (Zi3 )
1×3
(2)
σP atho (S1 ) σP atho (S2 ) σP atho (Ef f icientN etBi ) = .. . σP atho (SSP ) S ×1 P σP atho (Z11 ) σP atho (Z12 ) σP atho (Z13 ) σP atho (Z21 ) σP atho (Z22 ) σP atho (Z23 ) .. .. .. . . .
= σP atho (ZSP 1 ) σP atho (ZSP 2 ) σP atho (ZSP 3 )
SP ×3
(3) PN −1 σEnsemble (P atho) =
i=0
σP atho (Ef f icientN etBi ) N (4)
C. Training process The model was run on Colab Pro with the NVIDIA Tesla P100-PCIE-16GB GPU including CUDA version 11.2. The total available RAM was 27 GB and there was a total of 166 GB of disk space. The batch size was fixed to 16 for all training. The categorical crossentropy loss function was utilized for the multi-class dataset. The stochastic gradient descent (SGD) was used as an optimizer with a momentum value of 0.90. The initial learning rate(Lr) was set to 0.001. The learning rate was reduced by a factor of 0.01 if the monitor value(validation accuracy) had not improved in three successive epochs. There were a total of 100 epochs. The dataset was divided into train, test, and validation subsets based on the ratios of 0.68, 0.15, and 0.17 of the whole dataset, respectively.
Page 232
Fig. 4. Deep ensemble architecture
TABLE I C LASSIFICATION REPORT FROM PROPOSED ENSEMBLE TRANSFER LEARNING ARCHITECTURE
esophagitis polyps ulcerative-colitis accuracy overall
precision 0.97 0.93 1.00
recall 1.00 0.99 0.91
0.966
0.964
f1-score 0.99 0.95 0.95 0.964 0.965
support 75 75 75 225 225
Fig. 5. Training and validation loss Fig. 6. Training and validation accufor Top-1 classifier racy for Top-1 classifier
IV. R ESULTS In this study, an ensemble transfer learning framework was proposed to successfully screen the pathological findings in gastrointestinal images. In this section, the major individual outcomes from each of the EfficientNet members are discussed. Then the ensemble outputs and final prediction results are also mentioned, along with a number of comparative analyses. Fig. 5 and Fig. 6 depict training loss vs validation loss and training accuracy vs validation accuracy for Top1 classifier(EfficientNet-B0) for pathological findings data. The EfficientNet-B0 version is anticipated to produce the best performance in this scenario though it contains the least scaling parameters. The individual outcomes from each of the EfficientNet versions were then ensembled by averaging the softmax probabilities. Table. I shows the classification report from the proposed ensemble transfer learning for the pathological findings data. The pathological findings includes three classes called esophagitis, polyps and ulcerative-colitis. In this case, we are more concerned about detecting polyps class for the early diagnosis of colorectal cancers and other related abnormalities. The ensemble network produces precision of 93.0%, recall of 99.0% and f1-score of 95.0% for polyps class. Furthermore, it yielded 96.40%, 96.60%, 96.40% and 96.50% of overall accuracy, precision, recall and f1-score respectively. Fig. 7 shows the prediction results of three random patho-
logical findings samples by the ensemble transfer learning. Fig. 8 demonstrates the confusion matrix report generated by the proposed model. It is obvious that the proposed ensemble transfer learning is more sensitive to the esophagitis and polyps class rather than the ulcerative-colitis class. There are only eight miss-classified pathological samples in Fig. 8. These evaluation metrics from deep ensemble transfer networks, along with the reduced number of miss-classified samples, demonstrate the beneficial effect of deep ensemble transfer learning on GI images. Table. II shows a comparative analysis among the EfficientNet versions and the proposed ensemble transfer learning model. Table. II proves the effectiveness of ensemble transfer learning over the individual EfficientNet models for pathological findings. Table. III demonstrates the comparison among some of the recent competing CNN models with the proposed model. All of the mentioned CNN models were run on the same environment and setting with the same dataset. The proposed ensemble architecture surpassed most of the models. Table. IV presents a comparative overview of recent studies on deep learning-based GI image processing, including the detection of polyps with the proposed deep ensemble transfer learning framework. According to Table. IV, most works use standard datasets, including Kvasir, CVCColonDB and ASU-Mayo.
Page 233
Fig. 7. Three random predictions with the proposed ensemble transfer learning
TABLE III C OMPARISON WITH SOME RECENT CNN Models AlexNet [19] VGG-16 [15] VGG-19 [15] Resnet-50 [16] Resnet-101 [16] Proposed Model
Fig. 8. Confusion matrix for the proposed ensembled transfer learning
V. C ONCLUSIONS AND OBSERVATIONS Cancers of the gastrointestinal (GI) organ account for a significant number of deaths worldwide each year. Most prior work on GI duct analysis was restricted to either polyp detection or binary classification. This study employed deep ensemble transfer learning architecture to screen pathological findings more methodically, automatically, and timeefficiently. Primarily, members of the EfficientNet family were utilized independently as transfer learning. While larger scaled-up networks are supposed to perform better, they cannot
TABLE II C OMPARISON WITH TOP -2 E FFICIENT N ET MODELS AND THE PROPOSED ENSEMBLE TRANSFER LEARNING
Architecture EfficientNet-B0 (Top-1) EfficientNet-B7 (Top-2) Proposed Model
Accuracy 0.951 0.938 0.964
Precision 0.955 0.946 0.966
Recall 0.951 0.938 0.964
F1-score 0.951 0.937 0.965
Accuracy 0.882 0.903 0.908 0.923 0.933 0.964
Precision 0.881 0.914 0.912 0.914 0.935 0.966
MODELS
Recall 0.876 0.914 0.912 0.911 0.938 0.964
F1-score 0.878 0.914 0.912 0.912 0.936 0.965
do so consistently. However, the larger scaled-up networks converged more stably and handled overfitting and underfitting challenges more effectively. The individual outcomes were then ensembled independently to create deep ensemble models by incorporating several modified mathematical formulae of softmax. Numerous quantitative analyses were presented regarding the ensemble networks’ success rate and the efficacy of diagnosing GI findings appropriately. The investigation indicated that ensemble networks outperformed all individual networks in detecting each class of pathological findings. It was also reported that several samples that individual EfficientNets incorrectly classified were predicted correctly in the ensemble networks. The following observations are listed for pathological findings: • The top two performers are the EfficientNet-B0 and EfficientNet-B7. • All of the esophagitis class samples are correctly predicted by all of the members. • The ‘ulcerative-colitis’ class has the most misclassified samples. • Out of 75 test samples, the best performer (EfficientNetB0) incorrectly categorized nine cases of ‘ulcerativecolitis’ as ‘polyps’ and one case of ‘polyps’ as ‘ulcerative-colitis.’ Furthermore, the study found that the proposed model surpassed almost all recent deep learning-based GI duct analyses. These observations may assist the physicians in screening
Page 234
TABLE IV A
COMPARATIVE ANALYSIS OF DIFFERENT WORKS WITH THE PROPOSED ENSEMBLE TRANSFER LEARNING . R ECALL ; S PEC : S PECIFICITY.
ACC : ACCURACY; P REC : P RECISION ; R EC :
Authors
Datasets
Models used
Major classification findings (Only best results)
Pogorelov et al. [8]
Kvasir [4]; CVC [9]; and Nerthus [10]
Xception; VGG-19; Resnet and GAN
Acc: 0.909; Spec: 0.940
Korbar et al. [12]
Collection
Resnet
Acc: 0.930; Pre: 0.897;
Yuan et al. [18]
ASU-Mayo [20]; CVC-ClinicDB ASU-Mayo [20]
Proposed Model
Kvasir [4]
Shin et al. [17]
CNN and SVM AlexNet Ensemble transfer learning with EfficientNet family
the GI organs more accurately. We intend to concentrate on segmenting and localizing the region of abnormalities in those findings in the future. R EFERENCES [1] W. H. Organization et al., “Estimated cancer incidence, mortality, and prevalence worldwide in 2012,” International agency for research on cancer, 2012. [2] GCO, “International Agency for Research on Cancer.” 2021, ”https://gco.iarc.fr/today/data/factsheets/cancers/10 8 9-Colorectumfact-sheet.pdf [accessed August 10, 2021]. [3] “Colorectal Cancer,” 2021, ”https://www.cancer.org/cancer/colon-rectalcancer/about/what-is-colorectal-cancer.html [accessed August 15, 2021]. [4] K. Pogorelov, K. R. Randel, C. Griwodz, S. L. Eskeland, T. de Lange, D. Johansen, C. Spampinato, D.-T. Dang-Nguyen, M. Lux, P. T. Schmidt et al., “Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection,” in Proceedings of the 8th ACM on Multimedia Systems Conference, 2017, pp. 164–169. [5] D. A. Lieberman, D. K. Rex, S. J. Winawer, F. M. Giardiello, D. A. Johnson, and T. R. Levin, “Guidelines for colonoscopy surveillance after screening and polypectomy: a consensus update by the us multi-society task force on colorectal cancer,” Gastroenterology, vol. 143, no. 3, pp. 844–857, 2012. [6] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” arXiv preprint arXiv:1411.1792, 2014. [7] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International Conference on Machine Learning. PMLR, 2019, pp. 6105–6114. [8] K. Pogorelov, O. Ostroukhova, M. Jeppsson, H. Espeland, C. Griwodz, T. de Lange, D. Johansen, M. Riegler, and P. Halvorsen, “Deep learning and hand-crafted feature based approaches for polyp detection in medical videos,” in 2018 IEEE 31st International Symposium on ComputerBased Medical Systems (CBMS). IEEE, 2018, pp. 381–386. [9] J. Bernal and H. Aymeric, “Miccai endoscopic vision challenge polyp detection and segmentation,” Web-page of the 2017 Endoscopic Vision Challenge, 2017. [10] K. Pogorelov, K. R. Randel, T. de Lange, S. L. Eskeland, C. Griwodz, D. Johansen, C. Spampinato, M. Taschwer, M. Lux, P. T. Schmidt et al., “Nerthus: A bowel preparation quality video dataset,” in Proceedings of the 8th ACM on Multimedia Systems Conference, 2017, pp. 170–174. [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014. [12] B. Korbar, A. M. Olofson, A. P. Miraflor, C. M. Nicka, M. A. Suriawinata, L. Torresani, A. A. Suriawinata, and S. Hassanpour, “Deep learning for classification of colorectal polyps on whole-slide images,” Journal of pathology informatics, vol. 8, 2017.
Rec: 0.883; F1-score: 0.888 Acc: 0.9176; Pre: 0.9271; Rec: 0.9082 and Spec: 0.9176 Acc: 0.9147; Rec: 0.9176 Acc: 0.964; Pre: 0.966; Rec: 0.964; F1-score: 0.965
[13] G. Urban, P. Tripathi, T. Alkayali, M. Mittal, F. Jalali, W. Karnes, and P. Baldi, “Deep learning localizes and identifies polyps in real time with 96% accuracy in screening colonoscopy,” Gastroenterology, vol. 155, no. 4, pp. 1069–1078, 2018. [14] W. Wang, J. Tian, C. Zhang, Y. Luo, X. Wang, and J. Li, “An improved deep learning approach and its applications on colonic polyp images detection,” BMC Medical Imaging, vol. 20, no. 1, pp. 1–14, 2020. [15] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [17] Y. Shin and I. Balasingham, “Comparison of hand-craft feature based svm and cnn based deep learning framework for automatic polyp classification,” in 2017 39th annual international conference of the IEEE engineering in medicine and biology society (EMBC). IEEE, 2017, pp. 3277–3280. [18] Z. Yuan, M. IzadyYazdanabadi, D. Mokkapati, R. Panvalkar, J. Y. Shin, N. Tajbakhsh, S. Gurudu, and J. Liang, “Automatic polyp detection in colonoscopy videos,” in Medical Imaging 2017: Image Processing, vol. 10133. International Society for Optics and Photonics, 2017, p. 101332K. [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, pp. 1097–1105, 2012. [20] N. Tajbakhsh, S. R. Gurudu, and J. Liang, “Automated polyp detection in colonoscopy videos using shape and context information,” IEEE transactions on medical imaging, vol. 35, no. 2, pp. 630–644, 2015. [21] S. Aksenov, K. Kostin, A. Ivanova, J. Liang, and A. Zamyatin, “An ensemble of convolutional neural networks for the use in video endoscopy,” Sovremennye Tehnologii v Medicine, vol. 10, pp. 7–17, 06 2018. [22] ImageNet, “ImageNet,” 2020, ”http://www.image-net.org/ [accessed January 19, 2020].
Page 235
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
Interpretable Garment Workers’ Productivity Prediction in Bangladesh Using Machine Learning Algorithms and Explainable AI Hasibul Hasan Sabuj1 , Nigar Sultana Nuha2 , Paul Richie Gomes3 , Aiman Lameesa4 and Md Ashraful Alam5 1,2,3,4,5
Department of Computer Science and Engineering, 1,2,3,5 BRAC University, 66 Mohakhali, Dhaka 1212, Bangladesh, 4 Asian Institute of Technology, Thailand. Email: 1 [email protected], 2 [email protected], 3 [email protected], 4 [email protected] and 5 [email protected]
Abstract—Bangladesh’s garment industry is widely recognized and plays a significant role in the current global market. The nation’s per capita income and citizens’ living standards have risen significantly with the noteworthy hard work performed by the employees in this industry. The garment sector is more efficient once the target production can be achieved without any difficulties. But a frequent issue that comprises within this industry is, often the actual garment producing productivity of the people working there do not reach the previously determined target-productivity. The business suffers a significant loss when the productivity gap appears in this process. This approach seeks to address this issue by prediction of the actual productivity of the workers. To attain this goal, a machine learning approach is suggested for the productivity prediction of the employees, after experimentation with five machine learning models. The proposed approach displays a reassuring level of prediction accuracy, with a minimalist MAE (Mean Absolute Error) of 0.072, which is less than the existing Deep Learning model with a MAE of 0.086. This indicates that, application of this process can play a vital role in setting an accurate target production which might lead to more profit and production in the sector. Also, this work contains an explainable AI technique named SHAP for interpreting the model in order to see further information within it. Index Terms—Productivity prediction, Regression Problem, Machine Learning Algorithms, Random Forest, Explainable AI
I. I NTRODUCTION In the 1980s, the garment industry emanated and has come to contribute in many countries’ economies. The garment industry has been a just cause for the country’s expansion and is dominating most other industries in the current days of globalization. It controls 80% of Bangladesh’s total exportoriented earnings, whereas the country itself is titled the second largest exporter of garments in the world [4]. EPB (Export Promotion Bureau) released in their current published data in [5], during the last seven years, annual revenue of Bangladesh’s garment industry has increased from 19 billion dollars to 34 billion dollars, i.e., a 79% escalation. Fabric inspection machines, CAD (Computer Aided Designs), fabric quality reporting machines etc. are helping a lot in terms of production. Yet, since the production procedure from [3], for apparel industries is generally designing, sample selection or confirmation, sourcing materials and merchandising, lay-planning, market-planning, spread and cut, sewing, washing, finishing and lastly packaging, although the help
from technology, a gigantic sum of human resource is required for the production process in the industry. Because, work in this industry is highly labor-intensive requiring tons of manual processes. A. Motivation The production procedure and delivery procedure require a large amount of human participation. Different employees have diverse working aptitude and pace at work, so garment industries often do not meet the targeted productivity required or assigned by the authorities to meet the production goals in the given time. The companies thus proportionately, face a tremendous loss in production. Consequently, to solve this problem of loss or to achieve the targeted outcome, there is an inevitability of systematic monitoring, analysis of the production level of employees and keeping a record of the overall working performances of them. Since there are several individuals working in different departments or teams, it turns out to be a hassle to keep track of these things manually, like revealed in [3]. The model used is a regression type model and it assists in generating the percentage of productivity. With the application of this method, the problem faced in production can be minimized, by maximizing the profits of the industries while overcoming the fatalities. B. Contributions In this paper, we bring forward, a model to directly generate a percentage of productivity of the employees. The key contributions here are• We suggested a regression approach using Machine Learning for productivity prediction, which performs better than existing Deep Learning approach. • We manifested rigorous analysis of five Machine Learning Algorithms like- SVM algorithm, Decision Tree algorithm, Random Forest algorithm, etc. for this regression approach. • We Interpretated the black-box using an explainable AI method called SHAP. II. R ELATED W ORKS To solve the common problem in garment industries, which is actual productivity prediction of the employees
979-8-3503-4602-2/22/$31.00 ©2022 IEEE Page 236
[1] proposed a DNN or Deep neural network model. The experiment resulted in a promising prediction showing a minimal Mean Absolute Error of 0.086 which is less than baseline performance error of 0.15. The MSE performance here is about 0.018 combining the Training and Validation performance. The data preprocessed and prepared for the experiment was collected from a renowned company that manufactures garments in Bangladesh. The model used is a deep learning model with capabilities like optimization and tuning of hyperparameters through manual search then practical judgements. Even though the learning procedure is consistent, the model proposed cannot reduce the error rate any further while thorough the machine learning model that we proposed, we can lower the error rate further to 0.13. In the recent world, data mining and machine learning is in a great demand for measuring some important aspects. Lots of people working in the garment industry are female and [2] is an approach to analyze the working performances of women based on their activities done previously, making a use of the machine learning algorithms. The algorithms used are Decision Tree Classifier, Logistic regression, Random Forest Classifier and Stochastic Gradient Descent. The best result found after such experiments are from the logistic regression model which is 69%. The dataset used in this procedure was very little in amount, which is 512 and that too was collected from some directly asked questions to some specified people. The classification procedure of models might have given a huge accuracy in this case, on the basis of the dataset that was experimented on, as well as the model used in classification but the approach we talked about in our paper is way better. We are not proposing some classification model but a regression model experimenting on a huge data set for more perfection in result and receive more accuracy. For solving the productivity prediction problem of the garment employees, the data mining technique can be explored and used. The state-of-the-art data mining can be applied in industrial data analysis, meaningful insight revelation and predicting productivity performance of the garment employees according to [3]. This technique applies 8 data mining procedures containing 6 evaluation metrics. The datasets used are also of two kinds namely, 3CD or 3-class dataset and 2CD or 2-class dataset. The techniques were all individual and were compared strictly to figure out the best one of them. The class imbalance problem between two type datasets was solved by an oversampling technique called SMOTE on the training data. This oversampling technique was more effective on the 2CD classification. Highest accuracy without oversampling procedure was 83.89% while oversampling produced 86.39%. The AUC score turned out to be 0.90 in case of the approach. On the other hand, the machine learning approach we proposed goes on to be a regression model with the same dataset and performs better compared to this approach in order to bring out more accuracy by reducing the error rate.
III. M ETHODOLOGY At the very beginning, the dataset is collected which includes 1197 data instances. The collected dataset was then processed to CSV file and preprocessed. The preprocessing step included, handling missing values, feature selection, outlier detection and removal, feature encoding, feature scalar and then data partitioning like in the work flow diagram mentioned below. 823 instances were kept for training the model and the rest of the 353 instances were used to train the machine learning model.
Fig. 1. Top level overview of the proposed model
The machine learning algorithms used in this experiment are five in number; Support Vector Machine Algorithm, Decision Tree Algorithm, Random Forest Algorithm, Gradient Boost Algorithm and XG-boost Algorithm. The MAE, MSE, RMSE and MAPE values were then determined, saved and summarized for evaluating further. After evaluation and analysis like stated in the figure 1, we interpreted the model by using an explainable AI named SHAP. A. Dataset description We find the dataset from a renowned and reputed garment industry in Bangladesh, which is presented by the Industrial Engineering department of the company and made publicly available through UCI Machine Learning repository. The data was collected for three months starting from January 2015 to the end of March 2015. It contains the production data of the sewing and the finishing team out of many other teams in the industry. It particularly has 1197 data instances as well as 13 attributes. The attributes are below with their functions in the dataset1) Date: It is mainly the date of the month mentioned in the particular year in the ‘month-date-year’ format.
Page 237
2) Department: It is an attribute that states the associated department along with the instance. 3) Team no: Numbered team that is associated with the instance. 4) No of workers: Numeral for workers, a specific team holds. 5) No of style change: Changes made in quantity, in a particular styled product 6) Targeted productivity: The target to fulfill for each team, which is set by the authorities. 7) SMV: Standard Minute Value. (Amount of time fixed for a certain task). 8) WIP: Work in Progress. (Quantity of unfinished items in a garment). 9) Over time: Minutes of overtime done by a team. 10) Incentive: Financial incentives enabling a particular task. (In BDT) 11) Idle time: Time that was idle or unproductive due to several kinds of production interruptions. 12) Idle men: Workers who were idle during the production interruption period. 13) Actual productivity: Productivity value of the workers on the scale of the expected productivity, ranging from 0.0 to 0.1.
2) Feature Selection: It is a process to extract a feature from the attributes and decide if it is needed or not. It is an effective procedure for solving a problem by eliminating the unnecessary and redundant information or data so that the computation-time is reduced, learning accuracy is improved along with a facilitation of a better understanding for a model or data in agreement with [7]. In our data we also have this attribute named ‘date’ and it contains temporal information that is totally irrelevant in our experimentation. So, in order to make our process effective, we canceled the attribute out. 3) Outlier Detection and Removal: Applying outlier detection and removal in training datasets is stated to be the keycomponent of machine learning algorithms, according to [8], in order to progress the precision of a model.
B. Data preprocessing Datasets are very important in experimentation, and play a great role in research. So, since its significance, it is greatly necessary that it is processed in the experimentation in a flawless way. Since many scientific research activities use scientific data-sets as at least exploratory findings [6]. It is highly essential to get a propitious result in solving a problem within the real-world, using a machine learning program. Within the real-world context, data are often incomplete, noisy, imbalanced or missing in the middle. Thus, before it’s fed into a machine learning model, it needs some preprocessing to make it suit the model and minimize the flaws with balancing it previously in a precise manner. As we found the raw data, we firstly formed a semi structured dataset as well as saved them, as CSV file for further processing and manipulation requirements. The files were converted manually so that the quality of the data would be preserved and well. It was then re-checked right after, to ensure that there was no mistake. The next phase, according to the form and structure of the data, requires some specific tasks to be maintained and processed in the model. For this reason, for our model it required us to perform the following tasks in order to process the data in the model; handling missing values, selection of features, outlier detection and removal, feature encoding, feature scaling and perfectly partitioning. We give a short briefing on the tasks below: 1) Handling Missing values: We saw that the attribute WIP which is the work in progress notation, had a previously assigned value. But according to our observation the value of work in progress at the finishing department is supposed to be zero. So we replaced the value with ‘0’ in order to handle the missing values.
Fig. 2. Before removing outliers
Outlier detection in the data that we are processing is done after plotting the data in a box plot. We detected the outlier in the ‘incentive’, ‘wip’ and ‘over-time’ columns in the obtained graph in fig: 2. Then comes the removal part. To do so, firstly
Fig. 3. After removing outliers
find the first and third quartiles from the box plotted graph, then we determine the interquartile range or IQR. We find the values for the lower limit and upper limit of the data and then remove the values in between to remove all the outliers.
Page 238
Thus, we plot the data next to show that all the outliers were detected and removed in fig: 3. 4) Feature encoding: It is a process in which categorical variables are characterized into numerals. Since, machine learning models learn from numeric data only, it is a must for us to have all our data in numeric as well. That is why, we must encode the categorical data from our dataset like ‘department’, ‘Team no’, ‘month’, ‘quarter’ and ‘year’ into numeric representations. Here, all the categorical data given are transformed into numbers fixing a value for all. We applied a label encoder in order to encode those variable data into numbers. The label encoder we used is from scikitlearn, [9] which uses the following procedure to work. That is, it replaces the strings with numerals starting from zero to (number of classes -1). 5) Feature Scalar: It is a scaling procedure that plays a huge role in data standardization. It scales data-information within a fixed range and then smaller standard-deviations lead to overpower the effects from the outliers. It is also termed as the standardization or mean-removal and variancescaling. It is very important for many machine-learning to the ones who estimate, [10]. We used the Standard scalar as our feature scalar from scikit-learn which operates on the following formula to measure the standard score of a sample x: x−u (1) z= s
6) Train-Test Split: In our approach, we used the training then, testing procedure. As we are training a machinelearning model, we require to train the model with lots of data, for it to attain perfection. From the total number of 1176 data or instances, we have used 70% in order to train it and the rest 30% for the testing purpose. That is, there are 823 data for training and the rest of the 353 instances are to test the model.
C. Model Specification: In our research we have implemented 5 Machine Learning algorithms for experimentation. A little insight of all those algorithms is written below. 1) Support Vector Machine Algorithm: Support Vector Machine is a traditional Machine Learning technology that can still be used to address large data categorization challenges as well as multi-domain operations in a bigdata environment. It is a revolutionary small-sample learning method that outperforms previous methods in many ways since it is based on the idea of structural risk minimization instead of on the traditional empirical theory [11]. 2) Decision Tree Algorithm: The most well-known techniques for classifier representation in data are believed to be decision tree classifiers. It is a tree-based strategy where each path leading from the root is characterized by a data separating sequence up until a Boolean result is reached at the
leaf node. It is a node-and-connection-containing hierarchical exemplification of knowledge relationships. 3) Random Forest Algorithm: The random forest algorithm, is a meta-learner machine learning algorithm [12]. In order to decide on an overall classification for the given set of inputs, the random forest uses numerous random tree classifications. The approach’s significant capacity for learning, resilience, and feasibility of the hypothesis space account for the benefits. [13] Assuming, the finally clipped linear decision tree has k leave nodes, the regression function of kth leave nodes as hk (x), the prediction function of the complete linear decision tree model thus, is: f (x) =
k X
I(x ∈ Xk ).hk (x)
(2)
k=1
4) Gradient Boost Algorithm: It is a regression strategy that resembles boosting [14]. GBMs are a class of sophisticated machine-learning systems that have demonstrated significant effectiveness in a variety of practical applications [15]. The learning mechanism in gradient boosting machines, or simply GBMs, fits new models sequentially to offer a more accurate estimate of the response variable. The basic idea behind this technique is to build new base-learners that are maximally correlated with the negative gradient of the loss function, which is associated with the entire ensemble. [16]. Since, in this context, we combine a group of weak learners (basic algorithms) to create a strong learner that can tackle a certain problem. It is calculated by making a use of this formula [14]. Fm (x) = Fm−1 (x) + ρm hm (x)
(3)
5) XG-boost Algorithm: XGBoost is a tree boosting scalable machine learning system. The system is freely available as open-source software, while the package is lightweight and reusable. It’s accessible in popular languages like Python, R, and Julia, and it interacts well with language native data science pipelines like scikitlearn. XGBoost’s mobility allows it to be used in a variety of ecosystems rather than being limited to a single platform [16]. For classification purposes, the Extreme Gradient Boosting (XGBoost) method is utilized [18]. XGBoost, in particular, focuses on minimizing the computational complexity for determining the best split, as this is the most time-consuming element of decision tree constructing algorithms. [14] More over it is determined in the following manner, Lxgb =
N X i=1
L(yi , F (xi )) +
M X
Ω(hm )
(4)
m=1
D. Evaluation Metrices The proposed model is evaluated using 4 metrices from [19], namely MSE (Mean squared error), RMSE (root-meansquare error), MAPE (mean absolute percentage error) and MAE (Mean Absolute Error) .
Page 239
1) MSE: MSE or mean squared error computes the square of the difference between the estimations and the target for each point, averaging the results. The worse the model is, the higher this value is supposed to be, from [1]. The formula for determination of MSE is: N 1 X (yi − yˆi )2 N i=1
(5)
2) RMSE: It is a methodology used to assess the accuracy and rate of any machine learning algorithm used in a regression problem and [20] determined it byv u N u1 X t (yi − yˆi )2 (6) N i=1 3) MAPE: The absolute error is divided by the goal value for each instance to produce the relative error. The outcomes of applying MAPE may be easily understood, as it is easy to approximate what the predictions will look like once the average range of the prediction is known, [1]. The equation to determine this error can be cited asN
100% X |yi − yˆi | N i=1 yi
(7)
4) MAE: An average of the absolute discrepancies between the goal values and the projections is used to determine MAE in [1]. This metric’s ability to penalize massive errors is crucial as it has a linear score, given a meaning that the unique differences are weighted in an equal manner. The equation for determining it isN 1 X |yi − yˆi | N i=1
(8)
Table 1: Summarized values of Evaluation Metrics Model Names Support Vector Machine Algorithm Decision Tree Algorithm Random Forest Algorithm Gradient Boost Algorithm XG-boost Algorithm
MAE
MSE
RMSE
MAPE
0.084
0.014
0.121
0.144
0.083
0.022
0.148
0.137
0.072
0.013
0.116
0.120
0.075
0.012
0.112
0.126
0.078
0.015
0.123
0.126
Aditionally, the Random Forest Algorithm did better in the experiment for all the algorithms. Test performance for the model was yielding a value of ‘0.013’ for MSE, ‘0.112’ for RMSE, ‘0.120’ for MAPE and ‘0.072’ for MAE. In case of regression, we know the lesser the mean square error, or MSE the better the model is. From the evaluation metrices, we obtained different results but the main evaluation metrics we are going to consider is MAE. This is because it is a linear score giving evaluation metrics which conforms the different values are weighted accordingly and equally in average. MAE value for the Random Forest Algorithm is the lowest, which can be considered the best algorithm here, in accordance with our findings. A recent existing research adressing the same problem, concluded a sollution with a deep learning approach [1]. The evaluation metrices used there, were MSE, MAE and MAPE while the experiment was conducted on the same dataset. We compared the approach with ours taking similar evaluation metrices into consideration. Table 2: Comparison with existing state-of-the-art method
IV. R ESULT AND A NALYSIS In analysis of the result obtained from the Machine learning algorithms, we utilize evaluation metrices and an explainable AI method. The values taken to represent the result are recorded up to three decimal places in the table no. 1. We input the summarized values from the metrics that we get from the machine learning algorithms and state them in the table no. 1. We found a highest MSE of ’0.022’ for the Decision tree algorithm and the lowest of ’0.012’ from the Gradient Boost algorithm. The highest RMSE was in Decision Tree algorithm which is ’0.148’ and the lowest in Gradient Boost algorithm, i.e, ’0.112’. A MAPE value of ’0.144’ was the highest and ’0.120’ was the lowest in our paper found from the SVM and the Random Forest algorithm respectively. Lastly, the MAE value was highest in case of the SVM algorithm of ’0.084’ while the lowest was obtained from Random Forest Algorithm which is 0.072.
Method names Existing Deep Learning Method Our Machine Learning Method
MAE 0.086 0.072
MSE 0.018 0.013
MAPE 0.159 0.120
According to the data in table no. 2 taken from [1], we can see the MSE, MAE and MAPE value for our approach is lesser than that of the deep learning method. Thus, we can state that our model is way better in comparison to it, while working on the same dataset. V. I NTERPRETING THE M ODEL A Machine Learning model when trained, generates the direct outcome from the input. So, we do not have any clue for the steps used in between. Since the steps are unknown and unsighted for us, it can be assumed as a black box that covers everything up and does not let anyone witness the things happening inside it. So, to overcome this difficulty, an interpreting model is used to explain the things occurring inside and that is, an explainable AI method called SHAP. Lastly, we used the explainable AI method SHAP, for interpreting the models, plotting the outcome after using the
Page 240
SHAP we get fig:4
Fig. 4. Outcome after using explainable AI - SHAP
Here we can see the impact of SHAP in our model output. The more spread the data are, in this figure, the more impactful they are in our approach. VI. C ONCLUSION We aimed to propose a Machine Learning model for generating a percentage of productivity performance of the employees in garment industries. The dataset used has been collected from a highly-known garment production corporation in Bangladesh. The data was preprocessed using some sequential procedures, namely handling missing values, feature selection, outlier detection and removal, feature encoding, feature scalar and data partitioning. For our approach we brought together some prominent machine learning models from well familiar research documentations. The machine learning models used are five in number, while the performance was evaluated based on 4 metrices called MSE, MAPE, RMSE and MAE. For determining the training results and testing results we used a loss metric, MAE which is the Mean Absolute Error. While training the machine learning model were well-performed and more consistently curved in learning along a very little distinctiveness. The testing performance for the model were also very proficient in terms of all the 4 metrices that we used. The performance of our approach is reliable, it is learning and predicting the performance of the employees with a promising percentage.
[3] Imran, A. A., Rahim, M. S., Ahmed, T. (2021). Mining the productivity data of the garment industry. International Journal of Business Intelligence and Data Mining, 19(3), 319-342. [4] Mirdha, R. U. (August 11, 2017). Exporters hardly grab orders diverted from China. The Daily Star, Retrieved from https://www.thedailystar.net/business/exporters-hardly-grabordersdiverted-china-1446907, Accessed: 4 January, 2019 [5] Islam, M. S., Rakib, M. A., Adnan, A. T. M. (2016). ReadyMade Garments Sector of Bangladesh: Its Contribution and Challenges towards Development. Stud, 5(2). [6] Dekker, R. (n.d.). The importance of having data-sets. Purdue e-Pubs. Retrieved September 11, 2022, from https://docs.lib.purdue.edu/iatul/2006/papers/16/ [7] Jie Cai, Jiawei Luo, Shulin Wang, Sheng Yang, Feature selection in machine learning: A new perspective, Neurocomputing, Volume 300, 2018, Pages 70-79, ISSN 0925-2312, https://doi.org/10.1016/j.neucom.2017.11.077. [8] Weizhi Li, Weirong Mo, Xu Zhang, John J. Squiers, Yang Lu, Eric W. Sellke, Wensheng Fan, J. Michael DiMaio, and Jeffrey E. Thatcher ”Outlier detection and removal improves accuracy of machine learning approach to multispectral burn diagnostic imaging,” Journal of Biomedical Optics 20(12), 121305 (25 August 2015). https://doi.org/10.1117/1.JBO.20.12.121305 [9] Bisong, E. (2019). Introduction to Scikit-learn. In: Building Machine Learning and Deep Learning Models on Google Cloud Platform. Apress, Berkeley, CA. [10] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. https://jmlr.csail.mit.edu/papers/volume12/pedregosa11a/pedregosa11a.pdf [11] Zhang, Y. (2012). Support Vector Machine Classification Algorithm and Its Application. In: Liu, C., Wang, L., Yang, A. (eds) Information Computing and Applications. ICICA 2012. Communications in Computer and Information Science, vol 308. Springer, Berlin, Heidelberg. [12] Livingston, F. (2005). Implementation of Breiman’s random forest machine learning algorithm. ECE591Q Machine Learning Journal Paper, 1-13. [13] Ao, Y., Li, H., Zhu, L., Ali, S., Yang, Z., The linear random forest algorithm and its advantages in machine learning assisted logging regression modeling, Journal of Petroleum Science and Engineering (2018), doi: https://doi.org/10.1016/j.petrol.2018.11.067. [14] Bent´ejac, C., Cs¨org˝o, A., Mart´ınez-Mu˜noz, G. (2021). A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review, 54(3), 1937-1967. [15] Natekin, A., Knoll, A. (2013). Gradient boosting machines, a tutorial. Frontiers in neurorobotics, 7, 21. [16] Bikmukhametov, T., J¨aschke, J. (2019). Oil production monitoring using gradient boosting machine learning algorithm. Ifac-Papersonline, 52(1), 514-519. [17] Chen, T., Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794). [18] Torlay, L., Perrone-Bertolotti, M., Thomas, E., Baciu, M. (2017). Machine learning–XGBoost analysis of language networks to classify patients with epilepsy. Brain informatics, 4(3), 159-169. [19] Saigal S. and Mehrotra D. (2012) Performance Comparison of Time Series Data Using Predictive Data Mining Techniques. Advances in Information Mining, ISSN: 0975-3265 E-ISSN: 0975-9093, Volume 4, Issue 1, pp.-57-66. [20] Wang, W., Lu, Y. (2018, March). Analysis of the mean absolute error (MAE) and the root mean square error (RMSE) in assessing rounding model. In IOP conference series: materials science and engineering (Vol. 324, No. 1, p. 012049). IOP Publishing.
R EFERENCES [1] Al Imran, A., Amin, M. N., Rifat, M. R. I., Mehreen, S. (2019, April). Deep neural network approach for predicting the productivity of garment employees. In 2019 6th International Conference on Control, Decision and Information Technologies (CoDIT) (pp. 1402-1407). IEEE. [2] Keya, M. S., Emon, M. U., Akter, H., Imran, M. A. M., Hassan, M. K., Mojumdar, M. U. (2021, January). Predicting performance analysis of garments women working status in Bangladesh using machine learning approaches. In 2021 6th International Conference on Inventive Computation Technologies (ICICT) (pp. 602-608). IEEE.
Page 241
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
An Ensemble Learning Approach for Chronic Kidney Disease Prediction Using Different Machine Learning Algorithms with Correlation Based Feature Selection Md Mahedi Hassan, Tanvir Ahamad, and Sunanda Das Department of Computer Science and Engineering Khulna University of Engineering & Technology Khulna-9203, Bangladesh [email protected], [email protected], and [email protected]
Abstract—Chronic Kidney Disease (CKD), also known as Chronic Renal Disease is considered one of the biggest reasons acting behind deaths in adults all over the globe and the number is escalating throughout the years. At its final stages, treatment of CKD becomes much exorbitant. Machine learning algorithms, for their capabilities to learn from experience, can play a vital role in predicting CKD in its early stages. In this paper, we apply machine learning to predict CKD on the basis of clinical data obtained from the UCI machine learning repository. The dataset has a significant amount of missing values which is handled using K-Nearest Neighbours imputer. The imbalanced dataset has been balanced using Synthetic Minority Oversampling Technique (SMOTE). A Correlation Based Feature Selection (CBFS) and Principal Component Analysis (PCA) is used for feature selection. Later, the dataset is divided into 80% for training, 10% for validation and 10% for testing. Five renowned supervised learning algorithms namely K-Nearest Neighbours (KNN), Support Vector Machine (SVM), Gaussian Naive Bayes, Decision Tree, Logistic Regression and an Ensemble learning algorithms are used to achieve the prediction. Among these, the ensemble learning algorithm proves to be superior than others on the dataset obtained by CBFS, acquiring an accuracy, precision, recall, and f1-score of 97.41%, 99.52%, 95.27% and 97.33% respectively. Keywords—Chronic kidney disease, Machine learning, Classification, K-fold cross validation, Ensemble learning, SMOTE, PCA, Correlation
I. I NTRODUCTION Chronic Kidney Disease (CKD) has been a life threatening phenomena these days. It’s mortality rate is very high, especially among the adults. In 2016, CKD was responsible for the death of approximately 753 million people, among which 44.6% were male and 55.4% were female [1]. It is considered chronic because it begins slowly, and eventually hampers the functionality of the urinary system. Because of this, blood filtration is interrupted which leads to many more health related issues. Risk factors of CKD includes blood pressure, diabetes and cardiavascular diseases [2]. CKD can be divided into five stages [3] based on glomerular filtration rate (GFR) as shown in Table I. GFR is a measurement of kidney function and is determined based on several factors such as age, body weight, cystatin C, serum creatinine [4]. CKD is curable if it is detected in its early stages. Later stages require expensive procedures to keep the
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
patient alive. So, an early prediction of CKD is required in order to remain healthy. Since the number of patients are ever increasing and the lack of experts along with costly and time consuming diagnosis and treatment, a computer based automated system to help physicians in the early detection of CKD is a crying need. Artificially intelligent systems have played a significant role in medical field such as diagnosis, prognosis and treatment. Machine learning and deep learning techniques are well known for their prediction capabilities. This study incorporates • A manual Correlation Based Feature Selection (CBFS) method and Principal Component Analysis (PCA) for dimensionality reduction to get two versions of the dataset • Several supervised machine learning models were applied on both of them • An ensemble learning was also applied on the datasets • K-fold cross validation was used to estimate the goodness of the model when exposed to new data. TABLE I. F IVE S TAGES OF CKD D EVELOPMENT. Stage
Function
1 2 3 4 5
Normal Mild Moderate damage Severe damage Kidney failure
Glomerular Filtration Rate (GFR) (mL/min/ 1.73 m2 ) ≥ 90 60–89 30–59 15–29 ≤ 15
The succeeding sections of this paper are assembled as follows. Section II gives an overview of the related researches conducted in this field. Section III illustrates the methodology of this study. Results obtained by the system are analyzed and discussed in section IV. In the final section i.e., section V, we state our final thoughts and conclude this paper, explaining scopes of further improvement. II. R ELATED W ORKS A remarkable number of research have been conducted on detection of CKD using artificially intelligent systems. Almansour et al. [5] used Artificial Neural Networks (ANN) and SVM to detect CKD where ANN outperformed SVM with an accuracy of 99.75% to 97.75%. ANN acquired 100%
Page 242
precision, 99.6% recall and 99.7% f1 score. Whereas SVM obtained 100.00% precision, 96.40% recall and 98.20% f1 score. Rady and Anwar [6] used probabilistic neural networks (PNN), multilayer perceptron (MLP), SVM, and radial basis function (RBF) algorithms for making the prediction. PNN algorithm outperformed the MLP, SVM, and RBF algorithms in their case, displaying the accuracy, precision, recall and f1score of 99.72%, 99.65%, 100.00% and 99.37% respectively. A classification scheme for detecting the stages of CKD separately was developed by Johnson et al. [7] with the accuracy of 99.70% for each stage and an overall accuracy of 96.70%. This work discards instances with missing values and calculates eGFR considering gender and race as additional attributes. Wibawa et al. [8] used correlation-based feature selection (CFS) to select features. They used Naive Bayes, KNN and SVM as basic classifiers and AdaBoost as an ensemble method.The system performed its best when a hybrid of KNN and AdaBoost was used, achieving an accuracy score of 98.10%. The hybrid approach showed 98.00% precision, recall and f1-score. Chiu et al. [9] built ANN models for detecting CKD. The models included a generalized feed forward neural networks (GFNN), back-propagation network (BPN) and modular neural network (MNN) with a view to detect CKD at an early stage. The authors proposed a hybrid model that was a combination of GA and three-dimensional models. BPN+GA, MNN+GA, GFNN+GA obtained the accuracy of 91.71%, 88.82%, 91.09% accuracy respectively. Gunarathne et al. [10] used 14 attributes, achieving an accuracy of 99.10%. They used ANN and logistic regression, which gave accuracy of 97.50% and 96.00% accuracy overall. The correlation of the selected features were between 0.2 and 0.8. Avci et al. [11] used the software WEKA to predict CKD from the UCI dataset. The methods used here were Naive Bayes, K-Star, SVM, and J48. The J48 classifier outperformed the other algorithms, achieving an accuracy of 99.00%. The classifier displayed a precision and an f-measure of 98.00% and a recall of 99.00%. A. J. Hussain et al. [12] proposed a method of predicting CKD using multilayer perceptron. That work included PCA as the feature selection method. It achieved an accuracy of 99.50%. The discarding of attributes with more than 20% missing values, and the missing value filling approach had a significant role to play on the improvement of performance. Shrivas et al. [13] used the Union Based Feature Selection Technique (UBFST) for feature selection. The selected features were evaluated by several techniques of machine learning. The best accuracy was obtained using a combination of random forest (RF), classification and regression Trees (CART) and SVM, which was 99.75%. Sara et al. [14] extracted features using hybrid wrapper and filter-based FS (HWFFS) and feature selection (FS). Later they combined the features and applied SVM, naive Bayes and ANN for classification. The performance measure of this research was error rate, which was the least (10.00%) for SVM, Elhoseny et al. [15] proposed a healthcare system for
diagnosing CKD by adopting density based feature selection (DFS) and ant colony optimization. DFS was used to remove irrelevant features, who had a weak correlation with the target feature, thus improving the performance. III. M ETHODOLOGY The entire study has been perpetrated in four main steps, namely data preprocessing, feature selection, model development and model evaluation. This has been depicted in Fig. 1. A. Dataset The dataset has been collected from the UCI machine learning repository [16]. The source of this data is from Apollo Hospital, Tamilnadu, India. It contains the data of 400 patients with 25 features which are blood pressure (bp), specific gravity (sg), albumin (al), sugar (sg), rbc, pus cell (pc), pus cell clumps (pcc), bacteria (ba), blood glucose random (bgr), blood urea (bu), serum creatinine (sc), sodium (sod), potassium (pot), hemoglobin (hemo), packed cell volume (pcv), white blood cell count (wc), red blood cell count (rc), hypertension (htn), diabetes mellitus (dm), coronary artery disease (cad), appetite (appet), age, pedal edema (pe) and anemia (ane). Among these, 11 are numeric and 14 are nominal attributes. The one nominal target attribute has two labels namely are ckd and non-ckd. B. Data Preprocessing Data preprocessing stage is divided into two steps such as missing value handling, and handling class imbalance. 1) Missing Value Handling: The dataset contained a significant number of missing values. Initially, features with more than 20% missing values were dropped. Filling the missing values with a constant may result in a drop in the accuracy as there are more CKD instances. This can be seen in [10] and [17]. The presence of categorical variables prevented us from filling those missing values with an average value. So, a KNN imputer algorithm was incorporated to fill up the missing values. 2) Handling Class Imbalance: The dataset was imbalanced with 250 instances of ckd and 150 instances of non-ckd. A Synthetic Minority Oversampling Technique (SMOTE) oversampling was applied to balance the dataset by generating 100 synthetic data points of the minority class. It first randomly selects a data point belonging to the minority class and determines its k-nearest neighbors. It then obtains the feature vector from the selected data point and one of its selected neighbors and multiplies that feature vector with a random number ranging from 0 to 1, producing a new data point. C. Feature Selection Two methods of feature selection have been employed in this study, creating two versions of the dataset. One is the CBFS method, where Pearson correlation is used. Another one is PCA, where the dimensionality of the data has been reduced.
Page 243
Feature Selection
Gaussian Naive Bayes Decision Tree Logistic Regression
Pearson Correlation Testing
Oversampling with SMOTE Algorithm
Split
KNN SVM (rbf)
Principal Component Analysis
Model Evaluation
Data Acquisition
Validation Test
Accuracy
5 Fold Cross Validation
Precision Recall F1-Score
Model Developement Ensemble Learning
Data Imputation
(KNN imputer)
Data Spliting
Training
Data Preprocessing
KNN SVM (rbf) Gaussian Naive Bayes Decision Tree Logistic Regression
Fig. 1. Workflow pipeline of the proposed system.
1) Correlation Based Feature Selection: This method selects those variables or attributes which are strongly (correlation coefficient, r > 0.5) or moderately (0.3 < r < 0.5) correlated with the target variable. Also, if there is existence of strong or moderate correlation between two of these selected variables, then one is dropped. This is done because one of them describes the other. So, removing one would not affect the drawing of boundary between the classes. The formula used here to calculate the bivariate correlation coefficient is the Pearson’s correlation coefficient, which is shown in eq. (1) where, n = number of instances, X1i , X2i = values of the variable X1 and X2 , X1 , X2 = mean of the values of X1 and X2 . Pn (X1i − X1 )(X2i − X2 ) r = qP i=1 (1) Pn n 2 2 X ) X ) (X − (X − 1 2 1i 2i i=1 i=1 2) Principal Component Analysis: Another approach considered in this stage is PCA, creating another version of the dataset. PCA does multivariate data analysis using projection methods. It turns the data of n-dimensional space into a data of k-dimensional space where, k ≤ n. For example, in Fig. 2, the data is two dimensional. So, the maximum number of principal components (PC), that is 2, are drawn through the mean of the data in such a way that the projection of data points on the largest principal component, PC1 has maximum spread from the mean. The second component is perpendicular to PC1. Both of them store information as the distance of projection of the data points from the origin. The quality of information in each PC can be its explained variance (EV), calculated using eq. (2) where, VC is the variance of that component and TV is the total variance. The more the explained variance, the easier it is to draw a decision boundary. It is clear that PC1 has more explained variance than PC2. From the projections on PC1, there are four classes (C1 to C4) and the decision boundaries can easily be determined by only looking at PC1. So, PC1 contains info of better quality, which is enough to
represent the data. Thus, one dimension is reduced without losing valuable information. EV =
VC × 100% TV
(2)
Fig. 2. PCA in action. D. Classification Supervised and ensemble learning methods have been incorporated with a view to classifying the data. 1) KNN: The KNN algorithm works based on “birds of the same feather flock together”. It assumes that things that are similar exist in close proximity. When a new data point is introduced, its k-nearest neighbors are identified using Minkowski distance depicted in eq. (3), where d is the distance, X1 and X2 are data points, n is the number of samples and P is a constant. Then the class labels of the neighbors are found and the most frequent label is assigned to the new data point. n X 1 d=( |X1 − X2 |P ) P
(3)
i=1
2) SVM: The SVM algorithm finds a hyperplane in ndimensional space that separates data points of two different classes. Two types of SVMs are available. One is the linear SVM, where the data needs to be linearly separable, and the other one is non-linear SVM. In case of non-linear SVM, the n-dimensional data is transformed into an (n+1)-dimensional
Page 244
data by adding an extra dimension using a kernel function. In this study, we used radial basis formula (RBF) kernel shown in eq. (4), where K is the kernel, ||X1 − X2 ||2 is the distance of two features and σ is a constant. Later a decision boundary is drawn through the data points. ||X1 − X2 ||2 ) (4) 2σ 2 3) Gaussian Naive Bayes: Naive Bayes Classifier is a classification algorithm based on Bayes theorem which is based on the conditional probability. Let, X(x1 , x2 , . . . , xn ) be an instance of the dataset and Y (y1 , y2 , . . . , yn ) is the set of classes. The naive Bayes classification algorithm finds P (y1 |X), P (y2 |X), . . . , P (yn |X) and assigns X the class with the highest conditional probability using eq. (5). The naive Bayes classification algorithm has two assumptions. Firstly, all the features are independent of each other. Secondly, each feature is equally important. This transforms eq. (5) to eq. (6). Gaussian naive Bayes considers each attribute as continuous and calculates each probability using the eq. (7), where µc and σc are constants. K = e(−
n X
Pi
F 1 − Score = 2 ×
(8)
1 (9) 1 + e−x 6) Soft Voting: Soft voting classifier is an ensemble classifier. Ensemble learning is quite effective for classification problem [18]. It takes into account many classifiers and with the help of the probabilities of predictions p returned by them, it predicts the output using the eq. ( 10), where w is the weight associated with classifier j.
j=1
wj pij
(13)
P recision × Recall P recision + Recall
(14)
(10)
hemo
htn
dm
appet
pe
class
sg al su bgr bu sc hemo htn dm appet pe class
OF
sc
TABLE II. A BSOLUTE C ORRELATION C OEFFICIENTS THE S ELECTED ATTRIBUTES AND THE C LASS L ABEL . bu
(7)
f (x) =
i
TP TP + FN
(6)
5) Logistic Regression: Logistic regression is a classification algorithm that uses the sigmoid function in eq. (9). Despite being a regression algorithm, it produces binary values thanks to the sigmoid function which maps the output in a range of 0 to 1.
y = argmax
(12)
(5)
i=1
m X
TP TP + FP
P recision = Recall =
4) Decision Tree: It is a flowchart like tree structure. Here each internal node represents a test on each attribute. Each branch denotes the outcome of each test. The leaf nodes contain the class labels. Tree construction i.e., attribute selection for splitting has been done using the Gini index, which is the measurement of impurity and it follows eq. (8) where Pi denotes probability. Gini = 1 −
(11)
bgr
1 (xi − µc )2 P (Xi |c) = p × exp(− ) 2σc2 2xσc2
TN + TP TN + TP + FN + FP
su
P (x1 |y)P (x2 |y)...P (xn |y)P (y) P (y|X) = P (x1 )P (x2 )...P (xn )
Accuracy =
al
P (X|y)P (y) P (X)
F. Performance Metrics The model performance has been evaluated in terms of accuracy, precision, recall, and f1 scores, which are calculated using eq. (11), (12), (13), (14) where, TP is true positive, TN is true negative, FP is false positive, FN is false negative. The number of correctly labeled samples (TP and TN ) and incorrectly labeled samples (FP and FN ) are obtained from the confusion matrix.
sg
P (y|X) =
E. k-Fold Cross Validation As the number of samples in this study is limited, k-fold cross validation is used to evaluate the models. The K-fold cross validation divides the dataset into k segments and in each one of k iterations, uses (k-1) segments as the training and 1 segment as the testing set. We used 5 as the value of k.
1 -
0.53 1 -
0.34 0.31 1 -
0.37 0.36 0.68 1 -
0.35 0.46 0.2 0.16 1 -
0.27 0.27 0.17 0.13 0.6 1 -
0.58 0.61 0.25 0.32 0.55 0.36 1 -
0.44 0.53 0.34 0.39 0.41 0.29 0.57 1 -
0.44 0.44 0.5 0.50 0.33 0.23 0.47 0.65 1 -
0.3 0.36 0.12 0.21 0.3 0.19 0.41 0.36 0.34 1 -
0.32 0.46 0.16 0.13 0.37 0.2 0.39 0.38 0.32 0.47 1 -
0.72 0.62 0.34 0.41 0.38 0.3 0.72 0.55 0.51 0.37 0.35 1
IV. R ESULT A NALYSIS AND D ISCUSSION The selected attributes during CBFS are shown in Table II along with their correlation coefficients obtained using equation 1. Some of the attributes are highly correlated with each other. Being nominal prevents them from explaining each other (Fig. 3). So all of them are selected. In case of PCA, as depicted in Fig. 4, 13 attributes are required to maintain a 95% explained variance. In the model training phase, the k-value that minimizes the error rate is selected. Gini index is used in case of decision tree. The soft voting classifier combines one KNN, one SVM, one decision tree with entropy as attribute selection method, one Gaussian naive Bayes, and one logistic regression classifier. According to table III, The soft voting classifier correctly predicts 97.41% labels accurately on the CBFS version of the dataset. Whereas, logistic regression, decision tree, Gaussian naive Bayes, SVM, and KNN make 96.94%, 96.23%, 94.58%, 89.17%, and 87.29% correct predictions on the same dataset
Page 245
TABLE III. P ERFORMANCE C OMPARISON A MONG D IFFERENT M ETHODS U SING 5 F OLD C ROSS VALIDATION R ESPECT TO PCA AND P EARSON C ORRELATION . Methods
KNN
SVM (rbf)
Gaussian Naive Bayes
Decision Tree
Logistic Regression
Soft Voting
Evaluation Metrics (%) Accuracy Precision Recall F1 score Accuracy Precision Recall F1 score Accuracy Precision Recall F1 score Accuracy Precision Recall F1 score Accuracy Precision Recall F1 score Accuracy Precision Recall F1 score
Fold-1 89.41 92.30 85.71 88.88 90.58 94.37 85.71 90.00 95.29 93.18 97.61 95.34 96.47 95.34 97.61 96.47 90.58 94.73 85.71 90.00 98.82 100.00 97.61 98.79
Principal Fold-2 96.47 97.61 95.34 96.47 97.64 100.00 95.34 97.61 94.11 93.18 95.34 94.25 92.94 89.36 97.67 93.33 97.64 100.00 95.34 97.61 97.64 100.00 95.34 97.61
Component Analysis (PCA) Fold-3 Fold-4 Fold-5 Average 94.11 96.47 97.64 94.82 95.12 100.00 95.45 96.09 92.85 93.02 100.00 93.38 93.97 96.38 97.67 94.67 94.11 95.29 98.82 95.29 97.43 100.00 97.67 97.89 90.47 90.69 100.00 92.44 93.82 95.12 98.82 95.07 95.29 97.64 97.64 96.00 93.18 100.00 95.45 95.00 97.61 95.34 100.00 97.18 95.34 97.61 97.67 96.04 95.29 96.47 100.00 96.23 93.18 100.00 100.00 95.57 97.61 93.02 100.00 97.18 95.34 96.38 100.00 96.30 97.64 95.29 98.82 96.00 97.61 100.00 97.67 98.00 97.61 90.69 100.00 93.87 97.61 95.12 98.82 95.83 92.94 97.64 98.82 97.12 100.00 100.00 100.00 100.00 85.71 95.34 97.61 94.32 92.30 97.61 98.79 97.02
Fold-1 88.23 90.00 85.71 87.80 88.23 88.09 88.09 88.09 94.11 100.00 88.09 93.67 96.47 97.56 95.23 96.38 96.47 100.00 92.85 96.29 96.47 100.00 92.85 96.29
1.025
1.025
1.020
1.020
1.015
1.015
3
sg
2 1
sg
4
su
Pearson Correlation Fold-3 Fold-4 Fold-5 85.88 88.23 89.41 94.11 90.24 94.59 76.19 86.04 83.33 84.21 88.09 88.60 88.23 91.76 89.41 92.10 89.13 90.24 83.33 95.34 88.09 87.50 92.13 89.15 91.76 98.82 96.47 94.87 97.72 95.34 88.37 100.00 97.61 91.35 98.85 96.47 97.64 95.29 97.64 97.61 95.34 95.45 97.61 95.34 100.00 97.61 95.34 97.67 95.29 96.47 98.82 100.00 97.61 97.67 90.47 95.34 100.00 95.00 96.47 98.82 96.47 97.64 97.64 100.00 100.00 97.61 92.85 95.34 97.61 96.29 97.61 97.61
TABLE IV. C OMPARISON OF THE P ERFORMANCE P ROPOSED S YSTEM WITH P REVIOUS S TUDIES .
5
1.010
Previous studies
1.010
0 1.005
1 0
100
200
bgr
300
400
500
1.005 10
(a) bgr vs su
20
30 pcv
40
50
4
(b) pcv vs sg
1.0250
1.0175
5
500
4
400
3
300
su
1.0150 1.0125 1.0100
2
200
1
100
1.0075 1.0050
0 0
1
2
al
3
4
5
8
10 12 hemo
14
16
18
0.0
0.5
(d) al vs sg
1.0
1.5 dm
2.0
2.5
3.0
(e) dm vs su
0
0.0
0.5
1.0
1.5 dm
2.0
2.5
3.0
(f) dm vs bgr
Cumulative explained variance (%)
Fig. 3. Joint plot of highly correlated nominal attributes. 1.0 0.8
95% cut-off threshold
0.6 0.4 0.2 0.0
Chittora et al. [19] Jongbo et al. [20] Hore et al. [21] Morales et al. [22] Rady et al. [6] Elhoseny et al. [15] Harimoorthy et al. [23] Ogunleye et al. [24] Khan et al. [25] Proposed model
Accuracy (%) 90.73 89.2 92.54 92.00 95.84 85.00 66.30 96.80 95.75 97.41
Precision (%) 83.34 97.72 85.71 93.00 84.06 – 65.90 – 96.20 99.52
Recall (%) 93.00 97.80 96.00 90.00 93.55 88.00 65.90 87.00 95.80 95.27
Average 87.29 92.61 81.14 86.36 89.17 89.24 89.11 89.10 94.58 96.58 92.43 94.38 96.23 95.45 97.17 97.28 96.94 99.05 94.80 96.84 97.41 99.52 95.27 97.33
OF THE
F1-score (%) 88.05 – 90.56 91.00 88.55 88.00 – 93.00 95.80 97.33
bgr
1.0200
6
(c) hemo vs sg
1.0225
sg
Fold-2 84.70 94.11 74.41 83.11 88.23 86.66 90.69 88.63 91.17 95.00 88.09 91.56 94.11 91.30 97.67 94.38 97.64 100.00 95.34 97.61 98.82 100.00 97.67 98.82
WITH
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Number of Components
Fig. 4. Cumulative explained variance vs number of PCs selected.
respectively. Soft voting falls behind in case of the PCA version of the dataset, making 97.12% error-free predictions. On the contrary, logistic regression, decision tree, Gaussian naive Bayes, SVM, and KNN get 96.00%, 96.23%, 96.00%, 95.29%, and 94.82% predictions accurate on this dataset respectively. The positive predictions made by the soft voting classifier are accurate 100.00% and 99.52% of the time for the PCA and CBFS version of the dataset respectively. This rate is 99.05%, 95.45%, 96.58%, 89.24%, and 92.61% for logistic regression, decision tree, Gaussian naive Bayes, SVM and KNN respectively for the CBFS version of the dataset, and on the other version, 98.00%, 95.57%, 95.00%, 97.89% and 96.09% respectively. The soft voting classifier identifies 94.32% and 95.27% of the CKD cases flawlessly on the PCA and CBFS version of the dataset respectively. The error-free identification percentage is 93.87%, 97.18%, 97.18%, 92.44%, and 93.38% for logistic regression, decision tree, Gaussian naive Bayes, SVM and KNN respectively for the PCA version of the dataset, and on the other version, 94.80%, 97.17%, 92.43%, 89.11% and 81.14% respectively. When it comes to f1-score, soft voting worked significantly well on both CBFS and PCA data, obtaining a score of
Page 246
97.33% and 97.02% respectively. Logistic regression, decision tree, Gaussian naive Bayes, SVM and KNN scored 96.84%, 97.28%, 94.38%, 89.10%, and 86.36% on the CBFS data respectively. On the other data, scores were 95.83%, 96.30%, 96.04%, 95.07%, and 94.67% respectively. Soft voting performed significantly well on both the datasets in terms of all the performance metrics. Using Pearson correlation at the preprocessing stage gave better accuracy, precision and f1-score. But it fell behind in terms of recall. Keeping the precision-recall trade-off in mind, the least gap between precision and recall means a better performance. This gap is 5.68% and 4.25% while using correlation and PCA as preprocessing method respectively. The existing works mentioned in Table IV have accuracy, precision, recall and f1-score in a range of 66.30% to 95.84%, 65.9% to 97.72%, 65.90% to 97.80%, and 88.05% to 95.80% respectively. Our proposed model with correlation based feature selection outperforms all of them. V. C ONCLUSION AND F UTURE W ORK This study aims to apply a number of machine learning models on the UCI machine learning dataset on chronic kidney disease. The dataset with a significant amount of missing values and class imbalance was preprocessed by filling in the missing values using KNN imputer and balancing the dataset with SMOTE. CBFS and PCA was used for the purpose of feature selection and two versions on the dataset was created. KNN, SVM, Gaussian naive Bayes, decision tree, logistic regression and soft voting was applied on both of them along with a 5-fold cross validation. Soft voting performed comparatively better than other models when it was combined with CBFS. In the future, more data can be obtained from other sources and appended with the existing dataset. New data may contain different distributions. This would lead the models to become more robust. Other machine learning models can be implied. More parameter tuning and hyperparameter tuning can be added to the existing models. Deep learning methods can be implemented when the dataset grows in size. A time based analysis can also be carried out.
[6] [7]
[8]
[9] [10]
[11]
[12]
[13]
[14]
[15] [16] [17] [18] [19]
[20]
R EFERENCES [1] B. Bikbov, N. Perico, G. Remuzzi et al., “Disparities in chronic kidney disease prevalence among males and females in 195 countries: analysis of the global burden of disease 2016 study,” Nephron, vol. 139, pp. 313–318, 2018. [2] Z. Chen, X. Zhang, and Z. Zhang, “Clinical risk assessment of patients with chronic kidney disease by using clinical data and multivariate models,” International urology and nephrology, vol. 48, no. 12, pp. 2069–2075, 2016. [3] F. E. Murtagh, J. Addington-Hall, P. Edmonds, P. Donohoe, I. Carey, K. Jenkins, and I. J. Higginson, “Symptoms in the month before death for stage 5 chronic kidney disease patients managed without dialysis,” Journal of pain and symptom management, vol. 40, no. 3, pp. 342–352, 2010. [4] G. J. Schwartz and S. L. Furth, “Glomerular filtration rate measurement and estimation in chronic kidney disease,” Pediatric nephrology, vol. 22, no. 11, pp. 1839–1848, 2007. [5] N. A. Almansour, H. F. Syed, N. R. Khayat, R. K. Altheeb, R. E. Juri, J. Alhiyafi, S. Alrashed, and S. O. Olatunji, “Neural network and support vector machine for the prediction of chronic kidney disease: A
[21] [22]
[23]
[24] [25]
comparative study,” Computers in biology and medicine, vol. 109, pp. 101–111, 2019. E.-H. A. Rady and A. S. Anwar, “Prediction of kidney disease stages using data mining algorithms,” Informatics in Medicine Unlocked, vol. 15, p. 100178, 2019. C. A. Johnson, A. S. Levey, J. Coresh, A. Levin, and J. G. L. Eknoyan, “Clinical practice guidelines for chronic kidney disease in adults: part 1. definition, disease stages, evaluation, treatment, and risk factors,” American family physician, vol. 70, no. 5, pp. 869–876, 2004. M. S. Wibawa, I. M. D. Maysanjaya, and I. M. A. W. Putra, “Boosted classifier and features selection for enhancing chronic kidney disease diagnose,” in 2017 5th international conference on cyber and IT service management (CITSM). IEEE, 2017, pp. 1–6. R. K. Chiu, R. Y. Chen, S.-A. Wang, Y.-C. Chang, and L.-C. Chen, “Intelligent systems developed for the early detection of chronic kidney disease,” Advances in Artificial Neural Systems, vol. 2013, 2013. W. Gunarathne, K. Perera, and K. Kahandawaarachchi, “Performance evaluation on machine learning classification techniques for disease classification and forecasting through data analytics for chronic kidney disease (ckd),” in 2017 IEEE 17th international conference on bioinformatics and bioengineering (BIBE). IEEE, 2017, pp. 291–296. E. Avci, S. Karakus, O. Ozmen, and D. Avci, “Performance comparison of some classifiers on chronic kidney disease data,” in 2018 6th International Symposium on Digital Forensic and Security (ISDFS). IEEE, 2018, pp. 1–4. A. J. Aljaaf, D. Al-Jumeily, H. M. Haglan, M. Alloghani, T. Baker, A. J. Hussain, and J. Mustafina, “Early prediction of chronic kidney disease using machine learning supported by predictive analytics,” in 2018 IEEE congress on evolutionary computation (CEC). IEEE, 2018, pp. 1–9. A. Shrivas, S. K. Sahu, and H. Hota, “Classification of chronic kidney disease with proposed union based feature selection technique,” in Proceedings of 3rd International Conference on Internet of Things and Connected Technologies (ICIoTCT), 2018, pp. 26–27. S. Sara and K. Kalaiselvi, “Ensemble swarm behaviour based feature selection and support vector machine classifier for chronic kidney disease prediction,” International Journal of Engineering & Technology, vol. 7, no. 2, p. 190, 2018. M. Elhoseny, K. Shankar, and J. Uthayakumar, “Intelligent diagnostic prediction and classification system for chronic kidney disease,” Scientific reports, vol. 9, no. 1, pp. 1–14, 2019. D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml S. Vijayarani, S. Dhayanand et al., “Data mining classification algorithms for kidney disease prediction,” Int J Cybernetics Inform, vol. 4, no. 4, pp. 13–25, 2015. S. Das and D. Biswas, “Prediction of breast cancer using ensemble learning,” in 2019 5th International Conference on Advances in Electrical Engineering (ICAEE). IEEE, 2019, pp. 804–808. P. Chittora, S. Chaurasia, P. Chakrabarti, G. Kumawat, T. Chakrabarti, Z. Leonowicz, M. Jasi´nski, Ł. Jasi´nski, R. Gono, E. Jasi´nska et al., “Prediction of chronic kidney disease-a machine learning perspective,” IEEE Access, vol. 9, pp. 17 312–17 334, 2021. O. A. Jongbo, A. O. Adetunmbi, R. B. Ogunrinde, and B. BadejiAjisafe, “Development of an ensemble approach to chronic kidney disease diagnosis,” Scientific African, vol. 8, p. e00456, 2020. S. Hore, S. Chatterjee, R. K. Shaw, N. Dey, and J. Virmani, “Detection of chronic kidney disease: A nn-ga-based approach,” in Nature inspired computing. Springer, 2018, pp. 109–115. G. R. V´asquez-Morales, S. M. Martinez-Monterrubio, P. Moreno-Ger, and J. A. Recio-Garcia, “Explainable prediction of chronic renal disease in the colombian population using neural networks and case-based reasoning,” Ieee Access, vol. 7, pp. 152 900–152 910, 2019. K. Harimoorthy and M. Thangavelu, “Multi-disease prediction model using improved svm-radial bias technique in healthcare monitoring system,” Journal of Ambient Intelligence and Humanized Computing, vol. 12, no. 3, pp. 3715–3723, 2021. A. Ogunleye and Q.-G. Wang, “Xgboost model for chronic kidney disease diagnosis,” IEEE/ACM transactions on computational biology and bioinformatics, vol. 17, no. 6, pp. 2131–2140, 2019. B. Khan, R. Naseem, F. Muhammad, G. Abbas, and S. Kim, “An empirical evaluation of machine learning techniques for chronic kidney disease prophecy,” IEEE Access, vol. 8, pp. 55 012–55 022, 2020.
Page 247
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
SleepExplain: Explainable Non-Rapid Eye Movement and Rapid Eye Movement Sleep Stage Classification from EEG Signal Rafsan Jany
Md. Hamjajul Ashmafee
Iqram Hussain
Islamic University of Technology [email protected]
Islamic University of Technology [email protected]
Seoul National University [email protected]
Md Azam Hossain Islamic University of Technology [email protected]
standard for sleep stage classification (AASM). According to this standard, PSG signal recordings are classified as NonRapid Eye Movement (NREM) sleep, Rapid Eye Movement (REM) sleep, and waking (W). The American Academy of Sleep Medicine (AASM) has more recent guidelines for this (AASM) [8], [9]. The AASM rules also specify the distinctive waves for each of the five sleep phases [12]. During NREM and REM stages, the human body has to face many functional changes in both the nervous system and the body system [18]. Hormonal changes also occur in these two stages [18]. When NREM sleep goes deep, sympathetic nerve functionality decreases. There is a break in sympathetic nerve activity at some point of NREM sleep due to the short increase in blood pressure and heart rate that follows K-complexes [18]. During REM sleep, respiratory flow and ventilation change and become faster and more erratic [19], [20]. Hypoventilation takes place in NREM sleep [21]. Significant reductions in blood flow and metabolism are linked to NREM sleep, while overall metabolic rate and blood flow during REM similar to wakefulness [22].The alpha, beta, and gamma rhythms were attenuated in NREM sleep, while theta and delta rhythms rise with awakeness, followed by an increase in alpha and beta rhythms in REM sleep [31]. Detecting and understanding NREM and REM sleep stages can provide many ways to detect nervous system and body functional disorders. Except for the eye movement, middle ear ossicles, and respiratory system, the body is paralyzed during a Rapid Eye Movement (REM). Although the brain is less active during non-rapid eye movement (NREM), the body can still move. A sleep disorder like Narcolepsy is characterized by excessive daytime drowsiness and abnormal REM sleep regulation [23]. NREM sleep is related with the PARASOMNIAS or unusual sleep related behaviors, that take place while sleeping [23]. Because the body muscles are more active. Predicting NREM and REM sleep is deemed to be a novel process for the diagnosis of sleeping disorders. It also evaluates the
Abstract—Classification of sleep stages is one of the most important diagnostic approaches for a variety of sleep-related disorders. Electroencephalography (EEG) is regarded as a powerful tool for examining the association between neurological effects and sleep phases since it correctly identifies sleep-related neurological alterations. During Non-Rapid Eye Movement (NREM) and Rapid Eye Movement (REM) sleep phases, a number of nerve and bodily functions are affected and therefore hold an important role both in their functionalities. This work aims to classify NREM and REM sleep stages from sleep EEG data and present a noble SleepExplain model, an explainable NREM and REM sleep stage classification to explain its predictions. In this work, sleep stages were classified using Random Forest, XGBoost, and Gradient Boosting ensemble classification models. Overall, we obtained an accuracy of 92.54% (Random Forest), 94.25% (Gradient Boosting), and 94.30% (XGBoost). For explainable classification model, we utilized a game theoretic approach, SHAP (SHapley Addictive exPlanations) to offer a convincing explanation for the prediction. Index Terms—Machine Learning, Ensemble , XAI, electroencephalography, explainable, Sleep Stage
I. I NTRODUCTION Sleep is one of the basic biological activities that are required for relieving stress. It is the brain’s fundamental functions that are crucial for a person’s learning ability, performance, and physical activity [2], [3]. Understanding sleep quality in an easier manner is the most important and interesting topic in the field of neuroscience and sleep disorder diagnosis. Sleep stage scoring is the state of the art for analyzing human sleep [4]. The goal of sleep stage scoring is to find the stages of sleep that are important for identifying and treating sleep disorders [5], [6]. The continuous recording of several electrophysiological signals, termed as Polysomnographic (PSG) signals, is used for sleep stage scoring purposes [7]. The American Academy of Sleep Medicine (AASM) provides the most widely used
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
1
Page 248
body and nervous system characteristics during sleep. In the practical world, the sleep stage is performed through some manual processes. An expert measures and monitors the sleep scoring process manually. Notably, when it is a hand-operated procedure, errors can be coped up at any point. An automated strategy might be more reliable for regulating NREM and REM sleep. Several EEG and biosignal studies have been published to investigate the relationship between EEG biomarkers and neurologic prognosis in medical and healthcare [34]–[39]. The objective of this research is to develop a approach that can automatically classify NREM and REM sleep by explaining the prediction model using XAI. It will use multi-channel EEG signals to train machine learning models, and features will be retrieved from these signals. We intended to automate this sleep score technique using data from three EEG channels from three distinct sites (C4, O2, and F4). F4, C4, and O2 from the frontal, central, and occipital lobes were utilized for our investigation. The purpose of this study is to develop better models for forecasting NREM/REM sleep and explain those models using Explainable AI.
III. M ATERIAL A ND M ETHODOLOGY This study presents an efficient and reliable automatic sleep stage classification model of NREM and REM sleep classes on three-channel EEG signals. Alongside, we added an explainable AI (XAI) appraoch for better understanding the inside mechanism of the model. The explainable AI shows the most devoted features for the final outcome produced by the classification model. Notably, we focused on increasing the accuracy of the prediction model of the NREM and REM sleep stages and explain this model with SHAP based approach. A. Model Architecture Figure 1 depicts a general architecture of the proposed investigation. First, we acquired the EEG data from channels F4, C4, and O2 and extracted features from the raw signal using FFT (Fast Fourier Transform). To eliminate any 60 Hz AC noise from the neighboring electrical grid, the EEG signal was filtered. Then, from the noise-free signal, features are retrieved. Next, the model was trained using three ensemble machine learning models: Random Forest, Gradient Boosting, and XGBoost. Finally, we implemented an SHAP based explainable classification model to explain the outcomes produced by the prior model.
II. A SSOCIATED S TUDY In the majority of research, overnight recorded EEG signals are used to classify the sleep stages [27]. These proposed studies recommend applying a variety of feature reduction strategies in order to identify the relevant features. Various classification methods have been developed to classify the phases of sleep. None of them, however, classifies sleep as exclusively NREM and REM sleep. Satapathy et al. [24] proposed a method for identifying two stages of sleep, such as awake and sleep, using the ISRUCSleep [29] dataset. The overall accuracy for the awake and sleep stages was 91.67% and 93.8% respectively. Ellis et al. [25] provided an interpretable taxonomy of sleep stages, and their research classified sleep into five distinct phases. In this work, the PhysioNet Sleep-EDF [30] dataset was used. Santaji el al. [26] used another form of EEG data from sixty participants which were collected and preprocessed using an IIR (Infinite Impulse Response) in their research. In their study, three sleep classes (REM, NREM1, and NREM2) were classified with an overall accuracy of 95.36%. Santosh et al. [27] presented a machine-learning model with an ensemble approach by using ISRUC-Sleep [29] dataset. In their work, sleep stages have been classified as waking, NREM (as N1, N2, and N3), and REM sleep achieving an accuracy 91.10%. Shen et al. [28] proposed an enhanced machine learning model based on the essence of features applied to the ISRUC [30] dataset and classified sleep states into five stages, including waking, NREM (as N1, N2, and N3), and REM achieving an accuracy of 81.65%. Hussain et al. [31] used the neurological biomarkers of sleep phases to measure the delta wave power ratios (DAR, DTR, and DTABR). These measurements were evaluated by biomarkers due to their nature of decreasing during NREM sleep and increasing during REM sleep.
Fig. 1. Proposed framework of an explainable sleep stage classification model.
B. Dataset Description The dataset was collected from a sleep center named Haaglanden Medisch Centrum (HMC, The Netherlands) [32], [33]. The initial data files are combined with different types of signals. We selected EEG and F4 from frontal lob, C4 from central lob, and O2 from occipital lob. Frontal lob controls the voluntary movements of the body. The thoughts and analytical activities are controlled from central lob. Occipital lob is responsible for the eyesight. To characterize the sleep stages of a cerebral condition, the EEG is divided into five frequency subbands: delta wave (0Hz - 4Hz), theta wave (4Hz - 8Hz), alpha wave (8Hz - 12Hz), beta wave (12Hz - 30Hz), and gamma wave (30Hz). Our dataset contains 154 sleep recordings with 75 features. There are five classifications in the dataset, such as Wake, N1, N2, N3, REM. In the preprocessing
2
Page 249
stap, all wake classified rows were removed and N1, N2, N3 classes were merged into a single NREM class. So we have received the final preprocessed dataset having two classes; NREM and REM. The final dataset has a total of 89096 rows, where rows containing NREM are 72631 and rows containing REM are 16465. The difference between the sample counts of NREM and REM classes is significant. So we implemented the SMOTE (Synthetic Minority Over-sampling Technique) [13] approach to balance our training dataset. The testing dataset was unchanged to find the original accuracy of our proposed model.
D. Explainability of the Proposed Model using SHAP It is crucial to be able to appropriately interpret the results of prediction models. It fosters the right level of user trust, offers suggestions for how to make a model better, and aids in comprehending the learning process that is being represented [40]. Explainable artificial intelligence (XAI) helps users to understand and believe the result produced by machine learning algorithms. Each feature is given a relevance value by SHAP (SHapley Additive exPlanations) for a specific prediction [1]. In this study, we took out the most important features and their contributions for a particular prediction applying SHAP value. It explains an outcome of a model by computing the contribution of each feature associated with it. The Shapely value of an outcome is produced following a linear model and measures how much each feature in that model contributes, either positively or negatively.
C. Classification Models We used three popular machine learning models Random Forest, Gradient Boosting and, XGBoost. The training dataset was balanced using SMOTE technique. The default values of performance parameters like n estimators and max depth did not produce a satisfactory result. So we had to tune the parameters of the algorithms. We used scikit-learn python library for training and tuning our models.
IV. R ESULT A ND D ISCUSSION In this section we will discuss the outcome of prediction performances. LaterSHAP detects the most influential features for XGBoost classifier. TABLE II P ERFORMANCE M ETRIC R ESULTS F OR M ODELS
TABLE I O PTIMUM PARAMETER VALUES Parameters max depth n estimators
RF 39 450
GB 12 1150
Evaluation Metric Accuracy Precision Recall Specificity F1 Score
XG 29 4010
Random forest is one of the most effective machine learning techniques for classification [14]. In this solution, we tuned this model with the n estimators covering the range of 3 to 500 and the max depth covering the range of 5 to 50. We figured out the best result when, n estimators = 450 and max depth = 39 having the accuracy of 92.52% Using an iterative process, boosting algorithms merge weak learners into a strong learner [15]. Gradient boosting is a regression approach that resembles boosting [16]. The accuracy with the default value of n estimators and best dept for Gradient Boosting Classifier was not quite acceptable. We tuned this model at the parameters, n estimators having the range of 5 to 1200 with a interval of 50 and best dept having the range of 3 to 30. The most acceptable result was acquired for n estimators = 1150 and best dept = 12. The accuracy with this tuning is 92.25%. Another ensemble model based on Gradient Boosting with a high degree of scalability is called XGBoost [17]. XGBoost constructs a loss function which is minimized if the objective function is expanded additively, similar to Gradient Boosting. The accuracy with the default value of n estimators and best dept for XGBClassifier Classifier was not also satisfactory. This model was tuned with the parameters, n estimators including the range of 500 to 5000 with an interval of 50 and the best dept including the range of 3 to 30. The topmost result of this model was attained for n estimators = 4010 and best dept = 29. It scored 94.3%.
RF 92.54% 87.13% 89.31% 89.31% 88.21%
GB 94.25% 91.25% 89.65% 89.51% 90.44%
XG 94.30% 91.33% 89.70% 89.70% 90.51%
A. Model Performance Result In Table II, the performance metric of three models are given. The best accuracy was achieved by XGBoost classifier and it was 94.30%, when other two classifiers Random Forest and Gradient Boost scored 92.54% and 94.25% respectively. XGBoost and Gradient Boosting scored almost same score. The differences is very tiny. In other Evaluation metrics, XGBoost performed very satisfactory. Gradient Boosting was also little bit behind from XGBoost. But Random Forest was far behind from other two classifiers. Individually, 92.54% is not a bad accuracy. But compare to other two classifiers, Random Forest was far behind. The ROC curve illustrates the trade-off between specificity and sensitivity . A better performance is shown by classifiers that provide curves that are closer to the top-left corner. The figure 2 is presenting the training-testing ROC graphs of Random Forest. In the top-left corner, differences between training and testing are slightly large than other two models. The AUROC for the training is 1.00, and for the testing the AUROC is 0.968. The training-testing ROC graph of Gradient Boosting is shown in The figure 3. The differences of the top-right corner of the graph was reduced The training and testing curves are closer than the Random Forest model. The accuracy is better The AUROC is 0.975 for the testing and 1.00 for the training.
3
Page 250
Fig. 2. ROC Graph of Random Forest Model
Fig. 4. ROC of XGBoost Model
were naturally smooth. Though, we used SMOTE for the balancing the training data, the models performed well with the uncorrupted testing data. So we can assure that, the models are not overfitted.
Fig. 5. Confusion Matrix of XGBoost Model
Classification problem’s prediction outcomes are compiled in a confusion matrix. In the figure 5, the confusion matrix of XGBoost is shown. We can see the total prediction of NREM classes were 14460, where 14037 predictions were correct and only 423 predictions were wrong. Total REM classes wee 3350, 2766 were correct and 584 were wrong. We did not use SMOTE in testing data. That’s why, the imbalance is noticeable here. Though our models did perform well.
Fig. 3. ROC Graph of Gradient Boosting Model
The figure 4 is showing the training-testing ROC graphs of XGBoost. XGBoost performed better than other two models. The differences between XGBoost and Gradient Boosting is not not much. That’s why, the ROC curvs of this two models are significantly similar. The testing curve is more closer to the training graph than the other two graphs. The AUROC is 0.977 which close to the Gradient Boosting 3. There are no unusual ups and down in the ROC graphs of the three models. The Training and testing graphs were naturally smooth. There are no unusual ups and down in the ROC graphs of the three models. The Training and testing graphs
B. SHAP(XAI) Result The internal mechanism of a machine learning model is hard to understand and difficult to relate to the practical dataset and scenario. It is deemed to be a novel approach to use Explainable AI(XAI) to understand the inner mechanism of a model. In this study, we used SHAP for the better
4
Page 251
understanding and feature dependencies. We applied SHAP in XGBoost classifier to determine the most significant features for a particular prediction. We considered the first 100 rows of the testing data.
XGBoost, and Gradient Boosting. Overall, Random Forest obtained 92.54% accuracy, XGBoost earned 94.30% accuracy, and Gradient Boosting also achieved 94.25% accuracy. We also applied SHAP in the XGBoost classifier to determine the most significant features for a particular prediction. Most of the traditional classifying approaches are for the five stage classification of sleep. Very few studies worked with XAI. So the inside mechanism of a model is not revealed. It remained a black box model. This study unleashed the model explainability with XAI and classified the two most significant sleep stages (NREM and REM) with better accuracy. ACKNOWLEDGMENT This research was supported by Islamic University of Technology Research Seed Grants (IUT RSG) (Ref: REASP/IUTRSG/2022/OL/07/012).
Fig. 6. Significant Features With SHAP Value
Fig 6 is showing the list of the most important features and their SHAP values for predictions. MedianF Beta F4 is the most dedicated feature. MeanF Alpha F4 is also scored almost same as Median Beta F4. Most of the features of the list scored +0.02. The rest of the features scored +0.41 altogether. Notably, the most of the significant features are from the frontal lob. The dedication of the occipital lob is less than other two lobs.
R EFERENCES [1] Scott M. Lundberg, Su-In Lee, “A Unified Approach to Interpreting Model Predictions,” 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. [2] Estrada, E and Nava, Patricia and Nazeran, Homayoun and Behbehani, Khosrow and Burk, J and Lucas, Itakura distance: A useful similarity measure between EEG and EOG signals in computer-aided classification of sleep. 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference, 2006 , pp. 1189–1192. [3] Li, Yi and Yingle, Fan and Gu, Li and Qinye, Tong, “Sleep stage classification based on EEG Hilbert-Huang transform,” in IEEE, 2009 4th IEEE Conference on Industrial Electronics and Applications, 1963, pp. 3676–3681. [4] Ebrahimi, Farideh and Mikaili, Mohammad and Estrada, Edson and Nazeran, Homavoun, “Assessment of itakura distance as a valuable feature for computer-aided classification of sleep stages,” 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2007, pp.3300–3303. [5] Mora, Antonio Miguel and Fernandes, Carlos M and Herrera, Luis Javier and Castillo, Pedro A and Merelo, Juan Juli´an and Rojas, Fernando and Rosa, Agostinho C, “TSleeping with ants, SVMs, multilayer perceptrons and SOMs,”2010 10th International Conference on Intelligent Systems Design and Applications, 2010, pp.126–131. [6] Vatankhah, Maryam and Akbarzadeh-T, Mohammad-R and Moghimi, Ali, “An intelligent system for diagnosing sleep stages using wavelet coefficients,” The 2010 International Joint Conference on Neural Networks (IJCNN), pp.1–5, 2010. [7] Aboalayon, Khald Ali I and Faezipour, Miad and Almuhammadi, Wafaa S and Moslehpour, Saeid, “Sleep stage classification using EEG signal analysis: a comprehensive survey and new investigation ,”2016, pp.272. [8] Shuyuan, Xiao and Bei, Wang and Jian, Zhang and Qunfeng, Zhang and Junzhong, Zou and Nakamura, Masatoshi, “Notice of Removal: An improved K-means clustering algorithm for sleep stages classification,”2015 54th Annual Conference of the Society of Instrument and Control Engineers of Japan (SICE), 2015, pp.1222–1227. [9] Lajnef, Tarek and Chaibi, Sahbi and Ruby, Perrine and Aguera, PierreEmmanuel and Eichenlaub, Jean-Baptiste and Samet, Mounir and Kachouri, Abdennaceur and Jerbi, Karim, “Learning machines and sleeping brains: automatic sleep stage classification using decision-tree multiclass support vector machines,”Journal of neuroscience methods, 2015, pp.94–105. [10] G¨unes¸, Salih and Polat, Kemal and Yosunkaya, S¸ebnem, “Efficient sleep stage recognition system based on EEG signal using k-means clustering based feature weighting,”Expert Systems with Applications, 2010, pp.7922–7928. [11] Charbonnier, Sylvie and Zoubek, Lukas and Lesecq, Suzanne and Chapotot, Florian, “TSleeping with ants, SVMs, multilayer perceptrons and SOMs,”Computers in biology and medicine, 2011, pp.380–389.
Fig. 7. SHAP Value Beeswarm
The beeswarm plot(fig 7) is made to show a summary of the top features in a dataset and how they affect the model’s output in a way that is both information-dense and easy to understand. A single dot is used to indicate each instance of the explanation in each aspect of the figure 7. Here, MedianF Beta F4 is the most important feature on average. So from the both figure (fig 6 and fig 7), we can notice one thing and that is , F4 and C4 have a very significant role, where O2 has not such kind of contribution like other two lobes. The duration of REM sleep was less then the NREM. Eyesight and eye movement are controlled by the occipital lob. This is cause to reduce the supremacy of O2 lob in the model. V. C ONCLUSION In this study, NREM and REM sleep phases were classified using ensemble classification models from Random Forest,
5
Page 252
[12] Berry, Richard B and Brooks, Rita and Gamaldo, Charlene E and Harding, Susan M and Marcus, C and Vaughn, Bradley V and others, “The AASM manual for the scoring of sleep and associated events,”Rules, Terminology and Technical Specifications, Darien, Illinois, American Academy of Sleep Medicine, 2012. [13] Nitesh V. Chawla,Kevin W. Bowyer ,Lawrence O. Hall,W. Philip Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,”Journal of Artificial Intelligence Research 16 (2002) 321–357. [14] Leo Breiman. Random forests. Machine Learning, 2001, 45(1):5–32. [15] Robert E. Schapire Yoav Freund. A Short Introduction to Boosting. Journal of Japanese Society for Artificial Intelligence, 1999, 14(5):771 – 780. [16] Jerome H. Friedman. Greedy function approximation: a Gradient Boosting machine. The Annals of Statistics, 2001, 29(5):1189 – 1232. [17] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting sys- tem. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, 2016. ACM, pages 785–794. [18] Institute of Medicine (US) Committee on Sleep Medicine and Research; Colten HR, Altevogt BM, editors. Sleep Disorders and Sleep Deprivation: An Unmet Public Health Problem. Washington (DC): National Academies Press (US); 2006. 2, Sleep Physiology. Available from: https://www.ncbi.nlm.nih.gov/books/NBK19956/. [19] Krieger J. Respiratory physiology: Breathing in normal subjects. In: Kryger M, Roth T, Dement WC, editors. Principles and Practice of Sleep Medicine. 4th ed. Philadelphia: Elsevier Saunders; 2000. pp. 229–241. [20] Simon PM, Landry SH, Leifer JC. Respiratory control during sleep. In: Lee-Chiong TK, Sateia MJ, Carskadon MA, editors. Sleep Medicine. Philadelphia: Hanley and Belfus; 2002. pp. 41–51. [21] NLM (National Library of Medicine), NIH (National Institutes of Health). Medline Plus Online Medical Dictionary. [accessed February 6, 2006]. [22] Madsen PL, Schmidt JF, Wildschiodtz G, Friberg L, Holm S, Vorstrup S, Lassen NA. Cerebral O2 metabolism and cerebral blood flow in humans during deep and rapid-eye-movement sleep. Journal of Applied Physiology. 1991b;70(6):2597–2601. [23] Kathryn Lovell, PhD, and Christine Liszewski, MD, Normal Sleep Patterns and Sleep Disorders [24] Satapathy, S.K., Loganathan, D. A Study of Human Sleep Stage Classification Based on Dual ChannelSatapathy, S.K., Loganathan, D. A Study of Human Sleep Stage Classification Based on Dual Channels of EEG Signal Using Machine Learning Techniques. SN COMPUT. SCI. 2, 157 (2021). [25] Ellis CA, Zhang R, Carbajal DA, Miller RL, Calhoun VD, Wang MD. Explainable Sleep Stage Classification with Multimodal Electrophysiology Time-series. Annu Int Conf IEEE Eng Med Biol Soc. 2021 Nov;2021:2363-2366. doi: 10.1109/EMBC46164.2021.9630506. PMID: 34891757. [26] Santaji, S., Santaji, S., Desai, V. Automatic sleep stage classification with reduced epoch of EEG. Evol. Intel. 15, 2239–2246 (2022) [27] Santosh Kumar Satapathy, Akash Kumar Bhoi, D. Loganathan, Bidita Khandelwal, Paolo Barsocchi, Machine learning with ensemble stacking model for automated sleep staging using dual-channel EEG signal, Biomedical Signal Processing and Control, Volume 69, 2021, 102898, ISSN 1746-8094, https://doi.org/10.1016/j.bspc.2021.102898. [28] Shen H, Ran F, Xu M, Guez A, Li A, Guo A. An Automatic Sleep Stage Classification Algorithm Using Improved Model Based Essence Features. Sensors (Basel). 2020 Aug 19;20(17):4677. doi: 10.3390/s20174677. PMID: 32825024; PMCID: PMC7506989. [29] Khalighi Sirvan, Teresa Sousa, Jos´e Moutinho Santos, and Urbano Nunes. “ISRUC-Sleep: A comprehensive public dataset for sleep researchers.“Computer methods and programs in biomedicine 124 (2016): 180-192. [30] B Kemp, AH Zwinderman, B Tuk, HAC Kamphuisen, JJL Obery´e. Analysis of a sleep-dependent neuronal feedback loop: the slow-wave microcontinuity of the EEG. IEEE-BME 47(9):1185-1194 (2000). [31] Hussain, I.; Hossain, M.A.;Jany, R.; Bari, M.A.; Uddin, M.; Kamal, A.R.M.; Ku, Y.; Kim, J.-S. Quantitative Evaluation of EEGBiomarkers for Prediction of Sleep Stages. Sensors, 22, 3079 (2022). https://doi.org/10.3390/s22083079 [32] Alvarez-Estevez, D.; Rijsman, R.M. Haaglanden Medisch Centrum Sleep Staging Database (Version 1.0.1); PhysioNet, 2021; Availableonline: https://physionet.org/ (accessed on 22 March 2022).
[33] Alvarez-Estevez, D.; Rijsman, R.M. Inter-Database Validation of a Deep Learning Approach for Automatic Sleep Scoring. PLoSONE 2021,16, e0256111. [CrossRef] [34] Hussain, I. and S. J. Park. ”Big-ecg: Cardiographic predictive cyberphysical system for stroke management.” IEEE Access 9 (2021): 123146-64. [35] Hussain, I. and S. J. Park. ”Healthsos: Real-time health monitoring system for stroke prognostics.” IEEE Access 8 (2020): 213574-86. [36] Hussain, I., M. A. Hossain and S.-J. Park. ”A healthcare digital twin for diagnosis of stroke.” Presented at 2021 IEEE International Conference on Biomedical Engineering, Computer and Information Technology for Health (BECITHCON), 2021. 2022. [37] Hussain, I. and S.-J. Park. ”Quantitative evaluation of task-induced neurological outcome after stroke.” Brain Sciences 11 (2021): 900. [38] Hussain, I., S. Young, C. H. Kim, H. C. M. Benjamin and S. J. Park. ”Quantifying physiological biomarkers of a microwave brain stimulation device.” Sensors 21 (2021): 1896. [39] Hussain, I.; Young, S.; Park, S.-J. Driving-Induced Neurological Biomarkers in an Advanced Driver-Assistance System. Sensors 2021, 21, 6985. [40] M. K. Sumon, M. H. Ashmafee, M. R. Islam, and A. R. Mostofa Kamal, “Explainable NLQ-based visual interactive system: Challenges and objectives,” Proceedings of the 2nd International Conference on Computing Advancements, 2022.
6
Page 253
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
Dual-Core Antiresonant Fiber Based Compact Broadband Polarization Beam Splitter Kumary Sumi Rani Shaha
1, 2, *
, Abdul Khaleque 1 , and Md. Sarwar Hosen
1
1
2
Dept. of Electrical & Electronic Engineering, Rajshahi University of Engineering & Technology, Rajshahi-6204, Bangladesh Dept. of Electrical, Electronic and Communication Engineering, Pabna University of Science and Technology, Pabna, Bangladesh * [email protected]
Abstract—In the following article, we suggest a hollow dualcore antiresonant fiber design that acts as a polarization splitter. The proposed splitter is designed with single layer cladding structure where two elliptical elements admit two separate symmetrical cores. The designed fiber provides an excellent splitting performance including a very compact length of 2.14 cm and the largest extinction ratio of 145 dB at 1.55 µm. It supports an ample bandwidth of 445 nm by retaining an extinction ratio of greater than 20 dB, covering two popular communication windows. It achieves a higher order mode extinction ratio of 225 that offers an effective single mode activity. Therefore, the broad bandwidth and compact length of the suggested polarization splitter possibly make it a potential candidate in optical communication networks. Index Terms—dual-core fiber, antiresonant, broadband, device length, polarization splitter, extinction ratio, single mode.
I. I NTRODUCTION In the following years, the emerging interest of hollow core fibers (HCFs) for the unique features such as low transmission loss, low group velocity dispersion, high nonlinearity, etc., has catched the researcher’s attraction [1]–[3]. These tremendous features of HCFs resulted into extensive applications, for example data communications [4], mid-IR transmission [5], THz guidance [6], and many others. Hence, the fisrt type of HCF is photonic bandgap fibers (PBGFs) that transmit light by following the bandgap theory [2], and the second type is antiresonant fibers (ARFs) which guide through inhibited coupling mechanism [3]. In addition, the application of HCF is also diverted to integrated optical devices including polarizers [7], optical fiber couplers [8], [9], etc. Initially, the implementation of optical fiber devices were revolved in the platfrom of solid dual-core fibers (SDCFs). At first in 2000, Mangan and colleagues revealed and fabricated a SDCFs for computing the communication spectrum [10]. After the years, SDCFs based optical fiber coupling has been investigated numerically as well as experimentally [10], [11]. However, the SDCFs suffer with some ultimate drawbacks for example material loss, optical nonlinearities, scattering loss, and so forth [7], [12]. These limitations can be solved by HCFs (PBGFs and ARFs, respectively): various studies have been completed on hollow dual-core PBGFs (HDC-PBGFs) based optical coupler in theoretical and experimental ways [13], [14]. Moreover, the mentioned HC-ARFs maintain more feasible platform for optical fiber devices designs such as coupler
979-8-3503-4602-2/22/$31.00©2022 IEEE
and polarizer due to broad transmission band, remarkable loss performance, and flexible structure [15]. On the other hand, the ARFs support different types of cladding to manage the overall performance of fiber such as single-layer [8], multi-layer [16], [17], elliptical elements [18], [19], nested [20], conjoined tube [21], [22], ice-cream cone shape tube [23] and many complex configurations [24]–[26]. This flexibilities on cladding offers the ARFs to design hollow dual-core (HDC) fiber coupler or polarizer [15]. Hence, the two hollow core is generally created by cladding tube (first layer or second layer tubes) that separate the core uniformly [8]. Liu and colleagues offered a HDC-ARFs based coupler theoretically that was attracted entirely [27]. Besides, after fabriating successfully the first HDC-ARF coupler, it draws the attarction of researchers mostly and it provided a coupling length of 35 cm [28]. In addition, a double layer arrangement including elliptical tubes is reported by Zhao et al. which provide broad-bandwidth but can’t capture two popular communication windows and also have large fiber length of 6.75 cm, simultanously [18]. Furthermore, a splitter based on triple cladding arrangement is proposed by Stawska et al. which offered a 36.90 cm device length that is comaparatively larger [7]. In 2021, there also studied some HDCARF based splitter with great interest: one structure having complex arrangement of two circular nested tubes inset in between two elliptical tubes that provides higher fiber coupling length [29]. Moreover, in immediate past of 2022, a splitter was introduced based on HDC-ARF having nested as well as double cladding layer configuration which provided a long device length of 6.45 cm [20]. Therefore, according to the literature, the optical polarization splitter based on HDC-ARFs is a very emerging field due to superior features of ARFs but it is challenging to achieve broad bandwidth and short device length simultaneously on a simpler HC-ARF geometry. This manuscript introduces a relatively feasible HDC-ARF geometry that provides good splitting performances, as far as our knowledge, till now. Hence, our suggested HDC-ARF based polarization beam splitter (PBS) is optimized delicately that reveals a compact fiber length of 2.14 cm. It also figures out a wide 445 nm bandwidth (1.20 µm to 1.645 µm) that covers the extinction ratio (ER) of higher than 20 dB with the highest ER of 145 dB at 1.55 µm, correspondingly. The nested tubes enhance the single mode activity and transfer
Page 254
almost 98% input power by reduding the light sparse in the cladding of the fiber. Therefore, the simplicity and the overall splitting performances (small length, wide-band, along with good single-mode nature) prove it as a promising prospect for PBS in optical communication. II. F IBER D ESIGN Fig. 1 shows the suggested cross-sectional outlook of HDCARF. Silica and air are applied as backround material where the characteristic of silica are demonstrated by Sellmeier’s formula [30]. The mentioned HDC-ARF have a comparatively simpler configuration of eight cladding tubes with two nested tubes: six identical circular tubes, two elliptical tubes, and two circular tubes nested in horizontal axis circular tubes. Hence, the diameter of uniform circular tubes is Dt = 12.15 µm, the diameter of nested elements is Dn = 7 µm, and the radius of semi-minor and semi-major axis of elliptical tubes are Rx = 6 µm and Ry = 13 µm, subsequently, which realizes a curvature factor of Cf = Ry /Rx . The gaps between neighboring tubes are (the circular to elliptical along with circular elements) g1 = 2.40 µm and g2 = 3.75 µm, respectively. Furthermore, the identical silica strut thickness of 500 nm and the total fiber diameter of D = 57.70 µm has been optimized for the suggested fiber. Therefore, the both symmetrical cores A and B are introduced by placing the two elliptical tubes in vertical axis that maintain a separation gap of S = 2.80 µm along yaxis. Therefore, the optimized parameters value are tabulated in Table I.
TABLE I THE OPTIMIZED PARAMETERS FOR PROPOSED SPLITTER DEVICE Parameter label Circular tube diameter Nested tube diameter Elliptical tube semi-major axis radius Elliptical tube semi-minor axis radius Gap between circular to elliptical tubes Gap between circular tubes Tube thickness Separation within elliptical tubes Fiber total diameter
III.
Symbol Dt Dn Ry Rx g1 g2 t s D
Optimized value 12.15 µm 7 µm 13 µm 6 µm 2.40 µm 3.75 µm 0.5 µm (500 nm) 2.80 µm 57.70 µm
BASIC THEORY
Following coupling mode theory, the HDC-ARF maintains four super-modes: two even along with two odd modes towards x- and y- directions [8]. Hence, the field strength profile of four modes (odd and even of x- along with y- polarized) are exhibited in Fig. 2. Hence, the suggested HDC-ARF
Fig. 2. The electric field strength profile of four supermodes such as (a) x -even, (b) x -odd, (c) y -odd, and (d) y -even for the suggested device at 1.55 µm. The electric field strength orientation is indicated by black arrow.
presents the smallest potential guiding length, by analyzing the coupling length of x- and y-polarized lights, at which input power signal is uniformly transmitted throughout the two cores. The length of coupling is calculated by following the expression as [31] Fig. 1. Suggested cross-sectional layout of the HDC-ARF. Here, the whole fiber diameter is D = 57.70 µm, the circular and nested tubes diameters are Dt = 12.15 µm and Dn = 7 µm, respectively, the strut thickness is t = 500 nm, the elliptical elements parameters are Ry = 13 µm and Rx = 6 µm. In addition, the gap between two elliptical tubes is s = 2.80 µm (y-axis) and the gaps between adjacent tubes are g1 = 2.40 µm and g2 = 3.75 µm, respectively.
lcx,y =
2 (nx,y e
λ − nx,y o )
(1)
x,y here lcx,y , nx,y e , no , and λ, identify the coupling length, refractive indices of even as well as odd modes, and wavelength, eventually, while the x, y superscripts represent the x- as well
Page 255
as y-polarization. Furthermore, the coupling length ratio (CLR) of the mentioned fiber can be evaluated [31] as CLR =
lcy w = x lc z
(2)
other parameters are optimized simultaneously by following the same procedure (not shown) and are chosen as Dt = 12.15 µm, t = 500 nm, the semi-major and semi-minor axis of elliptical tubes are Ry = 13 µm and Rx = 6 µm, respectively.
here the polarized rays of two modes is divided into two cores when wlcy = zlcx = l, here w and z are represented as integral numbers and l presents the fiber length. Since, the acceptable estimation of CLR is either 1/2 (lcy > lcx ) or 2 (lcy < lcx ) which ensures outstanding splitting efficiency [31]. The input power signal couples in both cores A and B when input ray is guided into the fiber [18]. The input power Pin along with the normalized output power Pout of core B can be investigated [8] as πl x,y 2 (3) Pout = Pin cos 2lcx,y The extinction ratio is a key parameter of splitter that is characterized as the ratio of normalized power for x to y polarization or conversely. Therefore, ER is investigated for core B by following the formula [18] as y Pout ER = 10 log (4) x Pout
Fig. 3. The device coupling length (left y axis) for x- and y-polarized rays with respect to the fiber diameter (D) variation. Besides, the CLR (right y axis) regarding the fiber diameter (D) is also included, while the operating wavelength of 1.55 µm is maintained.Therefore, the required CLR = 0.5 is found at D = 57.70 µm.
V. R ESULT & D ISCUSSION IV. N UMERICAL ANALYSIS A finite element method based COMSOL Multiphysics has been implied to analyze the characteristics of suggested HDCARF. Hence, a 5 µm perfectly matched layer (PML) thickness is employed to our structure for the numerical modeling. According to literature [8], [18], the mesh elements of 871693 is used for proposed fiber, correspondingly. Basically, ARF faces a resonant area, where coupling happens in between core and cladding modes which goes through high loss [3]. Besides, the designed fiber has t = 500 nm that introduces first resonance at 1.05 µm wavelength and offers low losses as well as wide bandwidth, over the working wavelengths, for our splitter. The effect of total fiber diameter is analyzed by concerning the two essential parameter of splitter such as coupling length and the CLR as depicted in Fig. 3 thus investigated by Eq. 1 and Eq. 2. From Fig. 3, it can be seen that by increasing the fiber diameter, the coupling length for x- and y-polarized lights are also increasing, that happens due to increasing the core size [8]. Because when the core size increases, the elliptical tubes faces more distance to each other that reduces the asymmetry effect. Since this reduced asymmetry decreases the birefringences and increases the coupling length as investigated by Eq. 1, hence, the CLR = 0.5 is found at D = 57.70 µm as indicated by dotted line in Fig. 3. Besides, the CLR = 0.5 provides the coupling length of 1.07 cm for x- polarized ray and 2.14 cm for y- polarized ray that can be noticed at 57.70 µm of diameter which assures the best splitting performance at this point. Therefore, the whole fiber diameter is chosen as D = 57.70 µm. Besides the
A. Performance of the Splitter The splitting performances of our HDC-ARF splitter, with optimized parameters are investigated. Firstly, the coupling length along with the CLR are analyzed with respect to wavelength as depicted in Fig. 4. From Fig. 4, it is observed that the coupling length of x-polarized light is larger than the y-polarized, hence, it would be possible to divide two polarized lights through two cores at a particular length of our fiber. At 1.55 µm wavelength, the designed splitter obtains CLR = 0.5 as can be seen in Fig. 4. In addition, the curve of CLR is quite flat in nature that helps us to realize the wide bandwidth activity for our splitter. Then, the deviation of normalized power is investigated with respect to the propagation distance as shown in Fig. 5. Hence, the maximal power is maintained by x- polarization light at core B, where it provides least power at core A and vice verse for y- polarized ray. Almost all the power propagate very confinely without facing noticeable leakage power (Fig. 5) for having nested tubes thus results more light confinement by additional negative curvature. At 2.14 cm fiber length, the deviation of normalized power is maximal for x- along with y- polarized light in both cores. Therefore, the incident ray is divided into two polarized rays thus helps to act as a PBS. Now, the extinction ratio as a function of wavelength is analyzed and depicted in Fig. 6. Here, a reference level of 20 dB is applied for investigating the operating spectrum of proposed splitter. It is observed from Fig. 6 that 445 nm of bandwidth ranging from 1.20 µm to 1.645 µm, by maintaining greater than 20 dB ER while the highest ER of 145 dB, is obtained at 1.55 µm.
Page 256
Fig. 4. The coupling length (left y axis) for x- and y-polarized rays and the CLR (right y axis) as a function of wavelength of our proposed device, while wavelength = 1.55 µm maintains CLR = 0.5.
Fig. 6. The extinction ratio versus wavelength for suggested fiber, the dotted line denotes the 20 dB level that is considered as reference.
HOMER =
confinement loss of LP11 − 1 confinement loss of ye
(5)
where, the highest HOMER of 225 is achieved at 1.575 µm and 212 is found at 1.55 µm by considering the CL of highest FSM and the lowest HOM. Therefore, the proposed HDCARF based PBS may realize the single mode characteristics. Moreover, in Fig. 7, the upper panel represents the two HOMs surface plot of field strength and directions where the line color relates with the frame color.
Fig. 5. The variation of normalized power with respect to propagation distance for our suggested fiber.
B. Single Mode Performance In this part, the single mode performance is analyzed for the proposed HDC-ARF splitter. The confinement loss (CL) ratio of higher-order modes (HOMs) to fundamental supermodes (FSM) determines the higher order mode extinction ratio (HOMER) which indicates single modeness of our fiber. By following literature [32], the HOMER of 10 supports the single mode characteristics. Hence, for our suggested fiber, the CL of FSM and HOMs are depicted in Fig. 7 along with the HOMER at right y-axis. From Fig. 7, it is observed that the highest CL is provided by ye -mode and the lowest CL is maintained by LP11 -1 mode, thus the HOMER will be the ratio of the lowest CL among HOMs to the highest CL among four core supermodes [18]. Besides, the HOMER is explored by the equation as [18]
Fig. 7. The confinement loss for four supermodes and two HOMs (left y axis) and the HOMER (right y axis) as a function of wavelength for the proposed HDC-ARF. The electric field profile of two HOMs are included in the upper panel where the frame color relate with the line color of HOMs.
C. Parameter Variation Tolerance Now, the strut thickness variation effect on our achieved ER spectrum is measured and shown in Fig. 8. Hence, the tube
Page 257
thickness of our splitter is varied from +2% to -2% where other optimized parameters remain unchanged. At 1.55 µm, a decrease of ER is seen due to the resonance point shifting which results in the shift of our optimized point having CLR = 0.5 because resonance wavelength is highly depending on tube thickness [2]. On the other hand, the coupling length is also shifted little one that’s why the peak is reduced and shifted according to the variation of ±2% strut thickness for our suggested polarization splitter. Moreover, the overall bandwidth is same as we claimed. Thus, we can conclude that, the proposed splitter working spectrum maintains a wide bandwidth with ±2% and more strut thickness parameter tolerance.
double layer cladding structure based PBS that provided a device length of 6.45 cm with a bandwidth of 400 nm and a HOMER of > 100 [20]. Besides, our proposed splitter having single layer elements, provides a device length of 2.14 cm with a bandwidth of 445 nm by covering 20 dB standard reference level and the HOMER is greater than 100 covering almost whole band. In addition, the suggested splitter provides the maximum ER of 145 dB at 1.55 µm where the HOMER is 212 at this wavelength. Therefore, we can figure out that almost every structure have complex design with multilayer, nested, or adjacent nested cladding [7], [15], [18], [20], [20], [26], [29], thus in between those studies our proposed polarization splitter maintains single layer structure with best splitting performance (Table II). TABLE II R ELATIVE S TUDY OF P ERFORMANCE OF THE S UGGESTED S PLITTER WITH THE R ECENT L ITERATURE
Fig. 8. The extinction ratio versus wavelength for ±2% strut thickness variation of suggested fiber. The fixed coupling length of 2.14 cm for ypolarization ray is used for investigating the extinction ratio spectrum for ±2% strut thickness deviation.
D. Comparative Analysis The comparative survey of the suggested splitter device and other related literature is summarized in Table II. The dominated parameters of splitter that convey the splitter performance are tabulated where the single mode analysis (HOMER) is only related to HCF based splitter for supporting to diminish the leaky nature [3]. However, ARF based splitter are very growing field where very numerical and experimental works are reported [7], [9], [18], [20], [26], [28], [29]. Hence, the first work [18] from the table, reported in 2019, with double layer nested structure that provided a device length of 6.35 cm, a bandwidth of 310 nm, the lowest ER of -55 dB at 1.45 µm and maintained a HOMER of > 100. The second work [7] from the table, reported a triple layer cladding structure in 2020 year, with a device length of 36.9 cm that retained a HOMER of 65. On the other hand, another work [29] in 2020, was proposed with two adjacent nested ring that reported a length of 8.15 cm along with 370 nm bandwidth having the lowest ER of -65 dB at 1.57 µm that maintained a HOMER of > 100. Besides, in this year 2022, Ni et al. reported a nested
Ref. (Year)
Device Length
Maximum / Minimum ER
[18] (2019)
6.75 cm
- 55 dB at 1.45 µm
[7] (2020)
36.9 cm
64 dB at 0.56 µm
[29] (2021)
8.15 cm
- 65 dB at 1.57 µm
[20] (2022)
6.45 cm
-58.8 dB at 1.52 µm
Prop. Splitter
2.14 cm
150 dB at 1.55 µm
Bandwidth (covering 20 dB / -20 dB level) 310 nm (1.41 µm to 1.72 µm) Not discussed 370 nm (1.28 µm to 1.65 µm) 400 nm (1.23 µm to 1.63 µm) 445 nm (1.20 µm to 1.645 µm)
HOMER
>100 65 >100
>100
>100
E. Fabrication Feasibility Some HDC-ARF based PBS device has been fabricated successfully by utilizing the stack-and-draw technique [2], [28], [33], [34]. On the other hand, some complex shape cladding has been successfully fabricated such as ice cream cone shape cladding [23], conjoined shape cladding [21], split cladding [35], etc., by following stack-and-draw process. Hence, the nested tubes as well as the elliptical tubes cladding has been reported theoretically or experimentally [2], [4], [18], [33]. Furthermore, Kosolapov and colleagues reported and fabricated a revolver fiber utilizing stack-and-draw method that reveals the elliptical elements in practical manner [33]. Thus, it is hoped that the suggested HDC-ARF PBS can be achieved by using accessible fabrication methods. VI. C ONCLUSION A HDC-ARF based PBS is proposed with simple geometry which split the incoming light ray into two polarized (x as well y) rays into two hollow air cores. Hence, the designed device supplies a short splitting length of 2.14 cm with a highest HOMER of 225 that realizes single mode fiber. It also maintains a 445 nm of broad bandwidth by retaining larger than 20 dB of ER along with highest ER of 145 dB at 1.55 µm.
Page 258
Therefore, it determines better performance among the related HDC-ARF based PBSs [7], [15], [18], [20], [29]. Overall, our HDC-ARF PBS is an encouraging applicant in wideband and speedy fiber optics communication networks. ACKNOWLEDGMENT The author, Kumary Sumi Rani Shaha, desires to acknowledge her heartfelt thanks to the Information and Communication Technology (ICT) Division of the Government of Bangladesh, regarding the financial cooperation offered by ICT Fellowship for research work of MSc Engineering at Electrical & Electronic Engineering (EEE) Department of Rajshahi University of Engineering & Technology (RUET), Rajshahi-6204, Bangladesh. Moreover, the resources provided by Electrical & Electronic Engineering (EEE) Department and Research & Extention (DRE/7/RUET/574(58)/PRO/202223/18) of Rajshahi University of Engineering & Technology (RUET), Rajshahi-6204, Bangladesh, are acknowledged by Dr. Abdul Khaleque. R EFERENCES [1] R. Cregan et al., “Single-mode photonic band gap guidance of light in air,” Science, vol. 285, no. 5433, pp. 1537–1539, 1999. [2] F. Poletti, “Nested antiresonant nodeless hollow core fiber,” Optics express, vol. 22, no. 20, pp. 23 807–23 828, 2014. [3] K. S. R. Shaha, A. Khaleque, and M. I. Hasan, “Low loss double cladding nested hollow core antiresonant fiber,” OSA Continuum, vol. 3, no. 9, pp. 2512–2524, 2020. [4] K. S. R. Shaha, A. Khaleque, and M. S. Hosen, “Wideband low loss hollow core fiber with nested hybrid cladding elements,” J. Lightwave Technol., vol. 39, no. 20, pp. 6585–6591, Oct 2021. [5] K. S. R. Shaha, A. Khaleque, and M. I. Hasan, “Nested antiresonant hollow-core fiber with ultra-low loss,” in 2020 11th International Conference on Electrical and Computer Engineering (ICECE). IEEE, 2020, pp. 29–32. [6] A. S. Sultana, A. Khaleque, K. S. R. Shaha, M. M. Rahman, and M. S. Hosen, “Nodeless antiresonant hollow core fiber for low loss flatband thz guidance,” Optics Continuum, vol. 1, no. 8, pp. 1652–1667, 2022. [7] H. I. Stawska and M. A. Popenda, “Fluorescence anisotropy sensor comprising a dual hollow-core antiresonant fiber polarization beam splitter,” Sensors, vol. 20, no. 11, p. 3321, 2020. [8] K. S. R. Shaha, A. Khaleque, M. T. Rahman, and M. S. Hosen, “Broadband and short-length polarization splitter on dual hollow-core antiresonant fiber,” IEEE Photonics Technology Letters, vol. 34, no. 5, pp. 259–262, 2022. [9] N. Wheeler, T. Bradley, J. Hayes, G. Jasion, Y. Chen, S. R. Sandoghchi, P. Horak, F. Poletti, M. Petrovich, and D. Richardson, “Dual hollow-core anti-resonant fibres,” in Micro-Structured and Specialty Optical Fibres IV, vol. 9886. International Society for Optics and Photonics, 2016, p. 988617. [10] B. Mangan, J. Knight, T. Birks, P. S. J. Russell, and A. Greenaway, “Experimental study of dual-core photonic crystal fibre,” Electronics Letters, vol. 36, no. 16, p. 1, 2000. [11] A. Khaleque, E. G. Mironov, and H. T. Hattori, “Analysis of the properties of a dual-core plasmonic photonic crystal fiber polarization splitter,” Applied Physics B, vol. 121, no. 4, pp. 523–532, 2015. [12] C. Wei, C. R. Menyuk, and J. Hu, “Polarization-filtering and polarization-maintaining low-loss negative curvature fibers,” Optics express, vol. 26, no. 8, pp. 9528–9540, 2018. [13] S. Han, Z. Wang, Y. Liu, H. Li, H. Liang, and Z. Wang, “Mode couplers and converters based on dual-core hollow-core photonic bandgap fiber,” IEEE Photonics Journal, vol. 10, no. 2, pp. 1–8, 2018. [14] L. Meng, J. Fini, J. Nicholson, R. Windeler, A. DeSantolo, E. Monberg, F. DiMarcello, M. Hassan, and R. Ortiz, “Bend tunable coupling in dualhollow-core photonic bandgap fiber,” in OFC/NFOEC. IEEE, 2012, pp. 1–3.
[15] H. I. Stawska and M. A. Popenda, “A dual hollow core antiresonant optical fiber coupler based on a highly birefringent structure-numerical design and analysis,” Fibers, vol. 7, no. 12, p. 109, 2019. [16] R. Nishad, K. S. R. Shaha, A. Khaleque, M. S. Hosen, and M. T. Rahman, “Low loss triple cladding antiresonant hollow core fiber,” in 2021 IEEE International Conference on Telecommunications and Photonics (ICTP). IEEE, 2021, pp. 1–5. [17] K. S. R. Shaha, A. Khaleque, and M. S. Hosen, “Hybrid conjoined tube hollow core antiresonant fiber,” in 2021 IEEE International Conference on Telecommunications and Photonics (ICTP). IEEE, 2021, pp. 1–5. [18] T. Zhao, H. Jia, Z. Lian, T. Benson, and S. Lou, “Ultra-broadband dual hollow-core anti-resonant fiber polarization splitter,” Optical Fiber Technology, vol. 53, p. 102005, 2019. [19] M. S. Hosen, A. Khaleque, K. S. R. Shaha, L. N. Asha, A. S. Sultana, R. Nishad, and M. T. Rahman, “Highly birefringent polarization maintaining low-loss single-mode hollow-core antiresonant fiber,” Opt. Continuum, vol. 1, no. 10, pp. 2167–2184, Oct 2022. [20] Y. Ni, J. Yuan, S. Qiu, G. Zhou, C. Xia, X. Zhou, B. Yan, Q. Wu, K. Wang, X. Sang et al., “Dual hollow-core negative curvature fiber polarization beam splitter covering the o+ e+ s+ c+ l communication band,” JOSA B, vol. 39, no. 9, pp. 2493–2501, 2022. [21] S. f. Gao et al., “Hollow-core conjoined-tube negative-curvature fibre with ultralow loss,” Nature communications, vol. 9, no. 1, pp. 1–6, 2018. [22] K. S. R. Shaha and A. Khaleque, “Low-loss single-mode modified conjoined tube hollow-core fiber,” Applied Optics, vol. 60, no. 21, pp. 6243–6250, Jul 2021. [23] F. Yu et al., “Low loss silica hollow core fibers for 3–4 µm spectral region,” Optics express, vol. 20, no. 10, pp. 11 153–11 158, 2012. [24] R. Nishad, K. S. R. Shaha, A. Khaleque, M. S. Hosen, and M. T. Rahman, “Impact of cladding rectangular bars on the antiresonant hollow core fiber,” in 2021 3rd International Conference on Electrical Electronic Engineering (ICEEE), 2021, pp. 77–80. [25] K. S. R. Shaha, A. Khaleque, and M. T. Rahman, “Low loss anisotropic nested hollow core antiresonant fiber,” in 2020 2nd International Conference on Advanced Information and Communication Technology (ICAICT). IEEE, 2020, pp. 71–76. [26] H. Jia, X. Wang, T. M. Benson, S. Gu, S. Lou, and X. Sheng, “Ultrawide bandwidth dual sakura hollow-core antiresonant fiber polarization beam splitter,” JOSA B, vol. 38, no. 11, pp. 3395–3402, 2021. [27] X. Liu, Z. Fan, Z. Shi, Y. Ma, J. Yu, and J. Zhang, “Dual-core antiresonant hollow core fibers,” Optics express, vol. 24, no. 15, pp. 17 453–17 458, 2016. [28] X. Huang, J. Ma, D. Tang, and S. Yoo, “Hollow-core air-gap antiresonant fiber couplers,” Optics Express, vol. 25, no. 23, pp. 29 296– 29 306, 2017. [29] H. Jia, X. Wang, T. Zhao, Z. Tang, Z. Lian, S. Lou, and X. Sheng, “Ultrawide bandwidth single-mode polarization beam splitter based on dual-hollow-core antiresonant fiber,” Applied Optics, vol. 60, no. 31, pp. 9781–9789, 2021. [30] M. M. Rahman, A. Khaleque, M. T. Rahman, and F. Rabbi, “Gold-coated photonic crystal fiber based polarization filter for dual communication windows,” Optics Communications, vol. 461, p. 125293, 2020. [31] M. T. Rahman and A. Khaleque, “Ultra-short polarization splitter based on a plasmonic dual-core photonic crystal fiber with an ultra-broad bandwidth,” Applied optics, vol. 58, no. 34, pp. 9426–9433, 2019. [32] Y. Wang, M. I. Hasan, M. R. A. Hassan, and W. Chang, “Effect of the second ring of antiresonant tubes in negative-curvature fibers,” Optics express, vol. 28, no. 2, pp. 1168–1176, 2020. [33] A. F. Kosolapov et al., “Hollow-core revolver fibre with a doublecapillary reflective cladding,” Quantum Electronics, vol. 46, no. 3, p. 267, 2021. [34] A. Argyros, S. G. Leon-Saval, and M. A. van Eijkelenborg, “Twinhollow-core optical fibres,” Optics Communications, vol. 282, no. 9, pp. 1785–1788, 2009. [35] X. Huang, W. Qi, D. Ho, K.-T. Yong, F. Luan, and S. Yoo, “Hollow core anti-resonant fiber with split cladding,” Optics Express, vol. 24, no. 7, pp. 7670–7678, 2016.
Page 259
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December, Cox’s Bazar, Bangladesh
Demystifying Hypothyroidism Detection with Extreme Gradient Boosting and Explainable AI A.K.M. Salman Hosain and Md. Golam Rabiul Alam Department of Computer Science and Engineering Brac University 66 Mohakhali, Dhaka - 1212, Bangladesh Email- [email protected], [email protected] Abstract—Hypothyroidism is a prevalent disease of thyroid glands in human. Thyroid is an endocrine gland in vertebrates known to control physiological metabolism by producing few specific hormones - thyroxine (T4) and triiodothyronine (T3) hormones. Hyopthyroidism can be defined as the absence of thyroid hormones T3 and T4 in bloodstream. Pituitary gland produces a hormone named Thyroid-stimulating hormone (TSH) which stimulates the thyroid to produce T43 and T4. Hypothyroidism causes human to gain weight, exhaustion, infertility, cardiovascular illness, dyslipidaemia, etc. Hypothyroidism affects around 5% of the world’s population, with another 5% being undiagnosed. If not treated, hypothyroidism can be life threatening. In this paper, we have addressed this issue by proposing a framework and using state-of-the-art machine learning algorithms: XGBoost, AdaBosst, and CatBoost and represented a comparative analysis among the algorithms with different quantitative performance evaluation metrics. In our work, we have showed XGBoost outperformed the other two models with an accuracy of 99.87%, while AdaBoost had 99.73% accuracy, and CatBoost showed 99.75% accuracy on our test sets. We have further used Explainable Artificial Intelligence (XAI) architectures: LIME and SHAP, to interpret the model’s decision in a comprehensive manner to address the ‘Black Box’ issue of machine learning algorithms. Keywords—XGBoost, AdaBoost, CatBoost, LIME, SHAP, Hypothyroidism, Thyroid, machine learning, XAI
I. I NTRODUCTION Hypothyroidism is the failure of the thyroid gland to generate enough thyroid hormone [1]. In 1874, Dr. Gull was the first person to describe the clinical symptoms associated with hypothyroidism [2]. A small butterfly-shaped gland which is located immediately below laryngeal prominence of the thyroid cartilage is called thyroid. It produces Triiodothyronine (T3) and thyroxine (T4) thyroid hormones. Hypothyroidism is defined as a chronic condition caused by low thyroxine (T4) and triiodothyronine levels (T3) [3]. As mentioned before, thyroid gland produces T4 hormone. T3 hormone is mostly produced from T4 hormone. When thyroid gland fails to produce T3 and T4, pituitary gland produces thyroid-stimulating hormone (TSH) as a negative feedback mechanism [4]. Hypothyroidism is classified into four classes: primary, secondary, tertiary, and peripheral. Primary hypothyroidism occurs due to deficit of thyroid hormones T3 and T4. This is the prevalent case of hypothyroidism. Secondary hypothyroidism is the results from lack of thyroid
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
stimulating hormone (TSH) production by pituitary gland. Tertiary hypothyroidism happens because of deficiency in thyrotropin-releasing hormone (TRH). Birth abnormalities in thyroid hormone metabolism cause peripheral hypothyroidism. 95% cases of hypothyroidism belongs to primary and remaining 5% is secondary hypothyroidism. Central and peripheral hypothyroidism accounts for less than 1% cases. Hypothyroidism is the second most prevalent endocrinal dysfunction, behind only diabetes mellitus in terms of prevalence [5]. The prevalence of primary hypothyroidism increases with age, with a peak incidence between the ages of 30 and 50 years [4], [6]. Women are more susceptible to this disease than men, around 8-9 times higher. Hypothyroidism can cause infertility, hypertension, cardiovascular illness, dyslipidaemia, infertility, neurological and musculoskeletal dysfunction [1], [4]. It can also result in cognitive impairment [1]. Common symptoms of hypothyroidism includes weakness, lethargy, dry or coarse skin, facial edema, slow speech, thick tongue, decreased sweating, eyelid dilemma, cold sensation, skin pallor, coarse hair, forgetfulness, etc [5]. Although clinical diagnosis is accurate, it is a time consuming process which may have negative impacts on patients if treatment is not started immediately. With the advent of machine learning (ML) algorithms, various diseases in organs like breast [7], [8], ovary [9], intestine [10], etc. can now be diagnosed efficiently and faster. Hypothyroidism can also be detected efficiently with ML. But these algorithms sometimes fails to achieve physician’s confidence due to the ’black box‘ issue. These sophisticated ML models often suffer from explainibility. Hence, is the use of Explainable Artificial Intelligence which attempts to make ML models more interpretable to human. Explainable AI strives to generate more transparent, responsible, and explainable models while preserving its impressive predictions. In healthcare sectors, the importance of explainibility is even more vital. If ML models are applied in healthcare sector to detect a disease, it is expected that physicians would want to know the features or attributes on which the ML model is emphasizing to make its decision, as all the attributes of a disease are not equally important to diagnose a disease. XAI can solve this issue by making the prediction of the ML models more understandable by to the physicians and gain their confidence on the predicition of ML
Page 260
models. Our contribution in this paper• We have trained three ML algorithms: XGBoost, CatBoost, and AdaBoost classifiers to detect hypothyroidism from a comma separated values (csv) dataset collected from UCI Machine Learning Repository [11]. • We have presented an extensive comparison between these three models using various quantitative performance evaluation parameters and have showed that XGBoost achieved the highest accuracy between the three models. • We have utilized two XAI methods: LIME (local interpretable model-agnostic explanations) and Shap (SHapley Additive exPlanations) to interpret the prediction of the model. II. R ELATED W ORKS Zhang et al . [12] utilized nine ML classifiers on US patients dataset containing 4765 records. They showed random forest classifier to be the classifier of the highest accuracy. Tyagi et al. [13] used Support Vector Machine (SVM), KNN, and Decision Trees to predict patient’s possibility of having thyroid disease on a dataset from UCI Machine Learning Repository. In their work, SVM showed the highest accuracy which was 99.63%. Chaubey et al. [14] used logistic regression, decision trees and k-nearest neighbor (kNN) to detect thyroid disease. They have used dataset from UC Irvin knowledge discovery in databases archive. In their work, kNN achieved highest accuracy, which was 96.87%. Aversanoa et al. [15] collected patient data from “AOU Federico II” Naples hospital and applied various ML algorithms to detect thyroid disease. Among the classifiers, Extra Tree Classifier had an accuracy of 84% which was deemed the highest. Salman and Emrullah [16] used Support vector machines, random forest, decision tree, na¨ıve bayes, logistic regression, k-nearest neighbors, multilayer perceptron (MLP), linear discriminant analysis algorithms and showed Random Forest to have achieved 98.93% accuracy. Ha and Baek [17] showcased a description of the future developmental directions of computer aided designs (CAD) for the individualized and optimum detection of thyroid nodules, together with an overview of the development of the AIbased CAD systems that are now being employed for thyroid nodules. Vadhiraj et al. [18] demonstrated that SVM model achieved higher accuracy than Artificial Neural Network model in detecting thyroid nodules from ultrasound images. III. M ETHODOLOGY A brief overview of our workflow is depicted in Fig. 1 A. Dataset Description We collected our csv dataset from UCI machine learning repository [11]. The dataset consisted of 3771 patient records with 29 features and one column of target labels. There were
Fig. 1: Our workflow to detect hypothyroidism two target labels: positive, and negative. Positive class is for patient records which were detected as hypothyroidism and negative label is for patients who were not diagnosed hypothyroidism. B. Data Cleaning and Preprocessing 1) Redundant Feature Removal We have searched for null values in our dataset and found out that Thyroxine-binding globulin (TBG) feature was absent in all the patient records. So, we dropped this feature from our dataset. We also removed T3 measured, TT4 measured, T4U measured, FTI measured, and TBG measured features from the dataset as they contained information whether a patient had this hormones tested or not. The quantity of this hormones were not in these columns. We also removed ’referral source‘ feature. Feature removal process yielded a dataset of 21 features for the patient records. 2) Feature and Label Encoding We have encoded our categorical feature values with OrdinalEncoder function of Pandas dataframe and labeled the categorical class labels with LableEncoder function. As there were two classes, we ended up with two labels: 0 and 1 for the negative and positive classes respectively. We have replaced the NaN values with mean values of respective classes. After all the cleaning and preprocessing steps were carried out, the dataset contained 21 features for its 3420 positive classes, and 291 negative classes. 3) Train Test Split To train and test our models, we split our dataset by 8:2 splitting method which means, 80% of the data were used as training data and 20% of the data were used as test data. The splitting process was carried out with the help of scikit-learn. The test set was unseen to the models throughout the training steps to avoid bias in accuracy of the models. We trained and tested all three our models with this processed dataset. C. Data Visualization Our processed dataset had two classes: Hypothyroid and Non Hypothyroid. Hypothyroid was labeled as ‘1’ and Non Hypothyroid was labeled as ‘0’. Number of instances in Hypothyroid and Non Hypothyroid classes were 3,420 and 291 respectively.
Page 261
There were a total of 21 independent variable and one dependent variable in our processed dataset. A correlation matrix between these variables is depicted in Fig. 2. In the correlation matrix, the darker cells represents higher correlation between variables and vice versa. ‘binaryClass’ is our dependent variable which states whether patient is suffering from Hypothyroidism. It is apparent from the correlation matrix that hypothyroidism is highly correlated with TSH (Thyroid-stimulating hormone), T3 (Triiodothyronine), TT4 (Thyroxine), T4U, and FTI (free T4 index) variables.
as well as outliers [21]. In our work we used base estimator is Decision Tree Classifier initialized with max depth=1, number of estimator = 50, and learning rate of 1 2) XGBoost (eXtreme Gradient Boosting) Chen and Guestrin [22] developed the XGBoost (eXtreme Gradient Boosting) algorithm in 2016. This algorithm works by integrating multiple decision trees. XGBoost is based on gradient boosting algorithm. In gradient boosting algorithm, new decision trees are created to predict the residuals of previous trees and then added together to make a final prediction [23]. Residual is the difference between true value and predicted value. XGBoost uses various parameters to provide higher efficiency [24]. 3) CatBoost CatBooost is the updated form of gradient boosted ecision trees. CatBoost is trained by a set of Decision Trees. Each tree learns from the previous tree and has influence over the succeeding tree. CatBoost uses ordered ordered boosting mechanism. It solves overfitting by several permutation of the training batch [25]. The output [26] of CatBoost is given by, Z = H(xi ) =
J X
cj 1xϵRi
j=1
E. Explainable AI (XAI) Frameworks Fig. 2: Correlation Matrix between independent and dependent variable of dataset D. Model Selection In this paper we have selected three supervised models: XGBoost, CatBoost, and AdaBoost classifier to detect hypothyroidism in patients and presented an extensive comparison between these three models on various quantitative evaluation metrics acquired from our test set. Our dataset is represented as D = (xi ; yi ), i = 1, 2, ..., N where, xi = [xi 1 , xi 2 , xi 3 , ..., xi n ] is a feature vector of an instance or record and yi ϵ{0, 1} as our task is a binary classification task. We need our ML functions y = f (x) to produce prediction yˆk = f (xk ) so that yˆk can be as close as possible to yk , which is true label. 1) AdaBoost (Adaptive Boosting) Adaptive boosting combines a number of weak classifier with weighted voting approach to form a strong classifier. Adaboost combines a number of weak classifiers hm (x) to generate a strong classifier H(x) for data classification or regression task [19], [20]. m=1 X
H(x) = sign(
αm .hm (x))
M
where, αm is scalar weights and input data is x. AdaBoost alters ineffective learners in order to prioritize incorrectly classified data samples. However, it is extremely sensitive to noise
1) LIME (Local Interpretable Model-agnostic Explanations) Providing explanation of a models’ prediction improves dependability, especially in healthcare sector, physician’s confidence on model’s prediction is vital. Explainabel AI can play an important role to address this issue. LIME or Local Interpretable Model-agnostic Explanations is a an explainable AI framework, which was originally propose by Ribeiro et al. [27]. It has become a popular XAI framework for its simplicity and accessibility since then. Local means that it illustrates how a model categories a specific observation [28]. Interpretable stands for understandable by human. Modelagnostic means this framework can be used to interpret any ML model, be it textual or image data and classification or regression problem. Explanations means the framework establishes a comprehensive relation between input and model’s prediction. LIME explains by the following equation [27],
2) SHAP (SHapley Additive exPlanations) SHAP or SHapley Additive exPlanations is a collective framework for model explanation developed by Lundberg and Lee [29]. It uses shaples values of each feature to explain the prediction by illustrating contribution of each feature to the prediction made by a model. It tells us how the prediction is fairly distributed among the features. Shap is a local explanation framework. Shapley value for a feature i is given by,
Page 262
TABLE I: Comparative table of performance evaluation parameters conducted on test set Parameters Accuracy Precision Recall F1 Score
Class Non Hypothyroid Hypothyroid Non Hypothyroid Hypothyroid Non Hypothyroid Hypothyroid
XGBoost 99.87% 0.98 1 1 1 0.99 1
AdaBoost 99.73% 0.98 1 0.98 1 0.98 1
CatBoost 99.75% 0.97 1 1 1 0.98 1
Fig. 4: Confusion Matrix of AdaBoost IV. R ESULT A NALYSIS A. Performance Evaluation on Quantitative Parameters We trained our models on the same training and test set. The evaluation parameters considered to compare these models are: precision, recall, f1-score, and accuracy which are quantified on the test set. Our test set contained 62 records of Non Hypothyroid patient records and 681 Thyroid patient records. Table I represents the comparison of these quantitative comparison between the models. From Table I we can see that although all three models’ accuracy is almost identical, XGBoost performed better by a small margin, with an accuracy of 99.87%. It performed 0.14% better than AdaBoost and 0.12% better than CatBoost in terms of accuracy. In terms of precision, both XGBoost and Adaboost performed same for both of the classes, 0.98 for Non Hypothyroid and 1.00 for thyroid. CatBoost scored 0.01% less than other two models in Non Hypothyroid class, although same in Thyroid class. XGBoost and CatBoost scored 1 for both classes in recall parameter. AdaBoost scored 0.02% less in Non Hypothyroid class. AdaBoost and CatBoost had identical f1 scores for both classes. XGBoost scored 0.01% better in Non Hypothyroid class in terms of f1 score, and 1 in Thyroid class.
Fig. 5: Confusion Matrix of CatBoost
Fig. 3, fig. 4, and fig. 5 depicts confusion matrix illustrated on predictions of XGBoost, AdaBoost, and CatBoost’s predictions on test set. Non Hypothyroid is labeled as 0 in Hypothyroid is labeled as 1 in the confusion matrix. XGBoost classified all the 62 Non Hypothyroid instances in our test set correctly. AdaBoost incorrectly classified one Non Hypothyroid case, while in the case of CatBoost, incorrectly classified Non Hypothyroid instance was none. XGBoost correctly classified 680 Hypothyroid records out of 681. AdaBoost’s performance was identical to XGBoost in classifying Hypothyroid class. CatBoost incorrectly classified two Hypothyroid records out of 681. From the above discussed quantitative performance evaluation parameters, it is apparent that XGBoost slight outperformed AdaBoost, and CatBoost. We have applied LIME and SHAP framework to interpret the prediction made by XGBoost on our dataset. B. Model Interpretation with XAI Frameworks
Fig. 3: Confusion Matrix of XGBoost
1) LIME We performed LIME interpretation on a record from our test set. The attributes of the record is shown in Table II. Categorical features that had ‘no’ values are labeled as 0.00, and 1.00 for ’yes’ values. LIME interpretation of XGBoost on the patient of Table II is depicted in Fig. 6. The features are arranged in descending
Page 263
TABLE II: Attributes of a patient record from test set used to interpret with LIME and SHAP Features TSH on thyroxine thyroid surgery hypopituitary tumor TT4 I131 Treatment query on thyroxine T3 lithium
Value 3.00 0.00 0.00 0.00 0.00 121.00 0.00 0.00 2.5 0.00
order based on their impact on model’s prediction. Features with orange colored bar in LIME interpretation illustration have impacts to be classified as Hypothyroid and features with blue colored bars have impacts to be classified as Non Hypothyroid. We can see that XGBoost classified the patient with label ‘1’, which means he is affected with Hypothyroid. From the figure LIME is interpreting that thyroid stimulating hormone or ‘TSH’ value of the patient is the most impactful feature for being classified as Hypothyroid patient, as TSH level is more than 1.6 and less than or equal to 3.4. From Table II we can see that patient’s TSH label is 3.00 which is more than 1.6 and less than or equal to 3.4. Whether patient was on thyroxine had impact to be classified as Non Hypothyroid. From LIME interpretation we can see that as patient’s ‘on hypthyroxine’ value was less than or equal to 0.00, it had impact on patient to be classified as Non Hypothyroid. Table II tells us the patient had ‘on thyroxine’ value of 0.00, which means he/she was not on hypothyroxine. Similarly, hypopituitary ≤ 0.00, thyroidsurgery ≤ 0.00, 106.00 < T T 4 ≤ 123.00, and goitre ≤ 0.00 had impact on patient to be classified as Non Hypothyroid. Apart from TSH, lithium ≤ 0.00, pregnant ≤ 0.00, T 3 > 2.30, and queryonthyroxine ≤ 0.00 had impact on model’s prediction to be Hypothyroid.
Fig. 7: SHAP interpretation on individual test record decision of XGBoost
Fig. 7 depicts SHAP interpretation of XGBoost decision on the same patient record of Table II. Patient attributes with red color drives model’s decision towards ‘1’ or, Hypothyroid and attributes with blue color drives model’s prediction to be ‘0’ or, Non Hypothyroid. In Fig 7 red marked attributes T3 (1), TT4 (75), TSH (5.4) drive model’s prediction towards Hypothyroid and on thyroxine (0), age (72), and blue colored attributes thyroid surgery (0) drive model’s prediction towards Non Hypothyroid.
Fig. 8: SHAP interpretation summary of features’ impacts on XGBoost decision
Fig. 6: LIME Interpretation on XGBoost decision 2) SHAP SHAP can summarize impact of features on overall prediction made by the model and also at individual instance. It interprets the model’s prediction by plotting shaple values of the features on the prediction.
Fig. 8 represents a summary of features’ impacts on overall prediction of XGBoost. In this figure, the rows are features of the patient records and points on the graph are SHAP values of instances of that feature. The redder the points in color, the higher are the feature values and the bluer the color, the lower is the feature value. The X axis represents the SHAP values of the features. Higher SHAP value of a feature means that feature has more impact on model’s decision to be ‘1’ or Hypothyroid, and lower SHAP value depicts that feature having impact on model’s prediction to be
Page 264
‘0’ or Non Hypothyroid. The figure depicts higher TSH value drives model’s decision to be Non Hypothyroid and lower TSH value drives model’s decision to be Hypothyroid. On the contrary, higher values of on Thyroxine, TT4, T3, FTI, thyroid surgery, age, and T4U features drive model’s decision to be Hypothyroid and vice versa. Thus, SHAP can be utilized to get an overview of each feature’s impact on a model’s prediction. V. C ONCLUSION In this work, we have addressed the issue of hypothyroidism in world population by using three ML algorithms: XGBOost, CatBoost, Adaboost to detect hypothyroidism among patients. We have demonstrated a comparative analysis among the models and showcased that XGBoost achieved the highest accuracy among three models - around 99.87%. We have also prioritized the use of XAI in healthcare sector to gain confidence of physicians in the predictions made by ML models to detect diseases. We have used two XAI framework in this work: SHAP and LIME to interpret the decisions made by XGBoost in detecting hypothyroidism from patient records of our dataset. In future, we plan to use dataset based on demographic features and achieve higher accuracy by further fine tuning our models. R EFERENCES [1] D. Y. Gaitonde, K. D. Rowley, and L. B. Sweeney, “Hypothyroidism: an update,” South African Family Practice, vol. 54, no. 5, pp. 384–390, 2012. [2] S. W. W. Gull, On Cretinoid State Supervening in Adult Life in Women..., 1873. [3] R. Guglielmi, F. Grimaldi, R. Negro, A. Frasoldati, I. Misischi, F. Graziano, C. Cipr`ı, E. Guastamacchia, V. Triggiani, and E. Papini, “Shift from levothyroxine tablets to liquid formulation at breakfast improves quality of life of hypothyroid patients,” Endocrine, Metabolic & Immune Disorders-Drug Targets (Formerly Current Drug TargetsImmune, Endocrine & Metabolic Disorders), vol. 18, no. 3, pp. 235–240, 2018. [4] L. Chiovato, F. Magri, and A. Carl´e, “Hypothyroidism in context: where we’ve been and where we’re going,” Advances in therapy, vol. 36, no. 2, pp. 47–58, 2019. [5] W. J. Hueston, “Treatment of hypothyroidism,” American family physician, vol. 64, no. 10, p. 1717, 2001. [6] Y. Aoki, R. M. Belin, R. Clickner, R. Jeffries, L. Phillips, and K. R. Mahaffey, “Serum tsh and total t4 in the united states population and their association with participant characteristics: National health and nutrition examination survey (nhanes 1999–2002),” Thyroid, vol. 17, no. 12, pp. 1211–1223, 2007. [7] I. E. Kabir, R. Abid, A. S. Ashik, K. K. Islam, and S. K. Alam, “Improved strain estimation using a novel 1.5d approach: Preliminary results,” in 2016 International Conference on Medical Engineering, Health Informatics and Technology (MediTec), 2016, pp. 1–5. [8] R. A. Mukaddim, J. Shan, I. E. Kabir, A. S. Ashik, R. Abid, Z. Yan, D. N. Metaxas, B. S. Garra, K. K. Islam, and S. K. Alam, “A novel and robust automatic seed point selection method for breast ultrasound images,” in 2016 International Conference on Medical Engineering, Health Informatics and Technology (MediTec), 2016, pp. 1–5. [9] A. S. Hosain, M. Islam, M. H. K. Mehedi, I. E. Kabir, and Z. T. Khan, “Gastrointestinal disorder detection with a transformer based approach,” in 2022 IEEE 13th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), 2022, pp. 0280–0285. [10] A. K. M. S. Hosain, M. H. K. Mehedi, and I. E. Kabir, “Pconet: A convolutional neural network architecture to detect polycystic ovary syndrome (pcos) from ovarian ultrasound images,” 2022. [Online]. Available: https://arxiv.org/abs/2210.00407 [11]
[12] B. Zhang, J. Tian, S. Pei, Y. Chen, X. He, Y. Dong, L. Zhang, X. Mo, W. Huang, S. Cong et al., “Machine learning–assisted system for thyroid nodule diagnosis,” Thyroid, vol. 29, no. 6, pp. 858–867, 2019. [13] A. Tyagi, R. Mehra, and A. Saxena, “Interactive thyroid disease prediction system using machine learning technique,” in 2018 Fifth international conference on parallel, distributed and grid computing (PDGC). IEEE, 2018, pp. 689–693. [14] G. Chaubey, D. Bisen, S. Arjaria, and V. Yadav, “Thyroid disease prediction using machine learning approaches,” National Academy Science Letters, vol. 44, no. 3, pp. 233–238, 2021. [15] L. Aversano, M. L. Bernardi, M. Cimitile, M. Iammarino, P. E. Macchia, I. C. Nettore, and C. Verdone, “Thyroid disease treatment prediction with machine learning approaches,” Procedia Computer Science, vol. 192, pp. 1031–1040, 2021. [16] E. Sonuc¸ et al., “Thyroid disease classification using machine learning algorithms,” in Journal of Physics: Conference Series, vol. 1963, no. 1. IOP Publishing, 2021, p. 012140. [17] E. J. Ha and J. H. Baek, “Applications of machine learning and deep learning to thyroid imaging: where do we stand?” Ultrasonography, vol. 40, no. 1, p. 23, 2021. [18] V. V. Vadhiraj, A. Simpkin, J. O’Connell, N. Singh Ospina, S. Maraka, and D. T. O’Keeffe, “Ultrasound image classification of thyroid nodules using machine learning techniques,” Medicina, vol. 57, no. 6, p. 527, 2021. [19] W. Hu, W. Hu, and S. Maybank, “Adaboost-based algorithm for network intrusion detection,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 38, no. 2, pp. 577–583, 2008. [20] J.-J. Lee, P.-H. Lee, S.-W. Lee, A. Yuille, and C. Koch, “Adaboost for text detection in natural scene,” in 2011 International conference on document analysis and recognition. IEEE, 2011, pp. 429–434. [21] K. Randhawa, C. K. Loo, M. Seera, C. P. Lim, and A. K. Nandi, “Credit card fraud detection using adaboost and majority voting,” IEEE access, vol. 6, pp. 14 277–14 284, 2018. [22] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794. [23] K. Davagdorj, V. H. Pham, N. Theera-Umpon, and K. H. Ryu, “Xgboostbased framework for smoking-induced noncommunicable disease prediction,” International Journal of Environmental Research and Public Health, vol. 17, no. 18, p. 6513, 2020. [24] A. Ogunleye and Q.-G. Wang, “Xgboost model for chronic kidney disease diagnosis,” IEEE/ACM transactions on computational biology and bioinformatics, vol. 17, no. 6, pp. 2131–2140, 2019. [25] S. Hussain, M. W. Mustafa, T. A. Jumani, S. K. Baloch, H. Alotaibi, I. Khan, and A. Khan, “A novel feature engineered-catboost-based supervised machine learning framework for electricity theft detection,” Energy Reports, vol. 7, pp. 4425–4436, 2021. [26] L. C. Fang, Z. Ayop, S. Anawar, N. F. Othman, N. Harum, and R. S. Abdullah, “Url phishing detection system utilizing catboost machine learning approach,” International Journal of Computer Science & Network Security, vol. 21, no. 9, pp. 297–302, 2021. [27] M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” explaining the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144. [28] J. Dieber and S. Kirrane, “Why model why? assessing the strengths and limitations of lime,” arXiv preprint arXiv:2012.00093, 2020. [29] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” Advances in neural information processing systems, vol. 30, 2017.
Page 265
2022 25th International Conference on Computer and Information Technology (ICCIT) 17-19 December 2022, Cox’s Bazar, Bangladesh
Novel Memristor-based Energy Efficient Compact 5T4M Ternary Content Addressable Memory 1
Md Hasan Maruf1, and Syed Iftekhar Ali 2 Department of Electrical and Electronic Engineering, Green University of Bangladesh, Dhaka-1207, Bangladesh 2 Department of Electrical and Electronic Engineering, Islamic University of Technology, Board Bazar, Gazipur 1704, Bangladesh Email- [email protected], [email protected]
Abstract— Memristor-based ternary content addressable memory (MTCAM) is a form of special memory where the memristor controls the primary operation instead of transistors. In addition, a memristor is a kind of particular passive element with two terminals that keeps the data as memory when the power goes down. This paper proposes a novel 5T4M MTCAM that is compact in size, efficient in energy consumption, and capable of restoring data. The proposed design uses BSIM 32nm CMOS PTM as a transistor model and a modified Biolek model as a memristor model for simulation. 16x16 array of MTCAM has been used with pre-charge low matchline (ML) sensing. This novel MTCAM offers 633ps of search time and 1.65fJ/digit/search of search energy which are lower than the other existing designs. In addition, this design can restore the data in successive search cycles though it performs its write and search operations using the same nodes. Keywords— Memristor, MTCAM, 5T4M, 32nm CMOS Technology, Pre-Charge Low ML Sensing
I. INTRODUCTION A semiconductor memory called Ternary Content Addressable Memory (TCAM) allows for one clock cycle of searching across a big lookup table. It can return the address of the matching data by comparing the input data with the table of stored data [1]. TCAM's distinct feature makes it suitable for usage in associative memory, pattern matching, internet data processing, packet forwarding, and storage of tag bits in processor cache. [2]. Because TCAM may store three states—logic "0," logic "1," and don't care "X," it differs from binary CAM. The don't care state allows for versatile search operations by matching logic '0' and '1'. TCAM can execute the longest prefix match since it has a don't care state (LPM). TCAM can use a priority encoder to implement LPM for multiple matches [3]. The conventional TCAM structure consists of two Static Random Access Memories (SRAMs) and a comparison circuit. Conventional SRAM is designed using six CMOS transistors, and the comparison circuit needs four CMOS transistors [4]. As a result, the traditional TCAM needs sixteen CMOS transistors which require a high chip area. Furthermore, the power density of conventional TCAM is high because of the standby sub-threshold leakage current, which is a significant concern in CMOS-based technology [5, 6]. In light of this, numerous design approaches have been put forth at various points to lower power consumption and also reduce chip area. Good emerging nonvolatile memories, such as memristors [7], magneto tunneling junction-based devices
979-8-3503-4602-2/22/$31.00 ©2022 IEEE
[8], spintronic nanodevices [2, 9], carbon nanotube field effect transistors [10], etc., are being explored to find the solution of the mentioned problems. The memristor is the most promising among different nonvolatile memories because it has strong compatibility with CMOS, high-speed performance, and less area [11, 12]. A memristor is a type of passive devices with two-terminals and also memory resistor. Its resistance depends on the magnitude, direction, and duration of the applied voltage [13]. It can remember its last value when the power is off, which makes it a nonvolatile device. Figure 1 shows a basic block structure of a memristor. Various authors have proposed different memristor-based TCAM (MTCAM) designs. Among them, [14] and [15] are the most exciting ones because they claim less delay, efficient area, and low power consumption. In [14], the authors designed a 5T2M MTCAM cell using a 180nm CMOS process. The authors came up with 2ns of search delay and 0.99fJ/bit/search of search energy for a word width of 128 bits. They failed to achieve higher storage density because they used older technology nodes. In [15], the authors designed a 5T2M MTCAM cell using 180nm CMOS technology. They simulated their design with a small array of 2x4. They claimed the area of their proposed cell is at least 27% smaller than [14]. Both the designs failed to explain restoring capability because the same nodes are used for write and search operations. In this paper, a novel energy efficient 5T4M TCAM is proposed, which can restore the data and further improves different performance parameters. As the overall design uses the Current Race (CR) matchline (ML) scheme for match sense, it uses a dummy word concept which is excellent in controlling the ML charging duration [16]. The remaining of the parts are as follows: Section II describes the novel 5T4M TCAM structure with WRITE and SEARCH operations.
Figure 1: (a) Basic block structure of memristor where there are four regions: Pt electrode (+Ve), doped, undoped, and Pt electrode (-Ve); (b) Symbol of Memristor
Page 266
Section III presents the simulation results from which search time, voltage margin, search energy, etc., can be determined. Section IV concludes this paper.
𝑉
= 𝑉 𝑉
II. NOVEL 5T4M TCAM The proposed memristor-based TCAM cell comprises four memristors and five transistors (N-type), as shown in Figure 2(a). The Mn1 and Mn2 are the access transistors. ME1 and ME2 memristors are connected in series with the access transistors Mn1 and Mn2, respectively. The drain of the Mn3 transistor is connected to ML; that's why this decision-making transistor will indicate whether the data is matched or mismatched. Mn4 and Mn5 transistors are used for the write operation. Two more memristors, ME3 and ME4, are used to store the same data. This stored data can be read at any time (after or before the search), which is another novel contribution of this TCAM cell. A. WRITE Operation The proposed TCAM cell stores three states: low '0', high '1' and don't care 'X'.The ternary encoding table is shown in Figure 2(b). During the write operation WSL and WR must be high to write the data in the memory cell. According to the table, when DS=0 and DSB=1, the TCAM cell stores low '0'. In this case, ME1 is in a low resistance state (RON), and ME2 is in a high resistance state (ROFF). The opposite situation occurs when DS and DSB get 1 and 0, respectively, in their pins; this time, the TCAM cell stores a high '1'. Now, ME1 is in a high resistance state (ROFF), which means the current is in the forward direction and ME2 is in a low resistance state (RON). When DS and DSB are 0, these cell stores don't care about 'X', which indicates local masking. In this case, the memristors, ME1, and ME2, are in a high resistance state (ROFF). B. SEARCH operation In the search operation, ML is pre-charged to low as it uses the Current Race (CR) scheme as Match Line Sense Amplifier (MLSA), shown in Figure 3. The same DS and DSB pins are used for searching the data for the search operation. To search for low '0', DS and DSB are set to 0 and 1, respectively. The voltage at is the result of a resistor divider that, depending on the search result, should turn on or off the Mn3 transistor when both Mn1 and Mn2 are activated [14]. According to equations (1) and (2), if the stored data matches with search data, VG2 is low, which is less than the threshold voltage of Mn3. As a result, ML continues to be charged, and the final output (MLSO) shows high.
Figure 2: (a) Proposed 5T4M TCAM (b) Ternary encoding table
.
𝑅 (𝑅 + 𝑅
𝑉
)
(4) (5)
,
The above operations and equations are also valid for match and mismatch searches if search data is high ‘1’. If the search data is don’t care ‘X’, then DS and DSB are both 0. For this, the result is always a match, and MLSO goes high. The resistor divider is described in equation 6, and equation 7 states that VG2 should be sufficiently low turn OFF the Mn3 transistor [14].
𝑉
= 𝑉 𝑉
.
𝑅 (𝑅 + 𝑅
= 4) 7,185
Dataset Negative (=4) (3) (