226 57 515KB
English Pages 5 Year 2022
2022 10th International Conference on Cyber and IT Service Management (CITSM) | 978-1-6654-6074-3/22/$31.00 ©2022 IEEE | DOI: 10.1109/CITSM56380.2022.9935966
Named-Entity Recognition and Optical Character Recognition for Detecting Halal Food Ingredients: Indonesian Case Study Dewi Khairani Informatics UIN Syarif Hidayatullah Jakarta, Indonesia [email protected]
Dwi Adi Bangkit Informatics UIN Syarif Hidayatullah Jakarta, Indonesia [email protected]
Nurul Faizah Rozi Informatics UIN Syarif Hidayatullah Jakarta, Indonesia [email protected]
Siti Ummi Masruroh Informatics UIN Syarif Hidayatullah Jakarta, Indonesia [email protected]
Shinta Oktaviana Computer Science Nusa Mandiri University Jakarta, Indonesia [email protected]
Tabah Rosyadi Informatics UIN Syarif Hidayatullah Jakarta, Indonesia [email protected]
Abstract— This study offers solutions using OCR and NER technology to read and recognize the compositional entities listed on packaged product. The purpose of this study is to guide Muslim consumers identifying ingredients of a consumers products, we define three food category entities in NER: Halal, Haram, Syubhat (doubtful). The motivation is to meet the needs of a Muslim for halal products that halal guarantees should support. Because consumers have easier access to various imported products, it is hoped that consumers can freely choose the consumption products they like, including imported ones, while maintaining the halal guarantee. Our proposed system is built using OCR to scan the composition listed on packaged products and processed with the trained NER Model; the evaluation of the model gets an F-Score value of 0.967, and in system testing, by testing 24 packaged products, it produces an OCR accuracy value of 90% and the accuracy of the NER model. for a food reading of 84%. (Abstract) Keywords— Named Entity Recognition, Optical Character Recognition, Ingredients Recognition (keywords)
I. INTRODUCTION A Muslim’s need for halal products should be supported by halal guarantees [1]. Demographically, the Muslim population in Indonesia reaches 209.1 million people, or 87.2 percent of the total population of Indonesia, or 13.1% of the entire Muslim population in the world. The need for consumption and use of halal products will be substantial. This coincides with Muslim preferences for halal products that are improving as a form of compliance in implementing religious law. LPPOM MUI, one of the bodies that organize halal assurance in Indonesia, provides a halalmui.org website where users can search for a product’s halal certificate based on the product name or company name [2]. However, nowadays, consumers can freely choose the consumption products they like, including imported products and products that have not been certified by the relevant parties. Faridah’s research about halal certification in Indonesia shows data in the 2011-2018 period of 727,617 products; only 69,985 products have been certified halal (LPPOM MUI). The data shows that only 9.6 percent of the products have been certified, while the rest do not yet have a halal certificate. It does not mean that the product is haram,
but it could be that it has not been submitted for halal certification [3]. In order to help Muslims maintain the halalness of the products they consume in the era of globalization, researchers and academics have conducted research related to the halal reading system for foodstuffs; according to Askomi, currently, the system built is divided into four types, namely Halal Geo-Locator, Halal Scanner, Halal Directories & Halal Recipes [4]. Besides a halal geo-locator, the halal scanner is a technology that Muslims often use to find out the halalness of a product; the method used is through barcodes and OCR, similar research also delivered by Widya & Salsabila, their research implemented barcodes to find product registration codes into the database to find out whether the product is halal or not [5]. Research related to other halal information by Yuniarti, her research uses OCR to read the name of food products and check whether the product is registered in halal products or not [6]. However, both studies have limitations, such as they can only check products registered in the database. Another different approach is by Mira’s Halal Food Ingredients Identification study using a mobile application that uses Optical Character Recognition to read food composition and look for information [7].
Fig. 1. NER implementation
Named-Entity Recognition (NER) is part of Natural Language Processing (NLP) research which is used to extract information used for text classification from a document or corpus, such as name of person, location, organization, date, time, etc as shown in Fig 1. NER is implemented in many fields, including machine translation, question-answering
Authorized licensed use limited to: Zhejiang University. Downloaded on June 28,2023 at 09:16:48 UTC from IEEE Xplore. Restrictions apply.
The 10th International Conference on Cyber and IT Service Management (CITSM 2022) Yogyakarta, September 20-21, 2022 machine systems, indexing on information retrieval, classification and automatic summarization. The expected goal of the process in NER is to extract and classify names into several categories regarding the correct understanding [8]. And also to recognize and identify named entities and classify them into predetermined categories [9]. In general, NER has three approaches, namely rule-based methods, learning-based methods, and hybrid approaches. The rulebased approach relies on the rules and patterns of named entities contained in sentences and defined manually using regular expressions based on linguistic knowledge and entity characteristics. Linguistic knowledge can include grammar, context, lexicon, and algorithms to determine each operation involved [9]; this study will implement NER to identify and recognize food entities in packaged product composition by applying rule-based. The rules are made based on direct observation of the research object, considering the patterns and forms of composition commonly found in packaged products. Evaluation performs by measurement using the Confusion Matrix method. II. METHODS
Type ORDINAL CARDINAL TABLE II. Type Halal Haram Syubhat
Description "first", "second", etc. Numbers are not included in other types NER ENTITIES USED IN THIS WORK
Description Halal food is food that can be consumed, produced, and commercialized by Muslims Foods that are not allowed to be consumed by Muslims Shubhat means vague or unclear. The point here is that every case/issue is not so clear between halal and haram for humans.
In its development, this research employs spaCy to build an information extraction or Natural Language Processing system to process text for deep learning. spaCy features a very fast statistical entity recognition system, which assigns labels to contiguous ranges of tokens. spaCy is accompanied by a Parts-of-Speech and Named-entity Recognition algorithm. It also means that the package is not overloaded with unnecessary features. Some of the features provided by spaCy are Tokenization, Parts-of-Speech (PoS) Tagging, Text Classification, and Named Entity Recognition [11].
We collect data by observing pictures of compositions or ingredients online and directly in the product. In order to study the patterns and forms of writing compositions commonly listed on packaged products, then at the data collection phase perform by collecting the literature study; the authors look for references relevant to the object to be studied. The information prepares the theoretical basis, research methodology, and direct application development. As explained in the previous chapter, the task of NER is to identify and create context tags on available words based on possible combinations of these words, such as determining the minimum length of words that will be identified as entity names, initial words, and etc., entities such as Person, Organization, Position, and Location in the identification process will require features that reflect the properties of an entity name, such as type, occurrence and various general sizes, both for document and corpus scale. One example of the use of features in determining the name of an entity is the feature of the appearance of a word in the first order (first sentence occurrence) because the order in which a word appears can determine the level of importance of the word [10]. NER supports several types of entities displayed in Table 1. TABLE I. Type Person Norp FAC ORG GPE LOC PRODUCT EVENT WORK_OF_ART LAW LANGUAGE DATE TIME PERCENT MONEY QUANTITY
EXAMPLE NER ENTITIES
Description Person, include fiction Nationality or religion or political group. Buildings, airports, highways, etc. Companies, agencies, institutions, etc. Country, city, state. Non-GPE locations, mountains Objects, vehicles, food, etc. Events, battles, wars, etc. Title of book, song, etc. Legal documents. Language. Absolute or relative date or period. Time. Percentages, including “%”. Monetary value, including units. Measurements, such as weight or distance.
Fig. 2. spaCy Architecture[11]
Spacy is used for the process of Data Training, Data Testing, Model Evaluation [12] and here we applying the Rule Base to classify Haram and Syubhat Entities, because the data from these two entities is not significant, we only applies the Rule Base without doing any more data training. Optical Character Recognition (OCR) is then used to read the composition through photos or images before processing the NER Model. In this study, we uses the AWS Textract System. The system was chosen because in this study the objects used varied in shape and color, in this section the author will explain the installation and implementation of AWS Textract in the system that will be created later. The AWS Textract library is a service from amazon that allows users to detect
Authorized licensed use limited to: Zhejiang University. Downloaded on June 28,2023 at 09:16:48 UTC from IEEE Xplore. Restrictions apply.
The 10th International Conference on Cyber and IT Service Management (CITSM 2022) Yogyakarta, September 20-21, 2022 text in various types of documents such as report documents, images and forms. Amazon Textract is based on Amazon’s highly scalable deep-learning technology that analyzes billions of images and videos daily. Amazon Textract is constantly learning from new data that enables Amazon Textract data extraction to deliver high accuracy scores across a wide range of use cases [13].
Fig. 4. Proposed Solution
A. Training Process The next step before the training process is to combine the food dataset with the revised general entity dataset, the two data are combined and the train.spacy file is formed which is the spaCy training file format as shown below.
Fig. 3. General recognition process by OCR [14]
The OCR system makes it possible to take a book or magazine article, feed it directly to an electronic computer file, and then edit the file using a word processor. This enables the machine to classify optical patterns contained in digital images according to alphanumeric or other characters, Character recognition is achieved through the steps of segmentation, feature extraction and classification [15]. The next step is Template Matching, which is an image analysis process in identifying the inherent properties of each character, also called the features of an object in the image. These characteristics are used in describing an object or an attribute of an object, then the features possessed by the character can be used as a recognition process [14]. The application designed in this study is an alternative tool for users to determine whether there is non-halal content of packaged products they want to consume through the composition listed on the product packaging.
db = DocBin() # create docbin object TRAIN_DATA = TRAIN_REVISION_DATA + TRAIN_FOOD_DATAD for text, annot in tqdm(TRAIN_DATA): # doc = nlp.make_doc(text) # create object from train data text for start, end, label in annot["entities"]: #give annotation index span = doc.char_span(start, end, label=label, alignment_mode="contract") if span is None: print("Skipping entity") else: ents.append(span) doc.ents = ents #give entity on every objects db.add(doc) db.to_disk("./train.spacy") #save to disk in train.spacy format, this data will be used by spaCy to training.
The training process is carried out for approximately 40 minutes using the Google Colab platform, the process of the training model as shown in Fig. 5
III. RESULT AND DISCUSSION This application is designed to be a web that applies Optical Character Recognition technology through AWS Textract and Named Entity Recognition services with the spaCy library, and the application can take pictures and read the list of compositions listed on the product then the application can show the food entities listed on the packaging and which of them is a haram entity. The workflow of a proposed system is shown in Fig 4.
Authorized licensed use limited to: Zhejiang University. Downloaded on June 28,2023 at 09:16:48 UTC from IEEE Xplore. Restrictions apply.
The 10th International Conference on Cyber and IT Service Management (CITSM 2022) Yogyakarta, September 20-21, 2022
Fig. 5. Training process
B. Testing the application We tested the performance of the trained model that was tested with the previously formed testing data and the results are shown in Fig 6.
Fig. 7. Spacy Default Entity Evaluation
To test the system’s performance, we tested 12 food products and used two types of devices. One example of the test results is shown in Fig 8.
Fig. 6. Trained Entity evaluation score
The results of the model performance evaluation show that the satisfactory results of the three categories of food entities with the value of the Halal entity get an F1 value of 0.95, the F1 Haram value is 0.97, and the F1 value from the Syubhat entity is 0.99. Meanwhile, for evaluating the performance of the model in general entities from spaCy, the results are as follows:
Fig. 8. Testing result TABLE III. Devices Device1
TESTING SUMMARY
OCR Result (words) 97%
Model reading result (ingredients) 87%
Device 2
83%
81%
Average
90%
84%
IV. CONCLUSION Based on the results of the research discussion that the author performs, it is concluded that this research has succeeded in using OCR and NER to read and detect halal food ingredients in packaged products, by building a NER model and implementing a model built with OCR Service from AWS Textract. Based on Tests using the Confusion matrix and manual calculations, the Named Entity Recognition Model built in the study was able to read Halal,
Authorized licensed use limited to: Zhejiang University. Downloaded on June 28,2023 at 09:16:48 UTC from IEEE Xplore. Restrictions apply.
The 10th International Conference on Cyber and IT Service Management (CITSM 2022) Yogyakarta, September 20-21, 2022 Haram, and Syubhat entities with an average F score of 0.967, which indicates the accuracy of the model reading is satisfying. Furthermore, the OCR performance results from AWS Textract can also read the composition of packaged products very well, with an accuracy score of 90%. In the test case, the OCR and NER capabilities can also analyze and read the food entities listed on the product packaging. With an accuracy score of 79%, although there are still errors in the naming of entities, the application has been able to read most of the expected entities, and most importantly, the application can read Haram and Syubhat entities, which is the primary goal of this research. REFERENCES [1]
[2]
[3]
[4]
[5]
A. Feizollah, M. M. Mostafa, A. Sulaiman, Z. Zakaria, and A. Firdaus, “Exploring halal tourism tweets on social media,” J. Big Data, 2021, doi: 10.1186/s40537-021-00463-5. R. Mardiyah, A. U. Ismail, D. Khairani, Y. Durachman, T. Rosyadi, and S. U. Masruroh, “Conceptual Framework on Halal Meat Traceability to Support Indonesian Halal Assurance System (HAS 23000) using Blockchain Technology,” 2021, doi: 10.1109/CITSM52892.2021.9588953. H. D. Faridah, “Halal certification in Indonesia; history, development, and implementation,” J. Halal Prod. Res., vol. 2, no. 2, p. 68, 2019, doi: 10.20473/jhpr.vol.2-issue.2.68-78. A. H. Askomi, F. D. Yusop, and Y. Kamarulzaman, “Combating Halal misconceptions among Muslims and Non-Muslims : The potential use of mobile learning application,” Int. Halal Conf., no. November, pp. 1–10, 2016. H. Widya and R. Salsabila, “Aplikasi Barcode Scanner Food Halal Pada Produk Makanan Impor Berbasis Android,” vol. 1099, pp. 14– 17, 2019.
[6]
[7]
[8]
[9]
[10]
[11] [12]
[13] [14]
[15]
A. Yuniarti, I. Kuswardayan, R. R. Hariadi, S. Arifiani, and E. Mursidah, “Design of integrated latext: Halal detection text using OCR (Optical character recognition) and web service,” Proc. - 2017 Int. Semin. Appl. Technol. Inf. Commun. Empower. Technol. a Better Hum. Life, iSemantic 2017, vol. 2018-Janua, pp. 137–141, 2018, doi: 10.1109/ISEMANTIC.2017.8251858. M. Kartiwi, T. S. Gunawan, A. Anwar, and S. S. Fathurohmah, “Mobile Application for Halal Food Ingredients Identification using Optical Character Recognition,” 2018 IEEE 5th Int. Conf. Smart Instrumentation, Meas. Appl. ICSIMA 2018, no. November, pp. 1–4, 2019, doi: 10.1109/ICSIMA.2018.8688756. M. Y. S. Dirgantara, M. A. Fauzi, and R. S. Perdana, “Penerapan Named Entity Recognition Untuk Mengenali Fitur Produk Pada ECommerce Menggunakan Rule Template Dan Hidden Markov Model,” J. Pengemb. Teknol. Inf. dan Ilmu Komput., vol. 2, no. 10, pp. 3912–3920, 2018. N. M. Sinta Wahyuni and N. A. Sanjaya ER, “Rule-based Named Entity Recognition (NER) to Determine Time Expression for Balinese Text Document,” JELIKU (Jurnal Elektron. Ilmu Komput. Udayana), vol. 9, no. 4, p. 555, 2021, doi: 10.24843/jlk.2021.v09.i04.p14. T. Pramiyati, I. Supriana, and A. Purwarianti, “Pengenalan Entitas User Profile Pada Twitter,” J. INKOM, vol. 8, no. 2, p. 103, 2015, doi: 10.14203/j.inkom.411. Spacy, “Spacy 101,” 2022. . C. Chantrapornchai and A. Tunsakul, “Information extraction on tourism domain using SpaCy and BERT,” ECTI Trans. Comput. Inf. Technol., 2021, doi: 10.37936/ecti-cit.2021151.228621. Amazon Web Services Inc., “Amazon Textract,” 2020. A. F. Mollah, N. Majumder, S. Basu, and M. Nasipuri, “Design of an Optical Character Recognition System for Camera- based Handheld Devices Ayatullah,” IJCSI Int. J. Comput. Sci. Issues, Vol. 8, Issue 4, No 1, July 2011, vol. 8, no. 4, 2011. A. Chaudhuri, K. Mandaviya, P. Badelia, and S. K Ghosh, Optical Character Recognition Systems for Different Languages with Soft Computing, vol. 352, no. December. 2017.
Authorized licensed use limited to: Zhejiang University. Downloaded on June 28,2023 at 09:16:48 UTC from IEEE Xplore. Restrictions apply.