211 46 21MB
English Pages 219 Year 2015
PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE ENGINEERING
IKE’18 Editors Hamid R. Arabnia Ray Hashemi, Fernando G. Tinetti Cheng-Ying Yang Associate Editor Ashu M. G. Solo
Publication of the 2018 World Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE’18) July 30 - August 02, 2018 | Las Vegas, Nevada, USA https://americancse.org/events/csce2018
Copyright © 2018 CSREA Press
This volume contains papers presented at the 2018 International Conference on Information & Knowledge Engineering. Their inclusion in this publication does not necessarily constitute endorsements by editors or by the publisher.
Copyright and Reprint Permission Copying without a fee is permitted provided that the copies are not made or distributed for direct commercial advantage, and credit to source is given. Abstracting is permitted with credit to the source. Please contact the publisher for other copying, reprint, or republication permission.
American Council on Science and Education (ACSE)
Copyright © 2018 CSREA Press ISBN: 1-60132-484-7 Printed in the United States of America https://americancse.org/events/csce2018/proceedings
Foreword It gives us great pleasure to introduce this collection of papers to be presented at the 2018 International Conference on Information & Knowledge Engineering (IKE’18), July 30 – August 2, 2018, at Luxor Hotel (a property of MGM Resorts International), Las Vegas, USA. An important mission of the World Congress in Computer Science, Computer Engineering, and Applied Computing, CSCE (a federated congress to which this conference is affiliated with) includes "Providing a unique platform for a diverse community of constituents composed of scholars, researchers, developers, educators, and practitioners. The Congress makes concerted effort to reach out to participants affiliated with diverse entities (such as: universities, institutions, corporations, government agencies, and research centers/labs) from all over the world. The congress also attempts to connect participants from institutions that have teaching as their main mission with those who are affiliated with institutions that have research as their main mission. The congress uses a quota system to achieve its institution and geography diversity objectives." By any definition of diversity, this congress is among the most diverse scientific meeting in USA. We are proud to report that this federated congress has authors and participants from 67 different nations representing variety of personal and scientific experiences that arise from differences in culture and values. As can be seen (see below), the program committee of this conference as well as the program committee of all other tracks of the federated congress are as diverse as its authors and participants. The program committee would like to thank all those who submitted papers for consideration. About 70% of the submissions were from outside the United States. Each submitted paper was peer-reviewed by two experts in the field for originality, significance, clarity, impact, and soundness. In cases of contradictory recommendations, a member of the conference program committee was charged to make the final decision; often, this involved seeking help from additional referees. In addition, papers whose authors included a member of the conference program committee were evaluated using the double-blinded review process. One exception to the above evaluation process was for papers that were submitted directly to chairs/organizers of pre-approved sessions/workshops; in these cases, the chairs/organizers were responsible for the evaluation of such submissions. The overall paper acceptance rate for regular papers was 22%; 19% of the remaining papers were accepted as poster papers (at the time of this writing, we had not yet received the acceptance rate for a couple of individual tracks.) We are very grateful to the many colleagues who offered their services in organizing the conference. In particular, we would like to thank the members of Program Committee of IKE’18, members of the congress Steering Committee, and members of the committees of federated congress tracks that have topics within the scope of IKE. Many individuals listed below, will be requested after the conference to provide their expertise and services for selecting papers for publication (extended versions) in journal special issues as well as for publication in a set of research books (to be prepared for publishers including: Springer, Elsevier, BMC journals, and others). • • • • • •
Prof. Nizar Al-Holou (Congress Steering Committee); Professor and Chair, Electrical and Computer Engineering Department; Vice Chair, IEEE/SEM-Computer Chapter; University of Detroit Mercy, Detroit, Michigan, USA Prof. Hamid R. Arabnia (Congress Steering Committee); Graduate Program Director (PhD, MS, MAMS); The University of Georgia, USA; Editor-in-Chief, Journal of Supercomputing (Springer); Fellow, Center of Excellence in Terrorism, Resilience, Intelligence & Organized Crime Research (CENTRIC). Dr. Travis Atkison; Director, Digital Forensics and Control Systems Security Lab, Department of Computer Science, College of Engineering, The University of Alabama, Tuscaloosa, Alabama, USA Dr. Arianna D'Ulizia; Institute of Research on Population and Social Policies, National Research Council of Italy (IRPPS), Rome, Italy Prof. Kevin Daimi (Congress Steering Committee); Director, Computer Science and Software Engineering Programs, Department of Mathematics, Computer Science and Software Engineering, University of Detroit Mercy, Detroit, Michigan, USA Prof. Zhangisina Gulnur Davletzhanovna; Vice-rector of the Science, Central-Asian University, Kazakhstan, Almaty, Republic of Kazakhstan; Vice President of International Academy of Informatization, Kazskhstan, Almaty, Republic of Kazakhstan
• • • • •
• • • • • • • • • • • • • • •
• •
Prof. Leonidas Deligiannidis (Congress Steering Committee); Department of Computer Information Systems, Wentworth Institute of Technology, Boston, Massachusetts, USA; Visiting Professor, MIT, USA Prof. Mary Mehrnoosh Eshaghian-Wilner (Congress Steering Committee); Professor of Engineering Practice, University of Southern California, California, USA; Adjunct Professor, Electrical Engineering, University of California Los Angeles, Los Angeles (UCLA), California, USA Prof. Ray Hashemi (Session Chair, IKE); Professor of Computer Science and Information Technology, Armstrong Atlantic State University, Savannah, Georgia, USA Prof. Dr. Abdeldjalil Khelassi; Computer Science Department, Abou beker Belkaid University of Tlemcen, Algeria; Editor-in-Chief, Medical Technologies Journal; Associate Editor, Electronic Physician Journal (EPJ) - Pub Med Central Prof. Louie Lolong Lacatan; Chairperson, Computer Engineerig Department, College of Engineering, Adamson University, Manila, Philippines; Senior Member, International Association of Computer Science and Information Technology (IACSIT), Singapore; Member, International Association of Online Engineering (IAOE), Austria Dr. Andrew Marsh (Congress Steering Committee); CEO, HoIP Telecom Ltd (Healthcare over Internet Protocol), UK; Secretary General of World Academy of BioMedical Sciences and Technologies (WABT) a UNESCO NGO, The United Nations Dr. Somya D. Mohanty; Department of Computer Science, University of North Carolina - Greensboro, North Carolina, USA Dr. Ali Mostafaeipour; Industrial Engineering Department, Yazd University, Yazd, Iran Dr. Houssem Eddine Nouri; Informatics Applied in Management, Institut Superieur de Gestion de Tunis, University of Tunis, Tunisia Prof. Dr., Eng. Robert Ehimen Okonigene (Congress Steering Committee); Department of Electrical & Electronics Engineering, Faculty of Engineering and Technology, Ambrose Alli University, Nigeria Prof. James J. (Jong Hyuk) Park (Congress Steering Committee); Department of Computer Science and Engineering (DCSE), SeoulTech, Korea; President, FTRA, EiC, HCIS Springer, JoC, IJITCC; Head of DCSE, SeoulTech, Korea Dr. Prantosh K. Paul; Department of Computer and Information Science, Raiganj University, Raiganj, West Bengal, India Dr. Xuewei Qi; Research Faculty & PI, Center for Environmental Research and Technology, University of California, Riverside, California, USA Dr. Akash Singh (Congress Steering Committee); IBM Corporation, Sacramento, California, USA; Chartered Scientist, Science Council, UK; Fellow, British Computer Society; Member, Senior IEEE, AACR, AAAS, and AAAI; IBM Corporation, USA Chiranjibi Sitaula; Head, Department of Computer Science and IT, Ambition College, Kathmandu, Nepal Ashu M. G. Solo (Publicity), Fellow of British Computer Society, Principal/R&D Engineer, Maverick Technologies America Inc. Prof. Fernando G. Tinetti (Congress Steering Committee); School of Computer Science, Universidad Nacional de La Plata, La Plata, Argentina; also at Comision Investigaciones Cientificas de la Prov. de Bs. As., Argentina Varun Vohra; Certified Information Security Manager (CISM); Certified Information Systems Auditor (CISA); Associate Director (IT Audit), Merck, New Jersey, USA Dr. Haoxiang Harry Wang (CSCE); Cornell University, Ithaca, New York, USA; Founder and Director, GoPerception Laboratory, New York, USA Prof. Shiuh-Jeng Wang (Congress Steering Committee); Director of Information Cryptology and Construction Laboratory (ICCL) and Director of Chinese Cryptology and Information Security Association (CCISA); Department of Information Management, Central Police University, Taoyuan, Taiwan; Guest Ed., IEEE Journal on Selected Areas in Communications. Prof. Layne T. Watson (Congress Steering Committee); Fellow of IEEE; Fellow of The National Institute of Aerospace; Professor of Computer Science, Mathematics, and Aerospace and Ocean Engineering, Virginia Polytechnic Institute & State University, Blacksburg, Virginia, USA Prof. Jane You (Congress Steering Committee); Associate Head, Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong
We would like to extend our appreciation to the referees, the members of the program committees of individual sessions, tracks, and workshops; their names do not appear in this document; they are listed on the web sites of individual tracks. As Sponsors-at-large, partners, and/or organizers each of the followings (separated by semicolons) provided help for at least one track of the Congress: Computer Science Research, Education, and
Applications Press (CSREA); US Chapter of World Academy of Science; American Council on Science & Education & Federated Research Council (http://www.americancse.org/). In addition, a number of university faculty members and their staff (names appear on the cover of the set of proceedings), several publishers of computer science and computer engineering books and journals, chapters and/or task forces of computer science associations/organizations from 3 regions, and developers of high-performance machines and systems provided significant help in organizing the conference as well as providing some resources. We are grateful to them all. We express our gratitude to keynote, invited, and individual conference/tracks and tutorial speakers - the list of speakers appears on the conference web site. We would also like to thank the followings: UCMSS (Universal Conference Management Systems & Support, California, USA) for managing all aspects of the conference; Dr. Tim Field of APC for coordinating and managing the printing of the proceedings; and the staff of Luxor Hotel (Convention department) at Las Vegas for the professional service they provided. Last but not least, we would like to thank the Co-Editors of IKE’18: Dr. Hamid R. Arabnia, Dr. Ray Hashemi, Dr. Fernando G. Tinetti, Dr. Cheng-Ying Yang; and Associate Editor, Ashu M. G. Solo. We present the proceedings of IKE’18.
Steering Committee, 2018 http://americancse.org/
Contents SESSION: INFORMATION RETRIEVAL, DATABASES, DATA STRUCTURES, STORAGE METHODS, AND NOVEL APPLICATIONS The Case for NoSQL on a Single Desktop Ryan D. L. Engle, Brent T. Langhals, Michael R. Grimaila, Douglas D. Hodson
3
Chaos Game for Data Compression and Encoding Jonathan Dixon, Christer Karlsson
7
Shed More Light on Bloom Filter's Variants Ripon Patgiri, Sabuzima Nayak, Samir Kumar Borgohain
14
An Online Interactive Database Platform for Career Searching Brandon St. Amour, Zizhong John Wang
22
Challenges Archivists Encounter Adopting Cloud Storage for Digital Preservation Debra Bowen
27
Solar and Home Battery Based Electricity Spot Market for Saudi Arabia Fathe Jeribi, Sungchul Hong
34
Understanding Kid's Digital Twin Anahita Mohammadi, Mina Ghanaatian Jahromi, Hamidreza Khademi, Azadeh Alighanbari, Bita Hamzavi, Maliheh Ghanizadeh, Hami Horriat, Mohammad Mehdi Khabiri, Ali Jabbari Jahromi
41
Design and Implementation of a Library Database Using a Prefix Tree in an Undergraduate CS Sophomore Course Suhair Amer, Adam Thomas, Grant Reid, Andrew Banning
47
SESSION: INFORMATION AND KNOWLEDGE EXTRACTION AND ENGINEERING, PREDICTION METHODS AND DATA MINING, AND NOVEL APPLICATIONS Human Computer Interaction and Visualization Tools in Health Care Services Beste Kaysi, Ezgi Kaysi Kesler
55
Knowledge-Based Design Quality in Software Reverse Engineering: An Introductory Review Chung-Yang Chen, Chih-Ying Yang, Kuo-Wei Wu
62
Extracting Keywords from News Articles using Feature Similarity based K Nearest Neighbor Taeho Jo
68
Modification of K Nearest Neighbor into String Vector based Version for Classifying Words in 72 Current Affairs Taeho Jo
Predicting Energy Spectrum of Prompt Neutrons Emitted in Fission Using System Reduction with Neurofuzzy Systems Deok Hee Nam
76
String Vector based AHC Algorithm for Word Clustering from News Articles Taeho Jo
83
Drivers of Reviews and Ratings on Connected Commerce Websites Gurpreet Bawa, Kaustav Pakira
87
Algorithms in Law Enforcement: Toward Optimal Patrol and Deployment Algorithms Terry Elliott, Abigail Payne, Travis Atkison, Randy Smith
93
Intelligent Transportation Systems and Vehicular Sensor Networks: A Transportation Quality Adaptive Algorithmic Approach Anastasia-Dimitra Lipitakis, Evangelia A.E.C. Lipitakis
100
A Metric To Determine Foundational User Reviews Sami Alshehri, James P. Buckley
106
SESSION: DATA SCIENCE AND APPLICATIONS A Data Integration and Analysis System for Safe Route Planning Reza Sarraf, Michael P. McGuire
111
Data Analytics: The Big Data Analytics Process (BDAP) Architecture James A. Crowder, John N. Carbone
118
Big Data Architecture Design for Pattern of Social Life Logging Hana Lee, Youseop Jo, Ayoung Cho, Hyunwoo Lee, Youngho Jo, Mincheol Whang
124
SESSION: INTERNATIONAL WORKSHOP ON ELECTRONICS AND INFORMATION TECHNOLOGY; IWEIT-2018 The Mathematical Implications of the Design Idea of the Runge-Kutta Method Wei-Liang Wu
133
Plagiarism Cognition, Attitude and Behavioral Intention: A Trade-Off Analysis Jih-Hsin Tang, Tsai-Yuan Chung, Ming-Chun Chen
140
A Study of the Effect of Block Programming on Students' Logical Reasoning, Computer Attitudes and Programming Achievement Ah-Fur Lai, Cio-Han Wang
145
Blockchain-based Firmware Update Framework for Internet-of-Things Environment Alexander Yohan, Nai-Wei Lo, Suttawee Achawapong
151
Performance Evaluation for the E-Learning with On-line Social Support Chin-En Yen, Tsai-Yuan Chung, Ing-Chau Chang, Cheng-Fong Chang, Cheng-Ying Yang
156
SESSION: POSTER PAPERS AND EXTENDED ABSTRACTS Optimization of Situation-Adaptive Policies for Container Stacking in an Automated Seaport 163 Container Terminal Taekwang Kim, Kwang Ryel Ryu Build an Online Homeless Database Ting Liu, Luis Concepcion-Bido, Caleb Ryor, Travis Brodbeck
165
Symbolic Aggregate Approximation (SAX) Based Customer Baseline Load (CBL) Method Sae-Hyun Koh, Sangmin Ryu, Juyoung Chong, Young-Min Wi, Sung-Kwan Joo
167
Automating Ethical Advice for Cybersecurity Decision-Making Mary Ann Hoppa
170
SESSION: LATE BREAKING PAPERS Volume: Novel Metric for Graph Partitioning Chayma Sakouhi, Abir Khaldi, Henda Ben Ghezala
174
Prediction of Review Sentiment and Detection of Fake Reviews in Social Media Amani Karumanchi , Lixin Fu, Jing Deng
181
Theories of Global Optimal and Minimal Solutions to K-means Cluster Analysis Ruming Li, Huanchang Qin, Xiu-Qing Li, Lun Zhong
187
Randomness as Absence of Symmetry Gideon Samid
199
Development of a Sentiment Extraction Framework for Social Media Datasets - Case Study of 205 Fan Pages Li Chen Cheng, Pin-Yi Li
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
SESSION INFORMATION RETRIEVAL, DATABASES, DATA STRUCTURES, STORAGE METHODS, AND NOVEL APPLICATIONS Chair(s) TBA
ISBN: 1-60132-484-7, CSREA Press ©
1
2
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
3
The Case for NoSQL on a Single Desktop Ryan D. L. Engle, Brent T. Langhals, Michael R. Grimaila, and Douglas D. Hodson
Abstract—In recent years, a variety of non-relational databases (often referred to as NoSQL database systems) have emerged and are becoming increasingly popular. These systems have overcome scaling and flexibility limitations of relational database management systems (RDBMSs). NoSQL systems are often implemented in large-scale distributed environments serving millions of users across thousands of geographically separated servers. However, smaller-scale database applications have continued to rely on RDBMSs to provide transactions enabling Create, Read, Update, and Delete (CRUD) operations. Since NoSQL database systems are typically employed for largescale applications, little consideration has been given to examining their value in single-box environments. Thus, this paper examines potential merits of using NoSQL systems in these small-scale single-box environments. Index Terms— data storage systems, databases, database systems, relational databases, data models, NoSQL.
I. INTRODUCTION Ever since E.F. Codd articulated the relational database model at IBM in 1970, organizations have turned to relational databases as the primary means to store and retrieve data of perceived value [1]. For decades such a model has worked extremely well, especially given business intelligence applications tended to use data in a format conducive to storage and retrieval mechanisms afforded by the powerful Structured Querying Language (SQL) that comes standard each relational database implementation, regardless of vendor. However, by the mid-2000s and with advent of Web 2.0, organizations began to understand valuable data existed in diverse formats (blob, JSON, XML, etc.) and at such a scale (petabyte or larger) that did not easily assimilate into precise, pre-defined relational tables. Furthermore, with the globalization of the internet, organizations found the need for the data accessibility to be independent of geography, and indeed, in many cases by millions of simultaneous users with extremely low tolerance for latency. Traditional relational database models suddenly were found lacking in terms of scalability, flexibility, and speed. As a natural consequence, a new generation of data storage technologies were developed to address these shortcomings and were somewhat euphemistically given the general moniker of NoSQL or “Not Only SQL” [5]. While the meaning of the term NoSQL is imprecise and often debated, for the purposes of this paper, NoSQL is used to generally refer to all non-relational database types. NoSQL database technologies tended to develop along four general class lines: key-value, columnar, document, or graph [3]. Each one addressed, in its own way, the issues of
consistency, availability and partitionability (i.e. Brewers CAP Theorem) while simultaneously maximizing the usefulness of diverse data types by focusing on the aggregate model rather than the traditional data model used by relational databases [4]. Aggregates, in the NoSQL world, represent a collection of data that is interacted with as a unit, effectively setting the boundaries for ACID transactions [4]. As a concept, aggregates are extremely important because they define at a foundational level how data is stored and retrieved by the NoSQL database, regardless of type or class. In other words, NoSQL databases need to know a good deal about the data intended to be stored and how it will be accessed, while focusing less on the structure in which the data arrives. In doing so, NoSQL developers have created a new generation of database tools that are extremely fast, easily partitioned, and highly available, but at the cost of an ability to create independent complex queries. Given NoSQL databases were designed to operate in an enterprise environment, at scale on large numbers of commodity hardware, while supporting a diverse array of data types, little discussion of their use has been devoted to more traditional deployments (i.e. single desktop solutions). Indeed, in the haste to develop NoSQL tools to meet emergent business use-cases, comparatively little effort has been expended to evaluate in any rigorous way the inherent advantages NoSQL databases may possess over relational databases in any context except a distributed environment. It is conceivable advantages exist to using NoSQL technology in more a limited deployment. Clearly some inherent NoSQL advantages such as ability to easily partition are minimized, if not lost, in a single box implementation, other advantages may remain. The balance of this paper illuminates such potential advantages. II. ADVANTAGES OF NOSQL (ON A SMALL SCALE) Before deciding to use NoSQL as a single-box solution, it is useful to consider which potential advantages NoSQL retains comparative to relational databases without regard to any specific deployment strategy. While NoSQL was initially developed to address deficiencies with the relational database model regarding large data sizes and distributed environments, many of the unique attributes which make NoSQL databases desirable exist regardless of scale. The following sections describe some potential advantages depending on the needs and desires of the application using the data. A. Minimizing (or Eliminating) ETL Costs ETL (extract, transform, and load) is a database management technique designed to facilitate sharing of data
ISBN: 1-60132-484-7, CSREA Press ©
4
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
between multiple data repositories. Historically these data stores were relational databases designed to serve specific homogeneous purposes. Given relational databases are designed to follow rigid data models with prescribed schema and structures, data inserted into or retrieved from the database often must be transformed into a new state before loading into either a new database, data warehouse or a desired end-user application. Significant costs in terms of time and effort are dedicated to ETL tasks. NoSQL databases present an opportunity to significantly reduce ETL costs, because they store data closer to its native format and instead choose to push data manipulation to the application level. Many NoSQL databases are agnostic to what “data” is actually being stored. The ability to work with data in its native state potentially reduces coding requirements for both creating the database as well as managing it down the road. Furthermore, since NoSQL is predicated on the aggregate model, much is known a priori about how the target data will be used upon retrieval. Applications intended to use the data will be aware of the native state and presumably would be ready to accommodate the data in such format. Which leads to the second advantage, the ability to handle heterogeneous data. B. Acceptance of Heterogenous Data Much of data in existence today arrives in diverse formats, meaning it arrives as a collection of data where one record does not “look like” the other records. To visualize this, consider all the potential content of one employee’s human resources folder. Such a folder may include values (name, date of birth, position title), relationships (dependents, multiple supervisorto-employee relationships, contacts in other organizations/companies), geospatial data (addresses), metadata (travel vouchers, work history, security attributes), images (ID picture, jpeg of educational records), and free text (supervisor notes, meeting minutes). Further complicating the heterogeneous nature of the data, is that not all employees’ records may contain the same type or amount of data. In a relational database, this can be dealt with by either creating a very wide table with many fields to cover every possible attribute (and accept a significant number of NULL values) or creating a large number of tables to accept every possible type of record type (and risk a significant number of complex joins during retrieval). Both solutions are highly unsatisfactory and can cause significant performance issues, especially as the database grows. NoSQL avoids such complications by storing data closer to the format in which it arrives. Additionally, NoSQL’s ability to manage such data in its native state offers particularly interesting advantages for decision making. For example, upon loading the data into the database, tools exist that can alert upon key terms of interest (i.e. people, places, or things) which can subsequently enhance categorization and indexing while incorporating the original context accompanying the data when it arrives. The result is richer information for decision making. In contrast, relational databases, in the effort to transform the data to conform to rigid data models, often lose the associated context and the potentially beneficial information it contains. Building upon the storing data natively, the next advantage of NoSQL databases is the ability to support multiple data structures.
C. Support for Multiple Data Structures In many ways, relational databases were initially designed to address two end user needs: interest in summative reporting of information (i.e. not returning individual data) and eliminating need for the “human” to explicitly control for concurrency, integrity, consistency or data type validity. As a result, relational databases contain many constraints to guarantee transactions, schemas and referential integrity. This approach has worked well as long as the focus of the database was on satisfying the human end user and performance expectations for database size and speed were not excessive. In today’s world the end user is often not simply a human sitting at the end of terminal, but rather software applications and/or analytics tools which value speed over consistency and seek to understand trends and interconnections over information summaries. Suddenly the type, complexity, and interrelatedness of the data store mattered and thus NoSQL databases evolved to support a wide-range of data structures. Key value stores, while simplistic in design, offer a powerful way to handle a range of data from simple binary values to lists, maps, and strings at high speed Columnar databases allow grouping of related information into column families for rapid storage and retrieval operations Document databases offer a means to store highly complex parent-child hierarchal structures Graph database provide a flexible (not to mention exceptionally fast) method to store and search highly interrelated information The one constant from each of the database types and the data structures they support is the presupposition that NoSQL is driven by application-specific access patterns (i.e. what types of queries are needed). In effect, NoSQL embraces denormalization to more closely group the logical and physical storage of data of interest and uses a variety of data structures to optimize this grouping. In doing so data is stored in the most efficient organization possible to allow rapid storage and querying [6]. The tradeoff, of course, is if the queries change, NoSQL (in general) lacks the robustness to support the complex joins of SQL based relational databases. However, it should not be inferred that NoSQL equates to inflexibility. D. Flexibility to Change To compare the flexibility of relational vs. NoSQL databases, it is helpful to remember how each structures data. A relational database has a rigid, structured way of storing data, similar to a traditional phone book. A relational database is comprised of tables, where each row represents an entry and each column stores attributes of interest such as a name, address, or phone number. How the tables, attributes, data types permitted in each field are defined is referred to as the database schema. In a relational database, the schema is always well defined before any data is inserted because the goal is to minimize data redundancy and prevent inconsistency among tables, a feature essential to many businesses (i.e. financial operations or inventory management). However, such rigidity can produce unintended complications over time. For example,
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
a column designed to store phone numbers might be designed to hold exactly 10 integers because 10 is the standard for phone numbers in the United States. The obvious advantage is any data entry operation which attempts to input anything other than 10 whole numbers (i.e. omits an area code or includes decimals) results in rejecting the input, resulting in highly consistent (and more likely accurate) data storage. However, if for any reason the schema needs to change (i.e. your organization expands internationally, and you need to accommodate phone number entries with more than 10 integers), then the entire database may need to be modified. For relational databases, the benefits of rigid initial organization come with compromised future flexibility. In comparison, NoSQL databases do not enforce rigid schemas upfront. The schema agnostic nature of NoSQL allows seamless management of database operations when changes, such as inclusion of international standards for phone numbers, are required. NoSQL systems are also capable of accepting varying data types as they arrive thus, as previously discussed, the need to rewrite ETL code to accommodate changes in the structure, type or availability of data is minimized. Some NoSQL databases take this a step further and provide a universal index for the structure, values, and text found in arriving data. Thus, if the data structure changes, these indexes allow organizations to use the information immediately, rather than having to wait months while new code is tested as typically is the case with relational databases. Of course, this flexibility comes with costs, primarily in terms consistency and data redundancy issues, but some applications, even on a singles desktop, may not be concerned about attendant problems these issues cause. E. Independence from SQL Structured Query Language (SQL) is the powerful, industry standardized, programming language used to create, update, and maintain relational databases. It is also used to retrieve and share data stored in the database with users and external applications. The power of SQL is derived from its ability to enforce integrity constraints and link multiple tables together in order return information of value. However, using SQL also imposes a certain rigidness on how developers and users alike can interact with the database. Additionally, as the database increases in scale, multi-table joins can become extremely complex, thus the effectiveness of SQL is, to some degree, dependent on the skills of the database administrator. While several databases belonging to the “NoSQL” class have developed a SQL-like interface, they typically do so to maintain compatibility with existing business applications or to accommodate users more comfortable with SQL as an access language [2]. NoSQL databases also support their own access languages with varying lesser degrees of functionality than SQL. This trade-off allows independence from SQL and permits a more developer-centric approach to the design of the database. Typically, NoSQL databases offer easy access to application programming interfaces (API) and are one of the reasons NoSQL databases are very popular among application developers. Application developers don’t need to know the inner workings and vagaries of the existing database before
5
using them. Instead, NoSQL databases empower developers to work more directly with the required data to support the applications instead of forcing relational databases to do what is required. The resulting independence from SQL represents just one of many choices offered by NoSQL databases. F. Application Tailored Choices (Vendor, Open Source) The NoSQL environment is clearly awash with choice. In 2013, over 200 different NoSQL databases options existed [4]. As previously mentioned, key-value, column, document, and graph comprise the broad categories, but each NoSQL implementation offers unique options for developers and users to choose features best suited for a given application. The key to choosing is dependent upon understanding how the relevant data is be acquired and used. If the application is agnostic to the “value” being stored and can accept limited query capability based on primary key only, the simplicity and speed of key value databases may be the answer. Such apps may include session data, storing user preferences, or shopping cart data. Alternatively, if the developer wishes to store the data document (often as JSON, XML, or BSON) and requires the ability to search on a primary key plus some stored value, document databases are an excellent choice. Apps that are likely to take advantage of document databases would include content management systems, analytic platforms, or even e-commerce systems. Columnar databases aggregate data in related column families without requiring consistent numbers of columns for each row. Column families facilitate fast, yet flexible, write/read operations making this database type well suited for content management systems, blogging systems, and services that have expiring usage. Finally, graph databases allow storage of entities (nodes) along with corresponding relationship data (edges) without concern for what data-type is stored while supporting index-free searches. These characteristics make graph databases especially well-suited for applications involving social networks, routing info centers, or recommendation engines. These choices emphasize the power of the NoSQL aggregate model over the relational data model. Making the choice even more appealing, is that numerous open source options exist for every NoSQL database type. While most NoSQL databases do offer paid support options, nearly all have highly scalable, fully-enabled versions, with code made freely available. Popular names like MongoDB (document), Redis (key value, in memory), Neo4j (graph), and HBase (column), among many others, all represent industry standard options that are unlikely to become unsupported due to neglect since each enjoy avid developer communities. These low cost, open source options enable users to experiment with NoSQL databases with minimal risk while allowing successful implementations to become operational (even when intended to produce profit!) without huge upfront costs. While free open source versions of relational databases exist (i.e. MySQL), they are often provided as an introductory or transitional offer to the more powerful, for profit versions or have restrictive usage agreements for no-cost versions. The net result is with NoSQL, users can not only choose the database best suited to their application needs, they can also
ISBN: 1-60132-484-7, CSREA Press ©
6
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
do so with a low or no-cost options with access to the underlying code in order to remain independent of software vendors. It is very likely that users of single box or desktop solutions would find this level of choice very appealing. III. USES FOR NOSQL ON THE DESKTOP The unanswered question remains, what application or types of applications, running on a single machine, perhaps even an individual’s desktop, might benefit from one or more of these potential advantages? Instead of defining a list of specific applications, it is more useful to approach the problem from the perspective of the characteristics such applications may seek. The following list, while certainly not exhaustive, highlights a few characteristics that would form a starting point for deciding which NoSQL database to choose:
Efficient write performance – i.e. collecting nontransactional data such as logs, archiving applications
Fast, low-latency access – i.e. such as that required for games
Mixed, heterogeneous data-types – i.e. applications that use different media types (such as an expert system containing images, text, videos, etc.)
Easy maintainability, administration, operation – i.e. home-grown applications without professional support
Frequent software changes – e.g. embedded systems
In the end, each database developer must consider the goals of the database and choose the type to match the requirements needed. The key realization is, regardless of scale, there are available choices beyond relational database models.
IV. CONCLUSIONS When choosing a database to support a specific application, the choice between relational versus NoSQL options should come down to what the database needs to accomplish to support the application. The thesis of this paper challenges the notion that NoSQL is only considered useful for applications requiring big data and distributed support, while applications residing on a single box or desktop solutions remain the domain of relational databases. Instead, the authors believe choosing the database type should be driven by expected usage and performance concerns. It is hoped the paper encourages researchers (and developers) to rigorously define methodologies and advantages related to NoSQL outside enterprise solutions. In doing so, the future applications may benefit from a richer, more understood set of choices when selecting an appropriate data storage method. REFERENCES [1] [2]
[3]
[4]
[5]
[6]
Codd, E. F, “A Relational Model of Data for Large Shared Data Banks,” Communications of the ACM, vol. 13, pp. 377–387, June, 1970. Hecht, R. and Jablonski, S., “NoSQL Evaluation: A Use Case Oriented Survey”, International Conference on Cloud and Service Computing, 1214 December, 2011. Nayak, A., Poriya, A. and Poojary, D., “Type of NoSQL Databases and its Comparison with Relational Databases”, International Journal of Applied Information Systems, Vol 5, pp 16-19, March, 2013. Sadalage, P. J. and Fowler, M., NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Upper Saddle River: AddisonWesley, 2013. Schram, A. and Anderson, K. M., (2012) “Mysql to NoSQL: Data Modelling Challenges in Supporting Scalability”, proceedings of the 3rd Annual Conferences on Systems, Programming, and Applications: Software for Humanity, 191-202, Tucson, AZ. Denning, P. J. (2005). The locality principle. Communications of the ACM, 48(7), 19-24.
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
7
Chaos Game for Data Compression and Encoding J. Dixon1 , C. Karlsson1 1 Department of Mathematics and Computer Science, South Dakota School of Mines and Technology, Rapid City, South Dakota, USA Abstract— The chaos game is a type of iterated function system (IFS) that may be used to generate fractals such as the Sierpinski Triangle. An IFS is a method of fractal generation in which some starting shape or point will be transformed by a function repeatedly until a fractal image appears. Chaos games generally begin with a randomly generated point in 2-dimensional space, and apply a function to that point repeatedly until some fractal image appears. The basic premise of the chaos game lends itself well to data encoding, as the chaos game may be set up in such a way so that it will generate a reversible pattern, allowing for successful decoding of encoded data. This paper presents a method to encode and decode data using the chaos game, and shows why data compression using strictly the chaos game is not possible. Keywords: chaos game, encoding, compression, fractals
1. Introduction The chaos game is a type of recurrent iterated function system (IFS) that was originally proposed as a method to generate fractal images and patterns[1]. An IFS is a method of fractal generation in which some set of operations will be applied to an image repeatedly in order to produce an interesting result. Possible operations include rotation, transformation, copy, and scaling. In an IFS, the process is well-defined prior to iterations being performed, and will always produce the exact same results given the same input. The chaos game may be used to produce images that are created by an IFS, but does so by randomly selecting transformations, rather than applying the same transformations each iteration[2][3]. The hypothesis of this research is to determine a method for data compression using chaos game based encoding. The underlying assumption is that enough bits will be encoded that storing a 2D coordinate will require fewer bits than the total number of bits encoded.
2. The Chaos Game Originally presented by Barnsley in 1988, the chaos game is a type of iterated function system that can be used to generate fractal images from simple sets of rules, and random selection of transformations to be applied to a two-dimensional coordinate. As described by Barnsley, the
chaos game relies on a Markov process to determine the next transformation to apply to the coordinate. The Markov process need not be fully connected, that is to say that there does not need to be a transformation to get from one state to any other state. However, each state of the Markov process should be assigned a linear affine transformation. A state’s transformation should be applied any time the Markov process selects that state[1]. Each state is defined by the four parameters for the affine transformation to apply when the state is selected, the x and y translation values for the transformation, and probabilities for each of the states that may be selected next. For this work, a full Markov process does not need to be used. Rather than randomly generating the next vertex, input from a file will dictate the proper move, and the transformation will always halve the distance between a point and the selected vertex.
2.1 Chaos game generation of Sierpinski triangle To understand the process behind the chaos game and to simplify the rules so that an affine transformation is unnecessary let us go through an example of how to use it to generate the Sierpinski Triangle. The Sierpinski Triangle is a fractal shape that displays the hallmark characteristics of fractal images: self-similarity, and infinite detail. To begin, we will generate three vertices of a triangle. In this example, the vertices are (0, 0), (0, 6), (3, 5.196), so that the triangle is equilateral, with side length six. Once the vertices are generated, we randomly generate a point nearby (or within) the triangle. Randomly select a vertex, and move to the midpoint between the current point and that vertex. Plot the point. Repeat this process for the desired number of iterations. While it is not immediately obvious that this movement of a point around a two-dimensional area will produce any interesting results, Figures 1 - 2 show that over enough iterations a pattern does begin to emerge. By the time we have reached 100,000 iterations, it becomes clear that the points being plotted have ended up residing on the Sierpinski triangle. Why is this the case? Why does this seemingly random pattern create such a structured result? As explored in the text Chaos and Fractals, the reasoning behind why the chaos game creates the Sierpinski triangle is that, on a base level, it is behaving in a very similar manner to an IFS to generate the Sierpinski
ISBN: 1-60132-484-7, CSREA Press ©
8
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
generate three vertices of triangle; generate random point; while not done do randomly select vertex; move point half the distance to selected vertex; if number of moves > 10 then plot point; end end Algorithm 1: Chaos Game to Generate Sierpinski Triangle
Fig. 1 C HAOS GAME FOR 10,000
Fig. 2 C HAOS GAME FOR 100,000
ITERATIONS
2.2 Chaos Game Representation of DNA sequences Besides generating fractal images, geneticists have found uses of the chaos game for DNA visualization using a method called Chaos Game Representation (CGR). For the CGR, we will examine a chaos game with four vertices, rather than three. A four vertex chaos game, like the three vertex game, will create interesting fractal patterns when provided correct rules, but will only produce noise when moves are generated randomly[5]. Instead of generating a random pattern of moves or creating rules that dictate how random vertices may be selected, a CGR is created by moving the coordinate based on input from the DNA sequence. Each of the four nucleotides (A, C, T, G) is assigned to a vertex, and with each nucleotide read from the DNA sequence, the chaos game coordinate is moved half the distance towards the corresponding vertex. Noting that a flawed random number generator will produce unexpected results, Jeffrey supposed that the images produced by the chaos game reveal underlying patterns within DNA sequences [5], [4], [6]. He continues to analyze the results of the CGR, noting that if points are plotted within the same quadrant, those points end in the same nucleotide base. Furthermore, each quadrant may be divided into subquadrants, and those sub-quadrants again divided, ad infinitum. For each of these sub-quadrants, if more than one point is plotted within the sub-quadrant, those points share the same nucleotide ending, for as many nucleotides as the quadrant has been divided into sub-quadrants [5], [4]. Figure 3 shows the CGR for the Ebola genome. The quadrants and sub-quadrants are divided, and the sub-quadrant indicated is populated with points whose last three moves were CGT.
ITERATIONS
triangle. By moving half the distance towards a vertex, we are effectively transforming the space around the vertex by moving all points half their distance to that vertex. As with an IFS, we are able to tune the output image by modifying the operations that we intend to perform. For example, if the point is moved two-thirds the distance to the randomly selected vertex, the resulting Sierpinski triangle will appear stretched, with the corners of the interior shape disconnected [4].
Fig. 3 C HAOS GAME REPRESENTATION OF E BOLA GENOME
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
2.3 Encoding Rather than generating fractal images using the chaos game, we will examine how to use the chaos game to encode and decode any input data. First, the initial conditions for the chaos game must be defined. A chaos game with four vertices will be used, with the coordinates of each vertex located as follows: V0 = (0, 0), V1 = (1, 0), V2 = (0, 1), V3 = (1, 1). The coordinates of each vertex correspond directly to the bits from the data to be encoded. So the 00 corresponds to V0 , 10 corresponds to V1 , 01 corresponds to V2 , and 11 corresponds to V3 . The chaos game coordinate (Cx0 , Cy0 ) will begin in the center of this square, at (0.5, 0.5). Given these initial conditions and a file opened for binary input, begin reading two bits at a time. From those two bits, select the corresponding vertex Vi . The next coordinate (Cxi , Cyi ) is found by calculating the midpoint between (Cxi−1 , Cyi−1 ) and Vi . Bits are read in pairs of two until no more bits remain to be read. At this time the input data has been fully encoded into a single coordinate which may be saved to a file to be decoded later. As an example, the word "Chaos" will be encoded, keeping track of the coordinates at each step of the encoding process. Figure 4 shows the path the coordinate follows during the encoding process with each step labeled.
9
from to the decoded data string. To reiterate, the coordinates of the vertex directly correspond to the bit sequence that had been encoded. That is to say that the vertex at (0, 0) represents the bit sequence 00, and the vertex at (1, 1) represents the bit sequence 11. However, for the purposes of decoding, we must append the bit sequence backwards. For example, if the coordinates of the closest vertex are (0, 1), we must append 10 to the decoded bit sequence. Since we are decoding in the reverse order, if the bits are not reversed as they are appended, every two bits will be backwards in the final decoded sequence. Repeat the process until the point is at the starting location (0.5, 0.5). Once the coordinate has been moved back to the starting position, the complete decoded bit sequence must be reversed.
2.5 Increasing the number of vertices What would happen if the number of vertices were to increase? To encode n bits per move, the number of vertices is 2n . While this would enable us to encode more bits per move, unlike the four vertex chaos game, if an eight vertex game is played, the midpoint between the previous coordinate and the vertex representing the incoming bits does not suffice to represent the data, because the encoded coordinate is not guaranteed to be unique. After moving towards a vertex, the coordinate will land in an octagonal region between (0.5, 0.5) and the vertex. This region overlaps with the regions of the two adjacent vertices, so during the decode any point that has been moved to one of these regions will not necessarily be moved to the correct vertex or worst case, will be equidistant between two vertices, and the closest cannot be determined. Figure 5 shows the overlap between these regions, with 100,000 moves randomly generated, and the points colored to represent the most recent move made. The overlapping regions are visible, and the only areas that can guarantee a correct decode are the ones closest to their respective vertices, with no overlap.
2.6 Size of chaos game coordinates
Fig. 4 P LOT OF ALL POINTS TO ENCODE "C HAOS "
2.4 Decoding To decode the point stored and restore the initial data, the process is simply reversed. Given a 2D point, select the closest vertex from V0 - V3 and double the distance between the point and the vertex. Append the corresponding sequence
The process described in the sections 2.3 and 2.4 works well to encode and decode data, but unless the data is carefully manipulated and stored, the encoded result may end up being many times larger than the original file. This expansion is caused by limitations of floating precision data types and the nature of the chaos game’s division by two with each and every move. In C++ if we use the double data type, we immediately run into problems with this, as double is only capable of reliably storing fifteen decimal points of precision. Since the beginning coordinate is (0.5, 0.5), one decimal point is already taken, meaning that we have fourteen to work with, or fourteen moves that we can make. Examining the previous example where the word "Chaos" was encoded, let us see what happens if we attempt to store
ISBN: 1-60132-484-7, CSREA Press ©
10
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
Fig. 5
Fig. 6
P LOT OF CHAOS GAME PLAYED ON OCTAGONAL BOARD
E XAMPLE OF ARRAY OVERLAY FOR FOUR MOVES
the encoded coordinates in C++ doubles. The total length in binary of "Chaos" is 40 bits. At two bits per move, and fourteen possible moves per double, 28 bits may be encoded per coordinate pair. Since there are forty total bits, two coordinate pairs are required. Each double is 64 bits, so two coordinate pairs costs 256 bits. The size of the word "Chaos" has been increased by 216 bits. Storing the coordinates as doubles is clearly not efficient, so we must look at alternative ways to store the encoded data.
y coordinates from the chaos game to x and y indices of the array. Before assigning array indices, let us define a fixed size of the array. In the interest of conserving space, and to prevent the encoded data from growing in size, an unsigned short integer is used to store both of the indices. An unsigned short integer in C++ is 16 bits, and is able to store values in the range 0-65535. Given the size of 16 bits, it is possible to encode 16 chaos game moves at a time without losing precision. This will encode 32 bits of input data into two 16 bit array indices, so the encoded data will remain the same size as the input data. The encode process is as simple as mapping from the chaos game coordinate range (0,0) to (1,1) to the range of the indices, (0,65535) (0,65535). Floor and ceiling operations are applied in order to cover the coordinate mapping into either the 0 or 65535 indices, respectively. Since the decimal precision of C++ double restricts the number of moves we can make to fourteen, a C++ long double must be used in this case. Even though the long double takes more space in memory, that is irrelevant since it is not directly stored. To decode, simply reverse the mapping from integer back to decimal point.
2.7 Overlaying an array on the Chaos game One potential method of avoiding this expansion is to find a way to represent the final chaos game coordinate as integers. The method presented here is to imagine an array has been placed over the area of the chaos game, and map the final encoded coordinate to array indices. While no array truly exists in memory, it serves as a visual representation of the mapping from floating point coordinates to integers. This array must be a specific size based on the number of moves that will be performed on the point. Generally, the array needs to be of dimension 2n x 2n where n is the number of moves to be made. Figure 6 shows an example of encoding just the first four moves of the previous encode example, just the character "C". The first three moves end up on intersections of the array, and will not adequately serve as indices, but the fourth and final move places the point within an array position. This will be the case no matter the number of moves made, as long as the array has been defined as the appropriate size.
2.8 Assigning array indices Now that the coordinate falls within the array position, we must find a method to represent the coordinate as an array index. In order to do this, we must convert the x and
2.9 Quick array index generation The difficult part of the method of generating array indices is that using the method presented above, the chaos game still needs to be played in its entirety. While this process is not terribly complex, requiring only two floating point values, and simply running the midpoint formula twice per move, we are able to take a bit of a shortcut to calculate the indices for the array. By nature of the chaos game, with any given movement, there are very few possible outcomes. The x component of
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
the coordinate will be either moved left or right, and the y component of the coordinate will either be moved up or down. Therefore, it is possible to obtain the index by appending either 0 or 1 into a bit sequence for the moves made in the x direction, and 0 or 1 into a bit sequence for the moves made in the y direction. This is equivalent to reading the data to be encoded as a bit string, and placing every other bit into these bit sequences, right to left. For example, we can encode the character "C" again. The encoding sequence is shown in Figure 6. The final coordinate is (0.53125, 0.59375), and mapping these values to the range 0 − 15 gives the array indices 8, 9. The binary value of "C" is 01000011. Filling two integers right to left from every other bit of "C" gives us the 1000 and 1001 which are the indices.
Fig. 7 S PLITTING INPUT BIT STRING TO
INDICES
2.10 Huffman coding the chaos game In order to coerce the chaos game into compressing data, we may attempt to leverage Huffman coding to our advantage to find patterns in data string before it is encoded with the chaos game. Huffman coding was selected since one of the applications of the chaos game has been to identify patterns within DNA sequences[5]. Since any input data type may be encoded by the chaos game, we must determine how to define a "symbol" in terms of bits. For the purposes of this work, the "symbol" will be one byte. The intent is to sum the frequency of the bytes within the input data and represent those that occur more commonly with shorter bit strings, as with any Huffman coding. The success of this method will be limited by the underlying data, i.e. whether certain bytes appear more frequently than others, and how much more frequently those bytes appear. If nearly all bytes within the file appear with the same frequency, that is to say the data is more random than not, the success of Huffman coding will be limited.
3. Results The methods presented have shown a way to encode and decode data using the chaos game. When encoded, the output will always be the same size as the input. Unfortunately, it is not possible to use strictly the chaos game to compress data, but it is possible to use other methods in conjunction with the chaos game in order to compress data to some extent. For example as presented in Chapter ??, Huffman encoding
11
may be applied either before or after the encode process, and will yield generally similar results either way. Throughout this section, results of the chaos game encode process will be presented, with timings.
3.1 Timings of chaos game encoding In order to test encoding and decoding, multiple file types were used, with various sizes of files in order to confirm that the encode and decode process works regardless of file type. Table 1 F ILES USED AND File Name alice.txt ebola.txt cat.jpg song.wav video.mp4
SIZES
File Size 170K 19K 3.9M 14M 200M
These files are simply four files that were on hand at the time of writing, and contain the following information. Alice.txt is a text file containing the complete text of Alice’s Adventures in Wonderland, encoded in UTF-8[7]. The second file is the partial genome of the Ebola virus. The intent here is to show that data known to have underlying patterns will compress well, and that Huffman coding works well on files with a limited number of symbols. The third file, cat.jpg is a 3.9 MB photo the author took of a cat. The dimensions of the photo are 51843456. The fourth file, song.wav is a one minute, twenty-three second long song that the author wrote one evening while learning to use his synthesizer. The song features repetitive patterns, in the hopes that some interesting information may come from the chaos encode process or the Huffman code compression. The fifth file, video.mp4 is a one minute, twenty-three second long video with the previous song playing as the audio. Despite the patterns in the audio track, since a video has been added, the pattern should be obscured. 3.1.1 Timings of chaos game storing floating point coordinate The most straightforward way to encode and decode the chaos game is to simply store the coordinate after all input data has been encoded, and reverse the process to decode. The expected result for storing the encoded floating point coordinate is that the encoded file will grow in size, specifically 32 bits are being encoded at a time, and the encoded coordinate is stored in two C++ doubles, each using 64 bits. So 32 bits will become 128 bits, which is four times larger than the input. Table 2 presents timings for encode and decode of the coordinate, and encoded file size. The results confirm that the encoded file is four times larger than the input file.
ISBN: 1-60132-484-7, CSREA Press ©
12
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
Table 2
Table 4
C OORDINATE ENCODE AND DECODE TIMINGS
C OORDINATE ENCODE H UFFMAN CODED FILE SIZES
File Name alice.txt ebola.txt cat.jpg song.wav video.mp4
Encode Time 24.408 ms 2.125 ms 630.364 ms 2140.66 ms 31754.3 ms
Decode Time 17.405 ms 1.42 ms 453.703 ms 1566.8 ms 23706.9 ms
Output Size 678K 75K 16M 56M 800M
3.1.2 Timings of chaos game storing integer indices As presented in Chapter 3, the process of encoding can be simplified by representing the chaos game coordinate as two integer indices on an array that has been overlaid on the chaos game board. The expected result is to encode data to a file of the same size, and do so quickly, since the underlying process is to simply split and recombine the bit string. Table 3 lists timings and output file sizes for the same test files using the integer index encode method. Table 3 I NTEGER INDEX ENCODE AND DECODE TIMINGS File Name alice.txt ebola.txt cat.jpg song.wav video.mp4
Encode Time 8.325 ms 0.681 ms 229.374 ms 766.686 ms 12316.7 ms
Decode Time 5.517 ms 0.648 ms 157.428 ms 556.572 ms 8905.22 ms
Output Size 170K 19K 3.9M 14M 200M
3.2 Huffman coding of the chaos game The chaos game itself can only at best encode data to the same size that it was to begin with. Compression does not appear to be possible. As presented in Section 3.2, increasing the number of vertices while maintaining a 2D coordinate seems at a glance to make it possible to encode more bits for each move of the chaos game point, but upon closer inspection this method is impossible to decode. How then can we make the chaos game encoded output smaller? One option is to use Huffman coding in order to find patterns in the input data and represent them as unique patterns that are smaller than their input counterparts. The most common patterns will be represented as shorter bit strings, while less common patterns may increase in length. Each representation will be of a unique prefix, so that it can be decoded properly [8]. We expect to see diminishing returns based on file size, since we do not expect to see patterns throughout the general input data, or the input data is so random overall that the returns we gain from Huffman coding are negligible. Table 4 shows the resulting file sizes of Huffman coding before chaos game encoding with Huffman code length 8 (bits) and coordinate-based chaos game. Table 5 shows the resulting file sizes of Huffman coding before chaos game encoding with Huffman code length 8 (bits) and integer index based chaos game.
File Name alice.txt ebola.txt cat.jpg song.wav video.mp4
Original Output Size 678K 75K 16M 56M 800M
Huffman Output Size 406K 21K 15M 52M 800M
Percent Size Reduction 40.11799 72 6.25 7.142 0
Table 5 I NTEGER INDEX ENCODE H UFFMAN CODED FILE SIZES File Name alice.txt ebola.txt cat.jpg song.wav video.mp4
Original Output Size 170K 19K 3.9M 14M 200M
Huffman Output Size 102K 5.2K 3.9M 13M 200M
Percent Size Reduction 40 72.631 0 7.142 0
While the results for Huffman coding for the Ebola genome look to be the most impressive, that result must be taken with a grain of salt. By nature of the input file being a text file composed entirely of the characters A, C, T, or G, the Huffman coding algorithm will perform exceptionally well, since a very limited selection of symbols is available. Approaching it another way, the underlying bits have four very distinct patterns. These four patterns are the bytes representing the characters mentioned above. Since the Huffman code length is one byte, it is perfectly tuned for exactly this situation, whereas in the case of the image of a cat, minor variations from pixel to pixel cause a very large number of one byte sequences, rendering the Huffman coding less effective. It is also important to note that the percent saved by adding Huffman coding is the same whether the floating point coordinates are stored from the chaos game, or if the integer indices are stored. This further shows that the success of this compression method is only as good as the Huffman code compression alone.
4. Future work This work has presented an encode and decode method based on the chaos game. The applications of such an encoding scheme are limited, but there is still potential for future work.
4.1 Future work - Chaos Game Representation of data to determine Huffman code length Applying the CGR to DNA sequences gives a visual representation of densities of sequences of nucleotides within the DNA sequence [5]. Similarly, if the CGR is applied to any data, we can investigate the relationships between bits, and identify patterns within the data. Figure 8 shows the first chapter of Lewis Carroll’s “Alice’s Adventures
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
in Wonderland" [7]. Certain regions of the plot are more densely populated than others, indicating patterns in the data. For example the outlined region in the upper left is the region of the plot contains all points whose last four moves were 01 10 01 01, or the character ’e’. The number of points within this region directly corresponds to the number of times the letter ’e’ appears within the text. Digging a little deeper, the second outlined region contains points whose last six moves were 01 10 01 01 01 11, or the character ’e’ and the first half of another character.
13
4.2 Future Work - Increase dimension of chaos game The chaos game is generally played on a 2D board, but there is no reason why it could not be played in a higher dimension. Section 3.2 describes the complications that arise when the number of vertices is increased for a 2D chaos game, but if the chaos game is played in higher dimensions, the same result is obtained. As shown in Figure ??, a fourvertex chaos game in three dimensions will produce a 3D Sierpinski triangle. This can be encoded and decoded the same way, as long as the number of vertices is selected carefully, that is to say with no overlap between the regions to which a point can be encoded. So, for any chaos game played in the nth dimension, the number of vertices is 2n , and the number of bits encoded per move is n. A threedimensional chaos game will use eight vertices, and encode three bits per move. To encode one byte at a time, the chaos game will be eight dimensional and use 256 vertices. Increasing the dimensionality of the chaos game in this way can be used to identify patterns of different length, as the 2D chaos game can only identify patterns that are multiples of two bits long. This could lead to more optimal Huffman coding possibilities.
References
Fig. 8 F IRST CHAPTER OF “A LICE ’ S A DVENTURES IN W ONDERLAND " PLOTTED AS
CGR
As presented in the results chapter, using Huffman coding is possible to reduce the size of the chaos game encoded file, but the amount of data compressed in this manner is no more than would be compressed by simply Huffman encoding the input data. It may be possible to use the CGR of the input data to determine the optimal length for Huffman encoding. Since at its core the chaos game is a quadtree representation[9][10], it could be possible to use a quadtree of histograms from the CGR to determine an optimal number of bits to use for Huffman coding. At each level of the quadtree, the number of points within the region could be summed, and that value stored to a histogram quadtree. Each layer of the quadtree represents another two bits encoded, so the fourth layer would represent one byte, or as in Figure 8 one character.
[1] M. F. Barnsley, Fractals Everywhere, 1988. [2] T. Martyn, “The chaos game revisited: Yet another, but a trivial proof of the algorithm’s correctness,” Applied Mathematics Letters, vol. 25, no. 2, pp. 206 – 208, 2012. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0893965911003922 [3] ——, “An elementary proof of correctness of the chaos game for ifs and its hierarchical and recurrent generalizations,” Computers & Graphics, vol. 26, no. 3, pp. 505 – 510, 2002. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0097849302000924 [4] H.-O. Peitgen, J. H., and D. Saupe, Chaos and fractals: new frontiers of science. Springer-Verlag, 1993. [5] H. J. Jeffrey, “Chaos game representation of gene structure,” Nucleic Acids Research, vol. 18, no. 8, pp. 2163–2170, 1990. [6] R. A. Mata-Toledo and M. A. Willis, “Visualization of random sequences using the chaos game algorithm,” Journal of Systems and Software, vol. 39, no. 1, pp. 3 – 6, 1997. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0164121296001586 [7] “Project gutenberg’s alice’s adventures in wonderland, by lewis carroll.” [Online]. Available: https://www.gutenberg.org/files/11/11h/11-h.htm [8] D. A. Huffman, “A method for the construction of minimumredundancy codes,” Proceedings of the IRE, vol. 40, no. 9, pp. 1098– 1101, Sept 1952. [9] S. Vinga, A. M. Carvalho, A. P. Francisco, L. M. Russo, and J. S. Almeida, “Pattern matching through chaos game representation: bridging numerical and discrete data structures for biological sequence analysis,” Algorithms for Molecular Biology : AMB, vol. 7, pp. 10–10, 2012. [Online]. Available: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402988/ [10] N. Ichinose, T. Yada, and T. Takagi, “Quadtree representation of dna sequences,” 01 2001.
This tree of histograms could then be analyzed to select the best depth for Huffman coding. A number of metrics could be used to select the depth, such as average histogram density compared to the depth in the tree, or the value in a node with compared to the value of the parent node.
ISBN: 1-60132-484-7, CSREA Press ©
14
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
Shed More Light on Bloom Filter’s Variants Ripon Patgiri, Sabuzima Nayak, and Samir Kumar Borgohain Department Of Computer Science & Engineering, National Institute of Technology Silchar, Assam, India Abstract— Bloom Filter is a probabilistic membership data structure and it is the excessively used for membership query. Bloom Filter becomes the predominant data structure in approximate membership filtering. Bloom Filter extremely enhances the query response time, and the response time which is O(1) time complexity. Bloom filter (BF) is used to detect whether an element belongs to a given set or not. The Bloom Filter returns True Positive (TP), False Positive (FP), or True Negative (TN). The Bloom Filter is widely adapted in numerous areas to enhance the performance of a system. In this paper, we present a) in-depth insight on the Bloom Filter, b) the prominent variants of the Bloom Filters, and c) issues and challenges of the Bloom Filter. Keywords: Bloom Filter, Scalable Bloom Filter, Variants of Bloom Filter, Membership filter, Data Structure, Algorithm
1. Introduction The Bloom Filter [1] is the extensively used probabilistic data structure for membership filtering. The query response of Bloom Filter is unbelievably fast, and it is in O(1) time complexity using a small space overhead. The Bloom Filter is used to boost up query response time, and it avoids some unnecessary searching. The Bloom Filter is a small sized data structure. The query is performed on the Bloom Filter before querying to the large database. The Bloom Filter saves immense query response time cumulatively. However, there is also a false positive which is known as overhead of the Bloom Filter. Nevertheless, the probability of the false positive is very low. Thus, the overhead is also low. Moreover, a careful implementation of Bloom Filter is required to reduce the probability of false positive. There are various kind of Bloom Filters available, namely, Blocked Bloom Filter [2], Cuckoo Bloom Filter [3], dLeft CBF (dlCBF) [4], Quotient Filter (QF) [5], Scalable Bloom Filter (SBF) [6], Sliding Bloom Filter [7], TinySet [8], Ternary Bloom Filter (TBF) [9], Bloofi [10], OpenFlow [11], BloomFlow [12], Difference Bloom Filter (DBF) [13], and Dynamic Reordering Bloom Filter [14]. The variants of Bloom Filters are designed based on the requirements of the applications. The Bloom Filter data structure is highly adaptable. Therefore, the Bloom Filter has met a enormous applications. A careful adaptation of the Bloom Filter ameliorates the system. However, it depends on the requirements of the applications. Bloom Filter’s improvement potentiality makes the vast applicability of the probabilistic data structure.
The fast query response using Bloom Filter attracts all the researchers, developers, and practitioners. There are tremendous applications of Bloom Filter. For instance, the BigTable uses Bloom Filter to improve disk access time significantly [15]. Moreover, the Metadata Server is drastically enhanced by Bloom Filter [16], [17], [18], [19]. The Network Security is also boosted up using Bloom Filter [20], [21]. In addition, the duplicate packet filter is a very time consuming process. The duplicate packets are filtered in O(1) time complexity using Bloom Filter [22]. Besides, there are diverse applications of Bloom Filter which improve significantly the performance of a system. The Bloom Filter predominant the filtering system, and thus poses some research questions (RQ) which are listed belowRQ1: Where should not Bloom Filter be used? RQ2: What is the barrier of Bloom Filter? RQ3: What are the various kinds of Bloom Filters available? RQ4: What are the issues and challenges of Bloom Filter? The research question (RQ) leads the article to draw a suitable conclusion. The RQ1 exposes the reason for using Bloom Filter. The RQ2 exploits the False Positive of Bloom Filter. The RQ3 exposes the state-of-the-art development of Bloom Filter. And finally, the RQ4 poses the issues and challenges of Bloom Filter.
2. Bloom Filter The Bloom Filter [1] is a probabilistic data structure to test an element membership in a set [23]. The Bloom Filter uses a small space overhead to store the information of the element set. The True Positive and True Negative enhance the performance of filter mechanism. However, there false positive overhead in the Bloom Filter variants. However, the probability of false positive is negligible. But, some system cannot tolerate False Positive, because the false positive introduces error to the system. For example, duplicate key filtering system. Moreover, it also guarantees that there is no False Negative (FN) except counting variants of Bloom Filter. There are many systems where most of the queries are TN. Let K = k1 , k2 , k3 , . . . , kn be elements present in the set S. Let ki be the random element where 1 ≤ i. The approximate membership query is whether ki ∈ S or not. The Bloom Filter returns either positive or negative. The positive is classified into False Positive (FP) and True Positive (TP). The FP of Bloom Filter returns existence
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
15
of an element in a set, but ki 6∈ S. However, the TP correctly identifies the element, and it is in the set. Similarly, the negative is also classified into TN and FN. The TN boosts up the performance of a system and FP degrades the performance of a system. Therefore, the key challenge of Bloom Filter design is to reduce the probability of FP. The Figure 1 depicts the flowchart of Bloom Filter. The Figure 1 clearly exposes the overhead of Bloom Filter in case of FP. Fig. 2: The false positive probability of Bloom Filter with m = 64M B, and h = 1, h = 2, h = 4, h = 8, and h = 16. The X-axis represents the number of entries n.
Start
However, almost all variants of the Bloom Filter reduce the FP probability. Let us assume, m is the number of bits available in the array. The probability of particular bit to be 1 . The probability of particular bit to be 0 is 1 is m 1 1− m
Client query in Bloom Filter - Whether an elemet E is a member in S or not.
Member in the Bloom Filter?
No True Negative
Return NO
Yes
True Positive
Exist in the database?
Overhead
Check the existence of the element E in in the database
No
Let h be the number of hash functions and the probability of that bit remain 0 is [24], [23] h 1 1− m There are total n element insertion into the array, therefore, the probability of that bit still 0 is nh 1 1− m Now, the probability of that particular bit to be 1 is nh ! 1 1− 1− m What is the optimal value of hashing h? The probability of all bits 1 is nh !h h 1 ≈ 1 − e−hn/m 1− 1− m
False Positive
Yes
Return YES
Fig. 1: Flowchart of Bloom Filter. Figure demonstrates the overhead of Bloom Filter
The probability of false positive increases with the large size of entries n. However, it is reduced by increasing the value of m. Therefore, minimizing the false positive probability is m h = ln2 n Let us p be the desired false positive, and hence, ( m m n ln2) p = 1 − e−( n ln2n)/m m (ln2)2 n nln p m=− (ln2)2
ln p = −
2.1 Analysis The FP affects on performance of Bloom Filter and this is an overhead of a system as shown in Figure 1.
ISBN: 1-60132-484-7, CSREA Press ©
16
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
n log2 p m =− ≈ −1.44 log2 p n ln2 Therefore, the optimal hash functions required h = −1.44 log2 p
Table 1: Parameters description of Figure 3 Name a b MaxFPC Zero FPC AverageFPC AverageFPP Filter Size Data Size
Description Represents random strings dataset Combination of strings dataset Maximum number of false positive count in 1000 round queries Total number of no false positive count (Not found FP) in 1000 round queries T otal F P C 1000 T otal F P P 1000
The Bloom Filter array size. Total number of input entries.
2.2 Discarding Compressed Bloom Filter
Fig. 3: Statistics on various data size during 1000 round queries.
Fig. 4: False positive queries found on input size on 100000 elements during 1000 round queries. X-axis represent the number of query round and Y-axis represent the number of false positive queries. Figure 3 depicts the experiments on various data size with 1000 round of queries [25]. We have generated random string dataset of various size and combination of strings dataset of various size. The Table 1 describes the parameters of Figure 3. Figure 3 represents the dynamic scaling of Filter Size according to data size. The MaxFPC refers to a maximum number of false positive detected in 1000 round queries. The zero FPC refers to total number of zero false positive count in 1000 round queries. The AverageFPC and AverageFPP are the mean false positive count and the mean false positive probability in 1000 round queries respectively. Figure 4 depicts the snapshot by keeping the number input to 100000 elements [25]. The experiment is conducted by fixing the number input elements in random string and combination of alphabets. Those strings are input to study the behavior of a number of false positive queries hit in 1000 round queries. The dataset a and dataset b consist of random string and combination of the alphabets to form strings in different sizes. The input elements vary from 100 elements to 100000 elements to study the behavior of the false positive queries and the probability of false positive.
The Compressed Bloom Filter (ComBF) [26] reduces the extra space requirements, and maps an element into a single bit. However, there is an important tradeoff between performance and space complexity. Therefore, the ComBF does not exhibit a good performance.
Fig. 5: Compression vs false positive on input of 100000 random string Figure 5 exposes the trade off between the compression and false positive. The high compression results in increment of false positive probability. In Figure 5, the input strings are generated randomly and input to the compressed Bloom Filter. Hence, the result shows that high compression rate increases false positive which is not desirable for Bloom Filter. Figure 5 Effective False Positive Probability (EFPP), Probability of False Positive (PFP), and difference between both (Difference) [25]. The difference is calculated as follows PFP Dif f ernce = 100 × EF P P
3. Variants of Bloom Filter 3.1 Scalable Bloom Filter Scalable Bloom Filter (SBF) [6] is a Bloom filter having a series of one or more Bloom Filters. In each Bloom filter, the array is partitioned into k slices. Each hash function produces one slice. During insertion operation, for each element k hash functions produces an index in their respective slice. So each element is described using k bits. When one Bloom filter is full another Bloom filter is added. During query operation, all filters are searched for the presence
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
of that element. The k bit description of each element makes this filter more robust where no element is especially sensitive to false positives. In addition, this Bloom Filter have the advantage of scalability by adapting to set growth by adding a series of classic Bloom filters and making the error probability more tighter as per requirement.
17
time complexity of insertion is same, whereas query and deletion operation is O(k × s) where k is the number of hash functions.
3.5 Deletable Bloom Filter
Adaptive Bloom Filter (ABF) [27] is a Bloom Filter based on the Partial-Address Bloom Filter [28]. ABF is used in tracking the far-misses. Far-misses are those misses that is hits, if the core is permitted to use more cache. To each set of each core, a Bloom Filter array (BFA) with 2( k) bits is added. When a tag is removed from the cache, tag’s k least significant bit is used to index a bit of the BFA, which is 1. During Cache miss, using the k least significant bit the BFA is looked for the requested tag. A far-miss is detected when the array bit becomes 1.
Deletable Bloom filter (DlBF) [30] is a Bloom Filter that enables false-negative-free deletions. In this Bloom Filter, the region of deletable bits is encoded compactly and saved in the filter memory. DlBF, divide the Bloom Filter array into some regions. This region is marked as deletable or non-deletable using bitmap of size same as the number of regions. During insertion operation, when an element maps to an existing element slot, i.e. collision, then the corresponding region is marked as non-deletable i.e bitmap is assigned value 1. This information is used during deletion. The elements under deletable region are only allowed to be deleted. Insertion and query operations in DIBF are same as the traditional Bloom Filter.
3.3 Blocked Bloom filter
3.6 Index-Split Bloom Filters
Blocked Bloom Filter [2] is a cache-efficient Bloom Filter. It is implemented by fitting a sequence of b standard Bloom Filter in each cache block/line. Usually for better performance, the bloom filters are made cache-line-aligned. When an element is added, the first hash function determines the Bloom filter block t0 use. The other hash functions are used for mapping of the element to k array slots, but within this block. Thus, this lead to only one cache miss. This is further improved by taking a single hash function instead of k hash functions. Hence, this single hash function determines the k slots. In addition, this hash operation is implemented using fewer SIMD instructions. The main disadvantage in using one hash function is, two elements are mapped to same k slots causes a collision. And this leads to increased false positive rate (FPR).
Index-split Bloom filter (ISBF) [31] helps in reducing memory requirements and off-chip memory accesses. It consist of many groups of on-chip parallel CBFs and a list of off-chip items. When a set of items is stored, the index of each item is divided into some B groups. Each group contains b bits, where B = dlog2 n/be. So the items are split into 2b subsets. Each subset is represented by a CBF. Thus, total 2b CBFs per group are constructed in on-chip memory. During query operation, after matching the query element and the index of an item found by the B group of on chip parallel CBFs, response is given. Also, for deletion operation, a lazy algorithm is followed. Because, deletion of an item requires adjustment of indexes of other off-chip items and reconstruction of all on-chip CBFs. Moreover, the average time complexity for off-chip memory accesses for insertion, query and, deletion is O(1).
3.2 Adaptive Bloom Filter
3.4 Dynamic Bloom Filter Dynamic Bloom Filter (DBF) [29] is an extension of Bloom Filter which changes dynamically with changing cardinality. A DBF consist of some CBF (Counting Bloom Filter), say s. Initially s is 1 and the status of CBF as active. A CBF is called active when a new element is inserted or an element is deleted from it. During insertion operation, DBF first checks whether the active CBF is full. If it is full a new CBF is added and its’ status is made active. If not, then new element is added to active CBF. During query operation, the response is given after searching all CBF. And, during deletion operation, first the CBF is found which contains the element. If a single CBF contains that element, then it is deleted. However, if multiple CBFs are there, then that deletion operation is ignored but deleted response (i.e. the operation is completed) is delivered. Furthermore, if the sum of two CBF capacities is less than a single CBF then they are united. For that, addition of counter vectors is done. The
3.7 Quotient filter Quotient Filter (QF) [5] is a Bloom Filter where each element is represented by a multi-set F . The F is an open hashtable with a total buckets of m=2q , called quotienting[32]. Besides, F stores p-bit fingerprint for each element which is the hash value. In this technique, a fingerprint f is partitioned into r least significant bits, which stores the remainder. The q=p-r is the most significant bits which stores the quotient. Both quotient and remainder is used for reconstruct of the full fingerprint. During insertion operation, F stores the hash value. During query operation, F is searched for the presence of the hash value of the element. And, during deletion operation, the hash value of that element is removed from F . QF has the advantage of dynamical resizing i.e. it expands and shrunk as elements are added or deleted. However, the QF insertion throughput deteriorates towards the maximum occupancy.
ISBN: 1-60132-484-7, CSREA Press ©
18
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
3.8 NameFilter Name Filter [33] is a two-tier filter which is helps in looking up names in Named Data Networking. The first tier determines the length of the name prefix and second tier makes use of the prefix determined in the previous stage to make a search in a group of Bloom Filters. In the first stage the name prefixes are mapped to Bloom Filter. Thereafter, the process of building up a Counting Bloom Filter is taken up. This filter is built for the concerned prefix set and then it is converted to take the form of a conventional Boolean Bloom Filter. As a final step, the second stage use the merged Bloom Filter. In the first stage, the calibration of the name prefixes to the Bloom Filter is done on the basis of their lengths. It maps the k hash function into a single word. Hence, the Bloom Filter is called One Memory Access Bloom Filter as the query access time is O(1) instead of O(k). First, it acquires the hash output of the prefix using the DJB hash method. Then, the later hash value is calculated using the previous hash value. Thus, after k − 1 loops, it obtains a single hash value and stores it in a word. This value is input for the calculation of the address in Bloom Filter, and the rest bits are calculated from one AND operation. So, when k − 1 bits are 1s, then a graceful identification is declared. The aim of this stage is to find the longest prefix. In second stage, the prefixes are divided into groups based on their associated next-hop port(s). All groups are stored in the Bloom Filter. And, the desired port is found in this stage. In MBF, each slot stores a bit string with machine word-aligned. The N th bit stores the N th Bloom Filter’s hash value and rest bits are padded with 0s. To obtain the forwarding port number, AND operation is done on K bit strings with respect to k hash functions. The location of 1 in the result gives the port number.
3.9 Cuckoo Filter A Cuckoo Bloom Filter [3] is based on Cuckoo hash table [34]. This Bloom Filter stores fingerprint instead of keyvalue pairs. Whereas, fingerprint means the hash value of the element. For insertion, index for two candidate buckets are calculated. One is the hash value of the element and another is the XOR operation between the hash value of the element and the hash value of the fingerprint of that element. This is called partial-key cuckoo hashing. This method reduces hash collision and improves the table utilization. After finding the indexes, the element is stored in any free bucket. otherwise cuckoo hash tables’ kicking [34] of elements is done. For query operation, two candidate buckets are calculated as done in insertion operation, then if the element is present in any one of them true is returned otherwise false. For deletion operation, same procedure as lookup is followed, whereas instead of returning true or false, element is deleted. The advantage of the basic algorithms (i.e. insertion,deletion and lookup) is they are independent of hash table configuration (e.g. number of entries in each bucket). However,
the disadvantage of using partial-key cuckoo hashing for storing fingerprints leads to slow increase in fingerprint size to increase in filter size. In addition, if the hash table is very large, but stores short fingerprints then hash collision increases. This lead to the chances of insertion failure and also reduces the table occupancy.
3.10 Multi-dimensional Bloom Filter Crainiceanu et. al. proposed a Bloom Filter called Bloofi [10]. Bloofi is a Bloom Filter index. It is implemented like a tree. The Bloom Filter tree construction is done as follows. The leaves are Bloom Filters. And, the bitwise OR on the leaf Bloom Filters is done to obtain the parent nodes. This process continues till root is obtained. During lookup operation, the element is checked at root if it does not match then it returns false. Because if an element in leaf does not match then it will not match from the leaf to the root. Whereas, if the element matches, the query further moves to roots’ children Bloom Filters till it reaches the leaf. During insertion of a new node, search for most similar node to the new node is done. As Bloofi want to keep similar nodes together. So, when found this new node is inserted as its sibling. If an overflow occurs, then the same procedure is followed as in a B+ tree. During deletion operation, the parent node deletes the pointer to the node. And, when underflow occurs, the same procedure is followed as in B+ tree.
3.11 Sliding Bloom Filter Sliding Bloom Filter [7] is a Bloom Filter having a sliding window. It has parameters (n, m, ε). The sliding window remains over last n elements and the value of the slots is 1. In other words, the window only shows the elements that are present. The m numbers of elements that appear before the window elements does not have restrictions on the value. And ε is the at most probable of slot being 1. This Bloom Filter is a dictionary based and use the Backyard Cuckoo hashing [35]. To this hashing a similar lazy deletion method is applied as used by Thorup [36]. A parameter c is used, which is the trade off between the accuracy of the index stored and the number of elements stored in the dictionary. After optimizing the parameter c the Sliding Bloom filter shows good time and space complexity. The algorithm uses a hash function selected from the family of Universal hash functions. For each element in the dictionary D, stores its hash value and location where it previously appeared. The stream of data is divided into generations of size n/c each, where c is optimized later. Generation 1 is the first n/c elements; generation 2 is next n/c elements and so on. Current window contains last n elements and at most c+1 generations. Two counters are used, one for generation number (say g) and another for the current element in the generation (say i). For every increment of i, g get incremented to mod (c+1). For insertion, first obtain the
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
ith hash value and checks whether it is present in D, if exists, the location of the element is updated with the current generation. Otherwise, it stores the hash value and generation number. Finally, update the two counters. If g changes, then scan D and delete all elements with associated data equal to the new value of g.
3.12 Bloom Filter Trie
19
single hash function. The hashing operation has two phases. In the first phase, apply a hash function, which gives the true fingerprint. And in the second phase, find the d locations of the element using additional (pseudo)-random permutation. One small disadvantage in the obtained d locations is, these are not independent, and uniform and as it is determined by the choices of the permutation.
Bloom Filter Trie (BFT) [37] helps to store and compress a set of colored k-mers, and efficient traversal of the graph. It is an implementation of the colored de Bruijn graph (CDBG). It is based on burst trie which stores k-mers along with the set of colors. Colors are bit array initialized with 0. A slot assigns the value 1 if that index k-mer has that color. Later, this set of color is compressed. BFT is defined as t = (Vt , Et ) having the maximum height as k where the k-mers is split into k substrings. A BFT is a list of compressed containers. An uncompressed container of a vexter V is defined as < s, colorps > where s is the suffix and p is the prefix which represents the path from root to V. Tuples are ordered lexicographically based on their suffixes. BST support operations for storing, traversing, and searching of a pan-genome. And, it also helps in extracting relevant information of the contained genomes and subsets. The time complexity for insertion of a k-mer is O(d+2λ +2q) where d is the worst lookup time, λ is the number of bits to represent the prefix and q is the maximum number of children. And, the time complexity of lookup operation is O(2λ + q).
3.15 Ternary Bloom Filter
3.13 Autoscaling Bloom Filter
Difference Bloom Filter (DBF) [13] is a probabilistic data structure based on Bloom Filter. It has multi-set membership query which is more accurate and has a faster response speed. It is based on two main design principles. First, to make the representation of the membership of elements exclusive by writing a different number of 0s and 1s in the same filter. Second, use of DRAM memory to increase the accuracy of the filter. DBF consist of a SRAM and a DRAM chaining hash table. The SRAM filter is an array of m bits with k independent hash functions. During the insertion function, elements in the set i are mapped to k bit of the filter. Arbitrarily k − i + 1 bits are set to value 1 and other i − 1 bits are set value 0. This is called < i, k > constraint. If the new element gets conflicted with another element in the filter, DBF use dual-flip strategy to make this bit shared. Dual-flip is to change a series of mapping bits of the filters, so that the filter satisfy the < i, k > constraint. During lookup operation, if exactly k − i + 1 bits are 1 then it returns true. During deletion operation, for each bit of the k bits of an element, DBF decides whether to reset it or not with the help of DRAM table.
Autoscaling Bloom Filter [38] is a generalization of CBF, which allows adjustment of its capacity based on probabilistic bounds on false positives and true positives. It is constructed by binarization of the CBF. The construction of Standard Bloom Filter is done by assigning all nonzero positions of the CBF as 1. And, given a CBF, the construction of ABF is done by assigning all the values which are less than or equal to the threshold value as 0.
3.14 d-left Counting Bloom filter d-Left CBF (dlCBF) [4] is an improvement of the CBF. To implement this it uses the d-left hash table. This hash table consists of buckets, where each bucket has fixed number of cells. Each cell is of fixed size to hold a fingerprint and a counter. This arrangement makes the hash table appear as a big array. Each element has a fingerprint. And each fingerprint has two parts. The first part is a bucket index, which stores the element. Second part is the remainder part of the fingerprint. The range of bucket index is [B] and the remainder is [R]. So the hash function is H: U → [B] × [R]. During element insertion, hash the element and store in appropriate remainders in the cell of each bucket. And increment the counter. And during deletion, decrement the counter. dlCBF solves the problems arise due to use of a
Ternary Bloom Filter (TBF) [9] is another improvement of CBF. This Bloom filter introduces another parameter v for each hash value, which can have possible values as 0, 1, X. During insertion operation, if an element is mapped to a hash value for the first time, assign value 1 to v. If another element is mapped to the same hash value, then assign value X to v. During lookup operation, if an element’s every v value for each hash value is X then it is defined as indeterminable. Indeterminable means, the element cannot be identified as negative or positive. And, value 1 indicates, the element is present and value 0 indicates the element is absent. Similarly, in deletion operation, if an elements’ every v value for each hash value is X then it is defined as undeletable. Undeletable means, the element cannot be deleted from TBF. And, if v is value 1 it assigns value 0. TBF allocates the minimum number of bits to each cell which saves memory. In addition, it also gives much lower false positive rate compared to the CBF, when the same amount of memory used by both filters.
3.16 Difference Bloom Filter
3.17 Self-adjustable Bloom Filter TinySet [8] is a Bloom Filter that has more space efficiency compared to standard Bloom Filter. Its’ structure is
ISBN: 1-60132-484-7, CSREA Press ©
20
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
similar to the blocked Bloom filter. Whereas, each block is a chain based hash table [39]. It uses a single hash function, H → B × L × R, where B is the block number, L is the index of the chain within that block, and R is the remainder (or fingerprint) that is stored in that block. All operations (insertion, deletion and lookup) initially follow three common steps. First, apply hash function is to the element and obtain the B, L, and R values. Second, use B to access the specific block. Third, calculate the Logical Chain Offset (LCO) and Actual Chain Offset (ACO) values. During insertion operation, shift all fingerprints from offset to the end of the block to right. The first bits in a block contain a fixed size index (I). Unset I means chain is empty. If the I bit is unset, it is made to set and the new element is marked as the last of its chain. During deletion operation, if I is unset, then the operation is terminated as it indicates the element is absent. Otherwise, shift all bits from the ACO value to end of the block to left by a single bit. If the deleted element is marked last then previous is made last or mark entire chain as empty. In lookup operation, if I is unset similarly the operation is terminated. Otherwise, search the chain. TinySet is more flexible due to its ability to dynamically change its configuration as per the actual load. It accesses only a single memory word and partially support deletion of elements. However, delete operation gradually degrade its space efficiency over time.
3.18 Multi-stage Bloom Filter BloomFlow [12] is a multi-stage Bloom Filter which is used for multicasting in Software-defined networking (SDN). It helps to achieve reductions in forwarding state while avoiding false positive packet delivery. The BloomFlow extends the OpenFlow [11] Forward action with a new virtual port called BLOOM_PORTS to implement Bloom filter forwarding. When a flow specifies an output action to BLOOM_PORTS forwarding Element (FE) implements an algorithm. The algorithm first reads from the start of the IP option field, the Elias gamma encoded filter length b, and the number of hash function of k fields. Then the algorithm treats the rest of the bits of the IP option field as a Bloom Filter. And this Bloom Filter is copied to a temporary cache for further processing. The remainder of IP options fields and the IP payload are shifted left to remove first stage filter from the packet header. Then the algorithm iterates through all interfaces and check for membership test for each interface’s bloom identifier in the cached bloom filter. Bloom identifier is a unique, 16 bit integer identifier. The Bloom identifier is assigned by the network controller to every interface on the network that participates in multicast forwarding. If the membership test returns true, the packet is forwarded from the matched interface.
3.19 Dynamic Reordering Bloom Filter Dynamic Reordering Bloom Filter [14] is another type of Bloom Filter that saves the searching cost of Bloom Filter. It dynamically reorders the searching sequence of multiple Bloom Filter using One Memory Access Bloom Filter (OMABF) and the order of checking is saved in Query Index (QI). This approach considers two factors. First, policy of changing the query priority of Bloom Filter. Second, reduction of overhead caused due to change in the order. This approach reduces the searching time of the query by sorting and saving the query data in Bloom Filter based on popularity. Sorting is done based on the query order, i.e popularity of data. So when the request comes from that data it quickly gives the response. And, when the popularity of a data becomes more, its query order is made a level higher in the Bloom Filter. However, this change of query order imposes overheads. To solve this, Query Index (QI) is used. QI saves the query priority of each block. When membership is checked Bloom Filter are checked according to the order saved in QI.
4. Conclusion The Bloom Filter is the widely used data structure. The Bloom Filter also associates with a system to improve the performance dramatically. Moreover, it does not waste more spaces of main memory. The Bloom Filter provides a fast lookup system with a few KB of memory spaces. The Bloom Filter returns either 0 (False) or 1 (True). However, this Boolean value is classified into four categories, namely, TN, TP, FP, and FN. The TN and TP boost up the lookup performance of a system. On the contrary, the FP, and FN become an overhead to the system. Nevertheless, the FN is not common for all variants of Bloom Filter. The FP is the key barrier of Bloom Filter. Therefore, there are several kinds of Bloom Filters in the market. The key objective of the modern Bloom Filter is to reduce the probability of FP. In addition, the modern Bloom Filter also deals with high scalability, space efficiency, adaptability, and high accuracy. Besides, the Bloom Filter meets copious applications, and thus, extensive experiment has been done on Bloom Filter. The paper discusses a few selected applications to highlight the efficacy of the Bloom Filter. However, it is observed that the Bloom Filter is applied extensively in computer networking. Moreover, the efficiency, and accuracy of Bloom Filter depends on the probability of false positive. Therefore, reducing the false positive probability is a prominent challenge to achieve. Finally, the Bloom Filter will be able to reduce the false positive probability approximately to zero. In this paper, we presented the theoretical and practical analysis of Bloom Filter. Moreover, there are abundant of Bloom Filter variants, those are discussed in this paper. Furthermore, issue and challenges of Bloom Filter are discussed. Also, we have exposed the disadvantages of compressed
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
bloom filter through a experiment. Moreover, the FP analysis is also shown through an experiment.
References [1] B. H. Bloom, “Space/time trade-o s in hash coding with allowable errors,” Comm. of the ACM, vol. 13, no. 7, pp. 422–426, 1970. [2] F. Putze, P. Sanders, and J. Singler, “Cache-, hash-, and space-efficient bloom filters,” J. Exp. Algorithmics, vol. 14, pp. 4:4.4–4:4.18, Jan. 2010. [3] B. Fan, D. G. Andersen, M. Kaminsky, and M. D. Mitzenmacher, “Cuckoo filter: Practically better than bloom,” in Proceedings of the 10th ACM Intl. Conf. on Emerging Networking Experiments and Technologies, ser. CoNEXT ’14, 2014, pp. 75–88. [4] F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese, An Improved Construction for Counting Bloom Filters. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 684–695. [5] M. A. Bender, M. Farach-Colton, R. Johnson, R. Kraner, B. C. Kuszmaul, D. Medjedovic, P. Montes, P. Shetty, R. P. Spillane, and E. Zadok, “Don’T Thrash: How to cache your hash on flash,” Proc. VLDB Endow., vol. 5, no. 11, pp. 1627–1637, July 2012. [6] P. S. Almeida, C. Baquero, N. Preguica, and D. Hutchison, “Scalable bloom filters,” Information Processing Letters, vol. 101, no. 6, pp. 255–261, 2007. [7] M. Naor and E. Yogev, “Tight bounds for sliding bloom filters,” Algorithmica, vol. 73, no. 4, pp. 652–672, 2015. [8] G. Einziger and R. Friedman, “Tinyset- an access efficient self adjusting bloom filter construction,” IEEE/ACM Transactions on Networking, vol. 25, no. 4, pp. 2295–2307, 2017. [9] H. Lim, J. Lee, H. Byun, and C. Yim, “Ternary bloom filter replacing counting bloom filter,” IEEE Communications Letters, vol. 21, no. 2, pp. 278–281, 2017. [10] A. Crainiceanu and D. Lemire, “Bloofi: Multidimensional bloom filters,” Information Systems, vol. 54, no. Supplement C, pp. 311 – 324, 2015. [11] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner, “Openflow: enabling innovation in campus networks,” ACM SIGCOMM Computer Communication Review, vol. 38, no. 2, pp. 69–74, 2008. [12] A. Craig, B. Nandy, I. Lambadaris, and P. Koutsakis, “Bloomflow: Openflow extensions for memory efficient, scalable multicast with multi-stage bloom filters,” Computer Communications, vol. 110, no. Supplement C, pp. 83 – 102, 2017. [13] D. Yang, D. Tian, J. Gong, S. Gao, T. Yang, and X. Li, “Difference bloom filter: A probabilistic structure for multi-set membership query,” in 2017 IEEE International Conference on Communications (ICC), 2017, pp. 1–6. [14] D. C. Chang, C. Chen, and M. Thanavel, “Dynamic reordering bloom filter,” in 2017 19th Asia-Pacific Network Operations and Management Symposium (APNOMS), 2017, pp. 288–291. [15] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A distributed storage system for structured data,” ACM Trans. Comput. Syst., vol. 26, no. 2, pp. 4:1–4:26, 2008. [16] R. Anitha and S. Mukherjee, “’maas’: Fast retrieval of data in cloud using metadata as a service,” Arabian Journal for Science and Engineering, vol. 40, no. 8, pp. 2323–2343, 2015. [17] Y. Zhu, H. Jiang, and J. Wang, “Hierarchical bloom filter arrays (hba): A novel, scalable metadata management system for large clusterbased storage,” in CLUSTER ’04, Proceedings of the 2004 IEEE International Conference on Cluster Computing, 2004, pp. 165–174. [18] Y. Zhu, H. Jiang, J. Wang, and F. Xian, “Hba: Distributed metadata management for large cluster-based storage systems,” IEEE transactions on parallel and distributed systems, vol. 19, no. 6, pp. 750 – 763, 2008. [19] Y. Hua, Y. Zhu, H. Jiang, D. Feng, and L. Tian, “Supporting scalable and adaptive metadata management in ultralarge-scale file systems,” IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 4, pp. 580 – 593, 2011.
21
[20] D. Zhu and M. Mutka, “Sharing presence information and message notification in an ad hoc network,” in Pervasive Computing and Communications, 2003.(PerCom 2003). Proceedings of the First IEEE International Conference on. IEEE, 2003, pp. 351–358. [21] L. Maccari, R. Fantacci, P. Neira, and R. M. Gasca, “Mesh network firewalling with bloom filters,” in Communications, 2007. ICC’07. IEEE International Conference on. IEEE, 2007, pp. 1546–1551. [22] G. Fernandez-Del-Carpio, D. Larrabeiti, and M. Uruena, “Forwarding of multicast packets with hybrid methods based on bloom filters and shared trees in mpls networks,” in 2017 IEEE 18th International Conference on High Performance Switching and Routing (HPSR), 2017, pp. 1–8. [23] F. Grandi, “On the analysis of bloom filters,” Information Processing Letters, vol. 129, no. Supplement C, pp. 35 – 39, 2018. [24] S. Tarkoma, C. E. Rothenberg, and E. Lagerspetz, “Theory and practice of bloom filters for distributed systems,” IEEE Communications Surveys Tutorials, vol. 14, no. 1, pp. 131–155, 2012. [25] A. Partow, “C++ bloom filter library,” Accessed on December, 02, 2017 from http://www.partow.net/programming/bloomfilter/index.html and https://github.com/ArashPartow/bloom. [26] M. Mitzenmacher, “Compressed bloom filters,” IEEE/ACM Transactions on Networking, vol. 10, no. 5, pp. 604–612, 2002. [27] K. Nikas, M. Horsnell, and J. Garside, “An adaptive bloom filter cache partitioning scheme for multicore architectures,” in 2008 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, 2008, pp. 25–32. [28] J.-K. Peir, S.-C. Lai, S.-L. Lu, J. Stark, and K. Lai, “Bloom filtering cache misses for accurate data speculation and prefetching,” in ACM International Conference on Supercomputing 25th Anniversary Volume. ACM, 2014, pp. 347–356. [29] D. Guo, J. Wu, H. Chen, Y. Yuan, and X. Luo, “The dynamic bloom filters,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 1, pp. 120–133, 2010. [30] C. E. Rothenberg, C. A. B. Macapuna, F. L. Verdi, and M. F. Magalhaes, “The deletable bloom filter: a new member of the bloom family,” IEEE Communications Letters, vol. 14, no. 6, pp. 557–559, 2010. [31] K. Huang and D. Zhang, “An index-split bloom filter for deep packet inspection,” Science China Information Sciences, vol. 54, no. 1, pp. 23–37, Jan 2011. [32] D. E. Knuth, The art of computer programming: sorting and searching. Pearson Education, 1998, vol. 3. [33] Y. Wang, T. Pan, Z. Mi, H. Dai, X. Guo, T. Zhang, B. Liu, and Q. Dong, “Namefilter: Achieving fast name lookup with low memory cost via applying two-stage bloom filters,” in 2013 Proceedings IEEE INFOCOM, 2013, pp. 95–99. [34] R. Pagh and F. F. Rodler, “Cuckoo hashing,” Journal of Algorithms, vol. 51, no. 2, pp. 122–144, 2004. [35] Y. Arbitman, M. Naor, and G. Segev, “Backyard cuckoo hashing: Constant worst-case operations with a succinct representation,” in 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, 2010, pp. 787–796. [36] M. Thorup, “Timeouts with time-reversed linear probing,” in Infocom, 2011 Proceedings Ieee. IEEE, 2011, pp. 166–170. [37] G. Holley, R. Wittler, and J. Stoye, “Bloom filter trie: an alignmentfree and reference-free data structure for pan-genome storage,” Algorithms for Molecular Biology, vol. 11, 2016. [38] D. Kleyko, A. Rahimi, and E. Osipov, “Autoscaling bloom filter: Controlling trade-off between true and false positives,” CoRR, vol. abs/1705.03934, 2017. [Online]. Available: http://arxiv.org/abs/1705.03934 [39] R. L. Rivest and C. E. Leiserson, Introduction to algorithms. McGraw-Hill, Inc., 1990.
ISBN: 1-60132-484-7, CSREA Press ©
22
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
An Online Interactive Database Platform For Career Searching Brandon St. Amour Zizhong John Wang Department of Mathematics and Computer Science Virginia Wesleyan University Virginia Beach, Virginia 23455, USA
Abstract The purpose of this research is to create a functional social platform for rising graduates, recent graduates, and veterans in the work force to help guide those who are about to graduate and start on career searching. This program allows these groups of people to connect and interact with others. The approach used was a focus in security and functionality. The key features are an interactive partners list, an interests list, a commenting function for each interest to interact with others with the same interests, a partner search, and a secure login and logout feature. The project was implemented with CSS, HTML, PHP, and MySQL.
Keywords: database, career searching, PHP, MySQL 1. Introduction What are the official steps that one who just graduated take in order to launch their career? How do they know the countless interviews are for the right company that could help them propel their career? These were the questions that gave this project life [1]. This project is specifically aimed at college students and those who just recently graduated. That is why this project matters. Many students in college or who just graduated are struggling to find their preferred career choice, yet along a singled job. This website should allow these people to create a very basic research template and some references to allow themselves to retrieve information from others in their interested fields that are already established. The paper is organized in the following sections. The logical structure of the system is introduced in the next section. In Section 3, the details of the main modules for case study are presented as well as the code samples. Conclusion is given in Section 4.
2. System Structure In order to accomplish the goals proposed in the abstract, the first step was to focus on a secure log in feature [2]. Next, was to allow the user’s to request partners which allowed them to view each other’s personal information in order to further communicate. The other significant result from research was the creation of a dynamic page which allowed users to communicate between others with the same interest. First, the user would select their career interests, then the information would be sent to a table that stored the user’s ID number and the selected interest. Then, for each interest that person selected, they were able to see the posting page for the interest and have the option of being able to post a comment. This was done by creating a table for each possible career interest that contains a user’s username and the comment they posted. Finally, a log out feature was incorporated to ensure that no session variables were vulnerable when a user was done [3]. When visiting the site, the user will first be sent to FirstStepIndex.php where they would have the option to register an account or to log in. If an account is set up then the user will have multiple options. Some of these options are visiting their profile information, selecting career interests, and requesting partners as friends. These function pages are all available on a sidebar so the user to view all of their options at the same time. The system flowchart is shown in Figure 1. The first tables that were created were to allow the user’s to register an account and be able to log in afterwards. First, the table “FirstStep” was created in order to allow the a user’s password to be securely stored. The username and passwords were passed through the md5 function in order to make sure no one, even the administrator, knows anyone’s passwords. In order to utilize the personal information and be able to present it on the web page, a second table had to be formed that contains the user’s personal information.
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
23
Function Pages
FirstStepIndex.php
MyProfile.ph p
OldUser.php
NewUser.php
MyProfileSettin gs.php MyProfileSettin gsAction.php
CareerIntersts.ph p CareerInterstsAction.php MyIntersts.ph p
MyInterstsAction.p hp
NewUserAction.php
FirstStepHomePage. php Other functional modules
Fig. 1 The system flowchart
Fig. 2 The FirstStep table structure The table “FirstStep2” contains the user’s personal information such as e-mail and phone number but does not include the password, therefore this table was able to be safely used to output users’ information without any risk of printing their password. Another important table was the use of “Partners.” The main purpose of this table was to determine the specific status between two users. This was done by the table being very reliable on a column named Status which used 0, 1, 2, and 3 as values to determine whether a user has sent a partner request, accepted a request, rejected a request, or are partners. The value 0 means the user has rejected a partner request, 1 means the user has sent a partner request, the value 2 means that the requested user has accepted the request by the user who sent the request has not seen the notification of the accepted request yet, and the value 3 means the two users are partners.
3. Case Study The default page that you are first brought to is FirstStepIndex.php. This page allows for the user to log in if they already have an account set up or allows them to link to NewUser.php in which they can set up an account. This page takes advantage of HTML Form programming. In order for the user to create an account to be able to log into FirstStep, they must submit a first name, last name, username, password, e-mail, and phone number. This was created by HTML Form programming [3]. Before the user’s submitted information can be stored into a MySQL table, a number of conditions must be checked. First, the input from the forms is checked to see if any of the required fields are empty. If they are, then a
ISBN: 1-60132-484-7, CSREA Press ©
24
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
MySQL query is used to search through the table “FirstStep” to make sure the desired username is not already taken. Next, the two passwords are checked to make sure they match so the user doesn’t accidently mistype a password they did not want to use. Finally, if all of these conditions are met, then the password and username are encrypted through the use of the md5 functions in order to be securely stored in a table. Another query is crafted to insert the user’s first name, last name, the encrypted username and encrypted password into “FirstStep.” If this query is successful, then another query is created to insert all of the user’s information, except their password, into “FirstStep2” where the username is not encrypted [4].
Fig. 3 The action script for new users
sidebar.
If the user at any time wishes to view or edit their profile information, they can click on “My Profile” in the Here, they will see all of the personal information that was stored in the table “FirstStep2.”
Fig. 4 The screen shot for the profile page After the user has logged in for the first time, one of the first things they should do is select their career interests to be able to interact with others that have the same interests. Here, each interest is printed using for loops to loop through each array and add a checkbox to each row so the user can select it as an interest. Thus, each selected interest is put into an array for each group of interests that is passed on to the next page to be stored in the table “Interests.”
Fig. 5 The script for selecting the career Another key feature of FirstStep is the use of being able to send, retrieve, and accept or reject partner requests. The partner feature serves as creating a distinct relationship between two users where they can view each other’s personal information to communicate outside of FirstStep. The user can input the known information of a partner into the HTML forms to be searched where they can choose to send them a partner request.
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
Fig. 6 The page of search for partners Before the search feature is utilized, the HTML forms will go through checks to make sure the fields are not empty [5]. Then, through various elseif statements, the user can search by just a first name, just a last name, just a username, or any combination of the three. Inside of these elseif statements contain queries that will select first names, last names, and usernames of any users where their personal information matches the user’s input. For each match, there is an additional column that contains a checkbox where the user can select to send or not to send a partner request.
Fig. 7 The action script for search for partners Passed to PartnerRequest.php from PartnersAction.php is the array of the names of users requested. Since each of these iterations in the array contained another array, a for loop was created to store just the names of the people requested so they can be printed. Another for loop was structured to iterate through the array of the requested users. Inside this, the user ID from each of the requested users is retrieved through a MySQL query. Then two more queries are crafted to check from the table “Partners” if the two users are already friends, or if one of the users has sent a request already, or if a user has accepted by the user who sent the request has not seen the notification yet. If the number of rows from the results of these two queries is 0, then a row is inserted into “Partners” where the user who sent the request is stored under User1ID, the user who was sent the request is stored under User2ID and the Status for the two users is set to 1. However, if the number of rows from the results of the two queries is not 0, then a query is used to select the personal information of the requested user(s) and then notifies the user that that person has already been requested, has already sent the user a request, or is already a partner. After all of that, if the user has any partners, then they can be viewed under “My Partners” in the sidebar. Here a query is made to check if the user has any partners at that time. This is done by selecting all of the rows from “Partners” where the user’s ID is under User1ID or User2ID and the status is set to 3. Then if the number of rows from the result of the query is not equal to 0, a combination of a for loop, while loop, and foreach loop is used to print the list of all of the user’s partners from the table. Alongside this list is another column where the user can remove a partner(s) if they desire. If the user decides to remove a partner, then an array is passed that contains the ID numbers of the people selected to remove. If the count of this array is not equal to 0, then loops are once again used to print which users have been removed as a partner and for a query to be made that deletes the row from the table “Partners.”. See Figure 9 for the sample script.
ISBN: 1-60132-484-7, CSREA Press ©
25
26
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
Fig. 8 The page of results for partners
Fig. 9 The script for displaying partners
4. Conclusion The system of FirstStep has nearly achieved all of the goals that were created when managing this project. The career interests and dynamic comment pages along with the partners function allows for a functional and simple website. More functions to be extended include securely uploading a file that contains a user’s resume for others to view and securely recovering accounts for users if they were to forget their password. However, all of the others goals that were created for this project had been fulfilled using HTML, PHP, and MySQL
5. References [1] "Explore Careers." Career and Occupations Guide: Complete List of Careers. http://www.careerprofiles.info/careers.html [2] "Manual :: Obtaining Data from Query Results." PEAR - PHP Extension and Application Repository. The PHP Group. . [3] "PHP: How to Get the Current Page URL." . [4] Stack Overflow. Stack Exchange Inc. . [5] W3Schools Online Web Tutorials.
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
27
Challenges Archivists Encounter Adopting Cloud Storage for Digital Preservation Debra A. Bowen Department of Computer Science and Information Systems, National University, San Diego, CA, USA
Abstract – This paper provides information to those who design, develop, manage, and provide storage and archival services by answering the question, what are the most challenging aspects adopting cloud storage for long-term digital preservation? There was a gap in whether current challenges from the literature (cost, obsolescence, security) were specific to the archival community or if there were other challenges unique to the profession. Members of the Society of American Archivists (SAA) were the population for this study. Data was collected using three rounds of surveys in which participants identified, ranked, and described challenges. The 10 most challenging aspects of cloud storage included all three challenges from the literature. The only challenge unique to the preservation of digital information was related to fixity checking.
data. However, most cannot afford data loss due to improper long-term preservation techniques. Preserving digital information for long periods involves the likelihood of making sure that accessibility to authentic, usable, and reliable information endures for centuries, without knowing what storage technologies of the future will exist. In addition, archivists need to do this while maintaining both the authenticity and context of the information. This paper identifies the top 10 challenges archivists encounter when adopting cloud storage for long-term digital preservation. Although the practice of archiving has been around for thousands of years and an archivist has many responsibilities beyond archiving, the focus of this study was only on the digital preservation aspects of archiving.
Keywords: archival services, archivist, cloud storage, digital information, digital preservation
The theoretical frameworks for this study were archival theory and the Open Archival Information System (OAIS), which is a theory of digital preservation. As IT crosses into almost all other fields, the same could be recognized for the discipline of archival science. Almost all disciplines outside of the archival community have a growing interest in archiving and preservation.
1
Introduction
People and organizations are storing massive amounts of digital information in the cloud without fully understanding the long-term benefits and/or risks. The introduction of the Internet has led to such a rapid increase in digital information that estimating the actual amount is challenging and often inaccurate. Storage technologies continually evolve to store larger amounts of information, faster, more cost effective, and for longer periods. Additionally, information technologies have altered the way humans produce, store, and retrieve information about themselves and the world around them. As a result, organizations are struggling to select, store, and maintain valuable digital information so that it will be available when needed. However, just storing digital information to a cloud solution is not enough to ensure it will be accessible, usable, and trustworthy in the future. The goal of archiving is to retain information that has a long-term preservation need due to historical, cultural, or legal value. Archivists, who are part of the archival community in charge of preserving our most valuable digital information for future access, are in the process of adopting cloud storage for long-term digital preservation [1][2]. The archival community places its trust in the Information Technology (IT) field to develop and manage the technology that ensures information is archived and protected so that it can be accessible when needed. Organizations can now afford to store unlimited amounts of
1.1
Theoretical frameworks
1.1.1 Archival theory Archival theory is a collection of interconnected ideas that encompass archiving, with the objective of guiding archivists in their work. In a digital environment, an archive is not the same as a backup. Backups are often used for recovery purposes in the event that the original data becomes corrupt or lost. An archive’s purpose is the long-term preservation of digital data selected because of its evidential or historical value. Archives, in the context of archival theory, entails everything about a document that is deemed valuable enough to archive, including its relationship to other documents. Maintaining a relationship to other documents is an important responsibility of the archivist and is necessary when proving an archive’s authenticity, regardless of format or media. This fundamental principle of archival theory is referred to as provenance. Provenance refers to the information about the origin and ownership history of an item or collection of items with the intent of assuring authenticity; however, provenance does not end when authenticity has been captured. Once the authenticity of an item has been determined, an archivist has the responsibility of ensuring the security of the item so that its authenticity is not damaged or lost. Provenance and technology
ISBN: 1-60132-484-7, CSREA Press ©
28
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
are inseparable for archiving digital information [3]. In an article by [4], the authors propose an approach for using secure provenance as a foundation for providing a trustworthy cloud. Provenance is an important constructs of archival theory, especially now that archives are moving away for the archivist and into the cloud. The use of provenance as a principle of archival theory underlying this study is not intended to be an indication of expertise about provenance or the nature of archival theory. Archival theory encompasses many other important principles, such as appraising, acquiring, preserving, and providing access, that were not included in this study due to limited scope and focus. 1.1.2 Open Archival Information System (OAIS) The Open Archival Information System (OAIS) is a theory of digital preservation that has been extensively accepted by the archival community [5]. The idea of an OAIS reference model originated in 1995 from the Consultative Committee for Space Data System (CCSDS). The CCSDS’ function at the time was to support the study of terrestrial and space environments. The OAIS is a high-level model for archival repository systems that preserve, manage, and maintain access to digital information, which requires long-term preservation. However, the OAIS does not specify any hardware, software, database, language, or platform requirements to be an OAIS-compliant archive. The model is meant to provide standards for digitally preserving an archive without controlling the method for doing so. The verbiage in the OAIS reference model is identical to the International Organization for Standardization (ISO) 14721:2012, titled Space data and information transfer systems - Open archival information system (OAIS) - Reference model. An OAIS is, “an archive, consisting of an organization, which may be part of a larger organization, of people and systems that has accepted the responsibility to preserve information and make it available for a designated community” [6, p. 1-1]. In addition, the word “open” in Open Archival Information System is not a reference to allowing open access into an archival system, but rather to the fact that recommendations and standards to the OAIS have been, and will continue to be, discussed and created in open forums. Efforts into digital preservation has been occurring for almost four decades in a mostly uncoordinated manner. However, because of its focus on digital information the OAIS model has now become a framework for digital preservation in many non-archival disciplines, such as big science, medical, and design and engineering. This acceptance throughout disciplines has made the OAIS model the most widely used framework for digital preservation systems.
2 Challenges from the literature Preserving digital data long-term is becoming more important because of increased cultural and economic dependence on digital data. Archivists must rely on an IT organization to manage cloud storage to ensure proper preservation of data so that it is available when needed.
Exploring the challenges archivists encountered when using cloud storage advances the knowledge base of archival and information technology professionals by providing a clearer understanding of what to expect so that the archival community can successfully prepare for cloud adoption challenges. In addition, it provides the IT field information, unique to the archival community, when designing, developing, managing, or offering cloud storage products or services for long-term digital preservation. There was limited research that examined the challenges of adopting cloud storage by the archival community. The research literature indicated that the archival profession relied on IT organizations to manage its digital information, not only to support compliance but also to ensure information is stored and protected [1]. While researching cloud storage for longterm digital preservation, three challenges recurred in the literature, (a) cost [1][7], (b) security [12], and (c) obsolescence [1].
2.1
Cost
From the research, it was apparent that the cost involved in long-term preservation was a consideration that had yet to have a proven model. In an article by [1], the authors identified quantifying the value of long-term digital preservation as a future study area. They justified research in this area so that organizations could have some type of decision criteria to measure their digital preservation IT investments. Turner [7] believed this lack of a specific cost model might be, in part, due to organizations not recognizing the value of digital assets.
2.2
Security
Digital information security, especially for sensitive information, was one of the most cited concerns of information professionals looking to adopt cloud storage [8]. Security threats such as data loss, data breaches, and authentication and access vulnerabilities were not favorable, especially for longterm digital preservation. The archival community continues to address cloud security in research papers and projects [8][9].
2.3
Obsolescence
One of the more challenging aspects of long-term digital preservation was obsolescence. This happens when hardware, software, storage media or file formats do not last long enough to provide long-term accessibility of digital information. Shortly after disk storage became popular, the challenges of obsolescence began to appear in archival journals. In an article by [1], the authors concluded with a suggestion for future research that included how to minimize information loss due to obsolescence. Through reviewing the literature, obsolescence appeared to be an on-going digital preservation challenge regardless of the IT storage technology utilized. Every day, valuable digital information becomes unreliable or inaccessible because of obsolescent storage technologies [10]. Hardware, software, storage media, and file formats become outdated over time, causing preserved digital information to become unusable if not periodically managed. The IT field
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
continues to be the driver for storage solutions used by archivists and cloud storage is the next solution that archivists will be utilizing, essentially entrusting digital information to the IT field for safekeeping.
3 3.1
Research methodology Research question
This study targeted archivists to get their perspective on the unique challenges they encountered using cloud storage for long-term digital preservation. The following research question provided a guide for this study: Q1. From an archivist’s perspective, what are the most challenging aspects when adopting a cloud storage solution for long-term digital preservation?
3.2 Research design The design for this qualitative study was a modified Delphi technique. The Delphi technique made it possible to elicit challenges experienced by certified archivists so that they could rank and describe their experiences anonymous from each other to further an understanding of their challenges when adopting cloud storage for long-term digital preservation. The main aspects of the Delphi technique are (a) it is a process in which experts respond to the same concept at least twice, (b) panel participants are anonymous from one another, (c) controlled feedback provided by the researcher, and (d) a final group response in which all participant opinions are included. In addition to panel members providing challenges they experienced, they had the option of selecting three reoccurring cloud storage challenges derived from the literature, which were (a) cost, (b) obsolescence, and (c) security. Because this study utilized preselected challenges during round 1, a modified Delphi model was utilized. This model has been used in other studies providing preselected list items [11][12], which is not a characteristic of a classic Delphi study. The Delphi technique was an appropriate method for the proposed question and phenomena because collecting data from experienced archivists that have already adopted cloud storage was a way that was convenient for participants and researcher. In addition, archivist are concerned with specific principles when archiving digital information, such as provenance and authenticity and, unlike many cloud storage users, may never use the data they are archiving. A quantitative study that measures factors that influence adoption to cloud storage would not have been as appropriate for archivists, as it might have been for other groups, because some cloud storage factors for archivists were not quantifiable, but rather something understood in the framework of their professional responsibilities. Subsequent rounds of this modified Delphi study narrowed and ranked the challenges archivists experienced and allowed participants to describe how selected challenges were resolved or managed. Using a convenient online survey tool and collecting data from professionals that directly experienced the phenomena, provided richer feedback than would have resulted from other conventional methods [11].
29
Target population and sample
3.3
Although digital preservation activities can be found in many non-archival disciplines, the population focus of this research study was on archivists due to preservation being one of the key responsibilities of their profession. The target population of this study was the members of the Society of American Archivists (SAA). The SAA is a professional not-for-profit organization. The SAA provides a certification method for the archival profession. All members must pass a certification exam covering archival knowledge, skills, and responsibilities. In addition, members hold at least a master’s degree and have a minimum of one year of archival experience. To maintain a certification, members must provide proof of ongoing education, experience, and participation in the profession. The challenges SAA certified archivists encountered when adopting cloud storage for long-term digital preservation was the focus of this study. The sample was members of the SAA, from the United States, who were 18 years or older. In addition to members agreeing to the informed consent form, the inclusion focused on members that actively worked in an archival related role during cloud adoption. Another inclusion criterion was the availability of panel members to participate in multiple rounds of survey questions. Any member that did not agree to the informed consent form was exited from the study. In addition, any member that did not complete the round 1 survey questions was excluded from further rounds. Because panel participants needed specific archival knowledge, skills, and experience adopting cloud storage for long-term digital preservation, a purposive criterion sampling method was utilized to intentionally select individuals from a group that was most likely to have experienced the phenomenon personally for this study.
3.4
Data collection
Once a participant clicked on the button agreeing to the informed consent, they were linked to the round 1 survey in SurveyMonkey. The data collection procedures from this point were as follows: 1.
Demographic questions began the round 1 survey. Other than the question asking about a participant’s age range, only questions relating to archival experience and adopting to cloud storage were asked. This was intentional to protect participant identity.
2.
In addition to panel participants being able to provide open-ended responses, identifying up to three challenges from their own experience, they were also given the opportunity to agree or disagree to having encountered challenges from options derived from the literature.
3.
For round 1, a list of three preselected challenges were provided, which came from the literature. The participants were asked if they encountered any of the
ISBN: 1-60132-484-7, CSREA Press ©
30
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
three challenges, their response options for each was either yes or no. In addition, there were three text boxes for participants to add their own additional challenges. 4.
The round 1 survey request was sent from SurveyMonkey. The email allowed potential participants to continue to the survey or to disagree with the informed consent form and exit the study.
5.
SurveyMonkey was used for the online data collection of all questions, for all rounds.
6.
The round 1 demographic data and all participant entered challenges were downloaded and analyzed within Microsoft Excel to remove participant identifiable information. Then the data was imported into NVivo Pro 11 qualitative data analysis software tool to create a final list of challenges for round 2.
7.
Round 2 was used to narrow the list of 19 challenges to a more manageable size. This was accomplished by having panel participants select the top 10 challenges from the original list of 19 challenges from round 1. The 10 challenges with the highest percentage of participants that ranked them was the final list of challenges for ranking in round 3.
8.
Round 3 had panel participants rank the final list of 10 challenges, in order from most challenging. This would be repeated until there was a level of agreement of ranked answers. A final question gave the participants an opportunity to share their thoughts about the long-term aspect of cloud storage.
9.
After rounds two and three, Kendall’s Coefficient of Concordance, or Kendall’s W was calculated to determine if a level of agreement had been archived (W > .5) [13]. This would have been repeated until an agreement of W > .5 was achieved. However, W > .5 was attained for round 3, so data collection was completed.
Once Kendall’s W determined a level of agreement of ranked answers, the final outcome was a list of 10 challenges, ranked from most to least challenging, that archivists encountered adopting cloud storage for long-term digital preservation. In addition, how challenges were resolved or managed and opinions about the long-term aspect of cloud storage were documented.
4
of those challenges were analyzed for themes. In addition, how those challenges were resolved or managed was also identified. The participant responses during rounds one and two did not include any phrases related to long-term. To capture this aspect of the research question a specific open-ended question was added to the round 3 survey to solicit opinions about longterm, in relation to cloud storage, from the experience of participants. Because no participant’s long-term response described an actual experience, it was deduced that none of the panel participants had yet encountered cloud storage challenges associated with long-term. Members of the SAA were the sample for this study’s panel due to their credentials and experience with the study’s topic. To ascertain the top 10 cloud storage adoption challenges, three rounds of surveys were employed to obtain a ranked list of challenges provided by a panel of 23 members. Round 1 produced a list of 19 challenges; three were derived from the literature and 16 provided by participants. During round 2, 16 participants narrowed the challenges to 10 and ranked them from most challenging. The top five ranked challenges included the three challenges derived from the literature, (a) cost, (b) obsolescence, and (c) security. Also, four themes emerged from the participants challenge descriptions, (a) staffing, (b) technology, (c) obsolescence, and (d) security. Kendall’s W did not meet the target of W > .5 during round 2, so round 3 was conducted. During round 3, 12 participants ranked the 10 challenges (Table 1 shows the final list of challenges, in ranked order). Table 1 - Final Ranked Challenges Final Challenge Score Rank 1 Cost (for equipment, resources, 27 education, other activity, etc.) 2 Obsolescence (hardware, software, file 31 format, media, etc.) 3 Lack of ability to easily do fixity 47 checking 4 Extracting content from the systems in 48 which it was originally used 5 Security (data loss, data breach, 55 authentication, etc.) 6 Connectivity – speed & reliability of 77 access to cloud & file uploads/transfers 7 Technical knowledge/lack of qualified 79 staff 8 Understaffed IT department 88 9 Coordination with IT staff 103 10 Trust of the provider 105 Note. 1 = most challenging. The score is the sum of all rankings for the challenge.
Results
This study was guided by one research question, “From an archivist’s perspective, what are the most challenging aspects when adopting a cloud storage solution for long-term digital preservation?” To answer this question comprehensively a ranked list of top 10 challenges was identified and descriptions
Again, the top five included the three challenges derived from the literature. The lack of ability to easily do fixity checking was also ranked in the top five during round 3. This was the only challenge in the top 10 that was specific to digital preservation, opposed to data that is simply stored. Fixity checking is a digital preservation term that refers to the process
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
of checking the integrity of a digital item by verifying that is has not become unexpectedly altered [6]. Fixity ranked two places higher than security in the final ranked list. Noting that there may have been a terminology difference, fixity checking and security could reasonably be the same challenge, however, the security theme that was emerging in round 2 did not develop during round 3 to provide more support for this assumption. Kendall’s W was calculated at W = .61, which exceeded the target of W > .5. The analysis from challenge descriptions indicated two themes continued from round 2, technology and obsolescence. Also, cost and human expectation, emerged as new themes. Round 3 also solicited participant opinions about the long-term aspect of cloud storage, because it was not addressed in previous rounds. Participant opinions of longterm, in relation to cloud storage, were equal among the category types of positive, negative, and unsure (see Table 2 for participant responses). Table 2 - Opinions on the Long-Term Aspect of Cloud Storage Participant Category Response P05
negative
"There has not been a real longterm digital solution that I know of, just different technologies. Isn't the cloud just a bunch of hard drives over the network - what is longterm about that?"
P14
negative
P26
negative
P02
positive
P18
positive
P21
positive
"The more data that is moved around the more potential for it to be changed or become corrupt. To be a real long term solution cloud vendors will need to be more aware of integrity issues with archives and how to deal with them" "This is problematic to entrust ownership/control of valuable digital materials to a third party. We must find a way of retaining more autonomy for the holding institution." "It is key part of our overall strategy. We have no choice due to cost and management considerations. Especially at this institution's rate of growth." "being non-profit the cloud really helps us to be part of a bigger community, so I hope it proves to be a long-term solution" "I think that as cloud storage becomes increasingly cheaper, it will also become increasingly useful. Ideally, we would store our data using two different storage providers for redundancy and security, but we can't afford to at the time. Open source systems for ingesting content into cloud storage and checking fixity with little to no programming knowledge (such as an open source version of something like
31
P08
unsure
P11
unsure
P15
unsure
Cloudberry Backup, and improved tools from resources like AVPreserve) would also help more institutions adopt cloud storage as a preservation solution." "have not really thought about the cloud in terms of long-term, it is fairly a new process and we are still working out all the issues" "I wonder if the issues we had with the older systems will become the same issues we will have with the cloud in 20 or 30 years" "My thought is that it is too soon to tell…"
As expected from participant challenge descriptions, the technology and obsolescence themes were consistent between rounds. In addition, obsolescence was one of the challenges derived from the literature. Although cost was not a theme in round 2, it was ranked as the fourth most challenging aspect of adopting cloud storage. On the final ranked list from round 3, cost was ranked as the most challenging aspect of adopting to a cloud storage solution. As one participant, P08, stated, “most of the challenges stem from not enough funding.” An unexpected theme during round 3 was human expectation. Participant P05 mentioned this as the reason why extracting content from the systems in which it was originally used was such a challenge. As noted previously, the concept of long-term was not included in any of the participant responses during round one or two. When participants were asked to share their thoughts about the long-term aspect of cloud storage, the responses were not specific to any actual experiences. Because long-term for this study represented a period of 10 years or longer [8][14], it is probable that panel participants had not yet encountered any long-term cloud storage challenges. The results of the study did answer the research question. In addition, the final ranked list of challenges included the three challenges derived from the literature (cost, obsolescence, and security). The long-term aspect of cloud storage had not been experienced by any of the panel participants, but they were able to provide opinions based on knowledge from their overall cloud storage adoption experiences. The three challenges from the literature were also ranked as top challenges by the panel participants of this study because the challenges of cost, obsolescence, and security are still ongoing issues in the adoption of cloud storage. By taking into account the unique requirements of the archival community, this study’s results provide the IT field a perspective of cloud storage challenges by an industry outside of IT. The findings yielded by this study could have also resulted partly because the archival community has been managing the challenges of cost and obsolescence for as long as they have been using digital storage. Those two challenges, not being unique to cloud storage, may have been selected because of an association with previous storage technology experiences.
ISBN: 1-60132-484-7, CSREA Press ©
32
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
5
Limitations
A limitation of the modified Delphi design was that there were no clear guidelines for determining sample size. In an article by [15], the authors suggested that for a homogeneous group, as the one used for this study, 10 to 15 participants might be sufficient. However, the attrition rate made it difficult to determine an initial sample size so that 10 to 15 participants completed the final round. Another limitation was that only one professional archivist organization (the SAA) was surveyed for this research study. Even though this group is believed to be a distinguished and knowledgeable group of archivist, there were other organizations whose members would have been able to contribute to this study, but were not engaged due to the limitations of scope, time, and cost. Several areas of archiving were not mentioned in this study, such as appraising, acquiring, arranging, and describing. Archiving digital information to a cloud storage solution is only one specific concern of the archival community.
6 Recommendations for further research A recommendation for future research would be to use the list of ranked challenges from this study as a preselected list for a new modified Delphi study, while also expanding the targeted population to a broader area of the archival community. This would potentially provide a richer description of challenges, with more diverse approaches to resolving or managing the challenges. In addition, including the IT community in describing and explaining how they would resolve or manage these challenges, in relation to the archival community, may prove beneficial to both communities. Using a modified Delphi method to reach a level of agreement for a list of ranked challenges was convenient and easy to implement, however, the focus on ranking and the multiple rounds needed to reach a level of agreement may have detoured potential participants from participating. One design recommendation that could strengthen a similar study would be to use a case study approach to gather challenge descriptions and resolutions and to remove the ranking component. This would be more time consuming; however, fewer participants would be needed and they could potentially provide richer details into the challenges encountered when adopting a cloud storage solution.
7
Conclusion
Using a modified Delph technique, a panel of SAA members identified, ranked, and described the 10 most challenging aspects they encountered when adopting a cloud storage solution for long-term digital preservation. The top five ranked challenges did include the three challenges derived from the literature; cost, obsolescence, and security. Ranked third was lack of ability to easily do fixity checking, which was the only challenge identified as being specific to digital preservation. The survey rounds were concluded when
Kendall’s W was calculated at W = .61, exceeding the target of W > .5. The challenges from the literature were confirmed by the panel participants as still being relevant. This indicated that these challenges were on going or that they were so prevalent in the archival community that any storage technology would have likely received similar rankings for cost, obsolescence, and security. The need for the archival and IT communities to work together to resolve preservation issues created by new storage technologies is echoed throughout participant responses.
8
References
Burda, D., & Teuteberg, F. “Sustaining accessibility of information through digital preservation: A literature review”. Journal of Information Science, 39(4), 442-458. doi:10.1177/0165551513480107, 2013. [2] Fryer, C. “Project to production: Digital preservation at the Houses of Parliament, 2010–2020”. International Journal of Digital Curation, 10(2), 12-22. doi:10.2218/ijdc.v10i2.378, 2015. [3] Marciano, R., Lemieux, V., Hedges, M., Esteva, M., Underwood, W., Kurtz, M. & Conrad, M.. “Archival Records and Training in the Age of Big Data”. In J. Percell,L. C. Sarin, P. T. Jaeger, J. C. Bertot (Eds.), ReEnvisioning the MLS: Perspectives on the Future of Library and Information Science Education (Advances in Librarianship, Volume 44B, pp.179-199). Emerald Publishing Limited, 2018. [4] Jamil, F., Khan, A., Anjum, A., Ahmed, M., Jabeen, F., & Javaid, N. “Secure provenance using an authenticated data structure approach”. Computers & Security, 73, 3456. doi.org/10.1016/j.cose.2017.10.005, 2018. [5] Corrado, E. M., & Moulaison, H. L. “Digital preservation for libraries, archives, and museums”. [ebook version]. Retrieved from http://www.ebrary.com, 2014. [6] CCSDS. “Reference model for an open archival information system (OAIS)”. Washington, DC: Consultative Committee for Space Data Systems, 2012. [7] Turner, S. “Capitalizing on big data: Governing information with automated metadata”. Journal of Technology Research, 5, 1-12. Retrieved from http://t.www.aabri.com/, 2014. [8] Beagrie, N., Charlesworth, A., & Miller, P. “How cloud storage can address the needs of public archives in the UK”. The National Archives. Retrieved from http://www.nationalarchives.gov.uk/documents/archives /cloud‐storage‐guidance.pdf, 2014. [9] Burda, D., & Teuteberg, F. “The role of trust and risk perceptions in cloud archiving—Results from an empirical study”. The Journal of High Technology Management Research, 25(2), 172-187. doi:10.1016/j.hitech.2014.07.008, 2014. [10] Rinehart, A.K., Prud'homme, P. A., & Huot, A.R. “Overwhelmed to action: Digital preservation challenges at the under-resourced institution”. OCLC Systems & [1]
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
[11]
[12]
[13]
[14]
[15]
Services, 30(1), 28-42. doi:10.1108/OCLC-06-20130019, 2014. Gill, F. J., Leslie, G. D., Grech, C., & Latour, J. M. “Using a web-based survey tool to undertake a Delphi study: Application for nurse education research”. Nurse Education Today, 33(11), 1322-1328. doi:10.1016/j.nedt.2013.02.016, 2013. Fletcher, A. & Marchildon, P. “Using the Delphi method for qualitative, participatory action research in health leadership”. International Journal of Qualitative Methods, 13, 1-18. Retrieved from https://ejournals.library.ualberta.ca/index.php/IJQM/inde x, 2014. Habibi, A., Sarafrazi, A., & Izadyar, S. “Delphi technique theoretical framework in qualitative research”. The International Journal of Engineering and Science, 3(4), 8–13. Retrieved from http://www.theijes.com, 2014. Dollar, C. M & Ashley, L. J. “Assessing digital preservation capability using a maturity model process improvement approach”. Retrieved from https://www.nycarchivists.org/Resources/Documents/Do llarAshley_2013_DPCMM%20White%20Paper_NAGARA %20Digital%20Judicial%20Records_8Feb2013-1.pdf, 2013. Skulmoski, G. J., Hartman, F. T., & Krahn, J. “The Delphi method for graduate research”. Journal of Information Technology Education, 6, 1-21. Retrieved from http://jite.org/, 2007.
ISBN: 1-60132-484-7, CSREA Press ©
33
34
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
Solar and Home Battery Based Electricity Spot Market for Saudi Arabia Fathe Jeribi, Sungchul Hong Computer and Information Sciences Department, Towson University, Towson, MD, USA
Abstract - With the rise of technology, electronic market has shown growth potential recently compared to the traditional market. It is difficult to imagine people’s lives without electricity, so the availability of electricity can do an important role in modern life. The electricity market discussed in this paper is a proposal of developed Saudi Arabia’s electricity market. Due to the gap between electricity supply and demand, for example, the hot places in Jazan, Saudi Arabia experience the interruption of electricity. The demand of electricity increases in the summer season because of increasing of hot weather [1]. The supply of electricity during a peak time can be solved by building a large-scale power plant but it costs the government a lot of money. Instead, an electricity market can provide electricity for that gap. Furthermore, consortium of smaller sellers and individual large sellers can form an electricity market more efficiently and provide electricity when it is needed during the peak time. This paper will show the possibility of generating electricity through consortium of small sellers who have solar panels and home batteries by simulation. Keywords: consortium, electricity market, home battery, Saudi Arabia, solar power, trading.
1 1.1
Introduction Electricity market
Electricity can be considered a commodity. It is the stream of electrons and it is derived from other energy sources such as solar, wind, nuclear energy, natural gas, oil, or coal. Normally, electricity is measured by two terms: kilowatts for a small usage or megawatts for a large usage. An electricity market can be classified as retail and wholesale. In the retail market, electricity can be sold directly to the consumers and in the wholesale market electricity can be sold to traders by a utility company before delivering it to consumer [2]. In terms of trading, an electricity market could be a virtual or physical place where the buyers and the sellers can trade electricity. In an electricity market, producers or sellers could be small, medium, or large according to their capacity. Each type of producer can provide electricity to the different types of consumers. Small producers, such as solar and wind turbines can provide electricity to a house or small business. Medium producers, such as gas turbines and solar energy farms can provide electricity to medium size businesses or villages. Large producers such as a power plant can provide electricity to large
businesses or a city. Overall, there are three types of producers that can contribute to the electricity market. In terms of benefits of using solar panels with home batteries, there are five advantages: First, people can use electricity if they need it or sell it to the electricity company if they don’t need it. Second, if there is extra electricity collected from sellers, people who don’t desire to use solar panels with home batteries can buy electricity from the electricity company. Third, people’s electricity bill will be decreased. Fourth, decreasing the government’s cost to build a new power plant. Fifth, stopping the electricity blackouts.
1.2
U.S. electricity market
Facilities of electricity in the U.S. are controlled by state and federal regularity organizations. Federal Energy Regulatory Commission (FERC) controls the wholesale electricity market as well as services of interstate transmission. States control rates of retail electricity as well as services of distribution. In the U.S., some states follow a regulated model and some of them follow a restructured model. Facilities in the regulated states follow vertical integration and create strategies of integrated resources to assist the electricity load. In addition, economic regulation establishes rates of distribution and supply. In restructured states, market establishes rates of distribution and regulations of generation are exempt. Unlike generation, services of distribution are regulated. Compared to the regulated states, the restructured states don’t make plans of integrated resources; however, they have the right to manage the generation as well as demand-side resources [3]. The FERC approved that PJM (Pennsylvania, New Jersey, and Maryland Power Pool) matched the four essential characteristics of RTO (Regional Transmission Organization) [4] [5]. These characteristics are operational authority, independence, shortterm reliability, and scope and regional configuration. Overall, FERC plays an important role in electricity market.
1.3
Solar electricity in U.S. and battery
According to Solar Market Insight Report of 2 nd quarter in 2017, 2044 MW of solar photovoltaic (PV) was installed in the USA market in 1st quarter in 2017. In Q1, solar power generation ranked as the second source of electricity generation which weighted 30% of the total new electric generating capacity [6]. The solar power forms 0.5% of overall electricity generation in U.S. in 2016 [7].
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
35
Over the next 5 years, researchers forecast the total volume of solar PV installed in the U.S. will be increased by approximately three times. There will be over 17 GW of solar PV installed in 2022. In Q1 2017, the capacity of installed residential PV was 563 MW. The capacity of installed nonresidential PV was 399 MW [6]. Compared to 2015, the solar market in the U.S. increased by 95% in 2016. Solar electricity generation in 2015 was 7,493 MW and 14,626 MW in 2016 [8]. Solar power increases fast in the USA. It also could be an attractive and clean resource for the economy as well as the environment [9].
1.4
Consortium
Consortium in business is an association of individuals or organizations for investments or business and there are three basic frameworks to initiate consortium. First, the negotiation process and initiating the collaboration’s essentials. Second, arranging documents to establish consortium. Third, describing consortium’s organizational and management framework. The documents that can be used for consortium initiations are a strategic plan, objective’s letter, consortium agreement for initiation, and legislations of consortium. Time of occurrence of consortium could be temporary or permanent, where temporary occurs in the case there is only one specific job and permanent occurs in the case there is a group of members share together in a commenced investment. The consortium agreement (CA) is an independent rule of trade, which is generated in the commercial operations for mutual execution of huge investments projects and so on. CA can control the rights and obligations such as work partition, execution times, charges sharing, and consortium’s decision making [10].
1.5
Selling electricity to a company from home battery
Structure of selling solar electricity to an electricity company be shown in Figure 1. This figure contains 4 items which are home battery, home i.e. sellers, a smart grid, and solar panel. Electricity can be generated using solar panels and stored in home batteries. In the electricity market, an inventory can be done through home battery; otherwise, the electricity must be consumed as soon as it is generated. Selling electricity to an electricity company can happen in two different ways. Number 1 in Figure 1 shows the first way of selling electricity where it can be sold directly to a smart grid from a solar panel. On the other hand, number 2 in Figure 1 shows the second way of selling electricity where electricity can be stored in home battery then it can be sold to a grid. The advantage of home battery is that it can help to store the generated electricity and sell it later. This paper also shows how the electricity market with consortium can help small sellers to sell electricity to a big buyer who wants to deal with a certain amount of electricity. In addition, it demonstrates that consortium can help the efficiency of the electricity market.
Fig. 1. Selling solar electricity to a utility company.
1.6
Mechanism of electricity market model with consortium
The electricity market’s trading with a consortium can occur between sellers and a buyer (SEC) that can be shown below in Figure 2. Sellers could be small and large. Large sellers can participate in the electricity market directly, but small sellers need to do consortiums to be able to participate in the electricity market if there is a threshold to enter a market. In this paper, the objective of the electricity market is finding out the minimum price for certain amount of electricity to cover the gap between peak demand and electricity generation. Sellers Small Sellers
Buyer (SEC)
Consortium
Small Sellers
Consortium
Electricity Market
Large Seller
Fig. 2. The electricity market structure with a consortium
In the electricity market, the bidding price should be set by buyers and the asking price should be listed by sellers. A seller’s minimum asking price and a buyer’s maximum bid price can result in the best matching case. There are many examples of the electricity market such as a real-time-balancing market, a day-ahead market, etc. A buyer should consider the following factors: setting the market’s bid price, determining the required electricity, buying electricity involves the peak time, and if needed altering the market’s bid price. A seller should consider the following factors: determining the time of selling, checking the availability of electricity, determining asking price, and finding out the index price in the market. Determining the time of selling could be an opportunistic and risky simultaneously. For example, if a seller waits for a long time, he or she could make an additional profit; however, he or she could lose a deal if the demand is achieved and the market is closed before selling electricity. Also, setting the bidding price could be risky. For example, selling electricity will not
ISBN: 1-60132-484-7, CSREA Press ©
36
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
happen if the buyer sets the price too low. Overall, a consortium is useful because it allows small sellers to participate in the market.
Jazan can avoid blackouts without building expensive power plant.
2.1 1.7
Electricity market model without consortium
Trading in the electricity market without consortium can occur between sellers and a buyer (SEC), which is shown in Figure 3. If there is a threshold value to participate in the market, small sellers can’t contribute in the electricity market trading because they need enough volume of electricity to participate. However, large sellers can be able to participate in the electricity market directly because they have enough volume electricity. Compared to an electricity market with a consortium, an electricity market without a consortium could be less efficient because it doesn’t allow smaller sellers to participate in the market. Sellers Buyer
(SEC) Electricity Market
Large Seller Fig. 3. The electricity market structure without a consortium
2
According to Saudi Arabia’s Ministry of Economy and Planning, in 2014 in the Jazan region, the peak load was 2510 MW and electricity subscribers was 231,390 [13]. According to MESIA (Middle East Solar Industry Association), in Jazan (2014), the gap between covered peak times and SEC’s power generation capability could be up to 25 percentage i.e. the gap was 627.5 MW [14] (Figure 4). Blackout Gap (K) = 627.5 MW
Peak demand = 2510 MW
Production capacity = 1882.5 MW
Time 1 PM
3 PM
Fig. 4. Blackout occurrence in Jazan during the summer
Small Sellers Small Sellers
Math models
A Proposal of an Electricity Market Model in Jazan, Saudi Arabia
Some regions in Saudi Arabia, such as Jazan, need more electricity for the period of blackouts. Saudi Electricity Company (SEC) can produce more electricity by using small, medium, and large producers such as gas turbines and people’s solar panel with home battery. A gas turbine is an engine that generates electricity in large volumes. It is an internal combustion engine which converts natural gas or any other fuels to mechanical energy then produce electricity [11] [12]. People with solar panels and home batteries could be small, medium, or large producers based on their capacity. If producers are small but they want to deal with a large demand, they can form a consortium to deal with that large demand, but they don’t need to form a consortium if they can produce enough electricity to meet the buyer’s large demand. Among electricity consumers in Jazan, some will have solar power, some will have solar power with home batteries or some do not have solar power. In the current situation, because of the lack of electricity production by the SEC, there are blackouts during the summer due to the high electricity usage of air conditioning. If the collection of solar power with batteries provide enough electricity, then all the electricity consumers in
In a typical electricity market, a buyer (SEC) wants to deal with a large volume. To meet this requirement, small volume sellers may form a consortium. Electricity consortium is important because it help individuals to group together and allow them to do business with large volume. In the proposed model in this paper, sellers in the electricity market can form consortiums to meet the gap (K) amount of electricity between the peak demand and production capacity. Steps to form a consortium are identifying the opportunities, seeking out the possibilities of a relationship, risk assessment and due diligence, creating a relationship, communications, relationship’s resourcing, follow up with a relationship, and exiting a relationship [15]. The proposal model is explained below. A simulation was built using Java programming language. It generates consortiums out of 30 random sellers with volumes and prices then checks the combined consortiums volume that are greater than or equal to K (gap capacity), using the following formula: .
∑ 𝑆𝑉𝐶𝑗 ≥ 𝐾
(1)
𝑗
In equation (1), 𝑆𝑉𝐶𝑗 is 𝑗th consortium of volumes of sellers and K is a difference between the peak demand and the production capacity. A consortium 𝑆𝑉𝐶𝑗 is made of sellers 𝑆𝑖 which has volumes 𝑉𝑖 , and prices 𝑃𝑖 . 𝑖 ∈ (selected sellers in the consortium 𝑗). The goal of this model is showing the consortiums that have enough electricity that can cover K. At the same time, the market algorithm attempts to find out the minimum price, which can be calculated by multiplying volumes and prices, using the following formula:
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
Minimum (Volumes × Prices)
37
(2) Start
In equation (2), the sum of volumes meets the K. Finds out K (Gap)
From the buyer’s view (SEC), Figure 5 below shows the consortium’s behavior in a market i.e. it shows how a minimum price in the electricity market with a set of consortiums is selected. The shaded section in Figure 5 shows the excessive volume of each group of consortiums. In this figure, there are goal and constraint. In equation (3), the goal is finding out the minimum of multiplying consortium volumes and prices. In equation (4), the constraint is the sum of consortium of volumes, which is greater than or equal to K.
Goal:
𝑚
Min ∑
(𝑆𝑉𝐶𝑗 × 𝑃𝑗 )
(3)
(𝑆𝑉𝐶𝑗 ) ≥ K
(4)
𝑗=1
Constraint:
𝑚
∑
Forms a consortium Receives agent’s bid, consortium of volumes and prices (𝑆𝑉𝐶𝑗 , 𝑃𝑗 )
Select Consortiums 𝑚
∑(𝑆𝑉𝐶𝑗 ) ≥ K 𝑗=1
Finds minimum by multiplying volumes and prices Minimum (volumes × prices) Trading
𝑗=1
End
In equation (3) and (4), 𝑚 is the number of consortiums those are selected to trade.
Fig. 5. Selection of consortiums with a minimum price
𝑆𝑉𝐶𝑗 can be formed by the following process: in a consortium, selected sellers (𝑆𝑖 ) should agree on a price that is based on the market index price and individual seller’s reserved price. If a consortium’s combined electricity volume is large enough then this consortium can participate in a market biding. The flowchart below shows how an electricity market with consortium can be formed and operated (Figure 6).
Fig. 6. The process of the electricity market with a consortium
In the electricity market, sellers and a buyer (SEC) should find a matching price. In terms of price, similar to other trading, there are issues for the price finding: 1. With the market index price (MP) for short, a buyer cannot find its trading partners if their reserved price is too low. For example, if a buyer (SEC) sets its price too low then no sale will happen and there will be no sufficient electricity for a peak time. Consequently, it will lead to a blackout. 2. If a seller’s reserved price is too high, then there will be no sales and sellers will lose sales opportunity. For consortium in this electricity market, there could be two cases: 1. 𝑆𝑖 ∈ 𝑆𝐶𝑗 where 𝑖 ∈ { 𝑗 | 𝑆𝑃𝑗 ≤ MP} In this case, some seller’s reserved price is less than or equal to MP. 2. 𝑆𝑖 ∈ 𝑆𝐶𝑗 where 𝑖 ∈ { 𝑗 | 𝑆𝑃𝑗 >MP} In this case, usually these sellers are not selected, but a consortium might need this seller’s volume to meet the buyer’s volume if sellers still can make a profit. In this case, individually a seller’s price is greater than MP, but as a group it will meet the required volume and price as well as generating profit to sellers. Each 𝑆𝑖 has 𝑆𝑅𝑃𝑖 (Seller 𝑖’s reserved price). After sales, profit will be distributed among sellers by using following three formulas:
ISBN: 1-60132-484-7, CSREA Press ©
38
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
1.
TABLE 3: MIXED CASE, WHEN 𝑆𝑅𝑃𝑖 ≤ MP AND 𝑆𝑅𝑃𝑖 >MP.
𝑆𝑅𝑃𝑖 ≤ MP
In this case, 𝑆𝑅𝑃𝑖 is less than or equal to MP (𝑆𝑅𝑃𝑖 ≤ MP), 𝑖 ∈ 𝑙 where 𝑙 is a set of sellers in a consortium. In this case, 𝑆𝑖 will receive extra profit from the difference between the MP and consortium’s reserved price. Table 1 below shows an example of this case. For example, if MP= $50, total of (𝑉𝑖 ×MP) is $7000 and total of (𝑉𝑖 ×𝑆𝑅𝑃𝑖 ) is $4650, this means there is a gap ($7000-$4650= $2350). The solution to cover this gap is dividing the gap among sellers whose reserved price is lower than MP except 𝑆𝑖 ∈ 𝑆𝐶𝑗 where 𝑖 ∈ { 𝑗 | 𝑆𝑃𝑗 >MP}. This seller made profit it more than MP.
Sellers S1 S2 S3 Total
Volume (𝑽𝒊 )
𝑽𝒊 × MP
Reserved Price (𝑺𝑹𝑷𝒊)
S1 S2 S3 Total
45 50 45 140
$2,250 $2,500 $2,250 $7,000
$40 $30 $30
Sellers S1 S2 S3 Total
𝑽𝒊 × 𝑺𝑹𝑷𝒊 $1,800 $1,500 $1,350 $4,650
2.
𝑆𝑅𝑃𝑖 > MP In this case, 𝑆𝑅𝑃𝑖 is greater than MP (𝑆𝑅𝑃𝑖 > MP), 𝑖 ∈ 𝑙 where 𝑙 is a set of sellers in a consortium. Table 2 below shows an example of this case. For example, if MP= $30, total of (𝑉𝑖 × MP) is $2700 and total of (𝑉𝑖 ×𝑆𝑅𝑃𝑖 ) is $3700. In this case no sale will happen. Also, if S1, S2 and S3 don’t make a deal, their profit will be zero.
Sellers S1 S2 S3 Total
3.
Volume (𝑽𝒊 ) 40 40 10 90
𝑽𝒊 × MP $1,200 $1,200 $300 $2,700
.
Reserved (𝑺𝑹𝑷𝒊) $40 $45 $30
Price
3
𝑽𝒊 × 𝑺𝑹𝑷𝒊 $1,600 $1,800 $300 $3,700
.
Mixed Case: ∑𝑖( 𝑉𝑖 × 𝑆𝑅𝑃𝑖 ) ≤ ∑𝑖(𝑉𝑖 × MP)
In this case, some 𝑆𝑖 will have (𝑆𝑅𝑃𝑖 ≤ MP) but some 𝑆𝑖 will have (𝑆𝑅𝑃𝑖 > MP), which results a mixed case. For example, in Table 3, if MP is $30, total of (𝑉𝑖 ×MP) is $3000 and total of (Vi×𝑆𝑅𝑃𝑖 ) is $2800, this means there is a gap ($3000 - $2800 = $200). In this case, 𝑉𝑖 will get 𝑆𝑅𝑃𝑖 and the value of the gap will be divided between sellers whose reserved price is lower than MP. In Table 4, for example, if MP is $30, total of (𝑉𝑖 ×MP) is $3900 and total of (𝑉𝑖 ×𝑆𝑅𝑃𝑖 ) is $4000. In this case, the consortium can’t take S3. It needs a different seller with lower price.
Reserved (𝑺𝑹𝑷𝒊)
40 40 20 100
$1,200 $1,200 $600 $3,000
$20 $30 $40
Price
𝑽𝒊 × 𝑺𝑹𝑷𝒊 $800 $1,200 $800 $2,800
Volume (𝑽𝒊 )
𝑽𝒊 × MP
Reserved (𝑺𝑹𝑷𝒊)
40 40 50 130
$1,200 $1,200 $1,500 $3,900
$40 $30 $40
Price
𝑽𝒊 × 𝑺𝑹𝑷𝒊 $1,800 $1,200 $2,000 $4,000
Electricity Market Simulation System and Results
The main focus of this paper is providing a model and showing simulations of possible electricity trading in the electricity market to prevent blackouts. This model is designed for Jazan in Saudi Arabia; however, it can be generalized to other markets easily. This simulation shows the comparisons of results of trading with a consortium and trading without a consortium in the electricity market. Simulation system and results are explained below.
3.1
TABLE 2: CASE 2, WHEN 𝑆𝑅𝑃𝑖 > MP.
𝑽𝒊 × MP
TABLE 4: MIXED CASE, WHEN 𝑆𝑅𝑃𝑖 ≤ MP AND 𝑆𝑅𝑃𝑖 >MP.
TABLE 1: CASE 1, WHEN 𝑆𝑅𝑃𝑖 ≤ MP. Sellers
Volume (𝑽𝒊 )
Simulation system
In this paper, the electricity market simulation shows how trading can occur between sellers and a buyer (SEC). The objective of this simulation is demonstrating the possibility of preventing blackouts during the peak time in Jazan, Saudi Arabia. This simulation generates 30 random sellers (𝑆𝑖 ) with volumes (𝑆𝑉𝑖 ) and prices (𝑆𝑃𝑖 ) for simplicity. In reality, Jazan needs 231,390 potential sellers [20]. In this simulation, the target volume of electricity (K=100), range of volumes is 5-100 and range of prices is 1-15. The reason for this price’s range is assuming that the last known MP (Market Price) is 7 or 10. This simulation has two cases of trading: sellers with consortiums and sellers without consortiums. The same sellers’ volumes and prices are used for both cases to show the difference. In Table 5, one unit represents 6.275 MW. Stored electricity can be sold by each seller or group of sellers (5-100) units. For example, based on Tesla Powerwall’s capacity, if every subscriber contributes 10 kWh, the availability of electricity will be 2,3139 MW. Simulation and real demand are summarized in Table 5. TABLE 5. SIMULATION AND REAL DEMAND SUMMARY. Gap (K) Capacity of Seller
ISBN: 1-60132-484-7, CSREA Press ©
Simulation
Real Demand
100 5~100
627.5 MW 31.375 MW (5) ~ 627.5 MW (100)
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
39
Case 1: Sellers without consortiums. In this case, only sellers that are greater than the threshold value (30) can participate in the market. The reason for excluding sellers less than the threshold value is that the buyer (SEC) may not be interested in small sellers, but it may be interested in large sellers. Taking off small sellers from the market can lead to a shortage of electricity or inefficient price discovery. In this case, only sellers that are greater than the threshold value can be used to find out a minimum price of K. The threshold value was selected as random and it can be any number. Case 2: Sellers with consortiums. In this case, sellers that are greater than the threshold value (30) and consortiums of small sellers that have trading volume less than the threshold value can participate in the market. Consortium volumes of smaller sellers should be greater than threshold value to be able to participate in the market. The advantages of consortiums are: 1.
Allowing small sellers to participate in the market and increasing the number of market participants. If there are more participants, then it is likely to get the minimum 𝑚
value of ∑
𝑗=1
2. 3.
𝑚
(𝑆𝑉𝐶𝑗 × 𝑃𝑗 ) and ∑
(𝑆𝑉𝐶𝑗 ) ≥ K .
𝑗=1
Market liquidity will be increased. Increasing participation of the total volume of small sellers.
In this case, the simulation combined of individual sellers i.e. large sellers that are greater than the threshold value and consortiums of small sellers that are less than the threshold value, which can be used to find out a minimum price of K.
3.2
Fig.7. Comparison between trading with a consortium and trading without a consortium based on minimum price.
Simulation system’s result
In this paper, 45 simulations have been completed and they show results of trading with consortiums and trading without consortiums based on a minimum price and participating seller numbers. Figure 7 shows that out of 45 simulations, there are 12 simulation results that have the same minimum prices for both trading with a consortium and trading without a consortium. In addition, 33 simulations have lower prices in trading with a consortium. In 73% of cases, trading with a consortium can help to get K volume with the lowest price. This means trading with a consortium is more efficient than trading without a consortium. On the other hand, Figure 8 shows that all 45 simulations demonstrate that trading with a consortium has more participating sellers than trading without a consortium. Also, consortiums can help to avoid disallowing small sellers from market activity. Increasing sellers’ number can lead to providing more electricity to the electricity market rather than trading without consortiums.
Fig. 8. Comparison between trading with a consortium and trading without a consortium based on seller numbers.
4
Conclusion
The availability of electricity can make people’s daily life activities easy and more flexible. Frequent interruption of electricity can make day-to-day life activities more difficult as well as causing damage for the electronic appliances such as air conditioning devices and refrigerators. In addition, people can make money from trading in the electricity market. The gap between generation of electricity and the peak load causes blackout in the Jazan region especially in the summer season. Consequently, people can get electricity using solar panels with home batteries. Using a math model and simulations for an electricity market, a regional electricity market is proposed and tested. In the simulations, consortiums are tested and a market with consortiums shows an improved market performance. It is also demonstrated that if electricity subscribers in Jazan, Saudi Arabia install solar panels and home batteries like Tesla Powerwall, they can produce enough electricity to cover the lack of electricity during summer peak hours without building a large, expensive power plant. 45 simulations demonstrate that by adopting a consortium the market model is more efficient.
ISBN: 1-60132-484-7, CSREA Press ©
40
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
5
References
[1] Business Wire. “Demand for Fast Power Heats Up in the Saudi Desert: Saudi Electricity Company Orders Six More GE Gas Turbines.” Business Wire, 9 Feb. 2016, www.businesswire.com/news/home/20160209005628/en/De mand-Fast-Power-Heats-Saudi-Desert-Saudi. [2] Federal Energy Regulatory Commission. 2015, www.ferc.gov/market-oversight/guide/energy-primer.pdf. [3] Nazarian, Doug. “Introduction to US Electricity Markets.” National Association of Regulatory Utility Commissioners, 21 July 2012. [4] PJM. “PJM’s Role as an RTO.” PJM, 16 Mar. 2017. [5] Energy Master Plan | Frequently Asked Questions About Energy, State of New Jersey New Jersey Energy Master Plan, www.nj.gov/emp/energy/faq.html. Accessed 20 March. 2018. [6] GTM Research | U.S. Research Team, and Solar Energy Industries Association | SEIA. Solar Market Insight Report 2017 Q2. SEIA, Solar Market Insight Report 2017 Q2, www.seia.org/research-resources/solar-market-insight-report2017-q2. [7] “Solar.” IER, instituteforenergyresearch.org/topics/encyclopedia/solar/. Accessed 30 November. 2017. [8] Munsell, Mike. US Solar Market Grows 95% in 2016, Smashes Records. 2017, US Solar Market Grows 95% in 2016, Smashes Records, www.greentechmedia.com/articles/read/us-solar-marketgrows-95-in-2016-smashes-records#gs.pp4daxY. [9] Rogers, John, and Laura Wisland. “Solar Power on the Rise: The Technologies and Policies behind a Booming Energy Sector.” Union of Concerned Scientists, Aug. 2014. [10] Ivanovic, Milan, et al. “Establishing A Consortium - Way for Successful Implementation of Investments Projects - An Example Of The Infrastructural Project ‘Slavonian Networks.’” Economy of Eastern Croatia Yesterday, Today, Tommorow, vol. 3, 2014, pp. 28–36. [11] Boyce, Meherwan P. Gas Turbine Engineering Handbook. 2nd ed., Gulf Publishing Company, 2002. pp. 3. [12] “What Is a Gas Turbine and How Does It Work?” GE Power, www.gepower.com/resources/knowledge-base/whatis-a-gas-turbine. Accessed 25 February. 2018. [13] “Nine Development Plan.” Ministry of Economy and Planning, www.mep.gov.sa/index_en.html. Accessed 11 May. 2017. [14] Bkayrat, Raed. “Middle East Solar: Outlook for 2016.” Middle East Solar Industry Association, Jan. 2016. [15] Lipson, Brenda. “Participating in A Consortium Guidance Notes for Field Staff.” Framework Organization, July 2012.
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
41
Understanding Kid's Digital Twin 1Anahita
Mohammadi, 2Mina Ghanaatian Jahromi, Hamidreza Khademi, Azadeh Alighanbari, Bita Hamzavi, Maliheh Ghanizadeh, Hami Horriat, Mohammad Mehdi Khabiri, Ali Jabbari Jahromi 1
Morphotect Design Group, R&D Department, Toronto, ON, Canada [email protected]
2
The Department of Architecture and Design (DAD), Polytechnic of Turin, Turin, Italy [email protected]
Abstract- Childhood is the most important part of everyone life as it shapes fundamentals of their identity and characteristic. The better lifestyle during their childhood, the better generation will be presented for future. Nowadays, for every physical asset in the world, we could have a virtual cloud base copy running that gets richer with every second of operational data. Practically, opportunities are uncovered within the virtual environment that can be applied to the physical world by applying “Digital Twin” concept. The purpose of this paper is to investigate the relation of virtual reality’s potential in kid’s development progress via introducing “Digital Twin” for kids. It could be the optimum approach that we begin to recognize children through his/her individual digital twin based on dynamic profiling. We propose that the idea of the Digital Twin, which links the physical and mental issues of the kids with its virtual equivalent can assist parents to mitigate problematic concerns. We describe the Digital Twin concept and its development and show how it applies to kids’ development progress in defining and understanding their behavior. This paper discuss how the Digital Twin relates to kids’ character and how it addresses parents impact on “kids character”.
Keywords: Digital twin, Dynamic Database, Kid’s Development Progress, Integrated Digital Identity, Virtual and Real-word Synchronization, Virtual Behavior Analysis, Real-Time Behavioral Fata, IoT
1 Introduction The world of technology has massively changed over the last decade and it runs our lives these days. Each new upgrade technology compounds existing technologies to create something better than what was previously used before. And on and on it goes. Obviously, the technology trend has this capability to help parents in order to more unified and integrated understanding of their kid's character and intensity. It has been tried to construct an integrated discourse how to understand kids by developing and understanding their “Digital Twin”. In a sense, this attempt appears in this way, first, the digital data reflecting the lifestyle of the kids that are uploaded by parents to gather data about real-time statuses which are bound in a unified identity of the kid. The basic data are connected to a cloud-based system that receives and processes all the data that parents monitor. This input is analyzed against the best customized and personalized lifestyle and make an individual growth strategy roadmap for kids and it will improve parent skills by improving their awareness about their educational and parental approach to their kids. Indeed, through learning and exchange of spatiotemporal data with the parents, enabled through interfaces connecting us to virtual reality such as the Internet of Things (IoT), mobile apps and smart wares. By Digital Twin of the kid becomes smarter over time, able to provide
ISBN: 1-60132-484-7, CSREA Press ©
42
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
predictive insights into the better potential lifestyle and growth.
would role as a valid resource for kids’ development progress and would demonstrate the very same nature of the kids’ development progress itself. It would be real-time, responsive and dynamic in nature.
3 Environmental Impact
Fig. 1. Digital twin technology is the bridge between the physical and digital world © www.gestion.pe
Kids’ development progress is directly/indirectly impacted by environmental elements. These elements could influence and define kids’ development progress pattern such as physical, social and emotional skills, and thinking/mental abilities. Understanding dynamic multi faced nature of kids’ environment could be of as importance in creating a proper roadmap that raise awareness to milestones in development progress process. Physical environment and non-material environment including emotional and social environments are to be considered in the process. 3.1. Physical Environment
2 Understanding our true kid, Validity of Resources Parenting patterns are passed from one generation to the next over time and each decade has its own challenges. Indeed, nowadays children have a different lifestyle in comparison with the previous generation that their parents have grown up that evolutionary change also applies in the parenting style as deemed necessary to this change. In fact, parents always endeavor to be updated on how to treat their kids and always provide them with proper education that suits their kids’ age and physical/psychological characteristic. Psychiatrist, child development specialists as well as all other accessible resources such as internet, books and etc are all there for parents for reference. But as much as all these resources are available for parents to refer to, they are also generic in nature and not necessarily related to each child individually and though they are missing unique milestones in every kids’ physical and behavioral progress. Ignoring the fact that environment -in both physical and nonmaterial from of it- shapes the fundamentals of most parameters impacting kid’s development, would result in the partial and non-comprehensive understanding of required informative resources for parents. In this case, a smart platform can assist parents with their own kids’ personal physical and behavioral development progress recognition and profiling.
Kids are surrounded by physical environment and their development are directly/indirectly impacted by the quality of this physical space. Potentially any change to physical environment could lead to a change in kids’ development progress and thus should be carefully monitored and recorded to kids’ favor. Monitored element could include incidents, crime, and environmental physical pollution (i.e. air, water and noise). The importance of physical environment would be even more significant when it comes to development progress for kids with disabilities (physical and mental) as it also defines the barriers that are in place to utilize physical space to its maximum potential. Element such as accessibility to vehicles, pedestrians, and entry of buildings will become subjects of attention which needs to be carefully reviewed and addressed. However, there are also other issues which effect on the kid’s physical development. For instance, entertainment technology (TV, Internet, video games, iPads, and cell phones) has advanced on such exponential rate that researchers have scarcely noticed the significant impact and changes to the development progress. Technological entertainment will effect on the physical well-being of kids which will be caused Obesity, Inefficient Sleep Pattern, Repetitive Strain Injuries, and laziness and lower health index in general.
It is necessary to make parents aware of the parameters that could be true representatives of their kids’ identity and why it is important to have these parameters into consideration when it comes to shaping valid and real-time resource to kid's development. Above mentioned virtual platform
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
43
and manage new challenges that emerge in time and space over time”. [4]
3.2 Non-Physical Environment Humans are categorized amongst most complex systems. Every individual has a single dynamic behavioral pattern and its own character. “From this perspective, there can be no character “Types” since every person’s array of organizing principles is unique and singular, a product of his or her unique life history.” [1] As for kids, interests, tendency, and character are even more dynamic and over time it would be changed rapidly. Therefore, each kid would have its own mental status that is influenced by “traffic, noise pollution, crime and other hazards which all are issues that affect children’s everyday freedoms.” [2] These could be considered non-material environmental parameters that has significant impact on kids’ development progress. Such concerns created a growing tendency for parent and guardians to shelter their kids in their safe zones, at home, where less environmental challenges are arising. Understanding the impact of non-material environmental elements are not as easy as physical elements as they could be considered physical in materiality but could be categorized as non-material element in the nature of their impact. Elements such as color; “Color comes with a strong emotionally magnified impact for the kid which will bring out certain emotions in response. Color can define any space for kids and translate the hidden codes within that space in a clear emotional way, codes such as security, danger, dynamism, excitement and etc. Utilizing of proper colors while considering the existing depth of space can emotionally bring out both positive and negative potentials of an urban space for kids. For instance, while green can point out a stress-free environment, red will express a dynamic one, or using yellow and black colors beside each other will point out a potential hazard in an environment.” [3] And thus a chaotic color disorder in environment could be considered as an environmental element with negative impact on kids’ development progress. In order to identify these possible environmental elements a dynamic smart and personalized platform/database is required that could both receive real-time data from kids’ development and also to render an equivalent system, dynamic in nature which is synced with kids’ development progress and is a true representative of that. Defining such platform/database can improve parents’ awareness about child’s physical, psychological and behavioral health and improve kids’ ability to learn and sustain personal and family relationships as they are indispensable in nature. Nevertheless, “success of such transformation requires the ability to understand
4 Digital Twin Creating real-time virtual identities with complex and meaningful behavioral patterns has become possible thanks to significant advances in machine learning protocoles, smart algorithms and AI along with cloud base data collection and cloud processing. [5] The term Digital Twin was brought to the general public for the first time in Shaftoetal.(2010)and Shaftoetal.(2012).[6],[7] Digital twin technology is the bridge between the physical and digital world. It is a virtual model of a process, product or service. This pairing of the virtual and physical worlds allows analysis of data and monitoring of systems to head off problems before they even occur, prevent downtime, develop new opportunities and even plan for the future by using real-time and smart simulations. “The concept of the digital twin is mostly associated with the model-driven virtual image of a real system. Through a model, which emulates certain functions of the real system, predictions of can be estimated and analyzes can be performed”. [8] Such digital twin synchronizes its status with the real object through sensors and communication interfaces. It can directly affect the real system, for example through the modification of parameters or it can be used as a communication interface to other systems and also to humans, for example in order to observe a certain status. [9] Understanding digital twin of a system is much easier than understanding same system directly, as one can identify and analyze key parameters one at a time and also in relation together. Smart and dynamic equations would also enable us to predict and visualize the strange attractor of that system’s behavioral pattern. Merging digital twin concept with psychological cognitive science will allow us to use the concept of digital twin in kids’ development progress. Following topics will discuss how digital twin concept can be utilized and implemented to create a comprehensive roadmap for kids’ development progress. [10] 4.1 Digital Twin platform Due to nature of kids’ development, we would be talking about an integration of various informative virtual models. Every one of these models covers a certain scope of kids’ development progress. The digital twin would be defined by informative interaction of these virtual models on a single platform. These model could include Geographic Information System (GIS) models, Virtual physical/psychological development progress
ISBN: 1-60132-484-7, CSREA Press ©
44
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
models, Environmental data (i.e. police incident, crime... data). Kids’ digital twin would be a realtime platform role as the interface of above mentioned models. This platform is a unified Application Programming Interface (API) that runs as an interaction hub for environmental and informational APIs.
4.2 Digital Twin Data Input Supported by real-time data collection from kid’s behavioral pattern, physical status and biometrics, a Big Data would be shaped as analyzed computationally data-set reviling behavioral and physical development patterns. The input data collected here is interpreted by the interaction of machine learning and AI algorithms and through the unified platform. 4.2.1 Database In order to convert and transit physical environment to virtual environment, two types of the database are needed, A customized “kid profiling” and “general” database. Kid profiling database contains personal data which are collected from parents of their kid like age, gender, nationality, lifestyle (i.e. parents schedule and budget), cultural values, disability (physical and mental), allergy and etc.
Fig. 2. Kid’s profiling database is a real-time and dynamic database © www.fomentformacio.com
5 Methodology This work aims to introduce a roadmap to a methodology that allows the construction of the platform for information Exchange-Digital Twin-. The main idea of this proposal is to model a digital twin of kids’ physical and mental development at a high level. This model of platform is a base to exchange information with other APIs as well as users, complicated enough for APIs to shape their interaction and user friendly enough to shape a dialog with users.
General database is a data collected from context (i.e. authorities) and contains general data type such as GIS, accessibility (vehicles, pedestrians), depth of perceptible space, density, visual disorder, environmental pollution (i.e. air, noise), climate, incident, crime, green and public spaces, and etc. Access to a real-time database is toward ‘‘Internet of Things’’, which caused data become more accessible and ubiquitous that necessitates the right approach and tools to convert data into useful, actionable information [11].
4.3 Digital Twin Interface The digital twin interface is also customized to create an open end dialog between its users (kids, parents, kids), a Human-Computer Interfaces (HCI) that is multi faced in nature but presenting a unified platform. The focus of the interface is the interaction between human and the virtual identity that is to be consistent, yet dynamic in nature. This interface would connect the kid and parents (guardians) to his/her associated digital twin.
Fig. 3. Different dimensions when using simulation technology (Digital Twin) throughout the entire life-cycle of a system [16]. It was illustrated the schematic of Kid's Digital Twin based on current life prediction process.
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
The emergence of Digital Twins [14], [15], [16], an endeavor to create intelligent adaptive machines by generating a parallel virtual version of the system along with the connectivity and analytical capabilities enabled by IoT, constitutes the foundation for cognitive development of ideal lifestyle for the kid. An Artificial Intelligence system will harvest informational data which was made by kid's physical and physical character and also the experimental context created by other associated users. Over time an individual growth strategy roadmap of the kid is formed. The system is being analyzed consistently and the real-time platform would shape the dialog with users through its user friendly interface.
45
lifestyle. The Kid Digital Twin represents a first step toward developing a reality-virtually system at the real-time intersection of these interactions. This material is based upon work supported by Morphotect Design Group’s R&D Department which is focused on protocols and interfaces of kid that have potential to merge Digital Twin to physical(real) world of him/her. Future research can expand this framework in ways that enable the complex interdependent visual and numerical analytics that will allow how parents can understand all aspects of Kid-character-technology interactions to achieve resilience objectives.
7 References
Fig. 4. Schematic of constructing the digitaltwins virtual environment of KID integrated with “Dynamic Profiling Data” and “Environment information fusion.
The digital twin of every single kid’s character, cognizant of the profiling data and basic data, and their fluctuations in time and space, is progressively able to anticipate the strange attractor of behavior and character in the systems and predict possible/optimum future behaviors. The predictive conduct of the digital twin character relies on the real-time kid dynamic profiling and general data. Additionally, irrespective of the current basic character, a digital twin character can simulate what-if current kids’ lifestyle is not efficient or ideal and how it can improve to an optimum one.
6 Conclusion and Further Works Kid-character-technology interactions, in which dynamic profiling of kid is integrated into an analytics platform, may enhance the quality of kid
1. Stolorow, R. D. (2011). The world, Affectivity, Trauma: Heidegger and Post-Cartesian Psychoanalysis. New York: Routledge. 2. Global Designing Cities Initiative and National Association of City Transport Officials (2017). Global Street Design Guide. [Online] globaldesigningcities.org [Accessed: 12 Sep. 2017]. 3. Mohammadi, Anahita, Jabbari Jahromi, Ali, Alighanbari, Azadeh,” Kids Friendly Factor in Urban Spaces”, International Conference on Information & Knowledge Engineering (WORLD COMP’15/IKE2015), CSREA Press. 4.N. Mohammadi and J. E. Taylor, Smart city digital twins, Conference: 2017 IEEE Symposium Series on Computational Intelligence (SSCI), November 2017, DOI10.1109/SSCI.2017.8285439 5.Le Duigou J, Gulbrandsen-Dahl S, Vallet F, So¨derberg R, Eynard B, Perry N (2016) Optimization and Lifecycle Engineering for Design and Manufacture of Recycled Aluminium Parts. CIRP Annals-Manufacturing Technology 65(1):149– 152. 6. Shafto et al., 2010, Shafto, M., Conroy, M., Doyle, R., Glaessgen, E., Kemp, C., LeMoigne, J., and Wang, L. (2010). Draft modeling, simulation, information technology & processing roadmap. Technology Area, 11. 7. Shafto et al., 2012, Shafto, M., Conroy, M., Doyle, R., Glaessgen, E., Kemp, C., LeMoigne, J., and Wang, L. (2012). Modeling, simulation, information technology & processing roadmap. National Aeronautics and Space Administration. 8. T. H.-J Uhlemann, C. Lehmann, R. Steinhilper, “The Digital Twin: Realizing the Cyber-Physical Production System for Industry 4.0,” in Procedia CIRP, 2017, Vol.61, pp.335-340 9. Graessler, A. Poehler, "Integration of a digital twin as human representation in a scheduling procedure of a cyber-physical production system", Industrial Engineering and Engineering Management (IEEM), 2017 IEEE International
ISBN: 1-60132-484-7, CSREA Press ©
46
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
Conference on Singapore.DOI: 10.1109/IEEM.2017.8289898 10. José Ríos, J. C. Hernandez, Manuel Oliva, Fernando Mas, "Product Avatar as Digital Counterpart of a Physical Individual Product: Literature Review and Implications in an Aircraft", Conference: 22nd ISPE Inc. International Conference on Concurrent Engineering (CE2015) At: TU Delft, Volume: 2 of Advances in Transdisciplinary Engineering. 11. Lee J, Lapira E, Bagheri B, Kao H-A (2013) Recent Advances and Trends in Predictive Manufacturing Systems in Big Data Environment. Manufacturing Letters 1(1):38–41. 12. S. Boschert and R. Rosen, “Digital twin-the simulation aspect,” in Mechatronic Futures: Challenges and Solutions for Mechatronic Systems and Their Designers, 2016, pp. 59–74. 13. M. Grieves, “Digital twin: Manufacturing excellence through virtual factory replication,” 2014. 14. S. P. A. Datta, “Emergence of Digital Twins,” Arxiv, 2016.
ISBN: 1-60132-484-7, CSREA Press ©
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
47
Design and Implementation of a Library Database Using a Prefix Tree in an Undergraduate CS Sophomore Course Suhair Amer, Adam Thomas, Grant Reid, and Andrew Banning Department of Computer Science, Southeast Missouri State University, Cape Girardeau MO Abstract – Undergraduate computer science students, taking a sophomore C++ programming course, tests the idea of using a “trie” data structure to catalog books in a library. These books have an alphanumeric ID of no more than 7 characters, and a book title. This means the catalog can store 7.8364164096 × 10^10 different book titles. The benefits of this system is that it can quickly retrieve books when searching by key. The main drawback was related to the large file size generated, as each node in the trie must have an array of 36 node pointers, to quickly traverse through it. Keywords: prefix tree, book catalog, trie
1. Introduction Many computerized library databases make use of a file system to store and retrieve books. As these libraries often contain thousands to tens of thousands of books, they need a system that can quickly and efficiently retrieve books for the user. One common way to implement such a system is by using a spelling tree, herein referred to as a trie (pronounced tri or trai). In computer science, a trie is also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes). A tree in computer science is a data structure that can be used to store and sort data. As such, there are multiple versions of a tree, and different applications of these trees that are all equally valid. For this paper, we will focus on a prefix/spelling tree, herein referred to as a trie. According to Briandias, a trie works by traversing from node to node using a key, where each node is assigned a character in the key (1959). This concept was then expanded by Edward Fredkin to define finite dimensional trie’s and present a more concise approach to adding new nodes. Even with this rather strict outline, this still leaves us the option to implement the trie in several ways. There are many types of tries, such as the HATtrie, Hybrid HAT-trie, and Burst-trie (Askitis, N. & Sinha, R. 2007). In this paper, the students, for simplicity and because of time constraints, developed and tested the most basic implementation.
One of the conceptual differences between a regular tree and a prefix tree is that a prefix tree will partition the space, while a regular tree will partition the data. This means that more space is allocated for a prefix tree, but it will drastically shorten the time taken to find the information stored in a node. By partitioning the data, a regular tree will lengthen the search time, but cut down on storage considerably (2004, Ramabhadran, Hellerstein, Ratnasamy, Shenker). Some of the benefits of tries are having “fast retrieval time, quick unsuccessful search determination, and finding the longest match to a given identifier” (Al-Suwaiyel & Horowitz 1984). Fast retrieval time refers to the total amount of data that needs to be sifted through to get to the information. Trie excel in this area. The best, worst, and average case for getting to the information in a trie will always be the key length that acts as an address for the information. With every piece of information, there is a key associated with it. If the key has a total length of five characters, even if the trie is completely full, it will only take five node traversals to get to the desired information. According to Comer, “trie implementations have the advantage of being fast, but the disadvantage of achieving that speed at great expense in storage space” (1979). This is because with a trie implementation, every possible key permutation must be considered when coming up with the pathway for the nodes. While the creation of trie structures require great storage space, according to M. Al-Suwaiyel in his paper, algorithms for trie-compaction can be further reduced by up to 70% through compacting algorithms postcreation of tries as to minimize space. According to Bia & Nieto, indexing substantial amounts of space inside tries is unnecessary, hence using keys (2000). The keys allow node traversals to be at a minimum, while allowing the nodes to hold the necessary information. Trie traversals can then be further accelerated by using hash codes generated by hashing functions such as those described by E. G. Coffman (1970). One way to implement a trie with a constant node traversal time is to use a “full trie”, where every key has the same length, no matter the item. Minimizing the size of a tree to the smallest size is an NP complete problem (Comer, 1981).
ISBN: 1-60132-484-7, CSREA Press ©
48
Int'l Conf. Information and Knowledge Engineering | IKE'18 |
2. Analysis Next are the functionalities of the implemented system: • • • • • •
Add a book by key to the catalogue. Remove a book by key from the catalogue. Search for a book by key and display the title. Key limited to 7 alphanumeric characters, allowing more possibilities for book storage. Check a book in or out using a key. Overwrite an existing book.
3. Design For this prefix tree, the students decided to traverse and assign children nodes using a key provided by the user or read from a file that previously stored books information. It was also decided to keep the information separate from the nodes until the last key node, in which the string name of a book would be stored in the node. The key is limited to the twentysix characters in the English alphabet and the numbers 0-9. The key entered by the user will be automatically checked to see if their format is valid and convert them to lower case if necessary. If the key entered is invalid, a message will pop up indicating an “Invalid key in _____” replacing the blank space with the name of the function that was unable to use the key or was unable to access the information in the next node. The trie implemented for this project is similar to a basic prefix tree, with a couple of exceptions. The first exception is that every can store the critical information, not only just the key. In many tries’ implementations, and to save space, the keys hold the information, and then a leaf node will hold the final information. This is not implemented in this project. This is because, implied from project requirements, the students implemented the ability for any of the nodes to store a book title. Therefore, each node has two data members: a string information, which is used to hold book titles, and an array of thirty-six children node pointers, represented as node* children [36]. The purpose for the thirty-six children pointers is to minimize searching time. Because the key is only using the lower twenty-six letters in the alphabet and 0-9, every letter is assigned an index that allows easy node traversal. For example, if the key entered is ab, the program will read a and store it in a character key, and then traverse to the next node, using the children array as such: children [‘a’-key]. The index will be zero, as every single key with value ‘a’. To get to b it will use the same process, and the index will end up as one. Using this indexing method, search time is cut down to the total number of characters in a single key. This is used instead of a dynamic array because even though a dynamic array would save space, the use of a static array allows us to index the children in the manner previously explained. The dynamic array would constantly change size and make it
impossible to immediately get to the next key, or traverse an array at every node, thus eliminating time efficiency of a prefix tree. It was decided not to implement the traversal functionality, because the total traversal of this tree is impractical and incredibly time consuming. There is no need to display the entire contents of a library all at once, nor is there a reason to copy the entire library from one file to another.
4. Implementation This project was implemented using the C++ language which allows object-oriented design needed to implement the trie. Object-oriented design was important especially regarding the interaction between the trie structure and the nodes. Unfortunately, running this program required surplus of memory to efficiently run it. Due to the structure of a trie, there is always a large storage overhead. The implemented node data type is in figure 1. Inside each node there are two items: an array of thirty-six node pointers, and a string to hold the possible data.
//node.h #include #include using namespace std; const int NUMCHILDREN = 36;//uses 26 english characters and 0-9 #ifndef NODE_H #define NODE_H class node { private: string information;//book title node *children[NUMCHILDREN]; //array of 36 node pointers bool checkOut;//tells if book is checked out or not public: node(); node(string info); void display(); string getInfo(); node* getChild(char key);//will return child of node calling function void setInfo(string info);//naming a book void setCheck(bool state);//checking out or returning a book bool setChild(char key, string information);//setting a child bool isChecked();//checking state of book friend ostream &operator