Data Fabric Architectures: Web-Driven Applications 3111000826, 9783111000824

The immense increase on the size and type of real time data generated across various edge computing platform results in

678 125 56MB

English Pages 228 Year 2023

Table of contents :
Contents
List of Authors
1 Demystifying Industrial Trends in Data Fabric
2 Web-Based Data Manipulation to Improve the Accessibility of Factory Data Using Big Data Analytics: An Industry 4.0 Approach
3 The Overview of Data Virtualizations and Its Modern Tools in the Domain of Data Fabrics
4 Data Fabric Technologies and Their Innovative Applications
5 Enterprise Data
6 Features, Key Challenges, and Applications of Open-Source Data Fabric Platforms
7 An Open-Source Data Fabric Platform: Features, Architecture, Applications, and Key Challenges in Public Healthcare Systems
8 Simulation Tools for Big Data Fabric
9 Simulation Tools for Data Fabrication
10 Security, Privacy, and Authentication Framework for Web-Driven Data Fabric
11 Government Compliance Strategies for Web-Driven Data Fabric
Index

Recommend Papers

Data Fabric Architectures: Web-Driven Applications 9783111001142, 9781118269190

The immense increase on the size and type of real time data generated across various edge computing platform results in

153 23 4MB Read more

Deciphering Data Architectures: Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh 9781098150761

Data fabric, data lakehouse, and data mesh have recently appeared as viable alternatives to the modern data warehouse. T

103 92 7MB Read more

Principles of Data Fabric: Become a data-driven organization by implementing Data Fabric solutions efficiently 9781804615225, 1804615226

Apply Data Fabric solutions to automate Data Integration, Data Sharing, and Data Protection across disparate data source

110 87 3MB Read more

Data Fabric and Data Mesh Approaches with AI 9781484292525, 9781484292532

Understand modern data fabric and data mesh concepts using AI-based self-service data discovery and delivery capabilitie

372 88 9MB Read more

Principles of Data Fabric: Become a data-driven organization by implementing Data Fabric solutions efficiently 9781804615225, 1804615226

Apply Data Fabric solutions to automate Data Integration, Data Sharing, and Data Protection across disparate data source

109 28 4MB Read more

Nonwoven fabric: manufacturing and applications 9781536175875, 1536175870

"Nonwoven industry plays an important role in economy and society. Nonwoven Fabric : Manufacturing and Applications

470 67 6MB Read more

Modern Data Architectures with Python: A practical guide to building and deploying data pipelines, data warehouses & data lakes 9781801070492

Build scalable and reliable data ecosystems using Data Mesh, Databricks Spark, and Kafka Key Features Develop modern da

101 90 9MB Read more

Fabric Styles

Soft decoration includes window curtains, curtains, carpets, tablecloths, cushions, beds and sofas, and works of art. Fa

604 18 21MB Read more

New Algorithms, Architectures and Applications for Reconfigurable Computing 1402031270, 9781402031274

New Algorithms, Architectures and Applications for Reconfigurable Computing consists of a collection of contributions fr

117 117 4MB Read more

Wireless-Powered Communication Networks: Architectures, Protocols, and Applications 1107135699, 9781107135697

Learn the fundamentals of architecture design, protocol optimization, and application development for wireless-powered c

233 87 18MB Read more

Data Fabric Architectures: Web-Driven Applications
3111000826, 9783111000824

Author / Uploaded
J. Joshua Thomas
Vandana Sharma
Balamurugan Balusamy
L. Godlin Atlas

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Data Fabric Architectures

Also of interest Big Data Security Edited by Shibakali Gupta, Indradip Banerjee, Siddhartha Bhattacharyya,  ISBN ----, e-ISBN (PDF) ---- Series: De Gruyter Frontiers in Computational Intelligence Edited by Siddhartha Bhattacharyya ISSN -, e-ISSN - Mathematical Foundations of Data Science Using R Frank Emmert-Streib, Salissou Moutari, Matthias Dehmer,  ISBN ----, e-ISBN (PDF) ----

Noise Filtering for Big Data Analytics Edited by Souvik Bhattacharyya, Koushik Ghosh,  ISBN ----, e-ISBN (PDF) ---- Series: De Gruyter Series on the Applications of Mathematics in Engineering and Information Sciences Edited by Mangey Ram ISSN -, e-ISSN - Deep Learning for Cognitive Computing Systems. Technological Advancements and Applications Edited by M.G. Sumithra, Rajesh Kumar Dhanaraj, Celestine Iwendi, Anto Merline Manoharan,  ISBN ----, e-ISBN (PDF) ---- Series: Smart Computing Applications Edited by Prasenjit Chatterjee, Dilbagh Panchal, Dragan Pamucar, Sharfaraz Ashemkhani Zolfani ISSN -, e-ISSN -

Data Fabric Architectures Web-Driven Applications Edited by Vandana Sharma, Balamurugan Balusamy, J. Joshua Thomas and L. Godlin Atlas

Editors Dr. Vandana Sharma Amity Institute of Information Technology Amity University Noida Campus Uttar Pradesh India [email protected] Dr. Balamurugan Balusamy Shiv Nadar University Campus Housing Flat No. 754, 5th Floor Tower 7, NH 91 Greater Noida 201314 Uttar Pradesh India [email protected]

Dr. J. Joshua Thomas UOW Malaysia KDU Penang University College 12-7-c, Tanjung Pura, Logan Road George Town 10400 Penang Malaysia [email protected] Dr. L. Godlin Atlas 6-18 Atlas Cottage Chenkody 629177 Kanyakumari, Tamil Nadu India [email protected]

ISBN 978-3-11-100082-4 e-ISBN (PDF) 978-3-11-100088-6 e-ISBN (EPUB) 978-3-11-100114-2 Library of Congress Control Number: 2023931029 Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.dnb.de © 2023 Walter de Gruyter GmbH, Berlin/Boston Cover image: enjoynz/DigitalVision Vectors/Getty Images Typesetting: Integra Software Services Pvt. Ltd. Printing and binding: CPI books GmbH, Leck www.degruyter.com

Contents List of Authors

VII

Seema Rani, Vandana Sharma 1 Demystifying Industrial Trends in Data Fabric

1

Vijayapriya R., A. Umamageswari, Rohith Bhat, Ruby Dass, Manikandan N. 2 Web-Based Data Manipulation to Improve the Accessibility of Factory Data Using Big Data Analytics: An Industry 4.0 Approach 19 Shubham Verma, Amit Gupta, Abhishek Prabhakar 3 The Overview of Data Virtualizations and Its Modern Tools in the Domain of Data Fabrics 35 Jagjit Singh Dhatterwal, Kuldeep Singh Kaswan, Vivek Jaglan 4 Data Fabric Technologies and Their Innovative Applications

61

Pushpalatha N., Sudha Mohanram, S. Sivaranjani, A. Prasanth 5 Enterprise Data 95 Rajakumari K., Hamsagsayathri P., Shanmugapriya S. 6 Features, Key Challenges, and Applications of Open-Source Data Fabric Platforms 115 Pankaj Rahi, Monika Dandotiya, Sanjay P. Sood, Mohit Tiwari, Sayed Sayeedi 7 An Open-Source Data Fabric Platform: Features, Architecture, Applications, and Key Challenges in Public Healthcare Systems 127 Kuldeep Singh Kaswan, Jagjit Singh Dhatterwal, Naresh Kumar 8 Simulation Tools for Big Data Fabric 149 R. Ahila, D. Nithya, P. Sindhuja, K. Divya 9 Simulation Tools for Data Fabrication

167

Aditya Saini, Vinny Sharma, Manjeet Kumar, Arkapravo Dey, Divya Tripathy 10 Security, Privacy, and Authentication Framework for Web-Driven Data Fabric 183

VI

Contents

Aditya Saini, Amit Yadav, Vinny Sharma, Manjeet Kumar, Dhananjay Kumar, Divya Tripathy 11 Government Compliance Strategies for Web-Driven Data Fabric 199 Index

215

List of Authors Seema Rani Amity University Noida, Uttar Pradesh India [email protected]

Shubham Verma AKTU Lucknow, Uttar Pradesh India [email protected]

Dr. Vandana Sharma Amity University Noida, Uttar Pradesh India [email protected]

Dr. Amit Gupta HBTU Kanpur, Uttar Pradesh India [email protected]

Vijayapriya R. Vellore Institute of Technology Vellore, Tamil Nadu India [email protected]

Dr. Abhishek Prabhakar Department of Computer Science and Engineering. Dr. A.I.T.H. Kanpur, Uttar Pradesh India [email protected]

Umamageswari A. SRM Institute of Science and Technology Kattankulathur, Chengalpattu, Tamil Nadu India [email protected] Rohith Bhat SIMATS School of Engineering Thandalam, Chennai, Tamil Nadu India [email protected] Ruby Dass Vellore Institute of Technology Vellore, Tamil Nadu India [email protected] Manikandan N. Vellore Institute of Technology Vellore, Tamil Nadu India [email protected]

https://doi.org/10.1515/9783111000886-203

Jagjit Singh Dhatterwal Koneru Lakshmaiah Education Foundation, Vaddeswaram, Andhra Pradesh India [email protected] Kuldeep Singh Kaswan Galgotias University Greater Noida, Uttar Pradesh India [email protected] Pushpalatha Naveenkumar Sri Eshwar College of Engineering Coimbatore, Tamil Nadu India [email protected] Sudha Mohanram Sri Eshwar College of Engineering Coimbatore, Tamil Nadu India [email protected]

VIII

List of Authors

Sivaranjani S. Sri Krishna College of Engineering and Technology Coimbatore, Tamil Nadu India [email protected] Prasanth A. Sri Venkateswara College of Engineering Sri Perumbudur, Tamil Nadu India [email protected] Rajakumari K. School of Engineering, Avinashilingam Institute for Home Science and Higher Education for Women Coimbatore, Tamil Nadu India [email protected] Hamsagsayathri P. School of Engineering, Avinashilingam Institute for Home Science and Higher Education for Women Coimbatore, Tamil Nadu India [email protected] Shanmugapriya S. School of Engineering, Avinashilingam Institute for Home Science and Higher Education for Women Coimbatore, Tamil Nadu India [email protected] Pankaj Rahi Department of Artificial Intelligence and Data Sciences Poornima Institute of Engineering and Technology Jaipur, Rajasthan India [email protected]

Monika Dandotiya Department of Computer Science and Engineering Madhav Institute of Technology and Science Gwalior, Madhya Pradesh India [email protected] Sanjay P. Sood Centre for Development of Advanced Computing (C-DAC) Mohali Punjab, India [email protected] Mohit Tiwari Bharati Vidyapeeth’s College of Engineering Delhi India [email protected] Sayed Sayeed Ahmad Rochester Institute of Technology Dubai Campus Dubai UAE [email protected] Naresh Kumar GL Bajaj Institute of Technology and Management, Greater Noida, Uttar Pradesh India [email protected] Dr. Ahila R. Avinashilingam Institute for Home Science and Higher Education for Women Coimbatore, Tamil Nadu India [email protected] Dr. Nithya D. Avinashilingam Institute for Home Science and Higher Education for Women, Coimbatore, Tamil Nadu India [email protected]

List of Authors

Sindhuja P. Avinashilingam Institute for Home Science and Higher Education for Women Coimbatore, Tamil Nadu India [email protected] M. Divya K. Avinashilingam Institute for Home Science and Higher Education for Women Coimbatore, Tamil Nadu India [email protected] Aditya Saini Galgotias University Greater Noida, Uttar Pradesh India [email protected] Vinny Sharma Galgotias University Greater Noida, Uttar Pradesh India [email protected]

Manjeet Kumar G. L. Bajaj [email protected] Arkapravo Dey Galgotias University Greater Noida, Uttar Pradesh India [email protected] Divya Tripathy Galgotias University Greater Noida, Uttar Pradesh India [email protected] Amit Yadav Galgotias University Greater Noida, Uttar Pradesh India [email protected] Dhananjay Kumar Galgotias University Greater Noida, Uttar Pradesh India [email protected]

IX

Seema Rani, Vandana Sharma

1 Demystifying Industrial Trends in Data Fabric Abstract: Data is the driving force for all industries and business organizations. Its presence can be felt at each moment in every one’s life. Even its availability and prominence are continuously growing. Data volume and cognizance of its value are triggering the demand of readiness and accessibility of data at the organizational landscape. Industry and business decision is reinforced by existing data analysis to confirm optimistic returns on investments. To have existence in the industry, it demands to be updated with current and the latest trends. The world is becoming competitive after pandemic where every business or industry has entered into digital world. It is crucial to be acquainted with latest and upcoming trends and ensure business remains competitive by being aware of possible changes. Data integration is essential for innovation, as it allows companies to better connect and use data to their advantage. Today’s businesses bank upon data integration to deliver an undeviating, truthful picture at enterprise landscape level. The data integration method guarantees that right people are able to access the right data at the right time. Data fabric framework solutions provide unified architecture for data integration from various sources, deal with data silos. This chapter begins with a discussion on challenges and issues in the existing scenario: why data fabric is getting popular and adopted by various organizations and how it will overcome the challenges that occur during usage of data to their best return, and also the set of solutions provided by industrial organization along with case studies to highlight how effective the solutions are. Keywords: Data silos, data fabric, data integration, governance and compliance, data enterprise

1.1 Introduction With the world going digital in almost all sectors or segments, the explosion of data has taken place. In the digital world, individuals are connected or communicating or have some relations from all walks of life irrespective of geographic location, time, demographic data, the disparity in literacy, or socioeconomic status. The ubiquity of social networking platforms like Facebook, Twitter, YouTube, and Reddit has created

Seema Rani, Amity Institute of Information Technology, Amity University, Uttar Pradesh, e-mail: [email protected] Vandana Sharma, Amity Institute of Information Technology, Amity University, Uttar Pradesh, e-mail: [email protected] https://doi.org/10.1515/9783111000886-001

2

Seema Rani, Vandana Sharma

an information ecosystem as a significant basis for spreading in addition to sharing of information [1]. How data is generated or evolving in terms of volume, variety, and velocity is called big data [2]. The data that was difficult to accommodate and dealt with has already accrued and cloud computing has been introduced.

1.1.1 Cloud Computing With the explosive evolution of data, the size of data to be stored or processed or result from various programs or query increases gigantically. It was difficult to be processed with a normal computer. Cloud computing provides the solution on demand by making resources, data storage, and computing power available and distributed over multiple locations where each location is a data center which reduces the capital expenses but increases operation expenses (hidden) unexpectedly. It is fully based and relies on internet services. The cloud computing term was introduced by the National Institute of Standards and Technology (NIST) [3]. Cloud computing modeling recommended a shared pool of computing resources such as networks, servers, storage, applications, and services. This includes data centers, applications, and service providers for storing, manipulating, managing, and dealing with enormous data without buying or establishing or scaling the organization. These resources can be swiftly provisioned and available for use with nominal handling effort or service provider communication. The pool of resources available to consumers or organizations on demand and easily based on payment as per usage [4]. The whole communication in the system from one end to the other end relies on the availability of internet access. The client can access resources like infrastructure, software, storage, and application and use their data with the help of a thin client or web browser. It does not require the client side or organization to establish the infrastructure or buy large storage devices to deal with situations arising due to an explosion of data. A set of services provided by cloud technology is presented in Figure 1.1 [5]. A brief discussion on how these services will offer the benefit to end user [6]. Pay as you go: Here, there is no capital investment in infrastructure or procurement for installing data centers or servers. At the same time paying is based on the usage of operation only. 24 × 7 availability: The customer can access or work on their data anytime from anywhere irrespective of their location. Flexibility in capacity: The technology offers flexibility to users in terms of their usage as per season or time span or operations if sales go down in a particular time period it can be reduced or in case of sudden increases it can accommodate the requirement.

1 Demystifying Industrial Trends in Data Fabric

3

Figure 1.1: Services offered by cloud computing.

Location independence: Cloud services can be used from any place in the world provided the internet connection services in that area are not limited to the use of devices even. Automated updates on software: The regular maintenance work and updates with time are done or taken care of by the server end automatically. The end user is not required to spare time or schedule such events. Security: This technology offers security in terms of availability of data. In case of loss of sensitive data, it can be easily available to user at any time. Carbon footprint: Organizations are paying and using the services/data as per their need only, not on overrated provision of resources. This leads to efficient resource utilization from an organization perspective. No waste of resources and so saving on energy and reducing carbon footprint. Enhanced collaboration: It provides a platform for users who can collaborate with miscellaneous groups in order to make use of different expertise required for development of a product or an improvement. Control on the documents: All users can access the required document from centralized storage place, that is, cloud. This influences the overall work efficiency as no

4

Seema Rani, Vandana Sharma

delay in sharing or sending the document from one user to other via mail and so forth as in traditional system. Easily manageable: The technology offers simple and augmented resources and infrastructural management activities. From the user end perspective, it provides a user interface as per SLA (Service Level Agreement) without worrying about installation of an application at their end. Even though the cloud offers 24 × 7 availability of data with minimal rate of downtime, in real-time scenario, clouds are susceptible to failure. For the organization verge on scaling up at a large scale and progressive, instead of keeping everything under one umbrella, the power of multicloud environment will play a vital role. Multicloud is more than two public cloud, redundant data, large area, a smaller number of hops, reduce latency, high availability, and split work load. Multicloud platform authorizes organizations to disperse their jobs among multiple cloud environment [7]. This will primarily reduce the risk of mitigation lying with different cloud environments. Principally, low mitigation risk itself justifies the widespread progress and adaptability of multicloud infrastructural environment in the real scenarios with progressive organizations [8]. The features attainable by adopting multicloud environment are as follows: Optimized return on cloud investment: Multicloud infrastructural environment suggests rich set of cloud option in terms of pricing, policies, features, and functionality with respect to individual business requirement. User can shuffle among cloud with respect to resources required for present/current needs of business. Better security and economic: This environment approves to keep security relevant data/workload in private cloud and routine/regular business data in public cloud networks. This way multicloud infrastructure empowers amalgamation of security and saving on cost at the same time. Reduction in latency time: The large organization scattered/dispersed in large geographical location is in need to oblige different corporates with the integrated end user practice. This is realized by multicloud infrastructural environment which serves the required data from nearest/closest data center which involves minimum number of server hops leading to reduction in latency time. Autonomy: This sanctions to have blend of platforms and vendors so that their work/ data are not locked-in to an individual cloud provider. The minimal vendor lock-in and automization of switching among vendors in easy and simplified manner will help in improving the performance and accommodating the changing and growing requirements of business. Low mitigation risk: Distribution of workload among multiple cloud network decreases the probability of concurrent downtime of all clouds. With this, organizations have many options for responding to mitigate risks in need.

1 Demystifying Industrial Trends in Data Fabric

5

1.1.2 Challenges This section briefly describes the challenges that come with deployment of cloud environment in real scenario [9]. Privacy and security: In cloud infrastructural environment, application runs on a server. Here, it is being used by many other businesses at the same time. Any malicious user behavior can hamper the overall performance of shared resources, either application or data storage [10]. In this environment, hackers get large window to intervene and access or corrupt the sensitive information of business organization. Internet connectivity: The whole model or framework is dependent on the availability of good internet connection services. At times, cloud provider offers limited bandwidth, and organizations are required to pay an additional service charge in order to have smooth data transfer/access. Technical support: Cloud network is established over a large geographical area, so probability of failure, technical fault, power failure, and low internet bandwidth required high maintenance standards. However, cloud providers are lacking in maintenance and support to business organization. Data breaching over the network: In the prevailing scenario and digital world, data in transit has evolved exponentially. As data transit over the network involves multiple cloud and different deployment models, such operation is prone to data breaches, ensuring data security over the transit has become an important issue [11]. Autonomic computing: Automated service provision, dynamic resource allocation, and switching among cloud as per their specific services and the need of organization in time is another challenging aspect while implementation. Control/governance/dependency: The cloud providers will have control over data and resources allocation. Business is fully dependent on the services offered by them [12]. The switching among cloud based on functionality required by business at some point is complex and dependent on providers.

6

Seema Rani, Vandana Sharma

1.1.2.1 Challenges Faced by Cloud Services in Real-Time Application Crowd Research Partners’ survey reveals that 90% of cloud security experts were concerned about cloud security issues [13]. They highlighted and addressed the issues in view of violation of confidentiality, data privacy along with loss of data, and data leakage. Cloud mitigation encounters challenges due to shortage or lack of resources and technical expertise. According to RightScale report around 75% respondents consider it as challenge and 23% ponder it is actually complex and problematic. Their report uncovered that some business entities are wasting approximately 30% of budget by investing in it. The continuous increase in number of users leading to an issue and challenge related to compliance with laws and regulation. In real scenario, more than 60% realizes that mitigation projects were harder than they estimated and more than 50% projects exceeded their budget limit. Similarly, many companies reported and conveyed challenges related to integration of their on-premise apps, tools, and public cloud. A SoftwareOne report reveals that 39% of users find integrating legacy system a major challenge in multicloud. In real scenario, cloud computing challenges such as scalability, integration, and disaster recovery are magnified in a hybrid cloud environment.

1.1.3 Types of Data In the prevailing scenario where digitization has reached one and all in the world, business organizations enhance their periphery by covering state, across states, or even countries. This state of affairs leading to data explosion from various sources is depicted in Figure 1.2.

Transactional Activities

Application Data

Unstructured Data

Audio, Video, Document images

Logs

Messages

Files

Reports

Web content

Social Network

Emails

Data warehouse

Meta Data

Master Data

Figure 1.2: Forms of data.

The different types of data will be produced from various sources or machine or files as depicted in Figure 1.3. The way data is accessed, used, manipulated, and interpreted is presented in Table 1.1.

1 Demystifying Industrial Trends in Data Fabric

Communication with customer

7

Digital Behaviour

purchase from Market

Previous history/pattern

Government Records (NPA,Health etc)

Documents

Figure 1.3: Sources of data. Table 1.1: Data access mechanism. Existing scenario

Virtual/digital world

Required/necessity

Data access and manipulated method

Access methodology

Study and interpret the way data is available

Customer

Employee

App

Data

Customer

App

Data

Capture

Integrate

Analyse (deeper customer insight)

Data generated from various sources, namely, traditionally, machine generated, human generated, or from external sources are residing or pushed into databases, multicloud, and data center. Integration of data is the need of time in order to get the deeper insight from it. Diversity in data sources and size of data posing many challenges and have intricacy in finding, managing, governing, and integration [14]. Blind integration or danger of self-service data preparation, ways to share the data, and availability of data irrespective of location raise a big question on governance. At the same time, explosion of silos, multiple versions of inconsistent data will generate a scenario that organizations are not able to get the deeper insight from the available data.

8

Seema Rani, Vandana Sharma

However, this in turn leads to lessen the return on investment. This is an implication based on fundamental principle that garbage in will lead to garbage out. This introduces the necessity of better personalization, effective, less time, better experience, less mitigation, and better insight to drive the decision/recommendation so that the required data will reach in the hand of right people at the right time [15]. The forthcoming section introduces the data fabric technology that will deliver business ready data and analytical asset.

1.2 Data Fabric The data fabric is an up-and-coming building block that intends to discourse the data challenges which come to light out of a hybrid data landscape. A data fabric is eccentric to deployment platforms, data progressions, terrestrial settings, and architectural approach. It enables the usage of data at organizational landscape and analytical asset. A data fabric layer guarantees that numerous types of data can be integrated, accessed, manipulated, and governed in an efficient and effective manner. This will increase the magnitude of business agility and speed. In real-time scenario, every user having either direct or indirect interaction with the system look for right data at right time from right place in the right way. This demand which is becoming the necessity in today’s scenario will be satisfied by data fabric technology. Conceptually, data fabric is a layer and a software that introduces common platform or data management platform with a unified approach. This governs, integrates, analyzes, and gets insight in distributed landscape, timely manner, auto data discovery/finding, data profiling, data/info catalog, auto tagging, data virtualization, automation, APIs and role-based user interface, frictionless access, and sharing of data along with agility. It will support/facilitate to establish connectivity among various data sources and targets across distributed scenario, support business, and IT users (IT data engineer, business analyst, data scientist, and IT developer) [16]. The success lies in business strategy which includes trusted data and analytical asset. Data fabric layer between data sources and data consumers are depicted in Figure 1.4. A fruitful data fabric architecture is modular and recommend scaling at magnificent level. It will be efficiently supporting distributed multicloud, on-premise, and hybrid deployment at organizational landscape level irrespective of time and geographic location [17]. The key capabilities of data fabric architecture are presented in Figure 1.5. The reasons why data fabric technology has been adopted by many organizations are presented in Figure 1.6. The following issues are prevailing in most of organizations due to diversity among data sources, types of data, distributed location, size, inappropriate analysis, untimely, and unsuitable data insight [18]:

1 Demystifying Industrial Trends in Data Fabric

9

Figure 1.4: Data fabric.

Knowledge, insights

Unified Governance & Compliance

Integration

Orchestration

Figure 1.5: Data fabric capabilities.

– – – – – – –

Data silos Distinct method for managing data Availability of data to a particular group Data access not at company landscape Huge data sitting idle and unused Infrastructure underutilization Low productivity as lack of availability of data for useful prediction

This scenario hinders and delay to make right recommendation at right time to the right customer by right methodology which is a vital factor in order to increase the return on investment.

10

Seema Rani, Vandana Sharma

Secure

future proof data solution Buisness: Number of stakeholders increased Lessen return on investment Efficient

Unified environment

Figure 1.6: Need of data fabric.

These conditions arise the need of data fabric to rescue and meet business requirement with scalability, agility, and speed with unified approach for data access at company landscape irrespective of geographic locations [19]. Data fabric capabilities shown in Figure 1.7 facilitate for performing a set of operations which fulfil the business need in terms of improving the return on investments in an efficient manner. Data fabric architecture is worth able to complement and add value in multicloud environment as well [20]. It enables to handle dynamic data workload scattered across geographical boundaries, that is, the organizational landscape. Nowadays, many organizations have deployed cloud-based enterprise IT solutions [21]. However, in the existing scenario, how data fabric architecture will add value proposition and increase the worth of organization is elucidated in Figure 1.8. Table 1.2 provides the brief about data fabric solution from different vendors along with their few customers.

1 Demystifying Industrial Trends in Data Fabric

Intelligent Integration

Design, deployment, utilization

automated flow and pipeline for silos

Orchestration

Discovery/in sights

self service orchestration of disparate data sources

11

Governance & Compliance

access relevant data

Local management of metadata

access vast pool of data assets

Automation

continous analytics

AI for data tracing and route querying

Advanced AI

Governance process centralized

end to end data management visibility

Security process centralized

Advanced AI

self service ingestion of new data asset

Comprehensive view across all data environments

future proof infrastructure

Unified data life cycle

Figure 1.7: Data fabric capability.

True hybrid cloud Overcome with technical challenges , Customer free to operate mission critical data driven IT services based on requirements evolving with time

Seamless cloud computer transition designed to mitigate disruptions from switching between cloud vendors and compute resources. Drastic reduction in time to get deeper data insights

High performance & optimized data investments Heavy investment of significant resources and efforts in view of delivering the best performance for their apps and services. Investment in data fabric can realize this capability and optimize their data investments

Future proofing & the flexibility to evolve flexibility to adapt their infrastructure based on changing technology needs. unified data management framework across all infrastructure deployments and organizations can future proof their data investments accordingly.

Figure 1.8: Data fabric in multicloud environment.

12

Seema Rani, Vandana Sharma

Table 1.2: Data fabric solutions. Vendor

Features

Goal/outcome/achievement

Customer

IBM Automization of data Cloud collection, organization, and Pak [] analysis Infusion of AI in business AI at scale in hybrid cloud environments Multicloud data integration ° view of customer data Data governance and privacy MLOps and trustworthy AI

Curated data available to consumers with the optimum balance of cost, performance, compliance, and with the intelligence to orchestrate and optimize data processing according to workloads, data locality, and data policies Boost productivity and diminish complexity

North American energy company During digital transformation with IBM data fabric implementation on data projects across the industry landscape such as: – eMobility – Gas operations document discovery, including handwriting extraction – Electric customer segmentation and load forecasting – Asset management – COVID-19 load impacts – Return-to-work risk mode

NetApp []

Discover: Let nothing hide Integrate: Easy access for the win Automate: Take a hard pass on boring Optimize: Make it all hum Protect: Never lose access Secure: Stand up for data integrity

Customers build stronger, smarter, and more efficient infrastructures Deliver the right data and applications to the right place, at the right time, and with the right capabilities Designed for simplicity and agility

Hannover Medical School (MHH) data fabric with NetApp has allowed them to find the right answer to any data question in view of medical care, research, or education

TIBCO []

Optimized data management and integration Shared data assets Data virtualization Data integration Master data management Metadata management Data streaming Data quality

Simplify, automate, and accelerate data pipelines Support all users over the organizational landscape Easy-to-deploy Adaptable to complex, everchanging technology

Bayer: Cultivates analytics for crops, seeds, and digital farming with complete understanding Panera Bread: Cooks with data to deliver on service and satisfied customers MultiView sees real-time inventory from its CRM Grupo Xcaret delivers ecofriendly fun with TIBCO

1 Demystifying Industrial Trends in Data Fabric

13

Table 1.2 (continued) Vendor

Features

Goal/outcome/achievement

Customer

HPE []

Focuses on automating the process of data ingestion, data curation, and the integration of diverse data sources, simplifying data analytics and insights for business success Reduced complexity by simplify deployment ° view of the customer Internet of things (IoT) analytics Real-time and advanced analytics Key factors considered for solution: Resilience Scale Open tools Edge-core-cloud

Integrates existing files, objects, streams, and databases into a single technology base, security, and management system, reducing data silos Solves critical business needs: AI and analytics IOT and edge analytic Journey to cloud containers

New work SE accelerates data insights for a better work life Business growth with datadriven offerings to job seekers and employers

Denodo []

Focuses on data virtualization Integrates all enterprise data siloed across the disparate systems, manages the unified data for centralized security and governance, and delivers it to business users in real time Optimization for analytics use cases

Catalog serves as a singleentry point for enforcing security and governance Broad go-to-market partnerships

Autodesk enabled to interact with disparate sources; it facilitated regulatory compliance, as well as collaboration between the business and IT. It also enabled near-real-time payment processing, which was just not possible before The Denodo platform improved our agility, performance, and profitability Biogen doubled the number of BI projects completed on time; it reduces cost and enriches BI with new data across internal and external systems

14

Seema Rani, Vandana Sharma

Table 1.2 (continued) Vendor

Features

Goal/outcome/achievement

Customer

Talend []

Focus and strength in data integration across multicloud and hybrid ecosystems Unified environment Generates optimized code natively, works on both onpremises and cloud environments pervasive data quality and governance

Deliver healthy, clean, complete, and uncompromised data Easy to work with code and when building pipelines Self-service data management

Oxfordshire County Council able to meet its goal to help and protect people with Talend. They could paint a single view of a child or family at risk Office depot Viking with Talend mature more ways to communicate with customers across channels and drives loyalty

Kview []

Enabling data teams to productize enterprise data, making it instantly accessible across the enterprise Delivering a °customer view Complying with data privacy laws Pipelining enterprise data into data lakes and warehouses Provisioning test data on demand Modernizing legacy systems Securing credit card transactions Predicting churn, detecting customer fraud, credit scoring, and more

Scalable, availability, data without delay Support for massive data workloads requiring real-time data integration and movement Full support for both analytical and operational workloads Quick deployment (typically in a few weeks) and easy adaption, supporting agile development and CI/CD Low total cost of ownership (TCO)

Vodafone implemented KView fabric for customer data hub and test data management, to deliver strategic business outcomes across its operations AT&T with KView data fabric achieve: Reduction in time to create test data from weeks to minutes Accelerate development speed-to-market by % Reduced manual testing resources by %

Cinchy []

Data integration using ETL, APIs, and microservices Get control of their data, cut IT delivery costs in half, and unlock data network effects to accelerate their digital transformation Single UI for users to access and manage data

Reduces the time spent in transporting and copying data between systems, permitting to remove data silos and enhance the data value

Concentra Bank deliver a Covid relief program in  days Made loan processing seamless Cinchy realizes the need for agility, insight, and speed that data centricity enables

1 Demystifying Industrial Trends in Data Fabric

15

Table 1.2 (continued) Vendor

Features

Goal/outcome/achievement

Customer

Stardog []

Yes, to all data requests Create a flexible, reusable data layer for answering complex queries across data silos

Modernize without disrupting legacy systems Maintain multiple data definitions Deploy new use cases faster

Boehringer Ingelheim: Driving faster research through Stardog They identified and anticipated the need to connect data from disparate parts of the company to augment research and operational efficiency, increase output, and eventually expedite drug research

1.3 Use Cases In a sentence, data fabric technology is an adhesive that works in binding the organizational data in a highly cohesive manner with uniformity. Organizations are able to serve any data request and self-service to access the data irrespective of location of data reside or with data silos. Table 1.3 reports important use cases of data fabric technology [31]. Table 1.3: Use cases in different segments. Purpose

Outcome

Enterprise innovation

Can deliver a road map and open new windows to revolution in fast-tracking the data analytics life cycle

Preventive maintenance

Preventive maintenance analysis, helping to reduce downtime

Slaying silos

Silos hamper productivity. Data fabric truly ends data silos

Deeper customer insight

Framing policies and strategy to improve a customer’s overall experience

Enhanced regulatory compliance

Mitigation is dependent on enterprise risk management strategy which comprises of data governance policy

Improving data accessibility across healthcare organizations and academic institutions

Data fabrics offers the secure and flexible environment that organization demands without demanding a massive rework of existing IT infrastructure

16

Seema Rani, Vandana Sharma

1.4 Conclusion In this chapter, comprehensive evolution in terms of adoption of technology has been documented from cloud computing to data fabric. In the digital era with advancement of technology, data generated and data availability have become the prime concern in respect of right data to right people at right time in a secure manner. The existing adopted technology has been widely adopted in view of delivering the data without infrastructural support by cloud computing. In continuation, to deal with concerns in cloud computing, multicloud architecture came into existence. However, due to extensive growth of data from variety of data sources, issue of data silos, integration, and compliance emanated. Data fabric framework has addressed the concerns elevated in respect of cloud and multicloud environment. With continuous evolution and advancement in the technology, data fabric has come with a promising solution which is adopted by worldwide companies. Data fabric solution with their features along with outcome obtained by various companies has been presented in the chapter such as IBM cloud Pak, TIBCO, and HPE. Also, important use cases in different segments have been listed down with the gains or returns such as academics, healthcare, enterprise, and regulatory compliance. This gives the direction to academic and business community to propose more and better data fabric solution at affordable cost. This will boost the economic growth in turn and customer satisfaction.

References [1]

[2] [3] [4] [5] [6] [7]

[8]

Chawla, S., Mehrotra, M., & Rani, S. Understanding the dynamics of cyberbully content diffusion using textual emotion mining. In 2022 IEEE 7th International conference for Convergence in Technology (I2CT) 2022 Apr 7 (pp. 1–6). IEEE. Elgendy, N., & Elragal, A. Big data analytics: a literature review paper. In Industrial Conference on Data Mining 2014 Jul 16 (pp. 214–227). Springer, cham. Mell, P., & Grance, T. The NIST definition of cloud computing. Salesforce. About Benefits of Cloud Computing. [Internet] [cited 2022 Aug 10]. Available from: https://www.salesforce.com/in/products/platform/best-practices/benefits-of-cloud-computing/. JavaTpoint. Tutorial on trending technology. [Internet] [cited 2022 Jun 18]. Available from: https://www.javatpoint.com/advantages-and-disadvantages-of-cloud-computing. Idexcel Technologies. (2017 October 17) [Internet]. [cited 2022 Jun 16]. Available from: https://www. idexcel.com/blog/top-10-advantages-of-cloud-computing/. Suresh, S. Why you must adopt Data Fabric Architecture in a Hybrid Multi-Cloud world. [Internet] [cited 2022 Jun 9]. Available from: https://blog.aspiresys.com/digital/why-you-must-adopt-datafabric-architecture-in-a-hybrid-multi-cloud-world/. Moyse, I. (2019 March 19) [Internet]. The top five reasons for a multi-cloud infrastructure. [cited 2022 Jun 18]. Available from: https://www.cloudcomputing-news.net/news/2018/mar/19/top-fivereasons-multi-cloud-infrastructure/.

1 Demystifying Industrial Trends in Data Fabric

[9] [10] [11]

[12] [13] [14]

[15]

[16] [17] [18]

[19]

[20]

[21] [22] [23] [24] [25] [26]

[27] [28]

17

Peterson, R. [Internet] [cited 2022 Aug 06]. Advantages and Disadvantages of Cloud Computing. Available from: https://www.guru99.com/advantages-disadvantages-cloud-computing.html. Patel, R. (2020 April 24) [Internet]. Cloud Computing. [cited 2022 Jul 17]. Available from: https://www.mindinventory.com/blog/cloud-computing-challenges/. Brodkin, J. (2008 July 2) [Internet]. Gartner: Seven cloud-computing security risks. [cited 2022 Jun 19]. Available from: https://www.infoworld.com/article/2652198/gartner–seven-cloud-computing -security-risks.html. Baciu, I. E. Advantages and disadvantages of cloud computing services, from the employee’s point of view. National Strategies Observer No. 2015;2. Cloud Security Report. (2018 March 26) [Internet]. [cited 2022 Aug 15]. Available from: https://crowdresearchpartners.com/portfolio/cloud-security-report/. Sarangam, A. (2022 June 16) [Internet]. Top 14 Challenges of cloud computing, [cited 2022 Jun 16]. Available from: https://www.jigsawacademy.com/blogs/cloud-computing/challenges-of-cloudcomputing/. Woodie, A. (2021 October 25) [Internet]. Data Mesh Vs. Data Fabric: Understanding the Differences. [cited 2022 May 15]. Available from: https://www.datanami.com/2021/10/25/data-mesh-vs-datafabric-understanding-the-differences/. Raza, M. (2021 November 9) [Internet]. Data Fabric Explained: Concepts, Capabilities & Value Props. [cited 2022 May 17]. Available from: https://www.bmc.com/blogs/data-fabric/. HPE. About Ezmeral Data Fabric documentation. [Internet] [cited 2022 Jun 10]. Available from: https://docs.datafabric.hpe.com/62/OtherDocs.html. Ghiran, A. M., & Buchmann, R. A. The model-driven enterprise data fabric: a proposal based on conceptual modelling and knowledge graphs. In International Conference on Knowledge Science, Engineering and Management 2019 Aug 28 (pp. 572–583). Springer, Cham. Gupta, A. 2021 May 11 [Internet]. D&A leaders should understand the key pillars of data fabric architecture to realize a machine-enabled data integration. [cited 2022 Jun 17]. Available from: https://www.gartner.com/smarterwithgartner/data-fabric-architecture-is-key-to-modernizing-datamanagement-and-integration. Liu, K., Yang, M., Li, X., Zhang, K., Xia, X., & Yan, H. M-data-fabric: A data fabric system based on metadata. In 2022 IEEE 5th International Conference on Big Data and Artificial Intelligence (BDAI) 2022 Jul 8 (pp. 57–62). IEEE. Moon, S. J., Kang, S. B., & Park, B. J. (2021). A study on a distributed data fabric-based platform in a multi-cloud environment. International Journal of Advanced Culture Technology, 9(3), 321–326. IBM Cloud Pak for Data. [Internet] [cited 2022 Jun 13]. Available from: https://www.ibm.com/prod ucts/cloud-pak-for-data. NetApp. What is a data fabric. [Internet] [cited 2022 Jun 13]. Available from: https://www.netapp. com/data-fabric/. Agile Data Fabric. [Internet] [cited 2022 Jun 13]. Available from: https://www.tibco.com/solutions/ data-fabric. HPE Ezmeral Data Fabric. [Internet] [cited 2022 Jun 14]. Available from: https://www.hpe.com/in/en/ software/ezmeral-data-fabric.html. Denado. Logical Data Fabric. Democratize access to data and increase speed-to-insight for everyone across your organization 2020. [Internet] [cited 2022 Jul 18]. Available from: https://www.denodo. com/en/solutions/by-use-case/logical-data-fabric. Talend. What is Data Fabric. [Internet] [cited 2022 Jul 16]. Available from: https://www.talend.com/ resources/what-is-data-fabric/. Data Product Platform. [Internet] [cited 2022 Jun 14]. Available from: https://www.k2view.com/plat form/data-product-platform/.

18

Seema Rani, Vandana Sharma

[29] The Autonomous Data Fabric Platform for The Enterprise 2020 August. [Internet] [cited 2022 Jun 16]. Available from: https://ai4.io/wp-content/uploads/2020/08/Cinchy-OnePager.pdf. [30] Stardog. Data Fabric. [Internet] [cited 2022 Jul 16]. Available from: https://www.stardog.com/usecases/data-fabric/. [31] Edwards, J. (2022 January 20). Data Fabrics: Six Top Use Cases. [Internet] [cited 2022 Aug 18]. Available from: https://www.informationweek.com/big-data/data-fabrics-six-top-use-cases.

Vijayapriya R., A. Umamageswari, Rohith Bhat, Ruby Dass, Manikandan N.✶

2 Web-Based Data Manipulation to Improve the Accessibility of Factory Data Using Big Data Analytics: An Industry 4.0 Approach Abstract: Industry 4.0 is one of the emerging revolutions in the current enhancements with digital manufacturing. Every process associated with this involved multiple levels of transactions. Amount of data generated and manipulated is enormous. Success of the technologies used in Industry 4.0 is depending on the way in which data and results are connected with the end users, process and product owners as well as the various teams involved in product development. In this chapter, we have taken a smart grid-based industry structure and used data analytics and visualization methods to effectively capture, store, and display in web-based interface. Implementation results show considerable improvements by means of adaptability, scalability, and usability. Keywords: Industry 4.0, smart manufacturing, smart grids, big data analytics

2.1 Introduction to the Technologies The web-based data is quite significant in today’s world as they give us the real-time data set to work upon. The factory data becomes the primitive source by which they could be utilized to generate the next set of subsequent data. Be it customer purchasing pattern, raw materials mixture proportion, data calibration, and what not. The entire next set of data becomes more important for the factory setup ranging right from procuring the next set of items to what finished need to look alike. Industry 4.0 is the buzz word today and one needs a brave heart to deny that. The significance of Industry 4.0 could be seen in recent developments which exceeds the expectations of engineers. Above all, the Industry 4.0 does not end up in producing mere ✶ Corresponding author: Manikandan N., School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, India, e-mail: [email protected] Vijayapriya R., School of Electrical Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, India, e-mail: [email protected] A. Umamageswari, Department of Computer Science and Engineering, SRM Institute of Science and Technology, Ramapuram, Chennai, India, e-mail: [email protected] Rohith Bhat, Institute of Computer Science and Engineering, SIMATS School of Engineering, Chennai, India, e-mail: [email protected] Ruby Dass, School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, India, e-mail: [email protected]

https://doi.org/10.1515/9783111000886-002

20

Vijayapriya R. et al.

meagre products rather they would anticipate the customer/consumer preferences and help industries to produce one such products that are exactly required. As the word Industry 4.0 appears the essential technologies like IoT, Smart Digital Tech, and machine learning are closely involved. Big data by and large deal with huge amount of data, thus helping the researchers and the experts to deal with the considerable amount of data that are available at hand. The algorithms such as Page Rank, the Spark GraphX, decision trees, and predictive analysis help to deal with the cumbersome data that are being generated each and every day. The data that are being generated are correlated according to the need in hand and they are categorized according to the specific requirement. In reality, the data that are being generated randomly requires a specific mechanism to sort the essential data from them and thus big data appears as the most favorable choice for the experts. Smart factories does require Industry 4.0 standard where they might get a sensor data from production unit and they would be used to predict how the annual maintenance shall be carried out along with that the operations such as repair would also be analyzed from the collected data. By the applications of it, the manufacturers would be helped with production efficiency, real-time data analysis, to predict the preventive maintenance and to automize the possible areas in near future. Data play a pivotal role here as they correspond to inherit the need in hand and they constitute to deliver the solutions that are anticipated. Choosing a relevant data and the key data are the important factors here. A web application would definitely come handy as they go deep penetration to the data that is available and the relevant data that are required would be obtained based on the need at hand. Web application not only take data from remote sources they also segregate and understand the query and the relevant data alone would end up in the end user screen, thus making it as a one among the favorable option to consider all the time. The key functionality of big data would be to analyze the data-driven system and to predict the actual data that is required for further processing; a web-based application shall help us to transmit and analyze the collected data to inherit the best possible ways to predict the industry pattern and the sequence of flows that are needed thereafter. The sequence could be like the data might be used to optimize the best marketing pattern, to generate additional revenue to the organization, to customize the customer preferences, and to target the best operational factors. The list of benefits could be endless as the data might be used for different analysis based on the need that we look for. A web-based application could be next possible best thing to consider here as the generated data could be manipulated, utilized, secured, and used as per the requirement at hand. Big data would efficiently categories the data that we look for and understand the potential use of the same in the said sector. We understand that the Industry 4.0 is in the need of the hour and those data that we require would be generously obtained and would be shared to the system in general.

2 Web-Based Data Manipulation to Improve the Accessibility of Factory

21

Figure 2.1: Generic process of data collection and refinement according to the user requirement.

Manipulation is the key word here as huge amount is being generated, sensor ’ed, and captured, the challenge lies upon which data should need to be considered and what need to be eliminated. A threshold value would help the system to capture the data that falls on or above the threshold value. In order to effectively use this, a solid smart grid becomes the need of the hour which could do potential work as expected. The prime challenge for the smart grid would be to analyze the data capture source, to analyze if the data is generic to be stored, and to predict whether it could be useful to an end user analysis. If the answer falls yes to all the said parameter, then a smart grid would be liable to capture the sensor ’ed data. The generic process of data collection and refinement according to the user requirements is shown in Figure 2.1. As the data gets captured, the data analytics and visualization become another challenging task as they take the key role to indicate how an end user visualizes the data that is being captured. So, presenting the data becomes more challenging and trickier as they decide whether the captured data stay in or not. A web-based interface would be utilized to store and analyze the captured data thus resulting in to understand how these captured data could be used effectively thereafter. Adaptability, scalability, and usability becomes the key answer here as they contribute how success of a system could be. Adaptability would ensure the data is collected through rapid experimentation where a highly competitive market would require extensive data to predict customer purchasing preferences or whatever. Scalability takes the next upper hand as they target the increasing number of users and the increasing amount of data. Whether the data is being generated or utilized, scalability deals with both. The intense operations predict and promote huge volume of data and thus scalability is one factor which may not be refused at any moment of time. Usability is the other key factor here as they target upon the entity identity, the accuracy of the data, the integrity of the data, the consistency of the data, and timeliness factor that gets involved. All these would constitute under them. The more the

22

Vijayapriya R. et al.

prediction, the more the best results could be is the key factor here. The factors said above indicate how a data is going to be transformed according to the need at hand. The implications are simple that they might be utilized to a specific application and the best of the analysis and the usage are predicted thereafter. Web-based interface takes an important part here as they represent the end results to the end user, the interface would get connected to various components to get their inputs and to effectively display the same after necessary scrutinization. Implementation results indicate that the end results are typically based on adaptability, scalability, and usability. They constitute to the success measure for any study involved, thus considering them becomes more vital as further enhancement could also be considered in near future. Our smart grid-based infrastructure would categorize the data that we require followed by our data visualization and analytics method to ensure if the correct data is being segregated, recorded, and stored for further study. Industry 4.0 requires extensive analysis of the data that are being captured at each and every moment. Big Data would analyze how these could be done as effectively as it can. Our web-based interface would capture the analysis study from big data strategies to present the final data to the user as they require. The factory setup becomes an essential study as they require almost a perfect data to understand the market requirement, produce products one such, target the respective customers, improve sales accordingly, and stay in the market as long as they can. Our web interface is rightly targeted to show the accurate results from the data being captured. Be it any stakeholder who utilizes the system shall have their necessary data, be it product owner or process owner they have their data at hand whenever they require. Most important they would have the updated data from smart grid for their analysis and experimentation thereafter. The key factor such as decision-making could be effectively done if the required amount of data is available and our smart grid structure would ensure that they delivered as perfectly it could be, thus facilitating the decision-making as it convenient as it could be. The key factors involved here are big data analysis, Industry 4.0 a web-based smart grid structure. The primitive target would be on adaptability, scalability, and usability as they provide a solid backbone to the system by generating the data we require. The target could be industry personnel and people involved in digital manufacturing who require extensive data for their system of study and who shall utilize the same to predict the next user demand in future. It does not end with one benefit rather the collected data would also ensure that the respective stakeholders are supplied with sufficient data to deal with and it would also help them to prepare new products or change anything that exists as of now. The benefits are endless, thus making it as a smart grid system is absolutely essential and mandatory for the prediction of future requirements.

2 Web-Based Data Manipulation to Improve the Accessibility of Factory

23

2.2 Survey of Existence 2.2.1 Industry 4.0 and Big Data The implementation of Industry 4.0 is made possible by cutting-edge technology. The key technologies utilized to successfully implement Industry 4.0 are big data, robotics, artificial intelligence (AI), cloud computing, internet of things (IoT), and 3D printing [1]. The purpose of this study is to talk about the enormous potential of big data for Industry 4.0. The main goal of these technologies is to gather the appropriate data in order to address the pertinent problem during production and other necessary services. The fourth industrial revolution’s developments are greatly influenced by this technology. In conclusion, the automation industry can benefit from big data applications for in-process monitoring and enhanced productivity [2]. Using data gathered by this technology, networks of teamsters and sophisticated instruments can be simply optimized. By identifying the basic problems such process deviations, quality forms of discrimination, and energy efficacy waste in an engineering process, big data is essential to gaining a modest edge. The chapter talks about big data’s important uses in Industry 4.0. The Big data is an excellent source for anticipating analytics and operations utilizing market insight data or information since businesses need to have a good surveillance system in a highly sophisticated or personalized method [3]. Big data is anticipated to significantly enhance Industry 4.0 in the next years and to be important in its widespread adoption.

2.2.2 Web-Based Interaction for Factory Data The IoT is described as the concerning of portable devices embedded in normal objects with the Internet so that they can interconnect and interchange data. A planned or actual factory is modelled in the digital factory, which is utilized for design, planning, and operations. The engineering-developed digital factory, with its real-time data, inferred statistics, and information, should be connected with the industrial IoT in the smart factory. Thus, integrating the digital manufacturing with the industrial IoT is a key capability. This skill set includes the capacity to design digital objects’ physical-to-digital interactions. The implementation of functions for data transfer from the IT apps of the digital factory to IoT platform that executes the smart factory and for giving feedback to the smart factory via IoT services are additional requirements [17]. To ensure the consistency of the data that is pushed to or withdrawn from the IoT platform, a good connectivity strategy is required because, in the IT context, data and material assets are frequently heterogeneous [4]. There is a need for a method that allows IoT devices to interact with the digital factory as well as a common language for the display and representation of data. Numerous ontologies and information standards, such as ISO 10303, have been created for communication within a digital factory. The semantic sensor network [5] addresses the need for a context and end-to-end

24

Vijayapriya R. et al.

prototype for sensor applications by trying to merge sensor-focused, observation focused, and system-focused views [17]. For example, in the Industry 4.0, the metadata framework is used to accomplish interoperability. In addition, businesses create vendor-specific IT architectural solutions.

2.2.3 Role of IoT The IoT has been adopted by many different business sectors. IoT usage has increased in the retail, manufacturing, and government, automotive and healthcare industries. With the help of IoT, sectors including education and environment are planning significant changes. The IoT functions as a highly sophisticated analyzer on a business level. It finds holes in processes and procedures corporate policies, and analytical capabilities for better decisions. Additionally, it establishes a previously unheard-of link between the production floor and the company. All of this translates to higher productivity while lowering expenses and energy use. Unquestionably, all of this has boosted the bottom line of the business [18]. The IoT makes it possible to automate daily tasks that often consume labor and resources. One example is automatically changing settings based on the current situation or usage. This liberates up a great deal of resources for the corporation to concentrate on development and the bigger picture of the business. There are several IoT subtypes available in the market right now, including commercial IoT, industrial IoT, infrastructure IoT, and defense IoT [5].

2.2.4 Role of Smart Grid The term “smart grid” refers to an application of digital technologies and an electrical network. It offers a number of technological tools that can be employed now or will be in the near future. Digital control equipment, electric network, and intelligent monitoring system are all parts of it. These all have the potential to move energy from generators to consumers, control energy flow, reduce material loss, and improve the reliability and adaptability of the operation of the electric network [6]. In the near future, a smarter grid will work more efficiently, enabling it to deliver the type of service that people have grown accustomed to more affordably in an era of rising expenses, while also offering major societal advantages, such as decreased effect on the environment [7]. The web has already had a profound impact on how people work, play, study, and live, and the smart grid wants to have a similar impact. There has been a severe shortfall in energy transmission lines as a result of the rapid increase in demand. In comparison to the millions of high voltage distribution lines that run throughout the USA, just 968 new miles of federal electricity generation power line have been built since 2002 [8]. Therefore, American electric utilities actually pay a price for the outage and poor power quality. The smart grid was created in response to the pressing need for an ideal and

2 Web-Based Data Manipulation to Improve the Accessibility of Factory

25

effective method of “broadcasting” the power is transmitted from a few dominant power providers to a sizable number of customers. Numerous possible economic and environmental advantages of smart grid include [9]: 1. Boost transmission and power quality dependability 2. Enhanced energy conservation and distribution efficiency 3. Lower utility costs for electricity 4. Less money spent by homes and companies on electricity 5. Reduced emissions of greenhouse gas and other gases

2.2.5 Security for Industry 4.0 Industry 4.0’s new technologies have made it possible for a widespread adoption of cyber-physical systems (CPS). It currently provide assistance for latest electronic factory services and scenarios, including smart manufacturing, smart logistic, and smart maintenance [10]. One of the difficulties facing the modern industrial environment is security. In fact, there are numerous vulnerabilities present in applications for smart factories that have been researched in the literature [19]. In order to support the newly emerging idea known as the CPS, diverse domains such as coding, analysis, optimization, visualization, and networking are combined with physical systems in the direction of a fourth industrial age [11]. In order to control, coordinate, or work together with physical systems, automated systems that rely on computing and communications infrastructures are known as CPSs. The IoT, the Internet of services (IoS), and the Internet of data can be integrated with the industrial using the framework provided by Industry 4.0 to make it more intelligent, flexible, and adaptive. The IIoT (Industrial Internet of Things) is the incorporation of the IoT across numerous industries, including production, manufacturing, energy, logistics, and others [12]. The IoS is made up of a SoA (Service Oriented Architecture) that organizes infrastructure and application software into a group of interconnected services [13]. The new industrial setting permits new system features for industrial integration, which takes the form of horizontal and vertical integration [14]. Numerous intelligent applications are made possible by Industry 4.0, which enhances intelligent services like manufacturing, design, logistics, and storage. It integrates the consumer into the supply chain and production process [15]. Furthermore, it enables additive manufacturing to create products at a low cost. Additionally, it offers a smart storage system that recognizes outdated commodities automatically. The attack surfaces are increased by the new SoA and IoT, making the smart factory a target for hackers [16].

26

Vijayapriya R. et al.

2.3 Industrial IoT-Enabled Smart Grid With the evolution of second industrial revolution, that is, Industry 2.0 electrification of the nation started. Subsequently with the third industrial revolution, that is, Industry 3.0, automatic control of devices in the power plant sectors have been initiated with the aid of information and operational technology. The present fourth industrial revolution, that is, Industry 4.0 necessitates the application of advanced technologies like IoT, cloud, edge and fog computation, big data storage, processing and analytics, communication protocols, cyber security, and so forth. Industry 4.0 enforces the application of these key technologies to provision remote monitoring and control of the devices in the industries, to reduce the human interventions, to operate the devices in a way such that machines themselves can make intelligent decisions based on the analytics. Industry 4.0 smart grid system integrates electrical network with different users such as consumers, prosumers, and utility for different applications such as automation of substation, line monitoring of transmission and distribution, predicting demand response, forecasting generation, and so forth. With these objectives, the smart grid system addresses several issues such as energy generation cost reduction, energy saving, reduction in carbon emission, green manufacturing, peak loads energy management, energy break down prevention, and overall improved energy balance. Industry 4.0 smart grid layers is depicted in Figure 2.2.

2.3.1 Perception Layer Perception layer includes wide range of things ranging from physical things to be monitored and control, to sensors, measurement devices, computing devices (processors and microcontrollers), and actuators. The perception layer includes different physical things such as generation (thermal, hydro, and nuclear power plants), transmission, substations, green power generation unit (solar, wind, bioenergy), smart mobility (electric vehicles), smart buildings (home and commercial), smart factories, and so forth. The measurement devices include smart meters, smart sensors, intelligent electronic devices, and phasor measurement units, and actuators include digital relays, circuit breakers, and recloses.

2.3.2 Edge or Fog Computing Layer This layer allows the smart grid systems to enhance the speed, security, and scalability requirements of the entire network. This layer provisions preprocessing of raw data and storage of information as nearer to the data sources to achieve lower latency, faster speeds, and improved device management capability. For instance, the smart meters present at the perception layer collects energy consumption data for

2 Web-Based Data Manipulation to Improve the Accessibility of Factory

27

every 15 min. Many such smart meters spread across different locations may send terabyte of data in real time. Likewise, the phasor measurement unit collects huge volume of voltage and current information which are located at different locations. All the data cannot be directly send to the cloud, as it increases the latency and computational burden. One possible solution to solve this problem is to perform preprocessing and basic analytics closer to the source node, that is, edge computation.

Figure 2.2: Industry 4.0 smart grid layers.

28

Vijayapriya R. et al.

2.3.3 Connectivity Layer Data collected and stored at the perception and edge devices are transferred to the cloud via a communication protocols. Perception layer devices use link layer protocols such as Wi-Fi, Zigbee, LoRA, RFID, and Zwave to transfer the data to the gateways. Transport layer uses transmission control protocols and user data gram protocols. At the application level, protocols such as http, CoAP, and MQTT are employed to provision mobile and web application interfaces.

2.3.4 Processing Layer As smart grid technologies involve huge number of measurement, computation, and communication devices, large volume of heterogeneous big data are collected from these ubiquitous devices. This leads to the challenges of transforming the heterogeneous big data into actionable outcomes owing to computational complexity, data security, and operational integration of large dataset into power system planning and operational frameworks. In this regard, application of big data analytics such as descriptive, diagnostic, predictive, and preventive analytics gives better situational awareness and paves a way to make intelligent decisions. Generally the big data holds a significant percentage of useful information and the implementation of advanced computation algorithms like machine learning and AI on these data delivers insight, finds hidden pattern, and so forth Machine learning algorithms can be used to forecast the renewable energy generation, load demand, peak hours, and so forth.

2.3.5 Application Layer The layer analyzes the information obtained from the cloud layer. Using the application programming interface the final data inferences are visualized in the web or mobile app dashboard. The layer also makes decision to intelligently manage and control the devices presents at the perception layers based on the insights derived from the data analytics. In the context of smart grid UI/UX can be developed to visualize the power generation by various resources, energy demand, number of connected devices in the network, energy saving per day, and so forth.

2.4 Results and Validation In this work, three sectors of smart grid are considered: green power generation – solar and wind energy resources; smart building – smart home; smart Mobility –

2 Web-Based Data Manipulation to Improve the Accessibility of Factory

Figure 2.3: Real-time visualization of solar panel voltage, current, and power.

29

30

Vijayapriya R. et al.

electric vehicle. The data collected from these sectors are preprocessed and transferred to the cloud. Based on the basic and advanced analytics, web apps are designed for remote monitor and control of the physical devices present at different location. Dashboards are created using the IoT platforms such as Node-RED and ThingSpeak.

2.4.1 Green Power Generation 2.4.1.1 Solar Energy The output voltage, current, and power of the solar panel for different irradiation and temperature is illustrated in Figure 2.3. With the real-time visualization, the solar plant can be monitored from any location.

2.4.1.2 Wind Energy The wind data collected from the perception layer is processed and machine learning analytics is performed to predict the wind speed as depicted in Figure 2.4. The red and blue lines represent the original and the predicted value, respectively.

Figure 2.4: Wind speed prediction using machine learning.

2.4.2 Smart Building In this use case smart home is considered by taking a single room of the house. The data visualization of the room is depicted in Figure 2.5. The dashboard visualizes the room’s intensity level, temperature, and humidity. Based on the intensity level, the smart device actuates the light present in that room.

2 Web-Based Data Manipulation to Improve the Accessibility of Factory

31

Figure 2.5: Data visualization of the room.

2.4.3 Smart Mobility The electric vehicle speed is visualized in the ThingSpeak platform for remote monitoring of the vehicle.

2.5 Conclusion A smart grid-based industry structure is presented in this work employing data analytics and visualization methods to effectively capture, store, and display in web-based interface. Three sectors are considered in this chapter: green power generation – solar and wind energy resources; smart building – smart home; smart mobility – electric vehicle. The data is collected from these sectors are preprocessed and transferred to the cloud. Based on the basic and advanced analytics, web apps are designed for remote monitor and control of the physical devices present at different locations. Implementation results show considerable improvements by means of adaptability, scalability, and usability. The experimental results also prove that the error detection rate is less.

32

Vijayapriya R. et al.

Figure 2.6: Data visualization of electric vehicle speed.

2 Web-Based Data Manipulation to Improve the Accessibility of Factory

33

References [1] [2]

[3] [4] [5] [6] [7] [8]

[9]

[10] [11] [12]

[13] [14]

[15] [16] [17]

[18] [19]

Aazam, M., Zeadally, S., & Harris, K. A. (2018). Deploying fog computing in industrial internet of things and industry 4.0. IEEE Transactions on Industrial Informatics. 14(10), 4674–4682. Abbasian, N. S., Salajegheh, A., Gaspar, H., & Brett, P. O. (2018). Improving early OSV design robustness by applying multivariate big data analytics on a ship’s life cycle. Journal of Industrial Information Integration. 10, 29–38. Abdirad, M., Krishnan, K., & Gupta, D. (2021). A two-stage meta-heuristic algorithm for the dynamic vehicle routing problem in industry 4.0 approach. Journal of Management Analytics. 8(1), 69–83. Aceto, G., Persico, V., & Pescapé, A. (2020). Industry 4.0 and health: Internet of things, big data, and cloud computing for healthcare 4.0. Journal of Industrial Information Integration. 18, 100129. Aheleroff, S., Xu, X., Lu, Y., Aristizabal, M., Velásquez, J. P., Joa, B., & Valencia, Y. (2020). IoT-enabled smart appliances under industry 4.0: A case study. Advanced Engineering Informatics. 43, 10104. Manoj, P., Kumar, Y. B., Gowtham, M., Vishwas, D. B., & Ajay, A. V. (2021). Internet of Things for Smart Grid Applications (pp. 159–190). Academic Press, US. Tsiatsis, V., Karnouskos, S., Höller, J., Boyle, D., & Mulligan, C. (2019). Smart Grid. In Internet of Things (pp. 257–268). 2nd Edition, Academic Press, US. Abu-Rub, O. H., Fard, A. Y., Umar, M. F., Hosseinzadehtaher, M., & Shadmands, M. B. (2021). Towards intelligent power electronics-dominated grid via machine learning techniques. IEEE Power Electronics Magazine. 8(1), 28–38. Atif, S., Ahmed, S., Wasim, M., Zeb, B., Pervez, Z., & Quinn, L. (2021). Towards a conceptual development of industry 4.0, servitisation, and circular economy: A systematic literature review. Sustainability. 13(11), 6501. Baheti, R., & Gill, H. (2011). Cyber-physical systems. The Impact of Control Technology. 12(1), 161–166. Sadiku, M. N. O., Musa, S. M., & Nelatury, S. R. (2016). Internet of things: An introduction. International Journal of Engineering Research and Advanced Technology. 2(3), 39–43. AL-Salman, H. I., & Salih, M. H. (2019). A review cyber of industry 4.0 (cyber-physical systems (cps) the (iot) and the internet of services (ios)): Components and security challenges. Journal of Physics: Conference Series. 1424, 1–6. Zeid, S. S., Moghaddam, M., Kamarthi, S., & Marion, T. (2019). Interoperability in smart manufacturing: Research challenges. Machines. 7(2), 1–17. Qarabsh, N. A., Sabry, S. S., & Qarabash, H. A. (2020). Smart grid in the context of industry 4.0: an overview of communications technologies and challenges. Indonesian Journal of Electrical Engineering and Computer Science. 18(2), 656–665. Zhong, R. Y., Xu, X., Klotz, E., & Newman, S. T. (2017). Intelligent manufacturing in the context of industry 4.0: A review. Engineering. 3(5), 616–630. Sadhu, P. K., Yanambaka, V. P., & Abdelgawad, A. (2022). Internet of things: security and solutions survey. Sensors. 22(7433), 1–51. Akturk, H. K., Giordano, D., Champakanath, A., Brackett, S., Garg, S., & Snell‐Bergeon, J. (2020). Long‐ term real‐life glycaemic outcomes with a hybrid closed‐loop system compared with sensor‐ augmented pump therapy in patients with type 1 diabetes. Diabetes Obesity and Metabolism. 22(4), 583–589. Tandon, R., & Tandon, S. (2020). Education 4.0: A new paradigm in transforming the future of education in India. International Journal of Innovative Science, Engineering & Technology. 7(2), 32–54. Abraham, P., & Lakshminarayanan, S. (2022). In Big Data Applications in Industry 4.0 (pp. 1–38). Auerbach Publications, Data Science and Its Applications, New York.

Shubham Verma✶, Amit Gupta, Abhishek Prabhakar

3 The Overview of Data Virtualizations and Its Modern Tools in the Domain of Data Fabrics Abstract: There is a huge research gape between the traditional data fabrics, its Virtualization tools and the advanced and best tools in today’s date. Getting aroused with this fact this research is written to through light on the data fabrics and there are many data Virtualization tools available, and we’ve done the research to determine the best for small businesses. These tools should be versatile, easy to use and allow you to visualize data in a variety of ways to suit your business needs. The research clearly distinguishes the pros and cons of particular tool used in the research. Keywords: Virtualizations, Qlik sense, Domo, klipfolio, Data fabrics

3.1 Data Virtualization: An Overview Digital visualization can also be considered as an act of providing the information in the forms of graphs, maps, and charts. The digital visualization is basically met to provide the information to the human in the easiest manner possible. Therefore, these information signs are presented by using different shapes, designs, models, and so on. There are different ways to represent the digital visualization. This includes providing the information in the form of a polar graph, line graph, pie chart, bar graph, chart, reader, polygon, etc. The basic motive behind providing the digitalization is to represent information in the simple way possible. However, the hard part used to analyze how to make it simpler and smoother to analyze [1, 2]. Therefore, there are availability of a number of tools and features that helps in exporting these hard-earned data into the simple forms. These tools and features involve certain plugins that can be used for maintaining the low complexity of the data. Like maximum types of software program, the excellent facts visualization software gives many tiers of safety. while deciding on this type of

✶

Corresponding author: Shubham Verma, Department of Computer Science and Engineering, Dr. A.P.J. Abdul Kalam Technical University, Lucknow, Uttar Pradesh, India, e-mail: [email protected] Amit Gupta, Department of Computer Science and Engineering, Harcourt Butler Technical University, Kanpur 208002, India Abhishek Prabhakar, Department of Computer Science and Engineering. Dr. A.I.T.H., Kanpur 208024, India

https://doi.org/10.1515/9783111000886-003

36

Shubham Verma, Amit Gupta, Abhishek Prabhakar

software program, you want to search for security capabilities like multifactor authentication (MFA) or two-factor authentication (2FA) documentation that indicates common patches and security updates, the monitoring of user activities, intrusion detection, privacy safety, and data encryption.

3.1.1 Data Virtualization as Per Gartner and DAMA According to Gardner, the data visualization must be done in such a manner that it hides or masks the physical as well as in the natural identity from any threat. Considering the fact that the details provided in a data can be used for the digital virtualization. Happy birthday to integration that visualizes and abstraction layer through cutting of certain information helps in assisting the data silos. It also helps in providing a number of widespread unique packages that helps in providing the inside information environment. DAMA is also a category of data virtualization. DAMA stands for Data Management Association International. The basic criteria and the motive behind the digital virtualization under DAMA are too a lot a database as well as several heterogeneous facts so that they can be accessed according toward the requirement as a single database. Therefore, performing several data virtualization using the transformation engine and ELT the remodeling of the data and extraction can be done easily without the help of any integrated virtual essence [3–4]. The basic motive and the goal of providing the data visualization age to present all the information in the best manner possible without moving or copying any information. Figure 3.1 provides the analyzation of every data source, how it is performed and access it throughout the consumption and platforms.

Figure 3.1: The data structure and several aspects of the data framework (Eckerson Group).

3 The Overview of Data Virtualizations and Its Modern Tools in the Domain

37

3.2 Use Cases: Data Virtualizations Most businesses have data silos spread across disparate storage systems and the cloud along with warehouses, data lakes, and information shops [5]. So, some of the most widely used cases of data virtualization are as follows:

3.2.1 Virtual Data Warehouse Data warehouse is known for its quick and simple setups. When it is consisting the fact of the data, there have been no sign of physical motion.

3.2.2 Virtual Data Lake Virtual detailing is quite similar to that of the virtual data warehouse; however, the only difference is that they are much faster than them. Moreover, they are more seamless accurate as integrated. Data virtualization has two greater vital use instances, which simplify facilitate governance projects, permit data democratization and data discovery, these consists of:

3.2.3 Data-Catalo This catalogue is known for being training as well as solving the problems regarding the metadata [6]. However, it also has the full-fledged control over the accessing and granting of that access, therefore the data. The seats are up to date along with the context that is provided. Moreover, when it comes to the usage of the day to ask if they can only be done through a hyperlink which requires a multi cloud environment.

3.2.4 Self-Service Analytics Self-service analytics are those analyses that is done for the services that is provided under the data analyzation. This includes the extraction of the data speeding up as well as the strain imposed over the technologists. Therefore, in order to maintain the criteria of self-service is there had been a development and growth of analytics powered application.

38

Shubham Verma, Amit Gupta, Abhishek Prabhakar

3.3 Data Fabric as per Gartner and Eckerson Eckerson defines data fabric as follows: “Data Fabric is considered to combine technology, structure, and services designed to ease out the difficulty of coping with several data, the use of a couple of database management structures, and deployed throughout numerous platforms.” These data fabrics are those entities that are unidentified, layered as well as a single unit that consists a huge information. Therefore, managing these data fabrics are quite important as well as essential. Moreover, the sharing of these data fabrics needs to be frictionless [7]. Gartner consists of a terrific algorithm that helps in examine the details of a different data fabric. Considering an example where are a self-driving vehicle is provided. Therefore, if the street driving vehicle has been working conveniently in automatically providing all that or can do for that is required, there will be no intervention. However, if the driving force is not up to the mark in there must be some lose criteria in the data information, the semiautonomous mode bill comes up. Therefore, there might be problem occurring in the direction analysis or several other criteria. On the basis of that, the overall data fabric is considered to be the data pipelines in the starting. In the end, it works on making the repetitive task automated along with the suggestions that may help in improving the outcome [8], which does not only help in protecting the time efficiency as well as prevent an improved efficiency of the operation and innovation of the strategy.

3.3.1 How Does Virtualization Complement a Data Fabric Architecture? Consistent with Gartner, the data fabric ought to have a sturdy statistics integration backbone. Data fabric: – has the compatibility with many formats and data resources; – has Catalo and consists of all varieties of metadata; – actively manages the metadata; – is supportive with several workflows of data pipeline; – automates statistics orchestration; – empowers all sorts of statistic purchasers (technologists and enterprise users). Allowing such abilities calls for powerful technologies for appearing analytics and a solid records integration layer to get admission to all statistics assets – it is wherein virtualization plays a vital role.

3 The Overview of Data Virtualizations and Its Modern Tools in the Domain

39

3.3.2 Use Cases: Data Fabric Seeing the digital fabric is becoming one of decentralized areas of the fact. Under this condition, it becomes quite important to focus on the data fabric uses. Information surrounding these data fabrics is interoperable and this information plays an active role in reducing the overall reading, locating, and expertise in data [9]. Therefore, when it comes to the basic motive of the digital fabric used to enforce the information and the generation at pace, the application encompasses devices, predictive and analyzing as well as prescriptive analytics, democratization, and data dictionary. Let us look at some use cases further.

3.3.2.1 Data Discovery Under the visualization of data integration layer within the data fabric, the right human being has the full grant as well as the axis of the data they require.

3.3.2.2 Machine Learning When it comes to the data fabric, it requires the methods that records the integration and makes the usage of most of the data in addition to the analysis. Machine learning is quite effective in providing the quick access granted through the right records on the proper duration of time.

3.3.2.3 Data Democratization When it comes to centralize granting of the information through coaching, speed makes it very much important as well as viable to make it automatic. There, for the analysis of the data democratization is quite important. Various commercial enterprises use the insight of these data democratization.

3.3.3 How Do Companies Implement a Data Fabric Design? Data fabric is considered to be a technology-centric framework. That is why the design should be supported by way of subsequent-gen technology including understanding graphs, ML-powered modern information virtualization, data catalogues, and active metadata control [10].

40

Shubham Verma, Amit Gupta, Abhishek Prabhakar

There are several initial steps that need to be taken for designing a framework. Therefore, some of these steps that need to be implemented throughout the information and the construction of the platforms are: – Consolidate all of the metadata in a singular repository. – Support lively metadata control. – Permit governance and granular access control. – Promote sharing of data and open collaboration. It is where a modern-day statistics catalogue and governance platform likeable can assist. We have additionally prepared a complete manual that will help you evaluate current record catalogues available in the marketplace.

Figure 3.2: Flow diagram of technology-centric framework.

3.3.4 Data Fabric Versus Data Virtualization: Summary of Comparison It is easy to get confused with data virtualization versus data fabric, in particular with such a lot of buzzwords shooting up in the statistics management and governance atmosphere. It is really why we put Table 3.1 to explain the variations among them: Organization that requires the management of the data into this authority. Various companies use the insight of the business criteria in order to enhance the management of the data the data virtualization. Authorized the data fabric and helps in maintaining the enterprise customers. Therefore, there had been several comparisons that had been done under the data virtualization and data family basis [11]. Therefore, there have been a lot of data fabric versus data virtualization comparison.

3 The Overview of Data Virtualizations and Its Modern Tools in the Domain

41

Table 3.1: Differences between data fabric and data virtualization. Data fabric

Data virtualization

Data fabric is a cease-to-give-up structure for current facts control.

It helps in the creation of an abstraction layer of the data that helps in the integration of the information without the physical motion.

It helps in the simplification of the meter data management as well as the discovery of the data under the governance.

It feels a battle where the quick access of the data is required.

Under the organization or any business where the management of decentralized details required, the date of fabric is used.

Data fabric architecture has the basic spinal element under the name of data virtualization.

3.4 The Best Data Visualization Software of 2022 Several companies discovered the best visualization of the data devices they need to gather down the best expert in the business. Moreover, they need to find out the best performance of the visualization using the equipment’s in the available market [12]. Therefore, the consideration capabilities are put together according to the pricing use, customization, customer service along with the capability to share as well as access the facts outdoor of the instruments used. Moreover, the extra emphasis over the customer critics is mattered.

3.4.1 Microsoft Power BI Microsoft Power BI is caused to be in enterprise-intelligent platforms that helps in providing the customer with the collaborative goals and emphasizes. It not only provides the fashion evolution but also provides a real-time solution to various problems. Moreover, Microsoft Power BI is made up of several Microsoft merchandise which also offers several features and facilities There are several security services also provided under the Microsoft Power BI. Moreover, this platform is known as the most immersive constructing, entices, and reporting platform available out there in the market, which is the most secure platform.

42

Shubham Verma, Amit Gupta, Abhishek Prabhakar

3.4.2 Data into Insight to Action with Power BI Desktop 3.4.2.1 Connect to Your Data, Wherever It Is Get right of entry to the fax from the load of all the on premises as I cloud available data. Includes Salesforce, SharePoint, Excel, dynamic 350, etc. All of these are the automated incremental basis of database data refreshing itself accordingly. The Power BI is also an important part in maintaining and connecting this date as having a huge variety of cases.

3.4.2.2 Prep and Model Your Data with Ease In order to prevent the information premises and the time efficiency is quite important to model the tools accordingly. Service electricity question is quite familiar when it comes to declaiming the address of the day as well as several Excel users spend on learning it there. For this, modeling of the data helps in providing the combination enhancements, transformation as raised the injection of the information through power BI.

3.4.2.3 Provide Advanced Analytics with the Familiarity of Office Kashmir fact offered I provided patent toward the data are overlooked when it comes to the action. Is there for any functions of a group paying a clustering, forecasting, quick measurements etc. The most superior? Baba that is given to the analysis of the data is in the advanced analytics. It uses the DX method to find out the problems as well as do the root cause analysis.

3.4.2.4 Deepen Your Data Insights with AI-Driven Augmented Analytics It is used for exploring the different statistics as routines to find out and recognize the pattern for managing the data. Moreover, it is used to find out the exceptions as ready consequences of the enterprise effects. The basic use of the power BI is to pioneer so as to enhance the skill of managing the data.

3.4.2.5 Create Interactive Reports Customized for Your Business The basic interaction of data visualization is the creation of the feedbacks. It is quite important to maintain the feedback on the reviews of the usage of the data that has been done. The statistics are used by the Microsoft Power BI to maintain and opensource visuals framework. It helps in formatting, layout in as well as theming.

3 The Overview of Data Virtualizations and Its Modern Tools in the Domain

43

3.4.2.6 Author for Everyone, Anywhere When the analytics and the feedback Sir visible to everyone, anyone can be the author. Therefore, it helps in providing the full fledge agility, transparency to the customers. Not only this but the power BI also provide the computing devices to the clouds that helps in providing the on premises.

3.4.3 Pros and Cons – – – – –

Consist an accessible app Best pricing Easy to learn Heavy usage of CPU Noncompatibility with Mac

3.5 Tableau: Data Visualization and Analytics Capabilities Tableau is considered to be one of the data visualization platforms that helps in providing the customer with the full-fledged information in the data proportion. The basic exploration of this visualization page to maintain the flexibility as well as energy. Moreover, the presentation includes the drag and drop option which is used for the statistical modeling and maintaining a herbal language. Tableau is not only time efficient, but also efficient in the way of costing because it provides extra region of governance. Moreover, it not only helps in the security but also for the renovation, compliance, and support Where it comes with a grouping, it also helps in providing the analytics of these skills and the date along with the visualization up to the scheme at which it is required.

3.5.1 Pros and Cons – – – – –

A Dictation dashboard Salesforce backup Slack integration Excellent learning curve Expensive in nature

44

Shubham Verma, Amit Gupta, Abhishek Prabhakar

3.5.2 Tableau: Definition and Products It Offers Tableau is one of the devices that authorizes a data visualization, that is widely used for enterprise intelligence but is not always restricted to it (Table 3.2). It not only provides the facility of the interactive graphs but also of the worksheet dashboards and charts [13]. This is quite advantageous to several enterprises to look into these data information, and also provides several gestures in accessibility features including the drag and drop. Table 3.2: Tableau: features and products it offers. Key Features

Other Features

Operating License System

Tableau Desktop

Creating Dashboards and Stories locally

Tableau Personal - limited data sources, non connectivity to Tableau Server Tableau Professional - Full enterprise capabilities

Windows, Mac

Personal $ Professional $

Tableau Public

A Massive, public, non commercial Tableau Server

All data published in public

–

Free

Tableau Online

Creating Dashboards and Stories on the Cloud

Live Connections

–

$ per year per

Tableau Reader

View Dashboards and Sheets locally

Cannot modify workbooks or connect to the server

Windows, Mac

Free

Tableau Server

Connect to Data sources and share Dashboards

Users can directly interact with Dashboards via browser

Windows

Core Licensing

3.6 Qlik Sense: Tool Qlik Sense is one of the data visualization tools that plays an effective role in maintaining the synthetic intelligence for the user. Moreover, it is quite statistical efficient. Movie provides are more interactive contact to the users, which feels more visualized in any other. It not only provides a high-speed calculation but also combines all the database in such a manner that it provides a fact shows. Qlik active Intelligence Platform play cervical tool that acts as the backbone of the performance of the database. It is not only to write a good scalability but also enhances the overall performance. SaaS is the hybrid provider that provides the analytics additional premises. With Qlik, you have access to the broadest variety of Augmented Analytics features currently available, for all users and use cases inside your business.

3 The Overview of Data Virtualizations and Its Modern Tools in the Domain

45

Qlik Sense is your intelligent AI assistant that offers natural language interaction, search-based discovery, conversational analytics, and auto-generation of sophisticated analytics and insights. It also helps with analytics development and data preparation. Some of the insights that have been used by the analytics are as follows: – AI-generated insights – Automation of the creation – Viable language interaction – Machine learning

3.6.1 Go from Passive to Active Analytics Corporations want a dynamic relationship with statistics that reflects the modern moment, passive BI falls short, traditional. We integrate a real-time data pipeline with motion-orientated analytics talents, developing lively Intelligence that offers in-themoment insights and drives instant motion. It captures each business second with effective collaboration, wise alerting, automatic triggering of moves, and embedded analytics.

3.6.2 High-Performance SaaS and Hybrid Cloud Platform It provides unmatched analytics overall performance and versatility with Qlik Sense on Qlik Cloud®. Installation a complete business enterprise SaaS answer or use Qlik Forts™, the handiest hybrid provider that extends SaaS analytics to anywhere your statistics resides. We are devoted to giving you complete manipulate of supporting any aggregate of public cloud, facts and analytics, on-premises sites, and personal cloud.

3.6.3 Pros and Cons – – – –

Automated motion triggers AI insights Interface of drag and drop Best UI

3.7 Klipfolio: One of the Best Tools for 2022 With Klipfolio, the data gets the full-fledged access as well. As the offerings of the load can also be minimized using decorated incident matrix, there are several preconstructed database that helps in the modeling of the data. Moreover, the everyday choice

46

Shubham Verma, Amit Gupta, Abhishek Prabhakar

making us also examined, edited, and imported in the information in order to maintain the actual reception. A cloud-based web software called Klipfolio can help you succeed with data! By comprehending, visualising, and monitoring the KPIs and indicators that are most important to your company, you can expand it. You can collect, share, display, and learn from your data with Klipfolio in real-time. Also, you may monitor your statistics through time, allowing for historical comparisons to watch your development. It is a noticeably appreciated device that sets itself aside from its competition by way of the ease of use, extraordinary pricing, and flexibility. It is far the move to dashboard platform for sales, marketing, IT performance tracking, and accounting.

3.7.1 Five Key Benefits Klipfolio has five distinct features that we particularly like and use a lot in our service business.

3.7.1.1 Custom Styling There are many ways to tailor the look and feel of your dashboard, from logo to graph colors. But there is also a CSS option that allows you to fully customize how the dashboard looks.

3.7.1.2 Roles, Groups, and Rights It is very easy to tailor the “who-sees-what” of every dashboard page. This allows you to make one dashboard for everybody, but show or hide according to the need to know. This feature makes it easy to keep the dashboard very simple to digest, because he/she only sees what needs to be seen.

3.7.1.3 Data Connections There are many ways to connect data to Klipfolio, so it is very easy to display data from sales, marketing, IT, accounting, and so on from multiple sources. It is even possible to combine data from multiple sources in one graph or table, without a lot of hassle. The data connections go from uploading a file and connecting Dropbox or Google Drive to more advanced connections like Google Big Query, MySQL, or FTP. Additionally, Cervinodata includes a native connection and for more demanding customers, Cervinodata for BigQuery for larger account.

3 The Overview of Data Virtualizations and Its Modern Tools in the Domain

47

3.7.1.4 A Powerful Formula Editor Klipfolio’s powerful formula editor allows you to calculate metrics or add functions to your data. This ranges from a simple “sum” or “average” function to “cumulative” values, standard deviations, regression, and so on.

3.7.1.5 The API and SSO Not only is Klipfolio configurable through their interface but also through the Klipfolio API. This allows for large-scale automatic management of all your dashboards, users, data, and access rights. Add the SSO login feature to it (and white label if you want) and you can manage and deploy dashboards at scale.

3.7.1.6 Pros and Cons – – – –

Option of free plan Number of integrations Availability of viewer users, metrics, and unlimited dashboards Void of the service of downloadable PDF reports

3.8 Looker: Virtualization Tool Looker is one of the powerful tools that permits users to see records in lots of approaches Please provide the number of plugins under the marketspace. Moreover, the user can also discover a number of directories which provides the bar, gauges, cartoons, visualization, calendars, maps, etc. It also consists of the several preconstructed parameters which helps in maintaining the templates for the analysis and the statistics. The boosting of analytics helps individualization of the equipment as well as to clean up the drivers fast [15]. agencies that want a ramification of information visualization capabilities to scale operations. Looker is considered to be a platform that totally depends upon the statistics are the action taken there for the insight of the integrated data, maintains the workflow of the daily basis as well as corporates the free form of the information. Over 1,600 industry‐leading and innovative companies inclusive of Amazon, Sony, IBM, The Economist, Spotify, Lyft, Etsy, and Kickstarter accept as true with Looker to power their information‐driven cultures. Most of the business enterprise headquarter exist in the major cities of the world like Santa Cruz, California, with workplaces in the Big Apple, San Francisco, Chicago, London, Boulder, Dublin, Tokyo, and Ireland. Investors include Meritech Capital Partners, CapitalG, Kleiner Perkins Caufield & Byers, Redpoint Ventures, and Goldman Sachs etc.

48

Shubham Verma, Amit Gupta, Abhishek Prabhakar

3.8.1 Benefits 3.8.1.1 Our Data Is Beautiful – Use It Looker Studio unlocks the power of your statistics by making it easy to create interactive dashboards and compelling reviews from an extensive form of sources, using smarter enterprise choices.

3.8.1.2 Connect to Data Without Limits You can get entry to a huge kind of data sources through the greater than six hundred companion connectors that instantly permit you to connect certainly any form of facts, with none coding or software.

3.8.1.3 Share Your Data Story You can proportion your compelling reports with your team or with the arena, collaborate in real time, or embed your reports on the web.

3.8.2 Key Features 3.8.2.1 An Easy-to-Use Web Interface Looker Studio is created to be intuitive and smooth to use. The record editor features simple drag-and-drop gadgets with fully custom belongings panels and a snap-to-grid canvas.

3.8.2.2 Report Templates With a sturdy library of record templates to choose from, you can visualize your records in minutes, connect your facts assets and personalize the design to fulfil your needs.

3.8.2.3 Data Connectors Certain data resources act as pipes to attach a Looker Studio report to the underlying facts. Each source has a completely unique, prebuilt connector to make sure your information is simple to get right of entry to and use.

3 The Overview of Data Virtualizations and Its Modern Tools in the Domain

49

3.8.2.4 Looker Studio API The Looker Studio API allows Cloud identity or Google Workspace groups to automate control and migration of Looker Studio assets. You can configure a utility to use the Looker Studio API fast and without difficulty.

3.8.2.5 Report Embedding Embedding lets in you to consist of your Looker Studio record in any web page or intranet, making it easier for you to tell your information story to your crew or the sector.

3.8.3 Types of Looker Data Visualizations Looker has many visualizations which might be used to explain your facts. Every visualization is extraordinary, and we are able to personalize it in keeping with our desires and patterns. Facts visualizations make a big effect on understanding facts and assist in creating a clear choice. Looker records visualizations consist of pie charts, bar charts, line charts, column histograms, tables, container plots, and warmness maps [16]. Looker has many visualizations you can use to make feel of your records. Each type of visualization has one-of-a-kind settings that you can use to customize its appearance. The links underneath provide facts approximately every visualization and its settings.

3.8.3.1 Sunburst Sunburst charts are open-supply equipment used to exhibit hierarchical facts systems. They are visually charming charts, and the records is expressed in a good-looking manner inside the form of a radial illustration.

3.8.3.2 Collapsible Tree Diagram Collapsible tree diagram interactively visualizes hierarchical statistics. It represents a tree and consists of a root node with branches like different nodes, and nodes will extend and decrease in keeping with our desires. It is far an open-supply device.

50

Shubham Verma, Amit Gupta, Abhishek Prabhakar

3.8.3.3 Liquid Fill Gauge A liquid fill gauge is an open-source device used to determine the boom toward a purpose. We will customize the gauge color, the font, and the animation of the curves and waves.

3.8.3.4 Chord Diagram In a larger dataset, the connection between the two items may be efficiently visualized in the chord diagrams. It is also an open-supply device. In chord diagrams, we will characterize the motion from two unique factors.

3.8.4 Looker Data Visualizations Use Cases Way to e-commerce facts analytics, businesses now can get admission to extra statistics than ever. Looker comes ready with powerful gear that assists to find out profitable insights and might create possibilities to grow your business.

3.8.4.1 Looker Data Visualization: E-Commerce Way to e-commerce facts analytics, businesses now can access more facts than ever. Looker comes equipped with effective gear that assist discover least expensive insights and might create opportunities to grow your commercial enterprise. 3.8.4.1.1 E-Commerce Analysis Looker provides tools to tune e-commerce KPIs (key overall performance signs) like shopping, conversion quotes, revenue, and consumer values. Tools help optimize sales overall performance, growth online sales with identify patron tendencies, predictive modeling, and update expenses depending on call for and supply. 3.8.4.1.2 Customer Trends and Behavior Looker information visualization helps clients to create profiles approximately their order history and purchasing nature and learn about their behavior and hobby. It identifies repeat buy patterns and frames out marketing procedures and new promotions that drive the enterprise.

3 The Overview of Data Virtualizations and Its Modern Tools in the Domain

51

3.8.4.1.3 Category and Brand Management With category performance, you may easily find out the top performers in utilizing the information, using extraordinary product classes, and take gain of it in buying choices. With BI statistics, you may grow income with the aid of making promotions depending on when and where to run. It improves stock control and gives real-time insights about inventory. In that way, you cannot run out of stocks which can be excessive in demand.

3.8.4.2 Looker Data Visualization: Healthcare With Looker, you could analyze claims with healthcare stakeholders, medical doctors, coverage agencies, and patients and growth efficiency inside the above categories. It helps HIPAA compliance. Looker has won an advantage in developing better traits to address COVID-19 like sicknesses. 3.8.4.2.1 Efficient Planning with Qventus Qventus is an AI-enabled platform which assists health center teams in making higher functional decisions in actual time. Looker with AI software program has customized the “submit Acute Care Utilisation tool and PPE demand planner” to offer the best making plans and patient care. 3.8.4.2.2 Effective and Proactive Monitoring by Commonwealth Care Alliance (CCA) 4423 (Commonwealth Care Alliance) is a nonprofit, community-based totally healthcare organization that offers healthcare for excessive desires of people by way of retaining fine and fitness consequences even as reducing average costs. CAA uses Looker and Google large query to test and help sufferers tormented by COVID-19. CCA helped their individuals through presenting the state-of-the-art records and guidance. 3.8.4.2.3 Improved Digital Care with Force Therapeutics Therapeutics, an episode-based totally patient engagement studies network and platform, enhances care through simplifying and strengthening the relation among physicians, patients, and care groups. By way of imparting specialized and relaxed insights to surgeons, patients, and directors, they labored on editing care and made it a greater effective technique. Looker is ready with embedded analytics in its products to deliver scalable and beautify patient care and comfortable insights and 3.8.4.2.4 Transition to Value-Based Care Opportunity payment fashions APM (Application performance management) are transforming healthcare; however, we should analyze the metrics and their performance using Looker’s flexible facts platform New Wave Telecom and technology, Inc. brought

52

Shubham Verma, Amit Gupta, Abhishek Prabhakar

metrics and dashboards for new APM in a quick span of time. Medical doctors and directors without difficulty understand patient performance and specific statistics with the assist of the Centers for Medicare and Medicaid (CMS) offerings.

3.8.4.3 Looker Data Visualization: Gaming Looker facts visualization helps you to develop video games and advantage correct insights. You could have this through information recreation analytics and boosting your revenue [17]. 3.8.4.3.1 Grow Your Gameplay Metrics With gaming analytics, you are allowed to track the status of your KPIs, be capable to make higher choices, and find critical key insights. You can optimize campaigns and feature an exclusive take a look at campaigns in an innovative manner. The principle key metric for gaming is ad revenue, and automated bidding will optimize installs and boom your revenue. Discover the stability between retention and monetization that is required for better player engagement. 3.8.4.3.2 Gameplay Experience Optimization If you pick out simple retention metrics like decreasing churn will optimize gameplay and decrease quits. Display your KPIs for insights that balance issue, and the sport economy and content material enhance consumer revel in gameplay by way of analyzing user behavior and often updating the video games. 3.8.4.3.3 Unlock Customer Insights Create sustainable increase by way of being outlandish within the marketplace and locating your customer base. Mix up all the revenue resources and get a birds-eye view of every participant’s lifetime cost (LTV) at any factor in their life cycle. Prepare a cohort analysis which shows updated developments so that you could make changes in the game to enhance the gameplay.

3.8.5 Pros and Cons – – – –

Number of plugin database Attractive visuals Interface of drag and drop Less option for customization

3 The Overview of Data Virtualizations and Its Modern Tools in the Domain

53

3.9 Zoho Analytics Zoho Analytics is one of the data visualizations that helps in importing the data from different sources as well as doing the in-depth analysis of them with the help of the drag and drop interface. This also provides an insightful report as well as the range of the visualization tools also helps in maintaining the dashboard. Users also collaborate on these dashboards as the reports so as to decide over the reporting system of this database reports as well as a dashboard that has been published with the help of the emails or other websites, integrates with Zoho application, and offers several plans.

3.9.1 Pros and Cons – – – –

Easy sign up There is availability of a free plan It provides a 15-day trial It also provides the premium support.

3.9.2 Overview and Features Zoho Analytics is a self-provider business intelligence and information analytics software program that permits businesses to visually examine their records, create remarkable records visualizations, and find out hidden insights. The statistics analytics software program brings evaluation and perception to the business data and gets rid of the want for any IT assistance or statistics analysts [18]. You can collaborate with Zoho Analytics partners to combo that records, to crunch big datasets, visualize the effects in graphical formats, and perform numerous analytical responsibilities to discover the insights. Zoho Analytics lets in agencies to make statistics-informed selections and springs with flexible on-premise or clouddeployment fashions.

3.9.2.1 Centralized Data Collection Thanks to Zoho Analytics’ clean-to-navigate smart assistant Zia, Zoho Analytics centralizes the gathering of data. Streamlining statistics collection facilitates agencies to broaden a 360-Degree view of their employer as well as can help your solution questions related to the overall performance, patron behavior, and financial status. Further to this, Zoho analytics simplifies the complex techniques which include, formatting, splitting, records, feed calculation, merging, and ensures top of the line productiveness [19].

54

Shubham Verma, Amit Gupta, Abhishek Prabhakar

Figure 3.3: Zoho Analytics benefits.

3.9.2.2 Data Visualization Zoho Analytics is a one-of-its-kind business intelligence software program that incorporates enticing and insightful dashboards and permits you to gain vital statistics. Collaborate with Zoho Analytics experts to understand records visualization that allows enterprise groups to make effective decisions to beautify the bottom line of the business.

3.9.2.3 Efficient Collaboration Zoho Analytics permits efficient collaboration among business groups by way of imparting group leaders and executives with first-rate-grained access over the information that colleagues and customers can view and edit. Depending upon the access, it additionally allows you to create annotations and pix and create comment threads to talk extra productively.

3.9.2.4 Scalable Architecture Zoho Analytics is extensible and may be scaled because the commercial enterprise grows. The scalable structure of the platform allows you to combine Zoho Analytics into any provided size. Moreover, it also configured according to the review as well as a cope up with the massive provided volume. Therefore, it is used for the scalable database.

3.10 Domo: Virtualization Tool Domo gives statistics visualization gear that plays a subtitle role in helping the small businesses as well as maintaining the knowledge of the statistic pushed selection. With the help of the easy-to-use interfaces, it also allows the customer to use the

3 The Overview of Data Virtualizations and Its Modern Tools in the Domain

55

advanced charts. Moreover, it helps in customizing the visualization applications and recording several clicks. Its governance equipment helps corporations manipulate who has get right of entry to statistics. And with Domo different reporting can be done easily, either under a shared atmosphere or a solo atmosphere. For example, the statistics are considered and they embed it onto your internet site the usage of JavaScript and iFrame [20]. Domo is also known as a cloud-based enterprise intelligence platform which helps in performing and maintaining the information asset. Domo statistics are alongside collaborated with the visualization and gearing-derived precautions insight provides a number of visualization and the customer growth of more than 7,000 mapping. However, there are more than 150 extraordinary charts available for the usage of the users. Domo visualization not only provides the data atmosphere but also gives a number of filters along with custom made templates. It is also used for the position, primarily security. Either be used in a multi cloud atmosphere or a single sign authentication.

3.10.1 Introduction to Domo Data Visualization Workflow The Domo Data Visualization workflow broadly consists of the following steps: – Step 1: Connection – Step 2: Preparation – Step 3: Visualization – Step 4: Engagement – Step 5: Optimization

3.10.1.1 Domo Data Visualization Workflow Step 1: Connect Domo is known as a single platform which helps in the pre-built in the connectors. Moreover, it is also insisting of the intelligence that helps in maintaining the commercial enterprises the statistics as the favored organization destination [21]. It also helps in providing the extensive statistics sources. Along with the packages at the fingertips of the user. It performs the leverage usage of several programs that are called as CRM and several other tools under the category of ERP.

3.10.1.2 Domo Data Visualization Workflow Step 2: Prepare Domo statics visualization is quite easy and user friendly for the customers. It provides a drag and drop functionality which is quite useful. Moreover, it is used for the sourcing of the information without preprocessing it or without interchanging the information available.

56

Shubham Verma, Amit Gupta, Abhishek Prabhakar

When it comes to the DOMO visualization, playing card is one of the visualizations that helps in providing the source data in the form of a card. Playing cards are created the usage of the Domo Card Builder, that is known for its resources and they easy to use interface [22]. However, some of the information’s are quite complex to display but with the DOMO visualization it is quite easy. Customer can easily access the present card or they can sign up for a new card according to their convenience. Moreover, the fact preparation is more leveraged. When it comes to the visualization, the new cars take up several time as the embedding of the films and PPTs are required.

3.10.1.3 Domo Data Visualization Workflow Step 3: Visualize With the help of any dataset, the data visualization can be done in an easy framework. Therefore, when it comes to the suitable visualization of the potent data, it becomes quite important to maintain and chart. Therefore, a default charge is already prepared in the chart’s menu. There are a number of layouts present in the numerous panels. The basic purpose of each layout is as follows.: 3.10.1.3.1 Data Pane It is basically used for recording the asset as well as a picking out the available filters and fields for creating the essential calculations. 3.10.1.3.2 Filter and Sort Pane It is used for applying certain filters as well as sorting the data according to the desired series. 3.10.1.3.3 Toolbar Pane or Analyzer Tools Some of these analyzer tools are as follows: – Data: It basically works according to the customer records. The measurements and dimension adjustments are done according to the calculation that is provided in the left pane. Moreover, the number of the access that may be used in the visualization is also done through this device. – Data table: It allows customization according to a certain parameter that is provided through the tools defined in a pivot table. – Filters: These help in showing the filters that have been used as well as the others that can be used. – Sorting: It provides the feature of drag and drop along with which the data can be sorted easily. – Summary: This provides the feature of customize the summary of the given data. – Date: It is used for adding date filters, date tiers, and another customization.

3 The Overview of Data Virtualizations and Its Modern Tools in the Domain

– –

57

Color rules: These provide certain colors for the charts, line, pie graphs, and so on. YOY: Year-over-Year is considered for analyzing and comparing time-series data.

3.10.1.3.4 Chart Picker Pane The chart picker pane is one of the best options available on the display screen. It provides more than 150 chart layouts that can be used by any organization, groups, or business workers. Some of the examples that comes under the chart picker pen includes. A bar chart, line chart, science chart, gauge chart along with the forecasting in providing biggest customization of the vertical lines, sliding lines, curve, sparkline, and so on. 3.10.1.3.5 Auto Preview Toggle Pane It helps in maintaining and changing the visualization agency statistics according to preview that continues to get an upgradation. However, the automation preview of these toggle panel is quite important and basically done for the primary generation.

3.10.1.4 Domo Data Visualization Workflow Step 4: Engage Domo allows the user to collaborate, engage, as well as create their own facilities and communication through annotations. Each person’s story leads to be told by them. Therefore, it helps in providing the basic selective primary function. As an example, only a few features may be made visible to visitors from outdoor your employer [23]. Every other sturdy capability provided for the DOMO is quite useful. That DomoBuzz is known for the best visualization as well as the arrangement of the data on the card, which attributes solution to separate problems. Have you ever had been in the metric realization over the alert notification as well as provide the corresponding seamless review of the basic services? It is always open toward the great king of the collaboration and conversation. Users can pick out to exchange their notification and buzz settings, as in line with their desires.

3.10.1.5 Domo Data Visualization Workflow Step 5: Optimize There had been a number of focus signals along with the optimization that helps in optimizing the commercial enterprises. However, there are a number of power information in driven choice making customers who basically depend upon these notifications for their activities. Therefore, we hack of the bed sheets and the cost-efficient attractive strategies. The subset of the statistics can be made into segments of the new application. This helps in increasing the productiveness and the efficiency of the

58

Shubham Verma, Amit Gupta, Abhishek Prabhakar

group. Moreover, it helps in controlling the job of the selected views of the testimonies with the control over the manipulated bird view opportunity.

References [1]

[2]

[3] [4] [5]

[6] [7] [8]

[9]

[10] [11] [12] [13]

[14] [15] [16] [17]

Chatziantoniou, D., & Kantere, V. (2021, June). Datamingler: A novel approach to data virtualization. In Proceedings of the 2021 International Conference on Management of Data (pp. 2681–2685). https://doi.org/10.1145/3448016.3452752, https://doi.org/10.1145/3448016.3452752. Karpathiotakis, M., Alagiannis, I., Heinis, T., Branco, M., & Ailamaki, A. (2015). Just-in-time data virtualization: Lightweight data management with vida. In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR) (No. CONF). Chatziantoniou, D., & Kantere, V. (2021). Data Virtual Machines: Enabling Data Virtualization. In Heterogeneous Data Management, Polystores, and Analytics for Healthcare (pp. 3–13). Springer, Cham. Mathivanan, S., & Jayagopal, P. (2019). A big data virtualization role in agriculture: a comprehensive review. Walailak Journal of Science and Technology (WJST). 16(2), 55–70. Haihui, Z., Chunjiang, Z., Huarui, W., Feng, Y., & Xiang, S. (2010). Research of cloud computing-based agriculture virtualized information database. In Proceedings of the 2nd APSIPA Annual Summit and Conference, Singapore (pp. 835–838). Bizarro, P. A., & Garcia, A. (2013). Virtualization: Benefits, risks, and control. Internal Auditing. 28(4), 11–18. Watt, D., Bebee, B., & Schmidt, M. (2019, October). Business Driven Insight via a Schema-Centric Data Fabric. In ISWC (Satellites) (pp. 305–306). Liu, K., Yang, M., Li, X., Zhang, K., Xia, X., & Yan, H. (2022, July). M-data-fabric: A data fabric system based on metadata. In 2022 IEEE 5th International Conference on Big Data and Artificial Intelligence (BDAI) (pp. 57–62). IEEE. Alvord, M. M., Lu, F., Du, B., & Chen, C. A. Big Data Fabric Architecture: How Big Data and Data Management Frameworks Converge to Bring a New Generation of Competitive Advantage for Enterprises. Izzi, M., Warrier, S., Leganza, G., & Yuhanna, N. (2016). Big data fabric drives innovation and growth. Next-Generation Big Data Management Enables Self-Service and Agility. 3(1). Zaidi, E., Thoo, E., De Simoni, G., & Beyer, M. (2019). Data fabrics add augmented intelligence to modernize your data integration. With Eric Thoo. Gartner Grou. 17. Ross, M., & Kimball, R. (2013). The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling. John Wiley & Sons. Tan, R., Chirkova, R., Gadepally, V., & Mattson, T. G. (2017, December). Enabling query processing across heterogeneous data models: a survey. In 2017 IEEE International Conference on Big Data (Big Data) (pp. 3211–3220). IEEE. Chatziantoniou, D., & Kantere, V. (2020). Data virtual machines: Data-driven conceptual modeling of big data infrastructures. In EDBT/ICDT Workshops. Zhu, Z., Zhong, S., Chen, L., & Chen, K. (2015). Fully programmable and scalable optical switching fabric for petabyte data center. Optics Express. 23(3), 3563–3580. Abadi, D., Agrawal, R., Ailamaki, A., Balazinska, M., Bernstein, P. A., Carey, M. J. . . . Widom, J. (2016). The beckman report on database research. Communications of the ACM. 59(2), 92–99. Chatziantoniou, D., & Tselai, F. (2014, June). Introducing data connectivity in a big data web. In Proceedings of Workshop on Data analytics in the Cloud (pp. 1–4). http://doi.acm.org/10.1145/2627770. 2627773.

3 The Overview of Data Virtualizations and Its Modern Tools in the Domain

59

[18] Gottlieb, M., Shraideh, M., Fuhrmann, I., Böhm, M., & Krcmar, H. (2019). Critical success factors for data virtualization: A literature review. The ISC International Journal of Information Security. 11(3), 131–137. [19] Shraideh, M., Gottlieb, M., Kienegger, H., Böhm, M., & Krcmar, H. (2019). Decision support for data virtualization based on fifteen critical success factors: A methodology. In MWAIS 2019 Proceedings. 16. [20] Raheem, N. (2019). Big Data: A Tutorial-based Approach. Chapman and Hall/CRC, USA. [21] Bogdanov, A., Degtyarev, A., Shchegoleva, N., Korkhov, V., & Khvatov, V. (2020). Big Data Virtualization: Why and How? In CEUR Workshop Proceedings (2679) (pp. 11–21). [22] Dospinescu, O., & Chiuchiu, S. (2019). Physical integration of heterogeneous web based data. Informatica Economica. 23(4), 17–25. [23] Ylijoki, O., & Porras, J. (2016). Conceptualizing big data: Analysis of case studies. Intelligent Systems in Accounting, Finance and Management. 23(4), 295–310.

Jagjit Singh Dhatterwal, Kuldeep Singh Kaswan, Vivek Jaglan

4 Data Fabric Technologies and Their Innovative Applications Abstract: Organizational information technology (IT) departments now often use hybrid or multicloud deployment strategies. It is now more difficult than ever to handle data in such a heterogeneous setting. NetApp’s ideal data management system is a “data fabric” that allows for the frictionless integration of various cloud deployment models (including private, public, and hybrid ones). To provide uniformity and authority over data mobility, security, visibility, protection, and access, a data fabric centralizes data management across disparate resources. We shall define the term “data fabric,” outline its design, and provide some examples in this chapter of its use, outline its deployment options, and show you the future of NetApp’s data fabric. Keywords: Data dabric, cloud-agnostic, hybrid cloud, service provider, cloud orchestration, Amazon Web Services

4.1 Introduction Today’s IT workers are looking for methods to speed up innovation by capitalizing on emerging technologies like the storage technologies such as the cloud, data storage facilities, open source, integrated transportation infrastructure, configuration management, flash, refrigerated, and software-defined disposal are just a few examples [1]. There has been a shift in the standard for business IT departments to use hybrid cloud deployment strategies, which combine private and public cloud resources. The computing, networking, and storage capabilities available in these contexts are almost unbounded. Businesses desire the flexibility to shift workloads and applications to the best possible environment as their requirements evolve and as new alternatives emerge. Most applications are stateless, meaning they can be deployed in a variety of settings with little effort. Data, on the other hand, has a persistent nature and is often provided by external data storage or databases. Information is a valuable resource upon which modern economies rely. As with any other kind of mass, data is not light and requires both time and space to store and transport. In other words, data has a “temperature,” or

Jagjit Singh Dhatterwal, Department of Artificial Intelligence and Data Science Koneru Lakshmaiah Education Foundation, Vaddeswaram, Andhra Pradesh, India, e-mail: [email protected] Kuldeep Singh Kaswan, School of Computing Science and Engineering Galgotias University, Greater Noida, Uttar Pradesh, India, e-mail: [email protected] Vivek Jaglan, DPG Institute of Technology and Management, Gurugram, Haryana, India https://doi.org/10.1515/9783111000886-004

62

Jagjit Singh Dhatterwal, Kuldeep Singh Kaswan, Vivek Jaglan

the degree to which it is accessible at any one moment. This data’s dynamic nature necessitates careful handling of all its varied attributes [2]. Protecting data and fixing security flaws are only two of the extra difficulties that come with data management for hybrid cloud apps which have separated into functionally independent data organizational boundaries. Data security, protection, and governance must be maintained regardless of where a company stores its information; this is the responsibility of IT. Unable to transfer information. It may be difficult or impossible for a company to transfer its data from one cloud service to another once it has already been uploaded. The inconsistency of data management is a major problem. There is a lack of uniformity in data regulations since each environment has its own unique collection of tools, application programming interfaces (APIs), and management software. Members of the IT department must become proficient with each of these programs and hardware. We have few options. Incompatible new technology and services are more challenging to embrace. Thus, IT is constrained in its capacity to take use of the features of both new and old settings due to the technologies it may employ. Information is a vital resource for modern businesses. Regardless of its location, this information must be safeguarded by IT. Cloud storage may be risky for enterprises because of the lack of control over data security and governance [3]. IT requires a safe, unified method of managing apps and data across clouds, irrespective of their storage technologies, to face these difficulties head-on. Connecting clouds allows IT to take use of their combined strengths, migrate existing data and applications to emerging cloud services, and distribute work as efficiently as possible.

4.1.1 Discuss on White Paper As such, the purpose of this piece is to elaborate on how NetApp’s data fabric might meet the difficulties associated with managing data in the dynamic and dispersed context of modern IT. This document is divided into five parts: In the first section, we discussed the need for a data fabric, and in the second, we saw examples of how businesses and service providers (SPs) are now making use of data fabric.

4.1.2 Meaning of Data Fabric Data fabric is NetApp’s vision for the future of information management. As a result of having easy access to data in the areas where it is most valuable, it helps consumers react and develop faster. The entire potential of the hybrid cloud may be used, allowing customers to make the most optimal choices for their operations. NetApp’s vision for the data fabric is constantly developing. The scope of the data fabric will

4 Data Fabric Technologies and Their Innovative Applications

63

expand over time to include other ecosystems. The data fabric expands in scope and functionality with each successive weaving [4]. To realize this goal, NetApp has developed the data fabric, which describes the technical framework for its hybrid cloud services. Whether it is flash, disk, or the cloud, NetApp solutions, services, and partnerships make it easy for clients to manage their data across all these different types of infrastructure. It is free to pick and select the resources it uses to run its applications, and to alter those choices at any time [5]. Five core requirements must be met for a data fabric to be considered real: – Control. Keep your data safe and under your control wherever it resides – onpremises, near the cloud, or in the cloud itself. – Choice. It provides complete flexibility to choose and modify the services as per the requirements such as in application ecosystem, distribution approaches, storage architecture and deployment patterns. – Integration. Utilize the full potential of each layer’s components while enabling them to work together as a whole. – Access. Help data go to the right place, at the right time, in the right format, for the right applications [6]. – Consistency. If your data exists in more than one environment, you can still manage it with the same set of tools and procedures. By adhering to these principles, a data fabric helps its clients boost productivity, sharpen IT responsiveness, and speed up innovation.

4.1.3 Who Needs a Data Fabric? As of now, NetApp offers cloud-agnostic data management. A company’s productivity, IT agility, and pace of innovation may all benefit from using NetApp’s data fabric [7]. Utilizing NetApp’s data fabric, business leaders can create an atmosphere that encourages creativity and new ideas to flourish. By being more adaptable, businesses may better use their resources. Agility allows a company to function more like a startup while still being able to do all the necessary business and regulatory tasks for its sector. All the company’s data is safe in their hands. CIOs (Chief Information Officers) have a difficult balancing act to do when trying to embrace promising new technologies while keeping their current infrastructures and security in place to fulfill the demands of the company. With NetApp’s data fabric, CIOs are given more leeway to make strategic choices by ensuring data is available when and where it is required while also allowing them to speed up innovation and reducing costs. As hybrid cloud becomes the standard in IT, CIOs will be responsible for safeguarding, managing, and ensuring compliance of all company data, regardless of where it sits [8].

64

Jagjit Singh Dhatterwal, Kuldeep Singh Kaswan, Vivek Jaglan

4.1.3.1 Professionals Who Design IT Infrastructure It is incumbent upon CIOs and CISOs (Chief Information Security Officers) to balance competing demands for service quality (SLOs). Different workloads have varied needs for accessibility, cost-effectiveness, security, and availability. IT architects have more options for designing hybrid cloud infrastructures because to NetApp’s data fabric. They can provide control of entry by combining the necessary security measures for the business with the ease of access to information that employees want [9].

4.1.3.2 Managers of Software Applications The hybrid cloud is being rapidly used by application development teams due to the necessity for safe, speedy software development. Cloud infrastructure provisioning is simple, cheap, and conducive to adopting the DevOps paradigm, all of which provide pilot projects more leeway to succeed or fail. Application owners can relocate their testing and/or live environments back to their own data centers if the benefits no longer outweigh the expenses.

4.1.3.3 Who Designs Storage Systems The data fabric allows storage architects and administrators to work with cloud-based infrastructures. They may take advantage of cutting-edge technological developments in deployment models by capitalizing on their prior knowledge and expertise in the field. They now have additional options for encouraging productivity and creativity among their users thanks to improved access to cutting-edge resources [10].

4.1.3.4 Hosted Computing Companies For a cloud SP, success means being able to rapidly expand operations while maintaining low overhead and a smooth onboarding process for new clients. When using NetApp’s data fabric, SPs can construct a solid, scalable network that can handle any number of users at any time. You may entirely automate business operations whether you choose a cloud orchestration framework that is publicly accessible, available for purchase, or one that was built from the bottom up expressly for business purposes. Offering clients the opportunity to extend their data fabrics to the SP cloud is one way to onboarding them, since it gives them full data sovereignty while also making use of the SP’s cloud storage [11].

4 Data Fabric Technologies and Their Innovative Applications

65

4.1.3.5 Models of Deployment An organization’s data fabric architecture and scope may be defined by the mix of cloud deployment types it selects. Among the many deployment models available to IT are: – Data center-based private cloud, which includes – Internet-accessible cloud storage and processing offered by several internet services – Private and public resources are pooled together in a hybrid cloud When there are many clouds, there are often two or more different cloud providers involved.

4.1.3.6 Confidential Cloud Private clouds may be located either on-premises or in a third-party data center. It makes no difference, given the options for hardware and software architecture are both many and equivalent. The architects decide whether to use a proprietary or free hypervisor and cloud orchestration framework. Data management capabilities are intimately integrated with these ecosystems by using NetApp solutions [12]. A variety of storage solutions, from custom hardware to SDS, are available.

4.1.3.7 The General Public’s Hosted Environment Service companies, rather than government agencies, are responsible for making public clouds accessible. Even though the largest cloud providers have the size to build custom systems, most telecommunications companies still go for the same choices used by enterprises designing and implementing private clouds. This frees up time and resources for SPs to concentrate on their main competency, creating new services for their customers. By delivering NetApp SnapMirror or NetApp SnapVault services, SPs leveraging NetApp infrastructure may let their clients extend their data fabrics to the cloud. Consequently, this paves the way for hybrid cloud architectures by allowing the SP’s clientele may efficiently migrate their data to the cloud and start using it with the provider’s offerings. When the NetApp ONTAP cloud application is used to quickly establish an ONTAP terminal in AWS, the advantages of ONTAP information systems may be extended to cloud storage.

66

Jagjit Singh Dhatterwal, Kuldeep Singh Kaswan, Vivek Jaglan

4.1.3.8 A Hybrid Public/Private Cloud Architecture A hybrid cloud uses both on-premises hardware and public cloud computing to provide IT services. NetApp maintains that a real hybrid cloud is one in which users have complete control over the infrastructure. It can connect any mix of on-premises, nearcloud, and cloud-based resources and can go in any direction [13]. Hybrid cloud deployments may be as basic as a combination of an on-premises FAS array in the company’s data center and ONTAP running on the cloud at Amazon Web Services (AWS), linked through SnapMirror transport to synchronize data between the two. Because of the data fabric established by this straightforward design, application data may be supplied and handled in the same manner in both settings. Colocation managed and/or dedicated services may be integrated into a hybrid cloud as well. With NetApp Private Storage (NPS), for instance, businesses may set up a private FAS cluster in a data center and then link it to public cloud computing through a network exchange. This deployment method permits a hybrid architecture with minimal latency, which combines the safety of on-premises storage with the scalability of cloud computing. Data fabric connection allows IT departments to keep their data under their control regardless of where it is physically stored, as well as their retention rules and service level objectives (SLOs) [14].

4.2 Data Fabric in Action Businesses are already reaping the benefits of the data fabric, which is very much a thing at this very now. Organizations of all sizes are using NetApp’s data fabric to make radical changes to their daily activities. This figure describes examples of real-world applications of data fabrics by telecommunication companies and businesses.

4.2.1 Providing IaaS That Is Robust, Scalable, and Secure To better compete in the expanding cloud services market, a large SP sought to improve its infrastructure. When the SP incorporates an IaaS model into its dedicated hosting service, it can offer clients lower prices and more managed services that can be trusted to remain safe. The SP’s business clients need a rearchitected platform with high capacity, performance, durability, and availability without compromising the SP’s security. The SP also required a scalable infrastructure so that it could expand as needed, while maintaining a reasonable cost/performance ratio in terms of its technological capabilities [15].

4 Data Fabric Technologies and Their Innovative Applications

67

4.2.2 Perspective on a Way Forward After considering several possible designs, the SP settled on clustered Data ONTAP as the backbone of their IaaS platform. Platform that is always available. Critical maintenance has often been scheduled during scheduled downtime. A single maintenance window will not operate in a multitenant setting. They may undertake platform life cycle management (upgrades to hardware and software) without disrupting operations thanks to clustered data ONTAP: – Scale. Inevitably, the cloud market is always evolving. The SP may scale up in line with demand by adding nodes to the clustered design and matching expenditures with revenues. – The high standard of service provided. The SP can provide tenant-specific SLOs because to streamlined performance management. The SP is shielded from any possible financial losses due to customers only being paid for the performance they actually use [16]. – Replication. Enterprise-level service level agreements (SLAs) highlight regional availability by mandating RPOs in the minutes range and requiring continentalscale isolation of data centers. Using SnapMirror replication technology, they can provide catastrophe recovery at a cost that their clients can handle. Just one set of building blocks. Customers may access a wide range of services and quality levels at a variety of price points, all from a single underlying architecture (e.g., disk and flash). When all systems are built on the same blueprint, it is easier to control expenses and boost profits: – Flexibility. The architecture’s adaptability facilitates the rapid rollout of new offerings. There was no need for costly renovation to add support for new protocols, hypervisors, or applications. – Security. The security footprint and threat environment grow when virtualization is used in an IaaS cloud solution. When it came to potential security issues, the SP was well-versed in concepts like virtual networks, hypervisors, guest operating system-based rootkits, and even colocation. The SP used logical network segmentation and ONTAP multitenancy methods to construct the security strategy for the cloud-based IaaS system [17]. The use of application programming interfaces with robotic process automation. Customers in the cloud often use a portal to get access to their information. Through the API layer, SP development teams might automate the NetApp system without any intermediate steps.

68

Jagjit Singh Dhatterwal, Kuldeep Singh Kaswan, Vivek Jaglan

4.2.3 Material Value of Data Fabric Using NetApp’s data management skills, the SP can ensure that data crucial to the SP’s reputation is always both secure and readily accessible. As a result of the data fabric, the SP may give more attention to things like service specialization, revenue expansion, and customer satisfaction. Using a unified set of systems, the SP can quickly and easily provide a variety of services at various pricing points. With SnapMirror and SnapVault, cloud SPs running on NetApp infrastructure may help their clients bring their data fabrics into the cloud. With this feature, client data may be easily integrated, opening the path for a hybrid cloud architecture.

4.2.4 SaaS Distribution for Multiple Clouds An ERP (enterprise resource planning) software provider took the business decision to switch to a software as a service (SaaS) delivery model after seeing that their clients were shifting toward a “SaaS first” IT deployment strategy. The company’s success hinged on its ability to fulfill the needs of its many customers. Customers needed a secure method of deploying their workloads to the cloud of their choice. For rapid client onboarding, the management consulting group for the software provider required the ability to rapidly spin up development and testing environments. It was critical for the software vendor’s development team to safeguard the ERP application’s primary transactional database [18].

4.2.5 Perspective on a Way Forward The software company realized it had to restructure its program to get the most financial benefits from hosting it on the cloud. However, the application’s primary transactional database must be kept intact. In the beginning, the software architects decided to use Amazon’s elastic block storage (EBS) for use with SQL server databases. Problems with this strategy were quickly spotted. The scaling expenses were too high to be maintained. They were unable to scale their SaaS company because: – Bringing on new clients was more expensive and time-consuming than anticipated in the business strategy. – The SQL server database needed many dev/test copies, which led to a significant consumption of storage space on EBS. Because of the length of time needed to clone the SQL server data, we saw a rise in client acquisition expenses and a delay in onboarding new customers; – Functionality and accessibility not all customers were happy with SLAs. It was too expensive to rely only on AWS to meet the performance and availability SLAs required

4 Data Fabric Technologies and Their Innovative Applications

69

to please their major clients. They need the flexibility to choose from a pool of services where consumers may be hosted, and where providers can switch between providers at any time without disrupting service to the running application [19]. The specified 15 min differential backup intervals could not be sustained for the production server. The company’s Microsoft SQL server database systems have been hosted on iSCSI LUNs in ONTAP cloud, with the aggregate residing on EBS so that test copies may be made available for testing and development purposes. ONTAP cloud’s deduplication and other collection optimization capabilities helped them cut their EBS storage needs in half. ONTAP Cloud’s storage cloning features facilitated faster customer onboarding for the professional services department quickly and efficiently, increasing productivity and enabling for a smoother transition to the new SaaS offering [20]. The software firm is presently engaged in the development of a way to solve that will use NPS in a locationally localized setting with direct connections to both AWS and Azure in order to satisfy the speed and accessibility SLAs for production service users and to facilitate the possibility of using any which AWS or Azure for a given scenario.

4.2.6 Material Value of Data Fabric This SaaS offering from NetApp is built on the company’s data fabric and may be used as a consistent foundation for both HA and DR. To avoid having to depend company practices differ across AWS and Microsoft Azure based on replication methods tailored to individual applications and other platform-specific capabilities uses Clustered Data ONTAP’s HA and DR capabilities, such as 15-min backup systems for multiple databases. A single NPS deployment may support many cloud environments, including AWS, Azure, and SoftLayer. When moving individual operating applications from one public cloud to another, you may avoid the time, cost, and complexity of data duplication.

4.2.7 To Decrease the Overabundance of Content Management Systems The internal website at NetApp were created utilizing a variety of content management systems (CMSs), including Oracle’s CMS platform, Jive, WordPress, Drupal, and Joomla as well as internal Java/HTML/CSS scripts. While they started off with relatively tiny footprints, they have since grown in both size and power. CMS installations are distributed unevenly across the company due to a lack of established standards, which makes it difficult to oversee and manage. Their initial footprints were rather small, but they have expanded in both size and capabilities over time. Due to a lack of defined standards, CMS installations are dispersed unevenly throughout the business, which makes it challenging to monitor and administer [21].

70

Jagjit Singh Dhatterwal, Kuldeep Singh Kaswan, Vivek Jaglan

4.2.8 Perspective on a Way Forward IT at NetApp implemented a cloud-based hosting infrastructure to centralize all intranet sites under one unified aesthetic strategy. This infrastructure is based on the NPS and AWS Cloud. It is built on top of a free and public CMS. Each portal is given a medium-sized Amazon EC2 instance that runs Linux, Apache, MySQL, and PHP (LAMP) and is based on an IT blueprint designed by NetApp’s CMS. Portal data is stored on NetApp FAS6290s operating in NetApp private nonvolatile memory. The FAS utilizing NFS-mounted files and a 1 Gbps direct connect connectivity, the EC2 nodes will have accessibility.

4.2.9 Material Value of Data Fabric In this way, IT may store all company data on its own servers following NetApp’s legal regulations regarding data autonomy, data integrity, and data governance. NetApp’s IT department may leverage the data fabric to impose user-friendly, controllable, and financially sustainable web portal requirements while allowing backup users full editorial freedom. NetApp IT has used this framework to successfully launch 15 intranet portals, providing a firm basis for future growth.

4.3 Constructing the Backbone of a Data Network To construct a data fabric, a company must keep its current security posture and architecture in place while analyzing its whole IT network, beginning with terminals, as a data lake to offline caching to database administrator to environment integrations, apps, and services. The core of the architecture is made up of solutions and services that untether information from its original systems, making it possible to exchange and access information from any point. Since each component of NetApp was the concept of a real data fabric taken into consideration throughout the design process, IT architects have a broad choice of options for constructing their systems. This post will discuss several strategies for maintaining a secure and reliable setting for your NetApp data fabric [22].

4.3.1 Three-Point-One-One Data Protection Virtualization, shared infrastructure, and cloud computing have all contributed to a larger attack surface that businesses must now defend against. IT must take governance and data privacy into account, and the data fabric must be designed to support

4 Data Fabric Technologies and Their Innovative Applications

71

it appropriately for the delivery models (IaaS, PaaS, and SaaS) that are being offered if they are to be successful in overcoming security concerns. There must be careful planning for security features at each level of the data fabric’s architecture.

4.3.2 Mastering Foundational Ideas in Safety and Security The segmentation and isolation of customers inside a cloud architecture, and how this plays out throughout a data fabric, is something businesses must be aware of. It is also critical for businesses to understand the security measures used to protect their data. Examples of this include secure multitenancy, which allows for the sharing of resources between different users in a safe environment. Tenants’ access to resources and quality of service (QoS) are both determined by the administrator. A tenant may manage their isolated, provided environment independently from other tenants.

4.3.3 Information during work, in transit, and at rest Any information being handled by a cloud SP is considered “in use” data. Because of how important it is to the security posture of any given company, it must be maintained over the whole lifespan. Due to the nature of cloud solutions’ shared nature, further care must be taken. The security standards enforced inside a company should be applied, and maybe even strengthened, across the whole data infrastructure. When information and data is being sent over the data fabric, it is said to be “in movement,” “in transit,” or “in flight.” When data is in transit, it is more susceptible to theft, tampering, and other sorts of unlawful access. Data in motion includes anything from web and database traffic to data transfers inside a computing server to internal use of data housed in the public or hybrid cloud. HTTPS, which uses SSL and TLS to encrypt internet traffic, is the most common technique now in use. Virtual private network (VPN) infrastructures (often based on IPSec) are also in widespread usage [23]. To put it simply, “data at rest” describes any information or data that is not in transit. The information in a database, for instance, or the copies of it stored elsewhere. The integrity and confidentiality of all data in the data fabric must be protected at all times when it is not in use. Self-encrypting drives and other FDE solutions are common methods for addressing this problem.

4.3.4 Strategic Leadership Management of a safe place where sensitive company information may be stored without fear of compromise is what’s known as “key management.” Access control and safe storage of keys are two essential features of any key management system

72

Jagjit Singh Dhatterwal, Kuldeep Singh Kaswan, Vivek Jaglan

used over a data fabric. It is also crucial to ensure that, should a catastrophe occur, recovery methods are available for the whole key management system. Essential management factors, as defined by the cloud security alliance, include the following: The procedure should include random number generation and the creation of a secure method for transmitting cryptographic keys, which should never be done in the open. Data fabric fragmentation and job assignment are both essential management responsibilities. If you cannot get to the keys, you cannot get to the data, so keep that in mind [24].

4.3.5 Bringing Privacy, Security, and Ownership of Data into Harmony Everything from democratic accountability to information security to data ownership must be on the same page. Since binary or digital personal data is subject to the rules of the region in which it exists, it is crucial for businesses to have a firm grasp on the specifics of their computing environment and the relationships between its various components. It is also crucial to be aware of how different nations’ laws treat the privacy and security of health data.

4.3.6 IaaS IT departments may use infrastructure as a service to supply the hardware necessary to install and execute software (operating systems and applications). When it comes to protecting an IaaS, everything revolves on the virtual environment’s hypervisor. IaaS security focuses on three primary domains:

4.3.6.1 A Hypervisor with Virtual Machines (VMs) Increased vulnerability to VM-based attacks is a result of the proliferation of virtual solutions throughout the data fabric. Since a compromised VM may compromise additional VMs hosted on the same hypervisor or physical server, the effect of these attacks can be exponential. Attacks on one VM may spread to other in the event that a rootkit is the problem, the remedy may include VMs, APIs, and software installed on the infected VM. Traditional attacks and vectors, such as distributed denial of service (DDoS) and DNS-based assaults, must be defended against to keep VM safe [25]. There is a virtual infrastructure and network. The virtual network infrastructure is often the target of attacks. Typically, these assaults are directed against the fabric’s

4 Data Fabric Technologies and Their Innovative Applications

73

virtual switch or router. Threats to virtual networks include hopping between virtual local area networks (VLANs) and tampering with or erasing entries from address resolution protocol (ARP) tables.

4.3.6.2 Managing Tasks There are a few main management tasks involved with IaaS, and they include colocation, tenant management, and network partitioning. Sharing of actual space is what is meant by the term “colocation.” Computer hardware like as processors, hard drives, and RAM are shared across several VMs. When resources like these are shared, it opens up more potential entry points for attackers. When many companies or customers use the same cloud-hosted apps or infrastructure, they may be certain that their data is kept separate thanks to tenancy or multitenancy. Segmenting a network ensures its logical separation, isolation, and attack surface reduction, all while increasing visibility at strategic locations.

4.3.7 PaaS Applications may be moved to the cloud with the help of platform as a service. PaaS security focuses on four important areas: The separation of components and resources: A PaaS solution’s users/tenants must always operate independently of one another. To prevent configuration or system changes from affecting many tenants, administration, and monitoring should be separated and protected independently [26]. Access control that is determined by the individual user: Integrate with the PaaS solution’s data fabric and all its services. To avoid privilege escalation and inheritance problems, each service should be compartmentalized and given just the minimum amount of access or permissions it needs. Managed access for users: Restricts a user’s access to computer systems. Printers, telephones, and computer data are all good examples. To streamline the authentication process, many organizations now use single sign-on throughout their whole data fabric. Safeguarding and preventing the loss of data: Essential safeguards against hostile activities in the data fabric include anti-malware software, anti-backdoor, and antitrojan software, software development life cycle (SDLC) that has been tried and tested, code reviews, training, audits, and controls. SaaS is a model in which computer programs are made available to users on a subscription basis and executed by a remote server. Thin clients, web browsers, and the apps’ respective APIs make it possible to access and integrate these programs [27]. SaaS security focuses on three primary domains:

74

Jagjit Singh Dhatterwal, Kuldeep Singh Kaswan, Vivek Jaglan

The ability to identify and separate data belonging to different users or customers across all settings is called “data segregation.” It is crucial in the modern cloud setting because segmentation is relevant on both the logical and application levels. It is crucial that data from different users be kept separate, and this is something that data fabric technologies should help enable. The goal is to restrict access to the data so that it is only available to the people or organizations who really require it, the term “data access and policies” is used. Organizational security rules should be reflected in the security postures of data fabric solutions, including access restrictions for all cloud solutions. Web SaaS is a common term for discussing web application security in the context of software as a service. These security features are crucial for cloud-based application management: It is possible to integrate features like URL filtering, virus scanning, peer-to-peer limits, susceptibility intelligence, and centralized policy control on the internet. Web SECaaS may be deployed in a variety of ways throughout the data fabric. The majority of people use proxy techniques. Browser proxy settings or firewall/ router/etc. transparent forwarding are common methods for doing this. Web requests are often sent from on-premises appliances used by access gateways to the cloud, where web application security services are hosted. Due to the increased dynamism and, in essence, presence of the online SECaaS solution, agents are often seen in mobile workforces. Agents installed at terminals are used for this purpose. An optimal method for guaranteeing comprehensive coverage of web application security services is to combine web SECaaS solutions.

4.3.8 Layer for Endpoints The data fabric’s terminals, represented by storage systems, make up the endpoints layer. These destinations are either custom-built or software-defined storage systems, and their functions and form factors vary depending on the workload and the kind of system. The length and structure of the fabric are set by the deployment models of the nodes. In addition, the endpoints layer of data transmission is protected by the NetApp solution [28].

4.3.9 Selection of Measures Upon the ONTAP software is built NetApp’s data fabric. It supports a wide variety of scenarios and use cases because to its flexible deployment choices, extensive data and storage management features, and comprehensive ecosystem integration. For NetApp’s AFF and FAS storage systems, ONTAP is the built-in data management software. Fabrics built with ONTAP software-defined storage may extend to

4 Data Fabric Technologies and Their Innovative Applications

75

ONTAP Select commodity DAS and ONTAP cloud in the cloud. The hybrid cloud benefits from consistent data management capabilities thanks to NetApp ONTAP software. Using AES 256-bit software encryption, ONTAP Cloud protects data while it is stored in the cloud. Systems from the FAS, SolidFire, and E-Series families use self-encrypting disks to provide a higher level of hardware-based security. AltaVault encrypts information at rest in transit to its designated cloud service. A Comparison between ONTAP and AltaVault’s encrypting features provide consumers full control over their data’s security at every stage. Users often set up their own key management servers that are KMIP-compliant and used to store and distribute encryption keys [29]. Workload and business needs will determine whether and when additional storage systems, such as third-party arrays, will be installed.

4.3.10 The Transport Layer Establishing lines of communication between the endpoints necessitates a transport mechanism to enable the movement of data across the data fabric. The data fabric is mostly transported through NetApp SnapMirror. With the SnapMirror protocol, no data is lost when being transferred between data fabric nodes. To the cloud, across storage tiers, or between clusters, it allows programs to transport data without interruption. Any data transfers are invisible to the applications [30]. Files in the ONTAP family of solutions (FAS, AFF, ONTAP cloud, ONTAP Select) are all stored in the WAFL file system. The data formats used by the other fabric nodes are specific to them. Whether an endpoint is using SnapMirror replication or SnapVault backup, the SnapMirror transport provides seamless data movement between them. When utilized across ONTAP endpoints, this transport not only enables interoperability but also keeps deduplication and compression efficiency, so data does not rehydrate. The quickest method to transfer large amounts of data over the network is via this transport. Because of the significance of hybrid cloud systems, this is particularly true when the fabric extends over wide area networks. A data fabric transport ensures the safe transmission of information between nodes and makes it possible for each nodes to acquire and use the data in its native form.

4.3.11 Protected Connections to the Cloud Through the fabric, destinations get cloud connectivity and the capacity to exchange data with the FAS and one another, both of which are critical for achieving service level goals (SLOs) and cut down on the cost of maintenance (TCO (Total Cost of Ownership)). Utilizing the SnapMirror transport, syncing data between ONTAP Cloud and FAS servers set up as NPS is a breeze.

76

Jagjit Singh Dhatterwal, Kuldeep Singh Kaswan, Vivek Jaglan

4.3.12 Security, Privacy, and Lightning-Fast Data Transfers AltaVault can replicate data from any storage system. AltaVault’s data transfers are more streamlined and rapid since they make use of SnapVault’s backup features for transit. When using to supplement data stored in ONTAP, incremental information from a NetApp Snapshot might sent to AltaVault in its original format. Because of the less stress on the source and target systems, backup and restoration sessions may run simultaneously, and recovery point targets can be strengthened (RPOs). AltaVault’s SnapMirror data transmission method enables a “incremental eternal” backup technique, which does away with the requirement for periodic complete backups. This method requires less destination storage space, speeds up data recovery, and shortens backup periods.

4.4 Developing a Larger Data Network IT has more tools at its disposal to solve customer issues the more connections it has in its fabric.

4.4.1 Data Tiering with Automation Data wear leveling, or automated information The term “tiering” refers to the practice of dynamically and transparently moving active data across different tiers of memory that vary in price and performance according to predetermined regulations. Even if customers on the lowest tiers may still get to the data, they will have longer wait times to do so. NetApp ONTAP Flash PoolTM aggregates, which are collections of solid-state drives and hard disk drives, are one kind of automatic data tiering. High-capacity HDDs are slower than SSDs but cost less. The SSDs are the most costly and fastest option. Because of the higher throughput and lower latency requirements of frequently accessed data, hybrid configurations are becoming more popular, and ONTAP ensures that the most frequently accessed information is always stored on the SSD tier. Data has a temperature, which is a changeable attribute, making automation an essential part of the tiering process. What’s popular now might become unfashionable tomorrow. Data migration from one storage tier to another would be impossible for a person to keep up with manually. If this data is managed using bespoke management solutions, it will add complexity for the client and waste time in data transfer. However, since ONTAP is aware of how data is used, data may be moved to the optimal tier, based on actual consumption patterns, without any manual intervention. One benefit to the end

4 Data Fabric Technologies and Their Innovative Applications

77

user is automated information placement, which may use both the high speed of SSDs for frequently accessed data and the lower prices of HDDs in the capacity department.

4.4.2 Data Storage Grid on the Internet StorageGRID offers reliable, cross-system, software-defined cloud computing. Both the data fabric and the object fabric may benefit from StorageGRID’s services. StorageGRID’s potential applications extend well beyond these examples; it may also be used as a standalone object-based fabric. Applications designed for the cloud may use S3, CDMI, or Swift as their default data storage and retrieval protocols. StorageGRID’s extensive object core competencies – including multigeographic erasure encryption, a globally integrated nomenclature, and policy-based object administration and migrations – make it a genuine multisite infrastructure. Despite the fact that the NetApp data fabric was built for NetApp products, it is possible for third-party devices to have access to it. NetApp’s data fabric may be expanded in a number of ways, allowing businesses to make the most of their storage needs while protecting expenditures in third-party disk drives, direct-attached office equipment, and open platform software-defined collection.

4.4.3 Integration with Preexisting Object-Based Storage With the ONTAP software observe the data as NetApp FlexArray, businesses can make the most of their investments in third-party arrays by integrating them into their data fabric. In addition to allowing non-NetApp arrays to leverage SnapMirror data transport links, multiprotocol interoperability, cloud connectivity, and systematic information management procedures and tools, FlexArray also improves storage security and effectiveness. SnapMirror allows IT to create a DR fabric in which a server that is not FAS may be connected as an endpoint to a FAS array or ONTAP Cloud. Users may also move their data from non-NetApp computers to the native NetApp AFF/FAS storage. Transferring information from one backup system into a pool managed by NetApp requires a process known as foreign logical unit number import (FLI). By using the FLI architecture, SLO management may be optimized via array migrations, cloud platforms, or data tiering.

78

Jagjit Singh Dhatterwal, Kuldeep Singh Kaswan, Vivek Jaglan

4.4.4 The Integration of Direct-Attached Storage for Commodities Connecting to the data fabric, direct-attached storage (DAS) machines may employ ONTAP selection (file or block), SolidFire Component X (block), or Storage GRID (object) and control the underlying physical storage. While ONTAP Select and Storage GRID are VM installations, Element X requires physical access to the host. The software’s entire suite of data management features becomes available to programs and ecosystems once it has been installed on a server. With SolidFire Element X, SPs may expand their block offerings to meet rising customer demand. In addition, they may use inexpensive hardware and provide varying QoS options to their customers.

4.4.5 Implementation of Open-Source Software-Defined Storage Enterprise-distributed access systems (DAS) built using E-Series hardware provide advantages over traditional, in-house DAS in terms of performance, scalability, cost, and security. E-Series provides a hardware platform on which open-source SDS solutions like CEPH and OpenStack Cinder may be deployed. E-Series may serve as an object storage for OpenStack Swift thanks to the addition of StorageGRID functionality. The SafeStore encryption feature of E-Series is useful in both of these cases for further safety.

4.4.6 Replication of Data to Cloud-Based Object Repositories Securely backing up any storage array to any cloud is possible with an AltaVault appliance, and they support a broad range of backup programs. It is a more up-to-date version of tape systems that offers the cost savings and geographic distribution perks of cloud deployments, as well as the added security of encrypted data at all stages of its journey to and while stored in the object store. Any storage array inside the data fabric may be used to recover data that has been backed up into the fabric.

4.4.7 Distributed Object Storage Object stores may be used as a backup target for AltaVault, a destination for direct volume backup and restoration, and as a tiering mechanism for data in the Data Fabric. In these scenarios, the repository’s data is managed by ONTAP, SolidFire, or AltaVault. The repository’s implementation and settings are both customizable. Many organizations choose using AWS’ S3 object store for their cloud data storage needs. Other public cloud object storage are also supported by several NetApp products. The object protocol may

4 Data Fabric Technologies and Their Innovative Applications

79

be implemented in technologies like For local installations, you may use NetApp StorageGRID, as well as open-source projects like Open source cloud computing Swift and CEPH. The repository may use a NetApp storage array or any other commercially available hard disk storage for its back-end physically storage needs.

4.4.8 Adding More Endpoints and Clouds By using different cloud endpoints, organizations may protect their data in the event that one of their clouds is compromised, maintain consistency in asset management across locations, and avoid cloud vendor lock-in. NetApp’s data fabric is backward compatible with previous generations of virtualization technology and forward compatible with emerging cloud architectures.

4.5 Storage Management Layer Technologies that provide hardware components with high availability, a scalable design, and storage efficiency are all part of the data fabric’s storage management layer. Technologies connected with this layer also handle automatic data tiering inside the storage system.

4.5.1 Excellent Reliability and Stability A high availability level indicates that data is easily accessible. In the case of disk, network, or node failures, a highly available system should seamlessly and automatically adjust to maintain high data service levels. The reliability of your storage device is evaluated by how probable it is that you will lose data if anything goes wrong. Storage systems provide data integrity and availability using mechanisms including RAID arrays, Helix arrays, erasing programming, replicating, and clustering node redundancy with multipathing. It is crucial to know whether the procedures considered for your environment and frequency of usage are sufficient for the applications you run, since unavailability and longevity ratings vary based on the backup system configuration.

80

Jagjit Singh Dhatterwal, Kuldeep Singh Kaswan, Vivek Jaglan

4.5.2 Technology with a Scalable Design To adapt to shifting storage demands, IT organizations may use a scale-out storage architecture. Upgrading or expanding a system under a conventional design necessitates a scheduled period of downtime.

4.5.3 System for Managing Information Data using ONTAP ONTAP software enables transverse and longitudinal scalability for IT organizations without compromising service quality. Scaling up means creating a more advanced storage array, whereas horizontal scaling means adding collection nodes to spread workloads. Enhancing flash capacity increases productivity, while adding high-density drives increases raw capacity.

4.5.3.1 Remote Data Storage Systems It is also possible to update or extend storage arrays from third-parties without any downtime when they are connected to the data fabric using ONTAP FlexArray.

4.5.4 SolidFire SolidFire’s scale-out architecture ensures linear, repeatable encourages growth when adding nodes to the clusters with zero downtime. As the size of the cluster increases, information is quietly shared across all nodes. Native QoS features are very useful for cloud service companies and organizations building private clouds because they provide the precise implementation of performance guarantees (SLAs) for all applications, workflows, and tenants. The GRID Webscale platform provides a very scalable method of storing objects. Its software-defined architecture is designed to accommodate many objects and a large amount of storage in a variety of places under a single namespace. Increases in capacity and size for StorageGRID Webscale may be implemented without disrupting service by adding new nodes and installing updated software. With QoS features, a shared infrastructure may guarantee a certain degree of performance for individual tenants and applications, regardless of the actions of other users. To provide the constant, predictable, and guaranteed performance required by each deployed application, setting a permitted minimum, reasonable amount, and burst level of achievement is possible with SolidFire QoS. The need for ongoing

4 Data Fabric Technologies and Their Innovative Applications

81

monitoring is eliminated, and IT processes are streamlined, when applications get the resources, they demand in a predictable manner.

4.5.4.1 Effectiveness of Inventory Control Storage efficiency features included into ONTAP software technologies available throughout the data fabric provide more efficient use of storage space, smaller network infrastructure dimensions, reduced energy and refrigeration costs, streamlined data maintenance, and accelerated storage operations.

4.5.5 Data Management Layer – – – –

–

– –

–

IT needs a document management layer to provide data services that are consistent in all contexts. Saving time and effort via better data management. Tools for duplicating and backing up data access protocols for data. Effectiveness of data management engineering methods for duplicating existing systems. Data loss of any type may be prevented using replication methods. To save costs and enhance operating procedures while meeting stringent RPO and RTO criteria, businesses employ NetApp SnapMirror. SnapMirror is a unified method of replication that may be implemented at any node in the data fabric. SnapMirror’s robust IP-based protocol is well-suited for data transmission across WANs, making it an excellent choice for transferring information between regions and clouds. When you start a new SnapMirror session, it will duplicate every bit of data from the source volume to the target. Snapshot copies are used to execute continuous incremental updates, which reduces the amount of data sent by simply delivering the modified blocks. Data that has been compressed or deduplicated at the source remains in its compressed or deduplicated state while being transferred between ONTAP nodes. Keeping a copy and bringing it back to life. Data fabric from NetApp provides various choices for handling backup and restoration procedures. For backing up data from storage arrays to cloud object repositories, businesses may use either AltaVault or Amazon Simple Storage Service. In addition to its adaptability in deployment, AltaVault boasts ingest speeds of over 9TB per hour, encryption using keys managed by the client, and local caching to reduce RTO for recently created data. AltaVault allows tiering across several object storage types, which helps save costs (e.g., AWS S3 and Glacier). Any main storage endpoint with SnapMirror data transfer connection may use SnapVault to create backups of their Snapshots. With SnapVault, you can create read-only Snapshot copies of your data across many computers and quickly

82

–

–

Jagjit Singh Dhatterwal, Kuldeep Singh Kaswan, Vivek Jaglan

and easily back them up to a secondary, central server. For instance, using ONTAP Select with SnapVault, data can be backed up quickly and simply from any branch office to the central data facility. Using Snapshot for native backup and restoration is possible on SolidFire and are also compatible with S3 and Swift object stores, look no farther than a system that has this feature built right in. With this built-in feature, you won’t have to install or pay for any additional backup software. Network Data Management Protocol (NDMP) is an open standard for backing up NASes over the internet and is supported by ONTAP (NAS). NDMP provides standardized instructions for backing up and restoring file servers, reducing the amount of custom code required for various applications. The ability to skip backup servers and save information on tape directly improves the speed and efficiency of NAS data protection for NetApp clients.

4.5.6 Methodologies for Obtaining Access to Information Modern applications need data access methods that scale with the size of the data fabric. Different systems have different protocols for facilitating data access. SnapMirror data fabric transport allows for interoperability between disparate endpoints, allowing data to flow across them while being provided according to their own protocols.

4.6 Expanding the Capabilities of the Data Fabric Horizontally and Vertically Managing server, network, and storage infrastructures is becoming more a generalist’s job in many IT departments. Instead, IT jacks of all trades oversee the whole system. The old methods of keeping tabs on the data center’s application servers, networks, and storage devices, which usually included using a wide variety of tools from multiple suppliers, are no longer applicable. NetApp has built a technology stack to solve this problem and provide businesses a variety of options for cloud SPs and application ecosystems to pick from. Integration of management and automation systems, both commercial and homemade; processes are made possible by APIs in the base layer of this technological stack. Enterprises may avoid becoming NetApp specialists as a direct consequence of these ecosystem synergies, which allow them to: – Use a wider variety of tools and services. They need just learn about the goods produced by their environment. – By using NetApp’s tools and frameworks, businesses may reduce their reliance on custom-built solutions.

4 Data Fabric Technologies and Their Innovative Applications

– –

83

Organizations can safeguard their databases, SharePoint, and Exchange installations without learning the specifics of each platform’s data security architecture. Without modifying existing procedures, businesses may now connect their historical systems to the current IT infrastructure.

4.6.1 Integrating Ecosystems at a Higher Level NetApp is devoting considerable resources to ensuring that our cutting-edge data management capabilities are seamlessly integrated to provide our customers with the necessary virtualization and cloud knowledge management. The OpenStack open-source management environment is attractive to many businesses, and NetApp contributes to it to help those businesses. Managing ONTAP Cloud instances in the cloud is a breeze using NetApp OnCommand® Cloud Manager. The same APIs that are exposed to customers and partners are used to construct these management plane connections.

4.6.2 Incorporation of VMware Self-service is made possible by granular data management for VMs, which is then securely exposed to VM and application owners as part of NetApp‘s ecosystem integration with VMware. Initially, NetApp enhanced the native VMware vSphere UI with features like quick VM cloning (UI). Nowadays, thanks to VMware’s strong integration, NetApp’s data fabric’s data management capabilities may be used without any hitches by way of vSphere’s native administration APIs and graphical user interfaces. There are several built-in connections, such as: – SnapMirror’s compatibility with VMware site recovery manager, which allows for the automated execution of disaster recovery tests and failover/failback scenarios – By integrating VAAI arrays with VMware vStorage APIs, cloning, and other storage-intensive operations may be offloaded. VVols allows for the connection of vSphere with NetApp’s data fabric, allowing for service-aware provisioning and administration of virtual disks and VMs, as well as virtual disk granular data management. – The NetApp for ONTAP Storage Systems vRealize Operations (vROps) Management Pack, which adds metrics and analytical insights unique to vROps; Testing and certification of VMware’s vCloud Air public cloud platform for use with NPS solutions.

84

Jagjit Singh Dhatterwal, Kuldeep Singh Kaswan, Vivek Jaglan

4.6.2.1 Microsoft’s Windows Server and Azure’s Compatibility Hyper-V and system center virtual desktop manager are two examples of Microsoft’s private cloud solutions that are based on the company’s server virtualization and management technologies (VMM). These systems are designed to work with Microsoft’s Azure cloud platform. In the event of a catastrophe, businesses may use Microsoft’s Azure Site Recovery (ASR), a disaster recovery solution that operates in a hybrid cloud by connecting independent private and public cloud networks. Using ASR, VMs may be moved between Hyper-V hosts in on-premises data centers, secondary private cloud data centers, and Azure cloud data centers. In order to make Azure site recovery easier, enterprises may copy their data fabric from SAN to NPS. NetApp’s system center for data processing and other Microsoft products are compatible with one another since both companies are willing to adopt and implement industry benchmarks. Through the use of system center VM supervisor and system center creative force, the Microsoft system center school principal is able to: – Track the cost and availability of servers and memory card for Microsoft server operating Hyper-V VMs – Use alerts and health navigator views in System Center Operations Manager to pinpoint the source of issues – Automate routine tasks The OpsMgr management server number of resources allows for this by spreading the load over many servers to guarantee continuous service. If ONTAP is deployed in a private cloud, leveraging Windows Azure Pack (WAP) might make it easier for SPs to manage the infrastructure.

4.6.3 Compatibility with OpenStack OpenStack is the most popular open-source option for cloud computing, and it has seen a lot of development in recent years. To facilitate the rapid, smooth, and safe rollout of cloud services, NetApp has integrated OpenStack into its data fabric. Integrations of NetApp’s OpenStack block storage (Cinder) include the following: •Provisioning and administration of storage and data. NetApp storage provisioning and data management features, including complete disk encryption, are made possible using specialized NetApp drivers for ONTAP, SolidFire, and E-/EF-Series systems (FDE). Capabilities for a catalog of storage services. With the help of NetApp Cinder drivers, IT can compile a list of storage services that cater to the demands of a wide variety of applications and tenants with regards to efficiency, performance, availability, and security through FDE.

4 Data Fabric Technologies and Their Innovative Applications

85

Improved copy offload for persistent instance creation. NetApp’s cloning technology is used by the ONTAP Cinder drivers, allowing for the rapid and efficient creation of numerous VMs from Glance images. NetApp is leading the charge to implement a shared file system service called Manila, which is foundational to most of the total storage delivered globally. Advantages of cloud platforms are being considered in the design of new applications as they become more widely available. This usually entails switching to a SOA or a microservices architecture. Both isolate a subset of an app’s functionality into a standalone unit that may be modified, updated, or even removed with little impact on the rest of the app’s operations. Containers are the common name for these kinds of parts. To ease the transition to container-based, distributed microservices, Docker provides an abstraction layer over container instantiation, separating containers from the underlying operating system. Persistent data access from several places at once is complicated by the distributed nature of this application architecture. NetApp‘s Docker volume plug-in solves these problems by making it possible to manage and connect persistent storage devices to containers running on different hosts. This provides enterprise-level storage performance, efficiency, and flexibility while relieving the application of data management responsibilities. Currently, the plug-in is compatible with ONTAP systems using the NFS and iSCSI protocols; support for more endpoints is planned.

4.6.4 Incorporating ONTAP Cloud Nodes Data fabric cloud environments management requires setting up ONTAP Cloud implementations in cloud infrastructures. On command cloud manager is the main gateway for doing so. As the data fabric grows, customers will be able to manage all of their data in a unified fashion via a single platform called cloud manager. Currently, both AWS and Microsoft Azure may make use of cloud manager and ONTAP cloud. Common tasks like creating new objects, adding new resources, and duplicating an existing one are all made easier using cloud manager’s handy wizards. Building a hybrid cloud is as easy as creating a SnapMirror transaction for data redundancy involves dropping and dragging FAS nodes onto ONTAP cloud nodes. Additionally, On command cloud manager offers a RESTful API set that may be used by other software applications.

4.6.5 Applications and Services Layer NetApp and its partners’ integrated apps and services for the data fabric provide high-value security solutions by capitalizing on the fabric’s core features.

86

Jagjit Singh Dhatterwal, Kuldeep Singh Kaswan, Vivek Jaglan

Applications are hosted and managed by IT as software products that may run locally or in the cloud using infrastructure as a service. Services are resources that are accessed through a network (often through a SaaS delivery model) and may be paid for in several different ways. Data fabric integrated applications and services tackle a wide range of issues, including data analytics, content delivery, fabric trying to measure and reviewing, information security, data movement among all collection edge devices, copy data management, data ethical guidelines, data authentication protocols, and many others. This layer ensures compatibility across nodes, geographies, and ecosystem, which is crucial for effective data fabric administration. Management tools for a data-centric strategy will improve in tandem with the development of the data fabric.

4.6.6 Hybrid Cloud Settings for the Analysis of Unstructured Data Data analytics may be used to sift through the mountains of unstructured data produced by data centers to find useful insights. IT may run an in-house analytics platform (like Hadoop) on-premises or in the cloud, or it can use a third-party solution like AWS’s Elastic MapReduce (EMR). There is now support for both choices inside data fabric.

4.6.7 Hadoop Using the NetApp NFS connector, businesses may replace or use NFS in addition to the Hadoop Distributed File System (HDFS) in their managed Hadoop environments. HBase (columnar database) and Spark are two examples of additional Apache projects that NFS connector works with and supports (processing engine compatible with Hadoop). That’s why NFS connector works with a wide variety of workloads, including batch, inmemory, streaming, and more. Connector works with ONTAP on-premises, ONTAP in the cloud with ONTAP cloud, and ONTAP near the cloud with NPS.

4.6.8 Using Amazon Web Services’ Elastic MapReduce For businesses that would instead not build and maintain their own analytics engine, AWS provides the EMR service. All EMR data is loaded from S3 buckets on AWS. There are difficulties brought on by this restriction for hybrid cloud setups. The sync utility and the custom scripts often used in tandem with it are two examples of ineffective and laborious tools for accomplishing these aims. The question is how businesses can move their unstructured data from onpremises file sharing to the cloud, where analytics may be performed and the findings

4 Data Fabric Technologies and Their Innovative Applications

87

retrieved in a safe and timely manner. What strategy do they use for dealing with the conversion of files and objects? How do they handle versioning of files and objects? IT requires a straightforward, automatic method of transferring files to and from AWS S3 buckets so that cloud services may be executed and data can be returned to its proper location. Through AWS and the NetApp Data fabric Cloud Sync service, businesses may get insight from previously unusable data stored in unstructured files. Cloud Sync can efficiently move data sets across AWS, convert NFS data sets to S3 objects, and vice versa, initiate a wide range of cloud services, such as Amazon EMR and Amazon Relational Database Service. You may use Cloud Sync with any network file system that uses the NFS v3 protocol (NetApp or third party). By quickly recursively searching directories and sending files to AWS in parallel, it efficiently handles massive quantities of data. After a copy of the baseline is made, further changes to the data set are synchronized in near realtime using Cloud Sync’s continuous sync feature. Cloud Sync sends the findings of an AWS cloud analytics job back to wherever the data was initially stored, regardless of whether it is stored locally or in the cloud, immediately after the data transfer.

4.6.9 Safeguarding Information in Online Programs and Network Drives Object stores are being used by IT departments as part of their backup and archiving plans. They use the cost benefits of the cloud while protecting money already spent on backup hardware. Regarding backing up data, cloud storage may provide many advantages, including reduced per-gigabyte costs, improved speed, and web scalability. App and information exchange prevention features built into the cloud are made possible by Snap Center integration with data fabric endpoints. With SnapCenter, businesses have a simple, scalable, and secure solution for creating and managing private cloud archives.

4.6.10 Secure File Sharing Data With SnapCenter, administrators of Windows and UNIX file shares may centrally manage all Snapshot copy-based data protection processes. Backups created using snapshot copies may be stored in a public or private cloud object storage that supports AltaVault, or on a secondary ONTAP server. SnapMirror is a data protocol that integrates the ONTAP and AltaVault endpoints for seamless data replication and transfer. After the first full baseline is created, this integration permits an unlimited number of incremental backups.

88

Jagjit Singh Dhatterwal, Kuldeep Singh Kaswan, Vivek Jaglan

4.6.11 Interoperability of Applications SnapCenter allows database administrators and app developers to handle their data independently. It enables users to automate crucial enterprise application life cycle tasks by combining the ONTAP data management capabilities with the data transport’s replication capabilities. These tasks include: – Advanced cataloging, indexing, and searching features facilitate storage space organization, planning, backup, and restoration. – Utilizing NetApp Snapshot, SnapMirror, and SnapVault technologies, we were able to cut down data recovery times for application servers from hours to minutes. – By facilitating clone life cycle management, including creation, refreshment, expiry, and split, you can speed up the rollout of new releases and applications. With SnapMirror and SnapVault, not only can on-premises network topologies, NPS and ONTAP cloud-based deployment models, and hybrid cloud solutions all benefit from the conveyance of database objects across locations, clouds, and dissemination types. Oracle, MySQL, Microsoft SQL Server, and IBM DB2 are just a few of the common DBMSs included in complete application packages. In theory, SnapCenter might pave the way for the incorporation of comparable features in both commercial and bespoke database and application frameworks.

4.6.12 Cloud-Based Security for Your Information SaaS applications liberate IT from managing applications and IT infrastructure, but IT is still responsible for safeguarding the data stored in and generated by these apps. The onus of ensuring the safety of the client’s cloud-stored data often falls on the customer, as stated in the agreement client and cloud SP. Data saved in the cloud is not protected against corruption or data loss by the cloud SP. If data damage or loss occurs because of application/user access or a natural catastrophe, the cloud provider will not be held responsible. As a result, IT administrators must take the same precautions to ensure business processes for information storage in SaaS as they would for data stored locally. Data from Microsoft Office 365 may be backed up using NetApp’s data fabric data protection services. OneDrive for Business, SharePoint Online, and Exchange Online are automatically backed up every day are performed by this data protection solution to ensure data security. The archive retention duration and backup locations may be set by administrators to satisfy certain legal requirements. Restoring Exchange and SharePoint parts, down to the item level, safeguards all O365 data without the hassle of the native O365 rollback process. AltaVault-supported

4 Data Fabric Technologies and Their Innovative Applications

89

ONTAP destination endpoints and object stores may be used to store backups locally or in the cloud, respectively (e.g., StorageGRID, OpenStack Swift, and public cloud). The advantages of a data protection solution include: – Maintaining oversight over sensitive data throughout the migration of users, folders, and mailboxes to Office 365 – Allows users to choose their own deployment method, storage duration, and timeframe for backups. The benefits include: – Enabling fault-tolerant data preservation for business continuity – Streamlining administration

4.6.13 Transparent Data Fabric-Wide Copy Management The need for secondary storage space is rising at a far faster rate than that for main data. Backups, archives, test and development environments, DR sites, analytic data warehouses, and so on use secondary copies. These duplicates are administered by several programs, which only serves to increase their already complicated nature. To overcome these obstacles in a hybrid cloud setting, NetApp and its partners provide CDM solutions developed on the data fabric. Cloud computing provides almost unlimited scalability for handling highly intermittent workloads when it has access to data replicas stored in the cloud. The data fabric provides access to several different NetApp technologies, including: – array-based Snapshots copies, which make it easy to generate virtual copies in a flash; – with SnapMirror, data can be quickly and simply cloned to several destinations and retrieved from any of them at any time; – with FlexClone, you can quickly and easily create a data backup for use in test/ dev environments or for analytics purposes in the event of a catastrophe; – recent releases of NetApp’s SnapCenter software include tools for managing data copies, and our partnerships provide technologies that are both storage- and copies of original, allowing for the highest possible service levels in data protection; – to illustrate, Commvault IntelliSnap is a backup and disaster recovery application that is completely compatible with NetApp Snapshot procedures for primary information, dedupe-aware duplicating, and tiering across memory and cloud repository; – Veeam delivers fast, flexible backup and disaster recovery operations for all applications in vSphere and Hyper-V environments by integrating AltaVault and FlexPod with NetApp data fabric technologies. To streamline the production and consumption of data copies (including snapshots, vaults, clones, and replication), catalogic has developed a copy management platform

90

Jagjit Singh Dhatterwal, Kuldeep Singh Kaswan, Vivek Jaglan

called ECX. Enterprise catalog management that uses a cataloging system helps businesses to keep track of all their copies in one convenient location. Storage administrators may utilize ECX to automate various use cases, including recovery, incident management, test/dev and DevOps, and big data/analytics.

4.6.14 Data Fabric Workload Optimization Performance indicators for applications, data stores, VMs, and storage infrastructure can all be seen in one place with the help of NetApp OnCommand Insight. Tier assignments are analyzed, and load balancing across the data fabric’s many nodes is made possible for the whole suite of applications used by a company. To get the most out of their investments, businesses may boost application performance with OnCommand Insight’s aid while simultaneously increasing the efficiency of their current storage infrastructure. It allows IT managers to oversee storage as a complete storage into the IT service delivery method. NetApp foreign LUN import technology may help IT organizations streamline the process of migrating data from non-NetApp systems to a native NetApp AFF or FAS managed retention pool, which can be useful for array migrations, cloud solutions, and data tiering.

4.6.15 Integration of Workflow Automation into the Data Fabric Orchestration solutions are great for automating processes, but they cannot always handle the data storage needs of different kinds of businesses. It has been shown in other domains, such as monitoring, that a skilled shared storage is required to fulfill storage automation requirements [31]. Data center orchestration solutions and client needs for storage automation are brought together by NetApp OnCommand Workflow Automation (WFA). When it comes to storage services, WFA is the automation foundation you need. Customers may take use of the data fabric’s capabilities with its help, as a variety of administrative tasks, such as provisioning, migration, replication, decommissioning, and cloning, are made easier. WFA supports the following features and capabilities: – management across environments, including conversions between virtualized environments; – monitoring across environments of capacity, performance, showbacks, and billbacks; and – capacity planning across contexts.

4 Data Fabric Technologies and Their Innovative Applications

91

It is possible to trigger the WFA-defined automated processes in many ways: – The native WFA graphical user interface allows operators to perform tasks with a single click. – The OpenStack, Microsoft Development Center Orchestrator, and VMware vRealize Orchestrator are examples of data center orchestration systems. Using WFA RESTful APIs on custom-built automation systems In any cloud environment, AutoSupport routinely monitors the status of AutoSupport-enabled NetApp systems. It works hand in hand with NetApp’s My AutoSupport, a suite of browser-based tools designed to facilitate troubleshooting. Data from AutoSupport is used by My AutoSupport, which performs a continual health check on the storage infrastructure and alerts the user to any problems it finds.

4.7 Conclusion New data management difficulties emerge when the company IT environment quickly expands to accommodate more technologies and widely spread applications and data. NetApp’s data fabric is more than the sum of its parts because it is an integrated set of technologies and services that work together to provide high-value data management solutions. In different settings, the data store boundaries are set by the storage systems used. These two systems are connected by SnapMirror, a data transmission protocol. It is simple to utilize them in tandem thanks to management tools, APIs, and ecosystem connectivity. The visibility and management of an organization’s processes are greatly enhanced by integrating apps and services based on the data fabric. When it comes to where to host their software, businesses now have more options thanks to the data fabric’s multicloud capabilities. This adaptability allows for a greater variety of services to be selected from, better satisfying the requirements of various applications and businesses. If a cloud service you rely on becomes hacked, you may still safeguard your assets and keep your users’ access levels unchanged. Lock-in with a single cloud provider is not inevitable. All this can be achieved without worrying about the safety of your data since it is being managed securely no matter where you are. NetApp’s data fabric enables IT organizations to move data across data centers and clouds without compromising its security or privacy. Adding more nodes and increasing their scope may be easier with the help of ONTAP information management software. There are a variety of deployment options, including the FlexPod converged architecture for data centers, ONTAP cloud, and NPS for clouds. NetApp is committed to become the industry standard in data management, and one of the ways we want to get there is by leading the way in the development of the data fabric. As the trend toward more product and ecosystem compatibility and endpoint

92

Jagjit Singh Dhatterwal, Kuldeep Singh Kaswan, Vivek Jaglan

availability continues, you will have more flexibility in application settings and delivery methods. The proliferation of data-management services will increase your freedom to access and control your data from any place, at any time. Nobody here is trying to solve this problem by themselves. To speed up the time it takes to bring solutions to market, NetApp is also aggressively partnering with partners to use their capabilities. As a result of our cutting-edge technology, forward-thinking business strategy, and lofty aspirations, NetApp is in the vanguard of companies offering unified data administration across clouds. The ongoing evolution of technologies and services will provide consumers the flexibility to expand the scope of their data fabric in line with new competences and business needs.

References [1] [2] [3] [4]

[5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16]

Abunadi, I. (2019). Enterprise architecture best practices in large corporations. Information. 10(10), 293. Ansyori, R., Qodarsih, N., & Soewito, B. (2018). A systematic literature review: Critical success factors to implement enterprise architecture. Procedia Computer Science. 135, 43–51. Baer, T. (2018). The Modern Data Fabric – What It Means to Your Business, viewed 22 May 2020, https://mapr.com/whitepapers/the-modern-data-fabric/assets/MapR-enterprise-fabric-white-paper.pdf. Begoli, E., & Horey, J. (2012). Design principles for effective knowledge discovery from big data. In Paper presented at the 2012 Joint Working IEEE/IFIP Conference on Software Architecture and European Conference on Software Architecture. Barton, N. (2019). Guaranteeing Data Integrity in the Age of Big Data, viewed 22 May 2020 https://www.dataversity.net/guaranteeing-data-integrity-in-the-age-of-big-data/# Boh, W. F., & Yellin, D. (2006). Using enterprise architecture standards in managing information technology. Journal of Management Information Systems. 23(3), 163–207. Bakshi, K. (2011). Considerations for cloud data centers: framework, architecture and adoption. In Paper presented at the 2011 Aerospace Conference. BARC Research. (2017). Data Discovery: A Closer Look at One of 2017’s Most Important BI Trends, viewed 22 May 2020, https://bi-survey.com/data-discovery. Collis, D. J., & Montgomery, C. A. (1995). Competing on resources: Strategy in the 1990s. Harvard Business Review. 73(4), 118. Dooley, B. (2018). Data Fabrics for Big Data, viewed 22 May 2020, https://tdwi.org/articles/2018/06/ 20/ta-alldata-fabrics-for-big-data.aspx. Dilnutt, R. (2005). Enterprise content management: supporting knowledge management capability. The International Journal of Knowledge, Culture, and Change Management: Annual Review. 5, 73–84. Davenport Thomas, H., Harris Jeanne, G., & Cantrell, S. (2004). Enterprise systems and ongoing process change. Business Process Management Journal. 10(1), 16–26. Diederich, T. (2019). Data Orchestration: What Is it, Why Is it Important?, viewed 22 May 2020, https://dzone.com/articles/data-orchestration-its-open-source-but-what-is-it. Foote, K. D. (2019). Streamlining the Production of Artificial Intelligence, viewed 22 May 2020, https://www.dataversity.net/streamlining-the-production-of-artificial-intelligence/. Gandomi, A., & Haider, M. (2015). Beyond the hype: big data concepts, methods, and analytics. International Journal of Information Management. 35(2), 137–144. Hassell, J. (2018). How a Big Data Fabric Can Transform Your Data Architecture, viewed 22 May 2020, https://blog.cloudera.com/data-360/how-a-big-data-fabric-can-transform-your-data-architecture/.

4 Data Fabric Technologies and Their Innovative Applications

93

[17] John, T., & Misra, P. (2017). Data Lake for Enterprises. Packt Publishing. [18] Kanjilal, J. (2015). Best Practices of Designing and Implementing a Data Access Layer, viewed 22 May 2020, http://www.databasedev.co.uk/data-access-layer.html. [19] Jha, M., Jha, S., & Brien, L. O. (2016). Combining big data analytics with business process using reengineering. In Paper Presented at the 2016 IEEE Tenth International Conference on Research Challenges in Information Science (RCIS). [20] Lohr, S. (2016). Data-ism: Inside the Big Data Revolution. Oneworld Publications. Luftman, J. (2000). Assessing business-it alignment maturity. Communications of the Association for Information Systems. 4(14). [21] Larno, S., Seppänen, V., & Nurmi, J. (2019). Method framework for developing enterprise architecture security principles. Complex Systems Informatics and Modeling Quarterly. 20, 57–71. [22] McDaniel, S. (2019). What is Data Fabric?, viewed 22 May 2020, https://www.talend.com/resources/ what-is-datafabric/. [23] Michael, E. P. (2014). Big data and competitive advantage at Nielsen. Management Decision. 52(3), 573–601. [24] Morrell, J. (2017). What are some of the different components of an enterprise data fabric?, viewed 22 May 2020, https://www.dataversity.net/secrets-utilizing-data-fabric/. [25] Moxon, P. (2018). Data Virtualization: The Key to Weaving Your Big Data Fabric, viewed 22 May 2020, http://www.datavirtualizationblog.com/key-to-weaving-big-data-fabric/. [26] McSweeney, A. (2019). Designing An Enterprise Data Fabric, viewed 22 May 2020. [27] Malik, P. (2013). Governing big data: Principles and practices. IBM Journal of Research and Development. 57(3/4). [28] Maddodi, S., & K, K. (2019). Netflix big data analytics- the emergence of data driven recommendation. SSRN Electronic Journal. [29] Mousanif, H., Sabah, H., Douiji, Y., & Oulad, Y. (2014). From big data to big projects: A step-by-step roadmap. In IEEE Computer Society, Paper presented at the 2014 International Conference on Future Internet of Things and Cloud (pp. 373–378). [30] Nagle, T., Redman, T., & Sammon, D. (2017). Only 3% of Companies’ Data Meets Basic Quality Standards, viewed 22 May 2020. [31] NetApp. (2020). Ducati and NetApp Build a Data Fabric to Accelerate Innovation, Deliver High Performance, and Win Races, viewed 22 May 2020, https://customers.netapp.com/en/ducati-datafabric-case-study. Office of the Deputy Prime Minister 2020, The Principles of Good Data Management [Ebook], 2nd ed, London.

Pushpalatha N., Sudha Mohanram, S. Sivaranjani, A. Prasanth

5 Enterprise Data Abstract: Enterprise data consists of all the digital information that flows through a company. This comprises both structured and unstructured data such as entries in spreadsheets and relational databases, photographs, and video content. Enterprise data management (EDM) requires inventorying, managing, and company participation in the process. EDM controls both people and data. Data management guarantees that employees have access to accurate and up-to-date information and satisfies their needs for keeping superior data in a standardized, secure, and managed environment. This quick reference includes answers to common business data management questions and resources for further study. Typically, database administrators, IT administrators, and IT project managers oversee EDM. They manage the data lifecycle for the company. They control data intake and deletion. This life cycle is described by data lineage. The management of data lineage avoids data breaches, incorrect analysis, and legal difficulties. Having personal information stored on-site or in the cloud that is not safe raises a number of legal issues. Keywords: EDM, data collection, meta data, structured data, unstructured data, web enterprise, data enterprise, data technology, data warehousing, ArchiMate metamodel

5.1 Introduction Enterprise architecture helps firms transform digitally by connecting technology and strategic/motivational issues [1]. Extensive data warehouses that incorporate operational, data about clients, suppliers, and markets have exploded information. Competition needs timely and effective data analysis [2]. Business agility is essential. Agile data collection and management require numerous parties. Semantic Web technologies help the Linked Data organisation to overcome the main impediments to agility [3]. Compliance with regulatory standards, integrated customer management, and worldwide business process integration necessitate enterprise-wide master data management [4]. Unstructured data sources, organizations profit from collecting, analyzing, and exploiting using relational Pushpalatha N., Department of Electrical and Electronics Engineering, Sri Eshwar College of Engineering, Coimbatore 641 202, e-mail: [email protected] Sudha Mohanram, Department of Electronics and Communication Engineering, Sri Eshwar College of Engineering, Coimbatore 641 202, e-mail: [email protected] S. Sivaranjani, Department of Electrical and Electronics Engineering, Sri Krishna College of Engineering and Technology, Kovaipudur 641 042, Coimbatore, India, e-mail: [email protected] A. Prasanth, Department of Electronics and Communication Engineering, Sri Venkateswara College of Engineering, Sriperumbudur, e-mail: [email protected] https://doi.org/10.1515/9783111000886-005

96

Pushpalatha N. et al.

models [5]. Web 2.0 technologies are used to support business activities in enterprises. These technologies are used to promote interperson cooperation, knowledge sharing, and information exchange, both inside and outside the organization, using conversational modalities instead of traditional corporate communication. Enterprise 2.0 values social networks outside and inside the firm to foster versatility, adaptation, and creativity among customers, suppliers, coworkers, bosses, and consultants [6, 7]. Big data technologies (BDT) are being used in enterprise data warehouses and company intelligence to drive innovation and improve business insights and choices [8, 9].

5.2 Enterprise Architecture (EA) Models for Automatic Data Collection The benefits of automatic data gathering for model instantiation include less time spent on the models and, potentially, higher quality data. A few methods for gathering data for the implementation of designs are suggested in the ongoing EA initiatives. Little information is provided on data gathering in the most used EA frameworks. A few strategies have seen limited application among EA tool developers. That is, to bring in models from other programs or to permit the use of SQL commands to bring in data from storage. However, the two methods rely on the model data already being accessible and current. There is still a substantial amount of human labor needed because the primary challenge is generally to acquire this information in the first place, and these methods require this as a precondition. The scientific community has been concentrating on 123H, providing guidelines and standards for the development and upkeep of models [1].

5.3 The ArchiMate Meta Model Since ArchiMate is an accessible, autonomous, and general modeling language for business architecture, it is used to instantiate the autonomous data collection strategy employing network scanners. The major goal of ArchiMate is really to assist stakeholders in addressing challenges within their organization and the enabling IT infrastructure. The ArchiMate is fully described based in part on ANSI/IEEE 1471-2000, Suggested Practices for Architectural Specification of Server Software Systems, generally recognized as that of the standard IEEE-1471. In 2009, the consortium authorized the formalization of the architectural metamodel as part of the Enterprise Architecture Forum TOGAF. Three layers comprise the architectural metamodel: the market layer, the application server, and the technology layer. Where technology supports applications, the company has benefited. Each layer consists of assets and established commercial contacts. The entities within each tier are categorized into three enterprise architecture types [1]:

5 Enterprise Data

97

(i) Informational items are modeled by the passive structure. (ii) The behavioral structure models an enterprise’s activity. (iii) The active structure – the framework components that execute the behavioral features.

5.4 Business Data Mining Enterprise data mining functionality includes all analyst functionality. However, compliance for such functions should be examined for enterprise needs: – Diverse data sources corporate data mining apps use diverse data sources like other enterprise applications. For data mining services, information services provide seamless and rapid access to relational database management systems, data marts, data warehousing, flat files, object-oriented databases, web sites, and IBM hierarchical databases [2]. – Samples sampling means picking a portion of a source of data to create, evaluate, or predict with a model. Enterprise data mining algorithms need these capabilities for two reasons. 1. Expandability: Building a mining model of data for very big data sets is impractical. But a valid method is to build the model using one or more large samples of the source data and then test the model against several other samples using the same or different data sets. 2. Assistance for the data mining process: Business analysts may use subsets of data from various data sources initially in order to accelerate the data mining lifecycle before including all or the majority of the information in the study definition. Support for various sampling sizes for many heterogeneous data sources is necessary for enterprise applications. This degree of flexibility would and ought to be supported directly by the data sets in a perfect architecture. However, it now belongs to the layer of data mining services. – Model merge: Model merging lets you combine disparate data mining models. Enterprise applications for data mining need these capabilities for two reasons: 1. Scalability: Multiple submodels can be made at the same time from different data sets and then combined using model merge to make really big models from really big data sets. 2. Business process support: A company may require separate regional and national models. Model merging allows enterprise data mining. – Incremental modeling is comparable to model merging, with the following distinction: incremental modeling begins with a single model, which is then updated with additional data, resulting in a single updated model that combines the data from the original model with the details of the extra data. This characteristic is crucial for commercial data mining systems for two reasons:

98

Pushpalatha N. et al.

1.

2.

Scalability: An organization may have extremely large models and may need to update them periodically with incremental data. In this circumstance, it is crucial that the data gathering service does not need to reconstruct the initial generative model but can instead begin with the original version and augment it with additional data. Support for business processes is necessary for an organization that maintains data mining models that require periodic data updates.

Figure 5.1 represents the various classifications of Data Enterprising.

DATA ENTERPRISING

Unstructured Data Enterprise

Linked Data Enterprise

Master Data Enterprise

Semantic Web Data Enterprise

Figure 5.1: Data enterprising.

5.4.1 Linked Data Enterprise In today’s competitive corporate world, speed is of the essence. The ability to get a product onto the market before the competition has always been a huge boon, but in today’s fast-paced economy, consumers have come to demand a flood of brand-new offerings that require intricate production and distribution networks. These goods are shipped to an ever-evolving global market, where regulations are always being updated somewhere in the world. Rather than being an anomaly, the introduction of new product categories has become the norm. Companies merge during precarious economic times, creating the additional difficulty of making the merged entity more successful than its constituent components. In today’s business world, it is not surprising that agility, or the ability to

5 Enterprise Data

99

adapt to change, is seen as the most important way to stand out. Sales, marketing, and business expansion are all information-heavy processes. To a considerable extent, information is the product or service itself for many sectors. Bioinformatics and clinical trial statistics are just two examples of the many types of information management used in pharmaceutical research. Most financial instruments are based on some sort of data. With the rise of digital media over print, the publishing industry has become even more of a business of managing information than it always was. In many cases, businesses cannot function without flexible information systems. This would have been considered excellent news a few decades ago. The software component of a company was the most adaptable and quick to change. In contrast to physical systems, software can be easily modified to accommodate novel business models and production procedures. The informational burden on businesses, however, has changed this. A new data-backed system can take anywhere from several months to an entire year to be fully operational. Because new products come out all the time, the time it takes to build the software system needed to run a product has become a major limiting factor [3].

5.4.2 Master Data Enterprise A conceptual master data model describing a key business objectives and the interrelations among them at a theoretical level and a system architecture encompassing the totality of a company’s application areas that generate, hold, and keep updating occurrences of the datatypes defined in the theoretical master data model are often referred to as the two parts of the data architecture as the information management subarchitecture. The enterprise data base architecture is a special kind of information architecture (master data). It incorporates both the static and dynamic facets of MDM to ensure that data management standards and deployment are consistent across all of an organization’s applications. Data flows between programs show how information is passed back and forth between them in this model: EMDA Components and Reference Architectures. We have taken into account methods that meet the following criteria in our analysis of existing ways of creating architecture: – They are widely distributed among researchers and practitioners. – They are meticulous in nature. – They focus on, but are not limited to, the creation of data and information structures. Evaluation criteria include the Zachman framework, the Open Group Architecture Architecture Framework, Enterprise Architecture Planning, the Federal Enterprise Architecture Framework, and the Enterprise Architecture Cube by BERNARD. It also uses the Guidance to the Data Processing Body of Knowledge, where the Data Management Association talks about problems with data and information frameworks in the framework of information and data quality management. The analysis of the publications used as sources is used to compare the frameworks. Numerous strategies were left out of this

100

Pushpalatha N. et al.

list because they do not deal with data or information architectures, are very comparable to other strategies, or are just modeling notations. The diverse structural frameworks were judged according to criteria that came from either the theoretical aspect of our research or the real problems that companies are facing right now [4]. Among the criteria are: – Centering attention on the architecture of the enterprise’s master data: It is not possible to simultaneously study an organization’s architecture in its full breadth and with adequate depth. In light of this, it is recommended that the research and design efforts be confined to a particular sub-architecture in order to be capable of making precise recommendations. – Taking into account all of the components of an EMDA, they should go beyond mere data storing concerns (software systems) and encapsulate data distribution between applications in addition to the business perspective on data objects. – Each strategy or structure should take into consideration all of the elements of an enterprise master data architecture. This implies that they should look beyond simple data storage issues and consider the business perspective on data objects. – Making specific reference to the master data and the characteristics it possesses. The Architecture of the Enterprise’s Master Data Sets Systems for Master Data ApplicationData Flows is conceptualized based on the Master Data Model software architecture. – Detailed description of design considerations. It is not enough to just identify models or architectural components and their interactions in order to make practical applicability possible. Instead, the strategy should demonstrate which choices need to be made by a corporation in an effort to assist the continued and sustainable trend of the EMDA. – Offering definite rules and a variety of design choices for each individual design decision.

5.4.3 Unstructured Data Enterprise Unstructured data has no specific pattern. Unstructured data typically consists of bitmap images and objects, text, email, and other types of data that are not stored in a database. Even while emails are structured in a database-like manner, similar to that of Microsoft Outlook and Lotus Notes, the actual information content is sent in text format and lacks any kind of structure whatsoever. In other aspects, unstructured data includes things like firm strategy presentations made in PowerPoint, Spreadsheets storing lead lists, mails received between colleagues, and social networking contacts of customers. Even though they may have some formatting, documents created with word processing software is another type of unstructured data. The substance of these documents is freeform text that does not have any structure. There is a more straightforward method for making use of unstructured data, which predominates in the data of modern businesses.

5 Enterprise Data

101

According to the findings of Dijcks [10], the advent of BDT exposed approximately 70–80% of an organization’s underused data, which led to the rise in popularity of unstructured data. Per Feldman, Hanover, Burghard, and Schubmehl, unsupervised learning represents approximately 2.5 quadrillion bytes of information collected per 24 h from a variety of places, including social media posts, sensors, and digital photographs, demonstrating that structureless data is expanding at an exponential value. There are four types of radar: oceanic, seismic, meteorological, and vehicular. Static examples include things like printable files and PDF files as well as faxes and scanned documents. White papers, processes, guidelines, business documents, and office documents are examples of the dynamic type of document, which derives its name from the fact that these types of files can be written, changed, examined, and acknowledged by a large number of individuals or organizations. Digital media, for example, sound, moving pictures, drawings, and animations. Communication documents are things like emails, the content of social networking sites, web documents, and records of instant messaging [5].

5.4.4 Semantic Web Data Enterprise The overarching goal of the framework that has been proposed is to utilize RDF/RDFS/ OWL languages, in addition to their associated techniques, in order to successfully describe, represent, harmonize, and access data that has been generated by a wide variety of applications. The primary advantages of employing this strategy are as follows: (i) using a single RDF simplified data model to retrieval and storage operations; (ii) using RDFS/OWL to define data as well as further infer lexically and acquire new factual information from those already stated; (iii) using RDFS/OWL to acquire new factual information from those already stated; (iv) the ability to link data that was made on the inside to data that is linked on the website. The architecture that present is made up of a collection of pre-existing lexicons from the Semantic Web that are put to use in order to model particular and pertinent elements of Enterprise-2.0 as well as a few architectural solutions that are designed to assist data representation, saving, extraction, and analysis. Regarding the planning of the architecture that was just described in detail, it has been decided that there are two most important things to do: (i) modeling business data using preexisting vocabulary and converting it to RDF and (ii) creating a consistent and scalable model of connectivity across all lexicons that are being utilized. On the other hand, the second difficulty can be handled by systematically reconciling the various vocabularies through the use of defined properties [6].

102

Pushpalatha N. et al.

5.5 Management of Enterprise Resources and Data The SAP implementation’s influence may be evaluated on four distinct levels, thanks to the data that was collected during the course of this research: the corporate, organizational, operational, and personal. At both the plant level and the corporate level, the key information points that are provided pertain to the company’s financial performance, particularly return on sales and stock levels. This information is gathered centrally inside the company and is therefore related since each of the distinct plants reports the information in a structure that is similar throughout the organization.

5.5.1 An Increase in Both Productivity and Command It has been stated that implementing SAP has led to a number of gains in operational efficiency. Overall, there has been a reduction in administration, and there is an expectation that this trend will continue. Due to the fact that all plants now use the same code instead of developing fresh ones, the process of record production has been significantly simplified as a result of the ubiquitous use of standardized company and plant codes. While this does result in a reduction in administrative burden, it also has the added benefit of uniform codes, which make it possible to exercise consolidated control and visibility. Because of this, the company’s ability to make judgments on a pan-European scale has been significantly improved. In addition to that, it is now possible to have centralized billing and invoicing. Because of this, it is no longer necessary for certain activities to take place in the local area. As a result, the number of employees required can be cut, and relationships with clients and vendors can be better coordinated and managed.

5.5.2 The European Union’s Inventory Rationalization Process The rationalization of inventories and the identification and removal of obsolete stock have both been made possible, thanks to the transparency of supplementary inventories and the capacity to analyze deviations. A new capital equipment procurement strategy has been implemented as well as a 10% reduction in stock as a result of improved inventory turnover and age control. Similar positive effects have also been observed with the inventory of finished goods. More openness about spares stockpiles and the capacity to analyze deviations have been made possible thanks to the system. Additionally, the system has made it possible to have control of inventory turn and age, which has led to a yet more reduction in stock and has already altered the strategy for purchasing capital equipment. It is expected that Europe will see a 20% reduction in spares over the next 2 years. An additional 5% of the total was deemed to be irrelevant.

5 Enterprise Data

103

5.5.3 Capacity Optimization Across International Borders The ability to view production plans and stock levels online from a central location enables a rapid response to export requests that are handled at that location. This will include the ability to take action more quickly in order to transfer manufacturing or inventory from one business segment to another when the need arises. Before the deployment of SAP, this procedure required participants to communicate with one another using fax machines and email. The capability to remotely query inventory information reduces the necessity for faxes and emails as well as the inaccuracies in data transcription that are invariably the outcome of using these communication methods. In addition, the company is now able to plan both its capacity and its inventory.

5.5.4 On a Scale That Encompasses All of Europe This lets the company make the most of its production capacity at all of its locations, so it does not have to make too much product at one location while leaving equipment at another location idle. A centralized online view of production planning and stock levels makes it possible to handle rapid responses to export orders at a centralized position and to reroute inventory from one venture to another according to the goals of the business.

5.5.5 More Influence and Control Over Suppliers The centralization of information on inventory and material usage makes it simpler to negotiate deals at a central level. This is due to a deeper comprehension of the future required materials when there is consolidation of this information. This is especially important to keep in mind for this company because all of its plants work with the same suppliers. Is a major increase in the power of buyers, which in turn enables improved terms to be negotiated, achieved by centrally negotiating larger and more longterm contracts? Moreover, the SAP system permits factories to act on a local level and to provide suppliers with the proper supply dates. This double function would never have been possible without the widespread adoption of SAP across Europe. According to the explanation given by the top European logistics and development manager, the company is now able to “bargain worldwide, but act locally” because of the capabilities that SAP has provided.

104

Pushpalatha N. et al.

5.5.6 An Improvement in the Planning During the course of the conversation, several of the participants brought up ideas for making a significant contribution. At a more fundamental level, the technology enables the execution of “what if” studies that are both more extensive and intricate. This involves the capability of making decisions on investments and engaging in coordinating European-wide operations. The availability of better details also makes it possible to make decisions regarding customer orders that are more informed which, in theory, makes it possible to select orders that will achieve maximum profitability for the company. This is especially important when taking into consideration the potential constraints that the company must deal with on a pan-European basis. Yet another advantage of planning is that it is now able to supervise and protect against ebbs and flows in the pricing of raw materials on a European level. This presents a significant advantage. Consequently, this makes it possible to adjust stocks in order to guarantee that natural resources are acquired at the most cost-effective prices. This was described by one senior buying manager as “the capacity to protect the organisation from potential surprises” [7].

5.6 Security Evaluation and Analysis The evaluation of the security system is one of the most essential components in the design of an efficient security system for an enterprise system. Among these factors are a large number of others. When conducting an evaluation of the safety system, it is important to take into account all aspects of the system. Alshammari [11] performed an analysis of the layout of the security system, which was an essential step in determining the comprehensive safety of any application. When it comes to evaluating security systems, academics have proposed a number of different security models, and they have even built a security modeling approach specifically for enterprise systems. In spite of the fact that the surveillance system followed the three-tier architecture concept, there was an additional tier. At a given level of abstraction, there were various metrics that were assigned to each level of the proposed system’s security tiers. Many of the structural attributes and metrics connected to the security work for each part at every layer seemed to have a direct influence on the other two layers in regards to the possible flow of classified material. This was the case throughout all three layers. In addition, the comprehensive safety of the whole business network was summed up with the top level, which offered a specific security measurement. Because of this, it was simple to compare the suggested system to other systems of a similar nature. Shrivastava et al. [12] presented an innovative method, and they proposed methods for modeling different facets of system dependencies as well as performing effect analysis correspondingly. The technique was able to illustrate both the individual components of the

5 Enterprise Data

105

system as well as their interdependencies mostly in the form of an impact graph. This phase was regarded as the system’s first important idea solution and was based on a weighted DAG. While the proposed method served as the second essential component of the system solution, it was used to calculate the system’s state of health in the case of a modification. To produce the impact analysis, the impact propagation algorithm was built on the traversal of the influence graph through a variety of weighed edges and dependency metaphors. Wang et al. [13] developed a plan for the e-commerce website to test and assess the security flaws and proposed using a system to do so. As shown in Figure 5.3, the system that was designed had both a module for doing security tests and a module for performing result evaluations. The system was also made to protect itself from different security threats, such as a SQL injection attack, which is a way that important data from the database could be stolen.

5.7 Poor Data Quality The typical business suffers in a variety of ways as a result of poor data quality. At the functional level, inaccurate data directly leads to dissatisfied customers, higher costs, and a decrease in employee work satisfaction. Customers have every reason to anticipate that their identities and addresses will be correctly recorded, that the services and products they order will be supplied as promised, that they will get accurate billing, and that respective accounts will be handled in accordance with industry standards. But basic mistakes make things difficult; for example, consumers may find that they are not addressed properly, that they receive the “medium” size rather than the “small,” or that they are required to spend time correcting invoicing issues. Customers have come to simply anticipate that the information linked with their order would be accurate, and they are notoriously critical when it comes to data inaccuracies. If the data quality is poor, more time and other resources will need to be spent finding and correcting errors, which will drive up operational costs. A typical illustration of this would be the expense incurred by the firm providing customer care in order to rectify customer information, orders, and bills. When I was younger, I had the opportunity to work for an agency whose primary mission was to identify problems in the statements of the company’s most significant vendors. Their spending plan for each year totaled 10 million dollars each and every year. Across the board, a typical business incurs costs that are comparable to one another. So, the time the shipping department spends fixing mistakes from the customer-order department, the time the HR department spends fixing information about employees, and the time the supply chain department spends fixing mistakes about the supplier base are all examples of this model.

106

Pushpalatha N. et al.

5.8 Best Practices in Adopting BDT Implementing BDT as of now, many organizations have already started the process of adopting BDT, which has been happening over the past few decades. In this section, it goes into more detail about the three key best practices that are already used in the enterprise and have been shown to help solve some of the problems that come up when businesses try to use Big Data technologies. 1. Use a three-legged strategy for your big data environment that works well together. The initial one is for developers, and it could be a small cluster with a small amount of data, but it is large enough to demonstrate the data sets that are realistically large and varied enough to be useful. The next is for the business community [10], which is a big group that might grow over time. The developer makes the use case for research. Researchers try out different use case options and choose the best analytics model, which they then give to business users within the same environment. The enterprise customers try out the use cases and models in research setting and either accept them or turn them down. In the event that the use case is accepted, it is put into the business environment. If it gets turned down, the researchers either tweak it themselves with the options they have or ask the designer for more options. This approach not only encourages an iterative model in which all major contributors can work together to prove the strategic plan for big data use cases, but it can also help keep the business case going for a long time. Further development is required based on the results of the proof-of-concept (PoC) and pilot studies and then a few use cases or applications need to be used to set up the core process. The agile development paradigm makes it easy for people to learn quickly and helps them solve skill-related problems in an effective way. Also, this can solve the problems with management and maintenance by letting the administrator see the solution in the PoC and pilot stages of a use case deployment, where they can learn about the nuances of implementation. As a whole, this strategy helps a lot with problems like reusability, manageability, scalability of development, and maintainability. 2. The next step for Big Data environments with three legs is to construct a standard layer of abstraction for information modeling, integration, visualization, and processing. This will obfuscate the difficulties of the different types of data, keep security concerns separate, and make it easier to connect to the remaining enterprise systems. Issues pertaining to safety, technological advancement, and compatibility will be helped by this. It also solves the problems with the skills of developers and modelers, which lead to problems with development scalability and maintainability. To describe the common abstraction layer, it is important to work together with key enterprise IT stakeholders who will utilize and contribute to the big data platform. Big data abstraction will be implemented in a three-legged environment. Again, a step-by-step process is needed to build these abstractions, as they cannot be needed to start the Big Data path.

5 Enterprise Data

107

3. The next step is to create an integrated set of Big Data tools and a three-legged process for managing environments and abstractions. Creating a use case in the development environment, sending it to a researcher, and then putting it into production for enterprise customers should be a smooth, one-click process. This step can help with problems like manageability, development, scaling, and maintenance. When the exact same collection of tools is used in three different environments to bring in, process, visualize, and manage data, the process of “what you see is what you get” is set up. This can solve the problems with management and monitoring because use cases can be easily sent back to the RDE to be fixed and trouble-shooted. The embedded Big Data toolbox also takes care of security concerns in a clear way since it is the only way to get to the information and environment. In the same way, this can be used to solve modeling problems because it frees researchers and business analysts from the technical complexity of the environment and tools. The integrated Big Data workbench can be built from scratch or taken from a product already on the market such as Eclipse with IBM BigInsight Perspective, Talend, or Pentaho [8].

5.9 Function of Data Warehouse Figure 5.2 depicts the various functions of data warehousing.

Figure 5.2: Functions of EDW.

108

Pushpalatha N. et al.

5.9.1 Provides the Ultimate Capacity for Storage The data enterprise warehouse is a place where all of the business information that an organization has ever made is kept in one place.

5.9.2 Replicates the Source Data It exactly replicates the data from the source. The data that EDW uses comes from its primary storage locations, such as Google Analytics, customer relationship management systems, and Internet of things devices. It is impossible to manage the data if it is dispersed across a number of different systems. Therefore, the objective of EDW is to create a single repository that contains data that is virtually identical to its primary sources. Before data can be stored in a warehouse, it needs to be managed by a dedicated infrastructure because there is constantly new relevant data being produced inside and outside the company. This data can come from anywhere.

5.9.3 Stores Organized Data The information held in an enterprise data warehouse (EDW) is always standard and organized. This allows end users to access the information through the use of BI interfaces and form documents. This is what differentiates a data warehouse from a data lake. For analytical purposes, unstructured data is stored in data lakes. Data scientists and data engineers, on the other hand, use data lakes to work with large sets of raw data instead of warehouses.

5.9.4 Subject-Specific Data A warehouse’s primary focus is on business information that can pertain to various domains. In order to comprehend what the data pertains to, it is always organized around a specific topic, known as a data model. A particular topic can be a selling geographic area or the total sale of a product, for instance. In addition, metadata is added to explain where each piece of information originated.

5.9.5 Time-Dependent The collected information is typically historical because it explains past events. Most of the data that is stored is usually broken up into time frames to help people figure out when and for how long a certain trend happened.

5 Enterprise Data

109

5.9.6 Nonvolatile Once data is stored in a warehouse, it is never deleted. Due to source changes, the data can be analyzed, modified, or updated, but it is never intended to be deleted, at least by end users. Considering that we are discussing past information, deletions are counterintuitive for analytic purposes. Nonetheless, general modifications may occur every few years in order to eliminate irrelevant data.

5.10 Types of EDW When the functions of EDW are taken into consideration, there is constantly room for argument regarding how to technically design it. When it comes to the storing and processing of data, various types of businesses have individual needs and requirements that must be met. There are a variety of options available for how to establish your system, though those options will naturally vary based on the amount of data, the analytical complexity, the concerns regarding security, and the budget.

Figure 5.3: Types of EDW.

5.10.1 On-Premises Data Warehouse On-premises data warehouses have devoted hardware and software for unified data storage. Information management tools between databases are not needed when data are kept on physical servers. EDW can use APIs to continuously source and transform data. Thus, all work is done at the staging point or warehouse. A vintage data warehouse is better than a virtual one since there is no abstraction layer. Data engineers can manage preprocessing and reporting data flow more easily. Most businesses’ classic warehouse drawbacks are:

110

– –

Pushpalatha N. et al.

costly hardware and software and hiring data experts and DevOps experts to establish and maintain the data platform.

5.10.2 Virtual Data Warehouse Virtual data warehouses replace traditional warehouses. These are virtual databases that can be queried together. Analytical tools can pull data from its sources. If you do not want to change the infrastructure or your data is manageable, virtual warehouses may be used. This strategy has several drawbacks: – Managing multiple databases necessitates ongoing software and hardware costs. – Virtual DW data must be transformed before it can be used by end users and reporting tools. Complicated data queries may be time-consuming because the needed data may be in two databases.

5.10.3 Cloud Data Warehouse Cloud data warehouses centralize cloud-hosted data. A cloud provider manages an analytics, scale, and usability-optimized database. Cloud warehouses have compute, storage, and service layers. Numerous compute clusters with parallel query processing are in the compute layer. The storage layer stores everything. The client layer manages data. Cloud warehouse architecture offers the same benefits as other cloud services. Its computers, data sets, and tooling are usually maintained for you. The cost of such a service depends on querying memory and computing power. Data security is the only cloud warehouse platform concern. Business data is sensitive. You want to make sure your vendor can prevent breaches. In this case, you control data security.

5.11 Data Warehouse Architecture Although there are numerous architectural methods that can be used to expand warehouse capabilities, we will only cover the most fundamental ones here. Without getting too deep into the weeds of technicality, the entire data channel can be broken down into three distinct stages: – Unprocessed information layer – The ecosystem of a warehouse – Interface to the product

5 Enterprise Data

111

Data Extraction, Transition, and Loading (ETL) software is a distinct subset of data management software. Data integration tools, which perform operations on data before storing it in a warehouse, also fall under ETL’s aegis. They function in the intermediate space among a raw data layer and a data warehouse. Data transformation is another option after data has been loaded into a warehouse. Therefore, the storage facility will need equipment for sanitation, standardization, and dimensionalization. The degree of architectural complexity will depend on these and other factors. The EDW design will be analyzed, with a focus on accommodating future business expansion.

5.12 Future Directions – Deep learning for insider threat detection: Data mining and machine-learning techniques will be used more frequently in DLPD in big data setups, where a large amount of information from disparate sources is generated. Anomaly detection has been a common use case for deep learning tools like the Deep Neural Network. Methods like these can be used in DLPD’s context and content analyses to better detect and respond to threats in a timely manner while also boosting accuracy and preventing covert data leaks. – On the subject of detecting insider threats, deep learning has the potential to help bridge the semantic gap that is often encountered. The semantic chasm separates the higher usage intentions from the minimal machine events. The most important factor in identifying insiders is the user’s intent, but this is not easily quantifiable. Machine events, on the other hand, can be observed directly; they lack semantic value and must be mapped to relevant user goals. – Other research problems, such as capturing the semantics of an image based on pixels, also suffer from similar semantic gaps. Recently, advances in deep learning have shown promise in addressing the challenging problem of translating natural language sequences into other sequences. Using sequence data of machine events to train deep learners to deduce user intentions is a promising new area of research. A cloud-based DLPD service with the rise of cloud computing comes a new way to look for data leaks. – Data privacy issues arise when businesses consider contracting out their data analysis to outside vendors. The concept of collection intersection is predicated on the concept of element frequency similarity between two sets. If the sensitive information is delegated to a third person and that party has access to sufficient background frequency data of the n-gram, then the data may be susceptible to frequency analysis. In order to withstand powerful attacks, private data leakage detection algorithms are required.

112

Pushpalatha N. et al.

– Achieving scalability without sacrificing detection accuracy or having major delays when processing huge databases is an important area of research for the cloud provider. Spark61 can break down a large streaming dataset into manageable chunks using a technique called “trucking.” It can be used in conjunction with the detection method that uses MapReduce. But if the leak occurs across various data segments, the small content sections might miss the real ones, and enhancing the size of data sections also increases network latency. It is possible to construct large-scale data leak detection systems using Flink62, another data stream processing platform. – Tracking Secure Communications Data that has been evolved, obscured, or encrypted cannot be processed by the majority of currently discussed DLPD methods because of their susceptibility to large changes in the original data. Because of the prevalence of encrypted traffic, traditional methods of content-based detection are becoming increasingly ineffective. For future DLPD solutions to be effective at detecting covert data leaks, it is necessary to monitor the environment even if deploying monitors externally can help alleviate the problem to some extent. Data access tracking or discrepancy analysis could be a step in the right direction. Recent work, for instance, has utilized differential analysis to achieve obfuscation-resistant privacy leakage detection on smartphone platforms. One of the hottest research areas of the last decade has been string tracking on encrypted data, which currently stands at 64. – DLPD standards: Sommer et al. [14] noted that one of the challenges of applying ML to network detection algorithms is the shortage of training data, which also pertains to DLPD. This could be used in the future to detect the transmission of sensitive data over encrypted channels. Since ML techniques are becoming increasingly prevalent, it is difficult to compare with state-of-the-art solutions and conduct sound evaluation in DLPD academic research due to the lack of standardized testing and evaluation datasets. The academic community must devise a means of rewarding data-sharing and benchmarking activities [9].

5.13 Conclusion Today’s firms employ computers, networks, databases, and so on to address problems with multiple information systems. Enterprise data management is crucial. Buyers want more value from digital products as the industry grows. One must maximize data possibilities while keeping high performance speeds. This requires a reliable, durable EDM strategy. Corporate data management architecture investments maximize future advantages. EDM works best when data are available to the right people in the right place. Analyze that. Correct data means a convenient, accurate, and complete data library. Data are beneficial if existing staff can answer questions and new staff can quickly learn and use analytics. Accounting systems are accurate. Data distribution

5 Enterprise Data

113

improves analytics. Enterprises must reconcile data security and democratization. Democracy and data security must coexist. Data usability, accessibility, and integrity require a data management team. Analytics naming standards assist in maintaining data taxonomy as the database grows. Data placement requires data analysis and recording system synchronization. Then invest in data sharing and employee training systems. Planning aids product development stakeholders agree to track data to optimize work procedures. Participants store analytics insights in a central document. Tracking eliminates data silos and simplifies data insights.

References [1]

Pérez-Castillo, R., Delgado, A., Ruiz, F., Bacigalupe, V., & Piattini, M. (2022). A method for transforming knowledge discovery metamodel to archimate models. Software System Models [Internet]. 21(1), 311–336. Available from: https://doi.org/10.1007/s10270-021-00912-y. [2] Kleissner, C. (1998). Data mining for the enterprise. Proceedings of the Hawaii International Conference on System Sciences. 7(c), 295–304. [3] Wood, D. (2010). Linking enterprise data. Link Enterprise Data. 1–291. [4] Otto, B., & Schmidt, A. (2010). Enterprise master data architecture: design decisions and options. In Proc 2010 Int Conf Inf Qual ICIQ 2010. [5] Eberendu, A. C. (2016). Unstructured data: An overview of the data of big data. International Journal of Computer Trends and Technology. 38(1), 46–50. [6] Capuano, N., Gaeta, M., Orciuoli, F., & Ritrovato, P. (2010). Semantic web fostering enterprise 2.0. In CISIS 2010 – 4th Int Conf Complex, Intell Softw Intensive Syst. 1087–1092. [7] Carton, F., & Adam, F. (2009). Analysing the Impact of Enterprise Resource Planning, 103–113, WwwEjiseCom. [8] Dhar, S., & Mazumdar, S. (2014). Challenges and best practices for enterprise adoption of big data technologies. In 2014 IEEE Int Technol Manag Conf ITMC 2014. [9] Sega, M., & Xiao, Y. (2011). Multivariate random forests. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 1(1), 80–87. [10] Dijcks, J. (June, 2012). Oracle: Big data for the enterprise. Oracle White Paper, p. 16, [Online]. Available: http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Oracle+:+Big+Data+ for+the+Enterprise#0. [11] Alshammari, B. M. (2017). Enterprise Architecture Security Assessment Framework (EASAF). Journal of Computer Sciences, 13(10), 558–571. doi: 10.3844/jcssp.2017.558.571. [12] Shrivastava, P., & Kennelly, J. J. (2013). Sustainability and place-based enterprise. Organization Environment, 26(1), 83–101. doi: 10.1177/1086026612475068. [13] Dai, H. N., Wang, H., Xu, G., Wan, J., & Imran, M. (2020). Big data analytics for manufacturing internet of things: opportunities, challenges and enabling technologies. Enterprise Information System, 14(9–10), 1279–1303. doi: 10.1080/17517575.2019.1633689. [14] Cheng, L., Liu, F., & D. D. Yao (2017). Enterprise data breach: Causes, challenges, prevention, and Future Directions. WIREs Data Mining Knowledge Discovery. 7(5) e1211. doi: 10.1002/widm.1211.

Rajakumari K., Hamsagsayathri P., Shanmugapriya S.

6 Features, Key Challenges, and Applications of Open-Source Data Fabric Platforms Is Open-Source Tools for Data Fabric – Dawn or Dusk? Abstract: In recent years, “Data Fabric” has become new analytic buzzword in Data Management agility where it has become a high priority in booming industries where they have an environment that is more complicated, scattered, and diversified. Data analytics experts began exploring beyond conventional data management techniques and shifted toward contemporary solutions like AI-enabled data integration in order to decrease human errors and total costs. Data fabric is a weave where it stretches spanning a wide area that connects numerous sites, various data source kinds, and accessing techniques. As it progresses through the various stages of the data fabric, the data collected from the source can be handled, processed, and stored. For a wide range of applications, the data can also be accessible by or shared with both internal and external apps. Data fabric applications’ main objectives are to optimize supply chains for end-to-end products, comply with data rules, and enhance consumer engagement through more sophisticated mobile apps and interactions. Always Companies can gain a competitive edge with data, but to meet customer demands, they must supply data rapidly. Most of the enterprises implemented cloud migration and IoT, with increased cost-effective data storage and processing. Because of this data is no longer tied to local centers, and most of the data are located in different places and it is very difficult to manage [1]. A Data Fabric is a strategic solution to the enterprise to incur storage operations and leverages the best version of cloud migration. This architecture can support centrally managed, public and private clouds, IoT and other devices. This reduces management tasks through automation, accelerates the development and deployment process, and protects assets without interruption. It enables changes to be made quickly, resolving problems, managing risk, reducing IT operations and complying with regulations. In this chapter, the best open source data fabric tools that meet the enterprise requirements are listed and highlighted with its benefits and challenges. The greatest data fabric tools are profiled in one location, which makes it easy for researchers to choose the tool throughout their search. Data categorization and discovery, data quality and profiling, data lineage and governance, and data exploration and integration

Rajakumari K., School of Engineering, Avinashilingam Institute for Home Science and Higher Education for Women, e-mail: [email protected] Hamsagsayathri P., Senior Associate, Projects Cognizant Technology Solutions, Coimbatore, e-mail: [email protected] Shanmugapriya S., School of Engineering, Avinashilingam Institute for Home Science and Higher Education for Women, e-mail: [email protected] https://doi.org/10.1515/9783111000886-006

116

Rajakumari K., Hamsagsayathri P., Shanmugapriya S.

are the four main functions offered by the data fabric technologies. These data collaboration platforms combine data integration with business applications. Atlan, Cinchy, Data.world, Denodo, IBM, K2 View are few open source tools that are used by enterprise to manage their data and its integration. There are wide ranges of Open Source Data fabric tools that are quick to list its benefits. Instead of using proprietary systems, the majority of firms are interested in open source solutions due to lower costs. The capacity to modify and offer creative solutions on the code to satisfy business objectives is another crucial advantage of working with open source proponents. However, in this chapter we discuss about the Key features, benefits and technical challenges of different open source data integration tools in detail. The primary challenges in the utilizing open source data tool in enterprise is they lack in community support. In general The IT departments of many businesses rely on vendor support to enhance their internal capabilities [2]. Having open source tools, make the enterprise to face and resolve the issues by their own, which is very hard. When developing a data management environment, technology teams frequently underestimate the amount of time and expertise required to properly employ open source software. Most of the organizations they frequently underestimate the amount of work necessary to integrate open source with other subsystems and, as a result, incorrectly evaluate the total cost of ownership of employing open source systems. Most businesses meet few significant obstacles when working on open source pilot projects, but they may run into problems when attempting to manage and maintain those deployments on a wide scale. Keywords: Data fabric, Ai-enabled data integration, key challenges, data fabric tools

6.1 Introduction A data fabric is a strategic solution to the enterprise to incur storage operations and leverages the best version of cloud migration. The main objectives of data fabric applications are to comply with data regulations, enhance end-to-end supply chain processes, and improve client engagement through more complex mobile interactions and apps. Researchers started examining within the standard data management approaches and shifted toward more modern solutions like AI-enabled data integration in order to decrease human error and the overall costs of data and analytics. Data fabric is a weave that connects various locations, data types, and sources as well as various accessing techniques, over a vast area. In the past, businesses have tried to solve data access issues by introducing data hubs or point-to-point connection. Both of those are inappropriate for highly fragmented and isolated data. Any additional end point that needs to be linked incurs exponentially higher costs when using point-to-point integrations [3]. A new model called the data fabric tries to address the data issues brought on by a mixed data landscape.

6 Features, Key Challenges, and Applications of Open-Source Data Fabric Platforms

117

The core idea behind it is to act as the virtual connecting fabric between data endpoints in order to find a balance between decentralization and worldwide. Enterprises may ensure that they keep up with the exponential expansion of data surplus by using data fabric technologies in conjunction with cloud computing and storage. Additionally, they may create data-driven platforms that are scalable, effective, and affordable, streamlining their capacity to process and analyze huge data [4]. In data fabric method, it enables the companies to maintain the normal growth and with rapid technology and to build data infrastructure that can spread across all the enterprises shown in Figure 6.1.

Figure 6.1: Data fabric Process.

6.2 Features or Benefits of Data Fabric The characteristics that make up a unified data fabric architecture provide the following business value, according to the Forrester New Technology: likely Total Economic Impact 2020 survey: – 459% raise in returns on investment – $5.8 million business benefits on normal – 60× accelerated data delivery time – 20× more rapidly in customer affinity analyses [5] Earlier the organizations have made an effort to overcome issues with data access by introducing data hubs or point-to-point integration. The capabilities of data fabric are

118

Rajakumari K., Hamsagsayathri P., Shanmugapriya S.

applicable to any businesses, and the benefits of multi-data fabric are suited to the entire businesses with multicloud infrastructure and data cloud settings as shown in Figure 6.2.

Figure 6.2: Key Pillars of a Comprehensive Data Fabric.

6.2.1 AI-Enabled Data Integration –

–

–

–

Making decisions quickly is essential for a firm to maintain its position as a market leader. The company must draw conclusions from its corporate data collection and take the appropriate steps in a timely manner. Data are expanding quickly as a result a difficulty for the business in enabling this because of the inclusion of nontraditional data sources (machine logs, social media posts, streaming data, etc.) with traditional ones (CRM, ERP, RDBMS, file system data, etc.) in the data governance ecosystem. Therefore, it is becoming increasingly important to integrate data and summarize the data flood into usable information for the development of insights. Organizations are frequently questioned about how to devote further time to data analysis quite rather than data curtain. Currently, business users use some time preparing data rather than they do analyzing it. Here, appropriate DI strategies are essential to helping business reverse this tendency. The ideal approach to automate data preparation tasks while also integrating big data analysis into its core competencies is DI, which has the capacity to use artificial intelligence (AI). Human intervention is a possibility in the DI

6 Features, Key Challenges, and Applications of Open-Source Data Fabric Platforms

119

with AI architecture, but it should only be used when necessary [6]. The data fabric in AI-enabled data integration is shown in Figure 6.3.

Figure 6.3: Data fabric in AI-Enabled Data Integration.

–

Three levels of context-setting information are experienced by the current DI frameworks: – Complete knowledge – The incoming data’s schema structure is already known. – No prior knowledge is required because AI is utilized to decode the incoming data’s schema by parsing the content. – Combining the two ways earlier, partial knowledge uses preexisting schema structure, while AI is used to decipher the dynamic component [7].

6.2.2 Key Challenges of Different Open-Source Data Fabric Tools 6.2.2.1 Streamlining Data Management Using Data Fabrics To manage the challenges posed by heterogeneous data environments, a data fabric is essentially an integrated data infrastructure. Infrastructures of this kind aim to strike a balance between decentralization and globalization by acting as a virtual connective tissue connecting various data endpoints. With this strategy, businesses can maintain scattered storage while unifying their data management. By utilizing cutting-edge technologies like data virtualization, a semantic layer, metadata management, and automated data cataloging, data fabrics are able to do this by erasing the barriers that divide different applications, data, clouds, and people [8].

120

Rajakumari K., Hamsagsayathri P., Shanmugapriya S.

6.2.2.2 Change and Improving UX Data fabrics can automate data discovery, governance, and deployment in addition to providing DI. They also provide an abundance of corporate data for analytics and AI. This improves decision-making and speeds up the digital transformation. The fact that data fabrics are independent of specific deployment platforms, data processes, locations, and architectural strategies is another benefit. Additionally, they offer realtime data access for all authorized staff members, regardless of where they are in the world, as well as a uniform, consistent user experience from the perspective of end users [9].

6.2.2.3 Unused Data to Full Potential The most effective businesses in the world today are data-driven. Data thus represents a significant competitive advantage. Despite this, 68% of the data in businesses is currently inactive. The causes of this are numerous and varied, ranging from the abundance of data types and sources within the company to a lack of straightforward data access. Putting together data from several companies is a difficult task that brings many difficulties. The demand for real-time connectivity, self-service, and automation cannot be satisfied by traditional approaches to DI. There are various factors in favor of introducing a data fabric approach, currently struggling with disparate infrastructures and data silos, the tech that need to solve our DI problems without having to rebuild your system landscape from the ground up. And if rapid decisions are keys to our competitive edge, a data fabric could be the accelerator [10] and the key architectural components of data fabric shows in Figure 6.4.

6.2.2.4 Greater Speed, Better Insight –

–

One major advantage of data fabrics is to build on top of the existing IT infrastructure. This means they can be implemented quickly and easily using an appropriately designed data platform, allowing continuing using the existing technologies, applications, and services. Data fabrics also have the benefit of automating data insight. By continuously finding and connecting data from many applications to find distinctive, commercially significant links between the available data pieces. Data traceability makes for greater transparency, enabling management to make better business decisions.

6 Features, Key Challenges, and Applications of Open-Source Data Fabric Platforms

121

Figure 6.4: Key Architectural components of data fabric.

6.2.3 Automated Compliance, Faster Data Access Data fabrics automate critical tasks. They can also leverage AI to automatically extract content from regulatory documents, creating data-governance rules and definitions and automatically checking whether data complies with them. Last but not the least, data fabrics avoid redundancies – on the one hand, by consolidating data management tools; and on the other hand, by minimizing data duplication. As a result, users enjoy faster access to more complete, higher quality data that provides deeper insights [11].

6.2.3.1 Use Cases The following use cases give an indication of how the technology is used in real-world situations. When used in supply chains, for instance, a data fabric can speed up access to fresh information regarding production and supplier delays, setting the groundwork for wiser decisions. – Banks can use data fabrics to integrate their data systems in the financial sector. This not only provides access to thorough, reliable data, but also allows for realtime data evaluation, allowing for quicker responses to urgent circumstances. Additionally, the introduction of data fabrics enables banks to quickly integrate

122

Rajakumari K., Hamsagsayathri P., Shanmugapriya S.

automation solutions for jobs like fraud detection, calculating credit scores, or conducting securities transactions [12] and describes the data fabric architecture in Figure 6.5.

Figure 6.5: Data Fabric Architecture.

6.2.4 Key Challenges for Open-Source Data Fabric Tools 6.2.4.1 Reducing the Amount of Resources Required As you create a data management environment, William McKnight, owner of McKnight Consulting Group, claims that IT teams frequently underestimate the time and expertise required to use open-source software effectively. Usually, the quantity of work required is. There is no need to do much because it is in the cloud, but actually need to carry a group of experience to accept. Teams working on open-source projects must have the perfect balance of coding, architecture, database administration, cloud computing, and security experience. To deliver a system of value, the initiative must also comprehend the business goals. The time and expertise required to develop their data management system using open-source software are typically underestimated by technology teams. McKnight continued by saying that these teams must constantly manage, maintain, upgrade, and develop the code, which calls for the usage of considerable resources. Open source has a high chance of success and can add a lot of value if the effort is done well [13].

6 Features, Key Challenges, and Applications of Open-Source Data Fabric Platforms

123

6.2.4.2 Lack of Vendor Support The in-house expertise of many companies IT teams is supplemented by vendor help. Some teams struggle to completely comprehend the fact that open source lacks this support. Undoubtedly, although open-source communities are protected and frequently used by corporations for help and collaboration, this is not the same as having a service-level agreement with a vendor of commercial software that is enforceable by law. IT needs to be prepared for the strength of many open-source communities.

6.2.4.3 Creating Silos Vice president and analyst at Forrester Research Noel Yuhanna asserts that some open-source solutions are so disjointed from other tools that they fail to provide the [needed] consistency and to connect and combine easily. The purpose in silos does not interconnect with some tools, and this makes more difficult for them to support the business, he continued. Due to the fact that businesses have data flowing from many sources, this can be particularly difficult for data programs. Therefore, when combining and integrating open-source solutions, Yuhanna continued teams in charge of technology should be ready to put in more time and effort than they would if they were using proprietary software [14].

6.2.4.4 Unsuitable Estimate of the Total Cost of Ownership Because of the amount of integration work required with open source, the lack of external support and the time and skills open-source software requires throughout its lifecycle, organizations commonly underestimate the total cost of ownership of their opensource systems. Organizations frequently underestimate the total cost of ownership of their open-source systems because it fails to take into account the amount of time and expertise open-source software requires throughout its lifecycle, the amount of integration work that must be done with it, the lack of external support, and other factors [15].

6.2.4.5 Poor Scale Management Many businesses successfully finish open-source pilot projects without encountering major challenges, but when they seek to manage such deployments at scale, they encounter a number of issues. In these circumstances, the systems either do not have the level of integration required for them to function effectively, the agility and automation needed to stay up, or the correct governance and controls needed to safeguard them. For instance, it was found that the enhanced and architectural teams had

124

Rajakumari K., Hamsagsayathri P., Shanmugapriya S.

trouble keeping track of changes to the open-source code that they rely on. It might result in operational problems with the systems such as security flaws [16].

6.2.5 Applications of Open-Source Data Fabric Platforms 6.2.5.1 Platform: Atlan The four primary features of data quality and profiling, data lineage and governance, data exploration and integration, and data cataloging and discovery make up Atlan’s data workspace platform. The software has a searchable business lexicon, an automatic data profiling feature, and a Google-like search interface for searching. Users may control data consumption and adoption with thorough governance and access limits regardless of where the data goes [17].

6.2.5.2 Platform: Dataware Platform Cinchy offers a platform for collaborative data analysis that tackles the integration of business applications and data. The application provides real-time capabilities for data governance and solution delivery and was created as a secure tool for resolving concerns with data access. Cinchy works by fusing several data sources with the way its network is built. The business describes “autonomous data,” which is defined as data that is self-describing, self-protecting, self-connecting, and self-managing inside the platform, as data that is enabled by the various architectural styles [18].

6.2.5.3 Tool: Data.world A cloud-native enterprise data catalog called Data.world provides companies with a complete framework so they can understand their data, wherever it may be. It contains tools for project management, social media collaboration, metadata, dashboards, analysis, code, and documents. Users can investigate relationships by using the product’s built-in connected web of data and insights, and it also suggests similar assets to enhance analysis. Because of its continual release cycle, data world is distinctive [19].

6.2.5.4 Platform: The Denodo Platform On the market for data virtualization products, Denodo is a major player. Established in 1999 and headquartered in Palo Alto, Denodo offers high-performance DI and abstraction for a range of big data, enterprise, cloud, unstructured, and real-time data

6 Features, Key Challenges, and Applications of Open-Source Data Fabric Platforms

125

services. Denodo also provides access to unified company data, data analytics, and single-view applications. For data virtualization, Denodo Platform is offered on the Amazon AWS Marketplace as a virtual image [20].

References [1] [2] [3] [4] [5] [6]

[7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20]

https://www.ibm.com/downloads/cas/V4QYOAPR. Cadariu, S. Data fabric and cloud computing as enterprise technologies. AI time journal. September 7, 2022. https://www.bmc.com/blogs/data-fabric. https://www.techtarget.com/searchdatamanagement/tip/5-challenges-IT-faces-using-open-sourcedata-management. https://www.linkedin.com/pulse/data-fabrics-paradigm-shift-management-dominik-krimpmannphd?trk=pulse-article_more-articles_related-content-card. https://www.gadgetgram.com/2022/05/25/what-are-the-challenges-of-implementingdatafabric/#:~:text=Deploying%20and%20Configuring%20Services,helpful%20for%20analytics% 20and%20reporting. https://techgenix.com/data-fabric. https://solutionsreview.com/data-management/the-best-data-fabric-tools-and-software. https://medium.com/kamu-data/introducing-open-data-fabric-eaf9fdcd3903. https://www.trustradius.com/data-fabric. https://www.expressanalytics.com/blog/data-fabric-benefits. https://www.openaccessgovernment.org/recognising-the-human-side-of-data-fabric/138265. https://slashdot.org/software/data-fabric. https://www.datanami.com/2022/08/09/these-15-data-fabrics-made-the-cut-in-forresters-wave. https://www.striim.com/blog/data-fabric-what-is-it-and-why-do-you-need-it. https://atlan.com/what-is-data-fabric. https://www.bmc.com/blogs/data-fabric. https://nix-united.com/blog/data-fabric-the-future-of-cloud-technologies. https://www.occtoo.com/blog/data-fabric. https://www.indiumsoftware.com/blog/why-data-fabric-is-the-key-to-next-gen-data-management.

Pankaj Rahi✶, Monika Dandotiya, Sanjay P. Sood, Mohit Tiwari, Sayed Sayeedi

7 An Open-Source Data Fabric Platform: Features, Architecture, Applications, and Key Challenges in Public Healthcare Systems Abstract: In order to incorporate data to more valuable design, data fabric architecture could determine whether the data is being used or advocate the inclusion of new, enhance the existing model, and diversified data for generation of more accurate or relevant information and knowledge. As a result, managerial work has been reduced and data value is captured more rapidly for improving the quality in services. In an environment of information communication technology, the data management and integration have been becoming more diverse, remote, big, or high-volume datasets and complicated, information processing, and management adaptability have developed as a mission-critical concern for enterprise entities. Information and analytics professionals have to move beyond conventional data management approaches and progress toward contemporary alternatives including machine intelligence enabled data integration in order to decrease human errors and overall costs with higher level of accuracy. In this chapter, the detailed analysis of the data fabrication platforms in light of healthcare problems and how the adoption of the data fabrications platforms is important for healthcare are ethereal for improving the quality of healthcare services. This chapter also describes the use cases, how effectively the data fabrication is beneficial for entire ecofriendly system, Healthcare 4.0 and Society 4.0. Keywords: Open-source data fabric architecture, Smart eHealth system, environmental healthcare, Hospital 4.0, Society 4.0, data fabrication

✶

Corresponding author: Pankaj Rahi, Department of Artificial Intelligence and Data Sciences, Poornima Institute of Engineering and Technology, Jaipur, Rajasthan, India, e-mail: [email protected], 0000-0003-1315-2727 Monika Dandotiya, Department of Computer Science and Engineering, Madhav Institute of Technology and Science (MITS), Gwalior, India, e-mail: [email protected], 0000-0002-5501-4210 Sanjay P. Sood, Health Informatics Division, Centre for Development of Advanced Computing (C-DAC), Mohali, India, e-mail: [email protected], 0000-0002-7358-5228 Mohit Tiwari, Department of Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, Delhi, India, e-mail: [email protected] Sayed Sayeedi, Adjunct Faculty, Rochester Institute of Technology, Dubai Campus, Dubai, UAE, e-mail: [email protected]

https://doi.org/10.1515/9783111000886-007

128

Pankaj Rahi et al.

7.1 Introduction 7.1.1 Overview Regardless of the location, database association, or structure of the linked data, a data fabric’s goal is to establish a uniform view of the data in order to simplify application access to information. In conjunction using artificial intelligence (AI) and machine learning (ML), it is frequently used to streamline analysis. As a result, data fabrics are increasingly used as a key tool for transforming unstructured data into business insight. Additionally, data fabrics can speed up application development by establishing a standard model for information access that breaks away from the usual application and database silos. The efficiency of operations can be increased by the same harmonization. They can offer greater information access at the line level of the organization. In view of the diseases’ prevalence in the healthcare sector, noncommunicable diseases (NCDs) continue to pose the biggest danger to global public healthcare. Precision public healthcare (PPH), a rapidly developing subject, presents a game-changing potential to use digital healthcare data to build an adaptable, data-driven public healthcare system that actively fights NCDs. Our goal is to propose a vision for PPH for NCDs across three horizons of digital healthcare transformation [1]: Horizon I – digital public health workflows; Horizon II – population/community health data and analytics; and Horizon III – PPH. We will do this by drawing on lessons learned from digital health. This viewpoint offers a high-level strategy pathway for attaining PPH for NCDs for public healthcare physicians, policymakers, and stakeholders in the health system, and scholars. To put our roadmap into a practical framework, two international use examples are presented: PopHQ (Australia), a population health informatics tool to trail and check obesity, and ESP and RiskScape (USA), a mature PPH platform for several NCDs. In order to lessen the public healthcare burden of NCDs, our goal is to offer a strategic framework to direct future health policy, funding, and research [2]. NCDs pose a serious threat to world health. NCDs impact individuals of all ages and geographical locations; they account for 71% of all worldwide fatalities [2]. A worldwide syndetic of NCDs, social injustice, and coronavirus disease (COVID-19) is now revealing the shortcomings of conventional public healthcare strategies in addressing the escalating risk factors and diseases they are causing. For instance, the four main NCDs, in addition to mental health, are now linked to malnutrition, which is also the world’s greatest risk factor and cause of ill health. Authorities have the chance to rethink public healthcare strategies to combat NCD in the digital age. The goal of PPH is to develop an adaptable, data-driven public healthcare system that improves on the present evidence-based strategies [3]. PPH uses data and digital technologies to inform precise choices, interventions, and policies in order to enhance population health. From a digital standpoint, gathering “organic” data that is automatically, regularly, and in real time collected via digital health infrastructure is the basis of developing a PPH system. Organic data, from the standpoint of public healthcare, refers to

7 An Open-Source Data Fabric Platform

129

data that has been repurposed (from, e.g., clinical or social systems) in order to advance public healthcare without going through a specialized study design [4].

7.1.1.1 Strategic Momentum for Precision Public Healthcare For contagious illnesses (such as cholera, influenza, and the Zika virus), the use of organic data for PPH has advanced, and COVID-19 has sped up this development. Since infectious disease transmission occurs in actual life and public healthcare interventions must be quick to handle acute danger, quick action against communicable illness necessitates [5]. A good instance is the Digital Coronavirus Application, which was created to collect real-time data on individuals who needed to be quarantined in Queensland, Australia, using digitally equipped public healthcare systems in order to allow precise tracking and surveillance. A starting point for managing communicable illnesses with the use of digital health is provided; however managing NCD requires special considerations [6, 7]. Since there has traditionally been minimal investment in digital health and analytics compared to the acute healthcare industry, it has been challenging to monitor trends and focus treatments to the communities where the social, environmental, and behavioral determinants are most prevalent. Personal health records (PHRs), mobile health (mHealth), wearable technology, the Internet of things (IoT), social media, and genomics have all advanced our understanding of community-based NCD and enhanced our ability to avoid them. Intervention literature increasingly uses digitally enabled public healthcare interventions [8].

7.1.1.2 Digital Public Healthcare Workflows In their local, state, or federal domains, public healthcare consultants and representatives must discover and assess useful “designed” with “organic” data assets for PPH of NCD. The (a) data asset’s availability and (b) quality must both be considered during evaluation. A present state evaluation is provided by mapping designed data assets, and a future state vision is provided by mapping organic data assets (i.e., PPH). Make high-trust alliances amongst government sectors including healthcare, sports, learning, cultivation, transportation, urban infrastructure, and academics that have an impact on the factors that cause NCDs [9]. Choose and install important organizational leaders who will support a broad PPH for NCDs vision. Collaborations between academics, public healthcare organizations, and the healthcare industry, at the very least, were shown to make population health analytics technologies more effective. As the data base for PPH, collect organic health data from public healthcare organizations. In Australia, for instance, local and state electronic medical records (EMR)/electronic health records might be prioritized as primary data asset to guarantee appropriate population reach, data granularity, real-time data capacity, and handling of all NCD factors. The Public Healthcare Data Warehouse in

130

Pankaj Rahi et al.

USA combines a variety of natural sources to monitor the opioid crisis and disparities in mother and child health.

7.1.2 Practical Challenges and Research Issues

Figure 7.1: RiskScape data visualization platform (USA) [10].

In reality, gathering and analyzing data on infectious diseases is a difficult process that often involves many stages and various stakeholders from different organizational boundaries [11]. Due to the nature of epidemics, fast data collection and processing are also essential digital health record assistance with public healthcare. The open-source software platform known as ESP categorizes and charts organic EMR data, examines it for conditions relevant to public healthcare (such as communicable and noncommunicable illnesses), and may inevitably send separate case reports or collective precises to health agencies.

7 An Open-Source Data Fabric Platform

131

The present capabilities of RiskScape comprise zip code-based heat maps, demographic and comorbidity stratifications, with trend statistics, continuums of care for diabetes, HIV and HCV, and care cascades for people with high HIV risk (such as the percentage) practices by offering timely.

Figure 7.2: Inhabitants Health Queensland (PopHQ) proof-of-concept (Australia) [12].

It also makes it possible to spot future health inequalities as described in Figure 7.2. By centralizing analytical work in an effective, enhanced electronic environment, RiskScape relieves clinical and health department professionals of their analytical burden. Additionally, since RiskScape incorporates data from other practices, clinical practices may learn more about illness trends and treatment strategies outside of their own patients and their catchment region. Currently, >1.2 million people, or 20% of the state’s population, are included in the Massachusetts RiskScape instance [13]. Epidemiologists can no longer complete the quantity of data collection and analysis required for every Public Healthcare epidemic without the use of a computerized system. The inability to integrate existing data collecting, analysis, and visualization operations still creates a considerable gap in the capacity to quickly analyze the data gathered by field investigators, even with the aid of statistical software and geographic information systems. Current infectious disease information systems provide relatively little assistance to experts in data analysis and predictive model development. It is essential to have an integrated ecosystem with features like geocoding, sophisticated spatiotemporal data analysis and

132

Pankaj Rahi et al.

predictive modeling, and visualization, an information system with integrated data [14]. In order to develop precision monitoring stages for NCD rather than PPH models of care, the present global state is an amalgam of advancements across Horizons 1 and 2. Rapid adoption of EMRs and EHRs in healthcare has already improved Public Healthcare monitoring for NCD comprising diabetes, smoking, and asthma. Modern PHI methods for NCD are based on a combination of planned (surveys, administrative datasets) and organic data. Precision NCD surveillance is the main objective of these tools, and the majority of them use imaginative analytics to describe regional or state-level disease burden. These are positive starts, but there are few cases of how to map useful data assets for PPH of NCD, develop workforce digital public healthcare literacy, and create multisectoral preventative collaborations with common objective of PPH [15–16]. Main goal of this research is to see if the technological advancement of big data fabric layouts with the in-depth architecture and its implementation could provide an organization gains with competitive advantage in their marketplace especially in healthcare domains. Research in past shows that there was mass acceptance, and investment in numerous enterprise systems only offers huge businesses with a modest benefit in the short term. The core principle of resource-based view is adopted for endorsing the temporary competitive advantage of enterprise system investments, such that when all other enterprises accepted the same information technology systems, segmentation variables including [19–21]. This study has been also performed for finding out and understanding the following questions, that is, how can the implementation of open source data fabric be advantages for most demanding servicing sector of health care? How the big data fabric can be beneficial for Healthcare 4.0? There are broad six core components which have been described and linked by Forrester, in the study “Enterprise Data Fabric Enables DataOps.” These six levels are as follows: a) Data management layer: The whole layer takes care of data and information governance as well as data security. b) Data ingestion layer: This layer begins to stitch together cloud data, establishing links between structured and unstructured data. c) Data processing: This computational layer refines the data so that only useful facts are emerged for data extraction. d) Data orchestration: The above critical layer applies a number of the most important responsibilities for the data fabric, such as transforming, assimilating, and cleansing data to make it usable for team members all across organization. e) Data discovery: The above surface helps to identify ways to combine data from disparate sources.

7 An Open-Source Data Fabric Platform

f)

133

Data access: The said surface facilitates data utilization even while ensuring the access permission for particular teams throughout order to adhere with regulatory requirements. The above surface furthermore aids in the demonstration of relevant information via the use of monitoring systems as well as other visualization techniques.

The new method of managing data that uses a network-based infrastructure and architecture rather than point-to-point interconnections is known as data fabric. This makes it possible to establish an integrated information layer (fabric) that combines analytics, positive and proactive data/information generation, orchestration, and application. Agendas for digital transformation are now looking at enterprise-wide efforts rather than focusing solely on eliminating specific or individual complex business problems. The goals are also to gain a competitive advantage, provide much more value to clients, reduce risk, and respond promptly to organizational requirements. With all of this, the top priority is now to identify ways to use information to generate insights and more precisely guide decision-making processes. Nonetheless, this is difficult because overly complex data infrastructures that rely on a segmented array of technologies for managing data, conceptual layers, data pipeline, data integration, and analytics prevent financial institutions from obtaining data quickly and in an easy-to-understand format. Many people are also having troubles in accessing and reaping the benefits of such critical changes era of technology. There are currently three core data fabric architectural style modes followed in the enterprise settings. The decentralized structure, which is a technique for obtaining data that would normally be distributed without consolidation in a central repository, such as a data lake or data warehouse. Second, is a more comprehensive representation of the data fabric that views such centralization directories as underprivileged respondents in distributed data architecture, with both the data revealed for significant exposure just like other sources, thus including centrally controlled data even while granting decentralized structured access privileges. The most recent version sees it as a hybrid data architecture foundation, heavily slanted toward centralized access and improving data/information architects with a way to bridge dispersed data resources while also trying to adapt to the shared data requirements of users such like data scientists. Data fabrics improve productivity by establishing a unified level wherein data access is controlled across all resources. A recent development in corporate information management and technology trends is the data fabric. The simplicity of database access, which would be made difficult by the wide variety of data models, formats, and scattered data assets present in a typical company, is the most widespread application of data fabrics.

134

Pankaj Rahi et al.

7.2 Core Components of Data Fabrication Architecture

Figure 7.3: Components of data fabrication architecture [17].

Figure 7.3 shows the main component of data fabrication architecture. This research aims to determine if the use of the new data fabric architecture technology may provide a business a competitive edge in its industry. Studies indicating that broad acceptance and asset of different enterprise systems may only provide major organizations a competitive edge in the short-term lead to the need to look into this.

7.3 Core Phases for Managing the Enterprise-Based Data Fabric Figure 7.4 shows the core phases of managing the enterprise-based data fabrication. By taking into account the difficulties associated with comparable designs that already exist, the data fabric architecture and its competitive advantage may be created. Last but not the least, given the recent advent of this architecture, particular

7 An Open-Source Data Fabric Platform

135

Figure 7.4: Core phases of managing the enterprisebased data fabrication [17].

deployment instances and the use of big data fabric manner are still in their infancy in terms of industrial practice. Data fabric is a design concept that transfers the emphasis of machine and human tasks instead of just simply integrating traditional and contemporary technologies. The information architecture design will only be accomplished with the aid of cutting-edge technology such as embedded ML, interactive metadata management, and latent semantic knowledge graphs.

Figure 7.5: Key pillars of comprehensive data fabric (source: Gartner @2021 Gartner Inc. [42]).

136

Pankaj Rahi et al.

7.4 Big Data Fabric Architecture

Figure 7.6: Building blocks of big data fabric architecture: Data management issues, big data fabric data fabric, and data fabric design.

Big data structure presented in Figure 7.6, the primary driver behind architecture, is a company’s requirement to efficiently handle enormous data. There is high expectation for the architecture’s capacity to create a competitive edge since it is designed specifically to meet commercial demands. Big Data Fabric architectural was designed to avoid the complexity of multiplatform sources and integration problems that are common with existing enterprise design trends and instead create extractable, valuable information from the data, and transform it into actionable business visions [24]. The term “big data fabric,” which Yuhanna et al. [25] created, combines the technological components of “big data” and “data fabric.” The distinction is that data fabric’s data management methodology does not mandate big data. The phrase “big data fabric” is created on the intersection of these two already existing technologies. Big data fabric is officially defined as “bringing disparate big data sources together automatically, intelligently, and securely and processing them in a big data platform technology using data lakes, Hadoop®1, and Apache SparkTM2 to deliver a unified, trusted, and comprehensive view of customer and business data.”

7 An Open-Source Data Fabric Platform

137

7.4.1 Enhancing of Data Pipeline for Interoperability in Healthcare and Attain Standards of Fast Healthcare Interoperability Resources A data pipeline is the procedure that is being followed by the data during the initial process of knowledge discovery. Usually, an entire cycle occurs among a “specified location” and then a “data lake or pool” which also endorses a group’s judgment procedure or an algorithm’s AI competence. Figure 7.7 displays the core phases being followed by the data pipeline process during the initial phases of knowledge discovery.

Figure 7.7: Pipeline process for data during the phases of knowledge discovery [29].

Big data pipelines are command and control flow, fully understand where and how to support multiple data collection, processing, integration, and implementation for the best case of knowledge discovery. The idea is that the greater the quantity of data capturing, the smaller the error margin when making intelligent business decisions. A data pipeline architecture defines the framework for collecting, organizes, and routes data so that it could be scrutinized for discovering the more insights for gaining the business benefits. Raw data contain an exorbitant number of observations which might or might not be pertinent. Data pipeline architecture assembles data occurrences to enhance disclosures, analysis, and its usages. Interoperability and healthcare emerge to be synonymous terms in current scenarios, kudos to the twenty-first century CURES Act. CURES legislation was enacted by the federal United States Government in the year 2016. The core aim of the Act was to standardize patient data and management of information as well as the accessibility and availability in order to drive optimal health outcomes within the value-based care model. Healthcare providers, medical and pharmacy industries, healthcare information networks, and healthcare focused on the exchange of information so that all domains

138

Pankaj Rahi et al.

Figure 7.8: Streaming data pipelines architecture for healthcare [30].

of health should participate in interoperable patient information sharing through the hospital physical health infrastructure.

7.5 Application of Data Fabric 7.5.1 Core Abilities of the Data Fabric A well-designed data fabric option could also generally include and explain with the following features.

7.5.1.1 Straightforward Information System The ability to keep track of access to information, trustworthiness, ability to respond, as well as threat in a strong and united office environment.

7.5.1.2 Unified Data Definitions Having a single working space for all manuscript types simplifies defining the business objective and trying to establish a singular point of truth for a sustained analytics perception.

7 An Open-Source Data Fabric Platform

139

7.5.1.3 Designing Independent Data Just-in-time query processing for frequency and utilization inside a unified platform diminishes the annoyances of intellectual capabilities and information management [31].

7.5.1.4 Centrally Controlled Strategic Planning and Governance A carefully constructed security policy for access distribution is consistently applied to all document types, whether it is on, in the cloud, multicloud, or computational cloud platform.

7.5.1.5 App and Framework Agnostic It defines the ability of app or framework for quickly incorporating with consumers’ and managers’ preferred business interfaces and ML applications. Intelligent information virtualization guarantees an end-to-end representation of the information from many beginning, divergent sources without having to relocate or copy the information [32]. A future-ready architecture is a system that is capable of integrating big infrastructure deployments and re-engineering legacy solutions without disrupting emerging implementations and supporting documents types.

7.6 Benefits of Big Data Fabric Architecture 7.6.1 Core Benefits of Data Fabric Speed, robustness, and efficiency are made possible by a data fabric, which leads to lower costs, more production, and a shorter time to value. Optimizes the self-service data exploration and analytics by giving all data permission and accessibility to the consumers for improving the reliability as well as better access to reliable data. Recent times, the widespread adoption of new technologies by the vast majority of the world’s population has resulted in massive amounts of data, which include clinical data. Medical organizations gathered and interpreted this clinical data in order to gain insights and knowledge useful for clinical decisionmaking, drug recommendations, and better diagnostic procedures, among several other factors.

140

–

–

– –

Pankaj Rahi et al.

In order to develop skills in the automated data engineering tasks and augments data integration to deliver real-time insights are provided by/through employeeassistance model. The active use of metadata for data quality enhancements, data curtain, data categorization, policy enforcement, and other purposes to automate data governance and protection [33]. Systematizes elastic scaling, self-tuning, self-healing, and task preparation for any environment and data volume. It also automates workload orchestration. Automates the process of discovering data assets, tying them together, and adding knowledge and semantics to make it easier for users to access and comprehend the data.

7.6.2 When the Data Fabric Be Helpful Without having to devise multiple methods for managing data within a single company or organization, data fabric gives the company fantastic capacity to address these challenges not temporarily or gradually, but entirely and all at once. It maximizes the capacity of reusing data in any of the company’s systems.

7.6.2.1 Comparison of Data Mesh Comparison with Data Fabric The data mesh and data fabric are looking as correlated terms but having the wide difference, and Table 7.1 describes the core differences. Table 7.1: Relationship and differences in the data mesh and data fabric. Data mesh and related encounters

How data fabric is beneficial?

Interoperability across numerous corporate data sources frequently calls requires domain-specific data pipelining expertise

Domains are relieved of the responsibility for maintaining the underlying source systems when a data product is a business entity maintained in a virtual data layer

It may be complicated to strike the ideal balance between dependency on centralized data infrastructure and domain independence

Domain-specific teams, in coordination with centralized data teams, create APIs and pipelines for their data consumers, control, and govern access rights and monitor usage

In cooperation with teams in charge of information centralization, domain-specific teams create APIs and pipelines for corresponding data consumers, control access rights, and monitor usage

For both online and offline use cases, the data architecture gathers and analyses data from the underlying systems to offer data products as needed

7 An Open-Source Data Fabric Platform

141

Additionally, data fabric enables businesses to establish a planned-data-conduit for their alphanumeric datasets or information communication technology projects, cutting down on the amount of time it takes to market new capabilities. This means that a digital project can be developed in a matter of months instead of a year or above. The data fabrics are very much advantageous for the following: – Association of native and cloud computing-based solicitations of healthcare – Concurrent organization of planned and amorphous data – Organization and streamlining of assorted data – Organization of numerous stages having the variable data and data architectures – Organization and controlling of numerous file classifications, DBMS, SaaS requests, and systems – Enterprise information network systems get higher ROI, quick scaling, and performance management

7.7 Challenge of Data Fabric in Healthcare – – –

–

–

Clinical IoT data validation and interoperability methods for healthcare Quantum healthcare computing and security of patient’s data The biggest barrier to the adoption of this technology is the lack of readiness on the part of its potential consumers. Several businesses organizations are unfamiliar with some of this platform, and a few of companies are lacking the knowledge required to implement, support, and train their employees on it [34]. The security and transmission of data could also pose challenges for businesses. The data fabric approach since the use of legacy technologies limits both performance and scalability. To effectively embrace this new technological concept, some organizations should update their outdated information transport and security scenarios. Data privacy issues whenever data is transmitted across locations via the data fabric have grown to be a top priority for enterprises. The infrastructure for data transit must have secured firewalls and procedures in order to protect users from system vulnerabilities. As cyberattack against businesses increase, data protection at every stage of the information cycle is crucial.

7.8 Limitations with Data Fabric Architecture The best-case scenario for data fabric architecture is often presented by supporters. This best-case scenario stresses streamlined data access through abstraction, independent of interface, or location. Additionally, advocates stress the advantages of federated

142

Pankaj Rahi et al.

access over centralized access. For instance, an organization does not transfer or duplicate data; instead, it gives business units, teams, and other processes ownership and authority over the information they create. However, the technologies that support the data fabric have their own costs and limitations.

7.8.1 Lack of Data History The OLTP systems that support finance, sales and marketing, HR, and other crucial business function areas are directly connected to by the data fabric using data visualization (DV). These systems do not save a history of transactional data; instead, they replace old transactions with new ones as they happen. The DV platform must thus have some kind of persistent store to store and handle previous transaction data. This eventually starts to have the makings of a DV platform with a data warehouse at its center. A derived subset of this data is instead ingested and managed by the data warehouse, which does not itself maintain raw transaction data. The issue is that for business analysts, data scientists, ML developers, and other professional users, this raw or “detail” data – the chaff that the warehouse does not preserve – could be beneficial. Therefore, to also collect this data, the DV platform must have some kind of data lake-like repository [35].

7.8.2 Issue for Location The physical location of dispersed data sources is concealed by the data fabric. However, data are most helpful when it is combined with other sorts of useful elements. The fundamental purpose of an SQL query is this. This issue is solved by data warehouse design, which involves integrating and consolidating data before transporting it to the warehouse. Additionally, the data warehouse speeds up searches by using permanent data structures (such as indexes and preaggregated roll-ups). The majority of these acceleration techniques involve data caching. In a data fabric, data are physically transferred into the DV platform, where it is integrated, and then it is accessible from scattered places. Once again, at least part of the warehouse’s duties must be transferred to the DV platform. In order to do this, it preaggregates data, caches data, and builds indexes to speed up the execution of frequent queries. Data cannot be stored or preaggregated for genuinely ad hoc queries or for processing analytic/ML models that need data from distributed sources (such as sensors at the corporate edge); it must be obtained on demand, regardless of its location. This causes at least noticeable delay; at worst, like when the DV layer has to use a high-latency connection to get edge data, it causes tasks to become unresponsive [36]. The final result is that, when used as a data processing engine, the data fabric often exhibits unpredictable performance (as opposed to the data warehouse). In a data fabric, data are physically transferred into the DV platform,

7 An Open-Source Data Fabric Platform

143

where it is integrated, and then it is accessible from scattered places. Once again, at least part of the warehouse’s duties must be transferred to the DV platform. In order to do this, it preaggregates data, caches data, and builds indexes to speed up the execution of frequent queries. Data cannot be stored or preaggregated for genuinely ad hoc queries or for processing analytic/ML models that need data from distributed sources (such as sensors at the corporate edge); it must be obtained on demand, regardless of its location. This causes at least noticeable delay; at worst, like when the DV layer has to use a high-latency connection to get edge data, it causes tasks to become unresponsive. The final result is that, when used as a data processing engine, the data fabric often exhibits unpredictable performance (as opposed to the data warehouse).

7.8.3 Low-Order DV Users In the DV model, IT specialists and experienced users set up many prebuilt connection types for novice users. In addition to designing and maintaining the prebuilt views required to mimic the functionality of reports, dashboards, and other tools, this task includes exposing particular data sources (such as SaaS finance, HR, and sales/marketing applications). In order to obtain, purify, and convert the data utilized in SQL analytics or ML data processing, it is necessary to develop and manage sophisticated data engineering pipelines [37].

7.9 Challenges 7.9.1 Challenge 1: Decision-Making Management and Action Many businesses are unaware that big data fabric’s analytical powers go beyond just enhancing business outcomes. Value may be obtained by employing a data-centric architecture when organizational processes and human decision-making are combined. It should therefore play crucial roles in value-creation in order to fully realize the benefits of big data fabric. They should comprehend the visions produced by big data analytics, classify potential opportunities, coordinate resources, and turn these insights into movements to aid their organizations attain key business purposes. The adoption of AIintegrated software and linked medical devices may provide healthcare organizations access to enormous amounts of data that they can utilize to produce insights [38]. These data may include administrative information, medical records for patients, data from linked devices, transcripts and clinical notes, and patient surveys. However, the majority of healthcare organizations, even the best ones, lack sophisticated architecture and data management systems to handle data gathered from many sources [27–28]. The usage of relational databases, which have trouble managing unstructured data gathered

144

Pankaj Rahi et al.

from many sources in an effective manner, makes the information they are obtaining of questionable value. Healthcare service providers may be able to manage vast and unstructured data more easily if they switch from relational to nonrelational databases. The database design may be expanded as the business expands and the volume of data rises to accommodate the additional data.

7.9.2 Challenge 2: Data Integrity Malicious hackers target healthcare providers because of connected medical equipment and an increased requirement to retain records of patient information. As you can see, statistics on healthcare data breaches show a clear increase trend in data breaches between 2019 and 2022. If we examine the number of health records that are exposed annually, we will see a significant spike in 2015. However, since 2015, the situation has improved due to continuous drops in the number of exposed data. These assaults highlight the need for a strong cybersecurity system in the healthcare industry, which might stop data theft, information loss, and consumer conviction. Even in 2014, practically all significant health networks experienced cyberattacks, with 20% of them incurring costs for recovery exceeding $1 million. Organizations must maintain the quality of their data since it is one of their most valuable resources. Consequences of faulty data include time loss, increased costs, decreased production, and worse decision-making [39]. Large companies still have difficulties ensuring data integrity when using big data fabric since a significant fraction of the massive volume of big data may be inaccurate, duplicated, or incomplete. Only 3% of firms now adhere to data quality requirements. Hence, data integrity in respect of healthcare data sets of patients possess numbers of challenges based on the ethical or privacy issues which need strict data-sharing and accessibility policy for healthcare industry and users.

7.9.3 Rising Healthcare Costs The issue of healthcare costs is not new. A variety of parties influence the cost of healthcare services, including payers, manufacturer and distributor of medical equipment and drugs, and insurance policy providers. When there are a lot of people involved, conflict is unavoidable. Furthermore, reaching an agreement necessitates careful planning and patience. The rising cost of healthcare has a direct impact on healthcare organizations’ income because patients are discouraged from completing routine follow-ups after visits and undergoing lab tests as a result of higher costs, resulting in poor patient outcomes [40].

7 An Open-Source Data Fabric Platform

145

7.9.4 Challenge 3: Continuous Good Data Management Practice By 2030, a number of business associations forecast a shortfall of around 100,000 physicians. Here, technology may be quite helpful in a variety of ways, including via the use of telehealth. Even in distant areas, live streaming, store-and-forward imaging, and remote patient diagnostics may increase access to healthcare. In order to fill newly generated roles and replace retiring nurses, it is predicted that an extra 203,700 new RNs would be needed year through 2026, according to the Bureau of Labor Statistics’ Employment Projections 2016–2026. However, constant professional development and infrastructure upgrades are the secret to lowering personnel shortages in the healthcare industry. The introduction of technology into the healthcare sector has fundamentally altered how training and education are delivered [41]. The typical clinical interview, which focuses on acute disease, is changing in favor of patient involvement and improved staff communication, which is encouraging for professional growth and effective solutions to the scarcity of healthcare workers. Although data fabric architecture offers a solution for data management, many major firms face a serious difficulty due to the growing amount of data and the absence of effective data management techniques.

7.10 Conclusions In this chapter, we have analyzed different aspect of Public Healthcare informatics managements and open-source data fabric architecture. This chapter has shown that although big data fabric architecture has strong technological foundations, obtaining competitive advantage depends on a number of nontechnical aspects, including ongoing excellent data management and well-defined data strategy. Additionally, businesses that convert data-generated insights and observations into usable decisionmaking have the greatest competitive advantage. This research suggests a new outline to guarantee that big data fabric architecture’s efficacy and competitive benefit may be achieved while taking into account the precedents for comparable architectures that already exist and the difficulties they provide. Last but not the least, given the recent advent of this architecture, particular organization examples and the implementation of big data fabric design in industrial practices are still in their infancy. There is presently little research on the big data fabric design; thus, there is much to gain from doing further research on how this architecture may boost competitive advantage in businesses.

146

Pankaj Rahi et al.

7.11 Way Forward Adoption of corporate data fabrics have raised as a method of ensuring access to data information sharing in remote settings. The data fabric magnificently utilized both machine and human-being competencies for improving the quality in healthcare services also and reduced the cost of recursive activities to many folds. Healthcare setting is a human-centric area dealing with the life’s ailment of human beings, the ethical level of the data integrity and information sharing is defined in the public interest so that effective use of technology is ensured in near future. The predefined standards for clinical IoT data validation and interoperability methods of data indignity for healthcare sector be closely monitored as well as endorsed so that there should not be privacy issues in personalized care while using the open-sourced software or tools. Hence, special routing or standards or data fabrication architecture should be defined for the usage of open-sourced software or tools for addressing the personalized care or healthcare data privacy and its IoT-based modeling. Moreover, data fabric must address the beckoning issues of robustness in lieu of data security and privacy issues of healthcare datasets.

References [1] [2] [3]

[4]

[5]

[6] [7] [8] [9]

Wang, Y., & Wang, J. (2020). Modelling and prediction of global non-communicable diseases. BMC Public Healthcare. 20, 822. doi: 10.1186/s12889-020-08890-4. Ben-Hur, A., Horn, D., Siegelmann, H. T., & Vapnik, V. (2001). Support vector clustering. Journal of Machine Learning Research. 2, 125–137. Murray, C. J. L., Aravkin, A. Y., Zheng, P., Abbafati, C., Abbas, K. M., Abbasi-Kangevari, M., et al. (2020). Global burden of 87 risk factors in 204 countries and territories, 1990–2019: A systematic analysis for the global burden of disease study 2019. Lancet. 396(1223–1249). doi: 10.1016/S01406736(20)30752-2. Vos, T., Lim, S. S., Abbafati, C., Abbas, K. M., Abbasi, M., Abbasifard, M., et al. (2020). Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: A systematic analysis for the global burden of disease study 2019. Lancet. 396(1204–1222). doi: 10.1016/S0140-6736(20)30925-9. Swinburn, B. A., Kraak, V. I., Allender, S., Atkins, V. J., Baker, P. I., Bogard, J. R., et al. (2019). The global syndemic of obesity, undernutrition, and climate change: The lancet commission report. Lancet. 393(791–846). doi: 10.1016/S0140-6736(18)32822-8. World Health Organization. (2013). Global Action Plan for the Prevention and Control of Noncommunicable Diseases 2013–2020. World Health Organization, Geneva. Firth, J., Gangwisch, J. E., Borsini, A., Wootton, R. E., & Mayer, E. A. Food and mood: How do diet and nutrition affect mental wellbeing? BMJ. (2020) 369:m2382. doi: 10.1136/bmj.m2382 Kee, F., & Taylor-Robinson, D. (2020). Scientific challenges for precision public healthcare. JEpidemiolCommunity Health. 74, 311–314. doi: 10.1136/jech-2019-213311. Canfell, O., Littlewood, R., Burton-Jones, A., & Sullivan, C. (2021). Digital health and precision prevention: Shifting from disease-centred care to consumer-centred health. Australian Health Review: A Publication of the Australian Hospital Association. doi: 10.1071/AH21063.

7 An Open-Source Data Fabric Platform

147

[10] Groves, R. Designed Data Organic Data USA: United States Census Bureau. (2011). Available online at: https://www.census.gov/newsroom/blogs/director/2011/05/designed-data-and-organic-data. html (accessed January 7, 2022). [11] Xu, H., Zhang, N., & Zhou, L. (2020). Validity concerns in research using organic data. Journal of Management. 46, 1257–1274. doi: 10.1177/0149206319862027. [12] Dolley, S. (2018). Big data’s role in precision public healthcare. Front Public Healthcare. 6, 68. doi: 10.3389/fpubh.2018.00068. [13] Rasmussen, S. A., Khoury, M. J., & Del Rio, C. (2020). Precision public healthcare as a key tool in the COVID-19 response. Journal of the American Medical Association. 324, 933–934. doi: 10.1001/ jama.2020.14992. [14] Sullivan, C., Wong, I., Adams, E., Fahim, M., Fraser, J., Ranatunga, G., et al. (2021). Moving faster than the COVID- 19 pandemic: The rapid, digital transformation of a public healthcare system. Applied Clinical Informatics. 12(229–236). doi: 10.1055/s-0041-1725186. [15] Hekler, E., Tiro, J. A., Hunter, C. M., & Nebeker, C. (2020). Precision health: The role of the social and behavioral sciences in advancing the vision. Annals of Behavioural Medicine. 54, 805–826. doi: 10.1093/abm/kaaa018. [16] Chen, C., Loh, E. W., Kuo, K. N., et al. (2020). The times they are a-changin’ – healthcare 4.0 is coming!. Journal of Medical Systems [Internet]. 44(40). https://doi.org/10.1007/s10916-019-1513-0. [17] Pre-Standards Workstream Report: Clinical IoT Data Validation and Interoperability with Blockchain. In Pre-Standards Workstream Report: Clinical IoT Data Validation and Interoperability with Blockchain vol., no. (pp. 1–29). 28 June 2019 [18] https://www.analyticsinsight.net/role-of-quantum-computing-and-ai-in-healthcare-industry/. [19] Kent, J., “Health it analytics: Big-data-to-see-explosive-growthchallenging-healthcare organizations,” December 3, 2018. [Online: https://healthitanalytics.com/news/ [20] Maity, S., & Sarkar, K. (2022). Topic sentiment analysis for twitter data in Indian languages using composite kernel SVM and deep learning. ACM trans. Asian low-resour. Lang. Inf. Process. 21, 5, Article 109 (September 2022), 35 pages. https://doi.org/10.1145/3519297. [21] Sterpone, L., & Violante, M. (2006). Recom: A new reconfigurable compute fabric architecture for computation-intensive applications. 2006 IEEE Design and Diagnostics of Electronic Circuits and Systems. 52–56. [22] https://nix-united.com/blog/data-fabric-the-future-of-cloud-technologies/ [Accessed on 3.9.2022]. [23] https://www.spiceworks.com/tech/big-data/articles/what-is-data-fabric/ [Accessed on 3.9.2022]. [24] Türker, C., Stolte, E., Joho, D., & Schlapbach, R. (2007). B-Fabric: A Data and Application Integration Framework for Life Sciences Research. In Cohen-Boulakia, S., Tannen, V. (eds.). Data Integration in the Life Sciences. DILS 2007. Lecture Notes in Computer Science(). vol. 4544, Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73255-6_6. [25] Yuhanna, N., Leganza, G., Warrier, S., & Izzi, M. (2016). Big Data Fabric Drives Innovation And Growth, viewed 22 May 2020, Forrester Research. [26] https://www.forrester.com/report/Big+Data+Fabric+Drives+Innovation+And+Growth/-/E-RES129473. [27] Ylijoki, O., & Porras, J. (2016). Conceptualizing big data: Analysis of case studies. Intelligent Systems In Accounting, Finance And Management. 23(4), 295–310. [28] Zikopoulos, P., & Ibm, C. E. (2011). Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data: McGraw-Hill Osborne Media. [29] https://www.stonebranch.com/blog/healthcare-interoperability-data-pipeline-automation [Accessed on 3.9.2022]. [30] Bahri, S., Zoghlami, N., Abed, M., & Tavares, J. M. R. (2018). Big data for healthcare: A survey. IEEE Access [Internet]. 7, 7397–7408. [31] Poeschel, F., Godoy, W. F., Podhorszki, N., Klasky, S., Eisenhauer, G., Davis, P. E. . . . Huebl, A. (2021October). Transitioning from file-based HPC workflows to streaming data pipelines with

148

[32]

[33] [34] [35] [36] [37] [38] [39] [40]

[41]

[42]

Pankaj Rahi et al.

open PMD and ADIOS2. In Smoky Mountains Computational Sciences and Engineering Conference (pp. 99–118). Springer, Cham. Wonham, M., de Camino-beck, T., & Lewis, M. (2004). An epidemiological model for west nile virus: invasion analysis and control applications. Proceedings of Royal Society: Biological Sciences. 27 1(1538), 501–507. Sonesson, C., & Bock, D. (2003). A review and discussion of prospective statistical surveillance in public healthcare. Journal of the Royal Statistical Society Series A. 166, 5–12. Kay, B., Timperi, R., Morse, S., Forslund, D., McGowan, J., & O’Brien, T. (1998). Innovative information-sharing strategies. Emerging Infectious Diseases. 4(3), 465–466. Kulldorff, M. (1997). A spatial scan statistic. Communications in Statistics: Theory and Methods. 26(6), 1481–1496. Kulldorff, M. (2001). Prospective time periodic geographical disease surveillance using a scan statistic. Journal of the Royal Statistical Society: Series A. 166(1), 61–72. CDC. Introduction. (2012). Challenges and opportunities in public healthcare surveillance: a CDC perspective. MMWR. 61(Suppl; July 27, 2012), 1–2. Wallenstein, S., & Naus, J. (2004). Scan statistics for temporal surveillance for biologic terrorism. MMWR. (Suppl), 74–78. Bradley, C. A., Rolka, H., Walker, D., & Loonsk, J. (2005). Biosense: implementation of a national early event detection and situational awareness system. MMWR. 54(Suppl), 11–19. American Recovery and Reinvestment Act of 2009. Title XIII – Health Information Technology, Subtitle B – Incentives for the Use of Health Information Technology, Section 3013, State Grants to Promote Health Information Technology. State Health Information Exchange Cooperative Agreement Program. Funding Opportunity Announcement. Available at: http://www.grants.gov/ search/search.do?oppId=58990&mode=VIEW. Khan, A. S., Fleischauer, A. F., Casani, J., & Groseclose, S. L. (2010). The next public healthcare revolution public healthcare information fusion and social networks. Am J Public Healthcare. 100, 1237–1242. https://www.gartner.com/smarterwithgartner/data-fabric-architecture-is-key-to-modernizing-datamanagement-and-integration [Accessed on 29.09.2022].

Kuldeep Singh Kaswan, Jagjit Singh Dhatterwal, Naresh Kumar

8 Simulation Tools for Big Data Fabric Abstract: The big data fabric architecture was developed due to the accumulation of numerous iterations of big data as well as data management institutional arrangements and architecture, which brought together a wide variety of cutting-edge technologies to facilitate the efficient management of data at massive corporations. The foundation for the most recent iteration of big-data enterprise architecture is the evolving technology components, such as big data analytics and cloud computing. Differentiating business systems and using big data analytics may provide large companies a leg up in the market. The issues with data management may be alleviated with the help of big data fabric architecture, which is geared toward commercial enterprises. Additionally, it improves the ability to generate value using efficient big data analytics, which provides valuable business insights. While considering existing instances of similar design and their associated hurdles, this article introduces big data fabric architecture’s potential. Its relative standing may be guaranteed by using a fresh way. Finally, there is a lack of development in the implementation examples and applications of big data fabric framework engineering processes due to the relative youth of this architecture. Big data fabric design research is still in its infancy. Thus, further in-depth studies into how this design might improve a company’s competitive edge are required.

8.1 Introduction This study aims to answer the following question: a company may gain competitive advantages via the adoption and use of the promising new technology known as big data fabric architecture. Studies have shown that large-scale adoption and investment in different enterprise systems may only deliver short-term competitive advantage to organizations, prompting the need to examine this further [1]. The resource-based view principle is responsible for the short-term competitive advantage of investments in enterprise systems because it states argument that if all businesses used the same IT systems, they would lose their ability to differentiate themselves via things like the use of strategic resources, which would in turn reduce their competitiveness [9].

Kuldeep Singh Kaswan, School of Computing Science and Engineering, Galgotias University, Greater Noida, e-mail: [email protected] Jagjit Singh Dhatterwal, Department of Artificial Intelligence and Data Science, Koneru Lakshmaiah Education Foundation, Vaddeswaram, Andhra Pradesh, India, e-mail: [email protected] Naresh Kumar, Department of Computer Science and Engineering, GL Bajaj Institute of Technology and Management, Greater Noida, e-mail: [email protected] https://doi.org/10.1515/9783111000886-008

150

Kuldeep Singh Kaswan, Jagjit Singh Dhatterwal, Naresh Kumar

This study sets out to address the question, “How does corporate adoption of Big Data Fabric Infrastructure influence their capacity to compete?”

8.2 Architectural Framework for Big Data Collection and Analysis The big data fabric depicts how the necessity to handle massive data efficiently in business was the original impetus for the development of architecture. With the architecture designed to specifically meet the requirements of businesses, it is hoped that the company would be able to achieve a competitive edge.

8.2.1 Big Data Fabric Architecture and Its Superficially Benefits In order to not only analyze information but also generate easily recoverable, constructive data, and transform it into implementable actionable knowledge, the big data fabric Framework was suggested as a way to do so without the sophistication of multiplatform generated equipment and interoperability issues that are typical of modern information systems developments [2, 3]. It depicts a variety of new and developing technologies. These include big data analytics, cloud-based services, and data fabric architectures as the backbone of big data architecture. Each of these areas is recognized to have its own considerable market share.

8.2.2 Technology Big Data Fabric Big data fabric is a term that combines the technological concepts of “big data” and “data fabric” and was first used by Yuhanna et al. (2016). While other data management frameworks may promise massive data, data fabric does not. Big data fabric is founded on the merging of these two established technologies. Big data fabric is defined as “automatically, intelligently, and securely bringing together disparate big data sources and processing them in a big data platform technology, leveraging data lakes, Hadoop, and Apache Spark to present a single, trusted, and comprehensive view of customer and business data” [10, Yuhanna and Istok 2017].

8 Simulation Tools for Big Data Fabric

151

8.2.3 Layers of Architecture for Big Data Fabric – – – – – –

The five levels of big data fabric architecture [25] are represented. These layers consist of big data analysis tools. The first step is known as “data ingestion,” and it entails adding information to large databases. Data administration and intelligence, which includes data governance, security, management, access, and other associated procedures. When data is merged and translated into a form that is beneficial to consumers, this process is called orchestration. Fourth, informational visibility or data discovery [15]. Fifth, data access, or a means by which users may access and acquire data for the purpose of gaining actionable insight, is an interface.

8.2.3.1 Layer 1: Data Ingestion The term “data intake” refers to acquiring and introducing information into databases and other data processing infrastructures. The data is made usable after passing through this layer, where it is prioritized according to its source, validated, and sent to its final storage destination. Extracting data may be done in one huge batch or split up into many smaller ones. Depending on the specifics of the circumstance, the data ingestion layer will choose the appropriate technique, giving preference to those that load data quickly within the context of the application [17].

8.2.3.2 Second Layer: Information Administration and Analysis Database procedure and intellectual capabilities is intertwined into information security, records management, systems integration, compliance, corporate procedures, external policies, and quality of information across all phases of the data process, independent of the data’s computer system or provenance [16, 25].

8.2.3.3 Data Orchestration at the Third Layer “Data orchestration” is a newer term for a set of technologies that virtualize all data, abstract its access across data storage, and make it available to data-driven applications via uniform APIs and a shared namespace [7, p. 4, 13].

152

Kuldeep Singh Kaswan, Jagjit Singh Dhatterwal, Naresh Kumar

8.2.3.4 Level 4: Information Exploration Data discovery, the process of looking for anomalies and patterns in data, is an everevolving method driven by business users [8]. In order to help in data discovery, business analysts must have skills in data modeling, data analysis, and guided advanced analytics operations [8].

8.2.3.5 Data Access To put it simply, layer data access is the point of interaction between a program and a database between a website and its underlying data repository [4]. Creating and closing connections to the database as well as executing CRUD (create, read, update, delete) operations, all fall within the purview of the data access layer [18].

8.3 Analytics for Massive Data Sets and Cloud Computing Big data analytics is the procedure of amassing and organizing large amounts of data to conduct in-depth analyses and draw valuable conclusions for use in business. Only by employing the cloud can big data analytics be conducted. Since local hardware cannot handle data greater than a terabyte, companies like Microsoft and Amazon provide a solution called infrastructures as a service to produce, gather, and store big data. This service depends on the processing capacity of cloud applications rather than local hardware [5, 7]. Big data fabric technology also paves the way for a unified platform that gives a user-friendly interface to data stored in a wide variety of locations and storage mediums, including the cloud, on-premises servers, and mobile devices. Now companies have a leg up on the competition because they can create a single, uniform source of truth for all the data they generate [6].

8.4 Data Fabric Framework and Its Disadvantages Big data fabric architecture’s forerunner, data fabric architecture, had several constraints that made it a less desirable choice for establishing a competitive edge. In Table 8.1, we can see a comparison of big data fabric’s features to those of data fabric’s that the data fabric architecture lacks that may be derived from large datasets in terms of value creation. When compared to its predecessor, big data fabric architecture should be prioritized

8 Simulation Tools for Big Data Fabric

153

Table 8.1: Differences with big data fabric and data fabric. Big data fabric

Data fabric

Capacity to deal with several forms of information

Only capable of processing structured information

Analytics in real time

Most data processing occurs in bulk

The architecture can accommodate the integration of disparate big data analysis technologies, such as Hadoop and RDBMS, despite the tools’ inherent heterogeneity

Integration of analytical tools is impossible if they are not interoperable

by large companies because of the greater value it can deliver from abstracted data in the form of actionable knowledge.

8.5 Competitive Advantage and Enterprise Systems As studies have demonstrated that the capacity to achieve the five competitive forces is directly correlated to an organization’s competitive advantage, and IT systems, data, and information are essential factors that have been proved to boost competitive advantage. One of the most recent major corporate techniques to obtain competitive advantage is to gather data and generate useful information [20, Porter and Millar, 1985]. Therefore, if a huge company wants to gain an edge over its rivals, it must implement an enterprise architecture that sets it apart from the competition [9].

8.5.1 The Enterprise Architectures of Yesterday, Today, and Tomorrow Data integration, process differentiation, and data automation are three methods that have helped enterprise systems like enterprise resource planning and customer relationship management systems gain an edge in the market [7]. To get an edge in the market, businesses are increasingly adopting hybrid infrastructures that combine on-premises and cloud resources. This allows for greater efficiency, lower IT costs, and the generation of real-time data for analysis [8]. Due to the inherent nature of hybrid systems to produce large volumes of data, however, businesses have significant difficulties in adequately managing this data [9]. Statistics suggest that just 25% of a company’s data is really being used for analytics (Yuhanna and Istok, 2017). While cloud computing has become increasingly popular, large companies still have an excellent time period to capitalize on the vastly undervalued and undercaptured method of efficaciously generating income from massive

154

Kuldeep Singh Kaswan, Jagjit Singh Dhatterwal, Naresh Kumar

amounts of data and the need to transform that data into usable insights; the result is a step forward in the development of innovation capability.

8.6 Gaining a Competitive Edge with Big Data Analytics The technological aspects of extensive data are distinct from those of transactional data. As a result, cutting-edge data management and analysis applications are essential. Data-driven businesses, like Google and Amazon, have an edge over their less data-driven rivals, according to a study by [10]. The key to gaining significant insights from big data is applying the right analytics tools, which may be difficult given the amount, diversity, and velocity of big data (Ylijoki and Porras, 2016). Since becoming the best in the internet streaming business, Netflix has relied on big data analytics since 2006. The primary focus of Netflix’s big data analytics efforts is on enhancing the company’s recommended features [28]. The recommendation algorithm is successfully tailored to individual tastes by combining two unique data collection strategies. The subscriber’s viewing habits are collected for content-based filtering, and comparable users are pooled together for use in collaborative filtering. Netflix’s innovative recommendation engine combines content-based and collaborative filtering to provide users with personalized suggestions. To further aid in discovering consumer insight and managing customer data, the organization has now used the Amazon Web Services cloud computing platform. Through the flexibility and scalability of the cloud, Netflix can meet the demands of its millions of subscribers all over the world.

8.6.1 Number of Employees That Use Data Fabric to Boost Performance Organizations may hasten their digital transformation via the use of data fabric, and cutting-edge enterprise architecture serves as the engine that drives innovation. With the ability to sense markets, react to consumers, predict cyberthreats, and optimize operations, businesses that are data-driven have a distinct advantage [3]. Data fabric designs are used by service providers like NetApp and Winshuttle to enable businesses handle, manage, analyze, and store data from a variety of sources. Domino’s Pizza, an international pizza delivery service, has undergone digital transformation by using data fabric architecture [22]. Domino’s plan is to consolidate information from all sources into a unified picture of its business. Domino’s is using data fabric to create a centralized data collecting system that can monitor activity

8 Simulation Tools for Big Data Fabric

155

across all of their POS terminals, distribution hubs, and marketing platforms. Another motorcycle industry player that uses data fabric architecture to differentiate themselves is Ducati (NetApp 2020). Only 55,000 bikes were manufactured by Ducati in 2018 due to their rigorous production standards. Compared to its industry rivals, who have manufacturing capacities in the millions, Ducati saw an opportunity to acquire an edge via its rapid innovation capacity. As a result, the corporation views data as an integral component in its quest to rapidly expand its operations. Ducati is able to gather and analyze data from over 15,000 bikes throughout the globe thanks to a hybrid cloud infrastructure, dramatically increasing the speed with which it can go from the road to product innovation. Businesses are able to acquire an edge in the market because to the possibilities of the data fabric. Important features include: – Digital asset management: Storing, analyzing, and managing information of many types is made more accessible with the help of informational fabric. It also promotes the legitimate and productive use of cryptocurrencies. Additionally, redundant storage space and financial costs may be decreased while the speed with which digital assets can be organized and located is greatly enhanced. – Dynamic data model: Businesses may benefit from adaptable and scalable data models thanks to this feature. Accessing and modifying data is possible without impacting the original records, making it ideal for use in business settings. As company needs and possibilities change, dynamic data models may adapt quickly to meet those needs.

8.7 Initial Obstacle: Management’s Ability to Make and Implement Decisions The analytical skills made possible by big data fabric are often misunderstood and attributed only to financial gains. A data-centric architecture’s value is realized when applied to organizational processes and informed human judgment (Shanks and Sharma 2011). As a consequence, order for companies and management in enterprises may play a crucial role in value creation and decision-making by being competent at interpreting the results of big data analysis, seeing new opportunities, gathering the necessary resources, and taking action on the findings [11].

8.7.1 Problem 2: Verifying Accuracy of Data A company’s data is very important; thus, protecting it is a top priority and crucial that businesses take steps to ensure that the data they collect and store is of the highest quality possible. Negative effects of inaccurate information include slowed progress, higher costs, lower output, and worse decision-making [30]. Assuring data integrity during extensive

156

Kuldeep Singh Kaswan, Jagjit Singh Dhatterwal, Naresh Kumar

data fabric implementation remains a difficulty for enterprises since much of the vast quantity of big data may be inaccurate, out of date, duplicated, or lacking [5]. Only 3% of businesses now have data quality criteria they can say they have achieved [30].

8.7.2 Third Problem: Maintaining Solid Data Management Procedures The lack of adequate management data and the ever-increasing amount of data mean that even while data fabric architecture provides a solution, many large companies continue to struggle with the problem. A set of practices known as “data management” must be implemented and kept up to date if the information is to be used safely and effectively [31]. Organizations continue to struggle with maintaining data quality, security, and privacy, as shown by the study by [12]. Fighting remains to ensure development in information protection, decision-making procedures, metadata management, and data processing methods [13].

8.8 New Strategies for Leveraging Big Data’s Fabric Technology to Gain a Competitive Benefit To succeed in the modern business world, it is crucial to efficiently extract data from extensive data and translate it into meaningful business insights [14]. This study proposes a technique that combines four fundamental principles for producing efficiency from extensive data analysis (Begoli and Horey, 2012) with three components for efficient big data management to guarantee that big data fabric architecture is utilized to deliver high-quality service [27]. The proposed framework consists of constituent parts that work together to record crucial organizational objectives. The proposed architecture for using big data to create value gives businesses a leg up on the competition.

8.8.1 Allow for Multiple Analytical Approaches When it comes to the understanding preliminary investigation of their operations, businesses should use a wide range of tools and analytical methodologies to extract maximum value from their data [15]. Methods for collecting and analyzing massive amounts of data, including but not limited to statistics, data mining, pattern recognition, visualization techniques, and documentary analysis, will be used (Begoli and Horey, 2012). By using a variety of analyses, a large corporation may get a more thorough understanding of its data and more confidently make conclusions.

8 Simulation Tools for Big Data Fabric

157

8.8.2 One Size Does Not Fit All Evidence from real-world applications show that not all big data analytic techniques and organizational structures can be accommodated by just one kind of database structure, such a massive relational database (Begoli and Horey, 2012). This is why it is important to deploy a data management system designed specifically for gaining insights from data and turning them into useful outcomes. A great benefit to large organizations is the capacity to effectively store, manage, and handle tremendous volumes of big data in a variety of formats as well as perform a wide range of analyses on that data [29].

8.8.3 Governance of Information Security Potential security and privacy risks for businesses stem from the ease with which big data fabric allows for the storage, access, and processing of massive volumes of data [10]. Therefore, businesses should ensure that all of their big data platforms, systems, and infrastructure adhere to a uniform security standard based on applicable rules and privacy policies. This will ensure data is gathered, stored, and transported securely to safeguard data security and conform to regulatory mandates [26]. In addition, an audit trail for data updates and access controls are required for big data to be used effectively [26].

8.8.4 Open Up Information It is crucial to have data easily accessible for business users. In order to assure highly simplified data and transform into actionable intelligence, making information easily available to individuals is essential (Begoli and Horey, 2012). Customers should be able to engage with data systems in a variety of ways, including via web-enabled APIs for visualizing analytics findings and through standardized communication protocols across disparate systems (Begoli and Horey, 2012).

8.8.5 Involve the System, the Steps, and the People For a company to reap the benefits of big data fabric architecture, it is crucial that all three of its fundamental components – people, processes, and organization – be put into motion [27]. Gaining a competitive edge via the use of big data fabric requires a thorough familiarity with the company’s operational strategy and methods. Organizations may need to obtain business benefits from unorganized and semistructured data, companies need to re-engineer their business requirements in order

158

Kuldeep Singh Kaswan, Jagjit Singh Dhatterwal, Naresh Kumar

to reap the full advantages of big data fabrics since conventional business processes are best suited to structured data [19]. Jha et al. [19] argue that businesses may better adapt to rapid change by rethinking their key business processes. In this approach, the business’s strategy, objectives, and requirements may inform the big data fabric architecture’s architecture, resulting in a more cohesive business-IT relationship and the generation of actionable insights that aid in decision-making [20]. Furthermore, to develop a complete big data fabric, companies and individuals need to be on the same page [2]. Employees and managers need training to reap the full benefits of big data fabric. This training should include the specifics of big data fabric as well as any applicable standards, processes, and regulations. As a corollary, change management should be implemented to educate staff on the benefits of the big new data fabric and facilitate a smoother transition for the business [11].

8.8.6 Establish Unambiguous Norms and Guidelines When it comes to handling massive amounts of data, both internal and external to the company, the big data fabric architecture is invaluable. To guarantee data comparability, considering the vast amounts of data being produced and transferred, it is crucial for the company to include its relevant stakeholders in ensuring data quality, integrity, and interoperability across all different applications, systems, and infrastructures [6]. In addition, interlayer connections in a structured big data fabric architecture would allow for more thorough tracking of its progress and more accurate reporting of its success [26]. By adhering to these three concepts, a company may set standards and policies [21]: agile policy modification and update; transparent policy construction; and implementation of appropriate current criteria. The large data fabric architecture may assist the company in systematically processing the big data, minimizing the chances of mistake and danger.

8.8.7 It Is Important to Stick to Standard Procedures in Your Field With an emerging technology like big data fabric, there might be many obstacles to overcome before the system can be used in practice. An organization’s success in implementing a big data fabric architecture depends on its adherence to established, well-tested best practices in the field [1, 27]. A company’s core technological issues may be fixed if it adopts best practices. To guarantee the project’s smooth execution, this may lessen dangers like competing frameworks and business requirements [1]. It may also aid businesses in focusing their efforts where they will have the most impact, and it can guarantee that the firm as a whole is covered in the design of the big

8 Simulation Tools for Big Data Fabric

159

data fabric. The corporation may establish norms, such integrating business processes to improve information interoperability, utilizing the cloud to share data across departments to foster more cooperation, and using technology expertise to make the most of available resources [1].

8.9 Big Data Tools It is possible to extract meaningful patterns and trends from the mountain of data gathered. As such, it must be archived, analyzed, and manipulated. The following are some of the most popular methods for controlling big data:

8.9.1 Hadoop Commonly used for analyzing large amounts of data, Hadoop is a free and opensource program. It is a way to put MapReduce into practice for analyzing massive data sets. Hadoop utilizes a user-level distributed file system for cluster-wide storage management. HDFS is the name of the Java-based file system. It was created to quickly transfer across various computer systems and operating systems. The MapReduce framework is the foundation of Hadoop. A map function and a reduce function are introduced to do the calculation in this case. To create a new set of keys and values, the map function uses the given key/value combination as input. The reduce function is then used for these intermediary key/value pairs to combine all values corresponding to a single key [5]. Learn the basics of how a Hadoop cluster is built with the help of the figure. To do business, the client interacts with the cluster. Cluster machines make up the cluster. A MapReduce agent and an HDFS node are included in each of these cluster computers. There will also be a name node in the cluster.

8.9.2 Google Charts Google charts is a back-end application programming interface. It is free to use! You may quickly make a chart from your data and add it to a website. From the provided data and the formatted parameters of an HTTP request, Google generates a PNG picture of the necessary chart [14]. It can produce lines, bars, pies, and even radar charts. There is also support for OR codes, Venn diagrams, scatter plots, maps, and Google-o-meters. Oceanographic information, for instance, is included. The data may be easily visualized using Google charts, which produce charts like the one shown.

160

Kuldeep Singh Kaswan, Jagjit Singh Dhatterwal, Naresh Kumar

8.9.3 SAP’s HANA One of HANA’s defining characteristics is SAP’s in-memory computing studio. SAP Host Agent with Sybase Replica Server 15 Session management, authorization, transaction management, and command processing are all tasks that will be handled by the indexer in this case. Row store and column store are both available in HANA. Either store may be used to make tables, but the column store offers greater flexibility. The index server is also responsible for maintaining consistency across database object caches, log files, and permanent storage files [7]. When it comes to free, open-source software, GridGain is your best bet. The Memory-Based Data Fabric provides the most feature-rich option for processing data in RAM. It is a great tool for bringing cutting-edge processing capability into the realtime business. Using a single, highly scalable data access and processing layer, it offers high-performance transactions, real-time streaming, and ultrafast analytics. Through GridGain, businesses can foresee and respond to market shifts with innovation. We can see how GridGain is structured. To facilitate communication across various data stores, the GridGain In-Memory Data Fabric offers a single API that can be used with all of the most popular programming languages and frameworks, including Java, .NET, and C++. These include both organized and unorganized information (SQL, NoSQL, and Hadoop). This provides a safe, highly available, and easily managed data environment, letting businesses handle fully ACID transactions and get useful insights from real-time, interactively, and batch inquiries.

8.9.4 Skytree Skytree provides a set of tools that can execute a wide variety of complex ML algorithms. The ability to provide the proper orders is necessary for this. The Skytree server is designed to efficiently execute a variety of traditional machine-learning algorithms on your data. By using these methods, it may be possible to outpace competing software by a factor of 10,000 or more. It can sift through the information for groups of numerically similar objects and then reverse this process to spot anomalies. There is a free version of Skytree available. They are the same algorithms, repackaged. The free version is restricted to datasets of 100,000 rows, whereas the commercial version has no such restriction. The initial step is to collect data from a number of sources. After gathering this information, it is turned into the necessary structure. After then, it is sent to the Skytree for further processing. Splunk may use text strings to look for information in its index. Splunk locates the desired URLs and organizes them into a timeline based on the timestamps it finds in the data. This graphic explains the internal structure of the Splunk program. The Splunk tool receives the information from the web servers and stores it there. Afterward, the

8 Simulation Tools for Big Data Fabric

161

cleaned and organized information is sent to the analytics database. After the data has been examined, it is sent to an OLAP engine for further processing.

8.10 Big Data Applications The widespread use of big data in recent years is a testament to its value. Some of the most prominent applications of big data are listed below.

8.10.1 Government When it comes to public service, big data analytics has been shown to be quite beneficial. The examination of massive amounts of data was crucial to President Obama’s reelection in 2012. Not so long ago, the BJP and its allies in India’s general election of 2014 won with a landslide, thanks in large part to the use of big data analysis. The Indian government uses a variety of methods to learn how the Indian public reacts to government action and to solicit suggestions for improving existing policies [24].

8.10.1.1 Social Media Analytics Big data has exploded with the emergence of social media. IBM’s Cognos Consumer Insights, a point solution operating on IBM’s BigInsights Big Data platform, is just one example of the many tools developed to decipher social network conversation. The market’s reaction to items and campaigns may be gauged in real time with the use of social media. Companies may use this information to fine-tune their pricing, advertising, and campaign locations. To get useful insights from massive data, some preparation work must be done beforehand. One of a business’s primary objectives is to meet customer needs by delivering information or goods that appeal to their particular worldview. Therefore, the use of smart judgments produced from big data is required to comprehend the customer psyche.

8.10.1.2 Technology The following organizations are examples of technical uses of big data since they regularly deal with large volumes of data and utilize that data in making business choices. As an example, eBay.com employs a 40 PB Hadoop cluster for search, customer recommendations, and merchandising, in addition to two data warehouses with a combined 7.5PB and 40PB. Investigating eBay’s 90-petabyte data center. Every day, Amazon.

162

Kuldeep Singh Kaswan, Jagjit Singh Dhatterwal, Naresh Kumar

com processes millions of administrative tasks and answers questions from over 500,000 independent merchants. Linux is the backbone of Amazon’s infrastructure, and in 2005 the company owned the three biggest Linux databases (7.8, 18.5, and 24.7 TB, respectively). It is estimated that Facebook stores 50 billion images uploaded by its users. Walmart processes more than $1 million in customer transactions every hour, with the resulting data being imported into databases that are estimated to hold more than $2.5 petabytes (2,560 terabytes) of information, or about 167 times as much as the entire collection of books in the US Library of Congress. Windermere Real Estate utilizes the aggregated, anonymized GPS data from over 100 million cars to predict when potential homebuyers would be leaving for and returning from work [10].

8.10.1.3 Science and Research In 2000, when the Sloan Digital Sky Survey (SDSS) initially started gathering astronomical data, it quickly amassed more information than had ever been gathered in the whole field of astronomy. Data is being collected by SDSS at a pace of 200 GB every night, totaling more than 140 terabytes. That is how much information its replacement is supposed to gather every five days, too. Decoding the human genome used to take 10 years, but today it can be done in a day or less because to advances in DNA sequencing technology that have cut the price of the procedure by a factor of 10,000 in the previous decade, making it 100 times more affordable than anticipated by Moore’s Law. When it comes to climate data, NASA’s Center for Climate Simulation has 32 petabytes worth stashed away on the Discover supercomputer cluster [10].

8.10.1.4 Fraud Detection One of the most compelling uses of big data is in the area of fraud detection, which is of great interest to companies whose work involves processing claims or transactions of any kind. In the past, real-time fraud detection has been difficult to achieve. By the time fraud is uncovered, the damage has usually already been done and all that can be done is to try to limit the fallout and make changes to procedures to avoid future occurrences. By analyzing claims and transactions in real time, big data solutions may revolutionize the fraud detection game by seeing broad trends across numerous transactions or unusual behavior from a single user.

8.10.1.5 IT Log Analytics Huge amounts of log and trace data are produced by IT services and IT departments. There is no way for businesses to evaluate all this data in real time, much alone

8 Simulation Tools for Big Data Fabric

163

manually, without a big data solution. However, once a big data solution is implemented, all of that log and trace data may be put to good use. IT log analytics has the most applicability among these uses of big data. The capacity to spot broad trends rapidly would be invaluable to any company with a large IT staff, both for the purposes of issue diagnosis and prevention. Similarly, any business with a sizable IT division would value the ability to spot gaps in performance that may be filled with little effort.

8.10.1.6 Call Center Analytics Next, we will look at some real-world instances of how businesses are using big data to better serve their customers, such as the insightful analytics used in contact centers. Without a big data solution, most of the data that a contact center may give would be ignored or found too late, even though call centers are frequently a fantastic barometer and shaper of market mood. Call content captured and processed by big data solutions not only helps make sense of time/quality resolution metrics but also identifies reoccurring issues or trends in customer and staff behavior in real time.

8.11 Conclusion This study concludes that although big data fabric architecture’s technological foundations are favorable, obtaining a competitive advantage requires the addition of nontechnical variables such as continual effective data management and defined data strategy. In addition, businesses may gain a competitive edge by making decisions based on the insights and observations gleaned from data. Based on these discoveries, the study proposes a framework consisting of pieces that collaborate to guarantee that businesses get a competitive edge via big data fabric architecture.

References [1] [2] [3] [4]

Abunadi, I. (2019). Enterprise architecture best practices in large corporations. Information. 10(10), 293. Ansyori, R., Qodarsih, N., & Soewito, B. (2018). A systematic literature review: Critical success factors to implement enterprise architecture. Procedia Computer Science. 135, 43–51. Baer, T. (2018). The Modern Data Fabric – What It Means to Your Business, viewed 22 May 2020, https://mapr.com/whitepapers/the-modern-data-fabric/assets/MapR-enterprise-fabric-white-paper.pdf. Begoli, E., & Horey, J. (2012). Design principles for effective knowledge discovery from big data. Paper presented at the 2012 Joint Working IEEE/IFIP Conference on Software Architecture and European Conference on Software Architecture.

164

[5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19]

[20] [21] [22] [23] [24] [25] [26] [27] [28]

Kuldeep Singh Kaswan, Jagjit Singh Dhatterwal, Naresh Kumar

Barton, N. (2019). Guaranteeing Data Integrity in the Age of Big Data, viewed 22 May 2020 https://www.dataversity.net/guaranteeing-data-integrity-in-the-age-of-big-data/# Boh, W. F., & Yellin, D. (2006). Using enterprise architecture standards in managing information technology. Journal of Management Information Systems. 23(3), 163–207. Bakshi, K. (2011). Considerations for cloud data centers: Framework, architecture and adoption. In Paper presented at the 2011 Aerospace Conference. BARC Research (2017). Data Discovery: A Closer Look at One of 2017’s Most Important BI Trends, viewed 22 May 2020, https://bi-survey.com/data-discovery. Collis, D. J., & Montgomery, C. A. (1995). Competing on resources: Strategy in the 1990s. Harvard Business Review. 73(4), 118. Dooley, B. (2018). Data Fabrics for Big Data, viewed 22 May 2020, https://tdwi.org/articles/2018/06/ 20/ta-alldata-fabrics-for-big-data.aspx. Dilnutt, R. (2005). Enterprise content management: Supporting knowledge management capability. The International Journal of Knowledge, Culture, and Change Management: Annual Review. 5, 73–84. Davenport Thomas, H., Harris Jeanne, G., & Cantrell, S. (2004). Enterprise systems and ongoing process change. Business Process Management Journal. 10(1), 16–26. Diederich, T. 2019, Data Orchestration: What Is it, Why Is it Important?, viewed 22 May 2020, https://dzone.com/articles/data-orchestration-its-open-source-but-what-is-it. Foote, K. D. (2019). Streamlining the Production of Artificial Intelligence, viewed 22 May 2020, https://www.dataversity.net/streamlining-the-production-of-artificial-intelligence/. Gandomi, A., & Haider, M. (2015). Beyond the hype: big data concepts, methods, and analytics. International Journal of Information Management. vol. 35(no. 2), 137–144. Hassell, J. (2018). How a big data fabric can transform your data architecture, viewed 22 May 2020, https://blog.cloudera.com/data-360/how-a-big-data-fabric-can-transform-your-data-architecture/. John, T., & Misra, P. (2017). Data Lake for Enterprises. Business process management. Packt Publishing. Kanjilal, J. (2015). Best Practices of Designing and Implementing a Data Access Layer, viewed 22 May 2020, http://www.databasedev.co.uk/data-access-layer.html. Jha, M., Jha, S., & Brien, L. O. (2016). Combining big data analytics with business process using reengineering. In: Paper presented at the 2016 IEEE Tenth International Conference on Research Challenges in Information Science (RCIS). Lohr, S. (2016). Data-ism: inside the big data revolution, oneworld publications luftman, j 2000, ‘assessing business-it alignment maturity’. Communications of the Association for Information Systems. 4(14), 142–150. Larno, S., Seppänen, V., & Nurmi, J. (2019). Method framework for developing enterprise architecture security principles. Complex Systems Informatics and Modeling Quarterly. 20, 57–71. McDaniel, S. (2019). What is Data Fabric?, viewed 22 May 2020, https://www.talend.com/resources/ what-is-datafabric/. Michael, E. P. (2014). Big data and competitive advantage at Nielsen. Management Decision. 52(3), 573–601. Morrell, J. (2017). What are some of the different components of an enterprise data fabric?, viewed 22 May 2020, https://www.dataversity.net/secrets-utilizing-data-fabric/. Moxon, P. (2018). Data Virtualization: The Key to Weaving Your Big Data Fabric, viewed 22 May 2020, http://www.datavirtualizationblog.com/key-to-weaving-big-data-fabric/. McSweeney, A. (2019). Designing An Enterprise Data Fabric, viewed 22 May 2020, https://www.re searchgate.net/publication/333485699_Designing_An_Enterprise_Data_Fabric. Malik, P. (2013). Governing big data: principles and practices. IBM Journal of Research and Development. 57(3/4). Maddodi, S., & K, K. (2019). Netflix big data analytics- the emergence of data driven recommendation. SSRN Electronic Journal.

8 Simulation Tools for Big Data Fabric

165

[29] Mousanif, H., Sabah, H., Douiji, Y., & Oulad, Y. (2014). From big data to big projects: A step-by-step roadmap. In IEEE Computer Society Paper presented at the 2014 International Conference on Future Internet of Things and Cloud (pp. 373–378). [30] Nagle, T., Redman, T., & Sammon, D. (2017). Only 3% of Companies’ Data Meets Basic Quality Standards, viewed 22 May 2020, https://hbr.org/2017/09/only-3-of-companies-data-meets-basic-qualitystandards?fbclid=IwAR1hSMsCnA6g1ZLuHG8aupnEbDZ9CCKH-QRwcHVAGCPMgzP_rhYSWKo4edA. [31] NetApp. (2020). Ducati and NetApp Build a Data Fabric to Accelerate Innovation, Deliver High Performance, and Win Races, viewed 22 May 2020, https://customers.netapp.com/en/ducati-datafabric-case-study. Office of the Deputy Prime Minister 2020, The Principles of Good Data Management [Ebook, 2nd ed, London.

R. Ahila, D. Nithya, P. Sindhuja, K. Divya

9 Simulation Tools for Data Fabrication Abstract: Big data platform components like Hadoop, data lakes, and NoSQL have made big data architectures more logical, enabling businesses to pursue insightdriven competitive advantage. Moving corporate data to these platforms, especially when dealing with distributed data across data centers, is hampered by security issues, complicated data structures, issues with moving historical data, big volumes, latency issues, and variable speed of ingestion. In contrast to a unified platform for insights, we discovered that the majority of enterprises are developing various repositories and platforms. One of the techniques that can assist in identifying probable in the social and medical sciences is a statistical tool. Using statistical techniques, it is possible to identify data creation a mixture of data sets from legitimate and false. Fabrication of data calls into question the scientific method’s quest for knowledge, undermines the reliability of results that have been published, and makes it harder to detect. Over the past 10 years, automated tools have been employed as screening methods to find image alteration and plagiarism in articles that have been submitted or accepted quantitative and founded on behavioral observations, surveys, (cognitive) tests, and so on. At any point in the research process, incorrect data might be found using computerized systems that automatically discover and screen articles for statistical anomalies. These screening procedures appear to be useful and effective in identifying the various types of research misconduct, according to numerous case studies. The picture detection technologies, for instance, are essentially useless in the disciplines of social media and medical science because the data are mostly quantitative. Keywords: Data management, computer-aided design, information exchange, data mining

R. Ahila, Department of CSE, School of Engineering, Avinashilingam Institute for Home Science and Higher Education for Women Coimbatore, Tamil Nadu, e-mail: [email protected] D. Nithya, Department of CSE, School of Engineering, Avinashilingam Institute for Home Science and Higher Education for Women Coimbatore, Tamil Nadu, e-mail: [email protected] P. Sindhuja, Department of CSE, School of Engineering, Avinashilingam Institute for Home Science and Higher Education for Women Coimbatore, Tamil Nadu, e-mail: [email protected] K. Divya, Department of CSE, School of Engineering, Avinashilingam Institute for Home Science and Higher Education for Women Coimbatore, Tamil Nadu, e-mail: [email protected] https://doi.org/10.1515/9783111000886-009

168

R. Ahila et al.

9.1 Introduction There are instances of scientific misconduct in every discipline of experimental research, whether through fabrication, falsification, or plagiarism. The likelihood of discovering knowledge fabrication is considerably lower for instance, among hundreds of thousands of Dutch and American researchers. Scanners for plagiarism have been around the longest and are used often in practice, not just at journals but also during student evaluations. To screen for faked or altered photos, a number of methods had been created image detection and modification. Some of these tools have been developed and used in medical journals. Investigating suspected data fabrication or determining the extent of suspected data fabrication have both shown to benefit from the use of statistical tools. We want to examine their diagnostic utility in order to decide whether using statistical approaches to spot data manipulation is ethical and worthwhile. In particular, a lot developed statistical methods for detecting data fabrication are quantifications of researcher case-specific fears, but their applications do not reveal their diagnostic value outside of those specific circumstances. Because of the in casu nature of these techniques, comparing alternative statistical methods to detect data manipulation has also proven difficult. TCAD has traditionally focused on the transistor fabrication step of the process flow, also known as front end of line manufacturing, which culminates in the production of source and drain contacts. Back-of-the-line manufacturing dielectric layers and interconnects are not considered. One justification for demarcation is the availability of powerful analysis instruments such as electron microscopy, scanning microscopy, and transmission electron microscopy, which allow for precise measurement of device geometry. There are no comparable techniques for precise high-resolution measurement of dopant or stress profiles. However, there is growing interest in the interaction of front-end and back-end production procedures. Stress in the transistor region, for example, during rear fabrication, may affect device performance. These interactions will either drive the demand for improved back-end simulation tool interfaces or result in some of those features being incorporated into TCAD products. In addition to the recent expansion of the scope of process simulation, there has always been a need for more accurate simulations. However, simpler physical models are typically used to reduce computing time. However, as device sizes shrink, so must the precision of dopant and stress profiles, necessitating the addition of new process models for each new generation of devices. Although many of the models were developed by academics prior to their use, there are times when additional effects are only identified and understood after a process has occurred. Engineers identify a problem, and tests are run. Adding new physical models and taking into account more specific physical effects will continue and, hopefully, accelerate.

9 Simulation Tools for Data Fabrication

169

9.2 Data falsification Manipulating study results in an effort to cause confusion. This entails modifying data points, deleting outliers or “inconvenient” results, and editing images (such as micrographs, gels, and radiological images). It is legal to technically enhance photos for readability with regard to image alteration. Adjusting the contrast, brightness, or color balance on the entire digital image is considered proper technical manipulation (and not parts of the image). Any technological manipulation by the author should be disclosed in the cover letter sent with the submission to the journal editor. The term “improper technical modification” describes the process of hiding, improving, removing, and/or adding new features to an image. Falsified information that has been reported by numerous editors and publishers worldwide. The alteration of research data to present a false impression of the study is known as data falsification. Among other unethical acts, it involves editing photographs, removing outliers, modifying knowledge, adding or removing data points, etc. Scientific studies frequently engage in falsification behavior because lab assistants want to keep their jobs by producing results that are favorable to the study premise.

9.3 Data Fabrication The act of fabricating data and presenting it as an accurate representation of an unconducted research study is known as data fabrication. When a researcher fills out the experiment with personal, presumptive data, fabrication is frequently the result. The studies might not have needed to be done or they might have been done with inflated data.

9.4 Guidelines to Avoid Fabrication In news publishing, fabrication can take many different forms, such as making up sources and embellishing stories, as well as changing statements to sound different from what was actually stated. The journalists and educators Geanne Belton, Ruth Hochberger, and Jane Kirtley who created the Poynter course on avoiding plagiarism and fabrication have provided several best practices for avoiding fabrication. – Be a stickler for accuracy. Create and uphold rules and strict standards for the veracity of the information you report. – Take responsibility for each fact. Verify each information for yourself using what you have seen, heard, and learned from reliable interviews and authoritative publications. Give credit to your sources for the facts.

170

– –

R. Ahila et al.

Stick to the facts. To present a more dramatic story without inflating or exaggerating, Be conscious of the legal risks. Falsification hurts your organization’s reputation as well as your career. Your deception might expose you to legal culpability if it damages someone’s reputation.

9.5 Importance of Data Fabrication In all areas of science, data fabrication should be discouraged and avoided. Above plagiarism and falsification, fabrication is recognized as one of the most serious forms of research misconduct. – Estimates show that 2% of researchers have at some point in their careers modified or misrepresented data. Falsification and data fabrication are two distinct concepts; the former entails creating and disseminating fake data, whilst the latter includes modifying data. These data could consist of making up study subjects, making up test results, or recording made-up observations. – Such data may include making up study subjects, fabricating test findings, or recording fictitious observations. – Fabrication threatens the veracity of scientific knowledge as well as the authority of science, undermining truth, and trust. – When discovered, offenders may face harsh penalties, and articles may be retracted. In addition to having an adverse impact on professional careers, falsified data that is presented as true and then employed in actual practice might have fatal results. – According to one instance of research misconduct, a single perpetrator’s data fabrication resulted in the loss of up to 800,000 lives.

9.6 Detection of Fabricated Data In some cases, it is simple to spot fraud and fake research. Perhaps, in contrast to what the researcher reports, the research assessor is aware that a certain lab lacks the capacity to carry out a specific type of research. Alternatively, data from the control experiment can be presented as being too “perfect,” raising concerns that the data were faked. Investigations are conducted into suspected data manipulation in research, fabrication, or falsification to ascertain whether the purpose was to commit fraud or whether the error or oversight was unintentional. The majority of publishers has extremely severe guidelines on picture editing and seek access to the researcher’s data.

9 Simulation Tools for Data Fabrication

171

9.7 Common Causes of Data Fabrication The act may be brought on by inadequate funding and compensation for the fieldworkers, a lack of institutional moral support, or social and political circumstances in the study location that make it difficult for the fieldworkers to collect data. A fieldworker’s ability to collect accurate data for a study can also be constrained by the social and political circumstances in a given place. There is little security for researchers, particularly those working on projects affecting residents’ social lives, in turbulent countries, such Eastern Europe and parts of the Middle East.

9.8 Data Fabric Tools –

–

–

–

–

Atlan: The four main capabilities are data cataloguing and discovery, data quality and profiling, data lineage and governance, and data exploration and integration of Atlan’s data workspace platform. The software has a searchable business lexicon, an automatic data profiling feature, and a Google-like search interface for searching. No matter where your data flows, users can manage data usage and adoption with comprehensive governance and access restrictions. Cinchy: It provides a platform for data collaboration that addresses the integration of enterprise applications and data. The software has real-time data governance and solution delivery capabilities and was created as a safe tool for addressing data access concerns. By integrating disparate data sources with its network design, Cinchy functions. The distinctive architecture enables what the business refers to as “autonomous data,” which is defined as self-description, self-protection, selfconnection, and self-management of data controlled by the platform. data.world: This gives customers a cloud-native enterprise data catalogue with full context so they can understand their data wherever it is. This includes project management tools, social media collaboration tools, metadata, dashboards, analysis, code, and documents. Users can investigate relationships by using the product’s built-in connected web of data and insights, and it also suggests similar assets to enhance analysis. Because of its continual release cycle, data. World is distinctive. Denodo: It is a significant player in the market for data virtualization products. Denodo is a Palo Alto-based company that was established in 1999 and provides highperformance data integration and abstraction for big data, enterprise, cloud, unstructured, and real-time data services. Denodo also provides single-view applications, data analytics, and access to unified company data. Only The Denodo Platform is available on Amazon AWS Marketplace as a virtual image for data virtualization. IBM Cloud Pak for Data: In both on-premises and cloud deployments, IBM provides a variety of unique integration tools for practically all enterprise use cases. For both classic (it provides technologies in its on-premise data integration

172

–

R. Ahila et al.

package to meet the needs of replication and batch processing as well as modern integration synchronization and data virtualization. IBM also offers a variety of prebuilt interfaces and functions. The mega-cloud vendor’s integration product is widely regarded as one of the best in the industry, and new features are constantly being added. K2View Data Fabric: It serves as a centralized platform for the delivery, orchestration, enrichment, and integration of data. The system was made to facilitate to provide a comprehensive view, real-time operations, and fragmented data for each business entity are combined into its own micro-DB. Every instance of a business entity has its own controlled micro-DB, while web services componentize and expose data from the micro-DBs for use by external applications. K2View’s distributed architecture allows it to serve hundreds of millions of micro-DBs at the same time.

9.9 Examples of Fabrication Falsification or fabrication of information entails its unauthorized creation, modification, or reporting in an academic endeavor. The following are some illustrations of fabrication or falsification: – Generating data artificially when it should be obtained from an actual experiment. – The unauthorized modification or fabrication of information, records, images, music, works of art, or other works. – Unauthorized omission of data, information, or outcomes in reports, presentations, and other writing. – Using improper scaling, magnification, and depiction in charts, graphs, and other forms of representation to conceal data, outcomes, or information. – Providing false information about the test subjects in an experiment. – Misrepresenting the goals of research or failing to obtain institutional approval before enlisting people in them in an experiment. – Making up information sources. – Falsely representing a third party in order to complete academic work. – Making unauthorized use of another person’s login credentials for a computer. – Unauthorized departure from the established experimental protocol.

9.10 Process Simulation Methods The dopant profile following processing is one of the most significant outcomes of process simulation. A proper mesh point density must always be maintained during the simulation in order for the profile to be accurate. Because the cost of computing

9 Simulation Tools for Data Fabrication

173

to solve because the number of mesh points increases with the number of diffusion equations, the density of points should be just enough to resolve all dopant and defect profiles, but no more. A typical full-flow CMOS process simulation can have more than 50 mesh changes, and if adaptive meshing is used, the number of mesh changes can rapidly increase. Interpolation is used to acquire data values on the new mesh for each mesh change. It is crucial to control mesh modifications in order to prevent accuracy loss from interpolation inaccuracy. The simplest approach to achieve this is to always keep points after adding them to the mesh; however, this has the disadvantage of creating a lot of mesh points, which can be expensive to compute. For accurate results with a minimal computational cost, it is crucial to maintain a balance between interpolation error and precision, computational cost, as well as the reduction of required user input. Particularly true when mimicking electronics in three dimensions. If the mesh is not carefully placed, either the accuracy will suffer in an unacceptable way, or the computing cost will be prohibitively expensive. Process simulation systems have had limited success automating mesh adaptation completely without human intervention. The user is forced to learn meshing and how it affects simulation accuracy and run time due to the cost of tracking mesh changes during the simulation to ensure adequate mesh is maintained. This is especially true when simulating electronics in three dimensions. If the mesh is not carefully placed, either the accuracy will suffer in an unacceptable way, or the computing cost will be prohibitively expensive. Process simulation systems have had limited success automating mesh adaptation completely without human intervention. The user is forced to learn meshing and how it affects simulation accuracy and run time of 24 h for simulations with the highest fidelity due to the cost of tracking mesh changes during the simulation to ensure adequate mesh is maintained. Because the device can be addressed equally in depth, the majority of the information required from TCAD simulations can be extracted (i.e., a 2D simulation). Simulations in three dimensions must be run in order to explore implant shadowing or to include the impacts of device shape along depth.

9.11 Fabrication Quality Decision Support Using Simulation-Based Analytics Quality management systems that are automated and data-driven are being sought to improve decision-making processes. These systems facilitate the transformation of data into useful information. The effective implementation of automated, data-driven quality management systems is dependent on the incorporation of precise, dependable, and simple methods for measuring inspection process uncertainty. This research addressed these needs by investigating and modifying Bayesian statistics-based methods for the

174

R. Ahila et al.

generation of fraction nonconforming posterior distributions. Using these precise and trustworthy inputs, this research will further develop unique, analytically based techniques to improve the practical functionality of conventional building fabrication quality control systems. To support and improve quality-related decision-making processes, numerous descriptive and predictive analytical functionalities have been developed. Multirelational databases from an industrial company in Edmonton, Canada, are examined and mapped in order to implement the novel system proposed. By developing a dynamic simulation environment that uses real-time data to improve simulation predictability, advancing construction management decision-support systems, developing integrated analytical methods for improved modeling in fabrication quality-related decision-making, and developing reliable and understandable decision-support metrics for quality performance measurement.

9.12 Data Fabrication Detection in Summary Statistics p-Value Analysis The distribution of a single or group of independent p-values is uniform if the null hypothesis is correct; however, if the alternative hypothesis is correct, it is rightskewed. If the underlying process’s model assumptions are correct, the population effect size, estimate precision, and observed effect size combine to produce one p-probability value’s density function. If the p-values are independent, these characteristics apply to a group of p-values. When the model assumptions that support it are violated, p-value distributions can take on a variety of shapes. For example, when optional stopping (adding batches of individuals until a statistically significant result is obtained) is used and the null hypothesis is true, p-values just below 0.05 are more common. However, a rightskewed distribution of significant p-values can and will continue to occur when optional stopping occurs under the alternative hypothesis or when additional researcher degrees of freedom are used to achieve significance. Failure of right-skewed or evenly distributed independent p-values (as predicted theoretically) can occur. In the Fujii case, for example, baseline data from supposedly randomly assigned groups was later proven to be false. When individuals are randomly assigned to conditions, measures at baseline are expected to be statistically identical (i.e., have equivalent distributions) between groups, resulting in uniformly distributed p-values. On the other hand, noticed a slew of significant p-values in the Fujii case, which led him to suspect data falsification [1]. When creating statistically nonsignificant data, the effect of randomness may be overlooked, resulting in data groups that are overly similar under the null hypothesis that there are no differences between the groups. This underestimation of the effect of randomness could explain the high p-values. In Table 9.1, we simulated normal distributed measurements for studies and associated

9 Simulation Tools for Data Fabrication

175

t-test comparisons in statistically similar populations. Furthermore, we generated independent data for comparable groups by averaging the mean and standard deviation across all studies and then adding (inadequate) homogeneous noise to these parameters. The mean p-value of the faked data in our example is 0.956, which is greater than the predicted value of a uniform p-value distribution of 0.5. Table 9.1: Averages and standard deviations. Set  Experimental

Study  Study  Study  Study  Study  Study  Study  Study  Study  Study 

Set 

Control

M (SD)

M (SD)

. (.) . (.) . (.) . (.) . (.) . (.) .(.) . (.) . (.) . (.)

. (.) . (.) . (.) . (.) . (.) . (.) . (.) . (.) . (.) . (.)

Experimental P-value . . . . . . . . . .

Control

M (SD)

M (SD)

. (.) . (.) .  (.) . (.) . (.) . (.) . (.) . (.) . (.) . (.)

. (.) . (.) . (.) . (.) . (.) . (.) . (.) . (.) . (.) . (.)

P-value . . . . . . . . . .

Table 9.1 shows examples of averages and standard deviations for continuous outcomes from real and fictitious randomized clinical trials. While Set 2 is generated under the null hypothesis of random assignment, Set 1 is generated under the null hypothesis of excessive consistency with equal groups. Each trial condition includes 100 participants. The p-values were calculated by comparing the experimental and control conditions within each individual set of a research. We propose using the Fisher method to see if a distribution of independent p-values can be generated. The Fisher technique was originally intended as a meta-analytic tool to determine whether there is sufficient evidence to conclude that an effect exists. When the null hypothesis of a zero true effect size underlying all k outcomes is tested and rejected for test statistic values greater than a specific number, usually the 95th percentile of 2, it is concluded that true effect size differs from zero for at least one of the k outcomes. Using an adaptation of the Fisher technique, the identical null hypothesis can be tested against the alternative that the outcomes are more in line with expectations than they would have been under the null, where the procedure’s p-value range is determined by t. For example, if t = 0, the procedure selects all p-values, but only statistically insignificant results if t = 0.05. As with the original Fisher method, the contribution of each result (in between the brackets) is in the range (0,1). Although not identical to Carlisle’s method, the inverted Fisher approach checks for excessive homogeneity among baseline measurements in RCTs.

176

R. Ahila et al.

It should be noted that improperly specified one-tailed tests can also lead to an excess of large p-values. If the alternative hypothesis is correct, the p-value distribution for properly specified one-tailed tests is skewed to the right. The p-value distribution is skewed to the left when the alternative hypothesis is correct but the effect is in the opposite direction of the predicted effect (e.g., a negative effect when a one-tailed test for a positive effect is performed). As a result, to avoid drawing incorrect conclusions, any potential data fabrication detected using this method would need to be checked for misspecified one-tailed hypotheses. Misspecification of one-tailed hypothesis testing is not an issue in the studies presented in this paper because the effect and its direction were prespecified to the participants who were requested to fabricate data.

9.13 Analysis of Variance Most empirical research papers include estimates of sample variance or standard deviation alongside means to indicate data dispersion. For example, we know that a sample with a reported age of M (SD) = 21.05(2.11) is both younger and more homogeneous than a sample with a reported age of M (SD) = 42.78. There is sampling error in the data’s estimated variance, just as there is in the data’s estimated mean. The sample size is proportional to the sampling error of the estimated variance. For example, under the assumption of normality, the sampling error of a given standard deviation can be estimated as 2n where n is the sample size of the group. Using the theoretical sampling distribution of standardized variances, we bootstrap the expected distribution of variance dispersion. To put it another way, we use the theoretical standard deviation sampling distribution to create a null model of variance dispersion that is consistent with probabilistic sampling processes for groups of equal population variances. First, to generate standard deviations at random for all j groups. Second, we compute MSw using the values. Finally, we compute the measure of dispersion across the j groups as the standard deviation of the standardized variances (abbreviated SDz in Simonsohn (2013)) or as the standardized variance range (denoted maxz – minz). This process is repeated a number of times to generate a parametric bootstrap distribution of variance dispersion based on the null model of equal variances across populations. When the observed variance dispersion is compared to the expected distribution, potential data fabrication can be tested. To that end, we compute the proportion of iterations that exhibit equally- or more extreme consistency in variance dispersion to derive a bootstrapped p-value (e.g., p (X SDobs)) 1. When the sum of all variances is divided by MSw, the weighted average equals 1. In this scenario, this is referred to as standardization. Similar to the Fisher method, this could be due to the fabricator

9 Simulation Tools for Data Fabrication

177

underestimating the higher level sampling fluctuations, resulting in too little randomness (i.e., error) in the standard deviations across groups. As an example, we perform a variance analysis on the illustration from Table 9.1 and the Smeesters case. The variance analysis applied to the standard deviations from each set is shown in Table 9.1. For the genuinely probabilistic data (Set 1), the reported mean standard deviation is 9.868 with a standard deviation of 0.595. For the fabricated data (Set 2), the reported mean standard deviation is 10.667, with a standard deviation of 0.456. Such methods highlight distinctions but are insufficient for testing them. The previously described procedure can quantify how extreme this difference is by using the standard deviation of variances as the dispersion of variances measure. The results show that Set 1 has no excessive consistency in the standard deviation dispersion (p = 0.214), whereas Set 2 has excessive consistency (p = 0.006). In other words, out of 100,000 randomly selected samples, 2.142104 had less standard deviation dispersion for Set 1, whereas only 572 had less standard deviation dispersion for Set 2. The standard deviations of three independent conditions from a study in the Smeesters case (nj = 15) were reported to be 25.09, 24.58, and 25.65, respectively (Simonsohn, 2013). According to theory, we can also use the outlined procedure to see if the reported standard deviations are too consistent based on sampling fluctuations of the data’s second moment. The standard deviation for these standard deviations is 0.54. Under the theoretical null model, such consistency in standard deviations (or even more) would be observed in only 1.21% of 100,000 randomly selected replications.

9.14 Extreme Effect Sizes There is enough evidence to suggest that data fabrication can have (extremely) serious consequences. Large effect sizes, for example, were used as an indicator of data fabrication in the Stapel misconduct investigations. With some papers showing incredibly large effect sizes that translate to explained variances of up to 95% or larger than the product of the reliabilities of the related measures. Furthermore, Akhtar-Danesh and Dehghan-Kooshkghazi [2] enlisted the help of faculty members from three universities to create data sets, and they discovered that the fabricated data had much larger effect sizes than the genuine data. We discovered anecdotally that large effect sizes (d > 20) raised initial suspicions of data fabrication. In clinical trials, extreme effect sizes are also used to identify potentially fabricated detain multisite trials while the study is still ongoing [3]. In research reports, effect sizes can be reported in a variety of ways. Effect sizes are frequently reported in psychology papers as a standardized mean difference (e.g., d) or as an explained variance. An effect size measure can be derived from a test statistic. A test result like t(59) = 3.55 corresponds to d = 0.924 and r = 0.176 in a between-

178

R. Ahila et al.

subjects design. These effect sizes can be easily recalculated using data extracted from thousands of results using statcheck. The observed effect sizes can then be compared to the effect distributions observed in other studies investigating the same effect. For example, if a study on the “foot-in-the-door” technique produces an effect size of r = 0.8, we can gather other studies on the “foot-in-the-door” effect and compare how extreme that r = 0.8 is in comparison to the other studies. If the largest observed effect size in the distribution is r = 0.2, and a sufficient number of studies on the “foot-in-the-door” effect has been conducted, an extremely large effect may be considered a red flag for possible data fabrication. This method focuses on situations in which fabricators want to create the appearance of an effect.

9.15 Detecting Data Fabrication in Raw Data Analysis of Digits The properties of leading (first) digits (e.g., the 1 in 123.45) and terminal (last) digits (e.g., the 5 in 123.45) can be investigated in raw data. We focus on testing the distribution of leading digits based on the Newcomb–Benford law (NBL) and the distribution of terminal digits based on the uniform distribution to detect potentially fabricated data. The NBL [4] states that under certain conditions, leading digits have a monotonically decreasing probability rather than an equal probability. A leading digit is the left-most digit of a numeric value, where a digit can be any of the nine natural numbers (1, 2, 3, . . ., 9). According to the NBL, the leading digit is distributed. Where d is the natural number of the leading digit and p (d) is the probability of d occurring. Table 9.2 depicts the expected leading digit distribution based on the NBL. To compare this expected distribution to the observed distribution, a two-test (df = 9) is commonly used. According to [5], such a comparison requires at least 45 observations (n = I J 5, with I rows and J columns). The NBL has been used to detect financial fraud (e.g., [6]), voting fraud (e.g., [7]), and problems in scientific data [8]. The NBL, on the other hand, only applies in limited circumstances that are rarely encountered in the social sciences. As a result, its effectiveness in detecting data fabrication in science is called into question. First and foremost, the NBL only applies to true ratio scale measurements (Berger and Hill, 2011; Hill, 1995). Second, the measure’s range must be large enough for the NBL to apply (i.e., range from at least 1 to 1,000,000 or 1 to 106). Third, digit preferences, such as psychological preferences for rounded numbers, should not influence these measures. Fourth, truncation of any kind weakens the NBL. Furthermore, some studies have suggested that humans may be capable of producing data that is consistent with the NBL [9], The applicability of the NBL detecting data fabrication is immediately undermined.

9 Simulation Tools for Data Fabrication

179

Table 9.2: Multivariate Association between M and SD. Digit         

Proportion . . . . . . . . .

The analysis of terminal digits is based on the principle that the rightmost digit of a number is the most random and thus should be uniformly distributed under certain conditions. A two-test (df = 10 1) is also performed on the digit occurrence counts (including zero), with the observed frequencies compared to the expected uniform frequencies. Agresti [5] states that at least 50 observations are required for a meaningful test of the terminal digit distribution (n = I J 5, with I rows and J columns). During the Imanishi-Kari case, Mosimann and Ratnaparkhi (1996; for a history of this decadelong case, see) developed terminal digit analysis. The first and second digit distributions are clearly nonuniform, whereas the third digit distribution appears to be only slightly nonuniform. As a result, if enough precision is provided, the rightmost digit should be uniformly distributed. The process that generates the data determines what constitutes sufficient precision. The uniform distribution appears to closely resemble the distribution of the third and subsequent digits in our example with N.

9.16 Associations with Multiple Variables Variables or measurements included in one study may have multivariate associations unknown to the researchers. As a result, data forgers may overlook such relationships between variables or measurements. Even if they are aware of the multivariate associations, fabricators may be unable to generate data that reflects them. For example, in response time latencies, there is usually a positive relationship between the mean response time and the variance of the response time. Given that genuine multivariate relationships between different variables emerge from stochastic processes and are not easily known in terms of form or size, they may be difficult to account for someone

180

R. Ahila et al.

attempting to fabricate data. As a result, using multivariate associations to distinguish between fabricated and genuine data. Genuine control data can be used to estimate multivariate associations between different variables. For example, if the multivariate relationship between means (Ms) and standard deviations (SDs) is of interest, control data for that measure can be found in the literature. A meta-analysis using these control data yields an overall estimate of the multivariate relationship, which can then be used to validate a set of statistics. The genuine data’s multivariate associations are then used to estimate the extreme of an observed multivariate relationship in the investigated data. Consider the fictitious example of the previously mentioned multivariate association between Ms and SDs for a response latency task shown below. Figure 9.2 depicts a (simulated) population distribution of the literature’s relationship (e.g., a correlation) between Ms and SDs (N). Assume we have two papers, each of which was derived from a pool of direct replications with an equal number of Ms and SDs.

Figure 9.1: Multivariate relationship.

9 Simulation Tools for Data Fabrication

181

Figure 9.2: Distribution of the literature’s relationship.

References [1] [2]

[3] [4]

[5] [6]

[7] [8]

[9]

Carlisle, J. B. The examination of 168 randomised controlled trials in order to assess data integrity. Anaesthesia. 67(5), 521–537. doi: 10.1111/j.1365-2044.2012.07128.x. Akhtar-Danesh, N., & Dehghan-Kooshkghazi, M. (2003). How do the correlation structures of real and fabricated data differ? 3rd edition of. BMC Medical Research Methodology. 1. http://doi.org/ 10.1186/1471-2288-3-18. Bailey, K. R. Data fabrication detection in a multicenter collaborative animal study. Controlled Clinical Trials. 741–752. doi: 10.1016/0197-2456(91)90037-m. Benford, F. (1938). The law of erroneous numbers. Proceedings of the American Philosophical Society. 78(4), 551–572. stable/984802. retrieved from: http://www.jstor.org. Berger, A., & Hill, T. P. A fundamental theory of Benford’s law. Probability Surveys. 8(0), 1-126. doi: 10.1214/11- ps175. Agresti. (2003). Analyzing Categorical Data (Vol. 482). John Wiley & Sons, London, United Kingdom, https://mathdept.iut.ac.ir/sites/mathdept.iut.ac.ir/files/AGRESTI.PDF. Cheung, I., Campbell, L., LeBel, E. P., Ackerman, R. A., & Aykutolu, B. Bahnk, report on registered replication. Perspectives on Psychological Science. 11(5), 750–764. http://doi.org/10.1177/ 1745691616664694. Cho, W. K. T., & Gaines, B. J. (2007). Breaking the (benford) law: Statistical fraud detection in campaign finance. The American Statistician. 61(3), 218–223. Retrieved from. http://www. jstor.org/stable/27643897. Durtschi, W. H., & Pacini, C. (2004). The voluntary application of benford’s law to aid in the detection of fraud in accounting data. Journal of Forensic Accounting. 5(1), 17–34. Bauer, J., & Gross, J. (2011). Having trouble detecting fraud? The application of Benford’s law to regression tables. Artefacts of Method, Data Manipulation, and Fraud in Economics and Social Science. http://doi.org/10.1515/9783110508420-010. Diekmann,. (2007). Not the first digit!” detecting fraudulent scientific data using ben ford’s law. Journal of Applied Statistics. 34(3), 321–329. http://doi.org/10.1080/02664760601004940.

Aditya Saini, Vinny Sharma, Manjeet Kumar, Arkapravo Dey, Divya Tripathy

10 Security, Privacy, and Authentication Framework for Web-Driven Data Fabric Abstract: The internet and related information technologies have sparked unheard-of innovation, economic value, and enhanced social services for more than two decades. Data on individuals that circulate across a complicated ecosystem power many of these advantages. As a result, when people interact with systems, goods, and services, they might not be able to comprehend the possible effects on their privacy. Organizations may not fully comprehend the full depth of these effects on people, society, or their businesses, which can have an impact on their branding, financial results, and growth prospects in the future. The National Institute of Standards and Technology also released the privacy structure: A tool to improve privacy through enterprise risk management (privacy framework) as a result of a translucent consensus-based process that involved both private and public stakeholders. The privacy framework aims to support advanced privacy engineering does that support privacy-by-design theories and support administrations ensure the confidentiality of their customers. Keywords: Data fabric, security framework, privacy framework, authentication framework

10.1 Introduction A security framework is a set of guidelines intended to eliminate privacy and security issues from computing. The secrecy, authentication, and integrity of personal data have been questioned since the advent of cloud drives. Data should be accessible from cloud accounts with ease while yet being secure. Single-sign-on-based authentication, which enables users to maintain just one authentication credential for accessing several applications, even with different cloud service providers, is one of the outstanding usability characteristics of this framework. To protect privacy, specialized hardware security measures can be combined with outside software. The deployment

Aditya Saini, Forensic Science, School of Basic and Applied Sciences, Galgotias University Vinny Sharma, School of Basic and Applied Sciences, Galgotias University Manjeet Kumar, G.L. Bajaj Institute of Technology and Management Arkapravo Dey, Forensic Science, School of Basic and Applied Sciences, Galgotias University Divya Tripathy, School of Basic and Applied Sciences, Galgotias University https://doi.org/10.1515/9783111000886-010

184

Aditya Saini et al.

of robust authentication and access controls is necessary for comprehensive security in order to carry out encrypted communication. Hardware and software layers should both be taken into account for a robust and trustworthy security framework. By isolating the code execution process and execution area, it is feasible to validate and uphold the integrity of data on cloud networks [1, 16]. The Internet and related information technologies have powered unheard-of innovation, economic value, and access to societal services for more than two decades. Data on individuals that circulate across a complicated ecosystem power many of these advantages. As a result, when people interact with systems, goods, and services, they might not be able to comprehend the possible effects on their privacy. Organizations might also not completely comprehend the effects. Failure to handle privacy risks can have immediate negative repercussions on people and society as well as collateral damage on an organization’s reputation, financial health, and growth possibilities. It may be difficult and not very suitable for a one-size-fits-all solution to remain to enjoy the profits of data processing while preserving the privacy of individuals [2]. In addition to being a broad notion that protects vital values like human autonomy and dignity, privacy is problematic because several methods can be used to achieve it [3]. For instance, seclusion, limiting surveillance, or people’s control over aspects of their identities can all contribute to privacy [4]. Human autonomy and dignity are also fluid concepts that are filtered by individual variability and culture diversity rather than being fixed, quantifiable constructions. Communication on privacy issues within, between, and with individuals is challenging due to the vast and fluid nature of privacy. What has been lacking is a universal language and useful instrument adaptable enough to handle various privacy concerns [2]. Authentication and permission are key components in creating a web application. A configurable module for user administration, authentication, and authorization is the Cocoon authentication framework. Any information that is accessible from any source, such as an existing database, LDAP, or the file system, can be used to validate a user. Using an existing user management/authentication system within Cocoon is relatively simple with this technique. The authentication framework’s fundamental goal is to secure papers produced by Cocoon. By document, we mean the output of a request to Cocoon, which may be from a reader or a pipeline specified in the sitemap. An authentication handler protects a document. To offer the protection, a document is linked to a certain handler. The handler must indicate that the user has successfully completed authentication before the user may successfully request a document. A handler can be used to safeguard many documents simultaneously. All of these documents are accessible to users who have been authenticated. Different handlers can be used to protect documents in various ways [5]. At its core, a data structure is an integrated data architecture that is secure, adaptable, and versatile. Unlocking the best of the cloud, core, and edge, the data fabric is in many ways a new strategic approach to your company’s storage operations. While

10 Security, Privacy, and Authentication Framework for Web-Driven Data Fabric

185

remaining centrally controlled, it can spread everywhere, including on-premises, public and private clouds as well as edges and internet of things (IoT) device. Skyscraper-sized data silos and disparate, disconnected infrastructures are a thing of the past. At the heart of the data structure is a comprehensive pair of data management tools that guarantee constancy across your interconnected contexts. By automating difficult administrative tasks, it simplifies development, testing, and deployment while protecting your assets 24/7. Ultimately, a data structure assists your company release an authority of data to encounter business needs and achieve competitive advantage. It enables IT team to more effectively leverage hybrid cloud capabilities, create hybrid multicloud environments, and modernize storage through data management [6].

10.2 What Is Data Fabric and Its Architecture? Thanks to the use of intelligent and automated technologies, the data structure is an architecture that enables the comprehensive integration of various data channels and cloud settings. The exponential expansion of big data over the past 10 years has been facilitated by advancements in edge computing, artificial intelligence (AI), hybrid clouds, the IoT, and AI. This has increased the difficulty of managing big data for businesses. The resulting significant challenges, including data silos, security threats, and general barriers to decision-making, have elevated merger and regulation of data ecosystems to a highest priority. Data fabric solutions are being used by data management teams to tackle these problems head-on. They are using them to consolidate their various data platforms, implement governance, increase security and privacy safeguards, and provide workers – particularly their business users – more access to data [10]. Data fabrics are being used to integrate data, allowing for more comprehensive, data-driven decision-making. In the past, an organization can have utilized various data platforms matched to particular business lines. For instance, despite potential overlaps, you might have a customer data platform, a supply chain data platform, and an HR data platform that all contain data in separate and distinct contexts. A data fabric, on the other hand, enables decision-makers to view this data more coherently in order to better comprehend the customer lifecycle and create links between previously unconnected data. Data fabrics are speeding up digital transformation and automation projects across enterprises by bridging these understanding gaps between customers, goods, and processes.

186

Aditya Saini et al.

10.2.1 How Data Fabric Is Different from Data Virtualization? One of the technologies that makes a data mesh approach possible is data virtualization. A data virtualization solution connects to the multiple sources, integrating only the necessary metadata, and generating a virtual data layer instead of physically moving the data from separate on-premises and cloud sources using the usual extract, transform, and load (ETL) methods. Users can take advantage of the source data in real time in this way.

10.2.2 Architecture of Data Fabric Data fabrics combine data from legacy systems, data lakes, data warehouses, SQL databases, and apps to provide a comprehensive view of business performance by utilizing data services and APIs [19]. It tries to increase fluidity between data environments, as opposed to these distinct data storage systems, in an effort to combat the issue of data gravity. The notion that as data gets bigger, moving it gets harder. All data are made available throughout the organization through a data fabric, which abstracts away the technology challenges involved in data transit, transformation, and integration as shown in figure 10.1. The principle underlying data fabric designs is the loose connection of data in platforms with the applications that require it. In a multicloud context, one cloud, like AWS, might manage data ingestion while another platform, like Azure, might be in charge of managing data transformation and consumption. This is an illustration of a data-fabric architecture. Then, you might have a third provider offering analytical services, like IBM Cloud Pak® for data. These settings are connected by the data fabric architecture to produce a single view of the data [14]. However, this is only one illustration. Since every organization has a unique set of requirements, there is no one data architecture that can be used for every data fabric. Variation among enterprises is ensured by the several cloud service providers and data infrastructure solutions. Businesses using this kind of data framework, however, show characteristics common to a data fabric throughout their architectural designs. The “Enterprise Data Fabric Enables DataOps” research from Forrester (link lives outside of ibm.com) details their six core components in further detail. The following are some of these six layers: 1. Data governance and data security are handled by the data management layer. 2. Data Ingestion Layer: By establishing links between structured and unstructured data, this layer starts to piece together cloud data. 3. Data processing: To guarantee that only pertinent data is surfaced for data extraction, the data is refined in the data processing layer. 4. Data orchestration: This vital layer handles some of the most important tasks of the data structure, converting, integrating, and cleaning data to make it available to squads across the firm.

10 Security, Privacy, and Authentication Framework for Web-Driven Data Fabric

5.

6.

187

Data discovery: This coating reveals newly possibilities for connecting different data sources. For example, by connecting data from the supply chain data market and the consumer relationship management data system, it can open up new possibilities for product offers to customers or strategies to increase customer satisfaction. Data access: This layer makes it possible to consume data while guaranteeing that certain teams have the correct permissions to adhere to the law. Additionally, by utilizing dashboards and other data visualization tools, this layer aids in surfacing pertinent data.

Figure 10.1: IBM - Data Fabric Architecture.

10.3 Advantages of Data Fabric Architecture Gartner (link is external to ibm.com) has seen specific efficiency improvements as data fabric providers increase market adoption, boasting that they can minimize “time to integrating design by 30%, deployment by 30%, and also upkeep by 70%.” While it is clear that data structures can increase overall productivity, the following benefits have also demonstrated business value for users: 1. Data fabrics use machine learning, metadata management, and semantic knowledge graphs to intelligently integrate data from multiple data kinds and ending points. This assist data management team’s group together datasets that are linked to one another as well as incorporate brand-new data sources into an organization’s data ecosystem. In addition to the productivity improvements noted above, this functionality automates parts of data workload management, which also

188

2.

3.

Aditya Saini et al.

helps to break down data system silos, consolidate data governance procedures, and enhance overall data quality. Data fabric architectures make it easier for self-service apps, democratizing entry to data out there extra specialized resources like data engineers, developers, and teams working on data analytics. By enabling business users to make quicker business choices and by releasing technical users to priorities tasks that better utilize their skill sets, the elimination of data bottlenecks leads to an increase in productivity [9]. Better data protection: Improving data access does not need compromising on privacy and security safeguards. In reality, it necessitates the implementation of further data governance safeguards around access restrictions, certifying that particular data are only reachable to specified roles. By implementing data masking and encryption around sensitive and proprietary data, technical and security teams can reduce the risks associated with data sharing and system hacks.

10.4 Cases Where Data Fabric Can Be Used Data structures are still in their infancy in terms of popularity, but their data integration capabilities help businesses find data, allowing them to handle a variety of use cases. A data structure is characterized by the scale and scope it can handle because it eliminates data silos, even though the use cases it can handle may not be very different from those handled by other data products. Firms and their data scientists can get inclusive picture of their consumers by integrating data from multiple foundations, which has proven very effective for banking clients. More specifically, data structures were used to: – customer profiles, – fraud detection, – analysis of preventative maintenance, – risk models for returning to work, and more.

10.5 Security and Privacy Framework Introduction Company which uses cloud computing and bigger data has ongoing concerns about security, trust, and privacy. Although there are demands for organizations to centralize data center management and transfer their data to the cloud, services and applications are being created to minimize costs and maximize work efficiency. To guarantee that each data services are compliant with the latest patches and policies, system design and deployment must be implemented concurrently based on present safety performs. All users will be protected and data confidentiality, integrity, and availability

10 Security, Privacy, and Authentication Framework for Web-Driven Data Fabric

189

will be ensured through a risk-based approach to security program progress that recognizes appropriate measures (as shown in Figure 10.2) [11, 12]. In addition to being a broad notion that protects vital values like human autonomy and dignity, privacy is problematic because several methods can be used to achieve it. For instance, seclusion, limiting surveillance, or people’s control over aspects of their identities can all contribute to privacy (e.g., body, data, and reputation). Humanoid autonomy and self-respect are also fluid concepts that are filtered by individual variability and culture diversity rather than being fixed, quantifiable constructions. It is challenging to communicate adequately about privacy hazards among, between, and with persons due to the vast and fluid nature of confidentiality. What has been lacking is a universal language and useful instrument adaptable enough to handle various privacy concerns [7]. The three components of the privacy framework are core, profiles, and implementation levels. Through the linkage between organizational roles and responsibilities, privacy practices, and business or missions, each component strengthens the way an organization manages privacy hazard. – The central is a set of privacy actions and outcomes that enables the executive stages to the implementation/operational level of the organization to communicate priority privacy activities and outcomes. The core is then separated into important, defined categories and subcategories for each function. – Current privacy practices or ideal outcomes of an organization are represented through a profile. An institution can evaluate all the actions of the core outputs in order to create a profile by selecting those that should be given the most attention with regard to its business or mission, role in the data processing ecosystem, and types of data processing and privacy requirements. Its users, features, categories, and subcategories can all be added or created by the organization as needed. By linking the “Recent” profile with the “Target” profile, the profiles can be used to find areas where the privacy position could be improved. Profiles can be used to communicate how privacy issues are managed within and across businesses as well as to conduct self-assessments. – Implementation levels (or “tiers”) offer a measure of how a company perceives privacy risk and assesses whether it has the necessary systems and personnel in place to handle the threat. The levels show a shift from random, reactive responses to agile, risky tactics. When choosing levels, a company should think about its brand profile(s) and how existing risk practices may help or hinder its success, the extent to which privacy risk has been integrated into its enterprise risk management portfolio, relationships in within the data processing ecosystem, and the composition and training of its workforce. According to the privacy framework, privacy events are potential problems that people may encounter as a result of the operations of a device, goods, or service involving data – whether in virtual or nonvirtual form – during the lifecycle of data collected through disposal. These data activities are individually referred to as data actions and

190

Aditya Saini et al.

collectively as data processing within the framework of personal data protection. The problems that people may face as a result of data processing can be shown in different paths, but NIST defines them as reaching from more obvious harms like discrimination, financial loss, or bodily harm to effects of a more dignified type such as humiliation or stigmatization.

Figure 10.2: Secure EFSS system architecture in the OpenStack.

The causes of the difficulties that people may encounter can be different. As a negative side effect of the data processing that businesses do to achieve their task or business goals, problems develop. To illustrate, consider some localities’ reservations about installing “smart meters” as part of the smart grid, a national initiative to improve energy productivity. The capacity of these meters to assemble, store, and disseminate extremely detailed data on household electricity use can offer insights into how people behave in their homelands. Although the meters worked as intended, the data processing may make some people feel like they are being watched. – Secure password, control mechanism access, and firewall safety are supported by access control and firewall (L1) [15]. – In addition to providing the latest machineries to stop outbreaks, for example, denial of service, port scanning, pattern-based attacks, antispoofing, parameter manipulation, cross-site scripting, known vulnerabilities, SQL injection and cookie poisoning, intrusion detection, and prevention systems intrusion detection system (L2) focuses on the detection of attacks, intrusions, and intrusions. Thanks to identity management, only approved users can access sensitive data.

10 Security, Privacy, and Authentication Framework for Web-Driven Data Fabric

–

191

Security controls are supported by the encryption and decoding control layer (L3), which also provides encoding and decoding of files and messages. These functions keep an eye on the system and issues a premature warning when a fine-grained object starts to act strange. In addition, it offers complete round-the-clock assurance that covers inquiry and remediation after an anomaly is detected.

10.6 Security and Privacy Framework Basics The privacy framework gives internal and external stakeholders a single vocabulary for comprehending, managing, and communicating privacy risk. It can be adjusted to fit any organization’s function(s) within the ecosystem of data processing. It is a device for coordinating policy, business, and technology methods to handle this threat and can be used to recognize and prioritize movements to reduce privacy threat (as shown in Figure 10.3) [7].

Figure 10.3: Privacy Framework Core Structure.

The core offers a series of more detailed activities and results that facilitate a conversation about controlling privacy risk. Functions, categories, and subcategories make up the core. The core components cooperate: – At the highest level, functions organize basic actions to protect privacy. By thoughtful and management of data processing, making risk management decisions, deciding how to connect with people, and leveraging knowledge from past experiences, they help companies formulate privacy risk management. They are not planned to create a sequential way or to reach a fixed termination state. In order to create or improve functioning culture that responds to the dynamic nature of secrecy risk, functions should be performed concurrently and continuously. – The feature is divided into categories, which are groups of privacy consequences closely related to certain program requirements and happenings.

192

–

Aditya Saini et al.

The subcategory further classifies the category into the results of specific managerial and/or technical activities. They offer a collection of findings that, while not exhaustive, aid in achieving the objectives in each category.

Let us discuss about some security framework: 1. Penetration testing: Penetration testing is often used in ethical hacking to check for vulnerabilities and confirm the reliability of a proposed security prototype. By the time penetration testing is over, the weakest areas and weaknesses will be identified. After effective penetration, the 2012 viruses and Trojans were brought in action for trial of resilience of multilayered defenses using a variety of tools and methodologies including Metasploits security. Virtual machines were used for all tests. In order to identify any performance issues, comparisons were made with McAfee’s single-layer antivirus software, including the amount of viruses and Trojans blocked and the blocking percentage. The resulting data was captured while experimentation [7, 11]: – The quantity of malware that each layer has spotted and stopped. – The total number of viruses and Trojan horses that the system has found and stopped. – The quantity of viruses and Trojans that have been discovered but cannot be eliminated and are being quarantined. – The entire count of viruses and Trojans, including both those that can be eliminated in the detention and those that cannot. 2. SQL injection: Two different kinds of databases are used in multilayer protection: (1) MySQL Server 5.5, which saves database information for users, clients, stuffs, and text-associated data and (2) 64-bit MongoDB 3.0.4, a NoSQL database. Both databases are tested for durability as part of this ethical hack. This is used for commercial clouds as an enterprise protection architecture such as CCAF should preserve critical data and protect against unwanted access or server outages caused by attacks. SQL injections are often used in unlicensed attacks and ethical hacking. There are two techniques for impromptu SQL injection. First, SQL injection queries produce “infinite” loops, putting database in an incorrect state [18]. To make sure that the SQL injection protection feature is not patched in time, 10,000 SQL statements per second were issued. The combination of these two techniques causes an SQL injection that takes the system offline. The below three categories of examinations were run for MySQL: – Use of SQL injection when protected by McAfee 2014 mainstream release due to product availability. – Use of multilayer security with CCAF in general and SQL injection. – CCAF multilayer security protection that completely blocks all ports for MySQL when using SQL injection, meaning that every SQL query will have to be authorized and authenticated (as shown in Figure 10.4).

10 Security, Privacy, and Authentication Framework for Web-Driven Data Fabric

193

Figure 10.4: Three-layered CCAF security.

10.7 How to Use Security, Privacy, and Authentication Framework The privacy structure, when used as a risk management device, can help a business maximize useful applications of data and develop cutting-edge technologies, products, and services while minimizing adverse impacts on people. The essential problem: “How do we consider the impacts on persons when we design our systems, products and services?” businesses can respond with a privacy framework. The use of the privacy framework is customizable to take into account the specific requirements of the organization although it is intended to support current corporate and system expansion activities. The implementing organization will determine how to implement it. For example, a company may have effective privacy risk management procedures in

194

Aditya Saini et al.

place, but use five core functions as a quick approach to identifying and explaining any deficiencies. Alternatively, a company wishing to launch a privacy program may refer to the core categories and subcategories. To harmonize privacy risk management significance across unalike characters in the processing of data bionetwork, other firms may compare profiles or levels. The fact that organizations can use the privacy framework in many ways should discourage them from treating “privacy compliance” as a standardized or externally verifiable concept. Several alternatives for utilizing the privacy framework are presented in the next sections [7, 11]. – Mapping to informative references: In order to help implementation, informative references are mappings to subcategories that include tools, technical advice, standards, legislation, regulations, and best practices. Organizations may find it useful to use crosswalks that show how standards, laws, and regulations relate to subcategories in order to prioritize which actions or results will help with compliance. The Privacy structure is equipment agnostic but assist in modernization in technology as these mappings can be created by any industry sector or organization as technology and related business requirements change. Devices and techniques in place to attain good privacy outputs can extend over boundaries and reflect the worldwide nature of privacy issues by depending upon consensusbased standards, guidelines, and practices. Leveraging current and developing standards will facilitate saving of scale and encourage the growth of systems, goods, and services that respond to established market requirements while also taking into account the privacy needs of individuals. – Utilizing in the ecosystem of data processing: A key element in privacy risk management is the entity’s position in the processing of data bionetwork, which impact both its legal liability and privacy risk management measures it may take. The data processing ecosystem, as shown in Figure 10.5, includes a variety of entities and roles that may have intricate, bidirectional interactions with one another, and with people. When an entity is backed by a series of dependent entities, such as several component suppliers for manufacturers or a number of service providers, the complexity can rise. The roles in the diagram are meant to represent hypothetical categories. In reality, entity roles may be formally codified by law – for example, some regulations categorize businesses as data controllers or processors – or arrangements may determine from industry sector designations [7, 20, 21]. An entity works on the Privacy Structure to think about how to organize privacy risk in light of its own priorities as well as how the actions it can take may affect the privacy risk management of other entities in the data processing ecosystem by creating one or multiple profiles related to his role(s). For example: – A data processor that is an external service provider uses its side view to display actions it has taken to process data cutting-edge accordance with predetermined obligations.

10 Security, Privacy, and Authentication Framework for Web-Driven Data Fabric

–

– –

–

195

Using a profile, an organization that decides how to gather and use personal information about person can inform an external service provider about its privacy needs (e.g., a cloud provider to which it exports data). In order to report results or compare to acquisition criteria, a business can express its position on privacy through the current profile. Industry can create a standard profile that its members can use to personalize their own side view. A creator can use the target profile to decide what features to include in their goods so that their business patrons can encounter the confidentiality requirements of their end users. Target profile can be used by a developer to think through how to create an application that offers privacy safeguards when utilized in the system environments of other companies [17].

Figure 10.5: Data Processing Ecosystem Relationships.

With the help of the privacy framework, parties involved in the data processing ecosystem can convey privacy requirements in a mutual language [13]. When the data processing ecosystem spans international borders, as with international data transfers, the requirement for this communication can become very apparent. Among the organizational procedures that promote communication are: – determining the needs for privacy; – establishing formal agreements (such as contracts or multiparty frameworks) to enforce privacy standards;

196

– – –

Aditya Saini et al.

describing how those privacy demands will be checked and verified; utilizing a range of assessment techniques to confirm that privacy needs are met; and directing and controlling the aforementioned actions.

References What is a security framework? – definition from techopedia. Techopedia.com. (n.d.). Retrieved October 27, 2022, from https://www.techopedia.com/definition/30605/security-framework-cloudcomputing#:~:text=A%20security%20framework%20is%20a,personal%20data%20have%20been% 20challenged. [2] Boeckl, K. R., & Lefkovitz, N. B. (2020). NIST privacy framework: A tool for improving privacy through enterprise risk management, version 1.0. [3] National Institute of Standards and Technology (2019) NIST Privacy Risk Assessment Methodology (PRAM). (National Institute of Standards and Technology, Gaithersburg, MD). [4] The smart grid interoperability panel – smart grid cybersecurity committee (2014) guidelines for smart grid cybersecurity: Volume 1 – smart grid cybersecurity strategy, architecture, and high-level requirements. (national institute of standards and technology, gaithersburg, MD), NIST Interagency or Internal Report (IR) 7628, Rev. 1, Vol. 1. [5] Authentication Framework. Authentication framework. (n.d.). Retrieved October 01, 2022, from https://cocoon.apache.org/2.1/developing/portal/authentication.html [6] What is a data fabric? NetApp. (n.d.). Retrieved October 27, 2022, from https://www.netapp.com/ data-fabric/what-is-data-fabric/. [7] Boeckl, K. R., & Lefkovitz, N. B. (2020). NIST privacy framework: A tool for improving privacy through enterprise risk management, version 1.0. [8] What is a data fabric? IBM. (n.d.). Retrieved October 27, 2022, from https://www.ibm.com/in-en /topics/data-fabric. [9] Feng, Q., He, D., Zeadally, S., & Liang, K. (2019). BPAS: Blockchain-assisted privacy-preserving authentication system for vehicular ad hoc networks. IEEE Transactions on Industrial Informatics. 16(6), 4146–4155. [10] Liu, K., Yang, M., Li, X., Zhang, K., Xia, X., & Yan, H. (2022, July). M-data-fabric: A data fabric system based on metadata. In 2022 IEEE 5th International Conference on Big Data and Artificial Intelligence (BDAI) (pp. 57–62). IEEE. [11] Chang, V., Kuo, Y. H., & Ramachandran, M. (2016). Cloud computing adoption framework: A security framework for business clouds. Future Generation Computer Systems. 57, 24–41. [12] Takabi, H., Joshi, J. B., & Ahn, G. J., Securecloud: Towards a comprehensive security framework for cloud computing environments. In 2010 IEEE 34th Annual Conference, Computer Software and Applications Workshops, COMPSACW, July, 2010 (pp. 393–398). [13] Zia, T., & Zomaya, A. A security framework for wireless sensor networks. In Proceedings of the IEEE Sensors Applications Symposium, February, 2006 (pp. 49–53). [14] Ramachandran, M., & Chang, V. (2014). Cloud security proposed and demonstrated by cloud computing adoption framework. In 2014 Emerging Software as a Service and Analytics Workshop. [15] Ramachandran, M., Chang, V., & Li, C. S. The improved cloud computing adoption framework to deliver secure services. In Emerging Software as a Service and Analytics 2015 Workshop, ESaaSA 2015, in Conjunction with CLOSER 2015, Lisbon, Portugal, 20–22 May, 2015. [1]

10 Security, Privacy, and Authentication Framework for Web-Driven Data Fabric

197

[16] Ko, R. K., Jagadpramana, P., Mowbray, M., Pearson, S., Kirchberg, M., Liang, Q., & Lee, B. S. Trustcloud: A framework for accountability and trust in cloud computing. In 2011 IEEE World Congress on Services, SERVICES, July, 2011 (pp. 584–588). [17] Kuo, Y. H., Jeng, Y. L., & Chen, J. N. A hybrid cloud storage architecture for service operational high availability. In Proceedings of the 37th IEEE Annual International Computers, Software & Applications Conference, IEEE COMPSAC 2013, July, 2013 (pp. 487–492). [18] Kieyzun, A., Guo, P. J., Jayaraman, K., & Ernst, M. D. Automatic creation of SQL injection and crosssite scripting attacks. In IEEE 31st International Conference on Software Engineering, 2009. ICSE 2009, May, 2009 (pp. 199–209). [19] Percival, C. Stronger key derivation via sequential memory-hard functions. In The BSD Conference, BSDCan, May, 2009. [20] Kuo, Y. H., & Jeng, Y. L. Secure synchronization apparatus, method, and nontransitory computer readable storage medium thereof. U.S. Patent Application (Priority No.: 14/292,901). [21] Wang, K. Y., Kuo, Y. H., & Jeng, Y. L. Synchronization apparatus, method, and non-transitory computer readable storage medium, U.S. Patent Application (Priority No.: 14/300,955).

Aditya Saini, Amit Yadav, Vinny Sharma, Manjeet Kumar, Dhananjay Kumar, Divya Tripathy

11 Government Compliance Strategies for Web-Driven Data Fabric Abstract: The chapter starts with the definition of compliances and the data fabric. Basically, compliance is an act of complying the rules or requirements or the process of doing while data fabric is the framework that makes several data pipelines from beginning to an end. The architecture of data fabric includes the several layers. While discussing the government compliances one also needs to work on the strategy and tools involved in compliances. The use of compliance technology, monitoring and auditing, and cloud security is the part of security strategy and tools of compliances. The government compliance in today’s era has both benefits as well as demerits. The basic framework of web-driven data fabric and the application of data fabric on serverless cloud has been discussed. Principle of data governance and data compliance along with their comparison of the benefits. Keywords: Compliances, data fabric, governance, cloud architecture, server interaction

11.1 Introduction Compliance is defined as the act of complying with well-known rules or requirements or the procedure of doing so. Software, for instance, might be written in accordance with standard body specifications and then deployed by user organizations in accordance with a vendor’s licensing terms. The term “compliance” can also refer to actions taken to make sure that businesses abide by both government and industry rules [1]. Data fabric: A data fabric is a framework that makes it possible to automatically and intelligently create several data pipelines and cloud environments from beginning to end. Big data’s exponential growth over the past few years has made handling it more challenging thanks to developments in edge computing, artificial intelligence, and hybrid clouds for enterprises, wireless sensor networks, as well.

Aditya Saini, Forensic Science, School of Basic and Applied Sciences, Galgotias University Amit Yadav, Forensic Science, School of Basic and Applied Sciences, Galgotias University Vinny Sharma, School of Basic and Applied Sciences, Galgotias University Manjeet Kumar, G.L. Bajaj Institute of Technology and Management Dhananjay Kumar, Forensic Science, School of Basic and Applied Sciences, Galgotias University Divya Tripathy, School of Basic and Applied Sciences, Galgotias University https://doi.org/10.1515/9783111000886-011

200

Aditya Saini et al.

11.1.1 Architecture of Data Fabric Data management layer: The layer is in charge of data governance and data security. Data integrity layer: This layer begins to put together cloud data by identifying links between unstructured and structured data. Data processing: The data is refined at the data processing layer to make sure that only relevant data are accessible for data extraction. Data orchestration: In order to make the data accessible to teams across the organization, this layer transforms, integrates, and cleans the data, among other critical functions for the data fabric. Data discovery: In these fresh possibilities are revealed for fusing various data sources. Data access: This layer facilitates data consumption while making sure that certain teams have the correct permissions to follow the law. Additionally, this layer assists in exposing relevant data by using dashboard and other data visualization tools [2].

11.1.2 Security and Compliance Tools and Strategies Compliance technology: This contains the equipment necessary to enable and streamline different compliance procedures. Even basic tools like spreadsheets, data storage, and shared discs are included in the literal sense. But in this chapter, we talk about the potential for automating compliance procedures through the use of clever technology, which may not only take the place of human interaction in routine and time-consuming compliance chores but can also make judgments for you. Monitoring and auditing: Smart technology can be quite helpful in this area. Both business activities and compliance procedures produce a lot of data. Additionally, compliance hazards and threats come not just from the company but also from business associates and outside parties, such as vendors and middlemen in the sales channel. Manually observing such a vast volume of data is nearly difficult. Large amounts of organizational data from compliance, business, and accounting procedures as well as online information from media, etc. may be reviewed by smart technologies like big data analysis, which can also make conclusions based on the review. The crucial takeaway from this is that smart technology implementation tools must be adaptable enough to allow compliance officers to establish criteria like keywords, thresholds, and illogical transaction patterns [3]. Cloud security: This should be approached differently from traditional data center solutions and it is the topmost blockade to cloud adoption. Businesses will be

11 Government Compliance Strategies for Web-Driven Data Fabric

201

protected from vulnerabilities if the proper cloud security is implemented at the appropriate time. Here are six tactics and technologies you can use to enhance the security of cloud: 1. Take an integrated approach to design and compliance with zero trust Enterprises must create an integrated, zero-trust design strategy for security, GRC, and compliance to reap the benefits of cloud adoption. Therefore, they must change their perspective from “security as an afterthought” to “security by design.” The zero-trust paradigm considers a breach to have occurred and verifies each request as if it came from the public network. Zero trust is about three principles: – Check for explicitness – Use least privileged access – Anticipate violations 2. Take a “shift left” approach Being secure is everyone’s responsibility. The SDLC process can lead to a 50% reduction in effort and associated costs. Combining DevOps and Security in the same team and implementing DevSecOPS frameworks is one method to achieve this. Being part of the same team would allow for better security outcomes than identifying security issues at the end. Security would be more integrated into the entire process. Businesses should consider investing in cyber security and compliance as a code solution to get the most out of the “Shift Left” strategy. 3. Implementation of cloud assets shield and cloud threat recognition Protecting cloud assets in the public cloud requires a continuum of skills, and that is where cloud access security brokers (CASB), cloud security posture management (CSPM), and cloud workload protection platform (CWPP) products come in. CSPM tool: This tool helps detect configuration-related threats and risks as well as monitor issues including insufficient encryption, mismanagement of encryption keys, and extra account permissions. 4. Extend data protection Implementing data governance is the best way to protect your data. In addition, businesses need to rethink their data strategy throughout the data lifecycle. Businesses must be open about the data they collect and the uses to which it may be put. Enterprises must use confidential computing to protect extremely sensitive data even during processing, as encryption, both in transit and at rest, is insufficient to protect sensitive data. 5. Use identity as a perimeter Apps are now accessible on any device 24/7. Legacy access and identity management and privileged access management (PAM) solutions are unfortunately insufficient.

202

Aditya Saini et al.

Enterprises must consider digital identity and cloud infrastructure claims management solutions to reduce the risk of overprivileged cloud infrastructure claims associated with human and machine identities, including applications, bots, services, and more to combat pervasive access and resources spread out in the cloud. 6. Develop a safe digital desktop program Businesses should create a program to ensure locked down digital fluency that focuses on cyber-attack awareness and technology to detect any opening or cyberattack. The ability to select and use the right digital tools and technologies to achieve a specific goal is known as digital fluency [4]. 7. CASB CASBs are gateway and stop-gap technologies that link customers with cloud service providers. They might be digital or tangible. This contains some (but not all) IaaS, PaaS, and SaaS setups. In a word, CASB eliminates security gaps by enabling businesses to build security controls specific to the cloud and to extend existing campusbased security policies there. 8. CSPM Products from CSPM are a respectable answer. The main goal of CSPM products is to restrict access to deployed and utilized cloud infrastructure technologies within an enterprise. Furthermore, these solutions are beneficial for companies moving their operations to the cloud. The way CSPM tools operate is by continually scanning for configuration errors and making any required changes automatically. These solutions are appropriate for companies that need to identify, evaluate, record, summarize, and automate issue fixes [20].

11.2 Security and Compliance of the Need to Remain Adaptable and Agile Enterprises must adopt a new perspective on cloud security by concentrating on the platform’s particular requirements and applications if they want to stay safe and derive the full benefit from the cloud. While there is no magic solution for cybersecurity, the “Shift Left” strategy and zero-trust architecture may allow for a simplification. Using defensive AI effectively while also being proactive can assist secure better corporate outcomes [4, 17].

11 Government Compliance Strategies for Web-Driven Data Fabric

203

Comparing agile and adaptive governance Agile governance

Adaptive governance

Origins

A response to waterfall-style planning in the software engineering industry. Later, the idea of agility was expanded to include governance and organizational research.

Built on the principles of evolutionary theory, but including ideas from other fields as organizational ecology, political science, ecology, systems theory, and complexity theory

Scope

Typically used in initiatives involving innovation and development

Employed most frequently in public policy and governance

Lead motive

Client satisfaction

For survival

Main purpose

Responding quickly and detecting the actions

Learning and maintaining fit

Key processes

Work in interdisciplinary teams, innovate incrementally, get timely feedback, and use it to get better.

Maintaining one’s compatibility with the surroundings, both of which are dynamic. No prescriptive key procedures are established since adaptive governance is mostly descriptive [].

11.3 Basic Framework of Web-Driven Data Fabric Many requests have been made to use serverless architectures. However, data-driven applications that require deadly error-prone development are regularly used on multiple devices. Data visualization and manipulation are at the heart of data-driven applications, causing their behavior to be reliant on the incoming data. These applications are distributed over the web, where the client request is accessed using a search engine and the server-side code runs in the cloud. The modeling of data-driven requirements for both frontend and backend in serverless cloud computing is presented at an advanced level of abstraction by unified modeling language profile for data-driven applications (UMLPDA). The main features of the proposed profile (and thus the research contributions of this chapter) are shown in Figure 11.1 and are summarized as follows: 1. For the sake of user interface (UI) design and development, the front end of datadriven applications is modeled at a platform-agnostic level. 2. Behavioral concepts are modeled to design a client that communicates with backend cloud data sources and connects relevant statistics to the UI. 3. GraphQL API principles are modeled to construct the GraphQL schema and GraphQL translators.

204

4.

Aditya Saini et al.

Using the model-to-text (M2T) method, an open-source transformation engine is built, primarily consisting of an application launcher and a code generator, to automatically build both frontend and backend low-level implementations of Angular2 and GraphQL. As a front-end solution, the Angular2 framework developed by Google was chosen, which is often used when creating user-friendly web applications for desktop computers and mobile devices. For the backend, GraphQL, a data query language for web APIs that is quite popular now, was targeted.

Figure 11.1: Architecture of data fabric.

11.4 Data-Driven Application Architecture in Cloud Without Server 11.4.1 Computing Environment Modern users of online and mobile applications want instant access to data and information whenever it is modified or updated in a cloud database, and they also expect the data to appear immediately on their displays without having to reload their client applications. To achieve these requirements (Figure 11.2), architecture for data-driven applications, based on GraphQL, is a commonly used BaaS configuration. The proposal is aimed at spreading data between many clients or real-time data to all clients from a serverless backend. The key principles are client, operation, action, schema, proxy, resolver, and data source as in Figure 11.3. – Client: It is designed to make it easy to create a UI that retrieves data from the cloud and works with any JavaScript interface. Operations for GraphQL are specified in the

11 Government Compliance Strategies for Web-Driven Data Fabric

Figure 11.2: Using the UMLPDA-designed model-driven framework for data-driven applications.

Figure 11.3: The architecture for data-driven applications in a serverless cloud computing.

205

206

–

– – –

– –

Aditya Saini et al.

client. Before sending request commands to the proxy for synchronization, the client performs proper authorization wrapping. Enrolment needs are done through enrolment and answers are stored in the offline store. The client sends the identity and credential context along with the proxy procedure request. Based on the set of credentials that must be provided with each proxy request, it represents the caller identity. Operation: It includes the following: 1. Request read-only data. 2. Write the data tracked by loading. 3. A long-term connection that receives data in response to events. Action: It is a written operation notice that is sent to connected subscribers. The client shakes hands with the subscriber to become a subscriber. Schema: It describes an API’s capabilities that the client application may use to interact with it. Every request is checked against the schema. Proxy: It is a section that analyses incoming requests and translates them to logical operations for data triggers or actions on data. Additionally, it oversees the techniques for identifying and resolving conflicts. Resolver: It is a function that, with caller authorization, transforms the payload to the underlying storage system protocol. Data source: It is a system of persistent storage, with which the API communicates. The system controls the application state, which is specified in a data source [5].

11.5 Data Governance Versus Compliance Data governance is the method of handling the usability, security, availability, and quality of organizational data using the internally established rules and policies. Data compliance is the practice of organizations making sure that all searching data is accomplished and organized in a way that permits them to meet their business rules as well as legality and government regulation. Data compliance relates to the security and privacy of sensitive data stored, retrieved, and used by corporations and organizations. The responsibility for protecting this personal data falls on organizations and corporations, particularly those that deal with personal data. Data governance examines setting up a setting where data may be efficiently used to provide insightful knowledge that improves company procedures. Any company wishing to use its data to get insights following analysis is said to require data governance. Without data governance, data are unable to comply with rules and the requirements for quality that are necessary to derive actionable insights. The integrity of the data might potentially be compromised by security dangers exposed to the data. This increases the possibility that the company or organization may violate the rules.

11 Government Compliance Strategies for Web-Driven Data Fabric

207

Comparison of data governance and data compliance principle Data governance principle

Data compliance principle

Data governance choices, procedures, and controls must all be auditable. Along with them, sufficient paperwork that complies with audit criteria is required.

Legality, equity, and transparency According to this idea, all procedures involving personal data must adhere to the specifications outlined, particularly those included in the GDPR standard. This includes operations including data collecting, storage, and processing.

A company’s data governance staff must all conduct themselves honestly in their employment. When addressing the possibilities, limitations, and effects of datarelated decisions, they must exhibit integrity.

Purpose limitation: According to this principle, information can only be gathered and used for purposes that have been disclosed to data subjects and for which they have given their consent.

Transparency is a must for all data governance procedures. All workers must be given a thorough explanation of every data-related decision’s how, when, and why it was made.

The concept of data minimization states that all data, particularly personal data, that is gathered must be appropriate, relevant, and restricted to what is absolutely essential in light of the purpose for which it has been processed.

A strong data governance framework must specify who is responsible and answerable for cross-functional data-related choices.

Accuracy: According to the accuracy principle, information should be correct and, where required, kept up to date. Businesses and organizations need to make sure that they delete erroneous data in order to avoid keeping old and out-of-date information.

11.6 Comparing the Benefits 11.6.1 Data Governance – –

– – –

By ensuring that data is consistent and uniform throughout the company, data governance facilitates better and more thorough decision support. Data governance makes ensuring there are clear guidelines for updating data processes, which makes an organization’s and its IT operations more nimble and scalable. By providing centralized control mechanisms, data governance lowers costs in other aspects of data administration. The introduction of the capacity to recycle data and data processes boosts efficiency. It increases trust in the accuracy of the data and the data process documents.

208

Aditya Saini et al.

11.6.2 Data Compliance –

–

–

Improved data management: It allows you to reduce the quantity of data gathered and held, to form data storage in an improved way, and to enhance your data management processes. Boost loyalty and trust: By assisting you in establishing and sustaining more relationships of trust and respect with your links, clients, and the general public, compliance with rules may boost your organization or business. Enhances cybersecurity: Ignorance of cybersecurity risks may result in high data breach expenses and system downtime from essential data theft or loss. It also develops a process that is security-conscious with the aid of data compliance [6].

11.7 Demerits of Government Compliances The disadvantages or demerits which government compliances follow are: 1. Lack of flexibility: The legislative or parliament sets the laws and rules that the government firms must abide by, and the majority of these restrictions are strict. 2. Lack of motivation: The operational operations of the government corporation are of the least interest to the directors and other executives. They get a set salary without any participation in profits or liability for losses. 3. Political interference: Political parties and political leaders are interfering with government corporations which cause problems. Governmental changes cause the Board’s constitution to alter, which has a negative impact on operations. 4. Lack of autonomy: Despite being autonomous organizations, government firms cannot make any decisions without first receiving approval from the government. 5. Delay in decision: Government agencies must rely on the government to make policy choices, which delays those decisions. 6. Low labor productivity: The issue of poor employee productivity is a challenge for government-run businesses. The effectiveness of the law is hindered by several problems, such as: – Wrong selection and promotion – Lack of training and development – Poor performance evaluation – Misplacement and forced transfer 7. Resources wastage: There is a huge amount of waste of resources. There is poor material management. Raw materials remain unused. Corrupt officials also place huge orders for raw materials and inputs for bribes and kickbacks, even when such purchases are not required in large quantities [7, 12].

11 Government Compliance Strategies for Web-Driven Data Fabric

209

11.8 Challenges to Data Compliance Strategies Some of the challenges commonly faced by the organizations in the data compliances strategies are: 1. More data: The amount of data that the industry has produced and ingested is astounding. A lot of data is being generated. Moreover, there is reason to suppose that this increase has not peaked because portions of the population in emerging countries are still embracing digital connections. Furthermore, businesses that were not processing data are now doing so due to the rise in internet modes brought on by the epidemic. 2. More devices: You no longer have just one Achilles heel to watch out for – you have multiple points of vulnerability to pay attention to as the Internet of things (IoT) weaves itself into the foundation of every organization. IoT devices encounter more than 5,000 attacks per month, according to a statistic, despite the IoT market being expected to be worth $1.1 trillion by 2026. According to another estimate, 6 in 10 businesses have encountered an IoT security incident. The challenges for data compliance are manifold and include: – Privacy violations – Legal complications – Vulnerability management 3. Dark data: Dark data are information that you possess but are unaware of. Your organization may be sitting on an iceberg of data, some of it beneficial, much of it useless, but all of it a risk, if dark matter is a metaphor to go with (85% of the universe is allegedly dark matter). Dark data not only creates major compliance concerns but also concerns about data collecting ethics. When you are ignorant of the data you own, how can you keep it safe, private, and confidential? Because you would definitely need stronger mechanisms in place to bring data into the light, the problem is partly one of cost. 4. Lack of board foresight: If you have been paying close attention to the previous few challenges, you will notice that they do not always need you to create a list of GDPR standards. They do, however, emphasize the necessity of crystal-clear organizational policy. A good place to start with data governance and compliance is frequently the internal signal rather than the external obligation. It follows that the board must accept accountability for the data you collect, manage, analyze, examine, and even sell. In the past, board meetings may not have given much attention to data privacy and compliance, but in the current digital age, boards must set the standard for risks in data management across the firm. 5. Mandates galore: But for many boards, the biggest problem is that there are simply too many compliance rules to manage at once. You have a tonne of measuring tools at your disposal, including:

210

– – – – – –

Aditya Saini et al.

GDPR HIPAA PIPEDA CCPA PCI DSS FACTA

6. Data lifecycle: You need to keep a close eye on client data from the moment you first obtain it until you handle it and how you do it in order to comply with data compliance regulations. This is a problem because, first, your data will eventually move from physical servers to the cloud and across borders, and second, because data persists. This is related to the possibility of finding organizational silos inside your structure; when data is lost within silos, the problem of compliance becomes murkier. 7. Damage is costly: Data compliance damages are expensive, very expensive. This is because you have to deal with: – Fees – Penalties – Containment – Reputation IBM estimates that average worldwide cost of a data breach will be $3.86 million in 2020 and $8.64 million in the USA. Large companies have already paid out hundreds of millions of dollars for security incidents and data breaches in 2019. IBM also notes that if remote work becomes more common, the cost of a data breach may rise [6], [13].

11.9 Current Situation of Government Compliances Shri Piyush Goyal, Minister of Commerce and Industry, Consumer Affairs, Food and Public Distribution and Textiles said this at the National Workshop on Reducing Compliance Burden organized by DPIIT. “Central Ministries and States/UTs are undertaking a major exercise to reduce the compliance burden and the objective of this exercise is to simplify, decriminalize and remove redundant laws.” The mentality has changed from “He is unable to understand the complexities” to “Starting a business is so simple.” He said that various regulatory compliances simply served to perplex fresh prospects and cause investors to hesitate, but that right now, we are fostering an atmosphere that is most supportive to entrepreneurs. Speaking at the event, Secretary DPIIT stated that so far, under the program, union ministries, states, and union territories (UTs) have eliminated more than 22,000 compliances, simplified approximately 13,000 compliances, and digitized more than 1,200

11 Government Compliance Strategies for Web-Driven Data Fabric

211

procedures. It should be mentioned that 327 unnecessary laws and 327 offences have been decriminalized during the last few years. Some of the iconic reforms that the center has implemented to reduce the compliance burden on citizens and businesses are: 1. Removing the distinction between domestic and foreign OSPs (other service providers), which would boost voice BPO and ITeS business in India. 2. Liberalized access to geospatial data. 3. Introduction of mobile application “Mera Ration.” 4. Eighteen services related to driving licenses and registration certificates now require a single step of online Aadhaar certification. 5. Uniform background checks for new investors minimized start-up time across businesses. 6. States and UTs have shortened the time it takes to provide permissions and licenses, removed physical contact points, and improved inspection transparency through business process re-engineering. In July 2020, the Cabinet Secretary wrote to all departments asking them to set up a special committee to review the laws and regulations that fell under their remit and reduce compliance costs for individuals and businesses. DPIIT has been mandated to serve as the nodal department responsible for coordinating these efforts to ease the compliance burden on businesses and citizens. By streamlining, streamlining, digitizing, and decriminalizing interactions between governments and businesses and citizens across all ministries and states/UTs, this largescale effort aims to promote ease of living and ease of doing business. The following are the main points of the exercise: – Eliminate the compliance burden of all procedures, rules, notices, circulars, office memos, etc. that only add time and cost without any tangible improvement in governance. – Repeal/amend/include redundant laws. – Decriminalize technical and minor noncompliance regulations to remove the constant fear of facing legal action for minor breaches, while maintaining tough criminal penalties for serious fraud offenses that threaten and damage the public interest [8].

11.10 Future of Government Compliances The future of compliance in today’s digital business environment entails choosing the best operating model to make the most of your employees and the available technologies. It is critical to consider all three aspects. In order to organize, comprehend, and extract true value from your data, you must embrace digital transformation and adopt new technology. However, you also cannot

212

Aditya Saini et al.

afford to ignore issues like data integrity, ownership, and cybersecurity. In addition, you cannot afford to ignore how digital technology is changing the way that businesses operate [9, 14, 16]. Governance, risk, and compliance (GRC) are regarded as essential procedures in businesses all around the world since they redefine sustainability and take into account present and future chances. Because the three components were not properly connected to organizations in the past and did not achieve their GRC aims, future compliances need to ensure that compliances accomplish six key goals: – Consistency – Efficiency – Effectiveness – Agility – Transparency – Accountability Compliances will achieve these objectives by actively engaging in the risk management process. Organizations have started to integrate their teams’ GRC procedures and execution. With the aid of technological innovation, new GRC efforts are being put into place to make sure the parties involved are aware of what is occurring, why it is happening, and how they fit into the process. Compliance will also be embracing mobile technology using specialist compliance application that can report problems quicker [10]. GRC is an acronym used by many nations. GRC was established in the world of management consulting a few years ago. Technology firms and others quickly picked it up, describing the services and software solutions available. The term governance is traditionally used in the company’s board of directors [15]. On a worldwide average, 32% of respondents had accepted a combined assurance model, and 56% had applied TLOD. Africa has the highest degree of adoption (51%), according to the regional distribution of integrated assurance implementation [19].

11.11 Benefits of Integrating Governance, Risk, and Compliance An integrated approach to compliance ensures that the appropriate individuals have the appropriate information at the appropriate times, that the appropriate goals are set, and that the appropriate measures and controls are implemented to deal with ambiguity and act honorably. The benefits accrue when GRC is carried out correctly. – Higher quality data that enables management to take more quick and wise judgments.

11 Government Compliance Strategies for Web-Driven Data Fabric

– – – – –

213

Process optimization that reduces lag time and unwelcome variance by eliminating nonvalue-added tasks and streamlining value-added activities. It is possible to more efficiently distribute both financial and human resources by identifying areas of duplication and inefficiency. The overall result of all the afore mentioned activities is that GRC efforts are directed more efficiently to the right individuals and departments. Improved reputation for the company results from better risk management. Lower expenses are a factor in the total benefits brought about by efficient GRC activities.

11.11.1 An Optimum Compliance Model

Figure 11.4: Model Structure - Optimum Compliance.

In the local government and school district sectors, a compliance model should work to inform and educate where sincere attempts at compliance are being undertaken. When there is proof of a major violation of the law or a pattern of recurrent offences, the law should be applied with its full force (as in Figure 11.4). We assist you in developing systems and procedures that allow local government municipalities/councils and school districts to comply and regularly identify those who do not as part of a comprehensive approach to compliance. We provide your departments the freedom to employ a range of strategies to learn more about governance standards and/or in reaction to criticism of procedures [11].

214

Aditya Saini et al.

References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]

[13] [14] [15] [16]

[17]

[18] [19]

[20]

What is compliance? (techtarget.com). Anand, J. V. (2022). Digital Transformation by Data Fabric. The Role of Smart Technologies in Data-Driven Compliance Programs | Corporate Compliance Insights. Security And Compliance Tools And Strategies For The Cloud (forbes.com). A model-driven framework for data-driven applications in serverless cloud computing | PLOS ONE. Challenges to Data Compliance Strategies (v-comply.com). Disadvantages of Government Companies, (indiastudychannel.com). More than 22,000 compliances reduced in Government (pib.gov.in). Future of Compliance | PwC Switzerland. The Future of Governance, Risk Management, and Compliance (cioreview.com). Compliance – Government Frameworks. Silverstone, M., Gonzalez, M. V., Rodulfo, R., Halley, S., de Medina, O., & Blanco, A. (2002, March). Complying with industrial effluent regulations in venezuela: comparing the advantages and disadvantages of three different technologies for achieving compliance. In SPE International Conference on Health, Safety and Environment in Oil and Gas Exploration and Production. OnePetro. Wipawayangkool, K. (2009). Information security compliances and knowledge management capabilities in international diversification. AMCIS 2009 Proceedings, 604. Lin, T. C. (2016). Compliance, technology, and modern finance. Brooklyn Journal of Corporate, Financial & Commercial Law. 11, 159. Steinberg, R. M. (2011). Governance, Risk Management, and Compliance: It Can’t Happen to Us, Avoiding Corporate Disaster While Driving Success. John Wiley & Sons. https://doi.org/10.1002/9781118269190 Conthe, P., Contreras, E. M., Pérez, A. A., García, B. B., de Cano Martín, M. F., Jurado, M. G. . . . Pinto, J. L. (2014). Treatment compliance in chronic illness: current situation and future perspectives. Revista Clínica Española (English Edition). 214(6), 336–344. Moyón, F., Méndez, D., Beckers, K., & Klepper, S. (2020, November). How to integrate security compliance requirements with agile software engineering at scale? In International Conference on Product-Focused Software Process Improvement (pp. 69–87). Springer, Cham. Janssen, M., & Van der Voort, H. (2020). Agile and adaptive governance in crisis response: Lessons from the COVID-19 pandemic. International Journal of Information Management. 55, 102180. Wibowo, S., Achsani, N. A., Suroso, A. I., & Sasongko, H. (2022). Integrated governance, risk, and compliance (GRC) and combined assurance: A comparative institutional study. Indonesian Journal of Business and Entrepreneurship (IJBE). 8(2), 289–289. https://www.vtechsolution.com/security-compliance-tools-strategies/.

Index 3D printing 23 agile data collection 95 agile governance 203 AI-driven augmented analytics 42 AI-enabled data 116 AI-integrated software 143 Amazon Web Services (AWS) 66, 69, 87 APIs 62, 72–73, 82–83, 91 ArchiMate metamodel 95 artificial intelligence (AI) 118, 185 Atlan’s data workspace 171 authentication framework 183 authentication framework’s 184 big data 19–20, 22–23 big data fabric 149–153, 156–158, 163 big data fabric architecture 136, 139, 145, 149–152, 156–158, 163 big data technologies (BDT) 106 Big Data tools 107, 159 business criteria 40 carbon emission 26 centralization 103 (CIEM) 202 Cinchy 171 cloud access security brokers (CASB) 201 cloud data warehouse 110 cloud orchestration 61 Cloud Pak 171 cloud security 200 cloud security posture management (CSPM) 201 cloud-agnostic 61 Commonwealth Care Alliance (CCA) 51 computer-aided design 167 content-based filtering 154 core benefits of data fabric 139 coronavirus disease (COVID-19) 128 CSPM 201–202 customer relationship management (CRM) 153 cyber-physical systems 25 DAG 105 DAMA 36 dark data 209 data access mechanism 7 https://doi.org/10.1515/9783111000886-012

data analytics 13, 15 data breaches 5 data democratization 39 data discovery 152 data fabric 1, 8–9, 10, 11, 12, 16, 38–39, 61–63, 66, 68–70, 78, 82, 85–87, 89–91, 199–200 data fabric architecture 122 data falsification 169 data governance 118, 124 data ingestion 151 data integration 1, 12 data integration (DI) 118 data integrity 144 data lifecycle 210 data management 149–150, 154, 156–157, 163, 167 data mesh comparison 140 data mining 156 data mining lifecycle 97 data orchestration 151 data processing 184, 186, 189–191, 194–195 data security 62 data storage grid 77 data strategies 22 data tiering 76 data virtualization 8, 13, 36–37, 40–41 data warehousing 95, 97 Data-catalo 37 data-centric architecture’s 155 decentralized structure 133 defense IoT 24 Denodo platform 13, 124–125 detection of fabricated data 170 diffusion equations 173 digital asset management 155 Digital Coronavirus Application (DCOVA) 129 digital health 128–130 digital manufacturing 19, 22–23 digital transformation 120, 154 direct-attached storage (DAS) 78 distributed denial of service (DDoS) 72 DLPD standards 112 Domo Data Visualization 55–57 DPIIT 210–211 DV platform 142 ecosystem power 183–184 edge computation 27

216

Index

Elastic MapReduce (EMR) 86 emerging cloud services 62 Enterprise 2.0 96 enterprise data 95, 97, 112 enterprise data management (EDM) 95 enterprise IT solutions 10 enterprise resource planning (ERP) 153 enterprise systems 153 ERP software 68 ETL 14 extraction, transition, and loading (ETL) 111

machine learning algorithms 28 MapReduce 112 massive amounts of data 154, 156, 158, 161 master data enterprise 99 metadata management 119 Microsoft merchandise 41 Microsoft Power BI 41 Microsoft’s Azure Site Recovery (ASR) 84 mobile health (mHealth) 129 multicloud platform 4 multifactor authentication (MFA) 36

falsified information 169 FHIR (Fast Healthcare Interoperability Resources) 137

NetApp 61–70, 74–79, 81–92 NIST 190 NoSQL 167

good data management practice 145 Google Analytics 108 Google charts 159 government compliance 199 GraphQL 203–204 green manufacturing 26

ONTAP 65–67, 69, 74–78, 80–89, 91 ONTAP Cloud’s storage 69 OpenStack 78, 83–84, 89, 91 orchestrate 12

Hadoop 86, 167 hamper productivity 15 HDFS 159 high-performance SaaS 45 hybrid data architecture 133 hybrid ecosystems 14 IaaS 66–67, 71–73 Industry 4.0 19–20, 22–23, 25–27 infrastructures as a service (IaaS) 152 Internet of data (IOD) 25 Internet of services 25 intrusion detection system (IPS) 190 IoT device 185 IT architectural solutions 24 IT infrastructure 83, 88 IT log Analytics 162 IT operations 117 IT stakeholders 106 knowledge discovery 137 KPIs 50, 52 legacy system 6 lifetime cost (LTV) 52

PaaS 71, 73 personal health records (PHRs) 129 poor scale management 123 precision public healthcare (PPH) 128 preventive analytics 28 privacy framework 183, 189, 191, 194–195 privileged access management (PAM) 201 proof-of-concept (PoC) 106 public healthcare (PPH) 128 purchasing pattern 19 Qlik Sense 44 Quality of Service (QoS) 80 renewable energy 28 research data 169 resource-based view (RBV) 132, 149 risk management 183, 189, 191, 193–194 RiskScape 128, 131 robotic process automation 67 SAP system 103 self-service analytics 37 semantic sensor network 23 semantic web data enterprise 101 semistructured data 157 service-level agreement 123

Index

simulation-based analytics 173 single-sign-on (SSO) 183 smart buildings 26 Smart Digital Tech 20 Smart eHealth System 127 smart factories 20 smart grid 19, 21–22, 25–26, 28, 31 smart mobility 26 smart technology 200 Society 4.0 127 solar plant 30 SolidFire 75, 78, 80, 82, 84 SQL injection attack 105 streaming data pipelines 138 streamlining data management 119 supply chains 121 surveillance 23 Tableau 43–44 terrific algorithm 38

TOGAF 96 transformation engine 36 transmission electron microscopy 168 two-factor authentication (2FA) 36 virtual data warehouse 37 virtual local area networks (VLANs) 73 virtual private network (VPN) 71 Web 2.0 technologies 96 Web application 20 web enterprise 95 web-driven data fabric 183 web-enabled APIs 157 Workflow automation 90 Zachman framework 99 Zoho Analytics 53–54

217