Advances in Big Data Analytics [1 ed.] 9781683921820

This volume contains the proceedings of the 2017 International Conference on Advances in Big Data Analytics (ABDA'1

234 108 3MB

English Pages 84 Year 2018

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Advances in Big Data Analytics [1 ed.]
 9781683921820

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

PROCEEDINGS OF THE 2017 INTERNATIONAL CONFERENCE ON ADVANCES IN BIG DATA ANALYTICS

Editors Hamid R. Arabnia Fernando G. Tinetti Mary Yang

CSCE’17 July 17-20, 2017 Las Vegas Nevada, USA americancse.org ©

CSREA Press

This volume contains papers presented at The 2017 International Conference on Advances in Big Data Analytics (ABDA'17). Their inclusion in this publication does not necessarily constitute endorsements by editors or by the publisher.

Copyright and Reprint Permission Copying without a fee is permitted provided that the copies are not made or distributed for direct commercial advantage, and credit to source is given. Abstracting is permitted with credit to the source. Please contact the publisher for other copying, reprint, or republication permission.

© Copyright 2017 CSREA Press ISBN: 1-60132-448-0 Printed in the United States of America

Foreword It gives us great pleasure to introduce this collection of papers to be presented at the 2017 International Conference on Advances in Big Data Analytics (ABDA’17), July 17-20, 2017, at Monte Carlo Resort, Las Vegas, USA. An important mission of the World Congress in Computer Science, Computer Engineering, and Applied Computing, CSCE (a federated congress to which this conference is affiliated with) includes "Providing a unique platform for a diverse community of constituents composed of scholars, researchers, developers, educators, and practitioners. The Congress makes concerted effort to reach out to participants affiliated with diverse entities (such as: universities, institutions, corporations, government agencies, and research centers/labs) from all over the world. The congress also attempts to connect participants from institutions that have teaching as their main mission with those who are affiliated with institutions that have research as their main mission. The congress uses a quota system to achieve its institution and geography diversity objectives." By any definition of diversity, this congress is among the most diverse scientific meeting in USA. We are proud to report that this federated congress has authors and participants from 64 different nations representing variety of personal and scientific experiences that arise from differences in culture and values. As can be seen (see below), the program committee of this conference as well as the program committee of all other tracks of the federated congress are as diverse as its authors and participants. The program committee would like to thank all those who submitted papers for consideration. About 65% of the submissions were from outside the United States. Each submitted paper was peer-reviewed by two experts in the field for originality, significance, clarity, impact, and soundness. In cases of contradictory recommendations, a member of the conference program committee was charged to make the final decision; often, this involved seeking help from additional referees. In addition, papers whose authors included a member of the conference program committee were evaluated using the double-blinded review process. One exception to the above evaluation process was for papers that were submitted directly to chairs/organizers of pre-approved sessions/workshops; in these cases, the chairs/organizers were responsible for the evaluation of such submissions. The overall paper acceptance rate for regular papers was 25%; 17% of the remaining papers were accepted as poster papers (at the time of this writing, we had not yet received the acceptance rate for a couple of individual tracks.) We are very grateful to the many colleagues who offered their services in organizing the conference. In particular, we would like to thank the members of Program Committee of ABDA’17, members of the congress Steering Committee, and members of the committees of federated congress tracks that have topics within the scope of ABDA. Many individuals listed below, will be requested after the conference to provide their expertise and services for selecting papers for publication (extended versions) in journal special issues as well as for publication in a set of research books (to be prepared for publishers including: Springer, Elsevier, BMC journals, and others). • • • •

• •

Prof. Afrand Agah; Department of Computer Science, West Chester University of Pennsylvania, West Chester, PA, USA Prof. Abbas M. Al-Bakry (Congress Steering Committee); University President, University of IT and Communications, Baghdad, Iraq Prof. Nizar Al-Holou (Congress Steering Committee); Professor and Chair, Electrical and Computer Engineering Department; Vice Chair, IEEE/SEM-Computer Chapter; University of Detroit Mercy, Detroit, Michigan, USA Prof. Hamid R. Arabnia (Congress Steering Committee); Graduate Program Director (PhD, MS, MAMS); The University of Georgia, USA; Editor-in-Chief, Journal of Supercomputing (Springer); Editor-in-Chief, Transactions of Computational Science & Computational Intelligence (Springer); Fellow, Center of Excellence in Terrorism, Resilience, Intelligence & Organized Crime Research (CENTRIC). Prof. Dr. Juan-Vicente Capella-Hernandez; Universitat Politecnica de Valencia (UPV), Department of Computer Engineering (DISCA), Valencia, Spain Dr. Daniel Bo-Wei Chen; Chair, IEEE Signal Processing Chapter, IEEE Harbin Section; Guest Editor in ACM Transactions in Embedded Computing; School of Information Technology, Monash University Sunway Campus, Australia

• • • • • •



• •

• • • • • • • • • • • •





Prof. Kevin Daimi (Congress Steering Committee); Director, Computer Science and Software Engineering Programs, Department of Mathematics, Computer Science and Software Engineering, University of Detroit Mercy, Detroit, Michigan, USA Prof. Zhangisina Gulnur Davletzhanovna; Vice-rector of the Science, Central-Asian University, Kazakhstan, Almaty, Republic of Kazakhstan; Vice President of International Academy of Informatization, Kazskhstan, Almaty, Republic of Kazakhstan Prof. Leonidas Deligiannidis (Congress Steering Committee); Department of Computer Information Systems, Wentworth Institute of Technology, Boston, Massachusetts, USA; Visiting Professor, MIT, USA Dr. Lamia Atma Djoudi (Chair, Doctoral Colloquium & Demos Sessions); Synchrone Technologies, France Prof. Mary Mehrnoosh Eshaghian-Wilner (Congress Steering Committee); Professor of Engineering Practice, University of Southern California, California, USA; Adjunct Professor, Electrical Engineering, University of California Los Angeles, Los Angeles (UCLA), California, USA Prof. George A. Gravvanis (Congress Steering Committee); Director, Physics Laboratory & Head of Advanced Scientific Computing, Applied Math & Applications Research Group; Professor of Applied Mathematics and Numerical Computing and Department of ECE, School of Engineering, Democritus University of Thrace, Xanthi, Greece. Prof. George Jandieri (Congress Steering Committee); Georgian Technical University, Tbilisi, Georgia; Chief Scientist, The Institute of Cybernetics, Georgian Academy of Science, Georgia; Ed. Member, International Journal of Microwaves and Optical Technology, The Open Atmospheric Science Journal, American Journal of Remote Sensing, Georgia Prof. Byung-Gyu Kim (Congress Steering Committee); Multimedia Processing Communications Lab.(MPCL), Department of Computer Science and Engineering, College of Engineering, SunMoon University, South Korea Prof. Louie Lolong Lacatan; Chairperson, Computer Engineerig Department, College of Engineering, Adamson University, Manila, Philippines; Senior Member, International Association of Computer Science and Information Technology (IACSIT), Singapore; Member, International Association of Online Engineering (IAOE), Austria Dr. Andrew Marsh (Congress Steering Committee); CEO, HoIP Telecom Ltd (Healthcare over Internet Protocol), UK; Secretary General of World Academy of BioMedical Sciences and Technologies (WABT) a UNESCO NGO, The United Nations Dr. Somya D. Mohanty; Department of Computer Science, University of North Carolina - Greensboro, North Carolina, USA Dr. Ali Mostafaeipour; Industrial Engineering Department, Yazd University, Yazd, Iran Prof. Dr., Eng. Robert Ehimen Okonigene (Congress Steering Committee); Department of Electrical & Electronics Engineering, Faculty of Engineering and Technology, Ambrose Alli University, Edo State, Nigeria Prof. James J. (Jong Hyuk) Park (Congress Steering Committee); Department of Computer Science and Engineering (DCSE), SeoulTech, Korea; President, FTRA, EiC, HCIS Springer, JoC, IJITCC; Head of DCSE, SeoulTech, Korea Dr. Prantosh K. Paul; Department of Computer and Information Science, Raiganj University, Raiganj, West Bengal, India Dr. Xuewei Qi; Research Faculty & PI, Center for Environmental Research and Technology, University of California, Riverside, California, USA Dr. Manik Sharma; Department of Computer Science and Applications, DAV University, Jalandhar, India Dr. Akash Singh (Congress Steering Committee); IBM Corporation, Sacramento, California, USA; Chartered Scientist, Science Council, UK; Fellow, British Computer Society; Member, Senior IEEE, AACR, AAAS, and AAAI; IBM Corporation, USA Ashu M. G. Solo (Publicity), Fellow of British Computer Society, Principal/R&D Engineer, Maverick Technologies America Inc. Prof. Fernando G. Tinetti (Congress Steering Committee); School of CS, Universidad Nacional de La Plata, La Plata, Argentina; Co-editor, Journal of Computer Science and Technology (JCS&T). Prof. Hahanov Vladimir (Congress Steering Committee); Vice Rector, and Dean of the Computer Engineering Faculty, Kharkov National University of Radio Electronics, Ukraine and Professor of Design Automation Department, Computer Engineering Faculty, Kharkov; IEEE Computer Society Golden Core Member; National University of Radio Electronics, Ukraine Prof. Shiuh-Jeng Wang (Congress Steering Committee); Director of Information Cryptology and Construction Laboratory (ICCL) and Director of Chinese Cryptology and Information Security Association (CCISA); Department of Information Management, Central Police University, Taoyuan, Taiwan; Guest Ed., IEEE Journal on Selected Areas in Communications. Dr. Yunlong Wang; Advanced Analytics at QuintilesIMS, Pennsylvania, USA

• • • •

Prof. Layne T. Watson (Congress Steering Committee); Fellow of IEEE; Fellow of The National Institute of Aerospace; Professor of Computer Science, Mathematics, and Aerospace and Ocean Engineering, Virginia Polytechnic Institute & State University, Blacksburg, Virginia, USA Prof. Mary Yang (Co-Editor, ABDA); Director, Mid-South Bioinformatics Center and Joint Bioinformatics Ph.D. Program, Medical Sciences and George W. Donaghey College of Engineering and Information Technology, University of Arkansas, USA Prof. Jane You (Congress Steering Committee); Associate Head, Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong Dr. Farhana H. Zulkernine; Coordinator of the Cognitive Science Program, School of Computing, Queen's University, Kingston, ON, Canada

We would like to extend our appreciation to the referees, the members of the program committees of individual sessions, tracks, and workshops; their names do not appear in this document; they are listed on the web sites of individual tracks. As Sponsors-at-large, partners, and/or organizers each of the followings (separated by semicolons) provided help for at least one track of the Congress: Computer Science Research, Education, and Applications Press (CSREA); US Chapter of World Academy of Science; American Council on Science & Education & Federated Research Council (http://www.americancse.org/); HoIP, Health Without Boundaries, Healthcare over Internet Protocol, UK (http://www.hoip.eu); HoIP Telecom, UK (http://www.hoip-telecom.co.uk); and WABT, Human Health Medicine, UNESCO NGOs, Paris, France (http://www.thewabt.com/ ). In addition, a number of university faculty members and their staff (names appear on the cover of the set of proceedings), several publishers of computer science and computer engineering books and journals, chapters and/or task forces of computer science associations/organizations from 3 regions, and developers of high-performance machines and systems provided significant help in organizing the conference as well as providing some resources. We are grateful to them all. We express our gratitude to keynote, invited, and individual conference/tracks and tutorial speakers - the list of speakers appears on the conference web site. We would also like to thank the followings: UCMSS (Universal Conference Management Systems & Support, California, USA) for managing all aspects of the conference; Dr. Tim Field of APC for coordinating and managing the printing of the proceedings; and the staff of Monte Carlo Resort (Convention department) at Las Vegas for the professional service they provided. Last but not least, we would like to thank the Co-Editors and Associate Co-Editors of ABDA’17: Prof. Hamid R. Arabnia, Prof. Fernando G. Tinetti, and Prof. Mary Yang. We present the proceedings of ABDA’17.

Steering Committee, 2017 http://americancse.org/

Contents SESSION: INTERNET OF THINGS, GAME TECHNOLOGIES, AND SECURITY APPLICATIONS Visual Analysis of Players' Activities in World of Warcraft Game Junfeng Qu, Weihu Hong, Yinglei Song

3

How Big Data Can Improve Cyber Security Haydar Teymourlouei, Lethia Jackson

9

An 'Intelli-Fog' Approach to Managing Real Time Actionable Data in IoT Applications Ekpe Okorafor, Mubarak Ojewale

14

Enhance Big Data Security in Cloud Using Access Control Yong Wang, Ping Zhang

18

Voice Recognition Research Francisco Capuzo, Lucas Santos, Maria Luiza Reis, Thiago Coutinho

25

SESSION: BIG DATA ANALYTICS AND APPLICATIONS Visualization Analysis of Shakespeare Based on Big Data Ran Congjing, Li Xinlai, Huang Haiying

31

Table based KNN for Text Summarization Taeho Jo

37

Evaluate Impacts of Big Data on Organizational Performance: Using Intellectual Capital as a Proxy Thuan Nguyen

43

SESSION: POSITION PAPERS - SHORT RESEARCH PAPERS Post Processing of Word2vec for Category Classification based on Semantic SungEn Kim, SinYeong Ahn, GuiHyun Baek, SuKyoung Kim

53

OLSH: Occurrence-based Locality Sensitive Hashing Mohammadhossein Toutiaee

57

SESSION: POSTER PAPERS Design on Distributed Deep Learning Platform with Big Data Mikyoung Lee, Sungho Shin, Sa-Kwang Song

63

Development of Road Traffic Analysis Platform Using Big Data Hong Ki Sung, Kyu Soo Chong

65

SESSION: LATE PAPERS - RECOMMENDATION SYSTEMS Context-Based Collaborative Recommendation System to Recommend Music Santiago Zapata, Luis Escobar R, Elias Aguila G

69

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

SESSION INTERNET OF THINGS, GAME TECHNOLOGIES, AND SECURITY APPLICATIONS Chair(s) TBA

ISBN: 1-60132-448-0, CSREA Press ©

1

2

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

3

Visual Analysis of Players’ Activities in World of Warcraft Game Junfeng Qu Dept. of Computer Science and Information Technology Clayton State University Morrow, Georgia, USA

Weihu Hong Dept. of Mathematics Clayton State University Morrow, Georgia USA

Abstract— In order to see a whole picture of players’ interaction with game world in terms of their decision making, progress, and how the overall game level design and game mechanics are reasonably organized, we conducted a visual data mining on the data that includes quantitative data of players over a period of time from several months to several years, and contains valuable information on players’ behavior, game design, game planning, game content, and services etc. Our analytical analysis on the data leads a way for game designers to improve existing games as well as to design better games in the future. Index Terms—Visual mining, big data, game analytics.

I. INTRODUCTION In recent years, there has been a rise in interest in collecting and analyzing game metrics, and how they can be used to inform the game development process. Game designers rely on game metrical data to revise game design, adjust levels, difficulty, game core mechanics and contents in order to increase gamers loyalty and keep players in the game for a long time. Williams et. al.[1] suggested four game data collection approaches: survey-based studies, online testing, participant observation and online interviews. All these methods require some form of participant recruitment in game, and makes it difficult to collect, analyze game data since these approaches depend on sampling of population. The Massively Multiplayer Online Games (MMOGs) offer another opportunity for researchers who want to study player behaviors in game because game services of MMOGs offer data for individual characters, allowing players to visit the site to learn about their achievements and compare with other as well. However, the data is not available directly, some researchers[2][3][4] have developed scripts, add-ins on MMOGs game services to query level, class, items equipped, and some statistics of players’ performance on World of Warcraft(WOW) game, and explored some interesting work as well. This queried data includes quantitative data of players over a period of time from several months to several years, and contains valuable information on players’ behavior, game design and planning, game content and services etc. With the rising of big-data, it’s possible for researchers to explore the data directly to see a whole picture of players’ interaction with game world in terms of their decision making, progress, and how the overall

Yinglei Song School of Electronics and Information Science Jiangsu University of Science and Technology, Zhenjiang 212003, China

game level design, game mechanics are reasonably organized or not. Especially, the visual data mining provides insights and gains a fundamental understanding of the game design based on whole pictures of players’ behaviors. The paper is based on data collected by Lee et. al[3], where three years of WOW game data is queried with LUA scripts. We mainly focused on data of year 2008, where a new release of WOW: Wrath of the Lich King on November, with extended new game contents, levels, regions, and skills for gamers to explore. The paper is organized as follows. In Section 2, we discuss the related work. In Section 3 we conducted visual data mining to study how player’s behavior in regarding new release, region, avatar, level design of game. Section 4 summarize preliminary findings in our case studies. II. RELATED WORK From the perspective of game designers, players’ behavior is one of the most important factor they must consider when they design game to justify their game decisions on level, items and equipment, difficulties of game, game world rendering etc. Although player centric design strategy provides game designer a logic way to pretend what is going to happen in the shoes of players, direct evidence and proof from whole game data is still a best feedback on game design. To gain a fundamental understanding of the game play behavior of online gamers, exploring user’s game play time provides a good starting point when WOW game services allow players to query and compare achievements on game server. Game designer will observe data to see how to increase game loyalty by providing rich favorited avatar, game world, equipment, adjust game level and difficulty, new release with new attractive features. Collean Mackline etc [5] have built game prototypes to illustrate how different game data to be represented in game. The Kimomo Color prototype uses cross-reference table to show game core mechanics on color material and the manufacture of dye. Directed graph that represents living and non-living elements is used in the Mannahatta Game prototype. Two directed graphs, one for the life that depends on the trees and another for the commodities that represent rain-forest ecosystem in the third Trees of Trade prototype while player are asked to analyze and rebuild the directed graphs that depend on biological and commercial relationships.

ISBN: 1-60132-448-0, CSREA Press ©

4

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

David Kennerly[6] explained how game designer can apply data mining techniques to analyze how players acquire experience in an MMORPG. Ducheneaut etc. [4] studied more than 220,000 World of Warcraft characters over 8 months, the players behaviors such as play time, time to move up a level, in which avatar is studied statistically. Weber et al. [7] investigated learning strategies in StarCraft based on over 5000 replays of expert matches data to train machine learning algorithm in order to predict strategy. Lewis et. al[2] performed quantitative research on WOW game with 166047 characters from the US and Europe. They constructed a Naive Bayes classifier that predicts the class of players based on the items they are wearing, the players’ behaviors such as days to level 80, class death on way to 80, popular items are also studied quantitatively. Mirko Sužnjević et. al[8] studied players’ behavior in World of Warcraft based on wJournal add-on structure. The players’ behavior is categorized into six group: Questing, Trading, Dungeons, Raiding, Player versus Player (PvP) combat, and Uncategorized. The study shows that 43% of players’ time is exploring or idle, 36% of players’ time is in PvP combat, which mostly takes place in instanced zones. Players tend to gather at certain hotspots, and some instanced areas for Dungeons and for Raiding are rarely used. Kang et. al[9] analyzed players’ four types of behaviors, social, combat, movement and idling in WOW using trajectory clustering method. The insights on players’ experience from simple trajectory data can be automatically generated by the model to save cost on level design. Zhuang et. al[10] collected and analyzed a 5-month long measurement study of World of Warcraft. They found that the players’ dynamic distribution of player session lengths is similar to that of peer-to-peer file sharing sessions, in-game character level or age is a good predictor for session length, and changes to a game’s virtual world can cause dramatic shifts in the population densities of in-game locations, which are otherwise relatively stable. Lee et. al[3] collected three-year of World of Warcraft players’ data. A list of online avatars status is collected every 10 minutes. If an avatar logs in and logs out within 10 minutes, the program may not be able to observe his re-login activity in consecutive snapshots. During the monitored period, 91065 avatars and 667032 sessions associated with the avatars were observed in regarding players’ id, level, zone, charclass, guild and timestamp of the snapshot was made. In this paper we focused on player’s interaction and social behavior in guild, avatar design and player’s expressive play in WOW game, player’s leveling path by zone and race as well as play time patterns and level difficulty design and progression control. III. OBSERVATIONS Gamer interaction and social function are always considered in the game design as ways to attract different gamers with enhanced gameplay experience. WOW guild is typically an ingame association of player characters. The formation of guild social atmosphere makes gamers easily to raid and rewarding.

Banking guilds also provides convenience to individual players as a way to increase the limited bank storage space available in the game. A guild greatly enhances your gameplay experience. You can meet friends, share adventures, and find people to protect you if you fight in faction versus faction combat. Typically, players in good guilds can go places and do things that players in poor guilds or no guild can't, because guilds offer many benefits including free items, opportunities for groups, access to trade skill masters, quest items, and readily available trade skill ingredients through gathering guild members. We studied top ten guilds in WOW game by players’ race and class as shown in following figure 1.

Figure 1: Guild composition by race and class Majority of the players belongs to no guild when they join wow community, we can see that race combination of Orc and class Warrior and Blood Elf of Paladin are top avatar chosen by gamers. When gamers join guild, the number of Blood Elf of paladin drops greatly although this avatar still exists with low number of players on top 10 guild that has most players by race and class. We further studied successful rate of players who do not belong to any guild i.e. guild value of -1, and completed level 80, and listed top 10 and bottom 10 success rate shown in table 1 and 2. Table 1: Top 10 success race Race Orc Blood Elf Undead Tauren Undead Undead Tauren Tauren Undead Tauren

Class Num. Players Player in Level 80 Percentage Death Knight 164 13 7.9% Death Knight 956 53 5.5% Death Knight 223 11 4.9% Death Knight 246 10 4.1% Priest 814 21 2.6% Mage 1077 27 2.5% Shaman 927 20 2.2% Warrior 1447 31 2.1% Warlock 1231 24 1.9% Druid 1760 33 1.9%

We find that popular race and class is not a good indicator to beat the game as indicated in bottom ten success rates, but race and class play an important role to be successful in game. Top ten successful rate shows that player who had chosen Dead Knight of all four races has a higher chance to finish game without being killed in game. Table 2: Bottom 10 success race

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

Race Blood Elf Blood Elf Undead Troll Orc Blood Elf Blood Elf Troll Blood Elf Orc Troll

Class Paladin Mage Rogue Hunter Hunter Hunter Warlock Rogue Rogue Warrior Shaman

Num. Players Player in Level 80 Percentage 3038 38 1.3% 2122 24 1.1% 1204 12 1.0% 1379 12 0.9% 1064 9 0.8% 1595 12 0.8% 2002 15 0.7% 881 6 0.7% 1410 6 0.4% 3147 9 0.3% 1058 3 0.3%

There are total 421 guilds in the whole year of WOW data. We further analyzed which guild has highest number of successful rate to see how guild helps player grows, the top ten are in the following table 3: Table 3: Top percentage in guild completed 80 levels Guild Num.Num. Players Num. in Level 80 Rate 473 1 1 100.0% 447 1 1 100.0% 504 2 1 50.0% 364 13 5 38.5% 414 19 7 36.8% 485 6 2 33.3% 147 3 1 33.3% 424 136 43 31.6% 471 29 8 27.6% 207 11 3 27.3% 368 4 1 25.0% 459 46 11 23.9% 342 50 10 20.0% 481 10 2 20.0% 508 5 1 20.0%

5

well as what abilities, powers, skills, and spells to be gained throughout the adventures. The player’s choice of which class to play is constrained by the choice of race; each of which has a different group of available classes to choose from. The popularity of race ranked from high to low is as follows Blood Elf, Orc, Troll, Undead, and Tauren, while the popularity of class ranked from high to low is as follows Warrior, Hunter, Rouge, Mage, Warlock, Paladin, Shaman, Priest, Druid and Death Knight. We also studied what combined race and class are preferred by player by counting how many of players who have chosen it in WOW gameplay, and plotted data in heat map with distribution by number. We found the most popular combined race and class is Orc Warrior of Orc, then Blood Elf with Paladin, followed by Mage, then Warlock class as shown in Table 5 and Figure 2. Table 5: Popular race and class combined

There is no exception that all guild that has 20% or higher change to complete the WOW is in guild that the size is relatively small, further study also shows that if the size is too small, then player’s success rate drops close to 0%. Table 4: Bottom percentage in guild completed 80 levels Guild Num.Num. Players Num. in Level 80 Rate ‐1 34258 454 1.3% 460 1796 0 0.0% 282 1141 12 1.1% 103 1073 202 18.8% 251 686 25 3.6% 101 620 50 8.1% 161 610 72 11.8% 189 532 93 17.5% 104 516 66 12.8%

The size of guild gives mixed signal on how guild helps players to move up because it ranges from 0% to 18.8% in success rate to reach level 80. 3.2 Hot avatars in combined race and class The player needs to create an account, then chooses a race and corresponding class and moves up level by level to overcome challenges designed by game designer. In most MMO games, varieties of avatar pre-build type are provided for players to choose, and players also have opportunities to change and customized avatar in later game. We studied number of players of WOW game by race and class, where these races speak many different languages, have different homelands and racial traits, and can pursue different classes. Race has basic statistics on strength, agility, stamina, intellect, and sprite. A class is the primary adventuring style of a player character which determines the type of weapons and armors to be used, as

Figure 2: Heat map of combined race and class The WOW player can change class and race during the play of game if the player is not satisfied with what had been chosen before. We studied all players who have changed either race or class during one-year data collection time. We found that change race or class is a rare event since 97.8% of players stick with what they started with, 1.92% of players change race or class only once, which means total of 99.72% of player are mostly satisfy with Avatar templates that have been offered by WOW game designer. Therefore, we can conclude that players are well satisfied with the rich race and avatar offered by game designers. 3.3 Players’ game play time. The design of difficulty in game in general follows the ‘S’ shape, where three stages of gameplay are slow beginning, steep acceleration and plateau. There are some variations that based upon the ‘S’ shape curve. The absolute difficulty of game is the skill needed plus the time pressure to solve challenges in a level.

ISBN: 1-60132-448-0, CSREA Press ©

6

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

The game designer in general expects game players to spend more time at higher level than lower level. Therefore, we use average time each player spends at each level as an indicator of the game difficulty. The plot of game time with log function is shown in figure 3. We can conclude that average time spend by player are gradually increasing up to level 70. The new release of Wrath of the Lich King, where the players spends abnormal amount of time to digest the challenges, skills, and game rules. After level 70, the difficulty moves smooth again up to level 80, which is the finale of the game, and players spends more time in order to win the game, however, the time is much more less than time spend on level 70. Overall, the level and difficulty design of WOW game is well considered and balanced based on data observed.

Figure 3: Time spent by level of all players 3.4 Player retention and recruiting The number of active players per month is plotted in figure 4. The monthly average number of active player is 8303, only Jan, Feb, Mar., Oct and Nov. exceeds the average. Further statistics shows that 1107 out of 37354 active players who had played every month, which is about 3%, while the average monthly retention rate is 61%. It is interesting that the first half of a year from January to June remains above average retention rate, and second-half of year is below the average retention rate although the new Wrath of the Lich King was released. The time line of Wrath of the Lich King is following: 1. July 3rd 2008, public beta released 2. October 10, 2008, announcement of new release was being manufactured for sale 3. November 13th, released to public

Content highlights of new WOW include the increase of the level cap from 70 to 80, the introduction of the death knight Hero class, and new PvP/World PvP content. No additional playable races have been added, though many new NPC races were featured. After slightly drop of number of active players from January to March, a sudden drop of April, the number of active players in May is in deep valley, and then it is up a little bit in July, which fits the time of new beta of WOW, then it drops a little in August and September as well. There is a high peak of number of player in October, which affirms that new release of Wrath of the Lich King and advertising in game community does attract more players to join the game, however, November and December’s number is not promising since it slides down again.

Figure 5: Monthly retention rate If we use players of January as the base of retention, then each later month of retention is calculated by number of players who stay divided by total number of active players of the month as indicated figure 5. The retention rate is flat from February to September except the drop in July, when the new WOW is on public beta. The data shows that same heavy drop of retention on October as well when the new WOW was on market. After that, the retention rate steadily recovers in November and December. We also studied how many months that players continuously play WOW after the first time login, the data is show in Figure 6. We can see that 60% of players quit playing after one month. 9% of players play two months before dropping out. 2% to 3% of players play 3 months to 11 months long, and 7% of players play for 12-month long.

Figure 6: Retention over time by months

Figure 4: Number of players online by month

3.5 Successful Players’ Moving Paths We identify players by combined char, race and charclass, locate all players who have successfully complete the game from level 1 to level 80. The moving path by level is plotted with Race information in different color in figure 7. There are total 74 players are identified from the data of year 2008 that have played from level 1 to level 80. There are total of 37354 different char, if we combine char and race, there are 37875, if we use

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

char, race, and class to represent a player, then there are 38331 total different players, in which some player might play WOW game under different char, or some players own different race and class.

7

Durotar-…-The Barrens. Although we have found that successful players did not visit some zones as indicated in figure 7, all zones is visited by players via some transitions. We further divided game level into four groups, level 1 to level 19, level 20 to level 39, level 40 to level 59, and level 60 to level 80 to explore closely on how players move among zones to complete tasks and advance up.

Figure 7: Success moving paths by zones to Level 80 We can see that some zones are never visited by players at all, such as Black Temple, Blackwing Lair, Gruul’s Lair, Hyjal, The North Sea, and The Veiled Sea etc . which suggests the designer needs to reconsider their designed game path to these zones. On the other hand, Zones of Everson Woods, Mulgore, Tirisfal Glades, Durotar, and Orgrimmar are special zones that there are players start with level 1 and finished at level 80. As indicated by the race name ‘Undead’, this race is the most common race that players complete all levels of the game, although Blood Elf is the most popular race in the game. The 2nd group of levelling up path is from level 70 to 80, which are new levels added in new release of WOW. We can see that players of race ‘Undead’ continuously completed these levels on varieties of zones followed by Blood Elf race. The diagonal path figure 7 also suggests possible transition of zones that game players would follow to levelling up and complete whole game successfully, and best race to choose in order to win the game is Undead. 3.6 Popular Zones and Transition Map of Players If we represent each zone with a node, the size of the node indicates the number of players in the zone, we add a directededge to another zone if player moves to it, and we count how many players have taken such transition overall, which is represented by the thickness of the edge. The zone transition map in directed graph is shown in figure 8.

Figure 8. Zone Transition Map of all players It is clear that most active zones and transitions are happened among Netherstorm - Shattrath City – Orgrimmer – Karazhan – Nagrand – Terokkar Forest-Arathi Basin –Hellfire Peninsula –

Figure 9: Zone transitions in Level 1 to 19 Figure 9 shows players’ transition in level 1 to level 19. We can see that the majority of transitions to complete tasks are Eversong Woods – Ghostlands – Silvermoon City, the other one is among The Barrens – Durotar – Orgrimmar.

Figure 10. Zone Transitions in Level 20 to 39. As indicated in figure 10. The Barrens – Orgrimmar transitions still dominates, however, Silvermoon City – Eversong Woods – Ghostlands are not hot in these levels. Transitions among Hillsbrad Foothills – Alterac Mountains – Arathi Higlands etc. become new favored zones in level 20 to 39.

Figure 11. Zone Transitions in Level 40 to 59 From level 40 to 59, The Barrens – Orgrimmar transitions is not dominate any more as indicated in figure 11. The Barrens – Thunder Bluff – Feralas – Un’Ooro Crater – Thousand

ISBN: 1-60132-448-0, CSREA Press ©

8

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

Needles – Dustwallow Marsh – Tanaris – Orgrimmar – Stranglethorn Vale – Under City – The Hinterlands are new popular zones that players move around to complete missions and tasks.

[5]

[6] [7] [8] Figure 12. Zone Transitions in Level 60 to 80 In levels 60 to 80, the transitions in levels before are settling down except Orgrimmar as indicated in figure 12. Shattrath City stands out in center and mostly transition happens from and to Shattrath City in these levels in order to complete tasks. It’s clear that the game designer deployed hub-and-spoke level design techniques. In the game world, Shattrath City is a major hub in Outland situated in the Northwestern portion of Terokkar Forest, the first capital available to both sides that populated by ancient heroes and naaru.

[9]

[10]

Appeal: A Look at Gameplay in World of Warcraft,” Games Cult., vol. 1, no. 4, pp. 281–317, 2006. C. Macklin, J. Wargaski, M. Edwards, and K. Y. Li, “DATAPLAY : Mapping Game Mechanics to Traditional Data Visualization,” DiGRA, pp. 1–7, 2009. D. Kennerly, “Better game design through data mining,” Gamasutra, August. 2003. B. G. Weber and M. Mateas, “A data mining approach to strategy prediction,” CIG2009 - 2009 IEEE Symp. Comput. Intell. Games, pp. 140–147, 2009. M. Sužnjević, M. Matijašević, and B. Brozović, “Monitoring and Analysis of Player Behavior in World of Warcraft,” pp. 618–623, 2012. S. J. Kang, Y. Bin Kim, T. Park, and C. H. Kim, “Automatic player behavior analysis system using trajectory data in a massive multiplayer online game,” Multimed. Tools Appl., vol. 66, no. 3, pp. 383–404, 2013. X. Zhuang and A. Bharambe, “Player dynamics in massively multiplayer online games,” Carnegie Mellon Universition, 2007.

IV. CONCLUSION In the research, we use a whole year WOW big log data to investigate whole picture of players’ interaction with game world in terms of their decision making, progress, and how the overall game level design and game core mechanics are reasonably organized via visual data mining. Our analytical work leads game designer a way to improve existing games as well as to design better games with real data that can be rely on directly in the future. There is no players’ profile in data, therefore, we cannot discover how players’ personal information, such as age, gender, educational background would impact the progress and behavior in WOW game.

REFERENCES [1]

[2]

[3]

[4]

D. Williams, N. Yee, and S. E. Caplan, “Who plays, how much, and why? Debunking the stereotypical gamer profile,” J. Comput. Commun., vol. 13, no. 4, pp. 993–1018, 2008. C. Lewis and N. Wardrip-fruin, “Mining Game Statistics from Web Services : A World of Warcraft Armory case study When to Begin,” Search, pp. 100– 107, 2010. Y.-T. Lee, K.-T. Chen, Y.-M. Cheng, and C.-L. Lei, “World of warcraft avatar history dataset,” Proc. Second Annu. ACM Conf. Multimed. Syst. MMSys ’11, p. 123, 2011. N. Ducheneaut, “Building an MMO With Mass

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

9

How Big Data Can Improve Cyber Security Haydar Teymourlouei1, Lethia Jackson2 Department of Computer Science, Lincoln University, PA, USA 2 Department of Computer Science, Bowie State University, Bowie, MD, USA 1

Abstract - The research presented in this paper offers how Big Data can improve cybersecurity. Cybersecurity concerns affect organizations across virtually all industries, including retail, financial, communications and transportation. Big Data and analytics are some of the most effective defenses against cyber intrusions. Hackers continue to develop powerful offensives to prevent what used to be considered highly effective cyber defenses. With the resources available to hackers today, they can move around a defense-in-depth strategy to breach data systems. With tools and techniques that now exist to handle the volume and complexity of today’s cyber-attacks, enabling enterprises to stay ahead of evolving threats. Big Data along with automated analysis brings network activity into clear focus to detect and stop threats, as well as shorten the time to remedy when attacks occur. Keywords: Big Data, cyber threats, logs, data reduction method, network, security

1 Introduction The ability to accumulate large amounts of data provides the opportunity to examine, observe, and notice irregularities to detect network issues. Better actionable security information reduces the critical time from detection to remediation, enabling cyber specialists to predict and prevent the attack without any delays. Data is analyzed using algorithms which give critical insight to organizations in order to provide assistance in improving their services. Big Data is continuing to be used on bigger platforms including financial services, health services, weather, politics, sports, science and research, automobiles, real estate, and now cybersecurity. An important way to monitor your network is to set up a Big Data analysis program. A common response to evolving attacks is to either add more security tools or increase the sensitivity of the security tools already in place. Big Data analysis is the process of viewing large data sets to reveal hidden patterns, unknown correlations, market trends, customer preferences and other important information. To ingest all the data, filter and aggregate the data first, but it is tricky and difficult to decide what to filter out and what to keep. With the help of log monitoring tools and advanced data reduction method we can defend against cybersecurity. To help prevent cyber-attacks it is necessary to monitor our network.

2 What is a Big Data? Data is a collection of facts, such as values or measurements. Information is often the result of combining, comparing, and performing calculations on data. Big Data is high-volume,

high-velocity and high-variety information assets [9]. Billions of bytes of data are collected through various mediums. Big Data demands cost-effective and innovative forms of information processing for enhanced insight and decision making. There is always an issue with storage and processing large amounts of data. Privacy and security can be compromised while storing, managing and analyzing the large quantities of data [2]. When dealing with Big Data [5], it is necessary to maintain a well balanced approach towards regulations and analytics. Data comes from multiple intrusion detection systems, as well as sensors from high noise environments. Using analytical tools, data management and examination techniques help us detect attacks. Big Data research and development is needed in the academic, industrial and government research labs to develop solutions that defend and protect large data sets.

3 Data Management Techniques Changing the organization’s operation from the normal system usage of data to handling and maneuvering Big Data brings the difficulties of tricky advanced persistent threats. Data privacy, integrity and trust policies should be examined inside the context of Big Data security. The cybersecurity of Big Data operates in high volumes. Data can come from multiple intrusion detection systems, sensors as well as sensors from high noise environments. As indicated by break-out group participants, denial of informational attacks and the demand to deal with malicious opposition are threats to data privacy. Using analytical tools, data management and examination techniques that integrate data from hosts, networks, social networks, bug reports, mobile devices, and the internet of things provides sensors to detect attacks. Hashing techniques and data technologies also play a major role in strengthening security systems. Data management and examination techniques such as biometric authentication [12] defend against cyber-attacks by providing solutions to security issues of protecting massive amounts of data. Many data communities are developing large data and solutions for efficiently managing and analyzing large sets of data. Continuous Big Data research and development is needed in the academic, industrial and government research labs to develop solutions that defend and protect large data sets including cloud data management. Cloud data management includes malware detection, insider threat detection, intrusion detection, and spam filtering [8].

4 Managing the Cyber Threats

ISBN: 1-60132-448-0, CSREA Press ©

10

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

Government and industry authorities are now enforcing regulations on how companies combat cyber- attacks. With the help of latest technology they are able to defend against cyber threats. Network Behavior Analysis (NBA) has been an emerging technology that serves as a security direction tool to improve the current network status. NBA proctors traffic coming in and traffic going out of the network to ensure that nothing is getting into the host package and application. By 2011, approximately twenty-five (25) percent of large security systems will use NBA. A major disadvantage of NBA is that it does not catch the security breach before it becomes a problem. There are many challenges when it comes to enforcing access and security control policies in Big Data storage environments [3]. Enforcing access and security control policies creates challenges, because in some instances each user would have fined grain access control based on their specific job or responsibility [1]. Companies ensure that their employees are within compliance, and it keeps logs of user activity. Keeping track of user activity in real time helps control threats as the data can be analyzed for unusual or suspicious activity and then treated accordingly. In addition, an event management technology uses Big Data technologies and methods to combat attacks. While the development of the data has encouraged an increase of cyber threats it concurrently manages and reduces their occurrence.

5 Different Methods to Prevent of Cyber Threats With the help of Big Data log analytics we can prevent cyber threats by monitoring data. When Big Data log analytics is combined with JIT (Just in Time) analysis it collects information on the machines that have an open connection with locations outside the local network. It also predicts future attacks and gives you information about previous attacks that might have occurred on your system. An IBM report shows us that forty-six (46) percent of businesses are experiencing security breaches; which shows that the need to protect our information is very high. IBM developed a solution using Big Data that protects data from threats and fraud. IBM’s solution detects risk and intrusion while analyzing structured and unstructured data. QRadar performs real-time correlation, anomaly detection and reporting for immediate threat detection, and also sends enriched security data to IBM Big Data products, such as IBM InfoSphere BigInsights [8]. Moreover, Hadoop is a Java-based programming framework that helps to store, organize, process, and analyze data. The analytical ability of Hadoop allows CounterTack, a form of APT, to analyze system level information collected from any kind of data to detect intrusion or malicious behavior. CounterTack uses Cloudera, which uses Hadoop algorithm, to protect against security threats. Hadoop allows us to manipulate through data affording us complete access to the information. Also, Apache Storm is a free open-source real-time streaming analytics. It is similar to Hadoop, however, it was introduced for real-time analytics. It is fast and scalable, supporting not only analytics in real-time but machine learning too. The Apache Storm algorithm reduces the number of false positives

found during security monitoring. Apache Storm is commonly available in cloud solutions that support antivirus programs which includes Big Data used to analyzed as well as uncover threats.

6 How Log Monitoring Can Prevent Cyber Threats Monitoring data validates access to the systems, tracks logged-in activities, prevents a breach as well as manages passwords efficiently. Surveillance is a major function of securing any network and the first deterrent for hacking or infiltrating anything in general. Continuously monitoring surveillance deters the hacker from successfully infiltrating the system. Other prevention methods include regular scans, updated virus protection software and firewalls. Free software is available like Barracuda’s Firewall, McAfee’s Advanced Protection or Patrol that monitors and protect against security breaches. All operating systems come equipped with standard log in features and are often overlooked. Software that may be used to monitor the system for security has a built-in event logger found in Windows based operating systems. This tool gives administrators (audit on Windows Server) the ability to view events that vary from login actions against files to either access a file, delete a file or create a file, and much more. These log files keep track of events that are produced from hardware and software operations. They can range in notification from informational events to warnings and subsequently to critical errors. However, not all events are collected by default, important audit settings must be turned on in domain group policies and on the folder or device that contains valuable data in order to receive the events. Log files can alert an administrator that a significant file has been modified, deleted or an unsuccessful attempt to change the file raises a red flag. Log monitoring usually consists of three tiers; log generation, log analysis, and log monitoring. Log generation involves the hosts making their logs accessible to the servers. Log analysis receives log data that is converted into the necessary formats so it can become a single standard readable format. Log monitoring includes console or software that handles monitoring and reviewing of log files. With log monitoring we can monitor system traffic to detect and prevent cyber-attacks. Additional log monitoring benefits include tracking suspicious behavior, meeting regulatory compliance, detecting/preventing unauthorized access, as well as monitoring user activity. Also, there is tool called System Information and Event Managements which handles the generation of reports, alerts when an abnormality is present, manages incidents, and correlates and analyzes events.

6.1 How to Prevent Cyber Attacks using Event Viewer

When a windows server or machine has problems or issues, the administrator would want to know what is the activity that

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

11

caused the problem. The event viewer allows the administrator to view the activity. Each activity, referred to as an event, is assigned a number. There are certain events that require more caution than others and warrant more attention to prevent an intrusion. For example, Event 4724 represents an innocent password reset or is it someone trying to wreak havoc on your system. Another example, Event 1102 which represents someone clearing the security log or is it someone trying to erase their tracks after their attempt to intrude in unauthorized territory. Windows 2008 and later versions contain built-in alerting of certain events. To prepare Windows to give an alert of certain events, perform the following steps. Step 1: Open the task manager and create a basic task. Step 2: Give the task a name, for example monitorEventxxx, and a brief description of the event. Step 3: Set the trigger to identify when a specific event is logged. Step 4: Specify which log to look at and the event Id number. Step 5: Select the action whether email, message, or start program when the event is logged. Verify the message is correct when completing this step. In the case that an event is logged, Windows will perform the alert. From this point the administrator can validate if the event requires more attention or not. Using this method is highly preventative of cyber threats and can provide real time speed to thwart a concurring intrusion.

6.2 How Data Compression and Reduction Method Could Help Cyber Threats In many instances, investigators require only a portion of Big Data; however, data is delivered in voluminous amounts. This leaves researchers to combat the cost it takes for Big Data to be transferred. This goal to compress Big Data is achieved using a reduced data size.

Figure 1: Actual Plotted Big Data

It is clear that when attempting to plot large sets of Big Data, researchers’ ability to attain desirable results is hindered. Often times only a small portion of the data sets is of interest to the researcher.

The size of large datasets must be reduced in order to successfully identify anomalies. The algorithm offers an efficient data reduction method that effectively condenses the amount of data. This algorithm enables users to store the desired amount of data in a file and decrease the time in which observations are retrieved for processing by using a reduced standard deviation range that significantly minimizes the original data. View the diagram below for an outlook on how large Big Data is presented.

Figure 2: Condensed Version Plotted from Big Data

The following diagram reveals the detection of a real cyber threat. Data reduction method was able to successfully locate the threat and provide a range of severity of the threat from low to high after reducing a large data set to a reasonable size. The use of a reduction method effectively condensed the amount of

ISBN: 1-60132-448-0, CSREA Press ©

12

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

data as seen in Figure 2. The range of severity for cyber threat detected in this example is high. The algorithm enables users to compress Big Data by a percentage value. The program instructs the user to input the following:  maximum and minimum threats  desired percentage that such Big Data should be reduced

The data in Table 2 includes many negative numbers. However, the program ignores the numbers that indicate a severity range of low such as negative numbers (-1). Therefore, in this particular illustration, there are only a few data values that are of significance based on the percentage that the program was instructed to perform. This critical reduction makes data easily retrievable and accessible. In addition, it provides researchers with the amount of data that examiners request.

7 Conclusions Organizations need to be current with the latest vulnerabilities to prevent known attacks. Big Data will quickly solve the problems of cybersecurity. The reality is that Big Data and analytics will allow companies to identify anomalies and advanced attack vectors. Cybersecurity requires risk management and actionable intelligence that is common from Big Data analysis. While it is great to have tools that can analyze data, the key is to automate tasks so that the data is available more quickly and the analysis is sent to the right people on time. This will allow analysts to classify and categorize cyber threats without the long delays that could make the data irrelevant to the attack at hand. Big Data will also help analysts to visualize cyber attacks by taking the complexity from various data sources and simplifying the patterns into visualizations.

Table 1: Result Big Data Reduced with Cyber Threat degree

Table 1 presents a sample of a typical data. The data displayed in Table 1 shows the results of Big Data reduced with cyber threat degree range.

Nearly one million malware threats are released daily and the estimated cost of cyber-crimes average billions. The importance of continuous vulnerability management should not be overlooked because exploits and viruses are constantly evolving. Although cyber criminals will always pursue weaknesses in computer systems, the countermeasures put in place today will help deter attacks in the future. Big Data analytics has already produced positive results in its efforts to reduce cyber threats. 84 percent of Big Data users say their agency has successfully used Big Data analytics to thwart cybersecurity attacks and 90 percent have seen a decline in security breaches [13].

8 References [1]

O’Brien, S. (2016, May 05). Challenges to Cyber Security & How Big Data Analytics Can Help. Retrieved March 13, 2017, from https://datameer-wpproduction-origin.datameer.com/company/datameerblog/challenges-to-cyber-security-and-how-big-dataanalytics-can-help/

[2]

Big Data to fight cyber crime. (2015, August 17). Retrieved March 20, 2017, from https://www.promptcloud.com/big-data-to-fight-cybercrime/

[3]

Garg, R. (2016, December 13). Proactive Measures Go a Long Way in Timely Prevention of Data Loss. March 20, 2017, from http://zecurion.com/blog/

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

[4] Siegel, J. E. (1970, January 01). Data proxies, the cognitive layer, and application locality: enablers of cloud-connected vehicles and next-generation internet of things. Retrieved March 20, 2017, from https://dspace.mit.edu/handle/1721.1/104456 [5] Thuraisingham, B, "Big Data Analytics and Inference Control", Secure Data Provenance and Inference Control with Semantic Web, 2014. [6] Gaddam, A. (2012) Securing Your Big Data Environment Retrieved March 13, 2017, from http://www.blackhat.com/docs/us-15/materials/us-15Gaddam-Securing-Your-Big-Data-Environment-wp.pdf [7]

S. H. Ahn, N. U. Kim and T. M. Chung, "Big data analysis system concept for detecting unknown attacks," 16th International Conference on Advanced Communication Technology, Pyeongchang, 2014 from http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumbe r=6778962&isnumber=6778899

[8]

Data, data everywhere. (2010, February 27). Retrieved March 24, 2017, from http://www.economist.com/node/15557443

[9] What Is Big Data? - Gartner IT Glossary - Big Data. (2016, December 19). Retrieved March 15, 2017, from http://www.gartner.com/it-glossary/big-data [10] Teymourlouei, H. (2012, November 30). An Effective Methodology for Processing and Analyzing Large, Complex Spacecraft Data Streams. Retrieved March 24, 2017, from https://eric.ed.gov/?id=ED553032 [11] Teymourlouei, H. (July 28, 2015). Detecting and Preventing Information Security Breaches Retrieved March 14, 2017, from http://worldcompproceedings.com/proc/p2016/SAM3565.pdf [12] Elisa Bertino, E. (2014). Security with Privacy Opportunities and Challenges. Retrieved from https://www.cs.purdue.edu/homes/bertino/compsacpanel-14.pdf [13] MeriTalk Study Shows 81 Percent of Federal Government Agencies Using Big Data Analytics to Cut Cybersecurity Breaches. (2016, August 29). Retrieved March 20, 2017, from http://www.businesswire.com/news/home/20160829005 121/en/MeriTalk-Study-Shows-81-Percent-FederalGovernment

ISBN: 1-60132-448-0, CSREA Press ©

13

14

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

An “Intelli-Fog” Approach To Managing Real Time Actionable Data In IoT Applications 1, 2

Ekpe Okorafor1 and Mubarak Adetunji Ojewale2 Department of Computer Science, African University of Science and Technology, Abuja, Nigeria

Abstract - The Internet of Things (IoT) has generated a large amount of research interest across a wide variety of technical areas. These include research interests in the physical devices, communications between the devices, and relationships between them. One of the effects of ubiquitous sensors networked together into large ecosystems has been an enormous flow of data supporting a wide variety of applications. Data stream processing frameworks enable us get real time insights from these enormous data generated by IoT applications. Fog Computing has been proposed and implemented for IoT applications as it brings decision making closer to the devices and reduces action latency among many other advantages. The low computation power, typical of edge devices, also makes them unsuitable for complex stream analytics. In this work, we propose a new “Intelli-Fog” approach to IoT data management by leveraging mined historical intelligence from Cloud Big Data platforms and combining it with real time actionable events from IoT devices at the edge of the network. IoT devices export data to Big data platforms at intervals and the platforms in turn communicate information needed for real time intelligent decision making. We also present use case application scenarios. This approach makes edge devices more intelligent in decision making without increasing action latency. Keywords: IoT; Fog Computing; Real Time events

1

Introduction

Advances in sensor technology, communication capabilities and data analytics have resulted in a new world of novel opportunities. With improved technology, such as nanotechnology, manufacturers can now make sensors which are not only very small to fit into almost anything, but also more intelligent. These sensors can now pass their sensing data effectively and in real time due to improvements in communication protocols between devices. In the era of Big Data and the Cloud, there are now, also, emerging tools for storing and processing the increasing amounts of data. These phenomena combined with the need to gain insight from the data, has made the Internet of Things (IoT) a topic of interest among researchers in recent years. Simply put, IoT is the ability of people “Things” to connect with anything, anywhere and at any time using any communication medium. “Things” here means connected devices of any form. It is estimated that

by 2020, there will be 50 to 100 billion devices connected to the Internet [1]. These devices will equally generate an incredible amount of massively heterogeneous data. The data, due to size, rate at which they are generated and heterogeneity is referred to as “Big data”. Big data can be defined with 3 characteristics known as the 3Vs; volume, variety, and velocity or sometimes 5Vs including Value and Veracity [2], [3]. Big data, if well managed, can give us invaluable insights into the behavior of people and “things”; an insight that can have a wide range of applications. The potentials for incorporating insights from IoT data into aspects of our daily lives are becoming a reality at a very fast pace. The acceptability and trust level is also growing as people have expressed willingness to apply IoT data analytics results in important decision making domains; domains as important as stock market trading [4]. These developments inform the need for efficient approaches to manage and make use of the huge and fast moving data streams in a way that extracts real time value from it. Distributed processing frameworks such as Hadoop have been developed to manage large data but not data streams. One major limitation of distributed platforms such as Hadoop is with latency. They are still based on the traditional Store-Process-and-Forward approach which makes them unsuitable for real time processing; a contrast with the real time demands of the current and emerging application areas [5]. Store and forward approaches also will not be able to satisfy the latency requirements of IoT data because of the velocity and the unstructured nature of the data. Stream processing frameworks like Apache Storm and Samza have been introduced to solve this problem. In stream processing, data from data sources are continuously processed as they arrive and do not need to be first stored. These stream processing frameworks usually leverage Cloud and Distributed Computing. IoT data are typically in streams and suits stream processing applications. Actionable events from IoT applications, however, have strict latency requirements than Cloud stream processing frameworks can provide. So, combining mined intelligence from sensor data with real time events from IoT to make intelligent decisions in real time requires a much faster approach; an approach that takes decision making closer to the sensing devices. To achieve this, we propose Fog-Computing to complement Cloud data management infrastructures in capturing and reacting to real-time actionable events. We propose an “Intelli-

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

Fog” layer in which mined intelligence from Big Data Platforms can be cached at the edge of the network and available to the devices for real-time intelligent decision making. This information (the mined intelligence) is updated at intervals regular enough to still make it relevant for correct decision making.

2

Fog Computing and IoT

One can think of Cloud computing as a manufacturing company (an application) where all customers must buy goods (access services) from the factory (cloud) alone. The Fog computing would be the company deciding to open retail outlets at locations close to the customers and the retail stores will give a level of access to the customer; if not exactly, it will be close to what they can obtain from the factory. The Fog Computing paradigm extends Cloud computing and services to the edge of the network; close to the devices. It is a hierarchical distributed platform for service delivery providing computation, communication, and storage at a layer that is much closer to the devices than the cloud. It is not also “just” a trivial extension of Cloud computing. It has some distinctive characteristics that make it more than just an extension. These characteristics per Bonomi [6], include Edge location, location awareness, and low latency, Geo-distribution, Mobility, Real time interaction among others. Location awareness, most especially, makes fog-computing more than just an extension of the Cloud. Fog Computing is described as a highly virtual platform that provides compute, storage, and networking services between end devices and traditional Cloud Computing Data Centers, typically, but not exclusively located at the edge of network [6]. Fog computing is not meant to replace Cloud computing; rather, it is to compliment it in applications where the traditional Cloud computing may not be suitable. Examples of such applications include Geo-distributed applications, largescale distributed control systems, and applications with very strict latency requirements [7].

15

latency significantly and brought processing close to the edge of the network. It is also sometimes used for dynamic real-time load balancing. Processing of data geo-sensitive data is also easier and data is aggregated before transmission to the cloud layer for further processing. However, the existing Fog approach does not provide a mechanism for making intelligent decisions in real time while taking mined intelligence from Big data analytics into consideration. Decisions are typically based on the actionable events alone and/or the real-time state of the system.

3

The Intelli-Fog Approach

The intelli-Fog Layer of the proposed approach differs from the existing fog approaches because it can make more intelligent decisions; without any significant latency increase. This layer communicates both with the IoT devices as well as the Cloud layer where data analytics is performed. Communication with devices involves taking in raw data/actionable events that are observed or reported by IoT devices. It also involves communicating decisions/actions taken back to the IoT devices. This is done in parallel with communication with the cloud layer. The communication with the Cloud layer involves a two-way mode. Aggregated data is sent to the cloud for storage and analytics and mined intelligence from the historical data are sent back to the intelliFog layer for real-time decision making. Other information necessary for decision making is also available at the intelliFog layer. The proposed Intelli-Fog approach is shown in Figure 1 below.

The early approach to managing IoT data is for the device to send data directly to the cloud via a communication medium, in real time. Big data analytics is also done at the cloud layer and decisions, where applicable, are communicated to the devices from the cloud to the devices via the communication medium also. IoT applications are typically highly distributed and the velocity of data is high. This can cause congestion on the network and communicating actionable events and decisions can be a challenge. An actionable event can get caught in a busy channel and can be delayed with other messages. This is a major contributing factor to action latency, even when the channel is free. The distance between the cloud layer and the devices is usually too wide. Processing actionable events at a layer closer to the devices was suggested and that introduced the Fog concept to IoT data management. The fog layer enables distributed processing at the edge of the network; very close to the devices. The Fog approach reduced

Figure 1: The Proposed Intelli-Fog Approach

ISBN: 1-60132-448-0, CSREA Press ©

16

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

4 4.1

Application Scenarios

information, together with the state of other lanes is used by a Smart Traffic Light to pass cars on its lane.

Intelligent Patient Monitoring System

A patient monitoring machine will sound an alarm by sensing an abnormal reading on patient data; but the alarm cannot state the likely symptoms the health care officer should look out for immediately. With the proposed system, a patient monitoring system can be made more intelligent to make these decisions based on some analysis on patients’ medical history as well as other available relevant medical records to predict likely complications to watch out for. An intelli-Fog layer can also go further to communicate this reading to the patient's doctor or a doctor on duty; and communicate the patient's necessary medical record too for speedy, accurate and efficient health care service.

5

Related Work

Some work has been done with respect to IoT data management but, in this section, we focus on the ones related to data, analytics and performance. Khan et al. [8] proposed a cloud-based data management and analytics service for a cloud city application. They proposed an architecture that provides basic components to build necessary functionality for a cloud-based Big Data analytical service for smart city data. This architecture was implemented in both Hadoop and Apache Spark on the Bristol open data. The results show that Hadoop incurs more overhead, especially in job submission, than Spark and that Spark is more appropriate for the Bristol open data. The part of the open data analyzed is the data about quality of life measured by indicators such as crime rate, security, etc. Their work is an important contribution to the smart cities projects, a very popular application of IoT. Their work, however, does not consider real-time analytics. Khodadadi et al. [9] proposed another cloud-based data-centric framework for development and deployment of IoT applications. The framework architecture manages data collection, filtering and load balancing from different sources, producing both structured and unstructured data. It was demonstrated with an application to compute tweet sentiments of the six biggest capital cities in Australia. This application was built on top of the Aneka Cloud Application Platform. This framework focuses on data collection and its abstraction from developers and it is an important contribution towards finding a generic approach to IoT data collection management. It, however, does not deal with data analytics.

Figure 2: Intelligent Patient Monitoring System with the Intelli-Fog Approach 4.2

Intelligent Traffic Monitoring

Smart Traffic Lights serves as the intelli-Fog layer of a traffic management system. The Smart traffic light senses both vehicles and pedestrians on the road. It monitors the state of each lane and uses Machine-to-Machine (M2M) communication protocol with other Smart Traffic Lights to get the state of neighboring states. It aggregates the data and then sends to a central cloud layer for data analytics. The cloud layer sends back information on likely busy lanes including periods and suggestions on how to pass cars in each lane. This

Zhu et al. [10] proposed a framework to make disparate and incompatible datasets usable, interoperable and valuable across the enterprise. This framework, though not specifically for IoT, enables data from disparate sources to be more “homogeneous” by proposing a Common Information Model (CIM) standard for information interchange between data sources. This approach will require data sources to use the CIM adapters; a condition many IoT devices may not be willing or able to meet. This approach is not also concerned with performance as it is mainly concerned with making data more homogeneous from the source for easy analytics. It is also specifically designed for the utility industry and not a general framework for Big Data analytics. Abu-Elkheir et al. [11], having investigated the issues of IoT data management, also introduced a model framework for IoT data management. This model provided layered stages of data management. The work proposed six layers of IoT data management including a “things layer”, communication layer, source/data layer, federation layer, query layer and application layer. Their work also does not involve any specific handling of real-time processing and optimization issues. It focused

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

17

mainly on how data is collected, stored and processed. The model is also focused on batch processing.

of Things: A Survey,” Communications Surveys Tutorials, IEEE, vol. 16, no. 1, pp. 414- 454, 2013

Cecchinel et al. [12] also developed an architecture support for a data collection framework in IoT. This framework handles IoT Big Data issues such as sensor heterogeneity, scalability data and even reconfiguration capability. The use was also used demonstrated on sample projects. The work, however, focuses on the data collection part and does not consider the data processing. Also, the framework handles data aggregation at the bridge to make data arrive at the middle-ware in a unified format. Their approach will work fine if the IoT data management application is location specific and data always arrive through the bridge. It may not work when data sources are widely dispersed and data come in directly through the internet without passing the bridge.

[2] C. Eaton, D. Deroos, T. Deutsch, G. Lapis and P. Zikopoulos, “Understanding Big Data”, McGraw- Hill Companies, 2012.

The Lambda architecture [13] was proposed by Nathan Marz. The architecture is generic for large stream processing, giving the options of real-time processing and batch processing concurrently. In the Lambda architecture, incoming data is dispatched to both the batch layer and the speed layer for processing. The batch layer manages the master dataset (an immutable, append-only set of raw data), and pre-computes the batch views. The service layer indexes batch views for low latency, ad-hoc queries and the speed layer deals with recent data and real-time analytics only. Incoming queries can be typically answered by merging the results from batch views and real-time views. The Lambda architecture received some criticisms, however, especially in terms of its complexity [14]. Some others criticized the redundancy in implementing almost identical processes in both layers. Lambda architecture is also criticized for the one-way data flow and immutability of data. Lambda’s shortcomings also include its inability to build responsive, event-oriented applications.

6

Conclusion and Future Work

In this work, we have proposed an “Intelli-Fog” layer which caches mined intelligence from Big data analytics and makes it available for real-time decision making at the edge of the Network. This makes intelligent response to actionable events faster and closer to the devices. Immediate decision making and Big data analytics can also be taking place concurrently in this new approach with just a single data entry point and a very simple architecture. In the future, we want to work on demonstrating the approach by implementing a use case example and comparing the benchmark with existing approaches

7

[3] Yuri Demchenko “Defining the Big Data Architecture Framework (BDAF)”, University of Amsterdam, Amsterdam, 2013. [4] C. Perera , R. Ranjan,L. Wang, S. Khan , and Y. Zomaya, “Big Data Privacy in the Internet of Things Era” Issue No.03 May-June (2015 vol.17) pp: 32-39 [5] M. Shah “Big Data and the Internet of Things” Research and Technology Center - North America, Palo Alto, 2015. [6] F. Bonomi, R. Milito, J. Zhu, and S. Addepalli “Fog Computing and Its Role in the Internet of Things”, in Proceedings of the First Edition of the MCC Workshop on Mobile Cloud Computing (New York, NY: ACM), 2012. [7] M. Pranali “Review of Implementing Fog Computing”, International Journal of Research in Engineering and Technology (IJRET), Volume: 04 Issue: 06, 2015. [8] Khan Z., Anjum A., Soomro K., and Tahir M., “Towards cloud based Big Data analytics for smart future cities” Journal of Cloud Computing: Advances, Systems and Applications, 2015. [9] Khodadadi, F.; Calheiros, R.N.; Buyya, R. “A datacentric framework for development and deployment of Internet of Things applications in clouds” (ISSNIP) 2015 IEEE Tenth International Conference, 2015 [10] Zhu J., Baranowski J., Shen J., “A CIM-Based Framework for Utility Big Data Analytics” (2014) available at: http://www.powerinfo.us/publications/CIM_Based_Framewo rk_for_Big_Data_Analytics.pdf [11] Abu-Elkheir M., Mohammad Hayajneh and Najah Abu Ali, “Data Management for the Internet of Things: Design Primitives and Solution” sensors ISSN 1424-8220, 2013. [12] Cecchinel C., Jimenez M., Mosser S., Riveill M., “An Architecture to Support the Collection of Big Data in the Internet of Things” IEEE International Workshop on Ubiquitous Mobile Cloud, 2014. [13] http://lambda-architecture.net/

References

[1] C. Perera, A. Zaslavsky, P. Christen and D. Georgakopoulos, “Context Aware Computing for The Internet

[14] Richard Hackathorn, “The BI Watch: Real-Time to Real Value,” DM Review, 2004.

ISBN: 1-60132-448-0, CSREA Press ©

18

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

Enhance Big Data Security in Cloud Using Access Control Yong Wang

Department of Mathematics and Computer Science Alcorn State University Lorrman, MS 39096

Abstract—In this paper, we review big data characteristics and challenges in the cloud. We propose to use access control in data encryption, data segregation, authentication and authorization, identity management, encrypted communication, and fined –grained access. Then we use stochastic process model to conduct security analyses in availability, mean time for security failure and confidentiality failure. Keywords—big data; security; cloud; access control

I.

INTRODUCTION

When we talk about the cloud computing [11], it includes two sections: the front end and the back end. They connect to each other using computer network, also called Internet. The front end is the user, or client. The back end is the “cloud” section of the system. The front end has the client‟s computer and application required to access the cloud computing systems. Not all cloud computing systems have the same user interface. Service includes web-based email programs and web browsers such as Internet Explorer. Other systems embrace applications that provide network service to clients. As internet becomes popular, big data transactions become a big concern in modern society. The data comes from online business, audios and videos, emails, search queries, health data, network traffic, mobile phone data, and many others [7]. The data is stored in database. The data grows tremendously. The data becomes difficult to store, retrieval, analyze, and visualize using traditional database software to approaches. In 2012, The human face of big data was compeleted as globe project. The project collects, visualizes, and analyzes big data. For the social network, Facebook has 955 million monthly active accounts in various languages, and 140 billion photos display. The Google support many services with 7.2 billion pages every day. In the next decade, the amount of information managed by the data center will increase by 50 times as estimated. The number of IT professionals will grow by 1.5 times then [8]. There are several kinds of clouding computing based on cloud location, or the service [9]. Based on the service that

Ping Zhang Department of Mathematics and Computer Science Alcorn State University Lorrman, MS 39096

cloud is offering, there are three kinds of service. These are lassS (Infrastructure as a Service), PaaS (Platform as a service) and SaaS (Software as service). Depending on a cloud location, we can classify clouds as public, private, hybrid, and community cloud. 1.1. According to NIST Definition for the Clouding Computing, the Service Models are [10]: Software as a Service: The service provided to the consumer is to use the provider application running on a cloud infrastructure. The applications are accessible from different client devices through client interfaces. These include web browser or program interface or others. The consumer does not manage or control cloud infrastructure. The infrastructure used in this category includes network, servers, operating systems, and storage. Platform as a Service (PaaS). The service provided to the consumer is to deploy on cloud infrastructure. The consumer generated or obtained applications using programming languages, libraries, services and tools. The consumer does not manage the cloud infrastructure, which embrace network, services, operating systems, or storage. The consumers do have control the deployed applications and configurations. Infrastructure as a Service (IaaS). The service provided to the consumer is to provision processing, storage, networks, and other computing resources. The consumer can operate software, operating systems, and application programs. The consumer does not manage cloud infrastructure. The consumer has control on storage, operating systems,, partial network components, and application softwares. 1.2. Cloud deployment models Public Clouds; Which is most common clouds. Multiple customers can access web applications and services over network. Each individual customer has own resources which are managed by third party vendor. The third party vendor hosts cloud infrastructure for multiple customers from various data centers. The customer has not control how the cloud is managed or what infrastructure is available to them.

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 | Private Clouds: Cloud computing is conducted on private network. They allow users to have cloud computing benefits. The private cloud computing has control over managing data and specify security measurements. This approach result in more user confidence. The disadvantage is that users have high cost because they establish dedicated infrastructure for cloud computing. Hybrid clouds. Combine both public and private approaches within the same infrastructures. The users will have advantages from both deployments. For instance, the organization can conduct private information on their private cloud and use the public cloud for large traffic. 1.3. For the cloud, there are five key attributes. These attribute include: Multitenancy (Shared resources): Different from previous computing paradigmss, which have dedicated resources and dedicated to single user. Cloud computing relied on business model in which resource are shared through network. Massive scalability: Cloud computing has ability to scale a large amount of systems. The cloud has ability to scale bandwidth and storage space. Elasticity: Users can rapidly change their computing resources as needed. The user can release resource for other users when the resources are not needed. In this category, users have a lot of flexibilities. Pay as you go: Users only pay for the resource applications they actually use and for the time they comsumed. Self-provisioning of resources: Users self-provision resources alocations. The resources include system and networks. . II. BIG DATA CHALLENGES Big data is large volumes of structured and unstructured data that it is difficult to process using normal database and software. The big data is originated from web searcher organizations who retrieve loosely structured large distributed data. A. There are five attributes which describe big data [1]. 1) Volume: many attributes result in increasing volume. These include credit card transaction data, video and audio data, sensor data. 2) Variety: Data are from different formats: normal database, text files, vedio, email communications, and etc. 3) Velocity: data is produced in very fast speed and processed in higher speed. 4) Variability: data can travel in inconsistent speed in different peaks [44]. 5) Complexity: data comes from different sources. The data needs to be matched, cleared and changed into specific formats before the process. B. Based on Elham Abd AL et al. [2], the data security in cloud have the following objectives:

19 1) Data integrity: it is about data correctness. The data can be changed with authorized person. The system always provide the right data. 2) Data confidentiality: it means that only authorized persons can get the private information. For instance, the medical doctors can access the patient data. 3) Data availability: It means that the electronic systems can provide useful data when the system requires to access. Without availability, the system become useless. 4) Data authentication: It is the process in deciding whether someone or something is declared to be. In general, authentication is performed before authorization 5) Data Storage & Maintenance: Users do not know the data location in cloud environment. It is dynamically stored in the cloud servers. The data in cloud may be exposed to loss or damage because of bad environment or server failure. a) Data Breaches and Hacks: data breach is an important risk in the cloud because of Multi-tenancy. b) Data separation: Maintain data in isolation. Data security not only discuss the data encryption, but also support the different policies for data sharing. Resource allocation and memory management need to secure. C. The data security in cloud can be summarized into network level, user authentication, data level, and generic issues [44]. 1) Network level: challenges exists in network level which related to network protocol in TCP/IP, and network security, distributed communications algorithm and distributed data. 2) Authentication level: User authentication level handles encryption/description. Authentication needs to check administration rights for the nodes, users, authentication of different nodes and application logs . 3) Data level: The challenges in data level is related to data integrity and availability. Specifically we encounter data protection and distributed data. Availability is key. Without availability, network security lost its meaning. 4) Generic issues: the challenges are about traditional tools and applications of different technology. D. Based on survey, there are top ten challenges in big data. These challenges include three new distinct issues in modeling, analysis, and implementation [12, 13]. Modeling: developing a threat model that covers most of cyber security attacks senarios. Analysis: discover tractable solution relied on the threat model. Implementation: develop a solution in the current infrastructure specifically. Security and privacy are controlled by the three V‟s of the big data in Velocity, Volume, and Variety. These considerations embrace variables such as large scale cloud infrastructure, diversity of data resources, streaming nature of data acquisition, and high volume of inter-cloud migrations.

ISBN: 1-60132-448-0, CSREA Press ©

20

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

The top ten challenges are classified in four categories. These are infrastructure security, data privacy, data management, and integrity and reactive security. For the infrastructure security, there are secure computation in distributed programming frameworks and secure best practices for non-relation data stores. For the data privacy, there are scalable and composable privacy-preserving data mining and analytics, cryptographically enforced data centric security, and granular access control. For the data management, there are secure data storage and transaction logs, granular audits, data provenance. For integrity and reactive security, there are end-point input validation/filtering, and real time security/compliance monitoring. Distributed programming frameworks adapt parallelism in computation and storage to process large size of data. One of popular example is MapReduce, which splits an input file into various chuncks. In the first beginning, a Mapper for each chunk reads the data, conduct some computations, and output a list of key/value pairs. In next phase, a Reducer integrates the values belonging to each distinct key and produce outputs. Non-relational data stores popularized by NOSQL databases are evolving in responding to security infrastructure. The solutions to NOSQL injections are still not mature. Each NOSQL were developed to solve different challenges posed by the analytic world based on our previous knowledge. Big data can be seen as troubling objects by potentially enabling invasions of privacy, decreased civil freedoms, invasive marketing, and increase state and corporate control in large. To ensure the most sensitive private data, it is end-to-end secure. The data is only accessible to the authorized persons. Data has to be encrypted based on access control policies. Some research in this area such as attribute-based encryption has to be made more efficient and scalable. To ensure authentication, a cryptographically secure communication approach has to be popularly implemented. Granular access control is a popular approach in access control. Access control is key to prevent access to data that have some access right. The problem with course-grained access mechanisms is that data that could otherwise be shared. With granular access, data is often changed into a more restricted category in access. Data and transaction logs are stored in multiple tiered storage media. Manually moving data between tiers provides the IT manager direct control over data. As the data grows exponentially, scalability and availability have pushed several tier levels for big data storage. Granular audits: With real-time security monitor applications, we try to be notified when an attack take place. To get to the bottom of a missed attack, we need audit information in our observed systems. Data provenance: Source metadata will grow in complexity because of large origin graphs from programming environments in big data applications. Analysis of such large provenance graphs will detect dependencies for security and confidentiality.

End-point input validation: Many data use cases require data collection from different sources, i.e. end-point device. For example, a security information systems may collect information from millions of hardware device and software applications. A key point of challenges in data collections is input validation. Real time security/compliance monitoring: Real time security monitoring has always been a challenge in the alerts from devices. These alerts lead to many false positives, which are ignored or simply throw away from our systems. III. ENHANCE BIG DATA SECURITY USING ACCESS CONTROL In the following, we propose to use access control to enhance our data security in cloud. Specifically we use access controls in data segregations, authentication and authorization, identification based access control, data encryptions, encrypted communication, and fine-grained access control. . 3.1.Data segregation. Protecting data integrity, availability and confidentiality is one of challenging task in the cloud computing. The customers data will be preserved and moved from dedicated storage to shared environment by different services providers. It may store in different countries with different police [18]. There are several reasons for security challenges in cloud computing: 1) Because of cloud and dynamic scalability, it is difficult to separate a specific resources in the security breach. 2) It is difficult to arrange unified approach since resources may be owned by various providers. 3) Because of multi-tenancy of cloud that have sharing of resource, the user data may be accessed by unauthorized users. 4) The cloud deals with large amount of information Based on Subashini and Kavitha [5], multi-Tenancy allows data of different users to reside at the same location. The user data intrusion from the another user is exponently increasing in the environment [6]. In the reality, multi-tenancy is balance between security and cost. The more sharing, more decrease in cost and more increase in utilization. The share will post security risk dramatically [5]. In general, there are three data management approaches: 1) Separate database; 2) Shared database with separate schemas; 3) Shared database with shared schemes. 3.2. Authentication and Authorization In [26], the authors develop a credential classifications and basic structure for analyzing and providing solutions for credential managements that embrace strategies to evaluate the cloud complexity. The study provides a set of analysess for authentication and authorization for the cloud focusing infrastructural organizations. These organizations include

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 | classification for credential and adapt these credentials to the cloud context. The study also provides important factors that need to taken into considerations when presenting a solution for authorization and authentication. For examples, the appropriate requirements, categories, service are identified. In the other aspects, design model for multi factor authentication in the cloud is developed in [27]. The model also provide an analysis for the potential security. Another authentication solution is developed in MILAMob [28]. FemiCloud [29] develops a different approach for authentication and authorization. It applies public key infrastructure (PKI) X.509 certificates for user identification and authentication. FemiCloud is built in OpenNebula, A web interface is used for user management. To avoid the approach limitation, access control lists (ACLs) are used for authorization after successful authentication of users. Authors integrate an local credential mapping service [30]. Tang et al [31] presented collaborative access control properties. These include centralized facilities, agility, homogeneity, and outsourcing trust. They have developed an authorization as a service (AaaS) approach using a formalized multi-tenancy authorization system. The approach also supports administrative control over fine-grained trust models. Integrating trust with cryptographic role based on access control is another solution that support the trust in the cloud [32]. The authors use cryptographic RBAC to enforce authorization policies about the trustworthiness of roles that are evaluated by the data owner. Sander et al. [33] develop a user centric approach for platform-level authorization of cloud services in the OAuth protocol. They allow service to act on behalf of the users when interacting with other services to avoid sharing username and passwords access service. 3.3. Identity management and access control The identity management systems for access control in clouds is discussed in [34]. The authors also present an authorization system for the cloud federation using Shibboleth. Shibboleth is an open source product of the security assertation markup language (SAML) for single signon with different cloud approaches. This solution presents how organizations can outsource authentication and authorization to third party solutions using identity management. Stihler et al. [35] also suggest that an integral federated identity management for cloud computing. The trust relationship between a given user and SaaS domain is required so that SaaS users can access the applications and resources. In a PaaS domain, there is an interceptor that acts as a proxy to accept the user „s requests and implement them. The interceptor interacts with the secure token service and request the security token using the trust description. IBHMCC [36] is another solution that has identity-based encryption (IBE) and identity –based signature (IBS) solutions. Relied on the IBE and IBS schemes, an identifybased authentication for the cloud computing has been developed. The approach is depended on the identity-based

21 hierarchical model for the cloud with the corresponding encryption and signatures schemes without using certificates. Contrail [37] is another way that enhance integration in heterogeneous clouds both vertically and horizontally. Vertical integration supports a unified platform for the different types of resources while horizontal integration abstracts the interaction models of different cloud providers. In [29], the researchers suggest a horizontal federation scheme as a requirement for vertical integration. The developed federation architecture contains several layers in the approaches. E-ID authentication and uniform access to cloud storage service providers [38] is another approach to build identity management systems for authenticating Portuguese adapt national e-identification cards for the cloud storages. In this trial, the Oauth protocol is integrated with authorization for the cloud users. The e-ID cards contain PKI certificates that are signed by theirl government departments. A certification authority is responsible for e-ID card issues and verifications. In [39], the authors study inter-cloud federation and the ICEMAN identity management architectures. The ICEMAN integrates identity life cycle, self-service, key management, provisioning that are required in an appropriate inter-cloud identity management system. 3.4. Data encryption If the computer hackers get access to the data, they can get the sensitive information. In general, we want to encrypt all data in cloud. Different data is encrypted using different keys. Without the specific decryption key, the hacker can not get access to sensitive data. In this way, we can limit hacker access to our useful data in cloud. Amin [14] surveyed how to enhance data security in cloud. They found that the encryption is first choice (45%). A digital signature with RSA algorithm is suggested to protect data security in cloud. Software is used to apply to data documents into few lines using hashing algorithm. These document lines are called message digest. Then software encrypts the message digest with the specific private key to generate the digital signature. Digital signature will be decrypted into digest by own private key and public key of senders to obtain useful information [15]. In [16], RSA algorithm is used to encrypt the data. Biliear Diffe-Hellman enhances the security while having keys exchange. In proposed method, a message header is added to front of each data packet for direct communications between client and cloud without third part server involvement. When users transmit the request to the cloud server for data storage, the cloud server generates public key, private key, and user identification in some server. Two tasks are performed at user end before sending file to cloud. First adds message header to the data and secondly encrypt the data including message headers using specific secret key. When user asks for data for cloud server, it will check received message header and pick up the Unique Identification for Server (SID). If the SID message is found, it will reply to the user requests.

ISBN: 1-60132-448-0, CSREA Press ©

22

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

In [17], a technique is introduced to warrant three security attributes in the availability, integrity, and confidentiality. Data in cloud computing uses Secure Socket Layer (SSL) 128 bit encryption that can be raised to 256 bit encryption. The user who wants to access to the data from cloud is required to perform valid user identify and password checks before access is given to encrypted data. In [18], user send the data to the cloud, then cloud service provider provides a key and encrypt the data using RSA algorithm and store into cloud data center. When user requires the data from the cloud, the cloud provider check the authenticity of the user and give the data to the user who can do decryption by computing the private key. In [19], three layer data security approach is suggested. Each layer conducts various task to make data security in the cloud. The first layer is responsible for authentication, the second layer is responsible for cloud data encryption. And third layer is responsible for data recovery when the cloud fails. In [20], RC5 algorithm is implemented to secure the cloud. A encrypted data is transmitted even if the data is stolen and there will be no corresponding key to decrypt the data. In [21], Role Based Encryption (RBE) is developed to secure data in the cloud. Role based access control (RBAC) cloud architecture was proposed to allow the organizations to store data in the public cloud securely while keeping the secret information of organization‟s structure in private cloud. In [22], four authorities are discussed. These include data owners, cloud server, data consumer, and N attribute authorities. For N attributes, authorities sets were divided into N disjoin sets with respect to the category. The data owner get the public key form any one of the authority and encrypt the data before sending it to the cloud server. When data is asked, the authorities will create private key and send it to the data consumer. Consumer will be able to download the file only if he get verified by cloud server. In [24], location based encryption approach was introduced using user location and geographical position A geographical encryption algorithm was applied on the cloud. The user computer and data was recorded with company name or person who works in the organization. When the data is required, a lot of labels will be searched and retrieved. The information corresponding to the label will be retrieved. In [25], a technique is proposed by using digital signature and Diffie Hellman key exchange in merge with Advanced Encyption Standard encryption algorithm to provide the confidentiality for the data store in cloud. .

3.5. Use encrypted communication when we need to transfer the data. For instance, the data can be compromised when we use FTP (file transfer protocol) to transfer data. The communication using the FTP is not encrypted. The hackers can get the information when we transfer data from one place to other. In the contrast, we may use ssl, ssh, or security copy (scp) to transfer data. 3.6.

Vormetric provides the fine-gained, policy based access controls that restrict access to the data that has been encrypted and allow only autherized access to data by process and users who meet the requirements. The privileged users can read plain texts only if they are approved to do so. The systems update and administrators can see the encrypted data, not plaintext data [13]. IV. SECURITY ANALYSIS 4.1. Availability and mean time to security failure. Wang et al. (2009) have conducted security analysis using stochastic processing. Specifically they applied Multidimentional Markov Process model in security analysis [40, 41]. The Markov process has stationary transition probability if Pr {Yt+s = j | Yt =i } = Pr{Ys =j | Y0 =i} (1) When a Markov Process has Y = {Yt; t≥ 0} has finite state E and jump times T0, T1, … and the imbedded process at the jump time expressed by X0, X1, …. , there is a set of scalars λ(i) for i Є E, called the mean sojourn rates and a Markov matrix P (the imbedded Markov Matrix) that meet the following conditions: Pr {Tn+1 –Tn ≤ t | Xn = i} = 1 - e -λ(i)t (2) Pr {Xn+1 =j| Xn=j} = p(i,j), (3) the analysis is summarized as follows: I. Identify irreducible sets in the Markov matrix P. II. Reorder the matrix P so that irreducible and recurrent sets (sits?) on the top, transient states at bottom of the matrix P‟. III. Steady-state analysis for irreducible sets using the following equations

π P i

i

ij

and

π j

j

 πj

(4)

1

. (5) NT = (I-Q)-1 for transient states I is identity matrix. Q is a submatrix associated with the transient states in the Markov matrix P. NT is number of visits for Markov Chain to the fixed state. V FT(i,j) = 1 – 1/N(j,j) if i = j or FT(i,j) = N(i,j)/N(j,j) (6) if i ≠ j, where FT(i,j) is the first passage probability that Markov chain eventually reaches state j at least once from initial state i. VI The probability fk from a transient state i to the kth irreducible set with the sub-matrix bk can be calculated by: IV

f k  (I  Q) 1 b k .

(7) The Markov process steady-state probability pj has a relationship with Πj (the steady-state probability for the imbedded Markov chain) as:

Fine-grained Access Control.

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

pj 

 j /j   k / k

23

(13)

(8)

kE

When there are different vulnerability attacks existing both in the web server and database server, the system availability can be calculated as: A = 1-p(wsg,sqlf) – p(wsa, sqlf) - p(wsf, sqla) - p(wsf, sqlg)- p(wsf, sqlf) (9) where (wsg, sqlf), (wsa, sqlf), (wsf, sqlf), (wsf, sqla), and (wsf, sqlg) stand for the different security failure states described in the modeling systems. As discussed in the method [40, 41], we have Markov matrix P for one-step transition probability in the Markov chain. P can be reorganized as P‟ 1 P‟

= b1

1 b2

1 b3

 N T 1j j jE t

REFERENCES. [1]

[2]

[3]

… …

Q

where bk is a sub-matrix with the one-step probability of describing transient state i to irreducible set. Sub-matrix Q from the Markov matrix P represents the transition probabilities between the transient states in one-step transition. The mean time to security failure can be calculated using the following operation: N(i,j) = (I –Q)-1(i,j). (10) where N(1,j) is the average number of times the state j (j Є E t) is visited in the Markov chain before the Markov chain arrives at one of the absorbing states from the beginning state. When we obtain the mean sojourn time in state j (T j), the mean time for security failure (MTTSF) can be computed by:

M TTSF

ACKNOWLEDGMENT The authors would like to thank Dr. Bharat Rawal at Penn State University for helps.

[4] [5] [6] [7] [8] [9] [10]

(11) [11]

B. Confidentiality analysis Confidentiality intrusion tolerance failure in DSDSS occurs when any i ≥ n distributed systems out of n + k - 1 lose their confidentiality [43, 44]. According to B. Madan et al, PCF(i) denote the failure probability of the ith system and P-CF(i) = 1 - PCF(i) Therefore, (12) Since there are n + k replicated copies of a file, loss of confidentiality of just one of these causes loss of confidentiality of the entire system [43,44]. Therefore, system confidentiality failure probability is much greater and is given by,

[12] [13] [14] [15] [16]

[17] [18]

Narasimha Inukollu, Sailaja Arsi, and Srinvasa Rao Ravuri, . 2014, “Security Issues Associated With Big Data in Cloud Computing,”. International Journal of 2014Network Security & Its Application. Vol. 6, No. 3, pp 45-56.. Elham Abd AL, Latif AL Badawi and Ahmed Kayed. 2015.” Survey on Enhancing the Data Security of The Cloud Computing Environment By Using Data Segregation Technique,“. IJRRAS 23(2), pp 136-143. Kalana , Parsi, and Sudha Singaraju, 2012,” Data security in Cloud Computing Using RSA Algorithm,“ IJRCCT Vol 1, No 4. Pp 143146 [4]. Wang, Cong, qian Wang, Kui Ren. And Wenjing Lou. 2010,” Privacy-preserving public auditing for data storage security in cloud computing,”. In INFOCOM, Proceedings IEEE, pp. 1-9. Subashini, and V. Kavitha. 2011,” A Survey on secutity issues in service delivery models of cloud computing,”. Journal of network and computer application, Vol 34, No. 1, pp 1-11. Almorsy, Mohamed, John Grundy, and Ingo Muller. 2010, An analysis of the cloud computing security problem, In Proceedings of APSEC 2010 Cloud Workshop, Sydney, Austrilia C. Eaton, D. Deroos, T. Deutsch, G. Lapis, and P. C. Zikopoulos, 2012, Understanding big data: Analytics for Enterprise Class Hadoop and streaming data, MCGraw-Hill Companies. Big data security, July 2012, Network security, pp.5-8. P. Mell and T. Grance, September 2011. The NIST Definition of cloud computing. National Institute of Standard and Technology: U. S Department of Commerce. Special publication 800-145. S Carlin and K, Curran, 2011. Cloud computing security. International Journal of Ambient Computing and Intelligence, 3(1). Pp 14-19. J. Strickland, How cloud computing works, http://computer.howstuffworks.com/cloud-computing/clopudcomputing1.htm (March 2017 access) E. Sayed, A. Ahmed, and R. A. Saeed. 2014. A survey of big cloud computing.security. International Journal of Computer Science and Software Engineering. Vol 3, No. 1, pp 78-85. CSA Cloud Security Alliance, 2012 (November), Top ten big data security and privacy challenges. A. A. Soofi, M. I. Khan, and F. E. Amin, 2014, A review on data security in cloud computing, International Journal of Computer Application. Vol. 94, No. 5, pp12-20. K. Vamsee, and R. Sriram, 2011. Data security in cloud computing, Journal of computer and mathematicsal science. Vol 2, pp 1-169. H, Shuai, X. Jianchuan, 2011. Ensuring data storage security through a novel third party auditor scheme in cloud computing. The clouding computing and Imtelligence systems IEEE conference on. S. K. Sood. 2012, A combined approach to ensure data security in cloud computing. Journal of Network and Computer Applications, 35(6), 1831-1838. P. Kalpana and S. Singaraju, 2012, Data security in clouding computing using RSA Algorithm. International Journal of Research

ISBN: 1-60132-448-0, CSREA Press ©

24

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

[19] [20] [21]

[22] [23]

[24]

[25]

[26]

[27] [28] [29] [30] [31] [32]

in Computer and Communication Technology, Vol. 1, Issue. 4. Pp.1-7 E. M. Mohammed, H. S. Abdelkader, and S. El-Etriby, 2012, Enhanced data security model for cloud computing. International conference on Information and Systems. J. Singh, B Kumar, and A. Khatri. 2012, Improving store data security in cloud using RC5 algorithm. Z. Lan, V. Varadharajan, M. Hitchens, 2013, Achieving Secure Role=based Access Control on Encrpted Data in Cloud Storage. Information Forensic and Security, IEEE Transaction on , 8(12): 1947-1960. J. Tacho, L. Xiang-Yang, Zhioguo Wang, and W. Meng, 2013. Privacy preserving cloud data access with multiple authorities, INFOCOM, proceeding of IEEE, pp. 14-19. Y. Ching-Nung, and L. Jia-bin, 2013, Protecting Data Privacy and Security for Cloud Computing Based on Secret Sharing, International Symposium on the Biometrics and Security Technologies.pp.1-7 M. S. Abolghasemi, M.M. Sefidab, R. E. Atani, 2013. Using location based encryption to improve the security of data access in cloud computing, International Conference on Computing, Communications, and Informatics. P. Rewagad, and Y. Pawar, 2013, Use of digital signature with Diffie Hellman key exchangeand AES algorithm to enhance data security in cloud computing. International Conference on the communication systems and Network Technologies. N. Mimura Gonzalez, M. Torrez Rojas, Maciel da Silva, F. Redigolo, T. Melo de Brito Carvalho, C. Miers, M. Naslund, and A. Ahmed, A Framwork for authenitication and authorization credential in cloud computing, 2013, 12th International Conference in Trust, Security and Privacy in computing and communication, pp. 509-516. R. Banyal, P. Jain, and V. Jain, 2013, Multi-factor authentication framework for loud computing in Fifth International Conference on Computational Intelligence, Modeling and Simmulation. Pp 105-110 R. Lomotey, and R. Deters, 2013, Saas authentication middleware for mobile consumers of iaas cloud, IEEE Ninth World Congress on Services, pp 448-455. H. Kim, and S. Timm, 2014, X.509 Authentication and Authorization in femi cloud. IEEE/ACM 7th International Conference on Utility and Cloud Computing. Pp 732-737. A. Gholami, and E. Laure, 2015, Security and Privacy of Sensitive Data in Cloud Computing: A Survey of Recent Development, Computer Science and Information technology. B. Tang, R. Sandhu, and Q. Li, 2013, Multi-tenancy Authorization Model for Collaborative Technologies and Systems. Pp 132-138. L. Zhou, V. Varadharajan, and M. Hitchens, 2013, Integrating trust with cryptographic role based access control for secure data storage. In Trust, Security, and Privacy in Computing and Communications. 12th IEEE International Conference on, pp 560-569.

[33] J. Sendor, Y. Lehmann, G. Serme, A. Santana de Oliveira, 2014, Platform level support for authorization in cloud services with oauth2, IEEE International Conference on Cloud Engineering , pp458-465. [34] M. A. Leandro, T. J. Nascimento, D. R. Dos Santos, C. M, Westphall, 2012, Multitenancy authorization system with federated identify for cloud-based environments using Shibboleth, the 11th International Conference on Networking, pp 88-93. [35] M. Stihler, A. Santin, A. Marcon, and J. Fraga, 2012, Integral federated identityty management for cloud computing, In new Technologies, Mobility and Secuiry, 5th International Conference on, pp. 1-5. [36] H, Li, Y. Dai, L. Tian, and H. Yang, 2009, Identity-based authentication for cloud computing, in Cloud Computing (M. Jaatun, G. Zhao, and C Rong. Eds). Vol 5931 Lecture Notes in Computing Science, pp. 157-166. [37] E. Carlini, M. Coppola, P. Dazzi, L. Ricci, and G. Righetti, 2012, Loud federation in contail, in Europar 2011: Parallel Processing Workshops (M. Alexander, P. D‟ Ambra, A. Belloum, S. Scott, J. Cannataro, M. Danelutto, B. Di Mar tino, M. Gerndt, E. Jeannot, R. Namyst, J. Roman, S. Scott, J. Traff, G. Vallee, and J. Weidendorfer, eds). Vol 7155 Lecture Notes in Computer Science, pp.159-168. [38] J. Gouveia, P. Crocker, S. Melo De Sousa, and R. Azevedo,2013, E-id authenticationand uniform access to cloud storage service providers, in Cloud Computing Technology and Science, IEEE 5th Conference on, Vol. 1, pp.487-492. [39] G. Dreo, M. Golling, W. Hommel, and F. Tietze,2013, Iceman: An architecture for secure federated inter-cloud identity management , in Integrated Network Management , IFIP/IEEE International Symposium on, pp1207-1210. [40] Y. Wang , W. M. Lively, and Simmons D. B. Software security analysis and assessment for web-based applications. Special Issue of Journal of Computational Methods in Science and Engineering 2009; pp. 179-190. [41] R M Feldman, and Valdez-Flores C, Applied probability and stochastic processes, 2nd Edition, PWS Publishing Company, St. Paul, MN; 2006. [42] S. Ross, Stochastic Processes, 2nd edition, John and Willey Sons Inc.1996. [43] Wang, Dazhi, Madan Bharat B, and Trivedi Kishor S. Security analysis of SITAR intrusion tolerance system. In Proceedings of the 2003 ACM workshop on Survivable and self-regenerative systems: in association with 10th ACM Conference on Computer and Communications Security. 2003. pp. 23-32. [44] Bharat S. Rawal, Harsha K. Kalutarage, S. Sree Vivek and Kamlendu Pandey, “The Disintegration Protocol: An Ultimate Technique for Cloud Data Security,” 2016 IEEE International Conference on Smart Cloud,

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

25

Voice Recognition Research

Francisco Capuzo¹ , Lucas Santos² , Maria Reis³ and Thiago Coutinho4 ¹²³4 Lab245 Research Center, Lab245 , Rio de Janeiro, Rio de Janeiro, Brazil Abstract - The main objective of the software is to create a database of voice recording files, which has an interface that can store and analyze audio data. Therewith, it is possible to gather and identify many information, for example: the voice tone and the peak of the frequency. Our studies were based on data collection and analysis. Therefore, we’ve built a structure with many information that was really relevant to the word comparing and identification processes. Our final intention is to expand this database in order to increase the number of users and improve the accuracy of the software. Keywords - Voice, Frequency, Studies, Samples, Word.

1. Introduction

The voice is a tool of communication that’s exclusive to humans. Talking is our main way to express our feelings and exchange information in everyday life. It has characteristics that may vary according to the gender, person characteristc or age. It’s also able to reflect each person’s emotional status. The voice carries the identity of each individual, and it works similar to a fingerprint. Our first idea was to create an app that could simultaneously write in words what people were talking. Considering that the process of voice recognition is very complex, we started converting the voice into amplitude spectrum in a time domain graphic. The ordinate corresponds to signal amplitude and abscissa corresponds to the time. After this, we realized that the most important thing was to transform the amplitude spectrum into the frequency domain graphic, in which, we had more information than in the time domain one. The axis of the abscissa is built with frequency values. So, it is possible to analyze if the voice is low or high for example. We have an interface that makes this conversion. So, we can take reliable frequency signature (see Picture-1):

The graph construction is another useful application. It shows all information about the voice, such as: where is the largest signature and the smallest signature.

2. Development

The first step in the development process was to create a code that was able to convert the recorded voice into a frequency domain. In the beginning, to complete the recording process, it was necessary two clicks. One click for starting and another one to finish. But after some tests, we changed that. For many reasons, mainly for human error (timing and miss-clicks), it would be easier if the user had a predetermined time (two seconds, in this case) to make the recording, right after click the Start button. It can be observed in picture 2.1:

Picture-2.1-Recoding Interface

For a new recording, the user must reset the recording process clicking on the Reset button. The collected data is going to be inserted in a database. For the data analysis, the software has an algorithm that transforms the whole recording into numbers, commas, and hashes. In order to make clearer where each recording starts and ends, it is set a “#” between each recording every time it captures a sample. There is an example that can be observed in Picture 2.2:

Picture-1-Amplitude Spectrum/Frequency Domain.

ISBN: 1-60132-448-0, CSREA Press ©

26

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

forty rows. The most interesting is that each value range has a different color, in which, we can examine the picture 2.4 better:

Picture-2.2-Samples Exemples.

It’s important to know that, in these two seconds, the frequency is stored continuously, until the recording process ends. So, the frequency changes according to the sample vector and, there are maximum and minimum of each vector positions. Thereby, we can analyze the entire recording process. For the next step (Picture-2.3), the user must enter his login and password. After that, the frequency is automatically stored in a page interface. In order to finish the process, the user must fill all the fields (it gathers all the data that were previously mentioned: word, gender, and computer) and click on “cadastrar” button:

Picture-2.3-Fill in the data interface.

After all this data have been stored in a database, it is possible to analyze the vectors of a recording according to the gender of the individual; computer used, or even, take the data samples per person. Once necessary the database can make calculations with these signatures, calculate the meaning of all vectors of the same word for example. Also, there was an idea to plot these vectors in a shape of a matrix. The vector number is represented by the column, and the frequency value in the vector represents the rows, three hundred

Picture-2.4-Voice Graphic.

In this picture, the color white represents the highest frequency values, and the color black the lower ones. If there is a large white color concentration at the beginning, it is possible to conclude that there’s a predominantly lower voice. On the opposite side, when the situation occurs at the end, there is a high predominance voice.

3. Conclusion

Until the end of the development of this article, we went through several stages, and at each step, we have improved. From the initial moment in which we modified the frequency domain graphics until the final step in which we had to plot a graphic. Even so, the initial purpose of the software has always been to store frequency signatures in a reliable database. This would be the start of other great applications that could be used all over the world, such as: Build a word recognizer that would help people who want to speed up a job, or a physically handicapped or injured person, who cannot type or who otherwise has some sort of difficulty writing a text would be able to accomplish this task, whereas the reverse procedure would help a person who is visually impaired to hear a text. Considering that the database it is able to do any math operation, you can create a unique identity for each word and also use it as a key to unlock cell phones and computers for example. As the standard deviation of the words is analyzed, if a peculiar variation related to some samples sets is found, this fact may be related to some emotional state, becoming a useful tool for psychologists, pedagogues, with

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

the intention of recognizing the Emotional state of a patient. Also, it could be useful for Energy Distribution Companies. The software could monitor the frequency that is used and its lag degree, thereby finds, quickly, some technical failure. We believe that is a study with a very big relevance that can provide structure for many possible other discoveries. Thereat, the purpose is to expand this tool and the number of words to be recorded. As a result, there will be a larger database

27

of samples giving greater reliability to the application that can be useful for a greater number of people.

4. References

[1] IEEE ASSP MAGAZINE (Volume: 3, Issue: 1, Jan 1986) [2] Lawrence Rabiner and Biing-Hwang Juang, Fundamental of Speech Recognition”, Prentice-Hall, Englewood Cliffs, N.J., 1993. [3] R. M. Gray, ``Vector Quantization,'' IEEE ASSP Magazine, pp. 4-29, April 1984.

ISBN: 1-60132-448-0, CSREA Press ©

28

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

SESSION BIG DATA ANALYTICS AND APPLICATIONS Chair(s) TBA

ISBN: 1-60132-448-0, CSREA Press ©

29

30

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

31

Visualization Analysis of Shakespeare Based on Big Data Ran Congjing1, Li Xinlai1, and Huang Haiying1 Center for Studies of Information Resources, Wuhan University, Wuhan, China School of Information Management, Wuhan University, Wuhan, China

Abstract - For more than four centuries, Shakespeare's works have brought great influence on the literary creation around the world. The analysis and gigging of strength, distribution, and hotspots of Shakespeare research have helped to understand the current situation and features in Shakespeare's research field, which can provide references for the related research. Based on the theory of big data, information metrology and co-word analysis, this paper makes a visual analysis of the Shakespeare research literature in WOS database. According to the index of documents amount, citations amount and average citations amount, the core authors, core institutions, distribution of journals and research hotspots are analyzed, data and analysis are provided for world Shakespeare’s related research from the perspective of knowledge maps.

key words of the literature, this paper explores the general situation and research hotspots of Shakespeare research abroad, and presents a new research method and references for scholars.

2 2.1

Methods Data Sources

Keywords: Visualization Analysis; Shakespeare Research; VOSviewer; Co-word Analysis

The data in this paper is taken from the Web of Science database. In the WOS default core collections, word “Shakespeare” and Shakespeare works’ name are used to build the advanced search strategy (appendix 1), with the time span from 1900 to 2016. The retrieval was made on January 3, 2017 (data update date is January 2, 2017) and a total of 35927 data were obtained. Irrelevant data were removed manually under the eliminating standard, a total of 28189 data were effective, including full records and references cited.

1

2.2

Introduction

Shakespeare is well known in the world, as his opponent, a contemporary drama writer, Ben Johnson says that Shakespeare does not belong to an era but belongs to all eras. There is no intermittent in Shakespeare's research, or praise or criticism, which shows us a colorful Shakespeare as Shakespeare study walked all the way. On the stage, Shakespeare’s drama continued to be interpreted in different forms and languages throughout different countries and regions. In the second decade of the 21st century, the study of Shakespeare continued to flourish and shows diversity of innovation in the academic circles. Shakespeare's research has gone through 400 years, experienced the change eras of various literary trends like Neo-Classicalism, Enlightenment, Romanticism, Realism, Postmodernism and so on. So Shakespeare's research has a wide range of fields, including history, aesthetics, sociology, ethics, philosophy, anthropology, psychology, ecology, film, television art and so on. But for now, the scholar's research on Shakespeare is mainly based on qualitative research and comparative analysis, and there are very few articles through the pattern of visualization. In order to further study the dynamics of Shakespeare research in the world, this paper adopts the mixed method of quantitative and qualitative, based on the theory and technology of big data analysis, taking the Shakespeare research literature of WOS database from 1900 to 2016. Through the visualization analysis of the amount of literature, the distribution of documents, the citations, and the

Research Methods

Co-word analysis method is one of the important methods of metrological analysis and content analysis method, which mainly reflects the number (or intensity) of word pairs in statistical documents, the degree of close relationship between the words and the relationship between the subject and the theme. Based on the close relationship between word pairs in the keywords network, this paper analyzes the research hotspots and research trend of Shakespeare's research field by using co-word analysis method. Data visualization can translate data into static or dynamic images or graphics using the basic principles of computer science and graphics, and allows users to interact and control data extraction and display. So hidden knowledge can be dug out and new rules can be discovered. This article mainly uses VOSviewer's visualization technology to draw the knowledge maps. VOSviewer is a software tool for constructing and visualizing bibliometric networks. These networks may for instance include journals, researchers, or individual publications, and they can be constructed based on co-citation, bibliographic coupling, or co-authorship relations. VOSviewer also offers text mining functionality that can be used to construct and visualize co-occurrence networks of important terms extracted from a body of scientific literature.

ISBN: 1-60132-448-0, CSREA Press ©

32

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

3 3.1

Analysis and Results Analysis of Authors’ Influence

Influential authors can influence the development trend and is the epitome of the scientific research activities of a certain subject. Through the analysis of the author, it is more easily to grasp the breadth and depth of the development of a subject and have positive significance to the management, organization and coordination of the scientific research activities. The author's academic influence can be measured by the amount of documents and the cited frequency. This paper mainly analyzes the author's influence from the perspective of the documents amount and the average citations. The paper sets up the relevant parameters in VOSviewer software to analyze author cooperation of the Shakespeare study literature from 1900 to 2016, with a total of 11291 authors, of which 430 authors have more than 5 documents and more than 5 cited times. Visualization map(software automatically filter out “Anonymous”) is shown in Fig. 1, the deeper the red, the more the documents, which issued that top five authors are SMITH, PJ, WELLS, S, BERRY, R, WARREN, R, DUNCAN-JONES, K, these researchers are more active and have more prominent contribution in the Shakespeare study.

Fig. 1 Literature Volume of Authors in Shakespeare Research 1900 2016

scholars of Shakespeare study. According to statistics, the core scholars of Shakespeare study have 413 authors, following statistics for the top 20 of the author, see Table 1. Table 1 Documents and Citations of Authors 1900 - 2016 (top 20) NUMBER 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

AUTHOR Smith, PJ Duncan-Jones, K WELLS,S BERRY,R WARREN,R PEARCE,GM MAGUIN,JM Duncan-Jones, K MEHL,D Vickers, B Vickers, B JACKSON,R BERKOWITZ,GM Hadfield, A WHITE,RS HONIGMANN,EAJ POTTER,L WILDS,L JACKSON,MP HUNT,M

DOCUMENTS 225 145 120 119 109 95 94 92 87 82 82 81 78 77 77 71 70 70 68 68

CITATIONS 17 56 51 29 12 0 4 26 2 64 64 17 9 41 16 21 9 8 90 119

The quality of the research papers of a scholar can be evaluated by average citations. VOSviewer software can make visualization maps, Fig. 2, of the average citations of above 413 authors by setting citations as weights and average citations as scores. The size of the node represents the amount of citations, and the deeper the red, the more the average citations, and the connection between the nodes represents the cooperation between the authors. Through the relevant parameters in the graph, it can be found that the author with the highest citations is STALLYBRASS, P, the average number of cited is 14.50, followed by BERGER, H, BARTELS, EC, RABKIN, N, NEWMAN, K, the average number of cited are 10.11, 10, 9.6, 8.3, 8 respectively. These authors with low documents amount but high quality are comparatively authoritative in the study of Shakespeare and make great contribution to Shakespeare research as well. In Fig. 2, there is no obvious cooperative group, indicating that the study of Shakespeare is mainly based on individuals. There is no core group between scholars currently.

The amount of documents is an important measure of authors’ academic level and scientific research ability, and we can determine the core scholars of Shakespeare study according to their documentation volume. Those authors who have attained deeper research in the study of Shakespeare are academic leaders. This paper determine the core scholars based on the formula written by Price2 in Yale University, N is the number of papers, is the number of papers of the author who have the highest yield in the statistics period. Only the author whose published papers amount is more than N can be called the core author. Take the value of into the formula, calculate N = 11 (articles), so authors who published more than 11 (including 11) papers belong to the kernel

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

Fig. 2 Average Citations of Authors in Shakespeare Study 1900 2016

3.2

33

12 13 14

Analysis of Organizations’ Cooperation

VOSviewer can generate network maps of organizations, with a total of 2160 institutions, of which 475 institutions have more than 5 documents. Visualization analysis is made for institutional cooperation of these organizations, as shown in Fig. 3, in which each node on behalf of an organization, and node’s size indicates the number of documents issued by the organization. Organization’s influence can also be measured by documents amount and cited amount. This paper mainly reveals the influential organizations in the field of Shakespeare's research from two aspects: published documents and average citations. The top 20 organizations’ statistics, table 2, shows that the University of Toronto issued a maximum of 120 papers in the Shakespeare study, followed by the University of Birmingham, a total of 105 documents. In this table, the university is the main force in the field of Shakespeare's research, indicating that academic organizations are the core force of the study of Shakespeare.

15 16 17 18 19 20 21

UNIV ILLNOIS UNIV CAMBRIDGE SHEFFIELD HALLAM UNIV UNIV GEORGIA UNIV CHICAGO UNIV CAPE TOWN HARVARD UNIV UNIV SUSSEX UNIV LEEDS UNIV MARYLAND

60 59

33 52

0 9

56

16

1

55 55 53 51 51 51 51

39 42 25 108 30 6 35

65 7 5 5 2 1 1

The visualization analysis, Fig. 4, carried out 258 organizations’ average citations with the documents more than 5 and the citations more than 5 by setting the citations as weight. Size of the nodes represents the amount of citations, and the deeper the red, the greater the average citations, and the connection between the nodes on behalf of the cooperation between the organizations. We can find that the University of Lund has the highest citations with the average cited of 20.8 through the following map. This organization has only 5 documents, but the average number of citations is high, indicating that the quality of the published literature is very high of this school. Followed by the University of Bonn, Stanford University, with the average citations of 11.67, 10.38 respectively. Some institutions have papers of high quality, even though with low documents amount can be found by the index of average citations. As it can be seen from the map below, there is little co-operation between most institutions. But there is a cluster including 47 universities, such as Stanford University, Boston University, Columbia University, Toronto University, George Washington University, University of Georgia, MIT, Harvard University, University of Notre Dame, University of Oxford, University of California, San Diego and so on, which shows that university are the mainly organizations to make cooperation in Shakespeare's research, and a research group is forming currently.

Fig. 3 Network Map of Organizations in Shakespeare Research 1900 - 2016 Table 2 Documents and Citations of Organizations in Shakespeare Research 1900 - 2016 (documents > 50) NUMBER

ORGANAZATION

DOCUMENTS

CITATIONS

TOTAL LINK STRENGTH

1 2 3 4 5 6 7

UNIV TORONTO UNIV BIRMINGHAM UNIV LONDON UNIV N CAROLINA UNIV WARWICK CUNY KINGS COLL LONDON UNIV CALIF LOS ANGELES UNIV PENN UNIV OXFORD UNIV CALIF BERKELY

120 105 85 74 73 73 68

100 62 80 23 14 43 28

3 7 8 64 3 2 7

68

182

2

66 65 61

170 54 114

1 11 0

8 9 10 11

Fig. 4 Average Citations of Organizations in Shakespeare Research 1900 - 2016

3.3

Analysis of Journals’ Distribution

After the statistical and visual analysis of the source of the documents of Shakespeare research, the number of

ISBN: 1-60132-448-0, CSREA Press ©

34

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

journals is very large, with a total of 1808 journals, of which 494 journals have more than 5 documents. After the 494 source journals are counted, we can find that the most journals focus on the field of literature in fact. Journals are arranged accordance with the number of published articles, as shown in Table 3. The top six journals published a total of 9297 related documents, accounting for 33% of the total amount of documents. According to Bradford law3, these 6 kind of journals belong to core areas, that is SHAKESPEARE QUARTERLY, TLS-THE TIMES LITERARY SUPPLEMENT, CAHIERS ELISABETHAINS, NOTES AND QUERIES, RENAISSANCE QUARTERLY and REVIEW OF ENGLISH JOURNAL, which pointed out that journals are widely distributed but the research field is relatively concentrated on the field of literature.

based on the keywords, so time is limited from 1990 to 2016 in the software analysis process, with a total of 4788 keywords, of which 188 words have co-occurrence more than 5 times. Clustering visualization analysis of the Shakespeare study hotspot is made, as shown in Fig. 5. In order to show the other main keywords more intuitively, nodes of “Shakespeare” and “William Shakespeare” are hidden for these two nodes are too large, and finally get the following map.

Table 3 Periodical Distribution of Shakespeare Research Documents 1900-2016 number

source

documents

citations

1

SHAKESPEARE QUARTERLY TLS-THE TIMES LITERARY SUPPLEMENT CAHIERS ELISABETHAINS NOTES AND QUERIES RENAISSANCE QUARTERLY REVIEW OF ENGLISH STUDIES THEATER HEUTE SIXTEENTH CENTURY JOURNAL THEATRE JOURNAL MODERN LANGUAGE REVIEW SHAKESPEARE COMPARATIVE DRAMA ARCHIV FUR DAS STUDIUM DER NEUEREN SPRACHEN UND LITERATUREN LIBRARY JOURNAL ENGLISH STUDIES EXPLICATOR ETUDES ANGLAISES THEATRE RESEARCH INTERNATIONAL MODERN PHILOLOGY NEW THEATRE QUARTERLY

3579

4699

1522

253

1493 1380 736

53 489 124

587

302

521

13

492

18

441 422 411 390

191 108 136 230

374

38

362 314 283 257

13 146 68 22

242

53

241 216

105 88

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Fig. 5 Clustering of Keywords in Shakespeare Study 1990-2016

According to the map above, it can be clearly found that the study of Shakespeare is divided into three clusters, the first cluster includes Hamlet, sonnets, King Lear, drama, history and other keywords; the second cluster includes Othello, Macbeth, the merchants of Venice, sex, love, comedy, England and other keywords; the third cluster includes performance, authorship, adaptation, cooperation and other keywords. From the above clusters, it can be found that the study of Shakespeare is divided into three major research hotpots: research Shakespeare's famous works from the point of history like historical drama genre or historical background etc.; research Shakespeare's famous works from the perspective of the theme like characters’ image, characters’ personality, theme and so on; The last category is that research Shakespeare's works from the perspective of textual criticism including the study of authorship, Shakespeare's doubt works, Shakespeare's cooperating works, and Shakespeare's adapted works. 3.4.1

3.4

Analysis of Shakespeare Research Hotpots

Keywords are condensed from author's academic thoughts and perspectives, which can reflect the core content of the article and research topics. The more keywords appear, the higher the number of co-occurrence in the map, and the higher co-occurrence frequency of the keywords can reflect research hotspots of a field to a certain extent. Through this hotspots clustering we can analyze the research direction or research content. Because there is no keywords from 1900 to 1989, and this article analyze the research hotspots mainly

Studying Shakespeare Works from the Perspective of History The first Shakespeare research hotspot cluster is the study of Shakespeare's famous works from the perspective of history kike historical drama genre, the background and so on. Shakespeare shows his high talent in the writing of the historical drama. He not only created rich historical figures in his works, but also recorded a history, so studying Shakespeare from the perspective of history became an obvious research direction. This article mainly chooses Shakespeare's famous tragedies "Hamlet" and "King Lear" as examples.

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

The word “Hamlet” is not only the name of Shakespeare's a work, but also the name of the hero in this famous work, so in the map the largest node shows that Hamlet is the keyword which have most co-occurrence times, and 84 other keywords appeared in the co-occurrence, so the study of Hamlet took a major share. These documents study the Hamlet from different angles, not only from the respect of characters, humanism, and madness, some studies also try to analyze it from the perspective of religion, Postmodernism and Postmodern. Hamlet is a tragic work written by Shakespeare from 1599 to 1602. Keyword “Ophelia” is one of only two female roles in Hamlet, and is the lover of Hamlet, so the times of co-occurrence with Hamlet is in great quantity. As one of Shakespeare's four tragedies, both from the story itself or from the Hamlet’s character, it would be related with words like "tragedy" "sadness" "mourning" and other keywords. The genre of King Lear is drama, which comes from an ancient legend of the United Kingdom. The story itself occurs around the 8th century, which was adapted into a lot of drama in British after that. There is an earlier anonymous work in the existing versions except Shakespeare’s drama, so it is generally believed that Shakespeare's King Lear is the adaptation of that play. The place of that story is the United Kingdom and France, and the time is around the 8 th century, which with the background of Renaissance when ideas of humanism and new historicism continue to develop and spread. So the research and interpretation for King Lear have aspects from humanism, new historicism, and therefore associated with the keywords like "humanism", "new historicism" etc. Also as one of the four tragedies of Shakespeare, it is common to study King Lear from the tragic point in many papers, such as the analysis of the four tragedies, the formation of the Lear King’s tragedy, the tragedy color from the perspective of literature, etc. so the keyword "tragedy" generated high contributions to the frequency of co-occurrence. 3.4.2

Studying Shakespeare Works from the Perspective of Theme The second Shakespeare study hotspot cluster is the analysis of Shakespeare's famous works from the perspective of the theme, like the famous characters, the thematic diversity and so on. All Shakespeare's tragedy, comedy and historical drama reflect the diversity of its theme. And in Shakespeare's works, he shapes a lot of typical images, such as female image, king image, statue, etc. Therefore, lots of literature excavate and analyze the theme of Shakespeare works, such as "time" theme and "love" theme in the Sonnet, "death" theme and "revenge" theme in Four Great Tragedies. This paper chooses the two outstanding works "Othello" and "Venice Merchant" as examples. Othello is also one of the four tragedies, written about in 1603. The study of this work is mainly from the perspective of the theme of the work, such as character, racial discrimination and colonialism, feminism and so on. "Othello" is a multithemed work, including the theme of love and jealousy, the

35

theme of credulity and treachery, the theme of intermarriage, and racial discrimination is one of the reasons for the tragedy of Othello. Shakespeare succeeded in exposing some features of his time, like the theme of racism, through the description of the whole process of Othell’s failure, a person from different nation4. There is a conflict between the AfricanAmerican culture presented by Othello and the European mainstream culture presented by Desdemona, and it is helpful for today's multi-cultural exchange and integration through the study of this theme5. In addition, there are some papers make the visualization analysis of the version of Othello 6, and the comparative research between Othello and "Hamlet", "King Lear", "Macbeth"7, so it is related with the keywords "translation", " Hamlet "" Macbeth "" King Lear " and have more co-occurrence times with these words of co-occurrence. The Merchant of Venice is a great irony comedy, written about 1596-1597 years. The theme is love and friendship, but also reflects the contradictions between early commercial bourgeoisie and usurers in capitalist society, which expresses humanism through the problems about money, law and religion in the bourgeois society. An important literary achievement of this play is to shape the typical image of Shylock8, the ruthless usurer, and many papers analyze this work from the points of characters and racism, so keyword "The Merchant of Venice" related with keywords "Shylock", “Race” more often. In addition, other common keywords, such as "Macbeth", "Midsummer Night's Dream" are also compared with high frequency. 3.4.3

Studying Shakespeare Works from the Perspective of Textual Criticism The content of this cluster is mainly about author’s identity of Shakespeare's work. As drama script belongs to the theater in Shakespearean Age, there is no signature on the Shakespeare's early drama scripts, and people have no awareness of copyright, resulting in a lot of works lost and the author can not be distinguished. At the same time, due to creation and adaptation of scripts in Shakespearean era, it is common for the phenomenon of cooperation between authors, which can explain that the keywords with high frequency of co-occurrence including “collaboration”. For the above reasons, many authors have a discussion and controversy for some dramas’ identity and the situation of cooperation, including Shakespeare's lost works, false works and suspected works, such as “Love’s Labour’s Won”, “Cardenio”, “The London Prodigal”, “Sir Thomas More”, “Arden of Faversham”, and so on. Another content of this cluster is the study of Shakespeare's adaptation works. The study of the other forms like films, plays, dramas, TV shows adapted from Shakespeare's text works, such as Fortier, Mark studied the adaptation of Shakespeare's five plays9, Jones, Keith studied the relationship between the text of Shakespeare's Hamlet and films10, Mike studied the music adaptation of Shakespeare's sonnets. The study of the adaptation of Shakespeare's works for different people, such as the adaptation for adolescent groups11; the limitations of religion, nation, gender and

ISBN: 1-60132-448-0, CSREA Press ©

36

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

adaptation of Shakespeare’s work for the specific country or region, like Canada, China12 and so on.

4

Conclusions

With the 400th anniversary of the death of Shakespeare, there would be another climax of Shakespeare study. This paper summarizes the research situation and hotspots of the global Shakespeare research by statistical analysis and visualization analysis of the research literature, mainly including the following findings: Shakespeare's research is mainly based on individual authors, and there is no core group of authors currently. Academic institutions, mostly universities, are the core of the study of Shakespeare. Journals are widely distributed but the research field is relatively concentrated on the field of literature. There are three main hotpots in the study of Shakespeare's research literature: research Shakespeare's famous works from the point of history like historical drama genre or historical background etc., research Shakespeare's famous works from the perspective of the theme like characters’ image, characters’ personality or theme and so on; and research Shakespeare's works from the perspective of textual criticism including the study of authorship, Shakespeare's doubt works, Shakespeare's cooperating works, Shakespeare's adapted works and so on.

Acknowledgment This work is supported by the project Intellectual Property’s Institution Change, Risk Analysis and Countermeasure Research of Information Consumption in Cloud Environment of Humanity and Social Science Research Project of Ministry of Education of China. (Grant No. 14YJA870008 ) Ran Congjing, Li Xinlai and Huang Haiying contributed equally to this paper and should be considered co-first authors. Li Xinlai is the corresponding author of this paper.

[8]

Bailey, A. Shylock and the slaves: owing and owning in the merchant of venice. Shakespeare Quarterly, 62(1), 1-24, 2011. [9] Johanson, K.. Shakespeare adaptations from the early eighteenth century : five plays. Fairleigh Dickinson University Press, 2013. [10] Nelson L M. Screen Adaptations: Shakespeare's Hamlet: The Relationship between Text and Film by Samuel Crowl (review)[J]. Journal of Dramatic Theory and Criticism, 31(1): 174-176,2016. [11] Mike Ingham. “the true concord of well-tuned sounds”: musical adaptations of Shakespeare's sonnets. Shakespeare, 9(2), 220-240, 2013. [12] Drouin J. Shakespeare in Quebec: Nation, Gender, and Adaptation[M]. University of Toronto Press, 2014.

Appendixes 1 Retrieve formula: TS=(Shakespear* OR Shakesperian OR "William Shakespeare" OR "All's Well That Ends Well" OR "As You Like It" OR “The Comedy of Errors” OR “Love's Labour's Lost” OR “Measure for Measure” OR “The Merchant of Venice” OR “The Merry Wives of Windsor” OR “A Midsummer Night's Dream” OR “Much Ado About Nothing” OR “Pericles, Prince of Tyre” OR “The Taming of the Shrew” OR “The Tempest” OR “Twelfth Night” OR “What You Will” OR “The Two Gentlemen of Verona” OR “The Two Noble Kinsmen” OR “The Winter's Tale” OR “Cymbeline” OR “King John” OR “Edward III” OR “Richard II” OR “Henry IV” OR “Henry V” OR “Henry VI” OR “Richard III” OR “Henry VIII” OR “Romeo and Juliet” OR “Coriolanus” OR “Titus Andronicus” OR “Timon of Athens” OR “Julius Caesar” OR “Macbeth” OR “Hamlet” OR “Troilus and Cressida” OR “King Lear” OR “Othello” OR “Antony and Cleopatra” OR “The Sonnets” OR “Venus and Adonis” OR “The Rape of Lucrece” OR “The Passionate Pilgrim” OR “The Phoenix and the Turtle” OR “A Lover's Complaint” OR “A Funeral Elegy” OR “Sonnets to sundry notes of music” OR “Love’s Labour’s Won” OR “Cardenio” OR “Sir Thomas More” OR “Arden of Faversham”)

References [1] [2]

[3] [4] [5] [6]

[7]

VOSviewer.Home[EB/OL]. http://www.vosviewer.com/,201701-13. Zhu Y. The Identification and Evaluation of the Kernel Authors of New Technology of Library and Information Service[J]. New Technology of Library & Information Service, 2004. Bradford S C. Sources of Information on Specific Subjects[J]. Journal of Information Science, 10(4):173-180, 1985. Smith, I. We are othello: speaking of race in early modern studies. Shakespeare Quarterly, 67(1), 2016. Nachit M. Shakespeare's Othello and the Challenges of Multiculturalism[J]. Social Science Electronic Publishing, 2016. Zhao, G., Cheesman, T., Laramee, R. S., Flanagan, K., & Thiel, S. Shakervis: visual analysis of segment variation of German translations of shakespeare's othello. Information Visualization, 14(4), 2013. Davis, A. Shakespearean tragedy as chivalric romance: rethinking 'Macbeth', 'Hamlet', 'Othello', 'King Lear', by Michael l. hays. Modern Language Review (1), 226-227, 2006.

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

37

Table based KNN for Text Summarization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University, Sejong, South Korea Abstract— In this research, we propose the modified version of KNN (K Nearest Neighbor) as an approach to the text summarization. Encoding texts into numerical vectors for using the traditional versions for text mining tasks causes the three main problems: the huge dimensionality, the sparse distribution, and the poor transparency; this fact motivated this research. The idea of this research is to interpret the text summarization task into the text classification task and apply the proposed version of KNN to the task where texts are encoded into tables. The modified version which is proposed in this research is expected to summarize texts more reliably than the traditional version by solving the three problems. Hence, the goal of this research is to implement the text summarization system, using the proposed approach. Keywords: Text Summarization, Table Similarity , K Nearest Neighbor

1. Introduction The text summarization refers to the process of selecting automatically some sentences or paragraphs in the given text as its essential part. As its preliminary task, the text is segmented into the sentences or the paragraphs, by the punctuation mark or the carriage return, respectively. In the task, some sentences or paragraphs are selected as the important part. In this research, the text summarization is interpreted into the binary classification where each sentence or paragraph is classified whether it is the essential part, or not. The automatic text summarization by a system or systems should be distinguished from one by human being which refers to the process of rewriting the entire text into its short version. Let us consider some motivations for doing this research. In encoding texts into numerical vectors as the traditional preprocessing, the three main problems, such as huge dimensionality, sparse distribution, and poor transparency, may happen[2][3][4][13][6]. Encoding texts into tables showed previously successful results in other tasks text mining: text categorization and clustering [3][4] [7]. Although we proposed previously the alternative representations of texts which were called string vectors, we need to define and characterize them mathematically to make the foundations for creating and modifying string vector based versions of machine learning algorithms [13][6]. Hence, this research is carried out by the motivations; we attempt to encode texts

into tables in the text summarization task, as well as the tasks of text categorization and clustering. This research may be characterized as some agenda. In this research, the text summarization task is interpreted into the binary classification task where each sentence or paragraph is classified into the essence, or not. Each sentence or paragraph is encoded into a table, instead of numerical vectors, to avoid the three problems. The similarity measure between tables which is always given as a normalized value is defined and used for modifying the KNN. The modified version will applied to the binary task which is mapped from the text summarization task. We will mention some benefits from this research. This research prevents the process of encoding texts from the three main problems mentioned above. By solving the problems, we may expect the better performance than the traditional version of KNN. Since the table is more symbolic representation of each text, we may guess the contents of texts by their representations. However, the table size is given as the external parameter of the proposed text summarization system, it is necessary to be careful for setting its value to optimize the trade-off between the system reliability and speed. This article is organized into the four sections. In Section 2, we survey the relevant previous works. In Section 3, we describe in detail what we propose in this research. In Section 4, we mention the remaining tasks for doing the further research.

2. Previous Works Let us survey the previous cases of encoding texts into structured forms for using the machine learning algorithms to text mining tasks. The three main problems, huge dimensionality, sparse distribution, and poor transparency, have existed inherently in encoding them into numerical vectors. In previous works, various schemes of preprocessing texts have been proposed, in order to solve the problems. In this survey, we focus on the process of encoding texts into alternative structured forms to numerical vectors. In other words, this section is intended to explore previous works on solutions to the problems. Let us mention the popularity of encoding texts into numerical vectors, and the proposal and the application of string kernels as the solution to the above problems. In 2002, Sebastiani presented the numerical vectors are the standard

ISBN: 1-60132-448-0, CSREA Press ©

38

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

representations of texts in applying the machine learning algorithms to the text classifications [8]. In 2002, Lodhi et al. proposed the string kernel as a kernel function of raw texts in using the SVM (Support Vector Machine) to the text classification [9]. In 2004, Lesile et al. used the version of SVM which proposed by Lodhi et al. to the protein classification [10]. In 2004, Kate and Mooney used also the SVM version for classifying sentences by their meanings [11]. Previously, it was proposed that texts should be encoded into string vectors as other structured forms. In 2008, Jo modified the k means algorithm into the version which processes string vectors as the approach to the text clustering[15]. In 2010, Jo modified the two supervised learning algorithms, the KNN and the SVM, into the version as the improved approaches to the text classification [16]. In 2010, Jo proposed the unsupervised neural networks, called Neural Text Self Organizer, which receives the string vector as its input data [17]. In 2010, Jo applied the supervised neural networks, called Neural Text Categorizer, which gets a string vector as its input, as the approach to the text classification [18]. It was proposed that texts are encoded into tables instead of numerical vectors, as the solutions to the above problems. In 2008, Jo and Cho proposed the table matching algorithm as the approach to text classification [3]. In 2008, Jo applied also his proposed approach to the text clustering, as well as the text categorization [15]. In 2011, Jo described as the technique of automatic text classification in his patent document [13]. In 2015, Jo improved the table matching algorithm into its more stable version [14]. The above previous works proposed the string kernel as the kernel function of raw texts in the SVM, and tables and string vectors as representations of texts, in order to solve the problems. Because the string kernel takes very much computation time for computing their values, it was used for processing short strings or sentences rather than texts. In the previous works on encoding texts into tables, only table matching algorithm was proposed; there is no attempt to modify the machine algorithms into their table based version. In the previous works on encoding texts into string vectors, only frequency was considered for defining features of string vectors. In this research, we will modify the machine learning algorithm, KNN, into the version which processes tables instead of numerical vectors, and use it as the approach to the text summarization which is mapped into a classification task.

between tables into a normalized value between zero and one. In Section 3.3, we mention the proposed version of KNN together with its traditional version. This section is intended to describe in detail the proposed version of KNN as the approach to the text summarization task. In Section 3.4, we mention the scheme of applying the KNN to the task with the view of it into the binary classification task.

3.1 Text Encoding This section is concerned with the process of encoding texts into tables as illustrated in figure 1. In the process, a text is given as the input and a table which consists of entries of words and their weights is generated as the output. A text is indexed into a list of words through the basic three steps as illustrated in figure . For each word, its weight is computed and assigned to it. Therefore, in this section, we describe the three steps which are involved in text indexing and the scheme of weighting words.

Fig. 1: The Process of Encoding Text into Table The process of indexing the corpus into list of words is illustrated in figure 2. The first step of the process is to tokenize a string concatenated by texts in the corpus into tokens by segmenting it by white spaces or punctuation marks. The second step called stemming is to map each token into its root form by the grammatical rules. The third step is to remove the stop words which function only grammatically and irrelevantly to contents for more efficiency. The words which are generated the through three steps are usually nouns, verbs, and adjectives.

3. Proposed Approach This section is concerned with the table based KNN (K Nearest Neighbor) as the approach to text summarization, and it consists of the four sections. In Section 3.1, we describe the process of encoding a text into a table. In Section 3.2, we do formally that of computing a similarity

Fig. 2: The Process of Corpus Indexing The weight of the word, tj in the text, di is denoted by wij , and let us mention the three schemes of computing it.

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

We use the relative frequency computed by equation (1) wij =

T Fij T Fimax

(1)

where T Fij is the frequency of the word,tj in the text, di , and T Fimax is the maximum frequency in the text, di . We may assign a binary value, zero or one, to weight by equation (2), ( 1 if T Fji > 0, wji = (2) 0 otherwise. We may use the TF-IDF (Term Frequency - Inverse Document Frequency) weights computed by equation (3), ( N log DF (1 + logT Fji ) if T Fji > 0, i wji = (3) 0 otherwise. where N is the total number of texts in the corpus, DFi is the number of texts including the word, tji , and T Fji is the frequency of the word, tji in the given text, Dj . In this research, we adopt the TF-IDF method which is expressed in equation (3) as the scheme of weighting words. We mention the schemes of trimming the table for more efficiency. The first scheme is to rank the words in the table by their weights and select a fixed number of words with their highest words. The second one is to set a threshold as an external parameter and select words with their higher weights than the threshold, assuming that the weights are always given as normalized values between zero and one. The third one is to cluster words by their weights into the two subgroups: the higher weighted group and the lower weighted group; the words in the former groups are selected. In this research, we adopted the first scheme because it is simplest. Let us consider the differences between the word encoding and the text encoding. In the word encoding, each word is associated with texts including it, whereas, in the text encoding, each text is done with words included in it. Each table representing a word has its own entries of text identifiers and its weights in the texts, whereas one representing a text has those of words and their weights in the text. Computing similarity between the tables representing both words is based on text identifiers which include them, while doing it between the tables representing both texts is based on words shared by the both texts. Therefore, from the comparisons, reversal of texts and words indicate the essential differences between the text and word encoding.

39

based on the ratio of the shared word weights to the total weights of the two tables. Therefore, we intend this section to describe in detail and formally the process of computing the similarity. A table which represents a word may be formalized as a set of entries of words and its weights. The text, Dj is represented into a set of entries as follows: Dj = {(tj1 , wj1 ), (tj2 , wj2 ), ..., (tjn , wjn )} where tji is a word included in the text, Dj , and wji is the weight of the word, tji in the text,Dj . The set of only words is as follows: T (Dj ) = {tj1 , tj2 , ..., tjn } The TF-IDF (Term Frequency - Inverse Document Frequency) weight, wji of the word, tji in the text, Dj is computed by equation (4)

wji

( N log DF (1 + logT Fji ) i = 0

if T Fji > 0, otherwise.

(4)

where N is the total number of texts in the corpus, DFi is the number of texts including the word, tji , and T Fji is the frequency of the word, tji in the given text, Dj . Therefore, the table is defined formally as unordered set of pairs of words and their weights. Let us describe formally the process of computing the similarity between two tables indicating two texts. The two texts,D1 and D2 are encoded into the two tables as follows: D1 = {(t11 , w11 ), (t12 , w12 ), ..., (t1n , w1n )} D2 = {(t21 , w21 ), (t22 , w22 ), ..., (t2n , w2n )} The two texts are represented into the two sets of words by applying the operator, T (·), as follows: T (D1 ) = {t11 , t12 , ..., t1n } T (D2 ) = {t21 , t22 , ..., t2n } By applying the intersection to the two sets, a set of shared words is generated as follows: T (D1 ) ∩ T (D2 ) = {st1 , st2 , ..., stk } We construct the table of the shared words and their dual weights among which one is from D1 , and the other is from D2 as follows:

3.2 Similarity between Two Tables

ST = {(st1 , w11 , w21 ), (st2 , w12 , w22 ), ..., (stk , wk2 , wk2 )}

This section is concerned with the process of computing a similarity between tables. Texts are encoded into tables by the process which was described in Section 3.1. The two tables are viewed into the two sets of words and the set of shared words is retrieved by applying the intersection on the two sets. The similarity between the two tables is

The similarity between the two tables is computed as the ratio of the total dual weights of the shared words to the total weights of the ones in both tables, by equation (5). Pk (w1i + w2i ) Pm (5) Sim(D1 , D2 ) = Pm i=1 w + i=1 w2i 1i i=1

ISBN: 1-60132-448-0, CSREA Press ©

40

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

It is always given as a normalized value between zero and one; if the two tables, D1 and D2 are same to each other, D1 = D2 the similarity becomes 1.0 as follows: Pm + w2i ) i=1 (w1i P P Sim(D1 , D2 ) = m m w + i=1 1i i=1 w2i Pm Pm w1i + i=1 w2i Pm = 1.0 = Pi=1 m i=1 w1i + i=1 w2i If they are exclusive, T (D1 ) ∩ T (D2 ) = ∅ the similarity becomes 0.0 as follows: 0 Pm = 0.0 Sim(D1 , D2 ) = Pm w + 1i i=1 i=1 w2i We demonstrate the process of computing the similarity between two tables using the simple example which is presented in Figure 3. The two texts are encoded into the two source tables as shown in Figure 3. In the example, the two words, ’artificial’ and ’documents’ are shared by the two tables, and each shared ones have their dual weights from the two input tables. The similarity between the two tables is computed to be 0.52 as a normalized value by equation (5). Therefore, the similarity is computed by lexical matching between the two tables.

The complexity of computing the similarity between two tables is O(n log n) , since it takes O(n log n) for sorting the entries of two tables using the quick sort or the heap sort, and O(n) for extracting shared elements by the consequential processing [1].

3.3 Proposed Version of KNN This section is concerned with the proposed KNN version as the approach to the text categorization. Raw texts are encoded into tables by the process which was described in Section 3.2. In this section, we attempt to the traditional KNN into the version where a table is given as the input data. The version is intended to improve the classification performance by avoiding problems from encoding texts into numerical vectors. Therefore, in this section, we describe the proposed KNN version in detail, together with the traditional version. The traditional KNN version is illustrated in Figure 4. The sample words which are labeled with the positive class or the negative class are encoded into numerical vectors. The similarities of the numerical vector which represents a novice word with those representing sample words are computed using the Euclidean distance or the cosine similarity. The k most similar sample words are selected as the k nearest neighbors and the label of the novice entity is decided by voting their labels. However, note that the traditional KNN version is very fragile in computing the similarity between very sparse numerical vectors.

Fig. 3: Example of Two Tables The similarity computation which is presented above is characterized mathematically. The commutative law applies to the computation as follows: Pk (w1i + w2i ) Pm Sim(D1 , D2 ) = Pm i=1 i=1 w1i + i=1 w2i Pk (w2i + w1i ) Pm = Sim(D2 , D1 ). = Pm i=1 i=1 w2i + i=1 w1i The similarity is always given as a normalized value between zero and one as follows: 0 ≤ Sim(D1 , D2 ) ≤ 1. If the weights which are assigned to all words are identical, the similarity between two tables depends on the number of shared words as follows: Sim(D1 , D2 ) ≤ Sim(D1 , D3 ) → |T (D1 ) ∩ T (D2 )| ≤ |T (D1 ) ∩ T (D3 )|.

Fig. 4: The Traditional Version of KNN Separately from the traditional one, we illustrate the classification process by the proposed version in Figure 5. The sample texts labeled with the positive or negative class are encoded into tables. The similarity between two tables is computed by the scheme which was described in Section 3.2. Identically to the traditional version, in the proposed version, the k most similarity samples are selected, and the label of the novice one is decided by voting ones of sample entities. Because the sparse distribution in each table is never available inherently, the poor discriminations by sparse distribution are certainly overcome in this research.

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

41

system. For doing so, we need to collect paragraphs which are labeled with one of the two labels as sample examples, in advance.

Fig. 6: View of Text Summarization into Binary Classification Fig. 5: The Proposed Version of KNN

We may derive some variants from the proposed KNN version. We may assign different weights to selected neighbors instead of identical ones: the highest weights to the first nearest neighbor and the lowest weight to the last one. Instead of a fixed number of nearest neighbors, we select any number of training examples within a hyper-sphere whose center is the given novice example as neighbors. The categorical scores are computed proportionally to similarities with training examples, instead of selecting nearest neighbors. We may also consider the variants where more than two variants are combined with each other. Because the tables which represent texts are characterized more symbolically than numerical vectors, it is easier to trace results from classifying items. Let us assume that novice tables are classified by voting labels of their nearest neighbors. In each category, the shared words between novice one and its nearest neighbors are extracted. We make a list of shared ones and their weights, and the categorical score, in each category. We present the evidence of classifying an entity into a category, by showing them.

3.4 Application to Text Summarization This section is concerned with the scheme of applying the proposed KNN version which was described in Section 3.3 to the text summarization task. Before doing so, we need to transform the task into one where machine learning algorithms are applicable as the flexible and adaptive models. We prepare the paragraphs which are labeled with ‘essence’ or ‘not’ as the sample data. The paragraphs are encoded into tables by the scheme which was described in Section 3.2. Therefore, in this section, we describe the process of extracting summaries from texts automatically using the proposed KNN with the view of text summarization into a classification task. The text summarization is mapped into a binary classification, as shown in Figure 6. A text is given as the input, and it is partitioned into paragraphs by carriage return. Each paragraph is classified into either of the two categories: ‘essence’ and ‘not’. The paragraphs which are classified into ‘essence’ are selected as the output of the text summarization

As sample examples, we need to collect paragraphs which are labeled with one of the two categories, before summarizing a text. The text collection should be segmented into sub-collections which are called domains, by their contents, manually or automatically. In each sub-collection, texts are partitioned into paragraphs, and they are labeled with one of the two categories, manually. We assign classifier to each domain and train it with the paragraphs in its corresponding domain. When a text is given as the input, we select the classifier which corresponds to the domain which is most similar as the text. Let us consider the process of applying the KNN to the text summarization which is mapped into a classification. A text is given as the input, and the classifier which corresponds to the subgroup which is most similar to the given text with respect to its content is selected. The text is partitioned into paragraphs, and each paragraph is classified into ‘essence’ or ‘not’ by the classifier. The paragraphs which are classified into ‘essence’ are extracted as results from summarizing the text. Note that the text is rejected, if all paragraphs are classified into ‘not’. Even if the text summarization is viewed into an instance of text categorization, we need to compare the two tasks with each other. In the text categorization, a text is given as an entity, while in the text summarization, a paragraph is done so. In the text categorization, the topics are predefined manually based on the prior knowledge, whereas in the text summarization, the two categories, ‘essence’ and ‘not’, are initially given. In the text categorization, the sample texts may span over various domains, whereas in the text summarization, the sample paragraphs should be within a domain. Therefore, although the text summarization belongs to the classification task, it should be distinguished from the topic based text categorization.

4. Conclusion We need the remaining tasks for doing the further research. We may apply the proposed approach for summarizing texts in the specific domains such as medicine, law, and engineering. We may consider the semantic relations among different words in the tables in compute their similarities, but it requires the similarity matrix or the word net for doing

ISBN: 1-60132-448-0, CSREA Press ©

42

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

so. We may install the process of optimizing weights of words as the meta-learning tasks. We may implement the text summarization system, adopting the proposed approach.

5. Acknowledgement This work was supported by 2017 Hongik University Research Fund.

References [1] M.J. Folk, B. Zoellick, and G. Riccardi, File Structures: An Object Oriented with C++, Addison Wesley, 1998. [2] T. Jo, “The Implementation of Dynamic Document Organization using Text Categorization and Text Clustering" PhD Dissertation of University of Ottawa, 2006. [3] T. Jo and D. Cho, “Index Based Approach for Text Categorization", 127-132, International Journal of Mathematics and Computers in Simulation, No 2, 2008. [4] T. Jo, “Table based Matching Algorithm for Soft Categorization of News Articles in Reuter 21578", pp875-882, Journal of Korea Multimedia Society, No 11, 2008. [5] T. Jo, “Topic Spotting to News Articles in 20NewsGroups with NTC, , Lecture Notes in Information Technology", pp50-56, No 7, 2011. [6] T. Jo, “Definition of String Vector based Operations for Training NTSO using Inverted Index", pp57-63, Lecture Notes in Information Technology, No 7, 2011. [7] T. Jo, “Definition of Table Similarity for News Article Classification", pp202-207, The Proceedings of The Fourth International Conference on Data Mining, 2012. [8] F. Sebastiani, “Machine Learning in Automated Text Categorization", pp1-47, ACM Computing Survey, Vol 34, No 1, 2002. [9] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, “Text Classification with String Kernels", pp419-444, Journal of Machine Learning Research, Vol 2, No 2, 2002. [10] C. S. Leslie, E. Eskin, A. Cohen, J. Weston, and W. S. Noble, “Mismatch String Kernels for Discriminative Protein Classification", pp467-476, Bioinformatics, Vol 20, No 4, 2004. [11] R. J. Kate and R. J. Mooney, “Using String Kernels for Learning Semantic Parsers", pp913-920, Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, 2006. [12] T. Jo, “Single Pass Algorithm for Text Clustering by Encoding Documents into Tables", pp1749-1757, Journal of Korea Multimedia Society, Vol 11, No 12, 2008. [13] T. Jo, “Device and Method for Categorizing Electronic Document Automatically", Patent Document, 10-2009-0041272, 10-1071495, 2011. [14] T. Jo, “Normalized Table Matching Algorithm as Approach to Text Categorization", pp839-849, Soft Computing, Vol 19, No 4, 2015. [15] T. Jo, “Inverted Index based Modified Version of K-Means Algorithm for Text Clustering", pp67-76, Journal of Information Processing Systems, Vol 4, No 2, 2008. [16] T. Jo, “Representationof Texts into String Vectors for Text Categorization", pp110-127, Journal of Computing Science and Engineering, Vol 4, No 2, 2010. [17] T. Jo, “NTSO (Neural Text Self Organizer): A New Neural Network for Text Clustering", pp31-43, Journal of Network Technology, Vol 1, No 1, 2010. [18] T. Jo, “NTC (Neural Text Categorizer): Neural Network for Text Categorization", pp83-96, International Journal of Information Studies, Vol 2, No 2, 2010.

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

43

Evaluate Impacts of Big Data on Organizational Performance: Using Intellectual Capital as a Proxy Thuan L Nguyen Graduate School, University of North Texas, Denton, Texas, USA

Abstract - Big data analytics has come out as a new important field of study for both researchers and practitioners, demonstrating the significant demand for solutions to business problems in a data-driven knowledge-based economy. Employing the emergent technology successfully is not easy, and assessing the roles of big data in improving firm performance is even much harder. Additionally, empirical studies examining the impacts of the nascent technology on organizational performance remain scarce. The present study aimed to fill the gap. This study suggested using firms’ intellectual capital as a proxy for the performance of big data implementation in assessing its influence on business performance. The present study employed the Value Added Intellectual Coefficient method to measure corporate intellectual capital, via its three main components: human capital efficiency, structural capital efficiency, and capital employed efficiency, and then used the structural equation modeling technique to model the data and test the models. The financial fundamental and market data of 100 randomly selected publicly listed firms in the sector of pharmaceutical, biotechnology, and life sciences were collected. The results of the tests showed that only human capital efficiency and capital employed efficiency had a significant positive impact on firm profitability, which highlighted the prominent roles of enterprise employees and financial capital in the impacts of big data technology. Keywords: Big Data, Big Data Analytics, Organizational Performance, Intellectual Capital, Value Added Intellectual Coefficient (VAICTM).

1

Introduction

Although many companies have made big data one of the top priorities of their business strategy, either having invested or planning to invest heavily in the technology, few of them did know how to generate value-added from it [43, 52]. Employing the emergent technology successfully is not easy, and assessing the roles of big data in improving firm performance is even much harder [8, 11, 12, 39, 43, 49]. More importantly, the basic question of whether or not the big data technology has a significant positive impact on the bottom line of firms was not yet answered clearly in academic [8, 39]. Additionally, empirical studies examining the impacts of big data analytics on organizational performance remained scarce [8, 39, 51, 54].

This study aimed to fill the gap, making significant contributions to both academic research and industrial practice. The present study contributed to the literature of multiple related fields: information systems, big data and data science, business intelligence, knowledge management and intellectual capital. The findings of this study contributed to the accumulated empirical evidence that big data can help firms regardless of size improve their business performance and increase profitability because the technology enables companies to serve customers much better and do business much more efficiently [13, 26, 36, 43]. Tan and Wong [48] suggested that if something could not be measured, it could not be managed. As above, it is very difficult to measure the performance of big data in firms, which in turn makes it a daunting task to evaluate the impacts of big data implementation on firm outcomes [8, 11, 12, 39, 43, 49]. To facilitate the assessment of big data effects on organizational performance, this study suggested using firm intellectual capital (IC) as a proxy for big data performance. The present study tried to answer the following research questions: How does big data performance, represented by the three core efficiency indicators of IC (HCE, SCE, CEE), impact firm performance? The remainder of this study is organized as follows. The next section discusses the theoretical background. The third section is a brief literature review that sheds light on the ultimate goals of implementing big data in firms, then introduces the basic concepts of IC viewed as organizational knowledge and discusses the link between big data performance and IC. Research methodology is presented in the next section, which is followed by the results. The fifth section presents discussion and implications. Finally, the study concludes with the section of implications and conclusions.

2

Theoretical Background

There exist various theories that postulate different views of the firm. Although there may be many differences in what these theories state, the central question all of them try to answer is what makes firms different from each other [1, 14, 31]. Why does this firm compete against its competitors much better than another one [2, 4]? How can a firm achieve much better business performance than others in the same industry [27, 28]?

ISBN: 1-60132-448-0, CSREA Press ©

44

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

One of the theories of the firm most-mentioned in the literature is the resource-based view (RBV). To the above question, the theory provides an answer that some of organizational resources possessed by a firm – labeled as strategic resources – and how these resources are managed enable it to gain competitive advantage and achieve superior performance [2, 4]. This theory argues that strategic resources help a firm compete better and operate more efficiently because they are valuable, rare, inimitable, and nonsubstitutable (VRIN) [4, 15].

3 3.1

Literature Review Big Data and Its Ultimate Goals to Create Organizational Knowledge

Big data can help organizations generate more valueadded in nearly any aspect of their business. In employing big data technology, firms aim to get analytical insights into huge volumes of data, and then leverage the business intelligence extracted from the data to improve business outcomes [10, 26, 43, 54]. In other words, the ultimate goals of implementing big data in firms are to create more organizational knowledge that can be used to gain and sustain competitive advantage, capture more market share, and improve business performance. Therefore, the measurement of corporate organizational knowledge can reflect the performance of implementing big data in firms. 3.2

Intellectual

Capital

(IC):

Another

Name

of

Organizational Knowledge It is widely recognized that IC consists of three major components: human capital (HC), structural capital (SC), and relational capital (RC) [32, 40, 47]. Human Capital (HC) represents the collective knowledge, skills, creativity, experience, and even enthusiasm of employees of a firm [19, 46]. Structural Capital (SC) indicates the institutionalized experience and codified knowledge generated by an organization as a whole such as corporate structures, processes, technology models and inventions, patents, copyright, business strategy, and information systems [15, 17]. Relational Capital (RC) represents the value generated through the relationship with customers, suppliers, and other external stakeholders [47]. According to Kianto et al. [24] and Kaya et al. [20], IC is the knowledge within an organization, a.k.a. organizational knowledge. According to these authors, IC and organizational knowledge are the same things if both are viewed from the static perspective of corporate assets [24]. Therefore, IC can be viewed as an organization’s stock of knowledge at any time [38]. In other words, firm IC is organizational knowledge that has been acquired and formalized to be used in creating value, gaining competitive advantage, and achieving superior performance [24, 33, 38].

3.3

Intellectual Capital (IC): A Proxy for Big Data Performance

As above, the ultimate goals of implementing big data in firms are to create more organizational knowledge that can be used to gain and sustain competitive advantage, capture more market share, and improve business performance [10, 13, 26]. Besides, corporate IC can be viewed as organizational knowledge that can help companies enhance competitiveness and achieve superior performance [24, 20]. Therefore, IC measurement can be used as a proxy for the performance of big data implementation in firms. In other words, it is reasonable to measure firm IC and then employ its measurement in assessing the impacts of big data on organizational performance, which addresses the research question. 3.4

IC, VAICTM, and Organizational Performance

The concept of IC is believed to be first discussed in detail by the Economist John Kenneth Galbraith in 1969 [20]. Since then, the concept of IC in organizational meaning has been widely known and studied thanks to Thomas Stewart’s articles about “brainpower” published by Forbes magazine in 1991 [47]. The IC literature also presents a large variety of methods that can be used to measure IC in firms [38, 47]. Among these approaches, the Value Added Intellectual Coefficient (VAICTM) model is one of the most popular tools to assess IC performance in organizations. Developed by Pulic [37], the VAIC model aims to calculate the set of efficiency indicators (HCE, SCE, and CEE) and the VAIC. The values can be used to represent the measurement of IC in firms [19, 25 29]. The model provides a simple, but effective, approach to measuring IC and then using the measurement to evaluate the influence of IC on firm performance [23]. According to Khanhossini et al. [22], the VAIC model is much better than other methods of measuring IC. In a broad perspective, the review of the literature supports the accumulated empirical evidence that IC has a significant positive impact on organizational performance [1]. However, the results varied considerably from one industry to another, or from one country to a different one, considering the influence of IC components – HC, SC, RC, or the effect of efficiency elements – HCE, SCE, CEE, on corporate business outcomes.

4 4.1

Research Methodology Value Added Intellectual Coefficient (VAICTM) Model

The VAIC model is based on the concept of value added that is a measurement reflecting the contribution of employees, management, and other resources of a firm to create value [37]. More importantly, value added normally leads to the creation

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

of wealth in the company [37]. The total value added (VA) can be computed with the following formula: VA = Op. Profit + Emp. Expenses + D + A

(1)

Where Op. Profit is Operating Profit, Emp. Expenses are normally the total salaries and wages, D is Depreciation, and A is Amortization. Next, the efficiency indicators (HCE, SCE, and CEE) are computed as follows: HCE = VA / HC (Human Capital)

45

4.3

Data Collection

The financial fundamental and market data of 100 randomly selected publicly listed firms in the sector of pharmaceutical, biotechnology, and life sciences were collected, using the online service of financial analytics S&P Capital IQ Platform provided by McGraw Hill Financial. 4.4

Theoretical Model and Research Hypotheses

Based on the reviewed literature, the following theoretical model is proposed:

(2)

Where HC is the employee expenses, normally the total salaries and wages. SCE = SC (Structural Capital) / VA

(3)

Where SC = VA – HC

(4)

CEE = VA / CE (Capital Employed)

(5)

Where CE = Property, Plant & Equipment + Current Assets – Current Liabilities (6) Finally, the VAIC value is the sum of the three efficiency indicators: VAIC = HCE + SCE + CEE

(7)

Then, the set of efficiency indicators (HCE, SCE, and CEE) or the VAIC value is used straightforwardly as IC measurement in research [1, 29, 41]. VAIC is considered better than other methods for measuring IC because it is simple and transparent [19, 22], and it provides a basis for standard measurement [22]. Additionally, the research data are collected from the annual filing documents reported by firms whose data have been audited by third parties and available on the websites of the companies or governmental agencies that oversee securities markets [19, 22]. 4.2

Figure 1. Theoretical Model for Research Based on the theories of the firm and the reviewed literature, the following hypotheses were proposed: H1: HCE has a significant positive impact on ROA. H2: HCE has a significant positive impact on ATO. H3: HCE has a significant positive impact on market value. H4: SCE has a significant positive impact on ROA. H5: SCE has a significant positive impact on ATO. H6: SCE has a significant positive impact on market value.

Research Variables

In this study, IC – as a proxy for big data performance – was the central predictor that was represented by its three efficiency indicators: HCE, SCE, and CEE [1, 29, 41]. Then, these efficiency indicators were used as the independent variables [1, 29, 41]). The dependent variables were the three indicators used to measure organizational performance: ROA (return-on-assets) representing profitability, ATO (asset-turnover) indicating productivity, and market value for market performance [18, 50].

H7: CEE has a significant positive impact on ROA H8: CEE has a significant positive impact on ATO H9: CEE has a significant positive impact on market value. 4.5

Testing the Model

Structural equation modeling (SEM) has been one of the statistical techniques widely chosen by researchers across disciplines [16]. SEM is frequently employed in the IC literature to study the impact of IC on firm performance [9, 22]. A SEM analysis was performed using the AMOS software to test the models in the study. The estimation of the SEM models

ISBN: 1-60132-448-0, CSREA Press ©

46

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

was conducted employing maximum likelihood estimation (MLE). MLE is a technique used to reveal the most likely function(s) that can explain, i.e., fit, observed data [30]. MLE has been the most widely used fitting function for structural equation models [6].

5

Results

The following fit indices were used for the evaluation of the model fit: Model chi-square (χ2), goodness-of-fit index (GFI), normed-fit-index (NFI), comparative fit index (CFI), and root mean square error of approximation (RMSEA). The chi-square value (χ2) assessed the overall model fit [16, 53]. To indicate a good model fit, the chi-square statistic must be insignificant at 0.05 threshold, i.e. p > 0.05 [16]. The results showed that the model fit the data: chi-square = 3.835, degrees of freedom = 2, and probability level = 0.147 (> 0.05). Table 1 summarizes the goodness of fit values and thresholds for these fit indices:

H6

0.044

0.626

H7

0.750

***

H8

-0.038

0.749

H9

0.109

0.525

*p < 0.05, **p < 0.01, ***p < 0.001 Table 2: Summary of results of testing the first nine hypotheses: H1 – H9

Hypothesis

Hypothesized Path

Supported or Rejected

H1

HCE  ROA

Supported

H2

HCE  ATO

Rejected

H3

HCE  Market Value

Rejected

Goodness-of-Fit Index

Recommended Values

Values from this study

H4

SCE  ROA

Rejected

Comparative Fit Index (CFI)

>0.90

0.992

H5

SCE  ATO

Rejected

Goodness-of-Fit Index (GFI)

>0.90

0.987

H6

SCE  Market Value

Rejected

Normalized Fit Index (NFI)

>0.90

0.984

H7

CEE  ROA

Supported

H8

CEE  ATO

Rejected

Root mean square error of approximation (RMSEA)

. “Word2Vec”, < https://code.google.com/archive/p/word2vec/>. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean “Efficient Estimation of Word Representations in Vector Space,” arXiv preprint arXiv:1301.3781, 2013. Beo Mmo Kang, “Text Context and Word Meaning: Latent Semantic Analysis”, Linguistics 68, 2014.4. “TF-IDF’s Definition”, < https://en.wikipedia.org/wiki/Tf%E2%80%93idf>. “Singular Value Decomposition”, < https://en.wikipedia.org/wiki/Singular_value_decomposition>. “Cosine Similarity”, . “Mecab”, Mecab : Yet Another Part-of-Speech and Morphological Analyzer. “Word2Vec Java API”, . Tomas Mikolov, “Distributed Representations of Words and Phrases and their Compositionality”, Advances in neural information processing system, 3111-3119, 2013.

VI. CONCLUSION This paper analyze the similarity of equipment using Word2Vec Model. Measuring the accuracy, we compare another algorithm which is used to analyzing latent semantic and association of words recently, with proposed method. Therefore, proposed method uses more computer resource, but it has more accuracy about 14% rather than LSA based on 1,000 equipment. And we found the more data learn in Word2vec model, the better accuracy of similarity we will get. This method can save national finance constructing the service that prevents to buy duplicated equipment. In the future, also, if we were decreasing time complexity of proposed method, and we could utilize the service that can recommend where the data is classified when new data were entered in search platform. ACKNOWLEDGMENT This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (No.2016-0-00087, Development of HPC System for Accelerating Large-scale Deep Learning) REFERENCES [1] [2] [3]

Ministry of Science, ICT and Future Planning, “Report for National Research Facitilies Equipment Investigation. Analysis in 2014”, 2014. Moon Ji Choi, “Advanced Plan for Operating and Utilzing National Research Facilities Equipment”, 2013.4. “One-Hot Expression’s Definition”, < https://en.wikipedia.org/wiki/Onehot >.

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

57

OLSH: Occurrence-based Locality Sensitive Hashing Mohammadhossein Toutiaee Computer Science Department The University of Georgia 415 Boyd, 200 D.W. Brooks Drive, Athens, USA [email protected]

Abstract—Probabilistic data structures are widely used with large amounts of data. Acceptable error or probability of failure can be controlled by statistic inference methods applied in many domains. Locality sensitive hashing (LSH) is an efficient data structure for a nearest neighbor search in high-dimensional data as an alternative to other exact Nearest Neighbor Searches such as R-tree. The basic idea is to provide probabilistic guarantees of solving approximate nearest neighbor searches in rich feature spaces. Using different buckets in a hash table has been proposed when running in a main memory structure (multi probe LSH); however, this method is not optimized for wide datasets such as streaming data. A proposed new approach takes advantage of throwing different bins and enhances LSH by reducing the failure rates of a similarity search. Index Terms—NN; LSH; MapReduce; Probability; Theory; ”Short Research Paper”

I. I NTRODUCTION The nearest neighbor method can be used in classification and regression problems. Nearest neighbor algorithms search in various parts of a high-dimensional dataset by querying similar data points. The Nearest Neighbor algorithm is being applied in many application domains with simple to complex structure. Pattern recognition, image classification, text mining and information retrieval, marketing analysis and DNA sequencing are just a few examples. This algorithm is quite useful in many applications. Complexity is one of the weak points of nearest neighbor. Searching the exact nearest neighbor point in a space using brute force search would result in O(N) running time, which is inefficient to implement in a large space. Additionally, searching the K th nearest neighbor would take O(NlogK) provided that a priority queue is used as the data structure. In both cases, O(N) is an inevitable part of the algorithm running time, which is not applicable for Large N. However, tree-based data structures enable more efficient pruning of the search space. R-tree[8], K-D-tree[2], SR-tree[11], X-tree[3] and M-tree[5] are data type indexing methods returning the exact query results in O(NlogN) time for constructing the tree. The time complexity of all treebased structures is between O(logN) and O(N) depending on how well the tree was constructed previously. Although the logarithmic implementation time will be given under certain conditions regarding the distribution of points, the running time is still exponential in a dimension (d). Therefore, the

complexity in tree-based methods is heavily reliant on how points are spread in a low-dimension space. In order to prevent exponential running time complexity for the Nearest Neighbor Search in a high-dimension space, Near-Neighbor Search methods are in fact idea for finding approximate nearestneighbor pairs in certain ways without looking at all pairs. Locality sensitive hashing (LSH) is one of the data structures that has been recently used in exploring various research topics. This paper introduces a new enhanced LSH based on a distributed algorithm showing how efficient this approach is. II. R ELATED WORK Indyk et al. [9] proposed locality sensitive hashing (LSH) based on the idea that a random hash function g exists on space Rd such that for any points p and q: Assume % = {g :Rd −→Zk } is a set of hash functions such as: g(v) = (h1 (v), ..., hk (v)) where the functions hi for i ∈ [1, k] is a subset of LSH function set H = {h :Rd −→Z} called (r1 , cr, p1 , p2 ) for any q, v such that: P r(h(q) = h(v)) ≥ p1 , whenkq − vk ≤ r (1) P r(h(q) = h(v)) < p2 , whenkq − vk > cr

(2)

where c > 1, p1 > p2 , r and cr are the decision and prune boundary, respectively. Intuitively, the pair q and v will more likely be hashed to the same value if their distance is within r, and less likely if their distance is greater than prune value cr. Charikar [4] introduced ”Hyperplane LSH” heavily inspired by Goemans et al. [7], and the method only works for a spherical space. Geomans’s method is based on splitting a sphere by a hyperplane randomly. Multi-probe LSH [13] was proposed to maximize the chance of collision between a pair of near data points by binning the space with random buckets. This method merely queries which bucket is more likely to contain relevant data points. A posteriori multi-probe LSH [10] as an extension to Multiprobe takes advantage of the likelihood of each bucket, turning to the probability of containing similar

ISBN: 1-60132-448-0, CSREA Press ©

58

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 | points incorporating a prior knowledge obtained from training data. Both Multiprobe and a posteriori LSH tend to favor a higher false negative for low storage overheads. Voronoibased locality sensitive hashing [12] partitions the space into a Voronoi diagram by random hash functions; however, backing to Voronoi diagram partitioning space becomes computationally inefficient as the dataset increases. Although LSH Forest [1] attempts to enhance indexing technique in high-dimension data, this technique was only designed for Hamming distance [14], and to use other distance functions is still challenging. Three approaches addressing drawbacks of the previous work will be presented in this paper. First, the false negative of searching similar points would be enhanced using multi LSH tables. Second, distance functions are not used in this technique. Last, but not least, a distributed algorithm will be introduced to boost the query time. III. P ROBLEM S ETUP The objective is to search near-neighbor data points in a given high-dimension space using a global voting function V (., .) to obtain a similarity between a query q and a point v. Accordingly, the m near-neighbor search will return m desired points that are the most similar objects (data-points) among all the data in a dataset using the global voting function V . Eventually, locality sensitive hashing as a real competitor to the tree-based structures could diminish the search space to a subset of data points binned by some random buckets. The random buckets are constructed by some random lines h in a coordinate space (or random hyperplanes hd in a spherical space). A ”score” for each point can translate each line h (or hyperplane hd ) into a binary index. And at the end, a hash table is created using an h-bit binary vector for each point as a bucket index.

by proposing a distributed multi-hash table approach using MapReduce. V. OLSH Occurrence-based LSH is a significantly fast approach for Near Neighbor Search in a high-dimensional dataset. The technique provides an accepted error rate using multiple tables for locality sensitive hashing. In the LSH technique, as the number of buckets increases, the probability of not finding the near neighbor decreases. Accordingly, a smaller error rate would be obtained as a result of using more hashing tables: Proof: Assume we search bins 1 bit off from query.  = 1 - Pr(same bin) - Pr(1 bin off)     = 1 - Pr(no split) - Pr(1 split) LSH h=1 = (3)  = 1-(1-δ)h − hδ(1 − δ)h−1    δ : Probability of split  = 1 - Pr(same bin)h+1     = 1 - Pr(no split)h+1 LSH h>1 = (4)  = 1-(1-δ)h+1    δ : Probability of split VI. E VALUATION In (Figure 1c), the comparison between multiple hash tables (blue) and one hash table (red) indicates that the probability of not finding the near-neighbor indicated by the red line is lower than the blue line (when h = 3). Similarly, in (Figure 1d), the gap between the blue line and the red line becomes more meaningful; therefore, we have a lower chance of not finding the near-neighbor (a higher probability of finding the nearneighbor) when h = 10. The higher chance of finding the nearneighbor arises from the probability of falling off of similar points in different bins in an exponential fashion.

IV. M OTIVATION First and foremost, the traditional LSH methods mentioned above only depend on some distance function. The shortcoming that exists in applying those methods would result in using very limited distance functions. Therefore, the ultimate purpose of using this new method is to remove any distance functions (including all metric distances) and to search for near neighbors in the space. The method is only based on a number of point occurrences in a set of multi LSH tables. In essence, this method involves finding the most repeated point as an approximate nearest neighbor (or similarly, finding the most repeated points as K near-neighbor) after having counted the global number of occurrences per each point. LSH methods can be applied fast with an accepted error rate. However, this work shows the improvement in the accuracy of searching near-neighbor by reducing the probability of not finding a near-neighbor with a proof given below. The false negatives could be decreased by constructing more hash tables with random buckets. Multi-tables hash function tends to outperform one table LSH by taking advantage of occurring independent events. Additionally, the running time increases as a result of emitting more hash tables; this would be lessened

Algorithm 1 MR-MultiLSH-Tables INPUT: Set of data points OUTPUT: Top near-neighbor points 1: Class RandomBins 2: method RndBin({x~1 , ..., x~n }, h, b) 3: for h = 1 To H 4: Set b~i , ..., b~j to be distinct randomly binned inputs from x~1 , ..., x~n 5: end for 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:

Class Mapper method M ap(data(key), value) write (data(key), value) Class Reducer method ReduceCount(data(key), value) write N one, (sum(values), data(key)) method ReduceM ax(data(key), value) write max(values)

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

59

(a) h = 1 hash table

(b) h = 3 hash tables

(c) P r(h = 3)

(d) P r(h = 10)

Fig. 1: Multi LSH tables would harness the near-neighbor search by reducing false negatives in the result. MapReduce can make the search fast among the space. (a), Binning a hash table with random bins. (b), Multi LSH tables for h = 3 tables. (c), Probability of not finding the near-neighbor for one hash table (blue) and for multiple hash tables (red) when the h = 3. Obviously the red curve is lower than the blue curve, indicating the lower error. (d), This plot indicates that the higher the number of tables, the lower the error rate. (e) and (d) show the MapReduce schema and the algorithm, respectively [6].

Fig. 2: : This diagram is showing the MapReduce schema and the algorithm.

VII. C ONCLUSION The expectation of finding Near Neighbors in a highdimension data would increase exponentially fast using multiple hash tables. OLSH, as a new technique, takes that advantage and reduces its time complexity using the MapReduce

approach. Moreover, the limitation of using some distance functions in old techniques is eliminated because in OLSH the near points are only detected in an ON/OFF mode.

ISBN: 1-60132-448-0, CSREA Press ©

60

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 | R EFERENCES [1] M. Bawa, T. Condie, and P. Ganesan. Lsh forest: Self-tuning indexes for similarity search. In Fourteenth International World Wide Web Conference (WWW 2005), 2005. [2] J. L. Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 18(9):509–517, Sept. 1975. [3] S. Berchtold, D. A. Keim, and H.-P. Kriegel. The x-tree: An index structure for high-dimensional data. In Proceedings of the 22th International Conference on Very Large Data Bases, VLDB ’96, pages 28–39, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc. [4] M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, STOC ’02, pages 380–388, New York, NY, USA, 2002. ACM. [5] P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. In Proceedings of the 23rd International Conference on Very Large Data Bases, VLDB ’97, pages 426–435, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc. [6] E. Fox and C. Guestrin. Clustering and retreival. In Machine Learning Specialization, Coursera, 2016. [7] M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM, 42(6):1115–1145, Nov. 1995. [8] A. Guttman. R-trees: A dynamic index structure for spatial searching. SIGMOD Rec., 14(2):47–57, June 1984. [9] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pages 604–613, New York, NY, USA, 1998. ACM. [10] A. Joly and O. Buisson. A posteriori multi-probe locality sensitive hashing. In Proceedings of the 16th ACM International Conference on Multimedia, MM ’08, pages 209–218, New York, NY, USA, 2008. ACM. [11] N. Katayama and S. Satoh. The sr-tree: An index structure for highdimensional nearest neighbor queries. SIGMOD Rec., 26(2):369–380, June 1997. [12] T. L. Loi, J. P. Heo, J. Lee, and S. e. Yoon. Vlsh: Voronoi-based locality sensitive hashing. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5345–5352, Nov 2013. [13] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe lsh: Efficient indexing for high-dimensional similarity search. In Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB ’07, pages 950–961. VLDB Endowment, 2007. [14] M. Norouzi, D. J. Fleet, and R. R. Salakhutdinov. Hamming distance metric learning. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1061–1069. Curran Associates, Inc., 2012.

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

SESSION POSTER PAPERS Chair(s) TBA

ISBN: 1-60132-448-0, CSREA Press ©

61

62

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

63

Design on Distributed Deep Learning Platform with Big Data Mikyoung Lee1, Sungho Shin1, and Sa-Kwang Song1 Decision Support Technology Lab, KISTI, Daejeon, KOREA

1

Abstract - In this paper, we design a distributed deep learning platform for model to predict typhoon track by analyzing typhoon satellite images. Recently, research on big data distribution processing and deep learning platform is actively being carried out. As the demand for deep learning study increases, it is necessary to study a distributed platform that can support large scale operation of training data when implementing deep learning model using massive data. Our platform uses Docker and Kubernetes to manage the distribution of server resources and Distributed TensorFlow and TensorFlow Serving to support distributed deep learning. We develop the wrapper libraries and modules needed for typhoon track prediction model. Keywords: Distributed Deep Learning Platform, Distributed Deep Learning, Distributed TensorFlow, Big Data Analytics

1

Introduction

In recent years, there has been a surge of interests in deep learning. Especially, using deep learning technology in big data analysis process improves the accuracy of analysis result. In order to increase the accuracy of the analysis, it is necessary to use a large number of computers because the size of the deep learning model becomes large and the amount of data to be computed becomes large. When using deep learning with big data, the scale of the deep learning model increases to increase the accuracy of the analysis and the amount of data to perform the calculation increases, so many computers are required. However, research on a distributed deep learning framework is an early stage. In this paper, we introduce a distributed deep learning platform that we designed. The distributed deep learning platform is designed for efficient installation, management and operation of the resources required to develop of the typhoon track prediction model. This makes it possible to training a deep learning model using big data in a distributed processing environment. Also, it is designed considering the stability, convenience, and portability of the deep learning platform.

2 Related Work 2.1

Deep Learning Framework on Spark

This is the use of deep learning distributed processing on SPARK, a well-established cloud. Developed by Seoul

National University, DeepSpark allows distributed execution of Caffe deep learning jobs on Apache Spark™ cluster of machines. It is designed to make large-scale parallel distributed deep learning jobs easy and intuitive for developers and data scientists[1]. To support parallel operations, DeepSpark automatically distributes workloads and parameters to Caffe-running nodes using Spark. SparkNet is a framework for training deep networks in Spark. It includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe, and a lightweight multidimensional tensor library[3]. Both systems have advantages of using Spark which is excellent for big data processing and using Caffe Deep Learning Library, but there are restrictions on performance optimization when Spark and Caffe are interworked.

2.2

Distributed Deep Learning Framework

Singapore University's SINGA is a distributed deeplearning platform that provides different neural net partitioning schemes for training large models[4]. SINGA architecture supports asynchronous and synchronous training frameworks. Veles[5], developed by Samsung, is distributed machine learning platform. It supports artificial neural network learning and supports genetic algorithm as an integrated development platform for model development, training, and application. Microsoft CNTK[6] improves the performance of high-end parallel training using GPGPU. Petuum is a large-scale distributed machine learning[7]. It considers both data and model parallelism, has a key-value store and dynamic schedulers.

3 3.1

Design of Platform Overview

We develop a distributed deep learning platform that helps to make deep learning model by training massive data. The data that we mainly deal with are large-scale typhoon observation satellite image data. This job must be performed in a large-scale distributed processing environment because of model parallel processing and data parallel processing. Because it performs a complex operation in a training process of processing a large amount of input data and generating a model. An overview of the platform is shown in Figure 1. The purpose of our platform is (1) to develop a platform capable of training and inferencing large scale deep learning models,

ISBN: 1-60132-448-0, CSREA Press ©

64

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

(2) to support efficient distributed deep learning using distributed TensorFlow and TensorFlow Serving, (3) helps to efficiently install, utilize, and operate necessary resources in a distributed computing environment. We want parallelization of data and model, asynchronous update of parameter server, and design as platform to support distributed environment efficiently. We design to make use of the open source libraries to make it easy to build the development environment, to construct a convenient environment to perform computation of big data.

Figure 2. Platform Software Stack

4 Figure 1. Distributed Deep Learning Platform Overview

3.2

Platform Configuration

The entire cluster node we have is composed of 10, with 128GByte of memory and two GPU accelerators per node. The platform is configured as shown in Figure 2. At the bottom is system OS (CentOS linux). It supports distributed parallel processing using Docker, a software container platform, and Kubernetes, a docker container orchestration tool. These help efficient installation, management, and operation of the distributed processing system. CUDA, is a parallel computing platform, to perform deep learning operations using the GPU. cuDNN is a GPU-accelerated library of primitives for Deep Neural Network. We use this to enhance single machines performance. The CUDA and cuDNN are libraries that make efficient use of the GPU when performing deep learning operations. We use a deep learning library, TensorFlow, to perform distributed deep learning. Distributed TensorFlow supports libraries, APIs, model parallelism, data parallelism, synchronized parallel deep learning, and asynchronous training framework to support distributed environments. It includes TensorFlow Serving to automatically deploy the traning model generated from TensorFlow, and deep learning wrapper libraries (Keras, TFLearn, TF-Slim etc.) that efficiently supports Deep Neural Network. Also, we develop a new KISTI Wrapper library to support the functions needed to operate the typhoon track prediction model and to improve the performance. This platform uses distributed TensorFlow to speed up training and inference of large scale data through data parallel processing, model parallel processing, and also makes distributed processing usefully by using Docker, Kubernetes, and so on.

Conclusions

Big data analysis using deep learning requires distributed processing technology using many computers. This is because there are many parameters to be learned and data for learning, which requires a considerable computation time for learning. In this paper, we describe about design on big data-based distributed deep learning platform. This platform uses Docker and Kubernetes to efficiently support distributed processing, and supports distributed deep learning using Distributed TensorFlow and TensorFlow Serving libraries. In the future, we will develop a deep learning wrapper library that supports large-scale CNN and RNN algorithms, which are mainly used in the typhoon track prediction model based on satellite image data, and will support training and inference of the typhoon track prediction model.

5

References

[1] H.Kim, J.Park, J.Jang, S.Yoon., “DeepSpark : A SparkBased Distriuted Deep Learning Framework for Commodity Clusters”., arXiv :1602.08191v3, ACM KDD, 2016. [2] Y.Jia, E.Shelhamer, J.Donahue, S.Karayev, J.Long, R.Gir shick, S.Guadarrama, T.Darrell, “Caffe:Convolutional Archit ecture for Fast Feature Embedding”, MM’14 Proceedings of the 22nd ACM international conference on Multimedia, pp 67 5-678, Nov 2014. [3]P.Moritz, R.Nishihara, I.Stoica, M.Jordan, “SparkNet: T raining Deep Networks in Spark”, arXiv:1511.06051, ICLR, 2016. [4] W. Wang et al, “SINGA: Putting Deep Learning in the H ands of Multimedia Users,” Proc. 23rd ACM International Co nference on Multimedia, 2015, pp. 25-34 [5] Veles, https://github.com/Samsung/veles/ [6] CNTK, https://github.com/Microsoft/CNTK [7] E. Xing et al, “Petuum: A New Platform for Distributed Machine Learning on Big Data”, IEEE Transaction on Big Data, vol. 1, pp49-67, 2015.

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

65

Development of Road Traffic Analysis Platform Using Big Data 1,2

Sung, Hong Ki1 and Chong, Kyu Soo2 Korea Institute of Civil Engineering and Building Technology, Goyang, S. Korea

Abstract - This paper presents the development direction of the driving environment prediction platform based on the road traffic big data. Towards that end, the levels and trends of the development of related big data platform technologies in Korea and overseas were analyzed, and the differentiation and expected effects of the platform that was developed in this study were examined. The suggested road driving environment prediction platform can provide traffic information services distinct from the existing systems, thereby boosting the general users’ and experts’ access to traffic data. It is expected that it can be used in the information system related industries and in the markets in diverse fields. In addition, the platform can use road traffic public, sensor, and unstructured data to provide meticulous road driving environment information, to improve the drivers’ safety, and to boost the reliability of traffic prediction information. Keywords: Road Traffic Big Data, Platform, Driving Environment, Vehicles’ Sensor Data, Real-time Data

1

Introduction

2

Differentiation of Technology

In this section, the development method and direction of the big-data platform that was developed in this study are presented. In particular, it is distinguished from the existing platforms and related technologies through the trend analysis of Korea and globals road traffic big-data platforms. In Korea road traffic field, platforms for collecting and analyzing public data and related big data have been developed and are being provided. Such platforms, however, can process only the public data provided by the existing systems. A platform for the convergence and analysis of two or more kinds of data has yet to be developed. Thus, this study aimed to develop a platform capable of collecting, combining, and analyzing multi-species big data consisting of structured data (e.g., public data) and unstructured data (e.g., the data obtained by the sensors of individual vehicles, socialmedia data, etc.). This study thus sought to boost the reliability of the prediction and analysis of road driving situations through the development of a pertinent technology.[2]

In the field of road traffic as well, the data scalability has been improved due to the development of related technologies. With the development of the sensor measurement technology for general personal vehicles as well as business vehicles like buses, taxis, and trucks, the scalability has been expanded into the mobile observation technologies (e.g., detection sensors, navigation, black boxes) from the fixed-type observation technologies (e.g., loop detectors, image detectors, CCTVs).[1] Thus, it is expected that the use of big data in the road traffic field will become more active, and it is essential to develop an intelligent IT environment and platforms that can collect, store, and analyze various types of data, such as structured and unstructured data.

In addition, thanks to the recent IT advances, the road traffic field is changing from the information collection system based on the existing fixed-type sensors (e.g., loop and image detectors) to the information collection system based on the mobile-type sensors (e.g., smartphones and navigators). Thus, a platform capable of collecting and analyzing real-time data obtained from individual vehicles’ sensors instead of using fragmental data provided by the existing systems was developed. Real-time weather and traffic density data are thus sought to be used to achieve faster and more reliable prediction and analysis of data.[3]

Thus, this paper presents the development direction of the driving environment prediction platform based on the road traffic big data, which meets the technology development trend and the social demand, as well as the corresponding expected effects. Towards that end, the levels and trends of the development of related big-data platform technologies in Korea and globals were analyzed, and the differentiation and expected effects of the platform that was developed in this study were examined.

The development of a road driving environment prediction platform in this study was divided into the development of collection tools, the development of storage tools, and the development of analysis tools. First, to develop a platform distinct from the existing parallel technologies, the data to be gathered for use in the development of the target platform had to be separated and selected. Towards that end, the target road traffic data were classified according to the collection methods to be used and the contents.

3

Development Methods

ISBN: 1-60132-448-0, CSREA Press ©

66

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

The big-data collection tools that were used for the development of the aforementioned platform were vehicle sensors collecting the road surface temperatures, atmospheric temperatures, precipitation, traffic density, and traffic speed data in real time. The data collection tool was developed according to the type of data. Vehicle sensor data were designed to be collectable through a tool developed for collecting such data through the RESTful-type interface and Kafka. Public data were designed to be collectable through the developed and optimized Flume-based data collection agent. As the collected public data are limited in size and the data are collected in real time at least every 5 minutes, the data were designed to be separable and storable according to the collection purpose, using Flume. Vehicle sensor data were designed to register data in the RESTful way using REST Server and Kafka Broker because they need to be distributable and processable as the number of vehicles increases. Kafka provides a distributed processing function in an environment where large amounts of data are being inputted, thereby preventing data loss. The big-data storage tool for the development of the target platform was developed so that the collected big data could be classified (according to the data type or use purpose) into HDFS (Hadoop Distributed File System) or NoSQL (Not Only Structured Query Language) for storing in the corresponding databases. shows the process conceptual diagram of the tool that was developed in this study for collecting and storing traffic big data.

Fig. 1. Process of Data Store Tool

Before being stored in the integrated database, the vehicle sensor and public data have to undergo a pretreatment process at Spark so that all the collected data will have identical time/spatial storage units. Based on such stored integrated database, the road driving environment is analyzed using the Zeppelin tool. Also, the information display and visualization of the analysis results are provided using the Web-based GIS Map. In the case of processing and handling big data, a phase-matching methodology was developed to construct data with diverse collection types in identical time and space units (5 minutes, standard link). shows the development conceptual diagram for the traffic big-data analysis tool that was developed in this study.

Fig. 2. Concept Map of Road Traffic Analysis Platform

4

Conclusion and Expected Effects

The suggested road driving environment prediction platform can provide traffic information services distinct from the existing systems, thereby boosting the general users’ and experts’ access to traffic data. It is expected that it can be used in the information-system related industries and in the markets in diverse fields. In addition, the platform can use road traffic public, sensor, and unstructured data to provide meticulous road driving environment information, to improve the drivers’ safety, and to boost the reliability of traffic prediction information. In the future, Open API based visualization platforms and a related-platform advanced technology should be developed. In addition, it is necessary to build and apply various algorithms for customized driving environment analysis, and to develop more reliable traffic data platforms through a continuous verification process.

5

References

[1] Korea Embedded Software and System Industry Association, “KESSIA ISSUE REPORT”, DEC 2014. [2] S.R. Kim, M.M. Kang, “Today and the Future of Big Data Analytics Technology”, The Korean Institute of Information Scientists and Engineering, vol. 1, pp. 8-17, 2014. [3] Korea Institute of Civil Engineering and Building Technology, “Development of Driving Environment Prediction Platform based on Big Data”, 2016.

ISBN: 1-60132-448-0, CSREA Press ©

SESSION LATE PAPERS - RECOMMENDATION SYSTEMS Chair(s) TBA

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

69

Context-Based Collaborative Recommendation System to Recommend Music Zapata C. Santiago Ph.D., Escobar R. Luis, Águila G. Elías Department of Computer Science, Engineering Faculty, Metropolitan Technological University Santiago, Chile [email protected], [email protected], [email protected] , www.utem.cl

Abstract - A recommender system based on user collaboration partners in communitiesin a social network , can build a collectiveknowledgeto help recommend automatically lists content to users of a social platform, based on their behaviorand preferences. The purpose of this paper is centered in implementing an Android RecommenderSystem capable of providing user’s songs without being assessed by the user, are estimated to beon their taste. To do this, the application is madearound a Collaborative Recommendation System, following the entire system as a whole client / server architecture. Keywords: Fuzzy Logic, Data Mining, Fuzzy Controller, Fuzzy Rules 1. Introduction In today's time, information overload is becoming increasingly apparent as more and more data sources are accessed. On a daily basis, users are faced with the challenge of choosing an application for their Smartphone or choosing a good movie to see the weekend between thousands and thousands of options. This is the main motivation scenario of the "Recommendations Systems", a set of tools that seek to reduce the user's cognitive effort by studying patterns of behavior that allow predicting the possible choices that a person would make among a set of items with which you do not have previous experience 2. Research Topic Consider a Recommendation System to recommend music, which is precisely our case study: a user with certain preferences could receive different recommendations depending on their status, whether they are working, resting, performing physical exercise, etc. These are aspects of the user context, which can be taken into account by a recommendation system. Context-based or context-based Recommendation Systems are systems that address these variable circumstances that pertain to the user's spatial and temporal environment,

geographic location, day of the week, season, etc., and in particular It could be considered as belonging to the context the type of activity developed: working, resting, exercising, etc. In this research we will focus on the problem of recommending music following a hybrid collaborative approach with a context-based approach, which mainly contemplates thetypeof activity that the user is performing: running, walking, sleeping, working, studying, buying, etc. 3. Theoretical Framework Systems Recommendation (SR) are techniques and software tools that provide suggestions of items that are of use to a user. The suggestions relate to several decision processes,such as what items to buy, what music to listen to orwhat news to read online. [2] "Item" is the term used to denote what thesystem recommends to users. An SR typically focuses on a specific type of item (such as CDs or news)and according to its design, GUI and recommendation technique used to generate recommendations are customized to provide effective and useful suggestions for that specific type of item. SRs are primarily aimed at individuals who lack sufficient personal experience or competenceto evaluate the potentially overwhelming numberof alternative items that a website, for example, can offer. A specific case is a book recommendation system that assists users in selecting a book to read. On the popular Amazon.com website, an SR is used to personalize the online storeforeach customer. As they are personalized recommendations, different users receive different suggestions. . In addition, there are also non-personalized recommendations. There are simpler suggestions to generate and are usually named in magazines or newspapers. Typical examples include thetop ten selections of books, artists, etc. While they can be useful and effective in certain situations, these types of non-personalized suggestions are not commonly the target of investigations of recommendation systems.

ISBN: 1-60132-448-0, CSREA Press ©

70

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

3.1 Basic Scheme of a Recommendation System

Figure 1 Basic Scheme of a Recommendation System The basic scheme of an SR exists the following main elements:  Database: The quality of the datastored in our database plays a fundamentalrole in making recommendations with higher or lower quality.  User Profiles: A user "shapes" your personal profile as you use the system. The profile reflects the tastes / preferences of the user, fundamental when discriminating objects during the recommendation.  Predictions: Prediction plays a crucial role within the basic schema of every SR. The prediction is based on the profile of the user and the information available in the database that is taken into account. 3.2 Data in the Recommendation Systems It is important to make some decisions when developing an SR, such as the type of feedback used, the type of data to use or how these data will be analyzed. Feedback on Recommendation Systems An SR should not be a static entity, but the effectiveness of its suggestions must evolvewith the passage of time, based on the experienceand new information obtained. This is achieved by applying feedback mechanisms between the system and user preferences. For this, there are two mechanisms of feedback: implicit feedback and explicit feedback. Realimentación Implícita An implicit feedback mechanism is one that provides the SR with information about user preferences without being aware of it. These feedbacks are not done directly, but through some measures such as: the time of visualization of the object, the number of queries, etc. It presents the problem that it depends too much on the context and is quite hypothetical, since

assumptions are made (based on the mentioned measures) on the tastes of the user that do not necessarily have to be true, which can lead to give a wrong suggestion And therefore,does not meet user requirements. Explicit Feedback Explicit feedback is based on the direct and deliberate action of the user to indicate those objects of the system that interest him. This action can be achieved by means of numerical votes or, simply, indicating whether or not the object is to the liking of the user. This type of feedback also presents problems, such as the willingness of the client or the time invested in it or even the veracity of the information entered into the system by the user. Real versus Synthetic Data Another interesting question is to chooseaset of real data (compiled from real users on real objects) or a set of synthesized data (with no real basis, specifically created for the Recommendation System). The latter are easier to obtain, since we avoid having to conduct surveys or other methods of collecting real information, although they are only used in the early stages of system development, and then replaced by real data once it has been accumulated enough information. Online Analysis versus Offline Analysis It is important to decide whether we are going to work on data online or offline. In the offline analysis, a technique or filtering algorithm is used to make predictions about the dataset, evaluating the results of those predictions using one or several error metrics. This type ofanalysis has the advantage of being fast and economical, but it presents two important drawbacks: the problem of data scarcity and the problem of obtaining the goodness of the prediction as the only result. On the contrary, the online analysis allows to obtain more results, among which stand out the performance of the participating users, the satisfaction of the same, etc. However, it turns out to be slower and more expensive than offline analysis. 3.3 Collaborative Recommendation Systems Collaborative Recommendation Systems are those that make recommendations based solely on terms of similarity among users, that is, they combine the valuations of the objects, identify the common tastes among users based on such valuations and thus recommend objects that are of the taste of other users of similar tastes to the current user. The techniques for developing the first Recommendation Systems for collaborative filtering were based on methods from data mining. For this, it was distinguished between a

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

learning phase (offline) in which the model is learned, as it happens in data mining, and a recommendation phase (online) in which the model obtained from the phase prior to a real-life problem, thus producing recommendations for system users. However, currently this type of technique is not usually used, becausedueto the interaction of users with the system, it is more convenient to use a relaxed learning paradigm (the model is built and updated during system operation). The theoretical basis of the Recommendation Systems is simple: there are groups of users closer, who will be those who maintain similar profiles, and a user of a group is recommended objects that have not yet experienced, but have experienced and valued Positively other users of your "group". Phases of SR Collaborative

Figure 2 Schematic of a Collaborative Recommendation System Currently, three fundamental stages in the operation of all Collaborative Recommendation System: 1. The system saves a profile of each user, which consists of evaluations ofobjects known by him and belonging to the database on which will be worked. 2. Based on these profiles, the degree of similarity between users of the systemis measured and user groups with similar characteristics are created. 3. The system uses the information obtained in the previous two steps to calculate the predictions. Each user will be advised of objects that havenot been previously evaluated and that have obtained the highest values for that prediction. Measures and Techniques Used in Collaborative Filtering Algorithms The first step in making these recommendations is to form groups with the users or items most similar to each other. For this purpose, we use similarity measures and a K-nn classification algorithm. Then, a prediction technique will be used to estimate the user's valuation on certain items. In this project the cosine coefficient has been chosen as a measure of similarity for the implementation of the collaborative filtering algorithm

71

Notation Before enumerating the formulas that will see next, it is convenient to make clear the notation that is going to be used to not cause confusion to the reader:  A user is that element represented by

𝑢 1 ∈ ∪ = {𝑢 1,𝑢 2, … , 𝑢 𝑛 }

An item will be that element represented by 𝑖 ∈ ∪ = {𝑖 1 ,𝑖 2 ,… , 𝑖 𝑛 }  The similarity between two items 𝑖𝑗 and 𝑖 𝑘 will be given by 𝑠(𝑖𝑗, 𝑖 𝑘 )  An assessment of a user 𝑢 𝑖 on an item 𝑖𝑗 will be given by as 𝑟𝑢𝑖,𝑖𝑗  A prediction of a user 𝑢 𝑖 about the item 𝑖𝑗 will be represented as 𝑃𝑢𝑖,𝑖𝑗 Algorithm K-nn A crucial step in the realization of a quality Collaborative SR is the formation of user groups (if it is a memory-based Collaborative SR) or of items (if it is based on models, as in this project) of similar characteristics. This activity is part of what we know as classification problems, and there are several techniques, called classifiers,to solve this problem. One of the most widespread classifiers is the K-nn algorithm, which will be used in this project to form the most similar group of items for each of the items in the database. In more detail the algorithm K-nn consists of: Operation: Being the element or object to be classified, you must select the 𝑘 elements with 𝐾 = {𝑖 1 ,… , 𝑖 𝑘 } such that there is no example 𝑖′ outside K with 𝑑(𝑖, 𝑖 ′ ) < 𝑑(𝑖, 𝑖𝑗 ),𝑗 = 1, … , 𝑘 Once the k neighbors are found, one can proceed to the classification of two possible forms: Vote by the majority: The new object is classified according to the predominant class in theobjects of K. Vote compensated by distance: The object is classified based on its distance weighted with the rest of objects of K. Description of the algorithm: A new object appears 𝑖 𝑎. We obtain the k objects of the set E closest to𝑖 𝑎. The object is classified 𝑖 𝑎. Of one of the two forms mentioned above. Main features : It is a robust algorithm against noise when k is moderate (k> 1). It is very effective when the number of possible classes is high and when the data are heterogeneous or diffuse. It has an order of complexity of 𝑂(𝑑𝑛 2 ), being 𝑂(𝑑) the complexity of the metric distanceused. The fact of not using models but the entire database causes it to be inefficient in memory. It serves both for classification and numerical prediction. Similarity Measures 

ISBN: 1-60132-448-0, CSREA Press ©

72

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

To establish the similarity between objects we must define a measure that allows us to evaluate the degree of similarity between them. It is important to note that in this project, to calculate the similarity between two items x and y, only those users who have evaluated both items will be taken into account, not being taken into account the rest There are many measures of similarity in the literature, of all, we will review two of the most used. Pearson's Correlation Coefficient: This coefficient is an index that measures the linear relationship between two quantitative variables, being independent of the scale of measurement of these variables and giving a result within the interval [-1,1]. It is calculated by the following expression:

S (x,y )=

∑ni=1 (xi -x ̅ )(y i -y̅ ) √∑ni=1 (xi -x ̅ ) 2 √∑ni=1 (y i -y̅ )2

Being x ̅ey ̅ The mean of x and , respectively. Cosine Coefficient: This method assumes that two items are represented by vectors in space,so the similarity between them will be given by the cosine of the angle they form. The expression for its calculation is the following:

S(x,y )=

∑ni=1 xi y i √∑ni=1 (xi ) 2 √∑ni=1 (y i )2

Being xi the value of the object x for the user i, 𝑦𝑖 the value of the object y for the user i and n the number of users who have evaluated both x and y. Prediction Algorithms After calculating the set of neighbors for each item, we must combine the valuations of this set to perform the user's prediction on that item. Choosing the right technique to perform the prediction is the most crucial step of collaborative filtering. Choosing one prediction algorithm or another depends on the nature of the data set, sinceeach algorithm best fits a specific data set. In ourcase, we will use the technique called weighted sum. This method calculates the prediction ofan item i by a user ua as the sum of user evaluations uaon items similar to i.. Each of these evaluations is weighted by the corresponding similarity s(i, j)between items i and j. Its expression is given by:

𝑝(𝑢 𝑎 ,𝑖 𝑎 ) =

∑𝑘ℎ=1 𝑠(𝑖 𝑎 ,𝑖 ℎ ) ∗ 𝑟𝑢 𝑎 ,𝑖 ℎ ∑𝑘ℎ=1 |𝑠(𝑖 𝑎 ,𝑖 ℎ )|

K being the k most similar items to i a. This technique tries to capture how the active user evaluates items similar to the one he wants to predict. To ensure that the prediction falls within the previously defined range, it is necessary to weight these assessments with similarity. 4. Software Engineering The project that has been carried out consists of the development of an application of an Internet recommendation system based on a hybrid collaborative filtering algorithm, with a contextlike algorithm to recommend songs to the users, following a client / server. This music recommendation application consists of three fundamental components, which are the four basic pillars of the development of the project:  A database, which includes all data related to system users, songs, user ratings on songs, predictions.  An application interface, in the form of an Android application, from which users will access the system, listen to and evaluate songs, and receive music recommendations as playlists.  A collaborative filtering algorithm in conjunction with a context-like algorithm, which is responsible for calculating predictions from the dataof scores, users and songs in the database. Functional requirements Being in front of a project of academic type, no client is available to obtain the requirements, reason why we based on other Systems of Recommendation of music existing in the market, of recognized success, to establish the requirements. The functional requirements will be: Logging into the sys tem: The system must provide a mechanism so that the user can register in the system, in order to offer a personalized service. View Menu Personalized Songs: The system must be able to offer a series of songs to theuser based on the tastes and preferences indicated at the time of registration. See Menu Recommended Songs: The system must be able to offer songs in its entirety, either by collaborative filter or based on choices. Mark songs as favorites: The system must allow the user to indicate songs as favorites or as nonfavorites if they already are. Updating the recommendations model: The system must update its recommendation model from time to time, incorporating the new ratings that users have made on songs . Non-Functional Requirements Since our project is based on a client / server architecture, it will be necessary to distinguish between the requirements of the client computer and the requirements of the server. A part will

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

need a development computer of similar characteristics to the server equipment. The customer's computer equipment requirements are quite simple. Only a mobile device with the Android 4.1 operating system, with Internet connection (preferably Wifi, although it can be used with the data connection) is necessary. Since this is supported by Cloud services, the Server Computer requirements are a basic SQL account in Microsoft Cloud Services, Azure Services. Interface Requirements The requirements of the graphical interface between the application and the user are closely linked to usability and its principles [10]. The basic principles of usability, which will be associated with the non-functional requirements that the graphical interface must meet: Ease of learning: refers to those features of the interface that allow new users to understand how to use it initially and how to obtain a maximum degree of productivity. Flexibility: This principle of usability states that there must be several ways in which the system and the user exchange the information. Robustness: It is the level of reliability of the system, or the degree to which the systemis able to tolerate failures during the interaction. System Analysis Once it is known, the purpose of the software project, its properties and the constraints to which it must submit, is the time to analyze the system and create a model of the system that is correct, complete, consistent, clear and verifiable. For this, the use cases will be defined according to the previously obtained requirements and, afterwards, the main scenarios and event flows of these use cases will be described [11]. General Model The android application will connect to the Last.Fm API to get information about the user and their preferences. This in turn will receive information from the application when creating a new user who does not have a profile in Last.Fm. The application additionally each record and recommendation that is created, will leave a record in the database that is stored in the Azure cloud service. The database will provide song information and user preferences to more accurately use Last.Fm's recommendation algorithms to achieve greater accuracy at thetime of recommendation.

73

Figure 3 General Model MusicRecommender Programming languages used Customer Part Java: Object-oriented language whose potential resides in the compilation to intermediatecodeor bytecode of the applications developed in this language. This allows an application to run on multiple platforms and independently to hardware through a java virtual machine. Java is the language used to develop the logicof applications for Android. XML: Markup language for documents that allows you to store information in a structured way. Android uses it as a standard for the development of user interfaces. Server Part. Last.fm API: The Last.fm API allows calling methods that XML responds in REST type expressions. In general terms, a method is sent in conjunction with specific arguments to the root URL. The API supports multiple transport formats but always takes XML by default. SQL: It is a database access language that exploits the flexibility and power of relational systems and allows a wide variety of operations. It is a "high level" or "non-procedural" declarative language that, thanks to its strong theoretical basis and its orientation to the management of records ets (not to individual records) allows high productivity in coding and object orientation. In this way, a single statement can be equivalent to one or more programs that would be used in a low-level register-oriented language. Development tolos This project has implemented two parts, one client and another server, and each of them, for language needs, has needed different development tools. The client side, which consists of an Android application and its programming language is Java, has been implemented in the Android Studio development environment. Android Studio is an integrated development environment for the Android platform. It is based on JetBrains IntelliJ IDEA software, and is published for free through the Apache 2.0 License. On the other hand, the server side, it was necessary to create the necessary statements to create the database, so we used Microsoft SQL Server Express version 2012.

ISBN: 1-60132-448-0, CSREA Press ©

74

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

The development language used (by command line or through the Management Studio graphical interface) is Transact-SQL (TSQL), an implementation of the SQL ANSI standard,used to manipulate and retrieve data (DML), create tables and define Relationships between them (DDL). Designing the data The structure of each of the information elements of the system was determined, that is, the structure of the data on which to work. These elements are: The songs, of which we know the id of the song in Last.Fm, the name of the artist or interpreterof the song, the musical genre to which the song belongs. The users, of whom we know their identifier, name, password and e-mail address. Once the information elements of the systemare determined, their representations must be obtained in the form of tables from a database. To do this, a conceptual design of the database must first be performed to obtain the required tables. To implement this conceptual design,the Entity-Relationship model (E-R) will be used. Conceptual scheme We need to turn our information elements into entities or relationships. In our case, Songs and Users and accounts of Last.Fm will become entities of our conceptual scheme, and Valuations becomes relationships that join the entity Songs with Users. We can determine a total of three tables in the database, taking into account that each entity of the ECM is transformed into a table and the attributes of an entity are converted into thefields of the respective tables. Therefore, according to the ECM, we will obtain the following tables: USER It is a table that contains as many rows as users are registered in the service and have the following attributes: UID: Integer. Primary key. Numeric identifier and univoco of the user. LID: Integer. Foreign key. Identifier that relates the user to his Last.Fm account. NAME: String of 60 characters. User name. KEY: 32-character string. User password encoded using an encryption algorithm. MAIL: String of 64 characters. User's e-mail address. LASTFM Table containing the following fields: LID: Integer. Primary key. User ID. USERNAME: Text string. Name of the user in platform Last.Fm. USERPASS: Text string. User password on Last.Fm platform. SONG

It is a table that contains as many rows as songs are registered in the service and have the following attributes: CID: Integer. Primary key. Numeric identifier and univoco of the song. NAME: Text string. Name of the recommended song. ARTIST: Text string. Author of the song. ALBUM: Text string. Album to which the song belongs. GENDER: Text string. Genre to which the song belongs. You can take the following values: o Metal, disco, world, poprock, jazz, indie, classical, reggae, techno, industrial, pop,hiphop, lounge, country, latin, rock, electronic, dance. There is, however, one exception. Since we are going to use the Recommendation library [11] it is necessary to add several more tables to get the recommendations. Specifically they are: Table track_getinfo (song features). Table track_getdown (song predictions). User_getinfo table (user characteristics). User_getfriends table (recommendations for users). 5. Results For this test, a Google Nexus 5 Smartphone (API Level 21) was used. The application called MusicRecommender was run and a user named "eluntux" with key "2141" was created. In addition, the Rock and Metal styles wereselected by way of test, so that the system delivered an initial recommendation. When the created user enters the systemappears a list with recommendations of which you can see in detail each song.

Figure 4 MusicRecommender Main Screen

ISBN: 1-60132-448-0, CSREA Press ©

Int'l Conf. on Advances in Big Data Analytics | ABDA'17 |

Figure 5 Selected Song Display 6. Conclusions It has developed an Android application of a Recommendation System capable of providing users with songs that, without being evaluated by the user, are estimated to be to their liking. For this, the application has been developed around a Collaborative Recommendation System, following the whole system as a whole the client / server architecture. From the very beginning of the conception ofthe project, the intention was to create a servicethat would allow any type of application to access the musical recommendation system. In this way different users from different systems registerin it, listen to different songs from different artists and musical genres and perform a simple evaluation on the songs. Based on these preferences, the Recommendation System creates a user profile and offers the user a series of recommended songs, according to thetastes of the user and other users of similar tastes. For the realization of the project we have compiled a set of musical data. For this, we have chosen to generate a database of real music albums, so that the prototype version of our system has 1652 songs from 5 different musical genres. The server does not host any music files at any time; they are requested directly from the music database. First, the properties that the system was to satisfy, as well as the constraints to which it was subjected, were determined. Next, a correct, complete, consistent, clear and verifiablesystem model has been created. Finally, this model has been codified in a prototype version and installed on the server.

75

It has developed an Android application of a Recommendation System capable of providing users with songs that, without being evaluated by the user, are estimated to be to their liking. For this, the application has been developed around a Collaborative Recommendation System, following the whole system as a whole the client / server architecture 7. References [1] Ricci, F., Rokach, L., Shapira, P.B. Kantor (Eds.), Recommender Systems Handbook, Springer, US (2011), pp. 1-38. [2] F. Moya, Proyecto Fin de Carrera, Tutores: Luis Martínez e Iban Palomares, Universidad de Jaén, España 2014. [3] Gediminas Adomavicius and Alexander Tuzhilin, Context Aware Recommender Systems, F. Ricci, L. Rokach, B. Shapira, P. B. Kantor(Eds.) Recommender Systems Handbook, Springer, US(2011), pp. 217-256 [4] Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering 17(6), 734–749 (2005) [5] Avila Mejia O.:”Android” http://www.izt.uam.mx/newpage/contactos/revis ta/83/pdfs/android.pdf. Last seen: July 2015 [6] Cho, Y.H., J. Kyeong Kim, “A personalized recommender system based on Web usage mining and decision tree induction” Expert Systems with Applications, Volume 23, Issue 3, 1 October 2002, Pages 329-342 [7] Arazy, O., Kumar, N., Shapira, B.: Improving social recommender systems. IT Professional 11(4), 38–44 (2009) [8] Open Handset Alliance, http://www.openhandsetalliance.com/. Last access: July 2015 [9] Wallace Jackson: “Android Apps for absolute beginners”, Apress 2012 [10] Android Dashboards, http://developer.android.com/about/dashboards/i ndex.html#Platform. Last access: July 2015 [11] Documentación API Last.FM, http://www.last.fm/api/intro. Last access: May 2016 [12] What is Software Engineering, https://otroblogsobretics.wordpress.com/2011/0 2/20/ Of-which-is-the-engineering-of-software. Last Access: August 2016, [13] Design Topics in Human Computer Interaction http://www.proyectolatin.org/books/Temas_de_ Dise%C3%B1o_en_Interacci%C3%B3n_Huma no_Computadora_CC_BY-SA_3.0.pdf . Ultimo Acceso: Agosto 2016.

ISBN: 1-60132-448-0, CSREA Press ©

Author Index Aguila G, Elias - 69 Ahn, SinYeong - 53 Baek, GuiHyun - 53 Capuzo, Francisco - 25 Chong, Kyu Soo - 65 Congjing, Ran - 31 Coutinho, Thiago - 25 Escobar R, Luis - 69 Haiying, Huang - 31 Hong, Weihu - 3 Jackson, Lethia - 9 Jo, Taeho - 37 Kim, SuKyoung - 53 Kim, SungEn - 53 Lee, Mikyoung - 63 Nguyen, Thuan - 43 Ojewale, Mubarak - 14 Okorafor, Ekpe - 14 Qu, Junfeng - 3 Reis, Maria Luiza - 25 Santos, Lucas - 25 Shin, Sungho - 63 Song, Sa-Kwang - 63 Song, Yinglei - 3 Sung, Hong Ki - 65 Teymourlouei, Haydar - 9 Toutiaee, Mohammadhossein - 57 Wang, Yong - 18 Xinlai, Li - 31 Zapata, Santiago - 69 Zhang, Ping - 18