282 112 7MB
English Pages 315 Year 2019
COMPUTING AND RANDOM NUMBERS
COMPUTING AND RANDOM NUMBERS
Edited by: Zoran Gacovski
ARCLER
P
r
e
s
s
www.arclerpress.com
Computing and Random numbers Zoran Gacovski
Arcler Press 2010 Winston Park Drive, 2nd Floor Oakville, ON L6H 5R7 Canada www.arclerpress.com Tel: 001-289-291-7705 001-905-616-2116 Fax: 001-289-291-7601 Email: [email protected] e-book Edition 2019 ISBN: 978-1-77361-625-4 (e-book)
This book contains information obtained from highly regarded resources. Reprinted material sources are indicated. Copyright for individual articles remains with the authors as indicated and published under Creative Commons License. A Wide variety of references are listed. Reasonable efforts have been made to publish reliable data and views articulated in the chapters are those of the individual contributors, and not necessarily those of the editors or publishers. Editors or publishers are not responsible for the accuracy of the information in the published chapters or consequences of their use. The publisher assumes no responsibility for any damage or grievance to the persons or property arising out of the use of any materials, instructions, methods or thoughts in the book. The editors and the publisher have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission has not been obtained. If any copyright holder has not been acknowledged, please write to us so we may rectify. Notice: Registered trademark of products or corporate names are used only for explanation and identification without intent of infringement. © 2019 Arcler Press ISBN: 978-1-77361-502-8 (Hardcover) Arcler Press publishes wide variety of books and eBooks. For more information about Arcler Press and its products, visit our website at www.arclerpress.com
DECLARATION Some content or chapters in this book are open access copyright free published research work, which is published under Creative Commons License and are indicated with the citation. We are thankful to the publishers and authors of the content and chapters as without them this book wouldn’t have been possible.
ABOUT THE EDITOR
Dr. Zoran Gacovski has earned his PhD degree at Faculty of Electrical engineering, Skopje. His research interests include Intelligent systems and Software engineering, fuzzy systems, graphical models (Petri, Neural and Bayesian networks), and IT security. He has published over 50 journal and conference papers, and he has been reviewer of renowned Journals. In his career he was awarded by Fulbright postdoctoral fellowship (2002) for research stay at Rutgers University, USA. He has also earned best-paper award at the Baltic Olympiad for Automation control (2002), US NSF grant for conducting a specific research in the field of human-computer interaction at Rutgers University, USA (2003), and DAAD grant for research stay at University of Bremen, Germany (2008). Currently, he is a professor in Computer Engineering at European University, Skopje, Macedonia.
TABLE OF CONTENTS
List of Contributors........................................................................................xv
List of Abbreviations..................................................................................... xix
Preface..................................................................................................... ....xxi SECTION I RANDOM NUMBER GENERATORS
Chapter 1
Implementation of Hardware-Accelerated Scalable Parallel Random Number Generators..................................................................... 3 Abstract...................................................................................................... 3 Introduction................................................................................................ 4 Implementation.......................................................................................... 6 Results...................................................................................................... 19 Conclusions.............................................................................................. 23 Acknowledgments.................................................................................... 23 References................................................................................................ 24
Chapter 2
A Hardware Efficient Random Number Generator for Nonuniform Distributions with Arbitrary Precision...................................................... 27 Abstract.................................................................................................... 27 Introduction.............................................................................................. 28 Related Work............................................................................................ 29 The Inversion Method............................................................................... 33 Generating Floating Point Random Numbers............................................ 39 Synthesis Results and Quality Test............................................................. 42 Conclusion............................................................................................... 49 Acknowledgment...................................................................................... 49 References................................................................................................ 50
Chapter 3
True-Randomness and Pseudo-Randomness in Ring Oscillator-Based True Random Number Generators............................................................................................... 55 Abstract.................................................................................................... 55 Introduction.............................................................................................. 56 Ring Oscillators and Timing Jitter.............................................................. 58 Simulation and Experimental Setup........................................................... 60 Comparison of The Generators’ Behavior In Simulations and In Hardware....................................... 63 Impact of The Size and Type of The Jitter on The Quality of The Raw Bit-Stream..................................................................... 69 Mutual Dependence of Rings and Its Impact on The Quality of The Raw Bit-Stream................................. 74 Discussion................................................................................................ 77 Conclusions.............................................................................................. 78 References................................................................................................ 80
Chapter 4
Random Numbers Generated from Audio and Video Sources.................. 83 Abstract.................................................................................................... 83 Introduction.............................................................................................. 84 Related Works.......................................................................................... 86 The Proposed Scheme: AVRNG With Filter............................................... 88 Conclusions.............................................................................................. 95 References................................................................................................ 96 SECTION II RANDOM VARIABLES TECHNIQUES
Chapter 5
Distribution of the Maximum and Minimum of a Random Number of Bounded Random Variables................................ 101 Abstract.................................................................................................. 101 Introduction............................................................................................ 102 Standard Uniform Geometric (SUG) Model............................................ 105 The Correlated Standard Uniform Geometric (CSUG) Model.................. 109 Parameter Estimation............................................................................... 114 A Summary of Some Other Models......................................................... 115 Conclusion............................................................................................. 117 Acknowledgements................................................................................ 117 References.............................................................................................. 118
x
Chapter 6
Random Route and Quota Sampling: Do They Offer any Advantage over Probably Sampling Methods?........................................ 121 Abstract.................................................................................................. 121 Introduction............................................................................................ 122 Distribution Comparison......................................................................... 126 Discussion.............................................................................................. 135 Conclusion............................................................................................. 137 References.............................................................................................. 138
Chapter 7
Equivalent Conditions of Complete Convergence for Weighted Sums of Sequences of Negatively Dependent Random Variables................................................................................................ 141 Abstract.................................................................................................. 141 Introduction............................................................................................ 142 Main Results........................................................................................... 146 Proofs of The Main Results...................................................................... 147 Acknowledgments.................................................................................. 153 References.............................................................................................. 154
Chapter 8
Strong Laws of Large Numbers for Fuzzy Set-Valued Random Variables in Gα Space.................................................... 157 Abstract.................................................................................................. 157 Introduction............................................................................................ 158 Preliminaries on Set-Valued Random Variables....................................... 159 Main Results........................................................................................... 162 Acknowledgements................................................................................ 169 References.............................................................................................. 170
SECTION III COMPUTING APPLICATIONS WITH RANDOM VARIABLES Chapter 9
Comparison of Multiple Random Walks Strategies for Searching Networks............................................................................... 175 Abstract.................................................................................................. 175 Introduction............................................................................................ 176 Single Random Walks on Networks........................................................ 177 Multiple Random Walks on Networks..................................................... 180 Simulations And Discussions.................................................................. 185 Conclusions............................................................................................ 194 xi
Acknowledgments.................................................................................. 195 References.............................................................................................. 196 Chapter 10 Fuzzy C-Means and Cluster Ensemble with Random Projection for Big Data Clustering........................................................................... 199 Abstract.................................................................................................. 199 Introduction............................................................................................ 200 Preliminaries........................................................................................... 202 Random Projection................................................................................. 205 Fcm Clustering With Random Projection and an Efficient Cluster Ensemble Approach................. 209 Experiments............................................................................................ 212 Conclusion And Future Work.................................................................. 221 Acknowledgments.................................................................................. 221 References.............................................................................................. 222 SECTION IV SIMULATIONS WITH RANDOM NUMBERS AND VARIABLES Chapter 11 Social Emotional Optimization Algorithm with Random Emotional Selection Strategy.................................................................. 227 Introduction............................................................................................ 227 Standard Social Emotional Optimization Algorithm................................ 229 Random Emotional Selection Strategy..................................................... 232 Simulation.............................................................................................. 235 Conclusion............................................................................................. 244 Acknowledgement.................................................................................. 244 References.............................................................................................. 245 Chapter 12 Lymph Diseases Prediction Using Random Forest and Particle Swarm Optimization................................................................. 249 Abstract.................................................................................................. 249 Introduction............................................................................................ 250 Feature Selection.................................................................................... 252 Random Forest Ensemble Classification Algorithm.................................. 255 Simple Random Sampling....................................................................... 257 Performance Measures............................................................................ 258
xii
Experimental Study................................................................................. 259 Conclusion............................................................................................. 264 Acknowledgements................................................................................ 265 References.............................................................................................. 266 Chapter 13 Challenges of Internal and External Variables of Consumer Behaviour towards Mobile Commerce................................................... 269 Abstract.................................................................................................. 269 Introduction............................................................................................ 270 Mobile Commerce And Challenges of Mobile Commerce....................... 271 Research Model Formulation.................................................................. 278 Findings.................................................................................................. 284 Hypothesis Testing.................................................................................. 296 Conclusions & Recommendations.......................................................... 296 References.............................................................................................. 298
Index...................................................................................................... 301
xiii
LIST OF CONTRIBUTORS JunKyu Lee Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN 37996, USA Gregory D. Peterson Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN 37996, USA Robert J. Harrison Department of Chemistry, University of Tennessee, Knoxville, TN 37996, USA Robert J. Hinde Department of Chemistry, University of Tennessee, Knoxville, TN 37996, USA Christian de Schryver Microelectronic Systems Design Research Group, University of Kaiserslautern, Erwin-Schroedinger-Straße, 67663 Kaiserslautern, Germany Daniel Schmidt Microelectronic Systems Design Research Group, University of Kaiserslautern, Erwin-Schroedinger-Straße, 67663 Kaiserslautern, Germany Norbert Wehn Microelectronic Systems Design Research Group, University of Kaiserslautern, Erwin-Schroedinger-Straße, 67663 Kaiserslautern, Germany Elke Korn Stochastic Control and Financial Mathematics Group, University of Kaiserslautern, Erwin-Schroedinger-Straße, 67663 Kaiserslautern, Germany Henning Marxen Stochastic Control and Financial Mathematics Group, University of Kaiserslautern, Erwin-Schroedinger-Straße, 67663 Kaiserslautern, Germany xv
Anton Kostiuk Stochastic Control and Financial Mathematics Group, University of Kaiserslautern, Erwin-Schroedinger-Straße, 67663 Kaiserslautern, Germany Ralf Korn Stochastic Control and Financial Mathematics Group, University of Kaiserslautern, Erwin-Schroedinger-Straße, 67663 Kaiserslautern, Germany Nathalie Bochard CNRS, UMR5516, Laboratoire Hubert Curien, Université de Lyon, 42000 Saint-Etienne, France Florent Bernard CNRS, UMR5516, Laboratoire Hubert Curien, Université de Lyon, 42000 Saint-Etienne, France Viktor Fischer CNRS, UMR5516, Laboratoire Hubert Curien, Université de Lyon, 42000 Saint-Etienne, France Boyan Valtchanov CNRS, UMR5516, Laboratoire Hubert Curien, Université de Lyon, 42000 Saint-Etienne, France I-Te Chen Department of Healthcare Administration & Medical Informatics, Kaohsiung Medical University, 100 Shih-Chuan 1st Road, Kaohsiung 80708, Taiwan Jie Hao Department of Statistics and Analytical Sciences, Kennesaw State University, Kennesaw, USA Anant Godbole Department of Mathematics and Statistics, East Tennessee State University, Johnson City, USA Vidal Díaz de Rada Departamento de Sociología, Public University of Navarre, Pamplona, Spain
xvi
Valentín Martínez Martín Centro de Investigaciones Sociológicas, Madrid, Spain Mingle Guo School of Mathematics and Computer Science, Anhui Normal University, Wuhu 241003, China Lamei Shen College of Applied Sciences, Beijing University of Technology, Beijing, China Li Guan College of Applied Sciences, Beijing University of Technology, Beijing, China Zhongtuan Zheng Shanghai University of Engineering Science, Shanghai 201620, China Nanyang Technological University, Singapore 639798 Hanxing Wang Shanghai Lixin University of Commerce, Shanghai 201620, China Shanghai University, Shanghai 200444, China Shengguo Gao Shanghai University of Engineering Science, Shanghai 201620, China Guoqiang Wang Shanghai University of Engineering Science, Shanghai 201620, China Mao Ye State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450002, China State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China Wenfen Liu State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450002, China State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
xvii
Jianghong Wei State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450002, China Xuexian Hu State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450002, China Zhihua Cui Complex System and Computational Intelligence Laboratory, Taiyuan University of Science and Technology, China State Key Laboratory of Novel Software Techchnology, Nanjing University, China Yuechun Xu Complex System and Computational Intelligence Laboratory, Taiyuan University of Science and Technology, China Jianchao Zeng Complex System and Computational Intelligence Laboratory, Taiyuan University of Science and Technology, China Waheeda Almayyan Department of Computer Science and Information Systems, College of Business Studies, Public Authority for Applied Education and Training, Kuwait City, Kuwait Arif Sari Department of Management Information Systems, Girne American University, Kyrenia, Cyprus Pelin Bayram Department of Business Management, Girne American University, Kyrenia, Cyprus
LIST OF ABBREVIATIONS ACO
Ant colony optimizer
API
Application Programming Interface
CMRG
Combined Multiple Recursive Generator
CAD
Computer-Aided Diagnosis
CSUG
Correlated Standard Uniform Geo-metric
CSPs
Critical security parameters
DVRNG
Dual-Video Random Number Generator
ESS
European Social Survey
EO
Extremal Optimization
FPGAs
Field Programmable Gate Arrays
FPT
First passage time
FCM
Fuzzy c-means
FRI
Fuzzy Rand Index
GBN
General Bayesian network
VHDL
Hardware Description Language
HASPRNGs Hardware-Accelerated Scalable Parallel Random Number Generators library HPRC
High-Performance Reconfigurable Computing
HTTP
Hyper Text Transfer Protocol
IGR
Information Gain Ratio attribute evaluation
ICDF
Inverse cumulative distribution function
LLNs
Law of Large Numbers
LAB
Logic array block
LUT
Lookup tables
MCC
Matthews Correlation Coefficient
MFPT
Mean first passage time
MPRW
Multiple preferential random walks
MRW
Multiple random walks
MSRW
Multiple simple random walks
MLFG
Multiplicative Lagged Fibonacci Generator
INE
National Institute of Statistics
NQD
Negatively quadrant dependent
PPRNGs
Parallel Pseudorandom Number Generators
PSO
Particle swarm optimization
PDAs
Personal digital assistants
PRW
Preferential random walks
PRNGs
Pseudorandom Number Generators
RNG
Random number generator
ROC
Receiver Operator Characteristic
RC
Reconfigurable Computing
RC MC
Reconfigurable Computing Monte Carlo
SPRNGs
Scalable Parallel Random Number Generators
SRW
Simple random walks
SFQ
Single-flux-quantum
SVD
Singular value decomposition
SIDS
Small Island Developing States
SEOA
Social emotional optimization algorithm
STS
Statistical Test Suite
SAV
Sum of absolute values
SI
Swarm intelligence
SU
Symmetric uncertainty correlation based measure
TRNGs
True random number generators
TL
Turkish Liras
VRNG
Video Random Number Generator
WEKA
Waikato Environment for Knowledge Analysis
WAP
Wireless Application Protocol
PREFACE
The random numbers are of great importance in today’s interconnected world, and aside from the algorithms for generation of random numbers - we should firstly discuss where they are applied. •
• •
• •
• • •
Simulations - When we simulate natural processes with a computer, random numbers are needed for the simulation to be more realistic. Simulations are used in various areas, from nuclear physics (where the decay of the radioactive substance occurs randomly) to operational research (time interval until the arrival of the next customer in the shop is unpredictable). Choosing a sample test - It is often impractical to investigate all possible cases that may occur and the random choice of a smaller number of cases - will provide a typical sample that represents the entire set of cases. Numerical analysis - Solution of some hard mathematical problems can be done relatively easy and precisely with the help of random numbers. One simple example is for an approximate calculation of the surface area of a circle (without knowing the formula that calculates the surface). For that purpose - we generate random points coordinates and we count how many of them will fall inside the circle. Programming - Randomly selected data is a good source of data for an efficient testing of a program. Making decisions - In cases when one has to choose one of several equally important decisions - usually chooses a case by random selection. E.g. for a chess program that always plays the same opening and in the same position always plays the same moves - we will not say that it is good. In situations where it is possible to play more different moves - it is recommended to make a random selection of one of them. Cryptography - In cryptography, random numbers play a crucial role. Fun - In the gaming world, randomness has always been present - from gambling, through throwing dice, to mingling the cards and playing computer games. Judicial system - In the judicial system of the United States, jury members play an important part in court processes - they are people who are
randomly chosen among the citizens who have the right to vote, and by their own decision will affect the judgment or determine it. They are chosen randomly - and therefore it is necessary to use a good source of random numbers. This edition covers different topics from computing with random numbers, including random number generators, random variables techniques, computing applications of random numbers and modeling and simulation applications of random variables. Section 1 focuses on random number generators, describing hardwareaccelerated scalable parallel random number generators, hardware efficient random number generator for non-uniform distributions with arbitrary precision, true-randomness and pseudo-randomness in ring oscillator-based true random number generators, and random numbers generated from audio and video sources. Section 2 focuses on random variables techniques, describing distribution of the maximum and minimum of a random number of bounded random variables, random route and quota sampling: do they offer any advantage over probably sampling methods, equivalent conditions of complete convergence for weighted sums of sequences of negatively dependent random variables, strong laws of large numbers for fuzzy set-valued random variables in Gα space.
Section 3 focuses on computing applications of random numbers, describing comparison of multiple random walks strategies for searching networks, fuzzy -means and cluster ensemble with random projection for big data clustering. Section 4 focuses on simulation applications of random variables, describing social emotional optimization algorithm with random emotional selection strategy and lymph diseases prediction using random forest and particle swarm optimization, challenges of internal and external variables of consumer behaviour towards mobile.
xxii
SECTION I RANDOM NUMBER GENERATORS
CHAPTER
1
IMPLEMENTATION OF HARDWARE-ACCELERATED SCALABLE PARALLEL RANDOM NUMBER GENERATORS JunKyu Lee1, Gregory D. Peterson1, Robert J. Harrison2, and Robert J. Hinde2 Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN 37996, USA
1
Department of Chemistry, University of Tennessee, Knoxville, TN 37996, USA
2
ABSTRACT The Scalable Parallel Random Number Generators (SPRNGs) library is widely used in computational science applications such as Monte Carlo simulations since SPRNG supports fast, parallel, and scalable random number generation with good statistical properties. In order to accelerate SPRNG, we develop a Hardware-Accelerated version of SPRNG (HASPRNG) on the Xilinx XC2VP50 Field Programmable Gate Arrays (FPGAs) in the Cray Citation: JunKyu Lee, Gregory D. Peterson, Robert J. Harrison, and Robert J. Hinde, “Implementation of Hardware-Accelerated Scalable Parallel Random Number Generators,” VLSI Design, vol. 2010, Article ID 930821, 11 pages, 2010. doi:10.1155/2010/930821. Copyright: © 2010 JunKyu Lee et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
4
Computing and Random Numbers
XD1 that produces identical results. HASPRNG includes the reconfigurable logic for FPGAs along with a programming interface which performs integer random number generation. To demonstrate HASPRNG for Reconfigurable Computing (RC) applications, we also develop a Monte Carlo 𝜋-estimator for the Cray XD1. The RC Monte Carlo 𝜋-estimator shows a 19.1× speedup over the 2.2 GHz AMD Opteron processor in the Cray XD1. In this paper we describe the FPGA implementation for HASPRNG and a 𝜋-estimator example application exploiting the fine-grained parallelism and mathematical properties of the SPRNG algorithm.
INTRODUCTION Random numbers are required in a wide variety of applications such as circuit testing, system simulation, game-playing, cryptography, evaluation of multiple integrals, and computational science Monte Carlo (MC) applications [1]. In particular, MC applications require a huge quantity of high-quality random numbers in order to obtain a high-quality solution [2–4]. To support MC applications effectively, a random number generator should have certain characteristics [5]. First, the random numbers must maintain good statistical properties (e.g., no biases) to guarantee valid results. Second, the generator should have a long period. Third, the random numbers should be reproducible. Fourth, the random number generation should be fast since generating a huge quantity of random numbers requires substantial execution time. Finally, the generator should require little storage to allow the MC application to use the rest of the storage resources. Many MC applications are embarrassingly parallel [6]. To exploit the parallelism, Parallel Pseudorandom Number Generators (PPRNGs) are required for such applications to achieve fast random number generation [2, 6, 7]. Random numbers from a PPRNG should be statistically independent each other to guarantee a high-quality solution. Many common Pseudorandom Number Generators (PRNGs) fail statistical tests of their randomness [8, 9]; so computational scientists are cautious in selecting PRNG algorithms. The Scalable Parallel Random Number Generators (SPRNGs) library is one of the best candidates for parallel random number generation satisfying the five characteristics, since it supports fast, scalable, and parallel random number generation with good randomness [2, 7]. SPRNG consists of 6 types of random number generators: Modified Lagged Fibonacci Generator (Modified LFG), 48-bit Linear Congruential Generator with prime
Implementation of Hardware-Accelerated Scalable Parallel Random ...
5
addend (48-bit LCG), 64-bit Linear Congruential Generator with prime addend (64-bit LCG), Combined Multiple Recursive Generator (CMRG), Multiplicative Lagged Fibonacci Generator (MLFG), and Prime Modulus Linear Congruential Generator (PMLCG). We desire to improve the generation speed by implementing a hardware version of random number generators for simulation applications, since generating random numbers takes a considerable amount of the execution time for applications which require huge quantities of random numbers [10]. FPGAs have several advantages in terms of speedup, energy, power, and flexibility for implementation of the random number generators [11, 12]. High-Performance Reconfigurable Computing (HPRC) platforms employ FPGAs to execute the computationally intensive portion of an application [13, 14]. The Cray XD1 is an HPRC platform providing a flexible interface between the microprocessors and FPGAs (Xilinx XC2VP50 or XC4VLX160). FPGAs are able to communicate with a microprocessor directly through the interface [15]. Therefore, we explore the use of reconfigurable computing to achieve faster random number generation. In order to provide the high-quality, scalable random number generation associated with SPRNG combined with the capabilities of HPRC, we developed the Hardware-Accelerated Scalable Parallel Random Number Generators library (HASPRNGs) that provides bit-equivalent results to operate on a coprocessor FPGA [16–18]. HASPRNG can be used to target computational science applications on arbitrarily large supercomputing systems (e.g., Cray XD-1, XT-5h), subject to FPGA resource availability [15]. Although the computational science application could be executed on the node microprocessors with the FPGAs accelerating the PPRNGs, the HASPRNG implementation could also be colocated with the MC application on the FPGA. The latter approach avoids internal bandwidth constraints and enables more aggressive parallel processing. This presents significant potential benefit from tightly coupling HASPRNG with Reconfigurable Computing Monte Carlo (RC MC) applications. For example, the RC MC 𝜋-estimation application can employ a huge quantity of random numbers using as many parallel generators as the hardware resources can support. In this paper we describe the implementation of the HASPRNG library for the Cray XD1 for the full set of integer random number generators in SPRNG and demonstrate the potential of HASPRNG to accelerate RC MC applications by exploring a 𝜋-estimator on the Cray XD1.
6
Computing and Random Numbers
IMPLEMENTATION SPRNG includes 6 different types of generators and a number of default parameters [2, 7]. Each of the SPRNG library random number generators and its associated parameter sets are implemented in HASPRNG. The VHSICs (Very High-Speed Integrated Circuits) Hardware Description Language (VHDL) is used for designing the reconfigurable logic of HASPRNG and the RC MC 𝜋-estimator. To provide a flexible interface to RC MC applications, a one-bit control input is employed to start and stop HASPRNG operation. When HASPRNG is paused, all the state information inside HASPRNG is kept to enable the resumption of random number generation when needed. Similarly, a one-bit control output signals the availability of a valid random number. These two ports are sufficient for the control interface to a RC MC application developer. To provide high performance, eight generators are implemented for HASPRNG on the Cray XD1. The modified lagged Fibonacci and multiplicative lagged Fibonacci generators have two implementations to exploit the potential concurrency associated with different parameter sets. We call the eight generators in HASPRNG as follows: HardwareAccelerated Modified Lagged Fibonacci Generator for Odd-Odd type seed (1: HALFGOO), Hardware-Accelerated Modified Lagged Fibonacci Generator for Odd-Even type seed (2: HALFGOE), Hardware-Accelerated Linear Congruential Generator for 48 bits (3: HALCG48), Hardware-Accelerated Linear Congruential Generator for 64 bits (4: HALCG64), HardwareAccelerated Combined Multiple Recursive Generator (5: HACMRG), Hardware-Accelerated Multiplicative Lagged Fibonacci Generator for Short lag seed (6: HAMLFGS), Hardware-Accelerated Multiplicative Lagged Fibonacci Generator for Long lag seed (7: HAMLFGL), and Hardware-Accelerated Prime Modulus Linear Congruential Generator (8: HAPMLCG). Since HASPRNG implements the same integer random number generation as SPRNG, every HASPRNG generator returns 32-bit positive integer values (most significant bit equals “0”). In consequence, HASPRNG converts different bit-width random numbers from different generators (e.g., 32 bits for LFG, 48 bits for LCG48, 64 bits for LCG64) to positive 32-bit integer random numbers and generates random numbers from 0 to (231-1). For the conversion, HASPRNG masks the upper 31 bits of the different bit-width random numbers and prepends “0” as the most significant bit of the 31-bit data in order to produce positive 32-bit random numbers for all generators in HASPRNG. The techniques to produce 32-bit random numbers are the same as in SPRNG [2, 7]. The random numbers before the
Implementation of Hardware-Accelerated Scalable Parallel Random ...
7
conversion are fed back to the generators to produce future random numbers in HASPRNG. HASPRNG can be used not only for improving the speed of software applications by providing a programming interface but also for accelerating RC MC applications by colocating the application and the hardware accelerated generator(s) on the FPGAs (see Sections 2.6, 2.7, 3.4, and 4).
Hardware-Accelerated Modifed Lagged Fibonacci Generator (HALFG) The Modified LFG in SPRNG is computed by the XOR bit operation of two additive LFG products [2, 7]. The Modified LFG is expressed by (1), (2), and (3):
(1)
(2)
(3)
where 𝑙 and 𝑘 are called the lags of the generator. The generator follows the convention of 𝑙>𝑘. (𝑛) and (𝑛)are 32-bit random numbers generated by the two additive LFGs. 𝑋’(𝑛) is represented by setting the least significant bit of 𝑋(𝑛) to 0. 𝑌’(𝑛) is represented by shifting 𝑌(𝑛) right by one bit. (𝑛) is the (𝑛/2)th random number of the generator [2, 7]. In SPRNG the generator produces a random number every two step-operations. To provide bitequivalent results and improve performance, two types of designs are used depending on whether 𝑘 is odd or even [19, 20]. We employ the design for HALFG as in [19]. The two types of designs are represented by HALFGOO (both 𝑙 and 𝑘 are odd numbers in (1), (2), and (3)) and HALFGOE (𝑙 is odd and 𝑘 is even). The HALFGOO and HALFGOE employ block memory modules in order to store initial seeds and state. Note that small values of 𝑘 result in data hazards that complicate pipelining. Hence, these designs are optimized for the specific memory access patterns dictated by 𝑙 and 𝑘. Figure 1 shows the HALFGOO and HALFGOE architectures. The block memories are represented by 𝑋𝐴, 𝑋𝐵, and 𝑋𝐶 and the results from the two additions are represented by 𝐴1 and 𝐴2 in Figure 1. HALFG requires an initialization process to store 𝑙 previous values in the three block memories, since (𝑛) and (𝑛) in (2) and (3) require the 𝑙 previous values. Therefore HALFG requires separate buffers of previous values to generate the different random number sequences for (𝑛) and (𝑛).
8
Computing and Random Numbers
Figure 1: HALFGOO (a) and HALFGOE (b) architectures [19].
After the initialization, the seeds are accessed to produce random numbers. For example, 𝐴1 represents a random number, (𝑛) or (𝑛), and at the same time 𝐴2 is stored to 𝑋𝐴 and 𝑋𝐵 to produce future random numbers. This method is able to generate a random number every two step operations. The HALFGOO and HALFGOE generators produce a random number every clock cycle.
Hardware-Accelerated 48-bit and 64-bit Linear Congruential Generators (HALCGs) The two LCGs use the same algorithm with different data sizes, 48 bits and 64 bits. The LCG characteristic equation is represented by
(4)
where 𝑝 is a prime number, 𝛼 is a multiplier, 𝑀 is 2 for the 48-bit LCG and 264 for the 64-bit LCG, and 𝑍(𝑛) is the 𝑛th random number [2, 7]. The HALCGs must use the previous random number as in (4). Consequently, the HALCGs face data feedback (hazards) in the multiplication which reduces performance. Thanks to the simple recursion relation of (4) we are able to generate a future random number by unrolling the recursion in (4) based on pipeline depths. Equations (5) and (6) represent the modified equations which produce identical results as (4) in the HALCGs. HALCG implementations employing (5) and (6) generate two random numbers every clock cycle using two seven-stage pipelined multipliers: 48
Implementation of Hardware-Accelerated Scalable Parallel Random ...
9
(5) (6) The architecture for the HALCGs is shown in Figure 2. Two generation engines are employed to produce two random numbers. One generation engine (Generators 1 in Figure 2) produces the odd indexed random numbers (e.g., (17), (19), 𝑍(21), …) and the other generator (Generator 2 in Figure 2) produces the even indexed random numbers (e.g., 𝑍(16), 𝑍(18), 𝑍(20), …). For the multiplications in (5), we employ built-in 18×18 multipliers inside the generators (Generator 1 and 2 in Figure 2) (see Table 3).
Figure 2: HALCGs architecture.
Instead of having internal logic modules to obtain the 𝛽 and 𝛾 in (6), the coefficients are precalculated in software to save hardware resources. Software also calculates 15 initial random numbers ((1)–Z(15)) to provide the initial state to the HALCGs. The pregenerated 15 random numbers and an initial seed are stored in the register file (Register File in Figure 2: 𝑅[0]=𝑍[0] (Initial seed), 𝑅[1]=𝑍[1],…,𝑅[15]=𝑍[15]) having 16 48/64-bit registers during the initialization process. The HALCGs produce 15 random numbers during initialization before they generate random number outputs. In consequence, the generator generates two 32-bit random numbers every clock cycle. We provide two random numbers every clock cycle to exploit the Cray XD1 bandwidth since the Cray XD1 can support a 64 bit data transfer between SDRAM and FPGAs every clock cycle. A microprocessor can access the SDRAM directly (see Section 2.6).
10
Computing and Random Numbers
Hardware-Accelerated Combined Multiple Recursive Generator (HACMRG) The SPRNG CMRG employs two generators. One is a 64-bit LCG and the other is a recursive generator [7]. The recursive generator is expressed by (8). The CMRG combines the two generators as follows:
(7)
(8)
where 𝑋(𝑛) is generated by a 64-bit LCG, 𝑌(𝑛) represents a 31-bit random number, and 𝑍(𝑛) is the resulting random number [2, 7]. The implementation equation is represented by:
(9)
where 𝑋’(𝑛) is the upper 32 bits of 𝑋(𝑛). Equation (9) produces identical results as (7). Figure 3 shows the HACMRG hardware architecture. The architecture has two parts, each generating a partial result. The first part is an HALCG64, and the second part is a generator having two lag factors as in (8).
Figure 3: HACMRG architecture.
Implementation of Hardware-Accelerated Scalable Parallel Random ...
11
The left part in Figure 3 is the HALCG64 employing a two-staged multiplier. The HACMRG HALCG64 produces one 𝑋(𝑛) random number every other clock cycle to synchronize with the 𝑌(𝑛) generator composed of two two-staged multipliers, four-deep FIFO registers, and some combinational logic. In the (𝑛) generator, the multiplexor controlling the left multiplier inputs switches between an input value of “1” during the initial random number and the previous value (𝑛−1) thereafter. For the “231−1” modulo operator implementation, the 62-bit value summed from the two multiplier’s outputs is shifted right by 31-bits and added to the lower 31-bits of the value before the shifting. The shifted value represents the modulo value for the higher 31 bits of the 62 bit value and the lower 31 bits of the 62 bit value represents the modulo value itself. The summed value represents the total modulo value. The total modulo value is reexamined to represent the final modulo value. If the total modulo value is larger or equal than 231, the final modulo value is represented by adding “1” to the lower 31-bit data of the total modulo value since the (231−1) modulo value of 231 is “1”. The value obtained from the modulus operation fans out three ways. The first one goes to the left multiplier input port in Figure 3 in order to save one clock cycle latency, the second one goes to the FIFO, and the third one goes to the final adder to add the value to the result generated by the HALCG64. The resulting upper 31 bits represent the next random number. The HACMRG generates a random number every other clock cycle.
Hardware-Accelerated Multiplicative Lagged Fibonnaci Generator (HAMLFG) The SPRNG MLFG characteristic equation is given by
(10)
where 𝑘 and 𝑙 are time lags and 𝑍(𝑛) is the resulting random number [2, 7].
The HAMLFGS and HAMLFGL employ two generators to produce two random numbers every clock cycle based on (10). The HAMLFGS and HAMLFGL require three-port RAMs to consistently keep previous random numbers since two ports are required to store two random numbers and one port is required to read a random number at the same time. However, only two ports are supported for the DPRAMs. Fortunately, (10) reveals data access patterns such that one port is enough for storing two random numbers. Tables 1 and 2 show the data accessing patterns in the case of the {17,5}
12
Computing and Random Numbers
and {31,6} parameter sets. In these tables, we reference the “odd-odd” case when the longer lag factor 𝑙 is odd and the shorter lag factor 𝑘 is odd and the “odd-even” case when the longer lag factor is odd and the shorter lag factor is even. Note that 𝑙 is always odd for the SPRNG parameter sets [7]. Table 1: Data access patterns in odd-odd case {17, 5} ODD-ODD case {17, 5} Two random numbers every clock cycle
Odd random number
Even random number
1st
Z(17)=Z(12)×Z(0)
Z(18)=Z(13)×Z(1)
2nd
Z(19)=Z(14)×Z(2)
Z(20)=Z(15)×Z(3)
3rd
Z(21)=Z(16)×Z(4)
Z(22)=Z(17)×Z(5)
4th
Z(23)=Z(18)×Z(6)
Z(24)=Z(19)×Z(7)
⋯
⋯⋯
⋯⋯
Table 2: Table 1 Data access patterns in odd-even case {31, 6} ODD-EVEN case {31, 6} Two random numbers every clock cycle
Odd random number
Even random number
1st
Z(31) =Z(25)×Z(0)
Z(32) =Z(26)×Z(1)
2nd
Z(33) =Z(27)×Z(2)
Z(34) =Z(28)×Z(3)
3rd
Z(35) =Z(29)×Z(4)
Z(36) =Z(30)×Z(5)
4th
Z(37) =Z(31)×Z(6)
Z(38) =Z(32)×Z(7)
⋯
⋯⋯
⋯⋯
Table 3: HASPRNG XC2VP50 hardware usage Hardware usage Slices (23616)
Multipliers (232)
Maximum number of copies
BRAM (232)
HALFGOO
1022 (4%)
0 (0%)
40 (17%)
5
HALFGOE
1015 (4%)
0 (0%)
40 (17%)
5
HALCG48
1662 (7%)
12 (5%)
0 (0%)
14
HALCG64
2474 (10%)
20 (8%)
0 (0%)
9
HACMRG
683 (2%)
18 (7%)
0 (0%)
14
HAMLFGS
1123 (4%)
20 (8%)
32 (13%)
7
HAMLFGL
1670 (7%)
20 (8%)
40 (17%)
5
HAPMLCG
659 (2%)
16 (6%)
0 (0%)
16
Average LCG type
5%
7%
0 (0%)
13
Average LFG type
5%
8%
16%
5
Implementation of Hardware-Accelerated Scalable Parallel Random ...
13
In the case of odd-odd parameter sets, the odd part always accesses the even part and the even part always accesses the odd part in Table 1. In the case of odd-even parameter sets, the odd part always accesses the odd part and the even part always accesses the even part in Table 2. The data access patterns are described in bold face in Table 2. Figure 4 shows the HAMLFGS and HAMLFGL architecture exploiting these data access patterns. HAMLFGS and HAMLFGL employ four DPRAMs to store random numbers and two multipliers to produce two random numbers every clock cycle, since the generator can access four values every clock cycle (see Tables 1 and 2). The two upper DPRAMs (DPRAM 1 and DPRAM 2 in Figure 4) are used to generate odd indexed random numbers and the two lower DPRAMs (DPRAM 3 and DPRAM 4) are used to generate even indexed random numbers.
Figure 4: HAMLFGS and HAMLFGL architecture.
HAMLFGS and HAMLFGL also employ one read-controller to read data from DPRAMs and one write-controller to write data to DPRAMs from the multipliers. The two multiplexers (MUX in Figure 4) exploit the data access patterns. They select the appropriate values for the odd-odd and oddeven parameter sets. Two-stage multipliers are employed for HAMLFGS for the {17,5} and {31,6} parameter sets and seven-stage pipelined multipliers are employed for HAMLFGL for the other nine parameter sets
14
Computing and Random Numbers
to avoid data hazards (SPRNG provides eleven parameter sets for MLFG) [7]. Even though two-staged multipliers are employed for the HAMLFGS implementation, data hazards still exist in Tables 1 and 2 since a two clock cycle latency is required due to the two-stage multipliers along with an additional two clock cycles for the DPRAM access, one for writing and one for reading. In order to avoid the two clock cycle delay from the DPRAM access, we employ forwarding techniques when the parameter set is {17,5} or {31,6}. Instead of storing the data into DPRAMs, the data is fed into the multipliers directly from the multiplier’s outputs through the multiplexers. The forwarding is shown as dotted-lines in Figure 4. If the required latency is three clock cycles, such as for the odd part in {17,5} and both parts in {31,6}, one stall is inserted to synchronize the data access. HAMLFGS and HAMLFGL produce two random numbers every clock cycle.
Hardware-Accelerated Prime Modulus Linear Congruential Generator (HAPLMCG) The SPRNG PMLCG generates random numbers as follows:
(11)
where 𝛼 is a multiplier and 𝑍(𝑛) is the resulting random number [2, 7]. Figure 5 shows the HAPMLCG architecture.
Figure 5: HAPMLCG architecture.
Implementation of Hardware-Accelerated Scalable Parallel Random ...
15
HAPMLCG employs four two-stage 32-bit multipliers that execute in parallel. The initial seed and the initial multiplier coefficient are divided by the lower 31 bits and upper 30 bits and are stored into the gray registers in Figure 5. In order to make those data 32 bits for 32-bit multiplications, the value of “0” is attached to the lower 31-bit data at the most significant bit position and the value of “00” is attached to the higher 30-bit data at the most significant bit position. All the shift and mask operations before the IF-condition black box are needed to make two 61-bit data multiplication (one is a multiplier coefficient and the other is a random number) as in (11). The last part of the implementation is described by the IF-condition black box in Figure 5. The IF-condition black box performs the modulus and data check operations. When the 62nd bit of the data is “1”, the data is modified by adding “1” after the 61-bit mask operation. If the modified data is 261, the data is changed to the value “1” in order to prevent the generator from producing “0” as a random number [7, 19, 20]. The final data from the IF-condition black box fans out two directions. One is fed into one of the inputs of the multiplexer in order to generate the next random number. The other represents a resultant random number. After the first operation, the multiplexer always selects the resultant random numbers instead of the initial seed. The HAPMLCG generates a random number every other clock cycle. The HAPMLCG did not use the HALCG technique for generating future random numbers, since it was extremely hard to derive an equation to generate future random numbers due to the complicated modulo value (261−1) while preventing the generation of a zero random number.
Programming Interface for HASPRNG The programming interface for HASPRNG employs the C language and allows users to use HASPRNG the same way as SPRNG 2.0. The programming interface requires two buffers in main memory on the Cray XD1 to transfer data between the FPGAs and microprocessors. The main memory plays a role as a bridge to communicate between the microprocessor and the FPGA directly through HyperTransport (HT)/RapidArrayTransport (RT) [15]. HASPRNG provides the initialization function, the generation function, and the free function for the programming interface as in SPRNG 2.0 [7]. To the programmer, the functions for HASPRNG are identical to those for SPRNG, but with “sprng” replaced by “hasprng.” The initialization function
16
Computing and Random Numbers
initializes HASPRNG. The generation function returns a random number per function call. The free function releases the memory resources when the random number generation is completely done. The three functions are implemented using Application Programming Interface (API) commands in C. Figure 6 describes the hardware architecture to support the programming interface of the Cray XD1.
Figure 6: Programming interface architecture.
The HASPRNG initialization function is responsible for reconfiguring the FPGA logic (FPGA in Figure 6) and registering buffers (Buffer 1 and Buffer 2 in Figure 6) in the Cray XD1 main memory. The initialization function requires five integer parameters to initialize HASPRNG. The five parameters represent the generator type, the stream identification number, the number of streams, an initial seed, and an initial parameter as with the initialization of SPRNG generators [7]. Refer to [7] for further explanation of the five initialization parameters. The logic for the specified generator type parameter is used to program the FPGAs. Once the generator core (Generator Core in Figure 6) is programmed in the FPGAs, the initialization function lets the generator core generate random numbers based on the five integer parameters and send them to a buffer until the two buffers are full of random numbers. A FIFO is employed to hide the latency for random numbers transfer and control the generator core operation. When the FIFO is full of data, the full flag signal is generated and it makes the generator core stop random number generation. When the full flag signal is released, the generator core starts generating random numbers again. The FIFO sends data directly to a buffer in the memory every clock cycle through HT/RT unless the buffer is full of data.
Implementation of Hardware-Accelerated Scalable Parallel Random ...
17
When the initialization function is complete, the random numbers are stored in two buffers so that a microprocessor can read a random number by a generation function call. The initialization function is called only once. Hence, the time initialization, including function call overhead, is negligible to overall performance. The generation function allows a microprocessor to read random numbers from one buffer while the FPGA fills the other buffer. Once the buffer currently used by the processor is empty, the other buffer becomes active and the FPGA fills the empty buffer. In consequence the latency of random number generation can be hidden. The free function stops random number generation and releases hardware resources. The free function makes the generator core inactive by sending a signal to the FPGAs and releases the two buffers and the pointer containing HASPRNG information such as the generator type and the buffer size. The performance of random number generation depends on the buffer size. We trade off the reduced overhead of swapping buffers by enlarging buffer size with fitting the buffer in cache. We optimized HASPRNG buffer size empirically, choosing 1MB as the most appropriate.
FPGA 𝜋-Estimator Using HASPRNG
We demonstrate an FPGA 𝜋-estimator implementation for the Cray XD1. The implementation of the 𝜋-estimation can be described by the following formula:
(12)
(13)
Based on the Law of Large Numbers (LLNs), the formula converges to the expected value of (𝑥,𝑦) as the number of samples increases [21]. Equation (14) describes the relation between the LLN and the MC integration for 𝜋 estimation:
(14)
18
Computing and Random Numbers
where 𝑥𝑖 and 𝑦𝑖 represent samples and 𝑁 is the number of sample trials of 𝑥𝑖 and 𝑦𝑖. For example, if the 𝜋-estimation consumes two random numbers, one for 𝑥𝑖 and one for 𝑦𝑖, then the value of 𝑁 is “1”.
The FPGA 𝜋-estimator implementation employs eight HALCG48 generators. Hence, the FPGA 𝜋-estimator is able to consume 16 random numbers per cycle (8 samples). Figure 7 describes the architecture of the FPGA 𝜋-estimator. Each “A” represents a HALCG48, each “B” represents a 32-bit multiplier, each “C” is an accumulator module, and the “D” is logic which produces the complete signal when the required number of iterations is done. A random number from the HASPRNG ranges from “0” to “231−1”. Each pair of random numbers is interpreted as values in the unit square.
Figure 7: 𝜋-estimator for Cray XD1.
The multipliers compute the square of these numbers which are added as follows:
(15)
where 𝑥 and 𝑦 are random numbers generated by HALCG48s in the FPGA 𝜋-estimator. If the resultant (𝑥,)falls inside the unit circle, the accumulator adds “1” to the count value. When done, the FPGA 𝜋-estimator returns the count-value and the complete signal. When the complete signal is “1”, the
Implementation of Hardware-Accelerated Scalable Parallel Random ...
19
microprocessor reads the count values and computes the value of 𝜋 based on (18): (16) where 𝑁=(Numberofrandomnumbers)/2 = (16×NumberofIterations(NI))/2 = 8×NI,
(17)
(18)
RESULTS Each HASPRNG generator occupies a small portion of the FPGA, leaving plenty of room for RC MC candidate applications. Each of the HASPRNG generators is verified with over 1 million random numbers. Through the 𝜋-estimator results we demonstrate that HASPRNG is able to improve the performance significantly for RC MC applications. We use Xilinx 8.1 ISE Place and Route (PAR) tools to get hardware resource usage.
HASPRNG Hardware Resources Table 3 shows the hardware resource usage for the XC2VP50 FPGA in each Cray XD1 node and the maximum allowable number of generators on each FPGA Each HASPRNG LFG consumes an average of about 5% of the slices, 8% of the built-in 18×18 multipliers, and 16% of the BRAMs. Multipliers are used when SPRNG characteristic equations contain multiplication operations and BRAMs are used when the initial seed data and random numbers are needed to be stored to generator random numbers. In consequence, the LCG type generators (HALCG48/64, HACMRG, and HAPMLCG) do not need BRAMs and the HALFGOO/OE do not need multipliers. HASPRNG performance can be linearly improved by adding extra random number generator copies.
HASPRNG Performance Each generator has a different clock rate. Table 4 shows applied clock rates and performance. The difference between actual performance and theoretical performance mainly comes from the data transfer overhead
20
Computing and Random Numbers
caused by transferring data from FPGAs to the main memory because the bus is saturated. Table 4: HASPRNG performance Clock rate
Actual (Theoretical) Speed in Millions of Random Numbers/second
HALFGOO
150 MHz
133 (150) MRN/s
HALFGOE
150 MHz
133 (150) MRN/s
HALCG
199 MHz
348 (398) MRN/s
HALCG64
199 MHz
349 (398) MRN/s
HACMRG
80 MHz
35 (40) MRN/s
HAMLFGS
80 MHz
142 (160) MRN/s
HAMLFGL
180 MHz
318 (360) MRN/s
Table 5 shows performance evaluation for a single generator from HASPRNG compared to SPRNG. The gcc optimization level is set to O2. Because there are eleven parameter sets for the MLFG in SPRNG, we compute the HASPRNG and SPRNG performance in Table 5 using the average speed of each parameter set. Table 5: Generator Perfomance comparision SPRNG
HASPRNG
Speedup
Speedup for max # of copies in FPGA
HALFGOO
77 MRNs
133 MRNs
1.7×
8.5×
HALFGOE
77 MRNs
133 MRNs
HALCG48
167 MRNs
348 MRNs
2.1×
29.4× 15.3×
HALCG64
200 MRNs
349 MRNs
1.7×
HACMRG
91 MRNs
35 MRNs
0.4×
5.6×
HAMLFGS
111 MRNs
142 MRNs
2.6×
18.2×
HAMLFGL
111 MRNs
318 MRNs
HAPMLCG
67 MRNs
35 MRNs
Average
13.0× 0.5×
8.0×
1.5×
13.7×
HASPRNG shows 1.5× overall performance improvement for a single HASPRNG hardware generator over SPRNG running on a 2.2 GHz AMD Opteron processor on the Cray XD1. Note that this speedup is limited by the bandwidth between the FPGA and microprocessor. Because SPRNG is designed to support additional random streams, we can easily add HASPRNG generators in the FPGA (up to the maximum copies in the last
Implementation of Hardware-Accelerated Scalable Parallel Random ...
21
column of Table 3). This will not help applications on the Opteron processor because the link between the FPGA and Opteron’s main memory is already saturated. Other reconfigurable computing systems with faster links will obtain higher performance (with a speedup of up to 13.7× as shown in Table 5), or applications can be partially or entirely mapped to the FPGA for additional performance as seen in Section 2.7 with the 𝜋-estimator. Moreover, the Opteron microprocessor is now free to perform other portions of the computational science application.
HASPRNG Verification For the verification platforms, each generator in HASPRNG was compared to its SPRNG counterpart to ensure bit-equivalent behavior [16, 18]. SPRNG was installed on the Cray XD1 to compare the results from SPRNG with the ones from HASPRNG. We observe that HASPRNG produces identical results with SPRNG for each type of generator and for each parameter set given in SPRNG. We verified over 1 million random numbers on the verification platform for each of these configurations.
Reconfigurable Computing 𝜋-Estimator
HASPRNG can improve the performance of RC MC applications as shown with the 𝜋-estimator. The 𝜋-estimator runs at 150 MHz and can consume 2.4 billion random numbers per second. The HASRPNG 𝜋-estimator shows 19.1× speedup over the software 𝜋-estimator employing SPRNG when the random samples are sufficiently large for Monte Carlo applications as shown in Table 6. We would expect the significant speedup for such computational science applications. It is worth noting that these results are for eight pairs of points generated per clock cycle. It takes less than 7 minutes to estimate 𝜋 generating 1 trillion random numbers for the 𝜋-estimator on a single FPGA node. Table 6: 𝜋-estimator runtime comparison Number of Random Samples N
Software π-estimator (Seconds)
RC MC π-estimator (Seconds)
Speedup
8×104
5.6×10-4
3.8×10-5
14.7×
8×105
5.4×10-3
3.4×10-4
15.8×
8×106
5.4×10-2
3.3×10-3
16.3×
8×107
5.3×10-1
3.3×10-2
16.1×
22
Computing and Random Numbers 8×108
5.5
3.3×10-1
16.7×
8×109 8×1010
5.1×10
3.3
15.5×
5.9×102
3.3×10
8×1011
17.9×
6.3×103
3.3×102
19.1×
Table 7 shows the numerical results for the absolute errors between the true value of 𝜋 and the estimated 𝜋based on five different experiments for each sample size. Mean values and the standard deviations are computed assuming that the five sample errors follow a normal distribution [5, 21]. We seek 95% confidence intervals according to different random sample sizes. The error in the estimate of 𝜋 decreases as 𝑁 increases, proportional to 1/ (𝑁1/2) [3]. When similar MC integration applications consume an additional 100 times more random samples, they are able to obtain a solution having one more decimal digit of accuracy. Table 8 shows the hardware resource usage for the FPGA 𝜋-estimator.
Table 7: Numerical results for the errors for 𝜋 estimation
Number of random Mean error Standard deviation of 95% Confidence interval samples N error upper limit of error 8×104
2.4×10-3
2.1×10-3
6.5×10-3
8×105
3.1×10-3
1.4×10-3
5.5×10-3
8×106
2.1×10-4
1.2×10-4
4.5×10-4
8×107
1.2×10-4
1.5×10-4
4.1×10-4
8×108
1.2×10-4
1.1×10-4
3.4×10-4
8×109
3.0×10-5
2.8×10-5
8.5×10-5
8×1010
5.6×10-6
3.2×10-6
1.2×10-5
8×1011
1.2×10-6
8.1×10-7
2.8×10-6
Table 8: FPGA 𝜋-estimator XC2VP50 hardware usage Hardware Usage Slices (23616)
Multipliers (232)
BRAM (232)
FPGA π-estimator
18161 (77%)
160 (69%)
0 (0%)
Eight HALCG48s
13296 (56%)
96 (41%)
0 (0%)
FPGA π-estimator—Eight HALCG48s
4865 (21%)
64 (28%)
0 (0%)
FPGA π-estimator hardware resources per RNG
608 (3%)
8 (3%)
0 (0%)
Implementation of Hardware-Accelerated Scalable Parallel Random ...
23
CONCLUSIONS Random number generation for Monte Carlo methods requires high performance, scalability, and good statistical properties. HASPRNG satisfies these requirements by providing a high-performance implementation of random number generators using FPGAs that produce bit-equivalent results to those provided by SPRNG. The bandwidth between the processor and FPGA is saturated with random number values, which limits the speedup on the Cray XD1. The reconfigurable computing Monte Carlo 𝜋 estimation application using HASPRNG shows good performance and numerical results with a speedup of 19.1× over the software 𝜋-estimator employing SPRNG. Hence, HASPRNG promises to help computational scientists to accelerate their applications.
ACKNOWLEDGMENTS This work was partially supported by the National Science Foundation, Grant NSF CHE-0625598. The authors would like to thank Oak Ridge National Laboratory for access to the Cray XD1. They thank the reviewers for their helpful comments.
24
Computing and Random Numbers
REFERENCES 1. 2.
3.
4. 5. 6. 7. 8.
9. 10.
11.
12.
13.
T. Warnock, “Random-number generators,” Los Alamos Science, vol. 15, pp. 137–141, 1987. M. Mascagni and A. Srinivasan, “Algorithm 806: SPRNG: a scalable library for pseudorandom number generation,” ACM Transactions on Mathematical Software, vol. 26, no. 3, pp. 436–461, 2000. S. Weinzierl, “Introduction to Monte Carlo methods,” Tech. Rep. NIKHEF-00-012, NIKHEF Theory Group, Amsterdam, The Netherlands, June 2000. N. Metropolis and S. Ulam, “The Monte Carlo method,” Journal of the American Statistica Association, vol. 44, no. 247, pp. 335–341, 1949. R. L. Scheaffer, Introduction to Probability and Its Applications, Duxbury Press, Belmont, Calif, USA, 1995. M. Mascagni, “Parallel pseudorandom number generation,” SIAM News, vol. 32, no. 5, pp. 221–251, 1999. Scalable parallel pseudo random number generators library, June 2007, http://sprng.fsu.edu. P. L’Ecuyer and R. Simard, “TestU01: a C library for empirical testing of random number generators,” ACM Transactions on Mathematical Software, vol. 33, no. 4, article 22, 2007. G. Marsaglia, “DIEHARD: a battery of tests of randomness,” 1996, http://stat.fsu.edu/pub/diehard/l. J. M. McCollum, J. M. Lancaster, D. W. Bouldin, and G. D. Peterson, “Hardware acceleration of pseudo-random number generation for simulation applications,” in Proceedings of the 35th IEEE Southeastern Symposium on System Theory (SSST ‘03), 2003. K. Underwood, “FPGAs vs. CPUs: trends in peak floating-point performance,” in Proceedings of the ACM International Symposium on Field Programmable Gate Arrays (FPGA ‘04), vol. 12, pp. 171–180, Monterrey, Calif, USA, February 2004. P. P. Chu and R. E. Jones, “Design techniques of FPGA based random number generator (extended abstract),” in Proceedings of the Military and Aerospace Applications of Programmable Devices and Technologies Conference (MAPLD ‘99), 1999. D. Buell, T. El-Ghazawi, K. Gaj, and V. Kindratenko, “Highperformance reconfigurable computing,” Computer, vol. 40, no. 3, pp. 23–27, 2007.
Implementation of Hardware-Accelerated Scalable Parallel Random ...
25
14. M. C. Smith and G. D. Peterson, “Parallel application performance on shared high performance reconfigurable computing resources,” Performance Evaluation, vol. 60, no. 1–4, pp. 107–125, 2005. · · 15. Cray Inc., “Cray XD1 FPGA development,” 2005. 16. J. Lee, G. D. Peterson, and R. J. Harrison, “Hardware accelerated scalable parallel random number generators,” in Proceedings of the 3rd Annual Reconfigurable Systems Summer Institute, July 2007. 17. J. Lee, Y. Bi, G. D. Peterson, R. J. Hinde, and R. J. Harrison, “HASPRNG: hardware accelerated scalable parallel random number generators,” Computer Physics Communications, vol. 180, no. 12, pp. 2574–2581, 2009. 18. J. Lee, G. D. Peterson, R. J. Harrison, and R. J. Hinde, “Hardware accelerated scalable parallel random number generators for Monte Carlo methods,” in Proceedings of the Midwest Symposium on Circuits and Systems, pp. 177–180, August 2008. 19. Y. Bi, A reconfigurable supercomputing library for accelerated parallel lagged-Fibonacci pseudorandom number generation, M.S. thesis, Computer Engineering, University of Tennessee, December 2006. 20. Y. Bi, G. D. Peterson, G. L. Warren, and R. J. Harrison, “Hardware acceleration of parallel lagged Fibonacci pseudo random number generation,” in Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms, June 2006. 21. J. Jacob and P. Protter, Probability Essentials, Springer, Berlin, Germany, 2004.
CHAPTER
2
A HARDWARE EFFICIENT RANDOM NUMBER GENERATOR FOR NONUNIFORM DISTRIBUTIONS WITH ARBITRARY PRECISION Christian de Schryver1, Daniel Schmidt1, Norbert Wehn,1 Elke Korn2, Henning Marxen2, Anton Kostiuk2, and Ralf Korn2 Microelectronic Systems Design Research Group, University of Kaiserslautern, ErwinSchroedinger-Straße, 67663 Kaiserslautern, Germany
1
Stochastic Control and Financial Mathematics Group, University of Kaiserslautern, ErwinSchroedinger-Straße, 67663 Kaiserslautern, Germany
2
ABSTRACT Nonuniform random numbers are key for many technical applications, and designing efficient hardware implementations of non-uniform random Citation: Christian de Schryver, Daniel Schmidt, Norbert Wehn, et al., “A Hardware Efficient Random Number Generator for Non-uniform Distributions with Arbitrary Precision,” International Journal of Reconfigurable Computing, vol. 2012, Article ID 675130, 11 pages, 2012. doi:10.1155/2012/675130. Copyright: © 2012 Christian de Schryver et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
28
Computing and Random Numbers
number generators is a very active research field. However, most state-ofthe-art architectures are either tailored to specific distributions or use up a lot of hardware resources. At ReConFig 2010, we have presented a new design that saves up to 48% of area compared to state-of-the-art inversionbased implementation, usable for arbitrary distributions and precision. In this paper, we introduce a more flexible version together with a refined segmentation scheme that allows to further reduce the approximation error significantly. We provide a free software tool allowing users to implement their own distributions easily, and we have tested our random number generator thoroughly by statistic analysis and two application tests.
INTRODUCTION The fast generation of random numbers is essential for many tasks. One of the major fields of application are Monte Carlo simulation, for example widely used in the areas of financial mathematics and communication technology. Although many simulations are still performed on high-performance CPU or general-purpose graphics processing unit (GPGPU) clusters, using reconfigurable hardware accelerators based on field programmable gate arrays (FPGAs) can save up to at least one order of magnitude of power consumption if the random number generator (RNG) is located on the accelerator. As an example, we have implemented the generation of normally distributed random numbers on the three mentioned architectures. The results for the achieved throughput and the consumed energy are given in Table 1. Since one single instance of our proposed hardware design (together with a uniform random number generator) consumes less than 1% of the area on the used Xilinx Virtex-5 FPGA, we have introduced a line with the extrapolated values for 100 instances to highlight the enormous potential of hardware accelerators with respect to the achievable throughput per energy. Table 1: Normal random number generator architecture comparison Implementation
Architecture
Power consumption
Fast Mersenne Twister, optimized for SIMD
Intel Core 2 Duo PC ~100 W 2.0 GHz, 3 GB RAM, one core only
Throughput [M Energy per sample samples/s] 600
166.67 pJ
A Hardware Efficient Random Number Generator for Nonuniform ... Nvidia Mersenne Twister + Box-Muller CUDA
Nvidia GeForce 9800 GT
~105 W
Nvidia Mersenne Twister + Box-Muller OpenCL Proposed architecture, only one instance [1]
Xilinx FPGA Virtex- ~1.3 W 5FX70T-3 380 MHz
Proposed architecture, 100 instances
~1.9 W
1510
69.54 pJ
1463
71.77 pJ
397
3.43 pJ
39700
0.05 pJ
29
In this paper, we present a refined version of the floating point-based nonuniform random number generator already shown at ReConFig 2010 [1]. The modifications allow a higher precision while having an even lower area consumption compared to the previous results. This is due to a refined synthesis. The main benefits of the proposed hardware architecture are the following:(i)The area saving is even higher than the formerly presented 48% compared to the state-of-the-art FPGA implementation of Cheung et al. from 2007 [2].(ii)The precision of the random number generator can be adjusted and is mainly independent of the output resolution of the auxiliary uniform RNG.(iii)Our design is exhaustively tested by statistical and application tests to ensure the high quality of our implementation.(iv)For the convenience of the user, we provide a free tool that creates the lookup table (LUT) entries for any desired nonuniform distribution with a user-defined precision. The rest of the paper is organized as follows. In Section 2, we give an overview about current techniques to obtain uniform (pseudo-) random numbers and to transform them to nonuniform random numbers. Section 3shows state-of-the-art inversion-based FPGA nonuniform random number generators, as well as a detailed description of the newly introduced implementation. It also presents the LUT creator tool needed for creating the lookup table entries. How floating point representation can help to reduce hardware complexity is explained in Section 4. Section 5 shows detailed synthesis results of the original and the improved implementation and elaborates on the excessive quality tests that we have applied. Finally, Section 6 concludes the paper.
RELATED WORK The efficient implementation of random number generators in hardware has been a very active research field for many years now. Basically, the available implementations can be divided into two main groups, that are(i)random number generators for uniform distributions,(ii) circuits that transform
30
Computing and Random Numbers
uniformly distributed random numbers into different target distributions. Both areas of research can, however, be treated as nearly distinct. We will give an overview of available solutions out of both groups.
Uniform Random Number Generators Many highly elaborate implementations for uniform RNGs have been published over the last decades. The main common characteristic of all is that they produce a bit vector with n bits that represent (if interpreted as an unsigned binary-coded integer and divided by 2n − 1) values between 0 and 1. The set of all results that the generator produces should be as uniformly as possible distributed over the range (0, 1). A lot of fundamental research on uniform random number generation has already been made before 1994. A comprehensive overview of the work done until that point in time has been given by L’Ecuyer [3] who summarized the main concepts of uniform RNG construction and their mathematical backgrounds. He also highlights the difficulties of evaluating the quality of a uniform RNG, since in the vast majority of the cases, we are dealing not with truly random sequences (as, e.g., Bochard et al. [4]), but with pseudorandom or quasirandom sequences. The latter ones are based on deterministic algorithms. Pseudorandomness means that the output of the RNG looks to an observer like a truly random number sequence if only a limited period of time is considered. Quasirandom sequences, however, do not aim to look very random at all, but rather try to cover a certain bounded range in a best even way. One major field of application for quasirandom numbers is to generate a suitable test point set for Monte Carlo simulations, in order to increase the performance compared to pseudorandom number input [5, 6]. One of the best investigated high-quality uniform RNGs is the Mersenne Twister as presented by Matsumoto and Nishimura in 1998 [7]. It is used in many technical applications and commercial products, as well as in the RNG research domain. Well-evaluated and optimized software programs are available on their website [8]. Nvidia has adapted the Mersenne Twister to their GPUs in 2007 [9]. A high-performance hardware architecture for the Mersenne Twister has been presented in 2008 by Chandrasekaran and Amira [10]. It produces 22 millions of samples per second, running at 24 MHz. Banks et al. have compared their Mersenne Twister FPGA design to two multiplier pseudoRNGs in 2008 [11], especially for the use in financial mathematics
A Hardware Efficient Random Number Generator for Nonuniform ...
31
computations. They also clearly show that the random number quality can be directly traded off against the consumed hardware resources. Tian and Benkrid have presented an optimized hardware implementation of the Mersenne Twister in 2009 [12], where they showed that an FPGA implementation can outperform a state-of-the-art multicore CPU by a factor of about 25, and a GPU by a factor of about 9 with respect to the throughput. The benefit for energy saving is even higher. We will not go further into details here since we concentrate on obtaining nonuniform distributions. Nevertheless, it is worth mentioning that quality testing has been a big issue for uniform RNG designs right from the beginning [3]. L’Ecuyer and Simard invented a comprehensive test suite named TestU01 [13] that is written in C (the most recent version is 1.2.3 from August, 2009). This suite combines a lot of various tests in one single program, aimed to ensure the quality of specific RNGs. For users without detailed knowledge about the meaning of each single test, the TestU01 suite contains three test batteries that are predefined selections of several tests:(i) Small Crush: 10 tests,(ii)Crush: 96 tests,(iii)Big Crush: 106 tests. TestU01 includes and is based on the tests from the other test suites that have been used before, for example, the Diehard Test Suite by Marsaglia from 1995 [14] or the fundamental considerations made by Knuth in 1997 [15]. For the application field financial mathematics (what is also our main area of research), McCullough has strongly recommended the use of TestU01 in 2006 [16]. He comments on the importance of random number quality and the need of excessive testing of RNGs in general. More recent test suites are the very comprehensive Statistical Test Suite (STS) from the US National Institute of Standards and Technology (NIST) [17] revised in August, 2010, and the Dieharder suite from Robert that was just updated in March, 2011 [18].
Obtaining Nonuniform Distributions In general, non-uniform distributions are generated out of uniformly distributed random numbers by applying appropriate conversion methods. A very good overview of the state-of-the-art approaches has been given by Thomas et al. in 2007 [19]. Although they are mainly concentrating on the normal distribution, they show that all applied conversion methods are based on one of the four underlying mechanisms:(i)transformation,(ii)rejection
32
Computing and Random Numbers
sampling,(iii)inversion,(iv)recursion. Transformation uses mathematical functions that provide a relation between the uniform and the desired target distribution. A very popular example for normally distributed random numbers is the Box-Muller method from 1958 [20]. It is based on trigonometric functions and transforms a pair of uniformly distributed into a pair of normally distributed random numbers. Its advantage is that it provides a pair of random numbers for each call deterministically. The Box-Muller method is prevalent nowadays and mainly used for CPU and GPU implementations. A drawback for hardware implementations is the high demand of resources needed to accurately evaluate the trigonometric functions [21, 22]. Rejection sampling can provide a very high accuracy for arbitrary distributions. It only accepts input values if they are within specific predefined ranges and discards others. This behavior may lead to problems if quasirandom number input sequences are used, and (especially important for hardware implementations) unpredictable stalling might be necessary. For the normal distribution, the Ziggurat method [23] is the most common example of rejection sampling and is implemented in many software products nowadays. Some optimized high-throughput FPGA implementations exist, for example, by Zhang et al. from 2005 [24] who generated 169 millions of samples per second on a Xilinx Virtex-2 device running at 170 MHz. Edrees et al. have proposed a scalable architecture in 2009 [25] that achieves up to 240 Msamples on a Virtex-4 at 240 MHz. By increasing the parallelism of their architecture, they predicted to achieve even 400 Msamples for a clock frequency of around 200 MHz. The inversion method applies the inverse cumulative distribution function (ICDF) of the target distribution to uniformly distributed random numbers. The ICDF converts a uniformly distributed random number x ∈ (0, 1) to one output y = icdf(x) with the desired distribution. Since our proposed architecture is based on the inversion method, we go more into details in Section 3. The so far published hardware implementations of inversion-based converters are based on piecewise polynomial approximation of the ICDF. They use lookup tables (LUTs) to store the coefficients for various sampling points. Woods and Court have presented an ICDF-based random number generator in 2008 [26] that is used to perform Monte Carlo simulations in financial mathematics. They use a nonequidistant hierarchical segmentation
A Hardware Efficient Random Number Generator for Nonuniform ...
33
scheme with smaller segments in the steeper parts of the ICDF, what reduces the LUT storage requirements significantly without losing precision. Cheung et al. have shown a very elaborate multilevel segmentation approach in 2007 [2]. The recursion method introduced by Wallace in 1996 [27] uses linear combinations of originally normally distributed random numbers to obtain further ones. He provides the source code of his implementation for free [28]. Lee et al. have shown a hardware implementation in 2005 [29] that produces 155 millions of samples per second on a Xilinx Virtex-2 FPGA running at 155 MHz.
THE INVERSION METHOD The most genuine way to obtain nonuniform random numbers is the inversion method, as it preserves the properties of the originally sampled sequence [30]. It uses the ICDF of the desired distribution to transform every input x ∈ (0, 1) from a uniform distribution into the output sample y = icdf(x) of the desired one. In case of a continuous and strictly monotone cumulative distribution (CDF) function F, we have
(1)
Identical CDFs always imply the equality of the corresponding distributions. For further details, we refer to the works of Korn et al. [30] or Devroye [31]. Due to the above mechanism, the inversion method is applicable to transform also quasirandom sequences. In addition to that, it is completable with variance reduction techniques, for example, antithetic variates [26]. Inversion-based methods in general can be used to obtain any desired distribution using memory-based lookup tables. This is especially advantageous for hardware implementations, since for many distributions, no closed-form expressions for the ICDF exist, and approximations have to be used. The most common approximations for the Gaussian ICDF (see Peter [32] and Moro [33]) are, however, based on higher-grade rational polynomials, but, for that reason, they cannot be efficiently used for a hardware implementation.
34
Computing and Random Numbers
State-of-the-Art Architectures In 2007, Cheung et al. proposed to implement the inversion using the piecewise polynomial approximation [2]. It is based on a fixed point representation and uses a hierarchical segmentation scheme that provides a good trade-off between hardware resources and accuracy. For the normal distribution (as well as any other symmetric distribution), it is also common to use the following simplification: due to the symmetry of the normal ICDF around x = 0.5, its approximation is implemented only for values x ∈ (0, 0.5), and one additional random bit is used to cover the full range. For the Gaussian ICDF, Cheung et al. suggest to divide the range (0, 0.5) into nonequidistant segments with doubling segment sizes from the beginning to the end of the interval. Each of these segments should then be subdivided into inner segments of equal size. Thus, the steeper regions of the ICDF close to 0 are covered by more smaller segments than the regions close to 0.5, where the ICDF is almost linear. This segmentation of the Gaussian ICDF is shown in Figure 1. By using a polynomial approximation of a fixed degree within each segment, this approach allows to obtain an almost constant maximal absolute error over all segments. The inversion algorithm first determines in which segment the input x is contained, then retrieves the coefficients ci of the polynomial for this segment from a LUT, and evaluates the output as
afterwards.
Figure 1: Segmentation of the first half of the Gaussian ICDF.
Figure 2 explains how, for a given fixed point input x, the coefficients of the polynomial are retrieved from the lookup table (that means how the address of the corresponding segment in the LUT is generated). It starts with counting the number of leading zeros (LZ) in the binary representation of x. It uses a bisection technique to locate the segment of the first level: that means numbers with the most significant bit (MSB) 1 lie in the segment
A Hardware Efficient Random Number Generator for Nonuniform ...
35
[0.25, 0.5) and those with 0 correspondingly in (0, 0.25), numbers with second MSB 1 (i.e., x = 01 ...) lie in the segment [0.125, 0.25) and those with 0 (i.e., x = 00 ...) in (0, 0.125), and so forth. Then the input x is shifted left by LZ + 1 bits, such that xsig is the bit sequence following the most significant 1-bit in x. The k MSBs of xsig determine the subsegments of the second level (the equally sized ones). Thus, the LUT address is the concatenation of LZ and MSBk(xsig). The inverted value equals the approximating polynomial for the ICDF in that segment evaluated on the remaining bits of xsig. The architecture for the case of linear interpolation [2] is presented in Figure 2. It approximates the inversion with a maximum absolute error of 0.3 · 2−11.
Figure 2: State-of-the-art architecture.
The works of Lee et al. [34, 35] are also based on this segmentation/ LUT approach. They use the same technique to create generators for the log-normal and the exponential distributions, with only slight changes in the segmentation scheme. For the exponential distribution, the largest segment starts near 0, sequentially followed by the twice smaller segments towards 1. For the log-normal distribution, neighboring segments double in size starting from 0 until 0.5 and halve in size towards 1. But this approach has a number of drawbacks as follows. (i)
Two uniform RNGs needed for a large output range: due to the fixed point implementation, the output range is limited by a number of input bits. The smallest positive value that can be
36
Computing and Random Numbers
represented by an m bit fixed point number is 2−m, what in the case of a 32-bit input value leads to the largest inverted value of icdf(2−32) = 6.33σ. To obtain a larger range of normal random variable up to 8.21σ, the authors of [2] concatenate the input of two 32-bit uniform RNGs and pass a 53-bit fixed point number into the inversion unit, at the cost of one additional uniform RNG. The large number of input bits results in the increased size of the LZ counter and shifter unit, that dominate the hardware usage of the design. (ii) A large number of input bits is wasted: as a multiplier with a 53-bit input requires a large amount of hardware resources, the input is quantified to 20 significant bits before the polynomial evaluation. Thus, in the region close to the 0.5, a large amount of the generated input bits is wasted. (iii) Low resolution in the tail region: for the tail region (close to 0), there are much less than 20 significant bits left after shifting over the LZ. This limits the resolution in the tail of the desired distribution. In addition, as there are no values between 2−53 and 2−52 in this fixed point representation, the proposed RNG does not generate output samples between icdf(2−52) = 8.13σ and icdf(2−53) = 8.21σ.
Floating Point-Based Inversion The drawbacks mentioned before result from the fixed point interpretation of the input random numbers. We therefore propose to use a floating point representation. First of all, we do not use any floating point arithmetics in our implementation. Our design does not contain any arithmetic components like full adders or multipliers that usually blow up a hardware architecture. We just exploit the representation of a floating point number consisting of an exponent and a mantissa part. We also do not use IEEE 754 [36] compliant representations, but have introduced our own optimized interpretation of the floating point encoded bit vector.
Hardware Architecture We have enhanced our formerly architecture presented at ReConFig 2010 [1] with a second part bit that is used to split the encoded half of the ICDF into two parts. The additionally necessary hardware is just one multiplexer
A Hardware Efficient Random Number Generator for Nonuniform ...
37
and an adder with one constant input, that is, the offset for the address range of the LUT memory where the coefficients for the second half are located. Figure 3 shows the structure of our proposed ICDF lookup unit. Compared to our former design, we have renamed the sign_half bit to symmetry bit. This term is more appropriate now since we use this bit to identify in which half of a symmetrical ICDF the output value is located. In this case, we also only encode one half and use the symmetry bit to generate a symmetrical coverage of the range (0, 1) (see Section 3.1).
Figure 3: ICDF lookup structure for linear approximation.
Each part itself is divided further into octaves (formerly segments), that are halved in size by moving towards the outer borders of the parts (compare with Section 3.1). One exception is that the both very smallest octaves are equally sized. In general, the number of octaves for each part can be different. As an example, Figure 4shows the left half of the Gaussian ICDF with a nonequal number of octaves in both parts.
Computing and Random Numbers
38
Figure 4: Double segmentation refinement for the normal ICDF.
Each octave is again divided into 2k equally sized subsections, where k is the number of bits taken from the mantissa part in Figure 3. k therefore has the same value for both parts, but is not necessarily limited to powers of 2. The input address for the coefficient ROM is now generated in the following way. (i)
The offset is exactly the number of subsections in part 0, that means all subsections in the range from 0 to 0.25 for a symmetric ICDF:
(ii)
(2) In part 0, the address is the concatenation of the exponent (giving the number of the octave) and the k dedicated mantissa bits (for the subsection). (iii) In part 1, the address is the concatenation of (exponent + offset) and the k mantissa bits. This floating point-based addressing scheme efficiently exploits the LUT memory in a hardware friendly way since no additional logic for the address generation is needed compared to other state-of-the-art implementations (see Sections 2.2 and 3.1). The necessary LUT entries can easily be generated with our freely available tool presented in Section 3.2.2.
The LUT Creator Tool For the convenience of the users who like to make use of our proposed architecture, we have developed a flexible C++ class package that creates the LUT entries for any desired distribution function. The tool has been rewritten from scratch, compared to the one presented at ReConFig 2010 [1]. It is freely available for download on our website (http://ems.eit.uni-kl. de/).
A Hardware Efficient Random Number Generator for Nonuniform ...
39
Most of the detailed documentation is included in the tool package itself. It uses Chebyshev approximation, as provided by the GNU Scientific Library (GSL) [37]. The main characteristics of the new tool are as follows. (i)It allows any function defined on the range (0, 1) to be approximated. However, the GSL already provides a large number of ICDFs that may be used conveniently.(ii)It provides configurable segmentation schemes with respect to(i)symmetry,(ii)one or two parts,(iii)independently configurable number of octaves per part,(iv)number of subsections per octave.(iii) The output quantization is configurable by the user.(iv)The degree of the polynomial approximation is arbitrary. Our LUT creator tool also has a built-in error estimation that directly calculates the maximum errors between the provided optimal function and the approximated version. For linear approximation and the configuration shown in Table 2, we present a selection of maximum errors in Table 3. For optimized parameter sets that take the specific characteristics of the distributions into account, we expect even lower errors. Table 2: Selected tool configuration for provided error values
Table 3: Maximum approximation errors for different distributions
GENERATING FLOATING POINT RANDOM NUMBERS Our proposed LUT-based inversion unit shown in Section 3.2.1 requires dedicated floating point encoded numbers as inputs. In this section, we
40
Computing and Random Numbers
present an efficient hardware architecture for generating these numbers. Our design consumes an arbitrary-sized bit vector from any uniform random number generator and transforms it into the floating point representation with adjustable precisions. In Section 5.2, we show that our floating point converter maintains the properties of the uniform random numbers provided by the input RNG. Figure 5 shows the structure of our unit and how it maps the incoming bit vector to the floating point parts. Compared to our architecture presented on ReConFig 2010 [1], we have enhanced our converter unit with an additional part bit. It provides the information if we use the first or the second segmentation refinement of the ICDF approximation (see Section 3.2).
Figure 5: Architecture of the proposed floating point RNG.
For each floating point random number that is to be generated, we extract the symmetry and the part bit in the first clock cycle, as well as the mantissa part that is just mapped to the output one to one. The mantissa part
A Hardware Efficient Random Number Generator for Nonuniform ...
41
in our case is encoded with a hidden bit, for a bit width of mant_bw bits, it can therefore represent the values 1, 1 + (1/2mant_bw), 1 + (2/2mant_bw), ... , 2 − (1/2mant_bw). The exponent in our floating point encoding represents the number of leading zeros (LZs) that we count from the exponent part of the incoming random number bit vector. We can use this exponent value directly as the segment address in our ICDF lookup unit described in Section 3.2.1. In the hardware architecture, the leading zeros computation is, for efficiency reasons, implemented as a comparator tree. However, if we would only consider one random number available at the input of our converter, the maximum value for the floating point exponent would be m – mant_bw − 2, with all the bits in the input exponent part being zero. To overcome this issue, we have introduced a parameter determining the maximum value of the output floating point exponent, max exp. If now all bits in the input exponent part are detected to be zero, we store the value of already counted leading zeros and consume a second random number where we continue counting. For the case that we have again only zeros, we consume a third number and continue if either one is detected in the input part or the predefined maximum of the floating point exponent, max_exp, is reached. In this case, we set the data valid signal to 1 and continue with generating the next floating point random number. For the reason that we have to wait for further input random numbers to generate one floating point result, we need a stalling mechanism for all subsequent units of the converter. Nevertheless, depending on size of the exponent part in the input bit vector that is arbitrary, the probability for necessary stalling can be decreased significantly. A second random number is needed with the probability of P2 = 1/2m−mant_bw−2, a third with P3 = 1/22·(m− mant_bw−2) , and so on. For an input exponent part with the size of 10 bits, for example, P2 = 1/210 = 0.976 · 10−3, which means that on average one additional input random number has to be consumed for generating about 1,000 floating point results. We have already presented pseudocode for our converter unit at the ReConFig 2010 [1] that we have enhanced now for our modified design by storing two sign bits. The modified version is shown in Algorithm 1.
42
Computing and Random Numbers
Algorithm 1: Floating point generation algorithm.
SYNTHESIS RESULTS AND QUALITY TEST In addition to our conference paper presented at ReConFig 2010 [1], we provide detailed synthesis results in this section on a Xilinx Virtex-5 device, for both speed and area optimization. Furthermore, we show quality tests for the normal distribution.
Synthesis Results Like for the proposed architecture from ReConFig 2010, we have optimized the bit widths to exploit the full potential of the Virtex-5 DSP48E slice that supports an 18 · 25 bit + 48 bit MAC operation. We therefore selected the same parameter values that are as follows: input bitwidth m = 32, mant_bw = 20, max_exp = 54, and k = 3 for subsegment addressing. The coefficient c0 is quantized to 46 bits, and c1 has 23 bits. We have synthesized our proposed design and the architecture presented at ReConFig with the newer Xilinx ISE 12.4, allowing a fair comparison of the enhancement impacts. Both implementations have been optimized for area and speed, respectively. The target device is a Xilinx Virtex-5 XC5FX70T-3. All provided results are post place and route.
A Hardware Efficient Random Number Generator for Nonuniform ...
43
From Tables 4 and 5, we see that just by using the newer ISE version, we already save area of the whole nonuniform random number converter compared to the ReConFig result that was 44 slices (also optimized for speed) [1]. The maximum clock frequency is now 393 MHz compared to formerly 381 MHz. Table 4: ReConFig 2010 [1]: optimized for speed
Table 5: Proposed design: optimized for speed
Even with the ICDF lookup unit extension described in Section 3.2.1, the new design is two slices smaller than in the former version and can run at 398 MHz. We still consume one 36 Kb BRAM and one DSP48E slice. The synthesis results for area optimization are given in Tables 6 and 7. The whole design now only occupies 31 slices on a Virtex-5 and still runs at 286 MHz instead of 259 MHz formerly. Compared to the ReConFig 2010 architecture, we therefore consume about 20% more area by achieving a speedup of about 10% at a higher precision. Table 6: ReConFig 2010 [1]: optimized for area
44
Computing and Random Numbers
Table 7: Proposed design: optimized for area
Quality Tests Quality testing is an important part in the creation of a random number generator. Unfortunately, there are no standardized tests for nonuniform random number generators. Thus, for checking the quality of our design, we proceed in three steps: in the first step, we test the floating point uniform random number converter, and then we check the nonuniform random numbers (with a special focus on the normal distribution here). Finally, the random numbers are tested in two typical applications: an option pricing calculation with the Heston model [38] and the simulation of the bit error rate and frame error rate of a duo-binary turbo code from the WiMax standard.
Uniform Floating Point Generator We have already elaborated on the widely used TestU01 suite for uniform random number generators in Section 2.1. TestU01 needs an equivalent fixed point precision of at least 30 bits, and for the big crush tests even 32 bits. The uniformly distributed floating point random numbers have been created as described in Section 4with a mantissa of 31 bits from the output of a Mersenne Twister MT19937 [7]. The three test batteries small crush, crush, and big crush have been used to test the quality of the floating point random number generator. The Mersenne Twister itself is known to successfully complete all except two tests. These two tests are linear complexity tests that all linear feedback shift-register and generalized feedback shift-register-based random number generators fail (see [13] for more details). Our floating point transform of Mersenne random numbers also completes all but the specific two tests successfully. Thus, we conclude that our floating point uniform random number generator preserves the properties of the input generator and shows the same excellent structural properties. For computational complexity reasons, for the following tests, we have restricted the output bit width of the floating point converter software implementation to 23 bits. The resolution is lower than the fixed point input
A Hardware Efficient Random Number Generator for Nonuniform ...
45
in some regions, whereas in other regions a higher resolution is achieved. Due to the floating point representation, the regions with higher resolutions are located close to zero. Figure 6 shows a zoomed two-dimensional plot of random vectors produced by our design close to zero. It is important to notice that no patterns, clusters, or big holes are visible here.
Figure 6: Detail of uniform 2D vectors around 0.
Besides the TestU01 suite, the equidistribution of our random numbers has also been tested with several variants of the frequency test mentioned by Knuth [15]. While checking the uniform distribution of the random numbers up to 12 bits, no extreme P value could be observed.
Nonuniform Random Number Generator For the nonuniform random number generator, we have selected a specific set of commonly applied tests to examine and ensure the quality of the produced random numbers. In this paper, we focus on the tests performed for normally distributed random numbers, since those are most commonly used in many different fields of applications. Also the application tests presented below use normally distributed random numbers. As a first step, we have run various χ2-tests. In these tests, the empirical number of observations in several groups is compared with the theoretical
46
Computing and Random Numbers
number of observations. Test results that would only occur with a very low probability indicate a poor quality of the random numbers. This may be the case if either the structure of the random numbers does not fit to the normal distribution or if the numbers show more regularity than expected from a random sequence. The batch of random numbers in Figure 7 shows that the distribution is well approximated. The corresponding χ2-test with 100 categories had a P value of 0.4.
Figure 7: Histogram of Gaussian random numbers.
The Kolmogorov-Smirnov test compares the empirical and the theoretical cumulative distribution function. Nearly all tests with different batch sizes were perfectly passed. Those not passed did not reveal an extraordinary P value. A refined version of the test, as described in Knuth [15] on page 51, sometimes had low P values. This is likely to be attributed to the lower precision in some regions of our random numbers, as the continuous CDF can not be perfectly approximated with random numbers that have fixed gaps. Other normality tests were perfectly passed, including the Shapiro-
A Hardware Efficient Random Number Generator for Nonuniform ...
47
Wilk [39] test. Stephens [40] argues that the latter one is more suitable for testing the normality than the Kolmogorov-Smirnov test. The test showed no deviation from normality. We not only compared our random numbers with the theoretical properties, but also with those taken from the well-established normal random number generator of the R language. It is based on a Mersenne Twister as well. Again, we used the Kolmogorov-Smirnov test, but no difference in distribution could be seen. Comparing the mean with the t-test and the variance with the F-test gave no suspicious results. The random numbers of our generator seem to have the same distribution as standard random numbers, with an exception of the reduced precision in the central region and an improved precion in the extreme values. This difference can be seen in Figures 8 and 9. Both depict the empirical results of a draw of 220 random numbers, the first with the presented algorithm and the second with the RNG of R.
Figure 8: Tail of the empirical distribution function.
48
Computing and Random Numbers
Figure 9: Tail of the empirical distribution function for the R RNG.
The tail distribution of the random numbers of the presented algorithm seems to be better in the employed test set. The area of extreme values is fitted without large gaps in contrast to the R random numbers. The smallest value from our floating point-based random number generator is 1 · 2−54, compared to 1 · 2−32 in standard RNGs, thus values of −8.37σ and 8.37σ can be produced. Our approximation of the inverse cumulative distribution function has an absolute error of less than 0.4 · 2−11 in the achievable interval. Thus, the good structural properties of the uniform random numbers can be preserved. Due to the good properties of our random number generator, we expect it to perform well in the case of a long and detailed approximation, where rare extreme events can have a huge impact (consider risk simulations for insurances, e.g.).
Application Tests Random number generators are always embedded in a strongly connected application environment. We have tested the applicability of our normal RNG in two scenarios: first, we have calculated an option price with the Heston model [38]. This calculation was done using the Monte Carlo simulation written in Octave. The provided RNG of Octave randn() has been replaced by
A Hardware Efficient Random Number Generator for Nonuniform ...
49
a bit true model of our presented hardware design. For the whole benchmark set, we could not observe any peculiarities with respect to the calculated results and the convergence behavior of the Monte Carlo simulation. For the second application, we have produced a vast set of simulations of a wireless communications system. For comparison to our RNG, a Mersenne Twister and inversion using the Moro approximation [33] has been used. Also in this test, no significant differences between the results from both generators could be observed.
CONCLUSION In this paper, we present a new refined hardware architecture of a nonuniform random number generator for arbitrary distributions and precision. As input, a freely selectable uniform random number generator can be used. Our unit transforms the input bit vector into a floating point notation before converting it with an inversion-based method to the desired distribution. This refined method provides more accurate random numbers than the previous implementation presented at ReConFig 2010 [1], while occupying roughly the same amount of hardware resources. This approach has several benefits. Our new implementation saves now more than 48% of the area on an FPGA compared to state-of-the-art implementations, while even achieving a higher output precision. The design can run at up to 398 MHz on a Xilinx Virtex-5 FPGA. The precision itself can be adjusted to the users’ needs and is mainly independent of the output resolution of the uniform RNG. We provide a free tool allowing to create the necessary look-up table entries for any desired distribution and precision. For both components, the floating point converter and the ICDF lookup unit, we have presented our hardware architecture in detail. Furthermore, we have provided exhaustive synthesis results for a Xilinx Virtex-5 FPGA. The high quality of the random numbers generated by our design has been ensured by applying extensive mathematical and application tests.
ACKNOWLEDGMENT The authors gratefully acknowledge the partial financial support from the Center for Mathematical and Computational Modeling (CM)2 of the University of Kaiserslautern.
50
Computing and Random Numbers
REFERENCES 1.
C. de Schryver, D. Schmidt, N. Wehn et al., “A new hardware ecient inversion based random number generator for non-uniform distributions,” in Proceedings of the International Conference on Recongurable Computing and FPGAs (ReConFig ‘10), pp. 190–195, December 2010. 2. R. C. C. Cheung, D.-U. Lee, W. Luk, and J. D. Villasenor, “Hardware generation of arbitrary random number distributions from uniform distributions via the inversion method,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 15, no. 8, pp. 952–962, 2007. 3. P. L’Ecuyer, “Uniform random number generation,” Annals of Operations Research, vol. 53, no. 1, pp. 77–120, 1994. 4. N. Bochard, F. Bernard, V. Fischer, and B. Valtchanov, “Truerandomness and pseudo-randomness in ring oscillator-based true random number generators,” International Journal of Reconfigurable Computing, vol. 2010, article 879281, 2010. 5. H. Niederreiter, “Quasi-Monte Carlo methods and pseudo-random numbers,” American Mathematical Society, vol. 84, no. 6, p. 957, 1978. 6. H. Niederreiter, Random Number Generation and Quasi-Monte Carlo Methods, Society for Industrial Mathematics, 1992. 7. M. Matsumoto and T. Nishimura, “Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator,” ACM Transactions on Modeling and Computer Simulation, vol. 8, no. 1, pp. 3–30, 1998. 8. M. Matsumoto, Mersenne Twister, http://www.math.sci.hiroshima-u. ac.jp/?m-mat/MT/emt.html, 2007. 9. V. Podlozhnyuk, Parallel Mersenne Twister, http://developer.download. nvidia.com/compute/cuda/2_2/sdk/website/projects/MersenneTwister/ doc/MersenneTwister.pdf, 2007. 10. S. Chandrasekaran and A. Amira, “High performance FPGA implementation of the Mersenne Twister,” in Proceedings of the 4th IEEE International Symposium on Electronic Design, Test and Applications (DELTA ‘08), pp. 482–485, January 2008. 11. S. Banks, P. Beadling, and A. Ferencz, “FPGA implementation of Pseudo Random Number generators for Monte Carlo methods in quantitative finance,” in Proceedings of the International Conference
A Hardware Efficient Random Number Generator for Nonuniform ...
12.
13.
14. 15.
16. 17.
18. 19.
20.
21.
22.
51
on Reconfigurable Computing and FPGAs (ReConFig ‘08), pp. 271– 276, December 2008. X. Tian and K. Benkrid, “Mersenne Twister random number generation on FPGA, CPU and GPU,” in Proceedings of the NASA/ ESA Conference on Adaptive Hardware and Systems (AHS ‘09), pp. 460–464, August 2009. P. L’Ecuyer and R. Simard, “TestU01: a C library for empirical testing of random number generators,” ACM Transactions on Mathematical Software, vol. 33, no. 4, 22 pages, 2007. G. Marsaglia, Diehard Battery of Tests of Randomness, http://stat.fsu. edu/pub/diehard/, 1995. D. E. Knuth, Seminumerical Algorithms, Volume 2 of The Art of Computer Programming, Addison-Wesley, Reading, Mass, USA, 3rd edition, 1997. B. D. McCullough, “A review of TESTU01,” Journal of Applied Econometrics, vol. 21, no. 5, pp. 677–682, 2006. A. Rukhin, J. Soto, J. Nechvatal et al., “A statistical test suite for random and pseudorandom number generators for cryptographic applications,” http://csrc.nist.gov/publications/nistpubs/800-22-rev1a/ SP800-22rev1a.pdf, Special Publication 800-22, Revision 1a, 2010. G. B. Robert, Dieharder: A Random Number Test Suite, http://www. phy.duke.edu/?rgb/General/dieharder.php, Version 3.31.0, 2011. D. B. Thomas, W. Luk, P. H. W. Leong, and J. D. Villasenor, “Gaussian random number generators,” ACM Computing Surveys, vol. 39, no. 4, article 11, 2007. G. E. P. Box and M. E. Muller, “A note on the generation of random normal deviates,” The Annals of Mathematical Statistics, vol. 29, no. 2, pp. 610–611, 1958. A. Ghazel, E. Boutillon, J. L. Danger et al., “Design and performance analysis of a high speed AWGN communication channel emulator,” in Proceedings of the IEEE Pacific Rim Conference, pp. 374–377, Citeseer, Victoria, BC, Canada, 2001. D.-U. Lee, J. D. Villasenor, W. Luk, and P. H. W. Leong, “A hardware Gaussian noise generator using the box-muller method and its error analysis,” IEEE Transactions on Computers, vol. 55, no. 6, pp. 659– 671, 2006.
52
Computing and Random Numbers
23. G. Marsaglia and W. W. Tsang, “The ziggurat method for generating random variables,” Journal of Statistical Software, vol. 5, pp. 1–7, 2000. 24. G. Zhang, P. H. W. Leong, D.-U. Lee, J. D. Villasenor, R. C. C. Cheung, and W. Luk, “Ziggurat-based hardware gaussian random number generator,” in Proceedings of the International Conference on Field Programmable Logic and Applications (FPL ‘05), pp. 275–280, August 2005. 25. H. Edrees, B. Cheung, M. Sandora et al., “Hardware-optimized ziggurat algorithm for high-speed gaussian random number generators,” in Proceedings of the International Conference on Engineering of Recongurable Systems & Algorithms (ERSA ‘09), pp. 254–260, July 2009. 26. N. A. Woods and T. Court, “FPGA acceleration of quasi-monte carlo in finance,” in Proceedings of the International Conference on Field Programmable Logic and Applications (FPL ‘08), pp. 335–340, September 2008. 27. C. S. Wallace, “Fast pseudorandom generators for normal and exponential variates,” ACM Transactions on Mathematical Software, vol. 22, no. 1, pp. 119–127, 1996. 28. C. S. Wallace, MDMC Software—Random Number Generators, http:// www.datamining.monash.edu.au/software/random/index.shtml, 2003. 29. D. U. Lee, W. Luk, J. D. Villasenor, G. Zhang, and P. H. W. Leong, “A hardware Gaussian noise generator using the Wallace method,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 13, no. 8, pp. 911–920, 2005. 30. R. Korn, E. Korn, and G. Kroisandt, Monte Carlo Methods and Models in Finance and Insurance, Financial Mathematics Series, Chapman & Hull/CRC, Boca Raton, Fla, USA, 2010. 31. L. Devroye, Non-Uniform Random Variate Generation, Springer, New York, NY, USA, 1986. 32. J. A. Peter, An algorithm for computing the inverse normal cumulative distribution function, 2010. 33. B. Moro, “The full Monte,” Risk Magazine, vol. 8, no. 2, pp. 57–58, 1995. 34. D.-U. Lee, R. C. C. Cheung, W. Luk, and J. D. Villasenor, “Hierarchical segmentation for hardware function evaluation,” IEEE Transactions
A Hardware Efficient Random Number Generator for Nonuniform ...
35.
36. 37. 38.
39.
40.
53
on Very Large Scale Integration (VLSI) Systems, vol. 17, no. 1, pp. 103–116, 2009. D.-U. Lee, W. Luk, J. Villasenor, and P. Y. K. Cheung, “Hierarchical Segmentation Schemes for Function Evaluation,” in Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT ‘03), pp. 92–99, 2003. IEEE-SA Standards Board. IEEE 754-2008 Standard for FloatingPoint Arithmetic, August 2008. Free Software Foundation Inc. GSL—GNU Scientic Library, http:// www.gnu.org/software/gsl/, 2011. S. L. Heston, “A closed-form solution for options with stochastic volatility with applications to bond and currency options,” Review of Financial Studies, vol. 6, no. 2, p. 327, 1993. S. S. Shapiro and M. B. Wilk, “An analysis-of-variance test for normality (complete samples),” Biometrika, vol. 52, pp. 591–611, 1965. M. A. Stephens, “EDF statistics for goodness of fit and some comparisons,” Journal of the American Statistical Association, vol. 69, no. 347, pp. 730–737, 1974.
CHAPTER
3
TRUE-RANDOMNESS AND PSEUDO-RANDOMNESS IN RING OSCILLATOR-BASED TRUE RANDOM NUMBER GENERATORS Nathalie Bochard, Florent Bernard, Viktor Fischer, and Boyan Valtchanov CNRS, UMR5516, Laboratoire Hubert Curien, Université de Lyon, 42000 Saint-Etienne, France
ABSTRACT The paper deals with true random number generators employing oscillator rings, namely, with the one proposed by Sunar et al. in 2007 and enhanced by Wold and Tan in 2009. Our mathematical analysis shows that both architectures behave identically when composed of the same number of Citation: Nathalie Bochard, Florent Bernard, Viktor Fischer, and Boyan Valtchanov, “True-Randomness and Pseudo-Randomness in Ring Oscillator-Based True Random Number Generators,” International Journal of Reconfigurable Computing, vol. 2010, Article ID 879281, 13 pages, 2010. doi:10.1155/2010/879281 Copyright: © 2010 Nathalie Bochard et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
56
Computing and Random Numbers
rings and ideal logic components. However, the reduction of the number of rings, as proposed by Wold and Tan, would inevitably cause the loss of entropy. Unfortunately, this entropy insufficiency is masked by the pseudorandomness caused by XOR-ing clock signals having different frequencies. Our simulation model shows that the generator, using more than 18 ideal jitter-free rings having slightly different frequencies and producing only pseudo-randomness, will let the statistical tests pass. We conclude that a smaller number of rings reduce the security if the entropy reduction is not taken into account in post-processing. Moreover, the designer cannot avoid that some of rings will have the same frequency, which will cause another loss of entropy. In order to confirm this, we show how the attacker can reach a state where over 25% of the rings are locked and thus completely dependent. This effect can have disastrous consequences on the system security.
INTRODUCTION True Random Number Generators (TRNGs) are used to generate confidential keys and other critical security parameters (CSPs) in cryptographic modules [1]. Generation of high rate and high quality random bit-stream inside logic devices is difficult because these devices are intended for implementing deterministic data processing algorithms whereas generating truerandomness needs some physical nondeterministic process. The quality of the generated bit-streams is evaluated using dedicated statistical tests such as FIPS 140-2 [1], NIST 800-22 [2], and Diehard [3]. However, the statistical tests are not able to give a mathematical proof that the generator generates true random numbers and not only pseudo-random numbers that can be employed in attacks [4]. For this reason, Killmann and Schindler [5] propose to characterize the source of randomness from the raw binary signal in order to estimate the entropy in the generator output bit-stream. The TRNGs implemented in reconfigurable devices usually use metastability [6–8] or the clock jitter [9–12] as the source of randomness. Many of them employ ring oscillators (ROs) as a source of a jittery clock [13–15]. One of principles employed the most frequently in reconfigurable devices is that proposed by Sunar et al. in [15]. This principle was later exploited and modified in [16] and enhanced in [17, 18]. Sources of randomness and randomness extraction in RO-based TRNG were analyzed in [19–21].
True-Randomness and Pseudo-Randomness in Ring Oscillator-Based ...
57
In order to increase the entropy of the generated binary raw signal and to make the generator “provably secure”, Sunar et al. employ a huge number of ROs [15]. The outputs of 114 supposedly independent ROs are XORed and sampled using a reference clock with a fixed frequency in order to obtain a raw binary signal. This binary signal is then postprocessed using a resilient function depending on the size of the jitter and the number of ROs employed. The main advantages of the generator of Sunar are(i) the claimed security level based on a security proof,(ii)easy (almost “push button”) implementation in FPGAs. Without the security proof, the generator of Sunar et al. can be considered as just one of many existing TRNGs that passes the statistical tests. This security approach is essential for TRNG evaluation according to AIS31 [5] that is accepted as a de facto standard in the field. Unfortunately, the Sunar’s security proof is based on at least two assumptions that are impossible or difficult to achieve and/or validate in practice [11]:(i)the XOR gate is supposed to be infinitely fast in order to maintain the entropy generated in rings;(ii)the rings are supposed to be independent. Wold and Tan show in [18] that by adding a flip-flop to the output of each oscillator (before the XOR gate), the generated raw bit-stream will have better statistical properties: the NIST [2] and Diehard [3] tests will pass without postprocessing and with a significantly reduced number of ring oscillators. It is commonly accepted that contrary to the original design of Sunar et al., the modified architecture proposed by Wold and Tan maintains the entropy of the raw binary signal after the XOR gate if the number of rings is unchanged. However, we believe that several other questions are worthy of investigation. The aim of our paper is to find answers to the following questions and to discuss related problems:(i)Is the security proof of Sunar valid also for the generator of Wold and Tan?(ii)What is the entropy of the generated bitstream after the reduction of number of rings?(iii)How does security enhancement proposed by Fischer et al. in [17] modify the quality of the generated binary raw signal?(iv)How should the relationship between the rings be taken into account in entropy estimation? The paper is organized as follows: Section 2 analyzes the composition of the timing jitter of the clock signal generated in ring oscillators. Section 3 deals with the simulation and experimental background of our research. Section 4 compares the behavior of the two generators in simulations and in hardware. Section 5 discusses the impact of the size and type of the jitter
58
Computing and Random Numbers
on the quality of the raw bit-stream. Section 6 evaluates the dependence between the rings inside the device and its impact on generation of random bit-stream. Section 7 discusses the obtained results and replies to the questions given in the previous paragraph. Section 8 concludes the paper.
RING OSCILLATORS AND TIMING JITTER Ring oscillators are free-running oscillators using logic gates. They are easy to implement in logic devices and namely in Field Programmable Gate Arrays (FPGAs). The oscillator consists of a set of delay elements that are chained into a ring. The set of delay elements can be composed of inverting and noninverting elements, while the number of inverting elements has to be an odd number. The period of the signal generated in the RO using ideal components is given by the form
(1)
where 𝑘∈{3,4,5,…,𝑛} is the number of delay elements and 𝑑𝑖 is the delay of the 𝑖-th delay element. This expression is simplified in two ways:
• the delay 𝑑𝑖 is supposed to be constant in time; • the delays of interconnections are ignored. In physical devices, the delay 𝑑𝑖 varies between two half-periods (i.e., between two instances 𝑖 and 𝑖+𝑘) and expression (1) gets the form
(2)
In the case of ring oscillators, the variation of the clock period is observed as the clock timing jitter, which can be seen as a composition of the jitter caused by local sources and the jitter coming from global sources, usually from power supply and/or global device environment [19]. When observing rings implemented in real devices and namely, in FPGAs, the delays of interconnections cannot be ignored anymore. For simplicity, we propose to merge them with the gate delays. This approach was validated in [21]. The delay 𝑑𝑖 of gate 𝑖 including interconnection delay between two consecutive gates in the ring oscillator can then be expressed as
(3)
where 𝐷𝑖 is a constant delay of gate 𝑖 plus interconnection between gates 𝑖 and 𝑖+1 (mod 𝑘) corresponding to the nominal supply voltage level
True-Randomness and Pseudo-Randomness in Ring Oscillator-Based ...
59
and nominal temperature of the logic device, Δ𝑑𝐿𝑖 is the delay variation introduced by local physical events, and Δ𝑑G𝑖 is the variation of the delay caused by global physical sources such as substrate noise, power supply noise, power supply drifting, and temperature variation. The delay 𝑑𝐿𝑖 is dynamically modified by some amount of a random signal Δ𝑑LG𝑖 (LG-Local Gaussian jitter component) and by some local crosstalks from the neighboring circuitry Δ𝑑LD𝑖 (LD-Local Deterministic jitter component). The jitter from local sources used in (3) can thus be expressed as Δ𝑑𝐿𝑖=Δ𝑑LG𝑖+Δ𝑑LD𝑖 (4)
The local Gaussian jitter components Δ𝑑LG𝑖 coming from individual gates and interconnections are characterized by normal probability distribution 𝑁(𝜇𝑖, 𝜎2𝑖) with the mean value 𝜇𝑖=0 and the standard deviation 𝜎𝑖. We can suppose that these sources are independent. On the other side, the local deterministic components can feature some mutual dependency, for example, from cross-talks. Besides being influenced locally, the delays of all logic gates in the device are modified both slowly and dynamically by global jitter sources. The slow changes of the gate delay Δ𝐷 (the drift) can be caused by a slow variation of the power supply and/or temperature. The power source noise and some deterministic signal, which can be superposed on the supply voltage, can cause dynamic gate delay modification composed of a Gaussian global jitter component Δ𝑑GG and a deterministic global jitter component Δ𝑑GD. The overall global jitter from (3) can therefore be expressed as
(5)
where 𝐾𝑖∈[0;1] corresponds to the proportion of the global jitter sources on the given gate delay. This comes from the fact that the amount of the global jitter included in delays of individual logic gates is not necessarily the same for all gates. It is important to note that 𝐾𝑖 depends on the power supply voltage, but this dependence may differ for individual gates. In real physical systems, the switching current of each gate modifies locally and/or globally the voltage level of the power supply, which in turn modifies (again locally and/or globally) the gate delay. This way, the delays of individual gates are not completely independent. We will discuss this phenomenon in Section 6.
60
Computing and Random Numbers
SIMULATION AND EXPERIMENTAL SETUP The aim of the first part of our work was to compare the behavior of two RO-based TRNGs: the original architecture depicted in Figure 1(a) that was proposed by Sunar et al. in [15] and its modified version presented in Figure 1(b) proposed by Wold and Tan in [18]. Contrary to the strategy adopted in [18], where the behavior of the two generators was compared only in hardware, we propose to compare it on simulation level, too. This approach has two advantages:(i)the functional simulation results correspond to an ideal behavior of the generator, this way the two underlying mathematical models can be compared;(ii)in contrast with the real hardware, thanks to simulation we can modify the parameters of injected jitter and evaluate the impact of each type of jitter on the quality of the generated bit-stream.
Figure 1: Original TRNG architecture of Sunar et al. (a) and modified architecture of Wold and Tan (b).
The principle of our simulation platform and experimental platform is depicted in Figure 2. For both platforms, the two generators were described in VHDL language and their architectures differed only in the use of flip-flops on the rings outputs (dashed blocks in Figure 2). The bitstreams obtained at the output of the final sampling flip-flop (before the post-processing) were tested and evaluated for different types and sizes of jitter in simulations and for different numbers of ring oscillators in both simulations and hardware experiments. The output of the TRNG was written into a binary file that was used as an input file in statistical tests.
Figure 2: Simulation and experimental platforms.
True-Randomness and Pseudo-Randomness in Ring Oscillator-Based ...
61
We avoided the post-processing in the generator of Sunar et al. for two reasons:(i)the post-processing function can hide imperfections in the generated signal;(ii)using the same structures, we wanted to compare the two generators more fairly. At this first level of investigation, we used the statistical tests FIPS 1402 [1] in order to evaluate the quality of the generated raw bit-streams. We preferred the FIPS tests before the NIST test suite [2], because of the speed and the size of files needed for testing. While using significantly smaller files, the FIPS tests give a good estimation of the quality of the generated raw signal. If these tests do not pass, it is not necessary to go further. As our aim was to test a big set of TRNG configurations, the time and data size constraints were important. However, when the FIPS tests pass, we cannot conclude that the quality of the sequence produced by the TRNG is good. In this case, the generator should be thoroughly inspected.
Simulation Methodology In order to compare the two generators on the functional simulation level (i.e., using ideal components), the behavior of ring oscillators was modeled in VHDL by delay elements with dynamically varying delays. The jittered half-period, generated in MatLab Ver. R2008b, is based on (2) to (5). However, we take into account only local Gaussian jitter Δ𝑑LG𝑖 (the source of true-randomness) and global deterministic jitter Δ𝑑GD (the source of pseudo-randomness which can easily be manipulated) in our simulations. This approach was explained in [19]. Both sources of jitter depend a priori on the time 𝑡. Thus the simplified equation used in MatLab for generating jittery half-periods over the time 𝑡, denoted by ℎ(𝑡), is expressed as
(6)
The fix part of the generated delay that determines the mean half-period of the ring oscillator is defined as a sum of 𝑘 delay elements featuring constant delay 𝐷𝑖(𝑡)=𝐷𝑖 for each gate 𝑖. The variable delay that is added to the mean half-period is composed of a Gaussian component generated for each gate individually and of a deterministic component generated by the same generator for all gates and rings. The Gaussian delay component Δ𝑑LG(𝑡) can be seen as a stationary process (i.e., mean and variance of Δ𝑑LG𝑖(𝑡) do not change over the time 𝑡). Thus Δ𝑑LG(𝑡) can be generated using
62
Computing and Random Numbers
the normrnd function in MatLab with mean 0 and standard deviation 𝜎𝑖. The deterministic component Δ𝑑GD(𝑡) applied at time 𝑡 is calculated in MatLab in the following way:
(7)
where 𝐴GD and 𝐹GD represent the amplitude and frequency of the deterministic signal. Note, that the sinfunction can be replaced with the square, sawtooth, or other deterministic function. Coefficients 𝐾𝑖 are used to simulate the varying influence of Δ𝑑GD(𝑡) on each gate 𝑖.
Once the parameters 𝑘, 𝐷𝑖, 𝐾𝑖, 𝜎𝑖, 𝐴GD, and 𝐹GD were set, a separated file containing a stream of half-period values was generated for each ring oscillator. These files were read during the VHDL behavioral simulations performed using the ModelSim SE 6.4 software, as presented in Figure 3(a).
Figure 3: Implementation of ring oscillators in simulations and in hardware.
The output signals were sampled using a D flip-flop at the sampling frequency 𝐹𝑠=32 MHz, and the obtained 20,000 samples were written during the simulation to a binary file. Finally, generated sequences were tested using FIPS 140-2 tests.
Methodology of Testing in Hardware Enhancements of the generator architecture brought by Wold and Tan were related to the behavior of the XOR gate. In order to compare generators’ behavior in two different technologies, we employed one from Altera (the same that was used in [18]) based on Lookup tables (LUT) and one from Actel based on multiplexers. We developed two different modules dedicated to TRNG testing. Each module contained a selected FPGA device, a 16 MHz
True-Randomness and Pseudo-Randomness in Ring Oscillator-Based ...
63
quartz oscillator and low-noise linear voltage regulators. The modules were plugged into a motherboard containing Cypress USB interface device CY7C6B013A-100AXC powered by an isolated power supply. The Altera module contained the Cyclone III EP3C25F256C8N device. The noninverting delay elements and one inverter were mapped to LUTbased logic cells (LCELL) from Altera library (see Figure 3(b)). This way, either odd or even number of delay elements could be used in order to tune the frequency in smaller steps. We used Quartus II software version 9.0 from Altera for mapping the rings into the device. All delay elements were preferably placed in the same logic array block (LAB) in order to minimize the dispersion of parameters. The Actel Fusion module featured the M7AFS600FGG256 FPGA device. The non-inverting delay elements were implemented using AND2 gates from Actel library with two inputs short-connected and one inverter (again from Actel library) was added to close the loop (see Figure 3(c)). On both hardware platforms, an internal PLL was used to generate the 32 MHz sampling clock 𝐹𝑠. The generated bit-streams were sent to the PC using the USB interface. A 16-bit interface communicating with the Cypress USB controller was implemented inside the FPGA. A Visual C++ application running on the PC read the USB peripheral and wrote data into a binary file that was used by the FIPS 140-2 tests software.
COMPARISON OF THE GENERATORS’ BEHAVIOR IN SIMULATIONS AND IN HARDWARE First, we compared the behavior of both generators in VHDL simulations. The generators used 1 to 20 ROs consisting of 𝑘=9 inverters each. To take into account the differences related to the placement and routing of ROs, we supposed that the mean delay 𝐷𝑖 of individual gates was slightly different from one ring to another (between 275 ps and 281 ps). The additional Gaussian jitter Δ𝑑LG𝑖 has mean 0 and standard deviation 𝜎𝑖=30ps. No deterministic component was added in this experiment (i.e., Δ𝑑GD(𝑡)=0). It is important to note that the same random data files were used for both Sunar’s and Wold’s generators.
64
Computing and Random Numbers
Simulation Results and Mathematical Analysis of the Generator Behavior The simulation results for both evaluated generators are presented in Figure 4. Note that the “Monobit” and “Poker” tests succeed if the test result lies in the gray area. The “Run” test succeeds if no run fails. The “Long run” test is not presented in the graphs because it always succeeded.
Figure 4: Results of the FIPS 140-2 tests in simulation with 30 ps of Gaussian jitter for Sunar’s and Wold’s architecture (excluding long runs that always passed), the tests pass if the results are in the gray region or are equal to zero for the “Runs” test.
It can be seen that in all configurations the two versions of the generator gave very similar (almost identical) results. Next, we will explain the reason for this behavior. Let (𝑡) be the bit sampled at time 𝑡≥0 at the output of the ring oscillator RO𝑗. This bit depends on the time 𝑡, the period 𝑇𝑗>0 generated by RO𝑗, and the initial phase 𝜑𝑗 of RO𝑗 at time 𝑡=0.
True-Randomness and Pseudo-Randomness in Ring Oscillator-Based ...
65
Note that 𝑇𝑗, 𝑡, and 𝜑𝑗 are expressed in the time domain. This dependency is given by the following relation:
(8)
Equation (8) ensures that 𝑏𝑗(𝑡) is a bit (i.e., 𝑏𝑗(𝑡)∈{0,1}). Thus Indeed, by definition of operation mod 𝑇𝑗,
(9)
Then
(10)
and by definition of the floor operation,
(11)
(12)
(13)
that means
thus
Equation (8) holds if 𝑇𝑗 is constant. Then if (𝜑𝑗+𝑡) mod 𝑇𝑗 𝑡ℎ, then we discard this frame, move to the next frame, and run step (d) again. (e) Get one random bit: bit [𝑖] = 1& (𝑅 ⊕𝐺 ⊕ 𝐵 ⊕𝑅1 ⊕𝐺1 ⊕ 𝐵1 ⊕ 𝑅2 ⊕ 𝐺2 ⊕ 𝐵2 ⊕ SN1 ⊕ SN2 ⊕⋅⋅⋅⊕ SN𝑛). And let 𝑅1 = 𝑅, 𝐺1 = 𝐺, 𝐵1 = 𝐵. New coordinate will be
(5)
Random Numbers Generated from Audio and Video Sources
91
Both 𝑥 and 𝑦 having the “≪4” operation are very important. If we omit this operation, the pass rate of statistical test will be down to 10%∼50% rapidly. (f)
Then back to step (c) to get another random bit bit [𝑖] until we get a random byte. (g) Let 𝑅2 = 𝑅, 𝐺2 = 𝐺, and 𝐵2 = 𝐵; back to step (c) to get another random byte.
Figure 1: AVRNG with filter flowchart.
The next step is to prove that the random numbers generated from AVRNG are qualified enough. Obviously, we should apply NIST SP 800-22 Rev.1a—15 statistical tests [3] which are proposed in April 2010 to verify the randomness of proposed AVRNG.
Result of AVRNG with Filter First of all, we briefly describe the materials that are used for experiments. The XiangSheng is a kind of comic dialogue to entertain the audience with ridiculous stories we mentioned in Section 1. And the cartoon, Spirited Away, is a Japanese animation produced in 2001. It is definitely difficult to produce
92
Computing and Random Numbers
qualified random numbers from a cartoon image because the original colors have already been artificial. The pictures and result are shown in Figures 2, 3, and 4 and Tables 2 and 3. Table 2: AVRNG with filter (XiangSheng). AVRNG with XiangSheng0 XiangSheng1 XiangSheng2 XiangSheng3 XiangSheng4 XiangSheng5 filter Frequency
0.81
1
0.98
0.98
0.98
0.98
Block frequency
0.98
0.98
0.99
1
1
0.98
Cumulative sums
0.82
0.995
0.98
0.98
0.985
0.98
Runs
0.39
1
1
0.99
0.97
1
Longest-runs- 0.88 of-ones
1
0.99
1
0.97
1
Binary matrix 0.98 rank
0.99
0.99
1
0.98
0.99
FFT (Fourier) 1
1
1
1
1
1
Nonoverlap- 0.9801 ping template
0.989
0.9899
0.9896
0.9881
0.9893
Overlapping template
0.98
0.98
0.99
1
0.98
1
Universal statistical
1
1
1
1
1
1
Approximate 0.99 entropy
1
0.96
1
0.99
0.99
The random excursions
1
0.9886
0.9688
1
0.9861
0.9772
The random excursions variant
1
1
0.9931
1
1
1
The serial
0.995
0.995
0.99
0.996
0.985
0.985
The linear complexity
0.98
1
0.98
0.99
0.98
0.98
Average of above
0.9190
0.9945
0.9868
0.9950
0.9863
0.9901
0.98
0.96
0.98
0.97
0.9772
Minimum of 0.39 above
Random Numbers Generated from Audio and Video Sources
Table 3: AVRNG with filter (Cartoon). AVRNG with filter Cartoon0 Cartoonl Cartoon2 Cartoon3 Cartoon4
Cartoon5
Frequency
0.3
1
0.99
1
0.98
0.98
Block frequency
0.86
1
1
1
0.98
0.98
Cumulative sums
0.275
1
0.985
1
0.985
0.99
Runs
0.57
0.99
1
1
0.98
0.99
Longest-run-ofones
0.42
0.99
1
0.98
1
0.97
Binary matrix rank 1
1
0.98
0.98
0.98
0.98
FFT (Fourier)
1
1
1
1
1
1
Nonoverlapping template
0.9666
0.9905
0.9893
0.9901
0.9886
0.989
Overlapping template
0.59
1
0.99
1
0.99
0.98
Universal statistical 0
1
1
1
1
1
Approximate entropy
0.99
0.98
0.98
0.98
0.99
The random excur- 0 sions
0.975
0.9722
1
1
0.9922
The random excur- 0 sions variant
1
1
1
1
0.9969
The serial
0.915
0.995
0.995
0.995
0.995
0.995
The linear complexity
0.94
0.99
1
1
1
1
Average of above
0.5678
0.9947
0.9921
0.9950
0.9906
0.9889
0.975
0.9722
0.98
0.98
0.97
0.68
Minimum of above 0
Figure 2: WEBCAM image.
93
94
Computing and Random Numbers
Figure 3: XiangSheng image.
Figure 4: Cartoon image.
The test result in XiangSheng part, minimum value of XiangSheng3, is 0.98 which is the best result of all XiangSheng test samples. We can see that, when adopting AVRNG with filter, both XiangSheng and cartoon become good enough compared with WEBCAM with microphone. From the XiangSheng0 to XiangSheng5 means XiangSheng film XOR from 0 to 5 sound points, respectively. Similarly, from the Cartoon0 to Cartoon5 means cartoon film XOR from 0 to 5 sound points, respectively. The complete results are shown in Tables 2 and 3. In cartoon part, minimum value of Cartoon3 is 0.98 which is the best result of all cartoon test samples. The result shows that we can generate qualified random numbers from AVRNG with filter no matter from WEBCAM, XiangSheng, or cartoon. The experimental equipment is Logitech Clear Chat Stereo headset’s microphone (100–10,000 Hz, Sensitivity −62 dBV/uBAR, −42 dBV/Pascal ±3 dB) [26], and the video devices are 1.3 million pixel WEBCAM of Logitech QuickCam Pro 4000 [27] and BlueEyes’s IPCAM BE-1200 [28]. The computer specification of experience is Asus U45J: CPU Intel i5-460 M (2.53 Hz) and 4 G RAM, and the OS is Fedora-18 x64.
Random Numbers Generated from Audio and Video Sources
95
CONCLUSIONS In this work, the proposed AVRNG with filter could generate random numbers from WEBCAM with microphone (as TRNG) or from video file’s frame with sound (as PRNG). Furthermore, the AVRNG adopts the filter 98% random numbers generated from both XiangSheng, and cartoon film could pass 15 statistical tests. Moreover, AVRNG with or without filter takes almost the same time to generate 100,000 random bits. The result is shown in Table 4. Table 4: Efficiency of AVRNG with/without filter (unit: second). AVRNG XiangSheng Cartoon
Without filter 46.0636 52.5758
With filter 47.1470 52.5658
With/Without filter 1.023 0.999
The proposed random numbers generating method requires merely WEBCAM and microphone instead of complex equipment such as electronic circuit or oscillator to generate true random numbers. The results are principally based on personal computer; consequently, to transfer proposed algorithm upon personal devices such as tablet PC or smartphone so as to prove that these efforts can be widely applied will be the next stage. The random numbers generated from this work could be used in evolutionary algorithm [29, 30]. Hopefully, this work will evolve into an effective adjunctive decision making tool.
Acknowledgments This work was supported in part by the National Science Council under the Grants NSC 99-2221-E-037 -004 and the NSYSU-KMU under the Grants NSYSUKMU102-P001.
96
Computing and Random Numbers
REFERENCES 1.
D. R. Stinson, Cryptography: Theory and Practice, Chapman & Hall/ CRC, New York, NY, USA, 3rd edition, 2005. 2. J. M. Tsai, J. Tzeng, and I. T. Chen, “Random number generated from white noise of video,” ICIC Express Letter, vol. 6, no. 7, pp. 1827– 1832, 2012. 3. NIST, “A statistical test suite for the validation of random number generators and pseudo random number generators for cryptographic applications,” FIPS Special Publication 800-22 Rev.1a, 2010. 4. J. B. Plumstead, “Inferring a sequence generated by a linear congruence,” in Proceedings of the 23th IEEE Symposium on the Foundations of Computer Science, pp. 153–159, IEEE, New York, NY, USA, 1982. 5. J. A. Reeds, “Cracking random number generator,” Cryptologia, vol. 1, no. 1, pp. 20–26, 1997. 6. J. Boyar, “Inferring sequences produced by pseudo-random number generators,” Journal of the Association for Computing Machinery, vol. 36, no. 1, pp. 129–141, 1989. 7. P. Alfke, “Efficient shift registers, LFSR, counters, and long pseudorandom sequence generators,” XAPP 052, (Version 1.1), 1996, http:// www.xilinx.com/support/documentation/application_notes/xapp052. pdf. 8. H. Debiao, C. Jianhua, and H. Jin, “A random number generator based on isogenies operations,” 2010, http://eprint.iacr.org/2010/094. 9. X. Y. Wang and Q. Yu, “A block encryption algorithm based on dynamic sequences of multiple chaotic systems,” Communications in Nonlinear Science and Numerical Simulation, vol. 14, no. 2, pp. 574–581, 2009. 10. X. Y. Wang, X. Qin, and Y. X. Xie, “Pseudo-random sequences generated by a class of one-dimensional smooth map,” Chinese Physics Letters, vol. 28, no. 8, Article ID 080501, 2011. 11. X. Y. Wang and X. Qin, “A new pseudo-random number generator based on CML and chaotic iteration,” Nonlinear Dynamics, vol. 70, no. 2, pp. 1589–1592, 2012. 12. X. Y. Wang and Y. X. Xie, “A design of pseudo-random bit generator based on single chaotic system,” International Journal of Modern Physics C, vol. 23, no. 3, Article ID 1250024, 11 pages, 2012.
Random Numbers Generated from Audio and Video Sources
97
13. X. Y. Wang and L. Yang, “Design of pseudo-random bit generator based on chaotic maps,” International Journal of Modern Physics B, vol. 26, no. 32, Article ID 12502080, 9 pages, 2012. 14. G. Taylor and G. Cox, “Behind intel’s new random number generator,” Semiconductors Processors, 2011, http://spectrum.ieee.org/computing/ hardware/behind-intels-new-randomnumber-generator/0. 15. L. C. Noll, S. Cooper, and M. Pleasant, LavaRnd, 1996, http://www. lavarnd.org. 16. S. K. Tawfeeq, “A random number generator based on single-photon avalanche photodiode dark counts,” Journal of Lightwave Technology, vol. 27, no. 24, pp. 5665–5667, 2009. 17. Y. Yamanashi and N. Yoshikawa, “Superconductive random number generator using thermal noises in SFQ circuits,” IEEE Transactions on Applied Superconductivity, vol. 19, no. 3, pp. 630–633, 2009. 18. X. Y. Wang, X. Qin, and L. Teng, “A novel true random number generator based on mouse movement and a one-dimensional chaotic map,” Mathematical Problems in Engineering, vol. 2012, Article ID 931802, 9 pages, 2012. 19. Y. A. Alsultanny, “Random-bit sequence generation from image data,” Image and Vision Computing, vol. 26, no. 4, pp. 592–601, 2008. 20. A. J. Menezes, P. C. van Oorschot, and S. A. Vanstone, Handbook of Applied Cryptography, CRC Press, New York, NY, USA, 1996. 21. J. Tzeng, I. T. Chen, and J. M. Tsai, “Random number generator designed by the divergence of scaling functions,” in Proceedings of the 5th International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP’ 09), pp. 1038–1041, Kyoto, Japan, September 2009. 22. NIST, FIPS PUB 140-2, “Derived test requirements for FIPS PUB 140-2, security requirements for cryptographic modules,” Federal Information Processing Standards Publication, 2004. 23. NIST, Revised draft FIPS 140-3, 2009, http://csrc.nist.gov/publications/ drafts/fips140-3/revised-draft-fips140-3_PDF-zip_document-annexAto-annexG.zip. 24. NIST, Revised draft FIPS 800-90a, 2012, http://csrc.nist.gov/ publications/nistpubs/800-90A/SP800-90A.pdf. 25. G. Marsaglia, “DIEHARD: a battery of tests of randomness,” The preceding description of the DIEHARD executable program that
98
26. 27. 28. 29.
30.
Computing and Random Numbers
explains the significance of the results, 1995, http://stat.fsu.edu/pub/ diehard. “Logitech ClearChat Stereo Headset,” http://www.logitech.com/engb/speakers-audio/headphones/devices/349. Logitech QuickCam Pro 4000, http://www.logitech.com/en-gb/ support/269?crid=405. BlueEyes’s IPCAM BE-1200, http://www.blueeyes.com.tw/EN/ brochure/BE1200.pdf. W. H. Ho, J. H. Chou, and C. Y. Guo, “Parameter identification of chaotic systems using improved differential evolution algorithm,” Nonlinear Dynamics, vol. 61, no. 1-2, pp. 29–41, 2010. W. H. Ho and C. S. Chang, “Genetic-algorithm-based artificial neural network modeling for platelet transfusion requirements on acute myeloblastic leukemia patients,” Expert Systems with Applications, vol. 38, no. 5, pp. 6319–6323, 2011.
SECTION II RANDOM VARIABLES TECHNIQUES
CHAPTER
5
DISTRIBUTION OF THE MAXIMUM AND MINIMUM OF A RANDOM NUMBER OF BOUNDED RANDOM VARIABLES Jie Hao1, Anant Godbole2 Department of Statistics and Analytical Sciences, Kennesaw State University, Kennesaw, USA
1
Department of Mathematics and Statistics, East Tennessee State University, Johnson City, USA
2
ABSTRACT We study a new family of random variables that each arise as the distribution of the maximum or minimum of a random number N of i.i.d. random variables X1, X2, ∙∙∙, XN, each distributed as a variable X with support on [0, Citation: Hao, J. and Godbole, A. (2016), “Distribution of the Maximum and Minimum of a Random Number of Bounded Random Variables”. Open Journal of Statistics, 6, 274-285. doi: 10.4236/ojs.2016.62023. Copyright: © 2016 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http:// creativecommons.org/licenses/by/4.0
102
Computing and Random Numbers
1]. The general scheme is first outlined, and several special cases are studied in detail. Wherever appropriate, we find estimates of the parameter θ in the one-parameter family in question. Keywords: Maximum and Minimum, Random Number of i.i.d. Variables, Statistical Inference
INTRODUCTION Consider a sequence
of i.i.d. random variables with support on
and having distribution function F. For any fixed n, the distributions of and have been well studied; in fact it is shown in elementary texts that and . But what if we have a situation where the number N of Xi’s is random, and we are instead considering the extrema
(1)
(2)
and of a random number of i.i.d. random variables? Now the sum S of a random number of i.i.d. variables, defined as
satisfies, according to Wald’s Lemma [1] , the equation provided that N is independent of the sequence and assuming that the means of X and N exist. The purpose of this paper is to show that the distributions in (1) and (2) can be studied in many canonical cases, even if N and are correlated. The main deviation from the papers [2] [3] and [4] , where similar questions are studied, is that the variable X is concentrated on the interval ―unlike the above references, where X has lifetime-like distributions on . Even then, we find that many new and interesting distributions arise, none of them
Distribution of the Maximum and Minimum of a Random Number ...
103
to be found, e.g., in [5] or [6] via the “extreme values of a random number of i.i.d. variables” connection. See, however, Remarks 1 and 2 in Section 3. In another deviation from the theory of extremes of random sequences (see, e.g., [7] ), we find that the tail behavior of the extreme distributions is not relevant due to the fact that the distributions have compact support. We next cite three examples where our methods might be useful. First, we might be interested in the strongest earthquake in a given region in a given year. The number of earthquakes in a year, N, is usually modeled using a Poisson distribution, and, ignoring aftershocks and similarly correlated events, the intensities of the earthquakes can be considered to be i.i.d. random variables in whose distribution can be modeled using, e.g., the data set maintained by Caltech at [8] . Second, many “small world” phenomena have recently been modeled by power law distributions, also sometimes termed discrete Pareto or Zipf distributions. See, for example, the body of work by Chung and her co-authors [9] [10] , and the references therein, where vertex degrees in “internet-like graphs” G (e.g., the vertices of G are individual webpages, and there is an edge between v1 and v2 if one of the webpages has a link to the other) are shown to be modeled by
for some constant
, where
is the Riemann Zeta function
Thus if the vertices v in a large internet graph have some bounded i.i.d. property Xi, then the maximum and minimum values of Xi for the neighbors of a randomly chosen vertex can be modeled using the methods of this paper. Third, we note that N and the Xi may be correlated, as in the CSUG example (studied systematically in Section 3) where
and
follows the geometric distribution . This is an example of a situation where we might be modeling the maximum load that a device might have carried before it breaks down due to an excessive weight or current. It is also feasible in this case that the parameter θ might be unknown. Here is our general set-up: Suppose variables following a continuous distribution on
are i.i.d. random with probability
104
Computing and Random Numbers
density and distribution functions given by
and
respectively.
N is a random variable following a discrete distribution on with probability mass function given by, . Let Y and Z be given by (1) and (2) respectively. Then the p.d.f.’s g of Y and Z are derived as follows: Since we see that and consequently, the marginal p.d.f. of Y is
In a similar fashion, the p.d.f. of Z can be shown to be
(3)
(4) what is remarkable is that the sums in (3) and (4) will be shown to assume simple tractable forms in a variety of cases. We want to point out that some of our distributions have been studied before but not using this motivation. For example, the Marshall-Olkin distributions [11] give a new method of adding a parameter to a distribution. Also, other distributions such as the beta and Kumaraswamy [12] distributions can be used to model continuous bounded data, but these do not apply to our set-up. See also Remark 2 in Section 3. Our paper is organized as follows. Section 1 provided a summary and motivation for studying the distributions in the fashion we do. In Section 2, we study the case of and . We call this the Standard Uniform Geometric model. The graphs of g(y) and g(z) can be seen in Figure 1 and Figure 2 respectively. The CSUG (Correlated Standard Uniform Model) is studied in Section 3. The graphs of g(y) and g(z) in the CSUG model are plotted in Figure 3 and Figure 4 respectively. Parameter estimation is done in Section 4. Section 5 is devoted to a summary of a variety of other models.
Distribution of the Maximum and Minimum of a Random Number ...
105
STANDARD UNIFORM GEOMETRIC (SUG) MODEL Since , and for some , we have from (3) that the p.d.f. of Y in the SUG model is given by
(5)
Similarly, (4) gives that
(6)
Proposition 2.1. If the random variable Y has the “SUG maximum , then distribution” (5) and Proof.
Figure 1: Plot of the SUG maximum density for some values of θ (see Equation (5)).
106
Computing and Random Numbers
Figure 2: Plot of the SUG minimum density for some values of θ (see Equation (6)).
Figure 3: Plot of the CSUG maximum density for some values of θ.
Distribution of the Maximum and Minimum of a Random Number ...
107
Figure 4: Plot of CSUG minimum density for some values of θ.
as claimed. Note. Even though we take the distributions to have support on, this may be done by changing the survival function in [3] , where the same compounding method is used. Specifically we can use the transformation in the proofs of [3] .
108
Computing and Random Numbers
Proposition 2.2. The random variable Y has mean and variance given, respectively, by
Proof. Using Proposition 2.1, we can directly compute the mean and variance by setting , and using the fact that for any random variable W. (This proof could equally well have been based on calculating the moments of and then recovering the values of and . The same is true of other proofs in the paper.) Proposition 2.3. If the random variable Z has the “SUG minimum , then distribution” and
Proof.
as asserted. Proposition 2.4. The random variable Z has mean and variance given, respectively, by
Distribution of the Maximum and Minimum of a Random Number ...
109
Proof. Using Proposition 2.3, it is easily to compute the mean and variance by setting k = 1, k = 2. The m.g.f.’s of Y, Z are easy to calculate too. Notice that the logarithmic terms above arise due to the contributions of the j = 1 and terms, and it is precisely these logarithmic terms that make, e.g., method of moments estimates for θ to be intractable in a closed (i.e., non-numerical) form. Similar difficulties arise when analyzing the likelihood function and likelihood ratios.
THE CORRELATED STANDARD UNIFORM GEOMETRIC (CSUG) MODEL The Correlated Standard Uniform Geometric (CSUG) model is related to the SUG model, as the name suggests, but X and N are correlated as indicated in Section 1. The CSUG problems arise in two cases. One case is that we conduct standard uniform trials until a variable Xi exceeds , where θ is the parameter of the correlated geometric variable, and the maximum of is what we seek. The maximum is between 0 and . The other case is where standard uniform trials are conducted until Xi is less than θ, and we are looking for the minimum of . The minimum is between θ and 1. Specifically, let and define
be a sequence of standard uniform variables
or In either case N has probability mass function given by
(7)
note that this is simply a geometric random variable conditional on the success having occurred at trial 2 or later. Clearly N is dependent on the X sequence. Proposition 3.1. Under the CSUG model, the p.d.f. of Y, defined by (1), is given by
110
Computing and Random Numbers
Proof. The conditional c.d.f. of Y given that
is given by
Taking the derivative, we see that the conditional density function is given by
Consequently, the p.d.f. of Y in the CSUG model is given by
This completes the proof. Proposition 3.2. The p.d.f. of Z under the CSUG model is given by
Proof. The conditional cumulative distribution function of Z given that is given by
Thus, the conditional density function is given by
which yields the p.d.f. of Z under the CSUG model as
Distribution of the Maximum and Minimum of a Random Number ...
111
which finishes the proof. Proposition 3.3. If the random variable Y has the “CSUG maximum distribution” and , then
Proof.
as claimed. Proposition 3.4. The random variable Y has mean and variance given, respectively, by
and
Proof. Using Proposition 3.3, we can directly compute the mean and variance by setting k = 1, 2. For example with k = 1 we get
112
Computing and Random Numbers
Notice that the variance of Y is smaller than that of Y under the SUG model, with an identical numerator term. Also, the expected value is smaller under the CSUG model than in the SUG case. This can be best seen by the inequalities
and
valid for
.
Proposition 3.5. If the random variable Z has the “CSUG Minimum distribution” and , then
Proof. Routine, as before. Proposition 3.6. The random variable Z has mean and variance given, respectively, by
and
Proof. A special case of Proposition 3.3; note that as in the SUG model, . Remark 1. The four distributions of Y and Z under the SUG and the CSUG models can be shown to be affine transformations of the same distribution as seem by the following results (proofs omitted):
Distribution of the Maximum and Minimum of a Random Number ...
113
Proposition 3.7. Changing the variable Y of (5) as yields (6). Thus the SUG maximum and SUG minimum variables are related by the fact that Proposition 3.8. Changing the variable Y of the CSUG model (in Proposition 3.1) as yields , which equals the pdf of (5). Hence Proposition 3.9. Changing the variable Z of the CSUG model (in Proposition 3.2) as equals the pdf of (6). Thus
yields
, which
As a result of these affine transformations, the moment equations (Propositions from 2.1 to 2.4 and from 3.3 to 3.6) can be derived in an easier fashion, though these facts are easier to observe post facto. Remark 2. As stated earlier the distributions of this paper are related to other distributions in the literature, but these do not exploit the extreme value connection as we do. For example, when , (5) reduces to
which is a special case, with k = 1, of the generalized half-logistic distribution [5] , eq. 23.83. Second, the distribution of Z under the CSUG model is a special case of a truncated Pareto distribution, which, for positive a, is defined by
Putting and , we obtain the pdf of Proposition 3.2. This special case of appears in the 2nd type of Zipf’s Law; see Urzúa [13] . The truncated Pareto distribution appears, e.g., in Aban et al. [14] and the references therein.
114
Computing and Random Numbers
PARAMETER ESTIMATION The intermingling of polynomial and logarithmic terms makes method of moments estimation difficult in closed form, as in the SUG case. However, if θ is unknown, the maximum likelihood estimate of θ can be found in a satisfying form, both in the CGUG maximum and CSUG minimum cases. Suppose that form a random sample from the CSUG Maximum distribution with unknown θ. Since the pdf of each observation has the following form:
the likelihood function is given by
The MLE of θ is a value of θ, where maximizes
. Let
for
, which
.
Since , it follows that is a increasing function, which means the MLE is the largest possible value of θ such that for . Thus, this value should be , i.e., . Suppose next that form a random sample from the CSUG minimum distribution. Since the pdf of each observation has the following form:
it follows that the likelihood function is given by
As above, it now follows that . It is not too hard to write down the distribution of the MLE’s but we do not do so here.
Distribution of the Maximum and Minimum of a Random Number ...
115
A SUMMARY OF SOME OTHER MODELS The general scheme given by (3) and (4) is quite powerful. As another example, suppose (using the example from Section 1) that
and
. Then it is easy to show that
and that
. (The expected value of Y can also be calculated by using
the identity this type, without proof: UNIFORM-POISSON
. In this section, we collect some more results of MODEL.
Here
we
let
and
, so that N follows a left-truncated Poisson distribution. Proposition 5.1. Under the Uniform-Poisson model,
In some sense, the primary motivation of this paper was to produce extreme value distributions that did not fall into the Beta family (such as for the maximum of n i.i.d. variables). A wide variety of non-Beta-based distributions may be found in [6] . Can we add extreme value distributions to that collection? In what follows, we use both the Beta families and , the arcsine distribution, and a “Beyond Beta” distribution, the Topp-Leone distribution [15] , as “input variables” to make further progress in this direction.
116
Computing and Random Numbers
GEOMETRIC-BETA(2, 2) MODEL. Here this case we get
and
. In
POISSON-BETA(2, 2) MODEL. Here and Poisson (q) distribution left- truncated at 0. In this case we get
, the
and
and
GEOMETRIC-ARCSINE MODEL. Here In this case we get
and
.
and
POISSON-ARCSINE MODEL. Here we have
and
and
. Here
Distribution of the Maximum and Minimum of a Random Number ...
GEOMETRIC-TOPP-LEONE
MODEL.
117
Here
and
:
and
:
and
POISSON-TOPP-LEONE MODEL.
and
CONCLUSION In this paper we studied a general scheme for the distribution of the maximum or minimum of a random number of i.i.d. random variables with compact support. While some of the distributions obtained through this process have appeared before in the literature, they do not been studied using this approach. Our biggest open problem is to find data sets for which these new distributions are appropriate.
ACKNOWLEDGEMENTS The research of AG was supported by NSF Grants 1004624 and 1040928. We thank the referees for their insightful suggestions for improvement.
118
Computing and Random Numbers
REFERENCES 1.
Durrett, R. (1991) Probability: Theory and Examples. Wadsworth and Brooks/Cole, Pacific Grove.http://dx.doi.org/10.4236/am.2012.34054 2. Louzada, F., Roman, M. and Cancho, V. (2011) The Complementary Exponential Geometric Distribution: Model, Properties, and a Comparison with Its Counterpart. Computational Statistics and Data Analysis, 55, 2516-2524.http://dx.doi.org/10.1016/j.csda.2010.09.030 3. Louzada, F., Bereta, E. and Franco, M. (2012) On the Distribution of the Minimum or Maximum of a Random Number of i.i.d. Lifetime Random Variables. Applied Mathematics, 3, 350-353. 4. Morais, A. and Barreto-Souza, W. (2011) A Compound Class of Weibull and Power Series Distributions. Computational Statistics and Data Analysis, 55, 1410-1425. 5. Johnson, N., Kotz, S. and Balakrishnan, N. (1995) Continuous Univariate Distributions. Vol. 2, Wiley, New York. 6. Kotz, S. and van Dorp, J. (2004) Beyond Beta: Other Continuous Families of Distributions with Bounded Support and Applications. World Scientific Publishing Co., Singapore. http://dx.doi.org/10.1142/5720 7. Leadbetter, M., Lindgren, G. and Rootzén, H. (1983) Extremes and Related Properties of Random Sequences and Processes. Springer Verlag, New York. http://dx.doi.org/10.1007/978-1-4612-5449-2 8. Earthquake Data Set. http://www.data.scec.org/eq-catalogs/date_mag_ loc.php 9. Chung, F., Lu, L. and Vu, V. (2003) Eigenvalues of Random Power Law Graphs. Annals of Combinatorics, 7, 21-33.http://dx.doi.org/10.1007/ s000260300002 10. Aiello, B., Chung, F. and Lu, L. (2001) A Random Graph Model for Power Law Graphs. Experimental Mathematics, 10, 53-66. http:// dx.doi.org/10.1080/10586458.2001.10504428 11. Marshall, A. and Olkin, I. (1997) A New Method of Adding a Parameter to a Family of Distributions with Applications to the Exponential and Weibull Families. Biometrika, 84, 641-652. http://dx.doi.org/10.1093/ biomet/84.3.641 12. Kumaraswamy, P. (1980) A Generalized Probability Density Function for Double-Bounded Random Processes. Journal of Hydrology, 46, 7988. http://dx.doi.org/10.1016/0022-1694(80)90036-0
Distribution of the Maximum and Minimum of a Random Number ...
119
13. Urzúa, C.M. (2000) A Simple and Efficient Test for Zipf’s Law. Economics Letters, 66, 257-260.http://dx.doi.org/10.1016/S01651765(99)00215-3 14. Aban, I.B., Meerschaert, M.M. and Panorska, A.K. (2006) Parameter Estimation for the Truncated Pareto Distribution. Journal of the American Statistical Association, 101, 270-277. http://dx.doi.org/10.1 198/016214505000000411 15. Topp, C.W. and Leone, F.C. (1955) A Family of J-Shaped Frequency Functions. Journal of the American Statistical Association, 50, 209219. http://dx.doi.org/10.1080/01621459.1955.10501259
CHAPTER
6
RANDOM ROUTE AND QUOTA SAMPLING: DO THEY OFFER ANY ADVANTAGE OVER PROBABLY SAMPLING METHODS? Vidal Díaz de Rada1, Valentín Martínez Martín2 Departamento de Sociología, Public University of Navarre, Pamplona, Spain
1
Centro de Investigaciones Sociológicas, Madrid, Spain
2
ABSTRACT The aim of this paper is to compare sample quality across two probability samples and one that uses probabilistic cluster sampling combined with random route and quota sampling within the selected clusters in order to define the ultimate survey units. All of them use the face-to-face interview
Citation: Rada, V. and Martín, V. (2014), “Random Route and Quota Sampling: Do They Offer Any Advantage over Probably Sampling Methods?” Open Journal of Statistics, 4, 391-401. doi: 10.4236/ojs.2014.45038. Copyright: © 2014 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http:// creativecommons.org/licenses/by/4.0
122
Computing and Random Numbers
as the survey procedure. The hypothesis to be tested is that it is possible to achieve the same degree of representativeness using a combination of random route sampling and quota sampling (with substitution) as it can be achieved by means of household sampling (without substitution) based on the municipal register of inhabitants. We have found such marked differences in the age and gender distribution of the probability sampling, where the deviations exceed 6%. A different picture emerges when it comes to comparing the employment variables, where the quota sampling overestimates the economic activity rate (2.5%) and the unemployment rate (8%) and underestimates the employment rate (3.46%). Keywords: Sampling Methods, Random Sampling, Multistage Cluster Sampling, Random Route Method, Quota Sampling
INTRODUCTION Since the earliest applications of quota sampling in the early twentieth century, there has been a wealth of references to its unsuitability for some purposes, such as to obtain population-representative samples [1] -[6] . Most of the criticism focuses on its non-probability nature (which precludes the possibility of calculating sampling error), and the heavy influence of the interviewer in the choice of ultimate respondent. Nevertheless, non-probability sampling methods remain among those commonly used by the majority of private opinion poll and market research companies [7] [8] ; among others, whose predominant approach is quota sampling within households previously selected by the random methods [9] . Without denying the issues raised, the research sector has responded to the above mentioned criticism, which emanates mainly from academics and statisticians [3] [10] , by claiming that these samples “work”, and, in fact, sometimes deliver better results than those obtained via strictly random sampling methods [5] [8] . The aim of this paper is to compare sample quality and response to the survey across two probability samples based on the municipal register of inhabitants and one that uses probabilistic cluster sampling combined with random route and quota sampling within the selected clusters in order to define the ultimate survey units. All of them use the face-to-face interview as the survey procedure. The hypothesis to be tested is that it is possible to achieve the same degree of representativeness using a combination of random route sampling and quota sampling (with substitution) as it can be
Random Route and Quota Sampling: Do They Offer Any Advantage ...
123
achieved by means of household sampling (without substitution) based on the municipal register of inhabitants. According to Bauer [9] “there are few studies focus on to assess the quality of random route samples”, and this one adds the use of selection with quota methods inside home. The paper is organized into 5 sections. The introduction of the research topic is followed by the presentation of the data sources, with a brief indication of their differences and similarities. The three surveys are then compared, taking into account the distribution by age, sex, education and four employment variables (economic activity ratio, rate of unemployment, rate of employment and employment status). The discussion and conclusion sections are followed by the references used.
Data Sources The two probability samples considered correspond to the European Social Survey (ESS) round 5, conducted in 2011, and the ISSP II Environmental Survey, which was carried out in 2010 by the Centro de Investigaciones Sociológicas—CIS, that is, the Spanish Sociological Research Centre, under an agreement signed with the ISSP (CIS survey number 2837). These two are compared with the sample used for the 2010 Health Barometer (henceforth HB) (CIS survey number 2850) where households are randomly drawn from census districts and respondents are selected within households using gender and age quotas. The next section presents the methodological details of these three samples.
European Social Survey (ESS) The target universe is the population aged 15 years and over and resident in the main home in the whole of Spain (including Ceuta and Melilla), irrespective of nationality [11] [12] . The sampling frame is the municipal register. The universe is stratified by 17 autonomous regions and four municipal size categories: fewer than 10.001 inhabitants, 10.001 to 50.000 inhabitants, 50.001 to 100.00 inhabitants, and over 100.000 inhabitants [13] . Since 2004 (the year of the second round) this survey has used twostage cluster sampling from census districts with probability proportional to the size of the universe (resident population over the age of 15), taking 6 or 7 individuals from each district; 6 from smaller municipalities (50,000 residents) [11] . Respondents are drawn from lists taken from the May 2010 Municipal Register of inhabitants and supplied by the National Institute of Statistics
124
Computing and Random Numbers
(INE). The fieldwork was undertaken between April 11 and July 31, 2011 by 66 interviewers and 22 provincial coordinators. The non-substitutability of respondents [11] who refused to participate required up to seven household visits following non-contact (at least two of these to be made during evening or weekend hours), refusal-conversion techniques, and other resources in order to ensure the participation of more reluctant sample members (Cuxart & Riba, 2009). Selected respondents were sent two letters of presentation prior to the interviewer’s visit (three after a refusal), promised payment of 12 Euros, and provided with a leaflet showing the results of previous rounds of the survey. With this range of resources, 1885 respondents were successfully recruited from an original sample of 2865 interviews, which represents a participation rate of 68.5% after the removal of 114 non-eligible respondents [15] .
ISSP (II) Environment Survey, CIS, Survey Number 2837 The universe for these surveys is all Spanish residents, of either sex, aged 18 and over, based on the 2009 municipal register sampling framework which has the same stratification as the European Social Survey, except that it has 7 municipal size categories: 1,000,000 inhabitants. As in the ESS, it uses two-stage cluster sampling drawn from municipal registers with probability proportional to size, followed by a systematic selection of individuals from households in each section. This survey is based on face-to-face interviews conducted in participants’ homes over the period May 13 to July 24, 2010. Of 4000 sample persons, 2560 were achieved, giving a participation rate of 64.5% after excluding 31 non-eligible subjects [16] . For a confidence level of 95.5% (two sigmas), and P = Q, the sampling error is ±1.98% for the sample as a whole. As in the ESS, refusals that persist after four visits to the household are not replaced. Despite the similarity of the participation rates (66% for the ESS and 64% the Environment ISSP), the use of fewer techniques to ensure the participation of the more reluctant sample members in the second of these surveys explains the marked differential in number of non-contacts: 84 and 811 (3% and 20% of all attempted contacts), respectively.
Random Route and Quota Sampling: Do They Offer Any Advantage ...
125
Health Barometer 2010, CIS Survey 2850 The universe for the Health Barometer is the resident population aged 18 and over, stratified by autonomous regions, and 7 municipal size categories (1,000,000 inhabitants). Three nationally-representative subsamples of 2600 respondents were taken and interviewed in March, June, and the period October to November, using multi-staged cluster sampling, where the primary units (municipalities), and secondary units (census districts) were selected with probability proportional to size (numbers of residents aged 18 and over) [17] . The aggregate sample of 7750 questionnaires has a sampling error of ±1.13% for the sample as a whole, assuming simple random sampling, for a confidence level of 95.5% (two sigmas), and P = Q [18] . Households were randomly drawn from every other apartment building in each census district, using gender and age quotas for the selection of the ultimate respondent (household member) [17] . When the household contained more than one eligible interviewee, the youngest of these was selected for interview [19] . Whenever the interviewer was unable to conduct the interview in the appointed place (non-contact, refusal, etc.), the sample unit is replaced according to the standards laid out in the document General standards for correct sampling, which states that, “when an interview is not possible at the first contact, the interviewer can try next door” [19] , that is, in the first household of the following segment (group of six households). In the case of apartment buildings, where no interview can be achieved, this address is replaced with the one next to it [19] . These procedural methods explain why it was necessary to contact 161,395 households in order to achieve 7750 interviews, which represents an average of 20.8 contacts per interview. Summing up, let us note that all three surveys stratify by autonomous region and municipal size, the only difference being that the ESS uses fewer strata (see Table 1). The data are collected by means of face to face interviews, using multi-stage (two-stage in the probability samplings, and multi-stage in the case of the HB) cluster samples drawn from municipal register data stratified by municipal size. It is from this point onwards that the differences between the three surveys begin. The HB uses a combined route/quota sampling procedure, whereas the ESS and the ISSP take random samples from lists of individuals. In addition, the ESS uses a wide range of resources to encourage participation (letters of presentation, re-contacts, refusal
126
Computing and Random Numbers
conversion techniques, productivity-linked remuneration of interviewers, etc.), and the ISSP uses a letter of presentation, and four re-contacts. Another key difference is that the HB uses non-contact replacement, while the other two do not (see Table 1).
DISTRIBUTION COMPARISON Representativeness will be examined by comparing the distribution obtained by each survey against the reference population [17] . This comparison can only be conducted for those variables about which there is available information for the population as a whole; namely, the separate gender and age distributions, and the joint age and gender distribution. The data in question are those available from the annual municipal register. Table 1. Differences between those three sampling methods. ESS
ISSP
HB
Universe
15 years and over
18 years and over
18 years and over
Sampling
Two stage
Two stage
Three stage
Sampling units
Census blocks
Census blocks
Municipalities and census blocks
Sampling units selection
PPS
PPS
PPS
Final units selection
Systematic
Systematic
Random routes and quotes
Sample size
2.865
4.000
7.800
Final respondents
1.885
2.650
7.750
Source: [13] [16] [18] . It would be interesting to extend the comparison to other highly relevant variables, such as educational attainment, and the economic activity ratio, about which there is no current data for the survey universe. There is, however, a source of data that is highly representative of the Spanish population. This is the Economically Active Population Survey (hence forth EAPS), which gathers and publishes statistics that are essential for an understanding of the human side of the nation’s economic activity, with the primary objective of assessing the degree of economic activity and other related issues among the population [20] . It is the largest household survey in terms of sample size, workforce size, and cost.
Random Route and Quota Sampling: Do They Offer Any Advantage ...
127
As a result of the large size of the sample used for the EAPS, and in view of its particular design characteristics, it can be considered a source of data that are very similar to real Spanish population data. In addition, the EAPS also provides data on the economic activity ratio, the majority gender differentiated. For a discussion of the comparability of the two surveys, see Díaz de Rada and Núñez Villuendas [21] . This enables the comparison to include four of the most relevant socio-political research variables: age by gender, educational attainment by gender, and employment rate by gender. Our analysis looks at the gender distributions rather than the marginal distributions of the variables of interest. The use of jointed distributions will enable us to detect differences in some variable frequencies that would go undetected if we were to use marginal distributions, where there is a possibility of cross-segment compensation. (In the first part of Table 2, for example, if we were working with the marginal age distribution, the 55 to 64 age bracket would show a 0.45% negative deviation from the municipal register data; while the gender breakdown shows males to be underrepresented by 0.67% and females to be over-represented by 0.21%. This enables us to attribute the observed deviation in that age bracket to males). These variables are examined separately in the following sections.
Age and Gender Comparisons Table 2 shows a comparison between the age distribution data from each survey and the sampling frame. Since the ESS used the 2010 municipal register and the fieldwork was conducted in 2011, the displayed age distribution is for the population in 2011. The discrepancy between the number of cases shown in Table 2 and the number mentioned in the sample design described above is due partly to the removal of under-eighteen year olds and Ceuta and Melilla residents and partly to the weighting of the data. These points aside, the difference with the universe is just over 8 points, due mainly to the under-estimation by 2.33% of the population over the age of 64 years. It is also possible to observe over-estimation of the youngest population group and those in the 45 - 54 age bracket; as has been observed in every round of the survey up to the present [11] . The gender-differentiated analysis reveals slightly higher male representation. Males between the ages of 45 and 54 are over-represented by 1.58%, while those in the 55 - 64 and 25 - 44 age brackets are under-represented. Females in the youngest age bracket are over-represented by 1.6%, and those in the over 65 age bracket are under-represented by more than 2%.
Computing and Random Numbers
128
By dividing each magnitude in Table 2 by the total sum of absolute differences-SAV (8.11 points), we are able to detect which subgroup (or subgroups) contributes most to the observed deviation. In this case, close to 30% (29% to be exact) of the deviation appears in the subgroup of women over the age of 64, and 18.89% of it appears in the youngest female age bracket. The next highest—19.21%—appears in the male 45 - 54 age bracket. Table 2. Sample vs. universe comparison of age and gender distributions. Vertical percentages and differences between magnitudes (sample estimate minus universe). European Social Survey (5th Round) Men
Women
Total
SAV
Score
Difference
Score
Difference
Score
Difference
18 - 24
5.00
0.33
6.10
1.61
11.10
1.95
1.95
25 - 34
9.10
−0.56
9.20
−0.02
18.30
−0.57
0.57
35 - 44
10.20
−0.30
9.80
−0.16
20.00
−0.46
0.46
45 - 54
10.30
1.58
9.00
0.29
19.30
1.87
1.87
55 - 64
5.80
−0.67
7.00
0.21
12.80
−0.45
0.88
Over 65 years old
8.90
0.02
9.60
−2.35
18.50
−2.33
2.37
TOTAL
49.30
0.41
50.80
−0.31
100.00
0.10
SAV
3.45
4.75
8.11
ISSP (II) Environment Survey, CIS Survey Number 2837 Men
Women
Total
SAV
Score
Difference
Score
Difference
Score
Difference
18 - 24
4.30
−0.50
4.80
0.19
9.10
−0.31
0.69
25 - 34
9.10
−0.97
9.20
−0.32
18.30
−1.29
1.29
35 - 44
10.00
−0.42
9.90
0.01
19.90
−0.41
0.44
45 - 54
9.70
1.15
9.00
0.48
18.70
1.63
1.63
55 - 64
6.80
0.39
7.10
0.37
13.90
0.76
0.76
Random Route and Quota Sampling: Do They Offer Any Advantage ... Over 65 years old
9.60
0.89
10.50
−1.27
20.10
−0.38
TOTAL
49.50
0.54
50.50
−0.54
100.00
0.0
SAV
4.33
2.64
129
2.16
6.97
Routes and Quotes (2010 Health Barometer), CIS Survey Number 2850 Men
Women
Total
SAV
Score
Difference
Score
Difference
Score
Difference
18 - 24
4.90
0.10
4.60
−0.02
9.60
0.09
0.11
25 - 34
10.50
0.43
9.70
0.18
20.20
0.61
0.61
35 - 44
10.50
0.08
9.90
0.01
20.40
0.09
0.09
45 - 54
8.40
−0.15
8.40
−0.12
16.80
−0.27
0.27
55 - 64
6.30
−0.11
6.70
−0.03
13.00
−0.14
0.14
Over 65 years old
8.70
−0.01
11.40
−0.37
20.10
−0.38
0.38
TOTAL
49.30
0.34
50.80
−0.34
100.10
0.0
SAV
0.87
0.73
1.60
Note: positive values to be interpreted as sample over-representation and negative values as sample under-representation. Source: [13] [16] [18] . Data for the universe taken from the NSI [14] [22] . These three subgroups account for 68.4% of the ageand gender-related deviation. The “ISSP II Environment Survey” uses the Municipal Register for 2009 (CIS, 2010) as its sample frame and carries out the fieldwork in July 2010, which produces a similar situation to that of the preceding case. That is, the age distribution shown in Table 2 is for the 2010 population. The differences with respect to the Municipal Register decrease to 6.97 points, and the underrepresentation of over 65-year-old (especially females) accounts for most of this deviation. The next most noteworthy feature is the over-estimation of the 45 - 54 age bracket. The distribution by gender shows that the sample does not adequately represent males. The 45 - 54 male age bracket is overestimated by 1.15%, and younger men are underestimated overall and to a slightly higher degree in the 25 - 34 age bracket (0.50%, 0.97% and 0.42%). The main difference in
130
Computing and Random Numbers
the case of the females lies in the under-estimation of the over 65 age group, and the over-estimation of the 45 - 54 age group. The subgroups that contribute most to the observed deviations are older women and men between the ages of 45 and 54. These two subgroups account for 34.7% of the observed differences, with the percentage deviation explained increasing to 48.6% when the male 25 - 34 age bracket is included. When compared, the HB and the January 1, 2010 revision of the Municipal Register show strong similarity in all age brackets, except the 25 to 34 age bracket, which is over-represented by 0.61%. From that point onwards, we find slight but increasing under-representation, which reaches a level of 0.38% in the oldest age bracket. The 35 to 44 age bracket is the best represented. Differentiating by gender, we find the most poorly represented group to be that of males between the ages of 25 and 34, followed by older females. In both cases, the difference is less than 1%, and is therefore attributable to sampling error (±1.13%). Taken together, these two groups explain half of the total deviation, which—it should be noted—is 1.60%. It appears logical that the quota-based sample should display fewer differences in the age and gender distributions than samples obtained by other procedures, which might make this comparison appear spurious. However, we consider it justifiable by the fact that interviewers are authorized to replace a respondent from one quota with another (of the same sex) from the adjacent quota when contact proves difficult. Despite the fact that detailed analysis of the fieldwork records shows that contact difficulties involved mainly young people and were slightly more frequent in the case of females, they have no significant impact on the representativeness of the age and gender distributions.
Differences in the Educational Attainment Distribution The responses to the educational attainment questions were recoded to align them with the EAPS categorization. As can be seen in the first part of Table 3, the deviation in the ESS arises mainly from under-estimation in the secondary education category (−9.05%), strong over-representation in the semi-higher qualification category (8.27%), and somewhat less marked over-representation in the incomplete primary and less category (2.15%). The absolute differences of the male and female secondary education categories, divided by total standard deviation (21.48%), show that they account for 42.12% of all the deviation in the table.
Random Route and Quota Sampling: Do They Offer Any Advantage ...
131
The ISSP II Environment Survey over-estimates the semi-higher qualification category (10.32%) and underestimates the incomplete primary and less category (−6.36%), the deviation being greater in the female subcategory (−3.97%). The subcategories that contribute most to the total deviation are the semi-higher qualification category and the female incomplete primary and less category, which, together, account for 60.7% (and 70.9% when the male incomplete primary and less category is included). Note that this survey provides a better representation of men but a poorer representation of women, in complete contrast to the ESS. In sum, the marked differences in educational attainment found in the two probability samplings showed a sum of absolute differences of 21.48 and 23.54 points, respectively. The data compiled by the HB (using random route and quotas) are somewhat better, since none of the differences across the table enter double figures. There is 7.61% over-estimation in the semi-higher qualification category, and an under-estimation of 6.44% in the highest educational attainment category (university degree). The deviation is notably lower in the primary education category, where we find an over-estimation of 2.43%, which is slightly greater than the under-estimation found in the incomplete primary and less category. No major differences emerge when these results are broken down by gender, since they continue to show the same over-representation in the semi-higher qualification category and underrepresentation in the university education category, as found in the overall results. The differences in these two subcategories account for 69.7% of all the differences in the table. Table 3. Sample vs. universe comparison of the education distribution by gender. Vertical percentages and differences between magnitudes (sample estimate minus universe). European Social Survey (5th Round) Men Score
Women
Total
SAV
Difference
Score
Difference
Score
Difference
Incomplete primary 5.50 and less
1.42
7.00
0.72
12.50
2.15
2.15
Primary
9.30
0.27
10.10
−0.15
19.40
0.12
0.42
Secondary
15.00
−5.11
14.30
−3.94
29.30
−9.05
9.05
132
Computing and Random Numbers
Semi-higher qualification
8.00
4.52
7.40
3.75
15.40
8.27
8.27
University
11.40
−0.75
11.90
−0.84
23.30
−1.59
1.59
TOTAL
49.20
0.36
50.70
−0.46
99.90
−0.10
SAV
12.07
9.41
21.48
ISSP (II) Environment Survey, CIS Survey Number 2837 Men Score
Women
Total
SAV
Difference
Score
Difference
Score
Difference
Incomplete primary 1.90 and less
−2.39
2.50
−3.97
4.40
−6.36
6.36
Primary
10.60
1.09
10.80
0.17
21.40
1.26
1.26
Secondary
18.40
−1.62
15.40
−2.75
33.80
−4.37
4.37
Semi-higher qualification
8.50
5.17
8.60
5.16
17.10
10.32
10.32
University
10.70
−1.09
12.50
0.14
23.20
−0.95
1.23
TOTAL
50.10
1.16
49.80
−1.26
99.90
−0.10
0.00
SAV
11.36
12.18
23.54
Routes and Quotes (2010 Health Barometer), CIS Survey Number 2850 Men Score
Women
Total
SAV
Difference
Score
Difference
Score
Difference
Incomplete primary 3.40 and less
−0.80
5.20
−1.19
8.60
−2.00
2.00
Primary
10.50
1.09
11.90
1.34
22.40
2.43
2.43
Secondary
18.90
−1.18
17.70
−0.52
36.60
−1.70
1.70
Semi-higher qualification
7.40
4.05
7.00
3.56
14.40
7.61
7.61
University
8.90
−2.99
9.00
−3.45
17.90
−6.44
6.44
TOTAL
49.10
0.17
50.80
−0.27
99.90
−0.10
0.00
SAV
10.12
10.06
20.18
Source: [13] [16] [18] . Data for the universe taken from the NSI [23] [24] .
Random Route and Quota Sampling: Do They Offer Any Advantage ...
133
Sample vs. Universe Comparison of Employment Variables The data collected from the employment status question were use to estimate economic activity, employment and unemployment rates, which we consider more useful than the raw responses. It should be noted that, in the questions about economic activity in all three surveys, interviewees are assigned to the employment or unemployment category according to their own responses. This contrasts with the EAPS questionnaire design, which distinguishes the economically active from the unemployed by purposely wording the questions. In other words, in the surveys considered in this paper, “unemployed” is an ascription that covers respondents who may, in fact, be non-active, that is, not part of the economically active population. This point will be discussed further at a later stage. Table 4 shows that the ESS over-estimates both the economic activity rate (particularly the male estimate) and the employment rate. The ESS unemployment rate estimate is, nevertheless, 0.49% lower than reported by the EAPS, which estimated female unemployment less accurately. It should be noted, however, that this is the lowest deviation found in any of the surveys considered, and that there is no observable change in the results when the population aged 16 to 17 and Ceuta and Melilla residents are included. Wider deviation can be observed in all three estimates given by the ISSP II Environment Survey, whose economic activity and employment rates deviate from the EAPS by more than 5%. Differentiation by gender reveals the widest deviations to be the over-estimation of the female economic activity rate (7.41%) and the over-estimation of the male employment rate (3.73%). It is also possible to observe an under-estimation of the overall unemployment rate by 1.49%, and by double that amount in the male unemployment subgroup. The sum of absolute values (SAV) reveals deviations of more than 10 points in the economic activity and employment rates. The HB over-estimates the economic activity and unemployment rates, the second of these by 8.64%, and the female unemployment subgroup by even more (9.32%). The sum of absolute values (SAV) of differences in the unemployment rate is 17.5 points, which is the widest observed deviation in this study, which also finds the overall employment rate to be underestimated by 3.46%, and male employment by 3.65%.
134
Computing and Random Numbers
Table 4. Sample vs. universe comparison of economic activity, unemployment and employments rates by gender. Vertical percentages and differences between magnitudes (sample estimate minus universe). European Social Survey (5th Round) Rates of…
Men
Women
Total
SAV
Score
Difference
Score
Difference
Score
Difference
Activity
71.60
4.16
56.20
3.06
63.80
3.68
7.22
Unemployment
20.80
0.22
20.00
−1.27
20.40
−0.49
1.49
Employment
56.70
3.14
45.00
3.16
50.80
3.24
6.30
ISSP (II) Environment Survey, CIS Survey Number 2837 Rates of…
Men
Women
Total
SAV
Score
Difference
Score
Difference
Score
Difference
Activity
72.10
3.73
59.60
7.41
65.80
5.69
11.14
Unemployment
15.90
−3.82
21.90
1.34
18.60
−1.49
5.16
Employment
60.70
5.81
46.50
5.04
53.60
5.57
10.85
Routes and Quotes (2010 Health Barometer), CIS Survey Number 2850 Rates of…
Men
Women
Total
SAV
Score
Difference
Score
Difference
Score
Difference
Activity
70.60
2.52
54.50
2.24
62.50
2.50
4.76
Unemployment
27.90
8.17
29.80
9.32
28.70
8.64
17.49
Employment
51.00
−3.65
38.30
−3.26
44.50
−3.46
6.91
Source: sample Table3 Data for the universe taken from the NSI [25] [26] .
Random Route and Quota Sampling: Do They Offer Any Advantage ...
135
DISCUSSION It comes as a surprise to find such marked differences in the age and gender distribution of the probability sampling, where the joint deviation exceed 6 points, and can, therefore, in no way be explained by sampling error. The most noteworthy discrepancies are those that appear between the two probability samplings—which, as already noted, use a similar sampling design—and can, therefore, only be attributed to the techniques used by the ESS to increase survey participation. Re-contacts and refusal conversion techniques enabled the ESS—in the third round (2006)—to retrieve 18.3% and 6.3% of the sample, respectively. Differences in the successful representation of the universe can be explained by the specific features of these two respondent groups [27] . The quality of the HB data is superior to that achieved by the two probability sampling-based surveys, due to the use of age and gender quotas. These require the interviewers to seek out individuals with given characteristics, with the result that the sample matches the universe because of the nature of the quota selection policy. In light of these considerations, quota replacements, by which interviewers are permitted to substitute a noncontact with an individual from an adjacent quota in either direction, are offset by replacements made by other interviewers in the opposite direction, and can, therefore, be classed as errors that cancel each other out. Given that the sample is defined by the age and gender distribution in the universe, age and gender quotas serve little purpose, except to confirm that age and gender are not strongly associated with other variables, such as the economic activity rate. This casts some doubt on the usefulness of age and gender data for quota sampling. The range of differences that appear in the probability samplings is situated between 10.32 points in the estimation of semi-higher qualification category figures in the ISSP II Environment, and 0.12 points in the primary education figures obtained by the ESS. These two surveys present deviations of 23.54 and 21.48 points, respectively. The HB, in contrast, reflects educational attainment in the population slightly better (20.18 points). This could be due to lower age dispersion, since there is a very close link between age and educational attainment. A completely different picture emerges when it comes to comparing the employment variables, particularly with respect to the employment, economic activity, and unemployment rates. The ESS over-estimates the
136
Computing and Random Numbers
economic activity and employment rates by more than 3 points, and more than 4 points in the male subcategory, while the Environment Survey reports even higher rates, particularly in the female activity (7.41%) and in the male employment where the deviations reach the 5.81%. The HB overestimates the economic activity rate by 2.5% and the unemployment rate by 8.64% (9.3% in the case of women), and under-estimates the employment rate by 3.46%. Two important issues underlie these differences. The first has to do with the different universes under consideration, the population over the age of 15 in the EAPS and the population over the age of 18 in the HB. More significantly, from our point of view, the survey designed for the HB—as noted previously—recommends interviewers to substitute addresses where there is no reply with the one next door. Calling at one household after another (up to 20.8 contacts) results in the replacement both of households where there is no reply—because the occupants are at work or absent for other reasons—and those that are empty–because they are second homes or (to a lesser degree) false entries in the municipal register—with households whose occupants are at home when the interviewer calls. Given that the employed spend less time at home than the unemployed, the probability of not being at home when the interviewer calls is higher among the employed than among the unemployed. In our opinion, this is the reason for the higher rate of unemployed found by the HB (and similar surveys). This could account for the marked over-estimation of the unemployment rate. The large number of unemployed also explains the higher economic activity ratio, which is calculated as the active population (employed and unemployed) over the population as a whole. Another possible interpretation of these findings could be that different methods were used to measure the economic activity rate. Some use the number of self-declared unemployed, making no distinction between the unemployed (those who have previously been employed) and the non-active population (those who do not form part of the labour market). Depending on the wording of the question, individuals who were employed in the past but have ceased working for a period in order to care for a relative might select the “unemployed” response in the questionnaire. The EAPS questionnaire, however, is worded in such as way as to enable a distinction between the true unemployed and members of the non-active population. Similar confusion arises with people who have worked in the underground economy and subsequently ceased to do so. The predominance of females
Random Route and Quota Sampling: Do They Offer Any Advantage ...
137
in the social care workforce [28] and in the underground economy accounts for the over-estimation of the unemployment rate.
CONCLUSION Thus, in summary, random route and quota samples provide a better representation of the age and educational attainment distributions, but overestimate the figures for the unemployment variable, which the probability samplings estimate more accurately. The CIS, however, which uses random route and quota sampling with substitution methods, obtains more representative samples than other surveys of its type conducted in Spain [29].
Authors and Affiliations Vidal Díaz de Rada Igúzquiza, Public University of Navarre, Dpto de Sociología, 31006 Pamplona-Spain, Valentín Martínez Martín, Centro de Investigaciones Sociológicas, Montalban 8, 28014 Madrid-Spain.
Acknowledgements Vidal Díaz de Rada, Valentín Martínez Martín (Government of Spain) (grant number CSO2012-34257).
138
Computing and Random Numbers
REFERENCES 1. 2. 3. 4. 5. 6. 7. 8.
9.
10. 11.
12.
13.
14.
15.
Marsh, C. and Scarbrough, E. (1990) Testing Nine Hypotheses about Quota Sampling. Journal of the Market Research Society, 32, 485-506. Kish, L. (1965) Survey Sampling. Wiley, New York. Kish, L. (1998) On Quota Sampling. Working Paper, Universidad de Michigan. http://www.websm.org/uploadi/editor/1132389572kish.doc Kish, L. (1998) Quota Sampling: Old plus New Thought. Working Paper, Universidad de Michigan, Unpublished. Sudman, S. and Blair, E. (1999) Sampling in the Twenty-First Century. Journal of the Academy of Marketing Science, 27, 269-277. Armate, M. (2004) La introducción en Francia de los métodos de sondeo aleatorio. Empiria, 8, 70-80. Rothman, J. and Dawn, M. (1989) Statisticians Can Be Creative Too. Journal of the Market Research Society, 31, 456-466. Taylor, H., Harris, L. and Associated (1995) Horses for Courses: How Survey Firms in Different Countries Measure Public Opinion with Very Different Methods. Journal of the Market Research Society, 37, 211-219. [Citation Time(s):2] Bauer, J.J. (2014) Selection Errors of Random Route Samples. Sociological Methods & Research.http://smr.sagepub.com/content/ early/2014/02/24/0049124114521150 [Citation Time(s):2] Burton, D. (2000) Research Training for Social Scientist. Sage, London. Cuxart, A. and Riba, C. (2009) Mejorando a partir de la experiencia: La implementación de la tercera ola de a ESE en España. Revista Española de Investigaciones Sociológicas, 125, 147-165. [Citation Time(s):4] Stoop, I., Billiet, J., Koch, A. and Fitzgerald, R. (2010) Improving Survey Response: Lessons Learned from the European Social Survey. Wiley, Chichester. Spanish National Team European Social Survey (2011) Documentation of the Spanish Sampling Procedure 2010.http://www.upf.edu/ess/ datos/quinta-ed.html [Citation Time(s):4] National Institute of Statistics (INE) (2010) Continuous Municipal Register Statistics, Definitive Results 2010. http://www.ine.es/jaxi/ menu.do?type=pcaxis&path=%2Ft20%2Fe245&file=inebase&L=1 Spanish National Team European Social Survey and Metroscopia (2011) Final Field Report of the 4th Round of ESS.http://www.upf.
Random Route and Quota Sampling: Do They Offer Any Advantage ...
16. 17. 18. 19. 20.
21.
22. 23.
24.
25.
26.
27.
28.
139
edu/ess/datos/quinta-ed.html#infadicional Centro de Investigaciones Sociológicas (CIS) (2010) Medio ambiente (II) ISSP. CIS Survey No. 2837. [Citation Time(s):4] Martín, V.C.M. (2004) Diseño de encuestas de opinión. Rama, Madrid. [Citation Time(s):3] CIS (2010) Barómetro Sanitario. CIS Survey No. 2850. [Citation Time(s):4] CIS (2010) Normas generales para la correcta aplicación de la muestra. CIS, Madrid. [Citation Time(s):3] National Institute of Statistics (INE) Economically Active Population Survey, Methodology.http://www.ine.es/en/inebaseDYN/epa30308/ epa_metodologia_en.htm De Rada, V.D. and Villuendas, A.N. (2008) Estudio de las incidencias en la investigación con encuesta. El caso de los barómetros del CIS, CIS, Madrid. National Institute of Statistics (INE) (2011) Continuous Municipal Register Statistics, Definitive Results 2011. National Institute of Statistics (INE) (2010) Economically Active Population Survey, National Results 2010: Activity and Unemployment Rate, Sex and Age Group. National Institute of Statistics (INE) (2011) Economically Active Population Survey, National Results 2011: Activity and Unemployment Rate, Sex and Age Group. National Institute of Statistics (INE) (2010) Economically Active Population Survey, National Results 2010: Population Aged 16 Years Old and Over by Sector of the Level of Education Attained, Sex and Age Group. National Institute of Statistics (INE) (2011) Economically Active Population Survey, National Results 2011: Population Aged 16 Years Old and Over by Sector of the Level of Education Attained, Sex and Age Group. Riba, C., Torcal, M. and Morales, L. (2010) Estrategias para aumentar la tasa de respuesta y los resultados de la Encuesta Social Europea en España. Revista Internacional de Sociología, 68, 603-635.http://dx.doi. org/10.3989/ris.2008.12.17 Durán Heras, M.á. (2012) El Trabajo no remunerado en la economía global. BBVA Fundation, Bilbao.
140
Computing and Random Numbers
29. Murgui, S., Muro, J. and Uriel, E. (1992) In fluencia de las sustituciones en la calidad de los datos en la encuesta de condiciones de vida y trabajo en España. Estadística Española, 34, 137-149.
CHAPTER
7
EQUIVALENT CONDITIONS OF COMPLETE CONVERGENCE FOR WEIGHTED SUMS OF SEQUENCES OF NEGATIVELY DEPENDENT RANDOM VARIABLES Mingle Guo School of Mathematics and Computer Science, Anhui Normal University, Wuhu 241003, China
ABSTRACT The complete convergence for weighted sums of sequences of negatively dependent random variables is investigated. By applying moment inequality and truncation methods, the equivalent conditions of complete convergence for weighted sums of sequences of negatively dependent random variables Citation: Mingle Guo, “Equivalent Conditions of Complete Convergence for Weighted Sums of Sequences of Negatively Dependent Random Variables,” Abstract and Applied Analysis, vol. 2012, Article ID 425969, 13 pages, 2012. doi:10.1155/2012/425969. Copyright: © 2012 Mingle Guo. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
142
Computing and Random Numbers
are established. These results not only extend the corresponding results obtained by Li et al. (1995), Gut (1993), and Liang (2000) to sequences of negatively dependent random variables, but also improve them.
INTRODUCTION In many stochastic model, the assumption that random variables are independent is not plausible. Increases in some random variables are often related to decreases in other random variables, so an assumption of negatively dependence is more appropriate than an assumption of independence. Lehmann [1] introduced the notion of negatively quadrant dependent (NQD) random variables in the bivariate case. Definition 1.1. Random variables X and Y are said to be NQD if
(1.1)
for all x, y ∈ . A sequence of random variables {Xn, n ≥ 1} is said to be pairwise NQD if for all i, j , Xi and Xj are NQD. It is important to note that (1.1) is equivalent to
(1.2)
for all x, y ∈ . However, (1.1) and (1.2) are not equivalent for a collection of three or more random variables. Consequently, the definition of NQD was extended to the multivariate case by Ebrahimi and Ghosh [2]. Definition 1.2. A finite family of random variables {Xi, 1 ≤ i ≤ n} is said to be negatively dependent ND if for all real numbers x1, x2,...,xn,
(1.3)
An infinite family of random variables is ND if every finite subfamily is ND. Negative dependence has been very useful in reliability theory and applications. Since the paper of Ebrahimi and Ghosh [2] appeared, Taylor et al. [3, 4] studied the laws of large numbers for arrays of rowwise ND random variables, Ko and Kim [5] and Ko et al. [6] investigated the strong laws of large numbers for weighted sums of ND random variables. Volodin [7]
Equivalent Conditions of Complete Convergence for Weighted Sums ...
143
obtained the Kolmogorov exponential inequality for ND random variables, Amini and Bozorgnia [8] studied the complete convergence for ND random variable sequences, Volodin et al. [9] obtained the convergence rates in the form of a Baum-Katz, and Wu [10] investigated complete convergence for weighted sums of sequences of negatively dependent random variables. The concept of negatively associated random variables was introduced by Alam and Saxena [11] and carefully studied by Joag-Dev and Proschan [12]. Definition 1.3. A finite family of random variables {Xi, 1 ≤ i ≤ n} is said to be negatively associated NA, if for every pair disjoint subset A and B of {1, 2,...,n} and any real nondecreasing coordinate-wise functions f1 on A and f2 on B,
(1.4)
whenever the covariance exists. An infinite family of random variables {Xi, −∞ 0.
144
Computing and Random Numbers
Theorem A. Let 0 1. Assume β > −1 and is a triangular array of real numbers for all n ≥ 1. Then, the following are equivalent: such that
(i)
(1.7)
(ii)
(1.8) Theorem C. Let {X, Xn, n ≥ 0} be a sequence of identically distributed NA random variables and let r > 1, 0 < α ≤ 1. Then, the following are equivalent:
(i)
(1.9)
Equivalent Conditions of Complete Convergence for Weighted Sums ...
(ii)
145
(1.10)
where . In the current work, we study the complete convergence for ND random variables. Equivalent conditions of complete convergence for weighted sums of sequences of ND random variables are established. As a result, we not only promote and improve the results of Liang [17] for NA random variables to ND random variables without necessarily imposing any extra conditions, but also relax the range of β. For the proofs of the main results, we need to restate a few lemmas for easy reference. Throughout this paper, The symbol C denotes a positive constant which is not necessarily the same one in each appearance and I(A) denotes the indicator function of A. Let an 0 such that an ≤ Cbn for sufficiently large n, and let . Also, let log x denote ln maxe, x.
Lemma 1.4 (see [11]). Let {Xn, n ≥ 1} be a sequence of ND random variables and let {fn, n ≥ 1} be a sequence of Borel functions all of which are monotone increasing (or all are monotone decreasing). Then, {fnXn, n ≥ 1} is still a sequence of ND random variables. Lemma 1.5 (see [18]). Let {Xi, 1 ≤ i ≤ n} be a sequence of ND random variables, EXi = 0, E|Xi| M < ∞, 1 ≤ i ≤ n, M ≥ 2. Then,
(1.11)
where C depends only on M. Lemma 1.6 (see [10]). Let {Xn, n ≥ 1} be a sequence of ND random variables. Then, there exists a positive constant C such that for any x ≥ 0 and all n ≥ 1,
(1.12)
By Lemma 1.2 and Theorem 3 in [19], we can obtain the following lemma. Lemma 1.7. Let {Xi, 1 ≤ i ≤ n} be a sequence of ND random variables, EXi = 0, E|Xi| M < ∞, 1 ≤ i ≤ n, M ≥ 2. Then, where C depends only on M.
(1.13)
146
Computing and Random Numbers
By using Fubini’s theorem, the following lemma can be easily proved. Here, we omit the details of the proof. Lemma 1.8. Let X be a random variable, then (i) (ii)
for any and
.
MAIN RESULTS Now we state our main results. The proofs will be given in Section 3. Theorem 2.1. Let {X, Xn, n ≥ 1} be a sequence of identically distributed ND random variables, r > 1,p> 1/2, β + p > 0 and suppose that EX = 0 for 1/2 < p ≤ 1. Assume that {ani ≈ (i/n) β (1/np), 1 ≤ i ≤ n, n ≥ 1} is a triangular array of real numbers. Then, the following are equivalent:
(i)
(2.1)
(ii)
(2.2) Theorem 2.2. Let {X, Xn, n ≥ 0} be a sequence of identically distributed ND random variables, r > 1,p> 1/2, β + p > 0 and suppose that EX = 0 for 1/2 < p ≤ 1. Assume that {ani ≈ n – (i/n) β (1/np), 0 ≤ i ≤ n − 1, n ≥ 1} is a triangular array of real numbers. Then, the following are equivalent:
(i) (ii)
(2.3)
(2.4) Remark 2.3. Since NA random variables are a special case of ND random variables, taking p = 1 in Theorem 2.1, we obtain the result of Liang [17].
Equivalent Conditions of Complete Convergence for Weighted Sums ...
147
Thus, we not only promote and improve the results of Liang [17] for NA random variables to ND random variables without necessarily imposing any extra conditions, but also relax the range of β. Remark 2.4. Taking β = 0, ani = 1/np, 1 ≤ i ≤ n, n ≥ 1 in Theorem 2.1, we improve the result of Baum and Katz [14]. Corollary 2.5. . Let {X, Xn, n ≥ 0} be a sequence of identically distributed ND random variables, r > 1, p> 1/2, 0 < α ≤ 1, and EX = 0. Let 1. Then, the following are equivalent:
(i)
(2.5)
(ii)
(2.6)
PROOFS OF THE MAIN RESULTS Proof of Theorem 2.1. First, we prove 2.1 ⇒ 2.2. Note that ani =
. Thus, to
prove 2.2, it suffices to show that
(3.1)
So, without loss of generality, we can assume that ani > 0, 1 ≤ i ≤ n, n ≥ 1. Choose δ > 0 being small enough and sufficient large integer K. Let, for every 1 ≤ i ≤ n, n ≥ 1,
Obviously,
. Note that
(3.2)
148
Computing and Random Numbers
Thus, in order to prove (2.2), it suffices to show that
(3.3.)
(3.4)
By the definition of , we see that . Since
, by Lemma 1.8, we have
Therefore, by (2.1), I4 < ∞. From the. From the definition of know . By using the definition of ND family, we have
Since 2.1 implies E|X|
r/p
(3.5) , we
(3.6)
< ∞, by Markov’s inequality and 3.6, we obtain
(3.7)
Equivalent Conditions of Complete Convergence for Weighted Sums ...
149
Noting that r > 1, p + β > 0, we can choose δ being small enough and sufficient large integer K such that r − 2 − Krp + β − δ/p < −1 and r − 2 − Kr − 1 − rδ/p < −1. Thus, by 3.7, we get I2 < ∞. Similarly, we can obtain I3 < ∞. In order to estimate I1, we first verify that
(3.8)
Note that 2.1 implies E|X| r/p < ∞ and E|X| 1/p < ∞. When p > 1, noting that and
, by Hölder’s inequality we have
(3.9)
When 1/2 < p ≤ 1, noting that EX = 0, by choosing δ being small enough such that −δ1 − r/p + 1 − r < 0, we have
(3.10)
Therefore, to prove I1 < ∞, it suffices to prove that
(3.11)
150
Computing and Random Numbers
Note that is still ND by Lemma 1.4. Using Markov’s inequality, Cr inequality, and Lemma 1.7, we get for a suitably large M, which will be determined later, (3.12) Choosing sufficient large M such that −2 − δM + rδ − β/p < −1, −1 − M − r/pδ < −1, we have
(3.13) When r/p ≥ 2, (2.1) implies EX < ∞. Noting that p+β > 0, p> 1/2, we can choose sufficient large M such that r − 2 – M(p + β) < −1, r − 2 – (2p – 1) M/2 < −1. Then, 2
(3.14)
When r/p < 2, choosing sufficient large M such that r − 2 – [δ(2 − r/p) +r(p + β)/p]M/2 < −1, r − 2 – [δ(2 − r/p) + r – 1]M/2 < −1, we have
Equivalent Conditions of Complete Convergence for Weighted Sums ...
Thus, by (3.13), (3.14), and (3.15), we have Now,
we
proceed
to
151
(3.15)
.
prove (2.2) ⇒ (2.1). , by (2.2), we have
(3.16)
(3.17)
Since
Next, we shall prove
In fact, when r ≥ 2, 3.16 obviously implies 3.17. When 1 0, by 3.19, we have
(3.19)
152
Computing and Random Numbers
(3.20) j+1 Now, for each n ≥ 1, let j be such that 2 ≤ n < 2 − 1. Then, for all j
Noting that
(3.21)
, we have
(3.22)
Therefore, by (3.21) and (3.22), we get that (3.17) holds. Thus, by (3.17) and Lemma 1.6, we have
(3.23)
(3.24)
Now (3.16) and (3.23) yield
By the process of proof of (3.5), we see that (3.24) is equivalent to (2.1). Proof of Theorem 2.2. The proof is similar to that of Theorem 2.1 and is omitted. Proof of Corollary 2.5. Put . Note that, for . Therefore, for , we obtain , . Thus, letting in Theorem 2.2, we can conclude that (2.5) is equivalent to (2.6).
Equivalent Conditions of Complete Convergence for Weighted Sums ...
153
ACKNOWLEDGMENTS The research is supported by the National Natural Science Foundation of China (nos. 11271020 and 11201004), the Key Project of Chinese Ministry of Education (nos. 211077) and the Anhui Provincial Natural Science Foundation (nos. 10040606Q30 and 1208085MA11).
154
Computing and Random Numbers
REFERENCES 1.
E. L. Lehmann, “Some concepts of dependence,” Annals of Mathematical Statistics, vol. 37, no. 5, pp. 1137–1153, 1966. 2. N. Ebrahimi and M. Ghosh, “Multivariate negative dependence,” Communications in Statistics, vol. 10, no. 4, pp. 307– 337, 1981. 3. R. L. Taylor, R. F. Patterson, and A. Bozorgnia, “Weak laws of large numbers for arrays of rowwise negatively dependent random variables,” Journal of Applied Mathematics and Stochastic Analysis, vol. 14, no. 3, pp. 227–236, 2001. 4. R. L. Taylor, R. F. Patterson, and A. Bozorgnia, “A strong law of large numbers for arrays of rowwise negatively dependent random variables,” Stochastic Analysis and Applications, vol. 20, no. 3, pp. 643–656, 2002. 5. M.-H. Ko and T.-S. Kim, “Almost sure convergence for weighted sums of negatively orthant dependent random variables,” Journal of the Korean Mathematical Society, vol. 42, no. 5, pp. 949–957, 2005. 6. M.-H. Ko, K.-H. Han, and T.-S. Kim, “Strong laws of large numbers for weighted sums of negatively dependent random variables,” Journal of the Korean Mathematical Society, vol. 43, no. 6, pp. 1325–1338, 2006. 7. A. Volodin, “On the Kolmogorov exponential inequality for negatively dependent random variables,” Pakistan Journal of Statistics, vol. 18, no. 2, pp. 249–253, 2002. 8. D. M. Amini and A. Bozorgnia, “Complete convergence for negatively dependent random variables,” Journal of Applied Mathematics and Stochastic Analysis, vol. 16, no. 2, pp. 121–126, 2003. 9. A. Volodin, M. O. Cabrera, and T. C. Hu, “Convergence rate of the dependent bootstrapped means,” Theory of Probability and Its Applications, vol. 50, no. 2, pp. 337–346, 2006. 10. Q. Wu, “Complete convergence for weighted sums of sequences of negatively dependent random variables,” Journal of Probability and Statistics, vol. 2011, Article ID 202015, 16 pages, 2011. 11. K. Alam and K. M. L. Saxena, “Positive dependence in multivariate distributions,” Communications in Statistics, vol. 10, no. 12, pp. 1183– 1196, 1981.
Equivalent Conditions of Complete Convergence for Weighted Sums ...
155
12. K. Joag-Dev and F. Proschan, “Negative association of random variables, with applications,” The Annals of Statistics, vol. 11, no. 1, pp. 286–295, 1983. 13. P. L. Hsu and H. Robbins, “Complete convergence and the law of large numbers,” Proceedings of the National Academy of Sciences of the United States of America, vol. 33, pp. 25–31, 1947. 14. L. E. Baum and M. Katz, “Convergence rates in the law of large numbers,” Transactions of the American Mathematical Society, vol. 120, no. 1, pp. 108–123, 1965. 15. D. L. Li, M. B. Rao, T. F. Jiang, and X. C. Wang, “Complete convergence and almost sure convergence of weighted sums of random variables,” Journal of Theoretical Probability, vol. 8, no. 1, pp. 49–76, 1995. 16. A. Gut, “Complete convergence and Cesàro summation for i.i.d. random variables,” Probability Theory and Related Fields, vol. 97, no. 1-2, pp. 169–178, 1993. 17. H.-Y. Liang, “Complete convergence for weighted sums of negatively associated random variables,” Statistics and Probability Letters, vol. 48, no. 4, pp. 317–325, 2000. 18. N. Asadian, V. Fakoor, and A. Bozorgnia, “Rosenthal’s type inequalities for negatively orthant dependent random variables,” Journal of the Iranian Statistical Society, vol. 5, no. 1, pp. 66–75, 2006. 19. F. Móricz, “Moment inequalities and the strong laws of large numbers,” Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, vol. 35, no. 4, pp. 299–314, 1976.
CHAPTER
8
STRONG LAWS OF LARGE NUMBERS FOR FUZZY SET-VALUED RANDOM VARIABLES IN Gα SPACE Lamei Shen, Li Guan College of Applied Sciences, Beijing University of Technology, Beijing, China
ABSTRACT In this paper, we shall present the strong laws of large numbers for fuzzy . The results are based on set-valued random variables in the sense of the result of single-valued random variables obtained by Taylor [1] and setvalued random variables obtained by Li Guan [2] .
Citation: Shen, L. and Guan, L. (2016), “Strong Laws of Large Numbers for Fuzzy SetValued Random Variables in Gα Space”. Advances in Pure Mathematics, 6, 583-592. doi: 10.4236/apm.2016.69047. Copyright: © 2016 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http:// creativecommons.org/licenses/by/4.0
158
Computing and Random Numbers
Keywords: Laws of Large Numbers, Fuzzy Set-Valued Random Variable, Hausdorff Metric
INTRODUCTION With the development of set-valued stochastic theory, it has become a new branch of probability theory. And limits theory is one of the most important theories in probability and statistics. Many scholars have done a lot of research in this aspect. For example, Artstein and Vitale in [3] had proved the strong law of large numbers for independent and identically distributed random variables by embedding theory. Hiai in [4] had extended it to separable Banach space. Taylor and Inoue had proved the strong law of large numbers for independent random variable in the Banach space in [5] . Many other scholars also had done lots of works in the laws of large numbers for set-valued random variables. In [2] , Li proved the strong laws of large numbers for set-valued random variables in space in the sense of dH metric.
As we know, the fuzzy set is an extension of the set. And the concept of fuzzy set-valued random variables is a natural generalization of that of setvalued random variables, so it is necessary to discuss convergence theorems of fuzzy set-valued random sequence. The limits of theories for fuzzy setvalued random sequences are also been discussed by many researchers. Colubi et al. [6] , Feng [7] and Molchanov [8] proved the strong laws of large numbers for fuzzy set-valued random variables; Puri and Ralescu [9] , Li and Ogura [10] proved convergence theorems for fuzzy set-valued martingales. Li and Ogura [11] proved the SLLN of [12] in the sense of by using the “sandwich” method. Guan and Li [13] proved the SLLN for weighted sums of fuzzy set- valued random variables in the sense of which used the same method. In this paper, what we concerned are the convergence theorems of fuzzy set-valued sequence in space in the sense of . The purpose of this paper is to prove the strong laws of large numbers for fuzzy set-valued random variables in space, which is both the extension of the result in [1] for single-valued random sequence and also the extension in [2] for set-valued random sequence. This paper is organized as follows. In Section 2, we shall briefly introduce some concepts and basic results of set-valued and fuzzy set-valued random variables. In Section 3, I shall prove the strong laws of large numbers for
Strong Laws of Large Numbers for Fuzzy Set-Valued Random Variables ...
fuzzy set-valued random variables in Hausdorff metric .
159
space, which is in the sense of
PRELIMINARIES ON SET-VALUED RANDOM VARIABLES Throughout this paper, we assume that space,
is a complete probability
is a real separable Banach space,
nonempty closed subsets of
, and
is the family of all is the family of all non-
empty bounded closed(compact) subsets of , and all non-empty compact convex subsets of .
is the family of
Let A and B be two nonempty subsets of and let , the set of all real numbers. We define addition and scalar multiplication by
The Hausdorff metric on
for
is defined by
. For an A in
The metric space of (cf. [14] ,
, let
.
is complete, and
is a closed subset
Theorems 1.1.2 and 1.1.3). For more general hyperspaces, more topological properties of hyperspaces, readers may refer to the books [15] and [14] . For each
where
, define the support function by
is the dual space of .
Let denote the unit sphere of , and the norm is defined
,
the all continuous functions of
Computing and Random Numbers
160
as
The following is the equivalent definition of Hausdorff metric. For each
,
A set-valued mapping variable (or a random set, or a multi-
is called a set-valued random
function) if, for each open subset O of . For each set-valued random variable F, the expectation of F, denoted by , is defined by
where is the usual Bochner integral in -valued random variables, and
, the family of integrable .
Let denote the family of all functions the following conditions:
which satisfy
1)
The level set
.
2)
Each v is upper semicontinuous, i.e. for each level set
3)
is a closed subset of
The support set
A function v in for any Let
, the .
is compact.
is called convex if it satisfies
. be the subset of all convex fuzzy sets in
.
It is known that v is convex in the above sense if and only if, for any , the level set ). For any
is a convex subset of (cf. Theorem 3.2.1 of [16]
, the closed convex hull
of v is
Strong Laws of Large Numbers for Fuzzy Set-Valued Random Variables ...
defined by the relation
for all
For any two fuzzy sets
161
.
define
for any Similarly for a fuzzy set v and a real number , define
for any The following two metrics in
which are extensions of the Hausdorff
metric dH are often used (cf. [17] and [18] , or [14] ): for
Denote value one at 0 and zero for all
,
, where is the fuzzy set taking
. The space is a complete metric space (cf. [18] , or [14] : Theorem 5.1.6) but not separable (cf. [17] , or [14] : Remark 5.1.7). , for every
It is well known that completeness of
. Due to the
, every
Cauchy sequence
has a limit v in
.
A fuzzy set-valued random variable (or a fuzzy random set, or a fuzzy , such that
random variable in literature) is a mapping
is a set-valued random variable for every (cf. [18] or [14] ). The expectation of any fuzzy set-valued random variable X, denoted by , is an element in
such that, for every
,
Computing and Random Numbers
162
where the expectation of right hand is Aumann integral. From the existence theorem (cf. [19] ), we can get an equivalent definition: for any ,
Note that
is always convex when
is nonatomic.
MAIN RESULTS In this section, we will give the limit theorems for fuzzy set-valued random variables in space. I will firstly introduce the definition of space. The following Definition 3.1 and Lemma 3.2 are from Taylor’s book [8] , which will be used later. Definition 3.1. A Banach space some
. If there exists a mapping
1)
, such that
;
2) 3)
is said to satisfy the condition for
; , for all
constant A.
Note that Hilbert spaces are mapping G. Lemma 3.2. Let
and some positive
with constant
and identity
be a Banach space which satisfies the condition of ,
be independent random elements in . Then
, such that
for each
where A is the positive constant in 3) of definition 3.1. In order to obtain the main results, we firstly need to prove Lemma 3.5. The following lemma are from [14] (cf. p89, Lemma 3.1.4), which will be used to prove Lemma 3.5.
Strong Laws of Large Numbers for Fuzzy Set-Valued Random Variables ...
Lemma 3.3. Let
for some
be a sequence in
163
. If
, then
Lemma 3.4. (cf. [13] ) For any
, there exists a finite
, such that
Now we prove that the result of Lemma 3.3 is also true for fuzzy sets. Lemma 3.5. Let
be a sequence in
. If
for some
, then
Proof. By (3.1), we can have
and
for
and
. Then by Lemma 3.3, for
, we have
(3.1)
164
Computing and Random Numbers
By Lemma 3.4, take an such that
Then for
, there exists a finite
,
,
Consequently,
Since the first two terms on the right hand converge to 0 in probability one, we have
but
is arbitrary and the result follows. , Theorem 3.6. Let
of
, let , such that
be a Banach space which satisfies the condition
be independent fuzzy set-valued random variables in for any n. If
where for 0 ≤ t ≤ 1 and with probability 1 in the sense of . Proof. Define
for t ≥ 1, then
converges
Strong Laws of Large Numbers for Fuzzy Set-Valued Random Variables ...
165
Note that for each j, and both independent sequence of fuzzy
set-valued
random
variables.
are When
,
we
have
. Then, for any
And from sequence. So, we have
, we know that
is a Cauchy
Since convergence in the mean implied convergence in probability, Ito and Nisios result in [9] for independent random elements (cf. Section 4.5) provides that
So, for any n, m ≥ 1, m > n, by triangle inequality we have
It means
is a Cauchy sequence in the sense of
completeness of in the sense of
, we have .
. By the
converges almost everywhere
166
Computing and Random Numbers
Next we shall prove that
converges in the sense of
. Firstly, we
assume that are all convex fuzzy set-valued random variables. Then by the equivalent definition of Hausdorff metric, we have
For any fixed n, m, there exists a sequence
That means there exist a sequence
, such that
, such that
Then by Cr inequality, dominated convergence theorem and Lemma 3.2, we have
Strong Laws of Large Numbers for Fuzzy Set-Valued Random Variables ...
167
for each n and m. Then, we know
is a Cauchy sequence. Hence,
is a Cauchy sequence. Thus by the similar way as above to prove converges with probability 1 in the sense of prove that
. We also can
168
Computing and Random Numbers
with probability 1 in the sense of
. In fact, for each
,
So, we can prove that
with probability 1 in the sense of
. If
are not convex, we can prove
converges with probability 1 in the sense of Lemma 3.5, we can prove that bility 1 in the sense of
as above, and by the
converges with proba-
. Then the result was proved. ,
From Theorem 3.6, we can easily obtain the following corollary. Corollary 3.7. Let . Let
be a separable Banach space which is
for some
be
a sequence of independent fuzzy set-valued random variables in such that
for each n. If
are continuous and such that non-decreasing, then for each the convergence of
implies that
,
are
Strong Laws of Large Numbers for Fuzzy Set-Valued Random Variables ...
converges with probability one in the sense of
169
.
Proof. Let
If
, by the non-decreasing property of
, we have
That is
If
, by the non-decreasing property of
(4.1) , we have
That is
(4.2)
Then as the similar proof of Theorem 3.6, we can prove both converges with probability one in the sense of
, and the result was obtained. ,
ACKNOWLEDGEMENTS We thank the Editor and the referee for their comments. Research of Li Guan is funded by the NSFC (11301015, 11571024, 11401016).
170
Computing and Random Numbers
REFERENCES 1.
Taylor, R.L. (1978) Lecture Notes in Mathematics. Springer-Verlag, 672. 2. Li, G. (2015) A Strong Law of Large Numbers for Set-Valued Random Variables in Gα Space. Journal of Applied Mathematics and Physics, 3, 797-801. http://dx.doi.org/10.4236/jamp.2015.37097 3. Artstein, Z. and Vitale, R.A. (1975) A Strong Law of Large Numbers for Random Compact Sets. Annals of Probability, 3, 879-882. http:// dx.doi.org/10.1214/aop/1176996275 4. Hiai, F. (1984) Strong Laws of Large Numbers for Multivalued Random Variables, Multifunctions and Integrands. In: Salinetti, G., Ed., Lecture Notes in Mathematics, Vol. 1091, Springer, Berlin, 160-172. 5. Taylor, R.L. and Inoue, H. (1985) A Strong Law of Large Numbers for Random Sets in Banach Spaces. Bulletin of the Institute of Mathematics Academia Sinica, 13, 403-409. 6. Colubi, A., López-Díaz, M., Domnguez-Menchero, J.S. and Gil, M.A. (1999) A Generalized Strong Law of Large Numbers. Probability Theory and Related Fields, 114, 401-417. http://dx.doi.org/10.1007/ s004400050229 7. Feng, Y. (2004) Strong Law of Large Numbers for Stationary Sequences of Random Upper Semicontinuous Functions. Stochastic Analysis and Applications, 22, 1067-1083. http://dx.doi.org/10.1081/ SAP-120037631 8. Molchanov, I. (1999) On Strong Laws of Large Numbers for Random Upper Semicontinuous Functions. Journal of Mathematical Analysis and Applications, 235, 249-355. http://dx.doi.org/10.1006/ jmaa.1999.6403 9. Puri, M.L. and Ralescu, D.A. (1991) Convergence Theorem for Fuzzy Martingales. Journal of Mathematical Analysis and Applications, 160, 107-121. http://dx.doi.org/10.1016/0022-247X(91)90293-9 10. Li, S. and Ogura, Y. (2003) A Convergence Theorem of Fuzzy Valued Martingale in the Extended Hausdorff Metric H∞. Fuzzy Sets and Systems, 135, 391-399. http://dx.doi.org/10.1016/S01650114(02)00145-8 11. Li, S. and Ogura, Y. (2003) Strong Laws of Numbers for Independent Fuzzy Set-Valued Random Variables. Fuzzy Sets and Systems, 157, 2569-2578. http://dx.doi.org/10.1016/j.fss.2003.06.011
Strong Laws of Large Numbers for Fuzzy Set-Valued Random Variables ...
171
12. Inoue, H. (1991) A Strong Law of Large Numbers for Fuzzy Random Sets. Fuzzy Sets and Systems, 41, 285-291.http://dx.doi. org/10.1016/0165-0114(91)90132-A 13. Guan, L. and Li, S. (2004) Laws of Large Numbers for Weighted Sums of Fuzzy Set-Valued Random Variables. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 12, 811-825. http://dx.doi.org/10.1142/S0218488504003223 14. Li, S., Ogura, Y. and Kreinovich, V. (2002) Limit Theorems and Applications of Set-Valued and Fuzzy Set-Valued Random Variables. Kluwer Academic Publishers, Dordrecht. http://dx.doi. org/10.1007/978-94-015-9932-0 15. Beer, G. (1993) Topologies on Closed and Closed Convex Sets. Mathematics and Its Applications. Kluwer Academic Publishers, Dordrecht, Holland. http://dx.doi.org/10.1007/978-94-015-8149-3 16. Chen, Y. (1984) Fuzzy Systems and Mathematics. Huazhong Institute Press of Science and Technology, Wuhan. (In Chinese) 17. Klement, E.P., Puri, L.M. and Ralescu, D.A. (1986) Limit Theorems for Fuzzy Random Variables. Proceedings of the Royal Society of London A, 407, 171-182. http://dx.doi.org/10.1098/rspa.1986.0091 18. Puri, M.L. and Ralescu, D.A. (1986) Fuzzy Random Variables. Journal of Mathematical Analysis and Applications, 114, 406-422. http:// dx.doi.org/10.1016/0022-247X(86)90093-4 19. Li, S. and Ogura, Y. (1996) Fuzzy Random Variables, Conditional Expectations and Fuzzy Martingales. Journal of Fuzzy Mathematics, 4, 905-927.
SECTION III COMPUTING APPLICATIONS WITH RANDOM VARIABLES
CHAPTER
9
COMPARISON OF MULTIPLE RANDOM WALKS STRATEGIES FOR SEARCHING NETWORKS Zhongtuan Zheng1,2 , Hanxing Wang3,4 , Shengguo Gao1 , and Guoqiang Wang1 Shanghai University of Engineering Science, Shanghai 201620, China
1
Nanyang Technological University, Singapore 639798
2
Shanghai Lixin University of Commerce, Shanghai 201620, China
3
Shanghai University, Shanghai 200444, China
4
ABSTRACT We investigate diverse random-walk strategies for searching networks, especially multiple random walks (MRW). We use random walks on weighted networks to establish various models of single random walks
Citation: Zhongtuan Zheng, Hanxing Wang, Shengguo Gao, and Guoqiang Wang, “Comparison of Multiple Random Walks Strategies for Searching Networks,” Mathematical Problems in Engineering, vol. 2013, Article ID 734630, 12 pages, 2013. doi:10.1155/2013/734630. Copyright: © 2013 Zhongtuan Zheng et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited
176
Computing and Random Numbers
and take the order statistics approach to study corresponding MRW, which can be a general framework for understanding random walks on networks. Multiple preferential random walks (MPRW) and multiple simple random walks (MSRW) are two special types of MRW. As search strategies, MPRW prefers high-degree nodes while MSRW searches for low-degree nodes more efficiently. We analyze the first passage time (FPT) of wandering walkers of MRW and give the corresponding formulas of probability distributions and moments, and the mean first passage time (MFPT) is included. We show the convergence of the MFPT of the first arriving walker and find the MFPT of the last arriving walker closely related with the mean cover time. Simulations confirm analytical predictions and deepen discussions. We use a small random network to test the FPT properties from different aspects. We also explore some practical search-related issues by MRW, such as detecting unknown shortest paths and avoiding poor routings on networks. Our results are of practical significance for realizing optimal routing and performing efficient search on complex networks.
INTRODUCTION A random walk on a network [1] is precisely what its name says: a walk 𝑋0𝑋1 ⋅⋅⋅ operated in a certain random fashion. Starting a random walk at 𝑋0, its next node, 𝑋1, is selected at random from among the neighbours of 𝑋0; then 𝑋2 is a random neighbour of 𝑋1, and so on. Namely, the stochastic sequence of nodes selected this way is a random walk on the network, which mainly depends on the network structure. Random walks on complex networks are a fundamental dynamic process, which can be used to model various dynamic stochastic systems in physical, biological, or social contexts such as diffusive motion [2], traffic flow [3], synchronization [4], animals movement [5], and epidemic spread [6]. It could also be a mechanism of transport and search in real-world networks [7–9], when no global information of the underlying networks is available. For instance, in some networks, such as wireless sensor networks [10], ad hoc networks [11], and peer-to-peer networks [12], data packets traverse the networks in a random-walk fashion. It has been demonstrated that the structural properties of networks, such as the large connectivity heterogeneity and the short average distance, play an essential role in determining the scaling features of the random-walk search for complex networks [13]. Recent studies show that random walks on complex networks can also reveal a variety of characteristics of the network [14].
Comparison of Multiple Random Walks Strategies for Searching Networks
177
First passage time (FPT) and cover time (CT) are two important characteristics of random walks on complex networks, which heavily rely on the underlying network topology [15–17]. FPT, denoted by 𝑇𝑠ℎ, is the number of steps the random walker takes to arrive at the destination node ℎ for the first time after starting from the source node 𝑠. Correspondingly, the mean FPT (MFPT), that is, the expectation of 𝑇𝑠ℎ is denoted by ⟨𝑇𝑠ℎ⟩ or ET𝑠ℎ. For a random walk on a network, the vertex CT is the number of steps the random walker takes to reach every node, and the edge CT is the number of steps the walker takes to traverse every edge. Let ECV𝑠 be the expectation of vertex CT (or the mean vertex CT), and let ECE𝑠 be the expectation of edge CT (or the mean edge CT) if the walk starts at node 𝑠. FPT and CT of a single walker or multiple walkers in complex networks have been studied in a number of articles [16–22]. These results are frequently based on the spectral properties of the adjacency matrix of the graph and of the transition matrix of the random walk [13, 21, 22]. However, thus far, the comparison of random-walk strategies for searching complex networks remains less understood, especially MRW strategies. Furthermore, a uniform analysis of various MRW strategies is still lacking. In this paper, based on random walks on weighted networks, we establish various models of single random walks and then construct corresponding MRW. By means of the order statistics, we mainly investigate the behaviors of the first and last arriving walkers when all the walkers start from the same source simultaneously and head to the same destination randomly, that is, MRW on networks. We focus on the FPT and the MFPT of MRW, which are closely related to the distance between a selected pair of nodes and the CT of corresponding single random walks. Considering the MRW as the search strategy, the FPT of the first arriving walker, to a certain extent, characterizes the search efficiency while that of the last arriving walker portrays network searchability [7]. In future, the MRW search strategy could be used to fast detect the unknown shortest path and avoid the worst routing between a pair of nodes.
SINGLE RANDOM WALKS ON NETWORKS First, we take a look at single random walks on an arbitrary finite network (or graph) 𝐺 = (𝑉, 𝐸). The network consists of nodes (or vertexes, sites) 𝑖 = 1, 2, . . . , 𝑁 and links (or edges) connecting them. We assume that the network is connected; that is, there is a path between each pair of nodes; otherwise, we simply consider each component separately. To implement a random walk, each node needs to calculate the transition probability from the node
178
Computing and Random Numbers
to each of its adjacent ones. In this paper, we mainly perform preferential random walks (PRW) on networks by the following rule. The rule is that a walker, wandering on the network, randomly selects a neighboring node as its next dwelling point according to the degrees of neighboring nodes. So the probability of jumping to any neighboring node is 𝑝𝑖𝑗 = (𝑋𝑡+1 =𝑗|𝑋𝑡 = 𝑖) = 𝑑𝑗/ ∑𝑗∈(𝑖) 𝑑𝑗, where 𝑑𝑗 denotes the degree, the number of connected neighbors, of a node 𝑗, and 𝜏(𝑖) denotes all the connected neighbors of node 𝑖. Transforming the 𝑝𝑖𝑗 into 𝑑𝑗𝑑𝑖/ ∑𝑗∈(𝑖) 𝑑𝑗𝑑𝑖, we have defined a Markov chain on the network with a transition matrix as follows: (1) We make comparative studies with simple random walks (SRW), of which the transition matrix is
(2)
In fact, as (1) and (2) show, if SRW and PRW being search strategies on networks, PRW prefers the high-degree node, while SRW searches for the relatively low-degree node more efficiently. More generally, for the network 𝐺 = (𝑉, 𝐸), we assign a positive weight
to link (𝑖, 𝑗); otherwise, if the link (𝑖, 𝑗) is absent, we attach weight 𝑐𝑖𝑗 = 𝑐𝑗𝑖 = 0. Writing 𝑐 for the function 𝑖𝑗 → 𝑐𝑖𝑗, we have obtained the weighted network (𝐺, 𝑐). Define a random walk on the weighted network as a sequence of random variables : 𝑡 = 0, 1, 2, . . ., each taking values in the set 𝑉 of nodes. And the sequence 𝑋𝑡 :𝑡= 0, 1, 2, . . . is such that if 𝑋𝑡 = 𝑖, namely, at time 𝑡, the walk is at node 𝑖; then with the transition probability 𝑝𝑖𝑗 = 𝑐𝑖𝑗/ ∑𝑗 𝑐𝑖𝑗 the walk is at node 𝑗 at the next time 𝑡+1; that is to say, the walk randomly selects a neighboring node as its next dwelling point according to link weights. Such a walk is named a single random walk on the weighted network or a single weighted random walk. Clearly, the sequence : 𝑡 = 0, 1, 2, . . . is a Markov chain with the finite space 𝑉, whose transition matrix P satisfies [15] where
.
(3)
Comparison of Multiple Random Walks Strategies for Searching Networks
179
Thereby, we can use a uniform method to study PRW and SRW on networks. If we attach weight 𝑐𝑖𝑗 = 𝑑𝑗𝑑𝑖 to each link (𝑖, 𝑗), then the random walk on the weighted network is a PRW. Whereas, if we assign weight 𝑐𝑖𝑗 = 1 to each link (𝑖, 𝑗), then the one is an SRW that we always say. Both of them are irreducible and reversible Markov chains. Random walks mentioned below include SRW and PRW unless particularly noted. Now we randomly select a pair of nodes on the network, one node as source 𝑠, the other as destination ℎ. Let 𝑇𝑠ℎ be the number of steps the walker takes to arrive at the destination ℎ for the first time after starting from the source 𝑠, namely, FPT. 𝑇𝑠ℎ is a nonnegative integer-valued random variable. The probability distribution is denoted by
(4)
where 𝑑𝑠ℎ is being the distance (or the shortest path length) between the source and the destination; here we assume .
In this paper, we investigate the FPT and MFPT for single random walks by analysis of the transition matrix. We define a matrix M(ℎ) [22] associated with the transition matrix P. M(ℎ) is got from P by setting ℎth column of P to zero; that is,
(5)
(6)
(7)
(8)
(9)
Thus,
By the way, in our another paper unpublished, by using the stopping time technique, we derive an exact formula of the MFPT, denoted by ET𝑖𝑗 or
180
Computing and Random Numbers
⟨𝑇𝑖𝑗⟩, for the PRW on networks. Considering the PRW on the finite network, the MFPT of node 𝑗 from node 𝑖 is
(10)
(11)
Where
thus, the average MFPT over all node pairs is
(12)
MULTIPLE RANDOM WALKS ON NETWORKS Formation of Multiple Random Walks Based on the above discussion, we are now able to further consider 𝑧 parallel, independent single weighted random walks that start from the same node, which form the multiple weighted random walks with 𝑧 walkers, abbreviated as MRW. We mainly analyze the FPT problem of MPRW and MSRW, which are two special types of multiple weighted random walks and, respectively, correspond to PRW and SRW. Suppose there are 𝑧 walkers wandering independently from the same source simultaneously to the same destination on the network. FPTs of walkers are, respectively, denoted by
(13)
They are independent random variables with the same distribution of 𝑇𝑠ℎ. Let
be the 𝑖th order statistics [23] given by
where they are arranged from
(14)
independent observations from the dis-
crete distribution of 𝑇𝑠ℎ. That is to say, , and the first, last, and 𝑖th arriving walkers, respectively.
denote FPTs of
Comparison of Multiple Random Walks Strategies for Searching Networks
181
FPT of Multiple Random Walks In this subsection, we explore the FPT of the first and last arriving walkers and present the corresponding equations of the probability distribution, the joint distribution, and moments as well. Further, we study the relationship between the MFPT of the first arriving walker and the distance between the selected two nodes, as well as the relationship between the MFPT of the last arriving walker and the mean CT of corresponding single random walks. Taking advantage of the order statistics technique and using (6)–(9), we can easily get the following formulas, of which the detailed proofs are omitted. Note that the results in this subsection generally hold for multiple weighted random walks, including MPRW and MSRW. We would like to stress that some time ago similar calculations were only done for the FPT of the first arriving walker for MSRW on networks. Results presented in our paper encompass the results of [24] as a special case.
Probability Distributions of (𝑡) = by (7) and (8), respectively,
Joint Distributions of (1) Joint distributions of
, where 𝑝(𝑡) and
are given
(15)
(16)
and Distributions of ,
182
Computing and Random Numbers
When
,
(18)
(19)
where 𝑡1, 𝑡2 = 1, 2, . . ..
(2) Distributions of
(17)
Comparison of Multiple Random Walks Strategies for Searching Networks
When
,
(20)
Moment Analysis of FPT Consider the following:
(21)
183
184
Computing and Random Numbers
(22)
(23)
(24)
(25)
(26) We can investigate general cases of , respectively, in a similar way. Distributions, joint distributions, moments, and more characteristics of these FPTs could be easily got, the range as well. The discrete-time results obtained here can also be compared with those for multiple continuous-time random walks. We leave these problems to our future work.
Comparison of Multiple Random Walks Strategies for Searching Networks
185
MFPT of the First Arriving Walker Since there is no path with a length shorter than 𝑑𝑠ℎ and at least one path with a length equal to 𝑑𝑠ℎ from the source to the destination, the FPT of single random walk on networks clearly satisfies 𝑃(𝑇𝑠ℎ < 𝑑𝑠ℎ)=0, 𝑃(𝑇𝑠ℎ = 𝑑𝑠ℎ)>0, Then, combining with (15) and (21), it is not difficult to obtain the following results:
(27)
(28)
(29)
That is to say, the first arriving walker of MRW will take the shortest path or near shortest path with high probability if the walker number is large enough.
MFPT of the Last Arriving Walker How long does one have to walk before he sees all nodes or traverse all edges in a network? The answer to this question is the vertex CT and edge CT. In the following, we mainly consider the coverage problem of single random walks on a simple network by using numerical simulation. For a simple random walk on a simple general network, the better upper bound of the mean vertex cover time is (|𝑉||𝐸|) [25], which is well verified by the simulations. An interesting problem is what the number of walkers 𝑧 will arrive at when the MFPT of the last arriving walker has a consensus with or exceeds the mean cover time of corresponding single random walks. To address this problem, the paper that only gives the simulation analysis. We find the MFPT of the last arriving walker will be significantly more than the mean cover time ECV𝑠 and ECE𝑠, when walker number 𝑧 is network size 𝑁 of several orders of magnitude, that is, 𝑧 = 𝑁 × 10𝑘 .
SIMULATIONS AND DISCUSSIONS
In this section, for both MPRW and MSRW, we make numerical simulations to deepen our discussion as well as to confirm analytic results. We use a small random network, generated by the Erdös-Rényi model [14], to test the FPT properties of the first and last arriving walkers, respectively, and their relationship as well. If the MRW is being as the search strategy, the
186
Computing and Random Numbers
FPT of the first arriving walker, to a certain extent, characterizes the search efficiency while that of the last arriving walker depicts network searchability. We demonstrate that wandering walkers are able to utilize and detect the shortest paths and avoid the worst routings. In addition, we explore the cover time of single random walks, which is closely related to the MFPT of the last arriving walker and can be regarded as a performance indicator of network search. We define a small random graph as 𝑁 = 21 labelled nodes and every pair of the nodes is being connected with probability 𝑝 = 0.1 by using the ER model. The average degree of the simple random network is 58/21, namely, ⟨𝑑⟩ = 58/21. We choose a pair of nodes as the source and the destination; the source is marked as 𝑠 = 16 and its degree 𝑘16 = 1 and the destination, ℎ = 21 and 𝑘21 = 3. The distance between them is 6 steps, 𝑑𝑠ℎ = 6. The walkers start from a node to reach a given node on the network according to different random-walk strategies. We perform MPRW and MSRW on the simple random network, respectively. Numerical data presented in the figures have been averaged over 104 runs using different number of wandering walkers. The walker number 𝑧 increases from 5 to 400. Moreover, we carry out single PRW and single SRW starting at 𝑠 on the simple random network. Simulations state that the mean vertex cover time ECV𝑠 = 244.985 and the mean edge cover time ECE𝑠 = 262.445 for the single PRW, while ECV𝑠 = 676.97 and ECE𝑠 = 691.7 for the single SRW. In the following diagrams of the probability distributions, the vertical coordinate unit is percentage (%).
MPRW on a Simple Random Network We record the FPT of the wandering walker, that is, the number of steps before the destination node is visited, starting from the source node. If all the walkers reach the destination, a round of the trial is over. We carry out 104 rounds of trials and then analyze the statistical data. We calculate frequencies of the FPT of the first and last arriving walkers for MPRW with 10 walkers on the simple network, which are depicted in Figure 1. We can explore MPRW with 30 walkers in the same way; the numerical result is shown in Figure 2. We can also get the corresponding theoretical result for MPRW with 10 walkers and 30 walkers using (15) and (16) under the same condition. Simulation results confirm our analytic predictions.
Comparison of Multiple Random Walks Strategies for Searching Networks
187
Figure 1: (a) Distribution of FPT of the first arriving walker for MPRW ,. (b) Distribution of FPT of the last arriving walker for MPRW
. Here walker number 𝑧 = 10.
Figure 2: (a) Distribution of FPT of the first arriving walker for MPRW . (b) Distribution of FPT of the last arriving walker for MPRW
. Here walker number 𝑧 = 30.
We observe that for MPRW with 30 walkers, the probability of the first arriving walker utilizing the shortest path is ( = 6) = 0.2373, while for MPRW with 10 walkers the probability ( = 6) = 0.0871. The obvious gap can be seen in Figures 2(a) and 1(a). There is a significant increase in the probability of the first arriving walker taking the shortest path. If we use more walkers, the probability will be higher. This agrees with (27). Comparing with Figure 1(a), we can observe in Figure 2(a) that the distribution curve
188
Computing and Random Numbers
of first passage time peaks up towards 𝑡1 = 6 and mainly concentrates in the interval 𝑡1 ∈ [6, 20], else 𝑡1 > 25, the probability 𝑃( = 𝑡1)=0 for MPRW with 30 walkers. This implies that the first arriving walker will utilize the shortest path or near shortest paths with high probability. In fact, we can calculate the probability of the first arriving walker taking near shortest paths with a length 𝐿 ≤ 10 by (15), ( ≤ 10) = 0.8443 also as shown in Figure 2(a). That’s to say, if we put 30 walkers on the network, the first arriving walker will almost surely take paths with a path length not more than 10. This observation has deeper meanings. On one hand, we can mark the nodes and the paths of the first arriving walker of MPRW visited, then we can detect the unknown shortest paths between two selected nodes. This is practical significance of realizing optimal routing on networks, especially the mobile ad hoc sensor networks. On the other hand, with a specified source and a walker number 𝑧 large enough, we conjecture that the wandering walkers will cover the vast majority of the network within (𝐷) steps with high probability because we can treat any other node as the destination. 𝐷 is the diameter of the network, which is the maximum distance measured on all pairs of nodes in the network. Consequently, based on the local information of the degrees of current node’s neighbours, the MPRW could be a mechanism of performing efficient search for networks, particularly for the highly heterogeneous small world networks [14]. Considering the MRW as the search strategy, the FPT of the first arriving walker characterizes the search efficiency while that of the last arriving walker depicts network searchability. For MPRW with 10 walkers, we can see that the FPT of the last arriving walker is mainly distributed in the interval [35, 424], namely, corresponding MFPT is ⟨
walkers, we observe that
∈ [35, 424] in Figure 1(b) and the
⟩ = 131.818 in Figure 3(b). For MPRW with 30 ∈ [62, 482] in Figure 2(b), ⟨
⟩ = 176.596 in
Figure 3(b), and the sample standard deviation of 𝑠𝑡𝑑 = 54.576. These results can also be computed by (23) and (24). From Figure 3(b), we can see that as the walker number 𝑧 increases from 5 to 30, the MFPT of the last arriving walker increases fast, whereas 𝑧 is greater than 30, the MFPT increases slowly. The phenomenon that its growth rate is very small reflects
Comparison of Multiple Random Walks Strategies for Searching Networks
189
the distribution of FPT which tends to be relatively centralized. At the same time, when the walker number 𝑧 is more than 30, with 𝑧 increasing, the
sample standard deviation of tends downwards, as shown in Table 1. Further, we also operate 104 rounds of trials with 107 walkers and analyze
the distribution of according to the statistical data. We find that only takes 7 numbers such as 655, 664, 697, 699, 710, 727, and 871, and its sample standard deviation is only 40.338. Moreover, is concentrated in 697, 699, 710, and 727, and the sum of the probabilities taking these four numbers is up to 0.8. This observation agrees with the significant increase in the probability of the first arriving walker taking the shortest path with a small increment in the walker number. We can also record traces of the last arriving walker and the traces containing bad cases of routings. Then, for instance, when designing the routing algorithm for one mobile wireless sensor network, based on local information, we can avoid poor routings of the last arriving walker and improve the reliability of the routing algorithm by MRW. Table 1: Sample standard deviation of
with different walker number
190
Computing and Random Numbers
Figure 3: (a) MFPT of the first arriving walker for MPRW number z. (b) MFPT of the last arriving walker for MPRW number z. (c) Range versus walker number z.
versus walker versus walker
We can compute by (21) the MFPT of the first arriving walker for MPRW with a different walker number 𝑧; see Figure 3(a). From Figure 3(a), we can see that with the walker number 𝑧 increasing from 5 to 400, decreases from 16.537 to 6.03. monotonically decreases with walker number 𝑧 increasing, and it will converge to the distance 6 when 𝑧→∞. This further confirms the analytic prediction of (28). It can also be seen that decreases sharply in the interval 𝑧 ∈ [5, 30]. That is to say, a small increment in the number of walkers will enable the MFPT to decline substantially in the initial stage. This observation also holds in the case of MSRW [24, 26], which highlights its obvious practical meaning. For example, in [26], for one unstructured peer-to-peer network, the authors propose a MSRW query algorithm that resolves queries almost as fast as the flooding method while reducing network traffic much more in many cases. In the algorithm they use only 16– 64 walkers while the network size is 4736–10,000. Their experimental practice is in accordance with our analysis here. The difference is that the query here may be faster since we conduct MPRW on networks. We can also calculate by (23) the MFPT of the last arriving walker for MPRW with a different walker number 𝑧; see Figure 3(b). From Figure 3(b), when 𝑧 is more than 30, the MFPT increases slowly with 𝑧 increasing. This has been mentioned above when discussing the sample standard deviation of (𝑧) 𝑠ℎ and is consistent with that the MFPT of the first arriving walker is declining substantially in the initial stage 𝑧 ∈ [5, 30]. In addition,
Comparison of Multiple Random Walks Strategies for Searching Networks
191
we can compute by (25) the range − ; see Figure 3(c). From Figure 3, we can see that the range mainly depends on . Unlike , we cannot determine the convergence of . However, we can explore the relationship between it and the mean cover time of single PRW, as well as the bounds on . Actually, hardly exceeds ECE𝑠 or ECV𝑠, unless 𝑧≫𝑁. From Figure 3(b), when 𝑧 = 400, almost 20 times the size of network 𝑁, = 284.591 is significantly larger than ECE𝑠 = 262.445 and ECV𝑠 = 244.985. To some extent, the FPT and MFPT of the last arriving walker characterize network searchability [7]. In fact, for the complex networks with the same sizes and different topologies, with a specified source and a walker number under the same conditions, we carry out single and multiple and the mean CT, random walks on them. The greater the quantities of the more difficult to search for the network.
MSRW on a Simple Random Network: Comparison to MPRW We make comparative study of MSRW and MPRW in the following aspects. Numerical results of the FPT of the first and last arriving walkers for MSRW with 10 and 30 walkers are demonstrated in Figures 4 and 5. The corresponding analytic results can be obtained by (15) and (16).
Figure 4: (a) Distribution of FPT of the first arriving walker for MSRW . (b) Distribution of FPT of the last arriving walker for MSRW . Here walker number 𝑧 = 10.
192
Computing and Random Numbers
Figure 5: (a) Distribution of FPT of the first arriving walker for MSRW . (b) Distribution of FPT of the last arriving walker for MSRW
. Here walker number 𝑧 = 30.
From Figures 5(a) and 4(a), we see that for MSRW with 30 walkers, the probability of the first arriving walker taking the shortest path is ( = 6) = 0.0717, while for MSRW with 10 walkers the probability is ( = 6) = 0.0251. This observation is similar to the case of MPRW, but the gap is not so apparent here. The first arriving walker will take the shortest path ≤ 10) = 0.4872; see Figure or near shortest paths with high probability ( 5(a). However, comparing with MPRW, the probability 0.4872 here is much smaller than 0.8443; see Figure 2(a). We are able to compute by (21) the MFPT of the first arriving walker ⟨
⟩ for MSRW with a different walker number 𝑧; see Figure 6(a). From Figure 6(a), we can see that with the walker number 𝑧 increasing from 5 to
⟩ decreases from 22.706 to 6.595. ⟨ ⟩ monotonically decreases 400, ⟨ with walker number 𝑧 increasing, and it will converge to the distance ⟩ between the selected two nodes when 𝑧→∞. It can also be seen that ⟨ decreases sharply in the initial interval 𝑧 ∈ [5, 50]. The results are consistent with MPRW. The difference is that the decreasing rate and the convergence ⟩ in the case of MSRW are much smaller than MPRW; see speed of ⟨ Figures 6(a) and 3(a).
Comparison of Multiple Random Walks Strategies for Searching Networks
193
Figure 6: (a) MFPT of the first arriving walker for MSRW
versus walker
number z. (b) MFPT of the last arriving walker for MSRW number z. (c) Range versus walker number z.
versus walker
As shown in Figures 4(b) and 5(b), for the distribution of of MSRW, it is relatively concentrated in the case of 𝑧 = 30, comparing with 𝑧 = 10. Numerical simulations show that when the walker number 𝑧 is more than 50, tends downwards. with 𝑧 increasing, the sample standard deviation of Comparing Figures 4(b) and 5(b) with Figures 1(b) and 2(b), with the same of MSRW is more walker number 𝑧, we find that the distribution of decentralized than MPRW, while the MFPT ⟨ ⟩ of MSRW is much greater than MPRW; see Figures 6(b) and 3(b).
194
Computing and Random Numbers
We can also calculate by (23) the MFPT of the last arriving walker ⟨ ⟩ for MSRW with a different walker number 𝑧; see Figure 6(b). When 𝑧 is more than 50, the MFPT increases slowly with 𝑧 increasing. In addition, we
⟩ and find that the range mainly can compute by (25) the range ⟨ ⟩ − ⟨ depends on ⟨ ⟩; see Figure 6(c). From Figure 6(b), when 𝑧 = 103 , ⟨ ⟩ = 623.985 < ECE𝑠 = 691.7020; when 𝑧 = 106 , almost 5 × 104 times the size of network 𝑁. ⟨ ⟩ = 743.012 is just significantly greater than ECE𝑠 and ECV𝑠. These tell that it is at lower cost to understand network searchability by MPRW than by MSRW. In short, the MPRW search strategy here is more effective and feasible than MSRW. This is due to the fact that the MPRW strategy is prone to searching for high-degree nodes. In other words, if we start from a periphery (or low-degree) node and search for a hub (or high-degree) node on networks, we need only a small number of wandering walkers and a small number of steps by the MPRW strategy, which also offers one efficient way of detecting hub nodes in scale-free networks [14].
CONCLUSIONS Some theoretical and technological problems with relation to MRW on networks have been investigated in this paper. These problems mainly come from random-walk theory and network search application. We mainly discussed the behaviors of the first and last arriving walkers of MRW on networks. We made use of random walks on weighted graphs to study single random walks on networks and took advantage of the order statistics approach to investigate MRW on networks, which can be a framework of random walks on networks. We analyzed the FPT and MFPT of MPRW on networks, making comparative studies with MSRW. We explored the FPT of the first and last arriving walkers and presented the corresponding equations of the probability distributions and moments. The MFPT of the first arriving walker monotonically decreases with walker number z increasing, and it will converge to the distance between source node and destination node. Furthermore, a small increment in the number of walkers will enable the MFPT of the first arriving walker to decrease sharply in the initial stage and the distribution curve of the FPT of the first arriving walker to peak up towards the distance between the selected two nodes as well. Thus, if the walker number is properly set, the first arriving walker will take the shortest path or near shortest paths with high probability. With the walker number
Comparison of Multiple Random Walks Strategies for Searching Networks
195
increasing, the MFPT of the last arriving walker also increases substantially in the initial stage. The MFPT of the last arriving walker is closely related with the mean CT of the corresponding single random walk. It could be seen that the MFPT of the last arriving walker is difficult to exceed the mean cover time unless 𝑧≫𝑁..
Numerical simulations of an ensemble of random walkers moving on a paradigmatic network model confirmed analytical predictions and deepened discussions. In addition, by MRW on networks, we investigated some practical search-related issues. The MRW regarded as the search strategy, then the FPT of the first arriving walker characterizes the search efficiency while that of the last arriving walker portrays network searchability. Accordingly, the mean CT of single random walks can also be considered as a performance indicator of network search. We can detect unknown shortest paths by the first arriving walker and avoid poor routings by last arriving walker, which is of practical significance for realizing optimal routing and performing efficient searching on networks. MSRW and MPRW being search strategies on networks, MPRW prefers the high-degree nodes, while MSRW searches for the relatively low-degree nodes more efficiently. This inspires us to develop proper MRW for identification of influential spreaders in complex networks [27], including hub nodes in the highly heterogeneous networks and congestion bottlenecks in the urban traffic networks. We leave these interesting problems to our future work. Finally, we point out that MPRW and MSRW could complement each other in web search, routing, communication transmission, transportation, and other applications, thereby increasing the efficiency and reliability of the search-related processes.
ACKNOWLEDGMENTS This work is supported by Shanghai Municipal Education Commission Foundation (Grant nos. shgcjs020 and B-8938-11-0440), The National Natural Science Foundation of China (Grant no. 11001169), Connotative Construction Project of Shanghai University of Engineering Science (Grant no. nhky-2012-13), Shanghai Municipal Education Commission Discipline Construction Foundation (Grant nos. 11XK11, 2011X33, and 2011XY34), and Scientific Research Foundation of Shanghai University of Engineering Science (Grant no. A-0501-10-019).
196
Computing and Random Numbers
REFERENCES 1.
L. Lovász, Random Walks on Graphs: A Survey, vol. 2, Combinatorics, Paul erdos is eighty, 1993. 2. E. M. Bollt and D. Ben-Avraham, “What is special about diffusion on scale-free nets?” New Journal of Physics, vol. 7, article 026, 2005. 3. P. Holme, “Congestion and centrality in traffic flow on complex networks,” Advances In Complex Systems, vol. 6, no. 2, pp. 163–176, 2003. 4. B. Kriener, L. Anand, and M. Timme, “Complex networks: when random walk dynamics equalssynchronization,” New Journal of Physics, vol. 14, no. 9, Article ID 093002, 2012. 5. E. A. Codling, M. J. Plank, and S. Benhamou, “Random walk models in biology,” Journal of the Royal Society Interface, vol. 5, no. 25, pp. 813–834, 2008. 6. M. Draief and A. Ganesh, “A random walk model for infection on graphs: Spread of epidemics & rumours with mobile agents,” Discrete Event Dynamic Systems, vol. 21, no. 1, pp. 41–61, 2011. 7. R. Guimerà, A. Díaz-Guilera, F. Vega-Redondo, A. Cabrales, and A. Arenas, “Optimal network topologies for local search with congestion,” Physical Review Letters, vol. 89, no. 24, 2002. 8. I. G. Portillo, D. Campos, and V. Méndez, “Intermittent random walks: transport regimes and implications on search strategies,” Journal of Statistical Mechanics, vol. 2011, no. 2, Article ID P02033, 2011. 9. V. Tejedor, O. Bénichou, and R. Voituriez, “Close or connected: distance and connectivity effects on transport in networks,” Physical Review E, vol. 83, no. 6, Article ID 066102, 2011. 10. K. K. Rachuri and C. S. R. Murthy, “Energy efficient and low latency biased walk techniques for search in wireless sensor networks,” Journal of Parallel and Distributed Computing, vol. 71, no. 3, pp. 512–522, 2011. 11. C.-F. Hsin and M. Liu, “Hitting time analysis for a class of random packet forwarding schemes in ad hoc networks,” Ad Hoc Networks, vol. 7, no. 3, pp. 500–513, 2009. 12. C. Gkantsidis, M. Mihail, and A. Saberi, “Random walks in peer-topeer networks: algorithms and evaluation,” Performance Evaluation, vol. 63, no. 3, pp. 241–263, 2006.
Comparison of Multiple Random Walks Strategies for Searching Networks
197
13. J. D. Noh and H. Rieger, “Random walks on complex networks,” Physical Review Letters, vol. 92, no. 11, 2004. 14. S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and D.-U. Hwang, “Complex networks: structure and dynamics,” Physics Reports, vol. 424, no. 4-5, pp. 175–308, 2006. 15. D. Aldous and J. Fill, Reversible Markov Chains and Random Walks on Graphs, monograph inpreparation, 2002. 16. S. Condamin, O. Bénichou, V. Tejedor, R. Voituriez, and J. Klafter, “First-passage times in complex scale-invariant media,” Nature, vol. 450, no. 7166, pp. 77–80, 2007. 17. S. Ikeda, I. Kubo, and M. Yamashita, “The hitting and cover times of random walks on finite graphs using local degree information,” Theoretical Computer Science, vol. 410, no. 1, pp. 94– 100, 2009. 18. J. Jonasson, “On the cover time for random walks on random graphs,” Combinatorics Probability and Computing, vol. 7, no. 3, pp. 265–279, 1998. 19. C. Mejía-Monasterio, G. Oshanin, and G. Schehr, “First passages for a search by a swarm of independent random searchers,” Journal of Statistical Mechanics, vol. 2011, no. 6, Article ID P06022, 2011. 20. R. Elsässer and T. Sauerwald, “Tight bounds for the cover time of multiple random walks,” Theoretical Computer Science, vol. 412, no. 24, pp. 2623–2641, 2011. 21. A. Fronczak and P. Fronczak, “Biased random walks in complex networks: the role of local navigation rules,” Physical Review E, vol. 80, no. 1, Article ID 016107, 2009. 22. H. Zhou, “Network landscape from a Brownian particle’s perspective,” Physical Review E, vol. 67, no. 4, Article ID 041908, 5 pages, 2003. 23. M. Ahsanullah, V. B. Nevzorov, and M. Shakil, An Introduction to Order Statistics, Atlantis Press, Paris, France, 2013. 24. S.-P. Wang and W.-J. Pei, “First passage time of multiple Brownian particles on networks with applications,” Physica A, vol. 387, no. 18, pp. 4699–4708, 2008. 25. R. Aleliunas, R. M. Karp, R. J. Lipton et al., “Random walks, universal traversal sequences, andthe complexity of maze problems,”
198
Computing and Random Numbers
in Proceedings of the 20th Annual Symposiumon IEEE, Foundations of Computer Science, pp. 218–223, 1979. 26. Q. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker, “Search and replication in unstructured peer-to-peer networks,” in Proceedings of the 16th International Conference on Supercomputing, pp. 84–95, June 2002. 27. M. Kitsak, L. K. Gallos, S. Havlin et al., “Identification of influential spreaders in complex networks,” Nature Physics, vol. 6, no. 11, pp. 888–893, 2010.
CHAPTER
10
FUZZY C-MEANS AND CLUSTER ENSEMBLE WITH RANDOM PROJECTION FOR BIG DATA CLUSTERING Mao Ye1,2 , Wenfen Liu1,2 , Jianghong Wei1 , and Xuexian Hu1 State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450002, China
1
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
2
ABSTRACT Because of its positive effects on dealing with the curse of dimensionality in big data, random projection for dimensionality reduction has become a popular method recently. In this paper, an academic analysis of influences of random projection on the variability of data set and the dependence of Citation: Mao Ye, Wenfen Liu, Jianghong Wei, and Xuexian Hu, “Fuzzy -Means and Cluster Ensemble with Random Projection for Big Data Clustering,” Mathematical Problems in Engineering, vol. 2016, Article ID 6529794, 13 pages, 2016. doi:10.1155/2016/6529794. Copyright: © 2016 Mao Ye et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
200
Computing and Random Numbers
dimensions has been proposed. Together with the theoretical analysis, a new fuzzy c-means (FCM) clustering algorithm with random projection has been presented. Empirical results verify that the new algorithm not only preserves the accuracy of original FCM clustering, but also is more efficient than original clustering and clustering with singular value decomposition. At the same time, a new cluster ensemble approach based on FCM clustering with random projection is also proposed. The new aggregation method can efficiently compute the spectral embedding of data with cluster centers based representation which scales linearly with data size. Experimental results reveal the efficiency, effectiveness, and robustness of our algorithm compared to the state-of-the-art methods.
INTRODUCTION With the rapid development of mobile Internet, cloud computing, Internet of things, social network service, and other emerging services, data is growing at an explosive rate recently. How to achieve fast and effective analyses of data and then maximize the data property’s benefits has become the focus of attention. The “four Vs” model [1], variety, volume, velocity, and value, for big data has made traditional methods of data analysis unapplicable. Therefore, new techniques for big data analysis such as distributed or parallelized [2, 3], feature extraction [4, 5], and sampling [6] have been widely concerned. Clustering is an essential method of data analysis through which the original data set can be partitioned into several data subsets according to similarities of data points. It becomes an underlying tool for outlier detection [7], biology [8], indexing [9], and so on. In the context of fuzzy clustering analysis, each object in data set no longer belongs to a single group but possibly belongs to any group. The degree of an object belonging to a group is denoted by a value in [0,1]. Among various methods of fuzzy clustering, fuzzy c-means (FCM) [10] clustering has received particular attention for its special features. In recent years, based on different sampling and extension methods, a lot of modified FCM algorithms [11–13] designed for big data analysis have been proposed. However, these algorithms are unsatisfactory in efficiency for high dimensional data, since they initially do not take the problem of “curse of dimensionality” into account. In 1984, Johnson and Lindenstrauss [14] used the projection generated by a random orthogonal matrix to reduce the dimensionality of data. This method can preserve pairwise distances of the points within a factor of
Fuzzy C-Means and Cluster Ensemble with Random Projection for Big ...
201
. Subsequently, [15] stated that such projection could be produced by a random Gaussian matrix. Moreover, Achlioptas investigated that even projection from a random scaled sign matrix satisfied the property of preserving pairwise distances [16]. These results laid the theoretic foundation for applying random projection to clustering analysis based on pairwise distances. Recently, Boutsidis et al. [17] designed a provably accurate dimensionality reduction method for k-means clustering based on random projection. Since the method above was analyzed for crisp partitions, the effect of random projection on FCM clustering algorithm is still unknown. As it can combine multiple base clustering solutions of the same object set into a single consensus solution, cluster ensemble has many attractive properties such as improved quality of solution, robust clustering, and knowledge reuse [18]. Ensemble approaches of fuzzy clustering with random projection have been proposed in [19–21]. These methods were all based on multiple random projections of original data set and then integrated all fuzzy clustering results of the projected data sets. Reference [21] pointed out that their method used smaller memory and ran faster than the ones of [19, 20]. However, with respect to crisp partition solution, their method still needs computing and storing the product of membership matrices, which requires time and space complexity with quadratic data size. Our Contribution. In this paper, our contributions can be divided into two parts: one is the analysis of impact of random projection on FCM clustering; the other is the proposition of a cluster ensemble method with random projection which is more efficient, robust, and suitable for a wider range of geometrical data sets. Concretely, the contributions are as follows:(i)We theoretically analyze that random projection can preserve the entire variability of data and prove the effectiveness of random projection for dimensionality reduction from the linear independence of dimensions of projected data. Together with the property of preserving pairwise distances of points, we obtain a modified FCM clustering algorithm with random projection. The accuracy and efficiency of modified algorithm have been verified through experiments on both synthetic and real data sets.(ii)We propose a new cluster ensemble algorithm for FCM clustering with random projection which gets spectral embedding efficiently through singular value decomposition (SVD) of the concatenation of membership matrices. The new method avoids the construction of similarity or distance matrix, so it is more efficient and space-saving than method in [21] with respect to crisp partition and methods in [19, 20] for large scale data sets. In addition, the improvements on robustness and efficiency of our approach are also verified
202
Computing and Random Numbers
by the experimental results on both synthetic and real data sets. At the same time, our algorithm is not only as accurate as the existing ones on Gaussian mixture data set, but also obviously more accurate than the existing ones on the real data set, which indicates that our approach is suitable for a wider range of data sets.
PRELIMINARIES In this section, we present some notations used throughout this paper, introduce the FCM clustering algorithm, and give some traditional cluster ensemble methods using random projection.
Matrix Notations We use X to denote data matrix; x𝑖 to denote the 𝑖th row vector of X and the 𝑖th point; 𝑥𝑖𝑗 to denote the (𝑖, 𝑗)th element of X. 𝐸(𝜉) means the expectation of a random variable 𝜉 and Pr(𝐴) denotes the probability of an event 𝐴. Let cov(𝜉, 𝜂) be the covariance of random variables 𝜉, 𝜂; let var(𝜉) be the variance of random variable 𝜉. ; then
We denote the trace of matrix by tr(), given
For any matrix
(1)
, we have the following property:
(2)
Singular value decomposition is a popular dimensionality reduction method, through which one can get a projection: , with 𝑓(x𝑖) = x𝑖V𝑡, where V𝑡 contains the top 𝑡 right singular vectors of matrix X. The exact SVD of X takes cubic time of dimension size and quadratic time of data size.
Fuzzy c-Means Clustering Algorithm (FCM) The goal of fuzzy clustering is to get a flexible partition, where each point has membership in more than one cluster with values in [0, 1]. Among the various fuzzy clustering algorithms, FCM clustering algorithm is widely used in low dimensional data because of its efficiency and effectiveness [22]. We start from giving the definition of fuzzy 𝑐-means clustering problem and then describe the FCM clustering algorithm precisely.
Fuzzy C-Means and Cluster Ensemble with Random Projection for Big ...
203
Definition 1 (the fuzzy c-means clustering problem). Given a data set of 𝑛 points with 𝑑 features denoted by an 𝑛×𝑑 matrix X, a positive integer 𝑐 regarded as the number of clusters, and fuzzy constant 𝑚>1, find the partition and centers of clusters Vopt = {kopt,1, kopt,2,..., kopt,𝑐}, such matrix that (3) Here, ‖⋅‖ denotes norm, usually Euclidean norm; the element of partition matrix 𝑢𝑖𝑗 denotes the membership of point 𝑗 in the cluster 𝑖. 1.The objective function is defined as Moreover, for any obj.
FCM clustering algorithm first computes the degree of membership through distances between points and centers of clusters and then updates the center of each cluster based on the membership degree. By means of computing cluster centers and partition matrix iteratively, a solution is obtained. It should be noted that FCM clustering can only get a locally optimal solution and the final clustering result depends on the initialization. The detailed procedure of FCM clustering is shown in Algorithm 1. Algorithm 1: FCM clustering algorithm.
Ensemble Aggregations for Multiple Fuzzy Clustering Solutions with Random Projection There are several algorithms proposed for aggregating the multiple fuzzy clustering results with random projection. The main strategy is to generate
204
Computing and Random Numbers
data membership matrices through multiple fuzzy clustering solutions on the different projected data sets and then to aggregate the resulting membership matrices. Therefore, different methods of generation and aggregation of membership matrices lead to various ensemble approaches about fuzzy clustering. The first cluster ensemble approach using random projection was proposed in [20]. After projecting the data into low dimensional space with random projection, the membership matrices were calculated through the probabilistic model 𝜃 of 𝑐 Gaussian mixture gained by EM clustering. Subsequently, the similarity of points 𝑖 and 𝑗 was computed as
, where 𝑃(𝑙 | 𝑖, 𝜃) denoted the the probability of
denoted the probability point 𝑖 belonging to cluster𝑙 under model 𝜃 and that points 𝑖 and 𝑗 belonged to the same cluster under model 𝜃. The aggregated similarity matrix was obtained by averaging across the multiple runs, and the final clustering solution was produced by a hierarchical clustering method called complete linkage. For mixture model, the estimation for the cluster number and values of unknown parameters is often complicated [23]. In addition, this approach needs (𝑛2 ) space for storing the similarity matrix of data points. Another approach which was used to find genes in DNA microarray data was presented in [19]. Similarly, the data was projected into a low dimensional space with random matrix. Then the method employed FCM clustering to partition the projected data and generated membership matrices U𝑖 ∈ with multiple runs 𝑟. For each run 𝑖, the similarity . Then the combined similarity matrix matrix was computed as . A distance matrix was M was calculated by averaging as computed by D = 1 − M and the final partition matrix was gained by FCM clustering on the distance matrix D. Since this method needs to compute the product of partition matrix and its transpose, the time complexity is (𝑟 ∗ 𝑐𝑛2 ) and the space complexity is 𝑂(𝑛2 ). Considering the large scale data set in the context of big data, [21] proposed a new method for aggregating partition matrices from FCM clustering. They concatenated the partition matrices as , instead of averaging the agreement matrix. Finally, they got the ensemble result as U𝑓 = FCM(Ucon, 𝑐). This algorithm avoids the products of partition matrices and is more suitable than [19] for large scale data sets. However,
Fuzzy C-Means and Cluster Ensemble with Random Projection for Big ...
205
it still needs the multiplication of concatenated partition matrix when crisp partition result is wanted.
RANDOM PROJECTION Dimensionality reduction is a common technique for analysis of high dimensional data. The most popular skill is SVD (or principal component analysis) where the original features are replaced by a small size of principal components in order to compress the data. But SVD takes cubic time of the number of dimensions. Recently, some literatures stated that random projection can be applied to dimensionality reduction and preserve pairwise distances within a small factor [15, 16]. Low computing complexity and preserving the metric structure make random projection receive much attention. Lemma 2 indicates that there are three kinds of simple random projection possessing the above properties. Lemma 2 (see [15, 16]). Let matrix and 𝑑 features. Given 𝜀, 𝛽 > 0, let
be a data set of 𝑛 points
(4)
For integer 𝑡≥𝑘0, let matrix R be a 𝑑 × 𝑡 (𝑡 ≤ 𝑑) random matrix, wherein elements 𝑅𝑖𝑗 are independently identically distributed random variables from either one of the following three probability distributions:
(5)
(6)
Let with (x𝑖) = (1/√𝑡)x𝑖R. For any u, k ∈ X, with probability at least 1−𝑛−𝛽, it holds that
(7)
206
Computing and Random Numbers
Lemma 2 implies that if the number of dimensions of data reduced by random projection is bigger than a certain bound, then pairwise Euclidean distance squares are preserved within a multiplicative factor of 1±𝜀. With the above properties, researchers have checked the feasibility of applying random projection to k-means clustering in terms of theory and experiment [17, 24]. However, as membership degrees for FCM clustering and k-means clustering are defined differently, the analysis method can not be directly used for assessing the effect of random projection on FCM clustering. Motivated by the idea of principal component analysis, we draw the conclusion that the compressed data gains the whole variability of original data in probabilistic sense based on the analysis of the variance difference. Besides, variables referring to dimensions of projected data are linear independent. As a result, we can achieve dimensionality reduction via replacing original data by compressed data as “principal components.” Next, we give a useful lemma for proof of the subsequent theorem. Lemma 3. Let 𝜉𝑖 (1 ≤ 𝑖 ≤ 𝑛) be independently distributed random variables from one of the three probability distributions described in Lemma 2; then
(8)
Proof. According to the probability distribution of random variable 𝜉𝑖, it is easy to know that Then,
(9)
obeys the law of large numbers; namely,
(10)
Since centralization of data does not change the distance of any two points and the FCM clustering algorithm is based on pairwise distances to partition data points, we assume that expectation of the data input is 0. In practice, covariance matrix of population is likely unknown. Therefore, we investigate the effect of random projection on variability of both population and sample.
Fuzzy C-Means and Cluster Ensemble with Random Projection for Big ...
207
Theorem 4. Let data set X ∈ be 𝑛 independent samples of 𝑑-dimensional random vector(𝑋1, 𝑋2,...,𝑋𝑑), and S denotes the sample covariance matrix of X. The random projection induced by random matrix R∈ maps the 𝑑-dimensional random vector to 𝑡-dimensional random vector (𝑌1, 𝑌2,...,) = (1/√𝑡)(𝑋1, 𝑋2,...,𝑋𝑑) ⋅ R, and S∗ denotes the sample covariance matrix of projected data. If elements of random matrix R obey distribution demanded by Lemma 2 and are mutually independent with random vector (𝑋1, 𝑋2,...,), then (1)
dimensions of projected data are linearly independent: ; (2) random projection maintains the whole variability: ; when 𝑡→∞, with probability 1, tr(S∗)
= tr(S). Proof. It is easy to know that the expectation of any element of random matrix (𝑅𝑖𝑗) = 0, 1 ≤ 𝑖 ≤ 𝑑, 1 ≤ 𝑗 ≤ 𝑡. As elements of random matrix R and random vector (𝑋1, 𝑋2,...,) are mutually independent, the covariance of random vector induced by random projection is
(1) If
, then
(11)
208
Computing and Random Numbers
(12)
(13)
(2) If 𝑖=𝑗, then
Thus, by assumption (𝑋𝑖) = 0 (1 ≤ 𝑖 ≤ 𝑑), we can get
(14)
We denote spectral decomposition of sample covariance matrice S by S = VΛV𝑇, where V is the matrix of eigenvectors and Λ is a diagonal matrix in which the diagonal elements are 𝜆1, 𝜆2,...,𝑑 and 𝜆1 ≥ 𝜆2 ≥⋅⋅⋅≥ 𝜆𝑑. Supposing the data samples have been centralized, namely, their means are 0𝑠, we can get covariance matrix S = (1/𝑛)X𝑇X. For convenience, we still denote a sample of random matrix by R. Thus, projected data Y = (1/√𝑡)XR and sample covariance matrix of projected data S∗ = (1/𝑛)((1/√𝑡)XR) 𝑇((1/√𝑡) XR) = (1/𝑡)R𝑇SR. Then, we can get
(15)
where 𝑟𝑖𝑗 (1 ≤ 𝑖 ≤ 𝑑, 1 ≤ 𝑗 ≤ 𝑡) is sample of element of random matrix R.
In practice, the spectrum of a covariance often displays a distinct decay after few large eigenvalues. So we assume that there exists an integer 𝑝, limited constant 𝑞>0, such that, for all 𝑖>𝑝, it holds that 𝜆𝑖 ≤ 𝑞. Then,
Fuzzy C-Means and Cluster Ensemble with Random Projection for Big ...
209
(16)
(17)
By Lemma 3, with probability 1,
Combining the above arguments, we achieve tr(S ) = tr(S) with probability 1, when 𝑡→∞. ∗
Part (1) of Theorem 4 indicates that compressed data produced by random projection can take much information with low dimensionality owing to linear independence of reduced dimensions. Part (2) manifests that sum of variances of dimensions of original data is consistent with the one of projected data, namely, random projection holds the variability of primal data. Combining results of Lemma 2 with those of Theorem 4, we consider that random projection can be employed to improve the efficiency of FCM clustering algorithm with low dimensionality, and the modified algorithm can keep the accuracy of partition approximately.
FCM CLUSTERING WITH RANDOM PROJECTION AND AN EFFICIENT CLUSTER ENSEMBLE APPROACH FCM Clustering via Random Projection According to the results of Section 3, we design an improved FCM clustering algorithm with random projection for dimensionality reduction. The procedure of new algorithm is shown in Algorithm 2.
210
Computing and Random Numbers
Algorithm 2: FCM clustering with random projection.
Algorithm 2 reduces the dimensions of input data via multiplying a random matrix. Compared with the (𝑐𝑛𝑑2 ) time for running each iteration in original FCM clustering, the new algorithm would imply an 𝑂(𝑐𝑛(𝜀−2 ln 𝑛)2 ) time for each iteration. Thus, the time complexity of new algorithm decreases obviously for high dimensional data in the case 𝜀 −2 ln 𝑛≪𝑑. Another common dimensionality reduction method is SVD. Compared with the (𝑑3 + 𝑛𝑑2 ) time of running SVD on data matrix X, the new algorithm only needs 𝑂(𝜀−2𝑑 ln 𝑛) time to generate random matrix R. It indicates that random projection is a cost-effective method of dimensionality reduction for FCM clustering algorithm.
Ensemble Approach Based on Graph Partition As different random projections may result in different clustering solutions [20], it is attractive to design the cluster ensemble framework with random projection for improved and robust clustering performance. Although it uses smaller memory and runs faster than ensemble method in [19], the cluster ensemble algorithm in [21] still needs product of concatenated partition matrix for crisp grouping, which leads to a high time and space costs under the circumstances of big data. In this section, we propose a more efficient and effective aggregation method for multiple FCM clustering results. The overview of our new ensemble approach is presented in Figure 1. The new ensemble method is based on partition on similarity graph. For each random projection, a new data set is generated. After performing FCM clustering on the new data sets, membership matrices are output. The elements of membership matrix are treated as the similarity measure between points and the cluster centers. Through SVD on the concatenation of membership matrices, we get the spectral embedding of data point efficiently. The detailed procedure of new cluster ensemble approach is shown in Algorithm 3.
Fuzzy C-Means and Cluster Ensemble with Random Projection for Big ...
211
Algorithm 3: Cluster ensemble for FCM clustering with random projection.
Figure 1: Framework of the new ensemble approach based on graph partition.
In step (3) of the procedure in Algorithm 3, the left singular vectors of are equivalent to the eigenvectors of . It implies that we regard the matrix product as a construction of affinity matrix of data points. This method is motivated by the research on landmark-based representation [25, 26]. In our approach, we treat the cluster centers of each FCM clustering run as landmarks and the membership matrix as landmark-based representation. Thus, the concatenation of membership matrices forms a combinational landmark-based representation matrix. In this way, the graph similarity matrix is computed as
212
Computing and Random Numbers
(18) which can create spectral embedding efficiently through step (3). To normalize the graph similarity matrix, we multiply Ucon by result, the degree matrix of W is an identity matrix.
. As a
There are two perspectives to explain why our approach works. Considering the similarity measure defined by 𝑢𝑖𝑗 in FCM clustering, proposition 3 in [26] demonstrated that singular vectors of U𝑖 converged to eigenvectors of W𝑠 as 𝑐 converges to 𝑛, where W𝑠 was affinity matrix generated in standard spectral clustering. As a result, singular vectors of converge to eigenvectors of normalized affinity matrix .Thus, our final output will converge to the one of standard spectral clustering as 𝑐 converges to 𝑛. Another explanation is about the similarity measure defined by
where x𝑖 and x𝑗 are data points. We can treat each row
as a transformational data point. As a result, affinity matrix obtained of here is the same as the one of standard spectral embedding, and our output is just the partition result of standard spectral clustering. To facilitate comparison of different ensemble methods for FCM clustering solutions with random projection, we denote the approach of [19] by EFCM-A (average the products of membership matrices), the algorithm of [21] by EFCMC (concatenate the membership matrices), and our new method by EFCM-S (spectral clustering on the membership matrices). In the cluster ensemble phase, the main computations of EFCM-A method are multiplications of membership matrices. Similarly, the algorithm of EFCM-C also needs the product of concatenated membership matrices in order to get the crisp partition result. Thus the above methods both need (𝑛2 ) space and 𝑂(𝑐𝑟𝑛2 ) time. However, the main computation of EFCM-S
and 𝑘-means clustering for A. The overall space is (𝑐𝑟𝑛), is SVD for the SVD time is ((𝑐𝑟)2 𝑛), and the 𝑘-means clustering time is 𝑙𝑐2 𝑛, where 𝑙 is iteration number of 𝑘-means. Therefore, computational complexity of EFCM-S is obviously decreased compared with the ones of EFCM-A and EFCM-C considering 𝑐𝑟 ≪ 𝑛 and 𝑙≪𝑛 in large scale data set.
EXPERIMENTS
In this section, we present the experimental evaluations of new algorithms proposed in Section 4. We implemented the related algorithms in Matlab
Fuzzy C-Means and Cluster Ensemble with Random Projection for Big ...
213
computing environment and conducted our experiments on a Windowsbased system with the Intel Core 3.6 GHz processor and 16 GB of RAM.
Data Sets and Parameter Settings We conducted the experiments on synthetic and real data sets which both have relatively high dimensionality. The synthetic data set had 10000 data points with 1000 dimensions which were generated from 3 Gaussian mixtures in proportions (0.25, 0.5, 0.25). The means of components were (2, 2, . . . , 2)1000, (0, 0, . . . , 0)1000, and (−2, −2, . . . , −2)1000 and the standard deviations were (1, 1, . . . , 1)1000, (2, 2, . . . , 2)1000, and (3, 3, . . . , 3)1000. The real data set is the daily and sports activities data (ACT) published on UCI machine learning repository (the ACT data set can be found at http://archive. ics .uci.edu/ml/datasets/Daily+and+Sports+Activities).Theseare data of 19 activities collected by 45 motion sensors in 5 minutes at 25 Hz sampling frequency. Each activity was performed by 8 subjects in their own styles. To get high dimensional data sets, we treated 1 minute and 5 seconds of activity data as an instance, respectively. As a result, we got 760 × 67500 (ACT1) and 9120 × 5625 (ACT2) data matrices whose rows were activity instances and columns were features. For the parameters of FCM clustering, we let 𝜀 = 10−5, we let maximum iteration number be 100, we let fuzzy factor 𝑚 be 2, and we let the number of clusters be 𝑐=3 for synthetic data set and 𝑐 = 19 for ACT data sets. We
also normalized the objective function as , where ‖⋅‖𝐹 is Frobenius norm of matrix [27]. To minimize the influence introduced by different initializations, we present the average values of evaluation indices of 20 independent experiments. In order to compare different dimensionality reduction methods for FCM clustering, we initialized algorithms by choosing 𝑐 points randomly as the cluster centers and made sure that every algorithm began with the same initialization. In addition, we ran Algorithm 2 with 𝑡 = 10, 20, . . . , 100 for synthetic data set and 𝑡 = 100, 200,..., 1000 for ACT1 data set. Two kinds of random projections (with random variables from (5) in Lemma 2) were both tested for verifying their feasibility. We also compared Algorithm 2 against another popular method of dimensionality reduction—SVD. What calls for special attention is that the number of eigenvectors corresponding to nonzero eigenvalues of ACT1 data is only 760, so we just took 𝑡 = 100, 200, . . . , 700 on FCM clustering with SVD for ACT1 data set.
214
Computing and Random Numbers
Among comparisons of different cluster ensemble algorithms, we set dimension number of projected data as 𝑡 = 10, 20, . . . , 100 for both synthetic and ACT2 data sets. In order to meet 𝑐𝑟 ≪ 𝑛 for Algorithm 3, the number of random projection 𝑟 was set as 20 for the synthetic data set and 5 for the ACT2 data set, respectively.
Evaluation Criteria
For clustering algorithms, clustering validation and running time are two important indices for judging their performances. Clustering validation measures evaluate the goodness of clustering results [28] and often can be divided into two categories: external clustering validation and internal clustering validation. External validation measures use external information such as the given class labels to evaluate the goodness of solution output by a clustering algorithm. On the contrary, internal measures are to evaluate the clustering results using feature inherited from data sets. In this paper, validity evaluation criteria used are rand index and clustering validation index based on nearest neighbors for crisp partition, together with fuzzy rand index and Xie-Beni index for fuzzy partition. Here, rand index and fuzzy rand index are external validation measures, whereas the clustering validation index based on nearest neighbors index and Xie-Beni index are internal validation measures. (1) Rand Index (RI) [29]. RI describes the similarity of clustering solution and correct labels through pairs of points. It takes into account the numbers of point pairs that are in the same and different clusters. The RI is defined as
(19)
where 𝑛11 is the number of pairs of points that exist in the same cluster in both clustering result and given class labels, 𝑛00 is the number of pairs of points that are in different subsets for both clustering result and given class equals 𝑛(𝑛 − 1)/2. The value of RI ranges from 0 to 1, and the labels, and higher value implies the better clustering solution. (2) Fuzzy Rand Index (FRI) [30]. FRI is a generalization of RI with respect to soft partition. It also measures the proportion of pairs of points which exist in the same and different clusters in both clustering solution and true class labels. It needs to compute the analogous 𝑛11 and 𝑛00 through contingency table, described in
Fuzzy C-Means and Cluster Ensemble with Random Projection for Big ...
215
[30]. Therefore, the range of FRI is also [0, 1] and the larger value means more accurate cluster solution. (3) Xie-Beni Index (XB) [31]. XB takes the minimum square distance between cluster centers as the separation of the partition and the average square fuzzy deviation of data points as the compactness of the partition. XB is calculated as follows:
(20)
where is just the objective function of FCM clustering and v𝑖 is the center of cluster 𝑖. The smallest XB indicates the optimal cluster partition.
(4) Clustering Validation Index Based on Nearest Neighbors (CVNN) [32]. The separation of CVNN is about the situation of objects that have geometrical information of each cluster, and the compactness is the mean pairwise distance between objects in the same cluster. CVNN is computed as follows:
where
(21) and
. Here, 𝑐 is the number of clusters in partition result, 𝑐 max is the maximum cluster number given, 𝑐 min is the minimum cluster number given, 𝑘 is the number of nearest neighbors, 𝑛𝑖 is the number of objects in the 𝑖th cluster Clu𝑖, 𝑞𝑗 denotes the number of nearest neighbors of Clu𝑖’s 𝑗th object which are not in Clu𝑖, and (𝑥, 𝑦) denotes the distance between 𝑥 and 𝑦. The lower CVNN value indicates the better clustering solution.
Objective function is a special evaluation criterion of validity for FCM clustering algorithm. The smaller objective function indicates that the points inside clusters are more “similar.” Running time is also an important evaluation criterion often related to the scalability of algorithm. One main target of random projection for dimensionality reduction is to decrease the runtime and enhance the applicability of algorithm in the context of big data.
216
Computing and Random Numbers
Performance of FCM Clustering with Random Projection The experimental results about FCM clustering with random projection are presented in Figure 2 where (a), (c), (e), and (g) correspond to the synthetic data set and (b), (d), (f), and (h) correspond to the ACT1 data set. The evaluation criteria used to assess proposed algorithms are FRI, (a) and (b), XB, (c) and (d), objective function, (e) and (f), and running time, (g) and (h). “SignRP” denotes the proposed algorithm with random sign matrix, “GaussRP” denotes the FCM clustering with random Gaussian matrix, “FCM” denotes the original FCM clustering algorithm, and “SVD” denotes the FCM clustering with dimensionality reduction through SVD. It should be noted that true XB value of FCM clustering in subfigure (d) is 4.03e + 12, not 0.
Fuzzy C-Means and Cluster Ensemble with Random Projection for Big ...
217
Figure 2: Performance of clustering algorithms with different dimensionality.
From Figure 2, we can see that FCM clustering with random projection is clearly more efficient than the original FCM clustering. When number of dimensions 𝑡 is above certain bound, the validity indices are nearly stable and similar to the ones of naive FCM clustering for both data sets. This verifies the conclusion that “accuracy of clustering algorithm can be preserved when the dimensionality exceeds a certain bound.” The effectiveness for random projection method is also verified by the small bound compared to the total dimensions (30/1000 for synthetic data and 300/67500 for ACT1 data). Besides, the two different kinds of random projection methods have the similar impact on FCM clustering because of the analogous plot.
218
Computing and Random Numbers
The higher objective function values and the smaller XB indices of SVD method for synthetic data set indicate that the generated clustering solution has better separation degree between clusters. The external cluster validation indices also verify that SVD method has better clustering results for synthetic data. These observations state that SVD method is more suitable for Gaussian mixture data sets than FCM clustering with random projection and naive FCM clustering. Although the SVD method has a higher FRI for synthetic data set, the random projection methods have analogous FRI values for ACT1 data set and better objective function values for both data sets. In addition, the random projection approaches are obviously more efficient as the SVD needs cubic time of dimensionality. Hence, these observations indicate that our algorithm is quite encouraging in practice.
Comparisons of Different Cluster Ensemble Methods The comparisons of different cluster ensemble approaches are shown in Figure 3 and Table 1. Similarly, (a) and (c) of the figure correspond to the synthetic data set and (b) and (d) corresponds to the ACT2 data set. We use RI, (a) and (b), and running time, (c) and (d), to present the performance of ensemble methods. Meanwhile, the meanings of EFCM-A, EFCM-C, and EFCM-S are identical to the ones in Section 4.2. In order to get crisp partition for EFCM-A and EFCM-C, we used hierarchical clusteringcomplete linkage method after getting the distance matrix as in [21]. Since all three cluster ensemble methods get perfect partition results on synthetic data set, we only compare CVNN indices of different ensemble methods on ACT2 data set, which is presented in Table 1. Table 1: CVNN indices for different ensemble approaches on ACT2 data Dimension t
10
EFCM-A
1.7315 1.7383
20
30
40
50
1.7449 1.7789 1.819
60
70
80
1.83
1.7623
1.8182 1.8685 1.8067
90
100
EFCM-C
1.7938 1.7558
1.7584 1.8351 1.8088 1.8353 1.8247
1.8385 1.8105 1.8381
EFCM-S
1.3975 1.3144
1.2736 1.2974 1.3112 1.3643 1.3533
1.409
1.3701 1.3765
Fuzzy C-Means and Cluster Ensemble with Random Projection for Big ...
219
Figure 3: Performance of cluster ensemble approaches with different dimensionality.
In Figure 3, running time of our algorithm is shorter for both data sets. This verifies the result of time complexity analysis for different algorithms in Section 4.2. The three cluster ensemble methods all get the perfect partition for synthetic data set, whereas our method is more accurate than the other two methods for ACT2 data set. The perfect partition results suggest that all three ensemble methods are suitable for Gaussian mixture data set. However, the almost 18% improvement on RI for ACT2 data set should be due to the different grouping ideas. Our method is based on the graph partition such that the edges between different clusters have low weight and the edges within a cluster have high weight. This clustering way of spectral embedding is more suitable for ACT2 data set. In Table 1, the smaller values of CVNN of our new method also show that new approach has better partition results
220
Computing and Random Numbers
on ACT2 data set. These observations indicate that our algorithm has the advantage on efficiency and adapts to a wider range of geometries. We also compare the stability for three ensemble methods, presented in Table 2. From the table, we can see that the standard deviation of RI about EFCM-S is a lower order of magnitude than the ones of the other methods. Hence, this result shows that our algorithm is more robust. Table 2: Standard deviations of RI of 20 runs with different dimensions on ACT2 data Dimension t 10
20
30
40
50
60
70
80
90
100
EFCM-A
0.0222 0.0174 0.018
0.0257 0.0171 0.0251 0.0188 0.0172
0.0218 0.0184
EFCM-C
0.0217 0.0189 0.0128 0.0232 0.0192 0.0200 0.0175 0.0194
0.0151 0.0214
EFCM-S
0.0044 0.0018 0.0029 0.0030 0.0028 0.0024 0.0026 0.0020
0.0024 0.0019
Aiming at the situation of unknown clusters’ number, we also varied the number of clusters c in FCM clustering and spectral embedding for our new method. We denote this version of new method as EFCM-SV. Since the number of random projections was set as 5 for ACT2 data set, we changed the clusters’ number from 17 to 21 as the input of FCM clustering algorithm. In addition, we set the clusters’ number from 14 to 24 as the input of spectral embedding and applied CVNN to estimate the most plausible number of clusters. The experimental results are presented in Table 3. Table 3: RI values for EFCM-S and EFCM-Sv on ACT2 data
In Table 3, the values with respect to “EFCM-SV” are the average RI values with the estimated clusters’ numbers for 20 individual runs. The values of “+CVNN” are the average clusters’ numbers decided by the CVNN cluster validity index. Using the estimated clusters’ numbers by CVNN, our method gets the similar results of ensemble method with correct clusters’ number. In addition, the average estimates of clusters’ number are close to the true one. This indicates that our cluster ensemble method EFCM-SV is attractive when the number of clusters is unknown.
Fuzzy C-Means and Cluster Ensemble with Random Projection for Big ...
221
CONCLUSION AND FUTURE WORK The “curse of dimensionality” in big data gives new challenges for clustering recently, and feature extraction for dimensionality reduction is a popular way to deal with these challenges. We studied the feature extraction method of random projection for FCM clustering. Through analyzing the effects of random projection on the entire variability of data theoretically and verification both on synthetic and real world data empirically, we designed an enhanced FCM clustering algorithm with random projection. The new algorithm can maintain nearly the same clustering solution of preliminary FCM clustering and be more efficient than feature extraction method of SVD. What is more, we also proposed a cluster ensemble approach that is more applicable to large scale data sets than existing ones. The new ensemble approach can achieve spectral embedding efficiently from SVD on the concatenation of membership matrices. The experiments showed that the new ensemble method ran faster, had more robust partition solutions, and fitted a wider range of geometrical data sets. A future research content is to design the provably accurate feature extraction and feature selection methods for FCM clustering. Another remaining question is that how to choose proper number of random projections for cluster ensemble method in order to get a trade-off between clustering accuracy and efficiency.
ACKNOWLEDGMENTS This work was supported in part by the National Key Basic Research Program (973 programme) under Grant 2012CB315905 and in part by the National Nature Science Foundation of China under Grants 61502527 and 61379150 and in part by the Open Foundation of State Key Laboratory of Networking and Switching Technology (Beijing University of Posts and Telecommunications) (no. SKLNST-2013-1-06).
222
Computing and Random Numbers
REFERENCES 1.
M. Chen, S. Mao, and Y. Liu, “Big data: a survey,” Mobile Networks and Applications, vol. 19, no. 2, pp. 171–209, 2014. 2. J. Zhang, X. Tao, and H. Wang, “Outlier detection from large distributed databases,” World Wide Web, vol. 17, no. 4, pp. 539–568, 2014. 3. C. Ordonez, N. Mohanam, and C. Garcia-Alvarado, “PCA for large data sets with parallel data summarization,” Distributed and Parallel Databases, vol. 32, no. 3, pp. 377–403, 2014. 4. D.-S. Pham, S. Venkatesh, M. Lazarescu, and S. Budhaditya, “Anomaly detection in large-scale data stream networks,” Data Mining and Knowledge Discovery, vol. 28, no. 1, pp. 145–189, 2014. 5. F. Murtagh and P. Contreras, “Random projection towards the baire metric for high dimensional clustering,” in Statistical Learning and Data Sciences, pp. 424–431, Springer, Berlin, Germany, 2015. 6. T. C. Havens, J. C. Bezdek, C. Leckie, L. O. Hall, and M. Palaniswami, “Fuzzy c-means algorithms for very large data,” IEEE Transactions on Fuzzy Systems, vol. 20, no. 6, pp. 1130–1146, 2012. 7. J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques: Concepts and Techniques, Elsevier, 2011. 8. S. Khan, G. Situ, K. Decker, and C. J. Schmidt, “GoFigure: automated gene ontology™ annotation,” Bioinformatics, vol. 19, no. 18, pp. 2484–2485, 2003. 9. S. Günnemann, H. Kremer, D. Lenhard, and T. Seidl, “Subspace clustering for indexing high dimensional data: a main memory index based on local reductions and individual multi-representations,” in Proceedings of the 14th International Conference on Extending Database Technology (EDBT ‘11), pp. 237–248, ACM, Uppsala, Sweden, March 2011. 10. J. C. Bezdek, R. Ehrlich, and W. Full, “FCM: the fuzzy c-means clustering algorithm,” Computers & Geosciences, vol. 10, no. 2-3, pp. 191–203, 1984. 11. R. J. Hathaway and J. C. Bezdek, “Extending fuzzy and probabilistic clustering to very large data sets,” Computational Statistics & Data Analysis, vol. 51, no. 1, pp. 215–234, 2006. 12. P. Hore, L. O. Hall, and D. B. Goldgof, “Single pass fuzzy c means,” in Proceedings of the IEEE International Fuzzy Systems Conference (FUZZ ‘07), pp. 1–7, London, UK, July 2007.
Fuzzy C-Means and Cluster Ensemble with Random Projection for Big ...
223
13. P. Hore, L. O. Hall, D. B. Goldgof, Y. Gu, A. A. Maudsley, and A. Darkazanli, “A scalable framework for segmenting magnetic resonance images,” Journal of Signal Processing Systems, vol. 54, no. 1–3, pp. 183–203, 2009. 14. W. B. Johnson and J. Lindenstrauss, “Extensions of lipschitz mappings into a Hilbert space,” Contemporary Mathematics, vol. 26, pp. 189– 206, 1984. 15. P. Indyk and R. Motwani, “Approximate nearest neighbors: towards removing the curse of dimensionality,” in Proceedings of the 13th Annual ACM Symposium on Theory of Computing, pp. 604–613, ACM, 1998. 16. D. Achlioptas, “Database-friendly random projections: JohnsonLindenstrauss with binary coins,” Journal of Computer and System Sciences, vol. 66, no. 4, pp. 671–687, 2003. 17. C. Boutsidis, A. Zouzias, and P. Drineas, “Random projections for k-means clustering,” in Advances in Neural Information Processing Systems, pp. 298–306, MIT Press, 2010. 18. C. C. Aggarwal and C. K. Reddy, Data Clustering: Algorithms and Applications, CRC Press, New York, NY, USA, 2013. 19. R. Avogadri and G. Valentini, “Fuzzy ensemble clustering based on random projections for DNA microarray data analysis,” Artificial Intelligence in Medicine, vol. 45, no. 2-3, pp. 173–183, 2009. 20. X. Z. Fern and C. E. Brodley, “Random projection for high dimensional data clustering: a cluster ensemble approach,” in Proceedings of the 20th International Conference on Machine Learning (ICML ‘03), vol. 3, pp. 186–193, August 2003. 21. M. Popescu, J. Keller, J. Bezdek, and A. Zare, “Random projections fuzzy c-means (RPFCM) for big data clustering,” in Proceedings of the IEEE International Conference on Fuzzy Systems (FUZZ-IEEE ‘15), pp. 1–6, Istanbul, Turkey, August 2015. 22. A. Fahad, N. Alshatri, Z. Tari et al., “A survey of clustering algorithms for big data: taxonomy and empirical analysis,” IEEE Transactions on Emerging Topics in Computing, vol. 2, no. 3, pp. 267–279, 2014. 23. R. A. Johnson and D. W. Wichern, Applied Multivariate Statistical Analysis, vol. 4, Pearson Prentice Hall, Upper Saddle River, NJ, USA, 6th edition, 2007. 24. C. Boutsidis, A. Zouzias, M. W. Mahoney, and P. Drineas, “Randomized
224
25.
26.
27. 28.
29.
30.
31.
32.
Computing and Random Numbers
dimensionality reduction for k-means clustering,” IEEE Transactions on Information Theory, vol. 61, no. 2, pp. 1045–1062, 2015. X. Chen and D. Cai, “Large scale spectral clustering with landmarkbased representation,” in Proceedings of the 25th AAAI Conference on Artificial Intelligence, pp. 313–318, 2011. D. Cai and X. Chen, “Large scale spectral clustering via landmarkbased sparse representation,” IEEE Transactions on Cybernetics, vol. 45, no. 8, pp. 1669–1680, 2015. G. H. Golub and C. F. Van Loan, Matrix Computations, vol. 3, JHU Press, 2012. U. Maulik and S. Bandyopadhyay, “Performance evaluation of some clustering algorithms and validity indices,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 12, pp. 1650– 1654, 2002. W. M. Rand, “Objective criteria for the evaluation of clustering methods,” Journal of the American Statistical Association, vol. 66, no. 336, pp. 846–850, 1971. D. T. Anderson, J. C. Bezdek, M. Popescu, and J. M. Keller, “Comparing fuzzy, probabilistic, and possibilistic partitions,” IEEE Transactions on Fuzzy Systems, vol. 18, no. 5, pp. 906–918, 2010. X. L. Xie and G. Beni, “A validity measure for fuzzy clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 8, pp. 841–847, 1991. Y. Liu, Z. Li, H. Xiong, X. Gao, J. Wu, and S. Wu, “Understanding and enhancement of internal clustering validation measures,” IEEE Transactions on Cybernetics, vol. 43, no. 3, pp. 982–994, 2013.
SECTION IV SIMULATIONS WITH RANDOM NUMBERS AND VARIABLES
CHAPTER
11
SOCIAL EMOTIONAL OPTIMIZATION ALGORITHM WITH RANDOM EMOTIONAL SELECTION STRATEGY Zhihua Cui1,2, Yuechun Xu1 and Jianchao Zeng1 Complex System and Computational Intelligence Laboratory, Taiyuan University of Science and Technology, China
1
State Key Laboratory of Novel Software Techchnology, Nanjing University, China
2
INTRODUCTION With the industrial and scientific developments, many new optimization problems are needed to be solved. Several of them are complex, multimodal, high dimensional, nondifferential problems. Therefore, some new optimization techniques have been designed, such as genetic algorithm, Citation: Zhihua Cui, Yuechun Xu and Jianchao Zeng (March 16th 2012). “Social Emotional Optimization Algorithm with Random Emotional Selection Strategy”, Theory and New Applications of Swarm Intelligence Heitor Lopes, IntechOpen, DOI: 10.5772/38980. Copyright: © 2012 by authors and Intech. This paper is an open access article distributed under a Creative Commons Attribution 3.0 License
228
Computing and Random Numbers
simulated annealing algorithm, Tabu search, etc. However, due to the large linkage and correlation among different variables, these algorithms are easily trapped to a local optimum and failed to obtain the reasonable solution. Swarm intelligence (SI) is a recent research topic which mimics the animal social behaviors. Up to now, many new swarm intelligent algorithms have been proposed, such as group search optimizer[1], artificial physics optimization[2], firefly algorithm[3] and ant colony optimizer (ACO)[4]. All of them are inspired by different animal group systems. Generally, they are decentralized, self-organized systems, and a population of individuals are used to interacting locally. Each individual maintains several simple rules, and emergence of “intelligent” global behavior are used to mimic the optimization tasks. The most famous one is particle swarm optimization. Particle swarm optimization (PSO) [5-8] is a population-based, selfadaptive search optimization method motivated by the observation of simplified animal social behaviors such as fish schooling, bird flocking, etc. It is becoming very popular due to its simplicity of implementation and ability to quickly converge to a reasonably good solution. In a PSO system, multiple candidate solutions coexist and collaborate simultaneously. Each solution called a “particle”, flies in the problem search space looking for the optimal position to land. A particle, as time passes through its quest, adjusts its position according to its own “experience” as well as the experience of neighboring particles. Tracking and memorizing the best position encountered build particle’s experience. For that reason, PSO possesses a memory (i.e. every particle remembers the best position it reached during the past). PSO system combines local search method (through self experience) with global search methods (through neighboring experience), attempting to balance exploration and exploitation. Human society is a complex group which is more effective than other animal groups. Therefore, if one algorithm mimics the human society, the effectiveness maybe more robust than other swarm intelligent algorithms which are inspired by other animal groups. With this manner, social emotional optimization algorithm (SEOA) was proposed by Zhihua Cui et al. in 2010[9-13] In SEOA methodology, each individual represents one person, while all points in the problem space constructs the status society. In this virtual world, all individuals aim to seek the higher social status. Therefore, they will communicate through cooperation and competition to increase personal status, while the one with highest score will win and output as the final
Social Emotional Optimization Algorithm with Random Emotional ...
229
solution. In the experiments, social emotional optimization algorithm (SEOA) has a remarkable superior performance in terms of accuracy and convergence speed [9-13]. In this chapter, we proposed a novel improved social emotional optimization algorithm with random emotional selection strategy to evaluate the performance of this algorithm on 5 benchmark functions in comparison with standard SEOA and other swarm intelligent algorithms. The rest of this paper is organized as follows: The standard version of social emotional optimization algorithm is presented in section 2, while the modification is listed in section 3. Simulation resutls are listed in section 4.
STANDARD SOCIAL EMOTIONAL OPTIMIZATION ALGORITHM In this paper, we only consider the following unconstrained problem: In human society, all people do their work hardly to increase their social status. To obtain this object, people will try their bests to find the path so that more social wealthes can be rewarded. Inspired by this phenomenon, Cui et al. proposed a new population-based swarm methodology, social emotional optimization algorithm, in which each individual simulates a virtual person whose decision is guided by his emotion. In social emotional optimization algorithm methodology, each individual represents a virtual person, in each generation, he will select his behavior according to the corresponding emotion index. After the behavior is done, a status value is feedback from the society to confirm whether this behavior is right or not. If this choice is right, the emotion index of himself will increase, and vice versa. In the first step, all individuals’s emotion indexes are set to 1, with this value, they will choice the following behaviour:
(1)
where represents the social position of j’s individual in the initialization period, the corresponding fitness value is denoted as the society status. Symbo ⊕ means the operation, in this paper, we only take it as addition operation +. Since the emotion index of j is 1, the movement phase Manner1 is defined by:
230
Computing and Random Numbers
(2) where k1 is a parameter used to control the emotion changing size, rand1 is one random number sampled with uniform distribution from interval (0,1). The worst L individuals are selected to provide a reminder for individual j to avoid the wrong behaviour. In the initialization period, there is a little emotion affection, therefore, in this period, there is a little good experiences can be referred, so, Manner1 simulates the affection by the wrong experiences. In t generation, if individual j does not obtain one better society status value than previous value, the j’s emotion index is decreased as follows:
(3)
where∆ is a predefined value, and set to 0.05, this value is coming from experimental tests. If individual j is rewarded a new status value which is the best one among all previous iterations, the emotion index is reset to 1.0: Remark: According to Eq.(3), words, if
, then
(4)
is no less than 0.0, in other .
In order to simulate the behavior of human, three kinds of manners are designed, and the next behavior is changed according to the following three cases:
(5)
(6)
(7)
Otherwise
Parameters TH1 and TH2 are two thresholds aiming to restrict the different behavior manner. For Case1, because the emotion index is too small, individual j prefers to simulate others successful experiences. Therefore, the symbol Manner2 is updated with:
Social Emotional Optimization Algorithm with Random Emotional ...
231
(8)
where represents the best society status position obtained from all people previously. In other words, it is:
(9)
With the similar method, Manner2 is defined:
(10)
where denotes the best status value obtained by individual j previously, and is defined by
(11)
(12)
For Manner4 , we have
Manner2 , Manner3 and Manner4 refer to three different emotional cases. In the first case, one individual’s movement is protective, aiming to preserve his achievements (good experiences) in Manner2 due to the still mind. With the increased emotion, more rewards are expected, so in Manner3 , a temporized manner in which the dangerous avoidance is considered by individual to increase the society status. Furthermore, when the emotional is larger than one threshold, it simulates the individual is in surged mind, in this manner, he lost the some good capabilities, and will not listen to the views of others, Manner4 is designed to simulate this phenomenon.
To enhance the global capability, a mutation strategy, similarly with evolutionary computation, is introduced to enhance the ability escaping from the local optima, more details of this mutation operator is the same as Cai XJ[14], please refer to corresponding reference. The detail of social emotion optimization are listed as follows:
232
Computing and Random Numbers
Step 1. Initializing all individuals respectively, the initial position of individuals randomly in problem space. Step 2. Computing the fitness value of each individual according to the objective function. Step 3. For individual j, determining the value Step 4. For all population, determining the value
. .
Step 5. Determining the emotional index according to Eq.(5)-(7) in which three emotion cases are determined for each individual. Step 6. Determining the decision with Eq. (8)-(12), respectively. Step 7. Making mutation operation. Step 8. If the criteria is satisfied, output the best solution; otherwise, goto step 3.
RANDOM EMOTIONAL SELECTION STRATEGY To mimic the individual decision mechanism, emotion index is employed to simulate the personal decision mechanism. However, because of the determined emotional selection strategy, some stochastic aspects are omitted. To provide a more precisely simulation, we replace the determined emotional selection strategy in the standard SEOA with three different random manners to mimic the human emotional changes.
Gauss Distribution Gauss distribution is a general distribution, and in WIKIPEDIA is defined as “normalis a continuous probability distribution that is often used as a first approximation to describe real-valued random variables that tend to cluster around a single mean value. The graph of the associated probability density function is “bell”-shaped, and is known as the Gaussian function or bell curve” [15] (see Fig.1):
where parameter µ is called the mean, σ2 is the variance. The standard normal Gauss distribution is one special case with µ = 0 and σ2 = 1 .
Social Emotional Optimization Algorithm with Random Emotional ...
233
Cauchy Distribution Cauchy distribution is also called Lorentz distribution, Lorentz(ian) function, or Breit– Wigner distribution. The probability density function of Cauchy distribution is
where x0 is the location parameter, specifying the location of the peak of the distribution, and γ is the scale parameter which specifies the half-width at half-maximum. The special case when x0 =0 and γ = 1 is called the standard Cauchy distribution with the probability density function
Figure 1: Illustration of Probability Density Function for Gauss Distribution.
Figure 2: Illustration of Probability Density Function for Cauchy Distribution.
234
Computing and Random Numbers
Levy Distribution In the past few years, there are more and more evidence from a variety of experimental, theoretical and field studies that many animals employ a movement strategy approximated by Levy flight when they are searching for resources. For example, wandering Albatross were observed to adopt Levy flight to adapted stochastically to their prey field[16]. Levy flight patterns have also been found in a laboratory-scale study of starved fruit flies. In a recent study by Sims[17], marine predators adopted Levy flights to pursuit Levy-like fractal distributions of prey density. In [18], the authors concluded that ``Levy flights may be a universal strategy applicable across spatial scales ranging from less than a meter, ..., to several kilometers, and adopted by swimming, walking, and airborne organisms”. Shaped by natural selection, the Levy flights searching strategies of all living animals should be regarded as optimal strategies to some degree[19]. Therefore, it would be interesting to incorporate Levy flight into the SEOA algorithm to improve the performance. Indeed, several studies have already incorporated Levy flight into heuristic search algorithms. In [20], the authors proposed a novel evolutionary programming with mutations based on the Levy probability distribution. In order to improve a swarm intelligence algorithm, Particle Swarm Optimizer, in [21], a novel velocity threshold automation strategy was proposed by incorporated with Levy probability distribution. In a different study of PSO algorithm[22], the particle movement characterized by a Gaussian probability distribution was replaced by particle motion with a Levy flight. A mutation operator based on the Levy probability distribution was also introduced to the Extremal Optimization (EO) algorithm[23]. Levy flights comprise sequences of straight-line movements with random orientations. Levy flights are considered to be ‘scale-free’ since the straight-line movements have no characteristic scale. The distribution of the straight-line movement lengths, L has a powerlaw tail: where 1 0 . For α →1 , the distribution becomes Cauchy distribution and for α → 2, the distribution becomes Gaussian distribution. Without losing generality, we set the scaling factor γ = 1 . Since, the analytic form of the Levy distribution is unknown for general α , in order to generate Levy random number, we adopted a fast algorithm presented in [24]. Firstly, Two independent random variables x and y from Gaussian distribution are used to perform a nonlinear transformation
Then the random variable z :
now in the Levy distribution is generated using the following nonlinear transformation
where the values of parameters K(α) and C(α) are given in[24]. In each iteration, different random number is generated for different individual with Gauss distribution, Cauchy distribution and Levy fligh, then choices the different rules for different conditions according to Eq.(5)-(7).
SIMULATION To testify the performance of proposed variant SEOA with random emotional selection strategy, five typical unconstraint numerical benchmark functions are chosen, and compared with standard particle swarm optimization (SPSO), modified particle swarm optimization with time-varying accelerator coefficients (TVAC)[25] and the standard version of SEOA (more details about the test suits can be found in [26]). To provide a more clearly insight, SEOA combined with Gauss distribution, Cauchy distribution and Levy distribution are denoted with SEOA-GD, SEOA-CD and SEOA-LD, respectively.
236
Computing and Random Numbers
Sphere Model:
where
, and
Rosenbrock Function:
where
, and
Schwefel 2.26:
where
, and
Rastrigin:
where
, and
Penalized Function2:
where
, and
Social Emotional Optimization Algorithm with Random Emotional ...
237
The inertia weight w is decreased linearly from 0.9 to 0.4 for SPSO and TVAC, accelerator coefficients c1 and 2 c are both set to 2.0 for SPSO, as well as in TVAC, c1 decreases from 2.5 to 0.5, while c2 increases from 0.5 to 2.5. Total individuals are 100, and the velocity threshold vmax is set to the upper bound of the domain. The dimensionality is 30, 50, 100, 150, 200, 250 and 300. In each experiment, the simulation run 30 times, while each time the largest iteration is 50 times dimension, e.g. the largest iteration is 1500 for dimension 30. For SEOA, all parameters are used the same as Cui et al[9]. 1. Comparison with SEOA-GD, SEOA-CD and SEOC-LD From the Tab.1, we can find the SEOA-GD is the best algorithm for all 5 benchmarks especially for high-dimension cases. This phenomenon implies that SEOA-GD is the best choice between three different random variants. 2. Comparison with SPSO, TVAC and SEOA In Tab.2, SEOA-GD is superior to other three algorithm in all benchmarks especially for multi-modal functions. Based on the above analysis, we can draw the following conclusion: SEOA-GD is the most stable and effective among three random variants, and is superior to other optimization algorithms significantly, e.g. SPSO, TVAC and SEOA. It is especially suit for high-dimensional cases.
238
Computing and Random Numbers
Table 1: Comparison results between SEOA-GD, SEOA-CD and SEOA-LD
Social Emotional Optimization Algorithm with Random Emotional ...
239
240
Computing and Random Numbers
Table 2: Comparison results between SEOA-GD and SPSO, TVAC, SEOA
Social Emotional Optimization Algorithm with Random Emotional ...
241
242
Computing and Random Numbers
Social Emotional Optimization Algorithm with Random Emotional ...
243
244
Computing and Random Numbers
CONCLUSION In standard version of social emotional optimization algorithm, all individuals’ decision are influenced by one constant emotion selection strategy. However, this strategy may provide a wrong search selection due to some randomness omitted. Therefore, to further improve the performance, three different random emotional selection strategies are added. Simulation results show SEOA with Gauss distribution is more effective. Future research topics includes the application of SEOA to the other problems.
ACKNOWLEDGEMENT This paper were supported by the Key Project of Chinese Ministry of Education under Grant No.209021 and National Natural Science Foundation of China under Grant 61003053.
Social Emotional Optimization Algorithm with Random Emotional ...
245
REFERENCES 1.
He S, Wu QH and Saunders JR. (2006) Group search optimizer an optimization algorithm inspired by animal searching behavior. IEEE International Conference on Evolutionary Computation, pp.973–990. 2. Yang XS. (2010) Firefly algorithm, stochastic test functions and design optimization. International Journal of Bio-inspired Computation, 2(2),78-84. 3. Laalaoui Y and Drias H. (2010) ACO approach with learning for preemptive scheduling of real-time tasks.International Journal of Bioinspired Computation, 2(6),383-394. 4. Abraham S, Sanyal S and Sanglikar M. (2010) Particle swarm optimisation based Diophantine equation solver,International Journal of Bio-inspired Computation, 2(2),100-114. 5. Yuan DL and Chen Q. (2010) Particle swarm optimisation algorithm with forgetting character. International Journal of Bio-inspired Computation, 2(1),59-64. 6. Lu JG, Zhang L, Yang H and Du J. (2010) Improved strategy of particle swarm optimisation algorithm for reactive power optimization. International Journal of Bio-inspired Computation, 2(1),27-33. 7. Upendar J, Singh GK and Gupta CP. (2010) A particle swarm optimisation based technique of harmonic elimination and voltage control in pulse-width modulated inverters. International Journal of Bio-inspired Computation, 2(1),18-26. 8. Cui ZH and Cai XJ. (2010) Using social cognitive optimization algorithm to solve nonlinear equations. Proceedings of 9th IEEE International Conference on Cognitive Informatics (ICCI 2010), pp.199-203. 9. Chen YJ, Cui ZH and Zeng JH. (2010) Structural optimization of lennard-jones clusters by hybrid social cognitive optimization algorithm. Proceedings of 9th IEEE International Conference on Cognitive Informatics (ICCI 2010), pp.204-208 10. Cui ZH, Shi ZZ and Zeng JC. (2010) Using social emotional optimization algorithm to direct orbits of chaotic systems, Proceedings of 2010 International Conference on Computational Aspects of Social Networks (CASoN2010), pp.389-395.
246
Computing and Random Numbers
11. Wei ZH, Cui ZH and Zeng JC (2010) Social cognitive optimization algorithm with reactive power optimization of power system, Proceedings of 2010 International Conference on Computational Aspects of Social Networks (CASoN2010), pp.11-14. 12. Xu YC, Cui ZH and Zeng JC (2010) Social emotional optimization algorithm for nonlinear constrained optimization problems, Proceedings of 1st International Conference on Swarm, Evolutionary and Memetic Computing (SEMCCO2010), pp.583-590. 13. Cai XJ, Cui ZH, Zeng JC and Tan Y. (2008) Particle swarm optimization with self-adjusting cognitive selection strategy, International Journal of Innovative Computing, Information and Control. 4(4): 943-952. 14. http://en.wikipedia.org/wiki/Normal_distribution 15. G.M.Viswanathan, S.V.Buldyrev, S.Havlin, M.G.daLuz, E.Raposo and H.E.Stanley. (1999) Optimizing the success of random searches. Nature.401(911-914) . 16. D.W.Simsand. (2008) Scaling laws of marine predator search behavior. Nature.451:1098– 1102. 17. A.M.Reynolds and C.J.Rhodes. (2009) The levy flight paradigm: random search patterns and mechanisms. Ecology.90(4):877–887. 18. G.A.Parkerand and J.MaynardSmith. (1990) Optimality theory in evolutionary biology. Nature.348(1):27–33. 19. C.Y.Lee and X.Yao. (2004) Evolutionary programming using mutations based on the levy probability distribution.IEEE Transactions on Evolutionary Computation.8(1):1–13. 20. X.Cai, J.Zeng, Z.H.Cui and Y.Tan. (2007) Particle swarm optimization using Levy probability distribution. Proceedings of the 2nd International Symposium on Intelligence Computation and Application, 353–361, Wuhan, China. 21. T.J.Richer and T.M.Blackwell. (2006) The levy particle swarm. Proceedings of IEEE Congress on Evolutionary Computation, 808– 815. 22. M.R.Chen,Y.Z.Lu and G.Yang. (2006) Population-based extremal optimization with adaptive levy mutation for constrained optimization. Proceedings of 2006 International Conference on Computational Intelligence and Security, pp.258-261.November 36;Guangzhou,China.
Social Emotional Optimization Algorithm with Random Emotional ...
247
23. R.N.Mantegna.(1994) Fast,accurate algorithm for numerical simulation of Levy stable stochastic processes. Physical Review E.49(5):4677– 4683. 24. Ratnaweera A, Halgamuge SK and Watson HC. (2004) Self-organizing hierarchical particle swarm opitmizer with time-varying acceleration coefficients. IEEE Transactions on Evolutionary Computation, 8(3):240-255. 25. Yao X, Liu Y and Lin GM. (1999) Evolutionary programming made faster. IEEE Transactions on Evolutionary Computation, 3(2).82-102.
CHAPTER
12
LYMPH DISEASES PREDICTION USING RANDOM FOREST AND PARTICLE SWARM OPTIMIZATION Waheeda Almayyan Department of Computer Science and Information Systems, College of Business Studies, Public Authority for Applied Education and Training, Kuwait City, Kuwait
ABSTRACT This research aims to develop a model to enhance lymphatic diseases diagnosis by the use of random forest ensemble machine-learning method trained with a simple sampling scheme. This study has been carried out in two major phases: feature selection and classification. In the first stage, a
Citation: Almayyan, W. (2016), “Lymph Diseases Prediction Using Random Forest and Particle Swarm Optimization”. Journal of Intelligent Learning Systems and Applications, 8, 51-62. doi: 10.4236/jilsa.2016.83005. Copyright: © 2016 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http:// creativecommons.org/licenses/by/4.0
250
Computing and Random Numbers
number of discriminative features out of 18 were selected using PSO and several feature selection techniques to reduce the features dimension. In the second stage, we applied the random forest ensemble classification scheme to diagnose lymphatic diseases. While making experiments with the selected features, we used original and resampled distributions of the dataset to train random forest classifier. Experimental results demonstrate that the proposed method achieves a remarkable improvement in classification accuracy rate. Keywords: Classification, Random Forest Ensemble, PSO, Simple Random Sampling, Information Gain Ratio, Symmetrical Uncertainty
INTRODUCTION Nowadays, Computer-Aided Diagnosis (CAD) applications have become one of the key research topics in medical biometrics diagnostic tasks. Medical diagnosis depends upon the experience of the physician beside the existing data. Consequently, a number of articles suggested several strategies to process the physician’s analysis and judgment tasks about actual clinical assessments [1] . With reasonable success, machine-learning techniques have been applied in constructing the CAD applications due to its strong capability of extracting complex relationships in the medical data [2] . Raw medical data requires some effective classification techniques to support the computer-based analysis of such voluminous and heterogeneous data. Accuracy of clinically diagnosed cases is particularly important issue to be considered during classification. In most cases the size of medical datasets is usually great, which directly affects the complexity of the data mining procedure [3] . So, the large-scale medical data is considered a source of significant challenges in data mining applications, which involves extracting the most descriptive or discriminative features. Thus, feature reduction has a significant role in eliminating irrelevant features from medical datasets [4] [5] . Dimensionality reduction procedure aims to reduce computational complexity with the possible advantages of enhancing the overall classification performance. It includes eliminating insignificant features before model implementation, which makes screening tests faster, more practical and less costly and this is an important requirement in medical applications [6] . The lymphatic system is a vital part of the immune system in removing the interstitial fluid from tissues. It absorbs and transports fats and fatsoluble vitamins from the digestive system and delivers these nutrients to
Lymph Diseases Prediction Using Random Forest and Particle ...
251
the cells of the body. It transports white blood cells to and from the lymph nodes into the bones. Moreover, it transports antigen-presenting cells to the lymph nodes where an immune response is stimulated. Different medical imaging techniques have been used for the investigation of the lymphatic channels and lymph glands status [7] . The current state of lymph nodes with obtained data from lymphography technique can ascertain the classification of the investigated diagnosis [8] . The enlargement of lymph nodes can be an index to several conditions and extends to more significant conditions that threat life [9] . The study of the lymph nodes is important in diagnosis, prognosis, and treatment of cancer [10] . Therefore, the main contribution of this paper is to investigate the effectiveness of the suggested technique in diagnosing the lymph disease problem. In this article, a CAD system based on random forest ensemble classifier is introduced to improve the efficiency of the classification accuracy for lymph disease diagnosis. The difference between this article and other articles that address the same topic is that a strong ensemble classifier scheme has been created by combining PSO feature selection and random forest decision tree methods, which yields more efficient results than any of the other methods tested in this paper. Several approaches have been investigated using conventional and artificial intelligence techniques in order to evaluate the lymphography dataset. Karabulut et al. studied the effect of feature selection methods with NaïveBayes, Multilayer Perceptron (MLP), and J48 decision tree classifiers with fifteen real datasets including lymph disease dataset [11] . The best accuracy was 84.46% achieved using Chi-square FS and MLP. Derrac et al. proposed an evolutionary algorithm for data reduction enhanced by Rough set based feature selection. The best accuracy recorded was 82.65% with 5 neighbors [12] . Madden [13] proposed a comparative study between Naïve Bayes, Tree Augmented Naïve Bayes (TAN) and General Bayesian network (GBN) classifier, with K2 search and GBN with hill-climbing search in which they scored an accuracy of 82.16%, 81.07%, 77.46% and 75.06% respectively. De Falco [14] proposed a differential evolution technique to classify eight databases from the medical domain. The suggested technique scored an accuracy of 85.14% compared to 80.18% using Part classifier. Abellán and Masegosa designed Bagging credal decision trees using imprecise probabilities and uncertainty measures. The proposed decision tree model without pruning scored an accuracy of 79.69% and 77.51% with pruning [15] .
252
Computing and Random Numbers
In this article, a two-stage algorithm is investigated to enhance classification of lymph disease diagnosis. In the first stage, a number of discriminative features out of 18 were selected using PSO and several feature selection to reduce the dimension. In the second stage, we used a random forest ensemble classification scheme to diagnose lymphography types. While making experiments with the selected features, we used original and resampled distributions of the dataset to train random forest algorithm. We noticed a promising improvement in classification performance of the algorithm with resampling strategy. The article commences with the suggested feature selection techniques and the random forest ensemble classifier. Section 4 briefly introduces simple random sampling strategy. Section 5 focuses on the applied performance measures. Section 6 describes the experiment steps and the involved dataset and shows the result of the experiments. The article concludes with conclusion and further research.
FEATURE SELECTION The main objectives of the proposed approach are to improve the performance of classification accuracy and obtain the most important features. Essentially, the feature space is searched to reduce the feature space and prepare the conditions for the classification step. This task is carried out using different state-of-the-art dimension reduction techniques, namely Particle Swarm Optimization, Information Gain Ratio attribute evaluation and Symmetric Uncertainty correlation-based measure.
Particle Swarm Optimization for Feature Selection The particle swarm optimization (PSO) technique is a population-based stochastic optimization technique first introduced in 1995 by Kennedy and Eberhart [16] . In PSO, a possible candidate solution is encoded as a finitelength string called a particle pi in the search space. All of the particles make use of its own memory and knowledge gained by the swarm as a whole to find the best solution. With the purpose of discovering the optimal solution, each particle adjusts its searching direction according to two features, its own best previous experience (pbest) and the best experience of its companions flying experience (gbest). Each particle is moving around the . n-dimensional search space S with objective function Each particle has a position (t represents the iteration counter), a fitness
Lymph Diseases Prediction Using Random Forest and Particle ...
function
253
and “flies” through the problem space with a velocity
A new position
is called better than
iff
.
[17] .
Particles evolve simultaneously based on knowledge shared with neighbouring particles; they make use of their own memory and knowledge gained by the swarm as a whole to find the best solution. The best search space position particle i has visited until iteration t is its previous experience pbest. To each particle, a subset of all particles is assigned as its neighbourhood. The best previous experience of all neighbours of particle i is called gbest. Each particle additionally keeps a fraction of its old velocity. The particle updates its velocity and position with the following equation in continuous PSO [17] : (1)
(2)
The first part in Equation (1) represents the previous flying velocity of the particle. While the second part represents the “cognition” part, which is the private thinking of the particle itself, where C1 is the individual factor. The third part of the equation is the “social” part, which represents the collaboration amongst the particles, where C2 is the societal factor. The acceleration coefficients (C1) and (C2) are constants represent the weighting of the stochastic acceleration terms that pull each particle toward the pbest and gbest positions. Particles’ velocities are restricted to a maximum velocity, Vmax. If Vmax is too small, particles in this case could become trapped in local optima. In contrast, if Vmax is too high particles might fly past fine solutions. According to Equation (1), the particle’s new velocity is calculated according to its previous velocity and the distances of its current position from its own best experience and the group’s best experience. Afterwards, the particle flies toward a new position according to Equation (2). The performance of each particle is measured according to a pre-de- fined fitness function (Figure 1).
Information Gain Ratio Attribute Evaluation Information Gain Ratio attribute evaluation (IGR) measure was generally developed by Quinlan (Quinlan, 1993) within the C4.5 algorithm and based on the Shannon entropy to select the test attribute at each node of the decision tree [18] . It represents how precisely the attributes predict the classes of the test dataset in order to use the “best” attribute as the root of the decision tree.
254
Computing and Random Numbers
The expected IGR needed to classify a given sample s from a set of data is calculated as follow samples C (3)
Figure 1: Pseudocode of PSO-based feature selection approach.
where , Ci and |Ci| are the frequency of the sample s in C, the ith class of C and the number of samples in Ci, respectively.
Symmetrical Uncertainty Symmetric uncertainty correlation-based measure (SU) can be used to evaluate the goodness of features by calculating between feature and the target class (Fayyad & Irani, 1993; Liu et al., 2002) [19] [20] . The features having greater SU value get higher importance. SU is defined as
Lymph Diseases Prediction Using Random Forest and Particle ...
255
(4)
where H(X), H(Y), H(X|Y), IG are the entropy of a of X, entropy of a of Y and the entropy of a of posterior probability X given Y and information gain, respectively.
RANDOM FOREST ENSEMBLE CLASSIFICATION ALGORITHM Ensemble learning methods which utilizes ensembles of classifiers such as neural networks ensembles, random forest, bagging and boosting have received an increasing interest because of their ability to deliver an accurate prediction and robust to noise and outliners than single classifiers [21] [22] . The basic idea behind ensembled classifiers is based upon the premise that a group of classifiers can perform better than an individual classifier. In 2001, Breiman proposed a new and promising tree-based ensemble classifier based on a combination tree of predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees and called it random forest. Random forest classifier consists of a combination of individual base classifiers where each tree is generated using a random vector sampled independently from the classification input vector to enable a much faster construction of trees. For classification, the classification single vote from all trees is combined using a rule based approach (such as, majority voting, product, sum, or Bayesian rule), or based on an iterative error minimization technique by reducing the weights for the correctly classified samples. In Random Forest, the method to build an ensemble of classifiers can be summarized as follows: •
•
The Random Forest training algorithm starts with constructing multiple trees. In the literature, several methods were used such as random trees, CART, J48, C4.5, etc. In this article we are using the random trees in building the Random Forest classifier with no pruning, which makes it light, from a computational perspective. The next step is preparing the training set for each tree, which is formed by randomly sampling the training dataset using bootstrapping technique with replacement. This step is called the
Computing and Random Numbers
256
•
•
•
•
• •
bagging step [23] and the selected samples are called the in-bag samples; the rest are set aside as out-of-bag samples. For each new training set that is generated, approximately one third of the data in the in-bag set is duplicated (sampling with replacement) and used for building the tree. Whereas, the remaining training samples, out-of-bag, are used to test the tree classification performance. Figure 2 illustrates the data sampling procedure. Each tree is constructed using a different bootstrap sample. Random Forest increases the diversity of the trees by choosing and using a random number of features (in this work four features) to construct the nodes and leafs of a random tree classifier. According to Breiman [21] [23] , this step minimizes the correlation among the features, decreases the sensitivity to noise in the data and, at the same time, increases the accuracy of classification. Building a random tree begins at the top of the tree with in-bag dataset. The first step involves selecting a feature at the root node and then splitting the training data into subsets for every possible value of the feature. This makes a branch for each possible value of the attribute. Tree design requires choosing a suitable attribute selection measure for splitting and the selection of the root node to maximize dissimilarity between classes. The information gain (IG) of splitting the training dataset (Y) into subsets (Yi) can be defined as: (5) If the information gain is positive; the node is split else the node will become a leaf node that would provide a decision of the most common target class in the training subset. The partitioning procedure is repeated recursively at each branch node using the subset that reaches the branch and the remaining attributes continues until all attributes are selected. The highest information gain of the remaining attributes is selected as the next attribute. Eventually the most occurring target class in the training subset that reached that node is assigned as the classification decision. The procedure is repeated to build all trees. After building all trees, the out-of-bag dataset is used to test trees as well as the entire forest. The obtained average misclassification
Lymph Diseases Prediction Using Random Forest and Particle ...
257
error can be used to adjust the weights of the vote of each tree. In this article, the implementation of the random forest classifier gives each tree the same weight.
Figure 2: Data partition in constructing random forest trees.
SIMPLE RANDOM SAMPLING Medical data usually experience class imbalance problems, due to the fact that one class is represented by a considerably larger number of instances than other classes. Subsequently, classification algorithms tend to ignore the minority classes. Simple random sampling has been advised as a good means of increasing the sensitivity of the classifier to the minority class by scaling the class distribution. An empirical study where the authors used twenty datasets from UCI repository has showed quantitatively that classifier accuracy might be increased with a progressive sampling algorithm [24] . Weiss and Provost deployed decision trees to evaluate classification performances with the use of a sampling strategy. Another important study used sampling to scale the class distribution and mainly focus on biomedical datasets [25] . The authors measure the effect of the suggested sampling strategy by the use of nearest neighbor and decision tree classifiers. In Simple random sampling, a sample is randomly selected from the population so that the obtained sample is representative of the population. Therefore, this technique provides an unbiased sample from the original data. Regarding simple random sampling there are two approaches while making random selection, in the first approach the samples are selected with replacement where the sample can be selected more than once repeatedly with an equal selection chance. In the other approach the selection of samples is done without replacement where the sample can be selected only once, so that each sample in the data set has an equal chance of being selected and once selected it cannot be chosen again [26] .
258
Computing and Random Numbers
PERFORMANCE MEASURES When the data is inadequate, predicting classification performance of a machine learning method is difficult. Thus, Cross-validation is preferred when the scholar have a small amount of data [27] . When machine-learning methods explore data, decisions must be made on how to split dataset for training and testing. With the intention of estimating the performance of machine learning methods, the lymphography dataset is split into training and testing subsets, afterwards a 10-fold cross-validation, which is a commonly used technique for evaluation, is applied. The performance of the suggested technique was evaluated by using four commonly used performance metrics, Precision, ROC, MCC and Cohen’s kappa coefficient. The main formulations are defined in Equations (4)-(6), according to the confusion matrix. In the confusion matrix of a two-class problem, TP is the number of true positives that was classified correctly. FN is the number of false negatives that was classified incorrectly. TN is the number of true negatives that was classified as negatives. FP is the number of false positives that was classified as negatives. Accordingly, we can define Precision as:
(6)
Receiver Operator Characteristic (ROC) curve is another commonly used measure to evaluate two-class decision problems in Machine Learning. The ROC curve is a standard tool for summarizing classifier performance over a range of tradeoffs between TP and FP error rates [28] . ROC usually takes values between 0.5 for random drawing and 1.0 for perfect classifier performance. Considering class-imbalanced datasets, such as the case in this database, the Matthews Correlation Coefficient (MCC) is an appropriate measure that considered balanced. It can be used even if the classes are of very different in sizes, as it is a correlation coefficient between the observed and predicted classification decisions. The MCC measure falls within the range of [−1, 1]. The larger the MCC coefficient indicates better classifier prediction. The MCC measure can be calculated directly from the confusion matrix using the following formula:
(7)
Lymph Diseases Prediction Using Random Forest and Particle ...
259
Additionally, Kappa error or Cohen’s kappa statistics is a recommended measure to compare the performances of different classifiers and hence the quality of selected features. Generally Kappa error value Î [−1,1], so when Kappa error value calculated for classifiers approaches to 1, then the performance of classier is assumed to be more realistic [28] . The Kappa error measure can be calculated using the following formula:
(8)
where P(A) is total agreement probability and P(E) is the hypothetical probability of chance agreement.
EXPERIMENTAL STUDY This lymphography database was first obtained by the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia and then it was donated by the same contributors to UCI Machine Learning Repository [29] [30] . It comprised of 148 instances represented by 18 diagnostic features. The classification of the Lymph dataset will be with respect to condition of the subject as normal, metastases, malign lymph or fibrosis. The 18features along with description, mean and standard deviation are listed in Table 1. Each of the classes has the sample sizes of 2, 81, 61 and 4, respectively. All the experiments were carried in Waikato Environment for Knowledge Analysis (WEKA) a popular suite of data mining algorithms written in Java as follows: i)
RF algorithm ensemble classifier is designed based on 150 trees and 10 random features to build each tree. ii) The suggested algorithm is trained with lymphographic dataset using 10-fold cross validation strategy to evaluate the classification accuracy on the dataset. As mentioned earlier, the suggested system for the purpose of enhancement of Lymph diseases diagnosis applied in this study is carried out in two major phases. In the first phase, the feature space is searched to reduce the feature space and prepare the conditions for the next step. This task is carried out using three feature selection techniques, PSO, IGR and SU. For PSO feature selection, population size is 20, number of iterations is 20, individual weight is 0.34 and inertia weight is 0.33. The optimal features of these techniques are summarized in Table 2. It is worth noting that the number of features has remarkably reduced, therefore less storage space is
260
Computing and Random Numbers
required for the execution of the classification algorithms. This step helped in reducing the size of dataset to only 9 to 13 attributes. Figure 3 visualizes the feature selection techniques agreements. The Venn diagram shows the three feature selection techniques shares the lymphatic, block of afferent, regeneration, early uptake, lymph nodes diminish, changes in lymph, defect in node, changes in node and number of nodes attributes. It also indicates that the early uptake and changes in structure attributes were common between PSO and IGR techniques. Whereas, special forms characteristic was common between IGR and SU techniques.
Figure 3: Feature selection techniques agreement. Table 1: Lymphographic dataset description of attributes Attribute number
Attribute description
Possible values of attributes
1
Lymphatic
Normal = 1, arched = 2, deformed = 1 - 4 3, displaced = 4
2.74 0.82
2
Block of afferent
No, Yes
1-2
1.55 0.50
3
Block of lymph c (supe- No, Yes rior and inferior
flaps)
1-2
1.17 0.38
4
Block of lymph s (lazy incision)
1-2
1.04 0.21
No, Yes
Assigned values
Mean S.D.
Lymph Diseases Prediction Using Random Forest and Particle ...
261
5
By pass
No, Yes
1-2
1.24 0.43
6
Extravasates (force out of lymph)
No, Yes
1-2
1.51 0.50
7
Regeneration
No, Yes
1-2
1.07 0.25
8
Early uptake
No, Yes
1-2
1.7
9
Lymph nodes diminish
0-3
0-3
1.06 0.31
10
Lymph nodes enlarge
1-4
1-4
2.47 0.84
11
Changes in lymph
Bean = 1, oval = 2, round = 3
1-3
2.4
12
Defect in node
No = 1, lacunar = 2, lacunar marginal 1 - 4 = 3, lacunar central = 4
2.97 0.87
13
Changes in node
No, lacunar, lacunar marginal, lacu- 1 - 4 nar central
2.8
14
Changes in structure
No, grainy, drop-like, coarse, diluted, 1 - 8 reticular, stripped, faint
5.22 2.17
15
Special forms
No, Chalices, vesicles
1-3
2.33 0.77
16
Dislocation
No, Yes
1-2
1.67 0.48
17
Exclusion of node
No, Yes
1-2
1.8
0.41
18
Number of nodes
0 - 80
1-8
2.6
1.91
19
Target Class
Normal = 1, metastases = 2, malign lymph = 3, fibrosis = 4
0.46
0.57
0.76
Table 2: Selected features of lymph disease data set Feature selection technique
Number of selected features
Selected features labels
1. PSO
10
1, 2, 7, 8, 9, 11, 12, 13, 14, 18
2. IGR
13
1, 2, 7, 8, 9, 10, 11, 12, 13, 14, 15, 18
3. SU
9
1, 2, 7, 9, 11, 12, 13, 15, 18
Afterwards, the selected features are used as the inputs to the classifiers. For the purpose of classification, three machine-learning classification paradigms, which are considered very robust in solving non-linear problems, are examined to estimate the lymph disease possibility. These methods include C4.5 as decision trees, k-NN as an instance based learner and feedforward artificial neural network classifier (Multi-layered Perceptron MLP). The k-NN classifier is performed based on Euclidean distance measure for k = 1. While C4.5 classifier was applied with a confidence factor for pruning = 0.25 and a minimum number of instances per leaf of 2. And MLP
262
Computing and Random Numbers
classifier with a learning rate = 0.3 and momentum = 0.2. In Table 3, we depict the comparative results of the classification performance before and after applying the feature reduction phase that deploy PSO, IGR and SU algorithms to detect the most significant features. As Table 3 is examined, it is seen that before the feature reduction step, the highest precision rate is associated with RF classifier was 84.3% with 18 features. The proposed method based on RF+PSO approach obtained 82.6%, 67.5%, 92.4% and 66.8% for Precision, Recall, MCC, ROC and Kappa, respectively with 10 features. RF+IGR and RF+SU techniques obtained an average Table 3: Classification performances of lymphographic data―without sampling Classi- Performance index Without FS PSO + FS fier
IGR + FS
SU + FS
RF
k-NN
MLP
C4.5
Precision
0.843
0.826
0.781
0.777
MCC
0.712
0.675
0.591
0.568
ROC
0.935
0.924
0.906
0.890
Kappa error
0.7105
0.6676
0.5942
0.5811
Precision
0.739
0.802
0.766
0.731
MCC
0.505
0.629
0.557
0.494
ROC
0.754
0.811
0.783
0.755
Kappa error
0.5069
0.6322
0.5596
0.4893
Precision
0.813
0.795
0.760
0.801
MCC
0.653
0.626
0.536
0.622
ROC
0.914
0.893
0.866
0.869
Kappa error
0.6626
0.6205
0.5282
0.6075
Precision
0.774
0.726
0.738
0.754
MCC
0.583
0.491
0.491
0.541
ROC
0.785
0.740
0.756
0.817
Kappa error
0.5874
0.4937
0.5014
0.5503
Precision rate of 78.1% and 77.7% with 12 and 9 features, respectively. Clearly we can observe that the PSO helped in reducing the dimension of features. Yet, this step did not improve the classification performance. Table 4 describes the class distribution, which clearly shows that the lymphography dataset is imbalanced. A common problem with the
Lymph Diseases Prediction Using Random Forest and Particle ...
263
imbalanced data is that the minority class contributes very little to the standard algorithms accuracy. This unbalanced distribution makes the lymphography dataset suitable to test the effect of simple random sampling strategy. We, therefore, used a simple random sampling approach with replacement to rescale class distribution of the dataset. The class distributions before and after simple random sampling are given in Table 4. The classification performance of this trained algorithm is tested with original distribution, i.e., without resampling, of data using 10-fold cross validation scheme. Table 5 shows the final classification results after applying the random sampling strategy on the reduced dataset to balance the number of instances in the minority classes. This step contributes to make a more diverse and balanced dataset. As it could be seen from results of Table 5, the highest precision rate before the feature reduction step is associated with k-NN classifier was 92.7% with 18 features. The proposed method based on RF and PSO approach obtained 94%, 89.8%, 98.3% and 92.3% for Precision, MCC, ROC and Kappa error, respectively with 10 features. While the proposed method based on RF and IGR approach obtained 95.4%, 92.5%, 98.4% and 92.3% for Precision, MCC, ROC and Kappa error, respectively with 12 features. In Table 5, it is also seen that the other performance indexes supports this improvement with increasing values compared to un-sampled classification strategy. We can observe that proposed RF+PSO model helped in improving the classification performance with a limited number of features. The results demonstrated that these features are fairly competent to represent the dataset’s class information. In terms of Precision, MCC, ROC and Cohen’s kappa coefficient our proposed technique that deploys random sampling technique succeeded in significantly improving the classification accuracy of the minority while the classification accuracy of major class remains high. The outcomes from the suggested technique show better results compared to datasets which are un-sampled and also when these attribute selection techniques are used independently. As can be seen from above results, the proposed method based on RF+PSO has produced very promising results on the classification of the possible lymph diseases patients.
264
Computing and Random Numbers
Table 4: Class distribution of the Lymphographic dataset before and after simple random sampling Index
Class
Class distribution Before sampling
After sampling
1
Normal
2
1
2
Metastases
81
74
3
Malign
61
69
4
Fibrosis
4
4
Table 5: Classification performance of Lymphographic dataset―with sampling Classifier
Performance index
Without FS
PSO + FS
IGR + FS
SU + FS
RF
Precision
0.907
0.940
0.954
0.886
MCC
0.833
0.898
0.925
0.790
ROC
0.946
0.983
0.984
0.964
Kappa error
0.8328
0.8972
0.9229
0.7954
Precision
0.927
0.940
0.926
0.880
MCC
0.872
0.898
0.865
0.774
ROC
0.936
0.947
0.947
0.917
Kappa error
0.8713
0.8972
0.8594
0.7696
Precision
0.907
0.935
0.935
0.853
MCC
0.833
0.883
0.883
0.721
ROC
0.946
0.952
0.950
0.924
Kappa error
0.8328
0.8847
0.8847
0.7185
Precision
0.888
0.841
0.827
0.774
MCC
0.792
0.700
0.673
0.557
ROC
0.910
0.900
0.873
0.855
Kappa error
0.795
0.7052
0.6798
0.5562
k-NN
MLP
C4.5
CONCLUSION The main goal of medical data mining is to extract hidden information using data mining techniques. One of the positive aspects is to support the analysis of this data. Therefore, accuracy of classification algorithms used in disease
Lymph Diseases Prediction Using Random Forest and Particle ...
265
diagnosing is certainly an essential issue to be considered. In this article, a random forest classifier approach has been investigated to improve the diagnosis of lymph diseases. The proposed RF + PSO model improved the accuracy performance and achieved promising results. The experiments have shown that the PSO feature selection technique helped in reducing the feature space, whereas adjusting the original data with simple random sampling helped in increasing the region area of the minority class in favor of handling the existing imbalanced data property. The future plan will take into consideration by applying the proposed technique in other medical diagnosis problems.
ACKNOWLEDGEMENTS The Public Authority for Applied Education and Training in Kuwait supports this work (grant number BS-14- 02). The author would like to kindly appreciate and gratefully acknowledge, UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/) for obtaining the lymphographic dataset.
266
Computing and Random Numbers
REFERENCES 1.
Ciosa, K.J. and Mooree, G.W. (2002) Uniqueness o Medical Data Mining. Artificial Intelligence in Medicine, 26, l-24. http://dx.doi. org/10.1016/s0933-3657(02)00049-0 2. Ceusters, W. (2000) Medical Natural Language Understanding as a Supporting Technology for Data Mining in Healthcare Medical Data Mining and Knowledge Discovery. Cios KJ Editor, Heidelberg: Springer, pp. 32-60,. 3. Calle-Alonso, F., Pérez, C.J., Arias-Nicolás, J.P. and Martín, J. (2012) Computer-Aided Diagnosis System: A Bayesian Hybrid Classification Method. Computer Methods and Programs in Biomedicine, 112, 104113. http://dx.doi.org/10.1016/j.cmpb.2013.05.029 4. Cselényi, Z. (2005) Mapping the Dimensionality Density and Topology of Data: The Growing Adaptive Neural Gas. Computer Methods and Programs in Biomedicine, 78, 141-156. http://dx.doi.org/10.1016/j. cmpb.2005.02.001 5. Huang, S.H., Wulsin, L.R., Li, H. and Guo, J. (2009) Dimensionality Reduction for Knowledge Discovery in Medical Claims Database: Application to Antidepressant Medication Utilization Study. Computer Methods and Programs in Biomedicine, 93, 115-123. http://dx.doi. org/10.1016/j.cmpb.2008.08.002 6. Luukka, P. (2011) Feature Selection Using Fuzzy Entropy Measures with Similarity Classifier. Expert Systems with Applications, 38, 46004607. http://dx.doi.org/10.1016/j.eswa.2010.09.133 7. Luciani, A., Itti, E., Rahmouni, A., Michel Meignan, M. and Clement, O. (2006) Lymph Node Imaging: Basic Principles. European Journal of Radiology, 58, 338-344. http://dx.doi.org/10.1016/j.ejrad.2005.12.038 8. Sharma, R., Wendt, J.A., Rasmussen, J.C., Adams, A.E., Marshall, M.V. and Sevick-Muraca, E.M. (2008) New Horizons for Imaging Lymphatic Function. Annals of the New York Academy of Sciences, 1131, 13-36. http://dx.doi.org/10.1196/annals.1413.002 9. Guermazi, A., Brice, P., Hennequin, C. and Sarfati, E. (2003) Lymphography: An Old Technique Retains Its Usefulness. RadioGraphics, 23, 1541-1558. http://dx.doi.org/10.1148/ rg.236035704 10. Cancer Research UK. http://www.cancerresearchuk.org
Lymph Diseases Prediction Using Random Forest and Particle ...
267
11. Karabulut, E.M., Özel, S.A. and Íbrikci, T. (2012) A Comparative Study on the Effect of Feature Selection on Classification Accuracy. Procedia Technology, 1, 323-327. http://dx.doi.org/10.1016/j. protcy.2012.02.068 12. Derrac, J., Cornelis, C., García, S. and Herrera, F. (2012) Enhancing Evolutionary Instance Selection Algorithms by Means of Fuzzy Rough Set Based Feature Selection. Information Sciences, 186, 73-92. http:// dx.doi.org/10.1016/j.ins.2011.09.027 13. Madden, M.G. (2009) On the Classification Performance of TAN and General Bayesian Networks. Knowledge-Based Systems, 22, 489495. http://dx.doi.org/10.1016/j.knosys.2008.10.006 14. De Falco, I. (2013) Differential Evolution for Automatic Rule Extraction from Medical Databases. Applied Soft Computing, 13, 1265-1283. http://dx.doi.org/10.1016/j.asoc.2012.10.022 15. Abellán, J. and Masegosa, A.R. (2012) Bagging Schemes on the Presence of Class Noise in Classification. Expert Systems with Applications, 39, 6827-6837. http://dx.doi.org/10.1016/j.eswa.2012.01.013 16. Kennedy, J. and Eberhart, R.C. (2001) Swarm Intelligence. Morgan Kaufmann Publishers, Burlington. 17. Kennedy, J. and Eberhart, R.C. (1997) A Discrete Binary Version of the Particle Swarm Algorithm. Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, 5, 4104-4108. http:// dx.doi.org/10.1109/icsmc.1997.637339 18. Quinlan, J.R. (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, Burlington. 19. Fayyad, U. and Irani, K. (1993) Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. Proceedings of the 13th International Joint Conference on Artificial Intelligence, Chambéry, 28 August-3 September 1993, 1022-1027. 20. Liu, H., Hussain, F., Tan, C. and Dash, M. (2002) Discretization: An Enabling Technique. Data Mining and Knowledge Discovery, 6, 393423. http://dx.doi.org/10.1023/A:1016304305535 21. Breiman, L. (2001) Random Forests. Machine Learning, 45, 5-32. 22. Dietterich, T.G. (2000) An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting,
268
23. 24.
25.
26.
27.
28.
29.
30.
Computing and Random Numbers
and Randomization. Machine Learning, 40, 139-157. http://dx.doi. org/10.1023/A:1007607513941 Breiman, L. (1996) Bagging Predictors. Machine Learning, 24, 123140. Weiss, G. and Provost, F. (2003) Learning when Training Data Are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research, 19, 315-354. Park, B.-H., Ostrouchov, G., Samatova, N.F. and Geist, A. (2004) Reservoir-Based Random Sampling with Replacement from Data Stream. Proceedings of the 2004 SIAM International Conference on Data Mining, 22-24 April 2004, Lake Buena Vista, 492-496. http:// dx.doi.org/10.1137/1.9781611972740.53 Mitra, S.K. and Pathak, P.K. (1984) The Nature of Simple Random Sampling. The Annals of Statistics, 12, 1536-1542. http://dx.doi. org/10.1214/aos/1176346810 Schumacher, M., Hollander, N. and Sauerbrei, W. (1997) Resampling and Cross-Validation Techniques: A Tool to Reduce bias Caused by Model Building? Statistics in Medicine, 16, 2813-2827. http:// dx.doi.org/10.1002/(SICI)1097-0258(19971230)16:243.0.CO;2-Z Ben-David, A. (2008) Comparison of Classification Accuracy Using Cohen’s Weighted Kappa. Expert Systems with Applications, 34, 825832. http://dx.doi.org/10.1016/j.eswa.2006.10.022 Cestnik, G., Konenenko, I. and Bratko, I. (1987) Assistant-86: A Knowledge-Elicitation Tool for Sophisticated Users. In: Bratko, I. and Lavrac, N., Eds., Progress in Machine Learning, Sigma Press, Wilmslow, 31-45. UCI (2016) Machine Learning Repository. http://archive.ics.uci.edu/ ml/index.html
CHAPTER
13
CHALLENGES OF INTERNAL AND EXTERNAL VARIABLES OF CONSUMER BEHAVIOUR TOWARDS MOBILE COMMERCE Arif Sari1, Pelin Bayram2 Department of Management Information Systems, Girne American University, Kyrenia, Cyprus
1
Department of Business Management, Girne American University, Kyrenia, Cyprus
2
ABSTRACT The Mobile Commerce (m-commerce) becomes very powerful tool in the competitive business markets. Companies started to use this technology to attract their customers and catch their attention. Usage of Mobile commerce applications spreaded around different countries and became Citation: Sari, A. and Bayram, P. (2015), “Challenges of Internal and External Variables of Consumer Behaviour towards Mobile”, Commerce. International Journal of Communications, Network and System Sciences, 8, 578-596. doi: 10.4236/ijcns.2015.813052. Copyright: © 2015 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http:// creativecommons.org/licenses/by/4.0
270
Computing and Random Numbers
very popular. Different communication protocols and security techniques are designed for business use of m-commerce. Mobile Commerce, likewise the e-commerce brought significant difference in the market. People start to use this technology by feeling the freedom of having transactions at anywhere and anytime. However, consumers face lot of difficulties while using this technology which is consumer-based or service provider based. This research exposes the impact of determinants that influences mobile commerce application users’ attitudes by classifying and investigating the internal and external variables in a case study of Cyprus Research Centre. Keywords: M-Commerce, Internal variables, External Variables, consumer behaviour
INTRODUCTION Technology never stops to advance, as our needs increase so those new innovations come up and easier ways for things to work that will bring about efficiency. Mobile commerce (M-Commerce) like its grand-daughter of Electronic commerce, enhance business strate-gies and simultaneously make it easy for users to comply with. M-commerce involves the Electronic transac-tions between buyers and sellers using mobile communications devices such as cell phones, personal digital as-sistants (PDAs), or laptop computers (Miller, 2010). This form of commerce is flexible, highly mobile, and extremely versatile, making it a popular model for some businesses, including companies which only do business electronically as well as consumers in the public. In many different countries, M-Commerce became very popular and powerful tool for communication and data exchange. People use this technology for private and corporate proposes. Different mobile commerce ap-plications serve for different sectors. Today’s technology savvy people appreciated for this technology because of its significant difference. It brought lot of advantages into our life from many different perspectives. Instead of music, video and chatting purposes, M-commerce brought significant difference for transactions at train sta-tions, cinemas, parking places etc. Payments, admissions to events, interactive services and information services such as weather forecasts, traf-fic information, exchange rates etc.
Challenges of Internal and External Variables of Consumer Behaviour...
271
are some of the advantages that mobile commerce provides to consumers at anytime and anywhere. However, consumers face a lot of obstacles while using this technology. Consumer behavior and attitudes towards mobile commerce change rapidly, because of several reasons which cause of these obstacles. These ob-stacles may be environmental, or consumer related. This study concentrated on the particular region and con-ducted a survey to analyze the consumer behaviour towards mobile commerce applications. The paper is structured by discussing Mobile Commerce and Challenges of Mobile Commerce in section. Section 3 explains the relationship between M-Commerce and Challenges of M-Commerce with different sub-sections by covering customer loyalty and trust issues in M-commerce activities. The research model formula-tion, the hypothesis testing, sampling, data collection and questionnaire design are explained in Section 3. The findings of the research are exposed in Section 4 with different frequencies. Section 5 elaborates the results of hypothesis testing and concluding the research.
MOBILE COMMERCE AND CHALLENGES OF MOBILE COMMERCE The higher competition between business companies and popularity of mobile is increased rapidly in the last decade. Because of the advantages of the mobile, such as eliminating the distances, removing the communication barriers and providing 7/24 service to everyone, it became very important for the software and system developer companies. Customer centred organizations give primary importance to its customers. In order to adopt this new technology to customers, companies try to develop new sources and new solutions such as developing ease of use programs, user-friendly interfaces and higher security opportunities for particular transactions. Today, the mobile Internet is emerging even faster, in part because service providers, content partners, customers and investors of these markets are leveraging lessons both nationally and globally, have made significant advances enable next generation data or wireless Web services and mobile “m”-Commerce. Researchers have broadly defined the Mobile Commerce (m-Commerce) as it involves an emerging set of applications and services people can access from their Web-enabled mobile devices [1] . As it is stated before, this technology providing 7/24 services to everyone different than the other technologies. This type of technologies can bring many advantages to
272
Computing and Random Numbers
country’s economies and welfare. As a developing technology, additionally, m-Commerce is facing many difficulties and obstacles as an emerging market, particularly in Small Island Developing States (SIDS). The main characteristics of SIDS are stated by researchers in the literature which these states are coastal countries that share similar sustainable development challenges, including small but growing populations, with the limited resources and very sensitive to natural affects and changes [2] -[5] . They are also vulnerable to external shocks like crises. Cyprus is one of these countries. For example, lack of standards, uncompleted infrastructure, cost and speed issues are main obstacles in Cyprus for m-Commerce. Because of these external factors which affect applicability and usability issues of m-Commerce, consumer behaviour and perceptions against m-Commerce becomes negative. In some of the developed countries, like US which also face some difficulty issues such as lack of standards, high mobile telecommunication costs and low speed, survey in the literature suggested that US consumers are not convinced they want or need mobile services and many think it is simply too complicated [6] . This is in contrast to other global markets in Asia and Europe where “to going online” means reaching for a mobile handset, not turning on a PC. In Korea, for example, reports which prepared by researchers suggest that one-third of all mobile phone subscribers use their handsets for m-Commerce activities [7] . This can easily explain us the popularity of m-Commerce services and mobile usage acceptance. As it is mentioned above, the mobile service provider companies working on development of usability and applicability issues of m-Commerce services. But the new developments of mobile service providers may not have significant affect or changes on the customer perceptions alone. There are external and internal factors exist which may influence consumer behaviour and perceptions against m-Commerce services. The internal variables can be demographic or psychographic and external variables can be classified as social, cultural and technological. All these factors are important for to adopt consumers to use m-Commerce services. Usage of m-Commerce services have great amount of benefits for the local companies and businesses. The opportunity of 7/24 services availability gives advantages to customers to conduct transactions at anywhere-anytime and processing power which also gives an opportunity for business to offer services nationally, as well as globally. Several large companies abandon or scaling back country-based wireless efforts to focus on global markets. Additionally, carriers and content partners are still investing and bright spots exist. Several companies started to provide different services which are compatible and accessible for
Challenges of Internal and External Variables of Consumer Behaviour...
273
mobile devices. EBay recently launched a new service that lets customers bid more easily from mobile device. According to a Yankee Group report, the new service has the correct success factors-priced right, speed, and ease of use. Providing the compatible and accessible, faster and reliable services for consumer mobile devices brought a different dimension for business markets. Like e- Commerce, m-Commerce also represents a huge opportunity for businesses to connect to consumers through these mobile devices at any particular time. While a set of issues warrant attention, we focus on an area that has been largely neglected applicability, usability and security issues. Many activities compete for a user’s attention on the Web. There are different mobile services sending different information such as news, stories, weather information, exchange rate information, and alerts about stock prices, and notifications of e-mail services. The environment outside of the Web is fairly stable from day to day with wired e-commerce. Most of the places like offices, workshops and homes function with a good amount of predictability, even If they experience a huge amount of activity, and relatively consistent amounts of attention can be devoted to performing tasks on the computer. In m-Commerce world, conversely, there can be a significant number of additional people, objects, and activities vying for a user’s attention aside from the application itself. The particular amount of attention a consumer can give to a mobile application will vary over time, and a user’s priorities can also change unpredictably. For that reason, the circumstances under which m-Commerce applications and services are used can be significantly different from those for their desktop e-Commerce counterparts. Moreover, in the m-Commerce environment, consumers and applications must deal with different and separate devices such as phones; handhelds, telematics, and this continue to shrink in size and weight. While this brings an opportunity and achieving high device portability, usability of the devices may suffer. Traditional mice and keyboards are being replaced with buttons and small keypads. But smaller screens are difficult to read and smaller devices are difficult to use with only one hand. The usage of mobile devices affecting from the external factors like noise level, weather and brightness. Difficulties of using mobile phones such as the lack of ease of use opportunities and user-friendly interface translate into waste of time or user frustration by the consumer.
274
Computing and Random Numbers
Researchers have also stated that security is another serious challenging issue in the m-Commerce environments [8] . There are potential benefits in storing sensitive data, including medical, personal, and financial information on mobile devices for use by m-Commerce applications. But the vulnerability and mobility of these devices increases the risk of losing the device and its data. Moreover, the risk of data access by unauthorized parties makes positive user identification a priority. Different safety issues also arise when user activities starts to vary. In order to prevent unauthorized access and different precautions started to implement on this environment and sometimes, these precautions makes m-Commerce services inapplicable for the particular environment or area. For example, when designing the m-Commerce systems for automobiles, serious consequences can result if the application distracts driver from the traffic and diverts too much attention from the primary task of driving. So Web access or usage of m-Commerce applications in automobiles creates potential problems associated with browsing while driving. But in addition to this, such kind of problems may be solved by designing minimal-attention interfaces. As it is mentioned above, the dynamic environment may cause problems because of it’s instability and context awareness systems may warn consumers for the particular events. New designs and flexible Input/Output systems may be developed to provide ease of use which will prevent mobile device limitations and usability. Lack of data sharing and data security may cause loss of customer trust against m-Commerce services system. In order to create customer trust against these services, security must be strengthening thorough new biometrics security systems, commonsense design and legislation of the entire system. Researchers have stated the potential challenges of m-Commerce in his previous studies. Table 1 indicates some of these potential m-Commerce challenges and potential solutions of these problems [9] . As it is shown in Table 1, there are varieties of challenges available for M-Commerce and potential solutions in contrast to these challenges. The one of the main challenge is that while there is a significant demand available in the market, there is a minimal attention for developing m-commerce application interfaces. The dynamic environment that leads both security and safety issues as vulnerable objects, different solutions proposed such as usage of biometrics and common-sense design with legislations. While Mobile commerce (M-Commerce) is discussed and stated, it was compulsory to discuss the e-com- merce first of all. For that reason, the previous section described the commerce briefly in order to clearly
Challenges of Internal and External Variables of Consumer Behaviour...
275
define the e-Commerce which is also known as “Electronic Commerce”. In order to understand the importance and role of E-Commerce, we must differentiate the e-commerce and m-commerce from each other. Researcher has explained the e-commerce as a monetary transaction which conducted using the combination of internet and a desktop or laptop computer [10] . So it’s clearly stated here, the need internet connection and computer is compulsory for usability of e-commerce. As long as e-commerce has relationship between m-commerce, the same or similar tools will be required as well. For the applicability of these systems, it’s again the internet connection will be compulsory. Mobile commerce has many similarities with e-commerce. It’s kind of more developed and technological based partner of commerce family. Once the wireless or any other internet connection device take place in the system, and allows clients or users and provide freedom of movement, the name of the commerce becomes “Mobile Commerce”. The basic milestone of the development of M-commerce started with Wi-Fi which is called Wireless Fidelity. For that reason, the researchers defined the mobile commerce as any transactions using a wireless device that result in the transfer of monetary value in exchange for information, goods or services [11] . This definition is very similar to the e-commerce definition which is done by the researcher [10] . The role of computer or laptop is completely taken by mobile devices such as PDA’s, or mobile phones. The source of communication which is provided for the data transmission is also same which is “Internet” but the devices is not switch or hub, and it’s a new technology which is wireless telecommunication network devices, wireless hubs, or wireless antennas that allows users to connect to internet at anywhere and anytime. The e-commerce websites are designed for e-commerce and could be accessed through combination of computer and internet. The clients were accessing to these websites through internet and computer, conducting a transactions. In the case of m-commerce, all these websites are designed and coded according to the some compatibility standards, so clients were accessing these websites through mobiles and doing all of the same processes likewise e-com- merce. Once a client accessed to the website, and transaction conducted from the mobile and through a wireless connection media, this is called Mobile commerce process. Researchers have proposed a book and defined the mobile commerce as “a monetary transaction for goods and services conducted by a mobile device, an operating system specific to mobile devices and a mobile-dedicated infrastructure.” [12] .
Computing and Random Numbers
276
Clients are using mobile commerce applications which are developed by software developers and connecting to the internet through their GSM operators. The clients browse and surf on the company websites, by using the software that is developed for mobiles. Since the beginning of the section, we have tried to explain the e-commerce and m-commerce. It is mentioned on the above that, m-commerce is an extension or more developed and technological way of e-commerce. But on the other hand, it must be stated that, there are very basic differences between m-commerce and e-commerce which are use of communications protocols for transactions, types of internet connection and the connection media, the key enabling technologies, and development languages. Table 1: M-Commerce challenges and potential solutions [9] Challenges
Potential Solutions
Increased demands on attention
Minimal-attention interfaces
Dynamic environment
Context awareness
Mobile device limitations and usability
New and flexible I/O modalities
Security
Biometrics
Safety
Commonsense design and legislation
Social concerns
Societal norms and written laws
For example, while the m-commerce is giving freedom of movement to the clients, it requires Wireless Application Protocol (WAP) which is a key for enabling the m-commerce technology. Clients access to the internet through the WAP technology and access/browse the pages. All data and packet transmission from the wireless media provided through WAP. So the WAP becomes a communication standard for the m-commerce. But in e-commerce technology, we do not need the WAP technology to enable us to connect to internet. Instead of this, the computer use Hyper Text Transfer Protocol (HTTP) and standardize the pages using Hyper Text Markup Language (HTML) in order to browse/show pages by using internet browser. On the other hand, WAP requires Wireless Markup Language in order to standardize the formatting of pages and display them on the mobile devices.
Customer Loyalty and Trust in Mobile Commerce The concept of trust has been studied in different disciplines ranging from business to psychology to medicine, and perspectives on it differ, but it can be loosely defined as “a state involving confident positive expectations about
Challenges of Internal and External Variables of Consumer Behaviour...
277
another’s motives with respect to oneself in situations entailing risk” [13] . Business relationship would be nonexistent without trust, which is expressed in various business contexts such as laws, contracts and regulations as well as in company policy and personal reputations, and long term relationships. Once the customer trust is gained by the business, this situation becomes long-term relationship between the customer and business and it transforms to loyalty during the particular period of time. Not surprisingly, studies show trust also plays an essential role in successful Internet retailing [14] . Gaining customer trust in m-Commerce is a frustrating process, extending from initial trust formation to continuous trust development-but it can be done. Studies by researchers shows nearly all customers refuse to provide personal information to a Web site at one time or another, a majority because they lack trust in the site [15] . Most of these customers are still not comfortable with the concept of Web-based developments and the electronic medium itself. They are distrustful that e-Commerce can satisfy consumer needs unfulfilled in the bricks-and mortar business world, and they wonder whether e-Commerce is technologically feasible, and reliable. From this uncertainty point of view, it’s a small step for customers to doubt the integrity of Internet vendors. Without social cues, and personal interaction such as body language, linguistics, the observation of other buyers, and the ability to feel, touch, and inspect products directly, customers can perceive online business and transactions as riskier in nature. According to researchers [16] , gaining consumer trust in m-Commerce world, which uses radio-based wireless devices to conduct business transactions over the Web-based e-commerce system, is particularly frustrating task because of its unique features. As it is stated about m-Commerce challenges on the previous paragraphs, mobile devices are terrifically convenient for anytime shopping, and offering various types of advantages both the consumers and businesses. But their small screens, low resolution displays and tiny multifunction keypads make developing userfriendly interfaces and graphical applications a challenge. Comparatively, mobile handsets are also limited in computational power, memory and batter life. Once it is considered that, m-Commerce is involved into the Wireless Networks; major limitations of these networks must be taken into consideration. These networks have difficulties of providing huge bandwidths and they have connection stabilities as well as function predictabilities. Also, relatively high operation costs, lack of standardized protocols and data transmitted wirelessly is more vulnerable to eavesdropping.
278
Computing and Random Numbers
Various factors may influence the complex process of engendering customer trust in Internet shopping. These are the factors and variables that influence consumer behaviour such as internal-demographic, psychographic and external-social, cultural and technological variables. The consumer characteristics such as need, motivation, capacity and willingness, along with the seller characteristics such as ability, benevolence, and integrity, all play a role in Internet purchasing and m-Commerce usage behaviour [17] . Customer perception of security and privacy control, integrity, and competence, as well as third-party recognition and legal framework are important antecedents of trust in Internet shopping [18] . There are several other factors exist which may influence customer trust and trigger Internet shopping such as personal experience, familiarity, affiliation and belonging, transparency, factual signals and heuristic cues. Table 2 indicates some of the factors which may influence consumer behaviour against m-Commerce and Internet shopping.
RESEARCH MODEL FORMULATION There is a need for determination of variables in the hypothesis creation progress. The determinants that impact the hypothesis testing as internal or external should be classified and explained. Table 2: Factors influence consumer behaviour against internet shopping Category
Consumer Characteristics
Seller Characteristics
Consumer Perceptions
Consumer Perceptions for Corporate Branding
Factors
Need
Ability
Security
Personal Experience
Motivation
Benevolence
Privacy Control
Familiarity
Capacity
Integrity
Integrity
Affiliation and Belonging
Competence
Transparency
Third-party recognition
Factual Signals and Heuristic cues
Willingness
Legal framework
This section explains the formulation of research model in details. Figure 1 shows the general research model formulation. The impact of determinants which affects the mobile commerce customers is divided into two (2) categories of variables which are Internal and External variables.
Challenges of Internal and External Variables of Consumer Behaviour...
279
As it is shown below, internal variables are classified as demographic and psychographic, external variables classified as social, cultural and technological variables. In this study, two (2) hypotheses are formulated through the impact of internal and external variables towards the mobile commerce. The variables which mentioned above would be discussed clearly later sections of this study.
Formulation of Research Objectives and Hypothesis To identify the impact of internal variables (demographic and psychographic variables) on consumer behavior towards mobile commerce; H1: There is a significant impact of internal variables (demographic and psychographic variables) on consumer behaviour towards mobile commerce To identify the impact of external variables (social, cultural and technology variables) on consumer behaviour towards mobile commerce; H2: There is a significant impact of external variables (social, cultural and technology variables) on consumer behaviour towards mobile commerce. There are 4 different types of variables exist which are classified as, dependent, independent, moderating and intervening variables. In this research, some of these variables are shown and classified in Figure 2. Figure 2 is showing the classification of the variables during the formulation of the hypothesis. The variables which are classified under the Psychographic variables referring to any other attribute related to personality, lifestyle, values, interests or attributes. These factors consider various influences on a person’s buying behaviour. Different lifestyle choices like parenting, exercise decisions, religion, marriage or health can greatly affect a person’s requirements or preferences for certain products or services. Technological Variables; was developed to measure and categorise consumers based on ownership, use patterns and attitude towards different technologies. A concrete example would be humanity’s attitude towards the Internet. There are distinct differences between frequent Internet users and those who seldom use it. Most experienced Internet users are more affluent and tend to be more optimistic towards modern technology than those who are not as manifested by the number of online shoppers. Confident online shoppers are those who have been using the Web for quite sometime thus, making them feel safer compared to newly recruited Web users.
280
Computing and Random Numbers
The behavioural variable of market segmentation groups consumers in terms of occasions, usage, loyalty and benefits sought. This is based on the way different consumers respond to, use or know a product or service. The variable of occasion simply means the occasion on which a product or service is consumed or purchased.
Range of Study and the Sample selection This study focuses the students of Research center in Cyprus. Students from different research departments are participated in this project. The total number of students who participated in this project is 100. The total population of the research center is approximately 971.
Figure 1: Research model formulation.
Figure 2: Dependent and independent variables classifications.
Data Collection In this research, two different sources are used to accomplish the objectives of this study. The primary sources and the secondary sources.
Challenges of Internal and External Variables of Consumer Behaviour...
281
The Primary Sources The primary source for this research is the information collected through questionnaires. The information obtained from this source has provided statistics, which measure the students’ behaviour against mobile commerce applications.
Secondary Sources The secondary sources will be gathered from references and published books, journals and information available on the internet. Additionally, some data obtained from the particular websites from the Internet.
Sample Plan The following sample units have been selected based on the following methods and the sample units ought to have requisite criteria to get selected for the study. ・ ・ ・ ・ ・
Should be a student of Cyprus Research center. Should be registered to one of the programs at Cyprus Research center. The participants should be aware of the mobile commerce applications. The participants should be able to use mobile commerce applications. The participants should have a mobile device that is compatible to mobile commerce applications.
Sampling Technique A simple random sampling has been used to select target respondents for this study. The students of Research Centre have been informed on the purpose of the study and the willing students participated in the study. Responders for the survey were randomly selected 100 individual people from Research Centre. The sample population include only students as well as those students who work outside as well.
Sample Size The sampling size is used for this research study is One Hundred (N = 100).
282
Computing and Random Numbers
Research Tools and Collection Instrument The instrument used for collecting the primary data is questionnaire. The questionnaire is a structured questionnaire. Selective questions were asked and respondents just ticked appropriately. Open ended questionnaire was used for the pilot study and the result of the open ended questionnaire was used to formulate the closed end structured questionnaire.
The Questionnaires The collection method used is hand delivery of the structured questionnaire to the respondents personally by the researcher of this study. This was done in order to ensure the respondents understand the questions before ticking appropriately. The structured questionnaire was also translated into Turkish language for easy understanding by the respondents. Questionnaires are one of the most widely used social research techniques. The questionnaires will be utilized to gather information from subjects. This is to explore the impact of determinants on consumer behaviour against mobile commerce applications and mobile commerce as well. Researchers stated that “Questionnaire is one of the most widely used survey data collection techniques, because each respondent is asked to response to the same set of questions, it provides an efficient way of collecting responses from a large sample prior to quantitative analysis” [19] . There are two main types of surveys, questionnaire and interviews. The method selected for this research was the questionnaire. The questionnaire are classified into two types; those which are designed for self-completion in which the respondent complete the questionnaire themselves, and those which designed for assisted completion wherein the researcher asks the questions and fill in the questionnaire himself [20] .
The Design of the Questionnaire The contents of the questionnaires were mainly derived from the literature review. In addition, from some research and studies conducted in this field. The questionnaire is developed in English but because of the student’s profile at Research Centre of Cyprus and as it is stated before; the questionnaire is also translated into Turkish. This action has granted two main factors for the questionnaire; Firstly, helped in saving time spent with respondent to translate and explain the questionnaire elements. Secondly, it guaranteed the highest level
Challenges of Internal and External Variables of Consumer Behaviour...
283
of understanding of the questionnaire items and the ideal amount of freedom for answering.
The Questionnaire Contents The questionnaire containing totally 10 questions. The first 3 questions were designed to gather main general information of the respondents such as age, sex, nationality, occupation and monthly income. The question number four to question number eight containing the questions about the frequency of conducting mobile commerce transactions and measuring the frequency of usage of mobile commerce. Especially, question number six, is measuring the usage of particular mobile commerce transactions. The last 2 questions were designed to examine the consumer’s satisfactions and perceptions against mobile commerce applications and their recommendations about mobile commerce. The question number 9 is 5 point scale designed question and measuring the replies in five ranks. This question was measuring the actual reason of usage of M-Commerce by people. It also shows the personality, perceptions and the consumer behaviour against mobile commerce applications. The more details about this question are given in the “Findings” section of this study. The questionnaire in the study was designed for the Bachelor students who are studying in any faculty and department of Cyprus Reserach Center. Single questionnaire was given to each respondent which containing the questions about his or her perceptions, and particular behavior which determining the influencing factors of Mobile commerce usage. The questionnaire contained questions used to determine the demographic distribution as well as perceptions and behaviours of the respondents. Question 1 was based on the age distribution of the respondents; Below 20, 21 - 30, 31 - 40, 41 - 50, and above 50 years. Question 2 was based on the gender of the respondents; Male or Female. Question 3 was based on the nationality of the respondent, Turkish Cypriot, Turkish, British and Others. Question 4 was designed based on the occupation of the respondents, which could be; Student, Private, Business, Housewife. Question 5 was designed based on the Monthly Income of the respondents in terms of Turkish Liras (TL) and which can be; Below 1000, 1001 - 2000, 2001 - 3000, 3001 4000 and above 4000. Question 6 was based on the mobile different types of commerce transactions and it was aiming to find out the most frequently done mobile commerce transactions by the consumers. Question 7 was designed to find out conducting the mobile commerce frequency of consumers which
Computing and Random Numbers
284
is limited with last 12 months time period and can be; Once, 2 - 4 times, 5 - 10 times, More than 10 times. Question 8 is designed for to measure the amount of money spent on m-commerce transactions in terms of Turkish Lira and which can be; Below 50, 51 - 100, 101 - 150, 151 - 200 and Above 200. The Question 9 is designed based on the “Five Points Likert” scale. The respondents were asked to rank their interests (against using mobile commerce services) using highly agrees as the highest and highly disagrees as the least. Question 10 was designed for to measure the recommendation of mobile commerce by respondents to their relatives and friends which can be; Definitely Recommend, Somewhat recommend, No Comments, Do Not Recommend, Not at All.
The Questionnaire Responses The questionnaire was targeting to all students at all faculties in Research Center. It was necessary to distribute the questionnaires to most of the departments of these faculties. The 100 questionnaires were been distributed, 100 of them collected, and none of the questionnaires rejected because of incompletion or as an unfilled.
Limitation of the Study There are specific limitations exist for this research. These are: ・ ・ ・
The data collected were purely based on knowledge, perception, feelings, attitude and opinions of the target respondents. The research would have the limitation like all the social science research does. The study had been conducted in Cyprus Research Centre and the results could not be generalized.
FINDINGS In this section the analysis of the survey will be highlighted and discussed. The mobile commerce service providers and applications developers keep conducting the researches to understand the customer profiles and analyze their needs in order to create better and secure mobile applications. But people perceptions differ and behaviour may vary depending on the time, place and corresponding action. The survey of this research shows the different perceptions of people, their actual reasons of usage of the mobile commerce applications, their problems with this technology and their
Challenges of Internal and External Variables of Consumer Behaviour...
285
recommendations. Additionally, each questionnaire is analyzed carefully, and consumer’s viewpoints measured in this research. Moreover, people spending on mobile commerce, and their main reasons of usage of this technology became some of the major interesting outcomes of this study.
General Information about the Sample This analysis is based on the data collected from 93 respondents. As it is mentioned before, responses from 100 respondents could succeed %100 and no unfilled-uncompleted questionnaires received. 100 respondents from different faculties and different departments of Cyprus Reserach Center. The demographic information includes nationality, age, sex, occupation and monthly income.
Age Group With regard to the age, out of the total number of students in the sample, between 21 - 30 years old represent the largest proportion which is 92% (92 respondents) and 8% of respondents participated in this research project. There is no respondent participated in this project above 30 years old. Figure 3 and Table 3 shows the same information about the distribution of population in different ways. Inference: From the Figure 3, Table 3 and analysis, it can be inferred that majority of the respondents used for this study are below 30 years old. Only 8% of respondents are in the category of “Below 20” years old.
Gender Ratio Table 4 shows that, there are totally 68 male respondents and 32 female respondents participated in this survey. Figure 4 shows number of respondents participated in this survey in a chart model. Inference: From the above analysis, it can be inferred that majority of the respondents used for this study are male. Table 3: Age frequency of respondents Age
% of Respondents
Below 20
Number of Respondents 8
21 - 30
92
92%
8%
286
Computing and Random Numbers 31 - 40
0
0%
41 - 50
0
0%
Above 50
0
0%
Null
0
0%
Total
100
100%
Figure 3: Age frequency of respondents.
Nationality Distribution of Respondents The nationality distribution of the sample is given in Table 5 and Figure 5. According to Table 5, there are totally 13 Turkish Cypriot respondents (13%), 50 Turkish respondents (%50), 0 British respondents (0) and 28 other’s nationality holder’s respondents (28%) participated in this survey. Inference: According to the data gathered and analyzed above, it shows that majority of participants are Turkish and Other nationality citizens. The Turkish Cypriot citizens have the minority of contribution in this survey. There is no British citizen participant exist in this survey. Table 4: Gender ratio of respondents Gender
Number of Respondents
% of Respondents
Male
68
68%
Female
32
32%
Challenges of Internal and External Variables of Consumer Behaviour... Null
0
0%
Total
100
100%
287
Figure 4: Gender ratios of respondents.
Figure 5: Nationality distributions of respondents.
Occupation of Respondents In this analysis, it is found that the highest percentage of respondents is students. Table 6 is showing the occupation of the respondents. According to Table 6, there are 100 (%100) of students participated in this research and there is 0 (0%) of workers or housewife attended or involved in this survey.
288
Computing and Random Numbers
Inference: Figure 6 and Table 6 show the distribution of occupation among participants. According to these results, it can be said that, all participants who participated in this survey was students.
Income Ratios of Respondents The data shown below on the. Table 7 is showing the Monthly income of the respondents who participated in the survey. According to Table 7, there are 100 (%100) of respondents participated in this research who had a salary of below 1000 TL and there is 0 (0%) of respondents participated who has salary more than 1000 TL in this survey. Table 5: Nationality distribution of respondents Nationality
% of Respondents
Turkish Cypriot
Number of Respondents 13
Turkish
59
59%
British
0
0%
Others
28
28%
Total
100
100%
13%
Table 6: Rate of occupation in the survey Nationality
% of Respondents
Student
Number of Respondents 100
Private
0
0%
Business
0
0%
Housewife
0
0%
Total
100
100%
100%
Challenges of Internal and External Variables of Consumer Behaviour...
Figure 6: Rate of occupation in the survey. Table 7: Income ratios of respondents Income (TL)
Number of Respondents
% of Respondents
Below 1000
100
100%
1001 - 2000
0
0%
2001 - 3000
0
0%
3001 - 4000
0
0%
Above 4000
0
0%
Total
100
100%
Figure 7: Income Ratios of respondents.
289
290
Computing and Random Numbers
Inference: Figure 7 and Table 7 show the distribution of income ratios among participants. According to these results, it can be said that, all participants who participated in this survey has a monthly income of less then or equal to 1000 TL.
Types of Mobile Commerce Transactions by Respondents Table 8 is showing the types of M-Commerce transactions conducted by the respondents. It is important to state that, majority of respondents who participated in this research prefer interactive services such as chats and games (79%). Most of the respondent mentioned about the “facebook” and “twitter” web addresses at the time of distribution of the questionnaire. Only a few respondent selected ringtones which is equivalent to 10% (10 respondents), 2% of respondents (2 respondents) and 5% (5 respondents) selected Music and Content which showing that they involved into m-commerce applications and services for this case. None of the respondents involved into the services of mobile commerce such as, admissions to events, parking, or others. The respondents also mentioned that, they could not involve into these services just because of the unavailability of these services, or they could not find it beneficial for themselves. Figure 8 is showing this distribution clearly. Inference: The stated analysis and table-figure above shows that, respondents who participated in this survey, prefer social activities, such as chatting or web-forums. This also can be considered as one of the difference because of the respondent’s age group and business. On the other hand, other selections which are not selected by the respondents should also be considered carefully.
Frequency of Mobile Commerce Transactions by Respondents Table 9 below is showing the frequency of m-commerce transactions done by respondents for the last 12 months. The majority of respondents (45%) replied as they have conducted a transaction more than 10 times for last 12 months. 33% of respondents replied as 5 - 10 times, 18% of respondents replied as 2 - 4 times and 4% of respondents replied as they have conducted a transaction just once. Inference: According to the analysis, it can be said that, majority of respondents have conducted mobile commerce transaction for last 12 months. However, there are other respondents exist, who did not conduct this
Challenges of Internal and External Variables of Consumer Behaviour...
291
much of transaction for different reasons which they have not stated. Figure 9is showing the frequency in a bar chart model. Table 8: Types of m-commerce transactions done by respondents Type of Services
% of Response
Ringtones
Number of Respondents 10
Screen Savers
4
4%
Music and Video Content Interactive services, such as chats, games etc. Information services, such as weather forecasts, traffic information etc. Admissions to events Parking
5
5%
79
79%
2
2%
0
0%
0
0%
100
100%
Other
10%
Table 9: Frequency of m-commerce transactions done by respondents for last 12 months Frequency
% of Respondents
Once
Number of Respondents 4
4%
2 - 4 Times
18
18%
5 - 10 Times
33
33%
More Than 10 Times
45
45%
Total
100
100%
292
Computing and Random Numbers
Figure 8: Types of m-commerce transactions done by respondents.
Amount of Money Spent on Mobile Commerce Applications Figure 10 and Table 10 showing the amount of money spent on Mobile commerce applications for last 12 months by the respondents. According to this data, 63% (63 respondents) spend less than 50 TL, 28% of respondents spend 51 - 100 TL, 9% of respondents spend 101 - 150 TL and 0% of respondents spend more than 101 TL on the mobile commerce transactions.
Figure 9: Types of m-commerce transactions done by Respondents for last 12 months.
Challenges of Internal and External Variables of Consumer Behaviour...
293
Figure 10: Amount of money spent on mobile commerce applications for last 12 months. Table 10: Amount of money spent on mobile commerce applications for last 12 months Amount of Money (TL) Below 50
Number of Respondents 63
% of Respondents
51 - 100
28
28%
101 - 150
9
9%
151 - 200
0
0%
More Than 200
0
0%
Total
100
100%
63%
Inference: According to the information gathered from the questionnaire analysis, it can be said that, a minority of respondents spend considerable amount of money on mobile commerce applications and majority of respondents spend less then 50TL on Mobile commerce applications.
Agreement Level of Respondents Table 11 is showing the agreement levels of respondents against different statements. According to these statements only. Inference: Table shows us the agreement level of some causes and it is found out that only first 4 ranks, 1st 2nd 3rd 4th rankings was highly agreed. According to this situation, it can be said that, due to occupation, marital
294
Computing and Random Numbers
status, because the popularity of technology and people characteristics effect the usage of mobile commerce.
Recommendation of Mobile Commerce transactions by Respondents Table 11 is showing the respondents recommendation of Mobile Commerce transactions to their relatives and friends. According to the gathered and analyzed data, %75 of respondents (75 respondent) answered as definitely recommend, %12 of respondents (12 respondent) answered as somewhat recommend, %10 of respondents (10 respondent) answered as No Comment, and 3% of respondents (3 respondent) answered as do not recommend the mobile commerce transactions for their friends and relatives. Table 11: Agreement level of respondents S. No
Statements
Score
S. Score
Rank
Agreement Level
1.
I use mobile commerce transactions because I am single
242
12.1
17
Disagree
2.
I use mobile commerce transactions because I am married
364
18.2
4
Highly Agree
3.
I use mobile commerce transactions because I have children
269
13.5
16
Disagree
4.
I use mobile commerce transactions because I am educated
277
13.9
13
Disagree
5.
I use mobile commerce transactions due to my occupation
370
18.5
3
Highly Agree
6.
I use mobile commerce transactions because I have high income
236
11.8
18
Disagree
7.
I use mobile commerce transactions because I like to live a good life style
226
11.3
19
Disagree
8.
I use mobile commerce transactions because I want to show myself different from my friends
325
16.3
10
Agree
9.
I use mobile commerce transactions because I want show myself modern to my friends
274
13.7
14
Agree
10.
I use mobile commerce transactions because everyone in my society uses the same
341
17.05
7
Agree
11.
I use mobile commerce transactions because I want to get appreciation from society
329
16.5
8
Agree
12.
I use mobile commerce transactions because I want the society to respect me
273
13.7
15
Agree
13.
I use mobile commerce transactions because it is suitable to my culture
284
14.2
11
Agree
Challenges of Internal and External Variables of Consumer Behaviour... 14.
I use mobile commerce transactions because it is useful for my work culture
341
17.1
6
Agree
15.
I use mobile commerce transactions because it helps me to align with my culture
282
14.1
12
Agree
16.
I use mobile commerce transactions because the technology save my time and money
356
17.8
5
Agree
17.
I use mobile commerce transactions to show I am a tech savvy (lover of technology)
373
18.7
2
Highly Agree
18.
I use mobile commerce transactions because it is the latest technology on commerce
387
19.4
1
Highly Agree
295
Table 12: Recommendation of mobile commerce transactions by respondents Opinion
Number of Respondents 75
% of Respondents
12
12%
10
10%
Do not Recommend
3
3%
Not at all
0
0%
Total
100
100%
Definitely Recommend Somewhat Recommend No comments
75%
Figure 11: Recommendation of Mobile Commerce transactions by respondents.
296
Computing and Random Numbers
Inference: According to the data in Table 12 and Figure 11, it can be said that majority of respondents definitely recommend mobile commerce transactions, and very small minority of respondents do not recommend this technology. However, there are respondents exist in the middle who actually somewhat recommend or have no idea about this technology. It must be important to consider that portion of the sample.
HYPOTHESIS TESTING At the Hypothesis formulation section of this study, the following hypothesis have been formulated which are; H1: There is a significant impact of internal variables (demographic and psychographic variables) on consumer behaviour towards mobile commerce H2: There is a significant impact of external variables (social, cultural and technology variables) on consumer behaviour towards mobile commerce. According to the answers we gathered from the respondents and analyzing of these data, results are showing us that, H1 and H2 are correct. Because demographic and psychographic variables affecting the respondents behaviour towards mobile commerce in this study. Table 11 showing the rankings of the ideas which gathered from the questionnaires of respondents. According to that information, usage of mobile commerce applications affected due to occupation, marital status, and because of the popularity of this technology in market and as well as people characteristics effect the usage of mobile commerce. This is showing that, both internal and external variables have a significant impact on mobile commerce.
CONCLUSIONS & RECOMMENDATIONS This study was conducted on a small region of the country, but it can be a sample to analyse the consumer behaviour towards mobile commerce to provide better services or improve existence services in the regions. The findings can show us that, respondents are technology savvy and like technological improvements and developments. Some limitations may exist as mentioned above, and additionally, the competition between the mobile service providers can be one of the handicaps for mobile commerce services. As it is stated at the end of the findings section, the outcome of the study can show us that people do not spend too much money on the mobile commerce transactions. However, Section 4.6 indicated that, the interactive
Challenges of Internal and External Variables of Consumer Behaviour...
297
services, such as chatting, games, etc. are most popular mobile commerce services used by the respondents. Especially, some of the respondents stated that, the facebook and twitter web sites are the most visited and important websites for them. In this case, it is important to highlight those free access services launched by the mobile service providers in this region such as facebook, twitter and messenger and this must be considered as one of the reasons of low amount of money spent on mobile commerce transactions an outcome. On the other hand, because of the lack of mobile commerce infrastructure from the security point of view in the region, most of the m-commerce consumers couldn’t highly satisfy. The study can be conducted by considering what type of services of mobile commerce is used by consumers, and under what circumstances consumer will rely on these services. Because some of the consumers, especially in such a region which infrastructure is not completely secure, the m-commerce cannot go further than a simple chatting tool and it cannot be used as a transaction tool for businesses as well as consumers. Since this study has conducted at Cyprus Research Centre, it has limitations because of limited resources, time and population. It must be stated that, the research outcomes may vary if it conducts on different regions of the country with different populations. Research outcomes may rely on the respondent’s profile, or different characteristics of the respondents. This specific research is conducted mainly with the university student’s participation. The student’s perception against mobile commerce technology and monetary income may affect the outcomes of the study. Mobile commerce applications can become more popular by providing more services, and customer feedback. People ideas and behaviour can indicate us that, they use this technology and they are lover of this technology, for that reason they are using this technology.
298
Computing and Random Numbers
REFERENCES 1.
Sari, A. (2012) Impact of Determinants on Student Performance towards Information Communication Technology in Higher Education. International Journal of Learning and Development, 2, 18-30. http:// dx.doi.org/10.5296/ijld.v2i2.1371 2. Cellatoglu, N. and Sari, A. (2010) Environmental Impacts of Private Transportation on Sustainable Development: A Case Study of Northern Cyprus. 1st International Sustainable Building Symposium, Volume 1, 441-445. 3. Sari, A., Karaduman, A. and Firat, A. (2015) Deployment Challenges of Offshore Renewable Energy Systems for Sustainability in Developing Countries. Journal of Geographic Information System, 7, 465-477. http://dx.doi.org/10.4236/jgis.2015.75037 4. Sari, A. (2014) Economic Impact of Higher Education Institutions in a Small Island: A Case of TRNC. Global Journal of Sociology, 4, 41-45. 5. Sari, A. (2012) Diversification of Tourism Activities in Small Island Developing States. International Journal of Applied Science and Technology, 2. 6. Yankee Group Research, Mobile User Survey Results Part 1: Will Next Generation Data Services Close the Value Gap? 2002. 7. Instat/MDR. Worldwide Wireless Data/Internet Market: Bright Spots in a Dark Industry. 2002. 8. Ghosh, A.K. and Swaminatha, T.M. (2001) Software Security and Privacy Risks in Mobile e-Commerce. Communications of the ACM, 44, 51-57. http://dx.doi.org/10.1145/359205.359227 9. Tarasewich, P. (2003) Designing Mobile Commerce Applications. Communications of the ACM, 46, 57-60. http://dx.doi. org/10.1145/953460.953489 10. Will, G. (2004) Upstart Airline Shows Direction of Industry. The Chicago Sun-Times, Chicago Sun Times Inc., Chicago. http://web. lexis-nexis.com 11. Tsalgatidou, A., Veijalainen, J. and Pitoura, E. (2000) Challenges in Mobile Electronic Commerce. Proceedings of IEC of the Third International Conference on Innovation through E-Commerce, Manchester, 14-16. 12. Turban, E. (2004) Electronic Commerce: A Managerial Prospective. Pearson Education, Inc., Upper Saddle River.
Challenges of Internal and External Variables of Consumer Behaviour...
299
13. Boon, S. and Holmes, J. (1991) The Dynamics of Interpersonal Trust: Resolving Uncertainity in the Face of Risk. In: Hinde, R. and Gorebel, J., Eds., Cooperation and Prosocial Behaviour, Cambridge University Press, Cambridge, 190-211. 14. Ambrose, P. and Johnson, G. (2000) A Trust Based Model of Buying Behavior in Electronic Retailing. Proceedings of America Conference of Information System. 15. Hoffman, D., Novak, T. and Peralta, M. (1999) Building Customer Trust Online. Communications of ACM, 42, 54-57. http://dx.doi. org/10.1145/299157.299175 16. Ratnasingham, P. and Kumar, K. (2000) Trading Partner Trust in Electronic Commerce Participation. Proceedings of International Conference of Information Systems. 17. Cheung, C. and Lee, M. (2000) Trust in Internet shopping: A Proposed Model and Measurement Instrument. Proceedings of America Conference of Information System. 18. Androulidakis, N. and Androulidakis, I. (2005) Perspectives of Mobile Advertising in Greek Market. Proceedings of 2005 International Conference on Mobile Business (ICBM 2005). http://dx.doi. org/10.1109/icmb.2005.78 19. Thornhill, A., et al. (2003) Research Methods for Business Students. 3rd Edition, Rotolito Lombarda, Italy, 72, 85. 20. Robson, C. (1993) Real World Research. A Resource for Social Scientists and Practitioner Researchers. Blackwell Publishers Inc., Oxford.
INDEX
A Achieve faster random number generation 5 Affinity matrix 211, 212 Airborne organism 234 Altera module contained the Cyclone 63 Analytic prediction 190 Animal group system 228 Annual municipal register 126 Ant colony optimizer (ACO) 228 Application Programming Interface (API) 16 Arbitrary finite network 177 Artificial neural network classifier 261 Audio and Video Random Number Generator 86 Autonomous region 123, 125 Auxiliary uniform 29
B Behavior of human 230 C Circumstances 273, 297 Cloud storage via personal devices 84 Cluster ensemble framework 210 Clustering algorithm 200, 201, 202, 203, 206, 209, 210, 214, 215, 216, 217, 220, 221, 222, 223, 224 Collection method 282 Combination of random route 122 Combined Multiple Recursive Generator (CMRG) 5 Compare generators’ behavior 62 Compare sample quality across 121 Comparison appear spurious 130 Complete convergence 141, 143, 144, 145
302
Computing and Random numbers
Complete metric space 161 Complete probability 159 Complex networks 176, 177, 191, 195, 196, 197, 198 Compounding method 107 Computational complexity reasons 44 Computer-Aided Diagnosis (CAD) 250 Conditional density function 110 Conduct standard uniform 109 Consumer behavior 271 Correlated Standard Uniform Geometric (CSUG) 109 Critical security parameters (CSPs) 56 Cyprus Research center 281 D Definitely Recommend 284, 295 Designing minimal-attention interfaces 274 Desktop e-Commerce counterparts 273 Deterministic algorithms 30 Diagonal elements 208 Digestive system 250 Dimensionality reduction 205 Dimensionality reduction method 213 Distributed random variable 158 Distribution curve 187, 194 Dual-Video Random Number Generator (DVRNG) 84 Dynamic stochastic systems 176 E Economic activity 122, 123, 126, 127, 133, 134, 135, 136
Educational attainment 126, 127, 130, 131, 135, 137 Education category 130, 131 Effective aggregation 210 Effective classification techniques 250 Efficient hardware architecture 40 Efficient implementation 29 Electronic circuit 87, 95 Emotion changing 230 Employ one read-controller 13 Enhancement impacts 42 Enormous potential of hardware accelerators 28 Ensemble learning methods 255 Entropy insufficiency 56, 78 Environment Survey 124, 128, 129, 131, 132, 133, 134, 136 Estimate economic activity 133 European Social Survey (ESS) 123 Extremal Optimization (EO) 234 F Field programmable gate arrays (FPGAs) 28 First passage time (FPT) 176, 177 Frequency 283, 285, 286, 290, 291 Fundamental dynamic process 176 Fuzzy clustering 200, 201, 202, 203, 204, 224 Fuzzy c-means (FCM) 200 Fuzzy Rand Index (FRI) 214 Fuzzy set-valued random variable 161 Fuzzy set-valued sequence 158 G Gauss distribution 232, 235, 244
Index
Gaussian mixture data 202, 218, 219 Gender distribution 127, 128, 130 General Bayesian network (GBN) 251 Generation of random bit-stream 58 Geometric random variable 109 Global behavior 228 Global jitter component 59 global jitter sources 59 Global markets 272 Goal of medical data 264 H Hardware-Accelerated Modified Lagged Fibonacci Generator 6 Hardware-Accelerated Scalable Parallel Random Number Generators library (HASPRNGs) 5 Hardware-Accelerated version of SPRNG (HASPRNG) 3 Hardware architecture 29, 30, 36, 41, 49 Hardware Description Language (VHDL) 6 Hardware platforms 63 Health Barometer 123, 125, 129, 132, 134 Heterogeneous networks 195 High-Performance Reconfigurable Computing (HPRC) 5 Hyperspaces 159 Hyper Text Transfer Protocol (HTTP) 276 Hypothesis formulation 296
303
I Identical numerator 112 Implementing deterministic data processing algorithms 56 Individual decision mechanism 232 Influence customer 278 Information Gain Ratio attribute evaluation (IGR) 253 Initialization function 15, 16, 17 Intermingling of polynomial 114 Internal clustering validation 214, 224 Internal validation measure 214 Internet graph 103 Inverse cumulative distribution function 32, 48 Investigated high-quality uniform 30 Investigate diverse random-walk strategies 175 L Landmark-based representation 211, 224 Law of Large Numbers (LLNs) 17 Logic array block (LAB) 63 Lookup tables (LUT) 62 M Marginal distribution 127 Matrix product 211 Matthews Correlation Coefficient (MCC) 258 Mean first passage time (MFPT) 176 Mechanism of transport 176 Medical application 250 Medical biometrics diagnostic tasks
304
Computing and Random numbers
250 Medical diagnosis depends 250 Memory element 78 Mobile Commerce 269, 271, 275, 276, 290, 292, 294, 295, 298 Mobile commerce application 271, 276, 281, 282, 283, 284, 293, 296 Mobile commerce customer 278 Mobile telecommunication 272 Modulus operation 11 Motherboard containing Cypress 63 Movement strategy 234 Multilayer Perceptron 251 Multiple continuous-time random 184 Multiple preferential random walks (MPRW) 176 Multiple random walks (MRW) 175 Multiple simple random walks (MSRW) 176 Multiplicative Lagged Fibonacci Generator (MLFG) 5 Multivariate hypergeometric 143 Mutation strategy 231 N National Institute of Statistics (INE) 124, 138, 139 Natural generalization 158 Negatively quadrant dependent (NQD) 142 Network topology 177 Non-empty compact convex 159 Nonnegative integer-valued random variable 179
Non-probability sampling methods 122 Non-uniform distributions 31, 50 Non-uniform random number generators 28 Numerical simulation 193, 195 O Obtain population-representative 122 Optimized hardware implementation 31 P Parallel Pseudorandom Number Generators (PPRNGs) 4 Parameter estimation 104 Particle swarm optimization (PSO) 228, 252 Particle Swarm Optimizer 234 Particular behavior 283 Particular transaction 271 Partition matrix 203, 204, 205, 210 Peer-to-peer networks 176, 196, 198 Permutation distribution 143 Personal computer 86, 95 Personal digital as-sistants (PDAs) 270 Physical nondeterministic process 56 Poisson distribution 103, 115 Polynomial approximation 32, 34, 39 Population-based swarm methodology 229 Practical search-related issues 176, 195 Practical significance 176, 188, 195
Index
Predominant approach 122 Preferential random walks (PRW) 178 Primary motivation 115 Principal component analysis 205, 206 Probability distribution 179, 181 Probability sampling-based surveys 135 Probability samplings 125, 131, 135, 137 Pseudo-random number generators (PRNGs) 84 Psychographic 272, 278, 279, 296 Q Qualified random numbers 87, 90, 92, 94 Questionnaire 133, 136 Questionnaire design 271 Quota-based sample 130 Quota replacements 135 R Random emotional selection strategy 229, 235 Random number generation 23 Random Number Generator Based on Mouse Movement 87 Random number generator (RNG) 28 Random projection 199, 200, 201, 202, 203, 204, 205, 206, 207, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 220, 221, 223 Random sampling strategy 252, 263 Random variable 141, 142, 143,
305
144, 145, 146, 147, 151, 154, 155 Random walker 177 Random walks mentioned 179 Rapid development 200 Receiver Operator Characteristic (ROC) 258 Reconfigurable Computing Monte Carlo (RC MC) 5 Reconfigurable Computing (RC) 4 Reconfigurable computing systems 21 Reducing network traffic 190 Regarding simple random sampling 257 Representativeness 126 Research model formulation 278 Reversible Markov chains 179 S Sampling strategy 257 Scalable Parallel Random Number Generators (SPRNGs) 3, 4 Scientific developments 227 Search strategy 177, 185, 188, 194, 195 Sensor networks 176, 188, 196 Set-valued random variable 157, 158, 159, 162, 164, 165, 166, 168 Set-valued stochastic theory 158 Similarity matrix 204, 211, 212 Simple random network 186 Simple random walks (SRW) 178 Simulated annealing algorithm 228 Simulations confirm 176 Single-flux-quantum (SFQ) 87 Single random walks 175, 177, 179, 181, 185, 186, 194, 195
306
Computing and Random numbers
Singular value decomposition (SVD) 201 Small Island Developing States (SIDS) 272 Social emotional optimization algorithm 228, 229, 244, 245 Source causing the pseudo-randomness 78 Source of significant challenge 250 Statistical Test Suite (STS) 31 Stochastic model 142 Substantial literature 144 Sum of absolute values (SAV) 133 Swarm intelligence (SI) 228 Swarm intelligent algorithm 228, 229 Symmetric uncertainty correlationbased measure (SU) 254 Synchronization 90 T Technological problem 194 Theoretical cumulative distribution function 46 Transfer of monetary 275 Transfer proposed algorithm 95 Transformation 112, 113
Transformational data 212 Transition matrix 177, 178, 179 Tree Augmented Naïve Bayes 251 Triangular array 144, 146 True random number generators (TRNGs) 84 Turkish Liras (TL) 283 U Uniform distribution random 87 Uniform method 179 Uniform random number generator 28, 40, 44, 49 Universal strategy applicable across spatial 234 V Value connection 113 Video Random Number Generator (VRNG) 84 W Waikato Environment for Knowledge Analysis (WEKA) 259 Wireless Application Protocol (WAP) 276 Wireless telecommunication network 275