Communication, Signal Processing & Information Technology: Extended Papers 9783110470383, 9783110468229

The book elaborates selected, extended and peer reviewed papers on Communication and Signal Proceesing. As Vol. 8 of the

253 15 8MB

English Pages 249 [250] Year 2018

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface of the Editors
Advances in Systems, Signals and Devices
Editorial Board Members
Advances in Systems, Signals and Devices
Contents
New Proposed Adaptive Beamforming Algorithms Based on Merging CGM and NLMS methods
Efficient Hardware Architecture of DCT Cordic based Loeffler Compression Algorithm for Wireless Endoscopic Capsule
Exploring the Physical Characteristics of an On-chip Router for its Integration in a 3D-mesh NoC
Compact High Speed Hardware for SHA-2 on FPGA
An Extensible Platform for Smart Home Services
Fuzzy-Based Gang Scheduling Approach for Multiprocessor Systems
A Robust Multiple Watermarking Scheme Based on the DWT
Retinal Identification System based on Optical Disc Ring Extraction and New Local SIFT-RUK Descriptor
An Optimized Hardware Architecture of 4×4 Intra Prediction for HEVC Standard
Arabic Continuous Speech Recognition Based on Hybrid SVM/HMM Model
Enhancing the Odd Peaks Detection in OFDM Systems Using Wavelet Transforms
Methodology for Analysis of Direct Sampling Receiver Architectures from Signal Processing and System Perspective
A Hybrid PAPR Reduction Scheme for OFDM Using SLM with Clipping at the Transmitter, and Sparse Reconstruction at the Receiver
Mobile Workflow Management System Architecture Taking into Account Relevant Security Requirements
Recommend Papers

Communication, Signal Processing & Information Technology: Extended Papers
 9783110470383, 9783110468229

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

F. Derbel, O. Kanoun and N. Derbel (Eds.) Communication, Signal Processing and Information Technology

Advances in Systems, Signals and Devices

|

Edited by Olfa Kanoun, University of Chemnitz, Germany

Volume 8

Communication and Signal Processing | Edited by F. Derbel, O. Kanoun and N. Derbel

Editors of this Volume Prof. Dr.-Ing. Faouzi Derbel Leipzig University of Applied Sciences Chair of Smart Diagnostic and Online Monitoring Wächterstrasse 13 04107 Leipzig, Germany [email protected]

Prof. Dr.-Eng. Nabil Derbel University of Sfax Sfax National Engineering School Control & Energy Management Laboratory 1173 BP, 3038 SFAX, Tunisia [email protected]

Prof. Dr.-Ing. Olfa Kanoun Technische Universität Chemnitz Chair of Measurement and Sensor Technology Reichenhainer Strasse 70 09126 Chemnitz, Germany [email protected]

ISBN 978-3-11-046822-9 e-ISBN (PDF) 978-3-11-047038-3 e-ISBN (EPUB) 978-3-11-046841-0 ISSN 2365-7493 Library of Congress Control Number: 2018934306 Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.dnb.de. © 2018 Walter de Gruyter GmbH, Berlin/Boston Typesetting: Konvertus, Haarlem Printing and binding: CPI books GmbH, Leck www.degruyter.com

Preface of the Editors The eighth volume of the series “Advances in Systems, Signals and Devices” (ASSD), is devoted to the field of Communication, Signal Processing & Information technology. The scope of the volume encompasses all aspects of research, development and applications of the science and technology in these fields. The topics concern design, modeling, fundamentals and application of communications systems, signal processing and information technology approaches. It covers topics in the area of communication systems, wireless communication, wireless sensor networks, cooperative and MIMO systems, new emerging communication technologies, adaptive and smart antennas, telecommunication systems, digital signal processing, image and video compression algorithms, speech recognition, biometry and medical imaging, data fusion and pattern recognition, information retrieval, computational intelligence, distributed and real time systems, cloud computing, and information technology. Every issue is edited by a special editorial board including renowned scientist from all over the world. Authors are encouraged to submit novel contributions which include results of research or experimental work discussing new developments in the field of power systems and smart energies. The series can be also addressed for editing special issues for novel developments in specific fields. The aim of this international series is to promote the scientific progress in the fields of systems, signals and Devices. It is a big pleasure of ours to work together with the international editorial board consisting of renowned scientists in the field of power systems and smart energies. Editors-in-Chief Faouzi Derbel, Olfa Kanoun and Nabil Derbel

De Gruyter Oldenbourg, ASSD – Advances in Systems, Signals and Devices, Volume 8, 2018, p. V. https://doi.org/10.1515/9783110470383-202

Advances in Systems, Signals and Devices Series Editor: Prof. Dr.-Ing. Olfa Kanoun Technische Universität Chemnitz, Germany. [email protected]

Editors in Chief: Systems, Automation & Control Prof. Dr.-Eng. Nabil Derbel ENIS, University of Sfax, Tunisia [email protected]

Power Systems & Smart Energies Prof. Dr.-Ing. Faouzi Derbel Leipzig Univ. of Applied Sciences, Germany [email protected]

Communication, Signal Processing & Information Technology Prof. Dr.-Ing. Faouzi Derbel Leipzig Univ. of Applied Sciences, Germany [email protected]

Sensors, Circuits & Instrumentation Systems Prof. Dr.-Ing. Olfa Kanoun Technische Universität Chemnitz, Germany [email protected]

Editorial Board Members: Systems, Automation & Control Dumitru Baleanu, Çankaya University, Ankara, Turkey Ridha Ben Abdennour, Engineering School of Gabès, Tunisia Naceur Benhadj, Braïek, ESSTT, Tunis, Tunisia Mohamed Benrejeb, Engineering School of Tunis, Tunisia Riccardo Caponetto, Universita’ degli Studi di Catania, Italy Yang Quan Chen, Utah State University, Logan, USA Mohamed Chtourou, Engineering School of Sfax, Tunisia Boutaïeb Dahhou, Univ. Paul Sabatier Toulouse, France Gérard Favier, Université de Nice, France Florin G. Filip, Romanian Academy Bucharest Romania Dorin Isoc, Tech. Univ. of Cluj Napoca, Romania Pierre Melchior, Université de Bordeaux, France Faïçal Mnif, Sultan qabous Univ. Muscat, Oman Ahmet B. Özgüler, Bilkent University, Bilkent, Turkey Manabu Sano, Hiroshima City Univ. Hiroshima, Japan Abdul-Wahid Saif, King Fahd University, Saudi Arabia José A. Tenreiro Machado, Engineering Institute of Porto, Portugal Alexander Pozniak, Instituto Politecniko, National Mexico Herbert Werner, Univ. of Technology, Hamburg, German Ronald R. Yager, Mach. Intelligence Inst. Iona College USA Blas M. Vinagre, Univ. of Extremadura, Badajos, Spain Lotfi Zadeh, Univ. of California, Berkeley, CA, USA

Power Systems & Smart Energies Sylvain Allano, Ecole Normale Sup. de Cachan, France Ibrahim Badran, Philadelphia Univ., Amman, Jordan Ronnie Belmans, University of Leuven, Belgium Frdéric Bouillault, University of Paris XI, France Pascal Brochet, Ecole Centrale de Lille, France Mohamed Elleuch, Tunis Engineering School, Tunisia Mohamed B. A. Kamoun, Sfax Engineering School, Tunisia Mohamed R. Mékidèche, University of Jijel, Algeria Bernard Multon, Ecole Normale Sup. Cachan, France Francesco Parasiliti, University of L’Aquila, Italy Manuel Pérez,Donsión, University of Vigo, Spain Michel Poloujadoff, University of Paris VI, France Francesco Profumo, Politecnico di Torino, Italy Alfred Rufer, Ecole Polytech. Lausanne, Switzerland Junji Tamura, Kitami Institute of Technology, Japan

Communication, Signal Processing & Information Technology Til Aach, Achen University, Germany Kasim Al-Aubidy, Philadelphia Univ., Amman, Jordan Adel Alimi, Engineering School of Sfax, Tunisia Najoua Benamara, Engineering School of Sousse, Tunisia Ridha Bouallegue, Engineering School of Sousse, Tunisia Dominique Dallet, ENSEIRB, Bordeaux, France Mohamed Deriche, King Fahd University, Saudi Arabia Khalifa Djemal, Université d’Evry, Val d’Essonne, France Daniela Dragomirescu, LAAS, CNRS, Toulouse, France Khalil Drira, LAAS, CNRS, Toulouse, France Noureddine Ellouze, Engineering School of Tunis, Tunisia Faouzi Ghorbel, ENSI, Tunis, Tunisia Karl Holger, University of Paderborn, Germany Berthold Lankl, Univ. Bundeswehr, München, Germany George Moschytz, ETH Zürich, Switzerland Radu Popescu-Zeletin, Fraunhofer Inst. Fokus, Berlin, Germany Basel Solimane, ENST, Bretagne, France Philippe Vanheeghe, Ecole Centrale de Lille France

Sensors, Circuits & Instrumentation Systems Ali Boukabache, Univ. Paul, Sabatier, Toulouse, France Georg Brasseur, Graz University of Technology, Austria Serge Demidenko, Monash University, Selangor, Malaysia Gerhard Fischerauer, Universität Bayreuth, Germany Patrick Garda, Univ. Pierre & Marie Curie, Paris, France P. M. B. Silva Girão, Inst. Superior Técnico, Lisboa, Portugal Voicu Groza, University of Ottawa, Ottawa, Canada Volker Hans, University of Essen, Germany Aimé Lay Ekuakille, Università degli Studi di Lecce, Italy Mourad Loulou, Engineering School of Sfax, Tunisia Mohamed Masmoudi, Engineering School of Sfax, Tunisia Subha Mukhopadhyay, Massey University Turitea, New Zealand Fernando Puente León, Technical Univ. of München, Germany Leonard Reindl, Inst. Mikrosystemtec., Freiburg Germany Pavel Ripka, Tech. Univ. Praha, Czech Republic Abdulmotaleb El Saddik, SITE, Univ. Ottawa, Ontario, Canada Gordon Silverman, Manhattan College Riverdale, NY, USA Rached Tourki, Faculty of Sciences, Monastir, Tunisia Bernhard Zagar, Johannes Kepler Univ. of Linz, Austria

Advances in Systems, Signals and Devices Volume 1 N. Derbel (Ed.) Systems, Automation, and Control, 2016 ISBN 978-3-11-044376-9, e-ISBN 978-3-11-044843-6, e-ISBN (EPUB) 978-3-11-044627-2, Set-ISBN 978-3-11-044844-3 Volume 2 O. Kanoun, F. Derbel, N. Derbel (Eds.) Sensors, Circuits and Instrumentation Systems, 2016 ISBN 978-3-11-046819-9, e-ISBN 978-3-11-047044-4, e-ISBN (EPUB) 978-3-11-046849-6, Set-ISBN 978-3-11-047045-1 Volume 3 F. Derbel, N. Derbel, O. Kanoun (Eds.) Power Systems & Smart Energies, 2016 ISBN 978-3-11-044615-9, e-ISBN 978-3-11-044841-2, e-ISBN (EPUB) 978-3-11-044628-9, Set-ISBN 978-3-11-044842-9 Volume 4 F. Derbel, N. Derbel, O. Kanoun (Eds.) Communication, Signal Processing & Information Technology, 2016 ISBN 978-3-11-044616-6, e-ISBN 978-3-11-044839-9, e-ISBN (EPUB) 978-3-11-043618-1, Set-ISBN 978-3-11-044840-5 Volume 5 F. Derbel, N. Derbel, O. Kanoun (Eds.) Systems, Automation, and Control, 2017 ISBN 978-3-11-046821-2, e-ISBN 978-3-11-047046-8, e-ISBN (EPUB) 978-3-11-046850-2, Set-ISBN 978-3-11-047047-5 Volume 6 O. Kanoun, F. Derbel, N. Derbel (Eds.) Sensors, Circuits and Instrumentation Systems, 2017 ISBN 978-3-11-044619-7, e-ISBN 978-3-11-044837-5, e-ISBN (EPUB) 978-3-11-044624-1, Set-ISBN 978-3-11-044838-2 Volume 7 F. Derbel, N. Derbel, O. Kanoun (Eds.) Power Systems & Smart Energies, 2017 ISBN 978-3-11-046820-5, e-ISBN 978-3-11-047052-9, e-ISBN (EPUB) 978-3-11-044628-9, Set-ISBN 978-3-11-047053-6 Volume 8 F. Derbel, O. Kanoun, N. Derbel (Eds.) Communication, Signal Processing & Information Technology, 2017 ISBN 978-3-11-046822-9, e-ISBN 978-3-11-047038-3, e-ISBN (EPUB) 978-3-11-046841-0, Set-ISBN 978-3-11-047039-0

Contents Preface of the Editors | V T. M. Jamel New Proposed Adaptive Beamforming Algorithms Based on Merging CGM and NLMS methods | 1 N. Jarray, M. Elhajji and A. Zitouni Efficient Hardware Architecture of DCT Cordic based Loeffler Compression Algorithm for Wireless Endoscopic Capsule | 23 M. Langar, R. Bourguiba and J. Mouine Exploring the Physical Characteristics of an On-chip Router for its Integration in a 3D-mesh NoC | 37 M. Anane and N. Anane Compact High Speed Hardware for SHA-2 on FPGA | 47 M. Götze, W. Kattanek and R. Peukert An Extensible Platform for Smart Home Services | 61 K. M. Al-Aubidy, K. Batiha and H. Y. Al-Kharbshh Fuzzy-Based Gang Scheduling Approach for Multiprocessor Systems | 81 H. Ouazzane, H. Mahersia and K. Hamrouni A Robust Multiple Watermarking Scheme Based on the DWT | 97 T. Chihaoui, R. Kachouri, H. Jlassi, M. Akil and K. Hamrouni Retinal Identification System based on Optical Disc Ring Extraction and New Local SIFT-RUK Descriptor | 113 M. Kammoun, A. Ben Atitallah, H. Loukil and N. Masmoudi An Optimized Hardware Architecture of 4×4 Intra Prediction for HEVC Standard | 127 E. Zarrouk, Y. Ben Ayed and F. Gargouri Arabic Continuous Speech Recognition Based on Hybrid SVM/HMM Model | 145

XIV | Contents

A. Damati, O. Daoud and Q. Hamarsheh Enhancing the Odd Peaks Detection in OFDM Systems Using Wavelet Transforms | 161 C. Schultz and P. Hillger Methodology for Analysis of Direct Sampling Receiver Architectures from Signal Processing and System Perspective | 175 M. Gay, A. Lampe and M. Breiling A Hybrid PAPR Reduction Scheme for OFDM Using SLM with Clipping at the Transmitter, and Sparse Reconstruction at the Receiver | 197 Z. J. Muhsin and A. Al-Taee Mobile Workflow Management System Architecture Taking into Account Relevant Security Requirements | 217

T. M. Jamel

New Proposed Adaptive Beamforming Algorithms Based on Merging CGM and NLMS methods Abstract: This paper proposes new two adaptive beamforming algorithms based on a merging method for performance enhancement of mobile communications systems. The first proposal method includes merging pure Conjugate Gradient Method (CGM) with pure Normalized Least Mean Square (NLMS) algorithms, so that it is called CGM-NLMS. While the second proposed algorithm will merge pure CGM with Modified NLMS algorithm so that it is called CGM-MNLMS. The MNLMS algorithm is regarded as variable regularization parameter ε(k) which is fixed in the conventional NLMS algorithm. The regularization parameter ε(k) uses reciprocal of the estimation error square of the update step size of NLMS instead of fixed regularization parameter (ε). With the new proposed (CGM-NLMS) and (CGM-MNLMS) algorithms, the estimated weight coefficients, which are acquired from the first stage (CGM) algorithm, are stored and then used as initial weight coefficients for NLMS (or MNLMS) algorithm processing. Through simulation results of adaptive beamforming system using an Additive White Gaussian Noise (AWGN) channel model and Rayleigh fading channel with a Jakes power spectral density, the two new proposed algorithms provide fast convergence time, higher interference suppression capability and low level of MSD, and MSE at steady state compared with the pure CGM and pure NLMS algorithms. Keywords: Beamforming algorithm, Least Mean Square (LMS), Normalized LMS (NLMS), Conjugate Gradient Method (CGM), Time-varying regularization parameter.

1 Introduction The main limitation of the Least Mean Square (LMS) algorithm which was first proposed by Widrow and Hoff at Stanford University, Stanford, CA in 1960 is the relatively slow rate of convergence [1]. In order to increase the convergence rate, LMS algorithm is modified by normalization, which is known as normalized LMS (NLMS) [1, 2]. We may view the normalized LMS algorithm as an LMS algorithm with time variable step-size parameter [1]. Many approaches of time varying step size for NLMS algorithm, like Error Normalized Step Size LMS (ENSS), Robust Variable Step Size LMS (RVSS), and Error – Data Normalized Step Size LMS (EDNSS) and others are reported [3-16]. The generalized

T. M. Jamel: Th. M. Jamel, University of Technology, Baghdad, Iraq, email: [email protected]

De Gruyter Oldenbourg, ASSD – Advances in Systems, Signals and Devices, Volume 8, 2018, pp. 1–22. https://doi.org/10.1515/9783110470383-001

2 | T.M. Jamel

normalized gradient descent (GNGD) algorithm in (2004) [6], used gradient adaptive term for updating the step size of NLMS . The first free tuning algorithm was proposed in (2006) [7] which used the MSE and the estimated noise power to update the step size . The robust regularized NLMS (RR-NLMS) filter was proposed in (2006) [8], which uses a normalized gradient to control the regularization parameter update . Another scheme with hybrid filter structure was proposed in (2007) in order to performance enhancement of the GNGD [9]. The noise constrained normalized least mean squares (NC-NLMS) adaptive filtering was proposed in (2008) [10] which is regarded as the time varying step size NLMS. Another free tuning NLMS algorithm was achieved in (2008) [11,12] and it is called generalized square error regularized NLMS algorithm (GSER) [10,11]. The inverse of weighted square error was proposed for variable step size NLMS algorithm in (2008) [13]. After that, the Euclidian vector norm of the output error was suggested for updating a variable step size NLMS algorithm in (2010) [14]. Another nonparametric algorithm which uses mean square error and the estimated noise power was presented in 2012 [15]. Finally, in 2016, Young-Seok Choi proposed an approach that dynamically updates the regularization parameter by exploiting a gradient descent direction [16]. All these algorithms suffer from preselect of different constant parameters in the initial state of adaptive processing or have high computational complexity. In this paper, a Modified Normalized Least Mean Square algorithm (MNLMS) is proposed which is also tuned free (i. e. Nonparametric). It used time varying regularization ε(k) instead of fixed value (ε). The gradient based directions method in some cases has slow convergence rate. In order to overcome this problem, Hestenes and Stiefel developed conjugate gradient method (CGM) in the early 1950s [17]. The CGM suffers from that the rate of convergence depends on the conditional number of the matrix (A). Therefore, many modifications have been proposed to improve the performance of the CG algorithm for different applications [18]. One of these modifications suggested that, the step size can be replaced by a constant value or by a normalized step a size [19]. Moreover the preconditioning process is used to increase the convergence rate of the CGM algorithm by changing the distribution of the eigenvalues of (A) and clustered them around one point[19]. In 1997, spatial and time diversity for CGM algorithm was used to obtain an algorithm for adaptive beamforming in mobile communication systems [20]. In (1999), the researchers solved the problem of applying CGM for a small number of both snapshots and the array elements by proposing new forward and backward CGM (FBCGM) and multilayer (WBCGM) methods [21]. In (2013), interference alignment in time-varying MIMO (multiple input and multiple-output) interference channels was achieved by applying an approach based on the conjugate gradient method combined with metric projection is applied for [22]. In (2013), adaptive block least mean square algorithm (B-LMS) with optimally derived step size using conjugate gradient search directions was proposed to minimize the mean square error (MSE) of the linear system [23].

Adaptive Beamforming Algorithms |

3

Although the pure CGM has better performance compared with the pure NLMS algorithm, but we can obtain further performance enhancement when we merge these algorithms together in one algorithm. This paper presents a new approach to achieve fast convergence, higher interference suppression capability and low level of MSD, and MSE. The proposed algorithms involve the use of a combination of CGM (as first stage) with NLMS (or MNLMS) algorithm (as second stage). In this way, the desirable fast convergence, good interference suppression capability of CGM is combined with the good tracking capability of variable step size method and low level of MSD, and MSE of NLMS (or MNLMS) algorithm.

2 Basic concepts of the LMS, and NLMS algorithms An adaptive beamforming algorithm system of M-element array can be drawn as in Fig. 1. This figure shows that, the weight vector w [w1 w2 . . . w M ]T must be modified in such a way as to minimize the error while iterating the array weights [24].The signal s (k) and interferers i1 (k), i2 (k), . . . i N (k) are received by an array of M elements with M potential weights [24]. Each received signal at element m also includes additive Gaussian noise. Time is represented by the k-th time samples. Thus, the weighted array output can be given in the following form [24]: y(k) = w H (k) x (k)

(1)

Θ0

Θ1

i1 (k)

x1 (k)

i2 (k)

x2 (k)

w1*

w2* y (k)

iN (k) ΘN

xM (k)

wM* – e (k)

ADAPTIVE ALGORITHM

Fig. 1. Block diagram of adaptive beamforming algorithm.

d (k) +

4 | T.M. Jamel where the operator H denotes the vector Hermitian transpose, and x (k) is the input signal vector: i1 (k) [ i (k) [ 2 x (k) = a0 s (k) + [a1 a2 . . . a N ] × [ [ : [ i N (k)

] ] ] + n(k) ] ]

= x s (k) + x i (k) + n(k) : input signal

(2)

with x s (k) is desired signal vector, x i (k) is interfering signal vector, n(k) is zero mean Gaussian noise for each channel, and a i is M-element array steering vector for θ i direction of arrival. An error signal is defined as the difference of desired signal d(k) and output signal y(k) [24]: e(k) = d(k) − w H (k)x (k)

(3)

By using the gradient of cost function, the weight vector of LMS is: w (k + 1) = w (k) + μe(k)x (k)

(4)

The parameter μ is constant and is known as the step size [24]. In order to guarantee stability of the LMS algorithm; the step size parameter should be bounded by [25]: 0 0 is added to overcome the problem of dividing by small value for the x T (k)x(k) [24].

3 New proposed modified normalize least square algorithm (MNLMS) The proposed MNLMS algorithm introduces a new way of choosing the step size. The small constant ε in the NLMS algorithm has fixed effect in step size update parameter and may cause a reduction in its value. This reduction in step size affects the convergence rate and weight stability of the NLMS algorithm. In MNLMS algorithm, the error signal may be used to avoid denominator being zero and to control the step size in each iteration. According to this approach, the ε parameter can be set as: ε(k) =

1 e(k)2

(12)

The proposed new step size formula can be written as: μ mnlms (k) =

μ0 ε(k) + ‖x (k)‖2

(13)

Clearly, μ mnlms (k) is controlled by normalization of both reciprocal of the squared of the estimation error and the input data vector. Therefore, the weight vector of MNLMS algorithm is: w (k + 1) = w (k) +

μ0 e(k)x (k) ε(k) + ‖x (k)‖2

(14)

As can be seen from (13), the step size of MNLMS reduce and increase according to the reciprocal of the squared estimation error and input tap vectors. In other word, when the error signal is large at the beginning of the adaptation process, ε(k) is small and

6 | T.M. Jamel

the step size is large in order to increase the convergence rate. However, when error signal is small in steady state, ε(k) is large and the step size is small in order to get a low level of misadjustment at steady state as shown in Fig. 2. This prevents the update weights from diverging and makes the MNLMS more stable and converges faster than NLMS algorithms.

4 Analysis of the proposed time varying step size In this section, the researcher will give an approximate performance analysis for the proposed time varying step size algorithm using a similar approach used in [4, 25]. The weight coefficients of the proposed algorithm are updating as in (17). This is rewritten as: w (k + 1) = w (k) + μ mnlms (k)e(k)x (k)

(15)

(15) Let w∗ (k) represents the time varying optimal weight vector that is computed as [4]: w∗ (k + 1) = w∗ (k) + p(k)

(16)

where p(k) is the disturbance zero-mean white process [4]. Moreover, let ζ(k) represents the optimum estimation error process defined as [4]:

16

0.35

14

0.3

12 10 8 6 4 2 0

(a)

Step Size of MNLMS Algorithm

Time Varying Regulazition parameter

ζ(k) = d(k) − x T (k) w∗ (k)

(17)

0.25 0.2 0.15 0.1 0.05 0

0

20

40 60 80 Number of samples

100

0 (b)

20

40 60 80 Number of samples

100

Fig. 2. (a) Profile change of ε(k) parameter and (b) Profile change of step size parameters (μ mnlms (k) of MNLMS algorithm.

Adaptive Beamforming Algorithms |

7

or d(k) = ζ(k) + x T (k) w∗ (k)

(18)

Let v (k) represents the coefficient misadjustment vector (error vector) defined as [25]: v (k) = w (k) − w∗ (k)

(19)

Substitutes (18) and (19) in (3) for d(k) and v (k) respectively, then e(k) in (3) becomes: e(k) = ζ(k) + x T (k)w∗ (k) − x T (k)w (k)ζ(k) − v T (k)x T (k)

(20)

Taking expected value of (20) after squaring it, then: E[e2 (k)] = ξ min + σ2 (x)tr[G (k)]

(21)

where ξ min = E[ζ 2 (k)] represents the MMSE (minimum mean-square error) [4], σ 2 (x) = E[x (k)2 ], and G (k) = E[v (k)v T (k)] is the expected value of the coefficient misadjustment vector ( error vector) [4]. Substituting (18), (19), and (20) into (15), we can easily show that: v (k + 1) = [I − μ mnlms (k)x T (k)x (k)]v (k) + μ mnlms (k)x (k)ζ(k) − p(k)

(22)

Now assume that, μ mnlms (k) is uncorrelated with x (k), and ζ(k) respectively, and the term p(k) is zero mean [4, 25], then the expected value of the weight vector is given by [4]: E[v (k + 1)] = {I − E[μ mnlms (k)]E[x T (k)x (k)]}E[v (k)]

(23)

Then the convergence of the proposed algorithm is guaranteed if the expected value of the step size parameter is within the following bound: 0 < E[μ mnlms (k)] < 2

(24)

5 Concept of the conjugate gradient method (CGM) algorithm The goal of CGM algorithm is to iteratively search for the optimum solution by choosing conjugate (perpendicular) paths for each new iteration [24]. CGM is an iterative method which its to minimize the quadratic cost function [24]: J(w) =

H 1 H w A w −d w 2

(25)

where A is the K × M matrix of array snapshots (K = number of snapshots, and M = number of array elements). d = [d(1) d(2) . . . d(K)]T is the desired signal vector of K

8 | T.M. Jamel

snapshots. It can be shown that the gradient of the cost function is [24]: ∇w J(w) = A w − d

(26)

Starting with an initial guess for the weights w(1), then the first residual value r (1) after at the first guess (iteration = 1) is given as [24]: r (1) = −J 󸀠 [w(1)] = d − Aw(1)

(27)

The new conjugate direction vector D is to iterate toward the optimum weight is [24]: H

D (1) = A r (1)

(28)

The general weight update expression is given by [24]: w (k + 1) = w (k) + μ CGM (k)D (k)

(29)

where μ CGM is the step-size of CGM and is given by [24]: H

μ CGM (k) =

r H (k)A A r(k) H

H

D A A D (k)

(30)

The residual vector update is given by [24]: r (k + 1) = r (k) + μ CGM (k)A D(k)

(31)

and the direction vector update is given by [24]: H

D (k + 1) = A r (k + 1) − α(k)D (k)

(32)

A linear search is used to determine α(k) which minimizes J[w (k)]: H

α(k) =

r H (k + 1)A A r(k + 1) H

r H AA r (k)

(33)

Thus, the procedure to use CGM is to find the residual and the corresponding weights and update until convergence is satisfied.

6 Merging CGM with NLMS and CGM with MNLMS algorithms The two proposed algorithms can summarized as the following: 1. The first proposed algorithm is called CGM-NLMS which is a merging of CGM with NLMS algorithms. The NLMS algorithm uses the weight vector that calculated by the CGM algorithm as initial value to calculate the final optimal weight.

Adaptive Beamforming Algorithms |

2.

9

The second proposed algorithm is called CGM-MNLMS. It makes use of two individual algorithm stages, based on the CGM and the proposed MNLMS algorithms.

With the proposed (CGM-NLMS) and (CGM-MNLMS) algorithm scheme, the estimated weight coefficients, obtained from the first CGM algorithm, is storage, and then they used as initial weight coefficients for NLMS (or MNLMS) algorithm processing. In this way, the NLMS weight coefficients will not be initiated with zero value, but with previously estimated values that are obtained from the first algorithm (CGM). Table 1 shows the step sequence for both proposed algorithms.

7 Simulation results In this section, the pure CGM, pure NLMS, CGM-NLMS, and CGM-MNLMS algorithms are simulated and investigated for adaptive beamforming applications in mobile communications system. The performance of all algorithms is investigated in terms of

Tab. 1. CGM-NLMS and CGM-MNLMS algorithm. Set the parameters: K, AOA0 , AOA1 , AOA2 , the order of the FIR and the number of array elements; Generate the desired and interference signals; Step 0 : Initialization (CGM as the first stage) Get input data of K snapshots. Set w(0) = [0, 0, . . . , 0]T Step 1 : For k = 1, 2, . . . K Initialize columns of x matrix of input data as x(:, k) = x(k); Define matrix of array values for K time samples A, set; H

r (1) = d − Aw(1); D(1) = A r(1); Step 3: for k = 2 : K Compute the following: μ CGM (k); update the weight coefficients as: w(k + 1) = w(k) + μ CGM (k)D(k) ; Update CGM parameters r(k + 1); α(k); D (k + 1); End Store w (k + 1) as w (K) To be used as the initial weight for second stage, i. e. NLMS (or MNLMS). Step 4 : w(1) = w (K) Calculate the error signal as: e(k) = d(k) − w H (k)x(k) Update the weight coefficients as: μ0 w(k + 1) = w(k) + e(k)x(k) for NLMS ε + ‖x(k)‖2 μ0 w(k + 1) = w(k) + e(k)x(k) for MNLMS ε(k) + ‖x(k)‖2 End

10 | T.M. Jamel

interference suppression capability, MSD, and MSE learning curve. In all simulations presented here, a linear array consisting of M = 10 isotropic elements with d = 0.5λ element spacing is used.The desired signal is cosine input signal S(k) = cos[2πft(k)] with f = T1 = 900 MHz and the iteration number are set to 200. The Desired Angle of Arrival (AOA) of the desired signal is set to θ0 = 0∘ and two interfering signals with AOAs, θ1 = 30∘ , θ2 = −30∘ , respectively. Signal to Noise ratio (SNR) is set to 30 dB, and Signal to Interference ration (SIR) is set to 10 dB. The average ensemble run is 100 for each 200 iterations. The Mean Square coefficients Deviation (MSD) is computed for 100 average ensemble run as following: MSD = 10 × log10 (|w − w0 |2 )

(34)

where w is the estimated weight vector and w0 is the optimum weight vector. The Mean Square Error (MSE) is also computed for 100 average ensemble run as following: MSE = 10 × log10 (|e(k)|2 )

(35)

7.1 Pure CGM and Pure NLMS algorithms In this section, an Additive White Gaussian Noise (AWGN) channel model is used with an additive zero mean, Gaussian noise. Figure 3 presents the linear plot of the radiation pattern for the pure CGM, and pure NLMS algorithms. This figure shows that the pure CGM generates a null of about −30 dB and −32 dB at interference angle −30∘ and 30∘ , respectively. While the pure NLMS algorithm generates a null of about −29 dB. Figure 4 shows the (MSD) curves for both pure CGM and NLMS algorithms. As shown in

0

10

PURE CGM PURE NLMS

|AF|

10–1

10–2

10–3 –90

–60

–30

0 AOA (deg)

30

60

Fig. 3. Linear radiation patterns for the pure CGM and pure NLMS algorithms.

90

Adaptive Beamforming Algorithms |

11

this figure, the CGM algorithm has a faster convergence rate, but with higher minimum MSD at steady state compared with NLMS algorithm. Figure 5 shows the MSE learning curves for both pure algorithms. –20 PURE CGM PURE NLMS

–25

MAD(dB)

–30 –35 –40 –45 –50 0

20

40

60

80 100 120 Iteration Number

140

160

180

200

Fig. 4. MSD curves for pure CGM and pure NLMS algorithms.

As shown from figures 4 and 5, if both pure algorithms CGM and NLMS are combined together in one algorithm (CGM-NLMS), then all good performance features (fast convergence rate, high interference suppression, minimum MSD and minimum MSE at steady state) can be obtained as shown in the next section.

–10 PURE CGM PURE NLMS

–20 –30

MSE (dB)

–40 –50 –60 –70 –80 –90 –100 –110

0

20

40

60

80 100 120 Iteration Number

140

160

Fig. 5. MSE learning curves for pure CGM and pure NLMS algorithms.

180

200

12 | T.M. Jamel

7.2 Proposed algorithms using AWGN channel Figure 6 presents the linear plot of the radiation pattern for the pure CGM, pure NLMS and CGM-NLMS algorithms. This figure shows that the CGM-NLMS algorithm generates a deeper null of about −32 dB and −38.5 dB at interference angles −30∘ and 30∘ , respectively which is higher than both pure CGM and pure NLMS algorithms. This means that the proposed CGM-NLMS algorithm has about 4.5 and 6 dB average improvement in interference suppression compared with pure CGM and pure NLMS algorithms respectively. Figure 7 presents the linear plot of the radiation pattern for the pure CGM, pure NLMS and CGM-MNLMS algorithms. This figure shows that the CGM-MNLMS algorithm generates a deeper null of about −35 dB and −39 dB at interference angles −30∘ and 30∘ , respectively, which is higher than both pure CGM and CGM-NLMS algorithms. This means that the second proposed CGM-MNLMS algorithm has about between 6 dB and 8 dB average improvement in interference suppression compared with both pure algorithms. In other words, the second proposed algorithm achieved further performance enhancement compared with the first proposed algorithm. Figure 8 shows the magnitude estimation for one element weight (w4) for the first 20 iterations and one run only. As it can be observed from this figure, the pure CGM and NLMS start convergence towards the optimum (desired) weight from the arbitrary weight value to the optimum weight value. While the two proposed algorithms converge faster and with very low misadjusting at steady state. Figure 9 and 10 shows the (MSD) and MSE learning curves for all algorithms using ensemble average run of 100 for 200 iterations.

100

PURE CGM PURENLMS CGM-NLMS

|AF|

10–1

10–2

10–3

10–4 –90

–60

–30

0 AOA (deg)

30

60

90

Fig. 6. Linear radiation patterns for pure CGM, pure NLMS, and CGM-NLMS algorithms/AWGN.

Adaptive Beamforming Algorithms |

100

13

PURE CGM PURENLMS CGM-MNLMS

|AF|

10–1

10–2

10–3

10–4 –90

–60

–30

0 AOA (deg)

30

60

90

Fig. 7. Linear radiation patterns for pure CGM, pure NLMS, and CGM-MNLMS algorithms/AWGN.

0.18 PURE CGM PURENLMS CGM-NLMS CGM-MNLMS

0.17 0.16

|weights|

0.15 0.14 0.13 0.12 0.11 0.1 0.09 0.08

2

4

6

8

10

12

14

16

18

20

Iteration Number Fig. 8. One element weight estimation for all algorithms/ AWGN channel.

It can be seen that, both proposed algorithms have a faster convergence rate and minimum MSD at steady state compared with other algorithms. Moreover, the second proposed algorithm has minimum MSD at steady state compared with the first one.

7.3 Proposed algorithms using the Rayleigh fading channel with Jakes model The Jakes fading model which is used in the simulations, also known as the Sum of Sinusoids (SOS) model, is a deterministic method for simulating time-correlated

14 | T.M. Jamel

–20 PURE CGM PURE NLMS CGM-NLMS CGM-MNLMS

–25

MSD (dB)

–30 –35 –40 –45 –50 –55 0

20

40

60

80

100

120

140

160

180

200

Iteration Number Fig. 9. MSD plot for all algorithms / AWGN channel.

0 PURE CGM PURE NLMS CGM-NLMS CGM-MNLMS

–20

MSE (dB)

–40 –60 –80 –100 –120 0

20

40

60

80

100

120

140

160

180

200

Iteration Number Fig. 10. MSE learning curve for all algorithms / AWGN channel.

Rayleigh fading waveforms and is still widely use today. The model assumes that N equal-strength rays arrive at a moving receiver with uniformly distributed arrival angles α n , such that ray n experiences a Doppler shift ω n = ω m cos(α n ), where ω n = 2πf c v/c is the maximum Doppler frequency shift, v is the vehicle speed, f c is the carrier frequency, and c is the speed of light. As a result, the fading waveform can be modeled with No + 1 complex oscillator, where No = 12 ( N2 − 1). This leads to the equation [26]: T k (t) = √

N0 1 [2 ∑ (cos β n + j sin β n ) cos(ω n cos α n t + θ nh ) + √2 cos(ω m t + θ0h )] 2N + 1 n=1

(36)

Adaptive Beamforming Algorithms |

15

where h is the waveform index, h = 1, 2 . . . N0 , and λ is the wavelength of the transmitted carrier frequency. Here β n = Nπn . To generate the multiple waveforms, 0 +1 Jakes suggests using [26] θ nh =

2π(h − 1) πn + N0 + 1 N0 + 1

(37)

The output was shown as a power spectrum, with the variation of the signal power in the y axis and the sampling time (or the sample number) on the x axis [26]: T h (t) = √

2 N0 ∑ (cos β n + j sin β n ) cos(ω n cos α n t + θ nh ) N0 n = 1

(38)

To present a classic case scenario, the velocity of a car is set to 80 km/h at 900 MHz. The Rayleigh Envelope that results for inputs of v = 80 km/h, f c = 900 MHz, f s = 500 kbps, U = 3 and M = 106 , is shown in Figure 11, where U is the number of sub-channels and M is the number of channel coefficients. Figure 12 presents the linear plot of the radiation pattern for the pure CGM, and pure NLMS algorithms using Rayleigh fading channel with Jakes model (figure 11). This figure shows that the pure CGM generates a deeper null of about −28 dB at interference angle −30∘ and 30∘ respectively. While NLMS generates a null of about −19 dB at interference angles. Figure 13 presents the linear plot of the radiation pattern for the pure CGM, pure NLMS and CGM-NLMS algorithms. This figure shows that the CGM-NLMS algorithm generates a deeper null of about −32 dB and −29 dB at interference angles −30∘ and 30∘ respectively, which is higher than both pure CGM and pure NLMS algorithms.

10

Rayleigh Envelope in dB

0 –10 –20 –30 –40 –50 –60 0

1

2

3

4

5 Time

6

7

Fig. 11. Simulation of Jakes fading model with v = 80 km/h.

8

9

10 5 × 10

16 | T.M. Jamel

100 PURE CGM PURE NLMS

|AF|

10–1

10–2

–90

–60

–30

0 AOA (deg)

30

60

90

Fig. 12. Linear radiation patterns for pure CGM, and NLMS algorithms/Rayleigh channel.

100 PURE CGM PURE NLMS CGM-NLMS

|AF|

10–1

10–2

10–3 –90

–60

–30

0 AOA (deg)

30

60

90

Fig. 13. Linear radiation patterns for pure CGM, pure NLMS and CGM-NLMS algorithms/Rayleigh channel.

This means that the proposed CGM-NLMS algorithm has about between 2.5 and 11 dB average improvement in interference suppression compared with pure CGM and pure NLMS algorithms respectively. Figure 14 presents the linear plot of the radiation pattern for the pure CGM, pure NLMS and CGM-MNLMS algorithms. This figure shows that the CGM-MNLMS algorithm generates a deeper null of about −32 dB at interference angles −30∘ and 30∘ respectively, which is higher than both pure CGM and CGM-NLMS algorithms.

Adaptive Beamforming Algorithms |

17

100 PURE CGM PURE NLMS CGM-MNLMS

|AF|

10–1

10–2

10–3 –90

–60

–30

0 AOA (deg)

30

60

90

Fig. 14. Linear radiation patterns for pure CGM, pure NLMS and CGM-MNLMS algorithms/Rayleigh channel.

This means that the second proposed CGM-MNLMS algorithm has about between 4 dB and 13 dB average improvement in interference suppression compared with pure CGM and NLMS algorithms. In other words, the second proposed algorithm achieved further performance enhancement compared with the first proposed algorithm. Figure 15 shows the magnitude estimation for one element weight (w4 ) for the first 20 iterations and one run only. As can be observed from this figure, the CGM-MNLMS has better performance compared with all other algorithms in terms of

0.2 PURE CGM PURE NLMS CGM-NLMS CGM-MNLMS

0.19 0.18

|weights|

0.17 0.16 0.15 0.14 0.13 0.12 0.11 0.1 2

4

6

8

10

12

14

16

Iteration Number Fig. 15. One weight tracking for all algorithms/ Rayleigh channel.

18

20

18 | T.M. Jamel

fast convergence rate and low level of misadjustment at steady state. In addition, the CGM-NLMS algorithm has better performance than both pure algorithms. Figure 16 and 17 show the (MSD) and MSE learning curves for all algorithms using ensemble average run of 100 for 200 iterations. It can be seen that, both proposed algorithms have a faster convergence rate and minimum MSD at steady state compared with other algorithms. Moreover, the second proposed algorithm has minimum MSD at steady state compared with the first one.

–20 PURE CGM PURE NLMS CGM-NLMS CGM-MNLMS

–25

MSD (dB)

–30 –35 –40 –45 –50 –55 0

20

40

60

80 100 120 Iteration Number

140

160

180

200

Fig. 16. MSD plot for all algorithms / Rayleigh channel.

0 PURE CGM PURE NLMS CGM-NLMS CGM-MNLMS

–20

MSE (dB)

–40 –60 –80 –100 –120 0

20

40

60

80 100 120 Iteration Number

Fig. 17. MSE plot for all algorithms / Rayleigh channel.

140

160

180

200

Adaptive Beamforming Algorithms |

19

8 Performance analyses Intuitively justification for performance enhancement of the proposed two algorithms is as following. For NLMS (or MNLMS) algorithm, weights are initialized arbitrarily with w (0) = 0 and then are updated. In order to speed up convergence, an initial weight vector, that has been coming through the CGM algorithm, is used. After the initial weights vector derivation, and the antenna beam is already scanned to the incident direction of the desired signal (by CGM), then the NLMS (or MNLMS) starts its operation. When the NLMS (or MNLMS) algorithm begins adaptation, the antenna beam has already steered close to the approximate direction of the desired signal. Therefore the NLMS (or MNLMS) algorithm takes less time to converge compared with pure CGM or pure NLMS. After that, even if the signal environment changes, the two proposed combined algorithms are able to encounter these changes. In our paper, we consider a system which the environmental change is slow (AWGN channel) and strong (Rayleigh fading channel with a Jakes power spectral density). Under this condition, with the signal environment change, NLMS (or MNLMS) algorithm can track the desired signal with fast convergence time because both NLMS and MNLMS algorithms have time varying step size. Therefore, the two proposed algorithms combined the fast convergence rate capability and deep null of CGM with low level of MSD, and MSE, and also good tracking capability of time varying step size NLMS (or MNLMS). The final outcome of combined algorithms is fast convergence rate, high, deep null (interference suppression), low level of MSD, and MSE, and also high stability in steady state.

9 Conclusion This paper presents a new approach to achieve fast convergence and higher interference suppression capability of the adaptive beamforming application for a mobile communications system. The proposed algorithms involve the use of merging CGM in the first stage with NLMS (or MNLMS) as second stage. In this way, the desirable fast convergence and high interference suppression capability of CGM is combined with better tracking and low level of MSD, and MSE capability of NLMS (or MNLMS). The simulation results of adaptive beamforming using AWGN and the Rayleigh fading channel, shows performance enhancements of the proposed algorithm in terms of fast convergence rate, and interference suppression capability compared to the pure CGM and pure NLMS algorithms for both radio channels.

20 | T.M. Jamel

Bibliography [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

[11] [12]

[13]

[14]

[15] [16]

[17] [18] [19] [20] [21]

S. Haykin. Adaptive Filter Theory. 4th ed. Englewood Clis, NJ: Prentice Hall, 2002. A.H. Sayed. Fundamentals of Adaptive Filtering. New York: Wiley, 2003. A.D. Poularikas and Z.M. Ramadan. Adaptive iltering Primer with MATLAB, CRC Press, 2006. I. Sulyman and A. Zerguine. Convergence and steady-state analysis of a variable step-size NLMS algorithm. Signal Processing, 83:1255–1273, June, 2003. H.C. Shin, A.H. Sayed and W.J. Song. Variable step-size NLMS and a newprojection algorithms. IEEE Signal Processing Letters, 11(2):132–135, February, 2004. D.P. Mandic. A Generalized Normalized Gradient Descent Algorithm. IEEE Signal Processing Letters, 11(2), February, 2004. J. Benesty, Q. Montreal, H. Rey, L. Rey Vega and S. Tressens. A nonparametric VSS NLMS algorithm. IEEE Signal Processing Letters, 13(10):581–584, October, 2006. Y.S. Choi, H.C. Shin and W.J. Song. Robust regularization for normalized LMS algorithms. IEEE Trans. on Circuits and Systems II, Express Briefs, 53(8):627–631, August, 2006. D. Mandic, P. Vayanos, C. Boukis, B. Jelfs, S.L. Goh, T. Gautama and T. Rutkowski. Collaborative adaptive learning using hybrid filters. IEEE ICASSP’07, III:921–924, April 2007. S.C. Chan, Z.G. Zhang, Y. Zhou and Y. Hu. A new noise-constrained normalized least mean squares adaptive filtering algorithm. IEEE Asia Pacific Conference on Circuits and Systems, APCCAS 2008. J. Lee, H.C. Huang and Y.N. Yang. The generalized square-error-regularized LMS algorithm. World Congress on Engineering and Computer Science WCECS’08, :157–160, October, 2008. J. Lee, J.-W. Chen and H.-C. Huang. Performance comparison of variable step-size NLMS algorithms. World Congress on Engineering and Computer Science, I, San Francisco, USA, October 20–22, 2009. J. Lee, H.C. Huang, Y.N. Yang and S.Q. Huang. A Square-Error-Based Regularization for Normalized LMS Algorithms. Int. Multi-Conf. of Engineers and Computer Scientists, II:19–21, Hong Kong, March, 2008. Z. Ramadan. Error Vector Normalized Adaptive Algorithm Applied to Adaptive Noise Canceller and System Identification. American J. of Engineering and Applied Sciences 3(4):710–717, 2010. H.-C. Huang and J. Lee. A new variable step-size NLMS algorithm and its performance analysis. IEEE Trans. on Signal Processing, 60(4), April, 2012. Y.-S. Choi. Variable Regularization Parameter Normalized Least Mean Square Adaptive Filter. Int. J. of Electrical, Computer, Energetic, Electronic and Communication Engineering, 10(1):129–132, 2016. S. Wang, H. Mi, B. Xi and D.J. Sun. Conjugate Gradient-based Parameters Identification. 8th IEEE Int. Conf. on Control and Automation, Xiamen, China, :1071–1075, June 9–11, 2010. P.S. Chang and A.N. Willson. Analysis of Conjugate Gradient Algorithms for Adaptive Filtering. IEEE Trans. on signal processing, 48(2):409–418, February, 2000. G.K. Boray and M.D. Srinath. Conjugate Gradient Techniques for Adaptive Filtering. IEEE Trans. on Circuits And Systems, 1:1–10, January, 1992. G.D. Mandyam, N. Ahmed and M.D. Srinath. Adaptive Beamforming Based on the Conjugate Gradient Algorithm. IEEE Trans. on Aerospace and Electronic Systems, 33(1):343–347, 1997. T. Jun, P. Yingning and W. Xiqin. Application of Conjugate Gradient Algorithm to Adaptive Beamforming. Int. Symp. on Antennas and Propagation Society, Orlando, FL, USA, 2:1460–1463, 1999.

Adaptive Beamforming Algorithms |

21

[22] J. Lee, H. Yu and Y. Sung. Beam Tracking for Interference Alignment in Time-Varying MIMO Interference Channels: A Conjugate Gradient Based Approach. IEEE Trans. on Vehicular Technology, 63(2):958–964, February, 2014. [23] S.A. Abbas. A New Fast Algorithm to Estimate Real-Time Phasors Using Adaptive Signal Processing. IEEE Trans. on Power Delivery, 28(2):807–815, April, 2013. [24] F.B. Gross. Smart Antenna for Wireless Communication, McGraw-Hill, Inc, USA, 2005. [25] J. Mathews and Z. Xie. A stochastic gradient adaptive filter with gradient adaptive step size. IEEE Trans. on signal processing, 41(6):2075–2087, 1993. [26] V.A. Silva, T. Abrao and P.J.E. Jeszensky. Statistically correct simulation models for the generation of multiple uncorrelated rayleigh fading waveforms. 8th IEEE Int. Symp. on Spread Spectrum Techniques and Applications, Sydney, Australia, August 30, September 2, 2004.

Biography Thamer M. Jamel was born in Baghdad, Iraq. He graduated from the University of Technology with a Bachelor’s degree in electronics engineering in 1983. He received a Master’s degree in digital communications engineering from the University of Technology in 1990 and a Doctoral degree in communication engineering from the University of Technology in 1997. He is associate professor in the communication engineering department at University of Technology, Baghdad, Iraq. His research interests include adaptive signal processing for digital communications systems.

N. Jarray, M. Elhajji and A. Zitouni

Efficient Hardware Architecture of DCT Cordic based Loeffler Compression Algorithm for Wireless Endoscopic Capsule Abstract: In general, image compression is the best way to decrease the communication bandwidth and save the transmission power. However, due to power and size limitation, traditional image compression techniques are not appropriate, therefore, more special techniques are required. So, the image compressor should be able to sufficiently compress the captured images to save the transmission power, retain reconstruction quality for an accurate diagnosis and should also take a small physical area. To meet all these conditions, we proposed an efficient hardware architecture of Cordic based Loeffler DCT compression algorithm designed for wireless endoscopic capsule. Our improvement over the original algorithm concerns the cordic part. This allowed us to reduce the number of addition from 38 to 30 add operations in the main algorithm whereas the shift operations number remain constant (16 shift operations). Moreover, to further ameliorate our results, we used Modified Carry Look Ahead adder (MCLA) and Carry Save Adder (CSA) which are characterized by low power and high speed compared to classical Carry Look Ahead Adder (CLA). Our aim is to provide an optimized architecture in terms of area and power consumption all in accordance with the image quality. The proposed design was implemented on (Field Programmable Gate Arrays) FPGA. Compared to other architectures; the proposed architecture has not only reduced the computation complexity, but also the area and power consumption. It should be noted that the proposed DCT architecture is very suitable for low power and high-quality codecs, that granted the image quality and this is proved by a software implementation using Matlab, especially for battery-based systems. Keywords: Wireless endoscopic capsule, DCT, MCLA, FPGA, Power consumption.

1 Introduction Thanks to the enormous progress in microelectronics, a wireless endoscopic capsule (WEC) has recently been invented. This capsule allows to evaluate the whole GI tract and the small intestines. First, such a capsule was invented by Given Imaging Ltd.[1] in the end of the 20-th century. It is equipped with a CMOS sensor, a lighting, a data

N. Jarray, M. Elhajji and A. Zitouni: N. Jarray, University of Monastir, Tunisia, email: [email protected]„ M. Elhajji, University of Shaqra, Saudi Arabia, email: [email protected]„ A. Zitouni, University of Damman, Saudi Arabia, email: [email protected] De Gruyter Oldenbourg, ASSD – Advances in Systems, Signals and Devices, Volume 8, 2018, pp. 23–36. https://doi.org/10.1515/9783110470383-002

24 | N. Jarray et al.

processing module and transmission unit as shown in Fig. 1. After swallowing by a patient, the capsule passes through the GI tract due to peristalic intestine movements and takes images that are wirelessly transmitted to the recorder carried by a patient. The wireless capsule uses a tiny wireless camera to take images of the digestive tract. It takes about 50.000–60.000 digital images for the doctor’s review [2]. The endoscopic capsule needs to be small enough to be swallowed easily and to pass through the human GI tract. Generally, it takes 24 hours to move from mouth to evacuation. The images are transmitted by a wireless radio frequency transmitter to the workstation, where they are stored [2]. The transmission of the image data consumes about 90 % of the total power in the battery of the endoscopic capsule [2]. The data should be first compressed to reduce the power of the image data transmission and the communication bandwidth. Recently, many researches have been done to improve the performance of WCE. This later presents a complex architecture with the addition of memory, causes higher area consumption and power dissipation. Some works have been reported in image compressor; it is well known that the Discret Cosine Transform (DCT) has been widely used in many areas, such as image coding. In particular, the two-dimensional 2-D DCT has been adopted in some international standards such as MPEG, JPEG and CCITT [3]. A 2-D DCT can be obtained by applying 1-D DCT over the rows followed by a 1-D DCT over the columns of the 8X8 data block [4]. Therefore, implementation of DCT has become the most important issue for real-time embedded systems. In mobile multimedia devices such as degital cameras, cell phones and endoscopic capsules, hardware complexity as well as power consumption have to be minimized. To achieve this goal, many algorithms have been proposed in literature. In [6],[5] the authors propose a lower complexity fast DCT algorithm based on (Flow graph algorithm ) FGA that requires only 11 multiplications and 29 additions. However, the common disadvantage of all fast DCT algorithms is that they still need floating

+



Transceiver

Controller

Battery

CMOS Sensor

RF Transmitter

Compressor

Workstation Fig. 1. Block diagram of a typical endoscopic system [15].

Liens

Hardware Architecture of DCT Cordic based Loeffler Compression Algorithm | 25

point multiplication. These operations are very slow in software implementation and require large area and power in hardware. Therefore, there is still the need to look for new design of DCT algorithm better suited to particular applications. Mathematically, fast DCT is composed of additions and multiplications by constants. When implemented in hardware, the multiplications by constants are often carried out by a sequence of additions and shifts, which is less expensive in terms of chip area and power consumption [7]. These implementations of transforms are referred to as multiplier less. Nevertheless, the binDCT seems to have the most notable result in this field [8]. This transform is based on a VLSI-friendly lattice structure and derived from DCT matrix factorization by replacing plane rotations with lifting schemes. Another popular way of multiplier less implementation of DCT is to use the coordinate rotation digital computer (Cordic) algorithm [9][10][21] since the Cordic algorithm leads to a very regular structure suitable for VLSI implementation. In [11], it is concluded that the length of the critical path, i. e. the maximum number of addition operations in cascade strongly affect the performance of the hardware implementation of DCT. It has been shown that a 30 %–40 % decrease in delay and power consumption was obtained after shortening the critical path from 10 to 7, even through at the expense of increasing the total number of adders [22]. In 2006, [12] [13] the authors proposed a low power and high quality Cordic-based Loeffler DCT architecture, which is optimized by advantage of certain proprieties of the Cordic algorithm and its implementation. The computational complexity is reduced from 11 multiplications and 29 addition operations to 38 additions and 16 shift operations. In this paper, we propose an efficient hardware architecture of 1D-DCT based on Cordic based Loeffler compression algorithm for wireless endoscopic capsule, which is optimized by taking advantage of certain properties of the novel Cordic-based unified architecture for DCT and IDCT [14]. It only requires 30 add and 16 shift operations. The resulting Cordic based Loeffler DCT architecture not only reduces the computational complexity and power consumption significantly, but also retains the good transformation quality as the pervious Cordic based Loeffler DCT does. Therefore, the presented Cordic-based Loeffler DCT implementation is especially suited for low-power and high-quality CODECs, such as mobile handsets and digital cameras. This paper is structured as follow: section I presents an introduction, section II introduces the Cordic based Loeffler DCT algorithm and some related works. In section III we explain the proposed Cordic based Loeffler DCT architecture. The experimental results are shown in section IV. Section V concludes the paper.

2 Cordic based Loeffler DCT algorithm A few years ago, an optimized Cordic based Loeffler DCT algorithm implementation required 38 adder and 16 shift operations as shown in Fig. 2. To perform a multiplier

26 | N. Jarray et al.

X0

Y0

X1

Y4

X2

Cordic 3π /8

X3 X4

Cordic π / 16

X5 X6

Cordic 3π / 16

X7 Adder

Y2 Y6 Y1 Y5 Y3 Y7

substractor

Fig. 2. Flow graph of an 8-points Cordic based Loeffler DCT [12].

less DCT transformation, researchers combine both Cordic and Loeffler algorithms to avoid using multiplication because of its complexity. Hence, the DCT algorithm becomes very simple to implement. They took the original Loeffler DCT as the starting point, and replaced it by the circular rotation of Cordic algorithm. So as to realize the vector rotation which rotating a vector (X,Y) by an angle θ, the circular rotation angle is described as follows [12]: Θ = ∑ α i tan−1 (2−i )

(1)

i=1

with α i = 1, −1 with i being the rotation iteration and σ the vector rotation direction. Then, the vector rotation (X,Y) can be achieved using the iterative equation given as follows: X i+1 = x i − α i y i 2−i Y i+1 = y i − α i x i 2−i

(2)

Moreover, the results of the rotation iteration needed to be scaled by a compensation factor s. X i+1 = x i (1 + 𝛾i F i ) Y i+1 = y i (1 + 𝛾i F i )

(3)

with ∏(1 + 𝛾i F i ), 𝛾 = (0, 1, −1), F i = 2−i When using Cordic algorithm to replace the multiplications of the 8 DCT points whose rotation angles θ are fixed, it can skip some unnecessary Cordic iterations without losing accuracy. As shown in Tab. 1. which summarizes the Cordic iteration of Cordic based Loefler DCT.

Hardware Architecture of DCT Cordic based Loeffler Compression Algorithm | 27

Tab. 1. Cordic based Loeffler DCT parameters [13]. Angle Rotation iteration [σ i , i] according to eq (2) 1 2 3 Compensation iterations[1 + γ i .F i ] according to eq (3) 1 2

π/4

3π/8

π/16

3π/16

−1,0 0 −

−1,0 −1,1 ±1,4

−1,3 −1,4 −

−1,1 −1,3 −

− −

− −

− −

1−(1/8) 1+(1/64)

3 Proposed architecture of Cordic based Loefler DCT 3.1 Optimized Cordic based Loeffler DCT algorithm On the basis of the previous work about Cordic based Loefler DCT [13], we proposed an optimized Cordic based Loefler DCT algorithm. This implementation requires only 30 add and 16 shift operations. We have taken the original Cordic based Loefler DCT as the starting point for our optimization. Our improvement, over the previous algorithm is performed in the cordic algorithm. Thus, according to Tab. 1 the unfolded cordic folow graph of 3π/6 angle, as shown in Fig. 3, needs 8 shifts and 8 additions to evaluate its rotation angle. To assess the rotation more efficiently, the sequence iteration number i and j should be big enough. As we can see, when we used the two next equations, twice iterations are replaced with once [14][16]. X k+i+j =x k − α i y k 2−i − α j y k 2−k − α i α j x k 2−(i+j) ≃x k − α i y k 2−i − α j y k 2−j



X_in 1 1

– 3

3

6

3

3

6

+

+

Y_in

+

Adder





+

substractor



X_out

+

Y_out

i Shift 2–i

Fig. 3. Unfolded flow graph of the 3π/16 angle [12].

28 | N. Jarray et al. Y k+i+j =y k − α i x k 2−i + α j x k 2−k + α i α j y k 2−(i+j) ≃y k + α i x k 2−i + α j x k 2−j

(4)

Similarly, equations(4) can be approximated as follows: X k+i+j =x i (1 + 𝛾i F i + 𝛾j F j + 𝛾i 𝛾j F i ) ≃x i + (1 + 𝛾i F i + 𝛾j F j ) Y k+i+j =y i (1 + 𝛾i F i + 𝛾j F j + 𝛾i 𝛾j F i ) ≃y i + (1 + 𝛾i F i + 𝛾j F j )

(5)

with 𝛾i,j = (0, 1, −1), F i,j = 2−i,j So, based on these two equations, the conventional unfolded Cordic is modified. The modified Cordic needs less shift and addition operations. therefore, the modified unfolded Cordic flow graph of the 3π 16 is shown as follows: According to Fig. 4, we can see that to evaluate the 3π/16 rotation angle, we need only 4 additions and 8 shifts. As we can see, the number of adder decreased compared to the conventional unfolded Cordic. We use this principal for the two other angles π/16 and 3π/8 (Fig. 5 and Fig. 6). Similarly, compared to conventional unfolded π/16 and 3π/8 Cordic angle, the number of additions is reduced from 4 to 2 additions and from 6 to 4 respectively. Consequently, the number of additions in all the architecture is reduced from 38 to 30 additions in the main algorithm Cordic based Loefler DCT.

3.2 DCT Cordic based Loeffler architecture Based on the proposed modified algorithm, an 8 point DCT-based Cordic-Loeffler architecture is presented. The modified Cordic Loeffler architecture is adopted to

3

X_in

3 –



X_out



Y_out

1 3 1

Y_in

+ 3

+

Adder

3

3



substractor

i Shift 2–i

Fig. 4. Modified unfolded flow graph of the 3π/16 angle.

Hardware Architecture of DCT Cordic based Loeffler Compression Algorithm | 29

1

X_in



X_out

+

Y_out

3 3

Y_in 1

Adder

+



substractor

i Shift 2–i

Fig. 5. Modified unfolded flow graph of the π/16 angle.

1

X_in

+

+

X_in



Y_out

4 4

Y_in

+ 1

+

Adder



substractor

i Shift 2–i

Fig. 6. Modified unfolded flow graph of the 3π/8 angle.

improve the performance and reduce the hardware complexity. The proposed architecture of the DCT-based Cordic Loeffler is shown in Fig. 7. This architecture consists of adder, subtractors and a Cordic modified algorithm. In addition, to further improve the efficiency of the architecture, it is important to speed up the adder efficiency. So, an MCLA adder (Modified carry look ahead adder shown in Fig. 8) is used, because of its high speed and its low cost [17]. The modified carry look Ahead adder is similar to CLA adder in basic construction. It contains an arithmetic adder circuit and a carry Look Ahead circuit. To make it faster, the authors in [17] proposed to replace the AND and the NOT gates in CLA adder with NAND gates, in order to decrease the cost and increase the speed of CLA adder. To design the Cordic architecture, it is important to speed up the adder efficiency. According to Fig. 5, the architecture of the modified unfolded Cordic is implemented as shown in Fig. 9 using CSA (Carry Save Adder) and hard wired shifters.

30 | N. Jarray et al.

X0 X7 X1 X6 X2 X5 X3 X4 X3 X4 X2 X5 X1 X6 X0 X7

MCLA

MCLA

MCLA

Y0

MCLA

SOUS

SOUS

MCLA

MCLA

MCLA

SOUS

Cordic 3π/8

Y4 Y2

SOUS

Cordic π/16

SOUS SOUS

Cordic 3π/16

SOUS

Y6

MCLA

MCLA

Y1 Y5

SOUS MCLA

Y3

SOUS

SOUS

Y7

Fig. 7. Architecture of DCT based cordic-loeffler.

B3

A3

B2

MPFA G S3

A2

B1

MPFA

P

G S2

A1

MPFA P

G

B0

MPFA

P

S1

A0 C0

G P S0

P G Fig. 8. 4-bit MCLA [17].

4 Experimental results and comparisons The proposed architecture is described in VHDL (VHSIC Hardware Description Language) and synthesized via Xilinx ISE 13.1 using VIRTEX5 FPGA as target device. The synthesis results of the architecture show that it occupied 613 slices out of 93120 and 394 LUTs out of 46560 and operated at about 226.2 MHz. In addition, it consumed about 0.037W of the power. The proposed modified DCT-based Cordic-Loeffler algorithm has a low computational complexity compared to the other type, as shown in Fig. 10, it can be observed that the proposed design requires only 30 additions and 16 shifts compared with

Hardware Architecture of DCT Cordic based Loeffler Compression Algorithm

| 31

the previous Cordic-based Loeffer DCT which requires 38 additions and 16 shifts. In addition, it has a low computational complexity compared to the binDCT algorithm which needs 36 additions and 17 shift operations. Therefore, the proposed design is more efficient and has a lower hardware complexity than the original Cordic based Loefler DCT algorithms and the binDCT. It is suitable for low power and high-quality CODECs, especially for battery-based systems.

2–3

2’comp

2–1

2’comp

CSA

X_out

X_in 2–3 CSA

2–1

Y_out

Y_in Fig. 9. Architecture of modified unfolded cordic.

104

Operation Multiply

100

Add shift

Operation Number

82 80 60 38

40

36 30

29

Fig. 10. Architecture of modified unfolded cordic.

16

os ed

0

Pr op

nD

CT

0

Bi

cD

0 Lo eff le rD CT

0 CT

0

Co rd ic

Lo eff le r

0

17

16

11

Co rd i

20

32 | N. Jarray et al. original image

reconstructed image, PSNR = 38.37 dB

original image

reconstructed image, PSNR = 37.31 dB

original image

reconstructed image, PSNR = 35.57 dB

original image

reconstructed image, PSNR = 36.10 dB

original image

reconstructed image, PSNR = 37.58 dB

Fig. 11. Original and reconstructed image.

Hardware Architecture of DCT Cordic based Loeffler Compression Algorithm | 33

Tab. 2. Comparison of performance. Compression algorithm Wahid et al. [15] Turcza et al.[19] Lin et al.[20] Proposed algorithm

DCT DCT DCT DCT

PSNR(dB) 32.95 36.49 32.51 36.98

In order to verify the performance of our proposed DCT cordic-loeffler algorithm, we have used Matlab language. In this paper, we use the gray scale endoscopic image of different parts of the gastrointestinal tract [18] as a test image as shown in Fig. 11, which presents the original image and the reconstructed image. Hence, to measure the reconstructed image quality, we use the peak signal-to noise ratio (PSNR). It is calculated using equation 10: PSNR = 20 × log( 10) ×

255 N

(6)

M

1 󸀠 √ ∑ ∑ (x m,n − x m,n )2 MN n = 1 m = 1

Where M and N are respectively the image width and height, x and x’ are the original and the reconstructed component values respectively. The qualitative results are shown in fig(11), it is clear from the finding results that our proposed improves the visual image quality in terms of peak-signal to noise ratio. It can be seen from the simulation results that our proposed algorithm has a good performance with the greatest PSNR value, which is equal to 38.37 dB. In addition, the average PSNR value of the above images in fig(11) is equal to 36.98 dB which is about 4 dB higher than that of the [15] [20] as shown in Tab. 2. In all cases, the PSNR of our reconstruction image are well above 30 dB, which is highly acceptable to medical doctors for diagnosis. This implies that the proposed DCT architecture not only reduces the computation complexity, the area and power consumption, but also improves the quality of the results in terms of PSNR values.

5 Conclusion Nowadays, image compression is very important in decreasing the communication bandwith and saving transmission power. Due to the small size and power limitation of wireless endoscopy capsule, image compressor should be able to sufficiently compress the captured image to save transmission power, take a small physical area and retain reconstruction quality for an accurate diagnosis. To meet all the conditions, an efficient hardware architecture of Cordic based Loefler DCT compression algorithm designed for wireless endoscopic capsule has been proposed. Compared to the previous Cordic based Loefler DCT architecture,

34 | N. Jarray et al.

the proposed algorithms has a lowest computational complexity, it requires only 30 additions and 16 shift operations to perform the DCT transform. Furthermore, the efficient MCLA and CSA adders are used to implement the Cordic based Loefler DCT architecture. Thus, it not only reduces the computation complexity, but also reduces the area and power consumption compared to the conventional Cordic based Loefler DCT algorithm. It is also keeps a high quality output image. It is clear from the finding results that the value of the PSNR comes to be a highest in the proposed work. In this regard, the proposed DCT architecture is very suitable for low-power and high-quality CODECs, especially for battery-based systems.

Bibliography [1] [2] [3] [4]

[5] [6]

[7]

[8] [9]

[10] [11] [12] [13] [14]

G. IDDAN, G. Meron, A. Glukhovsky and P. Swain. Wireless Capsule Endoscopy. Nature, VOL.405, NO.6785, PP:417, 2000. x. Xie, G. Li, X. Chen, L. Liu, C. Zhang and P. Wang. A low power digital Ic design inside the wirless endoscopy capsule. National nature science foundation of china. R.C. Gonzalez and RE. Woods. Digital Image Processing Prentice-Hall. New Jersy, 2001. S. Wolter, D. Birreck and R. Laur. Classification for 2DDCT and a new architectures with distributed arithmetic. circuits and systems? IEEE International symposium on, 11–14 Jun 1991, PP. 2204–2207. K.R. Rao and P. Yip. Dscret Cosine Transform Algorithms, Advantages, Applications. Academic Press Professional, INC, 1990. C. Loeffler, A. Ligtenberg and G.S. Moschyz. Practical fast 1-DDCT algorithms with 11 multiplications. Acoustics, Speech and signal processing, ICASSP, International Conference on, 988–991, 1989. A.C. Zelinski, M. Puschel, S. Misra and J.C. Hoe. Automatic cost minimization for multiplierless implementations of discrete signal transforms. Acoustics, Speech and signal processing,ICASSP, International Conference on, 221–224, 2004. T.D. Tran. The binDCT: fast multiplierless approximation of the DCT. IEEE Signal Processing letters, 141–144, 2000. M. Parfieniuk. The binDCT: Shortening the critical path in CORDIC-based approximation of the eight-point DCT. International Conference on Signal and Electronic Systems ICSEC, 405–408, 2008. S. Yu and E.E. Swartlander JR. A scaled DCT architecture with the CORDIC algorithm. IEEE Transactions on signal processing, 4160–167. C.C. Sun, S.J. Ruan, B. Heyne and j.Goetze. Low power and high quality cordic-based loeffler DCT for signal processing. IET Circuits Devices Syst, 623–655, 2007. C.C. Sun, B. Heyne and S.J. Ruan. A low power and high quality cordic-based loeffler DCT. IEEE conference, 2006. C.C. Sun, S.J. Ruan, B. Heyne and J. Goetze. Low power and high quality cordic-based loeffler DCT for signal processing. IET Circuits Devices Syst, 453–461, 2007. Liyi Xiao. A Novel Cordic based unified architecture for DCT and IDCT. International conference on optoelectronics and microelectronics (ICOM).

Hardware Architecture of DCT Cordic based Loeffler Compression Algorithm | 35

[15] K. Wahid, Seok-Bum ko and D. Tang. Effecient hardware implementation of image compression for wireless capsule endoscopy applications. International Joint conference on Neural Networks, 2761–2765, 2008. [16] H. Huang and L. Xiao. Variable length Reconfigurable algorithms and architectures for DCT/IDCT based on modified unfolded cordic. The Open Electrical and Electronic Engineering Journal, 71–81, 2013. [17] Y.T. Pai and Y.K. Chen. The Fastest Carry LooK Ahead Adder. Proceeding of the second IEEE International Workshop on electronic Design, 2004. [18] Gastrolab [Online] (2016). Available at: http://www.gastrolab.net, 2004. [19] P. Turcza and M. Duplaga. Low power image compression for wireless capsule endoscopy. Proc IEEE International Workshop on Imaging Systems and Techniques, 1–4, 2007. [20] M.-C. Lin, L.-R. Dung and P.-K. Weng. An ultra low power image compressor for capsule endoscope. BioMedical Engineering Online, 1–8, 2006. [21] B. Heyne, C.C. Sun, J. Goete and S.J. Ruan. A computationally efficient high quality cordic based DCT. Signal Processing Conference, 1–5, 2006. [22] C.C. Sung, S.J. Ruan, B.Y. Lin and M.C. Shie. Quality and Power efficient Architecture for the Discret Cosine Transform. IEICEI Transactions on Fundamentals Special Section on VLSI Design and CAD Algorithms, 2005.

Biographies Nedra Jarray was born in Tunisia in 1988. She got her MS degree in Physics (Electronics option) from the Faculty of Sciences of Monastir, Tunisia, Currently, she is preparing her PhD degree in Electronics and Microelectronics in the Faculty of Sciences of Monastir.

Majdi Elhajji was born in Tunisia in 1982. He received these PhDs degree in computer science and Physics from,respectively, the University of Lille1 and Faculty of Sciences of Monastir in 2012. Since this date he has been Associated professor in electronics and microelectronics with the Physics Department in Faculty of Sciences of Monastir. His research interests video coding, Network on Chip, System on Chip co-design, synthesis and simulation, performance evaluation and model driven engineering.

36 | N. Jarray et al.

Abdelkrim Zitouni was born in Tunisia in 1970. He received the DEA and the Ph. D. degree in Physics (Electronics option) from the Faculty of Sciences of Monastir, Tunisia, in1996 and 2001 respectively. He received the HDR degree in Physics (Electronics option) in 2009. Since this date he has been Associated professor in electronics and microelectronics with the Physics Department in Faculty of Sciences of Monastir and a Full professor since 2014 in the same institution. He is currently associate professor in university of Dammam, KSA. His researches interest on communication synthesis for SoC, video coding and asynchronous system design. He is the author of two books that interest on synchronous and asynchronous circuits design and more than 75 articles and conferences papers. He derives a research group that interests on communication synthesis for SoC, asynchronous system design and video processing.

M. Langar, R. Bourguiba and J. Mouine

Exploring the Physical Characteristics of an On-chip Router for its Integration in a 3D-mesh NoC Abstract: Three-dimensional Networks on chip 3D-NoCs represent a new generation of communication structure that enhances the performance and overcomes the limitations of two-dimensional NoCs. They offer, by replacing long interconnects by much shorter high-speed vertical links called Through-Silicon-Vias TSVs, fast and power efficient inter-core communication. However, vertical TSVs links bring several problems due to its large area footprint and increased manufacturing cost. In this context, we propose a new 3D-mesh NoC architecture based on shared TSVs and serialized data. This architecture is constructed using 2D and 3D routers. In this paper, we describe the design and implementation of the 2D router that ensures the intra-layer communication. It is characterized by its simple virtual channel allocation scheme that allows enhancing its area and power consumption characteristics. Keywords: Network on Chip, 3D-Mesh, Router, Virtual channel.

1 Introduction Three-dimensional chip technology is a promising approach to deal with the integration difficulties faced by current Systems on Chips SoCs. It allows, by vertically stacking multiple silicon layers, increasing the transistor density and integrating heterogeneous technologies. It also reduces the delay and power consumption of interconnections by replacing the long horizontal wires in a 2D circuit by shorter high-speed vertical links called Through-silicon-vias TSVs [18]. Despite their benefits, TSVs connections have several drawbacks [13, 18]. Firstly, TSV pads occupy an important area. Secondly, a large number of TSVs implies a large number of pads in the connected layers which increases routing congestion. Finally, the additional manufacturing steps of TSVs involve potential yield reduction, so fabricating an important number of TSVs will accentuate this problem. Thus, it is important to find the best trade-offs between performances and manufacturing cost when using TSV connections. Designing a 3D Network on Chip that meets the constraints imposed by the 3D technology is a promising interconnect solution for a 3D SoC.

M. Langar, R. Bourguiba and J. Mouine: M. Langar, National Engineering School of Tunis, Tunisia, email: [email protected] , R. Bourguiba, National Engineering School of Tunis, Tunisia, email: [email protected] , J. Mouine, Prince Sattam Bin Abdulaziz University, Saudi Arabia, email: [email protected] De Gruyter Oldenbourg, ASSD – Advances in Systems, Signals and Devices, Volume 8, 2018, pp. 37–46. https://doi.org/10.1515/9783110470383-003

38 | M. Langar et al.

Several works [5–7, 14] proposed 3D NoCs using serialized data communication of vertical TSV links. Although this solution minimizes TSVs area footprint and reduces the routing complexity, it has an impact on the performance due to serialization overhead and loss of vertical links bandwidth. In [16], Rahmani et al. introduce a new 3D NoC architecture based on power-aware Bidirectional Bisynchronous Vertical Channels BBVC. It replaces the pair of unidirectional vertical links between layers by a dynamically self-configurable BBVC operating at a high frequency. This architecture reduces TSVs area footprint and enhances the NoC power consumption but it doesn’t have a single hop vertical communication. Liu et al. [10] present a partially connected 3D NoC topology constructed using only few selected vertical TSV links. It allows neighboring routers sharing the same TSV in a time division multiplex mode. This structure reduces TSVs area overhead and improves their utilization. Our research considers the design of a 3D-NoC with a mesh topology using shared vertical links and serialized data. Communication between IP cores in the same layer is done using 2D routers, while communication between the layers is done via 3D routers as described in figure 1. Four or fewer neighboring 2D routers of the same layer are grouped and share the

TSV

3D router Fig. 1. “4x2x2” 3D-mesh NoC.

25 router

Physical Characteristics of an On-chip Router for its Integration

|

39

same TSV through the 3D router. When a flit enters the 3D-NoC, it is first routed through the Z-direction then it is routed through the XY-directions using the XY routing algorithm. As for data serialization, it is generally performed at relatively high frequency using a conventional parallel-in/serial-out shift register. In order to reduce the serialization frequency and ensure a high speed data transfer on the TSV, we propose the use of a multi-stage Serializer/Deserializer SerDes [19]. Communication between the different layers can be done in asynchronous or synchronous mode, while synchronization is ensured with bisynchronous FIFOs. In this paper, we are interested in the design and the implementation of the 2D router that will be used later to design the 3D mesh NoC. We propose an architecture using a new virtual channel allocation mechanism which improves the performance of the router in terms of area occupancy and power consumption. The remainder of this paper is organized as follows. Section 2 presents the NoCs and their characteristics and specifies the characteristics of the proposed NoC. Section 3 presents the architecture of the 2D virtual channel router and details the strategy of its virtual channel allocator. Section 4 presents the implementation results using Tezzaron technology. Finally conclusions are drawn.

2 Network on chip The Network on Chip paradigm provides a solution for the global wire delay problem and makes the integration of high number of IPs in a single SoC easier [1, 2]. This new packet-based communication is characterized by its scalability and high bandwidth. A NoC is composed of a set of routers, Network Interfaces NIs and links. The network interface connects IP cores to the network by adapting the NoC protocol to that of IP cores and vice versa. The router is the most important component of the NoC. It steers and coordinates the data flow. The links physically connect the different IP cores and NIs together. A NoC is characterized by its topology, routing algorithm, switching technique and flow control method. The proposed NoC has a 3D mesh topology. It uses virtual channel wormhole switching technique [3, 4] with stop/go protocol [8]. The choice of the wormhole switching technique is justified by the fact that it relaxes the constraints on buffer size as compared to store-and-forward and virtual cut-through [17] and the use of virtual channels enhances the NoC performances in terms of latency and bandwidth. As illustrated in figure 1, the 2D router is one of the basic components of our proposed 3D mesh NoC, the next section provides an overview of its architecture.

40 | M. Langar et al.

3 Proposed 2D virtual channel router Figure 2 illustrates the structure of the proposed router. It has six input/output ports: one port from the local IP core through the NI, one port from the 3D router allowing the communication between layers in the Z-direction and four ports from the four neighboring directions (North, East, South and West). Each input port has four virtual channels implemented by four dedicated FIFO buffers.

Routing Logic (R)

Virtual Channel Allocator (VCA)

Switch Allocator (SA)

input port 1

output port 1

Crossbar 6x6 input port 6

output port 6

Fig. 2. 2D Virtual channel router architecture.

The packet is divided into multiple FLow control unITS (flits): a head flit, one or multiple body flits and a tail flit. Head flit allocates router resources, body flits represent the payload of the packet and tail flit liberates the router resources allocated to the packet. The main components that constitute the router are the input buffers, the routing logic R, the virtual channel allocator VCA, the switch allocator SA and the crossbar. When a head flit arrives at an input port of the router, it is first decoded and buffered in the FIFO corresponding to its input VC. Then, its output port is calculated by the routing logic and its output VC is reserved by the virtual channel allocator. Once an output VC is successfully allocated, the head flit tries to grant the access to its output port by the switch allocator. Finally, it traverses the crossbar to its destination. All these operations are done during one clock cycle.

Physical Characteristics of an On-chip Router for its Integration

|

41

Body and tail flits proceed to the same operations expect the routing computation and VC allocation because they must inherit the same resources reserved by the head flit. The following sections describe in detail the different components of the router.

3.1 Routing logic It computes the appropriate output Physical Channel PC for the header flit of a new packet and/or dictates valid output Virtual Channels within the selected PC. The routing computation is done using the destination address of the packet that is present in the head flit. Our 2D router implements the XY deterministic routing algorithm because it is simple and deadlock free [12].

3.2 Virtual channel allocator It resolves the contention problem for output VCs by arbitrating between all packets requesting access to the same output VCs. In this paper, we focused on the virtual channel allocation mechanism in order to improve the performance of our 2D router. Hence we implemented a new VCA [9] that simplifies the logic of the classical separable VCA used by the typical VC router [11, 15]. It consists of two main steps: – The first one is composed of two arbitration stages. The first stage allows only one input VC from each input port to request an output VC. It requires a 4-inputs arbiter for each input port to reduce the number of requests to one. The winning requests from all input ports proceed then to the second stage of arbitration. This second stage resolves the contention for output ports. As any of the granted input VCs may request any output port, it requires a 6-inputs arbiter for each output port. Figure 3 illustrates the first step of the proposed VCA scheme. – The second step allocates a free output virtual VC to each of all input VCs that won in the first one. When compared to the classical VCA, we observe that the proposed VCA minimizes the number of virtual channel requests per cycle. In fact, in the classical VCA, all input VCs at the input port are allowed to make requests simultaneously, but in the proposed VCA, the 4:1 arbiter selects only one among them, so the number of requests per cycle is reduced. Also, at the output port, the classical VCA allows to grant one request per output virtual channel. But the proposed VCA uses only one arbiter per port, then the maximum number of request grants per cycle is reduced to one.

42 | M. Langar et al.

input port 1

4:1

6:1

output port 1

input port 2

4:1

6:1

output port 2

input port 3

4:1

6:1

output port 3

input port 4

4:1

6:1

output port 4

input port 5

4:1

6:1

output port 5

input port 6

4:1

6:1

output port 6

First arbitration stage

Second arbitration stage

Fig. 3. Complexity of the first step of VCA.

As the data path of the router allows only one flit per cycle to go through an output port, the reduction of the number of parallel virtual channel requests per cycle does not impact the performances. So the proposed allocation mechanism allows improving the area occupancy as well as the power consumption as it reduces the number and the dimension of arbiters.

3.3 Switch allocator It arbitrates between all VCs requesting access to the crossbar and gives permission to the winning flits. It employs the same separable allocation scheme implemented by the typical VC router [11, 15]. This allocation scheme consists of two arbitration stages: The first stage arbitrates between all VCs within the same input port. The second stage arbitrates amongst all winning requests from each input port for each output port.

3.4 Crossbar It passes the flits that won arbitration to the appropriate output channel. We implemented a symmetric crossbar structure which is composed of a set of multiplexers.

Physical Characteristics of an On-chip Router for its Integration

|

43

4 Results We implemented the proposed router using the 130 nm Tezzaron technology [20] in order to measure its performances in terms of area occupancy and power consumption. This technology was chosen because the router will be used in future works to design a 3D mesh NoC. We considered a router having 4VCs per input port; each VC can hold at most 4 flits of 32-bits. We first wrote its VHDL code and simulated it using Mentor Graphics Modelsim by driving it through a test-bench. Then, we synthesized the VHDL code using Cadence RTL Compiler. This step converts the RTL description of the router into a logic gates netlist. We constrained the design and verified that it works properly at 200 Mhz target frequency. Finally, we made the placement and the routing of the synthesized gate-level netlist using Cadence Encounter. Synthesis and Place&Root results are summarized in the table 1 below:

Tab. 1. Synthesis and Place&Root results. Technology

Target clock frequency (Mhz)

Area (mm2 )

Average power (dynamic + static) (mW)

200

0.2

50.13

Tezzaron (1.32V)

5 Conclusion In this paper, we detail the design of an enhanced 2D on-chip router that is characterized by its simple virtual channel allocation strategy. The later will be used in future works to implement a 3D-mesh NoC based on shared TSVs and serialized data. It ensures the intra-layer communication. To measure its performances in terms of area and power consumption, this router was implemented in the cadence digital ASIC design flow using the 130 nm Tezzaron technology.

Bibliography [1] [2]

L. Benini and G. De Micheli. Networks on chips: a new soc paradigm. Computers, 35(1):70–78, January, 2002. W.J. Dally and B. Towles. Route packets, not wires: on-chip interconnection networks. Design Automation Conf., :684–689, 2001.

44 | M. Langar et al.

[3] [4] [5]

[6]

[7]

[8] [9] [10]

[11] [12] [13] [14] [15] [16]

[17]

[18] [19] [20]

W. Dally and Brian Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003. W.J. Dally. Virtual-channel flow control. SIGARCH Comput. Archit. News, 18(2SI):60–68, May 1990. F. Darve, A. Sheibanyrad, P. Vivet and F. Petrot. Physical implementation of an asynchronous 3d-noc router using serial vertical links. IEEE Computer Society Annual Symp. on VLSI, :25–30, July, 2011. Y. Ghidini, M. Moreira, L. Brahm, T. Webber, N. Calazans and C. Marcon. Lasio 3d noc vertical links serialization: Evaluation of latency and buffer occupancy. 26th Symp. on Integrated Circuits and Systems Design (SBCCI), :1–6, September, 2013. A. Kologeski, C. Concatto, D. Matos, D. Grehs, T. Motta, F. Almeida, F.L. Kastensmidt, A. Susin and R. Reis. Combining fault tolerance and serialization effort to improve yield in 3d networks-on-chip. 20th Int. Conf. on Electronics, Circuits, and Systems (ICECS), :125–128, December, 2013. S. Kundu and S. Chattopadhyay. Network-on-chip: The Next Generation of System-on-chip Integration. CRC Press, 2014. M. Langar, R. Bourguiba and J. Mouine. Design and implementation of an enhanced on chip mesh router. 12th Int. Multi-Conf. on Systems, Signals Devices (SSD) :1–4, March 2015. C. Liu, L. Zhang, Y. Han and X. Li. Vertical interconnects squeezing in symmetric 3d mesh network-on-chip. 16th Asia and South Pacific Design Automation Conf. (ASP-DAC 2011) :357–362, January, 2011. R. Mullins, A. West and S. Moore. Low-latency virtual-channel routers for on-chip networks. 31st Annual Int. Symp. on Computer Architecture, :188–197, June, 2004. M. Palesi and M. Daneshtalab. Routing Algorithms in Networks-on-Chip. Springer Publishing Company, Incorporated, 2013. S. Pasricha. Exploring serial vertical interconnects for 3d ics. 46th ACM/IEEE Design Automation Conf., DAC’09, :581–586, July, 2009. S. Pasricha. Exploring serial vertical interconnects for 3d ics. 46th ACM/IEEE Design Automation Conf., DAC’09., :581–586, July, 2009. L. S. Peh and W. J. Dally. A delay model and speculative architecture for pipelined routers. 7th Int. Symp. on High-Performance Computer Architecture, :255–266, 2001. A.-M. Rahmani, P. Liljeberg, J. Plosila and H. Tenhunen. Developing a power-efficient and low-cost 3d noc using smart gals-based vertical channels. J. of Computer and System Sciences, 79(4):440–456, June 2013. E. Rijpkema, K.G.W. Goossens, A. Radulescu, J. Dielissen, J. van Meerbergen, P. Wielage and E. Waterlander. Trade offs in the design of a router with both guaranteed and best-effort services for networks on chip. Design, Automation and Test in Europe Conf. and Exhibition, :350–355, 2003. A. Sheibanyrad, F. Pétrot and Axel Jantsch. 3D integration for NoC-based SoC Architectures. Springer, 2011. M. Zid, A. Scandurra, R. Tourki and C. Pistritto. A high-speed four-phase clock generator for low-power on-chip serdes applications. Microelectronics J., 42(9):1049–1056, 2011. Tezzaron. http://tezzaron.com/.

Physical Characteristics of an On-chip Router for its Integration

|

45

Biographies Manel Langar was born in Tunis, Tunisia, in 1985. She received the engineering degree in electrical engineering, then the Master degree in Perception and Digital Communications, and finally the Ph. D. degrees in electrical engineering from the National Engineering School of Tunis, Tunisia, in 2008, 2010 and 2016, respectively. Her current research interests are related to embedded systems, NoC and MPSoC design.

Riadh Bourguiba was born in Paris, France, in 1973. He received the Bachelor degree in Electrical Engineering, from the University of Pierre et Marie Curie, Paris, in 1995, then the Master degree in Electronic Systems for Information Processing, from the Faculty of Sciences of Orsay, Paris-Sud, and finally the Ph. D. degree in Signal and Image Processing from the University of Cergy-Pontoise and the National Engineering School of Applied Electronics, France, in 2000. In 2001, he joined Prosilog, a French start-up specialized in developing CAD tools for designing SoCs with System-C, as a project manager, before integrating the Department of Electrical Engineering of the National Engineering School of Tunis, Tunisia, as an assistant professor in 2004. Since then, he is in charge of teaching the digital electronic systems design. His current research interests are related to advanced microsystems, run-time reconfiguration, NoC and MPSoC design. Jaouhar Mouine was born in Nabeul, Tunisia, in 1965. He received the engineering degree in electrical engineering from the University of Québec at Trois-Rivières, Trois-Rivières, Canada, in 1986, and the Master and Ph. D. degrees in electrical engineering from the University of Sherbrooke, Sherbrooke, Canada, in 1988 and 1992, respectively. In 1994, he joined the Department of Electrical Engineering, University of Sherbrooke, as an assistant professor, and in 2001 became a full professor. From 2003 to 2015, he has been with the Department of Electrical Engineering, National Engineering School of Tunis, Tunis, Tunisia. He has also been, Director of Studies and Deputy Director of the National Engineering School of Tunis. Currently, he is a full professor at the Department of Electrical Engineering, College of Engineering, Prince Sattam Bin Abdulaziz University, Al-Kharj, Saudi Arabia. His current research interests are related to advanced microsystems and embedded systems including mixed signal VLSI design.

M. Anane and N. Anane

Compact High Speed Hardware for SHA-2 on FPGA Abstract: Hash functions play an important role in modern cryptography. They are widely used to provide services of data integrity and authentication. Hash algorithms are based on performing a number of complex operations on the input data that require a significant amount of computing resources especially when the input data are huge.Thus, hardware implementation is far more suitable, for security and performances execution issues, compared to the corresponding software implementations. Hash functions perform internal operations in an iterative fashion, which open the possibility of exploring several implementation strategies. In this paper, we are concerned by optimizing the hardware implementation of the SHA-256 algorithm on Virtex-5 of Xilinx FPGA. Our main contribution is to design a compact SHA-256 core and to speed-up its critical paths which are respectively seven and six words addition. The CS (Carry Save) representation is advantageously used to overcome the carry propagation, until the last addition. Special efforts were made to design, at the LUT level, the two components (compressors 7:2 and 6:2) which are the key features of our design; their delay is data path independent and equivalent to the delay of two LUT6. The resulting architecture is compact and operates at 170 MHz with a throughput of 1.36 Gbps. Keywords: Security, Hash Functions, SHA-256, Hardware Implementation, FPGA.

1 Introduction Nowadays, the security is a major demand, due to the rapid evolution of plethora of communication standards. In particular, Hash functions algorithms are widely used to provide services of data integrity and authentication.They are mainly used with public-key crypto systems in digital signature schemes[1]. A digital signature can directly provide several security services including message authentication, message integrity,and non-repudiation. The hash function is also a basic building block in the implementation of secret key message authentication codes (HMAC) [2]. Hash functions map messages of arbitrary length to a string of fixed length, called the “message digest”. This compression process is known as “hashing”. Hash functions allow maintaining a high level of security by performing complex

M. Anane and N. Anane: M. Anane, ESI (Ecole Nationale Supérieure d’Informatique), BP 68M Oued Smar, Algeria, [email protected], N. Anane, CDTA (Centre de Développement des Technologies Avancées), BP 17, Baba Hassen, Algiers, Algeria, [email protected]

De Gruyter Oldenbourg, ASSD – Advances in Systems, Signals and Devices, Volume 8, 2018, pp. 47–60. https://doi.org/10.1515/9783110470383-004

48 | M. Anane and N. Anane

operations, usually in an iterative manner, which require a significant amount of computing resources. There are several algorithms for performing hash functions: MD5 [3], SHA-1 [4]and SHA-2 [5].This last is until now the most used and considered as safe, in spite of the selection by the NIST (National Institute of Standards and Technology) of the Keccak algorithm as SHA-3 [6]. However, as it is known, the transition to a new standard does not happen immediately. Hence, as also reported by NIST experts, the SHA-2 family is expected to continue being used in near-and medium-future applications. Otherwise the coming of the SHA-2 family solved the insecurity problem of SHA-1 and other popular algorithms as MD5 and SHA-0. In addition to the demanded high security level, the need for high performance is a significant factor for the selection of a security implementation. Thus, hardware implementation is far more suitable, for security issues, compared to the corresponding software implementations. Several papers have been published on the optimization of the SHA-2 in terms of improving the throughput rate by using pipelining techniques [7] or merging two or more computation steps to reduce the number of iterations [8]. The resulting designs have often unpacked mapping on silicon and are hardware and routing consuming. Most of these papers are based on the use of the CSA block to overcome the carry propagation and accelerate the critical path; this solution is often effective when it is an ASIC application. However, it is often least optimized in terms of speed and occupied area when it’s an FPGA application. This is due to the fact that the logic and routing resources of FPGA and ASIC are different. In this work, we investigate the feature of Virtex-5 resources to improve speed and area occupied of the SHA-256 algorithms on FPGAs.Our main objective is to design a compact hardware core for SHA-256, which can be used as an IP (Intellectual Property) in PSoC (Programmable System on Chip) platforms. This paper is organized as follows. The next section gives an overview of the algorithms in the SHA-2 standard specially the SHA-256. In the section 3, our proposed hardware core for the SHA-256 is presented and the speed/area performance results are discussed in section 4. Conclusions are given in section 5.

2 SHA-2 Algorithms The secure hash algorithms SHA-2 family is composed by four variants which are SHA-224, -256, -384 and -512. All of them use the Merkle-Damgård structure[9] as illustrated on figure 1, where the message is divided into blocks, padded as required, and the length is added to the end of the message. All of these hash algorithms have the same structure; they differ only from their output hash lengths and the set of Kt constants used in each algorithm. An overview of

Compact High Speed Hardware for SHA-2 on FPGA

| 49

SHA-256 is given here and more details for other algorithms can be found in the official NIST standard [10]. The SHA-256 algorithm consists essentially of 2 steps: preprocessing and hashing as shown on figure 2. Message

M1

M2

(1)

(0)

Hk

Pad

Cf

Hk

l

MN

(2)

Cf

Hk

(N)

Cf

Cf

Hk

Mi : Message block, Cf : Compression function, (i)

Hk : Chaining variable, l : Message length, Fig. 1. Merkle-Damgård hash constructs.

2.1 Preprocessing In the first step, the binary message is appended with a “1” and padded with zeros until its length ≡ 448 mod 512. The original message length is then appended as a 64-bit binary number. The resultant padded message is parsed into N ×512-bit blocks, denoted M (1) , M (2) . . . M (N) . These M (i) message blocks are passed individually to the message expander. In the second step, each M (i) block from the previous step is seen as 16 × 32-bit (i) words as follows: m t , where 0 ≤ t ≤ 15. (i) Then each M (i) message is expanded into 64 × 32-bit W t blocks, according to: 󵄨󵄨 i (i) 󵄨 M W t = 󵄨󵄨󵄨󵄨 t 󵄨󵄨 σ 1 (W t−2 ) + W t−7 + σ0 (W t−15 ) + W t−16

if 0 ≤ t ≤ 15 if 16 ≤ t ≤ 63

(1)

where: σ 0 (x) = ROT7 (x) ⊕ ROT18 (x) ⊕ SHF3 (x)

(2)

σ1 (x) = ROT17 (x) ⊕ ROT19 (x) ⊕ SHF10 (x)

(3)

50 | M. Anane and N. Anane

Message

A B

C

D E

Split and padd

G H

F

M (i)

Blocks

Message Expansion

Wt SHA-256 Iteration

(i)

Preprocessing

Kt constants

A

B

C D

E

F

(i–1)

H0

G H

(i–1)

... , H7

Adder (i)

(i)

H0 , ... , H7 M.D (Message Digest) Fig. 2. Computing process of SHA-256 Message Digest.

The function ROT n (x) denotes a circular rotation of x by n positions to the right, whereas the function SHF n (x) denotes the right shifting of x by n positions. All additions in the SHA-256 algorithm are modulo 232 .

2.2 SHA-256 iteration (i)

The W t words from the message expansion stage are then passed to the SHA-256 iteration function. This last utilizes 8 × 32-bit working variables labeled A, B. . . H, (0) (0) (0) which are initialized to predefined values H0 , H1 . . . H7 at the beginning of each SHA-256 iteration shown in Table 1, then perform 64 SHA-2 iterations, as follows: A = T1 + T2

(4)

B=A

(5)

C=B

(6)

D=C

(7)

Compact High Speed Hardware for SHA-2 on FPGA

| 51

Tab. 1. Initial hash values of SHA-2 [1].

(0)

H0 (0) H1 (0) H2 (0) H3 (0) H4 (0) H5 (0) H6 (0) H7

󳨀→ 󳨀→ 󳨀→ 󳨀→ 󳨀→ 󳨀→ 󳨀→ 󳨀→

A B C D E F G H

SHA-224

SHA-256

SHA-384

SHA-512

c1059ed8 367cd507 3070dd17 f70e5939 ffc00b31 68581511 64f98fa7 befa4fa4

6a09e667 bb67ae85 3c6ef372 a54ff53a 510e527f 9b05688c 1f83d9ab 5be0cd19

cbbb9d5dc1059ed8 629a292a367cd507 9159015a3070dd17 152fecd8f70e5939 67332667ffc00b31 8eb44a8768581511 db0c2e0d64f98fa7 47b5481dbefa4fa4

6a09e667f3bcc908 bb67ae8584caa73b 3c6ef372fe94f82b a54ff53a5f1d36f1 510e527fade682d1 9b05688c2b3e6c1f 1f83d9abfb41bd6b 5be0cd19137e2179

E = D + T1

(8)

F=E

(9)

G=F

(10)

H=G

(11)

where: T 1 = H + Σ1 (E) + Ch(E, F, G) + K t + W t

(12)

T2 = Σ0 (A) + Maj(A, B, C)

(13)

Ch(x, y, z) = (x AND y) ⊕ (x AND z)

(14)

with:

Maj(x, y, z) = (x AND y) ⊕ (x AND z) ⊕ (y AND z)

(15)

Σ0 (x) = ROT2 (x) ⊕ ROT13 (x) ⊕ ROT22 (x)

(16)

Σ1 (x) = ROT6 (x) ⊕ ROT11 (x) ⊕ ROT25 (x)

(17)

and the inputs denoted K t are 64 × 32-bit constants, specified in [10]. The arithmetic and logic operations involved in the SHA-256 iteration above are: AND, XOR, ROT and modulo-232 addition. The functional block of the SHA-256 iteration is depicted on figure 3. (i) (i) (i) As it is shown in figure 2, after 64 iterations, new values of H0 , H1 . . . H7 are calculated as follows: { { { { { { { { { { {

(i)

H0 (i) H2 (i) H4 (i) H6

= = = =

(i−1)

A + H0 , (i−1) C + H2 , (i−1) E + H4 , (i−1) G + H6 ,

(i)

H1 (i) H3 (i) H5 (i) H7

= = = =

(i−1)

B + H1 (i−1) D + H3 (i−1) F + H5 (i−1) H + H7

(18)

52 | M. Anane and N. Anane

A

B

C

D

E

F

G

H K t Wt

Ch

Maj

+

+

+

+

+

+ +

A

B

C

D

E

F

G

H

Fig. 3. Functional block of the SHA-256 iteration.

(i)

(i)

(i)

Then the resulting values H0 , H1 . . . H7 are used to initialize A, B. . . H variables for the next 512-bit message block. After processing all the N data blocks; the final (N) (N) (N) message digest output (M.D) is formed by concatenating final values H0 , H1 . . . H7 as follows: (N)

(N)

(N)

(N)

(N)

(N)

(N)

(N)

M.D = H0 &H1 &H2 &H3 &H4 &H5 &H6 &H7

(19)

3 Proposed Sha-256 hardware core Hash functions are iterative algorithms which, in order to compute the final message digest perform N × 64 iterations, where N is the number of message blocks. The maximum value of N can be equal to (264 − 1), and then optimizing the iteration delay is crucial for enhancing the computation throughput of this function. The longest data path (critical path) in the SHA-256 iteration is the calculation of the working variable A, which involves addition (modulo 232 ) of 7 operands (see equations (4), (12) and (13)). The second longest data path is the one which involves the calculation of the working variable E by an addition (modulo 232 ) of six operands (see Equations (8) and (12)).

Compact High Speed Hardware for SHA-2 on FPGA

| 53

In the following we present our method to speed-up these two data paths. Instead of using six adders, we let all intermediates results in a carry save form to overcome the carry propagation that is very expensive in the case of multi-additions. The method is based on the use of specific components optimized at LUT (Look Up Table) level,namely compressor 7:2 and compressor 6:2. The resulting functional block of the SHA-256 iteration is depicted on figure 4, where its architecture is essentially composed of a number of logic functions, two compressors (7:2 and 6:2) and two modulo 232 adders. The logic functions (Σ0 , Maj, Σ1 and Ch) are computed in parallel. Each function can be easily implemented in one LUT of six inputs (LUT-6 of the Virtex-5 FPGA circuit). The resulting delay of all these functions is that of crossing one slice. The adders are implemented as Carry Propagate Adder (CPA) by using the carry chain of Xilinx FPGA.

3.1 Compressor 7:2 Compressor trees are a general class of circuits that perform multi-input addition much more efficiently than adder trees.This is true for ASIC implementation; with the

A

B

C

D

E

F

G

H Kt Wt

Σ1

Maj

Ch

Σ0 Compressor 6:2 Compressor 7:2

CPA

CPA

A

B

C

D

E

F

G

H

Fig. 4. SHA-256 iteration component based on compressors.

54 | M. Anane and N. Anane

introduction of the dedicated adder circuitry and fast carry-chains, the conventional wisdom has been that multi-input addition is best implemented using adder trees rather than compressor trees. The reason is twofold: firstly, it was thought that the compressor trees are not efficiently synthesized onto LUTs and secondly,the reduction in delay due to the fact that the fast carry-chain does not pass through the routing network. Then the compressor found its use in such applications and significant performance improvement can be obtained for the compressor by exploiting the features available in recent FPGAs. In the following, we describe the implementation of our logic cell namely compressor 7:2 on the virtex-5 CLB. As it is presented on figure 5, our compressor must reduce 9 input bits of rank i (including the two carry-in bits) into 4 output bits of ranks i, i + 1, and i + 2 respectively (the latter two being the carry-out bits). This compressor accepts as inputs the seven bits (a i , b i , c i , d i , e i , f i , g i ) plus carries F2i−1 from rank (i − 1) and F3i−2 from rank (i − 2) and as outputs two carries F2i and F3i and two final CS (Carry Save) results c iout and s iout . The reduction of seven operands A, B, C, D, E, F, G of (i + 1) bits to CS result (C out and S out ) by the compressor 7:2 is shown on figure 6. The mapping of our compressor on the 6-input LUTs of virtex-5 FPGA is shown on figure 7. The resulting delay of our compressor 7:2 is data path length independent and equivalent to the delay of two LUTs. The compressor 6:2 is slightly different from the first one in terms of occupied area. The computation of (F1i , F2i−1 and F3i−2 need only three 6-input LUTs for its implementation, instead of six for the 7:2 compressor. For the delay, it remains practically the same, which is equal to the crossing of two LUTs.

ai bi ci

3:2

Rank(i)

Rank(i + 1)

Rank(i) di ei

fi

Rank(i–1)

ai bi ci di ei f i gi

gi

i–1

i

3:2

F2

F2

7:2 F2i–1

3:2 F1i 3:2

3:2

F3i F2i

Couti Souti

F3i–2

i–2

F3

i

F3 C iout

S iout

Fig. 5. Compressor 7:2 and its equivalent representation based on 3:2 CSAs.

Compact High Speed Hardware for SHA-2 on FPGA

Rank(i +1)

Rank(i)

Rank(0)

7

7

7 i+1

i

i+1 F3

i–1

F2

F2

7:2

i

F3

7:2

| 55

0

F2

F2

i–2 F3

0 F3

0

7:2

0 0

2 i+1 Cout

i+1 Sout

2 i Cout

i Sout

2 0 Cout

0 Sout

Fig. 6. Chain of 7:2 compressors.

The compressor 6:2 used in our architecture is a simplified version of the compressor 7:2 yet presented. In terms of timing, the two compressors have the same delay, the delay of two slices. In terms of area the compressor 6:2 occupies less area than the compressor 7:2. The difference is in the first stage of the mapping of figure 7. In this part of the design, the compressor 6:2 need only three LUT6 to generate the six-input functions F1i , F2i−1 and F3i−2 .

4 Implementation results and comparisons The SHA-256 was captured at the LUT level using VHDL and implemented on a Virtex-5 FPGA circuit (xc5vlx330t-2ff1738). The ISE12.2 (Integrated Software Environment) Design suit and VHDL were used for the hardware description of our core component and was fully simulated and verified using ISim (ISE Simulator). The design was fully verified using a large set of test vectors, apart from the test example proposed by the standard. The proposed implementation is based on optimizations at the lowest level, which means its VHDL description is based on the smallest FPGA component which is the look-up table LUT6.These optimizations have resulted in compact implementation of SHA-256 compared to the same architecture based on VHDL description at RTL level by using CSA components, as shown in figure 8. The achieved operating frequency is equal to 169.95 MHz for the SHA-256 hashing core. Achieving this high frequency, throughput exceeds 1359.6 Mbps. The number of occupied slices by our architecture is 1203. The proposed implementations are based on optimizations at the lowest level (slice level) which led to efficient mapping of

ai–1 bi–1 ci–1 di–1 ei–1 f i–1 gi–1

ai–2 bi–2 ci–2 di–2 ei–2 f i–2 gi–2

56 | M. Anane and N. Anane

ai bi ci di ei f i gi

LUT6

LUT6

LUT6

LUT6

O6

O6

O6

O6

F7

LUT6

LUT6

O6

O6

F7

i F1

F7

i–1 F2

i–2

F3 0 0 0

0 0 0

LUT6

LUT6

O6

O6 i

Cout

i

Sout

Fig. 7. Mapping of one bit 7:2 compressor on Virtex-5 slices.

the architecture on slices.Therefore the routing among slices is reduced and the most occupied slices are fully used (figure 8.b). In order to evaluate the proposed architectures, we compared first our compressor 7:2 based on LUTs with its analog component based on CSA as shown on Table 2. Furthermore, we selected two criteria to compare our work with related ones in the literature which are throughput and efficiency computed as follows: (Data_block_size) (Clock_period) × (Clock_cycles) Throughput Efficiency = (Number_of _slices)

Throughput =

(20) (21)

In Table 3, we compare our proposed implementation of SHA-256 core to that of [11] and [12] basing on the criteria presented above. The optimizations used to design the SHA-256 architecture can be easily generalized to the SHA-512. In the proposed design the most iteration delay is data-path length independent; therefore increasing the data-path from 32-bit of SHA-256 to the 64-bit of SHA-512 will change slightly the iteration delay. Despite the fact that the number

Compact High Speed Hardware for SHA-2 on FPGA

| 57

of iterations in the SHA-512 (80 iterations) is higher than that of the SHA-256,the throughput can be greater due to the fact that in each SHA-512 iteration 8 × 64 bits are processed instead of 8 × 32 bits for SHA-256.

(a)

RTL level

(b)

LUT level

Fig. 8. Mapping of SHA-256 core on Virtex-5 FPGA.

Tab. 2. Compressor 7:2 delay.

Delay (ns)

Compressor 7:2 LUT level

Compressor 7:2 CSA level

1.836

2.347

58 | M. Anane and N. Anane

Tab. 3. Throughput and efficiency of the proposed SHA-256 core to other works. Throughput (Mbps) [11] [12] Our sha-256 core

819.74 772.11 1359.6

Efficiency (Mbps/Slice) 0.734 0.516 1.13

5 Conclusion In this paper, we have presented an efficient area design of hardware core for SHA-256 hash function. This last is based on the optimization of the two critical paths to speed-up the SHA-2 iteration by using two dedicated components namely compressors 7:2 and 6:2. These components were optimized at the lowest level (LUT) which allows performing a seven operands addition in only 5.016 ns. In order to fully use the slices during the synthesis process of the proposed architecture on FPGA, a special effort was made in the design at the VHDL description level. The resulting hardware core is compact with good performance. This core can be easily adapted to others variants of SHA-2 by only adjusting the data path of the operands to the word length specified in each standard.

Bibliography [1]

[2]

[3] [4] [5] [6] [7] [8]

M. Khalil, M. Nazrin and Y. W. Hau. Implementation of SHA-2 Hash Function for a Digital Signature System-on-Chip in FPGA. Int. Conf. on Electronic Design, Penang, Malaysia, December, 2008. M. Juliato and C. Gebotys. A Quantitative Analysis of a Novel SEU-Resistant SHA-2 and HMAC Architecture for Space Missions Security. IEEE Trans. on Aerospace and Electronic Systems, 49(3), July 2013. K. Järvinen, M. Tommiska and J. Skyttä. Hardware Implementation Analysis of the MD5 Hash Algorithm. 38th Hawaii Int. Conf. on System Sciences, 2005. L. Zhou and W. Han. A brief implementation analysis of SHA-1 on FPGAs, GPUs and Cell Processors. Int. Conf. on Engineering Computation, :101–104, 2009. R.P. McEvoy, F.M. Crowe, C.C. Murphy and W.P. Marnane. Optimisation of the SHA-2 Family of Hash Functions on FPGAs. Emerging VLSI Technologies and Architectures, (ISVLSI’06), 2006. B. Jungk and J. Apfelbeck. Area-efficient FPGA Implementations of the SHA-3 Finalists. Int. Conf. on Reconfigurable Computing and FPGAs, 2011. G.S. Athanasiou, H.E. Michail, G. Theodoridis and C.E. Goutis. Optimising the SHA-512 cryptographic hash function on FPGAs. IET Comput. Digit. Tech., 8(2):70–82, 2014. H.E. Michail, A.P. Kakarountas, E. Fotopoulou and C.E. Goutis. High-Speed and Low-Power Implementation of Hash Message Authentication Code through Partially Unrolled Techniques. 5th WSEAS Int. Conf. on on Multimedia, Internet and video technologies, :130–135, Corfu, Greece, August 17–19, 2005.

Compact High Speed Hardware for SHA-2 on FPGA

| 59

[9]

P. Gauravaram, W. Millan, E. Dawson and K. Viswanathan. Constructing Secure Hash Functions by Enhancing Merkle-Damgård Construction. Australasian Conference on Information Security and Privacy (ACISP) 2006, LNCS 4058, :407–420, Springer-Verlag Berlin Heidelberg, 2006. [10] NIST. Secure Hash Standard. FIPS PUB 180–2, 2002. [11] I. Algredo-Badillo, C. Feregrino-Uribey, R. Cumplidoy and M. Morales-Sandovalz. Novel Hardware Architecture for implementing the inner loop of the SHA-2 Algorithms. 14th Euromicro Conf. on Digital System Design, 2011. [12] S. Ducloyer, R. Vaslin, G. Gogniat and E. Wanderley. Hardware implementation of a multi-mode hash architecture for MD5, SHA-1 and SHA-2. Workshop on Design and Architectures for Signal and Image Processing, 2007.

Biographies Mohamed Anane is currently Associate Professor at the Ecole Nationale Supérieure d’Informatique (ESI) in Algiers Algeria. He teaches Computer architecture and IT security. His research activity is centered on computer arithmetic, FPGA design and Programmable System on Chip with special emphasis on the application domain of security and cryptography-related processing.

Nadjia Anane has obtained the engineer’s degree in 1986 in Electrical engineering from Ecole Nationale Polytechnique d’Alger (ENP) and her Master of image processing and electronic systems from Saad Dahlab University of Blida in 2012. She is currently a searcher in Center of Development of Advanced Technologies, Algiers Algeria. She has attended many national and international conferences and her research areas are cryptography, embedded Systems, and FPGA Implementation.

M. Götze, W. Kattanek and R. Peukert

An Extensible Platform for Smart Home Services Abstract: Providing services to both inhabitants and building operators may overcome the landlord/tenant dilemma with respect to the deployment of current home automation technologies in residences. Pairing this with wireless technologies in particular enables cost-efficient retrofitting of existing buildings. To this end, we introduce an overall system architecture and a residential services gateway essential to it, detailing design considerations aiming for extensibility, cost efficiency, and low power consumption. The system enables both local and centralized services. We also describe how these concepts and their implementations have found first applications in a larger-scale R&D project and beyond. Keywords: Residential Services Gateway, Smart Home, Smart Metering, Building Automation, Services, Sensors, Wireless, Component-Based Software, Web Applications.

1 Introduction While smart metering systems are increasingly being deployed in both private and association-operated buildings due to political pressures and the utilities’ own interests, home automation solutions are to be found almost exclusively in private residential buildings because of their costliness. Housing associations’ incentive to incrementally introduce even basic home automation infrastructures is hindered by the landlord/tenant dilemma [1], a variant of the general investor/user dilemma. From a technical perspective, this contrasts starkly with obvious benefits such as a general increase in comfort for the user, energy savings, or ambient assisted living (AAL) scenarios. If it was possible to offer housing associations some sort of ROI, their hesitance could likely be overcome. One way of compensating for investments may consist in using the added infrastructure to provide services commercially to both tenants and third parties, such as building operators. For the latter group, benefits will effectively need to clearly be monetary. As for tenants, it stands to reason to assume that, with reasonable prices agreeing with the user’s perceived gain in comfort (or other benefits), this may establish an end user market mirroring that of smartphones, where users willingly pay for inexpensive apps.

M. Götze, W. Kattanek and R. Peukert: IMMS Institut für Mikroeletronik- und Mechatronik-Systeme gemeinnützige GmbH, Ehrenbergstrasse 27, 98693 Ilmenau, Germany, emails: [email protected], [email protected], [email protected] De Gruyter Oldenbourg, ASSD – Advances in Systems, Signals and Devices, Volume 8, 2018, pp. 61–80. https://doi.org/10.1515/9783110470383-005

62 | M. Götze et al.

Taking this general idea one step further, it is only logical to utilize available smart metering infrastructures in the context of such services [2]. Doing so may enable innovative services while limiting the required investment in new infrastructures. One such example is that of a leakage alarm service combining data received in near real-time from available smart water meters with humidity data of, e. g., a mold risk sensor (which in itself enables services and thus benefits of its own). Putting this vision into practice, however, requires elaborate concepts to be developed and realized. Providing services requires an appropriate platform, and integrating them with existing and novel infrastructures has to deal with heterogeneous technologies (both among the realms of smart metering and home automation and with respect to the plethora of standards and proprietary protocols used within each). This reasoning has led us to investigate the feasibility of and design a prototype platform for providing this type of integrative services. We have come to term the platform a “residential services gateway” due to its characteristics of being (primarily) deployed in residences, its purpose to enable “smart” services, and its ability to connect residences to building-level and even central infrastructures. Retrofitting existing buildings—such as, for example, the “plattenbau” (concrete panel) type frequently encountered in Eastern Germany—with wired infrastructures is, of course, extremely costly, usually requiring flats to be temporarily evacuated during renovations. Contrarily, wireless systems, implementations of which are becoming increasingly available, can be set up much less invasively and offer tremendous application potentials for these smart systems. Consequently, we have focused specifically on the incorporation of wireless technologies, both conceptually and in our implementations. This work sets itself apart from others, including existing insular solutions (such as RWE SmartHome [3]), by its integrative character (regarding the two separate domains of smart metering and home automation as well as the numerous technologies available within each), its goal to provide a platform for services for tenants as well as operators, and its focus on retrofitting existing, association-operated buildings.

2 Overview of the system Providing services integrating metering and home automation infrastructures as outlined in the introduction requires access to relevant concrete implementations and thus dealing with a variety of interfaces and communication standards. Building upon this, a platform for implementing these services needs to be created and provided access to the data in a unified manner. Offering services beyond the scope of a single residence, such as to housing operators, further requires carefully-restricted subsets of these data (possibly anonymized) to be transmitted to the outside, e. g., to a central infrastructure, such as specific servers or a “cloud”.

An Extensible Platform for Smart Home Services |

63

2.1 Overall system architecture Figure 1 illustrates the general overall system architecture. Data are aggregated per-residence from networks of end devices of various technologies by a residential services gateway. Further connected to this is some sort of terminal, such as a tablet computer, to serve as HMI for local services. Via the Internet, residences’ gateways are connected to a central infrastructure, relaying data for centralized services. Introducing a distinction among local and centralized services makes sense for a number of reasons which will be discussed in section 5. This overall system architecture does not mandate an actual deployment in a 1:1 relation of residences to residential services gateways (“home” setup). Depending on the specific services to be offered, economic considerations, or building specifics, an individual gateway may handle multiple residences (“campus” setup), or a services gateway may be deployed in a building’s installations room, merely monitoring central infrastructures, such as HVAC and lighting systems (“facility” setup).

Residence 1 Network of End Devices, Technology A

Central Infrastructure Operator

Residential Services Gateway Network of End Devices, Technology C Terminal, e. g., Tablet

Internet Residence N Network of end Devices, Technology B

Residential Services Gateway Network, Technology C Terminal, e. g., Tablet

Building

Fig. 1. Principal system architecture.

64 | M. Götze et al.

As the local key element of the architecture, the residential services gateway’s foremost responsibilities are: – data acquisition from heterogeneous sources (networks of various technologies, end devices from various vendors) – local data storage and provisioning – data transmission to the central server infrastructure – providing a platform for local services interacting with end devices, the user, and, possibly, central servers Consequently, major requirements to be considered in the services gateway’s architectural design included the following: – extensibility regarding hardware interfaces (the set of supported buses/networking standards), requiring flexibility in both hardware and software – computational performance necessary for data processing and interactive services – cost efficiency, especially as the residential services gateway will be deployed along with additional devices – power efficiency in contrast to traditional, PC-based solutions and with one of the aimed-for benefits being to enable overall energy savings All of these essential requirements are of consequence to both the hardware and software to be developed. Consequently, we have developed hardware and software architectures and implementations addressing both functional and these non-functional requirements.

2.2 Supported infrastructures The smart metering and home automation landscapes are both expansive and mostly disjunctive. In consequence, it proved necessary to select a sub-set of relevant technologies which to support initially while providing future extensibility. The following domains and specific technologies have been considered relevant: 2.2.1 Metering solutions Wireless M-Bus (wM-Bus) and the Open Metering System (OMS) [4] are related wireless metering standards gaining traction in Europe and Germany in particular. These are based on (wired) M-Bus as an established metering standard. However, in our concept, metering solutions are to be integrated preferably via a multi-utility communication device (MUC), e. g. via TCP/IP, rather than directly. This decision has been made deliberately in light of foreseeable legislation raising concerns particularly with respect to required certifications. Integration with a MUC also offers the advantage of not having to deal with intrinsics of the various devices on the market and rather

An Extensible Platform for Smart Home Services |

65

leave this to the MUC. One specific MUC has been integrated exemplarily; others can be attached in a similar fashion (implementing a software component, assuming that the physical connection is made using one of the available interfaces). 2.2.2 Home automation solutions In the area of home automation, various technologies compete. Here, ZigBee, a wireless standard supporting a number of distinct device profiles, particularly popular in the USA, and the EnOcean standard [5] have been selected to be supported. Legacy home automation equipment is often connected via serial interfaces such as RS-232 or RS-485, thus providing such interfaces makes it possible to integrate these physically. 2.2.3 Wireless sensor networks Wireless sensor networks are trending towards the adoption of IPv6 in their protocol stacks, pioneered by 6LoWPAN [6, 7], which, similarly to ZigBee, builds upon IEEE 802.15.4 and has already found applications in smart metering and smart grid contexts. 6LoWPAN is also the basis of ConSAS and BASe-Net, a proprietary protocol and modular sensor platform, respectively, previously developed at IMMS (section 6), the integration of which enables scenarios beyond smart home services.

3 Hardware architecture In order to enable software development in parallel to hardware development and allow for early software testing, a two-stage approach to the creation of the residential services gateway’s hardware platform was chosen: initially, an off-the-shelf hardware platform was selected; the development of a custom hardware tailored to the specific requirements was delayed until first practical experiences had been made.

3.1 Hardware platform The initial prototypical hardware platform was selected by investigating hardware requirements (primarily concerning computational performance and interfaces), surveying the market, and benchmarking a selection of platforms. In particular, a modular evaluation platform offered by PHYTEC [8] was examined which allows for various different CPU modules to be combined with a base board providing interfaces. The prototype platform eventually chosen paired the base board with an ARM-Cortex-A8-based (ARMv7) CPU module, a combination which promised to be sufficient in terms of performance and its array of interfaces. For example, benchmarks showed the Cortex A8 clocked at 720 MHz to perform approximately twice as well in

66 | M. Götze et al.

real-world benchmarks as a Freescale i.MX35 ARM11-based processor at 532 MHz while consuming similar (low) amounts of power. Without peripherals, the boards consume 3.1 W and 3.6 W of power, respectively, when idle, and consumption does not rise considerably even under full load; peripherals such as the various USB transceivers and additional USB Ethernet adapters raise this to approximately 10 W. In contrast, a current medium-class PC (Intel Core i5) consumes approximately 95 W when idle. For the necessary interfaces towards the selected smart metering, home automation and wireless sensor network technologies, it was possible to resort to COTS components as well: USB sticks with transceivers for wM-Bus, EnOcean, and ZigBee. A USB stick with a transceiver for our own 6LoWPAN-based BASe-Net sensor platform was used in addition. Furthermore, a MUC (multi-utility communication) device was incorporated, connected via an on-board Ethernet interface and a proprietary protocol, which enables access to wM-Bus/OMS and (wired) M-Bus meters supported by it. WLAN support has been provided through a USB transceiver, and additional Ethernet and serial interfaces can on demand be included via additional USB adapters. Besides immediate availability, resorting to USB COTS components has the added benefit of being able to work with the same hardware on both development hosts and the target platform, facilitating software development. Based on the experiences made with this initial setup, an optimized modular hardware platform was developed which uses a TI Sitara AM 3305 processor (ARMv7 architecture) running at 600 MHz. In this design, cost efficiency as a primary concern has been addressed by selecting, e. g., a CPU with the minimum set of “extra” features out of the numerous variants of ARM Cortex A8 processors available and limiting external interfaces to a common denominator of deployment scenarios. The processor board containing this CPU also provides 256 MB of RAM, 512 MB of flash memory, an RJ45 Ethernet port, two USB ports, an SD card slot, and a RS-232 serial port, rendering it a versatile platform for a variety of application scenarios. The platform is competitive in terms of costs with alternative platforms with similar specifications. The platform has been subjected to systematic tests regarding performance, latency, and power consumption. It achieves up to 1,018 MIPS in the Dhrystone 2.1 benchmark and 1,428 points in the EEMBC Coremark benchmark. The measured Ethernet transfer rate of 11.5 MB/s is close to the theoretical maximum of a 100 MBit interface. Running Linux with real-time extensions (Preempt-RT), the measured latency of 127 μs is within the expected range for the hardware (i. e., similar to TI’s development boards). This renders the platform suitable even for real-time applications. One specimen of the platform has been submitted to Open Source Automation Development Lab’s (OSADL) real-time QA farm [9], where it is undergoing continuous benchmarking. The processor board itself consumes no more than 1.75 W under full load, rendering even battery-powered operation viable. Additionally, the processor board can be paired with one or two extension boards (interfaced via up to three separate connectors) which may offer additional and more specialized interfaces. The concept allows for arbitrary extension boards to be

An Extensible Platform for Smart Home Services

|

67

realized; an existing first multi-purpose extension board offers wM-Bus/OMS and KNX interfaces (in addition to RS-485 and RS-232 serial ports, GPIOs, and another USB port). Besides, smaller extension boards containing only a particular wireless transceiver each have been created as well. Alternative extension boards currently being designed will incorporate the IMMS BASe-Net wireless sensor network solution, a UMTS modem, an MMI interface respectively. As for connecting the system to external IT infrastructures, this can be accomplished either directly in a LAN, using WiFi (provided by an external access point or a USB adapter), or via a mobile data connection (using, e. g., a UMTS USB modem). Initially, an even more modular custom hardware platform had been considered which would also have included an exchangeable CPU module. The lower production quantities per specific module have however rendered this approach inefficient in terms of costs: using the same processor yields a more competitive platform pricewise, even if the processor is overpowered for some scenarios.

4 Software architecture The residential services gateway’s software comprises a number of distinct parts (Fig. 2). Linux as the operating system of choice serves as basic software platform, custom-built given the requirements at hand using the OpenEmbedded build framework [10]. On top of this, a modular suite of software termed core software performs

Building Management, Centralized Services

Terminal

Local Services Core Software High-Level Drivers

Services Infrastructure

Embedded Linux

End Device Networks Fig. 2. Basic software architecture of the services gateway (dark-gray: preexisting software (yet custom-built), medium-gray: custom native software, light-gray/white: custom platform-independent software (provided/user).

68 | M. Götze et al.

the gateway’s core functionalities. As a part of this, high-level drivers provide access to networks of end devices. Building in part upon the core software, a services infrastructure is provided. This infrastructure offers a platform-independent interface to services at a higher level of abstraction. The specifics of the services infrastructure are not tied intrinsically to the core software; rather, its realization is intentionally kept as independent as possible in order to provide for more flexibility.

4.1 Core software The core software’s primary responsibilities are data acquisition from heterogeneous sources, local storage and provisioning of data, and transfer of data to external parties (both local and remote). In order to provide a uniform interface towards external parties, the heterogeneous data acquired from end devices of different types communicating via different networking standards needs to be stored in a unified manner. To this end, the core software applies a number of abstractions: – networks are represented internally as non-hierarchical pools of end devices identified by arbitrary addresses – end devices are further described by product and device type information and an operational status – end devices each provide a number of data points termed inputs (sensors, state information), a sub-set of which may also be writable (actuators/outputs) – inputs’ data are stored as (long) integer values (for reasons of precision) and optionally scaled and/or accompanied by a string-type annotation – interfaces, devices, and inputs are identified by numerical IDs unique per gateway instance Specifically, these abstractions do not yet deal with any unifying semantics concerning physical quantities, as done, e. g., in HYDRA/LinkSmart [11] via ontology mappings. While inputs can be assigned type descriptors and quantities, dealing with these is currently left to service implementations (i. e., service implementations require device/input assignments through explicit configuration or matching). For the realization of the core software’s functionalities, we have chosen a modular approach, yielding separate components for separate concerns. While this introduces communication among components as an additional explicit aspect, the approach has a number of advantages over a complex monolithic software: – the components’ complexity is significantly lower – loose coupling/well-defined interfaces facilitate testing – deployment can easily be tailored to the particular requirements of a specific instance of the system – the implied need to provide interfaces among components automatically enables extensibility (by implementing additional components utilizing the same interfaces)

An Extensible Platform for Smart Home Services



|

69

depending on the means of communication, it is possible to choose the best-suited implementation language on a per-component basis and still enable seamless integration

Popular options for inter-component communication include inter-process method call paradigms and general socket-based communications. For the residential services gateway, D-Bus [12] as an implementation of the former has been chosen. D-Bus is an open standard for intra-system communications commonly used in desktop Linux systems, implementing a binary message-passing protocol enabling cross-process method calls and notifications. The D-Bus implementation is available as open source (native C), with bindings for major programming languages and frameworks, including Java, Python, and the C++ Qt framework [13].



– –





The specific components currently realized include the following: the device manager, a component managing physical (networks/interfaces, devices) and virtual (data points/inputs) components connected to the residential services gateway the data store, a component storing data per input in embedded SQL databases (SQLite) and providing an interface to access these data the concept of (high-level) drivers as abstractions of entities establishing access to end device networks via a uniform API, registering themselves and their respective devices and inputs with the device manager, and broadcasting data upon reception various implementations of drivers for, i. a., several IEEE-802.15.4-based networks, EnOcean devices, and the integration of a MUC (multi-utility communication) device by means of which meters communicating via M-Bus and wM-Bus/OMS can be incorporated additional components, such as a script engine for the back-end part of local services and a web API component for efficient access from web-based services (section 5), a component providing virtual sensors through sensor fusion, and one displaying status information using LEDs

Components are based on a common hierarchy of classes provided as libraries. In all, the software encompasses four libraries (for the common object models, I/O, and database functionalities) and 17 concrete components. Accompanying tools deal with such aspects as the run-time life cycle management of components and their dependencies. Components interoperate using well-defined D-Bus APIs, sets of documented properties, methods, and signals. In some cases, separate APIs are defined for distinct contexts: for example, the device manager provides distinct application and driver APIs. New data is propagated upon reception using signals (event-driven). Figure 3 illustrates the way the various components communicate with each other across the D-Bus.

OMS Driver

Data Broadcast

Data Broadcast

Broadcast Data

ZigBee Driver

Data Queries Aggregated Data Broadcast

Script Engine

Write/Set Actuator

Data Broadcast

Broadcast Data

Physical Infrastructure

BASe-Net Driver

Data Transfer

Physical Physical Infrastructure Infrastructure

Unifor Driver Interface

Element Queries Broadcast Driver Data

Data Store

MUC

MUC Driver

Data Broadcast

Broadcast Data

LCD Status

Display

Physical Infrastructure

Data Broadcast

Web Backend

Local Services

Terminal

Fig. 3. Components of the core software and representative aspects of their communication across the D-Bus.

D-Bus

Registration of Interfaces, Devices, Inputs/Outputs

Device Manager

Building Management, Centralized Services

70 | M. Götze et al.

An Extensible Platform for Smart Home Services |

71

The core software’s components have been implemented natively in C++ using the Qt framework [13]. A native implementation, while subject to issues of platform dependence, has lower requirements on the underlying hardware platform (the requirements imposed by, e. g., Java would have been significantly higher), allowing for lower-cost hardware designs. Conversely, a less resource-intensive core software leaves more of the available performance of a given platform for additional infrastructures, such as for services. The specific choice of libraries allows for the software to be built for both PC development hosts and embedded targets (such as the ARM-Cortex-A8-based residential services gateway hardware platform). Thanks to this and the utilization of (mostly) USB transceivers for the various infrastructures, it is possible to debug on the host using advanced diagnosis tools such as Valgrind [14], which employs heavy-weight binary instrumentation for detecting, e. g., subtle memory access errors.

5 Services Providing a platform for services represents the ultimate goal of the development of the residential services gateway. As has been pointed out before, these services are to explicitly combine metering data with additional sensors and actuators in order to reduce the investment required on the part of the landlord or building association. One representative example is that of a leakage warning service for residences (Fig. 4), where leakages may occur, e. g., due to tap water left running or defects. This service may be provided by monitoring water consumption via regular smart water meters on the one hand and sensory information from, e. g., a mold risk or climate sensor on the other hand, issuing a warning (via the local terminal, SMS, or Internet infrastructures) if both exceed certain thresholds. As another example, individual-room

Water Metering Data Monitoring

Mold Warning

Water Leakage Alarm

Occupancy Sensing End Devices

Elementary Services

Integrated Services

Fig. 4. Leakage warning service as an example of services combining information from multiple sources and building upon each other.

72 | M. Götze et al.

regulations can be optimized by incorporating up-to-date data from central heating infrastructures, and vice versa. These optimizations may yield benefits such as energy savings and increased comfort. A third example consists in correlating mold risk indications from individual flats with external climate data or user behavior, such as aeration, determined, e. g., via EnOcean window contacts. Hence, the cause of mold may be determined in individual cases or the spread of risk analyzed across entire buildings, which may enable conclusions on the influence of intrinsics of the buildings. As services are to be offered to tenants as well as building operators, it is necessary to distinguish between local and centralized services. In fact, this distinction also corresponds with technical considerations, such as: – the scope of involved data (single residence vs., e. g., an entire apartment block) – hardware requirements of the local platform – the dependence on or independence from permanent Internet connectivity – the degree of interactivity/real-time requirements (vs. latencies) – deployment (setup and updating in the field vs. centrally) In consequence, centralized services will be mostly informational while local ones will often include control aspects. For both types of services, web-based solutions have been envisioned. In contrast to the current “app” trend established by smartphones, web applications offer a greater degree of platform independence (or, conversely, less total effort for multiple platforms). This stance does not, however, forestall the implementation of apps; in fact, the very same interfaces used by the local web platform (such as REST APIs or WebSockets, see below) can equally-well be used to this end, and a proof-of-concept app has already been realized for the Android operating system. In contrast, the web-based services infrastructure has been designed such that it suffices that service developers be familiar with common web technologies: HTML, CSS, JavaScript, and (in more complex cases) some Python. As the platform provider (rather than an operator of centralized platforms or even cloud infrastructures), our focus has been on local services. For these the services gateway runs a light-weight web server (lighttpd) and a Python-based “micro” application framework, Flask [15] (see Fig. 5). Besides providing the basic infrastructure for building web applications, Flask uses the Jinja2 templating language for dynamic content in HTML templates. This basic infrastructure is complemented by custom libraries providing access to the core software through its D-Bus APIs. The concept of realizing services as web pages also entails a dynamic client-side portion (executed in the user’s browser). Consequently, there is a JavaScript counterpart complementing the usage of standard HTML 5, CSS 3, and JavaScript (see again Fig. 5). Client-side service frontends will typically communicate more or less continuously with the backend, for which a dedicated core component providing REST APIs [16] is available. JavaScript code executing in the user’s browser will use AJAX requests (using the JSON-RPC protocol) and/or an HTML 5 WebSocket also provided by the component for near-real-time data reception.

An Extensible Platform for Smart Home Services

|

73

Service Frontend (Client-side) Custom JavaScript Libraries

HTML 5+ CSS 3

JavaScript

Service Frontend (Server-side) Service Backend

Jinja2 Templating Engine Custom Python Libraries Flask Micro Framework Core

Python lighttpd

Fig. 5. Software architecture of the services platform (dark-gray: preexisting implementations, medium-gray: custom native software, light-gray/white: custom platform-independent software (provided/user)).

Operational services (in contrast to those merely visualizing data etc.) require some sort of control logic to run continuously. For this purpose, client-side code is inappropriate since it depends on a human user keeping a web page open, and the server-side web frontend is unsuitable because web servers are request-driven. Hence, in order to provide a means of continuously executing pieces of logic on the platform, another dedicated core component has been developed. This component is a script engine executing backend logic written in a JavaScript dialect (QtScript), supplemented by an API nearly identical to the custom JavaScript libraries provided client-side. This enables the implementation of backend logic to be executed on the platform. Frontend and backend are integrated through an API enabling the former to configure the latter. A practical example of this approach has been realized as an automated lighting control service (see Fig. 7), where the user can configure motion- or time-based control schemes via a web frontend, and a script executed by the script engine is configured accordingly, taking care of actually controlling the lighting in the configured manner. The services platform is rounded off by comprehensive tools supporting the development of services. These include tools for localization, configuration, and deployment; deployment is possible both locally and remotely, where in the latter case, services are packaged and deployed on the target system via an SSH connection. The tools also support versioning and manage dependencies among services. Development efforts by others involved in the project representing the context of this work (section 6) have focused on centralized services. Those realized this

74 | M. Götze et al.

far offer web-based frontends to repositories of data gathered from all of a housing association’s residences, implementing adequate access regulations towards different parties and users, such as managers and technicians.

6 Applications 6.1 Smart home applications The residential services gateway has found first applications in the context of a publicly-funded project, “Smart Home Services” (SHS) [17], in which more than 15 companies, institutes, and housing associations have joined efforts. In fact, the project’s primary goals correspond to the motivation presented in the introduction, and the overall system architecture as introduced in section 2 and its variants have been elaborated in an architectural working group. In this context, the core software has exemplarily been integrated with an established building management system [18]. For this purpose, a preexisting embedded data acquisition software has been ported to Linux by the provider, and a connector component has been added enabling data propagation and access to actuators via a binary shared-memory interface provided by the legacy software. This enables an integration of the residential services gateway and the end device networks operated by it into existing building management infrastructures. This way, data acquired by the residential services gateway’s core software is made available in a central location, allowing for it to be monitored as well. Even more importantly, it adds general sensory data to the metering data hitherto primarily collected, enabling innovative centralized services. Conversely, data from additional, wired house automation systems, such as M-Bus, is made accessible to local services. Furthermore, the system has also been interfaced with an existing building automation solution [19], enabling the latter to interface with devices outside its domain. As an immediate result of the project, pilot installations have been set up at a residence and a facility basement in Jena, Germany, with the purpose of gaining insights operating the platform and obtaining feedback from end users (Fig. 6). These installations include prototypes of the residential services gateway platform and various end devices of supported technologies. Data from these is aggregated locally, visualized via local services on a tablet computer or smartphone (Fig. 7), and transmitted to a central building management infrastructure.

An Extensible Platform for Smart Home Services

|

75

Fig. 6. Pilot site in Jena, Germany.

Water Metering Data Monitoring

Water Leakage Alarm

Mold Warning

Occupancy Sensing End Devices

Elementary Services

Integrated Services

Fig. 7. An automatic lighting service’s frontend and a mold-warning service as examples of web-based services, accessed from mobile devices connected via WLAN.

6.2 Applications beyond the smart home The platform’s versatility has been proved further by applications in additional scenarios, such as environmental monitoring [20], where it is used to collect data from an outdoor network of wireless sensors and then relay them to a central, EU-level environmental database (TERENO) via a mobile data connection (UMTS). This application is the result of a tight integration of the platform with IMMS’s wireless sensor network solution, BASe-Net [21, 22]. BASe-Net encompasses a multitude of sensors for numerous quantities, such as illuminance, temperature, humidity, CO2 , and also offers switchable mains plugs, general-purpose reed contacts, or GPIOs, and the possibility of integrating customer-specific sensors. BASe-Net components

76 | M. Götze et al.

communicate via IPv6, using a proprietary application-level protocol, ConSAS [23], which is encapsulated in 6LoWPAN [6, 7], in turn building upon the IEEE 802.15.4 physical layer. Towards the user, various PC tools for monitoring sensor networks and visualizing and exporting data are provided. The services gateway platform has been integrated in such a way that is serves as an embedded logger of and access point to data acquired from the sensor network. Dedicated services visualize both the network’s status and measurement data (using, e. g., diagrams or heatmaps, Fig. 8). In this context, a concrete configuration of the hardware and software platform has been named BASe-Box and is available either individually or in conjunction with BASe-Kit [24], a measuring kit offered specifically for short-term measurements, e. g., in residences or buildings. Another ongoing project aims to employ and develop the platform further for applications in traffic monitoring [25]. Here, the scenario is to collect data from distributed wireless (traffic) sensors and transmit them to a central traffic management system via GPRS/UMTS. Challenges lie, i. a., in the requirement to operate the system energy-autonomously over extended periods of time.

7 Summary and outlook This contribution has given an overview of a residential services gateway, an extensible, versatile, energy-efficient platform for smart home services and beyond, detailing various design considerations concerning both its hardware and software. While the software has been made particularly modular, realizing a comprehensive set of components communicating via a software bus, the hardware design’s modularity represents a compromise between maximum flexibility and economic considerations. Several prototypes of the platform have been produced, and the software already supports various networking standards, such as wM-Bus/OMS, ZigBee, or integration with a MUC. Pilot installations have been set up at a residence and a facility basement and are being operated in order to gather experiences and feedback from users. As openness and extensibility have been primary concerns during development, systems incorporating the residential services gateway will be easily extensible by interfaces to additional smart metering, home automation, and building management solutions and offer an attractive platform to potential service providers. The focus on and integration of wireless standards facilitates retrofitting existing buildings. While the platform has been developed and employed in the context of SHS, a specific smart home services project, its potential applications are not limited to it. Thanks to the modular architecture in both hardware and software, it can easily be adapted to specific requirements. In hardware, this is supported by the concept of extensions boards providing additional interfaces. We consider the processor platform itself a suitable basis also for alternative high-level software platforms. In

An Extensible Platform for Smart Home Services |

77

Fig. 8. Examples of web-based visualizations: graphs (top), heatmap (down).

that regard, the feasibility of operating it as an energy management gateway running the Java-/OSGi-based OGEMA [26] v2 framework is currently being evaluated. Acknowledgments: This work was supported by the German Federal Ministry of Economics and Technology (grant number KF2534502KM9) as well as the Free State of Thuringia and the European Commission (grant number 2010 FE 9073).

78 | M. Götze et al.

Bibliography [1] [2]

[3] [4] [5] [6] [7] [8] [9] [10] [11]

[12] [13] [14]

[15] [16] [17] [18] [19] [20]

[21]

[22]

[23]

European Council of Real Estate Professionals (CEPI) and the International Union of Property Owners (UIPI), Landlord/Tenant Dilemma, 2010. M. Götze, T. Rossbach, W. Kattanek, S. Nicolai and H. Rüttinger. Erweitertes Smart Metering zur verbesserten Verbrauchsanalyse und für neuartige Smart Home Services. Talk, Leibniz-Konferenz Lichtenwalde: Solarzeitalter 2011, Lichtenwalde, Germany, May 13th , 2011. RWE Smart Home. https://www.rwe-smarthome.de/. OMS Group. http://www.oms-group.org/. ISO/IEC 14543-3-10:2012. http://www.iso.org/iso/catalogue_detail.htm?csnumber=59865. C. B. Z. Shelby, 6LoWPAN: The Wireless Embedded Internet. Wiley, 2009. J. H. D. C. G. Montenegro and N. Kushalnagar, RFC 4944: Transmission of IPv6 Packets over IEEE 802.15.4 Networks, September 2007. PHYTEC: phyCARD. http://www.phytec.eu/europe/products/modules-overview/phycard.html. OSADL realtime QA farm. https://www.osadl.org/Profile-of-system-in-rack-8-slot-6.qa-profiler8s6.0.html. H.H.P. Freyther, K. Kooi, D. Vollmann, J. Lenehan, M. Juszkiewicz and R. Leggewie, Open Embedded User Manual. OpenEmbedded.org. M. Eisenhauer, P. Rosengren and P. Antolin. HYDRA: A development platform for integrating wireless devices and sensors into ambient intelligence systems. The Internet of Things: 20th Tyrrhenian Workshop on Digital Communications, pp. 367–373, Springer, 2010. H. Pennington, A. Carlsson, A. Larsson, S. Herzberg, S. McVittie and D. Zeuthen. D-Bus specification. freedesktop.org, July 2011. Qt — A cross-platform application and UI framework. http://qt.nokia.com/products/. N. Nethercote and J. Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. ACM SIGPLAN conf. on Programming language design and implementation, PLDI ’07, (New York, NY, USA), pp. 89–100, ACM, 2007. Flask: A Python microframework. http://flask.pocoo.org/. R. Fielding. Architectural styles and the design of network-based software architectures. Ph.d. thesis, University of California, Irvine, 2000. Smart Home Services Netzwerk. http://www.elmug-shs.de/, select “Über uns”. KDS DAISY. http://kdssvcpub-prepare.sharp.bcpraha.com/eigene_produkte. ACX ViciOne. http://acx-gmbh.de/. J. Bumberger, P. Remmler, T. Hutschenreuther, H. Töpfer and P. Dietrich, Potentials and limitations of wireless sensor networks for environmental monitoring. 12th Int. UFZ-Deltares Conf. on Groundwater-Soil-Systems and Water Resource Management (AquaConSoil2013), (Barcelona, Spain), 2013. E. Chervakova, W. Kattanek, U. Knoblich and M. Binhack. Added-value services in smart home applications through system integration of different wireless standards. Embedded World Conf., (Nuremberg, Germany), Weka Fachmedien, 28.2.–1.3. 2012. M. Götze, T. Rossbach, A. Schreiber, S. Nicolai and H. Rüttinger, “Distributed in-house metering via self-organizing wireless networks. 55th Int. Scientific Colloquium of the IWK, (Ilmenau, Germany), Ilmenau University of Technology, September 2010. T. Rossbach and E. Chervakova. Wireless sensor/actor networks for building automation – A system for distributed computing and unique sensor identification. 3rd Summer School on

An Extensible Platform for Smart Home Services

|

79

Applications of Wireless Sensor Networks and Wireless Sensing in the Future Internet, Talk, SenZations’08, Ljubljana, Slovenia, September 1st , 2008. [24] S. Engelhardt, E. Chervakova, A. Schreiber, T. Rossbach, W. Kattanek and M. Götze. BASe-Kit – Ein mobiles Messsystem für die Gebäudeautomation. Tagungsband Sensor+Test 2010, (Nuremberg, Germany), :431–435, VDE Verlag, May 18th –20th , 2010. [25] Smart Mobility in Thüringen (sMobiliTy). http://www.smobility.net/. [26] Open Gateway Energy Management Alliance (OGEMA). http://www.ogema.org/.

Biographies Marco Götze received his diploma (Dipl.-Inf.) in Computer Science from Ilmenau University of Technology in 2003. He has since been a research staff member with the Institut für Mikroelektronik- und Mechatronik-Systeme (IMMS) in Ilmenau, Germany. His work deals primarily with high-level application development for communications and visualization in and peripheral to embedded systems. In recent years, his focus has been on infrastructures for incorporating wireless sensor networks into various application domains.

Wolfram Kattanek received his diploma (Dipl.-Ing.) in Electrical Engineering with a focus on Computer Engineering from Ilmenau University of Technology in 1993. In the same year, he joined a Research Training Group funded by DFG (German Research Foundation). There, he worked on new design methods for analog and mixed analog-digital structures. In 1997, he became a research staff member with the Institut für Mikroelektronik- und Mechatronik-Systme (IMMS) in Ilmenau, Germany. His research interests include embedded software development methods, resource-constrained HW/SW systems, and wireless sensor networks. Rolf Peukert received his diploma (Dipl.-Inform.) in Computer Science from Kaiserslautern University in 1994. From 1995 through 2001, he worked as teaching and research assistant for the Faculty of Computer Science and Automation at Ilmenau University of Technology. There, he focused on verification tools for logical circuits and on computer-assisted teaching methods. In 2001, he became a research staff member with the Institut für Mikroelektronik- und Mechatronik-Systme (IMMS) in Ilmenau, Germany. His research interests include embedded operating systems and real-time systems.

K. M. Al-Aubidy, K. Batiha and H. Y. Al-Kharbshh

Fuzzy-Based Gang Scheduling Approach for Multiprocessor Systems Abstract: Different types of task scheduling problems have been formulated due to their importance in operating systems design. However, the problem has been shown to be computationally difficult and it is not possible to obtain optimal solutions. Therefore, more attention has been paid to formulate approximate and heuristic algorithms during recent years. This paper presents an attempt to adopt a modified scheduling algorithm suitable for multiprocessor systems by applying softcomputing concepts. It suggests a modified approach for multiprocessor scheduling by combining gang scheduling and fuzzy logic decision maker. The fuzzy decision maker uses incomplete information about current states of the jobs and multiprocessor system to compute a new priority for each runnable task. The proposed scheduling approach is tested on several case studies, and computational results are compared to those obtained with other approaches. The given results proof that the proposed approach achieves the best performance including the computational efficiency, the average waiting time, and the average turnaround time. Keywords: Task Scheduling, Gang Scheduling, Fuzzy Decision Making, Multiprocessor Systems, Operating Systems.

1 Introduction With the rise of the internet and success of web-based applications, the need for decentralization and distribution of software and hardware resources has increased tremendously. This necessitated the request for new abstractions and architectures that can enable dynamic resource sharing in a geographically and administratively distributed setting [1]. Furthermore, multiprocessor platforms have been widely used, not only for servers and personal computers but also for embedded systems. The research on task scheduling has been therefore renewed for those multiprocessor platforms, especially in the context of real-time scheduling [2]. Based on scheduling algorithms for single processor, several algorithms for multiprocessor scheduling has been developed and presented in literature [3], such as time sharing, space sharing and gang scheduling. Time sharing scheduling is a simple algorithm used to schedule independent process. In case of dependent processes space sharing is introduced. The main problem for this method is a time. As we known time is very important factor in any computer system but unfortunately

K. M. Al-Aubidy, K. Batiha and H. Y. Al-Kharbshh: Philadelphia University, Jordan, emails: [email protected], [email protected], [email protected] De Gruyter Oldenbourg, ASSD – Advances in Systems, Signals and Devices, Volume 8, 2018, pp. 81–96. https://doi.org/10.1515/9783110470383-006

82 | K.M. Al-Aubidy et al.

space sharing wasted time, especially when processor is blocking and when tasks in job need to communicant together. For that the gang scheduling algorithm is presented [4]. The performance of gang scheduling strategies in a parallel system is studied by Karatza [5, 6]. Simulation is used to generate results needed to compare the performance of different scheduling policies under various workload models. Simulation results reveal that the relative performance of the different scheduling methods depends on the workload. Also, the variability in task execution time affects the relative performance of scheduling algorithm. In gang scheduling there are often idle processors although there are tasks in the respective queues. Therefore, fragmentation occurs in the schedule because gangs do not fit the available processors. A migration schema is suggested by Papazachos & Karatza[7] which is suitable for gang scheduling in multi-core clusters. Scheduling multiple tasks on multiple processors is an optimization problem, and has been a source of challenges to researchers in this area. Due to high complexity of scheduling problems, many heuristic approaches have been developed to solve different scheduling problems. Among them, a fuzzy-based algorithm for load balancing in symmetric multiprocessor environment is proposed by Rantonen et. al. [8]. The given results proof that the proposed fuzzy load balancer achieves the best load balance among the processors as well as the fastest response time. Litoiu et al. [9] presented a distributed scheduling algorithm for real-time systems. The decision making component of the proposed algorithm is based on fuzzy rules. They analyzed several algorithms and mentioned that the proposed fuzzy decision maker achieves good performance and have a good robustness. In reference [10], fuzzy set concepts were used in the Longest Processing Time first (LPT) scheduling algorithm to schedule uncertain tasks. They mentioned that the fuzzy LPT scheduling algorithm is a feasible solution for definite and uncertain scheduling. Genetic algorithms have received much attention as a class of robust stochastic search algorithms for various optimization problems. A method based on genetic algorithm is developed to solve the multiprocessor scheduling problem [11]. The representation of the search node is based on the order of the tasks being executed in each individual processor. The genetic operator proposed is based on the precedence relations between the tasks in the task graph. A pure genetic algorithm generates illegal schedules due to the crossover and the mutation operators. Cheng and his group [12] proposed a genotype to generate legal schedules without modifying the genetic algorithm or genetic operators. The proposed dynamic real-time scheduling approach shows high performance and generates good quality schedules. In order to adapt genetic algorithm to deal with non-identical parallel processor scheduling problem, Balin [13] proposes a new crossover operator and a new criterion. The simulated results show that, in addition to its high computational speed, the proposed algorithm is suitable for such multiprocessor system scheduling problem of minimizing the maximum completion time. Moreover, in the literature, there is an increasing number of researches which deal with development of scheduling strategies with genetic fuzzy systems [11, 14], and neuro genetic systems [15]. The neuro-genetic approach

Fuzzy-Based Gang Scheduling Approach for Multiprocessor Systems

| 83

is a hybrid meta heuristic search techniques that combines neural networks and genetic algorithms. Such a hybrid approach is applied to multiprocessor scheduling problems. It is a non deterministic iterative local search method which combines the benefits of a heuristic search and iterative neural network search to improve scheduling solutions compared to neural network-based or genetic algorithm-based approach alone. In fact, the scheduling of multiprocessor system is an real-time online problem as jobs are submitted over time and the precise processing times of these jobs are frequently not known in advance. Furthermore, information about next jobs and status of processors are usually not available. Several scheduling algorithms have been applied to multiprocessor systems to guarantee the time constraints of real time processes [16]. Decision of these algorithms is usually based on crisp parameters; however, in real time the values of these parameters are vague [17]. The vagueness of parameters led to use fuzzy logic in scheduling decision making to obtain better utilize system. Our main contribution is proposing a fuzzy approach to multiprocessor real time which the scheduling parameters are treated as fuzzy variables, a simulation concluded that the proposed fuzzy approach is very promising and it has the potential to be considered for future research. The proposed scheduling algorithm has the following features: – combining fuzzy decision making with gang scheduling. – dealing with vague information – considering more than single parameter in the scheduling. – suitable for real-time applications. – Improving scheduling efficiency. The remainder of this paper is organized as follows. The scope of the scheduling problem in multiprocessor systems is outlined in the next section. In section 3, the gang scheduling approaches are reviewed in brief. The fuzzy decision maker of the modified task scheduler is described in section 4. In section 5, the simulation of the proposed scheduling model is described. In section 6, some numerical examples and computational results are provided to demonstrate the effectiveness of the proposed approach. Finally, conclusions and suggestion for future work are given in section 7.

2 Scheduling problem Scheduling is mainly concerned with the optimal allocation of resources to activities over time, with the objective of optimizing one or several performance measures. Scheduling, in computer science, concerns allocating suitable resources to available tasks over time under necessary constraints. Therefore, scheduling is an essential decision making function in any engineering system, since it determines what is going to be made, when, where and with what resources. Appropriate scheduling minimizes

84 | K.M. Al-Aubidy et al.

completion time and reduces throughput time to improve performance. Task scheduling problems are characterized as a combinatorial optimization problems [13]. Most solutions to the problem have been proposed in the heuristics [18], since it is difficult to achieve an optimal solution with traditional optimization methods. Mathematical optimization techniques can give optimal solution for well-defined problem, however, in the case of multiprocessor scheduling their application is limited. Research efforts have therefore concentrated on intelligent approaches, such as fuzzy logic, neural network and genetic algorithms. Among these various approaches to different scheduling problems, there has been an increasing interest in applying fuzzy logic as a good tool to deal with unknown parameters and vague information related to the scheduling problem. Also, most solutions to the problem have been proposed in the form of heuristics. Most scheduling algorithms are based on single parameter decision making, and suppose that have full and exactly information about the job to be scheduled. If the jobs have more importance to be executed now it stays in order and not executed. For this reason, many attempts have been made to adopt new scheduling algorithms that can deal with two or more parameters in task scheduling. This research work is mainly based on previous research [17] applied a fuzzy decision maker to compute a new priority for each task based on its priority and execution time. The main objective of this research is to obtain a sub-optimal solution for task scheduling in multiprocessor system by combining gang scheduling and fuzzy logic as a tool for decision making.

3 Scheduling in multiprocessor systems In modern operating system the scheduling is based on threads rather than processes. Applications can be implemented as a set of threads, which cooperate and execute concurrently in the same address space. There are four general approaches for multiprocessor thread scheduling, these are; load sharing, dedicated processor assignment, dynamic scheduling, and gang scheduling [8]. Gang scheduling is a time-space sharing technique where tasks of a parallel job are grouped together, as a gang, and executed simultaneously on different processors. The number of tasks in a gang cannot exceed the number of available processors. Gang tasks interact efficiently by busy waiting, without the risk of waiting on a task which is currently waiting and not running on any processor [7]. The selected policy used to schedule parallel tasks in a multiprocessor system affects the overall performance of the parallel running applications [18]. In this paper, the following policies for scheduling gangs are considered: – First Come First Served Scheduling (FCFS): attempts to schedule a job which coming first.

Fuzzy-Based Gang Scheduling Approach for Multiprocessor Systems



– –

| 85

Adapted First Come First Served (AFCFS): attempts to schedule a job as soon as assigned processor is available. If there are not enough processors to assign a large job in the front of the queue, smaller jobs are scheduled. Largest Job First Served (LJFS): places the largest jobs on the top of processor queue. Shortest Job First Served (SJFS): places the shortest jobs on the top of processor queue.

4 Fuzzy-gang scheduling Many scheduling algorithms have been studied and reported for single and multiprocessors scheduling. Scheduling in several algorithms is mainly based on a single or more parameters, assumed to be crisped. However, the real values of these parameters are vague. Therefore, it is necessary to use a suitable tool that can deal with vague parameters in decision making. Fuzzy logic is an active tool to deal with scheduling problem since it is a multi valued logic derived from fuzzy set theory to deal with reasoning and vague information [17]. Fuzzy logic is specifically designed to represent uncertainty, vagueness and provide formalized tools for dealing with the imprecision intrinsic of many problems. In this paper, a fuzzy-based scheduling algorithm is proposed to improve the performance of gang scheduling for multiprocessor systems. Normally, tasks are generated randomly and are served based on the FCFS approach, for example. In the proposed scheduling algorithm, a fuzzy decision maker is used to manage the available tasks by processing fuzzy inputs from the current state of the jobs and processors. The structure of the developed fuzzy decision maker, as shown in Fig. 1, has three modules; a fuzzification module, a reasoning module, and a defuzzification module. Input variables are: – Agree (AG) between number of tasks/job and number of processors. – Job priority (PR). – Job execution time (Tx).

Execution Time Priority Agree

Fuzzification

Fuzzy O/P

Fuzzy rule base If...Then Inference Engine

Fig. 1. Layout of the fuzzy decision maker.

Fuzzy I/P

Crisp O/P Defuzzification

86 | K.M. Al-Aubidy et al.

Each input parameter has three sets; low (L), medium (M) and high (H). The new priority parameter is the module output which has seven sets; very low (VL), low (L), below-mid (BM), mid (M), above-mid (AM), high (H) and very high (VH); as shown in Fig. 2. Triangular membership functions were selected here due to the requirements of minimized computational overhead and the fast decision making. The value of each parameter does not need full and exactly information; it has a range between low and high. The measured variables are inverted into suitable linguistic variable during fuzzification stage. Fuzzy Rules: Decision making is taken depending on a collection of linguistic rules, which describe the relationship between input variables (agree (AG), execution time (Tx) and present priority (Pp)) and an output variable (new priority (Pn)). An important goal of fuzzy logic is to make reasonable inference even when the condition of an implication rule is partially satisfied. Table 1 illustrates the required fuzzy rules for the proposed decision making. There are 27 rules, each rule has three inputs and single output. Each rule is represented by IF and THEN statement, for example; IF AGH and LTx and LPp THEN AMPn This means that if the agree between size of gang and number of processor is high and execution time is low and the pre-priority is low then the new calculated priority is above mid. The Mamdani-style inference process is used, and the center of gravity

μ

1

M

L

0

25

H

50

75

100

(a) μ 1

L

BM

M

AM

H

VL

0

25

50

75

VH

100

(b) Fig. 2. Membership functions of the fuzzy sets. (a): Input variables. (b): Output variable.

Fuzzy-Based Gang Scheduling Approach for Multiprocessor Systems

| 87

Tab. 1. Fuzzy decision making rules. Agree: AGL

HPTx MTx LPTx

Agree: AGM

Agree: AGH

LPp

MPp

HPp

LPp

MPp

HPp

LPp

MPp

HPp

VLPn LPn BMPn

LPn LMPn MPn

BMPn MPn AMPn

LPn BMPn MPn

LMPn MPn AMPn

LPn AMPn HPn

BMPn MPn AMPn

MPn AMPn HPn

AMPn HPn VHPn

defuzzification method is applied to convert the fuzzy output into crisp value that represents the new priority of job.

5 Task scheduling model The proposed scheduling model is based on fuzzy decision maker that takes imprecise information related to the arriving tasks and current states of the processing elements. The fuzzy engine performs the analysis based on the fuzzy input parameters then makes decisions on the assignment of tasks to the available processing elements. Simulation: The task scheduling simulation assumes a multiprocessor system with n fully identical connected processing elements. Each processing element has a processor and a memory model. The job j can be processed by either of the available processors. Each processor can implement only one task at a time. Each task is a sequential unit of work in a program whose size can be as small as possible. Each job consists of set of tasks (gang size). Depending on gang scheduling algorithm, all members of a gang should run simultaneously on available processors and all gang tasks start and end together at the same time. Assumptions; – The system is composed of n identical processors and m jobs. – Each job is characterized by three parameters (T c (j), d j and Pp j ), where T c (j) is computation time, d j is a relative deadline, and Pp j is the priority of the job (j). – Each job generates an infinite sequence of tasks. – Each task is independent and preemptive. – Tasks produced by the same job must be executed sequentially, which means that every task of job (j) is not allowed to begin before tasks of job (j − 1) completes. Performance evaluation: In order to evaluate the system’s performance we employ the following factors and metrics; – Throughput: number of processes that complete per unit time – Average Turnaround time: time from submission to completion of job.

88 | K.M. Al-Aubidy et al.

– –

Average waiting time: amount of time spent ready to run but not running. Efficiency: ratio of the energy used in doing the work and the total energy supplied, and in our case is given by; n

Efficiency =

Tc max = T total

m

∑ ∑ M(i, j) T s (j)

i=0 j=1 n m

(1)

∑ ∑ T s (j)

i=0 j=1

where; T c is the maximum completion time, T s (j) is the slot time for job (j), and M(i, j) is the Boolean variable which determines whether job (j) is implemented by processor (i). Case studies: Several case studies were generated with varying number of processors and jobs. These case studies were considered according to the following parameters: – number of tasks/job with respect to the number of processors. – dependency between jobs. – slot time (T s ): maximum execution time between tasks in the job. – number of processors varied between 4 and 16. – number of jobs considered were between 5 and 50. – mean duration for jobs in each test is between 5 and 40. For the first case study, there are six independent jobs, each job has several tasks as given in Tab. 2.

6 Results and discussion To demonstrate the applicability and performance of the fuzzy gang scheduling algorithm, it is compared with well known scheduling algorithms such as LJFS, SJFS, FCFS, and AFCFS. Four case studies were considered in this analysis for homogeneous system in terms of efficiency, throughput, average waiting time, and average turnaround time.

Tab. 2. Case-study one. Job & Sequence

J3

J4

J6

J1

J2

J5

Tasks/Job Slot time

10 10

2 8

8 5

5 5

7 10

8 10

Fuzzy-Based Gang Scheduling Approach for Multiprocessor Systems

| 89

Table 3 shows the task scheduling of the given jobs with different scheduling approaches. The same analysis can be applied for the second case study which considers six jobs and ten processors. The third case study has 15 jobs divided into three levels, and each job has number of tasks and specific slot time, as in Fig. 3. Seven dependent jobs with 10 homogenous processors were considered in the forth case-study. The performance analysis of the proposed fuzzy scheduling approach, compared with other approaches, for the given case studies are given in Tab. 4 and Fig. 4. It is clear that: – the FCFS method is fair to the jobs in the wait queue. The main drawback of this strategy occurs when process monopolize CPU.

Tab. 3. Task scheduling for case study 1. Process. No. 1 2 3 4 5 6 7 8

Job 5

Job 2

Job 1

Job 6

Job 4

Job 3

Ts = 10

Ts = 10

Ts = 5

Ts = 5

Ts = 8

Ts = 10

Ts = 10

J5 J5 J5 J5 J5 J5 J5 J5

J2 J2 J2 J2 J2 J2 J2 xxxxx

J1 J1 J1 J1 J1 xxxxx xxxxx xxxxx

J6 J6 J6 J6 J6 J6 J6 J6

J4 J4 xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx

J3 J3 J3 J3 J3 J3 J3 J3

J3 J3 xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx

(a). FCFS Scheduling.

Process. No. 1 2 3 4 5 6 7 8

Job 3

Job 3

Job 5

Job 6

Job 2

Job 1 4

Ts = 10

Ts = 10

Ts = 10

Ts = 5

Ts = 10

Ts = 8

J3 J3 J3 J3 J3 J3 J3 J3

J3 J3 xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx

J5 J5 J5 J5 J5 J5 J5 J5

J6 J6 J6 J6 J6 J6 J6 J6

J2 J2 J2 J2 J2 J2 J2 xxxxx

(b). LJFS Scheduling.

J1 J1 J1 J1 J1

xxxxx xxxxx xxxxx xxxxx xxxxx J4 J4 xxxxx

90 | K.M. Al-Aubidy et al.

Process.

Job 1&4

Job 2

Job 5

Job 6

No.

Ts = 8

Ts = 10

Ts = 10

Ts = 5

Ts = 10

Ts = 10

1 2 3 4 5 6 7 8

J4 J4 xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx

J2 J2 J2 J2 J2 J2 J2 xxxxx

J5 J5 J5 J5 J5 J5 J5 J5

J6 J6 J6 J6 J6 J6 J6 J6

J3 J3 J3 J3 J3 J3 J3 J3

J3 J3 xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx

J1 J1 J1 J1 J1

Job 3

(c). SJFS Scheduling.

Process. No. 1 2 3 4 5 6 7 8

Job 5

Job 2

Job 1&4

Job 6

Ts = 10

Ts = 10

Ts = 8

Ts = 5

Ts = 10

Ts = 10

J5 J5 J5 J5 J5 J5 J5 J5

J2 J2 J2 J2 J2 J2 J2 xxxxx

J6 J6 J6 J6 J6 J6 J6 J6

J3 J3 J3 J3 J3 J3 J3 J3

J3 J3 xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx

J1 J1 J1 J1 J1

xxxxx xxxxx xxxxx xxxxx xxxxx J4 J4 xxxxx

Job 3

(d). AFCFS Scheduling.

Process.

Job 6

Job 1

Job 5

Job 2

Job 3

Job 3 & 4

No.

Ts = 5

Ts = 5

Ts = 10

Ts = 10

Ts = 10

Ts = 10

J6 J6 J6 J6 J6 J6 J6 J6

J1 J1 J1 J1 J1 xxxxx xxxxx xxxxx

J5 J5 J5 J5 J5 J5 J5 J5

J2 J2 J2 J2 J2 J2 J2 xxxxx

J3 J3 J3 J3 J3 J3 J3 J3

J3 J3

1 2 3 4 5 6 7 8

(e). FLS Scheduling. Note xxxxx processor is idle.

J4 J4

xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx

Fuzzy-Based Gang Scheduling Approach for Multiprocessor Systems

Job No.

| 91

No. of Tasks, Slot Time

8,10 J5

J6 8,5

J12 8,10

7,10 J2

J9 2,12

J14 4,9

5,5 J1

J8 8,2

J11 6,11

2,8 J4

J7 7,15

J13 2,9

J10 10,15

J15 10,4

10,10 J3 Level 1

Level 2

Level 3

Fig. 3. Jobs of case study 3.







The AFCFS algorithm attempts to schedule a job as soon as assigned processor is available. If there are not enough processors to assign a large job in the front of the queue, smaller jobs are scheduled. This will allow the smaller jobs to be implemented before the larger jobs. The LJFS algorithm places the largest jobs on the top of processor queue. The job for which the assigned processors are available is executed first, whereas the job on the top is considered first and then the following jobs on the work queue. This method gives higher priority to execute large job above smaller job this may lead to starvation. The SJFS algorithm places the shortest jobs on the top of processor queue this method gives higher priority to execute short job above large job this not solve the LJFS problem. The average waiting time, average turnaround time, throughput and efficiency obtained from the proposed FLS scheduling algorithm are better or equally to that obtained from other methods.

92 | K.M. Al-Aubidy et al.

Tab. 4. Performance analysis of the proposed scheduling approach. (a): Case Study, (b): Number of Processors, (c): Number of Jobs, (d): Number of Tasks per Job, and (e): Average Slot time. (a) (b)

1

2

3

4

8

10

8

10

(c)

5

6

(d)

(e)

2–10

3–10 10.5

15 2–10

7

9.6

1–10

9.0

FCFS

LJFS

SJFS

AFCFS

LJFS

Performance

71.33

78.06

78.06

78.06

82.75

Efficiency (%)

0.1034

0.1132

0.1132

0.120

Throughputs

20.5

21.66

14.5

15.166

10.83

Average Tw

30.16

30.5

23.33

24

19.16

Average Tt

79.52

84.91

82.13

82.1

82.13

Efficiency (%)

0.0952

0.1017

0.0983

0.0983

0.098

Throughputs

27.83

26.5

14.1

23.166

15.50

Average Tw

38.33

36.33

24.33

33.83

25.66

Average Tt

69.84

72.16

72.1

78.44

80.1

Efficiency (%)

0.1

0.1

66.93

63.6

58.26

52.26

45.1

Average Tw

77.26

73.6

68.26

61.47

54.1

Average Tt

56.4

88.125

88.125

86.47

88.125

Efficiency (%)

0.0967

0.1086

0.111

Throughputs

3.57 0.28

0.4375

0.4375

0.411

0.4375 Throughputs

11.28

4.2857

5.285

5.1428

4.285

14.85

6.571

7.571

7.571

6.5714 Average Tt

90 80 70 60 50 40 30 20 10 0

Average Tw

90 80 70 60 50 40 30 20 10 0

FCFS

LJFS

(a)

SJFS AFCFS Case Study 1.

LJFS

FCFS 90 80 70 60 50 40 30 20 10 0

FCFS

LJFS

Efficiency (%)

SJFS AFCFS Case Study 3.

LJFS

(b)

90 80 70 60 50 40 30 20 10 0

(c)

0.1132

LJFS

Throughputs

(d)

FCFS

LJFS

Average Waiting Time

SJFS AFCFS Case Study 2.

SJFS AFCFS Case Study 4.

LJFS

LJFS

Average Turnaround Time

Fig. 4. Performance of the FLS approach compared with other approaches.

Fuzzy-Based Gang Scheduling Approach for Multiprocessor Systems

| 93

7 Conclusions Several multiprocessor scheduling algorithms have different characteristic and no single one is ideal absolutely for every application. The proposed scheduling approach presents an attempt to incorporate fuzzy logic as an active tool to deal with vague and incomplete information in the scheduling process. When a fuzzy decision maker is used with gang scheduling, a modified dynamic scheduling algorithm is obtained. Such algorithm, when applied for multiprocessor systems, it can achieve several features, such as: – Fuzzy scheduling algorithm gives good results and closed to LJFS when gang size is less than or equal the number of processors. – The FLS is a dynamic scheduling algorithm and can deal with more than one parameter in the scheduling process. Also, it can deal with uncertainty and vagueness information and can solve starvation problems. – Fuzzy scheduling does not require full and exactly information about each job. – Fuzzy scheduling calculates a new priority for each job according to job situations therefore it is suitable for real time application. For future work, this work may be extended to deal with: – heterogeneous multiprocessor systems. – multiprocessors systems under processor failure. – distributed systems by considering the communication speed and time delay between devices.

Bibliography [1] [2] [3] [4]

[5] [6] [7] [8]

U. Farooq, S. Majumdar and E.W. Parsons. Achieving efficiency, quality of service and robustness in multi-organizational Grids. J. of Systems and Software, 82:23–38, 2009. S. Kato. A Fixed-Priority Scheduling Algorithm for Multiprocessor Real-Time Systems. Parallel and Distributed Computing. Edited by Alberto Ros, :144–158, InTech Publisher, January, 2010. A.S. Tanenbaum. Modern Operating Systems. Prentice Hall, USA, 2007. D. Bishop. Survey on scheduling algorithms for multiprocessing systems. 2008, Available online on: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.123.3010 &rep=rep1&type=pdf H. Karatza. A simulation-based performance analysis of gang scheduling in a distributed system. 32nd Annual Proc. (Simulation Symposium), :26–33, April, 1999. H. Karatza. Scheduling gangs in a distributed system. Int. Journal of Simulation, Systems, Science & Technology, 7(1):15–22, 2007. Z.C. Papazachos and H.D. Karatza. Gang scheduling in multi-core clusters implementing migrations. Future Generation Computer Systems, 27:1153-1165, 2011. M. Rantonen, T. Frantti and K. Leiviska. Fuzzy expert system for load balancing in symmetric multiprocessor systems. Expert systems with applications, 37:8711–8720, 2010.

94 | K.M. Al-Aubidy et al.

[9] [10] [11] [12]

[13] [14] [15]

[16]

[17]

[18]

M. Litoiu, T.C. Ionescu and J. Labarta. Dynamic task scheduling in distributed real-time systems using fuzzy rules. Microprocessors & Microsystems, 21:299–311, 1998. T.P. Hong, C.M. Huang and K.M. Yu. LPT scheduling for fuzzy tasks. Fuzzy Sets and Systems, 97:277–286, 1998. E.S. Hou, N. Ansari and H. Ren. A genetic algorithm for multiprocessor scheduling. IEEE Trans. on parallel and distributed systems, 5(2):113–120, February, 1994. S.C. Cheng, D.F. Shiau, Y.M. Huang and Y.T. Lin. Dynamic hard-real-time scheduling using genetic algorithm for multiprocessor task with resource and timing constraints. Expert systems with applications, 36:852–860, 2009. S. Balin. Non-identical parallel machine scheduling using genetic algorithm. Expert systems with applications, 38:6814–6821, 2011. C. Franke, F. Hoffmann, J. Lepping and U. Schwiegelshohn. Development of scheduling strategies with genetic fuzzy systems. Applied softcomputing, 8:706–721, 2008. A. Agarwal. A neurogenetic approach for multiprocessor scheduling. Multiprocessor scheduling: Theory & Applications, edited by E. Levner, I-Tech Education & Publishing, Austria, :121–136, 2007. M. Sabeghi, H. Deldari, V. Salmani, M. Bahekmat and T. Taghavi. A fuzzy algorithm for real time scheduling of soft periodic tasks on multiprocessor systems. Int. Conf. Applied Computing, IADIS, :467–471, Spain, 25–28 February, 2006. S.J. Kadhim and K.M. Al-Aubidy. Design and evaluation of a fuzzy-based CPU scheduling algorithm. Int. Conf. on recent trends in business administration and information processing BAIP’2010, :45–52, India, April 2010. S. Salleh and A.Y. Zomaya. Multiprocessor scheduling using mean-field annealing. Future Generation Computer Systems, 14:393–408, 1998.

Biography Kasim M. Al-Aubidy received his BSc and MSc degree in control and computer engineering from the University of Technology, Irag in 1979 and 1982, respectively, and PhD degree in real-time computing from the University of Liverpool, England in 1989. He is currently a professor and dean of Engineering Faculty at Philadelphia University, Jordan. His research interests include fuzzy logic, neural networks, genetic algorithm and their real-time applications. He was the winner of Philadelphia Award for the best researcher in 2000. He is also the chief editor of two international journals, and a member of editorial board of several scientific journals. He has co-authored 4 books and published 82 papers on topics related to computer applications. Khaldoon Batiha received his BSc degree in Computer Engineering from Venitsia University, Ukraine in 1991, and PhD degree in Computer Networks from Venitsia University, Ukraine in 1995. He is currently an associate professor and dean of Admissions and Registration at Philadelphia University, Jordan. His research interests include computer networks, operating systems and intelligent systems. He has co-authored 2 books and published 16 papers on topics related to computer applications.

Fuzzy-Based Gang Scheduling Approach for Multiprocessor Systems

| 95

Hesham Y. Al-Kharbshh received his BSc degree in Computer Systems Engineering from Balqa Applied University, Jordan in 1991, and MSc degree in Computer Systems from Philadelphia University, Jordan in 2011. He is currently a lecturer at The World Islamic Sciences and Education University, Jordan. His research interests include parallel computers and fuzzy logic applications in intelligent systems. He has published 2 papers on his field of interest.

H. Ouazzane, H. Mahersia and K. Hamrouni

A Robust Multiple Watermarking Scheme Based on the DWT Abstract: In this paper we make contributions to a non-blind multiple watermarking scheme that proceeds by embedding a binary image in the discrete wavelet transform bands of a gray scale image. Unlike the common wavelet based watermarking techniques, the proposed scheme lies essentially on marking the approximation and diagonal bands of the discrete wavelet transform (DWT) of the cover image achieving a better compromise between fidelity and robustness. Experiments show that our contributions provide the multiple watermarking scheme with robustness to a wide variety of attacks. Keywords: Digital Watermarking, Discrete Wavelet Transform, Non-Blind Image Watermarking.

1 Introduction The development of communication networks and the trivialization of image processing tools have given rise to content security problems underscoring the need to secure digital images from illegal modification, protect their economic interest and ensure intellectual property. Digital image watermarking is an attractive alternative that matches these necessities. This technique consists in embedding a permanent watermark in a cover image in such a way that the watermarked image remains accessible to everyone and the embedded watermark can be decoded after the watermarked image have undergone several attacks. Besides, potential attacks can be no-malicious like compression and image enhancement techniques or malicious like rewatermarking and cropping [1, 2]. The embedded mark can be visible or invisible. Digital watermarking has many applications according to the type of the watermark and the used technique. In general, visible watermarking is used to reveal ownership, invisible robust watermarking is used for copyright protection and organization of digital contents in archiving systems, and, invisible fragile watermarking is used for tampering detection. Image watermarking requires usually three relevant criteria [3]: – Fidelity: the watermarking process should not distort the original image to ensure its commercial value.

H. Ouazzane, H. Mahersia and K. Hamrouni: Université de Tunis El Manar, Ecole Nationale d’Ingénieurs de Tunis. LR-SITI: Signal Image et Technologies de l’Information, Tunis, Tunisia, emails: [email protected], [email protected], [email protected] De Gruyter Oldenbourg, ASSD – Advances in Systems, Signals and Devices, Volume 8, 2018, pp. 97–112. https://doi.org/10.1515/9783110470383-007

98 | H. Ouazzane et al.





Robustness: the inserted mark should be detectable if the cover image has undergone some potential attacks. It should be, however, difficult and complex to be detected by unauthorized people. Fragile watermarks should be altered in an irreversible way if the cover image has been modified. Capacity: it describes the necessary amount of data to be inserted in the cover image. Watermarking schemes should have high capacity.

Every watermarking scheme includes an encoder, process that embeds the watermark in the cover image, and a decoder, process that detects or extracts the watermark. Watermarking schemes can be distinguished according to the encoding and decoding domain. Effectively, images can be represented either in the spatial domain i. e. the image pixel domain, or a transformed domain such as discrete cosine transform domain or discrete wavelet transform domain. Watermark embedding in the spatial domain is performed by modifying the cover image pixels values. Watermark embedding in a transformed domain is performed by modifying the image coefficients in this selfsame domain. Watermarking schemes can be distinguished according to the watermark embedding approach: – LSB substitution approach: embeds the watermark by substitution of some specific least significant bits (LSB) of the cover image pixels, like the schemes [4–6]. – Additive insertion: adds the watermark to some image components, like the schemes [7–9]. – Statistical approach: this approach is known as Patchwork [10], it performs by pseudo randomly choosing pixels from the cover image and modifying their luminosity. – Visual approach: Texture block watermarking is a method that lies on the visual approach. It uses random texture patterns in the cover image. It performs by producing identical textured regions by copying a randomly chosen pattern [10]. – Quantization based watermarking: This approach uses quantization to embed the watermark in the cover image. For example, the scheme proposed in [11] performs by quantizing coefficients relative to some special image edges to embed the binary watermark bits. Watermarking schemes can be also classified according to the decoder type. There are three decoding modes: – Non-blind decoding: requires at least the original image. – Semi-blind decoding: uses only the original watermark. – Blind decoding: extracts the watermark from the possibly distorted image using neither the original image nor the original watermark. In this paper, we present a new watermarking scheme based on the DWT. This transform is commonly used in digital watermarking because of its advantages.

A Robust Multiple Watermarking Scheme Based on the DWT

| 99

Keyvanpour and Merrikh-Bayat propose in [11] a blind watermarking scheme that embeds the watermark in the HL and LH sub-bands resulting from a multilevel DWT using the quantization approach. Tao and Eskicioglu propose in [12] a non-blind multiple watermarking scheme based on the DWT. First, they apply the first or second level DWT to the cover image. The level choice depends on the watermark size that must be equal to each sub-band thumbnail size. According to the additive approach, they embed four copies of the binary watermark into the LL, HL, LH and HH sub-bands. They apply the IDWT to get back to the spatial domain and obtain the watermarked image. The decoding process consists in applying the DWT and extracting the four embedded watermark copies. The four extracted watermarks are, afterwards, compared to the original watermark to check the watermark presence in the attacked image. For objective examination, they calculate the similarity ratio (SR) between each extracted watermark and the original one and admit that the highest SR value helps to identify the most resistant sub-band for a given attack. Our contributions consist in embedding only two copies of the watermark in the High and Low frequency sub-bands. In fact, the extraction results in [12] show that the highest SR values are always found at the LL or HH sub-bands according to the attack type. The paper continuous as follows: In section 2 we present a brief introduction to the two dimensional DWT and we describe the encoding and decoding processes. Section 3 is dedicated for experimental evaluation. In this section we present our test platform and the results of the method simulation and we compare the new scheme to some schemes based on the DWT. Finally, in section 4, we give our observations regarding the obtained scheme simulation results and our perspectives.

2 Proposed watermarking method Every two-dimensional DWT decomposition level produces four representations of an image: an approximation image (LL) and three detail ones (LH, HL and HH). The approximation image represents the image low frequencies, it has the largest coefficient magnitudes at each level and, thus, contains the most significant information of the image. To obtain the next level decomposition, the two dimensional DWT is applied to the LL sub-band. The detail images are called the vertical (LH), the horizontal (HL) and the diagonal (HH) sub-bands, they represent the mid and high frequency sub-bands and contain information about edges and texture patterns. Figure 1 shows a two level decomposition. In the following, we describe the embedding and the extraction process.

100 | H. Ouazzane et al.

LL2

HL2 HL1

LH2

HH2

LH1

HH1

Fig. 1. Two level DWT decomposition.

2.1 Watermark embedding process The cover image (I) is a gray scale image. We suppose that the cover image size is N × N, then the binary watermark image (W) size must be 2Nn × 2Nn ; n is the decomposition level during the embedding process. 1. Decompose I using the n-level two dimensional DWT. 2. Inserting the watermark in the LL n and HH n sub-bands by modifying their coefficients as follows : I ̂k (i, j) = I k (i, j) + α k W

(1)

k denotes the sub-band LL n or HH n . I ̂k is the watermarked LL n or HH n image representation.

3.

α k denotes the scaling factor corresponding to each sub-band. Effectively, we don’t use the same scaling factor for the LL n and HH n sub-bands since the coefficient sizes are not of the same magnitude order. Apply the n-level IDWT to obtain the watermarked image I ̂ in the spatial domain.

2.2 Watermark extraction process Let I 󸀠 be the possibly corrupted image. 1. Decompose I 󸀠 using the n-level two dimensional DWT. 2. Extracting the watermark from the LL n and HH n sub-bands as follows : W k (i, j) =

I 󸀠k (i, j) − I k (i, j) αk

W k is the extracted watermark from the LL n or HH n sub-bands.

(2)

A Robust Multiple Watermarking Scheme Based on the DWT

3.

| 101

Convert W k to a binary image applying a simple thresholding : W k (i, j) = {

1, 0,

if

W k (i, j) > 0.5 otherwise

(3)

3 Experimental evaluation In this part, we test the proposed watermarking scheme on the 512 × 512 gray scale Goldhill test image and we use three binary watermarks (Fig. 2). Effectively the Goldhill test image is used in Tao and Eskicioglu’s paper, thus, it will be useful to compare the proposed scheme with Tao and Eskicioglu’s method. The tests will involve watermarking of the LL and HH sub-bands for first and second level DWT decomposition. To evaluate the proposed scheme fidelity, we measure the visual quality of the watermarked image using the Peak Signal to Noise Ratio (PSNR). PSNR = 20 log10

255 RMSE

(4)

(a)

(b)

(c)

(d)

Fig. 2. Test platform. (a): Goldhill cover test image. (b): Watermark used to test robustness against attacks. (c): Watermark used for rewatermarking attack. (d): Watermark used to compare the proposed scheme with Tao and Eskicioglu’s scheme.

102 | H. Ouazzane et al.

Where, the RMSE is the square root of mean squared error (MSE) between the original image and the distorted one. N

RMSE =

N

∑ ∑ [I 󸀠 (i, j) − I(i, j)]2 √ i=1 j=1

(5)

N2

Qualitative evaluation of the watermark presence can be done by comparing the two extracted watermarks with the original one. Quantitative evaluation is performed by calculating the similarity ratio (SR) between each extracted watermark and the original one. The SR value lies between 0 and 1. SR =

S S+D

(6)

Where, S is the number of matching pixels between the original and extracted watermarks, and D is the number of different pixels between the same images. Figure 3 presents the PSNR values of twelve first level watermarked images (Goldhill, Lena, Peppers, Couple, Cameraman, Boat, F16, Barbara, Mandrill, Printer test, Zelda and Pirate). The figure shows that all the PSNR values are greater than 40 db. Figure 4 presents results of embedding the “WMK" binary logo into the Goldhill test image (Fig. 2). Embedding the “BC" binary logo, used in Tao and Eskicioglu’s paper, in the Goldhill image gives PSNR values exceeding slightly PSNRs indicated in Tao and Eskicioglu’s paper :

50

PSNR (db)

40

30

20

10

Fig. 3. PSNRs of twelve watermarked gray scale images each of size 512×512.

te Pi ra

Ze ld a

F1 6 Ba rb ar a M an dr ill Pr in te rt es t

Bo at

er s Co up Ca le m er am an

Pe pp

Le na

Go ld hi ll

0

A Robust Multiple Watermarking Scheme Based on the DWT

(a)

| 103

(b)

Fig. 4. Watermarking results: Watermarked images at first level decomposition. (a): PSNR = 42.724 db, (b): PSNR = 42.723 db.

– –

First level decomposition: PSNR = 42.724 db with proposed scheme vs. PSNR = 42.400 db with Tao and Eskicioglu’s method. Second level decomposition PSNR = 42.701 db with proposed scheme vs. PSNR = 42.230 db with Tao and Eskicioglu’s method.

Table 1 provides the extraction results without applying any attack on the watermarked image. To evaluate the scheme robustness, we have applied different attacks to the watermarked Goldhill image. For each attacked image, we have extracted the two embedded watermarks and calculated the SRs. Tables 2 and 3 provide the extracted watermarks from the LL and HH sub-bands and their appropriate SR after each attack. These results suggest that: – The LL sub-bands are most resistant to lossy compression, filtering, geometrical deformations and noise addition. – The HH sub-bands are robust to nonlinear deformations of the gray scale. – Both sub-bands are resistant to rewatermarking.

Tab. 1. Watermark extraction results. (a) First level decomposition

LL: SR = 1.000

HH: SR = 1.000

(b) Second level decomposition

LL: SR = 1.000

HH: SR = 1.000

104 | H. Ouazzane et al.

Tab. 2. Extraction results for first decomposition level. (a) JPEG compression Q = 25

LL: SR = 0.813

HH: SR = 0.477

(c) JPEG compression Q = 75

LL: SR = 0.958

HH: SR = 0.476

(e) Gaussian filtering (5 × 5)

LL: SR = 0.773

HH: SR = 0.477

(g) Histogram equalization

LL: SR = 0.682

HH: SR = 0.885

(i) Gamma correction (1.5)

LL: SR = 1.000

HH: SR = 1.000

(k) Gaussian noise(0 0.001)

LL: SR = 0.782

HH: SR = 0.653

(m) Cropping

LL: SR = 0.863

HH: SR = 0.920

(b) JPEG compression Q = 50

LL: SR = 0.899

HH: SR = 0.478

(d) Gaussian filtering (3 × 3)

LL: SR = 0.879

HH: SR = 0.476

(f) Sharpening

LL: SR = 0.936

HH: SR = 0.916

(h) Intensity Adjustment(0 0.8)(0 1)

LL: SR = 0.855

HH: SR = 0.897

(j) Pixelate

LL: SR = 0.800

HH: SR = 0.475

(l) Rescaling (512 →

LL: SR = 0.903

256 →

512)

HH: SR = 0.476

(n) Rewatermarking

LL: SR = 0.880

HH: SR = 0.880

A Robust Multiple Watermarking Scheme Based on the DWT

Tab. 3. Extraction results for second decomposition level. (a) JPEG compression Q = 25

LL: SR = 1.000

HH: SR = 1.000

(c) JPEG compression Q = 75

LL: SR = 0.991

HH: SR = 0.828

(e) Gaussian filtering (5 × 5)

LL: SR = 0.893

HH: SR = 0.442

(g) Histogram equalization

LL: SR = 0.691

HH: SR = 0.884

(i) Gamma correction (1.5)

LL: SR = 1.000

HH: SR = 1.000

(k) Gaussian noise(0 0.001)

LL: SR = 0.928

HH: SR = 0.774

(m) Cropping

LL: SR = 0.877

HH: SR = 0.926

(b) JPEG compression Q = 50

LL: SR = 0.985

HH: SR = 0.607

(d) Gaussian filtering (3 × 3)

LL: SR = 0.971

HH: SR = 0.634

(f) Sharpening

LL: SR = 0.992

HH: SR = 0.931

(h) Intensity Adjustment(0 0.8)(0 1)

LL: SR = 0.853

HH: SR = 0.895

(j) Pixelate

LL: SR = 0.860

HH: SR = 0.512

(l) Rescaling (512 →

LL: SR = 0.981

256 →

512)

HH: SR = 0.636

(n) Rewatermarking

LL: SR = 0.885

HH: SR = 0.885

| 105

106 | H. Ouazzane et al.



Robustness is enhanced for second level decomposition. In particular, the visual quality of LL extracted watermarks and their SR values have been visibly increased in Table 3 for the lossy compression, low-pass filtering, sharpening and noise addition.

Comparison with previous methods In this part, we compare the experimental results of the proposed method with Tao and Eskicioglu’s method and Yuan’s method [13]. These two methods are based on the multiple watermarking approach in the DWT domain. Results are shown in figures 5, 6, 7, 8, 9 and 10, they are based on applying the same attacks on the Goldhill test image watermarked with the same binary logo for each comparison. Figures 5, 7 and 9 provide the SR values after watermark extraction from the LL sub-band. They show that the SRs of the proposed method exceed the SRs of both previous methods. In particular, the robustness of the LL sub-band has improved significantly for the gray scale deformation attacks such as histogram equalization. Figures 6, 8 and 10 provide the SR values after watermark extraction from the HH sub-band. The SR values reveal also that the HH sub-band robustness is enhanced with the proposed scheme.

0,9

0,7

0,5 Tao and Eskicioglu’s method 0,3 Proposed method 0,1 a

b

c

d

e

f

g

h

i

j

k

Fig. 5. LL sub-band robustness comparison between Tao’s method and the proposed method for first level decomposition. (a) JPEG compression (Q = 25), (b) JPEG compression (Q = 50), (c) JPEG compression (Q = 75), (d) Gaussian filtering (3×3), (e) Sharpening, (f) rescaling (512->256->512), (g) Gaussian noise ([0 0.001]), (h) Histogram equalization, (i) Intensity adjustment ([0 0.8][0 1]), (j) Gamma correction (1.5), (k) Rewatermarking.

A Robust Multiple Watermarking Scheme Based on the DWT

| 107

0,9

0,7

0,5 Tao and Eskicioglu’s method 0,3 Proposed method 0,1 a

b

c

d

e

f

Fig. 6. HH sub-band robustness comparison between Tao’s method and the proposed method for first level decomposition. (a) Histogram equalization, (b) Intensity adjustment ([0 0,8], [0 1]), (c) Gamma correction (1,5), (d) Sharpening, (e) Gaussian noise ([0 0.001]), (f) Rewatermarking.

0,9

0,7

0,5 Tao and Eskicioglu’s method 0,3 Proposed method 0,1 a

b

c

d

e

f

g

h

i

j

k

Fig. 7. LL sub-band robustness comparison between Tao’s method and the proposed method for second level decomposition. (a) JPEG compression (Q = 25), (b) JPEG compression (Q = 50), (c) JPEG compression (Q = 75), (d) Gaussian filtering (3×3), (e) Sharpening, (f) rescaling (512->256->512), (g) Gaussian noise ([0 0.001]), (h) Histogram equalization, (i) Intensity adjustment ([0 0.8][0 1]), (j) Gamma correction (1.5), (k) Rewatermarking.

108 | H. Ouazzane et al.

0,9

0,7

0,5 Tao and Eskicioglu’s method 0,3 Proposed method 0,1 a

b

c

d

e

f

Fig. 8. HH sub-band robustness comparison between Tao’s method and the proposed method for second level decomposition. (a) Histogram equalization, (b) Intensity adjustment ([0 0,8], [0 1]), (c) Gamma correction (1,5), (d) Sharpening, (e) Gaussian noise ([0 0.001]), (f) Rewatermarking.

0,9

0,7

0,5 Yuan’s method 0,3 Proposed method 0,1 a

b

c

d

e

f

Fig. 9. LL sub-band robustness comparison between Yuan’s method and the proposed method for first level decomposition. (a) JPEG compression (Q = 75), (b) Gaussian filtering (3×3), (c) rescaling (512->256->512), (d) Gaussian noise ([0 0.001]), (e) Gamma correction (1,5), (f) Cropping.

A Robust Multiple Watermarking Scheme Based on the DWT

| 109

0,9

0,7

0,5 Yuan’s method 0,3 Proposed method 0,1 a

b

c

d

e

f

Fig. 10. HH sub-band robustness comparison between Yuan’s method and the proposed method for first level decomposition. (a) JPEG compression (Q = 75), (b) Histogram equalization, (c) Intensity adjustment ([0 0,8], [0 1]), (d) Gamma correction (1,5), (e) Gaussian noise ([0 0.001]), (f) Cropping.

4 Conclusion In this paper, we have made a contribution to a multiple non-blind watermarking scheme based on the DWT. The proposed scheme consists in applying the DWT to the gray scale cover image and modifying the LL and HH sub-band coefficients in order to insert the binary watermark according to an additive approach. Experimental results indicate that modification of the LL and HH sub-bands results in good fidelity and robustness against a large range of attacks. Watermark embedding with second level decomposition results in better robustness. Objective evaluation shows that the proposed method outperforms Tao and Eskicioglu’s scheme in terms of fidelity and robustness. The proposed watermarking method can be further improved by automating the selection of the optimal thresholding parameter and appropriate scaling factor for each band.

Bibliography [1] [2] [3]

Vidyasagar M. Potdar, Song Han and Elizabeth Chang. A Survey of Digital Image Watermarking Techniques. 3rd Int. Conf. on Industrial Informatics, 2005. Saraju P Mohanty. Digital Watermarking: A Tutorial Review, 2005. Emir Ganic and Ahmet M. Eskicioglu. Robust DWT-SVD Domain Image Watermarking: Embedding Data in All Frequencies. Workshop on Multimedia and Security, New York-USA, 2004.

110 | H. Ouazzane et al.

[4] [5] [6] [7] [8]

[9]

[10] [11] [12] [13]

Po-Yueh Chen and Hung-Ju Lin. A DWT Based Approach for Image Steganography. Int. J. of Applied Science and Engineering, 4(3):275–290, 2006. A. Bamatraf, R. Ibrahim and M.N.M. Salleh. A New Digital Watermarking Algorithm Using Combination of Least Significant Bit (LSB) and Inverse Bit. J. of Computing, 3, 2011. R.G. Van Schyndel, A.Z. Tirkel and C.F. Osborne. A Digital Watermark. IEEE Int. Conf. on Image Processing, ICIP-94, 2:86–90, 1994. E.T. Lin and E.J. Delp. Spatial Synchronization Using Watermark Key Structure. Proc. SPIE, 5306:536–547, 2004. P. Bas, B. Roue and J-M Chassery Tatouage d’Images Couleur Additif: Vers la Sélection d’un Espace d’Insertion Optimal. Compression et Représentation des Signaux Audiovisuels, Lyon-France, 2003. S. Rastegar, F. Namazi, K. Yaghmaie and A. Aliabadian. Hybrid Watermarking Algorithm Based on Singular Value Decomposition and Radon Transform. Int. J. of Electronics and communication, 658–663, 2011. W. Bender, D. Gruhl, N. Morimoto and A. Lu. Techniques for Data Hiding. IBM Systems J., 35:313–336,1996. M-R Keyvanpour and F. Merrikh-Bayat. Robust Dynamic Watermarking in DWT Domain. Procedia Computer Science, 3, 2010. P. Tao and A.M. Eskicioglu. A Robust Multiple Watermarking Scheme in the Discrete Wavelet Transform Domain. Proc. SPIE 5601:133–144, 2004. Y. Yuan, D. Huang and D. Liu. An Integer Wavelet Based Multiple Logo-Watermarking Scheme. 1st Int. Multi-Symp. on Computer and Computational Sciences, 2:175–179, 2006.

Biographies Hana Ouazzane is a Phd student working on image watermarking under the supervision of Pr. Kamel Hamrouni. She received computer science engineering degree in june 2012 from the National Faculty of Engineering of Tunis (ENIT), university of Tunis El Manar. She is interested in multimedia security, notably in digital watermarking, data hiding and protection of 3D objects.

Hela Mahersia (Dr. Ing.) received both master degree and Ph. D degree in electrical engineering (image processing) from National Engineering School of Tunis (ENIT-Tunisia), and Electronic Engineer Diploma in 2001 from National Engineering School of Monastir (ENIM-Tunisia). She is currently an associate professor at the Faculty of Science of Bizerte (FSB-Tunisia). She has teached several computer science courses especially related to Image processing, DSP and Pattern Recognition since 2004. Her research interests include image processing, texture processing, segmentation, pattern recognition and artificial intelligence.

A Robust Multiple Watermarking Scheme Based on the DWT

| 111

Kamel Hamrouni received his Master and PhD degrees from “Pierre and Marie Curie" University, Paris, France. Hereceived his “HDR" diploma from “University of Tunis El Manar, Tunisia". He is, actually a professor at the National Faculty of Engineering of Tunis (ENIT), university of Tunis El Manar, teaching graduate and undergraduate courses in computer science and image processing. His main research interests include image segmentation, texture analysis, mathematical morphology, biometry and medical image application. He supervises a research team composed of around thirty researchers preparing master thesis, Phd thesis and HDR diploma. He published more than eighty papers in scientific journals and international conferences.

T. Chihaoui, R. Kachouri, H. Jlassi, M. Akil and K. Hamrouni

Retinal Identification System based on Optical Disc Ring Extraction and New Local SIFT-RUK Descriptor Abstract: Personal recognition based on retina has been an attractive topic of scientific research. A common challenge of retinal identification system is to ensure a high recognition rate while maintaining a low mismatching FMR rate and execution time. In this context, this paper presents a retinal identification system based on a novel local feature description. The proposed system is composed of three stages, firstly we enhance the retinal image and we select a ring around the optical disc as an interest region by using our recently proposed Optical Disc Ring ODR method. Secondly, in order to reduce the mismatching rate and speed up the matching step, we propose in this paper an original alternative local description based on the Remove of Uninformative SIFT Keypoints, that we call SIFT-RUK. Finally, the generalization of Lowe’s matching technique (g2NN test) is employed. Experiments on the VARIA database are done to evaluate the performance of our proposed SIFT-RUK feature-based identification system. We show that we obtain a high performance with 99.74 % of identification accuracy rate without any mismatching (0 % of False Matching Rate FMR) and with a low matching processing time compared to existing identification systems. Keywords: Biometric, Retinal Identification Systems, Optical Disc Ring (ODR) Method, Scale Invariant Feature Transform (SIFT), Remove of Uninformative Keypoints (RUK), Speeded Up Robust Features (SURF).

1 Introduction Biometric recognition is a challenging topic in pattern recognition area based on distinguishing and measurable "morphological and behavioral" characteristics such as face, retina, iris, ADN, etc. [1]. Retinal recognition has received increasing attention in recent years as it provides promising solution to security issues [2]. Hence, nowadays, retina is one of the most secure and valid biometric trait for personal recognition due to its uniqueness, universality, time-invariance and difficulty to be

T. Chihaoui, R. Kachouri, H. Jlassi, M. Akil and K. Hamrouni: T. Chihaoui, R. Kachouri, M. Akil, Université Paris-Est, Laboratoire Informatique Gaspard−Monge, Equipe A3SI, ESIEE Paris, France, email: [email protected], [email protected], [email protected], H. Jlassi, K. Hamrouni, Université de Tunis El Manar, Ecole Nationale d’Ingénieurs de Tunis, LR−SITI Signal Image et Technologie de l’information, Tunis, Tunisie, email: [email protected], [email protected] De Gruyter Oldenbourg, ASSD – Advances in Systems, Signals and Devices, Volume 8, 2018, pp. 113–126. https://doi.org/10.1515/9783110470383-008

114 | T. Chihaoui et al.

forged. Indeed, retinal patterns have highly distinctive characteristics and the features extracted from retina identify effectively individuals, even among genetically identical twins [1]. In addition, this pattern will not change through the life of the individual, unless a serious pathology appears in the eye. Existing retinal recognition systems include two modes: Identification which compares each pattern to all others (1:N) and Verification which compares the pattern to other patterns from the same individual (1:1). Both of these identification and verification systems aim actually to find the best compromise between the good retinal recognition accuracy and the processing time. In this context, we proposed recently an original ring selection as a region of interest around the optical disc that we called "ODR". Two new ODR-based retinal recognition systems were developed. Our SIFT [4]-based identification system [11] gives a high identification rate (99.8 %) but still suffers from an important execution time (10.3 s). The recently proposed SURF [6]-based verification system [12] offers a high verification rate (100 %) and a low processing time thanks to the verification mode that compares one pattern to only the other ones belonging to the same individual. However, it suffers from mismatching and slowness in the identification mode. Consequently and to solve these problems, we present in this paper a new retinal identification system. In which, we use firstly the ODR method [11], then we employ the SIFT description [4] since its accuracy and finally the g2NN matching. In order to reduce the mismatching rate and speed up the matching step, we propose in this paper an original alternative local description based on the Remove of Uninformative SIFT Keypoints. Our new local feature descriptor is named SIFT-based Removal Uninformative Keypoints (SIFT-RUK). It allows an important identification rate while reducing the number of mismatching keypoints and speed up the execution time. The remainder of this paper is organized as follows: next section presents the related works of retinal recognition systems based on local features. Our proposed identification system is detailed in section 3. Section 4 illustrates the obtained experimental results. Finally, conclusions are discussed in section 5.

2 Biometric recognition systems based on local feature descriptors: Related works Image geometric transformations, illumination changes, mismatching and processing time are the most challenging problems in retinal biometric recognition. Local feature descriptors [4] [6] are distinctive and robust to geometric changes, contrary to global generic algorithms [8]. In this context, many different local descriptor-based biometric system have been developped in the literature [7] [11].

Retinal Identification System

|

115

The most known and performant local feature descriptor is the Scale Invariant Feature Transform (SIFT), developed by David Lowe [4] in 1999. Thanks to its distinctiveness to translation, rotation and scale changes, it is widely used in object recognition. For instance, it was employed in adaptive optics retinal image registration by Li et al [18]. Indeed, it extracts corner points and match corresponding ones in two frames by correcting the retina motion. However, its main defect is that it may suffer from a huge keypoints number which leads to eventual mismatching and a consequent high matching processing time. In order to overcome this drawback, recent extentions of SIFT algorithm [5] have been proposed, as well as some preprocessing and matching topologies which reduce the number of extracted local features. In 2008, Herbert Bay [6] proposed a speed local descriptor named Speeded Up Robust Features (SURF) which is similar to SIFT but faster with 64 dimensional descriptor. This algorithm is used on biometric recognition [7] in 2009 and proved its efficiency in term of processing time. So, to further speed up the local feature extraction, Israa Abdul et al. [10] applied in 2014 the Principle Component Analysis (PCA) to the SIFT descriptor. The PCA reduced the dimensionality of SIFT feature descriptor from 128 to 36, so that the PCA-SIFT minimized the size of the SIFT feature descriptor length and speeded up the feature matching by factor 3 compared to the original SIFT method [4]. In [19], sub-segments around the pupil (left, right and bottom of the iris), are used as input for SIFT feature extraction instead of the whole iris image. In retinal recognition, we proposed in 2015 a new ROI selection based on Optical Disc Ring (ODR) extraction [11]. It aims to maintain the most dense retinal region in order to improve the SIFT-based identification rate and further decrease the execution time. In 2016, we exploited the SURF descriptors in retinal verification system [12]. This system is faster than some existing ones, but it still suffers from some individual mismatching in identification mode. In order to reduce the mismatching rate and speed up the matching step, we propose in this paper a retinal identification system based on the Remove of Uninformative SIFT Keypoints. Next section details the proposed system based on SIFT-RUK descriptor.

3 Proposed local feature SIFT-RUK descriptor-based retinal identification system In this work, we propose a novel retinal identification system based on the Remove of Uninformative SIFT Keypoints. This proposed system uses firstly the ODR method [11], then employs the new proposed SIFT-based Remove of Uninformative keypoint (SIFT-RUK) description and finally the g2NN matching. The flowchart of the proposed system is illustrated in the Fig. 1.

116 | T. Chihaoui et al.

Input Image

ODR image method

Local SIFT-RUK feature extraction Database G2NN Feature matching

Recognition decision Fig. 1. Flowchart of our proposed retinal identification system.

3.1 Ring extraction around the optical disc On the one hand, the presence of noise, the low contrast between vasculature and background, the brightness, and the variability of vessel diameter, shape and orientation are the main obstacles in retinal images. On the other hand, SIFT characterization is sensible to the intensity variation between regions, which requires the elimination of the brightest area (the optical disc). In order to overcome these problems, we use our recently proposed Optical Disc Ring (ODR) extraction [11]. The CLAHE technique [20] is used to improve the quality of the retinal image. Then a ring

Retinal Identification System

|

117

around the optical disc is selected as an interest region. The output of this first step in our identification is shown on the Fig. 2.

3.2 SIFT-RUK feature descriptor In this step, we improve the Scale Invariant Feature Transform (SIFT) [4] local image descriptor. We apply for that an original Remove of Uninformative Keypoint (RUK) method. It allows to reduce the number of redundant SIFT keypoints while maintaining the quality of description. The SIFT-RUK local descriptor algorithm is detailed in the following subsections.

3.2.1 Standard SIFT description The standard SIFT description is stable and robust to some imperfections acquired by retinal image process such as scale, rotation, translation and illumination changes. The SIFT algorithm includes four stages [4]: scale-space peak detection, keypoint localization, orientation assignment and keypoint description.

Scale-space peak detection This first stage aims to recognize those locations and scales that are identifiable from different views of the same object. The scale-space is defined by this equation 1. L(x, y, σ) = G(x, y, σ) ∗ I(x, y)

(1)

Where: L(x, y, σ), G(x, y, σ) and I(x, y) are respectively the scale-space function, the variable scale Gaussian function and the input image.

Fig. 2. Optical Disc Ring (ODR) extraction step: (a) the input retinal image and (b) the extracted ring around the optical disc.

118 | T. Chihaoui et al.

In order to detect the most stable keypoint locations in the scale-space, we compute the distance of Gaussians D(x, y, σ) in different scales, as given by the equation 2. D(x, y, σ) = L(x, y, sσ) − L(x, y, σ)

(2)

Where s is a scale factor. In order to detect local scale-space extrema, we compare each point with its 8 neighbors in the same scale, and its 9 neighbors in respectivelly the immediate upper and downer one scale. If this point is representing the minimum or the maximum of all compared ones, it is considered as an extrema.

Keypoint localization In this second stage, we compute the Laplacian value for each founded keypoint in stage 1. The location of extremum E is given by the equation 3. E(x, y, σ) = (∂∂D −1 (x, y, σ)/∂2 (x, y, σ)) ∗ (∂D(x, y, σ)/∂(x, y, σ))

(3)

Orientation assignment This third stage identifies the dominant orientations for each selected keypoint based on its local region. The assigned orientation(s), scale and location for each keypoint enables SIFT to construct a canonical view for the keypoints that are invariant to similarity transforms.

Keypoint description This final stage builds a representation for each keypoint by generating its local image descriptor that is a 128 elements vector because the best results were achieved with a 4 × 4 array of histograms with 8 orientation bins in each keypoint (4 × 4 × 8).

3.2.2 Removal uninformative SIFT keypoint method As known huge number of local keypoints increases the matching run time and the mismatching rate. In addition, by our study carried out after the extraction of SIFT keypoints, we note that local keypoints extracted from the same region in the retinal image may suffer from redundancy and similar description. For that, our proposed SIFT-based the Remove of Uninformative Keypoint method called SIFT-RUK aims to limit the number of interest keypoints used to characterize the retinal image. To detect and reject the redundant keypoints, we use the locations (x, y) and the orientation O of these local features. The algorithm of our proposed RUK method is presented as follows.

Retinal Identification System

|

119

Algorithm 1: The algorithm of our RUK method Data: Local SIFT keypoint set K = {k1 , k2 , . . . , k i , . . . , k n } k i : (L(k i ), O(k i )), i ∈ {1, . . . ,n} L(k i ) = (x i , y i ): localisation vector. O(k i ): orientation vector Result: Local SIFT-RUK keypoint set K " = {k"1 , k"2 , . . . , k"i , . . . , k"m } /* Compute the localisation distance D l (L(k i ),L(k j ))of feature pairs */ 1 D l (L(k i ), L(k j )) = ||(x i − x j ) + (y i − y j )||, i, j ∈ {1, . . . , n} and i ≠ j /* Compute the OTSU-based localisation distance threshold T l */ 2 D l max = max(D l ) 3 D l min = min(D l ) 4 Dif l = D l max − D l min 5 Otsu l = Otsu graythresh (|D l |) 6 T l = (Otsu l ∗ Dif l ) + D l min /* Localisation condition (1) */ 7 K 󸀠 = [] 8 if D l (L(k i )),(L(k j )) < T l , i, j ∈ {1, . . . , n} and i ≠ j then 9 K 󸀠 = {K 󸀠 , k i } 10

l = length(K 󸀠 ), l < n /* Compute the orientation distance D o (O(k󸀠i ),O(k󸀠j ))of feature pairs D o (O(k󸀠i ), O(k󸀠j )) = ||O(k󸀠i ) − O(k󸀠j )||, i, j ∈ {1, . . . , l}, i ≠ j

*/ */

19

/* Compute the OTSU-based orientation distance threshold T o D o max = max(D o ) D o min = min(D o ) Dif o = D o max − D o min Otsu o = Otsu graythresh (|D o |) T o = (Otsu o ∗ Dif o ) + D o min /* Orientation condition (2) K" = [] if D o (O(k󸀠i )),(O(k󸀠j )) < T o , i, j ∈ {1, . . . , l} and i ≠ j then 󸀠 K" = {K", k i }

20

m = length(K"), m < l < n

11

12 13 14 15 16

17 18

*/

We assume that the retinal image is characterized by n local SIFT Keypoint set K = {k1 , . . . , k i , . . . , k n } and each local keypoint k i is presented by a vector composed by location vector L and the orientation O. The Remove of Uninformative SIFT keypoint method is detailed as follows.

120 | T. Chihaoui et al.

The SIFT keypoint location condition Firstly, we employ the manhattan distance, due to its simplicity and speed, to compute the location distances D l between all extracted keypoint pairs in the considered image (algorithm 1, line 1). Secondly, based on the OTSU thresholding [14], the optimal value of the location distance threshold T l is automatically determinated for each retinal image. The T l value (algorithm 1, line 6) is found by multiplying the Otsu l thresholding level value (algorithm 1, line 5) [14], by the difference Dif l (algorithm 1, line 4) between the maximum manhattan distance D l max and the minimum one D l min. Indeed, the Otsu l level is found by using the Otsu thresholding of the normalized location distance D l in order to classify it into two classes: small and large distances. So, the resultant product maximizes the variance between the near local keypoints and the distant ones. After that, we add the minimum manhattan distance D l min between all extracted keypoint pairs in the considered image. Thirdly, we check the location distance of each keypoint pairs (algorithm 1, line 8). We consider that k i is neighbor of k j , i, j ∈ 1, . . . n and can describe the same interest region if the manhattan location distance between k i and k j is less than the threshold T l . SIFT keypoint orientation condition After identifying local neighbor keypoint set K 󸀠 (algorithm 1, line 9) which verified the first location condition, we compute, first, the manhattan orientation distance between these pairs (algorithm 1, line 11). Second, we compute the optimal orientation distance threshold T o (algorithm 1, line 16) based on the same process as OTSU-based 󸀠 󸀠 localisation distance thresholding. Consequently, if k i and k j verify the orientation 󸀠

condition (algorithm 1, line 18), the local keypoint k j is removed from the keypoint candidate list and we design the new distribution K" (algorithm 1, line 19)as input for matching process. This SIFT keypoint set K" contains a reduced number of keypoints m for each image (algorithm 1, line 20) where m < l. Indeed, the standard SIFT description suffer from huge number of uninformative keypoints, as shown in Fig. 3-b, that leads to mismatching. It may also be very

Fig. 3. SIFT-RUK description: (a) The input retinal ring around the optical disc (ROI) image, (b) the SIFT-characterised ring of interest and (c) the ring of interest characterised by SIFT-RUK keypoints.

Retinal Identification System

|

121

time-consuming. Hence the interest of using the new local descriptor SIFT-RUK which seriously eliminates the uninformative SIFT keypoints, as illustrated in Fig. 3-c.

3.3 Feature matching strategy This step of matching is performed on the SIFT-RUK space among the feature vectors of each keypoint to identify similar retinal images. According to Lowe [11], we use a matching technique called g2NN [11] to find the best candidate which is based not only on the distance with the first most similar keypoint, but also with the second one; in particular, we use the ratio of computed distance with the candidate match di and the 2nd nearest candidate distance di+1. The two considered keypoints are matched only if this ratio is low (e. g. lower than 0.6). Finally, by iterating over all keypoints, we can obtain the set of matched points, that describe identical retinal images.

4 Experimental results To evaluate our proposed retinal identification system based on the new SIFT-based Removal Uninformative Keypoint (SIFT-RUK) descriptor, we use the publicly available VARIA database [15]. This database includes 233 retinal images of 139 different subjects with a resolution of 768 * 584. The images have been acquired over several years with a TopCop NW-100 model non-mydriatic retinal camera. These images are optic disc centered and have a high variability in contrast and illumination. All the experiments are implemented in MATLAB, and performed on a PC with a 3.2 GHz CPU and 4 G Byte memory. Table 1 reflects the impact of our proposed SIFT-RUK method on the number of standard SIFT keypoints. We note that the average number of maintained keypoints in each retinal image based on SIFT is seriously dropped by 25.4 %, from 2012 to 1501 keypoints. As shown in Tab. 2, this huge numbers of uninformative SIFT keypoints can lead to a much more time consumption and to individual mismatching. On the one hand, our SIFT-RUK based proposed system ensures almost the same identification rate

Tab. 1. Removal Uninformative SIFT keypoint analysis on images of the used VARIA database. Local description SIFT SIFT-RUK

Extracted keypoint number per image (average) 2012 1501

122 | T. Chihaoui et al.

Tab. 2. Identification rate, the FMR error and the processing time of our proposed retinal identification system compared to SIFT and SURF-based systems. Identification system

Identification rate (%)

FMR (%)

Matching run time average (s)

99.74 99.8 99.4

0 4.3 10−5 6.2 10−4

6.6 9.8 3

SIFT-RUK-based system SIFT-based system SURF-based system

99.74 % as SIFT-based identification system (99.8 %) [12] and seriously more than the SURF-based identification rate (99.4 %), as shown in Tab. 2. The Identification Rate curves for these retinal identification systems using the VARIA database are illustrated in the Fig. 4. We can see clearly that our system identifies a higher rate compared to the others ones, to reach the optimal identification by 99.74 %. Moreover, matching execution time has been severely reduced from 9.8s in SIFT-based identification system [12] to only 6.6s in our SIFT-RUK system. This reached execution time is essentially due the Remove of SIFT Uninformative Keypoint strategy. It helps to keep the most informative keypoints in order to maintain a high performance (99.74 %) and decrease the processing time of the matching step.

Recognition rate Curve 1 SURF SIFT SIFT-RUK

0.99

Identification rate

0.98

0.97

0.96

0.95

0.94

0

10

20

30

40

50

60

Score

Fig. 4. Identification Rate Curves of retinal Human identification systems based on local feature descriptors.

Retinal Identification System

|

123

False Match rate Curve 0.025

SURF SIFT SIFT-RUK

0.02

0.015

0.01

0.005

0

0

10

20

30

40

50

60

70

Score

Fig. 5. False Match Rate (FMR) Curves of retinal Human identification systems based on local feature descriptors.

However, this execution time still slower than SURF-based system due the low 32SURF descriptor dimension. On the other hand, the mismatching, in biometric identification system, is illustrated by the False Match Rate (FMR) error. It measures the percent of invalid input identities which are incorrectly assigned to one person. So, the evaluated system is more efficient [16], when this error is lower. Table 2 shows that our novel proposed system allows reducing the FMR error from 6.2 10−4 % with the SURF-based system and 4.3 10−5 with the SIFT-based one to 0 % with the new proposed SIFT-RUK-based identification system. Therefore, thanks to the elimination of uninformative SIFT keypoints, the SIFT-RUK-based identification system lets identifying all individuals without any error. Figure 5 illustrates this performance of matching. It shows that the SIFT-RUK based identification system has the lowest FMR rate (0 %) compared to the two other ones.

5 Conclusion In this paper, we present an automatic retina identification system based on a new local image descriptor extended from SIFT, named SIFT-RUK. Firstly, we extract an interest ring around the optical disc to be an input for the feature extraction

124 | T. Chihaoui et al.

phase. Secondly, we employ our new proposed feature SIFT-RUK to well describe the selected region. Indeed, SIFT-RUK allows detecting redundant SIFT keypoints and then eliminates uninformative ones while maintaining a relevant quality of description. Finally, the test g2NN is applied to compute the number of matched keypoint pairs and classify identical retinal images. We show in this paper that our proposed system reduces the matching processing time (6.6s), compared to the standard SIFT-identification system (9.8s) while leading to get a high identification performance (99.74 %) with 0 % of FMR error.

Bibliography [1] [2] [3] [4] [5]

[6] [7] [8] [9] [10] [11]

[12]

[13] [14] [15] [16]

A. Jain and A. Kumar. Biometric Recognition: An Overview Second Generation Biometrics: The Ethical, Legal and Social Context, E. Mordini and D. Tzovaras (Eds.), Springer, :49–79, 2012. R.B. Hill. Retinal identification, in Biometrics: Personal Identification in Networked Society. A. Jain, R. Bolle, and S. Pankati, Eds., :126, Springer, Berlin, Germany, 1999. X. Meng, Y. Yin, G. Yang and X. Xi. Retinal Identification Based on an Improved Circular Gabor Filter and Scale Invariant Feature Transform. Sensors (14248220),13(7):9248, July 2013. D.G. Lowe. Object recognition from local scale-invariant features. Int. Conf. on Computer Vision, Greece, :1150–1157, Sep.1999. Y. Tao, M. Skubic, T. Han, Y. Xia and X. Chi. Performance Evaluation of SIFT-Based Descriptors for Object Recognition Int. Multiconf. of Engineers and Computer Scientists, IMECS2010, Hong Kong, March II:17–19, 2010. H. Bay, A. Ess, T. Tuytelaars and L. Van Gool. Surf: Speeded up robust features. Computer Vision and Image Understanding (CVIU), 110(3):346–359, 2008. G. Du, F. Su and A. Cai. Face recognition using SURF features, Pattern Recognition and Computer Vision (MIPPR), 7496:749628.1–749628.7, 2009. K. Tan and S. Chen, Adaptively weighted sub-pattern PCA for face recognition Neuro computing, 64:505–511, 2005. F. Alhwarin, C. Wang, D. Ristic-Durrant and A. Gräser, VF-SIFT: very fast SIFT feature matching. Pattern Recognition. Springer Berlin Heidelberg, :222–231, 2010. I. Abdul, A. Abdul Jabbar and J. Tan. Adaptive PCA-SIFT matching approach for Face recognition application. Int. Multi-Conf. of Engineers and Computer Scientists, IMECS, I:1–5, 2014. T. Chihaoui, R. Kachouri, H. Jlassi, M. Akil and K. Hamrouni. Human identification system based on the detection of optical Disc Ring in retinal images. Int. Conf. on Image Processing Theory, Tools and Applications (IPTA), Orleans, :263–267, Nov. 2015. T. Chihaoui, H. Jlassi, R. Kachouri, K. Hamrouni and M. Akil. Personal verification system based on retina and SURF descriptors. 13th Int. Multi-Conf. on Systems, Signals and Devices (SSD), Leipzig, :280–286, 2016. Y. Meng and B. Tiddeman. Implementing the Scale Invariant Feature Transform (SIFT) Method. Departement of Computer Science University of St.Andrews, 2012. N. Otsu. A threshold selection methods from grey-level histograms. IEEE Trans. on Pattern Analysis and Machine Intelligence, 9:62–66, 1979. VARIA. VARPA. Retinal Images for Authentification (Database) http://www.varpa.es/varia.html. J. L. Wayman. Error Rate Equations for the General Biometric System. IEEE Robotics Automation, :35–48, 6, 9 January 1999.

Retinal Identification System

|

125

[17] L. Juan and O. Gwun. A comparison of SIFT, PCA-SIFT and SURF. In Int. J. of Image Processing, 3:143–152, 2009. [18] H. Li, H. Yang, G. Shi and Y. Zhang, Adaptative optics retinal image registration from scale-invariant feature transform. Optik-Int. J.Light Electron Opt., 122:839–841, 2011. [19] I. Mesecan, A. Eleyan and B. Karlik. Sift-based Iris Recognition Using Sub-segments. Int. Conf. on Technologiacal Advances in Electrical, Electronics and Computer Engineering, :350–350, 2013. [20] S. Singh Dhillon and S. Sharma. A review on Histogram Equalisation Techniques for Contrast Enhancement in Digital Image. Int. J. of Innovation and Advancement in Computer Science, 5(1), 2016.

Biographies Takwa Chihaoui is currently a PhD student the National Engineering School of Tunis (ENIT) and at ESIEE Paris since 2013 and 2014 respectively. She received her degrees of Engineer and Master from the National Engineering School of Tunis (ENIT) in 2012 and 2013 respectively. Her research focuses on Image and Signal processing (Pattern recognition, Forgery detection, Biometric). Particularly, In her Master degree project, she worked on Image Forgery Detection Systems for Digital Forensics Investigations. Currently, her doctoral research is about Biometric based on retina.

Rostom Kachouri received his Engineer and Master degrees from the National Engineering School of Sfax (ENIS) respectively in 2003 and 2004. In 2010, he received his Ph. D. in Image and Signal Processing from the University of Evry Val d’Essonne. From 2005 to 2010, Dr. Kachouri was an Assistant Professor at the National Engineering School of Sfax (ENIS) and then at the University of Evry Val d’Essonne. From 2010 to 2012, He held a post-doctoral position as part of a project with the high technology group SAGEMCOM. Dr. Kachouri is currently Associate Professor in the Computer Science Department and Head of apprenticeship computer and application engineering at ESIEE, Paris. He is member of the Institut Gaspard-Monge, unité mixte de recherche CNRS-UMLPE-ESIEE, UMR 8049. His main research interests include pattern recognition, machine learning, clustering and Algorithm-Architecture Matching. Hejer Jlassi received both master degree and Ph. D degree in electrical engineering (image processing) in 2010 from National Engineering School of Tunis (ENIT), and Computer Science Diploma in 2002 from Faculty of Science of Tunis (FST). She is currently an associate professor at the Higher Institute of Medical Technologies of Tunis (ISTMT). She teached several computer science courses especially related to Image processing, DSP and Pattern Recognition since 2004. Her research interests include image processing and analysis, biometrics and medical image application.

126 | T. Chihaoui et al.

Mohamed Akil received his PhD degree from Montpellier University (France) in 1981 and his doctorat d’état (DSc) from the Pierre et Marie curie University (UPMC, Paris, France) in 1985. Since September 1985, he has been with ESIEE Paris that is the CCIR’s (Chambre de commerce et d’industrie de région Paris Ile-de-France) center for scientific and engineering education and research. He is currently a Professor in the Computer Science Department, ESIEE Paris. He is a membrer of the Laboratoire d’Informatique Gaspard Monge, Université Paris-Est Marne-la-Valée (UMR 8049, unité mixte de recherche CNRS), a joint research laboratory between Université Paris-Est Marne-la-Vallée (UPEM), ESIEE Paris and École des Ponts ParisTech (ENPC). He is member of the program committee of the SPIE – Real Time Image and Video Processing conference (RTIVP). He is member of the Editorial Board of the Journal of Real-Time Processing (JRTIP). His research interests include dedicated and parallel architectures for image processing, image compression and virtual reality. His main research topics are parallel and dedicated architectures for real time image processing, reconfigurable architectures and FPGA, high-level design methodology for multi-FPGA, mixed architecture (DSP/FPGA) and Systems on Chip (SoC). He has published more than 160 research papers in the above areas.

Kamel Hamrouni received in 1971 his first cycle diploma in mathematics and physics from University of Tunis El Manar followed by a Master (in 1976) and PhD diplomas (in 1979) in computer science from Pierre and Marie Curie University, Paris, France. He received in 2005 his HDR diploma in image processing from University of Tunis ElManar, Tunisia. Since 1980 until now, he isa professor at the National Faculty of Engineering of Tunis (ENIT), university of Tunis El Manar, teaching graduate and undergraduate courses in computer science and image processing. His main research interests include image segmentation, texture analysis, mathematical morphology, biometry and medical image application. He supervises a research team preparing master thesis, Phd thesis and HDR diploma. He published more than one hundred papers in scientific journals and international conferences.

M. Kammoun, A. Ben Atitallah, H. Loukil and N. Masmoudi

An Optimized Hardware Architecture of 4×4 Intra Prediction for HEVC Standard Abstract: The High Efficiency Video Coding (HEVC) is a proposal of new video coding standard that will be used for a wide range of applications like ULTRA HD and 3D applications. The Moving Picture Expert Group (MPEG) and the Video Coding Expert Group (VCEG) have established a Joint Collaborative Team on Video Coding (JCT-VC) to develop the HEVC standard which is expected to provide a significant improvement in data transmission and streaming efficiency compared to H.264/AVC (Advanced Video Coding). In this proposal standard, various modules of coding are defined. Among the most complex is the module of the intra prediction. The HEVC defines 35 modes of intra prediction for 8×8, 16×16, 32×32, 3 modes for 64×64 and 17 modes for 4×4 while the H.264 uses 9 modes for intra 4×4 and 4 modes for intra 16×16. In this paper, we propose an efficient uniform architecture for all of the 4×4 intra directional modes. This architecture offers an important gain in case of treatment time compared to the literature. Our proposed architecture is designed using VHDL language and implemented with FPGA and TSMC 0.18 μm CMOS technologies. Keywords: HEVC, JCT-VC, intra 4×4, FPGA, TSMC.

1 Introduction With the increasing needs of HD applications, the new video compression technologies that can provide more efficiency coding than the existing coding standards become the focus of organizations active in this field. In this case, new tools have been proposed covering many aspects of video compression technology. These include a general structure of a new proposed standard in 2010, known as HEVC [1], for a more efficient representation of video contents. It aims to provide an equivalent improvement in case of image quality and compression ratio with a loss of 3 times in complexity relative to H.264/AVC [2]. The video coding layer of HEVC employs the same hybrid approach (inter/intra prediction and 2D transform coding) [3] used in the previous compression standards. Figure 1 illustrates the block diagram of the HEVC encoder standard. The HEVC encoding algorithm produce a conform bitstream by following these steps. All pictures are firstly divided into various block-shapes region which can be large than a traditional macrobloc [4]. There are three types of units in HM (HEVC

M. Kammoun, A. Ben Atitallah, H. Loukil and N. Masmoudi: LETI laboratory,Circuits and Systems group (C&S) , University of Sfax, Sfax Engineering School, Sfax, Tunisia, emails: [email protected], [email protected], [email protected] De Gruyter Oldenbourg, ASSD – Advances in Systems, Signals and Devices, Volume 8, 2018, pp. 127–144. https://doi.org/10.1515/9783110470383-009

128 | M. Kammoun et al.

Input Video Signal

General Coder Control Transform, Scaling & Quantization

Split into CTUs

Intra-Picture Estimation Intra-Picture Prediction Motion Intra/Inter Compensation Selection Motion Estimation

General Control Data

Scaling & Inverse Transform

Quantized Transform Coefficients

Intra Prediction Data

Header Formatting & CABAC

Coded Bitstream

Filter Control Filter Control Analysis Data Deblocking & SAO Filters

Motion Data

Decoded Picture Buffer

Output Video Signal

Fig. 1. HEVC video encoder layer.

reference software): coding unit (CU), prediction unit (PU) and transform unit (TU) [5]. The coding unit (CU) can have various sizes, with no distinction corresponding to its sizes. Two special terms are defined: the largest coding unit (LCU) and the smallest coding unit (SCU). The PU is defined only for the last-depth CU and it is used for the prediction process. At the last level of splitting, two PU sizes of N × N and 2N × 2N are supported [6].The first picture of the video sequence is coded using only intra-picture prediction process (is performed from samples of already decoded adjacent PUs) [7]. The remaining pictures are treated with inter temporally-predictive modes applied for most blocks. The inter prediction process consist of selecting the corresponding reference frame and the MV (Motion Vector) to be applied for the current predicting block. The residual signal derived from the differential operation between the original block and the predicted one is transformed by approximating scaled DCT (Discrete Cosine Transform) coefficients, quantized, entropy coded and finally transmitted with the prediction information [8]. In the decoder, the residual reconstructed block is obtained by an inverse scaled followed by an inverse transform. This residual is then added to the predicted block to get the reconstructed data which will be finally processed with two filters (SAO (Sample Adaptive Offset) and deblocking filter) to smooth artifacts caused by block effects. The final reconstructed frame is stored in a decoded picture buffer to be used for the next prediction process [9].

Optimized Hardware Architecture

|

129

In this paper, we have adopted to study the complexity of the intra prediction module in which the number of modes used varies depending on PU sizes. Unlike H.264/AVC [10] [11], the intra prediction process for the HEVC includes PUs of the size 4×4, 8×8, 16×16, 32×32 and even 64×64. The statistical result of occurrence of different PUs sizes made in 100 frames of the sequence Foreman is shown in Fig. 2. According to the results supplied by the graph of Fig. 2, the 4×4 PUs sizes are the most occurring. So, it’s enough to study only the complexity of intra-4×4 [12]. This paper is organized as follows. Section 2 highlights the 4×4 intra prediction in HM. The section 3 describes the architecture proposed in literature. In section 4, we propose the optimized hardware architecture for the whole 4×4 intra prediction modes. Finally, the simulation results are illustrated in section 5.

2 4×4 intra prediction for H.265/HEVC In HM 5.0 [13], unified intra prediction provides up to 17 directional prediction modes for 4×4 block including the modes DC (Differential Coefficients) and planar as shown in Fig. 3. The prediction directions have the angles of +/- [0, 2, 5, 9, 13, 17, 21, 26 and 32]/32. The pixels are reconstructed using linear interpolation of the reference above or left samples at 1/32th pixel accuracy [14]. Two arrays of reference samples are used to predict a 4×4 PU, corresponding to the row of pixels situated above the current PU [15], and the column of samples situated in the left of the same PU. In case of vertical direction, the row of samples above the PU is called the main reference and the reference column to the left of the same PU is called the side reference. In the case of horizontal prediction, the column of samples in the left of the PU is called the main reference and the row of samples above the PU is called the side array. The positive angles use only the main reference for intra prediction

Percentages of various Pu 1% 5% Intra 4 × 4 28 %

Intra 8 × 8 66 %

Intra 16 × 16 Intra 32 × 32 and 64 × 64

Fig. 2. Probable portion of various PU size in a 64×64 LCU.

130 | M. Kammoun et al.

4

11

5

12

1

13

6

14

7

15

8

16 2

0 Intra Planar 3 Intra DC

17

9

10

Fig. 3. Directional modes of intra 4×4.

while negative angles require both of main and side reference. When the side array is available for prediction, the index of interception requires a division operation. The indices of interception of the side reference are given in (1) and (2): deltaIntSide = [256 × 32 × (l + 1)/absAng] ≫ 8

(1)

deltaFractSide = [256 × 32 × (l + 1)/absAng] % 265

(2)

AbsAng = [ 2, 5, 9, 13, 17, 21, 26, 32]. l: pixel position in case of horizontal or vertical direction. In order to remove the division operation, the look-up table (LUT) [16] is used to calculate the fractional and integer parts of interception corresponding to the vertical or horizontal directions. So pixels associated to the reference side are determined by the integer and fractional part of interception without any division operation. The equations for calculating the intercepted indices of the side reference have become as follows: Inverseangle = (256 × 32)/Angle

(3)

deltaIntSide = (invAbsAngTable[absAngIndex] × (l + 1)) ≫ 8

(4)

deltaFractSide = (invAbsAngTable[absAngIndex] × (l + 1)) % 256

(5)

where the Inverse angles are done in Tab. 1.

Optimized Hardware Architecture

|

131

Tab. 1. LUT conversion table. Angles Inverses Angles

−32 −256

−26 −315

−21 −390

−17 −482

−13 −630

−9 −910

−5 −1638

−2 −4036

Angles Inverses Angles

32 256

26 315

21 390

17 482

13 630

9 910

5 1638

2 4036

The LUT technique can simplify the division operation but the per-sample test to determine if the main or the side array will be used for prediction still exists. As a solution, extended main reference by projecting samples in the side array is proposed. As a result, there is only one reference to predict pixels by interpolation or copying samples of the extended reference as presented in Fig. 4. In case of vertical direction, pixels are predicted by using extended main reference above the predicted block. Neighbouring pixels are designated by the labels (AV to MV) for the vertical direction or (AH to MH) for the horizontal direction as given in Fig. 4. According to HM reference software, the prediction equations of intra angular modes are calculated once the “iFact" and “iIdx" of each row or colon of pixels are determined as mentioned in (6), (7) and (8). Otherwise, If “iFac" is equal to 0, neighbouring pixels

CH BH AH GV

FV

EV MV AV

BV

CV

DV

EV

a

b

c

d

FV

e

f

g

h

GV

i

j

k

l

HV

m

n

o

p

IV

JV

KV

MH

AH

BH

CH

DH

EH

a

b

c

d

FH

e

f

g

h

GH

i

j

k

l

HH

m

n

o

p

LV

NH OH PH QH

Fig. 4. Pixels used for prediction in case of vertical or horizontal directions.

NOTE: MV = MH = M.

132 | M. Kammoun et al.

are copied directly to predicted pixels. iFact = [(y + 1) × intraPredAngle]&33

(6)

iIdx = [(y + 1) × intraPredAngle] ≫ 5

(7)

predSamples[x, y] = ((32 − DeltaFact) × refMain[x + iIdx + 1] +DeltaFact × refMain[x + iIdx + 2] + 16) ≫ 5

(8)

The DC mode is done by choosing one of 3 cases (9), (10) or (11). The different parameters are presented in Fig. 5. DCval = (A + B + C + D + E + F + G + H + 2) ≫ 3 if(Topblock&Leftblock)

(9)

DCval = (A + B + C + D + 2) ≫ 2 if(Topblock)

(10)

DCval = (E + F + G + H + 2) ≫ 2 if(Leftblock)

(11)

3 Related works In this section, we describe two uniform architectures proposed in [12] and [17] for the implementation of all directional modes of intra 4×4 designed for the new standard HEVC. The first one proposed in [12] integrates a register table and a flexible selection technique for reference samples. In the input of this architecture a decision is made between directional modes and the DC mode. The 17th reference samples located above and in the left of the predicted block are arranged in a register table.

M

A

B

C

D

E

a

b

c

d

F

e f g Mean(A..D,E..F)

h

G

i

j

k

l

H

m

n

o

p

Fig. 5. Intra prediction DC mode.

Optimized Hardware Architecture

| 133

Those samples will be used in the prediction process. The coping circuit needs only four samples to predict a row of pixels while the interpolation circuits require five reference samples to interpolate the four pixels in one row. Then, the accumulator is incremented once when completing to predict a row of pixels. The obtained results will be used to calculate the integer and the fractional parts of the intercept. As a final step, a decision between using the interpolating circuit and the coping circuit is made according to the state of the fractional part. The second architecture proposed in [17] uses data reuse and pixel equality based computation reduction (PECR) techniques for HEVC intra prediction algorithm which could affect PSNR and bite rate. This proposed hardware architecture includes both of intra 4×4 and 8×8 luminance angular prediction modes. 56 neighboring registers are used to store the neighboring pixels in order to load in parallel predicted pixels of 8×8 and four 4×4 PU sizes. When estimating the performances of those two proposed architectures we can notice a significant gain in occupied area for the architecture proposed in [12] but it requires 24 clock cycles to process 1 mode for a 4×4 PU. For the one proposed in [17], the number of clock cycle required to calculate modes of 4×4 PU is relatively increased (about 40 clock cycles). So, we try in the next section to design novel hardware architecture for the implementation of intra 4×4 for HEVC standard that can afford an improvement in processing time taking into account the occupied surface.

4 Intra 4×4 proposed architecture The general structure of our optimized hardware architecture is given in Fig. 6. It uses three different components and each component performs a definite function. The principle is as follows: The input of component1 uses multiplexing technique to select one of the vertical references (AV, BV, CV, DV or IV) or horizontal references (EH, FH, GH, NH, HH or M). Similarly for component 2, the input selects one of the vertical references (EV, FV,KV or JV) or (AH, BH or OH) in case of horizontal direction. It should be noted that the operation of the multiplexer (MUX) is controlled by the control unit. This one helps the MUX to select the corresponding neighbor for each prediction coefficient or mode when we need to copy neighboring pixels. All the computing prediction coefficients are done sequentially. Our optimized approach supposes that the prediction coefficients calculated by the component 1 and the component 2 can be used more than one time in the prediction equation. The idea is to group these coefficients given in Fig. 7 and Fig. 8 and reduce the call number to only one time. The component 1 is used 11 times to evaluate the most occurring prediction coefficients. Therefore, the component 2 is used 7 times in order to display the prediction coefficients which present a low redundancy in the prediction equations.

134 | M. Kammoun et al.

SEL

DCO

11 BITS >>3

DC1

10 BITS >>2

DC2

10 BITS >>2

2 BITS

M AV BV 128

CV

IV

MUX

DV

EH S

FH

COMPONENT 1 8 BITS

GH

13 BITS NH 13 BITS

HH

COMPONENT 3

x0..x31 EV

PP0..PP15

FV

BH

P0..P15

13 BITS MUX

AH

8 BITS

>>5

8 BITS

COMPONENT 2

w

KV JV OH LV

START RSTART CLK

CONTROL UNIT

DONE

PH GV QH Fig. 6. Optimized hardware intra architecture.

To minimize the computational complexity, we have replaced the multiplication operations with addition and shift operations (Example: 7AV = 4AV+2AV+AV). The total number of operations for the non-optimized and optimized method is given in Tab. 2. Thus we can optimize the occupied area and the processing time. The architecture of the component 1 is shown in Fig. 9 and as we can notice the outputs (S5, S6, S7, S10, S20, S31) need only one clock cycle to be displayed while the outputs (S11, S13, S12, S27, S15, S25, S21, S26, S22, S17, S19) need two clock cycles.

5AV, 13AV, 15AV,21AV, 25AV, 6AV, 7AV, 11AV, 17AV, 19AV,

2

3

2

20EV, 20AH,

Fig. 8. Call number of prediction coefficients calculated from component 2.

1

7EV, 10EV, 12EV, 31EV, 20FV, 7AH, 10AH, 12AH, 31AH, 20BH, 20KV, 20JV, 12JV, 20OH

Fig. 7. Call number of prediction coefficients calculated from component 1.

1

7GH, 11GH, 21GH, 5HH, 7HH, 27EH, 5FH, 6FH, 13FH, 15FH, 10HH, 11HH, 13HH, 15HH, 25HH, 17FH, 19FH, 21FH, 22FH, 25FH, 26HH,31HH, 5NH, 7NH, 10NH, 27FH, 5EH, 6GH, 10GH, 13GH, 10AV, 22AV, 10BV, 20BV, 10CV, 12NH, 13NH, 15NH,21NH, 15GH, 17GH, 19GH, 25GH, 26GH, 20CV, 20DV, 22DV, 21IV, 22EH, 26NH,5M, 7M, 13EH, 15M, 21M, 27GH, 6HH, 17HH, 19HH, 20HH, 20EH, 10FH,20FH, 20GH,22GH, 12HH, 20M, 22HH, 27HH, 21HH, 10M, 12M, 22M, 25M, 26M, 31M

31IV, 5EH, 7EH, 11EH, 13EH, 25CV, 26CV, 27CV, 6DV, 10AV, 15EH, 21EH, 25EH, 26EH, 11DV, 17DV, 19AV, 27DV, 10IV, 31EH,7FH, 11FH, 31FH, 26FH, 12IV, 6EH, 10EH, 17EH, 19EH,

25DV, 26DV, 31DV,5IV, 7IV, 13IV 26BV, 27BV, 5CV, 6CV, 11CV, 15IV, 21IV, 22IV, 25AV, 26IV, 13CV, 15CV, 17CV, 19CV, 21CV,

5DV, 7DV, 13DV, 15DV, 21DV, 15BV, 17BV, 19BV, 21BV, 25BV,

26AV, 5BV, 7BV, 31BV, 7CV, 31CV, 27AV, 31AV,6BAV, 11BV, 13BV,

4

22BV, 22CV, 12DV, 12EH, 12GH,

5

12AV, 12BV, 12CV

Call number

Call number Optimized Hardware Architecture | 135

136 | M. Kammoun et al.

Tab. 2. Operation number comparison. Operations Addition Multiplication shift

Non optimized method

Optimized method

153 187 0

498 0 80

s 8 BITS

>>2

+ +

13 BITS

ss

ss

s6

s6

+

S11 +

+ >>3



s7

s7

512

+ >>1 >>4

+

+

s10

s20

s10

s20

+

515



519 +

527

>>5

525

+

>>2

+ >>1 –

s31

513 517

+ +

526 522 521

Fig. 9. Architecture of component 1.

In other hand, the outputs (W5, W12, W7, W10, W20 and W31) of component 2 are displayed in only one clock cycle as presented in Fig. 10. Finally, the outputs of the component 1 and 2 (X0..X31) are considered as the inputs of the component 3. This one selects the predicted coefficients of the corresponding mode to evaluate the predicted pixels. The architecture of the component 3 is done in Fig. 11. These three components are used only in directional modes, the DC mode is defined depending on neighboring pixels. In this context four cases to state: – DC0 is shifted by 3 if top neighbors and left neighbors. – DC1 is shifted by 2 if top neighbors. – DC2 is shifted by 2 if left Neighbors. – 128 else. The control unit is based on Moor state machine which uses a Moore state machine which allows controlling the operation of the three components by assigning to each state neighbors and prediction coefficients. Explicitly, each state will be assigned by the corresponding neighbors. Then, the calculation of prediction coefficients is carried

Optimized Hardware Architecture

|

137

13 BITS

w

+

>>2

ws

8 BITS w12

+ –

>>3

w7 w10

+

>>1 >>4

w20

+

>>2

>>5

w21



13 BITS

Fig. 10. Architecture of component 2.

X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31 16

13 BITS + X0 X1 X2 X3 X4 X5 X6

+ + +

+

13 BITS PP0

+

PP1

+ + + +

PP2 PP3 PP4 PP5 PP6

+

PP7

+

PP8

+

PP9

+

PP10

+

PP11

+ + +

PP12

+

PP15

+ + + +

X7 X8 X9 X10 X11 X12 X13 X14 X15

+ + + + + + + + +

PP13 PP14

Fig. 11. Architecture of component 3.

out within the same machine after the associated neighbors are already assigned in previous states. As detailed in Fig. 12, the operation of this state machine begins when the signal start-M1 is equal to 1. Form the state 0 we begin to affect every state with the input of the component 1 and the input of the component 2. The associated results

CV,AH

BV,JV

State 2

State 3

AV,FV

M,EV

State 1

Start_M1

State 0

Fig. 12. 1st Moore state machine.

M11,M12,M13, M15,M17,M19, M20,M21,M22, M25,M26,M27 AV6,AV7,AV10, AV20,AV31,FV5, FV6,FV7,FV10,FV20 ,FV31

MS,M6,M7,M10 M20,M31,EV6,EV7, EV10,EV20,EV31

S_Idle

FH11,FH12,FH13, FH15,FH17,FH19, FH20,FH21,FH22, FH25,FH26,FH27, GH5,GH7,GH10, GH20,GH31

AV11,AV12,AV13, AV15,AV17,AV19, AV20,AV21,AV22, AV25,AV26,AV27, BV6,BV7,BV10, BV20,BV31,JV5,JV6 ,JV7,JV10,JV20, JV31

DV,BV

State 10

NH

Start_M2

State 4

NH11,NH12,NH13, NH15,NH17,NH19, NH20,NH21,NH22, NH25,NH26,NH27

GH11,GH12,GH13, GH15,GH17,GH19, GH20,GH21,GH22, GH25,GH26,GH27, NH5,NH6,NH7, NH10,NH20,NH31

State 12

State 11

138 | M. Kammoun et al.

State 5 Mode 10

Mode 7

Mode 4

State 3

State 4

Mode 2

Mode 1

State 2

States 1

Start_M2

State 0

Fig. 13. 2nd Moore state machine.

Coeff_Mode6

Coeff_Mode5

S_Idle

State 9

Coeff_Mode14

Coeff_Mode9

State 11

State 10

State 8

Coeff_Mode13

Coeff_Mode8

State 7

State 6

Coeff_Mode12

Coeff_Mode11

Mode 14

Mode 13

Mode 12

Mode 11

Mide 6

Mode 5

State 14

Coeff_Mode17

State 17

State 16

State 15

State 13

State 12

Coeff_Mode16

Coeff_Mode15

DC mode

Mode 17

Mode 16

Mode 15

Mode 9

Mode 8

Optimized Hardware Architecture |

139

1 17 cycles (to generate all the modes by the component 3)

Fig. 14. Number of cycles required for the intra-4×4 architecture.

1

Mode Mode Mode Mode Mode Mode Mode Mode Mode Mode Mode Mode Mode Mode Mode Mode Mode 1 2 3 4 7 10 5 6 11 12 13 14 8 9 15 16 17

Component 2 (EV, FV, JV, AH, BH, KV, OH)

Component 1 ( M, AV, BV, CV, DV, IV, EH, FH, GH, HH,NH)

7 cycles

12 cycles

22 cycles

140 | M. Kammoun et al.

Optimized Hardware Architecture

|

141

will be displayed in state 2. The same principal is applied to the rest of the reference samples. Form the state 4, we introduce in the system a second state machine. This one starts running once the signal state-M2 is equal to 1 in order to display in each state the predicted pixels of all modes. We begin from the state 1 by affecting the modes witch need only one cycle to be displayed (mode 1, mode 2, mode 4, mode 7, and mode 10). From the state 4, we start by assigning the prediction coefficients of the modes which need two cycles to be calculated (mode 5, mode 6, mode 11, mode 12, mode 13, mode 14, mode 8, mode 9, mode 15, mode 16, mode 17) and the DC mode is obtained in the state 17. The general structure of this second state machine is given in Fig. 13. After functional simulation, the predicted pixels of different modes need only 22 clock cycles to be displayed. In fact, the prediction coefficients may be calculated in 12 clock cycles and 7 clock cycles, respectively. Each mode requires one clock cycle by the component 3. Therefore, the generation of the 17 modes is made in 17 clock cycles. As shown in Fig. 14, the three components can operate in parallel in order to minimize the processing time. 5 cycles of latency are necessary for generating the prediction coefficients for different modes. We will then have 22 clock cycles for the treatment of 17 modes. Finally, when comparing the performances with the architecture proposed in [12], the major advantage presented by our architecture is evident in the remarkable optimization of execution time without introducing any delay.

5 Implementation and performance results Our proposed architecture were designed with VHDL language, simulated by Mentor Graphics ModelSim simulation tool and synthesized with FPGA [18] and Loenardo Spectrum tool using TSMC 0.18 μm standard-cells [19] technologies. The synthesis results obtained with FPGA technology using the Stratix III EP3SL150F1152C2 [20] component is done in Tab. 3. In order to do comparison with previous works, we have synthesized our architecture with Leonardo Spectrum tool using TSMC 0.18 μm standard-cells technology. The synthesis results are supplied in Tab. 4. An analysis considering the number of cycles required to process a complete block 4×4 by 17 modes of the 4×4 intra prediction has been established. As evident in Tab. 4 the proposed architectures performs the whole prediction mode of a block 4×4 in only 22 clock cycles instead of 24 clocks to generate one mode in [12]. Indeed, from Tab. 4, we can see that the throughput performance of the proposed design reaches even as high as 41 times whereas the cost of integration is 50 % higher with the use of the 17 modes of the 4×4 intra prediction.

142 | M. Kammoun et al.

Tab. 3. Operation number comparison. ALUTS Frequencies I/O Pin Registre number Processing time

2.864/113.600 (3 %) 416 MHZ 310/744 2.494/113.600 (2 %) 22 cycles

Tab. 4. Operation number comparison.

Architectures Technology Frequencies Logic gate count (gate) Total modes Processing time(Cycles/MB) Processing latency(clocks) Maximum Throughput (KMB/s)⋆ ⋆

[12] TSMC 0.13 μm 150 MHZ 9020 1 24 8 6250

Optimized hardware architecture TSMC 0.18 μm 339 MHZ 20742 17 22 5 15409

Throughput (KMB/s) = [(1/Fmax) × processing time]−1

6 Conclusion In this paper we have proposed an optimized hardware architecture in case of operating time for the implementation of directional modes of the intra prediction 4×4 in the HEVC standard. According to the synthesis results with TSMC 0.18 μm CMOS technology, our architecture is more efficient in processing time than the proposed in literature. In fact, we need 22 clock cycles to predict a block 4×4 of the entire modes when the architecture in [12] requires 24 clocks for one mode.With an operating frequency of 339 MHz and a maximum throughput equal to 15409, the new proposed intra hardware architecture can easily reach the performance of real time processing of 1080p video frame sequences. Finally, the importance of our proposed architecture is especially manifesting in the important gain in processing time without introducing any significant delay. in the future work, we can be concentrated on studying the whole of the intra prediction chain of the HEVC or we can extend our implementation for the intra 8×8, 16×16, 32×32 and 64×64. Acknowledgment: This work has been supported by the Laboratory of Electronics and Information Technology (LETI) at the National School of engineering of Sfax.

Optimized Hardware Architecture

|

143

Bibliography [1]

[2] [3] [4] [5] [6] [7]

[8] [9] [10]

[11] [12]

[13] [14]

[15]

[16] [17]

[18] [19] [20]

G.J. Sullivan. Next-Generation High Efficiency Video Coding (HEVC) Standard. HEVC presentation for HPA / ATSC, Co-chair JCT-VC, Chair VCEG, Co-Chair Video in MPEG,Video & Image Technology Architect, Microsoft, February 15, 2011. http://blog.radvision.com/voipsurvivor/2011/06/30/ what-can-we-expect-from-h-265/, June 30, 2011. G.J. Sullivan, J. Ohm, W. Han and T. Wiegand. overview of high efficiency video coding (HEVC) standard. IEEE Trans. on circuit and systems for video technology, December, 2012. T. Smith. Ultra-efficient vid codec paves way for MONSTER-res TVs, decent mobe streaming. 11th April 2013. J. Ohm and G. Sullivan. MPEG High Efficiency Video Coding (HEVC). MPEG doc: N11922, January 2011. K. McCann, W. Han and I. Kim. Samsung’s Response to the Call for Proposals on Video Compression Technology. JCTVC-A124 !!!, :7–10, 1st Meeting, Dresden, April 15–23, 2010. G.J. Sullivana and J. Ohmb. Recent developments in standardization of high efficiency video coding (HEVC). Microsoft Corporation, Institute of Communications Engineering University, USA, Aachen. M.T. Pourazad, C. Doutre, M. Azimi and P. Nasiopoulos. HEVC: The New Gold Standard for Video Compression. IEEE Consumer Electronics Magazine, July 2012. Y.J. Ahn, W.J. Han and D.G. Sim. Study of decoder complexity for HEVC and AVC standards based on tool-by-tool comparison. SPIE, 8499(84990X):14, article id. 84990X, 14 pp, 10/2012. S. Smaoui, H. Loukil, A. Ben Atitallah and N. Masmoudi. An Efficient Pipeline Execution of H.264/AVC Intra 4×4 Frame Design. Int. Conf. on Systems, Signals & Devices,, SSD’10, Amman, Jordan, June 27–30, 2010. A. Ben Atitallah, H. Loukil and N. Masmoudi. FPGA Design for H.264/AVC encoder. AIRCC, Int. J. of Computer Science, Engineering and Applications (IJCSEA), 1(5):119–138, October 2011. F. Li, G. Shi and F. Wu. An efficient vlsi architecture for 4×4 intra prediction in the high efficiency video coding (HEVC) standard. 18th IEEE Int. Conf. on Image Processing, 373–376, Xidian University, Xi’an China, Microsoft Research Asia, Brussels, September 11–14, 2011. https://hevc.hhi.fraunhofer.de/trac/hevc/browser/tags/HM-5.0?rev=2765. B. Bross, W. Han, J. Rainer, G.J. Sullivan and T. Wiegand. WD5: Working Draft 5 of High-Efficiency Video Coding. JCTVC-G1103, Fraunhofer HHI, Kyungwon University, Ohm RWTH Aachen, Microsoft, Fraunhofer HHI/TU Berlin, 7th Meeting: Geneva, November 21–30, 2011. F. Bossen. Simplified angular intra prediction. Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11 2nd Meeting, Geneva, CH, 21–28 July, 2010. Y. Liud. 01-12-2012. www.h265.net/2010/12/analysis-of-coding-tools-in-hevc-test-model-hm-intra-prediction.html E. kalali, Y. Adibelli and I. Hamzaoglu. A high perfermance and low energy intra prediction hardware for high efficiency video coding. Field Programmable Logic and Applications, (FPL), :719–722, Turkey, August 29–31, 2012. http://www.altera.com/.devices/fpga/stratix-fpgas/stratix-iii/st3-index.jsp. http://www.europractice-ic.com/technologies_TSMC.php?tech_id=018um. http://components.arrow.com/part/detail/45902608S9811898N7401.

144 | M. Kammoun et al.

Biographies Manel Kammoun Received electrical engineering degree from the National School of Engineering-Sfax (ENIS) in 2012. She is currently researcher in the Laboratory of Electronics and Information Technology whitin the C&S (Circuit & System) team. Here main research activities are focused on image and video signal processing, hardware implementation and embedded systems.

Ahmed Ben Atitallah. Ahmed Ben Atitallah is currently an Associate Professor in Electronics at the University of Sfax (Tunisia) and member of LETI laboratory within the C&S (Circuit & System) team. Previously, he received his PhD degree in Electronics from the University of Bordeaux1 in 2007 and the diploma of engineer and MS degree in Electronics from the University of Sfax in 2002 and 2003, respectively. His main research activities are focused on image and video signal processing, FPGA implementation, embedded system design.

Hassen Loukil Received electrical engineering degree from the National School of Engineering-Sfax (ENIS) in 2004. He received his M.S. and Ph. D. degrees in electronics engineering from Sfax National School of Engineering in 2005 and 2011 respectively. He is currently an assistant professor at Higher Institute of Electronic and Communication of Sfax (Tunisia). He is teaching Embedded System conception and System on Chip. He is currently researcher in the Laboratory of Electronics and Information Technology and an assistant at the University of Sfax, Tunisia. His main research activities are focused on image and video signal processing, hardware implementation and embedded systems.

Nouri Masmoudi received his electrical engineering degree from the Faculty of Sciences and Techniques-Sfax, Tunisia, in 1982, the DEA degree from the National Institute of Applied Sciences-Lyon and University Claude Bernard-Lyon, France in 1984. From 1986 to 1990. He received his Ph. D. degree from the National School Engineering of Tunis (ENIT), Tunisia in 1990. He is currently a professor at the electrical engineering department-ENIS. Since 2000, he has been a group leader ’Circuits and Systems’ in the Laboratory of Electronics and Information Technology. Since 2003, he is responsible for the Electronic Master Program at ENIS. His research activities have been devoted to several topics: Design, Telecommunication, Embedded Systems, Information Technology, Video Coding and Image Processing.

E. Zarrouk, Y. Ben Ayed and F. Gargouri

Arabic Continuous Speech Recognition Based on Hybrid SVM/HMM Model Abstract: The Hidden Markov Models (HMM) achieved a huge progress, but imperfectly they still suffer from their lack of discrimination capability especially on speech recognition. Therefore, as a way to improve results of recognition systems, we engage Support Vectors Machine (SVM) which works like an estimator of posterior probabilities, in as much as they are characterized by an immense discriminatiom and a big predictive power. Moreover, they are based on structural risk minimization (SRM) where the goal is to learn and gain a classifier that can minimizes a bound on the expected risk, rather than the empirical risk. In this work we present a new approach for automatic labeling with respect to the syntax and the grammar rules of Arabic language. The obtained results for the Arabic speech recognition system based on triphones are 64.68 % with HMMs standards and we achieve 76.96 % as best recognition rate of a tested speaker with the proposed system SVM/HMM. Consequently, the WER obtained for the recognition of continuous speech by the three systems proves the performance and the powerfulness of SVM/HMM. The speech recognizer was evaluated with ARABIC_DB corpus and performs at 11.42 % WER as compared to 13.32 % with triphones mixture-Gaussian HMM system. Keywords: Automatic Speech Recognition, Hybrid System, Support Vector Machine, Automatic Labeling, Triphones-Based Continuous Speech.

1 Introduction Automatic speech recognition (ASR) is the task of taking an utterance of speech signal as input and converting it into a text sequence as close as possible to what was represented by the acoustic data. Current speech recognition systems follow the statistical pattern recognition approach having feature extraction at the front end and likelihood evaluation of these features at the back end [10, 11]. Traditionally statistical models, such as Gaussian mixture models, have been used to represent the various modalities for a given speech sound. Hidden Markov models, introduced in the late 60 s and early 70 s, became the perfect solution to the problems of automatic speech recognition. Indeed, these models are rich in mathematical structures and therefore can be used in a wide range of applications.

E. Zarrouk, Y. Ben Ayed and F. Gargouri: MIRACL, Multimedia Information system and Advanced Computing Laboratory, Higher Institute of Computer Science and Multimedia (ISIMS)„ University of Sfax, Tunisia, emails: [email protected], [email protected], [email protected] De Gruyter Oldenbourg, ASSD – Advances in Systems, Signals and Devices, Volume 8, 2018, pp. 145–160. https://doi.org/10.1515/9783110470383-010

146 | E. Zarrouk et al.

Despite the enormous progress made by the HMM, they suffer from their lack of discrimination capability, specifically the learning phase of the HMM which requires a large amount of data to end to approach the conditional probabilities. There is a lack of ASR for Arabic language, so this research which is based to work of Zarrouk et al. [1, 2] aims to prove the performance and the effectiveness of using Support Machine Vectors (SVM) as an estimator of posterior probabilities stage with HMM for Arabic continuous speech. Several researches have been developed to ameliorate the Arabic ASR with many techniques. The best researches are presented on [13]. Recognition of Arabic continuous speech was addressed by Al-Otaibi [14]. He proposed a technique for labeling Arabic speech. He reported a recognition rate for speaker dependent ASR of 93.78 % using his technique. The ASR was built using the HMM toolkit (HTK). Hyassat and Abu Zitar [15] described an Arabic speech recognition system based on Sphinx4. They also proposed an automatic toolkit for building phonetic dictionaries for the Holy Qur’an and standard Arabic language. Three corpuses were developed in this work, namely, the Holy Qura’an corpus HQC-1 of about 18.5 h, the command and control corpus CAC-1 of about 1.5 h, and the Arabic digits corpus ADC of less than 1h of speech. Soltau et al. [16] reported advancements in the IBM system for Arabic speech recognition as part of the continuous effort for the GALE project. The system consists of multiple stages that incorporate both vocalized and non-vocalized Arabic speech model. The system also incorporates a training corpus of 1,800 h of unsupervised Arabic speech. Nofal et al. [17] demonstrated a design and implementation of stochastic-based new acoustic models suitable for use with a command and control system speech recognition system for the Arabic language. Park et al. [18] explored the training and adaptation of multilayer perceptrons (MLP) features in Arabic ASRs. Three schemes had been investigated. First, the use of MLP features to incorporate short-vowel information into the graphemic system. Second, a rapid training approach for use with the perceptual linear predictive (PLP) + MLP system was described. Finally, the use of linear input networks (LIN) adaptation as an alternative to the usual HMM-based linear adaptation was demonstrated. Shoaib et al. [19] presented a novel approach to develop a robust Arabic speech recognition system based on a hybrid set of speech features. This hybrid set consists of intensity contours and formant frequencies. Bourouba et al. [20] presented a new HMM/support vectors machine (SVM) (knearest neighbor) for recognition of isolated spoken words. Messaoudi et al. [21] demonstrated that by building a very large vocalized vocabulary and by using a language model including a vocalized component, the WER could be significantly reduced. Al-Diri et al. [9] aims to build large vocabulary continuous speech recognition for the Arabic language. They also described and implemented the hybrid HMM/NN for Arabic triphones-based continuous speech. A typical ASR system operates with the help of five basic modules: feature extraction for signal parameterization, acoustic models, language models, pronunciation model and decoder.

Arabic Continuous Speech Recognition

|

147

The contribution of this paper consists in the definition of a new approach able to ameliorate automatic Arabic speech recognition. Our work incorporates two interventions of developing the ASR for continuous speech. The first one is included on preparing acoustic models, it consists on automatic labeling. The second and the primordial one is the application of SVM/HMM model for Arabic continuous speech recognition. We make a comparative study between HMM and SVM/HMM for the recognition for Arabic monophones and triphones-based continuous speech. The paper is organized as follows: Section 2 describes generally the architecture of an ASR. Section 3 presents the parameterization phase and acoustic models, and represents the new approach of automatic labeling and sub-word modeling. We describe briefly in section 4 the basic concept of using HMM. On the fifth part, we describe the architecture of our proposed system SVM/HMM including a brief description of using SVM. Experimental results are presented in section 6.

2 Automatic speech recognition A system of automatic speech recognition (ASR) is to transcribe a voice message into text. The main applications using SRAP are automatic transcription, indexing multimedia documents and man-machine dialogue. Current Systems of continuous automatic speech recognition are based on a statistical approach which [30] proposed formalization, resulting from the information theory. From acoustic observations X, the goal of a recognition engine is to find the sequence of words W most likely among all possible sequences as it shown on Fig. 1. This sequence must maximize the following equation [12]: ⋀

W = arg max P(W|X)

(1)

W

Speech

feature extraction MFCC, LPC, FFT ... X

Features

Searching on the max of P(X| W)P(W ) P(W )

P(X| W)

T ranscription Linguistic constraints

Acoustic Likelihoods Acoustic Models HMM, ANN, SVM Fig. 1. General architecture of ASR systems.

Language Models Grammar

148 | E. Zarrouk et al.

Applying Bayes theory, the equation becomes: ⋀

W = arg max W

P(X|W)P(W) P(X)

P(X) = ∑ P(X|W)P(W)

(2) (3)

W

does not depend on a particular value of W and can be released from the calculation of the arg max: ⋀

W = arg max P(X|W)P(W) W

(4)

where P(W) is estimated using the language model and P(X|W) corresponds to the probability given by the acoustic models. This approach allows integrating in the same decision-making process, the acoustic and linguistic information.

3 Acoustic models and parameters The speech signal can be used directly. Indeed, the signal contains many other elements that the linguistic message: information related to the speaker, the recording conditions, etc. All this information is not necessary when decoding speech and has same noise. In addition, the variability and redundancy of the speech signal make it difficult to use as is. It is therefore necessary to extract only the parameters that are dependent on the linguistic message. Generally, these parameters are estimated via sliding windows on signal. This analysis window used to estimate the signal on a stationary portion of signal considered: typically 10 to 30 ms limiting side effects and discontinuities of the signal via a Hamming window. The majority of parameters represent the frequency spectrum and its evolution over a window size. Parameterization techniques most commonly used are: PLP (Perceptual Linear Prediction: spectral domain) [4], LPCC (Linear Prediction Cepstral Coefficients: time domain) [5], MFCC (Mel Frequency Cepstral coefficients: cepstral domain). We used the tool HTK Hidden Markov Model Toolkit to determine the phase of data preparation and feature extraction. Our first intervention in the recognition system is in the phase of labeling in sound files. Labeling is a critical phase in the process of speech recognition given that from which all linguistic units are limited and well bounded in the signal. HSLAB command allows the HTK label file manually what makes this phase too painful which leads to more mistakes on segmentation.

Arabic Continuous Speech Recognition

|

149

HSLab is an interactive label editor for manipulating speech label files. An example of using HSLab would be to load a sampled waveform file, determine the boundaries of the speech units of interest and assign labels to them. Alternatively, an existing label file can be loaded and edited by changing current label boundaries, deleting and creating new labels. HSLab is the only tool in the HTK package which makes use of the graphics library HGraf. Labeling manually with HTk is done with the commanded window (Fig. 2). As an example, the sentence is: This sentence contains 65 monophones, to label it manually we have to use the previous display window with HSlab command, this task can’t segmented easily, we need information of the structure of all monophones signals to finish the labeling phase which is a hard and difficult task. To perform this task automatically, we apply this algorithm: Let u be a unit of measurement windows u = 0.5 In the Arabic alphabet, it consists of vowels and consonants. We divide the vowels in two categories ⋆ Short vowels [] ⋆ Long vowels [] Let A = the number of consonants × 2 + 3 × number of short vowels 4 × number of long vowels Let B = the total duration of the signal (in microseconds) The size of each window is B/A

Fig. 2. HSlab display window.

150 | E. Zarrouk et al.

We start from the beginning of each signal and tag label (monophone, diphone or triphone) by the following formula: – For each consonant is given a size (2 × u) – For each short vowel is assigned a dimension (3 × u) – For each long vowel is assigned a dimension of (4 × u) In our case, the database ARABIC_DB contains 4404 wave files; each file has 40 phonetic units as example. To prepare a single file we need at least 20 minutes in the best case. To prepare the whole body for a period of 88,080 minutes is necessary in the best cases. This task is too difficult!! In our work, we automate this task according to the linguistic and grammar rules of the Arabic language. Arabic language is composed of vowels and consonants. Our approach consists on making the labeling phase done automatically. We divided the vowels on two categories: long vowels and [27]. short modeling extension’s rule The automatic labeling algorithm developed was established under the Tab. 1. In fact, evaluate the start time and the end of a unit depends on the total number of units in the sound file, the coefficient on the category and the total duration of the input signal. Sub-word modeling In large vocabulary ASR systems, HMMs are used to represent sub units of words (such as phones). For Arabic it is typical to have around 38 models (phones). The exact phone set depends on the dictionary that is used. Word models can be constructed as a combination of the sub word models [13]. In practice, the realization of one and the same phone differs a lot depending on its neighboring phones (the phone, context). Speech recognition use context dependent phonetic alphabets, in which there are one or more units for each phoneme in the context of surrounding phonemes. Several of the more common schemes are [q a l] monophones, biphones and triphones [26]. Figure 3 shows the Arabic word in a monophone, biphone and triphone representation.

Tab. 1. Categories of Arabic phones.

Consonants

Long vowels Short vowels

Arabic Continuous Speech Recognition

Sil

q

a

l

Sil

Sil + q

q+a

a+1

1 + Sil

Sil+*

*–Sil+ q

Sil – q+ a

q –a +1

a – l +Sil

|

151

(a)

(b)

l – Sil +*

(c) Fig. 3. Monophone (a), biphone (b) and triphone (c) HMMs for the Arabic word [q a l]. ‘sil’ stands for silence at the beginning and ends of the utterance, which is modeled as a phone, too. The arabic word [q a l]:

.

The information coming from the language model and acoustic models as well as the information from the pronunciation dictionary has to be balanced during speech recognition. In our work, we focus our experimentations for the recognition of Arabic triphones-based continuous speech.

4 Hidden markov models The acoustic signal of speech is modeled by a small set of acoustic units, which can be considered as elementary sounds of the language. Traditionally, the chosen unit is the phoneme: a word formed by concatenating them. More specific units can be used as syllables, disyllables, phonemes in context, thereby making the model more discriminating, but this theoretical improvement is limited in practice by the complexity involved and estimation problems [24]. A compromise often employed is the use of contextual phonemes sharing states. The speech signal can be likened to a series of units. In the context of Markov ASR, the acoustic units are modeled by HMM which are typically left-right tristate. At each state of the Markov model there is a probability distribution associated modeling the generation of acoustic vectors through this state [6]. An HMM is characterized by several parameters: – N: the number of states of the model. – A = {a ij } = {P(q t = 1 |q t−1 = i )} is the matrix of transition probabilities on the set of states of the model.

152 | E. Zarrouk et al.

– –

B = {b k (X t )} = {P(X t |q t = k )} is the matrix of emission probabilities of the observations X t for the state q k . π is the initial distribution of states P(q i = 0 ).

HMM is a well established paradigm widely used in ASR systems for the last four decades. Other techniques can be combined with HMM or can be used as an extension of HMM to improve the system performance [12]. Our approach consists on the hybridization of SVM with HMM as an estimator of posterior probabilities. In this part, we will present the structure of the SVM/HMM model and describe the characteristics of SVM which motivate us to apply them.

5 Supports vector machine SVMs are a new statistical learning techniques initiated by V. Vapnick in 1995 [22]. The success of this method is justified by the solid theoretical foundation that underpins it. They can address a variety of problems including classification. SVM is a method well suited to deal with high dimension data such as text and images [7]. Since their introduction in the field of pattern recognition, several studies have demonstrated the effectiveness of these techniques primarily in image processing. Classifiers are typically optimized based on some form of risk minimization. Empirical risk minimization is one of the most commonly used techniques where the goal is to find a parameter setting that minimizes the risk: R emp (α) =

1 l ∑ |y i − f(x i , α)| 2l i = 1

(5)

where α is the set of adjustable parameters y i , x i are the expected output and given input, respectively. However, minimizing R emp does not necessarily imply the best classifier possible [23]. For example, Fig. 2 shows a two-class problem and the corresponding decision regions in the form of hyperplanes. All the hyperplanes C0 , C1 and C2 achieve perfect classification and, hence, zero empirical risk. However, C0 is the optimal hyperplane because it maximizes the distance between the margin H1 and H2 , thereby offering better generalization [7]. This form of learning is an example of structural risk minimization where the aim is to learn a classifier that minimizes a bound on the expected risk, rather than the empirical risk [7] .SVM is based on this Structural risk. Therefore, respecting the principle of structural risk minimization, from a data set for which we have no information on the distribution having occurred. If we choose as hypothesis space, space hyperplanes separating the hyperplane zero empirical risk that must be chosen is that which has the greatest

Arabic Continuous Speech Recognition

|

153

C0 C2

H1

H2

Class 2

C1

W Class 1

Fig. 4. Two-class hyperplane example.

geometric margin. The corresponding optimal solution is called optimal hyperplane. All the above remarks assume that the data are linearly separable [23]. What if they are not? Two ideas were proposed by Vapnik. The first is to project the data in an area of redescription wherein the data is linearly separable. The second is to accept a small number of examples can be in the margin, if it can be separated linearly other examples. The classification of data depends on nature of separation of data. There are cases of linearly separable data and nonlinearly separable data. With SVM a discriminative hyperlane with maximal border is searched when classes are linear separable. With constant intra classes variation classification confidence grows with increasing interclass distance. The former are the simplest SVM because they can easily find a linear separation. φ(x) : ℝn → ℝm

(6)

And an optimal linear discrimination function will be finding in this new space. This transformation increases the linear separation. For the nonlinear SVM, we are in front with very high dimension of the feature space ℝm . So φ(x)φ(y) must not be calculated explicitly, but can be expressed with reduced complexity with kernel functions. K(x i , y i ) = φ(x i )φ(y i )

(7)

We do not need to know, how the new feature space ℝm looks like. We only need the kernel function as a measure of similarity [29]. Among the most common kernel functions used in SVM. We quote the frequently kernel functions used in many applications: ∙ Polynomial kernel :

K(x i , y i ) = [(x i × y i ) + 1]d

(8)

154 | E. Zarrouk et al. ∙ Linear Kernel :

K(x i , y i ) = x i × y i

∙ Sigmoid kernel :

K(x i , y i ) = tanh(β1 x i × y i + β2 )

(10)

∙ Radial Basis Function kernel :

K(x i , y i ) = exp(−𝛾|x i − y i | )

(11)

(9) 2

with d, β1 , β2 and 𝛾 are parameters that will be determinate empirically.

6 Hybrid model SVM/HMM Here we have used SVM to estimate posterior probabilities in the training phase and the recognition phase. First, we train one SVM for every sub-task signal which means one versus all. Every phoneme is a separate class. the function f(x i ) that describes the separation plane measures the distance of the element x i to the margin.The inclusion of the element x i on one of the classes depends of the sign f(x i ). Also, the distance is far from the margin it has a higher probability of belonging to the class [29].

6.1 Emission probabilities with SVM After choosing and applying the kernel function the conditional probability P(x|class j ) is generating when a general model is summarized by minimizing the number of support vectors and supports the maximum data [7, 23]. We need to calculate the likelihoods that the input vector is given the classj of the appropriate phoneme jP(class j |x j ). We apply the Bayes rule to obtain those HMM emission probabilities: P(class j |x i ) = – – –

P(x i |class j )P(class j ) P(x i )

(12)

P(class j |x i ) is the likelihood of the input vector x i is given the class of the phoneme j. P(class j ) is the prior probability of the phoneme j. P(x i ) is the prior probability of the acoustic vector x i .

6.2 Decoding phase For each phoneme we attribute a HMM with five state. We consider that all the states are combined on one HMM because the first and the last state for each HMM don’t have any transition to another state. Given a sequence of observations X = x1 , x2 , x3 , . . . , x l and a HMM M with N states (Number of phones * 3), we wish to find the maximum probability state path Q = q1 , q2 , q3 , . . . , q N .

| 155

39 features

Arabic Continuous Speech Recognition

P(q1)...P(q2)...P(q3)...P(qN)

SVM

i N

P(q k |x m ) P(q N )

P(q N |x) P(q N )

Hidden Markov Models

39 inputs

Current Vector xn

The most liquely sequence evaluated by Viterbi algorithm

n

t

Fig. 5. Architecture of SVM/HMM model.

Let δ j (t) be the probability of the most probable path ending in state j at time t: δ j (t) =

max

q1 ,q2 ,q3 ,...,q t−1

P(q1 q2 q3 . . . q t−1 , q t = j, x1 x2 x3 . . . x t |M)

(13)

So δ j (t) = P(q j |x t ) which is the probability estimated by the SVM kernel function from the observation x t . Finally, we have to determinate: ⏟⏟⏟⏟⏟⏟⏟ [δ j (N)] q optimal = max

(14)

1≤j≤N

At the end we choose the highest probability endpoint, and then we backtrack from there to find the highest probability path [30]. We obtain a sequence of states that represents the observations’ sequence X. Thus, every monophone or triphone is representing by a five state HMM with start state, 3 hiddens states and end state. The Viterbi algorithm is the best solution to this problem is to estimate the posterior probability P(Q|X). The state sequence most likely at the time t depends only t and the most likely sequence to t − 1.

156 | E. Zarrouk et al.

7 Experimental results In this section, we present experimental results of automatic monophone or triphone-based continuous speech for Arabic language. The contribution presented in this paper is automatic labeling signals of sound files to prove the system performance SVM / HMM proposed in [2] the recognition of Arabic triphones-based continuous speech. The development of a robust large vocabulary continuous speech recognition system needs a large vocabulary speech corpus for the Arabic language. There is not a known, well-established database for the Arabic Language. However, the database ARABIC_DB in [8] was used. This database has 5034 unique triphones. After the training phase, we try to evaluate the learning accuracy for the two compared systems. We have trained the two systems with 620 statements for six speakers (3 males and 3 females). The number of coefficients Mel Frequency Cepstral Coefficients MFCC used is 39. After experimental tests, we were developing the recognizer systems with the best parameterization. In fact, the accuracy rate are ranged between 96.91 % and 98.68 % for learning of the HMM system and the proposed system SVM/HMM [3]. As first step, we evaluate the recognition rate of Arabic monophones of 171 sentences of 4 speakers A, B, C and D (2 males and 2 females) for the two systems: HMM standards and SVM/HMM. As it is presented on Tab. 4 !!!, we can deduce that the recognition rate of Arabic monophones with SVM/HMM model are better than HMM. We may deduce from the results shown above that hybrid system SVM/HMM are more successful in obtaining the best recognition rate even for a sequence of acoustic vectors namely 65.98 % for system HMM and 68.26 % for the most efficient hybrid SVM / HMM [3]. The second part of our experimentation is the recognition of Arabic triphones- based continuous speech. The recognizer was tested with 171 statements (7841 triphones) for each speaker. Table 2 !!! shows results obtained for the recognition of Arabic triphones by our proposed system SVM/HMM compared to HMM standards for 4 speakers A, B, C and D. Results illustrated on Tab. 6 ??? prove the effectiveness and the performance of SVM/HMM comparing to HMM standards. In fact, speaker B has the best accuracy rate

Tab. 2. Recognition rates of Arabic monophones by HMM and SVM/HMM for 4 speakers. Speaker A B C D

HMM % 60.57 65.98 64.75 62.32

SVM/HMM % 66.23 67.84 63.54 68.26

Arabic Continuous Speech Recognition

| 157

Tab. 3. Recogniton rates of Arabic triphones by HMM and SVM/HMM for 4 speakers. Speaker A B C D

HMM % 63.24 67.56 62.03 65.89

SVM/HMM % 72.16 74.64 76.96 72.30

with 67.56 % with HMM and 76.96 % is the best accuracy for SVM/HMM with speaker C which is the best accuracy rate obtained in all experimentations [3]. As illustrated on previous tables, the recognition rates of the Arabic monphons and triphones-based continuous speech obtained by the system of HMM standards are lowest. As it is seen HMM standards behave well, although the hybrid system SVM/HMM seems more efficient than the last one, it obtains the best recognition rate for the 4 speakers. That’s why; we evaluate the gain obtained by SVM/HMM by report of HMM. The word error rate (WER) is a valuable tool for comparing different systems as well as for evaluating improvements within one system. When reporting the performance of a speech recognition system, sometimes word accuracy (WAcc) is used instead: WAcc = 1 − WER

(15)

Let Reco_Phone denotes the number of phones in a word that can recognize it. Let Ratio_Phone denotes the ratio of Reco_Phone over the number of the word’s phones. WAcc =

number of recognized words number of all the words

(16)

The recognition of a phone is very sensitive. As a result, you will catch it or you will fail. The recognition of a word can be achieved by the maximum probability of its phones. So recognizing the correct word can be achieved without recognizing all phones of that word. For example, if we have a word , which consists of two phones and and we recognize the phone

then what is the probability of recognizing the

. If we try to expect other phones by experiments, then no other than can word a word. Thus, if you recognize the phones by 50 %, then some of the words produce can be recognized by 100 %. The systems were trained by 620 sentences (2251 words) for 6 speakers. Table 4 presents the WER obtained by HMM and SVM/HMM with 4 speakers, the lowest average of WER is obtained by the hybrid model, namely, 11.42 % for SVM/HMM and 13.42 % for HMM standards [3]. Thus, comparing to performance of the two systems, we perform that there is an improving on the recognition rates of Arabic

158 | E. Zarrouk et al.

Tab. 4. WER obtained by the two systems HMM and SVM/HMM for the speakers A, B, C and D. Speaker A B C D

HMM %

SVM/HMM %

12.67 14.62 12.03 13.97

10.54 11.84 10.96 12.36

monophones and triphones-based continuous speech difference between them i. e. the recognition rates of Arabic monophones and triphones-based continuous speech with SVM/HMM are bigger than those obtained by the HMM standards which prove the effectiveness of the hybridization of the SVM with HMM.

8 Conclusion In this paper, we have proposed a hybrid ASR system SVM/HMM applying for the recognition of Arabic triphones-based continuous speech. Our work incorporates two major contributions. We have presented a novel approach for phones segmentation and of sound files, this task consists on automatic labeling of input signals for language models which was a very difficult task on literature. To prove the performance of the SVM/HMM model, we have presented a comparison of the recognition rate of Arabic monphones and triphones-based continuous speech using consecutively: standards HMM and SVM/HMM which is our proposed work. The results of Arabic monphones and Arabic triphones-based continuous speech recognition obtained by the hybrid model SVM/HMM compared with those obtained with HMM standards showed a good effectiveness and performance. In fact, the best results are obtained with the proposed system SVM/HMM where we have achieved 76.96 % as recognition rate of one of the tested speakers. The speech recognizer was evaluated with ARABIC_DB corpus and performs at 11.42 % WER as compared to 13.32 % with triphones mixture-Gaussian HMM system.

Bibliography [1] [2]

E. Zarrouk and Y. Ben Ayed. Automatic Speech Recognition with Hybrid Models. Proc. of SPED Conf., :183–188, 2011. E. Zarrouk and Y. Ben Ayed. Hybrid SVM/HMM model for the Arab phonemes recognition. Proc. of SPED Conf., :183–188, 2011.

Arabic Continuous Speech Recognition

[3]

[4] [5] [6] [7] [8] [9]

[10] [11] [12] [13] [14] [15] [16] [17]

[18] [19]

[20] [21] [22] [23] [24] [25]

[26]

| 159

E. Zarrouk, Y. Ben Ayed and F. Gargouri. Hybrid SVM/HMM Model for the Recognition of Arabic Triphones-based Continuous Speech. 10th Int. Multi-Conf. on Systems, Signals & Devices (SSD) Tunisia, 2013. H. Hermansky and L. Cox. Perceptual linear predictive (PLP) analysis-resynthesis. Proc. of Eurospeech’91, Genova, :329–332, 1991. J.D. Markel and A.H. JR. Gray. Linear Prediction of Speech. Berlin, Springer-Verlag, 1976. L.-R. Rabiner and B.-H. Juang. Fundamentals of Speech Recognition. Prentice-Hall, 1993. V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, NY, USA,1995. B. Al-Diri and A. Sharieh. A Database for Arabic Speech Recognition ARABIC_DB. Technical Report, the University of Jordan, Amman, Jordan, 2002. B. Al-Diri, A. Sharieh and M. Qutiashat. A speech recognition model based on tri-phones for the Arabic language. Advances in modelling Series B: Signal processing and pattern recognition, 50(2):49–64, 2007. D. O’Shaughnessy. Interacting with computers by voice automatic speech recognitions and synthesis. Proc. of the IEEE, 91(9):1272–1305, 2003. F. Mihelic and J. Zibert. Speech recognition technologies and applications. Vienna:I-TECH, 2008. K.A. Rajesh and D. Mayank. Acoustic modeling problem for automatic speech recognition system: conventional methods (PartI). Int. J. Speech of Technology, (IJST), 14(4):297–308, 2011. A. Dhia and E. Moustafa. Cross-Word Modeling for Arabic Speech Recognition. Springer Briefs in Electrical Computer Engineering, Speech Technology, :17–21, 2012. F. Al-Otaibi. Speaker-dependant continuous Arabic speech recognition. M. Sc. thesis, King Saud University, 2001. H. Hyassat and R. Abu Zitar. Arabic speech recognition using SPHINX engine. Int. J. of Speech Technology 9(3–4):133–150, 2008. H. Soltau, G. Saon et al. The IBM 2006 Gale Arabic ASR system. IEEE Int. Conf. on acoustics, speech and signal processing, ICASSP, 2007. M. Nofal, R.E. Abdel et al. the development of acoustic models for command and control Arabic speech recognition system. Int. Conf. on electrical, electronic and computer engineering, ICEEC’04, 2004. J. Park, F. Diehl et al. training and adapting MLP features for Arabic speech recognition. IEEE Int. Conf. on acoustics, speech and signal processing, ICASSP, 2009. M. Shoaib, F. Rasheed, J. Akhtar, M. Awais, S. Masud and S. Shamail. A novel approach to increase the robustness of speaker independent Arabic speech recognition. 7th Int. Multi Topic Conf., INMIC, 2003. H. Bourouba and R. Djemili. New hybrid system (supervised classifier/HMM) for isolated Arabic speech recognition. 2nd Information and Communication Technologies, ICTTA’06, 2006. A. Messaoudi, J.L. Gauvain et al. Arabic broadcast news transcription using a one million word vocalized vocabulary. IEEE Int. Conf. on acoustics, speech and signal processing, ICASSP, 2006. Vapnik V. Estimation of Dependences Based an Empirical Data. Nouka, Moscow, English translation, Springer Verlog, New York, 1979. A. Ganapathiraju et al. Support Vector Machines for Speech Recognition. Proc. of the ICSLP, :2923–2926, Sydney, Australia, 1998. S. Connel. A Comparison of Hidden Markov Model Features for the Recognition of Cursive Handwriting. Computer Science Department, Michigan State University, MS Thesis, 1996. R. Schwartz, J. Klovstad, J. Makhoul and J. Sorensen. A preliminary design of a phonetic vocoder based on a diphone model. IEEE Int. Conf. on acoustics, speech, and signal processing, :32–35, 1980. H. Abd-Arahman. Nutshell in science of tajweed :52–69, 2008.

160 | E. Zarrouk et al.

[27] Y. Steve, E. Gunnar, G. Mark, H. Thomas, K. Dan, A. Liu Xunying, M. Gareth, O. Julian, O. Dave, P. Dan, V. Valtcho and W. Phil. The HTK book (for HTK Version 3.4). Cambridge University Engineering Departement, :294–297, 2006. [28] V. Luxburg, U. Bousquet and O. Scholkopf. A compression approach to support vector model selection. J. of Machine Learning Research, :293–323, 2004. [29] A. Castellani, D. Botturi, M. Bicego and P. Fiorini. Hybrid HMM/SVM: Model for the Analysis and Segmentation of Teleoperation Tasks. IEEE Int. Conf. on Robotics and Automation New Orleans, 2004. [30] A. Faria. An Investigation of Tandem MLP Features for ASR. Int. Computer Science Institute, TR 07–003, 2007. [31] F. Jelinek. Continuous speech recognition by statistical methods. Proc. of the IEEE, 64(4):532–556, 1976.

Biographies Elyes Zarrouk received his Master’s degree in Computer Science from the Higher Institute of Computer of Monastir, Tunisia in 2007. He obtained his MS degree on New Information Technologies and Systems Dedicated from the National Engineering School of Sfax, Tunisia in 2010. Graduated with PhD degree in Computer Science from Faculty of Economics and Management of Sfax, Tunisia in 2017. Currently, he focuses his research on pattern recognition, artificial intelligence and speech recognition in MIRACL, Multimedia Information System and Advanced Computing Laboratory, university of Sfax, Tunisia. He is actually an assistant professor in Computer Science in the University of Kairouan in Tunisia. Yassine Ben Ayed Yassine Ben Ayed Graduated in Electrical Engineering from National School of Engineering in Sfax, Tunisia in 1998. He obtained his PhD degree in Signal and Image from Telecom Paris Tech in 2003. He obtained his University habilitation from the National Engineering School of Sfax-Tunisia in 2015. Currently, He is a professor in Electrical and Computer Engineering in the University of Sfax. He focuses his research on pattern recognition, artificial intelligence and speech recognition.

Faiez Gargouri received his Diploma of higher technician in computer management from Faculty of Economics and Management of Sfax, Tunisia in 1986. He received his Master’s degree in management computer science from Faculty of Economics and Management of Sfax, Tunisia in 1988. He obtained his MS degree on Computer Methods of Industrial Systems from Pierre and Marie Curie University, Paris 6. In 1990. Graduated with his Ph. D degree from Renés Descartes University, Paris 5 in 1995. He received his University habilitation in computer science from the University of Tunisia in 2002. He is actually a professor in Computer Science in the University of Sfax.

A. Damati, O. Daoud and Q. Hamarsheh

Enhancing the Odd Peaks Detection in OFDM Systems Using Wavelet Transforms Abstract: This work aims to study the effect of unwanted peaks and enhance the performance of wireless systems on the basis of tackling such peaks. A new proposition has been made based on wavelet transform method and its entropy. Signals with large peak-to-average power ratio (PAPR) will be examined such as the ones that is considered as the major Orthogonal Frequency Division Multiplexing (OFDM) system’s drawbacks. Furthermore, a spatial diversity Multiple-Input Multiple-Output (MIMO) technology is used to overcome the complexity addition that could arise in our proposition. To draw the best performance of this work, a MATLAB simulation has been used; it is divided into three main stages, namely, MIMO-OFDM symbol’s reconstruction based on wavelet transform, a predetermined thresholding formula, and finally, moving filter. This algorithm is called Peaks’ detection based Entropy Wavelet Transform; PD-EWT. Based on the simulation, and under some constrains such as the bandwidth occupancy and the complexity structure of the transceivers, a peak detection ratio has been achieved and reaches around 85 %. Comparing with our previously published works, the PD-EWT enhances the detection ratio for 25 % more peaks. Keywords: Wavelet Transform, Entropy, MIMO, OFDM, PAPR.

1 Introduction The overwhelming huge data due to the highly demand for the various wireless and cellular system’s applications attract the researchers interest to handle these effects on the wireless systems. Thus, and during the last two decades, their attentions have been focused on the combination between the Orthogonal Frequency Division Multiplex (OFDM) modulation technique and the Multiple-Input Multiple-Output (MIMO) technology. Therefore, we are talking a data rate of around more than 100 Mbps for such systems. The OFDM systems use the parallel transmission, while the MIMO technologies have been employed to reduce the effect of the rich scattering environments. Moreover, the OFDM has been adopted at the both wireless and wired application to the high data rates as significant advantages over the conventional

A. Damati, O. Daoud and Q. Hamarsheh: A. Damati, Dept. of Electrical Engineering, Philadelphia University, Amman19392, Jordan, email: [email protected], O. Daoud: Dept. of Communications and Electronics Engineering, Philadelphia University, Amman, Jordan, [email protected] Q. Hamarsheh: Dept. of Computer Engineering, Philadelphia University, Amman, Jordan, [email protected] De Gruyter Oldenbourg, ASSD – Advances in Systems, Signals and Devices, Volume 8, 2018, pp. 161–174. https://doi.org/10.1515/9783110470383-011

162 | A. Damati et al.

ones, and shows robustness to multipath fading and a greater simplification of channel equalization. Furthermore, the multiple antennas have been employed to support the extraordinary data rates due to the rapid growth of the wireless systems and to make use of the rich scattered environments [1–5]. The MIMO technologies that could be used for this purpose are either the spatial multiplexing or BLAST [6]. OFDM technique is considered as a multi-carrier system that utilizes a parallel processing technique and allowing the simultaneous transmission of data on many closely spaced, orthogonal sub-carriers. This is attained by making use of the Inverse fast Fourier transforms (IFFT) and fast Fourier transform. However, the peak-to-average power ratio (PAPR) is found as a major deficiency of the OFDM signal, which limits the efficiency of the non-linear devices such as the power amplifiers, mixers, and analog to digital converters. Therefore, the wavelet transform method has been used to tackle the effect of such deficiency as will be discussed in section two [7]. Previously in [8], another proposition of PAPR reduction technique has been addressed based also on wavelet transformation technique. It was based on Denoise the OFDM using some DWT, after that defining an adaptive threshold to limit those peaks, and finally replace these peaks and valleys using an average filter. This algorithm gives an enhancement around 65The PAPR could be defined as shown in (1) on the maximum power of the OFDM symbol and its average power as: PAPR = 10 log10 (

P peak ) P avg

(1)

where, P peak is the maximum power of an OFDM symbol, and P avg is the average power. The PAPR can be reformulated as given in (2). Here where T is the symbol duration, x(t) is the OFDM symbol at time, t, which can has values from 0 to nT. X n is the data modulating the n-th is sub carrier and f0 is the nominal subcarrier frequency spacing. Moreover, the average power of the OFDM symbol presented in (2) will be given in (3): 󵄨󵄨2 󵄨󵄨N−1 󵄨 󵄨󵄨 󵄨󵄨 ∑ X n e2(jπf0 nt) 󵄨󵄨󵄨 󵄨󵄨 󵄨󵄨 n=0 󵄨 󵄨 PAPR = 󵄨󵄨 󵄨󵄨2 N−1 1 NT 󵄨󵄨 1 󵄨 ∑ X n e2(jπf0 nt) 󵄨󵄨󵄨󵄨 dt ∫0 󵄨󵄨󵄨 √N 󵄨󵄨 √N n=0 󵄨󵄨 1 N

(2)

T

N−1 1 P avg = ∫ ( ∑ c2υ ) dt T υ=0

(3)

0

Here, c υ is the magnitude of the modulated data. For the sake of simplicity; |c υ | = 1, which can be attained by using BPSK modulation with channel coding less techniques at the interval between 0 and T.This will result a direct relationship between the average power and the total number the IFFT points, N. It is clearly shown from (4)

Detection in OFDM Systems Using Wavelet Transforms | 163

as follows: T

P avg =

N ∫ c2υ dt = N. T

(4)

0

From [4] and based on the mathematical modelling that was used to combat the affect of the PAPR; the PAPR will be decreased if the average power of the OFDM symbol is decreased. The following flowchart shows the previously proposed technique in Fig. 1. Another technique to compare with could be found in Fig. 2. It can be summarized by the following steps: 1. Read a segment of the OFDM signal. 2. Denoise the OFDM signal from additive white Gaussian noise (AWGN) using wavelets technique [13]. In this step the unwanted random addition to a wanted signal is removed using the following sub steps: (a) Applying discrete wavelet transform DWT to the noisy signal.

OFDM signal with 2l length

PAPR < threshold

Spread the OFDM symbol period to be l times the original one

Divide the spread symbol to l blocks

Adding GT (control data instead of zero carrier)

One of these blocks will be combined with the original OFDM signal while the others will be sent through l−1 antenna

Up-conversion and transmission through l-antennas Fig. 1. Algorithm flowchart [9].

Delay

164 | A. Damati et al.

Suffered OFDM symbol from high PAPR

Remove GI

Define the adaptive magnitude threshold

Use Wavelets to de-noise the OFDM symbol

Defect the local maxima and minima, and then save the locations in an array

Moving average filter to replace each peak and valley by the average of these values in addition to the surrounding neighbors

Add GI and prepared to be transmitted through I antennas Fig. 2. Proposed algorithm flowchart [8].

(b) Applying soft thresholding operator (wavelet shrinkage) [13] to highlight large values of wavelet coefficients which almost correspond to the OFDM signal and suppress small values which correspond to noise. (c) Applying inverse discrete wavelet transform IDWT to the thresholded wavelet coefficients to reconstruct a denoised OFDM signal. In this work, the wireless system’s performance will be drawn for the PD-EWT and compared to the previously published work in [8, 9]. This performance will be based on the BER. It is known that either the wrong detection or the noisy channels will cause burst error and then special protection is necessary. Let us define first the received OFDM symbol as shown below in (5) Ŝ = s0 + s1 .

(5)

where s0 is the useful information, s1 is the interference signals. After that, the SINR expression could be deduced as SINR =

󵄨 󵄨 E [󵄨󵄨󵄨󵄨s20 󵄨󵄨󵄨󵄨] 󵄨 󵄨 E [󵄨󵄨󵄨󵄨s21 󵄨󵄨󵄨󵄨]

(6)

Then, the BER comes from defining the relationship between the bit error probabilities with the SINR. Thus, a mapping function could be defined through the link level simulation with the needed channel. Making use of the definition that is found in [10] which is based on Chernoff Union bound. The rest of paper is organized as follows; the

Detection in OFDM Systems Using Wavelet Transforms | 165

introduced structure of the proposed algorithm in the MIMO-OFDM wireless system is defined in section 2, the simulation results are presented in Section 3, while the last section summarizes the conclusion.

2 Description of the used PD-EWT in the wireless systems Figure 3 shows the description of the used wireless system; MIMO-OFDM system, and it is divided into three main stages; OFDM stage, PD-EWT stage and the MIMO stage. From Fig. 1, the transmission layer contains three different stages; the OFDM stage, the proposed O-EWT; which proposed to overcome the effect of the PAPR, and the MIMO stage. For the OFDM stage, it consists of three main blocks as shown in Fig. 4. I/P Data

OFDM Stage

PD-EWT Stage

MIMO Stage

Fig. 3. Proposed work block diagram.

OFDM Stage Coding

Modulation

I-FFT

G-Band

Fig. 4. Structure of the OFDM stage.

166 | A. Damati et al.

In Fig. 4, the turbo encoder of 1/2 coding ratio is used in the coding block, 16 QAM for the modulation block, the overall throughput expressed in terms of bits/symbol for OFDM symbols, and generated by applying the IFFT. After the IFFT stage and due to the coherent addition of the independently modulated subcarriers to produce OFDM symbol, a large PAPR ratio could appear. The generated OFDM signal will pass through the second stage which is capable of detecting the high PAPR peaks and overcoming their effect. The whole work in this stage could be divided into four blocks as shown in Fig. 5. After that, the achieved results will be compared with our previously published work [8, 9]. The continuous wavelet transform (CWT) is attained as a sum of time signals multiplied by a scaled and a shifted version of small wavy functions that are proficiently limited duration with an average of zero. Moreover, and if these scaled versions have been generated based on powers of two, therefore the discrete wavelet transform (DWT) will be obtained. In addition to the wavelet transforms that are based on the decomposition high and low pass filters namely wavelet packet transform is the WP. A pair of low and high pass filters is used to recognize two sequences capturing dissimilar frequency sub-band features of the original signal. These sequences are then decimated (dissembled by a factor of two). It was indicated by many works that WP features have better presentation than the DWT [11]. In [12], the authors define the mathematical meaning of the Entropy for a discrete random variable X as: H(X) = − ∑ [P(X = a i )] log [∑ P(X = a i )]

(7)

H is the entropy, a i are the discrete random variable, X, possible values. This equation reflects the disorder degree that the variable acquires. Then, the discrete wavelet decomposition for sampled values of the signal S(t) could be written as: −1

S(t) = ∑ ∑ C j (k)ψ(j, k)(t) j=−N k

PD-EWT Stage Wavelet Transform Entropy Calculations I-FFTPredeterminde Threshold Formula Moving Average Filter

Fig. 5. PD-EWT architecture stage.

(8)

Detection in OFDM Systems Using Wavelet Transforms

| 167

The signal S(t) is given by the sampled values, C j (k) is the wavelet coefficient and limited to following frequency interval 2j−1 ω s ≤ |ω| ≤ 2j ω s . Moreover, the wavelet entropy could be defined in terms of wavelet coefficients relative wavelet energy as follows: Pj = (

Ej ) E total

(9)

E j is the energy at each j resolution level, Etotal is the sum of E j , j = N, . . . , 1. After defining the meaning of the wavelet transforms, we can conclude that the scope of this work is the use of DWT while it could be further emphasized in the future to cover the WP. Thus, the proposed algorithm starts with scanning the resultant x(t) that is defined previously. This signal will be processed as follows: – The preprocess stage: – Remove the noise from a signal using wavelet technology. – Perform P-level Haar wavelet decomposition of a signal (P = 8). – Construct the approximations, CAP and the details CDP. – Zero Crossing mechanism – The entropy Calculation – Case studies based on decomposition process using the depicted flowchart in Fig. 6, the results of these case studies clearly depicted in Fig. 7. – Case Study 1: Detect true and false local extremes points using all details coefficients (C DP1 − C DP8 ). – Case Study 2: Detect true and false local extremes points using details coefficients (C DP1 , C DP2 ) – Case Study 3: Detect true and false local extremes points using all details coefficients except ( C D3 , C D7 and C D8 ). – Thresholding process – Moving average (MA) filter.

The examined CDp’s

Combine them in one matrix Sort it to avoid duplication Allocate the extreme points and calculate the true and false ones Fig. 6. Flowchart of the used procedures.

Fig. 7. Odd peaks detection based on the different three case studies.

168 | A. Damati et al.

Detection in OFDM Systems Using Wavelet Transforms | 169

In this section, a new technique has been proposed to allocate peaks in the OFDM signal based on the entropy wavelet packets. It is clearly seen in Tab. 1, the entropy of the original signal is about 100.4268. After dividing the used signal into eight levels using the wavelet packets, the entropy has been divided into the range of 0.68262 to 74.2432. Therefore and based on the entropy we can use the best combination that results the best peaks allocation process.

3 Simulation results and discussion The MATLAB simulation program was performed and limited to the use of - Theoretical randomly generated test data, - Simple linear convolutional encoder, - 16-Quadrature Amplitude Modulation (16 QAM) and Binary-Phase Shift Keying (BPSK), - IFFT size of 256. The novelty in this work rises from the way of dealing with the entropy of the wavelet coefficients to determine the peaks in the OFDM signal before the transmission. For checking the system performance, two main key factors will be studied; the bit error rate (BER) and the complementary cumulative distribution function (CCDF) curves for the processed OFDM signal. As a comparison, Tab. 2 demonstrates powerfulness of the proposed work over either the work that are found in the literature or our previously published works; It is in the range of 15 % – 81 % as an extra PAPR reduction ratio. Table 2 shows clearly the improvement of the proposed work comparing either to our previously published work or to the SLM technique that found in the literature. The achieved reduction rate varies between 8–81.62 % based on the selected case study and the used technique. Figures 8 and 9 show the simulation part that is based on the

Tab. 1. Packet’s entropy values. P level 1 2 3 4 5 6 7 8 ∗

CD p

CA p

Entropy Summation

13.3924 60.8508 26.5319 32.917 16.3553 20.7162 8.2519 9.7128 6.3765 3.2822 1.6068 0.93369 1.541 0.54864 0.62314 0.059486

74.2432 59.4489 37.0715 17.9647 9.6587 2.5405 2.0897 0.68262

Original Signal

100.4268

Decomposition Acceptance∗ Accepted Accepted Not Accepted Accepted Accepted Accepted Not Accepted Not Accepted

Acceptance if the sum of entropies of the given level is less than the entropy of the level above the given

170 | A. Damati et al.

Tab. 2. Simulation results of the proposed technique based on linear coding compared to the literature and our work for the three different case studies.

Modulation Technique

PAPR without coding (dB)

PAPR Based SLM (dB)

PAPR Based work in [8]

PAPR Based work in [9]

I II III

7.9

3.6 3.81 4.2

3.5 2.73 3.9

2.7 1.92 3.1

72.1 81.62 65.5

44.5 60 35.9

11 15 8

I II III

12.3

5.6 6.4 6.9

4.1 3.5 4.7

3.7 3.2 4.2

62

51.43

14

case study 16 QAM

BPSK

Additional Reduction (%) ConSLM voluWork tional in [9] coding

Modulation Technique

PAPR No coding dB

Previously published PAPR reduction technique based on: Convo- Turbo LDPC lution coding (dB) Spreading rate of 2

Additional reduction (%) of the proposed algorithm based convolution coding (1/2 coding rate) over the previously published work Convolution Turbo LPDC coding coding coding

BSPK

2.3

3.5

2.91

2.56

51.43

41.59

33.6

16 QAM

7.9

2.73

2.25

1.89

60

51.47

42.23

CCDF and BER curves with different modulation techniques. These results check the performance of our system from reducing the PAPR problem point of view for two different modulation techniques; 16 QAM and BPSK, respectively. These figures compare the threshold value against the probability that the PAPR will exceed the threshold value. From these figures the reduction improvements are clearly shown over what have been achieved in the literature for the conventional MIMO-OFDM systems. The CCDF plots that is shown in Fig. 8 shows that the probability of the peaks that will exceed the 16 dB could be reduced to be 3.9 × 10−3 while it was 53 × 10−2 . Moreover, it shows an extra 15 % reduction percentage over the PAPR combating technique that is in [9]. Figure 9 shows the BER curves for different modulation technique; BPSK to confirm the reliability of the proposed work in combating the PAPR problem. Thus, the performance of the MIMO-OFDM based FFT is still better than either that of the conventional PAPR reduction techniques or our previously published work in [8, 9]. At 20 dB threshold, the BER curve for the conventional MIMO-OFDM system shows a reduction from 48 × 10−2 to 36 × 10−2 .

Detection in OFDM Systems Using Wavelet Transforms

| 171

CCDF for 16 QAM

Prob. (PAPR>Pth)

10–1

0.53 0.367 0.1253 0.03976

10–2

0.003956 without any PAPR reduction techniques based on PAPR reduction techniques in [8] based on PAPR reduction techniques in [9] based on PD-EWT based SLM PAPR reduction techniques

10–3

0

2

4

6

8

10 12 SNR(dB)

14

16

18

20

Fig. 8. Comparison between the probabilities of PAPR values that exceed a certain threshold for the proposed work comparing to the conventional MIMO-OFDM system (for the best case study results; Case study II; using 16 QAM modulation process).

BER

Bit Error Rate for BPSK

10–1

Conventional work [8] The proposed algorithm based PD-EWT Theoritical results

0

2

4

6

8

10

12

14

16

18

20

SNR(dB) Fig. 9. Comparison between the BER of the previously published work and the PD-EWT (for the best case study results; Case study II; using BPSK modulation process).

172 | A. Damati et al.

4 Conclusion A new proposition has been made in this paper; PD-EWT. This work introduces a new OFDM transceivers design. It is based on allocating the peaks and valleys of OFDM signal to be analyzed. In the PD-EWT, the entropy of DWT has been analyzed and used to specify those peaks. The allocated peaks then will be processed using a special thresholding algorithm to overcome its effect. Making use the analytical derivation of this technique to build a MATLAB simulation, which will ease the study of its feasibility to enhance the MIMO-OFDM wireless system’s performance even in a condensed Multipath channel using two different modulation techniques; BPSK and 16 QAM. This work contains three case studies based on the entropy level of the DWT. Thus, a comparison among PD-EWT, our previously published work, and the SLM has been made. The results show that we can use just the first two detailed parameters where the decomposition status after that is not accepted. From this comparison, the PD-EWT shows extraordinary promising results in allocating and combating the high peaks. The achieved performance improvement for allocating and combating the effect of the PAPR is between 8 % – 81 % over the limitations that have been taken into consideration. Moreover, the BER for the best scenario has been reduced to 36 × 10−2 from 48 × 10−2 that has been achieved in our previously work.

Bibliography [1] [2]

[3]

[4]

[5] [6] [7]

[8]

B. Lu, X. Wang and K. Narayanan. LDPC-based space-time coded OFDM systems over correlated fading channels. IEEE Trans. on Communications, 50:74–88, 2002. S. ten Brink, G. Kramer and A. Ashikhmin. Design of low-density parity check codes for multi-antennas modulation and detection. IEEE Trans. on Communications, 52(4):670–678, 2004. J.G.-Rodriguez, A. Drygajlo, D.R.-Castro, M.G.-Gomar and J. O.-Garcia. Robust estimation, interpretation and assessment of likelihood ratios in forensic speaker recognition. Computer Speech and Language, 20:331–355, 2006. M. Juntt, M. Vehkapera, J. Leinonen, V. Zexian, D. Tujkovic, S. Tsumura and S. Hara. MIMO MCCDMA Communications for Future Cellular Systems. IEEE Communication Magazine, 43(2):118–124, 2006. A. Khaizuran and H. Zahir. Studies On Dwt-Ofdm And Fft-Ofdm Systems. Int. Conf. On Communication, Computer and Power, 382–386, Muscat, Oman, February 15–18, 2009. S. Han and J. Lee. PAPR Reduction of OFDM Signals Using a Reduced Complexity PTS Technique. IEEE Signal Processing Letters, 11(11):887–890, 2004. A. Ben Aicha, F. Tilli and S. Ben Jebara. PAPR analysis and reduction in WPDM systems. 1st IEEE Int. Symp. on control, communications and signal processing, :315–318, Hammamet, Tunisia, March 21–24, 2004. O. Daoud, Q. Hamarsheh and W. Al-Sawalmeh. Enhancing the BER of MIMO-OFDM Systems for Speaker Verification. Int. Multi-Conf. on Systems, Signals & Devices, :1–6, Hammamet, Tunisia, March 18–21, 2013.

Detection in OFDM Systems Using Wavelet Transforms | 173

[9]

[10] [11] [12] [13]

M. Al-Akaidi, O. Daoud and S. Linfoot. A new Turbo Coding Approach to reduce the Peak-to-Average Power Ratio of a Multi-Antenna-OFDM. Int. J. of Mobile Communications, 5(3):357–369, 2007. J. Tellado. Multicarrier Modulation with Low PAR Applications to DSL and Wireless. Kluwer Academic Publishers, New York, 2002. X. Zheng, M. Sun and X. Tian. Wavelet Entropy Analysis of Neural Spike Train. IEEE Congress on Image and Signal Processing, :225–227, Sanya, Hainan, China, May, 27–30, 2008. R. Gray. Entrom and Infirmation theory. Springer, New York, 1990. A. Damati, O. Daoud and Qadri Hamarsheh. Wavelet Transform Basis to Detect the Odd Peaks. Int. Multi-Conf. on Systems, Signals & Devices, Barcelona, Spain, February 11–14, 2014.

Biographies Ahlam A. Damati has achieved the MSc in the field of Electrical Engineering/ Communications from The University of Jordan/ Jordan 2002. She had worked as a lecturer in different universities since 2003; UJ and GJU, while she joined Philadelphia University in 2014 as a lecturer in the Electrical Engineering department. Her research interests are in achieving the Quality of Service for the wireless new technologies based on a variety of digital signal processing techniques.

Omar R. Daoud has achieved the PhD in the field of Communication and Electronics Engineering at DMU/ UK 2006. He joined Philadelphia University in 2007 as Assistant Professor. His current work is about achieving the Quality of Service for the 4th Generation of the Wireless and Mobile Communication Systems by combining the advantages of the OFDM and the multiple antenna technology. He is the Assistant Dean in the Faculty of Engineering in addition to the Head of Communications and Electronics engineering department. Moreover, and in March 2012 he has promoted to the associate professor rank.

Qadri J. Hamarsheh received the master degree of Computer Machines, Systems and Networks from the department of Computer Engineering, Lviv Polytechnic Institute in 1991. He obtained his Ph. D degree from Lviv National University Lvivska Polytechnica, Ukraina in 2001. Currently he is working as assistant professor of Computer Engineering, Philadelphia University-Jordan. He is in teaching since 2001. His areas of interest include Digital Signal Processing (DSP), Digital Image, Speech Processing, Object-Oriented Technology and Programming Languages, Internet Technology and Wireless Programming.

C. Schultz and P. Hillger

Methodology for Analysis of Direct Sampling Receiver Architectures from Signal Processing and System Perspective Abstract: Several works have been published that describe an implementation of a certain class of direct sampling receiver architectures. A signal processing representation of this class is presented that enables a design space exploration at an early stage of the system definition. This signal processing implementation adds to the common description parasitic components that result out of the real world nature of a circuit realization. These design space parameters are mapped back to system cost functions to support the decision process. Finally the results are applied on a circuit design in a 65nm technology. The presented methodology therefore enables a full system level assessment taking performance and cost into consideration before the design is started. Keywords: Sampling Receiver, MTDSM, Software Defined Radio.

1 Introduction The development of modern mobile communication systems is demanding an ever increasing number of different standards and bands to be supported by a single transceiver System on Chip (SoC). As classical analog RF architectures are highly optimized to receive a specific band with a certain bandwidth under certain environmental constraints, they are typically of limited flexibility. To optimize the required area for the receiver it is desirable to share as many blocks as possible. One approach to tackle this challenge has been presented by Muhammad [1] in 2004 for Bluetooth and by Ho [2] in 2006 for GSM. Central idea behind these publications is to replace the mixer by a sampler and exchange the analog aliasing filter by a switched-capacitor filter structure running at an RF frequency. Karvonen’s work presents a similar structure at an earlier time, but designed for a lower input frequency band. His publication in 2000 [3] was still designed to receive a 45 MHz intermediate frequency signal, only. In 2001 he published a detailed noise analysis of this structure [4]. Four years later he present an extended structure that already allows to sample a 100 MHz signal, which pushes the structure to the FM radio frequency band [5]. In the same year he presents already a programmable FIR filter [6]. He proposes to realize this programmability by having a configurable transconduct-

C. Schultz and P. Hillger: Intel Deutschland GmbH, Germany, emails: [email protected], [email protected] De Gruyter Oldenbourg, ASSD – Advances in Systems, Signals and Devices, Volume 8, 2018, pp. 175–196. https://doi.org/10.1515/9783110470383-012

176 | C. Schultz and P. Hillger

ance amplifier. This is only possible because he skips the history capacitance (section 2). The mentioned publications can be found summarized in his Ph. D.-thesis [7]. Jakonis [8][9], Andersson [10], Ru [11], Huang [12] and Jiangtao [13] presented further circuits based on the sampling receiver principle as it will be described further in this publication. These publications prove the feasibility of this architecture for certain use-cases. This publication will present a meta-analysis of all those structures and will transfer the circuit class into a signal processing description that enables a design space exploration. As final step this article will close the circle and will map the parameters used in the design space exploration back to real world quantities. The presented models can be used for an early feasibility study, as they allow to map system requirements to explicit circuit requirements and at the same gives indicators about the expected cost of the solution. Certain sub-blocks are already analyzed in different publications. In [14] and [15] the input sampling methods are described and analyzed. The publication [16] points to a polyphase model. A similar approach is used in [17] to model a complex dual-band receiver architecture for GNSS. This article will start in Section 2 with a detailed description of the class of sampling receivers and a mapping of the different publications onto this defined class. In section 3 each sub-block is presented and analyzed. Section 4 will close the circle and map back the design space representation to the real world circuit. Section 5 finally gives a detailed example of a use case that has been analyzed using the presented theory including simulation results.

2 Class of sampling receivers The class of sampling receivers that shall be described and the publications referred in section 1 share the following features: The initial down-conversion in the receive path is realized by a sampler instead of a mixer. From signal-processing perspective a modern passive mixer is very comparable to a sampler, with the main difference, that a sampler blocks the input from the internal node, while the mixer presents a continuous electrical connection. The internal node is the input node of a switched-capacitor block that is used for filtering and decimation. All presented systems can be mapped on the block diagram as depicted in Fig. 1. This is the main motivation of building a super-set or class of all those circuits, and analyzing this superset, instead of each realization by itself. When the class is analyzed and understood as a whole, all possible realizations can easily be understood, easily compared and the features can be predicted. A mapping of the referred to the key parameters of the discussed sampling receiver class is shown in Tab. 1.

Methodology for Analysis of Direct Sampling Receiver Architectures

| 177

Sample-Switch M-times

fs,1

Output-Cap

lin Vin Ch,1i

Cr,11

History-Cap

Cr,1M

Bank

Vout

Path

Fig. 1. Generalized Sampling Receiver Topology.

Tab. 1. Mapping of referred publications to sampling receiver class. Pub. [1] [2] [5] [9] [10] [11] [12] [13] [17]

f s [MHz]

f c /f s

Decimation Factor

Sampling

Paths

IIR Filter

FIR-Taps

2400 869 200 1072 2400