277 80 78MB
English Pages XIX, 707 [725] Year 2020
LNCS 12137
Valeria V. Krzhizhanovskaya · Gábor Závodszky · Michael H. Lees · Jack J. Dongarra · Peter M. A. Sloot · Sérgio Brissos · João Teixeira (Eds.)
Computational Science – ICCS 2020 20th International Conference Amsterdam, The Netherlands, June 3–5, 2020 Proceedings, Part I
Lecture Notes in Computer Science Founding Editors Gerhard Goos Karlsruhe Institute of Technology, Karlsruhe, Germany Juris Hartmanis Cornell University, Ithaca, NY, USA
Editorial Board Members Elisa Bertino Purdue University, West Lafayette, IN, USA Wen Gao Peking University, Beijing, China Bernhard Steffen TU Dortmund University, Dortmund, Germany Gerhard Woeginger RWTH Aachen, Aachen, Germany Moti Yung Columbia University, New York, NY, USA
12137
More information about this series at http://www.springer.com/series/7407
Valeria V. Krzhizhanovskaya Gábor Závodszky Michael H. Lees Jack J. Dongarra Peter M. A. Sloot Sérgio Brissos João Teixeira (Eds.) •
•
•
• •
•
Computational Science – ICCS 2020 20th International Conference Amsterdam, The Netherlands, June 3–5, 2020 Proceedings, Part I
123
Editors Valeria V. Krzhizhanovskaya University of Amsterdam Amsterdam, The Netherlands
Gábor Závodszky University of Amsterdam Amsterdam, The Netherlands
Michael H. Lees University of Amsterdam Amsterdam, The Netherlands
Jack J. Dongarra University of Tennessee Knoxville, TN, USA
Peter M. A. Sloot University of Amsterdam Amsterdam, The Netherlands
Sérgio Brissos Intellegibilis Setúbal, Portugal
ITMO University Saint Petersburg, Russia Nanyang Technological University Singapore, Singapore João Teixeira Intellegibilis Setúbal, Portugal
ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-030-50370-3 ISBN 978-3-030-50371-0 (eBook) https://doi.org/10.1007/978-3-030-50371-0 LNCS Sublibrary: SL1 – Theoretical Computer Science and General Issues © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Twenty Years of Computational Science Welcome to the 20th Annual International Conference on Computational Science (ICCS – https://www.iccs-meeting.org/iccs2020/). During the preparation for this 20th edition of ICCS we were considering all kinds of nice ways to celebrate two decennia of computational science. Afterall when we started this international conference series, we never expected it to be so successful and running for so long at so many different locations across the globe! So we worked on a mind-blowing line up of renowned keynotes, music by scientists, awards, a play written by and performed by computational scientists, press attendance, a lovely venue… you name it, we had it all in place. Then corona hit us. After many long debates and considerations, we decided to cancel the physical event but still support our scientists and allow for publication of their accepted peer-reviewed work. We are proud to present the proceedings you are reading as a result of that. ICCS 2020 is jointly organized by the University of Amsterdam, NTU Singapore, and the University of Tennessee. The International Conference on Computational Science is an annual conference that brings together researchers and scientists from mathematics and computer science as basic computing disciplines, as well as researchers from various application areas who are pioneering computational methods in sciences such as physics, chemistry, life sciences, engineering, arts and humanitarian fields, to discuss problems and solutions in the area, to identify new issues, and to shape future directions for research. Since its inception in 2001, ICCS has attracted increasingly higher quality and numbers of attendees and papers, and 2020 was no exception, with over 350 papers accepted for publication. The proceedings series have become a major intellectual resource for computational science researchers, defining and advancing the state of the art in this field. The theme for ICCS 2020, “Twenty Years of Computational Science”, highlights the role of Computational Science over the last 20 years, its numerous achievements, and its future challenges. This conference was a unique event focusing on recent developments in: scalable scientific algorithms, advanced software tools, computational grids, advanced numerical methods, and novel application areas. These innovative novel models, algorithms, and tools drive new science through efficient application in areas such as physical systems, computational and systems biology, environmental systems, finance, and others. This year we had 719 submissions (230 submissions to the main track and 489 to the thematic tracks). In the main track, 101 full papers were accepted (44%). In the thematic tracks, 249 full papers were accepted (51%). A high acceptance rate in the thematic tracks is explained by the nature of these, where many experts in a particular field are personally invited by track organizers to participate in their sessions.
vi
Preface
ICCS relies strongly on the vital contributions of our thematic track organizers to attract high-quality papers in many subject areas. We would like to thank all committee members from the main and thematic tracks for their contribution to ensure a high standard for the accepted papers. We would also like to thank Springer, Elsevier, the Informatics Institute of the University of Amsterdam, the Institute for Advanced Study of the University of Amsterdam, the SURFsara Supercomputing Centre, the Netherlands eScience Center, the VECMA Project, and Intellegibilis for their support. Finally, we very much appreciate all the Local Organizing Committee members for their hard work to prepare this conference. We are proud to note that ICCS is an A-rank conference in the CORE classification. We wish you good health in these troubled times and hope to see you next year for ICCS 2021. June 2020
Valeria V. Krzhizhanovskaya Gábor Závodszky Michael Lees Jack Dongarra Peter M. A. Sloot Sérgio Brissos João Teixeira
Organization
Thematic Tracks and Organizers Advances in High-Performance Computational Earth Sciences: Applications and Frameworks – IHPCES Takashi Shimokawabe Kohei Fujita Dominik Bartuschat Agent-Based Simulations, Adaptive Algorithms and Solvers – ABS-AAS Maciej Paszynski David Pardo Victor Calo Robert Schaefer Quanling Deng Applications of Computational Methods in Artificial Intelligence and Machine Learning – ACMAIML Kourosh Modarresi Raja Velu Paul Hofmann Biomedical and Bioinformatics Challenges for Computer Science – BBC Mario Cannataro Giuseppe Agapito Mauro Castelli Riccardo Dondi Rodrigo Weber dos Santos Italo Zoppis Classifier Learning from Difficult Data – CLD2 Michał Woźniak Bartosz Krawczyk Paweł Ksieniewicz Complex Social Systems through the Lens of Computational Science – CSOC Debraj Roy Michael Lees Tatiana Filatova
viii
Organization
Computational Health – CompHealth Sergey Kovalchuk Stefan Thurner Georgiy Bobashev Computational Methods for Emerging Problems in (dis-)Information Analysis – DisA Michal Choras Konstantinos Demestichas Computational Optimization, Modelling and Simulation – COMS Xin-She Yang Slawomir Koziel Leifur Leifsson Computational Science in IoT and Smart Systems – IoTSS Vaidy Sunderam Dariusz Mrozek Computer Graphics, Image Processing and Artificial Intelligence – CGIPAI Andres Iglesias Lihua You Alexander Malyshev Hassan Ugail Data-Driven Computational Sciences – DDCS Craig C. Douglas Ana Cortes Hiroshi Fujiwara Robert Lodder Abani Patra Han Yu Machine Learning and Data Assimilation for Dynamical Systems – MLDADS Rossella Arcucci Yi-Ke Guo Meshfree Methods in Computational Sciences – MESHFREE Vaclav Skala Samsul Ariffin Abdul Karim Marco Evangelos Biancolini Robert Schaback
Organization
Rongjiang Pan Edward J. Kansa Multiscale Modelling and Simulation – MMS Derek Groen Stefano Casarin Alfons Hoekstra Bartosz Bosak Diana Suleimenova Quantum Computing Workshop – QCW Katarzyna Rycerz Marian Bubak Simulations of Flow and Transport: Modeling, Algorithms and Computation – SOFTMAC Shuyu Sun Jingfa Li James Liu Smart Systems: Bringing Together Computer Vision, Sensor Networks and Machine Learning – SmartSys Pedro J. S. Cardoso João M. F. Rodrigues Roberto Lam Janio Monteiro Software Engineering for Computational Science – SE4Science Jeffrey Carver Neil Chue Hong Carlos Martinez-Ortiz Solving Problems with Uncertainties – SPU Vassil Alexandrov Aneta Karaivanova Teaching Computational Science – WTCS Angela Shiflet Alfredo Tirado-Ramos Evguenia Alexandrova
ix
x
Organization
Uncertainty Quantification for Computational Models – UNEQUIvOCAL Wouter Edeling Anna Nikishova Peter Coveney
Program Committee and Reviewers Ahmad Abdelfattah Samsul Ariffin Abdul Karim Evgenia Adamopoulou Jaime Afonso Martins Giuseppe Agapito Ram Akella Elisabete Alberdi Celaya Luis Alexandre Vassil Alexandrov Evguenia Alexandrova Hesham H. Ali Julen Alvarez-Aramberri Domingos Alves Julio Amador Diaz Lopez Stanislaw Ambroszkiewicz Tomasz Andrysiak Michael Antolovich Hartwig Anzt Hideo Aochi Hamid Arabnejad Rossella Arcucci Khurshid Asghar Marina Balakhontceva Bartosz Balis Krzysztof Banas João Barroso Dominik Bartuschat Nuno Basurto Pouria Behnoudfar Joern Behrens Adrian Bekasiewicz Gebrai Bekdas Stefano Beretta Benjamin Berkels Martino Bernard
Daniel Berrar Sanjukta Bhowmick Marco Evangelos Biancolini Georgiy Bobashev Bartosz Bosak Marian Bubak Jérémy Buisson Robert Burduk Michael Burkhart Allah Bux Aleksander Byrski Cristiano Cabrita Xing Cai Barbara Calabrese Jose Camata Mario Cannataro Alberto Cano Pedro Jorge Sequeira Cardoso Jeffrey Carver Stefano Casarin Manuel Castañón-Puga Mauro Castelli Eduardo Cesar Nicholas Chancellor Patrikakis Charalampos Ehtzaz Chaudhry Chuanfa Chen Siew Ann Cheong Andrey Chernykh Lock-Yue Chew Su Fong Chien Marta Chinnici Sung-Bae Cho Michal Choras Loo Chu Kiong
Neil Chue Hong Svetlana Chuprina Paola Cinnella Noélia Correia Adriano Cortes Ana Cortes Enrique Costa-Montenegro David Coster Helene Coullon Peter Coveney Attila Csikasz-Nagy Loïc Cudennec Javier Cuenca Yifeng Cui António Cunha Ben Czaja Pawel Czarnul Flávio Martins Bhaskar Dasgupta Konstantinos Demestichas Quanling Deng Nilanjan Dey Khaldoon Dhou Jamie Diner Jacek Dlugopolski Simona Domesová Riccardo Dondi Craig C. Douglas Linda Douw Rafal Drezewski Hans du Buf Vitor Duarte Richard Dwight Wouter Edeling Waleed Ejaz Dina El-Reedy
Organization
Amgad Elsayed Nahid Emad Chriatian Engelmann Gökhan Ertaylan Alex Fedoseyev Luis Manuel Fernández Antonino Fiannaca Christos Filelis-Papadopoulos Rupert Ford Piotr Frackiewicz Martin Frank Ruy Freitas Reis Karl Frinkle Haibin Fu Kohei Fujita Hiroshi Fujiwara Takeshi Fukaya Wlodzimierz Funika Takashi Furumura Ernst Fusch Mohamed Gaber David Gal Marco Gallieri Teresa Galvao Akemi Galvez Salvador García Bartlomiej Gardas Delia Garijo Frédéric Gava Piotr Gawron Bernhard Geiger Alex Gerbessiotis Ivo Goncalves Antonio Gonzalez Pardo Jorge González-Domínguez Yuriy Gorbachev Pawel Gorecki Michael Gowanlock Manuel Grana George Gravvanis Derek Groen Lutz Gross Sophia Grundner-Culemann
Pedro Guerreiro Tobias Guggemos Xiaohu Guo Piotr Gurgul Filip Guzy Pietro Hiram Guzzi Zulfiqar Habib Panagiotis Hadjidoukas Masatoshi Hanai John Hanley Erik Hanson Habibollah Haron Carina Haupt Claire Heaney Alexander Heinecke Jurjen Rienk Helmus Álvaro Herrero Bogumila Hnatkowska Maximilian Höb Erlend Hodneland Olivier Hoenen Paul Hofmann Che-Lun Hung Andres Iglesias Takeshi Iwashita Alireza Jahani Momin Jamil Vytautas Jancauskas João Janeiro Peter Janku Fredrik Jansson Jirí Jaroš Caroline Jay Shalu Jhanwar Zhigang Jia Chao Jin Zhong Jin David Johnson Guido Juckeland Maria Juliano Edward J. Kansa Aneta Karaivanova Takahiro Katagiri Timo Kehrer Wayne Kelly Christoph Kessler
xi
Jakub Klikowski Harald Koestler Ivana Kolingerova Georgy Kopanitsa Gregor Kosec Sotiris Kotsiantis Ilias Kotsireas Sergey Kovalchuk Michal Koziarski Slawomir Koziel Rafal Kozik Bartosz Krawczyk Elisabeth Krueger Valeria Krzhizhanovskaya Pawel Ksieniewicz Marek Kubalcík Sebastian Kuckuk Eileen Kuehn Michael Kuhn Michal Kulczewski Krzysztof Kurowski Massimo La Rosa Yu-Kun Lai Jalal Lakhlili Roberto Lam Anna-Lena Lamprecht Rubin Landau Johannes Langguth Elisabeth Larsson Michael Lees Leifur Leifsson Kenneth Leiter Roy Lettieri Andrew Lewis Jingfa Li Khang-Jie Liew Hong Liu Hui Liu Yen-Chen Liu Zhao Liu Pengcheng Liu James Liu Marcelo Lobosco Robert Lodder Marcin Los Stephane Louise
xii
Organization
Frederic Loulergue Paul Lu Stefan Luding Onnie Luk Scott MacLachlan Luca Magri Imran Mahmood Zuzana Majdisova Alexander Malyshev Muazzam Maqsood Livia Marcellino Tomas Margalef Tiziana Margaria Svetozar Margenov Urszula Markowska-Kaczmar Osni Marques Carmen Marquez Carlos Martinez-Ortiz Paula Martins Flávio Martins Luke Mason Pawel Matuszyk Valerie Maxville Wagner Meira Jr. Roderick Melnik Valentin Melnikov Ivan Merelli Choras Michal Leandro Minku Jaroslaw Miszczak Janio Monteiro Kourosh Modarresi Fernando Monteiro James Montgomery Andrew Moore Dariusz Mrozek Peter Mueller Khan Muhammad Judit Muñoz Philip Nadler Hiromichi Nagao Jethro Nagawkar Kengo Nakajima Ionel Michael Navon Philipp Neumann
Mai Nguyen Hoang Nguyen Nancy Nichols Anna Nikishova Hitoshi Nishizawa Brayton Noll Algirdas Noreika Enrique Onieva Kenji Ono Eneko Osaba Aziz Ouaarab Serban Ovidiu Raymond Padmos Wojciech Palacz Ivan Palomares Rongjiang Pan Joao Papa Nikela Papadopoulou Marcin Paprzycki David Pardo Anna Paszynska Maciej Paszynski Abani Patra Dana Petcu Serge Petiton Bernhard Pfahringer Frank Phillipson Juan C. Pichel Anna Pietrenko-Dabrowska Laércio L. Pilla Armando Pinho Tomasz Piontek Yuri Pirola Igor Podolak Cristina Portales Simon Portegies Zwart Roland Potthast Ela Pustulka-Hunt Vladimir Puzyrev Alexander Pyayt Rick Quax Cesar Quilodran Casas Barbara Quintela Ajaykumar Rajasekharan Celia Ramos
Lukasz Rauch Vishal Raul Robin Richardson Heike Riel Sophie Robert Luis M. Rocha Joao Rodrigues Daniel Rodriguez Albert Romkes Debraj Roy Katarzyna Rycerz Alberto Sanchez Gabriele Santin Alex Savio Robert Schaback Robert Schaefer Rafal Scherer Ulf D. Schiller Bertil Schmidt Martin Schreiber Alexander Schug Gabriela Schütz Marinella Sciortino Diego Sevilla Angela Shiflet Takashi Shimokawabe Marcin Sieniek Nazareen Sikkandar Basha Anna Sikora Janaína De Andrade Silva Diana Sima Robert Sinkovits Haozhen Situ Leszek Siwik Vaclav Skala Peter Sloot Renata Slota Grazyna Slusarczyk Sucha Smanchat Marek Smieja Maciej Smolka Bartlomiej Sniezynski Isabel Sofia Brito Katarzyna Stapor Bogdan Staszewski
Organization
Jerzy Stefanowski Dennis Stevenson Tomasz Stopa Achim Streit Barbara Strug Pawel Strumillo Dante Suarez Vishwas H. V. Subba Rao Bongwon Suh Diana Suleimenova Ray Sun Shuyu Sun Vaidy Sunderam Martin Swain Alessandro Taberna Ryszard Tadeusiewicz Daisuke Takahashi Zaid Tashman Osamu Tatebe Carlos Tavares Calafate Kasim Tersic Yonatan Afework Tesfahunegn Jannis Teunissen Stefan Thurner
Nestor Tiglao Alfredo Tirado-Ramos Arkadiusz Tomczyk Mariusz Topolski Paolo Trunfio Ka-Wai Tsang Hassan Ugail Eirik Valseth Pavel Varacha Pierangelo Veltri Raja Velu Colin Venters Gytis Vilutis Peng Wang Jianwu Wang Shuangbu Wang Rodrigo Weber dos Santos Katarzyna Wegrzyn-Wolska Mei Wen Lars Wienbrandt Mark Wijzenbroek Peter Woehrmann Szymon Wojciechowski
Maciej Woloszyn Michal Wozniak Maciej Wozniak Yu Xia Dunhui Xiao Huilin Xing Miguel Xochicale Feng Xu Wei Xue Yoshifumi Yamamoto Dongjia Yan Xin-She Yang Dongwei Ye Wee Ping Yeo Lihua You Han Yu Gábor Závodszky Yao Zhang H. Zhang Jinghui Zhong Sotirios Ziavras Italo Zoppis Chiara Zucco Pawel Zyblewski Karol Zyczkowski
xiii
Contents – Part I
ICCS Main Track An Efficient New Static Scheduling Heuristic for Accelerated Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas McSweeney, Neil Walton, and Mawussi Zounon
3
Multilevel Parallel Computations for Solving Multistage Multicriteria Optimization Problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Victor Gergel and Evgeny Kozinov
17
Enabling Hardware Affinity in JVM-Based Applications: A Case Study for Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roberto R. Expósito, Jorge Veiga, and Juan Touriño
31
An Optimizing Multi-platform Source-to-source Compiler Framework for the NEURON MODeling Language . . . . . . . . . . . . . . . . . . . . . . . . . . . Pramod Kumbhar, Omar Awile, Liam Keegan, Jorge Blanco Alonso, James King, Michael Hines, and Felix Schürmann
45
Parallel Numerical Solution of a 2D Chemotaxis-Stokes System on GPUs Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Raffaele D’Ambrosio, Stefano Di Giovacchino, and Donato Pera
59
Automatic Management of Cloud Applications with Use of Proximal Policy Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Włodzimierz Funika, Paweł Koperek, and Jacek Kitowski
73
Utilizing GPU Performance Counters to Characterize GPU Kernels via Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bob Zigon and Fengguang Song
88
A Massively Parallel Algorithm for the Three-Dimensional Navier-Stokes-Boussinesq Simulations of the Atmospheric Phenomena . . . . . Maciej Paszyński, Leszek Siwik, Krzysztof Podsiadło, and Peter Minev
102
Reconstruction of Low Energy Neutrino Events with GPUs at IceCube . . . . . Maicon Hieronymus, Bertil Schmidt, and Sebastian Böser
118
Cache-Aware Matrix Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dominik Huber, Martin Schreiber, Dai Yang, and Martin Schulz
132
xvi
Contents – Part I
QEScalor: Quantitative Elastic Scaling Framework in Distributed Streaming Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weimin Mu, Zongze Jin, Weilin Zhu, Fan Liu, Zhenzhen Li, Ziyuan Zhu, and Weiping Wang From Conditional Independence to Parallel Execution in Hierarchical Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Balazs Nemeth, Tom Haber, Jori Liesenborgs, and Wim Lamotte Improving Performance of the Hypre Iterative Solver for Uintah Combustion Codes on Manycore Architectures Using MPI Endpoints and Kernel Consolidation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Damodar Sahasrabudhe and Martin Berzins Analysis of Checkpoint I/O Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . Betzabeth León, Pilar Gomez-Sanchez, Daniel Franco, Dolores Rexachs, and Emilio Luque
147
161
175 191
Enabling EASEY Deployment of Containerized Applications for Future HPC Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maximilian Höb and Dieter Kranzlmüller
206
Reproducibility of Computational Experiments on Kubernetes-Managed Container Clouds with HyperFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michał Orzechowski, Bartosz Baliś, Renata G. Słota, and Jacek Kitowski
220
GPU-Accelerated RDP Algorithm for Data Segmentation . . . . . . . . . . . . . . . Pau Cebrian and Juan Carlos Moure
234
Sparse Matrix-Based HPC Tomography . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefano Marchesini, Anuradha Trivedi, Pablo Enfedaque, Talita Perciano, and Dilworth Parkinson
248
heFFTe: Highly Efficient FFT for Exascale . . . . . . . . . . . . . . . . . . . . . . . . Alan Ayala, Stanimire Tomov, Azzam Haidar, and Jack Dongarra
262
Scalable Workflow-Driven Hydrologic Analysis in HydroFrame . . . . . . . . . . Shweta Purawat, Cathie Olschanowsky, Laura E. Condon, Reed Maxwell, and Ilkay Altintas
276
Patient-Specific Cardiac Parametrization from Eikonal Simulations . . . . . . . . Daniel Ganellari, Gundolf Haase, Gerhard Zumbusch, Johannes Lotz, Patrick Peltzer, Klaus Leppkes, and Uwe Naumann
290
An Empirical Analysis of Predictors for Workload Estimation in Healthcare . . . Roberto Gatta, Mauro Vallati, Ilenia Pirola, Jacopo Lenkowicz, Luca Tagliaferri, Carlo Cappelli, and Maurizio Castellano
304
Contents – Part I
How You Say or What You Say? Neural Activity in Message Credibility Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Łukasz Kwaśniewicz, Grzegorz M. Wójcik, Andrzej Kawiak, Piotr Schneider, and Adam Wierzbicki Look Who’s Talking: Modeling Decision Making Based on Source Credibility. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrzej Kawiak, Grzegorz M. Wójcik, Lukasz Kwasniewicz, Piotr Schneider, and Adam Wierzbicki An Adaptive Network Model for Burnout and Dreaming . . . . . . . . . . . . . . . Mathijs Maijer, Esra Solak, and Jan Treur Computational Analysis of the Adaptive Causal Relationships Between Cannabis, Anxiety and Sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Merijn van Leeuwen, Kirsten Wolthuis, and Jan Treur Detecting Critical Transitions in the Human Innate Immune System Post-cardiac Surgery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alva Presbitero, Rick Quax, Valeria V. Krzhizhanovskaya, and Peter M. A. Sloot Using Individual-Based Models to Look Beyond the Horizon: The Changing Effects of Household-Based Clustering of Susceptibility to Measles in the Next 20 Years . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Elise Kuylen, Jori Liesenborgs, Jan Broeckhove, and Niel Hens Modelling the Effects of Antibiotics on Gut Flora Using a Nonlinear Compartment Model with Uncertain Parameters . . . . . . . . . . . . . . . . . . . . . Thulasi Jegatheesan and Hermann J. Eberl
xvii
312
327
342
357
371
385
399
Stochastic Volatility and Early Warning Indicator . . . . . . . . . . . . . . . . . . . . Guseon Ji, Hyeongwoo Kong, Woo Chang Kim, and Kwangwon Ahn
413
Boost and Burst: Bubbles in the Bitcoin Market . . . . . . . . . . . . . . . . . . . . . Nam-Kyoung Lee, Eojin Yi, and Kwangwon Ahn
422
Estimation of Tipping Points for Critical and Transitional Regimes in the Evolution of Complex Interbank Network . . . . . . . . . . . . . . . . . . . . . Valentina Y. Guleva
432
Modeling of Fire Spread Including Different Heat Transfer Mechanisms Using Cellular Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jarosław Wąs, Artur Karp, Szymon Łukasik, and Dariusz Pałka
445
Narrow Passage Problem Solution for Motion Planning . . . . . . . . . . . . . . . . Jakub Szkandera, Ivana Kolingerová, and Martin Maňák
459
xviii
Contents – Part I
Fault Injection, Detection and Treatment in Simulated Autonomous Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Garrido, Leonardo Ferreira, João Jacob, and Daniel Castro Silva
471
Using Cellular Automata to Model High Density Pedestrian Dynamics . . . . . Grzegorz Bazior, Dariusz Pałka, and Jarosław Wąs
486
Autonomous Vehicles as Local Traffic Optimizers . . . . . . . . . . . . . . . . . . . Ashna Bhatia, Jordan Ivanchev, David Eckhoff, and Alois Knoll
499
Modeling Helping Behavior in Emergency Evacuations Using Volunteer’s Dilemma Game. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jaeyoung Kwak, Michael H. Lees, Wentong Cai, and Marcus E. H. Ong
513
Learning Mixed Traffic Signatures in Shared Networks . . . . . . . . . . . . . . . . Hamidreza Anvari and Paul Lu
524
A Novel Metric to Evaluate In Situ Workflows. . . . . . . . . . . . . . . . . . . . . . Tu Mai Anh Do, Loïc Pottier, Stephen Thomas, Rafael Ferreira da Silva, Michel A. Cuendet, Harel Weinstein, Trilce Estrada, Michela Taufer, and Ewa Deelman
538
Social Recommendation in Heterogeneous Evolving Relation Network . . . . . Bo Jiang, Zhigang Lu, Yuling Liu, Ning Li, and Zelin Cui
554
DDNE: Discriminative Distance Metric Learning for Network Embedding . . . Xiaoxue Li, Yangxi Li, Yanmin Shang, Lingling Tong, Fang Fang, Pengfei Yin, Jie Cheng, and Jing Li
568
Extracting Backbone Structure of a Road Network from Raw Data . . . . . . . . Hoai Nguyen Huynh and Roshini Selvakumar
582
Look Deep into the New Deep Network: A Measurement Study on the ZeroNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Siyuan Wang, Yue Gao, Jinqiao Shi, Xuebin Wang, Can Zhao, and Zelin Yin
595
Identifying Influential Spreaders On a Weighted Network Using HookeRank Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanjay Kumar, Nipun Aggarwal, and B. S. Panda
609
Community Aware Models of Meme Spreading in Micro-blog Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mikołaj Kromka, Wojciech Czech, and Witold Dzwinel
623
Contents – Part I
A Dynamic Vote-Rank Based Approach for Effective Sequential Initialization of Information Spreading Processes Within Complex Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Patryk Pazura, Kamil Bortko, Jarosław Jankowski, and Radosław Michalski On the Planarity of Validated Complexes of Model Organisms in Protein-Protein Interaction Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . Kathryn Cooper, Nathan Cornelius, William Gasper, Sanjukta Bhowmick, and Hesham Ali Towards Modeling of Information Processing Within Business-Processes of Service-Providing Organizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sergey V. Kovalchuk, Anastasia A. Funkner, Ksenia Y. Balabaeva, Ilya V. Derevitskii, Vladimir V. Fonin, and Nikita V. Bukhanov A Probabilistic Infection Model for Efficient Trace-Prediction of Disease Outbreaks in Contact Networks . . . . . . . . . . . . . . . . . . . . . . . . . Willian Qian, Sanjukta Bhowmick, Marty O’Neill, Susie Ramisetty-Mikler, and Armin R. Mikler
xix
638
652
667
676
Eigen-AD: Algorithmic Differentiation of the Eigen Library . . . . . . . . . . . . . Patrick Peltzer, Johannes Lotz, and Uwe Naumann
690
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
705
ICCS Main Track
An Efficient New Static Scheduling Heuristic for Accelerated Architectures Thomas McSweeney1(B) , Neil Walton1 , and Mawussi Zounon1,2
2
1 University of Manchester, Manchester, UK [email protected] The Numerical Algorithms Group (NAG), Manchester, UK
Abstract. Heterogeneous architectures that use Graphics Processing Units (GPUs) for general computations, in addition to multicore CPUs, are increasingly common in high-performance computing. However many of the existing methods for scheduling precedence-constrained tasks on such platforms were intended for more diversely heterogeneous clusters, such as the classic Heterogeneous Earliest Finish Time (HEFT) heuristic. We propose a new static scheduling heuristic called Heterogeneous Optimistic Finish Time (HOFT) which exploits the binary heterogeneity of accelerated platforms. Through extensive experimentation with custom software for simulating task scheduling problems on user-defined CPU-GPU platforms, we show that HOFT can obtain schedules at least 5% shorter than HEFT’s for medium-to-large numerical linear algebra application task graphs and around 3% shorter on average for a large collection of randomly-generated graphs. Keywords: High-Performance Computing · GPU computing · Scheduling · Precedence constraints · Directed Acyclic Graphs
1
Introduction
Modern High-Performance Computing (HPC) machines typically comprise hundreds or even thousands of networked nodes. These nodes are increasingly likely to be heterogeneous, hosting one or more powerful accelerators—usually GPUs— in addition to multicore CPUs. For example, Summit, which currently heads the Top5001 list of the world’s fastest supercomputers, comprises over 4000 nodes, each with two 22-core IBM Power9 CPUs and six NVIDIA Tesla V100 GPUs. Task-based parallel programming is a paradigm that aims to harness this processor heterogeneity. Here a program is described as a collection of tasks— logically discrete atomic units of work—with precedence constraints that define the order in which they can be executed. This can be expressed in the form of a graph, where each vertex represents a task and edges the precedence constraints 1
https://www.top500.org/.
Supported by the Engineering and Physical Sciences Research Council (EPSRC). c Springer Nature Switzerland AG 2020 V. V. Krzhizhanovskaya et al. (Eds.): ICCS 2020, LNCS 12137, pp. 3–16, 2020. https://doi.org/10.1007/978-3-030-50371-0_1
4
T. McSweeney et al.
between them. We are interested only in the case when such task graphs are Directed Acyclic Graphs (DAGs)—directed and without any cycles. The immediate question is, how do we find the optimal way to assign the tasks to a set of heterogeneous processing resources while still respecting the precedence constraints? In other words, what schedule should we follow? This DAG scheduling problem is known to be NP-complete, even for homogeneous processors [17], so typically we must rely on heuristic algorithms that give us reasonably good solutions in a reasonable time. A fundamental distinction is made between static and dynamic scheduling. Static schedules are fixed before execution based on the information available at that time, whereas dynamic schedules are determined during runtime. There are generic advantages and disadvantages to both: static scheduling makes greater use of the data so is superior when it is sufficiently accurate, whereas dynamic scheduling uses more recent data. In practice task scheduling is usually handled by a runtime system, such as OmpSs [11], PaRSEC [6], or StarPU [4]. Most such systems use previous execution traces to predict task execution and data transfer times at runtime. On a single machine the latter is tricky because of shared buses and the possibility of asynchronous data transfers. Hence at present dynamic scheduling is typically preferred. However static schedules can be surprisingly robust, even when estimates are poor [1]. Furthermore, robustness can be improved using timing distribution information [18]. In addition, superior performance can be achieved in dynamic environments by modifying an existing static schedule, rather than computing a new one from scratch [1]. In this paper we therefore focus on the problem of finding good static schedules for multicore and GPU platforms. To facilitate this investigation, we developed an open-source software simulator which allows users to simulate the static scheduling of arbitrary task DAGs on arbitrary CPU-GPU platforms, without worrying about the time or energy usage constraints imposed by real systems. The popular HEFT scheduling heuristic comprises two phases: a task prioritization phase in which the order tasks are to be scheduled is determined and a processor selection phase in which they are actually assigned to the processing resources. In this article we introduce HOFT, which follows the HEFT framework but modifies both phases in order to exploit accelerated architectures in particular, without significantly increasing the complexity of the algorithm. HOFT works by first computing a table of optimistic estimates of the earliest possible times all tasks can be completed on both processor types and using this to guide both phases. Simulations with real and randomly-generated DAGs on both single and multiple GPU target platforms suggest that HOFT is always at least competitive with HEFT and frequently superior. Explicitly, the two main contributions of this paper are: 1. A new static scheduling heuristic that is optimized specifically for accelerated heterogeneous architectures; 2. Open-source simulation software that allows researchers to implement and evaluate their own scheduling algorithms for user-defined CPU-GPU platforms in a fast and reproducible manner.
An Efficient New Static Scheduling Heuristic for Accelerated Architectures
5
The remainder of this paper is structured as follows. In Sect. 2 we summarize the relevant existing literature. Then in Sect. 3 we explicitly define the simulation model we use to study the static task scheduling problem. We describe HEFT in detail in Sect. 4, including also benchmarking results with our simulation model and a minor modification to the algorithm that we found performs well. In Sect. 5 we describe our new HOFT heuristic, before detailing the numerical experiments that we undertook to evaluate its performance in Sect. 6. Finally in Sect. 7 we state our conclusions from this investigation and outline future work that we believe may be useful.
2
Related Work
Broadly, static scheduling methods can be divided into three categories: mathematical programming, guided-random search and heuristics. The first is based on formulating the scheduling problem as a mathematical program; see, for example, Kumar’s constraint programming formulation in [14]. However solving these is usually so expensive that they are restricted to small task graphs. Guidedrandom search is a term used for any method that generates a large population of potential schedules and then selects the best among them. Typically these are more general optimization schemes such as genetic algorithms which are refined for the task scheduling problem. As a rule, such methods tend to find very high-quality schedules but take a long time to do so [7]. Heuristics are the most popular approach in practice as they are often competitive with the alternatives and considerably faster. In turn listing heuristics are the most popular kind. They follow a two-phase structure: an ordered list of all tasks is first constructed (task prioritization) and they are then scheduled in this order according to some rule (processor selection). HEFT is the most prominent example: all tasks are prioritized according to their upward rank and then scheduled on the processor expected to complete their execution at the earliest time; a fuller description is given in Sect. 4. Canon et al. [8] compared twenty different task scheduling heuristics and found that HEFT was almost always among the best in terms of both schedule makespan and robustness. Many modifications of HEFT have been proposed in the literature, such as HEFT with lookahead from Bittencourt, Sakellariou and Madeira [5], which has the same task prioritization phase but schedules all tasks on the resources estimated to minimize the completion time of their children. This has the effect of increasing the time complexity of the algorithm so, in an attempt to incorporate a degree of lookahead into the HEFT framework without increasing the cost, Arabnejad and Barbosa proposed Predict Earliest Finish Time (PEFT) [3]. The main innovation is that rather than just minimizing the completion time of a task during processor selection, we also try to minimize an optimistic estimate of the time it will take to execute the remaining unscheduled tasks in the DAG. Like the majority of existing methods for static DAG scheduling in heterogeneous computing, HEFT was originally intended for clusters with diverse nodes. At least one extension specifically targeting accelerated architectures has
6
T. McSweeney et al.
been proposed before, namely HEFT-NC (No Cross) from Shetti, Fahmy and Bretschneider [15], but our new HOFT heuristic differs from this in both the task prioritization and processor selection phases of the algorithm.
3
Simulation Model
In this paper we use a simulation model to study the static task scheduling problem for multicore and GPU. This simulator follows the mathematical model described in Sect. 3.1 and therefore facilitates the evaluation of scheduling algorithms for idealized CPU-GPU platforms. The advantage of this approach is that it allows us to compare multiple algorithms and determine how intrinsically well-suited they are for accelerated architectures. Although this model may not capture the full range of real-world behavior, we gathered data from a single heterogeneous node of a local computing cluster to guide its development and retain the most salient features. This node comprises four octacore Intel (Skylake) Xeon Gold 6130 CPUs running at 2.10 GHz with 192 GB RAM and four Nvidia V100-SXM2-16 GB (Volta) GPUs, each with 16 GB GPU global memory, 5120 CUDA Cores and NVLink interconnect. We used Basic Linear Algebra Subroutine (BLAS) [10] and Linear Algebra PACKage (LAPACK) [2] kernels for benchmarking as they are widely-used in scientific computing applications. The simulator is implemented in Python and the complete source code is available on Github2 . All code used to generate results presented in this paper is available in the folder simulator/scripts so interested researchers may repeat our experiments for themselves. In addition, users may make modifications to the simulator that they believe will more accurately reflect their own target environment. 3.1
Mathematical Model
The simulator software implements the following mathematical model of the problem. Suppose we have a task DAG G consisting of n tasks and e edges that we wish to execute on a target platform H comprising P processing resources of two types, PC CPU resources and P − PC = PG GPU resources. In keeping with much of the related literature and based on current programming practices, we consider CPU cores individually but regard entire GPUs as discrete [1]. For example, a node comprising 4 GPUs and 4 octacore CPUs would be viewed as 4 GPU resources and 4 × 8 = 32 CPU resources. We assume that all tasks t1 , . . . , tn are atomic and cannot be divided across multiple resources or aggregated to form larger tasks. Further, all resources can only execute a single task at any one time and can in principle execute all tasks, albeit with different processing times. Given the increasing versatility of modern GPUs and the parallelism of modern CPUs, the latter is a much less restrictive assumption than it once may have been. 2
https://github.com/mcsweeney90/heterogeneous optimistic finish time.
An Efficient New Static Scheduling Heuristic for Accelerated Architectures
7
In our experiments, we found that the spread of kernel processing times was usually tight, with the standard deviation often being two orders of magnitude smaller than the mean. Thus we assume that all task execution times on all processing resources of a single type are identical. In particular, this means that each task has only two possible computation costs: a CPU execution time wC (ti ) and a GPU execution time wG (ti ). When necessary, we denote by wim the processing time of task ti on the specific resource pm . The communication cost between task ti and its child tj is the length of time between when execution of ti is complete and execution of tj can begin, including all relevant latency and data transfer times. Since this depends on where each task is executed, we view this as a function cij (pm , pn ). We assume that the communication cost is always zero when m = n and that there are only four possible communication costs between tasks ti and tj when this isn’t the case: cij (C, C), from a CPU to a different CPU; cij (C, G), from CPU to GPU; cij (G, C), from GPU to CPU; and cij (G, G) from GPU to a different GPU. A schedule is a mapping from tasks to processing resources, as well as the precise time at which their execution should begin. Our goal is to find a schedule which minimizes the makespan of the task graph, the total execution time of the application it represents. A task with no successors is called an exit task. Once all tasks have been scheduled, the makespan is easily computed as the earliest time all exit tasks will be completed. Note that although we assume that all costs represent time, they could be anything else we wish to minimize, such as energy consumption, so long as this is done consistently. We do not however consider the multi-objective optimization problem of trading off two or more different cost types here. 3.2
Testing Environments
In the numerical experiments described later in this article, we consider two simulated target platforms: Single GPU, comprising 1 GPU and 1 octacore CPU, and Multiple GPU, comprising 4 GPUs and 4 octacore CPUs. The latter represents the node we used to guide the development of our simulator and the former is considered in order to study how the number of GPUs affects performance. We follow the convention that a CPU core is dedicated to managing each of the GPUs [4], so these two platforms are actually assumed to comprise 7 CPU and 1 GPU resources, and 28 CPU and 4 GPU resources, respectively. Based on our exploratory experiments, we make two further assumptions. First, since communication costs between CPU resources were negligible relative to all other combinations, we assume they are zero—i.e., cij (C, C) = 0, ∀i, j. Second, because CPU-GPU communication costs were very similar to the corresponding GPU-CPU and GPU-GPU costs, we take them to be identical—i.e., cij (C, G) = cij (G, C) = cij (G, G), ∀i, j. These assumptions will obviously not be representative of all possible architectures but the simulator software allows users to repeat our experiments for more accurate representations of their own target platforms if they wish.
8
T. McSweeney et al.
We consider the scheduling of two different sets of DAGs. The first consists of ten DAGs comprising between 35 and 22, 100 tasks which correspond to the Cholesky factorization of N × N tiled matrices, where N = 5, 10, 15, . . . , 50. In particular, the DAGs are based on a common implementation of Cholesky factorization for tiled matrices which uses GEMM (matrix multiplication), SYRK (symmetric rank-k update) and TRSM (triangular solve) BLAS kernels, as well as the POTRF (Cholesky factorization) LAPACK routine [10]. All task CPU/GPU processing times are means of 1000 real timings of that task kernel. Likewise, communication costs are sample means of real communication timings between the relevant task and resource types. All numerical experiments were performed for tile sizes 128 and 1024; which was used will always be specified where results are presented. Those sizes were chosen as they roughly mark the upper and lower limits of tile sizes typically used for CPU-GPU platforms. The standard way to quantify the relative amount of communication and computation represented by a task graph is the Computation-to-Communication Ratio (CCR), the mean computation cost of the DAG divided by the mean communication cost. For the Cholesky DAGs, the CCR was about 1 for tile size 128 and about 18 for tile size 1024, with minor variation depending on the total number of tasks in the DAG. We constructed a set of randomly-generated DAGs with a wide range of CCRs, based on the topologies of the 180 DAGs with 1002 tasks from the Standard Task Graph (STG) set [16]. Following the approach in [9], we selected GPU execution times for all tasks uniformly at random from [1, 100] and computed the corresponding CPU times by multiplying by a random variable from a Gamma distribution. To consider a variety of potential applications, for each DAG we made two copies: low acceleration, for which the mean and standard deviation of the Gamma distribution was defined to be 5, and high acceleration, for which both were taken to be 50 instead. These values roughly correspond to what we observed in our benchmarking of BLAS and LAPACK kernels with tile sizes 128 and 1024, respectively. Finally, for both parameter regimes, we made three copies of each DAG and randomly generated communication costs such that the CCR fell into each of the intervals [0, 10], [10, 20] and [20, 50]. Thus in total the random DAG set contains 180 × 2 × 3 = 1080 DAGs.
4
HEFT
Recall that as a listing scheduler HEFT comprises two phases, an initial task prioritization phase in which the order all tasks are to be scheduled is determined and a processor selection phase in which the processing resource each task is to be scheduled on is decided. Here we describe both in order to give a complete description of the HEFT algorithm. The critical path of a DAG is the longest path through it, and is important because it gives a lower bound on the optimal schedule makespan for that DAG. Heuristics for homogeneous platforms often use the upward rank, the length of the critical path from that task to an exit task, including the task itself
An Efficient New Static Scheduling Heuristic for Accelerated Architectures
9
[17], to determine priorities. Computing the critical path is not straightforward for heterogeneous platforms so HEFT extends the idea by using mean values instead. Intuitively, the task prioritization phase of HEFT can be viewed as an approximate dynamic program applied to a simplified version of the task DAG that uses mean values to set all weights. More formally, we first define the mean execution cost of all tasks ti through wi :=
P wC (ti )PC + wG (ti )PG wim = , P P m=1
(1)
where the second expression is how wi would be computed under the assumptions of our model. Likewise, the mean communication cost cij between ti and tj is the average of all such costs over all possible combinations of resources, 1 1 cij = 2 cij (pm , pn ) = 2 Ak cij (k, ), (2) P m,n P k,∈{C,G}
where ACC = PC (PC − 1), ACG = PC PG = AGC , and AGG = PG (PG − 1). For all tasks ti in the DAG, we define their upward ranks ranku (ti ) recursively, starting from the exit task(s), by ranku (ti ) = wi +
max (cij + ranku (tj )),
tj ∈Ch(ti )
(3)
where Ch(ti ) is the set of ti ’s immediate successors in the DAG. The task prioritization phase then concludes by listing all tasks in decreasing order of upward rank, with ties broken arbitrarily. The processor selection phase of HEFT is now straightforward: we move down the list and assign each task to the resource expected to complete its execution at the earliest time. Let Rmi be the earliest time at which the processing resource pm is actually free to execute task ti , P a(ti ) be the set of ti ’s immediate predecessors in the DAG, and AF T (tk ) be the time when execution of a task tk is actually completed (which in the static case is known precisely once it has been scheduled). Then the earliest start time of task ti on processing resource pm is computed through (4) EST (ti , pm ) = max Rmi , max (AF T (tk ) + cki (pk , pm )) tk ∈P a(ti )
and the earliest finish time EF T (ti , pm ) of task ti on pm is given by EF T (ti , pm ) = wim + EST (ti , pm ).
(5)
HEFT follows an insertion-based policy that allows tasks to be inserted between two that are already scheduled, assuming precedence constraints are still respected, so Rmi may not simply be the latest finish time of all tasks on pm . A complete description of HEFT is given in Algorithm 1. HEFT has a time complexity of O(P · e). For dense DAGs, the number of edges is proportional to n2 , where n is the number of tasks, so the complexity is effectively O(n2 P ) [17].
10
T. McSweeney et al.
Algorithm 1: HEFT. 1 2 3 4 5 6 7 8 9 10 11
4.1
Set the computation cost of all tasks using (1) Set the communication cost of all edges using (2) Compute ranku for all tasks according to (3) Sort the tasks into a priority list by non-increasing order of ranku for task in list do for each resource pk do Compute EF T (ti , pk ) using (4) and (5) end pm := arg mink (EF T (ti , pk )) Schedule ti on the resource pm end
Benchmarking
Using our simulation model, we investigated the quality of the schedules computed by HEFT for the Cholesky and randomly-generated DAG sets on both the single and multiple GPU target platforms described in Sect. 3. The metric used for evaluation was the speedup, the ratio of the minimal serial time (MST)—the minimum execution time of the DAG on any single resource—to the makespan. Intuitively, speedup tells us how well the schedule exploits the parallelism of the target platform. Figure 1a shows the speedup of HEFT for the Cholesky DAGs with tile size 128. The most interesting takeaway is the difference between the two platforms. With multiple GPUs the speedup increased uniformly with the number of tasks until a small decline for the very largest DAG, but for a single GPU the speedup stagnated much more quickly. This was due to the GPU being continuously busy and adding little additional value once the DAGs became sufficiently large and more GPUs therefore postponing this effect. Results were broadly similar for Cholesky DAGs with tile size 1024, with the exception that the speedup values were uniformly smaller, reaching a maximum of just over four for the multiple GPU platform. This was because the GPUs were so much faster for the larger tile size that HEFT made little use of the CPUs and so speedup was almost entirely determined by the number of GPUs available. Figure 1b shows the speedups for all 540 high acceleration randomlygenerated DAGs, ordered by their CCRs; results were broadly similar for the low acceleration DAGs. The speedups for the single GPU platform are much smaller with a narrower spread compared to the other platform, as for the Cholesky DAGs. More surprising is that HEFT sometimes returned a schedule with speedup less than one for DAGs with small CCR values, which we call a failure since this is obviously unwanted behavior. These failures were due to the greediness of the HEFT processor selection phase, which always schedules a task on the processing resource that minimizes its earliest finish time without considering the communication costs that may later be incurred by doing so.
An Efficient New Static Scheduling Heuristic for Accelerated Architectures
(a) Cholesky.
11
(b) Random.
Fig. 1. Speedup of HEFT for Cholesky (tile size 128) and random (high acceleration) DAG sets.
The effect is more pronounced when the CCR is low because the unforeseen communication costs are proportionally larger. 4.2
HEFT-WM
The implicit assumption underlying the use of mean values when computing task priorities in HEFT is that the probability of a task being scheduled on a processing resource is identical for all resources. But this is obviously not the case: if a task’s GPU execution time is ten times smaller than its CPU execution time then it is considerably more likely to be scheduled on a GPU, even if not precisely ten times so. This suggests that we should weight the mean values according to each task’s acceleration ratio ri , the CPU time divided by the GPU time. In particular, for each task ti we estimate its computation cost to be wi =
wC (ti )PC + ri wG (ti )PG , PC + ri PG
(6)
and for each edge (say, between tasks ti and tj ) the estimated communication cost is ACC · cij (C, C) + ACG ri cij (G, C) + rj cij (C, G) + ri rj AGG · cij (G, G) . cij = (ri PG + PC ) · (rj PG + PC ) (7) We call the modified heuristic which follows Algorithm 1 but uses (6) instead of (1) and (7) instead of (2) HEFT-WM (for Weighted Mean). We were unable to find any explicit references to this modification in the literature but given its simplicity we do not think it is a novel idea and suspect it has surely been used before in practice.
12
5
T. McSweeney et al.
HOFT
If we disregard all resource contention, then we can easily compute the earliest possible time that all tasks can be completed assuming that they are scheduled on either a CPU or GPU resource, which we call the Optimistic Finish Time (OFT). More specifically, for p, p ∈ {C, G}, we move forward through the DAG and build a table of OFT values by setting OF T (ti , p) = wp (ti ), if ti is an entry task, and recursively computing {OF T (t , p ) + δ c (p, p )} , (8) OF T (ti , p) = wp (ti ) + max min j pp ij tj ∈P a(ti )
p
for all other tasks, where δpp = 1 if p = p and 0 otherwise. We use the OFT table as the basis for the task prioritization and processor selection phases of a new HEFT-like heuristic optimized for CPU-GPU platforms that we call Heterogeneous Optimistic Finish Time (HOFT). Note that computing the OFT table does not increase the order of HEFT’s time complexity. Among several possible ways of using the OFT to compute a complete task prioritization, we found the most effective to be the following. First, define the weights of all tasks to be the ratio of the maximum and minimum OFT values, wi =
max{OF T (ti , C), OF T (ti , G)} . min{OF T (ti , C), OF T (ti , G)}
(9)
Now assume that all edge weights are zero, cij ≡ 0, ∀i, j, and compute the upward rank of all tasks with these values. Upward ranking is used to ensure that all precedence constraints are met. Intuitively, tasks with a strong preference for one resource type—as suggested by a high ratio—should be scheduled first. We also propose a new processor selection phase which proceeds as follows. Working down the priority list, each task ti is scheduled on the processing resource pm with the smallest EFT as in HEFT except when pm is not also the fastest resource type for that task. In such cases, let pf be the resource of the fastest type with the minimal EFT and compute sm := EF T (ti , pf ) − EF T (ti , pm ),
(10)
the saving that we expect to make by scheduling ti on pm rather than pf . Suppose that pm is of type Tm ∈ {C, G}. By assuming that each child task tj of ti is scheduled on the type of resource Tj which minimizes its OFT and disregarding the potential need to wait for other parent tasks to complete, we estimate E(Ch(ti )|pm ), the earliest finish time of all child tasks, given that ti is scheduled on pm , through EF T (ti , pm ) + cij (Tm , Tj ) + wTj (tj ) . (11) E(Ch(ti )|pm ) := max tj ∈Ch(ti )
Likewise for pf we compute E(Ch(ti )|pf ) and if sm > E(Ch(ti )|pm ) − E(Ch(ti )|pf )
(12)
An Efficient New Static Scheduling Heuristic for Accelerated Architectures
13
we schedule task ti on pm ; otherwise, we schedule it on pf . Intuitively, the processor selection always chooses the resource with the smallest EFT unless by doing so we expect to increase the earliest possible time at which all child tasks can be completed.
Algorithm 2: HOFT. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
6
Compute the OFT table for all tasks using (8) Set the computation cost of all tasks using (9) Set the communication cost of all edges to be zero Compute ranku for all tasks according to (3) Sort the tasks into a priority list by non-increasing order of ranku for task in list do for each resource pk do Compute EF T (ti , pk ) using (4) and (5) end pm := arg mink (EF T (ti , pk )) if wim = min(wC (ti ), wG (ti )) then pf := arg mink EF T (ti , pk )|wik = min(wC (ti ), wG (ti )) Compute sm using (10) Compute E(Ch(ti )|pm ) and E(Ch(ti )|pf ) using (11) if (12) holds then Schedule ti on pm else Schedule ti on pf end end end
Simulation Results
Figure 2 shows the reduction in schedule makespan, as a percentage of the HEFT schedule, achieved by HOFT and HEFT-WM for the set of Cholesky DAGs, on both the single and multiple GPU target platforms. The overall trend for the multiple GPU platform is that HOFT improves relative to both HEFT variants as the number of tasks in the DAG grows larger; it is always better than standard HEFT for tile size 1024. For the single GPU platform, HOFT was almost always the best, except for the smallest DAGs and the largest DAG with tile size 128 (for which all three heuristics were almost identical). Interestingly, we found that the processor selection phase of HOFT never actually differed from HEFT’s for these DAGs and so the task prioritization phase alone was key. HOFT achieved smaller makespans than HEFT on average for the set of randomly-generated DAGs, especially for those with high acceleration, but was slightly inferior to HEFT-WM, as can be seen from Table 1. However also
14
T. McSweeney et al.
(a) Tile size 128.
(b) Tile size 1024.
Fig. 2. HEFT-WM and HOFT compared to HEFT for Cholesky DAGs.
included in the table is HOFT-WM, the heuristic obtained by using the HEFTWM task prioritization with the HOFT processor selection. HOFT-WM was identical to HEFT-WM for the Cholesky DAGs but improved on both heuristics for the randomly-generated DAG set, suggesting that the HOFT processor selection phase is generally more effective than HEFT’s, no matter which task ranking is used. The alternative processor selection also reduced the failure rate for DAGs with very low CCR by about half on the single GPU platform, although it made little difference on the multiple GPU platform. Table 1. Makespan reduction vs. HEFT. Shown are both the average percentage reduction (APR) and the percentage of (540) DAGs for which each heuristic improved on the HEFT schedule (Better). Heuristic
Single GPU Multiple GPU Low acc. High acc. Low acc. High acc. APR Better APR Better APR Better APR Better
HEFT-WM 0.8 HOFT
74.8
2.3
69.6
1.6
84.8
2.4
79.8
−0.2 50.3
3.8
83.1
1.4
69.2
2.3
76.5
4.6
76.9
1.4
78.1
3.7
81.1
HOFT-WM 0.8
7
70.9
Conclusions
Overall our simulations suggest that HOFT is often superior to—and always competitive with—both standard HEFT and HEFT-WM for multicore and GPU platforms, especially when task acceleration ratios are high. The processor selection phase in particular appears to be more effective, at least on average, for any task prioritization. It should also be noted that HEFT-WM was almost always
An Efficient New Static Scheduling Heuristic for Accelerated Architectures
15
superior to the standard algorithm in our simulations, suggesting that it should perhaps be the default in accelerated environments. Although the number of failures HOFT recorded for DAGs with low CCRs was slightly smaller than for HEFT, the failure probability was still unacceptably high. Using the lookahead processor selection from [5] instead reduced the probability of failure further but it was still nonzero and the additional computational cost was considerable. We investigated cheaper sampling-based selection phases that consider only a small sample of the child tasks, selected either at random or based on priorities. These did reduce the failure probability in some cases but the improvement was usually minor. Alternatives to lookahead that we intend to consider in future are the duplication [13] or aggregation of tasks. Static schedules for multicore and GPU are most useful in practice as a foundation for superior dynamic schedules [1], or when their robustness is reinforced using statistical analysis [18]. This paper was concerned with computing the initial static schedules; the natural next step is to consider their extension to real—i.e., dynamic—environments. This is often called stochastic scheduling, since the costs are usually modeled as random variables from some—possibly unknown—distribution. The goal is typically to find methods that bound variation in the schedule makespan, which is notoriously difficult [12]. Experimentation with existing stochastic scheduling heuristics, such as Monte Carlo Scheduling [18], suggested to us that the main issue with adapting such methods for multicore and GPU is their cost, which can be much greater than computing the static schedule itself. Hence in future work we intend to investigate cheaper ways to make effective use of static schedules for real platforms.
References 1. Agullo, E., Beaumont, O., Eyraud-Dubois, L., Kumar, S.: Are static schedules so bad? A case study on Cholesky factorization. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1021–1030, May 2016. https://doi.org/10.1109/IPDPS.2016.90 2. Anderson, E., et al.: LAPACK Users’ Guide, 3rd edn. Society for Industrial and Applied Mathematics, Philadelphia (1999) 3. Arabnejad, H., Barbosa, J.G.: List scheduling algorithm for heterogeneous systems by an optimistic cost table. IEEE Trans. Parallel Distrib. Syst. 25(3), 682–694 (2014). https://doi.org/10.1109/TPDS.2013.57 4. Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exp. 23(2), 187–198 (2011). https://doi.org/10.1002/cpe.1631 5. Bittencourt, L.F., Sakellariou, R., Madeira, E.R.M.: DAG scheduling using a lookahead variant of the Heterogeneous Earliest Finish Time algorithm. In: 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp. 27–34, February 2010. https://doi.org/10.1109/PDP.2010.56 6. Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., H´erault, T., Dongarra, J.J.: PaRSEC: exploiting heterogeneity to enhance scalability. Comput. Sci. Eng. 15(6), 36–45 (2013). https://doi.org/10.1109/MCSE.2013.98
16
T. McSweeney et al.
7. Braun, T.D., et al.: A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 61(6), 810–837 (2001). https://doi.org/10.1006/jpdc.2000.1714 8. Canon, L.C., Jeannot, E., Sakellariou, R., Zheng, W.: Comparative evaluation of the robustness of DAG scheduling heuristics. In: Gorlatch, S., Fragopoulou, P., Priol, T. (eds.) Grid Computing, pp. 73–84. Springer, Cham (2008). https://doi. org/10.1007/978-0-387-09457-1 7 9. Canon, L.-C., Marchal, L., Simon, B., Vivien, F.: Online scheduling of task graphs on hybrid platforms. In: Aldinucci, M., Padovani, L., Torquati, M. (eds.) Euro-Par 2018. LNCS, vol. 11014, pp. 192–204. Springer, Cham (2018). https://doi.org/10. 1007/978-3-319-96983-1 14 10. Dongarra, J.J., Du Croz, J., Hammarling, S., Duff, I.S.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990). https://doi. org/10.1145/77626.79170 11. Duran, A., et al.: OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21(02), 173–193 (2011). https://doi.org/10. 1142/S0129626411000151 12. Hagstrom, J.N.: Computational complexity of PERT problems. Networks 18(2), 139–147. https://doi.org/10.1002/net.3230180206 13. Ahmad, I., Kwok, Y.-K.: On exploiting task duplication in parallel program scheduling. IEEE Trans. Parallel Distrib. Syst. 9(9), 872–892 (1998). https://doi. org/10.1109/71.722221 14. Kumar, S.: Scheduling of dense linear algebra kernels on heterogeneous resources. Ph.D. thesis, University of Bordeaux, April 2017. https://tel.archives-ouvertes.fr/ tel-01538516/file/KUMAR SURAL 2017.pdf 15. Shetti, K.R., Fahmy, S.A., Bretschneider, T.: Optimization of the HEFT algorithm for a CPU-GPU environment. In: PDCAT, pp. 212–218 (2013). https://doi.org/ 10.1109/PDCAT.2013.40 16. Tobita, T., Kasahara, H.: A standard task graph set for fair evaluation of multiprocessor scheduling algorithms. J. Scheduling 5(5), 379–394. https://doi.org/10. 1002/jos.116 17. Topcuoglu, H., Hariri, S., Wu, M.Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002). https://doi.org/10.1109/71.993206 18. Zheng, W., Sakellariou, R.: Stochastic DAG scheduling using a Monte Carlo approach. J. Parallel Distrib. Comput. 73(12), 1673–1689 (2013). https://doi.org/10. 1016/j.jpdc.2013.07.019
Multilevel Parallel Computations for Solving Multistage Multicriteria Optimization Problems Victor Gergel(B)
and Evgeny Kozinov
Lobachevsky State University of Nizhni Novgorod, Nizhni Novgorod, Russia [email protected], [email protected] Abstract. In the present paper, a novel approach for solving the computationally costly multicriteria optimization problems is considered. Within the framework of the developed approach, the obtaining of the efficient decisions is ensured by means of several different methods for the scalarization of the efficiency criteria. The proposed approach provides an opportunity to alter the scalarization methods and the parameters of these ones in the course of computations that results in the necessity of multiple solving the time-consuming global optimization problems. Overcoming the computational complexity is provided by reusing the computed search information and efficient parallel computing on highperformance computing systems. The performed numerical experiments confirmed the developed approach to allow reducing the amount and time of computations for solving the time-consuming multicriteria optimization problems. Keywords: Multicriteria optimization · Criteria scalarization · Global optimization · Search information · Parallel computations · Numerical experiment
1
Introduction
The paper discusses a novel approach for solving the time-consuming multicriteria optimization (MCO) problems. Such problems arise in many applications that is confirmed by a wide spectrum of research on this subject – see, for example, monographs [1–5] and reviews of scientific and practical results [6–8]. The solving of the MCO problems is usually reduced to finding the efficient (non-dominated) decisions1 . In the limiting case, it may appear to be necessary to obtain the whole set of the efficient decisions (the Pareto set) that may 1
The solutions, which cannot be improved with respect to any criteria without worsening of the efficiency values with respect to other criteria are understood as the efficient (non-dominated) decisions.
This research was supported by the Russian Science Foundation, project No 16-1110150 “Novel efficient methods and software tools for time-consuming decision making problems using supercomputers of superior performance.”. c Springer Nature Switzerland AG 2020 V. V. Krzhizhanovskaya et al. (Eds.): ICCS 2020, LNCS 12137, pp. 17–30, 2020. https://doi.org/10.1007/978-3-030-50371-0_2
18
V. Gergel and E. Kozinov
require a large amount of computations. Another approach used widely consists in finding a relatively small set of efficient decisions only. As a rule, to find particular efficient decisions, the vector criterion is transformed to a single (scalar) criterion, which can be optimized by some algorithms of nonlinear programming. Among such approaches, one can outline various types of criteria convolutions, the lexicographic optimization methods, the reference point algorithms, etc. [1–3]. In the present paper it is supposed that the efficiency criteria can be multiextremal and computing the values of criteria and constraints can be timeconsuming. But the computational complexity is even higher as it is supposed that the applied scalarization method can be varied in the course of computations that leads to a necessity of multiple solving of the global optimization problems [9–11]. In the framework of developed approach efficient global optimization methods are proposed for solving such computationally expensive MCO problems and different multilevel parallel computation schemes are investigated for executing these methods on high-performance supercomputer systems. Further structure of the paper is organized as follows. In Sect. 2, the statement of the multicriteria optimization problem is given and a general scheme for the criteria scalarization is proposed. In Sect. 3, efficient global optimization methods utilizing the computed search information are considered. In Sect. 4, the multilevel schemes of parallel computations for the multistage solving of the computational-costly MCO problems are given. Section 5 presents the results of numerical experiments confirming the developed approach to be promising. In Conclusion, the obtained results are summarized and main directions of further investigations are outlined.
2
Multicriteria Optimization Problem Statement
Multicriteria optimization (MCO) problem can be formulated as follows [1–5] f (y) → min, y ∈ D,
(1)
where y = (y1 , y2 , . . . , yN ) is a vector of varied parameters, f (y) = (f1 (y), f2 (y), . . . , fs (y)) is a vector efficiency criterion, and D ⊂ RN is a search domain D = {y ∈ RN : ai ≤ yi ≤ bi , 1 ≤ i ≤ N },
(2)
for the given vectors a and b. Without loss of generality, in further consideration the criteria fi (y), 1 ≤ i ≤ s are suggested to be non-negative and the decrease of these ones corresponds to the increase of the decision efficiency. In the most general case, the criteria fi (y), 1 ≤ i ≤ s can be multiextremal, and the procedure of computing the values of these ones can appear to be time consuming. Also, the criteria fi (y), 1 ≤ i ≤ s are supposed to satisfy the Lipschitz condition |fi (y1 ) − fi (y2 )| ≤ Li y1 − y2 , 1 ≤ i ≤ s,
(3)
Multilevel Parallel Computations
19
where Li , 1 ≤ j ≤ s are the Lipschitz constants for the criteria fj (y), 1 ≤ j ≤ s and ∗ denotes the Euclidean norm in RN . The efficiency criteria of the MCO problems are usually contradictory, and the decisions y ∗ ∈ D with the best values with respect to all criteria simultaneously may be absent. In such situations, for the MCO problems it is appropriate to find the efficient (non-dominated) decisions, for which the improvement of the values with respect to any criterion results in worsening the efficiency values with respect to other criteria. Obtaining of the whole set of efficient decisions (the Pareto set) may require to perform a large amount of computations. As a result, another approach is applied for solving the MCO problems often – finding only a relatively small set of efficient decisions defined according to the requirements of the decision maker. An approach to obtaining particular efficient decisions used widely consists in the transformation of a vector criterion into some general scalar criterion of efficiency2 [1–5] min ϕ(y) = F (α, y), y ∈ D, (4) where F is the objective function generated as a result of scalarization of the criteria fi , 1 ≤ i ≤ s, α is a vector of parameters of the criteria convolution applied, and D is the search domain from (2). Because of (3), the function F (α, y) also satisfies the Lipschitz condition with some constant L i.e. |F (α, y ) − F (α, y )| ≤ Ly1 − y2 .
(5)
To construct a general scalar efficiency function F (α, y) from (4), one can use, in particular, the following methods of the criteria scalarization. 1. One of the scalarization methods used often consists in the use of the minmax convolution of criteria [1–3]: F1 (λ, y) = max (λi fi (y), 1 ≤ i ≤ s), s λ = (λ1 , λ2 , . . . , λs ) ∈ Λ ⊂ Rs : i=1 λi = 1, λi ≥ 0, 1 ≤ i ≤ s.
(6)
2. Another approach used widely is applied if there are some a priori estimates of the criteria values for the required decision (for example, based on some ideal decision or any existing prototype). In such cases, the solving of a MCO problem may consist in finding an efficient decision corresponding to given criteria values most completely. The scalar criterion F2 (λ, y) can be presented as root-mean-square deviation of a point y ∈ D from the ideal decision y ∗ [3]: s
F2 (θ, y) =
1 θi (fi (y) − fi (y ∗ ))2 , y ∈ D, s i=1
(7)
where the parameters 0 ≤ θi ≤ 1, 1 ≤ i < s are the indicators of importance of the approximation precision with respect to each varied parameter yi , 1 ≤ i ≤ N separately. 2
It is worth noting that such an approach provides an opportunity to use a wide set of already existing global optimization methods for solving the MCO problems.
20
V. Gergel and E. Kozinov
3. In the case when the criteria can be arranged in the importance, the method of successive concessions (MSC) [2–4] is applied often, according to which the solving of a MCO problem is reduced to a multistage solving of the global optimization problems with nonlinear constraints: f1∗ = miny∈D f1 (y), ∗ fi = miny∈D fi (y), fj (y) ≤ fj∗ + δj , 1 ≤ j < i, 1 < i ≤ s, δ Plex (f, δ, D) = Arg miny∈D fs (y), fj (y) ≤ fj∗ + δj , 1 ≤ j
1 global search iterations to be completed. The choice of the trial point of the next (k + 1)th iteration is determined by the following rules. Rule 1. For each interval (xi−1 , xi ), 1 < i ≤ k compute a magnitude R(i) called further a characteristic of the interval. Rule 2. Determine the interval (xt−1 , xt ), which the maximum characteristic corresponds to5 R(t) = max {R(i) : 1 < i ≤ k}. (16) Rule 3. Perform the new trial in the interval with the maximum characteristic xk+1 ∈ (xt−1 , xt ).
(17)
The stopping condition, according to which the trials are terminated, is defined by the condition (xt − xt−1 )1/N ≤ ε, (18) where t is from (15), N is the dimensionality of the problem being solved from (1), and ε > 0 is the predefined accuracy of the problem solution. If the stopping condition is not fulfilled, the number of iterations k is incremented by unity, and new global search iteration is performed. The convergence conditions for the algorithms developed within the framework of the information-statistical theory of global search were considered in [12]. Thus, at appropriate numerical estimates of the H¨ older constants Hi , 1 ≤ i ≤ m from (12), MAGS converges to all available global minima points of the minimized function ϕ(y(x)). It is worth noting also that the application of the MAGS algorithm after solving the current problem F (α, y) from (4) to solving the next problems from the set FT from (10) allows reusing the search information Ak from (13) obtained in the course of all preceding computations.
4
Multilevel Parallel Computations for Solving Multistage Multicriteria Optimization Problems
The applied general approach to parallel solving the global optimization problems is the following – parallel computations is provided by simultaneous computing the minimized function values F (α, y) from (4) in several different points of the search domain D – see, for example, [12,20]. Such an approach provides 5
The characteristics R(i), 1 < i ≤ k may be interpreted as some measures of importance of the intervals with respect to the probability to find the global minimum point in respective intervals.
Multilevel Parallel Computations
23
the parallelization of the most time-consuming part of the global search process and is a general one – it can be applied to almost all global search methods for a variety of global optimization problems [18,19,21,22]. This approach can be applied at different computation levels – either at the level of solving of one of the problems of the set FT from (10) or at the level of parallel solving of several problems of this set. These methods of parallel computations will be considered below in relation to multiprocessor computer systems with shared memory. 4.1
Parallel Computations in Solving Single Multicriteria Optimization Problem
Since the characteristics R(i), 1 < i ≤ k of the search intervals (xi−1 , xi ), 1 < i ≤ k play the role of the measures of importance of the intervals with respect to the probability to find the global minimum points, the MAGS algorithm can be extended for the parallel execution at the following generalization of the rules (16)–(17) [12,18,20,23]: Rule 2 . Arrange the characteristics of the intervals in the decreasing order R(t1 ) ≥ R(t2 ) ≥ · · · ≥ R(tk−2 ) ≥ R(tk−1 )
(19)
and select p intervals with the indices tj , 1 ≤ j ≤ p having the maximum values of the characteristics (p is the number of processors (cores) employed in the parallel computations). Rule 3 . Perform new trials (computing of the minimized function values F (α, y(x)) in the points xk+j , 1 ≤ j ≤ p placed in the intervals with the maximum characteristics from (19) in parallel. The stopping condition for the algorithm (18) should be checked for all intervals, in which the scheduled trials are performed (xtj − xtj −1 )1/N ≤ ε, 1 ≤ tj ≤ p.
(20)
As before, if the stopping condition is not satisfied, the number of iterations k is incremented by p, and new global search iteration is performed. This extended version of the MAGS algorithm will further called Parallel Multidimensional Algorithm of Global Search for solving Single MCO problems (PMAGS-S). 4.2
Parallel Computations in Solving Several Multicriteria Optimization Problems
Another possible method of parallel computations consists in simultaneous solving several problems F (α, y) of the set FT from (10). In this approach, the number of problems being solved simultaneously is determined by the number of processors (computational cores) available. The solving of each particular problem F (α, y) is performed using the MAGS algorithm. Then, taking into account
24
V. Gergel and E. Kozinov
that all the problems of the set FT are generated from the same MCO problem (the values of the scalar criterion F (α, y) are computed on the basis of the criteria values fi (y), 1 ≤ i ≤ s from (1)), it is possible to provide the interchange of the computed search information. For this purpose, the computational scheme of the MAGS algorithm should be appended by the following rule: Rule 4. After completing a trial (computing the values of the function F (α, y) and criteria fi (y), 1 ≤ i ≤ s) by a processor, the point of current trial y k+1 ∈ D and the values of criteria f (y k+1 )) are transferred to all processors. Then the availability of the data transferred from other processors is checked, and new received data is included into the search information Ak from (13). Such a mutual use of the search information Ak from (13) obtained when solving particular problems of the set FT from (10) allows to reduce significantly the number of global search iterations performed for each problem F (α, y) – see Sect. 5 for the results of the numerical experiments. This version of the MAGS algorithm will be further called Parallel Multidimensional Algorithm of Global Search for solving Multiple MCO problems (PMAGS-M). 4.3
Parallel Computations in Joint Solving of Several Multicriteria Optimization Problems
The computational scheme of the PMAGS-S algorithm can be applied for the parallel solving of several problems of the set FT as well. In this case, the choice of the intervals with the maximum characteristics R(i), 1 < i ≤ k from (16) must be performed taking into account all simultaneously solved problems F (α, y): Rl1 (t1 ) ≥ Rl2 (t2 ) ≥ · · · ≥ RlK−2 (tK−2 ) ≥ RlK−1 (tK−1 ), 1 ≤ li ≤ p, 1 ≤ i ≤ K−1, (21) where li , 1 ≤ i ≤ K − 1 is the index of the problem, which the characteristic Rli belongs to and K is the total number of trials for all problems being solved simultaneously. In this approach, the problems, for which the trials are performed, are determined dynamically in accordance with (21) – at each current global search iteration for a problem F (α, y), the trials may be absent or all p trials may be performed. This version of the MAGS algorithms will further be called Parallel Multidimensional Algorithm of Global Search for Joint solving of Multiple MCO problems (PMAGS-JM).
5
Results of Numerical Experiments
The numerical experiments were carried out using the “Lobachevsky” supercomputer at University of Nizhny Novgorod (operating system – CentOS 6.4, managing system – SLURM). Each supercomputer node had 2 Intel Sandy Bridge E5-2660 processors 2.2 GHz, 64 GB RAM. Each processor had 8 cores (i.e. total
Multilevel Parallel Computations
25
16 CPU cores per node were available). To obtain the executable program code, Intel C++ 17.0 compiler was used. The numerical experiments were performed using the Globalizer system [24]. The comparison of the efficiency of the sequential version of the developed approach with other approaches to solving the MCO problems was performed in [10,11]. This paper presents the results of numerical experiments for the evaluation of the efficiency of the parallel generalization of the developed approach. Each experiment consisted of the solving of 100 two-dimensional bi-criterial test MCO problems, in which the criteria were defined as the multiextremal functions [12] f (y1 , y2 ) = −(AB + CD)1/2 2 7 7 AB = [A a (y , y ) + B b (y , y )] ij ij 1 2 ij ij 1 2 i=1 j=1 2 7 7 CD = [C a (y , y ) − D b (y , y )] ij ij 1 2 ij ij 1 2 i=1 j=1
(22)
where aij (y1 , y2 ) = sin(πiy1 ) sin(πjy2 ), bij (y1 , y2 ) = cos(πiy1 ) cos(πjy2 ) were defined in the ranges and the parameters −1 ≤ Aij , Bij , Cij , Dij ≤ 1 were independent random numbers distributed uniformly. The functions of this kind are multiextremal essentially and are used often in the evaluation of the efficiency of the global optimization algorithms [10–12,18,19,21]. When performing the numerical experiments, the construction of a numerical approximation of the Pareto domain was understood as a solution of a MCO problem. To construct an approximation of a Pareto domain for each MCO problem with the criteria from (22), 64 scalar global optimization subproblems F (α, y) from (4) were solved with different values of the criteria convolution coefficients (i.e. total 6400 global optimization problems were solved in each experiment). The obtained results of experiments were averaged over the number of solved MCO problems. It should be noted that since the developed approach is oriented onto the MCO problems, in which computing the criteria values requires a large amount of computations, in all tables presented below the computational costs of solving the MCO problems is measured in the numbers of global search iterations performed. In carrying out the numerical experiments, the following values of parameters of the applied algorithms were used: the required accuracy of the problem solutions ε = 0.01 from (18) and (20), the reliability parameter6 r = 2.3. The experiments were carried out using a single supercomputer node (two processors, 16 computational cores with shared memory). In Table 1, the indicators of the computational costs (the numbers of the performed global search iterations) for all considered schemes of parallel computations (see Sect. 4) are presented (Fig. 1). 6
The reliability parameter is used in the construction of the numerical estimate of the constants Lj , 1 ≤ j ≤ s from (3), L from (5) and H from (12).
26
V. Gergel and E. Kozinov
Fig. 1. Comparison of the averaged numbers of iterations executed for solving the MCO problems using various efficiency criteria convolutions
The results of experiments demonstrate the lowest computational costs (the number of performed global search iterations) to be achieved when using 16 cores for the PMAGS-S algorithm and the convolution F2 from (7). Also, one can see from Table 1 that almost all computational schemes have a high efficiency from the viewpoint of parallelization. When using 16 cores, all algorithms except PMAGS-M with the convolutions F1 from (6) and F2 from (7) have demonstrated the speedup greater than 9.9.
Multilevel Parallel Computations
27
Table 1. Comparison of performance of various schemes of parallel computations (the second column indicates the criteria convolution schemes F1 from (6), F2 from (7), F3 from (9)) Method\Cores Conv. Average number of iterations 1 2 4 8
Speedup 2 4 8
16
16
Method PMAGS-S (Section 4.1)
F1 F2 F3
1238.3 625.7 1018.6 512 1657.6 868.3
Method PMAGS-M (Section 4.2)
F1 F2 F3
1257.2 1035.9 1653.3
868.4 408.6 811.2 532.1 1102.3 509.8
231.9 407.6 282.4
143.4 284.3 150.2
1.4 3.1 5.4 8.8 1.3 1.9 2.5 3.6 1.5 3.2 5.9 11.0
Method PMAGS-JM (Section 4.3)
F1 F2 F3
1552.1 1387.9 2760.3
762.4 390.5 721.9 362.8 1371.7 652
214 209.8 421.4
130.7 135.9 227.6
2.0 4.0 7.3 11.9 1.9 3.8 6.6 10.2 2.0 4.2 6.6 12.1
324.5 186.7 115.3 2.0 3.8 6.6 10.7 264.2 158.5 102.9 2.0 3.9 6.4 9.9 443.6 245.4 143.4 1.9 3.7 6.8 11.6
Table 2. Comparison of the efficiency of the developed methods in solving the applied problem (the second row indicates the criteria convolution schemes F1 from (6), F2 from (7), F3 from (9)) Parallel scheme 4.1 Convolution Iterations
Parallel scheme 4.2
Parallel scheme 4.3
F1
F2
F3
F1
F2
F3
F1
F2
F3
125
145
118
168
138
149
81
82
79
In order to demonstrate the efficiency of the proposed approach, a problem of vibration isolation for a system with several degrees of freedom consisting of an isolated base and an elastic body has been solved. In the considered problem statement, the protected object was represented as multi-mass mechanical system consisting of several material points connected by vibration damping elements. As the criteria, the maximum deformation and maximum displacement of the object relative to the base were minimized (for details, see [25]). The dimensionality of the space of the optimized parameters was selected to be 3. The problem was solved by all considered methods using the Globalizer system. When solving the problem, the parameter r = 3 and the number of cores 16 were used. The number of convolutions was selected to be 16, and the accuracy of the method was set to ε = 0.05. The comparison of the efficiency of the methods by solving the applied problem is presented in Table 2. The results of the numerical experiments demonstrate that all methods have found the sufficient approximation of the Pareto domain. The lowest number of iterations performed the PMAGS-JM method. In Fig. 2, the computed approximation of the Pareto domain obtained by the PMAGS-JM method with the convolution F2 from (7) is presented.
28
V. Gergel and E. Kozinov
Fig. 2. Approximation of the Pareto domain for the problem of vibration isolation using PMAGS-JM method with the convolution F2 from (7)
6
Conclusion
In the present paper, a novel approach for solving the time-consuming multicriteria optimization problems has been considered. Within the framework of the developed approach, obtaining the efficient decisions is provided by using several different efficiency criteria scalarization methods. The proposed approach allows altering the scalarization methods used and the parameters of these ones in the course of computations. In turn, such a variation of the problem statement leads to the need for solving multiple time-consuming global optimization problems. Overcoming the computational complexity is provided by means of the reuse of the whole search information obtained in the course of computations and efficient parallel computations on high-performance computational systems. The proposed methods of parallel computations can be used both for solving single MCO problems and for joint solving several ones. The performed numerical experiments (total 6400 global optimization problems have been solved) and the example of solving the applied problem of vibration isolation confirm the developed approach to allow reducing the amount and time of computations for solving time-consuming MCO problems. In order to obtain more reliable evaluation of efficiency of the parallel computations, it is intended to continue carrying out the numerical experiments on solving the MCO problems with more efficiency criteria and for larger dimensionality. Future research will also include investigations how to select the best parallel method automatically by using different computational platforms. Finally problems how to generalize the proposed approach for applying on multi-node cluster with GPU processors will be considered.
References 1. Ehrgott, M.: Multicriteria Optimization, 2nd edn. Springer, Heidelberg (2010) 2. Collette, Y., Siarry, P.: Multiobjective Optimization: Principles and Case Studies (Decision Engineering). Springer, Hiedelberg (2011). https://doi.org/10.1007/9783-662-08883-8 3. Marler, R.T., Arora, J.S.: Multi-Objective Optimization: Concepts and Methods for Engineering. VDM Verlag (2009)
Multilevel Parallel Computations
29
ˇ ˇ 4. Pardalos, P.M., Zilinskas, A., Zilinskas, J.: Non-Convex Multi-Objective Optimization. Springer Optimization and Its Applications, vol. 123. Springer, New York (2017). https://doi.org/10.1007/978-3-319-61007-8 5. Parnell, G.S., Driscoll, P.J., Henderson, D.L. (eds.): Decision Making in Systems Engineering and Management, 2nd edn. Wiley, New Jersey (2011) 6. Figueira, J., Greco, S., Ehrgott, M. (eds.): Multiple Criteria Decision Analysis: State of the Art Surveys. Springer, New York (2005). https://doi.org/10.1007/ b100605 7. Hillermeier, C., Jahn, J.: Multiobjective optimization: survey of methods and industrial applications. Surv. Math. Ind. 11, 1–42 (2005) 8. Cho, J.-H., Wang, Y., Chen, I.-R., Chan, K.S., Swami, A.: A survey on modeling and optimizing multi-objective systems. IEEE Commun. Surv. Tutor. 19(3), 1867– 1901 (2017) 9. Gergel, V.P., Kozinov, E.A.: Accelerating multicriterial optimization by the intensive exploitation of accumulated search data. In: AIP Conference Proceedings, vol. 1776, p. 090003 (2016). https://doi.org/10.1063/1.4965367 10. Gergel, V., Kozinov, E.: Efficient multicriterial optimization based on intensive reuse of search information. J. Global Optim. 71(1), 73–90 (2018). https://doi. org/10.1007/s10898-018-0624-3 11. Gergel, V., Kozinov, E.: Multistage global search using various scalarization schemes in multicriteria optimization problems. In: Le Thi, H.A., Le, H.M., Pham Dinh, T. (eds.) WCGO 2019. AISC, vol. 991, pp. 638–648. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-21803-4 64 12. Strongin, R., Sergeyev, Y.: Global Optimization with Non-Convex Constraints: Sequential and Parallel Algorithms, 2nd edn. Kluwer Academic Publishers, Dordrecht (2013). (3rd edn. 2014) ˇ 13. Zhigljavsky, A., Zilinskas, A.: Stochastic Global Optimization. Springer, Boston (2008). https://doi.org/10.1007/978-0-387-74740-8 14. Locatelli, M., Schoen, F.: Global optimization: theory, algorithms, and applications. In: SIAM (2013) 15. Sergeyev, Y.D., Strongin, R.G., Lera, D.: Introduction to Global Optimization Exploiting Space-Filling Curves. Springer Briefs in Optimization. Springer, New York (2013). https://doi.org/10.1007/978-1-4614-8042-6 ˇ 16. Paulaviˇcius, R., Zilinskas, J.: Simplicial Global Optimization. Springer Briefs in Optimization. Springer, New York (2014). https://doi.org/10.1007/978-1-46149093-7 17. Floudas, C.A., Pardalos, M.P.: Recent Advances in Global Optimization. Princeton University Press, Princeton (2016) 18. Gergel, V.P., Strongin, R.G.: Parallel computing for globally optimal decision making. In: Malyshkin, V.E. (ed.) PaCT 2003. LNCS, vol. 2763, pp. 76–88. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45145-7 7 19. Sergeyev, Y.D.: An information global optimization algorithm with local tuning. SIAM J. Optim. 5(4), 858–870 (1995) 20. Strongin, R.G., Sergeyev, Y.D.: Global multidimensional optimization on parallel computer. Parallel Comput. 18(11), 1259–1273 (1992) 21. Sergeyev, Y.D., Kvasov, D.E.: A deterministic global optimization using smooth diagonal auxiliary functions. Commun. Nonlinear Sci. Numer. Simul. 21(1–3), 99– 111 (2015) 22. Barkalov, K., Gergel, V., Lebedev, I.: Solving global optimization problems on GPU cluster. In: AIP Conference Proceedings, vol. 1738, p. 400006 (2016) https://doi. org/10.1063/1.4952194
30
V. Gergel and E. Kozinov
23. Sergeyev, Y.D., Grishagin, V.A.: Parallel asynchronous global search and the nested optimization scheme. J. Comput. Anal. Appl. 3(2), 123–145 (2001). https:// doi.org/10.1023/A:1010185125012 24. Sysoyev, A., Barkalov, K., Sovrasov, V., Lebedev, I., Gergel, V.: Globalizer – a parallel software system for solving global optimization problems. In: Malyshkin, V. (ed.) PaCT 2017. LNCS, vol. 10421, pp. 492–499. Springer, Cham (2017). https:// doi.org/10.1007/978-3-319-62932-2 47 25. Balandin, D.V., Kogan, M.M.: Multi-objective generalized H2 control for optimal protection from vibration. In: 2018 UKACC 12th International Conference on Control, vol. 8516721, pp. 205–210 (2018)
Enabling Hardware Affinity in JVM-Based Applications: A Case Study for Big Data Roberto R. Exp´ osito(B) , Jorge Veiga, and Juan Touri˜ no Universidade da Coru˜ na, CITIC, Computer Architecture Group, A Coru˜ na, Spain {roberto.rey.exposito,jorge.veiga,juan}@udc.es
Abstract. Java has been the backbone of Big Data processing for more than a decade due to its interesting features such as object orientation, cross-platform portability and good programming productivity. In fact, most popular Big Data frameworks such as Hadoop and Spark are implemented in Java or using other languages designed to run on the Java Virtual Machine (JVM) such as Scala. However, modern computing hardware is increasingly complex, featuring multiple processing cores aggregated into one or more CPUs that are usually organized as a Non-Uniform Memory Access (NUMA) architecture. The platformindependent features of the JVM come at the cost of hardware abstraction, which makes it more difficult for Big Data developers to take advantage of hardware-aware optimizations based on managing CPU or NUMA affinities. In this paper we introduce jhwloc, a Java library for easily managing such affinities in JVM-based applications and gathering information about the underlying hardware topology. To demonstrate the functionality and benefits of our proposal, we have extended FlameMR, our Java-based MapReduce framework, to provide support for setting CPU affinities through jhwloc. The experimental evaluation using representative Big Data workloads has shown that performance can be improved by up to 17% when efficiently exploiting the hardware. jhwloc is publicly available to download at https://github.com/rreye/jhwloc. Keywords: Big Data · Java Virtual Machine (JVM) affinity · MapReduce · Performance evaluation
1
· Hardware
Introduction
The emergence of Big Data technologies offers great opportunities for researchers and scientists to exploit unprecedented volumes of data sources in innovative ways, resulting in novel insight discovery and better decisions. Distributed processing frameworks are the great facilitators of the paradigm shift to Big Data, as they enable the storage and processing of large datasets and the application of analytics techniques to extract valuable information from such massive data. The MapReduce paradigm [5] introduced by Google in 2004 and then popularized by the open-source Apache Hadoop project [19] has been considered c Springer Nature Switzerland AG 2020 V. V. Krzhizhanovskaya et al. (Eds.): ICCS 2020, LNCS 12137, pp. 31–44, 2020. https://doi.org/10.1007/978-3-030-50371-0_3
32
R. R. Exp´ osito et al.
as a game changer to the way massive datasets are processed. During the past decade, there has been a huge effort in the development of Big Data frameworks. Some projects focus on adapting Hadoop to take advantage of specific hardware (RDMA-Hadoop [16]), or to provide improved performance for iterative algorithms by exploiting in-memory computations (Twister [7], Flame-MR [22]). Other frameworks have been designed from scratch to overcome other Hadoop limitations such as the lack of support for streaming computations, real-time processing and interactive analytics. Many of these frameworks are developed under the umbrella of the Apache Software Foundation: Storm [10], Spark [24], Flink [2], Samza [14]. Despite the large amount of existing frameworks, Hadoop and its ecosystem are still considered the cornerstone of Big Data as they provide the underlying core on top of which data can be processed efficiently. This core consists of the Hadoop Distributed File System (HDFS) [17], which allows to store and distribute data across a cluster of commodity machines, and Yet Another Resource Negotiator (YARN) [20], for the scalable management of the cluster resources. New frameworks generally only replace the Hadoop MapReduce data engine to provide faster processing speed. Unlike the C++-based original MapReduce implementation by Google, the entire Hadoop stack is implemented in Java to increase portability and ease of setup. As Big Data users are generally non-expert programmers, the use of Java provides multiple appealing features for them: object orientation, automatic memory management, built-in multithreading, easy-to-learn properties, good programming productivity and a wide community of developers. Moreover, later Java releases adopt concepts from other paradigms like functional programming. A core feature of Java is cross-platform portability: programs written on one platform can be executed on any combination of software and hardware with adequate runtime support. This is achieved by compiling Java code to platformindependent bytecode first, instead of directly to platform-specific native code. The bytecode instructions are then executed by the Java Virtual Machine (JVM) that is specific to the host operating system and hardware combination. Furthermore, modern JVMs integrate efficient Just-in-Time (JIT) compilers that can provide near-native performance from Java bytecode. Apart from Hadoop, most state-of-the-art Big Data frameworks are also implemented in Java (e.g., Flink, Storm 2, Flame-MR). Other frameworks rely on Scala (Spark, Samza), whose source code is compiled to Java bytecode so that the resulting executable runs on a JVM. However, the platform-independent feature provided by the JVM is only possible by abstracting most of the hardware layer away from developers. This makes it difficult or even impossible for them to access interesting low-level functionalities in JVM-based applications such as setting hardware affinities or gathering topology information for performing hardware-aware optimizations. The increasing complexity of multicore CPUs and the democratization of Non-Uniform Memory Access (NUMA) architectures [11] raise the need for exposing a portable view of the hardware topology to Java developers, while also providing an appropriate API to manage CPU and NUMA affinities. JVM languages in general, and Big Data frameworks in
Enabling Hardware Affinity in JVM Applications with jhwloc
33
particular, may take advantage of such API to exploit the hardware more efficiently without having to resort to non-portable, command-line tools that offer limited functionalities. The contributions of this paper are: – We present jhwloc, a Java library that provides a binding API to manage hardware affinities straightforwardly in JVM-based applications, as well as a means to gather information about the underlying hardware topology. – We implement the support for managing CPU affinity using jhwloc in FlameMR, a Java-based MapReduce framework, to demonstrate the functionality and benefit of our proposal. – We analyze the impact of affinity on the performance of in-memory data processing with Flame-MR by evaluating six representative Big Data workloads on a 9-node cluster. The remainder of the paper is organized as follows. Section 2 provides the background of the paper and summarizes the related work. Section 3 introduces the jhwloc library and its main features. Section 4 describes a case study of integrating jhwloc in Flame-MR, and presents the experimental evaluation. Finally, our concluding remarks are summarized in Sect. 5.
2
Background and Related Work
Exploiting modern hardware platforms requires in-depth knowledge of the underlying architecture together with appropriate expertise from the application behaviour. Current architectures provide a complex multi-level cache hierarchy with dedicated caches (one per core), a global cache (one for all the cores) or even partially shared caches. Moving a computing task from one core to another can cause performance degradation because of cache affinities. Simultaneous Multithreading (SMT) technologies [6] such as Intel Hyper-Threading (HT) [13] involve sharing computing resources of a single core between multiple logical Processing Units (PUs). This fact also means to share cache levels so that performance may be even reduced in some particular cases. Furthermore, NUMA architectures [11] are currently widely extended, in which memory is transparently distributed among CPUs connected through a cross-chip interconnect. Hence, an access from one CPU to the memory of another CPU (i.e., a remote memory access) incurs additional latency overhead due to transferring the data through the network. As an example, Fig. 1 shows the hierarchical organization of a typical NUMA machine with two octa-core CPUs that implement two-way SMT, so 8 cores and 16 PUs are available per CPU. All this complexity of the hardware topology of modern computing platforms is considered a critical aspect of performance and must be taken into account when trying to optimize parallel [18] and distributed applications [23]. Nowadays, the hardware locality (hwloc) project [9] is the most popular tool for exposing a static view of the topology of modern hardware, including CPU, memory and I/O devices. This project solves many interoperability issues due to the amount and variety of the sources of locality information for querying the
34
R. R. Exp´ osito et al.
Fig. 1. Overview of a NUMA system with two octa-core CPUs
topology of a system. To do so, hwloc combines locality information obtained from the operating system (e.g., /sys in Linux), from the hardware itself (cpuid instruction in x86), or using high-level libraries (numactl) and tools (lscpu). Moreover, hwloc offers APIs to interoperate with device-specific libraries (e.g., libibverbs) and allows binding tasks (e.g., process, thread) according to hardware affinities (CPU and memory) in a portable and abstracted way by exposing a unified interface, as different operating systems have diverse binding support. As a consequence, hwloc has become the de facto standard software for modeling NUMA systems in High Performance Computing (HPC) environments. In fact, it is used by most message-passing implementations, many batch queueing systems, compilers and parallel libraries for HPC. Unfortunately, hwloc only provides C-based APIs for gathering topology information and binding tasks, whereas the JVM does not offer enough support for developing hardware-aware applications as it only allows to obtain the number of cores available in the system. Overseer [15] is a Java-based framework to access low-level data such as performance counters, JVM internal events and temperature monitoring. Among its other features, Overseer also provides basic information about hardware topology through hwloc by resorting to the Java Native Interface (JNI). Moreover, it allows managing CPU affinity by relying on the Linux Kernel API (sched.h). However, the support provided by Overseer for both features is very limited. On the one hand, topology information is restricted to a few Java methods that
Enabling Hardware Affinity in JVM Applications with jhwloc
35
only allow to obtain the number of available hardware resources of a certain type (e.g., cores per CPU), without providing any further details about cache hierarchy (e.g., size, associativity), NUMA nodes (e.g., local memory) or additional functionality to manipulate topologies (e.g., filtering, traversing). On the other hand, support for CPU binding is limited to use Linux-specific affinity masks that determine the set of CPUs on which a task is eligible to run, without providing convenient methods to operate on such CPU set (e.g., logical operations) or utility methods for performing more advanced binding operations in an easy way (e.g., bind to one thread of the last core of the machine). Furthermore, no kind of memory binding is provided by Overseer. Our jhwloc library allows to overcome all these limitations by currently providing support for more than 100 methods as the Java counterparts of the hwloc ones. There exist few works that have evaluated the impact of managing hardware affinities on the performance of Big Data frameworks. In [1], authors analyze how NUMA affinity and SMT technologies affect Spark. They manage affinities using numactl and use hwloc to obtain the identifier of hardware threads. Their results reveal that performance degradation due to remote memory accesses is 10% on average. Authors in [4] characterize several TPC-H queries implemented on top of Spark, showing that NUMA affinity is slightly advantageous in preventing remote memory accesses. To manage NUMA affinity, they also rely on numactl. However, both studies are limited to evaluating a single machine, they lack the assessment of the impact of CPU binding on performance and they do not provide any useful API for gathering topology information and managing hardware affinities for JVM-based languages. Hence, our work extends the current state-of-the-art in all these directions.
3
Java Hardware Locality Library
The Java Hardware Locality (jhwloc) project has been designed as a wrapper library that consists of: (1) a set of standard Java classes that model hwloc functionalities using an object-oriented approach, and (2) the necessary native glue code that interfaces with the C-based hwloc API. In order to do so, jhwloc uses JNI to invoke the native methods implemented in C. Hence, jhwloc acts as a transparent bridge between a client application written in any JVM-based language and the hwloc library, but exposing a more friendly and object-oriented Java API to developers. One important advantage provided by jhwloc is that it frees Java developers from interacting directly with JNI calls, which is considered a cumbersome and time-consuming task. So, our library can be easily employed in Java applications to perform hardware-oriented performance tuning. 3.1
Java API
Currently, jhwloc provides Java counterparts for more than 100 hwloc functions, covering a significant part of its main functionality. A complete Javadoc
36
R. R. Exp´ osito et al.
documentation that describes all the public methods together with their parameters is publicly available at the jhwloc website. Basically, the jhwloc API enables Java developers to: (1) obtain the hierarchical hardware topology of key computing elements within a machine such as: NUMA nodes, shared/dedicated caches, CPU packages, cores and PUs (I/O devices are not yet supported); (2) gather various attributes from caches (e.g., size, associativity) and memory information; (3) manipulate hardware topologies through advanced operations (e.g., filtering, traversing); (4) build “fake” or synthetic topologies that allow querying them without having the underlying hardware available; (5) export topologies to XML files to reload them later; (6) manage hardware affinities in an easy way using bitmaps, which are sets of integers (positive or null) used to describe the location of topology objects on the CPU (CPU sets) and NUMA nodes (node sets); and (7) handle such bitmaps by providing advanced methods to operate over them through the hwloc bitmap API. It is important to remark that, unlike C, the JVM provides an automatic memory management mechanism through the built-in Garbage Collector (GC), which is in charge of performing memory allocation/deallocation without interaction from the programmer. Hence, the memory binding performed by the hwloc functions that manage memory allocation explicitly (e.g., alloc membind ) or migrate already-allocated data (set area membind ) cannot be supported in jhwloc. NUMA affinity management is thus restricted to those functions for performing implicit memory binding: set membind and get membind. These functions allow to define the current binding policy that will be applied to the subsequent calls to malloc-like operations performed by the GC. 3.2
Usage Example
As an illustrative usage example, Listing 1 presents a Java code snippet that shows the simplicity of use of the jhwloc API. Basic hardware topology information such as the number of cores and PUs and the available memory is obtained (lines 6–11) after creating and initializing an instance of the HwlocTopology class (lines 1–4), which represents an abstraction of the underlying hardware. Most of the jhwloc functionality is provided through this class with more than 50 Java methods available that allow to manipulate, traverse and browse the topology, as well as to perform CPU and memory binding operations. Next, the example shows how to manage CPU affinities by binding the current thread to the CPU set formed by the first and last PU of the machine. As can be seen, the Java objects that represent those PUs, which are instances of the HwlocObject class, can be easily obtained by using their indexes (lines 13–15). The CPU set objects from both PUs are then operated using a logical and (lines 16–17), and the returned CPU set is used to perform the actual CPU binding of the current thread (lines 18–19). The logical and operation is an example of the more than 30 Java methods that are provided to conform with the hwloc bitmap API, supported in jhwloc through the HwlocBitmap abstract class. This class is extended by the HwlocCPUSet and HwlocNodeSet subclasses that provide concrete implementations for representing CPU and NUMA node sets, respectively.
Enabling Hardware Affinity in JVM Applications with jhwloc 1 2 3 4
37
// Create , i n i t i a l i z e and l o a d a t o p o l o g y o b j e c t HwlocTopology topo = new HwlocTopology ( ) ; topo . i n i t ( ) ; topo . l o a d ( ) ;
5 6 7 8 9 10 11
// Get t h e number o f c o r e s , PUs , and t h e a v a i l a b l e memory i n t nc = topo . g e t n b o b j s b y t y p e (HWLOC. OBJ CORE ) ; i n t np = topo . g e t n b o b j s b y t y p e (HWLOC. OBJ PU ) ; long mem = topo . g e t r o o t o b j ( ) . getTotalMemory ( ) ; System . out . p r i n t l n ( ‘ ‘# Cores /PUs : ’ ’+nc + ‘ ‘/ ’ ’+np ) ; System . out . p r i n t l n ( ‘ ‘ T o t a l memory : ’ ’+mem ) ;
12 13 14 15 16 17 18 19
// Get f i r s t and l a s t PU o b j e c t s HwlocObject fpu = topo . g e t o b j b y t y p e (HWLOC. OBJ PU , 0 ) ; HwlocObject l p u = topo . g e t o b j b y t y p e (HWLOC. OBJ PU , np −1); // L o g i c a l ‘ and ’ o v e r t h e CPU s e t s o f b o t h PUs HwlocCPUSet c p u s e t = fpu . getCPUSet ( ) . and ( l p u . getCPUSet ( ) ) ; // Bind c u r r e n t t h r e a d t o t h e r e t u r n e d CPU s e t topo . s e t c p u b i n d ( c p u s e t , EnumSet . o f (HWLOC.CPUBIND THREAD ) ;
Listing 1. Getting hardware topology information and managing CPU affinities
More advanced usage examples are provided together with the jhwloc source code. These examples include NUMA binding, manipulating bitmaps, obtaining cache information, traversing topologies, exporting topologies to XML and building synthetic ones, among other jhwloc functionalities.
4
Impact of CPU Affinity on Performance: Flame-MR Case Study
This section analyzes the impact of setting CPU affinity on the performance of Flame-MR, our Big Data processing framework. First, Flame-MR is briefly introduced in Sect. 4.1. Next, Sect. 4.2 describes how jhwloc has been integrated into Flame-MR to manage CPU affinities, explaining the different affinity levels that are supported. Section 4.3 details the experimental testbed, and finally Sect. 4.4 discusses the results obtained. 4.1
Flame-MR Overview
Flame-MR [22] is a Java-based MapReduce implementation that transparently accelerates Hadoop applications without modifying their source code. To do so, Flame-MR replaces the underlying data processing engine of Hadoop by an optimized, in-memory architecture that leverages system resources more efficiently. The overall architecture of Flame-MR is based on the deployment of several Worker processes (i.e., JVMs) over the nodes of a cluster (see Fig. 2). Relying on
38
R. R. Exp´ osito et al.
an event-driven architecture, each Worker is in charge of executing the computational tasks (i.e., map/reduce operations) to process the input data from HDFS. The tasks are executed by a thread pool that can perform as many concurrent operations as the number of cores configured for each Worker. These operations are efficiently scheduled in order to pipeline data processing and data movement steps. The data pool allocates memory buffers in an optimized way, reducing the amount of buffer creations to minimize garbage collection overheads. Once the buffers are filled with data, they are stored into in-memory data structures to be processed by subsequent operations. Furthermore, these data structures can be cached in memory between MapReduce jobs to avoid writing intermediate results to HDFS, thus providing efficient iterative computations.
Fig. 2. Flame-MR Worker architecture
4.2
Managing CPU Affinities in Flame-MR
Flame-MR has been extended to use the functionalities provided by jhwloc to enable the binding of computational tasks to the hardware processing elements available in the system (i.e., CPU/cores/PUs). To do so, the software components that manage such tasks, Worker and thread pool classes, have been modified to make them aware of the hardware affinity level that is set by the user through the configuration file of Flame-MR. When Flame-MR starts a computational task, it first checks the configuration file to determine if the jhwloc library must be called and, if so, the specific affinity level that must be enforced. Then, the Worker initializes a HwlocTopology object and uses the set cpubind method provided by jhwloc to bind computational tasks to the appropriate hardware. The configuration of the Workers is affected by the affinity level being used, as the number of threads launched by them to execute map/reduce operations should be adapted to the hardware characteristics of the underlying system. The intuitive recommendation would
Enabling Hardware Affinity in JVM Applications with jhwloc
39
be to create as many Workers as CPUs, and as many threads per Worker as cores/PUs available in the nodes. The CPU affinity levels currently supported by Flame-MR through jhwloc are: – NONE: Flame-MR does not manage hardware affinities in any way (i.e., jhwloc is not used). – CPU: Workers are bound to specific CPUs. Each JVM process that executes a Worker is bound to one of the available CPUs in the system using the jhwloc flag HWLOC.CPUBIND PROCESS. Hence, each thread launched by a Worker is also bound to the same CPU, which means that the OS scheduler can migrate threads among its cores. The mapping between Workers and CPUs is done cyclically by allocating each Worker to a different CPU until all the CPUs are used, starting again if there are remaining Workers. – CORE: map/reduce operations are bound to specific cores. Each thread launched by a Worker to perform such operations is bound to one of the available cores in the system using the flag HWLOC.CPUBIND THREAD. So, the OS scheduler can migrate threads among the PUs of a core (if any). The mapping between threads and cores is done by allocating a group of cores from the same CPU to each Worker. Note that the number of threads used by all Workers executed in a node should not exceed the number of cores to avoid resource oversubscription. – PU: map/reduce operations are bound to specific PUs. Each thread launched by a Worker to perform such operations is bound to one of the available PUs in the system using the flag HWLOC.CPUBIND THREAD. The mapping between threads and PUs is done by allocating a group of cores to each Worker, and then distributing its threads over the PUs of those cores in a cyclic way. Note also that the number of threads used by all Workers on a node should not exceed the number of PUs to avoid oversubscription. It is important to note that all the threads launched by a Worker process are created during the Flame-MR start-up phase. So, jhwloc is only accessed once to set the affinity level, avoiding any JNI overhead during data processing. 4.3
Experimental Configuration
Six MapReduce workloads from four domains that represent different Big Data use cases have been evaluated: (1) data sorting (Sort), (2) machine learning (K-Means), (3) graph processing (PageRank, Connected Components), and (4) genome sequence analysis (MarDRe, CloudRS). Sort is an I/O-bound microbenchmark that sorts an input text dataset generated randomly. K-Means is an iterative clustering algorithm that classifies an input set of N samples into K clusters. PageRank and Connected Components are popular iterative algorithms for graph processing. PageRank obtains a ranking of the elements of a graph taking into account the number and quality of the links to each one, and Connected Components explores a graph to determine its subnets. MarDRe [8] and CloudRS [3] are bioinformatics tools for preprocessing genomics datasets.
40
R. R. Exp´ osito et al.
MarDRe removes duplicate and near-duplicate DNA reads, whereas CloudRS performs read error correction. In the experiments conducted in this paper, Sort processes a 100 GB dataset and K-Means performs a maximum of five iterations over a 35 GB dataset (N = 360 million samples) using 200 clusters (K = 200). Both PageRank and Connected Components execute five iterations over a 40 GB dataset (60 million pages). MarDRe removes duplicate reads using the SRR377645 dataset, named after its accession number in the European Nucleotide Archive (ENA) [12], which contains 241 million reads of 100 base pairs each (67 GB in total). CloudRS corrects read errors using the SRR921890 dataset, which contains 16 million reads of 100 base pairs each (5.2 GB in total). Both genomics datasets are publicly available to download at ENA website. The experiments have been carried out on a 9-node cluster with one master and eight slave nodes running Flame-MR version 1.2. Each node consists of a NUMA system with two Intel Xeon E5-2660 octa-core CPUs. This CPU model features two-way Intel HT, so 8 cores and 16 logical PUs are available per CPU (i.e., 16 and 32 per node, respectively). Each node has a total of 64 GiB of memory evenly distributed between the two CPUs. The NUMA architecture just described is the one previously shown in Fig. 1, which also details the cache hierarchy. Additionally, each node has one local disk of 800 GiB intended for both HDFS and intermediate data storage during the execution of the workloads. Nodes are interconnected through Gigabit Ethernet (1 Gbps) and InfiniBand FDR (56 Gbps). The cluster runs Linux CentOS 6.10 with kernel release 2.6.32– 754.3.5, whereas the JVM version is Oracle JDK 10.0.1. To deploy and configure Flame-MR on the cluster, the Big Data Evaluator (BDEv) tool [21] has been used for ease of setup. Two Workers (i.e., two JVM processes) have been executed per slave node, since our preliminary experiments proved it to be the best configuration for Flame-MR on this system. Regarding the number of threads per Worker, two different configurations have been evaluated: (1) using as many threads per Worker as cores per CPU (8 threads), and (2) using as many threads per Worker as PUs per CPU (16 threads), thus also evaluating the impact of Intel HT on performance. Finally, the metric shown in the following graphs corresponds to the median runtime for a set of 10 executions for each experiment, clearing the OS buffer cache of the nodes between each execution. Variability is represented in the graphs by using error bars to indicate the minimum and maximum runtimes. 4.4
Performance Results
Figure 3 presents the measured runtimes of Flame-MR for all the workloads when using different affinity levels and Worker configurations as previously described. When running 8 threads per Worker (i.e., not using Intel HT), all the workloads benefit from enforcing some hardware affinity, although the best level to use varies for each workload. On the one hand, the performance improvements for Sort (Fig. 3a) and K-Means (Fig. 3b) with respect to the baseline scenario (i.e., without using affinity) are lower than for the remaining workloads. The main
Enabling Hardware Affinity in JVM Applications with jhwloc
41
reason is that both workloads are the most I/O-intensive codes under evaluation, being clearly bottlenecked by disk performance (only one disk per slave node is available). This fact limits the potential benefits of using an enforced hardware placement. Nevertheless, the improvements obtained are up to 3% and 4% using the PU and CPU affinity levels for Sort and K-Means, respectively. On the other hand, the performance improvements for the remaining workloads (see Figs. 3c–3f) are significantly higher: up to 13%, 17%, 16% and 11% for PageRank, Connected Components, MarDRe and CloudRS, respectively. Although the best affinity level depends on each particular workload (e.g., CPU for PageRank, CORE for CloudRS), it can be concluded that the performance differences between different levels are generally small. Note that the workloads benefit not only from accelerating a single Worker (i.e., JVM) using CPU binding, but also from reducing the impact of the synchronizations among all Workers, performed between the global map and reduce phases for each MapReduce job. Moreover, this reduction in the runtimes is obtained as a zero-effort transparent configuration for Flame-MR users, without recompiling/modifying the source code of the workloads. Regarding the impact of Intel HT on performance (i.e., running 16 threads per Worker ), note that the results for the CORE affinity level cannot be shown when using two Workers per node since slave nodes have 16 physical cores, as mentioned in Sect. 4.3. In the HT scenario, K-Means, MarDRe and CloudRS take clear advantage of this technology. In the case of K-Means (see Fig. 3b), execution times are reduced with respect to the non-HT counterparts by 16% and 12% for the baseline and CPU affinity scenarios, respectively. The improvements for these two scenarios increase up to 26% and 32% for CloudRS (Fig. 3f), and 48% and 41% for MarDRe (Fig. 3e). However, the impact of HT can be considered negligible for Sort, as shown in Fig. 3a, whereas the performance of PageRank and Connected Components is even reduced in most scenarios. Note that using HT technology implies that the two logical PUs within a physical core must share not only all levels of the cache hierarchy, but also some of the computational units. This fact can degrade the performance of CPU-bound workloads due to increased cache miss rates and resource contention, as it seems to be the case with PageRank. Finally, the impact of enforcing hardware affinity when using HT can only be clearly appreciated for CloudRS, providing a reduction in the execution time of 14% when using the CPU affinity level. We can conclude that there is no one-size-fits-all solution, since the best affinity level depends on each workload and its particular resource characterization. Furthermore, the impact of managing CPU affinities is clearly much more beneficial when HT is not used. However, the results shown in this section reinforce the utility and performance benefits of managing hardware affinities in Big Data JVM-based frameworks such as Flame-MR.
42
R. R. Exp´ osito et al. Sort NONE
K-Means
CPU affinity CPU CORE
2000
PU
Runtime (seconds)
Runtime (seconds)
1000 800 600 400 200 0
8
NONE
800 400
8
(b)
PageRank PU
NONE
Runtime (seconds)
Runtime (seconds)
Connected Components 2000
CPU affinity CPU CORE
2400 1800 1200 600 0
8
400
8
(d) 3000 PU
NONE
Runtime (seconds)
Runtime (seconds)
CloudRS
CPU affinity CPU CORE
800 600 400 200
8
16
#Threads/Worker
(e)
16
#Threads/Worker
MarDRe
0
PU
800
(c)
NONE
CPU affinity CORE
1200
0
16
CPU
1600
#Threads/Worker
1000
16
#Threads/Worker
(a)
NONE
PU
1200
#Threads/Worker
3000
CPU affinity CORE
1600
0
16
CPU
CPU
CPU affinity CORE
PU
2400 1800 1200 600 0
8
16
#Threads/Worker
(f)
Fig. 3. Runtimes of Flame-MR for different workloads and CPU affinity levels
5
Conclusions
The complexity of current computing infrastructures raises the need for carefully placing applications on them so that affinities can be efficiently exploited by the hardware. However, the standard class library provided by the JVM lacks support for developing hardware-aware applications, preventing them from taking advantage of managing CPU or NUMA affinities. As most popular distributed processing frameworks are implemented using languages executed by the JVM, having such affinity support can be of great interest for the Big Data community.
Enabling Hardware Affinity in JVM Applications with jhwloc
43
In this paper we have introduced jhwloc, a Java library that exposes an object-oriented API that allows developers to gather information about the underlying hardware and bind tasks according to it. Acting as a wrapper between the JVM and the C-based hwloc library, the de facto standard in HPC environments, jhwloc enables JVM-based applications to easily manage affinities to perform hardware-aware optimizations. Furthermore, we have extended our Java MapReduce framework Flame-MR to include support for setting hardware affinities through jhwloc, as a case study to demonstrate the potential benefits. The experimental results, running six representative Big Data workloads on a 9-node cluster, have shown that performance can be transparently improved by up to 17%. Other popular JVM-based Big Data frameworks such as Hadoop, Spark, Flink, Storm and Samza could also take advantage of the features provided by jhwloc in a similar way to Flame-MR. The source code of the jhwloc library is released under the open-source GNU GPLv3 license and is publicly available together with the Javadoc documentation at https://github.com/rreye/jhwloc. As future work, we aim to explore the impact of setting NUMA affinities on JVM performance when using different garbage collection algorithms. We also plan to extend jhwloc to provide other functionalities such as gathering information about the network topology. Acknowledgments. This work was supported by the Ministry of Economy, Industry and Competitiveness of Spain and FEDER funds of the European Union [ref. TIN201675845-P (AEI/FEDER/EU)]; and by Xunta de Galicia and FEDER funds [Centro de Investigaci´ on de Galicia accreditation 2019–2022, ref. ED431G2019/01, Consolidation Program of Competitive Reference Groups, ref. ED431C2017/04].
References 1. Awan, A.J., Vlassov, V., Brorsson, M., Ayguade, E.: Node architecture implications for in-memory data analytics on scale-in clusters. In: Proceedings of 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT 2016), Shanghai, China, pp. 237–246 (2016) 2. Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache Flink: stream and batch processing in a single engine. Bull. IEEE Tech. Comm. Data Eng. 38(4), 28–38 (2015) 3. Chen, C.C., Chang, Y.J., Chung, W.C., Lee, D.T., Ho, J.M.: CloudRS: an error correction algorithm of high-throughput sequencing data based on scalable framework. In: Proceedings of IEEE International Conference on Big Data (IEEE BigData 2013), Santa Clara, CA, USA, pp. 717–722 (2013) 4. Chiba, T., Onodera, T.: Workload characterization and optimization of TPC-H queries on Apache Spark. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2016), Uppsala, Sweden, pp. 112–121 (2016) 5. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008) 6. Eggers, S.J., Emer, J.S., Levy, H.M., Lo, J.L., Stamm, R.L., Tullsen, D.M.: Simultaneous multithreading: a platform for next-generation processors. IEEE Micro 17(5), 12–19 (1997)
44
R. R. Exp´ osito et al.
7. Ekanayake, J., et al.: Twister: a runtime for iterative MapReduce. In: Proceedings of 19th ACM International Symposium on High Performance Distributed Computing (HPDC 2010), Chicago, IL, USA, pp. 810–818 (2010) 8. Exp´ osito, R.R., Veiga, J., Gonz´ alez-Dom´ınguez, J., Touri˜ no, J.: MarDRe: efficient MapReduce-based removal of duplicate DNA reads in the cloud. Bioinformatics 33(17), 2762–2764 (2017) 9. Goglin, B.: Managing the topology of heterogeneous cluster nodes with hardware locality (HWLOC). In: Proceedings of International Conference on High Performance Computing & Simulation (HPCS 2014), Bologna, Italy, pp. 74–81 (2014) 10. Iqbal, M.H., Soomro, T.R.: Big Data analysis: Apache Storm perspective. Int. J. Comput. Trends Technol. 19(1), 9–14 (2015) 11. Lameter, C.: NUMA (Non-Uniform Memory Access): an overview. ACM Queue 11(7), 40:40–40:51 (2013) 12. Leinonen, R., et al.: The European nucleotide archive. Nucleic Acids Res. 39, D28–D31 (2011) 13. Marr, D.T., et al.: Hyper-threading technology architecture and microarchitecture. Intel Technol. J. 6(1), 1–12 (2002) 14. Noghabi, S.A., et al.: Samza: stateful scalable stream processing at LinkedIn. Proc. VLDB Endow. 10(12), 1634–1645 (2017) 15. Peternier, A., Bonetta, D., Binder, W., Pautasso, C.: Tool demonstration: overseer - low-level hardware monitoring and management for Java. In: Proceedings of 9th International Conference on Principles and Practice of Programming in Java (PPPJ 2011), Kongens Lyngby, Denmark, pp. 143–146 (2011) 16. Wasi-ur Rahman, M., et al.: High-performance RDMA-based design of Hadoop MapReduce over InfiniBand. In: Proceedings of IEEE 27th International Symposium on Parallel & Distributed Processing, Workshops & PHD Forum (IPDPSW 2013), Boston, MA, USA, pp. 1908–1917 (2013) 17. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: Proceedings of IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST 2010), Incline Village, NV, USA, pp. 1–10 (2010) 18. Terboven, C., an Mey, D., Schmidl, D., Jin, H., Reichstein, T.: Data and thread affinity in OpenMP programs. In: Proceedings of Workshop on Memory Access on Future Processors: A Solved Problem? (MAW 2008), Ischia, Italy, pp. 377–384 (2008) 19. The Apache Hadoop project: http://hadoop.apache.org. Accessed 31 Mar 2020 20. Vavilapalli, V.K., et al.: Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of 4th Annual Symposium on Cloud Computing (SOCC 2013), Santa Clara, CA, USA, pp. 5:1–5:16 (2013) 21. Veiga, J., Enes, J., Exp´ osito, R.R., Touri˜ no, J.: BDEv 3.0: energy efficiency and microarchitectural characterization of Big Data processing frameworks. Future Gener. Comput. Syst. 86, 565–581 (2018) 22. Veiga, J., Exp´ osito, R.R., Taboada, G.L., Touri˜ no, J.: Flame-MR: an event-driven architecture for MapReduce applications. Future Gener. Comput. Syst. 65, 46–56 (2016) 23. Wang, L., Ren, R., Zhan, J., Jia, Z.: Characterization and architectural implications of Big Data workloads. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2016), Uppsala, Sweden, pp. 145–146 (2016) 24. Zaharia, M., et al.: Apache spark: a unified engine for Big Data processing. Commun. ACM 59(11), 56–65 (2016)
An Optimizing Multi-platform Source-to-source Compiler Framework for the NEURON MODeling Language Pramod Kumbhar1 , Omar Awile1 , Liam Keegan1 , Jorge Blanco Alonso1 , urmann1(B) James King1 , Michael Hines2 , and Felix Sch¨ 1
2
Blue Brain Project, Ecole Polytechnique F´ed´erale de Lausanne (EPFL), Campus Biotech, 1202 Geneva, Switzerland [email protected] Department of Neuroscience, Yale University, New Haven, CT 06510, USA
Abstract. Domain-specific languages (DSLs) play an increasingly important role in the generation of high performing software. They allow the user to exploit domain knowledge for the generation of more efficient code on target architectures. Here, we describe a new code generation framework (NMODL) for an existing DSL in the NEURON framework, a widely used software for massively parallel simulation of biophysically detailed brain tissue models. Existing NMODL DSL transpilers lack either essential features to generate optimized code or capability to parse the diversity of existing models in the user community. Our NMODL framework has been tested against a large number of previously published user models and offers high-level domain-specific optimizations and symbolic algebraic simplifications before target code generation. NMODL implements multiple SIMD and SPMD targets optimized for modern hardware. When comparing NMODL-generated kernels with NEURON we observe a speedup of up to 20×, resulting in overall speedups of two different production simulations by ∼7×. When compared to SIMD optimized kernels that heavily relied on autovectorization by the compiler still a speedup of up to ∼2× is observed. Keywords: NEURON
1
· HPC · DSL · Code generation · Neuroscience
Introduction
The use of large scale simulation in modern neuroscience is becoming increasingly important (e.g. [2,24]) and has been enabled by substantial performance progress in neurosimulation engines over the last decade and a half (e.g. [14,17,19,20,27,28]). While excellent scaling has been achieved on a variety of platforms with the conversion to vectorized implementations, domain specific knowledge expressed in the models is not yet optimally used. In other fields, the use of DSLs and subsequent code-to-code translation have been effective in generating efficient codes and allowing easy adaptation to new architectures c Springer Nature Switzerland AG 2020 V. V. Krzhizhanovskaya et al. (Eds.): ICCS 2020, LNCS 12137, pp. 45–58, 2020. https://doi.org/10.1007/978-3-030-50371-0_4
46
P. Kumbhar et al.
[8,9,30,31]. This is becoming more important as the architectural diversity of hardware is increasing. Motivated by these observations, we have revisited the widely adopted NEURON simulator [11], which enables simulations of biophysically detailed neuron models on computing platforms ranging from desktop to petascale supercomputers, and which has over 2,000 reported scientific studies using it. One of the key features of the NEURON simulator is extendability via a domain specific language (DSL) layer called the NEURON Model Description Language (NMODL) [12]. NMODL allows the neuroscientist to extend NEURON by incorporating a wide range of membrane and intracellular submodels. The domain scientist can easily express these channel properties in terms of algebraic and ordinary differential equations, kinetic schemes in NMODL without worrying about lower level implementation details. The rate limiting aspect for performance of NEURON simulations is the execution of channels and synapses written in the NMODL DSL. The code generated from NMODL often accounts for more than 80% of overall execution time. There are more than six thousand NMODL files that are shared by the NEURON user community on the ModelDB platform [29]. As the type and number of mechanisms differ from model to model, hand-tuning of the generated code is not feasible. The goal of our NMODL Framework is to provide a tool that can parse all existing models, and generate optimized code from NMODL DSL code, which is responsible for more than 80% of the total simulation time. Here we present our effort in building a new NMODL source-to-source compiler that generates C++, OpenMP, OpenACC, CUDA and ISPC targeting modern SIMD hardware platforms. We also describe several techniques we employ in the NMODL Framework, which we believe to be useful beyond the immediate scope of the NEURON simulation framework.
2
Related Work
The reference implementation for the NMODL DSL specification is found in nocmodl [13], a component in the NEURON simulator. Over the years nocmodl underwent several iterations of development and gained support for a number of newer language constructs. One of the major limitations of nocmodl is its lack of flexibility. Instead of constructing an intermediate representation, such as an Abstract Syntax Tree (AST), it performs many code generation steps on the fly, while parsing the input. This leaves little room for performing global analysis, optimizations, or targeting a different simulator altogether. The CoreNEURON [19] library uses a modified version of nocmodl called mod2c [3], which duplicates most of the legacy code and has some of the same limitations as nocmodl. Pynmodl [23] is a Python based parsing and post-processing tool for NMODL. The primary focus of pynmodl is to parse and translate NMODL DSL to other computational neuroscience DSLs but does not support code generation for a particular simulator. The modcc source-to-source compiler is being developed as part of the Arbor simulator [1]. It is able to generate from NMODL DSL code,
An Optimizing Multi-platform Source-to-source Compiler Framework
47
optimized C++/CUDA to be used with the Arbor simulator. It only implements a subset of the NMODL DSL specification and hence is only able to process a modest number of existing models available in ModelDB [29]. For a more comprehensive review of current code-generator techniques in computational neuroscience we refer the reader to Blundell et al. [4]. Other fields have adopted DSLs and code generation techniques [9,25,31] but they are not yet fully exploited in the context of NMODL. We conclude that current NMODL tools either lack support for the full NMODL DSL specification, lack the necessary flexibility to be used as a generic code generation framework, or are unable to adequately take advantage of modern hardware architectures, and thus are missing out on available performance from modern computing architectures.
3
NMODL DSL
In most simple terms, the NEURON simulator deals with two aspects of neuronal tissue simulations: 1) the exchange of spiking events between neuronal cells and 2) the numerical integration of a spatially discretized cable equation that is equipped with additional terms describing transmembrane currents resulting from ion channels and synapses. The NMODL DSL allows the modelers to efficiently express these transmembrane mechanisms. As an example, many models of neurons use a non-linear combination of the following basic ordinary differential equation to describe the kinetics of ion channels first developed by Hodking and Huxley [15]: dV dt dn dt dm dt dh dt
= I − g¯N a m3 h(V − VN a ) − g¯K n4 (V − VK ) − gL (V − VL ) /C
(1)
= αn (V )(1 − n) − βn (V )n
(2)
= αm (V )(1 − m) − βm (V )m
(3)
= αh (V )(1 − h) − βh (V )h
(4)
where V is the membrane potential, I is the membrane current, gi is conductance per unit area for ith ion channel, n, m, and h are dimensionless quantities between 0 and 1 associated with channel activation and inactivation, Vi is the reversal potential of the ith ion channel, C is the membrane capacitance, and αi and βi are rate constants for the ith ion channel, which depend on voltage but not time. Figure 1 shows a simplified NMODL DSL fragment of a specific ion channel, a voltage-gated calcium ion channel, published in Traub et al. [32]. Our example highlights the most important language constructs and serves as an example for the DSL-level optimizations presented in the following sections. The NMODL language specification can be found in [12]. At DSL level a lot of information is expressed implicitly that can be used to generate efficient code and expose more parallelism, e.g.:
48 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
P. Kumbhar et al. TITLE T-calcium channel (adapted) channel dependency NEURON { SUFFIX cat USEION ca READ cai, cao WRITE ica RANGE gcatbar, ica, gcat, hinf, minf, mtau, htau } PARAMETER { celsius = 25 (degC) cao = 2 (mM) }
memory footprint
constant variables with limited precision
STATE { m h } ASSIGNED { ica gcat hinf htau minf mtau } BREAKPOINT { SOLVE states METHOD cnexp gcat = gcatbar*m*m*h ica = gcat*ghk(v,cai,cao) }
often memory bound
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
DERIVATIVE states { rates(v) m' = (minf - m)/mtau h' = (hinf - h)/htau }
often compute bound
PROCEDURE rates(v(mV)) { LOCAL a, b, qt qt= q10^((celsius-25)/10) if inlined into DERIVATIVE minf, mtau could become a = 0.2*(-1.0*v+19.26) local variables b = 0.009*exp(-v/22.03) minf = a/(a+b) mtau = betmt(v)/(qt*a0m*(1+alpmt(v))) } FUNCTION ghk(v(mV), ci(mM), co(mM)) (mV) { LOCAL nu,f elemental function f = KTF(celsius)/2 nu = v/f VERBATIM unsafe for optimisations, // C code implementation need C lexer/parser ENDVERBATIM ghk = -f*(1.0-(ci/co)*exp(nu))*efun(nu) }
Fig. 1. NMODL example of a simplified model of a voltage-gated calcium channel showing different NMODL constructs and summary of optimization information available at DSL level. Keywords are printed in uppercase and marked with boldface.
– USEION statement describes the dependency between channels and can be used to build the runtime dependency graph to exploit micro-parallelism [22] – PARAMETER block describes the variables that are constant, often can be stored with limited precision. – ASSIGNED statement describes modifiable variables and can be allocated in fast memory. – DERIVATIVE, KINETIC and SOLVE describes ODEs which can be analyzed and solved analytically to improve the performance as well as accuracy. – BREAKPOINT describes current and voltage relation. If this is ohmic then one can use analytical expression instead of numerical derivatives to improve the accuracy as well as performance. – PROCEDURE can be inlined at DSL level to eliminate RANGE variables and thereby significantly reduce memory access cost as well as memory footprint. To use this information and perform such optimizations, often a global analysis of the NMODL DSL is required. For example, to perform inlining of a PROCEDURE one needs to find all function calls and recursively inline the function bodies. As nocmodl lacks the intermediate AST representation, this type of analysis is difficult to perform and such optimizations are not implemented. The NMODL Framework is designed to exploit such information from DSL specification and perform optimizations.
4
Design and Implementation
The implementation of the NMODL Framework can be broken down into four main components: lexer/parser implementation, DSL level optimisation passes, ODE solvers, and code generation passes. Figure 2 summarizes the overall architecture of NMODL Framework. As in any compiler framework, lexing and parsing are the
An Optimizing Multi-platform Source-to-source Compiler Framework
hh mod na mod ca mod
Python Bindings
User Preferred Interface NMODL
C
Inlining
LoopUnroll
DefUse
Optimization Passes
A MOD Collection
ODEs
B
Units
SymPy
Solver
NMODL
Renaming
Conductance
driver
Localizer
Intermediate Utility AST
Kinetic
Parser
49
Passes
Passes
ODESolver
Look Up Performance
Abstract Syntax Tree (AST) Initial
tokens
Lexer
Code Generation Backends
C Detail Improved Abstract Syntax Tree
AST to C++ Transformer
Default
OpenMP
CUDA
ISPC
OpenACC
AST to NMODL Transformer
AST to Custom Transformer
Fig. 2. Architecture of the NMODL Code Generation Framework showing: A) Input NMODL files are processed by different lexers & parsers generating the AST; B) Different analysis and optimisation passes further transform the AST; and C) The optimised AST is then converted to low level C++ code or other custom backends
first two steps performed on an input NMODL. The lexer implementation is based on the popular flex package and bison is used as the parser generator. The ODEs, units and inline C code need extra processing and hence separate lexers and parsers are implemented. DSL level optimizations and code generation are main aspects of this framework and discussed in detail in subsequent sections. 4.1
Optimization Passes
Modern compilers implement various passes for code analysis, code transformation, and optimized code generation. Optimizations such as constant folding, inlining, and loop unrolling are commonly found in all of today’s major compilers. For example, the LLVM compiler framework [21] features more than one hundred compiler passes. In the context of the NMODL Framework, we focus on a few optimization passes with very specific objectives. By taking advantage of domain-specific and high-level information that is available in the DSL but later lost in the lower level C++ or CUDA code, we are able to provide additional significant improvements in code performance compared to native compiler optimizations. For example, all NMODL RANGE, ASSIGNED, and PARAMETER variables are translated to double type variables in C++. Once this transformation is done, C/C++ compilers can no longer infer these high-level semantics from these variables. Another example is RANGE to LOCAL transformations with the help of PROCEDURE inlining discussed in Sect. 3. All RANGE variables in the NMODL DSL are converted to array variables and are dynamically allocated in C++. Once this transformation is done, the C/C++ compiler can only do limited optimizations.
50
P. Kumbhar et al.
To facilitate the DSL level optimizations summarized in Sect. 3, we have implemented the following optimization passes. Inlining: To facilitate optimizations such as RANGE to LOCAL conversion and facilitate other code transformations, the Inlining pass performs code inlining of PROCEDURE and FUNCTION blocks at their call sites. Variable Usage Analysis: Different variable types such as RANGE, GLOBAL, ASSIGNED can be analysed to check where and how often they are used. The Variable Usage Analysis pass implements Definition-Use (DU) chains [18] to perform data flow analysis. Localiser: Once function inlining is performed, DU chains can be used to decide which RANGE variables can be converted to LOCAL variables. The Localiser pass is responsible for this optimization. Constant Folding and Loop Unrolling: The KINETIC and DERIVATIVE blocks can contain coupled ODEs in WHILE or FOR loop statements. In order to analyse these ODEs with SymPy (see Subsect. 4.4), first we need to perform constant folding to know the iteration space of the loop and then perform loop unrolling to make all ODE statements explicit. 4.2
Code Generation
Once DSL and symbolic optimizations (see Subsect. 4.3) are performed on the AST, the NMODL Framework is ready to proceed to the code generation step (cf. Fig. 2). The C++ code generator plays a special role, since it constitutes the base code generator extended by all other implementations. This allows easy implementation of a new target by overriding only necessary constructs of the base code generator. To better leverage specific hardware platform features such as SIMD, multithreading or SPMD we have further implemented code generation targets for OpenMP, OpenACC, ISPC (Intel SPMD Program Compiler ) and experimentally CUDA. We chose ISPC for its performance portability and support for all major vector extensions on x86 (SSE4.2, AVX2, AVX-512), ARM NEON and NVIDIA GPUs (using NVPTX) giving us the ability to generate optimized SIMD code for all major computing platforms. We have, furthermore, extended the C++ target with an OpenMP and an OpenACC backend. These two code generators emit code that is largely identical to the C++ code generator but add appropriate pragma annotations to support OpenMP shared-memory parallelism and OpenACC GPU acceleration. Finally, our code-generation framework supports CUDA as a main backend to target NVIDIA GPUs.
An Optimizing Multi-platform Source-to-source Compiler Framework
4.3
51
ODE Solvers
NMODL allows the user to specify the equations that define the system to be simulated in a variety of ways. – The KINETIC block describes the system using a mass action kinetic scheme of reaction equations. – The DERIVATIVE block specifies a system of coupled ODEs (note that any kinetic scheme can also be written as an equivalent system of ODEs.) – Users can also specify systems of algebraic equations to be solved. The LINEAR and NONLINEAR blocks respectively specify systems of linear and nonlinear algebraic equations (applying a numerical integration scheme to a system of ODEs typically also results in a system of algebraic equations to solve.) To reduce duplication of functionality for dealing with these related systems of equations, we implemented a hierarchy of transformations as shown in Fig. 3. First, any KINETIC blocks of mass action kinetic reaction statements are translated to DERIVATIVE blocks of the equivalent ODE system. Linear and independent ODEs are solved analytically. Otherwise a numerical integration scheme such as implicit Euler is used which results in a system of algebraic equations equivalent to a LINEAR or NONLINEAR block. If the system is linear and small, it is solved analytically at compile time using symbolic Gaussian elimination. Optionally, Common Subexpression Elimination (CSE) [7] can then be applied. KINETIC kin {
SymPy Based Code Generation
NMODL DERIVATIVE Abstract Syntax Tree
KINETIC Visitor
Compile-Time
cnexp/euler
Equations solved analytically at compile time
CSE
~ x y (a,b) ~ z void matmul ( const Matrix & A , const Matrix & B , Matrix & C ) { for ( int i =0; i