116 35 3MB
English Pages 88 [84] Year 2024
Michael M. Resch · Johannes Gebert · Hiroaki Kobayashi · Hiroyuki Takizawa · Wolfgang Bez Editors
Sustained Simulation Performance 2022
Sustained Simulation Performance 2022
Michael M. Resch · Johannes Gebert · Hiroaki Kobayashi · Hiroyuki Takizawa · Wolfgang Bez Editors
Sustained Simulation Performance 2022 Proceedings of the Joint Workshop on Sustained Simulation Performance, High-Performance Computing Center Stuttgart (HLRS), University of Stuttgart and Tohoku University, May and October 2022
Editors Michael M. Resch High Performance Computing Center University of Stuttgart, HLRS Stuttgart, Baden-Württemberg, Germany
Johannes Gebert High Performance Computing Center University of Stuttgart Stuttgart, Germany
Hiroaki Kobayashi Graduate School of Information Sciences Tohoku University Aoba-ku, Japan
Hiroyuki Takizawa Cyberscience Center Tohoku University Sendai, Miyagi, Japan
Wolfgang Bez Europe GmbH NEC High Performance Computing Düsseldorf, Nordrhein-Westfalen, Germany
ISBN 978-3-031-41072-7 ISBN 978-3-031-41073-4 (eBook) https://doi.org/10.1007/978-3-031-41073-4 Mathematics Subject Classification: 65-XX, 65Exx, 65Fxx, 65Kxx, 68-XX, 68Mxx, 68Uxx, 68Wxx, 70-XX, 70Fxx, 70Gxx, 76-XX, 76Fxx, 76Mxx, 92-XX, 92Cxx © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.
Proceedings WSSP 33/34-Draft Preface
The Workshop on Sustained Simulation Performance was held at HLRS in May 2022 and at the Cyberscience Center, Tohoku University, in October 2022. The collaboration between the High-Performance Computing Center Stuttgart, the Tohoku University, and NEC was the first in-person workshop after the COVID pandemic. We are glad that the joint efforts continue to strengthen our research undertakings. Ultimately, we are happy to continue the relationship that began in 2004 with the establishment of what we called the “Teraflop Workshop”. The meeting evolved into the Workshop on Sustained Simulation Performance with more than 30 events on two continents. While HPC systems were designed for many years as single-processor vector machines, they now are large cluster systems with fast interconnects and with a variety of processors and accelerators—among them vector processors. Climate and weather simulation is one of the scientific fields that has a particularly high demand for computing power, and research has shown that we want to use our resources more sustainably. This is at odds with the ever-larger systems with ever higher energy consumption of modern HPC systems. At the same time, however, there has been a tremendous increase in efficiency. The contributions of this book and the upcoming workshops will help to continue and accelerate the development of fast and efficient high-performance computing. The contributed papers study the development of novel system management concepts and investigations in load balancing and present the current state of the art in the most powerful vector supercomputer called AOBA. We would like to thank all the contributors and organizers of this book and the Sustained Simulation Performance Workshops. We especially thank Prof. Hiroaki Kobayashi for the close collaboration over the past years and look forward to intensifying our cooperation in the future. Stuttgart, Germany December 2022
Michael M. Resch Johannes Gebert
v
Contents
Digital Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael M. Resch, Johannes Gebert, and Benjamin Schnabel
1
A Provenance Management System for Research Data Management in High-Performance Computing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Yuta Namiki, Takeo Hosomi, Hideyuki Tanushi, Akihiro Yamashita, and Susumu Date Management of Data Flows Between Cloud, HPC and IoT/Edge . . . . . . . . 25 Kamil Tokmakov Dynamic Load Balancing of a Coupled Lagrange Particle Tracking Solver for Direct Injection Engine Application . . . . . . . . . . . . . . . . . . . . . . . . 41 Tim Wegmann, Matthias Meinke, and Wolfgang Schröder Toward Scalable Empirical Dynamic Modeling . . . . . . . . . . . . . . . . . . . . . . . . 61 Keichi Takahashi, Kohei Ichikawa, and Gerald M. Pao AOBA: The Most Powerful Vector Supercomputer in the World . . . . . . . . 71 Hiroyuki Takizawa, Keichi Takahashi, Yoichi Shimomura, Ryusuke Egawa, Kenji Oizumi, Satoshi Ono, Takeshi Yamashita, and Atsuko Saito
vii
Digital Convergence Michael M. Resch , Johannes Gebert , and Benjamin Schnabel
Abstract High-Performance Computing has recently been challenged by the advent of Artificial Intelligence. Artificial Intelligence has become rather popular in the last years and has claimed some success in solving relevant scientific problems in a variety of fields. In this paper we will look at the question of whether these technologies are mutually exclusive or whether they complement each other. We will argue that High-Performance Computing and Artificial Intelligence are two technologies that work together well. We will further argue that they are complemented by the Internet of Things which helps to create a concept that we want to call Digital Convergence. We will furthermore explore, how this Digital Convergence already today shapes the future of computer simulation. We will finally point at some new types of problems that will benefit from this Digital Convergence.
1 Introduction Artificial Intelligence (AI) has become more visible within the realms of science and engineering in recent years. Following a period of relative stagnation, AI has gained new momentum and has achieved a number of interesting results mainly because of two developments [1, 2]. Firstly, the speed of computers has increased over the last decades to an extent that has made concepts of AI more feasible. The necessary compute power did stop AI for some time. New accelerators and specifically the collaboration of bundles of such accelerators are increasingly providing a level of hardware performance that allows to tackle problems that could not be solved before. Secondly, the integration of Machine Learning (ML) methods has provided AI with a foundation for enhancement, both in terms of speed and quality. In contemporary discourse, ML is frequently regarded as synonymous with AI, representing the most pervasive AI technology today. The intersection of increased computational
M. M. Resch (B) · J. Gebert · B. Schnabel High-Performance Computing Center Stuttgart (HLRS), University of Stuttgart, Nobelstraße 19, Stuttgart 70569, Germany e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. M. Resch et al. (eds.), Sustained Simulation Performance 2022, https://doi.org/10.1007/978-3-031-41073-4_1
1
2
M. M. Resch et al.
capacity and the pervasive use of ML techniques has propelled AI into new realms of possibility, marking a transformative phase in its evolution [3]. The ascent of modern AI has, to some extent, begun to challenge traditional HighPerformance Computing (HPC). On one hand, problems amenable to simulation often can also be analyzed using ML technologies. Conversely, the momentum of HPC is waning, primarily due to the fact that Moore’s law [4] seems to be no longer valid. This anticipated superiority of AI over classical approaches has led some economists to posit that AI will instigate such profound changes in the world that millions of jobs will be lost, ushering in what they term a “second machine age” [5]. However, it is crucial to note that such societal implications are beyond the purview of this scientific investigation. This article explores the symbiotic relationship between HPC and AI and examines how they are poised to collaborate in the future. Presently, we observe substantial potential in the integration of these technologies. Moreover, AI emerges as a logical extension of HPC, not only in order to solve old problems better but also providing an opportunity to address novel problems [6–8]. Firstly, we will examine the current state of HPC in a general context, with a specific focus on the extensively debated end of Moore’s Law. Subsequently, we will delve into the potential of AI. Additionally, we will introduce the concept of Internet of Things (IoT) and introduce the concept of Digital Convergence, which brings HPC, AI, and IoT together. Finally, we will demonstrate how such a Digital Convergence is poised to significantly impact our problem-solving capabilities.
2 The Situation of Traditional High-Performance Computing The future of HPC is widely discussed not only in the scientific community but also in the industry. HPC has become a politically relevant topic. The People’s Republic of China has declared HPC a key technology and is aiming to achieve a leading position in HPC [9]. At least concerning the TOP500 list [10] (a list of the fastest 500 HPC systems in the world), it has been successful. The U.S. has reacted to that perceived challenge and has put an embargo on certain technologies. After many years of intensive discussions, U.S. President Joe Biden signed into law what is known as the “U.S. CHIPS and Science Act” [11]. The U.S. even went so far as to control scientific exchange in the field of HPC for its scientists. Europe has always been an open marketplace for HPC, with U.S. and Japanese vendors competing. Europe has witnessed decades of intense competition leading to lower prices. In response to the U.S.-Chinese competition in HPC, Europe has decided to re-enter this race. Aiming for “technological sovereignty”, Europe has opted to develop its own Information Technology (IT) industry, specifically focusing on its HPC sector. Japan has chosen to diversify its strategy. After spending approximately two decades concentrating resources on a large national system, this approach has recently been complemented
Digital Convergence
3
1015 1013
23
20
17
14
12
09
06
03
01
98
95
1011 93
Performance [FLOP s−1 ]
Rank 1 HPC system in Top500 list 1017
Year
Fig. 1 Evolution of HPC performance according to the Top500 list in reference to [14]
by several smaller systems and a network connecting all these installations to bundle both hardware and scientific research resources. The TOP500 list is a well-known project that ranks and provides detailed information about the 500 most HPC systems in the world. The list is updated twice a year and has been published since 1993, providing a reliable reference for measuring the performance of the world’s fastest supercomputers. The list includes information on each system’s hardware and software architecture, as well as its performance on a range of benchmark tests. The TOP500 list is an essential tool for researchers and organizations working in fields such as scientific computing, machine learning, and data analytics [12]. In July 2022, the Hewlett Packard Enterprise Frontier emerged as the first exascale system, boasting a theoretical peak performance of 1.6EFLOPs.−1 (see Fig. 1) [10, 13]. However, despite such remarkable achievements, certain HPC systems face challenges in keeping up with the performance predictions of Moore’s Law. These slower HPC systems, once leaders in computing performance, now confront limitations imposed by physical and architectural constraints. For decades, Moore’s Law, has been the catalyst behind the exponential growth in computing power. Nonetheless, as the fundamental physical limits of miniaturization approach, the previously predictable growth rate is decelerating. As a result, HPC systems that were designed based on Moore’s Law can no longer sustain the expected trajectory of performance improvement.
3 Slowing Down the Speedup While architecture may not be the most immediate concern, the future of HPC is primarily challenged by a hardware problem: the end of Moore’s Law [4]. In 1965, Moore predicted that the number of transistors on a chip would double every 12 (later revised to 18) months, allowing for increased cramming of transistors on the
4
M. M. Resch et al.
same surface area. This prediction held true for around 50 years. However, the limits of miniaturization are becoming increasingly uncertain. We have already reached the point of manufacturing using five nanometer technology, with potential advancements to three nanometers in the foreseeable future. Yet, the economic feasibility of achieving these further miniaturizations remains unclear. Presently, only three manufacturers (Samsung Group,1 Taiwan Semiconductor Manufacturing Company (TSMC),2 Intel Corporation3 ) handle the seven nanometer process. Consequently, it is reasonable to assume that beyond the mid-2020s, achieving performance increases in supercomputers through transistor scaling will be challenging [15]. The debate surrounding Moore’s Law continues to evolve. In 2022, Jensen Huang of NVIDIA Corporation4 proclaimed, “Moore’s Law is dead” [16]. In response, Ann Kelleher from Intel countered his statement by emphasizing the importance of innovation, stating, “Innovation is not dead, and we will maintain Moore’s Law as we always have, through innovation—innovation in process, in packaging, and in architecture” [17]. Both Huang and Kelleher represent differing interpretations of Moore’s Law. While Huang is correct in asserting that we may not witness a doubling of performance at the same cost, Kelleher expresses her intention to explore new technologies that can further enhance performance. From the perspective of users and HPC centers, the crucial message is the need to step outside the comfort zone and recognize that the same budget will not automatically guarantee increased performance with each supercomputer system replacement. This realization is not surprising when we delve deeper into the issue.
4 Digital Convergence Digital Convergence, the synergistic combination of HPC, AI, data, and people, is reshaping the technological landscape. While HPC systems continue to advance in speed, the pace of improvement has slowed compared to previous years. Simultaneously, AI methods are continually evolving and becoming more sophisticated. Moreover, the size of data is experiencing exponential growth, emphasizing the need for efficient processing and analysis. In this context, Digital Convergence demands a new kind of expertise from users. They must not only understand the technical aspects of HPC and AI but also possess the ability to handle large volumes of data and derive meaningful insights from it. Digital Convergence presents significant opportunities as the combination of HPC, AI, and data analysis enables new discoveries and insights across various research fields. The concept of Digital Convergence is shown in Fig. 2.
1
https://www.samsung.com/. https://www.tsmc.com/. 3 https://www.intel.com/. 4 https://www.nvidia.com. 2
Digital Convergence
5 Data
HPC Systems getting faster but at a slowing rate
Data growing exponentially
People
AI
New Type of expertise required
Methods getting better
Fig. 2 Concept of Digital Convergence
4.1 Artificial Intelligence in High-Performance Computing The concept of AI is said to have been initially introduced by Alan Turing in 1950 [18]. Intelligence, being a non-technical concept, has undergone significant changes in meaning and understanding over the past decades, making it challenging to assess the technical merits of AI. While Alan Turing referred to AI as a computer system capable of replicating the logical behavior of a human being, the modern interpretation of AI focuses on two main aspects. Firstly, AI is seen as a means to create humanoid robots that imitate human behavior, encompassing the development of artificial human-like physical bodies. However, this aspect is distinct from the discussion on HPC and will not be further explored here. On the other hand, AI is regarded as capable of replacing humans in the decisionmaking process. As highlighted by I. H. Sarker and illustrated in Fig. 3, Deep Learning (DL) and ML are recognized as subsets of AI and Artificial General Intelligence (AGI) [2]. AGI can be defined as the hypothetical intelligence of a computer program that has the ability to understand or learn any intellectual task that a human being can perform. In the 1970s and 1980s, substantial investments were made in AI research, with high expectations of achieving both goals. However, the current wave of enthu-
AGI
An intelligent machine entity, accomplishing or surpassing every task a human can do.
AI
To incorporate human behavior and intelligence to machine or systems
ML
Methods to learn from data or past experience, which automates analytical model building
DL
Computation through multi-layer neural networks and processing
Fig. 3 Illustration of the position of ML and DL within the area of AI and AGI in reference to [2]
6
M. M. Resch et al.
siasm regarding AI takes a more realistic approach. It seeks to integrate software and hardware solutions with ample data to develop systems that can assist humans in complex yet standardized decision-making tasks, such as medical treatment decisions and facial analysis. Although far from achieving a human-like machine, this technology holds significant potential. In the context of Computational Fluid Dynamics, AI can facilitate learning from previous simulations to make informed decisions about future simulations required to solve specific problems, effectively offloading the decision-making process within the simulation to an AI system.
4.2 Internet of Things in High-Performance Computing In the context of HPC, IoT serves as a significant representation of the vast amounts of data generated. The IoT refers to the interconnectivity of devices, sensors, and objects embedded with software, enabling them to collect and exchange data [19, 20]. This interconnected network of devices generates a continuous stream of real-time data that can be utilized by HPC systems for various purposes. By integrating IoT devices with HPC infrastructure, a wealth of data becomes accessible, offering valuable insights for analysis and decision-making. This data can be leveraged to optimize performance, monitor system efficiency, and drive advancements in scientific research domains. The IoTs ability to capture and transmit data from diverse sources positions it as a key enabler for harnessing the power of data in the context of HPC, facilitating innovative solutions and enhancing the capabilities of computational systems.
4.3 Data in High-Performance Computing The exponential growth of data in HPC environments is notably propelled by the increasing integration of AI methodologies [21]. The utilization of AI techniques, such as DL, for data analysis and pattern recognition has led to an unprecedented influx of information in HPC systems. While these AI-driven approaches enhance the accuracy and efficiency of scientific simulations and experiments, they concurrently contribute to the creation of massive datasets that in part are of heavily curated nature and in other parts are data crawled in the internet. Extracting meaningful insights from these extensive datasets involves the development and implementation of advanced AI algorithms tailored to the specific characteristics of HPC-generated data [22]. Challenges arise in terms of scalability, as the volume of data surpasses the capacities of traditional computational infrastructures. Moreover, interpreting AIgenerated results poses challenges in ensuring the transparency and explainability of complex models. Especially, as more and more data will be generated by AI, the more AI will be trained on such data, introducing additional uncertainties. Collaborative efforts between AI researchers, domain experts, computational scientists and
Digital Convergence
7
research software engineers are imperative to address these challenges and optimize the synergy between HPC and AI for scientific discovery. The ethical considerations associated with the use of AI in HPC, including data privacy and bias mitigation also emerge as critical focal points in this rapidly evolving landscape. A terrain in which Europe is well advised to create AI models taking european training data and values into account. Consequently, while the integration of AI accelerates data growth in HPC, navigating software, hardware and political challenges is crucial to unlock the transformative potential of these technologies for scientific exploration.
4.4 People in High-Performance Computing The integration of AI and IoT into HPC heralds a new era of computational capabilities but concurrently introduces unprecedented challenges. The synergy of these technologies amplifies the complexity of data processing, requiring HPC systems to contend with vast and diverse datasets generated by interconnected devices. The advent of IoT contributes to the influx of real-time, streaming data, demanding enhanced data management and processing capabilities. Furthermore, the incorporation of AI algorithms into HPC workflows introduces the necessity for expertise in algorithm development, model optimization, and the seamless integration of these computational paradigms [23, 24]. As HPC becomes increasingly intertwined with AI and IoT, interdisciplinary expertise becomes imperative, encompassing not only traditional computational sciences but also proficiency in data science, machine learning, and sensor network technologies. Addressing the challenges posed by this convergence demands collaborative efforts from experts with diverse skill sets. Consequently, the future of HPC necessitates a new breed of professionals who can navigate the intricate interplay between HPC, AI, IoT, and data, ensuring the efficient utilization of computational resources and unlocking transformative potential across scientific and industrial domains [25]. Thinking about researchers as users of AI systems opens a window of opportunity to intellectual interest rate effects. It is expected for knowledge-based jobs to be accelerated by AI based toolings in a significant amount of productivity. Not only will the convergence of HPC and AI enable entirely new research findings, but it will also shorten the turnaround time of discoveries.
4.5 How Does Digital Convergence Help in Traditional Computer Simulation Digital Convergence significantly enhances traditional computer simulations in two key ways. Firstly, the integration of data from the IoT, preprocessed by AI, offers improved input for simulations. The wealth of real-time data provided by IoT devices, coupled with AIs ability to refine and preprocess this information, ensures simulations
8
M. M. Resch et al.
are based on more accurate and dynamic input parameters. This, in turn, enhances the overall reliability and precision of traditional simulations. Secondly, AI plays a crucial role in the analysis of simulation results, providing a deeper understanding of the outcomes. Not only could AI assist in interpreting current simulation results, but it also enables retrospective analysis of data from simulations conducted over previous years and decades. This retrospective analysis contributes to a more comprehensive overview of the specific field in which simulations are performed. Through its analytical capabilities, AI becomes an invaluable tool in gaining insights, identifying patterns, and improving the overall efficacy of traditional computer simulations.
4.6 Which New Applications Can We Expect to See in the Future Simulations will change due to Digital Convergence. This shift not only marks a change in the nature of simulations but also heralds the emergence of innovative applications that seamlessly integrates all three technologies. A example is the analysis of global systems through a fusion of extensive data scrutiny and simulation techniques. The simulation of a pandemic’s spread, leveraging mobility data and advanced simulations, allows for the predictive understanding of its consequences, even if prevention proves challenging for contagious viruses [26, 27]. Extending this paradigm, the analysis of complex systems, illustrated by smart city simulations, becomes paramount. These simulations amalgamate data from diverse fields such as traffic, power supply, and consumption, thereby enhancing the planning process and engaging a broader public in public planning processes [28–30]. By deconstructing large-scale simulations, such as those pertaining to weather and climate, into localized models enriched with detailed regional data, disaster prediction becomes viable on both short and long timescales. This integration of HPC into disaster prediction strategies marks a departure from its traditional scientific confines, emphasizing the transformative potential of technology convergence in reshaping the landscape of simulations [31, 32].
5 Conclusion In conclusion, our exploration of the symbiotic relationship between HPC, AI, big data, and the IoT point towards a transformative era marked by Digital Convergence. The trajectory of traditional HPC, driven by Moore’s Law, faces challenges with the deceleration of performance improvements and the emergence of power consumption as a significant concern. Ethical and political considerations suggest focusing on a European sovereignty in software and hardware.
Digital Convergence
9
The intersection of the computational domains to a landscape of information technology reshapes the traditional boundaries of computer simulation, offering unprecedented opportunities for innovation. This convergence provides a foundation for addressing existing challenges more effectively and delving into novel problem domains. All in all, the interconnectedness of such technologies, as illustrated by the concept of Digital Convergence, introduces a new era of possibilities that communities are well advised to embrace.
References 1. Hirsch-Kreinsen, H.: Artificial intelligence: a “promising technology”. In: AI & SOCIETY (2023). https://doi.org/10.1007/s00146-023-01629-w 2. Sarker, I.H.: AI-based modeling: techniques, applications and research issues towards automation, intelligent and smart systems. SN Comput. Sci. 3(2) (2022). https://doi.org/10.1007/ s42979-022-01043-x 3. Pugliese, R., Regondi, S., Marini, R.: Machine learning-based approach: global trends, research directions, and regulatory standpoints. Data Sci. Manag. 4, 19–29 (2021). https://doi.org/10. 1016/j.dsm.2021.12.002 4. Moore, G.E.: Cramming more components onto integrated circuits. IEEE Solid-State Circuits Soc. Newslett. 11(3), 33–35 (2006). https://doi.org/10.1109/nssc.2006.4785860 5. Brynjolfsson, E., Mcafee, A.: The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies, 306 pp. First published as a Norton paperback. W. W. Norton & Company, New York (2016) 6. Resch, M.M.: The role of machine learning and artificial intelligence for high-performance computing. In: Singh, V.K., Sergeyev, Y.D., Fischer, A. (eds.) Trends in Mathematics, pp. 241– 249. Springer International Publishing (2021). https://doi.org/10.1007/978-3-030-68281-1_18 7. Resch, M.M., Boenisch, T., Gienger, M., Koller, B.: High performance computing: challenges and risks for the future. In: Singh, V.K., Gao, D., Fischer, A. (eds.) Advances in Mechanics and Mathematics, pp. 249–257. Springer International Publishing (2019). https://doi.org/10.1007/ 978-3-030-02487-1_14 8. Resch, M.M., Boenisch, T.: High performance computing - trends, opportunities and challenges. In: Iványi, P., Topping, B.H.V., Várady, G. (eds.) Advances in Parallel, Distributed, Grid and Cloud Computing for Engineering, pp. 1–8. Saxe-Coburg Publications (2017). https://doi. org/10.4203/csets.40.1 9. insideHPC: China Intends to Exceed 300 Exaflops Aggregate Compute Power by 2025 (2023). https://insidehpc.com/2023/10/china-intends-to-exceed-300-exaflops-aggregate-computepower-by-2025/. Accessed 14 Oct 2023 10. Strohmaier, E., Dongarra, J., Simon, H., Meuer, M.: TOP500 List - November 2022 (2023). http://www.top500.org. Accessed 03 March 2023 11. US Government Publishing Office (GPO): PUBLIC LAW 117-167-AUG.9,2022 (2022). https://www.govinfo.gov/content/pkg/PLAW-117publ167/pdf/PLAW-117publ167.pdf. Accessed 03 March 2023 12. Strohmaier, E., Dongarra, J., Simon, H., Meuer, M.: About (2022). https://www.top500.org/ project/. Accessed 03 March 2023 13. Hewlett Packard Enterprise: Hewlett Packard Enterprise Frontier (2019). https://www.olcf. ornl.gov/frontier/. Accessed 10 March 2023 14. Strohmaier, E., Dongarra, J., Simon, H., Meuer, M.: Performance Development (2023). https:// www.top500.org/statistics/perfdevel/. Accessed 03 March 2023
10
M. M. Resch et al.
15. Courtland, R.: Transistors Could Stop Shrinking in 2021 (2016). https://spectrum.ieee.org/ transistors-could-stop-shrinking-in-2021. Accessed 03 March 2023 16. Witkowski, W.: ’Moore’s Law’s dead,’ Nvidia CEO Jensen Huang says in justifying gaming-card price hike (2022). https://www.marketwatch.com/story/moores-laws-deadnvidia267ceo-jensen-says-in-justifying-gaming-card-price-hike-11663798618. Accessed 03 March 2023 17. Kelleher, A.: Moore’s Law - Now and in the Future (2022). https://www.intel.de/content/www/ de/de/newsroom/opinion/moore-law-now-and-in-the-future.html. Accessed 03 March 2023 18. Turing, A.M.: Computing machinery and intelligence. Mind LIX(236) 433–460 (1950). https:// doi.org/10.1093/mind/lix.236.433 19. Baz, D.E.: IoT and the need for high performance computing. In: 2014 International Conference on Identification, Information and Knowledge in the Internet of Things. IEEE (2014). https:// doi.org/10.1109/iiki.2014.8 20. de Souza Cimino, L., de Resende, J.E.E., Silva, L.H.M., Rocha, S.Q.S., de Oliveira Correia, M., Monteiro, G.S., de Souza Fernandes, G.N., da Silva Moreira, R., de Silva, J.G., Santos, M.I.B., Aquino, A.L.L., Almeida, A.L.B., de Castro Lima, J.: A middleware solution for integrating and exploring IoT and HPC capabilities. Softw. Pract. Exper. 49(4), 584–616 (2018). https:// doi.org/10.1002/spe.2630 21. Borrill, J., Oliker, L., Shalf, J., Shan, H., Uselton, A.: HPC global file system performance analysis using a scientific-application derived benchmark. Parallel Comput. 35(6), 358–373 (2009). https://doi.org/10.1016/j.parco.2009.02.002 22. Ejarque, J., Badia, R.M., Albertin, L., Aloisio, G., Baglione, E., Becerra, Y., Boschert, S., Berlin, J.R., D’Anca, A., Elia, D., Exertier, F., Fiore, S., Flich, J., Folch, A., Gibbons, S.J., Koldunov, N., Lordan, F., Lorito, S., Løvholt, F., Macías, J., Marozzo, F., Michelini, A., MonterrubioVelasco, M., Pienkowska, M., de la Puente, J., Queralt, A., Quintana-Ortí, E.S., Rodríguez, J.E., Romano, F., Rossi, R., Rybicki, J., Kupczyk, M., Selva, J., Talia, D., Tonini, R., Trunfio, P., Volpe, M.: Enabling dynamic and intelligent workflows for HPC, data analytics, and AI convergence. Future Gen. Comput. Syst. 134, 414–429 (2022). https://doi.org/10.1016/j.future. 2022.04.014 23. Müller, M., Terboven, C., Nellesen, M., Yazdi, M.A., Politze, M.: Combining HPC. AI and RDM: challenges and approaches (2023). https://doi.org/10.5281/ZENODO.8083624 24. Diaz-de-Arcaya, J., Torre-Bastida, A.I., Zárate, G., Miñón, R., Almeida, A.: A joint study of the challenges, opportunities, and roadmap of MLOps and AIOps: a systematic survey. ACM Comput. Surv. 56(4), 1–30 (2023). https://doi.org/10.1145/3625289 25. Becciani, U., Petta, C.: New frontiers in computing and data analysis - the European perspectives. Rad. Effects Defects Solids 174(11–12), 1020–1030 (2019). https://doi.org/10.1080/ 10420150.2019.1683840 26. Klüsener, S., Schneider, R., Rosenbaum-Feldbrügge, M., Dudel, C., Loichinger, E., Sander, N., Backhaus, A., Fava, E.D., Esins, J., Fischer, M., Grabenhenrich, L., Grigoriev, P., Grow, A., Hilton, J., Koller, B., Myrskylä, M., Scalone, F., Wolkewitz, M., Zagheni, E., Resch, M.M.: Forecasting intensive care unit demand during the COVID-19 pandemic: a spatial agestructured microsimulation model (2020). https://doi.org/10.1101/2020.12.23.20248761 27. Pham, Q.-V., Nguyen, D.C., Huynh-The, T., Hwang, W.-J., Pathirana, P.N.: Artificial Intelligence (AI) and big data for coronavirus (COVID-19) pandemic: a survey on the state of-the-arts. IEEE Access 8, 130820–130839 (2020). https://doi.org/10.1109/access.2020.3009328 28. Dong, S., Ma, M., Feng, L.: A smart city simulation platform with uncertainty. In: Proceedings of the ACM/IEEE 12th International Conference on Cyber-Physical Systems. ACM (2021). https://doi.org/10.1145/3450267.3452002 29. Suciu, G., Butca, C., Dobre, C., Popescu, C.: Smart city mobility simulation and monitoring platform. In: 2017 21st International Conference on Control Systems and Computer Science (CSCS). IEEE (2017). https://doi.org/10.1109/cscs.2017.105 30. Deren, L., Wenbo, Y., Zhenfeng, S.: Smart city based on digital twins. Comput. Urban Sci. 1(1) (2021). https://doi.org/10.1007/s43762-021-00005-y
Digital Convergence
11
31. Önol, B., Semazzi, F.H.M.: Regionalization of climate change simulations over the Eastern Mediterranean. J. Clim. 22(8), 1944–1961 (2009). https://doi.org/10.1175/2008jcli1807.1 32. Keuler, K., Block, A., Schaller, E.: High resolution climate change simulation for Central Europe. In: High Performance Computing in Science and Engineering ’03, pp. 11–22. Springer, Berlin, Heidelberg (2003). https://doi.org/10.1007/978-3-642-55876-4_2
A Provenance Management System for Research Data Management in High-Performance Computing Systems Yuta Namiki, Takeo Hosomi, Hideyuki Tanushi, Akihiro Yamashita, and Susumu Date Abstract Research data management (RDM) has become increasingly important recently. High-performance computing (HPC) systems that process large volumes of data face the problem of how to manage the research data produced in their systems. Since the purposes of RDM are to improve the reproducibility of research and the reusability of the research data, collecting and managing the provenance of data produced in HPC systems is essential. In this paper, we introduce the concept of a provenance management system for RDM in HPC systems. We analyze use cases of provenance and the requirements for collecting and managing provenance. We also discuss implementation of the requirements. Finally, we introduce our prototype of the provenance management system.
1 Introduction Research data management (RDM) has attracted much attention in recent years. The purposes of RDM are to improve the reproducibility of research and the reusability of research data. Improving the reproducibility of research makes the research verifiable and prevents research misconduct. Improving the reusability of research data allows researchers to utilize data or knowledge in the research outcomes, which is a demand from Open Science. The movement of Open Science makes research efforts accessible [6]. FAIR Principles [11] provide guidelines that data and other materials including research efforts should follow to improve reusability. FAIR Principles stipulate that data needs to be assigned an identifier, described with metadata, and associated with provenance.
Y. Namiki (B) · T. Hosomi NEC Corporation, Tokyo, Japan e-mail: [email protected] Y. Namiki · T. Hosomi · H. Tanushi · A. Yamashita · S. Date Joint Research Laboratory for Integrated Infrastructure of High Performance Computing and Data Analysis, Cybermedia Center, Osaka University, Osaka, Japan © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. M. Resch et al. (eds.), Sustained Simulation Performance 2022, https://doi.org/10.1007/978-3-031-41073-4_2
13
14
Y. Namiki et al.
High-performance computing (HPC) systems produce a large amount of data through simulations or computer experiments. However, there is no established method for managing the data to be reproducible and reusable. In this paper, we discuss a provenance management system for data produced in HPC systems. This system aims to implement the purpose of RDM; namely, the improvement of reproducibility and reusability. The remainder of this paper is structured as follows. In Sect. 2, we describe how provenance works for RDM in HPC systems and list the requirements. In Sect. 3, the design of our system is described. In Sect. 4 we introduce our prototype implemented based on the design. In Sect. 5, we introduce related work. Finally, we conclude this paper in Sect. 6.
2 Provenance Management in HPC Systems 2.1 The Purpose of RDM RDM is about creating, finding, organizing, storing, sharing and preserving data within any research process [3]. The purposes of RDM are as follows. ● To make research reproducible: By organizing, storing and preserving information and data that are needed to perform the research process again, another researcher can verify the claims in the original research. Making research reproducible involves prevention of research misconduct such as fabrication or falsification in the data. ● To make research data reusable: By organizing and sharing reliable research data, researchers can build original work and avoid duplicating efforts. This will improve the productivity of research. Here, fabrication and falsification are defined as follows [7]: ● Fabrication is making up data or results and recording or reporting them. ● Falsification is manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record. Note that this paper focuses on how to manage the provenance of data produced in HPC systems. It is beyond our scope how to manage the provenance of data produced by experimental devices or measuring equipment.
2.2 Provenance of Data in HPC Systems The provenance of data represents information about how the data was produced. The provenance allows researchers to identify which processes were used, and which
A Provenance Management System for Research Data …
db.dat
conv
Created: 2022-10-12 8:15 User: A Hash: 0xab.. File
15
db.csv sim1 raw.dat
result.csv Created: 2022-11-09 11:30 User: B Hash: 0xcd..
Process Metadata Fig. 1 Example of a provenance
data were used as sources. Output data through a process can be input data for another process. In HPC systems, data exists as a file, and a process is the execution of a program. Figure 1 shows an example of the provenance of research data produced through a simulation, which consists of executions of some programs in an HPC system. A rectangle in the figure is a file, and an oval is a program, which processes a file. An arrow connecting a file to a program describes that the file is an input of the program, and an arrow connecting a program to a file describes that the program outputs the file. In general, a relationship between the input, processing, and output makes a chain. In this example, the final output of the simulation “result.csv” was obtained from a program called “sim1” using “db.csv” and “raw.dat” as input. Here, one of the input files “db.csv” was also output from another process “conv”. From the figure, we can see all the files and programs used to obtain the result. In addition, metadata such as a hash value, creation date, and creator of the file are also included in the provenance. The hash value is generated by applying a hash function (e.g., SHA-256) to a file or a program. It is used to check the integrity of the file or the program as described in a later section.
2.3 Use of Provenance The provenance contributes to accomplishing the purposes of RDM in the following ways. To make the research reproducible, the essential component is the relevant information required to obtain a consistent result by another researcher [5]. The provenance records input files and programs, which clarifies the sources and the procedure to obtain the result. Another researcher expects to obtain a consistent result by executing the program with data shown in the provenance. The following cases of fabrication or falsification can be detected through provenance. The first case is where fabrication or falsification has been made through
16
Y. Namiki et al.
editing a file with an unusual program. For example, if a file that is a result for a paper is output not from a simulation program, but from an editor, it is doubtful that the file is authentic (i.e., it is falsified). Similarly, if a file expected to be an output of a specific program is an output of an editor or other unexpected programs, it might be fabricated. The second case is that a program is modified to output falsified results. Since the hash value of the executed program is stored in provenance, we can detect the usage of an improper program by comparing the hash value to the authentic one. Provenance also contributes to preventing the misidentification of data. A researcher can confirm that they have chosen the correct set of revisions of input and output files as well as a simulation program and configuration files to produce the results. Therefore, the researcher can avoid unintentional misidentification of data files (e.g. debugging a program based on an output file that produced another revision of the program). This will be helpful in situations where a large number of similar data files exist due to trial and error, which is common in the research process. Reproducibility of data contributes to reusability because such reproducibility is considered reliable data by third parties.
2.4 Characteristics of HPC Systems We list below our assumptions of an HPC system for our provenance management system. We assume the HPC systems have the following characteristics: C1 A system consists of multiple computing nodes. A user executes a program written with a parallel programming framework (e.g., MPI) to run the program on multiple computing nodes in parallel. The execution is done through a workload manager (e.g., Slurm). C2 Users have many assets (e.g., software programs, knowledge about operations) for the current system environment. It is difficult to request modifications on a program or operation for users of the system. C3 A system is expected to offer superior performance. The primary goal of the system is to provide HPC resources, which is even more important than RDM.
2.5 Requirements for the Provenance Management System As we discussed in the previous sections, provenance is beneficial for making research reproducible and reusable. We list below the requirements for realizing RDM on data in HPC systems based on the discussions in Sect. 2.3 and the characteristics of HPC systems C1–C3:
A Provenance Management System for Research Data …
17
● To collect the provenance: R1 Provenance should be collected exhaustively. To provide evidence of research, there must be no omission of files, programs, and metadata in the provenance. (Sect. 2.3) R2 Provenance should be collected automatically and cannot be modified. The system must prevent recording a provenance that is inconsistent with the actual process. The system also must also prevent someone from modifying the recorded provenance. Therefore, the recording should be done automatically, and not be dependent on the researcher. (Sect. 2.3) R3 Through the collection of provenance, impacts on the user programs or usage of the HPC system should be minimal. (C2) R4 Through the collection of provenance, performance impact on the HPC system should be minimal. The primary purpose of the HPC system targeted in this paper is not to provide RDM functions but to provide HPC resources. (C3) ● For using provenance: R5 Provenance can be browsable and findable. Users should be able to find the provenance by the name of the file or program, dates, contents of the file, etc. (Sect. 2.3) R6 Provide a function to verify files and programs. Users should be able to verify that a given file or program has not been fabricated or falsified. (Sect. 2.3) R7 Provenance should be provided in human-friendly representation. As shown in the example below, precise provenance is too complicated to read in some cases. Provenance, therefore, needs to be summarized adequately. (C1) The following are some example cases for R6. When a program is written to run multiple computing nodes in parallel using MPI, a set of provenances will be built for each computing node. This set should be merged into a single provenance, which shows provenance in the whole HPC system. Another example for R6 is the case of invoking another program. When a user executes a program. A (e.g. a shell script, Makefile, etc.) that invokes another program . B that processes some files, the provenance will show that . B processes the files as . B has done. However, the user is unaware of . B since it is not executed directly by the user. For this case, provenance should merge . B to . A and show that the files were processed by . A.
3 System Design 3.1 System Overview In this section, we show the architecture of a provenance management system that meets the requirements described in the previous section. Figure 2 is a diagram of the system. The system consists of three major components: a tracer, aggregator,
18
Y. Namiki et al. Apache Atlas 2. View/Verify
Interface
Database
Verifier
1. Execute
Aggregator HPC System
Tracer
User Program Linux (Kernel)
Fig. 2 System overview
and database. The tracer runs on each computing node in an HPC system. It collects information that needs to build provenance and sends the information to an aggregator. The aggregator builds the provenance from the data collected by tracer instances. A database stores the provenance built by the aggregator. In addition, it provides functions to find provenance. The aggregator and the database are supposed to run on a server outside the HPC system.
3.2 Implementation This section describes how to implement each requirement in Sect. 2.5 on the architecture.
3.2.1
Recording Provenance and Metadata
This feature corresponds to requirement R1. Provenance is built from records of system call invocations that are related to file access operations. For example, in Linux, a program invokes read() system call provided by the operating system (kernel) to read data from a file. The tracer detects this invocation and records the program that invokes the system call and the files to be read. Similarly, files written by the user program is also detected and recorded.
A Provenance Management System for Research Data …
19
In addition, the tracer collects metadata such as creator and creation date of files from the file system. A hash value of the files is calculated and recorded. This value is used to identify files (described later). The tracer stores the history of system call invocation and the metadata in a temporary area of the database. The aggregator periodically builds provenance from this history. For a read operation recorded in the history, the aggregator connects the file read to the program that invokes the operation as an input. Similarly, for a write operation in the history, the aggregator connects the file to the program as an output. In addition, metadata such as the creator, creation date, and the hash value of the file is associated with the file in the provenance. The provenance built by the aggregator is stored in the database.
3.2.2
Securing Records
This feature corresponds to requirement R2. The provenance management system is operated separately from an HPC system, except for the tracer program that needs to be run in the computing nodes. The tracer program runs in the background with root (administrator) privileges. Therefore, users of the HPC system cannot modify the behavior of the tracer. The aggregator and the database run in a dedicated node. There is an authentication mechanism not to modify (i.e., fabricate or falsify) the recorded data.
3.2.3
Minimizing Impact on Users of HPC Systems
This feature corresponds to requirements R3 and R4. As described in Sect. 3.2.1, the information to build the provenance is collected from a Linux kernel. This method allows the tracer to collect the information without requiring any modification on user’s programs. Furthermore, this method works for a program implemented in any programming language (i.e., Fortran, C, etc.), and even for a program without source code. The impact on the performance of the user program is discussed in Sect. 4.
3.2.4
Viewing Provenance
This feature corresponds to requirement R5. The provenance management system provides functionalities to find an entity in provenance by name or other values of metadata of a file or a program. The system also accepts a file or a program. The system calculates the hash value of the given file/program and then finds an entity that has the same hash value.
20
3.2.5
Y. Namiki et al.
Verifying Files/Programs
This feature corresponds to requirement R6. The provenance management system stores a hash value of a file when the file has been read or written. The system provides a function to compare the hash values between the one recorded in the system and the one calculated from the file given by user. If the two values are identical, the file has not been modified (falsified) since the provenance has been recorded.
3.2.6
Formatting Provenance
This feature corresponds to requirement R7. The tracer collects metadata from not only the file system but also a workload manager (i.e., Slurm, etc.), which is used to execute programs in a typical HPC system. For example, an identifier of a job, which is a unit of execution of programs in the workload manager, is one of the metadata to be collected. It can be used to group programs in provenance to improve the readability of the provenance. If a program runs in parallel on multiple computing nodes using MPI, the multiple instances of the program and their inputs and outputs in provenance need to be merged into a single instance of the program. If they are not merged, inputs and outputs will be partitioned by a node even if they were processed by an execution of an MPI program because each tracer collects information in a single node. The merge operation is done by aggregator, which can access the database that stores information from tracers in all nodes.
4 Prototype We developed a prototype that implements the functions described in the previous section. The core technologies used in the implementation are listed in Table 1. To detect system call invocations in tracer (Sect. 3.2.1), we use BPF [10]. BPF is a mechanism for running programs in Linux kernel without requiring modification in the kernel. BPF is typically used for monitoring and profiling. Major Linux distributions including Red Hat Enterprise Linux and Ubuntu use a BPF-enabled kernel by default. In our prototype, we write a BPF program that records an operation and its
Table 1 Technologies used for the implementation Tracing system call invocations Programing languages Database
BPF C (BPF), Python (Other components) Apache Atlas [9]
A Provenance Management System for Research Data …
21
Fig. 3 Showing provenance
target file (e.g., “process . P writes file . F”) on the invocation of system calls related to process and file operations (execve(), read(), etc.). The impact on the performance is relatively small (Sect. 3.2.3). We observed about 2% performance overhead for tracing with a heavy I/O workload generated by the I/O benchmark tool fio [1]. This is better than the results of ptrace and SystemTap, other mechanisms for tracing the invocation of system calls. The function of integrating the provenance of parallel programs (Sect. 3.2.6) by MPI is implemented in an aggregator. It runs on a server for the provenance recording system that is separate from the HPC system. The provenance database uses the open-source software Apache Atlas, which runs on a dedicated server for the provenance record system (Sects. 3.2.2, 3.2.5). Apache Atlas is designed for managing the provenance and metadata of data assets. In addition to storing data, Apache Atlas also provides functionalities for displaying provenance and metadata. The aggregator registers information such as files, processes, and the relationships between them, as well as metadata, in a predefined format to produce graphical output as described below. Figure 3 is a screen of our prototype showing the provenance. In the screenshot, a blue hexagon with a table icon indicates a file and a yellow-green hexagon with a gear icon indicates a program. A user can reach this screen by searching the name of a file, the hash value of a file or other metadata (Sect. 3.2.4). Figure 4 is a screen showing the details of a file. This screen can be reached through a search of the value of the metadata (Sect. 3.2.4), similar to the previous
22
Y. Namiki et al.
Fig. 4 Showing metadata
screen (Fig. 3), or from the previous screen by clicking an entity in the provenance. In this screen, metadata such as the date of creation, the user who created the file, and the hash value of the file are shown. A user can add user-defined properties and labels to improve findability.
5 Related Work To improve the reproducibility of research and the reusability of research data in HPC systems, several approaches have been used in previous research. The first approach is to package the computing environment. Previous works proposed a method for automatically collecting files, programs, and libraries (i.e., the environment) needed to run a simulation. This method provides a feature that package the environment for distribution. An external researcher could re-execute the simulation on their own system without setting up an environment. ReproZip [2] is one of the implementations of this approach. They are different from our approach. We manage not the environment of a simulation, but the provenance of data produced in the simulation. This is because we focus on describing how the data is produced, while the previous research focuses on re-executing the program and getting the same result again. The second approach is to preserve the provenance of data, which is similar to our approach. Previous works exist that build provenance for files produced in HPC
A Provenance Management System for Research Data …
23
systems [4, 8]. They proposed a method for automatically building the provenance of files through capturing system calls in the operating system kernel. Different than the previous research, we choose the latest and more lightweight mechanism, BPF, for the capturing. In addition, we discussed not only how we build the provenance, but also how we can use the provenance, i.e., functionaries for verifying the files by using the provenance with metadata including hash values.
6 Conclusion and Future Work In this paper, we discussed how to implement a provenance management system for research data in HPC systems. Our system aims to improve the reproducibility of research and the reusability of research data, which are the purpose of RDM. The provenance shows how data is processed for data in HPC systems, which allows other researchers to replicate the process (i.e., reproducible). Reproducible data is expected to improve reusability because it makes the data reliable. To implement provenance management in HPC systems, we analyzed the requirements based on the role of provenance and the characteristics of HPC systems. This included collecting provenance without modifications in programs and user operations, and functions to verification, minimizing performance impact, etc. Next, based on the requirements, the system architecture and implementation methods were considered. We used the history of system call invocations from the user’s programs to build the provenance with a minimal impact on users and performance. Finally, we introduced our prototype of the provenance management system. We plan to incorporate users’ opinions and enhance the system to include the practical functionalities of our prototype. In addition to enhancing the prototype, which focuses on research data in HPC systems, we will connect our system to other systems such as equipment or devices for measurement, local computers, and storages in laboratories, and to existing repositories in a conceivable extension. This will enable end-to-end RDM and make the data usable in more situations.
References 1. Axboe, J.: fio—flexible I/O tester (2006). https://git.kernel.dk/cgit/fio/ 2. Chirigati, F., Rampin, R., Shasha, D., Freire, J.: ReproZip: computational reproducibility with ease. In: Proceedings of the 2016 International Conference on Management of Data, ACM, SIGMOD ’16, pp. 2085–2088 (2016). https://doi.org/10.1145/2882903.2899401 3. Cox, A., Verbaan, E.: Exploring Research Data Management. Facet Publishing (2018) 4. Dai, D., Chen, Y., Carns, P., Jenkins, J., Ross, R.: Lightweight provenance service for highperformance computing. In: 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 117–129 (2017). https://doi.org/10.1109/PACT.2017.14 5. National Academies of Sciences, Engineering, and Medicine: Reproducibility and Replicability in Science. The National Academies Press, Washington, DC (2019). https://doi.org/10.17226/ 25303
24
Y. Namiki et al.
6. OECD: Making Open Science a reality. OECD Science, Technology and Industry Policy Papers (25) (2015). https://doi.org/10.1787/5jrs2f963zs1-en 7. Office of Research Integrity: Definition of research misconduct (2000). https://ori.hhs.gov/ definition-research-misconduct 8. Pasquier, T., Han, X., Goldstein, M., Moyer, T., Eyers, D., Seltzer, M., Bacon, J.: Practical whole-system provenance capture. In: Proceedings of the 2017 Symposium on Cloud Computing, Association for Computing Machinery, New York, NY, USA, SoCC ’17, pp. 405–418 (2017). https://doi.org/10.1145/3127479.3129249 9. The Apache Software Foundation: Apache Atlas (2017). https://atlas.apache.org/ 10. The kernel development community: BPF documentation (2014). https://docs.kernel.org/bpf/ 11. Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.W., da Silva Santos, L.B., Bourne, P.E., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3(1), 1–9 (2016)
Management of Data Flows Between Cloud, HPC and IoT/Edge Kamil Tokmakov
Abstract The components of heterogeneous applications are deployed across various execution platforms and utilise the capabilities of the platforms. As such, one component can utilise HPC resources for better performance in batch computations, while another—Cloud resources, for better scalability and elasticity. Furthermore, there is also a possibility for processing on Edge devices. The usage of such a hybrid setup, where dependent components of the applications are deployed across various platforms, might require flexible and adaptive data transfers from one platform to another. This work presents a data management framework, based on the Apache NiFi dataflow management system and developed in the scope of the SODALITE EU project. This framework enables scalable data transfer between any of GridFTP (a file transfer protocol common in HPC), HTTP, S3-compatible and data streaming (such as MQTT) endpoints.
1 Introduction A heterogeneous application is a type of software applications, the components of which are deployed across various execution platforms to utilise the capabilities of the platforms [1]. For example, one component of such application can utilise baremetal compute resources, offered by HPC systems, for high performance in batch computations. While another component can be deployed in Cloud for better availability, scalability and elasticity of the virtualised compute resources. Furthermore, there is also a possibility for a component to be deployed on Edge devices, where the data processing can be performed closer to the data sources, or the data sources could directly stream the data to the Cloud or HPC platforms, as in the case of IoT devices.
K. Tokmakov (B) High Performance Computing Center Stuttgart (HLRS), University of Stuttgart, Nobelstraße 19, 70569 Stuttgart, Germany e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. M. Resch et al. (eds.), Sustained Simulation Performance 2022, https://doi.org/10.1007/978-3-031-41073-4_3
25
26
K. Tokmakov
Therefore, heterogeneous applications demand an adaptive and scalable way to connect loosely coupled application components deployed on various platforms and perform data transfers between them. For instance, consider a production HPC infrastructure, where strict firewall rules are applied [2] and a number of available ingress endpoints for data management is often limited to GridFTP and SFTP file transfer protocols. Moreover, egress connectivity from the infrastructure to the Internet is often blocked for security reasons, hence directly requesting the data from the external repositories, data services or cloud storage is rarely possible. For these cases, one can rely on data management platforms designed to provide adaptivity between various data interfaces. Apache NiFi [3] is one of such platforms that provides a rich support for data management between various cloud storage vendors (e.g. S3-based, GCS, Azure) and streaming platforms (e.g. Apache Kafka, MQTT). It is an opensource software and provides a means for extension of its capabilities to support other platforms or data transfer protocols. As such, in this work we extended Apache NiFi with a support of GridFTP, a protocol common in HPC, in order to cover additional cases of heterogeneous applications, where data from cloud storage and streaming platforms can be transferred to the storage on an HPC infrastructure. Moreover, we present a data management framework for scalable data movement that conforms to the load balancing offered by Apache NiFi. This framework follows the data pipeline architecture, described in [4], and extends it with the load balancing support. This paper is structured as follows. Section 2 will outline related work in data management of heterogeneous data endpoints. Sections 3 and 4 will present respectively design and implementation of the framework for management of data flows for the heterogeneous applications. Section 5 will provide an evaluation of the framework with three use cases that require data flows between different data endpoints. Section 6 will then conclude the paper.
2 Related Work The European Organization for Nuclear Research (CERN) has developed services, such as FTS3 [7] and Rucio [9], for multi-protocol data management. FTS3 is based on GFAL2 [8] multi-protocol data management library, which allows data transfers between cloud storages, Grid and HPC systems and supports various protocols, such as GridFTP, SRM, S3, HTTP(S), WebDAV and XrootD. Rucio [9] provides data management across globally distributed locations and heterogeneous data centres. It utilises FTS3 as a middleware for multi-protocol connections and unites different storage and network technologies as a single federated entity. Scientific federated infrastructure and services providers often offer different storage technologies. In order to provide a unified data access services for unified data management, such as Globus [5] and Onedata [6], have been developed. These services also allows integration with external cloud providers, such as Amazon S3 and Microsoft Azure.
Management of Data Flows Between Cloud, HPC and IoT/Edge
27
Current data management services found in scientific communities mostly focus on HPC and cloud storage platforms and do not cover edge and IoT platforms: with the emergence of edge computing and IoT, data processing shifted towards data stream and data flow handling. Apache Kafka [10] is a distributed data stream processing platform. In terms of supporting various data endpoints, it offers import and export of data from other protocols (such as S3, Azure, FTP, GCS, SQL, HDFS, MQTT, etc.) via Kafka Connect. Fledge [11] is an IoT platform that serves as a gateway between edge devices and cloud storage systems. It currently provides plugins for various platforms and protocols, such as Apache Kafka, GCS, HTTP, EdgeX. Dataflow services, such as Apache NiFi [3] and StreamSets [12], also provide a rich support for data management between streaming platforms (e.g. Kafka, MQTT) and cloud storage (e.g. S3, GCS, Azure, etc.). It can also be observed that data management services for edge and IoT platforms target stream platforms and cloud storage, but do not integrate HPC. In this work we extend Apache NiFi to support GridFTP, a common file transfer protocol used in HPC, in order for heterogeneous applications to transfer data to HPC and incorporate HPC as their computing platform. RADON EU project [13] has developed a data pipeline architecture for serverless platform [4] and a framework [14] for dataflow management also based on Apache NiFi. However, these developments do not include data transfers to HPC infrastructure, as well as the scalability of the workflows is not addressed. In this work we extended the data pipeline architecture with load balancing and scalable dataflows.
3 A Design of Scalable Dataflows This section describes a dataflow design for scalable data transfers between heterogeneous data endpoints. It will firstly explain the terminology and concepts used by Apache NiFi and then discuss the design choices for the dataflow. Apache NiFi [3] is a dataflow management system that provides Processors for various storage endpoints, platforms and systems, such as S3, (S)FTP, Apache Kafka, HDFS, MQTT, HTTP, etc. The Processors are connected into a pipeline using Connections that serve as a queue for transferring FlowFiles and offer load balancing capabilities. A FlowFile represents a single piece of data and consists of Content with the actual data and Attributes, such as file name, file size, path, permission, etc. A set of Processors can be further grouped into a Process Group. For a dataflow between the Process Groups, Input and Output Ports are exposed, respectively, to ingest FlowFiles from and send FlowFiles to another Process Group. These Ports are also connected using Connections. NiFi can be configured as a cluster of multiple nodes. In this case it employs a Zero-Leader Clustering paradigm: each node performs the same tasks on the data, but operates on a different set of data. However, for the dataflow to be scalable, a load balancing of the Connections should also be enabled, such that FlowFiles will be distributed across available NiFi nodes. Below we present an extended architecture of data pipelines [4], which accounts scalability and load balancing.
28
K. Tokmakov
Fig. 1 Dataflow design of Consumer and Publisher Process Groups
Figure 1 depicts a dataflow design for scalable data transfers from Data Consumer to Data Publisher, logically grouped as Process Groups. The dataflow is executed by each node of the NiFi cluster. Data Consumer obtains the data from the specified data source and then transfer FlowFiles to the Data Publisher via a Connection. Data Publisher in turn receives and transfers the FlowFiles to another endpoint. As such, the dataflow is able to move data between heterogeneous data endpoints. Data Consumer Process Group starts with the List Processor, which generates a list of empty-content FlowFiles, i.e. FlowFiles with specified Attributes, but without the actual Content. The List Processor should be run on the so-called Primary Node. The reason for this is that otherwise all the NiFi nodes will be running this Processor due to Zero-Leader Clustering, which will lead to duplicate data. The empty-content FlowFiles are then transferred to a queue of the Connection. This Connection is set to enable round-robin load balancing, which in turn distributes the empty-content FlowFiles within the cluster, meaning that the different data will be processed by each NiFi node. This achieves the scalability, as each node of the NiFi cluster now processes its own data. The benefit of processing content-less FlowFiles instead of content-filled FlowFiles at this stage is the decreased internal distribution latency of load balancing: it is faster for the Primary Node to distribute empty FlowFiles rather than FlowFiles with the data. It is the task of the Get Processor to receive the empty-content FlowFiles, populate their Content and send to the Output Port.
Management of Data Flows Between Cloud, HPC and IoT/Edge
29
The Input Port of the Data Publisher Process Group receives FlowFiles from the Data Consumer and sends it to a Connection. This Connection is not load balanced, as it is redundant, since the FlowFiles distribution is being already performed by the Consumer. It serves rather as a back-pressure buffer, since there can be cases, when the rate of a Consumer to obtain data is higher than the rate of a Publisher. The FlowFiles are then passed to the Put Processor, which uploads data to the specified endpoint. Consumer and Publisher are independently running Process Groups. For example, one can configure a Consumer to collect files from an S3 bucket, while a Publisher can be configured to push files to a GridFTP server. Connecting these Process Groups will transfer the files from an S3 endpoint to GridFTP. Following this design, the scalability and heterogeneity of data transfers can thus be achieved.
4 Implementation In this section the implementation of GridFTP support for Apache NiFi will be described, as well as the details of the implementation of some of Process Groups specific to the data endpoints. The implementation of the dataflow management relies on the clustered version of Apache NiFi and a specific dataflow design (see Sect. 3) to enable scalability and heterogeneity. GridFTP support was implemented in a secure and scalable way: the credentials are obscured and encrypted, as well as custom GridFTP Processors were implemented to fit into the NiFi paradigm of clustering.
4.1 Implementation of GridFTP Processors Apache NiFi provides experimental Processors that execute a custom scripted logic of handling FlowFiles. These processors are: ExecuteScript and InvokeScriptedProcessor. Both provide various script engines, such as Groovy and JPython, and also the ability to add external modules as JAR (Java Archive) files. The difference between them is that the ExecuteScript only executes the scripted logic, when the processor is triggered, whereas with InvokeScriptedProcessor, it is possible to additionally define other parameters, such as initialization and custom properties of the Processor. The properties can also be configured as sensitive, meaning that they will be obscured for users and encrypted in the file system of Apache NiFi nodes. Therefore, InvokeScriptedProcessor provides more control over the implementation and was chosen as the underlying processor to support GridFTP. As the scripting engine for InvokeScriptedProcessor, Groovy was chosen, since it allows to import third-party Java libraries into its runtime. The GridFTP Processors have the following common properties:
30
● ● ● ● ● ●
K. Tokmakov
Host: the hostname of the GridFTP server Port: port number of GridFTP server Username: optional username of the client Path: remote path to the source or destination directory on the GridFTP server Usercert: certificate of the user to authenticate with the GridFTP server Userkey: certificate key of the user to authenticate with the GridFTP server
The Usercert and Userkey are sensitive properties. The user interface of Apache NiFi automatically detects and renders the properties, once Script Body and Module Directory properties of InvokeScriptedProcessor configuration are filled in. Next, we implemented three Processors: ListGridFTP, GetGridFTP and PutGridFTP, that respectively follow the design of the List, Get and Put processors, explained in Sect. 3. ListGridFTP and GetGridFTP belong to the Consumer Process Group, whereas the PutGridFTP processor belongs to the Publisher Process Group. The connection between GridFTP Process Groups with another Process Group would imply the data transfers between GridFTP server and storage type of another Process Group through an Apache NiFi cluster. This enables connectivity between an HPC infrastructure and Apache NiFi, which was previously missing, and therefore accomplishes the objective of this work to provide the management of data flows between heterogeneous data sources. The GridFTP Processors are using on JGlobus [15] Java client library for GridFTP, which can be added into the Module Directory of the InvokeScriptedProcessor configuration and later imported into the Groovy runtime, when the processors are initialized. Trusted Grid certificates should also be added into the /etc/grid-security directory in the file system of the host running NiFi. All of the GridFTP processors use the authentication based on X.509 certificates. The X.509 credentials are created by filling in Usercert and Userkey properties and used to authenticate with the target GridFTP server. ListGridFTP Processor ListGridFTP implements a List Processor by listing the files on the specified Path of the GridFTP server and yielding empty FlowFiles files with GridFTP-specific attributes, such as the absolute path and GSIFTP URL to a file. Internally ListGridFTP sends MLSD [16] command using GridFTP client of the JGlobus library. This command returns a list of MLSX entries, each containing the information, such as the file name, file size and permission. This list is then parsed and transformed into empty-content FlowFiles with the attributes populated with the information contained in MLSX entries. These FlowFiles are then outputted into the next Processor. GetGridFTP Processor GetGridFTP implements a Get Processor by reading the content of the remote file on the specified GridFTP server. In Java and Groovy the read operations (e.g. from local or remote file) are performed by using so-called input streams. JGlobus library contains an input stream implementation for GridFTP— GridFTPInputStream—in order to fetch the content of the GridFTP remote file. When GetGridFTP receives a FlowFile, sent by the ListGridFTP, it reads the attributes of the FlowFile and retrieves the path to the file. It then creates a GridFTP input
Management of Data Flows Between Cloud, HPC and IoT/Edge
31
stream, specifying the destination and the file path, and associates the stream with the FlowFile. The FlowFile is then populated with the file content and sent to the output of the Processor for transferring it further in the dataflow. PutGridFTP Processor PutGridFTP uses a JGlobus output stream for GridFTP (GridFTPOutputStream) in order to write a GridFTP remote file. An incoming FlowFile with the file name attribute and content containing data is then associated with the GridFTP output stream to write to a file in the remote directory specified in the Path property, hence copying the content of a FlowFile into the GridFTP remote file. Notes on limitations GridFTPInputStream and GridFTPOutputStream streams of the JGlobus library do not support GridFTP protocol’s Extended block (EBLOCK) mode [17] this mode sends data in blocks, defined by the GridFTP protocol, and enables out-of-order reception, which is needed for parallel and striped data transfers. Therefore features, such as parallel and striped data transfers, are not supported. Moreover, the implementation of dataflows is designed for data transfers of heterogeneous data sources and not designed to perform the third-party transfers, i.e. direct data transfers between GridFTP servers, which are common in HPC and Grid computing.
4.2 Implementation of the Process Groups This subsection presents a set of Process Groups developed for typical data endpoints in Cloud, HPC and Edge/IoT: such as S3 storage, HTTP repositories, GridFTP storage and MQTT streaming. All Processors of the Process Groups are already offered by NiFi, only the GridFTP Processors are custom. The extensions to the supported set of Process Groups can be provided. For instance, one can provide a support for Google Cloud Storage or Apache Kafka, as long as the design (Sect. 3) principles are followed.
4.2.1
S3 Storage
In order to obtain objects from an S3 buckets an S3 Consumer Process Group was developed. It consists of ListS3 and FetchS3Objects Processors and an Output Port, as depicted in Fig. 2. ListS3 Processor runs on the Primary Node (labelled P in the figure) and generates empty-content FlowFiles that contain attributes of objects in an S3 bucket. The relationship between the Processors is set to use the load balancing in order to distribute the empty-content FlowFiles across the cluster, as explained in Sect. 3. Each FlowFile is then processed by FetchS3Objects to retrieve the content of the objects. The content-filled FlowFiles are then passed to the Output Port in order to be connected with a Publisher Process Group. S3 Publisher process group is presented in Fig. 3 and contains an input port and PutS3Object processor. The FlowFiles generated from a Consumer process group
32
K. Tokmakov
Fig. 2 S3 Consumer Process Group. FlowFiles are generated from the ListS3 Processor, which outputs a list of objects of an S3 bucket. FetchS3Object then downloads the contents of objects and pass them to the Output Port
Fig. 3 S3 Publisher Process Group. FlowFiles are obtained from an Input Port and sent to PutS3Object, which uploads objects into an S3 bucket
Fig. 4 HTTP Consumer process group. A text containing a list of URLs to download is split and each URL is downloaded and its content is populated into a FlowFile. The FlowFiles are then transferred into the Output Port
are transferred to the input port and PutS3Object processor uploads the content of the FlowFiles into the specified S3 bucket.
4.2.2
HTTP Files
HTTP Consumer process group obtains the data from an HTTP endpoint and respectively generates FlowFiles. An HTTP Publisher is not yet implemented. Two versions of the HTTP Consumers were developed. The first version (Fig. 4) accepts a text, containing a list of URLs to download separated with a new line. The GenerateFlowFile Processor creates a FlowFile containing this text and sends it to SplitText Processor, which splits the text and generates FlowFiles, each containing a line of URL. The FlowFiles are then distributed across the cluster using load balancing. ExtractText Processor attaches an attribute (url) to the FlowFiles parsed from the FlowFiles content. InvokeHTTP sends an HTTP GET request to the specified URL and obtains its content. UpdateAttribute sets the file names of FlowFiles that are then sent to the Output Port. In the second version (not shown in the figure) a JSON, containing an array of dictionaries, is evaluated: SplitText is substituted with SplitJson and ExtractText is substituted with EvaluateJsonPath. Each dictionary should contain the url key and optionally can contain other keys. This array is split by SplitJSON and each dictionary is parsed by EvaluateJsonPath Processor, subsequently setting FlowFile attributes from the dictionary keys. The content of URLs are then obtained and file names are set, similarly to the text-based HTTP Consumer.
Management of Data Flows Between Cloud, HPC and IoT/Edge
33
Fig. 5 GridFTP Consumer Process Group. FlowFiles are generated from the ListGridFTP Processor, which outputs a list of files of an GSIFTP URL directory. GetGridFTP then downloads the contents of files and pass them to the Output Port
Fig. 6 GridFTP Publisher Process Group. FlowFiles are obtained from an Input Port and sent to PutGridFTP, which uploads files into the destination GridFTP server
4.2.3
GridFTP Storage
GridFTP Consumer and Publisher Process Groups operate in a similar way as S3 Process Groups. GridFTP Consumer contains ListGridFTP and GetGridFTP Processors (Fig. 5), which generate empty-content FlowFiles from a list of files of the specified GSIFTP URL directory, distribute the FlowFiles across the cluster and download their contents into the FlowFiles. The FlowFiles are then sent to the Output Port for a Publisher Process Group. GridFTP Publisher contains PutGridFTP Processor (Fig. 6), which receives FlowFiles from the Input Port and uploads the files into specified GSIFTP URL directory. The implementation of these Processors are described in Sect. 4.1.
4.2.4
MQTT Streaming
Apache NiFi offers a support for MQTT streams handling and provides a support for the MQTT Shared Subscription. This feature enables a load balancing capabilities of the MQTT Processors internally without the need for a dedicated load balancing queue. As such, a subscription to a topic by multiple NiFi nodes will distribute the data across the cluster, as opposed to having the same subscription executed by multiple nodes and thus duplicating the data. This can be set with $shared// as a topic filter for MQTT Consumer Processor (ConsumeMQTT) of the MQTT Process Group (Fig. 7). The generated FlowFiles containing the data stream are then transferred to the Output Port. MQTT Publisher Process Group also consists of a single PublishMQTT Processor (Fig. 8) that receives FlowFiles from the Input Port and publishes data to the specified topic of MQTT broker.
34
K. Tokmakov
Fig. 7 MQTT Consumer Process Group. FlowFiles are generated from the ConsumerMQTT Processor, which outputs data stream of a particular topic to the Output Port
Fig. 8 MQTT Publisher Process Group. FlowFiles are obtained from an Input Port and sent to PublishMQTT, which publishes data into the destination MQTT broker
5 Evaluation Four experiments were performed for the evaluation of the data management framework in the scope of SODALITE use cases [18]: Snow, Clinical and Vehicle IoT use cases. Each experiment reflects different setups for the data flow between heterogeneous data endpoints. As such, the first experiment involves the transfer of the webcam images of the mountains to an S3 storage. In the second experiment, medical datasets available online were transferred to a GridFTP server, while the third and fourth experiments involves a transfer of streaming data from a MQTT broker to S3 and GridFTP storage, respectively. With respect to the underlying tools for data storage and streaming used in the experiments, MinIO [19] was selected as an S3-compatible object storage and an MQTT broker is based on Eclipse Mosquitto [20]. Both of them were deployed in a private OpenStack cloud. A GridFTP server [21] was installed on a testbed at HLRS premises. The setup for the evaluation consists of a NiFi cluster, deployed in another private OpenStack cloud, such that the deployed storage and the NiFi cluster will be geographically separated, in order to pose the experiments close to real cases. For each experiment, the number of nodes is ranging from 1 to 4 and the total average execution time for Consumer and Publisher will be registered, in order to show the performance of the data management framework with the increase of a number of nodes, thereby presenting scalability.
5.1 Experiment 1: Pulling Images from Webcams of the Mountains and Pushing them into a MinIO (S3) Storage This experiment is performed in the scope of the SODALITE Snow use case. This use case consists of a pipeline of components in order to compute the snow index based
Management of Data Flows Between Cloud, HPC and IoT/Edge 5
HTTP Consumer S3 Publisher
4.5 Average execution time (min)
35
4 3.5 3 2.5 2 1.5 1 0.5 0 1 Node
2 Nodes 3 Nodes Number of nodes
4 Nodes
Fig. 9 Experiment 1 results. Total average execution time for HTTP Consumer and S3 Publisher per each number of nodes
on the mountain images from public webcams. One of the components—WebCam Crawler—pulls images from public webcams and stores them in either local file system or in an S3 storage. Here we used the data management framework as an alternative to the WebCam Crawler in order to present a more scalable approach. The following experiment was performed: the NiFi cluster pulls 1000 images (total size of 50 MB) from 10 different webcams and stores them in the remote MinIO storage. The dataflow for this experiment is the connection of HTTP Consumer (Fig. 4) and S3 Publisher (Fig. 3). The results are presented in Fig. 9. It can be seen that with the increase of the nodes number, the performance was improved both for the Consumer and Publisher process groups. The time to download the images (HTTP Consumer) is lower compared to the upload time (S3 Publisher) due to the small size of each image: storing small data in an S3 storage creates a write operation overhead in the file system of the storage device. The average execution time of the HTTP Consumer dropped from 30 s to 8 s, while for the S3 Publisher the execution time dropped from 4 min 36 s to 2 min 24 s. The execution time both for the HTTP Consumer and S3 Publisher decreased due to the round-robin based load balancing, where the FlowFiles were distributed across the nodes in the cluster, and each node in parallel downloaded the images and pushed them into S3 storage. With more nodes in the cluster, less FlowFiles have to be processed by each node, therefore less execution time is needed, as shown in the experiment.
36
K. Tokmakov
5.2 Experiment 2: Pulling Medical Datasets and Pushing them into GridFTP Server This experiment is performed in the scope of the SODALITE Clinical use case. This use case reproduces real clinical trials in biomechanics by means of simulation to determine an optimal fixation and function of bone implant systems for patients with spinal conditions (e.g. disk displacement or prolapse). Patient’s CT scans are ingested into the use case workflow, which is then executed either in Cloud or HPC. This experiment consists of pulling 50 datasets [22], each containing CT-scans in DICOM format, and then pushing the datasets into the GridFTP server in the HPC testbed. The resulting size of the data is approximately 5.9 GB. The dataflow for this experiment is the connection of HTTP Consumer (Fig. 4) and GridFTP Publisher (Fig. 6). The results of this experiment are presented in Fig. 10. It can be observed as well that with the increase of the nodes number, the performance was improved both for the Consumer and Publisher process groups. The time to download the datasets (HTTP Consumer) is similar to the upload time (GridFTP Publisher) due to the larger size of each datasets: the network transfer time and the time of file system IO commands do not cause the overhead as with the small data. The average execution time of the HTTP Consumer dropped from 19 min 3 s to 9 min 40 s, while for the GridFTP Publisher the execution time dropped from 19 min 5 s to 10 min 10 s. The execution time both for the HTTP Consumer and GridFTP Publisher decreased due to load balancing, where the FlowFiles were distributed across the nodes in the cluster
Average execution time (min)
20
HTTP Consumer GridFTP Publisher
18
16
14
12
10
8 1 Node
2 Nodes
3 Nodes
4 Nodes
Number of nodes
Fig. 10 Experiment 2 results. Total average execution time for HTTP Consumer and GridFTP Publisher per each number of nodes
Management of Data Flows Between Cloud, HPC and IoT/Edge
37
by HTTP Consumer, and each node in parallel downloaded the images and pushed them into GridFTP server. Less FlowFiles have to be processed by each node and less execution time is needed with more nodes in the cluster.
5.3 Experiment 3: Streaming Vehicle Events into S3 Storage and GridFTP Server The Vehicle IoT use case involves the development and provisioning of services for connected vehicles, running both in the Cloud as well as directly at the Edge. In the scope of this use case, the vehicle telemetry events were fetched and stored in S3 storage and GridFTP server to evaluate the Data Management component in terms of handling data streams and small messages. The events are generated with KnowGo Vehicle Simulator [23], which allows to stream vehicle telemetry events directly to an MQTT broker for further processing by interested subscribers. With the event rate of one event per second, KnowGo Simulator for one vehicle can produce 600 events within 10 min. We measured the execution time of completing the transfer of 6000 events (10 vehicles over a 10 min time frame) produced by the KnowGo simulator and published at once from an MQTT broker into S3 storage and GridFTP server, representing the cases when events should be handled in clouds and HPC, respectively. The dataflow for MQTT to S3 storage is the connection of MQTT Consumer (Fig. 7) and S3 Publisher (Fig. 3), whereas for MQTT to GridFTP server it is the connection of MQTT Consumer (Fig. 7) and GridFTP Publisher (Fig. 6). Eclipse Mosquitto was used as an MQTT broker and the shared subscription feature was utilised, as explained in Sect. 4.2.4. The results of this experiment are presented in Figs. 11 and 12. The performance for Consumer process group is the same across the nodes due to the small amount of data to pull from a data broker. The execution time of Publisher decreases with the increase of nodes. The time to download the events (MQTT Consumer) is significantly lower to the upload time (S3 and GridFTP Publishers) due to the small size of each telemetry event: similarly to Experiment 1, storing small data in an S3 storage creates a write operation overhead. It is even more prominent, when using the GridFTP Publisher, where the communication overhead is additionally created as the control and data channels are established for the transfer of each telemetry event. The average execution time of the MQTT Consumer is averaged to 3 s, while for the S3 Publisher the execution time dropped from 34 min 50 s to 13 min 22 s and for GridFTP Publisher it dropped from 424 min 38 s to 105 min 36 s. The execution time both for the S3 and GridFTP Publishers decreased due to load balancing. To address the issue with a small data in the future work, tuning of the Run Duration parameter of the NiFi Processors should be explored. This parameter is responsible for handling FlowFiles in micro-batches, i.e. processing multiple FlowFiles at a time. However, not all the Processors support this. For example, InvokeScriptedProcessor, on which GridFTP processors are based, does not allow this setting, whereas S3
38
K. Tokmakov 35
MQTT Consumer S3 Publisher
Average execution time (min)
30 25 20 15 10 5 0 1 Node
2 Nodes 3 Nodes Number of nodes
4 Nodes
Fig. 11 Experiment 3 results. Total average execution time for MQTT Consumer and S3 Publisher per each number of nodes 450
MQTT Consumer GridFTP Publisher
Average execution time (min)
400 350 300 250 200 150 100 50 0 1 Node
2 Nodes 3 Nodes Number of nodes
4 Nodes
Fig. 12 Experiment 3 results. Total average execution time for MQTT Consumer and GridFTP Publisher per each number of nodes
Management of Data Flows Between Cloud, HPC and IoT/Edge
39
Processors provide this support. Another approach is the introduction of an NiFi Processor (such as MergeContent and MergeRecord) that accumulates and merges data before sending them to the next Processor of a Process Group.
6 Conclusion In this work a data flow management framework for scalable data transfers between heterogeneous data endpoints was presented. A common use of the framework is data movement between Cloud, HPC and Edge/IoT and it is achieved by introducing data flows. Apache NiFi was chosen as an underlying dataflow management system. The dataflow for heterogeneous data transfers was designed to respect the NiFi’s clustering paradigm and to enable scalability by introducing load balancing and separation of data listing and data fetching processes. Moreover, a custom NiFi Processors for GridFTP were developed by using InvokeScriptedProcessor and JGlobus library. Several experiments were performed in the scope of SODALITE use cases that demonstrated the scalability of the framework, when transferring data between S3, HTTP, GridFTP and MQTT endpoints. Acknowledgements This work is supported by the European Commission grant No. 825480 (H2020), SODALITE in collaboration with No. 825040 (H2020), RADON.
References 1. Di Nitto, E., Cruz, J.G., Kumara, I., Radolovi´c, D., Tokmakov, K., Vasileiou, Z.: Deployment and Operation of Complex Software in Heterogeneous Execution Environments: The SODALITE Approach. Springer Nature (2022) 2. Bulusu, R., Jain, P., Pawar, P., Afzal, M., Wandhekar, S.: Addressing security aspects for HPC infrastructure. In: 2018 International Conference on Information and Computer Technologies (ICICT), pp. 27–30. IEEE (2018) 3. Apache NiFi. https://nifi.apache.org. Accessed 16 Dec 2022 4. Dehury, C., Jakovits, P., Srirama, S.N., Tountopoulos, V., Giotis, G.: Data pipeline architecture for serverless platform. In: European Conference on Software Architecture, pp. 241–246. Springer, Cham (2020) 5. Globus. https://www.globus.org/. Accessed 16 Dec 2022 6. Onedata. https://onedata.org. Accessed 16 Dec 2022 7. FTS3. https://fts.web.cern.ch/fts/. Accessed 16 Dec 2022 8. GFAL2. https://dmc-docs.web.cern.ch/dmc-docs/. Accessed 16 Dec 2022 9. Rucio. http://rucio.cern.ch/. Accessed 16 Dec 2022 10. Apache Kafka. https://kafka.apache.org/. Accessed 16 Dec 2022 11. Fledge. https://www.lfedge.org/projects/fledge/. Accessed 16 Dec 2022 12. StreamSets. https://streamsets.com/. Accessed 16 Dec 2022 13. Casale, G., Artaˇc, M., Van Den Heuvel, W.J., van Hoorn, A., Jakovits, P., Leymann, F., Long, M., Papanikolaou, V., Presenza, D., Russo, A., Srirama, S.N.: Radon: rational decomposition and orchestration for serverless computing. SICS Softw.-Intensive Cyber-Phys. Syst. 35(1), 77–87 (2020)
40
K. Tokmakov
14. Dehury, C.K., Srirama, S.N., Chhetri, T.R.: CCoDaMiC: a framework for coherent coordination of data migration and computation platforms. Future Gener. Comput. Syst. 1–6 (2020) 15. JGlobus, GitHub page. https://github.com/jglobus/JGlobus. Accessed 16 Dec 2022 16. Hethmon, P.: Extensions to FTP. RFC-3659 (2007) 17. Allcock W.: Protocol extensions to FTP for the Grid. In: Global Grid Forum (2003) 18. Pita Costa, J., Fraternali, P., Meth, K., Mundt, P., Quattrocchi, G., Schneider, R., Tokmakov, K., Torres, RN.: SODALITE use cases. In: Deployment and Operation of Complex Software in Heterogeneous Execution Environments, pp. 109–144. Springer, Cham (2022) 19. MinIO. https://min.io. Accessed 16 Dec 2022 20. Eclipse Mosquitto. https://mosquitto.org. Accessed 16 Dec 2022 21. Globus Connect Server. https://www.globus.org/globus-connect-server. Accessed 16 Dec 2022 22. Shakouri, S., Bakhshali, M.A., Layegh, P., Kiani, B., Masoumi, F., Ataei Nakhaei, S., Mostafavi, S.M.: COVID19-CT-dataset: an open-access chest CT image repository of 1000+ patients with confirmed COVID-19 diagnosis. BMC Res. Notes 14(1), 1–3 (2021) 23. KnowGo Vehicle Simulator. https://knowgoio.github.io/knowgo-vehicle-simulator/docs. Accessed 16 Dec 2022
Dynamic Load Balancing of a Coupled Lagrange Particle Tracking Solver for Direct Injection Engine Application Tim Wegmann, Matthias Meinke, and Wolfgang Schröder
Abstract A dynamic load balancing technique for multiphysics simulation methods based on hierarchical Cartesian meshes for a direct injection internal combustion engine application is presented. A finite-volume method for the large-eddy simulation of the turbulent in-cylinder flow field is two-way coupled to a Lagrange Particle Tracking algorithm for the liquid fuel phase. Additionally, a semi-Lagrange level-set solver is used to track the location of the moving engine parts. The joint Cartesian mesh is used for the domain decomposition, which allows an efficient redistribution of the computational load using a space filling curve. The simulations are based on meshes with approximately 155 million cells and 1 million embedded spray parcels. Due to significant load changes, created by the solution adaptive mesh refinement, the necessity of a dynamic load balancing technique is demonstrated. The optimal load balancing interval is computed and different weighting methods for the domain decomposition are compared. The simulation results show a strong influence of the in-cylinder flow field on the fuel vapor distribution at start of ignition.
1 Introduction In the transport sector, a large number of vehicles with internal combustion engines (ICEs) will remain operating during the next decades. Therefore, the use of CO.2 neutral energy carriers such as bio-hybrid fuels produced on a regenerative basis is essential to mitigate global warming. Bio-hybrid fuels, however, exhibit different properties in terms of heat capacity, latent heat of evaporation and many other influencing factors, which impact the fuel-air mixing behavior and the subsequent combustion process. Thus, injection systems of ICEs have to be optimized for promising new fuel candidates [3, 23]. Such optimization can be achieved for existing engine T. Wegmann (B) · M. Meinke · W. Schröder Institute of Aerodynamics RWTH Aachen University, Wüllnerstraße 5a, 52062 Aachen, Germany e-mail: [email protected] W. Schröder Jülich Aachen Research Alliance Center for Simulation and Data Science, RWTH Aachen University and Forschungszentrum Jülich, Seffenter Weg 23, 52074 Aachen, Germany © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. M. Resch et al. (eds.), Sustained Simulation Performance 2022, https://doi.org/10.1007/978-3-031-41073-4_4
41
42
T. Wegmann et al.
hardware, e.g., by adjusting the engine control parameters or the exchange of injectors. This, however, requires accurate predictions of the engine performance in terms of efficiency and emission of pollutants, on which the fuel mixing during the intake and compression stroke has a substantial influence. In this paper, a numerical simulation method for the prediction of the spray jet and in-cylinder flow interaction and subsequent fuel-air mixing is presented, which has been designed for the efficient execution on high performance computing (HPC) systems. A large-eddy simulation (LES) based on a finite-volume (FV) method is used to predict the turbulent flow field. The spray, i.e., the liquid fuel is described by a Lagrange Particle Tracking (LPT) algorithm. The continuous and disperse phase are two-way coupled by source term and flow field exchange. Additionally, a level-set (LS) solver is used to define the in-cylinder geometry with the moving piston and valves. The implementation of such a multiphysics simulation method on HPC systems is non-trivial due to the different numerical methodologies and discretization requirements for the individual solvers. Additionally, coupling information needs to be exchanged between the solution methods, usually requiring communication and synchronization of the involved solvers. When solution adaptive mesh refinement is used, the computational load can change significantly both overall and between compute cores. The solution method presented here, allows the three different solvers to use their individual mesh refinement as a non-conforming subset of a joint hierarchical Cartesian mesh. Domain decompositioning is based on a uniform coarse level of the joint octree mesh via a space-filling curve(SFC). A comparison between a zeroknowledge-type dynamic load balancing (DLB) scheme with a static scheme based on pre-defined cell weights is presented. The different solvers and coupling procedures with their associated computational time and load distribution are discussed. Furthermore, results for the application of liquid bio-hybrid fuel injection in a internal combustion engine, for which DLB and an efficient solver coupling is essential, will be presented. This application is especially challenging for DLB methods due to the continuously changing computational domain and load distribution caused by piston, valve and fuel droplet motion. This study has the following structure. First, the mathematical models for the continuous- and disperse phase are described in Sect. 2. In Sect. 3, the numerical methods are introduced. Simulation results and the analysis of the DLB method are presented in Sect. 5. Finally, conclusions are drawn in Sect. 6.
2 Mathematical Model The mathematical model for the large-eddy simulation of the continuous phase is given in Sect. 2.1. Subsequently, the particle motion, energy and spray equations of the Lagrange Particle Tracking algorithm for the disperse phase are summarized in Sect. 2.2. The level-set equation used to predict the location of moving boundary surfaces is given in Sect. 2.3.
Dynamic Load Balancing of a Coupled Lagrange Particle …
43
2.1 Continuous Phase The motion of a compressible, viscous fluid is governed by the conservation equations for mass, momentum, and energy, i.e., the Navier–Stokes equations. For a two-species flow, i.e., a fuel-air mixture, an additional advection-diffusion equation has to be solved to determine the fuel concentration .Y . The set of equations can be formulated in non-dimensional form
.
d dt
{ V (t)
⎡
⎤ ρ ∮ ⎢ ρu ⎥ ⎢ ⎥ dV + ⎣ρ E ⎦ ∂ V (t) ρY
⎤⎞ ⎡ ⎤ ⎡ ⎤ 0 ρ(u − u∂ V ) Sρ ⎥ ⎟ ⎜⎢ ⎢ τ ⎢ Su ⎥ 1 ⎢ ⎥⎟ ⎜⎢ ρu(u − u∂ V ) + p I ⎥ ⎥ ⎢ + ⎜⎣ ⎢ τ u + q ⎥⎟ · n dΓ = ⎣ ⎥ SE ⎦ ⎝ ρ E(u − u∂ V ) + pu⎦ Re0 ⎣ −ρ D∇Y ⎦⎠ SY ρY (u − u∂ V ) Sc0 ⎛⎡
(1)
where .V (t) ⊂ Ω(t) is a moving control volume bounded by the surface .Γ (t) = ∂ V (t) with the outward pointing normal vector .n. The vector of conservative variables contains the gas phase density .ρ, velocity vector .u, total specific energy per unit mass . E and fuel density .ρY . The source term vector on the right side of the equation is used for the two-way coupling of the disperse- and continuous phase. The inviscid and viscous flux tensors contain the pressure . p, the velocity of the control volume surface .u∂ V , and the unit tensor . I. All equations are non-dimensionalized by the stagnation state properties, denoted by the index .(·)0 . Dimensional variables the following reference values are marked by .(·). The Reynolds number is based on √ ρ a L 0 0 /μ0 , using the speed of sound at rest .a0 = γ p0/ρ0 with the heat capacity . Re0 = ratio of air .γ = 1.4 and the engine bore as characteristic length . L. The Schmidt number in the viscous flux tensor for the passive scalar species is . Sc0 = μ0/ρ0 D0 = 1.0 and the diffusion coefficient . D = μ(T )/ρ . Since the injected amount of fuel is typically low in ICEs, the fuel concentration .Y is assumed to have no impact on the gas properties. Assuming a Newtonian fluid with zero bulk viscosity, the stress tensor .τ is expressed by τ=
.
( ) 2 μ(T ) (∇ · u) I − μ(T ) ∇u + (∇u)T , 3
(2)
in which the dynamic viscosity .μ(T ) is determined by Sutherland’s law. The heat flux based on Fourier’s law with constant Prandtl number for air of . Pr0 = 0.72 reads .
q=−
μ(T ) ∇T. Pr0 (γ − 1)
(3)
The equations are closed by the ideal gas law in non-dimensional form .γ p = ρT .
44
T. Wegmann et al.
2.2 Disperse Phase The equation of motion for a discrete particle . p with velocity .u p in non-dimensional form reads .
du p C D Re p ρ 1 ) 2 eg + (u − u p ) , = (1 − dt ρ p Fr0 24 τ p
(4)
√ with the Froude-number . Fr0 = a0/ g L , unity gravity vector .e g , particle Reynolds number. Re p = ρ||u−u p ||d p/μ, and particle relaxation time.τ p = ρ p d 2p/18μ. The drag coefficient .C D (Re p · Re0 ) is given by the empirical relation in [15]. For the total particle mass and temperature temporal rate of change (.m˙ p ,.T˙ p ) two differential equations following the evaporation model of Miller and Bellan [9, 10] are included .
dm p mp Sh ' =− ( ) ln(1 + B M,neq ) dt 3 Sc' Re0 τ p N u 'neq T − Tp m˙ p dT p θ1 + L ev (γ − 1)θ1 , . = ' dt 3 Pr Re0 τp mp
(5) (6)
where .θ1 is the ratio of fluid to liquid heat capacity .θ1 = C p/C p, p and . L ev = L ev/a0 the non-dimensional latent heat of vaporization. The Schmidt and Prandtl number of the fluid phase are given by . Sc' = μ/ρΓ · μ0/ρ0 Γ0 and . Pr ' = μC p/λ · μ0 C p/λ0 , with the dimensionless binary diffusion coefficient .Γ , the thermal conductivity .λ, and the respective dimensional reference state values. For the convective heat and mass transfer, the empirical Ranz–Marshall correlations [16, 17] for the Sherwood and Nusselt number are used. The equilibrium vapor mole fraction (.χs,eq ) is obtained from the Clausius–Clapeyron equation. Non-equilibrium effects are considered by a correction of the Nusselt number with the non-dimensional evaporation parameter .β and a correction of the vapor mole fraction at the surface by the Langmuir–Knudsen law. The non-equilibrium species vapor mass fraction at the particle surface .Ys,neq and the Spalding transfer number for mass . B M,neq are defined as Y
. s,neq
=
χs,neq
χs,neq Ys,neq − Y and B M,neq = . + (1 − χs,neq )θ2 1 − Ys,neq
(7)
The “1/3 rule” following [6] is used as a reference state in the boundary layer around the particle to obtain fluid properties. The Wilke mixing rule [26] is applied for the fuel vapor and fluid mixing in the particle boundary layer. The following source terms are used for the mass, momentum, and energy exchange between the particle and the fluid phase
Dynamic Load Balancing of a Coupled Lagrange Particle …
⎤ ⎤ ⎡ m˙ p Sρ Re ⎥ ⎢ Su ⎥ ⎢ m p C24D τ pp (u − u p ) + m˙ p u p ⎥ ⎥=⎢ ( ) . .⎢ ⎢ ⎣ S E ⎦ ⎣ S · u − 0.5m˙ u · u + C p, p m T˙ + m˙ T ⎥ ⎦ p p p p i p p p p γ −1 SY m˙ p
45
⎡
(8)
To reduce the computational effort of the spray injection, a number of fuel droplets Nd are grouped into a parcel. All droplets in a parcel are assumed to have the same physical values and suffer from the same change in their properties, i.e., they are tracked as a single Lagrangian particle. The spray atomization and secondary breakup process, leading to decreasing particle diameter, is modeled by a combination of the Kelvin–Helmholtz (KH) and Rayleigh–Taylor (RT) model. The present implementation is based on the formulation of Reitz [18] and Beale and Reitz [19].
.
2.3 Moving Surfaces An individual moving surface element can be expressed as the zero contour of the signed-distance function .ϕ(x, t) which evolves by the transport equation .
∂ϕ + vΓ · ∇ϕ = 0 , ∂t
(9)
with the translational surface motion function .vΓ . This equation could be solved numerically to determine the location of the piston and valves. Due to the accumulation of the truncation error during the time integration, the accuracy of the moving geometries will deteriorate. Therefore, another approach is used, where the zero level-set is computed by interpolation of reference level-set distributions as described in [4].
3 Numerical Methods The numerical prediction of the fuel injection and fuel-air mixing is performed with three non-conforming and distinct solution methods. A finite-volume method (FV) is used for the solution of the fluid conservation equation (Eq. 1). A semi-Lagrange level-set solver (LS) is used to determine the evolution of the embedded moving boundaries [4]. The disperse, liquid phase is tracked by a Lagrange particle tracking (LPT) method. The different numerical methods are presented, followed by a description of the solver coupling and load balancing method.
46
T. Wegmann et al.
3.1 Finite-Volume Solver The turbulent flow field is predicted by an LES using a cell-centered finite-volume method, where a monotone integrated LES (MILES) approach is used. The inviscid fluxes in the Navier–Stokes equations are approximated by a low-dissipation variant of the advection upstream splitting method (AUSM) as proposed in [8]. The viscous fluxes are approximated by a central-difference scheme, where gradients at the cell-surface centers are computed using the re-centering approach by Berger et al. [2]. Cell-centered gradients are determined by a weighted least-squared reconstruction scheme such that a second-order accurate scheme in space is obtained. Second-order accurate time integration is achieved by an explicit five-stage Runge– Kutta scheme optimized for moving boundary problems. Runge–Kutta coefficients optimized for stability as proposed by Schneiders et al. [21] are used. The immersed moving boundaries are represented by a conservative sharp multiple-cut-cell and split-cell method [22]. A multiple-ghost-cell approach is used to prescribe the flow state individually for cut cells assigned to more than one boundary condition. A combined interpolation and flux-redistribution technique yields a conservative and stable numerical method for small cells. Further details on the overall numerical method can be found, e.g., in [4, 14, 21].
3.2 Level-Set Solver For the translational motion of different surface elements e.g., cylinder head, piston, and valves a semi-Lagrange multiple level-set method [4] is used to solve the levelset equation (Eq. 9) individually for each surface element. In this approach, the error of the geometric representation of the surface is reduced to the constant interpolation error between two reference locations. The overall surface of the fluid domain is constructed from all elements using a combined level-set function as described in [4]. Additionally, a gap-closing method is applied to the combined level-set in the vicinity of the valves and the valve seat when their gap width falls below a specified threshold.
3.3 Lagrange Particle Tracking Solver A modified second-order accurate Euler method is used to solve the particle motion equation (Eq. 4). The differential equations for particle temperature- and mass change (Eqs. 5 and 6) are solved by an iterative implicit Euler method. Continuous phase state variables (.ρ, u, T, p, Y ) are interpolated with first-order accuracy to the parcel position. A stencil containing all surrounding cells of the parcel and a distance weight is applied. The same weights are used for the redistribution of the parcel source terms for the continuous phase description. Particle-wall collisions are modeled as hardsphere collisions based on boundary surface normal and position.
Dynamic Load Balancing of a Coupled Lagrange Particle …
47
3.4 Solver Coupling All solvers operate on a hierarchical Cartesian grid in which cells are organized in an octree structure with parent, child, and neighbor relationships [5]. The joint domain decomposition based on this underlying Cartesian grid ensures an efficient spatial coupling between the different numerical methods. Solver cells contained in the same volume of the three-dimensional domain are assigned to the same process allowing in-memory exchange of the coupling terms. Individual spatial solver constraints for the different physical systems are taken into account during the mesh generation and solution adaptive grid refinement, where cells are tagged according to their use by the solver. When different domain sizes are used, single solvers can be deactivated for certain subdomains, thus not participating in the solver communication. An example of the spatial decomposition and solver affiliation for the Direct Injection application is presented in Fig. 1. The number of possible combinations of solver affiliations for a single cell is reduced by physical and application driven solver restrains, i.e., LPT and FV cells are both reduced to the fluid domain. The LPT solver is additionally deactivated in the intake and exhaust ports due to the closed valves shortly after start of injection. Note that level-set cells are required outside the fluid domain for the resolution of the zero level-set contour. The solution execution uses the same time-step for all solvers. In the current implementation, the single iteration level-set interpolation is conducted first, after which the predicted boundary position is passed to the finite-volume and LPT solvers. Next, the 5 Runge–Kutta integration stages of the FV solver are alternatively advanced with sub-steps of the particle time step. Before the last Runge–Kutta step, updated source terms are passed from the LPT to the FV cells. The advantages and limitations of this time-stepping procedure is further discussed is Sect. 5.2.2.
Fig. 1 Domain decomposition of the underlying quad-tree hierarchical mesh for Finite-Volume (FV), Level-Set (LS) and Lagrange Particle Tracking (LPT) solvers for a Direct Injection application. The cell color defines the solver affiliation. Due to an increased local mesh refinement, the partitioning is shifted from the lowest level .lα to a higher refinement level on subdomain .di and .di+1
48
T. Wegmann et al.
3.5 Domain Decomposition and Load Balancing Method The domain decomposition of the coarse partition level of the joint Cartesian grid is based on a Hilbert space-filling curve (SFC), which creats a one-dimensional ordering of all partition cells. When a computational workload for each grid cell is known, the partition cell workload is found by traversing the subtree of the grid and accumulating the individual cell workloads. This reduces the subdomain decomposition to a socalled chains-on-chains partitioning (CCP) problem [13]. This approach is illustrated in Fig. 2 for two time-steps of the Direct Injection application. The specific workload for a solver cell type can be either defined a-priory using a static work load or can be computed dynamically for each subdomain during the simulation by the dynamic load balancing method. The DLB algorithm applied in this study [11] estimates the computational weights based on measurements of the
Fig. 2 Illustration of the domain decomposition and dynamic load balancing for the Direct Injection application between two time-steps .t1 on the top and .t2 on the bottom. The adaptive mesh refinement due to moving parcels and geometries is schematically shown on the left. The changing numerical workload is displayed in the center, where the cell color indicates the sum of the associated accumulated solver cell workload, based on static exemplary cell weights .w F V = 1, .w L S = 0.5 and a parcel weight of .w L P T = 1 assigned for each parcel within a cell. On the right the space-filling curve through the coarse level partition cells is shown, where the cell color denotes the associated subdomain. Changing colors for the individual partition cells indicate a redistribution to the corresponding subdomain
Dynamic Load Balancing of a Coupled Lagrange Particle …
49
computing time and distribution of solver cells on the individual subdomains. This assumes that on average the load can be expressed as a linear combination of the individual workload contributions. This approach usually leads to an optimal load balancing, when any overhead from communication can be neglected. Load balancing using static work loads requires additional user knowledge of the workload share of the involved solution methods and the associated cell loads. In principle, it can lead to similar load distributions as using dynamically determined cell weights. The parallel efficiency of the computation is maximized by an even distributing of the workload among all processes [20]. For a single-solver framework with constant cell load this leads to an even distribution of cells among all parallel subdomains, assuming a homogeneous computing environment. Thus, memory usage is evenly distributed among all processes as well. However, load balancing of coupled multiphysics simulations is inherently more complex and poses various challenges. For instance, multi-stage calculations and computation steps with communication barriers may prevent a balanced distribution of load. Time dependent load variations through changing domain boundaries, adaptive grid refinement, and other load shifting usually generate additional imbalance.
4 Computational Setup The 8-hole Spray G injector of the ECN [7], which is also extensively studied in [1, 12], has been selected for the current Direct Injection study. Ethanol bio-hybrid fuel with a stoichiometric fuel mass with a start of injection (SOI) at .210 crank angle degree (CAD) is modeled. A high-tumble, long-stroke research engine with bore of .0.075 m and operational condition of .2000 RPM and .6 bar IMEP is utilized. The engine features a piston bowl and an inlet side installed tumble control plate with standard intake valve timings. The intake port region extends .3.3 times the bore diameters upstream and the exhaust port .2.7 diameters downstream of the engine center. 5 injection cycles with meaningful individual initial flow field conditions at SOI have been conducted to analyze cycle-to-cycle variations in the fuel-air mixture. Additionally, a rapid compression engine (RCM) with an identical setup, but with a fluid at rest as an initial condition for the flow field at the beginning of the compression stroke is considered. Details of the spray and engine initial and boundary conditions can be found in [24]. Simulations are performed on a solution and boundary adaptive refined mesh consisting of a uniform level and two higher refinement levels, as shown in Fig. 3. The smallest cell length is .Δx / bore .≈ 0.001827 or .0.137 mm, which is chosen at engine walls, near particles and other flow regions with high gradients. The spatial and time steps match the requirements obtained from a grid convergence study for spray injections by Wehrfritz et al. [25]. The total number of level-set, finite-volume, and underlying LPT mesh cells changes strongly during a simulation cycle as shown in Fig. 4. During injection, the number of active flow cells reaches 155 million with
50
T. Wegmann et al.
Fig. 3 Cut through the active finite-volume mesh in the engine tumble plane at .215 CAD. Fuel droplets in the tumble plane are displayed in red to visualize the adaptive mesh refinement near the particles. Additionally, the current piston position (orange) and the piston position at .340 CAD (blue) are indicated to visualize the varying extent of the fluid domain
up to 1 million Lagrangian particles. The computational cost for the injection and sub-subsequent compression stroke for a single cycle are .≈450, 000 core hours on the HAWK system installed at the HLRS Stuttgart.
5 Simulation Results First, the simulation results for the direct injection and the fuel-air mixing in the internal combustion engine are presented. Next, the parallel efficiency and load balancing results are discussed.
Dynamic Load Balancing of a Coupled Lagrange Particle …
51
Fig. 4 Finite Volume cell numbers and number of Lagrange parcels for the ethanol ICE injection at 210 CAD SOI and the engine simulation without injection
5.1 Fuel-Air Mixing The temporal evolution of the fuel-air mixing is quantitatively discussed by means of the integral coefficient of variance .C V from the stoichiometrical mean concentration .Yt with ∮ ( (2 1 Y (xi , t) − ∼ Y (t) d V (t) . (10) .C V (t) = V (t)Yt V (t) A larger coefficient of variance indicates poor fuel-air mixing with regions of lean and rich fuel concentration in the cylinder volume. Rich regions with a large fuel vapor concentration typically lead to an unfavorable combustion and can increase engine emissions. In Fig. 5a, the coefficient of variance is plotted for the internal combustion engine (ICE) and rapid compression machine (RCM) setup with the start of injection at .210 CAD. The influence of the different flow conditions at the beginning of the compression stroke on the fuel-air mixing can be seen. The engine setup shows a favorable fuel-air mixing with significantly lower .C V values. For instance, at the start of ignition at .340 CAD, .C V is decreased by a factor of .3.3 compared to the RCM setup. The different flow conditions of the engine cycles impact fuel-air mixing and cause cycle-to-cycle variations in the distribution of the fuel concentration. ∼) at .340 CAD is displayed. The cycle In Fig. 5b, the fuel distribution function .V (Y with the lowest .C V value for the ICE shows a significantly higher volume with the ∼ = 1. The cycle with the largest .C V , has slightly larger lean target concentration of .Y and rich regions. From the volume rendered cycle mean fuel distribution of the ICE
52
T. Wegmann et al.
Fig. 5 a Temporal evolution of the coefficient of variance (.C V ) of the fuel concentration for the spray-G injector with .210 CAD as SOI for the RCM and ICE. bDimensionless fuel concentration distribution functions for the spray-G injector at 210 CAD as SOI at .340 CAD. Displayed are the cycle mean and values for the cycles with the largest- and lowest coefficient of variance in the fuel concentration. The volume ratios are computed based on a concentration step of .0 : 01 and are a percentage of the engine volume Fig. 6 Ensemble average for 5 injection cycles of the non-dimensional fuel concentration for the ICE at the start of ignition at approx. .340 CAD. Volumes with less than .5% deviation from the stoichiometric fuel concentration displayed as transparent
setup collective rich and lean regions can be identified in Fig. 6. The distribution for the RCM differs significantly from these cycle-to-cycle variations and shows larger rich volumes and even regions with unmixed air. The difference in this fuel-air mixing behavior is explained based on the evolution e and of the in-cylinder flow field derived from the integral specific kinetic energy . ∼ counter-clockwise z-component of the vorticity.Ω as displayed in Fig. 7a and b. Without injection, the specific kinetic energy and vorticity inside the cylinder increase
Dynamic Load Balancing of a Coupled Lagrange Particle …
53
Fig. 7 Temporal evolution of the average specific kinetic energy . ∼ e (a) and the cycle mean of the counter-clockwise z-component of the cylinder vorticity .Ω (b) for the spray-G injector with .210 CAD as SOI for the RCM and ICE in comparison to the cylinder flow without injection
towards the end of the compression stroke due to the shrinking diameter of the large scale circular tumble motion. During the injection, the spray induces a jet, which interacts with this tumble motion. The injection setup shows a slightly lower kinetic energy but similar vorticity strength after the end of injection. The additional shear created by the circular tumble motion constantly generates smaller turbulent scales which enhance the fuel-air mixing. No circular tumble motion is present in the RCM setup, therefore the spray induces a symmetrical jet, which only interacts with the upwards fluid motion caused by the piston movement. Thus, almost no vorticity is observed there. It can be concluded that the circular tumble motion favors a homogeneous fuel-air mixing and larger volumes with stoichiometric fuel concentration at the start of ignition. Further, high spatial resolution is essential for the accurate prediction of the fuel concentration since meshes with too large spatial steps lead to an overprediction of the dissipation rates leading to an earlier tumble break-up.
5.2 Parallel Efficiency The parallel efficiency of the numerical method is analyzed by performing a certain number of time steps around .215 CAD using .4096 MPI ranks or computing cores of the HAWK system installed at HLRS for various simulation setups. All simulations are started with the same checkpoint data using an initial domain decomposition with the same number of active cells for all MPI ranks, i.e., a load distribution assuming the same computational weight for all cells associated to the various solvers. Based on this setup, the necessity for a dynamic load balancing, the optimal dynamic load balancing interval, and limitations of the load balancing approach is discussed.
54
5.2.1
T. Wegmann et al.
Mesh Adaptation and Load Balancing Interval
During the injection phase, the liquid spray penetrates the cylinder volume and solution adaptive mesh refinement is triggered in irregular intervals based on droplet and piston position. At these adaptation steps more cells are added than removed such that the overall computational effort increases and load and memory imbalances are generated. In this case, a dynamic load balancing method is necessary for a continuous simulation run without a disruptive increase in the number of computing nodes through a restart of the simulation, as seen in Fig. 8. On the left, the maximum wall-time from all MPI ranks to compute a single time step is displayed for different setups. The wall-time is non-dimensionalized by the reference wall time required for the first time step used for the reference domain decomposition. A setup using the reference domain decomposition without load balancing (yellow) is compared to a setup using a domain decomposition based on static cell weights (SCW) with two different time step intervals between a load rebalancing. Individual static cell weights are assigned for different solver cell types based on the computational load for the solver method and cell boundary condition formulation. For the FV solver, 5 different cell types are identified, i.e., active leaf cells and cells at or near the moving interface. Additionally, the maximum number of active finite volume leaf cells, which is an indicator for the maximum memory consumption of the simulation, is plotted on the right. Jumps in the cell count for the setup without load balancing occur at a time step when a mesh refinement is conducted, which simultaneously leads to an increase in the reference wall time for the subsequent time steps. The setup based on the reference domain decomposition without load balancing reaches the memory limit of the compute nodes during the adaptation process after .1350 time steps, at which the run time has increased by a factor of .1.6. The setup based on an improved domain decomposition by using pre-defined static cell weights but without dynamic
Fig. 8 Non-dimensional wall time for one time step (a) and the maximum number of active finite volume leaf cells on a MPI rank (b) for a setup without a dynamic load-balancing and with dynamic load balancing based on static cell weights (SCW) with different adaptation intervals I between the load balancing
Dynamic Load Balancing of a Coupled Lagrange Particle …
55
Fig. 9 Benefit in reference time steps after .1350 time steps as a function of the number of mesh adaptations after which a load balancing is triggered compared to the setup with the default domain decomposition without DLB. An load-balancing interval of .1 means that a DLB is performed after each mesh adaptation
load balancing reaches the memory limit after .1135 time steps and a computational increase by a factor of .1.26. This comparison shows the need for meaningful cell weights based on solver and boundary condition type to reduce the overall wall-time of the simulation, i.e., higher cell weights for cells with moving interfaces due to the additional computational cost for the formulation of the boundary condition. Due to the continuing adaptation steps and the following continuous increase in the maximum number of cells on an MPI rank, a dynamic load balancing method is required. For this application, load balancing is initialized after a certain number of mesh adaptation steps. Different load balancing intervals were used and their run-time improvement was compared to the reference domain decomposition, while taking into account the computational cost for the execution of the load balancing. The results are presented in Fig. 9. Clearly, the optimum load balancing interval is a function of load balancing cost, i.e., the compute time required for the load balancing and the run-time improvement of the dynamic load balancing. However, additional parameters such as the adaptation interval and the generation of load changes, i.e., the increase in run-time after the mesh adaptation and additional injection of spray parcels, play an important role. A load balance interval of .2, where a dynamic load balancing is performed after every other adaptation step, shows the lowest overall computational cost for the considered application. In comparison to a load balancing that is performed after each adaptation, a run-time improvement of approx..3.45% can be achieved. A robust plateau towards larger intervals exists, i.e., the computational increase for an interval of 4 is only .1.25%. However, larger memory requirements for larger adaptation intervals can be seen from Fig. 8b. Evaluations over a larger number of time steps have shown an even more clear benefit of the optimum interval of .2 in the present application.
56
5.2.2
T. Wegmann et al.
Performance of Static and Dynamic Cell-Weights
In the following, the influence of the determination of the cell-weights on the performance is discussed. A domain decomposition based on static, pre-defined cell weights (SCW), which are assigned a-priory to the different solver cell types, is compared to a dynamic cell weight (DCW) computation based on the compute time of all involved MPI ranks. The SCW method depends on the user knowledge and must be adapted to different applications and setups. The DCW method offers a general approach, where only the used computing time defines the domain decomposition. However, idle times between the computational stages of the multi-solver time step integration due to blocking communication is not taken into account. The wall time per time step for the SCW and DCW method is displayed in Fig. 10a with dynamic load balancing performed after every second adaptation. The DCW shows an overall increased wall time with larger wall time jumps at adaptation and load balancing steps compared to the SCW method. Overall, a computational overhead of approx. .12% after .2000 time steps is observed for the DCW method compared to the optimized weights used in the SCW method. One reason is that the computational cost of the load balancing process increases for the DCW compared to SCW due to the dynamic computation of cell weights. However, the major disadvantage of the DCW is due to the blocking solver execution in the present implementation and the resulting unavoidable blocking communication and induced idle time time between the execution of solver time steps, since this run-time overhead is not captured in load measurements and cannot be directly assigned to a specific cell type for the DCW. The advantage of the SCW will be discussed based on the load distribution in Fig. 10b and the load and communication timer in Fig. 11. The performance of the
Fig. 10 a Required non-dimensional wall time to compute a time step for dynamic load balancing with adaptation interval of 2 with domain decomposition based on the SCW and DCW method. The average wall-time required with the reference domain decomposition for the first time step is used as reference. b Distribution of the computational load without idle-times for all ranks for the different domain decompositions obtained at .825 time-steps after the restart
Dynamic Load Balancing of a Coupled Lagrange Particle …
57
Fig. 11 Run time distribution for the Direct Injection application averaged during the time steps after the restart without (top right) and with load balancing with the SCW (top left) and DCW method (bottom). Compute ranks are sorted by the FV compute load in decreasing order
.728–.825
DCW method, in achieving a balanced computational load distribution among all processes can be seen in Fig. 10b. The dynamic cell weight computation method shows a sharp Gaussian load distribution with low variance and a maximum load of .1.8. For the static cell weight computation, an additional local maximum with a load of .0.4 and larger global maximum load of .2.05 is observed. However, for the presented load distribution only loads based on measured computational times, which are also incorporated in the DCW, are considered. These computational times are displayed in the opaque color schemes in the timer distributions in Fig. 11. Next to these times, additional communication times (in translucent color scheme) must be considered in the current multisolver and coupling implementation. In this case, the additional communication time for the FV and LPT solver is caused by the blocking transfer of the source terms and flow field between the two solution methods and grid refinements. The advantage of the SCW in the overall run time reduction can be seen in the reduction of this communication time in the light blue area. Overall, peaks in communication time are reduced for the SCW by setting a minimum cell weight threshold for cells participating in the solver coupling. This ensures a more
58
T. Wegmann et al.
even cell distribution, but deteriorates the load distribution, which is shifted towards the unbalanced load distribution in Fig. 10b. A balance between the optimal load distribution and a reduced communication time between the solver execution allows for an overall wall time reduction for the SCW method.
6 Conclusion High parallel efficiency for large-scale coupled multiphysics simulations is challenging since multi-stage computations with communication barriers can severely impact the overall performance and may prevent even load balancing. Time varying load changes through domain boundary alterations and solution adaptive grid refinement generates additional dynamic load imbalance. The necessity for a dynamic load balancing scheme is shown for the application of liquid fuel injection in an internal combustion engine. The optimal load balancing interval was found by considering load balancing cost, where the balance after every second adaptation step showed the best overall wall time performance. Furthermore, a static computation of solver and boundary specific cell weights for the domain decomposition was compared to a dynamic weight computation based on run-time evaluations of the different solution methods. While the dynamic weight computation showed a good distribution of the considered computational load, the static weight computation showed an overall more efficient parallel performance. The limitations of the dynamic weight computation are caused by neglecting the blocking communication time between different solver sub-steps in the weight computation. For a further improvement of the computational efficiency, the solver coupling steps must be restructured such that non-blocking communication strategies can be used, which will be performed in future work. Acknowledgements The authors gratefully acknowledge the funding by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy—Cluster of Excellence 2186 ”The Fuel Science Center”—ID: 390919832. The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre. eu) for granting computing time on the GCS Supercomputer HAWK at Höchstleistungsrechenzentrum Stuttgart (www.hlrs.de). The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre. eu) for funding this project by providing computing time on the GCS Supercomputer JUWELS at Jülich Supercomputing Centre (JSC). The authors gratefully acknowledge the computing time granted through JARA on the supercomputer CLAIX at RWTH Aachen.
References 1. Aguerre, H.J., Nigro, N.M.: Implementation and validation of a lagrangian spray model using experimental data of the ecn spray g injector. Comp. Fluids 190, 30–48 (2019) 2. Berger, M.J., Aftosmis, M., Allmaras, S.: Progress towards a cartesian cut-cell method for viscous compressible flow. AIAA Paper, pp. 2012–1301 (2012)
Dynamic Load Balancing of a Coupled Lagrange Particle …
59
3. Dahmen, M., Hechinger, M., Villeda, J.V., Marquardt, W.: Towards model-based identification of biofuels for compression ignition engines. SAE Int. J. Fuels Lubr. 5(3), 990–1003 (2012) 4. Günther, C., Meinke, M., Schröder, W.: A flexible level-set approach for tracking multiple interacting interfaces in embedded boundary methods. Comp. Fluids 102, 182–202 (2014) 5. Hartmann, D., Meinke, M., Schröder, W.: An adaptive multilevel multigrid formulation for cartesian hierarchical grid methods. Comput. & Fluids 37, 1103–1125 (2008) 6. Hubbard, G., Denny, V., Mills, A.: Droplet evaporation: effects of transients and variable properties. Int. J. Heat Mass Transf. 18, 1003–1008 (1975) 7. Lucchini, T., Peredi, D., Lacey, J.: Topic 10: evaporative spray g (external, plume interaction, flash boiling). In: ECN6 Proceedings (2018) 8. Meinke, M., Schneiders, L., Günther, C., Schröder, W.: A cut-cell method for sharp moving boundaries in cartesian grids. Comp. Fluids 85, 134–142 (2013) 9. Miller, R., Harstad, K., Bellan, J.: Evaluation of equilibrium and non-equilibrium evaporation models for many-droplets gas-liquid flow simulations. J. Multiphase Flow 24, 1025–1055 (1998) 10. Miller, R.S., Bellan, J.: Direct numerical simulation of a confined three-dimensional gas mixing layer with one evaporating hydrocarbon-droplet-laden stream. J. Fluid Mech. 384, 293–338 (1999) 11. Niemöller, A., Schlottke-Lakemper, M., Meinke, M., Schröder, W.: Dynamic load balancing for direct-coupled multiphysics simulations. Comp. Fluids 199, 104437 (2020) 12. Paredi, D., Lucchini, T., D’Errico, G., Onorati, A., Pickett, L., Lacey, J.: Validation of a comprehensive computational fluid dynamics methodology to predict the direct injection process of gasoline sprays using spray g experimental data. Int. J. Eng. Res. 21(1), 199–216 (2020) 13. Pinar, A., Aykanat, C.: Fast optimal load balancing algorithms for 1d partitioning. J. Parallel Distrib. Comput. 64(8), 974–996 (2004) 14. Pogorelov, A., Schneiders, L., Meinke, M., Schröder, W.: An adaptive cartesian mesh based method to simulate turbulent flows of multiple rotating surfaces. Flow Turbul. Combust. 100(1), 19–38 (2018) 15. Putnam, A.: Integratable form of droplet drag coefficient. Ars J. 31(10), 1467–1468 (1961) 16. Ranz, W., Marshall, W.: Evaporation from drops: I. Chem. Eng. Prog. 48, 141–146 (1952) 17. Ranz, W., Marshall, W.: Evaporation from drops: II. Chem. Eng. Prog. 48, 173–180 (1952) 18. Reitz, R.D.: Modeling atomization processes in high-pressure vaporizing sprays. At. Spray 3(4), 309–337 (1987) 19. Reitz, R.D., Beale, J.: Modeling spray atomization with the kelvin-helmholtz rayleigh-taylor hybrid model. At. Spray 9(6), 623–650 (1999) 20. Schlottke-Lakemper, M., Niemöller, A., Meinke, M., Schröder, W.: Efficient parallelization for volume-coupled multiphysics simulations on hierarchical cartesian grids. Comp. Meth. Appl. Mech. Eng. 352, 461–487 (2019) 21. Schneiders, L., Günther, C., Meinke, M., Schröder, W.: An efficient conservative cut-cell method for rigid bodies interacting with viscous compressible flows. J. Comput. Phys. 311, 62–86 (2016) 22. Schneiders, L., Hartmann, D., Meinke, M., Schröder, W.: An accurate moving boundary formulation in cut-cell methods. J. Comput. Phys. 235, 786–809 (2013) 23. Thewes, M., Muether, M., Pischinger, S., Budde, M., Brunn, A., Sehr, A., Adomeit, P., Klankermayer, J.: Analysis of the impact of 2-methylfuran on mixture formation and combustion in a direct-injection spark-ignition engine. Energy & Fuels 25(12), 5549–5561 (2011) 24. Wegmann, T., Meinke, M., Schröder, W.: Numerical analyses of spray penetration and evaporation in a direct injection engine. In: SAE Technical Paper (2023) 25. Wehrfritz, A., Vuorinen, V., Kaario, O., Larmi, M.: Large eddy simulation of high-velocity fuel sprays: studying mesh resolution and breakup model effects for spray a. At. Spray 23(5), 419–442 (2013) 26. Wilke, C.: A viscosity equation for gas mixtures. J. Chem. Phys. 18, 517–519 (1950)
Toward Scalable Empirical Dynamic Modeling Keichi Takahashi, Kohei Ichikawa, and Gerald M. Pao
Abstract Empirical Dynamic Modeling (EDM) is an emerging non-linear time series analysis framework that allows prediction and analysis of non-linear dynamical systems. Although EDM is increasingly adopted in various research fields, its application to large-scale data has been limited due its high computational cost. This article describes our ongoing efforts toward accelerating EDM computation using HPC technologies such as GPU offloading and parallel processing using. We describe mpEDM, a massively parallel implementation of EDM designed for GPUaccelerated supercomputers, and kEDM, a performance-portable implementation of EDM based on the Kokkos performance portability framework. Furthermore, we present our ongoing work toward porting EDM to NEC’s Vector Engine processor and carry out a preliminary performance evaluation.
1 Introduction Empirical Dynamic Modeling (EDM) [3] is an emerging non-linear time series analysis framework that allows prediction and analysis of non-linear dynamical systems. EDM is increasingly utilized in various fields, such as neuroscience [10], ecology [18], medicine [7] and geophysics [12]. However, its application to largescale data has been limited due the high computational cost. Few studies have been carried out to speedup EDM by improving the algorithm [8] and taking advantage
K. Takahashi (B) Cyberscien Center, Tohoku University, 6-3 Aramaki-Aza-Aoba, Aoba-ku, Sendai 980-8578, Japan e-mail: [email protected] K. Ichikawa Nara Institute of Science and Technology, 8916-5 Takayamacho, Nara 630-0192, Japan e-mail: [email protected] G. M. Pao Salk Institute for Biological Studies, 10010 N Torrey Pines Rd, La Jolla, San Diego, CA 92037, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. M. Resch et al. (eds.), Sustained Simulation Performance 2022, https://doi.org/10.1007/978-3-031-41073-4_5
61
62
K. Takahashi et al.
of parallel and distributed computing [13]. We have been tackling this problem by leveraging High Performance Computing (HPC) technologies such as GPUs. This article describes our ongoing efforts toward accelerating EDM using HPC technologies. Section 2 briefly introduces the basic concept of EDM and two algorithms that we mainly target, which are Simplex projection and Convergent Cross Mapping. Section 3 describes mpEDM, a highly parallel implementation of EDM designed for large-scale analysis. Section 4 describes kEDM, a performance-portable implementation of EDM based on the Kokkos framework. Section 5 describes our ongoing work on implementing EDM on NEC’s Vector Engine processor and presents preliminary performance evaluation results. Section 6 concludes this article and discusses future work.
2 Empirical Dynamic Modeling (EDM) EDM is a non-linear time series analysis framework that builds upon the Takens’ embedding theorem [4, 17]. Takens’ theorem states that the state space of a dynamical system can be reconstructed from lagged time series observations. Figure 1 illustrates the basic idea behind state space reconstruction. The figure on the left shows the original state space of the Lorenz attractor, which has three variables .xt , . yt and .z t . The figure on the right shows the state space reconstructed from .xt and two time lags . x t−τ and . x t−2τ . According to Takens’ theorem, there exist a diffeomorphism (i.e. local neighborhoods are preserved) that maps the original state space to the reconstructed state space if enough time lags are given. EDM takes advantage of this property of the reconstructed state space to perform various time series analysis tasks. Simplex projection [15] is the most basic EDM algorithm that performs short-term of.xt in the. E-dimensional forecasts. Let.x be an observed(time series. The embedding ) state space is defined as .Xt = xt , xt−τ , . . . , x(E−1)τ , where .τ is the time lag. Given an embedded observation .Yt , Simplex projection predicts the state of the system .T p steps ahead. The first step of Simplex projection is to find the .k-nearest neighbors
Fig. 1 State space reconstruction
Toward Scalable Empirical Dynamic Modeling
63
of .Yt in the state space. The time index of the .i-th nearest neighbor is defined as n . Simplex projection then calculates the weighted average of the .T p steps future of each .k-nearest neighbors to predict the future of .Yt as follows:
. i
ˆ t+Tp = Y
E+1 ∑
.
i=1
wi ∑ E+1 i=1
wi
⎧ ⎨
where
· Xni ,
(1)
⎫ ⎬
||Yt , Xni || wi = exp − ⎩ min ||Yt , Xni || ⎭
.
(2)
1≤i≤E
Convergent Cross Mapping (CCM) [1, 11, 14] is an EDM algorithm that detects and quantifies the causal relationship between two variables. Given two time series .x and .y, CCM uses Simplex projection to predict .x using .y as library points. We start with a small subset of .y and gradually increase the subset size. If the prediction skill increases with the library size, it indicates that .x causes .y.
3 mpEDM: Massively Parallel EDM We initiated this research by profiling cppEDM,1 the de facto standard implementation of EDM being developed by the Sugihara laboratory. We identified using profiling that the .k-nearest neighbor (.k-NN) search in the state space is the primary bottleneck in EDM computation. We thus focused on parallelizing and optimizing the .k-NN search, and developed mpEDM [19], a parallel distributed implementation of EDM optimized for large-scale supercomputers equipped with GPUs as accelerators. ArrayFire [9], a tensor library for GPU computing, was used to offload computation such as the .k-NN search to the GPU. An MPI-based master-worker framework was designed to distribute the work across multi GPUs on multiple compute nodes. Furthermore, the original CCM algorithm was improved to reduce the number of required .k-NN searches. To demonstrate the practicality of mpEDM, we deployed mpEDM on the AI Bridging Cloud Infrastructure (ABCI)2 supercomputer at the National Institute of Industrial Science and Technology (AIST) and analyzed the causal interactions among neurons using real-world neural activity datasets. The datasets were sampled from an entire zebrafish brain at single neuron resolution. Using 512 ABCI nodes, mpEDM was able to finish computing the causal map from a dataset containing 53,053 time series each with 1,450 time steps in just 20s. The same dataset took
1 2
https://github.com/SugiharaLab/cppEDM. https://abci.ai/.
64
K. Takahashi et al.
8.5 hours to analyze using mpEDM, indicating that mpEDM is 1,530.× faster over cppEDM. Furthermore, an even larger dataset that contains 101,729 time series was analyzed in just 199 seconds on 512 ABCI nodes.
4 kEDM: Performance Portable EDM Although mpEDM successfully accelerated EDM using HPC, several challenges remained. First, mpEDM had different implementations for different architectures. This design, however, requires development effort when porting to a novel HPC hardware. Second, mpEDM was limited by ArrayFire. As described in Sect. 3, mpEDM used ArrayFire to offload computation to GPU. However, some parts of EDM computation, specifically the lookup of nearest neighbor points (Eq. 1), could not be efficiently implemented using ArrayFire. Thus, the lookups were executed on the host CPU even if a GPU is available on the system. To solve these challenges, we developed a new implementation of EDM named kEDM [16]. kEDM uses the Kokkos [2] performance portability framework developed at the Sandia National Laboratories, and runs on both CPUs and GPUs while sharing the same source code. Thus, porting to a new hardware can be completed with minimal effort. Furthermore, we were able to port the whole computation on the GPU using Kokkos, including the lookups. As a result, kEDM demonstrated up to .5.5× speedup compared to mpEDM in CCM analysis of various real-world datasets.
5 EDM on the Vector Engine Our EDM implementations mpEDM and kEDM have so far mainly targeted GPUs, since GPUs can provide massive memory bandwidth to accelerate memory-bound algorithms such as EDM. NEC’s Vector Engine (VE) [5, 6] is another HPC processor that provides massive memory bandwidth comparable to GPUs. It is thus a promising alternative to GPUs for accelerating EDM computation. In this article, we analyze the performance of top-.k sort, the main bottleneck in .k-NN search, on the VE and evaluate the feasibility of leveraging VE for EDM computation.
5.1 Evaluation Method We compare five top-.k sort implementations summarized in Table 1. The evaluation experiments are carried out on NEC Vector Engine Type 20B. The source codes are available on GitHub.3 The detail of each implementation is summarized as follows: 3
https://github.com/keichi/ve-partial-sort.
Toward Scalable Empirical Dynamic Modeling Table 1 List of top-.k search implementations Full/Partial Name STL sort STL partial sort NEC ASL sort MSD radix sort LSD radix partial sort
Full Partial Full Full Partial
65
Vectorized
Existing
No No Yes Yes Yes
Yes Yes Yes No No
• STL sort uses the std::sort() function provided by the C++ Standard Template Library (STL). NEC’s implementation of std::sort() uses the intro sort algorithm, which starts by quick sort, and switches to heap sort once the recursion level reaches a threshold. Furthermore, insertion sort is used to sort each partition once a partition is a smaller than a threshold size. • STL partial sort uses the std::partial_sort() function provided by STL. NEC’s implementation of std::partial_sort() uses heap select to find the .k smallest element and then sorts them using heap sort. Heap select works by scanning through the array and updating a max-heap that maintains the .k smallest elements that have appeared so far. • NEC ASL sort is a vectorized sort implementation provided by NEC’s Advanced Scientific Library (ASL).4 The documentation mentions that radix sort is used, but the detailed algorithm is not described. • Least Significant Digit (LSD) radix sort is our implementation of a vectorized radix sort. During each step, it examines a “digit” (blocks of bits) and sorts the elements based on the current digit using stable counting sort. The considered digit is moved from the least to the most significant bits. • Most Significant Digit (MSD) radix partial sort is our implementation of radix select. During each step, it examines a digit and moves the elements into their corresponding bins. Only the bins that hold the .k-smallest elements are sorted in the next step. The considered digit is moved from the most to the least significant bits.
5.2 Evaluation Result Figure 2 shows the runtime of top-.k sort with respect to the length of the array across different implementations. Here, .k is fixed to 1 and the benchmark is executed on a single core. Clearly, STL sort is the slowest, as it sorts the full array and is not vectorized. STL sort is followed by ASL sort and LSD radix sort, which also sort the 4
https://sxauroratsubasa.sakura.ne.jp/documents/sdk/SDK_NLC/UsersGuide/asl/c/en/index. html.
66
K. Takahashi et al. 1000
100
STL full sort STL partial sort ASL full sort LSD radix full sort MSD radix partial sort
Runtime [ms]
10
1
0.1
0.01
0.001 1×103
1×104
1×105
1×106
Array length (N)
Fig. 2 Top-.k sorting runtime on VE Type 20B (.k = 1)
full array. However, ASL sort and LSD radix sort become significantly faster than STL sort as the length of the array increases. This because the two implementations are highly vectorized and can take advantage of the vector processing using in the VE. The two fastest implementations are STL partial sort and MSD radix partial sort. This is expected since these two partially sort the array while the others sort the full array. The vectorized MSD radix partial sort is slower than STL partial sort if the vector is short, but outperforms STL as the vector becomes longer. Figure 3 shows the runtime of top-.k sort with varying value of .k across different implementations. Clearly, the runtime of implementations that sort the full array (STL sort, LSD radix sort and ASL sort) do not change depending on .k. The partial sort implementations (STL partial sort and MSD radix partial sort) are the fastest among all implementations, and their runtime generally increase with .k. However, the increase in runtime becomes smaller if the array is long. In summary, these results suggest that we need to dynamically choose from STL partial sort or MSD radix partial sort depending on the array length to obtain the highest performance.
6 Conclusions and Future Work In this article, we described our efforts to accelerate EDM using HPC technologies and introduced mpEDM and kEDM, which are our optimized and parallelized implementations of EDM. We also showed the result of the preliminary performance
Toward Scalable Empirical Dynamic Modeling
67
Fig. 3 Top-.k sorting runtime on VE Type 20B (varying .k1)
evaluation being conducted as a part of our ongoing work to port EDM to NEC’s Vector Engine. Specifically, we compared different top-.k search implementations on VE and found out that one of two implementations need to be dynamically selected for maximal performance. In the future, we will port the remaining parts necessary to run EDM on VE, and evaluate the performance of EDM algorithms such as Simplex projection and Convert Cross Mapping on VE. Another direction of future research is to take advantage of approximate algorithms to further accelerate the .k-NN search. All existing EDM implementations including ours perform exhaustive search to find the.k-nearest neighbors. However, this approach is not scalable since the cost for .k-NN search increases rapidly with the number of points. Approximate .k-NN search algorithms requires lower time complexity than exhaustive search, and might allow us to further scale EDM.
68
K. Takahashi et al.
Acknowledgements This work was supported by “Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (JHPCN)” in Japan (Project ID: jh220050) and JSPS KAKENHI Grant Number 20K19808. Part of the experiments were carried out using the AOBA-A and AOBA-C systems at the Cyberscience Center, Tohoku University.
References 1. van Berkel, N., Dennis, S., Zyphur, M., Li, J., Heathcote, A., Kostakos, V.: Modeling interaction as a complex system. Hum.-Comput. Interact. 00(00), 1–27 (2020). https://doi.org/10.1080/ 07370024.2020.1715221 2. Carter Edwards, H., Trott, C.R., Sunderland, D.: Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74(12) (2014). https://doi.org/10.1016/j.jpdc.2014.07.003 3. Chang, C.W., Ushio, M., hao Hsieh, C.: Empirical dynamic modeling for beginners. Ecolog. Res. 32(6), 785–796 (2017). https://doi.org/10.1007/s11284-017-1469-9 4. Deyle, E.R., Sugihara, G.: Generalized theorems for nonlinear state space reconstruction. PLoS ONE 6(3) (2011). https://doi.org/10.1371/journal.pone.0018295 5. Egawa, R., Fujimoto, S., Yamashita, T., Sasaki, D., Isobe, Y., Shimomura, Y., Takizawa, H.: Exploiting the Potentials of the Second Generation SX-Aurora TSUBASA. In: Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS 2020), vol. 2, pp. 39–49 (2020). https://doi.org/10.1109/PMBS51919.2020.00010 6. Komatsu, K., Momose, S., Isobe, Y., Watanabe, O., Musa, A., Yokokawa, M., Aoyama, T., Sato, M., Kobayashi, H.: Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC18), pp. 685–696 (2018). https://doi.org/10.1109/SC.2018.00057 7. Liu, S., Ye, M., Pao, G.M., Song, S.M., Jhang, J., Jiang, H., Kim, J.H., Kang, S.J., Kim, D.I., Han, S.: Divergent brainstem opioidergic pathways that coordinate breathing with pain and emotions. Neuron 110(5), 857-873.e9 (2022). https://doi.org/10.1016/J.NEURON.2021. 11.029 8. Ma, H., Aihara, K., Chen, L.: Detecting causality from nonlinear dynamics with short-term time series. Sci. Rep. 4, 1–10 (2014). https://doi.org/10.1038/srep07464 9. Malcolm, J., Yalamanchili, P., McClanahan, C., Venugopalakrishnan, V., Patel, K., Melonakos, J.: ArrayFire: a GPU acceleration platform. In: Modeling and Simulation for Defense Systems and Applications VII, vol. 8403, p. 84030A. SPIE (2012). https://doi.org/10.1117/12.921122 10. Natsukawa, H., Deyle, E.R., Pao, G.M., Koyamada, K., Sugihara, G.: A Visual Analytics Approach for Ecosystem Dynamics based on Empirical Dynamic Modeling. IEEE Trans. Visual. Comput. Graph. 2626(c), 1–1 (2020). https://doi.org/10.1109/tvcg.2020.3028956 11. Natsukawa, H., Koyamada, K.: Visual analytics of brain effective connectivity using convergent cross mapping. In: SIGGRAPH Asia 2017 Symposium on Visualization (2017). https://doi. org/10.1145/3139295.3139303 12. Park, J., Pao, G.M., Sugihara, G., Stabenau, E., Lorimer, T.: Empirical mode modeling: a datadriven approach to recover and forecast nonlinear dynamics from noisy data. Nonlinear Dyn. 108(3), 2147–2160 (2022). https://doi.org/10.1007/S11071-022-07311-Y/FIGURES/12 13. Pu, B., Duan, L., Osgood, N.D.: Parallelizing convergent cross mapping using apache spark. In: International Conference on Social Computing, Behavioral-Cultural Modeling, & Prediction and Behavior Representation in Modeling and Simulation (SBP-BRiMS 2019), pp. 133–142 (2019). https://doi.org/10.1007/978-3-030-21741-9_14 14. Sugihara, G., May, R., Ye, H., Hsieh, C.H., Deyle, E., Fogarty, M., Munch, S.: Detecting causality in complex ecosystems. Science 338(6106), 496–500 (2012). https://doi.org/10.1126/ science.1227079
Toward Scalable Empirical Dynamic Modeling
69
15. Sugihara, G., May, R.M.: Nonlinear forecasting as a way of distinguishing chaos from measurement error in time series. Nature 344(6268), 734–741 (1990). https://doi.org/10.1038/ 344734a0 16. Takahashi, K., Watanakeesuntorn, W., Ichikawa, K., Park, J., Takano, R., Haga, J., Sugihara, G., Pao, G.M.: kEDM: a performance-portable implementation of empirical dynamic modeling using Kokkos. In: Practice and Experience in Advanced Research Computing, pp. 1–8. ACM, New York, NY, USA (2021). https://doi.org/10.1145/3437359.3465571 17. Takens, F.: Detecting strange attractors in turbulence. In: Dynamical Systems and Turbulence, Lecture Notes in Mathematics, vol. 898, pp. 366–381 (1981). https://doi.org/10.1007/ BFb0091924 18. Ushio, M., Hsieh, C.H., Masuda, R., Deyle, E.R., Ye, H., Chang, C.W., Sugihara, G., Kondoh, M.: Fluctuating interaction network and time-varying stability of a natural fish community. Nature 554(7692), 360–363 (2018). https://doi.org/10.1038/nature25504 19. Watanakeesuntorn, W., Takahashi, K., Ichikawa, K., Park, J., Sugihara, G., Takano, R., Haga, J., Pao, G.M.: Massively parallel causal inference of whole brain dynamics at single neuron resolution. In: 26th International Conference on Parallel and Distributed Systems (ICPADS), pp. 196–205. IEEE (2020). https://doi.org/10.1109/ICPADS51040.2020.00035
AOBA: The Most Powerful Vector Supercomputer in the World Hiroyuki Takizawa, Keichi Takahashi, Yoichi Shimomura, Ryusuke Egawa, Kenji Oizumi, Satoshi Ono, Takeshi Yamashita, and Atsuko Saito
1 Introduction The Innovative High-Performance Computing Infrastructure (HPCI) connects major supercomputers of universities and research institutions in Japan. Today, there is a diversity of High-Performance Computing (HPC) system architectures. Also, there is a diversity of HPC workloads that need different performance characteristics of HPC systems. One important point that we have to emphasize here is that no system architecture can be optimal in every regard. The best system for one application area is not necessarily the best for other application areas. Therefore, HPCI provides various kinds of HPC systems to academic users in Japan. We believe that the diversity of H. Takizawa (B) · K. Takahashi · Y. Shimomura · K. Oizumi · S. Ono · T. Yamashita · A. Saito Tohoku University, Sendai, Japan e-mail: [email protected] K. Takahashi e-mail: [email protected] Y. Shimomura e-mail: [email protected] K. Oizumi e-mail: [email protected] S. Ono e-mail: [email protected] T. Yamashita e-mail: [email protected] A. Saito e-mail: [email protected] R. Egawa Tokyo Denki University, Tokyo, Japan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. M. Resch et al. (eds.), Sustained Simulation Performance 2022, https://doi.org/10.1007/978-3-031-41073-4_6
71
72
H. Takizawa et al.
Fig. 1 System specifications of AOBA-1.0
HPC system architectures is one outstanding point of Japanese academic computing environment. The primary mission of Tohoku University Cyberscience Center is to offer a leading-edge computing environment to academic users, and also to industrial users. As a member of HPCI, the Cyberscience Center is expected to complement the national flagship system, Fugaku, by offering a characteristic system different from other supercomputers in HPCI. Specifically, the Cyberscience Center has been operating NEC SX-series vector supercomputers, such as SX-ACE [4] and SX-Aurora TSUBASA [15], because the bandwidth-oriented system design can widely cover memory-intensive numerical simulations and achieve high sustained simulation performance. In October 2020, the first version of Supercomputer AOBA, referred to as AOBA1.0 in this article, started operation as the first supercomputer of using the secondgeneration SX-Aurora TSUBASA as the main computing resource [3]. AOBA-1.0 consists of two subsystems, AOBA-A and AOBA-B. AOBA-A is a 72-node system of NEC SX-Aurora TSUBASA with vector processors, while AOBA-B is a 68-node system of standard x86 processors. The overall system performance is about 1.75 Pflop/s, and the aggregated memory bandwidth is 924 TB/s. The system specifications of AOBA-1.0 are summarized in Fig. 1. Since its operation start, AOBA-1.0 had been used heavily by 1500+ registered users nation-wide in Japan, and always busy. Therefore, we decided to gradually update the system to meet the strong demands for vector computing. This article describes the plan of system updates at the Cyberscience Center as shown in Fig. 2. In October 2022, AOBA-1.0 was already upgraded by additionally introducing vector computing performance of about 2.4 Pflop/s installed at a remote data center. The current system generation as of December 2022 is called AOBA-1.2. Moreover, AOBA-1.5 will start operation in August 2023, as the most powerful vector supercomputer in the world and throughout the long history of vector supercomputers.
AOBA: The Most Powerful Vector Supercomputer in the World
73
Fig. 2 System update plan of Tohoku University Cyberscience Center
2 Supercomputer AOBA AOBA originally means young growing leaves in Japanese, and Aoba-yama or Mt. Aoba is the name of the university campus where Tohoku University Cyberscience Center is located. When SX-ACE [4] being operated at the Cyberscience Center was replaced by a new supercomputer in 2020, its nickname, i.e., AOBA, was chosen from submissions from the public. The first version of AOBA, AOBA-1.0, was of 1.75 Pflop/s in total, and not a very large system even at the operation start in October 2020. However, since the second-generation of NEC SX-Aurora TSUBASA (SX-AT) was employed as the main computing resource, i.e., subsystem AOBA-A, the aggregated memory bandwidth of AOBA-1.0 was outstanding of 924 TB/s thanks to the vector processors with HBM2E memory modules [3]. The node architecture of SX-AT is shown in Fig. 3. A node of SX-AT has one or more vector processors implemented as PCIExpress cards called Vector Engines (VEs), which are hosted by one or more x86 processors called Vector Hosts (VHs). In the case of AOBA-A, eight VEs and one VH are installed in each node. VEs and VHs are connected via PCI-Express links, and also to InfiniBand Host Channel Adapters (HCAs). VHs are running on the standard Linux operating system, while VEs run user processes and their system calls on VEs are implicitly forwarded to VHs. As a result, user processes running on VEs are seen as if they are running on the standard x86 Linux environment, even though they are actually running on vector processors with a special instruction set architecture. Therefore, SX-AT allows users to benefit from the special hardware
74
H. Takizawa et al.
Fig. 3 Node architecture of SX-AT
configuration such as high memory bandwidth of 1.53 TB/s per VE while using the standard software environment. Since the operation start, AOBA-1.0 had been extensively used and almost always full. The average system utilization had exceeded 90% almost every month, even 95% in the busy season. It shortly became difficult for AOBA-1.0 alone to meet the strong demands for vector computing. In 2022, therefore, we decided to increase the vector computing performance by introducing additional nodes of SX-AT installed in a remote data center. Figure 4 shows an overview of the AOBA-1.2 system configuration. The subsystem being comprised by remote “cloud” computing resource is called AOBA-C, which is named after AOBA Cloud besides meaning the third subsystem. The performance of AOBA-C is about 2.4 Pflop/s by adopting 106 nodes of SX-AT equipped with the same type of VEs as AOBA-A. The new generation of AOBA after incorporating AOBA-C is called AOBA-1.2 as illustrated in Fig. 2. The on-premise and cloud computing resources in Fig. 4 are connected via a virtual private network. The cloud computing resource is dedicated to AOBA users and not shared by others. Therefore, AOBA-C is physically located in a remote data center, but logically looks as if it is installed at the Cyberscience Center. One concern at system design of AOBA-1.2 was that users might be confused if they became capable of using distant subsystems in the same manner. As shown
AOBA: The Most Powerful Vector Supercomputer in the World
75
Fig. 4 System configuration of AOBA-1.2
in Fig. 4, there are two different storages for on-premise and cloud subsystems, respectively. One possible design option was to properly and timely synchronize data between the two storages to keep enabling every subsystem to access the same data. However, we did not finally select this option because their storage capacities are not identical and also there is a risk of data synchronization failures. Namely, we decided that users have two home directories. In this design option, another concern was that a user might write a job script with assuming the files in one home directory, and mistakenly submit the job script to a job queue associated with the other home directory. To avoid causing confusion, we decided to have different front-end servers on both sites for users to be conscious of which system they are using. Consequently, users are required to select their computing resources, either on-premise or cloud, at logging into the front-end server, and thus in advance of submitting their jobs. The system design of AOBA-1.2 allows users to prevent from unintendedly submitting their jobs to wrong subsystems. Instead, jobs are not automatically transferred to other subsystems unlike so-called cloud bursting technologies [2]. It is observed that only some advanced users are enjoying the additional computing power of AOBA-C for now, because users need to intentionally log into the frond-end server on the cloud side. However, the utilization ratio of AOBA-C gradually increases since the operation start. Therefore, we expect that more users will move to the cloud environment for using AOBA-C. Since there was no precedent HPCI supercomputer in Japan that relies on more cloud resource than on-premise one, there were several non-technical issues potentially needed to discuss. Because of this, AOBA-1.2 was designed as a transitional generation available only in a short period. At the system design, the top priority was given to smooth building and stable operation from not only technical but also administrative points of view. AOBA-C is available until the end of July 2023, and we already have a plan to increase the on-premise computing resource as described in the next section.
76
H. Takizawa et al.
3 AOBA-1.5: The World’s Largest Vector Supercomputer In August 2022, the Cyberscience Center made a contract with NEC Corporation to build a new vector supercomputer by employing the latest generation of SX-AT in 2023. In addition to subsystems AOBA-A and AOBA-B in Fig. 1, a new subsystem of 504 nodes with the third-generation VEs called Type 30A will be installed. Although the official name of the new subsystem is not decided yet, it is tentatively called AOBA-S in this article. The new system generation after incorporating AOBA-S is called AOBA-1.5. Figure 5 shows the system specifications of AOBA-1.5. As shown in the figure, the peak flop/s rate of AOBA-S reaches 21 Pflop/s in double-precision, and therefore AOBA-S will be the most powerful vector supercomputer in the world at the birth. As shown in Fig. 6, the interconnection of AOBA-S is a full-bisection fat-tree topology network and the nodes are connected via dual-rail InfiniBand NDR 200G. The storage system for AOBA-S is DDN ES400NVX2 of 4.5 PB. Moreover, although AOBAC will be retired, we plan to allow AOBA users to use external cloud computing resource not shown in this figure when the subsystems become too busy and more computing resources are needed. The computing power of 21 Pflop/s is achieved by using the new-generation vector processor, named Vector Engine Type 30A. Table 1 shows the key performance metrics of SX-series generations, and Fig. 7 shows the configuration of Type 30A. The most important feature of Type 30A is that the peak memory bandwidth is 2.45 TB/s, which is about 1.6 times higher than that of Type 20B used in AOBA-A. For memory-intensive scientific simulations, hence, we can expect 1.6 times higher sustained performance per socket in comparison with Type 20B. The peak flop/s rate of each vector core remains the same as that of Type 20B. However, Type 30A has 16 vector cores to double the peak flop/s rate per socket, and thus will improve the sustained per-socket performance of even compute-bound applications. The memory capacity per socket is also doubled. In addition, each vector core has a private cache
Fig. 5 System specifications of AOBA-1.5
AOBA: The Most Powerful Vector Supercomputer in the World
77
Fig. 6 Interconnect network of AOBA-S Table 1 SX-series performance comparison SXACE Core Socket
Clock frequency (GHz) Peak performance (Gflop/s) Core count Peak performance (Gflop/s) Memory bandwidth (GB/s) B/F ratio (B/flop) Memory capacity (GB)
1.00 64 4 256 256 1.00 64
Fig. 7 Configuration of vector engine type 30A
Type 10B
Type 10AE
Type 20B
Type 20A
Type 30A
1.40 268 8 2,150 1,228 0.56 48
1.58 304 8 2,433 1,352 0.56 48
1.60 307 8 2,457 1,536 0.62 48
1.60 307 10 3,072 1,526 0.50 48
1.60 307 16 4,912 2,457 0.50 96
78
H. Takizawa et al.
Fig. 8 Speedup ratio of vector engine type 30A over 20B
to reduce the pressure on the shared last-level cache, and effective use of the private cache would be one interesting point to be considered at code tuning. Figure 8 shows the speedup ratio of Type 30A over Type 20B for our key application kernels. Our preliminary evaluation results clearly demonstrate that Type 30A can achieve the speedup ratios of about 1.6 for MHD and Turbine, which are reasonable against the memory bandwidth improvement. For Nano_Powder, the speedup ratio slightly exceeds 1.6. We need more detailed analysis on this kernel, but we currently suspect that this is probably because the effect of the private cache newly introduced to Type 30A. If the data are on the private cache, we can reduce the traffic at the on-chip network, which results in improving the performance. Effective use of the private cache will be further discussed after the production version of Type 30A becomes easily accessible. As for the other two kernels, the speedup ratio was less than expected. Since the compiler used at the preliminary evaluation was not optimized for Type 30A yet, one possible reason is that new architectural features of Type 30A not detailed in this article were unused. Thus, we will evaluate the performance of all kernels again with a more matured version of the compiler in the future.
4 Research Activities Tohoku University Cyberscience Center is not only providing the computing infrastructure to users, but also doing research to develop novel HPC technologies. Our recent research achievements relevant to AOBA are described below. As illustrated in Fig. 3, SX-AT has a heterogeneous hardware configuration of VHs and VEs unlike the traditional NEC SX-series vector supercomputers before
AOBA: The Most Powerful Vector Supercomputer in the World
79
SX-AT. VHs and VEs have their own strengths and weaknesses. While a VH can achieve moderate performance for any workloads, a VE can achieve excellent performance only for “vector-friendly” workloads. Actually, a practical HPC application usually contains vector-unfriendly tasks as well as vector-friendly ones. Therefore, one interesting and important research topic is how to make best use of both VHs and VEs by appropriately assigning each task to either a VH or a VE. To use two kinds of processors within a node, one common way is to offload a part of application execution to a different processor. NEC provides Vector Engine Offloading (VEO), which is NEC’s proprietary programming framework to enable an application on a VH to offload kernel execution to a VE [8]. By developing an application with VEO, we can run an application on VHs and offload the execution of only vector-friendly kernels to VEs, resulting in high performance on a wider range of HPC applications, for which vector supercomputers had not been used traditionally. However, such an application must be specialized for SX-AT and become non-portable to other systems. Therefore, we are interested in offload programming on SX-AT with standard programming interfaces such as OpenCL [11] and SYCL [5, 6]. We have implemented our own SYCL implementation named neoSYCL [13], and are exploring an effective way of improving the performance portability by employing some advanced features such as meta-programming [12] and memory layout optimization [14]. Tohoku University Cyberscience Center and Osaka University Cybermedia Center have been operating a real-time tsunami inundation forecast system [7]. Once a tsunamigenic earthquake occurs, some running jobs are suspended to make a predefined number of nodes available, and then a tsunami simulation is immediately executed on the nodes for disaster mitigation. After the simulation, the suspended jobs are automatically resumed. The urgent computing mechanism of suspending and resuming jobs described in [7] was developed for NEC SX-ACE [4], which is the predecessor of SX-AT. As SX-AT has a heterogeneous hardware configuration of VHs and VEs, the original mechanism is not simply usable for SX-AT. As a joint-research work with NEC, therefore, we have discussed an urgent computing mechanism as well as job scheduling strategies for SX-AT [1]. As a result, the realtime tsunami inundation forecast system is in operation on AOBA and always ready for tsunamigenic earthquakes. In fact, the simulation was really executed as an urgent job upon the earthquake on March 16, 2022. We are now working together with Mitsui Consultants Co., Ltd. to achieve realtime flood simulation based on the Rainfall-Runoff-Inundation model [10]. The rainfall data are sent from the Japan Meteorological Business Support Center at a fixed interval of 30 minutes. Based on the rainfall data, we must execute the simulation in 20 minutes to have time to spare for the delay in rainfall data arrival. To this end, we carefully optimized the simulation code to fully exploit the potential of SX-AT. Furthermore, we have proposed to predict the minimum necessary amount of computing resource to complete the simulation by the deadline. As a result, we can efficiently execute the real-time flood simulation with less computing resource adaptively to the weather condition. Moreover, we are now developing a job scheduling simulator that can reproduce AOBA’s job scheduling as faithfully as possible [9]. Therefore,
80
H. Takizawa et al.
we will combine resource demand prediction with job scheduling to explore a new way of using HPC technologies especially for disaster prevention and mitigation in mind.
5 Conclusions This article has introduced the system update plan at Tohoku University Cyberscience Center. Our preliminary evaluation results demonstrate that VE Type 30A newly adopted in AOBA-1.5 can increase the performance of our key applications by a factor of 1.26 to 1.655. With 7 times more nodes in the system and also 2 times larger memory capacity of a VE, the overall performance of the AOBA-S subsystem will be much higher than that of AOBA-A in terms of both computational speed and memory capacity. Consequently, AOBA-1.5 consisting of AOBA-A, AOBA-B, and AOBA-S will contribute to advancing various research areas as the world’s largest vector supercomputer. The Cyberscience Center will also contribute to developing HPC technologies to make full use of the vector supercomputers. Acknowledgements This work is partially supported by MEXT Next Generation HighPerformance Computing Infrastructures and Applications R&D Program “R&D of a QuantumAnnealing-Assisted Next Generation HPC Infrastructure and its Applications,” and JSPS KAKENHI Grant Numbers JP20H00593, JP21H03449, and JP22K19764.
References 1. Agung, M., Watanabe, Y., Weber, H., Egawa, R., Takizawa, H.: Preemptive parallel job scheduling for heterogeneous systems supporting urgent computing. IEEE Access 9, 17557–17571 (2021) 2. Date, S., Kataoka, H., Gojuki, S., Katsuura, Y., Teramae, Y., Kigoshi, S.: First experience and practice of cloud bursting extension to OCTOPUS. In: Proceedings of the 10th International Conference on Cloud Computing and Services Science - CLOSER, pp. 448–455 (2020) 3. Egawa, R., Fujimoto, S., Yamashita, T., Sasaki, D., Isobe, Y., Shimomura, Y., Takizawa, H.: Exploiting the potentials of the second generation SX-Aurora TSUBASA. In: 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp. 39–49 (2020) 4. Egawa, R., Komatsu, K., Momose, S., Isobe, Y., Musa, A., Takizawa, H., Kobayashi, H.: Potential of a modern vector supercomputer for practical applications: performance evaluation of SX-ACE. J. Supercomput. 73, 3948–3976 (2017) 5. Ke, Y., Agung, M., Takizawa, H.: neoSYCL: a SYCL implementation for SX-Aurora TSUBASA. In: The International Conference on High Performance Computing in Asia-Pacific Region, pp. 50–57 (2021) 6. Li, J., Agung, M., Takizawa, H.: Evaluating the performance and conformance of a SYCL implementation for SX-Aurora TSUBASA. In: 22nd International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), pp. 36–47 (2021)
AOBA: The Most Powerful Vector Supercomputer in the World
81
7. Musa, A., Watanabe, O., Matsuoka, H., Hokari, H., Inoue, T., Murashima, Y., Ohta, Y., Hino, R., Koshimura, S., Kobayashi, H.: Real-time tsunami inundation forecast system for tsunami disaster prevention and mitigation. J. Supercomput. 74, 3093–3113 (2018) 8. NEC: VE offloading: Introduction. https://www.hpc.nec/documents/veos/en/veoffload/index. html. Accessed 27 Aug 2021 9. Ohmura, T., Shimomura, Y., Egawa, R., Takizawa, H.: Toward building a digital twin of job scheduling and power management on an HPC system. In: 25th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP) (2022) 10. Shimomura, Y., Musa, A., Sato, Y., Konja, A., Cui, G., Aoyagi, R., Takahashi, K., Takizawa, H.: A real-time flood inundation prediction on SX-Aurora TSUBASA. In: 29th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC2022) (2022) 11. Takizawa, H., Shiotsuki, S., Ebata, N., Egawa, R.: An OpenCL-like offload programming framework for SX-Aurora TSUBASA. In: 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), pp. 282–288 (2019) 12. Takizawa, H., Shiotsuki, S., Ebata, N., Egawa, R.: OpenCL-like offloading with metaprogramming for SX-Aurora TSUBASA. Parallel Comput. 102, 102754 (2021) 13. Tohoku University Takizawa Laboratory: neoSYCL (2022). https://github.com/TohokuUniversity-Takizawa-Lab/neoSYCL 14. Wang, W., Li, J., Shimomura, Y., Takizawa, H.: A memory bank conflict prevention mechanism for SYCL on SX-Aurora TSUBASA. In: International Symposium on Computing and Networking (CANDAR) Workshops, pp. 217–222 (2021) 15. Yamada, Y., Momose, S.: Vector engine processor of NEC’s brand-new supercomputer SXAurora TSUBASA. In: Proceedings of a Symposium on High Performance Chips (Hot Chips), vol. 30, pp. 19–21 (2018)