Machine Learning Applications for Intelligent Energy Management: Invited Chapters from Experts on the Energy Field (Learning and Analytics in Intelligent Systems, 35) 3031479084, 9783031479083

​As carbon dioxide (CO2) emissions and other greenhouse gases constantly rise and constitute the main contributor to cli

103 14

English Pages 240 [234] Year 2024

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Series Editors’ Foreword
Preamble: AI-Powered Transformation and Decentralization of the Energy Ecosystem
Contents
An Explainable AI-Based Framework for Supporting Decisions in Energy Management
1 Introduction
2 Problem Statement, Stakeholders, Analytics Services and Previous Related Work
2.1 Problem Statement
2.2 Key Energy Stakeholders
2.3 Analytics Services for Intelligent Energy Management
3 AI Explainability
3.1 Responsibility and Trust
3.2 Global Explainability
3.3 Local Explainability
4 Survey Questionnaire and Analysis
4.1 Technology Acceptance Model
4.2 Questionnaire Structure
4.3 Demographics
4.4 Quantification Methodology
5 xAI Incorporation Results
5.1 Model Architecture and Benchmarks
5.2 Textual Descriptors—Semantic Grouping
5.3 Local xAI
5.4 Global xAI
6 Conclusions and Future Work
References
The Big Data Value Chain for the Provision of AI-Enabled Energy Analytics Services
1 Introduction
2 Background
2.1 The Big Data Value Chain
2.2 Energy Analytics Services
2.3 Well-Known Big Data Architectures for Energy Analytics Services
3 AI-Enabled Energy Analytics Services Requirements
4 Analytics Services Architecture
4.1 Field Layer
4.2 Interoperability
4.3 Data Capturing
4.4 Data Integration
4.5 Data Lake
4.6 Data Analytics Environment
4.7 Data Streaming
4.8 Analytics Services
4.9 Identity and Access Control Management and Vulnerability Assessment
5 Implications
6 Conclusions
References
Modular Big Data Applications for Energy Services in Buildings and Districts: Digital Twins, Technical Building Management Systems and Energy Savings Calculations
1 Introduction
2 Big Data and the Energy Value Chain
2.1 Digital Twins
2.2 Technical Building Management Systems
2.3 Energy Savings Calculations
3 Modular Big Data Applications for Energy Services
3.1 Digital Twins at Different Scales
3.2 Technical Building Management Systems
3.3 Energy Savings Calculation Based on IPMVP
4 Acceptance of Solutions by the Energy Value Chain
4.1 Evaluation Framework
4.2 User Satisfaction Methodology
4.3 Preliminary Results from Validation and Next Steps
5 Discussion
6 Conclusions
References
Neural Network Based Approaches for Fault Diagnosis of Photovoltaic Systems
1 Introduction
2 Related Work
2.1 Comparison to Expected PV Output
2.2 I,V-Based Classification
2.3 Comparison to Reference PV System
3 Problem Definition
4 Recurrent Neural Network Model for a Single PV System Using Satellite Weather Information
4.1 Model: A Stacked GRU Network
4.2 Experiment Setup
4.3 Results
5 Graph Neural Network Model for Multiple PV System Sites
5.1 Our GNN Model
5.2 Experiment Setup
5.3 Results
6 Conclusions
References
Clustering of Building Stock
1 Case Study 1: Heat Saving Cost Curves for EU-27
1.1 Introduction
1.2 Methodology
1.3 Clustering
1.4 Results
1.5 Conclusions
2 Case Study 2: Synthetic Building Energy Performance Data for the Flanders Building Stock (VITO)
2.1 Background
2.2 Methodology
2.3 Results
2.4 Discussion
References
Big Data Supported Analytics for Next Generation Energy Performance Certificates
1 Introduction
2 Energy Performance Certification in Europe: Main Challenges and Opportunities
2.1 Stages in the Energy Performance and Certification Schemes
2.2 Energy Performance Certification in Spain
3 Big Data Supported Solutions Based on Energy Performance Certificates
3.1 EPCs Checker
3.2 EPCs Data Exploitation and Reports Generator
3.3 Energy Conservation Measures Explorer
3.4 Visualisation of EPCs and Estimated Energy Parameters
3.5 Climate Change Impact on Energy Use Analyses
4 Acceptance of Solutions by the Energy Value Chain
4.1 Validation Methodology Deployed
4.2 Preliminary Results from Validation and Next Steps
5 Discussion
6 Conclusions
References
Synthetic Data on Buildings
1 Introduction
2 Why and What Are Synthetic Data and How It Can Help the Construction Sector?
3 ML Techniques for the Generation of Synthetic Data in the Building Sector
4 Case Study 1: Synthetic Data of Indoor and Outdoor Temperature in a School
4.1 Methodology
4.2 Results and Discussion
Appendix
References
Recommend Papers

Machine Learning Applications for Intelligent Energy Management: Invited Chapters from Experts on the Energy Field (Learning and Analytics in Intelligent Systems, 35)
 3031479084, 9783031479083

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Learning and Analytics in Intelligent Systems 35

Haris Doukas Vangelis Marinakis Elissaios Sarmas   Editors

Machine Learning Applications for Intelligent Energy Management Invited Chapters from Experts on the Energy Field

Learning and Analytics in Intelligent Systems Volume 35

Series Editors George A. Tsihrintzis, University of Piraeus, Piraeus, Greece Maria Virvou, University of Piraeus, Piraeus, Greece Lakhmi C. Jain, KES International, Shoreham-by-Sea, UK

The main aim of the series is to make available a publication of books in hard copy form and soft copy form on all aspects of learning, analytics and advanced intelligent systems and related technologies. The mentioned disciplines are strongly related and complement one another significantly. Thus, the series encourages cross-fertilization highlighting research and knowledge of common interest. The series allows a unified/integrated approach to themes and topics in these scientific disciplines which will result in significant cross-fertilization and research dissemination. To maximize dissemination of research results and knowledge in these disciplines, the series publishes edited books, monographs, handbooks, textbooks and conference proceedings. Indexed by EI Compendex.

Haris Doukas · Vangelis Marinakis · Elissaios Sarmas Editors

Machine Learning Applications for Intelligent Energy Management Invited Chapters from Experts on the Energy Field

Editors Haris Doukas School of Electrical and Computer Engineering National Technical University of Athens Athens, Greece

Vangelis Marinakis School of Electrical and Computer Engineering National Technical University of Athens Athens, Greece

Elissaios Sarmas School of Electrical and Computer Engineering National Technical University of Athens Athens, Greece

ISSN 2662-3447 ISSN 2662-3455 (electronic) Learning and Analytics in Intelligent Systems ISBN 978-3-031-47908-3 ISBN 978-3-031-47909-0 (eBook) https://doi.org/10.1007/978-3-031-47909-0 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.

Series Editors’ Foreword

Over the past two centuries, the Industrial Revolution has led to a dramatic increase in global energy consumption, a trend that clearly continues into the twenty-first century. Indeed, global energy consumption is estimated to have risen from less than 6.000 TWh in 1800 to over 120.000 TWh in 2000 and to about 180.000 TWh in 2022 [1]. Moreover, projections show that energy consumption will continue to grow well into the foreseeable future and possibly exceed 250.000 TWh by the year 2050 [2]. This rise in energy consumption is certainly a sign of significant advances in human civilization and of improvement of living conditions worldwide. However, it does come with side effects that may have dangerous consequences for humankind and the entire planet. Indeed, carbon dioxide (CO2 ) emissions and other greenhouse gasses have been clearly recognized as the main contributors to climate change, temperature rise and global warming [3]. These emissions are mainly due to the use of fossil fuels as energy source, and they have been constantly rising from about 28.000.000 tons in 1800 [4] to almost 40.000.000.000 tons in 2022 [5] and already over 32.000.000.000 tons in the first three quarters of 2023 [2]. Governments around the World, including the European Union, are addressing climate change with high priority [6]. Indeed, energy transition from energy production based on fossil fuels (oil, natural gas, or carbon) to renewable and environmentally friendlier energy sources (solar energy, wind energy, hydrogen-based energy production, or use of lithium-ion batteries) has been set as a common target of many states worldwide. However, about 83% of consumed energy still comes from fossil fuels, while the use of environmentally friendlier sources of energy is rising at a slower-than-required pace [2] raising concerns even from oil giants such as BP [7]. Fortunately, the development and availability of important streamlined technologies, including artificial intelligence, big data, internet of things and blockchain technologies, can provide powerful tools towards intensifying efforts, accelerating energy transition and getting all energy stakeholders actively involved when decisions are made regarding energy production, distribution and management. As chief editor of the Learning and Analytics in Intelligent Systems Series (LAIS) series of Springer, I am particularly happy to present the book at hand, on Machine Learning Applications for Intelligent Energy Management, which is one of the very v

vi

Series Editors’ Foreword

first of its kind. The book has been edited by three outstanding researchers, namely Haris Doukas, Vangelis Marinakis and Elissaios Sarmas, who are renowned for their contributions to the fields of decision-making and policy designing in the energy sector. Its publication in our series aims at filling a gap in the literature on the use of the most advanced artificial intelligence, big data, internet of things and blockchain technologies in the energy sector and to verse the scientific community in the most recent relevant advances. More specifically, the book discusses both artificial intelligence-empowered analytics of energy data and artificial intelligence-empowered application development. The book consists of a preamble and an additional seven chapters written by leading experts. Topics covered include a presentation of the various stakeholders in the energy sector and their corresponding required analytic services, such as stateof-the-art machine learning, artificial intelligence and optimization models and algorithms tailored for a series of demanding energy problems and aiming at providing optimal solutions under specific constraints. Overall, the book is very well written and constitutes a valuable guide for both the experts in the field and the newcomers. The former will be updated on the most recent advances in terms of challenges and solutions regarding energy transition and the transformation of the energy sector into an environmentally friendlier human activity. On the other hand, the newcomers will also benefit from this book, as they will obtain knowledge and develop practical skills. Finally, the book will certainly attract the interest of readers from other areas as well, who wish to get versed in this significant scientific discipline. As series editor, I welcome this monograph to the Learning and Analytics in Intelligent Systems Series of Springer and present it to the research communities worldwide. I congratulate the editors for their superb work, in confidence that their book will help its readers not only understand, but also apply the proposed methodologies in various energy transition problems. Finally, I encourage the editors to continue their research work in this important area and keep the scientific communities appropriately updated on their research results. October 2023

George A. Tsihrintzis Department of Informatics University of Piraeus Piraeus, Greece

References 1. https://ourworldindata.org/energy-production-consumption 2. https://www.theworldcounts.com/challenges/climate-change/energy/global-energy-consum ption 3. https://www.youtube.com/watch?v=ipVxxxqwBQw 4. https://ourworldindata.org/co2-emissions 5. https://www.energyinst.org/statistical-review

Series Editors’ Foreword

vii

6. Paris agreement. Report of the Conference of the Parties to the United Nations Framework Convention on Climate Change (21st Session, 2015: Paris). Retrieved December, volume 4, 2017 (https://unfccc.int/process-and-meetings/the-paris-agreement) 7. “There is a growing mismatch between societal demands for action on climate change and the actual pace of progress, with energy demand and carbon emissions growing at their fastest rate for years. The world is on an unsustainable path.”—Spencer Dale, BP chief economist, 2019.

Preamble: AI-Powered Transformation and Decentralization of the Energy Ecosystem

During the last decade, the global society has confronted the challenge of climate change. The escalating repercussions of climate shifts, coupled with the relentless surge in carbon dioxide (CO2) emissions, have compelled profound shifts on both national and international fronts. This transformation stands as an imperative response to combat the devastating impacts of climate change and drive the crucial process of decarbonizing our energy sector. Decarbonization, in essence, signifies the imperative to diminish our dependence on carbon-centric energy sources. It necessitates a methodical transition toward a pioneering energy paradigm, meticulously structured to curtail carbon emissions. The urgency of this transition becomes undeniably clear when we consider that as of 2019, global CO2 emissions exceeded the 1990 levels by a staggering 60%, signifying the inception of earnest climate negotiations. These emissions continue to be the chief driver of the relentless global warming phenomenon. Significantly, 2019 etched a disconcerting milestone with record-breaking CO2 emissions. Yet, there was a glimmer of hope in 2020, a year fraught with the COVID19 pandemic, when stringent lockdown measures led to a noteworthy reduction in emissions. It is universally acknowledged that the most direct route to decarbonization hinges upon a sweeping shift toward harnessing Renewable Energy Sources (RES). This clarion call has found a formidable proponent in the form of the International Renewable Energy Agency (IRENA). The agency has meticulously charted a comprehensive roadmap, envisioning a significant reduction in carbon emissions by 2050. Consequently, a multitude of nations have already embraced a substantial share of renewable resources, such as wind and solar energy, to meet their burgeoning energy demands. Simultaneously, technological advancements over the past decade have empowered energy consumers, facilitating the decentralization of the energy system. Decentralization involves the placement of energy production facilities closer to consumption sources, as opposed to the centralized energy generation model characterized by large-scale power plants. This transition is underpinned by the adoption and widespread use of RES to reduce dependence on fossil fuels, as discussed in the previous sections. The current relevance of decentralization is paramount, leading ix

x

Preamble: AI-Powered Transformation and Decentralization …

to the gradual transformation of energy systems. It also supports the emergence of distributed, peer-to-peer energy transactions, centered around community involvement. This necessitates predictive models and optimization algorithms to ensure the successful operation of innovative microgrids and decentralized systems. Moreover, the democratization of knowledge within energy communities is crucial. The decentralized model relies heavily on individuals’ capacity to generate, store, and manage their energy consumption. This model can be implemented through the widespread installation of energy generation equipment, such as rooftop solar panels, energy storage systems (though their cost remains a barrier), and smart management and monitoring systems for the microgrids formed. Individuals who produce their energy are commonly referred to as “prosumers,” indicating their dual role as both energy producers and consumers. Prosumers can feed surplus energy into the grid when available and draw from the grid when needed. Energy democratization aims to ensure universal access to affordable and clean energy, enabling everyone to harness RES according to their needs. This condition fosters local economic value by expanding opportunities for prosumers and small to medium-sized enterprises due to their reliable access to electric power. In this context, full democratization of energy data and services aims to grant endusers access to data, information, and intelligent systems without requiring external expertise. The goal of democratizing data and applications is to enable energy community members to collect and analyze data on energy production systems, consumption patterns, and flexible loads independently. The preceding paragraphs provide a concise overview of the energy transition challenge and its relationship with technological progress, which has generated an immense volume of available data. This situation presents substantial opportunities for designing, implementing, and developing intelligent systems capable of optimizing energy management and aiding decision-making across a multitude of issues. Artificial Intelligence (AI) plays a pivotal role in the development of these systems, facilitating the optimal management of energy networks by controlling energy flows between homes, businesses, energy storage units, RES, microgrids, and the electrical grid. This AI-driven approach reduces energy waste and enhances consumer participation in energy management. AI is already deployed in a wide range of energy applications related to RES, energy-efficient buildings, and smart energy management within microgrids comprised of distributed energy resources. Therefore, the design and development of structured machine learning models and optimization algorithms for energy management problems, RES integration, or building energy efficiency represent an ongoing scientific field. The aim is to support an equitable, sustainable, and democratized energy transition.

Preamble: AI-Powered Transformation and Decentralization …

xi

In conclusion, AI serves as a linchpin in realizing the potential of energy transition and decentralization. Leveraging AI’s capabilities allows for the effective utilization of abundant data and the optimization of energy systems. This not only aids in addressing climate change, but also empowers individuals and communities to actively participate in energy management, contributing to the development of a more sustainable and democratic energy future. Stavros Stamatoukos CINEA Horizon Europe Energy

Contents

An Explainable AI-Based Framework for Supporting Decisions in Energy Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Elissaios Sarmas, Dimitrios P. Panagoulias, George A. Tsihrintzis, Vangelis Marinakis, and Haris Doukas The Big Data Value Chain for the Provision of AI-Enabled Energy Analytics Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Konstantinos Touloumis, Evangelos Karakolis, Panagiotis Kapsalis, Sotiris Pelekis, and Dimitris Askounis Modular Big Data Applications for Energy Services in Buildings and Districts: Digital Twins, Technical Building Management Systems and Energy Savings Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gema Hernández Moral, Víctor Iván Serna González, Roberto Sanz Jimeno, Sofía Mulero Palencia, Iván Ramos Díez, Francisco Javier Miguel Herrero, Javier Antolín Gutiérrez, Carla Rodríguez Alonso, David Olmedo Vélez, Nerea Morán González, José M. Llamas Fernández, Laura Sanz Martín, Manuel Pérez del Olmo, and Raúl Mena Curiel

1

29

53

Neural Network Based Approaches for Fault Diagnosis of Photovoltaic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Jonas Van Gompel, Domenico Spina, and Chris Develder Clustering of Building Stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Matteo Giacomo Prina, Ulrich Filippi Oberegger, Daniele Antonucci, Yixiao Ma, Mohammad Haris Shamsi, and Mohsen Sharifi

xiii

xiv

Contents

Big Data Supported Analytics for Next Generation Energy Performance Certificates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Gema Hernández Moral, Víctor Iván Serna González, Sofía Mulero Palencia, Iván Ramos Díez, Carla Rodríguez Alonso, Francisco Javier Miguel Herrero, Manuel Pérez del Olmo, and Raúl Mena Curiel Synthetic Data on Buildings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Daniele Antonucci, Francesca Conselvan, Philipp Mascherbauer, Daniel Harringer, and Cristian Pozza

An Explainable AI-Based Framework for Supporting Decisions in Energy Management Elissaios Sarmas, Dimitrios P. Panagoulias, George A. Tsihrintzis, Vangelis Marinakis, and Haris Doukas

Abstract Climate change and energy production and consumption are two inextricably linked concrete concepts of great concern. In an attempt to guarantee our future, the European Union (EU) has prioritized the addressing of both concepts, creating a new social contract between its citizens and the environment. The dazzling progress in its methodologies and applications during the recent years and the familiarization of the public with its abilities indicate Artificial Intelligence (AI) as a potential and powerful tool towards addressing important threats that climate change imposes. However, when using AI as a tool, it is vital to do so responsibly and transparently. Explainable Artificial Intelligence (xAI) has been coined as the term that describes the route of responsibility when implementing AI-driven systems. In this paper, we expand applications that have been previously built to address the problem of energy production and consumption. Specifically, (i) we conduct a survey to key stakeholders of the energy sector in the EU, (ii) we analyse the survey to define the required depth of AI explainability and (iii) we implement the outcomes of our analysis by developing a useful xAI framework that can guarantee higher adoption rates for our AI system and a more responsible and safe space for that system to be deployed. Keywords Machine learning · Optimization · Explainable artificial intelligence · Energy management · Energy transition

1 Introduction Since the Paris Agreement [1, 2], there is increasing international concern and action taken with regard to climate change and viability of Earth. Indeed, the Paris Agreement focuses on the design and the application of viable, effective, socially acceptable and fair policies to fight and possibly reverse climate change on a global level [3]. E. Sarmas (B) · V. Marinakis · H. Doukas Decision Support Systems Laboratory, School of Electrical & Computer Engineering, National Technical University of Athens, Athens, Greece e-mail: [email protected] D. P. Panagoulias · G. A. Tsihrintzis · H. Doukas Department of Informatics, University of Piraeus, Piraeus, Greece © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Doukas et al. (eds.), Machine Learning Applications for Intelligent Energy Management, Learning and Analytics in Intelligent Systems 35, https://doi.org/10.1007/978-3-031-47909-0_1

1

2

E. Sarmas et al.

As a result, radical changes are effected on the energy sector, which affect all of its stakeholders. These changes are collectively referred to as “Energy Transition”, which is a term implying the transition from energy production based on fossil fuels (oil, natural gas or carbon) to renewable and environmentally friendlier energy sources (solar energy, wind energy, hydrogen-based energy production or use of lithium ion batteries). Energy Transition is not restricted to decarbonisation only, but it also incorporates multiple social, technological and environmental targets [4]. In more detail, Energy Transition is founded on four pillars, namely the, so-called, 4D’s of the energy sector, which specifically stand for Decarbonization, Digitization, Decentralization, and Democratization. In more detail, Decarbonization refers to the reduction of the dependence on carbon for energy production and its gradual replacement with other sources of energy, such as Renewable Energy Sources (RES). Decarbonization is a high priority in Energy Transition, as it will slow down both climate change and the rate of increase in carbon dioxide (CO2) emissions [5]. Indeed, CO2 emissions in 2019 were over 60% higher than 1990 emissions. On the other hand, there was a significant decline in CO2 emissions in 2020, which was due to the strict lockdowns enforced by many governments around the World to contain the COVID-19 pandemic [6]. These two facts clearly indicate CO2 emissions as a main cause for climate change. The ultimate goal of Decarbonization is a World economy that does not produce CO2 emissions. To measure progress towards this goal, the Paris Agreement has established several actions and performance indices, which, however, are still far from being met [3]. On the other hand, Decentralization refers to the construction of energy production systems that are physically located near energy consumers, as opposed to large-scale, but centrally located ones [7]. Decentralization is commensurate with Decarbonization as Decentralization is based on the extensive use of RES and reduction of the dependence on fossil fuels. Decentralization also depends on the existence of active consumers who not only consume energy, but also possess the ability to, at least partially, produce and store energy and manage energy demand [8]. Democratization comes as a consequence of Decentralization. Indeed, the decentralized model is largely based on the ability of consumers to produce, store and manage the energy they consume, for example when they have installed photovoltaic equipment on their roof. The term “prosumer” (a combination of the two terms “producer” and “consumer”) has been coined to indicate any energy consumer with their own means to produce part of the energy they consume [9]. Energy Democratization is expected to produce significant local economic value as it will allow both prosumers and SMEs to have reliable access to low-cost energy. Digitization refers to the use of electronic tools, systems, devices and resources which produce, store and process data and extract meaningful and useful information [10]. Data, in particular, are widely available and of very high quality and can be used towards optimizing functions and processes at the household, building, work environment, community or city levels [11, 12]. As with most big data applications, energy data are characterized by the, so-called, 5V’s, namely, very high V olume, great

An Explainable AI-Based Framework for Supporting Decisions in Energy …

3

V ariety, high rate of new data collection (V elocity), increased reliability (V eracity), and the ability of information technologies to add V alue to them. Clearly, the transition from a producer/provider-centered system to a consumer/ household-centered system is a very challenging goal [13]. Fortunately, it coincides with the development and availability of important streamline technologies, including Artificial Intelligence (AI), Internet of Things and Blockchain Technologies. These technologies offer a unique opportunity to intensify efforts and accelerate Energy Transition. Particularly important is the active involvement of all energy stakeholders and their participation in decision making with regard to the production, distribution and management of energy. Towards this, decision support systems (DSSs) need to be made available to the various stakeholders, which are stakeholder-tailored and provide decision support at various levels and in various forms. On one hand predictive models and optimization algorithms need to be incorporated in them, which rely on state-of-the-art Machine Learning (ML) and AI technologies [14, 15]. On the other hand and in order to be adopted by the various classes of their users and to exclude unacceptable decisions, these DSSs need to incorporate explainable AI (xAI) and responsible AI technologies, which provide the system users with justification of the recommended actions [16, 17]. As explainability and responsibility of an AI system are concepts related to an audience, i.e. to a class of users or even to specific individual users [16], the various stakeholders (i.e. classes of users) need to be identified first. In this paper, we present a DSS for supporting decisions on energy management and energy efficiency in buildings. The system is based on a set of state-of-the-art ML and AI models which have been tailored for a series of demanding energy problems, as well as on a set of optimization algorithms aiming at providing the optimal solution under specific constraints. The novelty of the DSS lies in the use of xAI modules which have been developed based on conducting a survey among key stakeholders of the energy sector in the EU and analyzing its results. The development and implementation of the outcomes of our analysis into a useful xAI framework can guarantee higher adoption rates for our AI system and a more responsible and safe space for that system to be deployed as it provides well-established justification of the recommendations presented to its users. This is especially important for critical applications in the energy sector, such as energy load management or RES production forecasting, where the consequences of incorrect predictions can be quite significant. The proposed xAI system can provide insights into how AI models make predictions (e.g. about future energy demand or supply), assisting energy companies and regulators understand the factors that drive energy demand and make informed decisions. In comparison to existing DSSs that often lack transparency and interpretability making it difficult for users to assess the reliability of the results, the proposed xAI system provides a clear and understandable explanation of the predictions being made and the factors that contribute to them. More specifically, the paper is organized as follows: Sect. 2 is devoted to a statement of various problems that are associated with energy production, distribution and consumption, an outline of the various stakeholders and their relation to cor-

4

E. Sarmas et al.

responding problems and a review of previous related work on providing analytics services in this area. Section 3 summarizes recent approaches to xAI, while in Sect. 4 a questionnaire-based survey is conducted and its results are analyzed and presented. This survey forms the basis for the methodology followed in Sect. 5 with regard to incorporation and implementation of xAI in our DSS Finally, Sect. 6 a discussion of the main conclusions drawn are included, as well as indications to future related research avenues.

2 Problem Statement, Stakeholders, Analytics Services and Previous Related Work 2.1 Problem Statement Current challenges that researchers are facing and addressing in the energy sector can generally be grouped into three general categories, namely (1) managing renewable energy sources, (2) establishing and managing distributed energy resources and (3) improving building energy efficiency. In the following, these challenges are briefly analyzed further. Managing renewable energy sources: Driven by European and international concern about climate change and action taken to reduce Greenhouse emissions by 55% by the year 2030 [18], energy production from RES, especially from those based on wind and solar power, is increasing at a fast rate. Indeed, it is expected that, by the year 2026, up to 95% of the increase in global power generation will come from RES [19]. This translates to an increase of over 60% in electrical power generation from renewable sources during the years 2021 to 2026, with construction of 25% more new wind-energy parks than new win-energy parks constructed during 2015–2020 [19]. Clearly, new tools need to be developed and made available to optimize the energy production from RES, as well as its distribution and management. ML and AI technologies seem to provide an indispensable tool towards this goal. More specifically, challenges that need to be addressed by researchers include: – Predicting energy production from RES in the very short- (up to 30 mins ahead), short- (up to one hour ahead), medium- (up to one day ahead), and long- (more than one day ahead) term, based on predictive methodologies that process such measurements as humidity, temperature, cloudiness, wind speed and direct or diffuse solar irradiance. – Detecting faults and applying prognostic maintenance, in order to minimize outof-service time and reduce relevant cost. – Optimizing the location of RES, especially photovoltaic panels and wind-energy units, based on measurement of various parameters, such as solar irradiance or

An Explainable AI-Based Framework for Supporting Decisions in Energy …

5

wind potential, and taking into account other factors, such as environmental or social. Establishing and managing distributed energy resources: A distributed energy resource consists of a number of small energy production units, such as photovoltaic panels or wind-energy units, which reside on the consumer side. These distributed resources, when combined with energy storage units and managed in a way that takes flexible electric load demands [20], form smart (micro-)grids. Several challenges arise with regard to distributed energy resources forming smart micro-grids, that researchers need to address: – Predicting energy demand, i.e. predicting upcoming energy load, either in the short(a few minutes, a day or a week ahead) or in the medium- (up to a year ahead) or in the long- (several years or more ahead) term. This is a crucial requirement for the reliable operation of a distributed energy resource [21]. – Scheduling demand response, i.e. asking consumers to reduce energy consumption during specified time intervals in order to reduce the strain on the energy grid, save on energy costs, reduce the use of fossil fuels for energy production, and incorporate renewable energy resources into the grid [22]. – Optimizing flexible loads and demand side managing with practices that go beyond demand response, attempt to alter consumer behavior and possibly do not require intermediate energy storage. Such practices include peak shaving, valley filling and load shifting [23]. – Developing more efficient energy storage systems, ranging from the development of new materials to hybrid methods in which advanced algorithms may be employed to monitor battery health, reduce battery discharge or improve the frequency of required battery charging [24–26], Improving building energy efficiency: To date, a number of researches and studies have been effected on improving energy efficiency via detecting equipment faults in various building subsystems, predicting building energy consumption, classifying buildings according to their energy efficiency, identifying building user behavior and optimizing building subsystem operation. Clearly, all related works need to focus on improving (1) the thermal comfort of building users, (2) the building energy efficiency, (3) the flexibility of demand and (4) the overall building resilience. Thus, targeting at greener buildings with reduced carbon fingerprint, actions need to be taken at the following stages of building lifetime: – Building design, perhaps with use of parameterized [27] and genetic [28] design methodologies which, subsequently, are extensively evaluated [29]. – Building construction, perhaps with use of building information modeling methodologies [30] and streamline construction monitoring methodologies [31], which is followed by identification of construction faults [32]. – Building operation, during which energy efficiency [33], thermal comfort [34] and maintenance requirements [35] are monitored. – Building renovation, after energy audits [36].

6

E. Sarmas et al.

Clearly, AI, especially ML, and Optimization are fields that can provide powerful methodologies and tools to address the above-mentioned challenges in efficient ways and to provide solutions that can be made available and embraced by all stakeholders involved in the energy sector.

2.2 Key Energy Stakeholders Several stakeholders are involved in the energy sector, who require intelligent energy management systems that provide them with different groups of analytics services. In this subsection, we outline eleven (11) key stakeholders, while in the following subsection we present eight (8) analytics services, which are associated with stakeholders and have been fully implemented. Figure 1 illustrates key energy stakeholders along with their corresponding importance. The description and the roles of these stakeholders is analyzed in the following. 1. KS01—Producers (PD): A producer produces the energy1 that will be consumed by households and businesses. Conventional means of electricity production include nuclear power plants, combined cycle gas turbines and coal plants, while renewable electricity production comes from biomass power plants, hydroelectric power stations, wind farms and solar parks. The latter producers can be further categorized into individual park owners and aggregators who may control several solar parks. 2. KS02—Suppliers (SP): A supplier sells energy to households and businesses, without necessarily being a producer as well. Today, a large number of suppliers are in operation due to the opening of the energy market to competition. 3. KS03—Balance Responsible Party (BRP): A BRP is a private legal entity that overlooks the balance of one or multiple access points to the transmission grid. 4. KS04—Transmission System Operator (TSO): The role of the TSOs is to carry electricity from power plants and the different delivery points either to the distribution networks or straight to industrial customers. The TSOs are also obliged to ensure the overall balance of the network, i.e., to continuously balance production and consumption. The required investments for a TSO are colossal which forbids opening this business to competition. 5. KS05—Distribution System Operators (DSO): These stakeholders form the last link in energy delivery, as they construct and manage medium and low voltage/pressure networks to liaise between the transmission networks and private dwellings. Moreover, in the event of problems with the energy meter or power failure, the DSO rather than the supplier should be contacted. 6. KS06—Regulatory Bodies (REG): As some stakeholders enjoy a legal monopoly status, bodies have to be created to control and regulate the energy 1

In this work, the term “energy producer” refers to an electricity producer, but, in more general terms, the term may as well refer to a natural gas producer.

An Explainable AI-Based Framework for Supporting Decisions in Energy …

7

Fig. 1 Key energy stakeholders

7.

8.

9.

10.

market. Their role is to ensure transparency and competition in the energy market, defend consumer interests, advise the authorities on energy matters, and certify the operation of energy markets. KS07—Electricity Customers, Consumers, Residents of Buildings, Occupants (BUI): These stakeholders constitute entities that consume energy at various scales. KS08—Project Developers, Investing Funds, Governmental Institutions (INV): These stakeholders constitute entities that are looking for the best financing opportunities in investing in energy efficiency projects (e.g. renovations/refurbishments in buildings) with the goal of reducing carbon emissions at the building, district, or city level [37]. KS09—Aggregators (AGG): An aggregator is a new type of energy service provider which can increase or moderate the electricity consumption of a group of consumers according to total electricity demand on the grid. An aggregator consists of a grouping of agents in a power system, which acts as a single entity when engaging in the electricity market. The aggregator’s role is to gather flexibility from the prosumers’ devices and sell it to KS05-DSOs, KS03-BRPs, and (either directly or through a KS03-BRP) to KS04-TSOs. KS10—Facility Managers, Building Operators (FMB): Building Operators ensure that the heating, cooling, mechanical and electrical equipment of a building is running effectively. FMB duties may include inspecting the building for safety hazards and regulation violations, performing repairs, and checking the ability of the facility to operate successfully and in the most efficient manner.

8

E. Sarmas et al.

11. KS11—Energy Managers (ENM): Energy managers are responsible for handling heating, ventilation and air conditioning (HVAC) systems of large, multistorey buildings, so as to ensure that thermal comfort levels are kept within acceptable ranges.

2.3 Analytics Services for Intelligent Energy Management Figure 2 is illustrative of the various analytic services towards intelligent energy management, developed previously in related work. In this study, we examine explainability approaches to better communicate each service functionality to the stakeholders described in Fig. 1, to increase adoption and optimise usability. Each service is defined by its inputs and outputs. In detail, AS01 [38, 39] focuses on photovoltaic (PV) production forecasting, which involves the use of PV production time series, historical weather data, and numerical weather predictions as input to estimate the energy generation of the PV system. This estimation can be done for short-term (e.g., an hour ahead) or mid-term (e.g., a week ahead) intervals. AS02 [40] deals with consumption forecasting at the building level, utilizing building consumption time series as input to estimate the future energy consumption of a building. AS03 [41] is concerned with load forecasting of the grid. It employs building consumption time series, historical weather data, and numerical weather predictions as input to estimate the future grid load.

Fig. 2 Artificial Intelligence/Machine Learning/Optimization-based analytics services

An Explainable AI-Based Framework for Supporting Decisions in Energy …

9

AS04 [42] estimates energy savings as a result of renovations. It uses historical consumption time series of the building, historical weather data, and measurements after the renovation actions to determine the actual energy savings of the renovations. AS05 [14] involves financing of energy conservation measures. The input for this service includes historical records of financing projects, characteristics of the building (age, floors, country, heating area), and the cost of energy efficiency measures. The output is a renovation class (A, B, or C), which represents the potential of the renovation in terms of energy savings against cost. AS06 focuses on optimization of distributed energy resources (DERs) in microgrids. It uses forecasts of PV production, forecasts of building consumption and other loads, profiles of flexible loads, and storage systems as input. The output is the optimal scheduling of the flexible loads and the optimal sizing of the storage system. This service is directly connected with analytics services AS01 and AS02. AS07 [15] addresses scheduling of flexible loads for peak shaving at the grid level. It utilizes forecasts of PV production, forecasts of the grid load, and profiles of the flexible loads as input to determine the optimal scheduling of the flexible loads to minimize peak of loads. This service is directly connected with analytics services AS01 and AS03. Lastly, AS08 deals with thermal comfort with dynamic energy management [43, 44]. It uses consumption per building room, humidity, and temperature as input to schedule the heating, ventilation, and air conditioning (HVAC) system to ensure the thermal comfort of the occupants. This service is directly connected with analytics service AS02. Finally, Fig. 3 illustrates the interconnection between the key stakeholders and the analytics services defined previously.

Fig. 3 Interconnection between key stakeholders and analytics services

10

E. Sarmas et al.

3 AI Explainability xAI refers to the level and depth that the decision process of a trained (ML) model is explained and described. When the process is adequately explained via a framework of textual, visual and tabular paradigms, the usually referred to as “black box” in ML is disassembled. The users are prawn to increase adoption of systems, when misconceptions about the underlying technologies have been unraveled. The key characteristics of a xAI system are fairness, ethics, transparency, privacy, security, accountability and safety. When those characteristics are addressed, then the explainable framework is referred to as responsible. Systems that can infer intelligence are required to have an adequate framework, where the underlying technology that outputs decisions and recommendations can be adequately and properly explained. Due to the increasing social impact of such intelligent systems, xAI is considered a prerequisite. xAI is also an important feature that increases adoption. Familiarity with technology and AI literacy should be considered in the xAI framework, design process according to [45], where layers of scientific language and simplified explanations should be combined and applied accordingly. For that purpose cross-discipline knowledge and the mental capacity of the user should be taken into account and served accordingly [46]. The main concern to be addressed is related to the reasons that would make a user trust ML models to make predictions, automate classifications and support decision making. If the proposed solutions that are served by a machine are aligned with user expectations, a need for deeper explanations may be overlooked. In any case though, it is the developers’ responsibility to offer sufficient description and a road map to the decisions suggested by the system [47].

3.1 Responsibility and Trust System transparency is a basic social demand and on par with the need for data privacy and data security [48]. It is important that the user can identify the characteristics of the models utilised as the backbone of the AI system. The algorithms deployed, that lead to a decision, have to be adequately represented [49]. If these requirements are met, the system is considered responsible, inclusive and transparent.

3.2 Global Explainability Global model explanation is the process that takes place in the training phase of the ML pipeline. It involves deep data analysis, measurement of bias and semantic grouping of inputs and outputs, the analysis of the adopted ML methodology, feature importance and feature dependency, the classification report and the confusion

An Explainable AI-Based Framework for Supporting Decisions in Energy …

11

Fig. 4 Technology acceptance model

matrix. The global explainability method is, essentially, a way to give the user the clarifications required as per the inputs of a ML model and the expected outcomes (output) when those inputs are analysed.

3.3 Local Explainability Local model explanation is that feature of an AI-infused system, where a specific decision made by the trained algorithm is explained in detail. More commonly used for this process are plots from which individual explanations can be derived as per the contribution of a feature and the effect of values of features. The fairness of a system can also be evaluated by testing; for example, different demographic scenarios of a dataset can be analyzed to determine whether the model can serve predictions equally for all possible cases.

4 Survey Questionnaire and Analysis 4.1 Technology Acceptance Model The technology acceptance model [50] (TAM) theorizes that acceptance of a new computer system is related to two external variables, namely the perceived usefulness and the perceived ease of use. Those two variables are considered measurements of individual intention to use a specific technology and, thus, determine the likelihood of adoption of said technology. Perceived usefulness refers to the users’ belief that a certain technology can deliver value. Perceived ease of use refers to the level of effort that a user would be required to undertake to use the technology. TAM is extracted from the theory of reasoned action [51], which assumes that to predict user behaviour, those two external variables should be linked to specific intentions (Fig. 4). We use TAM to align survey outcomes with application deployment requirements to ensure higher adoption rate and smaller learning curves. For that purpose,

12

E. Sarmas et al.

the survey questionnaire is structured in a way to facilitate scoring with regard to usefulness (as related to AI) and perceived ease of use (as associated with a more generic approach to AI capabilities and tools provided by the system). By retrieving valuable input by its main potential users and stakeholders, a more concrete definition is provided of the required depth of system explainability. Recognising users’ attitude towards AI potential and their level of AI literacy, especially with regard to energy system automation and predictive capabilities of the proposed models, the developer can outline and build with intention and purpose better systems, and at the same time deliver tools for the circular economy.

4.2 Questionnaire Structure The questionnaire is split into three sections. The first one is related to the demographic characteristics of the participants, including age, gender, educational level, occupation and employment space (public, private or both). The aim of the second section is to define the AI literacy level (perceived usefulness) of the participants, while the third section attempts to define the perceived ease of use of the provided AI tools. In the second and third sections, both qualitative and quantitative questions were included. The quantitative questions were used as scoring components and facilitated the clustering of the specialists that participated in the survey. On the other hand, the qualitative questions were used as descriptors of the market trends and the general attitude towards AI infusion in the energy sector. In Tables 1, 2 and 3, the different questions are separated in a way that reflects the described process.

4.3 Demographics Based on [52, 53] and since the goal is to address AI specific usability and design concerns, a number of 20 to 40 participants would suffice to draw reliable and high quality results. Thus, the survey was conducted among 20 specialists working in the energy sector, with 30% of the participants being employed in a public company, 45% in a private company and 25% both in a public and private company. Their average employment duration was 2 years with a minimum of 1 and a maximum of 5 years in this particular sector. 75% of the participants are identified as men and 25% as women and work or/and live in the EU (Fig. 5). The primary language of the participants is mainly Italian(6), followed by Spanish(5). The average age of the participants was 36.6 years with a minimum age of 22 years and a maximum age of 58 years. The highest educational level of the participants was that of Ph.D status (2 out of 20), with the majority holding a master’s degree (15 out of 20) and the rest having acquired relevant certifications or equivalent to certifications degrees (3 out of 20). The electronic questionnaire was anonymously filled in, no detailed personal

An Explainable AI-Based Framework for Supporting Decisions in Energy …

13

Table 1 AI literacy level Perceived ease of use Question Q1 Q2 Q3 Q4 Q5

Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14

Competency level in the English language Level of literacy in AI and related expertise Years of Experience in the sector In the system use, were you adequately informed for every change that the system performs? In the system use, to what extent your previous knowledge on the use of information systems has helped you understand how to use the present system? In the system use, to what extent you needed a button for help? To what extent were the system messages informative? To what extent did you need to learn new functions to operate the system? To understand the results, did you need more information that was missing from the system? Did you need more automatic recommendations on how to use the results of the system? Did you need more explanations on how the results of the system were generated? Did you trust the systems results? Did you cross-check the results using other methods? Were the results satisfying?

data were collected and the participants submitted their consent prior to filling in the questionnaire.

4.4 Quantification Methodology As stated in the previous section, to better define the requirements of AI explainability depth and related usability functions, a clustering analysis was performed on the questions presented in Tables 1 and 2, that were answered using a likert scale, ranging from 1 (lowest) to 5 (highest). Due to the small amount of participants, which however was sufficient for the purpose of this study, the silhouette coefficients were utilised to define the correct amount of clusters and the . K -means clustering algorithm was employed to partition the group of answers into the defined number of clusters. The silhouette coefficients are values that measure the similarity of an object to a cluster. The silhouette coefficients range from .−1 to .+1 and their exact value is indicative of the similarity of an object to its cluster and its dissimilarity to the other clusters [54, 55]. . K -means clustering is a quantification method of unsupervised learning, that partitions observations in . K clusters based on the mean (centroid) of each cluster

14

E. Sarmas et al.

Table 2 AI usability Question Perceived usefulness Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11

In the system use, how easily can you predict what results are given by the functions of the system (e.g. what happens if you press a button)? In the system use, how easily can you understand what the current situation of the system is based on your previous actions in the system? In the system use, how easily could you predict what actions you needed to perform in similar situations within the system In the system use, were menus and figures easy to understand in the whole of the system? In the system use, how easy was to undo some action when needed? In the system use, to what extent could you see the results of each of your actions? Were the results compatible with other external sources and expertise? Did you find the overall system useful? Did you find the overall system easy to use? Was the experience of using the system satisfying for you? Would you use this system in the future?

Table 3 Qualitative questions to define trends Question Determining perception of AI Q1 Q2 Q3 Q4 Q5 Q6

AI incorporation in the energy sector will result in (mark as many as appropriate) “Suggested level of explainability of a software and decision support system incorporating AI in the energy sector (mark only one) Suggestions of potential actions towards safer, more efficient, user-friendlier, and faster incorporation of AI in the energy sector” AI will radically transform the energy sector (mark only one) AI incorporation in the energy sector imposes threats and ethical concerns with regard to (mark as many as appropriate) Which one of the following do you believe is the most likely cause impeding the incorporation of AI in the energy sector (mark only one)?

and, subsequently, approximates each observation with that mean (centroid) [56]. Since the number of questions/observations are more than 2 and it would be difficult to define the clusters in the multidimensional space, the PCA reduction technique was used for simplification purposes. Principal component analysis (PCA) converts the full features into fewer features based on their contribution in explaining a component (variance explained) [57]. AI literacy score, clusters and perceived usefulness Using the proposed methodology, of . K -means clustering, the silhouette coefficients to define the number of clusters and PCA technique to visualise those clusters, the answers of the questions of Table 1 are analysed. The results can be seen in Fig. 6. According to the analysis,

An Explainable AI-Based Framework for Supporting Decisions in Energy …

15

Fig. 5 Demographic characteristics

Fig. 6 Defining the layers of explainability

there are two different clusters related to the corresponding participants’ AI literacy level, as revealed by the silhouette coefficient and seen in Fig. 6 in which the highest value corresponds to the number of clusters. The primary characteristics, based on the average score of each answer, can be seen in Table 4. To plot the different clusters,

16 Table 4 AI literacy level Question Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Total count Average

E. Sarmas et al.

Cluster 1 (mean)

Cluster 2 (mean)

4.25 2.5 2.875 4.000 2.750 3.875 3.000 3.750 3.750 3.125 3.000 3.625 3.875 4.000 8 3.45

3.5 2.0833 1.4166 2.8333 2.7500 3.0833 3.000 3.166 3.2500 3.4166 3.166 3.416 2.666 3.333 12 2.93

we have used 3 components which, as can be seen in the principal component analysis of Fig. 6, explain more that 58% of the total variables (14). The average score of the first cluster is 3.45, while that of the second cluster is 2.93. There is greater separation (above 0.70 decimal points) between the two clusters in Questions 1, 3, 9, 13 and 14. Question 1 is related to the users’ competency in the English language. Question 3 is related to the years of working experience in the sector. Question 9 corresponds to the required explainability of the DSS result and the need of additional information that is perceived as missing. Question 13 measured the frequency that the user validated results using external sources and finally Question 14 was related to the degree of user satisfaction with the results provided by the system. To summarize, two levels of explainability seem necessary, that correspond to the characteristics of the two different users. Thus, at least years of working experience, level of understanding of the English language, and potentially offering a localised environment based on user language requirements should be taken into consideration. Trialability is also an important factor as described in Questions 13 and 14, where related literature should be additionally provided and which users should be encouraged to study. In both clusters AI literacy is either average or below average (Q2) and for that reason usefulness of the different explainability methodologies (local—global xAI) should be provided and detailed. The experienced users felt that system messages were adequate, but the second less experienced group of users felt that some actions performed by the provided AI pipeline were not sufficiently explained.

An Explainable AI-Based Framework for Supporting Decisions in Energy …

17

Fig. 7 Defining the layers of perceived ease of use

AI tools usability score, clusters and perceived ease of use. In this section, we assess the perceived ease of use of the AI tools offered. Similarly to the previous section, a segmentation methodology was used to define the optimum number of clusters. Specifically, 5 clusters were identified, using the silhouette coefficients, where 5 is indicated as the highest value of a total 6 cluster scenarios tested. Using PCA, we reduced the plotted space from 11 components in total in the questions to three components, i.e. we reduce the feature space down to a 3-dimensional one. The components explain almost 60% of the variance and the corresponding . K -means plotted clusters can be seen in Fig. 7. The five different clusters assume different scores, which indicates different abilities and different perceptions as per the usability of AI tools offered by our system. Cluster 3, within which gather most of the users, have a slightly above average perception of 2.98 for the perceived ease of use of the system AI Clusters 2, 4 and 5 which hold the highest average scores and also include 35% of the total participants, score above 3.5 which should be considered a strong indicator of a general acceptance of the offered tools, with room, though, for improvements. The questions with the lowest scores are Question 3 in Cluster 1, Question 1 in Cluster 4 and Question 4 in Cluster 5. These questions are stated as follows: – In the system use, how easily can you predict what results are given by the functions of the system (e.g. what happens if you press a button)? – In the system use, how easily could you predict what actions you needed to perform in similar situations within the system

18

E. Sarmas et al.

Table 5 Perceived ease of use Question Cluster 1 (mean) Q1 Q2 Q3 Q4 Q5 Q7 Q8 Q9 Q10 Q11 Total count Average

2.75 2.00 1.75 2.00 2.25 3.50 3.75 3.50 3.50 4.25 4 2.93

Cluster 2 (mean)

Cluster 3 (mean)

Cluster 4 (mean)

Cluster 5 (mean)

3.75 3.75 2.75 3.00 3.00 4.00 4.50 3.50 3.75 3.50 4 3.59

2.888889 3.000000 3.000000 3.222222 2.444444 3.000000 3.333333 2.888889 3.111111 3.333333 9 2.98

2.0 5.0 4.0 5.0 3.0 3.0 5.0 4.0 3.0 5.0 1 3.63

2.5 3.0 3.0 1.0 3.5 4.0 5.0 4.0 5.0 5.0 2 3.5

– In the system use, were menus and figures easy to understand in the whole of the system? The above mentioned questions indicate a difficulty to understand and navigate through the exported results of the AI pipeline. Again, here as in the previous section, an explainability framework, that considers different levels of AI-related capabilities can indeed offer a greater understanding of the system underlying value and thus increase adoption via improving perception of usability of AI and a general usefulness of extracted recommendations and automated suggestions related to the different energy analytical services provided by the ML pipeline. It is also important to note that the general sentiment, as reflected by Question 8 on the overall system usefulness, scores the highest for all clusters, indicating that the needs of the energy sector are well served and identified (Table 5). Market trends, towards an AI-driven energy sector. The answers provided in the last portion of the survey can be seen in Fig. 8. The main concern of the participants, with regard to AI incorporation in the energy sector, is related to data privacy (80%) and security. The general consensus is that AI will further decentralise the sector (80%) and will facilitate faster decarbonisation (65%) and increase energy savings (65%). From the following questions, which are also highly suggestive to the previous results analysed and also validate the clustering hypothesis, it appears that the most likely cause that would impede AI incorporation is by a majority of 60% the lack of explainability (30%) and lack of required training of users to the available software (30%). The suggested level of explainability should be at a level of limited detail for non-technical users (45%), followed by a level that corresponds to deeper technical detail, suitable for more technical users.

An Explainable AI-Based Framework for Supporting Decisions in Energy …

19

Fig. 8 Qualitative-based questions

The vast majority (45%), expect AI to transform the energy sector in the 30’s, while 30% see this transformation taking place in the 20’s. Moreover, 10% of the participants expressed the opinion that AI will never transform the energy sector and 15% that the transformation will occur beyond the 30’s. Again the results signify two different kind of users in the participant pool.

5 xAI Incorporation Results According to the TAM, analysed in the previous sections, the perceived ease of use (AI usability) and perception of usefulness (AI literacy level) are communicating vessels, interconnected in the users’ consciousness. We have identified two different clusters in our data [58], related to their analytical experience and have quantified their AI literacy level. More clusters have been proposed based on the perceived ease of use, where some key points were identified and should be addressed via design modifications, either via global functionality or via personalised options. In Fig. 9, the two main clusters can be seen. For each cluster different tools are proposed with some tools overlapping, with regard to AI explainability. For Cluster 1, a technical analysis of the ML models should be provided, using local and global xAI, bias analysis and model architecture characteristics. The related literature should also be provided. Cluster 2, which consists of less experienced users, should have access to basic textual description of the process. A semantic grouping of inputs and output [59] combined with cost benefit analysis and contribution to the circular economy via AI applicability in the sector, should use provide incentives and build trust.

20

E. Sarmas et al.

Fig. 9 xAI clusters

To address the problem of perceived usefulness and perceived ease of use, following the proposed methodology outlined in Fig. 9, we offer a preview of screens used to explain the predictive technology used for one the developed applications.

5.1 Model Architecture and Benchmarks In Fig. 10, the model architecture of one of the system applications is analysed, namely analytics service AS04. i.e. estimation of energy savings from renovations in Fig. 2. An ensemble of key models are used to make predictions based on a probability of a three class outcome, using as features-inputs (1) the energy consumption before renovation, (2) the cost of renovation, (3) the planned CO2 reduction as detailed by technical specifications, (4) the building age in years and (5) the total heating area defined by the building envelope [14]. Next, the overall model outputs the probabilities for each class. Based on the investment potential, which is the relationship between energy consumption reduction and cost of investment, the model considers three classes. For Class A, the potential for investment is optimum, while for Class B it is medium and the project should be only partially financed. For Class C, the project should not be financed. The different models are stacked and use linear regression as the activation function to make predictions. Using an ensemble of models, an average prediction is provided, ensuring optimised results for a given problem using different algorithms, for better definition and a boosted solution overall. In the benchmark section, the accuracy of each model is calculated, alongside the final result of the stacked model. Finally, the correlation between energy consumption and investment cost for each class is shown. In this section, we look into the proposed methods outlined in Fig. 10, where model architecture for Cluster 1 and benchmarks for Cluster 2 are touched upon.

An Explainable AI-Based Framework for Supporting Decisions in Energy …

21

Fig. 10 Model architecture—benchmarks, change benchmark figures

5.2 Textual Descriptors—Semantic Grouping Textual descriptions are comments for users to understand the reasoning behind the development of an application and the personal and macro-economic benefits of using it. Some key points that have been analysed in this paper are used as descriptors of usefulness, such as the idea of circular economics and the semantic grouping of applications and stakeholders. The relation between inputs and outputs [59, 60] is also an important factor that can increase adoption, add trust to the system and clarify the system results. In Fig. 11, the semantic relation of inputs and outputs is presented [14]. In the middle section, the data structure of each input is introduced. Each element has more information attached to it, for the user to explore, thus offering an in depth look of how the system analyses data and outputs predictions.

5.3 Local xAI In Fig. 12, selected graphs show how the different features contribute to each decision for a specific observation from the analysed dataset. We have used shapley values to determine the different outcomes and the SHAP python libraries [61, 62] to create an explainability framework for the examined case [63]. According to the observation analysed, the true actual value refers to the Class B or Class 1. The predicted array, entitled as predicted in Fig. 12, shows the probabilities for each class, as they are extracted by the stacked model. For Class 0, the probabilities are equal to 0.16 (0.0155), while for Class 1 they are equal to 0.924 and for Class 2 they are equal

22

E. Sarmas et al.

Fig. 11 Textual descriptors and semantic grouping

Fig. 12 Local xAI

to 0.06 (0.0597). Our model has correctly predicted the class as 1. To see how the different features contributed to the prediction, we have used SHAP force and waterfall plots. In the force plot, the different interactions between the features and the decision per class is outlined, arriving to the predicted solution-probability of outcome, symbolised as f(x). On the other hand, the waterfall plots similarly show the different responses of each parameter. In the bottom of the graph, the expected

An Explainable AI-Based Framework for Supporting Decisions in Energy …

23

output is shown, while the top of the graph indicates how the different features contributed to the actual prediction extracted by the model. Since Class 1 was indeed the correct output, in both graphs all features in Class 1 point to the right result and the proposed investment should be only partially funded.

5.4 Global xAI In Fig. 13, selected graphs show how the different features contribute to the final prediction for each class. Again, shapley values have been utilised, alongside the SHAP python libraries for explaining the results visually. Feature importance is first analysed for each class for all inputs. Cost is the highest contributing factor in all cases, but the contribution of other factors varies for the different classes. The dependency plots are then utilised to summarise the dependence between the actual cost and the actual contribution of the cost to the decision eventually made by the model, labeled as shap value for cost on the y-axis of each scatter graph. The shap value for the examined variable shows the extent to which knowing this particular variable affects the prediction of the model. On the opposite side of the y-axis a feature that is closely related to the examined variable is tracked and the extent of the effect to the examined variable is differentiated in color.

Fig. 13 Global xAI

24

E. Sarmas et al.

6 Conclusions and Future Work To define the AI explainability demands in the EU energy sector, a survey was conducted. The survey questionnaire was split into sections to address three important concerns, namely (i) the AI literacy level of the stakeholders Fig. 1, (ii) the AI usability perception of stakeholders and (iii) via qualitative questions the definition of sector trends as related to AI Quantifying the survey results, the alignment of the AI literacy level and usability perception with the technology acceptance model was measured. The technology acceptance model assumes that the two main drivers of market adoption of new tech are the perceived usefulness and perceived ease of use. Cluster analysis was performed for both cases to outline the different users in our dataset and, thus, tailor the development demands and explainability depth as per the users’ criteria and requirements. Two clusters were recognised as per the AI literacy level (perceived usefulness by the participants) and five as per the usability perception (perceived ease of use by the participants). Finally and as per the qualitative part of the questionnaire, especially illustrated in Fig. 8, two main groups were recognized with regard to the suggested level of explainability defined by the users. Following the survey analysis, a proposed methodology was subsequently outlined and implemented on the stacked neural network for analytics service AS04 (i.e. for estimating energy savings from renovations). The aim of this methodology is to partition the explainability layers of the application on two levels of user ability. By communicating in detail the underlying technology of an AI-infused application, its adoption will increase and benefit the circular economy. Having a better understanding of the technical demands and standpoints of the energy stakeholders in the EU, a system where explainability is an important feature has been implemented and a validating review with a survey for clarification of user adoption is under way. Explainability, is not only the ethical approach of building AI, but it also constitutes a tool to build user trust and ensure faster adoption rates. The energy sector offers a great benchmark opportunity for the world to identify and measure the benefits on the environment and the climate of automation and enhanced intelligence offered by trained artificial systems that can provide a better framework for a more sustainable feature. This and other related work is currently underway and its results will be announced on a future occasion. Acknowledgements The work presented is based on research conducted within the framework of the project “Modular Big Data Applications for Holistic Energy Services in Buildings (MATRYCS)”, of the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 101000158 (https://matrycs.eu/), of the Horizon 2020 European Commission project BD4NRG under grant agreement no. 872613 (https://www.bd4nrg.eu/) and of the Horizon Europe European Commission project DigiBUILD under grant agreement no. 101069658 (https://digibuild-project.eu/). The authors wish to thank the Coopérnico team, whose contribution, helpful remarks and fruitful observations were invaluable for the development of this work. The content of the paper is the sole responsibility of its authors and does not necessary reflect the views of the EC.

An Explainable AI-Based Framework for Supporting Decisions in Energy …

25

References 1. P. Agreement, Paris agreement, in Report of the Conference of the Parties to the United Nations Framework Convention on Climate Change (21st Session, 2015: Paris). Retrived December. vol. 4, HeinOnline (2015), p. 2017 2. J. Tollefson, K.R. Weiss, Nations adopt historic global climate accord: agreement commits world to holding warming’well below’2 [degrees] c. Nature 582(7582), 315–317 (2015) 3. H. Doukas, A. Nikas, M. González-Eguino, I. Arto, A. Anger-Kraavi, From integrated to integrative: delivering on the Paris agreement. Sustainability 10(7), 2299 (2018) 4. S. Carley, D.M. Konisky, The justice and equity implications of the clean energy transition. Nat. Energy 5(8), 569–577 (2020) 5. E. Papadis, G. Tsatsaronis, Challenges in the decarbonization of the energy sector. Energy 205, 118025 (2020) 6. P. Friedlingstein, M. O’sullivan, M.W. Jones, R.M. Andrew, J. Hauck, A. Olsen, G.P. Peters, W. Peters, J. Pongratz, S. Sitch et al., Global carbon budget 2020. Earth Syst. Sci. Data 12(4), 3269–3340 (2020) 7. A. Hope, T. Roberts, I. Walker, Consumer engagement in low-carbon home energy in the United Kingdom: implications for future energy system decentralization. Energy Res. Soc. Sci. 44, 362–370 (2018) 8. S. Baidya, V. Potdar, P.P. Ray, C. Nandi, Reviewing the opportunities, challenges, and future directions for the digitalization of energy. Energy Res. Soc. Sci. 81, 102243 (2021) 9. R. Zafar, A. Mahmood, S. Razzaq, W. Ali, U. Naeem, K. Shehzad, Prosumer based energy management and sharing in smart grid. Renew. Sustain. Energy Rev. 82, 1675–1684 (2018) 10. P. Weigel, M. Fischedick, Review and categorization of digital applications in the energy sector. Appl. Sci. 9(24), 5350 (2019) 11. V. Marinakis, H. Doukas, J. Tsapelas, S. Mouzakitis, Á. Sicilia, L. Madrazo, S. Sgouridis, From big data to smart energy services: An application for intelligent energy management. Futur. Gener. Comput. Syst. 110, 572–586 (2020) 12. E. Sarmas, N. Dimitropoulos, S. Strompolas, Z. Mylona, V. Marinakis, A. Giannadakis, A. Romaios, H. Doukas, A web-based building automation and control service, in 2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA). (IEEE, 2022), pp. 1–6 13. A. Esmat, M. de Vos, Y. Ghiassi-Farrokhfal, P. Palensky, D. Epema, A novel decentralized platform for peer-to-peer energy trading market with blockchain technology. Appl. Energy 282, 116123 (2021) 14. E. Sarmas, E. Spiliotis, V. Marinakis, T. Koutselis, H. Doukas, A meta-learning classification model for supporting decisions on energy efficiency investments. Energy Build. 258, 111836 (2022) 15. E. Sarmas, E. Spiliotis, V. Marinakis, G. Tzanes, J.K. Kaldellis, H. Doukas, Ml-based energy management of water pumping systems for the application of peak shaving in small-scale islands. Sustain. Urban Areas 82, 103873 (2022) 16. A.B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. Garcia, S. Gil-Lopez, D. Molina, R. Benjamins, R. Chatila, F. Herrera, Explainable artificial intelligence (xai): concepts, taxonomies, opportunities and challenges toward responsible ai. Inf. Fusion 58, 82–115 (2020) 17. C. Meske, E. Bunde, J. Schneider, M. Gersch, Explainable artificial intelligence: Objectives, stakeholders, and future research opportunities. Inf. Syst. Manag. 39(1), 53–63 (2022) 18. E. Council, Fit for 55: The eu’s plan for a green transition (2020) 19. IEA: Renewables 2021: Analysis and forecasts to 2026 (2021) 20. P. Skaloumpakas, E. Spiliotis, E. Sarmas, A. Lekidis, G. Stravodimos, D. Sarigiannis, I. Makarouni, V. Marinakis, J. Psarras, A multi-criteria approach for optimizing the placement of electric vehicle charging stations in highways. Energies 15(24), 9445 (2022) 21. M.Q. Raza, A. Khosravi, A review on artificial intelligence based load demand forecasting techniques for smart grid and buildings. Renew. Sustain. Energy Rev. 50, 1352–1372 (2015)

26

E. Sarmas et al.

22. P. Bradley, M. Leach, J. Torriti, A review of the costs and benefits of demand response for electricity in the UK. Energy Policy 52, 312–327 (2013) 23. C.W. Gellings, W.M. Smith, Integrating demand-side management into utility planning. Proc. IEEE 77(6), 908–918 (1989) 24. R. Xiong, J. Cao, Q. Yu, Reinforcement learning-based real-time power management for hybrid energy storage system in the plug-in hybrid electric vehicle. Appl. Energy 211, 538–548 (2018) 25. J. Duan, Z. Yi, D. Shi, C. Lin, X. Lu, Z. Wang, Reinforcement-learning-based optimal control of hybrid energy storage systems in hybrid ac-dc microgrids. IEEE Trans. Ind. Inf. 15(9), 5355–5364 (2019) 26. G. Zsembinszki, C. Fernández, D. Vérez, L.F. Cabeza, Deep learning optimal control for a complex hybrid energy storage system. Buildings 11(5), 194 (2021) 27. L. Fuhrimann, V. Moosavi, P.O. Ohlbrock P. D’acunto, Data-driven design: Exploring new structural forms using machine learning and graphic statics, in Proceedings of IASS Annual Symposia. vol. 2 in 1. International Association for Shell and Spatial Structures (IASS) (2018), pp. 1–8 28. D. Nagy, D. Lau, J. Locke, J. Stoddart, L. Villaggi, R. Wang, D. Zhao, D. Benjamin, Project discover: an application of generative design for architectural space planning, in Proceedings of the Symposium on Simulation for Architecture and Urban Design (2017), pp. 1–8 29. P. Geyer, S. Singaravel, Component-based machine learning for performance prediction in building design. Appl. Energy 228, 1439–1453 (2018) 30. M. Huang, J. Nini´c, Q. Zhang, Bim, machine learning and computer vision techniques in underground construction: current status and future perspectives. Tunn. Undergr. Space Technol. 108, 103677 (2021) 31. I.K. Brilakis, L. Soibelman, Shape-based retrieval of construction site photographs. J. Comput. Civ. Eng. 22(1), 14–20 (2008) 32. Z. Zhu, I. Brilakis, Parameter optimization for automated concrete detection in image data. Autom. Constr. 19(7), 944–953 (2010) 33. C. Fan, Y. Sun, K. Shan, F. Xiao, J. Wang, Discovering gradual patterns in building operations for improving building energy efficiency. Appl. Energy 224, 116–123 (2018) 34. S. Lu, W. Wang, C. Lin, E.C. Hameen, Data-driven simulation of a thermal comfort-based temperature set-point control with ashrae rp884. Build. Environ. 156, 137–146 (2019) 35. K. Yan, L. Ma, Y. Dai, W. Shen, Z. Ji, D. Xie, Cost-sensitive and sequential feature selection for chiller fault detection and diagnosis. Int. J. Refrig 86, 401–409 (2018) 36. J. Granderson, S. Touzani, S. Fernandes, C. Taylor, Application of automated measurement and verification to utility energy efficiency program data. Energy Build. 142, 191–199 (2017) 37. N. Dimitropoulos, E. Sarmas, M. Lampkowski, V. Marinakis, A quantitative methodology to support local governments in climate change adaptation and mitigation actions, in International Symposium on Distributed Computing and Artificial Intelligence (Springer, 2023), pp. 99–108 38. E. Sarmas, N. Dimitropoulos, V. Marinakis, Z. Mylona, H. Doukas, Transfer learning strategies for solar power forecasting under data scarcity. Sci. Rep. 12(1), 14643 (2022) 39. E. Sarmas, E. Spiliotis, E. Stamatopoulos, V. Marinakis, H. Doukas, Short-term photovoltaic power forecasting using meta-learning and numerical weather prediction independent long short-term memory models. Renew. Energy 216, 118997 (2023) 40. E. Sarmas, N. Dimitropoulos, V. Marinakis, A. Zucika, H. Doukas, Monitoring the impact of energy conservation measures with artificial neural networks, in In ECEEE Summer Study (2022) 41. E. Sarmas, S. Strompolas, V. Marinakis, F. Santori, M.A. Bucarelli, H. Doukas, An incremental learning framework for photovoltaic production and load forecasting in energy microgrids. Electronics 11(23), 3962 (2022) 42. E. Sarmas, E. Spiliotis, N. Dimitropoulos, V. Marinakis, H. Doukas, Estimating the energy savings of energy efficiency actions with ensemble machine learning models. Appl. Sci. 13(4), 2749 (2023) 43. C. Tsolkas, E. Spiliotis, E. Sarmas, V. Marinakis, H. Doukas, Dynamic energy management with thermal comfort forecasting. Build. Environ. 237, 110341 (2023)

An Explainable AI-Based Framework for Supporting Decisions in Energy …

27

44. P. Skaloumpakas, E. Sarmas, Z. Mylona, A. Cavadenti, F. Santori, V. Marinakis, Predicting thermal comfort in buildings with machine learning and occupant feedback, in 2023 IEEE International Workshop on Metrology for Living Environment (MetroLivEnv. (IEEE, 2023), pp. 34–39 45. T. Miller, Explanation in artificial intelligence: insights from the social sciences. Artif. Intell. 267, 1–38 (2019) 46. C.H. Tsai, J.M. Carroll, : Logic and pragmatics in ai explanation, in xxAI-Beyond Explainable AI: International Workshop, Held in Conjunction with ICML 2020, July 18, 2020, Vienna, Austria, Revised and Extended Papers. (Springer, 2022), pp. 387–396 47. T. Ngo, J. Kunkel, J. Ziegler, Exploring mental models for transparent and controllable recommender systems: a qualitative study, in Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization (2020), pp. 183–191 48. P. Hacker, J.H. Passoth, Varieties of ai explanations under the law. from the gdpr to the aia, and beyond, in xxAI-Beyond Explainable AI: International Workshop, Held in Conjunction with ICML 2020, July 18, 2020, Vienna, Austria, Revised and Extended Papers. (Springer, 2022), pp. 343–373 49. A.B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil-López, D. Molina, R. Benjamins et al., Explainable artificial intelligence (xai): concepts, taxonomies, opportunities and challenges toward responsible ai. Inf. fusion 58, 82–115 (2020) 50. F.D. Davis, Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS quarterly (1989), pp. 319–340 51. I. Ajzen, The theory of planned behavior. Organ. Behav. Hum. Decis. Process. 50(2), 179–211 (1991) 52. R. Alroobaea, P.J. Mayhew, How many participants are really enough for usability studies? in 2014 Science and Information Conference. (IEEE, 2014), pp. 48–56 53. J. Sauro, J.R. Lewis, Quantifying the User Experience: Practical Statistics for User Research. (Morgan Kaufmann, 2016) 54. S. Aranganayagi, K. Thangavel, Clustering categorical data using silhouette coefficient as a relocating measure, in International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007), vol. 2 (2007), pp. 13–17 55. H.B. Zhou, J.T. Gao, Automatic method for determining cluster number based on silhouette coefficient, in Advanced Materials Research, vol. 951. (Trans Tech Publ, 2014), pp. 227–230 56. A. Likas, N. Vlassis, J.J. Verbeek, The global k-means clustering algorithm. Pattern Recognit. 36(2), 451–461 (2003) 57. J. Shlens, A tutorial on principal component analysis (2014). arXiv:1404.1100 58. E. Sarmas, M. Kleideri, A. Zuˇcika, V. Marinakis, H. Doukas, Improving energy performance of buildings: dataset of implemented energy efficiency renovation projects in latvia. Data Brief 48, 109225 (2023) 59. D.P. Panagoulias, M. Virvou, G.A. Tsihrintzis, Regulation and validation challenges in artificial intelligence-empowered healthcare applications - the case of blood-retrieved biomarkers. Knowledge-Based Software Engineering: 2022, in Proceedings of the 14th International Joint Conference on Knowledge-Based Software Engineering (JCKBSE 2022, Larnaca, Cyprus), Maria Virvou, Takuya Saruwatari, Lakhmi C. Jain 133 (2023) 60. D.P. Panagoulias, D.N. Sotiropoulos, G.A. Tsihrintzis, Nutritional biomarkers and machine learning for personalized nutrition applications and health optimization, in 2021 12th International Conference on Information, Intelligence, Systems & Applications (IISA). (IEEE, 2021), pp. 1–6 61. E. Sarmas, P. Xidonas, H. Doukas et al., Multicriteria Portfolio Construction with Python. (Springer, 2020) 62. P. Xidonas, H. Doukas, E. Sarmas, A python-based multicriteria portfolio selection dss. RAIROOper. Res. 55, S3009–S3034 (2021) 63. S.M. Lundberg, S.I. Lee, A unified approach to interpreting model predictions. Advances in neural information processing systems 30 (2017)

The Big Data Value Chain for the Provision of AI-Enabled Energy Analytics Services Konstantinos Touloumis , Evangelos Karakolis , Panagiotis Kapsalis , Sotiris Pelekis , and Dimitris Askounis

Abstract In order to support decision-making problems on the energy sector, like energy forecasting and demand prediction, analytics services are developed that assist users in extracting useful inferences on energy related data. Such analytics services use AI techniques to extract useful knowledge on collected data from energy infrastructure like smart meters and sensors. The big data value chain describes the steps of big data life cycle from collecting, pre-processing, storing and querying energy consumption data for high-level user-driven services. With the exponential growth of networking capabilities and the Internet of Things (IoT), data from the energy sector is arriving with a high throughput taking the problem of calculating big data analytics to a new level. This research will review existing approaches for big data energy analytics services and will further propose a framework for facilitating AIenabled energy analytics taking into consideration all the requirements of analytics services through the entire big data value chain from data acquisition and batch and stream data ingestion, to creating proper querying mechanisms. Those query mechanisms will in turn enable the execution of queries on huge volumes of energy consumption data with low latency, and establishing high-level data visualizations. The proposed framework will also address privacy and security concerns regarding the big data value chain and allow easy applicability and adjustment on various use cases on energy analytics.

K. Touloumis · E. Karakolis (B) · P. Kapsalis · S. Pelekis · D. Askounis Decision Support Systems Laboratory, School of Electrical and Computer Engineering, National Technical University of Athens, 9 Iroon Polytechniou Str, 15773 Athens, Greece e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Doukas et al. (eds.), Machine Learning Applications for Intelligent Energy Management, Learning and Analytics in Intelligent Systems 35, https://doi.org/10.1007/978-3-031-47909-0_2

29

30

K. Touloumis et al.

Keywords Big data value chain · Big data energy analytics · AI-enabled decision-making services · Big data architecture

Nomenclature Term Internet of things Machine learning Neural network Support vector machine Online analytical processing Deep learning Online transactional processing Atomicity, consistency, isolation and durability Hadoop distributed file system International dataspaces Artificial neural network Multiple linear regression Multiple layer perceptron Building management system Building information management Infrastructure as a service Platform as a service Software as a service Identity access management Generic enabler Context broker Machine to machine Linear model K-nearest neighbour

Abbreviation IoT ML NN SVM OLAP DL OLTP ACID HDFS IDS ANN MLR MLP BMS BIM IaaS PaaS SaaS IAM GE CB M2M LM KNN

1 Introduction During the Covid pandemic several measures have been proposed by governments globally to prevent congestion in public places like large scale lockdowns, and remote working and education was further embraced [1]. Those measures assisted in reducing the energy demand significantly. It is actually the case that the electricity generation within 16 European countries in April 2020 had dropped by 9%, whereas fossil energy generation decreased by 28%, while nuclear energy decreased by 14% [2]. Despite the decline in the electricity generation, the building sector is responsible

The Big Data Value Chain for the Provision of AI-Enabled Energy …

31

for one fourth of the energy consumed world-wide and 40% of the global carbon dioxide emissions [3]. The EU has recognized the need for reducing the energy consumption and tackling its environmental harm and issued the Paris agreement for achieving carbon neutrality by 2050 [4]. It is therefore made clear that efficient energy planning is of the outmost importance for reducing energy consumption and preserving the already overstretched energy sources. Energy analytics can play a vital role in efficient organization, planning and reduction of energy consumption. Energy analytics are applied on various fields like IoT [5] and machine learning (ML) [6] and regard the development of computing techniques for providing high-level userdriven services such as energy consumption forecasting. Such computing techniques may include AI techniques like Neural Networks (NN’s), support vector machines (SVM) and gradient boosting, that can be used for model training in order to predict the energy consumption under circumstances like particular weather conditions, such as temperature and humidity [7–9]. Since data collected from sensors on the energy sector is arriving in batches and is increasing rapidly in size, the need for big data analytics has arouse, taking the problem of energy analytics to an entirely new level. Big data is hard to conceptualize, the fact is that except the quality of data there are also other dimensions that need to be defined like the volume, velocity, variety veracity, value known as the 5 V’s [10]: Volume indicates the amount of the collected data and since smart meters produce enormous numbers of data, efficient storing of such amounts is becoming a substantial problem; velocity indicates the speed to which data is produced and transferred and the problem is that each meter has its own production and transfer rate; variety indicates there is a huge range of available data sources which may differ in structure, so there can be structured data sources where data is stored in specific and strict format, unstructured, where data is stored in a combination of types maintaining its original structure and semi-structured, where data is not unstructured but still contains some structure schemas; veracity indicates the quality of the collected data, in such vast amounts of data it is highly possible that empty values and errors will be included affecting the quality and results of analytics; value indicates that the collected data should be useful for user-driven services and decision-making processes. With the introduction of big data on AI analytics services of the energy sector the need for dividing tasks into multiple subtasks executed on a large number of computing nodes through parallel programming has become compelling to assist in effective storage, processing and analysis of such huge volumes of data. Artificial intelligence (AI) aims to develop intelligent machines able of performing complex tasks not easily performed by human and is used in a plethora of domains including among others text translation, speech recognition, healthcare, search and recommendation engines [11]. AI allows the delegation of such complex problems and contributes to the velocity of data by establishing rapid and accurate decisions. AI can further allow users of big data to perform and automate complex descriptive analytical tasks that would be extremely intensive and time consuming to perform for humans [12]. Employing AI methods on big data can have a huge impact on the role data plays on various aspects of daily and economic life. The market and public

32

K. Touloumis et al.

bodies adopt new methodologies to exploit and stay competitive by developing AIenabled tools and applications. Up to now, AI algorithms have been implemented in single machines but nowadays with the introduction of big data, the need to increase the scalability of AI algorithms, making them run on multiple nodes in a parallelized manner has rapidly emerged [13]. Thus, companies and enterprises are looking for new and innovative ways of implementing highly distributed AI methodologies that will be able to receive, process and analyse big data assisting in solving complex computation problems on energy analytics [14]. Large-scale smart grids consist of thousands of microgrids, with smart meters being an integral part of them. Smart meters produce densely populated data thus transforming the problem of “energy analytics” to “energy big data analytics” [15]. Big data methods for AI can prove very beneficial for energy analytics purposes [16, 17]. By exploiting big data and AI, energy analytics can assist in improving smart meter and smart grid management from a high-level point of view, the accessibility of users in consumption data by performing statistical analysis and online analytical services (OLAP) [18]. Energy analytics can also assist on energy forecasting, analysing the consumer behaviour patterns on energy consumption and estimating the demand of energy [19]. Furthermore, analytics can analyse historical data on consumption and estimate energy production [20]. Load prediction to prevent excessive energy consumption under parameters and unnecessary transmission of excessive data can further be performed by exploiting Deep Learning (DL) techniques [6, 21]. Among the included services of big data analytics are services for the prediction of power system failures and sensor downtime increasing the robustness and efficiency of the power grid systems and meters. Energy analytics can further assist in tracking the energy production from different sources and taking decisions on prioritizing consumption demands [22]. Of course, in order to provide efficient energy analytics services, complex big data technologies and infrastructure should be utilised. In particular, continuous flows of real-time data from multiple data sources should be ingested, processed, harmonized and stored to a big data infrastructure, to facilitate efficient, near real-time analytics services. In this publication, we provide a thorough overview of all the technologies and techniques that are required to facilitate efficient, scalable and effective energy analytics AI services. Furthermore, we propose a high-level architecture, indicating and analysing all the proposed technologies selected for each functionality along with some alternative technologies that could be used instead.

The Big Data Value Chain for the Provision of AI-Enabled Energy …

33

2 Background 2.1 The Big Data Value Chain Value chains have been employed as decision support tools by organizations in order to model the chain of activities they have to perform to deliver a product or service to the market [23, 24]. The purpose of a value chain to is to categorize the activities of an organization for them to be better performed and optimized. The value chain consists of a chain of systems where each one receives an input, performs an operation and produces an output. The term value chain is used to describe the steps needed in order to produce valuable information from the collected data. With the exponential growth of the collected data from sensors and smart meters on the energy sector the data value chain is transformed to big data value chain which describes the processes needed to efficiently store, process and retrieve valuable statistics from big data of the energy sector. The big data value chain is depicted in Fig. 1. Data acquisition describes the process of collecting, processing and storing data in a data warehouse. Data acquisition has become one of the most challenging tasks on collected data on the energy sector due to their huge volume and requires low latency, high throughput and execution of queries on distributed environments. To achieve its purpose of retrieving data from distributed sources and storing them effectively in data warehouses many protocols are used for continuous flow of data like AMQP and MQTT [25]. Such protocols can be used across different industries, their messages can be easily encrypted, communication can be achieved through different protocols (TCP, UDP), and further ease manageability and applicability on various systems and hardware devices. Data pre-processing describes the process of cleaning and homogenising data from various data sources. Collected data from an increasing number of sources in huge volumes is not going to be clean and will contain noise and anomalies that will affect the quality of future steps in the value chain and especially data analysis. Data pre-processing involves a set of steps [26, 27]. Specifically, filtration describes the process of treating corrupted data; extraction describes the process of transforming incompatible data to proper format; transformation describes the process of adapting data scaling attributes to improve their quality; validation describes the process of managing semantic structures and removing invalid data; cleaning refers to the process of processing inaccurate data; fusion refers to the process of merging data

Fig. 1 The big data value chain

34

K. Touloumis et al.

from various sources; reduction describes the process of reducing the dimensions of the extracted data to increase their relevancy and reduce their complexity. Data analysis describes the process of transforming raw data to useful information for decision-making and user-driven services. Data analysis [24] involves data exploration, feature extraction, transformation, modelling of data, and uncovering hidden relations between data for extracting useful knowledge for decision-making and user-driven services. Big data analysis also faces challenging problems regarding the huge volumes of collected data on the energy sector. One of them is streaming data, big data energy analytics must be able of calculating statistics on streams of data arriving on a high frequency. Another challenge is semantic analysis, so big data analytics must be able of calculating useful semantics on entities. Data storage describes the process of storing the acquired data to data warehouses. Data that is stored in warehouses regards mainly 2 categories: Operational data that is used in everyday transactions and handled by Online Transaction Processing systems (OLTP), and analytical data that is used for predicting performance, and forecasting, among others. Such data is stored in OLAP. Relational database systems have been developed to store data in tables and capture relations between different attributes of different tables. Relational database systems comply to Atomicity, Consistency, Isolation and Durability (ACID) standard [26], which signifies that if one change fails, the transaction will fail and the database will remain in the same state it was before the transaction. Also, NoSQL systems have been developed mostly for OLAP. These systems do not store collected data in tables and try to represent the collected data in different ways like document representation (e.g., MongoDB [28]) and graphs (Neo4J [27]). Big data storage technologies address the same challenges with those of classical storage (volume, velocity, variety) but in a much larger scale. Thus, several technologies like Hadoop Distributed File System (HDFS [29]) have been developed. These technologies are more efficient for storing huge volumes of data from the energy sector. Analytics services describe the high-level user-driven services provided to users for the decision-making process involved in high-level services such as forecasting, load prediction and customer behaviour prediction. Such services can be applied on the energy sector and can include predictive analytics, like predicting the energy demand under certain circumstances, or predicting the load of the transferred data. Such services can assist users in performing exploratory analysis for inferring useful statistics among data.

2.2 Energy Analytics Services On top of the described phases of the big data value chain, a wide variety of energy analytics services can be developed. Furthermore, several generic services that can facilitate energy analytics can be developed as well. Such generic services include intelligent querying services, visual analytics services, as well as reasoning ones,

The Big Data Value Chain for the Provision of AI-Enabled Energy …

35

among others. Additionally, a very important aspect in big data platforms in the energy domain is data sharing among different organisations. In particular, there is a wide variety of efficient querying technologies in big data such as Presto [30] and Trino [31], as well as several query engine applications that take advantage of the aforementioned technologies. For instance, in [32] the authors present an intelligent query engine application for querying smart building data based on Presto query engine. Moreover, [33] presents another query engine application based on Trino, that enables efficient querying on different and heterogeneous data sources, along with a user-friendly interface for building and executing queries, with fine-grained access control. Regarding visual analytics services, there are well-established open-source technologies such as Apache Superset and Grafana [34] among others, as well as in-house ones such as the one presented in [35]. Concerning reasoning services, they enable logical inferences from the available data using ontologies or graph data. Examples of such services are presented in [36–38]. Lastly, concerning data sharing, European data spaces [39] provide trusted and secure data sharing among organisations through several technologies such as the Dataspace Connector [40] and the FIWARE TRUE Connector [41]. All services presented can facilitate energy analytics services development, monitoring and analysis. With regards to energy analytics services, a number of analytics approaches have been developed for calculating analytics on the energy sector. Analytics have mainly focused on buildings and smart grids, since the building sector consumes vast amounts of energy as mentioned earlier, while seamless and efficient grid operation is of utmost importance for energy prosumers. The established services aim at improving energy performance, reducing energy consumption, forecasting consumed energy, policy making and de-risking investments among others. Some examples of AI-enabled energy analytics services include electrical load forecasting, flexibility forecasting and demand response, and anomaly detection in smart buildings, among others. For instance, Xuemei et al. [42] used the SVM algorithm for HVAC system operation improvement by reducing the daily average, lowest, and highest temperature. Zhang et al. [43] used the SVM algorithms for conserving energy according to the weather attributes, zone temperature, heat gained through lights and windows. Mena et al. [44] used Artificial Neural Networks (ANN) for managing the energy demand based on outdoor temperature, humidity, solar radiation wind speed and direction. Finally, Tsolkas et al. proposed a DL model in order to optimally control the HVAC system of a public building in Sant Cugat, Spain, with the objective to achieve optimal thermal comfort [45]. Regarding energy consumption forecasting, [46] used Multiple Linear Regression (MLR) to forecast photovoltaic consumption based on the outdoor temperature, humidity, pressure, wind direction and speed. Additionally, [47] also used Multiple Layer Perceptrons (MLP) for forecasting energy consumption based on indoor and outdoor temperature, humidity, pressure, wind speed, direction envelope characteristics and the occupancy profile. In addition, study [6] experiments with different DL architectures for short term load forecasting in power grids. Alternative architectures and DL models are analysed and compared. Moreover, in the same study, the impact of Covid-19 pandemic is analysed. Another study ([48]) compares different

36

K. Touloumis et al.

DL models for short-term load forecasting and provides an analysis of the key accuracy drivers for the developed models. Another aspect in this domain involves the development of ML and DL models for estimating the energy savings of retrofitting actions in buildings, by leveraging building characteristics and weather conditions [49, 50]. Concerning flexibility forecasting and demand response, in [51] a real-time demand response framework for smart communities is proposed. The methodology proposed is based on clustering on different commercial and residential prosumers. Specifically, different actions and consumption patterns are proposed to prosumers of each cluster in order to shift their consumption behaviour to mitigate reverse power flow and decrease the demand during peak hours. Similar approaches are proposed also in other studies (e.g., [52]). With regards to anomaly detection, there is an abundance of studies, most of them focusing either on detecting anomalies in people’s daily habits in a smart building setup or on detecting anomalies in the operation of a smart grid and its components. In the first category, there are publications focusing on the response times of the developed services and the underlying technological architectures that can unlock such low latency services (e.g., [53]), as well as publications that focus mostly on the accuracy of the developed models (e.g. [54]). In the second category most publications focus on anomaly detection in data obtained from smart meters of a smart grid using ML and DL techniques [55, 56]. Anomaly detection services are of particular importance for this publication, as they pose requirements for near real-time response. To this end, streaming data technologies should be employed to serve these requirements.

2.3 Well-Known Big Data Architectures for Energy Analytics Services There is a wide variety of pan-European projects that are focusing on AI and Big Data applications and services for the energy sector. Moreover, there are several big data architectures proposed for the development of energy analytics services. Specifically, the I-NERGY project [57] aims at providing next-generation AI services for the energy sector and reinforcing the service layer of AI on Demand platform. I-NERGY conceptual architecture [58] presents all the software components required for building AI based energy analytics services, alongside their underlying technologies. These components serve several different functionalities of the big data value chain from data ingestion, storage and harmonisation to model training deployment and serving, along with the end-user energy analytics services. Last, I-NERGY conceptual architecture, also demonstrates the connection with AI on-demand platform. The MATRYCS project aims at delivering big data and AI applications and services for smart buildings. MATRYCS big data architecture [59] focuses on big

The Big Data Value Chain for the Provision of AI-Enabled Energy …

37

data management in the building domain and facilitates efficient data ingestion, processing, querying, as well as data sharing and interoperability, and serves a variety of analytics services for smart buildings. MATRYCS architecture also includes the main technologies that have been selected for each one of the functionalities of MATRYCS platform. Last but not least, BRIDGE energy data exchange reference architecture [60] provides interoperable and business agnostic data exchange along with other components required for energy data platforms that cover the entire data value chain. BD4NRG reference architecture [61] was built on top of BRIDGE reference architecture and provides similar functionalities. It consists of four horizontal layers (or components) that cover the entire data value chain and one vertical that is dedicated to data sharing capabilities.

3 AI-Enabled Energy Analytics Services Requirements To support decision-making on problems of the energy domain, analytics should comply to a set of requirements to reassure the quality of the provided services. In particular, integration [62] is a basic requirement; energy analytics services must be able of homogenizing data coming from different sources of the energy sector. Collected data from different meters and sensors will differ in format and structure. Hence, they may be structured, such data include elements that are organized into a formatted repository; semi-structured, such data does not reside in a relational database but has some organizational properties; and unstructured meaning that such data is not organized in a predefined data model. Proper pre-processing and transformation according to the acquainted data specificities is essential to ensure the quality of data in the following dimensions [63]; accuracy describes the degree to which data manages to measure real-world observations; reliability indicates the trust of the collected measurements; consistency describes the degree of change between a source and a reference dataset; completeness indicates the percentage of missing values in one data source; relevance describes the closeness between data consumer need and data provider output; accessibility describes the degree of ease at which the stored data can be easily retrieved or manipulated. From a technological perspective, collected and transformed data should be stored to the proper database according to its properties, including SQL and NoSQL databases. The latter include general purpose document oriented big data databases focusing on performance (e.g. MongoDB), timeseries databases (e.g. InfluxDB) for efficient timeseries operations, and conventional SQL ones for typical OLTP, as well as data lakes. Ontology models [64] are further required to effectively represent data coming from heterogenous data sources like sensors and meters of Building Management Systems (BMSs) and Building Information Management (BIM) systems by applying the proper ontology transformations. On top of the enriched data lakes, query mechanisms with the form of REST APIs should exist to allow simple and more complex aggregation queries on top of collected data with low latency on huge

38

K. Touloumis et al.

volumes of data for facilitating federated querying [65]. AI-enabled energy analytics must allow users to perform statistical analysis [66], from calculating simple statistics like the average value of a variable to complex join aggregational queries on different data sources. Exploratory analysis [67] must also be supported by allowing users to create their customized dashboards and visualizations on data allowing them to infer useful insights. Another requirement for energy analytics services is to support decision-making [68] on domain and non-domain specific problems, like providing high-level user-driven services for policy making on the energy sector. The scalability [69] must also be reassured indicating that energy analytics services must be able to scale up to the huge volumes of collected data, especially since sensors on the energy sector produce densely populated data with a high throughput. Near real-time monitoring [70] should also be facilitated, meaning that energy analytics services must be able of performing simple and complex statistical services in almost real-time in order to provide users with near real-time statistics. Near real-time monitoring is necessary in energy analytics for a number of reasons. Firstly, to establish connection with streaming data sources; these are data sources that create high-rate messages and events. Secondly, to manage batch pipelines and ETLs that are triggered periodically using predefined schedule, for instance once a day, to precompute data warehouse schemas and data volumes. To index batch pipelines results and provide indices to resulting data volumes and data marts so they can be queried in low-latency and on an ad-hoc analysis and last but not least, to manage and provide views and statistics on real-time data and on small-scale time window aggregations. Resource efficiency must also be reassured, so the services must be able of utilizing resources in an efficient way securing the analytics services from the memory overflow problem that can prove to be destructive both for the high-level user services and low-level system components and host devices.

4 Analytics Services Architecture In this section, we propose a cloud-based framework which provides a holistic and microservice oriented architecture for the entire big data life cycle starting from the collection of data from smart meters and sensors of the energy sector, up to the creation of high-level user-driven services for supporting decision-making problems on energy analytics. By establishing the framework on the cloud, scalability and flexibility is reassured. The architecture is presented in Fig. 2 while the technologies that can be used for implementing each layer is depicted in Fig. 3. Each layer is strongly interoperating with its previous and next layer. In the following subsections each layer is analysed thoroughly. The offered services can be divided in three categories: • Infrastructure as a service (IaaS): Includes the cloud infrastructure, both physical and virtual host machines used in the development of the architecture. It regards all layers except the Field Layer where smart meters and sensors are included.

The Big Data Value Chain for the Provision of AI-Enabled Energy …

Fig. 2 Conceptual architecture

39

40

K. Touloumis et al.

Fig. 3 Conceptual architecture and proposed technologies

– Vulnerability detection in IaaS cloud environments: aims at securing the infrastructure against vulnerabilities and potential cyber-threats by offering real-time vulnerability and threat detection and mitigation. – Identity Access Management (IAM): aims at securing the resource access on the framework to prevent unprivileged access.

The Big Data Value Chain for the Provision of AI-Enabled Energy …

41

• Platform as a service (PaaS): includes the workbench that participates in the steps of big data value chain and the development of energy analytics. It regards the Data Services and Data Analytics Environment Layers. – Data Services layer: performs the necessary steps of the entire big data value chain from pre-processing data to distributed big data querying. – Data Analytics layer: regards the training and storage of ML models that will be used by energy analytics services. • Software as a Service (SaaS): includes the analytics services provided to end users for assisting decision-making problems on energy management and financing. It regards the analytics Services Layer.

4.1 Field Layer The field layer consists of all the physical devices, meters and sensors, for measuring the energy consumption data needed for further layers of the architecture and highlevel analytics services. Such data includes weather data, which measures different weather characteristics including among others the temperature, humidity, wind speed and direction. Also, useful data on energy analytics can be gathered from BMS which may include data on energy consumption (AC, 3-PHASE etc.), temperature, humidity, coming from sensors on different building installations like lighting, HVAC, heating etc. Data on the energy sector may further include photovoltaic energy consumption and production data based on weather conditions like the temperature, wind speed and solar radiation. Open-source data on energy consumption may be included. For instance, ENTSO-E [71] provides access to electricity generation, transportation and consumption data for the pan-European market.

4.2 Interoperability The interoperability layer aims at securing the data flow during the entire lifecycle of big data value chain. To achieve its goals the layer can use the FIWARE Generic Enablers (GE) [72]. GE cover aspects of security, data, device management and cloud hosting. FIWARE has implanted two open-source generic enablers, the Backend Devices Management GE to connect IoT devices to the platform and the context broker (CB) to model IoT data as entities and make them available to other services. GE consists of many software agents. Each one of them is used to connect devices that use a particular communication protocol to a platform that supports the MQTT protocol. The Orion Context Broker [73] is an interface where users can get high-level context information regarding consumers of data, and machine to machine (M2M) applications. It also comes with an API that allows registration of producers like sensor updates on the context, such as updated sensor values and query available

42

K. Touloumis et al.

context information. Data sovereignty and trust can further be reassured by using dataspace connectors [74]. The IDS connector [74] can run internally in an organization to secure the flow of data from one node to another or externally for securing data exchange across different organizations.

4.3 Data Capturing The Data Capturing layer is responsible for receiving data from field devices. Moreover, this layer is responsible for processing and forwarding these data to upper layers of the architecture. Streaming data has become a necessity for energy analytics in order to depict real-time accurate statistics to users [75]. This can be achieved through live-streaming mechanisms like Kafka [76], which is a highly distributed, mechanism for transmitting huge volumes of data to multiple consumers in near real- time, with high throughput and low latency. Regarding the IoT communication protocols, the most commonly used are MQTT and AMQP [77]. Both protocols were designed for publishing messages for lightweight M2M communication, but AMQP was designed to focus on generic enterprise communication and not solely IoT [77]. An MQTT client sends a message to an MQTT broker from which all other clients can receive that message. Messages can be published in topics to assist in the organization of communication. They both use TCP for transport and TLS/SSL for security. Once data is received it is pre-processed and transformed in the proper format for further ontology modelling and analytics of the upper layers. Typical pre-processing includes cleaning, transformation, feature extraction, enrichment and validation as analysed in the background section.

4.4 Data Integration The Data Integration layer is responsible for receiving the processed data of the previous layer and homogenizing with proper ontology modelling. Since data coming from different IoT devices of the energy sector and usually differs in format, it is necessary to apply ontology modelling for homogenizing it. Many ontology models have been created for such a purpose including Brick and SAREF [78]. SAREF tries to capture devices and systems and their topological connection, but it cannot define a specific input and output [64]. Such devices may include sensors measuring energy consumption from HVAC, heating and lighting. Brick tries to represent descriptions of physical assets, equipment and devices, and further captures relationships between them and their topological connection. Data integration also regards metadata. Metadata regarding data source details can be stored in classical SQL, and NoSQL databases, whereas metadata regarding the interconnection of sensors can be stored in graph databases like Neo4j. Also there have been developed metadata repositories like DataHub [79] that allow efficient storage, querying and aggregation

The Big Data Value Chain for the Provision of AI-Enabled Energy …

43

of metadata through REST APIs. DataHub offers granularity to describe complex data sets in a semantically rich way [80].

4.5 Data Lake The Data Lake layer stores the acquainted data in the appropriate databases to effectively save it with its semantic structure and properties and make it further available for efficient querying. For data storage, a plethora of databases is available, including relational databases (PostgreSQL [81], SQLite [82]), document-oriented databases (MongoDB [28]), graph databases (Neo4j [27]). Relational databases, represent the data they store in tables based on the notion of keys which make a record unique and allow users to join different tables on their keys. Document oriented databases are based on the concept of collections and documents. For instance, MongoDB is the most well-known NoSQL database that provides high performance, availability and scalability. Graph databases, like Neo4j, try to represent the stored data in nodes and the relationships between them with output and input links. To query huge volumes of data from data lakes, several big data querying mechanisms have been developed like Presto [30] and Trino [31]. These are open-source distributed query engines, ANSI SQL compliant, that are capable of querying exabytes of data with low latency. They are capable of connecting to a wide range of SQL and NoSQL databases including PostgreSQL, MongoDB and Kafka for querying real-time streaming data.

4.6 Data Analytics Environment The data analytics environment provides all the necessary functionalities for supporting the entire ML model lifecycle including model training, storing, reproduction and serving. A number of available solutions can be used for training models like Jupyter Notebook [83]. Jupyter Notebook is an open-source programming tool that can be used for models training, development, visualization and analysis and is a popular way of computing, presenting and disseminating results on ML. The cellular structure of a Jupyter Notebook makes code easily executable, reproducible, and extensible. It can be used for implementing supervised algorithms, as well as unsupervised ones. Such algorithms include linear models, clustering techniques and NN using python modules like sklearn [84] and TensorFlow [85]. It also allows trained models to be easily served through some basic python code. As for storing trained models many ML registries have been proposed like MLflow [86]. Such registries allow the efficient storing of the trained models along with their input, output parameters and evaluation metrics. Mlflow also makes available reproduction code for assisting users in using the trained models to make predictions.

44

K. Touloumis et al.

4.7 Data Streaming In order for analytics services to provide accurate and near real-time statistics on data from the energy sector, they must fetch data using live streaming data mechanisms. Many mechanisms have been created for that purpose that allow fetching of streaming data in batches with a high throughput. Apache Spark [87] is a fast, general and large-scale data processing engine based on Map Reduce [88], featuring in memory computation. Apache Flink [89] is tailor made for distributed streaming and batch processing data. A comparative analysis of Spark and Flink shows that even if Spark has a much higher latency, it is expected to be able to handle much higher throughput by configuring its batch interval higher [90].

4.8 Analytics Services The analytics services layer contains high-level user-driven analytics services that assist decision-making on optimal energy management by exploiting trained models of the previous layer to develop predictive capabilities. Pre-trained models from model’s registry can be obtained, reproduced and extended through simple code provided by the model’s registry that assists in making the connection. Visualization services can be provided by exploiting well-known open-source visualization tools like Apache Superset, and Grafana. Such tools assist users in performing statistical analysis, OLAP, inferring useful statistics and uncovering hidden relations between data. High-level user-driven services may further assist decision-making in several activities such as energy financing. For instance, finding the optimal energy efficiency retrofitting actions for a building can be facilitated through services of this layer. Another example for services of this layer is energy consumption forecasting, which describes the prediction of the energy consumption under particular weather conditions. Furthermore, load forecasting services are also relevant to this layer. Such services assist the load prediction under particular circumstances. Thermal comfort and well-being services that predict the occupant’s well-being under uncertainty parameters are also provided by this layer. Optimal energy management is also supported by the analytics services layer by allowing users to manage efficiently energy sources to prevent excessive energy consumption.

4.9 Identity and Access Control Management and Vulnerability Assessment It is highly logical that such frameworks that provide high-level analytical services to end users have been targeted by adversaries. One such cause of adversarial attacks might be unauthorized access to resources. In this context, IAM plays a vital role in

The Big Data Value Chain for the Provision of AI-Enabled Energy …

45

securing the services against unauthorised access. Many IAM solutions have been proposed like the FIWARE Keyrock [91] and Keycloak [92]. Both are open-source identity management tools that register and query data by providing tokens using the OAuth2 protocol to authenticate users to trusted applications without knowing the host’s credentials. Another threat for big data systems is cyber-attacks. Specifically, in order to establish communication among its layers the framework’s resources use the TCP/ IP protocol. This means it is vulnerable to classical cyber-attacks exploiting TCP/IP. Such attacks include injection attacks, like SQL injections [93] where a user might be able to exploit a poorly structured app to inject an SQL command that could delete part of a database. Memory overflow [94] attacks might also be instigated where a user can tamper with the bandwidth of transmitting data to make the system crash. Regarding the ML Layer cyber-attacks can include poisoning attacks [95] where an adversary might be able to interfere with a trained model to modify its training set making it produce irrational output. It is therefore made clear that vulnerability assessment is of paramount importance for securing the framework against adversarial attacks. To achieve this, the proposed architecture will follow a multi-agent approach [96]. A SIEM agent [97] can be used to secure every physical and virtual machine of the infrastructure. The agents will send the collected data to one central data warehouse. The collected monitoring data will contain among others information on the detected vulnerabilities for each host. A security analyst can exploit such data for vulnerability assessment purposes [98] investigating vulnerabilities, their cause and mitigating them, thus preventing them from being exploited in the future, assisting in securing the framework against potential cyber-attacks. Lately, AI and ML-enabled methodologies have been developed to support forensics on multi-agent architectures [99], allowing almost real-time detection and mitigation of cyber-attacks and further shedding light in the context of cyber-security incidents by analysing their cause, impact and steps of execution.

5 Implications The proposed architecture satisfies all the requirements presented in Sect. 3. In particular, it satisfies the requirement for integration since it is able of handling data from heterogenous data sources. Data is pre-processed and stored in the proper database according to its specificities and a plethora of open-source NoSQL and SQL databases were proposed to ensure the quality of the acquainted data and effective storage of data along with its semantics. The framework also provides efficient big data querying mechanisms for executing complex queries on multiple and heterogenous data sources at a low latency as well as federated querying capabilities. Furthermore, it allows users of analytics to perform statistical analysis by allowing them to perform complex aggregation queries for OLAP through REST APIs in order to calculate simple and complex statistics on data to further infer hidden relations. It

46

K. Touloumis et al.

also facilitates data exploration since it allows users to create custom visualizations and dashboards by using well-known open-source solutions for visualizing data. Moreover, the proposed framework enables near real-time streaming of data by using open-source technologies for real-time data streaming like Kafka and operations to data streams (e.g. Apache Spark). In addition, it ensures the secure flow of data between its components by using the IDS connector and the FIWARE Orion context broker. Also, scalability is facilitated by using technologies that minimize resource utilization for both storage and big data querying. Finally, several examples of energy analytics services that can be benefitted by the proposed framework are presented in brief. It is worth mentioning that the proposed architecture proposes open-source solutions to support all steps of big data value chain from fetching the data to making the latter available for user-driven services. The entire technological solution can be deployed on the cloud which makes it easily applicable to almost every framework providing big data energy analytics. Moreover, cyber-security is thoroughly considered, to make the solution secure against adversarial attacks during all steps of the big data value chain, while forensics are supported to deeply investigate the cause of cyber-security incidents and the mechanisms adversaries exploited to inflict the physical infrastructure involved in the computation of energy analytics services.

6 Conclusions The purpose of this research is to review existing approaches on energy analytics services and the underlying big data techniques and technologies that are required to facilitate the latter. Specifically, the steps of big data value chain over the entire lifecycle of big data are analysed. As a next step, some already existing methodologies on analytics and the use of big data value chain were outlined, including AI algorithms that are used for supporting high-level user-driven services, such as energy and load forecasting. Furthermore, several well-known projects and their system architectures that regard energy analytics on big data are mentioned along with the approach they follow to address big data analytics challenges on the energy domain. In addition, the requirements for a framework that supports big data analytics on the energy sector are discussed alongside data quality and resource management issues. Finally, a framework in terms of a system architecture and its underlying technologies is proposed. The proposed framework enables AI-enabled energy analytics services by addressing the entire big data value chain over its entire lifecycle, beginning with fetching the data, processing, storing, querying and providing high-level user-driven services to support decision-making on domain and non-domain problems. Special attention is paid to securing the framework against unauthorized access and cybersecurity attacks. Open-source technologies are proposed for each layer to support each functionality thus allowing its implementation and application on various use cases of energy analytics services. The effectiveness and advantages of the proposed

The Big Data Value Chain for the Provision of AI-Enabled Energy …

47

framework are elaborated along with its applicability on various use cases on the energy sector.

References 1. D. Koh, COVID-19 lockdowns throughout the world. Occup. Med. (Chic Ill) 70(5), 322–322 (2020) 2. P. Jiang, Y. Van Fan, J.J. Klemeš, Impacts of covid-19 on energy demand and consumption: challenges, lessons and emerging opportunities. Appl. Energy 285, 116441 (2021) 3. K. Ahmed Ali, M.I. Ahmad, Y. Yusup, Issues, impacts, and mitigations of carbon dioxide emissions in the building sector. Sustainability 12(18), 7427 (2020) 4. C.A. Horowitz, Paris agreement. Int. Leg. Mater. 55(4), 740–755 (2016) 5. E. Karakolis, K. Alexakis, P. Kapsalis, S. Mouzakitis, J. Psarras: An end-to-end approach for scalable real time anomaly detection in smart buildings, in 2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA) (IEEE, Corfu, Greece, 2022), pp. 1–7 6. S. Pelekis et al, In search of deep learning architectures for load forecasting: a comparative analysis and the impact of the covid-19 pandemic on model performance, in 2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA) (IEEE, Corfu, Greece, 2022), pp. 1–8 7. P. Skaloumpakas, E. Sarmas, Z. Mylona, A. Cavadenti, F. Santori, V. Marinakis, Predicting thermal comfort in buildings with machine learning and occupant feedback, in 2023 IEEE International Workshop on Metrology for Living Environment (MetroLivEnv) (IEEE, 2023), pp. 34–39 8. E. Sarmas, N. Dimitropoulos, S. Strompolas, Z. Mylona, V. Marinakis, A. Giannadakis, H. Doukas et al., A web-based building automation and control service, in 2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA) (IEEE, 2022), pp. 1–6 9. E. Sarmas, S. Strompolas, V. Marinakis, F. Santori, M.A. Bucarelli, H. Doukas, An incremental learning framework for photovoltaic production and load forecasting in energy microgrids. Electronics 11(23), 3962 (2022) 10. J. Anuradha, A brief introduction on big data 5Vs characteristics and hadoop technology. Proc. Comput. Sci. 48, 319–324 (2015) 11. B.W. Wirtz, J.C. Weyerer, C. Geyer, Artificial intelligence and the public sector—applications and challenges. Int. J. Public Adm. 42(7), 596–615 (2019) 12. D.E. O’Leary, Artificial intelligence and big data. IEEE Intell. Syst. 28(2), 96–99 (2013) 13. R. Mayer, H.A. Jacobsen, Scalable deep learning on distributed infrastructures. ACM Comput. Surv.Comput. Surv. 53(1), 1–37 (2021) 14. S. Dilmaghani, M.R. Brust, G. Danoy, N. Cassagnes, J. Pecero, P. Bouvry, Privacy and security of big data in AI systems: a research and standards perspective, in 2019 IEEE International Conference on Big Data (Big Data) (IEEE, Los Angeles, CA, USA, 2019), pp. 5737–5743 15. J. Hu, A.V. Vasilakos, Energy big data analytics and security: challenges and opportunities. IEEE Trans. Smart Grid 7(5), 2423–2436 (2016) 16. E. Sarmas, E. Spiliotis, V. Marinakis, G. Tzanes, J.K. Kaldellis, H. Doukas, ML-based energy management of water pumping systems for the application of peak shaving in small-scale islands. Sustain. Cities Soc. 82, 103873 (2022) 17. E. Sarmas, E. Spiliotis, E. Stamatopoulos, V. Marinakis, H. Doukas, Short-term photovoltaic power forecasting using meta-learning and numerical weather prediction independent Long Short-Term Memory models. Renew. Energy 216, 118997 (2023) 18. H. Plattner, The impact of columnar in-memory databases on enterprise systems. Proc. VLDB Endow. 7(13), 1722–1729 (2014)

48

K. Touloumis et al.

19. S. Singh, A. Yassine, Big data mining of energy time series for behavioral analytics and energy consumption forecasting. Energies (Basel) 11(2), 452 (2018) 20. G. Hernández-Moral et al., Big data value chain: multiple perspectives for the built environment. Energies (Basel) 14(15), 4624 (2021) 21. E. Sarmas, N. Dimitropoulos, V. Marinakis, Z. Mylona, H. Doukas, Transfer learning strategies for solar power forecasting under data scarcity. Sci. Rep. 12(1), 14643 (2022) 22. A. Kongkanand, M.F. Mathias, The priority and challenge of high-power performance of lowplatinum proton-exchange membrane fuel cells. J. Phys. Chem. Lett. 7(7), 1127–1137 (2016) 23. A.Z. Faroukhi, I. El Alaoui, Y. Gahi, A. Amine, Big data monetization throughout big data value chain: a comprehensive review. J. Big Data 7(1), 3 (2020) 24. N. Naik, Choice of effective messaging protocols for IoT systems: MQTT, CoAP, AMQP and HTTP, in 2017 IEEE International Systems Engineering Symposium (ISSE) (IEEE, Vienna, Austria, 2017), pp. 1–7 25. E. Curry, The big data value chain: definitions, concepts, and theoretical approaches, in New Horizons for a Data-Driven Economy (Springer International Publishing, Cham, 2016), pp. 29– 37 26. Y. Li, S. Manoharan, A performance comparison of SQL and NoSQL databases, in 2013 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM) (IEEE, Victoria, British Columbia, Canada 2013), pp. 15–19 27. Neo4j, https://neo4j.com/. Accessed 29 May 2023 28. MongoDB, https://www.mongodb.com/. Accessed 29 May 2023 29. Apache Handoop, https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ HdfsDesign.html. Accessed 29 May 2023 30. Presto, https://prestodb.io/. Accessed 29 May 2023 31. Trino, https://trino.io/. Accessed 29 May 2023 32. K. Alexakis et al, Intelligent querying for implementing building aggregation pipelines, in 2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA) (IEEE, Corfu, Greece, 2022), pp. 1–6 33. K. Touloumis, E. Karakolis, P. Kapsalis, V. Marinakis, J. Psarras, BD4NRG query engine– intuitive, efficient and federated querying on big data, in e-Society (Iadis, Lisbon, Portugal, 2023) 34. Grafana, https://grafana.com/. Accessed 29 May 2023 35. G. Kormpakis, P. Kapsalis, K. Alexakis, S. Pelekis, E. Karakolis, H. Doukas, An advanced visualisation engine with role-based access control for building energy visual analytics, in 2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA) (IEEE, Corfu, Greece, 2022), pp. 1–8 36. P. Kapsalis, G. Kormpakis, K. Alexakis, D. Askounis, Leveraging graph analytics for energy efficiency certificates. Energies (Basel) 15(4), 1500 (2022) 37. P. Kapsalis, G. Kormpakis, K. Alexakis, E. Karakolis, S. Mouzakitis, D. Askounis, A reasoning engine architecture for building energy metadata management, in 2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA) (IEEE, Corfu, Greece, 2022), pp. 1–7 38. N. Dimitropoulos, E. Sarmas, M. Lampkowski, V. Marinakis, A quantitative methodology to support local governments in climate change adaptation and mitigation actions, in International Symposium on Distributed Computing and Artificial Intelligence (Springer Nature Switzerland, Cham, 2023), pp. 99–108 39. European Dataspaces, https://dataspaces.info/common-european-data-spaces/#page-content. Accessed 29 May 2023 40. IDS Connector, https://www.isst.fraunhofer.de/en/business-units/data-business/technologies/ Dataspace-Connector.html. Accessed 29 May 2023 41. True Connector, https://fiware-true-connector.readthedocs.io/en/latest/. Accessed 29 May 2023 42. L. Xuemei, D. Lixing, L. Jinhu, X. Gang, L. Jibin, A novel hybrid approach of KPCA and SVM for building cooling load prediction, in 2010 Third International Conference on Knowledge Discovery and Data Mining (IEEE, Washington DC, USA, 2010), pp. 522–526

The Big Data Value Chain for the Provision of AI-Enabled Energy …

49

43. J.P. Zhang, Z.W. Li, J. Yang, A parallel SVM training algorithm on large-scale classification problems, in 2005 International Conference on Machine Learning and Cybernetics (IEEE, 2005), pp. 1637–1641 44. R. Mena, F. Rodríguez, M. Castilla, M.R. Arahal, A prediction model based on neural networks for the energy consumption of a bioclimatic building. Energy Build. 82, 142–155 (2014) 45. C. Tsolkas, E. Spiliotis, E. Sarmas, V. Marinakis, H. Doukas, Dynamic energy management with thermal comfort forecasting. Build. Environ. 237, 110341 (2023) 46. V. Marinakis, H. Doukas, An advanced IoT-based system for intelligent energy management in buildings. Sensors 18(2), 610 (2018) 47. S.S.K. Kwok, R.K.K. Yuen, E.W.M. Lee, An intelligent approach to assessing the effect of building occupancy on building cooling load prediction. Build. Environ. 46(8), 1681–1690 (2011) 48. S. Pelekis et al, A comparative assessment of deep learning models for day-ahead load forecasting: investigating key accuracy drivers (2023) 49. E. Sarmas, E. Spiliotis, N. Dimitropoulos, V. Marinakis, H. Doukas, Estimating the energy savings of energy efficiency actions with ensemble machine learning models. Appl. Sci. 13(4), 2749 (2023) 50. E. Sarmas, E. Spiliotis, V. Marinakis, T. Koutselis, H. Doukas, A meta-learning classification model for supporting decisions on energy efficiency investments. Energy Build. 258, 111836 (2022) 51. S. Pelekis et al, Targeted demand response for flexible energy communities using clustering techniques (2023) 52. R. Ahmadiahangar, T. Haring, A. Rosin, T. Korotko, J. Martins, Residential load forecasting for flexibility prediction using machine learning-based regression model, in 2019 IEEE International Conference on Environment and Electrical Engineering and 2019 IEEE Industrial and Commercial Power Systems Europe (EEEIC/I&CPS Europe) (IEEE, Genova, Italy, 2019), pp. 1–4 53. E. Karakolis, K. Alexakis, P. Kapsalis, S. Mouzakitis, S. Psarras, An end-to-end approach for scalable real time anomaly detection in smart buildings, in 2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA) (IEEE, Corfu, Greece, 2022), pp. 1–7 54. M. Yamauchi, Y. Ohsita, M. Murata, K. Ueda, Y. Kato, Anomaly detection in smart home operation from user behaviors and home conditions. IEEE Trans. Consum. Electron. 66(2), 183–192 (2020) 55. J. Xu, H. Wu, J. Wang, M. Long, Anomaly transformer: time series anomaly detection with association discrepancy (2021) 56. M. Panthi, Anomaly detection in smart grids using machine learning techniques, in 2020 First International Conference on Power, Control and Computing Technologies (ICPC2T) (IEEE, Raipur, India, 2020), pp. 220–222 57. E. Karakolis et al, Artificial intelligence for next generation energy services across Europe– the I-Nergy project, in International Conferences e-Society 2022 and Mobile Learning 2022 (IADIS, Lisbon, Portugal, 2022) 58. E. Karakolis, S. Pelekis, S. Mouzakitis, G. Kormpakis, V. Michalakopoulos, J. Psarras, The I-Nergy reference architecture for the provision of next generation energy services through artificial intelligence, in e-Society (IADIS, Lisbon, Portugal, 2022) 59. M. Pau, P. Kapsalis, Z. Pan, G. Korbakis, D. Pellegrino, A. Monti, MATRYCS—a big data architecture for advanced services in the building domain. Energies (Basel) 15(7), 2568 (2022) 60. European Energy Data Exchange, https://energy.ec.europa.eu/system/files/2021-06/bridge_ wg_data_management_eu_reference_architcture_report_2020-2021_0.pdf/. Accessed 29 May 2023 61. K.A. Wehrmeister, The BD4NRG reference architecture for big data driven energy applications, in 2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA) (IEEE, Corfu, Greece, 2022), pp. 1–8

50

K. Touloumis et al.

62. F. Gao, Frameworks for big data integration, warehousing, and analytics, in Big Data Application in Power Systems (Elsevier, 2018), pp. 57–73 63. L. Cai, Y. Zhu, The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14, 2 (2015) 64. B. Balaji et al, Brick, in Proceedings of the 3rd ACM International Conference on Systems for Energy-Efficient Built Environments (ACM, New York, NY, USA, 2016), pp. 41–50 65. M. Muniswamaiah, T. Agerwala, C.C. Tappert, Federated query processing for big data in data science, in 2019 IEEE International Conference on Big Data (Big Data) (IEEE, Los Angeles, CA, USA, 2019), pp. 6145–6147 66. A. Gandomi, M. Haider, Beyond the hype: big data concepts, methods, and analytics. Int. J. Inf. Manage. 35(2), 137–144 (2015) 67. D. Cheng, P. Schretlen, N. Kronenfeld, N. Bozowsky, W. Wright, Tile based visual analytics for Twitter big data exploratory analysis, in 2013 IEEE International Conference on Big Data (IEEE, Santa Clara, CA, USA, 2013), pp. 2–4 68. M. Tavana, A. Shaabani, F. Javier Santos-Arteaga, I. Raeesi Vanani, A review of uncertain decision-making methods in energy management using text mining and data analytics. Energies (Basel) 13(15), 3947 (2020) 69. V. Cevher, S. Becker, M. Schmidt, Convex Optimization for Big Data: Scalable, randomized, and parallel algorithms for big data analytics. IEEE Signal Process. Mag. 31(5), 32–43 (2014) 70. J. Kampars, J. Grabis, Near real-time big-data processing for data driven applications, in 2017 International Conference on Big Data Innovations and Applications (Innovate-Data) (IEEE, Prague, Czech Republic, 2017), pp. 35–42 71. ENTSO-E, https://www.entsoe.eu/. Accessed 29 May 2023 72. V. Araujo, K. Mitra, S. Saguna, C. Åhlund, Performance evaluation of FIWARE: a cloud-based IoT platform for smart cities. J. Parallel Distr. Comput. 132, 250–261 (2019) 73. M.A. da Cruz, J.J. Rodrigues, P. Lorenz, P. Solic, J. Al-Muhtadi, V.H.C. Albuquerque, A proposal for bridging application layer protocols to HTTP on IoT solutions. Futur. Gener. Comput. Syst. 97, 145–152 (2019) 74. A. Braud, G. Fromentoux, B. Radier, O. Le Grand, The road to European digital sovereignty with Gaia-X and IDSA. IEEE Netw. 35(2), 4–5 (2021) 75. M. Mohammadi, A. Al-Fuqaha, S. Sorour, M. Guizani, Deep learning for IoT big data and streaming analytics: a survey. IEEE Commun. Surv. Tutor., 2923–2960 (2018) 76. K.M.M. Thein, Apache kafka: next generation distributed messaging system. Int. J. Sci. Eng. Technol. Res. 3(47), 9478–9483 (2014) 77. J.E. Luzuriaga, M. Perez, P. Boronat, J.C. Cano, C. Calafate, P. Manzoni, A comparative evaluation of AMQP and MQTT protocols over unstable and mobile networks, in 2015 12th Annual IEEE Consumer Communications and Networking Conference (CCNC) (IEEE, Las Vegas, Nevada, USA, 2015), pp. 931–936 78. SAREF, https://saref.etsi.org/. Accessed 29 May 2023 79. DataHub, https://datahub.io/. Accessed 29 May 2023 80. M. Brümmer, C. Baron, I. Ermilov, M. Freudenberg, D. Kontokostas, S. Hellmann, DataID: towards semantically rich metadata for complex datasets, in Proceedings of the 10th International Conference on Semantic Systems (ACM, New York, NY, USA, 2014), pp. 84–91 81. PostgreSQL, https://www.postgresql.org/. Accessed 29 May 2023 82. SQLite, https://www.sqlite.org/index.html. Accessed 29 May 2023 83. Jupyter, https://jupyter.org/. Accessed 29 May 2023 84. Scikit-Learn, https://scikit-learn.org/stable/. Accessed 29 May 2023 85. TensorFlow, https://www.tensorflow.org/. Accessed 29 May 2023 86. Mlflow, https://mlflow.org/. Accessed 29 May 2023 87. Apache Spark, https://spark.apache.org/. Accessed 29 May 2023 88. W. Wu, W. Lin, C.H. Hsu, L. He, Energy-efficient hadoop for big data analytics and computing: a systematic review and research insights. Futur. Gener. Comput. Syst. 86, 1351–1367 (2018) 89. Apache Flink, https://flink.apache.org/. Accessed 29 May 2023

The Big Data Value Chain for the Provision of AI-Enabled Energy …

51

90. S. Chintapalli et al.: Benchmarking streaming computation engines: storm, flink and spark streaming, in 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (IEEE, Chicago, IL, USA, 2016), pp. 1789–1792 91. Fiware Keyrock, https://fiware-idm.readthedocs.io/en/latest/. Accessed 29 May 2023 92. Keycloack, https://www.keycloak.org/. Accessed 29 May 2023 93. M. Muthuprasanna, K. Wei, S. Kothari, Eliminating SQL injection attacks—a transparent defense mechanism, in 2006 Eighth IEEE International Symposium on Web Site Evolution (WSE’06) (IEEE, Philadelphia, PA, USA, 2006), pp. 22–32 94. S. Biswas, T. Carley, M. Simpson, B. Middha, R. Barua, Memory overflow protection for embedded systems using run-time checks, reuse, and compression. ACM Trans. Embed. Comput. Syst. 5(4), 719–752 (2006) 95. B. Biggio, B. Nelson, P. Laskov, Poisoning attacks against support vector machines (2012) 96. T. Nagata, H. Sasaki, A multi-agent approach to power system restoration. IEEE Trans. Power Syst. 17(2), 457–462 (2002) 97. V. Vasilyev, R. Shamsutdinov, Security analysis of wireless sensor networks using SIEM and multi-agent approach, in 2020 Global Smart Industry Conference (GloSIC) (IEEE, Chelyabinsk, Russia, 2020), pp. 291–296 98. K. Touloumis, A. Michalitsi-Psarrou, P. Kapsalis, A. Georgiadou, D. Askounis, Vulnerabilities manager, a platform for linking vulnerability data sources, in 2021 IEEE International Conference on Big Data (Big Data) (IEEE, Orlando, FL, USA, 2021), pp. 2178–2184 99. K. Touloumis, A. Michalitsi-Psarrou, A. Georgiadou, D. Askounis, A tool for assisting in the forensic investigation of cyber-security incidents, in 2022 IEEE International Conference on Big Data (Big Data) (IEEE, Osaka, Japan, 2022), pp. 2630–2636

Modular Big Data Applications for Energy Services in Buildings and Districts: Digital Twins, Technical Building Management Systems and Energy Savings Calculations Gema Hernández Moral , Víctor Iván Serna González , Roberto Sanz Jimeno , Sofía Mulero Palencia , Iván Ramos Díez , Francisco Javier Miguel Herrero , Javier Antolín Gutiérrez , Carla Rodríguez Alonso , David Olmedo Vélez , Nerea Morán González , José M. Llamas Fernández , Laura Sanz Martín , Manuel Pérez del Olmo , and Raúl Mena Curiel

Abstract Buildings should play a key role in the energy transition, since they account for almost 40% of EU’s energy consumption. This is acknowledged by numerous European directives, which set strict objectives towards promoting energy demand reduction in buildings, optimisation of the energy use or the application of G. Hernández Moral (B) · V. I. Serna González · R. Sanz Jimeno · S. Mulero Palencia · I. Ramos Díez · F. J. Miguel Herrero · J. Antolín Gutiérrez · C. Rodríguez Alonso · D. Olmedo Vélez · N. Morán González · J. M. Llamas Fernández · L. Sanz Martín · M. Pérez del Olmo · R. Mena Curiel CARTIF Technology Centre, Parque Tecnológico de Boecillo, Parcela 205, 47151 Valladolid, Spain e-mail: [email protected] V. I. Serna González e-mail: [email protected] R. Sanz Jimeno e-mail: [email protected] S. Mulero Palencia e-mail: [email protected] I. Ramos Díez e-mail: [email protected] F. J. Miguel Herrero e-mail: [email protected] J. Antolín Gutiérrez e-mail: [email protected] C. Rodríguez Alonso e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Doukas et al. (eds.), Machine Learning Applications for Intelligent Energy Management, Learning and Analytics in Intelligent Systems 35, https://doi.org/10.1007/978-3-031-47909-0_3

53

54

G. Hernández Moral et al.

renewable energy sources. On the other hand, the current digitalisation trend has as a consequence the appearance of vast amounts of static and real-time data both on energy and off-domain data (big data), as well as new procedures to exploit them (e.g. machine learning). All of this, makes it possible to apply smart analytics to the data and enhance the decision-making process of stakeholders towards pursuing a more efficient building stock. In this line, this chapter will present three applications (digital twins, technical building management and IPMVP protocol application) developed within the H2020 projects “Modular big data applications for holistic energy services in buildings”, I-NERGY “Artificial intelligence for next generation energy” and BD4NRG “Big Data for Next Generation Energy” to support the efficient management and decision-making process around buildings. Digital twins will be considered at three different scales (building, district and regional scales), and applied in pilots in Poland and Spain; whereas technical building management systems and IPMVP protocol application will focus on the building scale, and will be applied in Slovenia and Spain pilots, respectively. Keywords Digital twin · Big data · Technical building systems · Building automation and control · IPMVP · Energy savings · Energy efficiency

Acronyms AEC API BACnet BAS BIM BMS CityGML CityJSON

Architecture Engineering and Construction Application Programming Interface Building Automation and Control Networks Building Automation Systems Building Information Modelling Building Management Systems City Geography Markup Language City JavaScript Object Notation

D. Olmedo Vélez e-mail: [email protected] N. Morán González e-mail: [email protected] J. M. Llamas Fernández e-mail: [email protected] L. Sanz Martín e-mail: [email protected] M. Pérez del Olmo e-mail: [email protected] R. Mena Curiel e-mail: [email protected]

Modular Big Data Applications for Energy Services in Buildings …

CLI CSV DB DEM DSM DL DT EVO GA JSON KPI ECM EU GIS HTML HTTP HVAC IFC IOT IPMVP LiDAR LOD LSP LSTM ML M&V OGC OPC-UA OSM PED RAM RNN SPA TBMS

55

Command-line Interface Comma Separated Values Database Digital Elevation Model Digital Surface Model Deep Learning Digital twin Efficiency Valuation Organisation Grant Agreement JavaScript Object Notation Key Performance Indicator Energy Conservation Measure European Union Geographic Information System HyperText Markup Language Hypertext Transfer Protocol Heating Ventilation and Air Conditioning Industry Foundation Classes Internet of Things International Performance Measurement and Verification Protocol Laser Imaging Detection and Ranging Level of Detail Large-Scale Pilot Long-term short-term memory Machine Learning Measurement and Verification Open Geospatial Consortium Open Platform Communications Unified Architecture Open Street Maps Positive Energy District Random Access Memory Recurrent Neural Network Single-Page Application Technical Building Management Systems

1 Introduction To achieve climate neutrality in Europe by 2050 is the main challenge pursued by the European Commission’s strategic vision “A Clean Planet for All”[1], striving to become the first climate neutral continent as reinforced one year later through the European Green Deal [2]. This transition is, according to the EC, both an urgent challenge and an opportunity to build a better future for all. In this context, the building

56

G. Hernández Moral et al.

sector represents one of the greatest challenges to be tackled, since it accounts for 38% of total global energy-related CO2 emissions. Moreover, while global building energy consumption remained steady year-on-year, energy-related CO2 emissions increased to 9.95 GtCO2 in 2019 [3]. To tackle this challenge, all parts of society and economic sectors will play a role (power sector, industry, mobility, buildings, agriculture and forestry). To do so, in the EU’s 2050 long-term strategy the investment into realistic technological solutions, that can empower citizens and align actions in key areas such as industrial policy, finance and research, while ensuring social fairness for a just transition plays a key role [4]. For this, several initiatives have been put into action by the European Commission (‘New Industrial Strategy for Europe’[5], the Circular Economy Action Plan [6], digital strategies ‘Shaping Europe’s Digital Future’ [7], ‘Data’ [8], ‘Artificial Intelligence White Paper’ [9], among other) that will contribute to achieving a green digital transformation. It then becomes apparent that the EC establishes a strong link in between the move to climate neutrality and a faster digitalisation and accelerated economic and societal changes. Indeed, the progress in digital and industrial technologies has the potential to shape all sectors of the economy and society. Certainly, they transform the way industry develops, produces new products and services, and are central to any sustainable future. In particular, when tackling the most challenging sector, the building stock, the challenges to be addressed in this context can be categorised into four main groups: design, performance, fund and policy [10, 11]. The design challenge is linked to the decision-making process that needs to be performed when designing a new building or building infrastructure or refurbishing an existing one [12, 13]. This can be performed by focusing on building level as well as at district level. Other challenges are related to the performance of the buildings. These are linked to the operational stage of the building and are connected to the exploitation of monitoring data towards the optimization of the functioning of energy systems, management of comfort-aware energy consumption, or energy matching with renewable energy produced on site [14, 15]. A further step to take would be boosting energy refurbishments, which can be categorized under “fund”. It is fundamental to enhance reliability and reduce risks of energy efficiency investments, in particular, tailored to ESCOs and financing institutions, that contribute to better define energy performance contract conditions, the analysis of refurbishment actions and the evaluation of their bankability. Finally, when approaching the building sector with a holistic perspective it is important to tackle the policy making and policy impact assessment dimension, to support the analysis and monitoring of strategies’ implementation at a broader scale [16, 17]. The challenges in the building sector are clear, now the question is how to address them with technological advancements and taking advantage of the vast amount of currently existing data [18, 19]. The following sections will provide some insights on solutions proposed focused on the building stock that leverage big data capabilities. Section 2 will present the current state of the art on relevant technologies related to big data and the energy value chain that are related to the solutions presented in this chapter. Then, Sect. 3 will present the modular big data applications proposed,

Modular Big Data Applications for Energy Services in Buildings …

57

namely, digital twins at different scales, technical building management systems, and an energy savings calculation service. Last but not least, a methodology to assess the acceptance of solutions is presented in Sect. 4, as well as the current progress in terms of user satisfaction related to these services presented. The chapter concludes with Sects. 5 and 6 with discussion and conclusions, respectively.

2 Big Data and the Energy Value Chain The overarching objective of accelerating the twin green and digital transitions towards building a lasting and prosperous growth, is particularly salient in the four objectives of the Horizon Europe programme, which will pave the way for research in Europe for the next six years, in line with the EU’s new growth strategy and the European Green Deal. To implement these twin transitions new technologies, investments and innovations will be required to create the new products, services and business models needed to sustain or enable EU industrial leadership and competitiveness, and to create new markets for climate neutral and circular products. The stakeholder’s building value chain should be seamlessly connected in order to make this transition possible. In this respect, this chapter will present innovative services and analytics that will contribute to supporting several building related cases at different scales addressed to a variety of target groups. Special emphasis will be placed on the development of digital twins at different scales (strengthening the links between data and digitisation and the built environment) and providing progress on fault detection of sensors, and energy savings calculations. To this end, the following paragraphs present brief account of the state of the art in these related technologies that will establish the basis to understand the progress proposed with the digital solutions presented in this chapter.

2.1 Digital Twins A digital twin is a virtual representation of a physical building or infrastructure that incorporates real-time data and simulations to model and monitor its behaviour [20, 21]. At the building scale, digital twins offer numerous benefits, including enhanced design, construction, operation, and maintenance of buildings. Digital twins enable architects and engineers to create virtual prototypes of buildings, allowing them to simulate and optimize various design parameters [22–24]. This helps in identifying potential issues, improving energy efficiency, and ensuring optimal performance before the construction phase. They also facilitate better construction planning and project management [25]. They enable real-time monitoring of construction progress, comparing it with the digital model, and identifying deviations or potential clashes [26]. This aids in reducing errors, optimizing workflows, and improving overall efficiency. Moreover, regarding energy efficiency, digital twins allow for continuous

58

G. Hernández Moral et al.

monitoring and analysis of a building’s energy consumption and performance [27]. By integrating data from sensors, measuring devices and IoT devices, they can identify energy inefficiencies, simulate energy-saving scenarios, and optimize building systems for better performance [28]. In addition, they enable proactive maintenance by integrating real-time data from various sensors installed in a building. By continuously monitoring equipment and systems, they can detect anomalies, predict failures, and schedule maintenance activities accordingly. This approach helps in reducing downtime, optimizing maintenance costs, and extending the lifespan of building assets. Digital twin technology has also found application in the context of cities and regions. It is important to note that when operating at these scales, the level of detail expected differs from that of individual buildings [29]. It is improbable to have access to all the necessary information required to create a flawless BIM model encompassing all the structures within a city or region. Utilizing a digital twin as a model for an entire city is an ambitious concept. At smaller scales, digital twins have been employed for many years to first create, test, and construct everything virtually, from products to plants design. Employing digital twins within a city framework represents a broader vision compared to industrial design or building-specific digital twins. Its potential applications include facilitating simulations and analysis of existing and future urban environments [30–32], supporting maintenance and administrative systems, aiding in emergency planning and management, and optimizing the construction production chain, among other possibilities. City digital twins primarily rely on topographic and geometric models of city infrastructure [33]. In recent times, CityGML [34] has been utilized as a data model for representing 3D urban objects [35, 36]. It defines classes and relationships for the most relevant topographic objects in city models, encompassing properties such as geometries, topology, and semantic data. In addition, 2D Geographic Information Systems (GIS) models have been extensively utilized to represent cities [37], particularly in scenarios where 3D attributes are not required. Furthermore, open GIS data is increasingly prevalent. Recently, there have been initiatives aimed at integrating city and building digital twins into a unified model. ESRI [38] and Autodesk [39], for example, are collaborating on the development of a combined tool that enables the creation of digital twins for cities. Similarly, other initiatives are adopting a similar approach. SuperMap GIS [40], for instance, integrates both 2D and 3D technologies, operating with a comprehensive spatial data model.

2.2 Technical Building Management Systems In this context it is crucial to analyse the synergies and the potential that digital twins can have when combining them with energy concepts to be able to monitor and optimise consumptions, and also guarantee energy savings. Actually, cloudbased platforms and digital twin technologies are being integrated into Technical

Modular Big Data Applications for Energy Services in Buildings …

59

Building Management Systems (TBMS), enabling remote access, storage, and analysis of building data [41]. Cloud-based solutions facilitate scalability, collaboration, and data sharing among multiple stakeholders, while digital twins provide virtual replicas of buildings for simulation, optimization, and predictive analysis. These technical Building Management Systems (TBMS), also known as Building Automation Systems (BAS) or Building Management Systems (BMS), have advanced significantly in recent years. They integrate various technologies and functions to monitor, control, and optimize the performance of building systems. Modern BMSs focus on seamless integration with diverse building systems and devices, such as HVAC (Heating, Ventilation, and Air Conditioning), lighting, access control, fire safety, and energy management. Interoperability standards, such as BACnet, KNX, and OPC-UA, enable different systems to communicate and exchange data effectively [42]. Additionally, these systems increasingly leverage the Internet of Things (IoT) to connect and communicate with a wide range of sensors and measuring devices deployed throughout the building. This enables real-time data acquisition, remote monitoring, and control, enhancing operational efficiency and allowing for predictive and proactive maintenance [43, 44]. In addition, they play a crucial role in optimizing energy consumption and improving building performance. Advanced energy management features include real-time monitoring of energy usage, demand response capabilities, automated energy-saving strategies, and integration with renewable energy sources [44]. Energy analytics and machine learning algorithms are often employed to identify energy-saving opportunities and optimize system performance [45, 46].

2.3 Energy Savings Calculations In terms of energy savings calculations, it is paramount to understand the International Performance Measurement and Verification Protocol (IPMVP) developed by the Efficiency Valuation Organisation (EVO) [47]. The implementation of M&V (Measurement and Verification) plans for savings calculations is normally addressed by adopting suitable protocols. In this context, IPMVP is the most recognized international protocol for this purpose. M&V is the process of planning, measuring, collecting and analysing data to verify and report energy savings resulting from the implementation of an energy conservation measure. Energy savings are by definition, the absence of energy use and subsequently cannot be measured, but what can be measured is the energy use. M&V therefore represents the process of analysing measured energy use before and after a retrofit to determine savings. To make a consistent comparison, appropriate adjustments shall be made according to the IPMVP methodology. The comparison of before and after energy use is made using the following general equation, where the adjustment term is used to re-state the energy use of the baseline and reporting periods under a common set of conditions.

60

G. Hernández Moral et al. Savings = (Baseline period energy use−Reporting period energy use) ± Adjustments

IPMVP provides four options for determining energy savings depending on the specific peculiarities of the assessed project. The selection of the option is a decision that is made by the designer of the M&V program in each project based on the project conditions, analysis, budgets and professional judgment. Below the description of the existing options and when they are applied are presented: • Option A: Retrofit Isolation: Key Performance Measurement. Savings are determined by field measurements of the key parameter(s), which define the energy consumption and demand of the energy conservation measures (ECMs) affected system(s) or the success of the project. • Option B: Retrofit Isolation: All Parameter Measurement. Savings are determined by field measurement of the energy consumption and demand and/or related independent or proxy variables of the ECM affected system. • Option C: Whole Facility: Continuous measurements of the entire facility’s energy use. Savings are determined by measuring energy consumption and demand at the whole facility utility meter level. • Option D: Calibrate Simulations: savings are determined through simulations. Savings are determined through simulation of the energy consumption and demand of the whole facility, or of a sub-facility. M&V is not just a collection of tasks conducted to help a project meet IPMVP requirements. Properly considered, each M&V task serves to enhance and improve the operation and savings of the facility. A complete M&V plan based on IPMVP should cover the following 13 topics: (1) ECM Intent: Describing the ECMs, its intended result and the operational verification procedures; (2) Selected IPMVP option and Measurement boundary for the savings determination; (3) Baseline period: Document the conditions and energy data for the reference situation, (4) Reporting period: Document the information about the reporting period after the deployment of the ECMs, (5) Basis for adjustment: Declare the set of conditions to which the energy measurements will be adjusted, (6) Calculation methodology and analysis procedures: specify the specific methods and analysis used in the savings reports, (7) Energy prices specification that will be used for the economic savings, (8) Meter specifications: Specifying the metering points and periods, (9) Monitoring responsibilities definition, (10) Expected accuracy used in the savings report, (11) Budget definition and resources required for the savings determination, (12) Savings report format specifying how the results will be reported and documented, (13) Quality assurance procedures. Depending on the specific circumstances of each project, some additional topics should be covered in the M&V plan. For example, for Option A, justification of the estimations is needed and for Option D, information about the simulation and modelling activities is required. In general, all the M&V plans follow a common structure based on the IPMVP but at the end they must be adapted to the specific characteristics and needs of the assessed project.

Modular Big Data Applications for Energy Services in Buildings …

61

All in all, this section has provided a solid base upon which to understand the approaches followed in the services development presented in the next section.

3 Modular Big Data Applications for Energy Services The following subsections showcase the different solutions developed to support stakeholders in the twin transition and digitalisation efforts, by tackling specific challenges encountered in these stakeholders’ day by day undertakings. These services have been developed in the context of the MATRYCS [48], I-NERGY [49] and BD4NRG [50] European projects, and have counted on specific pilots where these services are currently being validated [51–54]. All of the sections follow the same structure. First, the main objective and the challenge addressed is briefly presented, then, the solution design, data used and steps covered are explained. Next, the user experience, through the visualisation of graphical user interfaces or mock-ups is illustrated. Then, the application of the solution in a specific context is described. Finally, replication possibilities of the solutions are analysed, as well as potential next steps defined.

3.1 Digital Twins at Different Scales A digital twin (DT) can be defined as a connected, digital representation of a physical building (district, city, region, etc.) and corresponding processes that are used to understand, predict and optimise performance in order to achieve a more costeffective, straightforward and sustainable smart building. It brings together dynamic and static data from multiple sources in 2D/3D models and enables informed and effective decisions to be made. It provides real-time understanding of how a building is performing–enabling immediate adjustment to optimize efficiency and to provide data to improve the design and management of future buildings or districts. A. Digital twins at building level There are several layer-based approaches to the architecture of a digital twin (DT) at building level. Starting with the core of the building’s DT, three layers can be distinguished: a physical layer that corresponds to sensorisation (physical equipment that functions as data sources); a data layer in which all the available data would be concentrated, (such as the data repositories and the real-time information); and finally, the model layer, where the reasoning, behaviour, simulation and prediction models would be located (Fig. 1). Other approaches expand the number of layers by adding a user interface layer or subdividing some of the previous layers (a cloud storage layer could be added in the data layer; an analysis layer could be included in the model layer; and the

62

G. Hernández Moral et al.

Fig. 1 Digital twin schema

physical layer could also include communications, IoT). In the MATRYCS project, these three main layers will be used, and other layers will be included depending on the use cases and the specific requirements of each application. Main objective and challenge addressed At the building level scale, the following specific objectives are proposed: • Create a coherent aggregation of data from multiple sources and link them to a digital model of the building to obtain an initial digital twin. • Integrate the different analytics building services in the digital twin answering the needs specified in the pilots and captured in the user stories and the specifications. • Modelling the elements and behaviour of the entities taking part in the digital twin. • Prepare/adapt the digital twin to the required inputs and outputs, the required predictions and behavioural analysis, as well as the features to be applied for the ML and DL definitions. Solution design, data used and steps covered To achieve these objectives, an architecture of the digital twin service has been proposed at a building scale, which is based on the use of the following data from three main sources: • 3D models of buildings: for an adequate implementation of the digital twin, it is considered necessary to have three-dimensional models of the building that faithfully represent its geometry. These 3D models must be uploaded into the digital twin application via IFC files. They are subsequently converted into XKT format throught the xeokit-converter library located in the server side of the application and stored into the database.

Modular Big Data Applications for Energy Services in Buildings …

63

• Sensor attributes: information on the available sensors in the building, mainly the type of sensor (what information it measures), its position and, although it is not strictly necessary, it is advisable to have its dimensions. All this information is obtained directly from the uploaded IFC file. • Sensor data: the data provided by the sensors (value, unit of measure, date and time of capture, annotations or comments…). It is essential to have data in real time (to be able to connect it to the digital twin) but it is also important to collect historical data (in the form of databases, excel tables, etc.). All this data is obtained through HTTP requests to other services developed within the project. The architecture deployed is shown in the following Fig. 2: A frontend application for the Digital Twin at building level has been developed in the form of a Single-Page Application (SPA). In this type of application, all the functions exist in a single HTML page. This pattern can significantly improve the application’s user experience. This application is intended to easily view and manage the digital twin and also to guarantee the security of the information. It allows navigating the 3D models of the pilots and viewing the sensor data installed on them at the same time. Angular [55] open-source framework has been used for web application design and development. CSS framework used in this service is Bootstrap [56], a framework for building responsive mobile-first sites. For the visualization, navigation and management of the BIM models of the digital twin different libraries were evaluated. Initially, the Xbim toolkit and BIMvision solutions were analyzed, although it was finally decided to use the Xeokit SDK [57] for two main reasons: Xeokit is open-source and is designed for viewing large models in the browser. Xeokit is designed for viewing large models in the browser and has several capabilities that make it interesting for implementation in MATRYCS: fast loading and rendering of 3D models, double-precision rendering, BIM/AEC-friendly programming API (JavaScript graphics programming toolkit, with an API designed especially for BIM/

Fig. 2 Digital twin architecture at building level

64

G. Hernández Moral et al.

AEC applications), and Plugins (big library of plugins –for navigation, measurement, collaboration- to accelerate BIM/AEC application development). The Xeokit SDK can load IFC (2 × 3 & 4), CityJSON, glTF, OBJ, 3DS, STL, 3DXML, LAS, LAZ, PLY and XKT formats. Also, it is possible to use open source CLI tools to convert IFC STEP files into Xeokit’s compressed native model format, XKT, which Xeokit can then load super quickly into its browser-based viewer. Additionally, plotly library [58] has been used to visualize the sensor data. Plotly.py is an interactive, open source, and browser- based graphing library for Python [59]. Plotly graphs can be viewed in Jupyter notebooks, standalone HTML files, or integrated into Dash applications. Finally, Axios [60] was chosen as the data querying library to connect the frontend application to the backend application, which will be described below. The backend application was developed to provide the data to the frontend side. This application uses Node [61] as a server-side JavaScript runtime environment allowing to execute JavaScript code on the server, giving it the ability to interact with the operating system, file system and other low-level operations. This makes it possible to use the Xeokit-Converter [62] library to convert IFC models into XKT (file format especially designed to be viewed by the Xeokit SDK). The XKT models are stored in a Mongo DB [63], which is also fully managed by the backend application. User experience The user experience of this service can be summed up in four easy steps: (1) access the service, (2) select the desired digital twin, (3) navigate and consult the available information and finally (4) exit the service. When connecting to the web application of this service, the login option is displayed first. This page asks for the username and password to access the tool and clicking on the “Sign in” button grants access to the user if the security framework allows it. The digital twin service is integrated with the Security Enabler, which means that the functionalities of the service will only be available to users who have provided valid credentials to the Sign In page. Each process is validated by access token: if a valid access token exists, the user is allowed to use the service. Once access is granted, the home page is shown. The home page currently shows a brief summary of the service, but it will be expanded with the information deemed necessary to include a help manual, available data, project information, etc. (Fig. 3). The six available options are displayed at the top of the Homepage and are: (1) Homepage, (2) Viewer page, (3) Upload model page, (4) Services page (Technical Building Management—TBM), (5) User information and (6) Sign Out. Using the pilot’s option, it is possible to choose the digital twin of the pilot the user wants to interact or work with. Figure 4 shows an example of the digital twin of one of the MATRYCS pilots (BTC tower located in Ljubljana, Slovenia). Using the available options, it is possible to visualize each of the floors of the building; to hide them or see them in x-ray mode. The view can be rotated, zoomed in or out as in any 3D viewing program. Using the Xeokit library it is possible to load one or more IFC models without the need for prior format conversion (as it occurs in some libraries) (Fig. 5).

Modular Big Data Applications for Energy Services in Buildings …

Fig. 3 Home page of the web application tool

Fig. 4 Digital twin of the BTC tower in Ljubljana, Slovenia

65

66

G. Hernández Moral et al.

Fig. 5 X-ray view of the Digital Twin BIM model

Once the digital twin of the corresponding pilot has been loaded, the drop-down menu in the upper right area shows the different sensors linked to that digital twin (Fig. 6). Selecting a sensor shows us its type, identifier and the information it has linked to it (Fig. 7). When selecting the desired sensor, a query is made to the database accessible online and the corresponding values are displayed, available both in text and graphic format. It is possible to interact with the graph, for example by zooming to the interval of interest (Fig. 8). Other options available in the application include the feasibility to upload a new IFC file to a database through the Upload Model page. Clicking on the corresponding button, a file browser allows the user to select the desired IFC file. Once the file has been successfully converted it is possible to select it in the viewer option. Last but not least, the User information option shows the name, username, email and the corresponding roles of the user, and through the Sign Out option, the application logs the user out and navigates to the Sign In page. Application in a specific context This service has been tested and validated within the MATRYCS project. In particular, within two pilots. The first one is related to building operation, facility and resources fingerprinting for efficiency and optimal balancing of energy vectors, whereas the second one is focused on building refurbishment, specifically on sustainable building assessment and optimisation of refurbishment options. The first large-scale pilot (LSP) entails a series of three buildings, in a commercial complex called BTC in Slovenia; however, the service is tested in just one of these buildings, the BTC tower,

Modular Big Data Applications for Energy Services in Buildings …

Fig. 6 Selection of the desired sensor from those available in the DT

Fig. 7 Selected sensor information

67

68

G. Hernández Moral et al.

Fig. 8 Data sensor values

used as offices. The second LSP is located in Poland and is a single building used as a kindergarten. The different uses lead to a different data availability. In particular, the following data from the BTC tower are currently being used: (1) Simple 3D model, (2) Energy sources data, (3) Slovenian calendar, (4) Cooling system data, (5) Air condition data, (6) Heating system data. And with respect to the kinder garden in Poland, the following data are available: (1) Kindergarten Revit file, (2) Kindergarten IFC file, (3) Energy consumption data (per month, twenty months stored), (4) Temperature and humidity sensor data in 4 min frequency (two years stored). The testing of this service in the selected locations is never performed in an isolated manner, but linked to an additional service that is added on top of the digital twins, to enrich the proposed functionalities. This enrichment will become apparent in the explanation of the service related to technical building management systems explained in Sect. 3.2. Replication possibilities and envisaged next steps The service is fully replicable as it is a web service that runs remotely on the Internet. This service allows new pilots to be incorporated simply by uploading their IFC models into the platform. The file with the IFC model is limited to 16 Megabytes due to the MongoDB document size limitation. This maximum document size helps ensure that a single document cannot use excessive amount of RAM or, during transmission, excessive amount of bandwidth. To store larger documents than the maximum size, MongoDB provides the GridFS API but it has not been implemented in the digital twin.

Modular Big Data Applications for Energy Services in Buildings …

69

It should also be noted that in order to view the sensor measurements in the digital twin, on the one hand, the IFC model must include the modelled sensors, and on the other hand, it is necessary to develop a data collecting system to feed the LSP database with the sensor measurements and create the ‘endpoints’ needed to provide this data to the DT through HTTP requests. These ‘endpoints’ would not only be responsible for providing the sensor data but also for formatting it as expected by the Digital Twin so in principle, any type of sensor could be used. Consequently, potential next steps to improve the current service include implementing GridFS API to allow uploading IFC files larger the 16 Megabytes, and upgrading Xeokit-Converter to permit uploading other types of model files (LAS, glTF, STL formats). 2 Digital twins at district level Digital twins are 3D complex virtual representations of city elements and landscapes that make possible to translate real environments into virtual ones making possible not only 3D visualization but also queries, analytics or spatial and data mining analysis. Targeted application areas of 3D virtual models explicitly include urban and landscape planning, architectural design, tourist and leisure activities, 3D cadastres, environmental simulations, mobile telecommunications, disaster management, homeland security, vehicle and pedestrian navigation, training simulators and mobile robotics. When considering digital twins at district scale, challenges that connect different buildings among each other (such as district heating networks), or even the calculation of energy demand (assessing the impact of shadowing elements from one building to another) can be explored. This approach can contribute in the development of sustainable urban strategies, for instance through the support of positive energy districts (PEDs) [64]. Main objective and challenge addressed The objective of the service is to generate a digital twin (DT) at district level that provides a 3D virtual model of the buildings located in a district to make feasible the implementation of applications to simulate built environments based on a 3D model under a standard format as CityGML [34] or 3D Tiles [65]. CityGML is based in XML schema language describing the grammar of the conformant data instances. CityGML differentiates five consecutive Levels of Detail (LOD), where objects become more detailed with increasing LOD regarding both their geometry and thematic differentiation. 3D Tiles dataset contains any combination of tile formats organized into a spatial data structure. 3D Tiles are declarative, extendable, and applicable to various types of 3D data. Both formats are built following OGC formats under well documented standards. In addition, the DT service at district level provides users with estimated energy demand values covering heating, cooling and domestic hot water that could be used as a reference for other applications such as the building stock characterization, or primary energy estimation.

70

G. Hernández Moral et al.

Solution design, data used and steps covered The Digital Twin at district level service enables users to analyse energy demand differences thanks to a virtual 3D representation of buildings. Two different formats are used for this purpose: CityGML and 3Dtiles. The CityGML is a common semantic information model for the representation of 3D urban objects based on an OGC Encoding Standard for the representation, storage and exchange of virtual 3D city and associated landscape. In addition, a 3DTiles representation is provided for visualization purposes in a web-based application while the downloadable CityGML model calculated automatically by the service could be used for thematic queries, analytical tasks or spatial data mining in standalone applications. Developed as Python library, this service considers the following datasets as starting point for the 3D model implementation: (1) building boundaries collected from cadastre or other sources as OpenStreetMaps considering that at least building use is included as and attribute; (2) LiDAR data for building height estimation; (3) reference energy demand values for heating, cooling and domestic hot water demand per square meter and climatic zone and (4) building use profiles for energy demand disaggregation. The service could run at national level thanks to the use of standard data as building boundaries but it is necessary to ensure that building attributes follow defined standards making possible the implementation of calculations using the algorithms that are part of the DT Python library. The first phase for the CityGML implementation is the transformation from shapefile (cadastre or OSM information) to CityGML using the specifications of the CityGML 2.0 version and the LOD 1 level scheme for building representation. This version requires the height as an attribute to extrude buildings and generate the triangular mesh representing each building wall. The height is calculated using LiDAR data and requires the implementation of a Digital Elevation Model (DEM), a Digital Surface Model (DSM) and a normalized DSM calculated as the difference between the DSM and the DEM. This normalized model is used to calculate the real height of buildings that is included as a building attribute by means of a mean statistics calculation procedure using the building boundaries as the variable for data calculation and integration. In parallel, energy demand for heating, cooling and domestic hot water in a yearly basis is calculated considering the reference values for each building use and climatic zone, and also the built square meters per building calculated as the product of the number of floors and the area of each building boundary. Energy demand values are also integrated as properties of the 3D CityGML virtual model before making it accessible in the service. The second phase of the service will enable the evaluation of the 3D virtual model in a web application thanks to Cesium 3D Tiles objects represented using the Cesium web-based library. In this case, the CityGML is translated into 3D Tiles objects including the same attributes as in the CityGML format using a code parser developed by Stadt Zürich [66]. This part of the DT service provides users capabilities for exploring 3D virtual building models without standalone applications for CityGML visualization installed in the computer. As a complementary OCG format, 3D Tiles

Modular Big Data Applications for Energy Services in Buildings …

71

expands the service usability and replicability thanks to a more flexible format that avoids explicit rules in data visualization. Finally, the third phase of the service is based on the disaggregation of the energy demand. Users are able to run and algorithm to translate yearly values into hourly values in order to generate the hourly demand pattern of each building based on the building typology, climate reference values to define the heating and cooling hours according to the defined set point (21 °C for heating and 25 °C for cooling) and the energy use pattern according to the building typology. Once the disaggregation is finished, users are able to evaluate the hourly energy demand profile for heating, cooling and domestic hot water but also the accumulated demand in each hour. It is also necessary to reflect that the obtained values at hourly basis are an estimation based on hypothesis that need to be validated with real values of energy consumption at building level. User experience The service will be used by means of a web-based interface in which CityGML models will be calculated using shapefile geometries and stored attributes. Using an upload form, the user will be able to import data (LiDAR and shapefile with building boundaries) from its area of interest to build the 3D virtual model and run the energy demand calculation. The following steps could describe the user experience: • Enter the Digital Twin at district level service. • Select and import the required data for yearly and hourly demand calculation, CityGML or 3D scene (3D Tiles) virtual model implementation. • Choose between three different alternatives: run demand calculation, create CityGML or visualize 3D scene model. CityGML or 3D Tiles virtual models can be downloaded by the user once calculated. The generation of the 3D Tiles model requires, as a previous step, the calculation of the CityGML model from the geometries of the buildings stored in a shapefile object. • Visualize results, select and compare features from different buildings. • Export results after calculation. In the case of demands, the user will be able to download the table with the estimated results at hourly level and the GeoJSON object with yearly values at building level. Figure 9, presents different screenshots of the service with interfaces for data insertion and a representation of a 3D scene similar to the one developed for the visualization of CityGML and 3D Tiles virtual models in the DT service. Application in a specific context The DT at district level has been tested and validated in the MATRYCS project, in particular in a pilot located in the district of Torrelago (Spain). This a district located in Laguna de Duero village that is located close to Valladolid city. This district, covers more than 30 buildings including residential and educational buildings that are connected with a biomass district heating network. For the service application, the following data are currently being used: (1) cadastre, (2) LiDAR, (3) reference energy demand values by building type and

72

G. Hernández Moral et al.

Fig. 9 Screenshots of the Digital Twin at district level service

climate zone. With these data as inputs for the calculation engine, the service is capable of estimating the demands and generating the 3D scene for virtual model generation (CityGML and 3D Tiles) as it can be seen in Fig. 10, which corresponds with the implementation of the service in the Torrelago district. Replication possibilities and envisaged next steps The service has a very large replication potential since it is based on standards and type data so that the user is able to calculate any district as long as it is guaranteed that the data insertion meets the requirements defined by the service. The limiting factor of the service could be the availability of LiDAR data, since without them, the height could not be calculated and, therefore, it would not be possible to carry out virtual 3D models. As future steps, it would be possible to improve and expand the service by inserting new functionalities for the development of models with a LOD2 level of detail, requiring the complete transformation of the parser from shapefile geometries to

Fig. 10 Visualisation of a 3D Tiles virtual model at LOD 0 calculated by means of the Digital Twin at district level for Torrelago district

Modular Big Data Applications for Energy Services in Buildings …

73

CityGML as well as the development of algorithms for the characterization of the roofs of the buildings allowing to increase the level of detail of the 3D virtual model. 3 Digital twins at regional level The highest scale of digital twin development is proposed at regional level. This covers from the city-municipality level to the province and regional level. These digital twins can provide valuable information to analyse, develop and deploy policies at broader scale. Basic information data to build these digital twins and then enrich them are mostly shapefiles, including cadastral data, or Open Street Maps data (OSM) [67]. In this way, the potential of Geographic Information Systems (GIS) can be properly exploited. However, the amount of information with which the model is enriched cannot be comparable to those used in finer scales (building and district level), due to the computing intensity and storage requirements this will imply. Main objective and challenge addressed The goal of the DT at regional level is to create a coherent aggregation of data from multiple sources, in order to obtain a basis for the estimation of the energy performance of the buildings in a given region. The data sources for this service include cadastral data and OSM data. Depending on these inputs, some challenges needed to be addressed. These include addressing incomplete data (for instance, lack of information on years of construction), lack of data availability in certain European countries, or the use of big data methodologies to handle and store these big amounts of information. For this reason, it is necessary to carefully define the data processing steps to be followed. Solution design, data used and steps covered The digital twin service at the regional level utilizes data obtained from public sources. The service uses a common data model as its foundation and enriches the information based on this model. The sources identified include cadastral information and data from OSM. The information primarily consists of geometric details, gross floor area, above-ground and below-ground floor counts, year of construction, building usage information, and the number of dwellings. It is essential to emphasize that the geometric information must be georeferenced. So, the process includes collecting this basis information (OSM or cadastral information) and storing it in a way that makes it easy to organise and access. Once this information has been collected, the next step is to calculate new parameters that are useful for subsequent calculations through geographical and geometrical tools. One example of this is the calculation of the walls belonging to two connected buildings, that makes possible for this case to identify the walls that are facade walls and party walls. Besides for the cases in which the information contained in the source is not complete enough (for example in OSM the use of each building and the year of construction is not always defined), the application has mechanisms to complete this information, using other sources like the Urban Atlas [68] for the case of the

74

G. Hernández Moral et al.

typology. Urban Atlas information helps to define the use of the buildings for those in which the labels included in the OSM information do not include information about the use or typology. In the case of cadastral information, the application performs the calculations using the Spanish cadastre [69] as a data source. In this case the information about the year of construction, the number of levels and the use of the building is well defined. The uses of the building defined by the Spanish cadastre are the following: residential, agriculture, industrial, office, retail and public services. Once the information about the geometry, the year of construction and the use of the building is established, information about the characterization of the building can be added. This characterization will be focused in the energy aspects of the buildings, since the aim of this Digital Twin at Regional level is to provide a basis for applications that evaluate energy usage in a region. For the characterization of the building, information from TABULA [70] and Building Stock Observatory (BSO) [71] will be used. On one hand, TABULA data, conveniently filtered will be used in the case of residential buildings. On the other hand, BSO will be used for buildings of the tertiary sector. Besides, information from the catalogue of building elements according to the regulations (from Spanish Building Code) of each year will be used. The parameters extracted from these sources are those related to the physical characteristics of the building (window-wall ratio, number of floors for those buildings in which this information is not present, etc.), thermal characteristics of the envelopes and openings (U-values of the different elements) and characteristics of the energy systems (efficiency of the different systems, hot water demands by m2, etc.). This information will be used to enrich the Digital Twin. Besides the information calculated for each building can be aggregated at block, district, municipality, province or regional level. The outcome of the tool is a regional-level Digital Twin that encompasses all the gathered information, at different scales (building block, district, municipality, province or region). This service serves as a basis for calculations in other services. These services rely on the building information extracted from the regional-level Digital Twin to derive the necessary parameters for calculations and estimations. User experience The aim of the Digital Twin at the regional level is to provide information regarding buildings within a specific region. The information displayed is less detailed compared to digital twins at other levels, primarily focusing on the building’s geometry and some general parameters. However, if this service is complemented by other services and the Digital Twin is enriched, it can present more useful information. The visualization of the digital twin at the regional level can be generated through website applications using Leaflet [72], a JavaScript library that can be accessed through standard web browsers. The following Fig. 11 the view of the DT at the regional level. The usefulness of this energy-focused digital twin can be primarily tested in decision-making processes. It becomes crucial to have a clear understanding of

Modular Big Data Applications for Energy Services in Buildings …

75

Fig. 11 Digital twin at regional level, and information provided about one specific building

energy usage, consumption patterns, geographical distribution, and the potential impact of external factors on the regional level. Application in a specific context The generation of the digital building twin at regional level has been applied in different projects: MATRYCS, I-NERGY and BD4NRG. For Castilla y Leon region, in MATRYCS and BD4NRG projects a Digital Twin has been generated using information from Spanish cadastre and Spanish catalogue of building elements. Relevant data for the 2248 municipalities of Castilla y León region have been collected: 2248 Building files, 2248 Building Part files, 2248 Cadastral Parcel files and Cadastral Zones files. In the I-NERGY project the same process was done with for Principado de Asturias region, in this case for the 76 Municipalities in the region the same types of files were collected for its processing. Besides, in MATRYCS the creation of the DT was applied to Gdynia (Poland) municipality using different basis information: OpenStreetMap. In this case, the main difficulty was the lack of information associated with the buildings on the map, being critical the absence of year of construction and descriptive labels that could help to define the use of the building. In this case, information extracted from Urban Atlas was processed in order to improve the identification of the building use and information on the year of construction was added in a semiautomatic way.. With these estimations the DT has been obtained with an acceptable degree of accuracy. Replication possibilities and envisaged next steps The replication possibilities of this applications are limited by the weaknesses of the tool, which in turn, are closely related to the availability of open data in the area to be applied. The main potential hurdles of the tool are: • Creating a Digital Twin without accurate information for the basis is challenging. In some cases, the available data, such as OpenStreetMaps, may lack crucial

76

G. Hernández Moral et al.

details like building height and year of construction. Additionally, there might be inconsistencies in the data distribution. For example, the Spanish cadastre data may be incomplete for certain areas and, for the case of Autonomous Communities of Navarre and the Basque Country, the cadaster has a different structure. As a result, the analyses conducted using such incomplete data may yield results that are not entirely valid in certain cases, making necessary the adoption of further estimations and process of other specific sources. • Certain European countries may lack access to specific data sources, requiring the identification of feasible alternatives. For example, due to variations in information availability, TABULA/EPISCOPE cannot be used for all European countries for the generation of the Digital Twin. • Sometimes the data sources might be incomplete or lacking completeness, or even they could present inconsistencies in the structure or the content of the data. One example is the TABULA data source, where there is no strict coincidence between building typologies in different countries, and also some of the climatic regions from the countries might not be covered in certain cases, forcing the usage of typologies features that do not correspond to the proper climatic region inside the given country. In any case, although the direct replication is not always possible in all the regions, some adaptations could be done in order to generate a Digital Twin for the different locations. Regarding the next steps for improving the tool, they could entail: • Introducing other type of Open Data available at European level, in order to improve the process and increase the system’s accuracy. • Allowing the user to correct the data of the Digital Twin manually, but in an easy way in order to introduce more precise information for specific buildings or sets of buildings. • To automate the processes of data gathering, especially the ones regarding the yearly weather values, indicators linked to building usage (occupancies, handling of electric appliances, heating usage routines, etc.).

3.2 Technical Building Management Systems Once a digital twin is created, especially at building level, it is particularly useful to link to it the data obtained from a network of sensors that enable to observe its performance at different moments of the day and during different seasons. The analysis of this data, coupled with the functionalities offered by a digital twin, can enable building managers to optimise functioning hours of energy systems, couple energy demand with renewable energy sources, etc. However, for all of this to happen, the installed network of sensors should function adequately. For these reasons, the implementation of analytics to detect the system malfunctioning and generate alarms should be one of the first steps.

Modular Big Data Applications for Energy Services in Buildings …

77

A. Main objective and challenge addressed In order to support the energy management of a building, the main objective of this service is to automatically detect faulty sensors installed in different buildings. The following specific objectives are proposed: (1) create a coherent aggregation of data from multiple sources, (2) pre-process and filter data to obtain a consistent dataset, (3) create a neural network to obtain a trained model using the dataset mentioned above, (4) generate alarms due to faulty sensors and (5) create a historical record of alarms. 2 Solution design, data used and steps covered To achieve the abovementioned objectives, the architecture of the digital twin tool at building level has been adapted by introducing a trained model for alarm generation. The architecture for this is shown in the Fig. 12. A new alarm management page has been added to the frontend of the Digital Twin Service application. The Technical Building Management Service has also been developed as a Single-Page Application (SPA), so that all the functions are available in a single HTML page. The data used by the Technical Building Management Service comes from the data provided by the sensors (value, unit of measure, date and time of capture, …). This data is filtered and stored in a Mongo database with the aim of training an LSTM model. Once the trained model is obtained, the stored data is used to compare the predicted data with the actual data, and if the two datasets differ, an alarm is generated. The service is focused on the identification of sensors that do not work correctly, comparing the actual measurements with the modelled behaviour. In this line, sensor data prediction has been modelled using machine/deep learning techniques, specifically artificial recurrent neural network (RNN) architecture such as long-term shortterm memory (LSTM). Long Short-Term Memory networks are a type of recurrent

Fig. 12 Technical building management systems

78

G. Hernández Moral et al.

neural network capable of learning order dependency and recognising patterns in data sequences. What makes RNNs and LSTMs different from other neural networks is that they take time and sequence into account, they have a temporal dimension. Therefore, LSTMs are well suited to classify, process and predict time series with time lags of unknown duration (Figs. 13 and 14). In this context, Python is a versatile programming language commonly used for neural networks due to its extensive libraries such as TensorFlow [73] and PyTorch [74], which enable efficient implementation and training of deep learning models. An LSTM model has been developed using the TensorFlow library in order to generate alarms when a faulty sensor is detected. 3 User experience The user experience of this service can be summed up in these simple steps: (1) access the service, (2) select the TBM service, (3) navigate and consult the available information and finally (4) exit the service. The login option is displayed first when one connects to the web application of this service. The service is integrated with the Security Enabler, which means that the functionalities of the service will only be available to users who have provided valid credentials on the Sign In page. Once access is granted, same home page as Digital Twin at building level is shown. Using the Services’s option, it is possible to choose the Technical Building Management Service. Once a pilot is selected, the alarms activated for that pilot are displayed in red as shown in the Fig. 15. It is also possible to view all historical data by selecting “All alarms” as shown in the Fig. 16. Other options available in the service include the application of filters available on the right side of the page. These filters make it possible to insert a range of data or select a specific sensor. Once the filter is selected, clicking on the “Filter” button will display only those values that match the selected conditions. It is also possible

Fig. 13 Model training for the TBM service using energy data from BTC tower

Modular Big Data Applications for Energy Services in Buildings …

79

Fig. 14 Model training for the TBM service using temperature data from FASADA

Fig. 15 TBM service–activated alarms

to reset all filters by clicking on the “Reset Filter” button. Finally, when the Sign Out option is clicked, the application logs the user out and navigates to the Sign In page. 4 Application in a specific context

80

G. Hernández Moral et al.

Fig. 16 TBM service–all alarms

The Technical Building Management Service has been tested and validated in the same pilots within the MATRYCS project as the Digital Twin at building level. The Fig. 17 shows an alarm that has been activated by the FASADA pilot (LSP2: the kindergarten located in Gdynia, Poland). It is possible to see that the sensor with ID 605,503, which is located in the kitchen, will stop transmitting data on 1 July 2021. The alarm will remain active until this sensor sends data. On the other hand, selecting “All alarms” it is possible to see in a table records from historical data for monitoring each sensor to see the periodicity of an alarm or simply as a consultation. Historical data from the same pilot is shown in the Fig. 18.

Modular Big Data Applications for Energy Services in Buildings …

81

Fig. 17 TBM service–activated alarms for FASADA pilot

Fig. 18 TBM service–historical data from FASADA pilot

By clicking on the “View data” button (right side of the table), the data of the selected sensor can be visualised in the time graph, with the aim of analysing where the error is coming from. In this case, as an example, an alarm has been selected from a moisture sensor with ID 604,535, which is located in a room, whose values are shown in the Fig. 19. In the time graph it is possible to see that the value is stacked.

82

G. Hernández Moral et al.

Fig. 19 TBM service–moisture sensor from FASADA pilot

The time graph shows that the value is stuck and the table shows that the value has been stuck for five hours. 5 Replication possibilities and envisaged next steps The technical building management system deployed could be used for any service that has sensors that provide time series, not just for temperature sensors, for example. However, if the sensor type is different from those used in the pilots mentioned in the previous sections, a new model must be trained using the neural network. The steps that should be followed to replicate the service include: (1) creating a coherent aggregation of data from each selected sensor, (2) pre-processing and filtering data to provide a consistent dataset, (3) using the neural network already created and the dataset mentioned above, train the model. As these steps are computer intensive, they require the availability of a powerful computer. Another interesting point is that a large amount of data is needed to adapt to the real data. An application for managing sensor alarms can be very useful in a variety of contexts. Here are some features that could be useful for such an application: • Custom alarm configuration: Allow users to set alarms based on different types of sensors, such as motion sensors, door/window sensors, temperature sensors, etc. Users should be able to set thresholds and adjust the sensitivity of the sensors according to their needs. • Real-time notifications: The application should be able to send instant notifications to the user when an alarm is triggered. This can be done via push notifications on the mobile device or even via text messages, phone calls or emails.

Modular Big Data Applications for Energy Services in Buildings …

83

• Activate/deactivate alarm mode: Allow users to activate or deactivate alarms as needed, even if they are not at home or where the sensors are located. • Multiple location settings: If the application is used in multiple locations, it is helpful to allow users to configure this location and manage alarms independently for each location.

3.3 Energy Savings Calculation Based on IPMVP Based on monitoring or calculated data derived from the building, additional services can be applied to further analyse the energy consumption of a building, and be able to evaluate the impact an energy conservation measure has on its consumption. As a result, real energy savings can be obtained. To implement these calculations, and not to bias the results due to changes in the context of the building, or climatic circumstances, it is necessary to normalise the data. To do so, a protocol has been devised: International Performance Measurement and Verification Protocol (IMPVP) [47]. Its automated application can support in the development of energy performance contracts, and thus, support energy efficiency renovations by increasing trust by owners on the results to be reaped after the implementation of energy conservation measures. A. Main objective and challenge addressed The objective of the tool is to help the user to create a mathematical model to be used in the calculations of the IPMVP (International Performance Measurement and Verification Protocol) of a building. Also, the tool allows the user to measure and verify the energy and economic savings achieved after the implementation of Energy Conservation Measures (ECMs) in a building following the IPMVP, in particular, by using option C. This option C is the one suitable for complete facilities with availability of data before and after the implementation of the energy efficiency measures. To this end, the tool has been split into two different sections, the first one is related to the definition of the plan and its configuration, and the other one is related to savings calculations. 2 Solution design, data used and steps covered As previously described, the service covers two different operations of the process of the definition and calculation of the IPMVP. To this end the design of the service has followed the specifications of the IPMVP on how to define the protocol, calculate mathematical models and perform savings calculations. More technically speaking the service has been developed to be a web application based on the Python programming language and other frameworks such as Bootstrap [56], Flask [75], etc. as shown in the Fig. 20.

84

G. Hernández Moral et al.

Fig. 20 IPMVP service architecture

Fig. 21 Measurement and verification plan

The functioning of the service works as follows. After authentication on the service, the user must create a new project and define the general project data (type of building, location, ECMs, etc.) through a questionnaire/form. Then the user should upload the reference data (energy consumption, independent variables and static) in the service. The reference data could be uploaded to the tool in different ways such as CSV files, XLSX files, JSON… Once the tool obtains the data from the user, the tool runs, analyses and processes the data through the statistical engine and provides the user with the optimal mathematical model of the reference scenario to predict the savings (one mathematical model for each type of fuel). The mathematical model will be shown to the user through a virtual interface. The user has the opportunity to stop here and obtain the measurement and verification (M&V) plan if the only information available is the reference data (Fig. 21). In the second step, if the user provides the tool with the post-retrofit data (energy consumption, independent variables, changes in the static factors, energy price), the tool will run again using the mathematical model generated in the previous step and the savings calculation algorithms providing the user the corresponding reports. These reports will be shown directly in the service interface. Depending on the information provided to the tool, the tool will provide the energy savings report and the economic reports. These savings reports can be calculated periodically (e.g. each year the savings can be calculated) (Fig. 22).

Modular Big Data Applications for Energy Services in Buildings …

85

Fig. 22 Savings report generation (once ECMs have been implemented in the building)

3 User experience The service has been developed to be used not only by an expert user, but also by users that are not familiar with the definition, calculation and application of the IPMVP. To this end, the user interface has been developed as simple as possible giving the users a really good experience when using the service. The following paragraphs describe the use of the service step by step, which can be divided into two main functionalities: (1) creating a new plan and (2) creating a new report. Part 1: Creating a new plan When starting a new plan, the tool offers the users two different possibilities: to create a new building or to apply actions on a building (if there are already defined buildings). Once the building is defined or selected, a series of “ACTIONS” can be performed: (1) Manage Data, (2) Manage Plan and (3) Delete building (Fig. 23). • “Manage Data” gives the user the possibility to import data from different sources. In particular, the tool allows to upload a.CSV file containing data for the baseline. It is possible to upload several files at once (it takes several minutes to import). In this step, it is important to note that the format of the CSV is currently based on.CSV files that VEOLIA uses regularly, to import data to the tool the internal format of these.CSV files must be maintained. • “Manage Plan”. After importing data, clicking on “back to Building Plans” and clicking on “Manage Plan” from the list on “Select an existing building” a new screen appears. This screen follows the same layout as the previous one. On the left the tool gives the users the possibility to create a New Plan and the right part offers the possibility to interact with previously defined plans. As before, in case there are no previously defined plans, users must create a new one. It is recommended to create different plans for each fuel existing in the building users are measuring. In other words, it is very convenient to create a plan for the electrical energy, another plan for fuel, another plan for heat energy, etc. it depends on the different energy sources available in the building.

86

G. Hernández Moral et al.

Fig. 23 IPMVP service: starting a new plan

Going “Back to Buildings” once users have created a new plan, users have the possibility to perform the following actions to the plan: (A) Select variables for the plan, (B) Add ECM to plan, (C) Generate Model, (D) View plan or (E) Delete plan. • Select variables for the plan. Clicking on “Select variables for the Plan” from the list on “Existing Plan” users go to the following screen. On this screen the users must select which of the available variables they want to use to calculate the mathematical model. The users must consider that this selection of variables is very important for the correct calculation of the model and it depends on the kind of energy users are evaluating. As it can be seen in the previous picture there are four different columns for each variable where users must select (for each column and variable) the desired feature. The meaning of the four columns is: • “Temperature”. The tool needs to know which is the temperature used to calculate the Heating Degree Day (HDD) and the Cooling Degree Day (CDD). Users indicate this by clicking on the symbol of the variable TEMPERATURAEXTERIOR15minute (for this specific case). • “Dependent”. Users must select this feature when the variable is result of different processes, for example the heat consumed by a building. It is not an independent variable because its variation depends on other external factors such as the use of the boiler. Users can select several dependent variables considering that these

Modular Big Data Applications for Energy Services in Buildings …









87

variables will be merged and used like the final energy consumption that will be used to calculate the model. The dependent variables are the ones that will be used as energy consumption in the calculation of the mathematical model. The dependent variables will also be used subsequently in the reporting period to calculate the energy consumption savings that have been achieved. “Accumulated”. This column indicates whether the tool must use the correspondent variable like accumulated instead of current value. Users must select this feature when variables correspond to energy meters or another variable whose measure is accumulated in time. “Include Variable”. Indicates the tool that that specific variable will be used to calculate the mathematical model. All the previously selected variables as Dependent, Temperature and Accumulated must be also selected as Included. Other variables not accumulated, dependent or temperature must be also included in the model. These will be the independent variables. For example, in this specific case (as seen in the previous picture) the variable HORASFUNCIONAMIENTO CALDERA215minutos is included to calculate the model, it is not accumulated neither dependent variable and neither temperature (Fig. 24). “Add ECM to Plan”. Once all the necessary variables (the ones users want to use to calculate the model) are configured, clicking on “Add ECM to Plan” in the list of “ACTIONS”, users go to the following screen, where some descriptions about the used ECMs are given. Fulfilling this screen is not mandatory because it is just a descriptive form but it is useful for the users in order to know which ECMs are going to be implemented on the building. The tool not only offers the possibility to select an ECM from a list and then add it to the plan by clicking on “Add ECM”, but also the possibility to create new ECMs by clicking on “Define and apply a New ECM to the building” (Fig. 25). “Generate model”. When all variables have been configured and the ECMs have been defined, users can calculate the mathematical model by clicking on “Generate Model” in the list of “ACTIONS”, opening the next screen. Here users must select the aggregation of data, which means the period of time that users want to define to calculate the mathematical model. Values of accumulated variables (explained in previous sections) will be calculated using the selected aggregation period. There are five different options, 15 min, Hourly, Daily, Weekly and Monthly. Also, users must select the model calculation method. Currently there are two different possibilities, Linear Regression, which is the most common for this kind of models, and Random Forest. Once users select these two parameters, after clicking on “Calculate Model” the tool starts to calculate the mathematical mode. It takes several minutes depending on the imported data and the selected variables. The calculation engine tries to reduce the number of independent variables in order to make the model as simple as possible. To do so, the first step is to see the level of correlation among variables, then, the mathematical model calculator deletes the variables that are less significant or have a high value of correlation with other variables. Once this step is finished, the calculation engine is able to calculate a proper model and, finally, the tool shows an estimation on the real behavior of

88

Fig. 24 IPMVP service: variables selection

G. Hernández Moral et al.

Modular Big Data Applications for Energy Services in Buildings …

89

Fig. 25 IPMVP service: selecting ECMs to be included in the plan

the building and the one that is result of the mathematical model. Additionally, when all these steps are finished, the tool shows a summary about the model and statistics on the results obtained. • “View plan”. Users have the possibility to have a look on the plan by clicking on “View Plan” in the list of “ACTIONS” in the “Select Building Plan” screen. By doing so, the application displays a summary of the created plan (Fig. 26). At this point the mathematical model for the IPMVP would be completed and users will be available to make savings estimation going through the other way the tool offers in the main screen, clicking on “New Report” (Fig. 27). Part 2: Creating a new report As previously described procedure, the tool guides the users step by step. This process is simpler to the previous one, so, some of the following screens will not be described in depth. The user would need to select the desired building, then the desired plan, and finally the desired model to be reported. It is worth to highlight again that different models can be calculated depending on the selected variables and the selected aggregation.

90 Fig. 26 IPMVP service: view plan

G. Hernández Moral et al.

Modular Big Data Applications for Energy Services in Buildings …

91

Fig. 27 IPMVP service: creating new report

When users arrive to this form (Fig. 28), they have to input data regarding the variables shown in the screen. The first variable is the dependent variable. As the tool shows in this specific case (Fig. 28) the dependent variable is the sum of two different variables (this must be aligned with the selected as dependent variables when selecting them). The rest of the variables are independent variables. It should be noted that all variables must be included using the same aggregation period as the one configured to calculate the model, if this is not done in this way, calculations will be incorrect. Then, users must input also the price of the energy for the specific period of time they want to calculate the savings and the defined aggregation. For example, they can introduce the mean value of the price for a specific month (when the selected aggregation is monthly). After completing this input of data, the tool calculates savings for that specific period of time, price and aggregation and shows them in the next screen (example of a specific case, Fig. 29). 4 Application in a specific context The data used to develop and test the service are linked to two large-scale pilots, in the MATRYCS and I-NERGY projects, where in both cases the data has been provided

92

G. Hernández Moral et al.

Fig. 28 IPMVP service: creating new report, various screenshots

Fig. 29 IPMVP service: savings calculation

by Veolia. The service has been tested using these data and for several mathematical models that have been configured/calculated with successful results.

Modular Big Data Applications for Energy Services in Buildings …

93

The data used for the validation of the service were provided by VEOLIA and corresponds to available data of one of their buildings acting as a demo site in the MATRYCS project. 5 Replication possibilities and envisaged next steps This service is able to be used in any other building, just by considering that the format of the input data must follow the format that has been used for the development of the service. When importing data, the service is currently using several .CSV files that VEOLIA provided and the process of extracting data is based on the columns included in these files. This format could be changed in order to admit any other data and formats, for example, JSON data could be included just changing in the service the module that reads data from files. Other potential improvement of the service is the inclusion of the rest of the options of the IPMVP, option A, B and D. The differences in between these options have been explained in Sect. 2.

4 Acceptance of Solutions by the Energy Value Chain Big data service developments provide extremely useful functionalities, but, in order for stakeholders to reap the benefits from using them, it is necessary to assure that these services comply with their needs. This is performed via an adequate user requirement gathering (before the service is developed) and through assessing the user satisfaction after the service is developed. Additionally, the evaluation of the impact achieved with the implementation of the services is of the utmost importance. In this line, in both MATRYCS and I-NERGY projects, a similar approach for the deployment of the validation of the services in the different pilots where they are applied is set. This is materialised in the different elements contained in the evaluation framework, which is used to track the progress of the pilots throughout the project and evaluate its impact. This has proven useful in MATRYCS and I-NERGY contexts, counting on 11 pilots on the one hand, and 9 pilots and 15 use cases on the other. Thus, this approach can serve as a reference for projects needing large-scale pilots impact assessment, as well as services validation.

4.1 Evaluation Framework The evaluation framework deployed in these projects consists of three main pillars: (1) Strategy and general context, (2) Data, infrastructure and digital technologies, and (3) user satisfaction methodology. These three pillars are complemented by two further pillars: (4) Main stakeholders, and (5) Procedures to personalise the tools and services. Each of these elements contributed in a different manner throughout

94

G. Hernández Moral et al.

the project, considering the different cycles of work established to structure the developments. 1. Strategy and general context. Within this pillar the KPIs framework is defined. This framework is used to evaluate the Large-Scale Pilots (LSPs) impact, and it contains different defined KPIs (some common for all pilots and others specific based on the pilots’ objectives and services deployed). 2. Data, infrastructures and digital technologies. Within this pillar the basis upon which the analytics that are provided were determined. Starting from a preliminary identification of data availability and the baseline assessment per pilot; which assessed is at the end of the project through the achievement of fundamental targets through the new digital technologies. 3. User satisfaction. The objective of the user satisfaction methodology is to validate the analytics services by end users through questionnaires, live demonstrations or workshops in the different stages of the projects. It is used also to gather feedback for analytics developers to improve them in intermediate stages. 4. Main stakeholders. The objective of this pillar is to complement the three previous pillars and especially the user satisfaction pillar, through the identification of target groups (defined types of stakeholders) and personas (fictional characters created in the pilots to further specify casuistry of an identified target group, to better capture their user context, needs and challenges). 5. Procedures to personalise the tools and services. This last pillar of the evaluation framework consists of methods to personalise the different tools and services developed in the two projects in the different pilots where they applied. This is fostered in different manners, by (1) analysing how the same services are applied and adapted in each pilot, (2) by observing usage scenarios created in the pilots, (3) by incorporating users’ feedback (from user satisfaction feedback rounds) (Fig. 30).

Fig. 30 Evaluation framework structure defined for the projects

Modular Big Data Applications for Energy Services in Buildings …

95

4.2 User Satisfaction Methodology The user satisfaction methodology was defined based on the main aim of gathering feedback from users validating services and solutions and to measure how satisfied users are. To measure these aspects, it consists of five main pillars, divided in two groups according to their close relation to the pilot challenges (effectiveness) or to technical aspects linked to the functioning of services and tools (efficiency, satisfaction, safety and usability), as reflected in the Fig. 31. On the one hand, the ‘effectiveness’ pillar measures if the needs from the pilots are adequately addressed at the end of the services development, and if the services or analytics provided are effective to meet their needs and goals. Questions were defined according to pilot’s objectives and main challenges. On the other hand, the ‘efficiency, satisfaction, safety and usability’ group assesses the services and analytics deployed in the projects from the user’s perspective. This group contains general questions related to the overall functioning of the services (common to all services) and services- or analytic-specific questions, that serve to evaluate if the analytics and services are performing as expected and if the users are satisfied with them, as well as if there is any risk or if they are easy to use. Once all questions were defined, they were digitalised in order to obtain the answers from validators in a same structured way from all different pilots, as well as to get logical and ordered results. Due to the “subjectivity” in answering these questions, they are evaluated through Likert Scales (it is a bipolar scaling method, measuring either positive or negative response to a statement), to be able to measure the qualitative feedback into a quantitative manner. For the effectiveness questions for each pilot, the description of the Likert Scales values from 1 to 5 was added related

Fig. 31 User satisfaction methodology structure defined for the projects

96

G. Hernández Moral et al.

to each question. For the other two groups of questions related to the second group of evaluation (efficiency, satisfaction, safety and usability), a general Likert Scale was defined to proper answer the questions as: (1) Very low, (2) Low, (3) Neutral/ Average, (4) Good, (5) Excellent. Through the scores to the questions in each of the groups, percentages of satisfaction in the different pillars of the methodology can be obtained and analysed. In the questionnaires, the name of the user is asked, as well as its entity/company and the type of stakeholder they are, based on the target groups defined in the projects. This allows afterwards to make evaluations based on the target groups that tested the tools and services.

4.3 Preliminary Results from Validation and Next Steps The user satisfaction questionnaires are answered at two different times in the projects, one at an intermediate stage where tools and services are in a draft version, to obtain feedback from internal users (users that belong to the pilot, within the projects) and another one at the end of the project, with tools and services completely developed and with external stakeholders. This last round of validation is still pending in both projects, but from the first validation round it can be observed that the user satisfaction linked to the digital twins is around is over 70% approximately (digital twins at district level have not been validated yet), whereas the technical building management system services is over 82%, and the IPMVP service is around 90%. In the interactions with stakeholders some of the remarks for improvement related to the digital twins at building level and the TBMs were related to the specific modelling of sensors in their exact location, and also the provision of additional filtering functionalities when processing the metered data. In the case of the IPMVP, the functionalities provided were very close to the normal steps an Energy Services Company (ESCO) follows when performing energy savings calculations. This is the reason for having a comparably higher satisfaction, and not so many recommendations for improvement. Nevertheless, a further validation round should be performed at the end of the projects to assess the final user satisfaction, as well as the impact achieved with the implemented services in each of their contexts. Only then, with the services fully refined according to their requests could the satisfaction truly be assessed. All in all, this methodology has proven so far to be useful in tracking and assessing the satisfaction of users and the impact to be achieved with services implementation in a series of pilots. Nevertheless, both projects where this has been applied have not finished yet. Thus, a complete assessment of the adequacy of the method and lessons learned will be reported in the final deliverables of both projects.

Modular Big Data Applications for Energy Services in Buildings …

97

5 Discussion The services presented in this document contribute in the digitalisation and energy transition efforts proposed by the EU, by linking big data to the twin transition. This becomes evident in the way the 5Vs of the big data (volume, variety, velocity, veracity and value) have been contemplated within the services proposed. Volume is a relevant dimension especially in the digital twin at building level and technical building management systems (when considering a vast amount of monitored data to either be shown, or to be used in a model training); or in the digital twin at regional level, where whole regions are analysed and the corresponding geographical information is processed. In addition to the volume, the variety dimension becomes fundamental. Multiple data sources are combined to be able to obtain a solid basis for the digital twins upon which even further types of data coming from additional services can be added. Data interoperability and the usage of common standards, such as IFC or the INSPIRE directive, become fundamental to be able to carry out this process. In addition, the velocity dimension has become especially apparent in the Technical Building Management Systems service, but it could be a potentially important dimension when the services proposed in this chapter are combined with real-time data. Finally, veracity and value are relevant in order to provide robust results to the end-users at the end of the process. Nevertheless, these should be observed at the end of the process, when digital twins are further integrated with additional services. These considerations have had an impact on how the data value chain has been considered in the development of these services. In particular, the data generation acquisition, data analysis processing, and data storage curation. The last step of the data value chain (data visualisation and services) has been partially addressed in the case of the digital twins due to the possibility to add further services on top of this twin representations. The services presented in this chapter have a high replicability potential in other contexts, due to how the design has been set up, as well as due to the reliance on standards (such as IFC) whenever it was possible. Some limitations have been observed in terms of size of files, availability of data, data formats or the necessity to train new models. Based on these aspects, a series of next steps that can serve as future research avenues have been exposed. They are related to technical matters (admitting multiple data formats, allowing upload of larger files, data curation procedures to increase accuracy, or automating processes), to the increase of functionalities (generating custom alarm configuration, real-time notifications, multiple location settings, including more options in IPMVP service), or to the enhancement of the accuracy of services (for instance, by obtaining LOD2 models in the digital twin at district level through the characterisation of roofs). The approaches presented in this chapter generates numerous opportunities that can offer benefits to a series of stakeholders in the building value chain. Especially, when considering the main groups of challenges identified in Sect. 2 (performance, design, fund, and policy) the services proposed can offer advantages and improve

98

G. Hernández Moral et al.

decision-support making. Last but not least, the acceptance of solutions so far has been measured at an intermediate phase, but further feedback will be obtained at the end of the projects where these solutions are developed, and the identification of further research avenues to the ones detected in this chapter will be performed.

6 Conclusions The chapter has presented a series of modular big data applications focused on supporting decision-making in buildings and districts. In particular, a strong focus has been placed on digitalisation and the generation of digital twins at different scales (building, district and region), and some instances of services that could enrich these digital twins and provide additional functionalities: technical building systems and sensor fault detection, and energy savings measurement and verification through the IPVMP. The energy context related to the building stock, digitalisation and the EU vision has been presented. This serves as a guiding thread in the development of the services, as well as the challenges that need to be addressed. In this line, big data technologies and the increase of data availability present the perfect conditions to develop services that enable to yield relevant insights that would not be possible to achieve with traditional approaches, or that have not been implemented in the day to day undertakings of the stakeholders of the building value chain. Three approaches to build digital twins have been presented, one for each scale: building, district, and region. This expands the concept usually applied at building level by applying it at broader scales. In every case, there is a physical layer, data layer and a model layer. The latter has always been interpreted as the addition of services to the base layers of digital twin that have been generated. In this line, some examples are showcased in this chapter by presenting the technical building management systems service, and the energy savings calculation based on IMPVP. In addition, the stakeholders of the building value chain are brought to the spotlight through the exploration of their acceptability of the services proposed. Undeniably, they represent one of the key elements to take into consideration when developing services, since the success and usability of a solution depends on the value end users place on them. To this end, a methodology deployed in three European projects is presented. This is currently being applied, since neither of the three projects have come to an end at the moment of writing this chapter. The methodology does not only include a user satisfaction methodology, but also a complete evaluation framework that can contribute to tracking large scale pilots in European projects. So far, the methodology has worked well in the projects where it has been applied. Nevertheless, an in-depth assessment of lessons learned will be provided at the end of these projects. Finally, the discussion section has explored how the big data components in the solutions proposed have been applied, and which of the 5vs have been exploited in more depth. Especially the “variety” pillar can be highlighted as one of the most prominent in the services proposed. Moreover, the importance of the data value chain

Modular Big Data Applications for Energy Services in Buildings …

99

has also been showcased. Furthermore, even though the services have been applied in specific pilot cases within the projects where they have been developed, special emphasis has been placed on their replication capabilities. It is crucial to develop the services in a modular manner, so that they are as replicable as possible. For this, the main hurdles towards their replication have been identified. Some of them have to do with technical matters, but most of them are linked to lack of data availability, which consequently leads to a certain degree of inaccuracies of the results obtained. In all cases, potential next steps for the development of the services have been identified, where in some cases they are related to broadening the functionalities offered, or enhancing the replicability of the services in order to facilitate an improved market uptake. All in all, it has been demonstrated how modular big data services can contribute to the different challenges that the building value chain have to address. In particular, how the services proposed can contribute to performance, design, fund and policy dimensions of energy in buildings. Even though there is currently a strong momentum in the services development and data generation, since it is well accompanied by technical developments, these developments should be coupled by an adequate uptake and acceptance by end users. Understanding their pain points and main necessities will remain fundamental to ensure that services reap the expected benefits. Only this way the existing wealth of data that can be exploited at European level will be put into action in the development of policies that are more strategic and well-targeted, towards the final goal of decarbonising the building stock and become the first climate neutral continent, as pursued by the European Green Deal. Acknowledgements The work presented in this chapter is based on research conducted within the framework of three projects funded under the European Union’s Horizon 2020 research and innovation programme. Their details are as follows: (1) MATRYCS: Modular Big Data Applications for Holistic Energy Services in Buildings [GA: 101000158] https://www.matrycs.eu/ ; (2) INERGY: Artificial Intelligence for Next Generation Energy [GA: 101016508] https://i-nergy.eu/ and (3) BD4NRG: Big Data for Next Generation Energy [GA: 872613] https://www.bd4nrg.eu/. The authors would also like to express their gratitude to colleagues at FASADA (http://prefasada. pl/en/), BTC (https://www.btc.si/en/), VEOLIA (https://www.veolia.es/), Municipality of Gdynia (https://www.gdynia.pl/), EREN (Ente Público Regional de la Energía de Castilla y León, https://gob ierno.jcyl.es/web/es/consejerias/ente-publico-regional-energia.html), FAEN (Fundación Asturiana de la Energía, https://www.faen.es/), as well as the rest of the projects’ colleagues for their help, fruitful discussions and insights. The content of the paper is the sole responsibility of its authors and does not necessarily reflect the views of the EC.

100

G. Hernández Moral et al.

References 1. Communication from the Commission to the European Parliament, The European Council, The Council, The European Economic and Social Committee, The Committee of the Regions and The European Investment Bank a Clean Planet for all a European strategic long-term vision for a prosperous, modern, competitive and climate neutral economy COM/2018/773 final, https:// eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52018DC0773. Accessed June 2023 2. Communication from the Commission to the European Parliament, The European Council, The Council, The European Economic and Social Committee and the Committee of the Regions The European Green Deal COM/2019/640 final, https://eur-lex.europa.eu/legal-content/EN/ TXT/?qid=1588580774040&uri=CELEX%3A52019DC0640. Accessed June 2023 3. UN Environment Programme-Global Alliance for Buildings and Construction: Global status report, https://globalabc.org/news/launched-2020-global-status-report-buildings-andconstruction. Accessed June 2023 4. European Commission–Climate action. 2050 long-term strategy, https://ec.europa.eu/clima/ policies/strategies/2050_en. Accessed June 2023 5. Communication from the Commission to the European Parliament, The European Council, The Council, The European Economic and Social Committee and the Committee of the Regions a New Industrial Strategy for Europe COM/2020/102 final, https://eur-lex.europa.eu/legal-con tent/EN/TXT/?uri=CELEX:52020DC0102. Accessed June 2023 6. Communication from the Commission to the European Parliament, The Council, The European Economic and Social Committee and The Committee of the Regions a new Circular Economy Action Plan for a cleaner and more competitive Europe COM/2020/98 final, https://eur-lex.eur opa.eu/legal-content/EN/TXT/?uri=COM%3A2020%3A98%3AFIN. Accessed June 2023 7. Communication from the Commission to the European Parliament, The Council, The European Economic and Social Committee and The Committee of the Regions Shaping Europe’s digital future COM/2020/67 final, https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:520 20DC0067. Accessed June 2023 8. Communication from the Commission to the European Parliament, The Council, The European Economic and Social Committee and the Committee of the Regions a European strategy for data COM/2020/66 final, https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX: 52020DC0066. Accessed June 2023 9. Proposal for a Regulation of the European Parliament and of the Council Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act) and Amending Certain Union Legislative Acts Com/2021/206 final, https://eur-lex.europa.eu/legal-content/EN/TXT/? uri=CELEX:52021PC0206. Accessed June 2023 10. E. Sarmas, V. Marinakis, H. Doukas, A data-driven multicriteria decision making tool for assessing investments in energy efficiency. Oper. Res. Int. J. 22(5), 5597–5616 (2022) 11. P. Skaloumpakas, E. Sarmas, Z. Mylona, A. Cavadenti, F. Santori, V. Marinakis, Predicting thermal comfort in buildings with machine learning and occupant feedback, in 2023 IEEE International Workshop on Metrology for Living Environment (MetroLivEnv), pp. 34–39. (IEEE, 2023) 12. M.Á. García-Fuentes, S. Álvarez, V. Serna, M. Pousse, A. Meiss, Integration of prioritisation criteria in the design of energy efficient retrofitting projects at district scale: a case study. Sustainability 11, 3861 (2019). https://doi.org/10.3390/su11143861 13. G. Hernández-Moral, V.I. Serna-González, A.M. Crespo, S.S. Rodil, Multi-objective optimization algorithms applied to residential building retrofitting at district scale: BRIOTOOL. In E3S Web. Conf. 362, 03002 (2022). https://doi.org/10.1051/e3sconf/202236203002 14. E. Sarmas, N. Dimitropoulos, V. Marinakis, Z. Mylona, H. Doukas, Transfer learning strategies for solar power forecasting under data scarcity. Sci. Rep. 12(1), 14643 (2022) 15. E. Sarmas, E. Spiliotis, E. Stamatopoulos, V. Marinakis, H. Doukas, Short-term photovoltaic power forecasting using meta-learning and numerical weather prediction independent Long Short-Term Memory models. Renew. Energy 216, 118997 (2023)

Modular Big Data Applications for Energy Services in Buildings …

101

16. Álvaro Samperio, F.J. Miguel, P. Hernampérez, G. Hernández-Moral, E. Vallejo, La herramienta Civis como soporte a la transparencia y evaluación del buen gobierno. VI Congreso de Ciudades Inteligentes, Madrid, (2020). ISBN 9798680179963 17. G. Hernández Moral, E. Vallejo-Ortega, I. Ramos-Díez, N. Ferreras-Alonso, V. SernaGonzález, C. Valmaseda, G. Martirano, F. Pignatelli, F. Vinci, Supporting the design and implementation of a regional energy strategy. Publications Office of the European Union, Luxembourg, (2021). JRC124886 18. European Commission, Joint Research Centre, G. Hernandez-Moral, E. Vallejo-Ortega, I. Ramos-Díez et al.: Supporting the design and implementation of a regional energy strategy– ELISE energy and location applications : final report, Publications Office, 2021, https://doi. org/10.2760/300138. Accessed June 2023 19. G. Hernández-Moral, S. Mulero-Palencia, V.I. Serna-González, C. Rodríguez-Alonso, R. SanzJimeno, V. Marinakis, N. Dimitropoulos, Z. Mylona, D. Antonucci, H. Doukas, Big data value chain: multiple perspectives for the built environment. Energies, 14, 4624 (2021). https://doi. org/10.3390/en14154624 20. T. Lützkendorf, D. Lorenz, Digital twins for sustainable smart building design and operation: a review. Adv. Eng. Inform. 42, 100976 (2019) 21. F. Naghdy, S. Zhang, G. Naghdy, Digital twin in construction industry: recent advances and future trends. Adv. Eng. Inform. 44, 101098 (2020) 22. A. Borrmann, J. O’Donnell, E. Rank, Digital twins in the construction industry: a review. Adv. Eng. Inform. 42, 100954 (2019) 23. X. Wang, W. Wang, Y. Jiang, P. Tang, Digital twin-enabled building information modeling: a review and outlook. Autom. Constr. 117, 103225 (2020) 24. R. Soetanto, P. De Wilde, C. Fortune, Application of digital twins in the built environment: a review. Autom. Constr. 91, 297–310 (2018) 25. F.H. Abanda, A. Elnokaly, M.P. Roddis, Review of digital twins in the built environment: a systematic review. J. Build. Eng. 24, 100761 (2019) 26. Y. Cao, X. Lin, W. Chen, T. Peng, H. Zhang, Building digital twin and its applications. Autom. Constr. 100, 212–226 (2019) 27. X. Zhang, Y. Lu, X. Wang, A review on digital twin for building life-cycle performance management: concept, applications and challenges. J. Clean. Prod. 291, 125762 (2021) 28. E. Sarmas, S. Strompolas, V. Marinakis, F. Santori, M.A. Bucarelli, H. Doukas, An incremental learning framework for photovoltaic production and load forecasting in energy microgrids. Electronics 11(23), 3962 (2022) 29. J.J. Bloem, F. Pignatelli, G. Martirano, M.T. Borzacchiello, C. Lodi, G. Mor, G. Hernández, Building energy performance and location—from building to urban area, Ispra: European. Com, JRC 110645 (2018) 30. Á. Samperio-Valdivieso, P. Hernampérez-Manso, F.J. Miguel-Herrero, E. Vallejo-Ortega, G. Hernández-Moral, City indicators visualization and information system (civis), in 3rd international conference on smart and sustainable planning for cities and regions (Bolzano, Italy, 2019) 31. F.J. Miguel.-Herrero, V.I. Serna González, G. Hernández Moral, Supporting tool for multi-scale energy planning through procedures of data enrichment. Special issue on Tools, technologies and systems integration for the Smart and Sustainable Cities to come. Int. J. Sust. Energy. Plann. Manag. 24, 125–134 (2019). Submitted 30 June 2019, published 30 October 2019. https://doi. org/10.5278/ijsepm.3345 32. S. Álvarez, M.Á. García-fuentes, G. Hernández Moral, V.I. Serna González, S. Martín, Simulation based tool to design energy efficient retrofitting projects at district level: a case study, in 14th Conference on Advanced Building Skins (Bern, Switzerland, 2019), pp. 28–29. ISBN: 978-3-9524883-0-0 33. G. Martirano, F. Pignatelli, F. Vinci, C. Struck, V. Coors, M. Fitzky, G. Hernández Moral, V. Serna-González, I. Ramos-Díez, C. Valmaseda, Comparative analysis of different methodologies and datasets for energy performance labelling of buildings. EUR 30963 EN. (Publications Office of the European Union, Luxembourg 2022). ISBN 978-92-76-46608-6. https://doi.org/ 10.2760/746342. JRC124885

102

G. Hernández Moral et al.

34. Open Geospatial Consortium-CityGML standard, https://www.ogc.org/standard/CityGML/. Accessed June 2023 35. Helsinki’s 3D city models, https://www.hel.fi/helsinki/en/administration/information/general/ 3d/3d/. Accessed June 2023 36. Virtual Helsinki, https://www.virtualhelsinki.fi/. Accessed June 2023 37. S. Rizou, V. Marinakis, G. Hernández Moral, C. Sánchez-Guevara, L.J. Sánchez-Aparicio, I. Brilakis, V. Baousis, T. Maes, V. Tsetsos, M. Boaria, P. Dymarski, M. Bourmpos, P. Pergar, I. Brieze, Buildspace: enabling innovative space-driven services for energy efficient buildings and climate resilient cities. EGU Gen Assembly 2023, Vienna, Austria, 24–28 (2023). EGU234298. https://doi.org/10.5194/egusphere-egu23-4298 38. Esri, https://www.esri.com/en-us/home. Accessed June 2023 39. Autodesk, https://www.autodesk.com/. Accessed June 2023 40. Supermap, https://www.supermap.com/en-us/list/?152_1.html. Accessed June 2023 41. Y. Cho, D. Kim, Smart building energy management systems: a review of architecture, user behaviour, and data analytics. Renew. Sustain. Energy Rev. 91, 1179–1193 (2018) 42. W. Lee, S. Kim, Smart building management systems: a review. Appl. Sci. 9(4), 647 (2019) 43. M. Almasri, I. Zualkernan, Internet of Things (IoT): architecture, protocols and services, in Internet of Things (IoT) in five days (Springer, 2016), pp. 19–41 44. Z. Ma, W. Qiao, D. Yang, Advances in IoT-based smart building automation systems: a survey. IEEE Access 9, 8277–8296 (2021) 45. J. Fumo, D. Xu, Building energy management systems: a review. Appl. Energy 257, 113995 (2020) 46. M. Coccia, G. Concas, A. Gorreri, F. Pellizzoni, A review of digital twin: concepts, achievements, and opportunities. J. Clean. Prod. 198, 1358–1374 (2018) 47. Efficiency Valuation Organisation (EVO): International Performance Measurement and Verification Protocol (IPMVP), https://evo-world.org/en/products-services-mainmenu-en/protocols/ ipmvp. Accessed June 2023 48. MATRYCS: Modular big data applications for holistic energy services in buildings [H2020] GA: 101000158, https://www.matrycs.eu/. Accessed June 2023 49. I-NERGY: Artificial intelligence for next generation energy [H2020] GA: 101016508, https:// i-nergy.eu/. Accessed June 2023 50. BD4NRG: Big data for next generation energy [H2020] GA: 872613, https://www.bd4nrg.eu/. Accessed June 2023 51. E. Sarmas, E. Spiliotis, V. Marinakis, G. Tzanes, J.K. Kaldellis, H. Doukas, ML-based energy management of water pumping systems for the application of peak shaving in small-scale islands. Sustain. Cities Soc. 82, 103873 (2022) 52. C. Tsolkas, E. Spiliotis, E. Sarmas, V. Marinakis, H. Doukas, Dynamic energy management with thermal comfort forecasting. Build. Environ. 237, 110341 (2023) 53. E. Sarmas, E. Spiliotis, V. Marinakis, T. Koutselis, H. Doukas, A meta-learning classification model for supporting decisions on energy efficiency investments. Energy Build. 258, 111836 (2022) 54. T. Testasecca, M. Lazzaro, E. Sarmas, S. Stamatopoulos, Recent advances on data-driven services for smart energy systems optimization and pro-active management, in 2023 IEEE International Workshop on Metrology for Living Environment (MetroLivEnv) (IEEE, 2023), pp. 146–151 55. Angular, https://angular.io/. Accessed June 2023 56. Bootstrap, https://getbootstrap.com/. Accessed June 2023 57. Xeokit, S.D.K., https://github.com/xeokit/xeokit-sdk. Accessed June 2023 58. Plotly, https://plotly.com. Accessed June 2023 59. Python, https://www.python.org/. Accessed June 2023 60. Axios, https://axios-http.com/. Accessed June 2023 61. Node, https://nodejs.org/. Accessed June 2023 62. Xeokit-Converter, https://xeokit.github.io/xeokit-convert/. Accessed June 2023 63. D.B. Mongo, https://www.mongodb.com/. Accessed June 2023

Modular Big Data Applications for Energy Services in Buildings …

103

64. JPI Urban Europe–Postive Energy Districts (PED), https://jpi-urbaneurope.eu/ped/. Accessed June 2023 65. Cesium–3D tiles, https://cesium.com/why-cesium/3d-tiles/. Accessed June 2023 66. Stadt Zürich–Citygml to 3Dtiles converter, https://data.stadt-zuerich.ch/showcase/citygml-to3dtiles-converter. Accessed June 2023 67. OpenStreetMap, https://www.openstreetmap.org. Accessed June 2023 68. Copernicus Europe’s eyes on Earth–Land Monitoring Service, https://land.copernicus.eu/local/ urban-atlas. Accessed June 2023 69. INSPIRE Services of Cadastral Cartography, https://www.catastro.minhap.es/webinspire/ index_eng.html. Accessed June 2023 70. Episcope. Tabula 2020, https://episcope.eu/welcome/. Accessed June 2023 71. Building Stock Observatory, https://energy.ec.europa.eu/topics/energy-efficiency/energy-eff icient-buildings/eu-building-stock-observatory_en. Accessed June 2023 72. Leaflet, https://leafletjs.com/. Accessed June 2023 73. TensorFlow, https://www.tensorflow.org/. Accessed June 2023 74. PyTorch, https://pytorch.org/. Accessed June 2023 75. Flask, https://flask.palletsprojects.com/en/2.3.x/. Accessed June 2023

Neural Network Based Approaches for Fault Diagnosis of Photovoltaic Systems Jonas Van Gompel, Domenico Spina, and Chris Develder

Abstract Faults in photovoltaic (PV) systems due to manufacturing defects and normal wear and tear are practically unavoidable. The effects thereof range from minor energy losses to risk of fire and electrical shock. Thus, several PV fault diagnosis techniques have been developed, usually based on dedicated on-site sensors or high-frequency current and voltage measurements. Yet, implementing them is not economically viable for common small-scale residential systems. Hence, we focus on cost-effective techniques that enable introducing fault diagnosis without incurring costs for on-site sensor systems. In this chapter, we will present in particular two machine-learning-based approaches, built on recent neural network models. The first technique relies on recurrent neural networks (RNNs) using satellite weather data and low-frequency inverter measurements for accurate fault detection, including severity estimation (i.e., the power loss caused by the fault, usually not quantified in stateof-the-art methods in literature). The second technique is based on graph neural networks (GNNs), which we use to monitor a group of PV systems by comparing their current and voltage production over the last 24 h. By comparing outputs from multiple (geographically nearby) PV installations, we avoid any need for additional sensor data. Moreover, our results suggest that the GNN-based model can generalize to PV systems it was not trained on (as long as nearby sites are available) and retains high accuracy when multiple PV systems are simultaneously affected by faults. This work was supported in part by the DAPPER project, which is financed by Flux50 and Flanders Innovation and Entrepreneurship (project number HBC.2020.2144). This chapter is based on published papers [1, 2] by the same authors, from which we reproduce selected results. J. Van Gompel · D. Spina · C. Develder (B) IDLab, Department of Information Technology, Ghent University–imec, Technologiepark Zwijnaarde 126, 9052 Gent, Belgium e-mail: [email protected] J. Van Gompel e-mail: [email protected] URL: https://ugentai4e.github.io D. Spina e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Doukas et al. (eds.), Machine Learning Applications for Intelligent Energy Management, Learning and Analytics in Intelligent Systems 35, https://doi.org/10.1007/978-3-031-47909-0_4

105

106

J. Van Gompel et al.

Keywords Photovoltaics · Predictive maintenance · Fault detection · Recurrent neural networks · Graph neural networks · Time series classification

1 Introduction In facing today’s climate change challenges, the ambition has been expressed to limit the rise of global temperature to +1.5.◦ C, which amounts to achieve net zero emissions by 2050 [3]. Since solar photovoltaic (PV) power generation forms a crucial element of realizing these ambitions, we observe a rapid increase of the number of installed PV capacity in recent years. While PV cell efficiency has been improving steadily, they still can fail, and PV system faults may cause non-negligible energy losses, especially when such faults remain undetected [4]. To limit the impact of such losses and thus maximize the efficiency and lifetime of PV installations, predictive maintenance solutions to detect and identify PV system faults from their earliest manifestation are essential. Indeed, since climate change is expected to incur more extreme weather events, PV systems will likely be subject to higher levels of thermal and mechanical stress, which in turn impacts some of the possible PV faults. Thus, we expect the importance of widespread PV fault diagnosis only to increase. The types of faults that may occur in PV systems include short circuits, wiring degradation, hot spots, etc. The majority of such PV faults can be identified through visual inspection either by human technicians, or even infrared thermography with drones. Yet, such inspection is costly and therefor typically not adopted for smaller scale systems: the majority of PV systems thus remains unmonitored [5]. To provide cost-effective fault diagnosis, we explore methods based on artificial intelligence: these methods can achieve reasonably high fault detection accuracy, while avoiding the deployment of costly sensor infrastructure and without requiring to define systemspecific detection thresholds [6]. In Sect. 2, we will first give a brief overview state-of-the-art models for PV fault diagnosis based on machine learning [7], and classify them in 3 categories. Subsequently, we will present and analyze two recent neural network models we proposed for this task: (i) a recurrent neural network (RNN) model (Sect. 4) taking a single site’s local PV installation’s current and voltage measurements and satellite based weather information as input, and (ii) a graph neural network (GNN) model (Sect. 5) to detect faults from multiple sites, without using any weather information at all. Both models are developed to detect and identify 6 different fault types. The RNN model also estimates their impact severity in terms of relative reduction of the PV installation’s output power as a result of the defect. Note that in this chapter we will outline the models’ architectures and key results from our experiments. For in-depth discussions, we refer to [1] for the RNN model and [2] for the GNN. We summarize our key take-away messages in the concluding Sect. 6.

Neural Network Based Approaches for Fault Diagnosis of Photovoltaic Systems

107

2 Related Work We will discuss PV fault diagnosis methods according to the type of information they use as inputs to d01etermine whether the PV system is not performing as expected (and possibly determine the type of underlying fault): (1) Comparison of the expected and actual power output (or the actual current I and voltage V profiles that determine it), which essentially relies on environmental information (i.e., irradiance) that is typically gathered from sensors; (2) Pure current (I) and voltage (V) based classification, i.e., without relying on environmental sensor information, which thus needs more detailed measurements in terms of either (i) I-V curves, or (ii) high frequency transient behavior in Impp and Vmpp , i.e., the current and voltage at maximum power point (MPP); (3) Comparison with reference PV output, which may be either (i) on the module level, i.e., comparing to a reference module at the same PV site, or (ii) from a geographically nearby PV site. For each of these 3 categories, the following subsections discuss associated related works, which are summarized in Tables 1, 2 and 3 (wherein we also position our proposed methods, as discussed and analyzed in detail in Sects. 4 and 5).

2.1 Comparison to Expected PV Output To assess whether a PV system suffers from any defect, a common method is to compare its measured output to a prediction of the expected output, given the current environmental conditions [8–14]. Such prediction is realized either using physicsbased PV simulations [8, 10, 11] or rather through data-driven regression [12, 13]. The environmental conditions in both cases include solar irradiance and ambient and/or PV module temperature. Both [12] and [13] adopt a multilayer perceptron (MLP) for prediction. De Benedetti et al. [12] define a threshold on the difference between measured and predicted power output, above which a failure is assumed. Also Jiang et al. [13] define a threshold, this time on 1 kHz voltage and current measurements versus predictions: such high-frequency measurements allow further identification of the failure type, yet are costly to collect and process. Chine et al. [8] rather assumes I-V curve characteristics, both from the actual PV system and a simulated counterpart, which thus requires an I-V curve tracer (which again is not cost-effective in typical residential set-ups). Assessing deviations between predicted and measured outputs, and thus determining faults based on that deviation exceeding a predefined threshold, seems conceptually simple. Yet, the definition of appropriate thresholds requires expert knowledge and becomes increasingly complex when multiple different fault types need to be identified [6]. Therefore, more advanced machine learning models have

108

J. Van Gompel et al.

Table 1 Methods based on comparison to expected PV output. Bold: our proposed RNN model. (MPP: maximum power point; .Impp : MPP current; .Vmpp : MPP voltage; .Pmpp : power at the MPP; °T: temperature.) Refs.

Machine learning model

Inputs

Implementation cost

# identified fault types

[8]

Multilayer perceptron

.•

I-V curve

High

6

.•

Irradiance

.•

Module °T Medium

2

Medium

2

Medium

4

Medium

Detection only

High

5

Low

6

[9]

[10]

Ensemble model

.• Impp

Irradiance

.•

Module °T

Probabilistic neural .• Impp & Vmpp network .• .•

[11]

[12]

[13]

Ours (RNN) [1]

& Vmpp

.•

Gradient boosted trees

Multilayer perceptron

Multilayer perceptron

Recurrent neural network

Irradiance

Module °T

.• Impp

& Vmpp

.•

Irradiance

.•

Ambient °T

.•

Pmpp

.•

Irradiance

.•

Ambient °T

.•

High-freq. Impp & Vmpp

.•

Irradiance

.•

Ambient °T

.• .Impp

& .Vmpp

.•

Satellite irradiance

.•

Satellite ambient °T

been adopted: Garoudja et al. [10] use probabilistic neural networks, while Adhya et al. [11] adopt gradient boosted trees. The downside of these largely black-box approaches is that they are not as easily interpretable as threshold-based rules. While the methods discussed above rely on environmental information, essentially from local weather sensors, Zhao et al. [14] avoid needing detailed sensory information but rather rely on designated reference PV modules at the site itself. Clearly, the method will suffer if those reference modules themselves are experiencing faults. Our RNN-based approach that we will detail in Sect. 4 avoids any local measurements, but rather relies on environmental data (irradiance and ambient temperature)

Neural Network Based Approaches for Fault Diagnosis of Photovoltaic Systems

109

Table 2 I,V-based classification methods. Top: based on I-V curve, Bottom: based on transients in and .Vmpp . (MPP: maximum power point; .Impp : MPP current; .Vmpp : MPP voltage; .Pmpp : power at the MPP; .°T: temperature) Refs. Machine learning Inputs Implementation # identified fault model cost types

.Impp

[15] [16] [17]

[18]

[21] [19] [22]

[20]

Kernel extreme .• I-V curve learning machine Fuzzy classifier .• I-V curve Convolutional .• I-V curve neural network .• Irradiance .• Ambient °T ResNet .• I-V curve .• Irradiance .• Ambient °T Wavelet packet .• High-freq. Impp transforms & Vmpp Multi-grained .• High-freq. Impp cascade forest & Vmpp Convolutional .• High-freq. Impp neural network & Vmpp .• I & V of reference module Random forest .• High-freq. I per substring .• High-freq. Vmpp

Medium

4

Medium High

3 4

High

5

High

Detection only

High

3

High

2

High

4

Table 3 Methods based on comparison to a reference PV system. Top: comparison to reference PV modules at the same site, Bottom: comparison to nearby PV systems. Bold: our proposed GNN model. (MPP: maximum power point; .Impp : MPP current; .Vmpp : MPP voltage; .Pmpp : power at the MPP) Refs. Machine learning Inputs Implementation # identified model cost fault types [14]

Local and global consistency algorithm

.• Impp

& Vmpp

Medium

2

Medium

Detection only

Medium Low Low

3 Detection only 6

.•

[23] [24] [25] Ours (GNN) [2]

Convolutional neural network Random forest Random forest Graph neural network

I & V of ref. modules .• Pmpp per module .•

Pmpp per module Pmpp .• .Impp & .Vmpp .•

110

J. Van Gompel et al.

readily available from satellites. Thus being pragmatic and cost-effective, our method will prove to still exhibit high classification performance despite the less accurate information used.

2.2 I,V-Based Classification The methods that avoid using any environmental sensor information at all need more detailed local measurements to detect possible failures. In particular, in our overview listed in Table 2, we find methods that rely either on (i) I-V curves, which implies that I-V tracers are installed at the PV site, or (ii) high-frequency measurements of Impp and Vmpp , to identify transient effects stemming from defects. Chen et al. [15] manually define parameters from the I-V curves, based on analysis of such curves for various faults and weather conditions, and feed them to a kernel learning machine to perform 4-way classification. Similarly, Spataru et al. [16] rely on I-V curve parameters fed into fuzzy classifiers. To avoid manually engineering the input features to determine failures from, others have proposed purely data-driven deep learning approaches, using, e.g., convolutional neural networks [17] or residual neural networks (ResNet) [18]. If a fault occurs, this typically will also lead to transient behavior that can be observed in high frequency measurements of current and voltage, even though such transients typically last no longer than a few seconds [19]. Since the footprint of faults in such transients differ, various methods have been exploited for failure detection using the high resolution Impp and Vmpp measurements: random forests [19, 20], wavelet packet transforms [21], and convolutional neural networks [22]. Practical drawbacks are that (i) only the short period of transients is useful to detect the fault (i.e., if measurements are missing from such period, the fault will remain undetected), and (ii) high frequency measurements require costly sensors (e.g., in the order of e 5,000 for the 1 kHz measurements in [21]).

2.3 Comparison to Reference PV System To avoid additional sensors (for either weather or current/voltage tracing), several researchers have proposed to compare PV output to a given reference, as summarized in Table 3. This can be done either (i) on the module level, thus comparing the individual modules to one another, or (ii) on the full system level, using nearby PV systems to compare against. Clearly, methods in category (i) require the PV system to be equipped with micro-inverters (per module), whereas in practice string inverters are more common. Examples of techniques adopted in such module-based solutions include convolutional neural networks [23] and random forests [24]. The latter technique has also been used in a system-level comparison approach [25]. Our second proposed method, detailed in Sect. 5, will use graph neural networks to exploit

Neural Network Based Approaches for Fault Diagnosis of Photovoltaic Systems

111

the same idea of comparing system outputs (but realizing detection of also the fault type, as well as generalization to unseen PV systems—whereas [25] only considers binary classification, i.e., fault vs. no fault, on a fixed set of PV systems).

3 Problem Definition The desired output of our predictive model is twofold: (i) the type of fault the PV system suffers from (if any), and (ii) the relative power reduction, compared to normal operation, that results from the fault. The first essentially is a categorical variable, where we want to distinguish different fault types, as well as the no fault class. In our case studies, we will discriminate 6 different fault types, described briefly in Table 4. As input variables, we will consider time series data spanning hourly measurements of the PV system’s current (I) and voltage (V), as measured by the inverter. Furthermore, our single-site model (based on RNNs, see Sect. 4) will take environmental data (zenith angle, ambient temperature, irradiance), whereas our multi-site model (based on GNNs, see Sect. 5) will assume to know the differences between pairs of sites in terms of distance, altitude, azimuth and tilt.

Table 4 Description of the PV system faults considered in our work Fault type Description Simulated severities Open circuit Short circuit

Wiring degradation

Partial shading

Soiling

PID (shunting type)

Disconnection in the wiring Accidental low-resistance path between two points in the PV system Increased series resistance of PV modules

Disconnection of a (sub)string Short circuit of 1, 2, 3 or 4 modules in a (sub)string

Add a resistor on the connection to the inverter of 5.Ω, 10.Ω, 15.Ω or 20.Ω Local shading cast by clouds, Reduced irradiance of 1, 2, 3 chimneys, trees, etc. or 4 modules in a (sub)string by 50% during low sun (zenith larger than 60°) Accumulation of dust on the Reduced irradiance of all surface of PV modules modules by 5%, 10%, 15% or 20% Potential-induced performance PID severities corresponding degradation: electrochemical to a 5%, 10%, 15% or 20% degradation due to large loss of average power output voltage differences, leading to leakage current between PV cells and the array’s frame. The PID simulation is described in [26]

112

J. Van Gompel et al.

Fig. 1 Schematic layout of the 18-module PV system that was simulated with an indication of the simulated falut types (Adapted from [1])

Since both models are essentially supervised machine learning approaches, we need training data. Given that it is difficult and possibly dangerous to induce various faults in real-world systems, we rely on simulated data to train our models. We thus use physics-based PV simulations, based on the well established single-diode model. The model we adopt has been experimentally validated on real-world residential PV systems [27, 28], and has been shown to deliver output power values that are significantly more accurate than what the commercial tool PVsyst predicts.1 The rightmost column of Table 4 indicates how the various fault types were modeled in the simulation tool. The PV system layout we used is sketched in Fig. 1, which also indicates how/where we introduced the various faults. This layout was aligned with the PV systems assumed in literature [15, 29] to facilitate a fair comparison of results. Weather data was taken from the real-world dataset that is publicly available from the National Renewable Energy Laboratory (NREL) [30], including irradiance, ambient temperature, relative humidity, wind speed and wind direction. Since we assume that current and voltage measurements are taken from the inverter, and in practice these measurements are imperfect, we introduce 5% random noise on the simulated current and voltage. For further details, we refer to [1, 2].

1

In [27], PVsyst version 6.23 was used to establish this result: the authors’ tool achieved a mean absolute error (MAE) of predicted energy yield of 3.6%.± 2.8%, while PVsyst achieved an MAE of 5.5%.± 4.1%.

Neural Network Based Approaches for Fault Diagnosis of Photovoltaic Systems

113

4 Recurrent Neural Network Model for a Single PV System Using Satellite Weather Information 4.1 Model: A Stacked GRU Network As outlined in Sect. 3, the aim is to infer the type and severity of possible PV system faults from a multivariate time series covering the past 24 h of environmental parameters as well as current and voltage, measured at hourly resolution. Recurrent neural networks (RNN) are well suited for such time series, and recent well-performing models include long short-term memory (LSTM) and gated recurrent unit (GRU) networks: such RNN models have been successfully applied to a broad range of tasks, ranging from speech recognition, over machine translation, to time series forecasting [31]. We adopt a GRU-based architecture rather than using an LSTM, since it achieves similar performance but is characterized by fewer parameters. To minimize computational requirements, and aiming for maximal generalization, we share the same GRU layers to generate common representations for both the classification (fault identification) and regression (severity estimation) tasks. Note that we apply layer normalization in between the GRU layers (to speed up training and act as regularization), and add a residual connection (to avoid vanishing gradients). We feed the last GRU cells’ outputs to fully connected layers with a ReLU activation, and subsequently add a softmax and sigmoid head to finally obtain respectively the fault type classification and severity regression outputs, as sketched in Fig. 2. To feed in the 5-dimensional time series into the GRU layers, we apply standardization (i.e., i , where .μi and .σi are the feature’s mean and rescaling a feature .xi as .xi, = xi −μ σi standard deviation respectively). To train the model, we define a loss function that combines a cross-entropy loss term for the classification part, and a mean square error loss term for the severity regression part. Additionally, to enforce consistency between both outputs, we add a penalization loss to avoid predicting “no fault” with a severity greater than zero. Since our training set is imbalanced, i.e., not all fault types are equally represented (e.g., we consider various levels of shading, which is not applicable for the “no fault” class), samples of a more prevalent class are weighed less in the loss function. This results in a loss function as defined in Eq. (1), where for each data sample . j we define .w j as its weight, . y ˆ cj as its softmax probability for class .c (with 0 representing the “no fault” class), .s j as its true and .sˆ j as its predicted severity level (i.e., the average relative power reduction compared to fault-free operation, thus 0 for “no fault”), and .c j as its true fault type. L = Lce + α Lmse + β Lnfs N ( ) 1 ∑ c = w j − log yˆ j j + α (s j − sˆ j )2 + β sˆ j yˆ 0j N j=1

.

(1)

114

J. Van Gompel et al.

Fig. 2 We propose a stacked GRU architecture to process the multivariate time series, and add (i) a softmax layer to classify the fault into 7 types (including no fault), and (ii) a sigmoid to predict the power reduction impact (a factor in [0, 1]) on the PV system’s output. We use ReLU activation in fully connected layers. Connections are annotated with the dimensions of the vector passed between model components

For evaluation, we will use 5-fold cross-validation, where in each fold we keep a separate year as test set to assess performance of a model. In the training loop, we keep 100 randomly selected days as a validation set and shuffle the remaining data to construct the training batches. That validation set is used for hyperparameter tuning and early stopping. The latter implies that we stop training if the model performance on the validation set does not improve compared to the previous training epoch, since this can help prevent overfitting [31]. For the full training procedure details, we refer to [1].

4.2 Experiment Setup We implement the aforementioned GRU-based model, which we train for the fault types described in Sect. 3. As a baseline, we use CatBoost as a reference, which is based on gradient boosted decision trees, since it has been shown to outperform other boosting implementations [32]. Note that CatBoost cannot perform both the (fault type) classification and (severity) regression tasks simultaneously, and we thus construct two independent models for each of them—implying that we cannot include

Neural Network Based Approaches for Fault Diagnosis of Photovoltaic Systems

115

the 3rd loss function to enforce consistency between them, as we did in Eq. (1) for our RNN-model. We further setup experiments to answer three research questions: (Q1) What is the maximal performance our model can achieve, assuming it has perfect weather data? (Q2) How much does that performance suffer when we can only rely on approximate environmental information by relying on satellite weather data? (Q3) Can our model detect unknown faults by looking solely at its fault severity prediction (exceeding a certain threshold)? For (Q1), we simply feed the exact environmental parameter values as we also use to generate the simulated PV system output data (still adding noise to the inverter’s V and I values, as explained previously). With (Q2), we aim to assess our model performance in real-world, practical settings, where residential PV systems are not equipped with local weather sensors. Thus, we rely on weather satellite data, which we obtain from MERRA-2 [33]. This only offers a single time series for an entire US state (North Carolina in our experiments), and thus deviations between the actual local conditions and satellite data values, in terms of mean absolute error (MAE), amount to 49.9 W/m for irradiance and 2.9 °C for ambient temperature. We use this satellite data as environmental input to our RNN prediction model, along with the I and V values from PV system simulations (which obviously are still fed with the actual local weather data, rather than satellite weather data). Finally, to answer (Q3), we leave out one fault type from the training set, and then use that fault type’s test data to assess whether our model would still detect a failure by comparing the predicted severity against a (learned) threshold. Specifically, we learn the threshold as the maximal severity level predicted by the model for correctly classified “no fault” data points in a validation set.

4.3 Results Exact weather data—With the first experiment, we answer (Q1). In terms of fault identification, Fig. 3a shows the confusion matrix between our RNN model’s prediction and the ground truth. We obtain a high balanced classification accuracy (i.e., weighing each fault type equally) of 96.9%.± 1.3% (averaged over the 5-fold cross validation test sets .± 3 times the standard deviation). We note that the most mistakes are made by confusing “no fault” and wiring degradation. Looking at the severity predictions in Fig. 4, we note that the misclassified samples (orange dots) tend to also suffer from poor severity estimation. First of all, this is unsurprising because both models rely on the shared GRU layers to derive the representations fed into the classification and regression heads. Second, we also note the effect of the loss term to make both outputs consistent (i.e., .Lnfs in Eq. (1)). For example, in the actual “no fault” samples, we only observe (erroneously) high severity levels for the misclassified ones. Over all samples together, we find a

116

J. Van Gompel et al.

Fig. 3 Confusion matrix for our RNN model using a exact local weather, or b satellite weather data inputs

Fig. 4 Actual and predicted severity level from our RNN model, using actual exact local sensor weather data as inputs. Each dot represents a test sample, where we only pick one per 24 h window to avoid cluttering, and we group them by the gold truth label on the . X -axis

balanced MAE of our model’s severity estimation of 0.67%.± 0.14%, which is quite limited. Finally, comparing our RNN model to the CatBoost baseline in Table 5, we note a significant benefit of our model (.+3% for fault type classification accuracy and .−1% MAE for severity estimation). Satellite weather data—The second set of experimental results looks into (Q2). Given the discrepancy between actual local conditions and the low-resolution satellite information (cf. the stated irradiance MAE of 49.9 W/m2 , with individual data points

Neural Network Based Approaches for Fault Diagnosis of Photovoltaic Systems

117

Table 5 Performance metrics averaged over the 5-fold cross-validation results, where error margins are 3.× the standard deviation. Prediction accuracy is balanced over the 7 possible classes (6 fault types and “no fault”). (Values reproduced from [1]) Type prediction accuracy Severity prediction MAE Weather data CatBoost RNN (Ours) CatBoost RNN (Ours) Exact Satellite

93.2% ± 1.3% 83.5% ± 2.1%

96.9% ± 1.3% 86.4% ± 2.1%

1.66% ± 0.12% 3.6% ± 0.24%

0.67% ± 0.14% 2.09% ± 0.18%

deviating up to 850 W/m2 ), the fault classification and severity estimation gets more challenging: a difference between expected power output and the actually observed values could be due to a PV system fault or just the environmental parameter error. Inspecting the satellite versus local data deviations more closely (for the studied location), we observed a general over-estimation of irradiance and under-estimation of ambient temperature by relying on satellite data. Despite these inaccurate input feature values, our RNN-based model still achieves 86.4% ± 2.1% balanced classification accuracy. Intuitively, this can be explained by the fact that our model can learn to take into account the noise on the inputs, since it is also present in the training data. Looking at the confusion matrix in Fig. 3(b), we note that the model mostly has difficulty distinguishing between soiling and “no fault”, given that an overestimated irradiance also results in seemingly under-performing PV output power (that also arises from dust accumulation on the panels in the “soiling” case). In terms of severity estimation as illustrated in Fig. 5, we note an overall MAE in severity estimation of 2.09% ± 0.18%, largely due to an underestimation of soiling

Fig. 5 Actual and predicted severity level from our RNN model using satellite weather data as inputs. Each dot represents a test sample, where we only pick one per 24 h window to avoid cluttering, and we group them by the gold truth label on the . X -axis

118

J. Van Gompel et al.

Fig. 6 Actual and predicted severity level from our RNN model trained on all but the PID fault data. Each dot represents a test sample, where we only pick one per 24 h window to avoid cluttering, and we group them by the gold truth label on the . X -axis

severity (whereof the effect is hard to distinguish from the satellite data overestimating irradiation). To a lesser degree, we note similarly natured severity estimation errors for wiring degradation and PID. Comparing our RNN-model’s performance to the CatBoost baseline solution in Table 5 still reveals notable advantages of our model (.+3% balanced classification accuracy, .−1.5% severity prediction MAE). Testing on unknown faults—Our final set of results presented here answer (Q3). Figure 6 shows the severity predictions by our model when trained on all data except that with PID faults. We determine a binary fault/no fault classification threshold on the severity predictions as the maximal severity that is predicted for correctly classified “no fault” samples from the validation set. We note that 97.5% of the PID samples—which are obviously misclassified, since the fault type was not considered in the training set—are thus detectable as a “fault”. Repeating this setup for all of the faults (i.e., training an instance of our RNN-based model without supplying such fault data), we obtain the binary fault detection accuracies listed in Table 6. Averaging the binary fault detection accuracies among all 6 fault types, we find an “unknown fault” detection accuracy of 94.3% when using exact weather data. We note that the performance is substantially lower for the model that has not seen degradation. We hypothesize that this is due to the difficulty to differentiate between degradation and “no fault” as previously observed in the confusion matrix of the model trained on all faults (see Fig. 3a). When we only have access to (inaccurate) weather information based on satellite data, we see from Table 6 that the overall unknown fault detection capability is reduced to 80.4% on average. In Table 6, we particularly note the poor performance when trying to detect soiling with a model that did not see any such faults during

Neural Network Based Approaches for Fault Diagnosis of Photovoltaic Systems

119

Table 6 Binary classification performance of models trained on all but 1 fault type. Final column is the overall macro-average (i.e., average of the preceding 6 columns) (Values reproduced from [1]) Detection accuracy of excluded fault type as “unknown fault” Short Degradation Shading Soiling Weather Open circuit circuit (%) (%) (%) data (%) (%) Exact Satellite

100 100

99.6 98.4

70.7 83.4

98.4 93.0

99.7 22.3

PID (%)

Average (%)

97.5 85.2

94.3 80.4

training, while compared to model based on exact weather data the detection of degradation improves. Again, looking at the all-fault model’s confusion table (see Fig. 3b) gives some idea why: we note a high occurrence of mistakenly classifying soiling faults as “no fault” and vice versa—whereas among degradation and “no fault” there is less confusion than when using exact weather data.

5 Graph Neural Network Model for Multiple PV System Sites 5.1 Our GNN Model The primary goal of our graph neural network (GNN) based model, detailed below, is to enable failure detection that does not require any weather information (i.e., no such local sensors, nor relying on coarser estimations based on satellite data). The basic intuition is to use and compare the inverter measurements from nearby PV sites, assuming that some of them are operating without faults, and thus can serve to (implicitly) construct a reference to compare against. We now consider only fault type classification, hence we do not include a severity estimation output—although that can be straightforwardly added with a separate sigmoid layer, as in the previously discussed RNN model. The principle of a GNN is that nodes and edges, which we respectively note as .vi and .ei, j ,2 are represented by feature vectors as input to a GNN layer that transforms them by (i) first calculating output edge representations .ei,, j derived from the original edge representations and its pair of adjacent nodes, and then (ii) calculating output node representations.vi, based on the original node representations, and an aggregated

2

For simplicity, in our notation we assume undirected edges and use .ei, j to represent an edge between nodes .vi and .v j .

120

J. Van Gompel et al.

representation of its incident edges. This implies the use of parameterized functions f and . f v to consecutively calculate the output edge and node representations as follows3 :

. e

( ) e, = f e ei, j , vi , v j ∑ ( ) , , with E i, = .vi = f v vi , E i , ei,, j

. i, j

(2) (3)

j

The model parameters to learn thus are those that define . f e and . f v , which are multilayer perceptrons. Model architecture—The overall architecture of our GNN model is sketched in Fig. 7. We construct a graph, where each node corresponds to a PV site. For the node representations, we thus only use local measurements from each site individually, being 24 h of voltage and current measurements from the inverter. As in the the RNN model, we process this time series with a stacked GRU architecture (as in Fig. 2). We create a fully connected graph between these nodes, where each edge between a pair of PV sites has as features the distance between those sites, as well as their difference in altitude, azimuth and tilt (which all remain constant over time). The graph features then go through GNN layers, more specifically 2 XENet layers [34]. We opted for XENet rather than graph convolutional networks (GCN) [35], since it explicitly supports edge features, which GCN does not. Moreover, we found that XENet outperformed other GNN types that support edge features, particularly edge-conditioned convolutional networks [36] and crystal graph convolutional networks [37]. Finally, we note that we normalized the input features as detailed in [2, Sect. 4]. Training—As before, we rely on physics-based simulations to produce the training data [27, 28]. We consider the same faults listed previously in Table 4, but will use different PV system layouts and module types for each of the 6 considered sites, as listed in Table 7. For the weather data, we use measurements publicly available from the National Renewable Energy Laboratory (NREL) [30], taking data from weather stations at the locations shown on the map in Fig. 7, spanning a time range of 2012–18 for sites 1–4 and 2012 for sites 5 and 6. Since we are dealing with a pure classification problem (cf. we do not include severity estimation in this section), we use a cross-entropy minimization objective (i.e., the .Lce part of Eq. (1)). As before, we weigh the instances to compensate for the class imbalance in the data, such that each fault type contributes equally to the overall loss. Note that, in contrast to the RNN-based model of Sect. 4, in the GNN case the “no fault” instances are far more numerous, since we consider at most one of the PV sites to suffer from a fault. Note that the results do not change significantly when multiple PV systems are simultaneously faulty [1, Sect. 7.4].

3

Note that instead of simply summing the incident edge representations, in general GNNs other aggregation functions can be used to obtain . E i , e.g., mean-, min- or max-pooling.

Neural Network Based Approaches for Fault Diagnosis of Photovoltaic Systems

121

Fig. 7 The proposed graph neural network (GNN) based model. Each PV site is represented as a node in the graph, where (i) node features are the outputs of a stacked GRU (as in our RNN-based model, see Fig. 2) that processes a site’s hourly current and voltage measurements over a 24 h time window, and (ii) edge features are the distance between PV sites and their difference in altitude, azimuth and tilt. Note that we do not use/calculate edge representations coming out of the 2nd XENet layer

5.2 Experiment Setup We set up experiments to answer three research questions: (Q4) How does the GNN model compare to our previous RNN-based model and the CatBoost baseline? (Q5) How good do our models perform without any weather information? (Q6) Can a trained model generalize to PV system sites that were not included in the training data? (The latter thus amounts to zero-shot classification for new PV sites.) Baseline models—As (Q4) states, we will compare the presented GNN-based model against our previously introduced RNN-based model, as well as the CatBoost baseline described in Sect. 4.2. Since we consider only fault classification (i.e., we do not perform severity estimation as with the original RNN model), we slightly alter the RNN model to only have the classification output and thus remove the ‘fault severity’ branch from the architecture sketched in Fig. 2. Similarly, we train a single CatBoost model, only for fault classification. All of our models are trained for a balanced objective where each class is weighed equally, by multiplying the various samples with an appropriate factor depending on their actual fault type, as explained earlier (cf. the weight factors .w j in Eq. (1)). Note that in answering (Q4), we initially will consider providing also satellite weather information (irradiance, temperature, solar zenith). This means we also add

122

J. Van Gompel et al.

Table 7 PV system and module configurations for each of the 6 sites Site

PV module type

1 2

SW 325 XL duo Scheuten P6-60 i30 Scheuten P6-60 i30 SW 325 XL duo SW 325 XL duo Scheuten P6-60 i30

3 4 5 6

Module parameter Number of cells (series .× parallel) Maximum power (.Pmpp ) Maximum power point voltage Maximum power point current Open circuit voltage (.VOC ) Short circuit current (.ISC ) Temperature coefficient of .Pmpp Temperature coefficient of .VOC Temperature coefficient of .ISC

Number of modules 6.× 3 15.× 1

Tilt

Azimuth

15° 25°

180° 90°

10.× 2

35°

135°

12.× 1 8.× 2 4.× 3

45° 30° 20°

270° 225° 160°

SW 325 XL duo 24.× 3 325 W 37.7 V 8.68 A 47.0 V 9.28 A .−0.43%/K .−0.31%/K 0.044%/K

Scheuten P6-60 i30 20.× 3 230 W 29.3 V 7.84 A 37.2 V 8.31 A .−0.42%/K .−0.30%/K 0.040%/K

weather time series as inputs of the GRU layers (now taking 5-dimensional vectors as input) for both our GNN model and the baseline RNN model. Next, for (Q5) we will omit weather information and thus only consider voltage and current time series as input. Since all of our models in this section now only consider fault classification, we use cross-entropy loss minimization as objective for all of them, where we weigh each sample to achieve balancing of the loss across all 7 classes. For training details, we refer to [2].

5.3 Results Performance with satellite weather data—We first test our GNN model and compare it with our earlier RNN as well as CatBoost, for fault classification based on current, voltage, irradiance, ambient temperature and solar zenith—essentially the same inputs as in Sect. 4. We used the freely available satellite data from MERRA-2 [33] for the weather inputs (which are for the entire state of Denver, and thus only coarse approximations of the actual weather conditions at each of the 6 selected locations

Neural Network Based Approaches for Fault Diagnosis of Photovoltaic Systems

123

Fig. 8 Confusion matrices for the CatBoost, RNN and GNN models on the 1st fold (of the 5-fold cross-validation test sets), using a weather info as well as voltage and current, or b only current and voltage measurements from the PV inverter. Gray shaded columns and rows state the precision and recall values, respectively

of the PV sites; e.g., amounting to a MEA of 51.11 W/m2 for irradiation at site 1). We train and evaluate all models on data for sites 1–4, using 5-fold cross-validation as explained before. Looking at Fig. 8a to answer (Q4), we find that our proposed GNN model significantly outperforms both CatBoost and RNN models,4 especially in terms of discriminating between “no fault” and soiling. As we noted in Sect. 4, this is difficult for both the RNN model and CatBoost, because the satellite weather information does not accurately match the local PV site conditions: the irradiance overestimations of the satellite data lead to seemingly under-performing PV output, which is hard to distinguish from impaired performance due to soiling. The GNN model does not suffer as much from imprecise satellite measurements, since it can differentiate inaccurate weather inputs (which affect all sites) from fault conditions (at only 1 or a limited number of PV sites). The full cross-validation results over all 5 folds presented in Table 8a confirm our GNN model’s superiority.

4 Note that the results presented here for CatBoost and RNN differ from those in Sect. 4 because there we only had 1 PV site, where here we have a heterogeneous set of PV systems across the considered sites 1–4, and also the current models are not additionally trained on severity regression.

124

J. Van Gompel et al.

Table 8 Average accuracy on sites 1–4 over the 5-fold cross-validation, with 3 times the standard deviation over the cross validation folds as error margins Inputs Accuracy CatBoost (%) RNN (%) GNN (%) (a) (b)

Satellite weather, 79.8.± 2.4 I&V I & V only 73.0.± 2.4

82.3.± 2.9

87.5.± 1.6

72.7.± 2.7

84.6.± 2.1

Performance without any weather data—Our motivation for using a GNN model and combining input data from multiple PV sites was that this should enable fault detection without any weather information. Yet, since the various PV sites considered may have different orientations (as in our experiments, cf. Table 7), the PV output at a particular time of the day cannot be readily compared: a south-facing PV system will have its maximal power (in clear sky conditions) earlier than a west-facing system. However, since we provide the past 24 h time window of data, our GNN model should learn to adjust for such differences. To answer (Q5) we thus omit the weather data time series from the model inputs, only keeping the inverter voltage (V) and current (I) measurements (as originally sketched in Fig. 7). The results in Fig. 8b show the confusion matrices for the CatBoost, RNN and GNN models using only V and I time series as inputs. As expected, compared to the model variant including weather data, the GNN model does not suffer much performance loss. Conversely, the CatBoost and RNN models do deteriorate considerably, even though Table 8(b) shows they still attain an overall accuracy well over 70%. The latter may come as a surprise, but the still ‘decent’ performance may be due to the models being able to infer the ‘expected’ current and voltage levels from the training data without faults. Performance on unseen PV sites—As explained in Sect. 5.1, a GNN model essentially contains parameterized functions to map input node and edge representations to output node and edge representations. Given that the output node features are based on aggregated edge features (cf. the summation in Eq. (3)), implying that the mapping functions do not depend on the actual number of nodes (nor edges) in the graph, we can apply the learned mappings also for new nodes that were not included in training the GNN model. Thus, we can train a GNN model on a limited number of sites (in casu sites 1–4 of our case study), and afterwards apply it for fault detection of all sites, including the unseen sites 5 and 6. We set up this experiment to answer (Q6), and compare the GNN model against the CatBoost and RNN baselines, which only use single-site time series data as inputs and thus can also be tested on unseen sites—we thus also train these baselines on sites 1–4 and will test on sites 5–6. Since we only have 1 year of weather data for sites 5–6, we use this full year of 2012 as test data, while training the GNN model on only sites 1–4 for a five-year period, excluding 2012.

Neural Network Based Approaches for Fault Diagnosis of Photovoltaic Systems

125

Table 9 Accuracy on unseen PV sites 5 and 6 (as indicated on the map in Fig. 7), of which no data was included in the training or validation set of the models. Note that the GNN model predictions for all 6 sites, but only the accuracy on the unseen sites is reported here Site Accuracy Inputs CatBoost (%) RNN (%) GNN (%) Satellite weather, 5 I&V 6 5 I & V only 6

65.4

79.5

86.8

57.6 59.6 55.6

70.0 64.0 65.5

62.2 83.7 62.2

Table 9 shows that our GNN model successfully generalizes to site 5, which is situated relatively close to some of the sites the model was trained on (and thus inter-site distances, which serve as edge features in the GNN, are comparable to that in the training data). Yet, this is still a non-trivial achievement since the tilt, azimuth, and module configuration of site 5 is still quite different from the training sites 1–4 (recall Table 7). We note that the RNN model and especially CatBoost do show a considerably steeper performance drop on this unseen site 5, especially when looking at the model variants that can only use current and voltage measurements—which is what one would intuitively expect. Still, looking at the results for site 6, we observe that also our GNN model is far less successful (even though still beating the RNN and CatBoost models). This performance drop for site 6 is likely due to the site being quite distant from any of the training sites 1–4 (of which site 6’s closest neighbor is 196 km away). Hence, our answer to (Q6) seems to be that, yes, our GNN model can generalize to unseen sites, but only as long as the new PV system is not too remote from any of the sites used for training the model.

6 Conclusions In this chapter, we discussed data-driven solutions for cost-effective fault detection for PV panels. By cost-effective we mean that they do not rely on additional sensor equipment beyond (low temporal resolution) voltage and current measurements from the inverter, which converts the PV modules’ DC to AC power. More specifically, we investigate the adoption of recent state-of-the-art neural network models: (i) a recurrent neural network (RNN) based architecture, using gated recurrent units (GRUs) to process hourly measurements from a single PV site, and subsequently (ii) a graph neural network (GNN) based model, taking the per-site representations from such GRU layers, and jointly processing them across multiple sites. We have shown that the RNN-based model is effective even when only using (inaccurate) satellite based weather data—although overall fault type accuracy does drop from around 96% when using exact local weather information to 86% when using

126

J. Van Gompel et al.

satellite based data—and thus constitutes a workable solution for independent single site fault monitoring. Compared to state-of-the-art in PV fault detection literature, our RNN model (i) requires neither high temporal resolution data nor I-V curves nor local weather measurements, (ii) supports both fault detection and severity level estimation (i.e., relative power reduction compared to fault-free operation), and (iii) is shown to be also effective in detecting unseen/new fault types (i.e., not considered during training of the model). Further, the GNN-based model performing fault detection jointly for multiple PV sites is shown to be successful for a set of PV systems even when their configurations and orientations vary substantially. For example, in our case study on 6 sites with 2 types of PV modules arranged in 6 different system sizes and configurations, we attain overall classification accuracies of 85%. Moreover the GNN approach does not require any weather information at all. Additionally, we show that our GNN model also is capable of generalizing to PV sites unseen during training (as long as they are subject to similar weather conditions, i.e., are geographically nearby). Limitations—Since our solutions were targeted to be practical and cost-effective, they do not reach maximal performance: using dedicated sensors, approaches reaching over 99% accuracy have been reported [10, 15, 20]. From our experiments testing a trained GNN model on PV sites unseen during training, as discussed in Sect. 5.3, we found that the GNN model does not perform well on an isolated, remote PV system (site 6 in our case study) that is far from the sites included in the model’s training set. Yet, in practice this would only problematic in sparsely populated regions. In such cases, a dedicated model (e.g., using our RNN model) trained on local historical data of that site would be a more meaningful (i.e., better performing) solution. Further, if the remote PV system would be a large scale one, comprising multiple inverters, the GNN approach would still make sense considering each inverter individually as nodes in the model graph. Since in the latter scenario all strings essentially are subject to almost exactly the same weather conditions, we expect the thus trained GNN model to be highly accurate. Although our models are generic with respect to the PV module technology, climate/weather conditions, etc., of the sites to perform fault identification/classification for, we only have tested them in a limited set of case studies (and a relatively low number of sites for the GNN model). Yet, a priori we do expect the RNN model to perform as well for any PV system technology, configuration, and geographical location: e.g., further analysis in [1, Sect. 5.2] showed that the performance of an RNN-based model trained on North Carolina weather data was barely affected when tested in Nevada weather conditions. For the GNN model, we believe it is likely that the accuracy of the fault identification model will only improve when more (located relatively closely together) sites are considered. Considering GNNs have been applied to graphs with over 100 million nodes [38], scaling up our approach to monitor more PV systems is clearly feasible. Note that our RNN model includes a severity level prediction component, while our considered GNN-based model does not. Yet, clearly such a component could also be added there. Still, it is unclear whether severity level prediction would work

Neural Network Based Approaches for Fault Diagnosis of Photovoltaic Systems

127

equally well, especially when considering heterogeneous PV site technologies and orientations. Future work—From the above limitations, it is clear that a number of research questions need further analysis. For both models, we note that we only considered fault cases where only a single fault type occurs. Thus, how to identify possibly multiple simultaneously occurring faults, and/or how well our models fare under such multi-label classification conditions, remains to be investigated. Similarly, more detailed analysis is required to assess the RNN/GNN model performance in function of time resolution (we only considered 1 h measurement intervals) and input time window size (we used a 24 h history), which can guide deployment strategies to balance performance versus computational and sensor equipment requirements. We also note that in both model types we only considered stationary cases, i.e., we trained and evaluated on time series for which the fault was either present or absent the entire time. It would be interesting to study how to train our models (and/or tweak them) to most effectively and as quickly as possible identify faults, for time series that include the transition of fault-free conditions to the occurrence of a fault. A last common direction for both models is to validate them in the field, using actual real-world PV systems (including cases with known faults). Specifically for the GNN model, we advocate a study to establish up to what scale, in terms of both absolute number and geographical spread of the PV sites, it is meaningful to jointly process multiple sites. Acknowledgements This work was supported by the DAPPER project, which is financed by Flux50 and Flanders Innovation & Entrepreneurship (project number HBC.2020.2144). We further would like to thank Arnaud Schils (who was with imec at the time) for facilitating the PV simulations.

References 1. J. Van Gompel, D. Spina, C. Develder, Satellite based fault diagnosis of photovoltaic systems using recurrent neural networks. Appl. Energy 305, 1–12 (2022) 2. J. Van Gompel, D. Spina, C. Develder, Cost-effective fault diagnosis of nearby photovoltaic systems using graph neural networks. Energy 266(126444) (2023) 3. International Energy Agency (IEA): World Energy Outlook 2022 (2022), https://www.iea.org/ reports/world-energy-outlook-2022 4. S.R. Madeti, S. Singh, Online fault detection and the economic analysis of grid-connected photovoltaic systems. Energy 134, 121–135 (2017) 5. A. Livera, M. Theristis, G. Makrides, G.E. Georghiou, Recent advances in failure diagnosis techniques based on performance data analysis for grid-connected photovoltaic systems. Renew. Energy 133, 126–143 (2019) 6. D.S. Pillai, N. Rajasekar, A comprehensive review on protection challenges and fault diagnosis in PV systems. Renew. Sust. Energ. Rev. 91, 18–40 (2018) 7. A. Mellit, G.M. Tina, S.A. Kalogirou, Fault detection and diagnosis methods for photovoltaic systems: a review. Renew. Sust. Energ. Rev. 91, 1–17 (2018)

128

J. Van Gompel et al.

8. W. Chine, A. Mellit, V. Lughi, A. Malek, G. Sulligoi, A. Massi Pavan, A novel fault diagnosis technique for photovoltaic systems based on artificial neural networks. Renew. Energy 90, 501–512 (2016) 9. C. Kapucu, M. Cubukcu, A supervised ensemble learning method for fault diagnosis in photovoltaic strings. Energy 227, 120463 (2021). Accessed from 15 July 2021 10. E. Garoudja, A. Chouder, K. Kara, S. Silvestre, An enhanced machine learning based approach for failures detection and diagnosis of PV systems. Energy Conv. Manag. 151, 496–513 (2017) 11. D. Adhya, S. Chatterjee, A.K. Chakraborty, Performance assessment of selective machine learning techniques for improved PV array fault diagnosis. Sustain. Energy Grids Netw. 29, 100582 (2022) 12. M. De Benedetti, F. Leonardi, F. Messina, C. Santoro, A. Vasilakos, Anomaly detection and predictive maintenance for photovoltaic systems. Neurocomputing 310, 59–68 (2018) 13. L.L. Jiang, D.L. Maskell, Automatic fault detection and diagnosis for photovoltaic systems using combined artificial neural network and analytical based methods, in Proceedings of International Joint Conference on Neural Networks (IJCNN 2015). Killarney, Ireland (2015), pp. 1–8. Accessed from 11–15 July 20215 14. Y. Zhao, R. Ball, J. Mosesian, J.F. de Palma, B. Lehman, Graph-based semi-supervised learning for fault detection and classification in solar photovoltaic arrays. IEEE Trans. Power Electron. 30(5), 2848–2858 (2015) 15. Z. Chen, L. Wu, S. Cheng, P. Lin, Y. Wu, W. Lin, Intelligent fault diagnosis of photovoltaic arrays based on optimized kernel extreme learning machine and I-V characteristics. Appl. Energy 204, 912–931 (2017) 16. S. Spataru, D. Sera, T. Kerekes, R. Teodorescu, Diagnostic method for photovoltaic systems based on light I-V measurements. Sol. Energy 119, 29–44 (2015) 17. Q. Liu, B. Yang, Z. Wang, D. Zhu, X. Wang, K. Ma, X. Guan, Asynchronous decentralized federated learning for collaborative fault diagnosis of PV stations. IEEE Trans. Netw. Sci. Eng. 1680–1696 (2022) 18. P. Lin, Z. Qian, X. Lu, Y. Lin, Y. Lai, S. Cheng, Z. Chen, L. Wu, Compound fault diagnosis model for photovoltaic array using multi-scale SE-ResNet. Sustain. Energy Technol. Assess. 50, 101785 (2022) 19. W. Gao, R.J. Wai, S.Q. Chen, Novel PV fault diagnoses via SAE and improved multi-grained cascade forest with string voltage and currents measures. IEEE Access 8, 133144–133160 (2020) 20. Z. Chen, F. Han, L. Wu, J. Yu, S. Cheng, P. Lin et al., Random forest based intelligent fault diagnosis for PV arrays using array voltage and string currents. Energy Conv. Manag. 178, 250–264 (2018) 21. B.P. Kumar, G.S. Ilango, M.J.B. Reddy, N. Chilakapati, Online fault detection and diagnosis in photovoltaic systems using wavelet packets. IEEE J. Photovolt. 8(1), 257–265 (2017) 22. X. Lu, P. Lin, S. Cheng, Y. Lin, Z. Chen, L. Wu, Q. Zheng, Fault diagnosis for photovoltaic array based on convolutional neural network and electrical time series graph. Energy Conv. Manag. 196, 950–965 (2019) 23. T. Huuhtanen, A. Jung, Predictive maintenance of photovoltaic panels via deep learning, in Proceedings of IEEE Data Science Workshop (DSW 2018), Lausanne, Switzerland (2018), pp. 66–70. Accessed from 4–6 June 2018 24. M. Feng, N. Bashir, P. Shenoy, D. Irwin, D. Kosanovic, SunDown: model-driven per-panel solar anomaly detection for residential arrays, in Proceedings of the 3rd ACM SIGCAS Conference Computing Sustainable Society (COMPASS 2020), Guyacquil, Ecuador (2020), pp. 291–295. Accessed from 15–17 June 2020 25. S. Iyengar, S. Lee, D. Sheldon, P. Shenoy, SolarClique: detecting anomalies in residential solar arrays, in Proceedings of the 1st ACM SIGCAS Conference on Computing and Sustainable (COMPASS 2018), Menlo Park and San Jose, CA, USA (2018), pp. 1–10. Accessed from 20–22 June 2018 26. A. Schils, R. Breugelmans, J. Carolus, J. Ascencio-Vásquez, A. Wabbes, E. Bertrand, B. Aldalali, M. Daenen, E. Voroshazi, S. Scheerlinck, A grey box model for shunting-type potential induced degradation in silicon photovoltaic cells under environmental stress, in Proceedings

Neural Network Based Approaches for Fault Diagnosis of Photovoltaic Systems

27.

28.

29.

30. 31. 32.

33.

34.

35.

36.

37. 38.

129

of 38th European Photovoltaic Solar Energy Conference and Exhibition (EU PVSEC 2021) (2021), pp. 578–582. Accessed from 6–10 Sep 2021 H. Goverde, B. Herteleer, D. Anagnostos, G. Köse, D. Goossens, B. Aldaladi, G. J, K. Baert, F. Catthoor, J. Driesen, J. Poortmans, Energy yield prediction model for PV modules including spatial and temporal effects, in Proceedings of 29th European Photovoltaic Solar Energy Conference and Exhibition (EU PVSEC 2014) (2014), pp. 3292–3296. Accessed from 22–26 Sep 2014 D. Anagnostos, H. Goverde, B. Herteleer, F. Catthoor, S. Dimitrios, J. Driesen, J. Poortmans, Demonstration and validation of an energy yield prediction model suitable for non-steady state non-uniform conditions. In: Proceedings of 6th World Conference Photovoltaic Energy Conversion Kyoto, Japan (2014). Accessed from 23–27 Nov 2014 A.Y. Appiah, X. Zhang, B.B.K. Ayawli, F. Kyeremeh, Long short-term memory networks based automatic feature extraction for photovoltaic array fault diagnosis. IEEE Access 7, 30089– 30101 (2019) D. Jager, A. Andreas, NREL National Wind Technology Center (NWTC): M2 Tower; Boulder, Colorado (Data). NREL Report No. DA-5500-56489 (1996) I. Goodfellow, Y. Bengio, A. Courville, Deep Learning. (MIT Press, 2016), http://www. deeplearningbook.org L. Prokhorenkova, G. Gusev, A. Vorobev, A.V. Dorogush, A. Gulin, CatBoost: unbiased boosting with categorical features, in Proceedings of 32nd International Conference on Neural Information Processing System (NIPS 2018), Montreal, Canada (2018), pp. 6639–6649, https:// proceedings.neurips.cc/paper_files/paper/2018/file/14491b756b3a51daac41c24863285549Paper.pdf. Accessed from 3–8 Dec 2018 R. Gelaro, W. McCarty, M.J. Suárez, R. Todling, A. Molod, L. Takacs, C.A. Randles, A. Darmenov, M.G. Bosilovich, R. Reichle, K. Wargan, L. Coy, R. Cullather, C. Draper, S. Akella, V. Buchard, A. Conaty, A.M. da Silva, W. Gu, G.K. Kim, R. Koster, R. Lucchesi, D. Merkova, J.E. Nielsen, G. Partyka, S. Pawson, W. Putman, M. Rienecker, S.D. Schubert, M. Sienkiewicz, B. Zhao, The modern-era retrospective analysis for research and applications, version 2 (merra2). J. Climate 30(14), 5419–5454 (2017) J.B. Maguire, D. Grattarola, V.K. Mulligan, E. Klyshko, H. Melo, XENet: using a new graph convolution to accelerate the timeline for protein design on quantum computers. PLoS Comput. Biol. 17(9), 1–21 (2021) T.N. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks, in Proceedings of 5th International Conference Learning Representations (ICLR 2017), Toulon, France (2017), pp. 1–14, https://openreview.net/forum?id=SJU4ayYgl. Accessed from 24–26 Apr 2017 M. Simonovsky, N. Komodakis, Dynamic edge-conditioned filters in convolutional neural networks on graphs, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA (2017), pp. 29–38. Accessed from 22–25 July 2017 T. Xie, J.C. Grossman, Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Phys. Rev. Lett. 120, 145301 (2018) W. Hu, M. Fey, H. Ren, M. Nakata, Y. Dong, J. Leskovec, Ogb-lsc: a large-scale challenge for machine learning on graphs (2021). arXiv:2103.09430

Clustering of Building Stock Matteo Giacomo Prina , Ulrich Filippi Oberegger , Daniele Antonucci , Yixiao Ma, Mohammad Haris Shamsi, and Mohsen Sharifi

Abstract In Europe, buildings account for 40% of final energy demand. Building stock models assess the impacts of technologies on energy consumption, greenhouse gases, policies, city planning, renewable energy, renovation strategies, and health effects. Two approaches, top-down and bottom-up, generate these models using real data and simulations. In the big data era, information about buildings is increasingly available, allowing real analysis of building stocks (top-down approach). In the bottom-up approach, models are estimated through simulations of building archetypes and aggregated at stock level. Unsupervised machine learning like clustering is widely used to find and group similar buildings. Centroid- and density-based algorithms are most popular but subsequent evaluation of clusters is essential. In this chapter we demonstrate two applications of clustering on different building stocks. In the first, the aim is to generate heat saving cost curves for the residential sector. These curves allow policy makers to choose renovations that save most energy per Euro invested. In the second application, clustering is applied to a building stock in Flanders to generate synthetic data allowing to simulate energy efficiency scenarios for buildings. The archetypal modeling approach used classifies buildings based on characteristics and scales the energy consumption up to the entire housing stock. Keywords Clustering · Building stock · K-Means · Policy · Spatial analysis

M. G. Prina · U. F. Oberegger · D. Antonucci (B) Eurac Research, Institute for Renewable Energy, Bolzano, Italy e-mail: [email protected] Y. Ma · M. H. Shamsi · M. Sharifi Flemish Institution for Technological Research (VITO), Boeretang, Belgium © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Doukas et al. (eds.), Machine Learning Applications for Intelligent Energy Management, Learning and Analytics in Intelligent Systems 35, https://doi.org/10.1007/978-3-031-47909-0_5

131

132

M. G. Prina et al.

1 Case Study 1: Heat Saving Cost Curves for EU-27 1.1 Introduction To combat global warming, the European Union (EU) has set an ambitious target to actualize a zero-emission building stock by 2050 [1]. the Renovation Wave initiative postulates the need to double the annual energy renovation rate of buildings by 2030. This initiative is set to fundamentally transform the energy landscape in Europe [2]. The building stock within the EU is heterogenous, with residential buildings being predominant [3]. Energy conservation measures like insulation of external surfaces and window replacements hold immense potential not only for energy savings but also for stimulating local economies [4]. The challenge lies in balancing cost efficiency and emission reduction when comparing energy renovation actions with alternative heat supply options or energy-saving measures in other sectors. In this context, we propose a novel methodology that employs clustering techniques to generate heat saving cost curves explicitly tailored for the residential sector. Such a methodology would provide invaluable support for devising building energy renovation strategies and inform decision-making processes related to decarbonization. The derived cost curves delineate the functional relationship between the heat savings and the costs associated with energy renovation measures. These curves guide policymakers to identify the renovations that ensure maximum energy conservation per Euro invested. Nonetheless, the derivation of these curves necessitates comprehensive data on the characteristics of the building stock and the costs of energy renovation. Additionally, this data must be consistent and comparable across countries, a requirement often not met. Our methodology innovatively leverages open data sources, predominantly the Hotmaps project database [5], to derive cost curves for heat savings in buildings.. This database offers harmonized data on the characteristics of building stock and energy renovation costs for all EU-27 countries. Through clustering, we derive cost curves for each EU-27 country at the national level, comparing them across different climatic zones and building types, all while maintaining manageable data and effort requirements. The overarching goal is to develop a tool that aids energy system modelers in incorporating energy efficiency within buildings into their scenario development processes. Existing literature offers multiple approaches and methods to develop energysaving cost curves for diverse countries and building stocks. While these studies provide significant contributions and excel in specific areas, they do present limitations. Most of these studies rely on building stock databases that are not publicly accessible and employ a building energy performance simulation model with annual or monthly timesteps, or only cover a single country. Most studies used building stock databases that are not openly available [6–11], adopted a building energy performance simulation model with annual [7–9, 12] or monthly [6, 10, 11] timesteps, or covered a single country [7–11].

Clustering of Building Stock

133

Our approach differs from the existing literature by using an open building stock database, namely the one of the Hotmaps project [5], an hourly timestep model based on the standard ISO 52016-1:2017 [13], and by ultimately covering all EU-27 countries through clustering analysis.

1.2 Methodology Our overall methodology can be visualized as depicted in Fig. 1, outlining the systematic workflow we adhered to. Initially, we embarked on data collection and analysis. For each country, building type, and construction period, we meticulously extracted data on heated areas, U-values, and space heating energy demands at the national level from the Hotmaps building stock database. This was performed for a subset of the EU-27 countries. Energy renovation costs were ascertained from a publicly available Italian regional construction price list for 2021 [14]. These costs were subsequently adjusted for other countries using the construction cost index furnished by Eurostat [15]. Additionally, Heating degree days, serving as a proxy for climate, were gathered from the Eurostat database [16]. Upon collation of the necessary data, we conducted a clustering analysis to discern similarities in the building stock across European countries. This crucial phase aimed to determine the optimal number of clusters, focusing on recognizing analogous building types, construction periods, and heating characteristics prevalent in diverse EU nations. Subsequently, building simulations were executed for the clusters identified, conforming to the EN ISO 52016-1:2017 standard. This analysis was crucial for ascertaining the energy performance of the building stock, both pre and postenergy renovation, for the chosen clusters. Additionally, it served to evaluate the costs associated with the renovation.

Fig. 1 Methodological workflow

134

M. G. Prina et al.

In the final phase, we engaged in a ranking exercise for the various retrofit measures. Subsequently, energy-saving cost curves were constructed for all EU27 countries at the national level. The basis for the ranking was the results derived from the building simulation, thus ensuring an informed and data-driven approach to the process. This methodology allows a comprehensive comparison and assessment of the energy performance and cost-effectiveness of renovation measures across the EU.

1.3 Clustering The Hotmaps building stock database compartmentalizes the residential building stock for each of the EU-27 countries into three distinct categories: single-family/ terraced houses, multifamily houses, and apartment blocks. Further subdivisions within each building type are made based on seven chronological categories: pre1945, 1945–1969, 1970–1979, 1980–1989, 1990–1999, 2000–2010, and post-2010. We used the following indicators for each subdivision: heated floor area, number of buildings, number of dwellings/units, U-values (thermal transmittance) for walls, windows, roof, and floor, and annual space heating energy demand. Performing building energy renovation simulations and subsequent analyses for every country and building category would be prohibitively labor-intensive. Considering the 27 countries, 3 building categories, and 7 construction periods, we would need to analyze a total of 567 combinations. In order to mitigate this workload and streamline the process, we executed a clustering analysis to identify clusters of countries and building categories that would be compatible with similar energy renovation strategies. Clustering analysis, a data mining technique, categorizes a dataset into groups, or clusters, based on inherent similarities. For this study, we employed the k-means method for our clustering analysis [17]. This algorithm segregates a dataset into ‘k’ predetermined clusters, taking into account the mean distance between the data points and the respective cluster centroid [18]. To pinpoint the optimal number of clusters, we utilized two evaluation metrics: the Elbow Method [19] and the Silhouette Score [20]. The Elbow Method locates the optimal number of clusters where the increase in explained variance begins to plateau. The explained variance is gauged by the sum of squared distances between the data points and their allocated cluster centroid. Conversely, the Silhouette Score assesses the similarity of a data point with the data points in its own cluster in contrast to those in other clusters. Scores range from -1 to 1, with a score near 1 signifying that the data point is well-suited to its own cluster and ill-suited to other clusters. Our clustering analysis took into account three indicators, assigning equal weightage to each: U-value average, specific space heating energy demand [kWh/ m2 year], and heating degree days. These interdependent indicators describe distinct factors that significantly affect the characteristics of building stocks. A colder climate generally means buildings are constructed with greater insulation, hence lower Uvalues. However, this might not be the case for older structures. Energy demand

Clustering of Building Stock

135

for space heating is influenced by U-values, but other factors like infiltration and occupant behaviour can also impact it. We consolidated the four U-values (walls, roof, windows, and floor) using a weighted average based on the building surface area of each component. For the building geometry, we adhered to Kragh et al. [21] simplified modelling approach, which presumes a cuboid with a building width of 8 m and a room height between 2.5 and 2.8 m, depending on the construction year. The window area is assumed to represent 15% of the heated floor area. The heated floor area, provided by the Hotmaps database, allowed us to discern the surfaces of the floor and roof. With these surfaces defined, we were then able to calculate the average U-values for each country’s building category and construction period. Building simulations. Following the clustering process, we selected the points nearest to the centroids as the baseline buildings for simulation in accordance with the ISO 52016–1:2017 hourly method, as outlined in Table 1. We then calibrated the building models to match the heating energy need centroids, adjusting the ventilation levels via the air change rate. We calibrated the building models to the heating energy need centroids by varying the building ventilation via the air change rate. Next, we considered a range of retrofit packages: (1) façade insulation, (2) roof insulation, (3) window replacement, (4) basement insulation, and various combinations of these four elements, culminating in a complete renovation package that included all four measures. We employed a “staged” energy renovation strategy, also known as “over-time” or “phased” energy renovation. Unlike a comprehensive, one-off energy overhaul, a staged energy renovation involves a series of carefully planned, individual steps carried out over time [22]. This approach limits the immediate investment cost of each step, thereby making the energy renovation process more financially feasible. The choice of each energy renovation step was made based on a ranking according to the Levelized Cost of Saved Energy (LCSE). The LCSE allows for a direct comparison between energy supply (e.g., natural gas price) and energy renovation: Table 1 Baseline parameters used to create the building energy performance simulation models Nation

Year

Type

HDD

kWh/(m2 year)

Average U-value

Romania

1970–1979

Single family-terraced houses

2886

175.0

1.47

Portugal

1990–1999

Apartment blocks

1199

84.8

1.72

Ireland

1990–1999

Single family-terraced houses

2804

83.2

0.69

Sweden

1945–1969

Multifamily houses

5175

134.1

0.59

136

M. G. Prina et al.

LC S E = C RF =

C · C RF S

i · (1 + i )n (1 + i )n − 1

(1)

where: • C is the total cost spread in equal annual payments (investment cost is considered as loan with annual repayment); • S is the annual energy saving; • C R F is the capital recovery factor; • i is the discount rate fixed at 4%; • n is the energy renovation lifetime fixed at 30 years. We calculated the LCSE for each country, building type, construction period, and energy renovation. Each subsequent step in the staged energy renovation of a building implements the renovation measure with the lowest LCSE. For each successive renovation measure—be it façade insulation, roof insulation, window replacement, or basement insulation—the insulation thickness and window type were varied and then set at their minimum LCSE. This was achieved by measuring the additional heating energy savings relative to the previous state of the building, which could already have undergone one to three retrofit steps.

1.4 Results This section presents the results of our comprehensive analysis and simulations. It provides insights into the clusters identified in European building stocks, detailing their specific characteristics and similarities. The findings from our building simulations, following the implementation of various renovation packages, are also delineated. Here, we unpack the cost-effectiveness of these retrofit measures and evaluate their impact on the overall energy performance of the building clusters. These findings pave the way for a deeper understanding of energy efficiency strategies in the European context, thereby enabling more targeted and efficient approaches to energy renovation. Figure 2 graphically presents the outcome of applying the Elbow Method to determine the optimal number of clusters for our analysis. While it typically indicates an “elbow” or clear bend representing the optimal number of clusters, in this particular case, it is not easily discernible. Instead, our findings suggest a broader optimal range of clusters varying between 3 and 15. The less defined elbow in this instance emphasizes the complexity and diversity inherent in the European building stock data. Therefore, for the choice of the optimal number of clusters we should use an additional method: the Silhouette Score.

Clustering of Building Stock

137

Fig. 2 Elbow method

Figure 3 presents the Silhouette Score plotted against the varying number of clusters considered in our analysis. This chart aims to enhance the clarity of cluster selection after the Elbow Method proved inconclusive. The Silhouette Score method offers a more definitive optimal number of clusters, which in our case is four. The score associated with this number of clusters is the highest among all others tested, indicating the strongest data point cohesion within clusters and the most distinct separation between clusters.

Fig. 3 Silhouette score

138

M. G. Prina et al.

Figure 4 adds another layer of detail to the cluster selection process. Not only does this chart confirm that four is the optimal number of clusters (echoing the findings from Fig. 3), but it also provides insight into the internal distribution of data points within each cluster. The Silhouette Coefficient values indicate the degree of similarity of each data point to others within its cluster, compared to those in other clusters. The distribution visualized in this chart highlights the balanced grouping of the data points, reinforcing the validity of selecting four as the optimal number of clusters. Figure 5 presents a comparative analysis of the distribution of clustered points using 4, 5, and 6 clusters. The three categories considered for clustering—energy consumption (kWh/m2 y), average U-values, and Heating Degree Days—are represented on three subplots for each case. In the case of the 4-cluster model, we observe a discernable separation between the clusters, validating our previous selection of four as the optimal number of clusters. The data points are well-distributed and defined

Fig. 4 Silhouette score and distribution of data points within each cluster

Clustering of Building Stock

139

within each cluster, indicating a strong internal cohesion. With 5 and 6 clusters, the clarity and distinction between clusters begin to diminish. The overlap among some clusters increases, leading to potential misinterpretation of data. Moreover, some clusters become sparser, raising questions about their validity. Figure 6 presents a correlation matrix that showcases the relationships between the categories used for clustering: energy consumption (kWh/m2 y), average U-values, and Heating Degree Days. Figure 7 presents a scatter plot derived from Principal Component Analysis (PCA), a dimensionality reduction technique, applied to our three clustering categories: energy consumption (kWh/m2 y), average U-values, and Heating Degree Days. The purpose of applying PCA in this context is to visualize the four clusters identified in our analysis in a two-dimensional space, which can be graphically represented with ease.

Fig. 5 Comparison of clustering analysis with 4, 5, and 6 clusters

140

M. G. Prina et al.

Fig. 6 Graphical representation of the clustering analysis with 4 clusters

Fig. 7 Results of the application of PCA for two-dimension graphical representation

Clustering of Building Stock

141

The plot presents each data point, representing a unique combination of the three categories, placed according to its values for the first two principal components (x, y). These components are new variables derived from our original categories that capture the maximum possible variance in our data. Each cluster is differentiated by a distinct color, allowing for an easy visual distinction. Figure 8 provides a visualization of the results of our clustering analysis. For simplicity, the results are visualized over the energy consumption data for various building stocks in different countries. Each displayed table represents a distinct country with the rows denoting the construction period and the columns indicating the building type. In every cell, you can find the specific energy consumption value in kWh/m2 y for a given building type from a particular construction period within the represented

Fig. 8 Results of the clustering analysis shown over the data of the building stock energy consumption

142

M. G. Prina et al.

country. The cells are color-coded based on the cluster each data point (building type from a specific construction period) is assigned to, as per the results of our four-cluster analysis. This allows us to visually understand how different segments of the building stock in each country are grouped into distinct clusters based on their energy consumption characteristics. Most considered countries exhibit higher space heating values in their oldest single-family/terraced houses. However, there are some exceptions where newer buildings have worse space heating values than the previous ones. Italy, for instance, has worse values in the period of 1945–1969 for single-family/terraced houses compared to the period before 1945. Other examples include Croatia, Estonia, France, etc. Additionally, some countries have lower space heating values in general, such as those with warmer climates like Cyprus, Malta, and Spain. Table 1 shows the parameters close to the centroids extracted for each cluster to perform the building energy performance simulations. The space heating energy saving cost curves for the considered countries resulting from carrying out the methodological steps in Sect. 1.5 are shown in Fig. 9. In Fig. 9 the x-axis shows the cumulative energy savings that can be achieved by implementing all energy renovation measures up to that point on the curve, starting from the left. The y-axis shows the LCSE for each energy renovation measure in terms of Euros invested per kWh saved per year over the lifetime of the energy renovation measure. Several curves exhibit a flexion point where further energy renovation becomes prohibitively expensive while other curves have a more gradual increase in steepness. The reason why the LCSE can also drop at certain points on a curve is because a retrofit step in the staged energy renovation can make a subsequent retrofit step cheaper than if it were carried out without that previous step. Furthermore, there are sections on several curves where the LCSE stays basically constant. These refer to energy renovation of large portions of a building stock with similar performance characteristics.

Fig. 9 Space heating energy saving cost curves at country level

Clustering of Building Stock

143

The granularity of the approach taken allows us to further drill down into the specific energy renovation measures and the types of buildings to which they would be applied.

1.5 Conclusions We presented a methodology to obtain heat saving cost curves in residential buildings starting from an open database of the building stock for all European countries. The methodology demonstrates an important application of clustering including determining the optimal number of clusters through the elbow and silhouette methods. Elements of novelty of this work are the use of open building stock data and the dynamic building energy performance models. Cost curves allow for the explicit technological detail needed to evaluate the decarbonization process of the residential sector and its associated costs. They are therefore frequently used, relevant tools in energy system modelling to evaluate the decarbonization process not only through active measures (heat pumps, district heating, etc.) but also passive ones (energy efficiency options in buildings).

2 Case Study 2: Synthetic Building Energy Performance Data for the Flanders Building Stock (VITO) 2.1 Background One of the highest contributors of CO2 , the urban building sector accounts for approximately 35% of emissions in the EU. Buildings are at the epicenter of urban supply and demand of energy. There exists significant potential to reduce the urban building energy consumption and CO2 emissions. Urban building energy modeling (UBEM) facilitates large scale implementation of sustainable and energy-efficient scenarios. However, UBEM poses significant challenges when considering energy system’s complexity, desired modeling resources, time, and effort to produce accurate results. Furthermore, current UBEM studies lack an in-depth understanding of the modeling challenges and research opportunities associated with the entire UBEM spectrum. Different methods and tools have been developed for the building stock with the aim to provide various insights into system performance, building stock retrofitting potential, energy driven planning, forecasting and urban decision making by evaluating factors such as short or long-term energy use and demand, short term demand response, GHG emissions, potential renewable energy generation and storage etc. UBEM modeling approaches are normally classified as top-down and bottom-up approaches. Top-down models rely on macro level information, historical data and statistical energy use, socio-economic factors, and energy prices to estimate energy

144

M. G. Prina et al.

consumption or carbon dioxide emissions for long term purposes such as high-level building energy policy evaluation, while bottom-up models start from detailed individual building level data and scale all the way up to street, neighborhood, district, city, regional and national building stock level. Bottom-up approaches become more and more popular due to the emerging data at building level. Specifically, a hybrid approach in-between top-down and bottom-up can be favorable in terms of applicability. Among bottom-up approaches, physics-based methods have the advantage that they enable the assessment and quantification of the combined effect of several technologies on the building energy demand, and do not require detailed historical energy consumption and socio-economic factors. Synthetic data generation for this demonstration case focuses on bottom-up, physical-based techniques, given its flexibility with regards to data availability. This type of urban energy modeling requires the definition of model data inputs regarding the modeled buildings’ geometry, construction assemblies, HVAC systems and usage patterns, as well as climate conditions. However, such detailed data collection efforts become impractical for larger urban areas and the computational efforts become often too excessive in case one has to setup an individual building model for each building in large urban areas. Hence, this study uses the archetypal modeling approach, which classifies the building stock according to several building characteristics, after which the energy consumption estimates of modeled archetypes are scaled up to be representative of modeled housing stock. The classification of building stock involves the formulation of representative buildings (or Archetypes). Such classification is driven by the availability of urban building stock data that might include geometrical coordinates, building characteristics, operational energy use, street view imagery and building footprints. Within the machine learning and artificial intelligence domain, there exists another process workflow, called, clustering that has also been widely used for characterizing the urban building stock. Classification is a supervised learning approach where a specific label is provided to the machine to classify new observations. Clustering is an unsupervised learning approach where grouping is done on similarities basis. The major difference between classification and clustering is that classification includes the levelling of items according to their membership in pre-defined groups. This study demonstrates the use of a decision support tool for future scenario analysis for energy planning purposes. Urban Energy Pathfinder (UEP) provides a holistic energy solution by calculating energy, CO2 savings, and financial conditions for renovation scenarios and energy technology measures at building, district and city level. These scenarios include a mix of technological measures such as district heating/cooling networks, building renovation measures and decentralized renewable energy production technologies. The workflow of UEP is subdivided into three main parts: • Characterization of the existing situation of the buildings in the modeled district. • The evaluation of renovation measures on individual building level. • The evaluation of district heating potential for the district.

Clustering of Building Stock

145

For this study, we mainly demonstrate the use of characterization to generate synthetic building stock data. We further elaborate on the evaluation of renovation measures for the Flanders residential building stock. UEP first gathers all available information on individual building, including the building geometry, construction year, building function, installed HVAC, etc. Next, primary characteristics and input data such as current actual energy consumption data are where needed transferred, processed, re-calculated from higher aggregation levels towards the building level by means of spatial allocation algorithms. Next, UEP employs a bottom-up, archetype-based approach in which buildings are classified based on their function, type (apartment, terraced, semi-detached, detached) and construction year period. The archetype characterization is done probabilistic and derived from the national EPC database. Herewith, for each archetype, each energy-related building parameter is described by a probability distribution. To determine an input value needed for the actual building energy simulations, a single value is sampled from these distributions for each parameter and each building. Since two buildings of the same archetype will be characterized by different sampled values, this methodology considers the natural spread of the modeled building stock.

2.2 Methodology The devised methodology (Fig. 10) in this study identifies a balance between the bottom-up and top-down modeling approaches. The data inputs to initiate the building stock include building physical and operational characteristics. The calculated building characteristics are sequentially calibrated using the national statistics on energy balance. The second phase in the methodology deploys a bottom-up energy modeling approach using these data inputs, which undergo a data pre-processing routine to identify any outliers, remove duplicates and fill in any missing values. The pre-processed dataset is then fed into the clustering framework to devise archetypes that represent the building stock. These clusters are then used to create a GIS representation of the building stock at the district scale. The clusters are further used to devise fabric renovation scenarios on an individual building as well as collective district level. The data output phase uses cost optimization for various renovation scenarios at the building level as well as the district level. The building level clustering step in the bottom-up energy modeling phase uses a data-driven unsupervised clustering technique (k-means). The k-means algorithm is used for clustering building stock data as this algorithm is considered to be the best approach for archetypes development when compared to others clustering algorithms. K-means is the most common unsupervised partitional classification algorithm used to solve the clustering problem. Each cluster is represented by the mean of the cluster with the aim of dividing the observations into k clusters where each observation belongs to a respective cluster (center point). The objective is to minimize the sum of the distances of the points to their respective centroid. The most

146

M. G. Prina et al.

Fig. 10 The devised methodology to generate urban building stock data using data-driven machine learning clustering techniques

common distance definition is the Sum of Squared Error (SSE) minimization function, also known as the Euclidean distance. The scalability and simplicity of this approach is a key advantage of the K-means clustering technique when compared to other algorithms. K-means has serious limitations when the data contains outliers and clusters are of different sizes and densities.

2.3 Results We illustrate a few clustering results (Fig. 11) below to provide an idea of the clustering procedure. The clustering process is performed for an exemplary set of detached houses in a Flemish district.

Fig. 11 The k-means clustering results for detached houses in Flanders using building volume and building height as the key parameters; corresponding curves denote the elbow method and the Silhouette Index to identify the number of clusters

Clustering of Building Stock

147

The silhouette index of a cluster value is a measure of the ratio between separation and compactness. The calculation is based on the silhouette width of their cluster objects. The k value is the number of clusters or archetypes present in each dwelling type (Fig. 12). For example, clusters of the detached houses have 2, 3, or (k = 2, 3, or 4) classes that relate to 2, 3, or 4 different building archetypes. The current building stock model in Flanders is rather limited, 4 typologies in TABULA database. By joining various data sources, the methodology formulates a residential sector database (geometry, energy use, user profile, etc.) (Fig. 13). Clustering analysis, based on the database, is conducted to derive the representative residential typologies, and further compare with state-of-the-art research outcomes. The methodology derives typologies with the geometry data and cadaster data (construction year). The number of typologies is carefully defined/limited to account for the calculation capacity, and key parameters of typologies are defined as follows:

Fig. 12 Different values of k (clusters) illustrating the spread of height and building volume

Fig. 13 The identified clusters for the Flemish building stock depicting various hierarchical levels

148

M. G. Prina et al.

Basic parameters for the clustering analysis • Construction period (5 construction periods—pre-1945, 1946–1970, 1971–1990, 1990–2005, after 2005). In the event of existing strong correlations between construction year and geometry, the analysis considers the construction year. Clusters are based on geometrical data when no correlations exist. The clusters identify and assign centroids to each construction period, which are further run in the static calculation engine using different renovation packages. This is followed by merging the similar clusters together. • Building type (4 types: terraced, detached, semi-detached, apartment) • Geometry (3 geometrical representations—small, medium, large categories are assigned numbers using the centroid values of the cluster. Derived/calculated parameters • Based on the construction period, building characteristics (U value etc.) are estimated for Flanders. • Renovation measures: Only building envelope related measures with the cost’s figures. Heating source for each typology (gas, oil, electricity, etc.) Demand—Existing buildings The baseline demand consists of (mainly) heating/DHW/cooling demand(s) of the existing building stock, calculated using a static calculation engine with the derived typologies from clustering analysis. The current heating/DHW/cooling demand(s) are cross checked and verified with the national/regional energy balance data. Demand—New buildings The typologies of new buildings are derived from the clustering analysis to define the current new built demand. Additional crucial inputs include: • Current total floor area of the new built • Historical new built rate as reference—assumes a fixed rate or parameter varies along the time. • Availability of land for new built • Policy elements and implementation timeline considered for new building’s demand calculation (e.g. EPC requirements in Flanders). Renovation measures The renovation measures in the building stock model focus mainly on the building fabric (roof, wall, windows). Constraints are set within the model when evaluating. Renovation scenarios include: – Roof – Wall

Clustering of Building Stock

– – – – –

149

Window Roof + Wall Roof + Window Wall + Window Roof + Wall + Window.

The renovation packages are generated by the static calculation engine and are further fed into the building stock model. This includes a pre-screening of the measures due to the limitation of the calculation time. Additional renovations include PV rooftop panels, and HVAC system upgrades.

2.4 Discussion This study further demonstrates a clustering-based renovation approach to identify the fastest way to A-label and calculates the investment costs per building (Fig. 14). This forms a part of the data output phase in the devised methodology and performs a cost-based optimization considering minimum total cost of ownership. For instance, the figure illustrates the costs associated with different renovation scenarios when applied to the identified clusters. The investment costs differ based on the construction age bands. For old buildings, the recommended fastest way to A-label is to invest in fabric renovations, particularly roof and façade insulations. New buildings are constructed as per the new standards and hence, including a PV rooftop system is the recommended renovation.

Fig. 14 The clustering-based renovation recommendations to identify the fastest way to achieve A-label per building

150

M. G. Prina et al.

We would like to mention that the selection of renovations is highly dependent on the evolution of prices in the future. For instance, heat pumps would eventually phase out from the recommended renovations when considering the increase in electricity prices. Furthermore, the recommendation to upgrade windows considers only glass replacement (less investment) in newer buildings and hence, appears frequently in the list of recommended options.

References 1. Renovation and decarbonisation of buildings. https://ec.europa.eu/commission/presscorner/det ail/en/IP_21_6683. Accessed January 13, 2022 2. Renovation wave. https://energy.ec.europa.eu/topics/energy-efficiency/energy-efficient-buildi ngs/renovation-wave_en. Accessed February 21, 2023 3. EU Buildings Factsheets | Energy. https://ec.europa.eu/energy/eu-buildings-factsheets_en. Accessed February 21, 2023 4. M.G. Prina, D. Moser, R. Vaccaro, W. Sparber, EPLANopt optimization model based on EnergyPLAN applied at regional level: the future competition on excess electricity production from renewables. Int. J. Sustain. Energy Plan Manag. 27, 35–50 (2020). https://doi.org/10.5278/ijs epm.3504 5. Hotmaps project building stock data. https://gitlab.com/hotmaps/building-stock/-/tree/master/ data. Accessed May 16, 2023 6. M. Hummel, R. Büchele, A. Müller, E. Aichinger, J. Steinbach, L. Kranzl et al., The costs and potentials for heat savings in buildings: Refurbishment costs and heat saving cost curves for 6 countries in Europe. Energy Build 231, 110454 (2021). https://doi.org/10.1016/J.ENBUILD. 2020.110454 7. M. Jakob, Marginal costs and co-benefits of energy efficiency investments: the case of the Swiss residential sector. Energy Policy 34, 172–187 (2006). https://doi.org/10.1016/J.ENPOL. 2004.08.039 8. H. Lund, J.Z. Thellufsen, S. Aggerholm, K.B. Wittchen, S. Nielsen, B.V. Mathiesen et al., Heat saving strategies in sustainable smart energy systems. Int. J. Sustain. Energy Plan Manag. 4, 3–16 (2014). https://doi.org/10.5278/IJSEPM.2014.4.2 9. K. Promjiraprawat, P. Winyuchakrit, B. Limmeechokchai, T. Masui, T. Hanaoka, Y. Matsuoka, CO2 mitigation potential and marginal abatement costs in Thai residential and building sectors. Energy Build 80, 631–639 (2014). https://doi.org/10.1016/J.ENBUILD.2014.02.050 10. A. Toleikyte, L. Kranzl, A. Müller, Cost curves of energy efficiency investments in buildings— Methodologies and a case study of Lithuania. Energy Policy 115, 148–157 (2018). https://doi. org/10.1016/J.ENPOL.2017.12.043 11. U. Filippi Oberegger, R. Pernetti, R. Lollini, Bottom-up building stock retrofit based on levelized cost of saved energy. Energy Build, 210 (2020). https://doi.org/10.1016/j.enbuild. 2020.109757 12. R. Harmsen, R. Harmsen, B. Zuijlen van, P. Manz ee al., Cost-curves for heating and cooling demand reduction in the built environment and industry. Utrecht (2018) 13. ISO 52016–1:2017. Energy performance of buildings — Energy needs for heating and cooling, internal temperatures and sensible and latent heat loads — Part 1: Calculation procedures. https://www.iso.org/standard/65696.html. Accessed May 16, 2023 14. Construction price list for the Province of Bolzano, Italy (only available in Italian and in German). https://www.provincia.bz.it/lavoro-economia/appalti/elenco-prezzi-provinciale-onl ine.asp. Accessed May 16, 2023 15. Construction producer prices or costs, new residential buildings—quarterly data (sts_copi_q). https://ec.europa.eu/eurostat/databrowser/view/sts_copi_q/default/table?lang=en. Accessed May 16, 2023

Clustering of Building Stock

151

16. Energy statistics—cooling and heating degree days (nrg_chdd). https://ec.europa.eu/eurostat/ cache/metadata/en/nrg_chdd_esms.htm. Accessed March 13, 2023 17. A. Likas, N,J. Vlassis, J. Verbeek, The global k-means clustering algorithm. Pattern Recognit 36, 451–61 (2003). https://doi.org/10.1016/S0031-3203(02)00060-2 18. H. Teichgraeber, A.R. Brandt, Clustering methods to find representative periods for the optimization of energy systems: an initial framework and comparison. Appl. Energy 239, 1283–1293 (2019). https://doi.org/10.1016/J.APENERGY.2019.02.012 19. M.A. Syakur, B.K. Khotimah, E.M.S. Rochman, B.D. Satoto, Integration K-means clustering method and elbow method for identification of the best customer profile cluster. IOP Conf. Ser. Mater. Sci. Eng. 336, 012017 (2018). https://doi.org/10.1088/1757-899X/336/1/012017 20. A.M. Bagirov, R.M. Aliguliyev, N. Sultanova, Finding compact and well-separated clusters: Clustering using silhouette coefficients. Pattern Recognit. 135, 109144 (2023). https://doi.org/ 10.1016/J.PATCOG.2022.109144 21. J. Kragh, K.B. Wittchen, Development of two Danish building typologies for residential buildings. Energy Build 68, 79–86 (2014). https://doi.org/10.1016/J.ENBUILD.2013.04.028 22. M. Saffari, P. Beagon, Home energy retrofit: reviewing its depth, scale of delivery, and sustainability. Energy Build, 269 (2022). https://doi.org/10.1016/j.enbuild.2022.112253

Big Data Supported Analytics for Next Generation Energy Performance Certificates Gema Hernández Moral , Víctor Iván Serna González , Sofía Mulero Palencia , Iván Ramos Díez , Carla Rodríguez Alonso , Francisco Javier Miguel Herrero , Manuel Pérez del Olmo , and Raúl Mena Curiel

Abstract Energy Performance Certificates (EPCs) have been in place since the implementation of the Energy Performance Directive (2010/21/EU) and were envisaged as a tool to raise awareness on energy efficiency in buildings as well as boost energy refurbishments in the building sector. However, they are slowly becoming a bureaucratic checkpoint that has not fully reached its intended objective. Moreover, due to the lack of interoperability in between data sources and the complex management of the EPCs it has been unfeasible up to now to fully exploit the potential of these official documents. In this chapter, the focus will be placed on several challenges surrounding Energy Performance Certificates, and how big data

G. Hernández Moral (B) · V. I. Serna González · S. Mulero Palencia · I. Ramos Díez · C. Rodríguez Alonso · F. J. Miguel Herrero · M. Pérez del Olmo · R. Mena Curiel CARTIF Technology Centre, Parque Tecnológico de Boecillo, Parcela 205, 47151 Boecillo, Valladolid, Spain e-mail: [email protected] V. I. Serna González e-mail: [email protected] S. Mulero Palencia e-mail: [email protected] I. Ramos Díez e-mail: [email protected] C. Rodríguez Alonso e-mail: [email protected] F. J. Miguel Herrero e-mail: [email protected] M. Pérez del Olmo e-mail: [email protected] R. Mena Curiel e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Doukas et al. (eds.), Machine Learning Applications for Intelligent Energy Management, Learning and Analytics in Intelligent Systems 35, https://doi.org/10.1007/978-3-031-47909-0_6

153

154

G. Hernández Moral et al.

and machine learning applications can support them. By exploiting publicly available data (cadastre, building construction catalogues, climate data) and monitoring data (real time data), five main functionalities will be provided: (1) EPCs checker, (2) EPCs data exploitation and reports generator, (3) energy conservation measures explorer, (4) visualisation of EPCs and estimated energy parameters, and (5) climate change impact on energy use analysis. These functionalities will support regional authorities in charge of EPC management, building managers as well as citizens, tenants and owners to acquire a deeper understanding on the energy use of their and the neighbouring buildings, as well as assure quality in the information provided or gather insights towards future energy refurbishments. This chapter provides a summary of the application of these services in the Castilla y León and Asturias regions in Spain, as deployed within the H2020 projects: MATRYCS “Modular big data applications for holistic energy services in buildings”, BD4NRG “Big data for next generation energy” and I-NERGY “Artificial intelligence for next generation energy”. Keywords Energy performance certificates · EPCs · Big data · Analytics · Energy planning · Data exploitation · Energy refurbishment

Acronyms API BAC BIM BSO CDD CMIP6 CO2 COP CSV CTE C3S DHW DT EPB EPBD EPC ESCO EU GeoJSON GIS HDD HVAC

Application Programming Interface Building Automation and Control Building Information Modelling Building Stock Observatory Cooling Degree Days Coupled Model Intercomparison Project Phase 6 Carbon Dioxide Coefficient of Performance Comma Separated Value Spanish Building Code (Código Técnico de la Edificación) Copernicus Climate Change Initiative Domestic Hot Water Digital Twin Energy Performance of Buildings Energy Performance of Buildings Directive Energy Performance Certificate Energy Service COmpany European Union Geographic JavaScript Object Notation Geographical Information Systems Heating Degree Days Heating Ventilation and Air Conditioning

Big Data Supported Analytics for Next Generation Energy Performance …

ICT KPI LiDAR LPG ML MS NUTS OSM PDF PV RCP SME SSP TBS XML XSD

155

Information and Communication Technologies Key Performance Indicator Laser Imagining Detection and Ranging Liquefied Petroleum Gas Machine Learning Member State Nomenclature of territorial units for statistics (Nomenclature des Unités Territoriales Statistiques) Open Street Maps Portable Document Format Photovoltaic panel Representative Concentration Pathway Small Medium Enterprise Shared Socioeconomic Pathway Technical Building Systems eXtensible Markup Language XML Schema Definition

1 Introduction The current situation of the building stock and its energy consumption has led the EU to propose actions to increase energy efficiency, boost refurbishment actions and foster the application of renewable energy sources. This was reflected in the “Clean Energy for All Europeans” package [1], which included eight legislative proposals that addressed among other Energy Efficiency, Energy Performance in Buildings, Renewable Energy and Governance [2]. The objectives set in these directives have been further strengthened with the European Green Deal [3], where the EU “strives to be the first climate-neutral continent”, and reduce net greenhouse gas emissions by at least 55% by 2030 [4]. In this context, the Renovation Wave strategy [5], published in 2020, can be highlighted. It has the aim to tackle energy poverty and worstperforming buildings, public buildings and social infrastructure and decarbonise heating and cooling. These strategies are aligned with one of the most significant packages “Fit for 55” [6], which aims at delivering the EU’s 2030 climate target on the way to climate neutrality. In this context, and since 2010, Energy Performance Certificates (EPCs) have been a tool set forward in the EU to support these actions. Member States are required to submit an EPC when relevant transactions related to dwellings, building blocks or commercial premises happen, such as renting, selling or finalising a new construction project. The objective is for Energy Performance Certificates to support in the energy efficiency improvement of the building stock by boosting energy refurbishments as

156

G. Hernández Moral et al.

set out on the Energy Performance of Buildings Directive (EPBD 2010 and 2018 [7]). The application of this EU directive implied transposing at national level the objectives and targets from the directive, and making them actionable through strategies. As a consequence, Member States developed and implemented Energy Performance Assessment and Certification schemes using as a basis the common guidelines set in the directive. These include not only the definition of how Energy Performance Certificate should be calculated, but also defining what standards should be used, what input data are entered into the different tools, who are the experts entitled to issue these documents, how are EPCs registered, and how is their quality checked, if sanctions should be applied or not, where and when should EPCs be used, etc. Nowadays, energy performance assessment and certification schemes and their accompanying assessment methodologies are well-established in many EU countries. Through an understandable colour coding based on the buildings’ efficiency, an increased knowledge of the building stock of a country can be acquired, as well as comparability among buildings and countries fostered. However, even though clear benefits could be obtained with the implementation of EPCs, the review of the energy performance assessment and certification schemes has shown different interpretations, approaches and levels of compliance. This fact leads to a climate of distrust by the general public in the whole process and in the actors involved in it. Even though it will require a great amount of time and resources for the whole approach to energy performance assessment and certification schemes to be altered based on the currently identified shortcomings, improvements to these schemes should be implemented in a coherent and cost-effective manner. Several key identified areas of improvement have been highlighted by the European Commission and translated into research topic calls, giving rise to the “Next Generation Energy Performance Certificates cluster” [8], which include H2020 EU projects that started in 2019 (QualDeEPC [9], U-CERT [10] and X-tendo [11]), 2020 (D2EPC [12], E-Dyce [13], ePANACEA [14] and EPC RECAST [15]), 2021 (crossCert [16], EUB SuperHub [17], iBRoad2EPC [18] and TIMEPAC [19]), and 2022 (Smartliving EPC [20] and Chronicle [21]). Convergence of energy performance certificates in Europe, digitalisation and links to other existing concepts, or increasing the understanding and usability of EPCs are some of the key topics addressed. Moreover, since EPCs are an official document to display energy performance of buildings across the EU and are mandatory in certain cases, an increasing amount of valuable data is generated to this respect that can benefit a varied set of stakeholders in the building value chain [22]. If this is coupled with the constantly increasing momentum of big data and related ICTs or other technologies [23, 24], and the availability of energy and off-domain data [25, 26], this creates an unprecedented market opportunity for energy efficiency in the EU. This chapter presents different big data solutions that can support in the deployment of the energy performance assessment and certification schemes. For this, the main aspects related to Energy Performance Certification will be covered in Sect. 2, delving into their context in Europe, their structure, and how they are managed. This will constitute the basis to understanding the big data solutions proposed in Sect. 3. The objectives they pursue are varied. The first one focuses on the quality assurance of

Big Data Supported Analytics for Next Generation Energy Performance …

157

Energy Performance Certification by performing quality checks (Sect. 3.1). Services 2 and 3 exploit the data across different regions to extract insights for different purposes, such as the generation of regional energy strategies (Sect. 3.2), or provide insights on energy conservation measures included in Energy Performance Certificates (Sect. 3.3). Finally, the last two services contribute to a better understanding on energy performance in regions by visualising EPCs at different scales (from building block level to regional level) under current climatic circumstances (Sect. 3.4), and predicting the impact of climate change on the energy demand (Sect. 3.5).

2 Energy Performance Certification in Europe: Main Challenges and Opportunities Energy Performance Certificates in Europe are the end result of a series of administrative steps that pursue the increase of energy efficiency investment rates to support the decarbonisation objectives put forward by the EU. These are part of Energy Performance Certification schemes, which can be structured into three main areas, linked to the stages the EPCs go through: • Before issuing EPCs—setting the framework. This includes how energy performance certification methods (and corresponding tools) are established, what data needs to be used (and what data model it follows), and how are EPC registers established and managed. • During EPCs issuing process—ensuring quality. Quality in the EPC schemes includes setting up a regulatory and sanction framework, quality practices and procedures, determining who the qualified experts should be, and assuring compliance and verification. • After EPCs have been issued—boosting energy refurbishments. Maximising the use of EPCs can be linked to adding value to these documents by linking them to specific procedures, as well as enhancing social awareness and increasing the usability of EPCs and recommendations (for instance, through one-stop-shops). A comprehensive description of all the challenges in these three areas would be too vast and out of the scope of this chapter. For this reason, the following section focus on key issues that can be linked to the big data solutions presented in tshis chapter. More information about the Energy Performance Certification in Europe has been collected in this JRC Report [27].

158

G. Hernández Moral et al.

2.1 Stages in the Energy Performance and Certification Schemes A. Before issuing EPCs—setting the framework: Energy performance assessment, data, registers and tools To ensure reliable and high-quality energy performance assessment, the focus on calculation methods and data is paramount. For this, Annex I of the EPBD sets up a framework for energy performance assessment based on standards and calculation methodologies derived from mandate M480 [28]. Each national transposition, supported by standards under M480 [29], establishes a methodology for calculating the integrated energy performance of buildings. To align the calculation results, reference documents and tools support these methodologies. A set of reference documents and tools to assess the energy performance of buildings and obtain the information contained in the certificate are defined in every MS. However, despite the effort in aligning the methods, documents and tools across the EU, some divergences exist [30]. In the end, a huge number of tools and reference documents exist across Europe, all of them based on the criteria established on the calculation methodology proposed at national (or regional) level in each MS. In this context, data needs, required results and registers are strongly linked to the calculation methodologies deployed. When transposing the EPBD, each country has defined parameters to be used as input to EPB certification tools. This set of data represents a data model. However, even though a framework has been provided, not all MS follow the same one to represent their EPCs, leading to a lack of comparability among certificates in MS. This data model is also used to define the repositories to store EPCs. These have been implemented broadly in the EU, but how they have been implemented differs depending on the MS: they can be maintained at national or regional level, count on public access or be private. Main challenges: assuring reliability and trust with energy performance calculation methods, cost-effectiveness of methods deployed, adequate balance on the amount of data required, potential consideration of metering data in assessment process, definition of common EPC data model in Europe to make results comparable, reliance on standards for energy performance and data deployed.

B. During EPCs issuing process—ensuring quality: Quality assurance schemes To ensure having high-quality EPCs a harmonised control system and enforcing mechanisms are essential. For this, it is necessary to count on a robust regulatory framework, qualified experts and compliance and verification mechanisms. It has been widely acknowledged [31] that continuous monitoring of the actions within these schemes has an impact if accompanied by a clear sanctioning system. For this, responsibilities among stakeholders and the enforcing authorities should be

Big Data Supported Analytics for Next Generation Energy Performance …

159

clear. Moreover, apart from having a method to identified qualified experts to issue EPCs, it is of the utmost importance to establish an Inspection Control System for EPCs compliance and verification (as per Art 18 and Annex II of the amending 2018 EPBD). Aspects such as the size of the samples to be examined or the use of databases for monitoring and verification purposes are defined in the mentioned article. Additionally, the corresponding generated workload should be balanced, as well as the progress monitored, links to other databases (interconnected databases) established and bodies in charge of the validation defined, etc. Last but not least, the incorporation of smart readiness capabilities should be also considered since it may facilitate this effort. Main challenges: implementation of independent compliance and control quality assurance schemes, determination of control samples, assurance of EPC compliance and validity (preferably in an automatic manner), monitoring EPC compliance rate, enforcing penalties for non-compliance.

C. After EPCs have been issued—boosting energy refurbishments: Maximisation of EPC usability and social awareness Reaching the final result to boost the energy refurbishment market, and thus proving the effectiveness of Energy Performance Certificates, will only be achieved when they are regarded as a valuable mechanism. For this, it is crucial to enhance the opinion the general public has on EPC (increase social awareness) and maximise their use. In this context it is fundamental that the concept of EPC is understood and valued by the general public towards their uptake and further exploitation of their benefits. Information needs to be transmitted in an easy and comparable manner, without overburdening the end user (citizens) with information. Moreover, further interest on EPCs can be obtained by adding value to them, as well as by linking them to other concepts or processes. Adding value to EPCs can be pursued by incorporating a more precise and complete information on buildings (building skin, TBS and BAC), which in turn may lead “recommendations” included in the EPC to match those energy actions defined in an energy audit. Also, inputs necessary for the EPC may be collected from other sources of information (BIM, inspection reports, audits) or databases (cadastres, heating and HVAC). This connection among existing registers may also facilitate checking actions. The inclusion and linking of the Energy Performance Certificates within wellknown concepts or processes related to energy evaluation (such as energy audits), or to new initiatives to harmonize information of buildings (building renovation passports, renovation roadmaps, etc.) can be paramount to their success. Also, promoting one stop shops where to facilitate details about financial mechanisms can generate demand and therefore boost the energy refurbishment market. However,

160

G. Hernández Moral et al.

a pre-requisite for them to function adequately this is assuring that they are easily understood and accepted by society. Main challenges: increase of social acceptance and understanding of EPCs by the general public, linkage of EPCs with other processes (e.g. energy audits or one-stop-shops), interoperability of information with other data sources.

2.2 Energy Performance Certification in Spain This dedicated subsection to energy performance certification in Spain is provided to highlight two relevant specificities of the transposition of the EPC scheme in this country, since this has affected the design of the solutions proposed, as well as it has an impact on their replication possibilities. • Submission process, EPC registers and open data: qualified experts in Spain (architects, engineers, etc.) calculate EPCs through a nationally validated tool [32] (all freely available and offered by the MITECO1 ) or an equally valid method (which should be adequately justified). Regions are in charge of managing the EPC registers and should report at national level an assessment on the EPCs they have registered in a determined period of time. Depending on the regional authority, some of the data offered in the EPC is made publicly available through an open data platform. This is the case of selected parameters of the EPC in Castilla y León region, but in the case of Asturias no data is made available. • Data model and enrichment: even when the EPC registers are managed at regional level (as well as their quality assurance), the calculation framework established at national level to issue the certificates (EPC tools) are common in all the country. As a consequence, it is possible for the data model of the outputs of the tools to be defined in an XSD, and to be the same in the whole country. Also, it is worth mentioning, that some regions, such as Castilla y León add complementary data gathered in the submission process of the EPC. This does not alter the original EPC, but facilitates the identification of EPCs and the querying within the regional repository.

1

Ministry for the Ecological Transition and Demographic challenge—Ministerio para la transición ecológica y el reto demográfico.

Big Data Supported Analytics for Next Generation Energy Performance …

161

3 Big Data Supported Solutions Based on Energy Performance Certificates As it has been observed, energy performance assessment and certification schemes present both a wealth of possibilities, as well as need to overcome certain challenges in order for the implemented system to be robust and be able to achieve its final objective of decarbonising the building stock through energy performance refurbishments. The increasing amount of energy performance certificates, coupled with energyrelated data from relevant data sources, as well as off-domain data offer a basis for big data solutions to be applied and relevant insights to be extracted to improve decision-making processes. The challenges to be tackled when developing this kind of services related to energy performance certificates, and coupling them with additional data, do not only have to do with the volume of data that is managed, but also with their velocity, value, variety and veracity. In particular, as it will be explained, the variety and the veracity are two of the key pillars that are consistently addressed in the services presented in this chapter. Also, the data value chain is carefully considered in all of the services, by paying attention to the data generation acquisition, data analysis processing, data storage curation, and data visualisation and services [33]. Last but not least, it is crucial to consider the end users in the development of the services. Advanced technical solutions including complex algorithms and high volumes of data can be technologically pleasing and interesting from a research point of view, but might not bring impactful results nor be effective if they are not well understood by the end users. For this reason, special efforts have been deployed to validate the services presented in this chapter with the end users during their development, so that refinements could be included at later stages of their development. In the context of the energy performance assessment and certification schemes, there is a huge variety of stakeholders within the value chain that can benefit from the services proposed in this chapter. These range from facility managers, ESCOs, utilities, constructors and contractors, real estate developers and managers, designers, SMEs and companies, institutions, investors, policy makers at different levels to citizens, tenants and owners. A detailed account on how the services can provide relevant insights to each of these target groups will be provided in the discussion section. The following sections showcase the different solutions developed to support stakeholders in the energy performance assessment and certification schemes. All of the sections follow the same structure. First, the main objective and the challenge addressed is briefly presented, then, the solution design, data used and steps covered are explained. Next, the user experience, through the visualisation of graphical user interfaces or mock-ups is illustrated. Then, the application of the solution in a specific context is described. Last but not least, replication possibilities of the solutions are analysed, as well as next steps defined.

162

G. Hernández Moral et al. A. BEFORE ISSUING EPCs

B. DURING EPCs ISSUING

C. AFTER ISSUING EPCs

ENERGY PERFORMANCE ASSESSMENT, DATA, REGISTERS AND TOOLS

QUALITY ASSURANCE SCHEMES

ENERGY REFURBISHMENTS BOOST, USE MAXIMISATION & SOCIAL AWARENESS

1. EPCs CHECKER 2. EPCs DATA EXPLOITATION AND REPORTS GENERATOR 3. ENERGY CONSERVATION MEASURES EXPLORER

3. ENERGY CONSERVATION MEASURES EXPLORER 4. VISUALISATION OF EPCs AND ESTIMATED ENERGY PARAMETERS 5. CLIMATE CHANGE IMPACT ON ENERGY USE

Fig. 1 Proposed services related to key issues in energy performance assessment and certification schemes

These solutions are linked to the existing steps in energy performance assessment and certification schemes, as it can be seen in the Fig. 1. These services have been developed in the context of the MATRYCS [34], INERGY [35] and BD4NRG [36] European projects, and have counted on specific pilots where these services are currently being validated.

3.1 EPCs Checker EPCs can be seen as a key instrument to provide meaningful information to assess and benchmark the energy performance of buildings. Therefore, these documents not only provide information, but can also drive the transition towards an improved building stock in terms of energy efficiency. In this context, it seems crucial to guarantee the health and quality of the source data, and consequently, of the EPCs databases in charge of these processes. The following sections focus on this inspiring topic, starting with an overview of the main challenges to face, and how the solution proposed in the MATRYCS project has been designed to be able to cover them all. A. Main objective and challenge addressed There are many aspects that can be listed when facing the verification of the improvement of the data quality of EPC databases. To begin with, EPCs present a very complex schema, full of multiple connections between the described building blocks, even the process related to the type of database where to store this information can be full of drawbacks that need to be analysed in depth. The implementation of EPC databases is voluntary, however almost all Member States have done so. These databases contain the information extracted from the XML files, allowing access to building information for general checks or to implement statistics, among others. Despite this, it is still a major problem for database managers in Spain to have adequate mechanisms for quality assurance, due to the lack of experience in some fields and the difficulties of the nature of this complex information, which requires the joint work of experts from the fields of energy, architecture, computer science and engineering, for the successful implementation of solutions. This means that the usual information review procedure consists of manual checks of a stipulated percentage of certificates per year. It is therefore necessary to have tools to automate

Big Data Supported Analytics for Next Generation Energy Performance …

163

such a difficult and time-consuming process in order to benefit from such valuable information. B. Solution design, data used and steps covered The work presented here sheds light on the Spanish EPC scheme, and accordingly, a set of quality control mechanisms have been defined. It is worth mentioning that during the process of issuing the EPC the supporting calculation software performs some validations of the input data, such as ranges and types, based on a set of previously established rules. However, inconsistencies between related elements are not explored, and physically impossible scenarios may be described in the data. This quality checker analyses such cases, in line with the suggestions provided by EREN2 after a joint in-depth analysis. The Spanish EPC schema contains 15 main blocks, in which the nature of the information differs, and, therefore, the types of errors that can be found are also diverse. Those blocks and the nature of the data contained, are listed below: • Building identification. All these fields contain information in string format. There are specific fields whose format has to be defined in line with predefined lists or patterns (type of building, scope information, or climate zone, year of construction). • Certifier’s data. All these fields contain information in string format. This field is not displayed in the tool, following the defined anonymisation procedures. • Building geometry. The information in this section is in int, float and string format. No specific admitted values are foreseen. There is multiplicity between elements. • Building envelope. These fields are in string and float format. There are specific fields whose format has to be defined according to predefined lists (type, mode, for different constructive elements). There are complex multiplicities between elements. • Thermal energy systems. These fields are in string and float format. There are specific fields whose format needs to be defined in line with predefined lists (type of energy, production mode). Multiplicity can occur for the heating, cooling or DHW installations. • Lighting systems. These fields are in string and float format. One of the fields must be defined according to a predefined list (production mode). Multiple systems can exist for the same building. These values are available only in tertiary buildings. • Functioning and occupancy conditions. These fields are in string and float format. One of the fields should be defined in line with a predefined list (building use). • Renewable energy. These fields are in string and float format. No specific admitted values are foreseen. Multiplicity can take place. • Demand. All these fields contain information in float format. No specific admitted values are foreseen. Within this section, optional values of a reference building 2

Ente Público Regional de la Energía de Castilla y León – Regional Energy Agency of Castilla y León.

164



• • • • •

G. Hernández Moral et al.

according to legislation or compliance with the Spanish building code are provided within this section. Consumption. All of these fields contain information in float format. No specific admitted values are foreseen. Within this section, optional values of a reference building according to legislation or compliance with the Spanish building code are provided within this section. Emissions. All of these fields contain information in float format. No specific admitted values are foreseen. Energy label. These fields are in string and float format. The admitted values correspond to a pattern for all the string values. Improvement measures. These fields are in string and float format. The admitted values correspond to a pattern for all the string values. Multiplicity can take place. Building tests, checks and inspections. These fields are in string format. No specific admitted values are foreseen. Multiplicity may occur. Customized data. No specific format is foreseen. Multiplicity may occur. This field is optional. The parameters to be checked have been grouped into three main categories:

• Values in list: based on the expected values for specific fields, verifications are performed to see whether the information follows the correct format or not. • Values in range: within this category, the quality checks involve simple and more advanced validations. Simple ones, include verifications such as non-zero values, percentage values or data below/ above a threshold. More advanced validations consider trends and average values for the specific typologies under study. • Values according to regulation: compliance with the Spanish building code is verified. The main validations offered through the EPC checker, aim to help identify the EPCs to be further inspected through a manual process. The tool provides this information through an easy-to use graphical user interface, that will be presented in the following subsection. However, no corrective actions are performed, leaving it up to the end-user to take the measures they deem appropriate for data curation. C. User experience The tool is quite simple and straightforward, allowing the user to check whether the information in the EPC is correct and specific key parameters don’t exceed certain limits. This can be done in two ways, as can be seen in the first screen: by uploading an individual XML file or by selecting one from those in the database. Then, the tool “checks” the EPC and provides a general information on the EPC checked. In a first screen, it provides an assessment of all EPC categories, by showing them as dropdown and with the total number of fields checked, as well as the ones that are checked as “unexpected values” at the right. In this screen, information on the total detected errors is provided, as well as general EPC key data for reference: municipality, address, year of construction of the building, as well as type of building and the cadastral reference (Fig. 2).

Big Data Supported Analytics for Next Generation Energy Performance …

165

Fig. 2 Second screen once the selected EPC is checked, it provides information at the right side on the number of parameters checked within each category and the number of those for which an error is found

In the same screen, each of the categories can be displayed and then, all the information of the EPC in each of them can be visualised through the drop-down menus. This is done for reference, and to provide the user the overall information contained in the EPC even if some of the fields shown are not checked. Nevertheless, visualising all the data can help to understand the errors in other parameters. So, the full EPC can be consulted through the checker tool, and at the right of such data, it provides the assessment for each value on whether it is checked or not, and for those checked, it returns different messages: “expected value” for those fields checked and seen as right values according to the check defined; “unexpected value” is the general message for those fields checked and returned in error (Figs. 3 and 4). For the unexpected values, also a reference message is provided to the user, to give them with an idea on the type of checking performed on them. These are, for example, “address not in the cadastre”; “value not in the list”, for those fields where the value should be among a range or list of figures or words; “inconsistent with

Fig. 3 First screen for the selection of the EPC to check

166

G. Hernández Moral et al.

Fig. 4 Second screen of a checked EPC, when displaying the first category of “building identification”

CTE”, for those fields contrasted with normative values and type of element in force at the time of its construction (CTE is the Spanish Building Code with which the Spanish EPCs have been contrasted); “outlier value”, for those fields in which its figures are checked as analysis of all values in such field over the whole EPCs in the database; or “not in line with other values”, for those fields that are checked according to their interdependence with others. D. Application in a specific context This service has been developed in two regions in Spain. Namely, in Castilla y León region as part of the MATRYCS project and in the Asturias region as part of the I-NERGY project. In the first case, the database contains around 100,000 EPCs in XML format, whereas the database for the Asturias region EPC’s checker currently contains the EPCs from 2016 to 2018, in XML format. E. Replication possibilities and envisaged next steps The functionalities provided by the EPC checker have been defined in a generalist way, with the aim of being easily adaptable to new parameters. Thus, this tool could be easily modified to incorporate new sections in the list of elements to be verified, at least for the first level of validations. In addition, some of the defined thresholds may undergo changes in the regulations, so a procedure has been envisaged to update them. The second level of quality verification has more implications: to start with, trends and average values are considered to define reference values to compare with. These validations are only considered for critical elements and specific cases detected in the EPCs of Castilla y León. Further analysis should be carried out to determine whether this is the case in other regions of Spain.

Big Data Supported Analytics for Next Generation Energy Performance …

167

Finally, cadastral information also plays a key role for the verification of a large part of the values, for example in terms of consistency in location or geometry. So, the EPCs database together with the cadastre database feed this tool in order to offer these functionalities. A new data flow must be defined in case this geometric database is not available or follows a different scheme and cannot be connected.

3.2 EPCs Data Exploitation and Reports Generator Energy performance certificates need to be issued when renting, selling or building a new building/dwelling or commercial area. Moreover, these need also to be issued when applying to receive a grant to refurbish a building. As a consequence, over the years when energy performance certificates have been in place, a vast amount of information has been generated. They contain not only the energy performance status of the building stock, but also building stock characteristics. When these data are analysed at broader scales, relevant insights can be obtained that improve decisionmaking processes when planning energy refurbishment actions. A. Main objective and challenge addressed The solution proposed will provide the user with the capabilities to exploit data contained in energy performance certificates, as well as in cadastral data. This will facilitate the users to have a robust understanding not only of the building stock of a certain region, but also its energy status. The wealth of information contained in this data sources is enormous, therefore, analytical tools that enable users to extract relevant insights should be applied. B. Solution design, data used and steps covered The proposed solution offers information of two main blocks: (1) information related to the characteristics of the building and (2) information of the EPCs files. The tool has been generated using Power BI tool [37], which is a data visualization service that to turn data into insights and actions. The tool can connect to hundreds of data sources, both local and cloud-based, and create interactive reports with stunning visuals. Besides the reports can be embedded into the web applications easily. The first step is to feed the tool with the input data. For this, the data sources to be analysed have to be connected. Once the data sources are defined and connected, the data needs to be shaped and modelled with queries that transform and enrich the data. A well-structured data model helps to generate visuals that are more useful and attractive. Some processes are performed to clean, filter, join, and calculate the data to be show. Once the data is prepared, the next step is to create visualizations and reports with the data models. An analysis of the best visuals has to be done in order to generate interactive and compelling reports which could be useful for the users. Besides, the filters in the different screens have to be synchronised in order not to lose the filters selected between different screens. The last step is the publication of the report in a web page.

168

G. Hernández Moral et al.

Fig. 5 Data model of the building’s characteristics

In the case of the information related to the characteristics of the building, a digital twin (DT) is used. This digital twin is composed mainly of: (1) cadastral information (Building, Building Parts, Cadastral Parcels and Cadastral Zones files [38]), (2) geographical information of municipalities, province and regions [39] and (3) digitized mapping of census sections [40]. The digital twin offers the information properly processed and enriched. This DT is connected with the Power BI tool and it is organized in a data model. The data model can be seen in the following Fig. 5. Apart from the creation of the data model, in this step some processing is carried out, in order to select the parameters that are used in the final report, and some minor translations that are needed for a right visualization. After that, the generation of the visualization screen is done. The result can be seen in the next section “user experience” (Fig. 6). In the case of the EPC data, the process followed was similar. Firstly, the data sources are identified. In this case the information contained in the EPCs is combined with information of the digital twin in order to geolocate the EPCs. Once the information is connected to the Power BI, the data is filtered and modelled. The selection of the main parameters is needed because EPCs contain a big quantity of information, but the tool only shows selected parameters, in order not to overwhelm the user. The model for the EPCs is shown in the following picture. Once the parameters have been selected, the different screens to present the data are generated. In this case there are three screens to present the data, with different types of visuals as for example histograms graphs, tables, tart charts, different types of selectors. In the last step the tool is published.

Big Data Supported Analytics for Next Generation Energy Performance …

169

Fig. 6 Data model of the EPCs information

C. User experience The user experience is quite intuitive. It opens on a screen with the visually analysed data for both cases on the building stock information from cadastral data and for EPCs’ information. In the case of the analysis from cadastral data, this is performed at regional level and it is composed by three pages with different information analyses. In the three pages of the tool, the user can see the information at regional level, or select a municipality to see its specific information. The first page shows data on total number of buildings, total area, and conditioned area. There is also a filter that the user can set for the year of construction. Then, the screen shows information on the three different types of climate zones. It also provides a graph on the use of the buildings. The second screen of the tool (Fig. 7) provides information on the total number of buildings and total surface. It also allows the user filter by the year of construction of the buildings. It provides two thread diagrams with the use of the buildings (residential, industrial, agricultural, public services, retail, office) and with the status of the buildings (functional, declined, ruin). A graph showing the status of the buildings according to the year of construction is also included. As well as

170

G. Hernández Moral et al.

ASTURIAS CADASTRE

Building Use

Municipality

Building Status Bu ildin g u se Bu ildin g st at u s

Number of Buildings

Year of Construction Total Surface

Number of Dwellings

Buildings

Building status

Condit. surface

Buildings

Building use

Condit. surface

Buildings

Building status per year of construction

Num ber of buildings

Bu ildin g st at u s

Year of const r uct ion

Fig. 7 Second screen of the Power BI tool analysing cadastral data

tables with the relation of the number of buildings that have the different numbers of dwellings, the relation between the building status and its living area and the number of buildings, and another table relating the building uses with its living area and number of buildings. The last screen on cadastral data analysis shows the surface above ground, the surface underground and the total surface. The user can filter as well per year of construction and per number of floors. Information is provided in the form of a table on the different façades (including their four orientations) surfaces, as well as party walls and roof surfaces. Also, a graph relating the number of buildings with the number of floors is included. The Power BI of the EPCs information is composed of two pages or screens. In both of them the user can filter per province and per municipality within. The first screen (see Fig. 8) includes a table with the different building uses and the count on each use type. It also includes graphs showing the type of building use, one simpler with greater typologies (building blocks, single family houses, locals (similar to commercial premises) and tertiary buildings; and a second one with more detailed building uses. The second screen of this latter Power BI (Fig. 9) provides the energy information on the EPCs. As in the first page, user can select the province and municipality, and see the use and the count for buildings on each use type. Then, four bar graphs show the percentages for the labels in the four EPC indicators: cooling demand, heating demand, CO2 emissions and non-renewable primary energy. It should be noted, that the information contained in the EPC is in the national language. This is the reason why the interface has been provided in Spanish.

Big Data Supported Analytics for Next Generation Energy Performance …

171

Fig. 8 First screen of the Power BI tool analysing EPC data

Fig. 9 Second screen of the Power BI tool analysing EPC data

D. Application in a specific context This service is applied in the Castilla y León region as part of the MATRYCS project, and in Asturias region as part of the I-NERGY project. As they are both Spanish regions, they have the same EPC scheme, and so the replication of it is quite straightforward once data is available and processed. Both are complemented, as already explained, with the cadastral information, which has also the same structure for both regions as part of the same country.

172

G. Hernández Moral et al.

E. Replication possibilities and envisaged next steps As mentioned before, this service has high replication possibilities in Spain, as EPCs follow the same data model. The only requirement is to have the EPCs in a digital format. So, the main aspect that needs to be considered is how each region stores the data of the EPCs, and how they can be made available. For replication in other EU countries, its application should be further studied according to the different data model for the EPCs. That is why it would be a great advantage to start using a same data model for all EU EPCs. For improving the tool, a validation with more users (or more types of users) could help to detect if the parameters selected are enough for offering the functionality expected. The tool may be configurable for different types of user, offering more or less information depending the role of the user. Besides a generation of reports based on the data and the role of the user could be an added functionality. More ambitiously, the introduction of machine learning techniques for the identification of zones with high energy consumption or CO2 emission values could be applied to help urban planners.

3.3 Energy Conservation Measures Explorer Energy performance certificates’ final objective is to boost energy refurbishment by making end users understand the current energy status of the building and proposing some refurbishment measures that would improve this situation. For this reason, apart from the energy labels and the main parameters measured in all buildings, an important section of the EPCs are the energy conservation measures proposed by the experts issuing the certificate. These include not only the description of the measure, but also the impact it would have in terms of energy demand, consumption, primary energy and CO2 emissions’ reduction. In the case of Spain, the EPC printed document in PDF format provides a summary of the information contained in this energy conservation measures’ section, but in the XML format additional information can be found. However, even though this is highly valuable information, it does not enable to fully characterise the measure and assess it in an EPC certification tool. Therefore, the descriptions inserted by the experts issuing the EPC should be sufficiently explanatory and detailed. However, from the analysis of the data contained in the EPCs in Castilla y León and Asturias regions in Spain, it has been detected that this is sometimes not the case, and there are certain limitations to the exploitation of this part of data contained in the EPCs. A. Main objective and challenge addressed The objective of this service is to exploit the data contained in the energy conservation measures section in the EPCs. This will enable end users to search for potential energy conservation measures that could be applied in their building by analysing what impact other measures proposed in other buildings have had. Thus, this information

Big Data Supported Analytics for Next Generation Energy Performance …

173

is made accessible to users, who can filter, compare and explore in more depth a list of energy conservation measures contained in EPCs within the province/municipality/ post code of their choice, and/or within the year of construction range of their choice. B. Solution design, data used and steps covered The energy conservation measures explorer uses a database with the XMLs of the EPCs. All the information related to the energy conservation measures included in the EPCs has been stored in the database. Besides the system uses a list with the names of the municipalities of the regions to which the buildings evaluated in these EPCs may belong. Also, there is a list of the postal codes and provinces of these municipalities. The tool offers the possibility, through a web application, to search different energy conservation measures, by using different filters. The search engine works by searching for a word provided by the user in the energy conservation measures EPCs category, in the name and description of the measure texts. The application has been optimized to work with a big quantity of data, since the database contains big volumes of data. In one of the cases it contains more than 100.000 EPCs files, with one or more measures for each EPC. Moreover, the information of each measure contains at least 40 parameters (including the new energy rating, savings with respect to the baseline, and characterisation of the included measures, among other). The application also offers the information of all measures that match with the criteria expressed by the filters in a table format. Besides, the app can also provide specific information of each measure. C. User experience The experience for the user starts with an initial screen where they should enter the word from which they want to see the improvement measures related to. As it will search in the database of the EPC’s XML, it only works in the language where they are written. Thus, for the case of the Spanish EPCs, even if the tool is provided in several languages, the search will only be effective if the key word searches is in Spanish in this case, so that it has the chance to appear in the database. The first screen has also the type of building as mandatory field, and by default it is residential, since it is where there is a greater number of EPCs, although the user can change it to non-residential. The subtype of building can also be defined, related to the type selected, but it is not a mandatory field, and can be left open to obtain a greater search in the main type. Optional fields can be added for a more precise search: in terms of location (defining province, municipality or post code) and specifying year of construction ranges (by default it is from 1900 to 2023) (Fig. 10). After adding the required and desired fields in the first step of the service and searching, a second screen is displayed, where the results are shown in the form of a table. It informs on the number of available measures found with the key words added by the user (which are searched in the fields of name and description of the improvement measures category in the XML of the EPCs database). This table provides information from different fields of the EPCs (structured as the columns of

174

G. Hernández Moral et al.

Fig. 10 First screen with the search engine of the energy conservation measures explorer

such table), while each EPC measure found stands for a row in the table. The fields in the table are the following: (1) measure name; (2) description of the measure; (3) cost (in e) estimated for the implementation of such measure; (4) conditioned area (in m2 ), which provides the user of an idea of the size of the building or dwelling for which the measure is being defined, especially to be compared or related with the previous field of cost; (5) energy demand, showing the difference or savings with respect to the initial/current energy demand of the building without applying the measure, it is provided in both savings expressed in kWh/m2 ·year, and in percentage of savings. This field offers the possibility to the user of sorting the measures list based on those figures. (6) heating energy demand label, which shows the new label (coloured, for visual aid) that the building obtains in this respect with the application of the energy conservation measure, as the previous field, measures list can be sorted taking into account this parameter; (7) cooling energy demand label, which works in the same way as the heating label; (8) non-renewable primary energy, which is shown in the same manner as the energy demand, proving the information in kWh/m2 ·year for the difference with respect to the initial or current non-renewable primary energy demand of the building, and in percentage to show the savings with respect to that initial situation; allowing also the possibility to filter the measures list taking into account both figures (at the left and at the right). (9) non-renewable primary energy label shows the new label that would be obtained by the building if applying the energy conservation measure; the measures’ list can be sorted also considering this label. (10) CO2 emissions information is provided as the other figures that produce the labels: as the difference in kg of CO2 /m2 ·year with respect to the initial or current emissions of the building, and as savings in percentage; allowing also to sort the list of measures according to those figures. (11) CO2 emissions label, as the other label

Big Data Supported Analytics for Next Generation Energy Performance …

175

Fig. 11 Second screen of the energy conservation measures explorer, where the table with the results of the search is shown in a comparative way for the measures available

columns, provide the visualization of the new label obtained by the building with respect to its emissions after the implementation of the measure. Then, for further information, there are two more fields: (12) year of construction and (13) location. This provides information of the province where the building is, the municipality and the post code. The table shows 10 results per page, as the optimal format of the table in the screen of the computer, then the user can move to the next pages to see all measures (Fig. 11). If the user is interested in a specific measure from those shown in the table, they can click on its name and it will drive it to a new screen where more detailed information of the measure and EPC. This new screen is structured with an information table at the left side, which contains the name of the measure, the description, the cost in e, the conditioned area for reference in m2 , other data that the EPC issuer may have added, the location of the building (specifying the province, the municipality and the post code), the address, the cadastral reference, the year of construction and the type of building. Then, energy information is provided at the right side of the screen. The EPC label scale (from A to G) is shown for reference, and then, the four label parameters are depicted: non-renewable primary energy, CO2 emissions, heating and cooling energy demand. For each of them the new label is shown (and coloured), as well as the percentage of savings obtained through the implementation of the measure, and the savings in the corresponding unit (kWh/m2 ·year for the energy savings and kg CO2 /m2 ·year for the CO2 emissions saved). After these parameters, a table is depicted below to further detail these fields. It is structured with a row for each of the following parameters: final energy consumption, non-renewable primary energy, CO2 emissions and energy demand. In the columns, each of those parameters are depicted into heating, cooling, DHW (domestic hot water) and lighting, as well as

176

G. Hernández Moral et al.

Fig. 12 Third screen of the energy conservation measures explorer, when user wants detailed information of a specific measure from those that appear in the table after the search

providing the total. For final energy consumption, there are no labels, so only the value in kWh/m2 ·year is shown for each column. The non-renewable primary energy is shown as the figure and corresponding label for all column parameters, and in the total, it shows also the savings achieved. CO2 emissions is in the same line as the previous one, it is shown for all column parameters in figure (in kg of CO2 /m2 ·year) except for the lighting, and also in savings for the total. Energy demand is provided as figure and label for the heating and cooling, as well as in the total, which contains the information of the figures for the energy demand in kWh/m2 ·year and in savings (Fig. 12). D. Application in a specific context This service has been developed in Castilla y León region (Spain) as part of the MATRYCS project. The database contains around 100.000 EPCs in XML format. In an initial testing of the tool it has been observed that the most common measures are related to envelope insulation and window replacement, followed by boiler replacement, which is relevant in the context of Spain where heating systems are mainly individual, especially in older buildings. This is very relevant in the context of the Castilla y León region, where the heating is really needed and has a high impact in energy demand and consumption due to cold weather in the winter season. Also, due to the search engine of the tool, which searched in plain text in two fields, and the way in which the energy conservation measures category is written in the EPCs by the issuers, it will show more measures when the key word inserted is simpler (e.g. write “insulation” instead of “envelope insulation”) and less detailed. Although it will provide a smaller number of results when precising the word for searching, they are usually of higher quality, meaning with this a more detailed measure. In the EPC issuing, these fields are required but there are not clear rules or any engine that tests what it is written there. So, for some measures the name is just “measure 1” or similar, which makes difficult the proper work of the tool. That is why it was decided to run the search of the key word provided also in the description field. Another difficulty found was the lack of descriptions of the measures in the EPCs. This denotes a lack of details that are usually found also over the whole measure.

Big Data Supported Analytics for Next Generation Energy Performance …

177

But, due to the way the tool works, unless the name field is complete enough, these measures will have less text, and therefore less chances of being returned when searching by key words. E. Replication possibilities and envisaged next steps All the parameters related the energy conservation measures analysed in this service are mandatory to be included. This means that in all EPCs analysed, there will always be a value for the parameters selected. However, there are limitations to this, which are related to how the information of the energy conservation measures in the EPC is added by the issuers. In this respect, the current approach for the energy conservation measures explorer avoids in a certain way the lack of information in some fields, and provides the user a nice visualisation of key parameters when planning retrofitting of buildings and an assessment of the improvement in the different EPC fields can be obtained. Apart from this limitation, as a simple search tool, it can be easily and directly replicated for the EPCs in Spain because the same data model is used in the whole country. This could even be used for most of the EU EPCs, after studying the particularities of their EPC structure, especially with respect to how the improvement measures category is included in them. Next steps to improve the current service and provide additional functionalities would be related with a higher processing of the plain text in the different fields (mainly name and description). Moreover, an interest processing to be added to the tool is the application of categorization to automatically organize the measures in different categories to be presented to the user in a more structured way. Besides, a more intelligent processing on some of the parameters could improve the tool. This is the case of the cost parameter. This parameter appears in the EPCs as a text string and not as a numerical value, so right now it cannot be used in any calculation in a direct way. Using word processing and extracting the numerical value of this parameter, the measures could be organized on the basis of their cost–benefit or even providing an estimate of the amortization periods for each measure.

3.4 Visualisation of EPCs and Estimated Energy Parameters The maximisation of the use of the EPCs after they have been issued is a challenging task. Increasing social awareness and understanding of these documents is a fundamental step to be taken. In this line, using appealing tools can be a good first step towards gaining the attention of end users. In this context, Geographical Information Systems (GIS) and the use of maps can represent a valuable tool to resort to in order to present information in an easy manner. Moreover, this can be valid not only for the general public to understand the context in energy terms of their city, through colour coded maps; but also, for regional authorities to plan refurbishment strategies at broader scales. This can be achieved by adding a geolocation component to already existing EPCs and mapping them to the buildings where they have been issued. This

178

G. Hernández Moral et al.

can provide an idea of the density of EPCs in a certain area, but with this approach not all the buildings would have information to be displayed, since it is not mandatory to have an EPC (only when a relevant transaction is performed). As a consequence, in order to have a holistic view of the energy status of municipalities, it is necessary to complement the mapping of real EPCs with other estimated approaches. This section proposes the visualisation of three different approaches to visualise energy-related parameters. A. Main objective and challenge addressed The main objective of this service is to provide an easy-to-use interface based on maps where the user can visualise energy performance parameters coming from three different approaches at different scales (building, district, municipality, province, region) and be able to make comparisons among them. Moreover, additional information from the considered buildings is provided, as well as interactive filtering options for an improved user experience. B. Solution design, data used and steps covered For the visualization service, three main groups of datasets will be generated to be visualized in the application. These three approaches are: (1) real EPCs, (2) EPCs parameters estimations based on real EPCs and (3) EPCs parameters estimations based on EPBD approach. For the first case the idea is to visualize selected parameters of real existing EPCs. This means, that not all buildings will count on values to be displayed. To counterbalance this fact, the other two approaches are proposed. On the one hand, approach (2) estimates EPCs parameters based on the analysis of data contained in real EPCs, generates some typologies and then maps the information in all buildings of a municipality. On the other hand, approach (3) proposes some estimations based on the calculation methodology set in the Energy Performance of Buildings Directive (EPBD) [7]. This enables the characterisation of the whole building set by the means of applying typologies as they have been defined in TABULA [41] and BSO (Building Stock Observatory) [42] and calculating in base of these typologies using algorithms that follow the indications of the EPBD. The main parameters considered are the heating and cooling demand, energy consumption and CO2 emissions, although in some cases more information is provided (disaggregated information or other valuable parameters). The presentation of the three approaches enables the comparison between the estimations and the real EPCs, whose values could be considered closer to actual demand, consumption and emissions values. Although the information contained in each dataset group is slightly different, the basis for creating the dataset is common. The datasets will be generated through the enrichment of a digital twin at city or regional level. The digital twin, based on GIS data, enables the visualization of data in a geographical way, through a map, improving the user experience and the location of the different data. For the creation of the Digital Twin at city or regional level used by this service, open data coming from the cadastre or the Open Street Maps (OSM) project are used as basis, although information from other sources is added (land uses maps, for example).

Big Data Supported Analytics for Next Generation Energy Performance …

179

Apart from the different approaches, and depending on the application and the context, various scales are shown to the user. The basic scale is the building scale, in which information for each building is shown. After that there are other scales (district, municipality, province and region), in which the most valuable information is shown in an aggregated manner. Once again, the basis for these scales are based in the abovementioned digital twins. In the following paragraphs the particularities of each group of datasets, the data used and the step covered for every approach will be explained. It is important to note that only residential and tertiary sector buildings are considered in the application. This means that industrial buildings are discarded, since they are not considered under the energy performance certification schemes, due to the special characteristics of these buildings. Real EPCs visualization Energy Performance Certificates have a large amount of information about one of more dwellings or buildings (in the case of the Spanish EPCs they present around 200 parameters). A big challenge is to select the minimum number of parameters to be visualised. They have to be enough to offer the essential information of the EPCs, but not so many as to mislead the user or make it difficult to find useful data. Besides, the selection of the parameters has to consider the information generated for the other energy estimation approaches proposed in the service in order to facilitate the comparison among the main parameters, listed below: • The four main parameters to display the performance of the energy efficiency: heating and cooling demand, CO2 emissions and non-renewable primary energy. All of them are presented per m2 and year and with the respective labels. • Emission and non-renewable primary energy information disaggregated by heating, cooling, Domestic Hot Water and lighting (when available). • Non-renewable primary energy information disaggregated by energy carrier. The identified energy carriers are the following: Natural Gas, Diesel, Coal, BiomassPellet, Biomass-Others, Electricity, Biofuel and LPG. In cases in which not all the data is presented in the dataset, a reduced version of the application can be offered and adjusted to the available data. It can be the case when the EPCs with full data are not available and only a short version of the EPCs (with a smaller quantity of data) is accessible. A challenge for the generation of the data in this approach was the problem of granularity: there are some EPCs at dwelling level, while the digital twin’s finest disaggregation level is the building. In this case the information has been carefully organized and also the visualization service has been adapted to show the information of the different dwellings presented in each building through collapsible menus. EPCs parameters estimations based on real EPCs Real EPCs data in combination with geometries and other parameters collected from the cadastre, provide the basis for this approach. In general, only four of the parameters that summarise the energy performance of a building in a representative way

180

G. Hernández Moral et al.

are publicly available. These four fields correspond to the values of non-renewable primary energy, CO2 emissions, heating demand and cooling demand per m2 , and thus to the associated labels. These are the main parameters that will be displayed through the EPCs visualisation tool, although, depending on the adopted approach, some more can be provided. The aim of the methodology is to explore the possibilities offered by data science techniques to infer selected values for buildings that have similar characteristics to those that already have a certificate. Therefore, this module has two different modes of operation: • The simplified approach, based on the four representative energy efficiency parameters. It takes as input the public data of EPCs available in the Energy Data Hub of the Castilla y León Region [43]. • The extended approach, which in addition to the previous parameters (also available in the extended version of the certificate), is able to infer additional ones, in parallel to the information provided in the visualisation tool of the real EPCs approach. This makes this approach more complex in computational terms. It has been developed using python programming language, and the models are trained using Machine Learning libraries, which allow classification and regression algorithms to be applied. The models take as input the parameters of the EPC database and the cadastre database, whose relations are established in a first step. The data is then pre-processed (data cleaning, data manipulation, data normalisation), so that the parameters can be easily compared, correlations can be analysed and the final parameters can be selected. The ML models are then calculated and applied to obtain the information of the buildings that can be geometrically described. Finally, this new dataset is included in a GeoJSON file, so that it can be represented in a graphical way, and easily compared with the other approaches. EPCs parameters estimations based on EPBD approach The software generated to calculate energy consumptions at city level (NUTS-3) has been programmed in Python during the development of different European projects, where it has been enhanced in order to approach its results to foreseen scenarios in different demo sites across Europe. Such diversity of scenarios has forced the development of several alternatives for the data sources, as well as the search for common available data. There are some ancillary elements that have become necessary in order to fit the procedures of the software into the different realities of European building environments, and for that matter, two main frames of typology information have been selected: • On one side, TABULA [41] data have been filtered and adapted, in order to provide building U-values and other parameters related to the building for characterize energetically the building, taken for the different typologies that can be found in a certain country.

Big Data Supported Analytics for Next Generation Energy Performance …

181

Fig. 13 Execution order of the developed modules for the EPCs parameters estimation

• On the other side, data extracted from BSO [42] have been taken into consideration, to cover buildings that would not fit into TABULA typologies as well as tertiary sector ones, in order to gather acceptable U-values and other parameters. With the previous sources, and some other specific data from the EU repositories, these ancillary elements in the form of csv tables, have been conditioned to approach the reality of each national building inventory, and along with the building dataset and one year of climate data, are used to calculate the energy values related to the buildings being considered. The program itself has been designed in a modular way, so the different functionalities can be easily adapted for different project needs. They are invoked in a sequential manner during the execution of the main code, and they keep modifying an original set of data from single building so in the end, energy values calculated for every single building included in the digital twin used as basis can be obtained. In execution order, the following modules have been programmed (Fig. 13). The first step in the process is the generation of the input data required. For this, first module is the geometry module, which is a multi-purpose module that is invoked for its multiple functionalities. It extracts as many possible geometric data from the building source as possible, checking the availability of data like height, surfaces, number of floors, and year of construction. This information is complemented in the typology module. When the module receives the data frame from the previous module, and sets the country zone, it then obtains the building typology based on TABULA/ BSO tables of typologies. With the previous data, the module links the u-values to the buildings and other parameters, considering the typologies assigned beforehand. Then, the geometry module is invoked to calculate the areas of the building, like roofs, grounds, wall/window surfaces, etc. These values will be the base, along with the specific U-values from the typologies, to calculate the energy values. Finally, and in parallel to the functioning of these modules, the climate module is the last step to generate all the input data required. For this it uses weather data from one sample year of the zone, to generate HDD and CDD values in an hourly schema. Then, the next step of the energy calculations starts with the energy calculation module. With all the data gathered by the previous modules, it obtains energy values, generating the demand (for heating, cooling and DHW), the consumption related to the demand and the quantity of CO2 emissions associated to this consumption. The final step is processing and storing all the outputs obtained. In this line, the program stores the results in a csv format file that could be easily readable, and generates a JSON file (geoJSON concretely) that can be represented in many

182

G. Hernández Moral et al.

geographic applications like QGIS, and it can be also visualized in the application tool. After the whole data has been submitted, the buildings in the considered location had been characterized in terms of their energy consumptions. C. User experience The tool has powerful visualisation capabilities that show the information of the labels in the four indicators of the EPCs in a map. The colourful map is surrounded by different menus that allow the user the selection of different parameters to visualise both in the map and from which obtain analytical information at the right bar. Also, by moving the mouse around the map, pop-up information is provided for each building: the cadastral reference and the indicator value for which it is labelled coloured. Within the map, the user can select between the different scales to visualise (province, municipality, district or building), although moving across different scales and through the map with the mouse at the screen. Then, the top bar allows the user to filter by year range, by climate zone or by typology. Then, at that top bar the average values for the pilot region or area are shown for the following Key Performance Indicators (KPIs): non-renewable primary energy, CO2 emissions, heating demand and cooling demand; all of them are shown together with the corresponding label. The top bar continues with optional filters for province, municipality and district, and if selected, the average values for them on the four EPCs’ indicators will be displayed (Fig. 14). Then, on the right menu, the user can select among the three available approaches (as mentioned before): information from real EPCs, estimation based on EPCs information or estimation based on EPBD. There is also an information icon where explanations on which each approach consists of are provided. Then, the user can select the indicator they want to visualise for the coloured map (and the pop-up information values). Below those options, information is provided on the number of elements currently being seen in the map and their type (e.g. building). After that information,

Fig. 14 Screenshot of the service of visualisation of EPCs and estimated energy parameters

Big Data Supported Analytics for Next Generation Energy Performance …

183

the energy label scale is depicted, showing in a share scale the numbers of buildings with the different labels (from A to G, and also those without label). Then the average value for the indicator selected of the area shown in the map is provided, as well as two graphs for the assessment of the numbers of buildings according to their year of construction and according to the gross floor area. Average values for those two parameters are displayed as well. Scrolling down the right menu, two more graphs can be seen, they are cake graphs and provide visualisation for the condition of the construction (functional, declined, ruin or not specified), and for the current use of the building (residential, industrial, office, retail, public services or not detailed). More detailed information on a specific building can be obtained if, instead of just moving the mouse over a building, the user clicks on it. Cadastral reference and climate zone are informed, as well as the total surface of the building, its average height, current use and typology, as well as its year of construction and number of dwellings within. Then, average geometry characteristics of the buildings are provided, namely the values of the surface for the external façade (in total and per orientation), party walls surface, surface roof, volume above ground and compacity. Below, the values according to the approach selected are depicted: that of the main indicator selected in the map on previous screen, which can be depicted to see the different values for that building of the four indicators, as well as its labels. And those of the emissions (also those caused by heating, cooling and DHW). Non-renewable primary energy is also depicted under those three parameters (Fig. 15). D. Application in a specific context This visualization service is being applied in different contexts and projects. It is important to highlight that in each scenario the application is adapted according to data availability, not only for the creation of the digital twin but also for the generation of the different datasets to be visualised. In the case of the MATRYCS project, the service provides the visualisation of the three approaches because of the data availability. For one of the pilots, Castilla y León region (Spain), the information contained in the EPCs was used, since the service developer had access to the EPCs registry of the region. Besides, in Spain cadastral data [44] is available following the INSPIRE initiative for GIS data [45], allowing to have a digital twin used as basis with valuable information. In this pilot the EPCs parameters estimations based on real EPCs approach was used to its full potential, considering that in this case all the data from the EPCs are available. Besides in other pilot of MATRYCS project, Gdynia municipality pilot (Poland), a reduced version of the application was provided. Considering the unavailability of EPCs for this municipality only the estimation EPDB-based approach was offered. The lack of some of the parameters in BSO needed for the calculations (as the year of construction or the use of the building) forced the decision to try new approaches based on ground usage (provided by Urban Atlas [46]), and the process of additional types of data, filling data gaps (for example, the year of construction) with a proximity algorithm.

184

G. Hernández Moral et al.

Fig. 15 Cropped screenshot of the service of visualisation of EPCs and estimated energy parameters, showing the detailed information for a selected building

For the BD4NRG project the pilot also focused on Castilla y Leon region. The main difference with the MATRYCS project was that in this context the complete registry of EPCs was not available. Instead, a reduced version of the EPCs was used (containing all records, but only an extract of the information for each of them). This information has been extracted from the Public EPCs offered via an open data platform by the Ente Público Regional de la Energía de Castilla y Leon (EREN) [43]. For this reason, in this pilot different versions of the Real EPCs visualization and EPCs parameters estimations based on real EPCs approaches have been used. Working this way has allowed to test if this reduced versions, based in Open data, are accurate enough for the purpose of the service. Finally, in the case of the I-NERGY project, the pilot is the Principado de Asturias region. As in MATRYCS project case, there is information of the EPC with all the characteristics. In this case only Real EPCs visualization and EPCs parameters estimations based on real EPCs approaches are offered.

Big Data Supported Analytics for Next Generation Energy Performance …

185

E. Replication possibilities and envisaged next steps The limits concerning the spread-out and development for the solutions created will be defined mainly by the available information. As it was commented before, the tools need to work with geometrical and spatial data in order to be properly labelled, or the energy values checked. The OpenStreetMaps data source has the advantage of being a free and completely open source of information. The inconvenient is the irregular quantity and quality of such data: some buildings in certain European cities have enough data to be properly characterized, while other locations/regions across Europe lack the majority of the needed info values. Moreover, one of the most needed parameters is the date of construction/renovation, and such data are regularly missing even in the most complete datasets. Outside from European ground, the usage of OSM is limited to small zones in certain countries. One critical issue regarding the data from the cadaster is the diversity of formats in every single country of the EU that forces the development of interface modules to extract the building data contained there. In the case of Spain, most regions follow the same cadaster structure based in INSPIRE used for the generation of the Digital Twin used as basis. For those that do not follow the same structure (Autonomous Community of Navarre and Autonomous Community of the Basque Country) an extra effort should be done for the replication. Another current limitation is the access to weather data, in terms of getting values from stations near the building dataset site and also avoiding typical paying fees from these sources. Finally, the access to an EPCs register of the location is a key factor. The EPCs information (complete or restricted to some main parameters) is necessary for the development of 2 of the 3 approaches explained before and, therefore, to unlock the full potential of the tool. Once these issues have been analyzed, the conclusion is that the replication possibilities would cover most of the EU ground with a reasonable amount of effort, meanwhile the chance to export these solutions to undeveloped countries outside the EU would be very limited. The next steps to improve this service should cover the following topics: • New validations of data with real values and better calibration. Sometimes the current data available come from statistical calculations, satellite image processing, user observations, etc. And as it has been stated before, even the data contained in the EPCs has to be checked. The goal should be to find reliable data to compare with. Upcoming new valuable data could be the occupancy of the buildings (both in number of dwellers and seasonal usage), energy poverty indicators, dates of refurbishment, etc. • Enhancement of base parameters. It is important to find better procedures to obtain missing data or assign values like the year of construction or the building typology in order to improve the accuracy of the results. It is true that at global level, the algorithms are currently good enough to help for global energy policies in municipalities or regions, but there are still significant deviations even

186

G. Hernández Moral et al.

at neighborhood levels, when the final goal is to replicate EPC data from every single dwelling. • Better processing algorithms to deal with geometrical features. Some algorithms that have been, and currently are being developed in different related projects can be adapted to deal with surface sharing among different buildings in order to recalculate energy losses, and also shadow calculations can be done to adjust the effect of the solar radiation in the buildings, providing a more accurate characterisation of energy needs. • Inclusion of energy conservation measures and derived expenses. Once the estimations would be properly checked, revised and corrected, the following step is to offer the possibility of applying different energy conservation measures in specific buildings or set of buildings and calculate the savings in terms of energy and economic cost, and the amortization period [47]. • Feedback with new related indicators. Some indicators like the energy poverty can be utilized for tuning energy consumption values in the models, but this is also true in the opposite way: the energy calculations can assess geographically the zones prone to suffer from energy poverty.

3.5 Climate Change Impact on Energy Use Analyses The effects of climate change are becoming more visible every day, making it necessary to understand what their impact will be on the energy demand of buildings but also in the production of photovoltaic solar energy. In this sense, climate change will alter the balance between energy demands for heating and cooling respectively across Europe with differences that can be very visible between regions. Reductions in energy demand for heating will be offset by an increase in energy needs for cooling as a result of the forecast increase in temperature. Consequently, according to the overall energy demand across the EU could decline by 26% under a high warming scenario [48]. Considering this hypothesis as starting point and the different future climate scenarios proposed by the Coupled Model Intercomparison Project Phase 6 (CMIP6) [49, 50], the climate change impact on energy demand services will allow the user to: (i) analyze the current energy demand of a building being able to compare the effect that predicted climate change will have on future energy needs and (ii), evaluate the changes produced in the generation of photovoltaic energy through the variations in solar radiation and the predicted increase in the average daily temperature. This context makes it necessary for tools like the Energy Performance Certificates to evolve and consider not only current climate conditions, but also future predicted climate scenarios. A. Main objective and challenge addressed The main objective of this service is twofold. On the one side, energy demand estimation according to varying climate conditions expressed via the different

Big Data Supported Analytics for Next Generation Energy Performance …

187

climate change scenarios (shared socioeconomic pathways-representative concentration pathways (SSPs-RCPs) from CMIP6) will be addressed. On the other hand, it enables to forecast the solar radiation received in a region for the prediction of its mid to long-term potential under different CMIP6 scenarios that represent the impacts of climate change. Forecast of solar radiation plays an important role in planning the future deployment of renewable sources in an effective manner, also contemplating the effects of climate change through the assessment of vulnerabilities that may arise from them. Users will benefit from this solar forecasting service considering that the solar resource and its use through photovoltaic panels or thermal collectors is a great alternative for the development of self-consumption facilities. The number of households that have this type of facility is increasing and even more so with the regulatory change developed in recent months. For this reason, the current and future available resources are key to correctly size the facilities. B. Solution design, data used and steps covered Both parts of the service rely on forecast of energy demand according to the standard accepted CMIP6 scenarios (SSPs-RCPs) through the downscaling of the climate data and their bias-correction with historical data from weather stations. Below, the two parts of the solution are explained in more depth. Part 1: Climate change impact on energy demand The service calculates the current energy demand of the building stock and its evolution according to different climate change scenarios, and, thus, helps to determine how it will be affected by climate change. The required data for the energy demand estimation include daily and hourly data for temperature and solar radiation covering the historical and the future period using two different CMIP6 scenarios. Historical data are collected from PVGIS [51] by an API to request data directly in each calculation of the energy demand while modelled climate data in the historical and future are extracted from Copernicus Climate Change initiative (C3S) [52] as a step before the calculation. For the calculation, the service is based on cadastral data to obtain the main building attributes (e.g. building type, year of construction and floor area) for the characterization of the energy demand. Finally, Laser Imaging Detection and Ranging (LiDAR) data to calculate accurate building heights is used. At the end, current and future heating, cooling and domestic hot water demand is provided by the service. Built as a Python library, the service considers the following hypothesis and parameters in the demand estimation procedure helping to improve the understanding of the impact of climate change on energy demand thanks to the evaluation of daily and yearly demand data: • Building envelope parameters associated with its type and year of construction. These parameters include the percentage of glazing, the thermal transmittance of the walls (including coefficients for floor, roof and walls), the thermal transmittance of the windows and their reflection coefficient of solar radiation. The service

188

• • • • •



G. Hernández Moral et al.

covers the following building types: residential, education, restaurant, commerce, health care, public administration, office, hotel and sport. Thermal gains of the building due to the effect of solar radiation. Internal thermal gains due to the occupation and building use, also accounting for those generated by equipment and lighting. Energy losses through the walls, which has been defined by the permeability of the construction elements (walls and windows) of the building. Hourly profiles of energy use by type of building, with different profiles for heating, cooling and domestic hot water. Influence of outside temperature on energy demand. For this calculation, the hourly temperature obtained from the PVGIS API is considered and the degree hours are counted considering the heating needs from 21 ºC and cooling from 25 ºC. Energy systems for the selected building is provided by the user. Coefficient of Performance (COP) of the energy systems and coefficients of transformation of energy into emissions according to the type of fuel used by the energy system are used to determine the primary energy demand and the related CO2 emissions.

In addition, the service uses statistical downscaling algorithms (quantile mapping procedure) for bias correction and downscaling from 100 km of spatial resolution to local (point) resolution. This downscaling algorithm provides users with climate data with a daily frequency, being necessary a transformation distributional algorithm to generate hourly data based on the hourly distribution of temperature and the hour of maximum and minimum temperature and solar radiation in the evaluated day. Part 2: Climate change solar radiation forecasting The solar radiation is predicted in terms of mid and long-term potential from weather stations data, Copernicus C3S and knowledge from studies of climate change effects in solar energy. Built as a Python library, the service considers the data from the weather stations and the historical PVGIS radiation and temperature for the location selected by the user to execute a process of downscaling and bias correction through statistical adjustments that allow us to obtain values of the evolution of future climate variables by integrating Copernicus C3S data obtained for different climate scenarios using an average value for different climate models. These reference climate values serve as a basis to assess on an individual level what the solar radiation performance of a specific location would be, and how a PV panel would perform. In order to run this service, specific input on the location, and characteristics of the PV panel is requested from the user (PV panel slope, power, technology, peak power and installation area). After running the algorithms, the tool will provide several Key Performance Indicators (KPIs) to support the decisionmaking process in the installation of PV panels, as it will be explained in the user experience section.

Big Data Supported Analytics for Next Generation Energy Performance …

189

C. User experience The user experience can be divided into the two parts of the service. The workflow in the case of part 1 resembles the approach showed in the visualisation of EPCs and estimation of energy demand including the calculation engine to run the demand characterization. As in the previous services, it is based in an interactive application with a map, menus and queries to select the different scenarios to run at a certain scale (building, district or city level): current energy demand and consumption, or future hypothesis for energy demand calculation in a climate change scenario. Buildings are coloured according to their energy demand in the corresponding map, the user is enabled to zoom in and out, and select specific buildings to show additional information, which is presented to them in tables, graphs and pop-ups. Since the functioning is very similar to the previous service, a stronger focus is placed on the definition of the user experience of part 2 of the service related with the prediction of future solar energy production. This second service is based on a user-friendly interface for obtaining the coordinates of the location where the user wants to carry out a photovoltaic installation. Through a map, the insertion of points is allowed to obtain the coordinates that will be needed for the adjustment of data and calculation of solar energy production under different technologies and simulation scenarios. After that, the user must select the parameters of the installation (e.g. size and peak power) that intends to develop and the photovoltaic technology (e.g. amorphous silicon or crystalline silicon) that will be used in order to parameterize the simulation. Once calculated, the user will be able to evaluate the results of total, daily and monthly energy production for the installation that is intended to be carry out. Some images of the tool interface are presented in Fig. 16 covering the data insertion and results visualization. D. Application in a specific context The service is applied in two regions, as part of the BD4NRG and I-NERGY projects: Castilla y León (application of service part 1) and Asturias regions (application of service part 2). The demand prediction service (part 1) has been applied in Castilla

Fig. 16 Screenshots of the service of climate change solar radiation forecasting

190

G. Hernández Moral et al.

y León implementing algorithms for the characterization of the energy demand of buildings based on the evaluation of their thermal envelope, the building typology and energy consumption profiles, and the effect of current and future climate variables for different future scenarios and models. On the other hand, the photovoltaic prediction service has been applied in the Asturias region through the development of an access, adjustment and visualization algorithm that integrates both the effects of variation in solar radiation and changes in the average daily temperature. The results can be obtained for different future scenarios using at all times an average value of different models that allows reducing the variability in future climate variables. E. Replication possibilities and envisaged next steps The replication possibilities of both parts of the services do not count on many restrictions, since the data used as a basis can be normally found in open data sources. However, a series of considerations should be observed. • Part 1: depending on the approach used to calculate the baseline, it could be replicated globally. The main constraint to deploying this approach in a global manner is obtaining the baseline data to generate the energy demand models. Then, the calculation of climatic conditions according to the mentioned scenarios could be executed in any part of the world in which PVGIS data are available. • Part 2: the methodology can be globally deployed as well. The same considerations apply to climatic conditions in the future; however, depending on complementary data such as the LiDAR data or the availability of weather stations can result in having less accurate results. As for next potential improvements, this service can be complemented in different manners: • Including retrofitting strategies: The calculation of energy demand with the inclusion of retrofitting strategies could strongly support energy planning processes by indicating the energy gains that could be obtained. This would enable to see the effects of different future climate scenarios on the energy performance of the building stock with the application of different retrofitting strategies. This way, it could be determined if a measure that is seen as effective under current retrofitting scenarios is adequate with future climate conditions. • Optimisation of the combination of solutions: The considerations mentioned above could be optimised to derive not only the best combination of measures (passive, active, renewable and control), but also under different scenarios. In this line, several experiences [53–55] have been made to this respect considering current climatic conditions, but not so much emphasis has been placed on analysing the impact under future conditions. • Improve technologies and temperature impacts: Include a database with a greater number of photovoltaic technologies, including their performance while improving the characterization of the effect of temperature on photovoltaic production.

Big Data Supported Analytics for Next Generation Energy Performance …

191

4 Acceptance of Solutions by the Energy Value Chain The acceptance of solutions by the energy value chain is crucial for big data services to actually be deployed and used. The services presented in this chapter, developed in the MATRYCS, I-NERGY and BD4NRG have followed a validation methodology that has been carried out in parallel to the development cycles established in the projects.

4.1 Validation Methodology Deployed In the three mentioned projects, three development cycles have been established, and after the second and the third cycles are finalised, a validation process is implemented. The validation performed after the second cycle has finished provides relevant feedback for the service developers to implement and refine the proposed services, whereas the validation process performed at the end of the project (third cycle) enables to actually measure the satisfaction of the users with the final version of the services. At the time of writing this chapter, neither of the three mentioned projects have finalised. Therefore, only preliminary user satisfaction can be reported, derived from the first validation round performed (cycle 2). Nevertheless, it is worth to mention that this validation methodology is complemented by an evaluation framework that has enabled to manage the numerous pilots these projects have (MATRYCS—11 pilots, I-NERGY—9 pilots and 15 use cases and BD4NRG—12 pilots). The services presented in this chapter have not been applied in all the pilots in this project, only in selected ones, however the evaluation framework is presented because it can serve as a reference for projects needing large-scale pilots impact assessment, as well as services validation [56]. The evaluation framework implemented counts on three main pillars: (1) strategy and general context, (2) data, infrastructures and digital technologies, (3) user satisfaction; and two complementary ones: (4) main stakeholders, (5) procedures to personalise the tools and services. User satisfaction is measured through pillar number 3, by establishing a series of questionnaires devoted to measuring the satisfaction in terms of functionalities provided (how the services address the challenges the stakeholders face), and how the service works from a technical standpoint (i.e. if there are any efficiency, safety or usability remarks, among other). These questionnaires are answered by stakeholders through likert scales.

4.2 Preliminary Results from Validation and Next Steps As it has been previously explained, the last round of validation is still pending in the three projects, but from the first validation round performed on the EPCs

192

G. Hernández Moral et al.

checker, EPCs data exploitation and reports generation, and the Visualisation of EPCs and estimated energy parameters services, a user satisfaction of around 65% has been achieved. This validation was performed over preliminary prototypes of the services presented here, and the majority of improvements have been considered and included already. In particular, several remarks were highlighted by the validating stakeholders, that included the addition of functionalities that in most cases were related to the provision of additional data. Normally, these data had to do with aggregated values at higher scales that enabled the comparison with other values presented on the screen. This process has shown to be extremely useful, since additional functionalities and capabilities had been detected and included in the refined version of the service, even though they were not identified in the requirements gathering performed in collaboration with stakeholders at the beginning of the projects. This can be due to the fact that stakeholders were not aware of the high potential that big data solutions can offer in their day to day undertakings. In any case, the final validations still need to be performed in the three projects, and the final results to be reported. Additionally, an assessment of how the implemented methodology has worked to assess the pilots will be performed at the end of the three projects, and reported in the corresponding deliverables.

5 Discussion The services proposed in this chapter bridge the gap in between big data technologies and the already well-established schemes of the energy performance assessment and certification. Indeed, all the dimensions of big data, the 5Vs, can be observed to some extent in the development of these services. Energy performance certificates, when considered individually in a determined service (for instance, when the EPCs checker applies its procedures on an individual EPC or when showing the energy conservation measures), one could think that the volume dimension is not applicable in this context. Nevertheless, the queries performed to extract the data or the definition of some values to be checked are performed in a whole set of Energy Performance Certificates available in a region (more than 100.000), and this is expected to keep growing. In terms of velocity, in the services proposed this is not a salient characteristic, since most of the data are semi-static. However, these services could be coupled with monitored data in the future. On the other hand, dealing with variety of data has been one of the most relevant challenges to be tackled in the development of these services. Multiple data sources have been combined: EPCs in XML, weather data in CSVs, building characteristics in excel format, cadastral and OSM data in geoJSON format, or LiDAR data. In many of the cases, not only the different formats have posed a challenge, but also dealing with different data models for similar datasets. An extra effort has been devoted in the design phase to adequately set the data value chain in a coherent manner to obtain the expected results from the services. This challenge was linked to ensuring veracity of the data used within the services, and,

Big Data Supported Analytics for Next Generation Energy Performance …

193

therefore, in the results obtained. The lack of data availability in some cases, as well as the origin of the data (data coming from collaborative approaches, or even the errors found in official registered EPCs) imply reaching certain levels of uncertainty in the results obtained by the services. Additional calibration and validations of the services should be performed in order to reduce this uncertainty. Relevant value is obtained from the results of the services proposed, since the conclusions from the analyses performed cannot be extracted following other methods. However, further work on the veracity of data, as explained should be performed in order to make the V related with value even more prominent, which in many cases would be achieved by increasing the quality and availability of open data. As it can be observed, even though all the dimensions of big data are relevant, the volume, variety and the veracity Vs can be highlighted due to their impact in the accuracy and usefulness of the service. This is also noticeable when analysing the data value chain. The first two steps in this chain (data generation acquisition and data analysis processing) have been proven to be crucial in all the services presented in this chapter. These conditioned how the data analysis processing needed to be performed, and also the data storage curation approach. In the latter case, it was specially challenging to combine data from a multitude of energy performance certificates with geographic information systems in a way that was practical not only for querying, but also for the last step (data visualisation and services) where insights extracted should be presented in a user-friendly manner. Visualisation limitations ranging from the time required to load a dataset in a visualisation service, to the usefulness from a user’s perspective to analyse a specific set of datasets have conditioned not only the analytics to be performed, but also how the data should be stored to enable this latter step. For instance, this has been paramount to adequately balance the amount of information with which the maps were enriched. Furthermore, in many of the cases, how the data value chain is established conditions the replicability of the services. In all of the services proposed, its design has considered the replicability potential at its core. For this reason, all the services use when possible open data, that is easily accessible at global level, or European or national level. For instance, potentially global level available data sources include OpenStreetMaps or some climate data. However, even though the data coverage could cover the whole world in both examples, OSM data do not count on homogenously distributed information in all countries, which is a barrier when trying to replicate services like the Data visualisation approach proposed. Data availability (or lack thereof) is not the only barrier when tackling replicability. In almost all the services presented in this chapter, the data model deployed conditions the replicability potential of the service. This is really obvious in those services where the Energy Performance Certificates as a whole are exploited. This has been applied in Spanish EPCs, where the data model deployed to describe the information follow the same XML structure at national level. This means that the services proposed have no limitations to this respect in the whole country. However, one could think that this should be the case in all Europe, since Energy Performance Certificates schemes have been applied for a long time and in all EU countries. Nevertheless, even though the calculation procedures are guided by the same standards, each country can perform

194

G. Hernández Moral et al.

adaptations, and, as a result, the data model used in every country can vary. Even this can vary on a regional basis within the same country. This leads to such solutions as the ones presented in this chapter, to not be replicable at the EU level. Thus, the harmonisation of Energy Performance Certificates’ data model at EU level is seen as a crucial challenge to be overcome in order to reap the benefits of these and similar solutions, and be able to obtain a robust understanding of the European building stock to support decision-making in energy refurbishments. Nonetheless, some of the replicability barriers proposed cannot be addressed from a single developers’ perspective, since they require a systems’ change in terms of data availability enhancement or homogenisation of data models. For this reason, the next steps for the improvement of the services presented have to deal with increasing the replicability potential, but within the scope of the barriers imposed from the exterior, which cannot be fought against. In this line, the proposed next steps are linked to increasing the functionalities offered, adapting the service to work with other inputs, validating the service further for a better calibration of the results obtained, adding higher processing capabilities, or addressing specific target groups’ needs. In any case, the linkage in between big data and the energy performance assessment and certification schemes presented in this chapter generates numerous opportunities that can offer benefits to a series of stakeholders in the building value chain. In particular, when considering the steps followed in the energy performance assessment and certification schemes (Fig. 1), the following benefits for each stakeholder group can be observed in Table 1. With this approach some of the main challenges identified in Sect. 2 are addressed, incrementing the reliability and trust in energy performance assessment and certification schemes, increasing quality in the process, while at the same time increasing social awareness and providing methods to boost energy refurbishments. In this context, the acceptance of solutions so far has been measured at an intermediate phase, but further feedback will be obtained at the end of the projects where these solutions are developed, and the identification of further research avenues to the ones detected in this chapter will be identified.

6 Conclusions This chapter has focused on services to support decision-making processes in energy performance assessment and certification schemes. As it has been explained, this does not only have to do with the end results of this process (the issuing of Energy Performance Certificates), but there is a varied set of steps where support can be offered. In this complex deployment of actions to assess the energy performance of the building stock, there is room for improvement that can facilitate the actions of the building value chain stakeholders involved in the energy performance assessment and certification schemes. In this line, Sect. 2 has presented these steps and the main challenges. These can be classified into three groups: (1) before issuing the EPCs and how the framework

The setting of alarms when a parameter does not comply with the expected values will help certifiers, experts and EPC managers to correct those mistakes and ensure robustness in the EPCs Identification of common patterns in the building stock, as well as linking potential errors in EPCs with the usage of specific tools, or the issuing of EPCs by specific experts. This can help authorities in checking specific sets of EPCs and imposing sanctions if necessary

Ensuring that the data is adequately stored in the repositories will facilitate the later exploitation of EPCs information

1. EPCs checker

2. EPCs data exploitation and reports Understanding of the current status generator of the building stock, detecting additional aspects that should be General public, energy experts, measured towards ensuring a better energy planners at different scales understanding, as well as towards (including authorities) modifying how EPC registers are set

Energy performance certifiers and experts, and EPCs managers

B. During EPCs issuing [Quality assurance schemes]

A. Before issuing EPCs [Energy performance, data registers and tools]

Service and main targeted stakeholders

Table 1 Stakeholders’ benefits related to key aspects in energy performance assessment and certification

(continued)

The provision of building stock values will facilitate the understanding of the energy context in an area, both for the general public as well as for energy planners to develop strategic policies

N/A. The service is better suited to contribute in the decision making of the other two steps

C. After issuing EPCs [Maximisation of EPC usability and social awareness]

Big Data Supported Analytics for Next Generation Energy Performance … 195

A. Before issuing EPCs [Energy performance, data registers and tools]

B. During EPCs issuing [Quality assurance schemes]

3. Energy conservation measures explorer

The understanding of the potential of N/A. The service is better suited to ECMs analysis contained within contribute in the decision making of the other two steps Architects, engineers, energy experts, EPCs can invite EPCs managers to implement additional processing investors, local authorities, energy services companies, EPCs managers capabilities (such as the categorisation of ECMs) to extract improved insights from the EPCs stored in the repositories

Service and main targeted stakeholders

Table 1 (continued)

(continued)

Individuals looking to perform an investment in energy refurbishment, architects, engineers or ESCOs can benefit from a further understanding of the impact of specific energy conservation measures, towards planning a future investment in energy efficiency

C. After issuing EPCs [Maximisation of EPC usability and social awareness]

196 G. Hernández Moral et al.

Quality checks by comparing the different approaches proposed can be interesting for all the value chain, but especially for EPC registers managers, since an automatic search that compares several approaches can bring to the spotlight sets of EPCs that have a strong deviation, and thus, that should be checked

N/A. The service is better suited to contribute in the decision making of the other two steps

N/A. The service is better suited to contribute in the decision making in the last step

4. Visualisation of EPCs and estimated energy parameters

5. Climate change impact on energy use

The whole building value chain

The whole building value chain

B. During EPCs issuing [Quality assurance schemes]

A. Before issuing EPCs [Energy performance, data registers and tools]

Service and main targeted stakeholders

Table 1 (continued)

In a similar manner to the previous service, this can not only contribute to an understanding of the current status of EPCs, but also increase social awareness of climate change related impacts on energy demand and solar forecast estimation. It will contribute to future-proof the actions (energy efficiency or PV) proposed

The user-friendly visualisation capabilities provided by the maps will contribute to enhance the understanding of Energy Performance Certificates across the building value chain, especially for the general public. The comparison among neighbouring buildings can also catalyse initiatives towards energy refurbishments

C. After issuing EPCs [Maximisation of EPC usability and social awareness]

Big Data Supported Analytics for Next Generation Energy Performance … 197

198

G. Hernández Moral et al.

in terms of tools, data and repositories is set; (2) during the EPCs issuing process and how quality is ensured; and (3) after EPCs have been issued and how their use can be maximised towards boosting energy refurbishments. The latter has a strong social component, where the user acceptance and social awareness and understanding are fundamental. To address some aspects of these challenges, five user-friendly services are proposed, each of them linked to one or more of the abovementioned challenges. The first one, the EPCs checker, addressed to EPCs managers as well as EPC issuers, proposes a series of checking actions to be performed on a number of EPC parameters to ensure their quality and robustness. The second one (EPCs data exploitation and reports generator) is linked to the exploitation of data contained in the EPCs in a municipality or region, towards facilitating the analysis of the building stock. This can be particularly useful when developing energy refurbishment strategies at district, municipality and regional levels. The two last services are linked to improving the understanding by end users of the value of Energy Performance Certificates and enable them to compare values coming from three different approaches with neighbouring building. This service can also prove useful for EPC managers who, through the comparison of the different proposed approaches, they can detect discrepancies that lead to a further inspection of a specific EPC. Moreover, the service can also be helpful for energy planners in order to detect areas that should be prioritised in energy retrofitting strategies. This service is complemented by the last one, which is focused on analysing the impact of climate change in energy demand, and also in solar radiation, towards the assessment of PV panels implementation. This last service adds a further dimension to the normally considered ones, by contemplating potential future climate scenarios. This assessment can contribute to identifying if solutions that are cost-effective and adequate under current climate conditions will continue to be adequate in the future or not. Therefore, future-proofed strategies that are better scoped can be defined. After the presentation of the services, a crucial step is assessing the acceptance of the solutions proposed. Technological advancements and data exploitation approaches need to be coupled with addressing the pains and needs of stakeholders. For this reason, this chapter offers a first preliminary assessment of user satisfaction based on the deployment of a methodology developed in three European projects. In general, it was observed that additional functionalities that had not been previously identified were demanded by the end users. A potential reason for this can be due to the fact that end users were not fully aware of the capabilities big data solutions could provide in their day to day processes. Finally, the discussion section offers a description on how big data is addressed, in particular depicting how the 5vs are considered. In the services presented in this chapter, the key Vs are related to the volume (especially when addressing calculations that entail more than 100.000 items, or whole regions are tackled), and to the variety. This last aspect becomes really tangible in the services proposed when combining different data sources to extract insights for each of the steps of the energy performance assessment and certification schemes. In addition, an assessment of how the data value chain is tackled is provided in this same section. In the services presented,

Big Data Supported Analytics for Next Generation Energy Performance …

199

all the steps in the data value chain are of equal importance, from the data generation acquisition to the data visualisation. Apart from these aspects, a special focus is placed on the replicability potential of the solutions proposed, by identifying the main hurdles that would be encountered to use the services in other contexts. In this line, the main difficulties would be encountered when having full access to complete Energy Performance Certificates and their corresponding registers, the availability of open data to describe not only the geometric characteristics of buildings (e.g. cadastral, LiDAR data), but also their thermal characteristics (such as U-values, etc.), or weather stations data. In addition, next steps are proposed for all of the services, which in some cases aim to counterbalance the replicability limitations encountered, and in other cases they broaden the functionalities offered. In conclusion, there are numerous opportunities to ease the processes that are carried out in energy performance assessment and certification schemes. There is available technology and sufficient data to offer innovative solutions to current challenges that are faced by the building value chain in the energy performance assessment and certification scheme’s steps. Nevertheless, it is crucial to go the last mile and ensure an adequate acceptability by users and maximisation of the services’ uptake. In the end, the services can provide support in the decision-making processes, but these decisions need to be turned into actionable strategies that are more strategic and well-targeted, with the final aim to obtain a decarbonised building stock and reach climate neutrality. Acknowledgements The work presented in this chapter is based on research conducted within the framework of three projects funded under the European Union’s Horizon 2020 research and innovation programme. Their details are as follows: (1) MATRYCS: Modular Big Data Applications for Holistic Energy Services in Buildings [GA: 101000158] https://www.matrycs.eu/ ; (2) BD4NRG: Big Data for Next Generation Energy [GA: 872613] https://www.bd4nrg.eu/ and (3) I- NERGY: Artificial Intelligence for Next Generation Energy [GA: 101016508] https://i-nergy.eu/. The authors would also like to express their gratitude to colleagues at EREN (Ente Público Regional de la Energía de Castilla y León https://gobierno.jcyl.es/web/es/consejerias/ente-publicoregional-energia.html), FAEN (Fundación Asturiana de la Energía https://www.faen.es/), as well as the rest of the projects’ colleagues for their help, fruitful discussions and insights. The content of the paper is the sole responsibility of its authors and does not necessarily reflect the views of the EC.

References 1. “Clean energy for all Europeans” package: https://energy.ec.europa.eu/topics/energy-strategy/ clean-energy-all-europeans-package_en. Accessed June 2023 2. E. Sarmas, N. Dimitropoulos, V. Marinakis, Z. Mylona, H. Doukas, Transfer learning strategies for solar power forecasting under data scarcity. Sci. Rep. 12(1), 14643 (2022) 3. Communication from the Commission to the European Parliament, the European Council, the Council, the European Economic and Social Committee and the Committee Of The Regions The European Green Deal COM/2019/640 final. https://eur-lex.europa.eu/legal-content/EN/ TXT/?qid=1588580774040&uri=CELEX%3A52019DC0640. Accessed June 2023

200

G. Hernández Moral et al.

4. E. Sarmas, E. Spiliotis, V. Marinakis, T. Koutselis, H. Doukas, A meta-learning classification model for supporting decisions on energy efficiency investments. Energy Build. 258, 111836 (2022) 5. Communication from the Commission to the European Parliament, the Council, The European Economic and Social Committee and the Committee of the Regions a Renovation Wave for Europe—greening our buildings, creating jobs, improving lives. COM(2020) 662 final https:// eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52020DC0662. Accessed June 2023 6. Communication from the Commission to the European Parliament, the Council, the European Economic and Social Committee and the Committee of the Regions ‘Fit for 55’: delivering the EU’s 2030 Climate Target on the way to climate neutrality https://eur-lex.europa.eu/legal-con tent/EN/TXT/?uri=CELEX%3A52021DC0550. Accessed June 2023 7. Directive 2010/31/EU of the European Parliament and of the Council of 19 May 2010 on the energy performance of buildings (recast) http://data.europa.eu/eli/dir/2010/31/2021-01-01. Accessed June 2023 8. Next Generation Energy Performance Certificates cluster at EUSEW21 https://www.crosscert. eu/materials/events/the-next-generation-energy-performance-certificates-cluster-at-eusew21. Accessed June 2023 9. QualDeEPC: High-quality Energy Performance Assessment and Certification in Europe Accelerating Deep Energy Renovation [H2020 GA: 847100] https://qualdeepc.eu/. Accessed June 2023 10. U-CERT: Towards a new generation of user-centred Energy Performance Assessment and Certification; facilitated and empowered by the EPB Center [H2020 GA: 839937] https://u-cer tproject.eu/. Accessed June 2023 11. X-tendo: eXTENDing the energy performance assessment and certification schemes via a mOdular approach [H2020 GA: 845958] https://x-tendo.eu/. Accessed June 2023 12. D^2EPC: Next-generation Dynamic Digital EPCs for Enhanced Quality and User Awareness [H2020 GA: 892984] https://www.d2epc.eu/en. Accessed June 2023 13. E-Dyce: Energy flexible DYnamic building CErtification [H2020 GA: 893945] https://edy ce.eu/. Accessed June 2023 14. ePANACEA: Smart European Energy Performance AssessmeNt And CErtificAtion [H2020 GA: 892421] https://epanacea.eu/. Accessed June 2023 15. EPC RECAST: Energy Performance Certificate Recast [H2020 GA: 893118] https://epc-rec ast.eu/. Accessed June 2023 16. crossCert: Cross Assessment of Energy Certificates in Europe [H2020 GA: 101033778] https:// www.crosscert.eu/. Accessed June 2023 17. EUB SuperHub: European Building Sustainability performance and energy certification Hub [H2020 GA: 101033916] https://eubsuperhub.eu/. Accessed June 2023 18. iBRoad2EPC: Integrating Building Renovation Passports into Energy Performance Certification schemes for a decarbonised building stock [H2020 GA: 101033781] https://ibroad2ep c.eu/. Accessed June 2023 19. TIMEPAC: Towards innovative methods for energy performance assessment and certification of buildings [H2020 GA: 101033819] https://timepac.eu/. Accessed June 2023 20. Smartliving EPC: Advanced Energy Performance Assessment towards Smart Living in Building and District Level [H2020 GA: 101069639] https://www.smartlivingepc.eu/en. Accessed June 2023 21. Chronicle: Building Performance Digitalisation and Dynamic Logbooks for Future ValueDriven Services [H2020 GA: 101069722] https://www.chronicle-project.eu/. Accessed June 2023 22. T. Testasecca, M. Lazzaro, E. Sarmas, S. Stamatopoulos, Recent advances on data-driven services for smart energy systems optimization and pro-active management, in 2023 IEEE International Workshop on Metrology for Living Environment (MetroLivEnv) (pp. 146–151). IEEE (2023) 23. E. Sarmas, E. Spiliotis, V. Marinakis, G. Tzanes, J.K. Kaldellis, H. Doukas, ML-based energy management of water pumping systems for the application of peak shaving in small-scale islands. Sustain. Cities Soc. 82, 103873 (2022)

Big Data Supported Analytics for Next Generation Energy Performance …

201

24. E. Sarmas, S. Strompolas, V. Marinakis, F. Santori, M.A. Bucarelli, H. Doukas, An incremental learning framework for photovoltaic production and load forecasting in energy microgrids. Electronics 11(23), 3962 (2022) 25. C. Tsolkas, E. Spiliotis, E. Sarmas, V. Marinakis, H. Doukas, Dynamic energy management with thermal comfort forecasting. Build. Environ. 237, 110341 (2023) 26. E. Sarmas, E. Spiliotis, E. Stamatopoulos, V. Marinakis, H. Doukas, Short-term photovoltaic power forecasting using meta-learning and numerical weather prediction independent Long Short-Term Memory models. Renew. Energy 216, 118997 (2023) 27. European Commission, Joint Research Centre, V. Serna-González, G. Hernández Moral, F. Miguel-Herrero et al., Harmonisation of datasets of energy performance certificates of buildings across Europe—ELISE energy and location applications : final report, Publications Office (2021), https://doi.org/10.2760/500135. Accessed June 2023 28. Mandate to CEN, CENELEC and ETSI for the elaboration and adoption of standards for a methodology calculating the integrated energy performance of buildings and promoting the energy efficiency of buildings, in accordance with the terms set in the recast of the Directive on the energy performance buildings (2010/31/EU) https://energy.ec.europa.eu/system/files/ 2014-11/2010_mandate_480_en_0.pdf. Accessed June 2023 29. EPB Centre, Background (On EPB standards): https://epb.center/epb-standards/background/. Accessed June 2023 30. Concerted Action—Energy Performance of Buildings—Outputs 2015–2018 CCT1 Technical Elements https://www.epbd-ca.eu/ca-outcomes/outcomes-2015-2018/book-2018/ct/technicalelements. Accessed June 2023 31. Concerted Action—Energy Performance of Buildings—Outputs 2015–2018 CCT3 Compliance, Capacity and Impact https://www.epbd-ca.eu/ca-outcomes/outcomes-2015-2018/book2018/ct/compliance-capacity-and-impact. Accessed June 2023 32. Available Energy Performance Certification tools in Spain—MITECO (Ministerio para la transición ecológica y el reto demográfico): https://energia.gob.es/desarrollo/EficienciaEn ergetica/CertificacionEnergetica/DocumentosReconocidos/Paginas/procedimientos-certifica cion-proyecto-terminados.aspx. Accessed June 2023 33. Big Data Value Association (BDVA): European Big Data Value Strategic Research & Innovation Agenda, January 2015. https://www.bdva.eu/sites/default/files/europeanbigdatavaluepart nership_sria__v1_0_final0.pdf. Accessed June 2023 34. MATRYCS: Modular big data applications for holistic energy services in buildings [H2020] GA: 101000158. https://www.matrycs.eu/. Accessed June 2023 35. I-NERGY: Artificial intelligence for next generation energy [H2020] GA: 101016508 https:// i-nergy.eu/. Accessed June 2023 36. BD4NRG: Big data for next generation energy [H2020] GA: 872613 https://www.bd4nrg.eu/. Accessed June 2023 37. Power BI https://powerbi.microsoft.com/en-us/. Accessed June 2023 38. INSPIRE Services of Cadastral Cartography: https://www.catastro.minhap.es/webinspire/ index_eng.html. Accessed June 2023 39. Spanish Download Center—National Information Center https://centrodedescargas.cnig.es/ CentroDescargas/catalogo.do?Serie=CAANE. Accessed June 2023 40. Mapping of census sections and street map of Electoral Census—Spain https://www.ine.es/ss/ Satellite?L=es_ES&c=Page&cid=1259952026632&p=1259952026632&pagename=Produc tosYServicios%2FPYSLayout. Accessed June 2023 41. TABULA/EPISCOPE: https://episcope.eu/welcome/. Accessed June 2023 42. Building Stock Observatory https://energy.ec.europa.eu/topics/energy-efficiency/energy-effici ent-buildings/eu-building-stock-observatory_en. Accessed June 2023 43. Energy Data Hub of the Castilla y León region https://analisis.datosabiertos.jcyl.es/explore/dat aset/certificados-de-eficiencia-energetica/table/?sort=fecha_inscripcion. Accessed June 2023 44. Spanish Cadastral data https://www.sedecatastro.gob.es/. Accessed June 2023 45. INSPIRE Registry—Building (BU) https://inspire.ec.europa.eu/theme/bu. Accessed June 2023

202

G. Hernández Moral et al.

46. Copernicus—Land monitoring service—Urban Atlas https://land.copernicus.eu/local/urbanatlas. Accessed June 2023 47. E. Sarmas, E. Spiliotis, N. Dimitropoulos, V. Marinakis, H. Doukas, Estimating the energy savings of energy efficiency actions with ensemble machine learning models. Appl. Sci. 13(4), 2749 (2023) 48. JRC PESETA III Science for POlicy Summary Series. Available online on https://joint-res earch-centre.ec.europa.eu/system/files/2018-11/task_05_energy_final_v1.pdf. Accessed June 2023 49. Coupled Model Intercomparison Project Phase 6 (CMIP6) https://wcrp-cmip.org/cmip-phase6-cmip6/. Accessed June 2023 50. V. Eyring, S. Bony, G.A. Meehl, C.A. Senior, B. Stevens, R.J. Stouffer, K.E. Taylor, Overview of the coupled model intercomparison project phase 6 (CMIP6) experimental design and organization. Geosci. Model Dev. 9, 1937–1958 (2016). https://doi.org/10.5194/gmd-9-19372016 51. JRC—Photovoltaic Geographical Information System (PVGIS) https://joint-research-centre. ec.europa.eu/pvgis-online-tool_en. Accessed June 2023 52. Copernicus Climate Change initiative (C3S) https://cds.climate.copernicus.eu/about-c3s. Accessed June 2023 53. G. Hernández-Moral, V.I. Serna-González, A. Martín-Crespo, S. Saludes-Rodil, Multiobjective optimization algorithms applied to residential building retrofitting at district scale: BRIOTOOL. E3S Web Conf. 362 03002 (2022). https://doi.org/10.1051/e3sconf/202236 203002 54. G. Hernández, V. Serna, M.A. García-Fuentes, Design of energy efficiency retrofitting projects for districts based on performance optimization of District Performance Indicators calculated through simulation models, in Proceedings of the CISBAT 2017 International Conference, Laussane, Switzerland, pp. 721–726 (2017) 55. M.A. García-Fuentes, V. Serna, G. Hernández, Evaluation and optimisation of energy efficient retrofitting scenarios for districts based on district performance indicators and stakeholders’ priorities, in Proceeding of the BSO2018. (United Kindgom, Cambridge, 2018), pp.68–75 56. P. Skaloumpakas, E. Sarmas, Z. Mylona, A. Cavadenti, F. Santori, V. Marinakis, Predicting thermal comfort in buildings with machine learning and occupant feedback, in 2023 IEEE International Workshop on Metrology for Living Environment (MetroLivEnv) (pp. 34–39). IEEE (2023)

Synthetic Data on Buildings Daniele Antonucci , Francesca Conselvan, Philipp Mascherbauer, Daniel Harringer, and Cristian Pozza

Abstract The data-driven approach to building evaluation is gaining significant popularity. Policies related to the adoption of new energy-saving technologies are now rooted in their actual effects on reducing energy consumption. The pandemic and subsequent war have emphasized the crucial significance of various building aspects, including air quality, indoor comfort (thermal, light, and acoustic), and energy efficiency. Evaluating these aspects necessitates the use of real data, which, unfortunately, is often inaccessible, of low quality, or incomplete. Yet, even in the energy and building fields, data anonymization methods like normalization and aggregation limit the information that can be shared. Therefore, finding more effective ways to collaborate on data without compromising privacy is critical for both data owners and data analysts. According to the European Commission’s Joint Research Centre, synthetic data will play a crucial role in enabling AI. This type of data serves as a unifying bridge between policy support and computational models, unlocking the potential of data that may be hidden in silos. Consequently, it becomes the primary catalyst for AI adoption in business and policy applications across Europe. Additionally, the resulting data not only can be freely shared but also aids in rebalancing under-represented classes in research studies through over-sampling, making it an ideal input for machine learning and AI models. One of the fundamental elements involves developing AI models that can accurately and faithfully reproduce various types of building data, considering factors such as climatic conditions, building and system types, occupancy profiles, and intended use. The present study aims to evaluate potential scenarios and applications of AI models for generating valuable data for both building energy assessments and economic evaluations. D. Antonucci (B) · C. Pozza Eurac Research, Institute for Renewable Energy, Bolzano, Italy e-mail: [email protected] F. Conselvan e-Think Energy Research, Wien, Austria P. Mascherbauer · D. Harringer Institute of Energy Systems and Electrical Drives, Vienna University of Technology, EEG TU Wien, Vienna, Austria © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Doukas et al. (eds.), Machine Learning Applications for Intelligent Energy Management, Learning and Analytics in Intelligent Systems 35, https://doi.org/10.1007/978-3-031-47909-0_7

203

204

D. Antonucci et al.

Keywords Synthetic data · Generative AI · Machine learning · GANs · DoppelGANger · Time-series · Privacy

Acronymous IEA HVAC ECMs BIM AI ML GDPR BMS TBM RNN CVAE GAN EPBD

International Energy Agency Heating Ventilation Air Conditioning Energy Conservation Measures Building Information Modelling Artificial Intelligence Machine Learning General Data Protection Regulation Management System Technical Building Management. Recurrent Neural Network Conditional Variational Autoencoder Generative Adversarial Network Energy Performance of Buildings Directive

1 Introduction It is widely recognized that buildings constitute one of the most energy-intensive sectors. According to the International Energy Agency (IEA), in 2021, buildings were responsible for about 30% of global energy-related carbon dioxide (CO2 ) emissions, with 8% arising from direct emissions in buildings and 19% from electricity and heat consumption [1]. These percentages are influenced by various factors. Notably, the European building stock requires significant restructuring due to inefficiencies both in facades and HVAC systems [2]. Achieving energy efficiency in buildings necessitates evaluating various aspects, spanning from the design and construction process to subsequent analyses to assess the actual energy savings and comfort improvements following the implementation of Energy Conservation Measures (ECMs). Historically, these steps have heavily relied on manual processes, entailing lengthy timelines and considerable uncertainty. Additionally, the advent of Construction 4.0 has resulted in an overwhelming abundance of data generated throughout different project stages, from design and construction to facility management. This data encompasses not only conventional values but also diverse information like metadata, materials, models, and drawings. Prominent examples include BIM, which serves as a fundamental source for organizing and managing building information, and the Digital Twin, allowing the creation of a “parallel” model of entities like buildings, districts, cities, or regions.

Synthetic Data on Buildings

205

Despite this wealth of information, companies tend to keep it confidential, safeguarding their intellectual property and refraining from sharing it with others. One consequence of the current state of the construction sector is its reliance on outdated digital methods, leading to a lack of coordination between tasks, lengthy lead times, and increased costs [3]. However, looking at the transformative impact of Artificial Intelligence (AI), particularly generative AI, on various industries, it becomes evident that the world of buildings could also significantly enhance its performance through the adoption of this technology. Generative-based Machine Learning (ML) models, like ChatGPT and Bard, rely on vast and diverse datasets to train intricate models. In the building domain, their implementation should not be viewed as mere replacements for architects or engineers’ work. Instead, they can aid professionals in swiftly accessing the best information for making well-informed decisions. Combining rule-based AI systems with generative AI allows the generation of novel and valid data in the form of design alternatives, promoting creativity and efficiency in the decision-making process. Automated design generation drastically reduces the time and effort required, substantially outperforming traditional manual methods. Moreover, generative AI plays a crucial role in optimizing resource usage, minimizing waste, and enhancing energy efficiency. By creating synthetic data representing different scenarios, construction professionals can analyze the impact of various design choices, leading to more sustainable and cost-effective buildings. Notably, generative AI also revolutionizes the construction process itself. By leveraging real-time data, sensors, and AI algorithms, construction sites can be monitored and managed more effectively. Intelligent systems can identify potential safety hazards, track equipment and material usage, and optimize building energy consumption and production through predictive models trained with reliable ML data. However, the success of generative AI in the construction sector hinges on the availability and sharing of data, which is often limited by privacy concerns. Nevertheless, generative AI remains one of the most prominent methods for creating synthetic data, but it is not the sole approach. In the following paragraphs, we will delve into a detailed explanation of synthetic data, its potential to improve the construction sector, and present two relevant case studies. The first case study demonstrates the assessment of comfort and benchmark buildings, while the second showcases the creation of synthetic data from electricity consumption profiles and its application in predictive models and market analysis.

206

D. Antonucci et al.

2 Why and What Are Synthetic Data and How It Can Help the Construction Sector? As defined by [4] there is no precise and accepted definition of synthetic data. According to the authors of the above-mentioned scientific work, a definition of synthetic data could be the following: Synthetic data is data that has been generated using a purpose- built mathematical model or algorithm, with the aim of solving a (set of) data science task(s). Kalyan Veeramachaneni, principal research scientist with MIT’s [5] Schwarzman College of Computing said: A synthetic data set has the same mathematical properties as the real-world data set it’s standing in for, but it doesn’t contain any of the same information. It’s generated by taking a relational database, creating a generative machine learning model for it, and generating a second set of data. The result is a data set that contains the general patterns and properties of the original—which can number in the billions—along with enough “noise” to mask the data itself. It can therefore be seen that synthetic data is data that provides information that is very similar to real one even though it is not. A question therefore arises spontaneously, why use synthetic data instead real data and why use it in the construction field? Synthetic data can be a solution to several problems in different domains. However, there are three areas of interest where their use is more common, such as: • Release of private data: Synthetic data emerges as a promising solution in the context of data-driven machine learning solutions, which have become the primary drivers of innovation. The need to share data is evident as scientists and developers cannot make significant progress without access to high-quality data. However, privacy regulations, such as GDPR, rightly demand cautious handling of personal data, resulting in a cumbersome process of meeting data access requirements. This challenge is most pronounced when it comes to sharing sensitive data that could be exploited for profit or marketing purposes by companies. For instance, the risk lies in disclosing personal details like a person’s first name, last name, address, and other sensitive information related to health, finances, or general consumption habits. Companies can then use this data to create detailed user profiles and target marketing campaigns for profit maximization. In the insurance sector, knowledge of specific health conditions could lead to increased insurance premiums, while in the energy industry, understanding a user’s consumption patterns could result in tailored marketing strategies to promote certain products. To address these concerns and protect individual privacy, synthetic data offers a valuable alternative. By generating artificial data that closely resembles real data while containing no actual personal information, companies can share relevant insights without compromising privacy. This allows researchers, scientists, and developers to access useful data for their work without accessing actual sensitive information. As such, synthetic

Synthetic Data on Buildings

207

data strikes a balance between the need for data-driven advancements and ensuring data privacy compliance. • Bias-reduction and data accuracy. Bias arises when data collection methods fail to accurately represent the intended population, resulting in a sample that does not properly reflect it. Synthetic data empowers organizations to generate well-balanced or representative samples that more accurately mirror the underlying population. This approach helps mitigate the potential for discriminatory outcomes and fosters fairness and equity in decision-making processes. • Data augmentation. Synthetic data can be used to augment existing datasets, providing a larger and more diverse sample for analysis. Modern machine learning models rely heavily on large quantities of accurately annotated data to achieve good performance. However, the process of collecting and annotating data is typically manual and time-consuming, demanding significant resources. The availability of high-quality curated data for a specific task depends on both the inherent availability of clean data in that domain and the expertise of the developers involved. In numerous real-world applications, obtaining an adequate amount of training data is often impractical. Data augmentation has emerged as the most effective solution to address this challenge, aiming to augment the training data by increasing its volume, quality, and diversity. The utilization of synthetically generated labeled data presents a cost-effective approach to overcome this hurdle, which has already gained traction in the industry. This method involves training a ML model on synthetic data with the objective of deploying it on real-world data. Typically, privacy concerns take a backseat in such applications since the intention is not to substitute real data but to complement it by utilizing the synthetic data in tandem. The use of synthetic data in the construction field is used both for the aspects defined above, but also for other evaluations. Regarding privacy protection, it is well known that many data and information are sensitive. Among these there are not only the drawings of building or of mechanical/electrical system created by engineering firms, but there are also all the energy data that is collected by energy utilities with smart meters, as well as monitoring data (indoor comfort or performance of facades or components systems in place) or data collected by Building Management System (BMS) or Technical Building Management (TBM). All these data are regulated by GDPR, but contain a series of information that can be very useful for various aspects, such as: • Project design. In the design phase, it is crucial to define input data more precisely. Currently, pre-set values are employed to size components, but they may not always align with actual data, leading to malfunctions or discomfort. For instance, consider the sizing of ventilation flow rates for air exchange, where factors like internal temperatures, carbon dioxide levels, and occupancy should be considered to ensure optimal performance. • Data retention. In general, by law, the data collected by both building monitoring systems and energy systems (smart meters) can be saved by service providers (e.g.

208

D. Antonucci et al.

energy utilities) only for a limited period of time. This information can be useful for historical analysis in different studies, but also to identify possible market choices. In particular, keep a record of individual consumption data during major crises, which would allow studies to better understand these periods in time. • Research project. Construction research studies rely heavily on data as their foundation. However, obtaining this data is often challenging due to the stringent regulations set forth by GDPR, which restrict data sharing. In such situations, synthetic data emerges as a valuable aid for researchers, as it can offer the essential intrinsic information found in real data while circumventing the issues related to privacy infringement. • Data sharing. Many companies collect data from their service users, such as energy supply companies and smart meter data of users (citizens). However, companies are internally composed of different departments each with its own objective. Although these departments are from the same company, original data cannot be easily shared. A solution is to provide anonymized and aggregated data, but this reduces or eliminates the information needed for certain actions. (e.g., defining marketing campaigns with different offers). With the use of synthetic data, this problem can be overcome since as defined so far, such data retain the statistical information needed for analysis, while masking sensitive data. A second aspect to be considered is bias-reduction and data accuracy. Machine learning models, especially neural networks, have become common in the building sector, being trained to predict building energy consumption, renewable energy production, and diagnose faults in plant systems. However, real data often suffer from bias issues, stemming from dataset characteristics or data scarcity. In this context, synthetic data can prove invaluable, as it can be generated to reduce or eliminate bias problems, ensuring data accuracy. Consequently, ML models trained with more accurate and realistic data yield improved results, which is particularly critical for models controlling systems like HVAC systems. Accurate models enable better management of controls to reduce energy consumption in buildings. Furthermore, the reliability of ML models relies on analyzing large amounts of diverse and relevant data, which is not always readily available. Data augmentation through synthetic data generation addresses this limitation, producing qualitatively sound data in large quantities. A practical application of this approach involves creating specific profiles for different building archetypes, assisting energy suppliers in tailoring offers and services to individual users. Consideration must also be given to the economic aspect of generating and using synthetic data. While there may be upfront costs associated with identifying the correct model for generating synthetic data, its use leads to significant cost reduction over time. This is because organizations no longer need to continually search for and process new data, which can be time-consuming and expensive. Traditional data collection methods involve high costs, time requirements, and resource intensity. However, adopting synthetic data helps organizations overcome these challenges by reducing data collection and storage expenses. This is particularly advantageous for smaller businesses or startups with limited resources, enabling them to perform

Synthetic Data on Buildings

209

Fig. 1 Synthetic data will become the main form of data used in AI. Source Gartner, “Maverick Research: Forget About Your Real Data—Synthetic Data Is the Future of AI,” Leinar Ramos, Jitendra Subramanyam, 24 June 2021

analyses that would otherwise be cost-prohibitive or time-consuming. Moreover, synthetic data simplifies storage and manipulation, eliminating the need for costly hardware and software. This cost-effective approach allows organizations to save money on data-related expenditures and allocate resources to other critical aspects of their operations (Fig. 1). According to what is said by Gartner, an American technological research and consulting firm based in Stamford, in their latest report and presented at the Gartner Data & Analytics Summit 2022. By the year 2030, a substantial portion of the data utilized in AI applications is expected to be predominantly generated artificially through rules, statistical models, simulations, or alternative techniques.

3 ML Techniques for the Generation of Synthetic Data in the Building Sector In recent years, AI has made its way into the realm of buildings, aiming to optimize their energy efficiency. Researchers have harnessed the power of ML and AI techniques to create realistic and diverse synthetic data that captures the complexities of the built environment. This approach helps overcome the limitations of real-world data and unlocks new possibilities for investigation. ML predictive models are then developed, with the goal of reducing energy consumption, minimizing environmental impact, and enhancing the overall comfort and well-being of building occupants. The progress in ML has led to a wide array of techniques for generating synthetic data tailored to specific data types and project scopes. In the context of the building

210

D. Antonucci et al.

sector, several promising techniques stand out, especially for time-series data. These include Markov Chain, Recurrent Neural Network (RNN), Conditional Variational Autoencoder (CVAE), and Generative Adversarial Network (GAN). Each of these methods offers unique advantages for creating synthetic data that closely resembles real-world patterns and trends, enabling more accurate and effective ML models for building energy efficiency and environmental sustainability. • Markov Chain The Markov chain is a stochastic process that describes a sequence of possible states, in which the probability of each state depends on the previous states. The model is based on the assumption that activities evolve, and future activities can be determined from the current state and the transition probabilities, but not from external factors. Markov chain models are relatively straightforward to implement and computationally efficient. Once the model parameters are estimated, generating synthetic data becomes a computationally inexpensive process. This makes it t highly scalable and efficient for applications that demand the generation of significant volumes of synthetic data. Moreover, the Markov chain provides control over data generation. By manipulating the transition probabilities or initial state distribution, researchers can explore different scenarios and generate synthetic data that represent specific patterns or conditions of interest. However, the Markov-Chain lacks in recognizing complex dependencies ad longrange interactions between various conditions, like weather or building operations. Markov-Chain models typically focus on internal state transition rather than external factors, which also influence energy consumption. Moreover, the Markov-Chain models may struggle to capture long-term trends and seasonal variations and this is [6] a big limitation when it comes to the building sector. In that case, the model must be combined with other techniques specific to capture time series patterns. The Markov-Chain method can be applied to real-world scenarios to estimate the energy consumption of a building or a group of buildings, helping the optimization of the building systems and reducing energy waste. The method can also help to evaluate the impact of occupancy patterns and identify opportunities for energy savings. For example, the Markov chain model can help determine the optimal setpoints and schedules for heating, cooling, and ventilation systems based on the predicted occupancy states. In the study of occupancy profiles, Kelly and colleagues [6] refined the Markov chain to generate accurate and realistic occupancy profiles to model energy demand and to have a better understanding of occupants’ behavior. The authors used transition probability data compiled from the UK Time-Use Survey to account for typical behavioral differences and develop a higher-order Markov model that improves the prediction of transitions and durations of different occupancy states. The model considers historical occupant data, and the transition probabilities between different occupancy states and simulates a realistic sequence of occupancy states over time.

Synthetic Data on Buildings

211

• Recurrent Neural Network (RNN) RNNs are a neural network technique commonly used to capture and model temporal dependencies. It is well-suited to generate synthetic data from time series data, such as energy consumption patterns or sensor data. RNN processes and models information from previous time steps and uses this information to make better predictions at future time steps. One of the key contributions of RNNs is to capture long-range dependencies and complex interactions over extended periods. This is particularly valuable for modelling energy consumption and demand patterns. For instance, Kleinebrahm and colleagues used a neural network, because it better captures long-term mobility, and activity patterns and depicts behavior diversity across [7] the population. The authors combine social practice theory, energy-related activity modelling and novel machine learning approaches to simulate household electricity, heat and mobility demand. • Conditional Variational Autoencoders (CVAEs) Conditional Variational Autoencoder is an implementation of the Variational Autoencoders, a generative model that learns the underlying distribution of real building data by capturing its statistical patterns and characteristics [7]. CVAE extends the VAE framework by incorporating conditional information that includes any relevant information to have accurate and faithful synthetic data (Fig. 2). By training a CVAE on a dataset of building attributes, sensor readings, or energy consumption patterns, the model learns to encode the input data into a lowerdimensional latent space representation. The CVAE then decodes the latent space representation back into the original data space, generating synthetic samples that resemble real-world data. Compared to other techniques CVAEs are easy to train

Fig. 2 Architecture of variational autoencoder. Source Understanding Variational Autoencoders (VAES), Towards Data Science – J. Rocca

212

D. Antonucci et al.

and are easily adaptable to various types of building data, such as sensor reading or occupancy behavior. Second, latent space offers several advantages, like probabilistic modelling, data reconstruction and meaningful representation of the data. This makes VAEs robust to missing data by imputing missing values. Moreover, VAEs can reconstruct the input data from the latent space and this reconstruction ability allows for a better assessment of the model’s performance. For instance, by comparing the reconstructed data with the original data, anomalies or discrepancies can be detected. However, CVAEs require a large amount of data to accurately train the model and have an accurate distribution of the dataset. It can also struggle to capture long-range dependencies in sequential data, making it not ideal for generating synthetic time series. Despite these limitations, researchers continue to explore enhancement to VAEs architecture to improve the quality and fidelity of the generated synthetic data. Fan and colleagues used CVAEs to augment an existing building dataset to improve short-term building energy predictions [8]. Compared to other techniques, such as transfer learning or feature engineering, the proposed strategy has several advantages. First, it can generate meaningful yet synthetic data that can increase the diversity of existing data and potentially describe unseen working conditions for reliable model development. Second, it is a lightweight solution that can tackle practical data shortage problems in the building field. Finally, it has been shown to improve short-term building energy predictions in experiments using actual measurements from 52 buildings. • Generative Adversarial Networks (GANs) The advent of GANs has revolutionized the generation of synthetic data [9]. Since its debut in 2014 [10], such a method can produce high-quality data, initially images and videos, and our days in several fields [11]. Its ability to generate realistic and high-quality data and its flexibility have made it an advanced tool in data generation. Its flexibility lies in the possibility to create different variants of GAN using that fundamental piece of work (some of the known variants are StyleGAN, WGAN, ConditionalGAN). GANs consist of two interconnected networks: the generator and the discriminator. These networks are trained in an adversarial manner. The generator generates synthetic outputs, which are then evaluated by the discriminator along with real data. The discriminator’s role is to differentiate between the real and artificial outputs. The ultimate objective is to attain equilibrium, where the generated samples closely adhere to the statistic distribution as the real data. Once equilibrium is [9] achieved, the discriminator cannot differentiate between real and synthetic samples [10, 11] (Fig. 3). The first GANs models were implemented on the image dataset and their architecture could not capture the complex temporal structures and irregularities of time series data. Load profiles typically show how the amount of energy used changes over time, such as on an hourly, daily, weekly, monthly, or annual basis. Nevertheless, for many fields of research time is a valuable factor and several GAN nets have been specifically designed for time series (Table 1). Training a time series GAN involves

Synthetic Data on Buildings

213

Fig. 3 Visualization of the GAN architecture. Source [12]

capturing the temporal dependencies, patterns, and characteristics of the real-time series data. Nowadays, the use of GAN to generate synthetic load profiles is widely used and, as the demand for realistic synthetic data is growing, numerous studies and developments have been done to advance this technique in the building sector [13– 18]. In one of the pioneering projects, researchers at the Lawrence Berkley National Laboratory synthesized daily load profiles using a GAN algorithm to capture both the general trend and the random fluctuations of the electrical loads [12]. The growing emphasis on data protection and security has created challenges in accessing the metadata information of load profiles, which is a fundamental component to generating accurate synthetic data. That has opened a new challenge to tune the GAN workflow to address fidelity and accuracy. For instance, Asre and Anwar successfully implemented the timeGAN framework to generate energy load profiles at household levels having as input data only the timestamp and the energy consumption. In the current state of the art, the DoppelGANger is one of the most promising GAN models available to generate synthetic time series. DoppelGANger has been recently introduced by Lin et alias from Carnegie Mellon University, who specifically created an architecture able to capture the temporal dependencies between time series data [9] and handle datasets containing both continuous and discrete features. DoppelGANger overcomes the obstacle of mixed-type data by separating the generation of metadata from time series data and leveraging metadata to exert significant contributions to the generation of synthetic data. As a result, DoppelGANger achieves up to 43% better fidelity than baseline models and accurately captures the subtle correlations between data. Through the generation of synthetic data, ML is facilitating the development of more accurate and efficient energy models and is helping in designing and optimizing energy management strategies. Load modelling is one of the crucial tasks for improving building energy efficiency and the different machine learning-based

214

D. Antonucci et al.

Table 1 GAN architecture specific for time series data GANs architecture Description Vanilla GANs

Vanilla GANs, also known as traditional GANs, can be used to generate time series data by modifying the architecture and training process to account for the temporal nature of the data. This may involve using recurrent neural networks (RNNs), such as LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Unit), in the generator and discriminator to capture the temporal dependencies in the data

Recurrent GANs (RGANs)

RGANs are a type of GAN that explicitly incorporate recurrent neural networks (RNNs) in both the generator and discriminator to generate time series data. RGANs are designed to capture the temporal dependencies in the data and generate sequences of data points that exhibit realistic temporal patterns

SeqGANs

SeqGANs are a type of GAN that are designed for generating sequences of discrete data, such as time series data with discrete events or categorical variables. SeqGANs can use techniques such as Reinforcement Learning (RL) to guide the generator in generating realistic and representative sequences of data points

Attention-based GANs

Attention-based GANs incorporate attention mechanisms, which are commonly used in natural language processing tasks, to capture the temporal dependencies in the data. Attention mechanisms allow the generator and discriminator to focus on different parts of the input time series data, which can improve the quality of the generated time series data

Time GANs (TGAN)

TimeGAN is is a variant of GAN specifically designed for generating synthetic time series data. TimeGAN was proposed in a research paper titled “Time-series Generative Adversarial Networks” by Jinsung Yoon et al. in 2019. TimeGAN extends the traditional GAN architecture by incorporating additional components to handle the temporal dynamics of time series data. The embedder in TimeGAN is responsible for transforming the real and synthetic time series data into a lower-dimensional representation, which is used as input to the discriminator and supervisor

models enable researchers to explore various scenarios and evaluates the performance of energy systems under different conditions. Synthetic data are the cornerstone to path a more sustainable and energy-efficient future.

4 Case Study 1: Synthetic Data of Indoor and Outdoor Temperature in a School The latest Energy Performance of Buildings Directive (EPBD) mandates member states to assess building energy consumption dynamically, considering various factors such as space heating, cooling, domestic hot water, ventilation, lighting, and other building systems. The typical energy use should represent actual operating conditions and user behavior [19]. However, the accuracy of these assessments

Synthetic Data on Buildings

215

heavily relies on correctly implementing building properties and boundary conditions [20]. For example, a study on heating patterns in UK residential buildings emphasized the importance of accurately simulating indoor temperatures. The research revealed significant discrepancies between monitored indoor temperatures and assumptions made by simulation models, particularly in living rooms. Interestingly, heating patterns showed minimal variations between weekdays and weekends, contrary to common simulation tools used in the UK, such as BREDEM. Accurate boundary conditions were found to be essential for simulating building energy consumption, and the study used in-situ measurements from nine dwellings to derive an indoor temperature profile for improved energy demand prediction. Real data on building performance play a critical role in understanding building behavior. Unfortunately, such data are not always available due to the recent installation of sensors for building performance assessment and reluctance to release data publicly, often due to GDPR constraints. While data privacy concerns for sensitive information are understandable, there should be more openness to sharing non-sensitive data like internal temperatures. To address this issue, synthetic data generation can be employed. An example of generating synthetic temperature data from a school in northern Italy, based on a 6-month monitoring period, is presented. Synthetic data ensures the original data’s confidentiality, allowing the model to provide indoor temperature profiles for simulations of buildings with similar intended use and location. The model is trained using both indoor and outdoor temperatures to correlate outdoor temperature with indoor temperature and generate specific indoor temperature profiles based on a range of outdoor temperatures. This approach helps improve the accuracy of building energy performance predictions and enhances energy efficiency measures in various building types with similar climatic conditions.

4.1 Methodology In this study, we propose a data-driven approach that employs a generative algorithm called DoppelGANger, a variant of Generative Adversarial Network (GAN), to create realistic indoor and outdoor temperature profiles for a school building. The dataset comprises time series temperature data from 11 rooms in a school located in the North-East of Italy. The data was collected over a period of less than 3 months, from April to June, using temperature sensors with an error margin of ±0.5 °C and a sampling rate of 10 min. Additionally, a weather station was installed to monitor various external parameters, including outdoor temperature. Before processing the data with the DoppelGANger algorithm, it was cleaned to remove any outliers. All the processes were implemented in Python, utilizing a specific library developed by GreteAI. The machine learning model was generated, taking into account all variables together, which includes the 11 time series of indoor temperature and the 1 outdoor temperature.

216

D. Antonucci et al.

To evaluate the effectiveness of the resulting model in generating synthetic data, a comprehensive analysis was conducted. • The correlation matrix of datasets. The aim is to verify that ahe correlation between the original data and the generated synthetic data should be the same. • Autocorrelation function. The autocorrelation of real and synthetic data should follow the same trend. • Density distribution. The distribution of the real and synthetic data should be similar. • Generation of data. The synthetic data should provide unique temperature profile but similar to the real data.

4.2 Results and Discussion Figures 4 and 5 provides the correlation matrix for the real and synthetic data. In Figs. 6 and 7 a zoom of the two matrixes has been highlighted to better show the correlation values. As second analysis the autocorrelation function has been applied on all variables of both real and generated data. Figure 8 show the result for all temperature rooms and the outdoor temperature. To have a better overview of the performance of the model, a comparison of the distribution of the temperature data (real and synthetic) is performed. Figure 9

Fig. 4 Correlation matrix of real dataset of indoor and outdoor temperatures

Synthetic Data on Buildings

Fig. 5 Correlation matrix of synthetic dataset of indoor and outdoor temperatures

Fig. 6 Zoom of correlation matrix for real dataset of indoor and outdoor temperatures

217

218

D. Antonucci et al.

Fig. 7 Zoom of correlation matrix for synthetic dataset of indoor and outdoor temperatures

shows the result for the temperature of one room. The other results are available in the Appendix. As last assessment the simulation of synthetic data for some days has been tested, as depicted from Figs. 10, 11, 12 and 13. The results reveal the following key findings: 1. When comparing the correlation matrices between the original dataset and the synthetic ones, we observe that they follow the same trend, albeit with different values. This is normal because noise is introduced into the ML model to strike a balance between the similarity of synthetic data to real data and data privacy concerns. The level of deviation between synthetic and real data is thus controlled. 2. Autocorrelation, which measures the dependence between values in a sampled function, exhibits a remarkably similar trend for temperature profiles in both real and synthetic data. This underscores the ability of the synthetic data to maintain the statistical characteristics of the real data. 3. The distribution of temperature values across different ranges is comparable between synthetic and real data, with slight deviations attributed to the inherent noise present in synthetic data. Lastly, a crucial evaluation involves testing the model’s capability in generating synthetic data by comparing it to real data. Graphs (from Figs. 10, 11, 12 and 13) demonstrate how well the model can provide values that resemble real data while being authentic. By adjusting the parameters used to train the ML model, one can generate data that align more closely or diverge further from the real data. The

Synthetic Data on Buildings

219

Fig. 8 Autocorrelation of 12 time-series profile of indoor temperature and outdoor temperature

definition of the boundary for determining whether a synthetic model conforms to data privacy rules is currently under analysis, and users must rely on the obtained results and statistical evaluations to assess compliance with regulations, especially those set forth by the GDPR.

220

D. Antonucci et al.

Fig. 9 Distribution of indoor temperature in room 1

Fig. 10 Real data of indoor temperatures and outdoor temperature for a random day 1 from April to June

Fig. 11 Synthetic data of indoor temperatures and outdoor temperature for a random day 1 from April to June

Synthetic Data on Buildings

221

Fig. 12 Real data of indoor temperatures and outdoor temperature for a random day 2 from April to June

Fig. 13 Synthetic data of indoor temperatures and outdoor temperature for a random day 2 from April to June

Appendix Distribution of real and synthetic data for all inputs used in the model (Figs. 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 and 24).

222

Fig. 14 Distribution of indoor temperature in room 2

Fig. 15 Distribution of indoor temperature in room 3

Fig. 16 Distribution of indoor temperature in room 4

D. Antonucci et al.

Synthetic Data on Buildings

Fig. 17 Distribution of indoor temperature in room 5

Fig. 18 Distribution of indoor temperature in room 6

Fig. 19 Distribution of indoor temperature in room 7

223

224

Fig. 20 Distribution of indoor temperature in room 8

Fig. 21 Distribution of indoor temperature in room 9

Fig. 22 Distribution of indoor temperature in room 10

D. Antonucci et al.

Synthetic Data on Buildings

225

Fig. 23 Distribution of indoor temperature in room 11

Fig. 24 Distribution of outdoor temperature

References 1. IEA, “Buildings” https://www.iea.org/reports/buildings, Paris (2022) 2. L.T.H.W.R.C.S.A.H.C. Nägeli, Availability-adapted approaches for the spatial analysis of building stock energy demand. Energies 15, no. Bottom-Up Urban Building Energy Modelling, p. 18 (2022) 3. P.S.J.B.J.S.N.S.E. Thomas, Construction Disconnected (2018) 4. F.H.G.C.S.C.L.M.B.C.A.J. Jordan, Synthetic Data—what, why and how?. The Royal Society (2022) 5. B. Eastwood, What is synthetic data—and how can it help you competitively?, 23 1 2023. [Online]. Available: https://mitsloan.mit.edu/ideas-made-to-matter/what-synthetic-data-andhow-can-it-help-you-competitively 6. G.A.K.N. Flett, An occupant-differentiated, higher-order Markov Chain method for prediction of domestic occupancy. Energy Build. 125, 219–230 (2016) 7. D.P.K.A.M. Welling, An introduction to variational autoencoders. Found. Trends Mach. Learn. 12, 307–392 (2019) 8. R.J.W.M.C. Fan, A novel deep generative modeling-based data augmentation strategy for improving short-term building energy predictions. Build. Simul. 15, (2021)

226

D. Antonucci et al.

9. A.J.C.W.G.F.V.S.Z. Lin, Using GANs for sharing networked time series data: challenges, initial promise, and open questions, in ACM Internet Measurement Conference (IMC ’20), Virtual Event, USA. ACM, New York, NY, USA (2020) 10. J.P.-A.M.M.B.X.D.W.-F.S.O.A.Y.I. Goodfellow, Generative adversarial nets, in Advances in Neural Information Processing Systems 27 (2014) 11. J.Y.A.G.W.A. Dash, A review of generative adversarial networks (GANs) and its applications in a wide variety of disciplines—from medical to remote sensing. Association for Computing Machinery, 1 10 (2021) 12. Z. Wang, T. Hong, Generating realistic electrical load profiles through the Generative Adversarial Network (GAN). Energy Build. 224 (2020, October) 13. J.A.D.Y.X. Kang, A systematic review of building electricity use profile models. Energy Build. (2022) 14. R.K.B. Yilmaz, Synthetic demand data generation for individual electricity consumers : Generative Adversarial Networks (GANs). Energy AI 9 (2022) 15. G.R.E.G. Baash, A conditional generative adversarial for energy use in multiple buildings using scarce data. Energy and AI (2021) 16. X.D.Y.Z.Y. Zhang, Generation of sub-item load profiles for public buildings based on the conditional generative adversarial network and moving average method. Energy Build. 268 (2022) 17. L.S.O.K.A. Pinceti, Synthetic time-series load data via conditional generative adversarial networks, in IEEE Power & Energy Society General Meeting (PESGM) (2021) 18. Y.L.N.L.L. Song, IEEE transactions on smart grid, in ProfileSR-GAN: A GAN based SuperResolution Method for Generating High-Resolution Load Profiles (2022) 19. E. Commision, Energy performance of buildings directive, [Online]. Available: https://energy. ec.europa.eu/topics/energy-efficiency/energy-efficient-buildings/energy-performance-buildi ngs-directive_en 20. M.I.V.D.V.D.S.E. Lambie, Experimental analysis of indoor temperature of residential buildings as an input for building simulation tools. Energy Procedia 132, 123–128 (2017)