245 103 3MB
English Pages 274 Year 2018
Benjamin Lasshof Operating Costs of Real Estate
Schriftenreihe Bauökonomie
herausgegeben von Prof. Dr. Christian Stoy
Band 5
Benjamin Lasshof
Operating Costs of Real Estate Models and Cost Indicators for a Holistic Cost Planning
Dissertation, Universität Stuttgart (D 93), 2017
ISBN 978-3-11-059514-7 e-ISBN (PDF) 978-3-11-059608-3 e-ISBN (EPUB) 978-3-11-059616-8 Library of Congress Control Number: 2018934294 Bibliografische Information der Deutschen Nationalbibliothek Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte bibliografische Daten sind im Internet über http://dnb.dnb.de abrufbar. © 2018 Walter de Gruyter GmbH, Berlin/Boston Druck und Bindung: CPI books GmbH, Leck ♾ Gedruckt auf säurefreiem Papier Printed in Germany www.degruyter.com
Table of Contents List of Abbreviations | IX List of Figures | XI List of Tables | XIII Abstract | XVII Zusammenfassung | XIX 1 1.1 1.2 1.3 1.4
Introduction | 1 Background and problem statement | 1 Scope and objective | 2 State of the art | 3 Manuscript structure | 8
2 2.1 2.2 2.2.1 2.2.2 2.3 2.3.1 2.3.2 2.3.3 2.3.4 2.3.5 2.3.6
Methodology | 11 Quantitative approach | 11 Definition of key variables | 14 Response variables | 14 Predictor variables | 18 Statistical methods | 23 Pre-analysis of data sample | 23 Regression analysis | 25 Artificial neural network analysis | 33 Classification tree analysis | 37 Cost indicators | 41 Performance measures | 43
3 3.1 3.2 3.2.1 3.2.2 3.3 3.4
Data sample | 47 Overview | 47 Presentation of the sample | 48 Response variables | 48 Predictor variables | 51 Test sample | 53 Representativeness | 54
VI | Table of Contents 4 4.1 4.1.1 4.1.2 4.1.3 4.1.4 4.1.5 4.2 4.2.1 4.2.2 4.2.3 4.2.4 4.2.5 4.3 4.3.1 4.3.2 4.3.3 4.3.4 4.3.5 4.4 4.4.1 4.4.2 4.4.3 4.4.4 4.4.5 4.5 4.5.1 4.5.2 4.5.3 4.5.4 4.5.5 4.6 4.6.1 4.6.2 4.6.3 4.6.4 4.6.5
Analysis results | 57 Operating costs (CG 300) | 57 Theoretical basis and variables | 57 Model design and specifications | 61 Categorised cost indicators | 66 Performance validation | 67 Summary | 69 Utilities (CG 310) | 69 Theoretical basis and variables | 69 Model design and specifications | 72 Categorised cost indicators | 77 Performance validation | 78 Summary | 80 Water (CG 311) | 81 Theoretical basis and variables | 81 Model design and specifications | 82 Categorised cost indicators | 88 Performance validation | 89 Summary | 90 Heating (CG 312-316) | 91 Theoretical basis and variables | 91 Model design and specifications | 94 Categorised cost indicators | 99 Performance validation | 100 Summary | 102 Electricity (CG 316) | 102 Theoretical basis and variables | 102 Model design and specifications | 105 Categorised cost indicators | 109 Performance validation | 110 Summary | 112 Disposal (CG 320) | 112 Theoretical basis and variables | 112 Model design and specifications | 114 Categorised cost indicators | 119 Performance validation | 120 Summary | 122
Table of Contents |
4.7 4.7.1 4.7.2 4.7.3 4.7.4 4.7.5 4.8 4.8.1 4.8.2 4.8.3 4.8.4 4.8.5 4.9 4.9.1 4.9.2 4.9.3 4.9.4 4.9.5 4.10 4.10.1 4.10.2 4.10.3 4.10.4 4.10.5 4.11 4.11.1 4.11.2 4.11.3 4.11.4 4.11.5 4.12 4.12.1 4.12.2 4.12.3 4.12.4 4.12.5
VII
Cleaning and care of buildings (CG 330) | 122 Theoretical basis and variables | 122 Model design and specifications | 125 Categorised cost indicators | 129 Performance validation | 131 Summary | 132 Cleaning and care of outdoor facilities (CG 340) | 133 Theoretical basis and variables | 133 Model design and specifications | 135 Categorised cost indicators | 139 Performance validation | 140 Summary | 142 Operation, inspection and maintenance (CG 350) | 142 Theoretical basis and variables | 142 Model design and specifications | 146 Categorised cost indicators | 151 Performance validation | 152 Summary | 154 Inspection and maintenance of building construction (CG 352) | 155 Theoretical basis and variables | 155 Model design and specifications | 157 Categorised cost indicators | 162 Performance validation | 163 Summary | 164 Inspection and maintenance of technical installations (CG 353) | 165 Theoretical basis and variables | 165 Model design and specifications | 167 Categorised cost indicators | 172 Performance validation | 173 Summary | 175 Inspection and maintenance of outdoor facilities (CG 354) | 175 Theoretical basis and variables | 175 Model design and specifications | 177 Categorised cost indicators | 182 Performance validation | 183 Summary | 184
VIII | Table of Contents 4.13 4.13.1 4.13.2 4.13.3 4.13.4 4.13.5 4.14 4.14.1 4.14.2 4.14.3 4.14.4 4.14.5 4.15 4.15.1 4.15.2 4.15.3 4.15.4 4.15.5
Inspection and maintenance of furniture (CG 355) | 185 Theoretical basis and variables | 185 Model design and specifications | 186 Categorised cost indicators | 191 Performance validation | 192 Summary | 193 Security and surveillance (CG 360) | 194 Theoretical basis and variables | 194 Model design and specifications | 195 Categorised cost indicators | 200 Performance validation | 201 Summary | 203 Statutory charges and contributions (CG 370) | 203 Theoretical basis and variables | 203 Model design and specifications | 204 Categorised cost indicators | 209 Performance validation | 210 Summary | 211
5 5.1 5.2 5.3
Results summary | 213 Identified reference quantities | 213 Identified predictor variables | 214 Cost estimation performance | 218
6 6.1 6.2 6.3 6.4 6.5
Implementation | 223 Example facility | 223 Regression model | 225 Binary classification tree model | 227 Categorised cost indicators | 229 Implementation summary | 231
7
Conclusion | 235
References | 239 Appendix: Data sample | 245
List of Abbreviations ANN
Artificial neural network
APE
Absolute percentage error
approx.
Approximately
BCIS
Building Cost Information Service of the Royal Institution of Chartered Surveyors RICS
BCT
Binary classification tree
BKI
Cost Information Centre of the German Chamber of Architects (German: Baukosteninformationszentrum Deutscher Architektenkammern)
BMUB
German Federal Ministry for the Environment, Nature Conservation, Building and Nuclear Safety (German: Bundesministerium für Umwelt, Naturschutz, Bau und Reaktorsicherheit)
BMVBS
German Federal Ministry of Transport, Building and Housing (German: Bundesministerium für Verkehr, Bau und Stadtentwicklung)
BNB
Evaluation System for Sustainable Building (German: Bewertungssystem nachhaltiges Bauen)
cf.
Compare
CG
Cost group
cGIFA
Regularly cleaned gross internal floor area
CN
Continuous numeric variable
Coef.
Coefficient
Coef. SE
Coefficients standard error
CRT
Classification and regression tree
CV(RMSE)
Normalised root mean square error
DESTATIS
German Federal Statistical Office (German: Statistisches Bundesamt)
DIN
German Institute for Standardisation
DN
Discrete numeric variable
e.g.
For example
et al.
And others
FMBW
Ministry of Finance of the German Federal State Baden-Wuerttemberg (German: Finanzministerium Baden-Württemberg)
GBV
Gross building volume
GEFA
Gross external floor area
https://doi.org/10.1515/9783110596083-009
X | List of Abbreviations GIFA
Gross internal floor area
hGIFA
Heatable gross internal floor area
LN
Natural logarithm
LR
Linear regression
MAPE
Mean absolute percentage error
MAPE (aggr.) Aggregated mean absolute percentage error MLP
Multilayer perceptron
MV
Median value
N
Population size
n
Sample size
nbSAR
Non-built site area
NLR
Non-linear regression
No. var.
Number of variables
NPV
Net present value
Obs.
Observation
OLS
Ordinary least squares
PE
Percentage errors
PE (aggr.)
Aggregated percentage error
pSAR
Planted site area
R²
Coefficient of determination
R² (adj.)
Adjusted coefficient of determination
REFQ
Reference quantity
RMSE
Root mean square error
SAR
Site area
SQ
Square
SQR
Square root
SSE
Sum of squared errors
St. Coef.
Standardised coefficient
Transf.
Transformation
UFA
Usable floor area
VAT
Value added tax
VIF
Variance inflation factor
List of Figures Figure 1.1
Scope of cost planning | 1
Figure 2.1 Figure 2.2 Figure 2.3 Figure 2.4 Figure 2.5 Figure 2.6 Figure 2.7 Figure 2.8 Figure 2.9 Figure 2.10
Interrelationships between variables | 11 General description of the research process | 13 Classification of building areas and volume | 21 Classification of site areas | 21 Example of a simple linear regression function | 27 Box-Cox plot of optimal transformation of variable | 32 Histograms of untransformed and LN-transformed variable | 32 Schematic architecture of a MLP artificial neural network ANN | 35 Schematic structure of a binary classification tree BCT | 39 Partitions of a binary classification tree BCT | 39
Figure 3.1 Figure 3.2
Locations of the analysed facilities | 47 Box plots of operating cost indicators | 50
Figure 4.1 Figure 4.2 Figure 4.3 Figure 4.4 Figure 4.5 Figure 4.6 Figure 4.7 Figure 4.8 Figure 4.9 Figure 4.10 Figure 4.11 Figure 4.12 Figure 4.13 Figure 4.14 Figure 4.15 Figure 4.16 Figure 4.17 Figure 4.18 Figure 4.19 Figure 4.20 Figure 4.21 Figure 4.22
Residuals for non-linear regression model NLR(ind)300 | 64 Tree diagram of binary classification tree model BCT(ind)300 | 65 APE values for CG 300 estimation methods | 68 Residuals for non-linear regression model NLR(ind)310 | 76 Tree diagram of binary classification tree model BCT(ind)310 | 77 APE values for CG 310 estimation methods | 80 Residuals for non-linear regression model NLR(ind)311 | 86 Tree diagram of binary classification tree model BCT(ind)311 | 87 APE values for CG 311 estimation methods | 90 Residuals for non-linear regression model NLR(ind)312-316 | 97 Tree diagram of binary classification tree model BCT(ind)312-316 | 98 APE values for CG 312-316 estimation methods | 101 Residuals for non-linear regression model NLR(ind)316 | 108 Tree diagram of binary classification tree model BCT(ind)316 | 109 APE values for CG 316 estimation methods | 112 Residuals for non-linear regression model NLR(ind)320 | 117 Tree diagram of binary classification tree model BCT(ind)320 | 118 APE values for CG 320 estimation methods | 121 Residuals for non-linear regression model NLR(ind)330 | 128 Tree diagram of binary classification tree model BCT(ind)330 | 129 APE values for CG 330 estimation methods | 132 Residuals for non-linear regression model NLR(ind)340 | 138
https://doi.org/10.1515/9783110596083-011
XII | List of Figures Figure 4.23 Figure 4.24 Figure 4.25 Figure 4.26 Figure 4.27 Figure 4.28 Figure 4.29 Figure 4.30 Figure 4.31 Figure 4.32 Figure 4.33 Figure 4.34 Figure 4.35 Figure 4.36 Figure 4.37 Figure 4.38 Figure 4.39 Figure 4.40 Figure 4.41 Figure 4.42 Figure 4.43 Figure 4.44 Figure 4.45
Tree diagram of binary classification tree model BCT(ind)340 | 139 APE values for CG 340 estimation methods | 142 Residuals for non-linear regression model NLR(ind)350 | 149 Tree diagram of binary classification tree model BCT(ind)350 | 150 APE values for CG 350 estimation methods | 154 Residuals for non-linear regression model NLR(ind)352 | 160 Tree diagram of binary classification tree model BCT(ind)352 | 161 APE values for CG 352 estimation methods | 164 Residuals for non-linear regression model NLR(ind)353 | 170 Tree diagram of binary classification tree model BCT(ind)353 | 171 APE values for CG 353 estimation methods | 175 Residuals for non-linear regression model NLR(ind)354 | 180 Tree diagram of binary classification tree model BCT(ind)354 | 181 APE values for CG 354 estimation methods | 184 Residuals for non-linear regression model NLR(ind)355 | 189 Tree diagram of binary classification tree model BCT(ind)355 | 190 APE values for CG 355 estimation methods | 193 Residuals for non-linear regression model NLR(ind)360 | 198 Tree diagram of binary classification tree model BCT(ind)360 | 199 APE values for CG 360 estimation methods | 202 Residuals for non-linear regression model NLR(ind)370 | 207 Tree diagram of binary classification tree model BCT(ind)370 | 208 APE values for CG 370 estimation methods | 211
Figure 5.1
APE and MAPE for 1st, 2nd, and 3rd level cost estimation | 221
Figure 6.1 Figure 6.2
Overview of selected example facility | 224 Implementation of the binary classification tree BCT model | 228
Figure A.1 Figure A.2 Figure A.3 Figure A.4
Box plots of absolute operating costs | 245 Percentage distribution on 1st level operating costs | 246 Percentage distribution on 2nd level operating costs | 247 Box plots of operating cost indicators | 248
List of Tables Table 1.1
Relevant studies and publications | 5
Table 2.1 Table 2.2
Structure of analysed operating costs | 15 Predictor variable groups | 19
Table 3.1 Table 3.2 Table 3.3
Operating cost indicators | 49 Qualitative candidate predictor variable type of facility | 53 Comparison of the total, test, and training samples | 54
Table 4.1 Table 4.2 Table 4.3 Table 4.4 Table 4.5 Table 4.6 Table 4.7 Table 4.8 Table 4.9 Table 4.10 Table 4.11 Table 4.12 Table 4.13 Table 4.14 Table 4.15 Table 4.16 Table 4.17 Table 4.18 Table 4.19 Table 4.20 Table 4.21 Table 4.22 Table 4.23 Table 4.24 Table 4.25 Table 4.26 Table 4.27 Table 4.28 Table 4.29
Absolute cost models of CG 300 | 61 Cost indicator models of CG 300 | 62 Coefficients of non-linear regression model NLR(ind)300 | 63 Specifications of binary classification tree model BCT(ind)300 | 66 Categorised CG 300 cost indicators | 67 Comparison of PE and MAPE of CG 300 | 68 Absolute cost models of CG 310 | 73 Cost indicator models of CG 310 | 73 Coefficients of non-linear regression model NLR(ind)310 | 75 Specifications of binary classification tree model BCT(ind)310 | 76 Categorised CG 310 cost indicators | 78 Comparison of PE and MAPE of CG 310 | 79 Absolute cost models of CG 311 | 83 Cost indicator models of CG 311 | 83 Coefficients of non-linear regression model NLR(ind)311 | 85 Specifications of binary classification tree model BCT(ind)311 | 86 Categorised CG 311 cost indicators | 88 Comparison of PE and MAPE of CG 311 | 89 Absolute cost models of CG 312-316 | 94 Cost indicator models of CG 312-316 | 95 Coefficients of non-linear regression model NLR(ind)312-316 | 96 Specifications of binary classification tree model BCT(ind)312-316 | 99 Categorised CG 312-316 cost indicators | 100 Comparison of PE and MAPE of CG 312-316 | 101 Absolute cost models of CG 316 | 105 Cost indicator models of CG 316 | 106 Coefficients of non-linear regression model NLR(ind)316 | 107 Specifications of binary classification tree model BCT(ind)316 | 108 Categorised CG 316 cost indicators | 110
https://doi.org/10.1515/9783110596083-013
XIV | List of Tables Table 4.30 Table 4.31 Table 4.32 Table 4.33 Table 4.34 Table 4.35 Table 4.36 Table 4.37 Table 4.38 Table 4.39 Table 4.40 Table 4.41 Table 4.42 Table 4.43 Table 4.44 Table 4.45 Table 4.46 Table 4.47 Table 4.48 Table 4.49 Table 4.50 Table 4.51 Table 4.52 Table 4.53 Table 4.54 Table 4.55 Table 4.56 Table 4.57 Table 4.58 Table 4.59 Table 4.60 Table 4.61 Table 4.62 Table 4.63 Table 4.64 Table 4.65 Table 4.66 Table 4.67 Table 4.68 Table 4.69 Table 4.70
Comparison of PE and MAPE of CG 316 | 111 Absolute cost models of CG 320 | 115 Cost indicator models of CG 320 | 115 Coefficients of non-linear regression model NLR(ind)320 | 116 Specifications of binary classification tree model BCT(ind)320 | 119 Categorised CG 320 cost indicators | 120 Comparison of PE and MAPE of CG 320 | 121 Absolute cost models of CG 330 | 125 Cost indicator models of CG 330 | 126 Coefficients of non-linear regression model NLR(ind)330 | 127 Specifications of binary classification tree model BCT(ind)330 | 128 Categorised CG 330 cost indicators | 130 Comparison of PE and MAPE of CG 330 | 131 Absolute cost models of CG 340 | 135 Cost indicator models of CG 340 | 136 Coefficients of non-linear regression model NLR(ind)340 | 137 Specifications of binary classification tree model BCT(ind)340 | 138 Categorised CG 340 cost indicators | 140 Comparison of PE and MAPE of CG 340 | 141 Absolute cost models of CG 350 | 146 Cost indicator models of CG 350 | 147 Coefficients of non-linear regression model NLR(ind)350 | 148 Specifications of binary classification tree model BCT(ind)350 | 150 Categorised CG 350 cost indicators | 152 Comparison of PE and MAPE of CG 350 | 153 Absolute cost models of CG 352 | 157 Cost indicator models of CG 352 | 158 Coefficients of non-linear regression model NLR(ind)352 | 159 Specifications of binary classification tree model BCT(ind)352 | 160 Categorised CG 352 cost indicators | 162 Comparison of PE and MAPE of CG 352 | 163 Absolute cost models of CG 353 | 168 Cost indicator models of CG 353 | 168 Coefficients of non-linear regression model NLR(ind)353 | 169 Specifications of binary classification tree model BCT(ind)353 | 172 Categorised CG 353 cost indicators | 173 Comparison of PE and MAPE of CG 353 | 174 Absolute cost models of CG 354 | 178 Cost indicator models of CG 354 | 178 Coefficients of non-linear regression model NLR(ind)354 | 179 Specifications of binary classification tree model BCT(ind)354 | 181
List of Tables |
XV
Table 4.71 Table 4.72 Table 4.73 Table 4.74 Table 4.75 Table 4.76 Table 4.77 Table 4.78 Table 4.79 Table 4.80 Table 4.81 Table 4.82 Table 4.83 Table 4.84 Table 4.85 Table 4.86 Table 4.87 Table 4.88 Table 4.89 Table 4.90
Categorised CG 354 cost indicators | 182 Comparison of PE and MAPE of CG 354 | 183 Absolute cost models of CG 355 | 187 Cost indicator models of CG 355 | 187 Coefficients of non-linear regression model NLR(ind)355 | 188 Specifications of binary classification tree model BCT(ind)355 | 191 Categorised CG 355 cost indicators | 192 Comparison of PE and MAPE of CG 355 | 193 Absolute cost models of CG 360 | 196 Cost indicator models of CG 360 | 196 Coefficients of non-linear regression model NLR(ind)360 | 197 Specifications of binary classification tree model BCT(ind)360 | 200 Categorised CG 360 cost indicators | 201 Comparison of PE and MAPE of CG 360 | 202 Absolute cost models of CG 370 | 205 Cost indicator models of CG 370 | 205 Coefficients of non-linear regression model NLR(ind)370 | 206 Specifications of binary classification tree model BCT(ind)370 | 208 Categorised CG 370 cost indicators | 209 Comparison of PE and MAPE of CG 370 | 210
Table 5.1 Table 5.2 Table 5.3 Table 5.4 Table 5.5 Table 5.6
Identified reference quantities | 213 Identified predictor variables (Quantities) | 215 Identified predictor variables (Characteristics) | 216 Identified predictor variables (Utilisation, location, strategy) | 217 Comparison of MAPE for developed statistical methods | 219 Comparison of MAPE for 1st, 2nd, and 3rd level cost estimation | 220
Table 6.1 Table 6.2 Table 6.3
Implementation of categorised CG 370 cost indicators | 230 Observed and estimated values for example facility | 231 PE and MAPE for example facility | 232
Table A.1 Table A.2 Table A.3 Table A.4 Table A.5 Table A.6 Table A.7 Table A.8 Table A.9
Absolute operating costs | 245 Percentage distribution on 1st level operating costs | 246 Percentage distribution on 2nd level operating costs | 247 Operating cost indicators | 248 Candidate reference quantities | 249 Candidate predictor variables: Specific areas | 249 Candidate predictor variables: Compactness | 249 Candidate predictor variables: Function | 250 Candidate predictor variables: Condition | 250
XVI | List of Tables Table A.10 Table A.11 Table A.12 Table A.13
Candidate predictor variables: Standard | 251 Candidate predictor variables: Location | 253 Candidate predictor variables: Utilisation | 254 Candidate predictor variables: Management strategy | 254
Abstract In order to assess the economic impact of decisions in a planning process holistically, the entire life cycle of real estate should be considered. The approach of life cycle costing includes therefore not only the initial investment costs but takes likewise the consequential costs incurred during the operation and occupancy of real estate into account. The operating costs cause a significant amount of the total financial expenditures over the entire life cycle as illustrated in various studies and publications. Accordingly, a considerable potential of cost savings is provided by the expenditures for the operation. To ensure economic viability, it is an essential task for participants involved in the planning process to consider the determination of operating costs as early as possible. Therefore, construction, renovation, or modernisation measures including all available planning alternatives can be assessed holistically by respective cost-benefit analyses. Besides, the iterative process of cost control and management during the operation of real estate can be considered as a fundamental task of cost planning. Consequently, the determination of operating costs and knowledge about significant influential variables on these costs are a crucial foundation for decision making and budgeting. Although the financial relevance of operating costs and the substantial potential of cost savings for property owners and leaseholders is illustrated in various studies and publications, cost planning is to a large extent still limited to the determination of construction costs. Previous research on operating costs focuses on specific utilisations (in particular office and education facilities) and employs mainly regression analyses as statistical method for the examination of empirical data. In this context, there remains a lack of research on the determination of operating costs of real estate under consideration of a wide range of utilisations based on the evaluation of a variety of statistical methods. The current research study is dedicated to the development, validation, and evaluation of statistical models for an examination of causal interrelationships between operating costs and a variety of potential influential variables on these costs. The objective of the study is the provision of essential information, models, and adequate cost indicators for an accurate determination of operating costs for a practical application in the field of cost planning of real estate. The quantitative approach of the current research study is based on empirical data of more than 250 operated facilities located in Germany. Besides the operating cost data, the investigation includes a wide range of variables with potential influence on these costs as for example quantities, conditions, standards, utilisations, locations, and management strategies. An extensive review of relevant research studies and publications serves as basis for the definition of key variables and the selection of the employed statistical methods. The classification of the analysed cost data into cost groups is conducted according to the German standard DIN 18960:2008-02. Thereafter, regression, artificial neural network, and classification tree models are de-
https://doi.org/10.1515/9783110596083-017
XVIII | Abstract veloped and validated for 15 cost groups. The outcome includes the determination of adequate reference quantities, the identification of significant influential variables, and the introduction of categorised cost indicators. Likewise, the developed statistical models are evaluated and the most accurate operating cost estimation methods are determined. In a detailed implementation example, the practical application of the statistical models is demonstrated in a step-by-step presentation on the example of a randomly selected and independent observation. Finally, the limitations and restrictions of the conducted investigation are critically discussed and addressed in an outlook on further research. The main findings of the current investigation reveal a significant interrelationship between the utilisation of the facilities and their operating costs. The largest effect on the variance of the costs of nearly all cost groups is indicated for the type of facility as variable giving information on the utilisation. Further significant influence is in particular revealed for the variable groups of quantities, conditions, and standards. On the most detailed estimation level, a combination of the developed statistical models with the best performance provides an operating cost estimation with an error rate of under 13 %. For 8 of the total 15 cost groups, regression models with transformed variables for a correction of non-normality of the data distribution offer the highest accuracy of cost estimation. Classification tree models reveal the best estimation performance for 4 of the analysed cost groups including the heating energy costs (about 23 % of the operating costs) and the cleaning costs (about 35 % of the operating costs). The results serve as a basis for the assessment of planning alternatives, decision making, and budgeting and are directed towards architects, planners, and the real estate management.
Zusammenfassung Um in einem Planungsprozess die ökonomischen Auswirkungen von Entscheidungen ganzheitlich bewerten zu können, sollte der gesamte Lebenszyklus einer Immobilie berücksichtigt werden. Aus diesem Grund umfasst eine Lebenszykluskostenbetrachtung nicht ausschließlich die anfänglichen Investitionskosten, sondern berücksichtigt gleichermaßen die im Betrieb und während der Nutzung von Immobilien anfallenden Folgekosten. Wie in verschiedenen Untersuchungen bereits festgestellt wurde, haben Betriebskosten auf das Immobilienleben bezogen einen wesentlichen Anteil an den gesamten finanziellen Aufwendungen. Demzufolge bieten die Betriebskosten beträchtliche Einsparpotentiale und es ist eine maßgebliche Aufgabe für alle Planungsbeteiligten, diese Kosten so früh wie möglich zu berücksichtigen. Neubau-, Umbau- oder Modernisierungsmaßnahmen können so mit allen verfügbaren Planungsalternativen durch entsprechende Kosten-Nutzen-Analysen ganzheitlich bewertet werden. Ebenso können die Kosten von Immobilien während der Betriebsphase als unerlässlicher Teil der Kostenplanung in einem iterativen Prozess kontrolliert und gesteuert werden. Aus diesem Grund ist die Ermittlung der Betriebskosten und das Wissen über signifikante Faktoren, die diese Kosten beeinflussen, für die Entscheidungsfindung und Budgetbestimmung unumgänglich. Obwohl die Relevanz von Betriebskosten und das erhebliche Kosteneinsparpotential für Eigentümer und Mieter von Immobilien in einer Vielzahl von Studien beschrieben ist, beschränkt sich die Kostenplanung zu einem großen Teil auf die Baukosten. Die bisherige Forschung zu Betriebskosten legt den Schwerpunkt auf die Untersuchung von einzelnen Gebäudenutzungsarten (insbesondere Büro- und Bildungsimmobilien) und wendet dabei hauptsächlich die Regressionsanalyse als statistische Methode zur Auswertung von empirischen Daten an. In diesem Zusammenhang mangelt es an Forschung, die sich mit der Bestimmung der Kosten für den Betrieb unter Berücksichtigung von unterschiedlichen Gebäudenutzungen auf der Basis von verschiedenen statistischen Methoden befasst. Die vorliegende Forschungsarbeit widmet sich der Entwicklung und Bewertung von statistischen Modellen zur Untersuchung von Ursache-Wirkungs-Zusammenhängen zwischen Betriebskosten und möglichen Einflussfaktoren. Die Zielsetzung der Studie ist die Bereitstellung von grundlegenden Informationen, Modellen und geeigneten Kostenkennwerten für eine genaue Bestimmung der Betriebskosten für die praktische Anwendung in der Kostenplanung von Immobilien. Der quantitative Ansatz der vorliegenden Forschungsarbeit basiert auf den empirischen Daten von mehr als 250 Immobilien, die sich in Betrieb befinden. Neben den Betriebskosten enthält die Studie eine Vielzahl von Variablen, die möglicherweise Einfluss auf die Kosten haben. Zu den Variablen gehören beispielsweise Mengen, Zustandsbeschreibungen, Standards, Angaben zur Gebäudenutzung, Umfeldeigenschaften und Managementstrategien. Als Grundlage für die Auswahl der Variablen
https://doi.org/10.1515/9783110596083-019
XX | Zusammenfassung und der verwendeten statistischen Methoden dient eine umfassende Literaturrecherche. Die Auswertung der Kostendaten erfolgt nach der Einteilung in die entsprechenden Kostengruppen nach der Norm DIN 18960:2008-02. Anschließend werden für 15 Kostengruppen Regressionsmodelle, künstliche neuronale Netzwerke und Entscheidungsbaummodelle entwickelt und validiert. Als Ergebnisse werden geeignete Referenzmengen für die Betriebskosten, signifikante Einflussfaktoren und kategorisierte Kostenkennwerte präsentiert. Ebenso werden die entwickelten statistischen Modelle hinsichtlich ihrer Genauigkeit bewertet. Anhand einer zufällig ausgewählten und unabhängigen Immobilie wird in einem detaillierten Beispiel die praktische Anwendung der entwickelten Modelle dargestellt. Die Einschränkungen und Anwendungsgrenzen der durchgeführten Untersuchung werden schlussendlich kritisch diskutiert und es wird ein Ausblick auf mögliche zukünftige Forschung gegeben. Die Ergebnisse der Untersuchung zeigen eine signifikante Abhängigkeit der Betriebskosten von der Nutzungsart der Immobilien. Aus der statistische Analyse wird ersichtlich, dass die Gebäudenutzung in nahezu allen Kostengruppen die größte Varianz der jeweiligen Kosten verursacht. Weitere signifikante Einflussfaktoren zeigen sich insbesondere in den Variablengruppen der Mengen, Zustandsbeschreibungen und Standards. Auf der detailliertesten Ebene der Kostenschätzung ermöglicht eine Kombination der genauesten entwickelten statistischen Modelle eine Fehlerrate von unter 13 % für die Ermittlung der Betriebskosten. Regressionsmodelle mit transformierten Variablen zum Ausgleich einer nicht vorhandenen Normalverteilung bieten die höchste Genauigkeit der Kostenschätzung für 8 von insgesamt 15 Kostengruppen. Entscheidungsbaummodelle ermöglichen die genaueste Kostenschätzung für 4 der analysierten Kostengruppen, wie zum Beispiel für die Heizenergiekosten (Betriebskostenanteil ungefähr 23 %) und für die Reinigungskosten (Betriebskostenanteil ungefähr 35 %). Die Ergebnisse dienen als Basis für die Bewertung der Wirtschaftlichkeit von Planungsalternativen, zur Entscheidungsfindung und für die Budgetbestimmung. Die Studie richtet sich dabei an Architekten, Fachplaner und das Immobilienmanagement.
1 Introduction 1.1 Background and problem statement The term life cycle costing for the determination of the total costs incurred during the service life of a system was initially introduced in 1965 by the US Logistics Management Institute in order to assess military systems economically as for example described by Dhillon (2009). The first guideline for the assessment of construction projects was published by the US Department of Commerce in 1978 (Ruegg et al.). Within the context of the first global energy crisis, the guideline describes the determination and evaluation of life cycle costs for alternatives of energy saving measures for public building projects. In general, the life cycle costing approach in the field of construction includes not only the initial investment costs but takes likewise consequential costs into account. As illustrated in various studies and publications, the costs incurred during the operation and occupancy of real estate cause a significant amount of the total financial expenditures over the entire life cycle. Consequently, a considerable potential of cost savings for property owners and leaseholders is provided by the expenditures for the operation of real estate. Cost planning is to a large extent still limited to the determination of the initial construction costs, even though the financial relevance of operating costs and the respective potential of cost savings is illustrated in various studies and publications. The determination of operating costs at an early stage of the planning process is therefore an essential task in order to ensure the economic viability of construction projects. As a substantial element of cost planning, alternatives of construction, renovation, or modernisation measures can be assessed holistically by respective cost-benefit analyses under consideration of operating costs. Likewise, the iterative processes of cost control and cost management can be conducted on the basis of a holistic determination of costs. The general scope of cost planning according to the standard DIN 18960:2008-02 is illustrated in Figure 1.1. Cost planning Cost determination Estimation of future costs and assessment of incurred costs on various levels: - First projection - Preliminary estimate - Approximate estimate - Final estimate - Final statement
Cost control Comparison of current costs with earlier cost determinations and requirement specifications, e.g. comparison with benchmarks.
Figure 1.1. Scope of cost planning according to DIN 18960:2008-02
https://doi.org/10.1515/9783110596083-021
Cost management Intervention in the process of - planning, - execution, - occupancy, or - operation to meet the requirement specifications and for an optimisation if necessary.
2 | 1 Introduction Cost planning comprises the iterative processes of cost determination, cost control, and cost management. The process of cost determination can be distinguished into five levels, whereby the first projection, the preliminary estimation, the approximate estimation, and the final estimation intend to predict future expenditures and the final statement intends to assess the costs incurred during the operation phase. The various levels of cost determination can by be carried out for construction, renovation, or modernisation measures. In contrast, the process of cost control is applied in order to compare incurred costs to earlier cost determinations or requirement specifications. The intervention in the processes of planning, execution, occupancy, and operation is accomplished by the cost management as outlined in the standard DIN 18960:2008-02. The scope of cost management includes optimisations in the operation or in the planning that are carried out in order to meet the initial requirement specifications. Throughout all these iterative processes of cost planning, the involved architects, planners, and the real estate management face the challenge to provide an accurate determination of operating costs. Likewise, there is a lack of knowledge on significant influential factors for a holistic assessment of measures. Against this background, the operating cost planning is currently subject to the following problems: – What are significant influential factors on operating costs and how do they affect these costs? – What are appropriate methods for an accurate determination of operating costs for a holistic cost planning?
1.2 Scope and objective In the illustrated context of cost planning, the current research study is dedicated to the provision of relevant essentials for the determination, control, and management of expenditures for the operation of real estate. Therefore, a quantitative approach is employed in order to analyse empirical data of operated facilities and explain operating costs systematically. The current study intends to reveal and describe the causal interrelationships between operating costs and a variety of variables with potential influence on these costs. Furthermore, the study aims to introduce and evaluate appropriate tools for an accurate determination of operating costs. The main objective is the provision of an essential basis of information for the practical application in the field of life cycle cost planning of real estate. The results aim to provide the foundation for a holistic assessment of planning alternatives of construction, renovation, or modernisation measures and have the objective to support decision making and budgeting under consideration of operating costs. The research study is directed towards architects, planners, and the real estate management. An extensive review of relevant research studies and publications in the fields of operating cost estimation and cost modelling provides the foundation for the definition of relevant key variables and for the selection of multiple statistical methods
1.3 State of the art | 3
for the analysis. In the course of the statistical analysis, significant interrelationships between the costs and adequate reference quantities are identified and respective operating cost indicators are introduced. Based on the cost indicators, further statistical models are developed and presented in detail. The statistical models intend to identify significant influential variables and give an accurate estimation of operating costs. The most significant influential variables are employed in order to present categorised cost indicators. Finally, the most accurate operating cost estimation method is determined by a comparison and evaluation of the developed statistical models and categorised cost indicators. In conclusion, the main objectives of the current research on operating cost planning can be summarised as follows: – Identification of adequate reference quantities – Identification of significant influential variables – Introduction of categorised cost indicators – Introduction, comparison, and evaluation of cost estimation methods
1.3 State of the art The calculation and determination of costs in the field of construction has a relatively long tradition as summarised in an extensive review of literature on cost modelling published by Newton (1991). The first approaches of cost modelling are initially based on the techniques of regression analysis and simulation and were introduced in the 1970s as described by Ashworth and Perera (2015). Based on the availability of computers as a common tool in the construction industry, a more reliable and accurate estimation of construction costs was expected by the employment of the new techniques. Simultaneously, the evaluation of different designs and planning alternatives under consideration of the initial construction costs has become an important matter in construction economics. The first regression models as tools for the estimation of construction costs are presented in the studies of Kouskoulas and Koehn (1974), McCaffer (1975), and Bowen and Edwards (1985). With the life cycle costing approach in the 1980s and in particular with the whole life costing approach at the beginning of the 2000s, the consideration of costs incurred during the operation and occupancy of real estate has become significantly more important. With the shift away from the focus on the initial construction costs, the whole life cycle of real estate is considered for the calculation and determination of costs in a long-term perspective (cf. Ashworth and Perera, 2015). A selection of relevant studies and publications in the context of operating cost determination is presented in Table 1.1. The list of studies and publications includes a description of the type of conducted investigation, the employed data sample, the scope of analysed costs, and a summary of the main results. The focus of the review of literature lies both on literature providing tools for an estimation of operating costs and on literature analysing the causal interrelationships between the costs and influ-
4 | 1 Introduction ential factor groups. The first investigation addressing the consequential costs of real estate in Germany is published by Siegel and Wonneberg (1977). The empirical analysis employs the data of 110 office facilities and presents an extensive documentation of management and operating costs under consideration of building characteristics as an influential factor group. A further investigation of the data basis of Siegel and Wonneberg is conducted by Kalusche (1991) in an analysis of capital, object management, operating, and maintenance costs. In 1998, the study of Al-Hajj and Horner provides an approach to model operating and maintenance costs by application of the Pareto principle in order to identify the most significant components of these cost types. The first approach to employ a relatively large data basis of approximately 14,000 residential facilities in the United States in a statistical analysis is presented by the Graduate School of Design of the Harvard University GSD (2003). Focussing on internal factors, building characteristics, and the location as influential factors, the variance of management and operating costs is analysed utilising regression models. Based on 116 office facilities located in Switzerland, Stoy (2005) provides regression models and cost indicators for capital, management, and maintenance costs. The statistical analysis determines strategies and building characteristics as the main factors influencing these cost types. Further statistical models for the identification of significant variables with influence on the energy consumption and operating costs of school facilities are introduced by Beusker (2012). The research study is based on 130 observations and identifies the utilisation, functional and technical characteristics, and strategies as significant factor groups with impact on costs. Likewise, Hawlik (2015) conducts a statistical analysis of 125 child day care facilities and determines the utilisation, building characteristics, and technical characteristics as relevant factor groups. Besides the approaches to analyse the operating costs of real estate statistically, various studies offer cost indicators based on empirical data for the estimation of consequential costs. For example, the Building Cost Information Service of the Royal Institution of Chartered Surveyors provides a compilation of occupancy cost indicators from various sources for the British market since 1999 (BCIS, 2007a,b). The scope of costs includes the management, utilities, cleaning, and maintenance with a categorisation according to multiple utilisations, locations, building characteristics, and technical characteristics. On an international level, further data is published by the International Facility Management Association including categorised management, operating, and maintenance cost indicators based on the data of more than 1,400 facilities (IFMA, 2009). The cost indicators are differentiated by the utilisation, characteristics, the location, and the management strategy. For the Swiss market, pom+ (2016) publishes categorised management, operating, and maintenance cost indicators employing a data basis of currently more than 15,000 facilities. The utilisation, building characteristics, and the location are employed for a categorisation of the indicators presented in the publication.
1.3 State of the art | 5 Table 1.1. Relevant studies and publications Type
Data basis
Analysed costs
Results
Simple modelling:
20 university facilities
Operating and
Most significant
Pareto principle
(residential, teaching,
maintenance costs
cost components:
Al-Hajj and Horner (1998)
laboratory buildings)
Cleaning, gas, electricity
Bahr (2008) Calculation method
17 facilities
Maintenance costs
Influential factor groups:
(school and office
Building characteristics
facilities)
(age, standard, size, layout) maintenance strategy, utilisation, and location
BCIS (2007a) and BCIS (2007b) Categorised
Facilities of
Management,
Categorised cost indicators:
cost indicators
different utilisations,
utility,
Utilisation, location,
data of various sources
cleaning, and
building characteristics,
maintenance costs
technical characteristics
Beusker (2012) Statistical analysis:
Operating and
Reference quantities and
Regression and
130 school facilities
repair costs,
influential factor groups:
categorised indicators
energy consumption
Utilisation, functional characteristics, technical characteristics, strategy and operation
BMVBS (2013) and BMUB (2015) Calculation method
Capital,
Consideration of the
for life cycle costs
operating, and
following parameters:
(net present value NPV)
repair costs
Water, heating energy, electricity consumption, cleaning level, and service life of components
FMBW (2004) Categorised cost and consumption indicators
approx. 2,000
Operating and
Categorised cost indicators:
municipal facilities
maintenance costs,
Utilisation
energy consumption
GSD (2003) Statistical analysis:
approx. 14,000
Management and
Factor groups:
Regression
residential facilities
operating costs
Internal factors (occupants), building characteristics, location
Hawlik (2015) Statistical analysis:
125 child day care
Operating,
Factor groups:
Regression and
facilities
repair, and
Utilisation,
personnel costs
building characteristics
cost indicators
technical characteristics (Table continued)
6 | 1 Introduction Type
Data basis
Analysed costs
Results
Categorised area and
1,422 facilities
Management,
Categorised cost indicators:
cost indicators
of different utilisations
operating, and
Utilisation,
maintenance costs
building characteristics,
IFMA (2009)
location, strategy JLL (2016) Categorised
337 office facilities
cost indicators
Management,
Categorised cost indicators:
operating, and
Building characteristics
maintenance costs
(size, standard of the construction and technical installations), location
Kalusche (1991) Empirical analysis
Siegel and Wonneberg
Capital,
Influential factor group:
(1977)
object management,
Building characteristics
operating, and maintenance costs Kalusche (2008) Remarks on the
Capital,
Relevant factor groups:
DIN 18960:2008-02
object management,
Characteristics, utilisation,
operating, and
internal factors (owners,
repair costs
operators, occupants), external factors (social, legal, economic, environmental)
Naber (2002) Calculation method
Office and
Capital,
Significant cost components:
based on cost indicators
school facilities
management,
Maintenance, cleaning,
operating, and
heating, electricity
maintenance costs Pelzeter (2006) Case study
Capital,
Variation of factor groups:
theoretical calculation
2 facilities
operating, and
Location,
of life cycle costs
maintenance costs
layout and characteristics, environment
pom+ (2016) Categorised area and
15,200 facilities
Management,
Categorised cost indicators:
cost indicators
of different utilisations
operating, and
Utilisation,
maintenance costs
building characteristics, location
Riegel (2004) Calculation method
Validation based on
Management,
Influential factor groups:
and software tool
6 office buildings
operating, and
Utilisation,
repair costs
building characteristics, technical characteristics, management strategies
(Table continued)
1.3 State of the art | 7
Type
Data basis
Analysed costs
Results
Cost, consumption,
2,925 facilities
Management,
Categorised cost indicators:
and area indicators
of different utilisations
operating, and
Utilisation
Rotermund (2016)
repair costs Siegel and Wonneberg (1977) Empirical analysis
110 office facilities
Management, and
Influential factor group:
operating costs
Building characteristics
Stoy (2005) Statistical analysis:
Capital,
Influential factor groups:
Regression and
116 office facilities
management,
Strategies (net asset value,
cost indicators
operating, and
amortisation, maintenance),
maintenance costs
building characteristics (standard and condition), utilisation
Stoy et al. (2015) and Stoy et al. (2017) Categorised
250 facilities
Operating and
Categorised cost indicators:
cost indicators
of different utilisations
repair costs
Utilisation
On the German market, various facility management companies aim to provide operating cost data since the 1990s. For example, an annual report published by Jones Lang LaSalle offers data of service charges of office facilities for the purpose of benchmarking with a categorisation according to various factors since 1996. The current report (JLL, 2016) is based on the data of 337 operated office buildings and categorises cost indicators according to various characteristics and the location. Likewise, Rotermund (2016) provides management, operating, and repair cost indicators based on the data of about 3,000 facilities as benchmarks. Specific consumption and cost indicators directed towards the real estate management of public facilities are published by the FMBW (2004) with a categorisation by the type of facility and based on a data sample of approximately 2,000 observations. Furthermore, the Cost Information Centre of the German Chamber of Architects BKI offers operating and repair costs of operated real estate since 2010 (e.g. Stoy et al., 2015, 2017). Providing both detailed documentations of individual facilities and statistical cost indicators, the publication series is directed towards architects and planners and aims to provide the basis for an accurate cost estimation. Further studies and publications intend to provide calculation methods and are partly based on empirical data. The research study of Naber (2002) introduces calculation methods for capital, management, operating, and maintenance costs and is based on the utilisation of respective cost indicators. A software tool including calculation methods for management, operating, and repair costs is provided by Riegel (2004). The calculation methods are validated on the basis of 6 office buildings and the utilisation, building characteristics, technical characteristics, and management strategies
8 | 1 Introduction are determined as relevant influential factor groups. A detailed calculation method directed towards the determination of maintenance costs is presented by Bahr (2008). The study is based on the data of 17 operated school and office facilities and identifies building characteristics, maintenance strategies, the utilisation, and the location as factor groups with influence on the maintenance costs. Furthermore, the Evaluation System for Sustainable Building BNB by the BMVBS (2013) and the BMUB (2015) introduces a calculation method for life cycle costs of teaching facilities, office facilities, and laboratories under consideration of operating costs. Therefore, water, heating energy, electricity, service levels, and the service live of components are included as relevant parameters for the calculation. The presented summary of relevant studies and publications gives a detailed overview of the current state of the art in the field of research. The review reveals various limitations that are addressed in the current study. In general, only minor importance is given to the investigation of adequate reference quantities for the cost planning of operating costs. Therefore, one of the main objectives of the current study is the determination of adequate reference quantities in order to provide foundations for cost planning and accurate cost estimations. Furthermore, available studies and publications are to a large extent restricted to an analysis of individual utilisations and types of facilities (in particular office and education facilities). Against this background, a variety of facility types with a wide range of utilisations are employed as basis for the current investigation. Likewise, the focus of previous empirical research lies primarily on the utilisation of regression analysis as statistical method. In contrast, the current study introduces and evaluates multiple statistical methods for the identification of interrelationships between variables and for the estimation of operating costs. A further review of relevant studies and publications is presented as a basis for the definition of the operating costs as analysed variables in Section 2.2.1. Likewise, a wide range of potential influential variables is determined on the basis of an extensive review of relevant literature as presented in Section 2.2.2. Furthermore, the selection and introduction of the employed statistical methods in Section 2.3 contains a presentation of relevant research studies and publications.
1.4 Manuscript structure Chapter 2 gives an overview of the research methodology. This includes a general description of the approach to conduct a statistical analysis based on empirical data. A definition of key variables of the investigation includes the description of relevant studies describing variables in the field of research. Moreover, the statistical methods and the procedure to develop statistical models are presented in detail. Finally, performance measures for the evaluation and comparison of the performance of the developed models are introduced.
1.4 Manuscript structure | 9
Chapter 3 introduces the data sample employed to conduct the current research. This includes information on the collection of the data and an evaluation of their quality. Furthermore, the process of data preparation is presented in detail. In order to draw unbiased conclusions about the performance of the developed models, the sampling of the underlying data into a training and test sample is outlined. Likewise, the representativeness and restrictions of the results of the investigation are discussed under consideration of the scope of the sample. Chapter 4 describes the results of the conducted data analysis. Based on an illustration of the theoretical basis and the employed variables, the development of the statistical models is presented in a summary. The performance of the models is discussed and the best models are described in detail. Categorised cost indicators are introduced as an alternative estimation method. The performance of the statistical methods is validated and compared by the independent test sample. The results of the respective analyses are outlined in a short summary. Chapter 5 summarises the main results of the investigation. This includes a description of the revealed causal interrelationships between the available variables. Furthermore, the introduced statistical models and their estimation performance is compared and outlined employing the total data sample. Chapter 6 provides a step-by-step presentation of the implementation of the developed statistical methods with the best performance. Based on the example of a randomly selected and independent observation, the practical applicability of the statistical methods is demonstrated in detail. This includes a discussion of the respective amount of information required for a practical application. Chapter 7 focuses on the limitations of the conducted research in a concluding statement. This contains a discussion of the restrictions of the results under consideration of the scope of the underlying data and the employed statistical methods. Furthermore, an outlook addresses problems and objectives that may be considered in future research in the current research field.
2 Methodology 2.1 Quantitative approach A quantitative approach is based on empirical data and intends to explain an observed phenomenon systematically. Empirical data consists of measurements on suitable scales and the collection of the data is built upon the findings of previous research as outlined by Fellows and Liu (2015). An analysis of the empirical data can be conducted employing statistical, mathematical, or computational techniques. In order to avoid interferences of the analyses and findings, a precise definition of the subject matter is required (Fellows and Liu, 2015). The current study has the objective to identify significant interrelationships between the operating costs and factors that influence these costs. Furthermore, the study intends to introduce and evaluate tools for an accurate estimation of operating costs. Therefore, various statistical models are developed and presented. Fellows and Liu (2015) describe models as a tool to represent the reality as close as possible. Models can describe a design, actual object, process, or system.
Response variable Predictor variables
Figure 2.1. Interrelationships between variables (adapted from Tabachnick and Fidell, 2009)
In a statistical model, the response variable describes the quantity that varies under the influence of one or multiple predictor variables. In Figure 2.1, the interrelationships between one response variable and three predictor variables are illustrated exemplarily. The overlapping areas of the circles indicate the share of variance of the response variable that can be explained by the respective predictor variable. The statistical models aim to cover as much variance of the response variable as possible and provide consequently an accurate representation of the reality. The current Chapter describes the scope of the operating costs as the analysed response variables. Besides, a wide range of candidate predictor variables is presented including a review of previous research. Furthermore, the methods employed for statistical modelling are described in detail.
https://doi.org/10.1515/9783110596083-031
12 | 2 Methodology A general illustration of the dynamic research process employed in the current study is presented in Figure 2.2. Based on the definitions of the aim and objectives as described in the previous Chapter, an initial preparation of the data sample is conducted. A review of relevant research studies in the fields of operating cost estimation and modelling of costs in construction serves as the basis for the definition of the key variables to be analysed. Furthermore, suitable statistical methods are selected by the literature review and by an initial analysis of the underlying data sample. After the statistical pre-analysis, regression analysis is used to develop models for the estimation of absolute operating costs. Primarily, the developed absolute cost models intend to identify significant interrelationships between the costs and candidate reference quantities. After a validation of the developed regression models, the identified most adequate reference quantities are employed to establish respective operating cost indicators. In the main analysis, the introduced cost indicators serve as response variables for the development of various statistical models. Based on the underlying data and previous studies in particular in the field of construction cost estimation, the statistical methods for the analysis and model development are selected to be regressions, artificial neural networks, and classification trees. In contrast to the absolute cost models, these models are developed to give an accurate estimation of operating costs and to reduce the extent of estimation error to a minimum. Further aim of the cost indicator models and in particular of the regression and classification tree models is the identification of relevant predictor variables with significant impact on the analysed costs. The model development is conducted in an iterative process including a validation of relevant parameters, a measure of fit to the underlying data, and a comparison of the results between individual models to ensure their correct specification. Based on the results of the statistical analysis of both absolute costs and cost indicators, the identified reference quantities and the significant predictor variables are employed to provide categorised operating cost indicators. The categorisation of the indicators is conducted by the variables with the largest effect on the analysed costs, as determined by the regression and classification tree analysis. The respective median values are introduced to provide a further method for cost estimation. In the final procedure of the main analysis, all developed models and the median values of categorised cost indicators are evaluated in regard to their estimation performance. Therefore, an independent test sample is employed for unbiased conclusions to be drawn on the estimation accuracy. Finally, the identified reference quantities, the significant predictor variables, and the most accurate cost estimation methods are summarised and the practical application of the developed statistical models is presented in an implementation example.
2.1 Quantitative approach | 13
Initial data preparation
Aim and objectives Literature review Definition of key variables
Selection of statistical methods
Measurement of performance
Input: Cost indicators / cand. predictors
Linear regression LR(ind) model
Non-linear regression NLR(ind) model
Artificial neural network ANN(ind) model
Binary classfication tree BCT(ind) model
Model validation
Model validation
Model validation
Model validation
Reference quantities / significant predictors Median values of categorised cost indicators
Indpendent test sample Comparison of observed and estimated values Performance evaluation of developed models
Figure 2.2. General description of the research process
Implementation example
Input: Cost indicators / cand. predictors
Results summary
Adequate reference quantities
Input: Cost indicators / cand. predictors
Significant predictor variables
Input: Cost indicators / cand. predictors
Indicators for cost estimation
Model validation
Most accurate estimation method
Linear regression LR(abs) model
Categorised cost indicators
Development of cost indicator models
Development of absolute cost models
Input: Absolute costs / candidate predictors
Outlook on further research
Pre-analysis of data sample
14 | 2 Methodology
2.2 Definition of key variables 2.2.1 Response variables The objective of the current study is the identification of significant interrelationships between predictor variables and the operating costs. Further aim is the development of statistical models for an accurate estimation of these costs. The definition of response variables as the variables under investigation is an essential tasks in order to conduct this quantitative approach. The response variables describe the quantities that vary under the influence of one or multiple predictor variables. Considering that, the operating costs as defined in the standard “Occupancy costs of buildings” (DIN 18960:2008-02) are determined as the response variables of the current study. The standard is widely used in Germany and includes a classification structure for the planning with occupancy costs and principles for their determination. The planning with occupancy costs comprises the estimation, the control, and the management of costs. The standard was first introduced in 1976 and further developed since then in the version of 1999 and the current version of 2008. The standard DIN 18960:2008-02 generally describes the scope of occupancy costs as “all regularly and irregularly recurring costs” of “building structures and their sites” under consideration of the period of time from “the begin of their occupancy until their demolition (operating life)”. Occupancy costs include “the period of delivery and optimisation, the period of operation, the period of modernisation, and the period of restitution until the beginning of the period of demolition”. According to the definition, occupancy costs do not include “the costs for the construction, for the conversion, and for the removal of buildings”. Likewise the scope of occupancy costs does not include the “enterprise and production related costs for staff and equipment” if they can be determined separately. The structure of the standard DIN 18960:2008-02 classifies occupancy costs into cost groups on three levels. The first level comprises the CG 100 of capital costs, the CG 200 of management costs, the CG 300 of operating costs, and the CG 400 of repair costs. The second and third level cost groups cover a more detailed sub-classification of the mentioned cost groups. The scope of cost data under investigation in the current study includes the first level cost group CG 300 of operating costs and the respective sub-classified cost groups. The cost data were obtained from multiple project partners as cash flows for accounting periods of between one and five years as described in Section 3.2.1. A mean value of the data is determined if the data contain cash flows for more than one year. The annual data are adjusted to first quarter 2016 prices and include the current value added tax rate VAT. The data are classified into 15 cost groups on all three levels of the cost structure of DIN 18960:2008-02. The structure of the current investigation is presented in Table 2.1. The cost groups are analysed in the respective Sections of Chapter 4 as outlined in the Table. On the most detailed third level of the cost structure, analyses are conducted for seven sub-groups of CG 310 (utility costs)
2.2 Definition of key variables | 15
and CG 350 (operation, inspection and maintenance costs). On the second level, 7 cost groups are analysed. The first level cost group CG 300 of operating costs is aggregated from the respective second level cost groups and is likewise analysed as illustrated in Table 2.1. Table 2.1. Structure of analysed operating costs Cost group (DIN 18960:2008-02)
Statistical analysis
CG 300 Operating costs CG 310 Utilities CG 311 Water
Section 4.3
CG 312 Oil CG 313 Gas CG 314 Solid fuels CG 315 District heating
CG 312-316: Section 4.4
Section 4.2
CG 316: Section 4.5
CG 316 Electricity CG 320 Disposal
Section 4.6
CG 321 Sewage CG 322 Waste CG 330 Cleaning and care of buildings CG 331 Regular cleaning
Section 4.7
CG 332 Glass cleaning CG 333 Facade cleaning CG 334 Cleaning of technical installations
Section 4.1
CG 340 Cleaning and care of outdoor facilities CG 341 Paved areas CG 342 Planted and green areas
Section 4.8
CG 343 Water areas CG 344 Outdoor constructions CG 345 Outdoor technical installations CG 346 Outdoor equipment CG 350 Operation, inspection and maintenance CG 352 Inspection and maintenance of building construction
Section 4.10
CG 353 Inspection and maintenance of technical installations
Section 4.11
CG 354 Inspection and maintenance of outdoor facilities
Section 4.12
CG 355 Inspection and maintenance of furniture and equipment
Section 4.13
CG 360 Security and surveillance CG 361 Supervision according to public law regulations CG 362 Property security and surveillance CG 370 Statutory charges and contributions CG 371 Taxes CG 372 Insurance contributions
Section 4.9
Section 4.14
Section 4.15
16 | 2 Methodology The current study focuses on the scope of costs directly related to the operation of facilities. The capital costs depend on the construction costs (DIN 276-1:2008-12) and are not related to the operation of facilities. Likewise, the management costs depend on the organisational structure and services of the real estate management and are not related to the operation period. The CG 100 of capital costs, the CG 200 of management costs, and the CG 400 of repair costs are not included in the scope of costs under investigation. The employed cost classification structure of the German standard DIN 18960:2008-02 is adjusted to the local practices and is directly related to other national standards as for example the DIN 276-1:2008-12. A comparison or implementation of the results of the current study on an international level is dependent on the definition and classification of the investigated cost types. In the following, an overview of international standards and their cost classification structures is presented and compared to the definition of operating costs according to DIN 18960:2008-02.
DIN 31051:2012-09: Fundamentals of maintenance (Germany) Within a presentation of basic definitions, the German standard provides a differentiation of the term maintenance into preventive maintenance, inspection, repair, and improvement. The respective terminology is translated according to the standard DIN EN 13306:2010-12. The operating costs under investigation in the current study include inspection and maintenance costs (CG 350). The definition of CG 350 corresponds with the definitions of preventive maintenance (measures for the delay of the wear of operated systems) and inspection (measures to determine and assess the state and condition of operated systems) in the standard DIN 31051:2012-09. Further maintenance costs defined in DIN 31051:2012-09 correspond with the repair costs of DIN 18960:200802 and the construction costs described in DIN 276-1:2008-12 and are not included in the scope of costs under investigation in the current study.
GEFMA 200:2004-07: Costs of facilities management (Germany) The German standard provides a cost classification structure including four levels, considers the entire life cycle, and targets the facility management. A differentiation on the first level of the structure is made between the periods of conception, planning, construction, marketing, procurement, operation and occupancy, conversion and renovation, vacancy, and liquidation. The standard is partially in conformity with the cost classification structures of the DIN 276-1:2008-12 and the DIN 18960:2008-02. The operating costs under investigation in the current study can be mapped to the period of operation and occupancy as presented in the standard GEFMA 200:2004-07.
2.2 Definition of key variables | 17
ISO 15686-5:2008-06: Buildings and constructed assets (International standard) The international standard provides definitions, a basic structure, and a calculation method for life-cycle costing of buildings and constructed assets. The exemplary classification structure differentiates between construction costs, operation costs, maintenance costs, and end-of-life costs with examples for the respective cost types. According to the standard, the scope of the analysed costs is to be adjusted in order to accommodate local practices. The scope of the operating costs under investigation in the current study can generally be mapped to the operation and maintenance costs according to the classification structure of the standard. The international standard is adopted as a national standard in the United Kingdom (BS ISO 15686-5:2008-06) and Sweden (SS-ISO 15686-5:2008-06).
ÖNORM B 1801-2:2011-04: Project and object management in construction (Austria) The Austrian standard defines basic terms and a classification structure of occupancy costs for the application in structural and civil engineering. The classification structure provides two levels and includes costs of management, technical operations of buildings, utility and disposal, cleaning, security services, building management, repair and conversion, miscellaneous, and demolition and removal. Furthermore, the standard describes necessary data for the determination of occupancy costs and respective cost-benefit analyses.
PD 15686-5:2008-06 (2008): Standardized method of life cycle costing (UK) As a supplement to the international standard ISO 15686-5:2008-06, the publication provides a more detailed description of the exemplary cost classification structure under consideration of specific national issues. The provided cost structure contains two levels and differentiates between construction costs, maintenance costs, operation costs, occupancy costs, end of life costs, non-construction costs, and income. In contrast to the German standard, the British supplement defines occupancy costs as “user support costs relating to the occupation of the building”. The building-related operating costs under investigation in the current study can be mapped to the maintenance and operation costs according to the British supplement.
NS 3454:2013-03: Life cycle costs for construction works (Norway) The Norwegian standard provides definitions, a detailed calculation method, and a three level classification structure of life cycle costs for construction projects. The classification structure differentiates between nine cost groups on the first level: Procurement and construction, management, operation and maintenance, repair and development, utilities, cleaning services, service and support, occupancy-specific costs, and revenues from renting or selling of values. In a comparison with the German stan-
18 | 2 Methodology dard DIN 18960:2008-02, the Norwegian standard covers a larger scope of costs and includes for example investment costs. Further differences can be observed in particular in the definition of operation and maintenance.
SIA 480:2016-03: Economic assessment of investments (Switzerland) A description of methods and assumptions for the economic assessment of investments is provided in the Swiss standard. The management costs and the costs of operation and maintenance are described as annual expenditures to be included in an economic assessment. The operating costs are differentiated into utilities and disposal, cleaning and care, operation of technical installations, inspection of building construction and technical installations, security and surveillance, statutory charges and contributions, and maintenance and care of outdoor circulation areas and green areas. The classification of operating costs corresponds generally with the classification introduced in the German standard DIN 18960:2008-02.
2.2.2 Predictor variables Predictor variables describe the quantities that cause the variance of one or multiple response variables in statistical models as described by Backhaus et al. (2011). In order to conduct a statistical analysis, candidate predictor variables that explain or estimate the response variables have to be selected (cf. Chatterjee and Hadi, 2006). The candidate predictor variables can be determined from theory and literature as outlined by Fellows and Liu (2015). A general overview of variables with potential influence on operating costs is presented in the standard DIN 18960:2008-02. The standard describes functional, technical, and organisational system characteristics, the user behaviour, and the system environment as relevant variables with influence on occupancy costs. The standard ISO 15686-5:2008-06 characterises the orientation, the building footprint, the location, the site, the building height, and the building layout as major variables influencing the life cycle costs. Furthermore, the indoor climate control, the ventilation and solar design, and the air conditioning and heating system are described as technical standards with significant implications on operating costs. According to the publication of BCIS (2007b), occupancy costs may be affected by the following variables: Utilisation, building size, building shape, building layout, design and specification, intensity of use, and location. Further information on relevant predictor variables are presented in the studies of Stoy (2005) and Beusker (2012). They employ strategies, building characteristics, the location, and the utilisation as main variable groups in their investigation of occupancy costs. The variable groups introduced in further relevant studies are presented in detail in Section 1.3.
2.2 Definition of key variables | 19
Based on the review of various studies, the main variable groups of the current study are determined as the quantities, the characteristics, the utilisation, the location, and the management strategy of facilities. A detailed illustration of these variable groups and respective sub-groups is presented in Table 2.2. The variable group of quantities is differentiated into reference quantities, specific areas, compactness, and function. A detailed description of the characteristics of the facilities if provided by variable sub-groups giving information on the condition and standard of the building construction, technical installations, outdoor facilities, and furniture and equipment. The main variable group of utilisation contains both descriptions of the type of facility and specific utilisations. Moreover, the main variable group location is differentiated into urban location and topography. Finally, the variable group management strategy includes information on cleaning services of the facilities. A detailed description of the analysed variables is presented in the theoretical basis of the analysis of the respective cost groups in Chapter 4. The general classification of the predictor variables into the variable groups and their expected interrelationships with the operating costs is outlined in the following overview. Table 2.2. Predictor variable groups Variable group Quantities
Reference quantities Specific areas Compactness Function
Characteristics
Condition
Condition of building construction Condition of technical installations Condition of outdoor facilities Condition of furniture and equipment
Standard
Standard of building construction Standard of technical installations Standard of outdoor facilities Standard of furniture and equipment
Utilisation
Type of facility, specific utilisation
Location
Urban location, topography
Management strategy
Cleaning services
Quantities The employment of quantities for the normalisation of cost data is a common approach in order to generate cost indicators for the estimation and comparison of costs. Indicators are provided by various publications as for example BKI (2016) for construction costs or BCIS (2007b), JLL (2016), Rotermund (2016), and Stoy et al. (2017) for operating costs. The employed reference quantities are for example the gross external
20 | 2 Methodology floor area GEFA, the gross internal floor area GIFA, or the number of workplaces in a building. The standard DIN 277-3:2005-04 defines a reference quantity as a measurable value with an unit and gives volumes, areas, lengths, or numbers as examples. The current study employs various candidate reference quantities for the introduction of operating cost indicators. The most adequate reference quantity is selected by a statistical analysis and further investigations are conducted with the respective cost indicators as response variable. The employed candidate reference quantities are measured and classified as defined in the standards DIN 277-1:2016-01 and DIN 277-3:2005-04. A translation of the terminology of the German standards was conducted according to the terms used by the CEEC (2008) and the RICS (2013). Figure 2.3 illustrates the classification structure of the building areas and the volume according to DIN 277-1:2016-01. In the current investigation, the gross external floor area GEFA, the gross internal floor area GIFA, the usable floor area UFA, and the gross building volume GBV are included as candidate reference quantities. As described in the standard VDI 3807-1:2013-06, the heatable share of the building area is the preferable reference quantity for the energy consumption. Assuming that the heatable area has also significant influence on the operating costs, the heatable gross internal floor area hGIFA is analysed as candidate reference quantity. Besides, the regularly cleaned gross internal floor area cGIFA is considered for the normalisation of costs of some of the analysed cost groups. For cost groups related to outdoor facilities, the non-built site area nbSAR is analysed as a candidate reference quantity. The classification structure of site areas according to the standards DIN 2771:2016-01 and DIN 277-3:2005-04 is illustrated in Figure 2.4. Besides their utilisation as candidate reference units, various quantities are examined as candidate predictor variables in the current study. For example, the indicators provided by JLL (2016) are differentiated into multiple building sizes and indicate a significant variation of costs. Likewise, the amount of specific areas as for example cleaning or sanitary areas is expected to cause an alteration of costs. Therefore, the shares of the usable floor area UFA, circulation area CA, sanitary area sGIFA, and regularly cleaned area cGIFA as defined in the standards DIN 277-1:2016-01 and DIN EN 15221-6:2011-12 are included as candidate predictor variables. Furthermore, the shares of the heatable area hGIFA and ventilated and air-conditioned area vGIFA according to the standard VDI 3807-1:2013-06 are considered as predictors. The described shares refer to the gross internal floor area GIFA as a percentage. The shares of the non-built area nbSAR and planted area pSAR refer on the contrary to the site area as a percentage and are likewise analysed describing outdoor facilities and grounds. The design and shape of buildings has significant influence on the energy consumption and affects consequently also the operating costs as for example outlined in the study of Depecker et al. (2001). The impact of the geometry, the type of glazing, and the glazing area of buildings is for example investigated in the study of Ourghi et al. (2007), establishing a correlation between the design of buildings and the heat-
2.2 Definition of key variables | 21
ing energy and electricity consumptions. In order to investigate the influence of the compactness of facilities on the operating costs, the average floor size, average storey height, and number of floors are considered as candidate predictor variables in the analysis. Furthermore, various quantities describing the function of the facilities are analysed as candidate predictor variables, such as the number of elevator stops and number of sanitary facilities. Likewise, the share of glass surfaces and share of double or triple glass surfaces on the facades are included as candidate predictors describing the design of the facilities. DIN 277-1:2016-01 Gross external floor area GEFA Gross internal floor area GIFA Usable floor area UFA
Technical area TA
Construction area CONA Circulation area CA
Gross building volume GBV
Figure 2.3. Classification of building areas and volume
DIN 277-1:2016-01 / DIN 277-3:2005-04 Site area SAR Non-built site area nbSAR Planted site area pSAR
Paved site area pvSAR
Built site area bSAR
Figure 2.4. Classification of site areas
Characteristics The condition of the construction and technical installations has significant impact on the operating costs as for example outlined by IPBau (1995). Significant correlations between the conditions of the fitments and technical installations and the maintenance costs of the construction and technical installations are likewise determined in the study of Stoy (2005). Therefore, the current investigation includes various predictor variables that describe the conditions of the construction, technical installations, outdoor facilities and grounds, and furniture and equipments. Based on a detailed level of available data as described in Section 3.2.2, the condition of single components is aggregated for the specific requirements of the respective costs to be analysed. Consequently, a wide range of predictor variables is included in the investigation as
22 | 2 Methodology for example the share of defective electrical installations in percent providing information on the technical installations of the facilities. As for example determined in the study of Balaras et al. (2007), the standards of the construction and technical installations such as the building automation and the lighting systems cause a significant alteration of the energy consumption and affect consequently the utility costs. Likewise, the VDI 3807-2:2014-11 describes a significant correlation between district heating systems and energy consumption. For a facility with district heating, no heat loses by heat distribution and therefore a difference of utility costs are expected. Consequently, multiple candidate predictor variables providing information on the standards of the construction, technical installations, outdoor facilities and grounds, furniture, and equipments are analysed in regard to their variation of operating costs. The candidate predictors describe for example the thermal mass, flexibility, existence of building conservation regulations, standard, and type of heating energy source.
Utilisation The utilisation of facilities has significant influence on operating costs as for example described by Kalusche (2008). Detailed parameters for the calculation of water and energy consumptions are provided in the Swiss standard SIA 380/1:2016-12. The standard outlines an essential difference of room temperatures, air exchange rates, and water consumptions for various types of utilisations. Consequently, a correlation between the utilisation and the utility costs is assumed. Likewise, the cleaning intervals, materials, and type of areas to be cleaned differ substantially in dependence of the utilisation of facilities and are therefore expected to alter the cleaning costs as for example described by Ashworth and Perera (2015). Based on an extensive review of literature, the current investigation considers the type of facility, the specific utilisation, and the type of water usage as candidate predictor variables describing the utilisation of the analysed facilities.
Location Further interrelationships are expected between the candidate predictor variables of the variable group location and the operating costs as determined by a review of literature. For example, the operating and maintenance cost indicators provided by JLL (2016) differentiate between multiple locations. The variation of costs in dependence of the locations indicates a significant correlation. Furthermore, the topography of the site area of the facilities is expected to have significant influence on for example the cleaning and maintenance costs of outdoor facilities. The investigation includes therefore a differentiation between urban location and rural location as a candidate predictor variable. The type of topography is likewise examined as candidate predic-
2.3 Statistical methods | 23
tor with a distinction between a flat and a sloped topography of the site areas of the analysed facilities.
Management strategy A general impact of management strategies on operating costs is outlined in various studies as for example by Riegel (2004) and Kalusche (2008). A interrelationship between the outsourcing level of cleaning services and cleaning and care costs is determined in the study of Stoy (2005). Therefore, the influence of the management strategy is examined by the analysis of a candidate predictor variable providing information on the type of cleaning services with a differentiation between internal cleaning services, external contractors, and cleaning by tenants. Likewise, facilities cleaned on a voluntary basis are represented by the variable. Further management strategies as for example the level of outsourced facility services or service level agreements can not be considered since respective data is not available in the current study as described in detail in Section 3.4.
2.3 Statistical methods 2.3.1 Pre-analysis of data sample The data sample employed in the current study is evaluated in a statistical pre-analysis in order to determine the quality of the data and to search for basic patterns. As described by Fellows and Liu (2015), a basic examination of the raw data may reveal differences in the patterns and findings from what is determined by for example previous studies. With a basic examination, an unbiased analysis can be ensured. Furthermore, a pre-analysis may reveal measurement mistakes, corrupted data, and missing data. A screening of the data before the main analysis may therefore prevent a distortion of the results as described by Tabachnick and Fidell (2009). The pre-analysis of the data employed in the current study is conducted by descriptive statistics. Descriptive statistics describe and summarize the characteristics of a data sample and aim to give information on the data distribution, extreme values, and errors as described by Fahrmeir et al. (2013). A rough overview of the conducted pre-analysis is presented in the current Section and a summary of descriptive statistics of the data sample is illustrated in the Appendix.
Central tendency Measures for the central tendency of a sample describe the centre of a data distribution and are employed to give information on the average level of a characteristic as presented by Toutenburg and Heumann (2008). The arithmetic mean represents the
24 | 2 Methodology average of a series of values. The mean is defined as the sum of the values divided by the number of observations included in a sample. The mean value is not robust to outliers and is highly influenced by extreme values and skewed data distributions. The utilisation of the median value has the advantage of a certain robustness against outliers and skewed data. The median value describes the middle value and separates the higher 50 % partition from the lower 50 % partition. With its robustness, the median is a commonly used measure to describe a typical value of a data sample.
Dispersion and measures of spread The utilisation of quantiles generalises the idea to separate a data sample into partitions as described by Toutenburg and Heumann (2008). Quantiles describe the value of a cut-point that divides the data sample into two partitions. In contrast to the median value that generates two partitions of the same size, quantiles can be selected for an arbitrary point for partition. The most common used quantiles describe the lower quartile that separates a data sample into a lower 25 % and upper 75 % partition and the upper quartile that separates into a lower 75 % and upper 25 % partition. The standard deviation is a measure utilised to give information on the variance of the values and allows conclusions to be drawn on the dispersion of a data sample. It is calculated as the root of the variance and is therefore measured in the unit of the examined variable. The variance is determined as the mean of the squared deviation of an observation from the mean value for a data sample. Low values of the standard deviation indicate a low spread of the observations around the mean whereas high values indicate a wide spread.
Data distribution The distribution of a data sample may affect the outcome of a statistical analysis as described by Tabachnick and Fidell (2009). The skewness is a measure to describe the asymmetry of the data distribution in respect of the mean. A negative skewness describes a distribution with the most of the observations concentrated in the right of the distribution curve with a longer left tail. A right tailed distribution displays a positive skewness and has the observations concentrated in the left of the distribution curve. The kurtosis of a distribution measures the shape of the curve and is used as an indicator for the deviation of a data sample from normal distribution (Toutenburg and Heumann, 2008). A kurtosis value of 0 indicates normality of data distribution whereas a negative value indicates a flat distribution and a positive value indicates a peaked distribution as presented by Ryan et al. (2012).
2.3 Statistical methods | 25
Outliers Tabachnick and Fidell (2009) define an outlier as an observation with an extreme value that may distort the results of a statistical analysis. Identified outliers may indicate a mistake in the measurement of the data or a corruption of the data. In the current study, the interquartile range is employed for the identification of outliers. The interquartile range describes the difference between the upper and lower quartile and measures therefore the range of the middle partition of 50 % of a data sample. If an observations has a distance from the upper or lower quartile of more than 1.5 times the interquartile range, it is considered as an outlier. With a distance from the upper or lower quartile of more than 3 times the interquartile range, an observation is identified as extreme outlier (cf. Toutenburg and Heumann, 2008). Consequently, extreme values of a data sample are identified and a detailed examination of the respective observations can be conducted.
Graphical description In an exploratory data analysis, graphical illustrations of the measures of descriptive statistics are employed for an analysis of a data sample as described by Chatterjee and Hadi (2006). The pre-analysis of the data sample of the current study is conducted by box plots, histograms, and scatter plots. Box plots depict the central tendency and the dispersion of data graphically by an illustration of the median value, the upper and lower quartiles, the minimum and maximum values, and the outliers. Information on the data distribution including the skewness and kurtosis is provided by histograms. They illustrate the frequency of observations of a data sample in intervals and are likewise used for the pre-analysis by descriptive statistics in the current study.
2.3.2 Regression analysis The analysis of data with regressions has a relatively long tradition as a research method in the field of construction. As for example introduced in the studies of Kouskoulas and Koehn (1974), McCaffer (1975), and Bowen and Edwards (1985), regression models serve as tools for the estimation of construction costs since the 1970s. Recent studies as for example by Lowe et al. (2006), Stoy et al. (2012), and Dursun and Stoy (2016) employ regression models for the identification of relevant predictors and for the purpose of an early construction cost estimation. A first approach to create a model in order to estimate operating and maintenance costs of buildings under consideration of relevant predictor variables is carried out by Al-Hajj and Horner (1998). Further approaches as for example by Stoy (2005), use regression models for the analysis of capital, object management, operating, and maintenance costs of office buildings. Recently, Beusker (2012) introduces various regression models for the identification of variables with influence on the energy consumption and operating and repair costs
26 | 2 Methodology based on the empirical data of 130 school facilities. Based on the review of literature presented in Section 1.3, regression analysis is employed as one of the statistical methods for the execution of the current study.
Model development As described by Chatterjee and Hadi (2006), regression analysis is a statistical method to investigate functional interrelationships among variables. The developed regression models can be used for the description of a process, for the estimation and prediction of values, and for the purpose of controlling the response variable. Using the empirical data as described in detail in Chapter 3 as a basis, the statistical models developed in the current study aim to reveal and describe causal interrelationships between the operating costs and multiple variables. Therefore, the multiple regression is selected as the method to explain the attributes of a response variable under consideration of multiple predictor variables. Generally, the dependence of a response variable Y of multiple predictor variables X i is explained by the following function (Backhaus et al., 2011): R Y = (X1 , X2 , X3 , ..., X i ) + ε The error term ε of the function denotes the discrepancy between an estimated value and an observed value for an observation. The discrepancy can not be explained by the function and describes the failure of the model to fit to the underlying data sample. The following equation illustrates exemplarily a simple linear regression function Y with a single predictor variable X i under consideration of the error term ε: Y = β0 + β1 X1 + ε As presented in the diagram in Figure 2.5, the regression constant β0 specifies the value of Y for the predictor X i = 0 and illustrates therefore the intercept with the Y-axis. The slope of the regression function is denoted by the regression coefficient β1 and is calculated as the variance of ∆Y divided by the variance of ∆X. Moreover, the diagram provides an example for the determination of the error ε for a single observation Obs k . The error term ε k of the single observation Obs k (residual) is calculated as the difference between the observed value of the response variable y k and the corresponding estimated value ˆy k . As presented by Chatterjee and Hadi (2006), the linear relationship between one response variable and multiple predictor variables can be explained by the following generalised function of a multiple regression: Y = β0 + β1 X1 + β2 X2 + β3 X3 + ... + β i X i + ε
2.3 Statistical methods | 27
where Y is the response variable of the regression, β0 is the regression constant to be determined, β i are the regression coefficients to be determined, X i are the predictor variables, and ε is the error term of the regression.
Response variable Y
Function Y
ɛk = yk - ŷk ∆Y
ŷk
∆X β0
Obsk
yk
β1 = ∆Y / ∆X
Predictor variable X1
Figure 2.5. Scatter plot and example of a simple linear regression function
For the determination of a regression function, the unknown parameters β0 and β i are estimated by the so called ordinary least squares OLS method. As described by Fahrmeir et al. (2013), the method keeps the sum of the squared error terms ε for all underlying observations (residuals) as small as possible. By the squaring of the residuals, extreme deviations are weighted with a higher level of importance. Furthermore, positive and negative values of the residuals can not compensate each other and therefore distort the results. The selection of the predictor variables of the multiple regression models is conducted in a stepwise and bidirectional procedure under consideration of alpha-toenter and alpha-to-remove values as described by Harrell (2015). During the the bidirectional procedure, the p-values of the candidate predictor variables not included in the regression model are compared to the determined alpha-to-enter value. The candidate predictor variables are entered into the model if the alpha-to-enter value is deceeded. Likewise, the p-value of the predictor variables included in the regression model is compared to the alpha-to-remove value. If the p-value exceeds the alpha-toremove value, the variable is removed from the model. The selected p-value to determine significance of predictor variables is presented subsequently. In order to include qualitative predictor variables in the development of multiple regression models, the respective characteristics of the variables are transformed into
28 | 2 Methodology binary dummy variables as suggested by Chatterjee and Hadi (2006). The available characteristics are represented by numerical values of 0 or 1. If a characteristic of a qualitative predictor has a value of 0, it causes no influence on the regression function whereas a characteristic with a value of 1 may alter the result of a regression function. Consequently, a qualitative predictor variable can be employed in a regression model as a variety of quantitative variables.
Measures of fit The parameters, the significance, the fit to the underlying data, and the estimation performance of a regression model is indicated by various measures. The coefficient of determination R², the normalised root mean square error CV(RMSE), and the mean absolute percentage error MAPE reflect the fit of a model to the underlying data sample. The measures are not only used to describe the performance of the developed regression models but are also employed as indicators for the performance of the artificial neural network models (cf. Section 2.3.3) and classification tree models (cf. Section 2.3.4). Therefore, the performance measures R², CV(RMSE), and MAPE and their implementation are presented in detail in Section 2.3.6. Consequently, they are not included in the following description of relevant measures for the validation of the regression function and regression coefficients:
Generalised significance of a regression model (F-statistics) In order to conclude a general significance of the developed regression model, the F-statistics investigate a null hypothesis against an alternative hypothesis. The null hypothesis assumes that no significant interrelations between any variables are explained by the model, whereas the alternative hypothesis assumes a causal interrelationship between the response variable and at least one of the predictor variables included in the model. Therefore, the empirical F-value of the developed regression model is compared to a theoretical F-value determined under consideration of the sample size, the number of included predictor variables, and a 95 % confidence interval as described by Backhaus et al. (2011). The selected confidence interval of 95 % is equivalent to a significance level of alpha α of 5 %. If the empirical F-value exceeds the theoretical F-value, the null hypothesis can be rejected and a general significance of the developed regression model can be concluded.
Regression coefficients and standard error of coefficients (β and SE β) As presented on the example of the linear regression function in Figure 2.5, the coefficients of the regression describe the intercept with the Y-axis (regression constant β0 ) and the slope of the function for a single predictor variable (regression coefficients β i ). The coefficient β k represents the change of the response variable Y in correspon-
2.3 Statistical methods | 29
dence to a change of the predictor X k for the case that all other predictors are held constant (Chatterjee and Hadi, 2006). Since the predictor variables may be measured on different scales, the coefficients can not be used to determine and compare the effects of multiple predictors on the response variable. The standard error SE displays the accuracy of the estimate of the coefficient for the underlying observations.
Standardised regression coefficients The size of the effect of the predictor variables on the response variable is indicated by the standardized coefficients. Conclusions of the effect can be drawn even if the predictors are measured on different scales as described by Ryan et al. (2012). In order to calculate the standardised coefficients, the regression coefficients are multiplied with the standard deviation of the coefficients and divided by the standard deviation of the response variable. Consequently, the different scales of measurement of the variables and their coefficients are eliminated. Since categorical predictors are included as binary variables into the regression models, the standardised coefficients can only be determined for the characteristics of the qualitative predictors. In order to determine the standardised coefficients for qualitative predictors, the regression function is recalculated with composite variables in exchange for the binary variables. The composite variables are created by application of the regression coefficients as weights for the binary variables as established by Eisinga et al. (1991).
Significance of predictor variables (t-statistics) The significance of the relationship between a single predictor variable and the response variable of the regression is tested by the t-statistics. As described for the F-statistics, a null hypothesis is therefore tested against an alternative hypothesis. The null hypothesis assumes that the predictor does not have influence on the response variable of the regression, whereas the alternative hypothesis assumes a causal interrelationship between the tested predictor variable and the response variable (Fahrmeir et al., 2013). Under consideration of the sample size, the number of included predictor variables, and a 95 % confidence interval, the empirical t-value is compared to a theoretical t-value as described by Backhaus et al. (2011). The selected confidence interval of 95 % is equivalent to a significance level of alpha α of 5 %. If the empirical t-value exceeds the theoretical t-value, the null hypothesis can be rejected and significance of the tested predictor variables can be concluded. The p-value is likewise employed to test the null hypothesis against the alternative hypothesis (Ryan et al., 2012). With the p-value deceeding the determined significance level of alpha α of 5 %, significance is indicated for the predictor variable.
30 | 2 Methodology Multicollinearity (VIF) Multicollinearity describes strong linear relationships among the predictor variables and is associated with an unstable estimation of the coefficients of a regression model (Chatterjee and Hadi, 2006). The relationships between the predictor variables included in a model can be determined by the evaluation of the variance inflation factor VIF. The VIF measures the increase of the variance of the coefficient of a variable caused by collinearity to other variables as described by Ryan et al. (2012). A VIF-value of 1.0 indicates the absence of any linear relationship among the predictor variables. Since the current investigation is not conducted on an experimental level where the variables can be observed isolated and unbiased, certain multicollinearity among the predictor variables is expected. As suggested by Urban and Mayerl (2011), the threshold for a critical value of the VIF is selected to be 5.0 for the current investigation.
Regression assumptions The application of a regression model for inferences or the purpose of estimation is subject to multiple principal assumptions for the detection of model violations as described by Chatterjee and Hadi (2006). The violation of the model assumptions may distort the results of a regression model and the application of the developed model may result in error. Therefore, the following assumptions about the variables and the residuals of a regression model are evaluated in the current study:
Linearity The relationship between the response variable and the predictor variables of a multiple regression model is linear. Non-linearity distorts the results and influences the estimation of the coefficients negatively. A violation of the linearity of the relationships in a multiple regression model is detected by an evaluation of scatter plots of the distribution of the residuals against the predicted values. The scatter plot should display a symmetrical distribution around the zero-line of the residuals. Detected non-linearity of the relationship between the response and predictor variables may be fixed by a transformation of the affected variables and non-linear relationships may be transformed to linear relationships (Backhaus et al., 2011).
Normality The residuals of a regression model are normally distributed and not skewed by for example extreme outliers. Since the determination of the coefficients of a regression model is based on the ordinary least squares OLS method with the mean squared residuals kept as small as possible, extreme outliers and a skewed data distribution may distort the results and influence the estimation of the coefficients extremely. For the detection of non-normality in the distribution of the residuals, histograms and scatter
2.3 Statistical methods | 31
plots are evaluated. As described by Tabachnick and Fidell (2009), violations of the normality of the residuals can be caused by non-normality in the distribution of the underlying variables and may be avoided by transformation.
Independence The residuals of a multiple regression model are not serial correlated and independent. Correlations among consecutive residuals violate the assumptions for time series regression models and indicate an incorrect specification. Since the observations of the current study are not time related and only non-time series models are developed, an evaluation of the independence of the residuals is not conducted.
Homoscedasticity The residuals of a multiple regression model are homoscedastic in their variance. A heteroscedastic variance of the residuals indicates non-normality in the distribution of the variables, extreme outliers, missing terms, or influential points as described by Fahrmeir et al. (2013). The assumption of homoscedasticity is directly related to the assumption of normality and linearity. The violation of homoscedasticity may be detected by an evaluation of scatter plots of the distribution of the residuals against the predicted values. Detected violations may be fixed by a re-specification of the multiple regression model and by transformation of variables.
Variable transformation As described previously, the transformation of variables may establish linearity of the relationship between response and predictor variables. Likewise, the transformation may avoid non-normality of the distribution and heteroscedasticity of the residuals of a regression model. The transformation of variables may therefore fix the violation of the principal model assumptions as described by Tabachnick and Fidell (2009). When the residuals of a regression model comply normality of the data distribution, an improvement of the regression models in terms of quality and performance is expected (Schmidt, 2010). The transformation of variables was introduced by Grimm (1960). The optimal transformation of the variables in the current study is conducted by the method of Box and Cox (1964). Thereby, all possible transformations of a variable are tested and the optimal transformation with a value of lambda λ is determined for the lowest pooled standard deviation. The determination of lambda λ for a transformation of the response variable CG 352 (annual costs, cf. Chapter 3) according to Box and Cox (1964) is illustrated exemplarily in Figure 2.6. The pooled standard deviation of the variable is calculated for various values of lambda λ and the optimal transformation with the lowest pooled standard deviation is presented with a 95 % confidence interval. The optimal transfor-
32 | 2 Methodology mation is estimated to be a natural logarithm with a lambda λ of 0. Histograms of both the distribution of the original and transformed variable are presented in Figure 2.7. The untransformed variable displays a right tailed form with a positive skewness value of 6.53 and a kurtosis value of 49.75. The values of skewness and kurtosis and the visual inspection of the histogram indicate non-normality in the data distribution. With the estimated optimal transformation by a natural logarithm conducted, the values of skewness and kurtosis decrease to 0.32 and 0.23, respectively. Consequently, the distribution of the response variable CG 352 is substantially improved. The regression models for CG 352 with both untransformed and transformed variables are presented in detail in Section 4.10. Lower CL
Upper CL
λ with 95.0% confidence
StDev
20,000 15,000 10,000
Estimate
-0.08
Lower CL Upper CL
-0.20 0.06
Rounded value
0.00
5,000 0
Limit -3
-2
-1
0
1
λ
Mean 3,539 St. dev. 7,269 Skewness 6.53 Kurtosis 49.75 n 171
50
Frequency
40 30 20
Mean 20 St. dev. Skewness Kurtosis 15 n
7.46 1.12 0.32 0.23 171
10 5
10 0
Frequency
Figure 2.6. Box-Cox plot of optimal transformation of variable CG 352 costs
0
5,000 10,000 15,000 20,000 25,000 CG 352
0
5
6
7 8 LN (CG 352)
9
10
Figure 2.7. Histograms of variable CG 352 costs and LN-transformed variable CG 352 cost
The presented and specified procedure for the correction of non-normality of the data distribution is conducted for all response and quantitative predictor variables in the case that non-normality is indicated. All analysed cost groups in the current study
2.3 Statistical methods | 33
are presented with both linear regression model LR and non-linear regression models NLR. The linear regression models are developed with untransformed variables and the non-linear regression models are developed with variables that are undertaken an optimal transformation as described in the current Section. The denotation non-linear describes solely the transformation of at least one of the included variables and does not indicate a non-linear relationship between the response and predictor variables of the developed model.
2.3.3 Artificial neural network analysis Generally, the application of neural network models for data samples with non-linear relationships may offer a better performance as confirmed in the empirical comparison of several statistical methods conducted by Curram and Mingers (1994). The application of artificial neural networks as tools for construction, engineering, and management tasks is first introduced by Moselhi et al. (1991). Besides various potential areas of application as for example the selection of alternatives and optimisation, they describe the application of artificial neural networks for the estimation and classification of factors, schedules, and cost indices. Boussabaine (1996) describes likewise the use of artificial neural networks in construction management and suggests for example the estimation of costs, risk analysis, and decision making as potential areas of application. Smith and Mason (1997) present a general comparison of regression and neural network models for cost estimation in the fields of engineering and economics and describe that the neural network models may represent an alternative if the underlying cost data does not fit a regression model. The first application of artificial neural network models for the cost estimation of construction projects is presented by Adeli and Wu (1998) and includes a validation based on empirical data. Emsley et al. (2002), Kim et al. (2004), and Sonmez (2004) adopt the artificial neural network approach for the estimation of construction costs of buildings and present a comparison to the cost estimation by regression models. Since then, the construction cost estimation by artificial neural network models and respective approaches are enhanced constantly as for example in the studies of Sonmez (2011) and Dursun (2013). The application of neural network models for the estimation of energy consumption of buildings is presented in various studies as for example by Kalogirou and Bojic (2000) and Tso and Yau (2007). Studies for the introduction of neural network models for the estimation of operating costs of real estate are currently not available. Based on the presented review of literature and the analysis technique presented subsequently, the current study employs artificial neural network models as one of the statistical methods for the operating cost estimation. As described in the previous Section 2.3.2, regression model are solely able to represent linear relationships between response and predictor variables. In contrast, the computational approach of artificial neural network models is capable do describe
34 | 2 Methodology relationships between response and predictor variables that can not be classified linearly. The concept of artificial neural network models is biologically inspired by the operation of the central nervous system of humans and animals. The models are able to learn and generalise from experience and reveal generally unknown relations between a variety of variables as presented by Backhaus et al. (2013). The computational process of learning and generalising is based on empirical data as input. According to the study of Sharda (1994), the estimation of data is one of the main application areas of artificial neural network models.
Multilayer perceptron architecture The multilayer perceptron architecture MLP is the most adequate structure of an artificial neural network model for the estimation and the identification of cause-effect relationships as described by Backhaus et al. (2013). The MLP model represents a feedforward architecture that processes input variables to a set of one or more appropriate output variables. The structure of a MLP artificial neural network is typically composed of multiple layers of neurons as illustrated exemplarily in Figure 2.8. The selected predictor variables X i are each fed to one neuron of the input layer of the model. The neurons of the input layer receive the values of the predictor variables without alteration. The content of the neurons of the input layer is then processed and transferred to the neurons of the first hidden layer. During the transfer process to the neurons of the hidden layer, the respective values are weighted. The number of neurons in the hidden layer and the number of hidden layers is to be determined in the model specifications. In the neurons of the hidden layer, the values are summed up and further processed by a activation function. Thereafter, the transfer to the neurons of the next hidden layer (or output layer) is conducted. The activation function of a MLP determines the relationship between the input and the output of a single neuron of the model. The most common activation functions of artificial neural networks are presented by Backhaus et al. (2013) as the linear function, the binary step function, the hyperbolic tangent function, and the logistic function. The activation function f for the neurons of a MLP model is to be selected in the model specifications. In an iterative process, the output Y of a MLP model is compared to the observed response variables of the underlying empirical data and the output error is kept as small as possible. The so-called backpropagation process is conducted automatically and the model is stepwise adjusted by an alteration of the connection weights W i . The backpropagation reiterates the processing of the input data with an alteration of the connection weights until a stop criterion is fulfilled. For the feedforward architecture of the MLP models, the transfer of information is only conducted in one direction from the input layer to the output layer without an exchange of information between the neurons of one layer.
2.3 Statistical methods | 35 Input (predictor variables) X2 X3
X1
...
Xi
Input layer
Wi Hidden layer(s)
∑
∑
f
∑
f
f
...
∑
f
Wi Output layer
∑
f
Y Output (response variables)
Wi: Connection weights ∑ : Summation of inputs multiplied by weights W f : Transfer function (output = input for next layer)
Figure 2.8. Schematic architecture of a multilayer perceptron MLP artificial neural network ANN
As described by Adeli and Wu (1998), over-fitting of an artificial neural network model leads to estimation and generalisation problems. Since the backpropagation is conducted until a defined stop criterion is fulfilled, the output error for the sample used for the training of the model may tend toward zero. Therefore, the model has a perfect compliance to the training sample but allows no conclusion to be drawn about the fit to the population. Over-fitting can be avoided by employment of an independent validation sample for the modelling process. The output error of the training sample is constantly compared to the output error of the validation sample and the backpropagation is stopped when over-fitting of the model is detected.
Model specification For the development of multilayer perceptron MLP artificial neural network models, various input parameters are to be determined. In the following, the main parameters of the MLP models presented in the current study are summarised:
Predictor variables As described by Backhaus et al. (2013), the application of artificial neural network models is not suitable to reveal and describe the causal interrelationships between variables but to provide solutions for a defined problem. The selection of predictor variables to be included in a MLP artificial neural network model is therefore considered as a development process. In the current study, the predictor variables are determined according to the results of the conducted regression analysis as described for
36 | 2 Methodology example by Emsley et al. (2002) and Sonmez (2004). Predictor variables with significant influence on the variation of the response variable of the regression model are therefore employed as input for the MLP models. Further models are developed employing all available predictor variables as input. The results of the study presented in Chapter 4 contain solely the MLP models with the best fit to the underlying data. Since artificial neural network models accept only quantitative variables as input, categorical predictor variables are transformed into binary dummy variables as suggested by Backhaus et al. (2011).
Hidden layers and number of neurons The specification of the number of hidden layers and the number of neurons in the hidden layers plays an essential role in the development of neural network models. For example, a too large number of neurons may improve the fit of the model to the underlying data but corrupt the capability to generalise on data of the population by over-fitting (Backhaus et al., 2013). The development of the MLP models in the current study includes one hidden layer and is based on the results of the studies of Emsley et al. (2002) and Sonmez (2004, 2011). Hegazy et al. (1994) suggest to use 0.75n, n, or 2n+1 neurons (n is the number of neurons in the input layer) for the application of neural networks for construction cost estimation. The number of neurons used for the developed models in the current study is therefore altered between the three suggested values. The results contain however solely the MLP models with the best fit to the underlying data.
Activation function The definition of an activation function is an essential task for the development of an artificial neural network since the function determines the degree of non-linearity of the model. In a comparison of linear, logistic, and hyperbolic activation functions, Emsley et al. (2002) determined the hyperbolic tangent function for the activation of the neurons of an artificial neural network for the estimation of construction costs offering the better networks. Therefore, the hyperbolic tangent is employed as activation function for the MLP models developed in the current study. According to Backhaus et al. (2013), the hyperbolic tangent function is defined as follows: f (X) = tanh(x) =
sinh(x) cosh(x)
=
e x −e−x e x +e−x
Performance The performance of the developed MLP models is indicated by various measures describing the fit of the model to the underlying data sample. The employed performance measures are the coefficient of determination R², the normalised root mean square er-
2.3 Statistical methods | 37
ror CV(RMSE), and the mean absolute percentage error MAPE. A detailed presentation of the measures is provided in Section 2.3.6.
Validation and sampling As described previously, artificial neural networks may over-fit to the data sample used for model development. This may lead to generalisation problems and distort the estimation performance for the population. As described by Smith and Mason (1997), the cross-validation by a independent sample can avoid the development of over-fitted neural networks. Therefore, independent observations are employed during model development and their output errors are constantly compared to the output errors of the training sample during the backpropagation process. The so-called ANNvalidation sample contains approximately 20 % of the total sample and is selected randomly. Further 10 % of the total observations are employed as test sample for the validation of the models as described in detail in Section 3.3.
2.3.4 Classification tree analysis In a comparison with artificial neural networks and linear discriminant analysis, Curram and Mingers (1994) describe the advantages of classification trees as their transparency and their ability to give insight into the relationships between variables. In an experimental study, Kim (2008) compares decision trees, artificial neural networks, and linear regressions under consideration of different types of predictor variables and sizes of the data sample. Though the results indicate a better estimation accuracy for linear regression and artificial neural network models, classification tree models may represent interpretable rules or logic statements and provide clear information on the importance of significant predictor variables and their classification as described by Tso and Yau (2007). The study reveals a better performance and a simpler structure of the classification trees for the estimation of energy consumption in comparison with regression and artificial neural network models. Yu et al. (2010) describe the application of classification trees for the estimation of building energy demands as categorical variables. Shin (2015) utilises regression trees for the cost estimation of building construction projects on the basis of empirical data. In a comparison with the estimation accuracy and performance of artificial neural networks, the classifications trees show slightly better results. Classification and regression trees are decision models that are employed in order to estimate categorical and numerical data values. The computational application of classification and regression trees with pruning and cross-validation is introduced by Breiman et al. (1984). The modelling of a classification and regression tree is based on rules aiming to classify the data in nodes according to the characteristics of one or multiple predictor variables in order to determine the best classification of the re-
38 | 2 Methodology sponse variables. The classification of the data is conducted in an iterative process and creates further classifications of each node into child nodes until a defined stop criterion is fulfilled. The stop criterion is usually determined as a minimum number of observations, a maximum depth of the developed tree, the homogeneity of the values of the response variable in the child node, or by a cross validation to avoid over-fitting. Terminal nodes describe the end of a branch of the tree and are each defined by an unique combination of characteristics of the predictor variables. All observations of the employed data sample can be classified into the tree structure and have one conclusive and final terminal node. For an estimation of values, the observed characteristics of the predictor variables are employed to follow the tree structure until a terminal node is reached. Usually, the observations included in the terminal node and as for example the mean of their values of the response variable are used to estimate data.
Binary classification trees Based on the presented general approach for the development of classification and regression trees, binary classification tree BCT models are employed in the current study for both the identification of significant interrelationships between variables and for the estimation of the response variable. As illustrated in the study of Loh (2011), the schematic structure of a binary classification tree is presented exemplarily in Figure 2.9. The observations of a data sample n are classified binarily by the characteristics of the predictor variables X1 and X2 . The predictor variable X1 is qualitative with the characteristics a1 , b1 , c1 , d1 , and e1 . X2 is a continuous numeric quantitative predictor variable. The first classification of the total sample n into the Partition 1 and the Partition 2 is conducted by the qualitative predictor variable X1 . The partitions include the characteristics a1 and b1 of X1 for Partition 1 and the characteristics c1 , d1 , and e1 of X1 for Partition 2. A further classification is determined by the quantitative predictor variable X2 into the Partition 1.1 for values of X2 5 f2 and into the Partition 1.2 for values of X2 > f2 . The stop criterion is fulfilled for both nodes and they are determined as terminal nodes. For Partition 2, a further classification is conducted into the Partition 2.1 for values of the quantitative predictor of X2 5 g2 . The Partition 2.2 for values of X2 > g2 is determined as a terminal node by the stop criterion. On the third layer of the exemplary binary classification tree, the Partition 2.1 is classified by the characteristics c1 and d1 of the qualitative predictor X1 into the terminal node Partition 2.1.1 and by the characteristic e1 of X1 into the terminal node Partition 2.1.2. The presented classifications into Partition 1.1, Partition 1.2, Partition 2.1.1, Partition 2.1.2, and Partition 2.2 of the terminal nodes according to the predictor variables X1 and X2 are illustrated in Figure 2.10. In a next step, the partitions with their respective sub-samples of the total data sample can be utilised for the estimation of the response variable Y under consideration of the determined significant characteristics of the predictor variables X1 and X2 .
2.3 Statistical methods | 39
Sample n X1 a1 and b1 of X1
c1, d1, and e1 of X1
Partition 1
Partition 2
X2
X2
Layer 1
Layer 2
X2 f2
X2 g2
Partition 1.1
Partition 1.2
Partition 2.1
Partition 2.2
TN
TN
TN
X1
Layer 3
c1 and d1 of X1
e1 of X1
Partition 2.1.1
Partition 2.1.2
TN
TN
Figure 2.9. Schematic structure of a binary classification tree BCT
Partition 2.2 g2 Variable X2
Partition 1.2
Partition 2.1.1
Partition 2.1.2
f2 Partition 1.1 a1
b1
c1
d1
e1
Variable X1
Figure 2.10. Partitions of a binary classification tree BCT
Model specification For the development of binary classification tree BCT models, various input parameters are to be determined. In the following, the main parameters of the BCT models presented in the analysis of the current study are summarised.
40 | 2 Methodology Classification process Based on the classification and regression tree algorithm as introduced by Breiman et al. (1984), the classification of the data in a BCT model is conducted by an assessment of all possible classifications. For both quantitative and qualitative predictor variables, all available characteristics are considered as classification value for the data sample. The optimal classification is then selected by the measure of classification as described subsequently. The algorithm considers only univariate classifications for one predictor variable. With the presented classification process, conclusion can be drawn about the significance of individual predictor variables. For the development of the BCT models introduced in the current study, all available candidate predictor variables for the respective cost groups to be analysed are considered.
Measure of classification For the quantitative response variables of the current study, the least squares method is employed as measure for the classification of the sample as described by Steinberg (2009). Therefore, the sum of squared estimation errors is determined for all possible classifications and the classification with the lowest value is selected. The estimation errors are defined as the differences of the observed and estimated values for the sample or sub-sample n p of a partition p. The sum of the squared estimation errors SSE is calculated as follows: SSE =
Pn p
i=1 (y i
− ˆy i )2
where y i is the observed value for the observation i, ˆy i is the estimated value for the observation i, and n p is the sample size n of the partition p.
Stop criterion As described previously, the classification of the data is conducted in an iterative process. Thereby, the algorithm creates further classifications until a defined stop criterion is fulfilled. In the current study, the stop criterion is determined as the minimum number of observations to be included in a child node or the maximum depth of the developed tree to be reached. The criterion that is fulfilled first results in the termination of the classification process and in the creation of a terminal node.
Estimation In the current study, the mean values of the respective partitions of the terminal nodes are selected for the estimation of the response variable.
2.3 Statistical methods |
41
Performance Various measures describe the performance and the fit of the developed BCT models to the underlying data sample. The coefficient of determination R², the normalised root mean square error CV(RMSE), and the mean absolute percentage error MAPE are employed as performance measures and presented in detail in Section 2.3.6.
Validation and sampling A validation of the developed BCT models is conducted by an independent test sample. Therefore, 10 % of the total observations are selected randomly and allow unbiased conclusion to be drawn about the estimation performance for the population as described in detail in Section 3.3.
2.3.5 Cost indicators The estimation of costs employing empirical data as a basis is a widely used approach in the field of construction. For example, the Cost Information Centre of the German Chamber of Architects BKI provides cost indicators for the cost estimation of construction projects since the 1990s (e.g. BKI, 2016). For an estimation of operating and repair costs, a publication series provides statistically compiled cost data and cost indicators in documentations of individual facilities (e.g. Stoy et al., 2015, 2017). Further cost indicators for the estimation of operating costs are for example presented in the publications of BCIS (2007b), Rotermund (2016), and JLL (2016). Cost indicators can be utilised for the estimation, the control, and the management of operating costs as described in the standard DIN 18960:2008-02. Furthermore, cost indicators can be employed for the evaluation of the performance of operated facilities by a comparison to reference values of cost indicators by a quantitative benchmarking as described in the standard draft GEFMA 250:2011-02 and the standard DIN EN 15221-7:2013-01. The current study employs categorised cost indicators for the estimation of operating costs and evaluates their estimation accuracy in a comparison with the estimation by regression models, artificial neural network models, and classification tree models. Therefore, the results of the statistical analyses serve as basis for the compilation of adequate cost indicators. The provision of cost indicators requires a consistent definition of various parameters in order to ensure validity. The following parameters are considered for the cost indicators presented in Chapter 4.
Cost data The cost data as a basis for the introduction of operating cost indicators are defined in the cost structure of the standard DIN 18960:2008-02. Cost indicators are presented for all analysed cost groups as an annual value in the unit Euro per reference quantity
42 | 2 Methodology per year. All presented cost data contain the value added tax rate and are adjusted to first quarter 2016 prices.
Reference quantity In order to allow comparisons or draw inferences, the underlying cost data is normalised by a reference quantity to be defined. According to the standard DIN 2773:2005-04, a reference quantity is a measurable quantity as for example a volume, an area, a length, or a number described by a unit and a value. The available reference quantities are presented in the theoretical basis of the analysis in Chapter 4. The respective reference quantity for the compilation of cost indicators is determined by a statistical analysis with regression models. Therefore, several regression models are developed whereas each model contains one of the available candidate reference quantities as a predictor. In a evaluation of these regression models, the quantity with the highest effect on the absolute costs is determined and defined as the reference quantity for the respective cost group to be analysed.
Categorisation In order to provide a more accurate cost estimation, the cost indicators in the current study are presented in categories. The categorisation of the indicators is conducted by multiple significant predictor variables and is determined by the statistical analysis of the respective cost group.
Estimation by median values Since the utilisation of median values has the advantage of a certain robustness against outliers and skewed data, they are commonly used as a measure to describe typical values of a data sample (Toutenburg and Heumann, 2008). Therefore, the median value is employed for the estimation of operating costs in the current study. A further description of the dispersion of the underlying cost indicators is given by the upper and lower quartiles.
Validation and sampling The estimation by median values of categorised cost indicators is validated by an independent test sample as presented in Section 3.3. Furthermore, the test sample is employed to conduct a comparison of the different estimation methods. Therefore, 10 % of the total observations are selected randomly and allow unbiased conclusion to be drawn about the estimation performance for the population.
2.3 Statistical methods |
43
2.3.6 Performance measures In order to conduct a stepwise development of statistical models and evaluate the fit to the underlying data sample, various performance measures are employed in the current study. Based on the measures, the statistical models can be compared and the accuracy of cost estimation can be evaluated. The performance is measured by the coefficient of determination R², the normalised root mean square error CV(RMSE), and the mean absolute percentage error MAPE. Furthermore, the number of outliers provides information on the fit of the statistical models. Besides an evaluation based on the underlying data sample, an independent and randomly selected test sample is employed to avoid an over-fitting of the models. The test sample is not comprised in the development of the models and allows therefore unbiased conclusion to be drawn about the estimation accuracy for the population as described in detail in Section 3.3. The performance measures evaluated in the statistical analysis in Chapter 4 are presented in the following.
Coefficient of determination R² The fit of the statistical models to the underlying empirical data sample is determined by the coefficient of determination R². The so called goodness of fit describes the proportion of variation of the response variable that is explained by the statistical model (Backhaus et al., 2011). The coefficient of determination is calculated as the explained sum of squares divided by the total sum of squares. The explained sum of squares describes the sum of the squares of the differences of the estimated values and the mean of the observed values of the underlying data sample. The total sum of squares is determined as the sum of the squares of the differences of the observed values and their mean. The coefficient of determination can take values between 0 and 1, where a value of 0 indicates that the variation of the response variable is not explained by the statistical model and a value of 1 indicates that the total variation of the response variable is explained by the statistical model. The coefficient of determination R² is calculated as follows: R² =
Pn Pn (y i ))2 (ˆy − 1 Pni=1 i 1n Pi=1 n 2 (y − i=1 i n i=1 (y i ))
* 100 %
where y i is the observed value for the observation i, ˆy i is the estimated value for the observation i, and n is the sample size.
44 | 2 Methodology Adjusted coefficient of determination R² (adj.) For regression models, the R² and therefore the explained variation of the response variable increases with the number of predictor variables included in the model as described by Backhaus et al. (2011). The proportion of the explained variation may even increase by randomness with an increased number of variables. Therefore, the coefficient of determination is adjusted by taking the degree of freedom into account as introduced by Theil (1961). For a comparison of regression models and in particular for a comparison of models with a diversity of included predictor variables, the adjusted coefficient of determination R² (adj.) serves as a more reliable measure not depending on the number of predictor variables included in a regression model (Fahrmeir et al., 2013). As for the R², higher values of the R² (adj.) indicate a better fit of the regression model to the underlying data sample. In the current study, the R² (adj.) is only employed as a performance measure for regression models. The R² (adj.) is determined as follows: p R² (adj.) = 1 − (1 − R²) n−p−1
where p is the number of predictor variables included in the regression model and n is the sample size.
Mean absolute percentage error MAPE As a measure of performance, the mean absolute percentage error MAPE describes the accuracy of an estimation under consideration of the observed values. Therefore, the difference between the observed and the estimated values is divided by the observed value. The percentage error is summed for all observations and divided by the total number of observations. In the current study, the MAPE serves as a measure for the comparison of the developed statistical models. As suggested by Fahrmeir et al. (2013), the MAPE is furthermore employed to draw unbiased conclusion about the estimation accuracy of the developed statistical models based on an independent test data sample (cf. Section 3.3). The following equation describes the calculation of the mean absolute percentage error MAPE: P MAPE = 1n ni=1 y iy−ˆiy i * 100 % where y i is the observed value for the observation i, ˆy i is the estimated value for the observation i, and n is the sample size.
2.3 Statistical methods |
45
Normalised root mean square error CV(RMSE) The root mean square error RMSE is a measure for the accuracy of the estimated values of a statistical model by comparison to the observed values. The RMSE measures the standard deviation of the differences between the estimated values and the observed values of the underlying data sample as described by Ryan et al. (2012). As described by Hyndman and Koehler (2006), the standard deviation of the residuals is scale dependent. Therefore, the root mean square error RMSE is normalised by a division by the mean of the observed values of the data sample. Consequently, the normalised root mean square error CV(RMSE) can be employed for the comparison of various statistical models. Under consideration of the estimated and observed values, the CV(RMSE) is calculated as follows: CV(RMSE) =
q P n 1 n 1 n
yi ) i=1 (y i −ˆ
2
Pn
i=1 (y i )
* 100 %
where y i is the observed value for the observation i, ˆy i is the estimated value for the observation i, and n is the sample size.
Outliers An outlier is defined as an observation with an extreme value for a variable that distorts the results of a statistical analysis as described by Tabachnick and Fidell (2009). In the current study, extreme large values of the error term for single observations (residuals) are identified as outliers. The outliers are employed as a measure of performance in order to describe and compare the estimation accuracy of the developed statistical models. The outliers are determined by measurement of the standardised difference between the observed value and the estimated value of an observation. The standardisation is conducted by a division of the residual by the standard deviation of the estimated values. If the outcome has a value larger than 2, the respective observation is counted as an outlier as suggested by Ryan et al. (2012). The standardised residual is determined according to the following function: Standardised residual =
q P y i −ˆy i P n n 1 y− 1n y))2 i=1 (ˆ i=1 (ˆ n
where y i is the observed value for the observation i, ˆy i is the estimated value for the observation i, and n is the sample size.
3 Data sample 3.1 Overview The reliable and valid measurement of data plays an essential role for the quality of the results of a statistical analysis as described by Tabachnick and Fidell (2009). The quantitative approach of the current investigation is based on empirical data describing a total sample of 253 operated facilities. In a cooperation with 25 project partners, the employed data was collected in the years 2008 until 2014 in Germany. The project partners are mainly public sector institutions as for example municipalities, universities, ecclesiastical administrations, social housing administrations, and social associations. The project partners provided data of up to 80 facilities with buildings constructed between the years 1370 and 2010. In Figure 3.1, the locations of the analysed facilities are illustrated. The data collection was primarily restricted to the south west area of Germany with a large amount of facilities located in the city of Stuttgart.
Figure 3.1. Locations of the analysed facilities
As described by Hox and Boeije (2005), a data collection can generally be differentiated into two categories: Primary data is collected specifically for the purpose of a research project or a study where secondary data was originally collected for a different purpose as for example administrative reasons. The statistical analysis of the current investigation is based on a data collection process that was conducted on both a pri-
https://doi.org/10.1515/9783110596083-067
48 | 3 Data sample mary and a secondary level. The secondary data was obtained by the project partners and includes cost data, quantities, and schematic floor plans of the facilities. The cost data was submitted in digital form as for example spreadsheets and contained usually all cash flows related to a facility for an accounting period of between one and five years. Based on the descriptions of the cash flows, the cost data was classified according to the cost structure of the standard DIN 18960:2008-02 as described in Section 2.2. All cost data are adjusted to first quarter 2016 prices and include the current German VAT. Besides the cost data, the project partner provided usually spreadsheets of various quantities. On the basis of the obtained schematic floor plans of the facilities, the quantity data is classified according to the structure of the standards DIN 277-1:2016-01 and DIN 277-3:2005-04. Consequently, all received secondary data is processed in order to provide a consistent data base for the implementation of the current investigation. In a primary data collection, further information was collected on-site the facilities using a structured and standardised questionnaire. Interviews with the responsible facility managers or owners were conducted during the visits of the facilities. The data collected on-site include detailed information on characteristics, conditions, standards, utilisations, locations, and management strategies of the facilities. Information related to the building construction, building services, and the site area were collected in accordance with the structure of the standard DIN 276-1:2008-12. Based on the detailed level of the collected data, information can be aggregated for the specific requirements of a cost group to be analysed. As basis for the empirical investigation of operating costs, the response and candidate predictor variables are presented in detail including descriptive statistics in the following Sections. Furthermore, the representativeness of the data sample is discussed critically and conclusions about the applicability and restrictions of the results of the current investigation are drawn.
3.2 Presentation of the sample 3.2.1 Response variables As presented in Section 2.2, the annual operating costs are defined as the response variables of the current investigation. All cost data are obtained as secondary data from project partners and are classified in accordance with the cost structure of the standard DIN 18960:2008-02. The cost structure of the standard contains first, second, and third level cost groups as illustrated in Table 2.1 of Section 2.2.1 where the available cost data of the current investigation are classified into 15 different cost groups. On the most detailed third level of the cost structure, cost data can only be assigned to the sub-groups of the cost groups CG 310 for utility costs and CG 350 for operation, inspection and maintenance costs. Due to the limited extent of the cost descriptions and the structure of the provided cash flows of the facilities, further cost data can only
3.2 Presentation of the sample |
49
be assigned to the second level cost groups of the standard. The cost data of the first level cost group CG 300 (operating costs) are aggregated from the respective second level cost groups. The data included in the investigation contain the current German value added tax rate and are adjusted to first quarter 2016 prices based on the figures of the German Federal Statistical Office (DESTATIS, 2017a,b) for consumer prices of construction works and the maintenance of buildings. Initially, the cost data sample of the current investigation contains 253 observations in total. In the course of a pre-analysis of the data, multiple filters are applied on the data sample. As a result, the size of the data sample employed for the analysis varies for the respective cost groups between a number of 65 and 244 observations. Individual observations of the total sample are excluded from the data basis of the respective cost groups due to the limited availability of data. The analysis of cost groups with incomplete or missing cost data values may distort the reliability of the results of the developed statistical models. The data sample is therefore filtered according to the level of available cost data as essential information. Detailed reasons for the exclusion of observations are described in detail in the theoretical basis of the analysis of the particular cost groups in Chapter 4. On the basis of descriptive statistics and an analytical and visual inspection of the distribution of the cost data and respective cost indicators (e.g. histograms, box plots), errors and measurement mistakes in the data sample are identified as suggested by Tabachnick and Fidell (2009). Table 3.1. Operating cost indicators (per m² GEFA) Cost group
Unit
Mean
Standard
Lower
deviation
quartile
Median
Upper
n
quartile
CG 300
Euro/m² GEFA*year
39.84
15.52
29.27
38.84
48.37
CG 310
Euro/m² GEFA*year
15.22
5.26
11.24
14.22
18.01
206
CG 311
Euro/m² GEFA*year
1.35
1.15
0.59
0.99
1.68
194
CG 312-316
Euro/m² GEFA*year
9.33
3.77
6.61
8.42
11.23
206
CG 316
Euro/m² GEFA*year
4.47
2.57
2.80
3.85
5.38
206
CG 320
Euro/m² GEFA*year
1.18
1.01
0.40
0.85
1.78
200
CG 330
Euro/m² GEFA*year
14.20
10.70
5.81
12.48
20.42
238
CG 340
Euro/m² GEFA*year
6.22
7.57
1.45
3.59
7.60
169
CG 350
Euro/m² GEFA*year
5.46
3.75
2.58
4.51
7.44
244
CG 352
Euro/m² GEFA*year
1.51
1.38
0.61
1.07
1.90
195
CG 353
Euro/m² GEFA*year
2.27
1.39
1.27
1.92
2.93
212
CG 354
Euro/m² GEFA*year
0.69
0.73
0.16
0.42
0.97
65
CG 355
Euro/m² GEFA*year
0.94
0.96
0.27
0.58
1.21
91
CG 360
Euro/m² GEFA*year
1.10
1.40
0.17
0.38
1.63
149
CG 370
Euro/m² GEFA*year
1.08
0.94
0.47
0.66
1.42
208
1st quarter 2016 prices including VAT.
194
50 | 3 Data sample
CG 300 CG 310 CG 311 CG 312-316 CG 320 CG 320 CG 330 CG 340 CG 350 CG 352 CG 353 CG 354 CG 355 CG 360 CG 370 0
10
20
30
40
50
60
70
80
Cost indicators (Euro/m2 GEFA*year)
Figure 3.2. Box plots of operating cost indicators (per m² GEFA)
The quality of the current cost data is verified by a comparison with various publications providing annual operating cost information. Annually updated statistical operating cost data on the basis of currently 337 office buildings are published in the Office Service Charge Analysis Report by JLL (2016). The publication takes building sizes, standards, building characteristics, and locations for a classification of cost indicators into account. A comparison with the published data indicates plausibility for the operating cost data sample employed in the current investigation. Further verifications are carried out employing the data provided for example in the Evaluation System for Sustainable Building BNB by the BMVBS (2013) and the BMUB (2015). A comparison with operating cost indicators provided in the publications by BCIS (2007b) and by Rotermund (2016) verifies likewise the quality of the current sample. The annual operating costs employed as response variables in the current investigation are presented in Table 3.1. The gross external floor area GEFA is used as reference quantity for the compilation of operating cost indicators. Mean values, standard deviations, lower quartiles, median values, and upper quartiles describe the data of all analysed cost groups of DIN 18960:2008-02. The size of the respective data sample is presented by the number of observations n. Accordingly, Figure 3.2 illustrates the data distribution with box plots. Further descriptive statistics of the annual operating cost data employed as response variables in the current investigation are illustrated in the Appendix. The absolute costs of all analysed 15 cost groups according to DIN 18960:2008-02 and corresponding box plots are presented in Table A.1 and Figure A.1, respectively. A description of the distribution of the operating costs amongst the cost groups as percentage is displayed in Table A.2 and Table A.3 and as box plots in Fig-
3.2 Presentation of the sample | 51
ure A.2 and Figure A.3 of the Appendix. Furthermore, cost indicators of the underlying cost data employing various available areas as reference quantity are presented in Table A.4 and Figure A.4.
3.2.2 Predictor variables In order to conduct a statistical investigation with empirical data as a basis, variables potentially explaining or predicting the response variables have to be selected as described by Chatterjee and Hadi (2006). As presented in detail in Section 2.2.2, various variable groups with variables potentially influencing the operating costs are selected by a review of literature. Specific areas, the compactness, the function, the condition, the standard, the utilisation, the location, and the management strategy are defined as relevant variable groups in the current investigation. Furthermore, various quantities are selected as candidate reference units for the introduction of adequate operating cost indicators. The candidate reference quantities, the specific areas, and information on the compactness of the facilities were obtained as secondary data from the project partners. On the basis of spreadsheets and schematic floor plans, the data were classified according to the measurement rules and structures provided in the standards DIN 2771:2016-01 and DIN 277-3:2005-04 in order to ensure a consistent data base for the investigation. The reference quantities employ the respective area in m² or volume in m³ as unit where the specific areas refer to the gross internal floor area GIFA as a percentage. Descriptive statistics of the candidate reference quantities, the specific areas, and the compactness are presented in Table A.5, Table A.6, and Table A.7 in the Appendix for the total sample of 253 observations. The variables of the variable groups function, condition, standard, utilisation, location, and management strategy were collected as primary data on-site the facilities using a standardised questionnaire and in interviews with the responsible facility managers or owners. The variable group function contains for example variables describing the number of elevator stops and the number of sanitary facilities and is presented by descriptive statistics in Table A.8 in the Appendix of the study for all 253 observations. The condition was assessed on-site the facilities on a precise level for all appraisable components of the building construction, building services, outdoor facilities, and furniture and equipments according to the standard DIN 276-1:2008-12. Based on the detailed level of the collected data, information is aggregated for the specific requirements of a cost group to be analysed. The aggregation of the condition is conducted under consideration of the respective construction costs for a component by a weighting according to the respective share of costs. Furthermore, the utilisation of a facility is considered in the weighting. For example, the share of defective building envelope as a candidate predictor variable on heating costs is aggregated from the condition of the building components base plate, external walls, and roofs. The shares of
52 | 3 Data sample the defective base plate, external walls, and roofs in percent are therefore weighted by the respective construction costs of the components under consideration of the respective utilisation of the facility as published by BKI (2016) and aggregated to the variable share of defective building envelope in percent. The underlying components of the various aggregated conditions are presented in detail in the theoretical bases for the analyses of the respective cost groups in Chapter 4. Descriptive statistics of the conditions employed as candidate predictor variables are provided in Table A.9 in the Appendix. Qualitative information on the standard was collected as primary data on-site the facilities employing the standardised questionnaire. Therefore, the standard of the construction, technical installations, and outdoor facilities, grounds, furniture and equipment were assessed. The heat storage capacity of the structure is considered by the candidate predictor variable thermal mass and divides the total sample into facilities with light thermal mass and heavy thermal mass. The significance of the existence of conservation regulations for the entire facility or parts of the facility is examined by the qualitative variable protected structure. The flexibility of the construction and the technical installations comprise the respective building components as defined in the standard DIN 276-1:2008-12 and give information about the variability of the infrastructure in case of a structural modification. The qualitative variables standard of the technical installations, heating system, building automation, outdoor facilities, and furniture and equipment contain information on the fulfilment of the usage requirements of the respective components and are included with the characteristics high or low. The qualitative variable outdoor facilities included describes the availability of information on outdoor facilities. Further information on the standard of the technical installations is given by the variable type of heating energy source characterising the heating system of the facilities. Descriptive statistics of the variables including detailed information on the characteristics are presented in the Appendix in Table A.10. Table 3.2 illustrates the wide variety of utilisations considered in the current investigation. The data sample with a total number of 253 observations is presented with a differentiation into the available types of facilities as qualitative candidate predictor of the variable group utilisation. The differentiation into the facility types is conducted according to the Catalogue for the Classification of Civil Works by Argebau (2010). Table 3.2 contains the number of observations for the respective characteristics and the share on the total number of observations in percent. Further information on the utilisation of the facilities is provided by the qualitative candidate predictor variables specific utilisation and type of water usage as presented in Table A.12 of the Appendix including detailed information on the available characteristics. The variable group location includes the qualitative variables urban location and type of topography and provides information on the surrounding area of the facilities and the topography of the site areas, respectively. Both variables and the available characteristics including their statistical distribution are described in the Appendix in Table A.11. Furthermore,
3.3 Test sample
| 53
the significance of the management strategy is investigated by the qualitative variable type of cleaning services as presented in Table A.13 of the Appendix. Table 3.2. Qualitative candidate predictor variable type of facility Candidate predictor variable Type of facility
[a]
Characteristic
%[a]
n
Care retirement home
5.1 %
13
Church facility
5.1 %
13
Community hall
2.0 %
5
Fire department
2.0 %
5
Kindergarten
48.2 %
122
Library
1.6 %
4
Municipal facility
3.6 %
9
Research/teaching facility
4.7 %
12
Residential facility
8.7 %
22
School facility
8.7 %
22
Sport facility
7.5 %
19
Town hall
2.8 %
7
Total number of observations: 253.
3.3 Test sample As described by Fellows and Liu (2015), data used to develop a model can not be used to validate the model. The validation of a model must be conducted with data not involved in model development or the validation of the model may be distorted. According to Snee (1977), the splitting of a data sample for a cross-validation is an appropriate method to compare the fit of a model to the data and to measure the estimation accuracy. Therefore, the data sample of the current study is divided into two sub-samples as indicated in Section 2.3.6. A training sample is used to develop the statistical models and to introduce categorised cost indicators for the purpose of operating cost estimation. The training sample consists of approximately 90 % of the total observations. A test sample of approximately 10 % of the total observations is solely used for the purpose of performance validation. The observations included in the test sample are selected randomly and shall be representative for the total sample of the current investigation. The observations of the test sample are not included in the development of the statistical models or used for the introduction of categorised cost indicators. Consequently, the validation of the performance can be conducted under independent and unbiased conditions.
54 | 3 Data sample Table 3.3. Comparison of the total, test, and training samples Candidate predictor variable
n
n
n
Total sample
Training sample
Test sample
%[a]
Care retirement home
13
12
1
7.7 %
Church facility
13
12
1
7.7 %
Community hall
5
5
0
0.0 %
Fire department
5
5
0
0.0 %
122
110
12
9.8 %
4
4
0
0.0 % 11.1 %
Kindergarten Library Municipal facility
9[b]
7
1
Research/teaching facility
12
11
1
8.3 %
Residential facility
22
20
2
9.1 %
School facility
22
19
3
13.6 %
Sport facility
19
17
2
10.5 %
Town hall
7
6
1
14.3 %
228
24
9.5 %
Total [a] [b]
[b]
253
Percentage of the test sample on the total sample. Includes the observation employed as implementation example in Chapter 6.
A comparison of the randomly selected training and test samples for model development and validation is presented in Table 3.3 under consideration of the respective types of facility. With a number of 24 observations, the test sample consists of 9.5 % of the total sample with 253 observations. For the presented types of facilities, the test sample includes a share of between 7.7 % and 14.3 % of the total observations. Since a limited number of observations is available for community halls, fire departments, and libraries, the respective types of facilities are not represented in the test sample. Nevertheless, the relatively consistent distribution of the observations regarding their utilisation indicates representativeness of the test sample for the total sample. Besides the training and test sample, a further data sample is employed solely for the development of artificial neural network models in order to avoid over-fitting as described in Section 2.3.3. The ANN-validation sample consists of approximately 20 % of the total observations, is selected randomly, and reveals a similar distribution of data as the presented test sample.
3.4 Representativeness In order to conduct quantitative research on a statistical population, a sub-set of observations is selected from the population as a sample. The research can be carried out employing the selected data sample and statistical inferences can be drawn about the behaviour of the entire population as described by Fellows and Liu (2015). The de-
3.4 Representativeness | 55
termination of a sample simplifies the research to be conducted as for example by a reduction of the effort to collect and analyse data. The employed data sample should therefore provide an accurate representation of the statistical population. As introduced by Kahneman and Tversk (1972), the representativeness of a data sample is defined as the similarity of the sample and the population in essential characteristics under consideration of relevant conditions. For example, a data sample fulfils representativeness if the observations of the sub-set are selected randomly from the population. Consequently, a detailed description of the consistency of the data sample is crucial for the validity of the statistical inferences about the behaviour of the population. In the current investigation, a critical discussion about the representativeness of the underlying data sample is essential in order to draw conclusions about the practical applicability of the results and respective restrictions. As described in the previous Sections, the cost data and further information on the observations of the investigation was obtained from multiple project partners. The selection of the facilities was usually conducted by the respective project partners and the consistency of the data sample is therefore based on a restricted level of randomness in terms of statistics. Nevertheless, it is assumed that the selected facilities are representative for the real estate portfolio of the respective project partners. The participating project partners are mainly public sector institutions as for example municipalities, universities, ecclesiastical administrations, social housing administrations, and social associations. As presented in Section 3.2.1, the verification of the cost data of individual facility types as for example municipal buildings revealed a certain conformity with facilities owned and operated by the private sector. Nevertheless, only restricted statistical inferences can be drawn from the underlying data sample about facilities operated and owned by the private sector. Further limitations are expected in the analysis of management strategies as for example for the level of outsourced facility services or service level agreements. The analysis of management strategies is restricted to the outsourcing rate of cleaning services since other strategies and concepts are not existent for the observations provided by the participating project partners. Another restriction of the representativeness is expected in regard to the location. The data collection was primarily conducted in the south west area of Germany and a large amount of facilities included in the investigation are located in the city of Stuttgart. Therefore, the analysis of the variation of regional economic and climatic conditions is only available on a restricted level. In order to generalise the results of the current investigations for the application on facilities in other locations, the regional conditions can be taken into account by statistical data on the local economics of the construction sector as for example provided by the BKI (2016). Likewise, the variation of the climate conditions can be considered by statistical data on the local climate as for example presented in the standard VDI 3807-1:2013-06. The applicability of the results of the current investigation is limited by the scope of costs and cost types under investigation. As described in Section 2.2.1 in detail, the
56 | 3 Data sample operating costs analysed in the current study are determined according to the cost structure of the standard DIN 18960:2008-02. The application of the results of the statistical models and the categorised cost indicators is therefore restricted by the definition of the respective costs included in the various cost groups of the structure. In particular, the results of the aggregated first and second level cost groups require a detailed consideration of their scope when practically applied. A further limitation of the representativeness of the data sample is indicated by the scope of the collected cost data. As presented in the previous Sections, the cost data provided by the project partners contained the cash flows related to a facility for an accounting period of between 1 and 5 years. As a result of the variation of the general price level as for example by inflation, operating costs may vary significantly depending on the different years of their observation. In order to provide a consistent data base, the cost data included in the investigation are adjusted to first quarter 2016 prices based on the figures of DESTATIS (2017a,b). Correspondingly, the figures can be employed for an adjustment of the results of the investigation for a future application. The maintenance costs vary significantly across the different stages of the life cycle of a facility as for example described by Bahr (2008). With relatively short accounting periods of between 1 and 5 years under consideration, it is indicated that the current investigation is restricted in the representativeness regarding the cost data for inspection and maintenance. Nevertheless, the investigation includes facilities constructed between the years 1370 and 2010 and represents therefore cost data of facilities in a variety of life cycle stages. Finally, restrictions are expected for the limited amount of observations for individual characteristics of variables as for example the type of facility. With a limited number of observations available, the presence of outliers or errors in the underlying data may distort the results of a statistical analysis substantially as described by Tabachnick and Fidell (2009). Therefore, the impact of a limited data sample is significantly reduced by a detailed pre-analysis based on descriptive statistics and an analytical and visual inspection of the data.
4 Analysis results 4.1 Operating costs (CG 300) 4.1.1 Theoretical basis and variables A data sample of 194 observations including a training sample of 176 observations and a test sample of 18 observations serves as a basis for the empirical investigation of operating costs (CG 300). Individual observations of the total sample are excluded from the data basis of the current analysis due to the limited availability of data. The underlying cost data of the current analysis are aggregated from several second level cost groups according to the the standard DIN 18960:2008-02. Excluded observations are described in detail in the theoretical basis of the particular second level cost groups in the subsequent Sections. The underlying cost data of operating costs have median values of absolute costs of 45,182 Euro per year and cost indicators of 38.84 Euro per m² GEFA and year (1st quarter 2016 prices including VAT). A general overview and a definition of the investigated cost data is given in Section 2.2.1. The examined first level cost group of operating costs (CG 300) is defined in the standard DIN 18960:2008-02 and is aggregated from the following second level cost groups: – CG 310: Utilities – CG 320: Disposal – CG 330: Cleaning and care of buildings – CG 340: Cleaning and care of outdoor facilities – CG 350: Operation, inspection and maintenance – CG 360: Security and surveillance – CG 370: Statutory charges and contributions Besides the CG 300 costs as response variable of the analysis, various candidate predictor variables are defined and their effect on operating costs is examined. The variables give detailed information on the quantities, characteristics, utilisation, location, and management strategy of the facilities. A general overview of the variable groups and the available variables is presented in Section 2.2.2 and Section 3.2.2, respectively. The following overview describes the variables included in the current investigation.
Quantities Reference quantities Various areas and volumes are examined in order to determine an adequate reference quantity for the estimation of operating costs. Therefore, the gross external floor area GEFA in m², the gross internal floor area GIFA in m², the usable floor area UFA in m²,
https://doi.org/10.1515/9783110596083-077
58 | 4 Analysis results and the gross building volume GBV in m³ as defined in the standard DIN 277-1:2016-01 are analysed as candidate reference quantities in the current investigation.
Specific areas The investigation of operating costs examines various specific areas as candidate predictor variables. The analysis includes the share of usable floor area UFA on the GIFA in %, the share of circulation area CA on the GIFA in %, and the share of sanitary area on the GIFA in % as defined in the standards DIN 277-1:2016-01 and DIN EN 15221-6:201112. The share of heatable GIFA in % according to VDI 3807-1:2013-06, the share of ventilated and air-conditioned GIFA in %, and the share of regularly cleaned GIFA in % according to DIN 277-1:2016-01 refer to the gross internal floor area GIFA as percentage and are likewise included. Describing outdoor facilities and grounds, the share of non-built site area nbSAR in % and the share of planted site area pSAR in % refer to the site area, are defined in the standard DIN 277-1:2016-01 and are also considered as candidate predictor variables.
Compactness Describing the compactness of the facilities, the average floor size in m², the average storey height in m, and the number of floors are examined in the investigation of operating costs as candidate predictors. The average floor size in m² is calculated as the gross external floor area GEFA in m² according to DIN 277-1:2016-01 divided by the number of floors. The average storey height in m is calculated as the gross building volume GBV in m³ divided by the gross external floor area GEFA in m² as defined in the standard DIN 277-1:2016-01.
Function A functional description of the facilities is given by the quantities number of elevator stops and number of sanitary facilities which are therefore included in the analysis of the operating costs in the current Section. Likewise, the share of glass surfaces on above-grade exterior walls in % and the share of double or triple glazing on exterior glass surfaces in % as defined in the standard DIN 277-3:2005-04 are considered as candidate predictor variables describing the envelope of buildings.
Characteristics Condition – Condition of the construction: Aggregated from the condition of base plates, external walls, and roofs, the investigation includes the share of modernised building envelope and the share of
4.1 Operating costs (CG 300) | 59
–
–
defective building envelope as candidate predictor variables. The share of defective floorings describes the condition of floor and ceiling coverings and is likewise considered as candidate predictor influencing the operating costs. In order to summarise the condition of the construction of buildings, the share of defective construction comprises the condition of foundations, external and internal walls, ceilings, roofs, and structural fitments and is therefore examined. Condition of the technical installations: Describing the condition of sewage and fresh water drains and pipes, the share of defective water installations is considered as candidate predictor. Besides, the investigation quantifies the significance and influence of the share of defective sanitary installations including the condition of sanitary appliances and equipments. The share of defective heat supply systems is likewise examined giving information about the condition of heat generators, heat distribution networks, radiators, and panel heating systems. In order to comprise the condition of high voltage installations, lighting systems, and telecommunication systems, the share of defective electrical installations is included as candidate predictor. The share of defective technical installations summarises the condition of sewerage, water, gas, heat supply, air treatment, electrical, and telecommunication installations, as well as transport and building automation systems and is therefore taken into account as candidate predictor variable. Condition of the outdoor facilities, grounds, furniture and equipment: The investigation of operating costs includes the share of defective outdoor facilities and grounds as candidate predictor, aggregated from the condition of grounds, surfaces, external construction works, technical installations, fitments, water areas, and planting and sowing areas. Besides, the share of defective furniture and equipment including the condition of furniture, equipment, and art work is examined.
Standard – Standard of the construction: The heat storage capacity of the structure is described by the predictor thermal mass. Therefore, the qualitative variable with the characteristics heavy thermal mass and light thermal mass is included in the investigation. Furthermore, the flexibility of building construction gives information about the variability of the infrastructure of the building in case of a structural modification and is considered as a variable potentially influencing the operating costs. The significance of the existence of building conservation regulations for the entire building or parts of the building is examined by the qualitative predictor protected structure. – Standard of the technical installations: The variable flexibility of technical installations contains a description of the variability in case of a structural modification of the building and is included with the
60 | 4 Analysis results
–
characteristics high flexibility and low flexibility. Comprising information about the sewerage, water, gas, heat supply, air treatment, electrical, communication, transport, and building automation systems, the standard of technical installations is examined in the investigation. Besides, the influence and significance of the standard of the heating system is quantified. The variable describes the availability of time scheduled programs and measurement of external factors. The availability of individual or automatised room controls of the heating, ventilation, shading, and lighting systems is considered by the qualitative predictor variable standard of building automation. A further description of the standard of the technical installations is given by the variable type of heating energy source with the characteristics district heating, electricity, gas, and oil heating. Standard of the outdoor facilities, grounds, furniture and equipment: The qualitative variable outdoor facilities included describes the availability of information on outdoor facilities for the respective observation. Summarising the standard of grounds, surfaces, external construction works, technical installations, fitments, water areas, and planting and sowing areas and their particular fulfilment of the usage requirements, the investigation includes the standard of outdoor facilities and grounds as candidate predictor. The standard of furniture and equipment is likewise examined and contains information about the fulfilment of the usage requirements of the furniture and equipment.
Utilisation The utilisation of the facilities is represented by the candidate predictor variable type of facility. The qualitative variable contains the characteristics care retirement home, church facility, community hall, fire department, kindergarten, library, municipal facility, research/teaching facility, school facility, sport facility, and town hall as facility types. Furthermore, the data sample is differentiated into facilities with kitchen/canteen, tea kitchen, or none by the candidate predictor specific utilisation. The variable type of water usage with the characteristics process water usage and sanitary water usage specifies the water consumption according to the utilisation and is therefore examined.
Location The investigation of operating costs includes the qualitative variable urban location with the characteristics urban and rural describing the surrounding area of the facilities. The variable type of topography is likewise examined including a differentiation of the sample into the characteristics flat topography and sloped topography in order to describe the site area of the facility.
4.1 Operating costs (CG 300) |
61
Strategy With the characteristics internal cleaning services, external contractor, solely cleaning materials, and cleaning by tenants, the significance of the management strategy and the effect on cleaning and care costs of buildings is quantified by the candidate predictor variable type of cleaning services.
4.1.2 Model design and specifications On the basis of the training sample with 176 observations and the variables presented above, various statistical models are stepwise developed and described. The statistical models aim to give an accurate estimation of the CG 300 costs as response variable. Furthermore, the models intend to reveal and describe the causal interrelationships between the response variable and the candidate predictor variables.
Model overview For the determination of an adequate reference quantity, several linear regression models estimating absolute costs are developed. Each model contains one of the available candidate reference quantities as predictor. A comparison of the developed absolute cost models is presented in Table 4.1 including multiple measures of performance. With the highest value of the R² (adj.) of 93.6 % and the lowest value of the MAPE of 42.0 %, the best estimation accuracy is indicated for the model LR(abs)300GEFA including the GEFA as one of the predictor variables. The assumption is likewise confirmed by the lowest value of the CV(RMSE) of 31.5 % compared to the other developed models. Based on the results of the absolute cost models, the GEFA is determined as reference quantity for the introduction of cost indicators for the further investigation of operating costs. Table 4.1. Models (incl. reference quantities) for the estimation of CG 300 absolute costs Absolute cost model (Euro/year)
No. var.
R²
R² (adj.)
MAPE
CV(RMSE)
Outliers
n
LR(abs)300GEFA
Linear regression (incl. GEFA)
7
94.2 %
93.6 %
42.0 %
31.5 %
13 (7.4 %) 176
LR(abs)300GIFA
Linear regression (incl. GIFA)
7
93.8 %
93.2 %
43.7 %
32.4 %
13 (7.4 %) 176
LR(abs)300UFA
Linear regression (incl. UFA)
7
92.8 %
92.0 %
44.8 %
35.1 %
14 (8.0 %) 176
LR(abs)300GBV
Linear regression (incl. GBV)
7
92.5 %
91.7 %
52.9 %
35.7 %
16 (9.1 %) 176
Employing the CG 300 cost indicators as response variable, linear and non-linear regression models, artificial neural network models, and binary classification tree models are developed. A summary of the statistical models with the best performance is given in Table 4.2. The non-linear regression model NLR(ind)300 offers the best estima-
62 | 4 Analysis results tion accuracy with a R² (adj.) of 78.8 % and a MAPE of 15.1 % comparing the measures of performance. A CV(RMSE) of 19.1 % and the number of 5 outliers indicate the lowest estimation error compared to the other models. By transformation of both response and predictor variables, the MAPE decreases by 2.7 % and the R² (adj.) increases by 8.7 % for the non-linear regression model NLR(ind)300 comparing it to the linear regression model LR(ind)300 . As indicated throughout all measures, the artificial neural network model ANN(ind)300 and the binary classification tree model BCT(ind)300 can not offer an improvement of performance. Compared to the models estimating absolute costs, all developed cost indicator models show higher levels of accuracy as illustrated by the values of the MAPE and the CV(RMSE). Table 4.2. Models for the estimation of CG 300 cost indicators Cost indicator model (Euro/m² GEFA*year)
No. var.
R²
R² (adj.)
MAPE
CV(RMSE)
Outliers
n
LR(ind)300
Linear regression model
7
73.3 %
70.1 %
17.8 %
20.4 %
5 (2.8 %)
176
NLR(ind)300
Non-linear regression model
7
81.1 %
78.8 %
15.1 %
19.1 %
5 (2.8 %)
176
ANN(ind)300 Artificial neural network model
7
72.6 %
-
16.7 %
20.6 %
11 (6.3 %) 176
BCT(ind)300
7
73.5 %
-
16.4 %
20.3 %
9 (5.1 %)
Binary classification tree model
176
Regression model In Table 4.3, the non-linear regression model NLR(ind)300 is presented in detail with all relevant specifications as the model with the best compliance to the underlying data sample comparing the developed cost indicator models. With the linear regression model as a basis, both response and quantitative predictor variables of the nonlinear regression model are transformed according to their Box-Cox transformations (Box and Cox, 1964). The respective transformations are represented by the values of lambda (λ) and range from a natural logarithm transformation (λ=0) of the response variable (CG 300 cost indicators) to square root (λ=0.5) and square (λ=2) transformations of the included quantitative predictor variables. The empirical F-value of the model NLR(ind)300 of 35.23 exceeds the theoretical F-value of 1.65 as determined under consideration of the sample size, the number of included predictor variables, and a 95 % confidence interval (Backhaus et al., 2011). Therefore, the null hypothesis can be rejected and a general significance of the model NLR(ind)300 can be concluded. Significant relationships between the response variable and 7 predictor variables can be determined as indicated by the p-values based on a significance level of alpha (α) set to 0.05. The t-values of the determined variables exceed the threshold of 1.97 determined under consideration of a 95 % confidence interval, the sample size, and the number of predictors (Backhaus et al., 2011). The coefficients (β) are presented for the regression constant and the determined predictor
4.1 Operating costs (CG 300) |
63
variables. Besides, the standard deviation of the estimate of the coefficients to the underlying sample is displayed by the standard error of the coefficients. Table 4.3. Description of coefficients of non-linear regression model NLR(ind)300 Response variable
Transf. (λ)
R²
R² (adj.)
MAPE
0 (LN)
81.1 %
78.8 %
15.1 %
CG 300 (Euro/m² GEFA*year) Predictor variables
Transf. (λ) Coef. (β) Coef. SE St. Coef.
CV(RMSE) F-value
n
19.1 %
35.23
176
t-value
p-value
VIF
β0
Constant
-
2.598
0.111
0.000
23.32
0.000
-
X1
Share of heatable GIFA (%)
2 (SQ)
0.472
0.110
0.246
4.28
0.000
1.98
X2
Share of regularly
2 (SQ)
0.584
0.129
0.219
4.54
0.000
2.42
0.5 (SQR)
0.329
0.071
0.176
4.60
0.000
1.21
0.5 (SQR)
0.361
0.061
0.210
5.90
0.000
1.13
cleaned GIFA (%) X3
Share of defective envelope (%)
X4
Share of defective technical installations (%)
X5
X6
X7
Type of heating energy source
-
-
-
0.181
-
0.000
-
District heating
-
0.000
0.000
-
-
-
-
Electricity
-
0.252
0.082
-
3.08
0.002
1.31
Gas
-
0.138
0.040
-
3.44
0.001
1.80
Oil
-
-0.014
0.067
-
-0.20
0.839
1.56
-
-
-
0.492
-
0.000
-
Care retirement home
-
0.000
0.000
-
-
-
-
Church facility
-
-0.873
0.092
-
-9.53
0.000
2.03
Community hall
-
-0.783
0.108
-
-7.23
0.000
1.46
Fire department
-
-0.580
0.129
-
-4.50
0.000
1.67
Kindergarten
-
-0.196
0.065
-
-3.02
0.003
4.75
Library
-
-0.664
0.119
-
-5.60
0.000
1.41
Municipal facility
-
-0.492
0.107
-
-4.58
0.000
1.71
Research/teaching facility
-
-0.361
0.100
-
-3.62
0.000
1.95
School facility
-
-0.403
0.085
-
-4.72
0.000
3.02
Sport facility
-
-0.257
0.088
-
-2.91
0.004
2.74
Town hall
-
-0.317
0.106
-
-2.98
0.003
1.68
-
-
-
0.281
-
0.000
-
Internal cleaning services
-
0.000
0.000
-
-
-
-
External contractor
-
0.356
0.069
-
5.18
0.000
2.25
Only cleaning materials
-
-0.030
0.092
-
-0.32
0.749
2.06
Type of facility
Type of cleaning services
The following equation describes the non-linear regression model NLR(ind)300 : 2
2
ˆ = e β0 * e β1 X1 * e β2 X2 * e β3 Y
√
X3
* e β4
√
X4
* e β5 X5 * e β6 X6 * e β7 X7
The size of the effect of the predictor variables on the response variable is indicated by the standardised coefficients as described by Ryan et al. (2012). In order to deter-
64 | 4 Analysis results
3
25
2
20
1
Frequency
Standardised Residual
mine the standardised coefficients of the categorical predictors, the regression is recalculated with the coefficients of the dummy variables as introduced by Eisinga et al. (1991). For the non-linear regression model NLR(ind)300 , the largest effect is indicated for the qualitative predictor variables type of facility and the type of cleaning services. None of the values of the VIF of the predictor variables exceed the selected threshold of 5 (cf. Section 2.3.2). Though the values of the VIF reveal a certain multicollinearity among the determined predictor variables, a stable regression model is indicated (cf. Chatterjee and Hadi, 2006).
0 -1
10 5
-2 -3 2.5
15
3.0
3.5 Fitted Value
4.0
4.5
0
-3
-2
-1 0 1 Standardised Residual
2
3
Figure 4.1. Residuals for non-linear regression model NLR(ind)300
The distribution of the residuals (difference between observed and estimated value) is presented in the residual plots in Figure 4.1 and gives further information about the quality of fit of the model NLR(ind)300 to the underlying data sample. The standardised residuals appear to be uncorrelated to the estimated values and are distributed normally as illustrated by the scatter plot and histogram of the residuals, respectively. The homoscedastic variance of the residuals indicates a correctly specified model and an unbiased estimation of the response variable with no missing terms, extreme outliers, or influential points (cf. Fahrmeir et al., 2013).
Binary classification tree model Compared to the non-linear regression model NLR(ind)300 , the developed binary classification tree model BCT(ind)300 can not offer an improvement of performance as indicated by the achieved values of the R² of 73.5 % and the MAPE of 16.4 %. The performance measures of the model (cf. Table 4.2) indicate nevertheless a relatively high estimation accuracy. Since classification tree models may have the advantage to provide clear information on the importance of significant predictor variables as described by Tso and Yau (2007), the developed BCT(ind)300 model is likewise presented in the cur-
4.1 Operating costs (CG 300) |
65
rent Section. Figure 4.2 illustrates the structure of the tree and Table 4.4 summarises relevant parameters and specifications of the model. re_300_GEFA Mean 39.74 n 176 care_ret, kind_gar, school, sp_fac Mean 44.73 n 133 0.393 Mean 60.95 4 n TN
Figure 4.2. Tree diagram of binary classification tree model BCT(ind)300
Based on the training sample of 176 observations, the model BCT(ind)300 is developed with the classification and regression tree growing method CRT as described in Section 2.3.4 in detail. With a tree depth of 7 layers, the developed model includes 29 nodes in total whereof 15 are terminal nodes with stop rules. A significant effect on the operating costs is indicated for 7 predictor variables. The 7 identified predictor variables correspond with the variables identified by the non-linear regression model NLR(ind)300 . The model BCT(ind)300 displays the largest effect for the qualitative variable type of facility and the quantitative variable share of heatable GIFA. Though minor
66 | 4 Analysis results differences of the size of the effect of the predictors can be observed, the comparison of the models NLR(ind)300 and BCT(ind)300 reveals a high level of conformity of the results and indicates therefore a correct specification of the models. Table 4.4. Specifications and results of binary classification tree model BCT(ind)300 Binary classification tree model
Parameter
Specifications
Growing method
Classification and regression tree CRT
Dependent variable
re_300_GEFA (CG 300, Euro/m² GEFA*year)
Sample size
176
Minimum cases in nodes Results
Independent variables
3 qv_Util (Type of facility) cn_sh_hGIFA (Share of heatable GIFA, %) qv_CleanServ (Type of cleaning services) cn_sh_cGIFA (Share of regularly cleaned GIFA, %) cn_sh_defTecIn (Share of defective technical installations, %) qv_EnSource (Type of heating energy source) cn_sh_defEnv (Share of defective envelope, %)
Number of nodes
29
Number of terminal nodes
15
Tree depth
7
4.1.3 Categorised cost indicators As described in Section 2.3.5, the observations of the training sample are employed to introduce median values of cost indicators for the purpose of estimation. The indicators are presented with a categorisation based on the results of the developed statistical models. Therefore, the underlying cost data are defined as the operating costs and the respective reference quantity is defined as the gross external floor area GEFA as presented in the previous Section. The categorisation of the cost indicators is determined according to the predictor variables with the largest effect on operating costs. As indicated by the standardised coefficients of the non-linear regression model NLR(ind)300 , the largest effect is revealed for the qualitative variables type of facility and type of cleaning services. Therefore, the operating cost indicators are categorised according to their type of facility as presented in Table 4.5. A further categorisation is made by the type of cleaning services with a distinction between internal cleaning services and an external contractor. Sub-categorised cost indicators are only available for datasets with 3 or more observations. Besides the median values (MV), the lower quartiles (25 % percentile) and upper quartiles (75 % percentile) are introduced for all categories. The costs are adjusted to 1st quarter 2016 prices and include the current German VAT. The presented median values of the operating cost indicators range be-
4.1 Operating costs (CG 300) |
67
tween 14.74 Euro/m² GEFA*year for church buildings and 56.37 Euro/m² GEFA*year for care retirement homes with an external contractor for cleaning services. Table 4.5. Categorised CG 300 cost indicators Type of facility Care retirement home
Lower quartile[a]
MV(ind)300 [a]
Upper quartile[a]
n[b] 12
30.62
46.42
68.64
Internal cleaning services
23.81
27.06
32.61
4
External contractor
46.23
56.37
70.60
8 10
Church facility
12.78
14.74
20.27
Community hall
16.35
21.22
23.25
5
Fire department
14.36
20.16
30.91
4 88
Kindergarten
36.21
44.26
54.96
Internal cleaning services
23.98
34.99
41.39
4
External contractor
38.75
46.73
57.94
76 8
16.46
24.86
39.07
Library
Only cleaning materials
22.61
28.28
31.38
4
Municipal facility
20.60
24.55
34.17
6
Research/teaching facility
24.80
28.68
39.79
8
School facility
30.08
33.78
36.76
18
Sport facility
39.72
43.17
49.01
15
Town hall
25.44
26.73
33.25
6
[a] [b]
CG300 cost indicators (Euro/m² GEFA*year), 1st quarter 2016 prices including VAT. Total sample size: 176 observations.
4.1.4 Performance validation The development of the statistical models for the estimation of operating cost indicators is conducted on the basis of a training sample of 176 observations. As summarised in Section 4.1.2, the description of the models contains multiple measures of performance. Hereinafter, unbiased statistical inferences about the estimation accuracy of the developed models and the estimation by the median values of the cost indicators are drawn employing an independent test sample of 18 observations. As described in Section 3.3, the test sample is selected randomly, shall be representative for the total sample, and is not employed for model development. The values of the percentage errors PE are presented for all statistical models and the estimation by the categorised cost indicators in Table 4.6. The PE values are calculated by application of the observed characteristics into the statistical models for the 18 observations. The median values of the cost indicators are selected under consideration of the respective characteristics. Furthermore, the respective preference for the most accurate estimation method is given and the estimation accuracy is summarised by the mean absolute percentage error MAPE for the test sample.
68 | 4 Analysis results Table 4.6. Comparison of PE and MAPE values (test sample) for CG 300 estimation methods Obs.
Type of facility
LR(ind)300
NLR(ind)300
ANN(ind)300
BCT(ind)300
5 24
MV(ind)300
Preference
Kindergarten
-40.2 %
-53.3 %
-44.8 %
-29.0 %
-6.4 %
MV(ind)300
Church facility
18.4 %
15.6 %
19.8 %
14.0 %
29.7 %
BCT(ind)300
34
Kindergarten
21.1 %
27.9 %
17.2 %
-17.6 %
44.5 %
ANN(ind)300
40
Town hall
-25.3 %
-14.2 %
-25.7 %
-70.2 %
-40.3 %
NLR(ind)300
47
Kindergarten
-1.9 %
-1.0 %
-1.3 %
18.3 %
-8.5 %
NLR(ind)300
54
Kindergarten
-8.9 %
-8.0 %
-4.7 %
2.4 %
11.0 %
BCT(ind)300
57
Research/teaching
-1.5 %
-0.2 %
-10.5 %
11.1 %
14.7 %
NLR(ind)300
68
School facility
14.2 %
12.4 %
17.4 %
24.3 %
37.6 %
NLR(ind)300
92
Kindergarten
14.4 %
17.5 %
4.9 %
23.4 %
6.1 %
ANN(ind)300
112
Sport facility
-27.9 %
-19.2 %
-28.7 %
-2.9 %
-36.6 %
BCT(ind)300
130
Kindergarten
28.5 %
26.2 %
24.6 %
27.5 %
40.7 %
ANN(ind)300
167
Care retirement home
-41.8 %
-32.7 %
-30.7 %
22.6 %
7.8 %
MV(ind)300
187
School facility
-11.8 %
-10.0 %
-9.2 %
-4.8 %
MV(ind)300
202
Municipal facility
20.7 %
24.7 %
22.8 %
5.4 %
34.1 %
BCT(ind)300
211
School facility
20.3 %
19.1 %
11.2 %
9.9 %
13.6 %
BCT(ind)300
230
Kindergarten
10.9 %
12.6 %
12.1 %
16.5 %
3.3 %
MV(ind)300
237
Sport facility
-3.1 %
-4.5 %
-21.6 %
-4.9 %
-28.6 %
LR(ind)300
247
Kindergarten
-6.0 %
3.5 %
-3.8 %
-12.6 %
-30.5 %
NLR(ind)300
17.6 %
16.8 %
18.1 %
17.9 %
22.2 %
NLR(ind)300
Absolute percentage error APE
Total (MAPE)
-23.6 %
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
Test Training LR(ind)300
Test Training NLR(ind)300
Test Training ANN(ind)300
Test Training BCT(ind)300
Test Training MV(ind)300
Figure 4.3. APE values (test and training sample) for CG 300 estimation methods
As already indicated by the performance measures of the training sample, the values of the MAPE for the test sample validate the most accurate estimation of operating costs for the model NLR(ind)300 . The values of the PE range between an error of -0.2 % for the most accurate estimation and -53.3 % for the estimation with the lowest accuracy. A comparison of the MAPE values of the test sample (16.8 %) and training sample (15.1 %) confirms a constant estimation performance for the model NLR(ind)300 . Fig-
4.2 Utilities (CG 310) |
69
ure 4.3 gives an overview of the achieved values of the MAPE for both test and training samples for all methods. Likewise, the figure illustrates the distribution of the absolute percentage errors APE for all observations. A comparison of the respective test and training samples reveals a consistent distribution of the APE values. Therefore, it can be assumed that the statistical models are correctly specified and a consistent estimation performance beyond the training sample is validated.
4.1.5 Summary Based on the analysis of a data sample of 194 observations, various statistical models are developed to estimate operating costs (cost group 300). The gross external floor area GEFA is determined as the most adequate reference quantity. The most accurate estimation of cost indicators is provided by a non-linear regression model as indicated by a mean absolute percentage error of 15.3 % for the total data sample. Significance is revealed for the following predictor variables (ordered by their size of effect): – Type of facility (utilisation) – Type of cleaning services (strategy) – Share of heatable gross internal floor area GIFA (quantity) – Share of regularly cleaned gross internal floor area GIFA (quantity) – Share of defective technical installation (condition) – Type of heating energy source (standard) – Share of defective envelope (condition)
4.2 Utilities (CG 310) 4.2.1 Theoretical basis and variables The basis of the empirical analysis of utility costs is provided by a data sample of 206 observations including a training sample of 188 observations and a test sample of 18 observations. Individual observations of the total sample are excluded from the data basis of the current analysis due to a limited availability of data. For residential facilities in particular, tenants are charged directly by the respective suppliers for their water, heating energy, and electricity consumption. The utility cost data of these facilities can in part or in total not be provided by the project partners and the corresponding observations are excluded from the analysis. Furthermore, the utility cost data of some of the analysed facilities are only available on a superior level for multiple facilities and can not be differentiated and allocated to a particular facility. These observations are likewise excluded from the current analysis. The analysed cost data of the underlying data sample include the second level cost group of utilities (CG 310) as defined in the standard DIN 18960:2008-02 and have
70 | 4 Analysis results median values of absolute costs of 16,000 Euro per year and cost indicators of 14.22 Euro per m² GEFA and year (1st quarter 2016 prices including VAT). The utility costs represent a relatively high share of about 36.2 % of the total operating costs (CG 300). A general overview and definition of the investigated cost data is presented in Section 2.2.1. The examined costs are aggregated from the following third level cost groups according to DIN 18960:2008-02: – CG 311: Water – CG 312: Oil – CG 313: Gas – CG 314: Solid fuels – CG 315: District heating – CG 316: Electricity – CG 317: Technical and operational materials Besides the utility costs as response variable of the analysis, various candidate predictor variables are defined and examined regarding their effect on utility costs. The variables give detailed information on the quantities, characteristics, utilisation, and location of the facilities. A general overview of the variable groups and the available variables is given in Section 2.2.2 and Section 3.2.2, respectively. The variables included in the current investigation of CG 310 costs are presented in the following overview.
Quantities Reference quantities In order to determine an adequate reference quantity for the estimation of utility costs, various areas and volumes are examined. The investigation includes therefore the gross external floor area GEFA in m², the gross internal floor area GIFA in m², the usable floor area UFA in m², and the gross building volume GBV in m³ as defined in the standard DIN 277-1:2016-01 as candidate reference quantities.
Specific areas Various specific areas are examined as candidate predictor variables in the analysis of the utility costs. The investigation includes the share of the usable floor area UFA on the GIFA in % and the share of sanitary area on the GIFA in % as defined in the standards DIN 277-1:2016-01 and DIN EN 15221-6:2011-12. The share of the heatable GIFA in % according to VDI 3807-1:2013-06 and the share of ventilated and air-conditioned GIFA in % according to DIN 277-1:2016-01 refer to the gross internal floor area GIFA as percentage and are likewise included.
4.2 Utilities (CG 310) | 71
Compactness For a description of the compactness of buildings, the investigation of the CG 310 costs examines the average floor size in m², the average storey height in m, and the number of floors as candidate predictors. The average floor size in m² is calculated as the gross external floor area GEFA in m² according to DIN 277-1:2016-01 divided by the number of floors. The average storey height in m is calculated as the gross building volume GBV in m³ divided by the gross external floor area GEFA in m² as defined in the standard DIN 277-1:2016-01.
Function A functional description is given by the quantities number of elevator stops and number of sanitary facilities which are therefore included in the analysis of the utility costs in the current Section. Likewise, the share of glass surfaces on above-grade exterior walls in % and the share of double or triple glazing on exterior glass surfaces in % are defined in the standard DIN 277-3:2005-04 and are considered as candidate predictor variables describing the envelope of the facilities.
Characteristics Condition – Condition of the construction: The investigation includes the share of modernised building envelope and the share of defective building envelope aggregated from the condition of base plates, external walls, and roofs as candidate predictor variables influencing the utility costs. – Condition of the technical installations: The share of defective water installations is considered as candidate predictor composing the condition of sewage and fresh water drains and pipes. Besides, the investigation quantifies the significance and influence of the share of defective sanitary installations including the condition of sanitary appliances and equipments. The share of defective heat supply systems is likewise examined giving information about the condition of heat generators, heat distribution networks, radiators, and panel heating systems. In order to comprise the condition of high voltage installations, lighting systems, and telecommunication systems, the share of defective electrical installations is included as candidate predictor.
Standard – Standard of the construction: The qualitative predictor variable thermal mass describes the heat storage capacity of the structure and is included in the investigation of the CG 310 costs with the characteristics heavy thermal mass and light thermal mass.
72 | 4 Analysis results –
Standard of the technical installations: Comprising information about the sewerage, water, gas, heat supply, air treatment, electrical, communication, transport, and building automation systems, the standard of technical installation is examined in the investigation. Besides, the influence and significance of the standard of the heating system is quantified and describes the availability of time scheduled programs and measurement of external factors. The availability of individual or automatised room controls of the heating, ventilation, shading, and lighting systems is considered by the qualitative predictor variable standard of building automation. A further description of the standard of the technical installations is given by the variable type of heating energy source with the characteristics district heating, electricity, gas, and oil heating.
Utilisation The investigation of the utility costs includes the type of facility as candidate predictor variable to take the utilisation of the facilities into account. The qualitative variable contains the characteristics care retirement home, church facility, community hall, fire department, kindergarten, library, municipal facility, research/teaching facility, school facility, sport facility, and town hall as facility types. Furthermore, the data sample is differentiated into facilities with kitchen/canteen, tea kitchen, or none by the candidate predictor specific utilisation. The variable type of water usage with the characteristics process water usage and sanitary water usage specifies the water consumption according to the utilisation and is therefore examined.
Location The qualitative variable urban location describes the surrounding area of the facilities and is included with the characteristics urban location and rural location in the investigation of utility costs. The variable type of topography is likewise examined including a differentiation of the sample into the characteristics flat topography and sloped topography in order to describe the site area of the facilities.and external contractor.
4.2.2 Model design and specifications Using the presented response and predictor variables and the sample of 188 observations as a basis, various statistical models are stepwise developed and presented in the current Section. These statistical models aim to reveal and describe the causal interrelationships of the utility costs and the candidate predictor variables presented above. Furthermore, the models intend to give an accurate estimation of utility costs.
4.2 Utilities (CG 310) | 73
Model overview In order to identify an adequate reference quantity, several linear regression models estimating absolute costs are developed whereas each model contains one of the available candidate reference quantities as predictor. Comparing the developed absolute cost models and their measures of performance presented in Table 4.7, it is indicated that the model LR(abs)310GIFA including the GIFA as one of the predictor variables offers the best estimation accuracy with the highest value of the R² (adj.) of 93.5 %, the lowest value of the MAPE of 33.9 %, and the lowest value of the CV(RMSE) of 39.7 %. Based on the evaluation and comparison of the absolute cost models, the GIFA is determined as adequate reference quantity for the further investigation of CG 310 costs and corresponding cost indicators are established. Table 4.7. Models (incl. reference quantities) for the estimation of CG 310 absolute costs Absolute cost model (Euro/year)
No. var.
R²
R² (adj.)
MAPE
CV(RMSE)
Outliers
n
LR(abs)310GEFA
Linear regression (incl. GEFA)
4
93.9 %
93.4 %
36.2 %
40.1 %
10 (5.3 %) 188
LR(abs)310GIFA
Linear regression (incl. GIFA)
4
94.0 %
93.5 %
33.9 %
39.7 %
10 (5.3 %) 188
LR(abs)310UFA
Linear regression (incl. UFA)
4
92.1 %
91.5 %
35.9 %
45.5 %
11 (5.9 %) 188
LR(abs)310GBV
Linear regression (incl. GBV)
4
93.4 %
92.9 %
42.9 %
41.8 %
14 (7.4 %) 188
Table 4.8. Models for the estimation of CG 310 cost indicators Cost indicator model (Euro/m² GIFA*year)
No. var.
R²
R² (adj.)
MAPE
CV(RMSE)
Outliers
n
LR(ind)310
Linear regression model
6
60.8 %
56.9 %
16.7 %
20.6 %
NLR(ind)310
Non-linear regression model
6
63.6 %
60.0 %
15.6 %
19.8 %
7 (3.7 %)
ANN(ind)310 Artificial neural network model
6
53.8 %
-
18.6 %
22.3 %
10 (4.8 %) 188
BCT(ind)310
5
58.4 %
-
16.5 %
21.1 %
10 (4.8 %) 188
Binary classification tree model
10 (5.3 %) 188 188
Employing utility cost indicators as response variable, linear and non-linear regression models, artificial neural network models, and binary classification tree models are developed and evaluated. The statistical models with the best performance are summarised and presented in Table 4.8. Comparing these models and their measures of performance, the non-linear regression model NLR(ind)310 offers the best estimation accuracy with a R² (adj.) of 60.0 % and a MAPE of 15.6 %. Likewise, the CV(RMSE) of 19.8 % and the number of 7 outliers (3.7 % of the training sample) indicate the lowest estimation error compared to the other models. The transformation of both response and predictor variables improves the value of the MAPE by 1.1 % and increases the R² (adj.) by 3.1 % compared to the linear regression model LR(ind)310 . The artificial neural network model ANN(ind)310 and the binary classification tree model BCT(ind)310
74 | 4 Analysis results can not improve the performance and accuracy as indicated by all measures. All developed cost indicator models show nevertheless higher levels of accuracy compared to the models estimating absolute costs employing the values of the MAPE and the CV(RMSE).
Regression model The non-linear regression model NLR(ind)310 is presented in detail with all relevant specifications as model with the best compliance to the underlying data sample in Table 4.9. Based on the developed linear regression model, the non-linear regression model is introduced with transformations of both response and quantitative predictor variables, as determined by Box-Cox plots (Box and Cox, 1964). The applied transformations are presented by their respective values of lambda (λ) and range from a natural logarithm transformation (λ=0) of the response variable (utility cost indicators) to square (λ=2) and square root (λ=0.5) transformations of the included quantitative predictor variables. With a value of 17.50, the empirical F-value of the model NLR(ind)310 exceeds the theoretical F-value of 1.68 as determined under consideration of the sample size, the number of included predictor variables, and a 95 % confidence interval (Backhaus et al., 2011). Therefore, the null hypothesis can be rejected and a general significance of the model NLR(ind)310 can be concluded. Significant relationships between the response variable and 6 predictor variables are detected as indicated by the p-values of the predictors with a significance level of alpha (α) set to 0.05. The regression constant and the determined predictor variables are presented with their respective coefficients (β). The standard error of the coefficients displays the standard deviation of the estimate of the coefficients to the data sample. As described by Ryan et al. (2012), the size of the effect of the predictor variables is indicated by the standardised coefficients. The standardised coefficients of the categorical predictors are determined by re-calculation of the regression with the coefficients of the dummy variables as introduced by Eisinga et al. (1991). For the model NLR(ind)310 , the qualitative predictor variables type of facility and type of heating energy source show the largest effect on the utility costs. Though the values of the VIF reveal a certain multicollinearity among the determined predictor variables (cf. Chatterjee and Hadi, 2006), none of the values exceed the selected threshold of 5 indicating a stable regression model.
4.2 Utilities (CG 310) | 75 Table 4.9. Description of coefficients of non-linear regression model NLR(ind)310 Response variable CG 310 (Euro/m² GIFA*year) Predictor variables
Transf. (λ)
R²
R² (adj.)
MAPE
0 (LN)
63.6 %
60.0 %
15.6 %
Transf. (λ) Coef. (β) Coef. SE St. Coef.
β0
Constant
X1
Share of heatable GIFA (%)
X2
Share of ventilated
CV(RMSE) F-value
n
19.8 %
17.50
188
t-value
p-value
VIF
-
2.610
0.102
0.000
25.48
0.000
-
2 (SQ)
0.451
0.090
0.231
5.00
0.000
1.40
0.5 (SQR)
0.236
0.081
0.147
2.92
0.004
1.45
0.5 (SQR)
0.353
0.073
0.246
4.85
0.000
1.21
0.5 (SQR)
0.219
0.060
0.178
3.63
0.000
1.14
and air-conditioned GIFA (%) X3
Share of defective envelope (%)
X4
Share of defective heat supply systems (%)
X5
X6
Type of heating energy source
-
-
-
0.274
-
0.000
-
District heating
-
0.000
0.000
-
-
-
-
Electricity
-
0.413
0.093
-
4.46
0.000
1.24
Gas
-
0.085
0.038
-
2.20
0.029
1.71
Oil
-
-0.084
0.067
-
-1.25
0.213
1.56
-
-
-
0.622
-
0.000
-
Care retirement home
-
0.000
0.000
-
-
-
-
Church facility
-
-0.568
0.087
-
-6.50
0.000
1.97
Community hall
-
-0.535
0.109
-
-4.92
0.000
1.44
Fire department
-
-0.239
0.116
-
-2.06
0.041
1.64
Kindergarten
-
-0.392
0.065
-
-6.05
0.000
4.92
Library
-
-0.405
0.120
-
-3.38
0.001
1.40
Municipal facility
-
-0.494
0.107
-
-4.62
0.000
1.66
Research/teaching facility
-
-0.058
0.098
-
-0.59
0.554
1.84
School facility
-
-0.722
0.079
-
-9.12
0.000
2.67
Sport facility
-
-0.751
0.082
-
-9.17
0.000
2.44
Town hall
-
-0.293
0.106
-
-2.75
0.007
1.64
Type of facility
The following equation describes the non-linear regression model NLR(ind)310 : 2
ˆ = e β0 * e β1 X1 * e β2 Y
√
X2
* e β3
√
X3
* e β4
√
X4
* e β5 X5 * e β6 X6
The residual plots in Figure 4.4 illustrate the distribution of the residuals (difference between observed and estimated value) and allow further assumptions about the quality of fit of the non-linear regression model NLR(ind)310 to the underlying data sample. The standardised residuals of the model appear to be uncorrelated to the estimated values and are distributed normally as illustrated by the scatter plot and histogram of the residuals, respectively. The homoscedastic variance of the residuals indicates an unbiased estimation of the response variable with no missing terms, extreme outliers, or influential points and therefore a correctly specified model (cf. Fahrmeir et al., 2013).
3
25
2
20
1
Frequency
Standardised Residual
76 | 4 Analysis results
0 -1
15 10 5
-2 -3 2.0
2.5
3.0 Fitted Value
0
3.5
-3
-2
-1 0 1 Standardised Residual
2
3
Figure 4.4. Residuals for non-linear regression model NLR(ind)310
Binary classification tree model Though the binary classification tree model BCT(ind)310 can not improve the estimation performance compared to the non-linear regression model NLR(ind)310 , the achieved values of the R² of 58.4 % and the MAPE of 16.5 % indicate a relatively high accuracy of utility cost estimation (cf. Table 4.8). Since the results of a classification tree model may have the advantage to reveal and describe the causal interrelationships between the response variable and the predictor variables transparently as described by Curram and Mingers (1994), the developed model is presented hereinafter. A summary of relevant parameters and specifications is given in Table 4.10 and Figure 4.5 illustrates the model. Table 4.10. Specifications and results of binary classification tree model BCT(ind)310 Binary classification tree model
Parameter
Specifications
Growing method
Classification and regression tree CRT
Dependent variable
re_310_GIFA (CG 310, Euro/m² GIFA*year)
Sample size
188
Minimum cases in nodes Results
Independent variables
3 qv_Util (Type of facility) cn_sh_defEnv (Share of defective envelope, %) cn_sh_hGIFA (Share of heatable GIFA, %) cn_sh_defHtSys (Share of defective heat supply systems, %) qv_EnSource (Type of heating energy source)
Number of nodes
19
Number of terminal nodes
10
Tree depth
5
As outlined in Section 2.3.4, the classification and regression tree growing method CRT is used to develop the model BCT(ind)310 . The model employs the training sample of
4.2 Utilities (CG 310) | 77
188 observations, has a tree depth of 5 layers, and includes a total of 19 nodes whereof 10 are terminal nodes with stop rules. A significant effect on the CG 310 cost indicators is revealed for 5 predictor variables. The predictors correspond with the predictors identified by the non-linear regression model NLR(ind)310 and the largest effect on the utility costs is displayed for the qualitative variable type of facility. Despite the minor variance of the size of the effects of the predictor variables comparing the models NLR(ind)310 and BCT(ind)310 , both models reveal a relatively high level of conformity of their results and indicate therefore their correct specification. re_310_GIFA Mean 17.44 n 188
0.179 Mean 21.63 n 33 qv_Util
library, school, sp_fac, church, com_hall, mun_fac kind_gar Mean 23.90 Mean 16.43 23 n n 10 TN cn_sh_hGIFA 0.971 Mean 26.29 n 12 TN
Figure 4.5. Tree diagram of binary classification tree model BCT(ind)310
4.2.3 Categorised cost indicators Besides the statistical models presented previously, median values of categorised cost indicators for the purpose of estimation are introduced. Therefore, the results of the developed models are used as a basis as outlined in Section 2.3.5. The underlying cost data are defined as the utility costs and the reference quantity is defined as the gross internal floor area GIFA. The categorisation is conducted according to the predictor variables with the largest effect on CG 310 costs. The standardised coefficients of the
78 | 4 Analysis results non-linear regression model NLR(ind)310 show the largest effects for the qualitative variables type of facility and type of heating energy source. Table 4.11. Categorised CG 310 cost indicators Type of facility
Lower quartile[a]
MV(ind)310 [a]
Upper quartile[a]
n[b]
Care retirement home
20.17
25.11
28.44
12
Church facility
11.36
13.73
16.85
11
Community hall
13.97
15.52
16.48
5
Fire department
11.82
18.98
24.99
5
Kindergarten
14.45
17.71
20.57
96
14.24
16.63
19.12
15
Electricity
22.42
26.31
32.00
6
Gas
14.67
17.90
20.44
68 7
District heating
Oil
10.13
14.38
14.88
Library
15.56
16.56
19.16
4
Municipal facility
12.36
13.96
17.62
6
Research/teaching facility
20.90
23.23
27.94
8
School facility
9.95
11.67
13.57
19
District heating
9.01
11.35
12.25
9
Gas
10.80
13.46
16.06
10 16
Sport facility
10.75
13.89
16.11
District heating
10.60
13.58
14.67
9
Gas
10.36
15.85
16.98
7
13.61
14.73
19.20
6
Town hall [a] [b]
CG310 cost indicators (Euro/m² GIFA*year), 1st quarter 2016 prices including VAT. Total sample size: 188 observations.
Therefore, the utility cost indicators are categorised according to their type of facility as presented in Table 4.11. A further categorisation is performed with the type of heating energy source whereas the sub-categorised cost indicators are only available for datasets with 3 or more observations. Besides the median values (MV), the lower quartiles (25 % percentile) and upper quartiles (75 % percentile) are presented for all categories. The costs are adjusted to 1st quarter 2016 prices and include the current German VAT. The median values of the utility cost indicators range between 11.35 Euro/m² GIFA*year for school facilities with district heating and 26.31 Euro/m² GIFA*year for kindergartens with electricity as heating energy source.
4.2.4 Performance validation Based on a training sample of 188 observations, various statistical models for the estimation of utility cost indicators are developed. The accuracy of the models is described by various measures of performance as summarised in Section 4.2.2. Hereinafter, an
4.2 Utilities (CG 310) | 79
independent test sample is employed to draw unbiased statistical inferences about the estimation accuracy of the developed models and the estimation by cost indicators. For the CG 310, the test sample includes 18 observations, is selected randomly, is representative for the total sample, and is not employed for the model development as described in Section 3.3. Table 4.12. Comparison of PE and MAPE values (test sample) for CG 310 estimation methods Obs.
Type of facility
LR(ind)310
NLR(ind)310
ANN(ind)310
BCT(ind)310
MV(ind)310
Preference
5
Kindergarten
-12.6 %
-7.8 %
24
Church facility
28.5 %
26.7 %
-7.8 %
22.3 %
-15.9 %
NLR(ind)310
31.8 %
24.5 %
60.6 %
34
Kindergarten
-13.7 %
-21.1 %
0.2 %
BCT(ind)310
17.1 %
-2.4 %
ANN(ind)310
40
Town hall
14.7 %
11.2 %
37.5 %
47
Kindergarten
-9.1 %
-0.7 %
-12.9 %
12.9 %
51.2 %
NLR(ind)310
-16.6 %
-20.6 %
54
Kindergarten
14.1 %
12.6 %
20.8 %
4.2 %
19.6 %
NLR(ind)310 BCT(ind)310
57
Research/teaching
22.4 %
23.3 %
8.8 %
9.3 %
24.1 %
ANN(ind)310
68
School facility
11.8 %
15.0 %
11.2 %
11.0 %
9.7 %
MV(ind)310
92
Kindergarten
1.8 %
0.7 %
-22.3 %
-28.1 %
-8.1 %
NLR(ind)310 BCT(ind)310
112
Sport facility
-18. 3%
-5.0 %
-14.7 %
3.7 %
-36.3 %
130
Kindergarten
16.5 %
21.8 %
12.8 %
14.1 %
11.1 %
MV(ind)310
167
Care retirement home
10.4 %
9.8 %
21.3 %
13.0 %
13.0 %
NLR(ind)310
187
School facility
-36.4 %
-35.8 %
-38.6 %
-16.7 %
-36.1 %
BCT(ind)310
202
Municipal facility
-9.8 %
-8.3 %
-36.2 %
-25.9 %
-34.0 %
NLR(ind)310 ANN(ind)310
211
School facility
35.3 %
32.7 %
8.5 %
37.8 %
46.1 %
230
Kindergarten
9.1 %
8.5 %
-14.8 %
-37.5 %
3.0 %
MV(ind)310
237
Sport facility
38.4 %
34.4 %
11.3 %
13.4 %
14.7 %
ANN(ind)310
247
Kindergarten
-11.9 %
-2.1 %
-16.2 %
-22.0 %
-9.2 %
NLR(ind)310
17.5 %
15.4 %
18.2 %
18.3 %
23.1 %
NLR(ind)310
Total (MAPE)
The values of the percentage errors PE are presented for all statistical models and the estimation by the median values of the categorised cost indicators MV(ind)310 in Table 4.12. The PE values for the observations of the test sample are calculated by application of the observed characteristics into the statistical models. The median values of the cost indicators are selected under consideration of the respective characteristics. Besides, the respective method for the most accurate estimation is presented for all observations and the estimation accuracy is summarised by the mean absolute percentage error MAPE. The most accurate estimation of utility costs is offered by model NLR(ind)310 as indicated by the values of the MAPE for the test sample and the performance measures for the training sample. The values of the PE range between an error of 0.7 % for the most accurate estimation and -35.8 % for the observation with the lowest estimation accuracy. A comparison of the MAPE values of the test sample (15.4 %) and training sample (15.6 %) confirms a constant estimation performance for the non-linear regres-
80 | 4 Analysis results
Absolute percentage error APE
sion model NLR(ind)310 beyond the data sample employed for the model development. An overview of the achieved values of the MAPE for both test and training sample for all methods is given in Figure 4.6. Furthermore, the distribution of the absolute percentage errors APE is illustrated for all observations. The figure reveals a relatively consistent distribution of the APE values comparing the respective test and training samples for all developed statistical models and indicates therefore a correct specification of the models.
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
Test Training LR(ind)310
Test Training NLR(ind)310
Test Training ANN(ind)310
Test Training BCT(ind)310
Test Training MV(ind)310
Figure 4.6. APE values (test and training sample) for CG 310 estimation methods
4.2.5 Summary In order to estimate utility costs (cost group 310) and identify significant predictor variables, a data sample of 206 observations is analysed and various statistical models are developed. The gross internal floor area GIFA is determined as the most adequate reference quantity. With a mean absolute percentage error MAPE of 15.6 % for the total sample, the most accurate estimation of cost indicators is provided by a non-linear regression model. Significance is indicated for the following predictor variables (ordered by their size of effect): – Type of facility (utilisation) – Type of heating energy source (standard) – Share of defective envelope (condition) – Share of heatable gross internal floor area GIFA (quantity) – Share of defective heat supply systems (condition) – Share of ventilated and air-conditioned gross internal floor area GIFA (quantity)
4.3 Water (CG 311) |
81
4.3 Water (CG 311) 4.3.1 Theoretical basis and variables The investigation of water costs (CG 311) is based on empirical data and examines a data sample of 194 observations including a training sample of 177 observations and a test sample of 17 observations. Individual observations of the total sample are excluded from the data basis of the current analysis due to a limited availability of data. For residential facilities for example, tenants are charged directly by the respective suppliers for their water consumption. The water cost data of these facilities can in part or in total not be provided by the project partners and the corresponding observations are excluded from the analysis. Furthermore, the water cost data of some of the analysed facilities are only available on a superior level for multiple facilities and can not be differentiated and allocated to a particular facility. These observations are likewise excluded from the current analysis of water costs. The cost data of the underlying data sample have median values of absolute costs of 1,153 Euro per year and cost indicators of 0.99 Euro per m² GEFA and year (1st quarter 2016 prices including VAT). The water costs represent a relatively small amount of about 2.3 % of the the operating costs (CG 300) and about 7.3 % of the utility costs (CG 310). A general overview and a definition of the investigated cost data is given in Section 2.2.1. The examined third level cost group of water costs (CG 311) is defined in the standard DIN 18960:2008-02 and includes the supply of facilities with fresh water and rain water. Besides the CG 311 costs as response variable of the analysis, various candidate predictor variables are defined and their effect on water costs is examined. The variables give detailed information on the quantities, characteristics, utilisation, and location of the facilities. A general overview of the variable groups and the available variables is presented in Section 2.2 (Definition of key variables) and Section 3.2 (Presentation of the sample), respectively. The following overview describes the variables included in the current investigation of water costs.
Quantities Reference quantities Various areas and volumes are examined in order to determine an adequate reference quantity for the estimation of water costs. Therefore, the gross external floor area GEFA in m², the gross internal floor area GIFA in m², the usable floor area UFA in m², and the gross building volume GBV in m³ as defined in the standard DIN 277-1:2016-01 are analysed as candidate reference quantities in the current investigation.
82 | 4 Analysis results Specific areas The investigation of water costs examines specific areas as candidate predictor variables. The share of the usable floor area UFA in % and the share of sanitary area on the GIFA in % as defined in the standards DIN 277-1:2016-01 and DIN EN 15221-6:2011-12 refer to the gross internal floor area GIFA and are included in the analysis.
Function A functional description of the facilities is given by the quantity number of sanitary facilities which is therefore included in the analysis of the CG 311 costs in the current Section.
Characteristics Condition of the technical installations Describing the condition of sewage and fresh water drains and pipes, the share of defective water installations is considered as candidate predictor. Besides, the investigation quantifies the significance and influence of the share of defective sanitary installations including the condition of sanitary appliances and equipments.
Utilisation The utilisation of the facilities is represented by the candidate predictor variable type of facility. The qualitative variable contains the characteristics care retirement home, church facility, community hall, fire department, kindergarten, library, municipal facility, research/teaching facility, school facility, sport facility, and town hall as facility types. Furthermore, the data sample is differentiated into facilities with kitchen/canteen, tea kitchen, or none by the candidate predictor specific utilisation. The variable type of water usage with the characteristics process water usage and sanitary water usage specifies the water consumption according to the utilisation and is therefore examined.
Location Describing the surrounding area of the facilities, the investigation of water costs includes the qualitative variable urban location with the characteristics urban and rural.
4.3.2 Model design and specifications Based on the training sample with 177 observations and the variables described above, various statistical models are stepwise developed and presented in the current Sec-
4.3 Water (CG 311) |
83
tion. The statistical models aim to reveal and describe the causal interrelationships between the response variable and the candidate predictor variables. Furthermore, the models intend to give an accurate estimation of the CG 311 costs as response variable.
Model overview For the determination of an adequate reference quantity, several linear regression models estimating absolute costs are developed. Each model contains one of the available candidate reference quantities as predictor. A comparison of the developed absolute cost models is presented in Table 4.13 including various measures of performance. With the highest value of the R² (adj.) of 81.7 % and the lowest value of the MAPE of 62.3 %, the best estimation accuracy is indicated for the model LR(abs)311UFA employing the usable floor area UFA as one of the predictor variables. The lowest value of the CV(RMSE) of 82.8 % compared to the other developed models confirms the assumption. Based on the presented results of the absolute cost models, the UFA is determined as reference quantity for the introduction of cost indicators for the further investigation of water costs. Table 4.13. Models (incl. reference quantities) for the estimation of CG 311 absolute costs Absolute cost model (Euro/year)
No. var.
R²
R² (adj.)
MAPE
CV(RMSE)
Outliers
n
LR(abs)311GEFA
Linear regression (incl. GEFA)
3
82.7 %
81.4 %
65.0 %
83.6 %
12 (6.8 %) 177
LR(abs)311GIFA
Linear regression (incl. GIFA)
3
82.7 %
81.5 %
64.6 %
83.4 %
12 (6.8 %) 177
LR(abs)311UFA
Linear regression (incl. UFA)
3
83.0 %
81.7 %
62.3 %
82.8 %
10 (5.6 %) 177
LR(abs)311GBV
Linear regression (incl. GBV)
3
82.7 %
81.5 %
70.6 %
83.3 %
10 (5.6 %) 177
Table 4.14. Models for the estimation of CG 311 cost indicators Cost indicator model (Euro/m² UFA*year)
No. var.
R²
R² (adj.)
MAPE
CV(RMSE)
Outliers
n
LR(ind)311
Linear regression model
3
61.9 %
59.1 %
51.9 %
55.2 %
9 (5.1 %)
177
NLR(ind)311
Non-linear regression model
4
73.3 %
71.2 %
32.8 %
55.8 %
9 (5.1 %)
177
ANN(ind)311 Artificial neural network model
4
58.9 %
-
65.4 %
57.4 %
10 (5.6 %) 177
BCT(ind)311
3
78.5 %
-
38.5 %
41.5 %
10 (5.6 %) 177
Binary classification tree model
Employing the CG 311 cost indicators as response variable, linear and non-linear regression models, artificial neural network models, and binary classification tree models are developed. A summary of the statistical models with the best performance is given in Table 4.14. According to the R² of 78.5 % and a CV(RMSE) of 41.5 %, the binary
84 | 4 Analysis results classification tree model BCT(ind)311 offers the best estimation performance compared to the other models. On the contrary, taking the value of the MAPE of 32.8 % and the number of 9 outliers into account, the highest accuracy is indicated for the non-linear regression model NLR(ind)311 . As described in Section 2.3.6 in a presentation of utilised performance measures, differences in the assessment of the models may be caused by a different weighting of the residuals (differences of observed and estimated values). For the calculation of the MAPE, all residuals are weighted evenly whereas the R² and the CV(RMSE) weight extreme residuals with a higher level of importance. Therefore, it is indicated that the model NLR(ind)311 produces more extreme outliers, whereas the mean of the values of the residuals is lower compared to the model BCT(ind)311 . A further comparison and validation of both models is conducted subsequently. As indicated throughout all measures, the artificial neural network model ANN(ind)311 can not offer an improvement of performance. Compared to the models estimating absolute costs, all developed cost indicator models show higher levels of accuracy as illustrated by the values of the MAPE and the CV(RMSE).
Regression model In Table 4.15, the non-linear regression model NLR(ind)311 is presented in detail with all relevant specifications as the model with highest accuracy according to the MAPE and the lowest number of outliers comparing the developed cost indicator models. In order to improve the performance of the linear regression model (cf. Schmidt, 2010), both response and quantitative predictor variables of the non-linear regression model are transformed according to their Box-Cox transformations (Box and Cox, 1964). The respective transformations are represented by the values of lambda (λ) and range from a natural logarithm transformation (λ=0) of the response variable (CG 311 cost indicators) to square root (λ=0.5) transformations of the included quantitative predictor variables. The empirical F-value of the model NLR(ind)311 of 34.43 exceeds the theoretical F-value of 1.78 as determined under consideration of the sample size, the number of included predictor variables, and a 95 % confidence interval (Backhaus et al., 2011). Therefore, the null hypothesis can be rejected and a general significance of the model NLR(ind)311 can be concluded. Significant relationships between the response variable and 4 predictor variables can be determined as indicated by the p-values based on a significance level of alpha (α) set to 0.05. The coefficients (β) are presented for the regression constant and the determined predictor variables. Besides, the standard deviation of the estimate of the coefficients to the underlying sample is displayed by the standard error of the coefficients. The size of the effect of the predictor variables on the response variable is described by the standardised coefficients as explained by Ryan et al. (2012). In order to determine the standardised coefficients of the categorical predictors, the regression is
4.3 Water (CG 311) |
85
re-calculated with the coefficients of the dummy variables as introduced by Eisinga et al. (1991). For the non-linear regression model NLR(ind)311 , the largest effect is indicated for the qualitative predictor variable type of facility followed by the condition of the technical installations represented by the quantitative variable share of defective water installations. None of the values of the VIF of the predictor variables exceed the selected threshold of 5 (cf. Section 2.3.2). Though the values of the VIF reveal a certain multicollinearity among the determined predictor variables, a stable regression model is indicated (cf. Chatterjee and Hadi, 2006). Table 4.15. Description of coefficients of non-linear regression model NLR(ind)311 Response variable CG 311 (Euro/m² UFA*year) Predictor variables
Transf. (λ)
R²
R² (adj.)
MAPE
0 (LN)
73.3 %
71.2 %
32.8 %
Transf. (λ) Coef. (β) Coef. SE St. Coef.
CV(RMSE) F-value
n
55.8 %
34.43
177
t-value
p-value
VIF
β0
Constant
-
1.343
0.349
0.000
3.85
0.000
-
X1
Share of sanitary GIFA (%)
0.5 (SQR)
2.348
0.876
0.198
2.68
0.008
3.38
X2
Share of defective water
0.5 (SQR)
1.164
0.119
0.411
9.76
0.000
1.07
installations (%) X3
X4
Type of facility
-
-
-
0.604
-
0.000
-
Care retirement home
-
0.000
0.000
-
-
-
-
Church facility
-
-0.386
0.329
-
-1.17
0.243
3.20
Community hall
-
-0.051
0.288
-
-0.18
0.859
1.54
Fire department
-
-1.042
0.253
-
-4.11
0.000
1.96
Kindergarten
-
-0.806
0.152
-
-5.31
0.000
4.10
Library
-
-0.594
0.370
-
-1.61
0.110
1.63
Municipal facility
-
-1.587
0.270
-
-5.87
0.000
2.17
Research/teaching facility
-
-1.096
0.236
-
-4.65
0.000
3.26
School facility
-
-1.459
0.198
-
-7.35
0.000
4.04
Sport facility
-
-1.726
0.185
-
-9.32
0.000
3.14
Town hall
-
-1.638
0.278
-
-5.89
0.000
2.25
-
-
-
0.148
-
0.001
Type of water usage Process and sanitary
-
0.000
0.000
-
-
-
Sanitary
-
-0.749
0.216
-
-3.46
0.001
1.11
The following equation describes the non-linear regression model NLR(ind)311 : ˆ = e β0 * e β1 Y
√
X1
* e β2
√
X2
* e β3 X3 * e β4 X4
Further information about the quality of fit of the model NLR(ind)311 to the underlying data sample is given by the distribution of the residuals (difference between observed and estimated value) presented in the residual plots in Figure 4.7. The standardised residuals appear to be uncorrelated to the estimated values and are distributed nor-
86 | 4 Analysis results mally as illustrated by the scatter plot and histogram of the residuals, respectively. The homoscedastic variance of the residuals indicates a correctly specified model and an unbiased estimation of the response variable with no missing terms, extreme outliers, or influential points (cf. Fahrmeir et al., 2013).
20
2
15
1
Frequency
Standardised Residual
3
0 -1
10
-2 -3
-1
0
Fitted Value
1
5 0
2
-3
-2
-1 0 1 Standardised Residual
2
3
Figure 4.7. Residuals for non-linear regression model NLR(ind)311
Binary classification tree model Classification tree models may have the advantage to provide clear information on the importance of significant predictor variables as described by Tso and Yau (2007). According to the values of the R² (adj.) of 78.5 % and the CV(RMSE) of 41.5 %, the binary classification tree model BCT(ind)311 offers the best compliance to the underlying data sample comparing the cost indicator models (cf. Table 4.14) and is presented in the current Section. Table 4.16 summarises relevant parameters and specifications of the model and Figure 4.8 illustrates the developed structure of the tree. Table 4.16. Specifications and results of binary classification tree model BCT(ind)311 Binary classification tree model
Parameter
Specifications
Growing method
Classification and regression tree CRT
Dependent variable
re_311_UFA (CG 311, Euro/m² UFA*year)
Sample size
177
Minimum cases in nodes Results
Independent variables
3 qv_Util (Type of facility) cn_sh_defWatIn (Share of defective water installations, %) cn_sh_defSanIn (Share of defective sanitary installations, %)
Number of nodes
23
Number of terminal nodes
12
Tree depth
6
4.3 Water (CG 311) | re_311_UFA Mean 2.03 n 177 qv_Util
care_ret Mean 6.63 n 12 0.136 cn_sh_defWatIn Mean 5.07 Mean 8.19 n 6 n 6 TN TN
kind_gar, res_teach, fire_dep, library, com_hall, school, sp_fac, town_hall, church, mun_fac Mean 24.57 n 46 0.178 Mean 2.68 n 48 qv_Util
kind_gar, res_teach, fire_dep, mun_fac school, sp_fac Mean 1.15 Mean 3.14 11 n n 37 TN cn_sh_defWatIn 0.476 Mean 3.31 n 6 qv_Util mun_fac, church Mean 4.44 n 3 TN
Figure 4.8. Tree diagram of binary classification tree model BCT(ind)311
Based on the training sample of 177 observations, the model BCT(ind)311 is developed with the classification and regression tree growing method CRT as described in Section 2.3.4 in detail. With a tree depth of 6 layers, the developed model includes 23 nodes in total whereof 12 are terminal nodes with stop rules. A significant effect on the water costs is indicated for 3 predictor variables. The identified predictor variables correspond with the 4 variables identified by the non-linear regression model NLR(ind)311 . The model BCT(ind)311 displays the largest effect for the qualitative variable type of facility and the quantitative variable share of defective water installations. With only minor differences of their estimation performance, a comparison of the models NLR(ind)311 and BCT(ind)311 reveals a high level of conformity of the results and indicates therefore a correct specification of the models.
88 | 4 Analysis results 4.3.3 Categorised cost indicators As described in Section 2.3.5, the training sample is employed to introduce median values of cost indicators for the purpose of estimation of water costs. The indicators are presented with a categorisation based on the results of the developed statistical models. Therefore, the underlying cost data are defined as the water costs and the respective reference quantity is defined as the usable floor area UFA as presented in the previous Section. The categorisation of the cost indicators is determined according to the predictor variables with the largest effect on water costs. As indicated by the standardised coefficients of the non-linear regression model NLR(ind)311 and the order of the variables included in model BCT(ind)311 , the largest effect is revealed for the qualitative predictor variable type of facility and the quantitative predictor variable share of defective water installations. Table 4.17. Categorised CG 311 cost indicators Type of facility Care retirement home
Lower quartile[a]
MV(ind)311 [a]
Upper quartile[a]
n[b] 12
4.72
7.24
7.80
without defective water installations
3.56
4.50
6.40
5
with defective water installations
7.09
7.73
8.70
7
Church facility
1.02
3.45
4.53
5
Community hall
2.39
4.10
5.34
3
Fire department
0.83
1.28
1.88
5
Kindergarten
1.20
1.71
2.33
95
without defective water installations
1.04
1.45
1.90
60
with defective water installations
1.71
2.34
3.34
35
Municipal facility
0.36
0.52
2.50
5
Research/teaching facility
0.62
0.96
1.21
10
without defective water installations
0.60
0.87
1.11
7
with defective water installations
0.93
1.14
6.02
3
0.61
0.77
1.12
19 11
School facility without defective water installations
0.59
0.69
0.98
with defective water installations
0.65
0.97
1.35
8
0.61
0.70
1.05
16
without defective water installations
0.38
0.62
0.67
6
with defective water installations
0.70
1.00
1.18
10
0.31
0.62
0.86
5
Sport facility
Town hall [a] [b]
CG311 cost indicators (Euro/m² UFA*year), 1st quarter 2016 prices including VAT. Total sample size: 175 observations.
Therefore, the water cost indicators are categorised according to the predictor variable type of facility as presented in Table 4.17. A further categorisation is made by the share of defective water installations with a distinction between facilities without defective water installations and facilities with defective water installations. Sub-categorised
4.3 Water (CG 311) |
89
cost indicators are only available for datasets with 3 or more observations. Besides the median values (MV), the lower quartiles (25 % percentile) and upper quartiles (75 % percentile) are introduced for all presented categories. The costs are adjusted to 1st quarter 2016 prices and include the current German VAT. The presented median values of the CG 311 water cost indicators range between 0.52 Euro/m² UFA*year for municipal buildings and 7.73 Euro/m² UFA*year for care retirement homes with defective water installations.
4.3.4 Performance validation Various statistical models for the estimation of water cost indicators are developed employing a training sample of 177 observations. As summarised in Section 4.3.2, the description of the models contains multiple measures of performance. Hereinafter, unbiased statistical inferences about the estimation accuracy of the developed models and the estimation by the median values of the cost indicators are drawn employing an independent test sample of 17 observations. As described in Section 3.3, the test sample is not employed for model development, is selected randomly, and shall be representative for the total sample. Table 4.18. Comparison of PE and MAPE values (test sample) for CG 311 estimation methods Obs.
Type of facility
5
Kindergarten
34
Kindergarten
47
LR(ind)311
NLR(ind)311
ANN(ind)311
BCT(ind)311
MV(ind)311
Preference
-25.6 %
7.7 %
-19.6 %
6.7 %
32.4 %
17.3 %
-35.2 %
9.6 %
NLR(ind)311
31.1 %
32.0 %
Kindergarten
38.5 %
55.8 %
LR(ind)311
50.9 %
30.2 %
53.3 %
BCT(ind)311
54
Kindergarten
-42.5 %
57
Research/teaching
-62.3 %
1.7 %
-0.3 %
-11.1 %
-9.7 %
ANN(ind)311
-12.3 %
-28.9 %
-5.1 %
55.2 %
68
School facility
25.5 %
39.3 %
BCT(ind)311
-28.0 %
19.3 %
31.8 %
BCT(ind)311
92
Kindergarten
14.5 %
30.8 %
-8.2 %
6.7 %
37.6 %
BCT(ind)311
112
Sport facility
-118.5 %
0.4 %
-99.5 %
-75.9 %
-33.6 %
NLR(ind)311
130
Kindergarten
-58.1 %
-10.2 %
-28.2 %
-70.1 %
-13.8 %
NLR(ind)311
145
Kindergarten
-4.0 %
27.2 %
14.5 %
24.2 %
25.1 %
LR(ind)311
161
Kindergarten
7.3 %
-28.8 %
-66.4 %
-24.7 %
-9.1 %
LR(ind)311
167
Care retirement home
-220.7 %
-84.1 %
-133.4 %
-128.0 %
-102.5 %
NLR(ind)311
187
School facility
-26.4 %
-12.5 %
-153.0 %
-41.8 %
-19.8 %
NLR(ind)311
211
School facility
33.7 %
51.8 %
-0.1 %
35.0 %
45.1 %
ANN(ind)311
221
Kindergarten
-77.2 %
-114.6 %
-114.9 %
-141.6 %
-111.4 %
LR(ind)311
237
Sport facility
-9.6 %
41.2 %
-5.5 %
64.9 %
69.4 %
ANN(ind)311
247
Kindergarten
35.1 %
20.5 %
14.3 %
29.6 %
23.6 %
ANN(ind)311
47.4 %
33.6 %
46.1 %
45.6 %
40.2 %
NLR(ind)311
Total (MAPE)
90 | 4 Analysis results
Absolute percentage error APE
In Table 4.18, the percentage errors PE are presented for all statistical models and the estimation by the median values of the categorised cost indicators MV(ind)311 . The PE values are calculated by application of the observed characteristics into the statistical models for the 17 observations. The median values of the cost indicators are selected under consideration of the respective characteristics. Furthermore, the respective preference for the most accurate estimation method is presented and the estimation accuracy is summarised by the MAPE values for the test sample.
1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0
Test Training LR(ind)311
Test Training NLR(ind)311
Test Training ANN(ind)311
Test Training BCT(ind)311
Test Training MV(ind)311
Figure 4.9. APE values (test and training sample) for CG 311 estimation methods
As already indicated by the performance measures of the training sample, the values of the MAPE for the test sample validate the most accurate estimation of water costs for the model NLR(ind)311 . The values of the PE range between an error of 0.4 % for the most accurate estimation and -114.6 % for the estimation with the lowest accuracy. A comparison of the MAPE values of the test sample (33.6 %) and training sample (32.8 %) confirms a constant estimation performance for the non-linear regression model NLR(ind)311 . According to the MAPE of the test sample of 45.6 %, the model BCT(ind)311 reveals likewise a relatively constant estimation accuracy compared to the results of the training sample (MAPE of 38.5 %). Figure 4.9 gives an overview of the achieved values of the MAPE for both test and training samples for all methods. Likewise, the figure illustrates the distribution of the absolute percentage errors APE for all observations. A comparison of the respective test and training samples reveals a relatively consistent distribution of the APE values. Therefore, it can be assumed that the statistical models are correctly specified and a consistent estimation performance beyond the data sample used for model development is validated.
4.3.5 Summary For the identification of significant predictor variables and the estimation of water costs (cost group 311), various statistical models are developed based on a sample size
4.4 Heating (CG 312-316) | 91
of 194 observations. The usable floor area UFA is determined as the most adequate reference quantity. The most accurate estimation of cost indicators is provided by a nonlinear regression model with a mean absolute percentage error MAPE of 32.9 % for the total sample. Significance is indicated for the following predictor variables (ordered by their size of effect): – Type of facility (utilisation) – Share of defective water installations (condition) – Share of sanitary gross internal floor area GIFA (quantity) – Type of water usage (utilisation)
4.4 Heating (CG 312-316) 4.4.1 Theoretical basis and variables The empirical investigation of heating costs utilises a total data sample of 206 observations including a training sample of 186 observations and a test sample of 20 observations. Individual observations of the total sample are excluded from the data basis due to a limited availability of data. For residential facilities for example, tenants are charged directly by the respective suppliers for their energy consumption. Heating cost data of these facilities can not be provided by the project partners and the corresponding observations are excluded from the analysis. Furthermore, the heating cost data of some of the analysed facilities are only available on a superior level for multiple facilities and can not be differentiated and allocated to a particular facility. These observations are likewise excluded from the current analysis. The analysis of heating costs includes solely expenditures for heating energy as electricity costs. Further expenditures for electricity are investigated in Section 4.5. The underlying cost data have median values of absolute costs of 9,952 Euro per year and cost indicators of 8.42 Euro per m² GEFA and year (1st quarter 2016 prices including VAT). The heating costs represent a share of about 23.1 % of the operating costs (CG 300) and a share of about 64.9 % of the utility costs (CG 310). A general overview and definition of the investigated cost data is given in Section 2.2.1. The examined costs are aggregated from the following third level cost groups according to the standard DIN 18960:2008-02: – CG 312: Oil – CG 313: Gas – CG 314: Solid fuels – CG 315: District heating – CG 316: Electricity (solely expenditures for heating with electricity) Besides the heating costs as response variable of the investigation, various candidate predictor variables are defined and examined regarding their effect on heating costs.
92 | 4 Analysis results The variables give detailed information on the quantities, characteristics, utilisation, and location of the facilities. A general overview of the variable groups and the available variables is given in Section 2.2.2 and Section 3.2.2, respectively. The variables included in the current investigation of CG 312-316 costs are presented in the following overview.
Quantities Reference quantities In order to determine an adequate reference quantity for the estimation of heating costs, various areas and volumes are examined. The investigation includes therefore the gross external floor area GEFA in m², the gross internal floor area GIFA in m², the usable floor area UFA in m², and the gross building volume GBV in m³ as defined in the standard DIN 277-1:2016-01 as candidate reference quantities. Besides, the heatable gross internal floor area hGIFA in m² according to the directive VDI 3807-1:2013-06 is considered as reference quantities.
Specific areas The analysis of heating costs examines specific areas as candidate predictor variables. The share of the usable floor area UFA in % according to DIN 277-1:2016-01 and the share of the heatable GIFA in % according to VDI 3807-1:2013-06 refer to the gross internal floor area GIFA (DIN 277-1:2016-01) as percentage and are included in the investigation of CG 312-316 heating costs.
Compactness For a description of the compactness of buildings, the investigation of the heating costs examines the average floor size in m², the average storey height in m, and the number of floors as candidate predictors. The average floor size in m² is calculated as the gross external floor area GEFA in m² according to DIN 277-1:2016-01 divided by the number of floors. The average storey height in m is calculated as the gross building volume GBV in m³ divided by the gross external floor area GEFA in m² as defined in the standard DIN 277-1:2016-01.
Function A functional description is given by the share of glass surfaces on above-grade exterior walls in % and the share of double or triple glazing on exterior glass surfaces in %. Both variables are defined in the standard DIN 277-3:2005-04 and are considered as candidate predictor variables describing the envelope of the facilities.
4.4 Heating (CG 312-316) | 93
Characteristics Condition – Condition of the construction: The investigation includes the share of modernised building envelope and the share of defective building envelope aggregated from the condition of base plates, external walls, and roofs as candidate predictor variables influencing the heating costs. – Condition of the technical installations: The share of defective heat supply systems gives information about the condition of heat generators, heat distribution networks, radiators, and panel heating systems and is therefore included as candidate predictor.
Standard – Standard of the construction: The qualitative predictor variable thermal mass describes the heat storage capacity of the structure and is included in the investigation with the characteristics heavy thermal mass and light thermal mass. – Standard of the technical installations: Comprising information about the sewerage, water, gas, heat supply, air treatment, electrical, communication, transport, and building automation systems, the standard of technical installations is examined in the investigation. Besides, the influence and significance of the standard of the heating system is quantified. The variable describes the availability of time scheduled programs and measurement of external factors. The availability of individual or automatised room controls of the heating, ventilation, shading, and lighting systems is considered by the qualitative predictor variable standard of building automation. A further description of the standard of the technical installations is given by the variable type of heating energy source with the characteristics district heating, electricity, gas, and oil heating.
Utilisation The investigation includes the type of facility as candidate predictor variable to take the utilisation of the facilities into account. The qualitative variable contains the characteristics care retirement home, church facility, community hall, fire department, kindergarten, library, municipal facility, research/teaching facility, school facility, sport facility, and town hall as facility types.
94 | 4 Analysis results Location The qualitative variable urban location describes the surrounding area of the facilities and is included with the characteristics urban and rural in the investigation of heating costs.
4.4.2 Model design and specifications Build on the presented theoretical basis, various statistical models are stepwise developed employing the training sample of 186 observations. These statistical models aim to reveal and describe the causal interrelationships of the heating costs and the candidate predictor variables presented above. Furthermore, the models intend to give an accurate estimation of heating costs.
Model overview In a first step, several linear regression models estimating absolute costs are developed whereas each model contains one of the available candidate reference quantities as predictor. Comparing the developed absolute cost models and their measures of performance presented in Table 4.19, it is indicated that the model LR(abs)312-316hGIFA including the heatable gross internal floor area hGIFA as one of the predictor variables offers the best estimation accuracy with the highest value of the R² (adj.) of 84.4 % and the lowest value of the MAPE of 40.1 %. The lowest value of the CV(RMSE) of 49.0 % and the number of 7 outliers confirm the assumption. On the basis of the evaluation of the absolute cost models, the hGIFA is determined as adequate reference quantity for the further investigation of heating costs and corresponding cost indicators are established. Table 4.19. Models (incl. reference quantities) for the estimation of CG 312-316 absolute costs Absolute cost model (Euro/year)
No. var.
R²
R² (adj.) MAPE CV(RMSE)
Outliers
n
LR(abs)312-316GEFA Linear regression (incl. GEFA)
3
84.5 % 83.4 % 42.4 %
50.5 %
11 (5.9 %) 186
LR(abs)312-316GIFA
3
84.3 % 83.2 % 43.0 %
50.8 %
11 (5.9 %) 186
LR(abs)312-316hGIFA Linear regression (incl. hGIFA)
3
85.4 % 84.4 % 40.1 %
49.0 %
7 (3.8 %) 186
LR(abs)312-316UFA
Linear regression (incl. UFA)
5
85.0 % 83.7 % 43.0 %
49.7 %
9 (4.8 %) 186
LR(abs)312-316GBV
Linear regression (incl. GBV)
6
83.6 % 82.1 % 49.0 %
51.9 %
9 (4.8 %) 186
Linear regression (incl. GIFA)
With the heating cost indicators employed as response variable, linear and nonlinear regression models, artificial neural network models, and binary classification tree models are developed. The statistical models with the best performance are summarised and presented in Table 4.20. Comparing these models and their measures
4.4 Heating (CG 312-316) | 95
of performance, the binary classification tree model BCT(ind)312-316 offers the highest estimation accuracy with a R² of 65.7 % and a MAPE of 18.4 %. The CV(RMSE) of 20.5 % and the number of 5 outliers indicate the lowest estimation error compared to the other models. Comparing the models LR(ind)312-316 and NLR(ind)312-316 , the transformation of both response and predictor variables decreases the MAPE by 3.0 % and increases the R² (adj.) by 7.6 %. The regression models can nevertheless not offer a better performance compared to the model BCT(ind)312-316 as indicated by all measures. All developed cost indicator models show a significantly higher level of accuracy compared to the models estimating absolute costs adducing the values of the MAPE and the CV(RMSE). Table 4.20. Models for the estimation of CG 312-316 cost indicators Cost indicator model (Euro/m² hGIFA*year) LR(ind)312-316
Linear regression model
No. var.
R²
R² (adj.)
MAPE
CV(RMSE)
Outliers
n
6
51.8 % 47.0 % 21.5 %
24.2 %
6 (3.2 %)
186
NLR(ind)312-316 Non-linear regression model
7
59.0 % 54.6 % 18.5 %
22.3 %
6 (3.2 %)
186
ANN(ind)312-316 Artificial neural network model
7
50.6 %
-
20.6 %
24.5 %
10 (5.4 %) 186
BCT(ind)312-316
6
65.7 %
-
18.4 %
20.5 %
5 (2.7 %)
Binary classification tree model
186
Regression model The non-linear regression model NLR(ind)312-316 is presented in detail with all relevant specifications in Table 4.21. Based on the linear regression model, the non-linear regression model is developed with transformations of both response and quantitative predictor variables, as determined by Box-Cox transformations (Box and Cox, 1964). The applied transformations are presented by their respective values of lambda (λ) and vary between natural logarithm (λ=0) and square root (λ=0.5) transformations. With a value of 13.36, the empirical F-value of the model NLR(ind)312-316 exceeds the theoretical F-value of 1.67 as determined under consideration of the sample size, the number of included predictor variables, and a 95 % confidence interval (Backhaus et al., 2011). Therefore, the null hypothesis can be rejected and a general significance of the model NLR(ind)312-316 can be concluded. With a significance level of alpha (α) of 0.05, significant relationships between the response variable and 7 predictor variables can be determined as indicated by the p-value of the predictors. The regression constant and the determined predictors are presented with their coefficients (β) and the standard error of the coefficients. The size of the effect of the predictor variables is indicated by the standardised coefficients as described by Ryan et al. (2012). The standardised coefficients of the categorical predictors are determined by re-calculation of the regression with the coefficients of the dummy variables as introduced by Eisinga et al. (1991). The qualitative predictors type of facility and type of heating energy source show the largest effect
96 | 4 Analysis results on the heating costs. Though the values of the VIF reveal a certain multicollinearity among the determined predictor variables (cf. Chatterjee and Hadi, 2006), none of the values exceed the selected threshold of 5 indicating a stable regression model. Table 4.21. Description of coefficients of non-linear regression model NLR(ind)312-316 Response variable CG 312-316 (Euro/m² hGIFA*year) Predictor variables
Transf. (λ)
R²
R² (adj.)
MAPE
0 (LN)
59.0 %
54.6 %
18.5 %
Transf. (λ) Coef. (β) Coef. SE St. Coef.
CV(RMSE) F-value
n
22.3 %
13.36
186
t-value
p-value
VIF
-
4.036
0.280
0.000
14.42
0.000
-
0 (LN)
-0.193
0.034
0.267
-5.75
0.000
2.31
0 (LN)
-0.145
0.042
0.282
-3.48
0.001
2.65
0.5 (SQR)
0.334
0.096
0.215
3.47
0.001
1.54
0.407
0.082
0.294
4.94
0.000
1.47
β0
Constant
X1
Average floor size (m²)
X2
Number of floors
X3
Share of defective envelope (%)
X4
Share of defective
0.5 (SQR)
heat supply systems (%) X5
X6
X7
Thermal mass
-
-
-
0.224
-
0.000
-
Heavy thermal mass
-
0.000
0.000
-
-
-
-
Light thermal mass
-
-0.215
0.052
-
-4.12
0.000
1.22
-
-
-
0.436
-
0.000
-
District heating
-
0.000
0.000
-
-
-
-
Electricity
-
0.365
0.110
-
3.31
0.001
1.28
Gas
-
-0.025
0.046
-
-0.55
0.584
1.76
Oil
-
-0.229
0.079
-
-2.92
0.004
1.59
-
-
-
0.451
-
0.000
-
Care retirement home
-
0.000
0.000
-
-
-
-
Church facility
-
-0.305
0.111
-
-2.74
0.007
2.38
Community hall
-
-0.462
0.136
-
-3.40
0.001
1.62
Fire department
-
0.131
0.130
-
1.01
0.314
1.53
Kindergarten
-
-0.448
0.095
-
-4.73
0.000
4.79
Library
-
-0.359
0.158
-
-2.28
0.024
1.38
Municipal facility
-
-0.458
0.125
-
-3.65
0.000
1.65
Research/teaching facility
-
-0.450
0.114
-
-3.94
0.000
1.87
School facility
-
-0.432
0.090
-
-4.79
0.000
2.53
Sport facility
-
-0.590
0.099
-
-5.97
0.000
2.66
Town hall
-
-0.232
0.120
-
-1.92
0.056
1.52
Type of heating energy source
Type of facility
The following equation describes the non-linear regression model NLR(ind)312-316 : ˆ = e β0 * X β1 * X β2 * e β3 Y 1 2
√
X3
* e β4
√
X4
* e β5 X5 * e β6 X6 * e β7 X7
The residual plots in Figure 4.10 illustrate the distribution of the residuals (difference between observed and estimated value) and allow further assumptions about the quality of fit of the non-linear regression model NLR(ind)312-316 to the underlying
4.4 Heating (CG 312-316) | 97
data sample. The standardised residuals of the model appear to be uncorrelated to the estimated values and are distributed normally as illustrated by the scatter plot and histogram of the residuals, respectively. The homoscedastic variance of the residuals indicates an unbiased estimation of the response variable with no missing terms, extreme outliers, or influential points and therefore a correctly specified model (cf. Fahrmeir et al., 2013).
20
2
15
1
Frequency
Standardised Residual
3
0 -1 -2 -3 1.5
2.0
2.5 Fitted Value
3.0
3.5
10 5 0
-3
-2
-1 0 1 Standardised Residual
2
3
Figure 4.10. Residuals for non-linear regression model NLR(ind)312-316
Binary classification tree model According to Curram and Mingers (1994), classification tree model may have the advantage to reveal and describe the causal interrelationships between variables transparently. As model with the best compliance to the underlying data sample, the binary classification tree model BCT(ind)312-316 is presented hereinafter. With the achieved values of the R² of 65.7 % and the MAPE of 18.4 %, a relatively high accuracy of cost estimation is indicated (cf. Table 4.20). The tree of the model is illustrated in Figure 4.11 and a summary of relevant parameters and specifications is given in Table 4.22. The model BCT(ind)312-316 is based on the training sample of 186 observations and employs the classification and regression tree growing method CRT as outlined in Section 2.3.4. The developed model has a tree depth of 7 layers and includes a total of 31 nodes whereof 16 are terminal nodes with stop rules. A significant effect on the heating costs is indicated for 6 predictor variables. The predictors correspond with the 7 predictors identified by the non-linear regression model NLR(ind)312-316 , though the largest effect is displayed for the variables share of defective heat supply systems and type of heating energy source. Despite the variance of the size of the effects of the predictors comparing the models NLR(ind)312-316 and BCT(ind)312-316 , both models reveal conformity of their results. Therefore, it can be assumed that both models are specified correctly.
98 | 4 Analysis results re_312-316_hGIFA Mean 12.43 n 186 cn_sh_defHtSys
0.007 Mean 12.64 n 5 TN
Figure 4.11. Tree diagram of binary classification tree model BCT(ind)312-316
kind_gar, church, care_ret, fire_dep Mean 11.53 n 36 cn_AFS
473.88 Mean 13.91 n 6 TN
4.4 Heating (CG 312-316) | 99 Table 4.22. Specifications and results of binary classification tree model BCT(ind)312-316 Binary classification tree model
Parameter
Specifications
Growing method
Classification and regression tree CRT
Dependent variable
re_312-316_hGIFA (CG 312-316, Euro/m² hGIFA*year)
Sample size
186
Minimum cases in nodes Results
Independent variables
3 cn_sh_defHtSys (Share of defective heat supply systems, %) qv_EnSource (Type of heating energy source) qv_Util (Type of facility) cn_AFS (Average floor size, m²) cn_sh_defEnv (Share of defective envelope, %) dn_Floors (Number of floors)
Number of nodes
31
Number of terminal nodes
16
Tree depth
7
4.4.3 Categorised cost indicators Based on the results of the developed statistical models, the training sample is employed to introduce the median values of categorised cost indicators for the purpose of estimation as outlined in Section 2.3.5. The underlying cost data are therefore defined as the CG 312-316 heating costs and the respective reference quantity is defined as the heatable gross internal floor area hGIFA as determined in the previous Section. The cost indicators are categorised according to the predictor variables with the largest effect on heating costs. The standardised coefficients of the non-linear regression model NLR(ind)312-316 show the largest effects for the qualitative variables type of facility and type of heating energy source. As presented in Table 4.23, the heating cost indicators are categorised according to their type of facility. A further categorisation is made by the type of heating energy source with a distinction between facilities with district heating, electricity, gas, and oil heating. The sub-categorised cost indicators are solely available for datasets with 3 or more observations. Besides the median values (MV), the lower quartiles (25 % percentile) and upper quartiles (75 % percentile) are presented for all categories. The costs are adjusted to 1st quarter 2016 prices and include the current German VAT. The median values of the heating cost indicators range between 8.70 Euro/m² hGIFA*year for research/teaching facilities and 20.91 Euro/m² hGIFA*year for kindergartens with electricity heating.
100 | 4 Analysis results Table 4.23. Categorised CG 312-316 cost indicators Type of facility
Lower quartile[a]
MV(ind)312-316 [a]
Upper quartile[a]
n[b]
Care retirement home
9.26
11.03
14.65
12
Church facility
10.63
11.62
15.97
11
Community hall
10.07
10.60
12.84
5
Fire department
13.55
17.23
23.17
5
Kindergarten
10.31
12.97
16.31
95
District heating
9.54
11.85
15.33
15
Electricity
17.38
20.91
22.91
6
Gas
10.02
12.97
15.90
67 7
Oil
8.25
11.76
13.39
Library
7.41
9.78
14.14
3
Municipal facility
8.88
9.73
11.68
6
Research/teaching facility
6.20
8.70
10.55
8
School facility
8.03
9.52
11.62
19
District heating
7.83
8.99
10.67
9
Gas
7.79
10.70
12.85
10 16
Sport facility
7.80
10.42
12.74
District heating
7.51
9.76
12.05
9
Gas
7.73
11.46
12.32
7
Town hall
9.86
10.43
13.76
6
[a] [b]
CG312-316 cost indicators (Euro/m² hGIFA*year), 1st quarter 2016 prices including VAT. Total sample size: 186 observations.
4.4.4 Performance validation As presented in Section 4.4.2, the development of the statistical models for the estimation of heating cost indicators is based on a training sample including 186 observations. The description of the models contains various measures of performance as summarised in Section 4.4.2. Hereinafter, an independent test sample is employed to draw unbiased statistical inferences about the estimation accuracy of the developed models and the estimation by the median values of the cost indicators. For the CG 312316, the test sample contains 20 observations, is selected randomly, is representative for the total sample, and is not employed for the model development as described in Section 3.3. In Table 4.24, the values of the percentage errors PE are presented for all statistical models and the estimation by the median values of the categorised cost indicators MV(ind)312-316 . The PE values for the 20 observations are calculated by application of the observed characteristics into the statistical models. The median values of the cost indicators are selected under consideration of the respective characteristics. Besides, the respective method for the most accurate estimation is presented for all observations and the estimation accuracy is summarised by the mean absolute percentage error MAPE.
4.4 Heating (CG 312-316) |
101
Table 4.24. Comparison of PE and MAPE values (test sample) for CG 312-316 estimation methods Obs. Type of facility LR(ind)312-316 NLR(ind)312-316 ANN(ind)312-316 BCT(ind)312-316 MV(ind)312-316
Preference
5
Kindergarten
-9.4 %
-19.0 %
19.0 %
-8.2 %
-10.4 %
24
Church facility
-1.5 %
-23.3 %
20.5 %
17.8 %
55.7 %
BCT(ind)312-316 LR(ind)312-316
34
Kindergarten
12.1 %
0.4 %
2.9 %
8.3 %
29.1 %
NLR(ind)312-316
40
Town hall
12.1 %
-11.0 %
15.4 %
11.7 %
55.1 %
NLR(ind)312-316
47
Kindergarten
-35.1 %
3.4 %
10.0 %
-21.7 %
-38.5 %
NLR(ind)312-316
54
Kindergarten
-38.9 %
-25.4 %
-26.6 %
-24.4 %
-19.8 %
MV(ind)312-316
57
Res./teach.
50.9 %
29.9 %
33.1 %
18.7 %
43.6 %
BCT(ind)312-316
68
School facility
7.8 %
-0.9 %
16.3 %
-7.5 %
-13.7 %
NLR(ind)312-316
92
Kindergarten
-4.5 %
-5.2 %
2.4 %
13.5 %
-7.8 %
ANN(ind)312-316
112 Sport facility
-25.1 %
-31.4 %
-34.5 %
-30.5 %
-3.5 %
MV(ind)312-316
130 Kindergarten
-14.5 %
-7.1 %
-6.6 %
-1.2 %
-15.1 %
BCT(ind)312-316 NLR(ind)312-316
145 Kindergarten
-7.9 %
4.7 %
9.1 %
5.8 %
-17.3 %
161 Kindergarten
13.6 %
12.6 %
28.3 %
11.3 %
9.5 %
MV(ind)312-316
167 Care ret. home
-25.7 %
-27.1 %
4.0 %
-4.1 %
-1.6 %
MV(ind)312-316
187 School facility
-3.0 %
-2.6 %
-11.1 %
5.3 %
-10.8 %
NLR(ind)312-316
202 Municipal fac.
26.0 %
33.8 %
31.4 %
21.7 %
39.3 %
BCT(ind)312-316
211 School facility
-4.4 %
-10.8 %
-20.3 %
5.8 %
7.5 %
LR(ind)312-316
230 Kindergarten
-40.2 %
-35.0 %
-24.6 %
-33.5 %
-28. 5%
ANN(ind)312-316
23.1 %
31.4 %
28.9 %
11.1 %
18. 8%
BCT(ind)312-316
247 Kindergarten
-43.5 %
-34.7 %
-36.6 %
-26.7 %
-44. 1%
BCT(ind)312-316
Total (MAPE)
20.0 %
17.5 %
19.1 %
14.4 %
23.5 %
BCT(ind)312-316
Absolute percentage error APE
237 Sport facility
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
Test Training LR(ind)312-316
Test Training NLR(ind)312-316
Test Training ANN(ind)312-316
Test Training BCT(ind)312-316
Test Training MV(ind)312-316
Figure 4.12. APE values (test and training sample) for CG 312-316 estimation methods
The values of the MAPE for the test sample validate the most accurate estimation of heating costs for the model BCT(ind)312-316 as already indicated by the performance measures of the training sample. The values of the PE range between an error of -1.2 % for the most accurate estimation and -33.5 % for the estimation with the lowest accu-
102 | 4 Analysis results racy. A comparison of the MAPE values of the test sample (14.4 %) and training sample (18.4 %) confirms a relatively constant estimation performance for the binary classification tree model BCT(ind)312-316 beyond the data sample employed for the development of the models. According to the MAPE of the test sample of 17.5 %, the model NLR(ind)312-316 reveals likewise a constant estimation accuracy compared to the results of the training sample (MAPE of 18.5 %). An overview of the achieved values of the MAPE for both test and training samples for all analysed methods is illustrated in Figure 4.12. Furthermore, the distribution of the absolute percentage errors APE is illustrated for all observations. The figure reveals a relatively consistent distribution of the APE values comparing the respective test and training samples and indicates therefore a correct specification of the developed statistical models.
4.4.5 Summary The heatable gross internal floor area hGIFA is determined as the most adequate reference quantity for the analysis of heating cost indicators (cost groups 312-316). Under consideration of a data sample of 206 observations, a binary classification tree model can provide the most accurate cost estimation with a mean absolute percentage error MAPE of 18.0 %. By a non-linear regression model, significance is indicated for the following predictor variables (ordered by their size of effect): – Type of facility (utilisation) – Type of heating energy source (standard) – Share of defective heat supply systems (condition) – Number of floors (quantity) – Average floor size (quantity) – Thermal mass (standard) – Share of defective envelope (condition)
4.5 Electricity (CG 316) 4.5.1 Theoretical basis and variables A data sample of 206 observations includes a training sample of 186 observations and a test sample of 20 observations and serves as a basis for the empirical investigation of electricity costs (CG 316). Individual observations of the total sample are excluded from the data basis of the current analysis due to a limited availability of data. For residential facilities for example, tenants are charged directly by the respective suppliers for their electricity consumption. Electricity cost data of these facilities can not be provided by the project partners and the corresponding observations are excluded from the analysis. Furthermore, the electricity costs of some of the analysed facilities are
4.5 Electricity (CG 316) |
103
only available on a superior level for multiple facilities and can not be differentiated and allocated to a particular facility. These observations are likewise excluded from the current analysis of electricity costs. The investigation includes all expenditures for electricity except the costs of heating electricity that are included in the investigation of heating costs in Section 4.4. The underlying cost data have median values of absolute costs of 4,265 Euro per year and cost indicators of 3.85 Euro per m² GEFA and year (1st quarter 2016 prices including VAT). A share of about 9.5 % of the operating costs (CG 300) and a share of about 27.5 % of the utility costs (CG 310) is represented by the electricity costs. A general overview and a definition of the investigated cost data is given in Section 2.2.1. The examined third level cost group of electricity (CG 316) is defined in the standard DIN 18960:2008-02. Besides the CG 316 costs as response variable of the analysis, various candidate predictor variables are defined and their effect on electricity costs is examined. The variables give detailed information on the quantities, characteristics, utilisation, and location of the facilities. A general overview of the variable groups and the available variables is presented in Section 2.2.2 and Section 3.2.2, respectively. The following overview describes the variables included in the current investigation.
Quantities Reference quantities Various areas and volumes are examined in order to determine an adequate reference quantity for the estimation of electricity costs. Therefore, the gross external floor area GEFA in m², the gross internal floor area GIFA in m², the usable floor area UFA in m², and the gross building volume GBV in m³ are analysed as candidate reference quantities in the current investigation of CG 316. All reference quantities are defined in the standard DIN 277-1:2016-01.
Specific areas The analysis of electricity costs examines specific areas as candidate predictor variables. The share of the usable floor area UFA in % according to DIN 277-1:2016-01 and the share of the ventilated and air-conditioned GIFA in % refer to the gross internal floor area GIFA (DIN 277-1:2016-01) as percentage and are included in the investigation of CG 316 electricity costs.
Compactness As a variable describing the compactness of the facilities, the average floor size in m² is examined in the investigation of electricity costs as candidate predictor. The average floor size in m² is calculated as the gross external floor area GEFA in m² according to DIN 277-1:2016-01 divided by the number of floors.
104 | 4 Analysis results Function A functional description of the facilities is given by the quantity number of elevator stops that is therefore included in the analysis of the electricity costs in the current Section. Likewise, the share of glass surfaces on above-grade exterior walls in % and the share of double or triple glazing on exterior glass surfaces in % as defined in the standard DIN 277-3:2005-04 are considered as candidate predictor variables describing the envelope of buildings.
Characteristics Condition of the technical installations The share of defective electrical installations summarises the condition of high voltage installations, lighting systems, and telecommunication systems and is therefore taken into account as candidate predictor variable in the investigation of CG 316 electricity costs.
Standard of the technical installation Comprising information about the sewerage, water, gas, heat supply, air treatment, electrical, communication, transport, and building automation systems, the standard of technical installations is examined in the investigation. Besides, the influence and significance of the standard of the heating system is quantified. The variable describes the availability of time scheduled programs and measurement of external factors. The availability of individual or automatised room controls of the heating, ventilation, shading, and lighting systems is considered by the qualitative predictor variable standard of building automation.
Utilisation The utilisation of the facilities is represented by the candidate predictor variable type of facility. The qualitative variable contains the characteristics care retirement home, church facility, community hall, fire department, kindergarten, library, municipal facility, research/teaching facility, school facility, sport facility, and town hall as facility types. Furthermore, the data sample is differentiated into facilities with kitchen/canteen, tea kitchen, or none by the candidate predictor specific utilisation.
Location The investigation of electricity costs includes the qualitative variable urban location with the characteristics urban and rural describing the surrounding area of the facilities.
4.5 Electricity (CG 316) |
105
4.5.2 Model design and specifications On the basis of the training sample with 186 observations and the variables presented above, various statistical models are stepwise developed and described. The statistical models aim to give an accurate estimation of the CG 316 costs as response variable. Furthermore, the models intend to reveal and describe the causal interrelationships between the response variable and the candidate predictor variables.
Model overview For the determination of an adequate reference quantity, several linear regression models estimating absolute costs are developed. Each model contains one of the available candidate reference quantities as predictor. A comparison of the developed absolute cost models is presented in Table 4.25 including multiple measures of performance. With the highest value of the R² (adj.) of 85.9 % and the lowest value of the MAPE of 74.4 %, the best estimation accuracy is indicated for the model LR(abs)316GIFA including the GIFA as one of the predictor variables. The assumption is likewise confirmed by the lowest value of the CV(RMSE) of 83.7 % compared to the other developed models. Based on the results of the absolute cost models, the gross internal floor area GIFA is determined as reference quantity for the introduction of cost indicators for the further investigation of electricity costs. Table 4.25. Models (incl. reference quantities) for the estimation of CG 316 absolute costs Absolute cost model (Euro/year)
No. var.
R²
R² (adj.)
MAPE
CV(RMSE)
Outliers
n
LR(abs)316GEFA Linear regression (incl. GEFA)
3
86.6 %
85.7 %
79.8 %
84.3 %
10 (5.4 %) 186
LR(abs)316GIFA
Linear regression (incl. GIFA)
3
86.8 %
85.9 %
74.4 %
83.7 %
10 (5.4 %) 186
LR(abs)316UFA
Linear regression (incl. UFA)
3
84.4 %
83.3 %
78.5 %
91.0 %
9 (4.8 %)
LR(abs)316GBV
Linear regression (incl. GBV)
3
85.9 %
84.8 %
151.7 %
86.5 %
10 (5.4 %) 186
186
Employing the CG 316 cost indicators as response variable, linear and non-linear regression models, artificial neural network models, and binary classification tree models are developed. A summary of the statistical models with the best performance is given in Table 4.26. The non-linear regression model NLR(ind)316 offers the best estimation accuracy with a R² (adj.) of 66.3 % and a MAPE of 23.4 % comparing the measures of performance. A CV(RMSE) of 28.8 % and the number of 6 outliers indicate the lowest estimation error compared to the other models. By transformation of both response and predictor variables, the MAPE decreases by 4.2 % and the R² (adj.) increases by 1.6 % for the non-linear regression model NLR(ind)316 comparing it to the linear regression model LR(ind)316 . As indicated throughout all measures, the artificial neural network model ANN(ind)316 and the binary classification tree model
106 | 4 Analysis results BCT(ind)316 can not offer an improvement of performance. Compared to the models estimating absolute costs, all developed cost indicator models show higher levels of accuracy as illustrated by the values of the MAPE and the CV(RMSE). Table 4.26. Models for the estimation of CG 316 cost indicators Cost indicator model (Euro/m² GIFA*year)
No. var.
R²
R² (adj.)
MAPE
CV(RMSE)
Outliers
n
LR(ind)316
Linear regression model
3
67.0 %
64.7 %
27.6 %
32.9 %
9 (4.8 %)
186
NLR(ind)316
Non-linear regression model
5
68.8 %
66.3 %
23.4 %
28.8 %
6 (3.2 %)
186
ANN(ind)316 Artificial neural network model
5
44.3 %
-
33.6 %
42.6 %
12 (6.5 %) 186
BCT(ind)316
3
63.4 %
-
26.1 %
34.5 %
9 (4.8 %)
Binary classification tree model
186
Regression model In Table 4.27, the non-linear regression model NLR(ind)316 is presented in detail with all relevant specifications as the model with the best compliance to the underlying data sample comparing the developed cost indicator models. With the linear regression model as a basis, both response and quantitative predictor variables of the nonlinear regression model are transformed according to their Box-Cox transformations (Box and Cox, 1964). The respective transformations are represented by the values of lambda (λ) and range from a natural logarithm transformation (λ=0) of the response variable (CG 316 cost indicators) to square root (λ=0.5) and logarithm (λ=0) transformations of the included quantitative predictor variables. The empirical F-value of the model NLR(ind)316 of 26.97 exceeds the theoretical F-value of 1.75 as determined under consideration of the sample size, the number of included predictor variables, and a 95 % confidence interval (Backhaus et al., 2011). Therefore, the null hypothesis can be rejected and a general significance of the model NLR(ind)316 can be concluded. Significant relationships between the response variable and 5 predictor variables can be determined as indicated by the p-values based on a significance level of alpha (α) set to 0.05. The coefficients (β) are presented for the regression constant and the determined predictor variables. Besides, the standard deviation of the estimate of the coefficients to the underlying sample is displayed by the standard error of the coefficients. The size of the effect of the predictor variables on the response variable is indicated by the standardised coefficients as described by Ryan et al. (2012). In order to determine the standardised coefficients of the categorical predictors, the regression is re-calculated with the coefficients of the dummy variables as introduced by Eisinga et al. (1991). For the non-linear regression model NLR(ind)316 , the largest effect is indicated for the qualitative predictor variable type of facility followed by the quantitative predictor variable share of ventilated and air-conditioned GIFA. None of the values of the VIF of the predictor variables exceed the selected threshold of 5 (cf. Section 2.3.2).
4.5 Electricity (CG 316) |
107
Though the values of the VIF reveal a certain multicollinearity among the determined predictor variables, a stable regression model is expected (cf. Chatterjee and Hadi, 2006). Table 4.27. Description of coefficients of non-linear regression model NLR(ind)316 Response variable
Transf. (λ)
R²
R² (adj.)
MAPE
0 (LN)
68.8 %
66.3 %
23.4 %
CG 316 (Euro/m² GIFA*year) Predictor variables β0
Constant
X1
Share of ventilated
CV(RMSE) F-value
Transf. (λ) Coef. (β) Coef. SE St. Coef.
n
28.8 %
26.97
186
t-value
p-value
VIF
-
1.010
0.294
0.000
3.43
0.001
-
0.5 (SQR)
0.558
0.120
0.247
4.66
0.000
1.56
and air-conditioned GIFA (%) X2
Average floor size (m²)
0 (LN)
0.102
0.042
0.160
2.41
0.017
2.41
X3
Number of elevator stops
0.5 (SQR)
0.066
0.026
0.177
2.57
0.011
2.63
X4
Share of defective
0.5 (SQR)
0.369
0.113
0.175
3.27
0.001
1.54
electrical installations (%) X5
Type of facility
-
-
-
0.553
-
0.000
-
Care retirement home
-
0.000
0.000
-
-
-
-
Church facility
-
-0.681
0.153
-
-4.46
0.000
2.88
Community hall
-
-0.858
0.170
-
-5.04
0.000
1.69
Fire department
-
-0.397
0.158
-
-2.52
0.013
1.51
Kindergarten
-
-0.245
0.121
-
-2.02
0.045
4.84
Library
-
-0.311
0.187
-
-1.66
0.099
2.31
Municipal facility
-
-0.191
0.152
-
-1.25
0.212
1.85
Research/teaching facility
-
0.151
0.134
-
1.13
0.260
1.69
School facility
-
-0.813
0.127
-
-6.42
0.000
3.29
Sport facility
-
-0.722
0.133
-
-5.43
0.000
3.01
Town hall
-
-0.099
0.148
-
-0.67
0.504
1.59
The following equation describes the non-linear regression model NLR(ind)316 : ˆ = e β0 * e β1 Y
√
X1
β
* X 22 * e β 3
√
X3
* e β4
√
X4
* e β5 X5
The distribution of the residuals (difference between observed and estimated value) presented in the residual plots in Figure 4.13 gives further information about the quality of fit of the model NLR(ind)316 to the underlying data sample. The standardised residuals appear to be uncorrelated to the estimated values and are distributed normally as illustrated by the scatter plot and histogram of the residuals, respectively. The homoscedastic variance of the residuals indicates a correctly specified model and an unbiased estimation of the response variable with no missing terms, extreme outliers, or influential points (cf. Fahrmeir et al., 2013).
108 | 4 Analysis results
20
2
15
1
Frequency
Standardised Residual
3
0 -1 -2 -3 0.5
1.0
1.5 Fitted Value
2.0
2.5
10 5 0
-3
-2
-1 0 1 Standardised Residual
2
3
Figure 4.13. Residuals for non-linear regression model NLR(ind)316
Binary classification tree model Compared to the model NLR(ind)316 , the developed binary classification tree model BCT(ind)316 can not offer an improvement of performance as indicated by the achieved values of the R² of 63.4 % and the MAPE of 26.1 %. The performance measures of the model (cf. Table 4.26) indicate nevertheless a relatively high estimation accuracy. Since classification tree models may have the advantage to provide clear information on the importance of significant predictor variables as described by Tso and Yau (2007), the developed BCT(ind)316 model is likewise presented in the current Section. Figure 4.14 illustrates the structure of the model and Table 4.28 summarises relevant parameters and specifications. Table 4.28. Specifications and results of binary classification tree model BCT(ind)316 Binary classification tree model
Parameter
Specifications
Growing method
Classification and regression tree CRT
Dependent variable
re_316_GEFA (CG 316, Euro/m² GIFA*year)
Sample size
186
Minimum cases in nodes Results
Independent variables
3 qv_Util (Type of facility) cn_sh_defElecIn (Share of defective electrical installations, %) dn_ElevSt (Number of elevator stops)
Number of nodes
9
Number of terminal nodes
5
Tree depth
3
Based on the training sample of 186 observations, the model BCT(ind)316 is developed with the classification and regression tree growing method CRT as described in Section 2.3.4. With a tree depth of 3 layers, the developed model includes 9 nodes in to-
4.5 Electricity (CG 316) |
109
tal whereof 5 are terminal nodes with stop rules. A significant effect is indicated for 3 predictor variables that correspond with the 5 variables identified by the non-linear regression model NLR(ind)316 . The model BCT(ind)316 displays the largest effect for the variables type of facility and share of defective electrical systems. A comparison of the models NLR(ind)316 and BCT(ind)316 reveals a certain level of conformity of the results and indicates therefore their correct specification.
res_teach, care_ret Mean 11.15 n 20 TN
re_316_GIFA Mean 5.24 n 186
kind_gar, fire_dep, library, school, sp_fac, com_hall, town_hall, church, mun_fac Mean 4.52 n 166
qv_Util
cn_sh_defElecIn 0.130 Mean 6.49 n 32
qv_Util
dn_ElevSt
kind_gar, library, school, mun_fac, com_hall, town_hall, sp_fac, church fire_dep Mean 4.55 Mean 2.96 92 n n 42 TN TN
4 Mean 9.08 n 5 TN
Figure 4.14. Tree diagram of binary classification tree model BCT(ind)316
4.5.3 Categorised cost indicators As described in Section 2.3.5, the observations of the training sample are employed to introduce median values of cost indicators for the purpose of cost estimation. The indicators are presented with a categorisation based on the results of the developed statistical models. Therefore, the underlying cost data are defined as the electricity costs and the respective reference quantity is defined as the gross internal floor area GIFA as presented in the previous Section. The categorisation of the cost indicators is determined according to the predictor variables with the largest effect on electricity costs. As indicated by the standardised coefficients of the non-linear regression model NLR(ind)316 , the largest effect is revealed for the qualitative variable type of facility and the quantitative variable share of ventilated and air-conditioned GIFA. The electricity cost indicators are categorised according to their type of facility as presented in Table 4.29. A further categorisation is made by the share of ventilated and air-conditioned GIFA with a distinction between facilities without ventilation and airconditioning systems and facilities with ventilation or air-conditioning systems. Sub-
110 | 4 Analysis results categorised cost indicators are only available for datasets with 3 or more observations. Besides the median values (MV), the lower quartiles (25 % percentile) and upper quartiles (75 % percentile) are introduced for all categories. The costs are adjusted to 1st quarter 2016 prices and include the current German VAT. The presented median values of the electricity cost indicators range between 2.32 Euro/m² GEFA*year for sport facilities without ventilation and air-conditioning systems and 13.36 Euro/m² GEFA*year for research and teaching facilities. Table 4.29. Categorised CG 316 cost indicators Type of facility
Lower quartile[a]
MV(ind)316 [a]
Upper quartile[a]
n[b]
Care retirement home
7.75
9.72
11.27
12
Church facility
1.95
2.44
3.12
11
Community hall
2.10
2.79
3.38
5
Fire department
4.55
6.08
7.75
5
Kindergarten
3.47
4.48
5.74
95
without ventilation and ac system
3.08
3.74
5.04
50
with ventilation or ac system
45
3.78
5.18
6.13
Library
6.16
6.27
8.68
3
Municipal facility
4.65
5.21
7.28
6
Research/teaching facility
9.83
13.36
17.38
8
School facility
2.54
3.01
3.94
19
without ventilation and ac system
2.08
2.78
3.76
8
with ventilation or ac system
2.56
3.49
4.36
11 16
Sport facility
2.32
3.89
6.20
without ventilation and ac system
2.15
2.32
4.57
3
with ventilation or ac system
2.43
4.00
6.51
13
4.93
5.99
11.14
6
Town hall [a] [b]
CG316 cost indicators (Euro/m² GIFA*year), 1st quarter 2016 prices including VAT. Total sample size: 186 observations.
4.5.4 Performance validation The development of the statistical models for the estimation of electricity cost indicators is conducted on the basis of a training sample of 186 observations. As summarised in Section 4.5.2, the description of the models contains multiple measures of performance. Hereinafter, unbiased statistical inferences about the estimation accuracy of the developed models and the estimation by the median values of the cost indicators are drawn employing an independent test sample of 20 observations. As described in Section 3.3, the test sample is selected randomly, shall be representative for the total sample, and is not employed for model development. The values of the percentage errors PE are presented for all statistical models and the estimation by the median values of the categorised cost indicators MV(ind)316 in
4.5 Electricity (CG 316) | 111
Table 4.30. The PE values are calculated by application of the observed characteristics into the statistical models for the 20 observations. The median values of the cost indicators are selected under consideration of the respective characteristics. Furthermore, the respective preference for the most accurate estimation method is given and the estimation accuracy is summarised by the mean absolute percentage error MAPE. Table 4.30. Comparison of PE and MAPE values (test sample) for CG 316 estimation methods Obs.
Type of facility
LR(ind)316
5
Kindergarten
-20.6 %
24
Church facility
-22.7 %
34
Kindergarten
26.9 %
40
Town hall
47
Kindergarten
54
Kindergarten
57
Research/teaching
NLR(ind)316
ANN(ind)316
BCT(ind)316
MV(ind)316
Preference
-7.1 %
6.4 %
-28.6 %
-67.3 %
-23.5 %
-1.5 %
MV(ind)316
-33.4 %
-10.0 %
9.6 %
MV(ind)316
-23.2 %
18.5 %
49.3 %
NLR(ind)316
-71.7 % -37.7 %
-21.1 %
-40.2 %
-32.5 %
-32.0 %
NLR(ind)316
-49.1 %
-16.1 %
-28.3 %
-5.4 %
MV(ind)316
-46.4 %
-35.5 %
-26.0 %
-39.2 %
-14.4 %
MV(ind)316
7.4 %
35.9 %
49.9 %
3.7 %
-15.4 %
BCT(ind)316
68
School facility
24.2 %
27.3 %
-33.1 %
5.3 %
56.2 %
BCT(ind)316
92
Kindergarten
-12.4 %
8.0 %
-17.7 %
-24.1 %
22.7 %
NLR(ind)316
112
Sport facility
-15.4 %
10.5 %
-38.1 %
-52.3 %
41.2 %
NLR(ind)316
130
Kindergarten
-55.1 %
-33.2 %
-18.3 %
-58.8 %
-30.5 %
ANN(ind)316
145
Kindergarten
-29.7 %
-30.4 %
-8.6 %
-32.8 %
-9.1 %
ANN(ind)316
161
Kindergarten
3.7 %
12.3 %
-3.4 %
-7.4 %
33.2 %
ANN(ind)316
167
Care retirement home
-4.2 %
4.8 %
18.4 %
-30.7 %
-14.3 %
LR(ind)316
187
School facility
-66.6 %
-47.8 %
-79.1 %
-73.7 %
-63.2 %
NLR(ind)316
202
Municipal facility
6.3 %
-25.2 %
-77.5 %
-23.2 %
29.3 %
LR(ind)316
211
School facility
-36.3 %
-17.5 %
-43.2 %
-34.5 %
-26.3 %
NLR(ind)316
230
Kindergarten
-20.3 %
-9.7 %
3.8 %
-19.3 %
2.0 %
MV(ind)316
237
Sport facility
20.5 %
17.4 %
-33.3 %
-6.7 %
58.8 %
BCT(ind)316
247
Kindergarten
-49.4 %
-27.9 %
-13.6 %
-52.9 %
-25.6 %
ANN(ind)316
28.9 %
22.9 %
30.9 %
30.0 %
27.0 %
NLR(ind)316
Total (MAPE)
As already indicated by the performance measures of the training sample, the values of the MAPE for the test sample validate the most accurate estimation of electricity costs for the model NLR(ind)316 . The values of the PE range between an error of 4.8 % for the most accurate estimation and -49.1 % for the estimation with the lowest accuracy. A comparison of the MAPE values of the test sample (22.9 %) and training sample (23.4 %) confirms a constant estimation performance for the non-linear regression model NLR(ind)316 . Figure 4.15 gives an overview of the achieved values of the MAPE for both test and training samples for all applied methods. Likewise, the figure illustrates the distribution of the absolute percentage errors APE for all observations. A comparison of the respective test and training samples reveals a relatively consistent distribution of the APE values. Therefore, it can be assumed that the statistical mod-
112 | 4 Analysis results
Absolute percentage error APE
els are correctly specified and a consistent estimation performance beyond the data sample used for model development is validated.
1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0
Test Training LR(ind)316
Test Training NLR(ind)316
Test Training ANN(ind)316
Test Training BCT(ind)316
Test Training MV(ind)316
Figure 4.15. APE values (test and training sample) for CG 316 estimation methods
4.5.5 Summary Based on the analysis of a data sample of 206 observations, various statistical models are developed to estimate electricity costs (cost group 316). The gross internal floor area GIFA is determined as the most adequate reference quantity. The most accurate estimation is provided by a non-linear regression model as indicated by a mean absolute percentage error of 23.3 % for the total data sample. Significance is revealed for the following predictor variables (ordered by their size of effect): – Type of facility (utilisation) – Share of ventilated and air-conditioned gross internal floor area GIFA (quantity) – Number of elevator stops (quantity) – Share of defective electrical installations (condition) – Average floor size (quantity)
4.6 Disposal (CG 320) 4.6.1 Theoretical basis and variables The empirical analysis of the disposal costs (CG 320) employs a total data sample of 200 observations divided into a training sample of 185 observations and a test sample of 15 observations. Individual observations of the total sample are excluded from the data basis of the current analysis due to a limited availability of data. For residential facilities in particular, tenants are charged directly by the respective suppliers for sewage and waste disposal. The disposal cost data of these facilities can in part or in
4.6 Disposal (CG 320) | 113
total not be provided by the project partners and the corresponding observations are excluded from the analysis. Furthermore, the disposal costs of some of the analysed facilities are only available on a superior level for multiple facilities and can not be differentiated and allocated to a particular facility. These observations are likewise excluded from the current analysis of disposal costs. The underlying cost data of the second level cost group are defined in the standard DIN 18960:2008-02 and have median values of absolute costs of 1,252 Euro per year and cost indicators of 0.85 Euro per m² GEFA and year (1st quarter 2016 prices including VAT). With 1.9 %, the disposal costs represent a relatively small amount of the total operating costs (CG 300). A general overview and definition of the investigated cost data is presented in Section 2.2.1. The examined costs of CG 320 are aggregated from the following third level cost groups according to the standard DIN 18960:2008-02: – CG 321: Sewage – CG 322: Waste Various candidate predictor variables are defined and analysed regarding their effect on disposal costs. The variables give detailed information on the quantities, characteristics, utilisation, and location of the facilities. A general overview of the variable groups and the available variables is given in Section 2.2 (Definition of key variables) and Section 3.2 (Presentation of the sample), respectively. The variables included in the current investigation of CG 320 costs are presented in the following overview.
Quantities Reference quantities In order to determine an adequate reference quantity for the estimation of disposal costs, various areas and volumes are examined. The investigation includes therefore the gross external floor area GEFA in m², the gross internal floor area GIFA in m², the usable floor area UFA in m², and the gross building volume GBV in m³ as defined in the standard DIN 277-1:2016-01 as candidate reference quantities.
Specific areas The share of usable floor area UFA on the GIFA in % and the share of sanitary area on the GIFA in % are defined in the standards DIN 277-1:2016-01 and DIN EN 152216:2011-12, refer to the gross internal floor area GIFA as percentage and are examined as candidate predictor variable in the analysis of disposal costs.
Function The quantity number of sanitary facilities gives a functional description of the facilities and is therefore included in the analysis of the CG 320 costs in the current Section.
114 | 4 Analysis results Characteristics Condition of the technical installations The share of defective water installations is considered as candidate predictor composing the condition of sewage and fresh water drains and pipes. Besides, the investigation quantifies the significance and influence of the share of defective sanitary installations describing the condition of sanitary appliances and equipments.
Utilisation The investigation includes the type of facility as candidate predictor variable to take the utilisation of the facilities into account. The qualitative variable contains the characteristics care retirement home, church facility, community hall, fire department, kindergarten, library, municipal facility, research/teaching facility, residential facility, school facility, sport facility, and town hall as facility types. Furthermore, the data sample is differentiated into facilities with kitchen/canteen, tea kitchen, or none by the candidate predictor specific utilisation. The variable type of water usage with the characteristics process water usage and sanitary water usage specifies the water consumption according to the utilisation and is therefore examined.
Location The qualitative variable urban location describes the surrounding area of the facilities and is included with the characteristics urban location and rural location in the investigation of disposal costs.
4.6.2 Model design and specifications Using the presented response and predictor variables and the sample of 185 observations as a basis, various statistical models are stepwise developed and presented in the current Section. These statistical models aim to reveal and describe the causal interrelationships of the disposal costs and the candidate predictor variables presented above. Furthermore, the models intend to give an accurate estimation of CG 320 disposal costs.
Model overview In order to identify an adequate reference quantity, several linear regression models estimating absolute costs are developed whereas each model contains one of the available candidate reference quantities as predictor. Comparing the developed absolute cost models and their measures of performance presented in Table 4.31, it is indicated that the model LR(abs)320GEFA including the GEFA as one of the predictor variables of-
4.6 Disposal (CG 320) | 115
fers the best estimation accuracy with the highest value of the R² (adj.) of 70.6 %, the lowest value of the MAPE of 134.4 %, and the lowest value of the CV(RMSE) of 85.8 %. Based on the evaluation of the absolute cost models, the GEFA is determined as reference quantity for the further investigation of disposal costs and corresponding cost indicators are established. Table 4.31. Models (incl. reference quantities) for the estimation of CG 320 absolute costs Absolute cost model (Euro/year)
No. var.
R²
R² (adj.)
MAPE
CV(RMSE)
Outliers
n
LR(abs)320GEFA Linear regression (incl. GEFA)
5
73.0 %
70.6 %
134.4 %
85.8 %
13 (7.0 %) 185
LR(abs)320GIFA
Linear regression (incl. GIFA)
6
73.0 %
70.5 %
136.4 %
85.9 %
14 (7.6 %) 185
LR(abs)320UFA
Linear regression (incl. UFA)
6
72.9 %
70.4 %
137.4 %
86.1 %
13 (7.0 %) 185
LR(abs)320GBV
Linear regression (incl. GBV)
5
72.1 %
69.6 %
151.0 %
87.4 %
13 (7.0 %) 185
With the disposal cost indicators employed as response variable, linear and nonlinear regression models, artificial neural network models, and binary classification tree models are developed. The statistical models with the best performance are summarised and presented in Table 4.32. Comparing these models and their measures of performance, the non-linear regression model NLR(ind)320 offers the best estimation accuracy with a R² (adj.) of 67.4 %, a MAPE of 44.3 %, and a CV(RMSE) of 55.4 %. The transformation of both response and predictor variables decreases the MAPE by 16.3 % and increases the R² (adj.) by 22.7 % compared to the linear regression model LR(ind)320 . The models ANN(ind)320 and BCT(ind)320 can not offer an improvement of performance as indicated by all measures. All developed cost indicator models show higher levels of accuracy compared to the models estimating absolute costs adducing the values of the MAPE and the CV(RMSE). Table 4.32. Models for the estimation of CG 320 cost indicators Cost indicator model (Euro/m² GEFA*year)
No. var.
R²
R² (adj.)
MAPE
CV(RMSE)
Outliers
n
LR(ind)320
Linear regression model
2
48.3 %
44.7 %
60.6 %
60.8 %
9 (4.9 %)
185
NLR(ind)320
Non-linear regression model
3
69.7 %
67.4 %
44.3 %
55.4 %
7 (3.8 %)
185
ANN(ind)320 Artificial neural network model
3
45.0 %
-
72.5 %
62.7 %
10 (5.4 %) 185
BCT(ind)320
2
56.2 %
-
52.8 %
55.9 %
10 (5.4 %) 185
Binary classification tree model
Regression model With the best compliance to the underlying data sample, the non-linear regression model NLR(ind)320 is presented in detail with all relevant specifications in Table 4.33. Based on the linear regression model, the non-linear regression model is developed
116 | 4 Analysis results with transformations of both response and quantitative predictor variables, as determined by Box-Cox transformations (Box and Cox, 1964). The applied transformations are presented by their respective values of lambda (λ) and range from a natural logarithm transformation (λ=0) of the response variable (disposal cost indicators) to a square root (λ=0.5) transformation of the quantitative predictor variable. Table 4.33. Description of coefficients of non-linear regression model NLR(ind)320 Response variable CG 320 (Euro/m² GEFA*year) Predictor variables β0
Constant
X1
Share of defective
Transf. (λ)
R²
R² (adj.)
MAPE
0 (LN)
69.7 %
67.4 %
44.3 %
Transf. (λ) Coef. (β) Coef. SE St. Coef.
CV(RMSE) F-value
n
55.4 %
30.19
185
t-value
p-value
VIF
-
1.154
0.296
0.000
3.90
0.000
-
0.5 (SQR)
0.830
0.169
0.217
4.92
0.000
1.09
water installations (%) X2
X3
Type of facility
-
-
-
0.828
-
0.000
-
Care retirement home
-
0.000
0.000
-
-
-
-
Church facility
-
-0.463
0.356
-
-1.30
0.195
1.28
Community hall
-
-0.961
0.318
-
-3.02
0.003
1.35
Fire department
-
-0.994
0.292
-
-3.40
0.001
1.42
Kindergarten
-
-0.597
0.175
-
-3.41
0.001
4.82
Library
-
-1.281
0.319
-
-4.02
0.000
1.36
Municipal facility
-
-1.222
0.275
-
-4.44
0.000
1.50
Research/teaching facility
-
-2.229
0.249
-
-8.97
0.000
1.61
Residential facility
-
0.064
0.202
-
0.32
0.751
2.71
School facility
-
-2.130
0.207
-
-10.29
0.000
2.26
Sport facility
-
-2.354
0.215
-
-10.94
0.000
2.18
Town hall
-
-0.205
0.318
-
-0.65
0.519
1.35
Type of water usage
-
-
-
0.098
-
0.030
-
Process and sanitary
-
0.000
0.000
-
-
-
-
Sanitary
-
-0.637
0.291
-
-2.19
0.030
1.13
The following equation describes the non-linear regression model NLR(ind)320 : ˆ = e β0 * e β1 Y
√
X1
* e β2 X2 * e β3 X3
With a value of 30.19, the empirical F-value of the model NLR(ind)320 exceeds the theoretical F-value of 1.78 as determined under consideration of the sample size, the number of included predictor variables, and a 95 % confidence interval (Backhaus et al., 2011). Therefore, the null hypothesis can be rejected and a general significance of the model NLR(ind)320 can be concluded. With a significance level of alpha (α) set to 0.05, significant relationships between the response variable and 3 predictor variables can be determined as indicated by the p-value of the predictors. The regression constant
4.6 Disposal (CG 320) | 117
and the determined predictor variables are presented with their respective coefficients (β) and the standard error of the coefficients. The size of the effect of the predictor variables is indicated by the standardised coefficients as described by Ryan et al. (2012). The standardised coefficients of the categorical predictors are determined by re-calculation of the regression with the coefficients of the dummy variables as introduced by Eisinga et al. (1991). The qualitative predictor type of facility shows the largest effect on the disposal costs followed by the condition of the technical installations represented by the quantitative variable share of defective water installations. Though the values of the VIF reveal a certain multicollinearity among the determined predictor variables (cf. Chatterjee and Hadi, 2006), none of the values exceed the selected threshold of 5 indicating a stable regression model. Further assumptions about the quality of fit of the model NLR(ind)320 to the underlying data sample can be drawn from the distribution of the residuals (difference between observed and estimated value) presented in the residual plots in Figure 4.16. The standardised residuals of the model appear to be uncorrelated to the estimated values and are distributed normally as illustrated by the scatter plot and histogram of the residuals, respectively. The homoscedastic variance of the residuals indicates therefore a correctly specified model and an unbiased estimation of the response variable with no missing terms, extreme outliers, or influential points (cf. Fahrmeir et al., 2013).
20
2
15
1
Frequency
Standardised Residual
3
0 -1 -2 -3 -2.0
-1.5
-1.0
-0.5 0.0 Fitted Value
0.5
1.0
1.5
10 5 0
-3
-2
-1 0 1 Standardised Residual
2
3
Figure 4.16. Residuals for non-linear regression model NLR(ind)320
Binary classification tree model With a R² of 56.2 %, a MAPE of 52.8 %, and a CV(RMSE) of 55.9 %, the binary classification tree model BCT(ind)320 can not offer an improved performance compared to the non-linear regression model NLR(ind)320 (cf. Table 4.32). Since the results of a classification tree model may have the advantage to reveal and describe the causal interrelationships between the response variable and the predictor variables transparently as
118 | 4 Analysis results described by Curram and Mingers (1994), the developed model is likewise presented hereinafter. A summary of relevant parameters and specifications is presented in Table 4.34 and the tree diagram of the model is illustrated in Figure 4.17. The model BCT(ind)320 is based on the training sample of 185 observations and employs the classification and regression tree growing method CRT as outlined in Section 2.3.4. The developed model has a tree depth of 6 layers and includes a total of 17 nodes whereof 9 are terminal nodes with stop rules. A significant effect on the disposal costs is indicated for 2 predictor variables. The predictors correspond with the predictors identified by the non-linear regression model NLR(ind)320 and the largest effect on the disposal costs is displayed for the qualitative variable type of facility and the quantitative variable share of defective water installations by the model BCT(ind)320 . The models NLR(ind)320 and BCT(ind)320 reveal a high level of conformity of their results. Therefore, it can be assumed that both models are specified correctly.
care_ret, res_build Mean 2.30 n 34 0.357 Mean 4.08 Mean 2.13 31 n 3 n TN cn_sh_defWatIn 0.138 Mean 2.53 7 n TN
0.297 Mean 2.19 12 n TN
0.176 Mean 0.86 12 n TN
4.6 Disposal (CG 320) | 119 Table 4.34. Specifications and results of binary classification tree model BCT(ind)320 Binary classification tree model
Parameter
Specifications
Growing method
Classification and regression tree CRT
Dependent variable
re_320_GEFA (CG 320, Euro/m² GEFA*year)
Sample size
185
Minimum cases in nodes Results
Independent variables
3 qv_Util (Type of facility) cn_sh_defWatIn (Share of defective water installations, %)
Number of nodes
17
Number of terminal nodes
9
Tree depth
6
4.6.3 Categorised cost indicators Based on the results of the developed statistical models, the training sample is employed to introduce the median values of categorised cost indicators for the purpose of estimation as outlined in Section 2.3.5. The underlying cost data are therefore defined as the disposal costs and the respective reference quantity is defined as the gross external floor area GEFA as determined in the previous Section. The cost indicators are categorised according to the predictor variables with the largest effect on disposal costs. The standardised coefficients of the non-linear regression model NLR(ind)320 show the largest effects for the qualitative variable type of facility and the quantitative variable share of defective water installations. As presented in Table 4.35, the disposal cost indicators are categorised according to the type of facility. A further categorisation is made by the share of defective water installations with a distinction between facilities without defective water installations and facilities with defective water installations. The sub-categorised cost indicators are solely available for datasets with 3 or more observations. Besides the median values (MV), the lower quartiles (25 % percentile) and upper quartiles (75 % percentile) are presented for all categories. The costs are adjusted to 1st quarter 2016 prices and include the current German VAT. The median values of the disposal cost indicators range between 0.16 Euro/m² GEFA*year for research/teaching and school facilities without defective water installations and 3.12 Euro/m² GEFA*year for care retirement homes with defective water installations.
120 | 4 Analysis results Table 4.35. Categorised CG 320 cost indicators Type of facility Care retirement home
Lower quartile[a]
MV(ind)320 [a]
Upper quartile[a]
n[b] 12
1.92
2.26
3.19
without defective water installations
1.29
1.94
2.60
5
with defective water installations
2.13
3.12
3.81
7
Church facility
0.71
1.24
1.34
3
Community hall
0.39
0.79
1.28
4
Fire department
0.40
0.47
1.91
5
Kindergarten
0.65
0.91
1.74
87
without defective water installations
0.63
0.91
1.60
59
with defective water installations
0.74
0.97
1.80
28
Library
0.29
0.42
1.04
4
Municipal facility
0.29
0.36
2.01
6
Research/teaching facility
0.16
0.21
0.26
8
0.12
0.16
0.23
5
without defective water installations
0.22
0.25
1.01
3
Residential facility
with defective water installations
1.45
2.01
2.42
20
School facility
17
0.16
0.23
0.32
without defective water installations
0.14
0.18
0.30
9
with defective water installations
0.20
0.25
0.74
8 15
Sport facility
0.16
0.21
0.28
without defective water installations
0.12
0.16
0.29
5
with defective water installations
0.19
0.22
0.28
10
0.94
1.69
2.20
4
Town hall [a] [b]
CG320 cost indicators (Euro/m² GEFA*year), 1st quarter 2016 prices including VAT. Total sample size: 185 observations.
4.6.4 Performance validation The development of the statistical models for the estimation of disposal cost indicators is based on a training sample of 185 observations. The description of the models contains various measures of performance as summarised in Section 4.6.2. Hereinafter, an independent test sample is employed to draw unbiased statistical inferences about the accuracy of the developed estimation methods. For the CG 320, the test sample contains 15 observations, is selected randomly, is representative for the total sample, and is not employed for the model development as described in Section 3.3. In Table 4.36, the values of the percentage errors PE are presented for all statistical models and the estimation by the median values of the categorised cost indicators MV(ind)320 . The PE values for the 15 observations are calculated by application of the observed characteristics into the statistical models. The median values of the cost indicators are selected under consideration of the respective characteristics. Besides, the method with the most accurate estimation is presented for all observations and the estimation accuracy is summarised by the mean absolute percentage error MAPE.
4.6 Disposal (CG 320) | 121
With a MAPE of 41.3 %, the most accurate estimation of disposal costs is indicated for the median values of the cost indicators MV(ind)320 . Table 4.36. Comparison of PE and MAPE values (test sample) for CG 320 estimation methods Obs.
Type of facility
LR(ind)320
5
Kindergarten
-11.2 %
34
Kindergarten
57.5 %
47
Kindergarten
-72.7 %
68
School facility
65.4 %
NLR(ind)320
ANN(ind)320
BCT(ind)320
MV(ind)320
Preference
9.5 %
-9.4 %
-17.6 %
10.8 %
ANN(ind)320
61.7 %
60.3 %
57.4 %
65.5 %
BCT(ind)320
-40.6 %
-70.0 %
-82.6 %
-38.6 %
MV(ind)320
48.5 %
24.6 %
23.1 %
53.5 %
BCT(ind)320
92
Kindergarten
-0.4 %
18.3 %
1.2 %
-6.1 %
19.5 %
ANN(ind)320
112
Sport facility
199.2 %
-97.5 %
-235.8 %
-269.4 %
-98.4 %
NLR(ind)320
121
Residential facility
-63.5 %
-33.1 %
-60.9 %
-72.8 %
-31.2 %
MV(ind)320
145
Kindergarten
-62.2 %
-32.1 %
-59.6 %
-71.5 %
-30.2 %
MV(ind)320
163
Residential facility
-71.7 %
-39.8 %
-131.6 %
-30.4 %
-37.8 %
BCT(ind)320
167
Care retirement home
-68.3 %
-21.8 %
-30.5 %
-46.2 %
-40.9 %
NLR(ind)320 BCT(ind)320
187
School facility
46.6 %
50.7 %
42.5 %
41.4 %
50.8 %
202
Municipal facility
43.0 %
54.0 %
80.0 %
56.5 %
73.4 %
LR(ind)320
211
School facility
26.0 %
-10.1 %
-61.4 %
-64.5 %
0.6 %
MV(ind)320
230
Kindergarten
-104.9 %
-84.6 %
-106.4 %
-21.2 %
-36.6 %
BCT(ind)320
237
Sport facility
-32.8 %
-44.9 %
-92.0 %
7.9 %
32.0 %
BCT(ind)320
61.7 %
43.1 %
71.1 %
57.9 %
41.3 %
MV(ind)320
Absolute percentage error APE
Total (MAPE)
2.0 1.5 1.0 0.5 0.0
Test Training LR(ind)320
Test Training NLR(ind)320
Test Training ANN(ind)320
Test Training BCT(ind)320
Test Training MV(ind)320
Figure 4.18. APE values (test and training sample) for CG 320 estimation methods
The values of the PE of the cost indicators MV(ind)320 range between 0.6 % for the most accurate estimation and -98.4 % for the estimation with the lowest accuracy. For the non-linear regression model NLR(ind)320 as model with the second best performance, a comparison of the MAPE of the test sample (43.1 %) and training sample (44.3 %) confirms a constant estimation performance beyond the data employed for the model
122 | 4 Analysis results development. An overview of the MAPE values for both test and training samples for all methods is given in Figure 4.18. Furthermore, the distribution of the absolute percentage errors APE is illustrated for all observations. The figure reveals a relatively consistent distribution of the APE values comparing the respective test sample and training sample and indicates therefore a correct specification of the developed statistical models.
4.6.5 Summary In order to estimate disposal costs (cost group 320) and identify significant predictor variables, a data sample of 200 observations is analysed and various statistical models are developed. The gross external floor area GEFA is determined as the most adequate reference quantity. The estimation by categorised cost indicators provides the highest level of accuracy with a mean absolute percentage error MAPE of 39.2 % for the total sample. By a non-linear regression model, significance is indicated for the following predictor variables (ordered by their size of effect): – Type of facility (utilisation) – Share of defective water installations (condition) – Type of water usage (utilisation)
4.7 Cleaning and care of buildings (CG 330) 4.7.1 Theoretical basis and variables The investigation of cleaning and care costs of buildings (CG 330) is based on empirical data and examines a data sample of 238 observations including a training sample of 220 observations and a test sample of 18 observations. Individual observations of the total sample are excluded from the data basis of the current analysis due to a limited availability of data. Cleaning and care of some of the facilities is conducted by internal personnel and cleaning and care cost data can not be differentiated and allocated to a particular facility. The corresponding observations are excluded from the current analysis. If the cleaning and care of buildings is conducted by unsalaried volunteers or by tenants, the respective observations are included and the scope of costs is represented by a corresponding candidate predictor variable. The underlying cost data have median values of absolute costs of 13,003 Euro per year and cost indicators of 12.48 Euro per m² GEFA and year (1st quarter 2016 prices including VAT). The cleaning and care costs of buildings represent a relatively high share of about 34.7 % of the total CG 300 operating costs. A general overview and a definition of the investigated cost data is presented in Section 2.2.1. The examined second
4.7 Cleaning and care of buildings (CG 330) | 123
level cost group is defined in the standard DIN 18960:2008-02 and is aggregated from the following third level cost groups: – CG 331: Regular cleaning – CG 332: Glass cleaning – CG 333: Facade cleaning – CG 334: Cleaning of technical installations Besides the CG 330 costs as response variable of the analysis, various candidate predictor variables are defined and their effect on cleaning and care costs of buildings is examined. The variables give detailed information on the quantities, the characteristics, the utilisation, the location, and the management strategy of the facilities. A general overview of the variable groups and the variables available for the current study is presented in Section 2.2 (Definition of key variables) and Section 3.2 (Presentation of the sample), respectively. The following overview describes the variables included in the current investigation.
Quantities Reference quantities Various areas and volumes are examined in order to determine an adequate reference quantity for the estimation of cleaning and care costs of buildings. Therefore, the gross external floor area GEFA in m², the gross internal floor area GIFA in m², the usable floor area UFA in m², and the gross building volume GBV in m³ as defined in the standard DIN 277-1:2016-01 are analysed as candidate reference quantities in the current investigation. Besides, the regularly cleaned gross internal floor area cGIFA in m² according to the standard DIN 277-1:2016-01 is considered as reference quantity.
Specific areas The investigation of cleaning and care costs of buildings examines various specific areas as candidate predictor variables. The analysis includes the share of the usable floor area UFA on the GIFA in %, the share of circulation area CA on the GIFA in %, and the share of sanitary area on the GIFA in % as defined in the standards DIN 277-1:2016-01 and DIN EN 15221-6:2011-12. The share of regularly cleaned GIFA in % according to DIN 277-1:2016-01 refers to the gross internal floor area GIFA as percentage and is likewise included.
Function A functional description of the facilities is given by the quantities number of elevator stops and number of sanitary facilities which are therefore included in the analysis of the cleaning and care costs of buildings in the current Section. Likewise, the share
124 | 4 Analysis results of glass surfaces on above-grade exterior walls in % as defined in the standard DIN 277-3:2005-04 is considered as candidate predictor variable.
Characteristics Condition of the construction The share of defective floorings describes the condition of floor and ceiling coverings and is considered as candidate predictor influencing the cleaning and care costs of buildings.
Condition of the technical installations The investigation quantifies the significance and influence of the share of defective sanitary installations including the condition of sanitary appliances and equipments. The share of defective technical installations summarises the condition of sewerage, water, gas, heat supply, air treatment, electrical, and telecommunication installations, as well as transport and building automation systems and is therefore taken into account as candidate predictor variable.
Utilisation The utilisation of the facilities is represented by the candidate predictor variable type of facility. The qualitative variable contains the characteristics care retirement home, church facility, community hall, fire department, kindergarten, library, municipal facility, research/teaching facility, residential facility, school facility, sport facility, and town hall as facility types. Furthermore, the data sample is differentiated into facilities with kitchen/canteen, tea kitchen, or none by the candidate predictor specific utilisation.
Location The investigation of cleaning and care costs of buildings includes the qualitative variable urban location with the characteristics urban and rural describing the surrounding area of the facilities.
Strategy With the characteristics internal cleaning services, external contractor, solely cleaning materials, and cleaning by tenants, the significance of the management strategy and the effect on cleaning and care costs of buildings is quantified by the candidate predictor variable type of cleaning services.
4.7 Cleaning and care of buildings (CG 330) | 125
4.7.2 Model design and specifications Based on the training sample with 220 observations and the variables described above, various statistical models are stepwise developed and presented in the current Section. The statistical models aim to give an accurate estimation of the CG 330 costs as response variable. Furthermore, the models intend to reveal and describe the causal interrelationships between the response variable and the candidate predictor variables.
Model overview For the determination of an adequate reference quantity, several linear regression models estimating absolute costs are developed. Each model contains one of the available candidate reference quantities as predictor. A comparison of the developed absolute cost models is presented in Table 4.37 including multiple measures of performance. With the highest value of the R² (adj.) of 86.3 %, the lowest value of the MAPE of 88.0 %, and the lowest value of the CV(RMSE) of 51.6 %, the best estimation accuracy is indicated for the model LR(abs)330cGIFA including the regularly cleaned gross internal floor area cGIFA as one of the predictor variables. Based on the results of the absolute cost models, the cGIFA is determined as reference quantity for the introduction of cost indicators for the further investigation of cleaning and care costs of buildings. Table 4.37. Models (incl. reference quantities) for the estimation of CG 330 absolute costs Absolute cost model (Euro/year)
No. var.
R²
R² (adj.)
MAPE
CV(RMSE)
Outliers
n
LR(abs)330GEFA
Linear regression (incl. GEFA)
6
86.4 %
85.2 %
95.8 %
52.8 %
15 (7.5 %) 220
LR(abs)330GIFA
Linear regression (incl. GIFA)
6
86.8 %
85.6 %
91.4 %
52.0 %
14 (7.0 %) 220
LR(abs)330cGIFA Linear regression (incl. cGIFA)
6
87.5 %
86.3 %
88.0 %
51.6 %
12 (6.0 %) 220
LR(abs)330UFA
Linear regression (incl. UFA)
6
83.0 %
81.5 %
110.1 %
59.1 %
13 (6.5 %) 220
LR(abs)330GBV
Linear regression (incl. GBV)
6
86.6 %
85.4 %
103.7 %
52.7 %
13 (6.5 %) 220
Employing the CG 330 cost indicators as response variable, linear and non-linear regression models, artificial neural network models, and binary classification tree models are developed. A summary of the statistical models with the best performance is given in Table 4.38. The binary classification tree model BCT(ind)330 offers the best estimation accuracy with a R² of 83.4 % and a MAPE of 25.4 % comparing the measures of performance. A CV(RMSE) of 27.6 % indicates the lowest estimation error compared to the other models. Comparing the regression models LR(ind)330 and NLR(ind)330 , the transformation of both response and predictor variables decreases the MAPE by 3.8 % and increases the R² (adj.) by 11.2 %. The regression models can nevertheless not offer a better performance compared to the model BCT(ind)330 as indicated by all measures.
126 | 4 Analysis results Compared to the models estimating absolute costs, all developed cost indicator models show higher levels of accuracy as illustrated by the values of the MAPE and the CV(RMSE). Table 4.38. Models for the estimation of CG 330 cost indicators Cost indicator model (Euro/m² cGIFA*year)
No. var.
R²
R² (adj.)
MAPE
CV(RMSE)
Outliers
n
LR(ind)330
Linear regression model
4
71.0 %
68.8 %
35.6 %
36.3 %
NLR(ind)330
Non-linear regression model
4
81.5 %
80.0 %
31.8 %
35.9 %
9 (4.1 %)
ANN(ind)330 Artificial neural network model
4
59.2 %
-
42.0 %
43.2 %
13 (6.5 %) 220
BCT(ind)330
4
83.4 %
-
25.4 %
27.6 %
9 (4.1 %)
Binary classification tree model
10 (4.5 %) 220 220 220
Regression model In order to improve the performance of the linear regression model (cf. Schmidt, 2010), both response and quantitative predictor variables of the non-linear regression model are transformed according to their Box-Cox transformations (Box and Cox, 1964). The non-linear regression model NLR(ind)330 is presented in detail with all relevant specifications in Table 4.39. The respective transformations are represented by the values of lambda (λ). The response variable and a quantitative predictor variable are transformed by a square root (λ=0.5) transformation. The empirical F-value of the model NLR(ind)330 of 55.86 exceeds the theoretical F-value of 1.69 as determined under consideration of the sample size, the number of included predictor variables, and a 95 % confidence interval (Backhaus et al., 2011). Therefore, the null hypothesis can be rejected and a general significance of the model NLR(ind)330 can be concluded. Significant relationships between the response variable and 4 predictor variables can be determined as indicated by the p-values based on a significance level of alpha (α) set to 0.05. The t-values of the determined variables exceed the threshold of 1.97 determined under consideration of a 95 % confidence interval, the sample size, and the number of predictors (Backhaus et al., 2011). The coefficients (β) are presented for the regression constant and the determined predictor variables. Besides, the standard deviation of the estimate of the coefficients to the underlying sample is displayed by the standard error of the coefficients. The size of the effect of the predictor variables is indicated by the standardised coefficients as described by Ryan et al. (2012). In order to determine the standardised coefficients of the categorical predictors, the regression is re-calculated with the coefficients of the dummy variables as introduced by Eisinga et al. (1991). For the nonlinear regression model NLR(ind)330 , the largest effect is indicated for the qualitative predictor variables type of facility and type of cleaning services. None of the values of the VIF of the predictor variables exceed the selected threshold of 5 (cf. Section 2.3.2). Though the values of the VIF reveal a certain multicollinearity among the determined
4.7 Cleaning and care of buildings (CG 330) | 127
predictor variables, a stable regression model is indicated (cf. Chatterjee and Hadi, 2006). Table 4.39. Description of coefficients of non-linear regression model NLR(ind)330 Response variable
Transf. (λ)
R²
R² (adj.)
MAPE
CG 330 (Euro/m² cGIFA*year)
0.5 (SQR)
81.5 %
80.0 %
31.8 %
Predictor variables
Transf. (λ) Coef. (β) Coef. SE St. Coef.
CV(RMSE) F-value
n
35.9 %
55.86
220
t-value
p-value
VIF
β0
Constant
-
4.012
0.415
0.000
9.66
0.000
-
X1
Share of circulation area CA (%)
-
1.902
0.767
0.090
2.48
0.014
1.46
X2
Share of defective
0.5 (SQR)
0.964
0.224
0.144
4.31
0.000
1.22
sanitary installations (%) X3
X4
Type of facility
-
-
-
0.675
-
0.000
-
Care retirement home
-
0.000
0.000
-
-
-
-
Church facility
-
-2.338
0.395
-
-5.91
0.000
2.78
Community hall
-
-1.626
0.457
-
-3.56
0.000
1.67
Fire department
-
-2.161
0.464
-
-4.66
0.000
1.79
Kindergarten
-
0.060
0.320
-
0.19
0.851
4.78
Library
-
-2.267
0.486
-
-4.66
0.000
1.64
Municipal facility
-
-1.556
0.405
-
-3.84
0.000
2.34
Research/teaching facility
-
-2.242
0.385
-
-5.82
0.000
2.73
Residential facility
-
-1.240
0.369
-
-3.36
0.001
4.81
School facility
-
-1.926
0.351
-
-5.49
0.000
3.67
Sport facility
-
-1.325
0.365
-
-3.63
0.000
3.56
Town hall
-
-2.123
0.456
-
-4.65
0.000
1.69
-
-
-
0.564
-
0.000
-
Internal cleaning services
-
0.000
0.000
-
-
-
-
External contractor
-
1.040
0.226
-
4.60
0.000
2.69
Solely cleaning materials
-
-2.546
0.283
-
-9.01
0.000
2.07
Cleaning by tenants
-
-3.201
0.394
-
-8.12
0.000
1.58
Type of cleaning services
The following equation describes the non-linear regression model NLR(ind)330 : √
ˆ = (β0 + β1 X1 + β2 X2 + β3 X3 + β4 X4 )2 Y The distribution of the residuals (difference between observed and estimated value) presented in the residual plots in Figure 4.19 gives further information about the quality of fit of the model NLR(ind)330 to the underlying data sample. The standardised residuals appear to be uncorrelated to the estimated values and are distributed normally as illustrated by the scatter plot and histogram of the residuals, respectively. The homoscedastic variance of the residuals indicates a correctly specified model and an unbiased estimation of the response variable with no missing terms, extreme outliers, or influential points (cf. Fahrmeir et al., 2013).
3
30
2
25
1
20
Frequency
Standardised Residual
128 | 4 Analysis results
0 -1
15 10 5
-2 -3
-1
0
1
2 3 4 Fitted Value
5
6
0
7
-3
-2
-1 0 1 Standardised Residual
2
3
Figure 4.19. Residuals for non-linear regression model NLR(ind)330
Binary classification tree model Classification tree models may have the advantage to provide clear information on the importance of significant predictor variables as described by Tso and Yau (2007). As model with the best compliance to the underlying data sample, the binary classification tree model BCT(ind)330 is presented hereinafter. With the achieved values of the R² of 83.4 % and the MAPE of 25.4 %, a relatively high accuracy of cost estimation is indicated (cf. Table 4.38). A summary of the parameters and specifications is given in Table 4.40 and the tree of the model is illustrated in Figure 4.20. Table 4.40. Specifications and results of binary classification tree model BCT(ind)330 Binary classification tree model
Parameter
Specifications
Growing method
Classification and regression tree CRT
Dependent variable
re_330_cGIFA (CG 330, Euro/m² cGIFA*year)
Sample size
220
Minimum cases in nodes Results
Independent variables
3 qv_Util (Type of facility) qv_CleanServ (Type of cleaning services) cn_sh_defSanIn (Share of defective sanitary installations, %) cn_sh_CA (Share of circulation area CA, %)
Number of nodes
31
Number of terminal nodes
16
Tree depth
6
Based on the training sample of 220 observations, the model BCT(ind)330 is developed with the classification and regression tree growing method CRT as described in Section 2.3.4 in detail. With a tree depth of 6 layers, the developed model includes 31 nodes in total whereof 16 are terminal nodes with stop rules. A significant effect on the cleaning
4.7 Cleaning and care of buildings (CG 330) | 129
and care costs of buildings is indicated for 4 predictor variables. The identified predictor variables correspond with the variables identified by the non-linear regression model NLR(ind)330 . The model BCT(ind)330 displays the largest effect for the qualitative variable type of facility and the quantitative variable share of defective sanitary installations. A comparison of the models NLR(ind)330 and BCT(ind)330 reveals a high level of conformity of their results and indicates therefore a correct specification of the developed models. re_330_cGIFA Mean 20.67 n 220 kind_gar, care_ret Mean 29.19 n 113 outsourced, internal Mean 31.85 n 103
qv_Util
qv_CleanServ
cn_sh_defSanIn
materials Mean 1.84 n 10 TN
0.303 Mean 28.62 Mean 43.05 80 n n 23 cn_sh_CA 0.132 Mean 26.06 Mean 31.77 36 n 44 n cn_sh_CA >0.096 0.279 0.832 Mean 0.43 9 n TN
0.651 Mean 1.76 13 n TN
0.187) – Layer 3: qv_ProtStr (no protected structure) – Layer 4: cn_sh_defTecIn ( 10,000.
83.8 %
212
Flat topography The site area has a flat shaped topography.
62.5 %
158
Sloped topography The site area has a sloped topography.
37.5 %
95
Type of topography
[a]
Total number of observations: 253.
254 | Appendix: Data sample Table A.12. Candidate predictor variables: Utilisation %[a]
n
Care retirement home
5.1 %
13
Church facility
5.1 %
13
Community hall
2.0 %
5
Fire department
2.0 %
5
Kindergarten
48.2 %
122
Library
1.6 %
4
Municipal facility
3.6 %
9
Research/teaching facility
4.7 %
12
Residential facility
8.7 %
22
School facility
8.7 %
22
Sport facility
7.5 %
19
Town hall
2.8 %
7
Kitchen / canteen
51.0 %
129
Tea kitchen
15.8 %
40
None
33.2 %
84
Process and sanitary water usage Water is used for production or industrial processes and sanitary purposes.
2.4 %
6
Sanitary water usage Water is used solely for sanitary purposes.
97.6 %
247
%[a]
n
Candidate predictor variable
Characteristics
Type of facility
Specific utilisation
Type of water usage
[a]
Total number of observations: 253.
Table A.13. Candidate predictor variables: Management strategy Candidate predictor variable
Characteristics
Cleaning services
Internal cleaning services Cleaning and care of buildings or outdoor facilities are conducted by an external contractor. Costs include personnel and cleaning materials.
9.9 %
25
External contractor Cleaning and care of buildings or outdoor facilities are conducted by internal personnel. Costs include personnel and cleaning materials.
81.4 %
206
Solely cleaning materials Cleaning and care of buildings or outdoor facilities are conducted by unsalaried volunteers or internal personnel. Costs include solely cleaning materials and no personnel costs.
6.3 %
16
Cleaning by tenants Cleaning and care of buildings or outdoor facilities are conducted by tenants. Cleaning and care costs are not available.
2.4 %
6
[a]
Total number of observations: 253.