298 58 5MB
English Pages 336 Year 2011
Extreme Events
For other titles in the Wiley Finance series please see www.wiley.com/finance
Extreme Events Robust Portfolio Construction in the Presence of Fat Tails
Malcolm H. D. Kemp
A John Wiley and Sons, Ltd., Publication
This edition first published 2011 C 2011 John Wiley & Sons, Ltd Registered office John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com. The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress. ISBN: 978-0-470-75013-1 A catalogue record for this book is available from the British Library.
Typeset in 10/12pt Times by Aptara Inc., New Delhi, India Printed in Great Britain by CPI Antony Rowe, Chippenham, Wiltshire
Contents Preface
xv
Acknowledgements
xvii
Abbreviations
xix
Notation
xxi
1 Introduction 1.1 1.2 1.3 1.4 1.5 1.6
Extreme events The portfolio construction problem Coping with really extreme events Risk budgeting Elements designed to maximise benefit to readers Book structure
2 Fat Tails – In Single (i.e., Univariate) Return Series 2.1 Introduction 2.2 A fat tail relative to what? 2.3 Empirical examples of fat-tailed behaviour in return series 2.3.1 Introduction 2.3.2 Visualising fat tails 2.3.3 Behaviour of individual bonds and bond indices 2.3.4 Behaviour of equity indices 2.3.5 Currencies and other asset types 2.4 Characterising fat-tailed distributions by their moments 2.4.1 Introduction 2.4.2 Skew and kurtosis 2.4.3 The (fourth-moment) Cornish-Fisher approach 2.4.4 Weaknesses of the Cornish-Fisher approach 2.4.5 Improving on the Cornish-Fisher approach 2.4.6 Statistical tests for non-Normality 2.4.7 Higher order moments and the Omega function
1 1 2 2 3 4 4 7 7 7 11 11 11 17 17 22 23 23 25 26 27 28 29 31
vi
Contents
2.5 What causes fat tails? 2.5.1 Introduction 2.5.2 The Central Limit Theorem 2.5.3 Ways in which the Central Limit Theorem can break down 2.6 Lack of diversification 2.7 A time-varying world 2.7.1 Introduction 2.7.2 Distributional mixtures 2.7.3 Time-varying volatility 2.7.4 Regime shifts 2.8 Stable distributions 2.8.1 Introduction 2.8.2 Defining characteristics 2.8.3 The Generalised Central Limit Theorem 2.8.4 Quantile–quantile plots of stable distributions 2.9 Extreme value theory (EVT) 2.9.1 Introduction 2.9.2 Extreme value distributions 2.9.3 Tail probability densities 2.9.4 Estimation of and inference from tail index values 2.9.5 Issues with extreme value theory 2.10 Parsimony 2.11 Combining different possible source mechanisms 2.12 The practitioner perspective 2.12.1 Introduction 2.12.2 Time-varying volatility 2.12.3 Crowded trades 2.12.4 Liquidity risk 2.12.5 ‘Rational’ behaviour versus ‘bounded rational’ behaviour 2.12.6 Our own contribution to the picture 2.13 Implementation challenges 2.13.1 Introduction 2.13.2 Smoothing of return series 2.13.3 Time clocks and non-constant time period lengths 2.13.4 Price or other data rather than return data 2.13.5 Economic sensitivities that change through time
3 Fat Tails – In Joint (i.e., Multivariate) Return Series 3.1 Introduction 3.2 Visualisation of fat tails in multiple return series 3.3 Copulas and marginals – Sklar’s theorem 3.3.1 Introduction 3.3.2 Fractile–fractile (i.e., quantile–quantile box) plots 3.3.3 Time-varying volatility 3.4 Example analytical copulas 3.4.1 Introduction 3.4.2 The Gaussian copula
32 32 33 34 35 36 36 37 38 40 41 41 42 44 45 45 45 46 47 47 47 50 52 53 53 53 53 54 54 55 55 55 56 57 58 58 61 61 61 64 64 66 72 75 75 76
Contents
vii
3.4.3 The t-copula 3.4.4 Archimedean copulas Empirical estimation of fat tails in joint return series 3.5.1 Introduction 3.5.2 Disadvantages of empirically fitting the copula 3.5.3 Multi-dimensional quantile–quantile plots Causal dependency models The practitioner perspective Implementation challenges 3.8.1 Introduction 3.8.2 Series of different lengths 3.8.3 Non-coincidently timed series 3.8.4 Cluster analysis 3.8.5 Relative entropy and nonlinear cluster analysis
76 78 78 78 79 80 83 83 85 85 85 86 87 88
4 Identifying Factors That Significantly Influence Markets
91 91 92 92 94 95 96 96 97 97 99 100 102 103 103 106 106 106 107 109 110 110 111 112
3.5
3.6 3.7 3.8
4.1 Introduction 4.2 Portfolio risk models 4.2.1 Introduction 4.2.2 Fundamental models 4.2.3 Econometric models 4.2.4 Statistical risk models 4.2.5 Similarities and differences between risk models 4.3 Signal extraction and principal components analysis 4.3.1 Introduction 4.3.2 Principal components analysis 4.3.3 The theory behind principal components analysis 4.3.4 Weighting schemas 4.3.5 Idiosyncratic risk 4.3.6 Random matrix theory 4.3.7 Identifying principal components one at a time 4.4 Independent components analysis 4.4.1 Introduction 4.4.2 Practical algorithms 4.4.3 Non-Normality and projection pursuit 4.4.4 Truncating the answers 4.4.5 Extracting all the un-mixing weights at the same time 4.4.6 Complexity pursuit 4.4.7 Gradient ascent 4.5 Blending together principal components analysis and independent components analysis 4.5.1 Introduction 4.5.2 Including both variance and kurtosis in the importance criterion 4.5.3 Eliminating signals from the remaining dataset 4.5.4 Normalising signal strength 4.6 The potential importance of selection effects 4.6.1 Introduction
112 112 113 114 115 116 116
viii
Contents
4.7
4.8
4.9 4.10
4.6.2 Quantifying the possible impact of selection effects 4.6.3 Decomposition of fat-tailed behaviour Market dynamics 4.7.1 Introduction 4.7.2 Linear regression 4.7.3 Difference equations 4.7.4 The potential range of behaviours of linear difference equations 4.7.5 Multivariate linear regression 4.7.6 ‘Chaotic’ market dynamics 4.7.7 Modelling market dynamics using nonlinear methods 4.7.8 Locally linear time series analysis Distributional mixtures 4.8.1 Introduction 4.8.2 Gaussian mixture models and the expectation-maximisation algorithm 4.8.3 k-means clustering 4.8.4 Generalised distributional mixture models 4.8.5 Regime shifts The practitioner perspective Implementation challenges 4.10.1 Introduction 4.10.2 Local extrema 4.10.3 Global extrema 4.10.4 Simulated annealing and genetic portfolio optimisation 4.10.5 Minimisation/maximisation algorithms 4.10.6 Run time constraints
5 Traditional Portfolio Construction Techniques 5.1 Introduction 5.2 Quantitative versus qualitative approaches? 5.2.1 Introduction 5.2.2 Viewing any process through the window of portfolio optimisation 5.2.3 Quantitative versus qualitative insights 5.2.4 The characteristics of pricing/return anomalies 5.3 Risk-return optimisation 5.3.1 Introduction 5.3.2 Mean-variance optimisation 5.3.3 Formal mathematical notation 5.3.4 The Capital Asset Pricing Model (CAPM) 5.3.5 Alternative models of systematic risk 5.4 More general features of mean-variance optimisation 5.4.1 Introduction 5.4.2 Monotonic changes to risk or return measures 5.4.3 Constraint-less mean-variance optimisation 5.4.4 Alpha-beta separation
116 119 120 120 120 121 122 124 125 127 128 129 129 129 131 132 132 133 134 134 135 136 137 138 138 141 141 141 141 142 143 144 145 145 146 148 149 152 152 152 152 153 153
Contents
5.5 5.6
5.7
5.8
5.9
5.10
5.11 5.12 5.13
5.4.5 The importance of choice of universe 5.4.6 Dual benchmarks Manager selection Dynamic optimisation 5.6.1 Introduction 5.6.2 Portfolio insurance 5.6.3 Stochastic optimal control and stochastic programming techniques Portfolio construction in the presence of transaction costs 5.7.1 Introduction 5.7.2 Different types of transaction costs 5.7.3 Impact of transaction costs on portfolio construction 5.7.4 Taxes Risk budgeting 5.8.1 Introduction 5.8.2 Information ratios Backtesting portfolio construction techniques 5.9.1 Introduction 5.9.2 Different backtest statistics for risk-reward trade-offs 5.9.3 In-sample versus out-of-sample backtesting 5.9.4 Backtesting risk models 5.9.5 Backtesting risk-return models 5.9.6 Transaction costs Reverse optimisation and implied view analysis 5.10.1 Implied alphas 5.10.2 Consistent implementation of investment ideas across portfolios Portfolio optimisation with options The practitioner perspective Implementation challenges 5.13.1 Introduction 5.13.2 Optimisation algorithms 5.13.3 Sensitivity to the universe from which ideas are drawn
6 Robust Mean-Variance Portfolio Construction 6.1 Introduction 6.2 Sensitivity to the input assumptions 6.2.1 Introduction 6.2.2 Estimation error 6.2.3 Mitigating estimation error in the covariance matrix 6.2.4 Exact error estimates 6.2.5 Monte Carlo error estimation 6.2.6 Sensitivity to the structure of the smaller principal components 6.3 Certainty equivalence, credibility weighting and Bayesian statistics 6.3.1 Introduction 6.3.2 Bayesian statistics 6.3.3 Applying Bayesian statistics in practice
ix
154 154 154 156 156 157 157 158 158 158 159 160 161 161 162 163 163 164 166 167 169 169 170 170 171 171 173 174 174 174 175 177 177 178 178 180 181 182 185 186 187 187 188 189
x
Contents
6.4 Traditional robust portfolio construction approaches 6.4.1 Introduction 6.4.2 The Black-Litterman approach 6.4.3 Applying the Black-Litterman approach in practice 6.4.4 Opinion pooling 6.4.5 Impact of non-Normality when using more complex risk measures 6.4.6 Using the Herfindahl Hirshman Index (HHI) 6.5 Shrinkage 6.5.1 Introduction 6.5.2 ‘Shrinking’ the sample means 6.5.3 ‘Shrinking’ the covariance matrix 6.6 Bayesian approaches applied to position sizes 6.6.1 Introduction 6.6.2 The mathematics 6.7 The ‘universality’ of Bayesian approaches 6.8 Market consistent portfolio construction 6.9 Resampled mean-variance portfolio optimisation 6.9.1 Introduction 6.9.2 Different types of resampling 6.9.3 Monte Carlo resampling 6.9.4 Monte Carlo resampled optimisation without portfolio constraints 6.9.5 Monte Carlo resampled optimisation with portfolio constraints 6.9.6 Bootstrapped resampled efficiency 6.9.7 What happens to the smaller principal components? 6.10 The practitioner perspective 6.11 Implementation challenges 6.11.1 Introduction 6.11.2 Monte Carlo simulation – changing variables and importance sampling 6.11.3 Monte Carlo simulation – stratified sampling 6.11.4 Monte Carlo simulation – quasi (i.e., sub-) random sequences 6.11.5 Weighted Monte Carlo 6.11.6 Monte Carlo simulation of fat tails
7 Regime Switching and Time-Varying Risk and Return Parameters 7.1 Introduction 7.2 Regime switching 7.2.1 Introduction 7.2.2 An example of regime switching model 7.2.3 The operation of regime switching model 7.2.4 Complications 7.3 Investor utilities 7.3.1 Introduction 7.3.2 Expected utility theory 7.3.3 Constant relative risk aversion (CRRA) 7.3.4 The need for monetary value to be well defined
190 190 191 192 193 193 194 194 194 194 196 197 197 198 200 201 202 202 203 203 206 207 208 208 208 210 210 211 212 213 214 214 217 217 217 217 218 219 220 221 221 222 223 224
Contents
7.4 Optimal portfolio allocations for regime switching models 7.4.1 Introduction 7.4.2 Specification of problem 7.4.3 General form of solution 7.4.4 Introducing regime switching 7.4.5 Dependency on ability to identify prevailing regime 7.4.6 Identifying optimal asset mixes 7.4.7 Incorporating constraints on efficient portfolios 7.4.8 Applying statistical tests to the optimal portfolios 7.4.9 General form of RS optimal portfolio behaviour 7.5 Links with derivative pricing theory 7.6 Transaction costs 7.7 Incorporating more complex autoregressive behaviour 7.7.1 Introduction 7.7.2 Dependency on periods earlier than the latest one 7.7.3 Threshold autoregressive models 7.7.4 ‘Nearest neighbour’ approaches 7.8 Incorporating more intrinsically fat-tailed behaviour 7.9 More heuristic ways of handling fat tails 7.9.1 Introduction 7.9.2 Straightforward ways of adjusting for time-varying volatility 7.9.3 Lower partial moments 7.10 The practitioner perspective 7.11 Implementation challenges 7.11.1 Introduction 7.11.2 Solving problems with time-varying parameters 7.11.3 Solving problems that have derivative-like elements 7.11.4 Catering for transaction costs and liquidity risk 7.11.5 Need for coherent risk measures
8 Stress Testing 8.1 Introduction 8.2 Limitations of current stress testing methodologies 8.3 Traditional stress testing approaches 8.3.1 Introduction 8.3.2 Impact on portfolio of ‘plausible’ but unlikely market scenarios 8.3.3 Use for standardised regulatory capital computations 8.3.4 A greater focus on what might lead to large losses 8.3.5 Further comments 8.4 Reverse stress testing 8.5 Taking due account of stress tests in portfolio construction 8.5.1 Introduction 8.5.2 Under-appreciated risks 8.5.3 Idiosyncratic risk 8.5.4 Key exposures 8.6 Designing stress tests statistically 8.7 The practitioner perspective
xi
226 226 226 227 227 228 229 229 230 230 231 232 234 234 235 235 236 236 237 237 237 238 240 241 241 241 242 242 243 245 245 246 247 247 248 248 250 251 252 253 253 253 253 254 255 257
xii
Contents
8.8 Implementation challenges 8.8.1 Introduction 8.8.2 Being in a position to ‘stress’ the portfolio 8.8.3 Calculating the impact of a particular stress on each position 8.8.4 Calculating the impact of a particular stress on the whole portfolio 8.8.5 Engaging with boards, regulators and other relevant third parties
9 Really Extreme Events 9.1 9.2 9.3 9.4
9.5
9.6
9.7 9.8
Introduction Thinking outside the box Portfolio purpose Uncertainty as a fact of life 9.4.1 Introduction 9.4.2 Drawing inferences about Knightian uncertainty 9.4.3 Reacting to really extreme events 9.4.4 Non-discoverable processes Market implied data 9.5.1 Introduction 9.5.2 Correlations in stressed times 9.5.3 Knightian uncertainty again 9.5.4 Timescales over which uncertainties unwind The importance of good governance and operational management 9.6.1 Introduction 9.6.2 Governance models 9.6.3 Enterprise risk management 9.6.4 Formulating a risk appetite 9.6.5 Management structures 9.6.6 Operational risk The practitioner perspective Implementation challenges 9.8.1 Introduction 9.8.2 Handling Knightian uncertainty 9.8.3 Implementing enterprise risk management
10 The Final Word 10.1 Conclusions 10.2 Portfolio construction principles in the presence of fat tails 10.2.1 Chapter 2: Fat tails – in single (i.e., univariate) return series 10.2.2 Chapter 3: Fat tails – in joint (i.e., multivariate) return series 10.2.3 Chapter 4: Identifying factors that significantly influence markets 10.2.4 Chapter 5: Traditional portfolio construction techniques 10.2.5 Chapter 6: Robust mean-variance portfolio construction 10.2.6 Chapter 7: Regime switching and time-varying risk and return parameters 10.2.7 Chapter 8: Stress testing 10.2.8 Chapter 9: Really extreme events
258 258 258 259 260 261 263 263 263 265 266 266 266 268 269 270 270 271 271 272 272 272 273 274 275 276 277 278 279 279 279 279 281 281 281 281 282 283 283 284 285 285 285
Contents
xiii
Appendix: Exercises A.1 Introduction A.2 Fat tails – In single (i.e., univariate) return series A.3 Fat tails – In joint (i.e., multivariate) return series A.4 Identifying factors that significantly influence markets A.5 Traditional portfolio construction techniques A.6 Robust mean-variance portfolio construction A.7 Regime switching and time-varying risk and return parameters A.8 Stress testing A.9 Really extreme events
287 287 287 288 289 289 290 291 292 293
References
295
Index
301
Preface There are two reasons for writing this book. The first is timing. This book has been written during a worldwide credit and economic crisis. The crisis has reminded us that extreme events can occur more frequently than we might like. It followed an extended period of relative stability and economic growth that, with hindsight, was the calm before a storm. Throughout this book, this crisis is referred to as the ‘2007–09 credit crisis’. During the preceding relatively stable times, people had, maybe, paid less attention than they should have to the possibility of extreme events occurring. Perhaps you were one of them. Perhaps I was too. Adversity is always a good spur to careful articulation of underlying reality. Extreme events were evident even right at the start of the 2007–09 credit crisis, at the end of July 2007 and in the first two weeks of August 2007. This marked the time when previously ruling relationships between interbank money market rates for different deposit terms first started to unravel. It also coincided with some sudden and unexpectedly large losses being incurred by some high profile (quantitatively run) hedge funds that unwound some of their positions when banks cut funding and liquidity lines to them. Other investors suffered unusually large losses on positions they held that had similar economic sensitivities to the ones that these hedge funds were liquidating. During 2007 investors were also becoming more wary of US sub-prime debt. But, even as late as April 2008, central banks were still expecting markets gradually to regain their poise. It was not to be. In late 2008 the credit crisis erupted into a full-blown global banking crisis with the collapse of Lehman Brothers, the need for Western governments to shore up their banking systems and economies across the globe entering recessions. Market volatility during these troubled times was exceptional, reaching levels not seen since the Great Depression in the early 1930s. Extreme events had well and truly returned to the financial landscape. The second reason for writing this book is that it naturally follows on from my earlier book on Market Consistency; see Kemp (2009). Market consistency was defined there as the activity of taking account of ‘what the market has to say’ in financial practice. In Market Consistency I argued that portfolio construction should be viewed as the third strand in a single overarching branch of financial theory and practice, the other two strands being valuation methodology and risk management processes. However, I also noted that the application of market consistency to portfolio construction was in some sense simultaneously both core and peripheral. It was ‘core’, in the sense that ultimately there is little point in merely valuing and measuring positions or risk; the focus should also be on managing them. It was also ‘peripheral’, because portfolio management generally also involves taking views about when the market is right and when it
xvi
Preface
is wrong, and then acting accordingly, i.e., it necessarily does not always agree with ‘what the market has to say’. These different drivers meant that Market Consistency naturally focused more on valuation methodology and risk management and less on portfolio construction. So, this book seeks to balance the exposition in Market Consistency by exploring in more depth the portfolio construction problem. It is particularly aimed at those practitioners, students and others who, like me, find portfolio construction a fascinating topic to study or a useful discipline to apply in practice. The style I have adopted when writing this book is similar to the one used for Market Consistency. This involves aiming simultaneously to write both a discursive text and a reference book. Readers will no doubt want a reference book to help them navigate through the many techniques that can be applied to portfolio construction. However, undue focus on the ‘how to’ rather than also on the ‘why should I want to’ or ‘what are the strengths and weaknesses of doing so’ is typically not the right way to help people in practice. Instead, readers ideally also expect authors to express opinions about the intrinsic usefulness of different techniques. As in Market Consistency I have tried to find a suitable balance between mathematical depth and readability, to avoid some readers being overly daunted by unduly complicated mathematics. The book focuses on core principles and on illuminating them where appropriate with suitably pitched mathematics. Readers wanting a more detailed articulation of the underlying mathematics are directed towards the portfolio construction pages of the www.nematrian.com website, referred to throughout this book as Kemp (2010). Kemp (2010) also makes available a wide range of online tools that, inter alia, can be applied to the portfolio construction problem. Most of the charts in this book and the analyses on which they are based use these tools, and where copyrighted are reproduced with kind permission from Nematrian Limited. A few of the charts and principles quoted in this book are copied from ones already contained in Market Consistency and are reproduced with permission from John Wiley & Sons, Ltd.
Acknowledgements I would like to thank Pete Baker, Aimee Dibbens and their colleagues at Wiley for encouraging me to embark on writing this book. Thanks are also due to Colin Wilson and others who have read parts of this manuscript and provided helpful comments on how it might be improved. A special appreciation goes to my wife and family for supporting me as this book took shape. However, while I am very grateful for the support I have received from various sources when writing this book, I still take sole responsibility for any errors and omissions that it contains.
Abbreviations ALM APT AR ARMA BL bp CAPM CDO cdf CDS CML COSO CPPI CRO CRRA CVaR EM EMH ERM ESG EU EVD EVT FOC FSA GAAP GMM GRS F-test HHI i.i.d. IAS IASB ICA
asset-liability management Arbitrage Pricing Theory autoregressive autoregressive moving average Black-Litterman basis point Capital Asset Pricing Model collateralised debt obligation cumulative distribution function credit default swap capital market line Committee of Sponsoring Organisations of the Treadway Commission constant proportional portfolio insurance chief risk officer constant relative risk aversion conditional Value-at-Risk (aka TVaR and Expected Shortfall) expectation-maximisation algorithm efficient markets hypothesis enterprise risk management economic scenario generator European Union extreme value distribution extreme value theory first order conditions Financial Services Authority (UK) Generally Accepted Accounting Principles Gaussian (i.e., multivariate Normal) mixture models Gibbons, Ross and Shanken (Econometrica 1989) F-test Herfindahl Hirshman Index independent and identically distributed International Accounting Standards International Accounting Standards Board independent components analysis
xx
Abbreviations
IT lpm LTCM MA MCMC MDA MRP OLS ORSA OSLL PCA pdf QQ RE RS S&Ls SCR SETAR SIV SYSC TAR TVaR UK US and USA VaR
information technology lower partial moment Long Term Capital Management moving average Markov chain Monte Carlo maximum domain of attraction minimum risk portfolio ordinary least squares Own Risk and Solvency Assessment out-of-sample log-likelihood principal components analysis probability density function quantile–quantile resampled efficiency regime switching (US) savings and loans associations Solvency Capital Requirement self-exciting threshold autoregressive structured investment vehicle Senior Management Arrangements, Systems and Controls threshold autoregressive tail Value-at-Risk (aka CVaR and Expected Shortfall) United Kingdom of Great Britain and Northern Ireland United States of America Value-at-Risk
Notation 0 = vector of zeros 1 = vector of ones a , a = active positions, (linear combination) signal mixing coefficients αˆ = tail index parameter estimate b , b = benchmark, minimum risk portfolio, (distributional mixture) mixing coefficients β = portfolio or individual security beta γ , γi = cumulant of a distribution (if i = 1 then skew and if i = 2 then ‘excess’ kurtosis) Ŵ0 = option gamma = amount to invest in the underlying in a hedging algorithm 0 = option delta E = analogue of energy in simulated annealing algorithm E (X ) = expected value of X E (X | θ ) = expected value of X given θ ε j,t = error terms in a regression analysis f (x) = probability density function F (x) , F −1 (x) = cumulative distribution function (or more generally some specified distributional form) and its inverse function f = Monte Carlo estimate of the average of a function f f = true average of a function f h = time period length i, j, k = counting indexes I, I = Identity matrix It = index series Iα (t) = ‘hit’ series when backtesting K = option strike price L = lottery, Lagrange multiplier λ = risk-reward trade-off parameter, also eigenvalue m = number of assets (or liabilities, or both) in the portfolio optimisation problem ˆ = mean of a univariate distribution, vector of population means of a multivariate µ, µ , µ distribution, vector of sample or estimated means (likewise use of ‘hat’ symbol for other variables) n = number of time periods, observations or simulations
xxii
Notation
N (z) , N −1 (z) = cumulative distribution function for the Normal distribution, and its inverse function o (x) = tends to zero more rapidly than x as x → 0 O (n) = of order (magnitude) a constant times n (in analysis of algorithm run times) p (X ) = probability of X occurring P (X ≤ x) = probability that a random variable X is less than some value x p (X |θ ) , P (X |θ ) = probability of X given θ r = return on an asset, liability or index r¯ = mean of r rrf , rb = risk-free rate of return, benchmark return ρ = correlation coefficient, risk measure s = stress test, regime S = spectrum of an autoregressive time series St = stock index series S (α, β, γ , δk ; k) = Stable distribution with parameters (α, β, γ , δk ) using parameter definition k (= 0 or 1) σ = standard deviation T = matrix transpose (if used as a superscript), also (when not confusing) time at end of analysis/time horizon, also analogue of temperature in simulated annealing algorithm T (z) = transfer function V = a volume in a multi-dimensional space V = covariance matrix wi , wt , wt , wi, j = weight in asset i, innovation at time t (possibly expressed in vector form for vector autoregressive series), elements of mixing matrix W W = mixing matrix, terminal wealth △
X n = X = equality in distributional form D
Z n −→ F = the sequence of random variables, Z n , converges in distribution to F as n → ∞, i.e., in the limit takes the distributional form characterised by F
1 Introduction 1.1 EXTREME EVENTS This book is about how best to construct investment portfolios if a priori it is reasonable to assume that markets might exhibit fat-tailed behaviour. It is designed to appeal to a wide variety of practitioners, students, researchers and general readers who may be interested in extreme events or portfolio construction theory either in isolation or in combination. It achieves this aim by (a) Exploring extreme events, why they might arise in a financial context and how we might best analyse them. (b) Separately exploring all the main topics in portfolio construction theory applicable even in the absence of fat tails. A special case of any more general approach capable of effectively handling extreme events is the situation where the extent of fat-tailed behaviour is too small to be discernible. (c) Blending points (a) and (b) together to identify portfolio construction methodologies better able to cater for possible fat-tailed behaviour in the underlying assets or liabilities. Given its wide intended audience, the book covers these topics both from a more qualitative perspective (particularly in the earlier and later chapters) and from a more quantitative (i.e., mathematical) perspective (particularly in the middle chapters). Where possible, this book has been segmented so that valuable insights can be gained without necessarily having to read the whole text. Conversely, in the author’s opinion, valuable insights arise throughout the book, including the parts that are more mathematical in nature. More general readers are therefore encouraged not to skip over these parts completely, although they do not need to worry too much about following all the details. By fat-tailed behaviour we mean that the distribution of future returns is expected to involve more extreme events than might be expected to occur were returns to follow the (multivariate) (log-) Normal distributions often assumed to apply to markets in basic portfolio construction texts.1 Most practitioners believe that most markets are ‘fat-tailed’ given this terminology. There is a wide body of empirical academic literature that supports this stance, based on analysis of past market behaviour. There is also a growing body of academic theory, including some involving behavioural finance, explaining why fat-tailed behaviour seems so widespread. So, we might also characterise this book as exploring how best to construct investment portfolios in the real world. Of course, practitioners and academics alike are not themselves immune from behavioural biases. It is one thing to agree to pay lip service to the notion that market behaviour can be 1 By ‘multivariate’ we mean that the returns on different return series have a joint distribution, the characterisation of which includes not only how individual return series in isolation might behave, but also how they behave when considered in tandem, see Chapter 3. By ‘(log-) Normal’ we mean that the natural logarithm of 1 + r is Normally distributed, where the return, r , is expressed in fractional form, see Section 2.3.1.
2
Extreme Events
fat-tailed, but quite another to take this into account in how portfolios are actually constructed. Following the dot.com boom and bust in the late 1990s and early 2000s, markets settled into a period of unusually low volatility. Strategies that benefited from stable economic conditions, e.g., ones that followed so-called ‘carry’ trades or strategies relying on continuing ready access to liquidity, proved successful, for a time. The 2007–09 credit crisis, however, painfully reminded the complacent that markets and economies more generally can and do experience extreme events.
1.2 THE PORTFOLIO CONSTRUCTION PROBLEM We do not live in a world in which we have perfect foresight. Instead, portfolio construction always involves striking a balance between risk and reward, i.e., the risk that the views implicit in our portfolio construction will prove erroneous versus the rewards that will accrue if our views prove correct. Everyone involved in the management of portfolios, whether of assets or of liabilities, faces a portfolio construction problem. How do we best balance risk and return? Indeed, what do we mean by ‘best’? Given the lack of perfect foresight that all mortals face, it is not reasonable to expect a book like this to set out guaranteed ways of profiting from investment conditions come what may. Instead, it seeks to achieve the lesser but more realistic goal of exploring the following: (a) core elements of portfolio construction; (b) mathematical tools that can be used to assist with the portfolio construction problem, and their strengths and weaknesses; (c) ways of refining these tools to cater better for fat-tailed market behaviour; (d) mindsets best able to cope well with extreme events, and the pitfalls that can occur if we do not adopt these mindsets.
1.3 COPING WITH REALLY EXTREME EVENTS Lack of perfect foresight is not just limited to a lack of knowledge about exactly what the future holds. Typically in an investment context we also do not know how uncertain the future will be. Using statistical terminology, we do not even know the precise form of the probability distribution characterising how the future might evolve. The differentiation between ‘risk’ and ‘uncertainty’ is a topic that several popular writers have sought to explore in recent times, e.g., Taleb (2004, 2007). In this context ‘risk’ is usually taken to mean some measurable assessment of the spread of possible future outcomes, with ‘uncertainty’ then taken to mean lack of knowledge, even (or particularly) concerning the size of this spread. In this book, we take up this baton particularly in Chapters 8 and 9. Holding such an insight in mind is, I think, an important contributor to successful portfolio construction. In particular, it reminds us that really extreme events seem to have a nasty habit of occurring more often than we might like. Put statistically, if there is a 1 in 1010 (1 in 10 billion) chance of an event occurring given some model we have adopted, and there is a 1 in 106 (1 in a million) chance that our model is fundamentally wrong, then any really extreme events are far more likely
Introduction
3
to be due to our model being wrong than representing random (if unlikely) draws from our original model.2 Yet such insights can also be overplayed. The portfolio construction problem does not go away merely because the future is uncertain. Given a portfolio of assets, someone, ultimately, needs to choose how to invest these assets. Although it is arguably very sensible for them to bear in mind intrinsic limitations on what might be knowable about the future, they also need some framework for choosing between different ways of structuring the portfolio. This framework might be qualitatively formulated, perhaps as someone’s ‘gut feel’. Alternatively, it might be quantitatively structured, based on a more mathematical analysis of the problem at hand. It is not really the purpose of this book to argue between these two approaches. Indeed, we shall see later that the outcome of essentially any qualitative judgemental process can be reformulated as if it were coming from a mathematical model (and arguably vice versa). Perhaps the answer is to hold onto wealth lightly. All of us are mortal. The more religious among us, myself included, might warm to this philosophy. But again, such an answer primarily characterises a mindset to adopt, rather than providing specific analytical tools that we can apply to the problem at hand.
1.4 RISK BUDGETING Some practitioners point to the merits of risk budgeting. This involves identifying the total risk that we are prepared to run, identifying its decomposition between different parts of the investment process and altering this decomposition to maximise expected value-added for a given level of risk. It is a concept that has wide applicability and is difficult to fault. What business does not plan its evolution via forecasts, budgets and the like? Indeed, put like this risk budgeting can be seen to be common sense. Again, though, we have here principally a language that we can use to describe how to apply investment principles. Risk budgeting principally inhabits the ‘mindset’ sphere rather than constituting an explicit practical toolset directly applicable to the problem at hand. This should not surprise us. Although sensible businesses clearly do use budgeting techniques to good effect, budgeting per se does not guarantee success. So it is with risk budgeting.3 However, language is the medium through which we exchange ideas and so cannot be ignored. Throughout this book, we aim to explain emerging ideas using terms that can be traced back to risk budgeting concepts. This helps clarify the main aspects of the methodology under discussion. It also helps us understand what assumptions need to be made for the relevant methodology to be valid.
2 More precisely, in this situation we need the probability of occurrence of the event that we are considering to be much higher (on average) than 1 in 10,000 in the 1 in a million circumstances when our underlying model is assumed to prove to be fundamentally wrong. This, however, is typically what is implied by use of the term ‘fundamentally wrong’. For example, suppose that our model presupposes that possible outcomes are Normally distributed with zero mean and standard deviation of 1. Then the likelihood of an outcome worse than c. –6.4 is 1 in 10 billion. However, suppose that there is actually a one in a million chance that our model is ‘fundamentally wrong’ and that the standard deviation is not 1 but 10. Roughly 26% of outcomes when the standard deviation is 10 will involve an outcome worse than c. –6.4. So, in this instance an event this extreme is roughly 40 times as likely to be a result of our original model being ‘fundamentally wrong’ as it is to be a fluke draw from the original model. 3 Likewise, no portfolio construction technique is able to guarantee success.
4
Extreme Events
1.5 ELEMENTS DESIGNED TO MAXIMISE BENEFIT TO READERS As explained in Section 1.1, this book aims to appeal to a wide variety of audiences. To do this, I have, as with my earlier book on Market Consistency, sought a suitable balance between mathematical depth and readability, to avoid some readers being overly daunted by unduly complicated mathematics. The book focuses on core principles and on illuminating them where appropriate with suitably pitched mathematics. Readers wanting a more detailed articulation of the underlying mathematics are directed towards the portfolio construction pages of the www.nematrian.com website, referred to throughout this book as Kemp (2010). To maximise the benefit that both practitioners and students can gain from this book, I include two sections at the end of each main chapter that provide: (a) Comments specifically focusing on the practitioner perspective. To navigate successfully around markets typically requires an enquiring yet somewhat sceptical mindset, questioning whether the perceived benefits put forward for some particular technique really are as strong as some might argue. So, these sections either focus on the ways that practitioners might be able to apply insights set out earlier in the relevant chapter in their day-to-day work, or highlight some of the practical strengths and weaknesses of techniques that might be missed in a purely theoretical discussion of their attributes. (b) A discussion of some of the more important implementation challenges that practitioners may face when trying to apply the techniques introduced in that chapter. Where the same challenge arises more than once, I generally discuss the topic at the first available opportunity, unless consideration of the challenge naturally fits better in a later chapter. The book also includes an Appendix containing some exercises for use by students and lecturers. Each main chapter of the book has associated exercises that further illustrate the topics discussed in that chapter. The exercises are reproduced with kind permission from Nematrian Limited. Hints and model solutions are available on the www.nematrian.com website, as are any analytical tools needed to solve the exercises. Throughout the book, I draw out principles (i.e., guidance, mainly for practitioners) that have relatively universal application. Within the text these principles are indented and shown in bold, and are referenced by P1, P2, etc.
1.6 BOOK STRUCTURE The main title of this book is Extreme Events. It therefore seems appropriate to focus first, in Chapters 2 and 3, on fat tails and extreme events. We explore some of the ways in which fattailed behaviour can be analysed and the existence or otherwise of extreme events confirmed or rejected. We differentiate between analysis of fat tails in single return series in Chapter 2 and analysis of fat tails in joint (i.e., multiple) return series in Chapter 3. The shift from ‘one’ to ‘more than one’ significantly extends the nature of the problem. Before moving on to portfolio construction per se, we consider in Chapter 4 some ways in which we can identify what seems to be driving market behaviour. Without some underlying model of market behaviour, it is essentially impossible to assess the merits of different possible approaches to portfolio construction (or risk modelling). We consider tools such as principal components analysis and independent components analysis, and we highlight their links with other statistical and econometric tools such as multivariate regression.
Introduction
5
In Chapters 5–7 we turn our attention to the portfolio construction problem. Chapter 5 summarises the basic elements of portfolio construction, both from a quantitative and from a qualitative (i.e., ‘fundamental’) perspective, if fat tails are not present. At a suitably high level, both perspectives can be viewed as equivalent, apart perhaps from the mindset involved. In Chapter 5 we also explore some of the basic mathematical tools that commentators have developed to analyse the portfolio construction problem from a quantitative perspective. The focus here (and in Chapter 6) is on mean-variance portfolio optimisation (more specifically, mean-variance optimisation assuming time stationarity). We consider its application both in a single-period and in a multi-period world. In Chapter 6 we highlight the sensitivity of the results of portfolio construction analyses to the input assumptions, and the tendency of portfolio optimisers to maximise ‘model error’ rather than ‘risk-return trade-off’. We explore ways of making the results more robust to errors affecting these input assumptions. The academic literature typically assumes that input assumptions are estimated in part from past data. We might argue that asset allocation is a forward-looking discipline, and that the assumptions we input into portfolio construction algorithms should properly reflect our views about what might happen in the future (rather than about what has happened in the past). However, some reference to past data nearly always arises in such analyses. We pay particular attention to Bayesian approaches in which we have some prior (‘intrinsic’) views about the answers or input parameters that might be ‘reasonable’ for the problem and we give partial weight to these alongside partial weight to external (often past) data. The best-known example of this is probably the Black-Litterman approach. Some Bayesian approaches can also be viewed as justifying heuristic4 techniques that can be applied to the portfolio construction problem. This again highlights the high-level equivalence that exists between quantitative and qualitative approaches to portfolio construction. In Chapter 6 we also introduce ‘market consistent’ portfolio construction, in which we derive input assumptions not from past data but from market implied data. Such an approach is ‘robust’ in the sense that the input assumptions are in theory not subject to the same sorts of estimation errors as ones derived from historical behaviour. We also explore tools that practitioners less keen on Bayesian approaches have developed to tackle estimation error, particularly resampled portfolio optimisation. We show that they are less divorced from Bayesian approaches than might appear at first sight. In Chapter 7 we identify how to incorporate fat tails into portfolio construction theory. We start by exploring what happens when we relax the assumption of time stationarity, by introducing the concept of regime shifting. This involves assuming that the world is, at any point in time, in one of several possible states, characterised by different distributions of returns on the different assets and liabilities under consideration. The mixing of distributions introduced in such a model naturally leads to fat-tailed behaviour. We then extend these ideas to encompass more general ways of incorporating fat-tailed behaviour. We focus mainly but not exclusively on situations where the regime is characterised not by a single Normal distribution but by a distributional mixture of Normal distributions (because this type of model is sufficiently general that it can approximate other ways in which fat tails might arise). We also explore approaches that involve continuously varying parameterisations of the different regimes and focus on behaviour in continuous rather than discrete time.
4 In this context, a ‘heuristic’ technique is one that is akin to a rule of thumb that is not principally proposed on the basis of some formal mathematical justification but more because the approach is relatively convenient to implement.
6
Extreme Events
Chapters 2 to 7 are largely concerned with a probability-theoretic view of portfolio construction. In them, we identify, somehow or other, a distributional form to which we believe future outcomes will adhere (or more than one in the case of regime shifting). At least in principle, this involves specifying a likelihood of occurrence for any given future scenario. However, the belief that we can in all cases actually identify such likelihoods arguably involves an overly rosy view about our ability to understand the future. More to the point, regulators and other bodies who specify how financial services entities should operate may want to ensure that entities do not put too many of their eggs into what may prove an uncertain basket. In recent years this has led to an increased focus on stress testing, which we explore in Chapter 8. Stress testing, in this context, generally involves placing less emphasis on likelihood and more emphasis on magnitude (if large and adverse) and on what might make the scenario adverse. We can view ‘reverse stress testing’ and ‘testing to destruction’ as being at one extreme of this trend. In the latter, we hypothesise a scenario adverse enough to wipe out the business model of the firm in question (or the equivalent if we are considering a portfolio) irrespective of how likely or unlikely it might be to come to pass. We then focus on what might cause such a scenario to arise and whether there are any strategies that we can adopt that might mitigate these risks. Chapter 9 extends the ideas implicit in Chapter 8 to consider ‘really extreme’ events. It is more heuristic and ‘mindset-orientated’ than the rest of the book. This is inevitable. Such events will almost certainly be so rare that there will be little if any directly relevant data on them. Market implied portfolio construction techniques potentially take on added importance here. Merely because events are rare does not mean that others are not exposed to them too. The views of others, distilled into the market prices of such risks, may be subject to behavioural biases, but may still help us in our search to handle extreme events better. Finally, in Chapter 10 we collate and summarise in one place all the principles highlighted elsewhere in the book.
2 Fat Tails – In Single (i.e., Univariate) Return Series 2.1 INTRODUCTION The 2007–09 credit crisis is a profound reminder that ‘extreme’ events, and particularly ‘black swans’ (i.e., those rare events that, until they occur, may have been thought essentially impossible), occur more frequently than we might expect, were they to be coming from the Normal distributions so loved by classical financial theory. In this chapter we first explore what we mean by an ‘extreme event’ and hence by a ‘fattailed’ distribution. We then explore the extent to which some financial series appear to exhibit fat-tailed behaviour. In later chapters we will reuse the methodologies that we develop in this chapter for analysing such behaviours, particularly when applied to the task of practical portfolio construction. We focus in this chapter on univariate data series, e.g., the return series applicable to a single asset such as a bond, equity or currency or a single composite asset, such as an equity market or sector index. In Chapter 3 we focus on multivariate data, i.e., the combination of such series when viewed in tandem. The portfolio construction problem ultimately involves selecting between different assets. Therefore, in general it can only be tackled effectively when a full multivariate view is adopted.
2.2 A FAT TAIL RELATIVE TO WHAT? If everyone agrees that extreme events occur rather more frequently than we might like, then why don’t we take more cognisance of the possibility of such events? This rather profound question is linked in part to behavioural biases that humans all too easily adopt. An event can only be classified as ‘extreme’ by comparison with other events, with which we might reasonably expect it to be comparable. A five-hour train journey might be deemed to have taken an ‘extremely long’ time if journeys between the same two stations usually take only ten minutes and have never before taken more than 20 minutes. Conversely, this journey might be deemed to be ‘extremely short’ relative to a transcontinental train journey that usually takes days to complete. Indeed, a journey might be deemed ‘extremely’ short and still take roughly its ‘expected’ time. A commentator might, for example, note that the expected time taken to get from an airport to a holiday resort is ‘extremely short’, because the resort is just round the corner from the airport, if most ‘equivalent’ resorts involve much longer transfer times. Principle P1: Events are only ‘extreme’ when measured against something else. Our innate behavioural biases about what constitute suitable comparators strongly influence our views about how ‘extreme’ an event actually is.
8
Extreme Events
In the context of finance and economics we usually (although not always) make such comparisons against corresponding past observations. For example, we might view a particular day’s movement in a market index as extreme relative to its movements on previous days or a recent quarterly economic decline (or growth) as extreme versus equivalent movements in previous quarters. However, we generally do not view ‘the past’ as one single monolithic dataset that implicitly weights every observation equally. Instead, most of us, deep down, believe (or want to believe) in ‘progress’. We typically place greater weight on more recent past observations. We usually think that the recent past provides ‘more relevant’ comparative information in relation to current circumstances. Even if we do not consciously adopt this stance our innate behavioural biases and learning reflexes often result in us doing so anyway. ‘Recent’ here needs to be sufficiently recent in the context of a typical human lifetime, or maybe merely a human career span, for the generality of informed commentators to incorporate the relevant past data observation into the set of observations that they use (implicitly or otherwise) to work out what spread of outcomes is ‘to be expected’ for a given event. But what happens if the nature of the world changes through time? The ‘recent’ past may no longer then be an appropriate anchor to use to form an a priori guess as to what spread of outcomes might be ‘expected’ at the present juncture. An event may be ‘exceptional’ relative to one (relatively recent) past time span but be less exceptional in the context of a longer past time span. Figure 2.1 illustrates some issues that such comparisons can raise. This chart shows the behaviour of the spread (i.e., difference) between two different types of money market (annualised) interest rates. Both rates relate to 1 month interest rates: one relates to unsecured interbank lending (Euribor) and the other relates to secured lending in which the borrower posts collateral in the form of Euro denominated government debt (Eurepo). The difference 1.60 1.40 1.20
Euribor (1 mth) minus Eurepo (1 mth)
1.00
% pa
0.80 0.60 0.40 0.20 0.00 31/12/2003
31/12/2004
31/12/2005
31/12/2006
Date Figure 2.1 Spread between 1 month Eurepo and Euribor interest rates C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
31/12/2007
31/12/2008
Fat Tails – In Single (i.e., Univariate) Return Series
9
between the two can be viewed as a measure of the potential likelihood of the borrowing bank defaulting within a 1 month horizon. Until July 2007, the relationship between the two interest rates appeared to be very stable (and the spread very small). In preceding months, market participants believed that banks active in the interbank market were generally very sound and unlikely to default in the near future. A spread between these two lending rates of even, say, 0.1% pa would have seemed very large, based on the (then) recent history. However, with the benefit of hindsight we can see that much larger spreads still were to become commonplace. In the latter part of the period covered by this chart, a spread of 0.1% pa would have seemed very small! As noted in the Preface, the breakdown of previously ruling relationships such as these in the money markets in late July and early August 2007 marked the start of the credit crisis. What appeared to be ‘normal’ before the credit crisis no longer appeared ‘normal’ (or at least ‘typical’) once it had started. Economists and quantitative investors and researchers have a particular name for comparisons of the same underlying object through time: they are called longitudinal comparisons. Such comparisons may be contrasted with cross-sectional comparisons, in which we, say, compare the returns on different securities over the same time period. A particular security’s return on a particular day might, for example, be deemed extremely large relative to the spread of returns that other ‘equivalent’ securities achieved on the same day. With a cross-sectional comparison, the need for some sort of ‘equivalence’ between the securities being considered is self evident. Otherwise the comparison may be viewed as spurious. For example, we do not ‘expect’ the spread of returns exhibited by different equities to be anything like as narrow as the spread of returns on less volatile asset categories, such as cash, because we do not view equities and cash as ‘equivalent’ in this context. We also come with less of a preconception that extreme returns might be very rare. Given a large enough universe, we naturally expect to see some outliers. For example, with equities, we ‘expect’ exceptional events, such as takeovers and bankruptcies, to occur from time to time. We might ‘expect’ events like these, which generate extreme outliers, only to affect any given individual security rarely. However, given a large enough sample of such securities exposed to the same sorts of outliers, such events are common enough to remind us that very large positive or negative returns do affect individual securities from time to time, and might therefore show up as outliers in cross-sectional comparisons. The need for a corresponding underlying ‘equivalence’ between observations also arises with longitudinal comparisons, but it is easier to forget that this is the case. In particular we need this period’s observation to be ‘comparable’ with the corresponding past observations against which it is to be compared. Where the data is immediately perceived not to be comparable then we readily adjust for this and discount ‘spurious’ conclusions that we might otherwise draw from such comparisons. For example, we do not normally place much emphasis on a grown man’s height being large relative to his height as a baby. Instead we view such a progression as being part of the natural order.1 But what if lack of comparability is less clearly not present? Our understanding of what constitutes an extreme event (in relation to comparisons through time) is heavily coloured by an implicit assumption of time stationarity (or by a rejection of such an assumption). By this we mean that the assumption that the distribution from which 1 We might even reposition the comparison within a cross-sectional framework. For example, biologists may be interested in how the magnitude of physical changes between youth and maturity varies by species because it can reveal regularities (or differences) between species.
10
Extreme Events
the observation in question is being drawn does not change through time. Implicitly, time stationarity (or lack of it) has two parts, namely: (a) A part relating to the underlying nature of the world (including how it is viewed by others). Usually this is what we are trying to understand better. (b) A part relating to the way in which we may be observing the world. Ideally, we want this not to cloud our understanding of point (a). However, to assume that it is not present at all may overstate our own ability to avoid human behavioural biases. Differentiating between these two elements is not always straightforward, and is further compounded by market prices being set by the interaction of investor opinions as well as by more fundamental economic drivers. Consider, for example, a company that claims to have radically changed its business model and to have moved into an entirely new industry. We might then notice a clear lack of time stationarity in the observed behaviour of its share price. How relevant should we expect its past behaviour or balance sheet characteristics to be to its changing fortunes going forwards in its new industry? Perhaps there would be some relevance if its management team, its corporate culture and its behavioural stances have not altered much, but data more generally applicable to its new industry/business model might be viewed as more relevant. Conversely, sceptics might question whether the company really has moved into a new industry. It might, wittingly or unwittingly, merely be presenting itself as having done so. Similar shifting sands also affect the overall market. Every so often, commentators seem to focus on new paradigms, in which it is claimed that the market as a whole has shifted in a new direction. Only some of these turn out to be truly new paradigms with the benefit of hindsight. We will find echoes of all these issues in how we should best cater for extreme events in practical portfolio construction. The difference between effects arising from the underlying nature of the world and arising from how we observe the world can be particularly important in times of change. When we observe new information that seems to invalidate our earlier expectations, we will need to form a judgement about whether the new ‘information’ really reflects a change in the underlying nature of the world, or whether it merely reflects inadequacies in the way in which we have previously been observing the world. Wise investors have always appreciated that their understanding of how the world operates will be incomplete and that they need to learn from experience. Even wiser investors will appreciate that their learning may never be complete, a topic that we return to in Chapter 9. Principle P2: The world in which we live changes through time. Our perception of it also changes, but not necessarily at exactly the same time. Mathematically, we might formalise the definition of an ‘extreme event’ as one where the probability of occurrence of an event, X , which is this extreme (say X ≤ x, for downside extreme events, for some given threshold x, or X ≥ x for upside extreme events) is sufficiently small, i.e., P (X ≤ x) < α (for some sufficiently small positive α) given our ‘model’, i.e., a probability distribution P (X ) characterising how we think the world works. ‘Fat-tailed’ behaviour cannot typically be identified just from one single event, but is to do with the apparent probabilities of occurrence of X , say Pˆ (X ) (which will in general involve an amalgam of intuition and observational data), being such as to involve a higher frequency of extreme events than would arise with a Normal distribution with the same standard deviation and mean (if they exist) as Pˆ (X ).
Fat Tails – In Single (i.e., Univariate) Return Series
11
2.3 EMPIRICAL EXAMPLES OF FAT-TAILED BEHAVIOUR IN RETURN SERIES 2.3.1 Introduction In this section we explore some of the methodologies that can be used to tell whether return series appear to exhibit fat tails. We consider various ways of visualising the shape of the distributional form2 and we explore some of the stylised ‘facts’ that are generally held to apply to investment returns in practice. Throughout this section we assume that the world exhibits time stationarity (see Section 2.2). Given this assumption, a distribution is ‘fat-tailed’ if extreme outcomes seem to occur more frequently than would be expected were returns to be coming from a (log-) Normal distribution. The rationale for focusing on log-Normal rather than Normal distributions is that returns, i.e., r (t), compound through time and thus log returns, i.e., log (1 + r (t)), add through time.
2.3.2 Visualising fat tails Perhaps the most common way of visualising a continuous distributional form is to plot its probability density function (pdf). Such a chart can be thought of as the continuous limit of a traditional histogram chart. It indicates the probability of a given outcome being within a certain small range, scaled by the size of that range. For a discrete distributional form the equivalent is its probability mass function, which indicates the likelihood of occurrence of each possible outcome, again typically with the outcomes arranged along the horizontal (i.e., x) axis and the probabilities of these occurrences along the vertical (i.e., y) axis. A chart directly plotting the pdf immediately provides a visual indication of the relative likelihood of a given outcome falling into one of two different equally-sized possible (small) ranges. The ratio between these probabilities is the ratio of the heights of the line within such a chart. Directly plotting the entire pdf is not the ideal visualisation approach from our perspective. The ‘scale’ of such a chart (here meaning what might be the largest viewable y-value) is dominated by differences in the likelihood of the most likely outcomes. It is difficult to distinguish the likelihoods of occurrence of unlikely events. We can see this visualisation problem by viewing in the same chart the pdf of a (log) return series that is Normally distributed3 and the equivalent pdf of a (log) return series with the same mean and standard deviation but which is (two-sided) fat-tailed and thus has more outliers (at both ends of the distribution); see Figure 2.2.4 The main visual difference between the two charts is in the centre of the distribution (with the example fat-tailed distribution appearing
2 By ‘distributional form’ we mean a characterisation of the probability distribution in a way that enables us to differentiate between it and other types of probability distribution. 3 We see here that we have immediately implicitly applied Principle P1 by comparing a distribution (which we think might be fat-tailed) against an a priori distribution (here a Normal distribution) that we think is more ‘normal’ or innately likely to be correct. This begs the question of why we might have an innate bias towards viewing the Normal distribution as ‘normal’, which we discuss further in Section 2.5. 4 The fat-tailed distribution used for illustrative purposes in Figure 2.1 (and in Figures 2.2–2.5) has a pdf that results in its quantile–quantile form (see Figure 2.5) being y = ax(1 + x 2 /8), where a has been chosen so that it has the same standard deviation as the Normal distribution against which it is being compared.
12
Extreme Events
Normal distribution Example fat-tailed distribution
–10
–8
–6
–4
–2
0
2
4
6
8
10
x, here (log) return Figure 2.2 Illustrative probability density function plot C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
more peaked there5 ). It is possible to tell that the fat-tailed distribution also has greater mass in the tails (e.g. beyond, say, ±3 in Figure 2.2), but this feature is not as obvious, because visually it is less marked. It only becomes more obvious if we zoom in on the tail, e.g., as in Figure 2.3. A mathematically equivalent6 way of describing a (continuous) probability distribution is to plot its cumulative distribution function (cdf), i.e., the likelihood of the outcome not exceeding a certain value, as shown in Figure 2.4. This approach presents a similar visualisation challenge. Its ‘scale’ is dominated by the height of the cdf at its right hand end, i.e., by unity. The differential behaviour in the tail is again not immediately striking to the eye. It has the further disadvantage that many people have an intuitive feel for the bell shape curve applicable to a Normal pdf, but have less of an intuitive feel for the shape of the corresponding cdf, potentially making it harder for them to spot ‘significant’ deviation from Normality. A more helpful visualisation approach when analysing fat tails is a quantile–quantile plot (‘QQ-plot’) as shown in Figure 2.5. This illustrates the return outcome (i.e., ‘quantile’) associated with a given (cumulative) probability level, plotted against the corresponding return outcome applicable to a (log-) Normal distribution with the same mean and standard deviation as the original distribution. In it, a (log-) Normally distributed return series would be characterised by a straight line, while (two-sided) fat-tailed behaviour shows up as a curve that is below this straight line at the bottom left hand end of the curve and above it at the top right hand end of the curve. In practice, its ‘scale’ characteristics are driven by the extent to which distributions have different quantile behaviours in the ‘tails’ of the distribution. This is in 5 The main reason why fat-tailed distributions typically in these circumstances appear more peaked in the middle than the corresponding Normal distribution is the way that we have standardised the distributions to have the same means and standard deviations. If different standardisation approaches are used then there may be less difference in height at the centre of the distribution. It is also possible for distributions to have multiple peaks, in which case the main visual differences may again no longer be right in the centre of the distribution. 6 By ‘mathematically equivalent’ we mean that it is possible to deduce one from the other (and vice versa) by suitable mathematical transformations.
Fat Tails – In Single (i.e., Univariate) Return Series
13
Normal distribution Example fat-tailed distribution
–6
–5.5
–5
–4.5
–4
–3.5
–3
–2.5
–2
x, here (log) return Figure 2.3 Illustrative probability density function plot as per Figure 2.2, but zooming in on just the part of the lower tail of the distribution between x = −6 and x = −2 C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
1
0.8
Normal distribution Example fat-tailed distribution
0.6
0.4
0.2
0 –10
–8
–6
–4
–2
0
2
x, here (log) return Figure 2.4 Illustrative cumulative probability distribution plot C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
4
6
8
10
14
Extreme Events 20
y, here observed (log) return
15 Normal distribution
10
Example fat-tailed distribution
5 0 –10
–8
–6
–4
–2
0
2
4
6
8
10
–5 –10 –15 –20
x, here expected (log) return, if Normally distributed Figure 2.5 Illustrative quantile–quantile plot C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
contrast with plots of pdfs, which as we can see from Figure 2.1, largely focus on differences in the centre of the distribution. Of the three graphical representations described above, the one in Figure 2.5 (the QQ-plot) is the easiest one in which to see visually the extent of any fat-tailed behaviour in the extremities. It is the visualisation approach that we concentrate on in this section. QQ-plots such as these have a natural interpretation in the context of Value-at-Risk (VaR). This is a forward-looking risk measure commonly used in many parts of the financial community. VaR is an enticingly simple concept and therefore relatively easy to explain to lay-people. It involves a specification of a confidence level, say 95%, 99% or 99.5%, and a time period, say 1 day, 10 days, 1 month or 1 year. If we talk about a fund having a 10 day 99% confidence VaR of, say, X then we mean that there is7 only a 1% chance of losing more than X over the next 10 days, if the same positions are held for this 10 day time frame.8 The VaR at any given confidence level can be read off such a quantile–quantile chart by using as the x-coordinate the relevant VaR level that would have applied to a (log-) Normally distributed variable. Incidentally, the other forward-looking risk measure one often comes across in the
7 To be precise, we mean that ‘we believe that there is’ rather than ‘there is’. This highlights the point that VaR and other forward-looking risk measures always involve estimation, and hence are always subject to estimation error. 8 VaR and other similar risk measures can be expressed in percentage terms or in monetary amounts. We may also express them in ‘relative’ terms (if our focus is on return relative to a benchmark) or in ‘absolute’ terms, if our focus is on the absolute monetary movement in the value of the portfolio). The latter also requires a specification of the currency or ‘numeraire’ in which the value of the portfolio is expressed; see e.g., Section 7.3.4. In the asset management community, VaR would usually, although not always, be understood in a relative context, because fund managers are usually given a benchmark to beat. However in the banking community, VaR would usually, although not always, be understood in an absolute context, as it is more common there to focus on absolute quantum of loss or capital required.
Fat Tails – In Single (i.e., Univariate) Return Series
15
financial community, namely (ex-ante) tracking error,9 can also be inferred from such plots, because it drives the scale used to plot the x-axis. Are quantile–quantile plots the best possible visualisation approach to adopt for this purpose? This question is not an easy one to answer. Different types of chart implicitly assign different levels of importance, i.e., weights, to different ways in which distributional forms might differ. Our eyes will implicitly deem large displacements within a given chart type of greater ‘importance’ than small or difficult to spot displacements. Our eyes typically process such data as if distance along a chart scale is the correct way to distinguish things of note. Thus the ‘scale’ characteristics of the chart methodology directly feed through to the importance that we will implicitly give to different aspects of the distributional form. QQ-plots give much greater visual weight to outliers than probability density or cumulative density plots. However, this could result in them giving too much weight to what ultimately are still unlikely occurrences. From a return-seeking perspective, the mean drift of the return series may be the feature that is of most importance. Focusing too much on tail behaviour may distract from more important matters, especially because not all fat tails are bad in this context, only downside fat tails. Upside fat tails typically correspond to particularly favourable outcomes! Conversely, even QQ-plots might not give sufficient weight to downside risk in all circumstances. As noted above, quantiles are closely associated with Value-at-Risk. However, this type of risk measure can be criticised because it effectively ascribes the same downside whatever happens if a given cut-off threshold is met, even though the worse the loss the greater is the downside for the party who ultimately bears this loss. Kemp (2009) points out that this makes Value-at-Risk an intrinsically shareholder focused risk measure for financial firms. Shareholders are largely indifferent to how spectacularly a financial firm might default. Given the limited liability structure adopted by almost all modern financial firms, once a firm has defaulted its shareholders have largely lost whatever economic interest they might previously have had in the firm. Conversely, regulators, customers and governments may have a greater interest in ‘beyond default’ tail risk, if they are the parties that bear the losses beyond this cut-off. For such parties, risk measures such as tail VaR (TVaR) – also called conditional VaR (CVaR) or Expected Shortfall – i.e., the expected loss conditional on the loss being worse than a certain level, could potentially be more appropriate risk measures to use.10 There are also more technical reasons why TVaR might be preferable to VaR.11
9 The (ex-ante) tracking error is the expected future standard deviation of returns (or relative returns). Strictly speaking, it requires a specification of a time frame, because, as we shall see in Section 2.3.4, return series may behave differently depending on the time frequency used when analysing them. However, commonly, a scaling in line with the square root of time might be assumed. 10 Tail Value-at-Risk corresponds to the average quantum of loss conditional on the loss being greater than the VaR. If VaR is being −k −k measured at a quantile y, then VaR (y) = k where −∞ p (x) d x = y and TVaR(y) = −(1/k) −∞ x p(x)d x if the pdf is p (x) and x is suitably defined. Writers use TVaR and CVaR largely interchangeably, usually with the same loss trigger as the quantile level that would otherwise be applicable if the focus was on VaR. Occasionally, TVaR and/or CVaR are differentiated, with one being expressed in terms of the loss beyond the VaR rather than below zero. However, such a definition inherits some of the technical weaknesses attributable to VaR, see below. Expected Shortfall has a similar meaning, but might use a trigger level set more generically, e.g., it might include all returns below some level that corresponds to an actuarial ‘shortfall’ rather than below zero as is used in the above definition. 11 For example, VaR is not a coherent risk measure, whereas TVaR/CVaR is. A coherent risk measure is one that satisfies the technical properties of monotonicity, sub-additivity, homogeneity and translational invariance; see, e.g., Artzner et al. (1999) or Kemp (2010). See also Section 7.11.5.
16
Extreme Events
Principle P3: The ways in which we visualise data will influence the importance that we place on different characteristics associated with this data. To analyse extreme events, it helps to use methodologies such as quantile–quantile plots that highlight such occurrences. However, we should be aware that they can at times encourage us to focus too much on fat-tailed behaviour, and at other times to focus too little on it. The practical impact of such subtleties can depend heavily on the behaviour of the distribution in the far tail. This takes us into so-called ‘extreme value theory’ (see Section 2.9). To visualise such downsides, we could use a visualisation approach similar to QQ-plots, but with the y-axis showing (downside) tail-VaRs rather than quantiles (i.e., VaRs). Figure 2.6 shows such an analysis for the same example fat-tailed distribution we analysed in Figures 2.2–2.5. In the downside tail, this chart appears qualitatively (i.e., asymptotically) similar to the shape of the corresponding quantile–quantile plot shown in Figure 2.5. This is true for any probability distribution where the pdf falls off steeply enough in the tail. For these sorts of probability distributions, most observations further into the tail than a given quantile point are not ‘much’ further into the tail. However, this similarity is not the case for all probability distributions. In Section 2.8 we will discuss stable distributions. Some of these are so heavy-tailed that their TVaRs are infinite! More practically, we might be analysing possible movements in the goodwill element in a firm value calculation. These may decline catastrophically in the event of a firm defaulting. The behaviour of the TVaR might then be more akin to one of the other two lines shown in Figure 2.6, which assume that there is a small probability of a sudden decline in value (but
2 0
y, here observed (log) return
–10
–8
–6
–4
–2
0
2
4
6
8
10
–2 –4 –6 –8 –10
TVaR (Normal distribution) TVaR (Example fat-tailed distribution) TVaR (Normal + Discontinuity) (1) TVaR (Normal + Discontinuity) (2)
–12 –14 –16
x, here expected (log) return, if Normally distributed Figure 2.6 Illustrative TVAR versus quantile plot C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
Fat Tails – In Single (i.e., Univariate) Return Series
17
no further decline in worse situations), with smaller declines and rises following a Normal distribution.12 2.3.3 Behaviour of individual bonds and bond indices Certain types of instrument, including many types of bonds, naturally exhibit fat-tailed characteristics. High quality bonds typically rarely default. When they do, however, losses on them can be very substantial. A typical market assumption is that the average recovery, i.e., the average payment as a proportion of par value on a bond that has defaulted, is roughly 40%. This does not mean that the average market value decline shortly before default is circa 60%; an impaired bond often trades at well below par value. Even so, market declines as and when a bond does actually default can be substantial, particularly if the default was not widely expected prior to it actually happening. We might initially expect well diversified bond indices to exhibit less fat-tailed behaviour than individual bonds (particularly individual bonds where there is a material likelihood of default). This is indeed normally the case, at least for corporate bond indices. However, in practice there are other factors that come into play at a bond index level.13 Many common bond indices relate to the bonds issued by a single government. Inclusion of many different bonds within them does not therefore diversify away issuer risk (to the extent that there is deemed to be any such risk). The behaviour of these types of bonds is more normally viewed as driven principally by yield curve dynamics. Factors influencing these dynamics include supply and demand for the relevant government’s bonds and the perceived trade-off between consumption and saving through time. Corporate bond indices may be more ‘diversified’ in terms of default risk, but they are still exposed to overall yield curve dynamics. They are also exposed to a systematic (or maybe a systemic14 ) factor corresponding to the ‘risk appetite’ that investors might at any point in time have for carrying the risk that defaults will turn out higher or lower than expected (a factor that can become particularly important in times of stress, as we saw during the 2007–09 credit crisis).15 2.3.4 Behaviour of equity indices The returns delivered by individual equities, particularly of companies that are smaller or less well diversified in terms of business coverage, might also be intrinsically expected to exhibit fat tails. From time to time they may be hit by a seriously adverse effect, e.g., a fraud, or alternatively they may come across a goldmine. The shape of the tails of the return 12 We may also have little idea how large the decline might be in such a scenario or precisely how far into the tail it might be. This is a form of model risk, which we will explore further in Chapters 8 and 9. 13 An example of such a factor is the economic ‘cycle’. Default rates can vary quite materially depending on economic conditions. This is an example of lack of time homogeneity (see Section 2.7.3). 14 Both ‘systematic’ and ‘systemic’ risks involve industry-wide (or even economy-wide) exposures. Usually the term ‘systemic’ risk is applied to a particular subset of systematic risks that show up rarely but when they do they cause havoc to the entire financial or economic system. Thus major financial crises are often described as ‘systemic’ problems, particularly if they affect large parts of the economy rather than being limited to narrow sectors. See, e.g., Besar et al. (2009) for a further exploration of what types of financial services risks might be ‘systemic’. In contrast, ‘systematic’ risks might more usually describe exposures that a security has to ongoing market-wide risk exposures; see, e.g., Section 5.3.4. 15 As explained in Kemp (2009), commentators also postulate that some element of the return on a corporate bond may reflect its liquidity characteristics. This also has a market-wide dimension, because aggregate (corporate bond) market liquidity can rise and fall through time.
18
Extreme Events
distribution will be heavily dependent on the likelihood of such unusual events occurring, on the transformational impact that any such event might have on the firm’s fortunes and on investor response to such events (and the perceived likelihood of them repeating). There is perhaps less intrinsic reason to postulate strong asymmetric return characteristics for diversified baskets of equities, such as typical equity market indices. We might naively expect company specific one-off events such as those referred to in the previous paragraph to be largely diversified away. Indeed, this is one of the tenets of traditional portfolio construction theory as epitomised by the Capital Asset Pricing Model (see Section 5.3.4). Markets as a whole, however, often still appear to exhibit fat tails, in either or both directions. For example, the main equity market indices are typically still considered to exhibit fat-tailed behaviour – witness the October 1987 market crash. Even in more recent (‘normal’?) times mostly prior to the 2007–09 credit crisis, there is evidence of fat-tail behaviour. In Figures 2.7–2.9 we plot the tail characteristics of monthly, weekly and daily (logged) returns on the FTSE All-Share Index (in GBP), the S&P 500 Index (in USD), the FTSE-W Europe (Ex UK) Index (in EUR) and the Topix Index (in JPY) for the period from end June 1994 to end December 2007. These figures are quantile–quantile plots as per Figure 2.5. On the horizontal axis is shown the corresponding (sorted) size of movement expected were the (logged) returns to be Normally distributed (with mean and standard deviation in line with their observed values). The returns for each index have been scaled in line with their observed means and standard deviations, so that the comparator line is the same for each index.
Observed standardised (logged) return (sorted)
5 Expected, if (log) Normally distributed 4
FTSE All Share S&P 500 Composite
3
FTSE W Europe Ex UK 2
Tokyo SE (Topix)
1 0
–4
–3
–2
–1
0
1
2
3
4
–1 –2 –3 –4 –5
Expected standardised (logged) return (sorted) Figure 2.7 QQ-plots of monthly returns on various major equity market indices from end June 1994 to end December 2007 C John Wiley & Sons, Ltd. Reproduced by permission of John Source: Kemp (2009), Thomson Datastream. Wiley & Sons, Ltd
Fat Tails – In Single (i.e., Univariate) Return Series
19
Observed standardised (logged) return (sorted)
8 Expected, if (log) Normally distributed FTSE All Share
6
S&P 500 Composite FTSE W Europe Ex UK
4
Tokyo SE (Topix) 2
0 –4
–3
–2
–1
0
1
2
3
4
–2
–4
–6 –8
Expected standardised (logged) return (sorted) Figure 2.8 QQ-plots of weekly returns on various major equity market indices from end June 1994 to end December 2007 C John Wiley & Sons, Ltd. Reproduced by permission of John Wiley & Source: Kemp (2009), Thomson Datastream. Sons, Ltd
Observed standardised (logged) return (sorted)
8 Expected, if (log) Normally distributed 6
FTSE All Share S&P 500 Composite
4
FTSE W Europe Ex UK Tokyo SE (Topix)
2 0 –4
–3
–2
–1
0
1
2
3
4
–2 –4 –6 –8
Expected standardised (logged) return (sorted) Figure 2.9 QQ-plots of daily returns on various major equity market indices from end June 1994 to end December 2007 C John Wiley & Sons, Ltd. Reproduced by permission of John Wiley & Source: Kemp (2009), Thomson Datastream. Sons, Ltd
Extreme Events
Observed standardised (logged) return (sorted)
20
6 Expected, if (log) Normally distributed 4
Daily - FTSE All Share Weekly - FTSE All Share
2
Monthly - FTSE All Share
0 –4
–3
–2
–1
0
1
2
3
4
–2
–4
–6
–8
Expected standardised (logged) return (sorted) Figure 2.10 QQ-plots of daily, weekly and monthly returns on FTSE All-Share Index from end June 1994 to end December 2007 C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Thomson Datastream.
Whether and to what extent returns are fat-tailed seems to depend on the time-scale over which each return is measured. For daily data, all four indices analysed appear to exhibit fat tails on both upside and downside, but there is less evidence of upside fat tails in monthly data. Commentators commonly assert, merely by reference to such charts, that higher frequency return data (e.g., daily data) is more fat-tailed than lower frequency return data (e.g., monthly data). Although this does indeed seem to be the case, visual inspection merely of charts such as Figures 2.7–2.9 arguably overstates this phenomenon. Some of the visual difference in tail behaviour in these three figures is because the more frequent the data is, the further into the tail the observations go. Better, if our focus is on understanding sensitivity to data frequency, is to plot on the same chart QQ-plots relating to the same index but with differing return frequencies (see Figures 2.10–2.13). These comparisons suggest that (over the period analysed) daily data was not much more fat-tailed than weekly data for FTSE W Europe Ex UK (both upside and downside), for S&P 500 Composite (downside) or for FTSE All-Share (upside) although again Tokyo SE (Topix) stands out as having somewhat different characteristics. If all the daily returns were independent of each other then the fat-tailed behaviour in weekly and particularly monthly returns should be less noticeable than in daily data (if the Central Limit Theorem applies, see Section 2.5). This suggests that for some of the indices being analysed there is some persistence in fat-tailed behaviour through time. We might, for example, conclude that days on which a large movement occurs (whether positive or negative) tend to be followed by days where other large movements also occur, a phenomenon that is referred to as heteroscedasticity (see Section 2.7.3).
Observed standardised (logged) return (sorted)
Fat Tails – In Single (i.e., Univariate) Return Series
21
6 Expected, if (log) Normally distributed 4
Daily - S&P 500 Composite Weekly - S&P 500 Composite Monthly - S&P 500 Composite
2
0 –4
–3
–2
–1
0
1
2
3
4
–2
–4
–6
–8
Expected standardised (logged) return (sorted) Figure 2.11 QQ-plots of daily, weekly and monthly returns on S&P 500 Composite Index from end June 1994 to end December 2007
Observed standardised (logged) return (sorted)
C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Thomson Datastream.
6 Expected, if (log) Normally distributed 4
Daily - FTSE W Europe Ex UK Weekly - FTSE W Europe Ex UK Monthly - FTSE W Europe Ex UK
2
0 –4
–3
–2
–1
0
1
2
3
4
–2
–4
–6
–8
Expected standardised (logged) return (sorted) Figure 2.12 QQ-plots of daily, weekly and monthly returns on FTSE W Europe Ex UK Index from end June 1994 to end December 2007 C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Thomson Datastream.
22
Extreme Events
Observed standardised (logged) return (sorted)
8 Expected, if (log) Normally distributed 6
Daily - Tokyo SE (Topix) Weekly - Tokyo SE (Topix)
4
Monthly - Tokyo SE (Topix) 2 0 –4
–3
–2
–1
0
1
2
3
4
–2 –4 –6 –8
Expected standardised (logged) return (sorted) Figure 2.13 QQ-plots of daily, weekly and monthly returns on Tokyo SE (Topix) from end June 1994 to end December 2007 C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Thomson Datastream.
2.3.5 Currencies and other asset types Currencies can be viewed as possessing some of the behavioural attributes of both equities and bonds and therefore some of their corresponding exposures to extreme events. This is in part because currencies may be pegged or freely floating, or shades in between. At one extreme, currencies can, in effect, be completely pegged at an absolutely fixed rate, with essentially no likelihood of the pegging coming unstuck. Examples might be the legacy currencies that merged to form the Euro block. There are set conversion rates still applicable to contracts that refer to these legacy currencies (which are used to convert payments expressed in them into Euros). It is possible that countries might leave the Euro-zone and begin to issue their own currencies again, but even in such circumstances there is no particular reason why they would restart issuing their old legacy currencies; they might instead launch new currencies. More usually, when people refer to pegged currencies they mean currencies that are still being used as legal tender but where there is an explicit government-sponsored mechanism that tends to keep the currency closely aligned to another currency. Often this takes the form of a link to the US dollar, given that it is currently the world’s main ‘reserve’ currency. This might, for example, take the form of a government-sponsored Currency Board, which issues only a certain amount of the first currency and at the same time itself holds an equivalent amount of the reserve currency. But how ‘explicit’ does such government promotion of a fixed exchange rate need to be for the currency to be ‘pegged’? This is the bread and butter of currency traders!
Fat Tails – In Single (i.e., Univariate) Return Series
23
At the other extreme, since the break-down of the Bretton Woods agreement, many of the world’s most important currencies have largely been allowed to float relatively freely by their governments, i.e., with prices in the main set by the international currency markets. However, even major developed country governments can and do intervene from time to time in currency markets to prop up or to keep down the prices of their own (or other) currencies. Often, decisions to provide such support will be taken with a wider economic perspective in mind; the market clearing level for a currency may not be deemed attractive by its government (or by other governments) for a variety of economic reasons. Principle P4: Most financial markets seem to exhibit fat-tailed returns for data sampled over sufficiently short time periods, i.e., extreme outcomes seem to occur more frequently than would be expected were returns to be coming from a (log-) Normal distribution. This is true both for asset types that might intrinsically be expected to exhibit fat-tailed behaviour (e.g., some types of bond, given the large market value declines that can be expected to occur if the issuer of the bond defaults) and for asset types, like equity indices, where there is less intrinsic reason to postulate strong fat-tailed return characteristics.
2.4 CHARACTERISING FAT-TAILED DISTRIBUTIONS BY THEIR MOMENTS 2.4.1 Introduction Several of the return series described in the previous section have quantile–quantile plots that appear visually to deviate significantly from (log-) Normality. However, visual appearances might be deceptive. The plots shown in the previous section relate to samples which we can view as coming from some (unknown) underlying distributions, and will therefore be subject to sampling error. Before concluding that the return series do actually deviate materially from (log-) Normality, we should ideally identify more rigorous statistical methodologies for clarifying whether any observed deviation really is significant enough for an assumption of Normality to be unlikely to be correct. Sampling error arises because a finite sample of data drawn from a distribution only imperfectly corresponds to the underlying distribution. We might therefore test for sampling error by carrying out Monte Carlo simulations,16 repeatedly drawing at random samples of n different observations from the distribution we intrinsically ‘expect’ the returns to be coming from, where n is the number of observations in the original observation set. If the actually observed QQ-plot appears to be within the typical envelope of QQ-plots exhibited by these random samples, then we might conclude that the observed behaviour is merely an artefact of sampling error. In Figure 2.14 we illustrate how this approach can be developed by illustrating the QQ-plots of four such randomly drawn samples, here drawn from a Normal distribution and with the same number of observations as the daily data referred to in earlier charts. The QQ-plots of 16
Monte Carlo simulation techniques are explained further in Section 6.11.
24
Extreme Events
Observed standardised (logged) return (sorted)
8 Expected
Series 1
Series 3
Series 4
Series 2
6 4 2 0
–4
–3
–2
–1
0
1
2
3
4
–2 –4 –6 –8
Expected standardised (logged) return (sorted) Figure 2.14 QQ-plots of four Monte Carlo simulations of daily return data with samples drawn from a Normal distribution C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
individual random samples exhibit some jagged behaviour in the tails, which reflects the few observations that are present in these parts of the distribution. In practice, we might carry out many thousands of simulations and identify the level of deviation from Normality (for a given quantile point) exceeded only by a small fraction of simulations (say α of them). This is illustrated in Figure 2.15, in which we show (for each quantile point) the lines above and below which 0.5% of simulations lie, based on a Monte Carlo simulation involving 1000 randomly drawn samples. The substantial visual difference between Figure 2.9 and either of Figure 2.14 or Figure 2.15 indicates that deviations from Normality as large as those observed in practice with daily equity index return series are highly unlikely to have arisen purely by chance.17 More usually, we identify a relatively straightforward statistical measure that can then be used within a hypothesis test, adopting a null hypothesis that the (log) returns are actually coming from a Normal distribution. The null hypothesis would be deemed rejected if the probability of the null hypothesis being true is sufficiently small18 (see, e.g., Section 2.4.619 ). 17 This particular example illustrates a more general feature of Monte Carlo simulation. Because of its conceptual simplicity, researchers can often be tempted to focus on it to the exclusion of other possibly more helpful techniques (see Section 6.2.5). In this particular instance, the results shown in Figure 2.15 can be derived more quickly and accurately using analytical techniques; see Kemp (2010). 18 Some variation of this approach is needed if we are carrying out many different hypothesis tests simultaneously. For example, suppose that we set our confidence level for rejecting the null hypothesis at 1 in 20 (i.e., 5%). Suppose that we are also testing 100 such hypotheses. Then we would expect on average 5% of such tests to result in rejections even if the null hypotheses were all true. We can no longer conclude that the five cases where we would have in isolation rejected their null hypotheses are in fact statistically ‘significant’. Instead, we need, in some sense, to treat the entire series of tests as a whole, to avoid drawing unwarranted conclusions. 19 The reason that straightforward hypothesis tests such as these are generally used (if available) in preference to Monte Carlo simulation is primarily one of speed of computation and ease of interpretation of the answers. Monte Carlo simulation approaches
Fat Tails – In Single (i.e., Univariate) Return Series
25
Observed standardised (logged) return (sorted)
8 Expected 6 0.5%/99.5% (higher) 4 0.5%/99.5% (lower) 2 0 –4
–3
–2
–1
0
1
2
3
4
–2 –4 –6 –8
Expected standardised (logged) return (sorted) Figure 2.15 Envelopes of QQ-plots (at the 0.5% and 99.5% percentiles) of 1000 Monte Carlo simulations of daily return data with samples drawn from a Normal distribution C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
The most common such approach in this context is to use tests based on the moments of the distribution, i.e., E (X n ), where here and elsewhere in the book, E ( f (Y )) is the expected value of a function f (Y ) of a random variable Y , i.e., for a continuous valued random variable with probability density p (x):
E ( f (Y )) =
∞
f (y) p (y) dy
(2.1)
−∞
In this section we explore the main approaches that are adopted, and some of the issues that arise because usually we are particularly interested principally in deviations from Normality in the tail of the distribution rather than in the generality of the distributional form. 2.4.2 Skew and kurtosis Any Normal distribution is completely characterised by its mean and standard deviation, which in effect correspond to its first two moments. A common way of measuring deviation from Normality is thus to calculate the higher order moments of the observed distribution, to analysis of sampling error can also be formulated in the form of hypothesis tests, but they are often much more time consuming to carry out. One element of the problem that they retain when applied to QQ-plots, which conventional (moment-based) hypothesis tests potentially miss, is the sense that the QQ-plot is a line and so can wiggle in many places. We may, for example, be particularly interested in deviation from Normality only in a certain segment of the overall QQ-plot. We explore this point further in Section 2.4.5.
26
Extreme Events
particularly the third and fourth moment. These correspond to the skewness and (excess) kurtosis20 of a distribution respectively (both of which are conveniently pre-canned functions within Microsoft Excel). Skewness is a measure of the asymmetry of a distribution; i.e., how different it would appear versus its mirror image where downside is replaced by upside and vice versa. (Excess) kurtosis is more directly related to fat-tailed behaviour, but does not differentiate between fat-tailed behaviour on the downside and fat-tailed behaviour on the upside. The (equally weighted) sample skewness and kurtosis of a series of n observations are σˆ are the samcalculated as follows where the observations are X 1 , . . . , X n and µˆ and n ple mean and sample standard deviation of the observations, i.e., µ ˆ = i=1 X i /n and n σˆ = ˆ 2 /(n − 1): i=1 (X i − µ) n
skew = γ1 =
(X i − µ) ˆ 3 n (n − 1) (n − 2) i=1 σˆ 3
(2.2)
n
kurtosis = γ2 =
(X i − µ) ˆ 4 3 (n − 1)2 n (n + 1) − (n − 1) (n − 2) (n − 3) i=1 (n − 2) (n − 3) σˆ 4
(2.3)
n In the case where n is large, these simplify to the following, where µ = i=1 X i /n and n 2 σ = i=1 (X i − µ) /n (µ and σ are often then called ‘population’ rather than ‘sample’ means and standard deviations): γ1 ≈ γ2 ≈
n 1 (X i − µ)3 n i=1 σ3
n 1 (X i − µ)4 −3 n i=1 σ4
(2.4)
(2.5)
Corresponding formulae where different weights are given to different observations are set out in Kemp (2010). The skew and (excess) kurtosis of a Normal distribution are both zero (this is one of the reasons for including the 3 in the definition of excess kurtosis). Although it is possible for a distribution to be non-Normal and still exhibit zero skew and kurtosis, such distributional forms are not often observed in practice. Those for the data referred to in Section 2.3.4 are summarised in Table 2.1.
2.4.3 The (fourth-moment) Cornish-Fisher approach A common way within risk management circles of deriving the shape of an entire distributional form merely from its moments uses the Cornish-Fisher asymptotic expansion; see, e.g., Abramowitz and Stegun (1970), Kemp (2009) or Kemp (2010). Most commonly the focus is on the fourth-moment version of this expansion, because it merely uses moments up to and including kurtosis. In effect, the fourth-moment Cornish-Fisher approach aims to provide a 20 A distribution’s kurtosis is sometimes referred to in older texts as its ‘excess’ kurtosis (with its ‘base’ kurtosis then being 3 plus its excess kurtosis).
Fat Tails – In Single (i.e., Univariate) Return Series
27
Table 2.1 Skew and (excess) kurtosis for several mainstream equity (log) return series from end June 1994 to end December 2007 Skew
FTSE All-Share S&P 500 FT World Eur Ex UK Topix
(excess) kurtosis
currency
monthly
weekly
daily
monthly
weekly
daily
GBP USD EUR JPY
–1.0 –0.8 –1.0 0.0
–0.4 –0.5 –0.2 –0.1
–0.3 –0.1 –0.3 –0.1
1.6 1.3 2.2 –0.2
2.0 3.0 3.1 0.5
3.2 3.8 3.6 2.5
Source: Threadneedle, Thomson Datastream
reliable estimate of the distribution’s entire quantile–quantile plot merely from the first four moments of the distribution, i.e., its mean, standard deviation, skew and kurtosis. It involves estimating the shape of a quantile–quantile plot by the following cubic, where γ1 is the skew and γ2 is the kurtosis of the distribution: y (x) = µ + σ
γ1 x 2 − 1 3γ2 x 3 − 3x − 2γ12 2x 3 − 5x x+ + 6 72
(2.6)
For standardised returns (with µ = 0 and σ = 1), this simplifies to the following: 3γ2 x 3 − 3x − 2γ12 2x 3 − 5x γ1 x 2 − 1 + y (x) = x + 6 72
(2.7)
In Chapter 4, we will also be interested in the situation where skewness is ignored; Equation (2.7) then further simplifies to: γ2 x 3 − 3x y (x) = x + 24
(2.8)
2.4.4 Weaknesses of the Cornish-Fisher approach Unfortunately, as noted in Kemp (2009), the fourth-moment Cornish-Fisher approach does not appear to give a visually satisfying fit in the tails of quantile–quantile plots for the mainstream equity index return distributions considered above, particularly for more frequent data. Figure 2.16 repeats Figure 2.9 for just the FTSE All-Share Index but now includes the relevant (fourthmoment) Cornish-Fisher estimate of the quantile–quantile plot. It also includes an alternative method of estimating the quantile–quantile plot that arguably fits the distributional form better in the tails (see Section 2.4.5). The reason for this poor fit in the tails is explored in Kemp (2009), where it is noted that skew and kurtosis are ‘parametric’ statistics and give equal weight to every observation. However, most observations are in the centre of a distribution rather than in its tails.21 Thus parametric 21 This can be seen from Figure 2.1. The vast majority of outcomes are in the middle of the distribution, especially for a distribution that is more strongly peaked than the Normal distribution. Consider also the proportion of observations that are in the tails of a distribution. For example, only approximately 1 in 1.7 million observations from a Normal distribution should be further away from
28
Extreme Events
Observed standardised (logged) return (sorted)
10
Expected, if (log) Normally distributed 8
FTSE All Share 6
Cornish-Fisher approximation (incorporating skew and kurtosis)
4
Fitted cubic (weighted by average distance between points)
2 0
–4
–3
–2
–1
0
1
2
3
4
–2 –4 –6 –8 –10
Expected standardised (logged) return (sorted) Figure 2.16 Fitting the distributional form for daily returns on FTSE All-Share Index from end June 1994 to end Dec 2007 C John Wiley & Sons, Ltd. Reproduced by permission of John Source: Kemp (2009), Thomson Datastream. Wiley & Sons, Ltd
statistics arguably give undue weight to behaviour in the central part of the distribution and too little weight to behaviour in the tails. In effect, the Cornish-Fisher approach provides a good guide to the extent to which the distribution is akin to a fat-tailed one in its centre, rather than akin to one in its tails. A corollary is that over-reliance on use of just skew and kurtosis to characterise fat-tailed behaviour may also be inappropriate. It can be argued that the intrinsic justification for using these measures for this purpose is because one can then extrapolate from them to characterise the nature of the distributional form. If this really was the case then the fourth-moment CornishFisher adjustment should be the relevant way of implementing this extrapolation process. 2.4.5 Improving on the Cornish-Fisher approach Instead of giving equal weight to each data point, we could fit the quantile–quantile curve directly, giving greater weight to observations in the part of the ranked sample in which we are most interested. A simple example would be to use weights that corresponded to the distance between neighbouring expected (logged) returns (i.e., giving less weight to the observations bunched in the centre of the distribution). If we fit a cubic using a least squares methodology with this
the (sample) mean than 5 standard deviations. Each one in isolation might on average contribute at least 625 times as much to the computation of kurtosis as an observation that is just one standard deviation away from the (sample) mean (since 54 = 625), but because there are so few observations this far into the tail, they in aggregate have little impact on the overall kurtosis of the distribution.
Fat Tails – In Single (i.e., Univariate) Return Series
29
weighting approach22 then the fit in the tails becomes visually much more appealing than with the Cornish-Fisher approach (see Figure 2.16). This is at the expense of a not quite so good fit towards the middle of the distribution. We could also use weights that placed particular focus on sub-elements of the observed distributional form, e.g., only ones between µ − 3σ and µ − 2σ , if that part of the distribution was of particular interest to us. Principle P5: Skewness and kurtosis are tools commonly used to assess the extent of fat-tailed behaviour. However, they are not particularly good tools for doing so when the focus is on behaviour in the distribution extremities, because they do not necessarily give appropriate weight to behaviour there. Modelling distribution extremities using the fourth-moment Cornish-Fisher approach (an approach common in some parts of the financial services industry that explicitly refers to these statistics and arguably provides the intrinsic statistical rationale for their use in modelling fat tails) is therefore also potentially flawed. A more robust approach may be to curve fit the quantile–quantile form more directly.
2.4.6 Statistical tests for non-Normality The most common tests for Normality refer to the distribution’s skewness and kurtosis. The sample skewness and sample kurtosis as defined in Section 2.4.2 have the following asymptotic forms for large n if the sample is drawn from a Normal distribution:
skew = γ1 ∼ N 0, 6/n
kurtosis = γ2 ∼ N 0, 24/n
(2.9) (2.10)
In each case, we can either carry out a one-sided hypothesis test or a two-sided hypothesis test. With a one-sided test, we would reject the null hypothesis if, say, γ > α (for the given test statistic γ ), where α is set by reference to the strength of the rejection criterion we wish to adopt. For example, if we were using a 1 in 20 (i.e., 5% level) rejection criterion then α would be set so that if the null hypothesis were true the probability that γ would be greater than α is 5%. This would be an upside one-sided test and would be used if a priori we were expecting deviation from Normality to show up with γ being larger than expected. If the opposite was true – i.e., a priori we expected deviation from Normality to show up with γ being smaller than expected – then we would use a downside one-sided test, rejecting the null hypothesis if γ < α for some suitably chosen α. 22 The simplest way of fitting a curve in this manner is to use generalised linear least squares regression as described in Section 4.6. Using the ‘expected’ quantile values, x j , we create four series, f 0, j ≡ 1, f 1, j ≡ x j , f 2, j ≡ x 2j and f 3, j ≡ x 3j and then we regress the observed values, y j , against these four series simultaneously, giving different weights to the different observations. This involves 2 w j y j − a0 f 0, j + a1 f 1, j + a2 f 2, j + a3 f 3, j where w j are the weights finding the values of a0 , a1 , a2 and a3 that minimise we want to ascribe to the different observations (here the distances between neighbouring x j ). See Kemp (2010) for further details. However, it is also usually desirable for the fitted curve to be non-decreasing, because otherwise it would not correspond to a mathematically realisable QQ-plot.
30
Extreme Events
Table 2.2 Results of tests for Normality for several mainstream equity (log) return series from end June 1994 to end December 2007 skew
FTSE All-Share S&P 500 FT World Eur Ex UK Topix
(excess) kurtosis
Jarque-Bera
mth
wk
day
mth
wk
day
mth
wk
day
≈0 2×10-5 ≈0 0.47
5×10-5 ≈0 0.005 0.084
≈0 ≈0 ≈0 0.002
≈0 ≈0 ≈0 0.89
≈0 ≈0 ≈0 ≈0
≈0 ≈0 ≈0 ≈0
≈0 ≈0 ≈0 0.82
≈0 ≈0 ≈0 ≈0
≈0 ≈0 ≈0 ≈0
Note: tests quoted in terms of likelihoods of occurrence were data to be Normally distributed. Likelihoods less than 10−6 are shown by ‘≈0’. Source: Nematrian, Thomson Datastream
With a two-sided hypothesis test we would reject the null hypothesis if either γ > α1 or γ < α2 . If we were using the same 5% rejection criterion as before then we would normally choose α1 and α2 so that the probability under the null hypothesis of γ > α1 and the probability of γ < α2 were the same and both were one-half of 5%, i.e., 2.5%. We can use both measures simultaneously and test for Normality using the Jarque-Bera (JB) test, the test statistic for which is (asymptotically) distributed as a chi-squared distribution with two degrees of freedom: JB statistic = n
γ2 γ12 + 2 6 24
∼ χ 2 (2)
(2.11)
As this statistic depends merely on the squares of the two earlier statistics, it can be meaningfully applied only in the context of a one-sided hypothesis test.23 If n is not particularly large (i.e., is not large enough for us to be comfortable using the confidence intervals set by reference to the asymptotic distributions of these test statistics) then we could carry out Monte Carlo simulations to estimate the spread of these statistics under the null hypothesis; see, e.g., Kemp (2010). In Table 2.2 we use these statistics to test for non-Normality for each of the data series and frequencies shown in Table 2.1. The figures show the approximate likelihood of the test statistic in question exceeding its observed value, bearing in mind the number of observations in each series. As noted above, the choice between a one-sided and a two-sided confidence interval depends principally on the observer’s prior view about how returns might differ from Normality. With skewness, it is not immediately obvious (at least for these indices) whether we should expect (log) returns to exhibit positive or negative skewness. We might therefore naturally use twosided tests in such a context. With kurtosis, there are rather firmer grounds for postulating a priori that (log) returns should exhibit positive (excess) kurtosis, and so (upside) one-sided tests might be preferred for them. 23 An exception is if you believe that the data might have been ‘fixed’ (deliberately or accidentally) in some way that would tend unduly to favour the relevant null hypothesis being true. If, for example, the JB statistic was ‘too’ close to zero (or the skew or kurtosis in isolation were too close to zero) then we might infer that there had been some standardisation of the data before it came to us that would have removed evidence of fat-tailed behaviour had it existed in the original (unadjusted) dataset.
Fat Tails – In Single (i.e., Univariate) Return Series
31
When multiple data series are being simultaneously tested in this type of fashion, footnote 18 in Section 2.4.1 becomes relevant. Often the results would not then be expressed merely in the form of a series of binary outcomes, i.e., either rejecting the null hypothesis for that particular data series or not rejecting it. Instead, the results would often be presented in a form that gave the likelihood (under the null hypothesis) of the test statistic for the particular data series in question exceeding its observed value, together with a flag indicating whether this would then result in rejection or otherwise given the specified confidence criterion. Where such likelihoods are sufficiently close to 0 (or to either 0 or 1 for a two-sided confidence test) then we would reject that particular null hypothesis (for that particular data series). We use such a format in Table 2.2. Nearly all of the test statistics appear highly significant. In all cases apart from monthly Topix returns, the (excess) kurtosis appears to be significantly greater than zero, suggesting that it is implausible to assume that these (log) returns are coming from a Normal distribution. Evidence for deviation from Normality in terms of skewness is also compelling except for Topix. Other tests that can be used to test whether observed data appears to be coming from a Normal distribution include the Shapiro-Wilk test, the Anderson-Darling test, the KolmogorovSmirnov test and the Smirnov-Cram´er-von-Mises test (see Kemp, 2010).24
2.4.7 Higher order moments and the Omega function Although it is most common to focus merely on the first four moments of a (univariate) distribution, sometimes the focus is on higher moments. Return series can occasionally be quite non-Normal but still have relatively modest or even zero skew and (excess) kurtosis. One way of visualising all the moments simultaneously is via the Omega function, popularised by Shadwick and Keating (2002). This is defined as follows, where F (r ) is the cumulative distribution function for the future return25 and a and b are the lower and upper bounds for the return distribution:
(y) =
b y
(1 − F (r )) dr y
(2.12)
F (r ) dr
a
Shadwick and Keating (2002) proposed this function as a generalisation to various downside risk measures that others had proposed for performance measurement purposes.26 It generalises all such measures because it captures all the higher moments, it being possible to recover F
24 Readers should note that all these tests incorporate all data points equally, and thus potentially run into the same problem as the Cornish-Fisher asymptotic expansion. They may merely be telling us about apparent non-Normality in the generality of the distribution whereas we will often be most interested in non-Normality in the tails (particularly, if we are risk managers, in the downside tail). Visual inspection of the data as per Figures 2.5–2.7 suggests that in this case the relevant tails are also non-Normal. 25 As we have noted earlier, the form of F may depend on timescale. Shadwick and Keating (2002) originally proposed this function for past performance measurement purposes, i.e., it would in practice have been based on historic data, and so its relevance for what might happen in the future also requires assumptions about how useful the past is as a guide to the future. 26 Examples of such risk measures include those used in the Sortino ratio (see, e.g., Section 5.9.2) and the use of lower partial moments (see Section 7.9.3).
32
Extreme Events
from .27 A high value for the Omega function (for a given value of y) implies that there is more density of returns to the right of the threshold y than to the left and is thus to be preferred. Shadwick and Keating (2002) also proposed optimising portfolio construction by reference to it, rather than to standard deviation of returns as is usually done in traditional portfolio construction theory (see Chapter 5). Reaction to the Omega function has been mixed from some commentators. For example, Kazemi, Schneeweis and Gupta (2003) argue that the Omega function is not significantly new in finance and can be represented as a call-put ratio as follows, where C (L) is essentially the price of a European call option written on the investment and P (L) is essentially the price of a European put option written on the investment: (y) =
C (L) P (L)
(2.13)
This is because ∞
(1 − F (r )) dr =
∞
(r − L) f (r ) dr = E (max (r − L , 0)) = er f C (L)
(2.14)
L
L
and L
F (r ) dr =
−∞
L
(L − r ) f (r ) dr = E (max (L − r, 0)) = er f P (L)
(2.15)
−∞
where f (r ) is the probability density function for the one-period rate of return on the investment and r f is the corresponding risk-free rate. Kazemi, Schneeweis and Gupta (2003) use this insight to present what they call a ‘SharpeOmega’ ratio (see Section 5.9.2). They argue that this ratio preserves all the features of the original Omega function but at the same time provides a more intuitive measure of risk that is similar to the Sharpe ratio of traditional portfolio construction theory.
2.5 WHAT CAUSES FAT TAILS? 2.5.1 Introduction If return distributions are often fat-tailed, then a natural question to ask is why such behaviour seems so common. Given Principle P1, this begs the question of why, a priori, we might have assumed that such behaviour should be ‘unexpected’. In this section we introduce the Central Limit Theorem, which is the fundamental reason why a priori we might expect many investment variables to be distributed according to the Normal distribution. We also include a validation of its derivation, principally so that in Section 2.5.3 we can summarise all of the possible ways in which the Central Limit Theorem can break 27 This also means that use of the Omega function can be viewed as ‘merely’ another way of visualising the cdf, which means that Principle P3 becomes applicable, i.e., we need to be aware that its use might inadvertently lead us at times to focus too much on extreme events and at other times to focus too little on them.
Fat Tails – In Single (i.e., Univariate) Return Series
33
down. Sections 2.6–2.8 then explore specific ways in which the Central Limit Theorem can break down. These include a lack of adequate diversification, a lack of time stationarity, the possibility that the underlying distributions are so fat-tailed that they do not possess some of the characteristics needed for the Central Limit to apply and the impact of liquidity risk.
2.5.2 The Central Limit Theorem The fundamental reason why we might ‘intrinsically’ expect diversified baskets of securities (and even in some circumstances individual securities) to have return series that are approximately (log-) Normal is the Central Limit Theorem. The Central Limit Theorem as classically formulated relates merely to independent identically distributed (i.i.d.) random variables. It states that if X 1 , X 2 , . . . , X n is a sequence of n such random variables each having a finite expectation µ and variance σ 2 > 0 then as n → ∞ the distribution of the sample average of these random variables approaches a N µ, σ 2 /n distribution, i.e., a Normal distribution with mean µ and variance σ 2 /n, irrespective of the shape of the original distribution. √ Equivalently, if X¯ n = (X 1 + · · · X n )/n is the sample mean then Z n = n X¯ n − µ /σ converges to the unit Normal distribution (i.e., the one with a mean of zero and standard deviation √ D of unity) as n → ∞. Such convergence is often written as: n X¯ n − µ /σ√−→ N (0, 1). As is well known, if the X i are normally distributed then the distribution of n X¯ n − µ /σ is already N (0, 1) for all n ≥ 1, and thus no further convergence is needed. One way of validating this theorem uses characteristic functions. We introduce these functions here because they will prove useful later on. If a probability distribution relating to a real valued random variable, X , has a probability density f X (x) then its characteristic function, ϕ X (t), is a complex valued function that is usually28 defined as follows:
ϕ X (t) ≡ E e
it X
≡
∞
eit X f X (x) d x
(2.16)
−∞
1 2 2 For example, the characteristic function of a Normal distribution, N µ, σ 2 , is eitµ− 2 σ t . A more general definition is used if the random variable does not have a probability density function (e.g., if it is discrete). The concept can also be extended to multivariate distributions. Characteristic functions for a range of distributional forms are set out in Kemp (2010). A probability distribution is completely described by its characteristic function,29 in the sense that any two random variables that have the same probability distribution will have the same
∞ Some commentators use alternative definitions for characteristic functions, e.g., ϕ X ≡ −∞ e2πit X f X (x) d x to bring out more clearly the similarities with Fourier transforms, because the characteristic function is then seen to be the Fourier transform of the pdf. Here, and elsewhere in this book, ex , also called exp (x), is the exponential function, i.e., e (= 2.718282. . .) raised to the power of x, and, where the context requires it, i is the square root of minus 1. If t and X are real then eit x = cos (t X ) + i sin (t X ). The characteristic function ϕ X +Y (t) of the sum of two independent random variables X + Y satisfies the relationship ϕ X +Y (t) = ϕ X (t) ϕY (t). The characteristic function of any random variable, Y , with zero mean and unit variance has a characteristic function of the form ϕY (t) = 1 − t 2 /2 + o(t 2 ) for small t, where o (x) means a function that goes to zero more rapidly than x as x → 0. 29 This complete characterisation of the distributional form is one reason such functions are called ‘characteristic’ functions. 28
34
Extreme Events
characteristic function and vice versa (this is true for both univariate and multivariate distributions). n √ If we let Yi = (X i − µ)/σ then the characteristic function of Z n = i=1 Yi / n takes the following form, i.e., the characteristic function of the unit Normal distribution30 : n
t 2 ϕ Z n (t) = ϕY √ → e−t /2 as n → ∞ n
(2.17)
More relevant for our purposes is that the Central Limit Theorem also applies in the case of sequences that are not identically distributed, provided certain regularity conditions apply, value, say µi , and including that the X i need to be independent and each have finite nexpected 2 2 finite (and nonzero) standard deviation, say σ . We define S = σ . Then, if certain other i n i=1 i n X i /Sn converges to a unit Normal distribution. regularity conditions are satisfied, Z n = i=1 The Central Limit Theorem is thus potentially applicable at two different levels in an investment context: (a) The first applies primarily at the individual instrument level. If the return that a given security might exhibit over some (infinitesimally) small time interval arises merely from the accumulation of lots of small independent contributory factors (each with fixed finite variance) then the Central Limit Theorem implies that over this short time interval the return should be approximately Normal. If these (random) contributions are also independent and identically distributed through time (i.e., over each separate such time interval) then the overall return exhibited by the security should be log-Normal over any specified time period, even ones that are not infinitesimally short, because returns compound through time so log returns add through time. (b) The second applies primarily at the aggregate market level. We shall see shortly that there are several different ways in which the assumptions in (a) may not apply in practice. However, even when the returns exhibited by individual instruments are not themselves (log-) Normally distributed, we might still expect the return on a diversified portfolio of securities to be driven by a large number of such contributions. Indeed, we might view this as an intuitive interpretation of what we mean by diversification, i.e., not being too much exposed to any individual factor that might drive returns. The Central Limit Theorem, applied to the average of the individual returns within the portfolio, should then more plausibly result in the (log) return on the portfolio being Normally distributed over any given short time interval. If these contributions are independent and identically distributed through time then the portfolio returns will be approximately log-Normally distributed over any specified time interval.
2.5.3 Ways in which the Central Limit Theorem can break down Readers who are less mathematically minded may have been tempted to skip the previous section. Its real power comes when we use it to characterise the situations in which the Central Limit Theorem may break down. This has important ramifications for how we should handle extreme events in practice. 30 The Central Limit Theorem then follows from the L´evy continuity theorem which confirms that convergence of characteristic functions implies convergence in distribution.
Fat Tails – In Single (i.e., Univariate) Return Series
35
Essentially we can characterise all the possible ways in which the Central Limit Theorem can break down as follows: (a) There may only be a small number of non-Normal innovations driving the investment returns, or even if there are a large number of such innovations there may be just a few (that are non-Normal) that have disproportionate influence. Thus we may never approach the well-diversified limit of n → ∞. We explore this possibility further in Section 2.6. (b) The innovations may not be arising from distributions that remain the same through time, i.e., the innovations may not adhere to time stationarity (see Section 2.7). Strictly speaking, this does not invalidate the application of the Central Limit Theorem in a crosssectional sense to a well diversified portfolio (e.g., to, one might hope, some mainstream equity market indices), but it does make its interpretation tricky. Return series where the different elements are coming from time-varying (Normal) distributions can also be tricky or impossible to differentiate from those where the elements are coming from other distributions, if the distribution parameters are varying rapidly enough. (c) The innovations may not be coming from distributions with finite variance (or even, possibly, finite mean) (see Section 2.8). (d) The impact of the innovations may not respect the usual mathematical axioms that we might expect to apply to portfolio valuations (e.g., that an innovation that is twice as large in magnitude should, all other things being equal, be expected to alter portfolio value by twice as much). We will come across this issue particularly when exploring the impact of liquidity risk on portfolio construction.
2.6 LACK OF DIVERSIFICATION By far the most obvious way in which fat tails might arise is if there is insufficient diversification present in the return innovations. The innovations in question also then need to be non-Normal (because the sum of independent Normal random variables is itself Normal however many or few of them there are).31 We have already made this point in relation to idiosyncratic stock-specific behaviour. Extreme events applicable to such securities tend to be relatively rare one-off events of major corporate significance, e.g., a takeover or a default. The same can also apply at an entire market level. The generality of the market might be heavily influenced by a single overarching driver (e.g., the extent to which there is ‘risk appetite’ across the market as a whole), and thus might move in tandem if an extreme outcome derives from this driver. There is an important additional point relevant to practical portfolio construction. Usually (active) fund managers are being asked to outperform a benchmark. They are thus assessed 31 The return innovations typically also need to be fat-tailed rather than thin tailed for the overall outcome to be fat-tailed. For example, consider the number of ‘heads’ that arise if we randomly toss a coin n times. Each individual toss follows the Bernoulli distribution, taking a value of 1 with (success) probability p and 0 with (failure) probability q = 1 − p. The total number of heads follows a binomial distribution. The (excess) kurtosis is then (1/(1 − p) − 6 p)/(np); see, e.g., Kemp (2010). Suppose also that the coin is ‘fair’. Then p = 1/2 and a single toss, i.e., the case where n = 1, has the minimum possible (excess) kurtosis of −2. More generally the (excess) kurtosis when there are multiple tosses, n in total, is −2/n, i.e., always negative (although as we should expect from the Central Limit Theorem tending to zero as n → ∞). This incidentally indicates that if non-mathematical investment commentators are talking about investment outcomes being ‘binary’, i.e., meaning that effectively only one of two possible outcomes might arise, we should be wary about jumping to the conclusion that this will contribute to fat-tailed behaviour, at least as understood as above. However, we might more reasonably form the conclusion that the outcome might be volatile and uncertain, because usually such commentators only express such views when they are viewing the two possible outcomes as quite different to each other.
36
Extreme Events
by reference to their performance relative to that benchmark. It is the sources of relative performance, as encapsulated by how their portfolios deviate from the benchmark, that drive relative performance, not the extent of diversification of the portfolio as a whole. Fund managers will often only be expressing a relatively small number of overarching themes within their portfolio construction. For example, they might believe that some new pricing dynamic or market paradigm is likely to drive market behaviour over their investment time horizon. This might involve, say, the market favouring more environmentally friendly companies. The way in which this view is expressed at an individual asset level may appear relatively diversified to a third party observer who does not know what is driving the manager’s choice of positions. But this does not make the resulting stances actually diversified, just apparently diversified as far as that third party observer is concerned. If the postulated new market paradigm does not come about or, worse, something conspires to cause the opposite, then the apparently diversified positions may end up moving in tandem, i.e., exactly not what the third party observer might have expected! We may therefore expect actively managed portfolios to be less well diversified than we might expect based on a naive analysis of the portfolio positions, and thus to be more exposed to fat tails than we might otherwise expect. This does indeed seem to be the case; see, e.g., Kemp (2008d). Principle P6: Actively managed portfolios are often positioned by their managers to express a small number of themes that the manager deems likely to prove profitable across many different positions. This is likely to increase the extent of fat-tailed behaviour that they exhibit, particularly if the manager’s theme is not one that involves an exact replay of past investment movements and which may not be emphasised in any independent risk assessment of the portfolio structure. More generally, this is an example of the underlying dynamics of the portfolio construction problem being intrinsically altered by the inclusion of a conscious agent, in this instance the fund manager. This is exactly what we do want to happen if we are employing the manager because we perceive that person to possess skill in asset selection. We discuss in more detail the potential selection effects that can arise with active management in Section 4.6. The same also applies at an aggregate market level. Introduction of multiple conscious agents does not necessarily create much greater diversification, if they tend to behave in tandem – witness the 2007–09 credit crisis! This is the bread and butter of behavioural finance and economics. Principle P7: Not all fat tails are bad. If the manager can arrange for their impact to be concentrated on the upside rather than the downside then they will add to performance rather than subtract from it.
2.7 A TIME-VARYING WORLD 2.7.1 Introduction It is a truism that the world changes, and yet often researchers implicitly or explicitly assume that it will not. This partly reflects the extremely large number of possible ways in which the world might change, which makes selection of suitable models rather challenging. Moreover,
Fat Tails – In Single (i.e., Univariate) Return Series
37
time-varying models are often dramatically more complicated than models that are timestationary, i.e., that ignore this possibility. In this section we provide merely a brief introduction to this issue. We highlight that if returns are coming from a range of distributions depending on when the return occurs then the aggregate return series will often appear to be fat-tailed, even if the distribution from which the return is being drawn at any point in time is not. We use this insight to introduce the concepts of time-varying volatility and regime shifts. These two latter topics (and their interaction with portfolio construction) are explored in more detail in Chapter 7. 2.7.2 Distributional mixtures Suppose that we have two Normal distributions with the same mean, one with a relatively modest standard deviation and one with a much higher standard deviation. Consider the distribution of a random variable that is randomly selected a set fraction of the time from the first distribution and at other times from the second one, i.e., involves a distributional mixture of the two Normal distributions. In Figures 2.17 and 2.18 we show the probability density function and quantile–quantile plot of an illustrative distributional mixture in which the first distribution has a standard deviation of 1, the second has a standard deviation of 3, both have means of 0, 80% of the distributional mixture comes from the first distribution and 20% from the second distribution. In each figure we contrast it with a Normal distribution with the same overall standard deviation and mean. We see that the distributional mixture is fat-tailed, being denser in the tails (and in the central peak) than the corresponding (unmixed) Normal distribution with the same overall mean and standard deviation. In the ‘far’ tail (i.e., asymptotically as x → ±∞), the distributional mixture has a quantile–quantile form that is a straight line, but steeper than the corresponding (unmixed) Normal distribution. This is because essentially all the observations in these regions
Normal distribution Example Normal distributional mixture
–10
–8
–6
–4
–2
0
2
4
6
x, here (log) return Figure 2.17 Probability density function of an example Normal distributional mixture C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
8
10
38
Extreme Events 10 Normal distribution
y, here observed (log) return
8 Example Normal distributional mixture
6 4 2 0
–10
–8
–6
–4
–2
0
2
4
6
8
10
–2 –4 –6 –8 –10
x, here expected (log) return, if Normally distributed Figure 2.18 QQ-plot of an example Normal distributional mixture C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
come from the distribution with the higher standard deviation, i.e., in far tails the probability distribution is like that of a single Normal distribution. Distributional mixtures of Normal distributions with the same mean and different standard deviations will always be leptokurtic (i.e., will have positive excess kurtosis), as long as more than one of the distributional mixing coefficients, bi , is greater than zero, see Kemp (2010). If the means are not the same then the distributional mixtures can be mesokurtic (i.e., have zero excess kurtosis) or even platykurtic (i.e., have negative excess kurtosis); see also Kemp (2010). Readers are warned that different authors use the term ‘mixture’ when applied to probability distributions with quite different meanings and it is not always immediately obvious which meaning is being used. Perhaps most common is to use the term ‘mixture’ as shorthand for what we might more precisely call a linear combination mixture. With such a mixture, the ‘mixed’ random variable is derived as some suitable weighted average of the underlying random variables being ‘mixed’, i.e., if the random variables are x i then we derive the ‘mixture’ of them, y, as y = ai xi for suitable ‘mixing coefficients’, ai . Distributional mixtures in which the y is drawn from the underlying probability distribution applicable to xi with probability bi are quite different in nature, even though the bi may also be referred to as ‘mixing coefficients’.32 2.7.3 Time-varying volatility A corollary is that fat-tailed behaviour can arise even when markets appear to be responding to the existence of large numbers of small, relatively well-behaved, innovations, if there is time-varying volatility. Such behaviour is also known as heteroscedasticity. The observed time 32
The ‘mixing coefficients’ also have quite different meanings and, for example, the bi need to satisfy bi ≥ 0 and whereas there are no corresponding constraints on the ai .
bi = 1,
Fat Tails – In Single (i.e., Univariate) Return Series
39
series will then correspond to a distributional mixture, formed from the mixing of different distributions depending on when the relevant observation within the series occurred (and hence at what level the supposed time-varying volatility then was). In Chapter 4 we will find that a much richer set of market dynamics can arise if the world is time-varying, so much so that it is implausible to assume that the world is not time-varying. We should thus expect to find that fat-tailed behaviour can be decomposed into two parts, namely: (a) a component that is linked to time-varying volatility; (b) the remainder. This decomposition is only helpful if there is some way of estimating the current level of time-varying volatility (more precisely in the portfolio construction context, if we can predict its likely immediate future level). We will also explore this topic further in Chapter 4. For the purposes of this section it suffices to say that a common mathematical tool for doing so is to assume that time-varying volatility follows a simple (first-order) autoregressive behaviour, which means that we can estimate its immediate future level from its recent past level. This methodology underlies GARCH (Generalised Autoregressive Conditional Heteroscedasticity) modelling and its analogues. We can test whether time-varying volatility does seem to explain observed market dynamics by repeating the analysis shown in Figure 2.9 but focusing on returns scaled by, say, past volatility of daily returns over, say, the preceding 50 trading days (see Figure 2.19). Assuming that the choice of a 50 trading day window is effective at capturing shorter-term movements
Observed standardised (logged) return (sorted) adjusted by trailing 50 day volatility
8 Expected, if (log) Normally distributed 6
FTSE All Share Cornish-Fisher approximation (incorporating skew and kurtosis)
4
Fitted cubic (weighted by average distance between points)
2 0
–4
–3
–2
–1
0
1
2
3
4
–2 –4 –6 –8
Expected standardised (logged) return (sorted) adjusted by trailing 50 day volatility Figure 2.19 Daily returns on FTSE All-Share Index from end June 1994 to end Dec 2007, scaled by 50 business day trailing volatility C John Wiley & Sons, Ltd. Reproduced by permission of John Source: Kemp (2009), Thomson Datastream. Wiley & Sons, Ltd
40
Extreme Events
in (time-varying) volatility, this picture should show the contribution merely from fat-tailed behaviour that does not arise due to (short-term) time-varying volatility. The upside fat tail in Figure 2.9 largely disappears in Figure 2.19, suggesting that upside fat tails (at least for this index over this time period) may be largely explainable by this effect. The downside fat tail is somewhat less marked than before. However, it is still quite noticeable, suggesting that time-varying volatility is not the whole explanation of fat-tailed behaviour, particularly not for downside fat tails. The importance of time-varying volatility for risk measurement purposes is well recognised by researchers and practitioners alike. Many practical risk system vendors include such effects in their risk estimation algorithms.33 Researchers such as Hull and White (1998) and Giamouridis and Ntoula (2007) have analysed different ways of estimating downside risk measures and highlight the importance of incorporating allowance for time-varying volatility, e.g., via GARCH-style models. Principle P8: For major Western (equity) markets, a significant proportion of deviation from (log-) Normality in daily (index) return series appears to come from time-varying volatility, particularly in the upside tail. This part of any fat-tailed behaviour may be able to be managed by reference to rises in recent past shorter-term volatility or corresponding forward-looking measures such as implied volatility.
2.7.4 Regime shifts A more common way of developing the same theme, as far as commentators focusing on asset allocation are concerned, is to use the concept of regime shifts. (We return to this theme in more detail in Chapter 7.) The assumption being made is that the world can be in one of two or more states or ‘regimes’ and there is a certain probability of it shifting from one regime to another regime at any particular point in time, perhaps described by a Markov process.34 For such modelling to be of use in practice, we typically need to postulate that movement to a new state is relatively unlikely over each individual time period in the analysis (otherwise the regime structure will exhibit insufficient persistency to help with the modelling35 ). 33 For example, a risk system vendor’s model may give greater weight to more recent observations (see, e.g., Exercise A.2.3 in the Appendix). This implicitly introduces a GARCH-like behaviour to the model. Even if their models do not explicitly use such a methodology, the estimation process may do so implicitly as previous parameter estimates get updated with the passage of time. 34 A Markov process, named after Russian mathematician Andrei Markov, is a mathematical model for the random evolution of a ‘memory-less’ system, i.e., one in which the likelihood of a given future state, at any given moment, depends only on its present state and not on any past states. Formally it is one where the probability of it taking state y at time t + h, conditioned on it being in the particular state x (t) at time t, is the same as the probability of it taking the same state y but conditioned on its values at all previous times before t, i.e.:
P (X (t + h) = y |X (s) = x (s) , ∀s ≤ t) = P (X (t + h) = y |X (t) = x (t) ) ∀h > 0 Processes that do not appear to be ‘Markovian’ can often be given a Markovian representation if we expand the concept of the ‘current’ state to include its characteristics at previous times too, i.e., we define the ‘state’ by Y , where Y (t) is a (possibly infinite dimensional) vector incorporating all values of X (s) for s ≤ t. 35 If states typically change too frequently (e.g., many times within each individual time period within our analysis) then we effectively end up, for each individual time period, with merely a composite probability distribution (arising from a suitable distributional mixture of the underlying individual regime probability distributions). The precise dynamics of individual states and likelihoods of movements between them no longer then feature in the analysis, except at a conceptual level when trying to justify why the observed composite distributional form might take its apparent structure.
Fat Tails – In Single (i.e., Univariate) Return Series
41
The main differences between this sort of approach and the type of time-varying volatility and GARCH modelling described in Section 2.7.3 are as follows: (a) Time-varying volatility and GARCH modelling can be thought of as involving some continuous parameterisation of the state of the world (the parameter being the supposed time-varying level of volatility, which in principle is allowed to take a continuous spectrum of possible values). In contrast, regime shifting would normally assume that the world exists only in a small number of states (maybe even just two), i.e., it would normally involve a discrete parameterisation of the state of the world. (b) The regimes used, particularly when applying such techniques to asset allocation decisionmaking, would often differ not just in relation to volatility but also in relation to other factors influencing asset allocation, e.g., means, correlations and/or tail dependencies between different markets or instruments and/or in the distributional forms that returns from them are assumed to take. Thus, we can think of regime shift models as involving a more generalised description of how the world evolves, but one that is potentially more open to over-fitting, i.e., to lack of parsimony (see Section 2.10).
2.8 STABLE DISTRIBUTIONS 2.8.1 Introduction We noted in Section 2.5 that one way in which the Central Limit Theorem can break down is if returns do not have finite variance. Commentators focusing on this possibility often concentrate on stable distributions, because (a) they have certain interesting theoretical properties that make them potentially useful for modelling financial data (in particular, they adhere to a generalisation of the Central Limit Theorem); and (b) they in general depend on four parameters rather than merely the two (i.e., mean and variance) applicable to the Normal distribution and in general have fat tails – they are thus able to provide a better fit to past observed data. Traditionally, stable distributions have been perceived to be relatively difficult to manipulate mathematically because of their infinite variance (apart from the special case of the Normal distribution, which is the only member of the stable family of distributions with finite variance). More recently, mathematical tools and programs have been developed that simplify such manipulations and they are becoming more common tools for seeking to handle extreme events. Some tools for manipulating stable distributions are available in Kemp (2010). Whether stable distributions are actually good at modelling financial data is less clear. Longuin (1993), when analysing the distribution of US equity returns, concluded that their distribution was not sufficiently fat-tailed to be adequately modelled by stable distributions, even if it was fatter-tailed than implied by the Normal distribution. Moreover, as we have seen above, fat tails can arise from a variety of effects not all of which bear much resemblance to a world in which lots of small infinite variance return innovations are the prime sources of fat-tailed behaviour at an aggregate level. In practice, we might expect any such innovation to have a potentially large but finite impact on the end result, and so the ‘infiniteness’ of the supposed underlying distributional form is presumably just an approximation to reality. Even the generalisation of the Central Limit Theorem can be questioned, because combinations of
42
Extreme Events
different types of stable distribution do not necessarily converge to the same type of limiting distribution (see Section 2.9.5). Nevertheless, stable distributions do form an important way in which some commentators have sought to explore fat tails. They also naturally lead into other topics relevant to our discussions – see Sections 2.9 and 2.10. We therefore summarise below some of their key attributes. A more detailed treatment of stable distributions is available in Kemp (2010). 2.8.2 Defining characteristics The defining characteristic of stable distributions, and the reason for the term stable, is that they retain their shape (suitably scaled and shifted) under addition. The definition of a stable distribution is that if X, X 1 , X 2 , . . . , X n are independent, identically distributed random variables, then for every n we have for some constants cn > 0 and dn :
X 1 + X 2 + · · · + X n = c n X + dn
(2.18)
Here = means equality in distributional form, i.e., the left and right hand sides have the same probability distribution. The distribution is called strictly stable if dn = 0 for all n. Some authors use the term sum stable to differentiate this type of ‘stability’ from other types that might apply. Normal distributions satisfy this property. They are the only distributions with finite variance that do so. Other probability distributions that exhibit the stability property described above include the Cauchy distribution and the Levy distribution. The class of all distributions that satisfy the above property is described by four parameters: α, β, γ , δ. In general there are no analytical formulae36 for the probability densities, f , and cumulative distribution functions, F, applicable to these distributional forms, but there are now reliable computer algorithms for working with them. Special cases where the pdf has an analytical form include the Normal, Cauchy and Levy distributions; see Kemp (2010). Nolan (2005) notes that there are multiple definitions used in the literature regarding what these parameters mean, although he focuses on two – which he denotes by S (α, β, γ , δ0 ; 0) and S (α, β, γ , δ1 ; 1) – that are differentiated according to the meaning given to δ. The first is the one that he concentrates on, because it has better numerical behaviour and (in his opinion) a more intuitive meaning, but the second is more commonly used in the literature. In these descriptions: • α is the index of the distribution, also known as the index of stability, characteristic exponent or tail index, and must be in the range 0 < α ≤ 2. The constant cn in Equation (2.18) must be of the form n 1/α . • β is the skewness of the distribution and must be in the range −1 ≤ β ≤ 1. If β = 0 then the distribution is symmetric, if β > 0 then it is skewed to the right and if β < 0 then it is skewed to the left. • γ is a scale parameter and can be any positive number. • δ is a location parameter, shifting the distribution right if δ < 0 and left if δ > 0. 36 There is some flexibility in exactly what we deem to constitute an ‘analytical’ formula (also known as a ‘closed form’ formula). Generally the term is understood to mean a formula that merely includes relatively elementary functions, such as trigonometric functions, polynomials, exponentials and logarithms, and perhaps a few other ‘similar’ sorts of functions such as the gamma function. Without some limitation on the complexity of elements that such a formula can include, we can always make any formula ‘analytical’ by ‘creating’ a new function that corresponds to the function we are trying to express in analytical form.
Fat Tails – In Single (i.e., Univariate) Return Series
43
The distributional form is normally defined via the distribution’s characteristic function (see Section 2.5.2). Nolan (2005) defines a random variable X as coming from a S (α, β, γ , δ0 ; 0) distribution if its characteristic function is ϕ0 (u) and he defines one as coming from a S (α, β, γ , δ1 ; 1) distribution if its characteristic function is ϕ1 (u) where ϕk (u) is defined as follows for k = 0 or 1: ϕk (u) = exp (−γ α |u|α Q k (u) + iδk u)
(2.19)
where ⎧
πα ⎪ (sgn u) |γ u|1−α − 1 , α = 1 1 + iβ tan ⎪ ⎨ 2 Q 0 (u) = ⎪ 2 ⎪ ⎩ 1 + iβ (sgn u) log (γ |u|) , α=1 π
⎧
πα ⎪ (sgn u) α = 1 1 − iβ tan ⎪ ⎨ 2 Q 1 (u) = ⎪ 2 ⎪ ⎩ 1 + iβ (sgn u) log (|u|) α = 1 π
(2.20)
(2.21)
The location parameters, δ0 and δ1 , are related by
⎧ πα ⎨ δ1 + βγ tan , 2 δ0 = ⎩ δ1 + β π2 γ log (γ ) ,
α = 1
(2.22)
α=1
⎧ πα ⎨ δ0 − βγ tan , α = 1 2 δ1 = ⎩ δ0 − β 2 γ log (γ ) , α = 1 π
(2.23)
Nolan (2005) notes that if β = 0 then the 0-parameterisation and the 1-parameterisation coincide. When α = 1 and β = 0, the parameterisations differ by a shift βγ tan (π α/2) that gets infinitely large as α → 1. Nolan argues that the 0-parameterisation is a better approach because it is jointly continuous in all four parameters, but accepts that the 1-parameterisation is simpler algebraically, and so is unlikely to disappear from the literature. He also notes the following: (a) Stable distributions are unimodal (i.e., have a single peak), whatever the choice of (α, β, γ , δ). (b) When α is small then the skewness parameter is significant, but when α is close to 2 then it matters less and less. (c) When α = 2 (i.e., the Normal distribution), the distribution has ‘light’ tails and all moments exist. In all other cases (i.e., 0 < α < 2), stable distributions have ‘heavy’ tails (i.e., ‘heavy’ relative to behaviour of the Normal distribution) and an asymptotic power law (i.e., Pareto)
44
Extreme Events
decay in their tails (see Section 2.8.3). The term stable Paretian is thus used to distinguish the α < 2 case from the Normal case. A consequence of these heavy tails is that not all population moments exist. If α < 2 then the population variance does not exist. If α ≤ 1 then the population mean does not exist either. Fractional moments, e.g., the pth absolute moment, defined as E |X | p , exist if and only if p < α (if α < 2). All sample moments exist, if there are sufficient observations in the sample, but these will exhibit unstable behaviour as the sample size increases if the corresponding population moment does not exist. 2.8.3 The Generalised Central Limit Theorem Perhaps the most important feature of stable distributions in relation to the topic of fattailed innovations is that any linear combination of independent stable distributions with the same index, α, is itself a stable distribution. For example, if X j ∼ S α, β j , γ j , δ j ; k for j = 1, . . . , n then n j=1
a j X j ∼ S (α, β, γ , δ; k)
(2.24)
where γα =
n a j γ j α
(2.25)
j=1
n α 1 β= α β j sgn a j a j γ j γ j=1 ⎧ n ⎪ δ j + γβ tan πα , k = 0, α = 1 ⎪ ⎪ 2 ⎪ j=1 ⎪ ⎪ n ⎨ δ j + β π2 γ log γ , k = 0, α = 1 δ= ⎪ j=1 ⎪ ⎪ n ⎪ ⎪ ⎪ δj, k=1 ⎩
(2.26)
(2.27)
j=1
This leads on to a generalisation of the Central Limit Theorem. The classical Central Limit Theorem states that the normalised sum of independent, identically distributed random variables converges to a Normal distribution. The Generalised Central Limit Theorem extends this result to cases where the finite variance assumption is dropped. Let X 1 , X 2 , . . . be a sequence of independent, identically distributed random variables. Then it can be shown that there exist constants an > 0 and bn and a non-degenerate random variable Z with D
an (X 1 + · · · + X n ) − bn −→ Z if and only if Z is stable. A random variable X is said to be in the domain of attraction of Z if there exist constants an > 0 and bn such that this equation holds when X 1 , X 2 , . . . are independent identically distributed copies of X . Thus the only possible distributions with a domain of attraction are stable distributions as described above.
Fat Tails – In Single (i.e., Univariate) Return Series
45
Distributions within a given domain of attraction are characterised in terms of tail probabilities. If X is a random variable with x α P (X > x) → c+ ≥ 0 and x α P (X < x) → c− ≥ 0 with c+ + c− > 0 for some 0 < α < 2 as x → ∞ then X is in the domain of attraction of an α-stable law. an must then be of the form an = an −1/α . However, it is essential for the αs to be the same; adding stable random variables with different αs does not result in a stable distribution.
2.8.4 Quantile–quantile plots of stable distributions We highlighted earlier the benefits of trying to visualise fat-tailed behaviour using QQ-plots. Unfortunately, creating quantile–quantile plots for stable distributions presents some challenges. This difficulty is because the QQ-plots we have described above plot a distribution against the ‘comparable’ Normal distribution. However, non-Normal stable distributions do not have finite standard deviations (or even in some cases finite means), and so the definition of the ‘comparable’ Normal distribution is ill-defined. We might, for example, define ‘expected’ values (i.e., the x-coordinates) by reference to the means and standard deviations falling within a given quantile range such as 0.1% to 99.9% (perhaps corresponding to observations that might be observable ‘in practice’), but the choice of truncation is rather arbitrary. Kemp (2010) has a further discussion of this issue.
2.9 EXTREME VALUE THEORY (EVT) 2.9.1 Introduction The different tail behaviours of stable distributions with different (tail) indices, i.e., with different values of α, naturally leads us to consider the tail behaviours of other probability distributions. This subject is normally referred to as extreme value theory (EVT). EVT attempts to provide a complete characterisation of the tail behaviour of all types of probability distributions, arguing that tail behaviour can only in practice take a small number of possible forms. Such a theory is conceptually appealing when analysing extreme events. It suggests that we can identify likelihoods of very extreme events ‘merely’ by using the following prescription: 1. Identify the apparent type of tail behaviour being exhibited by the variable in question. 2. Estimate the (small number of) parameters that then characterise the tail behaviour. 3. Estimate the likelihood of occurrence however far we like into the tail of the distribution, by plugging our desired quantile level into the tail distributional form estimated in step 2. However, we shall find that life is not this simple in practice. Not surprisingly, extrapolation into the tail of a probability distribution is challenging, not because it is difficult to identify possible probability distributions that might fit the observed data (it is relatively easy to find plenty of different possible distributions) but because the range of answers that can plausibly be obtained can be very wide, particularly if we want to extrapolate into the far tail where there may be few if any directly applicable observation points.
46
Extreme Events
Table 2.3 Distributional forms of EVT distributions Type
Name of distribution
Distributional form
I
Gumbel
II
Fr´echet
III
Weibull
(−x)) for−∞ < x < ∞ exp (−
exp
x x −α for 1 + > 0; otherwise 0 exp − 1 + α α
x x α for 1 − > 0; otherwise 1 exp − 1 − α α
2.9.2 Extreme value distributions The most important theorem in extreme value theory is the Fisher-Tippet theorem: see, e.g., Malhotra (2008) who in turn quotes Fisher and Tippet (1928), Balkema and de Haan (1974) and Pickands (1975). The theorem is also called the Fisher-Tippet-Gnedenko theorem or the first theorem in extreme value theory by some writers. According to Malhotra (2008) it may be stated as follows. Suppose (x 1 , . . . , xn ) is a set of n independent draws from a (continuous and unbounded) distribution F (x). Suppose these draws are ordered, using the notation that x(i) is the ith smallest element of the set, i.e., x(1) ≤ · · · ≤ x(i) ≤ · · · ≤ x(n) . Then the Fisher-Tippet theorem indicates that as n → ∞ the distribution function of an appropriately scaled version of x (n) (if it converges at all) must take one of three possible forms, i.e., one of three possible extreme value distributions (EVD), as shown in Table 2.3. By ‘appropriately scaled’, Malhotra (2008) means the following. As n → ∞ we can expect x(n) → ∞. However, so too will the mean, µ(n) , and standard deviation, σ(n) , of a sample of many such x(n) taken from many different sample sets (each of size n). Thus, to define a distribution for x (n) we may ‘re-scale’ it using yn = x(n) − µ(n) /σ(n) and focus on the behaviour of y = limn→∞ (yn ). The Gumbel, Fr´echet and Weibull distributions are particular cases of the generalised extreme value distribution, which is a family of continuous probability distributions specifically developed to encompass all three limiting forms. It is defined to have a cumulative distribution function (as long as 1 + ξ (x − µ)/σ > 0): (x − µ) −1/ξ F (x; µ, σ, ξ ) = exp − 1 + ξ σ
(2.28)
The three parameters µ ∈ R, σ > 0 and ξ ∈ R are location, scale and shape parameters respectively. Some other features of the generalised extreme value distribution are given in Kemp (2010). The sub-families defined by ξ → 0, ξ > 0 and ξ < 0 correspond respectively to the Gumbel, Fr´echet and Weibull37 distributions. All the above formulae refer to maximum values. Corresponding generalised extreme value distributions for minimum values can be obtained by substituting (−x) for x in the above formulae. A related concept is that of the maximum domain of attraction (MDA) of an extreme value distribution. This is the set of all distributions that have that particular EVD for their extreme 37 The Weibull distribution is more normally expressed using the variable t = µ − x and more normally is used in cases that deal with minimum rather than maximum values.
Fat Tails – In Single (i.e., Univariate) Return Series
47
statistic. For example, the extreme statistic of a Normal distribution has a Gumbel EVD. The Normal distribution is thus said to belong to the Gumbel MDA. 2.9.3 Tail probability densities The Weibull EVD has an upper bound and is not generally used for tail risk purposes in a financial context. Its MDA includes bounded (continuous) functions such as the uniform and beta distributions. Instead, focus is generally on the Fr´echet EVD, if the researcher does not believe that Normal distributions are appropriate. According to Malhotra (2008), the Pickands-Balkema-de Haan theorem indicates that the Fr´echet MDA has the property that its members always have a simple power-law form for the probability densities of their tail, i.e., that their probability density function is proportional to x −α−1 as x → ∞. The tail index is therefore related to how fast the density falls to zero for extreme returns. Smaller values of α correspond to fatter tails. Thus we appear merely to have one parameter to estimate, relating to the relative thickness of the tail as x → ∞ rather than an infinite number of possible parameters to choose from. 2.9.4 Estimation of and inference from tail index values The Pickands-Balkema-de Haan theorem indicates a way of estimating the magnitude of the tail parameter, α, for distributions in the Fr´echet MDA. Once we have such an estimate we can also apply statistical tests and develop confidence limits to apply to it. If we order the returns r(1) ≤ r(2) ≤ . . . ≤ r(itail ) ≤ . . . ≤ r(n) (here we are now considering the downside tail) then Scherer (2007) notes that we can derive an estimate of tail index, α, ˆ using αˆ
−1
=
i tail 1
i tail
i=1
log
r(i) r(itail +1)
(2.29)
This estimator is called the Hill estimator by Malhotra (2008) (and other writers), referring to Hill (1975). Here i tail is the number of observations in the tail and i tail + 1 is the index of the observation where the tail is assumed to start. However, Scherer (2007) also indicates that there is no definitive rule on how to determine i tail , i.e., on where to deem the tail to have begun. He suggests that practitioners often plot the tail index against the choice of i tail (in what is known as a ‘Hill’ plot) and then choose i tail and hence the tail index to accord with where such a plot seems to become more or less horizontal (i.e., at the point where the tail index estimate seems to become robust against changes to where the tail is deemed to have begun). 2.9.5 Issues with extreme value theory There are several issues with EVT as formulated above that reduce its usefulness when analysing extreme events: (a) As soon as we go further into the tail than we have observations, we move into the realm of extrapolation rather than interpolation.
48
Extreme Events
(b) The theory implicitly assumes that the behaviour of an appropriately rescaled maximum value (or minimum value) actually converges to a probability distribution. This may not actually be the case. (c) Even if the appropriately rescaled maximum (or minimum) does converge to a probability distribution, we not only need to estimate the tail index but also the location and shape parameters to interpolate/extrapolate appropriately. (d) To be directly applicable without modification, the theorem requires an assumption of time stationarity, which we have already seen in Section 2.7 is a suspect assumption. We discuss each of these issues below. 2.9.5.1 Extrapolation is intrinsically more hazardous than interpolation Press et al. (2007), along with many other commentators, point out the extremely hazardous nature of extrapolation compared with interpolation. If life were simple then we would have nice smooth curves that extrapolated relatively easily and reliably. For example, having already been primed in Figure 2.5 as to what quantile–quantile plots of fat-tailed distributions might look like, we might argue that the ‘obvious’ extrapolation of the dotted line in Figure 2.20 is shown as Possible Extrapolation (1) (as the two together form the dotted line in Figure 2.5). However: (a) Even if smoothness of join is deemed paramount, there are an infinite number of curves that smoothly join with the dotted line. Two further ones are illustrated that have substantially different tail characteristics. An extrapolation that flattened off might be more appropriate 25 20
y, here observed (log) return
Normal distribution 15
Example fat-tailed distribution Possible Extrapolation (1)?
10
Possible Extrapolation (2)? Possible Extrapolation (3)?
5 0
–10
–8
–6
–4
–2
0
2
4
6
–5 –10 –15 –20 –25
x, here expected (log) return, if Normally distributed Figure 2.20 Illustration of the perils of extrapolating into the far tail C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
8
10
Fat Tails – In Single (i.e., Univariate) Return Series
49
if there happened to be some cap on the downside that applied not much further into the tail. Conversely, a steeper extrapolation might be more appropriate if there were effects that magnified losses in seriously adverse scenarios. (b) The observed data would not in practice be as smooth as the dotted line illustrated in Figure 2.20. We can see this by looking at Figures 2.7–2.9, which include individual observation points. There is a noticeable difference in position along the x-axis between the last two points of the chart, highlighting the graininess of the observations close to the observable limit.38 Principle P9: Extrapolation is intrinsically less reliable, mathematically, than interpolation. Extreme value theory, if used to predict tail behaviour beyond the range of the observed dataset, is a form of extrapolation.
2.9.5.2 The tail may not converge to any EVD Most generalist writers who apply EVD to financial risk modelling appear to assume that tail behaviour necessarily converges to one of the three limiting forms described above. However, this is not necessarily the case. The precise specification of the Fisher-Tippet theorem makes this point clearer. This involves the following. Let X 1 , X 2 , . . . be a sequence of independent and identically-distributed random variables. Let Mn = max (X 1 , X 2 , . . . , X n ). If a sequence of pairs of real numbers (an , bn ) exists such that each an > 0 and limn→∞ P ((Mn − bn )/an ≤ x) = F (x) and if F is a nondegenerate distribution function then F belongs to the Gumbel, the Fr´echet or the Weibull family of distributions. However, there may be no sequence of pairs of real numbers (an , bn ) for which there is such a limit. We can contrive such a situation by arranging for the log-log plot of the tail to have a form that does not tend to any particular limiting line as x → ±∞. However far our existing observation set is into the tail, we can always arrange for such behaviour to occur further still into the tail, highlighting once again the challenges that arise when we attempt to extrapolate beyond the observed dataset; see above.
2.9.5.3 To calculate quantile values we need to estimate scale parameters other than α The Hill estimator (and other equivalent approaches) assumes that we know when the tail begins and also implicitly assumes that we know the exact distributional value at that point. Neither of these is true. Even if EVT is applicable, we can only estimate the quantile values in the tail approximately. As we might expect, the closer we are to the edge of the observed dataset, the greater is the estimation error that can arise from this issue. Clauset (2007) provides a further discussion on how we might jointly estimate the relevant parameters on which tail behaviour might depend.
38 The graininess at the limit of the observable dataset does eventually disappear as the number of points tends to infinity, but only extremely slowly; e.g., to halve the distance along the x-axis between two lowest points compared to the length of the plot as a whole along the x-axis (in, say, Figure 2.9) we would need c. 1000 times as many observations!
Extreme Events
Observed standardised (logged) return (sorted)
50
15 10 5 0 –5
–4
–3
–2
–1
0
1
2
3
4
5
–5 –10 without volatility adjustment –15 trailing 20 business day volatility adjustment –20
trailing 50 business day volatility adjustment
–25
Expected standardised (logged) return (sorted) Figure 2.21 QQ-plot of S&P 500 daily price movements from early 1968 to 24 March 2009 C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Thomson Datastream.
2.9.5.4 Different tail indices arise depending on how we might relax the assumption of time stationarity EVT is usually developed in the context of things that we might expect to exhibit approximate time stationarity, e.g., the potential magnitudes of large earthquakes, the number of people with a common surname and so on. When time stationarity is less obviously valid, as is probably the case with most financial data (see Section 2.7), it is important to be aware of the large impact that different ways of catering for lack of time stationarity can have on tail index estimates and hence on deemed likelihoods of tail events. We can see this by, say, analysing daily price movement data for the S&P 500 Index and for the FTSE All-Share Index for the period 31 December 1968 to 24 March 2009 both without and with short-term time-varying volatility adjustments as per Section 2.7 (here focusing on trailing 20 business day and trailing 50 business day volatility); see Figure 2.21 for S&P 500 and Figure 2.22 for FTSE All-Share. Fat-tailed behaviour is evident in both figures and for both indices, except arguably for the S&P 500 Index on the upside once we have adjusted for short-term volatility. However, there are noticeable differences in the shape of the downside fat tail depending on whether a time-varying volatility adjustment is applied (and depending on the index to which it is applied), reflecting the large impact that data relating to the October 1987 Stock Market crash has on this part of both charts.
2.10 PARSIMONY A discussion of stable distributions also naturally leads us into another more general topic, that of parsimony and of the dangers of over-fitting data.
Observed standardised (logged) return (sorted)
Fat Tails – In Single (i.e., Univariate) Return Series
51
15 10 5 0 –5
–4
–3
–2
–1
0
1
2
3
4
5
–5 –10 without volatility adjustment –15 trailing 20 business day volatility adjustment –20
trailing 50 business day volatility adjustment
–25
Expected standardised (logged) return (sorted) Figure 2.22 QQ-plots of FTSE All-Share daily price movements from early 1968 to 24 March 2009 C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Thomson Datastream.
We noted in Section 2.8.1 that one of the perceived benefits of stable distributions is that they involve four parameters rather than the two, i.e., mean and variance, applicable to the Normal distribution. They are thus able to provide a better fit to past observed data. There are other ways of incorporating additional parameters into characterisations of distributional forms, which can also be used to achieve better fits to past data. Some of these are materially simpler to manipulate than stable distributions. For example, the cubic quantile–quantile curve fits referred to in Section 2.4.5 introduce two additional parameters into the distributional form. Thus they too will also naturally provide better fits to observed data than the Normal distribution alone.39 Incorporating additional parameters to improve fits to past data does not, however, mean that models are necessarily better at describing future data. Quite apart from the possibility that the past might not actually be a suitable guide to the future, we run the risk of over-fitting the observed data. We can see this by considering what happens if instead of using a cubic quantile–quantile curve fit we use a quartic, quintic or, more generally, a nth order polynomial, i.e., a curve of the form y (x) = a0 + a1 x + a2 x 2 + · · · + an x n . We can exactly fit any set of n + 1 data points using an nth order polynomial because we then have n + 1 possible parameters (i.e., degrees of freedom) to play around with (namely the a0 , a1 , a2 , . . . , an ) when trying to fit the n + 1 data points. However, as soon as an additional data point became available, our nth order polynomial would almost certainly no longer then exactly fit the augmented dataset. 39 Relative to stable distributions, cubic-cubic quantile curve fits also have the advantage of being more analytically tractable and they can be tailored so that the fit is best in the particular part of the distributional form in which we are most interested. Conversely, it can be argued that their choice of form is more arbitrary (e.g., why concentrate on cubics rather than quartics or quintics etc.?) and has less underlying ‘merit’ than stable distributions, because the latter can be justified by reference to the Generalised Central Limit Theorem described in Section 2.8.3.
52
Extreme Events
How do we decide how many additional parameters we need to provide a good but not over-fitted characterisation of the available data? There is no unique answer to this question. Typically, we adopt a fitting technique that penalises use of additional parameters when assessing ‘goodness of fit’. An example is the Akaike information criterion (see, e.g., Billah, Hyndman and Koehler (2003)), but there are other similar techniques that place greater or lesser weight on keeping down the number of parameters included in the model. Some examples of the dangers of over-fitting are given in Kemp (2010).
Principle P10: Models that fit the past well may not fit the future well. The more parameters a model includes the better it can be made to fit the past, but this may worsen its fit to the future.
2.11 COMBINING DIFFERENT POSSIBLE SOURCE MECHANISMS Although extreme events are probably most commonly analysed using selections of the distributional forms described above, the sheer variety of possibilities that then become available can make the subject very daunting to the non-expert mathematician. Is it possible to identify simpler ways of characterising different distributional forms? One possible approach that we will reuse several times as this book progresses is to approximate the distribution in which we are interested by a distributional mixture of Normal distributions. We noted in Section 2.7 that such mixtures naturally lead to fat-tailed behaviour if the Normal distributions being combined in this manner all have the same mean. However, if they have different means then a wider range of distributional forms is possible. Indeed, if we have a sufficiently large number of arbitrary Normal distributions being (distributionally) mixed together then we can replicate arbitrarily accurately any distributional form. Consider, for example, the situation where we have a large number of Normal distributions each with standard deviation dµ and with means given by µ±k = µ0 ± k (dµ). If dµ is small enough then any distributional form can be approximated arbitrarily accurately using (distributional) mixing weights bk for the kth distribution that match the value at µk of the pdf of the distributional form we are trying to replicate. The quantile–quantile plot of a distributional mixture of two Normal distributions was illustrated in Figure 2.18 on page 38. We notice that it has limiting forms in the two tails which are linear. Far enough into the tails, observations will asymptotically come merely from the distribution with the largest standard deviation. We can therefore in general only replicate limiting tail behaviour that is non-Normal if we are distributionally mixing infinitely many Normal distributions. However, we cannot in any case estimate the tail behaviour accurately beyond a certain point from any finite set of observations, and so this is arguably not too serious an issue in practice. In practice, if we are using this approach, we will want a parsimonious representation of the distributional form. This will involve (distributionally) mixing just a few Normal distributions. Different possible ways in which fat-tailed behaviour can appear can then be approximated by different ways in which the bk for the relevant Normal distributions evolve through time.
Fat Tails – In Single (i.e., Univariate) Return Series
53
2.12 THE PRACTITIONER PERSPECTIVE 2.12.1 Introduction In the previous sections of this chapter, we concentrated on a mathematical consideration of how fat tails might arise. Here we adopt a more practitioner and qualitative perspective. We also highlight some immediate corollaries relating to how practitioners might best react to fat-tailed behaviour. 2.12.2 Time-varying volatility We saw in Section 2.7 that an important source of fat tails is time-varying volatility (and, more generally, changes through time of any parameter used to describe the distributional form). A natural consequence should be a desire to understand and, if possible, to predict (and hence profit from) the way in which the world might change in the future. Investors take views all the time on which ways markets might move going forwards. These usually involve views on the means of the relevant distributional forms. For example, an investor may think that one investment looks to be more attractively priced than another, believing that on average the first investment should outperform the second. We might therefore want to place greater emphasis on higher moments, e.g., variance, skew and kurtosis. However, it is worth noting that the growth of modern derivatives markets has also made it possible to take views specifically on these moments as well. For example, it is nowadays possible to trade variance swaps. In return for a fixed payment stream, the purchaser of such a swap will receive the realised variance40 of the relevant underlying index between purchase date and maturity date. So, if it were possible for an investor to predict accurately the volatility that an index will exhibit in the future then the investor could make a large profit by appropriately buying or selling such swaps. There are several other types of derivative, e.g., options, which also provide ways of profiting from views on likely future volatility of market behaviour rather than just on likely future market direction. A corollary is that investors ought to be interested in market implied views on such topics that are derivable from the prices of these instruments. We will explore this topic in more detail in Sections 6.8 and 9.5. 2.12.3 Crowded trades We also saw in Section 2.7 that time-varying volatility does not appear to be the only contributor to fat tails, particularly on the downside. Another probable contributor is the existence of what are anecdotally called ‘crowded trades’. For example, as noted in Kemp (2009) and in the preface to this book, several quantitative fund managers suffered performance reversals in August 2007. Rather too many of their supposedly independent investment views all declined at the same time, and the returns that they posted were in some cases well into the extreme downside tail. Any style of active fund management is potentially exposed to ‘crowded trades’, because any style can be linked to rapid growth of assets under management, commonality of investment ideas and so on. Hedge funds are potentially particularly exposed to ‘crowded’ trades, whether 40
The realised variance of the index will typically be defined along the lines of e.g., Kemp (2009).
rt2 where the rt are the daily index returns; see,
54
Extreme Events
they focus on less liquid strategies or stick to more straightforward styles, e.g., equity longshort. Hedge funds can (potentially unwittingly) end up adopting similar positions to those of their close competitors. They can thus be caught out merely because there is a change in risk appetite affecting their competitors, and the consequential position unwinds then affect their own positions. Crowded trade risk increases as the part of the industry that might be ‘crowding’ grows in size (as the quantitatively run part of the hedge fund industry had done prior to August 2007), but is latent until some set of circumstances triggers greater volatility in risk appetite. Hedge funds, like other high conviction fund management styles, typically seem to exhibit more fat-tailed behaviour than more traditional fund types. They may therefore be more at risk from such fat tails, because another typical feature of hedge funds, namely leverage, adds risks such as liquidity risk, forced unwind risk and variable borrow cost risk that may be particularly problematic in fat-tail scenarios. One response to the existence of crowded trades is to seek to understand whether and how the views that we might be adopting coincide with views that lots of other people are also adopting. If they are and there is a stress, will our portfolio be able to ride out the storm, or will it be one of the portfolios forced to unwind at the ‘wrong’ time? Crowded trades, though often apparently relatively parochial, can sometimes encompass entire markets. Examples of the latter arguably include both the dot com boom and bust during the late 1990s and the early 2000s and the events before and during the 2007–09 credit crisis. 2.12.4 Liquidity risk A particular type of ‘crowded trade’ involves a latent exposure to liquidity risk. As we shall see in later chapters, most quantitative approaches to portfolio construction implicitly assume that the assets we might select from when constructing our portfolios are suitably liquid. This proved an invalid assumption for a wide range of investor types during the 2007–09 credit crisis. For example, some banks that funded lending to customers by borrowing short-term in the capital markets discovered that they were unable to roll-over such borrowings in a timely fashion. Some funds, including some hedge funds, were unable to liquidate assets sufficiently rapidly to meet heightened investor redemption requests. Some funds of funds, including some hedge funds of funds, struggled to meet redemption requests from their own investors, because they themselves had invested in funds that had put up ‘gates’41 (or maybe needed to liquidate less desirable assets to avoid having to gate their own investor base). 2.12.5 ‘Rational’ behaviour versus ‘bounded rational’ behaviour Another response to the existence of crowded trades is to explore why it is that investors have a propensity to get into crowded trades in the first place. This may partly be to make sure, if possible, that their own investment processes do not unduly expose them to such risks. It may also partly be to position their portfolios to benefit from such behaviour by others. It is unlikely that there will ever be general consensus on such matters. If there were then the tendency to crowd would presumably move somewhere else. However, it is highly 41 The term ‘gate’ refers to the ability of such a fund to slow down redemptions when it becomes difficult or impossible for the fund to liquidate its own investments. For example, the fund might retain the right to stop more than 10% of units being liquidated during any particular month – if more investors want to redeem units than this amount then any excess may be carried forward to future months. Or the fund might have wide powers to defer redemptions until it was possible to sell the underlying assets (this is relatively common for property, i.e., real estate, funds where sale negotiations can take several months to conclude).
Fat Tails – In Single (i.e., Univariate) Return Series
55
probable that behavioural finance and market structure play an important role. For example, a market involving heterogeneous investors or agents (perhaps with different agents principally focusing on different market segments or only able to buy and sell from a limited number of other agents) may naturally exhibit fat-tailed behaviour even when each agent is exhibiting ‘bounded rationality’ (i.e., fully rational behaviour within the confines of the bounds within which the agent is placed). This is because behavioural drivers unique to particular types of investor may be triggered in different circumstances. This again creates a form of distributional mixing and hence potential for fat tails. This idea is explored further in Palin et al. (2008) and Palin (2002) who develop (nonlinear) models of the economy and financial markets that naturally give rise to fat-tailed return distributions even though their underlying drivers are more classically Normal (or log-Normal) in form. Johansen and Sornette (1999) use a similar idea to model crashes, in which traders are particularly influenced by their ‘neighbours’.42 2.12.6 Our own contribution to the picture Any discussion of the limitations of ‘rational’ behaviour should remind us that we ourselves may also be operating in a manner that may not always be viewed as ‘rational’ by others. We are part of the market rather than dispassionate observers. This has several ramifications, including: (a) Investment arguments that we find persuasive may also be found persuasive by others, but this does not necessarily make them right. Feedback loops may exacerbate such behaviours. These feedback loops may create fat-tailed behaviour, as we saw when discussing ‘crowded trades’ in Section 2.12.3. (b) Being part of the market implies that from time to time we will be buying and selling assets and potentially liabilities. Our behaviour, magnified by others (to the extent that they are tending to follow the same investment ideas as us), will influence the costs of buying and selling these assets and liabilities. Some types of risk such as liquidity risk are particularly influenced by transaction costs and how these costs might behave in the future. Principle P11: Part of the cause of fat-tailed behaviour is the impact that human behaviour (including investor sentiment) has on market behaviour.
2.13 IMPLEMENTATION CHALLENGES 2.13.1 Introduction Arguably, there are not a lot of ways in which difficulties can arise if our focus is on single return series, as long as we have access to the series themselves and we also have access to suitable software to manipulate and analyse the series (or we can create such software ourselves). 42 Johansen and Sornette (1999) make the interesting observation that crashes caused by such factors are exactly the opposite of the popular characterisation of crashes as times of chaos. They view disorder, characterised by a balanced and varied opinion spectrum, as the factor that keeps the market liquid in normal times. When opinions cease to be varied then markets can move very substantially very quickly.
56
Extreme Events
Market data, at least for mainstream markets, is often relatively easy to come by, if you are prepared to pay for it. There are now many different providers of such data. Several companies, e.g., Bloomberg, Thomson Reuters, FT Interactive Data, MSCI Barra, Dow Jones and MarkIt Partners, specialise in collating live (and historic) market data for onward provision to their clients. The larger ones also typically provide newsfeeds alongside these data feeds (or conversely may provide market data as an adjunct to news provision). Their clients, usually fund managers or traders, often find qualitative information on market developments helpful alongside quantitative data. As well as these data vendors are the index vendors themselves, e.g., MSCI, FT, Standard & Poors and Dow Jones. Some data vendors and investment banks also calculate their own indices, as do some industry bodies, particularly for niche market areas such as hedge funds or private equity. Some of the data vendors, particularly the larger ones, also provide analytical systems for manipulating data (which may include third party tools, e.g., risk systems). The tendency recently has been towards integrated order management systems, e.g., ones that can capture and analyse individual portfolio positions and transactions alongside more general market data. Software specifically geared towards analysis of fat-tailed behaviour is not quite so readily accessible. The analytical toolkits of the providers mentioned above do not (yet) typically contain many elements specifically geared to such activities. Practitioners interested in more complicated mathematical manipulation of market data often revert to building their own toolkits, perhaps using computer languages and/or software libraries that are tailored towards the carrying out of such analyses. More sophisticated mathematics packages such as Matlab, Mathematica and Maple tend to come to the fore if the analysis includes more complex mathematical manipulations, e.g., ones involving symbolic algebra43 or more sophisticated numerical computations such as numerical integration, root finding, or minimisation or maximisation of more complicated mathematical expressions or functions. As explained in the Preface, an online toolkit capable of carrying out most of the analyses described in this book is available through www.nematrian.com. Its tools are specifically designed to be accessible through standard spreadsheet systems such as Microsoft Excel and/or through more sophisticated programming environments. Such access routes may be particularly relevant to users who want to use the tools repeatedly. The tools can also be accessed interactively using relevant pages on that website, for those users who are less interested in repeatability and more interested in a quick one-off analysis. 2.13.2 Smoothing of return series Leaving aside these more generic issues, perhaps the most important challenge that can arise with individual data series is that the observable series may not fully correspond to the ‘true’ underlying behaviour we are trying to analyse. This is a particular issue with market data that is smoothed in some way, or incorporates in its creation flexible time lags. These issues are discussed further in Kemp (2009) in the context of calibration to market data, because it may be necessary to ‘de-smooth’ the data for such purposes. Smoothing is particularly relevant to less liquid types of asset (or liability) or to valuations of pooled vehicles that hold such assets. For example, valuation methodologies used by surveyors 43 By symbolic algebra we mean manipulations that directly involve mathematical functions. Thus a symbolic algebra package asked to calculate the differential with respect to x of y = x 2 will be able to tell you that it is dy/d x = 2x, rather than merely being able to provide you with a numerical answer for some specifically chosen value of x.
Fat Tails – In Single (i.e., Univariate) Return Series
57
in real estate markets typically appear to introduce smoothing; see, e.g., Booth and Marcato (2004). Surveyor valuations typically seem to overestimate prices at which transactions can actually be completed when the market is depressed, and to understate them when the market is buoyant. As a result, surveyor valuations also seem to exhibit autocorrelation and an artificially low volatility relative to what one might have expected to apply on intrinsic economic grounds. The smoothing does not need to be intentional; it can arise merely because of unconscious behavioural biases that creep into the pricing process.44 For example there is a natural tendency to benchmark what is considered a sensible price quotation by reference to the last transaction in the instrument. Price smoothing should show up as autocorrelation in the return series. It can therefore be unwound by de-correlating the return series, r t , e.g., by assuming that there is some underlying ‘true’ return series r˜t , and that the observed series derives from it via, say, the formula rt = (1 − ρ) r˜t + ρ r˜t−1 , estimating ρ (from the autocorrelation characteristics of rt ) and then backing out r˜t . There is another interpretation for ρ. In effect, the de-correlation approach is assuming that the observed behaviour is akin to a moving average of the underlying behaviour. 1/ρ then represents the natural scale (in time period units) over which this moving average behaviour is occurring. Thus implicit within the above methodology is an assumption that the timescale over which such smoothing occurs (measured in units corresponding to the length of the time periods over which the data is available) is reasonably constant through time. If this is felt to be an inaccurate assumption then some time dependency for ρ would need to be incorporated in the analysis. A particular example of this might be if smoothing behaviour is believed to be driven by ‘stale pricing’, i.e., by price data that is not changed until a new transaction takes place,45 and there is some good reason to believe that the average historicity of prices contributing to the return series in question has changed (e.g., markets have become more or less liquid over the period being analysed).46 In principle more complicated de-correlation adjustments could also be adopted as per Section 4.7 if the assumed market dynamics warrant this, but such adjustments are rarely used in practice. 2.13.3 Time clocks and non-constant time period lengths Typically we make the implicit assumption that, all other things being equal, the ‘natural’ time clock of the processes driving the market behaviour we are analysing progresses uniformly in line with ‘proper’ (i.e., circadian) time. If this is not the case then it is generally appropriate to re-express the passage of time in units that we do think progress approximately uniformly. For example, in the absence of evidence to the contrary, we might assume that arrival of new market information is approximately uniform per business day, rather than per actual day. This conveniently accords with how daily return series are usually built up, which is from end of 44 The fact that Booth and Marcato (2004) find evidence for such behaviour with surveyors’ valuations indicates that the issue does not necessarily go away merely by using independent valuers, because this is the norm in the real estate market. 45 The term ‘stale pricing’ is also used to refer to instances where the price should have been updated (because, e.g., there have been subsequent market transactions) but has not been. Checking for stale prices is thus an important component in checking that the pricing process being applied to a portfolio seems to be working effectively. 46 Typically, secular changes like these are ignored if the period of analysis is not particularly long, unless there is a clear market structure discontinuity (e.g., the launch of a new trading venue) that invalidates this assumption. However, when analysing extreme events we typically try to use the longest histories possible, to maximise the number of extreme events that might be present within the dataset. Over very long time periods the assumption that secular changes can be ignored becomes more suspect.
58
Extreme Events
day published index levels. These are only present for business days, i.e., not over weekends. This then creates some uncertainty over how to handle days that are public holidays in some jurisdictions but not in others. As there are only relatively few such days each year, this nicety is often ignored. Another situation where this point can become relevant is around individual company announcements. Individual equities generally become more volatile around the time that their corporate results are announced. Their volatility then tapers off again. These times typically correspond to when the highest amount of new market information is becoming available in relation to these equities. Sometimes even after making these refinements the return data involves time periods that are not of constant deemed time ‘length’. In such circumstances: (a) De-correlation adjustments as per Section 2.13.2 should ideally be adjusted so that 1/ρ follows the desired behaviour in ‘proper time’ (after allowing for any deemed secular adjustments). (b) Most distributional moments such as volatility, skewness and kurtosis should ideally not be calculated by giving equal weight to the individual observations, but by giving greater weight to observations that correspond to longer ‘proper’ time periods; see, e.g., Kemp (2010). 2.13.4 Price or other data rather than return data Ideally, our analysis of market behaviour should focus on returns, because these correspond best to the overall relative economic impact exhibited by different portfolios. To be more precise, the focus should ideally be on log-returns, given the geometric behaviour of returns and hence linear behaviour of log-returns (see Section 2.3.1).47 However, historic data series going back a long time often focus more on price movements than on return data. They do not necessarily take account of reinvested income, perhaps because it was not efficiently captured at the time. Estimates of volatility, skew, kurtosis and so on are usually little affected by this nicety, as long as income accrues in relatively small increments through time.48 However, it is important to include such sources of return when calculating mean returns, because smoothness of accrual does not drive the magnitude of impact that income has to this statistic. We can also analyse factors that contribute to the returns on different instruments, rather than the returns themselves. For example, we might analyse yield curve movements. There is often, however, no good reason why these factors should follow approximately Normal behaviour (except perhaps over quite short timescales). For example, bond yields typically cannot fall below zero except in highly unusual circumstances. 2.13.5 Economic sensitivities that change through time A particular issue that encapsulates all the above points arises with instruments like bonds whose intrinsic economic sensitivities alter as time progresses. A bond that currently has a term to maturity of ten years will only have a term to maturity of nine years in one year’s 47 Sometimes it is important to bear in mind the precise tax position of the investor. Usually investment return series practically available concentrate on the position for mainstream institutional investors such as pension funds, who normally pay little if any income or capital gains taxes. 48 This would generally be the case for market indices, but may be less true for individual securities. For example, a company might pay a large one-off special dividend as part of a restructuring or to facilitate tax management by its shareholders. Its share price might then fall dramatically as a consequence, even though the total return to shareholders might follow a much smoother trajectory.
Fat Tails – In Single (i.e., Univariate) Return Series
59
time. Thus the price behaviour of bonds exhibits a pull to par through time. We may expect the bond to be redeemed at par when it matures (assuming that the issuer has not defaulted in the meantime). One common way of handling this issue is to create, for each time period separately, a sub-index by bracketing together instruments of a particular type and maturity. For example, we might bracket together bonds issued in a particular currency with a time to maturity within a particular range and with a particular credit rating status. We can then chain-link together the sub-index returns over different periods.49 A possible weakness of such an approach, however, is that the economic sensitivities of such categories can themselves change through time. For example, the duration of a bond of a given term typically reduces as yields rise, and so focusing through time on instruments with a given average time to maturity may not be as appropriate as focusing on instruments with a given average duration.50 A more sophisticated variant is to include explicit pricing algorithms in our analysis that link the behaviour of the instruments in which we are interested with factors, i.e., economic sensitivities, which we think do have a greater uniformity of ‘meaning’ through time. We would then use these pricing algorithms to derive from observed prices the applicable values of these factors at any point in time, and then reverse the pricing algorithm to determine the return behaviour of notional instruments deemed to have constant intrinsic exposures to economic sensitivities through time. For example, we might derive an average yield on all bonds of a particular duration and credit rating status and then use this yield to derive the return behaviour that we deem an instrument with, say, a constant duration to have exhibited. Classification of equities by industry or sector can be thought of as a simplified version of this approach, in which the economic sensitivities of the categories used are assumed to be constant within each individual element of the categorisation. The step involving calibration to market and the step involving subsequent re-pricing are then exact inverses of each other. Several times later on in this book we will come across the need, at least in principle, to have pricing algorithms available to map observed instrument behaviour onto deemed economic factor behaviour and vice versa.
49 By ‘chain-link’ we mean calculating the return over a composite time period 0 to T , say, by calculating the returns, r1 , . . . rn , for individual sub-periods 0 to t1 , t1 to t2 , . . . , tn−1 to tn (= T ) say, using just the instruments that fell within the relevant category at the relevant time, and then calculating the overall return, r, using 1 + r = (1 + r1 ) × · · · × (1 + rn ). 50 The duration of a bond is the weighted average of the times to payment of future income and maturity proceeds, with weights equal to the present values of the corresponding payments.
3 Fat Tails – In Joint (i.e., Multivariate) Return Series 3.1 INTRODUCTION We noted earlier how some quantitative fund managers posted particularly poor performance in August 2007. Movements in any one quantitative style, although sizeable, were not obviously particularly extreme. What was problematic was the extent to which multiple factors that were supposed to behave independently all moved simultaneously in an adverse direction. In short, it was the joint fat-tailed behaviour of equity market elements that proved particularly problematic for them at that time. In this section we focus on how to refine the techniques described in Chapter 2 to cater for fat-tailed behaviour in the joint behaviour of multiple return series. Many of the conclusions we drew in Chapter 2 also apply here. By ‘fat-tailed’ we again mean extreme outcomes occurring more frequently than would be expected were the returns to be coming from a (log-) Normal distribution. However, instead of focusing on univariate Normal distributions we now focus on multivariate Normal distributions. Univariate Normal distributions are characterised by their mean and standard deviation. Multivariate Normal distributions are more complicated than a series of univariate distributions, because multivariate Normal distributions are characterised not only by the means and standard deviations of each series in isolation but also by the correlations between the different series.
3.2 VISUALISATION OF FAT TAILS IN MULTIPLE RETURN SERIES Effective visualisation of deviations from (now multivariate) Normality in co-movements of multiple series is more challenging than effective visualisation of fat tails in the univariate case. Even when considering just two return series in tandem, there are an infinite number of ‘standard’ bivariate Normal probability density functions, because their definition includes an extra parameter, ρ, corresponding to the correlation between the two series. This can take any value between –1 and +1 (if ρ = 0 then the two series are independent). With three or more series, we run out of dimensions for plotting joint behaviour (and there is even more flexibility in the correlation matrix structure that a ‘standard’ trivariate or higher dimensional Normal distribution might take). The (joint) probability density functions corresponding to three ‘standard’ bivariate Normal probability distributions are shown in Figures 3.1 (ρ = 0), 3.2 (ρ = −0.3) and 3.3 (ρ = +0.6). These each have a hump in the middle that falls away at the edges (i.e., at the ‘tails’ of these bivariate distributions). In each case, cross-sections through the middle of the probability density function plot have the appearance of the traditional bell-shaped form of the univariate Normal distribution.
62
Extreme Events
0.25
0.2 0.2-0.25 0.15
0.15-0.2 0.1-0.15
0.1
0.05-0.1 4.0
0-0.05
2.0
0.05 0.0 0 -2.0
-4.0 -2.0 0.0
-4.0
2.0 4.0
Figure 3.1 Standard bivariate Normal probability density function with ρ = 0
C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
It is not easy to spot much difference between these charts, particularly not between Figures 3.1 and 3.2. The difference between the three charts can primarily be identified by considering different cross-sectional slices of the plot through an axis passing through the middle of the plot. The cross-sections are in general wider or narrower in different directions. If ρ is positive, as in Figure 3.3, then such cross-sections are widest towards the corners corresponding to the
0.25
0.2 0.2-0.25 0.15
0.15-0.2 0.1-0.15
0.1
0.05-0.1 4.0 2.0
0.05 0.0 0 -2.0
-4.0 -2.0 0.0
-4.0
2.0 4.0
Figure 3.2 Standard bivariate Normal probability density function with ρ = −0.3
C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
0-0.05
Fat Tails – In Joint (i.e., Multivariate) Return Series
63
0.25
0.2 0.2-0.25 0.15
0.15-0.2 0.1-0.15
0.1
0.05-0.1 4.0
0-0.05
2.0
0.05 0.0 0 -2.0
-4.0 -2.0 0.0
-4.0
2.0 4.0
Figure 3.3 Standard bivariate Normal probability density function with ρ = +0.6
C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
two series moving in the same direction (hence outcomes in which the two series move in the same direction are more likely than those where they move in opposite directions), whereas if ρ is negative, as in Figure 3.2, the opposite is the case. As with univariate distributions, we can also display cumulative distribution plots (cdfs), i.e., the likelihood of the (joint) outcome not exceeding a particular value. Unfortunately, it is often even harder to spot visually differences between (joint) cdfs than it is to spot differences in (joint) pdfs, see e.g. Kemp (2010). If the difference in ρ is large enough, however, then it
1
0.8 0.8-1 0.6
0.6-0.8 0.4-0.6
0.4
0.2-0.4 3.95 1.95
0.2
-0.05 0 -2.05
-4.05 -2.05 -0.05
-4.05
1.95 3.95
Figure 3.4 Cumulative distribution function for standard bivariate Normal with ρ = 0 C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
0-0.2
64
Extreme Events
1
0.8 0.8-1 0.6
0.6-0.8 0.4-0.6
0.4
0.2-0.4 3.95
0-0.2
1.95
0.2
-0.05 0 -2.05
-4.05 -2.05 -0.05
-4.05
1.95 3.95
Figure 3.5 Cumulative distribution function for standard bivariate Normal with ρ = 0.95 C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
does become more practical to do so. For example, joint cdfs of bivariate Normal distributions with ρ = 0 and ρ = 0.95 are shown in Figures 3.4 and 3.5 respectively.
3.3 COPULAS AND MARGINALS – SKLAR’S THEOREM 3.3.1 Introduction Perhaps the most usual (relatively sophisticated) way (in the financial community) of characterising more effectively the shape of the (multivariate) return distribution is to decompose the problem into two parts: (a) A part relating to how fat-tailed each return series is in isolation, via the shape of their marginal probability distributions (i.e., the distributional form relevant to each series in isolation ignoring any information about how the other series might move). (b) A part relating to how the different return series might co-move together, having first standardised all their original marginal distributions. The function most commonly used to characterise these co-movement characteristics is the copula of the distribution. The tools we introduced in Chapter 2 can be applied to tackle each marginal distribution, and so the ‘new’ element in this chapter is how to focus on the remainder, i.e., the copula. By analysing both sources in tandem it is possible to identify what proportion of any overall fat-tailed behaviour comes from each part. The definition of a copula is a function C : [0, 1]n → [0, 1] where (a) there are (uniform) random variables U1 , . . . , Un taking values in [0, 1] such that C is their cumulative (multivariate) distribution function; and
Fat Tails – In Joint (i.e., Multivariate) Return Series
65
(b) C has uniform marginal distributions, i.e., for all i ≤ n and u i ∈ [0, 1] we have C (1, . . . 1, u i , 1, . . . 1) = u i . The basic rationale for copulas is that any joint distribution F of a set of random variables X 1 , . . . , X n , i.e., F (x) = Pr (X 1 ≤ x 1 , . . . , X n ≤ x n ) can be separated into two parts. The first is the combination of the marginals that (if expressed in cdf form) are Fi (.) where Fi (x) = Pr (X i ≤ x) and the second is the copula that describes the dependence structure between the random variables. Mathematically, this decomposition relies on Sklar’s theorem, which states that if X 1 , . . . , X n are random variables with marginal distribution functions F1 , . . . , Fn and joint distribution function F then there exists an n-dimensional copula C such that F (x) = C (F1 (x1 ) , . . . , Fn (xn )) ∀x ∈ Rn
(3.1)
i.e., C is the joint cdf of the unit random variables (F1 (x1 ) , . . . , Fn (xn )). If F1 , . . . , Fn are continuous then C is unique. A particularly simple copula is the Product copula n (x) = n Fi (xi ) (also called the Independence copula). It is the copula that applies to independent random variables. Because the copula completely specifies the dependency structure of a set of random variables, the random variables X 1 , . . . , X n are independent if and only if their n-dimensional copula is the Product copula. The copula most commonly used in practice is probably the Gaussian copula (for a given correlation matrix). It is the copula applicable to a multivariate Normal distribution with that correlation matrix. In the special case where all the variables are independent, the covariance matrix has nonzero terms only along the leading diagonal, and the (Gaussian) copula for a Normal distribution with such a correlation matrix is equal to the Product copula. All Normal distributions with the same correlation matrix give rise to the same copula, because the standard deviations that also drive the covariance matrix are part of the characterisation of the marginals, rather than of the dependency structure characterised by the copula. Unfortunately, although any multivariate distribution can be decomposed into these two parts, this does not necessarily mean that copulas are much better for visualising joint fattailed behaviour than the cdfs from which they are derived, see Kemp (2010). Once again, however, if the difference in ρ is large enough then it does become more practical to do so. Copulas corresponding to bivariate Normal distributions with ρ = 0 and ρ = 0.95 are shown in Figures 3.6 and 3.7 respectively. Rather easier to distinguish are differences between two copulas.1 The difference between copulas corresponding to the distributions used in Figures 3.1 and 3.2 (i.e. with ρ = 0 and ρ = −0.3) is shown in Figure 3.8. It is peaked upwards at the corners corresponding to situations where the two series move in opposite directions (which become more likely when ρ is negative) and peaked down at the other two corners (which become less likely when ρ is negative).
1 To be more precise, we show here differences between copula gradients, which are functions of the form C x y = ∂ 2 C/∂ x∂ y, or equivalently copula densities (see Section 3.4).
66
Extreme Events
1
0.8 0.8-1 0.6
0.6-0.8 0.4-0.6
0.4
0.2-0.4 0.99
0-0.2
0.74
0.2
0.50 0 0.26
0.01 0.26 0.50
0.01
0.74 0.99
Figure 3.6 Gaussian copula with ρ = 0, also known as the product or independence copula
C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
3.3.2 Fractile–fractile (i.e., quantile–quantile box) plots A closely related concept is a fractile–fractile or quantile–quantile box plot. Fractiles are the generalisation of percentiles, quartiles and the like. For example, suppose we were analysing the co-movement characteristics of two different return series. We might take each series in
1
0.8 0.8-1 0.6
0.6-0.8 0.4-0.6
0.4
0.2-0.4 0.9878… 0.7439…
0.2 0.5 0
0.2560…
0.012195122 0.256097561
0.5
0.0121…
0.743902439
0.987804878
Figure 3.7 Gaussian copula with ρ = 0.95
C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
0-0.2
Fat Tails – In Joint (i.e., Multivariate) Return Series
67
0.06
0.05 0.05-0.06
0.04
0.04-0.05 0.03-0.04
0.03
0.02-0.03 0.01-0.02
0.02
0-0.01 0.74 0.01 0.50 0 0.26
0.01 0.26 0.50 0.74
0.01
Figure 3.8 Difference between Gaussian copula (ρ = −0.3) and Product copula (expressed as difference between copula gradients/densities) C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
isolation and slice up the observed returns into, say, 20 buckets, with the top 5% put into the first bucket, the next 5% in the second etc. We might then plot the number of times the observed return for the first series was in bucket x while the observed return for the second series was in bucket y. Another term used for this sort of analysis is box counting (see also Abarbanel (1993) and Kemp (2010)). Such a chart is essentially the same (up to a constant) as one that indicates how the gradient of the copula applicable to a particular distribution differs from the gradient of the Product copula. An example of how such plots operate is shown in Figures 3.9 and 3.10. Figure 3.9 is a scatter plot of weekly (log) sector relative returns on the MSCI ACWI Utilities sector versus (log) sector relative returns on the MSCI ACWI Transport for the period 9 March 2001 to 4 July 2008, sourced from MSCI and Thomson Datastream.2 Figure 3.10 is based on the same data but shows merely the ranks of these sector relatives within a listing of all that sector’s relative returns; i.e., it strips out the effect of the sector’s marginal distribution. Figure 3.11 groups these ranks into deciles, and shows the number of returns falling into each decile–decile pairing. All other things being equal, if we double the number of fractiles used, we can expect the number of observations falling into a given pairing to fall by a factor of 4. For example, if we have 20 quantile boxes (i.e., using increments of 5%) then we would need 400 observations to expect to have more than one observation on average within each quantile pairing. This considerably limits the usefulness of quantile–quantile box analysis on a single sector pairing. However, we can instead look at numbers across all sector pairings simultaneously. Expected
2 Weekly return data on these indices are available from end 2000 but we start 10 weeks into this period to allow us to include, later on in this chapter, adjustments relating to shorter-term time-varying volatility of returns, which we estimate using the preceding 10 weeks’ returns.
68
Extreme Events
MSCI ACWI Transport (relave return)
10%
5%
0% - 10%
- 5%
0%
5%
10%
- 5%
- 10%
MSCI ACWI Ulies (relave return) Figure 3.9 Scatter plot of weekly sector relative returns for MSCI ACWI Utilities versus Transport C John Wiley & Sons, Ltd. Reproduced by permission of John Wiley Source: Kemp (2009), Thomson Datastream. & Sons, Ltd
numbers in each fractile–fractile pair increase about four-fold as the number of sectors we can pair up is doubled. In Figure 3.12 we show such an analysis, considering all pairs of 23 MSCI industry sectors that have a continuous history over the period 30 May 1996 to 28 February 2009 (and all +/– combinations, so that the plot is symmetrical around each corner). The (excess) kurtosis of each of these sectors (the ordering is not important in this context) is shown in Figure 3.13 on page 71. In this chart the dotted lines show the 5th and 95th percentile confidence limits given the null hypothesis that each relative return series is Normal, assuming that the sample is large enough for the asymptotic tests referred to in Section 2.4.6 to apply. We see that most sector relative return series appear to be fat-tailed, some quite significantly. A noticeable feature of Figure 3.12 is the four peaks in each of the four corners of the (sector pair averaged) fractile–fractile plot. This, arguably, is a visual articulation of the phenomenon often articulated by risk managers that all correlations seem to go to ‘unity’ in adverse times. By this is actually meant that they seem to have the annoying habit (unless the risk management is top-notch) of going to +1 or –1 depending on which is the worse outcome. However, there is a possible error of logic involved in drawing this conclusion. We have already seen in Section 2.7 that (distributional) mixtures of univariate Normal distributions are typically not themselves Normally distributed. We might expect the same to apply to (distributional) mixtures of bivariate Normal distributions. Even if (log) sector relative return distributions were accurately modelled by a multivariate Normal distribution, we should expect the corresponding bivariate Normal distributions for each specific sector pairing to differ, because they can be expected in general to have different pair-wise correlations. This
Fat Tails – In Joint (i.e., Multivariate) Return Series
69
MSCI ACWI Transport (rank of relative return)
382
1 1
382
MSCI ACWI Utilities (rank of relative return) Figure 3.10 Scatter plot of ranks of weekly sector relative returns for MSCI ACWI Utilities versus Transport C John Wiley & Sons, Ltd. Reproduced by permission of John Source: Kemp (2009), Thomson Datastream. Wiley & Sons, Ltd
mixing might create the same effect – i.e., the four peaks might merely be a feature of there being some sectors with relatively high positive and negative correlations. One way of eliminating this possible error of logic is to identify series that in some sense contain exactly the same information but are all uncorrelated with each other. The most common way of doing this is to identify the principal components corresponding to the original data series (see Section 4.3). In aggregate they span the same data3 but also happen to be uncorrelated with each other. If mixing of distributions were the only source of the peaking in the four corners we should therefore expect peaking in the four corners not to appear in a quantile–quantile box plot of the principal component series. Instead, the peaking does still arise, albeit less than for the original data series (see Figure 3.14 on page 71). In Figure 3.14 we have given greater weight to the more significant principal components (by weighting by the magnitude of the corresponding eigenvalue), because the smaller principal components may principally reflect random noise (see Section 4.3.6). Indeed only the first circa six to ten eigenseries appear to be statistically significant, based on the cut-off described there (see Figure 3.15 on page 72). In Figure 3.15, the most important principal components – 3 By ‘span’ the same data we mean that if the original data series are xi,t and the corresponding principal component series are yi,t then each xi can be expressed as some linear combination of the yi i.e., x i,t = j ai, j y j,t for some suitable choice of ai, j (which do not vary by t).
70
Extreme Events
12 10-12 10
Number of occurrences
8-10 8
6-8 4-6
6 2-4 4
0-2
2 5
0
4 1
3
2
3
4
5
6
7
8
10 9
MSCI ACWI Transport
2 6
7
8
MSCI ACWI Ulies
1 9
10
Figure 3.11 Decile–decile plot between MSCI ACWI Utilities and Transport sectors C John Wiley & Sons, Ltd. Reproduced by permission of John Source: Kemp (2009), Thomson Datastream. Wiley & Sons, Ltd
3000 2500
2500-3000
2000
2000-2500
1500
1500-2000
1000
17 13
500 9
0 1
5 Sector 2 5
9
1000-1500 500-1000 0-500
1 13
17
Sector 1 Figure 3.12 Fractile–fractile plot of sector relative return rankings showing number of observations in each fractile pairing, averaged across all sector pairings and all +/– combinations of such pairs
C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Thomson Datastream.
Fat Tails – In Joint (i.e., Multivariate) Return Series
71
6.0 4.0 kurtosis 2.0 5%ile/95%ile 0.0 1
3
5
7
9
11 13 15 17 19 21 23
–2.0
Figure 3.13 Excess kurtosis of each series sector relative return series: x-axis shows the number of the relevant data series C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Thomson Datastream.
i.e., the ones contributing most to the overall variability of the different original sector returns (and therefore the ones given most weight in Figure 3.14) – are the ones on the left hand side of the chart. Only ones to the left of where the solid and dotted lines join are distinguishable from random noise. The lack of statistical significance for most of the smaller principal components is probably why the typical (excess) kurtosis we saw in Figure 3.13 is concentrated on the more important principal components (see Figure 3.16). The principal components extracted in this manner show little evidence of skewness (see Figure 3.17). Most of them are within the 5th to 95th percentile confidence levels derived by using the large sample asymptotic values set out in Section 2.4.6, assuming that the underlying series are Normally distributed.
2000 1500
1500-2000
1000
1000-1500 17 13
500
9 5 Sector 2
0 1
5
9
500-1000 0-500
1 13
17
Sector 1 Figure 3.14 Fractile–fractile plot of principal component rankings of sector relative returns showing number of observations in each fractile pairing, averaged across all principal component pairings and all +/– combinations of such pairs C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Thomson Datastream.
72
Extreme Events 40% 30% % of total 20% 10%
RMT cut-off
0% 1
3
5
7
9
11
13
15
17
19
21
23
Figure 3.15 Magnitudes of the eigenvalues of each principal component derived from the relative return series used in Figure 3.13 (most important principal components to the left of the chart) C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Thomson Datastream.
3.3.3 Time-varying volatility As we noted in Chapter 2, it is widely accepted within the financial services industry and within relevant academic circles that markets exhibit time-varying volatility. Markets can, possibly for extended periods of time, be relatively ‘quiet’, e.g., without many large daily movements, but then move to a different ‘regime’ with much larger typical daily movements. We have also noted that time-varying volatility can lead to fat tails (see Section 2.7). Indeed, it appears to be a significant contributor to the fat-tailed behaviour of major developed equity market indices, particularly upside fat tails. However, volatility does not necessarily always appear to move in tandem across markets or even across parts of the same market. Time-varying volatility also seems to explain a material fraction of the clumping into the corners still seen in Figure 3.14. There are two different types of ways of adjusting for time-varying volatility in multiple return series. We can apply a longitudinal adjustment, adjusting each series in isolation by its own recent past volatility, in the same way as was done in Figure 2.17. Alternatively, we can adjust every series simultaneously by reference to changes in the recent past cross-sectional dispersion of sector returns. In Figures 3.18 and 3.19 we show how the fractile–fractile plots shown above would alter if we incorporate such adjustments. Figure 3.18 adjusts for recent past changes in each individual series’ own volatilities while Figure 3.19 adjusts for changes in recent past
6.0 5.0
(excess) kurtosis
4.0 3.0 2.0
5%ile/95%ile
1.0 0.0 –1.0 1
3
5
7
9
11 13 15 17 19 21 23
Figure 3.16 Excess kurtosis of each principal component derived from the relative return series used in Figure 3.13 (most important principal components to the left of the chart) C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Thomson Datastream.
Fat Tails – In Joint (i.e., Multivariate) Return Series
73
0.8 0.6 skew
0.4 0.2
5%ile/95%ile
0.0 –0.2 1
3
5
7
9
11 13 15 17 19 21 23
–0.4 –0.6
Figure 3.17 Skewness of each principal component derived from the relative return series used in Figure 3.13 (most important principal components to the left of the chart) C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Thomson Datastream.
cross-sectional dispersion.4 In both cases we show a composite plot that groups together several different sub-charts corresponding to Figures 3.12–3.17. In these figures the term ‘eigenseries’ is used synonymously with ‘principal component’, highlighting the link between principal components and eigenvalues (see Section 4.3). The clumping into the four corners of the fractile–fractile plot is materially reduced by such adjustments. As with single return series in isolation, a material fraction (but not all) joint fat-tailed behaviour (at least for mainstream equity sectors) does appear to be explained by time-varying volatility. Indeed, based on Figure 3.19, we might believe that we had identified a methodology using cross-sectional adjustments that nearly eliminated fat-tailed behaviour. The fractile–fractile plot of the principal components is much closer to flat than before and the kurtosis of the individual principal components (at least the ones that appear ‘real’ in the context of random matrix theory) now no longer look extreme compared to the 5th to 95th percentile (large sample) confidence limit applicable to a Normally distributed sample. Unfortunately, the results turn out not to be as good as Figure 3.19 suggests, when viewed ‘out-of-sample’. We explore out-of-sample backtesting further in Section 5.9. Suppose for a given portfolio that we calculate at the start of each period an estimated tracking error for the portfolio, i.e., an estimate of what standard deviation we expect its return to exhibit over the coming period. Suppose we calculate the tracking error using a covariance matrix for the period in question that has been derived merely from return data for time periods prior to the period in question. To ensure that there is sufficient data to allow suitable estimation of the covariance matrix even early on, we start the analysis 36 months into the dataset. Suppose that we also compare the actual return applicable to that portfolio with its standard deviation, i.e., we normalise each period’s return by reference to the estimated tracking error. If a crosssectional adjustment really had largely explained fat-tailed behaviour then the normalised returns after adjusting for time-varying volatility in this manner should be approximately Normally distributed. 4 When adjusting each series in isolation by its own recent past volatility we have used the standard deviation of the preceding 10 weeks’ relative returns. When adjusting every series simultaneously by reference to changes in the recent past cross-sectional dispersion of sector returns we have used the average of the preceding 10 weeks’ cross-sectional dispersion (calculated using the cross-sectional standard deviation of normalised sector relative returns, where ‘normalised’ here means scaling by the overall standard deviation of that sector’s sector relatives over the entire period under analysis, to avoid giving undue weight to the behaviour of a small number of sectors).
74
Extreme Events
(a) Underlying series
(c) Underlying series: (excess) kurtosis
10.0
kurtosis
5.0
5%ile/95%ile 2000
1500-2000
0.0
1000-1500
-5.0
1
1500
3
1000
9
0 1
5 5
9
13
7
9
11 13 15 17 19 21 23
(d) Eigenvalues (sorted by size)
500-1000
20% 15%
17 13
500
5
0-500 Sector 2
% of total
10% 5%
1 17
RMT cut-off
0%
Sector 1
1
3
5
7
9
11 13 15 17 19 21 23
(e) Eigenseries: skewness
(b) Principal components 0.5 0.0 1
1500
1000-1500
3
5
7
9
skew 5%ile/95%ile
11 13 15 17 19 21 23
-0.5 -1.0
1000 500-1000 500
17 13
0-500 9 5 Sector 2
0 1
5
9
13
(f) Eigenseries: (excess) kurtosis 3.0 2.0
(excess) kurtosis
1.0
95%ile
1 0.0
17
Sector 1
-1.0
1
3
5
7
9
11 13 15 17 19 21 23
Figure 3.18 Impact of adjusting each relative return series by its own recent past volatility: (a) number of observations in each fractile–fractile pairing; (b) number of observations for each principal component pairing (weighted by eigenvalue size); (c) (excess) kurtosis of each underlying return series; (d) eigenvalues (sorted by size); (e) skewness of each eigenseries (sorted in order of decreasing eigenvalue size); (f) (excess) kurtosis of each eigenseries (sorted in order of decreasing eigenvalue size) C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Thomson Datastream.
Table 3.1 shows the results of such an analysis, averaged across 2300 randomly chosen portfolios, 100 with just one 1 sector position, 100 with 2 sector positions and so on. We see that the cross-sectional time-varying volatility adjustment has reduced but has not eliminated fat-tailed behaviour, particularly in the far tail.
Table 3.1 Out-of-sample test of the effectiveness of different ways of adjusting for time-varying volatility
Unadjusted data Longitudinal adjustment Cross-sectional adjustment c.f. expected if Gaussian
Kurtosis
90 percentile
99 percentile
99.9 percentile
2.3 1.2 0.8 0.0
1.2 1.2 1.3 1.3
2.7 2.5 2.6 2.3
4.3 3.8 3.8 3.1
Source: Nematrian, Thomson Datastream
Fat Tails – In Joint (i.e., Multivariate) Return Series (c) Underlying series: (excess) kurtosis
(a) Underlying series
3.0
2000
1500-2000
1500
1000-1500
2.0
kurtosis
1.0
5%ile/ 95%ile
0.0 -1.0 1
3
1000 17 13
500
1
500-1000
5
9
13
1 17
20% 10%
RMT cutoff 3
9 5 Sector 2
Sector 1
9
11 13 15 17 19 21 23
(e) Eigenseries: skewness
0.5
skew
1000-1200
0.0
5%ile/ 95%ile 1
3
200-400
5
7
9
11 13 15 17 19 21 23
-0.5
400-600 17 13
7
1200-1400
600-800
1
5
1.0
800-1000
13
11 13 15 17 19 21 23
(d) Eigenvalues (sorted by size)
1
1400 1200 1000 800 600 400 200 0 9
9
0%
(b) Principal components
5
7
30%
Sector 1
1
5
% of total
0-500 9 5 Sector 2
0
75
(f) Eigenseries: (excess) kurtosis 2.0 (excess) kurtosis
0-200 1.0
95%ile 0.0
17
1
3
5
7
9
11 13 15 17 19 21 23
-1.0
Figure 3.19 Impact of adjusting every series simultaneously by recent past cross-sectional volatility: (a) number of observations in each fractile–fractile pairing; (b) number of observations for each principal component pairing (weighted by eigenvalue size); (c) (excess) kurtosis of each underlying return series; (d) eigenvalues (sorted by size); (e) skewness of each eigenseries (sorted in order of decreasing eigenvalue size); (f) (excess) kurtosis of each eigenseries (sorted in order of decreasing eigenvalue size) C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Thomson Datastream.
3.4 EXAMPLE ANALYTICAL COPULAS 3.4.1 Introduction As in many branches of mathematical finance, there are considerable advantages in being able to characterise functions of interest analytically. We have already given the analytical formula applicable to one copula, namely the Independence or Product copula, in Section 3.3.1. Here we introduce a few more copulas. Trying to characterise a copula analytically limits the range of possible copulas we might consider. In Section 3.5 we will explore how we might empirically estimate copulas (and fat-tailed behaviour more generally) from observed return data. However, focusing only on analytical copulas does not limit the range of possible copulas ‘very much’ in practice. There are still an (uncountably) infinite number of possible analytical copulas that could in theory be chosen. This returns us to issues such as model error and parsimony that we explored in Section 2.10. Practitioners typically limit their attention to a small number of copula families, chosen because they are perceived to have intuitively reasonable economic characteristics.
76
Extreme Events
To facilitate such expressions, we here use mainly the notion of a copula density function as per Equation (3.2). These are usually easier to handle than the copula itself (see Equation (3.1) in Section 3.3.1). In Equation (3.2), f is the joint probability density function and f i are the marginal probability density functions for each variable xi . As always, use of density functions is only meaningful for continuous distributions, because the density becomes infinite in places for discrete distributions: c (F1 (x1 ) , . . . , Fn (x n )) =
f (x1 , . . . , xn ) f 1 (x1 ) × · · · × f n (xn )
(3.2)
3.4.2 The Gaussian copula The two-dimensional Gaussian copula density function (as derived from a bivariate Gaussian distribution with correlation coefficient ρ) is as follows, where a1 = N −1 (q1 ) and a2 = N −1 (q2 ). If ρ = 0 then the Gaussian copula becomes the Independence copula and c (q1 , q2 ) becomes constant: ρ ρa12 +ρa22 −2a1 a2 ) exp − ( 2(1−ρ)(1+ρ) c (q1 , q2 ) = 1 − ρ2
(3.3)
The most common Gaussian copula used for credit risk portfolio management purposes is the one-factor copula in which any two credits have the same default correlation. Market implied correlations for collateralised debt obligations (CDOs) and similar credit sensitive instruments are conventionally quoted as if this was a suitable model for correlation between defaults (see Section 3.5.1). Gregory and Laurent (2004) discuss how a more general factorbased Gaussian copula can be developed, in which correlations between different credits may vary depending on more than one factor. 3.4.3 The t-copula The Student’s t-distribution is a commonly used distribution in finance and other applications in which heavy-tailed behaviour is postulated. For a single random variable, its probability density is as follows (see, e.g., Malhotra and Ruiz-Mata (2008)): −(ν+1)/2 Ŵ ν+1 1 x −µ 2 2 1+ tν (x) = ν √ ν σ Ŵ 2 q νπ
(3.4)
In this formula, Ŵ (z) is the Gamma function, µ is the mean (a ‘location’ parameter) and ν is the number of degrees of freedom (a ‘shape’ parameter). Here, q is not the standard deviation of the distribution but is instead √ a ‘scale’ parameter, which for ν > 2 is such that the standard deviation is given by σ = q ν/(ν − 2). According to Malhotra and Ruiz-Mata (2008), the most popular way to extend this to the multivariate situation is for the multivariate probability density function to be tν (x 1 , . . . , xn ) =
Ŵ Ŵ
ν 2
ν+1
|P|
2 1/2
(νπ )n/2
−(ν+n)/2 1 T −1 1 + (x − µ) P (x − µ) ν
(3.5)
Fat Tails – In Joint (i.e., Multivariate) Return Series
77
Here P is an n × n non-singular matrix, but it is not necessarily the standard covariance matrix. Using the above definitions, the t-copula density is defined as ct (Tν (x1 ) , . . . , Tν (xn ) ; ν, P) =
tν (x 1 , . . . , x n ) tν (x1 ) × · · · × tν (x n )
(3.6)
Here Tν is the t-distribution cdf with degrees of freedom ν corresponding to a univariate t-distribution, defined as above. The basic t-copula defined above has the possible disadvantage, when applied to extreme events, that it has the same tail behaviour in either tail. We saw in Chapter 2 that downside fat-tailed behaviour often seems to be more pronounced than upside fat-tailed behaviour (and to be explained by different factors). To circumvent this problem, Malhotra and Ruiz-Mata (2008) also describe the following: (a) A copula based on the skewed t-distribution. This distribution is a special case of the generalised hyperbolic class of distributions. It has one heavy tail with asymptotic powerlaw behaviour (i.e., within the Fr´echet MDA explored in Section 2.9) and one semi-heavy tail with an exponential decay similar to the Normal distribution. (b) A ‘hybrid’ t-distribution in which the central part of the distributional form follows a standard Student’s t-distribution but the downside and upside tails have power-law dependencies with independent tail indices. The functional form of the density is then ⎧ ⎨
c1 x −α1 , ctν (x) , t H (x) = ⎩ c2 (−x)−α2 ,
x > x1 x2 ≤ x ≤ x1 x < x2
(3.7)
In point (b), the α1 and α2 are the tail index exponents for the right and left tails respectively. The cut-off points, x 1 and x2 , are the points at which the power-law densities must match the central t-distribution. This requirement drives the values of c1 and c2 relative to c. There is also a normalising condition on c, c1 and c2 because the probabilities must sum to unity. The central distribution could be replaced by another distribution that is close to Normal (including the Normal distribution itself). However, when Malhotra and Ruiz-Mata (2008) then proceeded to fit such marginal distributions to a range of weekly return series, they concluded that the improved fits from the two more advanced approaches were not much better than the fit available purely from a standard Student’s t-distribution, particularly bearing in mind the extra parameters present in these distributional forms (which ought to mean that they fit better anyway; see Section 2.10). Malhotra and Ruiz-Mata (2008) applied an out-of-sample log-likelihood (OSLL) test5 to analyse whether the potentially improved fits available from the two more advanced approaches would 5 Malhotra and Ruiz-Mata (2008) refer to Norwood, Bailey and Lusk (2004) when describing the OSLL approach. The main idea is that if rt is a vector of returns at time t then we calibrate its joint pdf using, say, the T returns before it, i.e., using time periods starting at t − T to t − 1. We then calculate the log likelihood for rt , We repeat this for the different t and the resulting log likelihoods are summed to obtain the net OSLL. Malhotra (2008) argues that one advantage of the OSLL technique is that it allows us to test the entire pdf at once, which is potentially very useful when the amount of data is relatively small. Unfortunately, this means that we may also give undue weight to a good fit in a part of the distribution in which we are not interested. This is analogous to the point highlighted in Section 2.4 when we highlighted some of the weaknesses of skewness and kurtosis as measures of fat-tailed behaviour in the tails of individual return series.
78
Extreme Events
then lead to risk models that had better out-of-sample properties,6 and concluded that the two more advanced t-distribution variants described above did not. Whichever type of t-distribution we might use in these circumstances, we also need a correlation matrix or equivalent to knit the marginals together. Malhotra and Ruiz-Mata (2008) use an empirical covariance matrix filtered to exclude the smallest principal components (see Section 4.3). Daul et al. (2003) propose a ‘grouped t-copula’ that clusters individual risk factors within various geographical sectors and show how the parameters describing this copula might be estimated. 3.4.4 Archimedean copulas Another family of copulas is the Archimedean family. According to Shaw, Smith and Spivak (2010) this family is frequently used in actuarial modelling, particularly in non-life insurance applications. A particular feature of Archimedean copulas is their ability to model particularly heavy tail dependency.7 They are also naturally asymmetric, and so they allow the modelling of dependency structures where tail dependency is different depending on whether we are focusing on the upper or lower tails (as is also achievable using some of the more complex variants of the t-copula referred to in Section 3.4.3). The Archimedean family includes the Gumbel and Clayton bivariate copulas. These have copulas defined as follows: Gumbel: 1/θ for θ ≥ 1 C θ (u 1 , u 2 ) = exp − (− log u 1 )θ + (− log u 2 )θ Clayton:
−1/θ −θ Cθ (u 1 , u 2 ) = u −θ 1 + u2 − 1
for θ > 0
3.5 EMPIRICAL ESTIMATION OF FAT TAILS IN JOINT RETURN SERIES 3.5.1 Introduction The usual way in which fat-tailed behaviour in joint return series is estimated empirically is to choose the form of the copula (say, one of the families referred to in Section 3.4) and then to fit it empirically to the data using a methodology akin to the one used by Malhotra and Ruiz-Mata (2008) described above. As noted in Section 3.4, choice of copula form is heavily influenced by the researcher’s prior view of what form of co-dependency is most likely to apply to the particular variables in question. Shaw, Smith and Spivak (2010) also describe how maximum likelihood estimation and/or the method of moments can be applied to such a problem. An exception is when the estimation process is merely a market convention used to express price differentials between different instruments. Standardisation is then the order of the day, 6
For a discussion of in-sample versus out-of-sample backtesting, see Section 5.9. Shaw, Smith and Spivak (2010) provide definitions of ‘tail dependency’ and contrast it with the concept of ‘tail correlations’. They define the coefficient of lower tail dependence of (X, Y ) as λ L (X, Y ) = limu→0 P(Y ≤ FY−1 (u)|X ≤ FX−1 (u)) assuming that such a limit exists. The coefficient of upper tail dependence, λU (X, Y ), is defined likewise but referring to the upper rather than the lower tail. The Gaussian copula has zero tail dependence (both lower and upper). 7
Fat Tails – In Joint (i.e., Multivariate) Return Series
79
irrespective of whether investors actually believe that the copula form used in the convention is a reasonable representation of reality. For example, tranches of CDOs and SIVs (structured investment vehicles) are often compared by reference to their implied correlations, which are calculated as if the co-dependency between defaults on their underlying credits will follow a Gaussian copula.8
3.5.2 Disadvantages of empirically fitting the copula However, there are some disadvantages with the approach described in Section 3.5.1: (a) Most copula fitting approaches used in practice use standard maximum-likelihood estimation techniques, giving ‘equal’ weight to every observation. Thus we may run into the same problem as we came across in Section 2.4. The fit may tend to be best where most of the data points reside. This may not be the part of the distributional form in which we are most interested. For example, consider 400 individual squares arranged in a 20 × 20 square. The overall square has 76 individual squares along its four edges and 324 squares that are not along any outside edge. So, if we define ‘tail’ as involving observations for either or both series observations falling into the most extreme 10% of outcomes for each return series in isolation, then only 19% of observations fall within the ‘tail’. (b) Ideally, we should be estimating fat-tailed behaviour taking into account not just fat-tailed behaviour in the copula but also fat-tailed behaviour in the marginal. Just because we can decompose the problem into two parts as per Section 3.3 does not mean that the two parts are not linked. We might be particularly interested in cases where a multi-dimensional copula had a peak in a particular corner and the corner in question related to series that had particularly fat-tailed marginals.9 (c) There are few if any copula forms in common use that are general enough to be able to provide a good fit to essentially any observation set that might arise. (d) We need to be aware of the risk of model error. If the copula family being used is unable to cater well with particular characteristics of the observed data then conclusions drawn by using it may be flawed. We also need to be mindful of the possibility that assumptions implicit in our use of the copulas may influence the end results more than the precise shape of the copula we are trying to fit. Otherwise we may focus so much attention on identifying the ‘right’ copula that we miss these wider issues.10
8 The main requirements for such conventions are that they should have an intuitive basis, some (albeit potentially only tenuous) link to reality, suitable mathematical properties (principally that any reasonable price the instrument can take can be characterised by one and only one value for the reference parameter) and, if there are several possible options satisfying the earlier criteria, that they get established as market conventions before the others. 9 There are also some subtleties about how the decomposition operates in practice that are not obvious from Sklar’s theorem, which we explore further in Section 4.5.2. 10 For example, we may be so focused on fitting a copula effectively that we might discount the possibility that no single copula validly applies to the entire time period being analysed, and that the data is not time stationary in this respect. Market implied correlations, derived from, say, prices of tranches of corporate bond indices using the Gaussian copula, certainly have varied dramatically through time depending on market conditions. We might also miss the possible clumping together through time of defaults, a phenomenon that is not directly relevant to the copula that we might choose to model co-dependency as measured over any set time period, but which might have more impact on the end answer.
80
Extreme Events 0.16 0.14 0.12 0.1 0.08 0.06 0.04 Expected Observed (series 1)
0.02 0
-0.15
-0.1
-0.05
-0.02
0
0.05
0.1
0.15
Figure 3.20 A one-dimensional ‘upwards’ QQ-plot in which we plot F−X on the downside and FX on the upside C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Thomson Datastream.
3.5.3 Multi-dimensional quantile–quantile plots One way of circumventing most or all the problems highlighted in Section 3.5.2 is to use a suitable generalisation of the quantile–quantile plots we introduced in Chapter 2. As noted in Section 2.4.5, curve fitting quantile–quantile plots can provide more visually appealing fits. By weighting different parts of the distributional form differently it can also focus on the particular part of the distributional form in which we are most interested. In Section 2.4.5 we concentrated on cubic curve fits, but we could equally have used higher order polynomials. The higher the order the better the fit will be (to the past, although not necessarily to the future; see Sections 2.10 and 5.9). The challenge is to find a suitable generalisation to higher dimensions of the one-dimensional quantile–quantile approach used in, say, Figures 2.7–2.13. Quantiles are inherently onedimensional and so do not naturally appear to generalise to more than one dimension. We suggest the following approach. We focus below on the two-dimensional case, i.e., the one involving analysis of the co-dependency of just two series, X i and Yi , as it can be plotted conveniently. However, the approach can be generalised to higher dimensions if desired: 1. We first refine the presentation of a one-dimensional quantile–quantile plot in a way that will facilitate generalisation to multiple dimensions, using an upwards QQ-plot as shown in Figure 3.20. This involves reflecting the lower half of a normal QQ-plot through a line parallel to the x-axis and plotting z = FX∗ (x) rather than z = FX (x). Here the horizontal axis is the x-axis and the vertical axis is the z-axis and FX∗ (x) is defined as follows: FX∗ (x) =
F−X (x) , FX (x) ,
x < x¯ x ≥ x¯
(3.8)
FX (x) represents the ordered ‘observed’ values for the series of observations X i . The xi are the corresponding ordered ‘expected’ values derived from corresponding quantile
Fat Tails – In Joint (i.e., Multivariate) Return Series
81
points of a Normal distribution with the same mean and standard deviation as the observed values and x¯ is the mean of this distribution (and, therefore, also the mean of the observed values). Such a plot contains exactly the same information as a normal QQ-plot, just presented differently (because the ordering of −X i is exactly the inverse of the ordering of X i ). In Figure 3.20 we plot both the ‘observed’ and the ‘expected’ upwards QQ-plots, the latter assuming that the observations had come from a Normal distribution with the same mean ¯ ¯ FX (x)). and standard deviation as the observation set. In general the plot has a kink at (x, In Figure 3.20, the axes have been defined so that this kink is positioned at the origin. 2. We then create a surface plot (in three dimensions) as shown in Figure 3.21, each vertical cross-section of which (through a common centre/origin defined as above) is a separate onedimensional QQ-plot defined in the following manner. Using cylindrical polar coordinates (r, θ, z) and again scaling the axes so that each ‘expected’ upwards QQ-plot has its kink at the origin, the upwards QQ-plot included in the vertical plane making an angle θ to the x-axis distribution is z (r ) = FR∗ (r ; θ) where Ri = X i cos θ + Yi sin θ. Such an approach offers a number of advantages over more traditional approaches when seeking to analyse joint fat-tailed behaviour, as described in the next five sections. 3.5.3.1 It contains a complete characterisation not only of each individual marginal distribution but also of the co-dependency between them The nature of each individual marginal distribution can be derived from cross-sections of Figure 3.21 through the x-axis and the y-axis. The one through the x-axis (i.e., with θ = 0) characterises the marginal distribution of the X i , because it plots z (x) = FX∗ (x), while the one through the y-axis (i.e., with θ = π/2) characterises the marginal distribution of the Yi , because it plots z (y) = FY∗ (y). However, Figure 3.21 also contains a complete characterisation of the co-dependency between the two series. This is because it is possible to derive from it the multi-dimensional characteristic function for the joint distribution of X and Y , and from this the joint pdf, the joint cdf and thus the copula.
0.3 0.2 0.2-0.3
0.1 0 -0.16 -0.096 -0.032 0.032 Sector 1 0.096
0.1-0.2 0.096 0.032 -0.032 -0.096 Sector 2
0-0.1
-0.16
Figure 3.21 A two-dimensional ‘upwards’ QQ-plot characterising all (linear) combinations of two return series C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Thomson Datastream.
82
Extreme Events
3.5.3.2 It encapsulates in a single chart fat-tailed behaviour arising both from co-dependency characteristics as well as from marginal distributions In contrast, the copula contains no information about fat-tailed behaviour in the marginal distributions and only focuses on fat-tailed behaviour arising from co-dependency characteristics. Usually the end conclusions drawn from any analysis will depend on both. It is not often clear a priori which of the two will prove to be the more important. 3.5.3.3 Like a one-dimensional QQ-plot, it places greater visual emphasis on extreme events These events correspond to ones where r and hence z are large, which are thus given greater visual prominence than behaviour towards the centre of the distribution 3.5.3.4 It is relatively easy to identify the types of circumstances in which divergences from a Normal distribution seem to be most pronounced Multivariate Normal distributions are characterised in such plots by cones, whose vertical cross-sections are straight lines and whose horizontal cross-sections are ellipses. The slopes of the cone along directions that project onto the x and y axes indicate the relative sizes of the standard deviations of the relevant marginal distributions. The direction in which the major axis of each horizontal ellipse points indicates the correlation between X and Y . Deviations from Normality correspond to plots that are more irregularly shaped. For example, fat-tailed behaviour will be characterised by the cone tending to curve upwards towards its edges. The point at which this is most pronounced highlights the joint events that appear to diverge most in the tail from Normality. For example, if this divergence occurs most along the positive x-axis, we might conclude that the most important divergence from Normality involves upside fat-tailed behaviour in the X i , whereas if it is along the negative y-axis, we might conclude that the most important divergence from Normality would correspond to downside fat-tailed behaviour in the Yi . However, if it were half way between the negative x-axis and the negative y-axis, this would correspond to joint downside fat-tailed behaviour occurring to both X i and Yi simultaneously. In Figure 3.21 we see that the two series being analysed are negatively correlated, because the direction in which the surface is longest is pointing closer to the diagonal running from (+,–) to (–,+) than to the diagonal running from (–,–) to (+,+). We can also see that the volatility of Sector 2 is larger than the volatility of Sector 1. Fat-tailed behaviour is evident particularly on the downside to Sector 2 because this is where the surface curves most upwards. 3.5.3.5 As with one-dimensional QQ-plots, we can use standard (or weighted) curve-fitting techniques to smooth out sampling error and encapsulate in a convenient manner the main elements of fat-tailed behaviour being exhibited by the series under analysis We saw in Section 2.4.5 that we could simplify the characterisation of fat-tailed behaviour by fitting, say, cubics to observed (one-dimensional) QQ-plots. A corresponding technique with multi-dimensional upwards QQ-plots would be to identify best fit polynomials of the form ai, j r i (sin θ ) j .
Fat Tails – In Joint (i.e., Multivariate) Return Series
83
The approach naturally generalises to higher dimensions, e.g., in three dimensions we might focus on the upwards QQ-plot characteristics of z (r ) = FR∗ (r ; θ, φ) where Ri = X cos θi cos φi + Yi sin θi cos φi + Z i sin φi . However it is impractical to plot these higherdimensional upwards QQ functions. Even two-dimensional upwards QQ-plots are necessarily somewhat ‘busy’. They contain a large amount of information that is potentially difficult to assimilate all at once. They might be easier to interpret if we could rotate them interactively in three dimensions. There are also a few niceties to worry about with such a format in practice. As with the one-dimensional case, we may need to think carefully about what we mean by ‘expected’ quantiles if the marginals do not appear to have finite means or variances. Also, not all polynomial curve fits correspond to physically realisable upwards QQ-plots. Moreover, carrying out Monte Carlo simulations using such distributions is not particularly easy, because it is in practice necessary to recreate the marginals and the copula to do so (see Section 6.11.6).
3.6 CAUSAL DEPENDENCY MODELS Copulas focus on the co-dependency between two or more variables. They do not attempt to identify why the co-dependency characteristics arise. An alternative way of analysing codependency is to develop causal dependency models that seek to identify and model the drivers that appear to be creating the co-dependency. For example, Shaw, Smith and Spivak (2010) note that the usual method for capturing dependencies within economic scenario generators used in insurance contexts would be to formulate a number of underlying drivers and then to have these drivers linked to features of interest via a number of causal links. They give the example of future inflation, which might be derived from simulated nominal and real yield curves and which is then used as an input when simulating future insurance losses. The inflation rate some years into the future is not typically an ‘output’ of the model per se (at least not one that directly corresponds with the behaviour of any particular asset or liability). Instead it can in this instance be viewed as a state variable that, behind the scenes, influences outputs we can observe (here future asset returns and insurance pricing) and which influence portfolio construction.11 In Chapter 4 we explore in more detail the topic of how best to identify what drives markets. Causal dependency models can be thought of as a subset of some of the more general techniques we describe in Section 4.7 for analysing and modelling market dynamics. Table 3.2 summarises some of the advantages and disadvantages of such models highlighted by Shaw, Smith and Spivak (2010).
3.7 THE PRACTITIONER PERSPECTIVE We saw in Section 3.3.3 that a substantial proportion of ‘fat-tailed’ behaviour in joint return series appears to come from time-varying volatility, just as was the case for single series. 11 In this particular example, some information pertinent to future inflation is externally observable in its own right (namely current inflation levels, if we adopt the reasonable premise that inflation rates are likely to change only relatively modestly between consecutive time periods). In such circumstances we might seek to refine the specification of the causal model to take account of any additional information such as this that is available to us, here perhaps blending together the current observed inflation rate with any break-even rates derived from short-term nominal versus real yields, to the extent that the two differ.
84
Extreme Events
Table 3.2 Table 3.2: Advantages and disadvantages of causal dependency models Advantages
Disadvantages
Theoretically very appealing and intuitive
Transparency and results communication can be a particular issue, because the model can take on the characteristics of a ‘black box’ Risk of over-fitting, lack of parsimony and model error; could lead to an overly complicated model providing a false sense of accuracy It is unlikely to be feasible to model all common risk factors at the lowest level (even if they can be accurately identified)
Potentially the most accurate way of imitating how the ‘real world’ works
Can be used in combination with other approaches: for example, inflation might be a common risk driver for expense and claims risk and for some asset class returns, but we might include other elements of correlation between them in a way that does not include any assumed causal dependency. Possible to capture nonlinearities through causal relationships
If lots of common risk factors are being simulated using a Monte Carlo approach then the model can put a very high demand on computing power
A natural consequence should therefore again be a desire to understand and, if possible, to predict (and hence profit from) the way in which the world might change in the future. In Section 2.12.2 we discussed how investment managers could use straightforward options and variance swaps to take views on the future volatility of an individual return series. We also noted that market implied views backed out of observed prices for such instruments should then be able to help us identify what others are thinking about such risks. These types of instruments are equally valid for understanding the marginal distributions of each individual series when considering series jointly. However, to understand better the co-dependency between different return series we need to refer to other instruments that are sensitive to such risks, such as more complicated derivatives dependent on multiple underlyings. In equity-land these include correlation (and/or covariance) swaps. In bond-land, the prices of CDOs and other similar instruments are sensitive to changes in market views on the co-dependency between the exposures underlying the structure. Most of the other comments that we noted in Section 2.12 also apply to joint return series, such as the need to worry about crowded trades, liquidity risk and the impact that our own behaviour (alongside our peers) might have on market behaviour. Principle P12: A substantial proportion of ‘fat-tailed’ behaviour (i.e., deviation from Normality) exhibited by joint return series appears to come from time-varying volatility and correlation. This part of any fat-tailed behaviour may be able to be managed by reference to changes in recent past shorter-term volatilities and correlations or by reference to forward-looking measures such as implied volatility and implied correlation or implied covariance.
Fat Tails – In Joint (i.e., Multivariate) Return Series
85
However, there is an important additional behavioural issue that can potentially arise as soon as we have more than one return series in tandem. This is the possibility that our choice between the factors may itself tend to exacerbate fat-tailed behaviour more than we might otherwise expect. The possibility of selection effects operating in this context forms a major backdrop to Chapter 4. Principle P13: Time-varying volatility does not seem to explain all fat-tailed behaviour exhibited by joint (equity) return series. Extreme events can still come ‘out of the blue’, highlighting the continuing merit of practitioners using risk management tools such as stress testing that seek to reflect the existence of such ‘unknown unknowns’ or ‘black swan’ events.
3.8 IMPLEMENTATION CHALLENGES 3.8.1 Introduction In Section 2.13 we covered implementation challenges that can affect individual return series. In this section we look at a number of implementation challenges that can arise when we are considering more than one series in tandem. 3.8.2 Series of different lengths Perhaps the most important problem that can arise in practice is that the series being analysed may be of different lengths or otherwise inconsistently timed in nature. Suppose, as is often the case, that we want to analyse jointly several return series that start at different times in the past. For example, assume that we have 6 series, and for 5 of these we have 60 common time periods of observations but for the last one we have only 40 observations in common with the first 5 series. It is unwise, say, to derive a correlation or covariance matrix by using correlations based on 60 time periods between the first 5 series, but only 40 time periods for the correlation between any of these series and the 6th one. This is because the resulting matrix may not be positive definite, and may not therefore correspond to any possible correlation or covariance matrix that might actually characterise a probability distribution from which the data might be coming. The simplest way to tackle this problem is to use whatever data we have that is overlapping. However, if any individual return series has a very short history then this approach effectively throws away nearly all the data relating the rest of the dataset, which is usually undesirable. It is particularly undesirable when we are trying to model risk for a large universe of individual securities, because the larger the universe the more likely it is that some of the instruments in question will have only been around for a short time. ‘Most’ problems to which we will want apply a risk model are ones where precise exposure to those instruments will be relatively unimportant. At the other end of the scale, we might derive the correlation matrix using all the available data but then adjust it to whatever is the ‘closest’ possible matrix that is a valid one for the problem at hand.12 We will find in Chapter 4 that for large universes for which we have 12
‘Close’ might here be defined using ideas such as relative entropy, see Section 3.8.5.
86
Extreme Events
limited observational data, a pre-filtering is always in practice necessary even if all the series have entirely coincident observations. So, we could merely build such an adjustment into this pre-filtering element of the process. However, the most common way of handling this problem is to choose a suitable overall time period for which most series have complete data and then to pad out any missing observations within this period with proxy data derived from series that do have more complete histories deemed to be ‘similar’ to the ones that do not. For example, with an equity-orientated risk model we might proxy missing return data for a given security by the return on a suitable sector index (or if, say, size or style were deemed likely to be relevant to security behaviour, by a suitable combination of sub-indices deemed likely to best represent how the given security would have behaved had it been present at the time). In bond-orientated and other types of risk models the need for a proxy may, in effect, be wrapped up in the computation of economic factors underlying such models (see Section 2.13.5). We might identify factors expected to be applicable to a range of assets under consideration (rather than just one), and then calculate these factors by suitably averaging across the assets in question. For widely applicable economic exposures, such as exposure to yield curve movements, the computation may incorporate some contribution from every single instrument in the universe to which the model applies.13 When it is more difficult to identify a suitable proxy merely by a priori reasoning, we might instead use methodologies that purely refer to the data, the most common of which involves clustering (see Section 3.8.4). 3.8.3 Non-coincidently timed series Another problem that we may face is that the sub-periods in question may not be coincident in time. Usually we would convert all returns into indices (or better still logged indices),14 choose a template with time periods that was compatible with the observations available for most of the instruments in question and then attempt to complete the template for the remaining instruments by interpolating between available index levels for the instrument in question. Crude interpolation, e.g., estimating the value of an index half way in time between known points using the formula It+1/2 = (It + It+1 )/2, introduces smoothing, which may distort the end results of any exercise that makes use of these estimates. This effect can be reduced by more sophisticated interpolation approaches that refer back to some suitable market observable that is available ‘live’ at each of the time points in question; see, e.g., Kemp (2009). At first sight, we might think that non-coincidently timed series is a relatively rare problem. Most data vendor systems are geared to providing daily, weekly, monthly return series etc., and when asked to produce such data for several series simultaneously will automatically tabulate the results in a manner that forces consistency on the return series. However, market opening and closing hours are not themselves coincident in ‘proper’ time, especially if the markets are located in different time zones. In particular, US markets close several hours after European markets, which in turn close several hours after Far Eastern 13 More common would be to slice up the yield curve into defined time brackets or ‘buckets’ and treat the ‘yield curve’ exposure within different cash flow timing buckets separately. However, even here there may be some contribution from buckets that are next to the time bucket in question, in order to force the yield curve to exhibit a level of smoothness that we think it ‘ought’ to exhibit. For further information on how to fit curves through points in ways that ensure such smoothness is present, see Press et al. (2007) or Kemp (2010). 14 This involves, say, calculating It = It−1 × (1 + rt ) or better still Jt = log(1 + It ).
Fat Tails – In Joint (i.e., Multivariate) Return Series
87
markets. Ignoring this nicety can result in some significantly inaccurate conclusions in terms of correlations between daily returns on markets located in different time zones. It is sometimes possible to adjust for such effects using live market observables (such as S&P 500 futures that are traded nearly around the clock) and by ‘deeming’ the market when closed to have been moving in a manner that perfectly correlated with that market observable. However, such an assumption may itself introduce bias if the answers from the exercise are sensitive to the correlation between two different instruments, each of which we are deeming to behave in a similar fashion. We are again introducing a priori assumptions and should again bear in mind the impact that they may have on the end results. 3.8.4 Cluster analysis If we are not prepared to identify proxies to handle missing data by a priori logic, we can instead seek to rely on cluster analysis. Most types of cluster analysis used in finance involve hierarchical clustering. We take some information about individual elements and build up a nested tree that best characterises the degree of linkage between the different elements (without presupposing any ‘right answer’ in advance, making it a form of unsupervised learning). For example, we might have a series of stock or sector returns, and we want to see which ones appear to be closest in behaviour to each other. The output is a collection of fully nested sets. The smallest sets are the individual elements themselves. The largest set is the whole dataset. The intermediate sets are nested, i.e., the intersection of any two sets is either the null set or the smaller of the two sets. We might use cluster analysis to identify ‘sectors’ into which individual securities might be grouped or to identify broader classifications that grouped together these sectors. Or we might use the approach to test whether some grouping we think might apply based on some economic arguments actually does seem to describe actual market behaviour.15 The common convention is to have the nesting arrangement form a binary tree, i.e., one where each larger set is deemed to split into just two subsets at each node of the tree. Where, say, three subsets are equally near each other within a larger set, this is typically represented by an arbitrary choice of one of the three subsets to stand distinct and for a branch of zero length to join it to the join of the other two subsets. Precise choice of how to measure ‘degree of linkage’, i.e., the ‘distance’ between different elements of such a tree, can be quite important in this context, and can depend on what question we are trying to answer. For instance, we might have some implicit view of how individual elements in the market should behave: e.g., for an equity-orientated analysis we might assume a model that involves stocks having sector and country exposures (‘betas’) derived from a formula such as the following (see, e.g., Morgan Stanley (2002)): n C n n r nj = α j + β Sj · r S( j) + β j · r C( j) + ε j
(3.9)
where S ( j) and C ( j) are sector and country of stock j, r nj , r Sn and rCn are the returns of stock j, sector S and country C in month n, and εnj is the residual return of stock j during that month not explained by either sector or country. 15 For example, we might think that stocks (or sectors) can be split into ones that are ‘global’, i.e., driven more by globalisation trends or other global macroeconomic effects, and ones that are ‘local’, i.e., more driven by events specific to their country of domicile. We might test this using such a cluster analysis. If stocks/sectors cluster together in the manner expected, we might conclude that our premise was supported by actual market behaviour. If they do not, we might need to refine our economic intuition.
88
Extreme Events
We might then try to identify which sectors or countries appeared to be the most ‘similar’ to each other, i.e., which r Sn and rCn were most similar to those of other sectors or countries. We might need this if a sector or country had recently been added to the universe and we wanted to proxy its behaviour further back into the past. However, even here there are several possible ways in which we might measure ‘similarity’. For example, we might measure it either by reference to correlations or by reference to covariances. If we use covariances then relatively non-volatile sectors will be deemed to be relatively similar while relatively volatile sectors may be deemed to be quite different to each other even when they are highly correlated. Usually, in finance, the focus is on correlations rather than covariances, because we typically view kxt as ‘similar’ in terms of market dynamics to xt whatever the value of k = 0. For example, Scherer (2007) uses the following formula to define the ‘distance’ between two possible clusters, C 1 and C2 (each such cluster will be a set of individual return series, perhaps derived from a regression analysis as above)16 : Distance (C1 , C2 ) =
1 Distance (i, j) |C1 | |C2 |
(3.10)
i∈C 1 j∈C 2
where |Cn | is the number of objects (assets) in cluster n, so the ‘distance’ between two clusters is the average ‘distance’ between their individual elements, and the distance between two elements is Distance (i, j) = 1 − Correlation (i, j). Some variants on this approach are described in SSSB (2000). The results of applying such a cluster analysis to the sector relative returns used in Figure 3.12 are summarised in Figure 3.22. The y-axis represents the distance between two clusters. When interpreting the results of a cluster analysis it is important to bear in mind that no specific choice of how to arrange the cluster hierarchy along the y-axis is imposed by the analysis. For example, any sector we like could be the one that was positioned at the top of this chart. Two sectors that are close together on the y-axis will only have exhibited ‘similar’ behaviour if they belong to the same or closely linked clusters. Principle P14: Some parts of the market behave more similarly to each other than other parts. For some purposes, it can be helpful to define a ‘distance’ between different parts of the market, characterising how similar their behaviour appears to be. The definition of ‘distance’ used for this purpose and, more generally, the weights we ascribe to different parts of the market in our analysis can materially affect the answers to portfolio construction problems.
3.8.5 Relative entropy and nonlinear cluster analysis Some readers may object to a focus on correlation as a measure of similarity in a text on extreme events. We might instead use more general techniques for measuring ‘similarity’ that 16 With this definition of distance it is possible for a cluster to have an average distance between its own elements that is greater than the average distance between elements of its ‘parent’ cluster. To maintain the hierarchical structure of the analysis we might place a limit on the average distance within a cluster equal to the least average distance within any cluster of which it is a part. Such an override was applied once in Figure 3.22.
Fat Tails – In Joint (i.e., Multivariate) Return Series
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
89
1
Figure 3.22 Illustrative cluster analysis of the sector relative return series used in Figure 3.12. The distance along the x-axis corresponds to the average ‘distance’ between the individual elements of the cluster C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Thomson Datastream.
are less dependent on linear relationships. Perhaps the most important of these is to focus on the concept of entropy; see, e.g., Press et al. (2007), Abarbanel (1993) or Kemp (2010). Usually the focus is on relative entropy, otherwise known as the Kullback-Leibler ‘distance’17 between two distributions, p and q defined (by Press et al.) as D (p q ) ≡
i
pi log
pi qi
(3.11)
There is a direct link between relative entropy, hypothesis testing and likelihood ratios, as pointed out by Press et al. (2007). Suppose, for example, that we are seeing events drawn from distribution p but want to rule out an alternative hypothesis that they are drawn from q. We might do this by calculating a likelihood ratio, which is the likelihood of the data coming from p divided by the likelihood of the data coming from q,18 and then rejecting the alternative hypothesis if this ratio is larger than some suitable number (in this shorthand equation, the product over the ‘data’ is to be interpreted as substituting for i in each product element the particular observation in question): L=
pi p (Data |p ) = p (Data |q ) Data qi
(3.12)
17 The Kullback-Leibler ‘distance’ is not a true ‘metrical’ distance as such. It is not symmetric. Nor does it satisfy the triangle inequality to which a true metric distance adheres. 18 Here and elsewhere we use the notation P(X |y) to mean the probability of X given y and xi to mean the product of the x i , i.e., xi = x 1 x2 . . . .
90
Extreme Events
Taking the logarithm of Equation (3.12) we see that, under the hypothesis p, the average increase in log L per ‘data event’ is just D (p q ). The relative entropy is therefore the expected log-likelihood (per observation) with which a false hypothesis can be rejected.19 As in Section 3.8.4, we would normally want to focus on some measure of ‘similarity’ or lack of similarity that was scale invariant. One possibility would be to exclude the impact of the marginal distributions in any comparison between sectors (this, in effect, assumes that two distributions are identical if there is a monotonically increasing, or monotonically decreasing, relationship between the two). The definition of ‘similarity’ is then entirely confined to the characteristics of the copula between the two series. See Kemp (2010) for further details.
19 Press et al. (2007) explain why the above likelihood ratio test is asymmetric by referring to Bayesian statistics (see also Chapter 6). The asymmetry arises because, until we introduce the notion of a Bayesian ‘prior’ distribution, we have no way of treating the hypotheses p and q symmetrically. Suppose, however, p(p) is the a priori assumed probability of p and hence p(q) = 1 − p(p) is the a priori assumed probability of q. Then the Bayes odds ratio between the two distributions is ( p(p)/ p(q))( pi /qi ). The Bayesian figure of merit, which is the analogue of log likelihood in a Bayesian world, is the expected increase in the logarithm of the odds ratio if p is true plus the expected decrease if q is true, which simplifies to p(p)D(p q) + p(q)D(q p), which is now symmetric between p and q. We can therefore use this expression to estimate how many observations we will need on average to distinguish between two distributions. In the case of a uniform (‘noninformative’) prior, which has p(p) = p(q) = 1/2, we end up with the symmetrised average of the two Kullback-Leibler distances.
4 Identifying Factors That Significantly Influence Markets 4.1 INTRODUCTION In Chapters 2 and 3 we discussed how individual return series as well as joint return series appear to exhibit fat tails. We also discussed, from a theoretical and a conceptual perspective, what might cause this fat-tailed behaviour. Before moving on to consider portfolio construction in more detail in later chapters, we explore in this chapter a selection of topics designed to help us gain a better understanding of market dynamics. Our focus is on how to identify factors that seem to drive market behaviour (and on how to refine traditional ways of doing so if we want to model fat-tailed behaviour effectively). In particular: (a) In Section 4.2, we describe the main methods that practitioners use to identify factors driving market behaviour when constructing portfolio risk models.1 The three main types of portfolio risk model that we consider are fundamental, econometric and statistical risk models. Fundamental and econometric risk models are augmented, in the sense that they rely on data that is in addition to individual return series. In contrast, statistical risk models are usually blind source, in the sense that they require no additional data other than the return series themselves. (b) In Sections 4.3–4.4, we focus on two particular ways of constructing statistical risk models that involve linear combination mixtures: principal components analysis (PCA) and independent components analysis (ICA). We show in Section 4.5 how a blend between the two may be more appropriate than either in isolation if we want our risk models to cater as effectively as possible with fat-tailed return behaviour. We use such a blended model to highlight in Section 4.6 the potential need to take account of selection effects when choosing which risk model (or model type) to use. (c) In Section 4.8, we describe some techniques for analysing data that we believe is coming from a distributional mixture. Why do we focus so much in these early chapters on the difference between linear combination mixtures and probability distribution mixtures? The differentiation has already been highlighted in each of the last two chapters. The reasons for doing so are explored further in Section 4.7. In essence, a much richer possible set of market dynamics arises if we do not limit ourselves merely to linear combination mixtures. Indeed it is so much richer that it is implausible to assume that distributional mixtures do not feature significantly in the behaviour of markets in the real world. One of the reasons for the depth and range of analysis contained in this chapter is to give the reader a better appreciation of the types of techniques that we need to employ if we wish 1 We will find in Chapters 5 to 7 that a key element of portfolio construction is to balance the risks that might be expressed by a particular portfolio positioning with its potential rewards. So identifying an appropriate risk model is a key precursor to effective portfolio construction, as we might have surmised from the discussion on risk budgeting in Section 1.4.
92
Extreme Events
to cater fully for fat-tailed behaviour. This reflects an apparent inconsistency between theory and practice prevalent in much modern risk modelling. Despite it being widely accepted that returns are fat-tailed, commercial risk and portfolio modelling systems often ignore this point in their fundamental formulation. This typically makes the mathematics of the underlying model framework much more analytically tractable. Risk system providers may also argue that the impact of deviation from Normality may be insufficient (or the evidence for such deviation insufficiently compelling) to justify inclusion in their model of fat-tailed behaviour. In short, there is a trade-off between: • complexity, practicality and presumed model ‘correctness’; and • simplicity, analytical tractability and practicality of implementation. We will come across a similar trade-off when we move on to portfolio construction in subsequent chapters. Practitioners again accept conceptually that returns are fat-tailed, but again often ignore the refinements that might then be needed to traditional portfolio construction techniques. Part of the purpose of this book is to equip readers with an appreciation of the extra steps needed to handle fat-tailed behaviour effectively, so that they can decide for themselves whether doing so is worth the extra effort involved.
4.2 PORTFOLIO RISK MODELS 4.2.1 Introduction Kemp (2009) contains a comprehensive description of different types of portfolio risk models. When our focus is on identifying models based wholly or mainly on past time series data, the three main model types used in practice are as follows: (a) Fundamental risk models – see Section 4.2.2. These ascribe certain fundamental factors to individual securities, which might include factors such as price to book, size, type of business the company pursues, industrial sector and how and where the company pursues its business. ‘Fundamental’ factors are ones that are exogenously derived, e.g., by reference to information contained in the company’s annual report or other regulatory filings. The factor exposures for a portfolio as a whole (and for a benchmark, and hence for a portfolio’s active positions versus a benchmark) are the weighted averages of the individual position exposures. Different factors are assumed to behave in the future in a manner described by some joint probability distribution. The overall portfolio risk (versus its benchmark) can then be derived from its active factor exposures, this joint probability distribution and any additional variability in future returns deemed to arise from security specific idiosyncratic behaviours. (b) Econometric risk models – see Section 4.2.3. These are similar to fundamental models except that the factor exposures are individual security-specific sensitivities to certain pre-chosen exogenous economic variables, e.g., interest rates, currency or oil price movements. The variables are usually chosen because we have surmised (from point (a), point (c) or from more general reasoning) that they should influence the returns on particular securities. The sensitivities are typically found by regressing the returns from the security in question against movements in the relevant economic variables, typically using multivariate regression techniques. (c) Statistical risk models – see Section 4.2.4. These eliminate the need to define any exogenous factors, whether fundamental or econometric. Instead we identify a set of otherwise
Identifying Factors That Significantly Influence Markets
93
arbitrary time series that in aggregate explain well the past return histories of a high proportion of the relevant security universe, ascribing to elements of this set the status of ‘factors’. Simultaneously we also derive the exposures that each security has to these factors. We explore in Sections 4.3–4.5 several different ways in which such factors can be identified, some of which seem more suitable for handling fat-tailed data than others. Historically, the earliest portfolio risk models tended to be fundamental ones. It is, for example, natural to expect returns from two companies operating in the same industry to behave more similarly than those from two companies operating in quite different industries. Econometric models were originally developed to provide a more intuitive interpretation of the ‘factors’ that might otherwise have been used in a fundamental risk model. More recently, some risk system providers have marketed pure statistical risk models to third parties. However, statistical risk modelling techniques have for a long time formed part of the panoply of tools that risk system providers have used internally when formulating the other two types of portfolio risk model. The key difference between the three approaches is between statistical risk models and the other two techniques. A statistical risk model refers only to the return data itself, i.e., it is a blind source technique.2 In contrast, fundamental and econometric risk models refer to return data augmented by other data, either fundamental data applicable to the security in question or econometric data relevant more generally. Within this overall divide, each type of risk model can exhibit otherwise similar variations. Our goal is to be able to explain as much as possible of the behaviour of the returns actually exhibited in practice across the universe of securities in which we are interested. The most common and simplest risk models focus largely on linear (Normally distributed) factor models with Normal residuals. Our goal then translates to minimising ε2j,t , where we are modelling the returns using the following equation3 : r j,t = α j + β j,k xk,t + ε j,t
(4.1)
All three types of risk model then have the same underlying mathematical framework, if we proxy risk by tracking error or an equivalent. We are modelling the jth security’s return as coming from ‘exposures’ β j,k to the kth ‘factor loading’. Here one unit of each factor generates a prospective return (in the relevant future period) of sk,t , which are random ˜ say. So a portfolio described by a vector of variables with a joint covariance matrix of V, active weights a,say, will have an ex-ante tracking error due to factor exposures of σ where ˜ a = (βa)T V ˜ (βa) and the matrix β is formed by the terms β j,k . σ 2 = aT βT Vβ Usually, the residuals are assumed to be idiosyncratic, i.e., security specific, although this is not strictly necessary.4 ε j,t and εk,t would then be independent for all j = k. There are two 2
Here ‘blind source’ means exclusively based on the available data without incorporating any suppositions on that data. Strictly speaking, we want to minimise what is unexplained after taking account of the impact that an increase in the number of parameters will have on our model (see comments on parsimony in Section 2.10). 4 In practice, there may be good reasons for believing that certain securities may share what would otherwise be purely idiosyncratic exposures. For example, some companies have a dual holding company structure, or are listed on more than one exchange. The two listings do not necessarily trade at identical prices (there may be differences in how they are taxed etc.) but are still likely to exhibit strongly linked behaviours. Likewise, corporate bonds issued by the same issuer are likely to share many common characteristics, even if their behaviour will not be identical (e.g., because the bonds may have different terms, priority status in the event of company wind-up ⌢ or different liquidity characteristics). Computation of ex-ante tracking error can then be characterised as σ 2 = (βa)T V (βa) + aT Ya, where Y is a sparse m × m, with few terms other than those along the leading diagonal. 3
94
Extreme Events
different ways in practice that risk models handle ‘residual’ risk within such a framework, i.e., the idiosyncratic risk that is relevant only to specific individual securities: (a) The matrix β may be deemed to include all such idiosyncratic risks, i.e., our set of ‘factors’ includes idiosyncratic factors that predominantly affect only individual securities. (b) The matrix β may exclude these idiosyncratic risks. In such a formalisation, the idiosyncratic risk of the jth security might be, say, σ j , and we might calculate the tracking error ⌢ ⌢ of the active portfolio using σ 2 = aT (βT Vβ)a + j ai2 σi2 = (βa)T V(βa) + j ai2 σi2 , ⌢ where V is now a much smaller sized matrix characterising merely the covariance matrix between ‘factor’ returns (a q × q matrix, if there are just q different factors included in the model). There are typically far fewer parameters to estimate using this sort of methodology, and so estimation of the parameters is likely to be more robust. Each of the three types of risk model can also be specified to be nonlinear in nature, with the β j,kxk,t replaced by more complicated functions f j xk,t of the underlying factors xk . The f j xk,t can be thought of as constituting pricing models that describe the price of the instrument in terms of changes to the underlying factor drivers. It can be particularly important to include such elements when modelling the risk characteristics of nonlinear instruments such as options; see Kemp (2009). To introduce fat-tailed behaviour into any of these models, we need to introduce fat-tailed characteristics into the behaviour of the factors, the behaviour of the residuals or both. 4.2.2 Fundamental models Suppose that we are trying to model the returns on a large number of equities. As noted above, we might intrinsically expect returns from equities in the same sector to behave more similarly than ones from quite different sectors (as long as we have some confidence in the sector classification agency being able to group together stocks with similar economic characteristics). Likewise, all other things being equal, we might expect bonds of a similar duration to behave more similarly than ones with quite different durations. Some of these a priori views have very strong intrinsic economic rationales relating to the extent to which it is possible to substitute one instrument for another. For example, the linkage between a bond’s behaviour and its duration arises because there is an intrinsic additional driver, i.e., the government bond yield curve, which strongly influences the behaviour of all bonds of the same currency (whether or not they are issued by the government). As we might expect, the linkage appears to be stronger the better the credit quality of the bond in question, because the closer the bond then is in economic terms to one of an equivalent duration issued by the relevant government. As Kemp (2009) notes, substitutability is closely allied to the principle of no arbitrage, a fundamental principle in derivative pricing theory. This principle can be viewed as akin to the assertion that two economically identical exposures should be valued identically. In other situations, we may not be able to fall back directly onto pure substitutability arguments, but the classifications may still have explanatory power. For example, there appears to be some explanatory benefit available if we classify stocks into ‘large capitalisation’ and ‘small capitalisation’. We can talk about a ‘size’ effect,5 and we can meaningfully express an 5 Usually when commentators refer to a ‘size’ effect or the like they mean a sustained return bias towards one type of company, which they then justify using a mixture of empirical evidence (i.e., long-term return patterns) and structural arguments, e.g., here
Identifying Factors That Significantly Influence Markets
95
investment view that large cap stocks will outperform small cap ones (or vice versa). Likewise ‘value’ stocks may be differentiated from ‘growth’ stocks, as may stocks in different industries and/or sectors, or stocks with other characteristics that can be derived from corporate reports, e.g., their gearing (i.e., leverage) level. Some of the tendency of stocks classified in a particular way to move in tandem may have fundamental economic rationale. For example, equities that are highly geared may be more exposed if organisations become less willing in aggregate to lend companies money. However, some of the tendency may arise because of the involvement of humans in the process of setting prices for different investments. Humans naturally seek order, to make sense of the plethora of investment opportunities that are available. Any classification structure that gains widespread acceptance will influence the way in which investment research is carried out and presented, potentially causing individuals to view such stocks as being ‘similar’ in nature whether or not they really are.6 To identify whether any particular classification structure seems to offer particularly good explanatory benefit, we might carry out a cluster analysis as per Section 3.8.4. To create a fundamental factor model, we 1. Identify fundamental characteristics, i.e., ‘factors’, that we believe a priori have some explanatory merit. 2. Calculate return series, x k,t , that correspond to a unit amount of a given factor exposure (this is typically done by calculating a suitable ‘average’ return across all securities exhibiting this factor, after stripping away the impact of any exposures the securities have to any to other factors, and so is actually done in tandem with step 3). 3. Carry out a multiple regression analysis of r j,t versus xk,t based on Equation (4.1) to identify the β j,k . This simultaneously identifies the residuals remaining after the impact of the relevant factor exposures. 4. Impose some structure on the residuals (perhaps using ‘blind’ factors as per a statistical model, which we discuss in Section 4.2.4). 5. Identify the expected future behaviour of the factors in step 1 and the residuals in step 4. Either or both could incorporate fat-tailed behaviour, if we so wished. 4.2.3 Econometric models The main perceived advantage that fundamental models have over statistical models is that the characterisation of securities using fundamental factors ought to improve the quality of explanatory power of our models. Another way of potentially achieving the same result is to identify observable exogenous factors that can be intrinsically expected to drive the behaviour of a variety of individual return series. For example, with bonds we might view movements in the government yield curve as just such a factor, because (as we saw above) it strongly influences the behaviour of individual bonds. In equity-land, possible econometric factors to use in this context might be bond yields, oil price movements and so on.
that ‘small cap’ are typically under researched and riskier than large cap stocks and therefore ‘ought’ to command a long-term return premium. 6 Stocks can also benefit or suffer if they are reclassified within a widely used classification structure. For example, stocks often perform well when they join an index (or when it becomes likely that they will join), because a wider selection of investors may then want to purchase them.
96
Extreme Events
Usually, the sensitivity of a particular security to such a factor is estimated using regression techniques, e.g., by finding the βˆ j,k that minimise the aggregate size of the residuals in an equation like Equation (4.1). Creating an econometric model is thus similar to creating a fundamental factor model. However, we do not normally need to carry out step 3 in Section 4.2.2, because the econometric factor in question provides its own ‘return’. 4.2.4 Statistical risk models What if we do not have ready access to fundamental or econometric factors that we think might explain observed security behaviour? We might, for example, have identified a few fundamental or econometric factors based on general reasoning. However, we might believe that (or might wish to test whether) there are other discernable factors, less easily mapped onto other data series that we have access to, that appear to be driving the residual behaviour of large numbers of securities simultaneously. Historically the most important technique that has been used to test for the presence of additional ‘blind’ factors has been principal components analysis (PCA) (see Section 4.3). However, there are other similar signal extraction techniques, such as independent components analysis (ICA), which may be better suited to catering for fat-tailed behaviour (see Section 4.4). Either methodology involves simultaneously identifying both a set of ‘blind’ factor return series and the factor exposures that each security has to these ‘blind’ factors. This type of methodology can be used in isolation (so that all the factors included in the model are ‘blind’ ones) or in conjunction with some fundamental or econometric factors. In the latter case, we might carry out the following: (a) Identify a series of fundamental or econometric factors as per Section 4.2.2 and/or Section 4.2.3. (b) Determine the residual behaviours of securities after stripping out the behaviour explained by these factors. (c) Test for the presence of additional ‘blind’ factors, by applying statistical techniques such as PCA or ICA to these residuals. (d) Enhance the fundamental/econometric factor coverage until the residuals do not appear to have any remaining ‘blind’ factors present or specifically include with the risk model the most important of these blind factors. 4.2.5 Similarities and differences between risk models We might expect the quite different derivation of factors apparently used in these three types of risk model to lead them to have rather different characteristics. However, there is arguably less difference in practice than might appear at first sight. It would be nice to believe that factors included within a fundamental or econometric model are chosen purely from inherent a priori criteria. In reality, however, the factors will normally be chosen in part because they seem to have exhibited some explanatory power in the past. They are therefore almost certain to have some broad correspondence to what you would have chosen had you merely analysed past returns in some detail as per Section 4.2.4. How can we ever expect to decouple entirely what we consider to be a ‘reasonable’ way of describing market dynamics from past experience regarding how markets have actually operated? There is ultimately only one past that we have to work with!
Identifying Factors That Significantly Influence Markets
97
The blurring is particularly noticeable with bond risk models. A key driver of the behaviour of a bond is its duration. Is this a ‘fundamental’ factor, because we can calculate it exogenously by reference merely to the timing of the cash flows underlying the bond? Or is it an ‘econometric’ factor, because a bond’s modified duration is also its sensitivity to small parallel shifts in the yield curve? Or is it a ‘statistical’ factor, because if we carry out a PCA of well-rated bonds we typically find that the most important statistical explanatory driver for a bond’s behaviour is closely allied to its duration? The ‘best possible’ exogenous factors to use, in terms of explaining what has happened in the past, are ones that in aggregate align closely with the most important blind source components. This is because the latter by design provide the series with the largest possible degree of past explanatory power of any factors we might possibly be able to identify. So why seek out fundamental or econometric factors if we know that they cannot improve on the past explanatory power of the ‘blind’ factors derivable using principal components or similar techniques? The answer is that using exogenous econometric series in this manner should give greater intrinsic meaning to the regression analysis (as long as there is economic substance to the linkage involved), which should improve the explanatory power of the model in the future. The hope is that such a model will provide a better explanation of the future even if it does not provide quite as good an explanation of the past. To test whether this does appear to be the case we cannot merely analyse all past data in one go. We know already that statistical factor models ought to provide a better fit to a single past dataset than any equivalent fundamental or econometric model. The ‘blind’ factors chosen in a statistical model are specifically chosen so that they are the best possible factors for this purpose. Instead, we would need to carry out some sort of ‘out-of-sample’ backtest, akin to the one used in Section 3.3.3. In Section 5.9.4 we describe such backtests in more detail (and some of the pitfalls involved). In broad terms they involve the following: (a) creating a separate set of models for each type of model we wish to test, each set consisting of a series of models, one for each time period, derived only from data available prior to the start of the period in question; and (b) testing the effectiveness of each such model for the period of time in question, to build up a picture through time of how reliable were the different types of model.
4.3 SIGNAL EXTRACTION AND PRINCIPAL COMPONENTS ANALYSIS 4.3.1 Introduction The primary aim of this chapter is to explore how we might formulate models that identify what drives markets and individual instruments within these markets and how, in particular, we might incorporate fat-tailed behaviour into them. In other contexts, the search for factors that appear to be driving end outcomes is normally referred to as signal extraction. In the world of quantitative finance, the term ‘signal’ is more often associated with a measurable quantity that helps us to choose when to buy or sell a particular asset (or, potentially, liability). It thus tends to be associated with tools that focus mainly on the first moment of the expected (short-term) future return distribution (i.e., its mean return), rather than on the higher moments more associated with the risks applicable to such a strategy.
98
Extreme Events
When we refer to ‘signal extraction’ we will use its more general meaning. Ultimately the focus of portfolio construction is on enhancing the risk-reward trade-off. ‘Signals’ ought therefore, in general, to consider return and risk in tandem. In the mathematical literature, signal extraction techniques tend to focus on blind source separation techniques. In our situation, this involves working only with return data (as with statistical risk models) rather than supplementing the return dataset with other information (as with fundamental and econometric risk models). However, we have already seen in Section 4.2.4 how in principle these two types of approach can be blended, if we first identify exogenous data series we think ought to be relevant to future market behaviour and, only once these have been allowed for, switch to a blind source technique. In this section and the next two, we consider two blind source techniques both of which involve linear combination mixtures. In each case the aim is in some sense to ‘un-mix’ the output signals (here the observed return series) and thus to recover the supposed input signals (here the underlying factors driving the behaviour observed in the return series). The two methodologies are (a) Principal components analysis – see this section. PCA is probably the best known blind source technique within the financial community. As explained in Section 4.2, it is commonly used in the creation and validation of many current portfolio risk and modelling system designs. It allows risk system designers to identify potential factors that seem to explain the largest amounts of individual stock variability, and may therefore also be called factor analysis. We have already used it ourselves in Chapter 3. It includes an implicit assumption of Normality (or to be more precise an investor indifference to fat-tailed behaviour, and thus a lack of need to include in the model a characterisation of the extent to which behaviour is non-Normal). It is thus not, by itself, well suited to cater for fat tails. (b) Independent components analysis – ICA and several variants motivated by the same underlying rationale are described in Section 4.4. Independent components analysis has perhaps more commonly been applied to other signal extraction problems, e.g., image or voice recognition or differentiating between mobile phone signals. In its usual formulation it seeks to extract ‘meaningful’ signals however weak or strong these signals might be (with the remaining ‘noise’ discarded). ‘Meaningful’ here might be equated with extent of non-Normality of behaviour, as is explicitly done in certain formulations of ICA, making it potentially better able to cater for fat-tailed behaviour. Ideally we would like an approach that blends together the best features of each of these two methodologies. This is because ‘meaningfulness’ should ideally be coupled with ‘magnitude’ for the source in question to be worth incorporating within models of the sort we are endeavouring to create. Moreover, ‘noise’ is not to be discarded merely because it does not appear to be ‘meaningful’. It still adds to portfolio risk. Fortunately, both principal components and independent components analysis can be thought of as examples of a more generic approach in which we grade possible sources of observed behaviour according to some specified importance criterion. Their difference is in the importance criterion that they each use. This means that it is possible to combine the two methodologies, by choosing a composite importance criterion that includes ‘meaningfulness’ as well as ‘magnitude’. We describe how this can be done in Section 4.5.
Identifying Factors That Significantly Influence Markets
99
Our focus will not be primarily on aggregate market behaviour but on intra-market behaviour. To simplify the explanation, our example analyses concentrate on (equity) sector effects, but the approach could equally be applied to ‘style’ dynamics such as small versus large cap or growth versus value, high yield versus investment grade, offices versus retail real estate and so on. 4.3.2 Principal components analysis PCA seeks to identify (from the data alone) the supposed input signals that seem to explain the most aggregate variability in the output signals. The first (i.e., most important) principal component explains the largest amount of aggregate variability, the next principal component the next largest amount of aggregate variability and so on (hence the name ‘principal’ components). Each lesser ‘component’ also excludes any ‘echo’ of more important components (more technically, each principal component is ‘orthogonal’ to every other one). In an investment context, the ‘output signals’ generally correspond to the returns on individual assets within a given universe. The ‘input signals’ then correspond to some supposed set of factors, i.e., drivers (that we are trying to identify), which in aggregate cause/explain the observed behaviour of these returns. We can illustrate some of the concepts involved by considering the particularly simple case where we have just two ‘output’ series x1,t and x2,t . Take, for example, two MSCI AC World sector relative return series, adjusted to have zero mean (one of which is the same as one that we used in Figure 3.9 on page 68). We can plot these series against each other in a scatter plot format as in Figure 4.1. Each point in this chart has coordinates x1,t , x 2,t for some t.
MSCI ACWI Software/Services (relative return)
15% Individual data points Principal Component 1
10% Principal Component 2 2 × standard deviation (of relevant normalised mixture)
5%
0% –15%
–10%
–5%
5%
10%
15%
–5%
–10%
–15%
MSCI ACWI Utilities (relative return) Figure 4.1 Illustrative principal components analysis C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Thomson Datastream.
100
Extreme Events
The maximum number of input series (i.e., ‘signals’) that it is possible to extract from m output series is m. In this example, this means that we will be able to identify only two input signals. Each input signal will be some linear combination of the output signals. Suppose that we consider all possible normalised linear combinations of the output signals and for each one we work out the variability exhibited by the corresponding combination. By
‘normalised’ we mean that if we have a1 in sector 1 and a2 in sector 2 then a12 + a22 = 1. With just two series, all possible normalised linear combinations can be represented by a unit vector, and hence by the angle that this vector makes to the x-axis. Thus we can plot a contour around the origin in Figure 4.1 whose distance from the origin corresponds to, say, twice the standard deviation of returns of the corresponding linear combination. The first principal component then lies along the direction in which this contour is farthest away from the origin, as can be seen in Figure 4.1. To find each subsequent principal component we first remove any variability explained by earlier principal components. We then identify in which direction the adjusted contour is farthest away from the origin (which will be a direction perpendicular to each previously identified principal component). With just two input series and hence a two-dimensional scatter plot there is only one direction perpendicular to the first principal component, which is thus the second principal component (see also Figure 4.1).
4.3.3 The theory behind principal components analysis Usually, the theory of principal components is developed by reference to matrix theory and the theory of vector spaces.7 In particular the focus is usually on the eigenvectors and eigenvalues of the observed covariance matrix derived from the different return series. The mathematics develops as follows. Suppose that we have m different return series,8 ri,t , and n different time periods (and for each return series we have a return for each time period), then the observed covariance matrix9 is an m × m symmetric matrix, V, whose terms are Vi, j
7 An n-dimensional vector space is the set of all possible vectors of the form x = (x 1 , x 2 , . . . , xn ) where the x i are each real numbers. For two-dimensional vector spaces such vectors can be visualised as corresponding to where a point can be positioned on a plane, with the x i corresponding to the Cartesian coordinates of x. Addition and scalar multiplication of such vectors adhere to linear combination rules. If we have two such vectors, x and y, and a scalar, k (i.e., a real number), then k (x + y) = (k (x1 + y1 ) , k (x 2 + y2 ) , . . . , k (xn + yn )). The point (0, 0, . . . , 0) is the origin. In such a formulation, an m × n matrix, M (often written as M) can be viewed as a function that maps points, x, in such a vector space to new points, y = M x in a new m-dimensional vector space in a way that again respects certain linear combination rules. In particular, we require that if x and p are two vectors, k is a scalar, y = M x and q = M p then k (M (x + p)) = k (y + q). If the new vector space is m-dimensional then the matrix can be characterised by a m × n array of numbers Mi, j and if y = M x then yi = j Mi, j x j . A n-dimensional vector can then be thought of as a 1 × n matrix. The transpose of a matrix M T is a matrix with the indices flipped over. A symmetric n × n matrix is one that has Mi j = M ji for each i and j (and thus has M = M T ). The inverse of a matrix, M −1 , if it exists, is the function that returns the mapped point back to itself. The identity matrix, I , is the matrix that maps each point to itself, i.e., x = M −1 M x = M M −1 x and x = I x for all x in the original vector space. Most of the properties of such vector spaces can be inferred from a geometrical analysis of how such spaces can be expected to behave. For example, we can infer from geometry that any set of n nonzero vectors can ‘span’ the entire vector space (i.e., linear combinations of them can in aggregate form any member of the vector space), as long as they are all, in a suitable sense, pointing in ‘different’ directions. How a matrix operates on an arbitrary point is then uniquely determined by how it operates merely on any given set of such ‘basis’ vectors. A sub-space is the set of vectors ‘spanned’ by some but not all of such a set of ‘basis’ vectors. 8 In theory we should use (log) returns here, for the sorts of reasons set out in Section 2.3.1 but this nicety is not always observed in practice. 9 We have assumed here that each time period is given equal weight in the calculation of the covariance matrix and that we are concentrating on ‘sample’ rather than ‘population’ statistics. More generally, we might weight the different time periods differently; see, e.g., Kemp (2010).
Identifying Factors That Significantly Influence Markets
101
where r¯i is the mean return for the ith series and n
Vi, j =
1 ri,t − r¯i r j,t − r¯ j n − 1 t=1
(4.2)
Eigenvectors are solutions to the matrix equation10 Vq = λq where λ is a scalar called the eigenvalue corresponding to that eigenvector. In general an m × m symmetric matrix has m eigenvectors and corresponding eigenvalues (although if any of the eigenvalues are the same then the corresponding eigenvectors become degenerate). For a non-negative definite symmetric matrix (as V should be if it has been calculated as above), the m eigenvalues will all be non-negative. We can therefore order them in descending order λ1 ≥ λ2 ≥ · · · ≥ λm ≥ 0. The corresponding eigenvectors qi are orthogonal i.e., have qiT q j = 0 if i = j. Typically we also normalise them so that |qi | ≡ qiT qi = 1 (i.e., so that they have ‘unit length’). For any λs that are equal, we need to choose a corresponding number of orthonormal eigenvectors that span the relevant vector subspace. T Each of the qi is an m-dimensional vector with terms, say, qi = Q i,1 , . . . , Q i,m . We can therefore create another matrix Q whose terms Q i j in aggregate summarise the eigenvectors. We can also associate with each eigenvector qi a corresponding n-dimensional vector gi T corresponding to a (zero-mean) ‘factor’ return series, gi = gi,1 , . . . , gi,n where gi,t =
j
Q i, j r j,t − r¯ j
(4.3)
If Q is the matrix with coefficients Q i, j then the orthonormalisation convention adopted above means that QT Q = I where I is the identity matrix. This means that Q−1 = QT and hence we may also write the observed (i.e., output) return series, ri,t , as a linear combination of the factor return series (in each case only up to a constant term per series, because the covariance does not depend on the means of the respective series): ri,t = r¯i + Q k,i gk,t (4.4) k
Additionally, we have qiT V q j = λ j qiT q j = 0 if i = j and = λi if i = j. We also note that 1 Vi, j = Q k, j g j,t = Q k,i gi,t Q k,i Q k, j λk (4.5) m−1 t k k k Hence the sum of the return variances (summed across the entire universe under consideration), i.e., the trace of the covariance matrix, satisfies tr (V ) ≡
i
Vi,i =
i
k
Q k,i Q k,i λk =
k
i
Q k,i Q k,i λk =
λk
(4.6)
k
Thus the aggregate variability of the output signals (i.e., the sum of their individual variabilities) is equal to the sum of the eigenvalues of the corresponding covariance matrix. Hence 10
Or equivalently to the equation |V − λI| = 0, where I is the identity matrix, and |A| ≡ det (A) is the determinant of A.
102
Extreme Events
the larger the eigenvalue the more the input signal (i.e., ‘factor’) corresponding to that eigenvalue (and the signal associated with it) ‘contributes’ to the aggregate variability of returns as observed over the universe of instruments as a whole. We have therefore decomposed the observed returns across the universe as a whole in exactly the manner we wanted, as long as we associate the first principal component with the factor/signal corresponding to the largest eigenvalue, the next principal component with the factor/signal corresponding to the next largest eigenvalue and so on. All these factors are orthogonal to each other and thus ‘distinct’ in terms of contributing to the observed return behaviour across the universe as a whole. The factors also in aggregate ‘span’ the entire range of observed behaviours, so there are no more than m of them (because there are no more than m output signals).
4.3.4 Weighting schemas Commercial statistical factor risk models typically derive estimates of underlying factor signals using the method described above. Suitably averaged across possible portfolios that might be chosen, the factors corresponding to the highest eigenvalues really are the ‘most important’ ones, because they explain the most variability across the universe as a whole. However, we made an implicit assumption in the above that is not always appropriate. We assumed that an equal level of ‘importance’ should be given to each instrument. Implicit in PCA is a weighting schema being applied to the different output signals. Suppose that we multiply each individual output signal, i.e., return series, ri,t , by a different weighting factor, wi = 0, i.e., we now recast the problem as if the output signals were yi,t = wi ri,t . This does not, in some sense, alter the available information we have to identify input signals, and thus the ability of any resulting PCA analysis to provide factors that in aggregate explain the entire variability in the observed datasets. However, it does alter how much variability each original return series now contributes to the deemed total variability across the universe as a whole. The results of PCA are thus not scale invariant in relation to individual stocks, and will change if we give greater or lesser weight to different stocks. Another way of seeing this is to consider what happens if we weight some stocks with wi = 1 and others with wi = 0. Then the net effect is as if we had considered just a particular subset of securities. What is the ideal weighting scheme to use? The answer depends on the purpose of the exercise. One issue that often seems to get neglected in practice (when PCA is applied to equity markets) is the extent, if any, to which the weighting scheme should take account of differences in the market capitalisations of individual stocks. It can be argued that each separate security represents an ability to take a security-specific investment stance, and so should be given equal weight. However, stocks with large market capitalisations are typically ‘more important’ in some fundamental sense when considering the market as a whole. Giving equal weight to individual securities does not fully reflect the likely desire of the investor to take industry or thematic stances. These sorts of stances will typically span multiple securities simultaneously.
Principle P15: Models often seek to explain behaviour spanning many different instruments principally via the interplay of a much smaller number of systematic factors. The selection and composition of these factors will depend in part on the relative importance we assign to the different instruments under consideration.
Identifying Factors That Significantly Influence Markets
103
4.3.5 Idiosyncratic risk An important point to note when using PCA with observed return series is that however large m is – i.e., however many instruments there are in the universe we are analysing – there are at most only n − 1 nonzero eigenvalues for a covariance matrix built up from observed returns over n time periods. This result can be confirmed by noting that all the return series can be derived from linear combinations of just n series, the ith of which has a return of 1 in the ith period and a return of 0 in all other periods. A further degree of freedom (i.e., in this context, a possible nonzero eigenvalue) is eliminated because the observed covariance matrix does not depend on the means of the underlying return series.11 This cannot, of course, in general be consistent with the characteristics of the underlying probability distribution from which we might postulate the observed returns are coming. As noted earlier, this implicitly involves an assumption of (time stationary) multivariate Normality. As there are m instruments involved, the underlying probability distribution being assumed will in general be characterised by an m × m covariance matrix and hence by m different eigenvalues which in general are all nonzero. Instead, the cut-off to at most n − 1 nonzero eigenvalues (i.e., a number often far fewer than m) is merely an artefact of the sampling process we have used, i.e., that we have only n observation periods to consider. In the common situation where there are many more securities than periods, i.e., where m ≫ n, the ‘factors’ able to be derived via a PCA analysis correspond very largely to ones that drive the behaviour of several or even many different securities simultaneously. In these circumstances, there is, in essence, no data within the observable return dataset that can help us estimate accurately contributions from idiosyncratic risks, i.e., risks that affect only individual securities at a time (e.g., the risk of fraud within a particular business). We must choose them using a priori reasoning. Some problems are sensitive to the assumptions we make regarding idiosyncratic risk. PCA (indeed, any blind source separation technique) is then unable to help us much. Instead we need to fall back onto general reasoning, or, as proposed in Kemp (2009), use sources such as market implied data from derivatives markets that are not subject to the same intrinsic data limitations. Portfolio construction, at least when viewed at an individual instrument level, happens to be one such problem. This intrinsic data limitation places intrinsic limits on the reliability of any portfolio construction process at a highly granular level. Principle P16: Historic market return datasets contain far too little data to permit reliable estimation of idiosyncratic risks, i.e., risk exposures principally affecting just one or at most a handful of individual instruments within the overall market. 4.3.6 Random matrix theory What can we tell about the n − 1 factors that we can identify from the data? Are all of them ‘significant’? If the underlying data series are actually entirely random (and each has the same underlying variability, σ ) then it is still possible to carry out a PCA and thus extract what look like n − 1 input signals, i.e., factors, corresponding to the different nonzero eigenvalues of the observed covariance matrix. However, in some sense, we know that such signals do not actually carry useful information, because by design we have arranged for there to be effectively no information content within the aggregate dataset. 11 A corollary is that these inherent limitations imposed by the data also apply to other signal extraction techniques, including the ICA and blended PCA/ICA approaches described later in this chapter.
104
Extreme Events
Even when there is real information in the output signals, we will typically be deriving the covariance matrix from a finite-sized sample of observed returns. Ifwe have n time periods and m securities then the covariance matrix will contain m(m + 1) 2 distinct entries. This will be substantially fewer than the total number of observations, nm, we have to estimate the covariance matrix, if m is large relative to n. We should then expect empirical determination of the correlation matrix to be ‘noisy’ and we should use it with caution. It is the smallest (and hence apparently least relevant) of the eigenvalues and associated eigenvectors that are most sensitive to this noise. But these are precisely the ones that determine the least risky portfolios; see, e.g., Laloux et al. (1999). We will therefore ideally want to limit our use of principal components to merely those that appear to be ‘significant’, in the sense that they are unlikely to have arisen by chance under the null hypothesis that the data was actually coming from the interplay of completely independent, random and otherwise homogeneous input signals. These will be the ones that appear to correspond to real characteristics of the relevant securities, rather than being mere artefacts of measurement ‘noise’. However, we need to be slightly careful with our choice of null hypothesis here. Input signals might be completely random (and independent of each other) but have different variabilities, say, σi . Then there is some information content in the output, and PCA analysis could in principle help us identify which input signals exhibited (effectively) greater variability than which others. Selection of how many principal components we should deem ‘significant’ is a branch of random matrix theory; see, e.g., Edelman and Rao (2005). This theory has a long history in physics starting with Eugene Wigner and Freeman Dyson in the 1950s. It aims to characterise the statistical properties of the eigenvalues and eigenvectors of a given matrix ensemble. The term ‘ensemble’ is used here to refer to the set of all random matrices that exhibit some pre-defined symmetry or property. Among other things, we might be interested in the average density of eigenvalues, i.e., the distribution of spacing between consecutively ordered eigenvalues etc. For example, we might compare the properties of an empirical covariance matrix V with a ‘null hypothesis’ that the returns from which the covariance matrix was derived are, in fact, arising from random noise as above. Deviations from this null hypothesis that were sufficiently unlikely might then suggest the presence of true information. In the limit when the matrices are very large (i.e., m → ∞) the probability density for the eigenvalues of the matrix ensemble is analytically tractable, if the ensemble consists of all covariance matrices derived from samples drawn from (independent) series that have a common standard deviation, σ . The density, for a given Q = n /m, is f Q,σ (λ) given by ⎧ √ (λmax − λ) (λ − λmin ) λmin ≤ λ ≤ λmax ⎨ Q , (4.7) f Q,σ (λ) = 2π σ 2 λ ⎩ otherwise 0, where
λmax
λmin
1 1 +2 = σ2 1 + Q Q 1 1 2 −2 = σ 1+ Q Q
(4.8)
(4.9)
Identifying Factors That Significantly Influence Markets
105
Some of the discontinuous features in the above formulae get smoothed out in practice for less extreme values of m. In particular we no longer then come across a hard upper and (if Q < 1) a hard lower limit for the range within which the eigenvalues must lie. We may therefore adopt the following prescription for ‘denoising’ an empirically observed covariance matrix, if we adopt the ‘null’ hypothesis that all the instruments involved are independent of each other12 (see Scherer (2007)): 1. Work out the empirically observed correlation matrix (which by construction has standardised the return series so that σ = 1). 2. Identify the largest eigenvalue and corresponding eigenvector of this matrix. 3. If the eigenvalue is sufficiently large, e.g., materially larger than the cut-off derived from the above (or some more accurate determination of the null hypothesis density applicable to the finite m case), then deem the eigenvalue to represent true information rather than noise and move on to step 4; otherwise stop. 4. Record the eigenvalue and corresponding eigenvector. Determine the contribution to each instrument’s return series from this eigenvector. Strip out these contributions from each individual instrument return series, calculate a new correlation matrix for these adjusted return series and loop back to step 2. The reason why in theory we need to adjust each instrument return series in step 4 is that otherwise the ‘residual’ return series can no longer be assumed to have a common σ , and so we can no longer directly use a formula akin to that above to identify further eigenvectors that appear to encapsulate true information rather than noise. However, if we adopt the null hypothesis that all residual return series, after stripping out ‘significant’ eigenvectors, are independent identically distributed Gaussian random series with equal standard deviations then the computation of such random matrix theory cut-offs can be materially simplified. This is because (a) The trace of a symmetric matrix (i.e., the sum of the leading diagonal elements) is the same as the sum of its eigenvectors (see Section 4.3.3). Therefore it does not change if we change the basis we use to characterise the corresponding vector space. (b) So, removing the leading eigenvectors as above merely involves removing the leading row and column, if the basis used involves the eigenvectors. (c) The variance of the residual series under this null hypothesis is therefore merely the sum of the eigenvalues not yet eliminated iteratively. (d) Hence, we can calculate the cut-off λmax (i) for the ith eigenvalue (1 ≤ i ≤ m) as follows, where λ( j) is the magnitude of the jth sorted eigenvalue, the largest eigenvalue being λ(1) :
⎛
⎞
(i (i m − − 1) m − − 1) 1 ⎝ λ( j) ⎠ 1 + λmax (i) = +2 m j=i n n n
(4.10)
12 This null hypothesis would be inappropriate if, a priori, we ‘expected’ two or more instruments to be correlated. For example, in equity-land, two instruments might correspond to separate listings of the same underlying company on two different stock exchanges. In bond-land, different bonds issued by the same underlying issuer would also typically be correlated, even once we had stripped out all other more general factors, because each such bond is exposed to that particular issuer’s idiosyncratic risk. This will not normally materially invalidate the de-noising prescription we set out here (as long as such instances are isolated), but does require further thought when we add back into our model some idiosyncratic risk components as per Section 4.2.3.
106
Extreme Events
Using this prescription, we would exclude any eigenvalues and corresponding eigenvectors for which the eigenvalue or any preceding eigenvalue was not significantly above this cut-off. More precise tests of significance if n is not deemed particularly large can be identified by simulating spreads of results for random matrices. 4.3.7 Identifying principal components one at a time Most algorithms used in practice to identify principal components extract all eigenvalues of the covariance matrix simultaneously, because this is normally the most computationally efficient approach to adopt. However, a disadvantage of such an approach is that it hides some features of PCA that we will want to use in Section 4.5. Therefore, in this section we show how PCA can be carried out extracting principal components one at a time. Suppose V is an m × m covariance matrix with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λm ≥ 0. Consider f (a) = aT Va where a is an m dimensional vector of unit length (i.e., |a| = aT a = 1) but otherwise takes an arbitrary value. Then we can express a in coordinates defined by the orthonormal eigenvectors qi corresponding to λi , i.e., we can find a1 , . . . , a m so that qi are orthonormal, we must have ai2 = 1. a = a1 q1 + · · · + am qm . Since |a| = 1 and the Using these coordinates we find that f (a) = ai2 λi and f (a) takes its largest value when a = q1 , when f (a) = λ1 . We can therefore view PCA as formally equivalent to an algorithm in which we sequentially identify a series of ‘signals’ (each signal here characterised as a vector of unit length in the n-dimensional space and only then associated as above with an actual return series), at each step maximising a specific importance criterion and then stripping out any contribution from that signal in subsequent steps. The importance criterion we use in this formulation of PCA is to maximise f (a) = aT Va among all vectors satisfying |a| = 1. Given the ordering we chose for the λi we see that at each iteration of this algorithm we would identify the most important remaining eigenvalue. If we carry out enough iterations we will thus reproduce exactly the results that were extracted ‘all at once’ using the usual formulation of the PCA algorithm.
4.4 INDEPENDENT COMPONENTS ANALYSIS 4.4.1 Introduction ICA is another blind source separation technique used to extract useful information from potentially large amounts of output data. It has been applied to a very wide range of possible problems, including analysis of mobile phone signals, stock price returns, brain imaging and voice recognition. For example, with voice recognition, the aim might be to differentiate between different foreground contributors to the overall sound pattern and, additionally, to filter out, as far as possible, anything that appears to be background noise. Like PCA, an underlying assumption ICA makes is that the output signals derive from a linear combination of input signals. The mixing coefficients applicable to this sort of mixture are then the multipliers applied to each input signal to create the relevant output signal. ICA is based on the often physically realistic assumption that if different input signals are coming from different underlying physical processes then these input signals will be largely independent of each other. ICA aims to identify how to decompose output signals into (linear combination) mixtures of different input signals that are as independent as possible of each other. Several variants exist, which we describe below, where ‘independent’ is replaced by
Identifying Factors That Significantly Influence Markets
107
an alternative statistical property that we may also expect might differentiate between input signals. It is the seeking of independence in the source signals that differentiates ICA from PCA. PCA merely seeks to find a set of signals that are uncorrelated with each other. By uncorrelated we mean that the correlation coefficients between the different supposed input signals are zero. Lack of correlation is a potentially much weaker property than independence. Independence implies a lack of correlation, but lack of correlation does not imply independence. The correlation coefficient in effect ‘averages’ the correlation across the entire distributional form. For example, two signals might be strongly positively correlated in one tail, strongly negatively correlated in another tail and show little correspondence in the middle of the distribution. The correlation between them, as measured by their correlation coefficient, might thus be zero, but it would be wrong to conclude that the behaviours of the two signals were independent of each other (particularly, in this instance, in the tails of the distributional form). How ICA works in practice can perhaps best be introduced, as in Stone (2004), by using the example of two people speaking into two different microphones, the aim of the exercise being to differentiate, as far as possible, between the two voices. The microphones give different weights to the different voices (e.g., there might be a muffler between one of the speakers and one of the microphones). To simplify matters we assume that the microphones are equidistant from each source, so that phase differentials are not relevant to the problem at hand. ICA and related techniques rely on the following observations: (a) The two input signals, i.e., the two individual voices, are likely to be largely independent of each other, when examined at fine time intervals. However, the two output signals, i.e., the signals coming from the microphones, will not be as independent because they involve mixtures (albeit differently weighted) of the same underlying input signals. (b) If histograms of the amplitudes of each voice (when examined at these fine time intervals) are plotted then they will most probably differ from the traditional bell-shaped histogram corresponding to random noise. Conversely, the signal mixtures are likely to be more Normal in nature. (c) The temporal complexity of any mixture is typically greater than (or equal to) that of its simplest, i.e., least complex, constituent source signal. These observations lead to the following prescription for extracting source signals: If source signals have some property X and signal mixtures do not (or have less of it) then given a set of signal mixtures we should attempt to extract signals with as much X as possible, since these extracted signals are then likely to correspond as closely as possible to the original source signals.
Different variants of ICA and its related techniques ‘un-mix’ output signals, thus aiming to recover the original input signals, by substituting ‘independence’, ‘non-Normality’ and ‘lack of complexity’ for X in the above prescription. 4.4.2 Practical algorithms As noted earlier, ICA typically assumes that outputs are linear combination mixtures of inputs, i.e., are derived by adding together input signals in fixed proportions (that do not vary through time). If there are m input (i.e., source) signals then there need to be at least m different mixtures for us to be able to differentiate between the sources. In practice, the number of signal mixtures is often larger than the number of source signals. For example, with electroencephalography
108
Extreme Events Inputs
Market dynamics (assumed to arise from a mixing matrix)
Outputs
xi
yi = Wxi hence xi = W –1yi
yi
Figure 4.2 Schematic illustration of the impact of market dynamics C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
(EEG), the number of signal mixtures is equal to the number of electrodes placed on the head (typically at least ten) but there are typically fewer expected sources than this. If the number of signals is known to be less than the number of signal mixes then the number of signals extracted by ICA can be reduced by dimension reduction, either by pre-processing the signal mixtures, e.g., by using PCA techniques,13 or by arranging for the ICA algorithm to return only a specified number of signals. Such mixtures can be expressed succinctly in matrix form, W . Schematically, we are assuming that market dynamics behave as shown in Figure 4.2. We can then write the formula deriving the output signals from the input signals as yt = W x t
(4.11)
We note that we have implicitly assumed a model of the world involving time homogeneity, i.e., that W is constant through time. If the mixing coefficients, i.e., the elements of W , are known already then we can easily derive the input signals from the output matrix by inverting this matrix equation, i.e., x = Ay where A = W −1 . However, we are more usually interested in the situation where the mixing coefficients are unknown. Therefore we seek an algorithm that estimates the un-mixing coefficients, i.e., the coefficients ai, j of A, directly from the data, allowing us then to recover the signals themselves (and the original mixing coefficients). If the number of input and output signals is the same then a matrix such as W can be viewed as describing a transformation applied to an m-dimensional vector space (m being the number of input signals) spanned by vectors corresponding to the input signals. In this representation, each input signal would be characterised by a vector of unit length in the direction of a particular axis in this m-dimensional space, with each different pair of input signals being orthogonal to each other (in geometric terms, ‘perpendicular’ to each other). Any possible (linear combination) mixture of these signals then corresponds to some vector in the same vector space, and a set of m of them corresponds to a set of m vectors in such a space. An m × m matrix thus defines how simultaneously to map one set of m vectors to another in a way that respects underlying linear combinations. Inverting such a matrix (if this 13 This would involve carrying out a PCA analysis as per Section 4.3.2 and discarding signal contributions corresponding to the smaller principal components. If these correspond to ‘noise’ then such a truncation will not eliminate any information. Instead it will merely have further mixed it, but this mixing will also be unwound by the ICA algorithm if it is applied to the remaining principal components.
Identifying Factors That Significantly Influence Markets
109
is possible) corresponds to identifying the corresponding inverse transformation that returns a set of m transformed signal vectors to their original positions. ICA uses this insight by identifying which (orthogonal) mixtures of the output series seem to exhibit the largest amount of ‘independence’, ‘non-Normality’ or ‘lack of randomness’, because these mixtures can then be expected to correspond to the original input series (or scalar multiples of them). We note that there is no way in such a framework of distinguishing between two input signals that are constant multiples of each other. Thus ICA and its variants will only generally identify signals up to a scalar multiplier (although we might in practice impose some standardised scaling criteria when presenting the answers or deciding which signals to retain and which to discard as ‘insignificant’ or unlikely to correspond to a true input signal). ICA also cannot differentiate between, say, two different pairs of signals in which the ordering of the signals is reversed. This is because it does not directly include any prescription that ensures that any particular input signal will be mapped back to its own original axis. Instead, it is merely expected to arrange for each input signal to be mapped back to any one of the original (orthogonal) axes (but for no two different input signals to be mapped back onto the same axis). However, there may be a natural ordering that can be applied to the extracted signals, e.g., if input signals are expected to be strongly non-Normal then we might order the extracted signals so that the first one is the most non-Normal one, etc. 4.4.3 Non-Normality and projection pursuit Suppose that we focus further on the property of non-Normality. Despite the comments made in Section 2.4, we might measure non-Normality by the (excess) kurtosis of a distribution. Kurtosis has two properties relevant to ICA: (a) All linear combinations of independent distributions have a smaller kurtosis than the largest kurtosis of any of the individual distributions (a result that can be derived using the Cauchy-Schwarz inequality). (b) Kurtosis is invariant to scalar multiplication, i.e., if the kurtosis of distribution s is γ2 then the kurtosis of the distribution defined by q = ks where k is constant is also γ2 . Suppose that we also want to identify the input signals (up to a scalar multiple) one at a time, starting with the one with the highest kurtosis. This can be done via projection pursuit. Given points (a) and (b) we can expect the kurtosis of z t = i pi yi,t = i j pi wi, j xt to be maximised when this results in z t = kxq,t for the q corresponding to the signal xi,t which has the largest kurtosis, where k is arbitrary. Without loss of generality, we can reorder the input signals so that this one is deemed the first one, and thus we expect the kurtosis of z t to be maximised with respect to p when pW = (k, 0, . . . , 0)T and z t = kx1,t . We can also normalise the signal strength to unity, by adjusting k (because, as we noted above, the kurtosis of kz is the same as the kurtosis of z). Although we do not at this stage know the full form of W we have still managed to extract one signal (namely the one with the largest kurtosis) and found out something about W . In principle, the appropriate value of p can be found using brute force exhaustive search, but in practice more efficient gradient-based approaches would be used instead (see Section 4.4.7). We can then remove the recovered source signal from the set of signal mixtures and repeat the above procedure to recover the next source signal from the ‘reduced’ set of signal mixtures.
110
Extreme Events
Repeating this iteratively, we should extract all available source signals (assuming that they are all leptokurtic, i.e., all have kurtosis larger than any residual noise, which we might assume is merely Normally distributed). The removal of each recovered source signal involves a projection of an m-dimensional space onto one with m − 1 dimensions and can be carried out using Gram-Schmidt orthogonalisation (see Section 4.5.3).
4.4.4 Truncating the answers As with PCA (or any other blind source separation technique), a fundamental issue that has no simple answer is when to truncate the search for additional signals. If the mixing processes are noise-free then we should be able to repeat the project pursuit algorithm until we have extracted exactly the right number of input signals, leaving no remaining signal to analyse (as long as there are at least as many distinct output signals as there are input signals). However, in practice, market return data is not noise-free. We might therefore truncate the signal search once the signals we seem to be extracting via it appeared to be largely artefacts of noise in the signals or mixing process, rather than suggestive of additional true underlying source signals. The difference with the approach discussed in Section 4.3.6 is that there may be no easy way to characterise analytically what constitutes an appropriate cut-off. However, if our focus is on kurtosis as a proxy for non-Normality then we might use a cut-off derived from the distribution of (excess) kurtosis arising at random for a sample drawn from a Normal distribution (see Section 2.4.6).
4.4.5 Extracting all the un-mixing weights at the same time ICA as normally understood can be thought of as a multivariate, parallel version of projection pursuit, i.e., an algorithm that returns ‘all at once’ all of the un-mixing weights applicable to all the input signals. If ICA uses the same measure of ‘signal-likeness’ (i.e., ‘independence’, ‘non-Normality’ and ‘lack of complexity’) and assumes that the same number of signals exist as used in the corresponding projection pursuit methodology then the two should extract the same signals. To the extent that the two differ, the core measure of ‘signal-likeness’ underlying most implementations of ICA is that of statistical independence. As we noted earlier, this is a stronger concept than mere lack of correlation. To make use of this idea, we need a measure that tells us how close to independent are any given set of unmixed signals. Perhaps the most common measure used for this purpose is entropy, which is often thought of as a measure of the uniformity of the distribution of a bounded set of values. However, more generally, it can also be thought of as the amount of ‘surprise’ associated with a given outcome. This requires some a priori view of what probability distribution of outcomes is to be ‘expected’. Surprise can then be equated with relative entropy (i.e., Kullback-Leibler divergence – discussed in Section 3.8.5), which measures the similarity between two different probability density functions. The ICA approach thus requires an assumed probability density function for the input signals and identifies the un-mixing matrix that maximises the joint entropy of the resulting unmixed signals. This is called the ‘infomax’ ICA approach. A common assumed pdf used for this purpose is a very high-kurtosis one such as some suitably scaled version of the hyperbolic tangent distribution, p (x) = (1 − tanh x)2 .
Identifying Factors That Significantly Influence Markets
111
ICA can also be thought of as a maximum likelihood method for estimating the optimal un-mixing matrix. With maximum likelihood we again need to specify an a priori probability distribution, in this case the assumed joint pdf ps of the unknown source signals, and we seek the un-mixing matrix, A, that yields extracted signals x = Ay with a joint pdf as similar as possible to ps . In such contexts, ‘as similar as possible’ is usually defined via the log likelihood function, which results in the same answer as the equivalent infomax approach, because both involve logarithmic functions of the underlying assumed probability distribution. Both projection pursuit and ICA appear to rely on the frankly unrealistic assumption that the model pdf is an exact match for the pdf of the source signals. In general, the pdf of the source signals is not known exactly. Despite this, ICA seems to work reasonably well. The reason is because we do not really care about the form of the pdf. Indeed, it could correspond to a quite extreme distribution. Instead, all we really need for the approach to work is for the model pdf to have the property that the closer any given distribution is to it (in relative entropy or log likelihood terms), the more likely that distribution is to correspond to a true source input signal. A hyperbolic tangent (‘tanh’)-style pdf may be an unrealistic ‘model’ for a true signal source, but its use within the algorithm means that distributional forms with higher kurtosis will be preferentially selected versus ones with lower kurtosis (even though neither may have a kurtosis anywhere near as large as that exhibited by the hyperbolic tangent pdf itself). The relative ordering of distributional forms introduced by the choice of model pdf is what is important rather than the structure of the model pdf per se. As a tanh-style model pdf preferentially extracts signals exhibiting high kurtosis it will extract similar signals to those extracted by kurtosis-based projection pursuit methods. Indeed, it ought to be possible to select model pdfs (or at least definitions of how to order distributional forms) that exactly match whatever metric is used in a corresponding projection pursuit methodology (even if this is not how ICA is usually specified). To estimate the un-mixing matrix A (= W −1 ) that maximises the relative entropy or log likelihood and hence corresponds to the supposed input signals, we could again use brute force. However, again it is more efficient to use some sort of gradient ascent method as per Section 4.4.7, iteratively adjusting the estimated W −1 in order to maximise the chosen metric.
4.4.6 Complexity pursuit Most signals measured within a physical system can be expected to be a mixture of statistically independent source signals. The most parsimonious explanation for the complexity of an observed signal is that it consists of a mixture of simpler signals, each from a different source. Underpinning this is the assumption that a mixture of independent source signals is typically more complex than the simplest (i.e., least complex) of its constituent source signal. This complexity conjecture underpins the idea of complexity pursuit. One simple measure of complexity is predictability. If each value of a signal is relatively easy to predict from previous signal values then we might characterise the signal as having low complexity. Conversely, if successive values of a signal are independent of each other then prediction is in principle impossible and we might characterise such a signal as having high complexity. When discussing this technique, Stone (2004) focuses on minimising Kolmogorov complexity and defines a measure, F, of temporal predictability as follows, for a given set of signal mixtures yi (each one of which would typically be a time series in an investment context) and
112
Extreme Events
weights ai, j applied to these signal mixtures: F ai, j , yi = log Vi − log Ui
(4.12)
where xi,t = ai, j y j,t , x˜i,t is a suitable exponentially weighted moving average of xi,t , − η) xi,t for some suitable (perhaps predefined) value of η, Vi = i.e., x˜i,t = ηx˜i,t−1 + (1 2 (1/n) nt=1 xi,t − x¯t corresponds to the overall variance of the given linear combination 2 and Ui = (1/n) nt=1 xi,t − x˜t corresponds to the extent to which it is well predicted by its previous values (assuming here a first order autoregressive dependency). Complexity pursuit has certain advantages and disadvantages over ICA and projection pursuit. Unlike ICA it does not appear explicitly to include an a priori model for the signal pdfs, but seems only to depend on the complexity of the signal. It ought therefore to be able to extract signals with different pdf types. Also it does not ignore signal structure, e.g., its temporal nature if it is a time series. Conversely, ‘complexity’ is a less obviously well defined concept than independence or non-Normality. For example, the prescription introduced by Stone (2004) and described above seems to be very heavily dependent on ‘lack of complexity’ being validly equated with signals exhibiting strong one-period auto-dependency. 4.4.7 Gradient ascent All the above approaches require us to maximise (or minimise) some function (the kurtosis, the log likelihood, the Kolmogorov complexity and so on) with respect to different un-mixing vectors (or un-mixing matrices, i.e., simultaneously for several un-mixing vectors all at once). Although brute force could be applied for simple problems this rapidly becomes impractical as the number of signals increases. Instead, we typically use gradient ascent, in which we head up the (possibly hyper-dimensional) surface formed by plotting the value of the function for different un-mixing vectors in the direction of steepest ascent. The direction of steepest ascent can be found from the first partial derivative of the function with respect to the different components of the un-mixing vector/matrix. Second order methods can be used to estimate how far to go along that gradient before next evaluating the function and its derivatives (see, e.g., Press et al. (2007) and Section 4.10).
4.5 BLENDING TOGETHER PRINCIPAL COMPONENTS ANALYSIS AND INDEPENDENT COMPONENTS ANALYSIS 4.5.1 Introduction In Section 4.3.4 we noted that PCA implicitly included a weighting schema that was being applied to the different output signals. A consequence is that the results of a PCA analysis change if we scale up or scale down the volatility of an individual return series. PCA is not scale invariant in relation to individual securities. Instead, it explicitly focuses on contribution to aggregate variability, i.e., the magnitude of the contribution that each extracted signal has to the aggregate variability of different securities within the universe being considered. ICA has different scaling properties. The projection pursuit method introduced earlier (and corresponding infomax and maximum likelihood ICA approaches) grades signal importance by reference to kurtosis, rather than by reference to contribution to overall variability. As we noted earlier, kurtosis is scale invariant. Thus, in the absence of noise, ICA focuses on
Identifying Factors That Significantly Influence Markets
113
the meaningfulness of the signals being extracted (if we are correct to ascribe ‘meaning’ to signals that appear to exhibit ‘independence’, ‘non-Normality’ or ‘lack of complexity’), but it will not necessarily preferentially select ones whose behaviours contribute significantly to the behaviour of the output signal ensemble. PCA will even ‘un-mix’ pure Gaussian (i.e., Normally distributed) signals whereas ICA will fail then to identify any signals. Despite these differences, PCA and ICA share many similarities. In particular, both can be thought of as examples of a projection pursuit methodology in which we sequentially extract signals that maximise some specified importance criterion (see Sections 4.2.5, 4.3.4 and 4.3.5). With PCA the importance criterion involves maximising a T V a (subject to the constraint |a| = 1), where V is the covariance matrix. With ICA the importance criterion involves maximising kurtosis (again, if we normalise the signal strengths, subject to the constraint |a| = 1). This intrinsic similarity also explains the close analogy between methods for deciding when to stop a projection pursuit ICA algorithm and when to truncate a PCA. Which of these methods is intrinsically more appropriate for what we are trying to achieve? The answer depends in part on how the portfolios we are trying to analyse are likely to have been put together. If portfolios are selected randomly, and all we are interested in is the magnitude of outcomes, PCA seems the most appropriate, because typically we are principally interested in magnitude. It seems dangerous, however, to assume that portfolios really are being selected randomly. The active managers putting them together certainly would not be happy if you were to postulate that this is the case. Their clients pay them deliberately not to choose random portfolios; they want them to outperform. So, we may expect that at least some of the time they are biased towards portfolios that exhibit ‘meaning’. 4.5.2 Including both variance and kurtosis in the importance criterion The underlying similarity between PCA and ICA implies that a good way to capture the strengths of each is to adopt a projection pursuit methodology applied to an importance criterion that blends together variance as well as some suitable measure(s) of independence, non-Normality and/or lack of complexity. One possible approach would be to use the following importance criterion (or any monotonic equivalent): f (a) = σ (a) (1 + c.γ2 (a))
(4.13)
Here σ (a) is the standard deviation of the time series corresponding to the mixture of output signals characterised by a (for portfolio construction, a corresponds to the portfolio’s active positions), γ2 (a) is its kurtosis and c is a constant that indicates the extent to which we want to focus on kurtosis rather than variance in the derivation of which signals might be ‘important’. We constrain a to be of ‘unit length’, i.e., to have |a| = a T a = 1. The larger (i.e., more positive) c is, the more we would expect such an approach to tend to highlight signals that exhibit positive kurtosis. Thus the closer the computed unmixed input signals should be to those that would be derived by applying ICA to the mixed signals (if the ICA was formulated using model pdfs with high kurtosis).14 The closer to zero c is, the closer the result should be to a PCA analysis. 14 We here need to assume that σ (a) does not vary ‘too much’ with respect to a, so that in the limit as c → ∞ any signal exhibiting suitably positive kurtosis will be selected at some stage in the iterative process. Variation in σ (a) might then ‘blur’ together some signals that ICA might otherwise distinguish.
114
Extreme Events
A possible way of defining c is to focus on the value that corresponds to the VaR for a quantile level that is well into the tail of the distribution, using the Cornish-Fisher asymptotic expansion described in Section 2.4. If the distribution is assumed to have zero skew then this involves assuming that the quantile is as follows, where µ is the mean of the distribution, σ is its standard deviation, x = N −1 (α), where α is the quantile in question and N −1 (z) is the inverse Normal distribution function: y =µ+σ
γ2 x 3 − 3x x+ 24
(4.14)
For example, we might adopt the equivalent of a 1 in 200 quantile cut-off, in which case x = N −1 (0.005) = −2.576 and we would use the following importance criterion: γ2 (a) x 3 − 3x = σ (1 + 0.39γ2 ) f (a) = σ (a) x + 24
(4.15)
We may interpret this criterion as indicating that the 1 in 200 quantile is expected to be a factor of (1 + 0.39γ2 ) further into the tail than we might otherwise expect purely from the standard deviation of the distribution (if the assumptions underlying the fourth-moment Cornish-Fisher asymptotic expansion are valid, and the distribution is not skewed). 4.5.3 Eliminating signals from the remaining dataset Before finalising our blended algorithm we need to make two further choices. We need to decide how at successive iterations we will ‘remove’ the ‘signal’ just extracted from the remaining dataset used in subsequent iterations. We also need to decide how to normalise ‘signal strength’. The most obvious way of removing each extracted signal from subsequent returns analysed (and the one that seems to be referred to most commonly in the literature on ICA) is to 1. Identify at each iteration the sub-space of mixtures of the original data series that are orthogonal to the signal that we are extracting. 2. Search for signals only in this sub-space. This procedure is known as Gram-Schmidt orthogonalisation and ensures that each extracted signal is orthogonal to, i.e., uncorrelated with, all preceding and subsequent extracted signals. If the signals extracted (indexed by the order in which they were identified and time indexed by t) are si (t), then we have t si (t) s j (t) = 0 if i = j. Eigenvectors are always orthogonal to each other,15 so such a signal extraction approach automatically arises with PCA. If we use this approach with an importance criterion set in line with PCA (i.e., with c = 0) then we exactly replicate the results of the conventionally formulated ‘all-at-once’ PCA algorithm. Equivalently, we can think of this extraction approach as involving the calculation of a beta for each return series relative to the signal being extracted and then subtracting from each 15
If their corresponding eigenvalues are equal, we need to choose them appropriately.
Identifying Factors That Significantly Influence Markets
115
return series its own beta times the signal in question. Beta16 is here calculated as
β=
t
(r (t) − r¯ ) s j (t) − s¯ j 2 t s j (t) − s¯ j
(4.16)
The extraction methodology described above is not the only possible approach. Choosing beta as per Equation (4.16) minimises the variance of y (t) = r (t) − β j a j (t). Within our blended PCA/ICA approach we focus on an importance criterion that blends variance with kurtosis. We might therefore attempt to define a modified beta, say β ∗ , that minimised the blended importance criterion applied to y (t) rather than merely the variance of y (t). However, such an approach does not seem to work well in practice. The β ∗ , crudely defined in this manner, often seem to end up being smaller or larger than we might intrinsically expect, resulting in a given signal not being fully excluded at the first opportunity and the algorithm ending up repeatedly trying to extract the same (or nearly the same) signal at subsequent iterations. When presenting the results of a specimen blended PCA/ICA analysis in Section 4.6 we have therefore used a beta in line with Equation (4.16) when extracting consecutive signals.
4.5.4 Normalising signal strength The other element of the algorithm that we need to finalise is how to ‘normalise’ the signal strength. The idea of un-mixing is that we identify a series of signals, si (t), which (if we extract enough ‘span’ the output signals, i.e., output signals can be written as of them) ri (t) − r¯i = j ai, j s j (t) − s¯ j , for some suitable constants ai, j . However, if this is the case then the output signals can also be spanned by signals that are arbitrary scalar multiples of the original si (t), e.g., by ki .si (t) (if ki = 0) by replacing the ai, j by ai,∗ j = ai, j k j . With a pure (kurtosis-orientated) ICA how we normalise the signal strength is irrelevant, given the scale invariant properties of kurtosis. The kurtosis of ki .si (t) is the same as the kurtosis of si (t) whatever the value of ki (if ki = 0). However, the choice is relevant with PCA and blended PCA/ICA, because it then influences which signals we extract first and (for the blended PCA/ICA) potentially all subsequently extracted signals. One possible way of normalising signal strength is to require the square of the mixing coefficients applicable to the given signal to add to unity. This approach seems to be commonly suggested for ICA (see, e.g., Stone (2004)) despite the issue not actually being particularly relevant to ICA. It also seems consistent with how PCA is normally presented. However, arguably a better normalisation approach is to require that the sum of the squares of the betas as per Equation (4.16) should add to unity. This results in the sum of the squares of the variances of the extracted signals, and hence the aggregate amount of variance explained by the signals across the entire security universe, being the same as would arise with PCA. The two approaches happen to produce the same answers in the special case where c = 0, i.e., with PCA, which is why we might be misled by how PCA is normally presented.
16 The use of the term ‘beta’ here is a convention linked to the way in which regression analyses are usually introduced in text books, i.e., as attempts to fit data according to expressions such as y (t) = α + β.x (t) + ε (t) for a univariate regression or yi (t) = αi + βi, j .x j (t) + εi (t) for a multivariate regression.
116
Extreme Events
4.6 THE POTENTIAL IMPORTANCE OF SELECTION EFFECTS 4.6.1 Introduction We have seen that design of risk systems often involves trade-offs. One of these trade-offs relates to the types of portfolios for which the system is best able to cater. For example, a model designed principally for analysing bond portfolios may not be particularly effective at analysing equity portfolios – the factors driving the two markets are not necessarily the same. Even if the risk model does cover the right sort of instruments, however, we should still consider whether the portfolios for which it is most important that the system provide the ‘correct’ answer are likely to be ‘representative’ of the generality of portfolios that the system is designed to handle. Typically, systems aim to be agnostic, as far as possible, across all possible randomly chosen portfolios. However, actively managed portfolios are not randomly chosen (at least not as far as the manager in question is concerned). This introduces the possibility of selection effects. Selection effects can be important in other financial contexts. For example, the average purchaser of an annuity does not have a life expectancy in line with that of the generality of the populace. If you were an annuity provider and you based your premium rates on the mortality of the general populace then you would rapidly go bankrupt. Active investment managers presumably also do not believe that they are choosing their portfolios at random. Indeed clients specifically pay them not to do so! We show below that if a manager is actively selecting strategies that are likely to exhibit fat-tailed characteristics then typical risk model designs can radically understate the extent of fat-tailed behaviour that the resulting portfolio might exhibit. In effect, typical risk models implicitly incorporate something akin to an ‘averaged’ level of fat-tailed behaviour across all portfolios they might be used to analyse, whereas we want them to express an average merely over ones that the manager might actually select. The particular reason for stressing this point is that active managers do not seek to construct their portfolios randomly. Instead they seek to impart structure or ‘meaning’ to them. However, we saw in Section 4.3.1 that in signal extraction theory ‘meaning’ is often associated with non-Normality, so just such a selection process probably does happen, although it is difficult to tell just how strong the effect might be.
4.6.2 Quantifying the possible impact of selection effects In Table 4.1 we set out the results of applying PCA, blended PCA/ICA as above (with c = 0.39) and pure (kurtosis-orientated) ICA to monthly sector relative return data for the 23 MSCI ACWI sectors considered in Chapter 3, for the period 30 May 1996 to 28 February 2009. We include in Table 4.1 the largest six components extracted in each case by the algorithm, on the grounds that application of random matrix theory suggests that only roughly six of the principal components can reliably be considered to be other than artefacts of random noise (see Figure 3.15 on page 72). We also highlight (in bold) two numbers, namely the average importance criterion applicable to the top six components in the PCA analysis and the corresponding number in the blended PCA/ICA approach. We do so because their difference highlights the substantial impact that selection effects can potentially have on the estimated magnitude of extreme events.
10.6 6.5 5.6 4.8 4.2 3.7 5.9 3.2
1 2 3 4 5 6 Av (top 6) Av (all 23)
3.1 2.1 1.7 1.4 0.4 1.1 1.6 1.2
Kurt 10.6 6.5 5.6 4.8 4.2 3.7 5.9 3.2
Criterion (%) 8.3 4.9 5.0 4.5 4.3 4.8 5.3 3.6
StdDev (%)
Ordering of series extracted is by reference to the size of the relevant importance criterion Source: Nematrian, Thomson Datastream
StdDev (%)
Component
PCA, only StdDev (c = 0)
Table 4.1 The potential impact of selection effects on observed fat-tailed behaviour
14.9 24.9 22.1 14.7 15.0 9.2 16.8 8.2
Kurt
Blended (c = 0.39) 56.6 52.7 48.0 30.1 29.7 22.1 39.9 17.5
Criterion (%)
4.5 4.2 4.5 6.9 4.2 4.2 4.7 3.7
StdDev (%)
Only kurtosis
24.2 23.5 18.1 16.2 15.0 13.7 18.5 9.1
Kurt
118
Extreme Events
Suppose, for example, that an investment manager selected his or her portfolio on the basis of ‘meaning’ and that ‘meaning’ in this context is validly associated with non-Normality. In particular, suppose that the manager selects a portfolio that happens to correspond to the same mixture of assets as one that reproduces one of the top six blended PCA/ICA components indicated in Table 4.1. Remember that these components have been extracted by un-mixing the original sector relative return series, and so there is some combination of sector positions that does correspond to each one of these components. The difference between the 39.9% and the 5.9% shown in the two columns titled ‘Criterion’ has a sobering interpretation. The ratio between the two (nearly 7) can be viewed as representing (if we believed the Cornish-Fisher approximation, see below) roughly how large the size of a 1 in 200 event might be for a portfolio selected to express strong kurtosis relative to the size we would predict for it were we to ignore fat tails (and hence to treat the exposure as merely a suitable mixture of factor exposures that could be derived from a PCA analysis). Even if we reintroduced a Cornish-Fisher adjustment to each individual PCA principal component, to reflect the apparent fat-tailed behaviour applicable to the generality of portfolios, the figure of 5.9% in the first column titled ‘Criterion’ would still only rise to c. 10%, There would then still be a factor of 4 difference between what we might have naively assumed was likely to be the ‘corrected’ magnitude of extreme events (here viewed as 1 in 200 events) and their implied actual size in situations where there is a strong selection bias towards exposures exhibiting fat tails. It can be argued that the above analysis overstates the potential size of selection effects. For example, we noted in Section 2.4.4 that the fourth order Cornish-Fisher expansion is not necessarily very good at estimating the shape of the distributional form in regions in which we might be most interested. Indeed it might overstate the actual magnitude of fat-tailed behaviour.17 Moreover, investment managers may not necessarily impute ‘meaning’ in their investment process merely or even mainly to factors that have an intrinsic fat-tailed element. Conversely, equity sector relative returns are typically seen as relatively well-behaved compared with other more naturally fat-tailed potential exposures that might be present in some investment portfolios. This suggests that selection effects could be even larger with some other asset types. Indeed, we might view the events of the 2007–09 credit crisis as just such an example of selection effects writ large. The portfolios and financial institutions that ran into most trouble were typically the ones that systematically selected for themselves strategies implicitly dependent on continuation of benign liquidity conditions. Liquidity risk is an exposure type that is very fat-tailed, because most of the time it does not come home to roost. It is also a type of risk that is not handled very well by many risk systems. The losses that some portfolios and institutions incurred were in some cases many times larger than their risk modellers might previously have thought realistically possible.
17 We could seek to refine the blended PCA/ICA approach to be more consistent with the methodology proposed in Section 2.4.5 in which we ‘extrapolated into the tail’ by directly fitting a curve through the quantile–quantile plot, i.e., the observed (ordered) distributional form. Such an approach is more computationally intensive than the Cornish-Fisher approach, particularly if the data series involve a large number of terms. This is because it requires the return series to be sorted, in order to work out which observations to give most weight to in the curve fitting algorithm. Sorting large datasets is intrinsically much slower than merely calculating their moments because it typically involves a number of computations that scale in line with approximately O (n log n) rather than merely O (n), where n is the number of observations in each series.
Identifying Factors That Significantly Influence Markets
119
Principle P17: Selection effects (in which those involved in some particular activity do not behave in a fashion representative of a random selection of the general populace) can be very important in many financial fields. They can also be important in a portfolio construction context. Managers can be expected to focus on ‘meaning’ as well as ‘magnitude’ in their choice of positions. The spread of behaviours exhibited by portfolios that managers deem to have ‘meaning’ may not be the same as the spread of behaviours exhibited by portfolios selected randomly.
4.6.3 Decomposition of fat-tailed behaviour In Chapter 3 we described how we could decompose fat-tailed behaviour into two parts. One corresponded to fat-tailed behaviour in individual return series in isolation (as characterised by their marginals). The other corresponded to fat-tailed behaviour in their co-dependency (as characterised by the copula). In Figure 3.12 (on page 70) we showed an ‘averaged’ fractile–fractile plot for pairs of relative return series, which suggested a strong tail dependency via the peaks in its corners. We suggested that this might overstate ‘actual’ tail dependency because we were mixing distributions with different correlations. To compensate for this, we provided in Figure 3.14 (on page 71) an equivalent ‘averaged’ fractile–fractile plot for pairs of principal components (which by construction are orthogonal, i.e., uncorrelated). This had some peaking in its corners but not as much as Figure 3.12. The components extracted by the blended PCA/ICA algorithm as above are also by construction orthogonal, i.e., uncorrelated, and we can prepare an equivalent ‘averaged’ fractile–fractile plot for them (see Figure 4.3). Comparison of this chart with Figure 3.14 highlights a subtlety concerning how we might decompose fat-tailed behaviour between the marginals and the copula. Figure 4.3 is noticeably flatter, i.e., less fat-tailed, than Figure 3.14 (and has a peak at its centre as well as at its four
1400 1200
1200-1400
1000
1000-1200
800
800-1000
600
600-800
400
17 13
200
9 5 Sector 2
0 1
5
9
400-600 200-400 0-200
1 13
17
Sector 1
Figure 4.3 Fractile–fractile plot of blended PCA/ICA component rankings of sector relative returns showing number of observations in each fractile pairing, averaged across all principal component pairings and all +/– combinations of such pairs
C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Thomson Datastream.
120
Extreme Events
corners). To compensate for this, the individual return series included in Figure 4.3 are in isolation fatter-tailed than the ones included in Figure 3.14. However, in both cases the series being analysed are all orthogonal. So we appear to be able to shift fat-tailed behaviour between the marginals and the copula despite Sklar’s theorem indicating that for any given (continuous) probability distribution the copula is uniquely defined. How can this be? The answer is that the charts we are using here do not describe the full (multi-dimensional) copula. Instead, they merely show a type of projection of the copula into two dimensions. Our apparent ability to shift fat-tailed behaviour between the marginals and the copula relates to higher dimensional behaviour in the copula that is incompletely captured merely by the type of two-dimensional projections we are using here. In practice, it is extremely difficult to visualise higher-dimensional behaviour. Estimating it accurately is also subject to inherent data limitations of the sort described in Section 4.2.3, because of our inherent inability to estimate accurately the underlying component series. This ambiguity in how we might apportion fat-tailed behaviour between marginals and copula is one reason why in Section 3.5 we favoured an approach for visualising joint fat-tailed behaviour that did not overly rely on the applicability of this decomposition. Instead, we focused primarily on an approach that involved a more holistic view of fat-tailed behaviour.
4.7 MARKET DYNAMICS 4.7.1 Introduction In the previous sections of this chapter, we have focused on behaviours that can arise by linearly combining factors/signals, i.e., by combining them in the form i ai si (t) where the ai are constants that do not change through time. In this book we call such mixtures ‘linear combination mixtures’. The assumption of time stationarity is implicit in most traditional time series analysis. In this section we highlight just how restrictive this assumption can be. It considerably limits the market dynamics that can be reproduced by such models, indeed so much so that important elements of market dynamics observed in practice cannot be replicated using such models. Linear combination models can only describe a relatively small number of possible market dynamics, in effect just regular cyclicality and purely exponential growth or decay. Investment markets do show cyclical behaviour, but the frequencies of the cycles are often far from regular. 4.7.2 Linear regression Suppose that we have an ‘output’ series, yt , that we want to forecast and an ‘input’ series, x t (which we can observe), that drives the behaviour of the output series (in each case for t = 1, . . . , n, where t is a suitable time index). Suppose further that there is a linear relationship between the two time series of the form yt = a + bx t + εt where the εt are random errors each with mean zero, and a and b are unknown constants. The same relationship can be written in vector form as y = a + bx + ε where x = (x 1 , . . . , xn )T is now a vector of n elements corresponding to each element of the time series etc. In such a problem the yt are called the dependent variables and the xt the independent variables, because in the postulated relationship the yt depend on the xt not vice versa.
Identifying Factors That Significantly Influence Markets
121
As explained in Kemp (1997) and Kemp (2010), traditional time series analysis typically solves such a problem using regression techniques. If the εt are independent identically distributed Normal random variables with the same variance (and zero mean) then the maximum likelihood estimators of a and b are the values that minimise the sum of the squared forecast error, i.e., (yt − (a + bxt ))2 . These are also known as their least squares estimators. 4.7.3 Difference equations To convert this simple example into one that might be relevant to return forecasting, we need to incorporate some time lags in the above relationship. This involves assuming that stocks, markets and/or the factors driving them exhibit autoregression. We assume that there is some equation governing the behaviour of the system yt = f (yt−1 , yt−2 , . . .). The yt might now in general be vector quantities rather than scalar quantities, some of whose elements might be unobserved state variables (or associated with other economic factors, see Section 4.2). However, the simplest examples have a single (observed) series in which later terms depend on former ones. Consider first a situation where we have only one time series and we are attempting to forecast future values from observed present and past values. The simplest example might involve the following equation, where c is constant: yt = cyt−1 + wt
(4.17)
This is a linear first order difference equation. A difference equation is an expression relating a variable yt to its previous values. The above equation is first order because only the first lag (yt−1 ) appears on the right hand side of the equation. It is linear because it expresses yt as a linear function of yt−1 and the innovations wt . wt are often (but do not always need to be) treated as random variables. Such a model of the world is called an autoregressive model, with a unit time lag. It is therefore often referred to as an A R (1) model. It is also time stationary, because c is constant. We can however introduce seasonal factors (or some steady secular trend) by including a ‘dummy’ variable linked to time. An example commonly referred to in the quantitative investment literature is a dummy variable set equal to 1 in January but 0 otherwise, to identify whether there is any ‘January’ effect that might arise from company behaviour around year-ends. If we know the value y0 at t = 0 then we find using recursive substitution that t
yt = c y0 +
t
ct− j w j
(4.18)
j=1
We can also determine the effect of each individual wt on, say, yt+ j , the value of y that is j timeperiods further into the future than yt . This is sometimes called the dynamic multiplier ∂ yt+ j ∂wt = c j . If |c| < 1 then such a system is stable, in the sense that the consequences of a given change in wt will eventually die out. It is unstable if |c| > 1. An interesting possibility is the borderline case where c = 1, when the output variable yt+ j is the sum of its initial starting value and historical inputs.
122
Extreme Events
We can generalise the above dynamic system to be a linear pth order difference equation by making it depend on the first p lags along with the current value of the innovation (input value) wt , i.e., yt = c1 yt−1 + c2 yt−2 + · · · + c p yt− p + wt . This can be rewritten as a first order difference equation, but relating to a vector, if we define the vector, gt , as follows: ⎛
yt
⎞
⎛
c1 ⎜ yt−1 ⎟ ⎜ 1 ⎜ ⎟ ⎜ ⎟ ⎜ gt ≡ ⎜ ⎜ ··· ⎟ = ⎜··· ⎝ yt− p+2 ⎠ ⎝ 0 0 yt− p+1 ⇒ gt = Ft g0 +
t
c2 · · · c p−1 0 ··· 0 ··· ··· ··· 0 ··· 0 0 ··· 1
cp 0 ··· 0 0
⎞
⎛
⎞ wt ⎟ ⎜ 0 ⎟ ⎟⎜ ⎟ ⎜ ⎟ ⎟⎜ ⎟ + ⎜ · · · ⎟ ≡ F.gt−1 + wt ⎟⎜ ⎟ ⎜ ⎟ ⎟⎜ ⎠ ⎝ yt− p+1 ⎠ ⎝ 0 ⎠ 0 yt− p (4.19) ⎞⎛
yt−1 yt−2 ···
Ft− j w j
(4.20)
j=1
For a pth order equation we have the following, where f i,k; j = F j i,k is the element in the ith row and kth column of F j : yt+ j =
p k=1
f 1,k; j+1 yt−k +
j
f 1,1; j−k wt+k
(4.21)
k=1
To analyse the characteristics of such a system in more detail, we first need to identify the eigenvalues of F which are the roots of the following equation: λ p − c1 λ p−1 − c2 λ p−2 − · · · − c p−1 λ − c p = 0
(4.22)
A pth order equation such as this always has p roots, but some of these may be complex numbers rather than real ones, even if (as would be the case in practice for investment time series) all the c j are real numbers. Complex roots correspond to regular cyclical (i.e., sinusoidal) behaviour because e x+i y = e x (cos y + i sin y) if x and y are real. We can therefore have combinations of exponential decay, exponential growth and sinusoidal (perhaps damped or inflating) behaviour. For such a system to be stable we require all the eigenvalues λ to satisfy |λ| < 1, i.e., for their absolute values all to be less than unity. 4.7.4 The potential range of behaviours of linear difference equations An equivalent way of analysing a time series is via its spectrum, because we can transform a time series into a frequency spectrum (and vice versa) using Fourier transforms. Take, for example, another sort of prototypical time series model, namely the moving average or MA model. This assumes that the output depends purely on an input series (without autoregressive components), i.e.:
yt =
q k=1
bk wt−k+1
(4.23)
Identifying Factors That Significantly Influence Markets
123
There are three equivalent characterisations of a MA model: (a) In the time domain, i.e., directly via the b1 , . . . , bq . (b) In the form of autocorrelations, i.e., via ρτ where E ((yt − µ) (yt−τ − µ)) (4.24) σ where18 µ = E (yt ) and σ = E (yt − µ)2 . If the input to the system is a stochastic process with input values at different times being uncorrelated (i.e., E wi w j = 0 for i = j) then the autocorrelation coefficients become ρτ =
⎧ q ⎨ b b k k−|τ | , ρτ = k=|τ |+1 ⎩ 0,
if |τ | ≤ q
(4.25)
if |τ | > q
(c) In the frequency domain. If the input to a MA model is an impulse then the spectrum of the output (i.e., the result of applying the discrete Fourier transform to the time series) is given by 2 SMA ( f ) = 1 + b1 e−2πi. f + · · · + bq e−2πi.q f
(4.26)
It is also possible to show that an AR model of the form described earlier has the following power spectrum: 1 SAR ( f ) = 2 −2πi. f 1 − c1 e − · · · − cpe−2πi. p f
(4.27)
The next step in complexity is to have both autoregressive and moving average components in the same model, e.g., what is typically called an ARMA ( p, q) model:
yt =
p i=1
ci yt−i +
q
b j wt− j
(4.28)
j=1
The output of an ARMA model is most easily understood in terms of the z-transform, which generalises the discrete Fourier transform to the complex plane, i.e.:
X (z) ≡
∞
xt z t
(4.29)
t=−∞
On the unit circle in the complex plane the z-transform reduces to the discrete Fourier transform. Off the unit circle, it measures the rate of divergence or convergence of a series. Convolution of two series in the time domain corresponds to the multiplication of their 18
Strictly speaking, the autocorrelation computation may involve lag-dependent µ and σ .
124
Extreme Events
z-transforms. Therefore the z-transform of the output of an ARMA model, Y (z), satisfies Y (z) = C (z) Y (z) + B (z) W (z) B (z) ⇒ Y (z) = W (z) ≡ T (z) W (z) , say 1 − C (z)
(4.30) (4.31)
This has the form of an input z-transform W (z) multiplied by a transfer function T (z) = B (z) (1 − C (z))−1 unrelated to the input. The transfer function is zero at the zeros of the MA term, i.e., where B (z) = 0, and diverges to infinity, i.e., has poles (in a complex number sense), where C (z) = 1, unless these are cancelled by zeros in the numerator. The number of poles and zeros in this equation determines the number of degrees of freedom in the model. Since only a ratio appears there is no unique ARMA model for any given system. In extreme cases, a finite-order AR model can always be expressed by an infinite-order MA model, and vice versa. There is no fundamental reason to expect an arbitrary model to be able to be described in an ARMA form. However, if we believe that a system is linear in nature then it is reasonable to attempt to approximate its true transfer function by a ratio of polynomials, i.e., as an ARMA model. This is a problem in function approximation. It can be shown that a suitable sequence of ratios of polynomials (called Pad´e approximants) converges faster than a power series for an arbitrary function. But this still leaves unresolved the question of what the order of the model should be, i.e., what values of p and q to adopt. This is in part linked to how best to approximate the z-transform. There are several heuristic algorithms for finding the ‘right’ order, for example the Akaike Information Criterion; see, e.g., Billah, Hyndman and Koehler (2003). These heuristic approaches usually rely very heavily on the model being linear and can also be sensitive to the assumptions adopted for the error terms. 4.7.5 Multivariate linear regression There are several ways in which we can generalise linear regression, including: (a) Introduce heteroscedasticity. This involves assuming that the εt have different (known) standard deviations. We then adjust the weightings assigned to each term in the sum, giving greater weight to the terms in which we have greater confidence. (b) Introduce autoregressive heteroscedasticity. The standard deviations of the εt then vary in some sort of autoregressive manner. (c) Use generalised linear least squares regression. This involves assuming that the dependent variables are linear combinations of linear functions of the x t . Least squares regression is a special case of this, consisting of a linear combination of two functions (if univariate regression), namely f 1 (xt ) ≡ 1 and f 2 (x t ) ≡ xt . (d) Include non-Normal innovations. We no longer assume that the εt (conventionally viewed as random terms) are distributed as Normal random variables. This is sometimes called robust regression and may involve distributions where the maximum likelihood estimators |yt − (a + bxt )| in which case the formulae for the estimators then involve minimise medians rather than means. We can in principle estimate the form of the dependency by the process of box counting, which has close parallels with the mathematical concept of entropy – see e.g., Press et al. (2007), Abarbanel (1993) or Section 3.8.5.
Identifying Factors That Significantly Influence Markets
125
In all the above refinements, if we know the form of the error terms and heteroscedasticity we can always transform the relationship back to a generalised linear regression framework by transforming the dependent variable to be linear in the independent variables. The noise element might in such circumstances need to be handled using copulas and the like. All such refinements are therefore still ultimately characterised by a spectrum (or to be more precise a z-transform) that in general is approximated merely by rational polynomials.19 Thus the output of all such systems is still characterised by combinations of exponential decay, exponential growth, and regular sinusoidal behaviour on which is superimposed some random noise. We can therefore in principle identify the dynamics of such systems by identifying the eigenvalues and eigenvectors of the corresponding matrix equations. If noise does not overwhelm the system dynamics, we should expect the spectrum/z-transform of the system to have a small number of distinctive peaks or troughs corresponding to relevant poles or zeros applicable to the AR or MA elements (and hence for the underlying dynamics of the system to be derivable from their location). Noise will result in the spreading out of the power spectrum around these peaks and troughs. The noise can be ‘removed’ by replacing the observed power spectrum with one that has sharp peaks/troughs. However, we cannot do so with perfect accuracy (because we do not know exactly where the sharp peak/trough should be positioned). Sampling error is ever present. For these sorts of time series dynamics, the degree of external noise present is in some sense linked to the degree of spreading of the power spectrum around its peaks. However, the converse is not true. A power spectrum that is broad (and without sharp peaks) is not necessarily due to external noise. Irregular behaviour can still appear in a perfectly deterministic framework, if the framework is chaotic. 4.7.6 ‘Chaotic’ market dynamics To a mathematician, chaotic behaviour has a particular connotation that is similar to but not identical with the typical English usage of the term. It implies a degree of structure within a framework that intrinsically results in outcomes that appear ‘noisy’, without the outcomes necessarily being entirely random. Sometimes (but not always) some aspects of this structure can be inferred from detailed analysis of the system dynamics. For such behaviour to arise, we need to drop the assumption of linearity. This does not mean that we need to drop time predictability. Instead it means that the equation governing the behaviour of the system yt = f (yt−1 , yt−2 , . . .) needs to involve a nonlinear function f . This change can create quite radically different behaviour. Take for example the following quadratic function, where c is constant: yt = cyt−1 (1 − yt−1 )
(4.32)
This mapping can be thought of as a special case of generalised least squares regression (but not generalised linear least squares regression), in the sense that we can find c by carrying out a suitable regression analysis where one of the input functions is a quadratic. In this equation yt 19 An exception would be if the time periods in question vary in length. Such an approach does have parallels in derivative pricing theory when in some instances we redefine the relevant ‘clock’ to progress at a rate that varies with proper time dependent on the volatility of some instrument (see also Section 2.13.3).
126
Extreme Events
depends deterministically on yt−1 and c is a parameter that controls the qualitative behaviour of the system, ranging from c = 0 which generates a fixed point (i.e., yt = 0) to c = 4 where each iteration in effect destroys one bit of information. To understand the behaviour of this map when c = 4, we note that if we know the value to within ε (ε small) at one iteration then we will only know the position within 2ε at the next iteration. This exponential increase in uncertainty or divergence of nearby trajectories is what is generally understood by the term deterministic chaos. This behaviour is quite different from that produced by traditional linear models. Any broadband component in the power spectrum output of a traditional linear model has to come from external noise. With nonlinear systems such output can be purely deterministically driven (and therefore in some cases may be predictable). This example also shows that systems do not need to be complicated to generate chaotic behaviour. The main advantages of nonlinear models in quantitative finance are: • Many factors influencing market behaviour can be expected to do so in a nonlinear fashion. • The resultant behaviour exhibits elements that match observed market behaviour. For example, markets often seem to exhibit cyclical behaviour, but with the cycles having irregular lengths. Markets also appear to be affected relatively little by some drivers in some circumstances, but to be affected much more noticeably by the same drivers in other circumstances. Their main disadvantages are • The mathematics is more complex than for linear models. • Modelling underlying market dynamics in this way will make the modelling process less accurate if the underlying dynamics are in fact linear in nature. • If markets are chaotic, this typically places fundamental limits on the ability of any approach to predict more than a few time steps ahead. The last point arises because chaotic behaviour is characterised by small disturbances being magnified over time in an exponential fashion (as per the quadratic map described above with c = 4), eventually swamping the predictive power of any model that can be built up. Of course, in these circumstances using linear approaches may be even less effective! There are even purely deterministic nonlinear models that are completely impossible to use for predictive purposes even one step ahead. For example, take a situation in which there is a hidden state variable, xt , developing according to the formula set out below, but we can only observe yt , the integer nearest to xt : xt = 2xt−1 (mod 1) yt = int xt + 12
(4.33)
The action of the map is most easily understood by writing xt in a binary fractional expansion, i.e., x1 = 0.d1 d2 d3 . . . = d1 2 + d2 22 + d3 23 + · · · where each di = 0 or 1. Each iteration shifts every digit to the left, and so yt = dt . Thus this system successively reveals each digit in turn. However, without prior knowledge of the seeding value, the output will appear to be completely random, and the past values of yt available at time t will tell us nothing whatsoever about values at later times! According to Fabozzi, Focardi and Jonas (2009), nonlinear methods are used to model return processes at 19% of (mainly quantitative) firms that they polled in 2006 on the future
Identifying Factors That Significantly Influence Markets
127
of quantitative investment management. Details of the survey results are also available in Fabozzi, Focardi and Jonas (2008).
4.7.7 Modelling market dynamics using nonlinear methods Mathematicians first realised the fundamental limitations of traditional linear time series analysis two or three decades ago. This coincided with a time when computer scientists were particularly enthusiastic about the prospects of developing artificial intelligence. The combination led to the development of neural networks. A neural network is a mathematical algorithm that takes a series of inputs and produces some output dependent on these inputs. The inputs cascade through a series of steps that are conceptually modelled on the apparent behaviour of neurons in the brain. Each step (‘neuron’) takes as its input signals one or more of the input feeds (and potentially one or more of the output signals generated by other steps), and generates an output signal that would normally −1 involve a nonlinear function of the inputs (e.g., a logistic function such as 1 + e−γ x . Typically some of the steps are intermediate. Essentially any function of the input data can be replicated by a sufficiently complicated neural network. So it is not enough merely to devise a single neural network. What instead is needed is to identify some way of selecting between possible neural networks. This is typically done in one of two ways (or possibly using elements of both): (a) We may select lots of alternative neural networks (perhaps chosen randomly) and then apply some evolutionary or genetic algorithm that is used to work out which is the best one to use for a particular problem. (b) We may define a narrower class of neural networks that are suitably parameterised (maybe even just one class, with a fixed number of neurons and predefined linkages between these neurons, but where the nonlinear functions within each neuron are parameterised in a suitable fashion). We then train the neural network, by giving it some historic data, adopting a training algorithm that we hope will home in on an appropriate choice of parameters that we hope will work well when predicting the future. There was an initial flurry of interest within the financial community in neural networks, but this interest seems since to have subsided. It is not that the brain does not in some respects seem to work in the way that neural networks postulate, but instead that computerised neural networks generally proved rather poor at the sorts of tasks they were being asked to perform. One possible reason why neural networks were found to be relatively poor at financial problems is that the effective signal-to-noise ratio involved in financial prediction may be much lower than for other types of problem where neural networks have proved more successful. In other words there is so much random behaviour that cannot be explained by the inputs that neural networks struggle to cope with it. Principle P18: Markets appear to exhibit features that a mathematician would associate with chaotic behaviour. This may place intrinsic limits on our ability to predict the future accurately.
128
Extreme Events
4.7.8 Locally linear time series analysis Mathematically, our forecasting problem involves attempting to predict the immediate future from some past history. For this to be successful we must implicitly believe that the past does offer some guide to the future. Otherwise the task is doomed to failure. If the whole of the past is uniformly relevant to predicting the immediate future then a suitable transformation of variables moves us back into the realm of traditional linear time series, which we might in this context call globally linear time series analysis. To get the sorts of broadband characteristics that real time series return forecasting problems seem to exhibit we must therefore assume that some parts of the past are a better guide for forecasting the immediate future than other parts of the past. This realisation perhaps explains the growth in interest in models that include the possibility of regime shifts, e.g., threshold autoregressive (TAR) models or refinements. TAR models assume that the world can be in one of two (or more) states, characterised by, say, yt = f 1 (yt−1 , yt−2 , . . .), yt = f 2 (yt−1 , yt−2 , . . .), . . . and that there is some hidden variable indicating which of these two (or more) world states we are in at any given time. We then estimate for each observed time period which state we were most likely to have been in at that point in time, and we focus our estimation of the model applicable in these instances to information pertaining to these times rather than to the generality of past history. (See also Section 7.7.3.) More generally, in some sense we should be trying to do the following: (a) Identify the relevance of a given element of the past to forecasting the immediate future. We might quantify this via some form of ‘distance’ between conditions ruling at that point in time and now, where the greater the distance the less relevance that part of the past has to our prediction of what is just about to happen. (b) Carry out what is now (up to a suitable transform) a locally linear time series analysis, in which we give more weight to those elements of the past that we deem ‘closer’, i.e., more relevant in the sense of (a), to forecasting given current conditions; see, e.g., Abarbanel (1993) or Weigend and Gershenfeld (1993). Such an approach is locally linear in the sense that it involves a linear time series analysis but only using data that is ‘local’ or ‘close’ to current circumstances (in the sense of relevant in a forecasting sense). It is also implicitly how non-quantitative investment managers think. One often hears them saying that conditions are (or are not) similar to ‘the bear market of 1973–74’, ‘the Russian Debt Crisis’, ‘the Asian crisis’ and so on. The unwritten assumption is that what happened then is (or is not) some reasonable guide to what might happen now. The approach also caters for any features of investment markets that we think are truly applicable in all circumstances, because this is the special case where we deem the entire past to be ‘local’ to the present in terms of its relevance to forecasting the future. The approach therefore provides a true generalisation of traditional time series analysis into the chaotic domain. It also provides some clues as to why neural networks might be relatively ineffective when applied to financial problems. In such a conceptual framework, the neural network training process can be thought of as some (relatively complicated) way of characterising the underlying model dynamics. An ever present danger is that we over-fit the model. Of course, it can be argued that a locally linear time series analysis approach also includes potential for over-parameterisation. There is almost unlimited flexibility in how we might
Identifying Factors That Significantly Influence Markets
129
define the relevance that a particular segment of the past has to forecasting the immediate future. This flexibility is perhaps mathematically equivalent to the flexibility that exists within the neural network approach, because any neural network training approach can in principle be reverse engineered to establish how much weight it is giving to different parts of the past. However, at least the importance of the choice of how to define/assess relevance then comes more to the fore. Another possibility is that market behaviour (or at least elements of it) is akin to the system described in Equation (4.33). It is then impossible to forecast. Indeed, without knowledge of the characteristics of the seeding value, x0 , it is not even possible to describe what it might be like. The unknowable uncertainty that such behaviour introduces is explored further in Section 9.4.
4.8 DISTRIBUTIONAL MIXTURES 4.8.1 Introduction We have seen in Section 4.7 that the introduction of nonlinearities substantially affects the range of possible behaviours that a system can exhibit. The challenge is that there are a large (indeed infinite) number of possible ways in which nonlinearities can be introduced. Fortunately from our perspective the full range of market dynamics can be approximated by a suitably rich interweaving of linear combination mixtures with distributional mixtures. As we saw in Section 2.7, distributional mixtures exhibit fundamentally different behaviour to linear combination mixtures. The leap up from just using linear combination mixtures to using these in combination with distributional mixtures is in some sense the ‘only’ jump we need to make because (a) any arbitrary distributional form can be postulated as being made up of distributional mixtures of sufficiently many more basic distributional forms; and (b) any arbitrary time-dependent characteristics (or indeed any other form of dependency we might impose on our model) can then be encapsulated by some model of how the probabilities of drawing from the different underlying distributions change through time. In this section we introduce techniques for analysing (multivariate) market data that involve distributional mixtures called Gaussian mixture models. We describe the expectationmaximisation (EM) algorithm that seeks to ‘un-mix’ such combinations. We also introduce a simpler algorithm motivated by the same basic goal, called k-means clustering. We show how these techniques can be used either in a time ‘agnostic’ manner, or in a manner that includes some pre-specified time dynamics to the distributional mixtures. This then leads us on to a more general consideration of time-varying volatility and how market analysis can incorporate such behaviour. 4.8.2 Gaussian mixture models and the expectation-maximisation algorithm In Section 3.8.4 we introduced a straightforward example of classification by unsupervised learning, namely cluster analysis. Another straightforward example, more applicable to distributional mixtures, involves Gaussian mixture models (GMM). The setup is as follows: (a) We have n data points in an m-dimensional space, e.g., n time periods, and for each time period we have an m-dimensional vector representing the (log) returns on m different asset categories.
130
Extreme Events 0.15 0.1 observed
0.05
Dist 1 (prob=0.09) Dist 2 (prob=0.21)
0 –0.2
–0.1
0
0.1
0.2
–0.05
Dist 3 (prob=0.43) Dist 4 (prob=0.27)
–0.1 –0.15
Figure 4.4 Illustrative Gaussian mixture modelling analysis C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Thomson Datastream.
(b) We want to ‘fit’ the data in the sense of finding a set of K multivariate Normal distributions that best represent the observed distribution of data points. The number of distributions, i.e., K , is fixed in advance, but the distributional parameters for the kth distribution (i.e., its means, µk , and its covariance matrix, Vk ) are not. (c) The exercise is ‘unsupervised’ because we are not told in advance which of the n data points come from which of the K distributions. Indeed, one of the desired outputs is to have some estimate of the probability, pt,k ≡ p (k |t ), that the observation vector at time t came from the kth distribution. How this can be done in practice using the EM algorithm is described further in Press et al. (2007) or Kemp (2010). The EM algorithm identifies the combination of pt,k , µk and Vk that maximises the likelihood that the data has come from the relevant distributional mixture. It gets its name from being a two stage process in which we cycle repeatedly through an E-stage and an M-stage, the former in effect corresponding to an ‘expectation’ step and the latter in effect corresponding to a ‘maximisation’ step. Press et al. (2007) appear to assume that K should be small for the algorithm to be effective, usually in the range of one to just say three or four. This would limit the applicability of the algorithm for practical risk modelling purposes and seems not to be particularly necessary, except to keep run times to reasonable levels.20 When m = 1 the problem collapses to the approach involving a mixture of Normal distributions as described in Section 2.7.2. The approach is illustrated in Figure 4.4, which considers just two of the sectors used in Chapter 3 (so that we can plot their joint distribution more easily). We have assumed that the observations are coming from four bivariate Gaussian distributions to be determined by 20 However, we do need to ensure that the total number of parameters we are attempting to fit is modest in relation to the number of observation periods we have, because otherwise we will run into the problem that there is not enough data in the dataset to allow estimation of them all. This requires a modification of the basic EM algorithm akin to that introduced by applying random matrix theory (see Section 4.3.6). The k-means clustering algorithm set out in Section 4.7.3 can be viewed as an extreme example of this modification, in which we truncate away every principal component.
Identifying Factors That Significantly Influence Markets
131
0.15 0.1 observed
0.05
Dist 1 (prob = 0.15) 0 –0.15
–0.1
–0.05
Dist 2 (prob = 0.42) 0
0.05
0.1
–0.05
0.15
Dist 3 (prob = 0.12) Dist 4 (prob = 0.31)
–0.1 –0.15
Figure 4.5 Illustrative k-means clustering analysis C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Thomson Datastream.
GMM, i.e., K = 4. The four distributions are overlaid on the individual data pairs by using ellipses of constant probability density,21 positioned two standard deviations away from the mean along any given marginal distribution. The choice of four distributions is for illustrative purposes only and does not reflect a view on the part of the author that there are four applicable ‘regimes’ that best describe the market dynamics of this sector pairing.
4.8.3 k-means clustering Press et al. (2007) also refer to a simplification of the GMM that has an independent history and is known as k-means clustering. We forget about covariance matrices completely and about the probabilistic assignment of data points to different components. Instead, each data point is assigned to one (and only one) of the K components. The aim is to assign each data point to the component it is nearest to (in a Euclidean sense), i.e., to the component whose mean µk is closest to it. A simplified version of the EM algorithm can be used to identify the solution to a k-means clustering analysis, and it converges very rapidly. Although it introduces an intrinsically ‘spherical’ view of the world, which is unlikely to be accurate, k-means clustering does have the advantage of being able to be done very rapidly and robustly. It can therefore be used as a method for reducing a large number of data points to a much smaller number of ‘centres’, which can then be used as starting points for more sophisticated methods such as GMM. A k-means clustering analysis of the same data as used in Figure 4.4 is shown in Figure 4.5.
21 Maybe these should be called ‘iso-likelihoods’, bearing in mind that isobars are lines of constant pressure, isotherms lines of constant average temperature etc.
132
Extreme Events
Table 4.2 Contingency table showing probability of being in ‘regime’ D1 to D4 in one period and moving to ‘regime’ D1 to D4 in the next period To (t) From (t − 1)
D1
D2
D3
D4
D1 D2 D3 D4
0.19 0.17 0.20 0.09
0.31 0.44 0.35 0.47
0.15 0.10 0.20 0.10
0.35 0.29 0.25 0.34
Source: Nematrian
4.8.4 Generalised distributional mixture models Rather than simplifying GMM we might instead want to make it more sophisticated. For example, we might allow each component to take on a more complicated distributional form, perhaps including fat-tailed behaviour and perhaps of a form akin to one mentioned in Chapter 3. In some sense this is not needed, because by mixing enough Gaussian distributions together we can always approximate arbitrarily accurately any given distributional form. However, doing so may not be a parsimonious way of describing the distributional form, especially if we want the distribution to take a particular form in the far tail. Conceptually, exactly the same EM-style approach to maximising the likelihood of the data coming from the given distributional mixture can be used as with GMM, as long as we can parameterise the possible component distributions in suitable fashion. Unfortunately, the EM algorithm is then likely to become less robust.
4.8.5 Regime shifts Gaussian mixture modelling and k-means clustering have a natural link to regime shifts (see Chapter 7). We can perhaps see this most easily with k-means clustering. We can, for example, tabulate the proportion of times that the data was classified as coming from component k1 in time period t and from component k2 in time period t + 1 in the analysis in Section 4.7.3. This is shown in Table 4.2. One way of interpreting this table is to say that there is a probability p (k1 , k2 ) of moving from a ‘state of the world’ or ‘regime’ characterised by the k1 th distribution to a regime characterised by the k2 th distribution over any time period. In principle exactly the same concept can be applied to GMM except that now the resulting contingency table is more difficult to interpret because the state of the world is deemed ‘fuzzy’ during any particular period (involving mixtures of multiple simultaneous regimes). Practitioners seem in general to avoid the leap in mathematical complexity that is introduced by GMM style fuzzy regime shifting, preferring the ease of interpretation that comes from viewing the world as only ever being in one ‘state’ at a time.22 For example, we typically 22 A physical analogy between the one state at a time of k-means clustering and the many overlapping states of Gaussian mixture models is the difference between classical mechanics (where a system is only ever in one state at one particular time) and quantum mechanics (which also inherently permits multiple overlapping states).
Identifying Factors That Significantly Influence Markets
133
refer to an economy as being ‘in recession’ or ‘in a growth phase’, rather than being in some probabilistic mixture of the two. We shall see in Chapter 7 that a willingness to tackle this fuzziness is potentially highly relevant to handling extreme events more effectively. One feature apparent from Table 4.2 is the relatively unstable and unpredictable nature of the ‘regimes’ the world appears to be able to be in at any given time, if the interpretation given above is meaningful. This is partly because use of the table in this manner implies that we should infer from return behaviour alone which ‘regime’ we are in. Far better would be to infer it from other more stable criteria. We shall also see in Chapter 7 that reliable estimation of which ‘regime’ we are in is an important ingredient in actually making use of more sophisticated regime-dependent portfolio construction techniques.
4.9 THE PRACTITIONER PERSPECTIVE Practitioners seeking to understand what factors drive markets can be split, in the main, into two camps: (a) Risk managers – working directly for entities with such exposures or for third party risk system providers who provide software solutions to such clients – whose primary aim, ultimately, is to limit the risks that the entity might incur. (b) Investment managers, whose primary aim, ultimately, is to maximise return. As we shall see in the next chapter, we ultimately want a good balance between risk and reward and so total polarisation along the lines implied above is not necessarily ideal. Good investment managers need to have a keen appreciation of risk and, as we shall argue in later chapters, good risk managers need to have a keen appreciation of return potential. Nevertheless, such a division does seem to be a feature of the way in which organisational structures typically evolve, in part because of the benefits that accrue from greater clarity in articulation of what individuals are to concentrate on in their day-to-day activities. This division also influences the perspectives that individuals may adopt regarding the activities described in this chapter. For example, risk managers are typically more open about the general ways in which they might model risks. Risk system providers will typically have an open and explicit bias towards one of the three different approaches to formulating risk models (statistical, fundamental or econometric), or may explicitly use one approach for one part of their modelling and another approach for another part of their modelling. This openness helps them differentiate their systems from those of other providers. Internal risk managers may be somewhat more circumspect about exactly how their risk models operate with direct competitors, but will still be reasonably forthcoming, particularly with their bosses and with regulators23 (or with their peers in other firms if they think that exchange of information would be mutually beneficial). In contrast, investment managers tend to be much more circumspect about how their return forecasting models might operate, particularly quantitative investment managers. It is their competitive advantage. 23 Firms typically need to be open with the regulators if they want to be able to use their own risk models to help set capital requirements; see Section 8.3.4.
134
Extreme Events
Some firms spend very substantial sums developing sophisticated quantitative models that they believe will improve their forecasting ability.24 They will offer tantalising hints of how these models work, to encourage clients to invest money with them, but not give so much away that their models become reproducible by others. They will typically place a premium on robustness of system design, and so may adopt an engineering approach to model design and implementation, with different individuals involved in different stages of the process.25 Less quantitative firms may be more sceptical of the ability of quantitative models to predict the future. However, they may still spend heavily on information technology (or spend heavily on outsourcing contracts that have a high IT component) to facilitate an infrastructure in which investment ideas developed more judgementally can be efficiently and robustly implemented. The confidential nature of most (quantitative) firms’ return forecasting techniques makes it very difficult to identify exactly the current state-of-the-art in such matters. Surveys such as Fabozzi, Focardi and Jonas (2008, 2009) can provide clues. These surveys suggest that enthusiasm for different approaches can rise and fall depending on recent market events (and/or on practitioners’ views about whether the techniques will prove useful in practice). We have already seen an example of this with neural networks in Section 4.7. Large quantitative firms may therefore develop multiple potential approaches simultaneously, in the hope that one will prove to be the ‘holy grail’. As far as clients are concerned, this business approach is not without its issues. If a firm incubates26 lots of new investment strategies but only reveals the results of some of them, it may become more difficult to work out how effective such strategies might be in the future.
4.10 IMPLEMENTATION CHALLENGES 4.10.1 Introduction In this chapter we have introduced a much wider range of analytical techniques than in earlier chapters. Most of them involve the minimisation or maximisation of some multivariate function, say f (x). For example, the PCA, ICA and blended PCA/ICA approaches involve (in general) identification of a vector that maximises the chosen importance criterion. GMM likewise involves maximising the likelihood of the distributional mixture in question. Any maximisation problem can trivially be re-expressed as a minimisation problem and vice versa, by replacing f (x) by − f (x). Our aim, in general, is to identify the maximum/minimum as quickly and as cheaply as possible. In computational terms this often boils down to doing so with as few evaluations of f (x) (or of its partial derivatives) as possible. 24 Firms may also spend substantial sums in investing in systems that improve their ability to execute their investment ideas in a timely fashion. At one extreme is the trend towards algorithmic trading. This may involve millisecond adjustments to portfolio exposures based on real-time exchange feeds. The aim is to respond more rapidly than other market participants to new information as soon as it arises and/or to take advantage of behavioural biases that occur in the dissemination of new information across the market. 25 Such firms may also occasionally seek to patent intellectual property implicit in their models. However, many jurisdictions frown on such patent applications and in any case patent filings become public in due course. Use of patent protection in this field therefore tends to focus on conceptual refinements or business process improvements rather than on refinements to actual forecasting techniques built into investment models per se. 26 ‘Incubation’ here means creating portfolios (that might be notional or real, depending on whether funds are actually allocated to the strategy) but not publicising their results until it becomes clear which ones have been doing well. See also Section 5.9.
Identifying Factors That Significantly Influence Markets
135
As Press et al. (2007) point out, there is a very large body of numerical research linked to the maximisation/minimisation problem. This body of research is more usually called optimisation and thus also underpins most of the numerical techniques underlying the later chapters of this book. Part of the reason for the extensive amount of research carried out in this context is because identifying an extremum (a maximum or minimum point) is in general very difficult. Extrema may be either global (truly the highest or lowest value of the function) or local (the highest or lowest in a small finite neighbourhood and not on the boundary of that neighbourhood).
4.10.2 Local extrema Finding a local extremum is in principle relatively straightforward, if we know or can estimate the first derivative of the function, i.e., d f /d x if x is one-dimensional, or the vector ∂ f ∂x = T ∂ f ∂x 1 , . . . , ∂ f ∂ xn if x is n-dimensional. The extremum is the point where d f /d x = 0 or ∂ f ∂x = 0. A necessary but not sufficient condition for the first derivative to exist is that the function be continuous. To identify whether a local extremum is a minimum, maximum, a point of inflexion or a saddle point we ideally need to be able to compute the second derivative of the function. A saddle point arises if the extremum is a minimum along one direction but a maximum along another, and so only arises if x is multidimensional. In the one-dimensional case, the extremum will be a minimum if d 2 f d x 2 is positive, and a maximum if d 2 f d x 2 is negative. If d 2 f d x 2 = 0 then determination of the form of the extremum becomes more difficult; it could be a point of inflexion (i.e., it could increase through the point or decrease through the point), or it could 2 be a maximum or minimum depending on higher derivatives). Some functions, e.g., e−1 x at x = 0, have all their partial derivatives zero at their local extremum, meaning that even consideration of higher derivatives is not a foolproof way of identifying the type of extremum involved. Fortunately, in numerical problems we do not need to worry unduly about such issues, because we can typically evaluate the function at f (x ± ε) for some small ε and hence work out the form of the local extremum. In the multi-dimensional case, the form of the local 2extremum 2 depends on the form of 2 ∂ f ∂ x x . If ∂ f ∂x is positive definite then the the matrix ∂ 2 f ∂x2 whose elements are i j local extremum isa minimum, if ∂ 2 f ∂x2 is negative definite then the local extremum is a maximum, if ∂ 2 f ∂x2 is neither positive definite nor negative definite but is not zero then the local extremum is a saddle point and if ∂ 2 f ∂x2 = 0 then we run into the same sorts of issues as arose in the one-dimensional case when d 2 f d x 2 = 0 (with the added issue that we then need to calculate f (x ± εk ) for several (indeed in theory an arbitrarily large number) of different εk to be sure that we know the true form of the extremum. There are several ways of finding the ‘nearest’local extremum to a given starting point. Simplest is to be able to solve d f /d x = 0 or ∂ f ∂x = 0 explicitly. This very considerably simplifies the problem, and is probably one of the main reasons for the focus on quadratic functions that we shall see in this book most explicitly in Chapters 5 and 6. If we cannot work out the solution analytically then in one dimension the algorithms usually first involve bracketing the extremum by a triplet of points a < b < c (or c < b < a) such that (if we are seeking a minimum) f (b) is less than both f (a) and f (c). We then know that the function (if it is smooth) has a minimum somewhere in the interval (a, c). The minimum can
136
Extreme Events
E
C
A G
D
F H
B
Figure 4.6 Illustration of the difficulties of finding global extrema, even in one dimension and even for smooth functions. Points A and C are local, but not global, maxima. The global maximum is at E, which is on the boundary of the interval so the derivative of the function need not vanish there. Points B and D are local minima and B is also the global minimum. The points F, G and H are said to bracket the maximum at A, because G is greater than both F and H and is in between them. C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
then be found in much the same way as we search for a root of an equation in one dimension, by successively narrowing down the interval over which the root is to be found.27 Bracketing does not have a practical analogue in multiple dimensions. If x is multidimensional then the algorithms used typically involve taking an initial starting value for x, heading along some promising direction until we cannot improve on where we have reached and then seeing if we can improve further by heading off in a different direction.
4.10.3 Global extrema In general, finding a global extremum is a very difficult problem, even in one dimension and even for continuous and adequately differentiable functions, as illustrated in Figure 4.6. Two standard approaches are to (a) find all available local extrema starting from widely varying starting values (perhaps randomly or quasi-randomly chosen), and then pick the most extreme of these; or (b) perturb a local extremum by taking a finite amplitude step away from it, and see if this takes us to a better point.
27 Indeed, we can think of such an algorithm as involving a root search, because we are searching for the root of the equation f ′ (x) ≡ d f /d x = 0, i.e., the value of x for which f ′ (x) = 0. In the absence of knowledge of the function’s first derivative, we might root search by bisecting the interval in which we know it is located. When searching for an extremum in an equivalent way it is more efficient to use a golden section subdivision; see, e.g., Press et al. (2007) or Kemp (2010). If we have greater knowledge about the function’s first derivative then a root and/or extremum search can usually be speeded up by splitting the interval/bracket in a different way, e.g., using the Newton-Raphson method for root finding or parabolic interpolation for extremum searching.
Identifying Factors That Significantly Influence Markets
137
Some element of point (b) is often needed even when our primary approach is (a), because a global extremum may be at the boundary of the set of feasible values of x, and so may not be at a local extremum. The problem becomes even more complicated if x is multi-dimensional, particularly if large numbers of dimensions are involved, as is often the case in practical financial problems. High-dimensional spaces are very sparse, with nearly all points being, relatively speaking, a long way away from each other. In such circumstances, approaches that use physical or biological analogies can prove helpful, particularly if the global extremum is hidden among lots of individual local extrema. Principle P19: Finding the very best (i.e., the ‘globally optimal’) portfolio mix is usually very challenging mathematically, even if we have a precisely defined model of how the future might behave. 4.10.4 Simulated annealing and genetic portfolio optimisation An example of an approach relying on such an analogy is the method of simulated annealing (see, e.g., Press et al. (2007)). The analogy here is with thermodynamics, specifically with the way in which liquids crystallise when they freeze or metals cool and anneal. At high temperatures the molecules in the liquid move freely with respect to each other, but as the temperature cools, particularly if it cools slowly, the molecules can take on a highly ordered arrangement, which involves the thermodynamic minimum state of the system. Nature amazingly seems to be able to find this minimum energy state automatically, as long as the cooling is relatively slow, allowing ample time for the redistribution of the energy originally present in the liquid molecular behaviour. Although the analogy is not perfect, there is a sense in which the algorithms described earlier in this section try to get to a solution as rapidly as possible. This will tend to result in us getting quickly to a local extremum but not as reliably to the global extremum, particularly if the global extremum is hidden among lots of individual local extrema. If we ‘slow the process down’ then we should have a better chance of replicating nature’s success at teasing out global extrema. Nature’s minimisation algorithm focuses on the Boltzmann probability distribution, as follows: (4.34) P (E) = ce−E /kT where E is the energy of the system, T is its temperature, k is the Boltzmann constant and c is some constant of proportionality so that the probability distribution P sums to unity. A system in thermal equilibrium thus has its energy probabilistically distributed across all different energy states. Even at low temperature there is some chance, albeit small, of the system getting out of a local energy minimum in favour of finding a better, more global minimum. Simulated annealing uses this analogy as follows: (a) We describe the possible configurations that the problem solution may take. (b) We generate random changes to these configurations. (c) We have an objective function E (the analogue of energy) whose minimisation is the goal of the procedure, and a control parameter T (the analogue of temperature).
138
Extreme Events
An annealing schedule then identifies how we lower T from high to low temperature, e.g., after how many random changes to the initially chosen configuration is each downward step in T taken and how large is this step. Exactly how we define ‘high’ and ‘low’ in this context often requires physical insights or trial-and-error. Another analogy that researchers have used in this context is evolution, leading to genetic algorithms (see, e.g., Doust and White (2005)). Here, we again describe possible configurations that the problem solution may take, but rather than starting with just one we start with many. We then generate random changes (‘mutations’) to these configurations and select ‘offspring’ from the then current population that seem to perform better at solving the optimisation problem. We can, as with simulated annealing, include in the iterative step some additional randomness that means that not all immediately ‘worse’ mutations are excluded. Instead, some may be allowed to propagate, reflecting the somewhat haphazard nature of life itself. On Earth, evolution appeared to take a huge jump forward when sexes became differentiated, and so these types of algorithms may also involve analogies to ‘mating’, if there is a relevant part of the optimisation problem to which such a concept can be attached. In either case, we usually abandon the aim of explicitly identifying the global extremum; instead our focus is on reaching some solution that is close to globally optimal. 4.10.5 Minimisation/maximisation algorithms Most of the complications described in Sections 4.10.1–4.10.4 await us in future chapters rather than being directly relevant to algorithms in this one. This is because most of the algorithms considered in this chapter are quasi-quadratic in form and have the property that any local extremum is also (up to a suitable reorganisation of the variables) the global extremum. This, for example, explains the relatively quick and robust nature of algorithms typically used for PCA. As we noted in Section 4.3.7, PCA in effect involves maximisation of a quadratic form along the lines of f (a) = a T V a. There are in fact many different local extrema, but each one is the same up to a reordering of the variables, which is why we normally present the answers after imposing some additional ordering of the variables, such as showing first the most important principal component. Exceptions are the ICA and blended PCA/ICA algorithms presented in Sections 4.4 and 4.5. Incorporating kurtosis into the importance criterion means that the problem involves maximising a quartic rather than a quadratic. Although it is not obvious from the presentation in Table 4.1 (because we have sorted the series extracted by the size of the importance criterion), only in the case of PCA were the series extracted in the order shown, i.e., with the first series being the one with the largest importance criterion etc. Thus, some of the series identified were, at that step of the algorithm, merely local extrema, rather than global extrema. Traditional ways of carrying out regression analysis also fit within this framework, because again the aim is to minimise a quadratic, i.e., the sum of the squares of the error terms. Robust regression techniques as per Section 4.7.5 again move us away from quadratics and can again ramp up the computational complexity. However, as we noted there, the special case where we focus on minimisation of mean absolute deviation, i.e., |yt − (a + bxt )|, is more tractable, because the estimators are then medians rather than means. 4.10.6 Run time constraints Even when the problem merely involves quadratic forms, run times can be significant, especially if we are considering a large number of instruments; see, e.g., Kemp (2010). More
Identifying Factors That Significantly Influence Markets
139
complex problems, needing algorithms such as ICA, simulated annealing, genetic algorithms, neural networks and locally linear regression can be many times more computationally intensive and may involve a significant amount of trial and error or testing to get to work well or to produce useful output. Conversely, because these sorts of algorithms are more sophisticated, they may be more helpful in teasing out information on market dynamics that other simpler (and more commonly used) tools cannot capture. This makes them potentially more attractive to investment managers focusing on spotting future return generating opportunities. As always, reward rarely comes without doing some work!
5 Traditional Portfolio Construction Techniques 5.1 INTRODUCTION In Chapter 1 we indicated that the fundamental problem underlying portfolio construction is ‘how do we best balance risk and return?’. Up until this point we have principally focused on the risks expressed by a portfolio and particularly risks linked to extreme events. We now move on to how we can incorporate both risk and return in portfolio construction. In this chapter we explore some of the more traditional (quantitative) techniques that practitioners have used to tackle this problem, together with the theory on which such techniques are based. We will, in the main, focus in this chapter on mean-variance optimisation. This reflects its historic and continuing importance as a portfolio construction technique. Before doing so, it is worth highlighting the potential gain that can arise by putting together a portfolio in an ‘efficient’ rather than an ‘inefficient’ manner. An example of such an analysis is given in Scherer (2007). In the preface to his book, Scherer estimates the uplift available by moving from a ‘naive’ portfolio construction approach (defined there as equally over/underweighting assets with the best/worst deemed return potentials) to an ‘optimal’ approach (defined there as one in which size of over/underweighting takes into account contribution to risk as well as deemed return potentials). For ‘realistic’ skill levels, he claims that a four-fold uplift in the risk-return trade-off is available from effective (i.e., optimal) portfolio construction.1 Principle P20: Portfolio construction seeks to achieve the best balance between return and risk. It requires some formalisation of the trade-off between risk and return, which necessitates some means of assessing each element contributing to this tradeoff.
5.2 QUANTITATIVE VERSUS QUALITATIVE APPROACHES? 5.2.1 Introduction Before we explore optimisation in practice, we need to explore the role or otherwise of any form of quantitative approach to portfolio construction. There are some who argue that quantitative approaches to investment management are intrinsically flawed, whereas others argue that only explicitly quantitative approaches can provide the necessary disciplines for an investment process to be effective. Those who argue
1 At higher skill levels Scherer argues that the uplift declines, confirming the intuition that risk management becomes less important the better are the available forecasts for future returns.
142
Extreme Events
strongly for either of these extreme positions generally do so because it suits their own purposes, e.g., because they are marketing one investment style to the exclusion of the other. Most practitioners are more open-minded. Although they do not necessarily articulate their portfolio construction disciplines as follows, any portfolio construction approach, if it is to have intrinsic rationale, can be viewed as involving the following: (a) The manager2 will have positive and negative views on a range of different instruments. These views will differ in terms of strength of conviction that the manager would ideally wish to place on the relevant idea. All other things being equal, the greater the conviction behind the idea, the bigger should be the risk-adjusted position size implementing the idea. (b) If a portfolio is optimally constructed then we can, in principle, assign to each position an ‘implied alpha’, which is the alpha-generating potential that needs to be assigned to that particular stock for the portfolio as a whole to express an optimal balance between risk and reward. By ‘alpha’ we mean scope for outperformance. (c) As the manager’s ideas change (e.g., because they come to fruition), these implied alphas will change and hence the ideal individual position sizes to express within the portfolio will also change. Assuming that the manager possesses skill, we may also assume that positions of the same magnitude should on average be rewarded over time in proportion to their implied alphas. If the manager knew which ideas would be better rewarded than this then he or she would presumably ascribe them greater weight, which would result in their implied alphas rising to compensate. 5.2.2 Viewing any process through the window of portfolio optimisation Any (active) portfolio construction approach can be reformulated to fit within this suitably broad ‘quantitative’ definition of an investment process. Given sufficient flexibility in the characterisation of risk that we are using (i.e., in quant-speak, the underlying risk model we are adopting), and in the evolution of this characterisation through time, any set of decisions taken by any type of manager can be recast into a portfolio optimisation framework. This is true whatever the extent to which the manager believes him- or herself to be more traditionally judgemental (i.e., ‘fundamental’) or more quantitatively orientated in nature. Such an assertion is no less or more tautological than claiming that measuring the length of an object is a ‘quantitative’ activity. Measurement of distance does indeed get bracketed with other basic mathematical disciplines such as addition and subtraction early enough in a typical school curriculum. Such a reformulation does not therefore explicitly support (or discourage) the use of more sophisticated quantitative techniques within the research or portfolio construction elements of an investment process. Rather, it indicates that we can analyse the problem in a quantitative manner irrespective of the style of management being employed.3 Indeed this must be so, if we return to the topic of risk budgeting. We have already seen that portfolio construction involves an effective allocation of the risk budget between different 2 By ‘manager’ we mean in this context anyone responsible for the positioning of a portfolio (which could include assets, liabilities or both). Most traders would fall within such a definition as would most investment managers. 3 Problems that can be tackled in this way include analysis of the optimal sizes of short books in 130/30 funds (see, e.g., Kemp (2007) and Kemp (2008a)), efficient construction of global equity portfolios (see, e.g., Kemp (2008b)) and efficient alpha delivery in socially responsible investment portfolios (see, e.g., Kemp (2008c)).
Traditional Portfolio Construction Techniques
143
alpha generating possibilities. What manager would knowingly aim for a ‘sub-optimal’ asset mix and hence an ‘ineffective’ use of the risk budget? Remember, we are talking here about intentions, rather than whether the risk budget actually proves to have been allocated effectively in practice. Principle P21: The intrinsic difference between qualitative and quantitative portfolio construction approaches, if there is one, is in the mindsets of the individuals involved. From a formal mathematical perspective, each can be re-expressed in a way that falls within the scope of the other.
5.2.3 Quantitative versus qualitative insights Another way of understanding the broad equivalence between quantitative and qualitative investment management styles, at least in a formal sense, is to explore some of the key insights that each claims to offer in terms of portfolio allocation. For example, Kahn (1999) highlights seven quantitative insights applicable to active management: (a) (b) (c) (d) (e) (f) (g)
active management is forecasting; information ratios determine value added; information ratios depend on skill and breadth; alphas must control for skill, volatility and expectations; data mining is easy; implementation subtracts value; distinguishing skill from luck is difficult.
We could equally restate these insights in less quantitative language, for example: (a) active management is about working out which assets will outperform and by how much; (b) the best outcomes involve an effective trade-off between return and risk; (c) the better the underlying investment ideas and the wider their range the better is likely to be the overall outcome; (d) assessing investment performance should take into account the possibility that outcomes arose by luck rather than by skill; (e) there are lots of pieces of information ‘out there’ waiting to be discovered, and the trick is to find the pieces that will actually prove useful going forwards; (f) transaction and other costs typically reduce overall investor return; (g) there are fewer managers who are actually skilled than the number claiming to be skilled. Indeed, there can sometimes be merit in being vaguer in our prescriptions. For example, Kahn’s insights (b) and (c) might be read by some as equating value added only with information ratios, i.e., with ratios of outperformance to volatility (i.e., standard deviation) of relative performance. Such an association would involve equating risk with standard deviation of relative returns. As we have already seen, this might be fine in a world characterised by Normal distributions, but is less justifiable if markets are fat-tailed. Conversely, being vague does not help us apply prescriptions in practice. Perhaps this is one of the strongest reasons in favour of the use of quantitative techniques in investment
144
Extreme Events
management – they encourage disciplined application of an investment process. However, some argue that ‘discipline’ and ‘flair’ may not be compatible. So there may be some investors who will want to downplay any type of investment discipline in the hope that this improves the chances that their managers will exhibit the ‘flair’ that they are looking for. Selecting fund managers is a non-trivial task, not least because of the human dynamics involved! 5.2.4 The characteristics of pricing/return anomalies A writer who draws out mainly qualitative lessons from ideas that ultimately derive primarily from a quantitative perspective is Cochrane (1999). His thesis is that there has been a revolution in how financial economists view the world. This has involved a shift away from an assumption that market behaviour can be viewed as being largely driven by a single factor (the ‘market’), as encapsulated in, say, the Capital Asset Pricing Model (see Section 5.3.4), towards a multifactor view of the world. At the same time, the world of investment opportunities, even at a fund level, has changed, with development of a bewildering variety of fund styles. Arguably, these trends have become even more pervasive since Cochrane was writing. His focus is on the generality of (retail and/or institutional) investors who might seek exposures using such pooled vehicles. Cochrane (1999) argues that certain pricing/return anomalies, if they are persistent, can, over the long term, offer meaningful reward.4 However, he notes that investors may need to be very patient to profit from such strategies once parameter uncertainty is taken into account. From this, he springboards onto the topic of whether pricing/return anomalies will last and considers three possibilities: (a) The anomaly might be real, coming from a well-understood exposure to risk, and widelyshared views about it. (b) The anomaly might represent an irrational misunderstanding of the nature of the world. (c) The return premium may be the result of narrowly-held risks. Cochrane (1999) thinks that anomalies falling into category (c) are the least well covered in the academic literature. He also thinks that there is . . . a social function in all this: The stock market acts as a big insurance market. By changing weights in, say, recession-sensitive stocks, people whose incomes are particularly hurt by recessions can purchase insurance against that loss from people whose incomes are not hurt by recessions.
He argues that this insight provides a plausible interpretation of many anomalies that he and others have documented. For example, he refers to the apparent tendency for small cap stocks to outperform over long periods of time, but that this tendency has become less marked more recently. He expresses the view that the risks involved were narrowly held but have become more widely held as more small-cap funds have been launched. An important question that then arises is what institutional barriers keep investors from sharing such risks more widely. Cochrane thinks that anomalies of type (a) are the ones most likely to be sustainable over the longer term (but are in some sense perhaps of least use to the generality of investors, see below) whereas ones of type (b) are the least likely to persist, on the grounds that news travels quickly, investors are challenged regularly about their views and so on. An exception 4
He refers to one that is based on dividend to price ratio, but see also Section 5.3.5.
Traditional Portfolio Construction Techniques
145
in relation to (b) might be if the anomaly has a deep-seated behavioural driver underlying it. If the supposed return premium ‘comes from a behavioural aversion to risk, it is just as inconsistent with widespread portfolio advice as if it were real’ (Cochrane, 1999). He notes that we cannot all be less exposed than average to a given behaviour, just as we cannot all be less exposed than average to a given risk. Thus, such ‘advice must be useless to the vast majority of investors . . . If it is real or behavioural and will persist then this necessarily means that very few people will follow the portfolio advice’ (Cochrane, 1999). Out of such discussions he concludes that investors should ask themselves the following questions (all of which have a natural resonance with anyone who takes the idea of risk budgeting seriously): What is my overall risk tolerance? What is my horizon? What are the risks that I am exposed to? What are the risks that I am not exposed to? He exhorts us to remember that the average investor holds the market (‘to rationalize anything but the market portfolio, you have to be different from the average investor in some identifiable way’ (Cochrane, 1999)). This observation is of direct relevance to techniques such as Black-Litterman and certain other ‘Bayesian’ approaches to portfolio construction (see Chapter 6). Perhaps his final conclusion is the one that is most apposite for readers of this book to bear in mind: Of course, avoid taxes and snake oil . . . the most important piece in traditional portfolio advice applies as much as ever: Avoid taxes and transaction costs. The losses from churning a portfolio and paying needless short-term capital gain, inheritance, and other taxes are larger than any of the multifactor and predictability effects I have reviewed. Cochrane, 1999
This should remind us that, however sophisticated might be the techniques we might want to adopt to cater better for extreme events, we should not then forget the basics. We explore the potential impact of transaction costs in Section 5.7.
5.3 RISK-RETURN OPTIMISATION 5.3.1 Introduction We will make little headway in this book, however, if we do not in the main formulate our arguments using the language of quantitative investment management. It is just not practical to expect to be able to differentiate effectively between what we might do in a Normal world and what we might do in a fat-tailed world without the use of a framework capable of differentiating between the two. The traditional quantitative workhorse that practitioners use to help decide how much risk to take (and of what type) is risk-return optimisation, also referred to as efficient frontier analysis. Mathematically, any optimiser requires some definition of return (also referred to as reward) and some definition of risk. The optimisation exercise then involves maximising, for a given risk aversion parameter, λ, some risk-reward trade-off function (i.e., utility function, see also Section 7.3) subject to some constraints on the portfolio weights, and then repeating the exercise for different values of λ. The trade-off might be expressed mathematically using the following equation: U (x) = Return (x) − λ.Risk (x)
(5.1)
146
Extreme Events
where x is the portfolio asset mix (which for the moment we will assume remains fixed), Return (x) measures the return generating potential of this portfolio and Risk (x) measures its risks. The output from the optimisation is then a series of portfolios, one for each λ, each of which optimally trades off risk against reward for some particular level of risk (or equivalently for some particular level of return). Such portfolios are known as ‘efficient portfolios’. Collectively the line that they form when plotted in a risk-reward chart is known as the ‘efficient frontier’ because it is not possible (given the assumptions adopted) to achieve a higher return (for a given level of risk) or lower risk (for a given level of return) than a point on the efficient frontier. Figure 5.1 shows an example of an efficient frontier, together with the corresponding efficient asset mixes. Implicitly or explicitly superimposed on this basic framework will be assumptions regarding the costs of effecting transactions moving us from where we are now to where we would like to be. Often the assumption made is that these costs are small enough to be ignored, but more sophisticated optimisation engines do not necessarily adopt this simplification. 5.3.2 Mean-variance optimisation The simplest (non-trivial) portfolio optimisation is (one-period) mean-variance portfolio optimisation. In it, we assume that the ‘return’ component of the utility function depends on a vector, r, of (assumed future) returns on different assets.5 The return for any given portfolio is then the weighted average (i.e., weighted mean) of these returns (weighted in line with the asset mix being analysed). The ‘risk’ component of the utility function is proxied by an estimate of forward-looking tracking error (versus some suitable minimum risk position,6 say, b) based on a suitable covariance matrix. Thus the utility function has a quadratic form as follows: U (x) = r.x − λ (x − b)T V (x − b)
(5.2)
Equally, it could be stated as follows, where a = x − b, because this involves a constant shift of −r.b to the utility function and therefore leads to the same portfolios being deemed optimal: U (x) = r.a − λaT Va
(5.3)
Constraints applied in practice are often linear, i.e., of the form Ax ≤ P. Equality constraints (such as requiring that the weights in a long-only portfolio add to unity) can be included as special cases of this form as follows:
xi ≤ 1
and
−
xi ≤ −1
(5.4)
5 Throughout this discussion ‘asset’ should be taken to include ‘liability’, because the optimisation may involve short positions, liability driven investment or asset-liability management. 6 The common alternative involving measuring risk in absolute volatility terms can be viewed as a special case where we have a hypothetical asset that exhibits zero volatility. It also typically produces similar answers to the above approach if a 100% cash minimum risk portfolio is assumed, given the relatively low volatility of cash, but this latter approach would be less helpful if we were attempting to allocate a portfolio across different types of cash-like instruments.
Traditional Portfolio Construction Techniques
147
14% 12%
Return (%pa)
10%
Efficient Frontier Return ML US Cash 0-1 Year (G0QA) ML US Govt 1-3 Year (G102) ML US Govt 5-7 Year (G302) ML US Govt 7-10 Year (G402) ML US Govt Over 10 Year (G902) ML US Corp 3-5 Year (C2A0) ML US Corp 5-7 Year (C3A0) ML US Corp 7-10 Year (C4A0) ML US Corp Over 10 Year (C9A0) ML US High Yield (H0A0) ML Emerging (IP00)
8% 6% 4% 2% 0% 0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
Risk %pa (Annualised Volatility of Returns)
100%
80%
ML US Cash 0-1 Year (G0QA) ML US Govt Over 10 Year (G902) ML US Corp 3-5 Year (C2A0) ML US Corp 5-7 Year (C3A0) ML US High Yield (H0A0) ML Emerging (IP00)
60%
40%
20%
0% 0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
Risk %pa (Annualised Volatility of Returns)
Figure 5.1 Illustrative efficient portfolio analysis: (a) Illustrative efficient frontier (including risks and returns of individual asset categories); (b) Composition of corresponding efficient portfolios. Analysis uses merely illustrative forward-looking expected return assumptions applicable some years ago and is not therefore a guide to what might be a suitable asset allocation in current market conditions C Nematrian. Reproduced by permission of Nematrian Source: Nematrian, Merrill Lynch, Bloomberg.
148
Extreme Events
Long-only portfolio constraints (i.e., that the weight invested in each stock should not be less than zero) can also be included as special cases of this form as follows: −xi ≤ 0 for each i
(5.5)
Other common types of constraint include ones linked to a risk measure. For example, the client might impose a maximum limit7 on tracking error, VaR and/or CVaR/TVaR. If V is a positive definite symmetric matrix (which it should be if it corresponds to a true probability distribution or is derived directly from historic data), if constraints are linear and if there is no allowance made for transaction costs (or for deals needing to be in multiples of given lot sizes) then the problem involves constrained quadratic optimisation. It can be solved exactly using a variant of the Simplex algorithm or solved using other standard algorithms for solving such problems; see, e.g., Kemp (2009) or references quoted in Press et al. (2007). Application of mean-variance optimisation is very widespread in both academic finance circles and in actual business practice. There are several examples of its use in the public domain by the author (Kemp, 2005; 2008a; 2008b; 2008c). In a world in which returns (and any other economic factors included in the utility function) are characterised by multivariate (log-) Normal distributions, single-period risk-return optimisation defaults to mean-variance optimisation. This is because the return distribution is then completely characterised by its mean vector and its covariance matrix. The investor’s utility, to the extent that it depends on market derived factors, must therefore also merely depend on these parameters. Risk-return optimisation also defaults to mean-variance optimisation if the investor has a quadratic utility function as per Equation (5.2), i.e., again depends merely on these parameters, even if the distribution is not multivariate (log-) Normal. However, quadratic utility functions are generally considered to be implausible representations of how an investor might actually view risk and return. If return distributions are believed to deviate materially from Normality, some refinement of mean-variance optimisation is typically considered desirable. One-period mean-variance optimisation is also called Markowitz portfolio optimisation, because Harry Markowitz was the individual credited with first developing the relevant formulation (Markowitz, 1952). 5.3.3 Formal mathematical notation More formally, we can state the one-period problem as follows. We want to find the value of x that maximises the future utility that the investor expects to obtain from the portfolio, i.e., to find the value of x for which U (x) takes its maximum value, Umax : Umax = max U (x) x
(5.6)
The value of x achieving this maximum is written as arg maxx U (x). By ‘one-period’ we mean that the investor’s asset mix is fixed at the start of the period and cannot be changed. If we have a ‘more-than-one-period’ problem then it is assumed that the 7 Institutions also sometimes place a minimum limit on such risk measures. For example, if a client is paying fees commensurate with active management then the client might want the portfolio to exhibit some minimum level of active positioning versus the benchmark; index-like exposures can be achieved at lower fees by investing in an index fund.
Traditional Portfolio Construction Techniques
149
investor can alter the asset mix at the end of each intermediate period. The problem can then be restated as follows (for an n period problem), where xi is the asset mix adopted at the start of the ith period: Umax =
max
x0 ,x1 ,...,xn−1
U (x0 , x1 , . . . , xn−1 )
(5.7)
A complicating factor, however, is that we do not need to set x1 etc. in advance at time t = 0. Instead, we may wait until time ti to choose the xi to be adopted at that time, bearing in mind what has already happened up to that time. The solution therefore involves dynamic programming, in which we define rules for working out how we might optimally choose the xi based on information available to us then. These rules will also, in general, depend on what alternatives will become available to us in the future. Generally, optimal rules can only be determined by working backwards from tn and identifying consecutively the optimal asset allocations to adopt at times tn−1 , tn−2 , . . . given the information that might then be available to us. The problem simplifies if there is no cost to switching asset allocation because we can then choose the asset mix at time ti solely by reference to what we expect to occur during the period ti to tn . However, if a penalty is imposed when we alter the asset mix then the optimal strategy to adopt at, say, time tn−1 becomes dependent on what strategies we have previously adopted as well as what we expect to happen further into the future. A simplification that is often adopted in practice in the multi-period problem is to view the utility of the trajectory taken as defined solely by the expected value of some measure that is derivable solely from some aggregate statistic linked to the x0 , x1 , . . . , xn−1 . For example, the focus might be on the terminal wealth, W (i.e., wealth at time tn ), with all income and redemption proceeds received in the meantime assumed to be reinvested. The problem then becomes the following: max
x0 ,x1 ,...,xn−1
E 0 (U (W (x0 , x1 , . . . , xn−1 )))
(5.8)
This simplification is not, however, always valid in practice. What happens in the meantime may be as important, if not more so, than what happens when we reach tn . This may be particularly relevant for endowments and other investors whose liability structures involve them using up their asset pool over time. Principle P22: Multi-period portfolio optimisation often focuses just on the end result, e.g., terminal wealth. However, this simplification may not be appropriate in practice. What happens during the journey that we take to get to a destination can often be important too.
5.3.4 The Capital Asset Pricing Model (CAPM) Perhaps the most important application in practice of (one-period) mean-variance optimisation is the Capital Asset Pricing Model (CAPM). In this section we summarise this model so that readers can understand its interaction with other topics covered in this book. According to Berk and DeMarzo (2007) the CAPM was proposed as a model of risk and return by Sharpe
150
Extreme Events 25%
20%
E x pe cte d R e tur n (%pa )
Capital Market Line
15%
Market Portfolio
Company A Company C
10% Company B
Company D 5%
Company E Risk-free asset 0%
0%
10%
20%
30%
40%
50%
60%
Volatility (standard deviation) (%pa) Figure 5.2 CAPM and the capital market line (CML) C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
(1964), as well as in related papers by Treynor (1962), Linter (1965) and Mossin (1966), and ‘has become the most important model of the relationship between risk and return’. For his contributions to the theory, William Sharpe was awarded the Nobel Prize in economics in 1990. Making suitable assumptions,8 the key insight of the CAPM is to deduce from the actions of investors themselves that all efficient portfolios should consist of a linear combination of the risk-free asset and the market portfolio. All portfolios that involve just mixtures of securities have a risk-reward position on or to the right of the curved line shown in Figure 5.2 and all efficient portfolios lie on the capital market line (CML). The CML is also called the security market line if the ‘market’ in question refers merely to securities (e.g., the stockmarket) and the tangency portfolio (because it corresponds to a risk-reward profile that forms a tangent to that available from any combination of risky assets). 8 The assumptions underlying the CAPM include: (a) Investors can buy and sell all securities at competitive market prices without suffering market frictions (such as transaction costs and taxes): by ‘competitive’ we mean, inter alia, that the market is in equilibrium; (b) Investors can borrow and lend at the risk-free interest rate (and all share the same view as to what constitutes ‘risk-free’). Usually the risk-free rate is equated by practitioners with the Treasury-bill rate or some other similar short-term interest rate. However, see Kemp (2009) for a further discussion of how we might identify what is ‘risk-free’; (c) Investors are ‘rational’, holding only efficient portfolios or securities (plus some exposure, possibly zero or negative, to the risk-free asset) where ‘efficient’ means one-period mean-variance optimal; and (d) Investors all share the same expectations regarding future volatilities, correlations and expected returns on securities (and all assume that return distributions are Normally distributed or all share a common quadratic expected utility function). Given these assumptions, if the market portfolio was not efficient then the market would not be in equilibrium and according to assumption (a) it would adjust itself so that it came back into equilibrium.
Traditional Portfolio Construction Techniques
151
The CAPM is particularly important in the context of evaluating the net present value of different investments that a firm might undertake. To do this we must determine the appropriate discount rate, or cost of capital, to use for a given investment. If the assumptions underlying the CAPM apply (and if we do not need to worry about other factors such as differential tax regimes dependent on which entity makes the investment) then the cost of capital rate we should use is determined by the beta, βi , of the investment with the market, i.e., we should use a discount rate Ri for investment opportunity i as follows: Ri = Rrf + βi × E (Rm ) − Rrf
(5.9)
xsystematic = (1 − β) xrisk-free + βxmarket
(5.10)
xidiosyncratic = x − xsystematic
(5.11)
whereRrf is the risk-free rate and E (Rm ) is the expected return on the market, and βi = ρi,m σi σm , where σi is the expected standard deviation of returns on the investment, σm is the expected standard deviation of returns on the market and ρi,m is the correlation between the returns on the investment and the returns on the market. In practice, firms use this concept by estimating E (Rm ) − Rrf using general economic reasoning (and/or historic analyses of the excess return that equities have returned over cash) and then estimating the likely betas of the opportunities available to them (perhaps in part by considering the historic beta of the sector index for the sector of which they are a part). Once they have identified a suitable discount rate that ought to be applicable to investment opportunities open to them, they can judge the merits of different business opportunities that they might follow, and only invest in ones which provide an adequate expected return to shareholders. In the above, the ‘market’ is usually taken to mean a value-weighted portfolio that includes all securities within the equity market. In practice, there are subtleties around this concept. For example, equating ‘the market’ with ‘the equity market’ is a rather equity-centric stance to adopt. Even if we think that doing so is correct, the CAPM is usually applied principally at a country level, leaving unresolved what to do with equities from other countries. We would also need to decide how to handle equities that have a limited ‘free float’ (e.g., are partly owned by other companies that are also included in the same market). These subtleties take on added importance when we consider the Black-Litterman portfolio construction methodology in Section 6.4.2. An important corollary of the CAPM is that there is no necessary reward for taking risk per se. In the CAPM, we can notionally split our portfolio, x, into two parts: a systematic part, xsystematic , investing along the CML with the same beta β to the market as x, and the remaining, idiosyncratic, part xidiosyncratic . So
and
If xidiosyncratic is nonzero then the position of x in the risk-return space as per Figure 5.2 will in general be inefficient. Inclusion of idiosyncratic risk (i.e., security exposures not valueweighted in line with the market) subtracts expected return, adds risk or does both (relative to that available on the CML). Better, if the assumptions underlying the CAPM are correct,
152
Extreme Events
is to diversify away the idiosyncratic exposures by holding merely market-weighted security exposures. 5.3.5 Alternative models of systematic risk The simplicity and power of the CAPM has had an enormous influence on modern finance. However, almost as soon as it was introduced researchers sought to make it more realistic and more complex. The most important refinement, for our purposes, is to drop the assumption that systematic risk is one-dimensional and instead to view it as multi-dimensional. This is usually referred to as the Arbitrage Pricing Theory (APT). Factors that researchers often argue appear to have their own systematic risk premia include the following: (a) Size: At times in the past small companies (i.e., those with lower market capitalisations) appear to have achieved higher returns than their larger brethren. (b) Value/growth characteristics: This involves differentiating between stocks that have high growth prospects (perhaps being in new industries) from ones that are expected to be more limited in their growth opportunities (perhaps because they operate in older declining industries but with business models that might offer more predictable and higher current income streams). (c) Leverage/gearing: If credit is scarce, companies that are highly leveraged may struggle if for any reason they need to refinance their debt or take on additional leverage. Some researchers such as Fama and French (1992) have argued that the risk premia available from exposure to some of these factors may be systematically positive. Other risk premia may be viewed as relevant only some of the time and not necessarily always positively. These might include exposures to short versus long-term bond yield spreads or price momentum.
5.4 MORE GENERAL FEATURES OF MEAN-VARIANCE OPTIMISATION 5.4.1 Introduction In this section we explore some of the more important features of mean-variance optimisation. We drop the assumption implicit in the CAPM that all investors have the same expectations regarding means, variances etc., but we retain the assumption that we have such expectations. So the portfolio mixes that we deem optimal will no longer necessarily be the same ones that others deem optimal. 5.4.2 Monotonic changes to risk or return measures The same optimal portfolios arise with any other risk measure that monotonically increases in tandem with the risk measure used above. So, for example, we will get the same efficient portfolios whether we use the ex-ante tracking error, the ex-ante variance (i.e., the square of the ex-ante tracking error) or any VaR statistic that has been determined using a Normal distribution approximation from the same underlying covariance matrix (as long as the risk measures relate to the same minimum risk portfolio). The same is also true if we replace the original return measure by any other that monotonically increases in line with the original return measure.
Traditional Portfolio Construction Techniques
153
At first sight, therefore, the results of Alexander and Baptista (2004) (and of certain other papers that they quote) seem surprising. Alexander and Baptista explore the merits of applying VaR versus CVaR-style constraints to mean-variance portfolio construction. They argue that in certain cases when a VaR constraint is applied agents may select a larger exposure to risky assets than they would otherwise have done in its absence, and they go on to argue that, in these circumstances, CVaR-style constraints may be better than VaR constraints. Their analysis seems paradoxical because if (x1 , . . . , x m )T is multivariate Normally distributed, then the distribution ai xi (for any fixed scalar ai ) is also Normally of any linear combination of the xi , e.g., distributed.9 Thus we might expect all the above risk measures to scale linearly together. The way to resolve this apparent paradox is to realise that VaR, CVaR, tracking error and the like all implicitly involve computation relative to something,10 and that if we mix together risk measures that are relative to different things then this scaling no longer applies. Indeed, as Alexander and Baptista (2004) note, it is even possible for there to be no portfolio that satisfies all the constraints simultaneously, i.e., for the efficient frontier to be empty. 5.4.3 Constraint-less mean-variance optimisation If there are no constraints (other than that portfolio weights add to unity), and by implication short-selling is permitted, then all efficient portfolios end up merely involving a linear combination of two portfolios: a ‘low-risk’ portfolio and a ‘risky’ portfolio.11 These are the minimum risk portfolio, b, as above (if it is investible) and some single ‘risky’ set of ‘active’ stances that represent the best (diversified) set of investment opportunities available at that time. All (unconstrained) investors adopting the same assumptions about the future (multivariate) return distribution should then hold the same active stances versus their own minimum risk portfolios. 5.4.4 Alpha-beta separation It is perhaps more common to express the decomposition in Section 5.4.3 using the language of alpha-beta separation. Strictly speaking, alpha-beta separation involves physically splitting the portfolio into two parts, one explicitly aiming to deliver beta, i.e., desired market exposures (achieved via, say, the use of low-cost index funds or low-cost index derivatives) and the other aiming to deliver alpha (i.e., enhanced returns over and above those available purely from market indices, sourced by employing skilled managers in some particular investment area). Conceptually any portfolio can be hypothetically decomposed in an equivalent way. Re-expressed in this fashion, the optimal portfolio in the constraint-less mean-variance optimisation case involves (a) the minimum risk portfolio, i.e., the liability matched beta; (b) the parts of the active stances that are market-index-like in nature, i.e., the active beta; and (c) the remainder, i.e., the alpha. 9 Visually, we can see this by noting that any cross-section through the centre of the joint probability density function of a multivariate Normal distribution (e.g., as shown in Figures 3.1–3.3) is a univariate Normal distribution. 10 Typically, the ‘something’ can be expressed as an asset mix, i.e., minimum risk portfolio. 11 More generally, although efficient frontiers generally look curved when plotted, they are actually (for mean-variance optimisation) multiple joined straight line segments if the axes are chosen appropriately, with the joins occurring whenever a new constraint starts or ceases to bite. This feature is used by some commercial optimisation packages to speed up computation of the entire efficient frontier.
154
Extreme Events
The presence of point (b) reflects the observation that investors often expect some intrinsic long-term reward for investing in particular types of risky assets per se (e.g., investors might consider there to be an equity ‘risk premium’ available over the long-term by investing in equity market indices), while (c) reflects the hope that there are opportunities that can be exploited by skilled fund managers operating within a given asset class. 5.4.5 The importance of choice of universe Within the above conceptual framework there is no necessary consensus on what actually corresponds to ‘market’ beta. For example, should it correspond to the main market index of the market we are analysing, e.g., the S&P 500 index for a US equity portfolio? But if so, why not use a global equity index, or one that includes both equities and bonds (and cash and other less traditional asset types)? Asking this question highlights that ‘alpha’ can come not just within a given asset class but also by successful timing of allocation between such classes, however granular is the definition of ‘asset class’. 5.4.6 Dual benchmarks In Section 5.4.2 we noted that in a mean-variance framework different efficient portfolios arise if we are focusing on different types of risk, as encapsulated by different minimum risk portfolios. At times we may want to identify efficient portfolios that specifically optimise against two (or more) different benchmarks simultaneously. This can be handled using the following utility function: Udual (x) = r.x − λ1 (x − b1 )T V (x − b1 ) − λ2 (x − b2 )T V (x − b2 )
(5.12)
where b1 is the first benchmark, b2 is the second benchmark and λ1 and λ2 are the risk aversion parameters applicable to the corresponding risks. This is the same, up to a constant shift, as the following utility function: Udual (x) = r.x − (λ1 + λ2 ) (x − b)T V (x − b)
(5.13)
where b = (λ1 b1 + λ2 b2 ) (λ1 + λ2 ). We can therefore find the efficient portfolios for the dual benchmark problem using the same basic methodology as before but applied to a modified mean-variance optimisation problem as defined in Equation (5.12). If different covariance matrices are used in the two different risk measures then the problem can also still be simplified to a single quadratic utility function, but the mathematical manipulations required are more complicated; see Kemp (2010).
5.5 MANAGER SELECTION Portfolio optimisation techniques can be applied not just at an individual asset or asset class level but also at a fund or manager level. They may then help identify what sort of investment management structure we should be adopting for our portfolio.
Traditional Portfolio Construction Techniques
155
For example, Jorion (1994) explores the merits of different types of currency overlay management styles. He considers three approaches: (a) a joint, full-blown optimisation over the underlying assets (deemed to be stocks or bonds) and currencies; (b) a partial optimisation over the currencies given a predetermined position in the core portfolio; (c) separate optimisations of each of the currency exposures and the underlying asset portfolios. Only approaches (b) and (c) involve an ‘overlay’ as such, because (a) corresponds to the situation where the portfolio is managed holistically (presumably by a single manager or management house, or with very close communication between the managers responsible for different parts of the portfolio). Intrinsically, we should expect approach (a) to be better than (b) or (c) and (b) to better than (c).12 When going from (c) to (a) we are increasingly widening the range of strategies that we are considering in the analysis. Jorion (1994) shows that when returns on the underlying assets are uncorrelated with exchange rates, there is no loss of efficiency from optimising underlying assets and currencies separately. He also notes that if expected (relative) returns on one currency versus another are zero (as some commentators argue), the optimal currency positions are then zero, and there is no reason to invest in currencies. However, he then goes on to argue that the expected returns on currency hedges are not likely to be zero as some might argue,13 justifying inclusion of some form of separate currency exposures. He is more agnostic on the practical merits of these three approaches. Approach (a), i.e., the holistic approach, should be preferred, if viewed purely through the lens of optimisation efficiency. However, he notes that whether the corresponding loss of efficiency14 from adopting (b) or (c) ‘is significant must be judged against the benefits from using specialized overlay managers. There is some evidence that returns on currencies are predictable. If so, the value added by overlay managers could outweigh the inherent inefficiency of the set-up’ (Jorion, 1994). The intrinsic merits of adopting a ‘holistic’ approach to investment management are not just limited to currency overlays. For example, Kemp (2008a) shows that a holistic approach to global equity investment management ought to be more efficient than one in which the equity portfolio is parcelled out to several managers from different houses who do not talk to each other, as long as the manager skill levels are equivalent. Interestingly, over the last 20 years or so, there has been a general shift in the opposite direction. There has been a shift away from employing generalist managers who are responsible for ‘the entire portfolio’15 (and who presumably can be expected to adopt a 12 An interesting aside is that for some types of mathematical problems the fact that approach (a) is ‘better’ than (b) and (b) is ‘better’ than (c) does not necessarily make (a) ‘better’ than (c); see, e.g., footnote 4 in Section 7.3.2 on page 222. 13 For example, Jorion (1994) notes that one argument others have promoted for there being no expected (relative) return on forward currency contracts is that such contracts are in zero net supply (because there is always an opposite party to anyone transacting such instruments). He then notes that this argument cannot be valid in isolation because there is also zero net supply in instruments like equity futures, but usually some expected return would be posited from such exposures if the commentator believes that there is a long-term equity risk premium. 14 Jorion (1994) actually describes the comparison as resulting in ‘underperformance’, but this of course requires the assumption that the manager in question has skill. 15 Typically here ‘entire’ actually only means the asset portfolio, although some firms do offer more integrated management of assets and liabilities in tandem.
156
Extreme Events
holistic investment management style) and towards employing a series of specialists who concentrate on a particular asset type (including, potentially, regional equity specialists or currency overlay managers). The argument generally put forward for such an approach is that it allows a ‘best in breed’ approach to investment management. The validity of this argument relies on the assumption that the person or organisation selecting the managers can successfully identify which managers have the most skill in the specialism in question. There may also be behavioural biases at play. The countries of domicile in which such approaches tend to be most prevalent (in, say, the institutional defined benefit pensions arena) also seem to be the ones where there is typically the greatest manager selection or governance capability able to support the use of specialists. Conversely, sceptics might argue that they are also therefore the countries where there might be the greatest vested interest towards using specialist managers.
5.6 DYNAMIC OPTIMISATION 5.6.1 Introduction In the previous sections of this chapter we focused mainly on the single-period problem. Risk-return optimisation may also be dynamic or multi-period. This requires us to work out the optimal strategy to adopt now and how it might change in the future as new information comes along. The decomposition in Section 5.4.3 is often implicitly relied upon for portfolio construction in a multi-period, i.e., dynamic, environment. For example, suppose that we have some desired goal that we would be uncomfortable failing to meet (e.g., some desired level of coverage of liabilities), but ideally we would like to do better than this. The portfolio construction problem can then be re-expressed mathematically in the form of a derivative pricing problem, in which our desired goal is characterised by some form of derivative pay-off involving as its underlyings the ‘low risk’ and the ‘risky’ portfolios noted in Section 5.4.3. As the two portfolios are independently characterised, we find that the optimal portfolio, even in a dynamic environment, still involves combinations of these two portfolios throughout time, although the minimum risk part may change through time (as may the mix between the two parts), if the liabilities (and client risk appetite) change.16 Merton (1969) and Merton (1971) provide a formal justification for the decomposition of the problem into a risk-free and a risky asset. Merton (1969) considers portfolio construction in a continuous time problem and derives the optimality equations for a multi-asset problem when rates of return behave as per Brownian motion (i.e., follow Weiner processes in line with the sorts of behaviour typically adopted for derivative pricing purposes). Merton (1971) extends these results to cover more complicated situations. He shows that if a ‘geometric Brownian motion’ hypothesis is adopted then a general ‘separation’ or ‘mutual fund’ theorem applies, in which the risky portfolio will be the same for all investors with suitably consistent utilities. The risky portfolio could therefore be structured as a single mutual fund in which any such investor might participate.17 16
Readers interested in understanding more about liability driven investment are directed towards Kemp (2005). The approach Merton (1971) uses involves Itˆo processes and other techniques more commonly associated with derivative pricing theory. He also argues that the separation theorem applies even if we drop potentially questionable assumptions about quadratic utility or of Normality of return distributions. However, in a sense, Brownian motion is inherently ‘Normal’ in nature if viewed over short enough time periods, because it can be shown that however generalised is the form of Brownian motion being assumed, over each infinitesimal time period it still involves returns that are Normally distributed. 17
Traditional Portfolio Construction Techniques
157
5.6.2 Portfolio insurance An example of how the ‘fund separation’ rule described in Section 5.6.1 can be used in practice is constant proportional portfolio insurance (CPPI). Classical index-orientated CPPI can be viewed in this framework as a methodology for seeking to meet some specified liability goal (typically delivery of an investment floor), while assuming that the optimal risky portfolio merely expresses an ‘active beta’ stance. More modern fund-linked variants assume that additional alpha is available from employing a fund manager within the risky portfolio.
5.6.3 Stochastic optimal control and stochastic programming techniques When expected returns or other aspects of the multi-period problem vary through time then the ‘separation’ theorem described in Section 5.6.1 becomes less helpful. We can still notionally (or practically) split our portfolio into two parts, a risk-free part and a risky part, but doing so no longer helps us much when trying to identify what should be in the risky portfolio. The two main ways of identifying how to invest the risky assets are using stochastic optimal control approaches and using stochastic programming approaches. The difference between these two approaches lies in the way in which uncertainty in the external environment is modelled. The stochastic optimal control problem captures this uncertainty by allowing for a continuum of states. As a result, the size of the stochastic optimal control problem grows exponentially with the number of state variables, i.e., the distinct factors (some of which might be observable directly while others might not) that are deemed to drive how the world might behave. Using this approach we end up with a series of partial differential equations that we then solve numerically. Examples include Merton (1969), Merton (1971), Brennan, Schwartz and Lagnado (1997) and Hsuku (2009). In contrast, stochastic programming models capture uncertainty via a branching event tree. Therefore, they can more easily accommodate a large number of random variables at each node, thus permitting a richer description of the state of the world: examples include Carino et al. (1994) and Herzog et al. (2009). The latter paper concentrates on how such techniques can be used to solve asset-liability management (ALM) problems. Stochastic programming problems are usually solved using Monte Carlo simulation techniques, by selecting a large number of random simulations of how the future might evolve and then choosing asset mixes that optimise the expected utility averaged across all these paths. We will come across such a model in Section 7.4 when discussing regime switching models. Arguably, however, the difference between these two approaches is principally one of detail rather than corresponding to a fundamentally different mathematical outcome. This is akin to the position with derivative pricing theory. The Black-Scholes option pricing formula can be derived analytically using a continuous time model or it can be derived as the limit of a discrete time model (e.g., the binomial lattice). In practice, few complex continuous time problems in finance can be solved without any recourse to numerical methods, and so the choice between the two approaches is primarily driven by how easy and efficient they might be to implement, bearing in mind that at some stage in the process discrete approximations will typically need to be introduced.
158
Extreme Events
Principle P23: Multi-period portfolio optimisation is usually considerably more complicated mathematically than single-period optimisation. However, the form of the multi-period optimisation problem can often be approximated by one in which most of the relative stances within the dynamically optimal portfolio match those that would be chosen by a ‘myopic’ investor (i.e., one who focuses on just the upcoming period).
5.7 PORTFOLIO CONSTRUCTION IN THE PRESENCE OF TRANSACTION COSTS 5.7.1 Introduction In the material set out above we have assumed that we can buy or sell assets without incurring transaction costs. This assumption does not normally reflect reality. In this section we explore how the introduction of transaction costs affects the portfolio construction problem. We already noted in Section 5.2.4 that one of the most important pieces of portfolio advice is to ‘avoid taxes and transaction costs’. The average active investment manager is generally viewed as exhibiting zero alpha before costs. Incurring unnecessary taxes or transaction costs is thus a very effective way for managers to handicap themselves in the quest for outperformance. Conversely, if there is real merit in an investment idea then failing to capitalise on the idea because you are not prepared to incur some reasonable level of costs when implementing the idea is equally unhelpful in this quest.18 5.7.2 Different types of transaction costs Transaction costs come in a variety of forms, including: (a) Brokerage commission. To an outsider, this may be viewed as reflecting the mechanics of order processing and the costs thus charged by a third party, e.g., a stockbroker, for implementing the transaction on behalf of the portfolio. The reality is more complicated. Some of any such commission payments may relate to research that this third party provides to the manager/portfolio. This research may not explicitly relate to the trade in question. For example, the research might highlight which of several stocks are attractive to buy, and the portfolio might only then buy one of the highlighted stocks (or none of them). Many brokers provide such research and will seek to recoup the costs of providing this research by increasing the commission rates they charge on transactions actually 18 At a more macro, i.e., industry-wide, level, other factors come into play. Because active managers are often paid ad valorem fees the impact to them from incurring transaction costs on behalf of their clients can be viewed as being quite modest. Thus some commentators argue that they have an inadequate incentive to keep such costs low. Conversely, managers may argue that they have a strong interest in avoiding unnecessary transaction costs because these costs lower the investment performance that the manager delivers and therefore, all other things being equal, the fees that the manager is likely to be able to charge in the future (either from the client in question or from other future clients). This is an example of an agency-principal problem – how do we ensure that the agents (here the managers) really do act in the best interests of their principals (here the owners or beneficiaries of the portfolios being managed by these agents). The issues are perhaps most stark if transaction costs incurred on behalf of client portfolios include elements relating to services supplied back to the investment manager that the manager might otherwise have had to pay for itself. Many regulatory frameworks, e.g., the one applying to asset managers in the UK, do seek to address such topics in relation to what can or cannot be covered by so-called ‘soft commission’ or ‘directed commission’ elements often present within brokerage commission payments. Clients may also seek a better alignment of objectives by suitable use of performance-related fee arrangements.
Traditional Portfolio Construction Techniques
159
undertaken. Some of the commission may also relate to provision of research or other services by other organisations back to the manager/portfolio/client in question (this is called soft commission or directed commission). (b) Bid-ask (or bid-offer) spread. This reflects the costs that would be incurred if we were to buy a security and sell it immediately afterwards. Economically, we might expect the bid-ask spread to recompense whoever is providing liquidity (i.e., the ‘market-maker’) for – the costs of carrying an inventory of securities necessary to provide this liquidity; – the costs of actually effecting the trade, to the extent that this is not included in (a); and – the risks incurred by being willing to be a counterparty to the relevant trade (e.g., there will be uncertainty in how long it might take such a trader to offload the position he or she will receive as a result of the transaction, and there may be information asymmetries that could work against the counterparty in question in this respect). (c) Market impact. Suppose we want to make a significant shift to a particular exposure. If the desired change is large relative to the typical levels of market transactions occurring in the securities in question then we may have to slice up our desired overall trade into several parts in order to effect the desired overall transaction efficiently.19 Earlier components of the trade will tend to shift the market price in a direction that makes later components more expensive to effect. This is called market impact. For large asset managers with scalable investment processes (which typically such managers need if their business is to be cost efficient), market impact can be the most important source of transaction costs. Transaction costs can be significant in size (particularly for a high turnover portfolio) but are notoriously difficult to analyse in detail. Several of the apparently separate types of transaction cost described above can morph into another type depending on the market in question.20 For example, some markets, such as bond markets, typically do not operate with brokerage commission. Any such costs are in effect moved into the bid-offer spread (including costs related to research pieces penned by research analysts in the market-making firms concerned). Some markets also operate on a ‘matched’ order basis, at least at certain points of the day (such as market close). There may be no apparent bid-offer spread present in such markets, but transaction effects will not generally have gone away (because many of the economic drivers relating to them are still present). Instead they may have migrated to ‘market impact’. 5.7.3 Impact of transaction costs on portfolio construction In theory we can introduce transaction costs relatively easily into the specification of the utility function – they behave as akin to a drag on performance. In practice, however, they can very considerably complicate the identification of what portfolio mix is optimal (see Sections 5.13 and 7.6), particularly for more complicated models. This is because the optimal portfolio at
19 Our dealers could, of course, try to carry out the whole trade simultaneously. However, the bid-ask spread incurred when doing so might be correspondingly larger, because of the increased risk incurred by the market-maker with whom we dealt. 20 This even arguably applies with taxes. For example, some jurisdictions have ad valorem duties on purchases or sales of some types of security (e.g., at the time of writing in late 2009 the UK had a stamp duty regime that applied to purchases of UK equities). It is possible to achieve broadly economically equivalent exposures potentially without paying such duties using certain types of derivative (e.g., contracts for differences, CFDs or single stock futures) but additional brokerage costs or bid-offer spreads or other charges might be incurred instead when doing so. Such costs may also be spread through time, rather than being incurred at point of transaction, meaning that the relative benefit (to the portfolio) of using such approaches may depend on the average time an exposure is held within the portfolio.
160
Extreme Events
any given point in time becomes sensitive to what has happened between now and then, as well as our views of what might happen thereafter. 5.7.4 Taxes As far as investors are concerned, taxes behave in a manner akin to (other) transaction costs. Most taxes that influence portfolio allocation fall into one of the following categories: (a) Flat-rate ‘stamp’ duties: These involve a flat payment per transaction. (b) Ad valorem ‘stamp’ duties: These involve a payment proportional to the market value (usually) of the transaction. They may potentially be payable on purchases or on sales, but it is less common for duties to be levied both ways within the same jurisdiction. (c) Income tax: These involve a payment proportional to income (e.g., dividends, coupons) received on an investment. (d) Capital gains tax: These involve a payment proportional to the capital gain on an investment. Different countries often have different rules for different types of investor. The tax treatment for non-local investors may also vary according to any tax treaties that exist between the country in which the investor is domiciled and the country where the transaction occurs or from which the return source is deemed to have arisen. Readers are advised to seek specialist tax advice where relevant. Stamp duty is generally handled in a portfolio construction context in a manner akin to other transaction costs (which generally also only occur when a transaction takes place and are also principally ad valorem in nature in practice). Income tax and capital gains tax are generally handled differently. They are payable on the return stream generated by the investment, i.e., are incurred even if we adopt a buy and hold strategy. The simplest approach is to use after tax returns rather than the gross (i.e., before tax) returns typically used within portfolio construction exercises. Expected future returns (and, as a consequence, also the expected future volatilities of these returns) would be reduced to reflect the taxes expected to be incurred on these returns. However, this approach will generally be modified if the effective rate of tax21 on income differs from that on capital gains. Strategies that favour payment of one sort of tax over another may then be optimal as far as the investor is concerned. Principle P24: Taxes and transaction costs are a drag on portfolio performance. They should not be forgotten merely because they inconveniently complicate the quantitative approach that we might like to apply to the portfolio construction problem.
21 Capital gains tax is usually only payable when the gain is realised, and so the ‘effective’ rate needs to take account of the likely time to payment. Suppose that we had an investment that produced a (gross) return of 10% p.a. and also that income and capital gains taxes are both levied at a notional 50%. If income tax at 50% is levied at the end of each year on the whole of each year’s return then (in the absence of other transaction costs) an investment of 1 will accumulate to (1 + 0.10 × 50%)10 = 1.629 at the end of 10 years. However, if capital gains tax at 50% is levied at the end of the 10 year period on the whole of the return accrued over the 10 year period then (in the absence of other transaction costs) an investment of 1 will accumulate to 1 + 50% × (1.1010 − 1) = 1.797, i.e., the same as if it were subject to annual income tax at 39.6%.
Traditional Portfolio Construction Techniques
161
5.8 RISK BUDGETING 5.8.1 Introduction A challenge that arises whether or not a quantitative approach to portfolio construction is being followed is to decide how much risk to adopt. There is no right answer here. It depends on the client’s risk appetite. Mathematically we see that this is the case because the results of any optimisation exercise depend on the investor’s risk aversion, i.e., the λ introduced in Section 5.3. This parameter cannot be specified a priori. Instead, it is defined by (or defines) the investor’s ‘risk budget’. Risk budgeting involves the following: (a) identifying the total risk that we are prepared to run; (b) identifying its decomposition between different parts of the investment process; and (c) altering this decomposition to maximise expected value-added for a given level of risk. Risk budgeting has wide applicability and as a concept is difficult to fault. It can be applied to asset-liability management, asset allocation, manager selection, stock selection etc. and to practically any other element of the investment or business management process. Indeed, if we apply the concept more generally then it involves framing in a mathematical fashion the fundamentals underlying rational economic behaviour. This is generally taken to underpin all economic activity (although see Section 2.12.5 regarding ‘bounded rational behaviour’). As Scherer (2007) points out, the ‘merit of risk budgeting does not come from any increase in intellectual insight but rather from the more accessible way it provides of decomposing and presenting investment risks’. In this sense, it is akin to budgeting more generally. It is not strictly speaking intrinsically necessary for countries, companies or individuals to prepare budgets in order to plan for the future, but it is a very helpful discipline, particularly if resources are constrained. Scherer (2007) defines risk budgeting as ‘a process that reviews any assumption that is critical for the successful meeting of pre-specified investment targets and thereby decides on the trade-off between the risks and returns associated with investment decisions’. He notes that in a mean-variance world risk budgeting defaults to Markowitz portfolio optimisation, but with results being shown not only in terms of weights and monetary amounts but also in terms of risk contributions. All other things being equal, the principles of risk budgeting mean that we should focus our investments on those areas where there is the highest expected level of manager skill or ‘alpha’. By ‘skill’ we mean, ultimately, the ability to outperform the market other than merely by luck. In a multivariate Normal world this implies the presence of an underlying mean drift (albeit one that might be time-varying). Typically those who use such language do so with the implicit understanding that the mean drift is positive. It is rude in polite (active management) circles to highlight other people’s possibly negative mean drifts, even though the average manager must, by definition, be average, and hence (unless everyone is equal) there must be some with below average skill if there are to be others with above average skill.22 22 A possible flaw in this reasoning is that the average as a whole may ‘add value’. For example, we might argue that the scrutiny that financial analysts apply when assessing a company’s prospects and hence share price should ultimately lead to more efficient allocation of capital across the economy as a whole. This leads on to discussions about whether it is possible/right to ‘free-load’ off such activities merely by investing in index funds, and how value becomes apportioned between different elements of the capital structure or across different companies in capital raising situations.
162
Extreme Events
Principle P25: Risk budgeting, like any other form of budgeting, makes considerable sense but does not guarantee success.
5.8.2 Information ratios In a multivariate (log-) Normal world, ‘skill’ can be equated with a fund’s information ratio. By this we mean that statistical tests of presence or absence of skill based on historical data will in general focus on the observed values of the fund’s information ratio; see, e.g., Kemp (2009). To the extent that it is possible to estimate what a fund’s (or fund manager’s) information ratio might be in the future, it is also how you would most naturally characterise the extent to which you might expect to benefit from the manager’s supposed ‘skill’ going forwards.23 We can see this by re-expressing the definition of the information ratio as follows. An advantage of this re-expression is that tracking error can in essence be thought of as principally deriving from portfolio construction disciplines and information ratio principally from the skill that the manager exhibits.24 Information Ratio (IR) =
Outperformance (α) Tracking Error (σ )
⇒ α = IR×σ
(5.14) (5.15)
The problem that we find in practice is that optimal asset allocations based on different managers’ skill levels are hugely sensitive to our assumptions about these future levels of skill, i.e., future likely observed information ratios. In particular if we adopt the usual risk management starting assumption that the information ratio is zero then the answers to the optimisation exercise become ill-defined. So, to make good use of risk budgeting in this context we need to believe that we have some skill at choosing ‘good’ managers (i.e., ones with an expected information ratio versus the underlying benchmark greater than zero) as opposed to ‘poor’ ones (i.e., ones with an expected information ratio less than zero). Risk budgeting theory can also be used to help define appropriate portfolio construction discipline rules. For example, a fund management house might a priori believe that it has an upper quartile level of skill (e.g., because of the way it selects staff, researches securities and so on). It might then adopt the following algorithm to define some portfolio construction disciplines: 1. If the house adopts the working assumption that all other managers will behave randomly (which to first order does not seem unreasonable if we analyse many different peer groups) then to target an upper quartile level of skill it should be aiming to deliver approximately a 0.7 information ratio over 1 year (if both return and risk are annualised), a 0.4 information ratio over 3 years or a 0.3 information ratio over 5 years; see Kemp (2009). These are √ 23 We can also express the information ratio as IR = IC B R where the information coefficient, IC, represents the manager’s skill at judging correctly any particular investment opportunity, and investment breadth, BR, represents the number (breadth) of opportunities that the manager is potentially able to exploit; see Grinold and Kahn (1999). 24 Tracking errors and information ratios are not in practice fully independent of each other. For example, alpha might be added by timing of increases or reductions in tracking error. For long-only portfolios, the expected information ratio of a skilled manager might also be expected to decrease as tracking error increased, because the best short ideas are preferentially used up first. With such a portfolio it is not possible to go shorter than the benchmark weight in any particular stock.
Traditional Portfolio Construction Techniques
163
close to the rule of thumb of an information ratio of 0.5 that is often used by investment consultants to define a ‘good’ manager. 2. Once it has defined an appropriate information ratio target, it can identify what level of risk needs to be taken to stand a reasonable chance of achieving a given client’s desired level of relative return, using the formula in Equation (5.15). 3. It can then use simulation techniques (or other approaches) to identify the sorts of portfolio construction parameters that might typically result in it running approximately this level of risk through time. Risk budgeting logic in theory implies that for the same fixed target outperformance level, a fund manager should alter average position sizes as general levels of riskiness of securities change, even if he or she has unchanged intrinsic views on any of the securities in question. This is arguably an appropriate approach if a short-term change really is a harbinger of a longer-term structural shift in the market. But what if it just reflects a temporary market phenomenon? During the dot com boom and bust during the late 1990s and early 2000s, average position sizes in many equity peer groups did not change much. Fund managers’ ex ante tracking errors derived from leading third-party risk system vendors thus typically rose and then fell again. This suggests that fund managers often use more pragmatic portfolio construction disciplines (e.g., applying maximum exposure limits to a single name and changing these limits only infrequently) and view with some scepticism what might arise were risk budgeting theory to be rigorously applied.
5.9 BACKTESTING PORTFOLIO CONSTRUCTION TECHNIQUES 5.9.1 Introduction Backtesting is an important element in the testing of most types of quantitative or qualitative investment techniques. Nearly all quantitatively framed investment techniques aim to provide some guide to the future. The aim might merely be to provide guidance on the potential risks of following different strategies, as in risk modelling. Or, the aim might encompass the trade-off between risk and return, as is the main focus of this book. In either case, any model being created is in principle amenable to verification, by comparing predictions with actual future outcomes. Backtesting can be thought of as a way of carrying out such a comparison without actually having to wait for the future to arrive. It involves identifying how well the model would have worked had it been applied in the past. Backtesting can also be thought of as a core step in the calibration of a model to (past) market behaviour. To calibrate a model to observed market behaviour, we parameterise the model in a suitable fashion and choose which parameters to adopt by finding the model variant that best fits the (past) data. More specifically, backtesting generally involves the following algorithm: 1. We formulate from general principles a model that we think a priori should have merit. 2. We work out what results it would have given had we been using it in the past. 3. We then test these results to see if they are statistically significant, i.e., to see whether the success that might have accrued from using the model in the past was sufficiently large to have been unlikely to have arisen purely from chance.
164
Extreme Events
4. If our model does not seem to work well then we reconsider the principles that led us to select it in step 1, identify a new and hopefully better model and start again from step 1 using that model.25 Commonly, there is a fifth step in practice, which is to identify the most effective way of implementing the desired model. This formulation applies equally to return forecasting models and risk models, although there are then differences in what we focus on as outputs from the model and hence in what we test in step 3. With a return forecasting model, we generally want the model to outperform (and to do so relatively reliably, i.e., at relatively low risk). Hence the aim would be to find a model that maximises some utility function that balances reward and risk. With a risk model, the implicit assumption usually adopted is that there will be no excess return, hence the focus is typically exclusively on the accuracy of prediction merely of risk measures. The latter can be thought of as a special case of the former, where the risk-reward trade-off parameter, λ, is chosen to give nil weight to future returns. In either case, the best that we can hope for is a statistical view regarding the quality of the model. Even a model that would have apparently performed exceptionally well in the past might have been ‘lucky’ and this ‘luck’ might not repeat itself in the future. Backtesting introduces subtleties into this analysis, in particular look-back bias, which make even apparently sound assessment of statistical accuracy potentially suspect (see Section 5.9.3). 5.9.2 Different backtest statistics for risk-reward trade-offs Investment performance measurement is a well developed discipline. Many different statistics have been proposed for assessing how well a portfolio has performed. These include cumulative return, average return, drawdown and the like. From a risk budgeting perspective, the analysis needs to include both risk and return. High past performance may sell well, but ultimately if it has come with excessive risk then its true ‘value’ to the investor is reduced.26 Statistics that explicitly include risk as well as return include the following: (a) The Sharpe ratio: This is the return on the portfolio in excess of the ‘risk-free’ rate, divided by the standard deviation of return relative to the same ‘risk-free’ rate. ‘Risk-free’ is usually taken to mean cash or short-dated treasury bills, even though these may not be ‘risk-free’ in the context of the investor in question. It is usually defined as follows, where r is the return on the portfolio, r rf is the risk-free return and σrf is the standard deviation of returns relative to the risk-free return: Sharpe ratio =
r − rrf σrf
(5.16)
(b) The Information ratio: This is the return on the portfolio in excess of its benchmark, divided by the standard deviation of return relative to that benchmark. The benchmark will be fund specific, e.g., a cash return, a market index (or composite of such indices) or a 25 A variant of this approach is to parameterise the original model and then, in effect, to select whatever parameters seem to provide the ‘best’ model, where ‘best’ is again, ideally, chosen with a focus on statistical significance; however, see Section 5.9.3. 26 This does not stop some practitioners, particularly those who focus on backtesting purely as a marketing tool, from overemphasising return (relative to risk) in such analyses.
Traditional Portfolio Construction Techniques
165
peer group average. It is usually defined as follows, where r is the return on the portfolio, rb is the return on the fund’s benchmark and σb is the standard deviation of returns relative to that benchmark, i.e., the tracking error: Information ratio (I R) =
r − rb σb
(5.17)
(c) The Sortino ratio: This is the return on the portfolio in excess of the risk-free rate divided by the downside standard deviation of returns relative to the same risk-free rate. It is usually defined as follows, where r is the return on the portfolio, rrf is the risk-free return and σd,r f is the downside standard deviation of returns relative to the risk-free return: Sortino ratio =
r − rrf σd,r f
(5.18)
Here, a ‘downside’ standard deviation, σd , is usually calculated as follows, where n − is the number of observations for which r j − r¯ < 0.27 It is an example of a lower partial moment (see Section 7.9.3). We have here assumed that the cut-off threshold applied in the definition of ‘downside’ is the mean return, i.e. r¯ ; the formula becomes more complicated in the more general case where the threshold differs from the mean: 1 2 r j − r¯ σd = − n
(5.19)
jwhere r j −¯r 2, Efron and Morris (1976), generalising Stein (1955), showed that µˆ MLE is an inadmissible estimator of µ in relation to the sort of quadratic loss function that typically arises in (meanvariance) portfolio construction theory; i.e., a loss function of the form set out in Equation (6.19) – by ‘inadmissible’ we mean that there is some other estimator with at least equal and sometimes lower loss for any value of the true unknown parameter µ: L (µ, µˆ (y)) = (µ − µˆ (y))T V−1 (µ − µˆ (y))
(6.19)
It turns out that the James-Stein estimator defined above with ωˆ as follows has a uniformly lower loss function than the maximum likelihood estimator:
(m − 2)/n (6.20) ωˆ = min 1, T µˆ MLE − µ0 1 V−1 µˆ MLE − µ0 1
The failure of the maximum likelihood estimator, i.e., the unadjusted sample means, to be as good an estimator as the James-Stein estimator is a perhaps surprising result. Jorion (1986) argues that it arises because the loss function encapsulates in a single joint measure all the impact of estimation error, while the sample means handle the estimation error on each individual mean separately. The James-Stein estimator is also called a shrinkage estimator because the sample means are all multiplied by a coefficient (1 − ω) ˆ lower than one. The estimator can be shrunk towards any point µ0 and still have better properties than the sample mean, but the gains are greater ˆ an improvement is to set when µ0 is close to the ‘true’ value. For negative values of (1 − ω) the coefficient to zero, hence the inclusion of an upper limit on ωˆ of 1. For more complicated loss functions arising if we move away from the mean-variance paradigm, the same inadmissibility of the sample mean arises, because again the sample mean treats estimation error in a manner that is too fragmented. Unfortunately, proof of inadmissibility does not necessarily make it easy to identify better estimators. Jorion (1986) does, however, refer to work done by, e.g., Berger (1978) to devise possible shrinkage estimators that might work better than the sample mean in such circumstances. With such a strong statistical underpin we might expect shrinkage to have become much more widely used as a portfolio construction concept than has actually proved to be the case.11 Unfortunately there is a twist in the tail, which probably explains the relatively modest take-up of this technique. The key point to note is that what we are actually interested in is not the estimation error in the input parameter estimates but the impact that estimation error has on the portfolios that are eventually deemed optimal. In Section 5.10 we saw that when we try to reverse optimise a given asset mix the corresponding (reverse optimised) mean returns for each individual asset are not uniquely defined. Instead the equivalent mean returns, a j , are only defined up 11 There are also some more technical reasons why it might not have taken off. For example, the shrunk estimator is not unbiased and it is nonlinear because ωˆ is itself a function of the data.
196
Extreme Events
to a j = Cb j + D for arbitrary constants C and D. However, when we write this in vector form we find that it becomes a = Cb + D1, i.e., exactly the same as the definition of the James-Stein estimator as per Equation (6.18), if we set C = (1 − ω) ˆ and D = ωµ ˆ 0. Thus any portfolio that is (unconstrained) mean-variance optimal for any given choice of ωˆ and µ0 will also be (unconstrained) mean-variance optimal for any other choice of ωˆ and µ0 , including the one where ωˆ = 0, i.e., where no shrinkage has taken place at all. Shrinking the means will therefore only actually affect the portfolios that we deem optimal if there are constraints being applied within the problem (because it can then alter the extent to which these constraints bite). 6.5.3 ‘Shrinking’ the covariance matrix Perhaps another reason for the less than wholehearted take-up of the concept of shrinkage is the limited reliance that practitioners typically place on sample means when formulating assumptions for future means (see Section 6.2.2). A better component of the parameter input set in which to explore the merits of shrinkage might therefore be the covariance element, bearing in mind the typically greater reliance that is often placed on historic data when formulating assumptions for it. In essence, this is the basis of Ledoit and Wolf (2003a, 2003b, 2004). Central to their message is that the sample covariance matrix naturally includes estimation error of a kind most likely to perturb a mean-variance optimiser. In particular, the most extreme coefficients in the sample matrix tend to take on extreme values, not because this is the ‘truth’ but because they contain the highest amount of error. The covariance matrix is thus particularly sensitive to what Michaud (1989) terms the phenomenon of ‘error maximisation’. Ledoit and Wolf (2003b) view the beauty of the principle of shrinkage as follows: that by properly combining two ‘extreme’ estimators one can obtain a ‘compromise’ estimator that performs better than either extreme. To make a somewhat sloppy analogy: most people would prefer the ‘compromise’ of one bottle of Bordeaux and one steak to either ‘extreme’ of two bottles of Bordeaux (and no steak) or two steaks (and no Bordeaux).
They adopt the stance that the ideal shrinkage target for the covariance matrix, i.e., the analogue to the vector µ0 1 that appears in Jorion (1986), should fulfil two requirements at the same time. It should involve only a small number of free parameters (that is, include a lot of structure). It should also reflect important characteristics of the unknown quantity being estimated. Their papers identify the following alternative shrinkage targets: (a) a single-factor model (akin to the CAPM) – see Ledoit and Wolf (2003a); (b) a constant correlation covariance matrix – see Ledoit and Wolf (2003b); (c) the identity matrix – see Ledoit and Wolf (2004). In each case, the aim is to identify an asymptotically optimal convex linear combination of the sample covariance matrix with the shrinkage target. Under the assumption that m is fixed, while n tends to infinity, Ledoit and Wolf (2003a) prove that the optimal value of the shrinkage factor asymptotically behaves like a constant over n (up to higher-order terms). They term this constant κ and give formulae describing how to estimate κ. Applying shrinkage to the covariance matrix really does affect the end answer. However, it turns out that the unravelling of sample mean shrinkage in an unconstrained world does have an analogy in the context of shrinkage of the covariance matrix. As we shall see in the next
Robust Mean-Variance Portfolio Construction
197
section, the portfolios deemed efficient using shrinkage of the covariance matrix are, at least in some instances, the same as those that arise merely by including particular types of position limits within the original (i.e., not shrunk) problem.
6.6 BAYESIAN APPROACHES APPLIED TO POSITION SIZES 6.6.1 Introduction Traditionally, application of Bayesian statistics has focused on the parameters that characterise the probability distribution applicable to future returns. DeMiguel et al. (2009b) adopt a different approach. They focus Bayesian theory directly on the outputs of the optimisation exercise, i.e., the position sizes, rather than on the model parameters that are the inputs to the end optimisation exercise. They show how this change of focus (when applied to the minimum-variance portfolio) includes, as special cases, several other previously developed approaches to mitigating sampling error in the covariance matrix. So, rather than ‘shrinking’ the moments of asset returns, they apply ‘shrinkage’ to the end portfolio weights. In particular they require that a suitably chosen norm12 applied to the portfolio weight vector should be smaller than a given threshold. They then show what this implies for a variety of different norms and threshold values. They show that this alternative way of applying Bayesian theory to the portfolio construction problem replicates several other approaches described earlier. In particular (using definitions for the norms that are elaborated upon in Section 6.6.2): (a) Constraining the 1-norm (i.e., the sum of the absolute values of the weights in the portfolio) to be less than or equal to 1 is equivalent to imposing short-sale constraints on a portfolio, i.e., requiring it to be a long-only portfolio. (b) Constraining the A-norm of the portfolio to be less than or equal to some specified threshold is equivalent to the shrinkage transformation proposed by Ledoit and Wolf (2004), if the threshold is suitably chosen. (c) Constraining the squared 2-norm to be smaller than or equal to 1/n is equivalent to adopting an equally weighted, i.e., ‘1/N ’, portfolio as per Section 6.2.3. DeMiguel et al. (2009b) also show how to generalise some of these approaches. For example, imposing the constraint that the 1-norm of the portfolio-weight vector is smaller than a certain threshold but then setting the threshold to be greater than unity results in a new class of ‘shrinkage’ portfolios in which the total amount of short-selling in the portfolio is limited, rather than there being no shorting allowed on any individual stock. These types of portfolio are also called 120/20 funds or 130/30 funds etc., depending on the size of the short book.13 12 A norm is a mathematical measure of the magnitude of something. More formally, a vector norm, n (x) usually written x where x is the vector in question, is a function that exhibits the following properties: (i) positive scalability (a.k.a. positive homogeneity), i.e., ax = |a| . x, for any scalar a; (ii) sub-additivity (a.k.a. triangle inequality), i.e., u + v ≤ u + v; and (iii) positive definiteness, i.e., u = 0 if and only if u is the zero vector. For example, x2 ≡ ( xi2 )1/2 is a norm, as is ∗ T 1/2 1/2 x ≡ (x Vx) = ( xi Vij x j ) where V is a positive definite symmetric matrix. There are several different ways of generalising such norms to matrices. Perhaps the most obvious are ‘entrywise’ norms that treat an m × n matrix as akin to a vector of size mn and then apply one of the familiar vector norms to the resulting vector. Examples are the ‘entrywise’ p-norms which are calculated as m n p 1/ p A p = ( i=1 . The Frobenius norm is the special case of such a norm with p = 2. For positive definite symmetric j=1 |Aij | ) square matrices of the sort that one typically comes across in portfolio construction problems, the Frobenius norm also happens to be the sum of the eigenvalues of the matrix. 13 A significant number of asset managers have launched 130/30 type funds, also called ‘long extension’ or ‘relaxed long-only’ funds, in which the fund is allowed to go short to some extent. The intrinsic rationale for doing so is that pure long-only funds cannot
198
Extreme Events
DeMiguel et al. (2009b) also calibrate their approaches using historical data and conclude that they perform well out-of-sample compared with most other robust portfolio construction approaches against which they benchmark them. When doing so they use bootstrapping methods (which they consider ‘are suitable when portfolio returns are not independently and identically distributed as a multivariate Normal’); see also Section 6.9.6. 6.6.2 The mathematics Readers with a mathematical bias may be interested in exploring further (in Sections 6.6.2.1–6.6.2.4) the underlying mathematics used by DeMiguel et al. (2009b) to justify their results, given the (at first sight somewhat surprising) ability of this apparently rather different formulation to replicate many other ‘robust’ portfolio construction methodologies. Others with a less mathematical bias may want to skip straight to Section 6.7 to understand why Bayesian methods seem more universal than we might otherwise expect. Principle P34: Many robust portfolio construction techniques can be re-expressed in terms of constraints applied to the form of the resulting portfolios (and vice versa). 6.6.2.1 Short-sale constraints In the absence of short-sale constraints, the minimum-variance portfolio is the solution to the following optimisation problem: ˆ s.t. wT 1 = 1 min wT Vw w
(6.21)
ˆ ∈ Rm×m Here w ∈ Rm is the vector of portfolio weights (if there are m potential holdings), V m is the estimated (sample) covariance matrix, 1 ∈ R is the vector of ones and the constraint wT 1 = 1 ensures that the portfolio weights sum to one. We might call the solution to this problem wMINU . The short-sale constrained minimum-variance portfolio, w M I N C , is the solution to the above problem but with the additional constraints that w ≥ 0 (which is shorthand for each element of w, i.e., each wi , say, needing to be non-negative). Jagannathan and Ma (2003) show that the solution to the short-sale constrained problem is the same as the solution to the unconstrained ˆ is replaced by V ˆ JM where problem if the (sample) covariance matrix V ˆ JM = V ˆ − λ1T − 1λT V
(6.22)
Here λ ∈ Rm is the vector of Lagrange multipliers for the short-sale constraint. Because ˆ JM may be interpreted as the sample covariance matrix after applying a λ ≥ 0, the matrix V shrinkage transformation. underweight stocks by more than the stock’s weight in the fund’s benchmark. For most stocks the extent of underweighting permitted is thus small. However, logically, asset managers, if they are skilful, should be able to identify stocks that are going to perform poorly as well as stocks that are going to do well. So, pure long-only portfolios appear to be unable to use nearly one-half of a manager’s potential skill. The manager may also be less able to manage risks effectively with a pure long-only fund, because there may be times when a manager would ideally like to short some stocks to hedge long positions in other stocks. Some of the enthusiasm for 130/30 funds from fund managers arises because they share some similarities with hedge funds (e.g., potentially higher fees). Conversely, not all managers who appear to be good at managing long-only portfolios seem to be good at shorting, there are extra costs incurred when shorting and there may also be extra risks as well (see Section 2.12).
Robust Mean-Variance Portfolio Construction
199
If we impose the requirement that the portfolio’s 1-norm is less than a specified threshold, m |wi | ≤ δ, then, given that the portfolio weights must add up to 1, we δ, i.e., w1 ≡ i=1 must have δ ≥ 1 for there to be a feasible solution. The constraint can then be rewritten as − follows, where S is the set ofasset indices for which the corresponding weight is negative, ˆ JM ) < 0 : i.e., S − = i : wi (of V −
i∈S −
wi ≤
δ−1 2
(6.23)
The interpretation that may be given to this is that left hand side is the total proportion of wealth that is sold short, and (δ − 1)/2 is the total size of the short book, i.e., the size of the short-sale budget (which can be freely distributed among all the assets). If δ = 1 then the short-book is of zero size, i.e., a no short-selling constraint is applied to each individual asset. 6.6.2.2 Shrinkage as per Ledoit and Wolf (2003a, 2003b, 2004) The approach proposed by Ledoit and Wolf (2003a, 2003b, 2004) can be viewed as replacing the (sample) covariance matrix with a weighted average of it and a low-variance target estiˆ target using the following formula, i.e., applying the following shrinkage transform: mator, V ˆ LW = V
ν ˆ 1 ˆ V+ Vtarget 1+ν 1+ν
(6.24)
Ledoit and Wolf (2003a, 2003b, 2004) also show how to estimate the value of ν that ˆ L W and the supposed minimises the expected Frobenius norm of the differences between V true covariance matrix, V. If we impose the requirement that the ‘A-norm’ is less than some specified threshold, ˆ then wT Aw ≤ δ ≡ δˆ2 , say. DeMiguel et al. (2009b) show ˆ i.e., w A ≡ wT Aw 1/2 ≤ δ, δ, ˆ is non-singular, there exists a δ such that the solution to the minimumthat provided V ˆ + ˆ replaced by V ˆ L W = (1/(1 + ν)) V variance problem with the sample covariance matrix V (ν/(1 + ν)) A coincides with the solution to the A-norm constrained case. For example, if we choose A to be the identity matrix then there is a one-to-one correspondence with the shrinkage approach described in Ledoit and Wolf (2003). 6.6.2.3 The equally weighted portfolio In the case where A is the identity matrix, the A-norm becomes the 2-norm. Suppose we constrain the square of the 2-norm to be less than or equal to δ. We note that m i=1
wi2
m
1 1 2 wi − ≤ δ− ≤δ⇒ m m i=1
(6.25)
Thus, imposing a constraint on the 2-norm is equivalent to imposing a constraint that the square of the 2-norm of the difference between the portfolio and an equally weighted (i.e.,
200
Extreme Events
1/N ) portfolio is bounded by δ − 1/m. The equally weighted portfolio is thus replicated if δ = 1/m. 6.6.2.4 A Bayesian interpretation DeMiguel et al. (2009b) also give a more explicitly Bayesian interpretation for 1-norm and A-norm constrained portfolios. They show that the 1-norm constrained portfolio is the mode of the posterior distribution of portfolio weights for an investor whose prior belief is that the portfolio weights are independently and identically distributed as a double exponential distribution. The corresponding result for the A-norm constrained portfolio is that it is the mode of the posterior distribution of portfolio weights for an investor whose prior belief is that the portfolio weights wi have a multivariate Normal distribution with covariance matrix A.
6.7 THE ‘UNIVERSALITY’ OF BAYESIAN APPROACHES At the start of Section 6.6.2 we suggested that less mathematically-minded readers might wish to skip straight to this point where we explain the apparently impressive ability to re-express other robust portfolio construction methodologies in what is in effect a Bayesian framework, even when the methodology appeared to be relatively heuristic in nature. Consider again Bayes’ theorem, i.e., p (θ |r ) ∝ p (r |θ ) p (θ). Suppose we re-express this relationship as follows: p (θ) ∝
p (θ |r ) p (r |θ )
(6.26)
We note that p (r |θ ) is fixed by the form of the problem. For any given distributional form, the probability that we will observe a given dataset r given an assumption set θ is always set in advance. However, what this re-expression also highlights is that we can always arrange for p (θ |r ) to take essentially any form we like merely by choosing the p (θ) to accord with the above equation. In short any posterior distribution is possible (within reason) if we allow ourselves sufficient flexibility in our choice of prior distribution. Thus the ability of Bayesian-style approaches to reproduce any other type of methodology should not by itself be a surprise to us. Instead, what is perhaps more surprising is the relatively simple ways in which this can often be achieved. Some of this is due to the particularly simple form of the log-density of the multivariate Normal distribution, which is merely a quadratic form with the following form: log ( pd f ) = C −
1 (x − µ)T V−1 (x − µ) 2
(6.27)
where C = − log (2π )m/2 |V|1/2
(6.28)
For example, we can view Black-Litterman as equivalent to a Bayesian approach in which we give partial weight to our own assumptions and partial weight to assumptions that correspond
Robust Mean-Variance Portfolio Construction
201
to ones that make the market efficient. This is because we can view regression and other building blocks implicit in the Black-Litterman methodology in a Bayesian light; see, e.g., Wright (2003a). In this case, the exponential form of the Normal probability density function naturally leads to results that are also Normal in form and hence directly match the usual Black-Litterman formulation. In this context, Black-Litterman can be viewed as involving a model that includes a single risk factor. Cheung (2007b) explains how to adapt the theory to situations where the model involves multiple factors, as per Chapter 4.
6.8 MARKET CONSISTENT PORTFOLIO CONSTRUCTION One possible source of a Bayesian prior is the information contained in derivative prices, and in particular options markets. These provide us with information about market implied volatilities (and in some cases market implied values for other parameters on which derivative prices may depend). If options markets were sufficiently deep and liquid and a sufficiently wide range of option types were actively traded, we could in principle derive the entire covariance matrix (indeed the entire distributional form) that might be applicable to future market behaviour. As explained in Kemp (2009), such an approach would in principle circumvent the inherent limitations on knowledge about the smaller principal components that arise with historic datasets. The idea of using market implied data seems to be one that generates mixed opinions. Some leading risk system vendors do use some market implied data within their risk systems, often because these data items are believed to be better predictors of (short-term) future market behaviour than estimates derived from the past; see, e.g., Malz (2001). Market implied information is also a common input into the types of economic scenario generators (ESGs) that are used, for example, to place market consistent valuations on more complicated types of insurance product; see, e.g., Varnell (2009). ESGs in this context play a similar role to (or can be deemed to be just another name given to) the sorts of Monte Carlo pricing techniques that are used to price many more complicated types of derivative.14 Use of market implied data, however, seems less common when it comes to portfolio construction. Varnell (2009) suggests that commonly for these types of purposes ESGs are instead calibrated to ‘real world’ models of the expected future; i.e., to models that incorporate the firm’s own views about how the future might evolve. The assumptions will therefore typically be formulated in ways similar to those described in Chapter 5 or in earlier sections of this chapter but tailored to the particular needs of the firm in question. Kemp (2009) adopts a slightly different perspective and accepts that a ‘completely’ market consistent stance does not add a lot of value within the portfolio construction problem per se (because there is then no justification for taking any investment stance). However, he disagrees with the thesis that this makes market consistent portfolio construction irrelevant. Instead, he argues that a more refined stance is needed. The arguments put forward in Kemp (2009) for this stance are similar to the ones from Cochrane (1999) that we have already discussed in Section 5.2.4. Kemp argues that is important to understand the market implied view so that we then know how our own views differ from 14 Equivalent types of Monte Carlo simulations are also used by other types of financial services entities such as banks, but are not then usually (at present) called economic scenario generators. Outside the insurance industry, the term ‘economic scenario’ would more usually currently be used to refer to a one-off scenario, perhaps in the form of a stress test, of how the future economy might develop (see, e.g., Chapter 8).
202
Extreme Events
it and hence how our portfolio should differ from one that follows the market implied view, which we might presume (ignoring subtleties regarding different minimum risk portfolios etc.) corresponds to the market itself. He also picks up on the point there should be a further tilt reflecting how the types of risk to which we are exposed, and the utility function to which we are subject, differ from those of the generality of other market participants. Put like this, we see that market consistent portfolio construction has strong affinities with the Black-Litterman approach. The BL approach also includes ‘market implied views’, by adopting the stance that the market is ‘right’ if we do not have views that differ from the consensus.
6.9 RESAMPLED MEAN-VARIANCE PORTFOLIO OPTIMISATION 6.9.1 Introduction In earlier sections of this chapter we concentrated on robust portfolio construction methodologies that have a strong Bayesian flavour about them. Nearly all of them included, to a greater or lesser extent, a blending of input assumptions derived from sample data with predefined views about how the world, and hence portfolio construction, ‘should’ operate. Although invoking such a priori reasoning does seem to be the most common way of mitigating the problem of sensitivity to the input assumptions, it is not the only methodology that has been proposed to tackle this problem. One might argue that the antithesis of a Bayesian statistician is a ‘frequentist’ who is determined to ‘allow the data to say it all’. Statisticians of this hue typically focus on resampling or bootstrapping15 techniques. Such techniques can also be applied to the portfolio construction problem, in which case they typically go under the name of resampled efficiency (RE); see, e.g., Scherer (2002, 2007) or Michaud (1998). To identify resampled efficient portfolios, we 1. Choose a (suitably large) number of random input assumption sets, all ultimately derived merely from information available within the original dataset. By an ‘assumption set’ we mean (if we are focusing on mean-variance optimisation) a set of choices for mean returns for each asset category and the associated covariance matrix. 2. Carry out a portfolio optimisation exercise for each set of input assumptions chosen in step 1. 3. Deem portfolios that are efficient to be ones that, in some suitable sense, correspond to the average of the efficient portfolios resulting from the different portfolio optimisation exercises carried out in step 2. As noted earlier, it had originally been my intention to cover resampled optimisation techniques in a different chapter to the one covering Bayesian techniques, given their apparently quite different focus. However, RE is still a tool that aims to mitigate sampling error and so naturally fits alongside other tools with similar aims. More importantly, it turns out to have characteristics quite like shrinkage. The impact of using resampling techniques on which 15 The terminology relates to the derivation of information about the spread of outcomes merely from consideration of the spread within the original dataset. This use of the term ‘bootstrapping’ is not to be confused with the technique of the same name used to calculate deterministically a yield curve from the prices of a series of bonds of sequentially increasing terms.
Robust Mean-Variance Portfolio Construction
203
portfolios are deemed efficient turns out to be similar to the impact of imposing individual position limits on the portfolio. Therefore, it too can be reformulated as a type of Bayesian approach akin to others set out in Section 6.6 even though its derivation looks quite different. This broad equivalence with other simpler techniques has led some commentators to view RE techniques as largely a waste of effort. As we have already seen in Section 6.7, essentially any variant of basic portfolio construction techniques can be viewed as involving Bayesian techniques.
6.9.2 Different types of resampling There are several ways that RE portfolio optimisation can be carried out. They differ mainly in relation to step 1 of Section 6.9.1. Suppose, for example, that the original dataset included n time periods. We might then resample the data without replacement as if there were only q < n time periods available to us, by randomly choosing q of the n actual time periods we have in question. Or we might resample with replacement, choosing a random draw of, typically, n of the time periods (some of which will typically be duplicates). Either approach is called bootstrapping. Alternatively, we might, say, use all the time periods in the original dataset to derive a best-fit ‘observed’ multivariate Normal probability distribution with the same means and covariances as the original dataset. To define each input assumption set to use in step 2, we might draw n (multivariate) observations at random from this ‘observed’ Normal probability distribution, and we might use for the input assumption set the observed means and covariance matrix of these n random draws. This is a variant of Monte Carlo simulation. In either case we use data that is coincident in time for any particular draw that contributes to the resampled dataset. So, if the bootstrapped resampled dataset includes as its kth draw the return on asset i during time period t, the kth draw will also include as its returns on all other assets their returns during the same time period t. Otherwise the bootstrapped correlations (and covariances) will make little sense in a portfolio construction context.
6.9.3 Monte Carlo resampling Let us first focus on the Monte Carlo variant described above. Suppose, for example, that our dataset contains n time periods. The methodology operates as follows: 1. We start with some data from which we derive estimates of the mean and covariance matrix ˆ respectively (we will for the moment assume that the of the distribution, say µˆ and V distribution is multivariate Normal). ˆ are themselves subject to sampling error. There is some ‘true’ 2. We know that the µˆ and V ˆ will not exactly match underlying distribution from which they are coming, but µˆ and V its distributional characteristics because we only have n observations on which to base our estimates. We estimate how large might be the sampling error by preparing a large number of simulations (say K of them), each of which involves n independent random draws from ˆ distribution. For, say, the kth simulation we calculate the estimated mean and ˆ V) an N (µ, ˆ k respectively, based merely on the data encapsulated in the covariance matrix, µˆ k and V ˆ k as k n random draws that form that particular simulation. The spread of the µˆ k and V
204
Extreme Events
varies should then give us an idea of how much sampling error there might have been in ˆ the original µˆ and V. 3. We are not primarily interested in quantifying the magnitude of the sampling error that applies to the distributional parameters. Instead, what we want to understand is how sampling error affects which portfolios we might deem to be efficient (and what risk-reward trade-off we might expect from them). So for each of the K simulations we compute which portfolios would have been considered efficient based purely on the data available to us in that simulation (i.e., the n observations that have been used to construct the simulation). For the kth simulation the efficient portfolio set can be characterised by a series of portfolios ak, j in a variety of ways. For example, we could choose them to be equally spaced between lowest and highest return, equally spaced from lowest to highest risk or indexed by a constant risk reward trade-off factor, λ, as per Section 5.3. This last choice is perhaps the most natural one from a mathematical perspective but not necessarily the one that gives the most intuitive answer. 4. The final step is to come up with a resampled efficient portfolio. This involves averaging the ak, j across all the K simulations, i.e., using the formula set out as follows:
aresampled, j =
1 ak, j K k
(6.29)
The results of an analysis of this type are illustrated in Figure 6.2 and Table 6.2, based on some illustrative sample means and variances as set out in Table 6.1, assumed to come from return series containing 60 observations. The minimum risk portfolio (MRP) is assumed to be 100% invested in Asset A, and a no-short sale constraint is applied. We have used 1000 simulations in the RE exercise. Twenty different lambdas have been used to define the efficient frontier (λ = 0.001, 0.101, 0.201, . . .).
8% 7%
Return (pa)
6%
Efficient Frontier
5%
A
4%
B
3%
C
2%
D
1%
E Resampled Efficient
0% 0%
5% 10% 15% Risk versus MRP (pa)
20%
Figure 6.2 Illustrative efficient frontier using assumptions set out in Table 6.1 C Nematrian. Reproduced by permission of Nematrian Source: Nematrian.
Robust Mean-Variance Portfolio Construction
205
Table 6.1 Assumptions used to illustrate resampled efficiency Correlations Asset
Means (% pa)
Standard Deviations (% pa)
A
B
C
D
E
A B C D E
4.0 5.0 6.0 7.0 7.5
2 5 9 14 14
1.00 0.39 –0.61 –0.11 –0.47
1.00 –0.53 –0.38 –0.48
1.00 0.19 0.61
1.00 0.31
1.00
Source: Nematrian
In Figure 6.2 we show the efficient frontier for the original (not resampled) assumptions as well as risk and return characteristics of corresponding resampled efficient portfolios. Risks and returns have been calculated as if we knew that the ‘true’ distributional characteristics were the ones set out in Table 6.1, and so the resampled efficient portfolios lie to the right of the ‘true’ efficient portfolios. In Table 6.2 we show the portfolio mixes corresponding to the eighth point along this plot (i.e., λ = 0.701). We show the mixes corresponding to both the ‘true’ efficient portfolio and the corresponding resampled efficient portfolio (i.e., the average of the simulated efficient portfolio mixes for that particular λ). We also give an indication of the range of portfolios that were deemed optimal in any given simulation (using the 5th, 25th, 50th, 75th and 95th percentiles and the maxima and minima of the ranges exhibited by the weight of any given asset class for the chosen value of λ). We see that a very wide range of portfolios were deemed efficient in the simulations. For all asset classes other than A there was at least one simulation that involved the efficient portfolio holding nothing in that asset class and another that involved the whole of the efficient portfolio being invested in it! Asset class A had such a low assumed return that none of the 1000 simulations included it in their efficient portfolio. There is also a noticeable difference between the expected return using the original assumptions and the expected return on the resampled efficient portfolio. This highlights the point noted in step 4 (above) that choice of how to average across simulations is not as straightforward as might appear at first sight. Table 6.2 Portfolios deemed efficient versus those deemed ‘resampled’ efficient (MRP used here is 100% Asset A) Asset Mix (%) Asset
Mean
Risk versus MRP
A
B
C
D
E
Efficient ‘Resampled’ efficient Min 5%ile 25%ile 50%ile 75%ile 95%ile Max
6.30 7.03
6.50 10.87
0 0 0 0 0 0 0 0 0
43 7 0 0 0 0 0 87 100
0 7 0 0 0 0 0 97 100
25 36 0 0 0 0 100 100 100
32 49 0 0 0 50 100 100 100
Source: Nematrian
206
Extreme Events
6.9.4 Monte Carlo resampled optimisation without portfolio constraints Various commentators have tested the benefits that resampling might bring using out-of-sample backtests; see, e.g., Scherer (2002, 2007). The general consensus seems to be that they appear to produce more robust portfolios than would arise by merely applying basic (unadjusted) mean-variance optimisation. There is not as much consensus on whether the backtest properties of resampled approaches are better or worse than some of the other robust portfolio construction methodologies that we have considered earlier in this chapter. This is partly because it is not obvious how we should interpret the terms ‘better’ or ‘worse’ in this context; see, e.g., Scherer (2002, 2007).16 It primarily arises, however, from the intrinsic mathematical nature of RE, which means that resampled optimisation does not differ as much as we might first expect from several of the alternatives we have explored already in this chapter. This is particularly true when the problem has no portfolio constraints (other than that the weights should add to unity) and the underlying data is coming from a multivariate Normal distribution. The results of a large sample Monte Carlo simulation will then, according to Section 6.2.5, tend to the analytical solution. We already know the analytical solution, however, because it is the distributional form that we identified as part of the process of identifying how to quantify estimation error in Section 6.2.4. This means that, in the absence of portfolio constraints, a (mean-variance) Monte Carlo resampling approach as described above merely comes up with the same efficient portfolios as we would have derived without carrying out any resampling; see, e.g., Scherer (2007). This assumes that we adopt an appropriate way of choosing where along the efficient frontier to pick the portfolios we include in the end averaging step of the resampling algorithm. In Scherer (2007) this involves averaging portfolios with the same λ. Thus in this situation resampled efficiency and Markowitz mean-variance efficiency coincide. If we plot the different resampled portfolios for a common λ in a hyper-space with the same number of dimensions as there are assets then the individual simulated resampled portfolios will be spread out through the space. In the limit (as the number of resamplings tends to infinity) the probability distribution describing this spread will take on the same shape as the distributional form described in Section 6.2.4. This equivalence ties in with other properties we might also expect RE to exhibit; see, e.g., Scherer (2007). He notes that RE does not actually add extra sample information. Indeed he notes that it uses resampling to derive better estimates which is somewhat in contrast to the statistical literature, where resampling is used foremost to come up with confidence bands. Also note that resampling does not address estimation error in the first place as it effectively suffers from estimation error heritage. Repeatedly drawing from data that are measured with error, will simply transfer this error to the resampled data.
One way that some commentators suggest for circumventing these inherent weaknesses in the RE approach is to combine it with a Bayesian approach. This involves altering the 16 This point is similar to the one made in Section 5.9.3 where we highlighted that a risk model that exhibits good out-of-sample backtest features could just be one with a particularly flexible structure that has then been well calibrated to the past. Equally, with Bayesian robust portfolio construction methods, we have flexibility in the choice of distributional form for the Bayesian prior. If we are considering enough variants then some of them should be very good fits to the past data, particularly if they include many input parameters.
Robust Mean-Variance Portfolio Construction
207
probability distribution from which the resampling data is drawn in a way that blends together the original observed data with prior information as supplied by the practitioner. In the extreme case where we give 100% weight to our Bayesian prior and 0% to the original observed data, the problem reverts to one in which resampling is not present. However, in intermediate cases where some resampling is still involved, the same sorts of issues as discussed above still apply, just to a lesser extent depending on the weight given to the original observed data. 6.9.5 Monte Carlo resampled optimisation with portfolio constraints In the presence of portfolio constraints, the position is somewhat different. The optimal portfolios derived in each individual run within the resampling exercise will still be spread out in the same hyper-space referred to above. But now, any individual resampled efficient portfolio that would otherwise have fallen outside the portfolio constraints is shifted to the ‘nearest’ point on the boundary formed by the constraints (if ‘nearest’ is defined suitably). Individual portfolio mixes that fall within this boundary are not shifted. The final averaging step in the RE algorithm then moves the RE portfolio to strictly inside the feasible set (i.e., the set of all portfolios that do satisfy the constraints) rather than merely to just edges or corners of its boundary.17 It is as if we have again imposed individual position ‘constraints’, but now they are being implemented in a more ‘smoothed’ fashion resulting in us positioning the portfolio some way strictly within the constraints, rather than exactly where they bite. The extent of this ‘pull’ towards the middle of the feasible set rather than just to its edge depends in practice on (a) How much the relevant constraint typically bites (if individual RE portfolios are always well outside the edge of the feasible set then few if any of them stay unmoved by this ‘pull to the middle’). (b) How reliable is the sample data. As the sample size increases and presumably therefore becomes more reliable, the fuzziness of the effect diminishes and the problem becomes more and more like a traditional constrained portfolio optimisation. This analogy with the imposition of a generalised type of position limits also means that we can view resampled efficiency in a Bayesian light, because it is akin to the approaches described in Section 6.6 (and through them to more classical Bayesian approaches). The main practical difference is the smoother nature of the ‘constraints’, which we can think of as involving an increasing penalty function as we approach the constraint boundary (with the penalty function tending to infinity close to the boundary). Is such smoothing intrinsically desirable? Some practitioners seem to think so. Suppose an optimal portfolio falls on what could otherwise be viewed as a somewhat arbitrary constraint boundary. It might then distort (and hence possibly make more difficult) subsequent rebalancings. We might also expect RE portfolios to be more diversified than basic (unadjusted mean-variance) portfolios, because the ‘pull’ to centre will typically pull all positions subject to a no short-selling constraint into strictly positive territory. 17 We make the assumption here that it is possible to satisfy all the constraints simultaneously and that the resulting feasible set of points that do satisfy all the constraints is convex. By a convex set we mean that if points a and b satisfy the constraints and thus lie within the feasible set (here deemed also to include points on its boundary) then so does any point along the line joining them and strictly between them, i.e., any point (1 − p) a + pb where 0 ≤ p ≤ 1.
208
Extreme Events
Minsky and Thapar (2009) appear to be two practitioners who prefer smoothed constraints. They describe a selection that they carried out for a fund of hedge funds portfolio optimiser. One attribute they penalised in this selection was a high incidence of solutions that fell on a constraint boundary or at a constraint corner, behaviour that should be less common or excluded by resampled optimisation. 6.9.6 Bootstrapped resampled efficiency The analysis of the behaviour of bootstrapped resampled efficiency is slightly more complicated than the Monte Carlo approach analysed above. This is because the bootstrapping element of the process does now more accord with how bootstrapping and resampling are typically used in the statistical literature, i.e., as means of deriving confidence bands. If the data is adequately modelled by a multivariate Normal distribution then presumably the spread of outcomes (i.e., portfolio mixes) arising from the individual resamplings should be similar to those arising with the Monte Carlo variants described above. All the conclusions drawn in Sections 6.9.4 and 6.9.5 should therefore flow through unaltered. If the data is not adequately modelled by a multivariate Normal distribution, however, the confidence bounds (and presumably therefore also the end averaged RE portfolios) may diverge from the traditional mean-variance solution even in the absence of portfolio constraints. It can be argued that in such circumstances the simulations in the Monte Carlo variant (if in each resampling they are being assumed to come from a multivariate Normal) are being drawn from an inappropriate distribution. If instead we chose a distributional form that adequately reflected the data then we should again see consistency between these two resampling techniques. 6.9.7 What happens to the smaller principal components? Another point worth highlighting is the way in which resampling handles the smaller principal components. We saw in Section 4.3 that there is not enough information to tell us anything about the smallest principal components, if the number of time periods is fewer than the number of assets. Even where there is information present it may not be reliable. This lack of reliable information is replicated in the estimated covariance matrix used in the Monte Carlo resampled approach. It is also replicated in the bootstrapped approach, because it too relies solely on observations that have actually occurred (rather than on ones that might have occurred). RE does not provide a means to circumvent this type of estimation error. Instead, it inherits this error from the intrinsic limitations applicable to the original dataset.
6.10 THE PRACTITIONER PERSPECTIVE Robust portfolio optimisation introduces several additional trade-offs as far as practitioners are concerned. Take, for example, a fund manager. Robust portfolio construction equivalent to the imposition of position limits can seem arbitrary and liable to fetter the expression of investment flair. This is particularly so if the manager is a traditional portfolio manager looking enviously at the additional freedoms (and remuneration) typically present in the hedge fund space, where such restrictions are less prevalent. Robust portfolio construction as per Black-Litterman is arguably less ‘arbitrary’ in this respect. However, it involves blending market implied assumptions with
Robust Mean-Variance Portfolio Construction
209
one’s own views using credibility weighting as per Bayesian theory. So, in a sense it involves a deliberate depletion in the magnitude of expression of the manager’s own views and so might also be viewed by managers as liable to fetter the expression of investment flair. Robust portfolio construction techniques therefore require careful explanation, justification and implementation if they are not to run up against manager resistance. Perhaps extending the idea to include opinion pooling as per Section 6.4.4 is a promising route to follow. Alternatively, robust portfolio construction techniques can be built into the product definition from the outset and then used for marketing advantage. Even sceptical fund managers can be won over if the process results in greater sales. From the client’s (i.e., the end investor’s) perspective, robust portfolio construction seems to have more direct appeal. It ought to limit the scope for the manager to impose wayward views that prove with the benefit of hindsight to be unconstructive. Yet the problem with this stance is that this argument is about risk appetite (rather than about adequate recompense for the risk being taken). Fund management houses operate as businesses, and so presumably managers will position their business models so that ultimately the fees that they charge are commensurate with the value-added that they might be expected to deliver. So, deplete their ability to express investment ideas and you will also deplete what you might expect to get out of them. This perhaps explains the tendency for clients to implement robust portfolio construction techniques themselves rather than arranging for their investment managers to do so. For example, consider a manager structure involving investment in a passive (i.e., index tracking) core portfolio alongside a more actively managed satellite portfolio (or alongside a long-short portfolio, as in alpha-beta separation; see Section 5.4.4). The positions being expressed by such a structure can be formally re-expressed in mathematical terms as akin to those that we would adopt using a Black-Litterman approach in which we credibility weight the active manager’s views alongside views implicit in the market portfolio (the credibility weights being consistent with how large the satellite portfolio is relative to the core portfolio). Where robust portfolio construction does seem to have clearer justification, particularly as far as the fund manager is concerned, is when our Bayesian priors might themselves add value, especially if we define ‘added value’ broadly enough. For example, Bayesian priors relating to investment position limits might express softer investment facets that allow clients to sleep more easily at night (even if they are at best neutral when it comes to more tangible added value). The comfort that these priors can engender itself has a utility to the client, and is therefore not to be ignored. Ultimately, fund managers understand that they operate in a service industry, and will happily contribute to such utility maximising gestures as long as it keeps the client happy. The ideal is for our Bayesian priors also to have real investment merit as well as contributing to these softer issues. In this case, however, then maybe they involve investment views masquerading as something else. The ubiquity of Bayesian priors that we will see repeatedly throughout this book suggests that some introspection about our own inbuilt biases and mindsets is in order, to try to ensure that the a priori views we adopt have as much investment merit as possible in their own right. Principle P35: Use of robust portfolio construction techniques may deplete the ability of a talented manager to add value. Clients may wish to implement them at the manager structure level rather than within individual portfolios.
210
Extreme Events
6.11 IMPLEMENTATION CHALLENGES 6.11.1 Introduction The main additional implementation challenge arising from issues discussed in this chapter is the slow speed of convergence of Monte Carlo simulations. Monte Carlo methods will also prove relevant in later chapters and so we explore this topic in somewhat more depth here than would be justified merely by the extent to which Monte Carlo approaches are relevant to this chapter in isolation. The problem is that each simulation typically involves a separate portfolio optimisation exercise, and so can be quite time consuming in its own right. The √ accuracy of basic Monte Carlo simulation scales according to 1/ n, where n is the number of simulations; see Equation (6.31) below. So, to double the accuracy, i.e., halve the error, we need to quadruple the number of simulations we carry out. For example, in a resampled efficiency exercise, we have to carry out a different optimisation exercise for each simulation. Each individual optimisation exercise can itself take a relatively long time (especially if a large number of assets and/or constraints are involved); see Section 5.13. This makes it highly desirable to keep the number of simulations and hence the number of times that we need to repeat the optimisation exercise involved to a minimum. Press et al. (2007) discuss in some detail the basic characteristics of Monte Carlo simulation, including its convergence properties. They also describe some ways in which these convergence properties can be improved upon. We summarise some of these techniques below, highlighting those which seem most relevant to applications discussed in this chapter or later in this book. The approaches we consider are as follows: (a) (b) (c) (d)
changing variables and importance sampling; stratified sampling; quasi- (that is sub-) random sequences; weighted Monte Carlo.
Press et al. (2007) focus on the use of Monte Carlo techniques to carry out numerical integration. This may at first sight seem to be of limited relevance here. However, numerical integration covers most of the uses of Monte Carlo in the financial sphere even if it is not how the problem is usually specified by financial practitioners. For example, finding the location of the mean of a probability density function falls into this category because it involves estimating x p (x) d x. So too does calculating resampled efficient portfolios, since they are merely the average (i.e., mean) of a function of the probability distribution. The function in question is the mapping that takes as its input the probability density function and produces as its output the relevant RE portfolio(s). Monte Carlo simulation of risk statistics such as VaR or TVaR can also be substantially speeded up using some of the techniques we discuss below; see, e.g., Kl¨oppel, Reda and Schachermayer (2009). In what follows, we use the following terminology. We assume that we have a function f defined on a volume V in some multi-dimensional space. Let f
correspond to the true average of the function over the volume V , while f corresponds to a (uniformly) sampled Monte Carlo estimator of that average, i.e. (where the x i are the sample points used in the Monte Carlo simulation): f
=
1 V
f dV
f =
1 f (xi ) n i
(6.30)
Robust Mean-Variance Portfolio Construction
211
The basic theorem of Monte Carlo integration is that we can estimate f
as follows, where the ‘plus-or-minus’ term is a one standard deviation error estimate for the integral:18 f
∼ = f ±
f 2 − f 2 n
(6.31)
6.11.2 Monte Carlo simulation – changing variables and importance sampling Suppose, for example, that we are trying to calculate the average, using Monte Carlo simulation, of the function f (x) where x = (x1 , . . . , x m )T , and each x j ranges, say, from 0 to 1. Suppose also that f (x) depends very strongly on the value taken by x m but (relatively speaking) hardly at all on the values taken by all the remaining x j . For example, it might be f (x) = (x 1 + · · · + xm−1 ) e10xm . Then, speaking approximately, parts of the function where xm ≈ 1 contribute circa 20,000 times as much (for the same unit volume of x) as parts of the function where xm ≈ 0, because e10 /e0 ≈ 22026. In other words, if we randomly choose points scattered throughout the hypercube formed by the join of xi ∈ [0, 1] then most of the sampled points will contribute almost nothing to the weighted average, because most have very little weight compared to the most important contributors. Far better, therefore, is to apply a change of variables. In this situation, the ideal is to change xm to s, say, where ds = e10xm d xm ⇒ s =
1 1 10xm e log (10s) ⇒ xm = 10 10
(6.32)
Instead of choosing values of x m randomly between 0 and 1 we now choose values of s randomly between 0.1 and 2202.6. Since e10×0.9 /10 = 810.3 we note that roughly two-thirds of the points sampled will now come from within the much smaller original region x m ∈ [0.9, 1], which happens also to be where f (x) is largest. More generally, we can use the concept of importance sampling. In the change of variables used above we tried to express the integral as follows, where h = f /g was approximately constant and we changed variables to G, the indefinite integral of g, because
fdV =
f gd V = g
hgd V
(6.33)
This makes gd V a perfect differential, making the computation particularly straightforward. More directly, we can go back and generalise the basic theorem in Equation (6.31). Suppose that points xi are chosen within the volume V with a probability density p satisfying pd V = 1 rather than being uniformly chosen. Then we find that we can estimate the integral of any function f using n sample points x1 , . . . , x n using
fdV =
f pd V ≈ f ± p
f 2 / p − f / p 2 n
(6.34)
18 The error estimate is an approximation, not a rigorous bound. Moreover, there is no guarantee that the error estimate is distributed as Gaussian, i.e., multivariate Normal, and so the error term in this equation should be taken only as a rough indication of probable error.
212
Extreme Events
The ‘best’ choice of p, i.e., the one that minimises the error term, can be shown to be as per Equation (6.35). If f is uniformly of the same sign then this corresponds, as we might intuitively expect from the analysis underlying Equations (6.32) and (6.33), to one that makes f / p as close to constant as possible. In the special case where we can make f / p exactly constant then the error disappears, but this is only practical if we already know the value of the integral we are seeking to estimate. p=
|f| | f | dV
(6.35)
A special form of importance sampling which turns out to be important in the context of Bayesian statistics is Markov chain Monte Carlo (MCMC). The goal is to visit a point x with probability given by some distribution function19 π (x). Instead of identifying sample points via unrelated, independent points as in a usual Monte Carlo approach, we choose consecutive points using a Markov chain (see Section 7.2) where p (xi |xi−1 ) is chosen in a specific way that ensures that we do indeed eventually visit each point in the set with the desired probability.20 The reason that such a methodology is particularly helpful with Bayesian approaches is that the posterior probability distribution for x, the parameters of a Bayesian model, is proportional to π (x) = p (D |x ) p (x), where p (x) is the Bayesian prior distribution for x and p (D |x ) is the probability of the observed data, D, given model parameters x. If we can sample proportional to π (x) then we can estimate any quantity of interest, such as its mean or variance. 6.11.3 Monte Carlo simulation – stratified sampling Stratified sampling is another example of a variance reduction approach, i.e., one aimed at minimising the ‘plus-or-minus’ term in Equation (6.31), but otherwise does not have too much in common with importance sampling. The variance of Monte Carlo estimator, var ( f ), is asymptotically related to the variance of the original function, var ( f ), by the relation var ( f ) var ( f ) ∼ = n
(6.36)
Suppose that we divide the volume V into two equal sub-regions A and B and sample, say, n A times in sub-region A and n B = n − n A times in sub-region B. Then another estimator is, say, f ∗ defined as follows, where f A means the mean of the simulations in region A: f ∗ =
1 f A + f B 2
(6.37)
We find that var f ∗ is minimised if we choose n A as follows, using the notation σ A ≡ (var ( f ))1/2 : σA nA = n σA + σB
(6.38)
19 Such a function is not quite a probability distribution because it may not integrate to unity over the sampled region, but is proportional to a probability distribution. 20 The p (x i |xi−1 ) are chosen to satisfy the detailed balance equation, i.e., π (x 1 ) p (x2 |x 1 ) = π (x2 ) p (x 1 |x2 ).
Robust Mean-Variance Portfolio Construction
213
More generally, if we split V into lots of different equal sub-regions then we should ideally allocate sample points among the regions so that the number in any given region is proportional to the square root of the variance of f in that region. Unfortunately, for high dimensional spaces we run into the curse of dimensionality. The most obvious way of subdividing V into equal spaces is to do so by dividing the volume into K segments along each dimension, but in spaces of high dimensionality, say d greater than 3 or 4, we then end up with so many sub-volumes, namely K d of them, that it typically becomes impractical to estimate all the corresponding σk . A better approach may then involve recursive stratified sampling, where we successively bisect V not along all d dimensions but only along one dimension at a time (chosen in a manner that seems likely to maximise the speed of convergence). In practice, we can use importance sampling and stratified sampling simultaneously. Importance sampling requires us to know some approximation to our integral, so that we are able to generate random points xi with desired probability density p. To the extent that our choice of p is not ideal, we are left with an error that decreases only as n −1/2 . Importance sampling works by smoothing the values of the sampled function h and is effective only to the extent that we succeed with this. Stratified sampling, in contrast, does not in principle require us to know anything about f , merely about how to subdivide V into lots of equally sized regions. The simplest stratified strategy, dividing V into n equal sub-regions, gives a method whose error asymptotically decreases as n −1 , i.e., much faster than the conventional n −1/2 applicable to basic Monte Carlo simulation. However, ‘asymptotically’ is an important caveat in this context. If f is negligible in all but a single one of these sub-regions then stratified sampling is all but useless and a far better result would be to concentrate sample points in the region where f is non-negligible. In more sophisticated implementations, it is possible to nest the techniques, perhaps again choosing the extent to which one is used relative to the other in a manner designed to achieve as rapid as possible convergence to the true answer. 6.11.4 Monte Carlo simulation – quasi (i.e., sub-) random sequences Part of the problem of basic Monte Carlo is that it does not select simulations ‘uniformly’ in space, merely randomly. Visual inspection of such simulations indicates a tendency for some areas of the space involved to have relatively large numbers of nearby simulation points and other areas where simulation points are relatively sparse – this is inherent in the choice of points at random. Suppose instead that we chose sample points so that they were the grid points of a Cartesian grid, and each grid point was sampled exactly once. For sufficiently smooth f such an approach has a fractional error that decreases at least as fast as n −1 for large n, because the points are more uniformly spread out through space. The trouble with such a grid is that we would have to decide in advance how fine it should be, and we would then be committed to sampling all its sample points in order to achieve the desired error characteristics. Ideally, we would like a process in which we are less committed to a set number of sample points but still avoid the random clumping together of simulation points that arises if they are chosen purely at random. This can be achieved using so-called quasi-random sequences, also called ‘low discrepancy’ sequences or ‘sub-random’ sequences; see, e.g., Press et al. (2007) or Papageorgiou and Traub (1996). These are not random sequences as such. Instead they are deliberately chosen so that the selected points are as uniformly spaced as possible across V .
214
Extreme Events
Use of such sequences can be thought of as akin to a stratified sampling approach but without taking account of differences in the variance of f in different parts of V . 6.11.5 Weighted Monte Carlo A final variant of Monte Carlo that is mainly relevant to topics discussed in the next chapter is weighted Monte Carlo. This involves choosing a series of points typically randomly from a given ‘prior’ probability distribution and then applying different weights to these points. Thus instead of using an estimator f defined as per Equation (6.30) we define it as follows with the wi no longer all being the same: f =
wi f (xi ) i wi
i
(6.39)
The extra flexibility provided by the choice of wi , allows us to ‘fit’ the sampled points so that estimators such as f have desired properties. For example, we might use the approach to calibrate derivative pricing models to market prices so that the present value of the expected future payoffs from a derivative exactly matches its observed market price. We also use what is effectively weighted Monte Carlo in Section 7.11 to arrange for a set of sample points to have properties (such as means, covariances and selected higher moments) that exactly match what we want a distribution to exhibit. Ensemble Monte Carlo, a Monte Carlo technique used by Thompson and McLeod (2009) when pricing credit derivatives, seems to share many similarities with weighted Monte Carlo. 6.11.6 Monte Carlo simulation of fat tails Monte Carlo simulation involving fat-tailed data is relatively simple if the fat-tailed characteristics are present only in marginal distributions. Usually, random number generators provide uniform random numbers, i.e., variables of the form xi that are drawn from a uniform distribution on [0,1]. To convert them to Normally distributed random numbers we merely need to transform them to yi = N −1 (xi ), where N −1 (z) is the inverse function of the Normal cumulative distribution function (cdf). Likewise, to convert them to any other type of random number, following a (fat-tailed) (marginal) distribution F (z), we merely need to transform them to qi = F −1 (z) where F −1 (z) is the corresponding inverse function. This process is particularly easy if the distribution has been expressed in terms of its QQ plot, because we then have qi = f (yi ) = f N −1 (xi ) where the function f (z) is the one that has been used to define its QQ-plot characteristics. However, even if this is not the case, derivation of the qi is still normally relatively straightforward. We might, say, approximate F −1 (z) by applying linear interpolation to a tabulation based on the values of F (z) at a suitably well spaced-out range of discrete quantile points. To knit together m such series we first note that if fat-tailed behaviour is present only in the marginal distributions then the co-dependency characteristics, as defined by the copula, will not be fat-tailed, i.e., will involve a Gaussian copula with some specific (positive definite) covariance matrix V. We can therefore adopt the following approach: 1. We use a random number generator to generate m independent uniform random numbers (x 1 , . . . , xm ) ∈ [0,1]m .
Robust Mean-Variance Portfolio Construction
215
2. We use these uniform random numbers to generate m independent standardised Gaussian (i.e., Normal) random numbers (y1 , . . . , ym ) ∈ Rm , using yi = N −1 (xi ). 3. We apply a Cholesky decomposition to V. If the elements of V are Vij , this involves identifying aij as follows. This can easily be done working sequentially through (i, j) = (1,1) , (1,2) , . . . , (1, m) , (2,1) , . . . , (2, m) , . . ., where m is the number of series to be simulated:21 ⎧ 2 ⎪ ⎪ Vii − aik , i= j ⎪ ⎪ ⎪ ⎪ k j ⎪ ⎪ ajj ⎪ ⎩ 0, i< j 4. We use this decomposition to generate m Gaussian (i.e., Normal but not necessarily independent) random numbers, (z 1 , . . . , z m ) ∈ Rm , by defining z i = a ik yk . k≤i
5. We use these non-independent Gaussian random variables to generate m random variables, (q1 , . . . , qm ) ∈ Rm , with the required joint fat-tailed distribution with zero mean and Gaussian copula with covariance matrix V by using qi = Fi−1 (N (z i )), where N (z) is the cdf of the Normal distribution and Fi−1 is the inverse function of the cdf corresponding to the ith marginal distribution. Such an approach does not work if the series are fat-tailed in the copula as well as (or instead of) in the marginal distributions. We then need to replace steps 2–4 with steps specific to the copula we are using; see e.g., Kemp (2010). 2 21 to be greater than zero, because otherwise its square root is For the Cholesky decomposition to work, we need Vii2 − k B). Then P(A > B) = P (B > C) = P(C > D) = P(D > A) = 2/3. On average A will beat B, B will beat C, C will beat D and, counter-intuitively, D will beat A. This issue becomes particularly pertinent here if we view portfolio construction from a game-theoretic perspective, i.e., if we view our aim to be to do ‘better’ than someone else. This is perhaps less relevant from the perspective of any one individual end investor (although we should still bear in mind the common desire to ‘keep up with the Jones’). It is perhaps more relevant for asset managers (and other professional investors), whose future fee revenue may be particularly sensitive to how well they do relative to others against whom they are competing.
224
Extreme Events
There are several points worth highlighting about CRRA: (a) The CRRA utility function is convex for any value of γ > 0, γ = 1. All other things being equal, this means that strategies resulting in less volatility in outcomes are preferred over more volatile ones. Likewise, all other things being equal, strategies with higher mean terminal wealth are preferred over ones with lower mean terminal wealth. Thus CRRA generates a risk-reward trade-off along the lines we might reasonably expect, with the investor’s coefficient of risk aversion, γ , driving the relative importance to give to risk versus return in the analysis. (b) CRRA assumes that inter-temporal utility can be fully catered for merely by expressing any utility ‘consumed’ in the interim in terms of an equivalent impact on terminal wealth. This is not necessarily sound. Whenever the CRRA utility function is convex, it includes an aversion to volatility (and fat tails). However, investors might have a different degree of risk aversion to volatility of outcomes depending on when those outcomes might arise. Some of the time volatility may be easier for the investor to cope with than at other times. (c) We can view quadratic utility as a ‘local’ approximation to CRRA (and vice versa) because we can expand the CRRA utility function using a multi-dimensional Taylor series expansion (in respect to the parameters on which the return distribution depends). The first terms in this expansion (at least for distributions of returns that are approximately multivariate Normal) are the same as for a quadratic utility function.5 (d) Because of this ‘local’ equivalence we may expect certain aspects of mean-variance optimisation to have analogues with CRRA even though this may not be obvious from Equation (7.8). Perhaps the most important is that mean-variance optimisation in effect requires a definition of what is ‘minimum risk’ (which might be a specific portfolio or might be, say, a hypothetical return series that was constant in a given currency). This point also arises with CRRA because we do not actually measure wealth as an abstract dimensionless quantity. Instead we generally express wealth with a particular numeraire in mind, e.g., US dollars, UK Sterling, the unit value of a fund and so on; see, e.g., Kemp (2009). Thus the WT used in CRRA should be interpreted as also including some suitable numeraire.
7.3.4 The need for monetary value to be well defined For CRRA to have meaning, we need the concept of (terminal) wealth to be well defined. More generally, for any utility function dependent on monetary values to be meaningful, we need to be able to place an unambiguous monetary value on whatever it is that we are focusing on. Economists are well aware of this. Utility theory is applied very widely, e.g., to the provision of public services, to game theory and to other activities where utility and monetary worth are not necessarily easily linked. This possible disconnect, however, is not always uppermost in the minds of financial practitioners. Financial practitioners usually take it for granted in the financial world that they can identify the value/price of anything in which they might be interested.
5 However, the asset mixes deemed efficient may not be so similar, if the risk level is not very low or if the distributional form differs materially from Normality, because the asset mixes may also be driven by ‘non-local’ aspects of the utility function.
Regime Switching and Time-Varying Risk and Return Parameters
225
Kemp (2009) explores in some depth what characteristics monetary value should have if it is to be meaningful. He draws analogies with the properties of money itself, in particular its applicability as a means of exchange. He highlights the need for monetary value, if it is to be generally recognised as such, to satisfy axioms of uniqueness, additivity and scalability, i.e., for V (k (A + B)) = kV (A) + kV (B) where V (X ) is the (monetary) value that we place on contingency X and k is an arbitrary scalar value. The most obvious example type of valuation that adheres to these axioms is the market value of something, as long as the market is deep, liquid and transparent and satisfies the principle of no arbitrage (and more specifically respects the identity V (0) = 0). Market value also needs to refer to the price of a marginal trade. Based on this analysis, Kemp (2009) also argues that the most obvious way of defining a market consistent value of something that is not actively traded on a market as such is as a reasoned best estimate of the market value at which it would trade on a hypothetical market that was sufficiently deep, liquid and transparent and that did exhibit these properties. Inconveniently, markets generally do not fully exhibit these characteristics. For example, in general market transactions will incur transaction costs, either explicit (e.g., commission, bid-offer spread) or implicit (e.g., market impact; see Section 5.7). In such circumstances, monetary value ceases to have a unique well defined meaning. The analogy with money itself is less effective here, because transaction costs in relation to money are generally minimal or nonexistent.6 Instead any price between bid and offer prices becomes theoretically justifiable.7 These market imperfections can at times be important. For example: (a) Transaction costs (and uncertainty in their magnitude) are intimately bound up with liquidity risk – see, e.g., Kemp (2009).8 This means that most portfolio construction methodologies, being ultimately based on utility functions that are based on monetary values, are not good at handling liquidity risk. We explore other ways of catering for liquidity risk in Section 7.11.4 and in Chapters 8 and 9. (b) We need to be careful about application of the utility theory whenever our transactions are large. With CRRA, for example, this would add an extra element of uncertainty in relation to what exactly was our terminal wealth in any particular outcome. A pragmatic way of circumventing this difficulty (if our portfolio is not too large) is to require the portfolio to be suitably diversified. As we saw in Section 6.6, such an approach can be thought of as a particular form of robust portfolio optimisation. One reason we might want to impose the a priori view that such restrictions ‘ought’ to be imposed on the portfolio is because they may reduce our exposure to the vagaries of this type of concentration risk.
6 A possible exception might be foreign exchange transactions (which do potentially incur transaction costs, albeit still typically modest relative to transaction costs on most other types of asset, at least for most freely convertible currencies). However, this exception is not really relevant here. Only one of the currencies in any such situation is typically the legal tender or the de facto medium of exchange in any one location, and thus ‘money’. 7 The usual convention is to focus on the mid-price between the bid and offer prices. This implicitly assumes equivalence in bargaining position which we might view as equitable but is not necessarily representative of reality. 8 Commentators discussing liquidity risk typically differentiate between ‘funding’ liquidity risk and ‘market’ (or ‘asset’) liquidity risk. The former corresponds to the difficulty that an agent might experience when trying to use a given asset as collateral to borrow against, while the latter corresponds to the difficulty that an agent might experience when trying to sell an asset (or buy back a liability). Uncertainty in future transaction costs influences both, although not necessarily by an equal amount.
226
Extreme Events
Principle P37: Utility functions used to quantify the trade-off between risk and return usually relate to the behaviour of the market values of the assets and liabilities under consideration. The inclusion of transaction costs complicates the meaning we might ascribe to ‘market value’ and may have a material impact on the resulting answers if our focus includes risks such as liquidity risk that are intimately bound up with future transaction costs and uncertainty in their magnitude.
7.4 OPTIMAL PORTFOLIO ALLOCATIONS FOR REGIME SWITCHING MODELS 7.4.1 Introduction In this section we describe in more detail how we can identify efficient portfolios if our model of the world is an RS model as described above. We will concentrate on a CRRA utility function and, in this section, we ignore transaction costs. 7.4.2 Specification of problem Consider an investor facing a T period horizon who rebalances his portfolio over m assets at the start of each period and who wants to maximise his expected utility from his end of Tth period wealth. The problem can be stated formally in a mathematical sense as involving finding the combination of weights xt applicable at time t (t = 0, 1, . . . , T − 1) that maximise the following, where WT is the end of T period wealth and E 0 (U (WT )) is the expected value now of the utility of this future wealth: max E 0 (U (WT ))
x0 ,...,xT −1
(7.9)
The optimisation is subject to the constraint that portfolio weights at time t must add to unity T use the notation xt = x1,t , . . . , xm,t in which case this constraint (i.e., xtT 1 = 1). We might m can be written as j=1 x j,t = 1. For the time being we will assume that there are no costs for short-selling or rebalancing and that wealth alters through time according to the formula: Wt+1 = Rt+1 (xt ) Wt
(7.10)
Here, Rt+1 (xt ) is the gross return on the portfolio (using our chosen numeraire), and may be derived as follows, where y j,t+1 is the logged return on asset j from time t to time t + 1: Rt+1 (xt ) =
m j=1
exp y j,t+1 x j,t
(7.11)
Ang and Bekaert (2003a) note that in this instance and by using dynamic programming the portfolio weights at each time t can be obtained by maximising the (scaled) indirect utility,
Regime Switching and Time-Varying Risk and Return Parameters
227
i.e., as 1−γ x∗t = arg max E t Q t+1,T Wt+1
(7.12)
xt
where Q t+1,T = E t+1
1−γ Rt+2 x∗t+1 . . . Rt+2 x∗T −1
and
Q T,T = 1
(7.13)
The optimal portfolio weights may be found where the first partial differentials of the above are simultaneously zero, i.e., at the values that satisfy the first order conditions (FOC) for this problem using terminology more commonly adopted by economists. If we express all returns as relative to the return on the mth asset then the relevant criteria become ⎛ ⎛ ⎞⎞ r1,t+1 ⎜ r2,t+1 ⎟⎟ ⎜ ⎜ ⎜ ⎟⎟ −γ −γ E t ⎜ Q t+1,T Rt+1 (xt ) ⎜ (7.14) ⎟⎟ ≡ E t Q t+1,T Rt+1 (xt ) rt+1 = 0 .. ⎝ ⎝ ⎠⎠ . rm−1,t+1
T where rt+1 = r1,t+1 , . . . , rm−1,t+1 is the vector of the (m − 1) returns on each asset relative to the mth asset. We could equally specify returns relative to some other numeraire in which case the rt+1 would then be an m-dimensional vector, but there would still be only m − 1 degrees of freedom because of the constraint that mj=1 x j,t = 1. 7.4.3 General form of solution The specification of the portfolio construction problem as above is quite general if we assume a utility function in line with CRRA, do not impose constraints other than that the weights add to unity and ignore short-selling constraints, transaction costs and other market frictions. In the special case where the yt+1 (i.e., the vector of logged returns) is i.i.d. across time, Samuelson (1969) showed that optimal portfolio weights would then be constant and the T period problem thus becomes equivalent to solving the ‘myopic’ one period problem.9 When returns are not i.i.d. then Merton (1971) showed that the portfolio weights can be broken down into a myopic and a hedging component. The myopic component is the solution to the one period problem and the hedging component results from the investor’s desire to hedge against potential unfavourable changes in the investment opportunity set. 7.4.4 Introducing regime switching Regime switching can be introduced into such a model by assuming that there are, say, K regimes, numbered (for the one present between t and t + 1) as st+1 = 1, . . . , st+1 = K . The regimes, st , are assumed to follow a Markov chain in which the probability of transition 9 The feature that the asset weights remain constant through time in such a situation is a consequence of using a CRRA utility function. More generally, if a more complex intertemporal utility function is applicable, the problem separates into two components, as per Section 5.4.
228
Extreme Events
from regime i at time t (more precisely, during the period t − 1 to t) to regime j at time t + 1 is denoted by, say, pi, j,t . More precisely we have the following, where ℑt refers to the information10 potentially available to us at time t: pi, j,t ≡ p (st+1 = j |st = i, ℑt )
(7.15)
Each regime is potentially characterised by a probability density function11 f (rt+1 |st+1 ) ≡ f (rt+1 |st+1 , ℑt ), i.e., the (multivariate) probability density function of rt+1 conditional on st+1 . Ang and Bekaert (2003a) concentrate on RS models where the f (rt+1 |st+1 ) are multivariate Normal distributions that are constant through time and where the transition probabilities are also constant through time, i.e., pi, j,t = pi, j . g (rt+1 |st ) ≡ f (rt+1 |st , ℑt ), the probability density function of rt+1 conditional on st (rather than conditional on st+1 ), is then given by the following distributional mixture of Normal distributions: g (rt+1 |st ) =
K
pi, j f (rt+1 |st+1 = k )
(7.16)
i=1
7.4.5 Dependency on ability to identify prevailing regime A further complication arises because there is an ambiguity in the definition of the information ℑt that might be available to us at time t. In the above formula we might deem the ℑt to be the information available to an infinitely well informed observer watching the evolution of the system who is also able to determine with certainty which regime the world is in at time t (more precisely, during the period t−1 to t). This, in effect, is what is assumed in the Ang and Bekaert model. In practice, however, it may be (indeed, it is likely to be) unrealistic to assume that agents have this level of ability to determine which state the world is currently in. More realistically, there is likely to be some uncertainty about which regime the world is in at any given point in time. Ang and Bekaert (2003a) argue that an assumption of full knowledge about which regime is prevailing can be viewed, in a sense, as a ‘worst-case scenario’, because they are primarily interested in this context in understanding the extent to which introducing regime switching alters investor behaviour versus the i.i.d. solution highlighted by Samuelson (1969) in which the optimal asset mix is constant through time. The harder agents find it to identify the regime in which the world currently is, the more ‘fuzzy’ becomes their understanding of the world. However, this ‘fuzziness’ still has certain specific characteristics in the Ang and Bekaert model. In the limit in which they cannot judge at all which regime the world is in, then, as far as they are concerned, g(rt+1 ) becomes time invariant and, in effect, regime invariant (as long as the f (rt+1 |st+1 ) and pi, j are themselves time-invariant). However, g(rt+1 ) would still be known to them and, because it involves a 10 The ℑt is also more technically called a ‘filtration’. The need to ensure that we only make use of information that would be available to us at a particular point in time rather than make use of the model’s entire history (both forwards and backwards in time) is also a vital ingredient in derivative pricing; see Section 7.5. 11 This would be replaced by a probability mass function if we were working with discrete rather than continuous probability distributions.
Regime Switching and Time-Varying Risk and Return Parameters
229
distributional mixture of the different f (rt+1 |st+1 ), would still involve a distributional mixture of Normal distributions. Moreover, it would be possible to identify the mixing coefficient involved from the pi, j , although it might in practice be easier to estimate them directly from observed (or forecast) behaviour and then to reverse out corresponding pi, j that led to such mixing coefficients (if it ever proved necessary to identify values for the individual pi, j ). It is therefore worth noting at this juncture that there are some other ways in which ‘fuzziness’ in this respect could be introduced into the model, including: (a) Agents might view each regime as a ‘fuzzy’ (distributional) mixture of Normal distributions but still have a clear idea at any given time what the corresponding mixing coefficients applicable to that regime might be. This, in a sense, generalises the Ang and Bekaert model to cater for any type of fat-tailed behaviour we like, because we can always approximate arbitrarily accurately any probability distribution (even a multivariate one) by a mixture of Normal distributions. We explore this alternative in Section 7.8. (b) Agents might have intrinsic (immeasurable) uncertainty about which regime is applicable at any particular point in time and/or about what are the correct values to use for pi, j,t . This is a form of ‘Knightian’ uncertainty, a topic that we discuss further in Section 9.4.
Principle P38: To take full advantage of the predictive power of models that assume the world changes through time, agents would need to have some means of identifying what ‘state’ the world was currently in. This is usually not easy. Such models also need to make assumptions about the level of ambiguity present in this respect.
7.4.6 Identifying optimal asset mixes Optimal asset mixes can then be derived by working backwards from time t = T solving consecutive one-period asset allocation problems as per Equation (7.12) applicable to each possible regime in which the world might be at time t = T − 1, t = T − 2, . . . back to the present. In general, Ang and Bekaert (2003a) believe that there is no closed form, i.e., analytical solution, to this problem (or to any analogous problem specified in continuous time), and so Equation (7.12) would need to be solved numerically (see Section 7.11).
7.4.7 Incorporating constraints on efficient portfolios In Chapter 5 we highlighted that, in practice, choice of efficient portfolio also usually involves some further constraints, e.g., short-selling constraints and/or specific position, sector or portfolio-wide limits. Given the specification of the problem as above, we can include such constraints by altering what are deemed acceptable portfolios in the optimisation stage carried out in Section 7.4.6. Optimal portfolios may still be solutions to Equation (7.12) (if none of these additional constraints bite) but more usually they would lie somewhere at the ‘edge’ of the feasible set of possible asset allocations that satisfy the constraints.
230
Extreme Events
7.4.8 Applying statistical tests to the optimal portfolios We saw in Chapter 6 that it was helpful to be able to derive standard errors and other distributional metrics for portfolio weights under the null hypothesis that our model was correctly specified. This made it possible to formulate statistical tests about how reliable the resulting optimal portfolios might be. Ang and Bekaert (2002a) also describe how, in general terms, this can be done with an RS model, if we assume that individual parameters used in the model, θˆ , are assumed to possess an asymptotic Normal distribution N (θ0 , ), where θ0 is assumed to be the vector of the true population parameters. They show that the resulting asset mixes deemed optimal then also have an asymptotically Normal distribution that can in principle be derived from the form of Equation (7.12) and θ0 and , as follows, with ϕ and D being suitably defined (see also Kemp (2010)): D x∗t → N ϕ(θ0 ), DD T
(7.17)
ˆ x∗t ) = 0, x∗t is the where Equation (7.12) is re-expressed as the solution to the equation t(θ, 0 ∗ ˆ xt ) = 0 at time t = t0 , ϕ is a function such that t (θ0 , ϕ) = 0 and solution to t (θ,
∂ϕ
D= ∂θ θ=θ0
(7.18)
In practice, Ang and Bekaert (2002a) compute D numerically. They first solve Equation (7.12) for the estimated parameter vector θˆ to identify optimal portfolio weights xˆ ∗t . They then change the ith parameter in θˆ by a small amount εand re-compute the new optimal portfolio ∗ε ∗ ˆ ˆ x . The ith column in D is then given by − x /ε. weights xˆ ∗ε t t t 7.4.9 General form of RS optimal portfolio behaviour Some generic characteristics of how RS optimal portfolios will behave can be identified merely by considering the form of the problem set out above rather than actually having to solve any particular example numerically. These include: (a) The more the regimes differ (and are determinable by agents, see Section 7.4.5) and/or the greater the probability of transition to a new regime, the larger will be the potential asset allocation shifts through time. (b) By selecting regime characterisations appropriately, it should be possible to replicate many possible types of market behaviour including some that exhibit fat-tailed behaviour. (c) If the mean returns applicable under different regimes differ (as is the case with Ang and Bekaert’s model), the assets involved can exhibit ‘momentum’ or ‘rebound’ (i.e., mean-reverting) characteristics, depending on whether transition to a new regime is less or more likely than staying in the existing regime. Ang and Bekaert (2002a) note, quoting Samuelson (1991), that with a rebound process (i.e., where returns next period tend to be low if in this period they were high, and vice versa) risk-averse investors should typically increase their exposure to risky assets (as long as there is a suitable reward from doing so). The opposite is the case with a momentum process (i.e., where returns next period tend to be high if in this period they were also high, likewise if low). Intuitively, this is because
Regime Switching and Time-Varying Risk and Return Parameters
231
long-run volatility is smaller, all other things being equal, under a rebound process than under a momentum process.
7.5 LINKS WITH DERIVATIVE PRICING THEORY The identification of optimal asset mixes in the presence of regime switching has strong similarities with certain facets of derivative pricing theory. This is to be expected because (a) To price (and hedge) derivatives it is necessary to formulate models for how markets might behave. For more complicated derivatives dependent on multiple underlyings, these models must simultaneously handle multiple asset types and so need to be multivariate, just like those used for asset allocation purposes. For example, Ang and Bekaert (2002b) argue that interest rates exhibit regime switching characteristics similar to those that they developed in Ang and Bekaert (2002a) for equity markets. The resulting interest rate models that they develop have strong similarities with some of the interest models that are used in practice when pricing interest rate derivatives. (b) Although derivative pricing theory is usually expressed in continuous time, pricing problems are often in practice solved in discrete time, using lattices and/or trees (see, e.g., Kemp (2009)), and working backwards from time t = T as per Section 7.4.6. (c) Some of the more modern derivative instruments that it is possible to buy and sell include ones that are portfolio rather than index based. In general, their pay-offs will depend on how the underlying portfolio might be managed, which in turn can be influenced by the asset mix that the portfolio adopts. In a sense, therefore, we can think of portfolio construction theory as being a subset of derivative pricing theory, although this does require a relatively all encompassing definition to be adopted for what we mean by ‘derivative pricing theory’.12 There are some differences, however, between the two theories as conventionally understood. The main one is that in derivative pricing theory we calibrate inputs to the pricing algorithm by reference to the current market prices at which different (usually more standardised) instruments are trading. In contrast, in portfolio construction theory we calibrate models by reference to our own views about how the future might evolve (coloured, most probably, by some reference to the past; see, e.g., Section 6.2). In derivative pricing theory we thus typically assume that the market is ‘right’ while in portfolio construction theory we typically assume that the market is ‘wrong’.13 Another difference often applicable in practice is that multi-period portfolio construction theory generally jumps straight to focusing on terminal wealth and usually ignores 12 Such a definition would require us to deem derivative pricing theory to be equivalent to contingent claims theory (and, arguably, thus with nearly every part of finance theory) and for us to include in the latter any type of contingent claim including ones that have actively managed underlyings. The leap in complexity involved is rather larger than most quantitative finance specialists would be comfortable adopting. More usually they would attempt to minimise the sensitivity of derivative prices to such features. For example, as explained in Kemp (2009), early CDOs were sold in which the prices of different tranches were sensitive to how the portfolio underlying the structure was positioned, but more recently it has become normal for CDOs to be sold in which the terms of the tranche are reset whenever the underlying portfolio is changed (the adjustments being designed to minimise the impact of such a change on the market value of the tranche at the point in time the change occurs). 13 This dichotomy is too black and white to be a perfect reflection of reality. In practice, there is a limit to the number of derivative instruments in which there is a ready market. As explained in Kemp (2009), derivative pricing also involves judgemental elements. Market makers trading derivatives generally cannot perfectly hedge their exposures or can only do so some time after they take on a position. Thus they are always to some extent taking positions, and when doing so will tend to favour positions that they think will perform well when measured in a risk-adjusted fashion.
232
Extreme Events
inter-temporal utility issues (e.g., as we did in Section 7.4). In contrast, derivative pricing theory can at times be quite focused on the ‘how we get there’ as well as on ‘what might be the end outcome’. Some derivatives that trade in practice such as barrier options can exhibit strong path dependency features. A greater focus on path dependency is being injected into portfolio construction theory by the development of economic scenario generators (ESGs) used by, e.g., insurance companies to calculate market consistent liability valuations or to choose investment or business strategies; see Section 6.8. Varnell (2009) argues that the ability to simulate paths as well as merely oneoff ‘point’ scenarios is a significant benefit available from ESGs and other similar simulation based modelling approaches.14 Principle P39: There are strong links between derivative pricing theory and portfolio construction theory, particularly if the minimum risk portfolio changes through time.
7.6 TRANSACTION COSTS The underlying links that exist between portfolio construction theory and derivative pricing theory can also be illustrated by considering how we might take account of transaction costs in either theory. Kemp (2009) describes how transaction costs (both directly visible elements, such as marketmaker bid-ask spreads, and less visible elements such as market impact) influence derivative pricing theory. The existence of such market frictions disrupts the hedging arguments typically used to justify the generalisations of the Black-Scholes option pricing theory that underpin a large part of conventional derivative pricing theory. In this body of theory it is usually assumed that we can rebalance a ‘hedging’ portfolio arbitrarily frequently at nil cost. If certain other conditions are met,15 we become able to hedge the derivative perfectly by investing in this dynamically adjusted hedge portfolio. By applying the principle of no arbitrage, we can then price the derivative merely by identifying the price of the hedge portfolio. Transaction costs disrupt this neat theory because for almost all derivatives the total volume of transactions that would be incurred by such ‘perfect’ dynamic hedging programmes tends to infinity as the time interval, h, separating consecutive rebalancings tends to zero.16 In the presence of nonzero transaction costs, sufficiently frequent rebalancing will completely extinguish the hedge portfolio. One possibility would be to ‘over-hedge’, i.e., always hold more than enough to meet any level of transaction costs. Unfortunately, the characteristics of Brownian motion mean that to avoid completely the possibility of extinguishing the hedge portfolio we would need to hold the upper limit on the possible value of the option and then carry out no dynamic hedging 14 The path taken to reach a particular situation can itself materially influence the outcome in several important areas of insurance practice. For example, pay-offs arising from participating (i.e., with-profit) business depend on bonuses declared between now and maturity. Simulations in an ESG can be carried out either using ‘market implied’ or ‘real world’ assumptions (or at times via nested loops in which instruments are valued using market implied assumptions, but we then allow factors driving the evolution of their value through time to evolve using ‘real world’ assumptions). ‘Real world’ here generally refers to assumptions that we ourselves believe to be correct rather than ones that are consistent with current market prices. 15 In particular, the underlying needs to move continuously (i.e., not to exhibit jumps) and we need to know when the cumulative volatility in the underlying will reach a given value; see Kemp (2009). 16 This is because the (generalised) Brownian motions that typically underlie the theory have the characteristic that as the time interval becomes shorter and shorter, the observed variability of the price process reduces only by the square root of the time interval.
Regime Switching and Time-Varying Risk and Return Parameters
233
whatsoever. Thus in practice we need to accept some possibility of being unable to hedge fully an option pay-off whenever transaction costs are present. This introduces a trade-off between the degree to which we rebalance (and hence incur transaction costs), and the degree to which we replicate accurately the final option pay-off. We can analyse this trade-off using expected utility theory as per Section 7.3. The uncertainty in the quality of replication that the trade-off creates is explained in Davis, Panas and Zariphopoulou (1993) who in turn borrow ideas from Hodges and Neuberger (1989). Using a utility maximisation approach they show that the mathematics involved reduce to two stochastic optimal control problems, i.e., partial differential equations involving inequalities. A simple asymptotic approximation to the solution to these equations is set out in Whalley and Wilmott (1993), if transaction costs are small and proportional to value traded and we have a single underlying, with price S. It involves rebalancing the portfolio if it moves outside a certain band, back to the nearest edge of the band, i.e., it defines ‘buy’, ‘hold’ and ‘sell’ regions. The band is defined in terms of a factor defining how much should ideally be invested in the underlying. is like an option delta but is given by = 0 ∓
3k Se−r(T −t) 2λtrans
1/3
|Ŵ0 |2/3
(7.19)
where k is the cost of trading a unit value of the underlying and 0 and Ŵ0 are the delta and the gamma of the option ignoring transaction costs, i.e. 0 =
∂V ∂S
Ŵ0 =
∂2V ∂ S2
(7.20)
The utility function assumed to apply in this analysis is an exponential one, with an index of risk aversion of λtrans . This approximation has several intuitive features: (a) The band (i.e., the ‘hold’ region) becomes wider as transaction cost levels rise. (b) The band tightens when the option becomes deep in-the-money, or deep out-of-the-money (i.e., when the rate of change of delta in the zero transaction cost case becomes small), but increases when the delta is liable to fluctuate more, i.e., when the option gamma is larger (although not as quickly as the change in the gamma, because of the trade-off between cost and quality of hedging). (c) In the limit of zero transaction costs (i.e., k → 0), the band collapses to rebalancing in line with the standard Black-Scholes case, i.e., using = ∂ V /∂ S. Turning to portfolio construction, if we try to identify optimal asset mixes in the presence of proportional transaction costs, we end up with a similar stochastic optimal control problem. It can in principle be solved in a similar manner to the above and in general should lead to a similar decomposition into ‘buy’, ‘hold’ and ‘sell’ regions. However: (a) We need to apply a different interpretation to Ŵ0 . For a start, the portfolio construction problem usually involves many different asset mixes, rather than just one risky one (on which we incur transaction costs) and one riskless one (on which we do not incur transaction costs) as typically arises in derivative pricing theory. Ŵ0 is therefore potentially multidimensional. Moreover, most portfolios do not have benchmarks that dynamically alter
234
Extreme Events
in an option-like manner. At an asset allocation level, they more commonly involve fixed weight benchmarks that are regularly rebalanced back to their starting weights if the benchmark asset mix would otherwise have drifted away from these weights due to market movements (with rebalancing occurring, say, monthly or quarterly, to match the timescale over which performance assessment of the portfolio usually takes place). For these types of benchmark, the equivalent to Ŵ0 will depend mainly on the volatility of the particular asset in question relative to the benchmark. (b) We also need to link the choice of λtrans with the choice of the main λ used when identifying optimal asset mixes. There is little point in adopting a very precise match to the optimal asset allocation (by choosing a large λtrans ), thus incurring potentially material transaction costs, if the choice of the main λ applicable to the derivation of efficient asset mixes emphasises return rather than risk. Transaction costs can in this context be thought of as akin to a drag on returns (i.e., as an adjustment to the first moment of the outcome). In contrast, quality of replication of what would otherwise be the optimal asset mix can be thought of as akin to an adjustment to the risk of the portfolio (and one that might be small relative to the riskiness already present in the relevant asset allocation). (c) Because we typically have many more assets, the positioning of the optimal ‘bands’ defining when we should and should not trade becomes much more complicated. For example, suppose that we adopt the naive approach of deriving bands for each asset category as if it was the only category to which an equation like Equation (7.19) applied. This might then result in only one asset falling outside the ‘hold’ band we had computed for it. However, we cannot sell (or buy) one asset without buying (or selling) another (if we include cash as an asset in its own right). In general, our decision on whether to buy or sell a particular asset will depend partly on the positioning of all the other assets. The precise form of the optimal asset allocation strategy in the presence of proportional transaction costs when there are many different assets potentially included in the problem is therefore beyond the scope of this book (see Kemp (2010) for further comments). The inclusion of fixed (in monetary terms) dealing costs alters the application of the banding and in particular the feature seen above that it is generally best merely to trade to the nearest boundary of the ‘hold’ region. Rebalancing back to, say, the middle of the ‘hold’ band rather than its closest edge may then reduce the number of times we might trade and hence the overall transaction costs that we might incur.
7.7 INCORPORATING MORE COMPLEX AUTOREGRESSIVE BEHAVIOUR 7.7.1 Introduction The type of Markov chain that we introduced in Section 7.2.3 was relatively straightforward. In it, the world was deemed to be in one of two states at the start of a given period, and we assumed that by the end of that period it had potentially transitioned with some given probability to the other state. Only two parameters were involved. One corresponded to the probability of transition from Regime 1 to Regime 2. The other corresponded to the probability of transition from Regime 2 to Regime 1. Even when we generalise it to involve more than two regimes, we still have a relatively simple model structure. In this section and the next we explore what can happen if our model includes more complicated transition arrangements. In this section we retain the assumption that the world is only
Regime Switching and Time-Varying Risk and Return Parameters
235
ever in one state at any one time, but we allow more complicated models of transitions between states. In Section 7.8 we also consider what happens if the world can be in a (distributional) mixture of states at any particular point in time. 7.7.2 Dependency on periods earlier than the latest one Having transition probabilities between time t − 1 and t that depend merely on the applicable regime at time t − 1 can be viewed as akin to the AR(1) models described in Section 4.7. Cyclical behaviour can arise, as the world transitions between the regimes. However, it is relatively straightforward in nature. Much more complicated is if the outcome at time t depends not only on the regime applicable at time t − 1, but also on the regime applicable at earlier points in time, e.g., t − 2, t − 3, . . . etc. We can think of these as analogous to AR(2), AR(3), . . . models. A richer set of behaviours becomes possible with such models. However, just as with Equation (4.19), it is possible to re-express such models as first order ones, i.e., in the vein of Equation (4.20). This involves redefining the transition probabilities so that they apply to transitions of ‘super-states’ from time t − 1 to time t, where the super-states are characterised by a vector s = s1 , . . . , s p where si characterises the state of the world at time t − i; i.e., it is the state of states going back in time. When doing so, we implicitly need to assume that the state of the world more than p time periods ago does not influence the probability of transition from the current state to any new state. Unfortunately, refining our models in this manner also comes with an important downside. In a two-regime world, refining the likelihood of transition to include all super-states of the world characterised by the exact path that the world has taken over the last a periods, rather than just its state in the last (i.e., one) period, increases the number of possible ‘states’ that the Markov chain needs to handle from 2 to 2a (e.g., if a = 5 then there are 32 possible paths it might take). If the world could be in more than two states then the shift is even more extreme. 3a or 4a rapidly becomes very much larger than 2a even for quite modest values of a. This increases very substantially the potential risk of over-fitting the model. Moreover, most of these super-states are likely to be very rare, increasing still further the difficulties of estimating their characteristics. 7.7.3 Threshold autoregressive models One type of autoregressive model that does appear relatively frequently in the academic literature is the threshold autoregressive model, particularly self-exciting threshold autoregressive (SETAR) models. Such a model consists of k autoregressive AR(p) parts, each one characterising a different regime. The model is therefore usually referred to as a SETAR (k, p) model, or often just as a SETAR (k) model because the p can in principle vary between the different regimes. The defining feature of a SETAR model is that one of the elements of the regime (typically a ‘state’ variable rather than one that relates to the actual returns achieved on a given asset) triggers a regime shift when it reaches a particular threshold. Threshold models such as these are nonlinear and can therefore be made to exhibit chaotic behaviour. Usually, the models proposed have symmetrical thresholds; i.e., if the state variable driving regime changes, z t , rises above threshold Q j then we switch from state s = j to state s = j + 1 and similarly if it falls below the same threshold level Q j then it returns back to state s = j. However, more generally we could introduce hysteresis, if we thought that this
236
Extreme Events
more accurately reflected market behaviour, by setting different thresholds triggering regime changes depending on the direction in which the state variable was moving.17 7.7.4 ‘Nearest neighbour’ approaches In Section 4.7 we described several ways of generalising autoregressive models so that they exhibited behaviour more akin to how the world really seems to behave. In particular, we described a ‘nearest neighbour’ approach which potentially reduced the risk of model overfitting. In this approach, we viewed different parts of the past as being more or less relevant to how the world might behave now. Greater or lesser weight was then given to data applicable to those times in the past when predicting what might happen in the future. Based on the discussion in Chapter 6, we might also want to incorporate Bayesian priors into such a framework, i.e., our own general reasoning about how the world ‘ought’ to operate. The important shift in perspective that this brings is to move away from the notion that the world will be in a single well-defined (and potentially identifiable) regime at any given point in time. Instead we accept that our characterisation of the world will necessarily be ‘fuzzy’, involving a superposition of different states.18 As we shall see below, such fuzziness also naturally arises in a different way when we explicitly incorporate fat-tailed behaviour into our models.
7.8 INCORPORATING MORE INTRINSICALLY FAT-TAILED BEHAVIOUR Up to now in this chapter we assumed that at each point in time the regime that the world was in was characterised by a single (multivariate) Normal distribution. We now want to explore what happens when we relax this assumption. Arguably the most general way of doing so is to assume that at each point in time the world is characterised not by a single Normal distribution but by a (distributional) mixture of more than one (multivariate) Normal distribution. As we have seen earlier, any distribution can be approximated arbitrarily accurately by such a mixture, as long as we are allowed to mix sufficiently many Normal distributions. To the extent that it is possible to impose any sort of mathematical structure onto market behaviour, we should therefore be able to approximate the desired behaviour arbitrarily accurately using such a model.19 This type of generalisation can in principle be achieved by a suitable generalisation of the Markov chain approach referred to earlier. We now define super-states to involve not just the past history of regimes but also a contemporaneous mixture of current states. So, if we had two original (Normal) states and thought that the autoregressive features of the model 17 Hysteresis is a well understood concept in several areas of physics, explaining, for example, how some metals can become magnetised. Conceptually we might also view investors as exhibiting similar behaviour – they might become jittery because of a change in circumstances but might only lose their jitteriness some way beyond the reversal of whatever change originally led to their concerns. 18 We use the term ‘superposition’ deliberately, because it is the usual term applied in quantum mechanics to situations where the world can be viewed as being in a mixture of several states simultaneously. 19 However, by ‘any’ market behaviour we actually mean only ones that satisfy the axioms of additivity and scalability referred to earlier. This means that, in general, our models will also need to include elements relating to transaction costs, liquidity risk and other features that represent deviations from the idea that there is a single ‘market price’ for any given asset. We will also in general need to include in our model elements relating to investor utilities and hence to investor behavioural biases (some of which may of course interact with parts of the model driving asset behaviour, because the price of an asset is itself in part driven by such biases).
Regime Switching and Time-Varying Risk and Return Parameters
237
should only involve the state that the world was in at most two states previously, we might characterise the contingencies by a table relating the two-dimensional vector ( pt−1 , pt−2 ) with the two-dimensional vector ( pt , pt−1 ) where pt indicated the mixing coefficient applicable to the first state (i.e., the distribution characterised by s = 1) at time t etc. We immediately see a problem. Although the original Markov chain contingency table in Table 7.1 involved just two parameters, this one now involves an infinite number of parameters. Moreover, the transition characterisation will become even more complicated if we drop the assumption of time stationarity that is usually included in Markov chain modelling. The level of complexity involved is quite daunting, as can be gleaned from the sizes of some of the standard texts that try to explore some of the resulting complications; see, e.g., Sayed (2003). To make the problem tractable and even vaguely parsimonious we need to impose considerably greater structure on the possible ways in which the world might be able to transition between different ‘super-states’. We have here another example of a Bayesian perspective appearing in portfolio construction theory, here relating to the form of the model we wish to consider.
7.9 MORE HEURISTIC WAYS OF HANDLING FAT TAILS 7.9.1 Introduction If we do not feel comfortable with the level of mathematical complexity implicit in Sections 7.7 and 7.8 (or if we are worried that there may be spurious sophistication at work), there are several simpler and more heuristic approaches that we could adopt. We may first note that if all return opportunities (and combinations of them) are in a suitable sense ‘equally’ (jointly) fat-tailed then the optimal portfolios are the same as those arising using traditional mean-variance optimisation approaches (as long as the risk budget is adjusted appropriately). Only if different combinations exhibit differential fat-tailed behaviour should portfolio construction be adjusted to compensate. Even then, we also need to be able to estimate these differentials. The simplest heuristic approach is thus to decide that handling fat tails is just too difficult and either to revert to unadjusted mean-variance optimisation or to a purely judgemental approach (perhaps supplemented by use of reverse optimisation techniques). This stratagem might be particularly attractive to less mathematically-orientated practitioners. Some investment management approaches, however, necessarily require some sort of automated approach to portfolio construction. Moreover, even if we place full reliance purely on ‘human judgement’, what is the right way to ‘train’ or foster such judgement? A half-way house is then to try to cater in a simple way for the main factors that might be causing fat tails. We saw in Chapters 2 and 3 that the most important single predictable contributor to fat tails (at least for series that we might otherwise intrinsically expect to be approximately Normal) is time-varying volatility, and so we focus on this source of fat-tailed behaviour in Section 7.9.2.
7.9.2 Straightforward ways of adjusting for time-varying volatility One relatively straightforward way of taking into account a significant fraction of the fat-tailed behaviour observed in practice is as follows:
238
Extreme Events
1. Strip out the effect of time-varying volatility as best we can, most probably using a crosssectional adjustment as per Section 3.3.3 (because cross-sectional adjustments applied simultaneously to all return series seemed there to be more effective than longitudinal adjustments applied separately to each series in isolation). 2. Calculate/estimate the covariance matrix between the return series adjusted as per step 1. 3. Identify optimal asset mixes as we think most appropriate (using whatever variants from standard mean-variance, ‘robust’, Bayesian, Black-Litterman and so on described in Chapters 5 and 6 that we think are most appropriate), using the covariance matrix derived as per step 2. 4. Identify the ‘adjusted’ optimal portfolios by unwinding the adjustments made in step 1 from the optimal portfolios identified in step 3. Cross-sectional volatility adjustments as per Section 3.3.3 can be thought of as involving giving different weights to different elements of the historic time series (but giving the same weight to all data points from all time series at any particular time). Thus a similar effect to the above can be achieved merely by estimating the covariance matrix using weighted covariances; see Kemp (2010). Step 4 then involves adjusting the overall risk appetite we want to run according to the level of volatility deemed to be currently applicable. The position becomes more complicated if we apply longitudinal adjustments that are time series specific (which was the alternative type of time-varying volatility adjustment considered in Section 3.3.3). Different adjustments may then be applied to different series making the unwinding in step 4 more complicated. If the focus is on reverse optimisation, i.e., implied alphas, a similar approach can also be adopted. However, if the implied alphas are being principally used to provide a sense check that the portfolio is not horribly mis-positioned (as might be the case with view optimisation as in Section 5.10) then there may be less point or need to incorporate such refinements into the checking process. 7.9.3 Lower partial moments Another approach suggested by Scherer (2007) involves the use of lower partial moments (lpms). For a single return series, the lpm of the mth degree and with threshold return γ is defined as lpm(γ , m) = E (max (γ − r, 0))m
(7.21)
In the discrete case this can also be written as
lpm(γ , m) =
n
k=1
dk (γ − rk )m n
(7.22)
dk
k=1
Here dk = 1 if rk < γ and otherwise is 0. The denominator might instead be n rather than dk , depending on the particular use to which the lpm was to be put. In such a definition γ can be a fixed number (in some suitable currency), usually zero, in which case the focus would be on nominal capital protection. Alternatively, we might
Regime Switching and Time-Varying Risk and Return Parameters
239
Table 7.2 Choice of threshold, γ Threshold return type
Corresponds to objective type
Zero return Inflation Risk-free rate Actuarial rate Benchmark
Nominal capital protection Real capital protection Minimising opportunity costs Actuarial funding protection Avoidance of underperformance
incorporate other types of risk into the definition (see Table 7.2). As with most types of risk measure, lpms can be based on actual observations (ex-post, historic, sample based) or on how we believe the future will behave (ex-ante, forward-looking, population based). For multiple return series, there are several possible ways of defining lpms. For example, one way of generalising the lpm equivalent of variance (i.e., the case where m = 2) to provide the lpm equivalent of covariance would be to use the following equation (possibly with the denominator adjusted as above):
lpm(γ , m) =
n
k=1
dk (γ − ri,k )(γ − r j,k ) n
(7.23)
dk
k=1
where ri,k is the kth element of the return series for asset i and dk = 1 if ri,k < γ and otherwise is 0. Such an approach is asymmetric because lpm i, j = lpm j,i . Scherer (2007) proposes an adjustment that results in a symmetric lower partial moment, by taking the individual lower partial moments and ‘gluing’ them together using ordinary correlation coefficients, e.g.: symmetric
lpm i, j
(γ , 2) = ρi, j lpm i (γ , 2)lpm j (γ , 2)
(7.24)
This approach may not be ideal, however, because the form of the dependency between different return series may not be well handled merely by reference to correlation coefficients. Methodologies involving lpms still in practice require estimation of significantly increased numbers of parameters if they are to cater effectively with fat-tailed behaviour, increasing the risk of over-fitting. Moreover, in Section 2.4 we concluded that skewness and kurtosis were not necessarily good ways of describing or characterising fat-tailed behaviour even for individual return series (because they could give inappropriate weights to different data points). Their multi-dimensional analogues, co-skew (also called co-skewness) and co-kurtosis, may be even less effective ways of characterising fat-tailed behaviour in multiple return series. Scherer (2007) suggests that ‘best practice’, if an lpm approach is to be used, is to specify possible candidate distributional forms using a small number of parameters, to find the form within this candidate type that best fits the data, and then to calculate lpms and thence optimal portfolios from these best fit distributional forms rather than directly from the observed return data. The approach suggested in Section 2.4.5 for rectifying some of the weaknesses evident in skew and kurtosis as descriptors of fat-tailed behaviour can be recast to fit within this framework.
240
Extreme Events
7.10 THE PRACTITIONER PERSPECTIVE One of the purposes of covering so many different techniques in this chapter is to highlight how dependent the answers of portfolio optimisation exercises can be on model choice. This is no different in principle to the position with pure risk modelling, because in either case a fundamental leap into the unknown occurs whenever we move our focus from the past to the future. Practitioner response to such topics will vary according to the practitioner’s own perspective on the likely reliability of choices that might be made in this context. Some choices may seem intrinsically more plausible than others, or just more akin to our experience of how the world has actually operated. We can attempt to mitigate some of this model risk by using ‘robust’ optimisation techniques, as per Chapter 6. However, we have seen that these also incorporate some of our own prior beliefs about how the world ‘ought’ to behave. The tendency in such circumstances may be to duck the issue and pay particular attention to deflecting the risk of ‘getting it wrong’ onto someone else. However, successful practitioners realise that ultimately this road eventually peters out. As we noted earlier, merely because portfolio construction is difficult does not make the need to do it go away. Someone (or some group of individuals) still needs to decide how to invest the assets. The rewards for doing so are commensurately attractive, because so many others are uncertain of their own abilities in this area. So too are the rewards for claiming to be competent even when you are not. The need at each stage in the investment value chain to justify that value-added is occurring (and, more importantly, is likely to continue to occur in the future) drives a large number of practical facets of the investment management industry. It is a global industry that is relatively fragmented with even the largest players managing only relatively modest proportions of the total world asset stock. New players are constantly arising, searching out new techniques for adding value. Some of these focus on niche investment markets, others on different investment strategies, still others on how best to combine investment ideas (i.e., on portfolio construction as understood in this book). The interplay between investor behavioural biases and the huge range of possible ways of modelling future market behaviour means that these facets of the asset management industry are unlikely to dry up any time soon, unless there is a catastrophe that wipes out a large part of the world’s asset base. New portfolio construction techniques will no doubt come to prominence. So too will new investment vehicle types within which investment exposures may be expressed. Success in the asset management industry often requires nimble business strategies as well as high quality investment management skills. Another trend noticeable in the quantitative investment arena is the greater emphasis placed on derivative pricing theory. We have already discussed some of the mathematical aspects driving this trend in Section 7.5. Nowadays, a ‘quant’ is more commonly associated with an individual well versed in derivative pricing techniques (particularly with a banking or market making background) than was the case 10 or 20 years ago, when the term might more usually have been applied principally to someone well versed in active quantitative investment management techniques, such as time series forecasting. Over this period we have seen growth in exchange traded funds, hedge fund replication, guaranteed products and other somewhat more banking orientated products encroach into the asset management space, further driving convergence between these two industries and hence in the types of skills most relevant for successful practitioners in either sphere.
Regime Switching and Time-Varying Risk and Return Parameters
241
7.11 IMPLEMENTATION CHALLENGES 7.11.1 Introduction Including time-varying parameters involves a significant leap in terms of mathematical complexity, creating several implementation challenges. These may loosely be grouped into those involved in (a) the actual solution of portfolio optimisation problems that involve RS models; (b) ways of incorporating derivative pricing techniques into the problem when they are relevant; and (c) ways of handling transaction cost complexities and particularly features relating to liquidity risk. 7.11.2 Solving problems with time-varying parameters The main implementation challenge with RS, as Ang and Bekaert (2002a) note, is that there does not appear to be any analytical solution available to Equation (7.12) even when our model involves switching between just two multivariate Normal distributions. Moreover, Ang and Bekaert are not aware of the existence of analytical solutions to similar continuous time problems. We must therefore solve the equations numerically. The method that Ang and Bekaert use for this purpose involves quadrature. This is a well developed technique for carrying out numerical integration; see, e.g., Press et al. (2007) or Kemp (2010). We in essence approximate E(A), i.e., the expected value of A(x) given a probability density f (x), using a set of weights, wk , applied to a set of, say, Q ‘quadrature points’, xk (k = 1, . . . , Q) as follows: E(A) =
A(x) f (x)d x ∼ =
Q
A(x k )wk
(7.25)
k=1
In the case of portfolio optimisation, each x k corresponds to a different path that the world might take, and so we might instead call them ‘quadrature paths’. Ang and Bekaert (2003a) discretise the Markov chain component of the model by 1. Choosing a sufficiently large number of quadrature paths (randomly). 2. Choosing the regime dependent weights to apply to each of these quadrature paths that result in key characteristics of the conditional probability distributions (e.g., their means and covariances etc.) being well fitted in a numerical integration sense as above. Optimal asset allocations are then found as if the only paths that the world can take are the ones highlighted in step 1 and as if their probabilities of occurrence are as per step 2. The quadrature approach therefore converts the problem from a continuous problem into a discrete problem (which is therefore in principle solvable exactly, albeit typically only by using fairly involved algorithms). The approach used is akin to weighted Monte Carlo, which is discussed in Section 6.11.5. One advantage of a weighted Monte Carlo approach like this is that it is in principle relatively simple to cater for additional model complexities. For example, adding autoregressive elements can be catered for ‘merely’ by altering the distribution fitting element in step 2.
242
Extreme Events
If we do not wish to impose much structure on the problem, we might seek to use the flexible least squares approach suggested by Lohre (2009). This involves assuming that there is a factorlike structure to the world, but the factors, as epitomised by betas, are changing through time. We may estimate the betas (along with the rest of the model) using a model that penalises changes in the beta through time, e.g., using the following equation, where (βt ) is the vector (of length T − 1) formed by βt+1 − βt and D is a matrix that penalises variation in beta: Q(β1 , . . . βT ) =
T t=1
yt −
βi xi,t
2
+ λtv
T −1
(βt )T D(βt )
(7.26)
t=1
7.11.3 Solving problems that have derivative-like elements The flexibility of the weighted Monte Carlo approach can be particularly helpful if the portfolio problem has explicit derivative-like features. Most commonly, portfolios have fixed or nearly fixed benchmarks. With the increasing convergence of banking and asset management techniques mentioned in Section 7.10, however, it has become more common for portfolios to exhibit significant option-like features. For example, the portfolio might include protection elements that mean that it could move en masse into or out of risky assets depending on aggregate market movements. In a time-stationary world, the problem can be decomposed into alpha and beta components as per Section 5.4.4. In a time-varying world such as is implied by a regime switching model; however, this usually becomes less practical. This is particularly so if the linkage between asset variability and factors driving the portfolio’s overall exposure to risky assets itself depends on the asset in question. In such circumstances, we can still use a similar approach to that set out in Section 7.11.2. We need to change the regime transition arrangements to include such dependencies, but apart from this nicety the algorithm remains essentially unchanged. The key is to find suitable wk so that the resulting conditional probability distributions adequately fit our characterisation of how the future might behave. 7.11.4 Catering for transaction costs and liquidity risk There does not yet appear to be any consensus on how best to handle transaction costs in portfolio construction. The position becomes even more complicated when we stray into areas affected by uncertainty in future transaction costs levels, e.g., ones involving liquidity risk. Substantially different answers potentially arise depending on what we understand by liquidity risk. This is illustrated in Kemp (2009) who surveys some of the literature on liquidity risk. If liquidity risk is purely seen as the risk arising from uncertain mid-prices arising from (modest) transaction costs then the effects of including liquidity risk seem reasonably modest. However, if liquidity risk is to be equated with the possibility of a complete inability to transact in the relevant asset at essentially any price, the effects can be much more significant. The idea that markets could completely shut down for extended periods of time might have seemed fanciful just a few years back. This, however, is exactly what happened to some markets during the 2007–09 credit crisis, highlighting the very fat-tailed nature of liquidity risk and its tendency to throw up very extreme events from time to time. Whether it is ever likely to be practical to model effectively the likelihood of such risks occurring using techniques of the sort we have explored up to this point in the book remains an open question. Currently, the
Regime Switching and Time-Varying Risk and Return Parameters
243
tendency with liquidity risk is to focus on methodologies that are less explicitly quantitative and statistical in nature, along the lines described in Chapters 8 and 9. 7.11.5 Need for coherent risk measures Portfolio optimisation when distributions are non-Gaussian, i.e., not multivariate Normal, generally requires risk measures included in the utility function being maximised to exhibit certain features, if the results are not to be difficult or impossible to interpret sensibly. In particular, the risk measure needs to be sub-additive (which is one of the properties a risk measure needs to exhibit if it is to be coherent); see Kemp (2010). This means that ρ(x + y) ≤ ρ(x) + ρ(y), where the risk measure for a portfolio x is ρ(x). If the risk measure is not coherent then additional diversification may appear to add to rather than subtract from risk, a result that flies in the face of intuitive reasoning. This means that we would normally seek to avoid utility functions based on VaR, because VaR is not a sub-additive risk measure. This is the case even if all individual return series are Normally distributed in isolation. If these types of return series are knitted together using a non-Gaussian copula then we may still be able to find portfolios where added diversification increases rather than reduces VaR.
8 Stress Testing 8.1 INTRODUCTION The final chapters of this book move away from a more quantitative treatment of extreme events towards a more qualitative one. The reason for doing so is simple. Even a financial model that appears to have worked very well in the past may not necessarily work well in the future. No amount of clever quantitative analysis can circumvent this problem; it arises because the past is not necessarily a good guide to the future. Moreover, it is not just that we are travelling forwards in time in something akin to a fog, which we could see through if only we were equipped with a sufficiently advanced form of market ‘radar’. Financial markets do not primarily fit within the ambit of precisely defined external physical systems amenable to such tools. Instead they include feedback loops in which we (the observers) are implicated. So if it were possible to create a fully prescient toolkit then it would probably rapidly cease to function, as market participants adjusted their actions accordingly.1 Of course, that does not mean that we should ignore the past or reject the conclusions of any analysis based on it, merely because it might not perfectly describe the future. If we adopted this line of reasoning, a book like this might be viewed as largely a waste of time. Instead, we need to bear in mind that the past is an imperfect guide to the future, perhaps sometimes quite a poor guide, but at other times quite good, and, if possible, build into our mindset some resilience against over-reliance on its future accuracy. Put less mathematically, the challenge we face is not primarily how to quantify the magnitude of the impact that a given scenario might have on a portfolio, if the scenario is specified appropriately.2 Instead, it is our imperfect ability to identify the likelihood of the scenario occurring. This difficulty of estimating accurately the likelihoods of different scenarios has not been lost on regulators, client boards and other bodies interested in the sound management of portfolios. As a result, increasing attention has been placed in recent years on methodologies that do not place as much emphasis on correct computation of scenario likelihoods as some of
1 The idea that human actions are, in aggregate, entirely predictable as per, say, the ‘Foundation’ novels of the science fiction writer Isaac Asimov, is arguably just that, i.e., science fiction. Indeed, even in those novels a character arises who disrupts the orderly progression of human history that the founders of the Second Foundation predicted would occur. Nowadays, prevailing scientific wisdom is more closely aligned to the ideas underlying chaos theory (see Section 4.6). Indeed, even apparently very predictable systems can become quite chaotic over long enough timescales. For example, Laskar and Gastineau (2009) simulated the possible future trajectories of the planets of our Solar System over the next few billion years, in each case using initial conditions compatible with our present knowledge of Solar System trajectories. In one of their simulations a gravitational resonance coupling between Jupiter and Mercury becomes large enough to destabilise all the inner planets of the Solar System resulting in possible collisions of Mercury, Mars or Venus with the Earth! Fortunately, only 1 of the c. 2,500 simulations they analysed exhibited this property and the destabilisation only took place some 3 billion years from now. 2 If the scenario is expressed in terms of returns on or price movements of exposures we have within the portfolio and if we have invested in advance sufficient time and effort implementing a system that correctly highlights what exposures we have within our portfolio, in principle it becomes relatively straightforward to estimate the magnitude of the impact of any scenario (but see Sections 3.5 and 8.8).
246
Extreme Events
the portfolio construction techniques discussed in earlier chapters. These go under the general heading of stress testing. In this chapter we explore some of the meanings that different commentators place on the term ‘stress testing’. We highlight some of the differences as well as similarities with more ‘statistically-orientated’ ways of modelling and managing risk. We also explore how we might take due account of such methodologies in portfolio construction techniques of the sort introduced in earlier chapters. Our focus in this chapter is in the main on scenarios that might be viewed as within the realm of ‘plausible’ outcomes for a portfolio. In Chapter 9 we will explore what if anything we can do regarding ‘really extreme events’.
8.2 LIMITATIONS OF CURRENT STRESS TESTING METHODOLOGIES Relative to other techniques described earlier in this book, stress testing focuses more on magnitude (if large and adverse), and what makes the scenario adverse, and pays less attention to likelihood of occurrence. A greater emphasis has recently been given to such techniques by risk practitioners, regulators and internal and external auditors. They are seen as complementary to VaR-like measures of risk. This is particularly true in areas such as liquidity risk that are perceived to be less well suited to VaR-like risk measures. More generally, portfolios (or risks) that are perceived to be less well diversified or nonlinear or skewed may be viewed as less suited to VaR-style risk measures, and so stress testing may be considered potentially more important in, say, bond-land and currency-land than equity-land. The basic idea behind stress testing is to identify a range of scenarios that in aggregate describe what might ‘go wrong’. The term ‘stress testing’ is used nearly synonymously by risk managers with the term ‘scenario analysis’, but arguably scenario analyses may also include focus on what might ‘go right’. Again, we see here a natural bias on the part of risk managers to worry about what might ‘go wrong’ (corresponding to what they are typically being paid to do). The fact that stress tests are complementary to VaR-type measures of risk, however, does not mean that stress testing provides a panacea when VaR-type measures of risk are deficient. Kemp (2009) highlights the following principle: Important tools in the armoury of practitioners wishing to mitigate against ‘model’ risk are to think outside the box and to carry out a range of stress tests that consider potentially different scenarios. However, stress testing should not be seen as a perfect panacea when doing so, since the stresses tested will often be constrained by what is considered ‘plausible’ in relation to past experience.
Kemp (2009) also highlights another, more subtle, limitation: virtually all stress testing as currently implemented focuses on loss merely up to the level of the stress rather than beyond it. Stress testing thus typically involves a mindset closer to VaR, in terms of measuring loss required to default, than to tail Value-at-Risk (TVaR), which in this context can be thought of as measuring average loss in the event of default. He notes that the two mindsets can create dramatically different answers, particularly for whoever picks up the loss beyond the point of
Stress Testing
247
default.3 He highlights that one of the lessons of the 2007–09 credit crisis (and in particular of the default of Lehman Brothers) was the magnitude of the ‘beyond default’ loss that an entity can suffer. The unpredictable nature of where such losses might eventually come to rest seems to have been an important contributor to the subsequent reluctance of governments to allow other large market participants to fail.4 A corollary is that governments and regulators ought to have a strong interest in encouraging companies to focus more on TVaR than is done at present. Shareholders do not generally care about loss given default (except perhaps if it affects other holdings they might have in their portfolio). They will typically have lost all that they are going to lose once the loss has become large enough to trigger default. Any further losses are no longer their problem. Such losses are precisely the ones that governments and the like have, in extreme circumstances, found themselves shouldering, via bank rescue packages etc. Even some newer stress testing approaches, such as reverse stress testing or ‘testing to destruction’, may still give insufficient weight to consideration of what might happen in really extreme events. This is perhaps one of the drivers towards the introduction of ‘living wills’,5 which are designed to make it easier (and presumably therefore less costly) for governments to dismember a failed bank. Principle P40: Stress tests are important tools for mitigating ‘model’ risk, but do not guarantee success in this respect.
8.3 TRADITIONAL STRESS TESTING APPROACHES 8.3.1 Introduction Kemp (2009) notes that there are several different interpretations that regulators and commentators place on the term ‘stress-testing’, including (a) an analysis of the impact on the portfolio (or firm as a whole) arising from movements in specific market drivers, the sizes of these movements being considered to be appropriately within the tail of the plausible distribution of outcomes; (b) specific industry-wide stress scenarios mandated by a regulator and directly applicable in the computation of regulatory capital; 3 Exactly who picks up such losses, and in what proportions, can be tricky to identify in advance. For many industrial or manufacturing firms, such losses may be borne by bondholders or other creditors, which can include the tax authorities, employees, suppliers and customers who have bought something from the firm but not yet received in full the benefits accruing from what they bought. However, most entities operating in the financial services industry are regulated and often industry-wide protection schemes limit the impact that default might have on depositors, policyholders or other beneficiaries. Sometimes these protection arrangements are funded by other industry participants but either implicitly or explicitly there is typically government backing for these arrangements in the event of the protection arrangements running out of money. This means that if such defaults are large enough (and extensive enough) then usually, one way or another, the public purse picks up at least part of the tab, as happened during the 2007–09 credit crisis. 4 A large part of the subsequent debate on how to reform the financial system has revolved around the question of how to handle entities that are systemically important and thus deemed ‘too large to fail’. 5 The Bank of England prefers to call these documents ‘recovery and resolution plans’ (see Financial Times (2009a) or Bank of England (2009)). The FSA in FSA (2009) also uses this latter term, but accepts that colloquially these plans will also be called ‘living wills’. The essential argument for such plans is also set out in that paper, namely: ‘However, the FSA recognises that it is not possible, or desirable, to reduce the probability of failure to zero. The FSA must therefore be prepared for the fact that a systemically important firm – however well managed and however well regulated – may reach the point of failure and require the authorities to intervene. This means that, in extremis, the UK authorities must be able to resolve systemically important firms without systemic disruption and without putting the public finances at risk.’
248
Extreme Events
(c) a greater focus on the sorts of configurations of market events that might lead to large losses, involving something of a mental adjustment away from purely probability theory and towards hypothetical scenario construction. We summarise these interpretations below, and also comment on how different in practice they are to more traditional VaR-like risk modelling. 8.3.2 Impact on portfolio of ‘plausible’ but unlikely market scenarios Typically a range of such scenarios would be considered in turn, e.g., oil price up 40%, equity market down 50%, a leading counterparty defaults and so on. There might also be some consideration of what are considered to be plausible combinations of such events. For such stress tests to be ‘reasonable’ they will naturally need to take some account of what is considered a sensible range of outcomes. This will ultimately be influenced in people’s minds by what has already happened. In a sense, therefore, they can be thought of as VaR-like in nature, using a range of different underlying VaR-models, one for each stress test considered. For example, a common way in practice to frame such scenarios is to identify every observed movement in the driver in question that has actually happened over some extended period of time in the past, to identify what impact each of these movements might have on the portfolio/firm as currently positioned if the movement was repeated now, and to use as the stress test the worst such outcome (or the second worst outcome. . .). This is called a (historic) worst loss stress test. Such a figure, however, could also be described as an ‘x-quantile historic simulation VaR’ (using a VaR model derived by considering the impact of just one single factor driver, namely the one involved in the stress test) where x is set by reference to the number of observations in the past dataset, to correspond to the worst observed outcome.6 We see that worst loss stress tests are thus VaR-like in derivation. One advantage of highlighting this equivalence is that it reminds us that the larger the number of past periods included in a worst loss computation, the more onerous a worst loss stress test is likely to be, because the dataset is more likely to include a really adverse outcome. Even if users in practice show a range of worst loss scenarios relating to different drivers, each one is still in isolation VaR-like in derivation. 8.3.3 Use for standardised regulatory capital computations Examples here would include many of the so-called ‘standardised’ approaches set out in Basel II and Solvency II. The ‘resilience’ reserving computations mandated by the FSA for UK insurance companies also use such methodologies – see, e.g., Kemp (2005) or Kemp (2009) – but they are not as sophisticated as the ones that will apply to such firms once Solvency II is implemented. For these stress tests to be intellectually credible their derivation will typically need to exhibit a suitable level of logical consistency. Although driven in part by the prior beliefs of whoever has formulated them, this means that their derivation will almost inevitably also take due account of what has actually happened in the past; see, e.g., Frankland et al. (2008). This is particularly true if the relevant ‘standardised’ framework involves several different stress 6
For example, if there are n observations then x might be set equal to 1/(2n), although see Kemp (2010) for other suggestions.
Stress Testing
249
test elements and there is a need for the stress tests to be seen to exhibit consistency relative to each other. The natural way to achieve this consistency is again to build up each individual stress test in a VaR-like manner. The Standard Formula Solvency Capital Requirement (SCR) that will shortly be applicable to EU insurers under the new Solvency II rules is explicitly derived in this manner. Typically, each individual stress test focuses on just one or at most a few drivers at a time. It may also involve a combination of sub-stress tests (a particularly simple example would be if the stress test involves consideration of both an up and a down move and focuses only on the one that is more onerous). The overall standard formula SCR then combines stress tests applicable to different risks in a suitable fashion. Nesting stress tests built up in different ways for different types of risk perhaps offers some protection against model risk. A problem that it then creates, however logical is the derivation of each stress test in isolation, is how to combine the stresses. This is usually done via some sort of assumed correlation matrix, which takes account of potential diversification between different risk exposures. For example, if the stress test requirement for the ith risk is Si then the overall capital requirement might be set as follows, where ci, j is the assumed correlation between the ith and the jth risk:7 S=
Si Cij S j
(8.1)
i, j
For risks that are assumed to be highly correlated, the overall capital requirement will be close to additive, but for risks of similar magnitude that are assumed to occur largely independently of each other (or even to offset each other), a material diversification offset can arise via this type of formula. There may be further adjustments reflecting the risk absorbing characteristics of deferred taxes8 , profit participation arrangements9 and/or management actions10 . 7 The Solvency II framework differentiates between operational risk and other types of risk in this respect, in that in the Standard Formula SCR is set as SCR = BSCR + SCRop , where BSCR incorporates all risks other than operational risk, using a correlation matrix type of approach, and SCRop is the operational risk charge, i.e., there is no diversification offset assumed between operational risk and other types of risk. 8 For example, a firm may have some deferred tax liabilities that it would be due to pay were profitability to be as expected. In the presumed adverse conditions corresponding to the stress being tested, profits might be lower, reducing the amount of tax that the firm would need to pay. Under Solvency II, the risk absorbing characteristics of deferred taxes are limited to the amount of the deferred tax liability carried in the firm’s balance sheet as per International Accounting Standards (IAS) set by the International Accounting Standards Board (IASB) or other relevant Generally Accepted Accounting Principles (GAAP). 9 This type of adjustment is particularly important for life insurance companies because participating (i.e., with-profits) business is an important class of life insurance business in many EU jurisdictions. 10 Insurance contracts (particularly life insurance contracts) are often relatively long term in nature (or have guaranteed renewability features making them potentially long term in nature) but conversely often contain elements that allow some flexibility to the firms issuing the contracts to alter contract terms. For example, disability insurance may be provided on the basis that premiums payable by the policyholder may be reviewed by the firm at regular intervals depending on the aggregate costs of providing disability coverage to the pool of individuals being insured. If it is reasonable to assume that the firm will take advantage of these flexibilities then the impact of a given stress test may be mitigated by assuming that they are exercised. However, most jurisdictions also place legal restrictions on the extent to which a firm can arbitrarily modify contracts it has entered into. For example, if the firm has a history of never using these flexibilities, or of doing so only in a particular way, this custom may be construed as limiting the potential application of such flexibilities going forwards. There may also be other types of management action, such as adoption of hedging programmes, that can mitigate the impact of a particular deemed stress scenario, but these would usually only be allowed to offset capital requirements if it was reasonable to assume that the management action would actually occur. A firm without the necessary operational infrastructure would not necessarily be deemed capable of mitigating its capital requirements in such circumstances.
250
Extreme Events
The extent of correlation offset between different types of risk allowed for in such computations can be a particularly thorny issue between the industry and the regulator. This is because the estimation of an appropriate correlation matrix is very difficult. Arguably the focus should be on tail correlations11 rather than on correlations applicable to the generality of the distribution, and hence there may be few observations available that can inform the debate. Regulators may also, for example, have an overall target capital base that they want the industry to adopt and they may, to some extent, work backwards from there to identify the magnitudes of the stress tests and the diversification offsets that lead to this outcome. Superimposed on this may be a desire to avoid undue pro-cyclicality in regulatory frameworks; see, e.g., Kemp (2009). Regulatory frameworks are increasingly encouraging firms to adopt more sophisticated ways of measuring risk. Both Basel II and Solvency II allow firms to dispense, to a lesser or greater extent, with standard formula capital requirements and to propose to regulators ways of setting capital requirements more tailored to their own requirements. These are called internal models. We can view internal models as ways of setting capital requirements that are more fully VaR-like in nature, e.g., by placing greater emphasis on more accurately assessing how different contributors to risk might interact. Regulators typically expect the models to be used not only to set capital requirements but also for other business management purposes. This is the so-called use test. One reason for doing so is to limit the scope for firms to select artificially favourable ways of assessing their own capital requirements. Another reason is that one of the perceived merits of allowing firms to use internal models is to provide an incentive for improved risk management. This needs some link to exist between the internal model and what the firm actually then does in practice. An opposite problem that can arise with internal models is that there may be insufficient incentive for firms to take account of the possibility that they will be hit (singly or in tandem) by events that have no direct prior market analogue but which if they did occur might wreck the firm’s business model or even lead to firm default.12 These might include the ‘unknown unknowns’ or ‘black swans’ of, say, Taleb (2007). This leads us to a third interpretation of the term ‘stress testing’, and probably one that is gaining ground relative to the first two. 8.3.4 A greater focus on what might lead to large losses This type of meaning highlights a mental adjustment away from probability theory and towards hypothetical scenario construction. Ideally, such a form of stress testing involves some group of individuals starting with a clean slate and then brainstorming potential scenarios that might adversely impact the firm/portfolio. They would then endeavour to quantify what impact these scenarios may actually have (bearing 11 Malevergne and Sornette (2002) describe a technique for quantifying and empirically estimating the extent to which assets exhibit extreme co-movements, via a coefficient of tail dependence. 12 Another related issue is the possibility that firms might be incentivised to underplay potential fat-tailed behaviour in their internal models. Regulators would typically like there to be incentives to adopt internal models (because internal models should reflect more accurately the risks being run by individual firms). In the financial world money can be a particularly important driver of firm behaviour. A powerful argument, if it can be mustered, in favour of firms adopting internal models is if use of internal models typically results in lower capital requirements than standard formulae. Unfortunately, incorporation of additional fat-tailed behaviour within an internal model is likely to create higher rather than lower capital requirements. So, if reduction in capital requirements is the main driver behind adoption of internal models then we might expect those responsible for internal models to come under pressure to sweep under the carpet evidence of fat-tailed behaviour unless it is overwhelming. Of course, regulators are aware of the mixed impact that overly financial incentives can have on firm behaviour, and usually stress qualitative as well as quantitative benefits that might accrue from a firm introducing an internal model.
Stress Testing
251
in mind how effective existing risk mitigating strategies might actually prove to be in such circumstances). The hope is that such a process may highlight potential weaknesses in existing risk mitigants, and thus suggest suitable refinements. The challenge is that typically stress testing ultimately still has, as its starting point, assumptions about how underlying markets might behave. If the target audiences for the stress tests (e.g., senior management) consider these assumptions to be unrealistic or too onerous (or not onerous enough), to involve implausible correlations between events or to lack credibility in some other way, they may place little credence on the chosen stress tests and information content contained within the stress tests may be lost.
8.3.5 Further comments The perception by recipients that stress tests provide the plausible boundary of what might happen in practice13 presents a big challenge to risk managers. Any stress test can always with hindsight be shown to have been inadequate, if something even more extreme comes along in the meantime. So, whenever something more extreme does come along we can expect there to be calls for additional stress tests to be added to the ones previously considered. We can see such a phenomenon at work in developments surrounding the 2007–09 credit crisis, particularly in the development of new liquidity requirements imposed on banks; see, e.g., CRMPG-III (2008), BCBS (2008) and FSA (2009a). The considerable pain that firms suffered as a result of a drying up of liquidity was typically well outside the spectrum of risks that their liquidity stress tests (that Basel II encouraged them to do) might have implied. The need is for stress tests to be more plausibly consistent with potential adverse actual market dynamics (particularly ones that are in the sufficiently recent past still to be fresh in people’s minds). There also seems to be a trend towards an interpretation of stress testing that sees it as a complement to, rather than a restatement of, VaR-like techniques. This is particularly true in areas where, for whatever reason, VaR-like techniques are less well entrenched, which can be because (a) the risks are more ‘jumpy’ or more ‘skewed’ in nature; (b) the regulatory framework has yet to develop standardised approaches to risk management in the area in question and regulators want to put more of the onus back on management to consider appropriately the risks of extreme events; and/or (c) there is a desire to encourage a mentality that ‘thinks outside the box’ to help highlight appropriate risk mitigants that might reduce the downside impact of existing exposures and to foster a business culture able to react more effectively to whatever actual stresses the future holds.
13 By implication, it is generally accepted that we do not need to include ‘hyper extreme’ events such as an asteroid striking the earth or a major nuclear war (at least not unless we work in civil defence parts of the government). These are much too doomsday to be typically considered worth bringing up in such contexts. However, there may be a grey area, involving substantial but not completely doomsday scenarios where it is less clear-cut what corresponds to the plausible range of adverse scenarios. For example, insurance firms, particularly reinsurers, do need to worry about major natural catastrophes such as hurricanes, earthquakes or flooding, and have in some cases been in the forefront of assessing what impact global warming might have on occurrence of such events. Another example that might no longer be viewed as not sufficiently plausible to be worthy of consideration is the risk of sovereign default of a major developed western nation to which a firm or portfolio might have exposure see, e.g., Kemp (2009).
252
Extreme Events
Whatever the interpretation adopted, any stress testing requires the re-evaluation of an entire portfolio or firm balance sheet assuming the occurrence of the set of market events corresponding to the stress ‘scenario’. CRMPG-III (2008) notes the importance of suitable valuation disciplines in mitigating systemic risks and highlights the impact that challenges in the valuation of complex instruments might have in this respect. For example, CRMPG-III (2008) highlights the increased possibility of valuation disputes relating to collateral flows that might further exacerbate systemic issues. Sorting out these operational challenges may need a fair investment in time, money and effort, for example, to ensure that positions, exposures and valuations are available in a timely manner and in suitably consolidated/aggregated form. The more ‘complementary’ style of stress testing that seems to be coming to the fore typically requires a fair amount of subjective input, indeed potentially significantly more than is the case with VaR (particularly if the firm buys in a VaR system from a third party). It may be potentially more time consuming (and more IT intensive) to do well (see Section 8.8). Some would argue that this is a good thing; i.e., entities and the people managing them should be spending more time thinking about what might happen and how they might respond.
8.4 REVERSE STRESS TESTING CRMPG-III (2008) proposed that firms put greater emphasis on reverse stress testing. This idea has since been picked up enthusiastically by regulators; see, e.g., FSA (2008). The term ‘reverse stress testing’ seems to derive from the ideas of ‘reverse optimisation’ (see Section 5.10) or ‘reverse engineering’ (which involves analysing what a piece of hardware or software does to work out how it does it). The starting point in a reverse stress test would be an assumption that over some suitable period of time the firm suffers a large loss. For a large integrated financial institution of the sort that CRMPG-III focuses on this might be a very large multi-billion dollar loss. The analysis would then work backward to identify how such a loss might occur given actual exposures at the time the reverse stress test was carried out. For very large losses these would most probably need to include contagion or other systemic (i.e., industry-wide) factors or compounding operational issues that might not typically be included in a conventional stress test. Done properly, CRMPG-III (2008) argues, this would be a very challenging exercise, requiring the engagement of senior personnel from both the income-producing and the control functions in a context in which the results of such exercises would be shared with senior management.
CRMPG-III is particularly focused on issues such as systemic liquidity risk and counterparty risk and hence it is perhaps particularly apposite that it should focus on stress testing methodologies that might be expected to cater for industry-wide business-disrupting stresses of the sort that happen infrequently but are of very serious magnitude for large numbers of industry participants simultaneously. The concept, however, is not limited merely to very large organisations. For example, the FSA (2008) indicates that it is planning to require most types of UK financial firms to carry out reverse stress testing. This would involve consideration of scenarios most likely to cause the firm’s business model to fail (however unlikely such scenarios might be deemed to be). The FSA thought that senior management at firms less affected by the 2007–09 credit crisis had more successfully established comprehensive firm-wide risk assessment processes in which
Stress Testing
253
thoughtful stress and scenario testing played a material part, allowing better informed and more timely decision-making. Principle P41: Reverse stress tests, if done well, can help us uncover under-appreciated risks and formulate plans for mitigating them.
8.5 TAKING DUE ACCOUNT OF STRESS TESTS IN PORTFOLIO CONSTRUCTION 8.5.1 Introduction It is relatively easy to see the attractions of stress testing, particularly to regulators. A greater focus on stress testing will typically involve a greater focus on extreme events and therefore, regulators might hope, on use of strategies to mitigate the likelihood of business model or even firm failure. It is not quite so simple, however, to identify how best to take account of stress tests in portfolio construction, particularly quantitative portfolio construction. This is because any meaningful risk-return trade-off must somehow put some utility on different outcomes, which necessarily involves some assessment of likelihood as well as magnitude. Nevertheless, there are some ways in which stress testing can directly inform portfolio construction (and some pitfalls for the unwary), which we explore further in this section. 8.5.2 Under-appreciated risks An important benefit of stress testing is that it can highlight risks that the manager is unaware of or has forgotten about. For example, a stress that simultaneously involves one market moving up and another moving down may highlight the reliance that the portfolio has to a hedging strategy that might not actually work effectively. This may encourage the manager to explore further just how reliable that hedge might be, or whether there are other better ways of hedging the risk in question without giving up (too much) potential return. From the manager’s own perspective, there may also be risks implicit in the way in which the portfolio is being run that are insufficiently rewarded. For example, if the strategy that the portfolio is supposed to be following is complex and is specified too precisely (e.g., it involves a rigorously defined dynamic hedging algorithm that the manager is explicitly agreeing to follow) then this may be opening up the manager to a level of operational risk that is excessive compared to the fees being charged. 8.5.3 Idiosyncratic risk Conversely, stress testing may lull the manager into a false sense of security. We can view stress testing as involving the definition of ‘important’ factors that might drive the future behaviour of the portfolio or firm in question. Ideally, we should include at least one stress for every ‘important’ factor that exists. In practice, however, only at most a moderate number of specific stresses will be defined. Moreover, we saw in Section 4.3.5 that factor based modelling has an inherent weakness. There are only a finite number of factors that we can possibly identify from past data alone (and fewer factors still that appear to be meaningful). Choice of idiosyncratic risk beyond a certain cut-off thus becomes essentially arbitrary.
254
Extreme Events
This makes it possible for portfolios to have little or no aggregate exposures to each of the factors being stressed, but still to express material risks, if the risks in question only show up in the idiosyncratic risk bucket beyond the reach of the factors being analysed. Moreover, if we select a portfolio that optimises against risk as measured by the stress tests then we may merely end up steering the portfolio towards risks that are not well catered for by the stress tests, rather than to risks that are adequately compensated for by excess returns. An example of this in practice involves interest rate stress tests. These often involve either an across the board rise in yields or an across the board fall in yields, with the chosen stress being the movement that is the more onerous of the two.14 However, what about portfolios that are duration hedged, e.g., long a 5 and 7 year bond but correspondingly short a 6 year bond? Focusing only on stress tests that merely involve across the board rises or falls in the yield curve may lull us into thinking that such portfolios are low risk, when there are situations in which such an assumption can prove inaccurate.15 One possible way of limiting this tendency towards error maximisation rather than maximisation of risk-reward trade-off is to have the stress tests specified separately to whatever process is used to construct the portfolio. However, this may be impractical or undesirable if the stress tests are specified in advance by regulators or if the individuals putting together the stress tests are believed to have investment insight. We might try to limit exposures not well catered for in factor based stress tests by placing explicit limits on the amounts of exposures the portfolio can have to individual situations. This is similar in concept to some of the Bayesian-style position limits referred to in Chapter 6. We can make stress testing more ‘robust’ by imposing specific requirements on how we think portfolios ‘ought’ to be structured. Principle P42: Stress tests can engender a false sense of security, because there will always be risks not covered in any practical stress testing regime. This issue is particularly important if portfolios are optimised in a way that largely eliminates exposures to risks covered by the suite of stress tests being used, because this can concentrate risks into areas not covered by the stress tests.
8.5.4 Key exposures Stress tests may also highlight exposures that managers or the entity itself are well aware of but are difficult to discuss effectively, because doing so involves uncomfortable questioning of the status quo. 14 The movement in yields does not need to be the same across the entire yield curve, but might vary according to payment duration. There is also the issue of whether the movement should be expressed as a percentage of the current yields (e.g., a 25% rise that might take the yields at some particular point along the yield curve from 1% pa to 1.25% pa or from 2% pa to 2.5% pa), as an absolute movement (e.g., a 30 bp shift that might take 1% pa to 1.3% pa or 2% pa to 2.3% pa) or in some more complicated fashion. In essence this depends on the assumed a priori dependence that yield curve movements might have conditional on prevailing interest rates, another example of the impact that Bayesian priors can have on portfolio construction. 15 Arguably a case in point involved the ill-fated hedge-fund Long-Term Capital Management (LTCM); see, e.g., Lowenstein (2001). LTCM had positions that sought to exploit the differential yield available on ‘off-the-run’ versus ‘on-the-run’ US Treasury notes of similar durations. In 1998, there was a sustained loss of risk appetite across many markets. Among other things, this resulted in this yield spread rising to unprecedented levels. Although such positions would eventually come good, the mark-to-market losses suffered in the meantime on these and other positions led to LTCM’s collapse.
Stress Testing
255
For example, in many jurisdictions occupational pension schemes are discouraged from investing directly in their sponsoring employer or from building up too large a funding deficit that will only be able to be closed if the employer’s business remains healthy. If, in such circumstances, the sponsor defaults then pension scheme beneficiaries may suffer material loss, particularly if they were expecting that a substantial fraction of their post-retirement income would come from their pension benefits. The likelihood of the sponsoring employer actually being able to make good any shortfalls and/or to honour direct investments in it by the scheme is known more technically as the sponsor covenant. Beneficiaries have been potentially exposed to weakening sponsor covenants for many years. However, in recent times, investment markets (particularly, in the UK, equity markets) have proved weak, longevity has improved and (in the UK) schemes have been closing to new entrants. Sponsor covenant issues have come more to the fore. The advantages of specifically highlighting such risks via, e.g., explicit stress tests designed to indicate what would happen if the sponsor did default, are as follows: (a) The potential magnitude of such risks can turn out to be material, particularly if measured by the market implied default rate available from the employer’s credit default swap (CDS) spreads.16 Few organisations are in such good health that they have very low CDS swap spreads (if such spreads are observable in the marketplace). Indeed, even the CDS spreads applicable to the (external) debt of many western governments are now no longer immaterial, as markets have become worried about the possibility that some of them might default under the weight of debt they have been taking on board to mitigate the recessionary effects of the 2007–09 credit crisis. A scheme with a substantial deficit can therefore find that the sponsor covenant is one of the larger risks to which it is now exposed. (b) Unless such a risk is specifically highlighted there is a danger that it might end up being seen as too politically sensitive to be tackled effectively. Those responsible for the management of the pension scheme (e.g., the trustees in a UK context, but other jurisdictions may use other types of occupational pension structures with different governance arrangements) are often closely aligned with the sponsor. For example, they might be part of the management of the sponsor or be employed by it. They may therefore be keener than might otherwise be the case to prop up the sponsor (or at least not to pull support away from it) in circumstances where the sponsor itself might be financially stretched. (c) Even if the sponsor is currently in good health financially, times change and so do business conditions. In a few years time, the sponsor may not be in such a rosy position. Planning ahead for such an eventuality is a pension scheme equivalent of reverse stress testing as per Section 8.4.
8.6 DESIGNING STRESS TESTS STATISTICALLY Although the underlying tenet of stress testing appears to be to place less focus on likelihood of occurrence, this has not stopped some researchers from trying to apply more advanced statistical techniques to the selection of stresses. For example, Breuer (2009) proposes an approach that, in effect, involves a ‘meta’ probability distribution that characterises the degree of plausibility of different possible scenarios. 16 The CDS swap spread is the (per annum) premium that would need to be paid to protect against the risk of default of the reference entity to which the CDS relates; for further details see, e.g., Kemp (2009).
256
Extreme Events
Using it we can in principle objectively select stress tests to consider. We can even identify which scenario (with a given level of implausibility) generates the worst possible outcome for our portfolio (out of all possible stress scenarios). When developing his ideas, Breuer (2009) argues that stress tests should ideally exhibit three characteristics simultaneously, namely: (a) plausibility; (b) severity; (c) suggestive of risk reducing actions. He divides stress testing theory into three generations. In his opinion, ‘first generation’ stress tests involve a small number of scenarios that are picked by hand ideally by heterogeneous groups of experts, say the set A = s1 , . . . sq . The focus is then placed on the one that generates the worst loss, i.e., maxs∈A L(s), where L(s) is the loss suffered if s occurs. He sees such tests as having two main weaknesses. They may neglect severe but plausible scenarios and they may consider scenarios that are too implausible. He views an improvement to be ‘second generation’ stress tests, which involve a measure of plausibility (a ‘Mahalanobis’ distance17 ) being ascribed to each scenario being tested. This measure is defined as follows, where V is the covariance matrix of the risk factor distribution and s is the vector of stress shifts corresponding to stress scenario s: Maha(s) =
(s − E(s))T V−1 (s − E(s))
(8.2)
According to Breuer, second generation stress testing theory then involves choosing scenarios from the set A∗ defined as the set of scenarios that have Maha(s) ≤ h for some suitably chosen threshold h. The focus with second generation stress tests is again on the worst case scenario within A∗ . Assuming that V can be estimated reliably, such an approach should not miss plausible but severe scenarios, and would not consider scenarios that are too implausible. Breuer thought, however, that second generation stress testing still had weaknesses. For example, a Mahalanobis measure does not take into account fat-tailed behaviour and is susceptible to model risk. He therefore proposed an even more statistically sophisticated ‘third generation’ approach to stress testing, which he called ‘stress testing with generalised scenarios’. In such an approach each individual stress test is generalised to involve an entire distribution of possible outcomes.18 Plausibility is defined by reference to the concept of relative entropy (see, e.g., Section 3.8.5). Within its definition is some distribution, say Q, which we can think of as a Bayesian prior for how we think the world will behave and which might be fat-tailed. Instead of choosing individual scenarios from A∗ we choose these generalised scenarios from A∗∗ , the set of all distributions whose relative entropy versus Q is less than some specific cut-off value. 17 T T The Mahalanobis distance between two multivariate vectors x = (x1 , . . . , x n ) and y = (y1 , . . . , yn ) coming from the same distribution (with covariance matrix V), i.e., d(x, y) = (x − y)T V−1 (x − y) can be thought of as a measure of their dissimilarity. If measured relative to the mean µ = (µ1 , . . . , µn )T it can be thought of as how improbable a given outcome might be. A more familiar way of expressing this type of dissimilarity is via a likelihood or log likelihood function. Indeed, the Mahalanobis distance can be seen in effect to be the square root of the Gaussian log likelihood function. This implicit link to a Gaussian, i.e., multivariate Normal, distribution may be considered undesirable, given the strong linkage we have identified earlier in this book between extreme events (which are presumably the ones we should be most interested in with stress tests) and non-Normality. 18 A specific set of market movements can be thought of as an extreme example of such a distribution that is concentrated on just a single outcome, i.e., with ‘support’ (i.e., possible range of outcomes) concentrated onto a single point.
Stress Testing
257
Breuer (2009) goes on to show how we can explicitly identify the ‘worst loss’ generalised scenario, i.e., the scenario that (for a given initial portfolio) leads to the maximum expected loss of any distribution with a relative entropy versus a given prior less than or equal to some specified cut-off. For example, he shows how we might derive a ‘worst case’ set of transition probabilities for differently rated bond categories, which can be interpreted as a stress test in which the expected losses experienced by a specific credit portfolio turn out to be y% rather than the x% we might have originally forecast. Whether such an approach will catch on is difficult to say. As mentioned earlier, it attempts to provide statistical rigour and hence a focus on likelihood of occurrence that seems at odds with the general direction in which stress testing has recently been developing. Moreover, the ‘worst loss’ generalised scenario is portfolio specific and may not therefore be an ideal way of identifying appropriate mitigating actions to consider applying to a portfolio. We already know the best way to mitigate a risk, which is not to have it in the first place. Presumably what we are ideally hoping to achieve with stress testing is to identify risks that we did not know we had (or that we had forgotten or been unwilling to admit that we had) and then to explore the best way of mitigating the worst outcomes that might arise with such risks (and only in extremis deciding to cut out exposure to these risks entirely).19 The approach described by Breuer (2009) does provide an interesting spin on one particular aspect of portfolio construction. Usually, we maximise the (expected) risk-reward trade-off, keeping the distribution of future outcomes that might apply fixed and altering the portfolio content. Here we reverse the process. We minimise the (expected) risk-reward trade-off, keeping the portfolio content fixed and altering the distribution of future outcomes. This reminds us of reverse optimisation (see Section 5.10), which involved identifying the distribution of future outcomes that needed to apply for a given portfolio to be efficient. We saw there that (forward) optimisation and reverse optimisation are effectively two sides of the same coin; one is the mathematical inverse of the other. It is just the same here. If, for example, we apply the generalised scenario approach described by Breuer (2009) not just to one portfolio but to many simultaneously, choosing the portfolio that has the best risk-reward trade-off, we end up back with a more conventional portfolio construction problem, but suitably averaged over a fuzzy set of possible distributional forms, similar in concept to resampled optimisation (see Section 6.9).20
8.7 THE PRACTITIONER PERSPECTIVE Stress tests generate perhaps the greatest difference in perspectives between different types of practitioners of any topic considered in this book. This is because stress testing highlights perhaps most starkly the different perspectives of risk managers and investment managers. For risk managers, stress tests provide a mechanism to highlight scenarios in which the entity can suffer material pain. If communicated well, they can steer discussion to topics that the risk managers think are particularly important. As no likelihood is ascribed to any particular scenario there is little downside to the risk manager from postulating the scenario (except perhaps a reduced ability to offer up scenarios in the future, if the risk manager is 19 As has happened repeatedly in this book, we also need a Bayesian prior for the approach to work. In order to define what an implausible scenario looks like, we first need to define what we think constitutes a plausible scenario! 20 We might also, in a similar vein, seek to identify portfolios that maximise the ‘worst-case’ utility given specified uncertainty sets for the parameter estimates. This involves maximising minr ∈θ r.x − λ maxV∈ϕ (x − b)T V(x − b) where θ and ϕ are specified uncertainty sets for parameter estimates for the mean and covariance matrix (with θ and ϕ possibly interacting).
258
Extreme Events
perceived as championing a particular course of action by highlighting a stress test that proves unhelpful). Failure to take sufficient notice of a stress test can always be viewed as someone else’s fault (as can taking too much notice of it if it does not prove relevant). Conversely, for investment managers or others who actually have to take decisions about how a portfolio or business will be structured, the lack of a likelihood ascribed to a particular scenario can be frustrating. They may take the view that it is all very well someone highlighting what might go wrong but far better still would be if that person could also offer some help in deciding what to do about it. Perhaps, therefore, risk managers working in the stress testing arena will gain most kudos if they focus primarily on areas where they can directly help in sorting out some of the issues that they highlight. For example, there might be ways in which they can themselves mitigate operational risks by highlighting ways in which operational risks can be closed at modest or no cost. This should typically be a win-win situation, because few if any organisations want to incur more operational risk than they have to.
8.8 IMPLEMENTATION CHALLENGES 8.8.1 Introduction There are several implementation challenges with stress testing (other than deciding what stress tests to apply) including (a) knowing what is in the portfolio; (b) calculating the impact of a particular stress on each position within the portfolio; (c) calculating the impact of the stress on the portfolio as a whole (because it may not merely be the sum of the impacts calculated in point (b)); (d) being able to engage effectively with boards, regulators and other relevant third parties regarding the results of the chosen stress tests. We discuss each of these issues below.
8.8.2 Being in a position to ‘stress’ the portfolio The rest of this chapter implicitly assumes that it is relatively straightforward to work out the impact that a specific stress test has on a portfolio. Would that life were this simple! Working out what is actually within our portfolio sounds straightforward, but it relies on the relevant portfolio management system and back office platform being sufficiently sophisticated to provide an accurate picture of portfolio composition whenever it is needed. Issues that can arise here include the following: (a) What should we do about trades that are in the process of being booked and/or settled? However efficient is the portfolio management system being used, the trade process usually includes time delays between the legal effecting of a trade (which might involve a telephone conversation) and its actual recording in a firm’s books and records. For example, there are usually reconciliation processes in place to ensure that any such booking is accurate. If the
Stress Testing
259
effective proportion21 of the portfolio that is not (yet) recorded properly in the portfolio accounting system is significant then the results derived by applying the stress test to the (thus wrong) portfolio may also be significant. (b) What should we do about trades that fail? These are trades that we thought we had executed but for whatever reason do not actually occur. (c) What happens if some trades are booked materially late or, worse, not at all? Such instances could be indicative of operational risks that might not affect ‘market’ or ‘credit’ risk stresses but might be as troubling if not more so for whoever is legally charged with the sound handling of the portfolio. A good asset portfolio management system also needs to take account of corporate actions, dividend/income payments and other income and expenditure items, and, for open ended pooled vehicles, unit subscriptions and redemptions. With all these there may be time delays between the legal effecting of a transaction and its actual recording in the books and records applicable to the portfolio. This can, for example, materially affect the amount of counterparty exposure the portfolio has to the organisation with which it banks. Some of these delays may merely reflect the need for robust reconciliation processes, to minimise operational risk. The hope is that these delays are short and are the only ones occurring in practice. Likewise, with liability portfolios there is in principle a similar need for effective recording and accounting for cash flows and for changes to investor/policyholder/beneficiary entitlements. Usually the booking of these also includes time delays (which may also merely reflect the need for suitable reconciliation processes). 8.8.3 Calculating the impact of a particular stress on each position With asset portfolios, the portfolio accounting system may just refer to a particular holding by a name or security ID. This will not necessarily tell us how the instrument will react to a given stress. In addition to the security name or ID we may need (a) information about the characteristics of each instrument within the portfolio; and (b) given point (a), the ability to work out the value impact of a given stress on the instrument in question. At one extreme, the information needed for point (a) might be as simple as knowing whether it is an equity or a bond, because only the former would be influenced by a stress test that refers solely to equities. An issue is then what to do with instruments that express more than one type of exposure, particularly if the sizes of these exposures change in a dynamic or nonlinear fashion. For example, a far in-the-money equity call option is likely to be exercised, and so will tend to rise and fall in value in line with small changes in the price of its underlying. Conversely, a far out-of-the-money option is likely not to be exercised, and so its value (for the same notional option exposure) may be much less sensitive to changes in equity market levels. However, an out-of-the money option may be much more exposed to the risk of an adverse change in equity market volatility than an in-the-money option (particularly relative to the market value of the option). Equity volatility levels do not affect the level of the equity markets per se, and so if 21 By ‘effective’ we mean the impact that correctly accounting for the trade would have on the end result. So all but one trade might have been sorted out fine, but if the one that has not been sorted out is large, unusual or disproportionate in terms of its contribution to a risk exposure in which we are interested, the end answer will be distorted.
260
Extreme Events
the portfolio had substantial exposures to equity volatility, it would be necessary to consider stress tests applying to such exposures in addition to any referring merely to equity market levels. More generally, if stress testing is to be done well then we need an ability to value each instrument given an arbitrary market scenario (or at least some ability to work out the sensitivities of the instrument to the various possible market permutations we might consider). So stress testing in general requires a pricing engine. Risk models more generally also require pricing engines. However, for stress testing purposes we may not be able to piggy-back off any pricing model implicit in any other risk system to which we might have access. Often we will be trying to develop stress tests ourselves, whereas the computation of risk measurements like VaR is often carried out using a third party provided risk system, and the implicit pricing engine built into such risk systems may not be applicable (or accessible) to us when we are trying to carry out our own stress tests. Standard risk system pricing engines may also implicitly assume that only modest movements in currently prevailing financial conditions occur. This simplifies risk calculations because the impact that factor exposures then have on risk measures becomes approximately linear in nature. Such a simplification may not be too far from reality if we are considering the generality of likely portfolio movements. However, a linearisation assumption may be less appropriate for large movements of the sort that might be most likely to be tested for in a stress testing environment. We may also need a substantial amount of information on an instrument to understand fully how it might be affected by our chosen market stress(es). This level of detail may not always in practice be available to us. For example, if the instrument itself corresponds to a portfolio run by a third party then often we will not know what is in the portfolio (or may only come to know this after some substantial delay). The potential future behaviour of more complex instruments such as collateralised debt obligations (CDOs) may be very sensitive to the exposures embedded within them (as well as to the level of seniority of the particular tranche of the CDO that we have bought). The underlying exposures might also be actively managed, and so may change through time. Some structured investment vehicles (SIVs) may be referred to by names that do not immediately link back to their sponsors. Different debt vehicles issued off the same debt programme may have very similar names, but express significantly different exposures. An uneasy balance may then exist. Presumably the people most familiar with such instruments are the managers responsible for deciding to include them within the portfolio. Yet ideally we would want some independence between them and the risk managers charged with independent review of what is in the portfolio, even though this may involve cost duplication.
8.8.4 Calculating the impact of a particular stress on the whole portfolio An implicit assumption that is often made is that the market value of the portfolio satisfies axioms of additivity and scalability (see Section 7.3.4). We generally assume that if a stress test reduces the market value of one unit of an instrument A by V ( A) and reduces the market value of one unit of an instrument B by V (B) then the stress will reduce the market value of a portfolio consisting of one unit of each instrument, A + B by V ( A) + V (B). Some types of risk, however, particularly ones such as liquidity risk that have a transaction cost angle, do not respect these axioms. A very large portfolio is typically more than proportionately harder to liquidate than a smaller one. For example, doing so will typically generate
Stress Testing
261
more market impact (see Section 5.7). Stress tests focusing on these types of risk need to take account of this potential nonlinearity. 8.8.5 Engaging with boards, regulators and other relevant third parties Risk managers can prepare stress tests, but how can they ensure that anyone takes note of their results? It helps to have the respect of your colleagues, for the work to be easy to assimilate and for it to be done to a high standard. There is also a more subtle point. Senior management is not merely responsible for the risks a firm or a portfolio might face, but also for the potential benefits that might accrue to interested parties from carrying exposures to such risks. Decisions that involve no trade-off are the easy ones to take – we should if possible not carry such risks. Decisions involving a real trade-off are much harder. Senior management might rightly conclude not to do anything even after receiving stress tests results that highlight a particular vulnerability, if the risks being run are more than compensated for by the potential excess returns accruing from the strategy. At issue is that different interested parties may have different views on what is the right balance to strike in this regard. For example, regulators and governments may want the balance to be more towards mitigating risk. Much of the debate about future regulatory frameworks for banks following on from the 2007–09 credit crisis has revolved around whether banking cultures, structures and regulatory frameworks unduly privatise profits but socialise losses. The incentive structures that these lead to, reaching even up to the top levels in such organisations, could then inappropriately skew behaviour towards undue adoption of risky strategies, at least as far as regulators and governments are concerned. Finance professors are quick to point out that a limited liability corporate structure introduces an inherent put option into corporate finance. It becomes more attractive to shareholders (and possibly employees) to pursue risky strategies, precisely because it is someone else who will pick up the costs if the strategy does not work. Risk managers aspiring to be seen to be producing useful stress tests should also, therefore, seek to appreciate these wider issues. Although it may be their job to provide risk-focused material, indeed at times to champion the risk perspective, they should aspire not to remain within some cosy silo that hardly interacts with the rest of the organisation. Instead they should be prepared to engage fully in the discussions regarding risk-return trade-off that underlie practical portfolio construction. This may sometimes involve appreciating that risk mitigation is not the sole aim in life! Principle P43: Effective stress testing requires effective engagement with the individuals for whom the stress tests are being prepared. Otherwise the informational content within them may be lost.
9 Really Extreme Events 9.1 INTRODUCTION In earlier chapters, we focused largely, though not wholly, on the use of quantitative techniques to analyse fat-tailed behaviour and on how to adjust portfolio construction accordingly. Even the last chapter, on stress testing, still had quantitative elements, even if it was not as explicitly mathematical as some of the earlier chapters. As an actuary who has headed up the quantitative research team at a major asset management house, it is to be expected that I would have a bias towards a quantitative analysis of such problems. However, it is important to realise that a purely quantitative focus is not the only, or even, at times, the most important mindset to adopt when seeking to handle extreme events. This is particularly true for really extreme events. Such events happen so rarely (we hope!) that there will be few if any comparative observations present in any explicitly relevant dataset to which we might have access. Any assessment of the likelihood of such events will therefore almost certainly involve some fairly heroic extrapolations. That being said, there are still some important principles that we can piece together from what we have explored earlier in this book and from more general reasoning. It is the purpose of this chapter to synthesise these strands of thought and to offer thoughts that practitioners, researchers, students and general readers alike may find helpful as they navigate around a world in which extreme events – indeed sometimes really extreme events – do occur from time to time.
9.2 THINKING OUTSIDE THE BOX One important element of a mindset that is able to cater well with really extreme events is to remember that though there may be few if any directly relevant prior examples of the sorts of events we are interested in, there may be analogues elsewhere that give us clues to work with. For example, a particular market may only have a short return history, too short to be of much use by itself when trying to form views about the likelihood of extreme events for that market in isolation. There may be, however, other information that is relevant to the problem. What about other markets? Although they may have idiosyncrasies that make them imperfect guides to the behaviour of the particular market in which we are interested, they nevertheless share some common attributes. For a start, they are themselves markets driven by investor supply and demand etc. Of course, as we saw in Section 2.2, not all ‘markets’ are equally relevant comparators in this context. If we are interested in the potential future behaviour of an emerging economy equity market then we might view the behaviour of other equity markets (particularly ones relating to other emerging economies) as more relevant guides than, say, the secondary ‘market’ in Paris theatre tickets. We, and others, will view some markets as having more closely aligned characteristics than others.
264
Extreme Events
This observation highlights another important element to our desired mindset, which is to remember that not everyone thinks the same way as we do. Replaying the above example, it would be difficult to find anyone who really thinks that the behaviour of Paris theatre tickets prices is very representative of the behaviour of share prices for an emerging market economy.1 We might, however, more reasonably expect to find diversity of opinion about which alternative equity market(s) might be the best source of inspiration for the particular market in question. Such comparisons are the bread and butter of researchers and analysts attempting to predict forthcoming trajectories for such markets. The market prices of instruments, particularly ones involving longer-term cash flows, can be very heavily influenced by investor sentiment. For want of a better term, this can result in behaviour more often perceived by us as ‘fashionable’ rather than ‘rational’. However, we again need to remember that to others it may be our behaviour that seems to be at the ‘fashionable’ end of the spectrum rather than theirs. Astute readers will have noticed that these observations correspond to two key principles that we have already articulated earlier in this book, namely Principles P11 and P13 (for ease of use, all the principles are collected together in Section 10.2). Several other key principles from earlier in this book for catering with extreme events are also relevant in this context. Here are just two: (a) Principle P2 highlights that the world in which we live changes through time. Markets change. Our (and other people’s) perceptions change. What has been may come round again. Or it may not. Relationships that have worked well in the past may not always work well in the future. Portfolio construction is about weighing up alternatives. Combinations of positions that appear to diversify well relative to each other will therefore be preferred. However, what seems to diversify well most of the time may not do so in extreme circumstances; these are exactly the sorts of circumstances where hitherto hidden linkages can disrupt previously ruling relationships. (b) Principle P17 highlights the potential selection effects that can arise with active investment management, which can compound the effects in point (a). Not only may new, and hitherto unexpected, linkages appear that disrupt previously ruling relationships, but these linkages can simultaneously be well correlated to several strategies in which we are invested, particularly if our portfolio shares common features with many other portfolios and we are in effect following a crowded trade. Understanding as best we can how others are invested is an important aid in appreciating the extent to which we might be exposed to such types of extreme events. Principle P44: Handling really extreme events well generally requires thinking ‘outside the box’.
1 Such comparisons may not be quite as inappropriate as they might look at first sight (which is not to suggest that there is any investment merit in this particular one!). For example, the market in question might potentially be heavily influenced by demand for luxury goods. This might also be an important driver for the emerging market in question; equity markets do not necessarily relate to the generality of the economic activity within a country, only the part to which stocks within that equity market relate. Moreover, there may be global macroeconomic drivers that influence both markets. If Parisians (particularly Parisian theatre owners) happened to be the main owners of the relevant equity market (nowadays unlikely, but not necessarily untrue in times gone by), there might be linkages through ownership structures that would be ‘unexpected’ to the uninitiated.
Really Extreme Events
265
9.3 PORTFOLIO PURPOSE Unfortunately, all the above comments, worthy though they might be, are rather generic and abstract in nature. How do we actually apply them in the context of real life portfolio construction? Perhaps the most important starting point is to consider the underlying portfolio purpose. In very broad terms this will drive the overall risk appetite and hence risk budget. These in turn will drive the extent to which any specific level of risk might be deemed desirable. The extent to which investors are prepared to stomach extreme events seems to vary according to portfolio rationale. For example: (a) If properly explained, investors typically understand that many types of hedge funds are quite volatile in behaviour, and are possibly materially exposed to extreme events (albeit hopefully, as far as the investors are concerned, principally to upside extreme events). Rarely will investors feel so disappointed about performance that they will sue the hedge fund manager; they typically only consider doing so if performance delivered is extremely adverse and/or there is perceived to have been some element of fraud involved in its delivery. Of course, the business itself may have been wiped out long before then. Hedge funds seem to operate to a faster internal clock than most other investment types. They can start to gain large inflows even with a relatively brief track record (in comparison with other fund types). Equally, even a relatively short string of adverse results (or the perception that the manager has ‘lost his way’) can lead to the evaporation of their asset base.2 (b) Conversely, low risk portfolios, such as cash or money market portfolios, are not only typically expected to be low risk, but investors also typically seem to be less forgiving if they do suffer extreme events (even relative to the lower ‘Normal’ level of volatility we might expect from such portfolios). This link with investor behaviour is perhaps to be expected from Section 2.6, in which we highlighted that active management seems to be associated with greater exposure to fat tails. Paradoxically, the more the portfolio is being sold as being actively managed, the less risky (in terms of being sued) the management of it may be from the manager’s personal perspective. We also need to remember just how non-Normal the world can be in really adverse circumstances, particularly if there are possible selection effects present (which we argued earlier is potentially whenever active management is occurring). The analysis in Chapter 4 suggested that strong selection effects can potentially increase extreme event magnitudes several-fold relative to what we might expect assuming approximate Normality. 2 A further complication with hedge funds is that they are typically sold as ‘absolute return’ vehicles. Investors often make the implicit assumption that the fund manager will lower the level of risk being expressed in the portfolio whenever there is a market downturn or markets become ‘more risky’. Really talented managers can react this quickly, but others may be less effective at repositioning their portfolios in a timely fashion if market conditions deteriorate. Some of these sorts of effects can be seen within individual hedge fund and hedge fund index return series. Hedge fund returns appear to embed exposures akin to option prices, because managers do behave to some extent as if they are dynamically hedging payoffs akin to ones that are positive in rising markets but flat in falling ones. Such options in effect involve payment of premium away (corresponding to the cost that would be incurred when buying an option), a fact that sophisticated investors are well aware of. So, the manager needs to be really good (or more precisely needs to be perceived to be really good) to compensate. If investors lose confidence that the manager really is this good then they can be pretty unforgiving. A problem that can then arise (as some investors have found to their cost) is that dealing frequency (i.e., ‘liquidity’ in fund units) may be insufficient to allow a timely exit, particularly if other investors are also heading for the door at the same time (see also Section 2.12). This again highlights the importance of remembering that investor perceptions can change. For example, a fund that is in demand at the moment may not stay so if the manager moves elsewhere.
266
Extreme Events
For portfolios sold as ‘low risk’ but still actively managed it is important to be conservative in terms of assumptions regarding possible extreme event magnitudes. Alternatively, there needs to be very good communication between the client and the manager to ensure that the potential for disappointment is adequately understood by both parties. This could involve tight parameters being placed on the contents of the portfolio, or include the manager making very clear that the portfolio was not so low risk that it could not suffer substantially greater losses than might be extrapolated purely from past behaviour in reasonably benign conditions.
9.4 UNCERTAINTY AS A FACT OF LIFE 9.4.1 Introduction Another important element of the mindset needed to navigate extreme events successfully is to hold on to the truth that uncertainty is a fact of life. Not only do we not know exactly how extreme the next period might be (let alone subsequent periods!), we also kid ourselves if we believe that we necessarily have a good handle on the extent of volatility we might expect even in ‘Normal’ circumstances. In economics, this type of ‘uncertainty’ is typically known as Knightian uncertainty after the University of Chicago economist Frank Knight (1885–1972). The intrinsic characteristic of Knightian uncertainty is that it is immeasurable, i.e., not possible to estimate reliably. Knight (1921) highlighted the intrinsic difference between this type of ‘uncertainty’ and ‘risk’ (the latter being amenable to quantitative analysis): Uncertainty must be taken in a sense radically distinct from the familiar notion of Risk, from which it has never been properly separated. . . . The essential fact is that ‘risk’ means in some cases a quantity susceptible of measurement, while at other times it is something distinctly not of this character; and there are far-reaching and crucial differences in the bearings of the phenomena depending on which of the two is really present and operating. . . . It will appear that a measurable uncertainty, or ‘risk’ proper, as we shall use the term, is so far different from an unmeasurable one that it is not in effect an uncertainty at all.
According to Cagliarini and Heath (2000), Knight’s interest in the difference between ‘uncertainty’ and ‘risk’ was spurred by the desire to explain the role of entrepreneurship and profit in the economic process. Knight’s view was that profits accruing to entrepreneurs are justified and explained by the fact that they bear the consequences of the uncertainties inherent in the production process that cannot be readily quantified. Where the risks can be quantified then it is generally possible to hedge them or diversify them away, i.e., they are not really ‘risks’ at all, or at least not ones that should obviously bear any excess profits. We see here a presaging of the Capital Asset Pricing Model (see Section 5.3.4). 9.4.2 Drawing inferences about Knightian uncertainty Merely because Knightian uncertainty is intrinsically immeasurable does not mean that we cannot make meaningful inferences about how it influences markets. To be more precise, we may be able to identify, to some extent, how we humans might react to it. Market prices are by definition set by ‘the market’, which ultimately means that they are determined by interactions between individuals like us. For example, Cagliarini and Heath (2000), when focusing on monetary policy-making, explore two possible ways in which Knightian uncertainty might influence decision-making.
Really Extreme Events
267
These are characterised by the work of Gilboa and Schmeidler (1989) and Bewley (1986, 1987, 1988) respectively.3 Loosely speaking, the Gilboa and Schmeidler (1989) approach involves decision-makers conceptually formulating a set of alternative possible distributions that might characterise Knightian uncertainty. In such a model, decision-makers are still always able to rank different strategies (because it is assumed that they can always formulate a full range of possible alternatives that might characterise the Knightian uncertainty). Given some further assumptions, Gilboa and Schmeidler (1989) show that decision-makers would then choose the strategy that minimises the adverse impact of whatever they view to be the ‘worst case’ scenario. This provides a theoretical foundation for Wald’s (1950) minimax decision rule. In contrast, Bewley (1986, 1987, 1988) drops the assumption that decision-makers can in all circumstances formulate meaningful comparisons between different strategies. To make the analysis tractable, he then introduces uncertainty aversion and an inertia assumption, in which it is assumed that the decision-maker will remain with the status quo unless an alternative that is better under each probability distribution being considered is available. The two prescriptions imply different behaviours on the part of decision-makers. We can therefore test which appears to be closer to reality by comparing each prescription with what actually seems to happen in the real world. As Cagliarini and Heath (2000) note, the Gilboa and Schmeidler (1989) approach should in principle lead to quite volatile behaviour, epitomised in their study by frequent, albeit potentially small, adjustments to interest rates set by the relevant monetary policy-maker. In contrast, the Bewley approach, because of its inclusion of an inertia assumption, should involve less frequent shifts. The latter appears to be more in line with actual decision-maker behaviour, suggesting that inertia in the presence of uncertainty really is an important contributor to market behaviour. We might have reached this conclusion rather more rapidly, merely by reviewing how we ourselves tend to react to events. Who has not at some stage in their business or personal lives been a model of indecision and procrastination? Others may not share all our foibles and weaknesses, but they do share our intrinsic humanity, and therefore at least some of the behavioural biases that influence us in how we interact with others. Bewley (1986) also highlights several other economic insights that might support his notion of uncertainty aversion and inertia. For example, he argues that it might explain the reluctance to buy or sell insurance when the probability of loss is ambiguous and hence the absence of many markets for insurance and forward contracts. He also thinks that it may offer a possible solution to vexing questions relating to how to explain labour market rigidities. Indeed, he indicates that he ‘was led to Knightian decision theory by exasperation and defeat in trying to deal with these questions using the concepts of asymmetric information and risk aversion’ (Bewley, 1986). In particular, he highlights the merits of simplicity. Simpler contracts are easier to evaluate, and so should involve less vagueness and therefore less Knightian uncertainty. The existence of long-term contracts may also be explained by the bargaining costs resulting from ambiguity. We might expect such costs not to scale linearly through time and therefore it is desirable to avoid incurring them unnecessarily by use of repeated short-term contracts. A refinement of the Gilboa and Schmeidler approach is given in Tobelem and Barrieu (2009). Like an earlier paper by Klibanoff, Marinacci and Mukerji (2005), this includes an element 3 Each of these approaches can be viewed as involving relaxations to specific axioms underlying expected utility theory (see Section 7.3). Gilboa and Schmeidler (1989) can be thought of as involving the relaxation of the axiom of independence (see Section 7.3.2(d)) and Bewley (1986, 1987, 1988) as involving the relaxation of the axiom of completeness (see Section 7.3.2(a)).
268
Extreme Events
linked to investor ambiguity aversion, i.e., an assumption that decision-makers show a more averse behaviour towards events where there is uncertainty regarding the underlying model than towards events that are merely ‘risky’ (i.e., where the underlying model is well known). This ambiguity aversion is similar to Bewley’s uncertainty aversion. However, Tobelem and Barrieu’s approach still appears to imply more frequent policy shifts than occur in practice, suggesting that inertia is an important factor to include in any modelling of how people behave in the presence of Knightian uncertainty. Principle P45: Markets are not just subject to risks that are quantifiable. They are also subject to intrinsic (‘Knightian’) uncertainty which is immeasurable. Even though we may not be able to quantify such exposures, we may still be able to infer something about how markets might behave in their presence, given the impact that human reaction to them may have on market behaviour.
9.4.3 Reacting to really extreme events The next four subsections include suggestions on how we might use the above insights to identify ways to react to extreme events.
9.4.3.1 Be particularly aware of exposure to positions that are sensitive to aggregate ‘market risk appetite’ The term ‘market risk appetite’ here refers to the aggregate willingness of market participants to invest in instruments or strategies that are perceived as ‘risky’ by most market participants. More precisely, it refers to positions that are risky and where the average market participant is ‘long’ the relevant risk. For example, high yield debt might be seen to be sensitive to the market’s risk appetite because prices for it may be high when credit conditions seem benign but may fall when credit conditions worsen and investors on average have long exposure to such debt. Although use of the concept of market risk appetite can be overstretched,4 it does seem to be an important contributor to market behaviour. There are several types of investor (e.g., hedge funds and/or proprietary trading desks) that seem to be particularly sensitive to perceptions regarding aggregate market risk. They may react quickly to liquidate positions when they believe that risk appetite is drying up, in part to limit their own exposure to liquidity risk. If the decline in aggregate market risk appetite is severe enough, or spread broadly enough, there may be little pressure to unwind such movements until the relevant ‘risk’ exposures represent undoubtedly good value for money. Investor inertia may result in those not immediately caught up in such a fire-sale staying on the sidelines until such a point is clearly reached. By such a mechanism, markets will tend to overreact to new information. Arguably, this is just another way of presenting the risks inherent in crowded trades (see Section 2.12.3).
4 An example of an overstretching of this concept might be if we use it as an arbitrary excuse for poor performance. We might conveniently postulate a ‘common’ linkage between positions that have performed poorly for us, as a way of explaining why the market as a whole has not thought fit to behave as we had hoped.
Really Extreme Events
269
9.4.3.2 Be particularly aware of exposure to liquidity risk This, too, is a lesson that has been rammed home by the 2007–09 credit crisis. A type of aggregate market risk appetite that is particularly ephemeral in times of stress involves that for liquidity risk. Discussion of market ‘fundamentals’ is somewhat superfluous. We are here talking principally about how market participants might react to changes in the level of ‘uncertainty’ about future market transaction costs, and more specifically to changes in the level of uncertainty about how other market participants might behave. Put like this, we again have all the hallmarks of a feedback loop, moreover one that we cannot easily model. 9.4.3.3 Keep things simple The more complicated exposures might be, the more difficult they may be for others to evaluate. Eventually, we may want (or need) to transfer our exposures to others. When that day comes, we will typically want to have the least possible Knightian uncertainty attaching to these exposures, to maximise their value in a world in which there typically seems to be uncertainty aversion. If we accept the premise of Knight, Bewley and others that entrepreneurship is inextricably linked to accepting Knightian uncertainty then perhaps our real goal should be to aim to accept exposures that others currently think are very uncertain, but that will either become less uncertain in due course or that we can make less uncertain in the meantime. 9.4.3.4 Bear in mind that short-term contracts may not be able to be rolled in the future on currently prevailing terms Long-term contracts do offer greater certainty. This has explicit value in a world typically prone to uncertainty aversion. Principle P46: Be particularly aware of exposure to positions that are sensitive to aggregate ‘market risk appetite’. Be particularly aware of exposure to liquidity risk. Keep things simple. Bear in mind that short-term contracts may not be able to be rolled in the future on currently prevailing terms.
9.4.4 Non-discoverable processes We can also draw some more general conclusions from a Knightian analysis of uncertainty that resonate with other comments we have made elsewhere in this book. For example, Knightian decision theory as presented by Bewley (1988) highlights the essentially Bayesian nature of decision-making; it is Bayesian in formulation, except for the additional elements relating to uncertainty aversion and inertia.5 5 According to Bewley (1988), Knightian decision theory is also compatible with classical statistical approaches. For example, it can reproduce confidence limits that one might derive from ordinary least squares regression, but explains them in a different fashion. A corollary is that we can infer something about how uncertainty averse the typical statistician appears to be if we define such a person as one who focuses on, say, the 95th percentile confidence level; such an individual needs to be quite uncertainty averse. If our focus is on even more extreme outcomes, e.g., the 99th or 99.5th percentiles that a typical risk manager might focus on, the individual needs to be even more risk averse. This perhaps highlights why it may help to differentiate these types of individuals from investment managers and others whom we may want to be not so risk averse.
270
Extreme Events
Bewley (1988) also explores why Knightian uncertainty might be more than merely a temporary phenomenon. The tendency is to assume that given enough information we can always work out how a system will operate, even one that is as complicated as the entire economy. Bewley (1988) notes that Numerous theorems demonstrate that the distribution of a stochastic process is learned asymptotically as the number of observations goes to infinity. Since economic life is always generating new observations, will we not eventually know nearly perfectly all random processes governing economic life? One obvious answer is that the probability laws governing economic life are always changing. But then, why are not those changes themselves governed by a stochastic law one could eventually discern?
Bewley rejects this assumption, by introducing the concept of discoverability and discoverable processes and by specifically identifying examples of processes that would be non-discoverable. In broad terms, discoverability involves the existence of a law characterising the behaviour of the observations in question, which in the limit as the number of observations increases, one can infer from sufficiently many past observations. A non-discoverable process is one where no such law exists, at least not one that can be identified merely from past observations. It should come as little surprise to us that such situations exist, given the comments in Section 4.7. We saw there that we could identify purely deterministic processes that were completely impossible to predict. Life is not in practice quite this unkind, but we can still agree with Bewley’s (1988) assertion that ‘there seems to be no sound reason for believing that economic time series necessarily have discoverable laws, the popularity of time series methods notwithstanding’. Although we can hopefully discover some regularities, we do need to be realistic in our attempts to discover regularities able to explain every facet of market behaviour. Not only may this be beyond our capabilities, it may be beyond anyone’s capabilities!
9.5 MARKET IMPLIED DATA 9.5.1 Introduction Similar somewhat mixed conclusions can also be drawn from consideration of market implied data. By this we mean information, such as market implied volatilities, that can be inferred from the market prices of instruments sensitive to the potential spread of trajectories that some aspect of the market might exhibit in the future. Commentators differ on whether market implied data is a better or worse predictor of future market behaviour than other more time series orientated methods applied to past market movements. Whichever it is, this is not necessarily relevant to whether it is appropriate to use such data for, say, risk management and capital adequacy purposes.6 Key is that the behaviour of market prices involves not only elements based on the ‘fundamental’ aspects of the instruments in question, but also on the ways in which market participants view such instruments. Even if we know nothing about some exposure, we might still be able to use market 6 For example, Kemp (2009) discusses analyses by Christensen and Prabhala (1998) and others, some of whom argue that implied volatility outperforms past volatility as a means of forecasting future volatility and some of whom argue the opposite. However, his main point is that when risk management is ultimately about placing a ‘fair’ price on risk, the most appropriate way of measuring risk is likely to be by reference to the price that might need to be paid to offload the risk to someone else. He argues that the need to put a ‘fair’ price on risk is paramount at least for capital adequacy purposes, because we should want capital requirements to apportion ‘fairly’ the implicit support governments etc. provide when regulated entities become undercapitalised or in extremis default.
Really Extreme Events
271
implied data to tell us something about how (other) market participants view this exposure. In this sense market implied data (possibly from quite diverse markets and/or sources) can give us clues about how market participants might behave when faced with extreme events. 9.5.2 Correlations in stressed times One example of potentially using market implied data from one market to help us handle extreme events in another is in the area of tail correlations. One feature of the 2007–09 credit crisis was the extremely high implied correlation between defaults on individual credits that one needed to assume in order to replicate the observed prices of super senior tranches of CDOs and SIVs. These tranches should only suffer material losses in the event of multiple defaults in their underlying portfolios. At times during the crisis they traded well below par (to the extent that they traded at all!), and the level of default correlation was higher than many commentators argued could plausibly be reconciled with reality. Or rather, reconciliation was only plausibly achievable if one assumed a radical departure for developed market economies compared with their recent histories. Some of the explanation for this phenomenon probably related to issues highlighted in Section 9.4.3.3, e.g., the negative premium placed on complexity in uncertain times. However, some probably reflected a ‘catastrophe’ premium, i.e., a worry that developed market economies might collapse and/or there might be political revolutions and the like that could wipe out previously ruling relationships or even the capitalist system as a whole.7 When political stress occurs, or there is just a serious worry that it might occur, it is reasonable to expect that investors will become more vexed about extreme outcomes that might involve multiple assets or asset types all moving adversely in tandem. 9.5.3 Knightian uncertainty again Ideas relating to Knightian uncertainty can also be used to explain some of the behaviour of markets from which we might derive market implied data. For example, Kemp (1997, 2005, 2009) highlights the different ways in which option prices can diverge from those characterised by the Black-Scholes option pricing formulae. He notes that whenever there is intrinsic uncertainty in the volatility of the underlying, it becomes impossible to rely on the hedging arguments needed to justify the Black-Scholes option pricing formulae. The market still puts prices on such instruments, but the prices are preference dependent. By this we mean that there is no single ‘right’ answer for the price at which they ‘should’ trade. Instead prices include investor views and uncertainties about the potential magnitude of the spread of future trajectories that the instrument underlying the option might exhibit. Equity option implied volatilities typically exhibit a ‘smile’ or ‘skew’ in which the implied volatilities applicable to out-of-the-money put options are higher than those applicable to at-the-money options. This effect can be caused by a typical investor belief that large downward 7 Historians are often willing to point out that such fears are not quite as far-fetched as we might like. For example, some have argued that the French Revolution was spurred on by the Mississippi bubble and near bankruptcy of the French monarchy. Moreover, at times during the 2007–09 credit crisis investors really did seem to be concerned that there might be rioting in the streets. A worry that the financial system might suffer a meltdown (with all the consequential impact this might have on the fabric of society more generally) seems to have been at the forefront of policy-makers’ minds when they approved large-scale bailouts and support arrangements of systemically important financial services entities despite the potentially large drain on the public purse that these support arrangements may have created.
272
Extreme Events
movements in equity prices are more likely than is implied by Normal distributions, by a typically heightened risk aversion against such outcomes or, most probably, both. 9.5.4 Timescales over which uncertainties unwind As far as portfolio construction is concerned, this sort of intrinsic uncertainty is impossible to model accurately. It is impossible to say with certainty how investor views will change in the future. This does not mean that we should not take views that are sensitive to such outcomes, merely that we need to appreciate that they might not prove correct. The timescale over which we are allowed to take investment views comes to the fore in such circumstances. For example, with short-term options, the option may have matured sufficiently rapidly for us not to need to worry unduly about the views of others (unless our plan had been to unwind our position at some interim time). If we know that we are going to hold the position to maturity then it should not matter much to us about how the value of the position might move in the meantime. Unfortunately, there is a weakness in such logic. The assertion that we are sure to hold a position to maturity is itself expressing a view that might prove inaccurate in the face of Knightian uncertainty. Moreover, the Knightian uncertainty might be triggered by factors unrelated to the specific positions in which we might be interested. Kemp (2009) explores the application of this point to the question of whether it is appropriate for life insurers to reduce the values they place on illiquid liabilities. The logic usually given for doing so is that the insurer could purchase correspondingly illiquid assets offering higher yields that cash flow match the liabilities. He notes that this logic depends crucially on the assumption that the firm will not need to liquidate these assets prior to maturity; if forced liquidation does for any reason occur then the losses incurred in any corresponding fire-sale can be expected to be magnified if such a strategy is followed. As capital is held to mitigate the impact to policyholders of default, he argues that particular attention should be paid to the quantum of loss conditional on default (i.e., to TVaR rather than VaR; see also Sections 2.3.2 and 8.2). Loss conditional on default may be magnified by investing in illiquid assets.
9.6 THE IMPORTANCE OF GOOD GOVERNANCE AND OPERATIONAL MANAGEMENT 9.6.1 Introduction Throughout the above, we have made a rather important, but at times unwarranted, assumption. We have assumed that the decisions we intend to take are the ones that we actually do take. The assumption is not to do with actual outcomes; they are still uncertain. It is to do with whether we are in a position to capture the outcomes we think might occur. We are here contrasting intent with what we actually do. From the perspective of risk management, this distinction is perhaps most noticeable in the emphasis that is often nowadays placed on governance. It is sometimes said that a good chief risk officer (CRO) does not need to be particularly intimate with all the quantitative tools and analyses that might be used to prepare risk statistics. Rather, a CRO needs to be able to manage individuals with a range of expertises (including some who are quite quantitative in nature), to be an effective communicator and to ensure that what is intended is what actually gets implemented.
Really Extreme Events
273
In this section we explore some of the characteristics of good governance. However, for those who think that good governance by itself ‘solves the problem’ we would point out that governance will in part be deemed ‘good’ or ‘bad’ depending on the perceived eventual outcome and not just by how good or bad it appeared at the time. If the eventual outcome is sufficiently poor then those involved in governance in the meantime will struggle to avoid being held to account. 9.6.2 Governance models By ‘governance model’ or ‘system of governance’ we mean the way in which decisions needing to be taken are raised, discussed, reached and then implemented, and the checks and balances that exist to ensure that implementation is sound. Although there may be quantitative elements to such a system (e.g., provision of suitable metrics to allow management to decide on the relative merits of different courses of action), we are not here talking about a primarily quantitative or even financial discipline. The need for good governance applies just as much to manufacturing and other business types as it does to firms or organisations operating in the financial services arena. There are several elements that nearly all commentators agree need to exist for a sound governance model to be present. These include: (a) An appropriate understanding of the interrelationships between different risks. Problems (in our language, ‘extreme events’) are usually characterised by several risks showing up simultaneously; it is their occurrence in tandem that makes for a really adverse outcome. The need for a ‘holistic’ approach to risk management is epitomised by the growth of enterprise risk management (see Section 9.6.3). (b) An effective articulation from the organisation’s governing body of its risk appetite. This topic is explored further in Section 9.6.4. In the context of management of a client’s investment portfolio, the equivalent is an effective articulation of the risk characteristics that the portfolio is expected to adopt, including position or other risk limits applicable to the portfolio and what excess return it is hoping to achieve. (c) An effective ordering of the decision-making and decision-implementation processes applicable to the organisation or portfolio in question (see Section 9.6.5). This involves clear allocation of responsibilities and duties, suitable back-up procedures (for when expected implementation protocols are, for whatever reason, disrupted) and appropriate checks and balances to avoid the people and processes being diverted onto other activities (or in extreme situations being deliberately manipulated for questionable or illegal purposes). Many individual failures in the financial services industry have been pinned wholly or in part onto failures in governance arrangements. Poor governance across an industry as a whole can also contribute to systemic crises.8 It is therefore not surprising that regulators around the 8 An example might be the US savings and loan crisis of the 1980s and 1990s. This crisis involved large numbers of failures of savings and loans associations (S&Ls). S&Ls are financial institutions somewhat akin to the building societies found in the UK and some Commonwealth countries specialising in accepting savings deposits and making mortgage loans. Some estimates put total losses at around $160bn, most of which was ultimately paid for by the US taxpayer. One perceived cause of the crisis was a widening of their investment powers via the Depository Institutions Deregulation and Monetary Control Act 1980 without at the same time imposing adequate regulatory capital requirements on S&Ls. The changes introduced new risks and speculative opportunities that many managers pursued, even though they may have lacked the ability or experience to evaluate or administer these opportunities.
274
Extreme Events
globe seek to ensure (by the way in which they regulate firms) that lessons learnt previously or in other contexts do not go unheeded by the organisations that they regulate. We are not here primarily talking about risks intrinsic to the markets in which the organisation operates, or even how much capital a firm might need in order to be allowed to operate in its chosen business area. Instead we are focusing primarily on risks that can be ‘avoided’ by adopting good organisational management principles. For example, the UK’s Financial Services Authority (FSA) requires firms that it regulates to have an articulated risk appetite. Firms regulated by the FSA are also required to have an apportionment officer (usually the Chief Executive) who is formally responsible for ensuring that someone is clearly responsible for every defined (regulated) activity undertaken by the firm, and that overlaps are suitably catered for.9 Under the Solvency II rules for EU insurance companies coming into force in 2012, EU insurers will need to carry out a comprehensive Own Risk and Solvency Assessment (ORSA). Insurers will potentially be penalised for bad governance structures via capital adequacy add-ons. Global banking regulation, as epitomised by Basel II etc., has similar elements. Does perceived good governance actually result in better handling of extreme events? Unfortunately, it is merely a precursor to, and not a guarantee of, success, as we seek to navigate through the vagaries of real life markets and business conditions. We have already noted that there will be a tendency for after-the-event dissection of management actions if outcomes are overly negative, irrespective of how well organised and formulated the decisions contributing to such outcomes may have appeared to be at the time. Good governance is like risk budgeting, i.e., sound common sense and the right mindset to adopt, but not necessarily foolproof. 9.6.3 Enterprise risk management The importance of good governance, as applied to UK insurance companies, is highlighted in Deighton et al. (2009). The authors see this topic as strongly linked to enterprise risk management (ERM). They do not primarily seek to make the business case for enterprise risk management. Rather, they start from the premise that it is perceived to be intrinsically desirable, and they then seek to identify how best to implement it. Over the last couple of decades organisations of all types, but especially firms and other organisations operating in the financial services arena, have placed increasing emphasis on risk management in its various guises. This is not to suggest that companies were not previously practising risk management. Many of the techniques and activities that we might nowadays explicitly associate with risk management have been practised for a long time, but perhaps not by specialist ‘risk managers’ as such. Instead they may have been implicit in the ways in which firms organised their affairs, and embedded more in individual business line managers’ day-to-day roles. ERM takes this increasingly institutionalised role for risk management up a further notch. It aims to look at risks of all types in a holistic manner. In other words, it aims to look at risk from the perspective of the whole organisation (although not necessarily in just a ‘top-down’ 9 The relevant part of the FSA Handbook, i.e., the part on ‘Senior Management Arrangements, Systems and Controls’ (known as ‘SYSC’), also covers areas or processes where a firm is expected to have adequate controls, or which themselves act as control processes, including organisational structure, compliance, employee remuneration, risk management, management information, internal audit, strategy, business continuity and record keeping.
Really Extreme Events
275
manner) and in particular at how risks of various types (and perhaps across various geographies or business units) might interrelate with each other. It may also focus on the perspective of shareholders as well as of the firm itself; see, e.g., Hitchcox et al. (2010). The Committee of Sponsoring Organisations of the Treadway Commission (2004) argues that ERM involves the following: (a) aligning risk appetite and strategy, e.g., when evaluating strategic alternatives; (b) enhancing risk response decisions, by providing rigour when selecting among different responses to different types of risk, such as risk avoidance, risk reduction, risk sharing and risk acceptance; (c) reducing operational surprises and losses, via an improved ability to identify potential adverse events and to formulate responses to such events, thereby reducing surprises and associated costs or losses; (d) identifying and managing multiple and cross-enterprise risks (the ‘holistic’ element of ERM); (e) seizing opportunities, by giving management sufficient information and structure to be able to identify and proactively realise opportunities to leverage the organisation’s risk budget; (f) improving deployment of capital, by ensuring that risk information is robust enough to be able to provide a full assessment of the organisation’s capital needs and where those needs are coming from. From this description we can see the strong alignment between ERM and risk budgeting. In effect, ERM is risk budgeting but specifically applied to the entire firm. It involves the same underlying trade-off between risk and reward, but reformulated into less technical language and made more generically applicable to any type of organisation. The increased focus placed by individual firms (and regulators) on ERM has been mirrored by an increased focus from credit rating agencies. For example, in 2005, Standard and Poor’s included a formal evaluation of ERM as the eighth pillar of its rating process. According to Deighton et al. (2009), Standard and Poor’s rating approach (for insurers) focuses on five key areas of ERM: risk management culture; risk controls; emerging risk management; risk and capital models; and strategic risk management. Based on senior level interviews, review of relevant reports etc. and site visits, it arrives at an ERM classification that can be ‘weak’, ‘adequate’, ‘strong’ or ‘excellent’. As one might expect, the majority of firms apparently fall into the middle two classifications.
9.6.4 Formulating a risk appetite In Chapter 5 we saw that to find an optimal portfolio we needed to articulate a risk appetite, which there involved a parameter that allowed us to identify the optimal trade-off between risk and reward. The need to formulate an appropriate risk appetite also features in enterprise risk management and hence in good governance. It is a key element of any ERM framework. Without it we cannot align strategy with risk appetite, and so the setting of business strategy reverts to a vacuum. It even becomes impossible to work out whether the current strategy is suitable for the business. We have nothing against which to benchmark it.
276
Extreme Events
In this context, ‘risk appetite’ might be defined, as per Deighton et al. (2009), as a combination of (a) the level of acceptable risk, given the overall appetite for earnings volatility, available capital, external stakeholder expectations (which could include return on capital), and any other defined objectives, such as paying dividends or particular ratings levels; and (b) the types of risk which the [Group] is prepared to accept in line with the control environment and the current market conditions.
Unfortunately, there is a possible weakness with such a definition. It might be construed as implying that there is one single way of measuring risk. In practice, as we have already seen in Chapters 5 and 7, there are not only several different ways of measuring risk but also several different vantage points from which to measure risk, potentially epitomised by different minimum risk portfolios. Risk also has a time dimension. So, therefore, may our chosen risk appetite. Risks that take a long time to gestate may be easier for an organisation to handle than ones that come suddenly, even if the quantum of loss is the same in either case. Some strategies (e.g., raising new capital or otherwise altering business philosophy) become more practical if they do not need to be rushed. 9.6.5 Management structures Many writers have opined at length on the merits of different management structures. Different norms may be adopted in different countries. For example, in the UK the roles of Board Chairman and Chief Executive are often assigned to different individuals, whereas this is not so common in the US. In some EU countries the Board is itself split in two, into a Supervisory Board and an Executive Board. So it is with the parts of management structures that are focused on risk. However, a common subdivision might envisage three different strands, namely: (a) risk management, which might principally lie with front line managers; (b) risk oversight or control, which might consist of an independent team overseeing the risks being incurred by those responsible for role (a); and (c) risk assurance, which might involve independent or nearly independent assurance from ‘neutral’ parties (e.g., internal or external auditors) that the risk management environment is operating effectively. How boundaries are defined between these roles will depend partly on the organisation in question, and partly on the industry in which it operates. For example, in the asset management community it has become increasingly common for independent risk monitoring units to be established to fulfil role (b), alongside internal and external audit teams who, with the Board, might be viewed as fulfilling role (c). In such a model it is the ‘front office’, i.e., the managers and/or traders, who are responsible for role (a), although the individuals in question may prefer to argue that ‘risk management’ is being carried out by role (b) (particularly if something has just gone wrong!). Such demarcations also inherit all the usual management issues implicit in any formalisation of a business process. For example, Deighton et al. (2009) note that much of the formal review of items on the risk agenda may be discussed and agreed at a variety of risk committees. They note that this is ‘both good and bad; good, in that there is a clearly targeted and focused agenda to deal with risk issues, but bad, in that risk is perceived to be “covered” by this committee, and hence no-one else need concern themselves about it’.
Really Extreme Events
277
Given the committee structures often adopted for such purposes, it is important for any individual responsible for teams implementing ERM to be well versed at navigating around such subtleties, a good communicator and leader, and skilled at ensuring that adequate resources are made available for the task in hand. The need for detailed quantitative modelling expertise of the sort we discussed in earlier chapters is not so obviously essential for such an individual, as long as the person understands the strengths and weaknesses of the models in question and can provide analysis and insights that help the risk/reward trade-off to be optimised. Other senior management will also want the head of any risk function to be well-versed in the art of effective documentation and in avoiding creating hostages to fortune. They will want risk management to be seen at least as unarguably comprehensive. The one thing worse than having the market or something else throw out a surprise is for this to happen and for the organisation’s management to be deemed to have been asleep on the job at the time. 9.6.6 Operational risk The tendency with ERM is to view all types of risk as fungible. In a sense this is true, because any loss of a given quantum is just as disruptive to the bottom line as any other. In another sense, however, it is false in relation to operational risk, as is explained in Kemp (2005) and in Section 8.7. Operational risk has two main characteristics that differentiate it from other types of risk, which we discuss in the next two subsections. 9.6.6.1 Operational risk is largely internal to the organisation in question By this we mean that every single organisation will have its own unique pattern of operational risks to which it is exposed, depending on its own precise business model and how it structures itself to carry out its business activities. Of course, similarities will exist between different firms. Identifying and learning from these similarities are the bread and butter of third party consultants who may advise firms on how to mitigate their own operational risks. But equally no two firms are identical, and so exposures that might exist in one firm may be reduced or magnified in another. In contrast, all firms (or portfolios) investing the same amount in the same marketable security largely speaking acquire the same risk exposure. Different firms or portfolios may hedge these exposures differently, but typically the securities when viewed in isolation are fully fungible, in the sense that if we swap one unit of the exposure for another unit of the same exposure then our overall exposures do not alter. 9.6.6.2 Operational risk usually has no upside Firms are generally not rewarded for taking operational risk.10 In contrast, we might postulate that there is some reward for taking market or credit risk, or at least we might sell our management skills on the premise that there ought to be such a reward.
10 There is no ‘reward’ as such from carrying ‘unnecessary’ operational risk. The only type of operational risk that a firm will really want to carry is what it absolutely has to in order to provide the services it is offering to its customers. A possible exception would be insurance companies who might offer insurance against operational risks, but even they would prefer if possible for the firms they insure to be exposed to as little operational risk as possible, as long as it did not reduce their own profitability. Operational risk managers, auditors, software vendors and the like also might be viewed as having a vested interest in there being operational risk to manage, but we would not expect them to try to add to the level of operational risk to which their own business might be exposed. Instead, we would expect them to be keen to offload whatever operational risk there is onto others, if possible.
278
Extreme Events
Principle P47: Good governance frameworks and operational practices may help us implement efficient portfolio construction more effectively. They can help to bridge the gap between what we would like to do and what actually happens in practice.
9.7 THE PRACTITIONER PERSPECTIVE How might we apply insights set out earlier in this chapter in practical portfolio construction? Again, as in Chapter 8, the challenge is that effective portfolio construction actually needs some assessment of likelihood of occurrence as well as magnitude of impact. Much of what we have covered in this chapter also deliberately sidestepped likelihoods. Without likelihoods any attempt to define a risk-reward trade-off and then to optimise it becomes hugely problematic. The tendency may therefore be to duck the issues, leaving someone else to specify the mandate within which the portfolio is to be managed (and by implication to take the rap if this specification proves woefully inadequate in the face of unforeseen circumstances). There are still, however, some important lessons we can draw from the above, including the following: (a) Behavioural finance is important. The market values placed on different contingencies are often heavily influenced by how worried others are about potentially extreme outcomes. How market values move through time is also strongly influenced by investor perceptions. (b) A sound governance and management framework is also important. Those ultimately responsible for portfolios of assets or liabilities want reassurance that their portfolios are being managed appropriately (and are prepared to pay more to managers who are able to demonstrate this ability). A sound governance framework helps an entity articulate its risk appetite effectively and select the investment strategy most appropriate to it. Effective governance and risk management structures should also help an entity spot and handle under-appreciated risks, including any operational risks that have little or no upside. (c) There is typically a premium for flexibility. This shows up most noticeably in terms of liquidity risk. The challenge is that this means there is also a cost to maintaining liquidity and therefore a tendency to skimp on this cost when liquidity conditions seem benign. Likewise, long-term elements of the instrument nature or contract design may limit flexibility (and, e.g., increase exposure to liquidity risk), which again may require a suitable premium.11 The players typically best placed to survive and thrive in the changing world in which we live are the ones nimble enough to react effectively to changing events. (d) The world is intrinsically uncertain. This may seem daunting, because it places intrinsic limits on our ability to look ahead and make sense of what might happen. However, the flip side is that without intrinsic uncertainty there would be little scope to be entrepreneurial. Every problem presents an opportunity!
11 Kemp (2005) highlights that long-termness may not necessarily be viewed as attractive by either party to a contract, because the loss of flexibility that it creates may be viewed negatively by both parties. Conversely, as we saw in Section 9.4, there can be times when long-termness may be attractive to both parties, because it can reduce the intrinsic uncertainty that might otherwise arise in dealings between the parties.
Really Extreme Events
279
9.8 IMPLEMENTATION CHALLENGES 9.8.1 Introduction Most of this chapter has focused on mindset, which is almost always easier to articulate than actually adopt. This is a feature of life in general rather than merely being limited to consideration of extreme events or portfolio construction theory. In this section we shall therefore concentrate on the new topics introduced in this chapter, namely consideration of Knightian uncertainty and of enterprise risk management. 9.8.2 Handling Knightian uncertainty The essential challenge with this type of uncertainty is that it is immeasurable. Therefore, by definition it is not readily amenable to analysis. To cut through this Gordian knot, we need to create a framework in which (a) lateral thinking can flourish; (b) it is practical to discuss and gain insight about how an investment might behave (allowing latitude to consider both how other ‘comparable’ investments have behaved in ‘comparable’ situations in the past and why they might be less comparable going forwards); and (c) insight into how others might think can come to the fore: as we noted in Section 9.4, the fact that Knightian uncertainty is uncertain does not stop us identifying ways in which investment markets might operate in its presence, because the interpretations that humans place on it will influence market behaviour. Put this way, the desired business culture ideally needed to navigate around Knightian uncertainty sounds suspiciously like the sort of culture that more traditional ‘judgemental’ types of asset managers generally seek to promote. This involves an emphasis on investment ‘flair’, a willingness to back one’s judgement (and therefore not to be caught in the inertia trap of indecision that hinders others) and a healthy scepticism towards views that others put forward, particularly views coloured by self-interest. We see again the fundamental similarity of more quantitative and more qualitative investment management styles that we highlighted in Section 5.2. 9.8.3 Implementing enterprise risk management The biggest implementation challenge with ERM is probably also how to foster the right kind of business culture. As we have already noted, good documentation and reporting frameworks are pretty important in any practical ERM implementation, but material that is never actually read or referred to will not contribute to improved business practices. Again, we are not primarily talking about a quantitative discipline, even if preparation and presentation of sound and relevant quantitative analyses do play an important role in ERM. Any implementation needs to be sensitive to dynamics that are unique to the entity in question, and so there is no ‘one’ right way of implementing a successful ERM framework.
10 The Final Word 10.1 CONCLUSIONS In this book we have explored from many angles the topic of extreme events and how best to cater for them in portfolio construction. We have blended together qualitative and quantitative perspectives because we think that they complement rather than contradict each other. We have found that an excess of extreme events versus that to be expected if the world were ‘Normal’ is a feature of many markets, but not one that is necessarily bad news. The way that we visualise and analyse fat tails can influence the importance we ascribe to extreme events. The time-varying nature of the world in which we live is a major source of fat-tailed behaviour, but it is not the only source. We have repeatedly found, even with the most data-centric of portfolio construction approaches, just how important are our prior views or beliefs about how the future ‘ought’ to behave. Important, too, are the views of others. Markets are driven by the interaction of supply and demand, as mediated by economic agents who, like us, are subject to human behavioural biases. These can also create fat tails. If dependency on our own prior beliefs is important when catering for the generality of market behaviour then it takes on even greater importance in relation to extreme events, because extreme events are relatively rare. Really extreme events are so rare (we hope!) that adopting the right mindset can be the dominant determinant of whether we successfully navigate through them.
10.2 PORTFOLIO CONSTRUCTION PRINCIPLES IN THE PRESENCE OF FAT TAILS We set out for ease of reference each of the principles we have identified earlier in this book, and the sections in which their derivations may be found.
10.2.1 Chapter 2: Fat tails – in single (i.e., univariate) return series Principle P1 (Section 2.2): Events are only ‘extreme’ when measured against something else. Our innate behavioural biases about what constitute suitable comparators strongly influence our views about how ‘extreme’ an event actually is. Principle P2 (Section 2.2): The world in which we live changes through time. Our perception of it also changes, but not necessarily at exactly the same time. Principle P3 (Section 2.3.2): The ways in which we visualise data will influence the importance that we place on different characteristics associated with this data. To analyse extreme events, it helps to use methodologies such as quantile–quantile plots that highlight such occurrences. However, we should be aware that they can at times encourage us to focus too much on fat-tailed behaviour, and at other times to focus too little on it.
282
Extreme Events
Principle P4 (Section 2.3.5): Most financial markets seem to exhibit fat-tailed returns for data sampled over sufficiently short time periods, i.e., extreme outcomes seem to occur more frequently than would be expected were returns to be coming from a (log-) Normal distribution. This is true both for asset types that might intrinsically be expected to exhibit fat-tailed behaviour (e.g., some types of bond, given the large market value declines that can be expected to occur if the issuer of the bond defaults) and for asset types, like equity indices, where there is less intrinsic reason to postulate strong fat-tailed return characteristics. Principle P5 (Section 2.4.5): Skewness and kurtosis are tools commonly used to assess the extent of fat-tailed behaviour. However, they are not particularly good tools for doing so when the focus is on behaviour in the distribution extremities, because they do not necessarily give appropriate weight to behaviour there. Modelling distribution extremities using the fourthmoment Cornish-Fisher approach (an approach common in some parts of the financial services industry that explicitly refers to these statistics and arguably provides the intrinsic statistical rationale for their use in modelling fat tails) is therefore also potentially flawed. A more robust approach may be to curve fit the quantile–quantile form more directly. Principle P6 (Section 2.6): Actively managed portfolios are often positioned by their managers to express a small number of themes that the manager deems likely to prove profitable across many different positions. This is likely to increase the extent of fat-tailed behaviour that they exhibit, particularly if the manager’s theme is not one that involves an exact replay of past investment movements and which may not be emphasised in any independent risk assessment of the portfolio structure. Principle P7 (Section 2.6): Not all fat tails are bad. If the manager can arrange for their impact to be concentrated on the upside rather than the downside then they will add to performance rather than subtract from it. Principle P8 (Section 2.7.3): For major Western (equity) markets, a significant proportion of deviation from (log-) Normality in daily (index) return series appears to come from timevarying volatility, particularly in the upside tail. This part of any fat-tailed behaviour may be able to be managed by reference to rises in recent past shorter-term volatility or corresponding forward-looking measures such as implied volatility. Principle P9 (Section 2.9.5): Extrapolation is intrinsically less reliable, mathematically, than interpolation. Extreme value theory, if used to predict tail behaviour beyond the range of the observed dataset, is a form of extrapolation. Principle P10 (Section 2.10): Models that fit the past well may not fit the future well. The more parameters a model includes the better it can be made to fit the past, but this may worsen its fit to the future. Principle P11 (Section 2.12.6): Part of the cause of fat-tailed behaviour is the impact that human behaviour (including investor sentiment) has on market behaviour. 10.2.2 Chapter 3: Fat tails – in joint (i.e., multivariate) return series Principle P12 (Section 3.7): A substantial proportion of ‘fat-tailed’ behaviour (i.e., deviation from Normality) exhibited by joint return series appears to come from time-varying volatility and correlation. This part of any fat-tailed behaviour may be able to be managed by reference to changes in recent past shorter-term volatilities and correlations or by reference to forwardlooking measures such as implied volatility and implied correlation or implied covariance. Principle P13 (Section 3.7): Time-varying volatility does not seem to explain all fat-tailed behaviour exhibited by joint (equity) return series. Extreme events can still come ‘out of the
The Final Word
283
blue’, highlighting the continuing merit of practitioners using risk management tools such as stress testing that seek to reflect the existence of such ‘unknown unknowns’ or ‘black swan’ events. Principle P14 (Section 3.8.4): Some parts of the market behave more similarly to each other than other parts. For some purposes it can be helpful to define a ‘distance’ between different parts of the market characterising how similar their behaviour appears to be. The definition of ‘distance’ used for this purpose and, more generally, the weights we ascribe to different parts of the market in our analysis can materially affect the answers to portfolio construction problems.
10.2.3 Chapter 4: Identifying factors that significantly influence markets Principle P15 (Section 4.3.4): Models often seek to explain behaviour spanning many different instruments principally via the interplay of a much smaller number of systematic factors. The selection and composition of these factors will depend in part on the relative importance we assign to the different instruments under consideration. Principle P16 (Section 4.3.5): Historic market return datasets contain far too little data to permit reliable estimation of idiosyncratic risks, i.e., risk exposures principally affecting just one or at most a handful of individual instruments within the overall market. Principle P17 (Section 4.6.2): Selection effects (in which those involved in some particular activity do not behave in a fashion representative of a random selection of the general populace) can be very important in many financial fields. They can also be important in a portfolio construction context. Managers can be expected to focus on ‘meaning’ as well as ‘magnitude’ in their choice of positions. The spread of behaviours exhibited by portfolios that managers deem to have ‘meaning’ may not be the same as the spread of behaviours exhibited by portfolios selected randomly. Principle P18 (Section 4.7.7): Markets appear to exhibit features that a mathematician would associate with chaotic behaviour. This may place intrinsic limits on our ability to predict the future accurately. Principle P19 (Section 4.10.3): Finding the very best (i.e., the ‘globally optimal’) portfolio mix is usually very challenging mathematically, even if we have a precisely defined model of how the future might behave.
10.2.4 Chapter 5: Traditional portfolio construction techniques Principle P20 (Section 5.1): Portfolio construction seeks to achieve the best balance between return and risk. It requires some formalisation of the trade-off between risk and return, which necessitates some means of assessing each element contributing to this trade-off. Principle P21 (Section 5.2.2): The intrinsic difference between qualitative and quantitative portfolio construction approaches, if there is one, is in the mindsets of the individuals involved. From a formal mathematical perspective, each can be re-expressed in a way that falls within the scope of the other. Principle P22 (Section 5.3.3): Multi-period portfolio optimisation often focuses just on the end result, e.g., terminal wealth. However, this simplification may not be appropriate in practice. What happens during the journey that we take to get to a destination can often be important too.
284
Extreme Events
Principle P23 (Section 5.6.3): Multi-period portfolio optimisation is usually considerably more complicated mathematically than single-period optimisation. However, the form of the multi-period optimisation problem can often be approximated by one in which most of the relative stances within the dynamically optimal portfolio match those that would be chosen by a ‘myopic’ investor (i.e., one who focuses on just the upcoming period). Principle P24 (Section 5.7.4): Taxes and transaction costs are a drag on portfolio performance. They should not be forgotten merely because they inconveniently complicate the quantitative approach that we might like to apply to the portfolio construction problem. Principle P25 (Section 5.8.1): Risk budgeting, like any other form of budgeting, makes considerable sense but does not guarantee success. Principle P26 (Section 5.9.2): All historic risk measures are only imprecise measures of the ‘intrinsic’ but ultimately unobservable risk that a portfolio has been running. A portfolio that seems in the past to have run a low level of risk may actually have been exposed to latent risks that just happened not to show up during the period under analysis. Principle P27 (Section 5.9.3): How well a risk model might have performed had it been applied in the past can be assessed using backtesting. However, practitioners using such techniques should bear in mind that it is tricky in any such backtesting to avoid look-back bias, even when the model is tested out-of-sample, i.e., even when any parameter estimates used at a particular point in time within the model are derived from data that would have been historic at that time. This is because look-back bias will creep in via the structure of the model and not just via the input parameters used in its evaluation at any particular point in time. Principle P28 (Section 5.10.1): Practitioners with different views about expected future returns may still identify the same portfolios as optimal. Principle P29 (Section 5.12): Mean-variance optimisation has several features that almost guarantee that it will be the first step on the route to identifying portfolios that efficiently trade off return versus risk. It is simple, relatively easy to implement, insightful and now well-established. 10.2.5 Chapter 6: Robust mean-variance portfolio construction Principle P30 (Section 6.2.1): The results of portfolio optimisation exercises are usually very sensitive to the assumptions being adopted. Principle P31 (Section 6.3.3): Practical portfolio construction algorithms are heavily influenced by our own prior views about how the problem (or answers) ‘ought’ to behave. Bayesian statistics provides a way of incorporating such views into the portfolio construction process, but does not by itself make these a priori views more reliable. Principle P32 (Section 6.4.1): The sensitivity of portfolio optimisation exercises to input assumptions can be reduced by using robust optimisation techniques. Techniques commonly used in other fields, however, need to be adapted when doing so. Unless observations are erroneously recorded, outliers cannot be ignored in portfolio construction methodologies. However extreme they are, they still contribute to portfolio behaviour. Principle P33 (Section 6.4.3): Potentially important sources of a priori views for portfolio construction algorithms are the views of other market participants, as expressed in market weights and in the prices of instruments sensitive to such views. Principle P34 (Section 6.6.2): Many robust portfolio construction techniques can be reexpressed in terms of constraints applied to the form of the resulting portfolios (and vice versa).
The Final Word
285
Principle P35 (Section 6.10): Use of robust portfolio construction techniques may deplete the ability of a talented manager to add value. Clients may wish to implement them at the manager structure level rather than within individual portfolios.
10.2.6 Chapter 7: Regime switching and time-varying risk and return parameters Principle P36 (Section 7.2.4): Models that allow for the world changing through time are more realistic but also more difficult to solve. Principle P37 (Section 7.3.4): Utility functions used to quantify the trade-off between risk and return usually relate to the behaviour of the market values of the assets and liabilities under consideration. The inclusion of transaction costs complicates the meaning we might ascribe to ‘market value’ and may have a material impact on the resulting answers if our focus includes risks such as liquidity risk that are intimately bound up with future transaction costs and uncertainty in their magnitude. Principle P38 (Section 7.4.5): To take full advantage of the predictive power of models that assume the world changes through time, agents would need to have some means of identifying what ‘state’ the world was currently in. This is usually not easy. Such models also need to make assumptions about the level of ambiguity present in this respect. Principle P39 (Section 7.5): There are strong links between derivative pricing theory and portfolio construction theory, particularly if the minimum risk portfolio changes through time.
10.2.7 Chapter 8: Stress testing Principle P40 (Section 8.2): Stress tests are important tools for mitigating ‘model’ risk, but do not guarantee success in this respect. Principle P41 (Section 8.4): Reverse stress tests, if done well, can help us uncover underappreciated risks and formulate plans for mitigating them. Principle P42 (Section 8.5.3): Stress tests can engender a false sense of security, because there will always be risks not covered in any practical stress testing regime. This issue is particularly important if portfolios are optimised in a way that largely eliminates exposures to risks covered by the suite of stress tests being used, because this can concentrate risks into areas not covered by the stress tests. Principle P43 (Section 8.8.5): Effective stress testing requires effective engagement with the individuals for whom the stress tests are being prepared. Otherwise the informational content within them may be lost.
10.2.8 Chapter 9: Really Extreme Events Principle P44 (Section 9.2): Handling really extreme events well generally requires thinking ‘outside the box’. Principle P45 (Section 9.4.2): Markets are not just subject to risks that are quantifiable. They are also subject to intrinsic (‘Knightian’) uncertainty which is immeasurable. Even though we may not be able to quantify such exposures, we may still be able to infer something about how markets might behave in their presence, given the impact that human reaction to them may have on market behaviour.
286
Extreme Events
Principle P46 (Section 9.4.3): Be particularly aware of exposure to positions that are sensitive to aggregate ‘market risk appetite’. Be particularly aware of exposure to liquidity risk. Keep things simple. Bear in mind that short-term contracts may not be able to be rolled in the future on currently prevailing terms. Principle P47 (Section 9.6.6): Good governance frameworks and operational practices may help us implement efficient portfolio construction more effectively. They can help to bridge the gap between what we would like to do and what actually happens in practice.
Appendix Exercises A.1 INTRODUCTION This Appendix contains some exercises for use by students and lecturers. Each main chapter of the book has associated exercises (with, say, A.3 corresponding to Chapter 3). The exercises are reproduced with kind permission from Nematrian Limited. Hints and model solutions are available on the www.nematrian.com website, as are any analytical tools needed to solve the exercises.
A.2 FAT TAILS – IN SINGLE (I.E., UNIVARIATE) RETURN SERIES The exercises in this section relate to the following hypothetical (percentage) returns on two indices, A and B, over 20 periods: Period
A (%)
B (%)
Period
A (%)
B (%)
1 2 3 4 5 6 7 8 9 10
–25.1 0.1 24.2 7.9 –1.4 0.5 –0.6 1.0 7.3 –12.0
5.0 –16.0 –11.8 –3.0 –9.3 –0.1 –7.7 2.8 13.1 1.5
11 12 13 14 15 16 17 18 19 20
0.7 0.8 –3.8 2.8 –4.3 –1.2 –7.0 6.5 3.6 1.8
–10.5 11.7 17.9 9.7 8.3 –6.3 –7.8 –2.3 5.4 2.8
A.2.1 You are an investor seeking to understand the behaviour of Index A. (a) Calculate the mean, (sample) standard deviation, skew and (excess) kurtosis of its log returns over the period covered by the above table. (b) Do the statistics calculated in question (a) appear to characterise a fat-tailed distribution if we adopt the null hypothesis that the log returns would otherwise be coming from a Normal distribution and we use the limiting form of the distributions for these test statistics (i.e., the form ruling when n → ∞, where n is the number of observations)? (c) Prepare a standardised quantile–quantile plot for Index A. Does it appear to be fat-tailed? (d) Does the Cornish-Fisher fourth moment approximation appear to under or overstate the fat-tailed behaviour of this series? (e) What other methodologies could you use to formulate a view about how fat-tailed this return series might be if your focus was principally on fat-tailed behaviour around or below the lower 10th percentile quantile level?
288
Extreme Events
A.2.2 You are an investor seeking to understand the behaviour of Index B. (a) You think that the returns shown for Index B may exhibit a material element of smoothing. What sorts of assets might lead to this type of behaviour? (b) Using a first order autoregressive model, de-smooth the observed returns for Index B to derive a return series that you think may provide a better measure of the underlying behaviour of the relevant asset category. (c) Prepare a standardised quantile–quantile plot for this underlying return series. Does it appear to exhibit fat-tailed behaviour? A.2.3 You discover that the periods being used for Index A are quite long (e.g., yearly), sufficiently long for secular change to make the relevance of data from some of the earlier periods suspect. You decide to exponentially weight the data using a half life of 10 periods, i.e., the weight given to period t (for t = 1, . . . , 20) is w(t) = e−(20−t) log(2)/10 . (a) Recalculate the mean, standard deviation, skew and kurtosis weighting the data as above. Are they still suggestive of fat-tailed behaviour? (b) Prepare a standardised quantile–quantile plot of the weighted data. Is it also suggestive of fat-tailed behaviour? Hint: the ‘expected’ values for such plots need to bear in mind the weight given to the observation in question.
A.3 FAT TAILS – IN JOINT (I.E., MULTIVARIATE) RETURN SERIES A.3.1 You have the following (percentage) return information on three different Indices A, B and C as follows (Indices A and B are as per Section A.2). Index C has not been around as long as Indices A and B: Period
A (%)
B (%)
C (%)
Period
A (%)
B (%)
C (%)
1 2 3 4 5 6 7 8 9 10
–25.1 0.1 24.2 7.9 –1.4 0.5 –0.6 1.0 7.3 –12.0
5.0 16.0 –11.8 –3.0 –9.3 –0.1 –7.7 2.8 13.1 1.5
N/A N/A N/A N/A N/A N/A N/A 3.1 4.6 –7.5
11 12 13 14 15 16 17 18 19 20
0.7 0.8 –3.8 2.8 –4.3 –1.2 –7.0 6.5 3.6 1.8
–10.5 11.7 17.9 9.7 8.3 –6.3 –7.8 –2.3 5.4 2.8
–10.5 3.9 5.2 0.1 3.5 –2.9 –10.8 2.5 6.3 2.4
(a) Explain the advantages and disadvantages of attempting to backfill data for Index C for the first seven periods from data applicable to either Index A or Index B when creating a model that jointly describes the behaviour of all three indices. (b) If you had to select between Index A or Index B to backfill data for Index C as per question (a), which would you use? Why? (c) What are the advantages and disadvantages of using a linear combination of Index A and Index B to backfill Index C, rather than using an either/or approach as per question (b)?
Appendix Exercises
289
A.3.2 Plot an empirical two-dimensional quantile–quantile plot as per Section 3.5 characterising the joint distribution of A and B.
A.4 IDENTIFYING FACTORS THAT SIGNIFICANTLY INFLUENCE MARKETS A.4.1 You are an investor trying to understand better the behaviour of Index B in Exercise A.2.1. You think that it is likely to be best modelled by an AR(1) autoregressive model along the lines of yt − µ = c( yt−1 − µ) + wt with random independent identically distributed Normal error terms: (a) Estimate the value of c 16 times, the first time assuming that you only have access to the first 5 observations, the next time you only have access to the first 6 observations and so on. (b) Do these evolving estimates of c appear to be stable? How would you test such an assertion statistically? A.4.2 You are an investor trying to understand better the joint behaviour of Indices A and B in Exercise A.2.1. (a) Identify the series corresponding to the principal components of Indices A and B. (b) Given a linear combination of Indices A and B, what is the maximum possible kurtosis of a linear combination of A and B? (c) More generally, can you identify two series where the linear combination of the series with the least variance is also the one with the maximum kurtosis? Hint: try identifying two series with very few terms in them because it simplifies the relevant mathematics. (d) What lessons might you draw from question (c) in terms of use of variance or variance-related risk statistics when used to estimate the likelihood of extreme events?
A.5 TRADITIONAL PORTFOLIO CONSTRUCTION TECHNIQUES A.5.1 You are an asset allocator selecting between five different asset categories, A1 to A5, using mean-variance optimisation. Your expected future returns and covariances for the asset categories are as in the following table: Expected correlation coefficients
A1 A2 A3 A4 A5
Expected return (% p.a.)
Expected standard deviation (% p.a.)
A1
A2
A3
A4
A5
3.0 5.0 6.0 7.0 7.5
2 4 8 14 15
1 0.4 –0.6 0.0 –0.4
1 –0.5 –0.4 –0.4
1 0.2 0.6
1 0.3
1
290
Extreme Events
(a) Plot the efficient frontier and the asset mixes making up the points along the efficient frontier, assuming that risk-free is to be equated with zero volatility of return and that no non-negative holdings are allowed for any asset category. (b) Show how the efficient frontier and the asset mixes making up the points along the efficient frontier would alter if risk-free is equated with 50% in Asset A1 and 50% in Asset A2. (c) In what circumstances might a mixed minimum risk portfolio as per question (b) apply? Give examples of the types of asset that might then be A1 and A2. A.5.2 A colleague has a client for which the mixed minimum risk portfolio as per Exercise A.5.1(b) applies. She has invested the client’s portfolio as follows:
Portfolio mix (%) A1 A2 A3 A4 A5
20 20 20 20 20
(a) You know that she has also used mean-variance optimisation techniques and adopted the same expected covariances as you would have done in Exercise A.5.1. What return assumptions might she have adopted when choosing her portfolio mix? Specify mathematically all possible sets of return assumptions that she could have adopted and still reached this answer. (b) Suppose that she is just about to adjust her portfolio mix so that it includes no holding in A3 and with the amounts that were invested in A3 in the above table redistributed equally between the remaining four asset categories. What return assumptions might she be adopting when choosing her new portfolio mix? Can you specify mathematically all possible sets of return assumptions that she could have adopted and still reached this answer?
A.6 ROBUST MEAN-VARIANCE PORTFOLIO CONSTRUCTION A.6.1 A daily series that you are analysing seems to have a small number of extreme movements that look suspiciously like errors to you. (a) To what extent should you exclude such observations when developing a robust portfolio construction algorithm? (b) What sorts of circumstances (applying to what sorts of financial series) might lead to extreme movements that are not actually errors? (c) What other sorts of observations arising in financial series might be ones that you would question? A.6.2 You are an asset allocator selecting between five different asset categories as per Exercise A.5.1 and you believe that the covariances between the asset categories are as set out in Exercise A.5.1. The ‘market’ involves the following asset mix and is viewed as implicitly involving a minimum risk portfolio that is 100% invested in asset class A1. The mandate does not allow short sales.
Appendix Exercises
Your views relative to those implicit in the market mix
Market mix (%) A1 A2 A3 A4 A5
291
10 15 20 30 15
−0.5 0.0 +0.5 −0.5 +0.5
(a) Set out how you would use the Black-Litterman approach to identify a robust optimal asset mix for the portfolio. (b) What additional information would you need before you could decide what is the most suitable asset allocation for this portfolio? (c) Making plausible assumptions about this additional information, suggest a suitable asset allocation for this portfolio. Ideally create a spreadsheet that takes this information as an input and selects the most suitable asset allocation given this information. (d) Set out a series of return and covariance assumptions that results in the same meanvariance asset allocation as in question (c) but without using the Black-Litterman methodology.
A.7 REGIME SWITCHING AND TIME-VARYING RISK AND RETURN PARAMETERS In Section 7.2 we set out formulae for the means and covariance matrices for the conditional probability distributions involved in an RS model that involved just two multivariate Normal regimes. A.7.1 Suppose we had K regimes rather than just 2. How would the formulae in Equations (7.1) to (7.6) generalise in such circumstances? A.7.2 Suppose that we revert to the 2-regime case and we also have just two assets. Derive formulae for the skew and kurtosis of the conditional probability distributions. A.7.3 Suppose that the regimes in Exercise A.7.2 have the following distributional characteristics and transition probabilities: Distributional characteristics Regime 1
Regime 2
Covariances
Covariances
Asset
Means
A
B
Means
A
B
A B
0.03 0.06
0.01 0.005
0.005 0.02
0.01 –0.01
0.02 0.02
0.02 0.03
292
Extreme Events
Transition probabilities (where p = 0.2 and q = 0.3) State at start of next period State at start of this period
Regime 1
Regime 2
Regime 1 Regime 2
p 1−q
1− p q
(a) What are the conditional means and covariance matrices of the distributions for the next period if the world is in (i) Regime 1 and (ii) Regime 2? (b) What in broad terms is the impact on optimal portfolios of increasing p and q by equal amounts, i.e., the likelihood that we switch states over the coming period whatever the state of the world we are currently in?
A.8 STRESS TESTING A.8.1 The wording of the Solvency II Directive indicates that in the Solvency II Standard Formula SCR the operational risk charge should be added to the charge for all other risk types without any diversification offset. This means that there is little scope in practice for a less onerous approach to be adopted by any particular national insurance regulator. (a) You are an EU insurance regulator who does not wish to rely merely on the wording of the Directive to justify no diversification offset between operational risk and other types of risk. What other arguments might you propose for such an approach? (b) Conversely, you are an insurance firm that is attempting to persuade the regulator to allow it to use an internal model and you would like to be able to incorporate a significant diversification offset between operational risk and other elements of the SCR. What arguments might you propose to justify your position? (c) You are another EU insurance firm. Why might you have an interest in how successful or unsuccessful the firm in question (b) is at putting forward its case for greater allowance for diversification between operational and other risks? A.8.2 Summarise the main risks to which the following types of entity might be most exposed (and for which it would be prudent to provide stress tests if you were a risk manager for such an entity): (a) a commercial bank (b) an investment bank (c) a life insurance company (d) a non-life insurance company (e) a pension fund A.8.3 You are a financial services entity operating in a regulated environment in which your capital requirements are defined by a nested stress test approach as described in Section 8.3.3. Your investment manager has approached you with a new service that will involve the manager explicitly optimising your investment strategy to minimise your regulatory capital requirements and is proposing that you pay them a performance related fee if they can reduce your capital requirements. Set out the main advantages and disadvantages (to you) of such a strategy.
Appendix Exercises
293
A.9 REALLY EXTREME EVENTS A.9.1 Set out the main types of risk to which a conventional asset manager managing funds on behalf of others might be exposed. Which of these risks is likely to be perceived to be most worth rewarding by its clients? A.9.2 You work for an insurance company and are attempting to promote the use of enterprise risk management within it. Describe some ways in which risks might be better handled when viewed holistically across the company as a whole. A.9.3 You work for a bank and have become worried that different business units might be focusing too little on the liquidity needs that their business activities might be incurring. How might you set an appropriate ‘price’ for the liquidity that the bank as a whole is implicitly providing to its different business units?
References Hyperlinks to some of the references set out below are accessible via http://www.nematrian.com/ extremeevents.aspx. Abarbanel, H.D.I. (1993). The analysis of observed chaotic data in physical systems. Reviews of Modern Physics, 65, No. 4. Abramowitz, M. and Stegun, I.A. (1970). Handbook of Mathematical Functions. Dover Publications Inc. Alexander, G.J. and Baptista, A.M. (2004). A comparison of VaR and CVaR constraints on portfolio selection with the mean-variance model. Management Science, 50, No. 9, 1261–1273. Ang, A. and Bekaert, G. (2002a). International asset allocation with regime shifts. Review of Financial Studies, 15, 4, 1137–1187. Ang, A. and Bekaert, G. (2002b). Regime switches in interest rates. Journal of Business and Economic Statistics, 20, 2, 163–182. Ang, A. and Bekaert, G. (2004). How do regimes affect asset allocation? Financial Analysts Journal, 60, 86–99. Artzner, P., Delbaen, F., Eber, J. and Heath, D. (1999). Coherent measures of risk. Mathematical Finance, 9, No. 3, 203–228. Balkema, A. and de Haan, L. (1974). Residual life time at great age. Annals of Probability, 2, 792– 804. Bank of England (2009). Recovery and resolution plans – Remarks by Andrew Bailey. Bank of England News Release, 17 November. Barry, C. B. (1974). Portfolio analysis under uncertain means, variances and covariances. Journal of Finance, 29, 515–522. BCBS (2008). Principles for Sound Liquidity Risk Management and Supervision. Basel Committee on Banking Supervision. Berger, J. (1978). Minimax estimation of a multivariate normal mean under polynomial loss. Journal of Multivariate Analysis, 8, 173–180. Berk, J. and DeMarzo, P. (2007). Corporate Finance. Pearson Education, Inc. Besar, D., Booth, P., Chan, K.K., Milne, A.K.L. and Pickles, J. (2009). Systemic risk in financial services. Paper presented to Institute of Actuaries Sessional Meeting, 7 December. Bevan, A. and Winkelmann, K. (1998). Using the Black-Litterman Global Asset Allocation Model: Three years of practical experience. Goldman Sachs Fixed Income Research Note, June. Bewley, T.F. (1986). Knightian decision theory: Part I. Cowles Foundation Discussion Paper, No. 807, Yale University (or Decisions in Economics and Finance, 25, 79–110). Bewley, T.F. (1987). Knightian decision theory: Part II: Intertemporal problems. Cowles Foundation Discussion Paper, No. 835, Yale University. Bewley, T.F. (1988). Knightian decision theory and econometric inference. Cowles Foundation Discussion Paper, No. 868, Yale University. Billah, M.B., Hyndman, R.J. and Koehler, A.B. (2003). Empirical information criteria for time series forecasting model selection. Monash University, Australia, Department of Econometrics and Business Statistics, Working Paper 2/2003, ISSN 1440-771X.
296
Extreme Events
Black, F. and Litterman, R. (1992). Global portfolio optimization. Financial Analysts Journal, 48, No. 5, 28–43. Booth, P.M. and Marcato, G. (2004). The measurement and modelling of commercial real estate performance. British Actuarial Journal, 10, No. 1, 5–61. Brennan, M.J., Schwartz, E.S. and Lagnado, R. (1997). Strategic asset allocation. Journal of Economic Dynamics and Control, 21, 1377–1403. Breuer, T. (2009). If worst comes to worst: Systematic stress tests with discrete and other non-Normal distributions, Presentation to Quant Congress Europe, November 2009. Britten-Jones, M. (1999). The sampling error in estimates of mean-variance efficient portfolio weights. Journal of Finance, 54, No. 2, 655–671. Brown, S. J. (1976). Optimal Portfolio Choice under Uncertainty: A Bayesian Approach. PhD Dissertation, University of Chicago. Cagliarini, A. and Heath, A. (2000). Monetary policy making in the presence of Knightian uncertainty. Economic Research Department, Reserve Bank of Australia. Campbell, S. D. (2006). A review of backtesting and backtesting procedures. Journal of Risk, 9, No. 2, 1–17. Carino, D.R., Kent, T., Myers, D.H., Stacey, C., Watanabe, K. and Ziemba, W.T. (1994). Russell-Yasuda Kasai model: An asset-liability model for a Japanese insurance company using multi-stage stochastic programming. Interfaces, 24, 24–49. Cheung, W. (2007a). The Black-Litterman model explained (II). Lehman Brothers: Equity Quantitative Analytics, 29 March. Cheung, W. (2007b). The Black-Litterman model (III): Augmented for factor-based portfolio construction. Lehman Brothers: Equity Quantitative Analytics, 30 March. Christensen, B.J. and Prabhala, N.R. (1998). The relation between implied and realized volatility. Journal of Financial Economics, 50, 125–150. Christoffersen, P. (1998). Evaluating interval forecasts. International Economic Review, 39, 841–862. Christoffersen, P. and Pelletier, D. (2004). Backtesting value-at-risk: A duration-based approach. Journal of Empirical Finance, 2, 84–108. Clauset, A., Shalizi, C.R. and Newman, M.E.J. (2007). Power-law distributions in empirical data. SIAM Review 51, 4, 661–703. Cochrane, J.H. (1999). Portfolio advice for a multifactor world. Economic Perspectives, Federal Reserve Bank of Chicago, 23(3), 59–78. Committee of Sponsoring Organisations of the Treadway Commission (2004). Enterprise Risk Management: Integrated Framework; see http://www.coso.org/Publications/ERM/ COSO ERM ExecutiveSummary.pdf (accessed 23 July 2010). Cotter, J. (2009). Scaling conditional tail probability and quantile estimators. Risk, April. CRMPG-III (2008). Containing Systemic Risk: The road to reform. Report of the Counterparty Risk Management Policy Group III, 6 August 2008. Daul, S., De Giorgi, E., Lindskog, F. and McNeil, A. (2003). Using the grouped t-copula. Risk, November. Davis, M.H.A, Panas, V.G and Zariphopoulou, T. (1993). European option pricing with transaction costs. SIAM Journal of Control and Optimisation, 31, 470–493. Deighton, S.P., Dix, R.C., Graham, J.R. and Skinner, J.M.E. (2009). Governance and risk management in United Kingdom insurance companies. Paper presented to the Institute of Actuaries, 23 March 2009. DeMiguel, V., Garlappi, L. and Uppal, R. (2009a). Optimal versus naive diversification: How inefficient is the 1/N portfolio strategy? Rev. Financial Stud., 22, 1915–1953. DeMiguel, V., Garlappi, L., Nogales, F.J. and Uppal, R. (2009b). A generalized approach to portfolio optimization: Improving performance by constraining portfolio norms. Management Science, 55, No. 5, 798–812. Dempster, M.A.H., Mitra, G. and Pflug, G. (2009) (eds). Quantitative Fund Management. Chapman & Hall/CRC. Dickenson, J. P. (1979). The reliability of estimation procedures in portfolio analysis. Journal of Financial and Quantitative Analysis, 9, 447–462. Doust, P. and White, R. (2005). Genetic algorithms for portfolio optimisation. Royal Bank of Scotland. Dowd, K. (2006). Backtesting market risk models in a standard normality framework. Journal of Risk, 9, No. 2, 93–111.
References
297
Edelman, A. and Rao, N.R. (2005). Random matrix theory. Acta Numerica, 14, 233–297. Efron, B. and Morris, C. (1976). Families of minimax estimators of the mean of a multivariate normal distribution. The Annals of Statistics, 4, 11–21. Fabozzi, F.J., Focardi, S.M. and Jonas, C. (2008). Challenges in Quantitative Equity Management. CFA Research Foundation. Fabozzi, F.J., Focardi, S.M. and Jonas, C. (2009). Trends in quantitative equity management: Survey results. In Dempster et al. (2009), Chapter 1. Fama, E. and French, K. (1992). The cross-section of expected stock returns. Journal of Finance, 47, 427–465. Financial Times (2009a). Bank sets off ideas on living wills for lenders. Financial Times, 18 November. Financial Times (2009b). Brazil clips the wings of banks adept at capital flight. Financial Times, 26 November. Fisher, R.A. and Tippett, L.H.C. (1928). Limiting forms of the frequency distribution of the largest and smallest member of a sample. Proceedings of the Cambridge Philosophical Society, 24, 180– 190. Frankfurter, G.M., Phillips, H.E. and Seagle, J.P. (1971). Portfolio selection: The effects of uncertain means, variances and covariances. Journal of Financial and Quantitative Analysis, 6, 1251–1262. Frankland, R., Smith, A.D., Wilkins, T., Varnell, E., Holtham, A., Biffis, E., Eshun, S. and Dullaway, D. (2008). Modelling extreme market events. Paper presented to Institute of Actuaries Sessional Meeting, 3 November 2008. FSA (2008). Consultation Paper 08/24: Stress and scenario testing. UK Financial Services Authority. FSA (2009a). Policy Statement 09/16. Strengthening liquidity standards. UK Financial Services Authority. FSA (2009b). Discussion Paper 09/4 Turner Review Conference Discussion Paper. A regulatory response to the global banking crisis: systemically important banks and assessing the cumulative impact. UK Financial Services Authority. Giacometti, R., Bertocchi, M., Rachev, S. and Fabozzi, F.J. (2009). Stable distributions in the BlackLitterman approach to asset allocation. Chapter 17 of Dempster et al. (2009). Giamouridis, D. and Ntoula, I. (2007). A comparison of alternative approaches for determining the downside risk of hedge fund strategies. Edhec Risk and Asset Management Research Centre. Gibbons, M., Ross, S.A. and Shanken, J. (1989). A test of the efficiency of a given portfolio. Econometrica, 57, 1121–1152. . Gilboa, I. and Schmeidler, D. (1989). Maxmin expected utility with non-unique prior. Journal of Mathematical Economics, 18, 141–153. Gregory, J. and Laurent, J-P. (2004). In the core of correlation. Risk, October. Grinold, R.C. and Kahn, R. (1999). Active Portfolio Management: A Quantitative Approach for Producing Superior Returns and Controlling Risk, second edition. McGraw Hill. Herzog, F., Dondi, G., Keel, S., Schumann, L.M. and Geering, H.P. (2009). Solving ALM problems via sequential stochastic programming. Chapter 9 of Dempster et al. (2009). Hill, B.M. (1975). A simple general approach to inference about the tail of a distribution. Annals of Statistics, 3, 1163–1174. Hitchcox, A.N., Klumpes, P.J.M., McGaughey, K.W., Smith, A.D. and Taverner, N.H. (2010). ERM for insurance companies – adding the investor’s point of view. Paper presented to Institute of Actuaries Sessional Meeting, 25 January 2010. Hodges, S.D. and Neuberger, A. (1989). Optimal replication of contingent claims under transaction costs. Review of Future Markets, 8. Hsuku, Y-H. (2009). Dynamic consumption and asset allocation with derivative securities. Chapter 3 of Dempster et al. (2009). Hull, J.C. and White, A. (1998). Incorporating volatility up-dating into the historical simulation method for VaR. Journal of Risk, 1, No. 1, 5–19. Hurlin, C. and Tokpavi, S. (2006). Backtesting value-at-risk accuracy: a simple new test. Journal of Risk, 9, No. 2, 19–37. Jagannathan, R. and Ma, T. (2003). Risk reduction in large portfolios: Why imposing the wrong constraints helps. Journal of Finance, 58, 1651–1684. Jobson, J.D. and Korkie, B. (1980). Estimation for Markowitz efficient portfolios. Journal of the American Statistical Association, 75, 544–554.
298
Extreme Events
Johansen, A. and Sornette, D. (1999). Critical crashes. Risk, January. Jorion, P. (1986). Bayes-Stein estimation for portfolio analysis. Journal of Financial and Quantitative Analysis, 21, No. 3, 279–292. Jorion, P. (1994). Mean/variance analysis of currency overlays. Financial Analysts Journal, May– June. Kahn, R. (1999). Seven quantitative insights into active management. Barclays Global Investors. Kazemi, H., Schneeweis, T. and Gupta, R. (2003). Omega as a Performance Function. University of Massachusetts, Amherst, Isenberg School of Management, Center for International Securities and Derivative Markets. Kemp, M.H.D. (1997). Actuaries and derivatives. British Actuarial Journal, 3, 51–162. Kemp, M.H.D. (2005). Risk management in a fair valuation world. British Actuarial Journal, 11, No. 4, 595–712. Kemp, M.H.D. (2007). 130/30 Funds: Extending the alpha generating potential of long-only equity portfolios. Threadneedle Asset Management working paper. Kemp, M.H.D. (2008a). Enhancing alpha delivery via global equity extended alpha portfolios. Threadneedle Asset Management working paper. Kemp, M.H.D. (2008b). Efficient implementation of global equity ideas. Threadneedle Asset Management working paper. Kemp, M.H.D. (2008c). Efficient alpha capture in socially responsible investment portfolios. Threadneedle Asset Management working paper. Kemp, M.H.D. (2008d). Catering for the fat-tailed behaviour of investment returns: Improving on skew, kurtosis and the Cornish-Fisher adjustment. Threadneedle Asset Management working paper. Kemp, M.H.D. (2009). Market consistency: Model calibration in imperfect markets. John Wiley & Sons, Ltd. Kemp, M.H.D. (2010). Extreme events: tools and studies. www.nematrian.com/extremeevents.aspx (accessed 10 September 2010). Klein, R.W. and Bawa, V.S. (1976). The effect of estimation risk on optimal portfolio choice. Journal of Financial Economics, 3, 215–231. Klibanoff, P., Marinacci, M. and Mukerji, S. (2005). A smooth model of decision making under ambiguity. Econometrica, 73, 1849–1892. Kl¨oppel, S., Reda, R. and Schachermayer, W. (2009). A rotationally invariant technique for rare event simulation. Risk, October. Knight, F.H. (1921). Risk, Uncertainty, and Profit. Houghton Mifflin. Kupiec, P. (1995). Techniques for verifying the accuracy of risk management models. Journal of Derivatives, 3, 73–84. Laloux, L., Cizeau, P., Bouchard, J-P. and Potters, M. (1999). Random matrix theory. Risk, March. Laskar, J. and Gastineau, M. (2009). Existence of collisional trajectories of Mercury, Mars and Venus with the Earth. Nature, 459, 817–819. Ledoit, O. and Wolf, M. (2003a). Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. Journal of Empirical Finance, 10, 603–621. Ledoit, O. and Wolf, M. (2003b). Honey, I shrunk the sample covariance matrix. Journal of Portfolio Management, 31, No. 1. Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88, 365–411. Linter, J. (1965). The valuation of risk assets and the selection of risky investments in stock portfolios and capital budgets. Review of Economics and Statistics, 47, 13–37. Litterman, R. and the Quantitative Resources Group, Goldman Sachs Assets Management (2003). Modern Investment Management: An equilibrium approach. John Wiley & Sons, Inc. Lohre, H. (2009). Modelling correlation and volatility within a portfolio. Presentation to Incisive Media Portfolio Construction: Robust Optimisation and Risk Budgeting Strategies, April 2009. Longuin, F. (1993). Booms and crashes: application of extreme value theory to the U.S. stock market. Institute of Finance and Accounting, London Business School, Working Paper No. 179. Longuin, F. and Solnik, B. (2001). Extreme correlation of international equity markets. Journal of Finance, 61, No. 2, 649–676. Lowenstein, R. (2001). When Genius Failed: The rise and fall of Long-Term Capital Management. Fourth Estate.
References
299
Malevergne, Y. and Sornette, D. (2002). Minimising extremes. Risk, November. Malhotra, R. (2008). Extreme value theory and tail risk management. Lehman Brothers Portfolio Analytics Quarterly, 30 May. Malhotra, R. and Ruiz-Mata, J. (2008). Tail risk modelling with copulas. Lehman Brothers Portfolio Analytics Quarterly, 30 May 2008. Malz, A. (2001). Crises and volatility. Risk, November. Markowitz, H. (1952), Portfolio selection. Journal of Finance, 7, No. 1, 77–91. Markowitz, H. (1959), Portfolio selection: Efficient diversification of investments. John Wiley & Sons, Ltd. Markowitz, H. (1987). Mean-variance analysis in portfolio choice and capital markets. Blackwell. Merton, R.C. (1969). Lifetime portfolio selection under uncertainty: The continuous-time case. Review of Economics and Statistics, 51, No. 3, 247–257. Merton, R.C. (1971). Optimum consumption and portfolio rules in a continuous-time model. Journal of Economic Theory, 3, 373–413. Merton, R.C. (1980). On estimating the expected return on the market. Journal of Financial Economics, 8, 323–361. Meucci, A. (2005). Beyond risk and asset allocation. Springer. Meucci, A. (2006). Beyond Black-Litterman: views on non-normal markets. Risk Magazine, February, 87–92. Michaud, R. (1989). The Markowitz optimization enigma: Is optimized optimal? Financial Analysts Journal, 45, 31–42. Michaud, R. (1998). Efficient Asset Management: A Practical Guide to Stock Portfolio Optimization and Asset Allocation. Oxford University Press. Minsky, B. and Thapar, R. (2009). Quantitatively building a portfolio of hedge fund investments. Presentation to Quant Congress Europe, November 2009. Morgan Stanley (2002). Quantitative Strategies Research Note. Morgan Stanley. Mossin, J. (1966). Equilibrium in a capital asset market. Econometrica, 34, 768–783. Norwood, B., Bailey, J. and Lusk, J. (2004). Ranking crop yields using out-of-sample log likelihood functions. American Journal of Agricultural Economics, 86, No. 4, 1032–1042. Nolan, J.P. (2005). Modelling financial data with stable distributions. See http://academic2. american.edu/∼jpnolan/stable/stable.html (accessed 12 October 2009). Palin, J. (2002). Agent based stock-market models: calibration issues and application. University of Sussex MSc thesis. Palin, J., Silver, N., Slater, A. and Smith, A.D. (2008). Complexity economics: application and relevance in actuarial work. Institute of Actuaries FIRM Conference, June 2008. Papageorgiou, A. and Traub, J. (1996). Beating Monte Carlo. Risk, June. Pena, V.H. de la, Rivera, R. and Ruiz-Mata, J. (2006). Quality control of risk measures: Backtesting VAR models. Journal of Risk, 9, No. 2, 39–54. Pickands, J. (1975). Statistical inference using extreme order statistics. Annals of Statistics, 3, 119–131. Press, W.H., Teukolsky, S.A., Vetterling, W.T. and Flannery, B.P. (2007). Numerical Recipes: The Art of Scientific Computing. Cambridge University Press. Samuelson, P.A. (1969). Lifetime portfolio selection by dynamic stochastic programming. Review of Economics and Statistics, 51, 3, 239–246. Samuelson, P.A. (1991). Long-run risk tolerance when equity returns are mean regressing: Pseudoparadoxes and vindication of businessman’s risk, in Brainard, W.C., Nordhaus, W.D. and Watts, H.W. (eds). Money, Macroeconomics and Economic Policy, MIT Press, 181–200. Sayed, A.H. (2003). Fundamentals of Adaptive Filtering. John Wiley & Sons, Ltd. Scherer, B. (2002). Portfolio resampling: Review and critique. Financial Analysts Journal, November/ December, 98–109. Scherer, B. (2007). Portfolio Construction and Risk Budgeting, third edition. RiskBooks. Shadwick, W.F. and Keating, C. (2002). A universal performance measure. Journal of Performance Measurement, 6 (3) . Shaw, R.A., Smith, A.D and Spivak, G.S. (2010). Measurement and modelling of dependencies in economic capital. Paper presented to Sessional Meeting of the Institute of Actuaries, May 2010. Sharpe, W.F. (1964). Capital asset prices: A theory of market equilibrium under conditions of risk. Journal of Finance, 19, 425–442.
300
Extreme Events
SSSB (2000). Introduction to Cluster Analysis. Schroder Salomon Smith Barney, 7 July. Stein, C. (1955). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proceedings of the 3rd Berkeley Symposium on Probability and Statistics I., University of California Press, 197–206. Stone, J.V. (2004). Independent Component Analysis: A tutorial introduction. MIT Press. Taleb, N.N. (2004). Fooled by Randomness, second edition. Penguin Books. Taleb, N.N. (2007). The Black Swan. Penguin Books. Thompson, K. and McLeod, A. (2009). Accelerated ensemble Monte Carlo simulation. Risk, April. Tobelem, S. and Barrieu, P. (2009). Robust asset allocation under model risk. Risk, February. Treynor, J.L (1962). Towards a theory of the market value of risky assets. Unpublished manuscript. A final version was published in 1999, in Asset Pricing and Portfolio Performance: Models, Strategy and Performace Metrics. Risk Books, pp. 15–22. Varnell, E.M. (2009). Economic scenario generators and solvency II. Paper presented to Sessional Meeting of the Institute of Actuaries, November 2009. Von Neumann, J. and Morgenstern, O. (1944). Theory of Games and Economic Behavior, Princeton University Press. Wald, A. (1950). Statistical Decision Functions. John Wiley & Sons, Inc. Weigend, A.S. and Gershenfeld, N.A. (1993). Time series prediction: forecasting the future and understanding the past, SFI Studies in the Sciences of Complexity, Proc Vol. XV, Addison-Wesley. Whalley, A.E. and Wilmott, P. (1993). An asymptotic analysis of the Davis, Panas and Zariphopoulou model for option pricing with transaction costs. Mathematical Institute, Oxford University, Oxford. Wright, S.M. (2003a). Forecasting with confidence. UBS Warburg Global Equity Research. Wright, S.M. (2003b). Correlation, causality and coincidence. UBS Warburg Global Equity Research. Zellner, A. and Chetty, V.K. (1965). Prediction and decision problems in regression models from the Bayesian point of view. Journal of the American Statistical Association, 60, 608–615. Zumbach, G. (2006). Backtesting risk methodologies from one day to one year. Journal of Risk, 9, No. 2, 55–91.
Index 2-norm constrained portfolios, Bayesian approaches 197–201 2007–09 credit crisis see credit crisis from 2007 A-norm constrained portfolios, Bayesian approaches 197–201 absolute aspects, VaR 14 active fund managers 35–6, 53–4, 61, 116–20, 133–4, 139, 141–4, 154–6, 158–60, 208–9, 257–8, 264, 265–6, 282 actuaries 263 agents, state of the world 228–30 Akaike information criterion 52, 124 see also heuristic approaches alpha-beta separation concepts 153–4, 209 alphas 142–4, 153–4, 161–3, 169, 170–1, 209, 238–9 see also skills ambiguity aversion 267–8 analytical techniques 24, 42, 91–139, 206–7, 229, 241–2 anchors, Black-Litterman portfolio construction technique 191–4 Anderson-Darling normality test 31 Ang and Bekaert RS model 218–41 annealing schedules 138 annuities 116 appendix 4, 287–93 Arbitrage Pricing Theory (APT) 152 Archimedian copulas, definition 78 Asian crisis 128 Asimov, Isaac 245 asset allocations 3, 5, 35–6, 41, 142, 154–6, 161–3, 174–5, 221, 226–31, 241–3, 264, 289–91 asset-liability management (ALM) 157, 161, 192, 240, 242, 258–61, 263 asymmetric correlation phenomenon 218–19 audit teams 276–7 augmented risk models 91, 93, 96–7 see also econometric . . . ; fundamental . . .
autocorrelations 57, 123–4 autoregressive behaviour 39–41, 121–4, 218, 234–6, 241–2 autoregressive models (AR) 121–2, 123–4, 218, 234–6, 288, 289–90 autoregressive and moving average model (ARMA) 123–4 average returns 164, 217–19 backfilled data, exercises 288–9 backtesting 73–8, 97, 163–9, 180, 206–7, 284 see also information ratio . . . ; performance evaluations; Sharpe . . . ; Sortino . . . ; Stirling . . . Bank of England 247 bankruptcies 9, 35 banks 2, 7, 8–9, 18, 54, 61, 85, 118, 221, 240, 242, 246–61, 269, 271, 273–4, 283, 292–3 Basel II 248–50, 274 failures 9, 246–7, 261 ‘living wills’ 247 barrier options 232 Basel II 248–50, 274 ‘Bayes-Stein’ estimation 194–7 see also shrinkage . . . Bayesian approaches to portfolio construction 5, 145, 167, 178, 180–1, 187–90, 193, 197–203, 206–8, 209, 212, 222, 237, 238, 254, 269–70, 284 see also resampled portfolio optimisation Bayesian priors 178, 188–90, 206–7, 209, 212, 222, 225, 236, 256–7, 281, 284 ‘bear’ regime, regime-switching models 218–19 behavioural psychology 1–2, 6–10, 54–5, 57, 87, 134, 144–5, 161–3, 164–9, 173–4, 188–90, 195–6, 218–31, 235–6, 240, 245–7, 263–4, 267, 278, 281, 282, 284 benchmarks 35–6, 57, 154, 164–6, 171, 239, 242 Bentham, Jeremy 222 Bernoulli distribution 35
302
Index
‘best in breed’ approaches, fund managers 155–6 betas 87–8, 93–5, 114–15, 151–4, 242 Bewley, T.F. 267–70 ‘beyond default’ loss 246–7 biases 1–2, 6–10, 57, 87, 134, 164–9, 195–6, 221, 240, 246–7, 263, 267, 281, 284 bid-ask/offer spreads 159, 178, 225 binary trees 87–90 binomial distributions 35 binomial lattices 157, 231 bivariate Normal probability distributions 61–4, 68–9, 76, 218–19 ‘black swans’ 7, 85, 250, 283 Black-Litterman portfolio construction technique (BL) 5, 145, 151, 173, 177–8, 191–4, 200–2, 208–9, 238 see also robust mean-variance portfolio . . . Black-Scholes option pricing formula 157, 175, 232–4, 271–2 blended PCA/ICA portfolio risk models 91, 98, 103, 112–20, 134–9 blind-source risk models 91, 93, 95, 96–115 see also principal components analysis; statistical . . . Bloomberg market data 56 boards of directors 261, 276–7 Boltzmann probability distributions 137–8 bonds/bond-indices 17, 23, 58–9, 86, 94–7, 172, 175, 246–7, 254, 282 bootstrapping methods 198, 202–9 see also resampled . . . bounded rationality 54–5, 161–3, 173–4, 188–90 box counting 67–8, 124–5 see also fractile–fractile plots bracketing the extremum 135–6 Bretton Woods agreement 23 Breuer, T. 255–7 brokerage commission transaction costs 158–9, 225, 232–4 Brownian motion 156 budgets 3, 142–3, 145, 161–3, 175, 237–8, 265–6, 274, 275–6, 284 business models 6, 10 calibration concepts 163–9, 197–201, 231–2 see also backtesting call-put ratios 32 Capital Asset Pricing Model (CAPM) 18, 144, 149–52, 173–4, 191–2, 196, 266 see also Black-Litterman . . . ; mean-variance optimisation techniques capital gains tax 160 capital market line (CML) 150–4, 191–2 capital requirements see regulatory capital requirements capitalisation classifications 94–5, 99
cardinal utility measures 222–3 ‘carry trades’ 2 Cartesian grids 213–14 ‘catastrophe’ premiums 271 Cauchy distributions 42, 109 Cauchy-Schwarz inequality 109 causal dependency models 83–4 see also co-dependency . . . Central Limit Theorem 20, 32–45 certaincy equivalence 187–90 chain-link calculations 59 changing variables, Monte Carlo simulations 210, 211–13 chaotic market dynamics 125–7, 270, 283 characteristics functions 33–5 chi-squared distributions 30–1 chief executive officers 276–7 chief risk officers (CROs) 272–3 Cholesky decomposition 215 classification structures, portfolio risk models 94–5, 99 Clayton copula, definition 78 cluster analysis 87–90, 95, 129–33, 142, 175, 283 co-dependency characteristics 78–84, 119–20, 214–15 see also causal dependency . . . ; copulas co-skewness 239–40 Cochrane, J.H. 144–5 coherent risk measures 15, 243 see also tail VaR collateralised debt obligations (CDOs) 76, 79, 231, 260, 271 commercial banks, stress testing 292 The Committee of Sponsoring Organisations of the Treadway Commission 275 complexity pursuit 111–12, 113–14 conditional VaR (CVaR) 15, 148, 153, 193–4 see also tail VaR conjugate priors 190 constant proportional portfolio insurance (CPPI) 157 constant relative risk aversion (CRRA) 223–31 continuous distributional forms 11–23, 41 continuous time 5, 231–2 convergence concepts 33–4, 48, 49, 123–4, 185, 210–15, 240 convex sets 207 coping mindsets 2, 4, 6, 263–79, 281, 285 copulas 64–90, 119–20, 193, 214–15, 243 see also fractile–fractile plots Cornish-Fisher fourth moment approximations 26–9, 39, 114, 118, 282, 287–8 corporate actions, stress testing 259 corporate bonds, empirical examples of fat-tailed behaviour 17
Index correlation coefficients 41, 78, 84–5, 87, 88–90, 105, 107, 110–11, 119–20, 181, 192–3, 196–7, 218–43, 250, 271, 282–3, 289–91 see also copulas cost of capital, CAPM 149–54 counterparty risk 252–3, 259 covariances 73–5, 77, 78, 85, 88, 94–7, 100–3, 130–3, 146–75, 177–215, 219–43, 256, 282, 289–92 crashes 2, 7, 8–9, 18, 50, 54, 55, 128–9 credibility weightings 187–90, 209 credit crisis from 2007 2, 7, 8–9, 18, 54, 61, 118, 221, 247, 251–2, 261, 269, 271, 273 credit default swaps (CDSs) 255 credit ratings 94–5, 97, 275 CRMPG-III 251–2 cross-sectional comparisons, concepts 9–10, 72–5, 82, 238–9 crowded trades 53–5, 84–5, 268–9 cubic-cubic quantile curve fits 51 cumulative distribution functions (cdfs) 12–15, 31–2, 42–4, 46–7, 63–5, 81–3, 214–15 cumulative returns, performance evaluation statistics 164 currencies 22–3, 92, 95–6, 155–6, 225, 246–7 curse of dimensionality 213 curve-fitting approaches, QQ-plots 28–9, 80–3, 282 daily returns on various equity market indices 18–30, 40, 50, 51, 57–8, 86–7, 282 data concepts 55–9, 85–90, 91–7, 143, 178–87 mining 143 de-correlated return series 57–8 decision-making, Knightian uncertainty 229, 266–79, 285 deferred taxes 249 delta hedging 232–4 see also dynamic hedging derivatives 53, 84–5, 94–5, 157, 171–3, 175, 201–2, 214, 228, 231–4, 240, 242, 255, 259–60, 271–2, 285 see also futures; options; swaps pricing theory 157, 175, 231–4, 240, 242, 271–2, 285 regime-switching models 231–2, 242, 285 stress testing 255, 259–60 deterministic chaos 126–7, 270 difference equations 121–4 discrete time 5, 228, 231–2, 241–2 ‘distance’ definitions, cluster analysis 88–90, 142, 283 distributional forms, concepts 11–23, 42–50, 51–3, 79–80, 129–33, 168–9, 206–7, 243
303
distributional mixtures 5, 37–41, 52, 55, 68–9, 91, 129–33, 134–5, 217–43 see also regime-switching models divergences, multi-dimensional QQ-plots 82–3 diversification 17, 18–20, 33–4, 35–6, 220–1, 249–50, 264, 265, 282, 292 domain of attraction 44–5, 46–7 see also generalised central limit theorem dot.com boom 54, 163 Dow Jones market data 56 downside extreme events 10, 15–17, 20, 26, 31–2, 36, 40, 50, 77, 80–1, 218–19 downside standard deviation, Sortino performance ratio 165 drawdown, performance evaluation statistics 164 dual benchmarks, mean-variance optimisation techniques 154, 173 duration 59, 94–5, 97, 254 dynamic hedging 172–3, 232–4, 253, 265 dynamic multipliers 121–2 dynamic programming 149, 226–31 econometric risk model 91, 92–4, 95–7, 98, 133–4 economic scenario generators (ESGs) 201–2, 232 economic sensitivities, time factors 58–9, 86–7 efficient frontier analysis 145–54, 191–4, 204–5, 290 see also risk-return optimisation efficient market hypothesis (EMHs) 173–4 eigenvalues 69–75, 100–6, 114, 122, 125, 187 see also principal components analysis eigenvectors 100–6, 114–15, 125, 186–7 see also principal components analysis electroencephalography (EEG) 107–8 emerging markets 263–4 empirical examples of fat-tailed behaviour 11–23, 56, 78–83, 282 ensemble Monte Carlo technique 214 enterprise risk management (ERM) 274–5, 279, 293 entrepreneurs, uncertainty profits 266, 269, 278 entropy 85, 88–90, 110–11, 124–5, 256–7 equally weighted (1/N rule) portfolios 182–3, 187, 197–201 equity indices 17–22, 23, 24, 40, 50, 51, 56, 85, 94–7, 175, 218–19, 246–7, 259–60, 282 ‘equivalence’ measures 5, 9, 181, 206–7 estimation errors 5, 177–8, 180–7, 195–201, 206–7 Eurepo 8–9 Euribor 8–9 Euro block 22–3, 274 European options 32 ex-ante tracking error 15, 146–54, 162–3 exact error estimates 181–5
304
Index
excess returns, exact error estimates 182–5 exchange traded funds (ETFs) 240 exercises 4, 287–93 exogenous factors, portfolio risk models 91–7, 98 expectation-maximisation algorithm (EM) 129–33 expected shortfall 15, 193–4 see also tail VaR expected utility theory 222–4 exponential family of probability distributions 190 extrapolation see also extreme value theory interpolation contrasts 47–9, 282 extrema concepts 135–7 extreme events 1–6, 7–10, 11, 246, 251, 263–79, 281–6, 293 see also really extreme events; risk extreme value theory (EVT) 16–17, 45–50, 282 see also Fisher-Tippet theorem F-tests 182–7 factor analysis see principal components analysis fat tails see also kurtosis; skew beneficial aspects 36, 282 causes 20, 32–41, 53–5, 84–5, 217, 237–9, 268–9, 281, 282–3 concepts 1–6, 7–59, 61–90, 91–2, 116–20, 217–43, 264, 281–6 definition 1, 11, 61 empirical examples of fat-tailed behaviour 11–23, 24, 40, 50, 51, 56, 78–83, 282 EVT 16–17, 45–50, 282 heuristic approaches 237–40 joint (multivariate) return series 1–2, 4, 7, 33–4, 61–90, 282–3, 288–92 moments of the distribution 23–32, 287–8 Monte Carlo simulations 214–15 persistence considerations 20, 40 portfolio construction techniques 5, 217–43, 285 practitioner perspectives 53–6, 75–6, 83–5 principles 7, 10, 16, 23, 29, 32, 36, 40, 49, 52, 55, 84–5, 264, 281–2 regime-switching models 5, 236–40 relative aspects 7–10, 263–6, 281 selection effects 116–20, 264, 265–6, 283 single (univariate) return series 4, 7–59, 281–2, 287–8 visualisation approaches 11–17, 27–8, 61–4, 80–1, 281 feedback loops 53–4, 55, 163–4, 245–6 Financial Services Authority (FSA) 247, 248–9, 251, 252–3, 274 financial services entities, stress testing 292
finite variance returns 35, 41–5 see also generalised central limit theorem; stable distributions first order autoregressive models 235, 288 first order conditions (FOC) 227 Fisher-Tippet theorem 46–50 see also extreme value theory fitted cubic 39–40 flexible least squares approach 242 forecasting techniques 127, 129, 133–4, 139, 163–9 Fourier transforms 122–4 fractile–fractile plots 66–75, 119–20 see also copulas Fr´echet distribution 46–50 frequency domain characterisation, moving averages model 123–4 FT Interactive Data 56 FTSE All Share index 18–20, 27–8, 30, 39–40, 50–1, 56 FTSE W Europe (Ex UK) index 18–21, 27, 30 fund managers 35–6, 53–4, 61, 84–5, 116–20, 133–4, 139, 141–5, 154–6, 158–60, 161–3, 164–5, 169, 170–1, 208–9, 238–9, 240, 257–8, 264, 265–6, 268–9, 279, 282, 285 fundamental risk model 91, 92–5, 96–7, 98, 133–4 future returns, past returns 245–6 futures 87 ‘fuzziness’ issues, regime-switching models 228–9, 236 GAAP 249 Gamma function 76–7 Gaussian copula 65–79 Gaussian mixture models (GMM) 129–33, 134–5 see also k-means clustering gearing 152 generalised autoregressive conditional heteroscedasticity (GARCH) 39–41, 218, 236 generalised central limit theorem 44–5 see also finite variance returns generalised least squares regression 29, 121, 124–5, 138 genetic portfolio optimisation 138, 139 geometric Brownian motion 156 global extrema 136–7, 138, 283 globalisation 87, 221 ‘goodness of fit’ 52 goodwill 16–17 governance models 261, 272–8, 286 governments bonds 17, 23, 58–9, 86, 94–7, 172, 175, 246–7, 254, 282
Index currencies 22–3 debt levels 8–9 gradient ascent 112 Gram-Schmidt orthogonalisation 110, 114–15 ‘growth’ stocks 95, 152 ‘GRS F-test’ 183–4 guaranteed products 240 Gumbel copula 78 Gumbel distribution 46–50 ‘gut feelings’ 3 see also qualitative perspectives hedging 53–4, 172–3, 208–9, 231–4, 240, 253, 254, 265–6 Herfindahl Hirshman Index (HHI) 194 heteroskedasticity 20, 38–41, 124–5 see also time-varying volatilities heuristic approaches 5, 124, 221, 237–40 see also Akaike information criterion hierarchical clustering 87–90 higher frequency data, estimation errors 181–2 higher moments 31–2, 53, 80–1, 165–6, 218–19 see also kurtosis; regime-switching models; skew; variance holistic approaches 155–6, 273–5 home country bias 221 hyperbolic tangent distributions 110–11 hypothesis tests 24–32, 89–90, 104–6 hysteresis, concepts 235–6 identity matrix 181, 196–9 idiosyncratic risk 93–7, 103, 151–4, 186–7, 253–4, 283 impact assessments, stress testing 6, 245–61 implementation challenges 4, 55–9, 85–90, 134–9, 143–4, 174–5, 209–15, 241–3, 258–61, 279 implied alphas 142–4, 153–4, 169, 170–1, 209, 238–9 see also reverse optimisation implied correlation 85, 282 implied covariance 85, 282 implied volatility 40, 53, 85, 201–2, 270–2, 282 importance sampling, Monte Carlo simulations 210, 211–13 in-sample backtesting 166–9 inadmissible estimators 195–7 income tax 160 incubation, concepts 134 independence tests, risk-model backtesting 168–9 independent components analysis (ICA) ‘all at once’ extraction of un-mixing weights 110–13, 114–15 blended PCA/ICA portfolio risk models 91, 98, 103, 112–20, 134–9
305
complexity pursuit 111–12, 113–14 concepts 4, 91, 96, 98–9, 106–15, 134–9 correlation contrasts 107, 110–11 definition 96, 98, 106–7, 112–13 fat tails 116–20 kurtosis 109–10, 113–20 PCA contrasts 112–15 uses 98, 106–7 variance uses 113–15 index of stability 42–4 inertia assumption 267–9, 279 inflation 83–4, 239 information ratios 143–4, 162–3, 164–5 input assumptions, portfolio construction techniques 177–87, 191–4, 284 insurance companies 83, 144–5, 232, 248–61, 267–8, 274, 292–3 interbank lending 8–9 interest rates 92, 95–6, 231–2, 254, 267 interior point method 174–5, 188 internal models 250, 274, 276–7 International Accounting Standards Board (IASB) 249 interpolation concepts 47–9, 86–7, 282 investment banks 292 investors types 221–2, 235–6, 265–6, 268–9 utility functions 217, 221–31 isoelastic utility see constant relative risk aversion Ito processes 156 James-Stein estimate 194–7 see also shrinkage . . . Jarque-Bera normality test (JB), concepts 30–1 joint (multivariate) return series 1–2, 4, 7, 33–4, 61–90, 94, 95, 119–20, 162–3, 177–8, 194–7, 203–4, 208, 214–15, 217–43, 282–3, 288–92 copulas 61–90, 119–20 definition 1, 7, 61–4 empirical estimation of fat tails 78–83 exercises 288–92 fractile–fractile plots 66–75, 119–20 implementation challenges 85–90 marginal distributions 64–90, 119–20, 214–15 multi-dimensional QQ-plots 80–3 practitioner perspectives 83–5 principles 84–5, 88, 264, 282–3 time-varying volatilities 72–5, 84–5, 217–43, 264, 282–3 visualisation approaches 61–4, 80–1 judgemental processes 3, 231, 237–40, 279 see also qualitative perspectives jumps 251–2
306
Index
k-means clustering 129–33 see also Gaussian mixture models Kalman filter 167 Knightian uncertainty 229, 266–79, 285 Kolmogorov complexity 111–12 Kolmogorov-Smirnov normality test 31 Kullback-Leibler ‘distances’ 89–90, 110 see also relative entropy kurtosis 25–32, 35, 38, 53, 58, 68, 71–5, 109–10, 113–20, 138, 218–19, 239, 282, 287–90, 291–2 see also leptokurtosis; skew l-norm constrained portfolios, Bayesian approaches 197–201 Lagrange multipliers 171, 198–9 lambdas 204–5 language uses, risk budgeting 3 lattices 157, 231 learning 10, 87, 129–30, 270 least squares regression 29, 121, 138, 242 Ledoit and Wolf shrinkage approach 181–2, 196–9 leptokurtosis 38 see also kurtosis leverage 152 L´evy distributions 34, 42 see also stable distributions life insurance companies 248–61, 272, 292 likelihood function 6, 78–9, 89–90, 111, 124–5, 130–3, 183, 188–90, 195–7, 245–6, 255, 257–8, 278 likelihood ratios 89–90 see also relative entropy limited liability structures, put options 261 linear combination mixtures 38, 91, 97–129 see also independent components analysis; principal components analysis linear difference equations 121–4 linear regression, market dynamics 120–5 linear time series analysis 128–9 liquidity risk 33, 54, 55, 84–5, 118, 225, 226, 241, 242–3, 251–3, 260–1, 268–9, 272, 278, 285, 286 ‘living wills’, banks 247 local extrema 135–7, 138 locally linear time series analysis 128–9, 139 log-Normal distributions 1, 11–32, 33–4, 40, 58, 61–90, 162–3 Long-Term Capital Management (LTCM) 254 long-term contracts, really extreme events 267–9, 278, 286 longitudinal comparisons 9–10, 72–3 look-back/forward bias 164, 166–7, 284 loss given default 246–7
lotteries, expected utility theory 223 lower partial moments 238–9 ‘magnitude’ criteria 6, 98–9, 112–15, 119, 245–6, 265, 283 ‘Mahalanobis’ distance 256 management structures, really extreme events 276–7 Maple 56 marginal distributions 64–90, 119–20, 214–15 marginal utility theory 222–3 Market Consistency (author) 4 ‘market consistent’ portfolio construction 5, 201–2 market consistent values 225 market data 55–9, 85–90, 91–7, 178–87 market dynamics 4, 17, 58, 83, 86–7, 91, 94–5, 97, 120–9, 138–9, 172–3, 270, 283 market impact transaction costs 159–60, 225, 232–4 market implied data 5, 6, 178, 186–7, 201–2, 270–2 market makers, options 172–3 market portfolios, Black-Litterman portfolio construction technique 191–4 ‘market risk appetite’, really extreme events 268–9, 277, 286 market values 225–6, 231, 285 market-influencing factors 4, 39, 53, 83, 91–139, 263–4, 283, 289 concepts 4, 83, 91–139, 263–4, 283 exercises 289 implementation challenges 134–9 portfolio risk models 85–90, 91–7 practitioner perspectives 91–7 run-time constraints 138–9 markets 1–10, 18, 39, 50, 53, 54–5, 57–8, 61, 83, 86–7, 88–90, 91–139, 142, 221, 223–31, 235–6, 245–6, 247, 251–2, 261, 263–6, 269, 271, 273, 278, 281, 282, 283, 289 behavioural psychology 1–2, 7–8, 10, 54–5, 57, 223–31, 235–6, 245–6, 263–4, 278, 281, 282 credit crisis from 2007 2, 7, 8–9, 18, 54, 61, 118, 221, 247, 251–2, 261, 269, 271, 273 October 1987 crash 18, 50 MarkIt Partners market data 56 Markov chains 212, 217–18, 227–8, 234–7, 241 Markov processes 40, 212, 217–18, 227–8, 234–7, 241 Markowitz portfolio optimisation 148, 161, 173–4, 180–1, 188, 206–7 see also mean-variance optimisation techniques material element of smoothing 288 Mathematica 56
Index mathematical equivalence 12 mathematical tools 2–3, 4, 55–9, 85–90, 97–8, 143, 283 see also quantitative perspectives Matlab 56 matrix theory 100–6, 125, 195–7 see also principal components analysis maximum domain of attraction (MDA) 46–7 maximum likelihood estimation (MLE) 78–9, 111, 124–5, 130–3, 183, 195–7 mean absolute deviation 138 mean-variance optimisation techniques see also Capital Asset Pricing Model; portfolio . . . ; resampled . . . ; risk-return . . . ; robust . . . concepts 5, 141, 146–75, 177–215, 217, 218, 222, 237, 284–5 constraints 146, 148, 152–4, 178–87, 195–201, 204–8, 229, 284 critique 173–4, 177–8, 217, 284 definition 146–9 estimation errors 5, 177–8, 180–7, 195–201, 206–7 exercises 289–91 implementation challenges 174–5, 209–15 input assumptions 177–87, 191–4, 284 more general features 152–4 universe considerations 154, 172, 174–5 ‘meaningfulness’ criteria 98–9, 112–15, 118–20, 283 means 10, 12–23, 25–32, 37–8, 41, 53, 61, 80–3, 97–8, 114, 138, 146–54, 180–7, 193–7, 203–8, 219–43, 287–8, 292 medians 138 mesokurtic distributional mixtures 38 method of moments 78–9 Microsoft Excel 26, 56 Mill, John Stuart 222 mindsets 2, 4, 6, 263–79, 281, 285 minimax decision rule 267 minimisation/maximisation algorithms 134–9, 175, 178–87, 191–2 see also optimisation minimum risk portfolio (MRP) 204–5, 232, 276–7, 285 model error 5 modern portfolio theory (MPT) 221 moments of the distribution 25–32, 44, 53, 80–1, 97–8, 113–14, 165–6, 180–1, 218–19, 238–9, 287–8 see also Cornish-Fisher . . . ; kurtosis; means; skew; standard deviations monetary values, constant relative risk aversion 224–6 money market interest rates 8–9
307
Monte Carlo simulations 23–5, 30–1, 83, 84, 157, 185–6, 201–2, 203–8, 210–15, 241–2 monthly returns on various equity market indices 18–30, 86–7 moving averages model (MA) 112, 122–4 MSCI ACWI Transport 67–71, 116 MSCI ACWI Utilities 67–71, 99–100, 116 MSCI Barra market data 56 multi-dimensional QQ-plots 80–3 multi-period (dynamic) portfolio optimisation 148–9, 156–8, 167, 174–5, 283–4 multiple regression analysis 95 multivariate data series see joint (multivariate) return series multivariate linear regression 124–5 naive 1/N rule 182–3, 187, 197–201 narrowly-held risks 144–5 ‘nearest neighbour’ approaches 236 Nematrian Limited 4, 287 nested stress-testing approach 292 net present values, CAPM uses 151 neural networks 127, 129, 134, 139 no arbitrage principle 94–5, 225 noise 98, 110, 116–18, 125–7 non-coincidentally time series, implementation challenges 86–7 non-constant time period lengths 57–8 non-discoverable processes, really extreme events 269–70 non-life insurance companies 292 non-Normal innovations 124–5, 193–4 nonlinear models 125–7, 129, 172–3, 246 Normal distributions 5, 11–32, 33–4, 41–5, 61–90, 105–19, 129–33, 148–54, 162–3, 187–90, 194–7, 208–9, 211, 214–15, 217–20, 230, 243, 272 ‘normal’ regime, regime-switching models 218–19 normalisation of signal strength, blended PCA/ICA portfolio risk models 115 norms, concepts 197–201 null hypothesis 24–32, 104–6, 230 numerical techniques 174, 210–15 observations, learning 10 October 1987 crash 18, 50 oil prices 92, 95 Omega function 31–2, 165–6 one-sided hypothesis tests 29–31 online toolkits 56 operational risk 249–50, 258, 259–61, 275, 277–9, 292 opinion pooling 193, 209 optimal portfolio allocations, regime-switching models 226–31
308
Index
optimisation 135–9, 141–75, 195–7, 222, 226–43, 252, 257 see also mean-variance . . . ; minimisation/maximisation algorithms optimisation algorithms, implementation challenges 174–5 options 53, 157, 171–3, 175, 231–4, 259–61, 271–2 ordinal utility measures 222–3 ordinary least squares regression (OLS) 182–5, 269 orthogonal signals, ICA 108–9, 110, 114–15 out-of-sample backtesting 73–8, 97, 166–9, 180, 206–7, 284 out-of-sample log-likelihood tests (OSLL) 77–8, 168–9 outside-the-box thinking 263–4, 279, 285 over-fitting dangers 41, 50–2, 84, 282 see also parsimony Own Risk and Solvency Assessments (ORSAs) 274 Pad´e approximants 124 parameterisations 5, 42–4, 51, 217–43 Paris theatre tickets 263–4 parsimony 41, 50–2, 75–6, 84, 132 see also over-fitting dangers partial derivatives 134–5 partial differential equations (PDEs) 227, 233 passive (index-tracking) fund managers 209, 265 past returns, future returns 245–6 path dependency 175, 232 pegged currencies 22–3 pension funds, stress testing 255, 292 perceptions 10, 84, 264, 265–79, 281 performance evaluations 32, 35–6, 163–9, 268–9 see also backtesting; returns . . . persistence considerations, fat tails 20, 40 philosophical perspectives on wealth 3 Pickands-Balkema-de Haan theorem 47 platykurtic distributional mixtures 38 ‘plausible’ but unlikely scenarios, stress testing 247–8, 255–7 ‘population’ means/standard-deviations, definition 26 portfolio construction techniques 1–6, 10, 18, 32, 35–6, 88, 119, 141–75, 177–215, 217–43, 245–6, 265–6, 281–6, 289–91 see also mean-variance optimisation . . . ; robust . . . alphas 142–4, 153–4, 161–3, 169, 170–1, 209, 238–9 backtesting 73–8, 97, 163–9, 180, 206–7 definition 141 exercises 289–91 fat tails 5, 217–43, 285
implementation challenges 174–5, 209–15 input assumptions 177–87, 191–4, 284 manager selections 154–6, 264 ‘market consistent’ portfolio construction 5, 201–2 multi-period (dynamic) portfolio optimisation 148–9, 156–8, 167, 174–5, 283–4 Omega function 32, 165–6 options 171–3 performance evaluations 32, 35–6, 163–9 practitioner perspectives 141–2, 170–1, 173–4, 208–9, 240, 278 principles 141, 149, 158, 160, 162, 166, 167, 171, 174, 190, 193, 198, 226, 229, 254, 264, 278, 281–2, 283–5, 286 really extreme events 265–79 shrinkage approach 181–2, 194–7, 199–203 stress testing 245–6, 253–61 taxes 145, 158, 160, 284 traditional techniques 141–75, 283–4, 289–90 transaction costs 55, 143–4, 145, 158–60, 169, 172–3, 175, 225, 232–4, 241–3, 284 ‘portfolio factories’ 171 portfolio insurance, concepts 157–8 portfolio risk models 85–90, 91–139, 167–9, 240, 246–7, 260–1, 264, 265–6, 283 see also econometric . . . ; fundamental . . . ; statistical . . . principles 102, 283 types 91–7, 133–4 portfolio-purpose considerations, really extreme events 265–6 posterior distributions 188–90 see also Bayesian . . . power spectrum 125 practitioner perspectives 4, 53–6, 75–6, 83–5, 91–7, 133–4, 141–2, 170–1, 173–4, 208–9, 240, 257–8, 278, 284 see also fund managers pre-filtering requirements 85–6 pre-knowledge concepts 188–90 see also priors preferences 222–4, 271–2 price data, concepts 58–9 pricing algorithms 59, 157, 175, 231–4, 240, 242, 260, 271–2, 285 pricing/return anomalies 144–5 principal components analysis (PCA) 4, 69–78, 91, 96–106, 107, 108, 112–20, 134–9, 186–7, 208, 221, 289 see also eigen . . . blended PCA/ICA portfolio risk models 91, 98, 103, 112–20, 134–9 definition 96, 98–102, 107, 112–13 fat tails 116–20 ICA contrasts 112–15
Index one-at-a-time extractions 106 uses 98 weighting schemas 102, 112–15 principles’ summary 281–6 priors 178, 188–90, 206–7, 209, 212, 222, 225, 236, 256–7, 281, 283, 284 probability density functions (pdfs) 11–23, 37–8, 61–4, 81–3, 110–11, 210, 228, 241 probability mass functions 11, 228 product (independence) copula 65–75 projection pursuit 109–10, 112, 113–14 proportion of failures (POF) 168–9 pull to par 59 quadratic functions 135–6, 138, 146–75, 177–215, 218, 222–4 see also mean-variance optimisation techniques quadrature technique 241–2 qualitative perspectives 1–2, 3, 5, 53–9, 141–5, 174, 245–61, 263–79, 281 quantile–quantile box plots see fractile–fractile plots quantile–quantile plots (QQ-plots) 12–32, 37–8, 45, 49–50, 52, 66–75, 80–3, 214–15, 281, 282 curve-fitting approaches 28–9, 80–3, 282 definition 12–14 distributional mixtures 37–8, 52 exercises 287–9 moments of the distribution 23–32, 287–8 multi-dimensional QQ-plots 80–3 stable distributions 45 TVaR 16 quantitative perspectives 1–2, 3, 5, 53–5, 97–8, 126–7, 133–4, 141–75, 217–43, 245–6, 263, 279, 281 see also mathematical tools quasi-quadratic algorithms 138 quasi-random sequences, Monte Carlo simulations 210, 213–14 random matrix theory 104–6, 130 see also principal components analysis random number generators 214–15 rational behaviour 54–5, 144–5, 150, 161–3, 173–4, 188–90, 218–26, 263–4, 278 real estate markets 56–7 real world 5, 10, 84, 217–43, 264, 265–79, 281, 285 realised variance 53 really extreme events 2–3, 6, 7, 246, 251, 263–79, 281, 282–3, 285–6, 293 correlations 271 definition 6, 263–6 enterprise risk management 274–5, 279
309
governance models 272–8, 286 implementation challenges 279 Knightian uncertainty 229, 266–79, 285 long-term contracts 267–9, 278, 286 management structures 276–7 market implied data 6, 270–2 mindsets 6, 263–79, 281, 285 non-discoverable processes 269–70 operational risk 275, 277–9 outside-the-box thinking 263–4, 279, 285 portfolio construction techniques 265–79 portfolio-purpose considerations 265–6 practitioner perspectives 278 principles 264, 268, 269, 278, 285–6 reaction guidelines 268–70, 274–9, 285 short-term contracts 267–9, 278, 286 statistics 2–3 uncertainty 266–79 rebalanced portfolios 226, 232–4 recent past observations 8–9 recursive substitution 121–2 references 295–300 regime-switching models (RS) 5, 37, 40, 132–3, 217–43, 285, 291–2 see also time-varying . . . autoregressive behaviour 234–6, 241–2 ‘bear’/‘normal’ regime types 218–19 complications 220–1 constraints 229 critique 218–21, 227–9, 241–3 definition 5, 217–18, 230–1 derivatives pricing theory 231–4, 240, 242, 285 examples 218–19 exercises 291–2 fat tails 5, 236–40 ‘fuzziness’ issues 228–9, 236 general form 230–1 identification of prevailing regime 228–9, 236 implementation challenges 241–3 operational overview 219–20 optimal portfolio allocations 226–31 practitioner perspectives 240 principles 221, 226, 229, 285 statistical tests 230 transaction costs 225–6, 232–4, 241–3 uses 218–19 utility functions 217, 221–31, 243 regulatory capital requirements 247, 248–52, 261, 270–1, 272, 274–5, 292–3 relative entropy 85, 88–90, 110–11, 256–7 see also likelihood ratios relative performance 35–6, 164–6 religious beliefs 3 resampled efficiency (RE) 202–9
310
Index
resampled portfolio optimisation 5, 178, 186, 202–8, 210 see also Bayesian approaches . . . returns 2, 5, 11–23, 55–9, 85–90, 91–139, 141–75, 178–87, 224, 257, 265–6, 276, 278, 282, 283 definition 145–7 empirical examples of fat-tailed behaviour 11–23, 56, 282 market data 55–9, 85–90, 91–7 pricing/return anomalies 144–5 risk 2, 5, 98, 133–4, 141–75, 178–87, 224, 257, 265–6, 278, 283 time-varying risk/return parameters 5, 217–43 reverse optimisation 170–1, 195–7, 222, 238–9, 252, 257 see also implied alphas reverse stress testing 6, 247, 252–5 risk 2–3, 17, 33, 35–6, 54, 55, 98, 102–3, 118, 144–54, 161–3, 209, 222, 223–31, 238–9, 249–50, 251–3, 259–61, 265–9, 273–9, 285, 286, 292, 293 see also extreme events; stress testing appetites 17, 35–6, 144–5, 161–3, 209, 222, 223–31, 238–9, 265–6, 268–9, 273–9, 286 assessments 5, 14–17, 40, 85–90, 91–7, 114, 146–54, 161–6, 173, 177–8, 190–215, 224, 225, 238, 240, 243, 246–61, 274–8, 282, 284–5, 290–1 assurance 276–7 aversion 144–5, 161–3, 223–31, 267–79 budgeting 3, 142–3, 145, 161–3, 175, 237–8, 265–6, 274, 275–6, 284 CAPM 18, 144, 149–52, 266 definitions 2, 145–7, 266, 276–7 ex-ante tracking error 15, 146–54, 162–3 management 6, 68–9, 85, 133–4, 245–61, 270–9, 283, 293 portfolio risk models 85–90, 91–139 returns 2, 5, 98, 133–4, 141–75, 178–87, 224, 257, 265–6, 278, 283 system vendors 40 time-varying risk/return parameters 5, 217–43 TVaR 15–17, 148, 246–7 types 17, 33, 54, 55, 102–3, 118, 151–4, 249–50, 251–3, 259–61, 266–8, 276–8, 285, 292, 293 VaR 5, 14–17, 114, 148, 152–4, 166, 173, 177–8, 190–215, 224, 225, 238, 240, 243, 246–61, 284–5, 290–1 risk models see portfolio risk models risk-free rates 150–4, 156–7, 164–5, 172–3, 182, 290 risk-return optimisation 5, 145–54, 163–9, 191–4, 226, 257, 261, 275–6, 278, 285
see also efficient frontier analysis; mean-variance . . . risk/reward balance 2, 98, 133–4, 141–75, 224, 257 robust mean-variance portfolio construction techniques 5, 173, 177–8, 190–215, 224, 225, 238, 240, 284–5, 290–1 see also Black-Litterman . . . ; portfolio . . . ; resampled . . . definition 190–4 exercises 290–1 HHI 194 implementation challenges 209–15 non-Normal innovations 193–4 practitioner perspectives 208–9, 240 principles 190, 191, 193, 198, 209, 284–5 shrinkage approach 181–2, 194–7, 199–203 traditional approaches 190–4, 208–9 robust regression 124–5, 138, 190–1 run-time constraints, market-influencing factors 138–9 Russian debt crisis 128 S&P 500 Composite index 18–21, 27, 30, 50, 56, 87, 154 sample means 183, 194–206 sampling errors 23–32, 82–3, 125, 182–3, 202–9 savings and loans associations (S&Ls) 273 scale characteristics 12–14, 46–9, 76–7, 102, 112–15, 185, 226–7 scenario analysis 246–52, 256–7 see also stress testing security market line 150 see also capital market line selection effects, portfolio risk models 91, 113–15, 116–20, 264, 265–6, 283 self-exciting threshold autoregressive models (SETARs) 235–6 separation theorem, multi-period (dynamic) portfolio optimisation 156–8 Shapiro-Wilk normality test 31 shareholders 15–16, 247, 261 Sharpe performance ratio 32, 164 Sharpe, William 149–50 Sharpe-Omega ratio 32, 165–6 short-sale constraints 181–2, 198–9, 204–5, 226, 229 short-term contracts, really extreme events 267–9, 278, 286 shrinkage approach 181–2, 194–7, 199–203 signal extraction techniques 96, 97–120, 127 see also independent components analysis; principal components analysis ‘similarity’ concepts 88–90 see also relative entropy Simplex algorithm 148
Index simulated annealing 137–8, 139 single risk factor models 196, 201 single (univariate) return series 4, 7–59, 61, 264, 281–2, 287–8 skew 25–32, 42–4, 53, 58, 71–5, 77–8, 114, 218–19, 239, 246, 251–2, 271–2, 282, 287–8, 291–2 see also kurtosis skills 143–4, 161–3, 169, 170–1, 208–9, 238–9, 240, 257–8, 279 see also alphas Sklar’s theorem 65–90, 120 see also copulas; marginal distributions Smirnov-Cram´er-von-Mises normality test 31 smoothing 56–7, 82–3, 86–7, 207–8, 212–13, 288 software 55–9, 138–9 Solvency Capital Requirement (SCR) 249–50, 292 Solvency II 248–50, 274, 292 Sortino performance ratio, definition 165 sovereign default 251 sponsor covenant schemes 255 spreads 8–9 stable distributions 16–17, 35, 41–5, 50–2, 121–2 see also finite variance . . . ; L´evy distributions stable Paretian 44 stakeholders 15–16, 247, 258, 261, 272–8, 285 ‘stale pricing’ 57 stamp duties 160 standard bivariate Normal probability distributions 61–4, 68–9, 218–19 standard deviations 10, 12–23, 25–32, 37–8, 61, 73–7, 80–3, 99–100, 104–6, 113–14, 116–20, 143–4, 164–6, 182–90, 193–4, 204–8, 211, 287–91 see also risk . . . ; tracking errors; volatilities Standard and Poor’s 275 state of the world, regime-switching models 5, 217–43, 285 statistical design approaches, stress testing 255–7 statistical risk model 91, 92–4, 95, 96–106, 133–4 see also independent components analysis; principal components analysis statistical significance, backtesting 163–9 statistical tests for non-normality 29–31 Stirling performance ratio, definition 165 stochastic optimal control approaches 157–8 stochastic processes 183–4, 233–4, 270 stochastic programming approaches 157–8 stock selections 161, 174–5 stratified sampling, Monte Carlo simulations 210, 212–14 stress testing 6, 85, 245–61, 283, 285, 292 see also risk; Value-at-Risk
311
critique 246–7 definition 245–6, 247–8, 251–2, 253–4 exercises 292 generations of stress tests 256–7 greater focus on what might lead to large losses 248, 250–2 idiosyncratic risk 253–4 impact calculations 258–61 implementation challenges 258–61 key exposures 254–5 methods 246–7, 251–2 ‘plausible’ but unlikely scenarios 247–8, 255–7 portfolio construction techniques 245–6, 253–61 practitioner perspectives 257–8 principles 247, 253, 254, 261, 285 regulatory capital computations 247, 248–52, 261 scenario analysis 246–52, 256–7 shareholders 247, 261 statistical design approaches 255–7 third parties 258, 261, 285 trade bookings/settlements 258–9 traditional approaches 247–52 under-appreciated risks 253–4 strictly stable distributions, definition 42 structured investment vehicles (SIVs) 79, 260, 271 sub-additive risk measures 243 substitutability issues, no arbitrage principle 94–5, 225 sum of squared residuals (SSRs), exact error estimates 183–5 super-states, regime-switching models 236–7 surveyor valuations 56–7 swaps 53, 255 systematic/systemic risks 17, 102, 151–4, 252–61, 283 t-copula 76–8 tail correlations 271 tail indexes, EVT 47–50 tail VaR (TVaR) 15–17, 148, 193, 210–11, 246–7 takeovers 9, 35–6 tangency portfolio 150 see also capital market line taxes 145, 158, 160, 247, 249, 284 term to maturity, bonds 58–9 ‘testing to destruction’ 6, 247 third parties, stress testing 258, 261, 285 Thomson Reuters market data 56, 67 threshold autoregressive models (TAR) 128–9, 235–6 time clocks 57–8, 86–7
312
Index
time domain characterisation, moving averages model 123–4 time factors, economic sensitivities 58–9, 86–7 time predictability 125–7 time series analysis, critique 120–9, 269–72 time stationarity 5, 9–10, 11, 33, 35, 36–41, 48, 50, 79, 120–9, 177–8, 217, 242 time-varying distributions 5, 35, 36–41, 53, 72–5, 84–5, 129–33, 217–43, 281, 282–3, 285, 291–2 time-varying risk/return parameters 5, 217–43, 281, 285, 291–2 see also regime-switching models time-varying volatilities 5, 35, 36–41, 53, 72–5, 84–5, 129–33, 217–43, 264, 281, 282–3, 285 see also heteroskedasticity Tokyo SE (Topix) index 18–19, 22, 27, 30, 31 tracking errors 15, 146–54, 162–6 see also mean-variance optimisation . . . ; risk . . . ; standard deviations trade bookings/settlements, stress testing 258–9 trade-off function, risk-return optimisation 5, 145–54, 163–9, 226, 257, 261, 275–6, 278, 285 traders 142 traditional portfolio construction techniques 141–75, 283–4, 289–90 traditional robust mean-variance portfolio construction techniques 190–4, 208–9 training algorithms, neural networks 127, 129, 139 transaction costs 55, 143–4, 145, 158–60, 169, 172–3, 175, 225–6, 232–4, 241–3, 284, 285 transfer functions 124 transitional probabilities 291–2 truncated searches, ICA 110 two-sided hypothesis tests 29–31 uncertainty 2–3, 129, 229, 266–79, 285 unconditional-coverage tests, risk-model backtesting 168–9
under-appreciated risks, stress testing 253–4 univariate data series see single (univariate) return series ‘unknown unknowns’ 85, 129, 203, 250 unsupervised learning 87, 129–30 upside extreme events 10, 20, 26, 30–1, 36, 39–40, 80–2, 218–19 utility functions 217, 221–31, 243, 285 ‘value’ stocks 95, 152 value-added 143–4, 161–3, 209, 240, 285 Value-at-Risk (VaR) 14–17, 114, 148, 152–4, 166, 167–9, 193–4, 210, 243, 246–61 see also stress testing variance 35, 41–5, 53, 113–15, 146–54, 189–90, 289–92 vector spaces 100–2, 122 see also principal components analysis view optimisation 170–1, 201–2 see also implied alphas visualisation approaches, fat tails 11–17, 27–8, 61–4, 80–1, 281 volatilities 2, 5, 35, 36–41, 53, 58, 72–5, 84–5, 129–33, 146–54, 180–7, 201–2, 217–43, 264, 270–2, 276, 281, 282–3 Von Neumann and Morgenstern utility theory 223 weekly returns on various equity market indices 18–30, 86–7 Weibull distribution 46–50 weighted covariances 238–9 weighted Monte Carlo 210, 214, 241–2 weighting schemas, PCA 102, 112–15 Weiner processes 156 worse loss stress tests 248, 256–7 ‘worst case’ scenarios 267 yield-curve dynamics 17, 58, 83, 86–7, 94–5, 97, 202, 254 z-transform 123–5, 168–9 Index compiled by Terry Halliday