Distress Risk and Corporate Failure Modelling The State of the Art 2022015228, 2022015229, 9781138652491, 9781138652507, 9781315623221

This book is an introduction text to distress risk and corporate failure modelling techniques. It illustrates how to app

239 65 9MB

English Pages 242 [243] Year 2022

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Cover
Endorsements
Half Title
Series
Title
Copyright
Contents
List of Tables
List of Figures
1 The Relevance and Utility of Distress Risk and Corporate Failure Forecasts
2 Searching for the Holy Grail: Alternative Statistical Modelling Approaches
3 The Rise of the Machines
4 An Empirical Application of Modern Machine Learning Methods
5 Corporate Failure Models for Private Companies, Not-for Profits, and Public Sector Entities
6 Whither Corporate Failure Research?
Appendix: Description of Prediction Models
References
Index
Recommend Papers

Distress Risk and Corporate Failure Modelling The State of the Art
 2022015228, 2022015229, 9781138652491, 9781138652507, 9781315623221

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

‘This book provides a comprehensive and highly informative review of corporate distress and bankruptcy modelling literature. The book traces the early development of this literature from linear discriminant models that dominated bankruptcy research of the 1960s and 1970s to modern machine learning methods (such as gradient boosting machines and random forests) which have become more prevalent today. The book also provides a comprehensive illustration of different machine learning methods (such as gradient boosting machines and random forests) as well as several pointers in how to interpret and apply these models using a large international corporate bankruptcy dataset. A helpful book for all empirical researchers in academia as well as in business.’ Iftekhar Hasan, E. Gerald Corrigan Chair in International Business and Finance, Gabelli School of Business, Fordham University in New York, USA

‘The corporate bankruptcy prediction literature has made rapid advances in recent years. This book provides a comprehensive and timely review of empirical research in the field. While the bankruptcy literature tends to be quite dense and mathematical, this book is very easy to read and follow. It provides a thorough but intuitive overview of a wide range of statistical learning methods used in corporate failure modelling, including multiple discriminant analysis, logistic regression, probit models, mixed logit and nested logit models, hazard models, neural networks, structural models of default and a variety of modern machine learning methods. The strengths and limitations of these methods are well illustrated and discussed throughout. This book will be a very useful compendium to anyone interested in distress risk and corporate failure modelling.’ Andreas Charitou, Professor of Accounting and Finance and Dean, School of Economics and Management, The University of Cyprus

‘This is a very timely book that provides excellent coverage of the bankruptcy literature. Importantly, the discussion on machine learning methods is instructive, contemporary and relevant, given the increasingly widespread use of these methods in bankruptcy prediction and in finance and business more generally.’ Jonathan Batten, Professor of Finance, RMIT University

DISTRESS RISK AND CORPORATE FAILURE MODELLING

This book is an introduction text to distress risk and corporate failure modelling techniques. It illustrates how to apply a wide range of corporate bankruptcy prediction models and, in turn, highlights their strengths and limitations under different circumstances. It also conceptualises the role and function of different classifiers in terms of a trade-off between model flexibility and interpretability. Jones’s illustrations and applications are based on actual company failure data and samples. Its practical and lucid presentation of basic concepts covers various statistical learning approaches, including machine learning, which has come into prominence in recent years. The material covered will help readers better understand a broad range of statistical learning models, ranging from relatively simple techniques, such as linear discriminant analysis, to state-of-the-art machine learning methods, such as gradient boosting machines, adaptive boosting, random forests, and deep learning. The book’s comprehensive review and use of real-life data will make this a valuable, easy-to-read text for researchers, academics, institutions, and professionals who make use of distress risk and corporate failure forecasts. Stewart Jones is Professor of Accounting at the University of Sydney Business School. He specializes in corporate financial reporting and has published extensively in the distress risk and corporate failure modelling field. His publications appear in many leading international journals, including the Accounting Review, the Review of Accounting Studies, Accounting Horizons, Journal of Business Finance and Accounting, the Journal of the Royal Statistical Society, Journal of Banking and Finance, and many other leading journals. He has published over 150 scholarly research pieces, including 70 seventy refereed articles, 10 books, and numerous book chapters, working papers, and short monographs. Stewart is currently Senior Editor of the prestigious international quarterly, Abacus.

Routledge Advances in Management and Business Studies

Reframing Mergers and Acquisitions Around Stakeholder Relationships Economic, Political and Social Processes Simon Segal, James Guthrie and John Dumay Managing Manufacturing Knowledge in Europe in the Era of Industry 4.0 Justyna Patalas-Maliszewska Family Business and Management Objectives, Theory, and Practice Magdalena Biel and Beata Ślusarczyk Consumer Packaging Strategy Localisation in Asian Markets Huda Khan, Richard Lee and Polymeros Chrysochou Collaborative Leadership and Innovation Management, Strategy and Creativity Elis Carlström Distress Risk and Corporate Failure Modelling The State of the Art Stewart Jones For more information about this series, please visit: www.routledge.com/ Routledge-Advances-in-Management-and-Business-Studies/book-series/SE0305

DISTRESS RISK AND CORPORATE FAILURE MODELLING The State of the Art

Stewart Jones

Cover image: © Getty Images First published 2023 by Routledge 4 Park Square, Milton Park, Abingdon, Oxon OX14 4RN and by Routledge 605 Third Avenue, New York, NY 10158 Routledge is an imprint of the Taylor & Francis Group, an informa business © 2023 Stewart Jones The right of Stewart Jones to be identified as author of this work has been asserted in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data Names: Jones, Stewart, 1964– author. Title: Distress risk and corporate failure modelling : the state of the art / Stewart Jones. Description: Abingdon, Oxon ; New York, NY : Routledge, 2023. | Series: Routledge advances in management and business studies | Includes bibliographical references and index. Identifiers: LCCN 2022015228 (print) | LCCN 2022015229 (ebook) | ISBN 9781138652491 (hardback) | ISBN 9781138652507 (paperback) | ISBN 9781315623221 (ebook) Subjects: LCSH: Bankruptcy—Forecasting—Mathematical models. | Corporations—Finance—Mathematical models. | Risk—Mathematical models. Classification: LCC HG3761 .J66 2023 (print) | LCC HG3761 (ebook) | DDC 332.7/5—dc23/eng/20220701 LC record available at https://lccn.loc.gov/2022015228 LC ebook record available at https://lccn.loc.gov/2022015229 ISBN: 978-1-138-65249-1 (hbk) ISBN: 978-1-138-65250-7 (pbk) ISBN: 978-1-315-62322-1 (ebk) DOI: 10.4324/9781315623221 Typeset in Bembo by Apex CoVantage, LLC Access the Support Material: www.routledge.com/9781138652507

CONTENTS

List of Tables List of Figures 1 The Relevance and Utility of Distress Risk and Corporate Failure Forecasts

viii x

1

2 Searching for the Holy Grail: Alternative Statistical Modelling Approaches

27

3 The Rise of the Machines

76

4 An Empirical Application of Modern Machine Learning Methods

110

5 Corporate Failure Models for Private Companies, Not-for Profits, and Public Sector Entities

150

6 Whither Corporate Failure Research?

191

Appendix: Description of Prediction Models 198 References215 Index228

TABLES

2.1 Summary of Major Strengths and Challenges of Different Logit Models 2.2 Theoretically Derived Predictors of Failure 3.1 Applications of Boosting Models to Bankruptcy Prediction Research 3.2 ROC Curve Analysis for the Cross-Sectional and Longitudinal Test Sample 3.3 Summary of Out-of-Sample Predictive Performance for Hazard Model and Deep Learning Model (Test Sample) 4.1 Model Error Measures (Gradient Boosting Machines) 4.2 Confusion Matrix – Gradient Boosting 4.3 Variable Importance Scores for Gradient Boosting Model 4.4 Whole Variable Interactions for Strongest Variable Effects for Gradient Boosting Model 4.5 Model Error Measures (Random Forests) 4.6 Confusion Matrix – Random Forests Model 4.7 Variable Importance Scores for Random Forests Model 4.8 Model Error Measures for CART Model 4.9 Confusion Matrix – CART Model 4.10 Variable Importance for CART Model 4.11 Model Error Measures (Generalized Lasso Model) 4.12 Confusion Matrix – Generalized Lasso Model 4.13 Variable Importance for Generalized Lasso Model 4.14 Model Coefficients for Generalized Lasso Model 4.15 Model Summary: Model Error Measures (MARS Model) 4.16 Confusion Matrix – MARS Model

53 68 84 99

107 114 115 116 128 130 131 132 133 134 135 136 137 138 138 139 139

Tables  ix

4.17 4.18 4.19 4.20 4.21 4.22 4.23 4.24 4.25 4.26 4.27 5.1 5.2 5.3 5.4

Variable Importance for MARS Model 140 Final Model 141 Model Comparison Based on ROC and K-Stat 142 Comparison of Model Performance Based on Misclassification 143 Comparison of Model Performance Based on N-Fold Cross-Validation (TreeNet/Gradient Boosting) 144 Model Stability Tests on Misclassification Rates (TreeNet/Gradient Boosting) 145 Model Stability Tests on ROC (TreeNet/Gradient Boosting) 146 Different Learning Rates on ROC (TreeNet/Gradient Boosting) 147 Different Learning Rates on Misclassification (TreeNet/Gradient Boosting)147 Different Node Settings on ROC (TreeNet/Gradient Boosting) 148 Node Settings on Misclassification (TreeNet) 148 Business Survivorship in the US 2015–2021 151 Global Impacts of COVID-19 on SMEs (Based on OECD Surveys) 152 Definition of Private Company Failure Used in Jones and Wang (2019) 161 Summary of Predictive Performance for Five State Model Used in Jones and Wang (2019) 162

FIGURES

2.1 Behaviour of Financial Ratios Prior to Bankruptcy From Beaver (1966) Study 2.2 Nested Tree Structure for States of Financial Distress 2.3 Moody’s KMV Model 3.1 Single Hidden Layer Neural Network Structure 3.2 A Simple Decision Tree for Bankruptcy Prediction 3.3 Example of Deep Neural Network Structure 3.4 Graphical Representation of DGN, Including Projection, Pooling, and Output Blocks 4.1 Average LogLikelihood (Negative) 4.2 Misclassification Rate Overall (Raw) 4.3 ROC Curve of Estimation and Test Samples 4.4 Partial Dependency Plot for Excess Returns (12 Months) and Failure Outcome 4.5 Partial Dependency Plot for Market Capitalization to Total Debt and Failure Outcome 4.6 Partial Dependency Plot for Percent of Shares Owned by the Top 5 Shareholders and Failure Outcome 4.7 Partial Dependency Plot for Cash Flow per Share and Failure Outcome 4.8 Partial Dependency Plot for Institutional Ownership and Failure Outcome 4.9 Partial Dependency Plot for Market Beta (12 Months) and Failure Outcome 4.10 Partial Dependency Plot for total Bank Debt and Failure Outcome

31 52 70 77 81 105 107 112 113 113 118 118 119 120 121 121 122

Figures  xi

4.11 Partial Dependency Excess Returns (6 Months) and Failure Outcome 122 4.12 Partial Dependency Plot for Earnings per Share and Failure Outcome 123 4.13 Partial Dependency Plot for the Current Ratio and Failure Outcome 123 4.14 Partial Dependency Plot for Interest Cover and Failure Outcome 124 4.15 Partial Dependency Plot for the Altman Z Score and Failure Outcome 125 4.16 Partial Dependency Plot for EBIT Margin and Failure Outcome125 4.17 Partial Dependency Plot for Average Debt Collection Period and Failure Outcome 126 4.18 Partial Dependency Plot for the Gearing Ratio and Failure Outcome 126 4.19 Partial Dependency Plot for Working Capital to Total Assets and Failure Outcome 127 4.20 Interaction Effects Between Excess Returns and Institutional Ownership129 4.21 ROC Curve of Estimation and Test Samples (Random Forests Model) 130 4.22 Class Probability Heat Map for Random Forests Model – FAILURE 133 4.23 ROC Curve of Estimation and Test Samples (CART Model) 134 4.24 ROC Curve of Estimation and Test Samples (Generalized Lasso Model) 137 4.25 Comparison of Model Performance (Based on ROC) 142 4.26 Comparison of Model Performance on Misclassification 142 4.27 Comparison of Model Performance Based on N-Fold Cross-Validation (TreeNet/Gradient Boosting) 144 4.28 Seed Setting and Misclassification (TreeNet/Gradient Boosting) 144 4.29 Seed Setting and ROC (TreeNet) 144

1 THE RELEVANCE AND UTILITY OF DISTRESS RISK AND CORPORATE FAILURE FORECASTS

1. Introduction The distress risk and corporate failure prediction modelling field has been growing steadily for nearly six decades now. Over this period, a prodigious amount of literature has emerged in a wide variety of academic journals spanning several discipline fields, including accounting, finance, economics, management, business, marketing, and statistics journals. The sheer size and breadth of this literature is quite remarkable, even astounding. It does beg the question, why such a preoccupation with the “dark side” of finance? One part of the explanation is that corporate failure models are relatively easy to develop and tend to predict quite well, and there is a substantial practitioner audience interested in distress risk and corporate failure forecasts for a variety of reasons. Another part of the explanation relates to the growing interdependency of global financial markets, the economic costs associated with corporate failure, and the changing character of companies, which can become financially distressed or go into bankruptcy. For instance, business failure was once considered the dominion of small and/or newly established entities. However, in today’s world it is much more commonplace even for very large and well-established corporations to fail. On a large enough scale, corporate collapses can exacerbate market volatility, create economic recessions, erode investor confidence, and impose significant economic costs on a wide range of stakeholders including investors, lender and creditors, employees, suppliers, consumers, and other market participants. Taleb’s (2007) notion of “black swan” events (supposedly unforeseeable events having massive impact) appear to be occurring with uncomfortable regularity in the financial arena. The saying that “history never repeats itself, but it rhymes” may be very true in the corporate failure field. Since time immemorial, the world has endured many financial calamities. While each financial crisis is to some extent DOI: 10.4324/9781315623221-1

2  Distress Risk and Corporate Failure

unique, there are also many striking similarities. For instance, the 1997 Asian Financial Crisis was triggered when Thailand abandoned its local currency peg to the US dollar to stimulate economic growth. This led to a dramatic devaluation of its currency. The currencies of many other Asian countries devalued in the aftermath. In fact, the word “contagion” was first coined during the Asian crisis (it was dubbed the “Asian contagion”), as the Asian crisis highlighted the complex financial and trading interdependencies of global markets. The collapse of local stock markets and property markets and an exodus of foreign investors quickly ensued, leading to a vicious cycle of economic slowdown and widespread business failures. Many Asian countries at the time were fostering fast-growing, export-driven economies that tended to mask underlying weaknesses, such as high credit growth and excessive financial leverage coupled with lack of good corporate governance and effective regulatory supervision. Heavy foreign borrowing, often with short-term maturities, also exposed corporations and banks to significant exchange rate and funding risks that exacerbated business financial distress.1 In response to the spreading crisis, the international community mobilized large loans totalling 118 billion for Thailand, Indonesia, and South Korea, and took other interventions to stabilize the economies of other affected countries. Financial support came from the International Monetary Fund, the World Bank, the Asian Development Bank, and governments in the Asia–Pacific region, Europe, and the United States. As stated by Carson and Clark (2013): The basic strategy was to help the crisis countries rebuild official reserve cushions and buy time for policy adjustments to restore confidence and stabilize economies, while also minimizing lasting disruption to countries’ relations with their external creditors. To address the structural weaknesses exposed by the crisis, aid was contingent on substantial domestic policy reforms. The mix of policies varied by country, but generally included measures to deleverage, clean up and strengthen weak financial systems, and to improve the competitiveness and flexibility of their economies. On the macro side, countries hiked interest rates to help stabilize currencies and tightened fiscal policy to speed external adjustment and cover the cost of bank clean-ups. However, over time, as markets began to stabilize, the macro policy mix evolved to include some loosening of fiscal and interest rate policy to support growth.2 Another major financial crisis occurred a few short years later. The crash of the technology sector in 2001 (also known pejoratively as the “tech wreck”) followed the rapid commercialization of the internet. The crash was precipitated by excessive hype and speculation in internet stocks, on the back of sanguine corporate valuations and a heightened media attention on this booming sector. An abundance of venture capital flowing into internet stocks coupled and, with their fundamental lack of profitability, was pivotal to the crash. In fact, between 1995 and March  2000, the Nasdaq Composite rose a staggering 400% in value (from an index value of 1000 to around 5000). Many commentators believe the high water

Distress Risk and Corporate Failure  3

mark of the internet boom was widely believed to be the ill-fated AOL Time Warner megamerger (valued at 165 billion) in January 2000, which proved to be the biggest merger failure in corporate history. By 1999, 39% of all venture capital investments were flowing into internet stocks. There was also a frenzy of internet initial public offerings (IPOs). Of the 457 IPOs in 1999, the majority related to internet companies, with 91 internet IPOs in the first quarter of 2000 alone. The Nasdaq collapsed from its peak of 5,048 on March 10, 2000, to 1,139 on October 4, 2002, a stunning 76.81% collapse. By the end of 2001, most technology companies went into bankruptcy, including many larger, well-known corporations such as Worldcom, NorthPoint Communications, and Global Crossing. Some companies that survived the stock market carnage, such as Amazon and Qualcomm, lost significant market value, with Cisco Systems alone losing over 85% of its market value. The Nasdaq did not regain its pre-2001 peak until 2015. In the wake of the technology sector collapse, there was also a spate of high-­ profile corporate bankruptcies involving accounting frauds between 2000 and 2002, including Enron, WorldCom, Tyco International, Adelphia, Peregrine Systems, and other companies. This led to new legislative measures to protect the investing public. In 2002, the US government passed the Public Company Accounting Reform and Investor Protection Act of 2002 (known as the Sarbanes-Oxley Act).3 The objective of the SOX legislation was to remedy the root causes of the accounting scandals, such as auditor conflicts of interest (for example, audit firms that performed audit engagements while taking on lucrative consultancies with the same clients); boardroom and audit committee failures (for example, directors who did not exercise their responsibilities diligently or who were not competent in financial and/or business matters, and audit committees that were not independent of directors; conflicts of interests with financial analysts (for instance, analysts who were conflicted by providing stock recommendations to clients while also providing lucrative investment banking services to the same entities); investment managers who recommended buying internet stocks to clients while secretly selling or shorting them; poor lending practices by banking institutions; and questionable executive compensation practices. For instance, issuing stock options and providing bonuses based on earnings performance can encourage earnings management practices (particularly as stock options were not treated as an expense prior to the SOX legislation). Barely had the world recovered from the “tech wreck” and the high-profile accounting scandals of the Enron and Worldcom era, but an even more serious and devastating financial crisis was about to erupt. The Global Financial Crisis (GFC) of 2007–2009 resulted in one of the worst global financial meltdowns since the Great Depression (Blinder, 2020). The GFC shared a number of resemblances with previous financial crises but was much more severe and global in its impact. As noted by Jones and Peat (2008), the GFC was sparked by the collapse of the US housing bubble. Investment banks exploited the housing bubble by creating new financial derivative products (such as collateralized debt obligations or CDOs) that were linked to the risky subprime lending market (also called pejoratively as

4  Distress Risk and Corporate Failure

“NINJA” loans or “no income, no job and no assets”). These lower quality loans had very low “honeymoon” rates that reset to much higher interest rates when the honeymoon period was over, which greatly escalated mortgage defaults as the housing market turned south. As noted by Jones and Peat (2008), the CDO market played a critical role in the GFC. As mortgage back securities (MBSs) linked to risky subprime lending markets are not considered investment grade, they would not attract high credit ratings and hence were not acceptable investment products for most professional fund managers or investors. The emergence of CDOs cleverly avoided this dilemma. By dividing up the MBSs into several tranches with different risk profiles, reflecting different investment grades, subprime mortgages (sometimes called the “toxic waste”) were packaged up with the higher-grade debt. Many CDOs sheltered a significant amount of subprime debt but nevertheless were issued high credit ratings by rating agencies (such as Moody’s) because there was a sufficient proportion of high-quality debt to raise the overall instrument to investment grade. Before the GFC, hedge fund managers were particularly active in trading equity and mezzanine tranches of the CDO. The value of the CDOs were “marked up” in times when housing prices were booming in the US, with the CDOs used as collateral with banks to raise further cheap debt. This in turn allowed investment banks and hedge funds to leverage more heavily into the CDO market. However, when the housing sector stumbled, the value of mortgages underlying the CDOs’ collateral began to spiral downwards with rising default rates. Banks and investment institutions holding CDOs faced a significant deterioration in the value of their CDO holdings. These problems were compounded by very high leveraging in the CDO market funded from short-term borrowings and a lack of transparency underlying such aggressive investment practices. As stated in the Financial Crisis Inquiry Commission Report of the US government (2011, p. xix), which investigated the causes of the GFC: For example, as of 2007, the five major investment banks – Bear Stearns, Goldman Sachs, Lehman Brothers, Merrill Lynch, and Morgan Stanley – were operating with extraordinarily thin capital. By one measure, their leverage ratios were as high as 40 to 1, meaning for every 40 in assets, there was only 1 in capital to cover losses. Less than a 3% drop in asset values could wipe out a firm. To make matters worse, much of their borrowing was short-term, in the overnight market – meaning the borrowing had to be renewed each and every day. For example, at the end of 2007, Bear Stearns had 11.8 billion in equity and 383.6 billion in liabilities and was borrowing as much as 70  billion in the overnight market. It was the equivalent of a small business with 50,000 in equity borrowing 1.6  million, with 296,750 of that due each and every day. One can’t really ask “What were they thinking?” when it seems that too many of them were thinking alike. And the leverage was often hidden – in derivatives positions, in off-balancesheet entities, and through “window dressing” of financial reports available to the investing public.

Distress Risk and Corporate Failure  5

The general panic in the market resulted in many banks calling in their original collateral. However, with the escalating volume of CDO sales, the market quickly became saturated, particularly for the equity and mezzanine tranches of the CDOs. In some cases, this resulted in enormous book losses for several hedge funds and investment banks. With no buyers, the equity and mezzanine tranches literally had no value. The subprime crisis had an immediate and devastating impact on world equity and debt markets generally, and credit derivative markets in particular. The subsequent turmoil in world equity and bond markets created a long-lasting international liquidity and credit crisis, which impacted the fortunes of many financial institutions and corporations for some time to come. An unparalleled spate of corporate bankruptcies, which included one of the world’s largest investment banks, Lehman Brothers, led to unprecedented government intervention in the economy, as well as several government bailouts of large banks and financial institutions. The Lehman Brothers collapse was largest in history, involving 688 billion in assets. The ten largest GFC failures alone had a combined value of more than 1.37  trillion at the time of their bankruptcy filings. The failure date and listed assets (at the time of failure) included Lehman Brothers (September 2008, 668  billion), Washington Mutual (September  2008, 328  billion), General Motors (June  2009, 82.6  billion), CIT group (January  2009, 80.4  billion), Chrysler (April 2009, 39.3 billion), Thornburg mortgage (May 2009, 36.5 billion), IndyMac (July 2008, 32.7 billion), General Growth Properties (April 2009, US 29.6 billion), Lyondell Chemical (January, 2009, 27.4 billion) and Colonial BancGroup (August, 2009, 25.8 billion) (all amounts in USD). As discussed in Jones and Johnstone (2012), the US government had to take urgent intervention to save the global financial system, including many bailouts. For instance, the New York Federal Reserve and JP Morgan Chase provided an emergency cash bailout for Bear Stearns. Following the bailout, Bear Stearns was taken over by JP Morgan Chase at a fraction of its pre-GFC share value. Another example is Fannie Mae and Freddie Mac. Faced with severe financial losses, escalating default rates, high levels of debt, and an inability to raise fresh capital, the Federal Housing Finance Agency (FHFA) placed Fannie Mae and Freddie Mac into conservatorship run by the FHFA in September 2008. Several other companies fell into this category, including Lloyds TSB, Countrywide Financial, and Northern Rock. The passing of the Emergency Economic Stabilization Act4 by the US Congress in 2008 provided the Federal Treasury 700  billion to buy troubled assets and restore liquidity in markets through the Troubled Asset Relief Program (TARP). While millions of people lost their jobs and homes as result of the GFC, many notable commentators questioned the Wall Street culture of unfettered greed and excessive risk taking that drove the world into recession. As stated by the Financial Crisis Inquiry Commission Report (2011, p. 291)5 into the originals and causes of the GFC: The Commission concludes the failure of Bear Stearns and its resulting government-assisted rescue were caused by its exposure to risky mortgage

6  Distress Risk and Corporate Failure

assets, its reliance on short-term funding, and its high leverage. These were a result of weak corporate governance and risk management. Its executive and employee compensation system was based largely on return on equity, creating incentives to use excessive leverage and to focus on short-term gains such as annual growth goals. Bear experienced runs by repo lenders, hedge fund customers, and derivatives counterparties and was rescued by a governmentassisted purchase by JP Morgan because the government considered it too interconnected to fail. Bear’s failure was in part a result of inadequate supervision by the Securities and Exchange Commission, which did not restrict its risky activities, and which allowed undue leverage and insufficient liquidity. The FCIC Report (2011, p. 212) singled out the poor performance of credit rating agencies, whose ratings of CDOs provided the financial lubricant that helped facilitate the onset of the crisis. Moody’s was singled out by the Commission for special criticism in the final report for responding slowly to the crisis and using outdated analytical risk models: The Commission concludes that the credit rating agencies abysmally failed in their central mission to provide quality ratings on securities for the benefit of investors. They did not heed many warning signs indicating significant problems in the housing and mortgage sector. Moody’s, the Commission’s case study in this area, continued issuing ratings on mortgage-related securities, using its outdated analytical models, rather than making the necessary adjustments. The business model under which firms issuing securities paid for their ratings seriously undermined the quality and integrity of those ratings; the rating agencies placed market share and profit considerations above the quality and integrity of their ratings. Moody’s eventually settled with state and federal authorities for nearly 864 million for its part in the ratings of risky mortgage securities, while Standard & Poor’s settled for 1.375 billion in state and federal penalties in 2015.6 The immediate aftermath of the GFC was the introduction of the Dodd–Frank Wall Street Reform and Consumer Protection Act (2010),7 touted as one of the most significant pieces of legislation under the Obama administration and one of the most sweeping reforms since the Securities Acts of the 1930s. Some of the reforms from the Dodd–Frank Act included creation of: (1) a Financial Stability Oversight Council (FSOC), which is accountable to the US Congress. The job of the FSOC is to monitor risks and respond to emerging threats to financial stability and constrain excessive risk in the financial system. The FSOC is also to have power to regulate nonbank financial companies. The Council also has the authority: to recommend stricter standards for the largest, most interconnected firms, including designated nonbank financial companies. .  .  . Moreover, where the Council determines that certain practices or activities pose a threat to

Distress Risk and Corporate Failure  7

financial stability, the Council may make recommendations to the primary financial regulatory agencies for new or heightened regulatory standards.8 The FSOC also has a significant role in determining whether action should be taken to break up those firms that pose a “grave threat” to the financial stability of the United States (companies deemed “too big to fail” during the GFC);9 (2) the Consumer Financial Protection Bureau (CFPB), which is an independent agency established under the Dodd–Frank Act. One of the key roles of the CFPB is to “protect consumers from unfair, deceptive, or abusive practices and take action against companies that break the law”.10 For instance, the CFPB was given the role of preventing predatory mortgage lending that occurred during the GFC and making it easier for consumers to understand the terms of a mortgages (and other loans) before agreeing to them;11 (3) the Volcker rule, which is a federal regulation that generally prohibits banking entities from engaging in proprietary trading or investing in or sponsoring hedge funds or private equity funds.12 It basically prevents banks from trading securities, derivatives, and commodities futures, as well as options on any of these instruments, on their own accounts for short-term proprietary trading. The regulations have been developed by five federal financial regulatory agencies, including the Federal Reserve Board, the Commodity Futures Trading Commission, the Federal Deposit Insurance Corporation, the Office of the Comptroller of the Currency, and the Securities and Exchange Commission.13 The Act also contains a provision for regulating OTC derivatives, such as the credit default swaps that were widely blamed for contributing to the GFC; (4) the Securities and Exchange Commission (SEC) Office of Credit Ratings. Because credit rating agencies were accused of contributing to the financial crisis by providing misleading ratings, Dodd–Frank established the SEC Office of Credit Ratings. The Office of Credit Ratings: assists the Commission [SEC] in executing its responsibility for protecting investors, promoting capital formation, and maintaining fair, orderly, and efficient markets through the oversight of credit rating agencies registered with the Commission as “nationally recognized statistical rating organizations” or “NRSROs.” In support of this mission, OCR monitors the activities and conducts examinations of registered NRSROs to assess and promote compliance with statutory and Commission requirements.14 This was clearly a response to the culpable activities of ratings agencies during the GFC in providing misleading ratings to investors on complex derivative products such as CDOs; and (5) the Whistleblower Program, whereby the Dodd–Frank Act also strengthened and expanded the existing whistleblower program promulgated by the Sarbanes-Oxley Act (SOX). The Dodd–Frank Wall Street Reform and Consumer Protection Act expanded the protections for whistleblowers and broadened the prohibitions

8  Distress Risk and Corporate Failure

against retaliation. Following the passage of Dodd–Frank, the SEC implemented rules that enabled the SEC to take legal action against employers who have retaliated against whistleblowers. This generally means that employers may not discharge, demote, suspend, harass, or in any way discriminate against an employee in the terms and conditions of employment who has reported conduct to the Commission that the employee reasonably believed violated the federal securities laws.15 It established a mandatory bounty programme in which whistleblowers can receive monetary proceeds from a litigation settlement, expanded the definition of a covered employee, and increased the statute of limitations for whistleblowers to bring a claim against their employer from 90 to 180 days after a violation is discovered.16 The Whistleblower Program was created by Congress to provide monetary incentives for individuals to come forward and report possible violations of the federal securities laws to the SEC. Under the program eligible whistleblowers (defined below) are entitled to an award of between 10% and 30% of the monetary sanctions collected in actions brought by the SEC and related actions brought by certain other regulatory and law enforcement authorities. An “eligible whistleblower” is a person who voluntarily provides the SEC with original information about a possible violation of the federal securities laws that has occurred, is ongoing, or is about to occur. The information provided must lead to a successful SEC action resulting in an order of monetary sanctions exceeding 1 million.17 The COVID pandemic starting in early 2020 resulted in a global health crisis that also had severe repercussions for the global economy. The pandemic led to yet another serious spate of corporate failures around the world, impacting both small and large companies alike. For instance, the International Monetary Fund estimated a large increase in the failure rate of small and medium size enterprises (SMEs) under COVID-19 of nearly 9%, absent any government support provided. The jobs at risk due to COVID-19–related SME business failures represented 3.1% of private sector employment.18 Over the pandemic, there were at least 60 “mega” bankruptcies in 2020 alone, which is higher than the bankruptcy rate occurring at the height of the GFC.19 The largest corporate bankruptcies over the COVID period have so far included (all in USD): Hertz Corporation ($25.84 billion), Latam Airlines Group S.A. (21.09 billion), Frontier Communications Corporation (17.43 billion), Chesapeake Energy Corporation (16.19 billion), and Ascena Retail Group Inc (13.68 billion). Other large COVID-related bankruptcies include: Valarie plc (13.04 billion), Intelstat S.A (11.65 billion), Mallinckrodt plc (9.58 billion); McDermott International (8.74 billion), J.C. Penney Company (7.99 billion), Whiting Petroleum Corporation (7.64 billion), Neiman Marcus Group LTD LLC (7.55 billion), Oasis Petroleum Inc. (7.5 billion), Avianca

Distress Risk and Corporate Failure  9

Holdings S.A. ($7.27 billion), and Noble Corporation plc ($7.26 billion). Many other businesses, large and small, have remained in varying states of financial distress during the pandemic. As a result of periodic financial crises, there has been a considerable escalation in academic research devoted to developing more sophisticated and reliable corporate distress risk and corporate failure prediction models. There is also a growing recognition that corporate failure forecasts are important to a wide range of market participants, including accountants and audit firms, who need to make periodic going concern evaluations in line with the requirements of auditing standards; directors and senior managers who need to make periodic assessments of the going concern status of an entity and its ongoing solvency; investors and analysts who need to incorporate failure forecasts into valuation models, financial forecasts, and stock recommendations; banks and credit providers who need to assess the likelihood of loan default; and capital market participants who need to evaluate portfolio risk and the pricing of derivative instruments exposed to credit risk (see also Duffie and Singleton, 2003; Jones and Hensher, 2004). We provide further discussion of how various market participants can use distress and failure forecasts below.

1.2  Users of Distress Risk and Corporate Failure Forecasts 1.2.1 Auditors Auditors can benefit from accurate distress risk and corporate failure forecasting models as they have responsibilities under auditing standards in relation to the going concern status of reporting entities. Auditors need to make assessments about going concern status of entities particularly if there are any material uncertainties about the ability of an entity to continue as a going concern. However, as discussed in Chapter 2, the research evidence suggests that bankrupt firms frequently do not receive qualified or adverse opinions based on their going concern status. Research indicates that this occurs in at least 50% of bankruptcy cases (Carson et al., 2013). It should be noted that auditors also received criticism for their conduct during the GFC. For example, in the case of the failure of Lehman Brothers, the FCIC (2011) report particularly highlighted the Repo 105 practices used by Lehman Brothers. Repo 105 was effectively an accounting loophole that allowed companies to conceal their real amounts of leverage. Lehman Brothers used the loophole to hide its true leverage during the financial crisis (FCIC, 2011, pp. 177–178): Ernst & Young (E&Y), Lehman’s auditor, was aware of the Repo 105 practice but did not question Lehman’s failure to publicly disclose it, despite being informed in May  2008 by Lehman Senior Vice President Matthew Lee that the practice was improper. The Lehman bankruptcy examiner concluded that E&Y took “virtually no action to investigate the Repo 105 allegations, . . . took no steps to question or challenge the non disclosure by Lehman,” and that “colorable claims exist that E&Y did not meet professional

10  Distress Risk and Corporate Failure

standards, both in investigating Lee’s allegations and in connection with its audit and review of Lehman’s financial statements.” New York Attorney General Andrew Cuomo sued E&Y in December 2010, accusing the firm of facilitating a “massive accounting fraud” by helping Lehman to deceive the public about its financial condition. The “International Standard on Auditing 570 (revised) Going Concern” sets out the responsibility of auditors in relation to going concern evaluation. The implications for auditor reports are as follows: Use of Going Concern Basis of Accounting Is Inappropriate 21. If the financial statements have been prepared using the going concern basis of accounting but, in the auditor’s judgment, management’s use of the going concern basis of accounting in the preparation of the financial statements is inappropriate, the auditor shall express an adverse opinion. (Ref: Para. A26 – A27) Use of Going Concern Basis of Accounting Is Appropriate but a Material Uncertainty Exists Adequate Disclosure of a Material Uncertainty Is Made in the Financial Statements 22. If adequate disclosure about the material uncertainty is made in the financial statements, the auditor shall express an unmodified opinion and the auditor’s report shall include a separate section under the heading “Material Uncertainty Related to Going Concern” to: (Ref: Para. A28 – A31, A34) (a) Draw attention to the note in the financial statements that discloses the matters set out in paragraph 19; and (b) State that these events or conditions indicate that a material uncertainty exists that may cast significant doubt on the entity’s ability to continue as a going concern and that the auditor’s opinion is not modified in respect of the matter. According to the standard, there are many factors that the auditor needs to consider that may cast doubt on the ability of an entity to continue as a going concern (see para A3): The following are examples of events or conditions that, individually or collectively, may cast significant doubt on the entity’s ability to continue as a going concern. This listing is not all-inclusive nor does the existence of one or more of the items always signify that a material uncertainty exists. Financial • Net liability or net current liability position. • Fixed-term borrowings approaching maturity without realistic prospects of renewal or repayment; or excessive reliance on short-term borrowings to finance long-term assets.

Distress Risk and Corporate Failure  11

• Indications of withdrawal of financial support by creditors. • Negative operating cash flows indicated by historical or prospective financial statements. • Adverse key financial ratios. • Substantial operating losses or significant deterioration in the value of assets used to generate cash flows • Arrears or discontinuance of dividends. • Inability to pay creditors on due dates. • Inability to comply with the terms of loan agreements. • Change from credit to cash-on-delivery transactions with suppliers. • Inability to obtain financing for essential new product development or other essential investments. Operating • Management intentions to liquidate the entity or to cease operations. • Loss of key management without replacement. • Loss of a major market, key customer(s), franchise, license, or principal supplier(s). • Labor difficulties. • Shortages of important supplies. • Emergence of a highly successful competitor. Other • Non-compliance with capital or other statutory or regulatory requirements, such as solvency or liquidity requirements for financial institutions. • Pending legal or regulatory proceedings against the entity that may, if successful, result in claims that the entity is unlikely to be able to satisfy. • Changes in law or regulation or government policy expected to adversely affect the entity. • Uninsured or underinsured catastrophes when they occur. The impact of COVID has also been an important issue for accounting and auditing standard Boards around the world. For instance, the International Auditing and Assurance Board (IAASB) issued Staff Audit Practice Alert April 2020 “Going Concern in the Current Evolving Environment – Audit Considerations for the Impact of COVID-19”. It stated (p. 1) that: as a result of the COVID-19 pandemic and the associated deteriorating economic environment, reduced revenues and cash flows could raise questions about the entity’s ability to meets its current or new obligations and comply with debt covenants.

12  Distress Risk and Corporate Failure

Under the Staff Audit Practice Alert, auditors now need to consider going concern status in presence of a number of COVID pandemic risk factors, which can include loss of a major market, key customer(s), revenue, labour shortages; a significant deterioration in the value of assets used to generate cash flows; significant deterioration in the value of current assets – inventory; delay in the launch of new products or services; foreign exchange fluctuations; measurements affected by increased uncertainty; counterparty credit risk; and the entity’s solvency (pp. 5–6). Distress risk and corporate failure models, particularly modern machine learning models, can be of considerable value to accountants and auditors. As we show throughout this book, statistical learning methods can combine and weight many such risk factors (such as the factors outlined previously) to produce an overall probability of a firm’s capacity to continue as a going concern. High dimensional models, such as machine learning methods, are particularly adept for this purpose. They can generate distress risk probabilities from hundreds, even thousands of different input variables and rank order variables based on their overall predictive power. This can potentially reduce the cognitive burden on accountants and auditors in having to consider a myriad of different risk factors as well as provide a more objective and reliable basis for formulating going concern assessments.

1.2.2  Account Preparers Account preparers are also required to consider the credit risk implications of particular types of assets and liabilities that are required to be recognized and disclosed under accounting standards. For instance, International Financial Reporting Standard (IFRS) 9 “Financial Instruments” establishes principles for the financial reporting of financial assets and financial liabilities. IFRS 9 requires that counterparty risk be considered in assessing the fair value price of financial instruments. The Standard states that one of the factor’s impacting values is credit risk: The effect on fair value of credit risk (i.e., the premium over the basic interest rate for credit risk) may be derived from observable market prices for traded instruments of different credit quality or from observable interest rates charged by lenders for loans of various credit ratings. Once again, distress risk models may serve an important role in the pricing of financial instruments exposed to credit risk as they can potentially provide accurate and reliable probabilities of default.

1.2.3  Directors and Senior Management Company directors and senior management also have an interest in accurate distress risk and corporate failure prediction both in a regulatory sense and because these models can provide an early warning distress signal with sufficient lead time for corporate managers to take appropriate remedial action. Under international

Distress Risk and Corporate Failure  13

Accounting standards, corporate management must provide an assessment of whether the entity is a going concern or is capable of continuing as a going concern. IAS 1 “Presentation of Financial Statements” (para. 25–26) requires that management assess the entity’s going concern status as follows: When preparing financial statements, management shall make an assessment of an entity’s ability to continue as a going concern. An entity shall prepare financial statements on a going concern basis unless management either intends to liquidate the entity or to cease trading, or has no realistic alternative but to do so. When management is aware, in making its assessment, of material uncertainties related to events or conditions that may cast significant doubt upon the entity’s ability to continue as a going concern, the entity shall disclose those uncertainties. When an entity does not prepare financial statements on a going concern basis, it shall disclose that fact, together with the basis on which it prepared the financial statements and the reason why the entity is not regarded as a going concern. In assessing whether the going concern assumption is appropriate, management takes into account all available information about the future, which is at least, but is not limited to, twelve months from the end of the reporting period. The degree of consideration depends on the facts in each case. When an entity has a history of profitable operations and ready access to financial resources, the entity may reach a conclusion that the going concern basis of accounting is appropriate without detailed analysis. In other cases, management may need to consider a wide range of factors relating to current and expected profitability, debt repayment schedules and potential sources of replacement financing before it can satisfy itself that the going concern basis is appropriate. Directors also have responsibilities with respect to insolvent trading and can benefit from accurate distress risk and corporate failure prediction models to make such as assessment. As an example, the Australian Securities and Investment Commission (ASIC) provides guidance to directors on insolvent trading as follows:20 If you suspect your company is in financial difficulty, get professional accounting and/or legal advice as early as possible. This increases the likelihood the company will survive. Do not take a “head in the sand” attitude, hoping that things will improve – they rarely do. Warning signs of insolvency include: • • • • • •

ongoing losses poor cash flow absence of a business plan incomplete financial records or disorganized internal accounting procedures lack of cash-flow forecasts and other budgets increasing debt (liabilities greater than assets)

14  Distress Risk and Corporate Failure

• • • •

problems selling stock or collecting debts unrecoverable loans to associated parties creditors unpaid outside usual terms solicitors’ letters, demands, summonses, judgements or warrants issued against your company • suppliers placing your company on cash-on-delivery terms • special arrangements with selected creditors • payments to creditors of rounded sums that are not reconcilable to specific invoices • overdraft limit reached or defaults on loan or interest payments • problems obtaining finance • change of bank, lender or increased monitoring/involvement by financier • inability to raise funds from shareholders • overdue taxes and superannuation liabilities • board disputes and director resignations, or loss of management personnel • increased level of complaints or queries raised with suppliers • an expectation that the “next” big job/sale/contract will save the company. Once again, statistical learning methods, particularly high dimensional machine learning methods, can combine and weight many such risk factors to produce an accurate probability forecast of a firm’s capacity to continue as a going concern or become insolvent. There are many other international corporate disclosure regulations that are under active consideration relating to business risk assessments. For instance, Exposure Draft ED/2021/6 Management Commentary published by the International Accounting Standards Board proposes a comprehensive framework that entities could apply when preparing management commentary that complements their financial statements. The proposals represent a major overhaul of IFRS Practice Statement 1 Management Commentary. ED/2021/6 (p. 57) states: To gain insight into factors that could affect an entity’s ability to create value and generate cash flows, investors and creditors need to understand the risks of events or circumstances that could in the short, medium or long term disrupt the entity’s business model, management’s strategy for sustaining and developing that model, or the entity’s resources and relationships. ED/2021/6 also states that the source of such risks could be external – for example, political instability – or internal – for example, the failure of a business process or an unintended consequence of a change in strategy. The source of a risk could be a one-off event, gradually changing circumstances or a group of events or circumstances that would cause disruption if they were all to occur.

Distress Risk and Corporate Failure  15

Information in management commentary shall enable investors and creditors to understand: (a) the nature of the risks to which the entity is (b) the entity’s exposure to those risks; (c) how management monitors and manages the risks; (d) how management will mitigate disruption if it occurs; and (e) progress in managing risks.21

1.2.4  Investors and Financial Analysts It is increasingly recognized in the bankruptcy prediction literature that investors and financial analysts can benefit from corporate failure forecasts. Conventional valuation models such as discounted cash flow and residual income valuation models are predicated on the going concern assumption. If the entity is distressed, these models will tend to overstate corporate valuation unless this bias is somehow corrected. Accurate distress and corporate failure forecasts are critical inputs into valuing distressed companies. Damodaran (2009) states:22 In summary, then, the possibility and costs of distress are far too substantial to be ignored in valuation. The question then becomes not whether we should adjust firm value for the potential for distress but how best to make this adjustment. Damodaran (2009) suggests several ways to incorporate distress forecasts into corporate valuation. One way is to incorporate a probability into the estimated cash flow streams. This is ideally achieved by considering all possible risk scenarios, ranging from the best outcome (or most optimistic) to the worst outcome (or most pessimistic). Probabilities can be assigned to each scenario, as shown equation 1.1 from Damodaran (2009). j n

Expected Cashflow

jt

Cashflow jt (1.1)

j 1

where jt is the probability of scenario j in period t and Cashflowjt is the cash flow under that scenario and in that period. These inputs have to be estimated each year, since the probabilities and the cash flows are likely to change from year to year. The adjustment for distress is a cumulative one and will have a greater impact on the expected cash flows in the later year.23 Another approach is to adjust for distress through the discount rate. Under this approach, one could estimate the cost of equity, using a beta more reflective of a healthy firm in the business, and then adding an additional premium to reflect distress as shown in equation 1.2. Cost of equity = Riskfree Rate + BetaHealthy(Equity Risk Premium)+ Distress Premium

(1.2)

16  Distress Risk and Corporate Failure

The distress premium can be estimated in 1.2 by either examining historical data on returns earned by investing in the equity of distressed firms or comparing the company’s own pre-tax cost of debt to the industry average cost of debt. An alternative to the modified discounted cash flow model is to separate the going concern assumptions and the value that emerges from it from the effects of distress.24 Financial analysts also need to understand the distress implications involved in forming earnings forecasts and stock recommendations. As mentioned earlier, the Dodd–Frank Wall Street Reform and Consumer Protection Act (2010) contains several requirements (mainly soliciting studies and reports) relating to potential conflicts of interests among analysts. In the context of widespread public criticism of analysts, Clarke et al. (2006) considered the extent to which analysts are reluctant to issue negative recommendations in distressed firms because of the potential loss of future investment banking deals. They argued that such behaviour is expected to produce positive biases in analyst recommendations (i.e., overly optimistic recommendations). Their second question concerned the potential for conflicts of interest among analysts that have ongoing business dealings with a firm. Such analysts could have incentives to compromise their recommendations for these firms even when they are financially distressed, as underwriting and related services usually provide higher levels of revenue for the brokerage firm than securities research or brokerage. Jones and Johnstone (2012) explored analyst forecasts and recommendations among 118 large international bankruptcies sampled over the period 2000–2010. Their study revealed a counter-intuitive finding that forecasted EPS for the failed firm sample actually increased up until the bankruptcy announcement. Relative to the control group, average EPS forecasts dropped off sharply (by around 40%) in the month immediately prior to and after bankruptcy, suggesting that analyst earnings’ forecast fail to anticipate corporate failure and are overly optimistic over the sample period. Analysts appear to revise down forward EPS growth estimates aggressively after the corporate failure event (on average, EPS growth forecasts dropped more than 60% one month after the failure date). The Jones and Johnstone (2012) study also showed that the number of sell recommendations started to increase noticeably from about 12 months from failure and rose sharply up to event date. The sell recommendations increased from around 20% to 30% (of total recommendations) in the month of failure. However, the sell recommendations rose even more dramatically after failure, to over 45% around three months following failure event. It should be noted that while Jones and Johnstone (2012) showed an increasing trend in sell recommendations after until the bankruptcy announcement date, at the time of failure 67.77% of analyst recommendations are either buy or hold, which declined sharply to 55% in the months after failure. One explanation for these results is the analysts do not anticipate corporate failure and are overly optimistic in their forecasts as a result. The use of accurate failure prediction models can help financial analysts serve the investing public better by providing more realistic stock recommendations and financial forecasts that are calibrated to the financial health of the entity.

Distress Risk and Corporate Failure  17

1.2.5  Lending Institutions Banks can obviously benefit from accurate distress and forecasting models. Banks are under significantly greater regulatory pressure to adopt rigorous credit risk evaluation tools and to stress test their credit exposures as required in the various recommendations set down by the Basel Committee on Banking Supervision (BCBS).25 For instance, the BCBS issued guidelines, “Principles for Sound Stress Testing Practices and Supervision”, immediately following the GFC in 2009. The BCBS (2009, p. 1) report stated: The depth and duration of the financial crisis has led many banks and supervisory authorities to question whether stress testing practices were sufficient prior to the crisis and whether they were adequate to cope with rapidly changing circumstances. In particular, not only was the crisis far more severe in many respects than was indicated by banks’ stress testing results, but it was possibly compounded by weaknesses in stress testing practices in reaction to the unfolding events. Even as the crisis is not over yet there are already lessons for banks and supervisors emerging from this episode. And, further, the GFC had exposed fundamental problems with the use of existing credit risk models (BCBS 2009, pp. 3–4): Most risk management models, including stress tests, use historical statistical relationships to assess risk. They assume that risk is driven by a known and constant statistical process, ie they assume that historical relationships constitute a good basis for forecasting the development of future risks. The crisis has revealed serious flaws with relying solely on such an approach. First, given a long period of stability, backward-looking historical information indicated benign conditions so that these models did not pick up the possibility of severe shocks nor the build up of vulnerabilities within the system. Historical statistical relationships, such as correlations, proved to be unreliable once actual events started to unfold. Second, the financial crisis has again shown that, especially in stressed conditions, risk characteristics can change rapidly as reactions by market participants within the system can induce feedback effects and lead to systemwide interactions. These effects can dramatically amplify initial shocks as recent events have illustrated. One of the major lessons for banks from the GFC is the need for the banking sector to develop more robust and dynamic prediction models that can factor in the impacts of rapid changes in the economic and financial environment (and their interaction effects), particularly on the risk assessments of complex financial instruments.

18  Distress Risk and Corporate Failure

1.3  Scope of this Book Given the strong international interest in distress and corporate failure prediction modelling generally, this book covers a broad range of statistical learning methods ranging from relatively simple linear techniques such as linear discriminant analysis (LDA) to state-of-the-art machine learning methods such as gradient boosting machines, adaptive boosting (AdaBoost), random forests, and deep learning methods. While this book must assume some technical background and knowledge of the field, every attempt has been made to present the material in a practical, accommodating, and informative way. To add practical appeal and to illustrate the basic concepts more lucidly, the book provides a detailed empirical illustration and comparison of various statistical learning approaches, including machine learning models that have come into prominence in the literature (see Chapter 4). This book covers distress risk and corporate failure modelling only. The terms distress risk and corporate failure prediction are used to differentiate the wide variety of studies that have been published in this literature. Some studies have developed forecasting models using a strict definition of legal bankruptcy (such as entering Chapter 7 or Chapter 11 under the US bankruptcy code). Other studies have used broader definitions of corporate failure that include a variety of distress risk events such as loan default, issuing shares to raise working capital, receiving a qualified audit opinion based on going concern, having to renegotiate the terms and conditions of a loan obligation, financial reorganization where debt is forgiven or converted to equity, failure to pay a preference dividend (or cutting dividends), or a bond ratings downgrade (as some examples). Some studies have also used a multi-class approach to failure forecasting where the dependent variable of the study accommodates not only legal bankruptcy but any one or more of the different distress states described earlier. Because the literature on distress risk and corporate failure modelling is so extensive, I do not cover the credit risk modelling literature in much detail (for instance, models that forecast credit ratings or credit rating changes). However, all the statistical learning methods discussed in this book are just as relevant to credit risk modelling as they are to distress risk and corporate failure prediction (see Jones and Hensher, 2008). For instance, credit ratings are often modelled using multiclass ordered logit or probit models, which are covered in this book. I also discuss structural default models such as the KMV distance-to-default measure that are relevant to credit risk assessments. Chapter 2 discusses the genesis of modern corporate failure modelling. The earliest corporate failure models adopted either univariate approaches (using a single ratio or variable) or multivariate techniques such as LDA. Many LDA corporate failure models have proven to be quite predictive; however, they have been extensively critiqued in the literature. While LDA does suffer from some limiting statistical assumptions (particularly multivariate normality and IID discussed in Chapter 2), LDA models have proven surprisingly robust to violations of statistical assumption.

Distress Risk and Corporate Failure  19

However, the conceptual foundation of LDA is quite naive. For instance, as pointed out by Greene (2008), LDA divides the universe of bankruptcies into two types, entities that will fail and entities that will not. LDA treats a firm is as if “preordained” to fail or not fail. Depending on a variety of related situations and random factors, the same entity could be in either the failed or non-failed category at any one time. Thus, according to Greene (2008), prediction of corporate failure is not a problem of classification in the same sense as “determining the sex of prehistoric individuals from a fossilized record”. Index function-based models of discrete choice, such as probit and logit, assume that for any firm, given a set of attributes, there is a definable probability that a firm will experience a distress event or corporate failure. This interpretation places all firms in a single population. The observed outcome of failure or non-failure arises from the characteristics and random behaviour of the firms. Ex ante, all that can be produced by the model is a probability. The underlying logic of corporate failure prediction is to ascertain how much each failed entity resembles entities who have failed in the past. Chapter 2 also points out that logit models have dominated the corporate failure prediction literature now for many decades. This is partly because of their appealing statistical properties, but there are pragmatic reasons as well. Expressing the likelihood of corporate failure as a probability is more useful than cut-off scores produced from an LDA model. For example, consider the incorporation of distress risk into corporate valuation models or going concern assessments discussed earlier. Incorporating corporate failure risk would only be meaningful and practicable if such risks were expressed in terms of probability outcomes. Chapter 2 demonstrates that the bankruptcy prediction literature has now moved beyond the traditional logit framework to consider “advanced” logit models, particularly mixed logit and nested logit models. Over the past 30 years, there have been major developments in discrete choice modelling where new approaches have increasingly relaxed the behaviourally questionable assumptions associated with the IID condition (independently and identically distributed errors) and allowed for observed and unobserved heterogeneity to be formally incorporated into model estimation in various ways. Ultimately, this has improved the explanatory and predictive power of these models. Another limitation of many current approaches is that most distress studies to date have modelled corporate failure as a naive binary classification of failure vs non-failure (the dependent variable can only take on one of two possible states). This has been widely questioned, one reason being that the strict legal concept of bankruptcy may not always reflect the underlying economic reality of corporate financial distress. The binary or two-state classification model can conflict with underlying theoretical models of financial failure and may limit the generalizability of empirical results to other types of distress that a firm can experience in the real world. Further, the practical risk assessment decisions by lenders and other parties usually cannot be reduced to a simple pay off space of just failed or non-failed (Ohlson, 1980).

20  Distress Risk and Corporate Failure

The major advantage of the mixed logit model is that it allows for the complete relaxation of the IID and IIA conditions by allowing all unobserved variances and covariances to be different, up to identification. The model is highly flexible in representing sources of firm-specific observed and unobserved heterogeneity through the incorporation of random parameters, whereas multinomial logit (MNL) and nested logit models only allow for fixed parameter estimates. Hensher and Jones (2007) examined the optimization of the mixed logit model. Their study suggested a number of empirical considerations relevant to harnessing the maximum potential from this approach (as well as avoiding some of the more obvious pitfalls associated with its use). Using a three-state corporate failure model, Jones and Hensher (2007) concluded that the unconditional triangular distribution for random parameters offered the best population-level predictive performance on a holdout sample. Further, the optimal performance for a mixed logit model arises when a weighted exogenous sample maximum likelihood (WESML) technique is applied in model estimation. Finally, they suggested an approach for testing the stability of mixed logit models by re-estimating a selected model using varying numbers of Halton intelligent draws. However, a relative weakness of the mixed logit model is the absence of a single globally efficient set of parameter estimates and the relative complexity of the model in terms of estimation and interpretation. The nested logit model (NL) improves on the standard logit model but possesses quite different econometric properties from the mixed logit model. In essence, the NL model relaxes the severity of the MNL condition between subsets of alternatives but preserves the IID condition across alternatives within each nested subset. The popularity of the NL model arises from its close relationship to the MNL model. In fact, NL is essentially a set of hierarchical MNL models, linked by a set of conditional relationships. The main benefits of the NL model are its closed-form solution, which allows parameter estimates to be more easily estimated and interpreted; and a unique global set of asymptotically efficient parameter estimates. A relative weakness of NL is that it is analytical and conceptually closely related to MNL and therefore shares many of the limitations of the basic model. Nested logit only partially corrects for the highly restrictive IID condition and incorporates observed and unobserved heterogeneity to some extent only. Chapter  2 also covers recent studies using hazard or duration models. This method has become much more widely used in the corporate failure prediction literature over the last 20 years. In particular, hazard models (such as the Cox proportional hazard model and the extended Cox model) have become increasingly popular in distress and corporate failure prediction because they explicitly model time to event and can handle censored observations. Censoring exists when there is incomplete information on the occurrence of a corporate failure event because an observation has dropped out of the sample or the study ends before the failure event occurs (for example if a firm becomes subject to a merger or takeover). The extended Cox model can handle time-varying covariates, which are covariates that change in value over time. Given that changes in covariates can influence the

Distress Risk and Corporate Failure  21

probability of event occurrence, time-varying covariates are clearly a very attractive feature of hazard models. Chapter 2 also examines non-parametric techniques, in particular neural networks and recursive partitioning models (such as classification and regression trees or CART). Non-parametric techniques also address some of the limiting statistical assumptions of earlier models, particularly LDA and logit. There have been a number of attempts to overcome these econometric problems, either by selecting a parametric method with fewer distributional requirements or by moving to a non-parametric approach. The logistic regression approach and the general hazard function formulation are examples of the first approach. The two main types of non-parametric approach that have been used in the empirical literature are neural networks and recursive partitioning. A neural network is a term that covers many models and learning (estimation) methods. These methods are generally associated with attempts to improve computerized pattern recognition by developing models based on the functioning of the human brain; and attempts to implement learning behaviour in computing systems. Their weights (and other parameters) have no particular meaning in relation to the problems to which they are applied; hence, they are often regarded as pure “black box” estimators. Estimating and interpreting the values of the weights of a neural network is not the primary modelling exercise, but rather intended to estimate the underlying probability function or to generate a classification based on the probabilistic output of the network. Recursive partitioning methods (such as CART) is a tree-based method to classification and proceeds through a simple mechanism of using one feature to split a set of observations onto two subsets. The objective of the spilt is to create subsets that have a greater proportion of members from one of the groups than the original set. This objective is known as reducing the impurity of the set. The process of splitting continues until the subsets created only consist of members of one group or no split gives a better outcome than the last split performed. The features can be used once or multiple times in the tree construction process. The distinguishing feature of the non-parametric methods is that there is no (or very little) a priori knowledge about the form of the true function that is being estimated. The target function is modelled using an equation containing many free parameters, but in a way that allows the class of functions that the model can represent to be very broad. The empirical application of both of methods has demonstrated their potential in a corporate failure prediction context; however, the research evidence to date is mixed. For instance, several studies have compared neural networks to conventional models. While neural networks tend to perform as well as, and in some cases better than, simpler linear methods (such as LDA and logit), the incremental improvement in predictive power is rarely offset by the complexity of these models and their lack of interpretability. Chapter 2 also examines early theoretically derived models such as the gambler’s ruin model. However, the literature has evolved to more sophisticated structural models of default, such as KMV distance-to-default models, which are widely used by many banks and financial institutions around the world. Structural models

22  Distress Risk and Corporate Failure

use the evolution of a firm’s structural variables, such as asset and debt values, to determine the timing of default. The basic idea is that the firm’s equity is seen as a European call option with maturity T and strike price D on asset value V. There are only three inputs to the model: the market value of assets, the volatility of assets, and the firm’s debt levels. In this approach, default occurs when a firm’s asset value falls below a certain threshold. The risk of default also increases when the market value of asset decreases, the volatility of asset values increases, and/or the level of debt increases. Chapter  2 examines the assumptions, limitations, and empirical performance of such theoretically derived models. Chapters 3 and 4 focus on modern machine learning techniques such as gradient boosting machines, random forests, AdaBoost, and deep learning. These methods are developing very rapidly in the social sciences, although there are still quite limited applications to the distress risk and corporate failure forecasting literature. Chapter 3 explains the theoretical and empirical foundations of these models and reviews the current literature. Chapter 3 also highlights the major differences between alternative machine learning techniques and how they are derived. An important finding in recent literature is that modern machine learning methods perform substantially better (in terms of both Type I  and Type II classification errors) than more traditional methods such as LDA, logit, mixed logit, neural networks, and even hazard models. Relative to traditional parametric models such as LDA and logit, machine learning methods have several advantages for corporate failure prediction research. For instance, these methods are largely immune to variable transformation or scaling and are far less impacted by outliers, missing values, the inclusion of irrelevant inputs, non-normalness in the data and a host of other data quality issues. Nor are these models destabilized or impaired by statistical problems such as multicollinearity or heteroscedasticity, which can seriously undermine the performance of parametric models such as LDA and logit. Not only do machine learning methods, such as gradient boosting machines, predict very accurately out-of-sample compared to traditional methods, but they can also provide valuable interpretative insights about the role and influence of different predictor variables through variable importance scores, partial dependency plots (or marginal effects), and interaction effects. Chapter 4 provides a comprehensive empirical illustration of modern machine learning methods using a large sample of international corporate bankruptcies. The main purpose of this chapter is to illustrate alternative machine learning techniques and compare predictive performance across different models. Using a commercial software package known as Salford Predictive Modeler (SPM), I compare a range of statistical learning approaches, including gradient boosting machines (called “TreeNet” in the SPM package), random forests, CART, generalized lasso, and multivariate adaptive regression splines (MARS). Chapter  4 illustrates how the outputs of these models should be interpreted. In terms of model diagnostics, I  also examine a wide range of tests that can be used to investigate the stability and performance of machine learning models, including an assessment of different learn rates, tree depth, maximal node settings, alternative seed settings, and other

Distress Risk and Corporate Failure  23

hyper-parameter estimation considerations. An interesting finding from Chapter 4 is that the gradient boosting model proved to be the strongest performing machine learning model overall in terms of both model fits and out-of-sample prediction success. Gradient boosting has proven to be a highly resilient model to different hyper-parameter settings and has considerable interpretative capability through variable importance scores, partial dependency plots, and high order interaction effects. In Chapter 5, I explore a much-neglected area of the corporate failure prediction literature. While the motivation for public company failure research is well established, there are also important reasons to develop accurate and robust prediction models for private companies, not-for-profit entities, and public sector entities. As pointed out by Filipe et al. (2016), private companies (which they term SMEs) play a crucial role in most economies. In Organization for Economic Cooperation and Development (OECD) countries, SMEs account for 95% of all enterprises and generate around two-thirds of employment. Further, private company failure rates are much higher than public companies (recent US data indicates that around 50% of private companies fail within five years of establishment). I attempt to fill the gap in the distress literature by examining quantitative modelling approaches relevant to private companies, not-for-profit entities, and public sector entities. Much of the modelling for private companies has not produced models that are highly effective in predicting the distress of these entities. However, recent applications of machine learning models to private companies appear to have produced much better predictive accuracy overall. Chapter 6 discusses directions for future distress risk and corporate failure prediction research. For instance, there is considerable opportunity to model corporate distress on a distress spectrum ranging from more moderate forms of distress to the most extreme form of distress (such as bankruptcy). There is limited research available that has explored predictive models for more moderate (but frequently occurring) distress events, such as adverse going concern opinions, capital reorganizations, reductions in dividends, public offerings to raise working capital, loan default, changes in credit ratings, and distressed mergers/takeovers. Modelling more moderate distress events is important because these events are often a direct precursor to more extreme distress events such as bankruptcy. Further, many distressed companies do not actually go into bankruptcy but are subject to distressed mergers and takeovers (Jones and Hensher, 2004). Second, while this book has focused on modern machine learning methods, these modelling techniques are only in the formative stages. More research is needed to find the best performing prediction model (or set of models) as well as to identify which features (and feature combinations) have optimal predictive power. Future research can continue to compare and evaluate a wider range machine learning methods, including variations of current models that are developing in the literature. Another direction for corporate failure research discussed in Chapter 6 is to harness the latest advances texting mining and natural language processing (NLP). Future research can utilize innovative techniques in NLP to exploit the predictive value of information signal embedded in unstructured text data. Examples of unstructured text data

24  Distress Risk and Corporate Failure

include text from emails, social media posts, video and audio files, and corporate annual reports. Text mining is the process of transforming unstructured text into a structured format to reveal hidden patterns, key concepts, and themes embedded in text data. The NLP method can be harnessed to extract key words, terms, phrases, or sentences that can potentially discriminate between different types of distress events and between bankrupt and non-bankrupt firms. While NLP has not been widely used in corporate failure prediction, this technology has capability to extract linguistic signals from a range of “big data” text sources, including corporate viability statements, management discussion and analysis, annual reports, auditor reports, and the risk information required to be disclosed by accounting standards (as discussed earlier). By converting unstructured text data into structured format that can be understood by machine learning models, we can compare the predictive signal in new features generated by NLP with other traditional and non-traditional failure predictors and their potential interaction effects. Future research can also compare the performance of the NLP method with the predictive performance of quantitative machine learning models (such as gradient boosting machines) and evaluate the predictive power of linguistic features relative to other traditional and non-traditional variables such as financial ratios, market price variables, corporate valuation and governance variables, macro-economic indicators, and corporate sustainability measures. Finally, more research is needed to understand and interpret the features (and their interaction effects) that are predictive of corporate distress and corporate failure. Machine learning methods can provide new theoretical insights into which feature variables have the greatest explanatory and predictive power across different machine learning models using metrics such as variable importance scores, marginal effects, and interaction effects. Finally, more research needs to be devoted to developing more accurate and robust distress prediction models that are adaptable to the circumstances of private companies, not-for-profit entities (such as charities), and public sector entities (such as local government authorities). While much of the research in this area has not demonstrated the effectiveness of distress prediction models, more recent research using modern machine learning models appear to hold much promise for the prediction of private company distress.

Key Points From Chapter 1 The corporate failure modelling field has developed into a prodigiously large literature over the past six decades, with many studies published across a wide range of journals and discipline fields. Financial crises, such as the Asian Financial Crises of 1997, the “tech wreck” of 2001, the global financial crisis (2007–2009) (GFC), and the COVID pandemic (from early 2020) have led to an unprecedented number of corporate collapses that, to different degrees, have seriously impacted the global economy. While corporate failure used to be considered the dominion of small and/or newly listed companies, the failure of very large and well-established corporations is now much more commonplace.

Distress Risk and Corporate Failure  25

Financial crisis and the interdependencies of the global financial system have underscored the importance of more accurate and robust corporate failure prediction models. Various financial crises and the ensuing spates of corporate bankruptcies have led to more innovative corporate failure modelling techniques being developed over the past 20 years. While the focus of this book is on modern machine learning methods such as gradient boosting machines, AdaBoost, random forests, and deep learning, the book also covers a number of traditional modelling techniques such as LDA, logit/probit, neural networks, CART, and hazard models. Corporate failure forecasts are useful in a wide variety of contexts, including auditing going concern assessments, the valuation of distressed companies, the stock recommendation and financial forecasts of financial analysts, lending decisions by banks and creditors; as well as use by corporate directors and managers who are required to assess the going concern and solvency status of their businesses on an ongoing basis. There are considerable opportunities available in this area of research, including expanding the use of modern machine learning methods to corporate failure modelling as well as the application of text mining and natural language processing techniques to failure forecasting. There are many areas of this literature that require more research, including the need for more accurate distress modelling techniques tailored to private companies, not-for-profit entities (such as charities), and public sector entities (such as local government authorities).

Notes 1 Michael Carson and John Clark (2013). Asian financial crisis: July 1997 – December 1998. Federal Reserve Bank of New York. 2 www.federalreservehistory.org/essays/asian-financial-crisis. 3 www.govinfo.gov/content/pkg/COMPS-1883/pdf/COMPS-1883.pdf. 4 www.congress.gov/110/plaws/publ343/PLAW-110publ343.htm. 5 See Final Report of the National Commission on the Causes of the Financial and Economic Crisis in the United States. Pursuant to Public Law 111–21, January 2011. Official Government Edition. 6 www.theguardian.com/business/2017/jan/14/moodys-864m-penalty-for-ratingsin-run-up-to-2008-financial-crisis. 7 Retrieved at: www.cftc.gov/sites/default/files/idc/groups/public/@swaps/documents/ file/hr4173_enrolledbill.pdf. 8 https://home.treasury.gov/policy-issues/financial-markets-financial-institutions-andfiscal-service/fsoc/about-fsoc. 9 https://home.treasury.gov/policy-issues/financial-markets-financial-institutions-andfiscal-service/fsoc/about-fsoc. 10 www.consumerfinance.gov/about-us/the-bureau/ 11 www.consumerfinance.gov/about-us/the-bureau/. 12 www.federalreserve.gov/supervisionreg/volcker-rule.htm. 13 www.federalreserve.gov/supervisionreg/volcker-rule.htm. 14 www.sec.gov/page/ocr-section-landing. 15 www.sec.gov/whistleblower/retaliation. 16 www.sec.gov/whistleblower/frequently-asked-questions. 17 www.sec.gov/whistleblower/frequently-asked-questions. 18 See S. Kalemli-Ozcan et al., 2020, September 25, COVID-19 and SME Failures, Working Paper No. 20/207. IMF (International Monetary Fund).

26  Distress Risk and Corporate Failure

19 See National Law Review “Trends in Large Corporate Bankruptcy and Financial Distress (Midyear 2021 Update): Bankruptcy Filings”, September  23, 2021 Volume XI, Number 266. 20 See https://asic.gov.au/regulatory-resources/insolvency/insolvency-for-directors/#suspectfinancial-difficulty. 21 The requirements of the exposure draft are broadly consistent with the requirements of ASIC RG 247 (August 2019), which also requires disclosure by companies of the material business risks that could adversely affect the achievement of the financial performance or financial outcomes described. RG 247.62 also states: “It is important that a discussion about future prospects is balanced. It is likely to be misleading to discuss prospects for future financial years without referring to the material business risks that could adversely affect the achievement of the financial prospects described for those years. By ‘material business risks’, we mean the most significant areas of uncertainty or exposure, at a whole-of-entity level, that could have an adverse impact on the achievement of the financial performance or outcomes disclosed in the OFR. Equally, it may be appropriate to disclose factors that could materially improve the financial prospects disclosed”. And, further, RG 247.63 states: “an OFR should: (a) only include a discussion of the risks that could affect the entity’s achievement of the financial prospects disclosed, taking into account the nature and business of the entity and its business strategy; and (b) not contain an exhaustive list of generic risks that might potentially affect a large number of entities”. 22 http://people.stern.nyu.edu/adamodar/pdfiles/papers/NewDistress.pdf 23 For instance, if the probability of distress is 10% in year 1, the expected cash flows in all subsequent years have to reflect the fact that if the firm ceases to exist in year 1, there will be no cash flows later. If the probability of distress in year 2 is 10% again, there is now only an 81% chance that the firm will have cash flows in year 3. Probability of surviving into year 3 = (1 − .10) (1 − .10) = 0.81. See Damodaran (2009) for discussion. 24 To value the effects of distress, Damodaran (2009) suggests estimating the cumulative probability that the firm will become distressed over the forecast period, and the proceeds that we estimate we will get from the distress sale. The value of the firm can then be written as: FirmValue Going concern value * 1 Distress sale value * Distress , Distress where Distress is the cumulative probability of distress over the valuation period. In addition to making valuation simpler, it also allows us to make consistent assumptions within each valuation. There are several ways in which we can estimate this probability. However, this book focuses on the statistical approach, where we relate the probability of distress to a firm’s observable characteristics – firm size, leverage, and profitability, for instance – by contrasting firms that have gone bankrupt in prior years with firms that did not. 25 For instance, see the stress testing principles set out by the Basel Committee on Banking Supervision (2009): www.bis.org/publ/bcbs155.pdf.

2 SEARCHING FOR THE HOLY GRAIL Alternative Statistical Modelling Approaches

1. Introduction Providing a comprehensive literature review of the distress risk and corporate failure literature is always going to be challenging, as the extant literature is very extensive, with many hundreds if not thousands of studies published across a variety of different outlets, including academic journals, practitioner-related publications, books, various government and commission reports, and other outlets over the past six decades. My approach to the literature is not intended to be exhaustive, but rather provide a broad overview of the most relevant and representative studies of this research. Considered in its entirety, there are at least two broad streams of distress risk and corporate failure modelling research. One stream of research, which is the most extensive, I term the statistical learning approach. This stream of research uses some type of statistical model to “learn from the data” and formulate predictions about yet to be seen events. In the context of distress risk and corporate failure research, the outcome variable is the occurrence of a distress event or failure event based on observed data and using any number predictor variables (such as financial ratios, market price variables, corporate governance proxies, macro-economic features, and other variables of interest). Statistical learning models are typically estimated or “trained” on the dataset and predictive performance evaluated on test or holdout samples. A  variety of modelling approaches have been applied to this area of research, including simple linear discriminant analysis (LDA), logit and probit models, mixed logit models, hazard models, and first-generation machine learning such as neural networks, recursive partitioning, and support vector machines (SVMs). Over the past 5–10 years, we have also seen the rapid rise of more powerful machine learning methods such as gradient boosting machines, AdaBoost, random forests, and deep learning methods. While this literature is fledging (at DOI: 10.4324/9781315623221-2

28  Searching for the Holy Grail

least as it applies to corporate failure modelling), there is also a developing area of research that uses text mining and natural language processing (NLP) for predictive purposes, although there are only very limited applications of this approach to corporate failure prediction so far. The distress risk and corporate failure literature to date reveals a great deal of heterogeneity across studies, particularly with respect to the use of different modelling techniques (e.g., LDA, logit, probit, neural networks, recursive partitioning, and hazard models); different reporting jurisdictions (failure models have now been developed across many countries); differences in sample sizes and sampling periods; alternative explanatory variables (e.g., financial and market price variables) used to predict failure; and different approaches to assessing model performance. Furthermore, the definition of corporate failure varies widely across studies. Hence, the title of this book. The terms “distress risk” and “corporate failure” prediction are used to differentiate the variety of studies that have been published in this literature. Some studies have developed forecasting models using a legal definition of corporate failure (such as entering Chapter 7 or Chapter 11 under the US bankruptcy code).1 Other studies have used broader definitions of corporate failure, which might include distress events such as loan default; issuing shares to raise working capital; receiving a qualified audit opinion based on going concern; having to renegotiate the terms and conditions of a loan obligation; financial reorganization where debt is forgiven or converted to equity; failure to pay a preference dividend (or cutting ordinary dividends); or a bond ratings downgrade. Other studies have applied multi-class modelling to corporate failure (i.e., where the dependent variable takes on more than two states), using a variety of different distress states, including legal concepts of failure, as the dependent variable of interest. The second stream of research involves theoretically derived models such as single period Black Scholes models, gambler’s ruin models (Wilcox, 1971), and second-generation theoretically derived models such as the KMV (Kealhofer, McQuown, and Vasicek) distance-to-default method, which is based on Merton’s (1974) bond pricing model (see, e.g., Kealhofer, 2003a, 2003b). There will clearly be some overlap across these two streams of research, as many studies that use theoretically derived models provide some level of empirical validation for these methods, while recent statistical learning studies have examined the predictive power of theoretically derived input variables (such as distance to default measures), particularly in combination with other predictor variables. I examine both approaches in this chapter.

2.  Early Statistical Learning Approaches One of the earliest corporate failure studies was FitzPatrick (1932), who published a series of articles in The Certified Public Accountant.2 This study used 20 matched pairs of failed and non-failed firms where data was collected over a three-year period. FitzPatrick (1932) investigated a number of accounting ratios such as the current ratio, the quick ratio, and net worth to fixed assets as predictors of corporate failure.

Searching for the Holy Grail  29

He concluded that two of the most predictive ratios were net worth to debt and net profits to net worth. FitzPatrick’s (1932) empirical investigation further suggested that less importance should be placed on the current and quick ratios for firms having long-term liabilities on their balance sheets. Smith and Winakor (1935) analysed the financial ratios of 181 failed firms from a variety of industries. They reported that one ratio in particular, working capital to total assets, was a stronger predictor of financial health compared to cash to total assets and the current ratio. As it turned out, working capital to total assets would prove to be a strong predictor in many future corporate failure studies (see Dimitras et al., 1996). Notwithstanding early approaches to predicting corporate failure, the corporate failure literature did not really develop until the 1960s with the early work of Beaver and Altman. From the 1960s, the literature has progressed from relatively simple univariate studies using small samples to much more sophisticated econometric models, including machine learning methods, using much larger distress and corporate failure samples. Early studies modelled distress and corporate failure in terms of a binary outcome of failure vs non-failure. However, in more recent literature there has been a growing recognition that binary classification models are too simplistic and do not represent a realistic depiction of corporate distress in practice (Ohlson, 1980). Hence, statistical learning models of corporate failure prediction have evolved to utilize multi-class approaches where distress can take on a range of possible states. Most of these approaches model corporate distress on a spectrum ranging from most severe to least severe (hence the use of ordered logit models). The most severe state could be forced liquidation, whereas a less severe distress state could be characterized by a cut in dividend payments or a capital reorganization. Another clear trend in distress and corporate failure studies is that most of the research has been conducted on public companies. However, an increasing number of studies have developed prediction models applicable to private companies, not-for-profit entities, and public sector entities (discussed further in Chapter 5). The predictive success of corporate distress and corporate failure models also differs widely across studies, although nearly all published studies dating back to the 1960s have reported some level of predictive success. Some of the earliest corporate failure studies, such as Altman (1968) and Altman et al. (1977), reported quite strong predictive accuracy rates (at least up to two years prior to failure), but other studies, such as Ohlson (1980), documented more modest classification success. While predictive success varies across studies, there is little doubt that the best prediction accuracy rates have come from studies using advanced machine learning methods, such as gradient boosting machines, random forests, and deep learning (see Jones, 2017; Alam et al., 2021). In a study of 16 different corporate failure prediction models, ranging from simpler linear classifiers (such as LDA and standard logit) to advanced machine learning methods, Jones et al. (2017) demonstrated that machine learning models performed significantly better out-of-sample than conventional models such as LDA and logit. However, a noteworthy finding from this study is that notwithstanding the superior

30  Searching for the Holy Grail

predictive performance of machine learning models, quite simple linear models still performed reasonably well out-of-sample and in some cases did as well as and even outperformed some of the more fancied nonlinear approaches. I will return to machine learning methods in Chapters 3 and 4.

2.1  Univariate Approach of Beaver (1966) The main motivation for Beaver’s (1966) study is to demonstrate the predictive value of accounting data and failure prediction was the context for this illustration. He observed that ratios had a long history, even at that time, and the current ratio was widely considered to be one of the most important ratios to predict credit worthiness. Beaver (1966, p. 72) stated rather prophetically of his study that it would not be “one of the last endeavours in this area but as one of the first”. Even at this early stage, Beaver’s study foreshadowed several issues that would prove contentious in future studies. For instance, his definition of corporate failure is quite broad and includes the inability of an entity “to pay its financial obligations as they mature”, and operationally a firm was considered to have failed in the event of bankruptcy, bond default, an overdrawn bank account, or non-payment of a preferred stock dividend. Many different definitions of corporate distress and bankruptcy have since emerged in the literature, which I discuss further in this chapter. Beaver’s (1966) study also recognized one of the perennial difficulties in bankruptcy research, which is finding good quality bankruptcy data. Notwithstanding that corporate failure rates tend to escalate in times of financial crises (such as during the GFC and more recently the COVID pandemic, as discussed in Chapter 1), the bankruptcy of public companies is still a relatively rare event (averaging 1–2% per annum over most years). Using the Moody’s Industrial File, Beaver (1966) relied on a relatively small sample of 79 failed firms having at least one year of data over the period 1954–1964. The sample, while small, was broadly representative of 38 industry groups. Beaver’s study is also one of the first to use a matched pair sampling technique to derive a better measure of control over factors that might influence the relationship between explanatory variables and the failure outcome. In this case, failed firms were matched with a non-failed firm based on size and industry, recognizing a widely accepted view that ratio characteristics can differ significantly across industry groups (Foster, 1986). Also, larger firms tend to be more solvent, on average, than smaller firms even if their ratio characteristics are similar. Beaver (1966) proceeded to test 30 ratios, which were divided into six common groups, based on their recognition and performance in previous literature. In terms of modelling approach, Beaver did not use a multivariate statistical learning model that we commonly see today. Rather, he assessed the predictive value of each variable using a Bayesian approach to likelihood ratios (essentially you work out the joint probabilities, which are the prior probabilities multiplied by likelihood estimates and sum them to arrive at the marginal probability). Using Bayes’ theorem, the posterior probability (i.e., the probability of failure or nonfailure) is the quotient of the joint probability divided by the marginal probability.

Searching for the Holy Grail  31

Beaver (1966) selected six ratios based on their lowest percentage error. The best ratios are found to be (in this order): (1) cash flow to total debt, (2) net income to total assets, (3) total debt to total assets, (4) working capital to total assets, (5) current ratios, and (6) no-credit interval. The first two ratios performed more strongly than the other ratios (they were over 90% accuracy one year prior to failure). The percentage error did not deteriorate sharply for these variables over a 5-year period. Beaver’s study revealed a widening gap in certain key ratios over the 5-year period as firms approached failure (see Figure 2.1).

FIGURE 2.1 Behaviour

Study Source: Beaver (1966, p. 82)

of Financial Ratios Prior to Bankruptcy From Beaver (1966)

32  Searching for the Holy Grail

While Beaver (1966) used a quite crude measure of cash flow (net income plus depreciation and amortization), subsequent corporate failure studies examined the predictive power of more refined measures of cash flow (see Largay and Stickney, 1980; Casey and Bartczak, 1984, 1985; Gentry et al., 1985, 1987; Aziz et al., 1988; Aziz and Lawson, 1989; Jones and Hensher, 2004).3 Beaver (1968a) extended this study to show that nonliquid assets measures (i.e., cash flow to debt, net income to total assets, and total debt to total assets) outperformed liquid asset measures (such as current assets total assets, quick assets to total assets, working capital to total assets, cash to total assets, and the current ratio). Several insights from the Beaver studies have been picked up in future studies. For instance, Beaver (1968a) recognized the limitations of a univariate analysis (looking at the likelihood ratios of individual ratios rather than several ratios at a time). While Beaver recognized that a multivariate approach could ultimately improve predictive performance, he was not optimistic because he was convinced that one variable tended to do as well as many! However, subsequent corporate failure research that adopted a multivariate approach appeared to refute this notion.4 However, it needs to be acknowledged that Beaver was looking not for the best predictor of failure, but rather to determine whether ratios themselves had predictive power. Beaver also anticipated the potential importance of market price data in failure prediction. He stated in his 1966 study: “Market rates of return will provide a very powerful test since they reflect all of the sources of information available to in­vestors – sources that not only include accounting data but other kinds of information as well” (p. 100). This observation was very insightful for its day, as many modern corporate failure studies either now use market price data exclusively or in combination with financial data to predict corporate failure. A second observation from the Beaver study would also resonate in future studies. That is, corporate failure models do not or cannot classify failed and non-failed firms with equal success. For instance, in each year before failure, the Type I error (predicting a firm to be safe when it goes bankrupt) is greater than the Type II error (predicting a firm to go bankrupt when it proves to be safe), and this error rate increases the further out you go from the failure event date. Beaver (1966, p. 91) concluded: Even with the use of ratios, investors will not be able to completely eliminate the possibility of investing in a firm that will fail. This is a rather unfortunate fact of life, since the costs in that event are rather high. In fact, as noted by Altman (2002), the cost of a Type I error is at least 36 times more costly in economic terms than a Type II error. A further insight from the Beaver study would be corroborated in many future studies. He recognized the potential for distressed companies to manage their financial statements. Beaver (1966, p. 100) stated: The profile analysis indicated that the mean current ratio of the failed firms was above the magic “2:1” standard in all five years. In fact, in the final year

Searching for the Holy Grail  33

the mean value is 2.02 – which is about as close to 2.00 as possible. The evidence hints that failed firms may attempt to window dress. However, it might be that non-failed firms window dress even better, which suggests an “intriguing hypothesis to be tested at some future date” (p. 100). This point was picked up in a growing literature that has examined the earnings and balance sheet management practices of failing firms (discussed in Chapter 6). Beaver (1968a) extended his study to include market price data, and this represented one of the earliest papers to formally test the predictive power of market price information. Using the same sample as his previous studies, Beaver (1968a) compared the predictive value of market price information and financial ratios. He computed annual rates of returns for the failed firm and non-failed firm size up to five years prior to failure. Beaver (1968b) reasoned that failed firms would have a higher probability of failure over the time horizon relative to the non-failed firm sample. As these firms are riskier, risk-averse investors would expect a higher ex ante rate of return on investment. Beaver (1968a, p. 181) stated: Each period, investors would reassess the solvency position of the firm and adjust the market price of the common stock such that the ex ante rate of return in future periods would continue to be commensurate with the higher risk. If, at any time, the firm is in a solvency state worse than expected, there will be a downward adjustment of the market price, and the ex post rate of return will be less than the ex ante or expected rate of return. The opposite effect would occur if the solvency position of the firm is better than expected. As pointed out by Beaver, it is not easy to make an unequivocal statement about the differences in ex post rates of returns for failed and non-failed firms. If the unexpected deterioration is large, the price might be revised so the ex post returns can be lower than the non-failed firms. If there is little or no unexpected deterioration in solvency, ex post returns are expected to be higher for the failed group (in other words, higher risk should be followed by higher returns). However, it is not possible to say a priori whether the unexpected deterioration will be large or small. Beaver (1968a) stated: “At any point in time, there is no reason why ex post returns for failed firms should necessarily differ from those of nonfailed firms” (p. 182). Beaver concluded that investors continually adjust for any unexpected solvency deterioration over the 5-year period leading to lower ex post returns, but the largest, unexpected deteriorations occur in the final year – in other words, bankruptcy was unexpected or surprised the market (this is sometimes called the “bankruptcy effect”). As stated by Beaver (1968a, p. 182), “the implication is that investors are still surprised at the occurrence of failure, even in the final year before f­ailure”. Further, the observation that failed firms have lower market returns became coined in future research as the “distress anomaly”. The distress anomaly reflects the abnormally low returns of distressed firms as they approach bankruptcy.

34  Searching for the Holy Grail

Campbell et  al. (2008) provided an extended discussion of this anomaly (their study is discussed further later on). Finally, Beaver (1968) concluded from his crosssectional and time series analysis that: (1) investors recognize and adjust to the new solvency position of firms and (2) changes in stock prices act as if investors rely on ratios for their assessment and impound ratio information into prices.5

2.2  Early Multivariate Models Another well-known early bankruptcy study is Altman (1968). The Altman (1968) study avoided the univariate approach of Beaver (1966, 1968b) and adopted a multivariate approach. Altman (1968) posited that the univariate approach is “susceptible to faulty interpretation and is potentially confusing” (p. 591). Altman argued that a multivariate approach that combines and weights many different measures together in a predictive model can ultimately improve prediction outcomes and provide a better indication about which variables have more predictive power in a bankruptcy model. The modelling approach adopted by Altman is known as linear discriminant analysis (LDA), a model that Altman continued to use in many of his future studies.

2.2.1  Linear Discriminate Analysis (LDA) Greene (2008) illustrated the LDA model as follows. LDA rests on the assumption that there are two populations of firms, denoted “1” and “0”, each characterized by a multivariate normal distribution of the attributes, x. A firm with attribute or feature vector xi is drawn from one of the two populations, and we need to determine which. The analysis is carried out by assigning to the application a “Z” score, is shown in 2.1: i

b0

bxi. (2.1)

Given a sample of previous observations on yi and xi , the vector weights, b (b0,b1 ), can be obtained as a multiple of the vector regression coefficients in the linear regression of di P0 yi P1 (1 yi ) on a constant and the set of explanatory variables, where P1 is the proportion of 1s in the sample and P0 1 P1 . The scale factor is (n 2) / e e from the linear regression. The individual is classified in group 1 if their score “Z” score is greater than Z (usually 0) and 0 otherwise. The linearity (and simplicity) of the computation is an attractive feature of the model. Since the left-hand side variable in the preceding linear regression is a linear function of yi ,di yi P1, the calculated discriminant function can be construed as nothing more (or less) than a linear probability model. As such, the comparison between discriminant analysis and, say, the probit model could be reduced to one between the linear probability model and the probit or logit model. Thus, it is no surprise that the differences between them are not great; this has been observed elsewhere (see Greene, 2008).

Searching for the Holy Grail  35

The full technical details of the LDA are discussed in the Appendix. Similar to Beaver’s study, Altman (1968) used a match pair design. However, unlike the Beaver (1966) study, Altman’s sample is drawn from one industry group, manufacturing firms. Based on a small sample of 66 manufacturing firms, Altman divided his sample into two groups (33 firms each), representing failed and non-failed firms. Unlike Beaver, who used a broader definition of failure, Altman (1968) used a sample based on manufacturers that filed a bankruptcy petition under Chapter  10 of the National Bankruptcy Act during the period 1946–1965. It is unclear which definition of failure is best. Distressed companies do not necessarily go bankrupt, and bankrupt companies are not always necessarily distressed. For instance, Chapter 11 can potentially be used for purely strategic reasons such as to stave off creditors or thwart a labour dispute (see Zavgren, 1983; Jones, 1987; Delaney, 1999). Altman (1968) started off with 22 ratios classified into five distinct categories: liquidity, profitability, leverage, solvency, and activity. The LDA analysis identified five key ratios that were statistically significant in the model: working capital to total assets, retained earnings to total assets, earnings before interest and tax to total assets, market value of equity to the book value (MVE) of total debt and sales to total assets. Altman also emphasized the potential importance of a “market” dimension to bankruptcy prediction as captured by the MVE measure. The final Altman (1968) model is the famous Z score formula, which has been cited in many bankruptcy studies. This is an LDA model with the following fitted form: Z = .012X1 + .014X2 + .033X3 + 0.006X4 + 0.999X5, where Z = overall index (observation discriminant score) and the input variables X1 − X5 represent financial ratios, respectively X1 = Working Capital/Total Assets; X2 = Retained Earnings/Total Assets; X3 = Earnings Before Interest and Taxes/ Total Assets; X4  =  Market Value of Equity/Book Value of Total Liabilities; and X5 = Sales/Total Assets. When the Z score is below a certain threshold ( total assets, else 0) − 2.37 (net income/total assets) − 1.83 (funds from operations/total liabilities) + 0.285 (1 if net loss for last two years, else 0) −0.521 (net incomei − net incomet-i)/(|net incomet| + |net incomet-1|). From these variables, Ohlson identified statistically significant variables in his logistic regression model (within one year of failure). These are: (1) the size of the company; (2) measure(s) of the financial structure; (3) measure(s) of performance; and (4) measure(s) of current liquidity. However, according to Ohlson (1980), the evidence for (4) was not as clear as compared to cases (2)–(3). Noteworthy in the Ohlson model is the absence of any market price variables. Ohlson (1980) used a larger sample of 105 bankrupt firms and 2058 nonbankrupt industrial companies. In a similar vein to Altman (1968), Ohlson’s definition of corporate failure is also purely legalistic. A methodological refinement of Ohlson (1980), however, is the use of 10-K financial statements, rather than Moody’s Industrials, which was used in previous bankruptcy studies. The point of distinction is that the 10-K reports showed when financial reports are released to the market; hence, it is possible to make a determination of whether firms went bankrupt before or after that date. According to Ohlson (1980), previous studies may have overstated the predictive power of failure models because they did not consider when financial statement data are actually released to the market. For instance, a financial report that is released after the bankruptcy event is likely to overstate the predictive power of the model because the financial report would reflect more recent financial information about the deteriorating health of the entity. As a result, Ohlson (1980) reported much higher prediction error rates than previous studies, particularly compared to Altman et  al. (1977). Ohlson (1980, p. 128) stated that “differences in results are most difficult to reconcile” (p. 128). What is quite telling is that Ohlson (1980) only used an estimation sample for his study. If a holdout sample was used, error rates would have undoubtedly been much higher. Ohlson (1980) used no market price data but acknowledged that “information based on equity prices and changes in prices might prove to be most useful” (p. 123). Ohlson (1980) also foreshadowed future studies that would take a multi-class approach to corporate failure prediction. In identifying the limitations of binary class models, Ohlson (1980, p. 111) observed: No decision problem I can think of has a payoff space which is partitioned naturally into the binary status bankruptcy vs nonbankruptcy. (Even in the

42  Searching for the Holy Grail

case of a “simple” loan decision, the payoff configuration is much more ­complex.) Existing empirical studies reflect this problem in that there is no consensus on what constitutes “failure”, with definitions varying s­ ignificantly and arbitrarily across studies. In other words, the dichotomy bankruptcy ­versus no bankruptcy is, at the most, a very crude approximation of the ­payoff space of some hypothetical decision problem. Later studies such as Lau (1987), Ward (1994), Jones and Hensher (2004), Hill et al. (1996), and others attempted model financial distress in multi-class settings (discussed further in what follows). Ohlson (1980) also observed another interesting anomaly reported in many future studies. That is, he concluded that none of the financial reports used in his sample would have been useful in failure prediction as none were qualified on the basis of going concern. Ohlson stated (1980, p. 129): Moreover, the accountants’ reports would have been of little, if any, use. None of the misclassified bankrupt firms had a “going-concern qualification” or disclaimer of opinion. A review of the opinions revealed that eleven of these companies had completely clean opinions, and the two that did not have relatively minor uncertainty exceptions. Curiously, some of the firms even paid dividends in the year prior to bankruptcy. Hence, if any warning signals were present, it is not clear what these actually were. This rather alarming observation is still valid today. For instance, Read and Yezegel (2018, p. 20) observed that US legislators have expressed concerns that corporations frequently fail shortly after receiving a standard (unmodified) audit opinion and have criticized auditors for failing to warn the public of their client’s impending financial collapse. Prior research has reported that auditors make Type II errors (i.e., auditors issue an unmodified audit opinion in the year preceding the filing of bankruptcy) in approximately 50% of the cases (see also Carson et al., 2013). However, Hopwood et  al. (1994) observed that neither auditors’ opinions nor bankruptcy prediction models are particularly good predictors of bankruptcy when population proportions, differences in misclassification costs, and financial stress are taken into consideration.

2.3.1  The Probit Model While the logit model has become one of the mostly widely statistical learning methods for corporate failure prediction in the literature, the probit model has also been used in several studies (Greene, 2008). For instance, Zmijewski (1984) used a probit model to illustrate, inter alia, methodological issues in corporate failure research. The probit model is set out as follows: Prob[Y1

1 xi ]

xi , (2.3)

Searching for the Holy Grail  43

where is the inverse of the cumulative normal distribution and xi is a vector of parameter estimates and explanatory variables. The link function for a probit model is the inverse of the cumulative normal distribution . The explanatory variables are assumed to be normally distributed and the error structure of a probit model is assumed to be IID, which makes the model more restrictive and computationally more intensive than logit. The standard probit model has a similar conceptualization to the logit model. While the probit classifier has more restrictive statistical assumptions, both predictor models normally produce consistent parameter estimates and have comparable predictive accuracy (Greene, 2008). As with the logit model, probit model parameters are estimated using maximum likelihood, as shown in the Appendix.

2.4 Methodological Issues With Early Corporate Failure Studies Zmijewski’s (1984, p. 59) dealt with two specific methodological problems in corporate failure modelling. The first is the bias in results arising from “oversampling” of distressed firms, which is a choice-based sample bias (for instance, matched pair designs oversample the number of bankrupt firms observable in practice, which can lead to biased estimates). This bias results when a researcher first observes the dependent variable and then selects a sample based on that knowledge; that is, the probability of a firm entering the sample depends on the dependent variable’s attributes. The second bias results from using a “complete data” sample selection criterion and falls within the topic of sample selection biases. This bias results when only observations with complete data are used to estimate the model and incomplete data observations occur nonrandomly. The first issue relates to the low frequency rate of firms exhibiting financial distress characteristics (e.g., petitioning for bankruptcy) in the population. The second is that data for financially distressed firms are often unavailable. Using a weighted exogenous sample maximum likelihood (WESML) adjustment, Zmijewski (1984) demonstrated that while probit parameter estimates are biased, this bias generally does not impact on statistical inferences or the overall classification accuracy of the model. Using a bivariate probit model to assess the second bias, Zmijewski’s (1984) results were found to be qualitatively similar to the choice-based sample results in that a bias is clearly shown to exist, but generally it does not appear to impact the statistical inferences or overall classification rates. However, Zmijewski (1984, p. 80) concluded that the adjusted estimation techniques appear to estimate probability distributions which fit the population distribution better. This can be important if cardinal probability estimates are required (e.g., for expected value calculations) or if the effect of an independent variable on the probability estimates is of interest.

44  Searching for the Holy Grail

There are several other methodological issues arising from the early corporate failure studies. For instance, the literature review by Jones (1987) pointed out the need for appropriate validation methods when estimating and testing corporate failure prediction models. The use of test or holdout samples is crucial as statistical learning models can easily over fit the training data and have limited generalizability when a test sample is not used. It is surprising that many corporate failure studies do not use test or holdout samples, which can greatly limit the value of the empirical findings. It is well appreciated that many of the earlier corporate failure studies were hamstrung by small failure samples. Hence, some studies used the Lachenbruch estimator (or jack-knife), which is a “leave one out” resampling strategy. The essence of the approach is to leave one observation out of the dataset and estimate the model without that observation, and then use the model to predict the left-out observation. The Lachenbruch method might be appropriate when the sample is very small, but it is not considered an adequate replacement for test samples. Another important point made by Jones (1987, p. 140) is that “using too many ratios can actually make a model less useful”. This point was also raised by Zavgren and Friedman (1988, p. 34), who concluded the financial variables used in the earlier studies were chosen on an arbitrary basis with no theoretical or empirical support. The result is a confusing array of variables used, sometimes as many as 30 (often redundant) financial ratios. This makes the models’ results difficult to interpret. Jones (1987) also noted the limitations of the matching approach to sample selection. Using this approach, it is not possible to investigate the effects of industry sector, company size, or year of failure on the probability of corporate failure. However, more recent studies have addressed many of these concerns by using much larger samples of distressed and bankrupt firms and using samples of failed and non-failed firms more representative of proportion of failure rates observable in practice. Furthermore, modern machine learning methods, discussed in Chapter 3, are capable of handling many hundreds, even thousands of predictor variables (even in the presence of high correlation among the features) without compromising model stability and performance. Mensah (1984) is another early study that addressed methodological issues in corporate failure modelling. In particular, he critiqued previous research that pooled failure observations together for model estimation. While pooling of observations is often unavoidable because of small sample sizes, Mensah (1984) argued that such pooling ignores the economic environment, which can change over different time periods. Mensah (1984) drew three major conclusions from his study. First, the accuracy and structure of predictive models differ across different economic environments. Predictive results can improve if corporate failure models are estimated over different time periods. In other words, we should not assume stationarity in the models),7 as corporate failure rates can rise in times of economic

Searching for the Holy Grail  45

recessions and financial crises and fall in times of economic stability and growth. As Mensah (1984, p. 381) stated, the best multivariate prediction models would show some nonstationarity, with different ratios becoming important at different periods depending on the economic events which triggered for the period examined. The reasons we might expect some nonstationarity in the economic environment could arise from changing inflation rates, changes in interest rates and credit availability, and change in the business cycle phases (contraction/expansion). To some extent, Moyer (1977) showed some evidence for the observations by Mensah (1984). Moyer (1977) tested nonstationarity indirectly by replicating the Altman (1968) study in a different time period and produced much weaker predictions as a result. Mensah (1984) also concluded that predictive models are more appropriate if applied to different industry groups, even for the same economic environment. Third, more useful empirical results can be obtained by explicitly considering multicollinearity and intersectoral development when developing corporate failure models. Moyer (1977) also concluded that reducing multicollinearity can improve the general application of a corporate failure prediction model.

2.5  Emergence of Multi-Class Models Another methodological limitation in early corporate failure studies was the use of binary models where distress can only be one of two possible states. Lau (1987) was one of the first studies to model corporate failure in a multi-class setting. Lau (1987) argued that companies face a continuum of financial distress states before going into bankruptcy or liquidation. Lau (1987) used an ordered logit model with five distress states ranging from the least severe state to the most severe. The five distress states included: (1) financial stability; (2) omitting or reducing dividend payments; (3) technical defaults and default on loan payments; (4) protection under Chapters  10 or 11 of American bankruptcy law; and (5) bankruptcy and liquidation. The results of this study showed that multinomial logit analysis (MNL) outperformed multinomial discriminant analysis. While Lau (1987) improved on the methodology of dichotomous prediction models by using a five-state model, the study does have a few limitations (see also Ward, 1994). For instance, the MNL approach selected is not robust to violations of the highly restrictive IID assumption, and some of the distress states identified by Lau (1987) are hampered by unacceptably small sample sizes. Using a holdout sample, Lau’s model predicted the financial stability state with reasonable accuracy but was less impressive in predicting the other states of financial distress. Another example of a multi-class model is Barniv et al. (2002). They used an ordered logit model to predict bankruptcy resolutions. Prior studies had focused on predicting the bankruptcy filing event or discriminating power between healthy and bankrupt firms. Barniv et al. (2002) noted that US bankruptcy courts confirm

46  Searching for the Holy Grail

one of three possible final bankruptcy resolutions, namely: acquisition, emergence, or liquidation. Their results are based on a sample of 237 firms, of which 49 were acquired, 119 emerged from bankruptcy, and 69 were liquidated. The authors utilized a ten-variable model, which included five accounting-based variables and five non-accounting variables. Barniv et al. (2002)’s model performed quite well, correctly classifying 61.6% of all firms in their respective three groups (acquired, emerged, and liquidated) and correctly classifying 75.1% of firms in a two-group setting (acquired and emerged vs liquidated). Given the small sample sizes, Barniv et  al. (2002) assessed classification accuracy using the Lachenbruch U-estimator, as discussed earlier. In addition, Barniv et al. (2002) examined their results using an inter-temporal split into an estimation and holdout sample. The classification accuracies for the holdout sample were 48.8% for the three groups and 69.9% for the two groups. Another study by Ward (1994) used an ordered four-state dependent variable to test the incremental predictive power of explanatory variables. The ordinal states used by Ward (1994, p. 549) included:8 (1) financially healthy; (2) cash dividend reduction; (3) loan principal/interest default or debt accommodation (extension of cash payment schedules, reduction in principal, or reduced interest rates); and (4) bankruptcy (if a firm filed, or was forced to file, for Chapter 11 protection). Ward (1994) tested eight variables in his multi-class logit model: net income/total assets; sales/current assets; owners’ equity/total liabilities; current assets/current liabilities; current assets/total assets; cash plus marketable securities/total assets; cash flow from operating activities; and net income plus depreciation and amortization. The results indicated that a simple measure cash flow measure (net income plus depreciation and amortization) was a better predictor of distress than net income over total assets. Ward (1994) concluded that the simple measure of cash flow was a strong predictor not because it should be interpreted as a I measure of operating cash flow but rather because it outperforms net income. Ward (1994) concluded that the more refined operating cash flow variable (cash flow from operating activities) added incremental explanatory power to his model as it remained statistically significant even when the simple measure of cash flow was added to the model. A further study using a multi-class logit model was Johnsen and Melicher (1994). Their study identified three states of financial distress: (1) non-bankrupt firms, (2) financially weak firms, and (3) bankrupt firms. The authors’ findings indicated that including the “financially weak” distressed state reduced the model’s misclassification error, and, further, the three states of distress are seemingly independent. However, their study also suffered from the general limitation of MNL models such as the restrictive IID condition.9 Other multi-class approaches to bankruptcy prediction used a competing risks framework, which is discussed further later under hazard models.

2.6  Mixed Logit and Nested Logit Models Many of the limitations identified in previous corporate failure studies ( Jones, 1987; Zmijewski, 1984 and others) were addressed by Jones and Hensher (2004). Jones

Searching for the Holy Grail  47

and Hensher (2004) introduced discrete choice modelling to the corporate failure literature. Discrete choice theory is concerned with understanding the discrete behavioural responses of individuals or firms to the actions of business, markets, and government when faced with two or more possible outcomes (or choices). Given that the analyst has incomplete knowledge of the information inputs of the agents being studied, the analyst can only explain a choice outcome up to a probability of it occurring. This is the basis for the theory of random utility (Louviere et al., 2000). While random utility theory has developed from economic theories of consumer behaviour, it can be applied to any unit of analysis (e.g., firm failures) where the dependent variable is discrete. Based on random utility theory, Jones and Hensher (2004) introduced a more sophisticated form of logit modelling known as mixed logit (and later the nested logit model). Following Ohlson’s criticism of binary class models (see also Lau, 1987; Ward, 1994), they investigated corporate failure is a multi-class setting. They note that while an extensive literature on financial distress prediction has emerged over the past three decades, innovative modelling techniques have been slow to develop. In fact, many commonly used techniques would rate as primitive and dated in other fields of the social sciences. As we have seen in this chapter so far, much of this literature has relied on relatively simplistic linear discriminant analysis (LDA), binary logistic or probit analysis, or rudimentary multinomial logit models (MNL). Mixed logit and its variants have now supplanted simpler models in many areas of economics, marketing, management, transportation, health, housing, energy research, and environmental science (Train, 2003). This can largely be explained in terms of the substantial improvements delivered by mixed logit over binary logistic and MNL models. The mixed logit model is an example of a model that can accommodate firmspecific heterogeneity across firms through random parameters. The essence of the approach is to decompose the stochastic error component into two additive (i.e., uncorrelated) parts. One part is correlated over alternative outcomes (such as corporate distress outcomes) and is heteroscedastic, and another part is IID over alternative outcomes and firms, as shown in equation (2.4) (ignoring the t subscript for the present). U iq

Xiq

(

iq

iq

) ,(2.4)

where iq is a random term with zero mean whose distribution over firms and alternative outcomes depends on underlying parameters and explanatory variables ( X ) and observed data relating to alternative outcome i and firm q; and iq is a random term with zero mean and is IID over alternative outcomes. As demonstrated in Jones and Hensher (2004), the mixed logit class of models assumes a general distribution for and an IID extreme value type 1 distribution for . That is, can take on a number of distributional forms such as normal, lognormal, and triangular. Denote the density of by , where are the

48  Searching for the Holy Grail

fixed parameters of the distribution. For a given value of , the conditional probability for outcome i is logit since the remaining error term is IID extreme value, as show in in equation 2.5: Li( ) = exp( xi +

) / ∑jexp( xj +

i

).(2.5)

j

Since is not given, the (unconditional) outcome probability is this logit formula integrated over all values of weighted by the density of , as shown in equation (2.6). Pi

Li

f

|

d .(2.6)

As noted by Jones and Hensher (2004), these models are called mixed logit because the outcome probability Li( ) is a mixture of logits with f as the mixing distribution. The probabilities do not exhibit the well-known independence from irrelevant alternatives property (IIA), and different substitution patterns are obtained by appropriate specification of f. This is handled in two ways. The first way, known as random parameter specification, involves specifying each q associated with an attribute of an alternative outcome as having both a mean and a standard deviation (i.e., it is treated as a random parameter instead of a fixed parameter). The second way, known as the error components approach, treats the unobserved information as a separate error component in the random component Considering the case of firm failures, the main improvement is that mixed logit models include a number of additional parameters that capture observed and unobserved heterogeneity both within and between firms. In addition to fixed parameters, mixed logit models include estimates for the standard deviation of random parameters, the mean of random parameters, and the heterogeneity in the means. For a mixed logit model, the probability of failure of a specific firm in a sample is determined by the mean influence of each explanatory variable with a fixed parameter estimate within the sampled population, plus, for any random parameters, a parameter weight drawn from the distribution of individual firm parameters estimated across the sample (Jones and Hensher, 2004). This weight is randomly allocated to each sampled firm unless there are specific rules for mapping individual firms to specific locations within the distribution of firm specific parameters.10 In contrast, the probability of failure for an individual firm using a binary logistic or MNL model is simply a weighted function of its fixed parameters with all other behavioural information assigned (incorrectly) to the error term.11 As noted by Hensher and Greene (2003), parameter estimation in the mixed logit model maximizes use of the behavioural information embedded in any dataset appropriate to the analysis. Ultimately, these conceptual advantages afford the analyst with a substantially improved foundation for explanation and prediction. A related problem is that most studies to date have modelled failure as a simplistic binary classification of failure or non-failure (see Jones, 1987). This has been questioned, in part because the legal definition of failure may not always represent the underlying economic realities of business financial health (Lau, 1987; Delaney,

Searching for the Holy Grail  49

1999). The two-state model can clash with underlying theoretical explanations of financial failure, limiting the generalizability of empirical findings (Scott, 1981; Bahnson and Bartley, 1992; Hill et al., 1996). Others have suggested that lenders’ and other stakeholders’ risk assessment judgements can rarely be reduced to a simple payoff space of failed and non-failed assets (Ward, 1994; Ohlson, 1980). Jones and Hensher (2004) modelled financial distress in three states: (1) non-failed firms; (2) insolvent firms. For the purposes of this study, insolvent firms are defined as: (i) failure to pay Australian Stock Exchange (ASX) annual listing fees as required by ASX Listing Rules; (ii) a capital raising specifically to generate sufficient working capital to finance continuing operations; (iii) loan default; and (iv) a debt/total equity restructure due to a diminished capacity to make loan repayments; (3) firms who filed for bankruptcy followed by the appointment of liquidators, insolvency administrators, or receivers. Following the approach of Joy and Tollefson (1975), Jones and Hensher (2004) used a test or validation sample, which was collected for the period 2001–2003 using the same definitions and procedures applied to the estimation sample. This produced a final useable sample of 4,980 firm years in the non-failed state 0; and 119 and 110 firm years in states 1 and 2, respectively. The financial measures examined in their study included ratios based on cash position; operating cash flow (CFO); working capital; profitability and earnings performance; turnover; financial structure; and debt servicing capacity. Variables found to be significant in the Jones and Hensher (2004) study included fixed parameters: total debt to gross operating cash flow and working capital to total assets. Random parameters that were significant in the study included: cash resources to total assets, net operating cash flow to total assets, total debt to total equity, and cash flow cover. After adjusting for the number of parameters, Jones and Hensher (2004) reported that the mixed logit produced a substantially improved model-fit compared with standard MNL. Random parameter estimation in the mixed logit model maximized the use of the behavioural information embedded in any dataset appropriate to the analysis. Ultimately, these conceptual advantages afforded the analyst with a substantially improved foundation for explanation and prediction. Furthermore, while the out-of-sample forecasting accuracy of the mixed logit design was much superior to multinomial logit model, it should be noted that the results of the Jones and Hensher (2004) were class-based forecasts, not individual firm forecasts. In advocating the mixed logit model, Jones and Hensher (2004) acknowledged certain limitations of “static” or single period distress models that draw on multipleperiod distress data (Shumway, 2001; Leclere, 2000). Much of the literature is of this genre (see Altman, 2002, for a survey). As discussed later under hazard model approaches, Shumway (2001) concluded that static models can result in inconsistent and biased estimates. Firm characteristics can change over time, but such problems are not expected to be pronounced in the Jones and Hensher (2004) study

50  Searching for the Holy Grail

given the relatively short time frame in which their estimation sample is drawn. While a hazard analysis was not deemed appropriate to the modelling focus of their study, Jones and Hensher (2004) did meet some of Shumway’s (2001) concerns by undertaking an assessment of a geometric lag specification to test time dependencies in their estimation sample. Hensher et  al. (2007) enhanced the mixed logit model to capture additional alternative-specific unobserved variation not subject to the constant variance condition, which is independent of sources revealed through random parameters. They referred to this as the generalized latent kernel effect.

2.6.1  The Nested Logit Model Jones and Hensher (2007) furthered their study by examining the predictive power of multinomial nested logit (NL) models. To understand how the NL model works, take an example from Standard and Poor’s credit ratings. Here we might have six outcome alternatives, three of them level A rating outcomes (AAA, AA, A, called the a-set) and three level B rating outcomes (BBB, BB, B, called the b-set). The NL model is structured such that the model predicts the probability of a particular A-rating outcome conditional on an A-rating. It also predicts the probability of a particular B-rating outcome conditional on a B-rating. Then the model predicts the probability of an A or a B outcome (called the c-set). That is, we have two lower-level conditional outcomes and an upper-level marginal outcome. Since each of the “partitions” in the NL model are of the MNL form, they each display the IID condition between the alternatives within a partition. However, the variances are different between the partitions. Jones and Hensher (2007) argued that the performance of the NL model was an important research consideration for at least two reasons. First, it establishes whether NL can be considered a complementary and/or alternative modelling technique to standard logit or even mixed logit. For instance, Jones and Hensher’s (2004) empirical comparison is restricted to standard logit, i.e., multinomial logit (MNL), which is the most basic form of closed-end discrete model in the social sciences (Train, 2003). The NL model is the most generally used advanced closed-form discrete choice model, especially for unordered outcomes, while mixed logit is the most advanced open-form model (for both ordered and unordered outcomes). Second, the NL model, having a closed-end form solution, has certain practical benefits not shared by open-form models such as mixed logit. The main benefit is that parameter estimates and probability outcomes in an NL model are generally easier to estimate, interpret, and apply, especially as the number of attributes and alternatives increases. In a mixed logit model, determining parameter estimates and probability outcomes is more difficult. Unlike the NL model, which has a closed-form solution and guarantees a unique globally optimal set of parameter estimates (given starting values from an MNL model), the mixed logit model can produce a range of solutions (regardless of starting values), only one of which is globally

Searching for the Holy Grail  51

optimal. In particular, the mixed logit model incorporates both random and fixed parameter estimates, whereas NL incorporates only fixed parameters in model estimation. Estimation of random parameters requires complex analytical calculations, which involves integration of the logit formula over the distribution of unobserved random effects across the set of alternatives. Outcome probabilities cannot be calculated exactly because the integral does not have a closed form in general; however, they can be approximated through simulation (see Stern, 1997). While NL has some of the practical benefits of a closed-form solution, it is conceptually superior to the standard closed-form models such as MNL because model specification partially corrects for the restrictive IID condition and allows for the incorporation of unobserved heterogeneity to some extent. In this sense, the NL model is widely regarded as a “halfway house” between the mixed logit model and the standard logit model. For the purposes of their study, Jones and Hensher (2007) utilized a four-state distress model for their NL model, which included: (1) non-failed firms; (2) insolvent firms. Insolvent firms are defined as: (i) loan default, (ii) failure to pay Australian Stock Exchange (ASX) annual listing fees as required by ASX Listing Rules; (iii) a capital raising specifically to generate sufficient working capital to finance continuing operations; and (iv) a debt/equity restructure due to a diminished capacity to make loan repayments; (3) financially distressed firms who were delisted from the ASX because they were subject to a merger or takeover arrangement; (4) firms who filed for bankruptcy followed by the appointment of receiver managers/liquidators. For purposes of their study, States 0–3 are treated as mutually exclusive states within the context of an unordered NL model. The nested structure is show in Figure 2.2, which shows that merger and failure outcomes are conditional on the restructured firms category. The derivation for elemental probabilities for a nested logit model is shown in the Appendix. Jones and Hensher (2007) reported that four financial variables (total liabilities to total assets; total debt to gross operating cash flow; two periods of consecutive negative reported operating cash flows; and total debt to total equity) had the strongest statistical impact on the distress outcome. A similar set of covariates was statistically significant in the basic MNL model, but there were some noteworthy differences. For instance, the sales to total asset and working capital to total asset variables were significant in the MNL model (but not in the NL analysis), and total debt to total equity ratio was not statistically significant in the MNL. Their results also showed that the reported cash flow variables had the highest overall significance in the unordered NL model (this appears consistent with other studies such as Ward, 1994). Jones and Hensher (2008) summarized

52  Searching for the Holy Grail

FIGURE 2.2 Nested

Tree Structure for States of Financial Distress

Source: Jones and Hensher (2008)

the strengths of the standard logit, mixed logit and nested logit as shown in Table 2.1.

2.7  Hazard Model Approaches A major limitation of LDA and discrete choice models such as logit and probit is that they are not designed to handle panel data structures. Jones and Hensher (2004) attempted to remedy this issue by specifying a geometric distributed lag structure to their mixed logit analysis. However, this is by no means a complete solution. Hazard models are generally more suitable to panel data structures and have become very popular in the corporate failure literature as a result. Hazard models are essentially time to event models where the outcome variable is the time until when an event occurs. In this context, time can mean years, months, weeks, or even days, depending on the research context. In a health context, the term event can mean death, incidence of disease, or relapse from a remission (as examples). In the context of business failure, an event means some type of distress event or a bankruptcy event. Hazard models typically assume there is only one event. However, when there are many possible events (for instance, death can come from many possible diseases), they are termed competing risk models (we discuss competing risk models in the next section). In hazard analysis, the time variable is usually called the “survival time” because, in the case of bankruptcies, it gives the time an entity has survived over the period. A key issue in survival analysis is censoring. As stated in Kleinbaum and Klein (2012, p. 5), “in essence, censoring occurs when we have some information about individual survival time, but we don’t know the survival time exactly”. In the context of bankruptcy, we may have distressed firms that only reach the event in question (i.e., bankruptcy) after the study ends (or at end of the study’s sample period). In this case, the survival time is as long the study period as a

Searching for the Holy Grail  53 TABLE 2.1  Summary of Major Strengths and Challenges of Different Logit Models

Major Strengths

Standard MNL

Nested Logit

• Closed form solution • Provides one set of globally optimal parameter estimates • Simple calculation • Widely understood and used in practice • Easy to interpret parameter estimates • Easy to calculate probability outcomes • Less demanding data quality requirements

• Closed form solution • Provides one set of globally optimal parameters (given MNL start values) • Relatively easy to interpret parameter estimates • Relatively easy to calculate probability outcomes • Partially corrects for IID & IIA condition • Incorporates firmspecific observed and unobserved heterogeneity to some extent (especially the covariance extension)

Major • Highly restrictive Challenges error assumptions (IID condition) • Violates the IIA assumption • Ignores firmspecific observed and unobserved heterogeneity, which can lead to inferior model specification and spurious interpretation of model outputs • Parameters are point estimates with little behavioural definition • Often provide good aggregate fits but can be misleading given simple form of the model • Tends to be less behaviourally responsive to changes in attribute levels Source: Jones and Hensher (2008)

Mixed Logit

• Allows for complete relaxation of IID condition • Avoids violation of the IIA condition • High level of behavioural definition and richness allowed in model specification • Includes additional estimates for random parameters, heterogeneity in means, and decompositions in variances (these influences are effectively treated as “white noise” in basic models) • Open form solution • Only partially (requires analytical corrects for IID integration and use of condition simulated maximum • Does not capture likelihood to estimate potential sources of model parameters) correlation across • Lack of a single set nests of globally optimal • Judgement required parameter estimates in determining (i.e., due to the which alternatives requirement for can be appropriately partitioned into nests simulated maximum likelihood) (nested logit requires well-separated nests • Assumptions must be imposed for to reflect their the distribution of correlation) unobserved influences • Complex interpretation • Model estimation can be time consuming due to computational intensity • High quality data constraints

54  Searching for the Holy Grail

minimum, but we do not have a completely accurate picture of the survival time. Another example is when an entity is removed from the sample for reasons other than bankruptcy prior to end of the sample period (for example, a firm might be subject to a merger or takeover). Data can be right censored or left censored. The examples earlier are right-censoring problems but a left-censoring might occur when we do not know exactly when a firm went bankrupt. The method of estimating event probability is called cause-specific hazard function, which is mathematically expressed as follows (Kleinbaum and Klein, 2012): t |Tc t ) .(2.7) t The numerator of the equation above is a conditional probability that gives the probability that an entity’s survival time, random variable T, will lie in the interval between t and t t given that the survival time T is greater than or equal to t. The denominator of the formula t denotes a small time interval. The probability P divided by t gives us the hazard rate, which takes a value between zero and infinity. Taking the limit of the right-side expression gives us the instantaneous risk of failure at time t per unit time given survivorship up to time t (Kleinbaum and Klein, 2012, p. 13). If we consider a healthy entity, it will have a constant h(t) relationship with t. That is, the instantaneous potential for failing (or going bankrupt) at any time during the period remains constant over the sample period (this is called an exponential model). A hazard function that increases over time is called an increasing Weibull model. For instance, a distressed company that does not improve with managerial intervention to turn the company around will have an increasing potential of failing as the survival time increases. A decreasing Weibull model might show the opposite effect for a distressed firm that is turned around by managerial intervention. In this case, the potential to fail is actually reduced as a function of survival time. One of the most popular hazard models in the literature is the Cox proportional hazard model. This is because the model is quite robust and will often closely approximate any parametric version of the model. hc (t )

lim t

P (t

Tc

t

0

p

h t,

h0 t e i 1

iXi

.

(2.8)

The model gives an expression for the hazard at time t for an entity with a given vector of explanatory variables denoted by X. The Cox model shows that hazard at time t is the product of two quantities. The first is h0 t , which is the baseline hazard function; because h0 t has an unspecified functional, the Cox PH model is semi-parametric. If the functional form is specified, such as in a Weibull hazard model, the model would be fully parametric. The second quantity is the exponential expression e to the linear sum i X i , where the sum is over p explanatory X variables (parameters are estimated using maximum likelihood estimation). The Xs are time independent Xs (they do not involve t). An extended model, called the extended Cox model, can accommodate time dependent explanatory variables (Kleinbaum and Klein, 2012, pp. 108–109).

Searching for the Holy Grail  55

As noted by Shumway (2001), most bankruptcy researchers have used single period classification models (or cross-sectional models) with corporate failure data that usually spans over many periods (Shumway calls these “static” models). According to Shumway (2001), by ignoring the fact that firms can change state through time, static models inevitably produce biased and inconsistent estimates of the probabilities that they approximate. To remedy this limitation, Shumway (2001) proposed a hazard model that is “simple to estimate, consistent, and accurate”. As discussed previously, hazard models resolved the problems of static models by explicitly accounting for time dependency in failure prediction. The dependent variable in a hazard model, as discussed earlier, is the time spent by a firm in the healthy group. When firms leave the healthy group for some reason other than bankruptcy (e.g., merger), they are considered censored, or are no longer observed. Static models simply consider such firms healthy. In a hazard model, a firm’s risk for bankruptcy changes through time, and its health is a function of its latest financial data and its age. As noted by Shumway (2001), a bankruptcy probability that a static model assigns to a firm does not vary with time. Time dependency can only be accommodated in logit models in a more rudimentary way, such as including dummy variables for different time periods in the model. Shumway’s (2001) hazard approach has been characterized as the equivalent of a multi-period logit model with an adjusted standard-error structure (see Duffie et al., 2007). To summarize, Shumway (2001) suggested three fundamental reasons to prefer hazard models for forecasting failure. First, static models fail to control for each firm’s period at risk. When sampling periods are long, it becomes important to account for the fact that some businesses collapse after many years of being in trouble, while others fail in their first year. This risk is automatically adjusted in hazard models. Second, hazard models can incorporate time-varying covariates (using the extended Cox model mentioned earlier), which change over time. If the health of a firm deteriorates leading up to bankruptcy, then allowing its financial data to reveal its changing health can improve prediction. Hazard models exploit each firm’s time-series data by including annual observations as time-varying covariates. Unlike static models, they can incorporate macro-economic variables that are the same for all firms at a given point of time. Hazard models can also account for potential duration dependence, or the possibility that firm age might be an important explanatory variable. Another advantage of hazard models is that that they can produce more efficient out-of-sample forecasts by utilizing much more of the sampled data. Shumway (2001) estimated both hazard and static models and examined their out-of-sample accuracy using a sample of 300 bankruptcies between 1962 and 1992. Estimating the hazard model with a set of bankruptcies observed over 31 years, Shumway (2001) documented that while half of the accounting ratios used in previous research are poor predictors, several market-driven variables provided to be strongly predictive of bankruptcy. Shumway (2001) demonstrated that a firm’s market size, its past stock returns, and the idiosyncratic standard deviation in stock

56  Searching for the Holy Grail

returns all predicted failure quite well. By combining these market-driven variables with two accounting ratios, Shumway (2001) produced his preferred model, which proved quite accurate in out-of-sample tests. While Shumway’s (2001) claims about the theoretical and empirical advantages of hazard models, particularly when compared to traditional discrete choice models such as logit, are well founded, we show in Chapter  3 that the predictive performance of hazard models can be outperformed by modern machine learning models such as deep learning. In particularly, I compare hazard models with a version of deep learning that is designed to accommodate panel data. Hillegeist et al. (2004) also used a hazard model to test the usefulness of marketbased variables in corporate failure prediction. Their sample consisted of 78,100 firm-year observations and 756 initial bankruptcies sampled over the period 1980–2000. The objective of their study was to evaluate the incremental predictive power of accounting and market-based predictors of bankruptcy, as most previous research had focused on financial variables only. Hillegeist et al. (2004) presented a compelling argument why market prices might be better predictors of corporate bankruptcy than accounting-based measures. For instance, while failure probabilities predict future events, accounting data is based on past performance and information that may have limited value for forecasting purposes. The authors also pointed out that financial statements are prepared using generally accepted accounting principles (GAAP), such as going concern and conservatism: Hillegeist et al. (2004) stated that financial statements are formulated under the going-concern principle, which assumes that firms will not go bankrupt. Thus, their ability to accurately and reliably assess the probability of bankruptcy will be limited by design. The conservatism principle usually results in the understatement of non-current assets resulting in the overstatement of leverage (debt/equity). These aspects of the accounting system will limit the performance of any accounting-based failure model. A further limitation of traditional corporate failure models is that they rely almost exclusively on financial statement information and typically lack a measure of asset volatility. Hillegeist et al. (2004) stated that probability of corporate failure can be impacted by volatility. Volatility is a crucial variable in bankruptcy prediction because it captures the likelihood that the value of the firm’s assets will decline to such an extent that the firm will be unable to repay its debts. Ceteris paribus, the probability of bankruptcy is increasing with volatility. Two firms with identical leverage ratios can have substantially different PBs [i.e., probabilities of bankruptcies] depending on their asset volatilities. Therefore, volatility is an important omitted variable in both the Altman (1968) and Ohlson (1980) bankruptcy prediction models.

Searching for the Holy Grail  57

Many corporate failure studies based on accounting measures recognize the importance of market price information (see Beaver, 1968b; Altman et al., 1977). Market prices impounds a wide source of data, financial, and non-financial information. While the potential for market-based variables to provide information about failure has long been recognized, one challenge with this approach has been how to extract corporate failure probabilities from market prices. Hillegeist et al. (2004) used the option-pricing models of Black and Scholes (1973) and Merton (1974) for this task. Under this approach, the firm’s equity can be viewed as a call option on the value of the firm’s assets. When the value of the assets is below the face value of liabilities (i.e., the strike price), the call option is left unexercised, and the bankrupt firm is turned over to its debtholders. As discussed further under theoretical models, the key variables used to estimate probabilities under this approach are the market value of equity, the standard deviation of equity returns, and total liabilities. Importantly, the market value of equity represents the equity “cushion” and reflects the amount by which the value of assets can decline before they become insufficient to cover the present value of the debt payments. As the equity cushion diminishes, the probability of corporate failure is expected to increase Chava and Jarrow, 2004; Beaver et al., 2005). This approach is also known as the structural approach of pricing credit risk as corporate failure probabilities incorporate the asset-liability structure of a company. As pointed out by Hillegeist et  al. (2004), the main advantage of using optionpricing models in corporate failure prediction is they are theoretically grounded and can be applied to any public company. Hillegeist et al. (2004) compared the performance of corporate failure probabilities generated from the Black–Scholes– Merton approach to four accounting-based models: Altman’s Z-Score and Ohlson’s O-Score, using both the original coefficients of these models and again with updated model coefficients. Instead of out-of-same prediction tests, Hillegeist et al. (2004) used a relative information content test to compare model performance based on log likelihood statistics. The authors argued that an advantage of this approach is that it allows differences in performance measures to be evaluated using tests of statistical significance. They estimated a discrete hazard model to assess how well each probability measure fitted the data and used the Vuong (1989) tests to statistically compare the log-likelihood statistics of alternative models. Their results indicated that corporate failure probabilities generated from a Black–Scholes–Merton option pricing model contained significantly more information about the probability of bankruptcy (at the 1% level) than any of the accounting-based models. A comparison of each model’s pseudo-R2 shows that the Black–Scholes–Merton approach outperformed the original Z-Score and O-Score by 71% and 33%, respectively. The Black–Scholes–Merton model pseudo-R2 was also 20% larger than the pseudo-R2 generated for the best accounting-based model (which was Ohlson’s model). Overall, Hillegeist et al. (2004) concluded that corporate failure studies that use the Z-score and/or O-score lack sufficient statistical power to yield reliable results. The authors concluded that the asset volatility

58  Searching for the Holy Grail

component of the Black–Scholes–Merton model could explain the superior performance better than the market-based leverage measure. Their sampled firms exhibited considerable cross-sectional variation in volatility. Hillegeist et al. (2004) were so convinced about the power of their market price measures that they even recommended that future research should focus exclusively on market-price indicators for corporate failure prediction. However, these results are somewhat debatable. Most research suggests that corporate failure models that incorporate both accounting and market price variables yield more optimal results than market price variables alone. There have been several other studies that have shown that market variables alone, even when using ­sophisticated KMV style distance-to-default methods, do not outperform accountingbased models. For instance, Agarwal and Taffler (2008) used a well-established UK Z-score model similar to that derived by Altman (1968) and compared it to the Black–Scholes–Merton or KMV approach. The authors found that the accountingbased model performed comparably well to market-based approaches, but also that a bank using the Z-score could realize greater risk-adjusted returns than by employing a market-based approach. Similar to prior studies, Agarwal and Taffler (2008) reported that neither the information contained in the market-based variables nor in the accounting covariates subsumed the other, as shown by the superior performance of a hybrid model that combined both accounting and market variables. Agarwal and Taffler (2008, p. 1550) concluded, “neither of the market-based models nor the accounting-ratio-based model is a sufficient statistic for corporate failure prediction and both carry unique information about firm failure”. A similar result was found in Campbell et al. (2008), discussed further later. There is no doubt that a market-based model using the Black–Scholes–Merton option pricing model has many appealing advantages (as noted by the authors, these measures are powerful and flexible and can be consistently estimated across accounting regimes and thus facilitates intertemporal comparisons). However, an important research question raised by Hillegeist et  al. (2004) is whether the incremental explanatory power of the Black–Scholes–Merton model probabilities is significant enough to make a difference to research studies that incorporate failure probability proxies as explanatory variables (p. 29). Hillegeist et al. (2004) compared the performance market based measures with quite dated modelling techniques and input variables (i.e., standard form logit and LDA models from the Ohlson and Altman studies).12 As we will see in the next chapter (and in the empirical demonstration in Chapter  4), modern machine learning methods not only predict better than conventional bankruptcy techniques, but they can also extract signal from many sources of information, including accounting and market price information. Hence, it is premature to draw conclusions whether market price variables are clearly superior to financial variables in corporate failure prediction. Beaver et al. (2005) also used a hazard model to examine the secular (long-term) change in the explanatory power of financial statements by analysing changes in the ability of financial ratios to predict corporate bankruptcy. Beaver at al. (2005) observed that several influences over the previous 40 years (from the time of their

Searching for the Holy Grail  59

study) have potentially impacted the ability of financial ratios to predict corporate bankruptcy. These factors include: (1) the establishment of the FASB and the development of accounting standards, many of which adopt fair value requirements; (2) an increase in the relative importance of intangible assets and financial derivatives, especially during the 1990s; (3) a perceived increase in managerial discretion in financial statements. To show some empirical evidence for this, Beaver et al. (2005) examined a sample of bankrupt and non-bankrupt firms over a 40-year time span from 1962 to 2002. The authors divided their sample into two major sub-periods: 1962–1993 and 1994–2002. The three ratio measures used in their study included return on total assets (ROA), which they define as earnings before interest divided by beginning of year total assets; EBITDA to total liabilities (ETL), which they defined as net income before interest, taxes, depreciation, depletion, and amortization divided by beginning total liabilities (both short term and long term). Note that in Beaver (1966), ETL is called the “cash flow” to total liabilities ratio. And LTA, which is a measure of leverage, they defined as total liabilities divided by total assets. As observed by Beaver et al. (2005, pp. 95–96): The precise combination of ratios used seems to be of minor importance with respect to overall predictive power, because the explanatory variables are correlated. Beaver et al. (2005) noted that the inclusion of market-based variables in their study was also appealing for several reasons. First, prior research indicated that market prices reflect a richer and more comprehensive mix of information, which included financial statement data as a subset. Assuming the market-based measures can be successfully defined to extract the probability of bankruptcy from the observed series of security prices, the resulting model can potentially provide superior estimates of the probability of bankruptcy. Second, market-based variables can be measured with a finer partition of time. While financial statements are available at best on a quarterly basis (in the United States at least) and prior research largely uses annual data, market-based variables can exploit the availability of prices daily. Third, the market value-based variables can provide direct measures of volatility, as pointed out in the Hillegeist et al. (2004) study. The market-based variables used by Beaver et al. (2005) included: log of market capitalization (LSIZE), lagged cumulative security residual return (LERET), and lagged standard deviation of security residual returns (LSIGMA). They reported two major findings in their study (p.  118): (1) the robustness of the predictive models is strong over time, showing only slight changes. It is quite striking that that a parsimonious three variable failure model proved surprisingly robust over the 40-year period (meaning the coefficients were quite stable over time); and (2) the slight decline in the predictive power of the financial ratios is offset by improvement in the incremental predictive ability of market-related variables. When the financial ratios and market-related variables are combined, the decline in predictive ability appears to be very small. According Beaver et  al. (2005), their finding is

60  Searching for the Holy Grail

consistent with non-financial-statement information compensating for a slight loss in predictive power of financial ratios. In another prominent study, Campbell et  al. (2008) utilized the basic hazard model specification used by Shumway (2001) and Chava and Jarrow (2004) and extended previous failure literature by evaluating a wide range of explanatory variables, including both accounting and market variables. The authors considered ways in which a firm may fail to meet its financial obligations. The first is bankruptcy filings under either Chapter 7 or Chapter 11 of the US bankruptcy code. Second, they also included firm failures defined more broadly to include financially driven delistings, or D (“default”) ratings issued by a rating agency. The broader definition of failure captured circumstances in which firms avoid bankruptcy through negotiation. It also captured firms that performed so poorly that their stocks were delisted from an event that sometimes preceded bankruptcy or formal default. They extended the literature by considering a wide range of explanatory variables, including both accounting and market variables, and by explicitly considering how bankruptcy forecasts varies with the horizon of the forecast. Their final sample included 800 bankruptcies, 1600 failures, and predictor variables for 1.7 million firm months. Campbell et al. (2008) measured the excess stock return of each company over the past month, the volatility of daily stock returns over the past three months, and the market capitalization of each company. From accounting data, the authors measured net income as a ratio to assets and total leverage as a ratio to assets. However, the authors also explored some additional variables and reported that corporate cash holdings, the market-to-book ratio, and a firm’s price per share contributed to the overall explanatory power of their model. The eight variables used in the Campbell et al. (2008) model are defined as follows: NIMTAAVGi,t is the geometric decaying average of (Net Income/(Market Value of Equity + Total Liabilities). TLMTAi,t is Total Liabilities/(Firm Market Equity + Total Liabilities). EXRETAVGi,t = log(1 + Ri,t) − log(1 + RS&P500t), where R is the return on the stock for the month and RS&P500 is the return on the S&P 500 Index for the month. This variable is calculated as the geometric decaying mean for the previous 12 months. SIGMAi,t−1,t−3 is the computed as an annualized 3-month rolling sample standard deviation, which is calculated as = 

252 *

1 N

1k

ri 2,k . ( t 1, t 2, t 3 )

RSIZE = log (Market Value of Equity/Total S&P 500 Market Value of Equity). CASHMTA = Cash and Short-Term Investments / (Market Value of Equity + Total Liabilities). MB = Market-to-Book ratio of the firm. Price = log of price per share winsorized above 15.

Searching for the Holy Grail  61

Campbell et  al. (2008) demonstrated improved explanatory power (measured through pseudo-R2) than the Shumway (2001) study for both bankruptcy and failure samples as well as significantly improving on the Altman (1968) and Ohlson (1980) models. In particular, the authors discovered that scaling net income and leverage by the market value of assets (rather than the book value) and adding further lags of stock returns and net income significantly improved the overall explanatory power of their model. The authors also constructed a measure of distance to default (DD) based on the practitioner model of Moody’s KMV13 (Crosbie and Bohn, 2003) and ultimately on the structural default model of Merton (1974). Consistent with other research, they find that this measure adds very little explanatory power to the reduced form variables already included in their model (similar results were reported in Bharath and Shumway, 2008). For instance, the best model reported in Campbell et  al. (2008) had better explanatory power than a model estimated only on the DD variable (with the out-of-sample pseudo-R2 nearly double that of a model estimated with only the DD variable). Adding the DD variable to their best model (using the eight-variable model defined earlier) failed to improve the overall model fits. Bharath and Shumway (2004) also utilized a discrete-time hazard model on a sample of 1449 firm defaults from 1980 to 2003. They also found that the DD measure is not a sufficient statistic for corporate failure prediction and that models including this variable only marginally outperformed models that omitted DD.14 Using a longer-term horizon, Campbell et al. (2008) demonstrated that all of their variables were found to be statistically significant. Campbell et  al. (2008) observed that as the forecast horizon increases the coefficients, significant levels, and overall fit (using pseudo-R2) of the logit regression declined as one would intuitively expect. However, even at three years from failure, almost all of their variables remain statistically significant. However, Campbell et  al. (2008) noted that three variables are particularly important over longer time horizons. Campbell et al. (2008, p. 2914) concluded that market capitalization, market-to-book ratio, and volatility, are persistent attributes of a firm becoming increasingly important measures of financial distress at long horizons. The authors also compared the realized frequency of failure to the predicted frequency over time. Although their model underpredicts the frequency of failure in the 1980s and overpredicts it in the 1990s, the model fits the general time pattern very well. Campbell et  al. (2008) also addressed a long-standing anomaly in the bankruptcy literature. While the capital asset pricing model would predict that higher risk firms should have higher expected returns (to compensate investors for riskier investments), the exact opposite has been found in many studies (see also Dichev, 1998). Campbell et al. (2008) used their fitted failure model probabilities to calculate the risks and average portfolios of stocks sorted by these probabilities. They

62  Searching for the Holy Grail

observed that “since 1981, financially distressed stocks have delivered anomalously low returns” (p. 2899). They found that distressed firms have high market betas and high loadings on the high minus low (HML) (value factor) and small minus big (SMB) (size factor) proposed by Fama and French (1993, 1996) to capture the value and size effects. Importantly, distressed firms have low average returns, indicating that the equity market has not properly priced in distress risk. Controlling for size, Campbell et al. (2008, p. 2934) find that the distress anomaly is stronger for stocks with low analyst coverage, institutional ownership, price per share, and turnover. Thus, like many other anomalies, the distress anomaly is concentrated in stocks that are likely to be expensive for institutional investors to arbitrage. For instance, short selling distressed stocks will be problematic when there are few institutional investors willing to lend their shares. Low price stocks with low turnover are also like to be expensive to trade in sufficient quantity. Low analyst coverage may also result in information being disseminated to markets more slowly than for stocks with high analyst coverage. As suggested by Campbell et al. (2008, pp. 2934–2935), “these limits to arbitrage help us to understand how the distress anomaly has persisted into the 21st century”.

2.7.1  Competing Risk Models Competing risk analysis is a type of survival analysis that aims to correctly estimate the marginal probability of an event in the presence of competing events. In standard survival data discussed earlier, firms only experience one type of event, which is bankruptcy. However, in the real world, firms can potentially experience more than one type of distress event. For instance, firms could possibly go into voluntary liquidation (a decision of the management) or go into forced liquidation or even merge or get taken over (such as in a distressed merger). When only one of these events can occur, they are called “competing events”, in a sense that the occurrence of one type of event will exclude the other events. In other words, they are mutually exclusive events. The probability of these events occurring is described as “competing risks”, in a sense that the probability of each competing event is conditioned by the other competing events. As with standard survival analysis, the analytical object for competing event data is to estimate the probability of one event among the many possible events over time, allowing firms to fail from competing events. In the previous example, we might want to estimate the corporate failure rate over time and to know whether the corporate failure differs between two or more treatment groups, with or without adjustment of covariates. In standard survival analysis, these questions can be answered by using the Kaplan Meier product limit method to obtain event probability over time, and the Cox proportional hazard model to predict such probability. Likewise, in competing event data, the typical approach involves the use of Kaplan-Meier estimator to separately estimate probability for each type of event, while treating the other competing events as censored in addition to those who are censored.

Searching for the Holy Grail  63

The standard approach for analysing competing risk data is to use the Cox proportional hazard model described earlier to separately estimate hazards and corresponding hazard ratios for each failure type, treating the other competing failure types as censored in additional to the other censoring issues described earlier. For instance, in our bankruptcy example, when forced liquidation is the event of interest, distressed mergers and voluntary administrations would be treated as censored. Hence, we can then estimate the cause-specific hazard for forced liquidation and then fit a cause-specific hazard model on forced liquidation. The same procedure can apply to distressed merger when it is the event of interest. Using the cause-specific hazard function discussed previously, we can adjust the formula to take into consideration competing events as follows: hc t

lim t

P t

0

Tc t

t Tc t

t

,(2.9)

where random variable Tc  = time to failure from event c, c = 1, 2, . . ., C (number of event types) (see Kleinbaum and Klein, 2012, p. 434). Thus hc t gives the instantaneous failure rate at time t for event type c, given not failing from event c by time t. Using the Cox proportional hazard model, which considers predictors X = (X1, X2, . . . Xp); the cause-specific hazard model for event type c has the form shown as follows. hc t ,

h0 t e

p X i 1 ic i

(2.10)

c = 1, . . ., C Note that ic , the regression coefficients for the ith predictor, is subscripted by c to indicate that the predictors may be different for different event types (see Kleinbaum and Klein, 2012, p. 434). There are several examples of competing risk models that have been applied in the context of bankruptcies. One example of the approach is Hill et al. (1996). Using a sample taken over the period 1977–1987, this study used a competing risk model (what they called a dynamic or event history methodology) in analysing financial distressed firms that survived and those that went bankrupt. As pointed out by Hill et al. (1996), because most distressed firms do not go bankrupt, those that do could be valuable for understanding the corporate failure process. Dynamic or competing risk models allow the calculation of the transition rate or the conditional probability of change in a firm’s financial status over time. The benefit of their approach is general to all hazard models in that their competing risk model captures the dynamics of change in financial status of firms and allows for time-varying covariates and censored observations. As discussed earlier, this improves on prior research that uses static analysis or cross-sectional models that do not adequately capture the time to event dimension in corporate failure models. The authors defined “financial distress” in terms of an entity having three years of

64  Searching for the Holy Grail

cumulative negative earnings over any period in the sample period. Bankruptcies that are “sudden” are excluded from the sample because they are not likely to be preceded by any type of financial distress. As the authors pointed out, “sudden” bankruptcies might occur as the result of frauds (such as Enron or Worldcom) or possibly through the strategic use of Chapter 11 provisions (see Delaney, 1999). The authors final sample was 75 bankrupt firms, 1311 financial distressed firms, and 1443 stable firms. Notwithstanding limitations in their methodology, their dynamic model proved useful for identifying a number of statistically significant variables, such as liquidity, profitability, leverage, size, qualified audit opinions, and economic factors such as unemployment rate and prime rates. Duffie et al. (2007) also used a competing-risk discrete-time hazard model that explicitly considered the mean reverting time dynamics of the covariates. Based on a sample period of 1980 to 2004 that included 1171 bankruptcies and defaults, they found a model based on the firm-specific components of the distance to default measure and stock return, as well as the macro-economic covariates 3-month Treasury bill rates and the trailing 1-year return on the S&P 500, outperformed previous market- and accounting-based models in terms of out-of-sample predictive ability. The strong performance of their model led to their conclusion that market-based are the most informative for corporate failure predication (see Peat and Jones, 2012). Other examples of competing risk models include Harhoff et al. (1998), who applied the approach on a sample of German firms to examine two modes of exit (voluntary liquidation and bankruptcy). Wheelock and Wilson (2000) also utilized a competing risks model on a sample of US banks to identify the characteristics that differentiated those more likely to fail or be acquired. Wheelock and Wilson (2000) assumed that the causal processes for acquisitions and failures were mutually exclusive because an underlying assumption of the competing risk model is that competing events are independent. In other words, when estimating the failure hazard, acquisitions are treated as censored at their dates of acquisition. When estimating the acquisition hazard, failures are treated as censored at their dates of failure. However, as noted by Yu (2006) in his study of Japanese banking institutions, this independence assumption can be questioned. For example, if a bank exhibits poor performance and foresees itself as having a high risk of failure, it might seek out a merger opportunity to avoid bankruptcy. This can induce a positive correlation between the bankruptcy and merger processes, and as a result the competing risk model can create biased estimates. To address these limitations, Yu (2006) specified a semi-parametric identification of the dependent competing risks model with time-varying covariates. However, one issue to consider is that it can never be explicitly proven whether competing risks are independent or not (see Kleinbaum and Klein, 2012, p. 439). In the review of the literature above, I have focused mainly on statistical learning approaches that have used a variety of modelling techniques. In Chapter 3, I will

Searching for the Holy Grail  65

examine a variety of machine learning techniques that will cover first-generation methods such as neural networks and recursive partitioning (such as classification and regression trees) followed by a discussion and analysis of more modern machine learning methods such as gradient boosting machines, random forests, AdaBoost, and deep learning. The next section briefly discusses theoretically derived corporate failure models.

3.  Theoretical Derived Corporate Failure Models As discussed previously, some of the statistical learning models, such as Hillegeist et al. (2004) and Campbell et al. (2008), tested theoretically derived failure predictors, one of the most widely used being the KMV distance-to-default measure (Kealhofer, 2003a). This section provides a brief overview of these theoretical models, highlighting their strengths and limitations. Scott (1981) outlined one of the simplest theoretical corporate failure models, which is based on a firm that lasts for only two periods. The model assumes that the firm’s shares are traded in the current period and it will be liquidated next period. The firm goes bankrupt if its liquidation value is less than the amount it owes its creditors. Formally, if V1 is random variable representing end-of-period value of a hypothetical firm and if D1 represents the amount owed to its creditors, the firm will go bankrupt under the following condition: V1 < D1.(2.11) Assume V1 has a two-parameter probability distribution with location parameter of and a scale parameter of v . Then, standardizing both of this equation, corporate failure leads to the equation in 2.12: V1

v v

D1

v

.(2.12)

v

As shown in Scott (1981), if F[.] represents the cumulative distribution function for V1 v D then F 1 v equals the probability of corporate failure. For example, if v

v

for ABC Corporation the natural logarithm of V1 were normally distributed with a mean of 12 and a standard deviation of 7, and if the natural logarithm of D1 were 3, then ABC’s probability of corporate failure would equal 0.09. However, if for ABC Corporation the natural logarithm of V1 were normally distributed with a mean of 12 and a standard deviation of 7, and if the natural logarithm of D1 were 10, then ABC’s probability of corporate failure would be much higher at 0.39. Increasing both the standard deviation and/or D1 or reducing the mean value increases the probability of corporate failure. In the Black-Scholes framework the market value of a firm’s equity is viewed as an option that will be valuable at the time a firm’s debt matures only if the debt can be paid in full. The model assumes that V1, the market value of the firm, follows a

66  Searching for the Holy Grail

diffusion-type stochastic process with known parameters. Black and Scholes (1973) and Merton (1974) assumed that the firm’s debt consists of a single pure discount bond. Scott (1981, p. 327) noted that it is true that for a firm with several debt issues that must pay interest and principal on different dates, no closed form solution to the option pricing formula has been found. Nevertheless, Schwartz (1977) has shown that numerical methods can be used to obtain a solution. Thus, in principle, the option pricing methodology could be used to determine bankruptcy probabilities. This methodology formed the conceptual basis of the KMV model discussed later. Some of the earliest corporate failure models are theoretically derived models. One of the most well-known is Wilcox (1971), who developed a corporate failure model explicitly based on theory. Wilcox (1971, 1973, 1976), Santomero and Vinso (1977) and Vinso (1979) have based failure prediction models on the gambler’s ruin model of probability theory. The gambler’s ruin models assume that the firm has a given amount of capital, K, and that changes in K are random. Positive changes in K result from positive cash flows from the firm’s operations. Losses require the firm to liquidate assets. When K becomes negative, the firm is declared bankrupt. As noted by Scott (1981), implicitly this model assumes the firm is completely cut off from the securities market. That is, the firm must fund its losses by selling assets and can sell neither debt nor equity.

3.1  Gambler’s Ruin Models Wilcox (1971, p.  389) critiqued the empirical studies of Altman (1968), noting the absence of theory in driving the selection of different ratio measures. The purpose of Wilcox’s (1971, p. 390) study “is to develop a theoretical model which better explains Beaver’s results and which gives rise to hypotheses which could be even better predictors”. Wilcox (1971) rather elegantly uses a theoretical model to derive the primary Beaver (1966) ratio (cash flow to debt). Wilcox (1971) first specified a one-dimensional random walk that has an absorbing barrier at one end and no barrier at the other, which is the classic gambler’s ruin model (i.e., a Markov process). Suppose there exists a firm of wealth C, which every year plays a game that nets it a gain or loss of constant size =  , where the probability of a gain equals p, and of a loss, q. Suppose p > q, then the probability of this firm’s ultimate corporate failure is given by equation 2.13: C

P ultimate failure

q p

,(2.13)

where the number of losses z the firm can take in a row before being ruined is C . q Wilcox (1971) then proposed how to find realistic measures for p , C, and . Wilcox (1971) defined the term C as a function of assets and liabilities (assets m ­ inus

Searching for the Holy Grail  67

­liabilities). Wilcox (1971) acknowledged that C might need to be interpreted differently depending on the availability of loan in any situation. For example, C might be measured as total assets minus total liabilities for large firms in times of readily available credit, but as current assets minus current liabilities, or even just cash resources, in the case of less established firms or where credit conditions are q much tighter. Wilcox (1971) defined p in terms of a drift rate of the firm’s wealth C. In a random walk model, the average drift rate per time period is defined as p q . Wilcox proposed a real-world measure of drift defined by A where A is the total assets employed, is the average return on total assets per time period, (1 − ) is the dividend pay-out rate, and (1 − ) represents the average fraction of net cash flow after dividends reinvested in illiquid capital expenditures. By setting the two drift rates to equal (i.e., p q A ) , and because p q 1, Wilcox (1971) derived equation 2.14 as follows: q/ p

1

A

/

1

A

/

.(2.14)

If is measurable by ˆ (the estimated standard deviation of the firm’s net cash flow Cless capital expenditures on illiquid assets less dividends), we can now estimate q as x and C as y, then the p in a real situation. By relabelling A estimate of pˆ (ultimate failure) of P is the statistic (in the case of the very high risk region): 1 x 1 x

y

1 2xy .(2.15)

After a few more rearrangements, the formula reduces to: P ultimate failure

1 2

A

C ˆ2

.(2.16)

Wilcox (1971) used average net income after taxes as an estimator for A . His proposed hypothetical ratio for discriminating between very high and lower risk firms would simply be xy, where avg. net income x

avg. proportion of net cash flow 1 less dividends reinvested in illiquid assetss std dev of net cash flow less capital expenditures for illliquid assets less dividends

dividend payout 1 ratio

and y

y1 assets

y2 liabilities

std dev of net cash flow less capital expenditures for illiquid assets less dividends

,

68  Searching for the Holy Grail

where y1 and y2 are vectors of weights depending on the convertibility of different classes of assets to wealth and liabilities to negative wealth (see Wilcox, 1971, p. 391). Ignoring the information content of various components yields the ratios found by Beaver (1966) to have predictive value. There are two major components: the first is x, a measure of the ratio of the observed drift rate to the standard deviation of that drift rate; the other is y, a measure of the ratio of the liquid wealth of the firm to what in some sense is the modal magnitude of setbacks in the drift. The latter term Wilcox has somewhat crudely measured as the standard deviation of the drift rate. This last point relied on the standard deviation of the drift rate being large in comparison with the mean drift rate. If we ignore the information in x and ignore the differences between firms in relative variability of the net cash flow less dividends and less capital expenditures on illiquid assets, we get the discriminant statistic y2 liabilities

,(2.17) net cash flow which is very similar to the best performing ratio used in the Beaver (1966) study. Other ratios used by Beaver can also be linked to the underlying model. A theoretical model developed by Scott (1981) with different underlying assumptions has closer affinity to Altman’s 1977 ZETA model. Scott (1981) provides a useful summary of the early theoretical models, as displayed in Table 2.2. TABLE 2.2  Theoretically Derived Predictors of Failure

Model

Failure Predictor (the Lower the Predictor, the Higher the Probability of Failure

Single period or Black-Scholes

Gambler’s ruins (no access to securities markets) Perfect access

v

D1

D1  = next debt payment (principal and/or interest) expected market value of the v firm (debt plus equity) at next debt payment. standard deviation of firm value v at next debt payment.

K

K = stockholder’s equity (book value). x , x  = mean, standard deviation of next period’s change in retained earnings.

S

S = market value of equity. x , x  = mean, standard deviation of next period’s net income.

v

v v

x x

Imperfect access

x

K

S/ 1 c x

Source: Scott (1981, p. 336)

Definitions

, v and S as defined above. K optimal change in stockholders’ equity, given that the firm is faced with earnings losses. c = proportional flotation costs. v

Searching for the Holy Grail  69

3.1.1  Performance of Gambler’s Ruin Models While the gambler’s ruin model of Wilcox has theoretical appeal, Scott (1981, p. 323) noted the empirical performance of the model was disappointing. The attempts to apply this model have been disappointing, perhaps because the version of the theory used is too simple, assuming, as it does, that cash flow results from a series of independent trials, without the benefit of any intervening management action. Although the theory specified a functional form for the probability of ultimate ruin, Wilcox found that this probability was not meaningful empirically. It could not be calculated for over half of his sample because the data violated the assumptions of the theory: firms that were supposed to be bankrupt were actually solvent. A review by Kinney (1973) concluded that that the Wilcox model does not outperform the simple univariate approach of Beaver (1966). Kinney (1973, p. 187) observed that the Wilcox results do not improve on Beaver’s model, which is based on less data and computation, and concluded: “Thus, the addition of the variance of CF, at least in the way Wilcox suggests, does not seem to add much to the predictive ability of Beaver’s CF/TD ratio, whatever the reason”. Faced with this problem, Wilcox (1976) discarded the functional form suggested by the theory and used the variables it suggested to construct a prediction model. But, as Scott (1981) pointed out, it is hard to assess the resulting model since he did not test it on a holdout sample. Santomero and Vinso (1977) provide an empirical application of a gambler’s ruin-type model using bank data. However, they provided no test of the model. Their approach is more complex than that of Wilcox, in that they estimated the probability of failure for each bank at that future point in time when its probability of failure will be at a maximum. The riskiest bank in their sample may have a probability of failure as low as 0.0000003. Considering the long history of bank failures, these probabilities were considered just too low to be plausible. In a further paper, Vinso (1979) used a version of the gambler’s ruin model to estimate default probabilities. Unfortunately, no rigorous empirical tests of the model were provided in the study. The models outlined in Scott (1981) foreshadowed many new approaches to default prediction based on the options pricing model. The most commercially successful of these models is the KMV approach, based on the distance-todefault metric.

3.2 KMV Distance to Default Approach As outlined in Kealhofer (2003a, 2003b), the Black-Scholes and Merton (1974) approach can be illustrated in a simplified case. Let us assume a company has a single asset consisting of 1 million shares of Microsoft stock. Assume further it has a single fixed liability of a one-year discount note with a par amount of 100 million and is otherwise funded by equity. In a year’s time, the market value of the

70  Searching for the Holy Grail

company’s business will either be sufficient to pay off the note or it will not, in which case the company will default. One can observe that the equity of the company is logically equivalent to 1 million call options on Microsoft stock, each with an exercise price of 100 and a maturity of one year (see Kealhofer, 2003a, p. 30). The implication of this illustration is that the equity of a company works like a call option on the company’s underlying assets. Neither the underlying value of the firm nor its volatility can be directly observed. Under the model’s assumptions, both can be inferred from the value of equity and the volatility of equity. The value of the equity thus depends on, among other things, the market value of the company’s assets, their volatility, and the payment terms of the liabilities. Implicit in the value of the option is a measure of the probability of the option being exercised; for equity, it is the probability of not defaulting on the company’s liability. This is illustrated in Figure 2.3. The horizontal axis represents time, beginning with the current period (“today”) and looking into the future. The vertical axis depicts the market value of the company’s assets. As of the current period, the assets have a single, determinable value, as shown on the vertical axis, but one year from now, a range of asset values is possible, and their frequency distribution (shown in Figure 2.3 on its side) gives the likelihood of various asset values one year in the future. The most likely outcomes

FIGURE 2.3 Moody’s

KMV Model Illustration: Frequency Distribution of Asset Value at Horizon and Probability of Default

Source: Kealhofer (2003a, p. 31)

Searching for the Holy Grail  71

are nearest to the starting value, with much larger or smaller values less likely. The mean is shown by the dashed line. The likelihood of extreme outcomes depends on the volatility of the assets, that is, the more volatile, the greater the probability of extreme outcomes. The dotted horizontal line shows the par amount of the liability due in one year. If the company’s asset value in one year is less than the amount of the liability, the company will default. Note that this decision is an economic decision; the equity owners could put additional money into the company, but that decision would be irrational because the money would go to pay creditors. If the owners defaulted, they would not be required to put in additional money and they could use this money for their own benefit rather than giving it to the creditors. The probability of default is given by the area under the frequency distribution below the default point, which represents the likelihood of the market value of the company’s assets in one year being less than what the company owes. The probability of default will increase if the company’s market value of assets today decreases, if the amount of liabilities increases, or if the volatility of the assets’ market value increases. These three variables are the main determinants of the company’s default probability. According to Kealhofer (2003a), the KMV model has the following characteristics: (1) the company may have, in addition to common equity and possibly preferred stock, any number of debt and nondebt fixed liabilities; (2) the company may have warrants, convertible debt, and/or convertible preferred stock; (3) obligations may be short term, in which case they are treated as demandable by creditors, or long term, in which case they are treated as perpetuities; (4) any and all classes of liability, including equity, may make fixed cash pay outs; (5) if the market value of the company’s assets falls below a certain value (the default point), the company will default on its obligations; this default point depends on the nature and extent of the company’s fixed obligations; and (6) default is a company-wide event, not an obligation-specific event (p. 32). Given the asset characteristics (i.e., value and volatility) and given the company’s default point, the KMV model can be used to calculate a measure of the company’s default risk, which is the number of standard deviations to the default point. Distance to default, DD(h), or the number of standard deviations to the default point by horizon h, is an ordinal measure of the company’s default risk. According to Kealhofer (2003a), it provides a simple and robust measure of default risk. Distance to default is essentially a volatility-corrected measure of leverage (Duffie et  al., 2007, p. 639). Mathematically, distance to default (DD) it is calculated as: ln A

1

ln DPT

DD h A

A

h1/2

2

2 A

h ,(2.18)

where A = current market value of a company assets, DPT = the company’s default point (where the company typically defaults), A  = expected market return to the assets per unit of time, and A  = volatility of the market value of the company’s assets per unit of time.

72  Searching for the Holy Grail

With the use of the KMV default database, Kealhofer (2003a) reported he could measure the empirical distribution with sufficient accuracy that the empirical probabilities could be substituted for the theoretical probabilities. This measurement relies on the distance to default as a “sufficient statistic” for the default risk, so all the default data for companies with similar DDs can be pooled. In other words, the differences between individual companies are expected to be reflected in their asset values, their volatilities, and their capital structures, all of which are accounted for in their DDs. The estimation need not be performed on separate subsamples – for instance, by industry or size. The result of this process is the KMV EDFTM (expected default frequency) credit measure. The EDF is the probability of default within a given time period.

3.2.1  Limitations With the KMV Approach While KMV has been a very successful model commercially, it does have some obvious drawbacks. Two fundamental problems with operationalizing the MMV model are: (1) misspecification, due to the restrictive assumptions of the model (e.g., single class of zero coupon debt, all liabilities mature in one-year, costless bankruptcy, no safety covenants, default triggered only at maturity, and so on) and measurement errors (e.g., value and volatility of assets are unobservable) (Agarwal and Taffler, 2008, p. 1550). Another limitation of the KMV approach is that it cannot easily be applied to private companies because they have no traded equity. It is difficult to construct theoretical EDFs without the assumption of normality of asset returns. Nor does the model distinguish among different types of long-term bonds according to their seniority, collateral, covenants, or convertibility. As a stock market measure, the KMV distance to default measure can potentially underestimate the number of defaulting companies during bull markets and overstate the number of potential defaults during market downturns. Furthermore, several empirical studies have not found that DD adds significantly to the explanatory value of failure models based on other accounting and financial data (see, e.g., Bharath and Shumway, 2004; Campbell et al., 2008). For instance, Bharath and Shumway (2004, p. 23) concluded “that the KMV-Merton probability is a marginally useful default forecaster, but it is not a sufficient statistic for default”. However, Bharath and Shumway (2004) acknowledged they have not used exactly the same distanceto-default model as the KMV Corporation which might lead to better predictive results (cf. Kealhofer, 2003a). In the next chapter, we explore the application of modern machine learning methods to corporate failure prediction.

Key Points From Chapter 2 The distress risk and corporate failure literature is very extensive. However, it can be divided into two broad streams of research. First, statistical learning approaches essentially use statistical models to predict corporate failure based on observed data and a set of features or explanatory variables. The second stream involves theoretically derived

Searching for the Holy Grail  73

models such as the early gambler’s ruin models and, more recently, the KMV distance to default measures. The terms distress risk and corporate failure models are used to differentiate the variety of studies that have been published in this literature. Some studies have developed forecasting models using a strict legal definition of failure (such as entering Chapter 7 or Chapter 11 under the US bankruptcy code). Other studies have used a broader definition of failure, which have included a wide range of possible distress events such loan default; issuing shares to raise working capital; receiving a qualified audit opinion based on going concern; having to renegotiate the terms and conditions of a loan obligation; financial reorganization where debt is forgiven or converted to equity; failure to pay a preference dividend (or cutting ordinary dividends); or a bond ratings downgrade. Other studies have used a multi-class approach that has used a variety of distress events and legal concepts of bankruptcy as the dependent variable of interest. Many corporate failure studies appear to report good predictive success rates, although classification accuracy varies across studies (for instance, Altman et al., 1977 reported much better prediction success than Ohlson, 1980, which might point to methodological differences between these studies). However, conventional modelling approaches such as LDA and standard logit generally do not perform exceptionally well on Type I errors (predicting a firm to be safe when it goes bankrupt) and Type II errors (predicting a firm to go bankrupt when it is safe). There are numerous methodological issues in early corporate failure models that limit the usefulness and generalizability of empirical results. For instance, many studies did not use test or holdout samples. Because of small sample sizes, some studies have relied on resampling techniques (such as the “jack-knife” technique). Match pair designs can also lead to biased parameter estimates. Over the decades, a wide range of modelling techniques have emerged, including simple linear models such as LDA, logit, and probit models. However, the literature has moved on to more sophisticated discrete choice models (such as mixed logit and nested logit models) and hazard models. From the earlier corporate failure literature, the Altman (1968) and Ohlson (1980) models are probably the most widely known in the literature. However, the logit model became more favoured because this model has less restrictive assumptions and has more pragmatic appeal as the outputs of such models are probabilities rather than cut-off scores. However, hazard models are now widely used in the corporate failure literature. Hazard models have many advantages over “static” cross-sectional models such as logit and probit models, which are generally considered unsuitable for panel data. The Cox proportional hazard model is widely used because it can explicitly model time to event and accommodate censored observations. The extended Cox model can accommodate time-varying covariates. Not only do hazard models predict well, but they have many appealing statistical and conceptual properties as well. There is a long-standing debate in the literature about the predictive value of different explanatory variables, particularly accounting vs market price variables. Some studies show that market price variables are highly predictive (particularly more sophisticated measures such as Moody’s KMV distance-to-default measure); however, the consensus

74  Searching for the Holy Grail

view seems to be that using a combination of accounting and market-based variables produces more optimal prediction models. There is good evidence that including asset volatility measures into corporate failure models can improve explanatory and predicative power. Among the many financial ratios used in failure forecasting, measures based on cash flows, leverage, and working capital appear to have the strongest predictive power overall across many studies. There is now more extensive research into multi-class models such as mixed logit models, nested logit models, and competing risk models (which is a special type of hazard model). Early theoretical models, such as the gambler’s ruin, are conceptually appealing, but their empirical performance has been disappointing. More modern theoretically derived models, such as Moody’s KMV approach, have fared much better, although several studies show that the KMV distance-to-default measure is not a sufficient statistic for predicting default.

Notes 1 See https://usbankruptcycode.org. 2 J.P. FitzPatrick, 1932, A comparison of the ratios of successful industrial enterprises with those of failed companies. The Certified Public Accountant (in three issues: October 1932, pp. 598–605; November 1932, pp. 656–662; December 1932, pp. 727–731). 3 In particular, Jones and Hensher (2004) used operating cash flows under the direct method and found this variable to have strong predictive power. 4 For instance, Edmister (1970) used a “stepwise” discriminant analysis on a sample of small businesses and concluded that a multivariate approach is better than a univariate approach. See R. O. Edmister, 1970, Financial ratios as discriminant predictors of small business failure (Ph.D. diss.), Ohio State University. 5 Another example of the univariate approach is Pinches et al. (1975). 6 According to Scott (1981), at least 30 financial institutions were using the Zeta model at that time. 7 Kane et  al. (1996) also demonstrated that accounting-based statistical models used to predict corporate failure are sensitive to the occurrence of a recession. Moreover, after controlling for the intertemporally unconditioned “stressed” and “unstressed” types of corporate failure, the authors find that models conditioned on the occurrence of a recession still add incremental explanatory power in predicting the likelihood of corporate failure. 8 Initially this was based on the research of Giroux and Wiggins (1984) and DeAngelo and DeAngelo (1990). Giroux and Wiggins (1984) found that common financial distress events occurring prior to bankruptcy were: dividend reduction/elimination, debt accommodation, and loan principal/interest default. In fact, all bankrupt firms in their sample either had a debt accommodation, loan default, or both, prior to bankruptcy, with 70% of bankrupt firms negotiating debt accommodations and 50% defaulting. Dividend reductions tended to occur before debt accommodation and loan principal/ interest default. However, distinguishing the ordering of a default or accommodation is quite difficult. DeAngelo and DeAngelo investigated the dividend policy adjustments of 80 firms that had experienced at least three annual losses during 1980–1985. The authors found that almost all the firms aggressively reduced their cash dividend payments in response to financial trouble. 9 A useful review of other studies using multi-class bankruptcy methods is provided in Chancharat et al. (2010).

Searching for the Holy Grail  75

10 The moments of an individual firm’s coefficient cannot be observed from a single data point, but are rather estimated by assuming a distribution for the coefficients of any particular attribute across all firms in the sample (see Train, 2003, pp. 262–263). 11 A fixed parameter essentially treats the standard deviation as zero such that all the behavioural information is captured by the mean. As noted by Jones and Hensher (2004), standard logit models assume the population of firms is homogeneous across attributes with respect to domain outcomes (i.e., levels of financial distress). For instance, the parameter for a financial ratio such as total debt to total equity is calculated from the sample of all firms (thus it is an average firm effect) and does not represent the parameter of an individual firm. 12 As acknowledged by Hillegeist et al. (2004), different bankruptcy probabilities can be generated from different approaches. For instance, they note that other studies, including Barth et al. (1998), Billings (1999), and Dhaliwal and Reynolds (1994), have used bond ratings to proxy for the probability of bankruptcy. Bond ratings incorporate elements of public information and private information conveyed to the rating agencies by firms and could improve the accuracy of bankruptcy forecasts as a result. 13 Moody’s acquired the KMV model in 2002. www.moodysanalytics.com/about-us/ history/kmv-history. 14 S.T. Bharath and T. Shumway, 2004, Forecasting default with the KMV–Merton model. Unpublished paper, University of Michigan 1001, 48109.

3 THE RISE OF THE MACHINES

1. Introduction In this chapter, we discuss the rise of machine learning methods in corporate d­ istress and corporate failure prediction. We begin with a discussion of first-­generation machine learning methods that were popular during the 1980s and 1990s, such as neural networks and recursive partitioning methods. We then p­ roceed to a ­discussion of more modern methods, such as gradient boosting, random forests, AdaBoost, deep learning, and other techniques. As discussed in Chapter 2, while corporate failure prediction techniques have evolved, much of the literature relies on conventional classifiers such as standard logit/probit models and linear discriminant analysis (LDA) (see, e.g., Altman et al., 1977; Iskandar-Datta and Emery, 1994; Blume et al., 1998; Duffie and Singleton, 2003; Altman and Rijken, 2004; Amato and Furfine, 2004; Nickell et al., 2000; Jorion et al., 2009). Among the conventional classifiers, the logit model appears to be the dominant classifier in the distress risk and corporate failure forecasting literatures. To a lesser extent, statistical learning techniques, including neural networks and recursive partitioning methods, have been used (see Duffie and Singleton, 2003). The Jones et al. (2015) review of 150 empirical studies concluded that the logit model appears (either as the primary classifier or as a comparator model) in 27% of studies, followed by (in percentage of studies): LDA (15.1%), neural networks (14.6%), SVMs (6.8%), probit models (6.3%), and recursive partitioning (3.6%); with the remainder of studies using an assortment of different modelling approaches, including rough sets, hazard/duration models, genetic algorithms, ensemble techniques, unsupervised learning models, and other methods. However, these frequencies do not provide a sense of which methods are starting to become more dominant in the distress and corporate failure prediction literature. While we will not be covering DOI: 10.4324/9781315623221-3

The Rise of the Machines  77

all possible modelling techniques described earlier, we will be paying closer attention to the development of modern machine learning methods that have come into prominence over the past 5–10 years or so.

3.1  Neural Networks and Recursive Partitioning A neural network is a two-stage model that is typically represented in the form of a network diagram. Figure 3.1 represents the most common neural network structure, often called a single hidden layer back propagation network. This particular example has two inputs, one hidden layer, and two output classes. As described in the Appendix to this book, for a typical single hidden layer binary neural network classifier, there are inputs (X), one hidden layer (Z), and two output classes (Y). Derived features Z m are created from linear combinations of the inputs, and then the target Yk is modelled as function of the linear combinations of the Z m , as follows: Zm Tk fk X

T m

om

ok

T k

X ,m

1,

, M . (3.1)

Z ,k

1,

, K . (3.2)

gk T , k

1,

, K . (3.3)

Where Z =  (Z 1 , Z 2 , Z 3 ,

,TK ) .(3.4) 1 The activation function v is typically the sigmoid v . The out1 e v put function gk T allows a final transformation of the vector of outputs T. For

FIGURE 3.1 Single

, Z M ) , and T =  (T1 ,T2 ,T3 ,

Hidden Layer Neural Network Structure

78  The Rise of the Machines

K-class classification, the identify function gk T function: gk T

eTk K

eT

is estimated using the softmax

. (3.5)

1

Further details are provided in the Appendix. Initially, neural network models were seen to hold great promise for corporate failure research. However, the performance of neural networks has been quite mixed. For instance, O’Leary (1998) provided a meta-analysis of 15 studies that have used some type of neural network model to predict corporate failure. He compared 15 studies in terms of “what works and what doesn’t work” with these models. The studies were compared on several methodological factors, including the impact of using different frequencies of bankrupt firms, different software used for model estimation, alternative input variables, the number of nodes in the hidden layer, the output variables, training and test sampling, differences in the analysis methodology, and classification success. O’Leary (1998) concluded that neural networks, when applied to corporate failure forecasting, have produced results that are at least as good as those results produced by LDA, logit, probit, and ID3 (a classification tree approach). However, O’Leary (1998) qualified his findings that in some settings other approaches seemed to have performed quite well relative to neural networks. O’Leary (1998) concluded that a number of characteristics in the formulation of neural network models influenced the quality of empirical findings across studies. For instance, the training proportion of bankrupt firms in the data influenced the quality of the results in the training and testing of these models, resulting in lack of “upward” generalization. Further, deviation from a single hidden layer tended to have an adverse impact on the relative quality of neural network models. Finally, time seems to negatively influence the relative quality of the neural network models. Coats and Fant (1993) compared neural networks with conventional approaches such as LDA and found that neural networks were more effective for pattern classification as well as generating more reliable type I error rates (although LDA performed better on type II error rates). However, many studies have produced largely inconclusive evidence about the performance of neural networks relative to LDA and logistic regression (see literature review by Altman and Saunders, 1997). For instance, Altman et al. (1994) compared neural networks and conventional methods based on a large sample of Italian corporate data. As noted by the authors, the advantages of the neural network approach are as follows (p. 514): Neural networks do not require the pre-specification of a functional form, nor the adoption of restrictive assumptions about the characteristics of statistical distributions of the variables and errors of the model. Moreover, by their nature, NN make it possible to work with imprecise variables and with changes of the models over time, thus being able to adapt gradually to the appearance of new cases representing changes in the situation.

The Rise of the Machines  79

However, the limitation of the neural network approach is the “black box” effect where input variables have limited interpretability. There is no way of understanding how the variables of interest are being used within the network connections. The main conclusions from the Altman et al. (1994, p. 507) study included: (1) neural networks are able to approximate the numeric values of the scores generated by the discriminant functions, even with a different set of business indicators from the set used by the discriminant functions; (2) neural networks are able to accurately classify groups of businesses as to their financial and operating health, with results that are very close to or, in some cases, even better than those of the discriminant analysis; (3) the use of integrated families of simple networks and ­networks with a “memory” has shown considerable power and flexibility. Their performance has almost always been superior to the performance of single networks with complex architecture; (4) the long processing time for completing the neural network training phase and the need to carry out a large number of tests to identify the neural network structure, as well as the trap of “overfitting”, can considerably limit the use of neural networks. The resulting weights inherent in the system are not transparent and are sensitive to structural changes; (5) the possibility of deriving an illogical network behaviour, in response to different variations of the input values, constitutes an important problem from a financial analysis point of view; (6) in the comparison with neural networks, discriminant analysis proves to be a very effective tool that has the significant advantage for the financial analyst of making the underlying economic and financial model transparent and easy to interpret; and (7) they recommend that the two systems be used in tandem. The main conclusion of this study is that neural networks are not a clearly dominant statistical learning technique compared to traditional statistical techniques, such as LDA. However, the authors conceded that “it is too early to say whether the use of experimental N.N. is simply a fad or it will result into something more permanent” (p. 512). In another study, Boritz and Kennedy (1995) examined the effectiveness of different neural networks, such as back-propagation and Optimal Estimation Theory, in predicting corporate bankruptcies. They compared neural networks against traditional corporate failure prediction techniques such as LDA, logit and probit. Their results showed that the level of Type I and Type II errors varies greatly across techniques. For instance, the Optimal Estimation Theory neural network had the lowest Type I error rate and the highest Type II error rate, while the traditional statistical techniques have the reverse relationship (i.e., high Type I error and low Type II error). The authors also concluded that “performance of the neural networks tested is sensitive to the choice of variables selected and that the networks cannot be relied upon to ‘sift through’ variables and focus on the most important variables” (p.  503). The significant variations across replications for some of the models indicate the sensitivity of the models to variations in the data. Peat and Jones (2012) added to current debates by investigating the performance of neural networks in the context of a forecasting combination methodology. A  neural network framework was used to combine forecasts from three

80  The Rise of the Machines

well-known corporate failure approaches, including: (1) the Altman (1968) model; (2) the Ohlson (1980) model; and (3) corporate failure scores developed from the distance to default measures generated from the Black–Scholes–Merton option pricing model (discussed in Chapter 2). The study modelled firm failure as a binary outcome setting of failure vs non-failure. The failed firm sample included three major forms of failure proceeding available under the legislative provisions of the Australian Corporations Act (2001): (1) voluntary administration (first introduced in Australia in June 1993 under the Corporate Law Reform Act, 1992); (2) liquidation; and (3) receivership. Failed firms also included firms defaulting on loans or other legally enforceable contracts, such as payment terms with creditors. Non-failed firms included all firms not classified as failed but excluded firms that were (1) privatized over the sample period, (2) subject to a distressed merger or takeover, or (3) subject to a compulsory acquisition over the sample period. Peat and Jones (2012) demonstrated that a neural network using the combination of failure forecasts constructed from the three forecasting approaches produced failure probability forecasts that significantly outperformed the forecasts generated using a regression-based approach alone. Peat and Jones (2012) also observed that performance of the forecasts generated from the neural network model was improved by using the full sample containing the population proportions of bankrupt and non-bankrupt firms in training the networks and scaling the input variables. Finally, the results of their study appeared to contradict some previous literature that finds little evidence that neural network approaches do not outperform better than linear failure techniques, particularly regression approaches such as LDA. However, the generally inconclusive evidence about the effectiveness of neural networks in the literature may have resulted in a decline in academic interest in these methods in the past 20 years. Neural networks are generally more time consuming to set up and much less interpretable than other approaches (i.e., the “black box” effect) (see the literature review by Altman and Saunders, 1997). If these models do not perform appreciably better than simpler more interpretable models, are they really worth the trouble? While the application of neural networks to corporate failure prediction has been somewhat disappointing, deep learning methods based on more complex and richer neural network architecture appear to hold considerably more promise for future research (discussed further below).

3.2  Classification and Regression Trees (CART) Recursive partitioning such as classification and regression trees (CART) are also a powerful first-generation machine learning method. In fact, the antecedents of modern machine learning methods based on boosting developed from the CART method. However, despite the early popularity of CART (particularly in health diagnostics), the technique has been associated with a number of limitations, most notably high variance (i.e., they do not generalize well). More sophisticated

The Rise of the Machines  81

techniques, starting with bagging (Breiman, 1996), began to develop that significantly improved on the performance of CART. Bagging can dramatically reduce the variance of unstable procedures (like trees), leading to improved prediction outcomes. While bagging is a major improvement on CART, more sophisticated boosting methodologies, such as random forests, adaptive boosting (AdaBoost), and gradient boosting, began to develop. Random forests are essentially a refined form of bagging. The technique improves on bagging by “de-correlating” the trees, which maximises the reduction in variance (I discuss these models further in what follows). As discussed in Peat (2008), the tree-based approach to classification proceeds through the simple mechanism of using one feature to split a set of observations onto two subsets. The objective of the split is to create subsets that have a greater proportion of members from one of the groups than the original set. This objective is known as reducing the impurity of the set. The process of splitting continues until the subsets created only consist of members of one group or no split gives a better outcome than the last split performed. The features can be used once or multiple times in the tree construction process. Sets which cannot be split any further are known as terminal nodes. The graphical representation of a simple decision tree is provided in Figure 3.2. Figure 3.2 shows a basic decision tree with two terminal nodes and one split. The model is based on the bankruptcy data described in Chapter 4. If percent of stock owned by the top five shareholders is greater than 9.25%, the model predicts class 1 membership (or non-failure). If the value of this variable is less than or equal

FIGURE 3.2 A Simple

Decision Tree for Bankruptcy Prediction

82  The Rise of the Machines

to 9.25%, the model predicts class 0 membership (or failure). This simple decision tree has an out-of-sample AUC of .80, which is quite impressive for such a simple model. As Peat (2008) pointed out, it is possible to proceed with the splitting process until each terminal node contains only one observation. Such a tree will correctly classify every member of the sample used in the construction of the tree, but it is likely to perform poorly in classifying another sample from the population. This is the problem of generalization, the trade-off between the accuracy of a classifier on the data used to construct it, and its ability to correctly classify observations that were not used in its construction. Tree construction involves three steps: splitting a set into two, deciding when a set cannot be split further, and assigning the terminal sets to a class. To select the best binary split at any stage of tree construction, a measure of the impurity of a set is needed. The best possible split would result in the two subsets having all members form a single population group. The worst possible split results in two subsets, each consisting of equal numbers from each of the population groups. In non-separable cases, the subsets resulting from a split will contain members from each of the population groups. Hastie et al. (2009) set out the formula for a classification tree. In a node m, representing a region Rm with Nm observations, let ˆ pmk

1 Nm

xi Rm

I ( yi

k ) ,(3.6)

the proportion of class k observations in node m. We classify the observations in node m to class k m argmaxk pˆmk , the majority class in node m. Hastie et  al. (2009, p.  309) pointed out that different measures Qm T of node impurity include the following: Misclassification error: Gini index:

k k

1 Nm

pˆmk pˆmk

Cross-entropy or deviation:

i Rm

I ( yi

K k 1

k(m )) 1 pˆmk m .(3.7)

pˆmk 1 pˆmk . (3.8) K k 1

pˆmk log pˆmk . (3.9)

Hastie et al. (2009, p. 309) state that all three measures are similar, but crossentropy and the Gini index are differentiable and hence more amenable to numerical optimization. As noted by Hastie et al. (2009), the Gini index can be interpreted in two ways. Rather than classify observations to the majority class in the node, we could classify them to class k with probability pˆ mk . Then the training error rate of this rule in the node is k k pˆ mk pˆ mk – the Gini index. Similarly, if we code each observation as 1 for class k and 0 otherwise, the variance over the node of this 0–1 response is pˆmk 1 pˆmk . Summing over classes k again gives the Gini index. I provide an empirical illustration of the CART methodology in Chapter 4.

The Rise of the Machines  83

3.3  Advanced Machine Learning Techniques Despite some initial promising developments in neural networks and CART, the literature in corporate failure prediction has not kept pace with developments in the theoretical statistics literature. Empirical evidence from other fields of application suggests that more recent modelling approaches (such as gradient boosting models, adaptive boosting (or AdaBoost), random forests, and deep learning) can significantly outperform conventional classifiers such as logit/probit, LDA, and also standard neural network and CART approaches (see Hastie et al., 2009; Schapire and Freund, 2012). Similarly, there has been relatively little attention in the corporate failure prediction field devoted to evaluating the empirical performance and theoretical merits of alternative predictive models. Studies have compared the performance of conventional classifiers, particularly logit, probit, and LDA (see, e.g., Lennox, 1999; Jones and Hensher, 2004; Greene, 2008), but there has not been equal scrutiny of more recent predictive approaches, apart from some evaluation of neural networks and SVMs (see Appendix for discussion of SVM models). While neural networks, CART, and SVMs remain established statistical learning techniques in the literature, these classifiers have been superseded by arguably more powerful techniques, particularly gradient boosting, adaptive boosting, random forests, and deep learning. The potential of these classifiers has not been extensively explored in the distress and corporate failure prediction and related literatures until quite recently. Of the few studies that have examined these machine learning techniques, the early results seem very promising. For example, based on a corporate failure sample of 1365 private firms, Cortés et al. (2007) find that the gradient boosting model significantly improved test sample predictive accuracy by up to 28% (see also Kim and Kang, 2010). See Table 3.1 for review of some recent literature using gradient boosting techniques and other machine learning methods.

3.3.1  Concept Behind Boosting The basic idea behind boosting as set out in the work of Schapire and Freund (2012) is to combine the outputs of many weak classifiers (trees) to produce a powerful overall “voting” committee. All the individual classifiers can be weak, but as long as the predictive performance of each weak classifier is slightly better than random guessing (i.e., their error rate is smaller than 0.5 for binary classification), the final model can converge to a very strong classifier (Jones, 2017). The weighted voting is based on the quality of the weak classifiers, and every additional weak classifier improves the prediction outcome. The weak learning algorithm is forced to focus on examples where the previous rules of thumbs provided inaccurate predictions. The intuition here is straightforward for corporate failure prediction. The first weak classifier, which might be (say) the current ratio of >2x, is trained on the data where all observations receive equal weights. Some observations will be misclassified by the

Author/s

Journal

Sample Size

Public/Private Firm Sample

Cortes et al. (2007)

Applied Intelligence, 27: 29–37

Sun et al. (2011).

Expert Systems With Applications, 38: 9305– 9312

2730 private Private firms. firms drawn from BvD’s SABI database over the period 2000– 2003. Public firms. 692 Chinese listed firms sampled over the period 2000–2008, taken from Shenzhen Stock Exchange and Shanghai Stock Exchange.

Definition of Distress

Number/Type of Boosting Model Input Variables Used

Interaction Effects

Model Prediction/AUC Architecture & Software Details

N/A

100 trees for tree depth, R software.

AdaBoost outperforms single tree. Overall test error rate of 6.569%.

Limited discussion, 30-fold holdout tests.

AdaBoost ensemble with SAT outperforms all other models. Test errors are 2.78% (1 year from failure), 12.81% (2 years from failure), and 27.51% (3 years from failure) respectively.

Bankruptcy, 18 predictors temporary with 14 receivership, financial acquired and variables/ dissolved ratios. firms.

AdaBoost vs single tree

Negative net 41 profit in financial consecutive variables/ two years, or ratios. net capital lower than the face value.

AdaBoost N/A with single attribute test vs single tree and SVMs.

84  The Rise of the Machines

TABLE 3.1  Applications of Boosting Models to Bankruptcy Prediction Research

Expert Systems With Applications, 37: 3373– 3379

Wang et al. (2014)

Expert Systems With Applications, 41: 2353– 2361

Hung and Chen (2009)

Expert Systems With Applications, 36: 5297– 5303

1458 externally Private firms Bankruptcy 32 financial audited obtained from (presume variables/ manufacturing a Korean legal ratios. firms, half of commercial bankruptcy). which went bank. bankrupt over the period 2002–2005. Study uses two Not clear from Generally Up to 30 small datasets, the study not well financial including 240 (refers to described variables/ firms in one previous and drawn ratios. dataset and studies using from 132 firms in the data). previous the other. studies.

56 bankrupt Public firms. companies and 64 non-bankrupt firms sampled between 1997 and 2001.

Legal bankruptcy.

30 financial variables/ ratios.

Boosted NNs vs bagged NN and NNs.

N/A

FS boosting compared to other techniques.

N/A

Selective ensemble vs stacking ensemble.

N/A

Not stated or limited details provided.

Boosting and bagged NN improve performance of NNs. AUCs around .75, but best performance appears to be from bagged NNs. Limited details FS boosting achieved provided. best overall WEKA classification software. success rates of 81.50 on dataset 1 and 86.70% on dataset 2; followed by boosting and bagging. SVMs performed quite strongly on dataset 2. Not stated No conclusion or limited reached for the details best bankruptcy provided. prediction technique. They propose selective ensemble, which integrates with the concept of the expected probability.

(Continued)

The Rise of the Machines  85

Kim and Kang (2010)

Karthik Chandra et al. (2009)

Expert Systems With Applications, 36: 4830– 4837

240 dotcom companies drawn from WRDS database (120 failed vs 120 non-failed).

Fedorova et al. (2013)

Expert Systems With Applications, 40: 7285– 7293

888 Russian Private manufacturing companies firms sampled from SPARK over the database. period 2007– 2011.

Public firms.

Not explicitly stated.

24 financial variables/ ratios.

Compares N/A MLP, random forests, logit, SVMs, and CART.

Not stated or limited details provided.

Bankruptcy according to Russian federal law.

23 financial variables/ ratios.

Compares N/A boosted NNs with other techniques.

Not stated or limited details provided.

Boosting yielded results superior to those reported in previous studies on the same dataset. Combining boosting with other techniques such as SVM resulted in an AUCs above .90. Boosted NNs appear to achieve the highest accuracy overall on test samples (classification success was 88.8%).

86  The Rise of the Machines

TABLE 3.1 (Continued)

Karas and International 1908 Private firms. Režň­áková Journal of manufacturing (2014) Mathematical firms within Models, 8: the Czech 214–223 Republic sampled over the period 2004–2011.

Olson et al. (2012)

34 financial AdaBoost vs variables/ratios. LDA.

Not discussed 19 financial Decision trees but assumed variables/ratios. vs NNs and to be SVMs. Chapter 11.

N/A

Tree depth of 107 trees. Statistical 10 software used.

N/A

Limited discussion. WEKA software.

AdaBoost significantly outperformed the LDA model. AdaBoost had a Type II misclassification rate of 1.48% and Type I misclassification rate of 15.88%. Accuracy differed across models depending on how many input variables were used. Decision trees performed better than SVMs and NN; however, boosted trees not specifically considered. Best AUC reported was .947.

(Continued)

The Rise of the Machines  87

Decision Support A total US Public firms. Systems, 52: sample of 464–473 1321 firm years sampled over the period 2005– 2011. Study includes 100 bankrupt firms

Bankruptcy under the laws of the Czech Republic.

Kim and Upneja (2014)

Economic A US sample of Modelling, 142 publicly 36: 354–362 traded restaurant firms sampled over the period 1988– 2010.

West et al. (2005)

Alfaro et al. (2008)

Public firms.

25 mostly financial variables/ ratios.

AdaBoost vs decision trees.

N/A

Not stated or limited details provided.

Computers & 329 observations Public firms. Operations (93 bankrupt Research, 32: firm years and 2543–2559 236 healthy firms).

Not clear but Altman’s five assumed key ratios. to be Chapter 11.

Boosting ensemble.

N/A

Not stated or limited details provided.

Decision Support 590 failed and Systems, 45: 590 active 110–122 firms with accounts registered on the Spanish Mercantile Registry.

Bankruptcy and temporary receivership during the period 2000–2003.

Compares N/A AdaBoost with CART, NNs, and LDA.

Private firms.

Uses the Zmijewski score as a dependent variable.

16 measures, including 13 financial variables/ ratios.

Limited details provided., R software used.

AdaBoost outperformed decision tree models, with an AUC of .988 on the full model, an AUC of .969 on the “full service” model, and an AUC of .94 on the “limited service” model. For the bankruptcy dataset, the bagging ensemble had the lowest test error of 0.126 compared to 0.127 for boosting and 0.129 for the CV ensemble. AdaBoost outperformed NNs, with a test error rate of 8.898%.

88  The Rise of the Machines

TABLE 3.1 (Continued)

Not clear from Bankruptcy or 17 financial the study. winding-up variables/ proceedings ratios. reported on the Hungarian Trade Register.

Comparison N/A of boosting and bagging, using the C4.5 method.

Not stated or limited details provided.

Bagging slightly outperformed boosting (by about 1%). Both models achieved accuracy rates of around 80%.

Heo and Yang (2014)

Small and Firms that 12 financial medium size went into variables/ private firms. workout, ratios. receivership, or bankruptcy.

Compares AdaBoost with NNs, SVM, D-Tree, and Altman Z scores

Not stated.

AdaBoost performed the best overall with the highest classification success on the large firm partition of the sample (93.8% accurate). However, AdaBoost performance was more modest across the whole sample (78.5% accurate), but still performed better than the other models.

976 Hungarian firms (51% were solvent and 49% insolvent) sampled over the period 2001 and 2012. Applied Soft A total of 1381 Computing, Korean 24: 494–499 construction bankruptcies and 28,481 “normal” construction firms sampled over the period 2008– 2012.

N/A

(Continued)

The Rise of the Machines  89

M. Virág and Financial & T. Nyitrai Economic (2014) Review, 13(4): 178– 193.

Tsai et al. (2014)

Applied Soft Three datasets Not clear from Computing, with “bad vs the study. 24: 977–984 good” firms in the following ratios: Australian dataset (307/383), German dataset (700/300), and Japanese dataset (307/383).

Kim and Kang (2015)

Expert Systems With Applications, 42: 1074– 1082

Not defined.

Up to 20 variables/ ratios.

500 bankrupt Private Not stated, 30 financial Korean companies but assumed variables/ manufacturing obtained from to be legal ratios. firms and a Korean bankruptcy 2500 noncommercial in Korea. bankrupt bank. manufacturing firms sampled over 2002–2005.

Source: Jones (2017, pp. 1416–1420). Reproduced with permission from Springer

Compares N/A boosted and bagged SVMs, NNs, and MLPs.

Compares geometric mean-based boosting (GMBoost) with AdaBoost.

N/A

Limited details Boosted decision provided. tree ensembles WEKA perform best and software. outperformed other classifier ensembles, including individual classifiers. While predictive accuracy differed across datasets, best performance was achieved on the Australian and Japanese datasets (classification accuracy around 86% overall). Not stated GMBoost shows or limited monotonically details superior provided. performance to AdaBoost when samples are unbalanced.

90  The Rise of the Machines

TABLE 3.1 (Continued)

The Rise of the Machines  91

first weak classifier. A second classifier, (say) the return on equity ratio with a cut-off of >4.5%, is initiated to focus on the residuals or trainings errors of the first classifier. The second classifier is trained on the same dataset, but misclassified samples receive a higher weighting while correctly classified observations receive less weight. The re-weighting occurs such that the first classifier gives 50% error (random) on the new distribution. Iteratively, each new classifier focuses on ever more difficult samples. The algorithm keeps adding weak classifiers (trees), often many hundreds or even thousands in a single model, until some desired low error rate is achieved.

3.3.2  Adaptive Boosting or AdaBoost Model More formally, Schapire and Freund (2012) illustrate how the boosting algorithm works in the case of adaptive boosting or AdaBoost: (1) train weak learner using distribution Dt (e.g., the distribution of a financial ratio over the failure and non-failure samples); (2) get weak hypothesis or classifier ht : X {-1, +1}, which for this study is a binary outcome of failure/distress event and the non-failure event; (3) select weak classifier ht (for instance, a financial ratio) to minimize weighted error; 1 t 1 (4) choose t =  ln ( ), 2 t Where t is the parameter importance assigned to the weak classifier ht .; (5) update, for i = 1, . . ., m:

Dt

1

(i)

Dt i

e e

Zt

t t

if ht xi if ht xi

yi (3.12) yi

Output the final hypothesis or strong classifier: H x

sign

T t 1

h x , (3.12)

t t

where H x is the linear combination of weak classifiers computed by the boosting algorithm.   As noted by Jones (2017), the purpose of boosting is to sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers that predicts very accurately.

3.3.3  Gradient Boosting Machines Gradient boosting is a generalization of AdaBoost to handle a variety of loss functions. (AdaBoost uses the exponential loss function and is regarded as a

92  The Rise of the Machines

special case of gradient boosting in terms of loss function.) Both AdaBoost and gradient boosting are conceptually similar techniques. Both approaches boost the performance of a base classifier by iteratively focusing attention on observations that are difficult to predict. AdaBoost achieves this by increasing the weight on observations that were incorrectly classified in the previous round. With gradient boosting, difficult or hard to classify observations are identified by large residuals computed in the previous iterations. The idea behind gradient boosting is to build the new base classifiers to be maximally correlated with the negative gradient of the loss function, across the whole tree ensemble (Friedman, 2001). The intuition behind gradient boosting is to build a sequence of predictors with the final classifier being the weighted average of these predictors (a multiple additive regression tree approach). At each stage, the algorithm adds a new classifier that improves the performance of the entire tree ensemble (with respect to minimizing some loss function). The gradient boosting approach can be understood as a long series expansion such as a Fourier or Taylor’s series – the sum of weighted factors becomes progressively more accurate as the expansion continues. The basic gradient boosting model is set out by Friedman (2001) and Hastie et al. (2009). The generic algorithm using steepest gradient descent optimization is as follows: (1) F0 x

N

arg min

(2) for m

L yi ,

;

i 1

1to M do: L yi , F x i

(3) yi

,i

F xi

(4) am

arg mina ,

(5)

arg min

m

(6) Fm x

Fm

1

F x

N i 1 N

yi

L yi , Fm

i 1

x

m

Fm

1

x

h xi ; a 1

xi

1, N ;

2

; h x i ; am

;

h x ; am ,

where L is the loss function to be minimized, yi is the response/outcome variable, Fm 1 represents the present model, and xi is the input variable. The expression h xi ; a is the weak classifier (normally a decision tree). For regression trees, para­ meters am are the splitting variables, split locations, and terminal node means of individual trees; h is a function of the input variables and tree split parameters a, and a coefficient , which is the weight assigned to the tree based on the overall model improvement. The expression h xi ; a represents the improvement to the current model (a new tree), which is to be optimized in a stage-wise fashion. The first line initializes to the optimal constant model, which for regression trees is a single

The Rise of the Machines  93

terminal node tree (the average of the response variable). Line 2 sets the iterations L yi , f x i M and line 3 of the algorithm; the negative gradient represents f xi generalized or pseudo residuals. At line 4 of the algorithm, a new tree is fitted to the generalized residuals of the model (the negative gradient of the loss function). At line 5, the multiplier m is solved using the one-dimensional optimization problem with the gradient boosting model updated at line 6. As demonstrated by Friedman (2001), the gradient boosting strategy can be applied to a variety of loss functions, yielding specific boosting algorithms, including least squares regression, least absolute deviation (LAD) regression, Huber-M, and other loss functions. For classification problems such as corporate failure prediction, the loss function is usually the logistic loss function. A frequent criticism of machine learning models is their lack of interpretability. Jones et al. (2015) described different classifiers in terms of an interpretability vs flexibility trade-off continuum. Sophisticated models such as gradient boosting tend to be very accurate predictors (because they are highly flexible) but at the expense of some interpretability. While the gradient boosting model can be very complex, involving many potential nonlinear relationships and interactions among predictor variables, there are now standard outputs available that allow the role and influence of different predictors to be more readily interpreted. The gradient boosting model is interpreted through relative variable importance scores, partial dependency plots, and interaction effects (these are illustrated in more detail in Chapter 4). Relative variable importance. Breiman et al. (1984) developed the measurement of a variable importance for a single tree (see Hastie et al., 2009; Jones, 2017): J 12 t 1 t

iˆ I v t

I l2 T

l ,(3.19)

where I l2 is the variable importance for each input variable X l . The summation is over J-1 internal nodes of the tree. At each node t, one of the input variables X v t is used to partition the region associated with that node into two subregions, and within each subregion, a separate constant is fit to the response variable. The variable selected is the one that provides the best improvement iˆt2 in squared error risk over that of a constant fit over the entire region. Partial dependency plots. Partial dependence functions can be estimated by the following formula (see Hastie et al., 2009, p. 370): f s (X s )

1 N

N i 1

f X s , xic , (3.20)

where x1C , x 2C , , xNC are the values of X C occurring in the training data. This requires a pass over the data for each set of joint values of X S for which f s ( X s ) is to be evaluated. Partial dependence functions represent the effect of X S on f(X)

94  The Rise of the Machines

after accounting for the (average) effects of the other variables X C on f(X) (see Hastie et al., 2009).

3.3.4  Random Forests Random forests are an improvement of the CART system (binary recursive partitioning) and bagged tree algorithms, which tend to suffer from high variance (i.e., if a training sample is randomly split into two halves, the fitted model can vary significantly across the samples); and weaker classification accuracy. Random forests maintain advantages of CART and bagged tree methodology by de-­correlating the trees and using the “ensemble” or maximum votes approach of gradient boosting. Unlike CART, random forests do not require true pruning for generalization. As in bagging, random forests build a number of decision trees based on bootstrapped training samples. But when building these decision trees, each time a split in the tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors. The split is only allowed to use one of these m predictors. A fresh sample of m predictors is taken at each split, and typically we choose m p , which suggests that at each split, we consider the square root of the total number of predictors (i.e., if 16 predictors, no more than 4 will be selected). By contrast, if a random forest is built where the predictor size subset m = the number of predictors p, this simply amounts bagging. The intuition behind random forests is clear. In a bagged tree process, a particularly strong predictor in the dataset (along with some moderately strong predictors) will be used by most if not all the trees in the top split. Consequently, all the bagged trees will look quite similar to each other. Hence, the predictions from bagged trees will be highly correlated. But averaging many highly correlated quantities does not lead to such a significant reduction in error as averaging uncorrelated quantities. Random forests overcome this problem by forcing each split to consider only p m a subset of predictors. Therefore, on average, of the splits will not even p consider the strong predictor, and so other predictors will have more of a chance. By de-correlating the trees, the averaging process will be less variable and more reliable. If there is strong correlation among the predictors, m should be small. Furthermore, random forests typically do not over fit if we increase B (the number of bootstrapped training sets), and in practice a sufficiently large B should be used for the test error rate to settle down. Both random forests and gradient boosting share the “ensemble” approach. Where the two methods differ is that gradient boosting performs exhaustive search for which trees to split, whereas random forest chooses a small subset. Gradient boosting grows trees in sequence (with the next tree dependent on the last); however, random forests grow trees in parallel independently of each other. Hence, random forests can be computationally more attractive for very large datasets.

The Rise of the Machines  95

A random forest model is derived as follows (see Appendix): (1) for b = 1 to B training sets: (a) draw a bootstrap sample of size N from the training data; (b) grow a random forest tree Tb to the bootstrapped data by recursively repeating the following steps for each terminal node of the tree, until the minimum node size nmin is reached: (i) select m variables at random from the p variables; (ii) pick the best variable/split point among m; (iii) split the node into two daughter nodes. (2) output the ensemble of trees Tb

B 1

. (3.10)

For a discrete outcome variable, let Cˆ b x be the class prediction of the bth random B forest tree. Then Cˆ rfB x  = majority vote Cˆ b x 1 . (3.10)

3.4  MARS and Generalized Lasso Models Other popular methods include Multivariate Adaptive Regression Splines (MARS) and generalized lasso (these methods will be illustrated in Chapter 4). The MARS model has the following function form: yi

0

b xi

1 1

b xi

2 2

b

R 3 R 3

xi

i

,(3.10)

which represents a cubic spline with K knots; parameters 0 , 1 , 2 and R 3 are estimated over different regions of X (i.e., knots); and b1 , b1 , , bR 3 are basis functions. Hence, MARS works by dividing the range of X into R distinct regions (or splines/ knots). Within each region, a lower degree polynomial function can be fitted to the data with the constraint functions join to the region boundaries through knots. This can provide more stable parameter estimates and frequently better predictive performance than fitting a high degree polynomial over the full range of X. In estimating logit and probit models with a MARS feature, it is conventional to place knots uniformly and use a cross-validation to determine the number of knots. A limitation is that regression splines can have high variance on the outer range of the predictors (when X takes on very small or very large values). This can be rectified by imposing boundary constraints. A further limitation is the additivity condition; hence, the MARS model can only be described as a partially nonlinear model. Regularized regression (such as such as the lasso, elastic net, and ridge regression). Penalized models or shrinkage methods are an alternative to subset procedures. Two popular techniques are ridge regression and the lasso. A relatively new technique

96  The Rise of the Machines

(elastic net) combines the strengths of both techniques. Rather than using OLS to find a subset of variables, ridge regression uses all variables in the dataset but constrains or regularises the coefficient estimates so they “shrink” the coefficient estimates towards zero for non-important variables. Shrinking the parameter estimates can significantly reduce their variance with only a small increase in bias. A weakness of ridge regression is that all variables are included in the model, making the model difficult to interpret. The lasso has a similar construction, but the penalty forces some parameters to equal zero (so the lasso has a variable selection feature and produces parsimonious models, as shown in Chapter 4). The elastic net technique builds on the strength of ridge regression and the lasso (Zou and Hastie, 2005). By setting =0.5, this allows very unimportant variable parameters to be shrunk to 0 (a kind of subset selection), while variables with small importance will be shrunk to some small (non-zero value). Model parameters are estimated using penalized maximum likelihood. For instance, the elastic net penalty is set out as follows: p j 1

j

1

2 j

,(3.11)

where is the penalty parameter and j is the estimate coefficient. If have a ridge regression penalty; if  = 1, we have a lasso penalty.

 = 0, we

3.5  Model Comparisons Jones et  al. (2015, 2017) provided the first systematic comparison of advanced machine learning methods vs conventional classifiers in the context of corporate failure and ratings changes in the literature. In both studies, the authors examined a wide set of binary classifiers against a large sample of credit ratings changes (the 2015 study) and a large sample of corporate bankruptcies (the 2017 study). While not an exhaustive list, the 16 statistical learning methods examined in Jones et al. (2017) are broadly representative of the most widely used and cited failure classifiers in the literature. On one side of the spectrum, we have relatively simple linear classifiers such as standard form logit, probit, and LDA. These classifiers are common in the literature but have limited capacity to model nonlinearity and unobserved heterogeneity in the dataset. However, they are generally more interpretable in terms of understanding the functional relationship between predictor variables and the response outcome. In the middle of the spectrum, we have classifiers that are better equipped to handle nonlinearity and unobserved heterogeneity, including mixed model approaches (such as mixed logit), MARS, generalized additive models (GAM), and generalized lasso methods (see Appendix for discussion of all models). The greater flexibility of these models usually translates into better fit and enhanced predictive performance, at the cost of lower interpretability (see Hastie et al., 2009). At the end of the spectrum, we have fully general, nonlinear models that are designed to capture all nonlinear relationships and interaction effects in

The Rise of the Machines  97

the dataset. These classifiers include neural networks, SVMs, gradient boosting models, AdaBoost, random forests, and deep learning models. While the complex algorithms underpinning many of these classifiers are designed to enhance classification accuracy, they too pose major hurdles for interpretation, since the relationship between the predictor variables and response variable is largely hidden in the internal mathematics of the model system. As suggested by Jones et  al. (2017), the benefit from using a more complex nonlinear classifier (such as gradient boosting or random forests) should come from improved out-of-sample predictive success. Following Occam’s Razor, if two classifiers have comparable predictive performance, a simpler and more interpretable classifier is preferred to a less interpretable classifier, particularly if statistical inference and not just prediction is a major goal of the modelling exercise ( James et al., 2013). A  major objective of the Jones et  al. (2017) study was to assess whether more complex classifiers do in fact lead to better out-of-sample prediction success, particularly compared to simpler more interpretable classifiers. Jones et al. (2017) also evaluated whether predictive performance of different classifiers is impacted by the underlying shape and structure of input variables, and whether predictive performance can be enhanced by modifying these conditions. This issue appears to be an important but much neglected issue in the literature. Jones et al. (2017) examined two common problems that can potentially impact the performance of classification models: non-normalness of the input variables and missing values. The predictive performances of all classifiers were tested by Jones et al. (2017) both with and without variable transformation (using the Box-Cox power transformation procedure) and missing value imputation (using the singular value decomposition or SVD procedure), including the combined effects of both techniques. As many of the binary classifiers examined by Jones et al. (2017) are quite new to accounting and finance research, and corporate failure research in particular, an empirical assessment of their characteristics and forecasting potential, particularly under different data assumptions and restrictions, provided important motivation for the use of these methods. For a model to have good practical value and appeal, Jones et al. (2017) argued that it should be: (1) relatively straightforward to implement, particularly in terms of model architecture specification, data preparation requirements, estimation time, and ready availability of statistical software for the task; and (2) have some level of interpretability in terms of the role and behavioural influence of input variables on overall model performance. Practitioners are unlikely to embrace purely “black box” approaches where it is impossible to interpret the variables most influencing model predictions. As noted by Jones (2017), one of the major practical benefits of modern machine learning models is that they involve relatively little researcher intervention. For instance, these models are largely immune to monotonic transformation of input variables, the effects of outliers, missing values, and other common data problems. These models are not significantly impaired by statistical problems

98  The Rise of the Machines

such as multicollinearity or heteroscedasticity, which can seriously undermine the performance of parametric models such as logit, probit, and LDA (and related models). In order to compare the empirical performance of alternative binary classifiers, Jones et al. (2017) indicated that it was important to provide a consistent framework for model comparison. Jones et al. (2017) used a common set of input (predictor) variables used in previous research, and identical processes for data handling, as well as hyper-parameter estimation (i.e., model selection) and model evaluation. Following the approach of Jones et  al. (2015), the predictive performance of all classifiers is tested using different data assumptions and conditions. Jones et al. (2017) evaluate model performance with respect to: (1) out-of-sample predictive accuracy, using both randomized cross-sectional and longitudinal test samples; (2) cross-sectional and longitudinal predictive performance using variance stabilizing data transformations (i.e., Box-Cox power transformations); (3) crosssectional and longitudinal predictive performance both with and without missing value imputation (using singular value decomposition or SVD method). The results of the Jones et al. (2017) study are reported in Table 3.2. Panel A of Table 3.2 displays the cross-sectional tests, and Panel B shows the longitudinal tests. The first column shows the area under the ROC curve (AUC) for the unadjusted dataset (no transformations or missing valuation imputations). The third column shows the AUCs for the transformed datasets. The AUC ranks and upper and lower AUC confidence intervals are also provided in Table 3.2. The AUCs for the transformed and imputed data, the AUC ranks, and upper and lower AUC confidence intervals are displayed on the far right of Table 3.2. Consistent with the results of Jones et al. (2015), the ROC curve analysis indicates that modern machine learning methods, such as AdaBoost, gradient boosting, and random forests, sharply outperformed all other classifiers on cross-sectional and longitudinal test samples, and across all permutations of the dataset. As these classifiers appear to be very accurate and stable predictors and are relatively easy to implement, they may hold significant promise for future research and practice in this field (and business research more generally). This conclusion is further galvanized by the many appealing properties of these classifiers. The results of the Jones et al. (2017) study suggest that advanced machine learning methods may require minimal data intervention as their predictive performance appears largely immune to the shape and structure of data and is largely insensitive to the monotone transformation of input variables (see Table 5.2). The strong performance of these classifiers on test samples also confirms previous findings that these models are resilient to model over-fitting, which typically manifests in weaker test sample accuracy relative to the training sample (Friedman, 2001; Schapire and Freund, 2012), while the underlying model structure of these models can be very complex, as they are designed to capture all nonlinear relationships and interactions. However, I illustrate in Chapter 4 that the RVIs, partial dependency plots, and interaction effects allow the analyst to see into the “black box” and evaluate which variables are contributing most too overall predictive success. This adds to the practical appeal of the models and could be the

TABLE 3.2  ROC Curve Analysis for the Cross-Sectional and Longitudinal Test Sample

Panel A: AUC Summaries for Cross-sectional Test Samples Unadjusted Data AUC

Transformed Data AUCs

AUC Rank

UCI

LCI

Transformed Imputed Data AUCs

AUC Rank

UCI

LCI

AdaBoost Generalized Boosting Random Forests SVMs Neural Networks Probit_GAM Probit_MARS Logistic_GAM Logistic_MARS Logistic_Penalized Logistic_Subset Logistic_Stepwise Probit_Subset Probit_Stepwise QDA LDA

0.9269 0.9289 0.9216 0.8315 0.8184 0.8081 0.8241 0.8011 0.8253 0.8099 0.7901 0.8032 0.8067 0.7955 0.7671 0.7561

0.9277 0.9292 0.9220 0.8501 0.8499 0.8322 0.8318 0.8309 0.8397 0.8177 0.8211 0.8201 0.8151 0.8127 0.8040 0.8078

2 1 3 4 5 7 8 9 6 12 10 11 13 14 16 15

0.9298 0.9314 0.9254 0.8590 0.8522 0.8471 0.8482 0.8415 0.8488 0.8369 0.8385 0.8379 0.8348 0.8325 0.8174 0.8237

0.9101 0.9221 0.9161 0.8345 0.8266 0.8209 0.8201 0.8246 0.8309 0.8215 0.8228 0.8121 0.8090 0.8171 0.7949 0.7943

0.9341 0.9397 0.9199 0.8403 0.8416 0.8426 0.8661 0.8481 0.8600 0.8161 0.8091 0.8271 0.8001 0.8022 0.7910 0.7981

2 1 3 8 7 6 4 9 5 11 12 10 14 13 16 15

0.9407 0.9413 0.9230 0.8488 0.8488 0.8511 0.8890 0.8559 0.8890 0.8225 0.8226 0.8315 0.8201 0.8215 0.8113 0.8221

0.9311 0.9281 0.9033 0.8339 0.8360 0.8354 0.8851 0.8315 0.8567 0.8107 0.7988 0.8113 0.7951 0.8001 0.7814 0.7851 (Continued)

The Rise of the Machines  99

Classifiers:

Panel B: AUC Summaries for Longitudinal Test Sample (1 Year Prior to Bankruptcy) Classifiers:

Unadjusted Data AUC

Transformed Data AUCs

AUC Rank

UCI

LCI

Transformed Imputed Data AUCs

AUC Rank

UCI

LCI

AdaBoost Gradient Boosting Random Forests SVMs Neural Networks Probit_GAM Probit_MARS Logistic_GAM Logistic_MARS Logistic_Penalized Logistic_Subset Logistic_Stepwise Probit_Subset Probit_Stepwise QDA LDA

0.9458 0.9486 0.9412 0.8432 0.8534 0.8241 0.8699 0.8308 0.8711 0.8111 0.8088 0.8059 0.8054 0.8089 0.7838 0.7946

0.9464 0.9491 0.9424 0.8524 0.8533 0.8925 0.9132 0.8934 0.9132 0.8329 0.8341 0.8285 0.8369 0.8326 0.8400 0.8180

2 1 3 9 8 7 5 6 5 13 12 15 11 14 10 16

0.9567 0.9589 0.9543 0.8713 0.8791 0.9094 0.9312 0.9108 0.9307 0.8642 0.8641 0.8593 0.8658 0.8627 0.8644 0.8521

0.9361 0.9382 0.9315 0.8293 0.8306 0.8712 0.8919 0.8731 0.8949 0.8071 0.8101 0.8111 0.8127 0.8072 0.8128 0.7921

0.9471 0.9481 0.9435 0.8592 0.8552 0.8793 0.8852 0.8808 0.8858 0.8173 0.8175 0.8173 0.8173 0.8177 0.8043 0.8021

2 1 3 8 9 7 5 6 4 14 11 13 13 10 15 16

0.9598 0.9611 0.9578 0.8831 0.8824 0.8985 0.9037 0.9015 0.9041 0.8381 0.8311 0.8497 0.8488 0.8418 0.8263 0.8399

0.9371 0.9383 0.9352 0.8377 0.8331 0.8511 0.8662 0.8632 0.8451 0.7911 0.7843 0.7955 0.7877 0.7928 0.7755 0.7741

Source: Jones et al. (2017, pp. 18–19)

100  The Rise of the Machines

TABLE 3.2 (Continued)

The Rise of the Machines  101

reason why these classifiers are now widely used in so many different discipline fields and applied settings. A second finding of Jones et al. (2017) is that the performance of many classifiers was improved, in some cases quite significantly, using Box-Cox power transformations of predictor variables. However, as can be seen from Table 3.2, missing value imputation using the SVD approach appears to have contributed very little to improving the overall predictive performance of most classifiers. A third finding of Jones et al. (2017) is that quite simple classifiers (such as best subset, stepwise, penalized, and LDA models) performed the worst on the test samples. However, as seen in the Table 3.2 results, they did not perform far worse than more sophisticated classifiers such as neural networks and SVMs. In fact, on the transformed test samples, the gap between the AUC performance of simple classifiers and some of the more sophisticated techniques was not substantial. Jones et al. (2017) concluded that simple model structures could still represent a viable alternative to more sophisticated approaches, particularly if statistical inference and interpretability is a major objective of the modelling exercise. In another study, Jones (2017) provided a more detailed examination of the gradient boosting machines in a high dimensional context. In the statistical learning literature, high dimensional means that an empirical context such as corporate failure can be explained by a potentially very large number of predictor variables (the number of predictors p is large relative to the sample size n). In some discipline fields, such as DNA research, it is recognized that p can be much greater than n, which has important implications to the types of statistical models most appropriate to the analysis. Corporate failure research is also a field where it is quite possible for p to be greater than n, as failure samples tend to be quite small in practice, but the range of potential predictors (and associated interaction effects) can be large. In the context of corporate failure, a high dimensional setting implies a multidimensional interpretation as well. That is, corporate failure can be explained by numerous distinct facets or dimensions, which might include accounting-based variables, market price indicators, ownership concentration/structure variables, macro-economic factors, executive compensation measures, external ratings indicators, and other factors. Typically, the models used in corporate failure research are unsuitable for high dimensional analysis. Logit, probit, and LDA are not designed to handle large numbers of explanatory variables and related interaction effects or even situations where p is moderately large compared to n. In other words, conventional models suffer from the “curse of dimensionality” (i.e., the problem of separating noise from useful information in the context of many predictor variables) and quickly become unstable as the number of predictors is increased. For example, multicollinearity issues (and other statistical problems) become magnified as the number of predictors is increased, resulting in biased parameter estimates and invalid significance tests. Hence, conventional models can impose quite severe constraints on the number of independent variables (and interactions) that can be examined together.

102  The Rise of the Machines

3.6 Recent Studies Using Machine Learning in Corporate Failure Prediction Jones (2017) used the gradient boosting model to discriminate from a very large set of explanatory variables which particular inputs have the greatest impact on predictive performance. Here, the gradient boosting model can be applied to test specific hypotheses relating to the predictive value of alternative corporate failure indicators. For instance, the predictive power of accounting-based versus market price variables has attracted much interest in the corporate failure literature, as discussed in Chapter 2. Gradient boosting models can incorporate all known predictors and rank order these variables based on their out-of-sample predictive performance. The Jones (2017) study is based on a sample of 1115 US corporate bankruptcy filings and showed that around 90% of the variables tested in their study (out of 91 predictors variables) have non-zero importance scores (meaning they all contribute to predictive success), which supported the view that corporate failure can be better explained/predicted in a high dimensional setting. Further, these high impacting variables appear to be associated with a number of different failure dimensions, which also supports a multi-dimensional interpretation of the results. After taking into account the role of all other model influences (including all important interaction effects), Jones (2017) reported that ownership concentration/structure variables, such as the level of stockholder concentration, institutional ownership, and insider ownership, are among the strongest predictors of corporate failure overall. Consistent with prior literature, several market price indicators were also found to be important; however, they appear to have no more predictive power than accounting-based indicators (and in some cases are less important). Also, several strong interaction effects between these variables suggest that accounting and market price variables appear to be complementary rather than competing sources of information. Jones (2017) not only showed that the gradient boosting model can predict significantly better than conventional models, but also that the model can reveal deeper relationships among failure predictors (particularly through nonlinearities and interaction effects) that cannot easily be captured using conventional models. The results of the Jones (2017) study can facilitate further theoretical development and debate about the role and predictive power of alternative corporate failure predictors. Another recent application of modern machine learning methods is to Chinese bankruptcies (Jiang and Jones, 2019). As noted by Jiang and Jones (2019), the corporate failure context is slightly more complex in the case of China. The Chinese regulatory regime is somewhat unique. Even though China has a bankruptcy code similar to the US, corporate bankruptcies are quite rare in China. The Chinese regulatory authorities have adopted a different system to handle corporate distress. As required by the Chinese Securities Regulatory Commission (CSRC), the Shanghai Stock Exchange (SSE) and Shenzhen Stock Exchange (SZSE) launched the Special Treatment (ST) system on April 22, 1998, to differentiate firms in financial distress (CSRC, 2008). The ST  system is designed

The Rise of the Machines  103

to identify and single out poorly performing stocks as an early warning signal to investors and creditors (Zhou et al., 2012). According to the stock listing rules, SSE and SZSE can assign ST status to a listed company with one of the following abnormal financial conditions: (1) negative earnings in two consecutive years; (2) net assets per share is less than face value per share; (3) no auditor’s report or auditor’s report materially disagrees with financial statement of the company; (4) abnormal financial behaviour is identified and claimed by the CSRC or stock exchanges. Those ST  stocks will be labelled with “ST” to show that they are currently in an abnormal state. During the special treatment period, ST  shares are required to observe the following rules: (1) 5% daily price fluctuation limit; (2) use prefix “ST” followed by original stock name; and (3) interim report to be audited. An ST company would be issued with a delisting warning and marked as ST shares if one of the following events happen: (1) the company failed to turn to profit after two consecutive years of losses; (2) following a material error or fraud discovered in financial statements; or (3) failure to disclose financial statements on a timely basis as required by law.1 As pointed out by Jiang and Jones (2019), companies with ST status face a twoway street: they must either take remedial action to have the ST status removed or face delisting as a result of persistent financial distress (Javvin, 2008). However, in reality, it is quite rare for a delisting to occur on the Chinese stock market (Tan and Wang, 2007). Most ST stocks will improve their financial health and retrieve listing positions, often with the help from government (Zhang, 2016). Furthermore, ST firms will often restructure or reorganize their activities to improve their financial health. According to Zhou et al. (2012), on average an ST firm would remain in special treatment status for 3.66 years before a final resolution is reached. Based on their sample of ST firms listed on either SSE or SZSE from 1998 to 2011, Zhou et al. (2012) find that 65% of ST firms return to normal trading status. Green et al. (2009) also report that a majority of ST companies successfully return to normal trading status by the third year following an ST event. However, there are several reasons why ST companies can continue in a distressed state, even re-entering ST status several times. First, as the majority of listed companies have parent state-owned enterprises (SOE), ST companies are usually rescued by their parent companies, who have access to cheap loans from state-run banks (Cheung et al., 2009). Second, the Chinese government tends to subsidize or rescue ST companies for political and economic purposes, e.g., if the ST company is a backbone in particular industries (Zhang, 2016). Third, the listing status is in itself very valuable in China. Competitors are also motivated to help ST companies recover, because they perceive the value of troubled firms in terms of not only fundamental value but also “shell” value. The “shell” value represents the valuable stock listing right subject to the highly competitive Initial Public Offering (IPO) listing regime in China (Zhou et al., 2012). A number of capital-hungry companies wishing to be publicly listed find ST firms attractive for merger or acquisition purposes, as this provides a path to accessing China’s liquid capital markets (Zhou et al., 2012).

104  The Rise of the Machines

All data for Jiang and Jones (2019) study was collected from the China Stock Market and Accounting Research (CSMAR) database. A total sample of 15504 firm years was extracted between the period of 1999 and 2015. The sample included 12156 active firm years and 3348 distressed firm years. Of the distressed firm years, 222 firm years relate to companies that experienced a single ST event; 1620 firm years relate to companies that experienced between 1 and 4 ST events; 993 firm years relate to companies that experienced more than 4 ST events over the sample period; and 514 firm years relate to companies that were delisted following an ST event. Among the distressed companies, 60.63% of sampled observations come from the Chinese industrial sectors; 12.90% relate to the properties sector; 11.4% relate to public utilities; 6.81% come from the conglomerate industry; 5.85% are from the commercial sector; and 2.38% relate to the finance industry. Jiang and Jones (2019) used a large number of predictor variables, including market price variables, several accounting variables, macro-economic variables (such as GDP and employment rates), and corporate governance proxies (such as shareholder concentration and ownership). Based on pooled observations and random assignment of 20% of observations to the test or holdout sample, the overall gradient boosting model was 94.74% accurate in predicting distress and 94.85% accurate in predicting active or healthy companies using a baseline threshold as the cut-off score one year from failure. Around 90% of the explanatory variables proved to have predictive value (RVI > 0) and are quite well dispersed over a number of different dimensions of corporate distress. Variables with the strongest predictive value in the Jiang and Jones (2019) study included: (1) market variables, particularly market capitalization and annual market returns; (2) macro-economic variables, notably GDP growth, GDP per capita, and unemployment rates; (3) financial variables, particularly retained earnings to total assets, net profit margin, net growth in equity over three years, ROA, annual growth in EPS, ROE, and total assets to total liabilities; (4) shareholder ownership/concentration, notably percentage of shares held by insiders; and (5) executive compensation, such as total compensation of the top three executives and total compensation to the top three directors.

3.7  Deep Learning Another promising machine learning method is deep learning. Deep learning and neural networks tend to be used interchangeably, but they are a little different. As a result, it is worth noting that the “deep” in deep learning is just referring to the depth of layers in a neural network. A neural network that consists of more than three layers – which would be inclusive of the inputs and the output – can usually be considered a deep learning algorithm. A neural network that only has two or three layers is just a basic neural network. Figure 3.3 is an example of a deep learning model. Alam et  al. (2021) observed that while machine learning models such as gradient boosting machines (and related methods such as random forests and

The Rise of the Machines  105

FIGURE 3.3 Example

of Deep Neural Network Structure

Source: IBM (www.ibm.com/cloud/learn/neural-networks)

AdaBoost) have gained considerable traction in the literature over the last decade, a developing literature on deep learning models is also rapidly emerging. Recent research has demonstrated that convolutional neural networks (CNN), a class of deep learning model, have led to significant breakthroughs in many types of classification problems, including robotics (such as self-driving cars), speech recognition, image recognition, natural language processing, news aggregation, recommender systems, and medical image analysis (Hosaka, 2019; Wiatowski and Bolcskei, 2018). Alam et al. (2021) developed a fully connected deep neural network architecture model for corporate failure prediction that can be applied to panel data. They used a large number of financial and market variables drawn from previous literature. The Deep Grassmannian Network (DGN) introduced in Huang et al. (2018) was utilized by Alam et al. (2021) and applies the idea of the standard deep convolutional neural networks (CNN), consisting of an input, an output layer, and a few hidden layers, including convolutional layers, pooling layers, and normalization layers. Similarly, there exist three basic building blocks in a DGN: projection, pooling, and output Blocks (see Figure 1 in Huang et al., 2018). In the projection block, the layers are designed to be fully connected referring to the full rank mapping layer (FRMap), followed by a nomalization layer referring to re-orthonomalization (ReOrth). FRMap takes the inputs of Grassmannian

106  The Rise of the Machines

points X 0i where 0 means the data at the input layer of the network and linearly transform to the first hidden layer as follows: X 1i

f fr ( X 0i ,W1 ) W1X 0i ,(3.22)

D1 D where the full rank matrix is Wi with D1 D. The resulting matrix * D1 d X 1i is not guaranteed to be an orthogonal matrix. However, we can use QR decomposition to obtain an orthogonal matrix by X 1i Q1i R1i . The process converting X 1i to Q1i is called re-orthonomalization. As such, the normalization data represents linear sub-spaces with respect to geometry-aware Grassmann manifold. This process is denoted as X 2i f f 0 ( X 1i ) Q1i . In the next step, the output X 2i Q1i of the ReOrth layer is passed into the pooling block. In the block, a projection mapping layer (ProjMap) is designed based on the Grassmann embedding (3) to maintain Grassmann context data and transfer into Euclidean data X 3i f pm ( X 2i ) X 2i XT2i , enabling data pooling to reduce data dimension and fuse information of multiple versions of {X13i , ., } into X 4i f pp ({X13i , ., }). The pooling f pp could be any of max, min, and mean pooling functions in order to reduce the model complexity. This layer is called the ProjPooling layer. Next, the output X 4i (a symmetric matrix) will be fed into another layer of orthonormal mapping, implemented by the eigenvalue (EIG) decomposition X 4i U 4i 4i V4Ti , such that

X 5i

f pp X 4i

U 4i :,1 : d ,(3.24)

where U 4i :,1 : d is the first d largest eigenvectors achieved by EIG. Hence, X 5i is a Grassmann point from G (D1, d). This layer is called the OrthMap layer. At this point, another project mapping layer is applied on X 5i to transform into the Euclidean space of symmetric matrix. As such, they can be converted into vector form, allowing classical Euclidean network layers to be used. Huang et al. (2018) note here that classical Euclidean layers such as the fully connected layer or the softmax layer for classification can be employed at this point. In a nutshell, the DGN takes Grassmann points as inputs and forwardly transforms into a set of softmax value for classification. Note that the DGN can repeat the projection block and pooling block many times before connecting to the final output block as shown in Figure 3.4. An interesting feature of the Alam et al. (2021) study is that they compared their results with a hazard model given the wide use and popularity of this approach (see, e.g., Campbell et  al., 2008). Based on a sample of 641,667 firm-month observations of North American listed companies between 2001 to 2018 and a large number of financial and market-based variables, Alam et al. (2021) demonstrated that a deep learning model produces an overall out-of-sample accuracy rate

The Rise of the Machines  107

FIGURE 3.4  Graphical

Representation of DGN, Including Projection, Pooling, and Output Blocks

Source: Alam et al. (2021)

TABLE 3.3 Summary of Out-of-Sample Predictive Performance for Hazard Model and

Deep Learning Model (Test Sample)

Overall accuracy Sensitivity Specificity

Discrete Hazard Model

Deep Learning Model

93.3% 86.95% 93.53%

91.2% 93.71% 90.28%

Source: Alam et al. (2021)

of 91.2% and a Type 1 classification accuracy rate of 93.71%. While the discrete hazard model produced a slightly higher overall accuracy rate of 93.3%, it performed much worse than deep learning on Type 1 classification accuracy (86.95%), which clearly matters more in corporate failure prediction given the significantly higher economic costs associated with corporate failure, as discussed in Chapter 2. The prediction results are summarized Table 3.3. Not only has the deep learning model proven quite effective in corporate failure prediction, it can also be potentially applied to many other classification problems in finance involving panel data structures.

3.8 Advantages of Machine Learning for Corporate Failure Research Modern machine learning models have several properties that are particularly appealing for distress risk and corporate failure research. For instance, the gradient boosting model, random forests, and deep learning models are demonstrably more suitable for high dimensional, nonlinear analysis, which arguably better reflects the real-world context of corporate failure. These models also appear to be a more suitable given the characteristics of corporate failure datasets. In contrast to conventional models, gradient boosting, random forests, and related methods are less sensitive to outliers (the algorithm simply isolates the outliers in a separate node that does not affect the performance of the final tree). These methods are also less

108  The Rise of the Machines

sensitive to monotonic transformations. For instance, transforming one or several variables to its logarithm or square root will not change the structure of the tree itself (it only affects the splitting values of the transformed variable). For conventional models, missing values can also present serious data problems. Typically, missing values are removed case-wise (sacrificing sample) or they must be imputed. However, a gradient boosting model such as TreeNet (which is illustrated in Chapter 4) handles missing values by building surrogates from other correlated variables available in the dataset, thus preserving the sample. In this sense, the model exploits multicollinearity to enhance performance, whereas it tends to diminish the performance of conventional models. Furthermore, while multicollinearity among predictors is a serious issue for parametric models such as logit or probit (diminishing the interpretability and stability of parameter estimates), it is simply an issue of redundant information in gradient boosting models. Another benefit of machine learning models is that the approach can eliminate “data snooping” bias and p-hacking (Jones, 2017). This occurs when a model is continually re-estimated with different iterations of the explanatory variables to arrive at the best model. This approach can result in a “good” model produced by chance rather than from something innately meaningful in the data. The high dimensional gradient boosting model utilizes the full set of input variables and automatically detects all important interaction effects. This approach helps remove bias associated with data snooping. Having discussed a range of modern machine learning methods, I now provide a comprehensive empirical demonstration of several of these models.

Key Points From Chapter 3 This chapter introduces a range of statistical learning methods, including neural networks, CART, MARS, gradient boosting, AdaBoost, and random forests. While first-generation neural network models were initially very popular in the 1980s and 1990s, several studies have shown that the predictive power of these models was generally not much better simple linear classifiers such as LDA and logit (and, in some cases, the classification performance has been worse). Given the inherent lack of interpretability of neural networks and lukewarm empirical results, many researchers started to lose interest in this method. The CART methodology was also very popular and continues to be used in many research studies. While CART models are highly interpretable, they tend to predict poorly out of sample. CART has been superseded by more powerful machine learning methods such as AdaBoost, gradient boosting, and random forests. Modern machine learning methods such as gradient boosting, adaptive boosting, and random forests are ensemble methods based on the concept of boosting. Ensemble methods use many models (which can be hundreds or even thousands of models) that are fitted to different regions of the data. The average effect of applying many models results in much better model fits and predictions than using just one model (this is the concept

The Rise of the Machines  109

behind “boosting”). The core idea is that each model in the overall ensemble is a “weak” predictor (it only has to predict a little better than a guess). The idea that an ensemble of many weak models can converge to produce a very strong predictive model was an important breakthrough in machine learning that has led to a new generation of machine learning methods based on boosting. Of all the tree-based machine learning methods, the gradient boosting model seems to consistently provide the strongest out-of-sample predictive performance in corporate failure studies. The model also has good interpretability through relative variable importance scores, partial dependency plots, and interaction effects (as will be shown in Chapter 4). Deep learning is another potentially powerful method in corporate failure research. While the literature is still very limited, a recent study by Alam et al. (2021) based on US data showed that a deep learning model outperforms a hazard model on Type I classification performance. Modern machine learning methods offer many advantages for distress prediction and corporate failure research, particularly given some of the general characteristics of bankruptcy datasets. Machine learning methods such as gradient boosting are more able to handle “dirty data” issues, such as non-normalness, missing values, database errors, and so on. Machine learning methods are also designed to handle situations where the number of predictors is greater than the sample size (p > n), which is useful considering that corporate failure samples tend to be quite small. Because modern machine learning methods require minimal researcher intervention, they can reduce “snooping bias”. Because modern machine learning methods are high dimensional, they can be used to test the predictive power of many hundreds, if not thousands, of features without compromising model stability and performance.

Note 1 Bai et al. (2004) reported that around 94% of Chinese listed companies were designated ST as a result of either making two consecutive years of losses or because shareholder’s equity was lower than registered capital. Zheng and Yanhui (2007) also reported that most ST companies in China arise from some type of financial abnormality.

4 AN EMPIRICAL APPLICATION OF MODERN MACHINE LEARNING METHODS

1. Introduction In this chapter, I provide a comprehensive demonstration of some of the statistical learning methods discussed in Chapter 3. I compare the performance of logit, CART, multivariate adaptive regression splines (MARS), and generalized lasso techniques with advanced machine learning methods such as random forests and gradient boosting machines. While machine learning methods are sometimes described as “black boxes”, this obstacle has largely been overcome as new machine learning methods provide extensive analytical and data visualization tools to facilitate model interpretation, such as variable importance scores, partial dependency plots (or marginal effects), and interaction effects, which are discussed further in this chapter. The performance of machine learning models can be evaluated using standard outputs such as average loglikelihood, confusion matrices, and the ROC curve in the case of binary models.1 In this chapter, we also demonstrate a number of useful diagnostic tests that should be considered when evaluating the overall stability and performance of machine learning models. For the purposes of this empirical examination, I use Salford Predictive Modeler (SPM) version 8.3, which is a leading commercial machine learning software developed by Salford Systems. As a commercial software, SPM has very extensive diagnostic and data visualization capabilities far exceeding any opensource software such as R. The SPM machine learning models also have much faster estimation speeds than open-source software such as R. For instance, running random forest models in R can take several hours when there are hundreds or thousands of input features, whereas estimation only takes a few minutes in SPM. Furthermore, Salford Systems developed TreeNet (as part of the SPM suite of models), which is a commercial version of gradient boosting machines. DOI: 10.4324/9781315623221-4

An Empirical Application  111

TreeNet is very well known in the machine learning literature. It was developed in collaboration with Jerome Friedman (who developed the original gradient boosting model in 2001 that is cited in this book) and Salford Systems. TreeNet has won numerous forecasting competitions and is widely considered one of the leading machine learning applications available. According to Salford Systems, TreeNet’s high predictive accuracy comes from the algorithm’s capacity to generate thousands of small decision trees built in a sequential error-correcting process that converges to a highly accurate model. Other reasons for its impressive predictive power include the power of TreeNet’s interaction detection engine. TreeNet establishes whether higher order interactions are required in a predictive model. The interaction detection feature not only helps improve model performance (sometimes dramatically) but also assists in the discovery of valuable new variables and patterns in the dataset. This software has automatic features for detecting higher order interaction effects across any number of variables (not available in most packages or older methods). TreeNet also has optimization features to help the analyst find the best tree depth, which maximizes predictive performance. For the purposes of this illustration, I use a large sample of international company bankruptcies extracted from Standard and Poor’s Capital IQ service, one of the largest financial and risk database providers in the world. Corresponding annual financial, market, and other input variables are extracted from a customized software application developed with the researcher by Capital IQ technical staff. The sample includes 1221 international corporate bankruptcies (most of which are from the US) extracted from the Capital IQ database and 8231 non-failed entities. The total firm year corporate failure observations used in this illustration include 7326 firm year corporate failure observations and 49317 firm year non-failure firm observations. The corporate failure sample spanned 25 years (1990 to 2015). For the purposes of this illustration, we use 26 feature variables, which are displayed Appendix 1. These features include financial variables, market price variables, corporate governance proxies, and other variables. The dependent variable for this is illustration is a binary outcome, where a firm that is observed to enter legal bankruptcy is coded “0”, and an active or nonfailed firm is coded “1”. As most of our bankrupt firms are from the US, bankrupt firms are firms that entered either Chapter 11 or Chapter 7. Following conventions in the statistical learning literature, I compared all models out-of-sample by randomly allocating 80% of the total observations to the training data and 20% of observations to the test (or holdout) sample. For all models, out-of-sample predictive accuracy is evaluated using several classification statistics. However, the ROC curve is the most ubiquitous measure in the literature for comparing the classification performance of different classifiers (Hastie et al., 2009). The ROC curve plots the true positive rate (sensitivity) relative to the false positive rate (1 − specificity) with respect to the discretionary cut-off score. (For the binary classifiers, this score is the predicted probability of failure.) Convention

112  An Empirical Application

suggests that AUC scores greater than 0.9 demark a very strong classifier, exhibiting an excellent balance between sensitivity and specificity across different probability thresholds. An AUC score greater than .5 or less suggests the model does not do better than a random guess.

1.1  Gradient Boosting/TreeNet Results The adjustable parameters for our gradient model include the interaction depth (number of trees), shrinkage (learn rate), maximum number of nodes per tree, minimal number of terminal nodes, choice of loss function, and other settings. For the purposes of this study, I use all the standard default settings for TreeNet available in the SPM software, which include a learn rate of .001 (this is considered quite conservative), a maximum node per tree of 6 (this is because TreeNet produces many smaller or shallow trees), and minimum terminal nodes of 10. For binary classification problems, the logistic loss function is commonly used as the loss function. The average log-likelihood (negative), overall raw misclassification rates, and ROC curve for both learn (estimation) and test samples are shown in Figures 4.1, 4.2, and 4.3. Note that the raw classification rate can often look very accurate, but this can be a little deceiving when the sample is unbalanced, as is the case with most corporate failure datasets. For instance, the high raw classification success can mean that the TreeNet model was more accurate in predicting Type II errors than Type I errors, and it is the Type I error that is more consequential, as discussed in previous chapters. Hence, we need to look carefully at the confusion matrix to get a better idea of classification accuracy. Figure 4.1 displays the average log likelihood of the gradient boosting model as a function of the number of trees, while Figure 4.2 displays the raw misclassification rate of the model as a function of the number of trees. Figure 4.3 shows the gains chart or ROC curve displaying the true positive rate (y-axis) and false positive rate (x-axis). The test error curve in Figure 4.2 lies slightly above the learn error curve, and they are in close agreement, which indicates no evidence of over-fitting on the test sample (Hastie et al., 2009). Over-fitting can also be observed when the test error starts to rise above the learn curve.

FIGURE 4.1 Average

LogLikelihood (Negative)

An Empirical Application  113

FIGURE 4.2 Misclassification

FIGURE 4.3 ROC

Rate Overall (Raw)

Curve of Estimation and Test Samples

Table 4.1 displays the error summaries for the gradient boosting model for the learn and test samples. Table 4.1 displays the average log-likelihood, areas under the ROC curve (AUC), lift, the K-S statistic, raw misclassification rates, balanced error rates (simple average over classes), and classification accuracy (baseline threshold). The AUC is a measure of overall model performance tied closely to the ability of the gradient boosting model to correctly rank records from most likely to least likely to be a 1 (non-failure) or 0 (bankruptcy). Figure 4.3 and Table 4.1 indicate that the gradient boosting model has an outof-sample AUC of 0.984, which is optimized at 1956 trees (here optimization simply means that at this tree depth, the test curve starts to rise upwards, suggesting model overfitting).

114  An Empirical Application

The classification accuracy displayed in Figure 4.2 is based on a simple tally of how frequently the model classifies an observation correctly or incorrectly over the tree depth. As can be seen by the model summary in Table 4.1, the gradient boosting model has very good out-of-sample predictive accuracy. The model error summaries in Table  4.1 indicate that the gradient boosting model has only misclassified 3.277% of observations for the test sample based on raw classification success. The baseline threshold out-of-sample accuracy rate is 93.9%. The baseline threshold accuracy rate reflects the actual balance of failures vs non-failures in the sample. For instance, if the sample has 88% non-failures, the threshold of .88 is used by TreeNet as the cut-off for non-failure and 12% for failed firms. The confusion matrix is reported in Table 4.2. The Type I and Type II error rates appear to be significantly lower than many other corporate failure studies (as discussed in Chapter 2) and are also consistent with the findings of Jones (2017). For instance, Table 4.2 shows that using balanced threshold, the gradient boosting model is 94.03% accurate in predicting actual failures (in other words, the Type I error is 5.97%) and 93.82% accurate in predicting non-failures (the Type II error is 6.18%). The balanced threshold cut-off is intended to rebalance inequal class sizes. TreeNet’s automatic re-weighting ensures that all of the weights in each class are equal, eliminating any need for manual balancing of classes via record selection or definition of weights (see Salford Systems, 2019). To ensure consistency across all models, we use balanced threshold classification to compare out-of-sample classification success. It can be seen from Table 4.3 that 23 of the 26 predictor variables used in this illustration have non-zero variable importances or RVIs. This means that all 23

TABLE 4.1  Model Error Measures (Gradient Boosting Machines)

Name

Learn

Test

Average LogLikelihood (negative) ROC (area under curve) Variance of ROC (area under curve) Lower confidence limit ROC Upper confidence limit ROC Lift K-S stat Misclass rate overall (raw) Balanced error rate (simple average over classes) Class. accuracy (baseline threshold)

0.06958 0.99325 0.00000 0.99228 0.99423 1.14688 0.92375 0.02335 0.03916 0.95346

0.09373 0.98472 0.00000 0.98173 0.98770 1.15114 0.87895 0.03277 0.06276 0.93989

Table 4.1 summarizes the predictive performance of the TreeNet (gradient boosting model), including the average log-likelihood, areas under the ROC curve (AUC), lift, the K-S statistic, raw misclassification rates, balanced error rates (simple average over classes), and classification accuracy (baseline threshold).

An Empirical Application  115 TABLE 4.2  Confusion Matrix – Gradient Boosting

Actual Class

0 1 Total: Average: Overall % correct: Specificity Sensitivity/Recall Precision F1 statistic

Total Class

1,507 9,905 11,412

Percent Correct

Predicted Classes

94.03% 93.82%

0 N = 2029 1,417 612

1 N = 9383 90 9,293

93.92% 93.85% 94.03% 93.82% 99.04% 96.36%

variables contributed to out-of-sample predictive success in some way, although the strength of different predictors varies quite significantly. The RVIs reported in Table 4.3 are expressed on a scale between 0 and 100, where the most important variable always receives a score of 100 from the algorithm and all other variables are rescaled to reflect their importance relative to the most important variable. RVIs are calculated relative to the predictive power of all other variables in the model. Because gradient boosting uses decision trees as the base learner, the RVI is based on the number of times a variable is selected for splitting, weighted by the squared improvement to the model as a result of each split, and averaged over all trees (Friedman and Meulman, 2003; Hastie et al., 2009). The top 20 variables in Table 4.3 in order of importance are: one year excess returns, which is the top ranked variable with an RVI of 100; market capitalization to total debts (RVI = 71.36); percentage of stock owned by the top five shareholders, (RVI  =  59.44); cash flow per share (RVI  =  46.74); percentage institutional ownership of shares (RVI = 40.16); one year beta (RVI = 38.02); total bank debt (RVI = 29.94); six month annual returns (RVI = 29.76); earnings per share (RVI  =  28.16); current ratio (RVI  =  27.64); interest cover (RVI  =  20.29); Altman Z score (RVI  =  19.42); EBIT margin (RVI  =  19.38); average debt collection period (RVI  =  18.63); gearing ratio (RVI  =  18.16); EBITTA (RVI  =  14.99); average inventory turns (RVI  =  14.65); sales to total assets (RVI = 13.66); total debt to total assets (RVI = 13.52); and total liabilities to total equity (RVI = 12.42). The interesting feature of these results is that the best performing gradient boosting model included a combination of market price variables, financial variables, and other variables. While market price variables featured quite strongly, institutional ownership and financial variables also had quite strong predictive power in

116  An Empirical Application TABLE 4.3  Variable Importance Scores for Gradient Boosting Model

Variable

Score

Excess return (12 months)

100.00

Market capitalization to total debt Percent owned by top five shareholders Cash flow per share Institutional percent owned One year beta Total bank debt Excess return (12 months) Earnings per share (basic) Current ratio Interest cover ratio Altman Z score EBIT margin Average debt collection period Gearing ratio EBIT to total assets Average inventory turnover Sales to total assets Debt to total assets Total liabilities to total equity Working capital to total assets Short term debt to total liabilities Identifiable intangible assets to total assets Total audit fees Percentage of external directors Insider trades percent of shares outstanding

71.36 59.44

||||||||||||||||||||||||||||||||| ||||||||| |||||||||||||||||||||||||||||| |||||||||||||||||||||||||

46.74 40.16 38.02 29.94 29.76 28.16 27.64 20.29 19.42 19.38 18.63 18.16 14.99 14.65 13.66 13.52 12.42 12.16 12.05 9.21

||||||||||||||||||| |||||||||||||||| ||||||||||||||| |||||||||||| |||||||||||| ||||||||||| ||||||||||| |||||||| ||||||| ||||||| ||||||| ||||||| ||||| ||||| ||||| ||||| |||| |||| |||| |||

0.00 0.00 0.00

Table 4.3 shows the relative variable importances (RVIs) for the TreeNet/gradient boosting model estimated on all explanatory variables. RVIs include the effects of all important two-way interaction effects having predictive value in the model. The RVI is based on the number of times a variable is selected for splitting, weighted by the squared improvement to the model as a result of each split, and then averaged over all trees. Hence, RVIs are calculated relative to all other input variables in the model. The RVIs in Table 4.3 are ranked according to their contribution to the overall predictive success of the model. Since these measures are relative, it is customary to assign the largest or most important variable a value of 100 and then scale all other predictors accordingly. Table 4.3 shows that 23 of the 26 input variables used in this illustration have non-zero importance. The RVIs are also dispersed across a number of bankruptcy dimensions, such as market-price variables, accounting-based indicators, ownership concentration/ structure variables, and other variables.

the overall model. The gradient boosting model supports the findings of previous machine learning literature that using both a combination of different features (a multi-dimensional approach) and a high dimensional approach (using a large number of features) generates more optimal corporate failure prediction models.

An Empirical Application  117

Having established that the gradient boosting model has very strong out-ofsample predictive accuracy, it is also important to assess whether the explanatory variables make sense in terms of their role and influence on the corporate failure outcome. A  major benefit of conventional failure models (such as logit) is that they are highly interpretable. Highly flexible (nonlinear) models, such as neural networks and support vector machines, often predict well but are widely viewed as black boxes from an interpretative standpoint (Jones et al., 2015). The gradient boosting model, including commercial versions of the model such as TreeNet, is a relatively recent statistical learning technology that provides outputs that allow the researcher to see into the black box, particularly through RVIs and partial dependency plots and interaction effects. The RVIs reported in Table  4.3 are valuable for displaying the strength or influence that a particular explanatory variable has on overall classification performance. However, the RVIs themselves provide no indication of the direction of its relationship with the failure outcome. For example, RVIs do not show whether the excess returns variable increases or reduces the probability of failure. Marginal effects or partial dependency plots reveal both the direction and strength of the relationship between explanatory variables and the corporate failure outcome, after holding all other variables in the model constant. While parameter estimates from a conventional model (such as logit) are always linear with respect to the outcome dependent variable, the marginal effects from a gradient boosting model capture all nonlinear impacts, which can be more informative and descriptive of the behaviour of corporate failure predictors in their real-world context. In fact, the marginal effects of each Table 4.3 variable proved to have nonlinear relationships with the failure outcome, after holding all other variables in the model constant. However, in a number of instances, the broad directions of the underlying relationships are interpretable and seem to make intuitive sense. This can be demonstrated with several examples from the Table 4.3 variables. To make it easier to interpret the partial dependency plots, the graphical depictions include a first-order single knot spline that smooths out the relationship and reveals the overall direction of the marginal effects on the corporate failure outcome. Figures 4.4 to 4.19 display the partial dependency plots for several of the variables reported in Table 4.3. From Figure 4.4, it can be seen that the one-year excess returns variable has a strong but nonlinear impact on the failure outcome, after holding all other variables in the model constant. That is, one-year excess returns are increasing (decreasing) of the non-failure (failure) outcome. In other words, the relationship is strongly asymmetric, which is consistent with the findings of Jones (2017). Lower and negative excess returns have a much stronger impact on the failure outcome. This is consistent with the distress anomaly discussed in Chapter 2, where distressed firms exhibit a pattern of lower abnormal returns. Figure 4.5 displays the partial dependency plot for the market capitalization-to-total debt ratio. Figure 4.5 shows that, after holding all other variables in the model constant, the market capitalization-to-debt ratio has a counter-intuitive direction with the failure outcome;

118  An Empirical Application

FIGURE 4.4  Partial

Dependency Plot for Excess Returns (12 Months) and Failure Outcome

FIGURE 4.5 Partial

Dependency Plot for Market Capitalization to Total Debt and Failure Outcome

An Empirical Application  119

however, the relationship is highly nonlinear. There is a stronger impact when market capitalization has a value of less than 2, but little or no effect for values of this ratio above 3. Higher market capitalization to debt is increasing of non-failure, which is not as expected. This could be the result of interaction effects or nonlinear relationships with other variables in the model. Figure 4.6 displays the partial dependency plot for percentage of shares owned by the top five shareholders. After holding all other variables in the model constant, higher percentage ownership is increasing of non-failure, which is as expected. As noted by Jones (2017), there are good theoretical reasons for expecting an association between these variables and corporate failure. In a practical sense, large stockholders and institutional investors have a voice in all aspects of the corporate failure process (such as Chapter 11). As the Chapter 11 process is typically complex, costly, and ultimately destructive to stockholder value, there are likely to be strong incentives for these stakeholders to consider all available options and remedial actions that avoid Chapter 11. Prior literature suggests that stockholder concentration is positively correlated with firm value and performance and that large stockholders can influence company performance in various ways. Figure 4.7 displays the partial dependency plot for the cash flow per share variable. After holding all other variables in the model constant, higher cash flow per share is also increasing of non-failure, which is as expected, as it is well established

FIGURE 4.6 Partial Dependency Plot for Percent of Shares Owned by the Top 5 Share-

holders and Failure Outcome

120  An Empirical Application

FIGURE 4.7 Partial

Dependency Plot for Cash Flow per Share and Failure Outcome

that healthy firms are expected to have stronger cash flow positions. However, the relationship is strongly asymmetric, with negative cash flow per share having a much stronger impact on the failure outcome. Figure  4.8 displays the partial dependency plot for institutional ownership. A higher percentage of institutional ownership is increasing of non-failure, which is as expected and is consistent with the results reported in Figure 4.6. From Figure 4.9, it can be seen that beta (a measure of the stock price volatility relative to the volatility of the market) does not have a clear relationship with the failure outcome. As indicated from the fitted spline, the relationship looks slightly negative, indicating that higher betas (indicating higher risk) are more associated with the failure outcome. Figure 4.10 displays the partial dependency plot for total bank debt. Holding all other variables in the model constant, total bank debt is strongly increasing of the failure, which is consistent with prior literature that indebtedness is an important predictor of failure. From Figure 4.11, it can be seen that six-month excess return variable has a strong but nonlinear impact on the failure outcome, which is consistent with the results reported in Figure  4.4. That is, six months excess returns are increasing (decreasing) of the non-failure (failure) outcome. Again, the relationship is strongly

An Empirical Application  121

FIGURE 4.8 Partial Dependency Plot for Institutional Ownership and Failure Outcome

FIGURE 4.9  Partial

Dependency Plot for Market Beta (12 Months) and Failure Outcome

asymmetric. Lower and negative excess returns have a much stronger impact on the failure outcome. Figure  4.12 displays the partial dependency plot for the earnings per share variable. Higher EPS is increasing of non-failure, which is as expected as nonfailing firms are expected to have higher earnings capacity relative to failed firms. However, the relationship is strongly asymmetric, with negative values of EPS

122  An Empirical Application

FIGURE 4.10 Partial

Dependency Plot for total Bank Debt and Failure Outcome

FIGURE 4.11 Partial

Dependency Excess Returns (6 Months) and Failure Outcome

having a stronger impact on the failure outcome. Figure 4.13 displays the partial dependency plot for the current ratio. The current ratio is strongly increasing of failure, which is as not as expected but is consistent with prior studies such as Beaver (1966). As there is a higher frequency of failure among new and recently

An Empirical Application  123

FIGURE 4.12 Partial

Dependency Plot for Earnings per Share and Failure Outcome

FIGURE 4.13 Partial

Dependency Plot for the Current Ratio and Failure Outcome

124  An Empirical Application

established companies, the companies might have high cash positions from investor contributions. As suggested by Beaver (1966), this might reflect some level of balance sheet management. Figure 4.14 displays the partial dependency plot for the interest cover ratio. Figure 4.14 indicates there is no clear direction with the interest cover ratio, as the relationship is highly nonlinear. Figure  4.15 displays the partial dependency plot for the Altman Z score. While the Altman Z score was not a highly ranked variable in the overall machine learning results, the relationship is strongly positive suggesting, after holding all other variables in the model constant, that higher Altman Z scores are strongly associated with non-failure as expected. Figure 4.16 displays the partial dependency plot for EBIT margin and displays a high nonlinear but positive relationship with the non-failure outcome as expected and is consistent with the results reported in Figure 4.12. Figure 4.17 displays the partial dependency plot for average debt collection period and displays a highly nonlinear but negative relationship with the non-failure outcome as expected. That is, failing firms tend to have poorer cash management practices reflected in higher debt collection periods. Figure 4.18 displays the partial dependency plot for the gearing ratio and shows a highly nonlinear but negative relationship with the non-failure outcome as expected. That is, failing companies tended to have higher gearing ratios. Finally, Figure 4.19 displays the partial dependency plot for the work capital to total assets ratio, which displays a nonlinear but positive relationship with the non-failure

FIGURE 4.14 Partial

Dependency Plot for Interest Cover and Failure Outcome

An Empirical Application  125

FIGURE 4.15 Partial

Dependency Plot for the Altman Z Score and Failure Outcome

FIGURE 4.16 Partial

Dependency Plot for EBIT Margin and Failure Outcome

126  An Empirical Application

FIGURE 4.17 Partial

Dependency Plot for Average Debt Collection Period and Failure Outcome

FIGURE 4.18 Partial

Dependency Plot for the Gearing Ratio and Failure Outcome

­ utcome as expected. That is, failing companies have lower levels of working o ­capital to total assets compared to non-failed firms. Despite the mathematical complexity of advanced machine learning models, and their capacity to capture all nonlinearities among explanatory variables, the partial dependency plots are remarkably stable and, in most cases, consistent with

An Empirical Application  127

FIGURE 4.19 Partial

Dependency Plot for Working Capital to Total Assets and Failure Outcome

prior expectations about the role and influence of these variables on the corporate failure outcome.

1.1.1  Interaction effects Another advantage of gradient boosting models, such as TreeNet, is that the algorithm can automatically detect all higher order interactions effects among the features. Table 4.4 reports the variable interactions, ranked according to their overall interaction magnitude. These are the interaction effects that contributed most to the overall predictive power of the gradient boosting model. Generally, regression models (such as LDA and logit) are not designed to capture high-order interaction effects and have very limited capacity to do so (the researcher must modify the data to include interaction effects, which can result in a lengthy trial and error process and potentially lead to data snooping). The interaction magnitudes reported in Table 4.4 are on a percentage scale. For example, whole variable interaction score for the variable “percent of shares owned by top five shareholders” is 25.21%, which means 25.21% of the total variation in the predicted response can be attributed to an interaction of this variable with any other variable in the model. For the variable “institutional percent owned”, the whole variable interaction is 24.60%, which means 24.60% of the total variation in the predicted response can be attributed

128  An Empirical Application TABLE 4.4 Whole Variable Interactions for Strongest Variable Effects for Gradient Boosting

Model Measure Predictor

Interaction Weight

Dimension

Percent owned by top five shareholders

25.21

Institutional percent owned

24.60

Market capitalization to total debt Total bank debt Earnings per share (basic) One year beta Average debt collection period Cash flow per share Current ratio Excess return (6 months) Interest cover Excess return (12 months) EBIT margin Average inventory turnover Working capital to total assets Debt to total assets Short term debt to total liabilities Altman Z score Gearing ratio Sales to total assets EBIT to total assets Total liabilities to total equity

18.62 12.61 11.46 11.32 10.73 9.51 9.21 8.99 8.2 8.16 7.4 7.12 6.53 6.10 5.61 5.51 4.77 4.34 4.32 4.31

Ownership concentration/ structure Ownership concentration/ structure Market price Accounting Accounting Market price Accounting Accounting Accounting Market price Accounting Market price Accounting Accounting Accounting Accounting Accounting Accounting Accounting Accounting Accounting Accounting

Table 4.4 displays the “whole variable” interactions sorted according to their overall interaction strength. These are the most important interacting variables that contributed most to the predictive performance of the model reported in Table 4.3. They are ranked according to their interaction weight.

to an interaction of this variable with any other variable in the model. Likewise, the whole variable interaction for “market capitalization to total debt” is 18.62%, which means that 18.62% of the total variation in the predicted response can be attributed to an interaction of this variable with any other variable in the model. These relationships can also be demonstrated graphically. Figure 4.20 shows the interaction between two variables: one-year excess returns and percentage of institutional ownership. To view this image in full-colour, please visit www.routledge. com/9781138652507 to download the eResource materials. Figure 4.20 indicates areas where the interaction effects of two variables produced the highest rate of change in the response variable including interaction effects with lower impact on the response outcome. Figure 4.20 indicate a strong initial positive interaction effect between these variables, particularly where the percentage of institutional ownership is upwards of the .25% level. At this level,

An Empirical Application  129

FIGURE 4.20 Interaction

Effects Between Excess Returns and Institutional Ownership

percentage of institutional ownership increases the impact of excess returns on the non-failure outcome. However, beyond the .25% level, the level of institutional ownership reduces the impact of excess returns on the non-failure outcome. Having discussed the outputs of the gradient boosting model, we now turn to the random forests model.

1.2  Random Forests We now compare the gradient boosting results with the random forests model, which was described in Chapter 3. Table 4.5 provides the model error summaries for the random forests model using the same features and sample as that used for the gradient boosting model. As can be seen from Table 4.5, the random forests model also shows very strong out-of-sample prediction success with an out-of-sample AUC score of .951. While this is not as strong as gradient boosting, it is nevertheless very impressive. ­Figure 4.21 displays the plot of the ROC curve for the training and test samples. The out-of-sample raw misclassification error rate, however, is substantially higher than gradient boosting at 12.49%. Table 4.6 displays the confusion matrix for the random forests model based on the balanced threshold cut off. As indicated by Table 4.6, the model shows quite strong out-of-sample classification performance on Type I and Type II errors; however, the model does more poorly than gradient boosting. The Type I error for the random forest model is 9.16% (cf. 5.97% for gradient boosting) and the Type II error rate is 9.50% (cf. 6.18% for gradient boosting). The random forests variable importance scores are shown in Table  4.7. The random forest model also shows that 23 out of the 26 variables have relative variable importance above 0, indicating they have predictive power in the overall model.

130  An Empirical Application TABLE 4.5  Model Error Measures (Random Forests)

Name

OOB

Test

Average LogLikelihood (negative) ROC (area under curve) Variance of ROC (area under curve) Lower confidence limit ROC Upper confidence limit ROC Lift K-S stat Misclass rate overall (raw) Balanced error rate (simple average over classes) Class. accuracy (baseline threshold)

2.18831 0.95548 0.00000 0.95333 0.95764 1.14753 0.82668 0.12337 0.22850 0.60525

2.26360 0.95145 0.00001 0.94695 0.95594 1.15203 0.81542 0.12496 0.22957 0.60585

Table 4.5 summarizes the predictive performance of the random forests model, including the average log-likelihood, areas under the ROC curve (AUC), lift, the K-S statistic, raw misclassification rates, balanced error rates (simple average over classes), and classification accuracy (baseline threshold).

FIGURE 4.21 ROC

Curve of Estimation and Test Samples (Random Forests Model)

However, the ranking of the RVIs is a little different for the random forests model. For instance, the top ten ranked variables for the random forests model are as follows (in order of magnitude): percentage of shares owned by the top five shareholders (RVI = 100); the percentage of shares owned by institutions (RVI = 36.71); six month excess returns (RVI = 31.65); one year beta (RVI = 31.30); cash flow

An Empirical Application  131 TABLE 4.6  Confusion Matrix – Random Forests Model

Actual Class

0 1 Total: Average: Overall % correct: Specificity Sensitivity/Recall Precision F1 statistic

Total Class

1,506 9,906 11,412

Percent Correct

Predicted Classes

90.84% 90.50%

0 N = 2309 1,368 941

1 N = 9103 138 8,965

90.67% 90.55% 90.84% 90.50% 98.48% 94.32%

per share (RVI = 29.15); basic EPS (RVI = 25.75); market capitalization to total debt (RVI  =  22.86); one year excess return (RVI  =  20.24); average inventory turnover (RVI  =  19.23); and the Altman Z score (RVI  =  15.20). While gradient boosting showed a steadier distribution of RVIs from highest to lowest, random forests reveal a much sharper change in variable importance scores – in other words, the random forests tended to make more use of a few top ranked variables for prediction and assigned lower scores for the majority of other variables in the model. Random forests, by construction, do not have interaction effects. However, partial dependency plots can be created for random forests in SMP by running TreeNet with the random forest loss function. Most of the partial dependency plots were consistent with the TreeNet results. Other outputs are possible, such as class probability heatmaps, as shown in ­Figure  4.22. To view this image in full-colour, please visit www.routledge. com/9781138652507 to download the eResource materials. The red region of the heat map represents the failure class and the blue region represents the non-failure class. The “rows” of the map are the out-of-bag observations, and on the columns are the target variable classes. The colours represent the probability of a row belonging to a given class one would prefer that the class 0 (first column is all red and class 1 is all blue). The darker the colour, the lower the probability for a record to be in that particular class.

1.3  Results for Classification and Regression Trees (CART) We now examine the results for the CART model, which is described in Chapter 3. As can be seen from the error summaries displayed in Table 4.8, CART also produced strong out-of-sample prediction success.

132  An Empirical Application TABLE 4.7  Variable Importance Scores for Random Forests Model

Variable

Score

Percent owned by top five shareholders

100.00

Institutional percent owned Excess return (6 months) One year beta Cash flow per share Earnings per share (basic) Market capitalization to total debt Excess return (12 months) Average inventory turnover Altman Z score Debt to total assets Interest cover ratio Total bank debt EBIT margin Sales to total assets Gearing ratio Average debt collection period Total liabilities to total equity Current ratio Short term debt to total liabilities EBIT to total assets Working capital to total assets Identifiable intangible assets to total assets

36.71 31.65 31.30 29.15 25.75 22.86 20.24 19.23 15.20 15.00 14.06 13.85 12.41 11.60 11.52 11.16 10.65 8.59 6.20 5.24 5.08 3.31

|||||||||||||||||||||| |||||||||||||||||||| |||||| ||||||||||||||||| ||||||||||||||| ||||||||||||||| ||||||||||||| |||||||||||| |||||||||| ||||||||| ||||||||| ||||||| ||||||| |||||| |||||| ||||| ||||| ||||| ||||| ||||| |||| || || || |

Table 4.7 shows the relative variable importance (RVIs) of the random forests model estimated on all explanatory variables. Random forests models do not have interaction effects. The RVI is based on the number of times a variable is selected for splitting, weighted by the squared improvement to the model as a result of each split, and then averaged over all trees. Hence, RVIs are calculated relative to all other input variables in the model. The RVIs in Table 4.7 are ranked according to their contribution to the overall predictive success of the model. Since these measures are relative, it is customary to assign the largest or most important variable a value of 100 and then scale all other predictors accordingly. Table 4.7 shows that 23 of the 26 input variables used in this illustration have non-zero importance. The RVIs are also dispersed across a number of bankruptcy dimensions, such as market-price variables, accounting-based indicators, ownership concentration/structure variables, and other variables.

In fact, the out-of-sample AUC for the CART model is .930 (see Figure 4.23). While this is not as strong as the gradient boosting and random forest model, it is quite impressive for a first-generation machine learning method. The out-ofsample raw misclassification error rate is also impressive at only 5.84%. Further, looking at the confusion matrix in Table 4.9, the CART model does quite well using the balanced threshold cut off. Both the specificity and sensitivity metrics

An Empirical Application  133

FIGURE 4.22 Class

Probability Heat Map for Random Forests Model – FAILURE

TABLE 4.8  Model Error Measures for CART Model

Name

Learn

Test

Average LogLikelihood (negative) ROC (area under curve) Variance of ROC (area under curve) Lower confidence limit ROC Upper confidence limit ROC Lift K-S stat Misclass rate overall (raw) Balanced error rate (simple average over classes) Class. accuracy (baseline threshold) Relative cost

0.09867 0.98372 0.00082 0.92761 1.00000 1.14758 0.91689 0.04068 0.04155

0.72681 0.93038 0.00555 0.78443 1.00000 1.08340 0.81610 0.05845 0.09373

0.95375 0.08311

0.92666 0.18746

Table 4.8 summarizes the predictive performance of the CART model, including the average log-likelihood, areas under the ROC curve (AUC), lift, the K-S statistic, raw misclassification rates, balanced error rates (simple average over classes), and classification accuracy (baseline threshold).

show very good Type I  and Type II classification accuracy (the Type I  error is 9.22% and the Type II error rate is 11.06%). While the Type II error rate is worse than gradient boosting and random forests, the Type I error rate is just as good as random forests.

134  An Empirical Application

FIGURE 4.23 ROC

Curve of Estimation and Test Samples (CART Model)

TABLE 4.9  Confusion Matrix – CART Model

Actual Class

0 1 Total: Average: Overall % correct: Specificity Sensitivity/Recall Precision F1 statistic

Total Class

1,507 9,905 11,412

Percent Correct

90.78% 88.94%

Predicted Classes 0 N = 2463

1 N = 8949

1,368 1,095

139 8,810

89.86% 89.19% 90.78% 88.94% 98.45% 93.45%

The variable importance scores for the CART model are shown in Table 4.10. The CART model also shows that 23 out of the 26 variables used in this illustration have RVIs greater than zero. However, similar to the random forests model, the CART model appears to make more use of a much smaller set of features for prediction (the best features are very similar to the random forests model). However, the variable importance rankings are a little different for the CART model. For instance, the top ten ranked variables for the CART model

An Empirical Application  135 TABLE 4.10  Variable Importance for CART Model

Variable

Score

Percent owned by top five shareholders

100.00

Institutional percent owned One year beta Debt to total assets Market capitalization to total debt Cash flow per share Excess return (6 months) Earnings per share (basic) Gearing ratio Average inventory turnover EBIT to total assets Altman Z score Current ratio Total bank debt Working capital to total assets Total liabilities to total equity Interest cover ratio EBIT margin Total bank debt Excess return (12 months) Sales to total assets Average debt collection period Identifiable intangible assets to total assets

45.07 28.25 21.68 18.34 16.91 16.35 16.05 15.26 12.84 12.12 9.38 8.92 8.39 7.48 7.21 6.89 6.87 6.43 6.34 5.34 5.14 1.36

|||||||||||||||||||||| |||||||||||||||||||| |||||| ||||||||||||||||||||| ||||||||||||| |||||||||| |||||||| |||||||| ||||||| ||||||| ||||||| |||||| ||||| |||| |||| |||| ||| ||| ||| ||| ||| ||| || ||

Table 4.10 shows the relative variable importance (RVIs) of the CART model estimated on all explanatory variables. The RVI is based on the number of times a variable is selected for splitting, weighted by the squared improvement to the model as a result of each split, and then averaged over all trees. Hence, RVIs are calculated relative to all other input variables in the model. The RVIs in Table 4.10 are ranked according to their contribution to the overall predictive success of the model. Since these measures are relative, it is customary to assign the largest or most important variable a value of 100 and then scale all other predictors accordingly. Table 4.10 shows that 23 of the 26 input variables used in this illustration have non-zero importance. The RVIs are also dispersed across a number of bankruptcy dimensions, such as market-price variables, accounting-based indicators, ownership concentration/structure variables, and other variables.

are (in order of magnitude): percent of shares owned by the top five shareholders (RVI = 100); the percentage of shares owned by institutions (RVI = 45.07); one year beta (RVI = 28.25); debt to total assets (RVI = 21.68); market capitalization to total debt (RVI = 18.34); cash flow per share (RVI = 16.91); six month excess returns (RVI = 16.35); basic EPS (RVI = 16.05); gearing ratio (RVI  =  15.26); and average inventory turnover (RVI  =  12.84). The CART model also shows a much sharper change in variable importance score – similar

136  An Empirical Application

to random forests, the CART model tends to make greater use of the top ranked variables for prediction purposes and assigns lower importance scores for other variables. CART, by construction, also does not have interaction effects, and it is not possible to generate partial dependency plots. Hence, we do not know the direction of the variable importance score through partial dependency plots. However, it is possible to determine the direction of variables through CART decision trees, as illustrated in Chapter 3.

1.4  Generalized Lasso The model error summaries for the generalized lasso results are reported below in Table 4.11. It can see from Table 4.10 that the out-of-sample AUC of .883, which is quite strong but lower than the AUCs reported for the gradient boosting, random forests, and CART models. The ROC curve for the generalized lasso model is provided in Figure 4.24. The confusion matrix based on balanced threshold cut off is provided for the generalized lasso model in Table 4.12. It shows that the out-of-sample classification accuracy based on the balanced threshold is much lower than the previous models. In fact, it can be seen from Table 4.11 that the Type 1 error rate for the generalized lasso model is 18.77%, while the Type II error rate is 18.27%, far lower than the gradient boosting, random forest, and CART models. The variable importance scores for the generalized lasso model are shown in Table 4.13.

TABLE 4.11  Model Error Measures (Generalized Lasso Model)

Name

Learn

Test

Average LogLikelihood (negative) ROC (area under Curve) Variance of ROC (area under curve) Lower confidence limit ROC Upper confidence limit ROC Lift K-S stat Misclass rate overall (raw) Balanced error rate (simple average over classes) Class. accuracy (baseline threshold)

0.29246 0.87600 0.00002 0.86833 0.88368 1.14624 0.67143 0.13956 0.16938

0.28652 0.88393 0.00005 0.86983 0.89803 1.15875 0.69783 0.13932 0.15738

0.79267

0.79837

Table 4.11 summarizes the predictive performance of the generalized lasso model, including the average log-likelihood, areas under the ROC curve (AUC), lift, the K-S statistic, raw misclassification rates, balanced error rates (simple average over classes), and classification accuracy (baseline threshold).

An Empirical Application  137

FIGURE 4.24 ROC

Curve of Estimation and Test Samples (Generalized Lasso Model)

TABLE 4.12  Confusion Matrix – Generalized Lasso Model

Actual Class

0 1 Total: Average: Overall % correct: Specificity Sensitivity/Recall Precision F1 statistic

Total Class

538 3,137 3,675

Percent Correct

81.23% 81.73%

Predicted Classes 0 N = 1010

1 N = 2665

437 573

101 2,564

81.48% 81.66% 81.23% 81.73% 96.21% 88.38%

Only one variable in the model, percentage institutional ownership, is used in the lasso model. The model coefficients for the generalized lasso model are provided in Table 4.14.

1.5  Multivariate Adaptive Regression Splines The model error summaries for the MARS model are displayed in Table 4.15. It can be seen from Table 4.15 that the out of-sample AUC of .90 is very strong but lower than the AUCs reported for the gradient boosting, random forests, and CART models. The confusion matrix for the MARS model is shown in Table 4.16.

138  An Empirical Application TABLE 4.13  Variable Importance for Generalized Lasso Model

Variable

Score

Institutional percent owned

100.00

Excess return (6 months) Excess return (12 months) Percent owned by top five shareholders Short term debt to total liabilities Debt to total assets Total bank debt Identifiable intangible assets to total assets Sales to total assets Altman Z score EBIT to total assets Interest cover Cash flow per share Average debt collection period Earnings per share (basic) Current ratio One year beta

0.06 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

||||||||||||||||||||||||| |||||||||||||||||||||||

Table 4.13 shows the relative variable importance (RVIs) for the generalized lasso model estimated on all explanatory variables. Only one variable is used in the generalized lasso model, which is the institutional percent owned variables. All other variables have an RVI of 0.

TABLE 4.14  Model Coefficients for Generalized Lasso Model

Variable

Coefficients

Constant Percent owned by top five shareholders Institutional percent owned Short term debt to total liabilities Excess return (6 months) Excess return (12 months) Debt to total assets Cash flow per share Identifiable intangible assets to total assets One year beta Total bank debt Sales to total assets Average debt collection period Earnings per share (basic) EBIT to total assets Interest cover Current ratio Altman Z score

−0.03510 0.06486 1.70669 0.00858 0.01892 0.07718 0.00030 0.00084 0.00330 6.00916E−08 −0.00014 0.00020 −1.69559E−06 0.00008 −7.22938E−06 −6.39413E−06 0.00205 −0.00001

|||||||||||||||

An Empirical Application  139 TABLE 4.15  Model Summary: Model Error Measures (MARS Model)

Name

Learn

Test

RMSE MSE GCV SSE R-Sq GCV R-Sq Average LogLikelihood (negative) ROC (area under curve) Variance of ROC (area under curve) Lower confidence limit ROC Upper confidence limit ROC Lift K-S stat Misclass rate overall (raw) Balanced error rate (simple average over classes) Class. accuracy (baseline threshold) MSE adjusted R-Sq adjusted

0.25892 0.06704 0.06708 3,032.37656 0.40194 0.40166 0.37602 0.90597 0.00001 0.90037 0.90937 1.13782 0.69064 0.08899 0.15568 0.80936 0.06703 0.40188

n/a 0.06795 n/a n/a n/a n/a 0.39918 0.90423 0.00002 0.89519 0.91327 1.14138 0.68855 0.09017 0.15689 0.80871 n/a n/a

Table 4.15 summarizes the predictive performance of the MARS model, including the average log-likelihood, areas under the ROC curve (AUC), lift, the K-S statistic, raw misclassification rates, balanced error rates (simple average over classes), and classification accuracy (baseline threshold).

TABLE 4.16  Confusion Matrix – MARS Model

Actual Class

0 1 Total: Average: Overall % correct: Specificity Sensitivity/Recall Precision F1 statistic

Total Class

1,507 9,905 11,412

Percent Correct

83.48% 84.26%

Predicted Classes 0 N = 2817

1 N = 8595

1,258 1,559

249 8,346

83.87% 84.16% 83.48% 84.26% 97.10% 90.23%

Based on the balanced threshold cut off, the Type I error for the MARS model is 16.52%, while the Type II error is 15.74%. This is lower than the generalized lasso model, making MARS the worst performing model. As can be seen from the variable importance scores in Table 4.17, only four variables feature in the MARs

140  An Empirical Application TABLE 4.17  Variable Importance for MARS Model

Variable

Score

One year beta_mis

100.00

One year beta

100.00

Percent owned by top five shareholders_mis Percent owned by top five shareholders Average debt collection period Average debt collection period_mis Interest cover_mis Interest cover

68.22 68.22 24.77 24.77 21.52 21.52

|||||||||||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||||||||||| ||||| ||||||||||||||||||||||||||| ||||| ||||||||||| ||||||||||| |||||||||| ||||||||||

Table 4.17 shows the relative variable importance (RVIs) of the MARS model estimated on all explanatory variables.

model. These are: one year beta, percent owned by top five shareholders, average inventory turnover, and the interest cover ratio. The final MARS model is reported in Table 4.18, which displays basis functions, coefficients, variables, sign, parent signs, and knots.

1.6  Diagnostic Features of the Model Now that we have considered the performance of several popular statistical learning methods, we need to demonstrate model diagnostic techniques that are useful for assessing the overall stability and robustness of these models. For brevity, I will just focus on the best performing machine learning model, gradient boosting machines. As explained in Chapter  3, machine learning models such as TreeNet gradient boosting machines are generally not influenced by the many data quality issues that plague conventional models, such as outliers, non-normalness, missing values, monotonic transformations, and scaling (see Jones, 2017). However, the performance of machine learning models can be impacted by hyper-parameter estimation or model architecture considerations, such as node setting (which controls the size of trees), learn rates, tree depth (number of trees), different out-of-sample testing approaches, and different random number seed specifications, just to mention a few issues. For instance, gradient boosting models can generate different results depending on the number of trees used in the model. Different results could be achieved if gradient boosting is estimated with a very low tree depth (say, 10 trees vs 200 trees). Changing the seed setting can also impact the performance of machine learning models. The seed value initializes randomization processes in gradient boosting. The random seed number can impact how randomization is used to randomly

An Empirical Application  141 TABLE 4.18  Final Model

Basis Coefficients Variable Function

Sign Parent Sign Parent

0 3

0.25485 0.00000 One year beta +

6

−0.00052 Percent owned + by top five shareholders

+

7

−0.01370 Percent owned − by top five shareholders

+

10

−0.00000 Average debt collection period −0.00000 Interest cover

+

+

+

+

13

+

Knot

One year beta −30,612,302.00000 _mis 18.60000 Percent owned by top five shareholders _mis 18.60000 Percent owned by top five shareholders _mis Average debt −421,820.90625 collection period_mis Interest cover_ −267,097.00000 mis

Table 4.18 displays the final MARS model including basis functions, coefficients, sign, parent sign, and knot values.

split the dataset between the training and test samples for resampling, which is critical to these models. Stochastic gradient boosting models such as TreeNet are nondeterministic in the sense that for a given set of inputs, the outputs may not be identical. Using the same random seed number guarantees reproducibility of results (i.e., producing the same model each time). Generally, we found that the TreeNet models reported earlier are highly robust to different model architecture settings, including seed setting, node level, learn rate specification, and tree depth. In the following section, we consider a number of robustness checks and model diagnostic analyses that can be used to evaluate and potentially improve on the performance of machine learning models such as gradient boosting.

4.27.6  Comparison of Model Performance One of the very useful diagnostic tools in SPM is the model comparison tool. Figure 4.25 and Table 4.19 compare the performance of TreeNet gradient boosting machines with other models available in SPM, such as logit, CART, random forests, and MARS. Figure 4.25 shows that the mean ROC across models is .921 (using the same features and sample used in the previous models). Table  4.19 shows that that highest out-of-sample ROC performance was for TreeNet with an out-of-sample AUC of .982, followed by random forests with an

142  An Empirical Application

FIGURE 4.25 Comparison

of Model Performance (Based on ROC)

TABLE 4.19  Model Comparison Based on ROC and K-Stat

Model

Model Complexity, Nodes, Trees, Coeffs

ROC

K-S Stat

N Observations

TreeNet (gradient boosting) Random forests CART MARS Logit Generalized lasso

444 500 198 5 22 17

0.98245 0.95145 0.94332 0.91084 0.87222 0.87138

0.86061 0.81542 0.80086 0.69556 0.66491 0.66642

56,643 56,643 56,643 56,643 18,106 18,106

Table 4.19 displays out-of-sample ROC performance for the gradient boosting machines, random forests, CART, MARS, logistic regression, and GPS (generalized lasso) models. The best performing model is TreeNet with a ROC (or AUC) of .982 and the worst performing model is generalized lasso with an AUC of .871.

FIGURE 4.26 Comparison

of Model Performance on Misclassification

AUC of .951, CART with an AUC of .943, MARS with an AUC of .910, logit with an AUC of .872, and generalized lasso with an AUC of .871. Figure 4.26 and Table 4.20 display the misclassification rate across models. The average out-of-sample misclassification rate is 9.1% across all models. However,

An Empirical Application  143 TABLE 4.20  Comparison of Model Performance Based on Misclassification

Model

Model Complexity, Nodes, Trees, Coeffs

Misclass

N Observations

TreeNet (gradient boosting) CART MARS Random forests Logit Generalized lasso

495 256 5 200 23 17

0.03251 0.05906 0.09017 0.09569 0.13007 0.13932

56,643 56,643 56,643 56,643 18,106 18,106

Table 4.20 displays out-of-sample misclassification performance for the gradient boosting model, random forests, CART, MARS, logistic regression, and GPS (generalized lasso) models. The best performing model is gradient boosting with a misclassification error rate of 3.25% and the worst performing model is generalized lasso a misclassification error rate of 13.93%.

misclassification is lowest for the TreeNet model at 3.25%, followed by (respectively) CART at 5.9%, MARS at 9%, random forests at 9.5%, logit at 13%, and, finally, generalized lasso at 13.9%.

4.27.6  N-Fold Classification Now that we have established TreeNet is the top performing model, we can investigate further how stable and robust this model really is. One useful diagnostic is to examine the sensitivity of the gradient boosting model to different levels of crossvalidation. Figure 4.27 and Table 4.21 shows model performance as we change the cross-validation folds from 5 through to 50. Figure 4.21 shows the out-of-sample AUC for the TreeNet (gradient boosting model) across different cross-validation folds, ranging from 5 folds to 50 folds. The worst AUC was .984 based on CV 5 folds. Beyond 10 folds, the AUC could not be improved any further. The results in Figure 4.27 and Table 4.21 suggest that the out-of-sample results are largely unaffected by the CV folds.

4.27.6  Different Seed Settings Another useful diagnostic in SPM is the seed setting feature, which is displayed in Figures 4.28 and 4.29. Figures 4.28 and 4.29 show that the TreeNet misclassification rates and ROC curves are highly robust to different seed settings available in SPM.

4.27.6  Model Stability The high AUCs of Table 3 model raise questions about the stability of the overall model. TreeNet provides an extensive range of diagnostic checks for model stability

144  An Empirical Application

FIGURE 4.27 Comparison

of Model Performance Based on N-Fold Cross-Validation (TreeNet/Gradient Boosting)

TABLE 4.21  Comparison of Model Performance Based on N-Fold Cross-Validation

(TreeNet/Gradient Boosting) Model

Opt. TN Trees Count

ROC

K-S Stat

CV Folds

TreeNet 1 TreeNet 2 TreeNet 3 TreeNet 4

495 500 500 495

0.98406 0.98525 0.98526 0.98524

0.87306 0.87913 0.87726 0.87777

5 10 20 50

FIGURE 4.28 Seed

Setting and Misclassification (TreeNet/Gradient Boosting)

FIGURE 4.29 Seed

Setting and ROC (TreeNet)

An Empirical Application  145

and over-fitting. The partition test is one of the best diagnostics for this purpose. With this test, TreeNet builds a series of models (based on Table 4.3 inputs and model setups) where the training and test samples are repeatedly drawn at random from the dataset in some proportion. Essentially, the model is estimated with new learn and test samples drawn from the “main” data (Salford Systems, 2019). The results provide a more realistic picture of the possible range in predictive performance for the preferred model. If the test error rates are too erratic across models, this could indicate stability and over-fitting problems with the model. Tables 4.22 and 4.23 provide results of a 20-repeat battery partition test of the Table 4.3 model based on an 80/20 allocation to the training and test samples. Tables 4.22 and 4.23 show the performance of each battery test, the optimum tree depth reached, and the misclassification error rates across each model. As can be seen from these tables, the misclassification error is quite stable across each TABLE 4.22  Model Stability Tests on Misclassification Rates (TreeNet/Gradient Boosting)

Model

Opt. TN Trees Count

Misclass

TreeNet 1 TreeNet 2 TreeNet 3 TreeNet 4 TreeNet 5 TreeNet 6 TreeNet 7 TreeNet 8 TreeNet 9 TreeNet 10 TreeNet 11 TreeNet 12 TreeNet 13 TreeNet 14 TreeNet 15 TreeNet 16 TreeNet 17 TreeNet 18 TreeNet 19 TreeNet 20

495 499 473 499 427 494 492 482 393 480 477 496 451 366 494 491 500 399 494 483

0.03251 0.03286 0.03321 0.03365 0.03277 0.03339 0.03312 0.03242 0.03304 0.03207 0.03295 0.03470 0.03251 0.03435 0.03470 0.03093 0.03286 0.03225 0.03382 0.03058

Table 4.22 provides results of a 20-repeat battery partition test of the TreeNet/gradient boosting model with 80/20 allocation for the training and test samples. The battery partition is a diagnostic check for model stability and over-fitting based on out-of-sample classification errors. Under this test, TreeNet builds a series of models where the training and test samples are repeatedly drawn at random from the dataset in some proportion. It shows the performance of each battery tests, the optimum tree depth reached, and the misclassification error rates across each model. As can be seen from Table 4.22, the misclassification error is quite stable across each model, with the highest test error reported in on Models 12 and 15 of 3.47% and the lowest test error on Model 20 of 3%.

146  An Empirical Application TABLE 4.23  Model Stability Tests on ROC (TreeNet/Gradient Boosting)

Model

Opt. TN Trees Count

ROC

K-S Stat

TreeNet 1 TreeNet 2 TreeNet 3 TreeNet 4 TreeNet 5 TreeNet 6 TreeNet 7 TreeNet 8 TreeNet 9 TreeNet 10 TreeNet 11 TreeNet 12 TreeNet 13 TreeNet 14 TreeNet 15 TreeNet 16 TreeNet 17 TreeNet 18 TreeNet 19 TreeNet 20

494 494 491 494 500 465 478 456 499 483 473 461 500 467 500 500 495 495 466 489

0.98472 0.98603 0.98496 0.98410 0.98410 0.98632 0.98333 0.98497 0.98487 0.98438 0.98257 0.98415 0.98685 0.98214 0.98375 0.98585 0.98415 0.98288 0.98403 0.98616

0.87972 0.88228 0.88049 0.87077 0.87307 0.89066 0.88210 0.89186 0.87736 0.87136 0.86999 0.87011 0.87891 0.86171 0.87132 0.88144 0.87427 0.87853 0.87729 0.88045

Table 4.23 provides results of a 20-repeat battery partition test of the TreeNet/gradient boosting model with 80/20 allocation for the training and test samples. The battery partition is a diagnostic check for model stability and over-fitting based on out-of-sample AUC results. Under this test, TreeNet builds a series of models where the training and test samples are repeatedly drawn at random from the dataset in some proportion. It shows the performance of each battery tests, the optimum tree depth reached, and the AUC across each model. As can be seen from Table 4.23, the AUCs are quite stable across each model, with the highest AUC reported on Model 13 of 0.986 and the lowest AUC on Model 14 of 0.982.

model with the highest test error reported on Models 12 and 15 of 3.47% and the lowest test error is displayed on Model 20 of 3.05%. With respect to the ROC curve tests, the highest AUC of .986 is reported for Model 13, and the lowest AUC was 0.982 for Model 14. However, the AUCs and misclassification rates are quite robust across TreeNet models.

4.27.6  Learn Rates Tables 4.24 and 4.25 show the impact on AUCs and misclassification rates using different learn rates. Three different learn rates are examined: .001, .01, and .10. It can be seen from both of these tables that using larger learn rates has had a significant impact on the performance of the gradient boosting model. Increasing the learn rate to .01

An Empirical Application  147 TABLE 4.24  Different Learning Rates on ROC (TreeNet/Gradient Boosting)

Model

Opt. TN Trees Count

ROC

K-S Stat

Learn Rate

TreeNet 1 TreeNet 2 TreeNet 3

497 500 494

0.93288 0.97210 0.98472

0.70613 0.82127 0.87972

0.001 0.010 0.100

Table 4.24 shows the out-of-sample AUC for the TreeNet model across different learn rates, ranging from .1 to .001. The worst AUC was .932 based on a learn rate of .001. The best AUC was .984 based on a learn rate of .1 TABLE 4.25  Different Learning Rates on Misclassification (TreeNet/Gradient Boosting)

TreeNet 1 TreeNet 2 TreeNet 3

Opt. TN Trees Count

Misclass

Learn Rate

1 497 495

0.13205 0.04986 0.03251

0.001 0.010 0.100

Table 4.25 shows the out-of-sample misclassification rates for the TreeNet model across different learn rates, ranging from .1 to .001. The worst misclassification rate was 13.2% based on a learn rate of .001. The best misclassification rate was 3.25% based on a learn rate of .1

and .10 significantly improved AUCs and reduced out-of-sample misclassification rates. For instance, a learn rate of .01 increased the AUC from .932 to .972, and increasing the learn rate again to .10 increased the AUC further to .984. The same results are true with the misclassification rates. For instance, a learn rate of .01 reduced the misclassification rate from 13.2% to 4.98%, and increasing the learn rate again to .10 reduced misclassification rate still further to 3.25%. Because the results can be significantly impacted by the learn rate settings, the basis for different learn rates needs to be well justified. Generally, higher learn rates, up to .1, are justified for larger samples, whereas small samples should always be used for much slower learn rates. For the gradient boosting illustration, I used SPM’s AUTO setting. SPM’s AUTO setting uses very slow learn rates for small samples and faster learn rates (.1) for datasets for more than 10,000 records.2

1.6.6  Node Settings Finally, node setting appears to have an impact on empirical results as well. Tables 4.26 and 4.27 show ROC performance and misclassification rates at different node settings ranging from 2 to 20. Increasing the number of nodes in the model tends to improve ROC performance, particularly from a very low node setting of between 2 to 4. Increasing the node setting also monotonically reduces the misclassification rate. For instance, the

148  An Empirical Application TABLE 4.26  Different Node Settings on ROC (TreeNet/Gradient Boosting)

Model

Opt. TN Trees Count

ROC

K-S Stat

Max Nodes

TreeNet 1 TreeNet 2 TreeNet 3 TreeNet 4 TreeNet 5 TreeNet 6

497 495 494 494 499 498

0.96982 0.98344 0.98472 0.98601 0.98687 0.98764

0.83230 0.87719 0.87972 0.88524 0.89143 0.88987

2 4 6 9 15 20

Table 4.26 shows the out-of-sample AUC for the TreeNet (gradient boosting model) across different node settings, ranging from 2 to 20. The worst AUC was .969 based on a node setting of 2. The best AUC was .987 based on a node setting of 20.

TABLE 4.27  Node Settings on Misclassification (TreeNet)

Model

Opt. TN Trees Count

Misclass

Max Nodes

TreeNet 1 TreeNet 2 TreeNet 3 TreeNet 4 TreeNet 5 TreeNet 6

498 456 495 485 488 426

0.04671 0.03531 0.03251 0.03198 0.03041 0.02918

2 4 6 9 15 20

Table  4.27 shows the out-of-sample misclassification error rate for the TreeNet (gradient boosting model) across different node settings, ranging from 2 to 20. The worst misclassification error rate was 4.67% based on a node setting of 2. The best misclassification error rate was 2.9% based on a node setting of 20.

lowest misclassification rate is achieved when the node setting is 20, which is well above the SPM default setting of 6. However, the improvements in AUC and misclassification rates are not substantial, and it would appear the SPM default setting of a maximum of 6 nodes is reasonably optimal in my illustration of the gradient boosting model.

Key Points From Chapter 4 This chapter provides a comprehensive illustration of several statistical learning methods described in Chapters 2 and Chapter 3. In particular, Chapter 4 focuses on the estimation and interpretation of classification and regression trees (CART), multivariate adaptive regression splines (MARS), gradient boosting machines, and random forests. The total firm year corporate failure observations used in this illustration include 7326 firm year bankrupt observations and 49317 firm year non-failure firm observations. The corporate failure sample spanned 25 years (1990 to 2015). The illustration

An Empirical Application  149

uses an international sample of bankrupt firms, although most bankrupt firms are from the US jurisdiction and entered Chapter 7 or Chapter 11 under the US bankruptcy code. The software used for this illustration is Salford Predictive Modeler (SPM) (Version 8.2), which is a market leader in machine learning software. SPM provides a comprehensive range of diagnostic and data visualization capabilities, particularly compared to opensource software such as R. In the context of corporate failure prediction, the key outputs of machine learning methods such as gradient boosting are relative variable importance scores (RVIs), partial dependency plots, and interaction effects. The predictive performance of machine learning models was evaluated through outputs such as area under the ROC curve (AUC), model fit statistics, and confusion matrices. Critical to determining the success of a corporate failure model is out-of-sample classification performance on Type I and Type II errors. All models in the Chapter 4 illustration performed reasonably well on Type I and Type II error rates, although the gradient boosting model was clearly the best and the logit/generalized lasso models performed the worst. The gradient boosting model was 94.03% accurate on Type I errors and 93.82% accurate on Type II errors. The gradient boosting, random forests, and CART models all revealed slightly different variable importance rankings. However, 23 of the 26 features used in the illustration were found to be important (i.e., had an RVI > 0) in all models. Financial, market price, and corporate governance proxies (such as shareholder ownership and concentration) all contributed to the overall predictive power of the models. As RVIs are scalars, they provide no indication of the direction of a feature on the corporate failure outcome. Partial dependency plots provide a valuable tool for interpreting the direction of predictor variables’ impact on the corporate failure outcome and whether the relationship is linear or nonlinear. In just about all cases, the features used in this illustration exhibited a highly nonlinear relationship with the corporate failure outcome. Chapter 4 also demonstrated several diagnostic and robustness tests that can be applied to machine learning models. Because gradient boosting machines was the best performing model, we further investigated the model using several model stability and robustness tests. The gradient boosting model proved highly robust to a range of model stability tests and different hyper-parameter settings.

Notes 1 For regression models, machine learning models can be evaluated with standard outputs such as AIC, BIC, R-square, mean squared error (MSE), root mean squared error (RMSE), and mean average percentage error (MAPE). 2 The AUTO value = max(.01, .1 min(1, nl/10,000)), where nl = number of LEARN records. See “SPM Users Guide: Introducing TreeNet” (2019).

5 CORPORATE FAILURE MODELS FOR PRIVATE COMPANIES, NOT-FOR PROFITS, AND PUBLIC SECTOR ENTITIES

Introduction While the motivation for public company failure research is well established, there are also important reasons to model distress and corporate failure for private company failures, not-for-profit entities, and public sector entities. For instance, Filipe et al. (2016) stated that private companies (which they term SMEs) play a crucial role in most economies. In Organization for Economic Cooperation and Development (OECD) countries, SMEs account for 95% of all enterprises and generate around two-thirds of employment. However, compared to public companies, private company failure rates are much higher.1 The US Small Business Administration Office of Advocacy (2018) provided several relevant statistics about small business enterprises in the US. For instance, as of 2015, there were 30.2 million small businesses in the US, with 5.9 million having paid employees. Small businesses comprise: • • • • • •

99.9% of all firms; 99.7% of firms with paid employees; 97.6% of exporting firms (287,835 small exporters); 32.9% of known export value ( 440 billion out of 1.3 trillion); 47.5% of private sector employees (59 million out of 124 million employees); 40.8% of private-sector payroll.

There is also information provided on survival rates. The US Small Business Administration Office of Advocacy (2018)2 estimated that 80% of small businesses that were established in 2016 survived until 2017. However, only around half of small businesses survived five years or longer after establishment. Only one-third of small businesses survived ten years or longer. Table 5.1 below was sourced US Bureau of DOI: 10.4324/9781315623221-5

Corporate Failure Models  151 TABLE 5.1  Business Survivorship in the US 2015–2021

Annual Openings Year Ended: March 2015

Surviving Establishments

Total Employment of Survivors

Survival Rates Since Birth

Survival Rates of Previous Year’s Survivors

Average Employment of Survivors

March 2015 March 2016 March 2017 March 2018 March 2019 March 2020 March 2021

677,884 539,708 468,300 416,444 375,839 340131 314,814

3,011,599 3,073,362 3,082,025 3,075,865 3,043,366 2,989,453 2,818,430

100.00 79.6 69.1 61.4 55.4 50.2 46.4

– 79.6 86.8 88.9 90.2 90.5 92.6

4.4 5.7 6.6 7.4 8.1 8.8 9.0

Source: US Bureau of Labor Statistics (2022)3

Labor Statistics, which tracks establishments that were opened in March 2015 and the survivorship of these establishments until 2021. Table 5.1 shows that less than half the establishments survived as at March 2021. The impact of the COVID pandemic has had a particularly pronounced impact on small and medium enterprises. As stated in the OECD report “Coronavirus (COVID-19): SME Policy Responses (July, 2020)”:4 There are several ways the coronavirus pandemic affects the economy, especially SMEs, on both the supply and demand sides. On the supply side, companies experience a reduction in the supply of labour, as workers are unwell or need to look after children or other dependents while schools are closed and movements of people are restricted. Measures to contain the disease by lockdowns and quarantines lead to further and more severe drops in capacity utilisation. Furthermore, supply chains are interrupted leading to shortages of parts and intermediate goods. On the demand side, a dramatic and sudden loss of demand and revenue for SMEs severely affects their ability to function, and/or causes severe liquidity shortages. Furthermore, consumers experience loss of income, fear of contagion and heightened uncertainty, which in turn reduces spending and consumption. These effects are compounded because workers are laid off and firms are not able to pay salaries. Some sectors, such as tourism and transportation, are particularly affected, also contributing to reduced business and consumer confidence. More generally, SMEs are likely to be more vulnerable to “social distancing” than other companies Bankruptcies in OECD countries are expected to rise sharply as a result of COVID. The American Bankruptcy Institute reported a 14% increase in Chapter 11 filings in the first quarter of 2020 as compared to Q1 2019 and expects these numbers to

Date

Country

Impact on Business

Expectations

10 Feb

China

80% of SMEs have not resumed operations yet

25 Feb

Finland

Early March Early March

Italy UK

9 March 9 March

Germany Japan

10 March

Poland

11 March

USA

12 March 13 March 16 March 16 March

UK USA Canada Israel

16 March 17 March 17–20 March 18 March

Greece USA Korea

One-third anticipated a negative or very negative impact 72% directly affected 63% see crisis as moderate to high/severe threat to their business 50% expect a negative impact 39% report supply chain disruptions, 26% decrease in orders and sales One-third of SMEs experience increasing costs and reduced sales 70% experience supply chain disruptions, 80% the impact of the crisis 69% experience serious cash flow problems 23% negatively affected, 36% expect to be 50% drop in sales 55% experienced no impact yet, one-third planning lay-offs 60% experience marked decline in sales 50% negatively affected, 75% very concerned 61% have been impacted

One-third out of businesses in one month, another onethird in two months N/A

Belgium

75% report declines in turnover

N/A N/A N/A N/A 27% already encounter cash flow problems N/A One-third fear being out of business in one month 25% expect not to survive longer than one month N/A N/A N/A 42% fear being out of business in three months, 70% in six months 50% fear not to be able to pay costs in the short term

152  Corporate Failure Models

TABLE 5.2  Global Impacts of COVID-19 on SMEs (Based on OECD Surveys)

19 March 20 March 20 March 21 March 24 March 31 March– 6 April

3 April

Belgium

7 April 7 April

Belgium Canada and the US

8 April

UK

8 April

Netherlands

6–10 April

Portugal

1 April 1 April

96% have been affected 60% expect a decline in sales 50% start-ups lost significant revenue 92% experience economic impact 60% experience significant impact 30% of SMEs expect to lay off 50% of their staff

51% indicate not be able to survive three months N/A 50% expect to be out of business within three months N/A One-third expect to be out of business in a month 50% of SMEs have a month cash reserve or less

N/A

18% of firms could be out of business in one month

N/A

35% of small business out of business in three months

Two-thirds of small business experience the impact of the crisis; 41% experience a drop in income of 50% or more in the last two months 40% of companies see drop in revenue of 75% or more N/A 90% of small business affected

N/A

37% expect to furlough 75–100% of their staff in the next week N/A

6% out of cash, 57% three months reserves or less

37% experience a drop in production of more than 50%

1 in 10 companies likely to face bankruptcy Over 31% of Belgium SMEs may not survive the crisis One-third lack the reserves to survive longer than a few weeks

85% of SMEs in financial difficulty because of COVID while 19.21% are at serious risk 50% do not have resources for more than two months (Continued)

Corporate Failure Models  153

3 April

USA Hungary Netherlands Japan Canada Several Asian countries United Kingdom United States Australia

Date

Country

Impact on Business

Expectations

15–22 April

4 May

Canada

5 May

15 May

United States United States United Kingdom Thailand

62% of small business experience a drop in revenues 58% of SMEs experience a drop in turnover by on average 50% 81% of small businesses indicate their operations are negatively affected N/A

32% cannot stay open longer than three months

24 April

United States Germany

8 June

Ireland

Mid-June

Canada

20 June

New Zealand

11 May 13 May

81% of firms experience and expect impact of pandemic in the next 12–16 months 37% of firms are considering, or have already made, redundancies 90% of firms expect extreme revenue loss SMEs that remained until 18 May closed incurred an average cost of EUR 177,000 during the lockdown period; of businesses that remained open, 70% reported a decrease in revenue 78% of small business reported a drop in sales, 47% between 50 and 100% 71% of SMEs have taken a revenue hit by COVID-19

Source: Annex B Source: OECD report “Coronavirus (COVID-19): SME Policy Responses” (July, 2020, pp. 5–6)

Half of SMEs have only two months of liquidity reserve 32% worry about the viability of their business over the next year One-fifth of small businesses closed down temporarily, onethird expects to close permanently within two months N/A 41% of firms have temporarily closed, 35% fear they will not reopen again 52% of small business expects to close down if containment measures last longer N/A

N/A 39% of SMEs fear having to close down

154  Corporate Failure Models

TABLE 5.2 (Continued)

Corporate Failure Models  155

continue rising.5 Australia, however, witnessed a decline in insolvencies in March, which at 683 companies was the lowest for March since 2007, which could be explained by the government support measures (such as the job keeper program) put in place by the government. However, corporate failure is only one measure – there is a much greater percentage of companies that experience some form of financial distress. The OECD published the results of a 41-country survey that identified the world-wide impact of COVID-19 on SMEs and the extent of the disruption and issues facing businesses across the world. A more recent survey is available at the US Census Bureau “Small Business Pulse Survey”. The Small Business Pulse Survey (SBPS) measures the effect of changing business conditions during the COVID pandemic on the nation’s small businesses. As of late December 2021, around 65.2% of US small business surveyed indicated that COVID-19 was having either a large or moderate overall negative impact on their businesses.6 A large number of US small businesses were also having difficulty recruiting or experiencing significant delays and difficulties with their businesses, and nearly 64% of small businesses were receiving some level of assistance (such as through the Paycheck Protection Program, loan forgiveness, or economic injury disaster loans (EIDL)). Around 56% of small businesses surveyed had fewer than two months of available cash on hand, which might indicate some level of financial distress for these businesses.7 It is not just private for-profit companies than have been severely impacted by the COVID-19 pandemic. As stated by Bossi (2020), charitable institutions and not-for-profits have also been adversely impacted:8 Charitable institutions and not-for-profit entities (NFPs) are particularly vulnerable to financial distress because of the COVID-19 pandemic. Like most for-profit entities, NFPs are experiencing a momentous disruption in their business operations and the revenue derived therefrom. That in itself is likely to trigger a financial crisis for many NFPs. However, as a “double-whammy,” many NFPs can anticipate a significant future reduction in donor contributions and support from other constituencies because of the economic downturn. This disruption in contributions and non-operating revenue may have an even more devastating and longer-lasting impact on NFPs than the loss of business directly caused by the pandemic.9 Unlike private companies, not-for-profit entities and public sector entities are particularly challenging from a corporate failure modelling perspective. Few notfor-profits entities ever declare bankruptcy. To a large extent, they either merge with other not-for-profits or simply “disappear” (Hager et al., 1996). Hager et al. (1996, p. 977) differentiated not-for-profits that “reincarnate” from those that terminate. Reincarnated not-for-profits are entities that merged, got acquired (but retain their governance structure), and/or left the area or changed status (from not-for-profit to profit making). Terminated entities are those that truly disbanded

156  Corporate Failure Models

(ceased operations altogether) and dissolved their board of directors; thus, using “bankruptcy” as the independent variable excludes a substantial group of nonprofits that may be at risk financially. Second, until recently, it was only possible to examine a small number of not-for-profits, since databases of not-for-profits were largely unavailable (Gordon et al., 1999). As a result, corporate failure models for not-for-profit and public sector entities have to adopt alternative proxies for distress risk or bankruptcy. For the remainder of this chapter, I will firstly discuss the literature on private company distress modelling, followed by not-for-profits and public sector entities.

1.1  Private Companies Notwithstanding the importance of private companies to broader economic growth and employment, most corporate failure studies have focused on public company samples. The paucity of research on private company failures can be attributed to a number of possible factors. First, in contrast to public companies, private company datasets tend to be less complete and reliable as most private companies are not subject to mandatory external auditing requirements or compliance with accounting standards. Second, the range of explanatory variables that can be used for predictive modelling tends to much more limited. Typically, private company studies are limited to financial indicators (such as financial ratios), whereas public company corporate failure models can draw on much wider and richer sources of data, such as market price data, external ratings, analyst forecasts, and institutional shareholding data as well as a wider variety of financial information. Third, private companies tend to have more heterogeneous business and legal structures (such as sole proprietors, partnerships, and even very large corporations), which can render distress events more difficult to detect and classify. As a result of these and other data limitations, many private company prediction studies document fairly modest predictive accuracy rates (see Jones and Wang, 2019). A tabulation of some representative private company distress studies is provided in the Appendix of this chapter. The literature on private company failures is relatively small and fragmented. As can be seen from the Appendix, there is wide diversity in the definitions of failure, sample sizes, types of models used, alternative of explanatory variables, different reporting jurisdictions, and data sources. As a result, the literature has been slow to develop and there appears to be limited generalizability of empirical results across studies. We discuss each of these issues as follows.

1.1.1  Definition of Failure One of the more contentious issues in the literature has been the definition of private company failure. As shown in the Appendix, a wide range of failure definitions have been used in various studies, including: loss to borrowers and guarantee

Corporate Failure Models  157

recipients (Edmister, 1972); wound up by a court (McNamara et al., 1988); liquidation or ceased trading (Keasey and Watson, 1986, 1988; Hall, 1994); sustained non-compliance with banking obligations (Pindado and Rodrigues, 2004); bankruptcy or liquidation (Slotemaker, 2008); loan default (Grunert et al., 2005; Bhimani et al., 2010); financial distress (Franks, 1998), bankruptcy, or default (Mitchell and Roy, 2007; Falkenstein et al., 2000); cash shortages (Mramor and Valentincic, 2003), combination of bankruptcy, receivership, liquidation, inactive, special treatment firms (Altman et al., 2017). Proxies for private company financial distress have been frequently questioned for their arbitrariness (e.g., Balcaen and Ooghe, 2006, p. 72). For instance, Keasey and Watson (1988) did not distinguish between closure (i.e., cessation of trading) and bankruptcy. A small firm could cease trading for many reasons, including bankruptcy, such as avoiding further losses, failure to “make a go of it”, retirement or ill health, realizing a profit, and other reasons. As acknowledged by Mramor and Valentincic (2003), cash shortages, namely, the excess of cash payment orders over a cash balance, can be temporary and occur for reasons other than financial distress. Further, many definitions of failure relate to a country’s legal and financial frameworks, such as Spain (Alfaro et al., 2008); Russia (Fedorova et al., 2013); the Czech Republic (Karas and Režňáková (2014); Germany (Grunert et  al., 2005) and numerous other jurisdictions. As an example of differences across international jurisdictions, “bankruptcy” only applies to persons in the UK and Australia, whereas it relates to companies in North America. Inconsistencies in the definition of failure across studies can limit the interpretability and generalizability of empirical findings. This underscores the need for a consistent approach to private failure definition. Some recent studies have used the classifications of failure adopted by Bureau van Dijk’s ORBIS database, which is one of the largest private company databases in the world. The ORBIS classifications include: (1) default of payment, (2) firms subject to insolvency proceedings, (3) firms subject to bankruptcy proceeding, (4) firms that are dissolved (through bankruptcy), (5) firms in liquidation, and (6) inactive firms (no precision). Another advantage of using the ORBIS definitions relates to the failure date itself, which is not always easy to determine for private companies. This may be due to less robust reporting requirements, reporting lags and lack of visibility among smaller private firms. However, the ORBIS database supposedly provides reliable estimates on when the failure event actually occurs.

1.1.2  Sample Sizes and Sources of Data As can be seen from the Appendix, much prior research on private companies and SMEs has relied on relatively small and quite dated samples. Most of these samples have been derived from local reporting jurisdictions, including the US, Australia, the UK, and Europe.10 Only three studies in the Appendix use larger samples of more than 20,000 observations (see, for example, Bhimani et al., 2010, 2014;

158  Corporate Failure Models

Mramor and Valentincic, 2003; Altman et al., 2017). Most of the studies have well under 1000 sampled firm year observations. At least half of the studies focus on small businesses or exclude medium and large private businesses. Most studies predate economic crises, particularly the GFC where private and public firm failure rates dramatically increased. More recently, the COVID pandemic has had a dramatic impact on business failure across the globe (Amankwah-Amoah et  al., 2021). Similar to public company bankruptcy research, private company distress and corporate failure research has typically relied on match paired samples, which can result in biased parameter estimates (see Zmijewski, 1984; Jones and Hensher, 2004). The Appendix studies also indicate that sample data for various studies is drawn from a wide range of heterogeneous sources, including regulators, private banks, chambers of commerce, and other sources. These methodological and data concerns can impact on generalizability of empirical findings.

1.1.3  Explanatory Variables As discussed in Chapter 2, overwhelmingly the literature on public company failure has focused on the role of accounting-based and/or market price indicators in corporate failure prediction modelling (see, e.g., Altman, 1968; Altman et al., 1977; Ohlson, 1980; Zmijewski, 1984; Shumway, 2001; Altman, 2002; Duffie and Singleton, 2003; Hillegeist et  al., 2004; Jones and Hensher, 2004; Beaver et  al., 2005; Jones and Hensher, 2008, Jones, 2017). A smaller number of studies have investigated other potentially important corporate failure predictors, including corporate governance proxies (such as stockholder concentration/structure), analyst estimates/forecasts, credit ratings changes, macro-economic factors, and other industry and firm-specific factors (Jones et al., 2015, 2017). While most private company studies rely on financial ratios as the primary variables of interest (see Edmister, 1972; McNamara et  al., 1988; Keasey and Watson, 1986, 1988; Bhimani et al., 2010, 2014; Franks, 1998; Mramor and Valentincic, 2003), other variables have been tested. Keasey and Watson (1988) for instance, investigated the relevance of reporting lags in the development of failure prediction models for small firms. Other non-financial factors examined on an ad hoc basis include age, size, industry, and region in Bhimani et al. (2010, 2014), internal credit ratings in Grunert et al. (2005), managerial structure, inadequacy of accounting information system and audit lags in Keasey and Watson (1987), strategy, relations with banks, pricing, marketing, characteristics of owners, and quality of working force in Hall (1994). However, many of these qualitative factors are difficult to measure and the data is difficult to gather or replicate. For instance, having collected business plan information by phone interview, Perry (2001) found that non-failed firms did more planning than failed firms. As Perry (2001, p. 205) acknowledged, the reliability of the findings was subject to selfreported biases (see also Hall, 1994). Aside from the issue of self-reported biases,

Corporate Failure Models  159

it is difficult to consistently measure the impact of cause-related qualitative factors on the survival and failure of a firm. For instance, while formal planning may help a firm survive, another firm can equally prosper without formal planning processes. While Keasey and Watson (1987) and Hall (1994) investigated the predictive ability of internal factors, Everett and Watson (1998) were primarily concerned with the impact of external factors on the likelihood of failure. It is expected that broader macro-economic conditions and the state of the economy should have an impact on the likelihood of firm failure. Previous literature indicates that macro-economic factors can impact on credit ratings changes, including corporate failure (see, e.g., Keenan et al., 1999; Bangia et al., 2002; Duffie et al., 2007; Koopman et  al., 2009; Koopman et  al., 2011; Figlewski et  al., 2012; Hensher et al., 2007). Everett and Watson (1998) find a negative association between unemployment rates and small business failure rates. In times of economic downturn, a firm often reduces its workforce to avoid financial distress, contributing to the negative association between unemployment rates and business failure. In addition to unemployment rates, this study also considers population size, which can be a useful proxy for size, strength, and resilience of an economy. We also use several broad macro-economic indicators relating to the overall economic health of the economy (see Jones et al., 2015). These indicators include: (1) real GDP/real GDP growth. Real GDP and real GDP growth are key measures of overall economic health and prosperity. A strong and growing economy is expected to lead to lower default risks and lower overall probability of corporate failure; (2) inflation rates. Inflation is a commonly used economic indicator, and the general consensus is that high inflation is bad for the economy, despite the fact that its effects on the economy can be equivocal (Figlewski et al., 2012; Jones et al., 2015). We expect higher inflation to be symptomatic of a weaker economy, leading to higher default risk and an increased likelihood of corporate failure; and (3) public debt. There is also a common perception that a high ratio of public debt to GDP is a sign of broad economic weakness and vulnerability, hence we expect this indicator to be positively associated with corporate failure; and (4) budget balance, trade balance, and international reserves per head are some other indications of general economic health. Altman et al. (2017) applied the Z model on an international dataset extracted from the ORBIS database. The estimation sample consisted of 2,602,563 nonfailed firms and 38,215 failed firms from 28 European and 3 non-European countries. They found that the original coefficients in Z core model performed well on their international dataset. The re-estimated multiple discriminant coefficients using the weighted data marginally improved the classification performance. The authors also developed both international and country-specific logistic regression models using the four ratios in the Z -score model and additional variables relating to year, size, age, and industry.

160  Corporate Failure Models

1.1.4  Modelling Techniques and Applications As can be seen from the Appendix, most studies have relied on conventional discrete choice models, such as LDA and logit/probit models. Nearly all studies in the Appendix are based on binary-dependent variables. As shown in Chapter 3, few studies outside the accounting and finance literature have used more sophisticated machine learning techniques such as gradient boosting, adaptive boosting or AdaBoost, and neural networks (see Cortés et al., 2007; Kim and Kang, 2010). Generally, the predictive accuracy of these models has been shown to be a significantly better than conventional models. For instance, using AdaBoost, Cortés et al. (2007) reported test error rates as low as 6.7% on a sample of 2730 private firms, although the predictive power of different models has been quite variable across studies (Karas and Režňáková, 2014). Jones and Wang (2019) used a machine learning method known as gradient boosting machines, outlined in Chapter 3 and illustrated in Chapter 4. A major strength of the gradient boosting model is that it can handle a very large number of input variables without compromising model stability. Gradient boosting is also better equipped to handle “dirty data” issues, such as outliers, missing values, incompleteness in records, database errors, and non-normalness in the data. Arguably, private company datasets are more susceptible to these issues than public company datasets, particularly given the generally inferior quality of data available for private companies. Jones and Wang (2019) extended the literature in private companies in other ways. For instance, while much of the private company distress literature has used a binary classification setting, the approach is a simplistic representation of reality. As stated in Chapter 2, the practical risk assessment decisions by lenders and other parties normally cannot be reduced to a simple pay off space of failed and non-failed (Ward, 1994; Ohlson, 1980). This may be particularly true for private companies, where there can be many different manifestations of distress or failure. For instance, the ORBIS database has several financial distress specifications for private companies. Jones and Wang (2019) demonstrated that an advanced machine learning technique such a gradient boosting can be effective in predicting failure in multi-class settings for private companies. They tested the power of gradient boosting with five different prediction tasks, ranging in difficulty from a more commonly used binary model to a multi-class model with up to five states of distress, as defined below in Table 5.3. Jones and Wang (2019) used the private company from the Bureau van Dijk ORBIS database, which is one of the largest private company databases available, containing over 5 million private companies. The ORBIS database claims to provide a consistent methodology to variable definitions and sampling that can improve the reliability, interpretability, and generalizability of the data. While private company financial records can be particularly prone to incompleteness and

Corporate Failure Models  161 TABLE 5.3  Definition of Private Company Failure Used in Jones and Wang (2019)

Dependent Variables Used in the Study Binary Model (1) Binary Model (2)

Three State Model

Five State Model

0 = Active firms (active and active branch firms) 1 = Firms in bankruptcy or liquidation proceeding 0 = Active firms 1 = Firms in bankruptcy or liquidation proceeding, active firms in default, active firms subject to insolvency proceeding, firms dissolved through bankruptcy or liquidation 1 = Active firms (active and active branch firms) 2 = Active firms that are in default or subject to an insolvency proceeding 3 = Bankrupt companies (firms in bankruptcy or liquidation process; firms dissolved through bankruptcy or liquidation 1 = Active firms (active and active branch firms) 2 = Active firms in default, active firms subject to insolvency proceeding 3 = Firms in bankruptcy or liquidation proceeding 4 = Firms dissolved through bankruptcy or liquidation 5 = Firms dissolved for reasons other than bankruptcy or liquidation (such as mergers)

Source: Jones and Wang (2019, p. 167)

error, the ORBIS claims to provide a methodology for the collection and verification of all financial records. The key findings of the Jones and Wang (2019) study is that the gradient boosting model predicted very well in all decision contexts and significantly outperformed conventional models, as shown in Table 5.4. Not surprisingly, Jones and Wang (2019) reported that the strongest predictive performance was for a model using a binary class dependent variable. When the binary dependent variable is defined broadly, the best out-of-sample classification success is achieved. When the distress variable is defined quite narrowly, lower out-of-sample accuracy was reported. Jones and Wang (2019) reported that a gradient boosting achieved an AUC better than .90 AUC (out-of-sample) in a binary failure setting. The conventional logit model was also tested by Jones and Wang (2019) using the best performing variables in the gradient boosting model and only achieved an AUC of around .70, which is broadly consistent with the classification success for private companies reported in previous literature. In a three-state setting, Jones and Wang (2019) showed that a gradient boosting model achieved quite strong predictive success up to three years prior to failure with an average AUC greater than .80 on most test samples. By contrast, a conventional multinomial logit model achieved predictive accuracy not much greater than .60. Jones and Wang (2019) also showed that gradient boosting achieved similar

162  Corporate Failure Models TABLE 5.4 Summary of Predictive Performance for Five State Model Used in Jones and

Wang (2019)

Panel A (t-1) Average log-likelihood (negative) Misclassification rate overall (raw) ROC (area under curve) Class. accuracy (baseline threshold) Balanced error rate (average over classes) Panel B (t-3) Average log-likelihood (negative) Misclass rate overall (raw) ROC (area under curve) Class. accuracy (baseline threshold) Balanced error rate (average over classes) Panel C (t-5) Average log-likelihood (negative) Misclass rate overall (raw) ROC (area under curve) Class. accuracy (baseline threshold) Balanced error rate (average over classes)

Learn Sample

Test Sample

0.98609 0.35421 0.91204 0.58297 0.35462

1.01646 0.37359 0.91243 0.57951 0.37266

1.19290 0.45782 0.84503 0.40760 0.45918

1.21829 0.47784 0.80434 0.40534 0.47510

1.02730 0.33263 0.76812 0.41161 0.43288

1.20531 0.44539 0.75114 0.40574 0.52497

Source: Jones and Wang (2019, p. 175)

results for the five-state failure model displayed earlier. Finally, not only did the gradient boosting model predict significantly better than conventional logit models, but the gradient boosting method also revealed a deeper structure among corporate failure predictors, mainly through nonlinearities and interaction effects, which cannot easily be captured using conventional models. Jones and Wang (2019) also reported the relative variable importance (RVI) scores of key variables, which contributed most to the predictive power of their models. For instance, one year from failure, the top predictor variables for their fivestate model shown earlier included: growth in public debt (average RVI = 100); annual change in shareholder numbers (average RVI  =  98.99); total shareholder numbers (average RVI = 74.88); growth in inflation rate (average RVI = 74.30); growth in unemployment (average RVI = 65.99); growth in GDP constant (average RVI = 61.48); number of directors (average RVI = 55.45); interest cover (average RVI  =  49.05); solvency ratio (average RVI  =  44.86); credit period (average RVI = 40.26); total capital (average RVI = 37.64); and average cost of employee (average RVI = 36.05). As noted by Jones and Wang (2019), the ability of modern machine learning methods such as gradient boosting machines to provide an ordinal ranking of the best performing predictors based on out-of-sample prediction success can assist further theoretical development and debate around the role and influence of alternative distress predictors for private companies.

Corporate Failure Models  163

1.2  Not-for-Profit Entities There have been a very limited number of studies that have developed distress prediction models for not-for-profit entities. Similar to private companies, there is good motivation to develop distress predictions for this sector. One of the key reasons is that the not-for-profit sector is very large as well as being socially and economically significant to the entire global economy. Some key statistics on notfor-profits have been provided by the Urban Institute in the United States (see the Nonprofit Sector in Brief 2019):11 (1) approximately 1.54 million non-profits were registered with the Internal Revenue Service (IRS) in 2016, an increase of 4.5% from 2006; (2) the non-profit sector contributed an estimated 1.047.2  trillion to the US economy in 2016, composing 5.6% of the country’s gross domestic product (GDP); (3) of the non-profit organizations registered with the IRS, 501(c)(3) public charities accounted for just over three-quarters of revenue and expenses for the non-profit sector as a whole ( 2.04 trillion and 1.94 trillion, respectively) and just under two-thirds of the non-profit sector’s total assets ( 3.79 trillion). (4) in 2018, total private giving from individuals, foundations, and businesses totalled 427.71 billion (Giving USA Foundation, 2019), a decrease of 1.7% from 2017 (after adjusting for inflation). According to Giving USA Foundation (2019), total charitable giving rose for consecutive years from 2014 to 2017, making 2017 the largest single year for private charitable giving, even after adjusting for inflation; (5) an estimated 25.1% of US adults volunteered in 2017, contributing an estimated 8.8 billion hours. This is a 1.6% increase from 2016. The value of these hours is approximately 195.0 billion. Not only is the not-for-profit sector very large and economically significant, but the transparency and governance of these entities has often been under the spotlight. As noted by Keating et al. (2005), several large not-for-profits, notably Planet Aid Canada (Cribb, 2002), the NAACP, United Way, Upsala College, and the Nature Conservancy, have been shaken by a succession of financial scandals (see Gibelman et al., 1997; Stephens, 2004).12 These and other high-profile cases, along with the growing public visibility of the sector, have led to several calls to strengthen the governance of not-for-profit entities (for example, see Boudreau, 2003; Strom, 2003). As stated by Gibelman et al. (1997, p. 21): Recently, we have witnessed a series of media revelations about the wrongdoings of nonprofit organizations. Are these isolated events, ripe for media exaggeration? Are charitable organizations under greater scrutiny today, and thus their long-standing vulnerabilities are just now surfacing? Or are there pervasive “misdeeds” occurring within nonprofits that warrant further

164  Corporate Failure Models

investigation and corrective action? We simply do not know if these are singular events. What is clear is that suspicion and disparagement have been cast upon the nonprofit sector in general, the results of which may reverberate in tighter government regulations, donor skepticism, and greater demands upon and expectations of governing boards. In such circumstances, the efficacy of nonprofit governing boards warrants reconsideration. There have only been limited applications of financial distress models to the notfor-profit sector. For instance, Schipper (1977) developed a model of financial distress applicable to US private colleges. Schipper (1977) presents an economic model that involved the maximization of the present value of a function, Q, which depends on the levels of certain stocks (library, plant, endowment) and decision variables (tenured faculty, number of students, student aid), subject to conditions on the changes of stocks over time through depreciation and investment. As stated by Schipper (1977, p. 1): The purpose of this study is to address the problem of providing an analytical description of financial distress in private colleges that will serve as a basis for an empirical description by presenting a model of the financial condition of a private college and by using this model to find and test presently produced accounting measures that will describe financial distress empirically. For the purposes of this study, distress was defined as any private institution that closes its doors (or whose trustees opt to do so), declares bankruptcy, or is taken over by another entity for financial reasons. Schipper (1977) used a stepwise forward LDA model, which was applied to the candidate variables, as well as the four independent variables included to capture the effects of external (i.e., nonfinancial) and decision variables (tenured faculty, number of students, student aid), subject to conditions on the changes of stocks over time through depreciation and investment. However, MaGee (1977) critiqued the complexity, ambiguity, and deterministic nature of Schipper’s analytical model and the definition of the objective function Q. He stated: “Unfortunately, the study lacks an explicit analysis of the decisions that are being made by the various parties in the educational system, the uncertainties they face, and the resulting costs of any “mistakes” they might make” (p. 41). Never (2013) examined financial distress in human service not-for-profit organizations. His analysis is based on the IRS Form 990 digitized data provided by the National Center for Charitable Statistics (NCCS). He used the core data files for two periods: 2004–2006, as the period immediately preceding the GFC, and 2007–2009, concurrent with the GFC. Never (2013) argued that a threeyear period allows for an understanding of organizational finances while potentially smoothing any aberrant changes in any one period. Never (2013) argued that an organization’s expenses are a more contemporaneous measure of its presence in a community; and cutting expenses over a three-year period indicates an

Corporate Failure Models  165

organization with a shrinking financial presence and capacity to continue to deliver services. Never (2013) analysed two dependent variables. Model 1 measured distress as a decline in expenses by at least 20% over a three-year period. Model 2 measured distress as a decline in expenses by at least 50% over a three-year period. Independent variables of the study include median household income; minority population as defined by the Census Bureau; ESRI diversity index (a measure of the likelihood that any two people randomly selected from geography will belong to different racial/ethnic groups); total revenue; frontline organizations (organizations with NTEE codes indicating employment, housing, or general human services sub-sectors); and number of employees (as indicated in NCCS Core Files). Joining the NCCS Core Files with spatial data from the American Community Survey, Never (2013) reported a positive relationship between financial distress and minority population.13 Tuckman and Chang (1991) developed a theory of financial vulnerability for not-for-profit organizations. They defined a not-for-profit organization as financially vulnerable if “it is likely to cut back its service offerings immediately when it experiences a financial shock” (p. 445), such as an economic downturn or the loss of a major donor. Further, “financial flexibility is assumed to exist if an organization has access to equity balances, many revenue sources, high administrative costs, and high operating margins. Organizations that lack flexibility are assumed to be more vulnerable than organizations with flexibility” (p.  450). Tuckman and Chang (1991, pp.  451–453) hypothesized four indicators of financial vulnerability for a not-for-profit organization (see Greenlee and Trussel, 2000, summary): Inadequate equity balances. Not-for-profits with large amounts of equity are in a better position to borrow. After a financial shock, a not-for-profit with a large equity balance may be able to leverage its assets rather than reduce its program offerings. Thus, the lower the equity balance, the more likely the not-for-profit is to be financially vulnerable. Revenue concentration. Gifts, grants, programme services, membership dues, inventory sales, and investments are all sources of revenue for non-profits. Organizations with a limited number of revenue streams may be more exposed to financial shocks than those with several. A  non-profit with numerous revenue streams may be able to rely on alternate financing sources and avoid having to cut back on its programme offerings. Thus, organizations receiving revenues from fewer sources should be more likely to be financially vulnerable – a predicted positive relationship. Low administrative costs. Financial shocks may be more sensitive to not-for-profits with lower administration costs than those with higher administrative costs. After a financial shock, a not-for-profit with higher administrative costs may be able to reduce discretionary administrative costs prior to reducing its program offerings. As a result, non-profits with smaller administrative overhead are more likely to face financial difficulties.

166  Corporate Failure Models

Low operating margins. Not-for-profit entities with relatively low operating margins may be more vulnerable to financial shock than those with relatively high operating margins. A not-for-profit with a high operating margin may be able to operate with a lower operating margin rather than reducing its programme offerings in the event of a financial shock. As a result, the lower the operating margin, the more vulnerable the organization is to financial risk. Greenlee and Trussel (2000) built on the work of Tuckman and Chang (1991) to develop a prediction model for financially vulnerable charities. In particular, Greenlee and Trussel (2000) used Tuckman and Chang’s four indicators of financial vulnerability – equity ratio, revenue concentration, administrative cost ratio, and operating margin – and developed a predictive model using methodology from the for-profit prediction literature. Because of the lack of data on not-forprofit bankruptcies, they defined “financially vulnerable” as meaning any not-forprofit organization that evidenced an overall decline in program expenses during a three-year period. Using the Form 990 database provided by the National Center for Charitable Statistics (NCCS) and a methodology initially developed by Altman (1968), they examined data from the 1992–1995 Form 990s of 6,795 notfor-profits. Greenlee and Trussel (2000) reported that their distress model was significant, with three of the four financial indicators of the Tuckman and Chang study significantly contributing to the overall model. Within certain parameters, they were able to predict with reasonable accuracy whether a charity was financially vulnerable. Trussel and Greenlee (2004) expanded this study in five ways. First, they included size in the model, since smaller not-for-profits may be more vulnerable to financial distress than larger not-for-profits. Second, they controlled for not-for-profit sub-sector, since different types of not-for-profits may be impacted differently by changes in the economy. Third, they defined “financial distress” as a “significant” decrease in net assets over a three-year period. Fourth, they tested the resulting models for robustness by applying them to different time periods. Finally, they developed a way to rate the financial vulnerability of not-for-profits. Their composite model proved robust and was able to predict financial distress quite accurately. Significant relationships were found between financial distress and two of the Tuckman and Chang measures and also between financial distress and organizational size. Trussel (2002) had used a broader dataset to predict financial vulnerability. The NCCS Core Files included smaller not-for-profits but fewer data fields. The final sample included 94,002 charities for the period 1997–1999, and financial distress was defined as a 20% reduction in net assets over a three-year period. Two of the Tuckman and Chang variables could not be computed since the necessary information was not coded by the Internal Revenue Service (the equity ratio and administrative cost ratio). Trussel (2002) replaced the equity ratio with a debt ratio (total liabilities divided by total assets) and added a

Corporate Failure Models  167

size variable. Due to the expanded dataset, not-for-profit sub-sector control variables were more detailed than possible in previous studies. All the variables were found to be statistically significant, and the predictive ability exceeded that of a naïve model. Other studies included Hager (2001), who examined the ability of the Tuckman and Chang ratios to predict the actual demise of arts organizations. Hager (2001) found that predictive ability varied within the sector: the Tuckman and Chang measures could be used to predict the closure of some, but not all, arts organizations. Hager (2001, p. 389) stated: The Tuckman-Chang measures of financial vulnerability do indeed have utility in explaining the demise of nonprofit arts organizations. Low equity balance was found to be a viable predictor of demise among art museums, theaters, and music organizations. High revenue concentration was found to be useful in predicting the death of visual arts organizations, theaters, music organizations, and generic performing arts organizations. Low administrative costs were associated with the loss of both theater and music organizations. Finally, low operating margin was significantly related to the closure of theaters and generic performing arts organizations. These relationships always manifested in the theorized direction, and were supported by both a comparison of group means and multivariate logit models. Other studies included Burde (2018), who introduced a number of methodological improvements, including use of a hazard analysis to predict the financial vulnerability of not-for-profit organizations. The improvements are based on two new ideas: (1) introducing the generalized time-at-risk, which measures the “level of instability” more consistently than the commonly used definition, and (2) dividing the sampled data into two roughly equivalent samples and comparing the findings produced with each sample, allowing the results to be adjusted for prediction purposes by optimizing the method parameters so that the difference between the results is minimal. Lord et al. (2020)14 used a modified Altman Z-score to predict financial distress within the nursing home industry. Lord et al. (2020)’s modified Altman Z-score model used LDA to examine multiple financial ratios simultaneously to assess a firm’s financial distress. Their study utilized data from Medicare Cost Reports, LTC Focus, and the Area Resource File. The sample consisted of 167,268 nursing home-year observations, or an average of 10,454 facilities per year, in the United States from 2000 through 2015. The independent financial variables included: liquidity, profitability, efficiency, and net worth, which were entered stepwise into the LDA model. All of the financial variables, with the exception of net worth, significantly contributed to the discriminating power of the model. The authors used K-means clustering to classify the latent variable into three categorical groups: distressed, risk-of-financial distress, and healthy.

168  Corporate Failure Models

1.3  Public Sector Entities There is also comparatively little research into the financial distress of public sector entities. There has been some analysis of fiscal and financial crises in the local government sector, particularly in the United States in the wake of the financial problems facing New York City and Cleveland during the 1970s (see, e.g., Gramlich, 1976; Clark and Ferguson, 1983; Wallace, 1985; Falconer, 1990) and a subsequent spate of major financial crises in the early 1990s (see, e.g., Honadle, 2003). Local governments in a variety of other countries have also been hit by financial issues, according to commentators (e.g., Carmeli and Cohen, 2001; Bach and Vesper, 2002; Carmeli, 2003). A financial crisis in this scenario could result in bankruptcy or loan default (see, e.g., the cases described in Cahill and James, 1992) or a string of operating losses (Cahill and James, 1992; Bach and Vesper, 2002). Trussel and Patrick (2009) investigated the financial risk factors associated with fiscal distress in local governments. They hypothesized that fiscal distress is positively correlated with revenue concentration and debt usage, while negatively correlated with administrative costs and entity resources. Their regression model resulted in a prediction of the likelihood of fiscal distress, which correctly classified up to 91% of the sample as fiscally distressed or not. Their model also allowed for an analysis of the impact of a change in a risk factor on the likelihood of fiscal distress. A decrease in intergovernmental revenues as a percent of total revenues and an increase in administrative expenditures as a percent of total expenditures had the largest influence on the likelihood of fiscal distress. In another study, Trussel and Patrick (2013) used a hazard model to investigate fiscal distress in special district governments. Their model accommodated differences in district functions, financing, and legislation. Trussel and Patrick (2013, p. 590) defined fiscal distress as a significant and persistent imbalance between revenues and expenditures. We operationalize fiscal distress as three consecutive years of operating deficits (scaled by total revenues) that cumulate to more than five percent. The results find that fiscally distressed districts tend to have more diverse revenue sources, lower capital expenditures, use more debt, and are smaller than districts that are not fiscally distressed. Their survival model correctly classified up to 93.4% of the sample and showed that the most important indicator of fiscal distress is a low level of capital expenditures relative to total revenues and bond proceeds. Notwithstanding the studies cited earlier, Jones and Walker (2007) observed that much of the public sector financial distress literature has been concerned with exploring the reasons for fiscal crises. Some observers attribute these issues to a lack of organizational resources and managerial abilities, resulting in an inability to supply high-quality services in a timely way or adjust to changing circumstances (Carmeli and Cohen, 2001). Others have suggested that the distress is the result of a failure to adapt to economic downturns in general, or the financial impact of

Corporate Failure Models  169

unfunded mandates, or where state governments shifted responsibilities to cities or municipalities without financial compensation, or while limiting local governments’ ability to increase revenues (Falconer, 1990; Beckett-Camarata, 2004). Still others sought to explain local government behaviour in times of financial stress (for a review, see Cooper, 1996), or to describe state responses to municipal crises (see Cahill and James, 1992). While it appears there have only been limited attempts to predict local government financial distress in the research literature, those with a responsibility to monitor the performance of the local government sector have utilized a range of techniques to identify municipalities that may be facing difficulties. However, one contribution noted that while some jurisdictions have sought to establish early warning systems, “they may not be functioning as planned” (Cahill and James, 1992, p.  92). Another study reported the development of a simple index based on arbitrary weighting of nine variables (Kleine et al., 2003). Subsequently, these authors applied their distress methodology to a sample of Michigan local governments and concluded that their index performed better than Michigan’s current system of identifying potentially distressed local councils, which was apparently via a general financial statement analysis approach (Kloha et al., 2005). The claim of improved performance was founded on the idea that it had “theoretical validity”, that it provided similar outcomes to state agency assessments, and that it was an efficient or parsimonious model. Honadle (2003) found that just under half of the states attempted to anticipate local government fiscal problems, primarily by studying audit reports, local government reporting, or information obtained from conversations or regional workshops, with only a few states employing “financial analysis methodologies” (p. 1454). However, none appeared to be using a statistical distress prediction model. An Australian study proposed an econometric distress prediction model and (like Kloha, et  al., 2005) compared the results to a state government agency’s “watch list”, concluding that the latter’s selection of “at risk” councils did not accurately identify municipalities that were (according to their model) in fact “at risk” (Murray and Dollery, 2005). A  number of commentaries have highlighted that financial statement analysis alone may be a poor basis for projecting local government distress, as stated by Jones and Walker (2007), because financial ratios may only reveal problems “too late”. Indeed, Clark and Ferguson (1983) provided significant data to support their claim that fiscal stress reflects governments’ failure to adjust to changes in the taxpaying community’s resources. That observation in itself highlights the difficulties of applying distress prediction models to the public sector environment. It is well recognized that, in the private sector, a prediction of distress may not be fulfilled if management takes corrective action. Much research has modelled failure as a basic binary classification of failure or non-failure, which is acknowledged as a key weakness of the distress literature (Jones, 1987; Jones and Hensher, 2004). Because the formal legal idea of bankruptcy (or insolvency) may not necessarily reflect the underlying economic realities of business financial distress, this methodology has been frequently questioned. For example, there is

170  Corporate Failure Models

substantial documented evidence that corporations have used bankruptcy protection for their own strategic goals, such as avoiding creditors, from time to time (Delaney, 1999). Furthermore, the two-state model may contradict underlying theoretical models of financial failure, which may restrict the generalizability of empirical findings (see Chapter  2). If the archetypal two-state failure model has dubious relevance in the private sector, its application in the public sector may be even more severely limited, as financially distressed public sector entities may respond to lower revenues or higher costs by reducing the range and quality of services they provide to the community. A statistical modelling approach is arguably superior to more rudimentary and heuristic approaches (such as a financial statement analysis), because it allows the testing of formal hypothesis and an examination of the statistical and explanatory impact of a range of covariates in a multivariate setting. Previous studies, according to Jones and Walker (2007), have struggled to come up with a meaningful measure of local government distress. For instance, Clark (1977) discussed four indicators of municipal fiscal strain: (1) probability of default, where default is defined as not meeting bond repayments; (2) ratio indicators, such as gross debt divided by the tax base or short-term debt to long-term debt, (3) socio- and economic-based characteristics, such as population size and median family income, and (4) funds flow measures. However, these measures have certain intractable problems when operationalized as a formal measure of local government distress (particularly in Australia). As noted by Clark (1977), bond defaults may not be useful given that actual default rates have historically been extremely low (see also Kaplan, 1977). Another dependent variable that can potentially indicate distress in local councils is the incidence of mergers and amalgamations. The NSW government has encouraged voluntary local council mergers to encourage efficiencies. For instance, the Local Government Amendment (Amalgamations and Boundary Changes) Bill 1999 streamlines the procedures laid down in the Local Government Act 1993 for voluntary amalgamations of council areas. Prima facie, the incidence of mergers and amalgamations represents a potentially attractive dependent variable, as merged councils can be readily identified and typically such merger activity has been motivated by concerns about the financial viability of some local councils. For example, a proposal for creation of a New Capital City Regional council in southern areas of the state would have seen the merger of five smaller councils. However, as noted by Jones and Walker (2007), there are a number of issues to consider if council mergers and amalgamations are used as the distress metric. Public companies experiencing distress can seek out merger partners in any number of locations, and typically merge with business partners that are in a stronger financial position. However, mergers of local councils in NSW (and elsewhere) are constrained by geographic considerations. Typically, distressed councils merge with adjacent councils that may only be marginally better off in financial terms themselves. Merging two or more financially fragile councils does not necessarily produce one larger “healthy” council. Most mergers over the recent years in NSW have involved smaller regional councils, and the numbers have been comparatively small in absolute terms (not more than 13% in the past ten years).

Corporate Failure Models  171

Given the difficulties in operationalizing an appropriate financial distress measure in local councils, the Jones and Walker (2007) study of distressed local government councils in Australia focused on constructing a proxy of distress linked to basic operating objectives of local councils, which is to provide services to the community. The major responsibilities of Australian local government are the provision of local infrastructure (such as roads, bridges, and community facilities) and waste collection. Local councils are responsible for administering building controls (though in some circumstances these may be overridden by state authorities). In major metropolitan areas, the provision of water and sewerage services is undertaken by state agencies, but in rural and regional areas these functions are generally provided by local councils (sometimes through joint ventures). While individual councils may provide some social welfare services, the provision of health and education services is a responsibility of the states, with the commonwealth government providing earmarked grants to support some services (such as home care programs for the aged or persons with disabilities). Service delivery can be considered in terms of both the quantity and quality of services provided. Jones and Walker (2007) focused on the qualitative aspects of service delivery. The authors concluded that a purely quantitative measure of service delivery can result in misleading interpretations of local council distress and may not, for various reasons, be strongly associated to explanators of distress. For example, road infrastructure can be provided and/or maintained by a local council, in spite of the fact that road quality itself can be steadily diminishing over time or left in a poor state of repair. Similarly, sewerage infrastructure may continue to operate, even though it is such a poor state of repair that it can threaten public health standards and the local environment. The sample used in the Jones and Walker (2008) study is based on the financial statements and infrastructure report data of 172 local councils in New South Wales over a two year period 2001–2002. The data collected included local council characteristics (such as whether the council is large or small, or urban or rural based on formal classifications used by the DLG); service delivery outputs; condition of infrastructure; and an extensive range of financial variables (described below). Data was collected from several sources. Infrastructure data was accessed from the 2002 infrastructure reports to the Minister on the condition of public infrastructure prepared by New South Wales councils in accordance with the Local Government Act of 1993. These reports were provided by the New South Wales Department of Local Government for the population of 172 councils then operating in the state in 2002. From the preceding discussion, the definition of distress used by Jones and Walker (2007) incorporated a qualitative measure of service delivery. This definition is not linked to social service outputs per se but to the condition of infrastructure assets upon which the delivery of local council services is critically dependent. The dependent variable in the Jones and Walker (2007) study is a continuous variable defined as the ratio of expected total costs to bring local council infrastructure assets to a satisfactory condition, scaled by total revenues. Scaling (i.e., dividing) to total revenues is intended to control for size differences between local councils and is appropriate as general revenue is the primary source of funds available to local councils to

172  Corporate Failure Models

maintain infrastructure in satisfactory condition. As noted previously, many NSW councils (mainly those outside major metropolitan areas) received revenues from charges for water and sewerage services (and these charges are not subject to rate pegging). Accordingly, scaling involved use of total revenues (both excluding and including water and sewerage charges). It was found that scaling total costs to bring infrastructure assets to a satisfactory condition by other denominators (such as operating cash flows or total assets) was highly correlated with total revenues, suggesting that this measure is robust to choice of scale. A broad range of financial and non-financial measures were tested by Jones and Walker (2007). The explanatory variables are in four categories: (1) council characteristics; (2) local service delivery variables; (3) infrastructure variables; and (4) financial variables. Jones and Walker (2007) reported that revenue generating capacity (measured by rates revenue to total ordinary revenue and ordinary revenue less waste and sewerage charges to total assets) had the strongest overall statistical impact on council distress. Road maintenance costs were also found to have a significant main effect in the model but were also associated with a number of significant interaction effects, particularly when this variable was interacted with the area (in square kilometres) serviced by councils; whether councils are classified as urban or rural; and with the expected costs of getting road assets into satisfactory condition. Surprisingly, some variables widely believed to be associated with local council distress in Australia were not found to be significant in the results, such as the distinction between rural and urban councils (using t-tests, the dependent variable used in this study was not found to be statistically significant between rural and urban councils, nor were a large number of financial performance variables). Jones and Walker (2007) concluded that there were several advantages for a using a statistical model in evaluating local council distress, as opposed to a purely descriptive or heuristic approach (such as a ratio analysis of key variables). A statistically grounded distress model, for example, can be used as an effective screening tool across a large number of councils to detect potential financial challenges and pressures. Such models can also have substantial policy implications, such as providing an objective basis for analysing local councils’ financial circumstances in various rate-pegging applications and/or identifying troubled municipalities that are vulnerable to merger and/or amalgamation activity (as well as assessing whether the merger activity itself has been effective in reducing distress). Triangulation of several distress measuring methodologies (such as heuristic/descriptive and quantitative tests) could potentially provide more insight into the circumstances of public sector agencies than using one approach alone.

Key Points From Chapter 5 Much of the distress risk and corporate failure literature to date has been applied to publicly traded companies. There is comparatively little distress risk and corporate failure research relevant to private companies, not-for-profit entities, and public sector entities. Similar to public companies, the COVID pandemic has wreaked havoc on private companies, not-for-profit entities, the majority of which are charities. Private company failure

Corporate Failure Models  173

rates are likely to escalate as the pandemic continues and various government support programs are withdrawn. In private company distress research, we find a wide range of definitions of failure and a significant amount of heterogeneity in sample sizes, reporting jurisdictions, and data sources. There is also a wide range of alternative explanatory variables and statistical learning methods that have been used. In prior research, definitions of failure have included: loss to borrowers and guarantee recipients; being wound up by a court; liquidation or ceased trading; sustained noncompliance with banking obligations; bankruptcy or liquidation; loan default; financial distress; bankruptcy or default; cash shortages; and combination of bankruptcy, receivership, liquidation, inactive, special treatment firms. Most private company distress studies have used conventional discrete choice models, such as LDA and logit/probit models. However, there has been recent research that has used machine learning methods in binary and multi-class settings. Private company distress studies have used a wide variety of explanatory variables, including financial ratios; reporting lags, firm age and size; industry and regional background; macro-economic variables (such as employment rates and GDP); internal credit ratings; managerial structure, inadequacy of accounting information system and audit lags; strategy, relations with banks, product pricing, and marketing; and characteristics of owners and quality of working force. There have also been a very limited number of distress studies for not-for-profit entities. There are very good reasons to develop distress predictions for this sector. One of the key reasons is the sector is very large as well as being socially and economically significant to the global economy. As few not-for-profit and public sector entities declare bankruptcy, the definition of distress can be problematic. Many studies use financial indicators as distress proxies, such as a series of deficits or a reduction in expenses over a given period of time. Some studies combine several financial indicators to identify financially vulnerable not-for-profit entities, such as the equity ratio, revenue concentration, administrative cost ratio, and operating margins. Other studies have used a modified version of the Altman Z score model. Modelling the distress of public sector entities faces similar challenges as public sector entities to do not typically fail or declare bankruptcy. Again, many studies use financial indicators as distress proxies, such as a series of operating deficits. Other studies have used a qualitative measure of service delivery (such as the expected costs to bring local council infrastructure assets to a satisfactory condition). The continual development and refinement of distress prediction models for private sector entities, not-for-profits, and public sector entities remains a major challenge and opportunity for future research.

Appendix for Chapter 5 Applications of financial distress risk and bankruptcy prediction models for private firms Definition of Distress

Number/ Type of Input Variables

Interaction Out-of-Sample Traditional Effects Prediction Statistical or Machine Learning considered Techniques Used

38,215 failed and Journal of 2,602,563 nonInternational failed firms from Financial the ORBIS Management database over and the period of Accounting, 8: 2007–2010 131–171

Private firms Bankruptcy, under receivership, from many and special cases countries depending on (largely countries (e.g., European) liquidation in Norway)

Nine variables, including four financial ratios

Multiple discriminant analysis and binary logistic regression

No

The Journal of Credit Risk, 6: 95–127

Liquidation, British administration, small and and receivership. medium enterprises (unlisted firms)

Up to 13 variables, including five financial ratios in SME1 and

Binary logistic regression

No

Author/s

Journal

Altman et al. (2017)

Altman et al. (2010)

Estimation Sample Size

Two datasets: 2,263,403 firms and 3,462,619 firms from Leeds University and CreditScore Ltd over the period of 2000–2005

Public/Private Firm Sample

AUCs depending on models on a test sample 3,148,079 non-failed and 43,664 failed firms from various countries SME1: AUC of 0.76 on a test sample of 237,154 firms; SME2: AUC of 0.75 on a test sample of 540,905 firms

174  Corporate Failure Models

Appendix

The British Balcaen Accounting and Review, 38: Ooghe 63–93 (2006) Bhimani Journal of et al. Accounting (2010) and Public Policy, 29: 517–532

N/A

N/A

24,818 firms from Portuguese private the Central Bank firms of Portugal over the period of 1997–2003

N/A

   up to 20 variables, including nine financial ratios in SME2 N/A

N/A

Loan default as 15 variables, Binary logistic regression defined in Basel II including 11 financial ratios

N/A

Yes

Classification success of 68.7% to 74.8% and AUC of 0.753 depending on cut-off points on a test sample of 6207 firms (Continued)

Corporate Failure Models  175

N/A

Bhimani et al. (2013)

Bhimani et al. (2014)

European Accounting Review, 22: 739–763

Review of Accounting Studies, 19: 769–804

1,529 observations from the Portuguese Central Bank’s databases over the period 1999–2005

Portuguese private firms

Loan default as defined in Basel II

Up to 12 variables, including four financial ratios

21,558 failed and nonfailed firms from the Central Bank of Portugal over the period of 1996–2003

Portuguese private firms

Loan default as defined in Basel II

14 variables, including eight financial ratios

The Cox proportional hazard model

No

AUC of 0.714 to 0.901 on a test sample of 655 observations depending on the number and type of predictors used

Binary mixed logistic regression

No

An AUC of 0.766 on a test sample of 21,559 firms

176  Corporate Failure Models

Appendix (Continued)

Buehler et al. (2012)

Small Business Economics, 39: 231–251

Ciampi (2015)

Journal of Business Research, 68: 1012–1025

Roughly 74,0000 firms from various sources over the period of 1995–2000 934 firms from CERVED databases over the period of 2008–2010 (randomly split 500 times into training and test samples)

Largely private firms in Switzerland

Bankruptcy

Up to 19 variables

Italian small enterprises (presumably private)

Form legal proceedings for debt recovery (bankruptcy, forced liquidation etc.)

Up to 14 variables, including four financial ratios

The Cox proportional hazard model and mixed logit regression Binary logistic regression

No

N/A

No

Overall classification of 76.4% and 87.9%, depending on the number of predictors

Corporate Failure Models  177

(Continued)

Author/s

Journal

Estimation Sample Size

Public/Private Definition of Distress Number/ Firm Sample Type of Input Variables

Interaction Out-of-Sample Traditional Effects Prediction Statistical or Machine Learning Considered Techniques Used

Ten financial Neural networks, No Italian private The beginning Ciampi and Journal of Small 100 firms from ratios multiple firms of formal legal the CERVED Gordini Business discriminant proceedings (2013) Management, database over the analysis, for debt period of 51: 23–45 and logistic (bankruptcy, regression forced liquidation etc.)

Cortés et al. Applied (2007) Intelligence, 27: 29–37

Bankruptcy, 2,730 firms from Spanish temporary public and BvD’s SABI receivership, private database over acquired and companies the period of dissolved firms 2000–2003 Duarte et al. Small Business 5,898 observations Portuguese Loan default small and (2018) Economics: from a major 1–18 commercial bank medium sized covering the enterprises period 2007– (private 2010 firms)

18 variables, AdaBoost versus N/A single tree including 14 financial ratios Ten variables Binary probit regression (none of them are financial ratios)

Yes

Overall prediction accuracy of 68.4% for neural networks, 65.9% for multiple discriminant analysis, and 67.2% for logistic regression on a test sample of 6113 firms AdaBoost outperforms single tree. Overall test error rate of 6.569% AUC of 0.772 on a test sample of 4719 observations

178  Corporate Failure Models

Appendix (Continued)

Fedorova et al. (2013)

Expert Systems with Applications, 40: 7285– 7293

790 firms from SPARK database over the period 2007–2011

Bankrupt Russian according manufacturing to Russian firms with Federal Law fewer than 100 employees (presumably private)

Up to 14 financial variables

Neural networks and AdaBoost

Overall classification success of 57% when 25 variables were contained, and 93% when 7 variables were contained N/A

The outof-sample accuracy ratio of approximately 0.54 N/A Overall classification success 88.8% of based on a test sample of 98 (Continued)

Corporate Failure Models  179

Multiple Yes American small Loss borrowers Up to 25 Edmister (1972) The Journal of Two samples: a and guarantee financial discriminant businesses sample of 42 Financial and recipients variables analysis (largely private firms, and a Quantitative firms) sample of 282 Analysis, 7: firms drawn from 1477–1493 Small Business Administration and Robert Morris Associates over the period 1958–1965 Everett and Small Business 5,196 from managed Australian small Five alternative Seven variables, Binary logistic No regression businesses definitions including five Watson (1998) Economics, shopping centres economic 11: 371–390 over the period variables 1961–1990 Default and Ten financial Binary probit No Falkenstein et al. N/A A subset of data from American and Canadian bankruptcy ratio regression (2000) Moody’s Credit private firms Research Database over the period 1983–1999

Fidrmuc and Hainz (2010)

Economic Systems 34: 133–147

Filipe et al. (2016)

Journal of Banking and Finance, 64: 112–135

Loan defaults 1,496 observations of Slovakian small and 667 firms from a medium-sized major commercial enterprises bank over the (presumably period 2000–2005 private) Defaulted, in 2,721,861 firm-year Private firms receivership, from eight observations of bankruptcy, European 644,234 firms from in countries Amadeus ORBIS, liquidation, and other databases disappearing over the period from the 2000–2009 sample with no updated status with negative equity in the last year

Eight variables, Binary probit regression including three financial ratios

Up to 15 variables, including five financial ratios

No

Multi-period Yes logit model and the Cox proportional hazard model

N/A

AUCs of 0.823 to 0.847 on a test sample of 304,037 observations depending on the number of predictors

180  Corporate Failure Models

Appendix (Continued)

Multinomial No British Ordered financial 13 variables, 483 stable, 78 logit including private distress (i.e., vulnerable, and regression two financial firms vulnerable 57 stressed farm ratios and stressed) businesses from based on rental the farm business equivalent/gross survey database margin over the period of 1983–1991 Loan default Up to 17 A discrete-time No American Small businesses Glennon Journal of variables hazard model small Money, Credit whose 13,550 and businesses loans were and Banking, Nigro (largely guaranteed by (2005) 37: 923–947 private) Small Business Administration

Franks European (1988) Review of Agricultural Economics, 25: 30–52

Overall classification success of 80% on a test sample of 85 stable, 22 vulnerable, and 10 stressed firms

The predicted number of defaults closely following actual defaults

(Continued) Corporate Failure Models  181

Grunert et al. Journal of Banking (2005) and Finance, 29: 509–531

Up to ten Binary probit No variables, regression including up to six financial ratios Ceasing trading Six nonBinary logistic No financial regression variables

240 failed and non- German private firms Default as defined in failed firms from Basel II six major banks over the period of 1992–1996

N/A

N/A British small Journal of Management, 28 failed and 30 companies with 31: 737–760 non-failed firms fewer than 100 from a wide range employees (largely of sources over private) the period of 1985–1990 No N/A Norwegian private Bankruptcy Up to 11 Binary logit 21,203 firms from Hol (2007) International regression firms variables the Dun and Transactions including Bradstreet register in Operational four Research, 14: 75–90 and Norsk financial Lysningsblad over ratios the period of 1995–2000 Up to 34 AdaBoost and N/A Overall Karas and International Journal 408 failed and 1,500 Private manufacturing Bankruptcy classification multiple firms from Czech under the laws financial Režňáková of Mathematical non-failed firms success of variables discriminant Republic of the Czech (2014) Models, 8: 214–213 from Amadeus models depend analysis Republic Database over the on the number period of 2004– of predictors 2011 used, and AdaBoost outperformed multiple discriminant analysis

Hall (1994)

182  Corporate Failure Models

Appendix (Continued)

Keasey and Watson (1987)

Keasey and Watson (1988)

Up to ten Court variables, winding-up, including creditors’ up to five voluntary, financial members’ ratios voluntary, and ceased trading/ dissolved

Binary logistic regression

No

Overall classification success of 55% to 65% depending on models on a test sample of ten failed and ten non-failed firms

Liquidation and ceased trading

Multiple No discriminant analysis

Overall classification success of 58.9% to 67.8% depending on competing models on a test sample of 73 failed and 73 non-failed firms

Two financial ratios

(Continued)

Corporate Failure Models  183

British private firms 73 failed Journal of and 73 Business non-failed Finance and firms from Accounting, research 14: 335–354 centres at the University of Newcastle upon Tyne over the period of 1970–1983 British private firms Accounting and 40 firms from research Business centres at the Research, 19: University 47–54 of Newcastle upon Tyne over the period of 1970–1980

Kim and Kang (2010)

Expert Systems with Applications, 37: 3373– 3379

Korean 1,458 firms manufacturing from a firms (presumably Korean private) commercial bank over the period of 2002–2005

Bankruptcy (presumably legal bankruptcy)

Laitinen (1992) Journal of Business Venturing, 7: 323–340

Bankruptcy 78 observations Finnish newly of 40 firms founded small and entrepreneurial firms (private)

Laitinen (1995) European Accounting Review, 4: 433–454

80 firms from the National Board of Patents and Registration of Trademarks

Finnish private firms Bankruptcy

Seven financial Boosted neural N/A Average AUCs of 71.02% for ratios networks, neural networks, bagged 75.10% for neutral boosted neural networks, networks, neural 75.97% for networks bagged neural networks No N/A Eight variables, Multiple discriminant including analysis seven financial ratios Binary logistic No Overall 12 financial regression classification variables, success of including 81.25% for eight the logistic financial benchmark ratios in model and logistic 83.75% for regression the combined benchmark model of the model benchmark variables and overall probability of bankruptcy

184  Corporate Failure Models

Appendix (Continued)

European Accounting Review, 8: 67–92

76 firms over the period of 1986– 1989

Finnish private firms

Bankruptcy

Three financial ratios

McNamara et al. (1988)

Accounting and Finance 28: 53–64

63 failed and 84 non-failed firms from the Queensland Commissioner for Corporate Affairs from over the period of 1980–1983

Australian private firms

Being wound up by court order

Six financial ratios

Multiple discriminant analysis, binary logistic regression, recursive partitioning, the Cox proportional hazard model and neural networks, human information processing Multiple discriminant analysis

No

Overall classification success depending on models and the time horizons

No

Overall classification success of 85% on a test sample of 18 failed and 22 non-failed firms

(Continued)

Corporate Failure Models  185

Laitinen and Kankaanpaa (1999)

Author/s

Journal

Estimation Sample Size

Public/Private Firm Sample

Definition of Distress Number/Type of Input Variables

Slovenian Cash shortage 2,745 failed Mramor and Journal of private firms as short-term and 16,882 Valentincic Business insolvency non-failed (2003) Venturing, 18: firms from 745–771 Central Database of the Agency of Payments over the period of 1996–1998 British large Liquidation and Peel (1987) The Investment 56 failed and entering into Analysis, 83: 56 non-failed private firms receivership 23–27 firms from the Extel Unquoted Companies Service over the period of 1980–1985

Interaction Out-of-Sample Traditional Effects Prediction Statistical Considered or Machine Learning Techniques Used

No Eight variables, Binary logit regression, including seven financial binary probit regression, ratios and multiple discriminant analysis

N/A

Binary logistic Up to seven regression variables, including four financial ratios

Overall classification success of 75% to 88.89% on a test sample depending on the number of variables integrated

No

186  Corporate Failure Models

Appendix (Continued)

Perry (2001) Journal of Small Business Management, 39: 201–208

One variable relating to business planning

N/A

N/A

Two financial ratios

No Multiple discriminant analysis and binary logistic regression

N/A

Overall classification success of 77.95% for discriminant analysis, and 75.98% for logistic regression on a sample of 254 firms

(Continued)

Corporate Failure Models  187

N/A Bankruptcy 152 failed and 152 non-failed firms from the credit reporting database of Dun & Bradstreet Corporation Portuguese Sustained nonPindado and Small Business 24 failed and compliance Rodrigues Economics, 22: 24 non-failed private firms with banking (2004) 51–66 firms from obligations the Central throughout the Balancewhole year Sheet Office of the Banco de Portugal over the period of 1990–1993

Bankruptcy

No Eight financial Multiple discriminant ratios for old analysis firms and up and neural to nine ratios networks for new firms

Dutch small Bankruptcy 1,584 firms private firms from the Dutch Chamber of Commerce, and Graydon Credit Management Services over the period of 2001–2006

No Six variables, Multiple including five discriminant financial ratios analysis, binary logistic regression neural networks

Belgian 7,080 Pompe and Journal of small and observations Bilderbeek Business mediumfrom Belgian (2005) Venturing, 20: sized National 847–868 enterprises Bank over (private the period of firms) 1986–1994

Slotemaker (2008)

N/A

Classification success depending on ratios, models for new and old firms, and the use of techniques. Overall, neural networks outperformed multiple discriminant analysis Neural networks outperform logistic regression and multiple discriminant analysis

188  Corporate Failure Models

Appendix (Continued)

A discrete-time No Up to 20 British newly Liquidation, 5,878,238 Wilson et al. International hazard model variables, incorporated administration, observations (2014) Small Business and receivership including up small and from a Journal, 32: to six financial mediumUK credit 733–758 ratios sized reference enterprises agency over (largely the period of private) 2000–2008

N/A

Note: Balcaen and Ooghe (2006) and Perry (2001) do not estimate financial distress prediction models. Balcaen and Ooghe (2006) provide a literature review of the singleperiod models in the corporate distress prediction literature, and Perry (2001) examines the influence of planning on US small business failures by comparing the extent of planning between active and failed small businesses in the descriptive statistics. Source: Jones and Wang (2019), Appendix B. Reprinted with permission from Elsevier

Notes

Corporate Failure Models  189

1 For instance, the US Bureau of Labor Statistics report that on average of 46.5% of private businesses fail in the first five years of operation between 1994 and 2015 (www.bls.gov/bdm/entrepreneurship/bdm_chart3.htm). 2 The Office of Advocacy defines a small business as an independent business having fewer than 500 employees. 3 www.bls.gov/bdm/us_age_naics_00_table7.txt. 4 www.oecd.org/coronavirus/policy-responses/coronavirus-covid-19-sme-policy-responses-04440101. 5 www.abi.org/newsroom/press-releases/commercial-chapter-11-bankruptcies-increase-14-percent-in-the-first-quarter. 6 https://portal.census.gov/pulse/data/#data. 7 https://portal.census.gov/pulse/data/#data. 8 www.thompsoncoburn.com/insights/blogs/credit-report/post/2020-04-02/financial-distress-of-the-covid-19-pandemic-on-not-for-profits. 9 www.washingtonpost.com/local/non-profits-coronavirus-fail/2020/08/02/ef486414-d371-11ea-9038-af089b63ac21_story.html. 10 While most studies to date are country specific, Altman et al. (2017) used a much larger global sample based on ORBIS data. 11 https://nccs.urban.org/publication/nonprofit-sector-brief-2019#the-nonprofit-sector-in-brief-2019.

190  Corporate Failure Models

12 D.S.W. Margaret Gibelman, Dr  R. Sheldon, P. H. D. Gelman, and Mr. J. D. Daniel Pollack, 1997, The credibility of nonprofit boards, Administration in Social Work, 21(2), pp. 21–40. 13 B. Never, 2014, Divergent patterns of nonprofit financial distress, Nonprofit Policy Forum, 5(1), pp. 67–84. 14 Justin Lord, Amy Landry, Grant T. Savage, and Robert Weech-Maldonado, 2020, January–December, Predicting nursing home financial distress using the Altman Z-Score, Inquiry, p. 57.

6 WHITHER CORPORATE FAILURE RESEARCH?

The corporate distress and failure prediction modelling literature has flourished over the past six decades with a vast number of studies published across a wide cross section of journals and discipline fields, including accounting, finance, economics, management, business, marketing, and statistics journals. The impact of various financial crises over the past 25 years has had a substantial impact on the development of this literature. For instance, the Asian Financial Crises of 1997, the “tech wreck” of 2001, the global financial crisis (GFC) (2007–2009), and the COVID pandemic (starting early 2020) have led to an unprecedented number of corporate collapses around the world, wiping out trillions of dollars in combined asset values. During times of financial crisis, widespread corporate collapses can exacerbate market volatility, accelerate the onset of economic recessions, erode investor confidence, and impose enormous economic costs on a wide range of stakeholders, including investors, lenders and creditors, employees, suppliers, consumers, and other market participants. While corporate failure used to be considered the province of small and/or newly listed companies, it is now commonplace for very large and well-established corporations to fail, particularly during times of financial crisis. With private company failures, at least half of these entities fail within five years of their establishment. This has led to renewed research efforts to develop more accurate and robust distress risk and corporate failure prediction models. It should not be surprising that distress risk and corporate failure forecasts have grown increasingly more important to a range of users in the more complex, globally interdependent, and highly regulated financial markets we see today. As detailed in Chapter 1, distress risk and corporate failure forecasts are now widely used by a variety of users, including accountants and auditors; investors; financial analysts; corporate directors and senior management; banks and creditors; regulators; and other parties. DOI: 10.4324/9781315623221-6

192  Whither Corporate Failure Research?

While this volume has reviewed a wide range traditional modelling techniques, such as LDA, logit/probit, neural networks, and hazard models, I have drawn particular attention to modern machine learning methods, such as gradient boosting machines, random forests, adaptive boosting (AdaBoost), and deep learning, which hold much promise in this field of research. For instance, machine learning methods such as gradient boosting machines are fully non-parametric and better equipped to handle common data issues that plague parametric models, such as non-normalness, missing values, database errors, and so on. These models are all better equipped to handle irrelevant inputs and statistical problems such as multicollinearity and heteroscedasticity. Machine learning methods are also designed to handle situations where the number of predictors is very large compared to the sample size (p > n), which is useful considering that corporate failure samples are typically quite small, but there are potentially a vast number of potential predictors. Because modern machine learning methods require minimal researcher intervention, they can reduce biases associated with “snooping bias” and p-hacking (Ohlson, 2015). This happens when a model is repeatedly re-estimated with multiple iterations of the explanatory variables to arrive at the best model, as Jones (2017) pointed out. This can lead to a decent model being generated by coincidence rather than something innately relevant in the data. The high dimensional gradient boosting model uses the full set of input variables and automatically detects all important interaction effects. As machine learning models are largely immune to variables transformations, scaling, and inclusion of irrelevant inputs, and impose no underlying distributional assumptions on the data, this approach helps remove bias associated with data snooping. As shown in Chapter 4, adjusting the adjustable settings (or hyper-parameters) of machine learning models such as learn rates, tree depth, maximal nodes for trees, loss functions, and cross validation methods does not impact machine learning performance greatly. In commercial machine learning software such as SPM, I found that the default settings generally produced the best models in corporate failure modelling. Because modern machine learning methods are high dimensional, they can be used to test the predictive power of many hundreds if not thousands of features without compromising model stability and performance. The rank ordering of predictor variables based on out-of-sample predictive power is a powerful feature of modern machine learning methods. This feature is useful for identifying potentially new and theoretically interesting variables that can be developed in future research. There are many promising directions for future research. First, the vast majority of corporate failure studies have modelled distress risk or corporate failure in a simplistic binary setting of failed vs non-failed. However, as pointed out in Chapters 1 and 2, this representation is quite unrealistic as distress events in the real world can manifest across a number of dimensions. There is considerable opportunity to model corporate distress on a distress spectrum ranging from more moderate forms of distress to the most extreme form of distress (such as bankruptcy). There is quite limited research available that has explored predictive models for more moderate

Whither Corporate Failure Research?  193

but frequently occurring distress events (particularly in multi-class contexts) such as adverse going concern opinions, capital reorganizations, reductions in dividends, public offerings to raise working capital, loan default, changes in credit ratings, and distressed mergers/takeovers. Modelling distress events is important because they can be a direct precursor to more extreme distress events such as bankruptcy. Hence, distress prediction models can potentially provide a more effective early warning signal, giving businesses more lead time to take corrective or remedial action before the onset of more severe distress states (such as bankruptcy). While multi-class modelling of corporate distress can be quite challenging from a modelling perspective (particularly in terms of interpretative complexity), such models can provide a more realistic depiction of the types of distress that are observable in practice. Second, while this book has focused on modern machine learning methods, the application of these models to distress risk and corporate failure modelling is only at a formative stage. More research is needed to evaluate the predictive performance, theoretical merits, and statistical properties of alternative machine learning models. For instance, gradient boosting and random forests often perform as well in terms of prediction, but variables rankings can be quite different across models. This can impact the interpretative insights that can be drawn about the usefulness and theoretical value of different predictor variables. Further research can continue to compare and evaluate different machine learning methods across a broader range of predictive scenarios and using a wider range of predictor variables. For instance, future research can consider the application of machine learning models to panel data structures. As discussed in Chapter 2, hazard models such as the Cox proportional hazard model are useful because they can explicitly model failure as a function of time as well as automatically handling censored data. The extended Cox model can accommodate time varying covariates. However, machine learning models are generally not able to accommodate time dependency except in a more rudimentary way (such as creating lags of the feature variables or dummy variables for different time periods). While most machine learning models are not designed to handle panel data, Alam et al. (2021) compared hazard models with a deep learning approach that could accommodate panel data structures. As the hazard model is conceptually very appealing and widely used in the corporate failure prediction literature, further research that compares the performance of machine learning methods with hazard models will be a fruitful area of future research. Another direction for future research is to investigate whether corporate failure probabilities generated from machine learning models can actually improve on conventional models when used as a corporate failure risk proxy. For instance, many studies have used Z scores from the Altman (1968) model and probabilities from the Ohlson (1980) model as risk proxies in various contexts. This is quite surprising considering that the Altman study is over 50 years old now and developed on a corporate failure sample comprising a mere 33 failed manufacturing firms! (Ohlson’s study is not much better, only using 105 industrial companies.)

194  Whither Corporate Failure Research?

Even when researchers have used updated coefficients of these models (that is, reestimating the parameters on more recent data), the coefficients themselves may no longer be relevant, or there may be other predictors available that are more effective for corporate failure prediction.1 However, recent research has shown that machine learning models can produce much lower Type I and Type II errors than conventional models (this was also shown in Chapter  4). Would using machine learning probabilities change our understanding, for example, of the distress anomaly discussed in Chapter 2? As noted by Dichev (1998), many studies have postulated that the effects of firm size and book-to-market, two of the most important predictors of stock returns, could be related to a firm distress risk factor. As corporate failure risk is a good proxy for distress risk, Dichev (1998) applied measures of corporate failure risk that are derived from existing models of corporate failure prediction, notably Altman (1968) and Ohlson (1980). For the portfolio results, firms were assigned monthly into decile portfolios according to their probability of bankruptcy based on Z scores (Altman model) and probabilities (Ohlson model). However, would conclusions about the distress anomaly be the same if machine learning probabilities were used instead? There are many other ways that machine learning models can be used to provide insight into current debates in the literature. One debate has been about the usefulness of accounting vs market price variables (Beaver et al., 2005; Campbell et al., 2008). Conventional models have limited capacity to handle many features, and different studies have tended to adopt different methodological approaches and model evaluation techniques, rendering comparisons across studies quite problematic. However, machine learning models can be used to test a wider range of variables (such as accounting and market price variables) within a single statistical learning framework (Jones, 2017). As variable rankings are based on out-of-sample predictive power, more meaningful and statistically valid comparisons can be made about the relative contribution of different variables to overall model success. Machine learning can rank order any number of predictors based on the RVI metric discussed in Chapter 3. While accounting and market variables have been investigated in some depth, relatively little research has been focused on other promising variables that might prove to be effective predictors of distress and corporate failure, such as earnings management and corporate governance proxies. For instance, several empirical studies document that failing and troubled companies have a higher propensity to engage in earnings management practices to mask underlying performance issues (see, e.g., Schwartz, 1982; Lilien et al., 1988; Sweeney, 1994; DeFond and Jiambalvo, 1994; Rosner, 2003; Lee et al., 1999). For instance, Jones (2011) explored an accounting environment that evidenced more permissive accounting practices with respect to the capitalization of identifiable intangible assets (such as patents, licences, and copyrights). Jones (2011) reported that failing firms capitalized intangible assets more aggressively than non-failed firms over the 16-year sample period, but particularly over the five-year period leading up to firm failure. Second, drawing on the accounting choice literature, managers’ propensity to capitalize intangible assets

Whither Corporate Failure Research?  195

has a strong statistical association with earnings management proxies, particularly among failing firms. Fisher et  al. (2019) examined the impact of earnings management prior to bankruptcy filing on the passage of firms through Chapter  11. Using data on public US firms, the study finds that earnings management prior to bankruptcy significantly reduces the likelihood of Chapter  11 plan confirmation and emergence from Chapter  11. The results are driven primarily by extreme values of earnings management, characterized by one or two standard deviations above or below the mean. The findings are consistent, with creditors reacting positively to unduly conservative earnings reports and negatively to overly optimistic earnings reports. Charitou et al. (2007) examined the earnings management behaviour of 455 distressed US firms that filed for bankruptcy during the period 1986–2001. Their results are consistent with downwards earnings management one year prior to the bankruptcy filing. Their results also showed that: (1) firms receiving unqualified audit opinions four or five years prior to the bankruptcy-filing event manage earnings upwards in subsequent years, consistent with Rosner (2003); (2) more conservative earnings management seems to be related to the qualified audit opinions occurring in the preceding year; (3) firms with long-term negative accruals the year of bankruptcy filing have a greater chance to survive thereafter; and (4) more pronounced (negative) earnings management is associated with more negative (next year’s) subsequent returns. However, more research is needed to investigate earnings management practices of distressed and bankrupt firms. For instance, future research could consider whether the current findings in the literature hold up using high dimensional machine learning models, perhaps using multiple earnings management proxies and a larger number of control variables to investigate the relationship between earnings management and corporate failure. There is also comparatively little research that has investigated the relationship between corporate governance proxies (such as shareholder ownership/concentration, board composition, and so on) and corporate failure. In an early study, Gilson (1990) reported that corporate default engenders significant changes in the ownership of firms’ residual claims and in the allocation of rights to manage corporate resources. Daily and Dalton (1994) reported differences between the bankrupt and matched groups in proportions of affiliated directors, chief executives, board chairperson structure, and their interaction. Darrat et  al. (2016) reported evidence that having larger boards reduces the risk of bankruptcy but only for complex firms. Complexity was measured on the number of business segments, log (sales), and leverage (defined as total liabilities to market value of assets). Their results also suggested that the proportion of inside directors on the board is inversely associated with the risk of bankruptcy in firms that require more specialist knowledge (and that the reverse is true in technically unsophisticated firms). Their results further revealed that the additional explanatory power from corporate governance variables becomes stronger as the time to bankruptcy is increased, implying that although corporate governance variables are important predictors, governance changes are likely to be too late to rescue the company

196  Whither Corporate Failure Research?

from bankruptcy. In a more recent study, Jones (2017) reported that ownership variables, namely percentage of stock owned by the five stockholders and percentage of stock owned by insiders, were the two strongest variables in the machine learning analysis. Future research can investigate a wider range of corporate governance proxies using modern statistical learning methods to draw out a more conclusive and theoretically defensible relationship between corporate governance and corporate failure. Another direction for corporate failure research is to utilize the latest advances in text mining and natural language processing (NLP). Future research can utilize innovative techniques in natural language processing (NLP) to exploit the predictive value of signal embedded in unstructured text data (Kotu and Deshpande, 2018). Examples of unstructured text data include text from emails, social media posts, video and audio files, and corporate annual reports. Text mining is the process of converting unstructured text into a structured format in order to uncover hidden patterns, important concepts, and themes. The NLP method can be harnessed to extract key words, terms, phrases, or sentences that can potentially discriminate between different types of distress events and between bankrupt and non-bankrupt firms. While NLP has not been widely used in corporate failure prediction, this technology has the capability to extract linguistic signals from a range of “big data” text sources, including corporate viability statements, management discussion and analysis, annual reports, auditor reports, and the risk information required to be disclosed by accounting standards. By converting unstructured text data into structured formats that can be understood by machine learning models, we can compare the predictive signal in new features generated by NLP with other traditional and non-traditional corporate failure predictors and their potential interaction effects. Future research can also compare the performance of the NLP method with the predictive performance of quantitative machine learning models (such as gradient boosting machines) and evaluate the predictive power of linguistic features relative to other traditional and non-traditional variables such as financial ratios, market price variables, corporate valuation and governance variables, macro-economic indicators, and corporate sustainability measures. Finally, more research is needed to develop accurate and robust prediction models for private companies, not-for-profits, and public sector entities. This appears to be a much-neglected area of the literature. However, as noted in Chapter  5, private companies and not-for-profit entities provide a significant amount of global employment as well contributing significantly to GDP. Failure rates are also substantially higher for private companies than public companies, as shown in Chapter 5. With respect to private companies, there is a great deal of variability across current prediction models, particularly with respect to definitions of private company failure, sample sizes, types of statistical learning methods used, choice of explanatory variables, and differences in reporting jurisdictions. Hence, the generalizability of empirical results across these studies can be highly problematic. These issues are even more pronounced for not-for-profit entities (such as charities) and public sector entities (such as local governments), where the dearth of research is even more

Whither Corporate Failure Research?  197

readily apparent. The continual development and refinement of distress prediction models for private sector entities, not-for-profits, and public sector entities remain a major challenge and opportunity for future research.

Key Points From Chapter 6 Chapter 6 mainly considers directions for future research. While this volume has reviewed a wide range of traditional modelling techniques such as LDA, logit/probit, neural networks, and hazard models, I have drawn particular attention to modern machine learning methods to distress risk and corporate failure modelling. Machine learning methods hold much promise in distress risk and corporate failure research. For instance, machine learning methods such as gradient boosting machines are better equipped to handle common data issues, such as non-normalness, missing values, database errors, and so on. Not only can machine learning models predict significantly better than conventional models, they can also provide important interpretative insights through variable importance scores, partial dependency plots, and interaction effects. As machine learning models such as gradient boosting machines are largely immune to variables transformations, scaling, missing values, and inclusion of irrelevant inputs, and impose no underlying distributional assumptions on the data, this approach helps remove bias associated with data snooping and p-hacking that has received attention in recent literature. Machine learning models can be used to test a wider range of variables within a single statistical learning framework. Because machine learning models such as gradient boosting machines and random forests can rank order variables on their out-of-sample predictive power, these methods can be very useful for addressing current debates in the literature (such as the predictive power of accounting vs market price variables) as well as facilitating the discovery of theoretically interesting new variables. Another fruitful area of research is to use modern machine learning methods to test a wider range of predictor variables, including macro-economic variables, earnings management variables, and corporate governance proxies. Another direction for corporate failure research is to utilize the latest advances in text mining and natural language processing (NLP). Future research can utilize innovative techniques in natural language processing (NLP) to exploit the predictive value of signal embedded in unstructured text data. There are considerable research opportunities to develop more accurate and robust distress prediction models for private companies, not-for-profits, and public sector entities, particularly using modern machine learning methods.

Note 1 Begley et al. (1997) show that the Altman and Ohlson models do not perform as well in more recent periods (in particular the 1980s), even when the coefficients are re-estimated. However, Ohlson’s model has the strongest overall performance when the model coefficients are re-estimated on more recent data.

APPENDIX Description of Prediction Models Model Type

Specification

Logit

Prob[Yi

1| x i ]

Description, Hyper-Parameter Estimation and Model Fitting e

xi

1 e

xi

,

where x i is a vector of parameter estimates and explanatory variables.

Probit

Prob[Yi 1| x i ] =  xi , where is the inverse of the cumulative normal distribution and x i is a vector of parameter estimates and explanatory variables.

Description: The logit model is conceptualized as log-odds, which converts a binary outcome domain (0,1) to the real line (− , ). For the logit model, this index or link function is based on the logistic distribution. The error structure is assumed to be IID, while explanatory variables have distribution-free assumptions. Parameters are estimated using maximum likelihood. Hyper-parameter estimation: Hyper-parameters for this model are regarded as being the set of explanatory variables included in the final model. The explanatory variables are selected using backwards stepwise selection based on the BIC. Fitting Process: Model parameters are estimated using maximum likelihood. Description: The link function for a probit model is the inverse of the cumulative normal distribution . The explanatory variables and error structure of a probit model are assumed to be IID, which makes the model more restrictive and computationally more intensive. The standard probit model has a similar conceptualization to the logit model. While the probit classifier has more restrictive assumptions, both classifiers normally produce consistent parameter estimates and have comparable predictive accuracy (Greene, 2008).

The Bayesian linear discriminant classifier is defined as follows: 1 ’ 1 x 1 k log k , k x k k 2 where parameters to be estimated are: k , which is a class specific mean vector; , which is a covariance matrix that is common to all K classes; and k , which denotes the prior probability that a randomly chosen observation comes from the kth class. An observation X = x is assigned to a class where this equation is largest.

(Continued)

Appendix: Description of Prediction Models  199

Linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA)

Hyper-parameter estimation: Hyper-parameters for this model are regarded as being the set of explanatory variables included in the final model. The input set was selected using backwards stepwise selection based on the BIC. Fitting Process: Model parameters are estimated using maximum likelihood. Description: The LDA classifier assumes that the observations in the kth class are drawn from a multivariate normal distribution and all classes share a common covariance matrix (i.e., the variance is the same for all K classes). For QDA, predictor variables appear in the discriminant function as a quadratic function. Like LDA, the quadratic discriminant classifier (QDA) makes the same assumption. However, unlike LDA, QDA assumes that each class has its own covariance matrix. As a rule of thumb, LDA can lead to improve modelling performance if the sample sizes are small (and hence reducing variance is important). However, if the dataset is relatively large, QDA is preferred because this method usually fits the data better and can handle a greater range of data issues (such as nonlinearity in the data). While LDA is based on quite restrictive statistical assumptions, Greene (2008) observes that these concerns have been exaggerated in the literature and the performance differences between logit/probit classifiers and LDA are usually not exceptional.

Logit/Probit – best subset selection

Let M 0 denote the null model having no predictors. (1) For k = 1,2, . . ., p p predictors, fit all k models that contain exactly k predictors. Pick among p these models based k on smallest RSS and denote it M k . (2) Select a single best model from among M 0, . . . , M p based on AIC and BIC criterion.

Hyper-parameter estimation: Hyper-parameters for this model are regarded as being the set of explanatory variables included in the final model. The input set for this model was selected using best subset selection. This is a two-step process, with the first step being to select the best subset for each potential cardinality of the input space (for P explanatory variables, then, the potential cardinality vector is 1 : P). This is achieved using a leaps and bounds algorithm that provides a best subset for each cardinality. The next step is to choose the optimal input cardinality (based on binomial deviance) using five-fold cross-validation on the training set. The input subset is chosen to be the best subset for the optimal cardinality. Fitting process: Model parameters are estimated using maximum likelihood. Description: Two popular approaches to selecting subsets of predictors are (1) best subset selection and (2) stepwise procedures. With best subset selection, the classifier fits a separate least squares regression for each combination of p predictors. The classifier fits all p models that contain exactly one predictor, then all models that contain exactly two predictors, and so on. The algorithm then examines the resulting models and identifies the best model. Hyper-parameter estimation: Hyper-parameters for this model are regarded as being the set of explanatory variables included in the final model. The explanatory variables for this model are selected using best subset selection. This is a two-step process, with the first step being to select the best subset for each potential cardinality of the feature space (for P features, then, the potential cardinality vector is 1 : P ). This is achieved using a leaps and bounds algorithm that provides a best subset for each cardinality. The next step is to choose the optimal cardinality (based on binomial deviance) using five-fold cross-validation on the training set. The input subset is chosen to be the best subset for the optimal cardinality. Fitting process: Model parameters are estimated using maximum likelihood.

200  Appendix: Description of Prediction Models

Appendix (Continued)

Logit/Probit – backward stepwise model

Description: The major limitation of best subset procedure is computation complexity, which rapidly escalates for large numbers of predictors. Generally, there are 2p models that involve subsets of p predictors (so if p = 20, there are 1000,000 models to estimate). The higher search space can lead to over-fitting and high variance in parameter estimates. Stepwise explores a much more restricted set of models. Backward stepwise (used for this study) begins with a model containing all parameters, then sequentially removes less useful predictors, one at a time. Stepwise models have a number of significant limitations, including potential overstatement of model-fits, biased parameters, inconsistency in model selection, and deletion of variables that potentially carry signal. However, some of these limitations are mitigated against by using out-of-sample prediction tests to evaluate overall model performance. Hyper-parameter estimation: Hyper-parameters for this model are regarded as being the set of explanatory variables included in the final model. The input set was selected using backwards stepwise selection based on the BIC. Fitting process: Model parameters are estimated using maximum likelihood. (Continued)

Appendix: Description of Prediction Models  201

Let M p denote the full model containing all p predictors. (1) For k = p, p −, . . ., 1 predictors, consider all k models that contain all but one of the predictors in M k for a total of k − 1 predictors. (2) Pick the best among these k models and call it M k 1 based on smallest RSS. (3) Select a single best model from among M 0 , . . . , M p based on AIC and BIC criterion.

Logit/Probit – penalized models

The elastic net penalty is set out as follows: p j

1

2 j

,

j 1

where is the penalty parameter and j is the estimate coefficient. If  = 0, we have a ridge regression penalty; if  = 1, we have a lasso penalty.

Description: Penalized models or shrinkage methods are an alternative to subset procedures. Two popular techniques are ridge regression and the lasso. A relatively new technique (elastic net) combines the strengths of both techniques. Rather than using OLS to find a subset of variables, ridge regression uses all variables in the dataset but constrains or regularizes the coefficient estimates so they “shrink” the coefficient estimates towards zero for nonimportant variables. Shrinking the parameter estimates can significantly reduce their variance with only a small increase in bias. A weakness of ridge regression is that all variables are included in the model, making the model difficult to interpret. The lasso has a similar construction, but the penalty forces some parameters to equal zero (so the lasso has a variable selection feature and produce parsimonious models). The elastic net technique builds on the strength of ridge regression and the lasso (Zou and Hastie, 2005). By setting  = 0.5, this allows very unimportant variable parameters to be shrunk to zero (a kind of subset selection), while variables with small importance will be shrunk to some small (non-zero value). Hyper-parameter estimation: The hyper-parameters for this model are the values of the penalty ( ) and a penalty mixing parameter ( ). For this study, we consider to be fixed at 0.5, which gives us a combination of lasso and ridge penalties. The value of is estimated using five-fold cross-validation with the binomial deviance loss function. Fitting process: Model parameters are estimated using penalized maximum likelihood.

202  Appendix: Description of Prediction Models

Appendix (Continued)

Probit boosted

Description: Gradient boosting is a regression fitting algorithm that improves predictive performance of models by shrinking model parameters towards zero. Shrinkage improves the predictive performance of models as it reduces the variance of parameter estimates for the cost of a small increase in bias. The algorithm consists of sequentially fitting the negative gradient of a specified loss function using a set of “base learners”. A specific subset of “base learners” is chosen to enter the model at each iteration. In the classification context, the “base learners” are weak classifiers, so the final model is made up of a combination of weak classifiers. For the probit boosted model, the link function is the inverse CDF of the Gaussian distribution. This methodology is similar to the generalized tree-based boosting models described below, with the difference being the link function and the fact that the probit boosted model uses standard regression “base learners” while the generalized tree-based boosting models use tree-based “base learners”. The benefit of using boosting algorithms to fit regression models is that they provide an “automatic” input selection procedure, provide shrinkage estimates that take advantage of the bias-variance trade-off to improve predictive performance, deal effectively with multicollinearity of the inputs, and provide a solution for the low sample size high input dimensionality problem. Hyper-parameter estimation: The hyper-parameters for this model are the step length and the number of boosting iterations. Typically step length is set at a small value such as 0.1 and the number of boosting iterations is used as the adjustable hyper-parameter. For the current study the optimal number of boosting iterations was estimated using ten-fold cross-validation. Fitting process: Model parameters are estimated using sequential fitting and base learner selection based on the negative gradient of a specified loss function. For the current model, the loss function is the negative log-likelihood of the Bernoulli distribution. (Continued)

Appendix: Description of Prediction Models  203

The probit boosted model has the typical form of a probit regression model: xi , Prob[Yi 1| x i ] =  where is the inverse of the cumulative normal distribution and x i is a vector of parameter estimates and explanatory variables. However, the values are not estimated using standard likelihood maximization techniques; they are estimated using the gradient boosting algorithm.

Logit/Probit models with multiple adapting regressive splines (MARS)

yi

0

b xi

1 1

b

R 3 R 3

xi

b xi

2 2 i

,

which represents a cubic spline with K knots; parameters 0 , 1 , 2 and R 3 are estimated over different regions of X (i.e., knots); and b1 , b1 , , bR 3 are basis functions.

Description: The standard way to extend regression functions for nonlinear relationships is to replace the linear model with a polynomial function. MARS is a more general technique. It works by dividing the range of X into R distinct regions (or splines/knots). Within each region, a lower degree polynomial function can be fitted to the data with the constraint functions join to the region boundaries through knots. This can provide more stable parameter estimates and frequently better predictive performance than fitting a high degree polynomial over the full range of X. In estimating logit and probit models with a MARS feature, we followed convention of placing knots uniformly and using a cross-validation to determine the number of knots. A limitation is that regression splines can have high variance on the outer range of the predictors (when X takes on very small or large values). This can be rectified by imposing boundary constraints. A further limitation is the additivity condition; hence, the model is only partially nonlinear. Hyper-parameter estimation: The main hyper-parameter for the MARS model is the cardinality of the additive model terms that are included within the model. There are some other hyper-parameters that could be adjusted; however, they were not considered for this study. The cardinality is determined using five-fold cross-validation with a binomial deviance loss function. Fitting process: Model estimation involves joint parameter estimation and term selection process. Terms are selected from a candidate set of hinge basis functions in a forward stage-wise additive process based on contribution to reduction in the binomial deviance. Once the model has been fitted up to a specified cardinality, terms are removed from the model in a stage-wise removal process that is also based on contribution to reduction in the binomial deviance. The optimal cardinality is chosen based on the process described in the previous paragraph.

204  Appendix: Description of Prediction Models

Appendix (Continued)

Mixed logit

A random parameter logit model (for panel data) is set out as follows (Train, 2003): Pi ,k ,l exp i x i ,k ,l / j exp i x i ,k , j . For a given vector of explanatory variables

i

, Pi ,k ,l is the probability

Pikl yikl .

Pik l

And the joint probability for the k observations of firm i is given by: Pi = k l Pikl yikl .

(Continued)

Appendix: Description of Prediction Models  205

that alternative l is chosen for the kth observation of firm i. Where random parameter logit differs from standard logit is with respect i , which represents a vector of firm level coefficients. The probability for the chosen kth observation for firm i is given by:

Description: A highly restrictive assumption of the standard logit (and probit model) is the IID condition. It is assumed the error structure is independently and identically distributed across outcomes. The mixed logit model completely relaxes the IID condition and allows for correlated predictor variables. The key idea behind the mixed logit model is to partition the stochastic component (the error term) into two additive (i.e., uncorrelated) parts. One part is correlated over alternative outcomes and heteroscedastic, and another part is IID over alternative outcomes and firms. The main improvement is that mixed logit models include a number of additional parameters that capture observed and unobserved heterogeneity both within and between firms. Hyper-parameter estimation: The main hyper-parameter for the mixed logit model is the subset of explanatory variables that are included within the model. Fitting process: The model is estimated using simulation-based maximum likelihood.

Nested logit

The choice probabilities for the elemental alternatives are defined as (see Jones and Hensher, 2007): P (k | j , i )

exp[ ’ (k | j, i )] K|j ,i

,

exp[ ’ (l | j, i )] l 1

where k|j,i = elemental alternative k in branch j of limb i, K|j,i = number of elemental alternatives in branch j of limb i, and the inclusive value for branch j in limb i is K|j ,i

IV ( j | i ) log k 1

exp[ b’x (k | j, i )] .

Description: Unlike the mixed logit model, the nested logit model has a closed form solution. A highly restrictive assumption of the standard logit (and probit model) is the IID condition. It is assumed the error structure is independently and identically distributed across outcomes. The NL model recognises the possibility that each alternative may have information in the unobserved influences of that alternative, which in turn has a role to play in determining an outcome that is different across the alternative branches. This difference implies that the error variances might be different (i.e., specific alternatives ( j = 1, . . ., J) do not have the same distributions for the unobserved effects (or errors, denoted by j). Differences might also imply that the information content could be similar among subsets of alternatives and hence some amount of correlation could exist among these subsets (i.e., non-zero and varying covariances for pairs of alternatives). Hyper-parameter estimation: The main hyper-parameter for the nested logit model is the subset of explanatory variables that are included within the model. Fitting process: The model is estimated using simulation-based maximum likelihood.

206  Appendix: Description of Prediction Models

Appendix (Continued)

Logit/Probit – general additive model (GAM)

log

p X 1 p X

f0

f2 X2

f 1( X 1 )

f p ( X p ),

for a logistic model, and similar for probit model. For a probit GAM model, 1

p X f2 X2

f0

f 1( X 1 ) f p(X p ) .

(Continued)

Appendix: Description of Prediction Models  207

Description: The general additive model (GAM) is a non-parametric technique for extending the linear framework by allowing nonlinear smooth functions of each of the explanatory variables while maintaining the additivity condition. GAMs are estimated using a backfitting algorithm. Hence, the linear relationship between predictors ( 1X 1) above can be replaced by a smooth nonlinear function f 1( X 1 ). GAM is called an additive model because we calculated a separate f j for each X j and then sum together their collective contributions. Hence, GAMs automatically capture nonlinear relationships not reflected in standard linear models. This flexibility to allow nonparametric fits with relaxed assumptions on the actual relationship between response and predictor provides the potential for better fits to data than purely parametric models, but arguably with some loss of interpretability. As with MARS and mixed logit, the major limitation of GAMs is the additivity condition, which can result in many important interactions being missed. Hyper-parameter estimation: The main hyper-parameter for the logit and probit GAMs is the subset of explanatory variables that are included within the model. For the current study, the subset was chosen using the double penalty approach of Marra and Wood (2011). Fitting process: The model is estimated using a doubly penalized, iteratively re-weighted least squares approach. The double penalization is due to there being two penalty terms, one for the smoothness parameters and the other for subset selection.

Neural networks

For a typical single hidden layer binary neural network classifier, there are inputs (X), one hidden layer (Z), and two output classes (Y). Derived features Z m are created from linear combinations of the inputs, and then the target Yk is modelled as function of the linear combinations of the Z m, Zm Tk fk X

T m

om

ok

T k

X , m 1,

Z ,k

1,

,K.

gk T , k

1,

,K.

Where Z =  (Z 1 , Z 2 , Z 3 , T =  (T1 ,T2 ,T3 ,

, Z M ) , and

,TK ).

The activation function the sigmoid

,M.

v

v is typically

1 . The 1 e v

output function gk T allows a final transformation of the vector of outputs T. For K-class classification, the identify function gk T is estimated using the softmax function: eTk gk T . K T e 1

Description: Neural networks are sometimes described as nonlinear discriminant models – essentially, neural networks are a two-stage regression or classification model. For a typical hidden layer, there are three inputs (X), one hidden layer (Z), and two output classes (Y). Derived features Z m are created from linear combinations of the inputs, and then the target Yk is modelled as function of the linear combinations of the Z m . This is the same transformation as the multinomial logit model and provides positive estimates that sum to 1. The units in the middle of the network computing the derived features of Z m are called hidden or latent units as they are not directly observable. Note that if is the identify function, then the entire model collapses to a linear model in the inputs. Hence, a neural network can be thought of as a nonlinear generalization of a linear model, for both regression and classification. NNs are good at dealing with dynamic nonlinear relationships. The major limitation is that backpropagational neural networks are essentially “black boxes”. Apart from defining the general architecture of a network, the researcher has little role to play other than to observe the classification performance (i.e., NNs provide no parameters or algebraic expressions defining a relationship (as in regression) beyond the classifiers own internal mathematics). Further, this classifier generally has less capacity to handle large numbers of irrelevant inputs, data of mixed type, outliers, and missing values. The computational scalability (in terms of sample size and number of predictors) is also a potential limitation. Hyper-parameter estimation: The adjustable hyper-parameters used for the neural network analysis are a weight decay parameter and the number of units within the hidden layer. These were estimated using crossvalidation over a grid of reasonable values. Fitting process: For each model, a set of 20 single hidden layer networks was fitted using different random number seeds. The model scores for each of the 20 models are averaged and then translated to predicted classes.

208  Appendix: Description of Prediction Models

Appendix (Continued)

Support vector machines

Support vector machines are a solution to the optimization problem: maximize M 0

, 1,

,

,

p

subject to yi

x

M 1 n i

2 j

0,

i

1, x

1 i1

0

n

1

p j 1

2 i2 i

x

p ip

,

C,

i 1

(Continued)

Appendix: Description of Prediction Models  209

where C is a nonnegative tuning parameter. C bounds the sum of the i ’ s so sets the tolerance level for misclassification. M is the width of the margin, which we want to make as large as possible. i is a slack variable that allows observations to be on the wrong side of the margin or hyperplane (i.e., “soft classifier” approach).

Description: Support vector machines (SVM) differ from conventional classification techniques such as LDA and logit/probit through the use of a separating hyperplane. A hyperplane divides p-dimensional space into two halves, where a good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class (i.e., we would expect that the larger the margin, the lower the out-of-sample classification error). Classification is based on the sign of the test observation. If i 0, then the ith observation is on wrong side of the margin. If i 1, the ith observation is on wrong side of the hyperplane. The tuning parameter C bounds the sum of the i ’ s and sets the tolerance level for misclassification. Larger values of C indicate larger tolerance. Importantly, C controls the bias-variance trade-off. Higher C indicates a higher margin (i.e., many observations violate the margin) and so there many support vectors. In this situation, many observations are involved in determining the hyperplane, leading to low variance but high bias (the reverse is true when the C is set at lower values). Support vector machines (SVM) enlarge the feature space to deal with nonlinear decision boundaries – support vector machines do this in an automatic way using various types of kernels (mainly for ease of computation). A widely used kernel is the radial kernel, which is used for this study. A major advantage of the SVM classifier is that the classification is quite robust to observations far away from the hyperplane – support vectors are based on a subset of the observations. Other techniques (such as LDA, logit and probit) are more sensitive to outliers. SVM suffers many of the limitations of NNs, particularly in terms of computational scalability, lack of interpretability, and ability to handle irrelevant inputs and data of mixed type. Hyper-parameter estimation: The adjustable hyper-parameters used for the support vector machine with a Gaussian radial basis function kernel are an inverse width parameter and a constraint violation parameter. The inverse width 2 parameter was set to be the median of x x and the cost of constraint violation parameter was estimated using five-fold cross-validation. Fitting process: The fitting process is undertaken using the sequential minimal optimization (SMO) algorithm for solving quadratic programming problems.

Penalized SVM

The SVM objective function can be represented in a loss function plus penalty form as follows: minb ,w

1 n

q

n

1 y i b w.x i

p

i 1

»

wj

.

j 1

The SCAD penalty is as follows: w ; if w p

w

w

2

; if

2 a 1 a 1 2

Generalized tree-based boosting/ AdaBoost

2

2a w

w

a

.

2

; if w

a

The GBM classifier (and its variant, AdaBoost) is initiated through the following steps (see Schapire and Freund, 2012): (1) train weak learner using distribution Dt , (2) get weak hypothesis or classifier ht : X {-1, +1}, (3) select weak classifier ht to minimize weighted error, 1 1 t (4) choose t =  ln ( ), 2 t

Description: The general approach of support vector machines (SVM) is described above. The penalized SVM approach used here is obtained by replacing the standard SVM L2 penalty term with one that allows sparse solutions and hence can play a variable selection role. The analysis is restricted to linear SVM, in contrast to the radial basis function approach used in the SVM described above. The penalty function used is the smoothly clipped absolute deviation (SCAD) penalty. This penalty has the property of allowing sparse solutions at the same time as not applying a large penalty (equivalent to a large bias) to inputs with high coefficients. Hyper-parameter estimation: The major adjustable hyper-parameter for the SCAD penalized SVM is denoted as (see left); this parameter controls the extent of the parameter shrinkage. was estimated using five-fold cross-validation on the training set. Fitting process: The penalized SVM is estimated by minimizing the penalized objective function set out on the left. Description: The idea behind boosting is to combine the outputs of many tree-based weak classifiers to produce a powerful overall “voting” committee. The weighted voting is based on the quality of the weak classifiers, and every additional weak classifier improves the prediction outcome. The first classifier is trained on the data where all observations receive equal weights. Some observations will be misclassified by the first weak classifier. A second classifier is developed to focus on the training errors of the first classifier. The second classifier is trained on the same dataset, but misclassified samples receive a higher weighting while correctly classified observations receive less weight. The re-weighting occurs such that first classifier gives 50% error (random) on the new distribution. Iteratively,

210  Appendix: Description of Prediction Models

Appendix (Continued)

where

t

is the parameter importance assigned to

the weak classifier ht . (5) update, for i = 1, . . ., m: Dt

1

(i)

e

Dt i Zt

e

t

t

if ht x i

yi

if ht x i

yi

.

Output the final hypothesis or strong classifier: H x

T

sign

h x .

t t t 1

(Continued)

Appendix: Description of Prediction Models  211

where H x is the linear combination of weak classifiers computed by generalized boosting or AdaBoost. AdaBoost differs from GBM only with respect to the loss function (AdaBoost uses the exponential loss function, whereas GBM uses the Bernoulli loss function).

  each new classifier focuses on ever more difficult samples. The algorithm keeps adding weak classifiers until some desired low error rate is achieved. More formally, generalized boosting methodology, and its main variant AdaBoost, is set out in Schapire and Freund (2012). A number of attractive features have been associated with this classifier. For instance, this classifier has been shown to be resistant to over-fitting and has impressive computational scalability in terms of the classifier’s capacity to handle many thousands of predictors. This classifier is also robust to outliers and monotone transformations of variables; has a high capacity to deal with irrelevant inputs; and is better at handling data of mixed (continuous and categorical) type. Another attractive feature is that the generalized boosting classifier has some level of interpretability as the algorithm provides a ranking of variable influences and their marginal effect on the response variable. Hyper-parameter estimation: The adjustable hyper-parameters used for the generalized boosting classification model are interaction depth, shrinkage, and number of trees. For current purposes, we hold the shrinkage parameter constant at 0.001. The optimal interaction depth and number of trees were determined using five-fold cross-validation on the training set. Fitting process: The model fitting is undertaken using Friedman’s gradient boosting machine method. Both the exponential loss (AdaBoost) and Bernoulli loss functions are implemented in this book.

Random forests

(1) For b = 1 to B training sets:    (a) draw a bootstrap sample of size N from the training data,    (b) grow a random forest tree Tb to the bootstrapped data by recursively repeating the following steps for each terminal node of the tree, until the minimum node size nmin is reached. (i) select m variables at random from the p variables, (ii) pick the best variable/split point among m, (iii)  split the node into two daughter nodes. (2)  Output the ensemble of trees Tb

B 1

.

For a discrete outcome variable, let Cˆb x be the class prediction of the bth random forest tree. Then Cˆ rfB x  = majority vote Cˆ b x

B 1

.

Description: Random forests are an improvement of the CART system (binary recursive partitioning) and bagged tree algorithms, which tend to suffer from high variance (i.e., if a training sample is randomly split into two halves, the fitted model can vary significantly across the samples); and weaker classification accuracy. Random forests maintain advantages of CART and bagged tree methodology by de-correlating the trees and using the “ensemble” or maximum votes approach of generalized boosting. Does not require true pruning for generalization. As in bagging, random forests build a number of decision trees based on bootstrapped training samples. But when building these decision trees, each time a split in the tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors. The split is only allowed to use one of these m predictors. A fresh sample of m predictors is taken at each split, and typically we choose m p , which suggests that at each split, we consider the square root of the total number of predictors (i.e., if 16 predictors, no more than 4 will be selected). By contrast, if a random forest is built where the predictor size subset m = the number of predictors p, this simply amounts to bagging. The intuition behind random forests is clear. In a bagged tree process, a particularly strong predictor in the dataset (along with some moderately strong predictors) will be used by most if not all the trees in the top split. Consequently, all the bagged trees will look quite similar to each other. Hence, the predictions from bagged trees will be highly correlated. But averaging many highly correlated quantities does not lead to such a significant reduction in error as averaging uncorrelated quantities. Random forests overcome this problem by forcing each split to

212  Appendix: Description of Prediction Models

Appendix (Continued)

p m of the p splits will not even consider the strong predictor and so other predictors will have more of a chance. By de-correlating the trees, the averaging process will be less variable and more reliable. If there is strong correlation among the predictors, m should be small. Furthermore, random forests typically do not over-fit if we increase B (the number of bootstrapped training sets), and in practice a sufficiently large B should be used for the test error rate to settle down. Both random forests and generalized boosting share the “ensemble” approach. Where the two methods differ is that boosting performs exhaustive search for which trees to split on, whereas random forest chooses a small subset. Generalized boosting grows trees in sequence (with the next tree dependent on the last); however, random forests grow trees in parallel independently of each other. Hence, random forests can be computationally more attractive for very large datasets. Hyper-parameter estimation: The major hyper-parameter for random forest classification is the number of features m, selected from the total number of features p for the fitting of each node. This value was selected using N-fold cross-validation based on misclassification loss. Fitting process: Well described to the left.   consider only a subset of predictors. Therefore, on average,

Appendix: Description of Prediction Models  213

(Continued)

Oblique random forests

(1)  For b = 1 to B training sets:    (a)   draw a bootstrap sample of size N from the training data,    (b)   grow a random forest tree Tb to the bootstrapped data by recursively repeating the following steps:     (i)     select m variables at random from the p variables,     (ii)   re-project the m variables onto a linear-subspace,     (iii) obtain an optimal split-point using the Gini impurity statistic,     (iv) split the node into two daughter nodes. (2) Output the ensemble of trees B Tb 1 . For a discrete outcome variable, let Cˆb x be the class prediction of the bth random forest tree. Then Cˆ rfB x  = majority vote Cˆb x

B 1

.

Source: Adapted from Jones et al. (2015). Reprinted with permission from Elsevier

Description: Oblique random forests are similar to the standard random forest models described earlier. The standard random forest model uses regression tree base learners that constrain each split in the tree to be formed by a hyperplane that is orthogonal to a single feature axis. This geometric restriction may reduce the effectiveness of the base learners, especially within the context of multicollinearity between input variables. For oblique random forests, the set of randomly selected subset features for each tree is re-projected onto a linear sub-space that splits the input space more efficiently than splits that have an orthogonality constraint. For the current volume, the re-projection is carried out using ridge regression and the splits are based on the Gini impurity statistic. The trees grown using oblique random forests are considered to be better adapted to many classification problems because they utilize the relationship between the inputs and the class labels and the internal relationships between the inputs. Hyper-parameter estimation: The major hyper-parameter for random forest classification is the number of features m, selected from the total number of features p for the fitting of each node. This value was selected using five-fold cross-validation based on misclassification loss. Fitting process: Well described to the left.

214  Appendix: Description of Prediction Models

Appendix (Continued)

REFERENCES

Agarwal, V. and Taffler, R., 2008. Comparing the performance of market-based and accounting-based bankruptcy prediction models. Journal of Banking  & Finance 32, pp. 1541–1551. Alam, N., Gao, J. and Jones, S., 2021. Corporate failure prediction: An evaluation of deep learning vs discrete hazard models. Journal of International Financial Markets, Institutions and Money, 75, p. 101455. Alfaro, E., García, N., Gámez, M. and Elizondo, D., 2008. Bankruptcy forecasting: An empirical comparison of AdaBoost and neural networks. Decision Support Systems, 45(1), pp. 110–122. Altman, E.I., 1968. Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. The Journal of Finance, 23, pp. 589–609. Altman, E.I., 1983. Multidimensional graphics and bankruptcy prediction: A  comment. Journal of Accounting Research, pp. 297–299. Altman, E.I., 2002. Bankruptcy, Credit Risk, and High Yield Junk Bonds. Malden: Blackwell Publishers. Altman, E.I., Haldeman, R.G. and Narayanan, P., 1977. ZETA analysis: A new model to identify bankruptcy risk of corporations. Journal of Banking & Finance, 1(1), pp. 29−54. Altman, E.I., Iwanicz-Drozdowska, M., Laitinen, E.K. and Suvas, A., 2017. Financial distress prediction in an international context: A review and empirical analysis of Altman’s Z-score model. Journal of International Financial Management & Accounting,  28(2), pp. 131–171. Altman, E.I., Marco, G. and Varetto, F., 1994. Corporate distress diagnosis: Comparisons using linear discriminant analysis and neural networks (the Italian experience). Journal of Banking & Finance, 18(3), pp. 505−529. Altman, E.I. and Rijken, H.A., 2004. How rating agencies achieve rating stability. Journal of Banking & Finance, 28(11), pp. 2679–2714. Altman, E.I., Sabato, G. and Wilson, N., 2010. The value of non-financial information in small and medium-sized enterprise risk management. Journal of Credit Risk, 6, pp. 95–127. Altman, E.I. and Saunders, A., 1997. Credit risk measurement: Developments over the last 20 years. Journal of Banking and Finance, 21(11–12), pp. 1721–1742.

216  References

Amankwah-Amoah, J., Khan, Z. and Wood, G., 2021. COVID-19 and business failures: The paradoxes of experience, scale, and scope for theory and practice. European Management Journal, 39(2), pp. 179–184. Amato, J.D. and Furfine, C.H., 2004. Are credit ratings procyclical?  Journal of Banking  & Finance, 28(11), pp. 2641–2677. Australian Securities Investment Commission (ASIC) Regulatory Guide (RG) 247, 2019. Retrieved from https://asic.gov.au/regulatory-resources/find-a-document/ regulatory-guides/rg-247-effective-disclosure-in-an-operating-and-financial-review. Aziz, A., Emanuel, D.C. and Lawson, G.H., 1988, September. Bankruptcy prediction – an investigation of cash flow based models. Journal of Management Studies, pp. 419–437. Aziz, A. and Lawson, G.H., 1989, Spring. Cash flow reporting and financial distress models: Testing of hypotheses. Financial Management, pp. 55–63. Bach, S. and Vesper, D., 2002. A crisis in finance and investment – local government finance needs fundamental reform. Economic Bulletin, 39(9), pp. 309–316. Bahnson, P. and Bartley, J., 1992. The sensitivity of failure prediction models. Advances in Accounting, 10, pp. 255–278. Bai, C.E., Liu, Q. and Song, F.M., 2004. Bad news is good news: Propping and tunneling evidence from China. Working paper (Business and Economics, University of Hong Kong). Balcaen, S. and Ooghe, H., 2006. 35 Years of studies on business failure: An overview of the classic statistical methodologies and their related problems.  The British Accounting Review, 38(1), pp. 63–93. Bangia, A., Diebold, F.X., Kronimus, A., Schagen, C. and Schuermann, T., 2002. Ratings migration and the business cycle, with application to credit portfolio stress testing. Journal of Banking and Finance, 26(2–3), pp. 445–474. Barniv, R., Agarwal, A. and Leach, R., 2002. Predicting bankruptcy resolution. Journal of Business Finance & Accounting, 29(3–4), pp. 497–520. Barth, M.E., Beaver, W.H. and Landsman, W.R., 1998. Relative valuation roles of equity book value and net income as a function of financial health.  Journal of Accounting and Economics, 25(1), pp. 1–34. Basel, I.I., 2009. Committee on Banking Supervision. (2006). In International Convergence of Capital Measurement and Capital Standards: A Revised Framework. Bank for International Settlements Press and Communications. Basel, Switzerland. Beaver, W.H., 1966. Financial ratios as predictors of failure. Journal of Accounting Research, pp. 71–111. Beaver, W.H., 1968a. Alternative accounting measures as predictors of failure. The Accounting Review, 43(1), pp. 113–122. Beaver, W.H., 1968b. The information content of annual earnings announcements. Journal of Accounting Research, pp. 67–92. Beaver, W.H., McNichols, M.F. and Rhie, J.W., 2005. Have financial statements become less informative? Evidence from the ability of financial ratios to predict bankruptcy. Review of Accounting Studies, 10, pp. 93–122. Beckett-Camarata, J., 2004. Identifying and coping with fiscal emergencies in Ohio local governments. International Journal of Public Administration, 27(8–9), pp. 615–630. Begley, J., Ming, J. and Watts, S., 1997. Bankruptcy classification errors in the 1980s: An empirical analysis of Altman’s and Ohlson’s models.  Review of Accounting Studies,  1(4), pp. 267–284. Bharath, S.T. and Shumway, T., 2004. Forecasting default with the KMV–Merton model. Unpublished paper, University of Michigan.

References  217

Bharath, S.T. and Shumway, T., 2008. Forecasting default with the Merton distance to default model. The Review of Financial Studies, 21, pp. 1339–1369. Bhimani, A., Gulamhussen, M.A. and Lopes, S.D.R., 2010. Accounting and non-accounting determinants of default: An analysis of privately-held firms. Journal of Accounting and Public Policy, 29(6), pp. 517–532. Bhimani, A., Gulamhussen, M.A. and Lopes, S.D.R., 2013. The role of financial, macroeconomic, and non-financial information in bank loan default timing prediction. European Accounting Review, 22(4), pp. 739–763. Bhimani, A., Gulamhussen, M.A. and da Rocha Lopes, S., 2014. Owner liability and financial reporting information as predictors of firm default in bank loans. Review of Accounting Studies, 19(2), pp. 769–804. Billings, B.K., 1999. Revisiting the relation between the default risk of debt and the earnings response coefficient. The Accounting Review, 74(4), pp. 509–522. Black, F. and Scholes, M., 1973. The pricing of options and corporate liabilities. Journal of Political Economy, 81, pp. 637–654. Blinder, A.S., 2020. After the Music Stopped: The Financial Crisis, the Response, and the Work Ahead (No. 79). New York: Penguin Books. Blum, M., 1974. Failing company discriminant analysis.  Journal of Accounting Research, pp. 1–25. Blume, M.E., Lim, F. and MacKinlay, A.C., 1998. The declining credit quality of US corporate debt: Myth or reality? The Journal of Finance, 53(4), pp. 1389–1413. Boritz, J. and Kennedy, D., 1995. Effectiveness of neural network types for prediction of business failure. Expert Systems with Applications, 9, pp. 504–512. Bossi, M., 2020. Financial distress of the COVID-19 pandemic on not-for-profits. Retrieved from https://www.thompsoncoburn.com/insights/blogs/credit-report/ post/2020-04-02/financial-distress-of-the-covid-19-pandemic-on-not-for-profits Boudreau, J., 2003, June  27. Non-profit directors under new scrutiny. San Jose Mercury News. Breiman, L., 1996. Bagging predictors. Machine Learning, 26, pp. 123–140. Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J., 1984. Classification and Regression Trees. Wadsworth Statistics/Probability Series. New York: CRC Press. Buehler, S., Kaiser, C. and Jaeger, F., 2012. The geographic determinants of bankruptcy: Evidence from Switzerland. Small Business Economics, 39(1), pp. 231–251. Burde, G., 2018. Improved methods for predicting the financial vulnerability of nonprofit organizations. Administrative Sciences, 8(1), p. 3. Cahill, A.G. and James, J.A., 1992. Responding to Municipal Fiscal Distress: An Emerging Issue for State Governments in the 1990s. Abingdon: Taylor & Francis. Campbell, J.Y., Hilscher, J. and Szilagyi, J., 2008. In search of distress risk. The Journal of Finance, 63, pp. 2899–2939. Carmeli, A., 2003. Introduction: Fiscal and financial crises of local governments. International Journal of Public Administration, 26(13), pp. 1423–1430. Carmeli, A. and Cohen, A., 2001. The financial crisis of the local authorities in Israel: A resource-based analysis. Public Administration, 79(4), pp. 893–913. Carson, E., Fargher, N.L., Geiger, M.A., Lennox, C.S., Raghunandan, K. and Willekens, M., 2013. Audit reporting for going-concern uncertainty: A  research synthesis. Auditing: A Journal of Practice & Theory American Accounting Association, 32(Supplement), pp. 353–384. Carson, M and Clark, J., 2013. Asian financial crisis: July 1997 – December 1998. Federal Reserve Bank of New York.

218  References

Casey, C.J. and Bartczak, N.J., 1984, July–August. Cash row – it’s not the bottom line. Harvard Business Review, pp. 61–66. Casey, C.J. and Bartczak, N.J., 1985, Spring. Using operating cash flow data to predict financial distress: Some extensions. Journal of Accounting Research, pp. 384–401. Chancharat, N., Tian, G., Davy, P., McCrae, M. and Lodh, S., 2010. Multiple states of financially distressed companies: Tests using a competing-risks model. Australasian Accounting, Business and Finance Journal, 4(4), pp. 27–44. Charitou, A., Lambertides, N. and Trigeorgis, L., 2007. Managerial discretion in distressed firms. The British Accounting Review, 39(4), pp. 323–346. Chava, S. and Jarrow, R.A., 2004. Bankruptcy prediction with industry effects. Review of Finance, 8, pp. 537–569. Cheung, Y.L., Jing, L., Lu, T., Rau, P.R. and Stouraitis, A., 2009. Tunneling and propping up: An analysis of related party transactions by Chinese listed companies. Pacific-Basin Finance Journal, 17, pp. 372–393. Ciampi, F., 2015. Corporate governance characteristics and default prediction modeling for small enterprises. An empirical analysis of Italian firms. Journal of Business Research, 68(5), pp. 1012–1025. Ciampi, F. and Gordini, N., 2013. Small enterprise default prediction modeling through artificial neural networks: An empirical analysis of Italian small enterprises.  Journal of Small Business Management, 51(1), pp. 23–45. Clark, T.N., 1977. Fiscal management of American cities: Funds flow indicators. Journal of Accounting Research, 15(Supplement), pp. 54–106. Clark, T.N. and Ferguson, L.C., 1983.  City Money. Political Processes, Fiscal Strain, and Retrenchment. New York: Columbia University Press. Clarke, J., Ferris, S.P., Jayaraman, N. and Lee, J., 2006. Are analyst recommendations biased? Evidence from corporate bankruptcies. Journal of Financial and Quantitative Analysis, 41, pp. 169–196. Coats, P.K. and Fant, L.F., 1993. Recognizing financial distress patterns using a neural network tool. Financial Management, pp. 142–155. Cooper, S.D., 1996. Local government budgeting responses to fiscal pressures. Public Administration Quarterly, pp. 305–319. Coronavirus (COVID-19): SME Policy Responses, 2020. Retrieved from www.oecd.org/ coronavirus/policy-responses/coronavirus-covid-19-sme-policy-responses-04440101. Corporate Law Reform Act, 1992. Retrieved from www.legislation.gov.au/Details/ C2004A04501. Cortés, E.A., Martínez, M.G. and Rubio, N.G., 2007. A boosting approach for corporate failure prediction. Applied Intelligence, 27(1), pp. 29−37. Cribb, R., 2002, April 26. Planet Aid claimed 350,000 in losses from 1998 to 2000. The Toronto Star, A01. Crosbie, P.J. and Bohn, R., 2003. Modeling Default Risk. San Francisco: KMV, LLC. CSRC, 2008. China securities regulatory commission (CSRC). Retrieved from www.csrc. gov.cn/pub/csrc_en/about/intro/200811/t20081130_67718.html. Daily, C.M. and Dalton, D.R., 1994. Bankruptcy and corporate governance: The impact of board composition and structure. Academy of Management Journal, 37(6), pp. 1603–1617. Damodaran, A., 2009. Valuing declining and distressed companies.  SSRN 1428022. Retrieved from https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1428022 Darrat, A.F., Gray, S., Park, J.C. and Wu, Y., 2016. Corporate governance and bankruptcy risk. Journal of Accounting, Auditing & Finance, 31(2), pp. 163–202.

References  219

Deakin, E.B., 1972. A  discriminant analysis of predictors of business failure.  Journal of Accounting Research, pp. 167–179. DeAngelo, H. and DeAngelo, L., 1990. Dividend policy and financial distress: An empirical investigation of troubled NYSE firms. The Journal of Finance, 45(5), pp. 1415–1431. DeFond, M.L. and Jiambalvo, J., 1994. Debt covenant violation and manipulation of accruals. Journal of Accounting and Economics, 17(1–2), pp. 145–176. Delaney, K.J., 1999. Strategic Bankruptcy: How Corporations and Creditors Use Chapter 11 to Their Advantage. Berkeley, CA: University of California Press. Dhaliwal, D.S. and Reynolds, S.S., 1994. The effect of the default risk of debt on the earnings response coefficient. Accounting Review, pp. 412–419. Dichev, I.D., 1998, Is the risk of bankruptcy a systematic risk? The Journal of Finance, 53, pp. 1131–1147. Dimitras, A., Zanakis, S. and Zopounidis, C., 1996. A survey of business failures with an emphasis on prediction methods and industrial applications. European Journal of Operational Research, 90, pp. 487–513. Dodd – Frank Wall Street Reform and Consumer Protection Act, 2010. Retrieved from www.cftc.gov/sites/default/files/idc/groups/public/@swaps/documents/file/hr4173_ enrolledbill.pdf. Duarte, F.D., Gama, A.P.M. and Gulamhussen, M.A., 2018. Defaults in bank loans to SMEs during the financial crisis. Small Business Economics, 51, pp. 591–608. Duffie, D., Saita, L. and Wang, K., 2007. Multi-period corporate default prediction with stochastic covariates. Journal of Financial Economics, 83, pp. 635–665. Duffie, D. and Singleton, K.J., 2003. Credit risk, Pricing, Measurements, and Management. Princeton, NJ: Princeton University Press. Edmister, R.O., 1970.  Financial Ratios as Discriminant Predictors of Small Business Failure. Columbus, OH: The Ohio State University. Edmister, R.O., 1972. An empirical test of financial ratio analysis for small business failure prediction. Journal of Financial and Quantitative Analysis, 7(2), pp. 1477–1493. Everett, J. and Watson, J., 1998. Small business failure and external risk factors. Small Business Economics, 11(4), pp. 371–390. Falconer, M.K., 1990. Fiscal stress among local governments: Definition, measurement, and the states impact. Stetson Law Review, 20, p. 809. Falkenstein, E., Boral, A. and Carty, L.V., 2000. For Private Companies: Moody’s Default Model. Working Paper. New York: Moody’s Investors Service. Fama, E.F. and French, K., 1993. Common risk factors in the returns on stocks and bonds. Journal of Financial Economics, 33, pp. 3–56. Fama, E.F. and French, K.R., 1996. Multifactor explanations of asset pricing anomalies. The Journal of Finance, 51(1), pp. 55–84. Fedorova, E., Gilenko, E. and Dovzhenko, S., 2013. Bankruptcy prediction for Russian companies: Application of combined classifiers. Expert Systems with Applications, 40(18), pp. 7285–7293. Fidrmuc, J. and Hainz, C., 2010. Default rates in the loan market for SMEs: Evidence from Slovakia. Economic Systems, 34(2), pp. 133–147. Figlewski, S., Frydman, H. and Liang, W., 2012. Modeling the effect of macroeconomic factors on corporate default and credit rating transitions. International Review of Economics & Finance, 21(1), pp. 87–105. Filipe, S.F., Grammatikos, T. and Michala, D., 2016. Forecasting distress in European SME portfolios. Journal of Banking and Finance, 64, pp. 112–135.

220  References

Financial Crisis Inquiry Commission Report, 2011. Final report of the national commission on the causes of the financial and economic crisis in the United States. Retrieved from www.govinfo.gov/content/pkg/GPO-FCIC/pdf/GPO-FCIC.pdf. Fisher, T.C., Gavious, I. and Martel, J., 2019. Earnings management in Chapter 11 bankruptcy. Abacus, 55(2), pp. 273–305. FitzPatrick, P.J., 1932. A comparison of the ratios of successful industrial enterprises with those of failed companies. The Certified Public Accountant (In three issues: October 1932, pp. 598–605; November 1932, pp. 656–662; December 1932, pp. 727–731). Foster, G., 1986. Financial Statement Analysis, 2nd ed. Prentice Hall. Franks, J.R., 1998. Predicting financial stress in farm businesses. European Review of Agricultural Economics, 25(1), pp. 30–52. Frecka, T.J. and Hopwood, W.S., 1983. The effects of outliers on the cross-sectional distributional properties of financial ratios. Accounting Review, pp. 115–128. Friedman, J.H., 2001. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), pp. 1189–1232. Friedman, J.H. and Meulman, J.J., 2003. Multiple additive regression trees with application in epidemiology. Statistics in Medicine, 22(9), pp. 1365–1381. Gentry, J.A., Newbold, P. and Whitford, D.T., 1985, Spring. Classifying bankrupt firms with funds flow components. Journal of Accounting Research, pp. 146–160. Gentry, J.A., Newbold, P. and Whitford, D.T., 1987. Funds flow components, financial ratios, and bankruptcy. Journal of Business Finance & Accounting, 14(4), pp. 595–606. Gibelman, M., Gelman, S. and Pollack, D., 1997. The credibility of nonprofit boards: A view from the 1990s and beyond. Administration in Social Work, 21(2), pp. 21–41. Gilson, S.C., 1990. Bankruptcy, boards, banks, and blockholders: Evidence on changes in corporate ownership and control when firms default. Journal of Financial Economics, 27(2), pp. 355–387. Giroux, G.A. and Wiggins, C.E., 1984. An events approach to corporate bankruptcy. Journal of Bank Research, 15(3), pp. 179–187. Giving USA Foundation, 2019. Americans gave $427.71 billion to charity in 2018 amid complex year for charitable giving. Retrieved from https://givingusa.org/giving-usa-2019-amer icans-gave-427-71-billion-to-charity-in-2018-amid-complex-year-for-charitable-giving/ Glennon, D. and Nigro, P., 2005. Measuring the default risk of small business loans: A survival analysis approach. Journal of Money, Credit and Banking, pp. 923–947. Gordon, T.P., Greenlee, J.S. and Nitterhouse, D., 1999. Tax-exempt organization financial data: Availability and limitations. Accounting Horizons, 13(2), pp. 113–128. Gramlich, E.M., 1976. The New York City fiscal crisis: What happened and what is to be done? The American Economic Review, 66(2), pp. 415–429. Greene, W., 2008. A statistical model for credit scoring. In S. Jones and D.A. Hensher (Eds.) Advances in Credit Risk Modelling and Corporate Bankruptcy Prediction. Cambridge, UK and New York: Cambridge University Press, pp. 14–44. Green, W., Czernkowski, R. and Wang, Y., 2009. Special treatment regulation in China: Potential unintended consequences. Asian Review of Accounting, 17, pp. 198–211. Greenlee, J.S. and Trussel, J.M., 2000. Predicting the financial vulnerability of charitable organizations. Nonprofit Management and Leadership, 11(2), pp. 199–210. Grunert, J., Norden, L. and Weber, M., 2005. The role of non-financial factors in internal credit ratings. Journal of Banking & Finance, 29(2), pp. 509–531. Hager, M.A., 2001. Financial vulnerability among arts organizations: A test of the Tuckman-Chang measures. Nonprofit and Voluntary Sector Quarterly, 30(2), pp. 376–392. Hager, M.A., Galaskiewicz, J., Bielefeld, W. and Pins, J., 1996. Tales from the grave: Organizations’ accounts of their own demise. American Behavioral Scientist, 39(8), pp. 975–994.

References  221

Hall, G., 1994. Factors distinguishing survivors from failures amongst small firms in the UK construction sector. Journal of Management Studies, 31(5), pp. 737–760. Harhoff, D., Stahl, K. and Woywode, M., 1998. Legal form, growth and exit of West German firms – empirical results for manufacturing, construction, trade and service industries. The Journal of industrial economics, 46(4), pp. 453–488. Hastie, T., Tibshirani, R. and Friedman, J., 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction, 2nd ed. New York: Springer. Hensher, D.A. and Greene, W., 2003. The mixed logit model: The state of practice. Transportation, 30, pp. 133–176. Hensher, D.A. and Jones, S., 2007. Forecasting corporate bankruptcy: Optimizing the performance of the mixed logit model. Abacus, 43(3), pp. 241–264. Hensher, D.A., Jones, S. and Greene, W.H., 2007. An error component logit analysis of corporate bankruptcy and insolvency risk in Australia.  Economic Record,  83(260), pp. 86–103. Heo, J. and Yang, J.Y., 2014. AdaBoost based bankruptcy forecasting of Korean construction companies. Applied Soft Computing, 24, pp. 494–499. Hill, N.T., Perry, S.E. and Andes, S., 1996. Evaluating firms in financial distress: An event history analysis. Journal of Applied Business Research ( JABR), 12(3), pp. 60–71. Hillegeist, S.A., Keating, E.K., Cram, D.P. and Lundstedt, K.G., 2004. Assessing the probability of bankruptcy. Review of Accounting Studies, 9, pp. 5–34. Hol, S., 2007. The influence of the business cycle on bankruptcy probability. International Transactions in Operational Research, 14(1), pp. 75–90. Honadle, B.W., 2003. The states’ role in US local government fiscal crises: A theoretical model and results of a national survey. International Journal of Public Administration, 26(13), pp. 1431–1472. Hopwood, W., McKeown, J. and Mutchler, J., 1988. The sensitivity of financial distress prediction models to departures from normality. Contemporary Accounting Research, 5(1), pp. 284–298. Hopwood, W., McKeown, J. and Mutchler, J., 1994, Spring. A reexamination of auditor versus modelaccuracy within the context of the going-concern opinion decision. Contemporary Accounting Research, 10(2), pp. 409–443. Hosaka, T., 2019. Bankruptcy prediction using imaged financial ratios and convolutional neural networks. Expert Systems with Applications, 117, pp. 287–299. Huang, Z., Wu, J. and Van Gool, L., 2018, April. Building deep networks on Grassmann manifolds. 32nd AAAI Conference on Artificial Intelligence. Hung, C. and Chen, J.H., 2009. A selective ensemble based on expected probabilities for bankruptcy prediction. Expert Systems with Applications, 36, pp. 5297–5303. Iskandar-Datta, M.E. and Emery, D.R., 1994. An empirical investigation of the role of indenture provisions in determining bond ratings. Journal of Banking and Finance, 18(1), pp. 93–111. James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. An Introduction to Statistical Learning with Applications. New York: Springer Science and Business Media. Javvin Press, 2008. China Stock Market Handbook. Saratoga, CA: Javvin Press. Jiang, Y. and Jones, S., 2019. Corporate distress prediction in China: A machine learning approach. Accounting & Finance, 58, pp. 1063–1109. Johnsen, T. and Melicher, R.W., 1994. Predicting corporate bankruptcy and financial distress: Information value added by multinomial logit models. Journal of economics and business, 46(4), pp. 269–286. Jones, F.L., 1987. Current techniques in bankruptcy prediction. Journal of Accounting Literature, 6, pp. 131–164.

222  References

Jones, S., 2011. Does the capitalization of intangible assets increase the predictability of corporate failure? Accounting Horizons, 25(1), pp. 41–70. Jones, S., 2017. Corporate bankruptcy prediction: A high dimensional analysis. Review of Accounting Studies, 22(3), pp. 1366–422. Jones, S. and Hensher, D.A., 2004. Predicting firm financial distress: A  mixed logit model. The Accounting Review, 79(4), pp. 1011–1038. Jones, S. and Hensher, D.A., 2007. Modelling corporate failure: A multinomial nested logit analysis for unordered outcomes. The British Accounting Review, 39(1), pp. 89–107. Jones, S. and Hensher, D.A., 2008. Advances in Credit Risk Modelling and Corporate Bankruptcy Prediction. Cambridge, UK: Cambridge University Press. Jones, S. and Johnstone, D.A., 2012. Analyst recommendations, earnings forecasts and corporate bankruptcy: Recent evidence. Journal of Behavioral Finance, 13(4), pp. 281–298. Jones, S., Johnstone, D.A. and Wilson, R., 2015. An empirical evaluation of the performance of binary classifiers in the prediction of credit ratings changes. Journal of Banking and Finance, 56(7), pp. 72–85. Jones, S., Johnstone, D.A. and Wilson, R., 2017. Predicting corporate bankruptcy: An evaluation of alternative statistical frameworks. Journal of Business Finance & Accounting, 44(1–2), pp. 3–34. Jones, S. and Peat, M., 2008. Credit derivatives: Current practices and controversies. In S. Jones and D.A. Hensher (Eds.) Advances in Credit Risk Modelling and Corporate Bankruptcy Prediction, Cambridge, UK and New York: Cambridge University Press, pp. 207–242. Jones, S. and Walker, R.G., 2007. Explanators of local government distress. Abacus, 43(3), pp. 396–418. Jones, S. and Walker, R.G., 2008. Local government distress in Australia: A  latent class regression analysis. In S. Jones and D.A. Hensher (Eds.) Advances in Credit Risk Modelling and Corporate Bankruptcy Prediction. Cambridge, UK and New York: Cambridge University Press, pp. 242–269. Jones, S. and Wang, T., 2019. Predicting private company failure: A multi-class analysis. Journal of International Financial Markets, Institutions and Money, 61, pp. 161–188. Jorion, P., Shi, C. and Zhang, S., 2009. Tightening credit standards: The role of accounting quality. Review of Accounting Studies, 14(1), pp. 123–160. Joy, O.M. and Tollefson, J.O., 1975. On the financial applications of discriminant analysis. Journal of Financial and Quantitative Analysis, 10(5), pp. 723–739. Kalemli-Ozcan, S., Gourinchas, P.O., Penciakova, V. and Sander, N., 2020, September 25. COVID-19 and SME Failures, Working Paper No. 20/207. International Monetary Fund. Kane, G.D., Richardson, F.M. and Graybeal, P., 1996. Recession-induced stress and the prediction of corporate failure. Contemporary Accounting Research, 13(2), pp. 631–650. Kane, G.D., Richardson, F.M. and Meade, N.L., 1998. Rank transformations and the prediction of corporate failure. Contemporary Accounting Research, 15(2), pp. 145–166. Kaplan, R.S., 1977. Discussion of fiscal management of American cities: Funds flow indicators. Journal of Accounting Research, pp. 95–99. Karas, M. and Režňáková, R., 2014. A parametric or nonparametric approach for creating a new bankruptcy prediction model: The evidence from the Czech Republic. International Journal of Mathematical Models, 8, pp. 214–223. Karthik Chandra, D., Ravi, V. and Bose, I., 2009. Failure prediction of dotcom companies using hybrid intelligent techniques. Expert Systems with Applications, 36(3), pp. 4830–4837. Kealhofer, S., 2003a. Quantifying credit risk I: Default prediction. Financial Analysts Journal, 59(1), pp. 30–44.

References  223

Kealhofer, S., 2003b. Quantifying credit risk II: Debt valuation.  Financial Analysts Journal, 59(3), pp. 78–92. Keasey, K. and Watson, R., 1986. The prediction of small company failure: Some behavioural evidence for the UK. Accounting and Business Research, 17(65), pp. 49–57. Keasey, K. and Watson, R., 1987. Non-financial symptoms and the prediction of small company failure: A test of Argenti’s hypotheses. Journal of Business Finance & Accounting, 14(3), pp. 335–354. Keasey, K. and Watson, R., 1988. The non-submission of accounts and small company financial failure prediction. Accounting and Business Research, 19(73), pp. 47–54. Keating, E.K., Fischer, M., Gordon, T.P. and Greenlee, J.S., 2005. Assessing financial vulnerability in the nonprofit sector. SSRN 647662. Retrieved from https://papers.ssrn. com/sol3/papers.cfm?abstract_id=647662 Keenan, S.C., Sobehart, J.R. and Hamilton, D.T., 1999. Predicting default rates: A forecasting model for Moody’s issuer-based default rates. SSRN 1020303. Retrieved from https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1020303 Kim, M.J. and Kang, D.K., 2010. Ensemble with neural networks for bankruptcy prediction. Expert Systems with Applications, 37, pp. 3373–3379. Kim, M.J. and Kang, D.K., 2015. Geometric mean based boosting algorithm with oversampling to resolve data imbalance problem for bankruptcy prediction. Expert Systems with Applications, 42, pp. 1074–1082. Kim, S.Y. and Upneja, A., 2014. Predicting restaurant financial distress using decision tree and AdaBoosted decision tree models. Economic Modelling, 36, pp. 354–362. Kinney, W.R., 1973. Discussion of a prediction of business failure using accounting data. Journal of Accounting Research, 11, Empirical Research in Accounting: Selected Studies, pp. 183–187. Kleinbaum, D.G. and Klein, M., 2012. Survival Analysis: A Self-Learning Text, 3rd ed. New York, NY: Springer. Kleine, R., Kloha, P. and Weissert, C.S., 2003. Monitoring local government fiscal health: Michigan’s new 10-point scale of fiscal distress.  Government Finance Review,  19(3), pp. 18–24. Kloha, P., Weissert, C.S. and Kleine, R., 2005. Developing and testing a composite model to predict local fiscal distress. Public Administration Review, 65(3), pp. 313–323. Koopman, S.J., Kräussl, R., Lucas, A. and Monteiro, A.B., 2009. Credit cycles and macro fundamentals. Journal of Empirical Finance, 16(1), pp. 42–54. Koopman, S.J., Lucas, A. and Schwaab, B., 2011. Modeling frailty-correlated defaults using many macroeconomic covariates. Journal of Econometrics, 162(2), pp. 312–325. Kotu, V. and Deshpande, B., 2018. Data Science: Concepts and Practice. Burlington, MA: Morgan Kaufmann Publishers. Laitinen, E.K., 1992. Prediction of failure of a newly founded firm. Journal of Business Venturing, 7(4), pp. 323–340. Laitinen, E.K., 1995. The duality of bankruptcy process in Finland.  European Accounting Review, 4(3), pp. 433–454. Laitinen, T. and Kankaanpaa, M., 1999. Comparative analysis of failure prediction methods: The Finnish case. European Accounting Review, 8(1), pp. 67–92. Largay, J.A. and Stickney, C.P., 1980, July–August. Cash flows, ratio analysis and the W.T, grant company bankruptcy. Financial Analysts Journal, pp. 51–54. Lau, A.H.L., 1987. A  five-state financial distress prediction model.  Journal of Accounting Research, pp. 127–138.

224  References

Leclere, M., 2000. The occurrence and timing of events: Survival analysis applied to the study of financial distress. Journal of Accounting Literature, 19. Lee, T.A., Ingram, R.W. and Howard, T.P., 1999. The difference between earnings and operating cash flow as an indicator of financial reporting fraud. Contemporary Accounting Research, 16(4), pp. 749–786. Lennox, C., 1999. Identifying failing companies: A re-evaluation of the logit, probit and DA approaches. Journal of Economics and Business, 51(4), pp. 347–364. Lev, Baruch, 1971. Financial failure and informational decomposition measures. In R.R. Sterling and W.F. Bentz (Eds.) Accounting in Perspective: Contributions to Accounting Thoughts by Other Disciplines. Cincinnati: Southwestern Publishing Co., pp. 102–111. Lilien, S., Mellman, M. and Pastena, V., 1988. Accounting changes: Successful versus unsuccessful firms. Accounting Review, pp. 642–656. Lord, J., Landry, A., Savage, G.T. and Weech-Maldonado, R., 2020. Predicting nursing home financial distress using the Altman Z-Score.  Inquiry: The Journal of Health Care Organization, Provision, and Financing, 57. doi:10.1177/0046958020934946. Louviere, J., Hensher, D. and Swait, J., 2000. Conjoint preference elicitation methods in the broader context of random utility theory preference elicitation methods. In Conjoint Measurement (pp. 279–318). Berlin, Heidelberg: Springer. Magee, R.P., 1977. Discussion of financial distress in private colleges. Journal of Accounting Research, pp. 41–45. McNamara, R.P., Cocks, N.J. and Hamilton, D.F., 1988. Predicting private company failure. Accounting & Finance, 28(2), pp. 53–64. Mensah, Y.M., 1984. An examination of the stationarity of multivariate bankruptcy prediction models: A methodological study. Journal of Accounting Research, pp. 380–395. Merton, R.C., 1974. On the pricing of corporate debt: The risk structure of interest rates. The Journal of Finance, 29, pp. 449–470. Mitchell, J. and Roy, R.V., 2007. Failure Prediction Models: Performance, Disagreements, and Internal Rating Systems. Working Paper. Belgium: National Bank of Belgium. Moriarity, S., 1979. Communicating financial information through multidimensional graphics. Journal of Accounting Research, pp. 205–224. Moyer, R.C., 1977. Forecasting financial failure: A re-examination. Financial Management (pre-1986), 6(1), p. 11. Mramor, D. and Valentincic, A., 2003. Forecasting the liquidity of very small private companies. Journal of Business Venturing, 18(6), pp. 745–771. Murray, D. and Dollery, B., 2005. Local government performance monitoring in New South Wales: Are “at risk” councils really at risk? Economic Papers: A Journal of Applied Economics and Policy, 24(4), pp. 332–345. Never, B., 2013, October. Divergent patterns of nonprofit financial distress. In  Nonprofit Policy Forum (Vol. 5, No. 1, pp. 67–84). Nickell, P., Perraudin, W. and Varotto, S., 2000. Stability of rating transitions.  Journal of Banking & Finance, 24(1–2), pp. 203–227. O’Leary, D.E., 1998. Using neural networks to predict corporate failure. Intelligent Systems in Accounting, Finance & Management, 7(3), pp. 187–197. Ohlson, J.A., 1980. Financial ratios and the probabilistic prediction of bankruptcy. Journal of Accounting Research, 18(1), pp. 109−131. Ohlson, J.A., 2015. Accounting research and common sense. Abacus, 51(4), pp. 525–535. Olson, D.L., Delen, D. and Mengm, Y., 2012. Comparative analysis of data mining methods for bankruptcy prediction. Decision Support Systems, 52, pp. 464–473.

References  225

Peat, M., 2008. Non-parametric methods for credit risk analysis: Neural networks and recursive partitioning techniques. In S. Jones and D.A. Hensher (Eds.) Advances in Credit Risk Modelling and Corporate Bankruptcy Prediction. Cambridge, UK and New York: Cambridge University Press, pp. 137–154. Peat, M. and Jones, S., 2012. Using neural nets to combine information sets in corporate bankruptcy prediction. Intelligent Systems in Accounting, Finance and Management, 19(2), pp. 90–101. Peel, M.J., 1987. Timeliness of private company accounts and predicting corporate failure. Investment Analysts, 83, 23–27. Peel, M.J. and Peel, D.A., 1987. Some further empirical evidence on predicting private company failure. Accounting and Business Research, 18(69), pp. 57–66. Perry, S.C., 2001. The relationship between written business plans and the failure of small businesses in the US. Journal of Small Business Management, 39(3), pp. 201–208. Pinches, G., Eubank, A., Mingo, K. and Caruthers, J., 1975. The hierarchical classification of financial ratios. Journal of Business Research, 3(4), pp. 295–310. Pindado, J. and Rodrigues, L.F., 2004. Parsimonious models of financial insolvency in small companies. Small Business Economics, 22(1), pp. 51–66. Pompe, P.P. and Bilderbeek, J., 2005. The prediction of bankruptcy of small-and mediumsized industrial firms. Journal of Business Venturing, 20(6), pp. 847–868. Read, W.J. and Yezegel, A., 2018. Going-concern opinion decisions on bankrupt clients: Evidence of long-lasting auditor conservatism? Advances in Accounting, 40, pp. 20–26. Rosner, R.L., 2003. Earnings manipulation in failing firms.  Contemporary Accounting Research, 20(2), pp. 361–408. Salford System, 2019. SPM users guide: Introducing TreeNet. Salford System. Retrieved from www.salford-systems.com. Santomero, A.M. and Vinso, J.D., 1977. Estimating the probability of failure for commercial banks and the banking system. Journal of Banking & Finance, 1(2), pp. 185–205. Schapire, R. and Freund, Y., 2012. Boosting: Foundations and Algorithms. MIT Press: Cambridge, MA. Schipper, K., 1977. Financial distress in private colleges. Journal of Accounting Research, pp. 1–40. Schwartz, E.S., 1977. The valuation of warrants: Implementing a new approach. Journal of Financial Economics, 4(1), pp. 79–93. Schwartz, K.B., 1982. Accounting changes by corporations facing possible insolvency. Journal of Accounting, Auditing and Finance, 6(1), pp. 32–43. Scott, J., 1981. The probability of bankruptcy: A comparison of empirical predictions and theoretical models. Journal of Banking & Finance, 5(3), pp. 317–344. Shumway, T., 2001. Forecasting bankruptcy more accurately: A simple hazard model. Journal of Business, 74(1), pp. 101−124. Slotemaker, R., 2008. Prediction of corporate bankruptcy of private firms in the Netherlands (A Master’s Thesis). Erasmus University, Rotterdam, Netherlands. Small Business Administration Office of Advocacy, 2018. Retrieved from https://advocacy. sba.gov/. Smith, R. and Winakor, A., 1935. Changes in Financial Structure of Unsuccessful Industrial Corporations, Bureau of Business Research, Bulletin No. 51. Urbana: University of Illinois Press. Stephens, J., 2004, March 4. Nature conservancy retools board to “tighten” oversight. Washington Post, A21. Stern, S., 1997. Simulation-based estimation.  Journal of Economic Literature,  35(4), pp. 2006–2039.

226  References

Strom, S., 2003, July 16. Fees and trustees: Paying the keepers of the cash. The New York Times. Sun, J., Jia, M. and Hui, L., 2011. AdaBoost ensemble for financial distress prediction: An empirical comparison with data from Chinese listed companies. Expert Systems with Applications, 38(8), pp. 9305–9312. Sweeney, A.P., 1994. Debt-covenant violations and managers’ accounting responses. Journal of Accounting and Economics, 17(3), pp. 281–308. Taleb, N.N., 2007. The Black Swan: The Impact of the Highly Improbable. Allen Lane, UK: Random House. Tan, L.H. and Wang, J., 2007. Modelling an effective corporate governance system for China’s listed state-owned enterprises: Issues and challenges in a transitional economy. Journal of Corporate Law Studies, 7, pp. 143–183. Train, K., 2003. Discrete Choice Methods with Simulation. Cambridge, UK: Cambridge University Press. Trussel, J.M., 2002. Revisiting the prediction of financial vulnerability. Nonprofit Management and Leadership, 13(1), pp. 17–31. Trussel, J.M. and Greenlee, J.S., 2004. A  financial rating system for charitable nonprofit organizations. Research in Governmental and Nonprofit Accounting, 11, pp. 93–116. Trussel, J.M. and Patrick, P.A., 2009, March  1. A  predictive model of fiscal distress in local governments. Journal of Public Budgeting, Accounting & Financial Management, 21(4), pp. 578–616. Trussel, J.M. and Patrick, P.A., 2013. Predicting fiscal distress in special district governments. Journal of Public Budgeting, Accounting & Financial Management, 25(4), pp. 589–616. Tsai, C.F., Hsu, Y.F. and Yen, D.C., 2014. A comparative study of classifier ensembles for bankruptcy prediction. Applied Soft Computing, 24, pp. 977–984. Tuckman, H.P. and Chang, C.F., 1991. A methodology for measuring the financial vulnerability of charitable nonprofit organizations. Nonprofit and Voluntary Sector Quarterly, 20(4), pp. 445–460. Vinso, J.D., 1979. A determination of the risk of ruin. Journal of Financial and Quantitative Analysis, 14(1), pp. 77–100. Virág, M. and Nyitrai, T., 2014. Is there a trade-off between the predictive power and the interpretability of bankruptcy models? The case of the first Hungarian bankruptcy prediction model. Acta Oeconomica, 64(4), pp. 419–440. Vuong, Q.H., 1989. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica: Journal of the Econometric Society, pp. 307–333. Wallace, W.A., 1985. Accounting policies and the measurement of urban fiscal strain. Research in Governmental and Non-Profit Accounting, 1, pp. 181–212. Wang, G., Ma, J. and Yang, S., 2014. An improved boosting based on feature selection for corporate bankruptcy prediction. Expert Systems with Applications, 41, pp. 2353–2361. Ward, T.J., 1994. An empirical study of the incremental predictive ability of Beaver’s naive operating flow measure using four-state ordinal models of financial distress.  Journal of Business Finance & Accounting, 21(4), pp. 547–561. West, D., Dellana, S. and Qian, J., 2005. Neural network ensemble strategies for financial decision applications. Computers & Operations Research, 32(10), pp. 2543–2559. Wheelock, D.C. and Wilson, P.W., 2000. Why do banks disappear? The determinants of US bank failures and acquisitions. Review of Economics and Statistics, 82(1), pp. 127–138. Wiatowski, T. and Bolcskei, H., 2018. A  mathematical theory of deep convolutional neural networks for feature extraction. IEEE Transactions on Information Theory, 64(3), pp. 1845–1866.

References  227

Wilcox, J.W., 1971. A  simple theory of financial ratios as predictors of failure.  Journal of Accounting Research, 9(2), pp. 389–395. Wilcox, J.W., 1973. A prediction of business failure using accounting data. Journal of Accounting Research, 11, pp. 163–179. Wilcox, J.W., 1976. The gambler’s ruin approach to business risk. Sloan Management Review (pre-1986), 18(1), p. 33. Wilson, N., Wright, M. and Altanlar, A., 2014. The survival of newly-incorporated companies and founding director characteristics.  International Small Business Journal,  32(7), pp. 733–758. Yu, X., 2006. Competing risk analysis of Japan’s small financial institutions institute for monetary and economic studies.  International Journal of Business and Management,  5(2), pp. 141–180. Zavgren, C.V., 1983. The prediction of corporate failure: The state of the art. Journal of Accounting Literature, 2, pp. 1–35. Zavgren, C.V. and Friedman, G.E., 1988. Are bankruptcy prediction models worthwhile? An application in securities analysis. Management International Review, pp. 34–44. Zhang, Z., 2016. Corporate reorganisation of China’s listed companies: Winners and losers. Journal of Corporate Law Studies, 16, pp. 101–143. Zheng, Q. and Yanhui, J., 2007. Financial distress prediction based on decision tree models. Paper presented at IEEE International Conference on Service Operations and Logistics, and Informatics, Philadelphia. Zhou, Y.A., Kim, M.H. and Ma, S., 2012. Survive or Die? An Empirical Study on Chinese ST Firms. Melbourne, SA: American Committee for Asian Economic Studies (ACAES). Zmijewski, M.E., 1984. Methodological issues related to the estimation of financial distress prediction models. Journal of Accounting Research, 22, pp. 59−82. Zou, H. and Hastie, T., 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2), pp. 301–320.

INDEX

AdaBoost see adaptive boosting adaptive boosting 81, 83, 91 analyst forecasts 16, 156 Asian financial crisis 2 AUC or area under ROC curve 82, 98, 99; see also receiver operating characteristic (ROC curve) auditors 9 Australian Securities and Investment Commission (ASIC) 13 Australia Stock Exchange (ASX) Listing Rules 49 bagging 81; see also random forests balanced error rate 113 Basel Committee on Banking Supervision 17 big data 24, 196 Black-Scholes option pricing models 36, 57, 58; see also KMV model Black Swan events 1 bond ratings 18, 28, 75 Box-Cox power transformation 97, 98, 101 CART model 21, 80, 81, 83 cash flow see operating cash flow censoring 20, 54, 63 choice-based sample bias 43 closed form solution 53, 206 collateralized debt obligations (CDOs) 3 comparison of model performance 141 143 competing risk models 52, 62

confusion matrix 112, 114, 129 corporate bankruptcy 37, 56, 58, 59, 102 corporate failure 1, 2, 9; see also corporate bankruptcy corporate governance 2, 6, 27, 104, 195 196 corporate valuation 2, 15 cox proportional hazard model 20, 54, 62; see also extended Cox model credit ratings 96, 158, 159 credit ratings agencies 4, 6, 7 credit risk 9, 12, 17, 57 curse of dimensionality 101 data quality 22, 53, 140 decision trees 81, 82, 92, 94 deep learning model 97, 104 107 deep neural network structure 105 discounted cash flow 15 16 discrete choice theory 47 discriminant function 34 distance-to-default method 18, 21, 28, 58, 65, 69, 72; see also KMV model distress anomaly 33 34 distressed mergers and takeovers 23 Dodd Frank Wall Street Reform and Consumer Protection Act 6, 7, 16 earnings management 194 195 elastic net method 95, 96 elemental probabilities (nested logit) 51 Emergency Economic Stabilization Act 5

Index  229

excess returns 115, 117, 120 extended Cox model 20, 54, 193 Failing Company Doctrine 38 failure and distress frequencies (for small and medium sized enterprises or SMEs) 150 151, 154 155; see also small and medium size enterprises (SMEs) Financial Accounting Standards Board (FASB) 59 Financial Crisis Inquiry Commission Report (US government) 4, 5 financial instruments 12 financial ratios 4, 11, 24, 28, 30 31, 32 33, 35, 44, 56, 59 Financial Stability Oversight Council (FSOC) 6 fixed parameter estimates 20 Gambler’s ruin model 21, 66 general additive model (GAM) 207 generalized lasso 22, 95 Gini index 82 global financial crisis 3 going concern assessments 9, 10 13, 18, 37, 42, 193; see also material uncertainties gradient boosting machines 91 93 Halton intelligent draws 20 hazard models 52 56 heterogeneity (unobserved) 47 48 high dimensional 101 holdout samples 44, 83, 98, 99, 145; see also test samples hyper-parameter estimation 98, 140, 198 identical and independently distributed errors (IID) 36 independence from irrelevant alternatives (IIA) 48 institutional ownership 62, 102, 115 interaction effects (in machine learning) 93, 111, 127 129 International Accounting Standards Board (IASB) 14 International Auditing and Assurance Board (IAASB) 11 International Financial Reporting Standard (IFRS) 12 KMV model 69 72

Lachenbruch estimator 44 learn rates 146 lending institutions 17 leverage 35, 36, 56, 58, 71 linear discriminant models 18, 19, 34 35, 199 liquidity 35, 37, 41, 64 loan default 9, 18, 28, 49 logistic regression 40 41, 159 loss functions (in machine learning) 91 93 machine learning methods see CART model; deep learning model; gradient boosting machines; random forests marginal effects 117; see also partial dependency plot market prices 56 57, 59 MARS model 95 96 matched pair designs 43 material uncertainties (and relation to auditing) 9, 13 misclassification error 82, 112 missing value imputation 97 98 mixed logit 46 50 model diagnostics 145 model stability tests 146 multi-class models 45 46 multicollinearity 98, 101, 108 multinomial logit model 47, 49 50 multivariate normality 18, 36 natural language processing (NLP) 23, 28, 196 nested logit models 50 52 neural networks 77 80 New York Federal Reserve 5 N-fold classification (in machine learning) 143 non-linear analysis 93, 95 97, 102 not-for-profit entities 163 167 oblique random forests 214 Ohlson model 41 operating cash flow 47, 49, 172 ordered logit 18, 45 Organization for Economic Cooperation and Development (OECD) 150 partial dependency plot 93 94; see also marginal effects probit boosted model 203 probit models 42 43, 198 Public Company Accounting Reform and Investor Protection Act (SOX) 3 public sector entities 168 172

230 Index

quadratic discriminant analysis 37, 199 random forests 83, 94 95, 212 randomization processes (in gradient boosting) 140 141 random parameter estimates 48 49 random walk 66 67 receiver operating characteristic (ROC curve) 98 99, 111 112 recursive partitioning models 21, 28, 65, 80 81; see also CART model regularized regression 95 relative variable importances (in machine learning) 93, 110, 116 residual income valuation model 15 retained earnings 35, 104 Salford Predictive Modeler (SPM) 22, 110 Securities and Exchange Commission (SEC) 7 sensitivity (and specificity) in classification tests 111 112 simulated maximum likelihood 53 small and medium size enterprises (SMEs) (fail rates and distress) 8, 150 151, 154 155 solvency 11, 33

steepest gradient descent optimization 92 stochastic processes (in gradient boosting machines) 141 support vector machines (SVMs) 117, 209 test samples 44, 83, 98, 99, 145 time-varying covariates 20 21 tree depth 22, 111 TreeNet method (Salford Systems) 111 Troubled Asset Relief Program (TARP) 5 Type I and Type II classification errors 22, 32, 39, 42, 79 univariate approach 30 unordered logit model 50 51 users (of distress forecasts) 9 17 variable importance scores 93, 110, 116; see also relative variable importances Volcker rule 7 Weibull hazard model 54 weighted exogenous sample maximum likelihood (WESML) 43 working capital 29, 31, 35, 41, 49 Zeta model 37 38