May 12th, 2021 by inflectra
When people talk about 'mission-critical systems' or 'high assurance industries', they are normally referring to real-time control systems such as air traffic control, automotive, railways, hospitals, utilities, and flight-control software. However when you look at the technology that underpins our economy and society, in reality, the definition of what is truly "mission critical" is actually much broader. In this article, I discuss some of the systems that are used in banking and finance and how they are mission-critical, and if we treat them as such, we should rethink how we define, develop, test, and deploy such systems.
The textbook definition of a mission critical system is as follows:
A mission critical system is a system that is essential to the survival of a business or organization. When a mission critical system fails or is interrupted, business operations are significantly impacted.
However, in software development, when we talk about mission critical systems, we often think about systems that are either real-time (such as flight-control and process control systems) or are used in industries where failure can mean the difference between life and death (e.g., patient health records, clinical systems, etc.), however, in reality, the definition is much broader in nature. During the past year, we've all been working from home and relying on platforms such as Amazon, DoorDash, so perhaps in the future, we'll consider those to be mission critical as well!
When the discussion turns to financial systems, the term "business systems" is often used to distinguish such systems from "mission critical systems," the thesis is that if a mission critical system fails, lives are lost, and if a business system fails, it is not as important. This difference in approach is fundamental in how the different systems are built and tested. We tolerate failures and unreliability in business systems that we'd never allow in an air-traffic-control system or hospital system. With many business systems, the assumption is that the failure can be tolerated for a "short time" and that it is cheaper to pay the customer or regulator compensation if an SLA is breached rather than factor in the cost to prevent it in the first place.
Although that approach is very common when people talk about business systems as being somehow less important than real-time systems, I am reminded of the following scene from the movie Too Big to Fail:
In the movie, the characters realize that the failure of certain banks and other interconnected entities in the financial system would have catastrophic effects on the world economy and society as a whole. When banks cannot function, people don't get paid, businesses cannot operate, and society can collapse just as quickly as if a system at an electrical utility were to fail.
Now in most cases, the failure of financial systems will not result in a collapse of society or a worldwide financial panic, but even localized system failures can have catastrophic consequences for those directly involved.
The background to this failure was the fact that during the 2008 financial crisis, several large UK banks were hurriedly merged together. Once the crisis has receded, the banks were forced to be de-merged to reduce system risk. However, this resulted in two now separate banks - TSB and Lloyds NatWest - using the same IT platform that was controlled by Lloyds NatWest. When TSB was sold to another bank - Banco Sabadell, the new group decided to migrate the TSB customers to the Proteo platform that was owned by Banco Sabadell. However, the migration was not implemented successfully, and many customers could not access their accounts for days, companies could not pay their staff, and many people lost their life savings when other customers could improperly access their accounts. Beyond just the financial impact (which was severe), people had weddings canceled, lost houses they were trying to purchase.
Ranking as perhaps the costliest computer “whoops” ever, the massive financial losses spawned by a programming glitch by Knight Capital Group cost the firm $440 million and left it on the brink of bankruptcy.
Knight Group’s computers were supposed to roll out multiple automatic orders over several days. Instead, the computers signaled their programs to make all the changes in one day, resulting in a massive amount of shares being bought and sold immediately. Specifically, 150 stocks listed on the New York Stock Exchange were traded at the speed of sound. The computer error resulted in the company nearly going out of business. As it was, over 5% of their staff were laid off as part of the firm’s subsequent reorganization.
Horizon was a computer system introduced into the UK Post Office network in 1999. The system was used for tasks such as managing transactions, accounting, and inventory management (effectively a specialized ERP system for post offices). Sub-postmasters complained about bugs in the system after it reported shortfalls, some of which amounted to many thousands of pounds. Some sub-postmasters attempted to plug the gap with their own money, even remortgaging their homes in an (often fruitless) attempt to correct an error. Despite these complaints and reported bugs, the Post Office prosecuted over 700 sub-postmasters - an average of one per week - based on information generated by Horizon, without any external corroboration. Some went to prison following convictions for false accounting and theft, many were financially ruined and have described being shunned by their communities. Some have since died. After 20 years, campaigners won a legal battle to have their cases reconsidered after claiming that the computer system was flawed.
This devastating date marked the end of the golden eighties decade in America, the stock market plunge has been linked to poor programming planning. In the midst of giant tumbles in market numbers, computers, which had been programmed to protect against compounded losses, went haywire trying to insure their portfolios. Once stocks began to plummet, programs felt the loss and signaled further sale of stocks, resulting in a positive feedback loop. The stocks being sold caused other programs to offset losses and sell into the declining market at high speed. The market crash cost some investors millions of dollars and led to outrage, sometimes culminating in shooting rampages and the death of some stockbrokers who were thought to be responsible for losses, according to stock-market-crash.net.
So we can conclude from these examples that failures in financial IT systems can be every bit as catastrophic as failures in a real-time system. Yet, the processes and oversight used in financial IT programs are, in many cases, not fit for purpose. If we consider the approaches used in high-assurance industries and apply them to projects in the financial sector, maybe we can avoid some of these problems.
When you develop software or systems in the Life Sciences, you don't have the freedom to define the requirements, build the system, test it and release it without any oversight. In such regulated systems, you have to meet the requirements of the 21 CFR Part 11, this means:
So we can apply these approaches to developing IT systems in finance, banking, and insurance.
The key is to be able to deliver modern, secure IT systems to customers, partners, and internal users in a way that you get the benefits of an agile approach (early user feedback, deliver increments of functionality on time, ability to quickly adapt to industry changes such as mobile payments, blockchain, etc.) while at the same time being compliant with the various rules (FATCA, SarBox, Patriot Act, GDPR, SDFR) and avoiding catastrophic failure.
Banks, insurance companies, stock exchanges, regulators, and other parties embarking on large, mission critical financial services projects need to design their DevOps toolchain with this in mind. Here are some key take-aways:
Treat financial systems as if they were mission critical / safety critical systems. They really are!
Understand the audit and compliance rules and make sure your agile / DevOps tools can generate that data for you. If not, change your systems before the project!
Document all your quality standards and processes and have an internal SOC 2 audit to ensure your compliance before you start development work.
Use automated testing and validation as much as possible, make sure the results tie back to your requirements, and documented risks.
Integrate functional, load, security, etc. testing tools into your DevOps toolchain
Have integrated real-time reporting and traceability back to mandatory requirements
Not enough that build failed need to know which requirements were impacted by the failure.
Include risk management and risk techniques into your agile process.