Cloud Resilience Testing
While many Financial Institutions are increasingly looking to reap the benefits of cloud computing, there are still several challenges and decisions that organizations wrestle with. One of these is how to test the resilience of the hybrid cloud as well as of their applications running on the cloud.
Resilience can be defined as the ability of a system to sustain failures. Resilience directly contributes to the overall availability of the solution through metrics like Mean Time to Recover (MTTR) and Mean Time Between Failures (MTBF), and ultimately affects the usability of the solution and the SLAs (Service Level Agreements) that it provides.
While failures in systems may ultimately manifest as errors or unavailability of a component/system, the list of factors that may cause failures in a distributed, cloud-native system is significant.
When moving workloads to public cloud, clients need to understand the resilience capabilities associated with the cloud provider as well as of their applications.
There are also important challenges being brought by regulatory bodies around the world that are focusing on more stringent rules and requirements for resilience testing. One example is DORA (EU’s Digital Operational Resilience Act). Cloud providers must comply with different regulations and prove their resilience through testing of resilience scenarios. Then, clients or solution teams can build their own testing strategy based on the resilience capabilities provided by the cloud to them.
We have outlined an approach based on best practices to test the resilience of public cloud as well as of applications running on the public cloud in a white paper we want to share with you. I would like to thank my colleagues: Boas Betzler Fabio Benedetti Ephraim P. John DeMarco Medha Adusumilli who contributed to this paper.
The best practices come from years of experience by the IBM Cloud services team which provides, manages and supports hundreds of resilient cloud services. The recommendations for resilience testing for cloud applications is based on the experience of working with clients that have successfully deployed mission critical workloads on the IBM Cloud and have considerable experience in productive delivery.
We discuss the types of resilience that cloud providers must support and the different failure scenarios that must be tested regularly. We then share recommendations for client workloads and applications built on top of cloud provider services and how their resilience should be tested.
The paper includes:
- Executive Summary
- Cloud Resilience Concepts
- Types of Workload Resilience
- Layering of Cloud Services and Applications Workloads Resilience Test Objectives and Requirements Scenarios for Resilience Testing
- Approaches and Procedures for Resilience Testing Summary
For Financial Institutions to take advantage of the benefits of cloud computing capabilities, they must manage the risks associated with migrating sensitive workloads into cloud. With this resilience testing best practices model as a starting point, which includes cloud provider testing requirements and client applications resilience testing recommendations, clients can then tailor and adopt this model in their planning of reporting of cloud related resilience risks to executives, boards, regulators, and auditors. This can help Financial Institutions to meet the challenges brought by new Operational Resilience regulations, like DORA.
In order to access this white paper, first login here https://developer.ibm.com/portals/fscc/, and then follow this link: https://developer.ibm.com/middleware/v1/cos/fscc/resiliency-testing-whitepaper-v25.pdf
What are your experiences and best practices for Resilience Testing?