In my previous blog post, I discussed how despite the immense potential of the cloud, two-thirds of companies using it fail to derive expected benefits. Post migration performance issues often dog applications, especially when on-prem systems and applications are migrated without appropriate remediation and testing. Let’s examine a common one in detail – failures in applications with high availability requirements– and see what causes this and how it can be avoided.
High availability promise of cloud
One of the much appreciated business benefits of cloud migration is high availability, which is typically defined as the ability of a system to continue to provide uninterrupted services under all conditions. This is usually achieved by automated and continuous monitoring, clustering, load balancing, detecting an impending failure and automated failover to a redundant secondary sub-system or component (when the primary fails). And all of this happens at the infrastructure layer.
How application resilience is often ignored
Often, when an on-prem system is migrated to cloud, customers assume high availability as a given since cloud providers guarantee availability goals based on the level of service subscribed to. Unfortunately, though, this assumption is incorrect. While the underlying cloud infrastructure availability is guaranteed by the cloud provider through various strategies as listed above, the cloud tenant remains responsible for architecting and designing the application layer for resiliency. If this is not done, applications may fail unexpectedly, affecting overall system availability.
But, what is application resiliency? Application resiliency can be defined simply as the ‘ability to provide and maintain acceptable levels of service in the face of various faults and challenges to normal operation’. Resilient applications adapt successfully to unforeseen and disruptive changing environments. It includes fault recovery and graceful degradation under extreme conditions.
Some migration strategies lead to failures in production because the application layer was not originally architected nor remediated for resiliency before migration. Even in the case of natively developed cloud applications, they must adopt resiliency best practices in order to realize the full promise of cloud.
Recently, a music streaming service provider had a severe service outage due to a cascading failure of multiple microservices, which was triggered by a transient network problem. The system had not been architected and tested for detecting and recovering gracefully from such an isolated transient failure, leading to cascading failures downstream.
Resiliency testing is critical, but conventional testing cannot help
A key challenge is how to test, evaluate and characterize cloud application resiliency before going live such that system availability is maintained as per business goals. Conventional testing approaches cannot adequately uncover cloud application resiliency issues for a variety of reasons:
- Existing test strategies are business use case or requirements-driven, unable to discover underlying, deep and hidden architecture faults.
- Heterogeneous and multi-layer architectures are prone to failure arising out of complexity of interactions between various software entities.
- Poor determinism of production usage patterns and consequent unforeseen ‘emergent’ behavior of the cloud application architecture, especially hybrid and multi-cloud.
- Often failures are asymptomatic and remain latent as internal system errors until specific conditions trigger their visibility.
- Layers within cloud may have different stakeholders and are managed by different administrators, leading to configuration changes not anticipated during application design which cause a breakdown of interfaces.
Key strategies to test and evaluate cloud application resilience
In cloud, architecting for application resiliency is more critical due to the multi-tier, multiple technology infrastructure and distributed nature of cloud systems. This can cause cloud applications to fail in unpredictable ways even though the underlying infrastructure resiliency is covered by the cloud provider. Cloud Quality Engineers should adopt the following strategies to test, evaluate and characterize application layer resilience:
- Collaborate with cloud architects to define availability goals and derive application layer resilience attributes.
- Hypothesize failure modes based on expected or observed usage patterns and prioritize testing of these failure modes based on business impact.
- Inject specific faults to trigger internal errors which simulate failures during development and testing phase. We call this the “fault kitchen with recipes” – meaning recipes which include failover situations such as inordinate delay in response, resource hogging, network outages, transient conditions, extreme actions by users and many more.
- Inject faults with varying severity and combination and monitor application layer behavior.
- Identify anomalous behavior and iterate the above steps to confirm criticality.
By adopting an architecture-driven testing approach, organizations can gain insights into cloud application resiliency well before going live and budget sufficient time for performance remediation activities.
Author Krishore Durg is lead, intelligent cloud & infrastructure and co-author Mahesh Venkataraman is lead, quality engineering services - growth markets, at Accenture. Read more from Accenture here.