Veryfront App

With the rise of microservices and distributed system architectures, software applications are becoming increasingly complex. Given this new environment, high-performing organizations are now adopting chaos engineering as a practice to access their software systems' resiliency against failure. According to Principlesofchaos.org, "Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production."

The complexity of distributed systems makes them vulnerable to unpredictable failure events, such as dependency failures, infrastructure failures, network failures, architectural defects, application failures, etc. And the more the number of interconnected components, the higher the chance of something going wrong, which can lead to huge financial loss. According to an IDC Study, unplanned downtimes cost businesses on average some $1.25 billion to $2.5 billion each year.

Planned experiments are essential to build confidence in systems and mitigate failures. In this post, we’ll delve into the origins and principles of chaos engineering, some real-world use cases, best practices, and tools to help you along the way.

But first, let’s take a look at further benefits that come from following this discipline.

Why Chaos Engineering?

We already mentioned the need to build confidence and mitigate failures, but below we list additional reasons you should start your journey into chaos engineering to improve the resiliency in your systems and organization.

1. Understanding Risks & Impacts

Chaos engineering allows you to understand the impact of turbulent conditions on critical applications by letting you create an experiment and measure how it affects your business. When companies understand what’s at stake, they’re able to make informed decisions and proactively react in order to minimize or prevent losses.

2. Incident Response

The complex nature of distributed systems creates many opportunities for things to go wrong. For companies with highly regulated environments, like the financial industry, the concept of disaster recovery and business continuity is crucial, as a single moment of downtime can be damaging. Through chaos experiments, these industries can practice, prepare, and put processes in place for real incidents. Chaos engineering enables teams to have the right level of awareness, strategies, and visibility whenever an attack or incident does occur.

3. Application Observability & Security

Chaos experiments allow you to understand the gaps in your systems' monitoring and observability capabilities, as well as in your team's ability to respond to incidents. Chaos engineering will help you see areas for improvement and drive you to make your systems more observable, thus enhancing the quality of your telemetry data.

4. System Confidence

Chaos engineering enables organizations to develop reliable and fault-tolerant software systems, building your team’s confidence in them. The more stable your systems are, the more confident you can be that they will function properly.

Origin of Chaos Engineering

In 2008, Netflix began migrating from an on-premises data center to an AWS cloud after one of its databases became corrupted. That led to a three-day service outage that negatively affected millions of Netflix customers. Having migrated to AWS, Netflix's engineering team built a suite of open-source tools called the "Simian Army" for checking the resilience, reliability, and security of their AWS infrastructure against all kinds of failures. The Simian Army comprises tools like Chaos Monkey, Janitor Monkey, Chaos Kong, Doctor Monkey, Latency Monkey, Security Monkey, etc.

Here's how Netflix describes why they built these chaos tools:

The cloud is all about redundancy and fault-tolerance. Since no single component can guarantee 100% uptime (and even the most expensive hardware eventually fails), we have to design a cloud architecture where individual components can fail without affecting the availability of the entire system. In effect, we have to be stronger than our weakest link.

One of the goals of chaos engineering is to build resilience into systems. Over the years, many teams have followed suit and have successfully implemented this process by following a common set of principles, which we’ll discuss below.

Principles of Chaos Engineering

Let's assume you're building the next disruptive fintech app. For customer satisfaction, it's important that your app is resilient and available at all times. With resiliency as your desired outcome, you could verify your state by running a set of planned experiments that follow a standardized process known as the principles of chaos engineering, as quoted from the site of that same name:

Start by defining 'steady state' as some measurable output of a system that indicates normal behavior.
Hypothesize that this steady state will continue in both the control group and the experimental group.
Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

Of course, chaos engineering is pointless without fixing weaknesses it uncovers. So you’ll also need to prioritize and remediate any issues found. In addition to the principles above, the following points offer up some best practices you should keep in mind while running chaos experiments.

Vary Real-World Events

It's essential to experiment with real-world events that result from software vulnerabilities, hardware failures, and even incidents that don't necessarily cause failures, such as operational growth and traffic spikes. Make sure to induce a variety of real-world events for your experiments, not just failures.

Minimize Blast Radius

When an experiment's blast radius is minimized, it automatically reduces the magnitude of the experiment and limits the affected service. As you gain more confidence in the system, you should increase the scope of testing beyond the initial magnitude and blast radius to cover more areas of your systems.

Run Experiments in Production

Running your chaos experiments in production is a recommended practice as long as the fault injection can be contained and controlled. Make sure to have a rollback plan just in case the experiment goes awry.

Before running an experiment in production, you should consult business stakeholders to confirm that the test won’t interfere with any important demo.

Implementing Chaos Engineering Effectively

Chaos engineering is not about injecting failures randomly without an end goal in sight. It's more about running thoughtful, planned experiments in a controlled environment. It transcends building resiliency into software and can uncover weaknesses across your entire organization, from your network, infrastructure layers, data, and software architecture to processes, culture, and people—who can often be the greatest concern when it comes to security.

Many businesses still fail to unleash the full potential of chaos experiments for various reasons. The good news is, whether you're planning to implement chaos engineering for the first time or are looking to further optimize your experiments, the following points will help put you on the right track.

Start with Game Days

Once you’ve formed your hypothesis and created solid plans for running chaos experiments, you can start with game days**.** A game day exercise allows you to understand a system's behavior with respect to performance, cost, reliability, security, and operations. It's a way to examine how your system functions when it encounters a real-world failure. It's also an opportunity to strengthen your intuition about large-scale, live, and catastrophic failures.

To implement chaos engineering in an organization, especially a financial institution where there's a high need for business continuity and disaster recovery, you need to bring the entire software development team on board, share contexts, and simulate the right failures to learn more about your system. You can start by defining the scenario you want to practice, execute the simulation, and then analyze the result. This entire process is known as a game day exercise.

Run Experiments Periodically

Errors are inevitable in production systems. They are bound to happen, and that’s why you should mitigate the impact of these by running chaos experiments periodically, not just once a year. When you run experiments on a regular basis, you're able to better test your system's ability to withstand failures and also discover new issues that need to be addressed. A one-off experiment is not enough to build confidence in your system.

Automate Your Experiments

Manual chaos experiments are unsustainable, labor-intensive, and counterproductive, impeding a proper analysis of your system. Through automated chaos experiments, you can build resilience into systems and achieve a high development velocity—especially in distributed systems—reliably and sustainably.

Stick to Solid Tools When Experimenting

Getting started with chaos experiments can be quite scary and complicated. But once you have the right set of tools in place, it becomes easy to begin running experiments. Currently, there are several tools available to help you. Notable ones include Powerful Seal, LitmusChaos, Pumba, Simian Army, Gremlin, etc**.** You can learn more about existing and new chaos engineering tools by regularly reviewing this diagram designed by the Chaos Engineering Slack community.

Create a Blameless Post-Mortem Culture

Mistakes are inevitable. They will surely happen, and when they do, you should avoid finger-pointing. Instead, reflect on the errors to understand why and how you can prevent them from happening again. When analyzing the cause of the failure, you need to be transparent and open about what went wrong and also avoid blaming individuals. This blameless post-mortem culture will help you learn from mistakes, cultivate a proper culture, and build systems that can withstand failures.

Set Up Proper Monitoring (Observability)

No system is 100% reliable, and you can't predict all the ways a system will fail. Whenever there's a system failure, you need to understand what's going on and discover its possible causes. One way to achieve this is by making your systems observable. Proper monitoring enables you to easily understand a system’s steady states and the changes that occur when there is an issue. You can use tools like Loki, Prometheus, Jaeger, and ChaosToolkit to improve your system’s observability and automatically implement chaos experiments in containerized applications.

Some Popular Tools for Chaos Engineering

There are mature tools that can help you facilitate chaos experiments and better prepare for unexpected failures without disrupting system operations. Here below are two key tools to get you going.

Gremlin

Gremlin is a Failure-as-a-Service (FaaS) platform that lets you build resilience to failure, improve incident response, prevent expensive outages and service disruptions, and maintain customer trust. With Gremlin, you can run chaos experiments securely and safely across every layer of your application’s infrastructure and platform.

PowerfulSeal

PowerfulSeal is an open-source, robust chaos engineering tool that enables you to inject failures into Kubernetes clusters to identify problems in your system as early as possible. Chaos Monkey served as the inspiration for PowerfulSeal, and it fully abides by the principles of chaos engineering.

Continue to Cause Chaos

Because of the unpredictable nature of distributed systems, chaos engineering is a powerful experimental testing procedure that every organization should add to its development arsenal. You don't need anything sophisticated to start your journey into chaos engineering. Basic experiments like injecting node failure, adding latency, and intermittent request errors are a great starting point.

You’ll be amazed at the kind of unexpected behaviors and insights you’ll learn from your system and how your team starts to think about system design, fault-tolerance, and reliability. You’ll quickly identify issues in your monitoring and observability tools even before you reach production, and your organization will eventually develop a culture of resilience engineering and game day exercises to optimize your production environment.