Introduction
As the name suggests, chaos engineering is the methodology of bringing chaos into our system to make it more resilient and robust to failures that might happen in the production. It's about embracing the inevitable chaos into our complex systems and bring solutions for them. You can also think of chaos engineering as vaccines and engineers as WBCs who make our body(here system) more immune to potential dangers.
Why chaos engineering?
We might have come across many news headlines like these :-
Facebook lost $65 million due to hour long outage On 5th Oct,2021 Facebook faced an outage for mere 5 hours but faced a huge loss of revenue.
British Airways suffers a power outage leaving 75,000 passengers affected The CEO of British Airways explained how this one failure that stranded tens of thousands of British Airways (BA) passengers in May 2017 cost the company 80 million pounds ($102.19 million USD).
Now a few hours might not seem such a big deal to use but it suffers a huge blow to the big enterprises that run their business across the globe. Not only it affects their financial charts, it also leaves a stain on their reputation as a company which cannot cater to the needs of its customers. With the progress in technology, the world has grown increasingly complex, this these failures have become much harder to predict. This is why companies need ways to tackle with these problems.
How chaos engineering solves our problems
Chaos engineering works on the following principles : -
Plan an experiment
This involves thinking of a hypothesis in what could go wrong in actual conditions.
Run experiment
Execute the test at the smallest scale possible that will teach you something. Blast radius should be minimum. Blast radius is the way of measuring the total impact of the fault.
Verify
Increase or decrease the blast radius as per the requirements of the test and full scale. Analyze the behavior of the system at each step.
Improve
Make improvements or bring the solutions to make the system more robust and immune to failure.
Steady State
After the improvements are done and the system has reached steady state, inject a new failure or keep repeating the same experiments.
When the test is over, we have a better understanding of our system's capabilities and flaws.
History of chaos engineering
- In 2010 ,chaos engineering was first used by Netflix Engineering Tools team to create Chaos Monkey in response to Netflix’s move from physical infrastructure to cloud infrastructure provided by Amazon Web Services, and the need to be sure that a loss of an Amazon instance wouldn’t affect the Netflix streaming experience.
- In 2011 the Simian Army was born. The Simian Army added additional failure injection modes on top of Chaos Monkey that would allow testing of a more complete suite of failure states, and thus build resilience to those as well. “The cloud is all about redundancy and fault-tolerance. Since no single component can guarantee 100% uptime (and even the most expensive hardware eventually fails), we have to design a cloud architecture where individual components can fail without affecting the availability of the entire system” (Netflix, 2011).
- In 2012 Netflix shared the source code for Chaos Monkey on Github, saying that they “have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient” (Netflix, 2012).
Thus paving way for more organizations using chaos engineering to strengthen their reliability pillars.
Tools for chaos engineering
There are various tools out there which can help in conducting a chaos engineering experiment. A few of them are listed below :-
Chaos Mesh
Chaos Mesh is an open-source cloud-native tool specifically designed for Chaos Engineering. Using various fault simulations, Chaos Mesh helps organizations determine system abnormalities that may occur during various portions of the development, testing, and production stages.
Chaos Monkey
Chaos Monkey is an open-source chaos tool originally created by Netflix developers. It was developed to help test their system reliability and resiliency after moving to the AWS cloud. The software functions by implementing continuous unpredictable attacks. Chaos Monkey uses the basic fundamental approach of terminating one or more virtual machine instances.
Gremlin
Gremlin is the first hosted Chaos Engineering service designed to improve web-based reliability. Offered as a SaaS (Software-as-a-Service) technology, Gremlin is able to test system resiliency using one of three attack modes. Users provide system inputs as a means of determining which type of attack will provide the most optimal results. Tests can be performed in conjunction with one another as a means of facilitating comprehensive infrastructural assessments.
Litmus
Litmus is an open-source Chaos Engineering platform designed for cloud-native infrastructures and applications. It assists teams with identifying system deficiencies and outages by performing controlled chaos tests. Litmus uses a cloud-native strategy to closely control and manage chaos.