Chaos Engineering

Chaos Engineering


chaos eng.jpg

As the name suggests, chaos engineering is the methodology of bringing chaos into our system to make it more resilient and robust to failures that might happen in the production. It's about embracing the inevitable chaos into our complex systems and bring solutions for them. You can also think of chaos engineering as vaccines and engineers as WBCs who make our body(here system) more immune to potential dangers.

Why chaos engineering?

server down.png

We might have come across many news headlines like these :-

Now a few hours might not seem such a big deal to use but it suffers a huge blow to the big enterprises that run their business across the globe. Not only it affects their financial charts, it also leaves a stain on their reputation as a company which cannot cater to the needs of its customers. With the progress in technology, the world has grown increasingly complex, this these failures have become much harder to predict. This is why companies need ways to tackle with these problems.

How chaos engineering solves our problems

Chaos engineering works on the following principles : -


  1. Plan an experiment

    This involves thinking of a hypothesis in what could go wrong in actual conditions.

  2. Run experiment

    Execute the test at the smallest scale possible that will teach you something. Blast radius should be minimum. Blast radius is the way of measuring the total impact of the fault.

  3. Verify

    Increase or decrease the blast radius as per the requirements of the test and full scale. Analyze the behavior of the system at each step.

  4. Improve

    Make improvements or bring the solutions to make the system more robust and immune to failure.

  5. Steady State

    After the improvements are done and the system has reached steady state, inject a new failure or keep repeating the same experiments.

When the test is over, we have a better understanding of our system's capabilities and flaws.

History of chaos engineering


  • In 2010 ,chaos engineering was first used by Netflix Engineering Tools team to create Chaos Monkey in response to Netflix’s move from physical infrastructure to cloud infrastructure provided by Amazon Web Services, and the need to be sure that a loss of an Amazon instance wouldn’t affect the Netflix streaming experience.
  • In 2011 the Simian Army was born. The Simian Army added additional failure injection modes on top of Chaos Monkey that would allow testing of a more complete suite of failure states, and thus build resilience to those as well. “The cloud is all about redundancy and fault-tolerance. Since no single component can guarantee 100% uptime (and even the most expensive hardware eventually fails), we have to design a cloud architecture where individual components can fail without affecting the availability of the entire system” (Netflix, 2011).
  • In 2012 Netflix shared the source code for Chaos Monkey on Github, saying that they “have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient” (Netflix, 2012).

Thus paving way for more organizations using chaos engineering to strengthen their reliability pillars.

Tools for chaos engineering

There are various tools out there which can help in conducting a chaos engineering experiment. A few of them are listed below :-

  • Chaos Mesh

chaos mesh.png Chaos Mesh is an open-source cloud-native tool specifically designed for Chaos Engineering. Using various fault simulations, Chaos Mesh helps organizations determine system abnormalities that may occur during various portions of the development, testing, and production stages.

  • Chaos Monkey

chaos monkey.png Chaos Monkey is an open-source chaos tool originally created by Netflix developers. It was developed to help test their system reliability and resiliency after moving to the AWS cloud. The software functions by implementing continuous unpredictable attacks. Chaos Monkey uses the basic fundamental approach of terminating one or more virtual machine instances.

  • Gremlin

gremlin.png Gremlin is the first hosted Chaos Engineering service designed to improve web-based reliability. Offered as a SaaS (Software-as-a-Service) technology, Gremlin is able to test system resiliency using one of three attack modes. Users provide system inputs as a means of determining which type of attack will provide the most optimal results. Tests can be performed in conjunction with one another as a means of facilitating comprehensive infrastructural assessments.

  • Litmus

litmus.png Litmus is an open-source Chaos Engineering platform designed for cloud-native infrastructures and applications. It assists teams with identifying system deficiencies and outages by performing controlled chaos tests. Litmus uses a cloud-native strategy to closely control and manage chaos.

Did you find this article valuable?

Support WeMakeDevs by becoming a sponsor. Any amount is appreciated!