Embrace the Chaos! Exploring the World of Chaos Engineering


In an age of ubiquitous cloud services, infinitely scalable resources, and infrastructure that spreads across geographic zones for higher availability, Chaos Engineering offers a new approach to the concept of performance and availability testing.


We explored this modern approach in our Gixer Labs – CoStrategix’s innovation lab. The goal of our experiment was to better understand how the discipline of chaos engineering differs from the more traditional concepts of performance and availability testing. In our lab, we executed a series of tests on an existing application using the modern principles of chaos engineering.

What is Chaos Engineering?

Chaos engineering is a discipline that aims to proactively identify weaknesses in distributed systems by subjecting them to controlled experiments that simulate real-world failures in a progressive and/or targeted manner. In doing so, you can gain insights into how the application reacts to various conditions it may encounter in production.

Implementing chaos engineering requires a systematic approach and the right tools. You can leverage chaos engineering platforms like Netflix Chaos MonkeyGremlin, or Microsoft Azure Chaos Studio to automate experiments and inject controlled failures into their systems. By running GameDays and Chaos Days, teams can simulate catastrophic events and validate their system’s resilience in a safe environment. It’s essential to start small, gradually increase complexity, and iterate on experiments based on learnings.

The core principles of chaos engineering include:

  • Defining a steady state
  • Introducing chaos
  • Measuring the impact
  • Automating experiments
  • Minimizing the “blast radius”

Benefits of Chaos Engineering

The benefits of chaos engineering are manyfold. By embracing chaos engineering, you can:

  • Improve system reliability and fault tolerance
  • Reduce downtime and service disruptions
  • Enhance customer satisfaction and trust
  • Identify and mitigate vulnerabilities before they impact users or the organization
  • Foster a culture of resilience and continuous improvement

Our Chaos Engineering Case Study

For our case study, we chose to apply chaos engineering principles to an internal product, and we selected Azure Chaos Studio as our tool. With a very limited learning curve, we were able to begin targeting resources, selecting from a list of simulation conditions, and creating chaos.

Because it’s best to start simple and work toward increasingly more complex chaos, we started with a straightforward CPU pressure test on one of the VMs where our QObserver process resides. We didn’t impose any additional abnormal conditions on other areas of the architecture (e.g., DB or QCloud). By gradually ramping up the CPU utilization and monitoring system performance, we were able to determine the threshold at which system latency started occurring. We used this information to adjust our autoscaling policies to obtain the optimal balance between cost efficiency and performance/availability.

Although we did not choose to do so for our case study, we could have also refactored areas of the code or architecture to determine if one-time changes (e.g., implementing caches) might allow us to get more processing headroom without adding infrastructure costs that would be incurred month over month, year after year.

Next, we targeted our client DB by slowly ramping up the number of connections until we started seeing errors related to the maximum DB connections. This allowed us to determine the maximum number of concurrent users that our DB could support with the currently allocated resources.

Knowing this, we can now make efficiency changes, such as identifying long-running queries that can be improved to free up DB connections more quickly; or adding small amounts of memory and/or GPU so that the size of the connection pool can be increased. We can tweak the various settings until we find the best “bang for the buck.” Better understanding these limitations of our product also allows our operations team to set thresholds for monitoring and alerting more accurately.

Our Findings

We found that chaos engineering differs from more traditional performance and availability testing primarily in the maturity of the tools and what these tools allow you to do.

Performance and scalability testing has always focused on finding potential stress and/or failure points before they cause problems. The advancement of the chaos engineering tools allows you to focus on more specific areas of your architecture, including very low-level services and/or communication scenarios.

Also, chaos engineering gives you the ability to create templates that can be utilized time and again on different projects. This is an important benefit that ensures key scalability and availability requirements are consistently tested.

Finally, when combined with an effective functional testing practice, chaos engineering helps ensure that your solution delivers the functionality, performance, and availability expectations of your customers and users.

Chaos engineering holds tremendous promise for organizations looking to build resilient, reliable, and high-performing systems. By embracing chaos engineering principles and practices, companies can stay ahead of the curve, mitigate risks, and deliver exceptional user experiences. As we navigate an increasingly complex and unpredictable technological landscape, harnessing chaos may just be the key to unlocking innovation and driving success.

CoStrategix is a strategic technology consulting and implementation company that bridges the gap between technology and business teams to build value with digital and data solutions. If you are looking for guidance on data management strategies and how to mature your data analytics capabilities, we can help you leverage best practices to enhance the value of your data. Get in touch!

Recent Blog Posts

Setting goals for yourself is key to motivation
Setting Goals For Yourself is the Key to Motivation
High-Altitude Climbing is a Lot Like Solving an IT Challenge
Best Practices for Getting Started with Azure DevOps
Just Get Started with LLM Use Cases
Lights. Camera... Actionable Data!