Question 1

What is Chaos Engineering?

Accepted Answer

The discipline of injecting controlled faults in production systems to validate that the architecture and operational processes respond as expected. Born at Netflix with Chaos Monkey and today standard practice in enterprises with critical workloads. The premise: the only way to know if your DR works is to test it.

Question 2

What is AWS Fault Injection Service (FIS)?

Accepted Answer

AWS-native Chaos Engineering service. Lets you inject controlled faults — terminate EC2 instances, exhaust CPU/memory, degrade network latency, simulate AZ outage, suspend API calls, fail Aurora or RDS components — with controllable blast radius and automatic abort criteria via CloudWatch. Replaces external tooling like Gremlin or Chaos Monkey with a service integrated into IAM, CloudWatch, and the rest of AWS.

Question 3

Is it safe to do Chaos Engineering in production?

Accepted Answer

Yes, when done well. The keys: small and controllable blast radius at start, automatic abort criteria in CloudWatch, full-stack observability in real time, and pre-prod before prod. Caleidos starts with small experiments and scales as the team gains confidence.

Question 4

What is a GameDay?

Accepted Answer

A simulated incident exercise with the full on-call. The day and time are announced but not the scenario; participants respond as if it were real (communication, escalation, runbook execution, failover). Response time, communication quality, and real RTO are measured. Ends with blameless postmortem and improvement plan.

Question 5

Do I need to have multi-region to do Chaos Engineering?

Accepted Answer

No. Chaos Engineering is valuable in any architecture — single region multi-AZ also benefits. For clients with multi-region it delivers the most value because it validates the operational processes of failover, where DRs most often fail in reality. Learn about the related service at /en/services/multi-region.

Question 6

How often is it done?

Accepted Answer

Typical Caleidos program: continuous automated experiments (resilience smoke tests on each deploy), quarterly GameDays with the full on-call, annual full DR exercise. Frequency adapts to workload criticality.

Question 7

Accepted Answer

Caleidos Lens© 24×7 operates the Chaos Engineering program as part of continuous AIOps and SecOps. Findings from each GameDay feed continuous improvement of the operated platform.

The only way to know your DR works is to test it

What you get with Caleidos

AWS FIS implemented

Recurring GameDays

Real RTO and RPO metrics

Resilience culture

How we work

Resilience Assessment

Hypothesis and blast radius

Execution with AWS FIS

GameDay and postmortem

Quarterly iteration

Chaos Engineering Program

Tech stack

What we get asked the most

Ready to get started?