The only way to know your DR works is to test it
Chaos Engineering program with AWS Fault Injection Service and quarterly GameDays. We validate in production under control that your multi-region architecture and operational processes respond when it matters.
Most enterprises that say they have disaster recovery discover on incident day that their runbooks are outdated, the on-call has not been trained, or the downed region's control plane blocks the failover. Chaos Engineering is the discipline of validating resilience by injecting controlled faults in production, safely and observably. Caleidos operates the program with AWS Fault Injection Service (FIS), recurring GameDays, and real RTO/RPO metrics against defined SLOs. It is the natural complement to /en/services/multi-region.
What you get with Caleidos
AWS FIS implemented
AWS Fault Injection Service is the official AWS service for injecting controlled faults: terminate instances, degrade network, exhaust CPU/RAM, simulate AZ or region outage. Replaces external tooling (Gremlin, Chaos Monkey) with a native AWS service.
Recurring GameDays
Quarterly exercises with the full on-call. We shut down critical components under control, observe how the architecture and processes respond. Each GameDay leaves improved runbooks and a better-trained on-call.
Real RTO and RPO metrics
We validate declared SLOs against real exercise results. Continuity becomes an auditable number, not an architectural promise.
Resilience culture
We support the cultural change: blameless postmortems, living runbooks, architecture ready to fail gracefully. Resilience becomes the way of operating.
How we work
Resilience Assessment
We map critical workloads, dependencies, and declared SLOs (RTO, RPO, availability). We identify the priority chaos experiments.
Hypothesis and blast radius
For each experiment we define the hypothesis (what we expect to happen), the blast radius (how much it can affect), and the abort criteria. Experiment safety is the priority.
Execution with AWS FIS
We run the experiments with AWS Fault Injection Service in pre-prod or controlled production environments. Full-stack observability in real time.
GameDay and postmortem
Sessions with the on-call to execute the full scenario (simulated incident, communication, failover, recovery). Blameless postmortem and improvement plan.
Quarterly iteration
The program is continuous. Each quarter we run new experiments, update runbooks, and improve operational processes.
Chaos Engineering Program
AWS FIS + quarterly GameDays
Implementation of the Chaos Engineering program in clients with critical multi-region architecture. Real DR validation, continuous runbook improvement, and on-call training with AWS FIS.
Read full case →Tech stack
What we get asked the most
What is Chaos Engineering?
The discipline of injecting controlled faults in production systems to validate that the architecture and operational processes respond as expected. Born at Netflix with Chaos Monkey and today standard practice in enterprises with critical workloads. The premise: the only way to know if your DR works is to test it.
What is AWS Fault Injection Service (FIS)?
AWS-native Chaos Engineering service. Lets you inject controlled faults — terminate EC2 instances, exhaust CPU/memory, degrade network latency, simulate AZ outage, suspend API calls, fail Aurora or RDS components — with controllable blast radius and automatic abort criteria via CloudWatch. Replaces external tooling like Gremlin or Chaos Monkey with a service integrated into IAM, CloudWatch, and the rest of AWS.
Is it safe to do Chaos Engineering in production?
Yes, when done well. The keys: small and controllable blast radius at start, automatic abort criteria in CloudWatch, full-stack observability in real time, and pre-prod before prod. Caleidos starts with small experiments and scales as the team gains confidence.
What is a GameDay?
A simulated incident exercise with the full on-call. The day and time are announced but not the scenario; participants respond as if it were real (communication, escalation, runbook execution, failover). Response time, communication quality, and real RTO are measured. Ends with blameless postmortem and improvement plan.
Do I need to have multi-region to do Chaos Engineering?
No. Chaos Engineering is valuable in any architecture — single region multi-AZ also benefits. For clients with multi-region it delivers the most value because it validates the operational processes of failover, where DRs most often fail in reality. Learn about the related service at /en/services/multi-region.
How often is it done?
Typical Caleidos program: continuous automated experiments (resilience smoke tests on each deploy), quarterly GameDays with the full on-call, annual full DR exercise. Frequency adapts to workload criticality.
How does it relate to Caleidos Lens©?
Caleidos Lens© 24×7 operates the Chaos Engineering program as part of continuous AIOps and SecOps. Findings from each GameDay feed continuous improvement of the operated platform.
Ready to get started?
Tell us about your challenge. No pitch, no commitment. Just understanding.
Free resilience diagnostic