Chaos Engineering is the discipline of experimenting on a system by deliberately introducing failures, so you learn how it really behaves under stress instead of assuming it is resilient. The goal is not to break things for fun, it is to build confidence that the system withstands turbulent conditions in production.
AWS provides AWS Fault Injection Service (FIS) to run these experiments safely, with stop conditions wired to CloudWatch alarms so an experiment aborts automatically if it goes too far. Always start in a sandbox, keep the blast radius small, and graduate to production only once you trust the controls.
The managed AWS service for running controlled chaos engineering experiments, injecting faults such as stopping instances, adding latency, failing over databases, or isolating an Availability Zone, with built-in stop conditions.
A measurable definition of a system operating normally (such as p99 latency, error rate, or throughput) that a chaos experiment uses as its baseline and watches to decide whether a fault caused real impact.
The scope of impact when a security incident occurs - how many resources, accounts, or users are affected. Smaller blast radius means better security posture.
A resilience property where a workload keeps operating during a failure using resources it already has, instead of relying on launching or changing resources during the failure (which is when control planes are most likely to be impaired).
Toc Consulting: AWS Security & Cloud Architecture
Our team helps engineering teams secure and architect AWS the right way: assessment in week one, a prioritized action plan in week two.