AWS Fault Injection Service (FIS) is the managed service for running chaos engineering experiments on AWS. You define an experiment template with three parts: targets (the resources to affect), actions (the faults to inject), and stop conditions (CloudWatch alarms that abort the experiment if it crosses a safety threshold).
aws:ec2:stop-instances and aws:ec2:terminate-instances for compute failureaws:ssm:send-command with the AWSFIS-Run documents for CPU, memory, and network stressaws:network:disrupt-connectivity for Availability Zone and dependency isolationaws:rds:failover-db-cluster and aws:dynamodb:global-table-pause-replication for data-tier faultsaws:ecs:stop-task and the Lambda invocation actions for containers and serverlessFIS has no native scheduler; recurring experiments are driven by Amazon EventBridge Scheduler or a CI/CD pipeline calling the StartExperiment API.
The practice of deliberately injecting failures into a system to discover weaknesses before they cause outages, by forming a hypothesis about steady state and testing it under real-world fault conditions.
A measurable definition of a system operating normally (such as p99 latency, error rate, or throughput) that a chaos experiment uses as its baseline and watches to decide whether a fault caused real impact.
A resilience property where a workload keeps operating during a failure using resources it already has, instead of relying on launching or changing resources during the failure (which is when control planes are most likely to be impaired).
Toc Consulting: AWS Security & Cloud Architecture
Our team helps engineering teams secure and architect AWS the right way: assessment in week one, a prioritized action plan in week two.