Tarek Cheikh
Founder & AWS Cloud Architect
This is Article 1 in the "Chaos Engineering on AWS" series. We deploy a product catalog and order API backed by Aurora, put real data through it, then kill a server and watch what happens to our customers' orders.
Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. That definition comes from principlesofchaos.org, and it is worth reading carefully. The key word is confidence. You are not trying to break things for fun. You are running controlled experiments to find out whether your system behaves the way you think it does.
The source lays out four advanced principles:
Alongside those four principles, the same source stresses one operational guardrail above all: minimize the blast radius. Start small. Kill one instance, not the whole fleet. Have stop conditions. Be ready to abort. Chaos engineering is not about causing outages; it is about preventing them. This guardrail matters more than people realize. Chaos engineering without it is just breaking things. We will talk more about stop conditions and observability in Article 2 of this series.
Every architecture diagram shows arrows flowing smoothly between boxes. Reality is different. Load balancers take time to detect failed targets. Auto Scaling Groups take time to launch replacements. Health checks have intervals and thresholds that create detection windows. You do not know how long those windows are until you test them.
More importantly, architecture diagrams do not show what happens to in-flight work. If a customer is placing an order when a server dies, does the order go through? Does it get charged but never confirmed? Does the stock decrement without creating a record? These questions only have real answers when you run the experiment.
The point of chaos engineering is to close the gap between what you assume about your system and what is actually true. In this article, we will find a concrete example of that gap.
We are deploying a product catalog and order API called Chaos Shop. This is not a hello-world endpoint that returns instance metadata. It is a small but real application: 10 products in a database, stock tracking, order placement with transactional integrity. Every request that matters goes through Aurora.
Two EC2 instances sit behind an Application Load Balancer, backed by an Aurora Serverless v2 PostgreSQL database. AWS Fault Injection Service (FIS) is configured to stop one instance on demand.
Here is the architecture:
Internet
|
+------+------+
| ALB |
| (us-east-1) |
+------+------+
|
+------------+------------+
| |
+--------+--------+ +--------+--------+
| EC2 Instance | | EC2 Instance |
| (us-east-1a) | | (us-east-1b) |
| Chaos Shop API | | Chaos Shop API |
+--------+--------+ +--------+--------+
| |
+------------+------------+
|
+--------+--------+
| Aurora Serverless |
| v2 (PostgreSQL) |
| Writer instance |
+-------------------+
+-------------------+
| AWS FIS |
| "Stop 1 instance" |
+-------------------+
The ALB distributes HTTP traffic across both instances. Each instance runs the Chaos Shop Flask API, which serves product listings, accepts orders, and manages inventory, all backed by Aurora. The ASG is configured with a minimum of 2 instances, so when FIS stops one, the ASG should launch a replacement.
The database is central here. Unlike a typical chaos engineering tutorial where you kill a server returning static data, our application has state. Products have stock counts. Orders create records and decrement inventory inside a database transaction. When we kill a server, the interesting question is not just "do requests fail?" but "does the data stay consistent?"
This is not a production architecture. There is no HTTPS, no WAF, no private subnets for the ALB. It is a lab environment designed to teach chaos engineering concepts without running up a large bill.
You will need the following installed and configured:
The full Terraform code is available at github.com/TocConsulting/chaos-on-aws. Clone it before continuing.
git clone https://github.com/TocConsulting/chaos-on-aws.git
cd chaos-on-aws/01-what-is-chaos-engineering/terraform
Each EC2 instance runs a Flask application deployed via userdata. The app has seven endpoints, and almost all of them talk to the database. This is deliberate. We want to test what happens to a stateful application during a failure, not just a stateless proxy.
Here is the application setup: imports, configuration, database connection, and initialization. The API endpoints follow below.
import os
import time
import pg8000
from flask import Flask, jsonify, request
app = Flask(__name__)
INSTANCE_ID = os.environ.get("INSTANCE_ID", "unknown")
AZ = os.environ.get("AZ", "unknown")
PRIVATE_IP = os.environ.get("PRIVATE_IP", "unknown")
DB_ENDPOINT = os.environ.get("DB_ENDPOINT", "")
DB_PORT = int(os.environ.get("DB_PORT", "5432"))
DB_NAME = os.environ.get("DB_NAME", "")
DB_USERNAME = os.environ.get("DB_USERNAME", "")
DB_PASSWORD = os.environ.get("DB_PASSWORD", "")
# Persistent connection per Gunicorn worker process.
# Each sync worker handles one request at a time, so a single
# connection per worker is safe and avoids per-request TCP overhead.
_worker_conn = None
def get_db_connection():
"""Return the worker's persistent DB connection, reconnecting if needed."""
global _worker_conn
if _worker_conn is not None:
try:
c = _worker_conn.cursor()
c.execute("SELECT 1")
c.fetchone()
c.close()
return _worker_conn
except Exception:
try:
_worker_conn.close()
except Exception:
pass
_worker_conn = None
_worker_conn = pg8000.connect(
host=DB_ENDPOINT,
port=DB_PORT,
database=DB_NAME,
user=DB_USERNAME,
password=DB_PASSWORD,
)
_worker_conn.autocommit = True
return _worker_conn
def init_db():
"""Create tables and seed products if they do not exist."""
conn = pg8000.connect(
host=DB_ENDPOINT,
port=DB_PORT,
database=DB_NAME,
user=DB_USERNAME,
password=DB_PASSWORD,
)
try:
conn.autocommit = True
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS products (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL UNIQUE,
price DECIMAL(10,2) NOT NULL,
stock INTEGER NOT NULL DEFAULT 0
)
""")
cursor.execute("""
CREATE TABLE IF NOT EXISTS orders (
id SERIAL PRIMARY KEY,
product_id INTEGER REFERENCES products(id),
quantity INTEGER NOT NULL,
total DECIMAL(10,2) NOT NULL,
status VARCHAR(20) DEFAULT 'pending',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
# Seed products with explicit IDs so concurrent instances do not create
# non-sequential IDs from the SERIAL sequence.
products = [
(1, "Wireless Keyboard", 49.99, 100),
(2, "USB-C Hub", 34.99, 150),
(3, "Mechanical Keyboard", 129.99, 50),
(4, "27-inch Monitor", 349.99, 30),
(5, "Webcam HD", 79.99, 80),
(6, "Laptop Stand", 44.99, 120),
(7, "Noise-Canceling Headphones", 199.99, 40),
(8, "Portable SSD 1TB", 89.99, 90),
(9, "Ergonomic Mouse", 59.99, 110),
(10, "USB Microphone", 69.99, 70),
]
for pid, name, price, stock in products:
cursor.execute(
"INSERT INTO products (id, name, price, stock) VALUES (%s, %s, %s, %s) "
"ON CONFLICT (id) DO NOTHING",
(pid, name, price, stock),
)
cursor.execute("SELECT setval('products_id_seq', 10)")
cursor.close()
finally:
conn.close()
Two things worth noting here. First, get_db_connection() maintains a persistent connection per Gunicorn worker process. Each sync worker handles one request at a time, so a single shared connection is safe and avoids the overhead of creating a new TCP connection on every request. The health check (SELECT 1) detects stale connections and reconnects automatically. Second, init_db() uses its own separate connection that it creates and closes, because initialization runs once at import time and should not interfere with the worker's persistent connection. Seeding with explicit primary-key IDs plus ON CONFLICT (id) DO NOTHING (and resetting the sequence with setval) makes concurrent initialization from multiple instances safe and keeps product IDs deterministic across the fleet.
GET / returns service information: the instance ID, availability zone, and private IP. This is our routing indicator. When we kill a server, we can see exactly which instance handled the request.
@app.route("/")
def index():
return jsonify({
"service": "chaos-shop",
"instance_id": INSTANCE_ID,
"availability_zone": AZ,
"private_ip": PRIVATE_IP,
"status": "running",
})
GET /health is the shallow health check. It returns 200 with no dependencies. This is what the ALB uses. A health check that queries the database would cause the ALB to mark instances unhealthy during a database blip, even though the instance itself is fine. Keep health checks simple.
@app.route("/health")
def health():
return jsonify({"status": "healthy"}), 200
GET /health/deep is the deep health check. It opens a connection to Aurora and runs SELECT 1. This is useful for operational debugging not for the ALB. You do not want a database hiccup to cascade into instance replacements.
@app.route("/health/deep")
def health_deep():
try:
start = time.time()
conn = get_db_connection()
cursor = conn.cursor()
cursor.execute("SELECT 1")
cursor.fetchone()
latency_ms = round((time.time() - start) * 1000, 2)
cursor.close()
return jsonify({
"status": "healthy",
"database": "connected",
"latency_ms": latency_ms,
})
except Exception as e:
return jsonify({
"status": "unhealthy",
"database": "error",
"error": str(e),
}), 500
GET /products lists all 10 products from Aurora, including their current stock levels and prices. This is a read operation, but it still goes through the database every time. No caching. That is intentional. We want to see what happens to database reads during a failure.
@app.route("/products")
def list_products():
try:
start = time.time()
conn = get_db_connection()
cursor = conn.cursor()
cursor.execute("SELECT id, name, price, stock FROM products ORDER BY id")
rows = cursor.fetchall()
latency_ms = round((time.time() - start) * 1000, 2)
cursor.close()
products = []
for row in rows:
products.append({
"id": row[0],
"name": row[1],
"price": float(row[2]),
"stock": row[3],
})
return jsonify({
"products": products,
"count": len(products),
"db_latency_ms": latency_ms,
"served_by": INSTANCE_ID,
})
except Exception as e:
return jsonify({"error": str(e)}), 500
GET /products/{id} returns a single product by ID. Simple read.
POST /orders is the most important endpoint. This is where the real complexity lives. When a customer places an order, the app does the following inside a single database transaction:
SELECT ... FOR UPDATE to prevent concurrent modifications.If anything fails, the transaction rolls back. No partial state. No stock decrement without an order. This is how real e-commerce works.
@app.route("/orders", methods=["POST"])
def create_order():
try:
data = request.get_json()
if not data or "product_id" not in data or "quantity" not in data:
return jsonify({"error": "product_id and quantity are required"}), 400
product_id = int(data["product_id"])
quantity = int(data["quantity"])
if quantity <= 0:
return jsonify({"error": "quantity must be positive"}), 400
conn = get_db_connection()
try:
conn.autocommit = False
cursor = conn.cursor()
# Lock the product row and check stock
cursor.execute(
"SELECT id, name, price, stock FROM products WHERE id = %s FOR UPDATE",
(product_id,),
)
row = cursor.fetchone()
if row is None:
conn.rollback()
cursor.close()
return jsonify({"error": "product not found"}), 404
product_name = row[1]
price = float(row[2])
stock = row[3]
if stock < quantity:
conn.rollback()
cursor.close()
return jsonify({
"error": "insufficient stock",
"available": stock,
"requested": quantity,
}), 409
total = round(price * quantity, 2)
# Decrement stock
cursor.execute(
"UPDATE products SET stock = stock - %s WHERE id = %s",
(quantity, product_id),
)
# Create order
cursor.execute(
"INSERT INTO orders (product_id, quantity, total, status) "
"VALUES (%s, %s, %s, 'confirmed') RETURNING id, created_at",
(product_id, quantity, total),
)
order_row = cursor.fetchone()
order_id = order_row[0]
created_at = str(order_row[1])
conn.commit()
cursor.close()
return jsonify({
"order_id": order_id,
"product_id": product_id,
"product_name": product_name,
"quantity": quantity,
"total": total,
"status": "confirmed",
"created_at": created_at,
"served_by": INSTANCE_ID,
}), 201
except Exception:
conn.rollback()
raise
finally:
conn.autocommit = True
except Exception as e:
return jsonify({"error": str(e)}), 500
GET /orders/{id} retrieves an order with a JOIN to get the product name. This lets us verify after the experiment that our orders survived intact.
The important thing about this application is that the database is not optional. It is not a nice-to-have feature tacked onto a static endpoint. Every product listing, every order, every stock check goes through Aurora. When we kill a server mid-request, there are real consequences to think about: uncommitted transactions, in-flight writes, data consistency.
The target group health check configuration determines how quickly the ALB detects a dead instance. Pay attention to these numbers.
resource "aws_lb_target_group" "main" {
name = "chaos-lab-tg"
port = 80
protocol = "HTTP"
vpc_id = aws_vpc.main.id
health_check {
path = "/health"
healthy_threshold = 2
unhealthy_threshold = 3
timeout = 5
interval = 10
matcher = "200"
}
}
The ALB checks /health every 10 seconds. An instance must fail 3 consecutive checks before the ALB stops sending it traffic. That means the ALB needs up to 30 seconds (3 checks * 10 second interval) to detect a failed instance. Remember this number.
resource "aws_autoscaling_group" "app" {
name = "chaos-lab-asg"
min_size = 2
max_size = 4
desired_capacity = 2
vpc_zone_identifier = aws_subnet.private[*].id
target_group_arns = [aws_lb_target_group.main.arn]
health_check_type = "ELB"
health_check_grace_period = 120
launch_template {
id = aws_launch_template.app.id
version = "$Latest"
}
}
Two settings matter here. health_check_type = "ELB" means the ASG uses the ALB's health check to decide whether an instance is healthy, not just EC2 status checks. And min_size = 2 guarantees the ASG will launch a replacement whenever an instance is terminated or stopped.
resource "aws_fis_experiment_template" "stop_instance" {
description = "Stop one EC2 instance in the chaos-lab ASG"
role_arn = aws_iam_role.fis.arn
action {
name = "stop-instance"
action_id = "aws:ec2:stop-instances"
target {
key = "Instances"
value = "chaos-lab-instances"
}
}
target {
name = "chaos-lab-instances"
resource_type = "aws:ec2:instance"
selection_mode = "COUNT(1)"
resource_tag {
key = "Project"
value = "chaos-lab"
}
}
stop_condition {
source = "none"
}
}
The target uses selection_mode = "COUNT(1)", meaning FIS will randomly pick one instance tagged with Project = chaos-lab and stop it. Not terminate, stop. The instance stays in the ASG but is no longer running. The ASG detects this and launches a new one.
The stop_condition is set to none for this first experiment. In Article 2, we will replace this with a CloudWatch alarm that automatically aborts the experiment if error rates spike too high. For now, we are keeping it simple.
The FIS role is scoped tightly. It can only stop and start instances tagged with Project = chaos-lab. This is the blast-radius guardrail in action: scope the experiment so it cannot touch anything you did not intend.
resource "aws_iam_role_policy" "fis" {
name = "chaos-lab-fis-policy"
role = aws_iam_role.fis.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"ec2:StopInstances",
"ec2:StartInstances"
]
Resource = "arn:aws:ec2:${var.region}:${data.aws_caller_identity.current.account_id}:instance/*"
Condition = {
StringEquals = {
"ec2:ResourceTag/Project" = "chaos-lab"
}
}
},
{
Effect = "Allow"
Action = ["ec2:DescribeInstances"]
Resource = "*"
}
]
})
}
If someone accidentally runs this in the wrong account, the tag condition prevents it from touching anything else. The EC2 instances also get an IAM role with SSM access for operational access, but nothing more.
The remaining Terraform files handle standard VPC networking (2 public subnets, 2 private subnets, two NAT gateways (one per AZ), route tables), security groups (ALB accepts port 80 from the internet, EC2 accepts port 80 from the ALB only, Aurora accepts port 5432 from EC2 only), and the Aurora Serverless v2 cluster with a single writer instance. These are important for a real deployment but not where the interesting chaos engineering decisions live.
See the full code at github.com/TocConsulting/chaos-on-aws.
Update the aws_profile variable in variables.tf to match your AWS CLI profile, or pass it on the command line. Then:
terraform init
terraform plan
terraform apply
The apply creates 37 resources and takes roughly ten minutes. Most of that time is the Aurora cluster provisioning. When it finishes, Terraform outputs the values you need:
Outputs:
alb_dns_name = "chaos-lab-alb-600543899.us-east-1.elb.amazonaws.com"
aurora_endpoint = "chaos-lab-aurora.cluster-c1l11rt7ly8s.us-east-1.rds.amazonaws.com"
fis_experiment_template_id = "EXT3z4QT4WJN4Jwz"
Save the fis_experiment_template_id. You will need it to run the experiment.
Give the instances a couple of minutes after Terraform completes. They need to boot, install dependencies, initialize the database tables, seed the product data, and start the Flask app. Then start verifying.
Hit the ALB and confirm both instances are responding:
$ ALB="chaos-lab-alb-600543899.us-east-1.elb.amazonaws.com"
$ curl -s http://$ALB/ | python3 -m json.tool
{
"service": "chaos-shop",
"instance_id": "i-0ef097d855de7f9a9",
"availability_zone": "us-east-1a",
"private_ip": "10.0.10.x",
"status": "running"
}
Hit it again and you should see the other instance:
$ curl -s http://$ALB/ | python3 -m json.tool
{
"service": "chaos-shop",
"instance_id": "i-0a52ee26b5d49d79f",
"availability_zone": "us-east-1b",
"private_ip": "10.0.11.x",
"status": "running"
}
Both instances are live. The ALB is round-robin distributing requests across us-east-1a and us-east-1b.
$ curl -s http://$ALB/products | python3 -m json.tool
{
"products": [
{"id": 1, "name": "Wireless Keyboard", "price": 49.99, "stock": 100},
{"id": 2, "name": "USB-C Hub", "price": 34.99, "stock": 150},
{"id": 3, "name": "Mechanical Keyboard", "price": 129.99, "stock": 50},
{"id": 4, "name": "27-inch Monitor", "price": 349.99, "stock": 30},
{"id": 5, "name": "Webcam HD", "price": 79.99, "stock": 80},
{"id": 6, "name": "Laptop Stand", "price": 44.99, "stock": 120},
{"id": 7, "name": "Noise-Canceling Headphones", "price": 199.99, "stock": 40},
{"id": 8, "name": "Portable SSD 1TB", "price": 89.99, "stock": 90},
{"id": 9, "name": "Ergonomic Mouse", "price": 59.99, "stock": 110},
{"id": 10, "name": "USB Microphone", "price": 69.99, "stock": 70}
],
"count": 10,
"db_latency_ms": 126.11,
"served_by": "i-0ef097d855de7f9a9"
}
Ten products, all with stock. The database round-trip on the first request is 126.11ms, which includes the cost of establishing the persistent connection: TCP connect, TLS handshake, query execution, and result fetch. Subsequent requests reuse the connection and are faster. Normal for a t3.micro talking to Aurora Serverless v2.
$ curl -s http://$ALB/products/3 | python3 -m json.tool
{
"id": 3,
"name": "Mechanical Keyboard",
"price": 129.99,
"stock": 50
}
This is the real test. We are going to place an order for 2 Mechanical Keyboards and verify the entire transaction works: stock decremented, order created, total calculated correctly.
$ curl -s -X POST http://$ALB/orders \
-H "Content-Type: application/json" \
-d '{"product_id": 3, "quantity": 2}' | python3 -m json.tool
{
"order_id": 1,
"product_id": 3,
"product_name": "Mechanical Keyboard",
"quantity": 2,
"total": 259.98,
"status": "confirmed",
"created_at": "2026-06-17 21:17:01.029913",
"served_by": "i-0ef097d855de7f9a9"
}
Order confirmed. Total is correct (129.99 * 2 = 259.98). Served by the instance in us-east-1a. Now check that the stock actually decremented:
$ curl -s http://$ALB/products/3 | python3 -m json.tool
{
"id": 3,
"name": "Mechanical Keyboard",
"price": 129.99,
"stock": 48
}
Stock went from 50 to 48. The transaction worked.
$ curl -s http://$ALB/orders/1 | python3 -m json.tool
{
"order_id": 1,
"product_id": 3,
"product_name": "Mechanical Keyboard",
"quantity": 2,
"total": 259.98,
"status": "confirmed",
"created_at": "2026-06-17 21:17:01.029913"
}
Try to order more than what is available:
$ curl -s -X POST http://$ALB/orders \
-H "Content-Type: application/json" \
-d '{"product_id": 3, "quantity": 999}' | python3 -m json.tool
{
"error": "insufficient stock",
"available": 48,
"requested": 999
}
409 Conflict. The app correctly reports available stock as 48 (not the original 50, because our earlier order decremented it) and refuses the order. The row lock and stock check inside the transaction are working.
Let's put more data in the system before we break things. We want to verify after the experiment that none of it gets lost.
$ curl -s -X POST http://$ALB/orders \
-H "Content-Type: application/json" \
-d '{"product_id": 1, "quantity": 1}' | python3 -m json.tool
{
"order_id": 2,
"product_name": "Wireless Keyboard",
"quantity": 1,
"total": 49.99,
"status": "confirmed",
...
}
$ curl -s -X POST http://$ALB/orders \
-H "Content-Type: application/json" \
-d '{"product_id": 5, "quantity": 3}' | python3 -m json.tool
{
"order_id": 3,
"product_name": "Webcam HD",
"quantity": 3,
"total": 239.97,
"status": "confirmed",
...
}
We now have 3 confirmed orders in the database. Stock levels have been decremented for Mechanical Keyboard (50 to 48), Wireless Keyboard (100 to 99), and Webcam HD (80 to 77). This is our baseline.
Before you touch anything, write down your hypothesis. This is not optional. Without a hypothesis, you are just clicking buttons. Here is ours:
Hypothesis: If we stop one EC2 instance, the ALB should route all traffic to the remaining healthy instance with no user-visible errors. Product listings should continue to work. Existing orders should remain readable. New orders should still be placeable. The ASG should detect the stopped instance and launch a replacement within a few minutes. Data in Aurora should not be affected because the database is independent of the compute layer.
This seems reasonable. We have two instances in two AZs. The ALB has health checks. The ASG has a minimum count of 2. Aurora is a managed database that does not care which EC2 instance talks to it. Everything should just work, right?
Let's find out.
Start the FIS experiment using the AWS CLI:
$ aws fis start-experiment \
--experiment-template-id EXT3z4QT4WJN4Jwz \
--query 'experiment.id' \
--output text
EXPEnJp6T52edQqh6J
The experiment picked instance i-0ef097d855de7f9a9 in us-east-1a and stopped it. FIS reports completion within seconds; the interesting part is what happens to traffic in the window before the ALB notices.
While the experiment runs, start monitoring the product listing endpoint:
$ while true; do
echo "$(date '+%H:%M:%S') - $(curl -s -o /dev/null -w '%{http_code}' \
--connect-timeout 3 --max-time 5 \
http://$ALB/products)"
sleep 2
done
Here is the actual output (the probe host's clock ran in local time, UTC+2, so these timestamps are two hours ahead of the database's UTC created_at values shown elsewhere):
23:17:27 - 200
23:17:29 - 502
23:17:31 - 502
23:17:33 - 200
23:17:35 - 200
23:17:38 - 000
23:17:45 - 000
23:17:52 - 200
23:17:54 - 200
23:17:56 - 200
...
Those 502s and the 000 (connection timeout) are real failures. Our hypothesis said "no user-visible errors." That was wrong.
Let's walk through the timeline.
Before 23:17:29: both instances are healthy and every request returns 200. FIS then stops instance i-0ef097d855de7f9a9 in us-east-1a. The instance is shutting down, but the ALB does not know this yet. It still has two targets in its list.
23:17:29 to 23:17:31. The ALB routes two requests to the dead instance. Both come back as 502 Bad Gateway. The ALB tried to forward the request, the connection failed, and it returned a 502 to the caller.
23:17:33 to 23:17:35. Requests happen to land on the healthy instance in us-east-1b. We get 200s. This is luck, not design.
23:17:38 and 23:17:45. Two requests to the dead instance time out entirely. The curl client reports 000, meaning the connection was never established within the timeout. This is worse than a 502; the caller has no idea what happened.
23:17:52 onward. All 200s. The ALB has detected the failed instance and stopped routing traffic to it.
The detection window (the time between the instance dying and the ALB removing it from rotation) is what caused the failures. Across the 60 probes we captured, 4 came back as failures (two 502s and two timeouts), all clustered between 23:17:29 and 23:17:45, a roughly 16-second window. That is a minority of requests, not a majority: with two targets, the ALB keeps routing about half of its attempts to the dead instance, but most of those still succeeded because the dead target simply was not selected for a given probe. The failures were interspersed with 200s rather than being one solid block of errors. A minority of requests failing is still a real outage for the customers who hit it.
Think about what this means for Chaos Shop. If a customer was browsing products during that window, their page load would have failed. If they were placing an order, the POST would have returned a 502 or timed out. The customer would not know whether their order went through. Did the money get charged? Did the stock get decremented? They would have to refresh and check, and that is a bad experience.
The failures during the detection window are concerning for request availability. But what about the data? Orders were placed before the experiment. Is the data still there?
$ curl -s http://$ALB/orders/1 | python3 -m json.tool
{
"order_id": 1,
"product_id": 3,
"product_name": "Mechanical Keyboard",
"quantity": 2,
"total": 259.98,
"status": "confirmed",
...
}
Order 1 is intact. The Mechanical Keyboard order we placed before the experiment survived the instance failure. This makes sense. The order data lives in Aurora, not on the EC2 instance. When the instance died, the database was unaffected.
$ curl -s http://$ALB/products/3 | python3 -m json.tool
{
"id": 3,
"name": "Mechanical Keyboard",
"price": 129.99,
"stock": 48
}
Stock is still 48. No corruption, no phantom decrements, no lost updates. The transactional integrity held.
The ASG launched a replacement instance, i-0a9a795f314c94f1e, back in us-east-1a to restore the fleet to two healthy instances. Meanwhile the application keeps accepting orders. Let's place one:
$ curl -s -X POST http://$ALB/orders \
-H "Content-Type: application/json" \
-d '{"product_id": 2, "quantity": 1}' | python3 -m json.tool
{
"order_id": 4,
"product_id": 2,
"product_name": "USB-C Hub",
"quantity": 1,
"total": 34.99,
"status": "confirmed",
"served_by": "i-0a52ee26b5d49d79f",
...
}
Order 4 went through, served by the instance that stayed healthy. The order ID continues the sequence from before the failure, and once the replacement instance passes its health checks it rejoins the pool. The database state is consistent across the old and new instances.
The architecture diagram shows a clean failover: instance dies, ALB routes around it, done. The reality has a detection delay. The ALB is not psychic. It learns about failures through health checks, and health checks have intervals.
Look at the health check configuration again:
The ALB needs three consecutive failed health checks before it marks a target as unhealthy and stops sending it traffic. In our experiment, the failures spanned roughly 16 seconds, from the first 502 at 23:17:29 to the last timeout at 23:17:45, based on the timing of the non-200 responses. We did not hit the 30-second worst case, but we still saw real failures.
This is not a bug in the ALB or a misconfiguration. It is a fundamental property of health-check-based systems. You can tune it (shorter intervals, lower thresholds) but you cannot eliminate it entirely. Every choice is a trade-off:
Let's go back to our hypothesis and score it honestly.
What we got right:
What we got wrong:
This is why you run the experiment instead of just reading the architecture diagram. The diagram says "ALB routes around failures." The experiment says "ALB routes around failures, after a detection window where some requests fail."
Knowing about this detection window, you have options. We are not going to implement fixes here. That is what the later articles in this series are for. But it is worth listing what you would consider:
None of these options appeared on the original architecture diagram. They only became apparent because we ran the experiment.
In this article, we tested what happens when a compute node dies. The database was fine because we did not touch it. But Aurora has its own failure modes, and those are more interesting, and more dangerous, than losing an EC2 instance.
In Article 2, we add observability: CloudWatch dashboards, alarms, and FIS stop conditions that automatically abort an experiment if things go wrong. We also cover how to structure chaos experiments as part of a regular testing practice instead of a one-off exercise.
In Article 3 and Article 4, we turn our attention to the database itself. What happens during an Aurora writer failover? How long does the application see errors? What happens to in-flight transactions? We will chaos test the database and implement resilience patterns: read replicas, RDS Proxy, write buffering, and circuit breakers. The database latency we measured today is our baseline, and we are going to see what happens when Aurora has a bad day.
This lab is not free. The main costs are the Aurora Serverless v2 cluster (starting at 0.5 ACU), the two NAT gateways (one per AZ, which roughly doubles that line item versus a single-NAT setup), and the ALB. Running the full stack for a few hours will cost roughly $2-4. If you leave it running for a day, expect about $8-10.
Tear it down when you are done:
terraform destroy
Confirm with yes when prompted. The destroy takes a few minutes as it drains the ALB and deletes the Aurora cluster.
This article is just the start. Get the full picture with our free whitepaper - 8 chapters covering IAM, S3, VPC, monitoring, agentic AI security, compliance, and a prioritized action plan with 50+ CLI commands.
Toc Consulting: AWS Security & Cloud Architecture
Our team helps engineering teams secure and architect AWS the right way: assessment in week one, a prioritized action plan in week two.
One experiment is a demo. A program is what builds resilience. Turn FIS experiments into an ongoing practice: the resilience flywheel, GameDays, continuous automated chaos on a schedule and in CI/CD, and AWS Resilience Hub. Includes a real, validated EventBridge Scheduler setup and the jsonencode gotcha that makes a recurring schedule silently run only once.
Stop talking about disaster recovery and measure it. Pause DynamoDB global table replication with AWS FIS while a live two-Region application keeps writing, then count exactly how many records the surviving Region cannot see. Real Terraform, real RPO, real recovery time.
Leave EC2 behind. Stop a whole Fargate task set and force AWS Lambda invocations to error and to stall, using AWS FIS. Real Terraform, real numbers, and the exact setup gotchas for the Lambda fault-injection extension.