Chaos Engineering on AWS: Your First FIS Experiment

Chaos Engineering on AWS - Your first experiment with AWS Fault Injection Service

This is Article 1 in the "Chaos Engineering on AWS" series. We deploy a product catalog and order API backed by Aurora, put real data through it, then kill a server and watch what happens to our customers' orders.

What Is Chaos Engineering

Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. That definition comes from principlesofchaos.org, and it is worth reading carefully. The key word is confidence. You are not trying to break things for fun. You are running controlled experiments to find out whether your system behaves the way you think it does.

The source lays out four advanced principles:

Build a hypothesis around steady-state behavior. Define what "normal" looks like in terms of measurable output: request success rate, latency percentiles, error rates. Not CPU usage or memory. Those are internals. Focus on what the user sees.
Vary real-world events. Inject failures that actually happen: servers die, networks partition, disks fill up, dependencies slow down. If you only test for things that never happen, you are wasting time.
Run experiments in production. This is the aspirational goal. Staging environments lie. They have different traffic patterns, different data, and different timing. The closer you get to production conditions, the more useful your results.
Automate experiments to run continuously. A one-time test proves a point. Continuous experiments catch regressions. Your system changes every week; your confidence should be validated just as often.

Alongside those four principles, the same source stresses one operational guardrail above all: minimize the blast radius. Start small. Kill one instance, not the whole fleet. Have stop conditions. Be ready to abort. Chaos engineering is not about causing outages; it is about preventing them. This guardrail matters more than people realize. Chaos engineering without it is just breaking things. We will talk more about stop conditions and observability in Article 2 of this series.

Why It Matters

Every architecture diagram shows arrows flowing smoothly between boxes. Reality is different. Load balancers take time to detect failed targets. Auto Scaling Groups take time to launch replacements. Health checks have intervals and thresholds that create detection windows. You do not know how long those windows are until you test them.

More importantly, architecture diagrams do not show what happens to in-flight work. If a customer is placing an order when a server dies, does the order go through? Does it get charged but never confirmed? Does the stock decrement without creating a record? These questions only have real answers when you run the experiment.

The point of chaos engineering is to close the gap between what you assume about your system and what is actually true. In this article, we will find a concrete example of that gap.

What We Are Building: Chaos Shop

We are deploying a product catalog and order API called Chaos Shop. This is not a hello-world endpoint that returns instance metadata. It is a small but real application: 10 products in a database, stock tracking, order placement with transactional integrity. Every request that matters goes through Aurora.

Two EC2 instances sit behind an Application Load Balancer, backed by an Aurora Serverless v2 PostgreSQL database. AWS Fault Injection Service (FIS) is configured to stop one instance on demand.

Here is the architecture:

                        Internet
                           |
                    +------+------+
                    |     ALB     |
                    | (us-east-1) |
                    +------+------+
                           |
              +------------+------------+
              |                         |
     +--------+--------+      +--------+--------+
     |   EC2 Instance   |      |   EC2 Instance   |
     |   (us-east-1a)   |      |   (us-east-1b)   |
     |   Chaos Shop API |      |   Chaos Shop API |
     +--------+--------+      +--------+--------+
              |                         |
              +------------+------------+
                           |
                  +--------+--------+
                  |  Aurora Serverless  |
                  |  v2 (PostgreSQL)    |
                  |  Writer instance    |
                  +-------------------+

     +-------------------+
     |  AWS FIS           |
     |  "Stop 1 instance" |
     +-------------------+

The ALB distributes HTTP traffic across both instances. Each instance runs the Chaos Shop Flask API, which serves product listings, accepts orders, and manages inventory, all backed by Aurora. The ASG is configured with a minimum of 2 instances, so when FIS stops one, the ASG should launch a replacement.

The database is central here. Unlike a typical chaos engineering tutorial where you kill a server returning static data, our application has state. Products have stock counts. Orders create records and decrement inventory inside a database transaction. When we kill a server, the interesting question is not just "do requests fail?" but "does the data stay consistent?"

This is not a production architecture. There is no HTTPS, no WAF, no private subnets for the ALB. It is a lab environment designed to teach chaos engineering concepts without running up a large bill.

Prerequisites

You will need the following installed and configured:

An AWS account with permissions to create VPCs, EC2 instances, ALBs, Aurora clusters, IAM roles, and FIS experiments.
Terraform >= 1.5: install instructions.
AWS CLI v2: install instructions. Configure a profile or use environment variables.

The full Terraform code is available at github.com/TocConsulting/chaos-on-aws. Clone it before continuing.

git clone https://github.com/TocConsulting/chaos-on-aws.git
cd chaos-on-aws/01-what-is-chaos-engineering/terraform

The Application: Chaos Shop API

Each EC2 instance runs a Flask application deployed via userdata. The app has seven endpoints, and almost all of them talk to the database. This is deliberate. We want to test what happens to a stateful application during a failure, not just a stateless proxy.

Here is the application setup: imports, configuration, database connection, and initialization. The API endpoints follow below.

import os
import time
import pg8000

from flask import Flask, jsonify, request

app = Flask(__name__)

INSTANCE_ID = os.environ.get("INSTANCE_ID", "unknown")
AZ = os.environ.get("AZ", "unknown")
PRIVATE_IP = os.environ.get("PRIVATE_IP", "unknown")
DB_ENDPOINT = os.environ.get("DB_ENDPOINT", "")
DB_PORT = int(os.environ.get("DB_PORT", "5432"))
DB_NAME = os.environ.get("DB_NAME", "")
DB_USERNAME = os.environ.get("DB_USERNAME", "")
DB_PASSWORD = os.environ.get("DB_PASSWORD", "")


# Persistent connection per Gunicorn worker process.
# Each sync worker handles one request at a time, so a single
# connection per worker is safe and avoids per-request TCP overhead.
_worker_conn = None


def get_db_connection():
    """Return the worker's persistent DB connection, reconnecting if needed."""
    global _worker_conn
    if _worker_conn is not None:
        try:
            c = _worker_conn.cursor()
            c.execute("SELECT 1")
            c.fetchone()
            c.close()
            return _worker_conn
        except Exception:
            try:
                _worker_conn.close()
            except Exception:
                pass
            _worker_conn = None
    _worker_conn = pg8000.connect(
        host=DB_ENDPOINT,
        port=DB_PORT,
        database=DB_NAME,
        user=DB_USERNAME,
        password=DB_PASSWORD,
    )
    _worker_conn.autocommit = True
    return _worker_conn


def init_db():
    """Create tables and seed products if they do not exist."""
    conn = pg8000.connect(
        host=DB_ENDPOINT,
        port=DB_PORT,
        database=DB_NAME,
        user=DB_USERNAME,
        password=DB_PASSWORD,
    )
    try:
        conn.autocommit = True
        cursor = conn.cursor()

        cursor.execute("""
            CREATE TABLE IF NOT EXISTS products (
                id SERIAL PRIMARY KEY,
                name VARCHAR(100) NOT NULL UNIQUE,
                price DECIMAL(10,2) NOT NULL,
                stock INTEGER NOT NULL DEFAULT 0
            )
        """)

        cursor.execute("""
            CREATE TABLE IF NOT EXISTS orders (
                id SERIAL PRIMARY KEY,
                product_id INTEGER REFERENCES products(id),
                quantity INTEGER NOT NULL,
                total DECIMAL(10,2) NOT NULL,
                status VARCHAR(20) DEFAULT 'pending',
                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        """)

        # Seed products with explicit IDs so concurrent instances do not create
        # non-sequential IDs from the SERIAL sequence.
        products = [
            (1, "Wireless Keyboard", 49.99, 100),
            (2, "USB-C Hub", 34.99, 150),
            (3, "Mechanical Keyboard", 129.99, 50),
            (4, "27-inch Monitor", 349.99, 30),
            (5, "Webcam HD", 79.99, 80),
            (6, "Laptop Stand", 44.99, 120),
            (7, "Noise-Canceling Headphones", 199.99, 40),
            (8, "Portable SSD 1TB", 89.99, 90),
            (9, "Ergonomic Mouse", 59.99, 110),
            (10, "USB Microphone", 69.99, 70),
        ]
        for pid, name, price, stock in products:
            cursor.execute(
                "INSERT INTO products (id, name, price, stock) VALUES (%s, %s, %s, %s) "
                "ON CONFLICT (id) DO NOTHING",
                (pid, name, price, stock),
            )
        cursor.execute("SELECT setval('products_id_seq', 10)")

        cursor.close()
    finally:
        conn.close()

Two things worth noting here. First, get_db_connection() maintains a persistent connection per Gunicorn worker process. Each sync worker handles one request at a time, so a single shared connection is safe and avoids the overhead of creating a new TCP connection on every request. The health check (SELECT 1) detects stale connections and reconnects automatically. Second, init_db() uses its own separate connection that it creates and closes, because initialization runs once at import time and should not interfere with the worker's persistent connection. Seeding with explicit primary-key IDs plus ON CONFLICT (id) DO NOTHING (and resetting the sequence with setval) makes concurrent initialization from multiple instances safe and keeps product IDs deterministic across the fleet.

The Endpoints

GET / returns service information: the instance ID, availability zone, and private IP. This is our routing indicator. When we kill a server, we can see exactly which instance handled the request.

@app.route("/")
def index():
    return jsonify({
        "service": "chaos-shop",
        "instance_id": INSTANCE_ID,
        "availability_zone": AZ,
        "private_ip": PRIVATE_IP,
        "status": "running",
    })

GET /health is the shallow health check. It returns 200 with no dependencies. This is what the ALB uses. A health check that queries the database would cause the ALB to mark instances unhealthy during a database blip, even though the instance itself is fine. Keep health checks simple.

@app.route("/health")
def health():
    return jsonify({"status": "healthy"}), 200

GET /health/deep is the deep health check. It opens a connection to Aurora and runs SELECT 1. This is useful for operational debugging not for the ALB. You do not want a database hiccup to cascade into instance replacements.

@app.route("/health/deep")
def health_deep():
    try:
        start = time.time()
        conn = get_db_connection()
        cursor = conn.cursor()
        cursor.execute("SELECT 1")
        cursor.fetchone()
        latency_ms = round((time.time() - start) * 1000, 2)
        cursor.close()
        return jsonify({
            "status": "healthy",
            "database": "connected",
            "latency_ms": latency_ms,
        })
    except Exception as e:
        return jsonify({
            "status": "unhealthy",
            "database": "error",
            "error": str(e),
        }), 500

GET /products lists all 10 products from Aurora, including their current stock levels and prices. This is a read operation, but it still goes through the database every time. No caching. That is intentional. We want to see what happens to database reads during a failure.

@app.route("/products")
def list_products():
    try:
        start = time.time()
        conn = get_db_connection()
        cursor = conn.cursor()
        cursor.execute("SELECT id, name, price, stock FROM products ORDER BY id")
        rows = cursor.fetchall()
        latency_ms = round((time.time() - start) * 1000, 2)
        cursor.close()

        products = []
        for row in rows:
            products.append({
                "id": row[0],
                "name": row[1],
                "price": float(row[2]),
                "stock": row[3],
            })

        return jsonify({
            "products": products,
            "count": len(products),
            "db_latency_ms": latency_ms,
            "served_by": INSTANCE_ID,
        })
    except Exception as e:
        return jsonify({"error": str(e)}), 500

GET /products/{id} returns a single product by ID. Simple read.

POST /orders is the most important endpoint. This is where the real complexity lives. When a customer places an order, the app does the following inside a single database transaction:

Locks the product row with SELECT ... FOR UPDATE to prevent concurrent modifications.
Checks if there is enough stock for the requested quantity.
Decrements the stock count.
Inserts a new order record with status "confirmed".
Commits the transaction.

If anything fails, the transaction rolls back. No partial state. No stock decrement without an order. This is how real e-commerce works.

@app.route("/orders", methods=["POST"])
def create_order():
    try:
        data = request.get_json()
        if not data or "product_id" not in data or "quantity" not in data:
            return jsonify({"error": "product_id and quantity are required"}), 400

        product_id = int(data["product_id"])
        quantity = int(data["quantity"])

        if quantity <= 0:
            return jsonify({"error": "quantity must be positive"}), 400

        conn = get_db_connection()
        try:
            conn.autocommit = False
            cursor = conn.cursor()

            # Lock the product row and check stock
            cursor.execute(
                "SELECT id, name, price, stock FROM products WHERE id = %s FOR UPDATE",
                (product_id,),
            )
            row = cursor.fetchone()

            if row is None:
                conn.rollback()
                cursor.close()
                return jsonify({"error": "product not found"}), 404

            product_name = row[1]
            price = float(row[2])
            stock = row[3]

            if stock < quantity:
                conn.rollback()
                cursor.close()
                return jsonify({
                    "error": "insufficient stock",
                    "available": stock,
                    "requested": quantity,
                }), 409

            total = round(price * quantity, 2)

            # Decrement stock
            cursor.execute(
                "UPDATE products SET stock = stock - %s WHERE id = %s",
                (quantity, product_id),
            )

            # Create order
            cursor.execute(
                "INSERT INTO orders (product_id, quantity, total, status) "
                "VALUES (%s, %s, %s, 'confirmed') RETURNING id, created_at",
                (product_id, quantity, total),
            )
            order_row = cursor.fetchone()
            order_id = order_row[0]
            created_at = str(order_row[1])

            conn.commit()
            cursor.close()

            return jsonify({
                "order_id": order_id,
                "product_id": product_id,
                "product_name": product_name,
                "quantity": quantity,
                "total": total,
                "status": "confirmed",
                "created_at": created_at,
                "served_by": INSTANCE_ID,
            }), 201

        except Exception:
            conn.rollback()
            raise
        finally:
            conn.autocommit = True

    except Exception as e:
        return jsonify({"error": str(e)}), 500

GET /orders/{id} retrieves an order with a JOIN to get the product name. This lets us verify after the experiment that our orders survived intact.

The important thing about this application is that the database is not optional. It is not a nice-to-have feature tacked onto a static endpoint. Every product listing, every order, every stock check goes through Aurora. When we kill a server mid-request, there are real consequences to think about: uncommitted transactions, in-flight writes, data consistency.

Key Terraform

Health Check and Target Group

The target group health check configuration determines how quickly the ALB detects a dead instance. Pay attention to these numbers.

resource "aws_lb_target_group" "main" {
  name     = "chaos-lab-tg"
  port     = 80
  protocol = "HTTP"
  vpc_id   = aws_vpc.main.id

  health_check {
    path                = "/health"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 10
    matcher             = "200"
  }
}

The ALB checks /health every 10 seconds. An instance must fail 3 consecutive checks before the ALB stops sending it traffic. That means the ALB needs up to 30 seconds (3 checks * 10 second interval) to detect a failed instance. Remember this number.

Auto Scaling Group

resource "aws_autoscaling_group" "app" {
  name                      = "chaos-lab-asg"
  min_size                  = 2
  max_size                  = 4
  desired_capacity          = 2
  vpc_zone_identifier       = aws_subnet.private[*].id
  target_group_arns         = [aws_lb_target_group.main.arn]
  health_check_type         = "ELB"
  health_check_grace_period = 120

  launch_template {
    id      = aws_launch_template.app.id
    version = "$Latest"
  }
}

Two settings matter here. health_check_type = "ELB" means the ASG uses the ALB's health check to decide whether an instance is healthy, not just EC2 status checks. And min_size = 2 guarantees the ASG will launch a replacement whenever an instance is terminated or stopped.

FIS Experiment Template

resource "aws_fis_experiment_template" "stop_instance" {
  description = "Stop one EC2 instance in the chaos-lab ASG"
  role_arn    = aws_iam_role.fis.arn

  action {
    name      = "stop-instance"
    action_id = "aws:ec2:stop-instances"

    target {
      key   = "Instances"
      value = "chaos-lab-instances"
    }
  }

  target {
    name           = "chaos-lab-instances"
    resource_type  = "aws:ec2:instance"
    selection_mode = "COUNT(1)"

    resource_tag {
      key   = "Project"
      value = "chaos-lab"
    }
  }

  stop_condition {
    source = "none"
  }
}

The target uses selection_mode = "COUNT(1)", meaning FIS will randomly pick one instance tagged with Project = chaos-lab and stop it. Not terminate, stop. The instance stays in the ASG but is no longer running. The ASG detects this and launches a new one.

The stop_condition is set to none for this first experiment. In Article 2, we will replace this with a CloudWatch alarm that automatically aborts the experiment if error rates spike too high. For now, we are keeping it simple.

IAM: Scoped Blast Radius

The FIS role is scoped tightly. It can only stop and start instances tagged with Project = chaos-lab. This is the blast-radius guardrail in action: scope the experiment so it cannot touch anything you did not intend.

resource "aws_iam_role_policy" "fis" {
  name = "chaos-lab-fis-policy"
  role = aws_iam_role.fis.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "ec2:StopInstances",
          "ec2:StartInstances"
        ]
        Resource = "arn:aws:ec2:${var.region}:${data.aws_caller_identity.current.account_id}:instance/*"
        Condition = {
          StringEquals = {
            "ec2:ResourceTag/Project" = "chaos-lab"
          }
        }
      },
      {
        Effect   = "Allow"
        Action   = ["ec2:DescribeInstances"]
        Resource = "*"
      }
    ]
  })
}

If someone accidentally runs this in the wrong account, the tag condition prevents it from touching anything else. The EC2 instances also get an IAM role with SSM access for operational access, but nothing more.

The Rest of the Code

The remaining Terraform files handle standard VPC networking (2 public subnets, 2 private subnets, two NAT gateways (one per AZ), route tables), security groups (ALB accepts port 80 from the internet, EC2 accepts port 80 from the ALB only, Aurora accepts port 5432 from EC2 only), and the Aurora Serverless v2 cluster with a single writer instance. These are important for a real deployment but not where the interesting chaos engineering decisions live.

See the full code at github.com/TocConsulting/chaos-on-aws.

Deploy

Update the aws_profile variable in variables.tf to match your AWS CLI profile, or pass it on the command line. Then:

terraform init
terraform plan
terraform apply

The apply creates 37 resources and takes roughly ten minutes. Most of that time is the Aurora cluster provisioning. When it finishes, Terraform outputs the values you need:

Outputs:

alb_dns_name = "chaos-lab-alb-600543899.us-east-1.elb.amazonaws.com"
aurora_endpoint = "chaos-lab-aurora.cluster-c1l11rt7ly8s.us-east-1.rds.amazonaws.com"
fis_experiment_template_id = "EXT3z4QT4WJN4Jwz"

Save the fis_experiment_template_id. You will need it to run the experiment.

Verify Everything Works

Give the instances a couple of minutes after Terraform completes. They need to boot, install dependencies, initialize the database tables, seed the product data, and start the Flask app. Then start verifying.

Check the service endpoint

Hit the ALB and confirm both instances are responding:

$ ALB="chaos-lab-alb-600543899.us-east-1.elb.amazonaws.com"

$ curl -s http://$ALB/ | python3 -m json.tool
{
    "service": "chaos-shop",
    "instance_id": "i-0ef097d855de7f9a9",
    "availability_zone": "us-east-1a",
    "private_ip": "10.0.10.x",
    "status": "running"
}

Hit it again and you should see the other instance:

$ curl -s http://$ALB/ | python3 -m json.tool
{
    "service": "chaos-shop",
    "instance_id": "i-0a52ee26b5d49d79f",
    "availability_zone": "us-east-1b",
    "private_ip": "10.0.11.x",
    "status": "running"
}

Both instances are live. The ALB is round-robin distributing requests across us-east-1a and us-east-1b.

List all products

$ curl -s http://$ALB/products | python3 -m json.tool
{
    "products": [
        {"id": 1, "name": "Wireless Keyboard", "price": 49.99, "stock": 100},
        {"id": 2, "name": "USB-C Hub", "price": 34.99, "stock": 150},
        {"id": 3, "name": "Mechanical Keyboard", "price": 129.99, "stock": 50},
        {"id": 4, "name": "27-inch Monitor", "price": 349.99, "stock": 30},
        {"id": 5, "name": "Webcam HD", "price": 79.99, "stock": 80},
        {"id": 6, "name": "Laptop Stand", "price": 44.99, "stock": 120},
        {"id": 7, "name": "Noise-Canceling Headphones", "price": 199.99, "stock": 40},
        {"id": 8, "name": "Portable SSD 1TB", "price": 89.99, "stock": 90},
        {"id": 9, "name": "Ergonomic Mouse", "price": 59.99, "stock": 110},
        {"id": 10, "name": "USB Microphone", "price": 69.99, "stock": 70}
    ],
    "count": 10,
    "db_latency_ms": 126.11,
    "served_by": "i-0ef097d855de7f9a9"
}

Ten products, all with stock. The database round-trip on the first request is 126.11ms, which includes the cost of establishing the persistent connection: TCP connect, TLS handshake, query execution, and result fetch. Subsequent requests reuse the connection and are faster. Normal for a t3.micro talking to Aurora Serverless v2.

Get a single product

$ curl -s http://$ALB/products/3 | python3 -m json.tool
{
    "id": 3,
    "name": "Mechanical Keyboard",
    "price": 129.99,
    "stock": 50
}

Place an order

This is the real test. We are going to place an order for 2 Mechanical Keyboards and verify the entire transaction works: stock decremented, order created, total calculated correctly.

$ curl -s -X POST http://$ALB/orders \
    -H "Content-Type: application/json" \
    -d '{"product_id": 3, "quantity": 2}' | python3 -m json.tool
{
    "order_id": 1,
    "product_id": 3,
    "product_name": "Mechanical Keyboard",
    "quantity": 2,
    "total": 259.98,
    "status": "confirmed",
    "created_at": "2026-06-17 21:17:01.029913",
    "served_by": "i-0ef097d855de7f9a9"
}

Order confirmed. Total is correct (129.99 * 2 = 259.98). Served by the instance in us-east-1a. Now check that the stock actually decremented:

$ curl -s http://$ALB/products/3 | python3 -m json.tool
{
    "id": 3,
    "name": "Mechanical Keyboard",
    "price": 129.99,
    "stock": 48
}

Stock went from 50 to 48. The transaction worked.

Verify the order is readable

$ curl -s http://$ALB/orders/1 | python3 -m json.tool
{
    "order_id": 1,
    "product_id": 3,
    "product_name": "Mechanical Keyboard",
    "quantity": 2,
    "total": 259.98,
    "status": "confirmed",
    "created_at": "2026-06-17 21:17:01.029913"
}

Test stock validation

Try to order more than what is available:

$ curl -s -X POST http://$ALB/orders \
    -H "Content-Type: application/json" \
    -d '{"product_id": 3, "quantity": 999}' | python3 -m json.tool
{
    "error": "insufficient stock",
    "available": 48,
    "requested": 999
}

409 Conflict. The app correctly reports available stock as 48 (not the original 50, because our earlier order decremented it) and refuses the order. The row lock and stock check inside the transaction are working.

Place a few more orders

Let's put more data in the system before we break things. We want to verify after the experiment that none of it gets lost.

$ curl -s -X POST http://$ALB/orders \
    -H "Content-Type: application/json" \
    -d '{"product_id": 1, "quantity": 1}' | python3 -m json.tool
{
    "order_id": 2,
    "product_name": "Wireless Keyboard",
    "quantity": 1,
    "total": 49.99,
    "status": "confirmed",
    ...
}

$ curl -s -X POST http://$ALB/orders \
    -H "Content-Type: application/json" \
    -d '{"product_id": 5, "quantity": 3}' | python3 -m json.tool
{
    "order_id": 3,
    "product_name": "Webcam HD",
    "quantity": 3,
    "total": 239.97,
    "status": "confirmed",
    ...
}

We now have 3 confirmed orders in the database. Stock levels have been decremented for Mechanical Keyboard (50 to 48), Wireless Keyboard (100 to 99), and Webcam HD (80 to 77). This is our baseline.

The Hypothesis

Before you touch anything, write down your hypothesis. This is not optional. Without a hypothesis, you are just clicking buttons. Here is ours:

Hypothesis: If we stop one EC2 instance, the ALB should route all traffic to the remaining healthy instance with no user-visible errors. Product listings should continue to work. Existing orders should remain readable. New orders should still be placeable. The ASG should detect the stopped instance and launch a replacement within a few minutes. Data in Aurora should not be affected because the database is independent of the compute layer.

This seems reasonable. We have two instances in two AZs. The ALB has health checks. The ASG has a minimum count of 2. Aurora is a managed database that does not care which EC2 instance talks to it. Everything should just work, right?

Let's find out.

Run the Experiment

Start the FIS experiment using the AWS CLI:

$ aws fis start-experiment \
    --experiment-template-id EXT3z4QT4WJN4Jwz \
    --query 'experiment.id' \
    --output text

EXPEnJp6T52edQqh6J

The experiment picked instance i-0ef097d855de7f9a9 in us-east-1a and stopped it. FIS reports completion within seconds; the interesting part is what happens to traffic in the window before the ALB notices.

While the experiment runs, start monitoring the product listing endpoint:

$ while true; do
    echo "$(date '+%H:%M:%S') - $(curl -s -o /dev/null -w '%{http_code}' \
      --connect-timeout 3 --max-time 5 \
      http://$ALB/products)"
    sleep 2
  done

Here is the actual output (the probe host's clock ran in local time, UTC+2, so these timestamps are two hours ahead of the database's UTC created_at values shown elsewhere):

23:17:27 - 200
23:17:29 - 502
23:17:31 - 502
23:17:33 - 200
23:17:35 - 200
23:17:38 - 000
23:17:45 - 000
23:17:52 - 200
23:17:54 - 200
23:17:56 - 200
...

Those 502s and the 000 (connection timeout) are real failures. Our hypothesis said "no user-visible errors." That was wrong.

What Happened

Let's walk through the timeline.

Before 23:17:29: both instances are healthy and every request returns 200. FIS then stops instance i-0ef097d855de7f9a9 in us-east-1a. The instance is shutting down, but the ALB does not know this yet. It still has two targets in its list.

23:17:29 to 23:17:31. The ALB routes two requests to the dead instance. Both come back as 502 Bad Gateway. The ALB tried to forward the request, the connection failed, and it returned a 502 to the caller.

23:17:33 to 23:17:35. Requests happen to land on the healthy instance in us-east-1b. We get 200s. This is luck, not design.

23:17:38 and 23:17:45. Two requests to the dead instance time out entirely. The curl client reports 000, meaning the connection was never established within the timeout. This is worse than a 502; the caller has no idea what happened.

23:17:52 onward. All 200s. The ALB has detected the failed instance and stopped routing traffic to it.

The detection window (the time between the instance dying and the ALB removing it from rotation) is what caused the failures. Across the 60 probes we captured, 4 came back as failures (two 502s and two timeouts), all clustered between 23:17:29 and 23:17:45, a roughly 16-second window. That is a minority of requests, not a majority: with two targets, the ALB keeps routing about half of its attempts to the dead instance, but most of those still succeeded because the dead target simply was not selected for a given probe. The failures were interspersed with 200s rather than being one solid block of errors. A minority of requests failing is still a real outage for the customers who hit it.

Think about what this means for Chaos Shop. If a customer was browsing products during that window, their page load would have failed. If they were placing an order, the POST would have returned a 502 or timed out. The customer would not know whether their order went through. Did the money get charged? Did the stock get decremented? They would have to refresh and check, and that is a bad experience.

Data Integrity Check

The failures during the detection window are concerning for request availability. But what about the data? Orders were placed before the experiment. Is the data still there?

$ curl -s http://$ALB/orders/1 | python3 -m json.tool
{
    "order_id": 1,
    "product_id": 3,
    "product_name": "Mechanical Keyboard",
    "quantity": 2,
    "total": 259.98,
    "status": "confirmed",
    ...
}

Order 1 is intact. The Mechanical Keyboard order we placed before the experiment survived the instance failure. This makes sense. The order data lives in Aurora, not on the EC2 instance. When the instance died, the database was unaffected.

$ curl -s http://$ALB/products/3 | python3 -m json.tool
{
    "id": 3,
    "name": "Mechanical Keyboard",
    "price": 129.99,
    "stock": 48
}

Stock is still 48. No corruption, no phantom decrements, no lost updates. The transactional integrity held.

The system recovers and keeps taking orders

The ASG launched a replacement instance, i-0a9a795f314c94f1e, back in us-east-1a to restore the fleet to two healthy instances. Meanwhile the application keeps accepting orders. Let's place one:

$ curl -s -X POST http://$ALB/orders \
    -H "Content-Type: application/json" \
    -d '{"product_id": 2, "quantity": 1}' | python3 -m json.tool
{
    "order_id": 4,
    "product_id": 2,
    "product_name": "USB-C Hub",
    "quantity": 1,
    "total": 34.99,
    "status": "confirmed",
    "served_by": "i-0a52ee26b5d49d79f",
    ...
}

Order 4 went through, served by the instance that stayed healthy. The order ID continues the sequence from before the failure, and once the replacement instance passes its health checks it rejoins the pool. The database state is consistent across the old and new instances.

The Gap in Our Thinking

The architecture diagram shows a clean failover: instance dies, ALB routes around it, done. The reality has a detection delay. The ALB is not psychic. It learns about failures through health checks, and health checks have intervals.

Look at the health check configuration again:

Health check interval: 10 seconds
Unhealthy threshold: 3 failed checks
Worst case detection time: 3 * 10 = 30 seconds

The ALB needs three consecutive failed health checks before it marks a target as unhealthy and stops sending it traffic. In our experiment, the failures spanned roughly 16 seconds, from the first 502 at 23:17:29 to the last timeout at 23:17:45, based on the timing of the non-200 responses. We did not hit the 30-second worst case, but we still saw real failures.

This is not a bug in the ALB or a misconfiguration. It is a fundamental property of health-check-based systems. You can tune it (shorter intervals, lower thresholds) but you cannot eliminate it entirely. Every choice is a trade-off:

Shorter health check intervals mean faster detection, but more load on your instances from health check traffic.
Lower unhealthy thresholds mean faster deregistration, but more risk of false positives (a single slow response marks the instance as dead).
Connection draining adds additional time while in-flight requests complete.

What We Learned

Let's go back to our hypothesis and score it honestly.

What we got right:

The ASG did detect the failure and launched a replacement instance.
The system did self-heal and return to full capacity.
Traffic did rebalance across both AZs.
Aurora data was completely unaffected. All orders intact, all stock counts correct.
The replacement instance connected to Aurora and placed orders without issues.
Data integrity was maintained through the failure. No corruption, no lost writes.

What we got wrong:

There was a detection window where requests returned 502s and timeouts. Our hypothesis said "no user-visible errors." That was wrong.
The product listing was unreachable for some requests during the window. If a customer was placing an order at that exact moment, it would have failed.

This is why you run the experiment instead of just reading the architecture diagram. The diagram says "ALB routes around failures." The experiment says "ALB routes around failures, after a detection window where some requests fail."

What Would You Do in Production

Knowing about this detection window, you have options. We are not going to implement fixes here. That is what the later articles in this series are for. But it is worth listing what you would consider:

Tune health checks. Reduce the interval to 5 seconds and the unhealthy threshold to 2. That cuts detection time from 30 seconds to 10. The trade-off is more sensitivity to transient issues.
Add more instances. With 4 instances instead of 2, the dead target receives only about 25% of routing attempts during the detection window instead of 50%. Fewer requests land on the failed instance, so the blast radius shrinks even if the window's duration is the same.
Implement client-side retries. If the client retries on 502 with a brief backoff, it will likely hit the healthy instance on the next attempt. This masks the detection window from the end user.
Use pre-warmed instances. The replacement took time to boot and initialize. A warm pool or pre-baked AMI could cut replacement time significantly.
Add observability. We monitored this experiment by watching curl output in a terminal. That does not scale. Dashboards, alarms, and automated stop conditions would make this a repeatable practice instead of a one-off exercise.

None of these options appeared on the original architecture diagram. They only became apparent because we ran the experiment.

What Is Next

In this article, we tested what happens when a compute node dies. The database was fine because we did not touch it. But Aurora has its own failure modes, and those are more interesting, and more dangerous, than losing an EC2 instance.

In Article 2, we add observability: CloudWatch dashboards, alarms, and FIS stop conditions that automatically abort an experiment if things go wrong. We also cover how to structure chaos experiments as part of a regular testing practice instead of a one-off exercise.

In Article 3 and Article 4, we turn our attention to the database itself. What happens during an Aurora writer failover? How long does the application see errors? What happens to in-flight transactions? We will chaos test the database and implement resilience patterns: read replicas, RDS Proxy, write buffering, and circuit breakers. The database latency we measured today is our baseline, and we are going to see what happens when Aurora has a bad day.

Cleanup

This lab is not free. The main costs are the Aurora Serverless v2 cluster (starting at 0.5 ACU), the two NAT gateways (one per AZ, which roughly doubles that line item versus a single-NAT setup), and the ALB. Running the full stack for a few hours will cost roughly $2-4. If you leave it running for a day, expect about $8-10.

Tear it down when you are done:

terraform destroy

Confirm with yes when prompted. The destroy takes a few minutes as it drains the ALB and deletes the Aurora cluster.

Chaos Engineering on AWS: Your First Experiment with AWS FIS