RTO (Recovery Time Objective) is the target maximum downtime for a system after a disruption: how long recovery is allowed to take. An RTO of five minutes means the service must be restored within five minutes of an outage.
RTO is about time to recover; RPO (Recovery Point Objective) is about how much data you can lose. A disaster recovery plan states both. They are hypotheses until you test them.
Chaos engineering turns RTO from a number on a runbook into a measured value: trigger the failure, time the actual recovery, and compare it to the objective.
The maximum acceptable amount of data loss measured in time, for example five minutes of writes. It defines how recent your recovered data must be.
A resilience property where a workload keeps operating during a failure using resources it already has, instead of relying on launching or changing resources during the failure (which is when control planes are most likely to be impaired).
The practice of deliberately injecting failures into a system to discover weaknesses before they cause outages, by forming a hypothesis about steady state and testing it under real-world fault conditions.
Toc Consulting: AWS Security & Cloud Architecture
Our team helps engineering teams secure and architect AWS the right way: assessment in week one, a prioritized action plan in week two.