RPO (Recovery Point Objective) is the maximum amount of data, measured as a window of time, that a system can afford to lose in a failure. An RPO of zero means no data loss is acceptable; an RPO of five minutes means losing up to five minutes of recent writes is tolerable.
With asynchronous replication, your real RPO is whatever the replication lag is at the moment of failover. While replication is healthy the lag is seconds; while it is impaired the lag grows without bound. A failover during that window loses everything that has not yet replicated.
Chaos engineering measures real RPO by pausing replication, continuing to write, and counting exactly how many records the surviving copy cannot see.
The maximum acceptable length of time a system can be down after a failure before the impact is unacceptable. It defines how fast you must recover.
The practice of deliberately injecting failures into a system to discover weaknesses before they cause outages, by forming a hypothesis about steady state and testing it under real-world fault conditions.
Toc Consulting: AWS Security & Cloud Architecture
Our team helps engineering teams secure and architect AWS the right way: assessment in week one, a prioritized action plan in week two.