Your AWS Recovery Plan Is Attacking the Wrong Layer
By Andrew Goifeld | 2026-02-27 | 27 min read
The AWS Status page reads "We are experiencing API latencies" and your application is down. Teams across the organisation are watching the dashboard, waiting for the all-clear. No one is taking action because the assumption is simple: When AWS has fixed their issue, that's when we will start fixing our issues.
That assumption is understandable — and nearly universal. But it rests on a premise that doesn't hold: that your recovery has to wait for theirs.
What if you didn't have to wait?
Teams who understand the control plane and data plane distinction don't experience AWS events the same way. The status page is still watched — but the question changes from "when will this be over?" to "which of our services are actually at risk, and which are already safe?"
AWS service impairments are just that — impairments to specific operations within specific fault isolation boundaries (Zonal, Regional or global in select few cases).
The key insight is that most impairments affect control plane operations (creating, modifying, deleting resources) while the data plane (serving traffic through already-provisioned resources) continues to operate normally. If your architecture has been designed with high availability and resilient patterns — pre-provisioned capacity, health-checked endpoints, static stability — customer-facing outages can be avoided or significantly minimised during the vast majority of AWS service events.
This is not a theoretical claim. AWS publishes its fault isolation model, and the Well-Architected Framework explicitly recommends against relying on control plane operations during recovery.
The evidence is in AWS's own documentation — the Fault Isolation Boundaries whitepaper and the Well-Architected Framework both confirm this directly. The next sections walk through the mechanics, using EC2 as a concrete example. (If you want to verify these claims independently, AI prompts and documentation links are included at the end of this article.)
What the Well-Architected Framework Says About Control Plane Dependencies
The AWS Well-Architected Framework addresses this directly in the Reliability pillar, specifically in REL 11: "How do you design your workload to withstand component failures?"
One of the best practices under REL 11 is to avoid relying on the control plane during recovery. The reasoning is straightforward: if the control plane is impaired (which is the most common failure mode during AWS service events), any recovery procedure that depends on control plane operations will also be impaired.
The diagram below illustrates how REL 11 maps control plane dependencies to recovery risk:
Control Plane vs Data Plane: What Happens When You Launch an EC2 Instance
Before diving into the diagrams, here are crisp definitions:
- Control plane: The set of AWS APIs and orchestration systems that create, configure, modify and delete resources. Examples: RunInstances, CreateDBInstance. These appear as mutating (write) calls in CloudTrail.
- Data plane: The runtime path that serves traffic through already-provisioned resources. Examples: an EC2 instance processing HTTP requests, an RDS database serving queries, a Lambda function handling invocations.
Now consider the following "highly simplified" flow chart of what happens when you start an AWS EC2 instance. This is a control plane operation:
Every step in this process can fail: you may run out of IP addresses in the subnet, you may request an incorrect subnet, you may receive an InsufficientInstanceCapacity, you may exceed a service quota you have never encountered before. The orchestration by design requires a level of complexity in order to successfuly implement the API request.
Now consider what happens once the instance is online and serving traffic. This is the data plane:
The contrast is stark. The data plane path has far fewer dependencies and far fewer failure modes. The instance processes requests, reads and writes to its EBS volume, and communicates over the network — all without requiring any further orchestration from the control plane.
This is precisely why AWS recommends that you do not rely on control plane operations during recovery. If you can recover using only the data plane (for example, by failing over to pre-provisioned standby instances rather than launching new ones), your blast radius during a control plane impairment drops to near zero.
Concrete patterns for control plane independence:
- Pre-provisioned capacity: Maintain warm standby instances rather than relying on auto-scaling to launch new ones during an incident
- Static stability: Design systems that continue to operate with their current resource allocation even if no changes can be made
- Health-based routing: Use Route 53 health checks or ALB target group health to shift traffic away from impaired resources without modifying infrastructure
- Avoid mutating operations in recovery runbooks: Review your DR procedures and remove any step that calls a create, modify or delete API.
Thisis probably the most difficult to swallow, and can catch people off guard. It costs to provision these resources so careful logic must be applied to make sure the right risk management approach is taken. It is not a one size fits all.
Why Most Teams Overlook This Foundational Concept
If this is already well understood in your organisation, that is genuinely excellent. This concept doesn't appear in most AWS certifications or training curricula. It emerges from operational experience — from teams who have been through a significant service event and came out the other side asking hard questions. At Resilera, I encounter this gap in almost every engagement, and the pattern is consistent across industries.
Three cognitive biases reinforce the pattern:
- Status quo bias: The existing architecture feels safe. But doing nothing is itself a decision — it's choosing to accept increased blast radius with each new service dependency.
- Diffusion of responsibility: Every team assumes another team owns the response. Name the person in your organisation who owns this question. If you can't, it defaults to you.
- Normalcy bias: The last three impairments were minor, so the next one will be too. Except the next one is statistically unlikely to look like the last — that's the nature of tail risk.
I believe understanding the control plane and data plane distinction is a prerequisite for any meaningful DR, HA or BC discussion. Without it, disaster recovery plans may include steps that depend on the very systems that are impaired, turning a service event into a prolonged outage.
For organisations subject to regulatory frameworks such as CPS 230 (APRA's operational resilience standard), this distinction is directly relevant. CPS 230 requires entities to maintain critical operations through severe disruption scenarios — exactly the kind of scenario where control plane availability cannot be assumed. See Article 44 under Testing and Review section in the Prudential Standard CPS 230 Operational Risk Management
Organisations who have a solid grasp of this concept can then take the next step: using AWS Fault Injection Service to simulate control plane impairments and validate that their recovery procedures work without mutating API calls. This is how you move from theoretical resilience to proven resilience.
What to do Monday morning
These three steps will tell you, within one working day, whether your organisation can recover from an AWS service event without waiting for AWS to fix it.
Before your first meeting tomorrow (~20 min): Open CloudTrail → filter
readOnly = false→ last 14 days. Read-only access is enough — request it as a security audit task if you don't have it. Document every mutating call and tag as "required for recovery" or "routine".Block 30 minutes this week with your risk lead: Map DR runbook steps against control plane/data plane. Flag every step that calls a mutating API. For each, ask: is there a data-plane-only alternative? If raising this feels uncomfortable, frame it as a proactive audit prompted by an article — not a gap you found. That framing is accurate and removes the social cost.
At your next team standup (~5 min): Share the one-sentence version: "During AWS impairments, the control plane is what breaks — the data plane keeps running. Our recovery should not depend on the part that breaks." Point them to this article and the AWS Fault Isolation Boundaries whitepaper. Sharing this article with your team is itself the action — it opens the conversation without requiring anyone to admit they didn't know.
If the CloudTrail audit raises questions about your architecture, we can help.
Verify the claims and audit your architecture
The claims in this article can be verified independently. Use the prompts below with any AI assistant, or go directly to the AWS documentation linked throughout.
Verify the core claim:
I'm evaluating whether AWS workloads can remain operational during
a service event reported on the AWS Health Dashboard. The claim is that
control plane impairments don't necessarily affect the data plane, and
architectures with static stability and pre-provisioned capacity can
avoid customer-facing outages during most events.
Please:
a) Confirm or refute this claim using AWS documentation
b) Explain control plane vs data plane for EC2, RDS, and Lambda
c) How would I find control plane actions in my account (CloudTrail)?
d) What is static stability and how does it apply here?
Audit your control plane usage in CloudTrail:
I am an AWS user. I want to evaluate my control plane utilisation for the last 2 weeks. My resources are tagged with "[YOUR_TAG_KEY=YOUR_TAG_VALUE]".
Please help me with:
1. The exact IAM permissions required to query CloudTrail (both console access and CloudTrail Lake)
2. AWS CLI commands to list mutating API calls (readOnly=false) filtered by my resource tags
3. Step-by-step Console instructions to navigate to CloudTrail and CloudTrail Lake
4. 10 sample CloudTrail Lake queries I can use to understand my control plane usage patterns, including:
- All RunInstances / TerminateInstances calls
- All CreateStack / UpdateStack / DeleteStack calls
- All mutating calls grouped by event source (service)
- All mutating calls made by auto-scaling
- All mutating calls during a specific time window (e.g. during an incident)
Review your DR runbook for control plane dependencies:
I have a disaster recovery runbook for my AWS workload. I want to identify every step that depends on the AWS control plane (i.e. any step that creates, modifies, or deletes a resource).
Here is my DR runbook:
[PASTE YOUR RUNBOOK HERE]
Please:
1. Identify every control plane dependency in the runbook (any step that calls a mutating AWS API)
2. For each dependency, explain the risk: what happens if this API call fails during a service impairment?
3. Suggest a data-plane-only alternative where possible (e.g. pre-provisioned standby instead of launching new instances)
4. Rate the overall runbook on a scale of 1-5 for control plane independence
5. Provide a revised version of the runbook that minimises control plane dependencies
Frequently asked questions
Q: What is the difference between the AWS control plane and data plane?
The control plane is the set of APIs and systems used to create, modify and delete AWS resources (e.g. launching an EC2 instance, creating a load balancer). The data plane is the runtime path that serves traffic after those resources exist (e.g. an already-running EC2 instance processing requests). During AWS service impairments, control plane operations are far more likely to be affected than data plane operations.
Q: Does this apply to managed services like RDS and Lambda or only EC2?
Every AWS service has both a control plane and a data plane. RDS provisioning a new database is a control plane operation; an existing RDS instance serving queries is data plane. Lambda creating a new function is control plane; an already-deployed function handling invocations is data plane. The principle applies across all services: reduce your dependency on control plane operations during recovery.
Q: How do I test whether my architecture depends on the control plane during recovery?
Start by reviewing your disaster recovery runbooks. Open AWS CloudTrail and filter for mutating API calls (readOnly = false) during your normal operating hours. If your recovery procedures require these calls to succeed, your architecture has a control plane dependency that could delay recovery during an impairment.
Q: What happens to running resources during an AWS service impairment?
More often than not, running resources continue to operate normally. The data plane (network traffic, storage I/O, compute processing) is independent of the control plane. Therefore, the most probable outcome is inability to process provisioning requests, such as modify security groups , spin up instances or make other configuration changes until the recovery is advertised or observed.
Q: Focusing on EC2, does auto-scaling create a control plane dependency during recovery?
Yes. Auto-scaling policies that launch new instances in response to load or failure are control plane operations. During an EC2 control plane impairment, OR during a workload failure, these scaling actions may fail or be delayed. Control plane doesn't need to be impaired for failure to occur here. One good example of this is the common InsufficientInstanceCapacity . Just because you launched a request for 30 instances in a specific AZ, doesn't guarantee it will be provisioned for you 100% of the time.
Our advise for recovery here is to consider a number of mitigating strategies, such as pre-provisioning resources, or active load-shedding. Which strategy is suitable is subject to testing and verification of workload by workload. That said, the most important concept to know is that there are real trade offs with each (cost vs complexity ) and only actively reasoning about it can help organisations prepare for a rapid recovery.
Q: How does the control plane and data plane distinction relate to compliance frameworks like CPS 230?
In Australia, CPS 230 (APRA) publishes "The testing program must be tailored to the material risks of the APRA-regulated entity and include a range of severe but plausible scenarios, including disruptions to services provided by material service providers and scenarios where contingency arrangements are required. "
An AWS control plane impairment is precisely such a scenario. If your recovery procedures depend on control plane operations, you may not be able to demonstrate continuity of critical operations as required by the standard. Mapping your control plane dependencies and implementing data-plane-only recovery alternatives directly supports CPS 230 compliance.
The decision you're already making
Every AWS service event forces a decision: wait for AWS, or act on your own architecture. Most organisations choose to wait — not because they've evaluated the options, but because they haven't examined the question.
The control plane and data plane distinction gives you the vocabulary to ask it. CloudTrail gives you the evidence. And the answer, more often than not, is that you have more control than you thought.
Resilera runs control plane dependency audits and DR runbook reviews for AWS workloads. Request a conversation — no obligation.
Andrew Goifeld is the founder of Resilera, an AWS resilience consultancy. He spent over a decade at Amazon, where he worked across infrastructure, operations and resilience engineering. Resilera helps organisations design architectures that survive AWS service events without waiting for AWS to fix them.