Resilience

Your AWS Recovery Plan Is Attacking the Wrong Layer

Andrew Goifeld 27 min read
aws control-plane data-plane high-availability disaster-recovery resilience well-architected cloudtrail fault-isolation

The AWS Status page reads "We are experiencing API latencies" and your application is down. Teams across the organisation are watching the dashboard, waiting for the all-clear. No one is taking action because the assumption is simple: when AWS has fixed its issue, we will start fixing ours.

That assumption is understandable and nearly universal. But it rests on a premise that doesn't hold: that your recovery has to wait for theirs.

What if you didn't have to wait?

Teams who understand the control plane and data plane distinction don't experience AWS events the same way. The status page is still watched, but the question changes from "when will this be over?" to "which of our services are actually at risk, and which are already safe?"

AWS service impairments are just that: impairments to specific operations within specific fault isolation boundaries. They may be zonal, regional or, in a few cases, global.

The key insight is that most impairments affect control plane operations (creating, modifying, deleting resources) while the data plane (serving traffic through already provisioned resources) continues to operate normally. If your architecture has been designed with high availability and resilient patterns, including capacity provisioned ahead of time, health checked endpoints and static stability, customer facing outages can be avoided or significantly minimised during the vast majority of AWS service events.

This is not a theoretical claim. AWS publishes its fault isolation model, and the Well-Architected Framework explicitly recommends against relying on control plane operations during recovery.

The evidence is in AWS's own documentation. The Fault Isolation Boundaries whitepaper and the Well-Architected Framework both confirm this directly. The next sections walk through the mechanics, using EC2 as a concrete example. (If you want to verify these claims independently, AI prompts and documentation links are included at the end of this article.)

What the Well-Architected Framework Says About Control Plane Dependencies

The AWS Well-Architected Framework addresses this directly in the Reliability pillar, specifically in REL 11: "How do you design your workload to withstand component failures?"

One of the best practices under REL 11 is to avoid relying on the control plane during recovery. The reasoning is straightforward: if the control plane is impaired (which is the most common failure mode during AWS service events), any recovery procedure that depends on control plane operations will also be impaired.

The diagram below illustrates how REL 11 maps control plane dependencies to recovery risk:

Control Plane vs Data Plane: What Happens When You Launch an EC2 Instance

Before diving into the diagrams, here are crisp definitions:

  • Control plane: The set of AWS APIs and orchestration systems that create, configure, modify and delete resources. Examples: RunInstances, CreateDBInstance. These appear as mutating (write) calls in CloudTrail.
  • Data plane: The runtime path that serves traffic through already provisioned resources. Examples: an EC2 instance processing HTTP requests, an RDS database serving queries, a Lambda function handling invocations.

Now consider the following simplified flow chart of what happens when you start an AWS EC2 instance. This is a control plane operation:

User request
User request
Receive StartInstance API Request
Receive StartInstance API Request
Validate Request
Validate Request
Is Request Valid?
Is Request Valid?
Provision Resources
Provision Resources
Return Error Response
Return Error Response
Allocate Elastic IP
Allocate Elastic IP
Configure Security Groups
Configure Security Groups
Launch EC2 Instance
Launch EC2 Instance
Send Confirmation
Send Confirmation
Log Error
Log Error
End Process
End Process
No
YES
YES
Text is not SVG - cannot display

Every step in this process can fail: you may run out of IP addresses in the subnet, request an incorrect subnet, receive an InsufficientInstanceCapacity or exceed a service quota you have never encountered before. The orchestration is complex by design because the API request has to coordinate multiple systems successfully.

Now consider what happens once the instance is online and serving traffic. This is the data plane:

User/Application/EC2
User/Application/EC2
EC2
EC2
Send dataReceive data
Text is not SVG - cannot display

The contrast is stark. The data plane path has far fewer dependencies and far fewer failure modes. The instance processes requests, reads and writes to its EBS volume and communicates over the network without requiring further orchestration from the control plane.

This is precisely why AWS recommends that you do not rely on control plane operations during recovery. If you can recover using only the data plane (for example, by failing over to standby instances provisioned ahead of time rather than launching new ones), your blast radius during a control plane impairment drops to near zero.

Concrete patterns for control plane independence:

  • Capacity provisioned ahead of time: Maintain warm standby instances rather than relying on Auto Scaling to launch new ones during an incident
  • Static stability: Design systems that continue to operate with their current resource allocation even if no changes can be made
  • Health based routing: Use Route 53 health checks or ALB target group health to shift traffic away from impaired resources without modifying infrastructure
  • Avoid mutating operations in recovery runbooks: Review your DR procedures and remove any step that calls a create, modify or delete API. This is often the hardest part to accept because provisioned standby resources carry cost. Teams need to test and reason through the right risk management approach for each workload. It is not one size fits all.

Why Most Teams Overlook This Foundational Concept

If this is already well understood in your organisation, that is genuinely excellent. This concept doesn't appear in most AWS certifications or training curricula. It emerges from operational experience, from teams who have been through a significant service event and came out the other side asking hard questions. At Resilera, I encounter this gap in almost every engagement, and the pattern is consistent across industries.

Three cognitive biases reinforce the pattern:

  • Status quo bias: The existing architecture feels safe. But doing nothing is itself a decision. It means accepting increased blast radius with each new service dependency.
  • Diffusion of responsibility: Every team assumes another team owns the response. Name the person in your organisation who owns this question. If you can't, it defaults to you.
  • Normalcy bias: The last three impairments were minor, so the next one will be too. Except the next one is statistically unlikely to look like the last. That is the nature of tail risk.

I believe understanding the control plane and data plane distinction is a prerequisite for any meaningful DR, HA or BC discussion. Without it, disaster recovery plans may include steps that depend on the very systems that are impaired, turning a service event into a prolonged outage.

For organisations subject to regulatory frameworks such as CPS 230 (APRA's operational resilience standard), this distinction is directly relevant. CPS 230 requires entities to maintain critical operations through severe disruption scenarios, exactly the kind of scenario where control plane availability cannot be assumed. See Article 44 under the Testing and Review section in the Prudential Standard CPS 230 Operational Risk Management.

Organisations that have a solid grasp of this concept can then take the next step: using AWS Fault Injection Service to simulate control plane impairments and validate that their recovery procedures work without mutating API calls. This is how you move from theoretical resilience to proven resilience.

What to do Monday morning

These three steps will tell you, within one working day, whether your organisation can recover from an AWS service event without waiting for AWS to fix it.

  1. Before your first meeting tomorrow (~20 min): Open CloudTrail, filter readOnly = false and review the last 14 days. Read only access is enough. Request it as a security audit task if you don't have it. Document every mutating call and tag as "required for recovery" or "routine".

  2. Block 30 minutes this week with your risk lead: Map DR runbook steps against control plane/data plane. Flag every step that calls a mutating API. For each, ask: is there a data plane only alternative? If raising this feels uncomfortable, frame it as a proactive audit prompted by an article, not a gap you found. That framing is accurate and removes the social cost.

  3. At your next team standup (~5 min): Share the one sentence version: "During AWS impairments, the control plane is what breaks. The data plane keeps running. Our recovery should not depend on the part that breaks." Point them to this article and the AWS Fault Isolation Boundaries whitepaper. Sharing this article with your team is itself the action. It opens the conversation without requiring anyone to admit they didn't know.

If the CloudTrail audit raises questions about your architecture, we can help.

Verify the claims and audit your architecture

The claims in this article can be verified independently. Use the prompts below with any AI assistant, or go directly to the AWS documentation linked throughout.

Verify the core claim:

I'm evaluating whether AWS workloads can remain operational during
a service event reported on the AWS Health Dashboard. The claim is that
control plane impairments don't necessarily affect the data plane, and
architectures with static stability and capacity provisioned ahead of time can
avoid customer facing outages during most events.

Please:
a) Confirm or refute this claim using AWS documentation
b) Explain control plane vs data plane for EC2, RDS, and Lambda
c) How would I find control plane actions in my account (CloudTrail)?
d) What is static stability and how does it apply here?

Audit your control plane usage in CloudTrail:

I am an AWS user. I want to evaluate my control plane utilisation for the last 2 weeks. My resources are tagged with "[YOUR_TAG_KEY=YOUR_TAG_VALUE]".

Please help me with:
1. The exact IAM permissions required to query CloudTrail (both console access and CloudTrail Lake)
2. AWS CLI commands to list mutating API calls (readOnly=false) filtered by my resource tags
3. Step by step console instructions to navigate to CloudTrail and CloudTrail Lake
4. 10 sample CloudTrail Lake queries I can use to understand my control plane usage patterns, including:
   - All RunInstances / TerminateInstances calls
   - All CreateStack / UpdateStack / DeleteStack calls
   - All mutating calls grouped by event source (service)
   - All mutating calls made by Auto Scaling
   - All mutating calls during a specific time window (e.g. during an incident)

Review your DR runbook for control plane dependencies:

I have a disaster recovery runbook for my AWS workload. I want to identify every step that depends on the AWS control plane (i.e. any step that creates, modifies, or deletes a resource).

Here is my DR runbook:
[PASTE YOUR RUNBOOK HERE]

Please:
1. Identify every control plane dependency in the runbook (any step that calls a mutating AWS API)
2. For each dependency, explain the risk: what happens if this API call fails during a service impairment?
3. Suggest a data plane only alternative where possible (e.g. standby provisioned ahead of time instead of launching new instances)
4. Rate the overall runbook on a scale of 1-5 for control plane independence
5. Provide a revised version of the runbook that minimises control plane dependencies

Frequently asked questions

Q: What is the difference between the AWS control plane and data plane?

The control plane is the set of APIs and systems used to create, modify and delete AWS resources (e.g. launching an EC2 instance, creating a load balancer). The data plane is the runtime path that serves traffic after those resources exist (e.g. an already running EC2 instance processing requests). During AWS service impairments, control plane operations are far more likely to be affected than data plane operations.

Q: Does this apply to managed services like RDS and Lambda or only EC2?

Every AWS service has both a control plane and a data plane. RDS provisioning a new database is a control plane operation; an existing RDS instance serving queries is data plane. Lambda creating a new function is control plane; an already deployed function handling invocations is data plane. The principle applies across all services: reduce your dependency on control plane operations during recovery.

Q: How do I test whether my architecture depends on the control plane during recovery?

Start by reviewing your disaster recovery runbooks. Open AWS CloudTrail and filter for mutating API calls (readOnly = false) during your normal operating hours. If your recovery procedures require these calls to succeed, your architecture has a control plane dependency that could delay recovery during an impairment.

Q: What happens to running resources during an AWS service impairment?

More often than not, running resources continue to operate normally. The data plane (network traffic, storage I/O, compute processing) is independent of the control plane. Therefore, the most probable outcome is that provisioning requests cannot be processed, such as modifying security groups, spinning up instances or making other configuration changes until recovery is advertised or observed.

Q: Focusing on EC2, does Auto Scaling create a control plane dependency during recovery?

Yes. Auto Scaling policies that launch new instances in response to load or failure are control plane operations. During an EC2 control plane impairment or a workload failure, these scaling actions may fail or be delayed. The control plane does not need to be impaired for this to matter. A common example is InsufficientInstanceCapacity. Requesting 30 instances in a specific AZ does not guarantee they will be provisioned every time.

Our advice for recovery is to consider mitigations such as capacity provisioned ahead of time or active load shedding. The right approach depends on workload testing and verification. The key point is that each option has trade offs between cost and complexity, and teams need to reason about those trade offs before recovery is under pressure.

Q: How does the control plane and data plane distinction relate to compliance frameworks like CPS 230?

In Australia, CPS 230 (APRA) publishes "The testing program must be tailored to the material risks of the APRA-regulated entity and include a range of severe but plausible scenarios, including disruptions to services provided by material service providers and scenarios where contingency arrangements are required."

An AWS control plane impairment is precisely such a scenario. If your recovery procedures depend on control plane operations, you may not be able to demonstrate continuity of critical operations as required by the standard. Mapping your control plane dependencies and implementing data plane only recovery alternatives directly supports CPS 230 compliance.

The decision you're already making

Every AWS service event forces a decision: wait for AWS, or act on your own architecture. Most organisations choose to wait, not because they've evaluated the options, but because they haven't examined the question.

The control plane and data plane distinction gives you the vocabulary to ask it. CloudTrail gives you the evidence. And the answer, more often than not, is that you have more control than you thought.

Resilera runs control plane dependency audits and DR runbook reviews for AWS workloads. Request a conversation. No obligation.


Andrew Goifeld is the founder of Resilera, an AWS resilience platform and consulting practice. He spent over a decade at Amazon, where he worked across infrastructure, operations and resilience engineering. Resilera helps organisations design architectures that survive AWS service events without waiting for AWS to fix them.

See it in your own architecture

Visualise dependencies, identify resilience risks and prioritise improvements across your AWS estate.

Get Early Access