Disaster Recovery Runbook
Overview
This document provides step-by-step procedures for recovering from various disaster scenarios affecting the AWS Control Tower infrastructure.
Table of Contents
- Emergency Contacts
- RTO/RPO Definitions
- Disaster Scenarios
- Recovery Procedures
- Testing Schedule
- Post-Recovery Actions
Emergency Contacts
Primary Contacts
| Role | Name | Phone | Email | Availability | |——|——|——-|——-|————–| | Infrastructure Lead | [NAME] | [PHONE] | [EMAIL] | 24/7 | | Security Lead | [NAME] | [PHONE] | [EMAIL] | 24/7 | | AWS TAM | [NAME] | [PHONE] | [EMAIL] | Business Hours | | On-Call Engineer | [ROTATION] | [PHONE] | [EMAIL] | 24/7 |
Escalation Path
- On-Call Engineer (Response: 15 minutes)
- Infrastructure Lead (Response: 30 minutes)
- CTO/VP Engineering (Response: 1 hour)
External Contacts
- AWS Support: 1-877-AWS-SUPPORT
- AWS Premium Support Case: https://console.aws.amazon.com/support/
RTO/RPO Definitions
Recovery Time Objective (RTO)
Maximum acceptable time to restore service after a disaster.
| Component | RTO | Priority |
|---|---|---|
| Control Tower | 4 hours | Critical |
| Terraform State | 1 hour | Critical |
| Security Services | 2 hours | Critical |
| Networking | 2 hours | High |
| Logging | 4 hours | High |
Recovery Point Objective (RPO)
Maximum acceptable data loss measured in time.
| Component | RPO | Backup Frequency |
|---|---|---|
| Terraform State | 1 hour | Continuous (S3 versioning) |
| CloudTrail Logs | 15 minutes | Real-time |
| Config Snapshots | 24 hours | Daily |
| Security Hub Findings | 1 hour | Continuous |
Disaster Scenarios
Scenario 1: Terraform State File Corruption
Severity: Critical
Impact: Cannot manage infrastructure
RTO: 1 hour
RPO: 1 hour
Scenario 2: Accidental Resource Deletion
Severity: High
Impact: Service disruption
RTO: 2-4 hours
RPO: Varies by resource
Scenario 3: AWS Account Compromise
Severity: Critical
Impact: Security breach, data loss
RTO: 4-8 hours
RPO: Varies
Scenario 4: Region Failure
Severity: Critical
Impact: Complete service outage
RTO: 8-12 hours
RPO: 1 hour
Scenario 5: Terraform State Lock Stuck
Severity: Medium
Impact: Cannot deploy changes
RTO: 30 minutes
RPO: N/A
Recovery Procedures
Procedure 1: Recover Terraform State File
Symptoms
terraform planfails with state errors- State file corruption detected
- Cannot read state file
Prerequisites
- Access to AWS management account
- Terraform CLI installed
- AWS CLI configured
Steps
- Assess the Situation
# Check current state terraform state list # Verify state file exists aws s3 ls s3://[STATE-BUCKET]/terraform.tfstate - Retrieve Latest Backup
# List available backups aws s3 ls s3://[STATE-BUCKET]/backups/ --recursive # Download latest backup aws s3 cp s3://[STATE-BUCKET]/backups/terraform.tfstate.[TIMESTAMP] \ ./terraform.tfstate.backup - Verify Backup Integrity
# Check backup is valid JSON cat terraform.tfstate.backup | jq . > /dev/null # Verify terraform version cat terraform.tfstate.backup | jq -r '.terraform_version' - Restore State File
# Push backup to S3 terraform state push terraform.tfstate.backup # Verify restoration terraform state list - Validate Infrastructure
# Run plan to check for drift terraform plan # If drift detected, review and apply terraform apply - Document Recovery
- Record timestamp of failure
- Document root cause
- Update runbook if needed
Rollback
If restoration fails:
# Restore previous version from S3 versioning
aws s3api list-object-versions \
--bucket [STATE-BUCKET] \
--prefix terraform.tfstate
# Download specific version
aws s3api get-object \
--bucket [STATE-BUCKET] \
--key terraform.tfstate \
--version-id [VERSION-ID] \
terraform.tfstate.restored
Procedure 2: Recover from Accidental Deletion
Symptoms
- Resources missing from AWS console
- Terraform detects resources need to be created
- Alerts for missing resources
Steps
- Identify Deleted Resources
# Check CloudTrail for deletion events aws cloudtrail lookup-events \ --lookup-attributes AttributeKey=EventName,AttributeValue=Delete* \ --max-results 50 # Run terraform plan to see what's missing terraform plan -
Determine Recovery Method
Option A: Restore from Terraform
# If resources can be recreated terraform applyOption B: Import Existing Resources
# If resources still exist but state is wrong terraform import [RESOURCE_ADDRESS] [RESOURCE_ID]Option C: Restore from Backup
# For critical data (S3, databases) # Follow AWS service-specific restore procedures - Verify Recovery
# Check all resources exist terraform state list # Verify no drift terraform plan # Test functionality # [Service-specific tests] - Implement Preventive Measures
- Review IAM permissions
- Enable MFA delete on S3 buckets
- Add resource deletion protection
- Update SCPs if needed
Procedure 3: Respond to Account Compromise
Symptoms
- Unauthorized API calls in CloudTrail
- GuardDuty findings
- Unexpected resource creation
- Unusual billing activity
Immediate Actions (First 15 Minutes)
- Isolate the Account
# Attach deny-all SCP to affected OU aws organizations attach-policy \ --policy-id [DENY-ALL-POLICY-ID] \ --target-id [OU-ID] - Rotate Credentials
# Disable all IAM users for user in $(aws iam list-users --query 'Users[].UserName' --output text); do aws iam update-login-profile --user-name $user --password-reset-required done # Delete access keys for user in $(aws iam list-users --query 'Users[].UserName' --output text); do for key in $(aws iam list-access-keys --user-name $user --query 'AccessKeyMetadata[].AccessKeyId' --output text); do aws iam delete-access-key --user-name $user --access-key-id $key done done - Enable CloudTrail Logging
# Ensure CloudTrail is enabled and logging aws cloudtrail get-trail-status --name [TRAIL-NAME]
Investigation (First Hour)
- Analyze CloudTrail Logs
# Find unauthorized activities aws cloudtrail lookup-events \ --start-time [INCIDENT-TIME] \ --lookup-attributes AttributeKey=EventName,AttributeValue=RunInstances - Check for Backdoors
- Review IAM users and roles
- Check for unauthorized EC2 instances
- Review security group rules
- Check for unauthorized Lambda functions
- Document Evidence
- Save CloudTrail logs
- Screenshot GuardDuty findings
- Record timeline of events
Recovery (Hours 2-4)
- Remove Malicious Resources
# Terminate unauthorized instances aws ec2 terminate-instances --instance-ids [INSTANCE-IDS] # Delete unauthorized IAM users aws iam delete-user --user-name [MALICIOUS-USER] - Restore Clean State
# Restore from known-good state backup terraform state push backups/terraform.tfstate.[CLEAN-TIMESTAMP] # Apply clean configuration terraform apply - Re-enable Access
# Remove deny-all SCP aws organizations detach-policy \ --policy-id [DENY-ALL-POLICY-ID] \ --target-id [OU-ID]
Post-Incident (Days 1-7)
- Conduct Post-Mortem
- Root cause analysis
- Timeline of events
- Lessons learned
- Action items
- Implement Improvements
- Update security policies
- Enhance monitoring
- Conduct security training
Procedure 4: Recover from Region Failure
Symptoms
- AWS region unavailable
- Services not responding
- AWS Health Dashboard shows region issues
Steps
- Verify Region Status
# Check AWS Health Dashboard aws health describe-events --filter eventTypeCategories=issue # Check service status curl https://status.aws.amazon.com/ - Assess Impact
- Identify affected resources
- Determine if failover needed
- Check if data is replicated
- Failover to DR Region (if configured)
# Update backend configuration terraform init -backend-config="region=ap-southeast-1" # Restore state from replica aws s3 cp s3://[REPLICA-BUCKET]/terraform.tfstate ./ # Deploy to DR region terraform apply -var="home_region=ap-southeast-1" - Update DNS/Routing
- Update Route 53 records
- Update load balancer targets
- Notify users of region change
- Monitor Recovery
- Watch AWS Health Dashboard
- Monitor application metrics
- Check for errors
- Failback (when primary region recovers)
# Sync data back to primary region # Update DNS back to primary # Verify functionality
Procedure 5: Clear Stuck Terraform Lock
Symptoms
terraform planfails with lock error- Error message: “Error acquiring the state lock”
- Lock persists after process termination
Steps
- Verify Lock Status
# Check S3 for lock file aws s3 ls s3://[STATE-BUCKET]/.terraform.lock.info - Identify Lock Owner
# Download lock file aws s3 cp s3://[STATE-BUCKET]/.terraform.lock.info ./ # View lock details cat .terraform.lock.info | jq . - Verify Process is Dead
- Check if Terraform process is still running
- Check CI/CD pipeline status
- Confirm no one else is deploying
- Force Unlock
# Get lock ID from error message or lock file terraform force-unlock [LOCK-ID] - Verify Unlock
# Try running plan terraform plan
Prevention
- Use shorter lock timeouts
- Implement lock monitoring
- Add automatic unlock after timeout
Testing Schedule
Quarterly Tests
- State file restoration
- Backup integrity verification
- DR region failover (if configured)
Annual Tests
- Full disaster recovery drill
- Account compromise simulation
- Region failure simulation
After Each Test
- Update runbook with findings
- Document time to recover
- Identify improvements
Post-Recovery Actions
Immediate (Within 24 Hours)
- Verify all services operational
- Check for data loss
- Review monitoring alerts
- Notify stakeholders
Short-term (Within 1 Week)
- Conduct post-mortem meeting
- Document incident timeline
- Update runbook
- Implement quick fixes
Long-term (Within 1 Month)
- Implement preventive measures
- Update disaster recovery plan
- Conduct training
- Review and update RTO/RPO
Appendix
A. Useful Commands
# Check AWS account
aws sts get-caller-identity
# List all resources
aws resourcegroupstaggingapi get-resources
# Check CloudTrail events
aws cloudtrail lookup-events --max-results 50
# List S3 buckets
aws s3 ls
# Check Terraform version
terraform version
# Validate Terraform configuration
terraform validate
# Show Terraform state
terraform show
B. Important ARNs and IDs
Root OU ID: [OU-ID]
Management Account ID: [ACCOUNT-ID]
State Bucket: [BUCKET-NAME]
KMS Key ID: [KEY-ID]
CloudTrail Name: [TRAIL-NAME]
C. Backup Locations
Primary State: s3://[STATE-BUCKET]/terraform.tfstate
State Backups: s3://[STATE-BUCKET]/backups/
State Replica: s3://[REPLICA-BUCKET]/terraform.tfstate
CloudTrail Logs: s3://[LOGS-BUCKET]/cloudtrail/
Config Snapshots: s3://[LOGS-BUCKET]/config/
D. Recovery Time Tracking
| Date | Scenario | Time to Detect | Time to Recover | Notes |
|---|---|---|---|---|
Document Control
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2024-01-01 | [NAME] | Initial version |
Last Reviewed: [DATE]
Next Review: [DATE + 6 months]
Owner: Infrastructure Team