Additional Best Practices for AWS Control Tower

This document outlines additional best practices that can be implemented to further enhance the Control Tower deployment’s security, reliability, maintainability, and operational excellence.

Table of Contents

  1. Pre-Commit Hooks
  2. Terraform State Management
  3. Secrets Management
  4. Monitoring and Observability
  5. Disaster Recovery
  6. Cost Optimization
  7. Security Hardening
  8. Compliance and Governance
  9. Documentation
  10. CI/CD Enhancements

1. Pre-Commit Hooks

Implementation

Add pre-commit hooks to automatically validate code before commits.

Benefits:

  • Catch errors early in development
  • Enforce code quality standards
  • Prevent committing sensitive data
  • Ensure consistent formatting

Recommended Tools:

  • pre-commit framework
  • terraform fmt - Format checking
  • terraform validate - Syntax validation
  • tfsec - Security scanning
  • tflint - Linting
  • detect-secrets - Prevent credential leaks
  • checkov - Policy as code

Files to Create:

  • .pre-commit-config.yaml
  • .pre-commit-hooks/ directory

Priority: HIGH


2. Terraform State Management

Current State

  • S3 backend with native state locking (Terraform 1.6+)
  • State stored in management account

Additional Best Practices

2.1 State File Encryption

Status: ✅ Implemented (KMS encryption)

2.2 State File Versioning

Status: ✅ Implemented (S3 versioning)

2.3 State File Backup Automation

Recommendation: Implement automated state backups

# Add to cron or GitHub Actions
#!/bin/bash
# scripts/backup-state-automated.sh
DATE=$(date +%Y%m%d_%H%M%S)
terraform state pull > "s3://backup-bucket/state-backups/terraform.tfstate.$DATE"

2.4 State File Access Logging

Recommendation: Enable CloudTrail logging for S3 state bucket

# Add to backend module
resource "aws_s3_bucket_logging" "state_bucket" {
  bucket = aws_s3_bucket.terraform_state.id
  
  target_bucket = aws_s3_bucket.logs.id
  target_prefix = "state-access-logs/"
}

2.5 State Locking Monitoring

Recommendation: Monitor state lock duration and failures

Priority: MEDIUM


3. Secrets Management

Current State

  • AWS credentials via environment variables or AWS CLI
  • No secrets in code (good practice)

Additional Best Practices

3.1 AWS Secrets Manager Integration

Recommendation: Store sensitive variables in AWS Secrets Manager

# Example: Retrieve notification emails from Secrets Manager
data "aws_secretsmanager_secret_version" "notification_emails" {
  secret_id = "control-tower/notification-emails"
}

locals {
  notification_emails = jsondecode(data.aws_secretsmanager_secret_version.notification_emails.secret_string)
}

3.2 HashiCorp Vault Integration

Recommendation: For multi-cloud or hybrid environments

3.3 SOPS (Secrets OPerationS)

Recommendation: Encrypt sensitive tfvars files

# Encrypt production variables
sops -e terraform.tfvars.production > terraform.tfvars.production.enc

# Decrypt during deployment
sops -d terraform.tfvars.production.enc > terraform.tfvars.production

3.4 Git-Secrets

Recommendation: Prevent committing secrets

# Install git-secrets
brew install git-secrets

# Setup
git secrets --install
git secrets --register-aws

Priority: HIGH


4. Monitoring and Observability

Current State

  • CloudWatch Logs for Control Tower
  • CloudWatch Alarms for security events
  • SNS notifications

Additional Best Practices

4.1 Terraform Deployment Metrics

Recommendation: Track deployment metrics

# Add to main.tf
resource "aws_cloudwatch_metric_alarm" "terraform_deployment_duration" {
  alarm_name          = "terraform-deployment-duration"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "1"
  metric_name         = "DeploymentDuration"
  namespace           = "ControlTower/Terraform"
  period              = "300"
  statistic           = "Average"
  threshold           = "3600" # 1 hour
  alarm_description   = "Terraform deployment taking too long"
  alarm_actions       = [aws_sns_topic.operational_notifications.arn]
}

4.2 Infrastructure Drift Detection

Recommendation: Automated drift detection

# .github/workflows/drift-detection.yml
name: Drift Detection
on:
  schedule:
    - cron: '0 */6 * * *'  # Every 6 hours
jobs:
  detect-drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Terraform Plan
        run: terraform plan -detailed-exitcode
      - name: Notify on Drift
        if: failure()
        run: |
          # Send notification

4.3 Cost Monitoring

Recommendation: Track infrastructure costs

# Add AWS Cost Anomaly Detection
resource "aws_ce_anomaly_monitor" "control_tower" {
  name              = "control-tower-cost-monitor"
  monitor_type      = "DIMENSIONAL"
  monitor_dimension = "SERVICE"
}

resource "aws_ce_anomaly_subscription" "control_tower" {
  name      = "control-tower-cost-alerts"
  frequency = "DAILY"
  
  monitor_arn_list = [
    aws_ce_anomaly_monitor.control_tower.arn
  ]
  
  subscriber {
    type    = "SNS"
    address = aws_sns_topic.operational_notifications.arn
  }
  
  threshold_expression {
    dimension {
      key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
      values        = ["100"]  # Alert on $100+ anomalies
      match_options = ["GREATER_THAN_OR_EQUAL"]
    }
  }
}

4.4 Centralized Logging

Recommendation: Aggregate logs from all accounts

Priority: MEDIUM


5. Disaster Recovery

Current State

  • S3 versioning for state files
  • Manual state backups via Makefile

Additional Best Practices

5.1 Cross-Region State Replication

Recommendation: Replicate state bucket to DR region

# Add to backend module
resource "aws_s3_bucket_replication_configuration" "state_replication" {
  bucket = aws_s3_bucket.terraform_state.id
  role   = aws_iam_role.replication.arn
  
  rule {
    id     = "replicate-state"
    status = "Enabled"
    
    destination {
      bucket        = aws_s3_bucket.terraform_state_replica.arn
      storage_class = "STANDARD_IA"
      
      encryption_configuration {
        replica_kms_key_id = aws_kms_key.replica.arn
      }
    }
  }
}

5.2 Disaster Recovery Runbook

Recommendation: Create detailed DR procedures

File: docs/DISASTER_RECOVERY.md

Contents:

  • State file recovery procedures
  • Control Tower rebuild process
  • Account recovery procedures
  • RTO/RPO definitions
  • Contact information

5.3 Backup Testing

Recommendation: Regularly test state restoration

# scripts/test-state-restore.sh
#!/bin/bash
# Test state restoration in isolated environment

5.4 Multi-Region Deployment

Recommendation: Deploy critical infrastructure in multiple regions

Priority: MEDIUM


6. Cost Optimization

Current State

  • Instance type restrictions for non-prod
  • Optional expensive features (Macie, centralized networking)

Additional Best Practices

6.1 Resource Tagging for Cost Allocation

Recommendation: Enforce comprehensive tagging

# Add to variables.tf
variable "required_tags" {
  description = "Required tags for all resources"
  type        = list(string)
  default     = ["Environment", "Project", "CostCenter", "Owner"]
}

# Add validation in locals.tf
resource "null_resource" "validate_tags" {
  lifecycle {
    precondition {
      condition = alltrue([
        for tag in var.required_tags :
        contains(keys(var.default_tags), tag)
      ])
      error_message = "Missing required tags: ${join(", ", var.required_tags)}"
    }
  }
}

6.2 AWS Budgets

Recommendation: Set up budget alerts

resource "aws_budgets_budget" "control_tower" {
  name              = "control-tower-monthly-budget"
  budget_type       = "COST"
  limit_amount      = "5000"
  limit_unit        = "USD"
  time_period_start = "2024-01-01_00:00"
  time_unit         = "MONTHLY"
  
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = var.operational_notification_emails
  }
  
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type            = "PERCENTAGE"
    notification_type         = "FORECASTED"
    subscriber_email_addresses = var.operational_notification_emails
  }
}

6.3 Resource Scheduling

Recommendation: Auto-stop non-prod resources

# Add Lambda function to stop non-prod resources after hours
resource "aws_lambda_function" "resource_scheduler" {
  filename      = "lambda/resource-scheduler.zip"
  function_name = "control-tower-resource-scheduler"
  role          = aws_iam_role.scheduler.arn
  handler       = "index.handler"
  runtime       = "python3.11"
  
  environment {
    variables = {
      STOP_SCHEDULE  = "0 19 * * MON-FRI"  # 7 PM weekdays
      START_SCHEDULE = "0 7 * * MON-FRI"   # 7 AM weekdays
    }
  }
}

6.4 S3 Intelligent-Tiering

Recommendation: Use S3 Intelligent-Tiering for log buckets

Priority: MEDIUM


7. Security Hardening

Current State

  • 35+ SCPs
  • KMS encryption
  • GuardDuty, Security Hub, Config
  • Network Firewall

Additional Best Practices

7.1 AWS Systems Manager Session Manager

Recommendation: Replace SSH/RDP with Session Manager

# Add to security module
resource "aws_ssm_document" "session_manager_prefs" {
  name            = "SSM-SessionManagerRunShell"
  document_type   = "Session"
  document_format = "JSON"
  
  content = jsonencode({
    schemaVersion = "1.0"
    description   = "Document to hold regional settings for Session Manager"
    sessionType   = "Standard_Stream"
    inputs = {
      s3BucketName                = aws_s3_bucket.session_logs.id
      s3EncryptionEnabled         = true
      cloudWatchLogGroupName      = aws_cloudwatch_log_group.session_logs.name
      cloudWatchEncryptionEnabled = true
      kmsKeyId                    = aws_kms_key.main.id
      runAsEnabled                = false
      runAsDefaultUser            = ""
    }
  })
}

7.2 AWS WAF for API Gateway

Recommendation: Add WAF protection

resource "aws_wafv2_web_acl" "api_protection" {
  name  = "control-tower-api-protection"
  scope = "REGIONAL"
  
  default_action {
    allow {}
  }
  
  rule {
    name     = "RateLimitRule"
    priority = 1
    
    action {
      block {}
    }
    
    statement {
      rate_based_statement {
        limit              = 2000
        aggregate_key_type = "IP"
      }
    }
    
    visibility_config {
      cloudwatch_metrics_enabled = true
      metric_name                = "RateLimitRule"
      sampled_requests_enabled   = true
    }
  }
  
  visibility_config {
    cloudwatch_metrics_enabled = true
    metric_name                = "ControlTowerWAF"
    sampled_requests_enabled   = true
  }
}

7.3 VPC Endpoints for AWS Services

Recommendation: Use VPC endpoints to avoid internet traffic

# Add to networking module
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.inspection.id
  service_name = "com.amazonaws.${var.region}.s3"
  
  route_table_ids = [
    aws_route_table.private.id
  ]
  
  tags = merge(var.tags, {
    Name = "s3-endpoint"
  })
}

7.4 AWS Shield Advanced

Recommendation: For DDoS protection (if needed)

7.5 AWS Inspector

Recommendation: Automated vulnerability scanning

resource "aws_inspector2_enabler" "control_tower" {
  account_ids    = [data.aws_caller_identity.current.account_id]
  resource_types = ["EC2", "ECR", "LAMBDA"]
}

Priority: HIGH


8. Compliance and Governance

Current State

  • AWS Config rules
  • Security Hub standards
  • SCP policies

Additional Best Practices

8.1 AWS Audit Manager

Recommendation: Automated compliance reporting

resource "aws_auditmanager_assessment" "control_tower" {
  name = "control-tower-compliance"
  
  assessment_reports_destination {
    destination      = "s3://${aws_s3_bucket.audit_reports.id}"
    destination_type = "S3"
  }
  
  framework_id = "arn:aws:auditmanager:${var.region}::framework/AWS-Well-Architected-Framework"
  
  roles {
    role_arn  = aws_iam_role.audit_manager.arn
    role_type = "PROCESS_OWNER"
  }
  
  scope {
    aws_accounts {
      id = data.aws_caller_identity.current.account_id
    }
    aws_services {
      service_name = "ec2"
    }
  }
}

8.2 Compliance Reporting

Recommendation: Automated compliance reports

# scripts/generate-compliance-report.sh
#!/bin/bash
# Generate compliance report from Security Hub, Config, etc.

8.3 Policy as Code Expansion

Recommendation: Add more OPA policies

New Policies:

  • Data residency validation
  • Tagging compliance
  • Cost threshold validation
  • Resource naming conventions
  • Backup policy compliance

8.4 Change Management

Recommendation: Implement change approval workflow

# .github/workflows/change-approval.yml
name: Change Approval
on:
  pull_request:
    types: [opened, synchronize]
jobs:
  require-approval:
    runs-on: ubuntu-latest
    steps:
      - name: Check Approvals
        uses: actions/github-script@v7
        with:
          script: |
            const reviews = await github.rest.pulls.listReviews({
              owner: context.repo.owner,
              repo: context.repo.repo,
              pull_number: context.issue.number
            });
            const approvals = reviews.data.filter(r => r.state === 'APPROVED');
            if (approvals.length < 2) {
              core.setFailed('Requires 2 approvals for infrastructure changes');
            }

Priority: MEDIUM


9. Documentation

Current State

  • Comprehensive inline comments
  • Multiple documentation files in docs/
  • README files in modules

Additional Best Practices

9.1 Architecture Diagrams

Recommendation: Create visual architecture diagrams

Tools:

  • draw.io
  • Lucidchart
  • CloudCraft
  • Terraform Graph

Diagrams Needed:

  • Overall architecture
  • Network topology
  • Security controls
  • Data flow
  • Deployment pipeline

9.2 API Documentation

Recommendation: Document module interfaces

# Generate module documentation
terraform-docs markdown table --output-file README.md modules/security/

9.3 Runbooks

Recommendation: Create operational runbooks

Runbooks Needed:

  • Incident response
  • Disaster recovery
  • Deployment procedures
  • Troubleshooting guide
  • Rollback procedures

9.4 Decision Records

Recommendation: Document architectural decisions

File: docs/adr/ (Architecture Decision Records)

# ADR-001: Use Terraform 1.6+ for Native S3 State Locking

## Status
Accepted

## Context
Need to manage Terraform state locking without DynamoDB.

## Decision
Use Terraform 1.6+ with native S3 state locking.

## Consequences
- Reduced cost (no DynamoDB)
- Simpler architecture
- Requires Terraform 1.6+

9.5 Video Tutorials

Recommendation: Create video walkthroughs

Priority: LOW


10. CI/CD Enhancements

Current State

  • GitHub Actions workflow
  • Validation, security scanning, OPA tests
  • Manual approval for production

Additional Best Practices

10.1 Multi-Environment Pipelines

Recommendation: Separate pipelines per environment

# .github/workflows/deploy-dev.yml
# .github/workflows/deploy-staging.yml
# .github/workflows/deploy-prod.yml

10.2 Automated Rollback

Recommendation: Implement automatic rollback on failure

- name: Apply Terraform
  id: apply
  run: terraform apply -auto-approve
  
- name: Rollback on Failure
  if: failure()
  run: |
    terraform state push backups/terraform.tfstate.backup
    terraform apply -auto-approve

10.3 Canary Deployments

Recommendation: Gradual rollout for changes

10.4 Integration Tests

Recommendation: Add integration tests post-deployment

- name: Integration Tests
  run: |
    # Test Control Tower is accessible
    aws organizations describe-organization
    
    # Test OUs exist
    aws organizations list-organizational-units-for-parent --parent-id $ROOT_ID
    
    # Test SCPs are attached
    aws organizations list-policies-for-target --target-id $OU_ID --filter SERVICE_CONTROL_POLICY

10.5 Deployment Notifications

Recommendation: Slack/Teams notifications

- name: Notify Slack
  uses: 8398a7/action-slack@v3
  with:
    status: $
    text: 'Control Tower deployment completed'
    webhook_url: $

10.6 Terraform Cloud/Enterprise

Recommendation: Consider Terraform Cloud for:

  • Remote execution
  • Policy as code (Sentinel)
  • Private module registry
  • Cost estimation
  • Team collaboration

Priority: MEDIUM


Implementation Priority Matrix

High Priority (Implement First)

  1. ✅ Pre-commit hooks
  2. ✅ Secrets management (AWS Secrets Manager)
  3. ✅ Security hardening (Session Manager, Inspector)
  4. ✅ Git-secrets

Medium Priority (Implement Next)

  1. ✅ Drift detection automation
  2. ✅ Cost monitoring and budgets
  3. ✅ Disaster recovery runbook
  4. ✅ Compliance reporting
  5. ✅ Architecture diagrams

Low Priority (Nice to Have)

  1. ⏳ Terraform Cloud migration
  2. ⏳ Video tutorials
  3. ⏳ Canary deployments
  4. ⏳ AWS Shield Advanced

Quick Wins (Easy to Implement)

  1. Add .pre-commit-config.yaml (30 minutes)
  2. Create DR runbook (2 hours)
  3. Add AWS Budgets (1 hour)
  4. Enable S3 Intelligent-Tiering (30 minutes)
  5. Add git-secrets (30 minutes)
  6. Create architecture diagram (2 hours)
  7. Add Slack notifications to CI/CD (1 hour)

Estimated Implementation Time

Category Time Required
Pre-commit Hooks 2 hours
Secrets Management 4 hours
Monitoring Enhancements 8 hours
Disaster Recovery 6 hours
Cost Optimization 4 hours
Security Hardening 8 hours
Compliance 6 hours
Documentation 12 hours
CI/CD Enhancements 8 hours
Total 58 hours

Next Steps

  1. Review this document with the team
  2. Prioritize based on business needs
  3. Create implementation tickets
  4. Assign owners
  5. Set deadlines
  6. Track progress

References


Back to top

Copyright © 2024. Distributed under the MIT License.

This site uses Just the Docs, a documentation theme for Jekyll.