</>Học Dev
Bài học

Tuần 9 - Ngày 2: DR Strategies (4 Patterns)

Tuần 9 – Ngày 2

Tuần 9 - Ngày 2: DR Strategies (4 Patterns)

Mục tiêu học tập

  • Hiểu 4 DR patterns: Backup/Restore, Pilot Light, Warm Standby, Multi-Site
  • Phân biệt RTO/RPO trade-offs
  • Áp dụng pattern phù hợp cho từng business requirement

1. RTO và RPO

Định nghĩa

  • RTO (Recovery Time Objective): max acceptable downtime (how long can we be down?)
  • RPO (Recovery Point Objective): max acceptable data loss (how much data can we lose?)

Example

  • Online banking: RTO = 1 hour, RPO = 0 (no data loss)
  • Internal blog: RTO = 24 hours, RPO = 1 day

Trade-off

  • Lower RTO/RPO → higher cost
  • Balance: spend $$ proportionally to business impact

2. The 4 DR Strategies

                         RTO   RPO  Cost
Backup & Restore        Hours  Hours  $
Pilot Light             10 min-1h  Min  $$
Warm Standby            Minutes  Min  $$$
Multi-Site Active/Active Seconds  Zero  $$$$

3. Strategy 1: Backup & Restore

Architecture

PrimaryRegion(active):EC2,RDS,S3(working)BackupS3(withCRRtoDRregion)DRRegion:Empty(norunningresources)BackupsinS3+AMIsWhendisaster:1.SpinupinfrastructureinDRregion(fromCloudFormation)2.Restorefrombackup3.UpdateDNStopointtoDRregion4.Test5.Resumeoperations

RTO: Hours to days

RPO: Hours (depending on backup frequency)

Setup

  • Daily backups to S3 with CRR
  • AMI snapshots cross-region
  • Database snapshots cross-region
  • CloudFormation templates ready in DR region

Cost

  • Lowest: only pay for backup storage in DR region
  • No compute running in DR

Use case

  • Non-critical apps
  • Long RTO/RPO acceptable
  • Cost-sensitive

4. Strategy 2: Pilot Light

Architecture

PrimaryRegion(active):Fullproduction(EC2,RDS,etc.)ReplicatetoDRDRRegion(minimal):CriticalinfrastructureRUNNING:RDSReadReplica(orrestored)MinimalEC2(orjustAMIsready)ResourcesscaledDOWNWhendisaster:1.PromoteRDSreplicatoprimary2.ScaleupEC2(orlaunchfromAMIs)3.UpdateDNS

RTO: 10 minutes - 1 hour

RPO: Minutes (continuous replication)

Setup

  • Continuous DB replication (e.g., Aurora cross-region replica)
  • AMIs current in DR region
  • Network setup pre-configured
  • Auto Scaling templates ready

Cost

  • Medium: minimal compute running, DB replica costs
  • Still pay for DB replica + minimal EC2

Use case

  • Important apps but not mission-critical
  • Acceptable < 1 hour downtime

5. Strategy 3: Warm Standby

Architecture

PrimaryRegion(active):FullproductionDRRegion(running,smallerscale):FullarchitecturerunningSmallerinstancesizes(orfewerinstances)DBreplicacontinuouslysyncedReceivingsometesttraffic(optional)Whendisaster:1.ScaleUPDRresources2.UpdateDNS3.Resumefulloperations

RTO: Minutes (< 30 min)

RPO: Minutes

Setup

  • Smaller version of production in DR
  • Continuous replication
  • Optionally serve % traffic for warmup

Cost

  • Higher: full architecture running (smaller scale)

Use case

  • Critical apps
  • Need fast recovery
  • Budget for redundancy

6. Strategy 4: Multi-Site Active-Active

Architecture

PrimaryRegion(active):FullproductionreceivingtrafficDRRegion(active):FullproductionreceivingtrafficBothregionsactivelyserveusersRoute53Latency-basedrouting:UsersnearestregionAutofailoverif1regiondownData:bi-directionalreplication(DynamoDBGlobalTables,AuroraGlobal)

RTO: Seconds (auto-failover)

RPO: Near-zero

Setup

  • Both regions full production
  • Bi-directional data replication
  • Route 53 latency-based + health checks
  • Aurora Global Database / DynamoDB Global Tables

Cost

  • Highest: 2x+ infrastructure
  • 2x compute, 2x storage, cross-region transfer

Use case

  • Mission-critical (financial, healthcare)
  • Global users (low latency benefit too)
  • Compliance requires multi-region

7. Comparison Table

StrategyRTORPOCostComplexity
Backup & RestoreHours-daysHours$Low
Pilot Light10 min-1hMinutes$$Medium
Warm Standby< 30 minMinutes$$$Medium-High
Multi-Site A/ASecondsNear-zero$$$$High

8. Choosing Strategy

Decision matrix

Workloadcriticality?Low(internaltool,dev)Backup&RestoreMedium(businessapp,<houroutageOK)AcceptableRTO1hourPilotLightRTO15-30minWarmStandbyMission-critical(financial,healthcare,e-commerce)Multi-SiteActive-Active

Cost vs benefit

  • Calculate cost of downtime per hour
  • If $1000/hr downtime cost, paying $500/month for HA is worth it

9. DR Testing

Regular drills required

  • Test DR procedure quarterly
  • Validate RTO/RPO actually meet objectives
  • Update runbooks after tests

Game days

  • Simulated outages
  • Practice for ops team
  • Find gaps before real incident

AWS Resilience Hub

  • Automated DR testing tool
  • Continuous validation
  • Recommendations based on Well-Architected Framework

10. Common AWS Services for DR

Compute

  • EC2 + AMI cross-region copy
  • ASG with multi-region patterns
  • Lambda (regional, easy redeploy)

Storage

  • S3 Cross-Region Replication (CRR)
  • EBS snapshot cross-region copy
  • AWS Backup cross-region

Databases

  • RDS cross-region read replica → promote
  • Aurora Global Database (sub-second cross-region)
  • DynamoDB Global Tables (multi-master)
  • ElastiCache global datastore (Redis)

Network

  • Route 53 Failover routing
  • Route 53 health checks
  • Global Accelerator (anycast)
  • Transit Gateway peering

Orchestration

  • CloudFormation templates (rebuild)
  • AWS Backup (centralized backup management)
  • AWS Resilience Hub (DR validation)

11. Specific Service DR Capabilities

S3

  • Same-region replication (SRR)
  • Cross-region replication (CRR)
  • S3 RTC: 99.99% replicated within 15 min

RDS

  • Multi-AZ: synchronous, intra-region
  • Read replicas: async, cross-region
  • Automated backups: PITR
  • Snapshots: manual, cross-region copy

Aurora

  • Aurora Global Database: < 1 min RTO, < 1 sec RPO

DynamoDB

  • Global Tables: multi-region active-active

Lambda

  • Stateless, regional
  • DR: deploy in DR region (CloudFormation/IaC)

12. Common Patterns

Pattern 1: SaaS web app

  • Multi-AZ in primary region
  • Warm Standby in DR region (smaller)
  • Aurora cross-region replica
  • S3 CRR for assets
  • Route 53 failover with health check

Pattern 2: E-commerce global

  • Multi-Site Active-Active
  • DynamoDB Global Tables
  • Aurora Global Database
  • Route 53 latency routing
  • CloudFront for static content

Pattern 3: Compliance archive

  • Backup & Restore
  • S3 Glacier with cross-region replication
  • Quarterly restore drills

Pattern 4: Critical internal tool

  • Pilot Light
  • Daily DB backups + CRR
  • AMI ready in DR
  • Documented manual failover procedure

Câu hỏi ôn tập

  1. 4 DR strategies từ rẻ nhất → đắt nhất là gì?

    Xem đáp án

    (1) Backup & Restore: rẻ nhất — backup data sang DR Region, restore khi disaster. RTO giờ-ngày. (2) Pilot Light: core components (DB replication) running, compute off — scale up khi cần. RTO phút-giờ. (3) Warm Standby: scaled-down production running — scale up khi failover. RTO phút. (4) Multi-Site Active-Active: đắt nhất — cả hai regions full capacity, traffic routed theo tỷ lệ. RTO ~0.

  2. Pilot Light có resources running ở DR region không?

    Xem đáp án

    — nhưng chỉ core components: database replication running (RDS cross-region replica, DynamoDB Global Tables). Compute (EC2, ECS) tắt hoặc scaled to minimum. Khi disaster: (1) Promote DB replica, (2) Scale up/launch compute, (3) Update DNS. Tên "Pilot Light" từ gas heater — flame nhỏ luôn cháy sẵn sàng đốt lên full. Rẻ hơn Warm Standby vì compute không running.

  3. Multi-Site Active-Active RTO khoảng bao nhiêu?

    Xem đáp án

    Gần 0 (zero downtime) — cả hai Regions đang serve traffic. Khi một Region fail, Route 53 hoặc Global Accelerator tự động route 100% traffic sang Region còn lại — không cần manual intervention. RTO thực tế phụ thuộc vào DNS TTL và health check intervals (thường < 1 phút). RPO cũng gần 0 nếu data replicated sync (Aurora Global Database) hoặc minutes nếu async.

  4. AWS Resilience Hub dùng để làm gì?

    Xem đáp án

    AWS Resilience Hub assess và improve resiliency của ứng dụng. Import application từ CloudFormation/Terraform/AppRegistry → define RTO/RPO targets → Hub analyze architecture → recommend gaps → generate resiliency score. Cũng có thể chạy resiliency drills (chaos engineering automated). Giúp teams biết application có đáp ứng DR objectives không — thay vì chỉ biết khi disaster thực sự xảy ra.

  5. Khi nào nên dùng Backup & Restore strategy?

    Xem đáp án

    Khi: (1) RTO/RPO requirements thấp (hours acceptable), (2) Budget hạn chế — không justify chạy infrastructure ở DR region liên tục, (3) Non-critical workloads — dev/test, batch processing, archives, (4) Data đã được backup đầy đủ và được test restore regularly, (5) Compliance requires offsite backup nhưng không yêu cầu fast failover. Không phù hợp cho production systems yêu cầu RTO < 1 giờ.

Bài tập thực hành

  • Tính toán RTO/RPO cho mỗi strategy cho 1 web app
  • Setup Backup & Restore: backup EC2 + RDS cross-region
  • Setup Pilot Light: Aurora read replica cross-region
  • Test Warm Standby failover procedure
  • (Optional) AWS Resilience Hub assessment

Tài liệu tham khảo chính thức


Tiếp theo: Backup và Restore Patterns