Tuần 9 - Ngày 2: DR Strategies (4 Patterns)
Mục tiêu học tập
- Hiểu 4 DR patterns: Backup/Restore, Pilot Light, Warm Standby, Multi-Site
- Phân biệt RTO/RPO trade-offs
- Áp dụng pattern phù hợp cho từng business requirement
1. RTO và RPO
Định nghĩa
- RTO (Recovery Time Objective): max acceptable downtime (how long can we be down?)
- RPO (Recovery Point Objective): max acceptable data loss (how much data can we lose?)
Example
- Online banking: RTO = 1 hour, RPO = 0 (no data loss)
- Internal blog: RTO = 24 hours, RPO = 1 day
Trade-off
- Lower RTO/RPO → higher cost
- Balance: spend $$ proportionally to business impact
2. The 4 DR Strategies
RTO RPO Cost
Backup & Restore Hours Hours $
Pilot Light 10 min-1h Min $$
Warm Standby Minutes Min $$$
Multi-Site Active/Active Seconds Zero $$$$
3. Strategy 1: Backup & Restore
Architecture
RTO: Hours to days
RPO: Hours (depending on backup frequency)
Setup
- Daily backups to S3 with CRR
- AMI snapshots cross-region
- Database snapshots cross-region
- CloudFormation templates ready in DR region
Cost
- Lowest: only pay for backup storage in DR region
- No compute running in DR
Use case
- Non-critical apps
- Long RTO/RPO acceptable
- Cost-sensitive
4. Strategy 2: Pilot Light
Architecture
RTO: 10 minutes - 1 hour
RPO: Minutes (continuous replication)
Setup
- Continuous DB replication (e.g., Aurora cross-region replica)
- AMIs current in DR region
- Network setup pre-configured
- Auto Scaling templates ready
Cost
- Medium: minimal compute running, DB replica costs
- Still pay for DB replica + minimal EC2
Use case
- Important apps but not mission-critical
- Acceptable < 1 hour downtime
5. Strategy 3: Warm Standby
Architecture
RTO: Minutes (< 30 min)
RPO: Minutes
Setup
- Smaller version of production in DR
- Continuous replication
- Optionally serve % traffic for warmup
Cost
- Higher: full architecture running (smaller scale)
Use case
- Critical apps
- Need fast recovery
- Budget for redundancy
6. Strategy 4: Multi-Site Active-Active
Architecture
RTO: Seconds (auto-failover)
RPO: Near-zero
Setup
- Both regions full production
- Bi-directional data replication
- Route 53 latency-based + health checks
- Aurora Global Database / DynamoDB Global Tables
Cost
- Highest: 2x+ infrastructure
- 2x compute, 2x storage, cross-region transfer
Use case
- Mission-critical (financial, healthcare)
- Global users (low latency benefit too)
- Compliance requires multi-region
7. Comparison Table
| Strategy | RTO | RPO | Cost | Complexity |
|---|---|---|---|---|
| Backup & Restore | Hours-days | Hours | $ | Low |
| Pilot Light | 10 min-1h | Minutes | $$ | Medium |
| Warm Standby | < 30 min | Minutes | $$$ | Medium-High |
| Multi-Site A/A | Seconds | Near-zero | $$$$ | High |
8. Choosing Strategy
Decision matrix
Cost vs benefit
- Calculate cost of downtime per hour
- If $1000/hr downtime cost, paying $500/month for HA is worth it
9. DR Testing
Regular drills required
- Test DR procedure quarterly
- Validate RTO/RPO actually meet objectives
- Update runbooks after tests
Game days
- Simulated outages
- Practice for ops team
- Find gaps before real incident
AWS Resilience Hub
- Automated DR testing tool
- Continuous validation
- Recommendations based on Well-Architected Framework
10. Common AWS Services for DR
Compute
- EC2 + AMI cross-region copy
- ASG with multi-region patterns
- Lambda (regional, easy redeploy)
Storage
- S3 Cross-Region Replication (CRR)
- EBS snapshot cross-region copy
- AWS Backup cross-region
Databases
- RDS cross-region read replica → promote
- Aurora Global Database (sub-second cross-region)
- DynamoDB Global Tables (multi-master)
- ElastiCache global datastore (Redis)
Network
- Route 53 Failover routing
- Route 53 health checks
- Global Accelerator (anycast)
- Transit Gateway peering
Orchestration
- CloudFormation templates (rebuild)
- AWS Backup (centralized backup management)
- AWS Resilience Hub (DR validation)
11. Specific Service DR Capabilities
S3
- Same-region replication (SRR)
- Cross-region replication (CRR)
- S3 RTC: 99.99% replicated within 15 min
RDS
- Multi-AZ: synchronous, intra-region
- Read replicas: async, cross-region
- Automated backups: PITR
- Snapshots: manual, cross-region copy
Aurora
- Aurora Global Database: < 1 min RTO, < 1 sec RPO
DynamoDB
- Global Tables: multi-region active-active
Lambda
- Stateless, regional
- DR: deploy in DR region (CloudFormation/IaC)
12. Common Patterns
Pattern 1: SaaS web app
- Multi-AZ in primary region
- Warm Standby in DR region (smaller)
- Aurora cross-region replica
- S3 CRR for assets
- Route 53 failover with health check
Pattern 2: E-commerce global
- Multi-Site Active-Active
- DynamoDB Global Tables
- Aurora Global Database
- Route 53 latency routing
- CloudFront for static content
Pattern 3: Compliance archive
- Backup & Restore
- S3 Glacier with cross-region replication
- Quarterly restore drills
Pattern 4: Critical internal tool
- Pilot Light
- Daily DB backups + CRR
- AMI ready in DR
- Documented manual failover procedure
Câu hỏi ôn tập
-
4 DR strategies từ rẻ nhất → đắt nhất là gì?
Xem đáp án
(1) Backup & Restore: rẻ nhất — backup data sang DR Region, restore khi disaster. RTO giờ-ngày. (2) Pilot Light: core components (DB replication) running, compute off — scale up khi cần. RTO phút-giờ. (3) Warm Standby: scaled-down production running — scale up khi failover. RTO phút. (4) Multi-Site Active-Active: đắt nhất — cả hai regions full capacity, traffic routed theo tỷ lệ. RTO ~0.
-
Pilot Light có resources running ở DR region không?
Xem đáp án
Có — nhưng chỉ core components: database replication running (RDS cross-region replica, DynamoDB Global Tables). Compute (EC2, ECS) tắt hoặc scaled to minimum. Khi disaster: (1) Promote DB replica, (2) Scale up/launch compute, (3) Update DNS. Tên "Pilot Light" từ gas heater — flame nhỏ luôn cháy sẵn sàng đốt lên full. Rẻ hơn Warm Standby vì compute không running.
-
Multi-Site Active-Active RTO khoảng bao nhiêu?
Xem đáp án
Gần 0 (zero downtime) — cả hai Regions đang serve traffic. Khi một Region fail, Route 53 hoặc Global Accelerator tự động route 100% traffic sang Region còn lại — không cần manual intervention. RTO thực tế phụ thuộc vào DNS TTL và health check intervals (thường < 1 phút). RPO cũng gần 0 nếu data replicated sync (Aurora Global Database) hoặc minutes nếu async.
-
AWS Resilience Hub dùng để làm gì?
Xem đáp án
AWS Resilience Hub assess và improve resiliency của ứng dụng. Import application từ CloudFormation/Terraform/AppRegistry → define RTO/RPO targets → Hub analyze architecture → recommend gaps → generate resiliency score. Cũng có thể chạy resiliency drills (chaos engineering automated). Giúp teams biết application có đáp ứng DR objectives không — thay vì chỉ biết khi disaster thực sự xảy ra.
-
Khi nào nên dùng Backup & Restore strategy?
Xem đáp án
Khi: (1) RTO/RPO requirements thấp (hours acceptable), (2) Budget hạn chế — không justify chạy infrastructure ở DR region liên tục, (3) Non-critical workloads — dev/test, batch processing, archives, (4) Data đã được backup đầy đủ và được test restore regularly, (5) Compliance requires offsite backup nhưng không yêu cầu fast failover. Không phù hợp cho production systems yêu cầu RTO < 1 giờ.
Bài tập thực hành
- Tính toán RTO/RPO cho mỗi strategy cho 1 web app
- Setup Backup & Restore: backup EC2 + RDS cross-region
- Setup Pilot Light: Aurora read replica cross-region
- Test Warm Standby failover procedure
- (Optional) AWS Resilience Hub assessment
Tài liệu tham khảo chính thức
Tiếp theo: Backup và Restore Patterns