Tuần 9 - Ngày 1: Multi-AZ vs Multi-Region
Mục tiêu học tập
- Phân biệt Multi-AZ và Multi-Region architectures
- Hiểu trade-offs: cost, complexity, RTO/RPO
- Áp dụng pattern phù hợp cho từng workload
1. High Availability Concepts
Định nghĩa
- High Availability (HA): minimize downtime (~99.9-99.99%)
- Fault Tolerance: zero downtime, system continues during failure
- Disaster Recovery (DR): recover from major disaster (region-wide)
Key metrics
- RTO (Recovery Time Objective): max acceptable downtime
- RPO (Recovery Point Objective): max acceptable data loss
- SLA (Service Level Agreement): uptime guarantee
Common SLA examples
- 99% = ~3.65 days downtime/year
- 99.9% = ~8.77 hours/year
- 99.99% = ~52.6 minutes/year
- 99.999% (5 nines) = ~5.26 minutes/year
2. AWS Failure Domains
Hierarchy
Failure scenarios
| Failure | Affected | Mitigation |
|---|---|---|
| Instance failure | 1 EC2 | Auto Scaling Group (auto-replace) |
| Hardware failure | 1 host | Place instances in spread placement group |
| AZ failure (rare) | All resources in AZ | Multi-AZ deployment |
| Region failure (very rare) | All resources in region | Multi-Region deployment |
3. Multi-AZ Architecture
Pattern
Components
- ALB: span multiple AZs (built-in HA)
- ASG: spread across AZs
- RDS Multi-AZ: primary + standby in different AZ
- DynamoDB: auto multi-AZ replicate
- EFS: multi-AZ
- S3: multi-AZ (Standard storage class)
Benefits
- Tolerate AZ failure
- Low latency between AZs (< 10ms)
- No cross-region data transfer cost
- Simpler than multi-region
Trade-offs
- Does NOT protect against region failure
- Limited to 1 region's services
Typical SLA: 99.99%
4. Multi-Region Architecture
Pattern
Benefits
- Tolerate region failure (major disaster)
- Lower latency cho global users
- Compliance (data residency per region)
Trade-offs
- Higher cost (resources in multiple regions)
- Higher complexity (replication, failover)
- Eventually consistent between regions (lag)
- Cross-region data transfer cost
Typical SLA: 99.999%+
5. DR Strategies (Recap — chi tiết tomorrow)
| Strategy | RTO | RPO | Cost |
|---|---|---|---|
| Backup & Restore | Hours | Hours | $ |
| Pilot Light | 10 min - hour | Min | $$ |
| Warm Standby | Minutes | Min | $$$ |
| Multi-Site (Active-Active) | Seconds | None | $$$$ |
6. Multi-AZ Decision Matrix
Always Multi-AZ for production
- RDS, Aurora (Multi-AZ deployment)
- ElastiCache (cluster mode)
- EFS (Standard)
- Application servers (ALB + ASG across AZs)
Single-AZ OK for
- Dev/test environments
- Stateless workloads (can rebuild)
- Cache (data can be regenerated)
- Compute-only workload với data elsewhere
7. Multi-Region Use Cases
When multi-region needed
- Global users with latency requirements
- Mission-critical (banks, healthcare)
- Regulatory compliance (data residency)
- Tolerate region failure (rare but possible)
When NOT needed
- Single-region users
- Cost-sensitive (overhead 2x+)
- App not designed for distributed state
- Acceptable to have hours of downtime
8. Service Availability Patterns
Stateless services
- Easy to multi-AZ/multi-region (deploy in each)
- Examples: EC2 web tier, Lambda
Stateful services
- Harder — need data replication
- Examples: databases, file systems
Strategies cho stateful
- Multi-AZ replication: RDS Multi-AZ, ElastiCache Redis Multi-AZ
- Cross-region replication: S3 CRR, DynamoDB Global Tables, Aurora Global DB
- Manual replication: backup + restore in DR region
9. Data Replication Options
Synchronous replication
- Write to primary + replica before ACK
- Zero RPO (no data loss)
- Higher latency (limited to ~10ms = same region)
- Examples: RDS Multi-AZ, Aurora replicas (same region)
Asynchronous replication
- Write to primary, replica eventually
- Some RPO (data loss possible, seconds-minutes)
- Lower latency (cross-region possible)
- Examples: RDS Read Replicas, S3 CRR, DynamoDB Global Tables
Combined
- Within region: sync (HA)
- Cross-region: async (DR)
10. Common Architecture Patterns
Pattern 1: Multi-AZ web app
Pattern 2: Multi-Region active-passive
Pattern 3: Multi-Region active-active
11. Cost Considerations
Multi-AZ
- ~2x cost for stateful (RDS Multi-AZ = 2x)
- ~1x cost for stateless (already in multiple AZs)
Multi-Region
- ~2x+ cost (duplicate everything)
- Cross-region data transfer ($)
- Plus replication setup/maintenance
Optimize
- Multi-AZ for HA, not multi-region if not needed
- Use Aurora Global Database instead of full duplication (cheaper)
- DynamoDB Global Tables (pay per region)
- Active-passive cheaper than active-active (DR region scaled down)
Câu hỏi ôn tập
-
Khác biệt giữa Multi-AZ và Multi-Region?
Xem đáp án
Multi-AZ: redundancy trong cùng Region — protect against AZ failure (hardware, power, network). Failover automatic, fast (seconds to minutes). Multi-Region: redundancy across hai+ Regions — protect against Region failure (rare) hoặc serve users globally với low latency. Multi-Region phức tạp hơn, đắt hơn, nhưng cần cho highest availability requirements và global applications.
-
Region failure có thường xuyên không?
Xem đáp án
Rất hiếm — AWS Region-wide failures cực kỳ rare. AZ failures xảy ra thỉnh thoảng. Multi-Region thường dùng cho: (1) Latency — serve users globally, (2) Compliance — data residency requirements, (3) Business continuity cho extreme availability requirements, không phải vì Region failure thường xuyên. Hầu hết applications chỉ cần Multi-AZ — Multi-Region là overkill và tốn kém cho hầu hết use cases.
-
Khi nào nên dùng Multi-Region?
Xem đáp án
(1) Global user base cần low latency từ nhiều continents, (2) RTO/RPO gần zero — không thể chấp nhận downtime dù có failover, (3) Regulatory compliance — data phải replicate sang second Region, (4) Critical business applications với revenue loss cao mỗi phút downtime, (5) Sovereignty/Data residency requirements. Cost và complexity tăng đáng kể — justify chỉ khi business requirements rõ ràng.
-
RDS Multi-AZ là sync hay async replication?
Xem đáp án
Synchronous replication — primary không acknowledge write cho application cho đến khi standby confirm đã replicate. Đảm bảo zero data loss (RPO = 0) khi failover. Tuy nhiên latency write cao hơn một chút so với single-AZ. Khác với Read Replicas dùng asynchronous replication — có thể có lag vài giây, nhưng ít ảnh hưởng đến write performance.
-
SLA 99.99% nghĩa là downtime tối đa bao nhiêu/year?
Xem đáp án
~52 phút/năm (52.6 phút). So sánh: 99.9% = ~8.7 giờ/năm, 99.95% = ~4.4 giờ/năm, 99.999% (five nines) = ~5 phút/năm. Công thức: downtime = (1 - SLA%) × 365 × 24 × 60 phút. Nhiều AWS services như S3, DynamoDB, ALB có SLA 99.99% hoặc cao hơn. Composite architecture SLA thấp hơn min component SLA — cần redundancy để đạt SLA tổng thể cao.
Bài tập thực hành
- Vẽ architecture: Multi-AZ web app với ALB, ASG, RDS Multi-AZ
- Setup S3 CRR sang region khác
- Tạo Route 53 Failover routing primary/secondary
- So sánh cost: same workload single-AZ vs Multi-AZ vs Multi-Region
Tài liệu tham khảo chính thức
Tiếp theo: DR Strategies (4 patterns)