Question 1

Sự khác biệt giữa RTO và RPO?

Accepted Answer

RTO (Recovery Time Objective): thời gian tối đa được phép từ khi disaster xảy ra đến khi service hoạt động trở lại — "maximum acceptable downtime". RPO (Recovery Point Objective): lượng data mất tối đa có thể chấp nhận — "maximum acceptable data loss". Ví dụ: RTO 1 giờ = phải back online trong 1 giờ; RPO 15 phút = backup mỗi 15 phút, chấp nhận mất tối đa 15 phút data. RTO thấp hơn = đắt hơn (Warm Standby/Active-Active); RPO thấp hơn = sync replication đắt hơn.

Question 2

4 DR strategies là gì và xếp theo cost từ thấp đến cao?

Accepted Answer

Từ rẻ nhất → đắt nhất: (1) Backup & Restore — backup sang S3/Glacier, restore khi cần, RTO giờ-ngày, (2) Pilot Light — core DB running, compute off, RTO phút-giờ, (3) Warm Standby — scaled-down production running, RTO phút, (4) Multi-Site Active-Active — full production cả hai regions, RTO ~0. Tương quan: cost thấp hơn = RTO/RPO cao hơn.

Question 3

Pilot Light khác Warm Standby như thế nào?

Accepted Answer

Pilot Light: chỉ core components running ở DR region — thường chỉ database replication (RDS cross-region replica, DynamoDB Global Tables). Compute (EC2, ECS) tắt — cần launch/scale khi failover. Warm Standby: scaled-down production running — cả compute lẫn database, nhưng với ít capacity hơn (ví dụ 10% production scale). Warm Standby failover nhanh hơn (chỉ scale up), Pilot Light rẻ hơn (không pay for idle compute).

Question 4

Circuit Breaker pattern giải quyết vấn đề gì?

Accepted Answer

Ngăn cascading failures trong microservices. Khi service A liên tục fail khi gọi service B (đang chậm/down), Circuit Breaker "mở" — không gọi B nữa và trả fallback response ngay. Sau timeout, thử "half-open" — một số requests đến B. Nếu succeed → "đóng" lại bình thường; nếu vẫn fail → "mở" tiếp. AWS App Mesh, Resilience4j, AWS SDK retry mechanisms implement circuit breaker. Ngăn B's slowness làm A cũng chậm và exhaustion thread pool.

Question 5

Cách tính availability của 2 components chạy song song?

Accepted Answer

Parallel (redundant) components: Availability = 1 - (1-A1) × (1-A2). Ví dụ: 2 EC2 instances mỗi cái 99% available → 1 - (1-0.99)² = 1 - 0.0001 = 99.99%. Series components: A_total = A1 × A2 (mỗi component phải hoạt động). Đây là lý do Multi-AZ tăng availability: failure cả 2 AZs cùng lúc xác suất rất thấp. Thêm AZ → availability tăng gần đến 100%.

Strategy	RTO	RPO	Cost
Backup & Restore	Hours	Hours	$
Pilot Light	10s of minutes	Minutes	$$
Warm Standby	Minutes	Seconds	$$$
Multi-Site Active	Near zero	Near zero	$$$$

Service	Purpose
ELB	Distribute traffic across AZs
Auto Scaling	Maintain desired capacity
RDS Multi-AZ	Database failover
S3	11 9s durability
Route 53	Health checks và failover

Service	Purpose
AWS Backup	Centralized backup management
S3 Cross-Region Replication	Data replication
Aurora Global Database	Multi-region database
DynamoDB Global Tables	Multi-region NoSQL
CloudFormation StackSets	Multi-region deployment

Availability	Downtime/Year	Downtime/Month
99% (2 nines)	3.65 days	7.3 hours
99.9% (3 nines)	8.76 hours	43.8 minutes
99.99% (4 nines)	52.6 minutes	4.38 minutes
99.999% (5 nines)	5.26 minutes	26.3 seconds

Tuần 2 - Ngày 4: Reliability Pillar

Mục tiêu học tập

1. Định nghĩa Reliability Pillar

4 Focus Areas

2. Design Principles

1. Automatically recover from failure

2. Test recovery procedures

3. Scale horizontally

4. Stop guessing capacity

5. Manage change through automation

3. Key Concepts: RTO và RPO

Recovery Time Objective (RTO)

Recovery Point Objective (RPO)

RTO/RPO Trade-offs

4. Disaster Recovery Strategies

1. Backup and Restore

2. Pilot Light

3. Warm Standby

4. Multi-Site Active/Active

5. Fault Tolerance Patterns

Circuit Breaker Pattern

Retry with Exponential Backoff

6. AWS Services cho Reliability

High Availability

Disaster Recovery

7. Availability Calculations

SLA và Nines

Calculating Composite Availability

8. Câu hỏi ôn tập

9. Bài tập thực hành

Tài liệu tham khảo chính thức