Question 1

Khác biệt giữa Multi-AZ và Multi-Region?

Accepted Answer

Multi-AZ: redundancy trong cùng Region — protect against AZ failure (hardware, power, network). Failover automatic, fast (seconds to minutes). Multi-Region: redundancy across hai+ Regions — protect against Region failure (rare) hoặc serve users globally với low latency. Multi-Region phức tạp hơn, đắt hơn, nhưng cần cho highest availability requirements và global applications.

Question 2

Region failure có thường xuyên không?

Accepted Answer

Rất hiếm — AWS Region-wide failures cực kỳ rare. AZ failures xảy ra thỉnh thoảng. Multi-Region thường dùng cho: (1) Latency — serve users globally, (2) Compliance — data residency requirements, (3) Business continuity cho extreme availability requirements, không phải vì Region failure thường xuyên. Hầu hết applications chỉ cần Multi-AZ — Multi-Region là overkill và tốn kém cho hầu hết use cases.

Question 3

Khi nào nên dùng Multi-Region?

Accepted Answer

(1) Global user base cần low latency từ nhiều continents, (2) RTO/RPO gần zero — không thể chấp nhận downtime dù có failover, (3) Regulatory compliance — data phải replicate sang second Region, (4) Critical business applications với revenue loss cao mỗi phút downtime, (5) Sovereignty/Data residency requirements. Cost và complexity tăng đáng kể — justify chỉ khi business requirements rõ ràng.

Question 4

RDS Multi-AZ là sync hay async replication?

Accepted Answer

Synchronous replication — primary không acknowledge write cho application cho đến khi standby confirm đã replicate. Đảm bảo zero data loss (RPO = 0) khi failover. Tuy nhiên latency write cao hơn một chút so với single-AZ. Khác với Read Replicas dùng asynchronous replication — có thể có lag vài giây, nhưng ít ảnh hưởng đến write performance.

Question 5

SLA 99.99% nghĩa là downtime tối đa bao nhiêu/year?

Accepted Answer

~52 phút/năm (52.6 phút). So sánh: 99.9% = ~8.7 giờ/năm, 99.95% = ~4.4 giờ/năm, 99.999% (five nines) = ~5 phút/năm. Công thức: downtime = (1 - SLA%) × 365 × 24 × 60 phút. Nhiều AWS services như S3, DynamoDB, ALB có SLA 99.99% hoặc cao hơn. Composite architecture SLA thấp hơn min component SLA — cần redundancy để đạt SLA tổng thể cao.

Failure	Affected	Mitigation
Instance failure	1 EC2	Auto Scaling Group (auto-replace)
Hardware failure	1 host	Place instances in spread placement group
AZ failure (rare)	All resources in AZ	Multi-AZ deployment
Region failure (very rare)	All resources in region	Multi-Region deployment

Strategy	RTO	RPO	Cost
Backup & Restore	Hours	Hours	$
Pilot Light	10 min - hour	Min	$$
Warm Standby	Minutes	Min	$$$
Multi-Site (Active-Active)	Seconds	None	$$$$