</>Học Dev
Bài học

Tuần 9 - Ngày 1: Multi-AZ vs Multi-Region

Tuần 9 – Ngày 1

Tuần 9 - Ngày 1: Multi-AZ vs Multi-Region

Mục tiêu học tập

  • Phân biệt Multi-AZ và Multi-Region architectures
  • Hiểu trade-offs: cost, complexity, RTO/RPO
  • Áp dụng pattern phù hợp cho từng workload

1. High Availability Concepts

Định nghĩa

  • High Availability (HA): minimize downtime (~99.9-99.99%)
  • Fault Tolerance: zero downtime, system continues during failure
  • Disaster Recovery (DR): recover from major disaster (region-wide)

Key metrics

  • RTO (Recovery Time Objective): max acceptable downtime
  • RPO (Recovery Point Objective): max acceptable data loss
  • SLA (Service Level Agreement): uptime guarantee

Common SLA examples

  • 99% = ~3.65 days downtime/year
  • 99.9% = ~8.77 hours/year
  • 99.99% = ~52.6 minutes/year
  • 99.999% (5 nines) = ~5.26 minutes/year

2. AWS Failure Domains

Hierarchy

RegionAvailabilityZone1DataCenterADataCenterBAvailabilityZone2...AvailabilityZone3...

Failure scenarios

FailureAffectedMitigation
Instance failure1 EC2Auto Scaling Group (auto-replace)
Hardware failure1 hostPlace instances in spread placement group
AZ failure (rare)All resources in AZMulti-AZ deployment
Region failure (very rare)All resources in regionMulti-Region deployment

3. Multi-AZ Architecture

Pattern

Regionus-east-1:AZ-1a:Webtier(EC2)RDSPrimaryAZ-1b:Webtier(EC2)RDSStandby(syncreplication)AZ-1c:Webtier(EC2)LoadBalancer(spansAZs)

Components

  • ALB: span multiple AZs (built-in HA)
  • ASG: spread across AZs
  • RDS Multi-AZ: primary + standby in different AZ
  • DynamoDB: auto multi-AZ replicate
  • EFS: multi-AZ
  • S3: multi-AZ (Standard storage class)

Benefits

  • Tolerate AZ failure
  • Low latency between AZs (< 10ms)
  • No cross-region data transfer cost
  • Simpler than multi-region

Trade-offs

  • Does NOT protect against region failure
  • Limited to 1 region's services

Typical SLA: 99.99%

4. Multi-Region Architecture

Pattern

Regionus-east-1(Primary):Regioneu-west-1(DR):ActiveworkloadStandby(oractive)RDSPrimaryRDSReadReplicaS3bucketS3CRRdestinationRoute53(global):Failover/Latencyroutingusersroutedtonearesthealthyregion

Benefits

  • Tolerate region failure (major disaster)
  • Lower latency cho global users
  • Compliance (data residency per region)

Trade-offs

  • Higher cost (resources in multiple regions)
  • Higher complexity (replication, failover)
  • Eventually consistent between regions (lag)
  • Cross-region data transfer cost

Typical SLA: 99.999%+

5. DR Strategies (Recap — chi tiết tomorrow)

StrategyRTORPOCost
Backup & RestoreHoursHours$
Pilot Light10 min - hourMin$$
Warm StandbyMinutesMin$$$
Multi-Site (Active-Active)SecondsNone$$$$

6. Multi-AZ Decision Matrix

Always Multi-AZ for production

  • RDS, Aurora (Multi-AZ deployment)
  • ElastiCache (cluster mode)
  • EFS (Standard)
  • Application servers (ALB + ASG across AZs)

Single-AZ OK for

  • Dev/test environments
  • Stateless workloads (can rebuild)
  • Cache (data can be regenerated)
  • Compute-only workload với data elsewhere

7. Multi-Region Use Cases

When multi-region needed

  • Global users with latency requirements
  • Mission-critical (banks, healthcare)
  • Regulatory compliance (data residency)
  • Tolerate region failure (rare but possible)

When NOT needed

  • Single-region users
  • Cost-sensitive (overhead 2x+)
  • App not designed for distributed state
  • Acceptable to have hours of downtime

8. Service Availability Patterns

Stateless services

  • Easy to multi-AZ/multi-region (deploy in each)
  • Examples: EC2 web tier, Lambda

Stateful services

  • Harder — need data replication
  • Examples: databases, file systems

Strategies cho stateful

  • Multi-AZ replication: RDS Multi-AZ, ElastiCache Redis Multi-AZ
  • Cross-region replication: S3 CRR, DynamoDB Global Tables, Aurora Global DB
  • Manual replication: backup + restore in DR region

9. Data Replication Options

Synchronous replication

  • Write to primary + replica before ACK
  • Zero RPO (no data loss)
  • Higher latency (limited to ~10ms = same region)
  • Examples: RDS Multi-AZ, Aurora replicas (same region)

Asynchronous replication

  • Write to primary, replica eventually
  • Some RPO (data loss possible, seconds-minutes)
  • Lower latency (cross-region possible)
  • Examples: RDS Read Replicas, S3 CRR, DynamoDB Global Tables

Combined

  • Within region: sync (HA)
  • Cross-region: async (DR)

10. Common Architecture Patterns

Pattern 1: Multi-AZ web app

Route53ALB(Multi-AZ)ASG(Webtier,3AZs)ASG(Apptier,3AZs)RDSMulti-AZ(Primary+Standby)

Pattern 2: Multi-Region active-passive

Route53(Failoverrouting)/\(active)/\(passive)/\us-east-1eu-west-1ALB+ASG(resourcesspunRDSPrimaryuponlyonfailover)CRRDynamoDBGlobal

Pattern 3: Multi-Region active-active

Route53(Latency-based)/\/\us-east-1eu-west-1(active)(active)ALB+ASGALB+ASGAuroraGlobalReadreplicasDynamoDBGlobalDynamoDB

11. Cost Considerations

Multi-AZ

  • ~2x cost for stateful (RDS Multi-AZ = 2x)
  • ~1x cost for stateless (already in multiple AZs)

Multi-Region

  • ~2x+ cost (duplicate everything)
  • Cross-region data transfer ($)
  • Plus replication setup/maintenance

Optimize

  • Multi-AZ for HA, not multi-region if not needed
  • Use Aurora Global Database instead of full duplication (cheaper)
  • DynamoDB Global Tables (pay per region)
  • Active-passive cheaper than active-active (DR region scaled down)

Câu hỏi ôn tập

  1. Khác biệt giữa Multi-AZ và Multi-Region?

    Xem đáp án

    Multi-AZ: redundancy trong cùng Region — protect against AZ failure (hardware, power, network). Failover automatic, fast (seconds to minutes). Multi-Region: redundancy across hai+ Regions — protect against Region failure (rare) hoặc serve users globally với low latency. Multi-Region phức tạp hơn, đắt hơn, nhưng cần cho highest availability requirements và global applications.

  2. Region failure có thường xuyên không?

    Xem đáp án

    Rất hiếm — AWS Region-wide failures cực kỳ rare. AZ failures xảy ra thỉnh thoảng. Multi-Region thường dùng cho: (1) Latency — serve users globally, (2) Compliance — data residency requirements, (3) Business continuity cho extreme availability requirements, không phải vì Region failure thường xuyên. Hầu hết applications chỉ cần Multi-AZ — Multi-Region là overkill và tốn kém cho hầu hết use cases.

  3. Khi nào nên dùng Multi-Region?

    Xem đáp án

    (1) Global user base cần low latency từ nhiều continents, (2) RTO/RPO gần zero — không thể chấp nhận downtime dù có failover, (3) Regulatory compliance — data phải replicate sang second Region, (4) Critical business applications với revenue loss cao mỗi phút downtime, (5) Sovereignty/Data residency requirements. Cost và complexity tăng đáng kể — justify chỉ khi business requirements rõ ràng.

  4. RDS Multi-AZ là sync hay async replication?

    Xem đáp án

    Synchronous replication — primary không acknowledge write cho application cho đến khi standby confirm đã replicate. Đảm bảo zero data loss (RPO = 0) khi failover. Tuy nhiên latency write cao hơn một chút so với single-AZ. Khác với Read Replicas dùng asynchronous replication — có thể có lag vài giây, nhưng ít ảnh hưởng đến write performance.

  5. SLA 99.99% nghĩa là downtime tối đa bao nhiêu/year?

    Xem đáp án

    ~52 phút/năm (52.6 phút). So sánh: 99.9% = ~8.7 giờ/năm, 99.95% = ~4.4 giờ/năm, 99.999% (five nines) = ~5 phút/năm. Công thức: downtime = (1 - SLA%) × 365 × 24 × 60 phút. Nhiều AWS services như S3, DynamoDB, ALB có SLA 99.99% hoặc cao hơn. Composite architecture SLA thấp hơn min component SLA — cần redundancy để đạt SLA tổng thể cao.

Bài tập thực hành

  • Vẽ architecture: Multi-AZ web app với ALB, ASG, RDS Multi-AZ
  • Setup S3 CRR sang region khác
  • Tạo Route 53 Failover routing primary/secondary
  • So sánh cost: same workload single-AZ vs Multi-AZ vs Multi-Region

Tài liệu tham khảo chính thức


Tiếp theo: DR Strategies (4 patterns)