</>Học Dev
Bài học

Tuần 9 - Ngày 4: Route 53 Failover Patterns

Tuần 9 – Ngày 4

Tuần 9 - Ngày 4: Route 53 Failover Patterns

Mục tiêu học tập

  • Hiểu Route 53 Failover routing
  • Áp dụng Health Checks cho automatic failover
  • Phân biệt Active-Passive vs Active-Active
  • Hiểu DNS TTL impact on failover speed

1. Failover Routing Recap

Active-Passive

  • Primary + Secondary record
  • Health check on primary
  • If primary unhealthy → traffic to secondary

Architecture

Route53DNS:example.comPrimary:ALB-us-east-1(healthcheckOK)Secondary:ALB-eu-west-1(passive)UserDNSqueryReturnsPrimaryIPIfPrimaryfailshealthcheckReturnsSecondaryIP

2. Health Checks

3 types

1. Endpoint Health Check

  • HTTP/HTTPS/TCP probe
  • ~15 AWS health checkers globally
  • Interval: 10s (fast) or 30s (standard)
  • Failure threshold: 1-10 (default 3)
  • Path, port, host configurable

2. Calculated Health Check

  • Combine 2-256 child health checks
  • AND, OR, NOT logic
  • Use case: "Healthy if 3+ regions healthy"

3. CloudWatch Alarm-based Health Check

  • Trigger when CloudWatch alarm state
  • Use case: alarm khi DynamoDB throttling → DNS failover

Cost

  • Endpoint within AWS: free for first 50, then $0.75/check
  • Endpoint outside AWS: $0.75/check
  • Calculated: $1/check
  • CloudWatch alarm-based: $1/check

3. Health Check Best Practices

Endpoint design

  • Dedicated /health endpoint
  • Returns 200 if healthy, 503 if not
  • Check dependencies (DB connection, cache, external API)

Frequency

  • 10-second interval for critical apps (fast failover)
  • 30-second for less critical (lower cost)

Failure threshold

  • Higher (e.g., 5): less false positives, slower failover
  • Lower (e.g., 2): faster failover, more false positives
  • Recommended: 3-5

Multi-region health check

  • Place health checkers in DIFFERENT regions than target
  • Detect region-specific issues

4. DNS TTL and Failover Speed

Why TTL matters

  • TTL = how long resolvers cache DNS response
  • After failover, resolvers may serve OLD record until TTL expires
  • → Effective failover time = health check detection + TTL expiry

Trade-off

  • Low TTL (60 sec): fast failover, more DNS queries (cost)
  • High TTL (24 hour): cheap, slow failover

Recommendation

  • Stable apps: 300-3600s
  • HA apps: 60s
  • Before maintenance: pre-lower TTL 24h ahead

ALIAS record

  • AWS-managed TTL (usually 60 sec)
  • Optimal cho AWS resources
  • Recommended over CNAME

5. Active-Passive Failover Setup

Steps

  1. Tạo health check trên primary ALB
  2. Tạo primary record:
    • Routing policy: Failover
    • Failover type: Primary
    • Health check ID: từ step 1
    • Value: ALB-us-east-1 alias
  3. Tạo secondary record:
    • Routing policy: Failover
    • Failover type: Secondary
    • Health check optional
    • Value: ALB-eu-west-1 alias

Behavior

  • DNS query → R53 returns Primary IF healthy
  • If unhealthy → returns Secondary
  • When primary recovers → R53 returns Primary again

Combine với DR strategy

  • Pilot Light / Warm Standby: passive resources in DR region
  • Health check fail → DNS failover + (optionally) auto-scale DR

6. Active-Active Failover

Pattern: Multivalue Answer hoặc Weighted/Latency

  • Both regions actively serving
  • Each region has own health check
  • If 1 region unhealthy → DNS không return its record

Weighted Routing với health checks

example.com:RegionA(weight50,healthcheck)RegionB(weight50,healthcheck)Bothhealthy50/50trafficAunhealthy100%toB

Latency-based với health checks

  • Lowest-latency region returned
  • Skip unhealthy regions
  • Best for global apps

7. Multi-Region Failover Patterns

Pattern 1: Active-Passive with Aurora Global DB

PrimaryRegion:ALBAppAuroraPrimaryDRRegion(passive):ALB(idle)App(idle)AuroraSecondary(read-only)Failover:1.Detectfailure(healthcheck)2.Route53switchtoDRregionDNS3.PromoteAuroraSecondaryPrimary4.(Optionally)scaleupDRresources

Pattern 2: Active-Active with DynamoDB Global Tables

RegionA:ALBAppDynamoDBGlobalTableARegionB:ALBAppDynamoDBGlobalTableB(bidirectionalsync)Route53Latency-based+healthchecksBothregionsactivesimultaneously

Pattern 3: Active-Active with Global Accelerator

StaticIP(anycast)GlobalAcceleratorEndpointRegionA(weight50)EndpointRegionB(weight50)GAbuilt-inhealthchecksAutofailoveratIPlevel(fasterthanDNS)

8. Route 53 vs Global Accelerator

Route 53 FailoverGlobal Accelerator
MechanismDNS (TTL-dependent)Anycast IP (TCP/UDP)
Failover timeHealth check + TTL< 1 minute (no DNS dependency)
ProtocolAny (DNS-resolved)TCP, UDP
Static IPNoYes (2 anycast IPs)
CachingDNS resolverNo
Cost$0.50/zone + queries$0.025/hour endpoint + data transfer

When to use GA over Route 53

  • Cần static IP cho whitelisting
  • Non-HTTP traffic (gaming, IoT)
  • Faster failover (< 1 min vs TTL-dependent)
  • Reduce DNS-related latency

9. Failover Testing

Test procedure

  1. Manually disable primary (stop ALB target group health)
  2. Observe DNS failover (use dig or online DNS tool)
  3. Measure time: health check detection + TTL expiry
  4. Verify app works on secondary
  5. Re-enable primary, observe failback

Tools

  • dig +short example.com (check current DNS)
  • AWS Console Route 53 health check status
  • CloudWatch metrics on health check

Game day exercise

  • Simulate region failure
  • Document actual RTO/RPO
  • Find gaps (e.g., manual steps that should be automated)

10. DNS Failover Limitations

Limitations

  • DNS resolvers may cache (TTL)
  • Browser DNS cache (separate from resolver, persistent across requests)
  • CDN edge caches DNS too
  • Some clients ignore TTL (rare)

Workarounds

  • Use Global Accelerator (no DNS)
  • Use CloudFront (origin failover at edge level)
  • Use client-side retry logic with multiple endpoints

11. CloudFront Origin Failover

Định nghĩa

CloudFront Origin Failover = primary + secondary origin in CloudFront distribution.

Setup

CloudFrontDistribution:OriginGroup:my-originsPrimary:ALB-us-east-1Secondary:ALB-eu-west-1Failovercriteria:5xxerrorsortimeout

Behavior

  • Edge tries primary, if fails → secondary
  • Sub-second failover (no DNS)
  • Use case: HTTP/HTTPS apps with multi-region origins

vs Route 53 Failover

  • CloudFront: faster failover, only HTTP/HTTPS, requires CloudFront
  • Route 53: any protocol, slower (TTL-dependent)

12. Common Patterns

Pattern 1: Web app multi-region failover

UserRoute53(failover)Primary:CloudFrontALBus-east-1Secondary:CloudFrontALBeu-west-1Healthchecks:HTTPS/healthoneachALBTTL:60s

Pattern 2: API endpoint with static IP

APIconsumers(whitelistedIPs)GlobalAccelerator(2staticIPs)NLBus-east-1NLBeu-west-1

Pattern 3: Database failover

AppAuroraGlobalDatabaseDRscenario:1.Promotesecondaryclusterprimary2.UpdateappDBconnection3.(OruseAuroraendpointthatfollowspromotion)

Câu hỏi ôn tập

  1. Route 53 Failover yêu cầu gì cho primary record?

    Xem đáp án

    Primary record phải có Health Check associated. Route 53 monitor endpoint (HTTP/HTTPS/TCP) định kỳ (default 30s, hoặc 10s "Fast" health check). Khi primary health check fail (3 consecutive failures by default), Route 53 trả secondary record thay thế. Secondary record không bắt buộc có health check nhưng best practice là có. Health Check có thể monitor cả CloudWatch Alarm state.

  2. DNS TTL impact failover time như thế nào?

    Xem đáp án

    TTL cao = failover chậm. Khi DNS trả IP mới (sau failover), clients còn cache TTL cũ tiếp tục dùng IP cũ cho đến khi TTL hết. Ví dụ TTL 300s → failover visible sau tối đa 5 phút với users. Low TTL (60s hoặc ít hơn) = failover nhanh hơn nhưng nhiều DNS queries hơn = cost cao hơn. Trước planned maintenance, giảm TTL → maintenance → failover nhanh → tăng TTL lại.

  3. Global Accelerator failover khác Route 53 failover ở điểm gì?

    Xem đáp án

    Global Accelerator: failover ở network layer — anycast IPs tĩnh, routing thay đổi trong giây (< 1 phút, không phụ thuộc DNS TTL). Client không cần đổi IP. Route 53: failover qua DNS — tốc độ phụ thuộc TTL và DNS propagation (minutes). Global Accelerator phù hợp cho RTO requirements seconds; Route 53 failover phù hợp cho minutes. GA đắt hơn (~$18/accelerator/month + data fees).

  4. CloudFront Origin Failover hoạt động ở layer nào?

    Xem đáp án

    Layer 7 (HTTP) — CloudFront tự động retry origin request sang secondary origin khi primary origin trả HTTP 500, 502, 503, 504 (configurable). Transparent cho clients — client không thấy failover. Setup: Origin Group với primary + secondary origins. CloudFront attempt primary, nếu fail → try secondary. Phù hợp cho active-passive origins với automatic content failover.

  5. Tốc độ failover cho Active-Active so với Active-Passive?

    Xem đáp án

    Active-Active: failover gần như ngay lập tức — cả hai endpoints đang serve traffic, load balancer chỉ cần stop routing đến failed endpoint. RTO seconds. Active-Passive: cần promote passive region (launch instances, scale up, DNS update) — RTO minutes đến hours tùy strategy. Active-Active tốt hơn về RTO nhưng cần data sync real-time giữa regions (phức tạp và tốn kém hơn).

Bài tập thực hành

  • Setup Failover routing: primary + secondary với health check
  • Test: disable primary, measure failover time
  • Setup Latency-based với health checks (active-active)
  • Setup Global Accelerator với 2 regions
  • Setup CloudFront Origin Failover

Tài liệu tham khảo chính thức


Tiếp theo: Aurora Global Database