</>Học Dev
Bài học

Tuần 10 - Ngày 3: Spot Instance Strategies

Tuần 10 – Ngày 3

Tuần 10 - Ngày 3: Spot Instance Strategies

Mục tiêu học tập

  • Hiểu Spot pricing và interruption mechanism
  • Áp dụng Spot Fleet với allocation strategies
  • Build fault-tolerant Spot architectures
  • Combine Spot + On-Demand trong ASG

1. Spot Recap

Spot Instances = unused EC2 capacity at up to 90% discount, AWS có thể reclaim với 2-min warning.

When to use

  • Fault-tolerant workloads (can interrupt + retry)
  • Stateless or checkpoint-based
  • Big data, ML training, image rendering
  • CI/CD workers, dev/test environments

When NOT to use

  • Database (stateful)
  • Mission-critical apps
  • Workload không thể replace mid-execution

2. Spot Pricing

Pricing model

  • AWS sets price based on capacity supply/demand
  • You set max price willing to pay
  • If Spot price ≤ max → instance runs
  • If Spot price > max → instance terminated

Spot Price History

  • View past 90 days
  • Identify stable instance types/AZs
  • Avoid volatile pools

Best practice

  • Set max price = On-Demand price (giúp instance ít bị reclaim do giá)
  • Spot interruption due to capacity, not price (most common reason)

3. Spot Interruption Behaviors

3 options

  • Terminate (default): instance destroyed
  • Stop: stop instance, keep EBS, restart later if Spot available
  • Hibernate: save RAM to EBS, fast resume

Stop và Hibernate yêu cầu

  • EBS-backed
  • Specific instance types

Use case for each

  • Terminate: stateless, can replace
  • Stop: large state, OK to pause
  • Hibernate: warm cache, fast resume needed

4. Spot Interruption Notice

Workflow

  1. AWS decides to reclaim instance
  2. 2-minute warning sent via instance metadata
  3. App should:
    • Save state (S3, DynamoDB)
    • Drain connections (deregister from ELB)
    • Checkpoint progress
  4. After 2 min: instance terminated

Check warning

# Inside instance
curl http://169.254.169.254/latest/meta-data/spot/instance-action
# Returns terminate time if interruption coming

EventBridge integration

  • Spot interruption event → Lambda → handle
  • E.g., update database, notify monitoring

5. Spot Allocation Strategies (Spot Fleet & ASG)

Why important

  • Pool = combination (instance type + AZ + OS + tenancy)
  • Spread across pools = less interruption

Strategies

lowestPrice

  • Cheapest pool
  • Higher interruption risk
  • Use: cost-sensitive batch

diversified

  • Spread evenly across all pools
  • Better availability
  • Use: stateless workloads
  • Pool với largest available capacity
  • Lowest interruption rate
  • Use: most workloads, default recommendation

priceCapacityOptimized (newest, often best)

  • Balance lowest price + lowest interruption
  • Use: cost + reliability balance

Decision

  • capacity-optimized: priority is uptime
  • price-capacity-optimized: balance
  • lowest-price: priority is cost (more risk)

6. Spot Fleet

Định nghĩa

Spot Fleet = managed group of Spot (+ optional On-Demand) instances meeting target capacity.

Features

  • Mix multiple instance types/families
  • Multiple Spot pools
  • Optional On-Demand baseline
  • Auto-replace terminated instances

Configuration

Spot Fleet:
  Target capacity: 100 instances (or vCPUs or memory units)
  On-Demand portion: 20 (always running)
  Spot portion: 80 (price-sensitive)
  Allocation strategy: capacity-optimized
  Instance types pool:
    - m5.large
    - m5.xlarge
    - m5a.large
    - c5.large
    - c5a.large

7. EC2 Auto Scaling với Mixed Instances

Mixed Instances Policy

  • ASG launch nhiều instance types
  • Mix Spot + On-Demand
  • Specify base capacity (On-Demand) + percentage (Spot)

Example

ASG:
  Min: 2, Desired: 10, Max: 50
  On-Demand base: 2 (always)
  On-Demand percentage above base: 20% (2 of every 10 above base)
  Spot percentage: 80%
  Instance types:
    - m5.large
    - m5a.large
    - m5n.large
    - m4.large
  Spot allocation strategy: capacity-optimized

Result

  • 2 baseline On-Demand
  • Above 2: 20% On-Demand, 80% Spot
  • E.g., at desired=10: 4 On-Demand + 6 Spot

8. Spot Best Practices

1. Design for interruption

  • Stateless or checkpoint state
  • Idempotent operations
  • Use SQS for work distribution (resume if interrupted)

2. Diversify

  • Multiple instance types, AZs
  • Reduce dependency on single pool

3. Use capacity-optimized

  • Lowest interruption
  • Often default best choice

4. Combine with Spot

  • Spot for elasticity (scale out/in)
  • On-Demand + Spot blend

5. Test handle interruption

  • Simulate via metadata or AWS Fault Injection Simulator
  • Verify app saves state, fails gracefully

9. Use Cases

Big Data (Spark, EMR)

  • EMR managed Spot + On-Demand cluster
  • Master + core: On-Demand (stable)
  • Task nodes: Spot (fault-tolerant compute)

CI/CD

  • Jenkins agents on Spot
  • Multiple instance types
  • Fallback to On-Demand if Spot unavailable

Batch processing

  • AWS Batch with Spot pricing
  • Auto-retry on interruption

Containerized workloads

  • ECS / EKS with Spot
  • Spot nodes for stateless services
  • Persistent state in RDS/EFS

Image/video rendering

  • Render farm on Spot
  • Each job checkpoint progress
  • Resume if interrupted

10. Spot vs On-Demand Cost Example

Scenario: 100 m5.large for batch (4 hours/day)

Pure On-Demand

  • 100 × $0.096/hr × 4 hr × 30 days = $1,152/month

Spot (70% discount, ~10% interruption overhead)

  • 100 × $0.029/hr × 4 hr × 30 days × 1.1 = $382/month
  • Saving: ~$770/month (66%)

11. Common Architecture Patterns

Pattern 1: Web app baseline + Spot scale

ASG:
- 5 On-Demand baseline (always)
- Auto-scale up to 50 with mix:
  - 20% On-Demand (reliability)
  - 80% Spot (cost)
- capacity-optimized strategy

Pattern 2: Big data pipeline

EMR cluster:
- Master + Core: On-Demand
- Task nodes: Spot Fleet (mix m5, c5, r5)
- Checkpoint to S3 frequently

Pattern 3: ML training

SageMaker Training:
- Use Managed Spot Training
- Save model checkpoints to S3
- Resume from checkpoint if interrupted
- 70-90% saving

Pattern 4: CI/CD farm

Jenkins or GitHub Actions runners:
- Auto-launch Spot instance per job
- Self-host runner on Spot
- Reduce idle cost vs always-on

12. Spot Limits

Service quotas

  • Default: 256 vCPUs per region for Spot
  • Request increase if needed

Region/AZ availability

  • Some instance types not in all regions
  • Availability changes (capacity)

Cannot use Spot for

  • Dedicated Hosts
  • Dedicated Instances
  • Capacity Reservations
  • Some specific instance types

Câu hỏi ôn tập

  1. Spot interruption warning có bao lâu trước khi terminate?

    Xem đáp án

    2 phút (120 giây). AWS gửi Spot Instance interruption notice qua Instance Metadata (/latest/meta-data/spot/termination-time) và EventBridge. Application phải checkpoint work, drain connections, save state trong 2 phút này. Best practice: polling metadata mỗi 5 giây, configure ASG lifecycle hooks để graceful drain từ LB trước khi terminate.

  2. Allocation strategy nào cho lowest interruption?

    Xem đáp án

    capacity-optimized strategy — AWS chọn pools có nhiều capacity nhất (và do đó ít likely bị interrupted nhất). Khác với lowest-price (default) chọn cheapest pool. capacity-optimized-prioritized kết hợp: AWS ưu tiên pools bạn chỉ định nhưng vẫn optimize capacity. Cho workloads quan trọng hơn là workloads chỉ cần lowest cost — batch ETL hay HPC jobs nên dùng capacity-optimized.

  3. Spot có thể Stop hoặc Hibernate được không?

    Xem đáp án

    — thay vì terminate, Spot có thể Stop (EBS-backed only — instance state lưu trên EBS) hoặc Hibernate (RAM dumped to EBS, instance stopped, resume về đúng state). Cấu hình interruption behavior khi request. Hibernate phù hợp cho workloads cần memory state preserved (ML inference models loaded). Cần EBS với dung lượng ≥ RAM size, và instance type/OS hỗ trợ hibernation.

  4. Khi nào KHÔNG nên dùng Spot?

    Xem đáp án

    Không dùng Spot cho: (1) Databases (stateful, interruption gây data loss hoặc corruption), (2) Stateful web servers mà không có proper drain/failover, (3) Long-running jobs không checkpoint (mất hết progress khi interrupted), (4) Compliance workloads cần guaranteed availability, (5) Primary instances trong HA architectures không có On-Demand fallback. Spot cần fault-tolerant, stateless, hoặc checkpoint-capable workloads.

  5. Mixed Instances Policy trong ASG là gì?

    Xem đáp án

    Mixed Instances Policy cho phép ASG dùng nhiều instance types từ cùng một Launch Template base. Kết hợp On-Demand + Spot trong cùng ASG: On-Demand cho baseline capacity (ổn định), Spot cho burst capacity (giảm cost). Cấu hình: % On-Demand base, danh sách instance types (AWS chọn best availability), allocation strategy. Giảm Spot interruption bằng cách diversify across nhiều pools.

Bài tập thực hành

  • Launch Spot Instance với max price = On-Demand
  • Listen for instance metadata interruption notice
  • Tạo ASG với Mixed Instances Policy (50% Spot, 50% On-Demand)
  • Test Spot Fleet với 3 instance types
  • Run sample batch job on EMR with Spot task nodes

Tài liệu tham khảo chính thức


Tiếp theo: Decoupled Architectures