Học DevSAA-C03 · Solutions Architect — Associate

Tiến độ Đổi khoá

Bài học

Tuần 10 - Ngày 3: Spot Instance Strategies

Tuần 10 – Ngày 3

Mục tiêu học tập

Hiểu Spot pricing và interruption mechanism
Áp dụng Spot Fleet với allocation strategies
Build fault-tolerant Spot architectures
Combine Spot + On-Demand trong ASG

1. Spot Recap

Spot Instances = unused EC2 capacity at up to 90% discount, AWS có thể reclaim với 2-min warning.

When to use

Fault-tolerant workloads (can interrupt + retry)
Stateless or checkpoint-based
Big data, ML training, image rendering
CI/CD workers, dev/test environments

When NOT to use

Database (stateful)
Mission-critical apps
Workload không thể replace mid-execution

2. Spot Pricing

Pricing model

AWS sets price based on capacity supply/demand
You set max price willing to pay
If Spot price ≤ max → instance runs
If Spot price > max → instance terminated

Spot Price History

View past 90 days
Identify stable instance types/AZs
Avoid volatile pools

Best practice

Set max price = On-Demand price (giúp instance ít bị reclaim do giá)
Spot interruption due to capacity, not price (most common reason)

3. Spot Interruption Behaviors

3 options

Terminate (default): instance destroyed
Stop: stop instance, keep EBS, restart later if Spot available
Hibernate: save RAM to EBS, fast resume

Stop và Hibernate yêu cầu

EBS-backed
Specific instance types

Use case for each

Terminate: stateless, can replace
Stop: large state, OK to pause
Hibernate: warm cache, fast resume needed

4. Spot Interruption Notice

Workflow

AWS decides to reclaim instance
2-minute warning sent via instance metadata
App should:
- Save state (S3, DynamoDB)
- Drain connections (deregister from ELB)
- Checkpoint progress
After 2 min: instance terminated

Check warning

# Inside instance
curl http://169.254.169.254/latest/meta-data/spot/instance-action
# Returns terminate time if interruption coming

EventBridge integration

Spot interruption event → Lambda → handle
E.g., update database, notify monitoring

5. Spot Allocation Strategies (Spot Fleet & ASG)

Why important

Pool = combination (instance type + AZ + OS + tenancy)
Spread across pools = less interruption

Strategies

lowestPrice

Cheapest pool
Higher interruption risk
Use: cost-sensitive batch

diversified

Spread evenly across all pools
Better availability
Use: stateless workloads

capacityOptimized (recommended)

Pool với largest available capacity
Lowest interruption rate
Use: most workloads, default recommendation

priceCapacityOptimized (newest, often best)

Balance lowest price + lowest interruption
Use: cost + reliability balance

Decision

capacity-optimized: priority is uptime
price-capacity-optimized: balance
lowest-price: priority is cost (more risk)

6. Spot Fleet

Định nghĩa

Spot Fleet = managed group of Spot (+ optional On-Demand) instances meeting target capacity.

Features

Mix multiple instance types/families
Multiple Spot pools
Optional On-Demand baseline
Auto-replace terminated instances

Configuration

Spot Fleet:
  Target capacity: 100 instances (or vCPUs or memory units)
  On-Demand portion: 20 (always running)
  Spot portion: 80 (price-sensitive)
  Allocation strategy: capacity-optimized
  Instance types pool:
    - m5.large
    - m5.xlarge
    - m5a.large
    - c5.large
    - c5a.large

7. EC2 Auto Scaling với Mixed Instances

Mixed Instances Policy

ASG launch nhiều instance types
Mix Spot + On-Demand
Specify base capacity (On-Demand) + percentage (Spot)

Example

ASG:
  Min: 2, Desired: 10, Max: 50
  On-Demand base: 2 (always)
  On-Demand percentage above base: 20% (2 of every 10 above base)
  Spot percentage: 80%
  Instance types:
    - m5.large
    - m5a.large
    - m5n.large
    - m4.large
  Spot allocation strategy: capacity-optimized

Result

2 baseline On-Demand
Above 2: 20% On-Demand, 80% Spot
E.g., at desired=10: 4 On-Demand + 6 Spot

8. Spot Best Practices

1. Design for interruption

Stateless or checkpoint state
Idempotent operations
Use SQS for work distribution (resume if interrupted)

2. Diversify

Multiple instance types, AZs
Reduce dependency on single pool

3. Use capacity-optimized

Lowest interruption
Often default best choice

4. Combine with Spot

Spot for elasticity (scale out/in)
On-Demand + Spot blend

5. Test handle interruption

Simulate via metadata or AWS Fault Injection Simulator
Verify app saves state, fails gracefully

9. Use Cases

Big Data (Spark, EMR)

EMR managed Spot + On-Demand cluster
Master + core: On-Demand (stable)
Task nodes: Spot (fault-tolerant compute)

CI/CD

Jenkins agents on Spot
Multiple instance types
Fallback to On-Demand if Spot unavailable

Batch processing

AWS Batch with Spot pricing
Auto-retry on interruption

Containerized workloads

ECS / EKS with Spot
Spot nodes for stateless services
Persistent state in RDS/EFS

Image/video rendering

Render farm on Spot
Each job checkpoint progress
Resume if interrupted

10. Spot vs On-Demand Cost Example

Scenario: 100 m5.large for batch (4 hours/day)

Pure On-Demand

100 × $0.096/hr × 4 hr × 30 days = $1,152/month

Spot (70% discount, ~10% interruption overhead)

100 × $0.029/hr × 4 hr × 30 days × 1.1 = $382/month
Saving: ~$770/month (66%)

11. Common Architecture Patterns

Pattern 1: Web app baseline + Spot scale

ASG:
- 5 On-Demand baseline (always)
- Auto-scale up to 50 with mix:
  - 20% On-Demand (reliability)
  - 80% Spot (cost)
- capacity-optimized strategy

Pattern 2: Big data pipeline

EMR cluster:
- Master + Core: On-Demand
- Task nodes: Spot Fleet (mix m5, c5, r5)
- Checkpoint to S3 frequently

Pattern 3: ML training

SageMaker Training:
- Use Managed Spot Training
- Save model checkpoints to S3
- Resume from checkpoint if interrupted
- 70-90% saving

Pattern 4: CI/CD farm

Jenkins or GitHub Actions runners:
- Auto-launch Spot instance per job
- Self-host runner on Spot
- Reduce idle cost vs always-on

12. Spot Limits

Service quotas

Default: 256 vCPUs per region for Spot
Request increase if needed

Region/AZ availability

Some instance types not in all regions
Availability changes (capacity)

Cannot use Spot for

Dedicated Hosts
Dedicated Instances
Capacity Reservations
Some specific instance types

Câu hỏi ôn tập

Spot interruption warning có bao lâu trước khi terminate?

Xem đáp án

2 phút (120 giây). AWS gửi Spot Instance interruption notice qua Instance Metadata (/latest/meta-data/spot/termination-time) và EventBridge. Application phải checkpoint work, drain connections, save state trong 2 phút này. Best practice: polling metadata mỗi 5 giây, configure ASG lifecycle hooks để graceful drain từ LB trước khi terminate.
Allocation strategy nào cho lowest interruption?

Xem đáp án

capacity-optimized strategy — AWS chọn pools có nhiều capacity nhất (và do đó ít likely bị interrupted nhất). Khác với lowest-price (default) chọn cheapest pool. capacity-optimized-prioritized kết hợp: AWS ưu tiên pools bạn chỉ định nhưng vẫn optimize capacity. Cho workloads quan trọng hơn là workloads chỉ cần lowest cost — batch ETL hay HPC jobs nên dùng capacity-optimized.
Spot có thể Stop hoặc Hibernate được không?

Xem đáp án

Có — thay vì terminate, Spot có thể Stop (EBS-backed only — instance state lưu trên EBS) hoặc Hibernate (RAM dumped to EBS, instance stopped, resume về đúng state). Cấu hình interruption behavior khi request. Hibernate phù hợp cho workloads cần memory state preserved (ML inference models loaded). Cần EBS với dung lượng ≥ RAM size, và instance type/OS hỗ trợ hibernation.
Khi nào KHÔNG nên dùng Spot?

Xem đáp án

Không dùng Spot cho: (1) Databases (stateful, interruption gây data loss hoặc corruption), (2) Stateful web servers mà không có proper drain/failover, (3) Long-running jobs không checkpoint (mất hết progress khi interrupted), (4) Compliance workloads cần guaranteed availability, (5) Primary instances trong HA architectures không có On-Demand fallback. Spot cần fault-tolerant, stateless, hoặc checkpoint-capable workloads.
Mixed Instances Policy trong ASG là gì?

Xem đáp án

Mixed Instances Policy cho phép ASG dùng nhiều instance types từ cùng một Launch Template base. Kết hợp On-Demand + Spot trong cùng ASG: On-Demand cho baseline capacity (ổn định), Spot cho burst capacity (giảm cost). Cấu hình: % On-Demand base, danh sách instance types (AWS chọn best availability), allocation strategy. Giảm Spot interruption bằng cách diversify across nhiều pools.

Bài tập thực hành

Launch Spot Instance với max price = On-Demand
Listen for instance metadata interruption notice
Tạo ASG với Mixed Instances Policy (50% Spot, 50% On-Demand)
Test Spot Fleet với 3 instance types
Run sample batch job on EMR with Spot task nodes

Tài liệu tham khảo chính thức

Tiếp theo: Decoupled Architectures