Redshift dùng cho OLAP hay OLTP?

OLAP (Online Analytical Processing) — data warehousing, analytics, reporting, BI. Columnar storage tối ưu cho aggregations, scans trên cột cụ thể trong dataset lớn. Không phù hợp cho OLTP (nhiều small transactions, row-level updates). OLTP dùng RDS/Aurora. Redshift được thiết kế cho queries phức tạp trên hàng tỷ rows — data warehouse scale.

Redshift Spectrum query data ở đâu?

Redshift Spectrum query data trực tiếp từ Amazon S3 — không cần load data vào Redshift cluster. Dùng external tables (schemas trong AWS Glue Data Catalog). Hàng ngàn nodes xử lý S3 data song song. Phù hợp cho "data lake" pattern: cold/historical data trên S3, hot data trong Redshift. Tính phí theo bytes scanned ($5/TB).

RA3 và DC2 khác nhau ở điểm gì?

DC2: compute-optimized với SSD local storage — storage cố định theo node type, không thể scale storage độc lập với compute. RA3: managed storage trên Redshift Managed Storage (S3-backed) — compute và storage scale độc lập. RA3 phù hợp khi dataset lớn hơn tổng SSD của DC2 nodes. AWS khuyến nghị RA3 cho workloads mới vì linh hoạt hơn.

Khi nào dùng Athena thay vì Redshift?

Dùng Athena khi: (1) Queries ad-hoc, không thường xuyên trên S3 data, (2) Không muốn quản lý cluster, (3) Dataset thay đổi thường xuyên (schema-on-read), (4) Budget thấp (pay-per-query). Dùng Redshift khi: queries phức tạp, thường xuyên, cần performance cao (seconds vs minutes), data warehousing, BI dashboards cần consistent query time, hoặc cần joins phức tạp.

Redshift Serverless tính phí theo gì?

RPU-hours (Redshift Processing Units per hour) — tính theo số RPUs được sử dụng trong thời gian query chạy. Không charge khi không có queries (scale to zero). Minimum 8 RPUs, maximum 512 RPUs. Phù hợp cho sporadic, unpredictable workloads và development/test. Provisioned cluster rẻ hơn cho steady, predictable workloads.

Tuần 5 - Ngày 5: Amazon Redshift Introduction | SAA-C03

Mục tiêu học tập

Hiểu Redshift là data warehouse cho analytics (OLAP)
Phân biệt RA3 nodes vs DC2 nodes
Nắm Redshift Spectrum và Concurrency Scaling
Biết khi nào dùng Redshift vs Athena vs EMR

1. Tổng quan Redshift

Amazon Redshift = fully managed petabyte-scale data warehouse cho OLAP (analytics).

Đặc điểm

Columnar storage: cột nén tốt, query analytics nhanh
MPP (Massively Parallel Processing): query split across nodes
SQL-based (PostgreSQL-compatible)
Scale: GB → petabyte
Performance: 10x faster than traditional DB for analytics

Use cases

Business Intelligence (BI) dashboards
Reporting (sales, marketing analytics)
Log analysis
Data lakes (combined với S3 + Spectrum)

NOT for

OLTP (transactional) workload — use RDS/Aurora
Single-row lookups — use DynamoDB
Real-time data ingestion at high frequency — buffer first

2. OLAP vs OLTP

	OLTP (RDS, Aurora, DynamoDB)	OLAP (Redshift)
Workload	Transactions (insert, update)	Analytics (aggregation)
Query type	Few rows, indexed lookup	Many rows, full scan
Schema	Normalized	Denormalized (star/snowflake)
Data size	GB-TB	TB-PB
Storage	Row-based	Column-based
Performance metric	Latency (ms)	Throughput (GB/sec)

Coordinate queries
Plan execution
Aggregate results from compute nodes

Compute Nodes

Actual query execution
Data stored on local disk (DC2) hoặc managed storage (RA3)
Slices: parallel processing units per node

4. Node Types

RA3 (Recommended, 2019+)

Managed storage: separate compute from storage
Storage scales independently từ compute
Use S3-backed managed storage (auto-tier hot data on local SSD)
Sizes: ra3.xlplus, ra3.4xlarge, ra3.16xlarge
Pay for compute + managed storage

DC2 (Dense Compute)

Local SSD storage attached
Compute + storage tied together
Smaller cluster (< 1 TB)
Cheaper for small datasets
Sizes: dc2.large, dc2.8xlarge

DS2 (Dense Storage) — Legacy

HDD storage
Deprecated, migrate to RA3

Recommendation

RA3 cho mọi production workload mới
DC2 cho small dev/test

Run SQL queries against S3 data (Parquet, ORC, CSV, JSON)
Combine S3 data với Redshift tables (federated query)
Pay per data scanned ($5/TB scanned)
Use AWS Glue Data Catalog for schema

Use case

Data lake architecture (S3 = storage, Redshift = compute)
Query historical data in S3 (without loading)
Cost-effective for infrequently accessed data

Example query

-- Query S3 data via Spectrum
SELECT *
FROM redshift_spectrum.s3_logs
WHERE date > '2024-01-01';

-- Join with Redshift table
SELECT u.name, COUNT(*)
FROM users u
JOIN redshift_spectrum.s3_events e ON u.id = e.user_id
GROUP BY u.name;

Auto-scale within seconds
Bills per second used (free credits available per cluster)
Use case: many concurrent BI users

7. Redshift Data Loading

COPY command (primary method)

COPY my_table
FROM 's3://my-bucket/data/'
IAM_ROLE 'arn:aws:iam::111111111111:role/RedshiftS3Role'
DELIMITER ','
IGNOREHEADER 1;

Sources

S3 (most common, parallel load)
DynamoDB
EMR
Remote host (SSH)
AWS DMS (from RDS, on-prem DB)
Kinesis Data Firehose (streaming)

Best practice

Split files into multiple parts (~1 GB each) → parallel load
Use Parquet/ORC for compression
Compress files (gzip, bzip2)

8. Redshift Backup

Automated snapshots

Continuous backup to S3 (managed)
Retention: 1-35 days
Cross-region copy support

Manual snapshots

User-initiated
Retain indefinitely

Restore

Restore to new cluster from snapshot

9. Redshift Security

Network

VPC deployment
Enhanced VPC Routing: force traffic through VPC (Spectrum, COPY)
Security Group control access

Encryption

At rest: KMS or HSM
In transit: SSL
Enable at cluster launch

Access control

IAM for AWS-level (cluster management)
Database users for SQL access
IAM Database Authentication (token-based)
Audit logs: STL_, SVL_ system tables → S3

Pay per RPU-second (Redshift Processing Unit)
Auto-pause khi idle (save cost)
Auto-scale based on workload
Same SQL, same features as provisioned

Use case

Sporadic analytics workload
Dev/test environments
New customer experimentation

Cost

Free trial: $300 credits initially
Pay-per-use after

11. Redshift vs Athena vs EMR

	Redshift	Athena	EMR
Type	Data warehouse (cluster)	Serverless query	Managed Hadoop/Spark cluster
Performance	Fastest for complex SQL on TB	Good for ad-hoc	Variable, depends on cluster
Cost	Pay for cluster	Pay per query ($5/TB)	Pay for EC2 + EMR fee
Latency	Sub-sec to seconds	Seconds to minutes	Minutes
Setup	Provision cluster	None	Cluster management
Use case	BI dashboards, frequent queries	Ad-hoc, infrequent queries	Big data processing (ETL)

Decision

Frequent complex queries on TB-PB: Redshift
Ad-hoc queries on S3: Athena
Custom ETL with Spark/Hive: EMR

12. Modern Data Lake Architecture

Câu hỏi ôn tập

Redshift dùng cho OLAP hay OLTP?

Xem đáp án

OLAP (Online Analytical Processing) — data warehousing, analytics, reporting, BI. Columnar storage tối ưu cho aggregations, scans trên cột cụ thể trong dataset lớn. Không phù hợp cho OLTP (nhiều small transactions, row-level updates). OLTP dùng RDS/Aurora. Redshift được thiết kế cho queries phức tạp trên hàng tỷ rows — data warehouse scale.
Redshift Spectrum query data ở đâu?

Xem đáp án

Redshift Spectrum query data trực tiếp từ Amazon S3 — không cần load data vào Redshift cluster. Dùng external tables (schemas trong AWS Glue Data Catalog). Hàng ngàn nodes xử lý S3 data song song. Phù hợp cho "data lake" pattern: cold/historical data trên S3, hot data trong Redshift. Tính phí theo bytes scanned ($5/TB).
RA3 và DC2 khác nhau ở điểm gì?

Xem đáp án

DC2: compute-optimized với SSD local storage — storage cố định theo node type, không thể scale storage độc lập với compute. RA3: managed storage trên Redshift Managed Storage (S3-backed) — compute và storage scale độc lập. RA3 phù hợp khi dataset lớn hơn tổng SSD của DC2 nodes. AWS khuyến nghị RA3 cho workloads mới vì linh hoạt hơn.
Khi nào dùng Athena thay vì Redshift?

Xem đáp án

Dùng Athena khi: (1) Queries ad-hoc, không thường xuyên trên S3 data, (2) Không muốn quản lý cluster, (3) Dataset thay đổi thường xuyên (schema-on-read), (4) Budget thấp (pay-per-query). Dùng Redshift khi: queries phức tạp, thường xuyên, cần performance cao (seconds vs minutes), data warehousing, BI dashboards cần consistent query time, hoặc cần joins phức tạp.
Redshift Serverless tính phí theo gì?

Xem đáp án

RPU-hours (Redshift Processing Units per hour) — tính theo số RPUs được sử dụng trong thời gian query chạy. Không charge khi không có queries (scale to zero). Minimum 8 RPUs, maximum 512 RPUs. Phù hợp cho sporadic, unpredictable workloads và development/test. Provisioned cluster rẻ hơn cho steady, predictable workloads.

Bài tập thực hành

Tạo Redshift Serverless workgroup (free trial)
Load sample data từ S3 với COPY command
Query với SELECT joins
Setup Redshift Spectrum query S3 Parquet file
Compare query performance: same query trong Athena vs Redshift Spectrum

Tài liệu tham khảo chính thức

Tiếp theo: Database Migration