Tại sao structured logging (JSON) tốt hơn plain text cho mục đích security?

SIEM/Log platform query field cụ thể: event.category:auth AND user.id:u_123 AND source.ip:203.0.113.42. Plain text phải dùng regex grep — chậm, không reliable (vd email Alice@Example.com vs ALICE@example.com), không aggregate được (vd "đếm distinct IP"). Structured log còn cho phép correlate cross-service qua trace_id, request_id. Schema chuẩn: ECS (Elastic Common Schema), OCSF (Open Cybersecurity Schema Framework), CIM (Splunk).

PII trong log nguy hiểm vì sao và làm sao redact đúng?

Log thường được index ở nhiều nơi (CloudWatch, SIEM, archive), nhiều người access cho debug. PII trong log = PII trong nhiều system → 1 log platform bị compromise = leak hàng triệu PII. GDPR còn yêu cầu PII chỉ ở nơi cần thiết. Redact: (1) allow-list field được phép log (an toàn nhất — bỏ field mới không vô tình lộ), (2) deny-list SENSITIVE fields, (3) hash email/identifier để giữ correlation mà không lộ raw, (4) log filter layer (Vector, Logstash) xoá field trước khi index — backup layer khi app log nhầm.

NIST IR lifecycle có Containment trước Eradication. Khác biệt là gì?

- Containment: chặn attack lan rộng. Quick action như disable IAM key, isolate host khỏi network, block IP. Chưa root-cause, chỉ giảm bleeding. - Eradication: loại bỏ root cause sau khi đã hiểu attack. Remove malware, patch vulnerability, rotate creds liên quan, close lateral movement path. Vì sao tách: trong khủng hoảng, mục tiêu đầu tiên là stop the bleeding càng nhanh càng tốt; root cause cần thời gian phân tích. Nếu cố eradicate trước khi contain, attacker có thể exfiltrate thêm trong lúc bạn đang investigate.

Tabletop exercise khác simulation chạy thực thế nào, vì sao vẫn cần?

Tabletop: discuss trên giấy/whiteboard, không chạy lệnh thật. Nhanh (2-3 giờ), không nguy cơ outage, focus vào decision-making và communication. Live drill / chaos engineering: thực sự kill instance, simulate breach. Tốn thời gian hơn, nguy cơ nếu sai môi trường. Vẫn cần tabletop vì: (1) test runbook coverage (có thiếu scenario nào?), (2) tìm communication gap (ai gọi ai, escalation path), (3) low cost — chạy được mỗi quý. Live drill bổ sung, không thay thế — thường 1-2 lần/năm cho scenario critical.

Vì sao "snapshot trước khi investigate" quan trọng trong forensics?

Hành động investigate (login server, chạy ps, restart process) thay đổi state của vật chứng: timestamp file truy cập đổi, memory rewrite, log thêm dòng. Nếu attacker đặt trap (vd shell script tự xoá khi detect login admin), bạn mất evidence. Best practice: (1) snapshot disk (EBS snapshot, VM snapshot) — immutable copy, (2) memory dump trước khi reboot/isolate, (3) export logs off-host trước khi attacker có thể edit, (4) investigate trên copy, prod giữ nguyên cho forensic team. Nếu liên quan pháp lý, chain of custody phải document mọi access từ thời điểm snapshot.

Tuần 4 - Ngày 23: Logging, Audit và Incident Response | Security

Mục tiêu học tập

Xác định các sự kiện security-relevant cần log (auth, privileged action, data access, config change)
Tránh leak PII/secret vào log — redact và structured logging với schema
Thiết kế centralized log: ELK, Splunk, Datadog, CloudWatch Logs với retention requirement
Hiểu SIEM: correlation rules, alerting, threat intel
Áp dụng NIST Incident Response lifecycle: Preparation → Detection → Containment → Eradication → Recovery → Lessons Learned
Chuẩn bị runbook, on-call rotation, tabletop exercise
Forensics 101: preserve evidence, chain of custody, timeline reconstruction
GDPR breach notification 72h và các regulation tương tự

1. Log gì cho mục đích security?

Không phải mọi log đều có giá trị security. Tập trung vào sự kiện có thể giúp detect, investigate, prove:

Loại event	Ví dụ	Vì sao quan trọng
Authentication	Login success/fail, MFA challenge, password reset	Brute force, credential stuffing detection
Authorization	Access denied, role assumption, privilege escalation	Phát hiện horizontal/vertical privilege abuse
Privileged action	`sudo`, admin user tạo IAM key, modify firewall rule	Insider threat, post-compromise activity
Data access	Query database, S3 GetObject với PII, export report	GDPR audit, data exfiltration detect
Configuration change	IAM policy modified, SG rule added, deploy production	Change tracking, blame-free root cause
Security tool event	WAF block, IDS alert, EDR detection, AV quarantine	Active attack indicator
Application error	5xx, exception stack trace với security context	Vulnerability exploitation evidence

Minimum log fields cho mỗi event

{
  "timestamp": "2025-06-05T10:23:45.123Z",     // ISO 8601 UTC + millisecond
  "event_id": "auth.login.success",
  "actor": {                                     // ai
    "user_id": "u_12345",
    "session_id": "s_abc...",
    "ip": "203.0.113.42",
    "user_agent": "Mozilla/5.0..."
  },
  "action": "login",                             // làm gì
  "resource": "user/u_12345",                    // trên cái gì
  "result": "success",                           // kết quả
  "request_id": "req_xyz...",                    // để correlate với upstream/downstream
  "trace_id": "trace_abc...",                    // distributed tracing
  "metadata": {
    "mfa_used": true,
    "method": "password"
  }
}

Why structured (JSON)? SIEM/Log platform query field cụ thể (event_id:auth.login.fail AND actor.ip:203.0.113.42). Plain text log → grep chậm + không reliable parse.

2. PII và secret trong log — Redact!

Bài toán

# ❌ BAD
logger.info(f"User signup: {request.json}")
# → log có email, password, credit card đầy đủ

2025-06-05T10:23:45Z INFO User signup: {"email":"alice@example.com","password":"hunter2","ssn":"123-45-6789","cc":"4111-1111-1111-1111"}

Hậu quả: log file leak (CloudWatch Logs IAM misconfig, S3 export public, dev workstation pwned) = leak PII của hàng ngàn user → GDPR fine 4% revenue.

Redact pattern

# Allow-list approach: chỉ log field đã approve
ALLOWED_FIELDS = {"user_id", "email_hash", "role", "country"}

def safe_log(event, payload):
    safe_payload = {k: v for k, v in payload.items() if k in ALLOWED_FIELDS}
    logger.info(event, extra=safe_payload)

# Hoặc deny-list — xoá field nhạy cảm
SENSITIVE = {"password", "ssn", "credit_card", "api_key", "token"}

def redact(payload):
    return {k: ("***REDACTED***" if k in SENSITIVE else v) for k, v in payload.items()}

Pino (Node) redact built-in

import pino from "pino";

const logger = pino({
  redact: {
    paths: [
      "password",
      "*.password",
      "req.headers.authorization",
      "req.headers.cookie",
      "user.ssn",
      "user.creditCard.number",
    ],
    censor: "[REDACTED]",
  },
});

logger.info({ user: { email: "a@x.com", ssn: "123-45-6789" } });
// → user.ssn = "[REDACTED]"

Email/identifier — hash thay vì plain

import hashlib

def email_hash(email):
    return hashlib.sha256(email.lower().strip().encode()).hexdigest()[:16]

logger.info("user.login", extra={"email_hash": email_hash(user.email)})
# → giữ được khả năng correlate request từ cùng user mà không lộ email

Log filter / processor layer

Đặt filter tại pipeline (Fluentd, Vector, Logstash) — xoá field nhạy cảm trước khi vào index. Nếu app log "nhầm", filter vẫn cứu được.

# Vector config — drop field
[transforms.redact]
type = "remap"
inputs = ["app-logs"]
source = '''
  del(.password)
  del(.credit_card)
  if exists(.ssn) { .ssn = "***" }
'''

3. Centralized logging và retention

Kiến trúc

Lựa chọn platform

Platform	Đặc điểm	Use case
ELK / Elastic Stack	OSS, self-host, mạnh	Team có DevOps capacity
Splunk	Enterprise, mạnh SIEM/SOAR	Compliance, regulated industry
Datadog Logs	SaaS, tích hợp APM/metric	Cloud-native, không muốn ops
AWS CloudWatch Logs	AWS-native, Logs Insights query	Tích hợp AWS service, ít volume
Grafana Loki	OSS, label-based, rẻ	Khi log volume lớn, query metric-style
OpenSearch	AWS fork của Elastic	Avoiding Elastic license, AWS native

Retention requirement

Standard	Retention
PCI-DSS	1 year online, 3 months immediately accessible
HIPAA	6 years
SOX	7 years
GDPR	"Storage limitation" — không quá thời gian cần thiết, thường 6-24 tháng
ISO 27001	Định nghĩa trong policy (thường 1-3 years)

Hot/Warm/Cold tier

# AWS CloudWatch — Logs Group với retention 365 ngày
aws logs put-retention-policy \
  --log-group-name /aws/lambda/api \
  --retention-in-days 365

# Export sang S3 (cheaper long-term)
aws logs create-export-task \
  --log-group-name /aws/lambda/api \
  --from $(date -d "30 days ago" +%s)000 \
  --to $(date +%s)000 \
  --destination security-logs-archive-bucket \
  --destination-prefix lambda-api/

Log integrity

Log có thể bị attacker xoá để cover track. Bảo vệ:

Append-only / WORM (Write Once Read Many): S3 Object Lock, immutable index
Off-host: ship log đi khỏi server trước khi attacker chiếm máy
CloudTrail Lake / log signing: hash chain, không ai chỉnh sửa được
Centralized + RBAC: log platform tách hoàn toàn account production

4. SIEM — Security Information & Event Management

SIEM = log aggregation + correlation + alerting + investigation UI cho security.

Core capabilities

Tool

SIEM	Đặc điểm
Splunk Enterprise Security	Industry leader, đắt
Microsoft Sentinel	Cloud-native, Azure-integrated, pay-per-GB
Elastic Security	Free tier, OSS base
Datadog Cloud SIEM	SaaS, gắn với observability stack
AWS Security Lake + Athena	Cheap storage + ad-hoc query, không real-time alerting
Wazuh	OSS HIDS + SIEM-lite

Correlation rule ví dụ

-- Splunk SPL: brute force attempt
index=auth event_id="auth.login.fail"
| stats count by actor.ip, _time span=5m
| where count >= 10
| eval risk=case(count >= 50, "high", count >= 20, "medium", true(), "low")

# Elastic detection rule (KQL)
event.category:authentication AND event.outcome:failure
| histogram time=5m
| where count() >= 10

Alert fatigue

Nguy cơ lớn nhất của SIEM: ngày 100 alert → analyst burnout → bỏ qua → miss real attack.

Anti-pattern: bật rule "kitchen sink", thresholding ngẫu nhiên
            → 100 alert/ngày, 95% là benign

Best practice:
  1. Tune threshold theo baseline (5σ above normal)
  2. Suppress known-benign source (vd: vulnerability scanner IP)
  3. Severity-based routing (critical → PagerDuty, low → ticket queue)
  4. Auto-close alert đã resolved trong 24h không re-trigger
  5. Run-book gắn với mỗi alert: ai investigate, làm gì

5. Incident Response Lifecycle (NIST SP 800-61)

Severity matrix (ví dụ)

Severity	Định nghĩa	Response SLA
SEV-1 Critical	Production down, data breach, ransomware active	< 15 phút, all-hands
SEV-2 High	Partial outage, suspicious activity confirmed	< 1 giờ
SEV-3 Medium	Single component degraded, possible threat	< 4 giờ
SEV-4 Low	Cosmetic, FYI	Next business day

Roles trong incident

Role	Trách nhiệm
Incident Commander (IC)	Quyết định, không tự fix — coordinate
Tech Lead	Hands-on debug, propose fix
Communications Lead	Update status page, notify stakeholder, customer
Scribe	Ghi timeline, decisions, commands chạy
Subject Matter Experts	On-demand cho component liên quan

6. Runbook, on-call, tabletop exercise

Runbook

Tài liệu cho từng loại incident — không phải design doc, không phải full architecture. Chỉ "khi alert X kêu, làm 1, 2, 3".

# Runbook: Suspicious S3 download spike

## Symptom
CloudTrail GetObject count > 5σ above baseline trong 5 phút từ một IAM user/role.

## Severity
SEV-2 default. Bump lên SEV-1 nếu bucket chứa PII (data-classification: confidential).

## Triage (5 phút đầu)
1. Identify actor: `aws cloudtrail lookup-events --lookup-attributes ...`
2. Check actor history: thông thường action gì? Lần đầu access bucket này?
3. Check source IP: GeoIP, có trong VPN range không?

## Containment
- Nếu IAM user: disable access key

aws iam update-access-key --access-key-id AKIA... --status Inactive --user-name

- Nếu IAM role: detach permission policy hoặc revoke session

aws iam put-role-policy --role-name --policy-name DenyAll --policy-document file://deny-all.json

- Block IP via WAF/SG nếu xác định malicious

## Eradication
- Rotate compromised credential
- Audit other resources actor đã access trong 7 ngày qua

## Recovery
- Re-enable access sau khi rotate + verify nguyên nhân
- Monitor 24h cho hành vi tương tự

## Escalation
- Bucket có PII → notify DPO trong 12h (chuẩn bị GDPR 72h notification)
- Bucket có PCI data → notify card brand qua PCI process

On-call rotation

Primary on-call: nhận page đầu tiên, response < 15 phút (SEV-1)
Secondary: backup nếu primary không acknowledge sau N phút
Manager on-call: escalation cho decision business (PR, customer comm)

Tool: PagerDuty, Opsgenie, Splunk On-Call
Pattern: weekly rotation, follow-the-sun cho global team

Tabletop exercise

Diễn tập trên giấy (không thực sự chạy production). 1 lần/quý. Scenario ví dụ:

Facilitator: "Lúc 3AM Sunday, CloudTrail detect IAM user ci-deploy tạo 50 EC2 instance ở region eu-central-1 (region không dùng). Bạn nhận page. Làm gì?"
Team discuss step-by-step: ai làm gì, lệnh gì chạy, escalate ai
Facilitator inject tình huống mới ("EC2 đang chạy crypto miner", "AWS bill đã +$5000")
Hậu tabletop: ghi nhận gap (vd: không biết ai có quyền revoke IAM, không có runbook cho EC2 mass-creation) → tạo action item

7. Forensics 101

Khi có incident nghiêm trọng (data breach, ransomware), bằng chứng phải được preserve để:

Phân tích root cause
Trình toà nếu có liên quan pháp lý
Báo cáo regulator

Nguyên tắc Chain of Custody

Mỗi vật chứng (disk image, log file, memory dump) phải có:
  - Who collected: tên, ID, timestamp
  - From where: hostname, IP, path
  - How: command sử dụng (vd: dd, fmem)
  - Hash integrity: SHA-256 ngay khi collect
  - Storage: ai access từ thời điểm collect đến giờ
  - Each transfer: signed log

# Collect EC2 disk image (snapshot EBS)
aws ec2 create-snapshot --volume-id vol-abc123 \
  --description "IR-2025-06-05 host-compromise"

# Hash snapshot for integrity
aws ec2 describe-snapshots --snapshot-ids snap-xyz | sha256sum > evidence.sha256

# Memory dump (Linux)
sudo ./avml memory.dump      # Microsoft AVML tool
sha256sum memory.dump >> evidence.sha256

Timeline reconstruction

Tool: Plaso/log2timeline (super timeline), Splunk SPL transaction command, Datadog timeline view.

Live vs Dead forensics

Live (host running): capture memory, network connection, running process. Mất nếu reboot. Rủi ro: hành động bạn để lại dấu vết.
Dead (offline): image disk, analyze offline. Mất evidence in-memory (decrypted secret, network state).

Best practice IR: snapshot trước (preserve state) → isolate (revoke creds, NSG block) → investigate trên snapshot copy, không trên prod.

GDPR Article 33: nếu personal data breach có "likely risk to rights and freedoms", controller phải notify Data Protection Authority (DPA) trong 72 giờ kể từ khi aware (không phải khi xảy ra).

T+0    : Detection (vd: alert hoặc tipoff)
T+24h  : Initial assessment — scope, data type, count
T+48h  : Containment + impact analysis
T+72h  : Notify DPA + (nếu high risk) notify data subject

         Nếu chưa đủ thông tin → notify "preliminary" với
         thông tin có, hứa supplement sau.

Nội dung notification phải bao gồm:

Bản chất breach, loại data, ước lượng số người ảnh hưởng
Tên + contact của DPO
Hậu quả khả dĩ
Biện pháp đã/sẽ thực hiện

Tương đương:

CCPA / California: tương tự GDPR nhưng cho residents California
HIPAA (US): notify HHS trong 60 ngày
PCI-DSS: notify card brand "immediately" — định nghĩa bởi brand
Vietnam Nghị định 13/2023: notify Bộ Công An trong 72h, tương tự GDPR

Implication cho dev: log + audit phải sẵn sàng để trả lời "data nào bị access, bởi ai, khi nào, bao nhiêu record" trong < 72h. Đây là lý do log everything security-relevant và keep ≥ 1 năm.

9. Câu hỏi ôn tập

Tại sao structured logging (JSON) tốt hơn plain text cho mục đích security?

Xem đáp án

SIEM/Log platform query field cụ thể: event.category:auth AND user.id:u_123 AND source.ip:203.0.113.42. Plain text phải dùng regex grep — chậm, không reliable (vd email Alice@Example.com vs ALICE@example.com), không aggregate được (vd "đếm distinct IP"). Structured log còn cho phép correlate cross-service qua trace_id, request_id.

Schema chuẩn: ECS (Elastic Common Schema), OCSF (Open Cybersecurity Schema Framework), CIM (Splunk).
PII trong log nguy hiểm vì sao và làm sao redact đúng?

Xem đáp án

Log thường được index ở nhiều nơi (CloudWatch, SIEM, archive), nhiều người access cho debug. PII trong log = PII trong nhiều system → 1 log platform bị compromise = leak hàng triệu PII. GDPR còn yêu cầu PII chỉ ở nơi cần thiết.

Redact: (1) allow-list field được phép log (an toàn nhất — bỏ field mới không vô tình lộ), (2) deny-list SENSITIVE fields, (3) hash email/identifier để giữ correlation mà không lộ raw, (4) log filter layer (Vector, Logstash) xoá field trước khi index — backup layer khi app log nhầm.
NIST IR lifecycle có Containment trước Eradication. Khác biệt là gì?
Xem đáp án
- Containment: chặn attack lan rộng. Quick action như disable IAM key, isolate host khỏi network, block IP. Chưa root-cause, chỉ giảm bleeding.
- Eradication: loại bỏ root cause sau khi đã hiểu attack. Remove malware, patch vulnerability, rotate creds liên quan, close lateral movement path.
Vì sao tách: trong khủng hoảng, mục tiêu đầu tiên là stop the bleeding càng nhanh càng tốt; root cause cần thời gian phân tích. Nếu cố eradicate trước khi contain, attacker có thể exfiltrate thêm trong lúc bạn đang investigate.
Tabletop exercise khác simulation chạy thực thế nào, vì sao vẫn cần?

Xem đáp án

Tabletop: discuss trên giấy/whiteboard, không chạy lệnh thật. Nhanh (2-3 giờ), không nguy cơ outage, focus vào decision-making và communication.

Live drill / chaos engineering: thực sự kill instance, simulate breach. Tốn thời gian hơn, nguy cơ nếu sai môi trường.

Vẫn cần tabletop vì: (1) test runbook coverage (có thiếu scenario nào?), (2) tìm communication gap (ai gọi ai, escalation path), (3) low cost — chạy được mỗi quý. Live drill bổ sung, không thay thế — thường 1-2 lần/năm cho scenario critical.
Vì sao "snapshot trước khi investigate" quan trọng trong forensics?

Xem đáp án

Hành động investigate (login server, chạy ps, restart process) thay đổi state của vật chứng: timestamp file truy cập đổi, memory rewrite, log thêm dòng. Nếu attacker đặt trap (vd shell script tự xoá khi detect login admin), bạn mất evidence.

Best practice: (1) snapshot disk (EBS snapshot, VM snapshot) — immutable copy, (2) memory dump trước khi reboot/isolate, (3) export logs off-host trước khi attacker có thể edit, (4) investigate trên copy, prod giữ nguyên cho forensic team. Nếu liên quan pháp lý, chain of custody phải document mọi access từ thời điểm snapshot.

Bài tập thực hành

# 1. Setup structured logging trong Node app
npm install pino pino-pretty

cat > app.js <<'EOF'
const pino = require("pino");
const logger = pino({
  level: "info",
  redact: {
    paths: ["password", "*.password", "req.headers.authorization", "*.ssn"],
    censor: "[REDACTED]",
  },
});

logger.info({ user: { id: "u_1", email: "a@x.com", ssn: "123-45-6789" } }, "user.signup");
EOF
node app.js | jq

# 2. CloudWatch Logs Insights query
aws logs start-query \
  --log-group-name /aws/lambda/api \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query-string 'fields @timestamp, actor.ip, action | filter event_id="auth.login.fail" | stats count() by actor.ip | sort count() desc | limit 20'

# 3. Set retention 365 ngày cho log group
aws logs put-retention-policy \
  --log-group-name /aws/lambda/api \
  --retention-in-days 365

# 4. Test gitleaks log: tạo runbook mẫu
mkdir -p runbooks
cat > runbooks/aws-key-leak.md <<'EOF'
# Runbook: AWS Key Detected in Public Repo
## Triage (5 min)
1. Identify key: AKIA prefix → console
2. aws iam list-access-keys --user-name <user>
## Contain
1. aws iam update-access-key --status Inactive ...
2. Notify on-call + key owner
## Eradicate
1. Rotate key, delete old
2. Audit CloudTrail with old key for 24h
EOF

# 5. Tabletop scenario (đọc và thảo luận với team)
cat <<'EOF'
SCENARIO: Sáng thứ Hai 9 AM, customer support nhận ticket
"Tôi nhận password reset email không yêu cầu". 30 phút sau,
50 ticket tương tự. CloudTrail thấy IAM user "cs-bot" gọi
sts:AssumeRole vào prod role 200 lần.

Q1: Severity?
Q2: Ai page đầu tiên? Ai làm IC?
Q3: Hành động 5 phút đầu là gì?
Q4: Khi nào notify customer? Legal? DPO?
Q5: GDPR clock bắt đầu chạy lúc nào?
EOF

Tài liệu tham khảo chính thức

Quiz Tuần 4 →