healthz는 Kubernetes Pod 내에서 Native Sidecar (initContainer + restartPolicy: Always)로 동작하며, 메인 컨테이너의 상태를 관측하고 kubelet probe를 중계하는 경량 헬스체크 사이드카이다.
기존 kubelet probe의 한계를 보완한다:
| 기존 kubelet probe | healthz sidecar | |
|---|---|---|
| 단일 엔드포인트 체크만 가능 | TCP + HTTP + Process 복합 체크 | |
| 결과가 binary (pass/fail) | 상세 진단 정보 제공 (/detail) |
|
| 메인 컨테이너 자체에 probe 로직 필요 | 메인 컨테이너 무수정 -- sidecar가 외부에서 관측 | |
| 프로세스 수 검증 불가 | worker count 범위 검증 (expected_min_count/expected_max_count) |
/detail)를 통해 프로세스, 네트워크, 시스템, 런타임 정보 한눈에 확인
+--- Pod ------------------------------------------------------+
| shareProcessNamespace: true |
| terminationGracePeriodSeconds: 60 |
| |
| +-- initContainer (native sidecar) --+ +-- container ----+ |
| | healthz (UID 1001) | | app (UID vary) | |
| | CAP_SYS_PTRACE | | | |
| | readOnlyRootFilesystem: true | | :APP_PORT | |
| | | | | |
| | :9001 - Self API | | | |
| | GET /health (liveness) | | | |
| | GET /health/detail (diagnostic) | | | |
| | | | | |
| | :9000 - Target API | | | |
| | GET /healthz (relay) | | | |
| | GET /healthz/detail (diagnostic)| | | |
| | | | | |
| | Background Tasks: | | | |
| | TCP -> 127.0.0.1:APP_PORT | | | |
| | HTTP -> GET localhost:APP_PORT/ | | | |
| | Proc -> /proc scan (PID match) | | | |
| +------------------------------------+ +------------------+ |
| |
| kubelet probes: |
| app liveness/readiness -> :9000/healthz (via healthz) |
| healthz liveness/readiness -> :9001/health (self) |
+---------------------------------------------------------------+
app/
|-- main.py # Entry point, startup sequence, signal handling
|-- factory.py # FastAPI app factory (self:9001, target:9000)
|-- state.py # SharedState singleton (cross-cutting state)
|-- process_identifier.py # /proc PID matching (exact/contains/regex)
|
|-- config/
| |-- schema.py # Frozen dataclass config model
| |-- loader.py # YAML load -> typo fix -> coercion -> validation
| |-- coercion.py # Type coercion pipeline (str->bool/int/enum/...)
| `-- errors.py # ConfigFatalError, ConfigWarning, ConfigTypoFix
|
|-- checks/
| |-- base.py # BaseCheck ABC
| |-- tcp.py # TCP/UDP socket check
| |-- http.py # HTTP GET/HEAD check (httpx)
| |-- process.py # Process existence + worker count check
| `-- scheduler.py # asyncio task scheduler (backoff, restart)
|
|-- collectors/
| |-- base.py # safe_collect() wrapper
| |-- system_info.py # CPU/memory/storage/cgroup limits
| |-- network_info.py # interfaces/ports/DNS/routes
| |-- process_info.py # per-PID details (RSS, threads, JVM flags, ...)
| |-- language_runtimes.py # python/java/node/... version detection
| |-- os_info.py # OS release/kernel/hostname/uptime
| |-- self_metrics.py # healthz uptime/version/probe stats/percentiles
| |-- pod_info.py # Downward API/namespace sharing/siblings
| `-- config_info.py # config source/SHA256 drift detection
|
|-- endpoints/
| |-- health.py # /health, /health/detail (self, port 9001)
| |-- healthz.py # /healthz, /healthz/detail (target, port 9000)
| `-- response.py # pretty_json() with X-Request-Id, timing headers
|
`-- utils/
|-- logging_config.py # structlog + stdlib logging setup
|-- masking.py # sensitive key masking ($REDACTED)
|-- proc.py # /proc parsing helpers (CANNOT sentinel)
`-- time_utils.py # UTC ISO 8601 helpers
1. Parse CLI args (--config or env-var chain)
2. Load YAML config -> typo fix -> coercion -> validate -> fatal on failure
3. Initialize structured logging (JSON/text, stdout+stderr+file rotation)
4. Print startup banner (effective config, defaulted/truncated/typo fields)
5. Pre-warm: one-off probe cycle for each check (avoid first-request latency)
6. Start background check tasks (tcp/http/process) as asyncio tasks
7. Start two uvicorn servers (self:9001, target:9000) via asyncio.gather
8. Register SIGTERM/SIGINT handlers
9. On shutdown: 15s drain -> cancel tasks -> close servers -> exit
하나의 Python 프로세스에서 두 개의 FastAPI 앱을 서로 다른 포트에서 실행한다:
| Port | App | Endpoints | 용도 | |
|---|---|---|---|---|
| 9001 | self | /health, /health/detail |
healthz 자체의 liveness (kubelet -> healthz) | |
| 9000 | target | /healthz, /healthz/detail |
메인 컨테이너 상태 릴레이 (kubelet -> app via healthz) |
두 앱은 SharedState 싱글턴을 공유하여 background check 결과에 즉시 접근한다.
각 체크(TCP, HTTP, Process)는 독립 asyncio task로 실행된다:
+-----------------------------------------------------+
| check.run_once() |
| | |
| +- PASS -> stats.consecutive_pass++ |
| | stats.consecutive_fail = 0 |
| | |
| +- FAIL -> stats.consecutive_fail++ |
| | stats.consecutive_pass = 0 |
| | recent_failures.append(...) |
| | if consecutive_fail == 1: |
| | log "probe.transition pass->fail"|
| | |
| +- UNAVAILABLE -> stats.total_unavailable++ |
| |
| await asyncio.sleep(interval_ms / 1000) |
| |
| Exception handling: |
| uncaught -> backoff 5s, max 10 restarts -> DEAD |
| CancelledError -> clean exit |
+-----------------------------------------------------+
asyncio.open_connection() (TCP) 또는 socket.SOCK_DGRAM (UDP)httpx.AsyncClient로 GET/HEAD 요청expected_status (기본 [200])와 expected_body (substring match) 검증verify_tls)catch_warnings 스코프)/proc 스캔 -> find_target_pids() -> PID 매칭exact | contains | regexprocess_names[0]: exe basename 매칭process_names[1..]: cmdline substring (AND 조건)Worker Count 검증 (신규 기능):
pids = find_target_pids(...)
count = len(pids)
if expected_min_count > 0 and count < expected_min_count:
-> FAIL "worker count N below expected_min_count M"
if expected_max_count > 0 and count > expected_max_count:
-> FAIL "worker count N exceeds expected_max_count M"
else:
-> PASS "found N matching process(es)"
expected_min_count: "0" + expected_max_count: "0" -> 기존 동작 (존재 여부만 확인)min > max -> ConfigFatalError (startup 실패)/healthz endpoint:
any check over failure_threshold -> DOWN (503)
any PASS (and no threshold breach) -> UP (200)
all UNAVAILABLE -> UNAVAILABLE (503)
failure_threshold는 연속 실패 횟수 기준이다. 일시적 FAIL 1회는 DOWN을 유발하지 않는다.
/health/detail 및 /healthz/detail 호출 시 8개 collector가 병렬 실행:
| Collector | 수집 내용 | 출처 | |
|---|---|---|---|
| system_info | CPU 사용률, 메모리, 스토리지, cgroup limits | /proc/stat, meminfo, statvfs, /sys/fs/cgroup | |
| network_info | 인터페이스, 리스닝 포트, DNS, 라우트 | /proc/net/*, /etc/resolv.conf | |
| process_info | PID별 상세 (RSS, threads, OOM, JVM flags, mounts, environ) | /proc//* | |
| language_runtimes | python/java/node 등 버전 | subprocess (절대경로 whitelist) | |
| os_info | OS release, 커널, hostname, uptime, timezone | /etc/os-release, uname, /proc/uptime | |
| self_metrics | healthz 자체 uptime, version, probe stats, latency p50/p95/p99 | SharedState | |
| pod_info | Pod/Container 메타 (Downward API), namespace sharing, siblings | env vars, /proc | |
| config_info | config source path, SHA256, drift detection, defaulted fields | SharedState |
각 collector는 safe_collect() 래퍼로 감싸져 있어, 개별 실패 시 {"status": "FAILED", "error": "..."} 반환하고 다른 collector에 영향을 주지 않는다.
/detail 호출 시 현재 디스크의 SHA-256과 비교config_info.drift_detected: true 표시kubectl rollout restartSIGTERM received
-> state.is_shutting_down = True
-> /health, /healthz -> 503 SHUTTING_DOWN
-> 15s drain (kubelet endpoint 제거 대기)
-> preStop: sleep 5s (endpoint removal 유예)
-> cancel background tasks (5s timeout)
-> close uvicorn servers
-> exit 0
| 요구사항 | 이유 | 최소 버전 | |
|---|---|---|---|
shareProcessNamespace: true |
/proc 스캔으로 target PID 접근 | K8s 1.17+ | |
Native Sidecar (initContainer + restartPolicy: Always) |
Pod lifecycle에서 사이드카 먼저 시작 | K8s 1.29+ (GA 1.33) | |
CAP_SYS_PTRACE |
cross-UID /proc//{environ,root,...} 접근 | 모든 버전 | |
Pod Security: privileged |
SYS_PTRACE는 baseline PSS에 없음 | - |
| 규칙 | 설명 | |
|---|---|---|
| R-1 | 모든 YAML 값은 따옴표 문자열 (native bool/int -> default + WARN) | |
| R-2 | 기본값 있는 필드: invalid -> default + WARN / 기본값 없는 필드: invalid -> CRITICAL exit 1 | |
| R-3 | 정수 필드의 소수점: 절삭 (truncate toward zero). "80.9" -> 80 |
|
| R-4 | 배열 중복 제거: 최초 출현 순서 유지 | |
| M-1 | timeout/*_ms: 이미 밀리초 단위. 단위 접미사(5s, 5000ms) 미지원 |
|
| M-2 | 대부분의 파싱 실패 -> default (float("inf") 같은 overflow만 CRITICAL) |
환경변수 체인으로 config 파일 경로를 결정한다:
HEALTHZ_HOME=${HEALTHZ_HOME:-/opt/healthz}
HEALTHZ_CONF_DIR=${HEALTHZ_CONF_DIR:-$HEALTHZ_HOME/conf}
HEALTHZ_CONF_FILE_NAME=${HEALTHZ_CONF_FILE_NAME:-healthz.yaml}
# 최종 경로: $HEALTHZ_CONF_DIR/$HEALTHZ_CONF_FILE_NAME
# 기본값: /opt/healthz/conf/healthz.yaml
우선순위: --config CLI > env var chain > Dockerfile defaults
| 시나리오 | 설정 | 결과 경로 | |
|---|---|---|---|
| 기본 | (없음) | /opt/healthz/conf/healthz.yaml |
|
| HOME만 변경 | HEALTHZ_HOME=/data/hz |
/data/hz/conf/healthz.yaml |
|
| CONF_DIR 지정 | HEALTHZ_CONF_DIR=/etc/myapp |
/etc/myapp/healthz.yaml |
|
| 파일명 변경 | HEALTHZ_CONF_FILE_NAME=custom.yaml |
/opt/healthz/conf/custom.yaml |
|
| CLI override | --config /tmp/test.yaml |
/tmp/test.yaml |
healthz(UID 1001)와 target(UID 0 등)의 UID 차이로 일부 /proc 정보 접근 제한:
| 항목 | 제한 | 영향 | |
|---|---|---|---|
/proc/<pid>/environ |
PermissionError (cross-UID) | 환경변수 수집 불가, 체크 판정 무관 | |
/proc/<pid>/root |
PermissionError | language_runtimes 감지 실패, detail 전용 | |
/proc/<pid>/io |
PermissionError | I/O 통계 수집 불가, 정보성 |
해결: target을 같은 UID로 맞추거나, healthz를 root(0)로 실행 (보안 트레이드오프)
| 항목 | 기본값 | 비고 | |
|---|---|---|---|
| CPU request/limit | 50m / 300m | idle ~5m, /detail 호출 시 burst | |
| Memory request/limit | 96Mi / 192Mi | 측정 ~71MB RSS (55% of 128Mi) | |
| Ephemeral storage | 64Mi / 256Mi | log rotation 포함 | |
| Log rotation | 100MB max, 7 backups | log_max_size / log_backup_count |
securityContext:
runAsNonRoot: true
runAsUser: 1001
runAsGroup: 0 # OpenShift convention
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
add: ["SYS_PTRACE"]
seccompProfile:
type: RuntimeDefault
automountServiceAccountToken: false (K8s API 불필요)
apiVersion: v1
kind: ConfigMap
metadata:
name: app-healthz-config
data:
healthz.yaml: |
healthz:
server:
bind:
listen_address: "0.0.0.0"
self_port: "9001"
target_port: "9000"
log:
console_log_level: "INFO"
log_format: "json"
targets:
target_container: "myapp"
tcp:
enabled: "true"
host: "127.0.0.1"
ports: ["8080"]
timeout_ms: "5000"
interval_ms: "10000"
failure_threshold: "3"
http:
enabled: "true"
host: "localhost"
port: "8080"
scheme: "http"
path: "/"
method: "GET"
expected_status: ["200"]
expected_body: "ok"
timeout_ms: "5000"
interval_ms: "10000"
failure_threshold: "3"
process:
enabled: "true"
match_mode: "contains"
process_names: ["python"]
expected_min_count: "2"
expected_max_count: "10"
failure_threshold: "3"
uvicorn worker 2개를 기대하는 설정:
process:
enabled: "true"
match_mode: "contains"
process_names: ["python"]
expected_min_count: "2" # worker 최소 2개
expected_max_count: "10" # worker 최대 10개
failure_threshold: "3" # 3회 연속 실패 시 DOWN
응답 예시 (정상):
{
"status": "UP",
"checks": {
"process": {
"status": "PASS",
"message": "found 4 matching process(es)",
"consecutive_fail": 0,
"threshold": 3
}
}
}
응답 예시 (worker 부족):
{
"status": "DOWN",
"checks": {
"process": {
"status": "FAIL",
"message": "cannot get informations: worker count 1 below expected_min_count 2",
"consecutive_fail": 3,
"threshold": 3,
"over_threshold": true
}
}
}
spec:
shareProcessNamespace: true
terminationGracePeriodSeconds: 60
initContainers:
- name: healthz
image: harbor.nova.office/library/healthz:latest
restartPolicy: Always
ports:
- containerPort: 9001
name: healthz-self
- containerPort: 9000
name: healthz-target
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: CONTAINER_NAME
value: "healthz"
- name: HEALTHZ_HOME
value: "/opt/healthz"
- name: HEALTHZ_CONF_DIR
value: "/opt/healthz/conf"
- name: HEALTHZ_CONF_FILE_NAME
value: "healthz.yaml"
securityContext:
runAsNonRoot: true
runAsUser: 1001
capabilities:
add: ["SYS_PTRACE"]
readOnlyRootFilesystem: true
volumeMounts:
- name: healthz-config
mountPath: /opt/healthz/conf
readOnly: true
- name: healthz-log
mountPath: /opt/healthz/log
livenessProbe:
httpGet:
path: /health
port: 9001
initialDelaySeconds: 5
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 9001
initialDelaySeconds: 2
periodSeconds: 5
containers:
- name: myapp
image: myregistry/myapp:latest
ports:
- containerPort: 8080
startupProbe:
httpGet:
path: /healthz
port: 9000
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 12
livenessProbe:
httpGet:
path: /healthz
port: 9000
periodSeconds: 30
readinessProbe:
httpGet:
path: /healthz
port: 9000
periodSeconds: 10
volumes:
- name: healthz-config
configMap:
name: app-healthz-config
items:
- key: healthz.yaml
path: healthz.yaml
- name: healthz-log
emptyDir:
sizeLimit: 1Gi
NiFi처럼 다른 경로에 config를 두는 경우:
env:
- name: HEALTHZ_CONF_DIR
value: "/etc/nifi-healthz"
- name: HEALTHZ_CONF_FILE_NAME
value: "nifi-healthz.yaml"
volumeMounts:
- name: healthz-config
mountPath: /etc/nifi-healthz
readOnly: true
helm install myapp-healthz manifests/helm/ \
--namespace myapp --create-namespace \
--set image.repository=harbor.nova.office/library/healthz \
--set image.tag=1.0.0 \
--set healthz.confDir="/opt/healthz/conf" \
--set healthz.confFileName="healthz.yaml"
{
"primary_focus": "target",
"status": "UP",
"timestamp": "2026-05-15T09:30:00.123Z",
"target_container": "pseudo",
"target": {
"checks": {
"tcp": { "status": "PASS", "latency_ms": 0, "consecutive_fail": 0, "threshold": 3 },
"http": { "status": "PASS", "latency_ms": 95, "consecutive_fail": 0, "threshold": 3 },
"process": { "status": "PASS", "message": "found 4 matching process(es)", "consecutive_fail": 0, "threshold": 3 }
},
"process_info": { "...per-PID details..." },
"language_runtimes": { "python": "3.13.1", "java": "$LANGUAGE_NOT_INSTALLED" }
},
"system": {
"system_info": {
"cpu": { "cores": 8, "usage_percent": 12.5, "load_avg": [0.5, 0.8, 0.6] },
"memory": { "total_bytes": 16777216000, "available_bytes": 12000000000 },
"cgroup_limits": { "memory_limit_bytes": 134217728, "cpu_limit": { "quota_us": 20000, "period_us": 100000 } }
},
"network_info": {
"interfaces": [ { "name": "eth0", "ipv4": "10.244.26.174", "rx_bytes": 123456 } ],
"listening_ports": [ { "port": 8080, "proto": "tcp" }, { "port": 9000, "proto": "tcp" }, { "port": 9001, "proto": "tcp" } ],
"dns": { "nameservers": ["10.96.0.10"], "search": ["myns.svc.cluster.local"] }
},
"os_info": { "distribution": "Debian GNU/Linux 12 (bookworm)", "kernel": "5.14.0-611.36.1.el9_7.x86_64" }
},
"healthz_self_summary": {
"status": "UP", "pid": 7, "uptime_seconds": 3600, "startup_complete": true, "prewarm_complete": true
},
"self_metrics": {
"probe_stats": {
"tcp": { "total_runs": 360, "total_pass": 360, "latency_p50_ms": 0, "latency_p99_ms": 1 },
"http": { "total_runs": 360, "total_pass": 358, "latency_p50_ms": 85, "latency_p99_ms": 250 }
},
"recent_failures": []
},
"config_info": {
"source_path": "/opt/healthz/conf/healthz.yaml",
"source_sha256": "a1b2c3d4...",
"drift_detected": false
}
}
# Pod 상태 확인
kubectl -n myns get pods -l app=myapp
# healthz self check
kubectl -n myns exec deploy/myapp -c myapp -- \
curl -s http://localhost:9001/health | jq
# target check (relay)
kubectl -n myns exec deploy/myapp -c myapp -- \
curl -s http://localhost:9000/healthz | jq
# 상세 진단
kubectl -n myns exec deploy/myapp -c myapp -- \
curl -s http://localhost:9000/healthz/detail | jq
# config drift 확인
kubectl -n myns exec deploy/myapp -c myapp -- \
curl -s http://localhost:9000/healthz/detail | jq '.config_info.drift_detected'
# config 변경 적용
kubectl -n myns rollout restart statefulset/myapp
| 경로 | 설명 | |
|---|---|---|
Dockerfile |
Multi-stage build (python:3.13-slim), non-root (UID 1001) | |
requirements.txt |
fastapi, uvicorn, PyYAML, httpx, structlog, orjson | |
healthz/app/ |
Python 소스 (위 구조 참조) | |
manifests/samples/ |
Raw 매니페스트 (Namespace, RBAC, ConfigMap, StatefulSet, Service, NetworkPolicy, PDB) | |
manifests/helm/ |
Helm chart (NetworkPolicy 제외) | |
scripts/validate.py |
매니페스트 검증기 (Phase A-D) |