Observability — Prometheus + Grafana (Strang I)¶
Minimal in-cluster observability stack for the dashi PoC. Intentionally operator-free (no CRDs, no Prometheus Operator) so it slots into a small k3d cluster without 500 MB of overhead. Backs Phase-2 criteria "Monitoring Dashboard aktiv" and "Alert-Regeln definiert".
Components¶
| File | Purpose |
|---|---|
namespace.yaml |
dashi-monitoring namespace |
rbac.yaml |
Prometheus ServiceAccount + ClusterRole (node, pod, service, endpoint scrape) |
kube-state-metrics.yaml |
kube-state-metrics Deployment — emits K8s object metrics as Prometheus timeseries |
prometheus.yaml |
Prometheus Deployment + Service + ConfigMap (scrape config + alert rules) — retains 7 days in emptyDir |
grafana.yaml |
Grafana Deployment + Service + Secret + provisioned datasource + provisioned dashi · Platform Overview dashboard |
kustomization.yaml |
kubectl apply -k . |
Scrape jobs¶
- prometheus self-scrape
- kubernetes-apiservers — API-server metrics
- kubernetes-nodes — kubelet metrics via API proxy
- kubernetes-cadvisor — per-pod CPU/mem/network metrics
- kubernetes-pods — any pod annotated
prometheus.io/scrape: "true"withprometheus.io/port: "<port>"auto-discovered - kube-state-metrics — K8s object state
To add a custom service into Prometheus, annotate the pod:
yaml
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics" # optional; default /metrics
Alert rules (defined, delivery via Alertmanager deferred to Phase 3)¶
| Alert | Condition | Severity |
|---|---|---|
PodCrashLoop |
> 3 restarts in 5 min on any pod, 10 min persisting | warning |
DashiPodDown |
Pod in Failed / Unknown phase in a dashi-* namespace for 5 min |
critical |
PVCFull |
PVC < 20 % free for 10 min | warning |
DashiIngestFlowFailure |
Prefect flow-run entered FAILED in the last hour |
warning |
Rules render on Prometheus → Alerts tab but do not dispatch anywhere. Alertmanager integration (email/Slack/Teams) is Phase 3.
Apply¶
bash
cd poc
make monitoring-up
Port-forward Grafana:
```bash kubectl -n dashi-monitoring port-forward svc/grafana 13000:3000 &
Anonymous Viewer on; admin login: see Secret grafana-admin for rotated password¶
```
Grafana comes pre-configured with a Prometheus data source and the dashi · Platform Overview dashboard (4 stat panels + 3 timeseries — pods Running/CrashLooping, PVC fullness, namespace count, restarts, CPU, memory). Extend by dropping more JSON into grafana-dashboards ConfigMap or by creating dashboards via the UI (won't persist without a Grafana DB PVC, which is a Phase-2-hardening follow-up).
Production hardening deferred¶
- Durable PVC for Prometheus + Grafana (currently both
emptyDir) - Alertmanager Deployment + receiver config (Slack/Email/PagerDuty)
- Remote write to long-term storage (Thanos / Mimir) — only needed when cluster retention exceeds 30 days
- Prometheus Operator CRDs (
ServiceMonitor,PrometheusRule) if we scale past ~20 services - Log aggregation with Loki (part of Strang I.5 — audit logs)
- Exporters per data service:
postgres_exportersidecars forpgstac-db+prefect-dbrustfs_exporteror Prometheus-formatted/metricson the RustFS pod- Custom
/metricsendpoint ondashi-ingest,duckdb-endpoint,titiler-endpoint