Observability¶
Naftiko Skipper provides built-in observability for every capability that
declares a type: control expose. No code changes are required — the operator
wires everything automatically.
Three backends are supported out of the box: - Prometheus + Grafana — metrics via scrape (ServiceMonitor) - Datadog — traces (MCP + REST) and metrics via OTLP push - Control port — direct access to metrics and health endpoints
The Control Port¶
Add a type: control expose to your capability spec to activate observability:
capability:
exposes:
- type: rest
address: "0.0.0.0"
port: 3001
namespace: my-api
- type: control
address: "0.0.0.0"
port: 9090
observability:
enabled: true
metrics:
local:
enabled: true
traces:
sampling: 1.0 # 1.0 = 100%, 0.1 = 10%
propagation: w3c
The control port exposes:
| Endpoint | Description |
|---|---|
/metrics |
Prometheus text format — RED metrics |
/health/live |
Liveness probe |
/health/ready |
Readiness probe |
/status |
Runtime capability status |
/traces |
Recent trace ring buffer (local dev) |
What the Operator Does Automatically¶
When a type: control expose is present, Skipper:
- Adds a named
controlport to the Service and Deployment container spec - Writes Prometheus pod annotations on the pod template:
- Creates a
ServiceMonitorfor Prometheus Operator (if CRDs are installed) - Injects OTEL env vars from the operator's own configuration into every capability pod — no per-capability configuration needed
Note: If Prometheus Operator is installed after the operator starts, restart the operator so Fabric8 re-discovers the
monitoring.coreos.comAPI group:
Available Metrics¶
The ikanos engine exports RED metrics via the control port and OTLP:
| Metric | Type | Labels |
|---|---|---|
ikanos_capability_active |
Gauge | ikanos_capability |
ikanos_request_total |
Counter | ikanos_adapter_type, ikanos_operation_id, status |
ikanos_request_duration_seconds |
Histogram | ikanos_adapter_type, ikanos_operation_id, status |
ikanos_request_errors |
Counter | ikanos_adapter_type, ikanos_operation_id, error.type |
ikanos_step_duration_seconds |
Histogram | step_type, naftiko_namespace |
ikanos_http_client_total |
Counter | server_address, http_response_status_code |
ikanos_http_client_duration_seconds |
Histogram | server_address |
Prometheus & Grafana Setup¶
Install kube-prometheus-stack¶
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set kubeControllerManager.enabled=false \
--set kubeScheduler.enabled=false \
--set kubeProxy.enabled=false \
--set kubeEtcd.enabled=false \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false
kubectl rollout status deployment/kube-prometheus-stack-grafana \
-n monitoring --timeout=120s
Access Prometheus and Grafana¶
Get the Grafana password:
kubectl get secret -n monitoring kube-prometheus-stack-grafana \
-o jsonpath="{.data.admin-password}" | base64 -d ; echo
Any cluster — port-forward (simplest):
kubectl port-forward svc/kube-prometheus-stack-prometheus 9091:9090 -n monitoring &
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring &
# Prometheus: http://localhost:9091
# Grafana: http://localhost:3000 (admin / password above)
minikube:
kubectl patch svc kube-prometheus-stack-prometheus -n monitoring \
--type='json' \
-p='[{"op":"replace","path":"/spec/type","value":"NodePort"},{"op":"add","path":"/spec/ports/0/nodePort","value":30090}]'
kubectl patch svc kube-prometheus-stack-grafana -n monitoring \
--type='json' \
-p='[{"op":"replace","path":"/spec/type","value":"NodePort"},{"op":"add","path":"/spec/ports/0/nodePort","value":30030}]'
MINIKUBE_IP=$(minikube ip)
docker run -d --name prometheus-bridge --restart=always \
-p 30090:30090 --network minikube \
alpine/socat TCP-LISTEN:30090,fork,reuseaddr TCP:${MINIKUBE_IP}:30090
docker run -d --name grafana-bridge --restart=always \
-p 30030:30030 --network minikube \
alpine/socat TCP-LISTEN:30030,fork,reuseaddr TCP:${MINIKUBE_IP}:30030
# Prometheus: http://localhost:30090
# Grafana: http://localhost:30030
kind:
kubectl patch svc kube-prometheus-stack-prometheus -n monitoring \
--type='json' \
-p='[{"op":"replace","path":"/spec/type","value":"NodePort"},{"op":"add","path":"/spec/ports/0/nodePort","value":30090}]'
kubectl patch svc kube-prometheus-stack-grafana -n monitoring \
--type='json' \
-p='[{"op":"replace","path":"/spec/type","value":"NodePort"},{"op":"add","path":"/spec/ports/0/nodePort","value":30030}]'
NODE_IP=$(kubectl get nodes \
-o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')
NETWORK=$(docker network ls | grep kind | awk '{print $2}')
docker run -d --name prometheus-bridge --restart=always \
-p 30090:30090 --network ${NETWORK} \
alpine/socat TCP-LISTEN:30090,fork,reuseaddr TCP:${NODE_IP}:30090
docker run -d --name grafana-bridge --restart=always \
-p 30030:30030 --network ${NETWORK} \
alpine/socat TCP-LISTEN:30030,fork,reuseaddr TCP:${NODE_IP}:30030
Import the Naftiko Dashboard¶
GRAFANA_PASSWORD=$(kubectl get secret -n monitoring kube-prometheus-stack-grafana \
-o jsonpath="{.data.admin-password}" | base64 -d)
# Replace <grafana-host> with localhost:30030 (NodePort) or localhost:3000 (port-forward)
curl -s -X POST http://<grafana-host>/api/dashboards/import \
-H "Content-Type: application/json" \
-u "admin:${GRAFANA_PASSWORD}" \
-d "{
\"dashboard\": $(cat config/observability/dashboards/ikanos-dashboard.json),
\"overwrite\": true,
\"inputs\": [{
\"name\": \"DS_PROMETHEUS\",
\"type\": \"datasource\",
\"pluginId\": \"prometheus\",
\"value\": \"Prometheus\"
}],
\"folderId\": 0
}"
The dashboard shows: Active Capabilities, Request Rate, Error Rate, P99 Latency, Request Duration (p50/p95/p99), Step Duration, HTTP Client metrics.
Verify Prometheus is Scraping¶
# ServiceMonitor created automatically by the operator
kubectl get servicemonitor <capability-name> -n default
# Target is UP in Prometheus
curl -s "http://localhost:YOUR_PORT/api/v1/targets" | \
python3 -c "
import sys, json
for t in json.load(sys.stdin)['data']['activeTargets']:
if '<capability-name>' in str(t['labels']):
print(t['labels'].get('job'), '->', t['health'])
"
Datadog Integration¶
Datadog receives both traces and metrics via OTLP from every capability
pod. The operator injects all required env vars automatically — users configure
nothing beyond declaring a type: control expose.
What is available in Datadog¶
| Signal | Adapter | Status |
|---|---|---|
| Traces — REST requests | REST | ✅ http.server.request spans |
| Traces — MCP tool calls | MCP | ✅ mcp.request spans |
| Log-trace correlation | All | ✅ traceId/spanId in every log line |
| Metrics — request counters | All | ✅ ikanos.request.total |
| Metrics — latency histograms | All | ✅ ikanos.request.duration |
1. Create a Datadog account¶
Sign up at https://app.datadoghq.eu (EU) or https://app.datadoghq.com (US).
Get your API key from Organization Settings → API Keys.
2. Install the Datadog Agent with OTLP receiver¶
kubectl create secret generic datadog-secret \
--from-literal=api-key=<YOUR_DD_API_KEY> \
-n monitoring
cat > datadog-values.yaml << 'EOF'
datadog:
apiKeyExistingSecret: datadog-secret
site: datadoghq.eu # change to datadoghq.com for US
clusterName: my-cluster
kubelet:
tlsVerify: false
otlp:
receiver:
protocols:
http:
enabled: true
grpc:
enabled: true
apm:
portEnabled: true
systemProbe:
enabled: false
processAgent:
enabled: false
env:
- name: DD_HOSTNAME
value: my-cluster-node # required for kind/minikube
agents:
useHostNetwork: false
EOF
helm repo add datadog https://helm.datadoghq.com
helm install datadog-agent datadog/datadog \
--namespace monitoring -f datadog-values.yaml
kubectl rollout status daemonset/datadog-agent \
-n monitoring --timeout=120s
3. Configure the operator¶
Set these env vars on the Naftiko Skipper operator once — they propagate automatically to every capability pod in the cluster:
kubectl set env deployment/naftiko-skipper \
NAFTIKO_OTEL_ENDPOINT=http://datadog-agent.monitoring.svc.cluster.local:4318 \
NAFTIKO_OTEL_PROTOCOL=http/protobuf \
NAFTIKO_OTEL_SAMPLING_RATE=1.0 \
-n naftiko-system
4. Reconcile capabilities¶
Force capabilities to pick up the new OTEL configuration:
Important: If a capability pod was running before the Datadog Agent was installed, restart it — the pod may have cached a failed DNS lookup:
5. Verify the OTLP pipeline¶
Test connectivity from inside the cluster:
kubectl run otlp-test --image=curlimages/curl --rm -it \
--restart=Never -n default -- \
curl -s -X POST \
http://datadog-agent.monitoring.svc.cluster.local:4318/v1/traces \
-H "Content-Type: application/json" \
-d '{"resourceSpans":[{"resource":{"attributes":[{"key":"service.name","value":{"stringValue":"naftiko-test"}}]},"scopeSpans":[{"spans":[{"traceId":"5b8efff798038103d269b633813fc60c","spanId":"eee19b7ec3c1b174","name":"test","kind":1,"startTimeUnixNano":"1544712660000000000","endTimeUnixNano":"1544712661000000000","status":{}}]}]}]}' \
-w "\nHTTP: %{http_code}\n"
# → HTTP: 200 {"partialSuccess":{}}
6. View in Datadog¶
Navigate to app.datadoghq.eu:
| Where | What you see |
|---|---|
| APM → Services | ikanos service with MCP and REST endpoints |
| APM → Traces | mcp.request and GET spans with latency |
| APM → Trace Explorer | Individual traces correlated to logs via traceId |
| Metrics → Explorer | ikanos.request.total, ikanos.request.duration |
Operator-Level OTEL Configuration Reference¶
All OTEL env vars are configured once on the operator — every capability in the cluster inherits them automatically. No per-capability configuration is needed.
| Operator env var | Injected into pod as | Purpose |
|---|---|---|
NAFTIKO_ENGINE_IMAGE |
(image field) | ikanos image to run |
NAFTIKO_OTEL_ENDPOINT |
OTEL_EXPORTER_OTLP_ENDPOINT |
OTLP collector endpoint |
NAFTIKO_OTEL_PROTOCOL |
OTEL_EXPORTER_OTLP_PROTOCOL |
grpc or http/protobuf |
NAFTIKO_OTEL_HEADERS |
OTEL_EXPORTER_OTLP_HEADERS |
e.g. DD-API-KEY=xxx |
NAFTIKO_OTEL_SAMPLING_RATE |
OTEL_TRACES_SAMPLER_ARG |
sampling rate 0.0–1.0 |
OTEL_SERVICE_NAME is always set to naftiko-{capability-name} — no
configuration needed.
Quick Reference¶
# Check control port directly
kubectl port-forward svc/<capability> 9090:9090 -n default &
curl http://localhost:9090/metrics | grep ikanos
curl http://localhost:9090/health/live
curl http://localhost:9090/health/ready
# Check ServiceMonitor was created
kubectl get servicemonitor <capability> -n default
# Force reconcile after spec or operator config change
kubectl annotate capability <capability> \
reconcile-at=$(date +%s) --overwrite -n default
# Restart pod after Datadog Agent reinstall (clears DNS cache)
kubectl rollout restart deployment/<capability> -n default
# Check Datadog Agent is receiving traces
kubectl logs -n monitoring -l app=datadog-agent \
-c trace-agent | tail -5
# Verify log-trace correlation is active
kubectl logs -l naftiko.io/capability=<capability> -n default \
| grep "traceId=[a-f0-9]" | head -3