Observability¶

Naftiko Skipper provides built-in observability for every capability that declares a type: control expose. No code changes are required — the operator wires everything automatically.

Three backends are supported out of the box: - Prometheus + Grafana — metrics via scrape (ServiceMonitor) - Datadog — traces (MCP + REST) and metrics via OTLP push - Control port — direct access to metrics and health endpoints

The Control Port¶

Add a type: control expose to your capability spec to activate observability:

capability:
  exposes:
    - type: rest
      address: "0.0.0.0"
      port: 3001
      namespace: my-api

    - type: control
      address: "0.0.0.0"
      port: 9090
      observability:
        enabled: true
        metrics:
          local:
            enabled: true
        traces:
          sampling: 1.0          # 1.0 = 100%, 0.1 = 10%
          propagation: w3c

The control port exposes:

Endpoint	Description
`/metrics`	Prometheus text format — RED metrics
`/health/live`	Liveness probe
`/health/ready`	Readiness probe
`/status`	Runtime capability status
`/traces`	Recent trace ring buffer (local dev)

What the Operator Does Automatically¶

When a type: control expose is present, Skipper:

Adds a named control port to the Service and Deployment container spec

Writes Prometheus pod annotations on the pod template:

prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"

Creates a ServiceMonitor for Prometheus Operator (if CRDs are installed)
Injects OTEL env vars from the operator's own configuration into every capability pod — no per-capability configuration needed

Note: If Prometheus Operator is installed after the operator starts, restart the operator so Fabric8 re-discovers the monitoring.coreos.com API group:
kubectl rollout restart deployment/naftiko-skipper -n naftiko-system
kubectl annotate capability <name> reconcile-at=$(date +%s) --overwrite -n default

Available Metrics¶

The ikanos engine exports RED metrics via the control port and OTLP:

Metric	Type	Labels
`ikanos_capability_active`	Gauge	`ikanos_capability`
`ikanos_request_total`	Counter	`ikanos_adapter_type`, `ikanos_operation_id`, `status`
`ikanos_request_duration_seconds`	Histogram	`ikanos_adapter_type`, `ikanos_operation_id`, `status`
`ikanos_request_errors`	Counter	`ikanos_adapter_type`, `ikanos_operation_id`, `error.type`
`ikanos_step_duration_seconds`	Histogram	`step_type`, `naftiko_namespace`
`ikanos_http_client_total`	Counter	`server_address`, `http_response_status_code`
`ikanos_http_client_duration_seconds`	Histogram	`server_address`

Prometheus & Grafana Setup¶

Install kube-prometheus-stack¶

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set kubeControllerManager.enabled=false \
  --set kubeScheduler.enabled=false \
  --set kubeProxy.enabled=false \
  --set kubeEtcd.enabled=false \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false

kubectl rollout status deployment/kube-prometheus-stack-grafana \
  -n monitoring --timeout=120s

Access Prometheus and Grafana¶

Get the Grafana password:

kubectl get secret -n monitoring kube-prometheus-stack-grafana \
  -o jsonpath="{.data.admin-password}" | base64 -d ; echo

Any cluster — port-forward (simplest):

kubectl port-forward svc/kube-prometheus-stack-prometheus 9091:9090 -n monitoring &
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring &
# Prometheus: http://localhost:9091
# Grafana:    http://localhost:3000  (admin / password above)

minikube:

kubectl patch svc kube-prometheus-stack-prometheus -n monitoring \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/type","value":"NodePort"},{"op":"add","path":"/spec/ports/0/nodePort","value":30090}]'

kubectl patch svc kube-prometheus-stack-grafana -n monitoring \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/type","value":"NodePort"},{"op":"add","path":"/spec/ports/0/nodePort","value":30030}]'

MINIKUBE_IP=$(minikube ip)
docker run -d --name prometheus-bridge --restart=always \
  -p 30090:30090 --network minikube \
  alpine/socat TCP-LISTEN:30090,fork,reuseaddr TCP:${MINIKUBE_IP}:30090

docker run -d --name grafana-bridge --restart=always \
  -p 30030:30030 --network minikube \
  alpine/socat TCP-LISTEN:30030,fork,reuseaddr TCP:${MINIKUBE_IP}:30030
# Prometheus: http://localhost:30090
# Grafana:    http://localhost:30030

kind:

kubectl patch svc kube-prometheus-stack-prometheus -n monitoring \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/type","value":"NodePort"},{"op":"add","path":"/spec/ports/0/nodePort","value":30090}]'

kubectl patch svc kube-prometheus-stack-grafana -n monitoring \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/type","value":"NodePort"},{"op":"add","path":"/spec/ports/0/nodePort","value":30030}]'

NODE_IP=$(kubectl get nodes \
  -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')
NETWORK=$(docker network ls | grep kind | awk '{print $2}')

docker run -d --name prometheus-bridge --restart=always \
  -p 30090:30090 --network ${NETWORK} \
  alpine/socat TCP-LISTEN:30090,fork,reuseaddr TCP:${NODE_IP}:30090

docker run -d --name grafana-bridge --restart=always \
  -p 30030:30030 --network ${NETWORK} \
  alpine/socat TCP-LISTEN:30030,fork,reuseaddr TCP:${NODE_IP}:30030

Import the Naftiko Dashboard¶

GRAFANA_PASSWORD=$(kubectl get secret -n monitoring kube-prometheus-stack-grafana \
  -o jsonpath="{.data.admin-password}" | base64 -d)

# Replace <grafana-host> with localhost:30030 (NodePort) or localhost:3000 (port-forward)
curl -s -X POST http://<grafana-host>/api/dashboards/import \
  -H "Content-Type: application/json" \
  -u "admin:${GRAFANA_PASSWORD}" \
  -d "{
    \"dashboard\": $(cat config/observability/dashboards/ikanos-dashboard.json),
    \"overwrite\": true,
    \"inputs\": [{
      \"name\": \"DS_PROMETHEUS\",
      \"type\": \"datasource\",
      \"pluginId\": \"prometheus\",
      \"value\": \"Prometheus\"
    }],
    \"folderId\": 0
  }"

The dashboard shows: Active Capabilities, Request Rate, Error Rate, P99 Latency, Request Duration (p50/p95/p99), Step Duration, HTTP Client metrics.

Verify Prometheus is Scraping¶

# ServiceMonitor created automatically by the operator
kubectl get servicemonitor <capability-name> -n default

# Target is UP in Prometheus
curl -s "http://localhost:YOUR_PORT/api/v1/targets" | \
  python3 -c "
import sys, json
for t in json.load(sys.stdin)['data']['activeTargets']:
    if '<capability-name>' in str(t['labels']):
        print(t['labels'].get('job'), '->', t['health'])
"

Datadog Integration¶

Datadog receives both traces and metrics via OTLP from every capability pod. The operator injects all required env vars automatically — users configure nothing beyond declaring a type: control expose.

What is available in Datadog¶

Signal	Adapter	Status
Traces — REST requests	REST	✅ `http.server.request` spans
Traces — MCP tool calls	MCP	✅ `mcp.request` spans
Log-trace correlation	All	✅ `traceId`/`spanId` in every log line
Metrics — request counters	All	✅ `ikanos.request.total`
Metrics — latency histograms	All	✅ `ikanos.request.duration`

1. Create a Datadog account¶

Sign up at https://app.datadoghq.eu (EU) or https://app.datadoghq.com (US).

Get your API key from Organization Settings → API Keys.

2. Install the Datadog Agent with OTLP receiver¶

kubectl create secret generic datadog-secret \
  --from-literal=api-key=<YOUR_DD_API_KEY> \
  -n monitoring

cat > datadog-values.yaml << 'EOF'
datadog:
  apiKeyExistingSecret: datadog-secret
  site: datadoghq.eu          # change to datadoghq.com for US
  clusterName: my-cluster
  kubelet:
    tlsVerify: false
  otlp:
    receiver:
      protocols:
        http:
          enabled: true
        grpc:
          enabled: true
  apm:
    portEnabled: true
  systemProbe:
    enabled: false
  processAgent:
    enabled: false
  env:
    - name: DD_HOSTNAME
      value: my-cluster-node    # required for kind/minikube
agents:
  useHostNetwork: false
EOF

helm repo add datadog https://helm.datadoghq.com
helm install datadog-agent datadog/datadog \
  --namespace monitoring -f datadog-values.yaml

kubectl rollout status daemonset/datadog-agent \
  -n monitoring --timeout=120s

3. Configure the operator¶

Set these env vars on the Naftiko Skipper operator once — they propagate automatically to every capability pod in the cluster:

kubectl set env deployment/naftiko-skipper \
  NAFTIKO_OTEL_ENDPOINT=http://datadog-agent.monitoring.svc.cluster.local:4318 \
  NAFTIKO_OTEL_PROTOCOL=http/protobuf \
  NAFTIKO_OTEL_SAMPLING_RATE=1.0 \
  -n naftiko-system

4. Reconcile capabilities¶

Force capabilities to pick up the new OTEL configuration:

kubectl annotate capability <name> \
  reconcile-at=$(date +%s) --overwrite -n default

Important: If a capability pod was running before the Datadog Agent was installed, restart it — the pod may have cached a failed DNS lookup:
kubectl rollout restart deployment/<capability-name> -n default

5. Verify the OTLP pipeline¶

Test connectivity from inside the cluster:

kubectl run otlp-test --image=curlimages/curl --rm -it \
  --restart=Never -n default -- \
  curl -s -X POST \
  http://datadog-agent.monitoring.svc.cluster.local:4318/v1/traces \
  -H "Content-Type: application/json" \
  -d '{"resourceSpans":[{"resource":{"attributes":[{"key":"service.name","value":{"stringValue":"naftiko-test"}}]},"scopeSpans":[{"spans":[{"traceId":"5b8efff798038103d269b633813fc60c","spanId":"eee19b7ec3c1b174","name":"test","kind":1,"startTimeUnixNano":"1544712660000000000","endTimeUnixNano":"1544712661000000000","status":{}}]}]}]}' \
  -w "\nHTTP: %{http_code}\n"
# → HTTP: 200 {"partialSuccess":{}}

6. View in Datadog¶

Navigate to app.datadoghq.eu:

Where	What you see
APM → Services	`ikanos` service with MCP and REST endpoints
APM → Traces	`mcp.request` and `GET` spans with latency
APM → Trace Explorer	Individual traces correlated to logs via `traceId`
Metrics → Explorer	`ikanos.request.total`, `ikanos.request.duration`

Operator-Level OTEL Configuration Reference¶

All OTEL env vars are configured once on the operator — every capability in the cluster inherits them automatically. No per-capability configuration is needed.

Operator env var	Injected into pod as	Purpose
`NAFTIKO_ENGINE_IMAGE`	(image field)	ikanos image to run
`NAFTIKO_OTEL_ENDPOINT`	`OTEL_EXPORTER_OTLP_ENDPOINT`	OTLP collector endpoint
`NAFTIKO_OTEL_PROTOCOL`	`OTEL_EXPORTER_OTLP_PROTOCOL`	`grpc` or `http/protobuf`
`NAFTIKO_OTEL_HEADERS`	`OTEL_EXPORTER_OTLP_HEADERS`	e.g. `DD-API-KEY=xxx`
`NAFTIKO_OTEL_SAMPLING_RATE`	`OTEL_TRACES_SAMPLER_ARG`	sampling rate 0.0–1.0

OTEL_SERVICE_NAME is always set to naftiko-{capability-name} — no configuration needed.

Quick Reference¶

# Check control port directly
kubectl port-forward svc/<capability> 9090:9090 -n default &
curl http://localhost:9090/metrics | grep ikanos
curl http://localhost:9090/health/live
curl http://localhost:9090/health/ready

# Check ServiceMonitor was created
kubectl get servicemonitor <capability> -n default

# Force reconcile after spec or operator config change
kubectl annotate capability <capability> \
  reconcile-at=$(date +%s) --overwrite -n default

# Restart pod after Datadog Agent reinstall (clears DNS cache)
kubectl rollout restart deployment/<capability> -n default

# Check Datadog Agent is receiving traces
kubectl logs -n monitoring -l app=datadog-agent \
  -c trace-agent | tail -5

# Verify log-trace correlation is active
kubectl logs -l naftiko.io/capability=<capability> -n default \
  | grep "traceId=[a-f0-9]" | head -3