Skip to content

Observability

Naftiko Skipper provides built-in observability for every capability that declares a type: control expose. No code changes are required — the operator wires everything automatically.

Three backends are supported out of the box: - Prometheus + Grafana — metrics via scrape (ServiceMonitor) - Datadog — traces (MCP + REST) and metrics via OTLP push - Control port — direct access to metrics and health endpoints


The Control Port

Add a type: control expose to your capability spec to activate observability:

capability:
  exposes:
    - type: rest
      address: "0.0.0.0"
      port: 3001
      namespace: my-api

    - type: control
      address: "0.0.0.0"
      port: 9090
      observability:
        enabled: true
        metrics:
          local:
            enabled: true
        traces:
          sampling: 1.0          # 1.0 = 100%, 0.1 = 10%
          propagation: w3c

The control port exposes:

Endpoint Description
/metrics Prometheus text format — RED metrics
/health/live Liveness probe
/health/ready Readiness probe
/status Runtime capability status
/traces Recent trace ring buffer (local dev)

What the Operator Does Automatically

When a type: control expose is present, Skipper:

  1. Adds a named control port to the Service and Deployment container spec
  2. Writes Prometheus pod annotations on the pod template:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
    prometheus.io/path: "/metrics"
    
  3. Creates a ServiceMonitor for Prometheus Operator (if CRDs are installed)
  4. Injects OTEL env vars from the operator's own configuration into every capability pod — no per-capability configuration needed

Note: If Prometheus Operator is installed after the operator starts, restart the operator so Fabric8 re-discovers the monitoring.coreos.com API group:

kubectl rollout restart deployment/naftiko-skipper -n naftiko-system
kubectl annotate capability <name> reconcile-at=$(date +%s) --overwrite -n default


Available Metrics

The ikanos engine exports RED metrics via the control port and OTLP:

Metric Type Labels
ikanos_capability_active Gauge ikanos_capability
ikanos_request_total Counter ikanos_adapter_type, ikanos_operation_id, status
ikanos_request_duration_seconds Histogram ikanos_adapter_type, ikanos_operation_id, status
ikanos_request_errors Counter ikanos_adapter_type, ikanos_operation_id, error.type
ikanos_step_duration_seconds Histogram step_type, naftiko_namespace
ikanos_http_client_total Counter server_address, http_response_status_code
ikanos_http_client_duration_seconds Histogram server_address

Prometheus & Grafana Setup

Install kube-prometheus-stack

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set kubeControllerManager.enabled=false \
  --set kubeScheduler.enabled=false \
  --set kubeProxy.enabled=false \
  --set kubeEtcd.enabled=false \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false

kubectl rollout status deployment/kube-prometheus-stack-grafana \
  -n monitoring --timeout=120s

Access Prometheus and Grafana

Get the Grafana password:

kubectl get secret -n monitoring kube-prometheus-stack-grafana \
  -o jsonpath="{.data.admin-password}" | base64 -d ; echo

Any cluster — port-forward (simplest):

kubectl port-forward svc/kube-prometheus-stack-prometheus 9091:9090 -n monitoring &
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring &
# Prometheus: http://localhost:9091
# Grafana:    http://localhost:3000  (admin / password above)

minikube:

kubectl patch svc kube-prometheus-stack-prometheus -n monitoring \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/type","value":"NodePort"},{"op":"add","path":"/spec/ports/0/nodePort","value":30090}]'

kubectl patch svc kube-prometheus-stack-grafana -n monitoring \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/type","value":"NodePort"},{"op":"add","path":"/spec/ports/0/nodePort","value":30030}]'

MINIKUBE_IP=$(minikube ip)
docker run -d --name prometheus-bridge --restart=always \
  -p 30090:30090 --network minikube \
  alpine/socat TCP-LISTEN:30090,fork,reuseaddr TCP:${MINIKUBE_IP}:30090

docker run -d --name grafana-bridge --restart=always \
  -p 30030:30030 --network minikube \
  alpine/socat TCP-LISTEN:30030,fork,reuseaddr TCP:${MINIKUBE_IP}:30030
# Prometheus: http://localhost:30090
# Grafana:    http://localhost:30030

kind:

kubectl patch svc kube-prometheus-stack-prometheus -n monitoring \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/type","value":"NodePort"},{"op":"add","path":"/spec/ports/0/nodePort","value":30090}]'

kubectl patch svc kube-prometheus-stack-grafana -n monitoring \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/type","value":"NodePort"},{"op":"add","path":"/spec/ports/0/nodePort","value":30030}]'

NODE_IP=$(kubectl get nodes \
  -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')
NETWORK=$(docker network ls | grep kind | awk '{print $2}')

docker run -d --name prometheus-bridge --restart=always \
  -p 30090:30090 --network ${NETWORK} \
  alpine/socat TCP-LISTEN:30090,fork,reuseaddr TCP:${NODE_IP}:30090

docker run -d --name grafana-bridge --restart=always \
  -p 30030:30030 --network ${NETWORK} \
  alpine/socat TCP-LISTEN:30030,fork,reuseaddr TCP:${NODE_IP}:30030

Import the Naftiko Dashboard

GRAFANA_PASSWORD=$(kubectl get secret -n monitoring kube-prometheus-stack-grafana \
  -o jsonpath="{.data.admin-password}" | base64 -d)

# Replace <grafana-host> with localhost:30030 (NodePort) or localhost:3000 (port-forward)
curl -s -X POST http://<grafana-host>/api/dashboards/import \
  -H "Content-Type: application/json" \
  -u "admin:${GRAFANA_PASSWORD}" \
  -d "{
    \"dashboard\": $(cat config/observability/dashboards/ikanos-dashboard.json),
    \"overwrite\": true,
    \"inputs\": [{
      \"name\": \"DS_PROMETHEUS\",
      \"type\": \"datasource\",
      \"pluginId\": \"prometheus\",
      \"value\": \"Prometheus\"
    }],
    \"folderId\": 0
  }"

The dashboard shows: Active Capabilities, Request Rate, Error Rate, P99 Latency, Request Duration (p50/p95/p99), Step Duration, HTTP Client metrics.

Verify Prometheus is Scraping

# ServiceMonitor created automatically by the operator
kubectl get servicemonitor <capability-name> -n default

# Target is UP in Prometheus
curl -s "http://localhost:YOUR_PORT/api/v1/targets" | \
  python3 -c "
import sys, json
for t in json.load(sys.stdin)['data']['activeTargets']:
    if '<capability-name>' in str(t['labels']):
        print(t['labels'].get('job'), '->', t['health'])
"

Datadog Integration

Datadog receives both traces and metrics via OTLP from every capability pod. The operator injects all required env vars automatically — users configure nothing beyond declaring a type: control expose.

What is available in Datadog

Signal Adapter Status
Traces — REST requests REST http.server.request spans
Traces — MCP tool calls MCP mcp.request spans
Log-trace correlation All traceId/spanId in every log line
Metrics — request counters All ikanos.request.total
Metrics — latency histograms All ikanos.request.duration

1. Create a Datadog account

Sign up at https://app.datadoghq.eu (EU) or https://app.datadoghq.com (US).

Get your API key from Organization Settings → API Keys.

2. Install the Datadog Agent with OTLP receiver

kubectl create secret generic datadog-secret \
  --from-literal=api-key=<YOUR_DD_API_KEY> \
  -n monitoring

cat > datadog-values.yaml << 'EOF'
datadog:
  apiKeyExistingSecret: datadog-secret
  site: datadoghq.eu          # change to datadoghq.com for US
  clusterName: my-cluster
  kubelet:
    tlsVerify: false
  otlp:
    receiver:
      protocols:
        http:
          enabled: true
        grpc:
          enabled: true
  apm:
    portEnabled: true
  systemProbe:
    enabled: false
  processAgent:
    enabled: false
  env:
    - name: DD_HOSTNAME
      value: my-cluster-node    # required for kind/minikube
agents:
  useHostNetwork: false
EOF

helm repo add datadog https://helm.datadoghq.com
helm install datadog-agent datadog/datadog \
  --namespace monitoring -f datadog-values.yaml

kubectl rollout status daemonset/datadog-agent \
  -n monitoring --timeout=120s

3. Configure the operator

Set these env vars on the Naftiko Skipper operator once — they propagate automatically to every capability pod in the cluster:

kubectl set env deployment/naftiko-skipper \
  NAFTIKO_OTEL_ENDPOINT=http://datadog-agent.monitoring.svc.cluster.local:4318 \
  NAFTIKO_OTEL_PROTOCOL=http/protobuf \
  NAFTIKO_OTEL_SAMPLING_RATE=1.0 \
  -n naftiko-system

4. Reconcile capabilities

Force capabilities to pick up the new OTEL configuration:

kubectl annotate capability <name> \
  reconcile-at=$(date +%s) --overwrite -n default

Important: If a capability pod was running before the Datadog Agent was installed, restart it — the pod may have cached a failed DNS lookup:

kubectl rollout restart deployment/<capability-name> -n default

5. Verify the OTLP pipeline

Test connectivity from inside the cluster:

kubectl run otlp-test --image=curlimages/curl --rm -it \
  --restart=Never -n default -- \
  curl -s -X POST \
  http://datadog-agent.monitoring.svc.cluster.local:4318/v1/traces \
  -H "Content-Type: application/json" \
  -d '{"resourceSpans":[{"resource":{"attributes":[{"key":"service.name","value":{"stringValue":"naftiko-test"}}]},"scopeSpans":[{"spans":[{"traceId":"5b8efff798038103d269b633813fc60c","spanId":"eee19b7ec3c1b174","name":"test","kind":1,"startTimeUnixNano":"1544712660000000000","endTimeUnixNano":"1544712661000000000","status":{}}]}]}]}' \
  -w "\nHTTP: %{http_code}\n"
# → HTTP: 200 {"partialSuccess":{}}

6. View in Datadog

Navigate to app.datadoghq.eu:

Where What you see
APM → Services ikanos service with MCP and REST endpoints
APM → Traces mcp.request and GET spans with latency
APM → Trace Explorer Individual traces correlated to logs via traceId
Metrics → Explorer ikanos.request.total, ikanos.request.duration

Operator-Level OTEL Configuration Reference

All OTEL env vars are configured once on the operator — every capability in the cluster inherits them automatically. No per-capability configuration is needed.

Operator env var Injected into pod as Purpose
NAFTIKO_ENGINE_IMAGE (image field) ikanos image to run
NAFTIKO_OTEL_ENDPOINT OTEL_EXPORTER_OTLP_ENDPOINT OTLP collector endpoint
NAFTIKO_OTEL_PROTOCOL OTEL_EXPORTER_OTLP_PROTOCOL grpc or http/protobuf
NAFTIKO_OTEL_HEADERS OTEL_EXPORTER_OTLP_HEADERS e.g. DD-API-KEY=xxx
NAFTIKO_OTEL_SAMPLING_RATE OTEL_TRACES_SAMPLER_ARG sampling rate 0.0–1.0

OTEL_SERVICE_NAME is always set to naftiko-{capability-name} — no configuration needed.


Quick Reference

# Check control port directly
kubectl port-forward svc/<capability> 9090:9090 -n default &
curl http://localhost:9090/metrics | grep ikanos
curl http://localhost:9090/health/live
curl http://localhost:9090/health/ready

# Check ServiceMonitor was created
kubectl get servicemonitor <capability> -n default

# Force reconcile after spec or operator config change
kubectl annotate capability <capability> \
  reconcile-at=$(date +%s) --overwrite -n default

# Restart pod after Datadog Agent reinstall (clears DNS cache)
kubectl rollout restart deployment/<capability> -n default

# Check Datadog Agent is receiving traces
kubectl logs -n monitoring -l app=datadog-agent \
  -c trace-agent | tail -5

# Verify log-trace correlation is active
kubectl logs -l naftiko.io/capability=<capability> -n default \
  | grep "traceId=[a-f0-9]" | head -3