Observability

JD.AI Gateway ships with built-in observability via OpenTelemetry distributed tracing and metrics, ASP.NET Core health checks, and a human-readable /doctor diagnostic command — all zero-config by default.

Quick start

By default the gateway writes traces and metrics to stdout (console exporter). No additional infrastructure is required to get started.

# Start the gateway — telemetry is on by default
dotnet run --project src/JD.AI.Gateway

# Check health
curl http://localhost:18789/health

# Run the diagnostic command (in a connected channel)
/doctor

OpenTelemetry Integration

JD.AI uses the standard .NET System.Diagnostics.ActivitySource (traces) and System.Diagnostics.Metrics (metrics) APIs, wired into OpenTelemetry through the JD.AI.Telemetry library.

ActivitySource names

Four named activity sources are registered automatically:

Source	Spans emitted
`JD.AI.Agent`	`jdai.agent.turn` — one span per conversational turn; attributes: `gen_ai.system`, `gen_ai.request.model`, `jdai.turn.index`, `jdai.agent.turn_count`
`JD.AI.Tools`	Tool invocations
`JD.AI.Providers`	`jdai.provider.chat_completion` — one span per provider API call; attributes include retry attempt number
`JD.AI.Sessions`	Session persistence operations

Span status semantics

Ok — operation completed successfully
Unset — operation was cancelled (client disconnect, graceful shutdown); not counted as an error
Error — unexpected exception; always accompanied by a non-zero jdai.providers.errors increment

Meter names and metrics

All instruments are in the JD.AI.Agent meter under the jdai.* namespace:

Instrument	Kind	Unit	Description
`jdai.agent.turns`	Counter	turns	Total agent turns completed
`jdai.agent.turn_duration`	Histogram	ms	Wall-clock time per turn
`jdai.tokens.total`	Counter	tokens	Prompt + completion tokens consumed
`jdai.tools.invocations`	Counter	calls	Tool invocations, tagged by tool name
`jdai.providers.errors`	Counter	errors	Errors after retry exhaustion (cancellations excluded)
`jdai.providers.latency`	Histogram	ms	Per-provider API call latency

All counters carry a gen_ai.system tag set to the provider name (e.g. claude-code, github-copilot, ollama), following the OpenTelemetry GenAI semantic conventions.

Telemetry Configuration

appsettings.json

Telemetry is configured under Gateway:Telemetry:

{
  "Gateway": {
    "Telemetry": {
      "Enabled": true,
      "ServiceName": "jdai",
      "Exporter": "console",
      "OtlpProtocol": "grpc",
      "Endpoint": null
    }
  }
}

Property	Type	Default	Description
`Enabled`	`bool`	`true`	Set `false` to disable all OTel instrumentation
`ServiceName`	`string`	`"jdai"`	Logical service name in traces and metrics
`Exporter`	`string`	`"console"`	Exporter type (see table below)
`OtlpProtocol`	`string`	`"grpc"`	OTLP transport: `"grpc"` (port 4317) or `"http"` (port 4318)
`Endpoint`	`string?`	`null`	Exporter endpoint URI; uses exporter default if absent

Exporters

`Exporter` value	Traces	Metrics	Notes
`"console"`	✔ stdout	✔ stdout	Default; useful for development
`"otlp"`	✔ OTLP	✔ OTLP	Connects to Jaeger, Grafana, Honeycomb, etc.
`"zipkin"`	✔ Zipkin HTTP	✔ console	Zipkin does not support metrics

Environment variables

Standard OpenTelemetry environment variables take precedence over appsettings.json:

Variable	Effect
`OTEL_SERVICE_NAME`	Overrides `Gateway:Telemetry:ServiceName`
`OTEL_EXPORTER_OTLP_ENDPOINT`	Activates OTLP mode and sets the endpoint

Sending to Jaeger (OTLP)

# Start Jaeger all-in-one
docker run -d --name jaeger \
  -p 4317:4317 \
  -p 4318:4318 \
  -p 16686:16686 \
  jaegertracing/all-in-one:latest

# Start gateway with OTLP export
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
dotnet run --project src/JD.AI.Gateway

Open the Jaeger UI at http://localhost:16686 and search for service jdai.

Prometheus / Grafana Integration

Export metrics to Prometheus using the OTLP exporter with an OpenTelemetry Collector:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [prometheus]

Configure Prometheus to scrape the collector:

# prometheus.yml
scrape_configs:
  - job_name: jdai
    static_configs:
      - targets: ["otel-collector:8889"]

In Grafana, add Prometheus as a data source and query JD.AI metrics:

jdai_agent_turns_total — total agent turns
jdai_tokens_total — token consumption over time
jdai_providers_latency_bucket — provider latency distribution
jdai_providers_errors_total — error rate by provider

Health Check Endpoints

The gateway runs health checks automatically:

Check	Tag	Failure status	Condition
`gateway`	—	Degraded	Gateway service not operational
`providers`	`providers`	Degraded	No AI providers are reachable
`session_store`	`storage`	Unhealthy	SQLite database inaccessible
`disk_space`	`storage`	Degraded	Less than 100 MB free in data directory
`memory`	`memory`	Degraded	Managed heap exceeds 1 GB

Endpoints

Endpoint	Description	Status codes
`GET /health`	All checks — full JSON report	`200` always
`GET /health/ready`	Readiness probe — `200` for Healthy/Degraded, `503` for Unhealthy	`200` / `503`
`GET /health/live`	Liveness probe — always `200` while process is running	`200`

Full health response example

{
  "status": "Healthy",
  "totalDuration": "00:00:00.0423180",
  "entries": {
    "gateway": {
      "status": "Healthy",
      "description": "Gateway operational",
      "data": { "activeAgents": 2, "uptime": "00:14:22" }
    },
    "providers": {
      "status": "Healthy",
      "description": "2/3 providers reachable",
      "data": { "available": ["claude-code", "github-copilot"], "unavailable": ["ollama"] }
    },
    "session_store": {
      "status": "Healthy",
      "description": "SQLite OK (14 sessions)",
      "data": { "sessionCount": 14, "dbPath": "/home/user/.jdai/sessions.db" }
    }
  }
}

Kubernetes Probes

livenessProbe:
  httpGet:
    path: /health/live
    port: 18789
  initialDelaySeconds: 5
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 18789
  initialDelaySeconds: 10
  periodSeconds: 15

/doctor Command

The /doctor gateway command runs all registered health checks and renders a human-readable diagnostic report in the connected channel:

=== JD.AI Doctor ===
Version:  1.0.0
Runtime:  .NET 10.0.0
Health:   ✔ Healthy

Checks:
  ✔ Gateway      — Gateway operational
  ✔ Providers    — 2/3 providers reachable
  ⚠ Disk Space   — Low disk space: 0.4 GB free (minimum: 100 MB)
  ✔ Memory       — 142 MB managed heap
  ✔ Session Store — SQLite OK (14 sessions)

Icon	Meaning
`✔`	Healthy
`⚠`	Degraded — gateway is operational but running with reduced capability
`✘`	Unhealthy — a critical dependency is unavailable

Log Configuration

Configure logging in appsettings.json:

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft.AspNetCore": "Warning",
      "JD.AI": "Information",
      "JD.AI.Providers": "Debug"
    }
  },
  "Gateway": {
    "Server": {
      "Verbose": false
    }
  }
}

Set Gateway:Server:Verbose to true for detailed request logging. Use structured logging sinks (Serilog, Seq, etc.) for production environments.

Table of Contents