Telemetry errors

Common error signatures

Telemetry errors typically fall into three categories: failures in CLI command execution, failures in agent coordination and heartbeat tracking, and failures in approval gate or event streaming operations.

Where errors originate

The following CLI entry points are the most common raise sites. Failures in the underlying coordination and streaming classes typically bubble up through these commands.

How to diagnose

  1. Check the CLI exit code first. All telemetry subcommands return 0 on success. Any other value means a command-specific failure occurred — run the command directly in your shell to see the error output before it gets swallowed by a calling script.

  2. Inspect help_queries.jsonl for format issues. The default telemetry log is a JSONL file. If a line is malformed or written by a different log version than 1.0, deserialization in from_dict() will raise KeyError or ValueError. Open the file and check that each line is valid JSON containing the expected fields.

  3. Verify Redis connectivity for coordination and streaming failures. CoordinationSignals, HeartbeatCoordinator, ApprovalGate, and EventStreamer all depend on a memory backend (Redis). If the backend is unavailable, every method that reads or writes signals will fail. Confirm the Redis connection before debugging the telemetry logic itself.

  4. Check TTL expiry for None returns from wait_for_signal() and is_agent_alive(). These methods return None or False — not exceptions — when a TTL key has expired. If your workflow is silently skipping a coordination step, check whether the ttl_seconds on the relevant CoordinationSignal (default 60) is too short for your workload, and whether get_stale_agents(threshold_seconds=60.0) reports the affected agent.

  5. Audit approval responses before acting on them. When ApprovalGate.request_approval() returns, check response.approved explicitly. A False value can mean either a rejection or a timeout — response.reason and response.responder distinguish between the two.

  6. Enable DEBUG logging. Most telemetry classes use logging. Set the log level to DEBUG and re-run the failing scenario. Logged state leading up to the failure usually identifies whether the root cause is a missing key, an expired TTL, or a backend connection error.

Source files

Tags: telemetry, metrics