Telemetry errors
Common error signatures
Telemetry errors typically fall into three categories: failures in CLI command execution, failures in agent coordination and heartbeat tracking, and failures in approval gate or event streaming operations.
- CLI commands returning non-zero —
main()and subcommands such ascmd_telemetry_show(),cmd_telemetry_savings(), andcmd_telemetry_cache_stats()all returnintexit codes. A return value other than0indicates a failure in data retrieval or rendering. KeyErrororValueErrorinCoordinationSignal.from_dict()orAgentHeartbeat.from_dict()— these dataclass deserializers expect specific keys (signal_id,signal_type,source_agent,agent_id,status,progress, etc.). Missing or mistyped fields raise these exceptions when parsing stored Redis data.TimeoutErrororNonereturn fromCoordinationSignals.wait_for_signal()— the method polls for up totimeoutseconds (default30.0) and returnsNoneif no matching signal arrives. Callers that don't handleNonemay propagateAttributeErroror silent no-ops downstream.- Stale or missing heartbeat —
HeartbeatCoordinator.is_agent_alive()returnsFalsewhen the TTL key has expired in Redis. If your workflow assumes an agent is running, this produces incorrect branching rather than a raised exception, making it easy to miss. ApprovalGate.request_approval()timing out — if no human responds withintimeout_seconds, the gate returns anApprovalResponsewithapproved=False. Workflows that don't checkresponse.approvedbefore proceeding will act on a rejected or timed-out request.EventStreamerpublish/consume failures —publish_event()andconsume_events()depend on a live Redis Streams connection. A connection error here produces an exception that propagates to the caller with no automatic retry.
Where errors originate
The following CLI entry points are the most common raise sites. Failures in the underlying coordination and streaming classes typically bubble up through these commands.
main()— top-level telemetry CLI dispatcher; catches unhandled exceptions from all subcommands.cmd_sonnet_opus_analysis()— reads Sonnet 4.5 → Opus 4.5 fallback data; fails if the underlying telemetry log is missing or malformed.cmd_file_test_status()andcmd_test_status()— query per-file and aggregate test status; fail if the telemetry store is unavailable.cmd_tier1_status()andcmd_task_routing_report()— aggregate automation metrics; sensitive to incomplete or corrupt telemetry records.cmd_agent_performance(),cmd_telemetry_savings(),cmd_telemetry_cache_stats()— all read from the telemetry log (help_queries.jsonlby default at_DEFAULT_FILE); fail if that file is absent or written in an incompatible format (log version_LOG_VERSION = '1.0').
How to diagnose
-
Check the CLI exit code first. All telemetry subcommands return
0on success. Any other value means a command-specific failure occurred — run the command directly in your shell to see the error output before it gets swallowed by a calling script. -
Inspect
help_queries.jsonlfor format issues. The default telemetry log is a JSONL file. If a line is malformed or written by a different log version than1.0, deserialization infrom_dict()will raiseKeyErrororValueError. Open the file and check that each line is valid JSON containing the expected fields. -
Verify Redis connectivity for coordination and streaming failures.
CoordinationSignals,HeartbeatCoordinator,ApprovalGate, andEventStreamerall depend on amemorybackend (Redis). If the backend is unavailable, every method that reads or writes signals will fail. Confirm the Redis connection before debugging the telemetry logic itself. -
Check TTL expiry for
Nonereturns fromwait_for_signal()andis_agent_alive(). These methods returnNoneorFalse— not exceptions — when a TTL key has expired. If your workflow is silently skipping a coordination step, check whether thettl_secondson the relevantCoordinationSignal(default60) is too short for your workload, and whetherget_stale_agents(threshold_seconds=60.0)reports the affected agent. -
Audit approval responses before acting on them. When
ApprovalGate.request_approval()returns, checkresponse.approvedexplicitly. AFalsevalue can mean either a rejection or a timeout —response.reasonandresponse.responderdistinguish between the two. -
Enable
DEBUGlogging. Most telemetry classes uselogging. Set the log level toDEBUGand re-run the failing scenario. Logged state leading up to the failure usually identifies whether the root cause is a missing key, an expired TTL, or a backend connection error.
Source files
src/attune/telemetry/**
Tags: telemetry, metrics