Observability
Monitor agent activity, LLM traces, and platform health.
Overview
Molecule AI provides multiple layers of observability -- from real-time WebSocket events on the canvas to structured activity logs, LLM traces, Prometheus metrics, and admin health endpoints.
Activity Logs
Every significant action in the platform is recorded in the activity_logs table. Query logs for a specific workspace:
GET /workspaces/:id/activityActivity types include:
- A2A communications -- request/response capture with duration and method
- Task updates -- agent-reported task status changes
- Agent logs -- structured log entries from workspace runtimes
- Errors -- failures with
error_detailfor debugging
Filter by source to separate user-agent chat (source=canvas) from agent-to-agent traffic (source=agent).
Activity logs are automatically cleaned up based on ACTIVITY_RETENTION_DAYS (default 7). The cleanup job runs every ACTIVITY_CLEANUP_INTERVAL_HOURS (default 6).
LLM Traces
Molecule AI integrates with Langfuse for LLM observability. Langfuse runs as part of the infrastructure stack on port 3001, backed by ClickHouse for efficient trace storage.
View traces for a specific workspace:
GET /workspaces/:id/tracesThe Langfuse UI at http://localhost:3001 provides:
- Token usage and cost tracking per workspace
- Latency breakdowns for LLM calls
- Prompt/completion pairs for debugging
- Trace timelines showing multi-step agent reasoning
Prometheus Metrics
The platform exposes Prometheus-format metrics at:
GET /metricsThis endpoint requires no authentication and is safe to scrape. Metrics are in Prometheus text format (v0.0.4) and include:
- Request counts by method, path, and status code
- Request latency histograms
- Active WebSocket connections
- Workspace status counts
Configure your Prometheus instance to scrape http://localhost:8080/metrics at your preferred interval.
Admin Liveness
The liveness endpoint reports the health of every supervised subsystem:
GET /admin/livenessThis endpoint requires AdminAuth (bearer token). It returns a supervised.Snapshot() for each subsystem with ages -- how long since each subsystem last reported healthy. Use this to debug stuck schedulers, stalled heartbeat goroutines, or unresponsive health sweeps before diving into logs.
WebSocket Events
The canvas receives real-time updates via WebSocket at /ws. Every state change in the platform is broadcast to connected clients:
| Event | Trigger |
|---|---|
WORKSPACE_ONLINE | Workspace registers successfully |
WORKSPACE_OFFLINE | Heartbeat TTL expires or health sweep detects dead container |
WORKSPACE_DEGRADED | Error rate exceeds threshold |
WORKSPACE_RECOVERED | Error rate drops back to normal |
WORKSPACE_REMOVED | Workspace deleted |
HEARTBEAT | Periodic heartbeat from workspace |
A2A_RESPONSE | Agent-to-agent message received |
AGENT_MESSAGE | Agent pushes a message to the user |
Events flow through Redis pub/sub to ensure all platform instances broadcast consistently.
Structure Events
The structure_events table is an append-only audit log of every structural change in the platform. Each event is:
- Inserted into the database via
broadcaster.RecordAndBroadcast() - Published to Redis pub/sub
- Relayed to WebSocket clients
Query events for a specific workspace or globally:
GET /events/:workspaceId # Workspace-specific
GET /events # All eventsBoth endpoints require AdminAuth.
Session Search
Search through chat history for a workspace:
GET /workspaces/:id/session-search?q=deployment+errorThis searches across both user-agent conversations and agent-to-agent A2A traffic stored in the activity logs.
Current Task Visibility
Each workspace reports its current task via heartbeat. This is visible in two places:
- Canvas node -- the workspace card on the canvas shows the current task text
- Heartbeat data --
GET /registry/discover/:idincludescurrent_taskin the workspace info
When active_tasks drops to zero, the current task field clears and the idle loop (if configured) begins its countdown.
Schedule Run History
For workspaces with cron schedules, inspect past runs:
GET /workspaces/:id/schedules/:scheduleId/historyEach history entry includes:
- Execution timestamp
- Status (
success,failed,skipped) - Duration
error_detailwhen the run failed (populated byscheduler.fireSchedule)
A status of skipped means the workspace was busy (active tasks > 0) when the schedule fired and the concurrency-aware scheduler chose not to queue the prompt.