Molecule AI

Observability

Monitor agent activity, LLM traces, and platform health.

Overview

Molecule AI provides multiple layers of observability -- from real-time WebSocket events on the canvas to structured activity logs, LLM traces, Prometheus metrics, and admin health endpoints.

Activity Logs

Every significant action in the platform is recorded in the activity_logs table. Query logs for a specific workspace:

GET /workspaces/:id/activity

Activity types include:

  • A2A communications -- request/response capture with duration and method
  • Task updates -- agent-reported task status changes
  • Agent logs -- structured log entries from workspace runtimes
  • Errors -- failures with error_detail for debugging

Filter by source to separate user-agent chat (source=canvas) from agent-to-agent traffic (source=agent).

Activity logs are automatically cleaned up based on ACTIVITY_RETENTION_DAYS (default 7). The cleanup job runs every ACTIVITY_CLEANUP_INTERVAL_HOURS (default 6).

LLM Traces

Molecule AI integrates with Langfuse for LLM observability. Langfuse runs as part of the infrastructure stack on port 3001, backed by ClickHouse for efficient trace storage.

View traces for a specific workspace:

GET /workspaces/:id/traces

The Langfuse UI at http://localhost:3001 provides:

  • Token usage and cost tracking per workspace
  • Latency breakdowns for LLM calls
  • Prompt/completion pairs for debugging
  • Trace timelines showing multi-step agent reasoning

Prometheus Metrics

The platform exposes Prometheus-format metrics at:

GET /metrics

This endpoint requires no authentication and is safe to scrape. Metrics are in Prometheus text format (v0.0.4) and include:

  • Request counts by method, path, and status code
  • Request latency histograms
  • Active WebSocket connections
  • Workspace status counts

Configure your Prometheus instance to scrape http://localhost:8080/metrics at your preferred interval.

Admin Liveness

The liveness endpoint reports the health of every supervised subsystem:

GET /admin/liveness

This endpoint requires AdminAuth (bearer token). It returns a supervised.Snapshot() for each subsystem with ages -- how long since each subsystem last reported healthy. Use this to debug stuck schedulers, stalled heartbeat goroutines, or unresponsive health sweeps before diving into logs.

WebSocket Events

The canvas receives real-time updates via WebSocket at /ws. Every state change in the platform is broadcast to connected clients:

EventTrigger
WORKSPACE_ONLINEWorkspace registers successfully
WORKSPACE_OFFLINEHeartbeat TTL expires or health sweep detects dead container
WORKSPACE_DEGRADEDError rate exceeds threshold
WORKSPACE_RECOVEREDError rate drops back to normal
WORKSPACE_REMOVEDWorkspace deleted
HEARTBEATPeriodic heartbeat from workspace
A2A_RESPONSEAgent-to-agent message received
AGENT_MESSAGEAgent pushes a message to the user

Events flow through Redis pub/sub to ensure all platform instances broadcast consistently.

Structure Events

The structure_events table is an append-only audit log of every structural change in the platform. Each event is:

  1. Inserted into the database via broadcaster.RecordAndBroadcast()
  2. Published to Redis pub/sub
  3. Relayed to WebSocket clients

Query events for a specific workspace or globally:

GET /events/:workspaceId    # Workspace-specific
GET /events                 # All events

Both endpoints require AdminAuth.

Search through chat history for a workspace:

GET /workspaces/:id/session-search?q=deployment+error

This searches across both user-agent conversations and agent-to-agent A2A traffic stored in the activity logs.

Current Task Visibility

Each workspace reports its current task via heartbeat. This is visible in two places:

  • Canvas node -- the workspace card on the canvas shows the current task text
  • Heartbeat data -- GET /registry/discover/:id includes current_task in the workspace info

When active_tasks drops to zero, the current task field clears and the idle loop (if configured) begins its countdown.

Schedule Run History

For workspaces with cron schedules, inspect past runs:

GET /workspaces/:id/schedules/:scheduleId/history

Each history entry includes:

  • Execution timestamp
  • Status (success, failed, skipped)
  • Duration
  • error_detail when the run failed (populated by scheduler.fireSchedule)

A status of skipped means the workspace was busy (active tasks > 0) when the schedule fired and the concurrency-aware scheduler chose not to queue the prompt.

On this page