Troubleshooting
Common issues and how to fix them.
Workspace Stuck in "Provisioning"
A workspace that stays in provisioning for more than 30 seconds usually indicates a container startup failure.
Steps to diagnose:
- Check Docker logs for the workspace container:
docker logs <container-id> - Verify the workspace image exists locally:
docker images | grep workspace-template - Check tier resource limits -- the container may be OOM-killed on start. Review
TIER2_MEMORY_MB/TIER3_MEMORY_MB/TIER4_MEMORY_MBvalues. - Ensure the platform can reach the Docker daemon (Docker Desktop must be running).
401 Unauthorized on API Calls
Bearer tokens can expire or be revoked. Workspace tokens are also auto-revoked when a workspace is deleted.
Resolution:
- For workspace-scoped endpoints, mint a new token:
# Development/staging only (hidden when MOLECULE_ENV=production) curl http://localhost:8080/admin/workspaces/:id/test-token - For admin endpoints, verify your token is still valid against a known-good endpoint like
GET /health. - Legacy workspaces (created before Phase 30.1) are grandfathered and do not require tokens on heartbeat/update-card routes.
WebSocket Shows "Reconnecting"
The canvas WebSocket connection (/ws) drops and retries.
Common causes:
CORS_ORIGINSdoes not include your domain -- the WebSocket upgrade is rejected. Add your origin to the comma-separated list.- A reverse proxy or firewall is terminating the long-lived connection. Ensure WebSocket upgrade headers are forwarded.
- The platform process crashed or restarted. Check platform logs.
Verify connectivity:
# Quick check that the WS endpoint is reachable
curl -i -N \
-H "Connection: Upgrade" \
-H "Upgrade: websocket" \
-H "Sec-WebSocket-Version: 13" \
-H "Sec-WebSocket-Key: dGVzdA==" \
http://localhost:8080/wsAgent Not Responding to A2A
When one agent cannot reach another via the A2A proxy (POST /workspaces/:id/a2a), check communication rules.
The CanCommunicate access check allows:
- Same workspace (self-call)
- Siblings (same parent)
- Root-level siblings (both have no parent)
- Parent to child or child to parent
Everything else is denied. If two agents need to communicate, they must be in the same subtree.
Also verify:
- The target workspace is
online(notpaused,offline, orprovisioning) - The target's heartbeat is fresh (Redis TTL has not expired)
- The caller includes
X-Workspace-IDandAuthorization: Bearer <token>headers
Schedule Not Firing
Cron schedules are managed by the platform scheduler subsystem.
Checklist:
- Verify the cron expression is valid (standard 5-field cron syntax)
- Confirm the workspace is
online-- paused workspaces skip all schedules - Check if the schedule was
skippeddue to concurrency: the scheduler skips whenactive_tasks > 0. Review schedule history:GET /workspaces/:id/schedules/:scheduleId/history - Inspect
GET /admin/livenessto ensure the scheduler subsystem is alive (age should be under 60 seconds)
Channel Test Fails
Social channel integrations (Telegram, Slack, etc.) can fail for several reasons.
Diagnose:
- Verify the bot token is correct and has not been revoked by the platform provider
- Check the allowlist config in the channel's JSONB settings -- messages from non-allowlisted chats are silently dropped
- Ensure the webhook URL is registered with the external platform:
This is the endpoint the external platform (Telegram, Slack) should send events to.POST /webhooks/:type - Test the connection explicitly:
POST /workspaces/:id/channels/:channelId/test
Migration Crash on Boot
The platform runs all *.up.sql migrations on every startup (there is no schema_migrations tracking table yet).
Common issues:
- Migrations must be idempotent (
CREATE TABLE IF NOT EXISTS,ALTER TABLE ... IF NOT EXISTS). If a migration lacks this guard, the second boot fails. - Before PR #212, the migration runner did not filter
.down.sqlfiles, causing tables to be dropped on every boot. Ensure you are running a platform version that includes this fix. - If you see errors about duplicate columns or tables, the migration is not idempotent. Patch the
.up.sqlfile to addIF NOT EXISTSguards.
Canvas Blank or 502 on Tenant Deploy
In tenant mode (platform/Dockerfile.tenant), the Go server proxies canvas requests.
Verify:
CANVAS_PROXY_URLis set and points to the running Next.js process inside the container- Both the Go server and the Node.js process are running (check container logs for both)
- The Next.js build completed successfully during
docker build
Plugin Install Timeout
Large plugins or slow network connections can exceed the default fetch deadline.
Adjust limits:
| Variable | Default | Description |
|---|---|---|
PLUGIN_INSTALL_FETCH_TIMEOUT | 5m | Increase for large or remote plugins |
PLUGIN_INSTALL_MAX_DIR_BYTES | 104857600 (100 MiB) | Increase if the plugin tree exceeds 100 MiB |
PLUGIN_INSTALL_BODY_MAX_BYTES | 65536 (64 KiB) | Increase if the install request body is large |
Memory or Disk Usage Growing
Activity logs and structure events accumulate over time.
Tune retention:
ACTIVITY_RETENTION_DAYS(default7) -- reduce to 3 or even 1 for high-traffic deploymentsACTIVITY_CLEANUP_INTERVAL_HOURS(default6) -- reduce to run cleanup more frequently- Monitor the
activity_logsandstructure_eventstables directly if disk usage is a concern:SELECT pg_size_pretty(pg_total_relation_size('activity_logs')); SELECT pg_size_pretty(pg_total_relation_size('structure_events'));
Container Health Detection
If workspaces go offline unexpectedly (e.g., Docker Desktop crash), three layers detect the failure:
- Passive (Redis TTL): 60-second heartbeat key expires, liveness monitor triggers auto-restart
- Proactive (Health Sweep): Docker API polled every 15 seconds, catches dead containers faster than TTL expiry
- Reactive (A2A Proxy): On connection error to a workspace, checks
provisioner.IsRunning()and triggers immediate offline + restart
If none of these are catching a dead container, check GET /admin/liveness to verify the health sweep and liveness monitor subsystems are running.