Troubleshooting

Workspace Stuck in "Provisioning"

A workspace that stays in provisioning for more than 30 seconds usually indicates a container startup failure.

Steps to diagnose:

Check Docker logs for the workspace container:
```
docker logs <container-id>
```
Verify the workspace image exists locally:
```
docker images | grep workspace-template
```
Check tier resource limits -- the container may be OOM-killed on start. Review TIER2_MEMORY_MB / TIER3_MEMORY_MB / TIER4_MEMORY_MB values.
Ensure the platform can reach the Docker daemon (Docker Desktop must be running).

401 Unauthorized on API Calls

Bearer tokens can expire or be revoked. Workspace tokens are also auto-revoked when a workspace is deleted.

Resolution:

For workspace-scoped endpoints, mint a new token:

# Development/staging only (hidden when MOLECULE_ENV=production)
curl http://localhost:8080/admin/workspaces/:id/test-token

For admin endpoints, verify your token is still valid against a known-good endpoint like GET /health.
Legacy workspaces (created before Phase 30.1) are grandfathered and do not require tokens on heartbeat/update-card routes.

WebSocket Shows "Reconnecting"

The canvas WebSocket connection (/ws) drops and retries.

Common causes:

CORS_ORIGINS does not include your domain -- the WebSocket upgrade is rejected. Add your origin to the comma-separated list.
A reverse proxy or firewall is terminating the long-lived connection. Ensure WebSocket upgrade headers are forwarded.
The platform process crashed or restarted. Check platform logs.

Verify connectivity:

# Quick check that the WS endpoint is reachable
curl -i -N \
  -H "Connection: Upgrade" \
  -H "Upgrade: websocket" \
  -H "Sec-WebSocket-Version: 13" \
  -H "Sec-WebSocket-Key: dGVzdA==" \
  http://localhost:8080/ws

Agent Not Responding to A2A

When one agent cannot reach another via the A2A proxy (POST /workspaces/:id/a2a), check communication rules.

The CanCommunicate access check allows:

Same workspace (self-call)
Siblings (same parent)
Root-level siblings (both have no parent)
Parent to child or child to parent

Everything else is denied. If two agents need to communicate, they must be in the same subtree.

Also verify:

The target workspace is online (not paused, offline, or provisioning)
The target's heartbeat is fresh (Redis TTL has not expired)
The caller includes X-Workspace-ID and Authorization: Bearer <token> headers

Schedule Not Firing

Cron schedules are managed by the platform scheduler subsystem.

Checklist:

Verify the cron expression is valid (standard 5-field cron syntax)
Confirm the workspace is online -- paused workspaces skip all schedules
Check if the schedule was skipped due to concurrency: the scheduler skips when active_tasks > 0. Review schedule history:
```
GET /workspaces/:id/schedules/:scheduleId/history
```
Inspect GET /admin/liveness to ensure the scheduler subsystem is alive (age should be under 60 seconds)

Channel Test Fails

Social channel integrations (Telegram, Slack, etc.) can fail for several reasons.

Diagnose:

Verify the bot token is correct and has not been revoked by the platform provider
Check the allowlist config in the channel's JSONB settings -- messages from non-allowlisted chats are silently dropped
Ensure the webhook URL is registered with the external platform:
```
POST /webhooks/:type
```
This is the endpoint the external platform (Telegram, Slack) should send events to.

Test the connection explicitly:

POST /workspaces/:id/channels/:channelId/test

Migration Crash on Boot

The platform runs all *.up.sql migrations on every startup (there is no schema_migrations tracking table yet).

Common issues:

Migrations must be idempotent (CREATE TABLE IF NOT EXISTS, ALTER TABLE ... IF NOT EXISTS). If a migration lacks this guard, the second boot fails.
Before PR #212, the migration runner did not filter .down.sql files, causing tables to be dropped on every boot. Ensure you are running a platform version that includes this fix.
If you see errors about duplicate columns or tables, the migration is not idempotent. Patch the .up.sql file to add IF NOT EXISTS guards.

Canvas Blank or 502 on Tenant Deploy

In tenant mode (platform/Dockerfile.tenant), the Go server proxies canvas requests.

Verify:

CANVAS_PROXY_URL is set and points to the running Next.js process inside the container
Both the Go server and the Node.js process are running (check container logs for both)
The Next.js build completed successfully during docker build

Plugin Install Timeout

Large plugins or slow network connections can exceed the default fetch deadline.

Adjust limits:

Variable	Default	Description
`PLUGIN_INSTALL_FETCH_TIMEOUT`	`5m`	Increase for large or remote plugins
`PLUGIN_INSTALL_MAX_DIR_BYTES`	`104857600` (100 MiB)	Increase if the plugin tree exceeds 100 MiB
`PLUGIN_INSTALL_BODY_MAX_BYTES`	`65536` (64 KiB)	Increase if the install request body is large

Memory or Disk Usage Growing

Activity logs and structure events accumulate over time.

Tune retention:

ACTIVITY_RETENTION_DAYS (default 7) -- reduce to 3 or even 1 for high-traffic deployments
ACTIVITY_CLEANUP_INTERVAL_HOURS (default 6) -- reduce to run cleanup more frequently

Monitor the activity_logs and structure_events tables directly if disk usage is a concern:

SELECT pg_size_pretty(pg_total_relation_size('activity_logs'));
SELECT pg_size_pretty(pg_total_relation_size('structure_events'));

Container Health Detection

If workspaces go offline unexpectedly (e.g., Docker Desktop crash), three layers detect the failure:

Passive (Redis TTL): 60-second heartbeat key expires, liveness monitor triggers auto-restart
Proactive (Health Sweep): Docker API polled every 15 seconds, catches dead containers faster than TTL expiry
Reactive (A2A Proxy): On connection error to a workspace, checks provisioner.IsRunning() and triggers immediate offline + restart

If none of these are catching a dead container, check GET /admin/liveness to verify the health sweep and liveness monitor subsystems are running.

On this page