Troubleshooting¶
Common problems running Genswarms and how to fix them. Most issues fall into agent startup, message routing, backend setup, task delivery, or the API server.
Before digging in, two commands surface most problems:
genswarms status [name] # Swarm/agent lifecycle state
genswarms events --errors # Recent error events across all swarms
Agent not starting¶
- Confirm the
subzeroclawbinary is reachable. The bwrap backend searches in this order: explicit config (subzeroclaw_path),../subzeroclaw/subzeroclaw(a sibling checkout), theSUBZEROCLAW_PATHenv var, thenPATH. If none resolve to a regular file, the agent fails to start. - Verify your LLM provider key is set (
SUBZEROCLAW_API_KEY), since agents need it to call the model. (If you are running without an LLM for testing, setSUBZEROCLAW_MOCK_SCRIPTinstead so subzeroclaw returns canned responses.) - Inspect the swarm and agent state:
Messages not routing¶
- Make sure the topology allows the edge
source -> target. TheRouteronly routes along configured topology edges (system objects:metrics,:tick, and:gatewayare always allowed without an explicit edge). - Check the agent is emitting the correct
@agent:syntax, for example@coder: please implement this. Use@all:to broadcast to all connected agents. - Review the message log (the
limitquery param defaults to 100):
curl http://localhost:4000/api/swarms/example-swarm/messages
curl "http://localhost:4000/api/swarms/example-swarm/messages?limit=20"
- As an alternative to
@agent:syntax, agents can drop a JSON file ({"to":"target","content":"msg"}) into{workspace}/.outbox/; the LogWatcher polls that directory and routes it. Inside a container, theswarm-msg send <target> <msg>helper writes these files for you (it JSON-encodes the message and writes it into/workspace/.outbox/).
SSH backend fails¶
- Confirm key-based SSH works first:
ssh user@hostshould connect without a password prompt. - Verify the remote
subzeroclawpath is correct on the target host. On NixOS machines the backend defaults to skills at/var/lib/subzeroclaw/skillsand runs the agent as thesubzeroclawuser (viasudo -u); for non-NixOS hosts setnixos: falsein the backend opts so it uses~/.subzeroclaw/skillsand runs as the login user. - Ensure the remote skills/workspace directory is writable for the SSH user — skills are copied over via SFTP at startup.
Docker backend fails¶
- Check the Docker daemon is up:
docker ps. - Confirm the agent image exists:
docker images. Build images withnix build .#agentContainer-<preset>anddocker load < result(presets:base,web,code,data,python,node,full). If the expected image is missing, the backend tries to build it vianixand otherwise falls back toszc-agent-base:latest. - Inspect a container's logs directly. Genswarms names containers
szc-{swarm}-{agent}:
- Containers are run with
--rm, so a crashed agent leaves no container behind. Catch the failure in the event log instead:
Tasks not delivered to daemon swarms¶
Daemon swarms (started with genswarms start) receive tasks through a SQLite-backed queue, not directly. The daemon polls the queue every 500ms.
- Confirm the daemon is actually running:
genswarms status. - Look for queued/processed task activity in the event log:
- Inspect the queue itself in
.genswarms/swarms.db(thetaskstable) to confirm rows are inserted with statuspendingand later flipped toprocessed. - Check for errors:
Valid
--categoryvalues:backend,routing,agent,object,swarm,system. Add-s <swarm>to scope to one swarm.genswarms eventsperforms a one-shot query and prints the matching events (default limit 50); it does not continuously tail.
API returns errors¶
- Confirm the API server is up (the root path returns API info):
- If a browser frontend is failing, CORS is already permissive on the API server (
origins: "*", all methods and headers allowed), so a CORS rejection usually points to a wrong URL or the server being down rather than a CORS policy. - Read the server output for the detailed error; start it in the foreground with
mix phx.serverwhile debugging.
Cleaning up stuck state¶
If swarms are left in a stopped or crashed state, or the database accumulates stale rows, clean them up via the mix task:
mix genswarms.clean # Remove stopped/crashed swarm entries and their files
mix genswarms.clean --all # Also clear the event log
The
cleanoperation is not exposed as an escript subcommand —genswarms cleanis not a recognized command and will error. Use themix genswarms.cleantask or the API route below.
Via the API, POST /api/swarms/clean removes stopped/crashed swarms (add ?all=true to also clear the event log). To remove a single swarm and all of its data, DELETE /api/swarms/:name?purge=true stops the swarm and deletes its files, events, and queued tasks.