Observability¶
Genswarms exposes everything that happens in a swarm through a single event spine.
Every observable state transition emits a :telemetry event, a bridge funnels
those into a centralized store, and the store persists, streams, and serves them.
Understanding that spine is the key to building any dashboard, monitor, or alerting
on top of the framework.
The single spine¶
There is one rule: every observable state transition emits a :telemetry event,
and nothing else logs it. A telemetry bridge funnels those events into LogStore,
which both persists them (ETS + durable store) and streams them over PubSub /
WebSocket.
A transition is logged in exactly one place: its emit_telemetry/2,3 call. The
emitter never also calls LogStore.log for the same moment. LogStore.log is
reserved for diagnostics and I/O that have no transition event: backend container
ops, raw agent stdout, received messages, config-load failures. Those are
single-source, so they never double up with the bridge.
emit_telemetry(:agent_started, ...) # emitters (swarm_manager, agent_server, ...)
| [:genswarms, :agent, :agent_started]
v
Genswarms.Observability.TelemetryBridge # single :telemetry handler
| LogStore.log(:info, :agent, :agent_started, "agent fixer_1 started", ...)
v
Genswarms.Observability.LogStore
|-- ETS ring buffer -> LogStore.query / GET /api/events (fast, in-node)
|-- EventStore (durable) -> cross-process / `genswarms events` CLI
+-- PubSub {:log_event, e} -> SwarmChannel "event" / "log_entry" push (live)
To make a new transition observable, emit a telemetry event under
[:genswarms, <domain>, <event>] and add it to
Genswarms.Observability.TelemetryBridge known_events/0. Nothing else: no
controller, no broadcast, no LogStore call at the call site.
The bridge attaches once at application start
(Genswarms.Observability.TelemetryBridge.attach/0, called by
Genswarms.Application.start/2 after the supervision tree is up). It attaches to
the concrete [:genswarms, domain, event] triples in known_events/0 — not a
prefix — so unrelated :genswarms telemetry is never swept in. The handler is
wrapped in a rescue: if a translation fails it logs a warning and drops the event
rather than taking down the emitting process.
The bridge derives the log level from the event name. When the level depends on
the outcome (a partial swarm start, an unexpected agent exit), the emitter passes
level: in the telemetry metadata to set it explicitly; the bridge strips that key
before persisting, so it never leaks into the payload.
Event taxonomy¶
level is derived from the event name. The exact rules
(TelemetryBridge.level_for/1), checked in order against the event-name string:
| Substring in event name | Level |
|---|---|
contains error |
:error |
contains failed |
:error |
contains invalid |
:warning |
contains not_found |
:warning |
contains full |
:warning |
| otherwise | :info |
A level: key in the telemetry metadata overrides this derivation for that event.
category is the telemetry domain, with one normalization: the :router domain
is mapped to the :routing category. All other domains pass through unchanged
(:swarm, :agent, :object).
The full vocabulary the bridge knows
(Genswarms.Observability.TelemetryBridge.known_events/0):
| Domain (category) | Event | Level | Meaning |
|---|---|---|---|
swarm |
swarm_started |
info | swarm finished starting (emitters may pass level: for a partial start) |
swarm |
swarm_stopped |
info | swarm torn down |
agent |
agent_started |
info | agent process up |
agent |
agent_stopped |
info | agent process exited |
agent |
agent_error |
error | agent backend/runtime error |
agent |
agent_added |
info | agent added to a running swarm |
agent |
agent_removed |
info | agent removed from a running swarm |
agent |
task_sent |
info | task delivered to an agent |
agent |
message_delivered |
info | message delivered to target inbox |
object |
object_started |
info | object handler initialized |
object |
object_stopped |
info | object handler stopped |
object |
object_error |
error | object handler crashed/errored |
object |
object_added |
info | object added to a running swarm |
object |
object_removed |
info | object removed from a running swarm |
routing |
message_routed |
info | direct message routed (:from, :to) |
routing |
message_broadcast |
info | broadcast routed (:from) |
routing |
invalid_route |
warning | message rejected by topology |
Every event carries :swarm in its metadata. Agent events also carry :agent,
and object events carry :object. The bridge lifts the swarm name into the event's
:swarm field and lifts either the agent name or the object name into the
event's :agent field (agent: metadata[:agent] || metadata[:object]) — there is
no separate object column, so object events appear under the same agent field /
-a / ?agent= filter as agents. The bridge then drops :swarm, :agent, and
:object from the remaining metadata (and the :level override key) and keeps the
rest as a JSON-friendly metadata blob.
Category naming note. The
:objectcategory is real — the bridge emits it andgenswarms events --category objectfilters on it. For historical reasons theLogStore@type categorytypespec and theEventsController/ rest-api.mdcategorydocs enumerate onlybackend | routing | agent | swarm | systemand omitobject. The omission is documentation/typespec drift, not a runtime restriction: object-category events are persisted and queryable through every path. Use the taxonomy above as the authoritative list.
Telemetry events and metrics¶
Raw telemetry events are emitted under [:genswarms, <domain>, <event>] with
metadata that always includes :swarm (and :agent for agent events). The
Genswarms.Telemetry supervisor declares a set of Telemetry.Metrics definitions
(consumable by LiveDashboard or any reporter). Metric names follow
genswarms.<domain>.<event>:
| Metric | Type | Tags | Source event |
|---|---|---|---|
genswarms.swarm.swarm_started.count |
counter | :swarm |
[:genswarms, :swarm, :swarm_started] |
genswarms.swarm.swarm_stopped.count |
counter | :swarm |
[:genswarms, :swarm, :swarm_stopped] |
genswarms.swarm.agent_count |
last_value | :swarm |
[:genswarms, :swarm, :agent_count] |
genswarms.agent.agent_started.count |
counter | :swarm, :agent |
[:genswarms, :agent, :agent_started] |
genswarms.agent.agent_stopped.count |
counter | :swarm, :agent |
[:genswarms, :agent, :agent_stopped] |
genswarms.agent.agent_error.count |
counter | :swarm, :agent |
[:genswarms, :agent, :agent_error] |
genswarms.agent.task_sent.count |
counter | :swarm, :agent |
[:genswarms, :agent, :task_sent] |
genswarms.agent.message_delivered.count |
counter | :swarm, :agent |
[:genswarms, :agent, :message_delivered] |
genswarms.router.message_routed.count |
counter | :swarm |
[:genswarms, :router, :message_routed] |
genswarms.router.message_broadcast.count |
counter | :swarm |
[:genswarms, :router, :message_broadcast] |
genswarms.router.invalid_route.count |
counter | :swarm |
[:genswarms, :router, :invalid_route] |
The genswarms.swarm.agent_count last-value is produced by a periodic
:telemetry_poller measurement (period 10s) that calls
Genswarms.Telemetry.measure_swarms/0, which polls Genswarms.SwarmManager.list/0
and emits [:genswarms, :swarm, :agent_count] per swarm. (agent_count is a
metrics-only event — it is not in known_events/0, so the bridge does not turn
it into a LogStore event; it never appears in the queryable event stream.)
Standard Phoenix and BEAM VM metrics (phoenix.endpoint.*,
phoenix.router_dispatch.*, phoenix.live_view.mount.*, vm.memory.total,
vm.total_run_queue_lengths.*) are also registered.
The EventStore behaviour¶
The durable, cross-process log sits behind one swappable interface,
Genswarms.Observability.EventStore (a behaviour plus a facade). Everything that
persists, reads, or tails events goes through it (LogStore, EventRelay, the
controllers, the channel, and the CLI), never a concrete backend. The backend is a
single config knob.
The callbacks are: persist/1, query/1, events_since/2, max_event_id/0, an
optional persist_many/1 (bulk write), and an optional child_specs/0 (processes
the backend needs supervised; the app splices them into its tree at boot via
EventStore.child_specs/0).
@callback persist(event()) :: :ok
@callback persist_many([event()]) :: :ok # optional
@callback query(keyword()) :: [event()]
@callback events_since(since_id :: non_neg_integer(), limit :: pos_integer()) :: [event()]
@callback max_event_id() :: non_neg_integer()
@callback child_specs() :: [Supervisor.child_spec()] # optional
@optional_callbacks child_specs: 0, persist_many: 1
The facade provides safe defaults for the optional callbacks: if a backend does not
export persist_many/1, EventStore.persist_many/1 falls back to N× persist/1;
if it does not export child_specs/0, EventStore.child_specs/0 returns []. The
event shape passed to persist/1 is a map with :level, :category,
:event_type, :message and optional :swarm, :agent, :metadata; reads
return the same maps additionally carrying :id and :timestamp (assigned by the
backend).
Backends¶
| Backend | Role |
|---|---|
EventStore.Sqlite |
Thin adapter over the events table in .genswarms/swarms.db (managed by SwarmRegistry). Stateless and synchronous — each call opens a short-lived connection — so it declares no child_specs/0 (no supervised process). |
EventStore.Buffered |
Engine-independent write-batching decorator wrapping any inner backend. The default. |
The default in config/config.exs batches writes on top of SQLite:
config :genswarms, :event_store, Genswarms.Observability.EventStore.Buffered
config :genswarms, Genswarms.Observability.EventStore.Buffered,
inner: Genswarms.Observability.EventStore.Sqlite,
interval_ms: 100,
max_buffer: 1_000
EventStore.Buffered enqueues each persist/1 (and persist_many/1) into a
Writer GenServer and flushes the batch via the inner backend's persist_many/1
on a 100ms timer (or sooner when max_buffer is reached). It also declares the
Writer (plus any of the inner backend's own children) through its child_specs/0,
so the app supervises it. This keeps disk writes off LogStore's critical path and
lets the inner backend amortize them: with EventStore.Sqlite, one
open -> BEGIN -> inserts -> COMMIT -> close per flush instead of a connection per
event. The tradeoff is a small durability window (at most one flush interval) on a
hard crash; the live in-node path (ETS + PubSub) is synchronous and unaffected. The
buffer is flushed on graceful shutdown via the Writer's terminate/2.
Tests run with the plain synchronous EventStore.Sqlite backend so that
persist -> query is deterministic with no buffering (config/test.exs sets
config :genswarms, :event_store, Genswarms.Observability.EventStore.Sqlite).
To raise throughput under load, tune the buffer (fewer, larger commits in exchange for a slightly larger latency/durability window):
config :genswarms, Genswarms.Observability.EventStore.Buffered,
inner: Genswarms.Observability.EventStore.Sqlite,
interval_ms: 250,
max_buffer: 5_000
Because every caller goes through the facade, a future EventStore.Postgres or a
Redis/streaming backend can be swapped in transparently: only the backend module
changes, not the emitters, the bridge, the channel, or EventRelay.
Cross-process event stream¶
A BEAM's in-memory machinery (PubSub, the process Registry, the ETS LogStore)
is node-local and invisible from another OS process. Genswarms supports two
deployment shapes.
Co-located: the swarm runs in the same BEAM as the Phoenix endpoint (for example,
started in-process via POST /api/swarms). Live PubSub, the WebSocket stream, and
the live snapshot endpoints all work directly. Nothing special is needed.
Monitor + daemons: the usual shape at scale. Each swarm runs as its own daemon
(genswarms start, its own BEAM), and a separate monitor/API node observes all of
them. The only thing the processes share is the SQLite events table in
.genswarms/swarms.db.
daemon swarm A --+ emit_telemetry -> LogStore -> SQLite events --+
daemon swarm B --+ +--> shared .genswarms/swarms.db
daemon swarm C --+ --+
|
monitor / API node ---- EventRelay polls events_since --+
| |
REST /api/events -- reads SQLite ---------------+ v
WS swarm:<name> <- EventRelay re-broadcasts {:log_event} onto
the same LogStore PubSub topics -> SwarmChannel push
GET /api/events(and the swarm/agent variants) read SQLite, so they surface every swarm, daemon or in-process.Genswarms.Observability.EventRelayruns on the monitor node. It tails new SQLite rows every 500ms (the:intervaldefault; batches of 500 viaEventStore.events_since/2) and re-broadcasts them onto the in-node PubSub topics (log_store:eventsandlog_store:events:<swarm>), mirroringLogStore.broadcast_event/1exactly, so the existingSwarmChannelpushes them to WebSocket clients live, with no clustering required. It starts from the current tip on boot (relays only new events going forward; history comes from the subscribe-time snapshot). Latency is approximately the poll interval.- A WebSocket client gets recent history from the snapshot on subscribe (also read from SQLite) and the live tail from the relay.
The relay is started by Genswarms.Application.start_web_server/1
(maybe_start_event_relay/0) — and only there, so daemons (which never start the
web server) do not run it. Run it only on a monitor/API node that does not host
swarms in-process: there the in-node LogStore never broadcasts swarm events (they
happen in the daemons), so the relay is the sole live source and there is no
double-delivery. To disable it explicitly, set:
Still node-local (known limits)¶
- Live process-state pulls (
GET /objects/:name,/agents/:name) call into the in-nodeRegistry, so they only reach swarms in the same BEAM. For daemon swarms, rely on the event stream (andGET /swarms/:name, which has a SQLite fallback) instead of synchronous state pulls. - Object internal state changes do not emit events (only object lifecycle does), so
a live object-state feed is not available: poll
GET /objects/:namein a co-located setup, or have the object emit on change. - For true sub-second push or cross-host fan-out without polling, swap
Phoenix.PubSubto its Redis adapter; the single spine means only the transport changes, not the emitters, bridge, or channel.
Querying events¶
CLI¶
genswarms events reads the durable spine, so it surfaces daemon swarms running in
other BEAMs by reading the events table in .genswarms/swarms.db.
genswarms events # recent events across all swarms
genswarms events -s my-swarm # one swarm
genswarms events --category routing # filter by category (backend|routing|agent|object|swarm|system)
genswarms events --errors # errors only
genswarms events --follow # stream in real time
The --category values map 1:1 to the taxonomy above (and include object). See
cli.md for the full command reference.
REST API¶
History endpoints are backed by the same spine and read from SQLite, so they cover every swarm:
| Endpoint | Returns |
|---|---|
GET /api/events |
recent events, filterable by level/category/swarm/agent/event_type/minutes/limit |
GET /api/swarms/:name/events |
events for one swarm |
GET /api/swarms/:name/agents/:agent_name/events |
events for one agent |
Snapshot endpoints (current state) complement the history:
| Endpoint | Returns |
|---|---|
GET /api/swarms/:name |
swarm status, agents, objects, counts |
GET /api/swarms/:name/topology |
topology adjacency |
GET /api/swarms/:name/objects |
objects + lifecycle state |
GET /api/swarms/:name/objects/:object_name |
one object's live domain state |
GET /api/swarms/:name/agents/:agent_name |
one agent's status |
See rest-api.md for full details.
Real-time (WebSocket / PubSub)¶
On the swarm:<name> channel, subscribe and patch a view from the push stream.
The channel subscribes to the per-swarm PubSub topics on join (swarm:<name>,
plus :output, :routing, and :status sub-topics) and pushes these
server→client messages (all confirmed in SwarmChannel.handle_info/2):
| Push | Source |
|---|---|
event, log_entry |
LogStore (the whole taxonomy, after subscribe_events / subscribe_logs) |
agent_output |
agent stdout |
agent_status |
agent state transition |
message_routed, message_broadcast |
router |
swarm_started, swarm_stopped |
swarm lifecycle |
agent_added, agent_removed, topology_changed |
dynamic mutations |
Within a single BEAM, code can subscribe directly with
Genswarms.Observability.LogStore.subscribe/0 (all events as {:log_event, event})
or LogStore.subscribe/1 (one swarm). See websocket.md for the
channel protocol.
Building a dashboard¶
A dashboard is a consumer, not framework code (the project is API-first and headless by design; no HTML ships here):
- Bootstrap from the snapshot endpoints (status + topology + objects).
- Open the
swarm:<name>channel,subscribe_eventsfor the taxonomy, and patch the view from the push stream. - Domain-specific concepts (for example user "sessions" or conversation transcripts) live in the consumer and read from the consumer's own store; the framework stays generic and exposes only generic object state via the introspection endpoint above.
See also¶
- cli.md —
genswarms eventsand the full CLI reference - rest-api.md —
/api/eventsand snapshot endpoints - websocket.md — real-time channel protocol