Skip to content

Observability

Genswarms exposes everything that happens in a swarm through a single event spine. Every observable state transition emits a :telemetry event, a bridge funnels those into a centralized store, and the store persists, streams, and serves them. Understanding that spine is the key to building any dashboard, monitor, or alerting on top of the framework.

The single spine

There is one rule: every observable state transition emits a :telemetry event, and nothing else logs it. A telemetry bridge funnels those events into LogStore, which both persists them (ETS + durable store) and streams them over PubSub / WebSocket.

A transition is logged in exactly one place: its emit_telemetry/2,3 call. The emitter never also calls LogStore.log for the same moment. LogStore.log is reserved for diagnostics and I/O that have no transition event: backend container ops, raw agent stdout, received messages, config-load failures. Those are single-source, so they never double up with the bridge.

emit_telemetry(:agent_started, ...)              # emitters (swarm_manager, agent_server, ...)
        |  [:genswarms, :agent, :agent_started]
        v
Genswarms.Observability.TelemetryBridge            # single :telemetry handler
        |  LogStore.log(:info, :agent, :agent_started, "agent fixer_1 started", ...)
        v
Genswarms.Observability.LogStore
        |-- ETS ring buffer        -> LogStore.query / GET /api/events (fast, in-node)
        |-- EventStore (durable)   -> cross-process / `genswarms events` CLI
        +-- PubSub {:log_event, e} -> SwarmChannel "event" / "log_entry" push (live)

To make a new transition observable, emit a telemetry event under [:genswarms, <domain>, <event>] and add it to Genswarms.Observability.TelemetryBridge known_events/0. Nothing else: no controller, no broadcast, no LogStore call at the call site.

The bridge attaches once at application start (Genswarms.Observability.TelemetryBridge.attach/0, called by Genswarms.Application.start/2 after the supervision tree is up). It attaches to the concrete [:genswarms, domain, event] triples in known_events/0 — not a prefix — so unrelated :genswarms telemetry is never swept in. The handler is wrapped in a rescue: if a translation fails it logs a warning and drops the event rather than taking down the emitting process.

The bridge derives the log level from the event name. When the level depends on the outcome (a partial swarm start, an unexpected agent exit), the emitter passes level: in the telemetry metadata to set it explicitly; the bridge strips that key before persisting, so it never leaks into the payload.

Event taxonomy

level is derived from the event name. The exact rules (TelemetryBridge.level_for/1), checked in order against the event-name string:

Substring in event name Level
contains error :error
contains failed :error
contains invalid :warning
contains not_found :warning
contains full :warning
otherwise :info

A level: key in the telemetry metadata overrides this derivation for that event.

category is the telemetry domain, with one normalization: the :router domain is mapped to the :routing category. All other domains pass through unchanged (:swarm, :agent, :object).

The full vocabulary the bridge knows (Genswarms.Observability.TelemetryBridge.known_events/0):

Domain (category) Event Level Meaning
swarm swarm_started info swarm finished starting (emitters may pass level: for a partial start)
swarm swarm_stopped info swarm torn down
agent agent_started info agent process up
agent agent_stopped info agent process exited
agent agent_error error agent backend/runtime error
agent agent_added info agent added to a running swarm
agent agent_removed info agent removed from a running swarm
agent task_sent info task delivered to an agent
agent message_delivered info message delivered to target inbox
object object_started info object handler initialized
object object_stopped info object handler stopped
object object_error error object handler crashed/errored
object object_added info object added to a running swarm
object object_removed info object removed from a running swarm
routing message_routed info direct message routed (:from, :to)
routing message_broadcast info broadcast routed (:from)
routing invalid_route warning message rejected by topology

Every event carries :swarm in its metadata. Agent events also carry :agent, and object events carry :object. The bridge lifts the swarm name into the event's :swarm field and lifts either the agent name or the object name into the event's :agent field (agent: metadata[:agent] || metadata[:object]) — there is no separate object column, so object events appear under the same agent field / -a / ?agent= filter as agents. The bridge then drops :swarm, :agent, and :object from the remaining metadata (and the :level override key) and keeps the rest as a JSON-friendly metadata blob.

Category naming note. The :object category is real — the bridge emits it and genswarms events --category object filters on it. For historical reasons the LogStore @type category typespec and the EventsController / rest-api.md category docs enumerate only backend | routing | agent | swarm | system and omit object. The omission is documentation/typespec drift, not a runtime restriction: object-category events are persisted and queryable through every path. Use the taxonomy above as the authoritative list.

Telemetry events and metrics

Raw telemetry events are emitted under [:genswarms, <domain>, <event>] with metadata that always includes :swarm (and :agent for agent events). The Genswarms.Telemetry supervisor declares a set of Telemetry.Metrics definitions (consumable by LiveDashboard or any reporter). Metric names follow genswarms.<domain>.<event>:

Metric Type Tags Source event
genswarms.swarm.swarm_started.count counter :swarm [:genswarms, :swarm, :swarm_started]
genswarms.swarm.swarm_stopped.count counter :swarm [:genswarms, :swarm, :swarm_stopped]
genswarms.swarm.agent_count last_value :swarm [:genswarms, :swarm, :agent_count]
genswarms.agent.agent_started.count counter :swarm, :agent [:genswarms, :agent, :agent_started]
genswarms.agent.agent_stopped.count counter :swarm, :agent [:genswarms, :agent, :agent_stopped]
genswarms.agent.agent_error.count counter :swarm, :agent [:genswarms, :agent, :agent_error]
genswarms.agent.task_sent.count counter :swarm, :agent [:genswarms, :agent, :task_sent]
genswarms.agent.message_delivered.count counter :swarm, :agent [:genswarms, :agent, :message_delivered]
genswarms.router.message_routed.count counter :swarm [:genswarms, :router, :message_routed]
genswarms.router.message_broadcast.count counter :swarm [:genswarms, :router, :message_broadcast]
genswarms.router.invalid_route.count counter :swarm [:genswarms, :router, :invalid_route]

The genswarms.swarm.agent_count last-value is produced by a periodic :telemetry_poller measurement (period 10s) that calls Genswarms.Telemetry.measure_swarms/0, which polls Genswarms.SwarmManager.list/0 and emits [:genswarms, :swarm, :agent_count] per swarm. (agent_count is a metrics-only event — it is not in known_events/0, so the bridge does not turn it into a LogStore event; it never appears in the queryable event stream.) Standard Phoenix and BEAM VM metrics (phoenix.endpoint.*, phoenix.router_dispatch.*, phoenix.live_view.mount.*, vm.memory.total, vm.total_run_queue_lengths.*) are also registered.

The EventStore behaviour

The durable, cross-process log sits behind one swappable interface, Genswarms.Observability.EventStore (a behaviour plus a facade). Everything that persists, reads, or tails events goes through it (LogStore, EventRelay, the controllers, the channel, and the CLI), never a concrete backend. The backend is a single config knob.

The callbacks are: persist/1, query/1, events_since/2, max_event_id/0, an optional persist_many/1 (bulk write), and an optional child_specs/0 (processes the backend needs supervised; the app splices them into its tree at boot via EventStore.child_specs/0).

@callback persist(event()) :: :ok
@callback persist_many([event()]) :: :ok                              # optional
@callback query(keyword()) :: [event()]
@callback events_since(since_id :: non_neg_integer(), limit :: pos_integer()) :: [event()]
@callback max_event_id() :: non_neg_integer()
@callback child_specs() :: [Supervisor.child_spec()]                  # optional

@optional_callbacks child_specs: 0, persist_many: 1

The facade provides safe defaults for the optional callbacks: if a backend does not export persist_many/1, EventStore.persist_many/1 falls back to N× persist/1; if it does not export child_specs/0, EventStore.child_specs/0 returns []. The event shape passed to persist/1 is a map with :level, :category, :event_type, :message and optional :swarm, :agent, :metadata; reads return the same maps additionally carrying :id and :timestamp (assigned by the backend).

Backends

Backend Role
EventStore.Sqlite Thin adapter over the events table in .genswarms/swarms.db (managed by SwarmRegistry). Stateless and synchronous — each call opens a short-lived connection — so it declares no child_specs/0 (no supervised process).
EventStore.Buffered Engine-independent write-batching decorator wrapping any inner backend. The default.

The default in config/config.exs batches writes on top of SQLite:

config :genswarms, :event_store, Genswarms.Observability.EventStore.Buffered

config :genswarms, Genswarms.Observability.EventStore.Buffered,
  inner: Genswarms.Observability.EventStore.Sqlite,
  interval_ms: 100,
  max_buffer: 1_000

EventStore.Buffered enqueues each persist/1 (and persist_many/1) into a Writer GenServer and flushes the batch via the inner backend's persist_many/1 on a 100ms timer (or sooner when max_buffer is reached). It also declares the Writer (plus any of the inner backend's own children) through its child_specs/0, so the app supervises it. This keeps disk writes off LogStore's critical path and lets the inner backend amortize them: with EventStore.Sqlite, one open -> BEGIN -> inserts -> COMMIT -> close per flush instead of a connection per event. The tradeoff is a small durability window (at most one flush interval) on a hard crash; the live in-node path (ETS + PubSub) is synchronous and unaffected. The buffer is flushed on graceful shutdown via the Writer's terminate/2.

Tests run with the plain synchronous EventStore.Sqlite backend so that persist -> query is deterministic with no buffering (config/test.exs sets config :genswarms, :event_store, Genswarms.Observability.EventStore.Sqlite).

To raise throughput under load, tune the buffer (fewer, larger commits in exchange for a slightly larger latency/durability window):

config :genswarms, Genswarms.Observability.EventStore.Buffered,
  inner: Genswarms.Observability.EventStore.Sqlite,
  interval_ms: 250,
  max_buffer: 5_000

Because every caller goes through the facade, a future EventStore.Postgres or a Redis/streaming backend can be swapped in transparently: only the backend module changes, not the emitters, the bridge, the channel, or EventRelay.

Cross-process event stream

A BEAM's in-memory machinery (PubSub, the process Registry, the ETS LogStore) is node-local and invisible from another OS process. Genswarms supports two deployment shapes.

Co-located: the swarm runs in the same BEAM as the Phoenix endpoint (for example, started in-process via POST /api/swarms). Live PubSub, the WebSocket stream, and the live snapshot endpoints all work directly. Nothing special is needed.

Monitor + daemons: the usual shape at scale. Each swarm runs as its own daemon (genswarms start, its own BEAM), and a separate monitor/API node observes all of them. The only thing the processes share is the SQLite events table in .genswarms/swarms.db.

daemon swarm A --+  emit_telemetry -> LogStore -> SQLite events  --+
daemon swarm B --+                                                 +--> shared .genswarms/swarms.db
daemon swarm C --+                                                 --+
                                                                   |
                  monitor / API node ---- EventRelay polls events_since --+
                                                                   |       |
                  REST /api/events  -- reads SQLite ---------------+       v
                  WS swarm:<name>   <- EventRelay re-broadcasts {:log_event} onto
                                       the same LogStore PubSub topics -> SwarmChannel push
  • GET /api/events (and the swarm/agent variants) read SQLite, so they surface every swarm, daemon or in-process.
  • Genswarms.Observability.EventRelay runs on the monitor node. It tails new SQLite rows every 500ms (the :interval default; batches of 500 via EventStore.events_since/2) and re-broadcasts them onto the in-node PubSub topics (log_store:events and log_store:events:<swarm>), mirroring LogStore.broadcast_event/1 exactly, so the existing SwarmChannel pushes them to WebSocket clients live, with no clustering required. It starts from the current tip on boot (relays only new events going forward; history comes from the subscribe-time snapshot). Latency is approximately the poll interval.
  • A WebSocket client gets recent history from the snapshot on subscribe (also read from SQLite) and the live tail from the relay.

The relay is started by Genswarms.Application.start_web_server/1 (maybe_start_event_relay/0) — and only there, so daemons (which never start the web server) do not run it. Run it only on a monitor/API node that does not host swarms in-process: there the in-node LogStore never broadcasts swarm events (they happen in the daemons), so the relay is the sole live source and there is no double-delivery. To disable it explicitly, set:

config :genswarms, :event_relay, false

Still node-local (known limits)

  • Live process-state pulls (GET /objects/:name, /agents/:name) call into the in-node Registry, so they only reach swarms in the same BEAM. For daemon swarms, rely on the event stream (and GET /swarms/:name, which has a SQLite fallback) instead of synchronous state pulls.
  • Object internal state changes do not emit events (only object lifecycle does), so a live object-state feed is not available: poll GET /objects/:name in a co-located setup, or have the object emit on change.
  • For true sub-second push or cross-host fan-out without polling, swap Phoenix.PubSub to its Redis adapter; the single spine means only the transport changes, not the emitters, bridge, or channel.

Querying events

CLI

genswarms events reads the durable spine, so it surfaces daemon swarms running in other BEAMs by reading the events table in .genswarms/swarms.db.

genswarms events                      # recent events across all swarms
genswarms events -s my-swarm          # one swarm
genswarms events --category routing   # filter by category (backend|routing|agent|object|swarm|system)
genswarms events --errors             # errors only
genswarms events --follow             # stream in real time

The --category values map 1:1 to the taxonomy above (and include object). See cli.md for the full command reference.

REST API

History endpoints are backed by the same spine and read from SQLite, so they cover every swarm:

Endpoint Returns
GET /api/events recent events, filterable by level/category/swarm/agent/event_type/minutes/limit
GET /api/swarms/:name/events events for one swarm
GET /api/swarms/:name/agents/:agent_name/events events for one agent

Snapshot endpoints (current state) complement the history:

Endpoint Returns
GET /api/swarms/:name swarm status, agents, objects, counts
GET /api/swarms/:name/topology topology adjacency
GET /api/swarms/:name/objects objects + lifecycle state
GET /api/swarms/:name/objects/:object_name one object's live domain state
GET /api/swarms/:name/agents/:agent_name one agent's status

See rest-api.md for full details.

Real-time (WebSocket / PubSub)

On the swarm:<name> channel, subscribe and patch a view from the push stream. The channel subscribes to the per-swarm PubSub topics on join (swarm:<name>, plus :output, :routing, and :status sub-topics) and pushes these server→client messages (all confirmed in SwarmChannel.handle_info/2):

Push Source
event, log_entry LogStore (the whole taxonomy, after subscribe_events / subscribe_logs)
agent_output agent stdout
agent_status agent state transition
message_routed, message_broadcast router
swarm_started, swarm_stopped swarm lifecycle
agent_added, agent_removed, topology_changed dynamic mutations

Within a single BEAM, code can subscribe directly with Genswarms.Observability.LogStore.subscribe/0 (all events as {:log_event, event}) or LogStore.subscribe/1 (one swarm). See websocket.md for the channel protocol.

Building a dashboard

A dashboard is a consumer, not framework code (the project is API-first and headless by design; no HTML ships here):

  1. Bootstrap from the snapshot endpoints (status + topology + objects).
  2. Open the swarm:<name> channel, subscribe_events for the taxonomy, and patch the view from the push stream.
  3. Domain-specific concepts (for example user "sessions" or conversation transcripts) live in the consumer and read from the consumer's own store; the framework stays generic and exposes only generic object state via the introspection endpoint above.

See also

  • cli.mdgenswarms events and the full CLI reference
  • rest-api.md/api/events and snapshot endpoints
  • websocket.md — real-time channel protocol