Skip to content

AnthonyUtt/yenna

Repository files navigation

yenna

A self-hosted AI virtual assistant designed for consumer hardware running a local LLM. Every architectural decision is shaped by two constraints:

  1. Context discipline. Self-hosted models degrade as their context fills. Yenna caps the effective window (default 16k tokens), delegates work to short-lived subagents, and auto-compacts when budget is exceeded.
  2. Determinism beats agent autonomy. Small models loop, forget, and rarely take initiative. The harness — not the model — enforces invariants: consecutive-tool-call detection, memory recall/ingestion at lifecycle boundaries, scheduled check-ins via heartbeats.

Quickstart

You need Bun and an OpenAI-compatible LLM endpoint (Ollama, vLLM, llama.cpp server, LM Studio, etc.).

# 1. Install workspace deps.
bun install

# 2. Drop a config in place — copy the skeleton and edit the values.
mkdir -p ~/.yenna
cp config.example.yaml ~/.yenna/config.yaml
$EDITOR ~/.yenna/config.yaml

# 3. Start the daemon.
bun run start

# 4. In another terminal, open the TUI.
bun run tui

If you don't have an LLM endpoint yet, docker compose up ollama brings one up on localhost:11434; then docker exec ollama ollama pull qwen2.5:7b (or whichever model you configured).

Architecture

                    Unix socket: ~/.yenna/yenna.sock
                    HTTP + WebSocket via Bun.serve
                                 │
   ┌──────────────┐    ┌─────────┴──────────┐    ┌──────────────────────┐
   │  TUI (Ink)   │────│    yenna daemon    │────│ Telegram (in-daemon) │
   │ packages/tui │    │  packages/core     │    │  channels/telegram   │
   └──────────────┘    │                    │    └──────────────────────┘
                       │  - agent loop      │
                       │  - hooks registry  │
                       │  - tool registry   │
                       │  - skill registry  │
                       │  - scheduler       │
                       │  - SQLite store    │
                       │  - LLM adapter     │
                       └──────────┬─────────┘
                                  │ OpenAI-compatible /v1/chat /v1/embeddings
                                  ▼
                       Self-hosted inference (Ollama / vLLM / llama.cpp / …)

Repo layout

yenna/
├── config.example.yaml           # copy to ~/.yenna/config.yaml
├── packages/
│   ├── core/                     # the daemon
│   │   ├── src/
│   │   │   ├── agent/            # run-turn loop, ContextBuilder, types
│   │   │   ├── channels/         # Channel interface, TUI + Telegram
│   │   │   ├── config/           # YAML loader, env interpolation, Zod schema
│   │   │   ├── hooks/            # registry + built-in hooks
│   │   │   ├── llm/              # Vercel AI SDK adapter + MockChatModel
│   │   │   ├── memory/           # markdown-file memory store + tools
│   │   │   ├── persistence/      # bun:sqlite + DAOs
│   │   │   ├── protocol/         # wire-protocol types (shared with SDK)
│   │   │   ├── rag/              # vector storage + cosine search
│   │   │   ├── scheduled-jobs/   # heartbeat job, signal providers
│   │   │   ├── scheduler/        # interval scheduler
│   │   │   ├── shell-policy/     # tier classifier + composition detection
│   │   │   ├── skills/           # discovery, indexing, run_skill
│   │   │   ├── tools/            # tool registry + fs / shell built-ins
│   │   │   ├── transcript/       # JSONL pretty-printer + CLI
│   │   │   ├── transport/        # Bun.serve setup + ConversationHub
│   │   │   └── web/              # search providers + web_search/fetch
│   │   ├── skills/               # bundled skills (web-research, summarize-page)
│   │   └── test/
│   ├── sdk/                      # @yenna/sdk — typed daemon client
│   └── tui/                      # @yenna/tui — Ink chat UI
└── docker-compose.yml            # ollama service + optional yenna container

Configuration

All settings live in ~/.yenna/config.yaml (or $YENNA_CONFIG_PATH). The config.example.yaml at the repo root is a complete skeleton with every field documented inline.

Secrets reference environment variables: ${env:VAR_NAME} is replaced at load time. Useful for llm.api_key, channels.telegram.bot_token, web_search.brave.api_key.

Tools

Built-in tools available to the agent:

  • fs_read, fs_write, fs_list, fs_mkdir — workspace-scoped filesystem
  • shell — runs via bash -c with tier-based policy gating
  • web_search, web_fetch — pluggable search provider + HTML→markdown
  • search_skills, run_skill — skill discovery + sub-agent dispatch
  • recall_memory, save_memory — RAG over markdown memory store

The shell tool classifies commands into safe / mutating / dangerous tiers and prompts for approval when something exceeds auto_approve_tiers. Chained commands (|, &&, ;, redirections) are always elevated to dangerous.

Skills

A skill is a directory containing a SKILL.md with YAML frontmatter:

---
name: my-skill
description: One-line description for the agent to decide whether to use this.
tools: [shell, fs_write]   # optional; restricts the sub-agent's tools
---

# Skill body

The full markdown content is loaded as the sub-agent's system prompt.

Skills live in:

  • packages/core/skills/ (bundled with yenna)
  • ~/.agents/skills/ (user-installed, using the skills.sh convention so existing skills work unchanged)

User skills override bundled ones with the same name. The bundled set currently ships web-research and summarize-page as templates.

Memory

Markdown files in ~/.yenna/memory/. The agent automatically:

  • Recalls relevant memories at the start of each user turn (MemoryRecaller hook).
  • Ingests new memorable facts asynchronously at turn end (MemoryIngester hook — fire-and-forget LLM extraction call).

You can also inspect and hand-edit the files directly — they're not a black box.

Transcripts

Every conversation has a JSONL transcript at ~/.yenna/logs/<conversation-id>.jsonl covering all lifecycle events. To read one:

bun run --cwd packages/core transcript <conversation-id>

Channels

A channel is a delivery target for assistant messages. The TUI channel delivers via the WS stream; Telegram delivers via the Bot API.

To enable Telegram:

channels:
  telegram:
    enabled: true
    bot_token: "${env:YENNA_TELEGRAM_BOT_TOKEN}"
    authorized_chat_id: 12345678    # your Telegram user id; single-user mode

The daemon long-polls the Telegram Bot API in-process. Each chat becomes a conversation with channel: "telegram", auto-created on first message.

Heartbeats

Scheduled proactive check-ins. The agent reaches out to the user — through the configured primary channel — when criteria suggest it should. Useful for upcoming calendar events, follow-ups, or reminders.

heartbeat:
  enabled: true
  interval_ms: 1800000       # 30 minutes
  primary_channel: "telegram"
  noop_token: "HEARTBEAT_OK"

The agent's response is suppressed (not persisted, not delivered) when it contains HEARTBEAT_OK — that's how the model says "nothing to say right now." Small models with short contexts are prone to picking up patterns VERY quickly. All it takes is a couple of messages with HEARBEAT_OK at the end in the chat history and suddenly every message ends in HEARTBEAT_OK. This is especially pertinent if the agent is left alone for a while, as they will rapidly create their own echo chamber / feedback loop that will consume pretty much all pre-existing context / history.

Design decisions

Why hooks instead of middleware or direct callbacks

The agent loop fires lifecycle events at well-known points (turn_start, pre_llm_call, pre_tool_call, post_tool_call, pre_compaction, turn_end, …) and a HookRegistry dispatches them in priority order. The registry gives us three properties that direct callbacks don't: ordering (memory recall has to run before the prompt is built, audit logging has to see the final outcome), short-circuit results (pre_tool_call can return abort or skip_tool to alter control flow — that's how duplicate detection works), and fire-and-forget observers (the audit logger watches every event without anyone wiring it in). Middleware was considered and rejected: it tangles cross-cutting concerns with the linear request/response shape, and the agent loop isn't request/response — it's a tool-call cycle with reentrant LLM calls and conditional dispatch. Hooks are now the only extension point for cross-cutting behavior; the agent loop itself has no knowledge of memory, duplicate detection, audit logging, or heartbeat suppression. Adding a new concern means writing a hook, not editing the loop.

Why subagents with bounded tool access

A single agent with every tool in scope drowns small models. The system prompt grows linearly with the tool catalog, every tool's JSON schema eats output tokens during planning, and irrelevant tools become attractive nuisances. The run_skill tool fans work out to short-lived subagents, each loaded with one skill's markdown as its system prompt and a restricted tool set declared in the skill's frontmatter (tools: [shell, fs_write] means only those tools are available). The parent agent returns to a clean context with just the subagent's final message. Skill authors can opt into wider tool sets when they need them, but the default is narrow. run_skill itself is excluded from subagent tool sets by default, which bounds recursion depth at 1 — a subagent can't spawn further subagents unless the skill explicitly opts in. That cap is a deliberate choice over arbitrary nesting: deeper agent trees are hard to debug and amplify failure modes (one bad turn loops a whole tree).

Why synchronous, sequential tool execution

When an LLM response contains multiple tool calls, the loop executes them one at a time and waits for each to complete before starting the next. Parallel execution would be measurably faster for independent tool calls, but three concerns kept it out. First, permission gating: a denied shell command should inform whether the agent attempts the next one — running them in parallel races the user's decision against side effects. Second, shell ordering: fs_write followed by shell running a script that reads that file has to happen in order, and the agent's request order is the only ordering signal we have. Third, persistence: the messages table appends in monotonic order and tool results are interleaved into the conversation history; parallel execution would force either out-of-order inserts (which break the compaction-cutoff invariant) or a join barrier that defeats the latency win. We'd revisit this if tool latency became the dominant cost in real workloads — at that point a per-tool parallelSafe flag plus a join-barrier dispatcher would be the natural extension.

Why markdown files for memory instead of a vector database

Memory is single-user, small (dozens to low hundreds of entries in practice), and rarely the bottleneck. Brute-force cosine similarity over embeddings stored as SQLite BLOBs returns top-K in well under a millisecond at that scale, which lets us skip an entire dependency. The user-facing payoff matters more than the perf one: memories are human-readable markdown files in ~/.yenna/memory/, not opaque rows in a vector database. You can grep, cat, hand-edit, version-control, sync via Syncthing, or just rm a file that's wrong. The index rebuilds on the next startup. This is a deliberate scale tradeoff — it would not survive multi-user deployment or memory catalogs in the tens of thousands. At that point swapping in sqlite-vec, LanceDB, or a hosted vector store is a contained change because the MemoryEmbeddingsIndex interface is narrow.

Why a single-process daemon with in-process channels

Telegram long-polling, the scheduler, the WebSocket server, the agent loop, and the SQLite store all live in one Bun process. No message broker, no inter-service queue, no shared cache. The deployment target is a single machine — usually the same machine the user is sitting at — and the operational overhead of a broker isn't earned by anything yenna needs. Channels are objects in a registry, not network endpoints; the heartbeat job calls hub.emit() synchronously rather than enqueueing a job for a worker to drain. This collapses an entire class of failure modes (broker down, queue backpressure, serialization mismatches) and keeps the mental model small enough to hold in your head. It would not hold up if the architecture had to scale horizontally — multiple daemon instances would need a real coordination layer for conversation locking, scheduler leadership, and channel routing. That's a bridge to cross if it ever becomes a problem, not a problem to design around today.

Testing

bun test                        # unit + integration (mocked LLM)
YENNA_E2E=1 bun test            # also runs E2E against a real LLM

E2E env vars:

  • YENNA_E2E_BASE_URL (default http://localhost:11434/v1)
  • YENNA_E2E_CHAT_MODEL (default qwen2.5:7b)
  • YENNA_E2E_EMBED_MODEL (default nomic-embed-text)

License

MIT — see LICENSE.

About

Self-hosted AI Virtual Assistant, designed for local LLMs (<30B parameters)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages