refactor: knowledge-base schema (issue #98) — phases 1a–1d#99
Merged
Conversation
Add the new wb_config_*, wb_rag_*, and wb_agentic_* tables alongside
the legacy wb_workspaces / wb_catalog_* / wb_vector_store_* / docs /
saved_queries set. Phase 1a is purely additive — nothing reads or
writes the new tables yet, so all 558 existing tests still pass.
New tables (created idempotently at startup):
config:
wb_config_workspaces (replaces wb_workspaces)
wb_config_knowledge_bases_by_workspace (replaces wb_catalog_by_workspace)
wb_config_chunking_service_by_workspace
wb_config_embedding_service_by_workspace
wb_config_reranking_service_by_workspace
wb_config_llm_service_by_workspace (Stage 2)
wb_config_mcp_tools_by_workspace (Stage 2)
rag:
wb_rag_documents_by_knowledge_base
wb_rag_documents_by_knowledge_base_and_status
wb_rag_documents_by_content_hash
agentic (Stage 2):
wb_agentic_agents_by_workspace
wb_agentic_conversations_by_agent (clustered created_at DESC)
wb_agentic_messages_by_conversation
Decisions baked in (see issue thread):
- KB row carries vector_collection (auto-provisioned) plus lexical_*
fields. Lexical isn't a callable service; folding it onto the KB
avoids inventing five empty endpoint columns just to fit the shape.
- KB references services by id (embedding/chunking/reranking).
Embedding service id is intended to be immutable post-create —
enforcement lands in 1b with the route rewrite.
- Agent.reranking_service_id overrides KB.reranking_service_id at
query time; KB value is the default for non-agentic search.
- Conversations are partition-clustered (created_at DESC) so list
endpoints get newest-first without server-side sort.
- saved_queries does not appear in the new schema — drop in phase 1c.
- wb_api_key_* unchanged; orthogonal to the data model.
Phase 1b switches the routes to read/write these tables; phase 1c
drops the legacy ones.
Python and Java runtimes get SCHEMA_NOTES.md pointing at the
TypeScript source of truth — no implementation changes there.
Adds the new control-plane CRUD surface for the knowledge-base
schema, layered on top of the table set introduced in phase 1a.
Coexists with the legacy /catalogs and /vector-stores routes —
phase 1c removes those.
Endpoints (all under /api/v1/workspaces/{workspaceUid}/):
/knowledge-bases GET POST GET-by-id PUT DELETE
/chunking-services GET POST GET-by-id PUT DELETE
/embedding-services GET POST GET-by-id PUT DELETE
/reranking-services GET POST GET-by-id PUT DELETE
Store interface extension:
Twenty new methods on ControlPlaneStore (5 each for KB,
chunking, embedding, reranking). Implemented across all three
backends — memory, file, astra. Existing legacy methods
unchanged; phase 1c removes them.
Decisions baked into the routes:
- KB.embeddingServiceId / chunkingServiceId immutable post-create.
Update schema is `.strict()` so PUT bodies that include those
keys hit a 400 instead of silently overwriting. The workspace
owner gets the gap-#4 invariant for free.
- Service deletion is refused (409) when any KB still references
the service, mirroring the existing vector-store rule.
- Vector collection is auto-provisioned on KB create using
`wb_vectors_<kb_id>` (hyphen-stripped — Astra collection names
must match `[a-zA-Z][a-zA-Z0-9_]*`).
- supportedLanguages / supportedContent / tags exposed as
`readonly string[]` (sorted, deduplicated) instead of
`ReadonlySet<string>`. JSON-friendly and matches the wire shape
one-for-one. The Astra row layer keeps Sets — astra-db-ts maps
CQL `SET<TEXT>` to native Sets — and the converter normalises
at the boundary.
Tests:
- 5 new contract tests run against memory + file + astra fakes
(15 total) — referential integrity, cascade delete, immutability,
array round-trip, and the auto-collection naming convention.
- 8 new route-level tests covering happy-path CRUD, validation
errors, 409 service-still-referenced, and pagination.
573 → 581 tests passing; typecheck clean.
The new schema does not include saved queries — the proposed agent layer (Stage 2) replaces the use case. This commit removes the surface end-to-end: - DELETE route file + handlers - DELETE store interface methods (list/get/create/update/delete) - DELETE memory/file/astra implementations - DELETE wb_saved_queries_by_catalog DDL + bundle entry + bootstrap - DELETE SavedQueryRecord, SavedQueryRow, OpenAPI schemas - DELETE saved-query-related tests (5 contract tests, 14 route tests) - DELETE catalog-saved-queries conformance scenario + fixture - DELETE structure-test reference to the saved-queries route file Phase 1c continues with /catalogs, /vector-stores teardown + documents/data-plane move to KB scope. 581 → 567 tests passing; typecheck clean.
Knowledge Bases now own their underlying vector collection
end-to-end. Three changes hang off this:
1. `resolveKb` (kb-descriptor.ts) materialises a `VectorStoreRecord`-
shaped descriptor on the fly from a KB + its bound embedding /
reranking services. The driver and dispatch layers don't need to
know KBs exist — they keep consuming the legacy descriptor shape.
2. KB CRUD now provisions / drops the collection:
- POST /knowledge-bases creates the row, then `driver.createCollection`
on the data plane. On provisioning failure the KB row is rolled
back so the two planes can't drift.
- DELETE drops the collection first, then the row.
3. New data-plane endpoints under `/knowledge-bases/{kb}`:
- POST .../records upsert
- DELETE .../records/{recordId} delete
- POST .../search vector / hybrid / rerank
Coexists with /vector-stores during 1c. UI migrates to KB endpoints
in 1d; legacy routes get retired in a follow-up cleanup.
569 tests passing; typecheck clean.
Adds the KB-scoped document and ingest surface, mirroring the legacy
catalog-scoped routes. Both stay live during 1c/1d so the UI migrates
without flag-flipping.
New endpoints under /api/v1/workspaces/{w}/knowledge-bases/{kb}/:
GET /documents list (paginated)
POST /documents register a doc
GET /documents/{d} fetch
PUT /documents/{d} patch metadata
DELETE /documents/{d} drop doc + cascade chunks
GET /documents/{d}/chunks list by index
POST /ingest sync chunk + embed + upsert
POST /ingest?async=true 202 + job pointer
Control plane:
Five new ControlPlaneStore methods for RAG documents (list, get,
create, update, delete) backed by `wb_rag_documents_by_knowledge_base`
in astra (already provisioned in 1a) and parallel maps in memory/
file. Astra writes also maintain the by-status secondary index;
by-content-hash gets written on create + content_hash changes.
deleteKnowledgeBase now cascades RAG document rows.
Job layer:
JobRecord gains `knowledgeBaseUid` alongside `catalogUid`. KB-scoped
ingest jobs leave catalogUid null and vice versa. wb_jobs_by_workspace
gets a `knowledge_base_uid` column (idempotent CREATE TABLE picks it
up on next boot).
New `runKbIngestJob` async worker resolves the KB descriptor on
every run so renames / service swaps can't drift mid-flight.
Pipeline:
`runKbIngest` is the KB-scoped sibling of `runIngest`. New
`KB_SCOPE_KEY = "knowledgeBaseUid"` payload key gets stamped on
every chunk so search filters can scope to a KB.
Tests:
- 5 new route tests (CRUD, sync ingest, async ingest 202, delete)
- 3 new contract tests × 3 backends = 9 (RAG CRUD, 404 on unknown
KB, deleteKnowledgeBase cascade)
- 1 conformance fixture updated for the new JobRecord shape
569 → 583 tests passing; typecheck clean.
Phase 1c (drop legacy /catalogs, /vector-stores) and the UI rewire
follow.
…1c.3)
Retires the catalog/vector-store/document surface in favour of the
KB-scoped equivalents shipped in 1d.
Routes deleted:
/api/v1/workspaces/{w}/catalogs
/api/v1/workspaces/{w}/catalogs/{c}/documents
/api/v1/workspaces/{w}/catalogs/{c}/documents/search
/api/v1/workspaces/{w}/catalogs/{c}/documents/{d}/chunks
/api/v1/workspaces/{w}/catalogs/{c}/ingest
/api/v1/workspaces/{w}/vector-stores
/api/v1/workspaces/{w}/vector-stores/{vs}/...
/api/v1/workspaces/{w}/vector-stores/discoverable
/api/v1/workspaces/{w}/vector-stores/adopt
Control plane:
- 15 ControlPlaneStore methods removed (catalog × 5, vector-store
× 5, document × 5).
- Memory / file / astra implementations stripped along with their
assertCatalog / assertVectorStore / assertVectorStoreNotReferenced
helpers.
- Workspace delete now resolves each KB into a driver descriptor
via `resolveKb` and drops the underlying collection — no longer
walks `wb_vector_store_by_workspace`.
- `assertVectorStorePatchIsEmpty` removed from defaults.
Schema:
- DDL dropped: `wb_catalog_by_workspace`, `wb_vector_store_by_workspace`,
`wb_documents_by_catalog`. Astra row types and converters for
them go too.
- `wb_jobs_by_workspace.catalog_uid` column dropped (the existing
table picks this up implicitly: idempotent CREATE TABLE doesn't
drop columns, so deployed schemas still have the dead column —
harmless, ignored on read/write).
- JobRecord loses `catalogUid`; `knowledgeBaseUid` now the only
parent pointer.
OpenAPI:
- `Catalog`, `VectorStore`, `Document`, `AdoptableCollection`,
`AdoptCollectionInput`, `Ingest*` (catalog-shaped) schemas
retired. KB-shaped equivalents (`KnowledgeBase`, `RagDocument`,
`KbIngestRequest`, `KbAsyncIngestResponse`) are the only document
surface.
Pipeline + worker:
- `runIngest` (catalog-scoped) and `runIngestJob` removed.
- `runKbIngest` + `runKbIngestJob` are the only ingest path. The
cross-replica orphan sweeper now resumes via `runKbIngestJob`.
- `CATALOG_SCOPE_KEY` removed from payload keys; `KB_SCOPE_KEY`
is the only chunk-payload scope key now.
Tests:
- app.test.ts trimmed from 3129 → 1023 lines: ~2100 lines of legacy
catalog / vector-store / document / ingest / chunk / adopt tests
deleted. Replacement coverage lives in `knowledge-bases.test.ts`
(15 KB-scoped route tests) and `control-plane/contract.ts`
(RAG document × 3 backends).
- Contract tests trimmed: `deleteWorkspace cascades to KBs and
api keys` replaces the catalog/vector-store cascade test.
- Converters test rewritten around `RagDocumentRecord`.
- Astra-fake bundle stripped of catalogs / vectorStores /
documents tables.
Conformance:
- 10 legacy fixtures deleted (catalog-*, vector-store-*,
document-crud-basic). 5 workspace-only scenarios remain.
- scenarios.md re-numbered; KB scenarios deferred to a follow-up.
466 tests passing; typecheck clean.
The legacy `wb_workspaces` and `wb_jobs_by_workspace` tables stay in
place for now — both are used unchanged by the KB-scoped surface.
Migrating to `wb_config_workspaces` is a separate refactor.
Closes the loop on phase 1c — the React UI now speaks the new
KB-scoped surface end-to-end.
Schemas + API:
apps/web/src/lib/schemas.ts and api.ts trimmed of every
catalog/vector-store/document/saved-query type and helper. Replaced
with KB-shaped schemas (KnowledgeBaseRecord, ChunkingService /
EmbeddingService / RerankingService records, RagDocumentRecord)
and matching create/update/delete API calls for each. JobRecord
loses `catalogUid`, gains `knowledgeBaseUid` to mirror the runtime.
Hooks:
Deleted: useCatalogs, useVectorStores, useSavedQueries.
Added: useKnowledgeBases (CRUD), useServices (chunking + embedding
+ reranking, factory-style to keep three near-identical surfaces in
one file).
Updated: useDocuments / useIngest / usePlaygroundSearch — all
KB-scoped now.
Pages:
- WorkspaceDetailPage swaps VectorStoresPanel + CatalogsPanel for
ServicesPanel + KnowledgeBasesPanel.
- CatalogExplorerPage → KnowledgeBaseExplorerPage. Route changes
from /workspaces/:wid/catalogs/:cid →
/workspaces/:wid/knowledge-bases/:kbid in App.tsx.
- PlaygroundPage picks a workspace + knowledge base instead of a
workspace + vector store. The query form now takes a
`QueryFormTarget` (vectorDimension + provider description +
lexical/rerank flags) so it doesn't need to know about
descriptors.
- OnboardingPage copy updated.
Components:
Deleted: CatalogsPanel, VectorStoresPanel, AdoptCollectionDialog,
CreateCatalogDialog, CreateVectorStoreDialog, SavedQueriesSection.
Added: KnowledgeBasesPanel, ServicesPanel,
CreateKnowledgeBaseDialog.
Updated: DocumentTable / DocumentDetailDialog / IngestQueueDialog
switched to RagDocumentRecord and KB-scoped props (`knowledgeBase`
/ `knowledgeBaseUid`, `documentId` instead of `documentUid`,
`contentHash` instead of `md5Hash`).
Tests:
- QueryForm test rewritten around the new `target` prop shape.
- DocumentTable test rewritten around RagDocumentRecord.
- IngestQueueDialog test rewritten around `knowledgeBase` +
`kbIngestAsync` mock.
- golden-path.spec.ts (Playwright) walks the new flow:
onboard → service creation (via API) → KB → upsert → playground.
UI typecheck clean; all 76 unit tests green; runtime tests still
466/466 green.
CI on PR #99 was red across two checks; this commit gets them both back to green. (Java Runtime is a pre-existing failure on main — package org.springframework.boot.test.autoconfigure.web.servlet missing under Spring Boot 4.0.6 — and is out of scope for this PR.) Lint, Typecheck, Test, Build: Biome flagged a mix of formatting drift, organize-imports, unused imports, and noNonNullAssertion in the files touched by phases 1c-1d. `biome check --write` auto-fixed almost everything; the one manual fix was `let resolved` in `runKbIngestJob` — Biome refuses implicit `any` at the declaration. Now annotated as `Awaited<ReturnType<typeof resolveKb>>`. Web E2E (Playwright): The golden-path spec navigated to /playground via a SPA-style nav link click after creating services + KB through the `request` fixture. React Query had already populated `useKnowledgeBases(workspaceUid)` with an empty list while `WorkspaceDetailPage` was mounted, and that cached value rode along into the playground — the KB select stayed disabled with "No knowledge bases yet" because the freshly-created KB was invisible to the page's QueryClient. Fix: hard-load `/playground` via `page.goto`, which remounts the React app and clears the cache. Comment in the spec explains why. 466/466 runtime tests, 76/76 web tests, 1/1 e2e green locally.
Reformat the chunking-service post-create response check —
`expect(chunkRes.ok(), `chunking-service create: ${await chunkRes.text()}`).toBe(true)`
overflowed Biome's line length and got auto-wrapped.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Long-lived feature branch for #98. Each phase landed here as its own commit; ready to merge.
/catalogsand/vector-storesPhase 1a — schema (commit
31d8488)11 new tables created additively —
wb_config_*,wb_rag_*,wb_agentic_*. Row types, control-plane records (camelCase), and converters added. Pure addition, no behaviour change.wb_config_*wb_rag_*wb_agentic_*Phase 1b — CRUD routes (commit
6e30717)Four new route files under
/api/v1/workspaces/{workspaceUid}/:/knowledge-bases— full CRUD; auto-provisions vector collection on create/chunking-services— CRUD/embedding-services— CRUD/reranking-services— CRUDTwenty new
ControlPlaneStoremethods, implemented across memory, file, and astra backends.Invariants enforced at the route layer:
embeddingServiceId/chunkingServiceIdimmutable after KB create. Update schema is.strict()so PUT bodies including those keys → 400.vector_collectionauto-namedwb_vectors_<kb_id>(hyphen-stripped — Astra naming rules).Set vs array trade-off: the schema uses
SET<TEXT>forsupported_languages/supported_content/tags. Astra row layer keeps Sets; the in-memory record exposes them as sorted, deduplicatedreadonly string[]. Converters normalise at the boundary.Phase 1c.1 — drop saved-queries (commit
92a49ff)The new schema does not include saved queries — the proposed agent layer (Stage 2) replaces the use case. Deleted route, store methods (5 each across 3 backends), DDL + bundle entry, OpenAPI schemas, and 19 tests.
Phase 1c.2 — KB data plane (commit
9ddff50)Knowledge bases now own their underlying vector collection end-to-end:
resolveKb(kb-descriptor.ts) materialises aVectorStoreRecord-shaped descriptor on the fly from a KB + its bound embedding/reranking services. The driver and dispatch layers consume the descriptor unchanged.…/knowledge-basescreates the row thendriver.createCollection, with rollback on provisioning failure; DELETE drops the collection first, then the row.…/knowledge-bases/{kb}:POST records(upsert),DELETE records/{id},POST search(vector / hybrid / rerank).Phase 1d — KB-scoped documents + ingest (commit
e62ecbe)New endpoints under
…/knowledge-bases/{kb}/:Control plane: five new
ControlPlaneStoremethods for RAG documents (list/get/create/update/delete) backed bywb_rag_documents_by_knowledge_base. Astra writes maintain the by-status secondary index; by-content-hash gets written on create + content-hash changes.deleteKnowledgeBasecascades RAG document rows.Job layer:
JobRecordgainsknowledgeBaseUid;wb_jobs_by_workspacegets aknowledge_base_uidcolumn. NewrunKbIngestJobasync worker resolves the KB descriptor on each call so renames / service swaps can't drift mid-flight.Pipeline:
runKbIngestis the KB-scoped sibling ofrunIngest. NewKB_SCOPE_KEY = \"knowledgeBaseUid\"payload key gets stamped on every chunk for KB-scoped filtering.Phase 1c.3 — drop legacy
/catalogs+/vector-stores(commit031af39)Retires the catalog/vector-store/document surface in favour of the KB-scoped equivalents.
Routes deleted:
/catalogs,/catalogs/{c}/documents,/catalogs/{c}/documents/search,/catalogs/{c}/documents/{d}/chunks,/catalogs/{c}/ingest,/vector-stores,/vector-stores/{vs}/...,/vector-stores/discoverable,/vector-stores/adopt.Control plane: 15
ControlPlaneStoremethods removed (catalog × 5, vector-store × 5, document × 5). Memory/file/astra implementations stripped along with their assert helpers. Workspace delete now resolves each KB into a driver descriptor viaresolveKband drops the underlying collection — no longer walkswb_vector_store_by_workspace.Schema: DDL dropped:
wb_catalog_by_workspace,wb_vector_store_by_workspace,wb_documents_by_catalog. Astra row types and converters for them go too.wb_jobs_by_workspace.catalog_uidcolumn dropped (idempotent CREATE TABLE doesn't drop columns, so deployed schemas still have the dead column — harmless, ignored on read/write).JobRecord.catalogUidremoved.OpenAPI:
Catalog,VectorStore,Document,AdoptableCollection,AdoptCollectionInput,Ingest*(catalog-shaped) schemas retired. KB-shaped equivalents (KnowledgeBase,RagDocument,KbIngestRequest,KbAsyncIngestResponse) are the only document surface.Pipeline + worker:
runIngest(catalog-scoped) andrunIngestJobremoved.runKbIngest+runKbIngestJobare the only ingest path. The cross-replica orphan sweeper now resumes viarunKbIngestJob.CATALOG_SCOPE_KEYremoved;KB_SCOPE_KEYis the only chunk-payload scope key.Tests:
app.test.tstrimmed from 3129 → 1023 lines: ~2100 lines of legacy catalog / vector-store / document / ingest / chunk / adopt tests deleted. Replacement coverage lives inknowledge-bases.test.ts(15 KB-scoped route tests) andcontrol-plane/contract.ts(RAG document × 3 backends). Converters test rewritten aroundRagDocumentRecord. Astra-fake bundle stripped of catalogs / vectorStores / documents tables.Conformance: 10 legacy fixtures deleted (catalog-*, vector-store-*, document-crud-basic). 5 workspace-only scenarios remain. KB scenarios deferred to a follow-up once the cross-runtime KB surface stabilises.
The legacy
wb_workspacesandwb_jobs_by_workspacetables stay in place — both are used unchanged by the KB-scoped surface. Migrating towb_config_workspacesis a separate refactor.Phase 1c.4 — web UI rewire (commit
8534c9a)The React UI now speaks the new KB-scoped surface end-to-end.
Schemas + API (
apps/web/src/lib/schemas.ts,api.ts): trimmed of every catalog/vector-store/document/saved-query type and helper. Replaced with KB-shaped schemas (KnowledgeBaseRecord, chunking/embedding/reranking service records,RagDocumentRecord) and matching create/update/delete API calls.JobRecord.catalogUid→knowledgeBaseUid.Hooks:
useCatalogs,useVectorStores,useSavedQueriesuseKnowledgeBases,useServices(factory-style for chunking/embedding/reranking)useDocuments,useIngest,usePlaygroundSearch— all KB-scopedPages:
WorkspaceDetailPageswapsVectorStoresPanel+CatalogsPanelforServicesPanel+KnowledgeBasesPanelCatalogExplorerPage→KnowledgeBaseExplorerPage. Route changes from/workspaces/:wid/catalogs/:cid→/workspaces/:wid/knowledge-bases/:kbidPlaygroundPagepicks workspace + KB instead of workspace + vector store. The query form takes aQueryFormTarget(vectorDimension + provider description + lexical/rerank flags) so it doesn't need to know about descriptorsOnboardingPagecopy updatedComponents:
CatalogsPanel,VectorStoresPanel,AdoptCollectionDialog,CreateCatalogDialog,CreateVectorStoreDialog,SavedQueriesSectionKnowledgeBasesPanel,ServicesPanel,CreateKnowledgeBaseDialogDocumentTable,DocumentDetailDialog,IngestQueueDialog— switched toRagDocumentRecordand KB-scoped props (knowledgeBase/knowledgeBaseUid,documentIdinstead ofdocumentUid,contentHashinstead ofmd5Hash)Tests:
QueryForm.test,DocumentTable.test,IngestQueueDialog.testrewritten for the new shapes.golden-path.spec.ts(Playwright) walks the new flow: onboard → service creation (via API) → KB → upsert → playground.CI fixes (commits
36d4d71,2377bc8)Two issues surfaced after the initial push:
Biome formatting + organise-imports drift across the touched TS/TSX files.
biome check --writewas idempotent on the runtime side; the one manual fix waslet resolvedinrunKbIngestJob— Biome forbids implicit-anydeclarations, now annotated asAwaited<ReturnType<typeof resolveKb>>. A follow-up commit cleared the line-length wrap ingolden-path.spec.ts.Web E2E was racing the React Query cache. The original golden-path spec navigated to
/playgroundvia the SPA nav link. TheWorkspaceDetailPagehad already populateduseKnowledgeBases(workspaceUid)with an empty list (the KB was created via therequestfixture, invisible to the page's QueryClient). The cached empty list rode into the playground; the KB select stayed disabled. Fixed by hard-loading/playgroundviapage.goto, which remounts the React app and clears the cache — comment in the spec explains the rationale.(Java Runtime is a pre-existing failure on
main— Spring Boot 4.0.6 missingspring-boot-test-autoconfigure— and is out of scope for this PR.)Decisions baked in (resolved in the issue thread)
vector_collection(auto-provisioned) pluslexical_*fields. Lexical isn't a callable service.reranking_service_idoverrides KB's at query time.created_at DESCso list returns newest-first.saved_queriesdoes not survive into the new schema.wb_api_key_*unchanged; orthogonal to the data model.Test plan
🤖 Generated with Claude Code