feat: add namespace insert write path by jackye1995 · Pull Request #587 · lance-format/lance-spark

jackye1995 · 2026-06-06T18:35:48Z

Adds an opt-in namespace insert append path for namespace-backed Lance tables.

Expected user experience:

Existing INSERT INTO and append writes keep using the default Spark writer unless users set use_namespace_insert=true.
With namespace insert enabled, users still submit a normal DataFrame append; Spark runs executor-side writer tasks that send Arrow batches through the Lance namespace insert API.
namespace_insert_parallelism lets users request the number of writer tasks; sharded tables use the sharding distribution, and unsharded tables repartition by the first output column.
Directory and REST namespaces use the same API path, so REST namespaces can handle ingestion without changing Spark user code.
Insert requests commit as they run, so users should use the default writer when they need Spark driver-side atomic commit behavior.

Includes local/rest-dir Docker CI coverage and documentation updates.

LuciferYang · 2026-06-09T13:33:43Z

+      byte[] requestData = serializeCurrentBatch();
+      InsertIntoTableRequest request = new InsertIntoTableRequest().id(tableId).mode("append");
+      try {
+        namespace.insertIntoTable(request, requestData);


This path commits visible data directly from executor tasks via namespace.insertIntoTable(...). If a task inserts one or more batches and Spark later retries or speculatively reruns the same partition because a later batch, executor loss, or close fails, the replacement attempt can insert the same rows again and the Spark job can still succeed with duplicates. Please do not describe the current path as safe or atomic. Either add first-class namespace idempotency with a stable cross-attempt key such as write_id + partition_id + batch_ordinal, gated by namespace capability/version discovery, or move visible writes behind a staged driver commit. Disabling retries/speculation only reduces exposure; it is not a correctness fix.

LuciferYang · 2026-06-09T13:35:23Z

+      rowsInserted += rowCount;
+      insertRequests++;
+      hasRowsInCurrentBatch = false;
+      allocateBatch();


After a successful insert, flush() allocates the next Arrow batch, and the final commit() flush uses the same path. If the last insert has already succeeded but a later allocateBatch() or close() cleanup fails, Spark treats the task as failed and may retry data that was already inserted. Please split normal flush from final flush so the last successful insert does not allocate another batch. Once any insert has happened, cleanup of the sharding evaluator, Arrow batch, blob resolver, and namespace should be best-effort: log failures but do not throw them back to Spark. Close failures before any insert can still fail the task.

LuciferYang · 2026-06-09T13:42:07Z

+    @Override
+    public WriterCommitMessage commit() throws IOException {
+      if (shardingKeyEvaluator != null) {
+        shardingKeyEvaluator.flush(this::writePartitionedRow);


This new namespace insert writer calls shardingKeyEvaluator.flush(this::writePartitionedRow) in commit(). Inside ShardingBatchKeyEvaluator.flush(...), it calls the row consumer in the loop, and that consumer may trigger namespace inserts. The no-extra-batch validation and the reset/allocation in finally happen after the consumer has run. If some rows have already been inserted and a later validation or reset fails, Spark can retry and duplicate those rows. Please finish evaluator validation and key extraction before invoking the consumer, including the no-extra-batch check. After the consumer may have forced namespace inserts, only cleanup/close failures should be best-effort.

LuciferYang · 2026-06-09T13:43:37Z



+@pytest.mark.rest_dir_compatible
+class TestDMLNamespaceInsert:


TestDMLNamespaceInsert currently covers basic append and fixed-size-list vector writes, but not the blob-reference resolver path that namespace insert depends on. Please add an integration test that reads from a source path producing blob references, then writes to a namespace-insert target with use_namespace_insert=true through a shuffled or joined DataFrameWriterV2 append, and verifies byte equality in the target table. The test also needs to prove that the namespace insert writer path was selected; byte equality alone can pass through the default writer. If the test is outside TestDMLNamespaceInsert, update NAMESPACE_INSERT_PYTEST_CMD and mark it @pytest.mark.rest_dir_compatible.

LuciferYang · 2026-06-09T13:44:49Z

+      - synchronize
+      - ready_for_review
+      - reopened
+    paths:


This workflow is intended to cover namespace insert, but the current paths mostly cover write/options/integration files. Namespace insert also depends on the Arrow IPC writer, blob-reference read/write path, source-context optimizer rule, session-extension injection, sharding utilities, namespace catalog construction, and runtime namespace wiring. Prefer source-area filters instead of enumerating individual dependencies, for example lance-spark-base_2.12/src/main/**, lance-spark-base_2.12/src/test/**, lance-spark-*/src/main/**, and lance-spark-*/src/test/**, so key path changes still run the local/rest-dir namespace insert tests.

LuciferYang · 2026-06-09T13:45:36Z

+| `batch_size`                   | Integer | `8192`  | Maximum rows per Arrow batch/request before flushing.                                                    |
+| `max_batch_bytes`              | Long    | `268435456` | Maximum approximate bytes per Arrow batch/request before flushing.                                  |
+
+Namespace insert writes are intended for append ingestion through a namespace implementation,


The docs currently say that rows may be visible after failures, but they do not say that namespace insert is currently an at-least-once path, nor that a Spark job can succeed with duplicate rows after task retry or speculation. Please update both user-facing docs to state that replay de-duplication can only be claimed when the connector verifies a first-class namespace idempotency capability with defined replay semantics. Otherwise users should treat this as at-least-once ingestion. Users who need Spark driver-side atomic commit semantics should use the default writer.

feat: add namespace insert write path

34a412d

github-actions Bot added the enhancement New feature or request label Jun 6, 2026

docs: clarify namespace insert user experience

35d421e

LuciferYang reviewed Jun 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add namespace insert write path#587

feat: add namespace insert write path#587
jackye1995 wants to merge 2 commits into
lance-format:mainfrom
jackye1995:jack/namespace-insert-writes

jackye1995 commented Jun 6, 2026 •

edited

Loading

Uh oh!

LuciferYang Jun 9, 2026

Uh oh!

LuciferYang Jun 9, 2026

Uh oh!

LuciferYang Jun 9, 2026

Uh oh!

LuciferYang Jun 9, 2026

Uh oh!

LuciferYang Jun 9, 2026

Uh oh!

LuciferYang Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		@pytest.mark.rest_dir_compatible
		class TestDMLNamespaceInsert:

Conversation

jackye1995 commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LuciferYang Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jackye1995 commented Jun 6, 2026 •

edited

Loading