Skip to content

Track: native MERGE via lance-core MergeInsertBuilder (bypass Spark position-delta path) #8

@sezruby

Description

@sezruby

Tracking issue

Investigate adding a native MERGE strategy in lance-spark that dispatches
directly to lance-core's MergeInsertBuilder instead of going through
Spark's V2 row-level operations / position-delta path. Goal: measure
whether the native path is meaningfully faster on tables with many
fragments, then file upstream against lance-format/lance-spark if the
numbers justify it.

Today (lance-format/lance-spark @ HEAD)

MERGE INTO lance . ` SQL routes through:

  • LancePositionDeltaDataset implements SupportsRowLevelOperations
  • LancePositionDeltaOperation with representUpdateAsDeleteAndInsert = true
  • rowId() = [_rowaddr], requiredMetadataAttributes() = [_fragid]

Catalyst's RewriteMergeIntoTable lowers MERGE into a position-delta
plan: WriteDelta(... Project(deletes+inserts) Join(targetScan, source, ON merge-condition) ...). The join, optional shuffle, and per-_fragid
write are all Spark stages.

LanceScan.planInputPartitions() emits one InputPartition per Lance
fragment, so target-scan and position-delta-write stages are
fragment-aligned. With a 2500-fragment table, several fragment-aligned
stages × 2500 = thousands of Spark tasks per MERGE.

FilterPushDown.isFilterSupported accepts only basic V2 Filter types
(EqualTo, In, range, null). The MERGE ON condition is evaluated as
a join predicate, not a pushable filter — so pushedFilters is empty
for the target scan, zonemap pruning skips, and all fragments are
scanned regardless of indexes.

MergeInsertBuilder from lance-core is not invoked anywhere in
lance-spark today.

Proposal

A planner strategy that intercepts MergeIntoTable nodes targeting Lance
tables and routes them through MergeInsertBuilder:

dataset.mergeInsertBuilder(onColumns)
  .whenMatchedUpdateAll()                  // or: ...UpdateAllIf(condition)
  .whenMatchedDelete()                     // or: ...DeleteIf(condition)
  .whenNotMatchedInsertAll()               // or: ...InsertAllIf(condition)
  .whenNotMatchedBySourceDelete()          // or: ...DeleteIf(condition)
  .execute(sourceRecordBatches);

Pros:

  • Avoids Spark Join + shuffle + position-delta write (the per-_fragid
    rewrite stage).
  • Lance-core can use BTREE/Bitmap indexes on the merge ON column
    for fragment pruning during the match phase.
  • Per-arm predicates (whenMatchedUpdateAllIf etc.) map naturally onto
    the SQL WHEN MATCHED ... AND <cond> clauses.

Cons:

  • Single-process: MergeInsertBuilder.execute doesn't shard across
    Spark executors. Source has to be materialized as RecordBatches on
    the driver (or one executor).
  • Predicate translation: Spark Expression → Lance SQL string for the
    per-arm *If predicates.

Plan

  1. Stand up a benchmark in lance-spark-knn_2.12/.../benchmark/ (same
    harness pattern as IndexedNearestJoinBenchmark):

    • Build a synthetic Lance table at varying scales (100, 500, 2500
      fragments).
    • Run MERGE INTO SQL via the current path.
    • Run an equivalent direct MergeInsertBuilder.execute from
      Java/Scala.
    • Validate equivalence (oracle check on a sample of merge keys).
    • Time both, report wall-clock + Spark task counts.
  2. If MergeInsertBuilder wins meaningfully on the
    2500-fragment-with-small-source case (the workload shape seen in
    prod), file upstream as a feature request with the numbers attached.

  3. If it doesn't (e.g. one-process source materialization dominates),
    document why and close.

Out of scope for this tracking issue

  • Streaming source path (always materialize for the bench).
  • Auto-routing heuristic between native and position-delta paths (will
    manifest as a config flag in the PoC).
  • Full Catalyst MERGE rewrite — the PoC can call MergeInsertBuilder
    directly from a benchmark harness, no SQL parser changes needed
    for measurement.

Status

  • Branch + benchmark scaffold
  • Synthetic Lance table generator
  • Position-delta path baseline run
  • MergeInsertBuilder direct-call run
  • Equivalence validation
  • Numbers attached
  • Upstream issue (filed only if numbers justify)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions