feat: implement distributed VECTOR_SEARCH with parallel fragment/index scanning by summaryzb · Pull Request #608 · lance-format/lance-spark

summaryzb · 2026-06-10T10:14:07Z

Summary

Implements distributed execution for the VECTOR_SEARCH table function, enabling Spark-parallel vector similarity search across Lance datasets. When enabled via spark.sql.lance.search.distributed.enabled=true, the driver plans one Spark task per execution unit
(indexed segment or fallback fragment), and each worker runs a local ANN scan or fallback to KNN scan without indexed segment. Results are merged with a global sort on _distance.

This provides horizontal scalability for vector search workloads on large datasets without requiring a centralized vector index server.

Behavior

Condition	Execution Path
`distributed.enabled=false`	Single-partition `namespace.queryTable()`
`distributed.enabled=true`, has vector index	One task per index segment, plus fallback tasks for unindexed fragments (unless `fastSearch=true`)
`distributed.enabled=true`, no index	One task per fragment (flat KNN)

Notice

Indexed-unit path with IVF_PQ is implemented but not yet covered by E2E tests (pending CREATE INDEX ... USING IVF_PQ SQL support in lance-spark extensions) which is rely on feat: distributed vector index creation #605.
The prefilter=true flag is forced for fallback units due to a lance-core JNI limitation where Scanner::nearest rejects fragment-restricted scans without a prefilter expression.
Specify indexSegments in ScanOptions is pre required which is rely on feat(java): support segment-based distributed vector search lance#7169, other wise compile fails

Testing

BaseSparkDistributedVectorSearchTest exercises fallback-only scenarios
Spark 3.4 and 3.5 modules have thin test subclasses

Change-Id: I94c3cd431bcf5ba4bee7906838fa2d7cd4f6769e

summaryzb · 2026-06-10T10:18:53Z

CI RED is expected since it rely on lance-format/lance#7169

summaryzb · 2026-06-10T10:19:09Z

@jackye1995 @Xuanwo @LuciferYang PTAL

sezruby · 2026-06-13T17:28:17Z

@summaryzb thanks for working on this. Quick data point in case it's useful:

Measured driver-side single-machine Dataset.newScan(...nearest(Query)), k=10, nprobes=16, IVF-PQ, ABFSS storage, single JVM, 100 timed queries on one open Lance handle:

Dataset	\|R\|	dim	p50	p90	p99
Cohere `wikipedia-2023-11-embed-multilingual-v3`	10M	1024	56 ms	182 ms	514 ms
Synthetic uniform-random (note: ~worst case for IVF clustering)	100M	128	157 ms	776 ms	1213 ms
Synthetic uniform-random	250M	128	1084 ms	1528 ms	2323 ms

Wondering whether the distributed path can beat sub-100ms single-machine once you account for Spark task scheduling overhead — would you be able to share end-to-end latency numbers from your benchmark setup at a similar scale? Would help inform docs around when to enable spark.sql.lance.search.distributed.enabled=true.

Thanks!

feat: support distributed VECTOR_SEARCH

4a02bf4

Change-Id: I94c3cd431bcf5ba4bee7906838fa2d7cd4f6769e

github-actions Bot added the enhancement New feature or request label Jun 10, 2026

sezruby mentioned this pull request Jun 16, 2026

SPIP: Lance-backed approximate nearest-neighbor join for Apache Spark — three complementary paths sezruby/lance-spark#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement distributed VECTOR_SEARCH with parallel fragment/index scanning#608

feat: implement distributed VECTOR_SEARCH with parallel fragment/index scanning#608
summaryzb wants to merge 1 commit into
lance-format:mainfrom
summaryzb:dis_vec_search

summaryzb commented Jun 10, 2026 •

edited

Loading

Uh oh!

summaryzb commented Jun 10, 2026

Uh oh!

summaryzb commented Jun 10, 2026

Uh oh!

sezruby commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

summaryzb commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behavior

Notice

Testing

Uh oh!

summaryzb commented Jun 10, 2026

Uh oh!

summaryzb commented Jun 10, 2026

Uh oh!

sezruby commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

summaryzb commented Jun 10, 2026 •

edited

Loading