Skip to content

feat: implement distributed VECTOR_SEARCH with parallel fragment/index scanning#608

Open
summaryzb wants to merge 1 commit into
lance-format:mainfrom
summaryzb:dis_vec_search
Open

feat: implement distributed VECTOR_SEARCH with parallel fragment/index scanning#608
summaryzb wants to merge 1 commit into
lance-format:mainfrom
summaryzb:dis_vec_search

Conversation

@summaryzb

@summaryzb summaryzb commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements distributed execution for the VECTOR_SEARCH table function, enabling Spark-parallel vector similarity search across Lance datasets. When enabled via spark.sql.lance.search.distributed.enabled=true, the driver plans one Spark task per execution unit
(indexed segment or fallback fragment), and each worker runs a local ANN scan or fallback to KNN scan without indexed segment. Results are merged with a global sort on _distance.

This provides horizontal scalability for vector search workloads on large datasets without requiring a centralized vector index server.

Behavior

Condition Execution Path
distributed.enabled=false Single-partition namespace.queryTable()
distributed.enabled=true, has vector index One task per index segment, plus fallback tasks for unindexed fragments (unless fastSearch=true)
distributed.enabled=true, no index One task per fragment (flat KNN)

Notice

Testing

  • BaseSparkDistributedVectorSearchTest exercises fallback-only scenarios
  • Spark 3.4 and 3.5 modules have thin test subclasses

Change-Id: I94c3cd431bcf5ba4bee7906838fa2d7cd4f6769e
@github-actions github-actions Bot added the enhancement New feature or request label Jun 10, 2026
@summaryzb

Copy link
Copy Markdown
Contributor Author

CI RED is expected since it rely on lance-format/lance#7169

@summaryzb

Copy link
Copy Markdown
Contributor Author

@jackye1995 @Xuanwo @LuciferYang PTAL

@sezruby

sezruby commented Jun 13, 2026

Copy link
Copy Markdown

@summaryzb thanks for working on this. Quick data point in case it's useful:

Measured driver-side single-machine Dataset.newScan(...nearest(Query)), k=10, nprobes=16, IVF-PQ, ABFSS storage, single JVM, 100 timed queries on one open Lance handle:

Dataset |R| dim p50 p90 p99
Cohere wikipedia-2023-11-embed-multilingual-v3 10M 1024 56 ms 182 ms 514 ms
Synthetic uniform-random (note: ~worst case for IVF clustering) 100M 128 157 ms 776 ms 1213 ms
Synthetic uniform-random 250M 128 1084 ms 1528 ms 2323 ms

Wondering whether the distributed path can beat sub-100ms single-machine once you account for Spark task scheduling overhead — would you be able to share end-to-end latency numbers from your benchmark setup at a similar scale? Would help inform docs around when to enable spark.sql.lance.search.distributed.enabled=true.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants