You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The lance-spark PR depends on the lance-core PR — lance.version is bumped from 6.0.0-beta.4 → 7.1.0-beta.1 to pull in the new external-index API. Both PRs need RFC review before becoming merge candidates.
Test status
✅ lance-core: 14 unit tests + 1 integration test (tests/external_index_phase1.rs) — build → open → search recall → fetchRows projection → RowFilter exclusion. All passing.
The headline finding from the cluster benchmark (full table + methodology in #4):
At wide-medium (|R|=1M, dim=128, 16 string payload columns, K=10, |L|=100), the external-index path's warm per-query latency is 570 ms vs the temp-Lance path's 18,128 ms — a 31.8× speedup. The external-index pays a one-time 30-second index build that amortizes after ~2 queries at this scale.
External-index is the right answer when:
R is a stable parquet/Delta table on storage (sidecar pattern).
The same R is queried many times — index build amortizes immediately.
Wide payload columns — temp-Lance write cost grows linearly with width; external-index reads only top_K × |L| × projection_cols from source parquet.
Temp-Lance (#3) stays the right answer when R is an arbitrary subplan — joins, projections, computed columns. The two paths are additive, not competing.
Optimization layer on top of Phase 1.5. Same external-index API, same recall, abfss/cloud-storage path now ~75% faster (12s → 3s warm at wide-tiny on a 2-worker × 4-core DBR cluster). See #7 §7.4 and #4 "Cloud-storage (abfss) — Phase 1.6 follow-up" for the timeline + per-batch breakdown.
ExternalFusedStage.fusedPartition issues one batched probe call per Spark task instead of N per-query calls
ParquetMetaCache on OpenedExternalIndex reuses footer + page-index across queries within a task
CoalescingParquetReader wraps ParquetObjectReader with a tunable coalesce gap + parallelism
Per-batch timing diagnostics gated on LANCE_LOG=lance::index::vector::external=info
The headline 570 ms wide-medium / 8.6 sec mega-medium numbers above are local-fs uniform-pool measurements. Cloud-storage is a different operating point — see the linked sections for those numbers.
Persistent index across Spark sessions (Phase 2). Phase 1 has an in-process driver-side cache; cross-application reuse needs manifest fingerprint validation (mtime/size/footer-hash) at open() to detect stale indexes after parquet rewrites.
SQL Catalyst integration (Phase 3 in the lance-spark side). Phase 1 ships the DataFrame API entry point only.
HNSW / IVF-FLAT external builds (Phase 2/3 lance-core). Same shape as IVF-PQ, separate work.
append() / compact() for incremental index updates (Phase 4 lance-core).
Cross-compile notes for contributors
Cluster runs require nativelib/linux-x86-64/liblance_jni.so. Local cargo build on macOS arm64 produces only darwin-aarch64/liblance_jni.dylib. Workflow that works:
brew install zig
cargo install cargo-zigbuild
rustup target add x86_64-unknown-linux-gnu
cd$LANCE_REPO/java/lance-jni
cargo zigbuild --target x86_64-unknown-linux-gnu --release
# Graft into the lance-core JAR before building the lance-spark fat JAR:cd /tmp/jar-patch
mkdir -p nativelib/linux-x86-64
cp $LANCE_REPO/java/lance-jni/target/x86_64-unknown-linux-gnu/release/liblance_jni.so \
nativelib/linux-x86-64/
jar uf ~/.m2/repository/org/lance/lance-core/$VERSION/lance-core-$VERSION.jar \
nativelib/linux-x86-64/liblance_jni.so
cross (Docker) hits the aws-lc-sys GCC blocklist on its default image; centos image has GCC 10 but its bundled Rust toolchain is too old for the workspace's edition = "2024". cargo zigbuild sidesteps both: uses host's rustup toolchain + zig-supplied glibc-targeted clang.
Update (Phase 1.6):cross also works with the Ubuntu-based ghcr.io/cross-rs/x86_64-unknown-linux-gnu:main image plus a small Cross.toml that pre-installs protoc 25.1, plus a [patch.crates-io] to a vendored copy of aws-lc-sys 0.41.0 with the gcc-bug-95189 self-check assertion stripped (assertion mis-fires under QEMU emulation on darwin-arm64; the actual crypto code is fine). Either path produces an equivalent linux .so.
Upstream lance-core CI cross-builds for all 3 architectures and publishes the multi-arch JAR; this workflow is only needed for in-flight branches not yet published.
Tracking: External Lance vector index — Phase 1 PRs filed
Tracker for the External Lance vector index delivery. RFC + design + cluster benchmark numbers + architecture deep-dive live in:
Phase 1 PRs (DRAFT, gated on RFC review)
Two parallel PRs implement Phase 1 end-to-end:
sezruby/lanceExternalIvfPqIndexsezruby/lance-sparkIndexedNearestJoinExternal, fused probe+materialize stage, benchmark vs temp-Lance / Lance-native-indexedThe lance-spark PR depends on the lance-core PR —
lance.versionis bumped from 6.0.0-beta.4 → 7.1.0-beta.1 to pull in the new external-index API. Both PRs need RFC review before becoming merge candidates.Test status
tests/external_index_phase1.rs) — build → open → search recall → fetchRows projection → RowFilter exclusion. All passing.IndexedNearestJoinExternalTestend-to-end (16 left vectors × 2 parquet files of 320 rows). Passing locally.Highlights from #4
The headline finding from the cluster benchmark (full table + methodology in #4):
External-index is the right answer when:
top_K × |L| × projection_colsfrom source parquet.Temp-Lance (#3) stays the right answer when R is an arbitrary subplan — joins, projections, computed columns. The two paths are additive, not competing.
Phase 1.6 — cloud-storage perf optimizations: ✅ shipped
Optimization layer on top of Phase 1.5. Same external-index API, same recall, abfss/cloud-storage path now ~75% faster (12s → 3s warm at wide-tiny on a 2-worker × 4-core DBR cluster). See #7 §7.4 and #4 "Cloud-storage (abfss) — Phase 1.6 follow-up" for the timeline + per-batch breakdown.
Surface changes:
ExternalIvfPqIndex.search_batch(Rust core) /nativeSearchBatch(JNI) /searchBatch(float[][], ...)(Java) /probeBatch(Array[Array[Float]], ...)(Scala wrapper)ExternalFusedStage.fusedPartitionissues one batched probe call per Spark task instead of N per-query callsParquetMetaCacheonOpenedExternalIndexreuses footer + page-index across queries within a taskCoalescingParquetReaderwrapsParquetObjectReaderwith a tunable coalesce gap + parallelismLANCE_LOG=lance::index::vector::external=infoThe headline 570 ms wide-medium / 8.6 sec mega-medium numbers above are local-fs uniform-pool measurements. Cloud-storage is a different operating point — see the linked sections for those numbers.
What's NOT in Phase 1
Tracked for follow-up; covered in #4:
open()to detect stale indexes after parquet rewrites.append()/compact()for incremental index updates (Phase 4 lance-core).Cross-compile notes for contributors
Cluster runs require
nativelib/linux-x86-64/liblance_jni.so. Localcargo buildon macOS arm64 produces onlydarwin-aarch64/liblance_jni.dylib. Workflow that works:cross(Docker) hits the aws-lc-sys GCC blocklist on its default image; centos image has GCC 10 but its bundled Rust toolchain is too old for the workspace'sedition = "2024".cargo zigbuildsidesteps both: uses host's rustup toolchain + zig-supplied glibc-targeted clang.Update (Phase 1.6):
crossalso works with the Ubuntu-basedghcr.io/cross-rs/x86_64-unknown-linux-gnu:mainimage plus a smallCross.tomlthat pre-installs protoc 25.1, plus a[patch.crates-io]to a vendored copy ofaws-lc-sys0.41.0 with the gcc-bug-95189 self-check assertion stripped (assertion mis-fires under QEMU emulation on darwin-arm64; the actual crypto code is fine). Either path produces an equivalent linux .so.Upstream lance-core CI cross-builds for all 3 architectures and publishes the multi-arch JAR; this workflow is only needed for in-flight branches not yet published.