Wire calculatePartialSums to native SIMD via Panama FFI downcall by r-devulap · Pull Request #651 · datastax/jvector

r-devulap · 2026-04-02T03:44:32Z

This change uses a native implementation of calculatePartialSums to accelerate PQ query scoring.
On ada002-100k with FUSED_PQ (numPQsubspaces/M =96, JDK build 23.0.1+11-39), it delivers 2–3× higher QPS and 40–65% lower mean latency across common overquery settings. Index build time, disk usage, and heap usage show no meaningful regression. The optimization is isolated to the PQ path; non‑PQ queries are unaffected.

Combined QPS and Latency Results (FUSED_PQ)

topK = 10

Overquery	QPS (main)	QPS (native)	Speedup	Latency ms (main)	Latency ms (native)	Latency ↓
1×	8,987	26,101	2.9×	0.462	0.171	−63%
2×	8,361	21,706	2.6×	0.505	0.202	−60%
5×	7,199	14,364	2.0×	0.590	0.292	−51%
10×	5,796	9,640	1.7×	0.731	0.431	−41%

topK = 100

Overquery	QPS (main)	QPS (native)	Speedup	Latency ms (main)	Latency ms (native)	Latency ↓
1×	5,743	9,568	1.7×	0.736	0.439	−40%
2×	4,174	5,574	1.3×	1.022	0.733	−28%
``

Summary of changes in this PR:

Wire calculatePartialSums in NativeVectorUtilSupport to a new Panama FFI downcall for the native calculate_partial_sums_f32_512 SIMD implementation.
Replace icelake-server gcc target with skylake-avx512 in build script (icelake-server isnt required to build our native code)
Remove global mutable state: eliminate initialIndexRegister, indexIncrement, maskSeventhBit, maskEighthBit globals and their constructor initializer; move mask constants (maskSeventhBit, maskEighthBit) to local scope inside lookup_partial_sums
Add shared reduce_add_128_ps and reduce_add_256_ps helper functions using proper horizontal-add sequences instead of store-to-array loops
Remove redundant if (length >= N) guards in all SIMD kernels — the loop body already handles the zero-iteration case correctly
Replace store-to-aligned-array horizontal reduction pattern with the new helpers across all 128- and 256-bit dot product and euclidean distance functions
Remove preferred_size parameter from dot_product_f32 and euclidean_f32; always dispatch to AVX-512 when length >= 16
Standardize inline annotations: replace attribute((always_inline)) inline with JV_FINLINE / JV_INLINE macros throughout

github-actions · 2026-04-02T03:44:44Z

Before you submit for review:

Does your PR follow guidelines from CONTRIBUTIONS.md?
Did you summarize what this PR does clearly and concisely?
Did you include performance data for changes which may be performance impacting?
Did you include useful docs for any user-facing changes or features?
Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
Did you trigger and review regression testing results against the base branch via Run Bench Main?
Did you adhere to the code formatting guidelines (TBD)
Did you group your changes for easy review, providing meaningful descriptions for each commit?
Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

* Replace icelake-server gcc target with skylake-avx512 in build script * Remove global mutable state: eliminate initialIndexRegister, indexIncrement, maskSeventhBit, maskEighthBit globals and their constructor initializer; move mask constants (maskSeventhBit, maskEighthBit) to local scope inside lookup_partial_sums * Add shared reduce_add_128_ps and reduce_add_256_ps helper functions using proper horizontal-add sequences instead of store-to-array loops * Remove redundant if (length >= N) guards in all SIMD kernels — the loop body already handles the zero-iteration case correctly * Replace store-to-aligned-array horizontal reduction pattern with the new helpers across all 128- and 256-bit dot product and euclidean distance functions * Remove preferred_size parameter from dot_product_f32 and euclidean_f32; always dispatch to AVX-512 when length >= 16 * Standardize inline annotations: replace __attribute__((always_inline)) inline with JV_FINLINE / JV_INLINE macros throughout

…zes 4,8 & 16 on AVX-512 Add SIMD fast paths in calculate_partial_sums_dot_f32_512 and calculate_partial_sums_euclidean_f32_512 for the two most common PQ subvector sizes: - size == 4: broadcast a 128-bit query fragment across all four 128-bit lanes of a ZMM register, load four consecutive centroids at once, and reduce each lane independently using two shuffle+add pairs. Produces 4 partial sums per loop iteration instead of 1. - size == 8: broadcast a 256-bit query fragment across both 256-bit halves of a ZMM register, load two consecutive centroids at once, and reduce across 128-bit lanes followed by within-lane shuffles. Produces 2 partial sums per loop iteration instead of 1. - size == 16: query and the centroid fit into a ZMM register, load the query into zmm and then loop over the centroids. Produces one partial sum per loop iteration, but prevents having to load the query multiple times. Both paths fall back to the default way of computing dot_product_f32 / euclidean_f32 in a loop for any tail elements or unsupported sizes.

jshook

I would like to see much more coverage of these with numerical tests. Are there some already which aren't seen here?

ashkrisk

Looks like an excellent set of optimizations. Left a few comments.

+1 to @jshook's comment about numerical tests. This PR touches almost every single function in the native supporting library, and it would be good to have a set of tests accompanying it, perhaps also in C.

ashkrisk · 2026-04-27T11:02:54Z

 if [ "$(printf '%s\n' "$MIN_GCC_VERSION" "$CURRENT_GCC_VERSION" | sort -V | head -n1)" = "$MIN_GCC_VERSION" ]; then
    rm -rf ../resources/libjvector.so
-    gcc -fPIC -O3 -march=icelake-server -c jvector_simd.c -o jvector_simd.o
+    gcc -fPIC -O3 -march=skylake-avx512 -c jvector_simd.c -o jvector_simd.o


Is there a strong reason to lower the target micro-architecture version?

ashkrisk · 2026-04-27T11:23:05Z

+        case 0:
+            calculate_partial_sums_euclidean_f32_512(codebook, codebookIndex, size, clusterCount, query, queryOffset, partialSums);
+            break;
+        case 1:


Can we use public enums here? Jextract should automatically make the enums available to the Java code as constants. Alternatively we could skip the parameter-based dispatch altogether and simply expose both versions of the function to Java code.

ashkrisk · 2026-04-27T11:57:26Z

+    __m512 vaMagnitude = _mm512_setzero_ps();
+    int i = 0;
+    int limit = baseOffsetsLength - (baseOffsetsLength % 16);
+    const __m512i initialIndexRegister = _mm512_setr_epi32(-16, -15, -14, -13, -12, -11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1);


It's good that this isn't a global variable anymore, but given that it's used in multiple places does it make sense to have it as a global constant?

ashkrisk · 2026-04-27T12:09:40Z

     * void calculate_partial_sums_best_euclidean_f32_512(const float *codebook, int codebookBase, int size, int clusterCount, const float *query, int queryOffset, float *partialSums, float *partialBestDistances)
     * }
     */
    public static void calculate_partial_sums_best_euclidean_f32_512(MemorySegment codebook, int codebookBase, int size, int clusterCount, MemorySegment query, int queryOffset, MemorySegment partialSums, MemorySegment partialBestDistances) {


Looks like a lot of functions that are no longer in the public header are still declared here. Should fix itself on re-running jextract.

r-devulap requested review from MarkWolters, jshook and tlwillke as code owners April 2, 2026 03:44

r-devulap added 3 commits April 7, 2026 04:31

Wire calculatePartialSums to native SIMD via Panama FFI downcall

d073371

r-devulap force-pushed the use-native-calcpartialsum branch from 70cd2fb to de4ff79 Compare April 7, 2026 04:33

Optimize size == 2

159c21f

jshook reviewed Apr 16, 2026

View reviewed changes

ashkrisk reviewed Apr 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wire calculatePartialSums to native SIMD via Panama FFI downcall#651

Wire calculatePartialSums to native SIMD via Panama FFI downcall#651
r-devulap wants to merge 4 commits intomainfrom
use-native-calcpartialsum

r-devulap commented Apr 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 2, 2026 •

edited by r-devulap

Loading

Uh oh!

jshook left a comment

Uh oh!

ashkrisk left a comment

Uh oh!

ashkrisk Apr 27, 2026

Uh oh!

ashkrisk Apr 27, 2026

Uh oh!

ashkrisk Apr 27, 2026

Uh oh!

ashkrisk Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

r-devulap commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Combined QPS and Latency Results (FUSED_PQ)

topK = 10

topK = 100

Uh oh!

github-actions Bot commented Apr 2, 2026 • edited by r-devulap Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jshook left a comment

Choose a reason for hiding this comment

Uh oh!

ashkrisk left a comment

Choose a reason for hiding this comment

Uh oh!

ashkrisk Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

ashkrisk Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

ashkrisk Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

ashkrisk Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

r-devulap commented Apr 2, 2026 •

edited

Loading

github-actions Bot commented Apr 2, 2026 •

edited by r-devulap

Loading