Skip to content

Add a de-duplicating flat vector format#15979

Open
kaivalnp wants to merge 4 commits intoapache:mainfrom
kaivalnp:dedup-raw-vectors
Open

Add a de-duplicating flat vector format#15979
kaivalnp wants to merge 4 commits intoapache:mainfrom
kaivalnp:dedup-raw-vectors

Conversation

@kaivalnp
Copy link
Copy Markdown
Contributor

Description

Closes #14758

Add a new de-duplicating vector format that only stores unique vectors on disk.
De-duplication is done for vectors across all docs and fields indexed by the format.

Disclaimer: This was mostly written by an AI, with me refining the implementation through prompts -- although I think it did a pretty good job on its own!

Details about the format itself (layout of vectors on disk, de-duplication strategy during flush and merge, performance tradeoffs, etc) are included in a markdown doc in the PR.

org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsFormat
org.apache.lucene.codecs.lucene104.Lucene104ScalarQuantizedVectorsFormat
org.apache.lucene.codecs.lucene104.Lucene104HnswScalarQuantizedVectorsFormat
org.apache.lucene.codecs.dedup.DedupFlatVectorsFormat
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not required -- the raw vector format will be wrapped in its own Lucene*HnswVectorsReader which will be exposed here (keeping this for later).

I had to add this here to demonstrate that tests are passing (directly uses it as the KNN vector format).

* @param ord the vector ordinal
* @return the byte offset for the vector
*/
public long ordToOffset(int ord) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is the biggest change with this format -- the offset of a vector in the raw file was simply ord * byteSize earlier -- but with a de-duplicating format the order is broken + multiple ordinals can "point" to the same vector, so we need an explicit function to resolve the offset.

This is probably more suitable for an interface like HasIndexSlice though.

@kaivalnp
Copy link
Copy Markdown
Contributor Author

luceneutil KNN benchmarks on Cohere v3 vectors, 1024d, DOT_PRODUCT similarity, 1M vectors, 10K queries, no quantization.

Important: filterStrategy = index-time-filter is used here (added in mikemccand/luceneutil#468), which simply creates a new vector field with randomly selected filterSelectivity proportion of docs, and uses the smaller field for search.

main

recall  latency(ms)  netCPU  visited  index(s)  index_docs/s  force_merge(s)  index_size(MB)  filterSelectivity
 0.988        2.987   2.982     8122    239.73       4171.41          418.68         6077.99               0.50
 0.991        2.398   2.397     7415    211.53       4727.37          296.96         4862.33               0.20

This PR

recall  latency(ms)  netCPU  visited  index(s)  index_docs/s  force_merge(s)  index_size(MB)  filterSelectivity
 0.988        2.724   2.720     8129    248.46       4024.79          466.43         4130.17               0.50
 0.991        2.173   2.170     7414    214.40       4664.16          396.85         4085.20               0.20

As we can imagine, if the same vector is indexed in a second field for 50% of docs, the index size increases by ~50% on main (roughly 4GB -> 6GB). However with the de-duplicating vector format, the same vectors are re-used and there is (almost) no increase in index size!

Indexing is slightly slower (<5%) + merges are non-trivially slower (up to 33%, currently looking into speeding this up).

@msokolov
Copy link
Copy Markdown
Contributor

msokolov commented Apr 24, 2026

It seems like we pay a small penalty for doing the indirected lookup. I wonder if we could save this in the case of the "main" field and only pay it for secondary fields if we change the API a bit so that users can specify a "primary" field that is the union of all the deduped fields?

Oh actually I misread the table! It looks as if the deduplicated vectors PR is actually a bit faster?!

@msokolov
Copy link
Copy Markdown
Contributor

How do you think this would combine with all the other vector formats, such as the quantizing ones? I guess we would want to avoid a lot of code copying ... I guess it should be possible to somehow wrap an existing format so we can independently innovate about quantization? Also .. do you see this as becoming the default flat vector format, or do you think users would select it in a custom codec?

@kaivalnp
Copy link
Copy Markdown
Contributor Author

looks as if the deduplicated vectors PR is actually a bit faster?!

Yeah, this was surprising to me too -- it is ~10% faster, though I'm not sure why (there's one additional lookup of vector ord -> ord of position in data file which is stored on-heap as an int[]).

How do you think this would combine with all the other vector formats, such as the quantizing ones?

This is kind of tricky -- the main challenge is that quantization factors can depend on data distribution? (i.e. the same float[] vector may be quantized to a different byte[] for smaller graphs).

A slightly smaller challenge is that quantization formats today store the quantized byte[] vector followed by a float correction factor on disk -- and because both are read at the same time, this layout leads to fewer page cache misses.

For de-duplication to be effective, I guess we'd need to decouple the quantized byte[] vector from the float correction factor? This would lead to more page cache misses -- but FWIW the correction factors are just 4B per vector per field, and can be "hot" too (either explicitly: by keeping on-heap, or implicitly: like HNSW graph edges, they are an order of magnitude smaller).

do you see this as becoming the default flat vector format

IMO it could be the default if the overhead of de-duplication is "small" (personally: something like <10% slower indexing / merge sounds like it would be worth replacing the default).

Since it is currently not: perhaps a separate HNSW format backed by the de-duplicating format in a non-core module?

@kaivalnp
Copy link
Copy Markdown
Contributor Author

Iterated with the AI to iron out some performance bottlenecks.

Disclaimer: There's generous use of AI involved in the code, so functions may be more verbose than required! I have reviewed most of it locally, and will do a refactor to try and make it cleaner / more human friendly soon.

Current Approach

See dedup-vectors-format-design.md in this PR for the AI-generated summary, but the flow is roughly:

  1. During indexing, maintain a per-field List of vectors on-heap (same as the flat format).
  2. During flush (occurs field-wise):
    1. Raw vectors across all fields are stored in a single "vector dictionary".
    2. Iterate over the List of vectors in a field, compute hash for each vector and store it in a map of hash -> (field number, ordinal in field).
    3. On hash collision, check for equality with the vector referenced in the map:
      1. IF vectors are equal, do not write a separate copy to the "vector dictionary" and point to the existing entry.
      2. ELSE use a hash probing strategy (move to hash + 1, then hash + 2, and so on).
    4. At the end, only unique vectors are stored in the dictionary.
    5. Each field stores an additional ord -> position in dictionary mapping (also see "Query Time Cost").
  3. During merge (also occurs field-wise):
    1. Get a view of vectors being merged across segments, as an iterator of (docid, vector, original segment idx, ord in original segment).
    2. Compute hash of each vector, and maintain a map of hash -> (field number, original segment idx, ord in original segment).
    3. On hash collision, we have two paths:
      1. IF both vectors (the one being merged + the one that collided) come from the same segment, which is also de-duplicating: we re-use the vector equality that has been resolved earlier (i.e. whether both vectors point to the same location in the original segment's dictionary).
      2. ELSE load the other vector onto heap for an explicit equality check. This happens when:
        1. Hash collisions occur across segments, where equality has not been resolved earlier.
        2. When a normal flat format was used earlier, and we switched to the de-duplicating one now (e.g. during Lucene upgrade).
    4. The logic to write to the dictionary is the same as flush.

Indexing Time Complexity

  1. One additional hash of each vector during indexing and merge (equivalent to one vector similarity computation done during HNSW?).
  2. Adding into the hash map (constant on average, but expansion of map on reaching capacity can add overhead).
  3. Equality checks for all duplicated vectors + hash-collisions.
    1. During indexing, these are all on-heap -- adding a fixed cost.
    2. During merging, this is not expensive when both vectors are present in the same segment (already resolved during indexing, which is re-used).
    3. However, cross-segment equality checks can become costly: loading vectors in a random-access fashion onto heap (page cache misses).
    4. Perhaps we can use a hash with more bits to reduce false positives, or only guarantee de-duplication within a document?

Query Time Cost

  1. One additional lookup per vector operation:
    1. Earlier: vectors were stored contiguously, so the offset in the flat file was a direct computation of ord * vectorByteSize.
    2. Now, vectors across all fields are stored in a single file, with multiple ordinals possibly referencing the same vector (i.e. no longer 1:1).
    3. The offset in the "vector dictionary" is something like ordToPosition[ord] * vectorByteSize instead.
    4. Currently, ordToPosition is an array stored on-heap, perhaps this can be index-backed too?

All-in-all, it does not seem too expensive compared to the flat format?

@kaivalnp
Copy link
Copy Markdown
Contributor Author

I observed a benchmark issue in the luceneutil KNN benchmark where force_merge(s) was noisy -- the indexing flow in that script is split into three phases:

  1. initial indexing
  2. wait for running merges to finish
  3. force merge

The index(s) column reports (1); force_merge(s) reports (3); but (2) is untracked!
Merges are selected by Lucene, and one run can do more / less work than another -- leading to un-comparable force_merge(s)?

To work around this, I simply moved this line to after waitForMergesWithStatus to include both (1) and (2) in index(s).

Can we look at the sum of index(s) and force_merge(s) as a "total indexing time" proxy?

cc @mikemccand


Benchmarks

I ran a luceneutil KNN benchmark with Cohere v3 vectors, 1024d, DOT_PRODUCT similarity, 1M documents, 10K queries, no quantization.

To benchmark this PR, we need to replace the format here with DedupFlatVectorsFormat.

Note: Before the work-around listed above, index(s) was very close in main and this PR, but the untracked running merges were costlier in this PR, leading to artificially lower force_merge(s) (as visible below).

The numbers below do include the work-around that considers initial indexing + wait for running merges in index(s). For this benchmark I'm approximating index(s) + force_merge(s) as "total indexing time" for comparison.

main

recall  latency(ms)  netCPU  avgCpuCount  visited  index(s)  index_docs/s  force_merge(s)  index_size(MB)         filterStrategy  filterSelectivity
 0.973        6.065   6.058        0.999    16735    483.72       2067.31          178.37         4055.40  query-time-pre-filter               0.50
 0.968        5.805   5.804        1.000    16195    483.72       2067.31          178.37         4055.40  query-time-pre-filter               0.20
 0.971        6.120   6.119        1.000    12687    483.72       2067.31          178.37         4055.40  query-time-pre-filter               0.10
 0.988        3.103   3.101        0.999     8118    552.98       1808.40          392.23         6077.95      index-time-filter               0.50
 0.991        2.468   2.467        1.000     7412    485.77       2058.59          276.12         4862.36      index-time-filter               0.20
 0.993        2.218   2.217        1.000     6818    428.49       2333.77          281.16         4457.81      index-time-filter               0.10

This PR

recall  latency(ms)  netCPU  avgCpuCount  visited  index(s)  index_docs/s  force_merge(s)  index_size(MB)         filterStrategy  filterSelectivity
 0.973        6.240   6.238        1.000    16709    592.26       1688.44           80.70         4058.20  query-time-pre-filter               0.50
 0.968        6.094   6.092        1.000    16102    592.26       1688.44           80.70         4058.20  query-time-pre-filter               0.20
 0.971        6.369   6.367        1.000    12611    592.26       1688.44           80.70         4058.20  query-time-pre-filter               0.10
 0.988        2.815   2.811        0.999     8125    775.38       1289.70          239.68         4130.06      index-time-filter               0.50
 0.991        2.161   2.160        0.999     7413    441.66       2264.17          291.74         4085.13      index-time-filter               0.20
 0.993        2.417   2.416        1.000     6827    392.37       2548.60          315.08         4070.74      index-time-filter               0.10

Summary

  1. Sanity check: recall is exactly the same on main and this PR.
  2. latency(ms) is in the same range too: some runs are higher / lower, but overall appears to be in the range of HNSW noise?
  3. As a recap of the options used:
    1. filterStrategy = query-time-pre-filter with filterSelectivity = F applies a filter during graph traversal that matches F proportion of docs. Note that the reported latency does NOT include time spent in BitSet creation for the filter; only reports graph search time.
    2. filterStrategy = index-time-filter (added in Add option for index-time filtering to knnPerfTest.py mikemccand/luceneutil#468) with filterSelectivity = F indexes the same docs (F proportion) into a separate vector field, which is then used at search time. There is no separate query-time cost, apart from the user resolving to the correct field based on the filter themselves.
    3. The difference b/w query-time-pre-filter and index-time-filter is ~2-3x, and demonstrates the tradeoffs nicely: more work during indexing to create a separate HNSW graph, for faster filtered search.
  4. "Total indexing time" (index(s) + force_merge(s)) change:
    1. 662.09 s -> 672.96 s (+1.6%) for query-time-pre-filter (i.e. no explicit duplicate vector field).
    2. 945.21 s -> 1015.06 s (+7.4%) for index-time-filter with 50% docs in the smaller field.
    3. 761.89 s -> 733.4 s (-3.7%) for index-time-filter with 20% docs in the smaller field.
    4. 709.65 s -> 707.45 s (-0.3%) for index-time-filter with 10% docs in the smaller field.
    5. All-in-all, there is <10% difference with this PR (slower in most cases, occasionally faster, likely due to indexing non-determinism).
  5. There is a sharp reduction in index_size(MB) with this PR as expected: only unique vectors are stored on disk (e.g. ~4GB index size without separate vector field -> ~6GB with 50% duplicates, which is reclaimed with this PR -- main additional disk usage is for the HNSW graph).

As a next step, I'll increase the duplication factor: up to TK fields, each containing a (different) proportion P of docs of the main field (to simulate a practical scenario of the proposal in #14758).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support multiple HNSW graphs backed by the same vectors

2 participants