Add a de-duplicating flat vector format#15979
Conversation
| org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsFormat | ||
| org.apache.lucene.codecs.lucene104.Lucene104ScalarQuantizedVectorsFormat | ||
| org.apache.lucene.codecs.lucene104.Lucene104HnswScalarQuantizedVectorsFormat | ||
| org.apache.lucene.codecs.dedup.DedupFlatVectorsFormat |
There was a problem hiding this comment.
This is not required -- the raw vector format will be wrapped in its own Lucene*HnswVectorsReader which will be exposed here (keeping this for later).
I had to add this here to demonstrate that tests are passing (directly uses it as the KNN vector format).
| * @param ord the vector ordinal | ||
| * @return the byte offset for the vector | ||
| */ | ||
| public long ordToOffset(int ord) { |
There was a problem hiding this comment.
IMO this is the biggest change with this format -- the offset of a vector in the raw file was simply ord * byteSize earlier -- but with a de-duplicating format the order is broken + multiple ordinals can "point" to the same vector, so we need an explicit function to resolve the offset.
This is probably more suitable for an interface like HasIndexSlice though.
|
Important:
This PR As we can imagine, if the same vector is indexed in a second field for 50% of docs, the index size increases by ~50% on Indexing is slightly slower (<5%) + merges are non-trivially slower (up to 33%, currently looking into speeding this up). |
|
It seems like we pay a small penalty for doing the indirected lookup. I wonder if we could save this in the case of the "main" field and only pay it for secondary fields if we change the API a bit so that users can specify a "primary" field that is the union of all the deduped fields? Oh actually I misread the table! It looks as if the deduplicated vectors PR is actually a bit faster?! |
|
How do you think this would combine with all the other vector formats, such as the quantizing ones? I guess we would want to avoid a lot of code copying ... I guess it should be possible to somehow wrap an existing format so we can independently innovate about quantization? Also .. do you see this as becoming the default flat vector format, or do you think users would select it in a custom codec? |
Yeah, this was surprising to me too -- it is ~10% faster, though I'm not sure why (there's one additional lookup of
This is kind of tricky -- the main challenge is that quantization factors can depend on data distribution? (i.e. the same A slightly smaller challenge is that quantization formats today store the quantized For de-duplication to be effective, I guess we'd need to decouple the quantized
IMO it could be the default if the overhead of de-duplication is "small" (personally: something like <10% slower indexing / merge sounds like it would be worth replacing the default). Since it is currently not: perhaps a separate HNSW format backed by the de-duplicating format in a non-core module? |
|
Iterated with the AI to iron out some performance bottlenecks. Disclaimer: There's generous use of AI involved in the code, so functions may be more verbose than required! I have reviewed most of it locally, and will do a refactor to try and make it cleaner / more human friendly soon. Current ApproachSee
Indexing Time Complexity
Query Time Cost
All-in-all, it does not seem too expensive compared to the flat format? |
|
I observed a benchmark issue in the
The To work around this, I simply moved this line to after Can we look at the sum of cc @mikemccand BenchmarksI ran a To benchmark this PR, we need to replace the format here with Note: Before the work-around listed above, The numbers below do include the work-around that considers initial indexing + wait for running merges in
This PR Summary
As a next step, I'll increase the duplication factor: up to TK fields, each containing a (different) proportion |
Description
Closes #14758
Add a new de-duplicating vector format that only stores unique vectors on disk.
De-duplication is done for vectors across all docs and fields indexed by the format.
Disclaimer: This was mostly written by an AI, with me refining the implementation through prompts -- although I think it did a pretty good job on its own!
Details about the format itself (layout of vectors on disk, de-duplication strategy during flush and merge, performance tradeoffs, etc) are included in a markdown doc in the PR.