Fix HNSW InfoStream duplicate times and add per-chunk completion logging#15978
Fix HNSW InfoStream duplicate times and add per-chunk completion logging#15978iprithv wants to merge 6 commits intoapache:mainfrom
Conversation
|
Whoa, thank you for the quick PR @iprithv! I'll have a look soon. This would make segment traces more consumable, and then we can also more accurately measure effective HNSW concurrency for each merge. The reason I'm chasing this is I was worried about a possible concurrency bug where a merge would incorrectly remain single-threaded for its duration if when the merge started there were other merges temporarily consuming all HNSW worker threads. Not sure there really is a bug ... speculating by attempting to simulate code in my head ;) |
mikemccand
left a comment
There was a problem hiding this comment.
Thanks @iprithv! I left a few comments. There are many words but I think pretty simple changes maybe?
| if ((node % 10000 == 0) && infoStream.isEnabled(HNSW_COMPONENT)) { | ||
| // Skip in-loop progress for ranges <= 10000 where it would fire at most once | ||
| // with identical incremental/total times (#15967). | ||
| if (numVectors > 10000 && (node % 10000 == 0) && infoStream.isEnabled(HNSW_COMPONENT)) { |
There was a problem hiding this comment.
I think maybe the original print (when this was single threaded, maybe?) is/was trying to print elapsed time since the merge kicked off, and then also relative time since the previous print?
Also, it'd be nice to keep the consistent printing every 10K vectors (whichever worker thread happens to get the lucky golden tickets). Imagine an adversarial-to-InfoStream world where a large merge kicks off, there are many threads so all chunks get pulled and worked on by a thread, but there are not enough cores, so the threads progress slowly, and we never see any logging until all chunks are done?
Maybe we could stop trying to log the delta-time (time since last log)? Just time since merge began, and, keep printing on the lucky golden ticket (node % 10000 == 0)? Then we will see regular prints, it's always since HNSW merge start, it kinda behaves similar w/ many chunks vs one chunk?
There was a problem hiding this comment.
Sure, updated.
- Kept the node % 10000 for all ranges (removed the numVectors > 10000 guard)
- Dropped the delta time, now just shows elapsed time since start: "built %d in %.2f ms"
- For concurrent merges, it'll now show time since merge start (not chunk start). Added a mergeStartTimeNS field that HnswConcurrentMergeBuilder.build() sets on all workers before launching them. For the non-concurrent build() path, it falls back to chunk-local start time (which is the same as build start).
| } | ||
| if (infoStream.isEnabled(HNSW_COMPONENT)) { | ||
| long elapsed = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - start); | ||
| infoStream.message( |
There was a problem hiding this comment.
This is awesome! I wonder if, separately from this new InfoStream log line, we could also aggregate these times and then (when HNSW merge is completely done, upstairs) add another summary InfoStream log line stating the effective concurrency? Could maybe just be an AtomicLong that each chunk increments its elapsed time into, then upstairs at HNSW merge end just divide that value by elapsed wall clock time to get & InfoStream.message the implied concurrency?
There was a problem hiding this comment.
Sure! Added a shared AtomicLong cumulativeWorkTimeNS that each chunk increments with its elapsed time. At the end of HnswConcurrentMergeBuilder.build(), after all workers finish, it logs:
merge completed: 100000 vectors, 12500.00 ms wall clock, 42300.45 ms cumulative worker time, 3.38x effective concurrency
This should make it straightforward to detect the single threaded merge scenario, it would show ~1.0x effective concurrency.
There was a problem hiding this comment.
Yay, this is awesome! Can't wait for nightly build to run on this then I scrutinize the InfoStream -- first light for this new metric!
| HNSW_COMPONENT, | ||
| String.format( | ||
| Locale.ROOT, | ||
| "addVectors [%d %d): %d vectors in %d ms", |
There was a problem hiding this comment.
Maybe I am overly optimistic, but maybe we could %.2f these ms instead of %d? (And also the one in printGraphBuildStatus)? How fast are 2048 ANN searches + HNSW insert on these modern CPUs with massive SIMD silicon and native optimized dot products and such ...
There was a problem hiding this comment.
Yes, individual chunks could well complete in sub-millisecond times. Updated both printGraphBuildStatus and the per chunk completion message to use %.2f with nanoTime / 1_000_000.0 for proper sub-millisecond precision. Thanks!
…ing (apache#15967) Signed-off-by: prithvi <prithvisivasankar@gmail.com>
4494239 to
512d5b4
Compare
mikemccand
left a comment
There was a problem hiding this comment.
Looks great -- I left a couple trivial comments, then I think this is ready! Can't wait to see what nightly benchy sees as effective HNSW concurrency building the Cohere vector indices...
| private long printGraphBuildStatus(int node, long start, long t) { | ||
| long now = System.nanoTime(); | ||
| private void printGraphBuildStatus(int node, long start) { | ||
| double elapsedMs = (System.nanoTime() - start) / 1_000_000.0; |
There was a problem hiding this comment.
Can we rename start -> startNs? I think it's crucial to include units in variable names ... I've seen too many bugs over the years where person 1 thought it was ns and person 2 thought it was us and ...
And of course the Mars Climate Orbiter disaster.
There was a problem hiding this comment.
Sure, done. Thanks!
|
|
||
| Bug Fixes | ||
| --------------------- | ||
| * GITHUB#15967: Fix HNSW InfoStream progress to show elapsed time since merge start instead of |
There was a problem hiding this comment.
Maybe state that this is an IndexWriter thing? Fix IndexWriter's HSNW InfoStream...?
There was a problem hiding this comment.
Yes, makes sense, added. thanks!
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
mikemccand
left a comment
There was a problem hiding this comment.
This looks great -- I just had one nitpick about camelCaseContainingAcronyms consistency. Thanks @iprithv.
| private long printGraphBuildStatus(int node, long start, long t) { | ||
| long now = System.nanoTime(); | ||
| private void printGraphBuildStatus(int node, long startNs) { | ||
| double elapsedMs = (System.nanoTime() - startNs) / 1_000_000.0; |
There was a problem hiding this comment.
Hmm, can we be consistent about startNs vs chunkElapsedNS ("treat it like a word" --> chunkedElapsedNs)? Darned camel-casing rules intersecting with acronyms/abbreviations... Wikipedia's CamelCase page has a whole paragraph dedicated to the conundrum.
There was a problem hiding this comment.
sure, updated. Thanks!
Description:
Fixes #15967
The
printGraphBuildStatusmethod inHnswGraphBuilder.addVectors()reports progress as"built %d in %d/%d ms"where the two times are meant to show incremental (since last print) and total (since start). However, during concurrent HNSW merging, each worker processes chunks of 2048 vectors, and the progress print fires every 10,000 nodes. This means the print fires at most once per chunk, always on the first matching node wheret == start, producing identical values:Changes
1. Suppress the broken in-loop progress for small ranges
Added a
numVectors > 10000guard to skip the in-loop progress print for ranges ≤ 10,000 vectors. These ranges can trigger the print at most once, wheret == startalways produces duplicate incremental/total times. For large ranges (e.g., the non-concurrentbuild()path with millions of vectors), the in-loop progress continues to fire every 10K nodes with correct diverging times.2. Add per-chunk completion logging
Every
addVectors()call now emits a completion message with the range, vector count, and wall-clock elapsed time:This enables deriving effective concurrency during HNSW merges by summing all chunk times that contributed to one merge and dividing by wall-clock elapsed time, as described in #15967.
Output comparison
Before (concurrent merge chunk crossing a 10K boundary):
After:
Before (non-concurrent build of 1M vectors):
After: