Partitioned testing by jshook · Pull Request #661 · datastax/jvector

jshook · 2026-04-23T21:00:50Z

This has some changes needed to test, for example, PVS data organized into profiles by label.

Typical structure:

dataset
 profiles
  label_00
  ...
  label_NN

These are hosted in a way that is compatible with the new loader (containing the entries layout), using logical dataset names like "testdataset:testprofile" which preserve the relationship to the base dataset. (the base dataset may have common base vectors which are shared in some cases, or which are related as upstream data)
To facilitate using custom entries.yaml paths, the base_url now supports specifying the full name, but the parent path to it is preserved as before for all relative lookup to facet entries.

In the case of these profiles, for PVS testing specifically, each partition is a pre-filtered set of predicate-matching base vectors, and the per-label ground truths are brute-force over these just as any other knn answer key.

To support multiple labeled partitions of test data, the various index metadata, dataset name, and other files now support configuring for multiple similarly-named datasets.

For dataset-metadata, dataset name keys can now be globs like this:

"testdataset:labelprefix_*":
  similarity_function: EUCLIDEAN
  load_behavior: NO_SCRUB

For index-parameters/*.yml files, a configuration can be matched against similar patterns with a also_for key:

also_for:
 - "testdataset:labelprefix_*"

Notice that these must be quoted.

repetitions

The following logic is bypassed by default if these parameters are not provided.

There is also a new 'repetitions' parameter that drives testing, which, if provided, takes the average of multiple runs in grid for a given dataset, and an accompanying queryRuns parameter for the search section which does the same within each main run for queries only. For cached indexes, the repetition enum is passed in to ensure that repetitions include new index builds. Also, if there are multiple repetitions, a post-hoc summarizer will show the average of them.

compression parameters

Support has been added for matrix of compressin parameters as fed to grid. For now, this is done by parameterizing instances of compression parameters and letting grid do that it already does. A follow-up improvement to this really needs to be done which gets rid of the for loop cascade. I didn't send this in this change set because it is a bigger change and I wanted to keep this easier for review even if it was a bit uglier.

github-actions · 2026-04-23T21:01:02Z

Before you submit for review:

Does your PR follow guidelines from CONTRIBUTIONS.md?
Did you summarize what this PR does clearly and concisely?
Did you include performance data for changes which may be performance impacting?
Did you include useful docs for any user-facing changes or features?
Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
Did you trigger and review regression testing results against the base branch via Run Bench Main?
Did you adhere to the code formatting guidelines (TBD)
Did you group your changes for easy review, providing meaningful descriptions for each commit?
Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

jshook added 7 commits April 23, 2026 19:45

support named entries.yaml files

aeac2a4

scaffolding for additional partitioned testing

d6b52c9

multiple grid repetitions/query runs

e189e9f

glob support in dataset metadata

2a4bea0

allow index parameters to be also_for others by pattern

fbbc855

add alpha to index construction parameters

20db6eb

Fan-out on compression parameters

b28cfa0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partitioned testing#661

Partitioned testing#661
jshook wants to merge 7 commits intomainfrom
partitioned_testing

jshook commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jshook commented Apr 23, 2026

repetitions

compression parameters

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant