Skip to content

Improve clause ordering for multi-field DocValues range conjunctions #15981

@sgup432

Description

@sgup432

Description

After #15954, SortedNumericDocValuesRangeQuery uses SkipBlockRangeIterator as its two-phase approximation. I see that SkipBlockRangeIterator.cost() currently returns NO_MORE_DOCS, which means when multiple DV range queries are combined in a FILTER conjunction, both DenseConjunctionBulkScorer and ConjunctionDISI will sort the clauses in arbitrary order as they all report the same cost?

I wonder if we should have a better way to do this. That is, we choose the most selective field(the one which can eliminate most docs) as the lead iterator, as this will allow us to skip most of the docs. This might help in performance depending on the number of fields in range conjunctions, higher the no. of fields, better ordering will give better performance.

The DocValuesSkipper already has metadata that could help estimate selectivity ie global minValue()/maxValue(), per-block min/max, and docCount(). Some ideas to do this:

  • Use skipper.docCount() as cost. Though only differentiate b/w sparse vs dense fields.
  • Estimate selectivity from queryRange / fieldRange * docCount. Rough but differentiates narrow vs wide queries
  • Count NO blocks by walking the skip tree at scorer creation. This is more accurate, O(num_blocks) but requires extra work ahead.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions