Improve BayesianScoreQuery and LogOddsFusionQuery with base rate prior, weighted Log-OP, and parameter estimation by jaepil · Pull Request #15948 · apache/lucene

jaepil · 2026-04-10T04:19:29Z

Summary

Follow-up to #15827. This PR extends BayesianScoreQuery and LogOddsFusionQuery with three improvements:

BayesianScoreEstimator: Auto-estimates sigmoid calibration parameters (alpha, beta) and corpus-level base rate from score distributions via pseudo-query sampling
Base rate prior for BayesianScoreQuery: Optional corpus-level relevance prior that shifts the posterior in log-odds space: sigmoid(alpha * (score - beta) + logit(baseRate)), improving calibration for rare-relevance corpora
Weighted Logarithmic Opinion Pooling for LogOddsFusionQuery: Per-signal weights enabling weighted Log-OP where each signal's log-odds contribution is scaled by its reliability weight, plus optional logit normalization bounds

Algorithm Details

BayesianScoreEstimator

Estimates BayesianScoreQuery parameters from corpus statistics via pseudo-query sampling:

Sample N documents randomly from the index (Fisher-Yates partial shuffle)
For each document, create a pseudo-query from its first few tokens in the target field
Run each pseudo-query via BM25 and collect the score distribution
Estimate: beta = median(scores), alpha = 1 / std(scores)
Estimate base rate: mean fraction of documents scoring above the 95th percentile, clamped to [1e-6, 0.5]

Base Rate Prior

When a base rate r is set on BayesianScoreQuery, the posterior is computed as:

P = sigmoid(alpha * (score - beta) + logit(r))

where logit(r) = log(r / (1 - r)). This shifts scores down for rare-relevance corpora (e.g., r = 0.01 adds a -4.6 logit offset), improving calibration without changing ranking order within a single query.

Weighted Log-OP

When per-signal weights are provided to LogOddsFusionQuery, the scoring formula changes from uniform mean to weighted sum:

uniform:  sigmoid(n^alpha * mean(softplus(logit(p_i))))
weighted: sigmoid(n^alpha * sum(w_i * gated(logit(p_i))))

Weights must be non-negative and sum to 1. Optional per-signal logit normalization bounds (logitMin, logitMax) enable min-max normalization as an alternative to softplus gating, useful when learned signal scales differ significantly.

New Files

File	Description
`BayesianScoreEstimator.java`	Auto-estimates alpha, beta, base rate from corpus score distributions

Modified Files

File	Description
`BayesianScoreQuery.java`	Add base rate prior support with logit-space shifting
`LogOddsFusionQuery.java`	Add per-signal weights, logit normalization bounds, and weighted Log-OP
`LogOddsFusionScorer.java`	Implement weighted scoring and logit normalization gating
`TestBayesianScoreQuery.java`	11 new tests for base rate and estimator
`TestLogOddsFusionQuery.java`	12 new tests for weighted fusion and normalization

Test Coverage (23 new tests)

BayesianScoreQuery base rate (7 tests)

Base rate lowers scores compared to no base rate
Scores remain in (0, 1) range with base rate
Max score correctness with WAND optimization
Explanation includes base rate details
QueryUtils.check, equals/hashCode, illegal argument validation

BayesianScoreEstimator (4 tests)

Estimated parameters are finite and valid
Estimated parameters produce valid scores in (0, 1)
Max score correctness with estimated parameters
Reproducibility with same random seed

LogOddsFusionQuery weighted fusion (10 tests)

Weighted fusion produces valid scores
Weights affect ranking order
Explanation correctness for weighted variant
equals/hashCode, toString, rewrite, QueryUtils.check
Illegal weight validation (wrong length, negative, non-unit-sum)
Three-way weighted combination

LogOddsFusionQuery logit normalization (2 tests)

Normalized fusion produces valid scores in (0, 1)
Max score correctness with normalization bounds

Test plan

./gradlew tidy passes (google-java-format via Spotless)
./gradlew :lucene:core:compileJava :lucene:core:compileTestJava passes
All 57 tests pass in TestBayesianScoreQuery and TestLogOddsFusionQuery

…ybrid search - Add BayesianScoreEstimator for auto-estimating sigmoid calibration parameters - Add base rate prior support to BayesianScoreQuery for log-odds shifting - Add per-signal weights to LogOddsFusionQuery for weighted Logarithmic Opinion Pooling - Add logit normalization support to LogOddsFusionScorer - Add comprehensive tests for BayesianScoreQuery and LogOddsFusionQuery

benwtrent · 2026-04-23T15:02:26Z

+ *
+ * @lucene.experimental
+ */
+public class BayesianScoreEstimator {


So, I see these params are then used within BayesianScoreQuery

I wonder, could we have a constructor for BayesianScoreQuery (and have those internal parameters be nullable), that detects during rewrite if the parameters are null, and if they are, we provide the correct estimation?

Or we adjust the interface so that BayesianScoreQuery accepts an estimator in its constructor OR the parameters, and if its an estimator, it will handle it rewrite?

Is the main concern that the estimation should only ever happen once per the life time of the index? Or only periodically vs. on every query?

Great questions — let me take them in reverse order, since the lifecycle question (3) is the most fundamental and the API choice follows from it.

On lifecycle (3): The estimated parameters are corpus-level statistics. α and β are derived from the BM25 score distribution's center and spread, and the base rate is a global prior. None of them depend on the user query, so the natural lifecycle is per-IndexReader (per-commit), not per-query. Estimation runs ~50 pseudo-queries × top-K collection, which is fine once per reader but prohibitive on every query.

On putting estimation inside rewrite() (1 and 2): I'm a bit hesitant for a few reasons:

rewrite() is generally expected to be cheap and stats-driven, not to perform I/O of this magnitude (reading stored fields, running 50 inner searches, sorting score arrays).

Even with a fixed seed, lazy estimation in rewrite() would need a reader-keyed cache to avoid redoing the work — otherwise every rewrite() call repeats the sampling.

It blurs query identity: equals/hashCode of an unestimated query vs. its rewritten form needs careful handling, especially for the query-cache layer.

What I'd propose instead: keep the explicit Parameters constructor as the primary, deterministic API, and add a convenience factory:

public static Query BayesianScoreQuery.withAutoCalibration( IndexSearcher searcher, String field, Query inner) throws IOException;

Internally this memoizes Parameters keyed by IndexReader.CacheHelper#getKey(), so estimation runs once per reader and is cleaned up automatically when the reader closes. The user gets the "just works" ergonomics without overloading rewrite() with sampling I/O.

Structurally this follows the same precedent as KnnFloatVectorQuery / KnnByteVectorQuery: some queries inherently need a reader-bound resolution step, and Lucene already accommodates that. The difference here is that we resolve eagerly at construction time rather than lazily in rewrite(), since calibration parameters are reusable across many inner queries against the same reader (whereas a kNN result set is tied to a specific query vector and isn't).

Happy to push this as a follow-up commit if the direction makes sense.

The estimated parameters are corpus-level statistics. α and β are derived from the BM25 score distribution's center and spread, and the base rate is a global prior. None of them depend on the user query, so the natural lifecycle is per-IndexReader (per-commit), not per-query. Estimation runs ~50 pseudo-queries × top-K collection, which is fine once per reader but prohibitive on every query.

Ah, gotcha! I am better understanding. Thank you.

My concern is how do we know what a "typical user query" looks like. Doesn't this require knowledge of the query?

Or did y'alls empirical analysis show that just using random docs worked well enough?

Great question, and the answer is: calibration doesn't need to model the user query distribution — it only needs the score distribution to be representative of the corpus's BM25 dynamic range.

Here's why: α and β are derived from the BM25 score distribution's spread (alpha = 1/std) and center (beta = median). These are scale statistics. As long as the pseudo-queries exercise the same scoring code path that real user queries will hit (BM25Similarity over the same field's term frequencies and IDF table), the resulting α/β describe the scorer's calibration, which is invariant to which specific terms appear in the query. The base rate is similarly a corpus-level fraction, not query-conditional.

A useful sanity check: sigmoid is monotone, so α and β never change ranking — they only adjust where on the (0,1) curve scores land for downstream Log-OP fusion. Even substantial pseudo-query/real-query distribution mismatch only shifts the calibration curve, which is the same effect as picking a different α/β manually.

That said, the "random docs + first N tokens" approach in this PR does have a real weakness on corpora with shared boilerplate prefixes (license headers, structured templates), where pseudo-queries collapse into near-duplicates. I'm thinking about replacing the document-text path with reservoir sampling over the field's indexed vocabulary, which would give uniform random samples of unique terms instead — a more defensible "what does this scorer's distribution look like" probe than "what do the first 5 words of random documents look like."

We did test this calibration approach during the research phase across several corpora and didn't see issues, but I'd like to redo that validation directly against the Lucene implementation as a follow-up PR before this leaves @lucene.experimental status.

benwtrent · 2026-04-30T10:58:18Z

+      // Extract first N tokens as pseudo-query terms
+      String[] tokens = tokenize(fieldValue, tokensPerQuery);


I wonder if just the first N works well for all types of data. For example, legal or source code may have all docs with a very similar "header" and this would effectively eliminate any random distribution and have a pretty significant bias.

You're right — taking the first N tokens biases heavily toward boilerplate prefixes (license headers, legal preambles, structured templates), and on those corpora the pseudo-queries collapse into near-duplicates.

The cleaner fix, I think, is to drop the document-text path entirely and reservoir-sample over the field's indexed vocabulary via MultiTerms.getTerms(reader, field) + TermsEnum. Vocabulary-level sampling is uniform over unique terms, not over occurrences — a boilerplate term that appears in 100% of documents has the same selection probability as a rare content term, so shared-prefix corpora no longer dominate the sample.

If this direction sounds right, I'll prepare a follow-up commit with a regression test for the shared-prefix case.

benwtrent · 2026-04-30T11:18:22Z

+  private static String[] tokenize(String text, int maxTokens) {
+    // Simple whitespace tokenization with lowercasing
+    String[] parts = text.toLowerCase(java.util.Locale.ROOT).split("\\s+");
+    int n = Math.min(parts.length, maxTokens);
+    List<String> tokens = new ArrayList<>(n);
+    for (int i = 0; i < n; i++) {
+      String token = parts[i].replaceAll("[^a-z0-9]", "");
+      if (token.isEmpty() == false) {
+        tokens.add(token);
+      }
+    }
+    return tokens.toArray(new String[0]);
+  }


So, I think we should actually analyze with a provided analyzer or gather information from term vectors or something. I suspect for many corpuses that doing whitespace and trimming like this just doesn't reflect reality.

Agreed, I should have implemented better code here.

Rather than threading an Analyzer parameter through the public API, I think the cleaner fix is to side-step analysis entirely by sampling from the already-analyzed term dictionary via MultiTerms.getTerms(reader, field) + TermsEnum. Concretely: reservoir-sample nSamples * tokensPerQuery unique terms from the field's vocabulary, partition into pseudo-queries, and feed them directly into new Term(field, bytesRef). The bytes are identical to what's indexed, so:

No analyzer parameter needed on the public API.

No dependency on stored fields or term vectors.

Works correctly for any analyzer chain the user indexed with — Korean, Chinese, custom n-gram, anything.

If this direction sounds right, I'll prepare a follow-up commit.

github-actions Bot added the module:core/search label Apr 10, 2026

github-actions Bot added this to the 11.0.0 milestone Apr 10, 2026

Move CHANGES.txt entry from 11.0.0 to 10.5.0 Improvements section

1de219b

github-actions Bot modified the milestones: 11.0.0, 10.5.0 Apr 10, 2026

benwtrent reviewed Apr 23, 2026

View reviewed changes

benwtrent reviewed Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve BayesianScoreQuery and LogOddsFusionQuery with base rate prior, weighted Log-OP, and parameter estimation#15948

Improve BayesianScoreQuery and LogOddsFusionQuery with base rate prior, weighted Log-OP, and parameter estimation#15948
jaepil wants to merge 2 commits intoapache:mainfrom
jaepil:bayesian-bm25

jaepil commented Apr 10, 2026

Uh oh!

benwtrent Apr 23, 2026

Uh oh!

jaepil Apr 30, 2026 •

edited

Loading

Uh oh!

benwtrent Apr 30, 2026

Uh oh!

jaepil Apr 30, 2026

Uh oh!

benwtrent Apr 30, 2026

Uh oh!

jaepil Apr 30, 2026 •

edited

Loading

Uh oh!

benwtrent Apr 30, 2026

Uh oh!

jaepil Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// Extract first N tokens as pseudo-query terms
		String[] tokens = tokenize(fieldValue, tokensPerQuery);

Conversation

jaepil commented Apr 10, 2026

Summary

Algorithm Details

BayesianScoreEstimator

Base Rate Prior

Weighted Log-OP

New Files

Modified Files

Test Coverage (23 new tests)

BayesianScoreQuery base rate (7 tests)

BayesianScoreEstimator (4 tests)

LogOddsFusionQuery weighted fusion (10 tests)

LogOddsFusionQuery logit normalization (2 tests)

Test plan

Uh oh!

benwtrent Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

jaepil Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwtrent Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

jaepil Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

benwtrent Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

jaepil Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwtrent Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

jaepil Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jaepil Apr 30, 2026 •

edited

Loading

jaepil Apr 30, 2026 •

edited

Loading