Skip to content

feat(index): support Utf8View patterns in the ngram TextQueryParser#7310

Open
wombatu-kun wants to merge 2 commits into
lance-format:mainfrom
wombatu-kun:feat/ngram-utf8view
Open

feat(index): support Utf8View patterns in the ngram TextQueryParser#7310
wombatu-kun wants to merge 2 commits into
lance-format:mainfrom
wombatu-kun:feat/ngram-utf8view

Conversation

@wombatu-kun

Copy link
Copy Markdown
Contributor

Follow-up to #7139.

Problem

The ngram TextQueryParser extracts string patterns and regex flags by matching only ScalarValue::Utf8 and ScalarValue::LargeUtf8. When the indexed column is Utf8View, safe_coerce_scalar (rust/lance-datafusion/src/expr.rs) coerces the predicate literal to ScalarValue::Utf8View(Some(..)), which the parser then fails to match. As a result an ngram index on a Utf8View column silently does not accelerate contains, regexp_like, or infix LIKE; they fall back to a full scan.

Change

Add the Utf8View arm to all three string-extraction sites in TextQueryParser: the contains / regexp_like pattern in visit_scalar_function, the infix-LIKE pattern in visit_like, and the regex-flags literal in apply_regex_flags. The bindings are unchanged because all three string ScalarValue variants carry an Option<String>.

Tests

  • New test_text_query_parser_utf8view: asserts contains and regexp_like over a Utf8View-typed ngram index route to the index (StringContains / Regex queries), and that infix LIKE with a Utf8View pattern is accelerated, with a Utf8 parity control.
  • Extended test_apply_regex_flags with a Utf8View flags literal.

Each test fails on the pre-change code and passes after.

Out of scope

The identical pre-existing gap in SargableQueryParser (starts_with / LIKE-prefix) is left untouched - it is a separate feature that #7139 never modified, so folding it in would be unrelated scope. It can be a separate change if desired.

This addresses the non-blocking review comment #7139 (comment) from @wjones127.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added enhancement New feature or request A-index Vector index, linalg, tokenizer labels Jun 17, 2026
@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.38462% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/expression.rs 95.38% 2 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant