Skip to content

feat: implement semantic rule validation for DQ rules (#1169)#1203

Open
Vsatyam013 wants to merge 6 commits into
databrickslabs:mainfrom
Vsatyam013:feat/1169-semantic-validation
Open

feat: implement semantic rule validation for DQ rules (#1169)#1203
Vsatyam013 wants to merge 6 commits into
databrickslabs:mainfrom
Vsatyam013:feat/1169-semantic-validation

Conversation

@Vsatyam013

@Vsatyam013 Vsatyam013 commented May 27, 2026

Copy link
Copy Markdown
Contributor

Changes

Implements semantic validation for DQ rule expressions as part of issue #1169.
Adds a SemanticValidator class that checks SQL expressions for forbidden
destructive keywords (DROP, TRUNCATE, DELETE, UPDATE, INSERT, ALTER) before
rules are applied. Validation results are collected into ChecksValidationStatus
rather than raising exceptions, consistent with the existing validation contract.

Linked issues

Resolves #1169

Tests

  • added unit tests

Documentation and Demos

  • added/updated demos
  • added/updated docs
  • added/updated agent skills

@CLAassistant

CLAassistant commented May 27, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@github-actions

Copy link
Copy Markdown
Contributor

All commits in PR should be signed ('git commit -S ...'). See https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits

@Vsatyam013 Vsatyam013 force-pushed the feat/1169-semantic-validation branch from d9d5ef7 to 379a5ba Compare May 27, 2026 16:44
@Vsatyam013 Vsatyam013 marked this pull request as ready for review May 28, 2026 03:08
@Vsatyam013 Vsatyam013 requested a review from a team as a code owner May 28, 2026 03:08
@Vsatyam013 Vsatyam013 requested review from grusin-db and removed request for a team May 28, 2026 03:08
@Vsatyam013 Vsatyam013 force-pushed the feat/1169-semantic-validation branch from ff01ddc to f84132b Compare May 29, 2026 05:41

@ghanse ghanse left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Vsatyam013 , the linked issue focuses more on validating the entire ruleset to detect the following issues:

  • Detect if a duplicate rule already exists with the same run config
  • Detect if a similar rule exists with different arguments (e.g. a narrower threshold of is_in_range)

Can you make some changes to address these more specifically? I think it should be simpler because the check function inputs are mostly structured. We can document any limitations when users execute checks with Spark expressions.

Comment thread CHANGELOG.md
Comment thread src/databricks/labs/dqx/semantic_validator.py
@ghanse ghanse added under-review This PR is currently being reviewed by one of DQX maintainers. needs-changes Changes required after review labels Jun 2, 2026
@mwojtyczka

mwojtyczka commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@Vsatyam013 do you plan to complete the PR, and address review comments? there was no update since 4 weeks.

@Vsatyam013 Vsatyam013 force-pushed the feat/1169-semantic-validation branch from ea0c092 to da2897a Compare June 25, 2026 03:24
@Vsatyam013

Vsatyam013 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Hi @mwojtyczka, sorry for the delay. I've reworked the implementation based on @ghanse's feedback. The SQL keyword validation approach has been removed and replaced with ruleset-level semantic checks, duplicate rule detection and conflicting rule detection (same function and column with different arguments). The validation is now integrated into validate_checks, save_checks, and load_checks with a configurable semantic_validation_mode parameter (warn, fail, or None). CHANGELOG changes have also been reverted.Can you review it?

@@ -53,10 +53,9 @@
from databricks.labs.dqx.errors import InvalidCheckError, InvalidConfigError, InvalidParameterError
from databricks.labs.dqx.utils import list_tables, safe_strip_file_from_path, resolve_variables, VariableValue
from databricks.labs.dqx.io import is_one_time_trigger

from .semantic_validator import SemanticValidator, SemanticValidationMode

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use full notation, not relative import

self,
config: BaseChecksStorageConfig,
variables: dict[str, VariableValue] | None = None,
semantic_validation_mode: str | None = SemanticValidationMode.WARN, # <-- ADD THIS

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
semantic_validation_mode: str | None = SemanticValidationMode.WARN, # <-- ADD THIS
semantic_validation_mode: str | None = SemanticValidationMode.WARN,

Comment thread src/databricks/labs/dqx/engine.py Outdated
) -> list[dict]:
"""Load DQ rules (checks) from the storage backend described by *config*.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert these indentation changes pls

"""
handler = self._checks_handler_factory.create(config)
checks = handler.load(config)
merged_variables = self._merge_variables(variables)
return resolve_variables(checks=checks, variables=merged_variables)
resolved = resolve_variables(checks=checks, variables=merged_variables) # <-- RENAME from return value

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove these comments

arguments = SemanticValidator._get_arguments(check)
criticality = check.get("criticality", "error")
filter_expr = check.get("filter")
return (function, tuple(sorted(arguments.items())), criticality, filter_expr)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detect_duplicates crashes on any list-valued argument (TypeError: unhashable type: 'list'). This key embeds tuple(sorted(arguments.items())), and detect_duplicates uses the result as a dict key (key in seen / seen[key] = idx). Many common checks carry list arguments — e.g. is_in_list allowed=["A","B","C"], range bounds, for_each_column lists — so the tuple contains a list and dict insertion raises TypeError: unhashable type: 'list'. I reproduced it with a single is_in_list rule. Because apply() runs by default (WARN) inside validate_checks/load_checks/save_checks, this would crash normal rulesets once the indentation error is fixed. Suggest serializing argument values to a hashable form, e.g. json.dumps(arguments, sort_keys=True, default=str), and adding a regression test that includes a list-valued argument.

Comment thread src/databricks/labs/dqx/engine.py Outdated

@mwojtyczka mwojtyczka left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. I feft some comments. The engine.py does not compile due to identation issues. Please also update the PR description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-changes Changes required after review under-review This PR is currently being reviewed by one of DQX maintainers.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Add methods for semantic rule validation

4 participants