Skip to content

Fixes #15274: Protobuf schema parsing when message name differs from topic#27866

Open
jatinmasaram wants to merge 12 commits intoopen-metadata:mainfrom
jatinmasaram:fix-protobuf-clean
Open

Fixes #15274: Protobuf schema parsing when message name differs from topic#27866
jatinmasaram wants to merge 12 commits intoopen-metadata:mainfrom
jatinmasaram:fix-protobuf-clean

Conversation

@jatinmasaram
Copy link
Copy Markdown

@jatinmasaram jatinmasaram commented May 2, 2026

Describe your changes:

Fixes #15274

The Kafka Protobuf ingestion currently assumes that the top-level message name is derived from the topic name using PascalCase (e.g. loansLoans). This fails when the actual Protobuf message name differs, leading to getattr returning None and downstream NoneType.DESCRIPTOR errors.

This PR updates the message resolution logic to:

  • Retain the existing PascalCase lookup for backward compatibility
  • Add a fallback using pb2_module.DESCRIPTOR.message_types_by_name to dynamically discover declared top-level messages
  • Select the first available message if multiple exist (with debug logging for visibility)

This ensures schema parsing works even when naming conventions differ, without breaking existing behavior.

Testing:

  • Added unit tests covering:

    • Message name ≠ topic name (core issue)
    • Legacy behavior where names match
    • Edge cases (no descriptor, multiple messages)
  • All tests pass locally

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the [CONTRIBUTING](https://docs.open-metadata.org/developers/contribute) document.
  • My PR title is Fixes #15274: Protobuf schema parsing fails when message name differs from topic
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.
  • I have added a test that covers the exact scenario we are fixing.
    Uploading test_protobuf_parser_fix_logfile.png…

Summary by Gitar

  • Refactored Protobuf parser:
    • Cleaned up long-form formatting to improve readability across protobuf_parser.py logic.
    • Updated _resolve_message_class to streamline conditional logic while maintaining existing name-matching precedence.
  • Improved error handling:
    • Standardized logging statements throughout the parsing flow for consistent debug/warning output.
  • Code cleanup:
    • Removed redundant multiline assignments and simplified recursive child parsing logic in _get_column_fields and _get_field_models.

This will update automatically on new commits.

@jatinmasaram jatinmasaram requested a review from a team as a code owner May 2, 2026 19:07
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment thread ingestion/src/metadata/parsers/protobuf_parser.py Outdated
Comment thread ingestion/tests/unit/test_protobuf_parser_fix.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@jatinmasaram
Copy link
Copy Markdown
Author

Hi, could a maintainer please add the 'safe to test' label so CI can run?

@harshach harshach added the safe to test Add this label to run secure Github workflows on PRs label May 2, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

🟡 Playwright Results — all passed (16 flaky)

✅ 3983 passed · ❌ 0 failed · 🟡 16 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
✅ Shard 1 299 0 0 4
🟡 Shard 2 751 0 3 8
🟡 Shard 3 742 0 5 7
🟡 Shard 4 774 0 1 18
✅ Shard 5 687 0 0 41
🟡 Shard 6 730 0 7 8
🟡 16 flaky test(s) (passed on retry)
  • Features/ActivityAPI.spec.ts › Activity event is created when description is updated (shard 2, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/SampleDataTableOperations.spec.ts › should open delete confirmation modal and require DELETE confirmation (shard 3, 1 retry)
  • Features/Workflows/WorkflowOssRestrictions.spec.ts › add-event-filter-button is enabled in OSS (shard 3, 1 retry)
  • Flow/PersonaDeletionUserProfile.spec.ts › User profile loads correctly before and after persona deletion (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Not_Set (shard 4, 1 retry)
  • Pages/ExplorePageRightPanel.spec.ts › Should allow Data Consumer to edit glossary terms for searchIndex (shard 6, 1 retry)
  • Pages/GlossaryImportExport.spec.ts › Glossary CSV import preserves typed relations (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/UserDetails.spec.ts › Create team with domain and verify visibility of inherited domain in user profile after team removal (shard 6, 1 retry)
  • Pages/Users.spec.ts › Create and Delete user (shard 6, 1 retry)
  • Pages/Users.spec.ts › Reset Password for Data Steward (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Comment thread ingestion/src/metadata/parsers/protobuf_parser.py
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

…n name differs from topic

- Add _resolve_message_class() with two-step fallback for issue open-metadata#15274
- Fix type errors: use DataType/DataTypeTopic enum members instead of strings
- Fix list invariance in get_protobuf_fields via list() cast
- Remove invalid FieldName import, use plain string for FieldModel.name
- Split get_protobuf_fields into typed helpers _get_column_fields/_get_field_models
- Fix worker_id fixture to work with and without pytest-xdist
- Rename ProtobufParserTests to TestProtobufParser for pytest discovery
Comment thread ingestion/src/metadata/parsers/protobuf_parser.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@jatinmasaram
Copy link
Copy Markdown
Author

The failing Playwright tests (e.g., GlossaryImportExport, LineageFilters, LineageRightPanel) appear unrelated to the changes in this PR. This PR only modifies the Protobuf parser in the ingestion layer, while the failures are occurring in UI E2E flows involving glossary import and lineage entity operations (HTTP 409 conflicts). The protobuf parsing logic and associated unit tests are passing locally and are isolated from these UI workflows. Requesting maintainer review to confirm whether these failures are pre-existing or can be overridden.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 5, 2026

Code Review ✅ Approved 9 resolved / 9 findings

Refactors Protobuf schema parsing to dynamically resolve message names via descriptor introspection, ensuring compatibility when names differ from topics. Addresses multiple issues regarding test path manipulation, type handling, and UnboundLocalError exceptions.

✅ 9 resolved
Edge Case: Returning [] instead of None changes children semantics

📄 ingestion/src/metadata/parsers/protobuf_parser.py:334 📄 ingestion/src/metadata/parsers/protobuf_parser.py:325
The change from return None if not field_models else field_models to return field_models means that when a RECORD field has no sub-fields, children will now be [] instead of None. This affects the recursive call at line 325 where the result of get_protobuf_fields is assigned to children. Downstream consumers or JSON serialization may treat an empty list differently from None (e.g., "children": [] vs field omission). The return type annotation Optional[List[...]] suggests None was the intended "no children" sentinel.

In practice this is unlikely to cause issues since both are falsy, but it's a subtle behavioral change worth noting. The test at line 412 (assert result == []) now codifies the new behavior.

Quality: Trailing blank lines left from removed code in tests

📄 ingestion/tests/unit/test_protobuf_parser_fix.py:416-417 📄 ingestion/tests/unit/test_protobuf_parser_fix.py:428-430 📄 ingestion/tests/unit/test_protobuf_parser_fix.py:444-445
Several test methods have residual blank lines where mock setup code was removed (lines 416-417, 428-430, 444-445). These leave awkward whitespace at the start of test functions.

Bug: UnboundLocalError when exception path is taken in parse_protobuf_schema

📄 ingestion/src/metadata/parsers/protobuf_parser.py:274 📄 ingestion/src/metadata/parsers/protobuf_parser.py:283-291
In parse_protobuf_schema, if the try block raises an exception (e.g. instance.DESCRIPTOR fails), the except clause catches it but field_models is never assigned. After the finally block, line 291 executes return field_models or None, which raises UnboundLocalError: local variable 'field_models' referenced before assignment.

This means any exception during schema parsing will produce an unhandled crash instead of the intended graceful None return with a warning log. The early return None paths at lines 260 and 272 are fine (they exit before reaching 291), but the except-path is broken.

Bug: UnboundLocalError on exception in parse_protobuf_schema

📄 ingestion/src/metadata/parsers/protobuf_parser.py:283-291
If an exception is raised during the field_models = [...] assignment (lines 274-282), the except block catches it but does not return or assign field_models. Control then falls through the finally block to line 291: return field_models or None. Because field_models was never bound, this raises UnboundLocalError, masking the original error and producing a confusing traceback.

The early return None statements at lines 260 and 272 correctly go through finally but never reach line 291, so they're fine. The problem is exclusively the exception path.

Quality: Tests manipulate sys.modules/sys.path instead of using pytest

📄 ingestion/tests/unit/test_protobuf_parser_fix.py:27-41
The test file manually inserts ~15 stub modules into sys.modules and manipulates sys.path (lines 28-121) to avoid real imports. This is extremely brittle: any change to the import chain in protobuf_parser.py will silently break the stubs. It also violates the project convention of preferring integration tests and using unittest.mock only at external boundaries.

Additionally, the test imports from protobuf_parser import ... (bare module name via sys.path hack) rather than the canonical from metadata.parsers.protobuf_parser import ..., meaning these tests won't run correctly from the standard pytest invocation at the repo root.

...and 4 more resolved from earlier reviews

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@jatinmasaram
Copy link
Copy Markdown
Author

looks like the shard-1 failure happened during the environment setup and datamodel generation, which is before the tests even started. My changes are strictly limited to the Protobuf parser, so I don't think they could have touched the build pipeline. Mind giving it a re-run? Pretty sure it’s just a transient CI hiccup.

@jatinmasaram
Copy link
Copy Markdown
Author

The failure in test_lineage.py seems to be a pyodbc.DataError related to SQL Server datetime handling. since this PR is strictly focused on the protobuf parser and its unit tests, I don't see how it could impact the SQL Server lineage logic. Is this a known flaky test or a pre existing issue? Would appreciate a re run or some guidance on how to proceed

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 5, 2026

Quality Gate Failed Quality Gate failed for 'open-metadata-ingestion'

Failed conditions
E Security Review Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

@jatinmasaram
Copy link
Copy Markdown
Author

@open-metadata/ingestion

Hi maintainers ,

I’ve reviewed the remaining CI failures and they appear outside the scope of this PR, which is limited to Protobuf schema parsing (issue #15274).

SonarCloud (S5443)
The /tmp path in ProtobufParserConfig.base_file_path is used only for temporary protobuf compilation artifacts:

  • no user-controlled input
  • no sensitive data
  • directory is cleaned up after use (shutil.rmtree)

Requesting review of this hotspot to confirm it can be marked as safe.

MSSQL lineage test
test_lineage[MssqlScheme.mssql_pyodbc-english] fails with a pyodbc.DataError related to datetime conversion. This seems tied to SQL Server locale/date formatting rather than the protobuf parser.

All protobuf-related tests are passing, with coverage ~69% (≥20% required).

Happy to investigate further or open a separate issue/PR for the MSSQL test if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ingestion: Kafka protobuf schema parsing only works if payload type is named the same as the topic

2 participants