feat: [Hackathron] Docstring linter#1037
Conversation
Unit test report (Python 3.9.24, Pydantic 2.12.3)272 tests 272 ✅ 8s ⏱️ Results for commit 5edcc6e. ♻️ This comment has been updated with latest results. |
Unit test report (Python 3.9.24, Pydantic 1.10.22)272 tests 272 ✅ 8s ⏱️ Results for commit 5edcc6e. ♻️ This comment has been updated with latest results. |
Summary of ChangesHello @yuriy-encord, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new docstring linter designed to significantly improve the quality and consistency of the SDK's documentation. By enforcing a standardized style, validating Sphinx roles, and ensuring accurate cross-references, the linter streamlines the documentation generation process and enhances the overall developer experience when interacting with the SDK. It includes robust issue detection, auto-fixing capabilities, and flexible reporting options, making it a powerful tool for maintaining high documentation standards. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new docstring linter to enforce documentation quality, along with a large number of auto-generated fixes to existing docstrings. This is a fantastic initiative for improving code maintainability and documentation clarity. My review focuses on the implementation of the linter itself and the correctness of the applied fixes. I've identified a few areas where the linter's logic can be improved for robustness and maintainability, and several instances where the auto-generated fixes are incorrect. Please see the detailed comments for suggestions.
|
|
||
| Returns: | ||
| List[DataRow]: A list of DataRows object that match the filter | ||
| List[:class:`~encord.orm.dataset.DataRow`]: A list of :class:`~encord.orm.dataset.DataRow`s object that match the filter |
There was a problem hiding this comment.
The auto-generated docstring contains a grammatical error: '...a list of :class:~encord.orm.dataset.DataRows object...'. This should be '...a list of :class:~encord.orm.dataset.DataRow objects...'. This suggests a potential issue in the linter's pluralization logic.
| List[:class:`~encord.orm.dataset.DataRow`]: A list of :class:`~encord.orm.dataset.DataRow`s object that match the filter | |
| List[:class:`~encord.orm.dataset.DataRow`]: A list of :class:`~encord.orm.dataset.DataRow` objects that match the filter |
| Iterator[Union[:class:`~encord.storage.StorageItem`, :class:`~encord.storage.StorageItem`Inaccessible]]: An iterator containing both accessible | ||
| and inaccessible storage items in the collection. |
There was a problem hiding this comment.
The generated reference for StorageItemInaccessible appears to be broken. The linter seems to have incorrectly split the class name, resulting in :class:~encord.storage.StorageItemInaccessible. The correct syntax should be a single :class: role for the full class name, like :class:~encord.storage.StorageItemInaccessible``.
| Iterator[Union[:class:`~encord.storage.StorageItem`, :class:`~encord.storage.StorageItem`Inaccessible]]: An iterator containing both accessible | |
| and inaccessible storage items in the collection. | |
| Iterator[Union[:class:`~encord.storage.StorageItem`, :class:`~encord.storage.StorageItemInaccessible`]]: An iterator containing both accessible | |
| and inaccessible storage items in the collection. |
| Args: | ||
| k : str: The key under which the embedding will be stored in the schema. | ||
| size : int: The size of the embedding. |
There was a problem hiding this comment.
The linter has correctly reformatted the Args section, but there are a couple of issues with the output:
- The type annotations (
: str,: int) are redundant as they are already present in the function signature. According to the PR description, these should be removed. - The format
k : str:is not standard Google style. It should bek: ....
A better format would be:
Args:
k: The key under which the embedding will be stored in the schema.
size: The size of the embedding.| Args: | |
| k : str: The key under which the embedding will be stored in the schema. | |
| size : int: The size of the embedding. | |
| Args: | |
| k: The key under which the embedding will be stored in the schema. | |
| size: The size of the embedding. |
| def label_row_metadata_dto_to_label_row_metadata(label_row_metadata_dto: LabelRowMetadataDTO) -> LabelRowMetadata: | ||
| """Convert a :class:`LabelRowMetadataDTO` instance to a | ||
| :class:`LabelRowMetadata` dataclass. | ||
| :class:LabelRowMetadata dataclass. |
There was a problem hiding this comment.
The linter seems to have incorrectly removed the backticks from the :class: role. The syntax :class:LabelRowMetadata is invalid. It should be enclosed in backticks, like :class:~encord.orm.label_row.LabelRowMetadata``.
| :class:LabelRowMetadata dataclass. | |
| :class:`~encord.orm.label_row.LabelRowMetadata` dataclass. |
| allow_mirror_dataset_changes: bool = False, | ||
| ) -> None: | ||
| """Move items (list of `StorageItem` objects or UUIDs) to another folder (specify folder object or UUID). | ||
| """Move items (list of `:class:`~encord.storage.StorageItem`` objects or UUIDs) to another folder (specify folder object or UUID). |
There was a problem hiding this comment.
The auto-generated reference for StorageItem has extra backticks, resulting in :class:~encord.storage.StorageItem```. This is incorrect and will not render correctly. It should be :class:~encord.storage.StorageItem``.
| """Move items (list of `:class:`~encord.storage.StorageItem`` objects or UUIDs) to another folder (specify folder object or UUID). | |
| """Move items (list of :class:`~encord.storage.StorageItem` objects or UUIDs) to another folder (specify folder object or UUID). |
| SDK_CLASS_PATTERNS = [ | ||
| r"\bLabelRowV2\b", | ||
| r"\bDataset\b", | ||
| r"\bProject\b", | ||
| r"\bOntologyStructure\b", | ||
| r"\bWorkflow\b", | ||
| r"\bStorageItem\b", | ||
| r"\bStorageFolder\b", | ||
| r"\bCollection\b", | ||
| r"\bOntology\b", | ||
| r"\bProjectUser\b", | ||
| r"\bDataRow\b", | ||
| r"\bObjectInstance\b", | ||
| r"\bClassificationInstance\b", | ||
| ] | ||
|
|
||
| # Mapping of class names to their actual import paths | ||
| CLASS_TO_MODULE = { | ||
| "LabelRowV2": "encord.objects.LabelRowV2", | ||
| "Dataset": "encord.dataset.Dataset", | ||
| "Project": "encord.project.Project", | ||
| "OntologyStructure": "encord.objects.OntologyStructure", | ||
| "Workflow": "encord.workflow.Workflow", | ||
| "StorageItem": "encord.storage.StorageItem", | ||
| "StorageFolder": "encord.storage.StorageFolder", | ||
| "Collection": "encord.collection.Collection", | ||
| "Ontology": "encord.ontology.Ontology", | ||
| "ProjectUser": "encord.utilities.project_user.ProjectUser", | ||
| "DataRow": "encord.orm.dataset.DataRow", | ||
| "ObjectInstance": "encord.objects.ObjectInstance", | ||
| "ClassificationInstance": "encord.objects.ClassificationInstance", | ||
| } | ||
|
|
||
| # Known exception names - automatically extracted from exceptions.py | ||
| EXCEPTION_CLASSES = { | ||
| "EncordException", | ||
| "InitialisationError", | ||
| "AuthenticationError", | ||
| "AuthorisationError", | ||
| "ResourceNotFoundError", | ||
| "TimeOutError", | ||
| "RequestException", | ||
| "RateLimitExceededError", | ||
| "PayloadTooLargeError", | ||
| "UnknownException", | ||
| "InvalidDateFormatError", | ||
| "MethodNotAllowedError", | ||
| "OperationNotAllowed", | ||
| "AnswerDictionaryError", | ||
| "CorruptedLabelError", | ||
| "FileTypeNotSupportedError", | ||
| "FileSizeNotSupportedError", | ||
| "FeatureDoesNotExistError", | ||
| "ModelWeightsInconsistentError", | ||
| "ModelFeaturesInconsistentError", | ||
| "UploadOperationNotSupportedError", | ||
| "DetectionRangeInvalidError", | ||
| "InvalidAlgorithmError", | ||
| "ResourceExistsError", | ||
| "DuplicateSshKeyError", | ||
| "SshKeyNotFound", | ||
| "InvalidArgumentsError", | ||
| "GenericServerError", | ||
| "CloudUploadError", | ||
| "MultiLabelLimitError", | ||
| "LabelRowError", | ||
| "OntologyError", | ||
| "WrongProjectTypeError", | ||
| "BundledMoveWorkflowTasksPayloadError", | ||
| # Exceptions from other modules | ||
| "MetadataSchemaError", | ||
| } | ||
|
|
||
| # Build regex patterns for all exceptions | ||
| EXCEPTION_PATTERNS = [rf"\b{exc}\b" for exc in EXCEPTION_CLASSES] | ||
|
|
||
| # Normalize exception names to match actual class names | ||
| # (Some docstrings use American spelling but the class uses British spelling) | ||
| EXCEPTION_NAME_MAPPING = { | ||
| "AuthorizationError": "AuthorisationError", # Normalize to British spelling | ||
| "AuthorisationError": "AuthorisationError", # Already correct | ||
| } | ||
| # Default: map each exception to itself | ||
| for exc in EXCEPTION_CLASSES: | ||
| if exc not in EXCEPTION_NAME_MAPPING: | ||
| EXCEPTION_NAME_MAPPING[exc] = exc | ||
|
|
||
| # Exceptions not in encord.exceptions need special path mapping | ||
| EXCEPTION_MODULE_MAPPING = { | ||
| "MetadataSchemaError": "encord.metadata_schema.MetadataSchemaError", | ||
| } |
There was a problem hiding this comment.
The lists SDK_CLASS_PATTERNS, CLASS_TO_MODULE, EXCEPTION_CLASSES, and EXCEPTION_MODULE_MAPPING are hardcoded within the linter. The PR description and the new config file suggest that these should be configurable via tools/docstring_linter_config.json. Loading these from the config file would make the linter more maintainable and easier to update as the SDK evolves.
| def _is_within_docstring(self, content: str, pos: int) -> bool: | ||
| """Check if a position in content is within a docstring (between triple quotes).""" | ||
| # Count triple quotes before this position | ||
| before = content[:pos] | ||
| # Count both ''' and """ as docstring delimiters | ||
| triple_double = before.count('"""') | ||
| triple_single = before.count("'''") | ||
|
|
||
| # If odd number of triple quotes before, we're inside a docstring | ||
| # We check both types independently | ||
| in_double = (triple_double % 2) == 1 | ||
| in_single = (triple_single % 2) == 1 | ||
|
|
||
| return in_double or in_single |
There was a problem hiding this comment.
The _is_within_docstring function relies on counting triple quotes, which can be brittle. For instance, a code example within a docstring that contains a triple-quoted string could confuse the parser. A more robust approach would be to use the line and column information from the ast node for the docstring. This information is available from the ast.walk traversal and would allow for precise boundary checking.
SDK integration test report285 tests ±0 277 ✅ +1 13m 50s ⏱️ +33s For more details on these failures, see this check. Results for commit 5edcc6e. ± Comparison against base commit 6060568. |
Add Docstring Quality Linter for SDK Documentation
Overview
This PR introduces a comprehensive docstring linter to ensure consistent, high-quality documentation across the Encord SDK. The linter enforces Sphinx role usage, validates formatting, and maintains cross-references to improve auto-generated documentation quality.
What the Linter Does
🔍 Issue Detection
The linter checks for 13 different types of docstring issues, grouped into three severity levels:
Errors (Must Fix - Exit Code 1)
parse_error- Python syntax errors preventing file parsingmalformed_class_reference- Incorrect:class:syntax (e.g.,:class:\Name [path]`instead of:class:`~path``)malformed_method_reference- Incorrect:meth:syntax (e.g.,:meth:\name [path]`instead of:meth:`~path``)incomplete_class_reference- Missing tilde and full path (e.g.,:class:\Dataset`instead of:class:`~encord.dataset.Dataset``)Warnings (Should Fix - Exit Code 1 in check mode)
unlinked_exception- Exception names inRaises:sections without:class:roleinvalid_section_header- Markdown-style headers instead of Google-styledeprecated_section_header- Non-standard headers (e.g.,Parameters:instead ofArgs:)markdown_list_formatting- Markdown lists in Args/Returns sections instead of Google-styleinconsistent_indentation- Wrong indentation in docstring sections (should be 4 spaces for definitions, 8 for continuations)section_spacing- Missing or extra blank lines between sections (should be exactly 1 blank line)redundant_type_annotation- Type annotations in docstrings when function already has type hintsmissing_class_docstring- Public classes without docstringsInfo (Nice to Have - Exit Code 0)
unlinked_class- SDK class references without:class:role (improves cross-linking)returns_missing_crossref- Class names in Returns section without:class:role🛠️ Auto-Fix Capabilities
The linter can automatically fix these issues when run with
--fix:**Args**:→Args:)Parameters:→Args:)Note: Issues like
unlinked_exceptionandunlinked_classcan also be auto-fixed, but require manual review since the linter needs to determine the correct module path.📊 Output Modes
1. Summary Mode (Default)
Shows high-level statistics:
2. Verbose Mode (
--verboseor-v)Shows all issues grouped by type with clickable file locations:
Benefits of verbose mode:
file.py:line)3. JSON Report Mode (
--report report.json)Generates machine-readable JSON for CI/CD integration:
{ "files_checked": 130, "files_modified": 0, "statistics": { "total": 289, "errors": 0, "warnings": 207, "info": 82, "unlinked_exception": 15, "unlinked_class": 82 }, "issues": [ { "file": "encord/client.py", "line": 123, "function": "get_dataset", "type": "unlinked_exception", "severity": "warning", "message": "Exception \"AuthenticationError\" should use :class: role", "original": "AuthenticationError", "suggested_fix": ":class:`~encord.exceptions.AuthenticationError`" } ] }Usage
Command Line
Pre-commit Hook Integration
The linter is integrated as a pre-commit hook. It runs automatically on modified files before commit:
Current behavior: Warnings will block commits. Use
--no-verifyto skip if absolutely necessary.Configuration
The linter is configured via
tools/docstring_linter_config.json:{ "sdk_package_name": "encord", "sdk_classes": [ "LabelRowV2", "Dataset", "Project", "Workflow", ... ], "exceptions": [ "AuthenticationError", "ResourceNotFoundError", ... ], "auto_fix_rules": { "invalid_section_headers": true, "spelling_inconsistencies": true, "unlinked_exceptions": false, "unlinked_classes": false }, "ignore_patterns": [ "*/tests/*", "*/test_*.py", "*/__pycache__/*" ] }Add new classes/exceptions: Simply append to the respective arrays in the config file.
Benefits
1. Improved Documentation Quality
2. Automated Quality Gates
3. Developer Productivity
4. Type Safety
5. SDK Consistency
Current Status
As of this PR:
Top issue types:
missing_class_docstring: 207 (mostly internal classes)unlinked_class: 82 (can be auto-fixed)Migration Plan
Phase 1: Enable Pre-commit Hook (Current)
Phase 2: Fix Existing Warnings (Recommended)
Phase 3: Gradual Cleanup
Testing
The linter includes:
Documentation
Full documentation available in:
tools/README.md- Comprehensive usage guidetools/docstring_linter.py- Inline docstring with examplestools/docstring_linter_config.json- Configuration referenceBreaking Changes
None. This is a development tool that:
--no-verifyif neededFuture Enhancements
Potential improvements:
:attr:,:data:, etc.)Examples
Before
After
Impact