[Feature] Subindo dados com IA na BD by rdahis · Pull Request #1484 · basedosdados/pipelines

rdahis · 2026-03-30T03:59:09Z

Descrição do PR:

Esse PR cria o fluxo e comandos que agentes de IA podem usar para subir dados na BD a partir de um ambiente local configurado.

Relacionado com:

Summary by CodeRabbit

New Features
- Added a Dataset Onboarding agent to run a full, user-approved end-to-end onboarding flow from raw data to transformed BigQuery tables and metadata.
Documentation
- Added detailed onboarding docs covering context collection, architecture sheet management, cleaning pipeline conventions and outputs, DBT model/test workflows, upload/verification steps, metadata registration, PR creation, and multilingual guidance with an explicit approval checkpoint.

coderabbitai · 2026-03-30T03:59:18Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Added comprehensive documentation and Claude agent specs for a 10-step dataset onboarding workflow: interactive context gathering, Drive architecture table fetch/create, cleaning to partitioned parquet, BigQuery uploads (dev → prod with approval), DBT model/test generation & runs, metadata registration, and PR creation. All backend calls must use the MCP server.

Changes

Cohort / File(s)	Summary
Agent Spec & Top-level doc `AGENTS.md`, `.claude/agents/onboarding.md`	New dataset onboarding agent spec and top-level doc describing a mandatory ordered 10-step orchestration, dev-by-default behavior, explicit human approval gate before prod promotion, commit rules, and trilingual output requirement.
Context & Discovery `.claude/commands/onboarding-context.md`, `.claude/commands/onboarding-discover.md`	Interactive context collection and ID discovery flows; standardized DATASET CONTEXT and DISCOVERED IDs output blocks for downstream consumption.
Architecture `.claude/commands/onboarding-architecture.md`	Drive/Google Sheets architecture table schema and workflow to fetch or create per-table architecture spreadsheets (location, columns, translations), with Drive URLs returned for metadata steps.
Cleaning & Upload `.claude/commands/onboarding-clean.md`, `.claude/commands/onboarding-upload.md`	Cleaning pipeline spec (out-of-repo layout, pandas/polars, safe casts, wide→long, partitioned parquet, dictionary CSV) and BigQuery upload procedure using basedosdados with staging/replace semantics and requester-pays handling.
DBT generation & runs `.claude/commands/onboarding-dbt.md`, `.claude/commands/onboarding-dbt-run.md`	DBT model/schema templates, file layout, test patterns, and a dbt-run/test workflow including venv setup, failure classification, and interactive remediation prompts.
Metadata & PR `.claude/commands/onboarding-metadata.md`, `.claude/commands/onboarding-pr.md`	End-to-end metadata registration steps (table/column/observation/cloud table/coverage/update flows, dry-run mode) and PR creation workflow with changelog approval and required labels/body format.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant User
participant Agent as Onboarding Agent
participant MCP as MCP Server
participant Drive as Google Drive / Sheets
participant Storage as GCS / BigQuery
participant DBT as DBT Runner
participant Meta as Metadata API

User->>Agent: start onboarding (slug, path)
Agent->>MCP: collect context & discover IDs
MCP->>Drive: fetch/create architecture sheets
MCP->>Storage: run cleaning → produce partitioned parquet
MCP->>Storage: upload parquet to BigQuery (dev)
MCP->>DBT: generate models & run dbt run/test
DBT-->>MCP: test results
MCP->>Meta: register/update dataset/table/columns
MCP->>User: pause and request approval
User->>Agent: approve promotion
Agent->>MCP: promote to prod (upload, metadata, create PR)
MCP->>Drive: create PR changelog & push branch
MCP-->>User: return PR URL

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 I hopped through sheets and parquet light,

Cleaned each row by day and night,
Tests green, metadata neat,
A PR blooms—how sweet, how sweet! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The PR description provides a clear objective and mentions related dependencies, but is minimal and lacks the structured sections (Contexto, Detalhes Técnicos, Testes, Riscos, Dependências) required by the repository template.	Expand the description to follow the template structure: add detailed Contexto explaining the need for this AI-driven onboarding feature, Detalhes Técnicos describing the 10-step workflow, testing/validation status, and any known risks.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[Feature] Subindo dados com IA na BD' is related to the changeset, which adds new AI-driven dataset onboarding workflows and commands to the BD (Base dos Dados) system.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/ai_data_upload

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 11

🧹 Nitpick comments (15)

.claude/commands/databasis-metadata.md (6)
142-142: Make the required field list explicit.

Line 142 states "All other fields must be re-passed as well (the API requires them)" but doesn't specify which fields. An AI agent should explicitly re-pass the fields from Step 1:
- slug, name_pt/en/es, description_pt/en/es
- dataset_id, status_id, id
Adding this list would eliminate uncertainty about which fields constitute "all other fields."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/databasis-metadata.md at line 142, Update the documentation
sentence that reads "All other fields must be re-passed as well (the API
requires them)" in the databasis-metadata instructions (the Step 1 / deferred
call paragraph) to explicitly list the required fields to re-pass: slug,
name_pt, name_en, name_es, description_pt, description_en, description_es,
dataset_id, status_id, and id; place this explicit bullet-style list or inline
comma-separated list immediately after the sentence so agents know exactly which
fields are required when performing the deferred relationship write.
148-148: Add language identifier to output format code block.

The fenced code block at Line 148 should specify a language identifier for better rendering and accessibility:
-```
+```text
 === METADATA REGISTRATION COMPLETE (env=<env>) ===
As per static analysis hint MD040 (fenced-code-language).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/databasis-metadata.md at line 148, The fenced code block
containing the line "=== METADATA REGISTRATION COMPLETE (env=<env>) ===" needs a
language identifier; replace the opening triple backticks for that block with a
language-tagged fence (e.g., ```text) so the block becomes ```text ... ``` to
satisfy MD040 and improve rendering/accessibility.
135-138: Clarify raw data source selection logic assumptions.

The conditional logic assumes:

Exactly two raw data sources exist (pre-2013 and post-2013)

All datasets follow this 2013 split pattern

An AI agent might encounter datasets with different source structures:

Only one raw data source spanning all years

Multiple sources not delineated by the 2013 boundary

Sources split by topic rather than time

Consider adding:

A note about when this logic applies vs. when to ask for user clarification

Guidance for datasets that don't fit the pre/post-2013 pattern
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/databasis-metadata.md around lines 135 - 138, The
raw_data_source_ids selection logic (using get_raw_data_sources and start_year)
currently assumes a rigid pre-2013/post-2013 split and exactly two sources;
update the documentation to make that assumption explicit and add fallback
guidance: state that the 2013 rule applies only when get_raw_data_sources
returns distinct pre- and post-2013 sources, otherwise (a) if there is a single
source spanning all years, select that source; (b) if there are multiple sources
not aligned to 2013, prefer the source(s) whose temporal coverage overlaps the
table's temporal_coverage and include a note to consult raw source descriptions;
and (c) if topic-based splits exist or ambiguity remains, instruct the agent to
ask the user for clarification rather than guessing.
158-158: Consider consistent command invocation format.

Line 158 shows /databasis-metadata <slug> --env prod with a leading slash, suggesting a slash command invocation. However, the document title and earlier references use databasis-metadata without the slash.

If commands are invoked as slash commands in the AI agent interface, consider:

Using the slash prefix consistently throughout all three command files

Or removing the slash if commands are invoked without it
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/databasis-metadata.md at line 158, The document shows an
inconsistent command invocation: a leading slash is used in the example
"/databasis-metadata <slug> --env prod" while the title and other references use
"databasis-metadata" without the slash; update the examples in all three command
files so the invocation format is consistent—either prefix every occurrence of
the command with a slash (e.g., "/databasis-metadata <slug> --env prod") if
these are true slash commands, or remove the slash everywhere (e.g.,
"databasis-metadata <slug> --env prod") if not; search for occurrences of
databasis-metadata across the three command files and make the same change to
each to ensure uniformity.
30-31: Clarify handling of source documentation format.

Line 30 recommends "direct copy-paste from raw data source documentation when it captures the content well", but Line 31 specifies "No bullet lists." If source documentation contains bullet lists, an AI agent might be uncertain whether to:

Copy the bullet content but convert to prose sentences, or

Avoid copy-pasting entirely in that case

Adding "convert bullet lists to prose" would resolve this.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/databasis-metadata.md around lines 30 - 31, The guidance is
ambiguous between "direct copy-paste from raw data source documentation" and "No
bullet lists"; update the rule text so it explicitly says that when the source
contains bullet lists you may copy the content but must convert bullets into 1–3
sentence prose per entity. Locate the phrases "direct copy-paste from raw data
source documentation when it captures the content well" and "No bullet lists."
and augment them with a short clarifying clause such as "If the source uses
bullet lists, transform those bullets into concise prose sentences rather than
preserving list formatting."
56-56: Consider moving detailed backend issue notes to a separate troubleshooting document.

The known issue note provides valuable context about M2M field behavior, but it includes implementation details (specific Python code, file paths, dates) that may become stale as the backend evolves.

Consider either:

Moving this to a separate "Known Issues" or "Troubleshooting" document referenced from here

Keeping a brief note here ("M2M fields require deferred update — see step 8") and moving details elsewhere

This would keep command specs focused on "what to do" rather than "why the backend behaves this way."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/databasis-metadata.md at line 56, The documentation
includes a detailed backend troubleshooting note about M2M fields
(raw_data_source_ids, published_by_ids, data_cleaned_by_ids) and even references
implementation details in perform_mutate (backend/custom/graphql_auto.py) and a
date; move the long technical explanation out of
.claude/commands/databasis-metadata.md into a separate "Known Issues" or
"Troubleshooting" document and replace it with a short actionable note (e.g.,
"M2M fields require deferred update — see Troubleshooting doc") that links to
the new doc so the command spec focuses on what to do not implementation
details.
.claude/commands/databasis-architecture.md (5)
41-41: Define "Brasil-level" terminology.

The term "Brasil-level" is used to distinguish federal/national data from state/municipality data, but it's not explicitly defined. An AI agent might misinterpret this.

Consider adding a brief parenthetical definition: "Brasil-level (federal/national level, not disaggregated by state or municipality)"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/databasis-architecture.md at line 41, Update the wording
where "Brasil-level" appears in the Partition columns line (the line containing
"2. **Partition columns:** ... Brasil-level: `ano` only") to include a brief
parenthetical definition clarifying it means federal/national data not
disaggregated by state or municipality (e.g., add "(Brasil-level =
federal/national level, not disaggregated by state or municipality)") so any
agent or reader cannot misinterpret the term.
68-70: Consider specifying a structured output format.

The output summary is described informally ("listing all tables found/created and the Drive URLs"), which could lead to inconsistent formatting by an AI agent. Since downstream commands need these URLs, a structured format would improve reliability:
=== ARCHITECTURE TABLES ===
Found: <N> | Created: <N>

Tables:
  ✓ <table_slug> — <drive_url>
  ✓ <table_slug> — <drive_url>
This makes parsing more reliable for subsequent commands.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/databasis-architecture.md around lines 68 - 70, Update
"Step 6 — Output" to require a deterministic, machine-parsable summary for
downstream use (databasis-metadata). Replace the informal sentence with a
specified structured format that includes a clear header (e.g., "===
ARCHITECTURE TABLES ==="), counts ("Found: <N> | Created: <N>"), and a
line-per-table list of " <table_slug> — <drive_url>" so callers can reliably
parse table slugs and Drive URLs; explicitly state that the Drive URLs must be
stored for databasis-metadata consumption.
64-64: Clarify file format for architecture tables.

The path specifies .xlsx (Excel format), but Line 66 references create_spreadsheet which creates native Google Sheets. An AI agent might be confused about which format to use.

Consider either:

Removing the .xlsx extension if using native Google Sheets format

Clarifying that the sheet should be created in Google Sheets but exported as .xlsx

Or specifying just <table_slug> without extension
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/databasis-architecture.md at line 64, The file path example
uses an .xlsx extension but the code/workflow uses create_spreadsheet (Google
Sheets); update the spec to avoid ambiguity by either removing the .xlsx
extension from the path string
(`BD/Dados/Conjuntos/<dataset>/architecture/<table_slug>`), stating explicitly
that create_spreadsheet should create a native Google Sheet which may be
exported as .xlsx if needed, or documenting that `<table_slug>.xlsx` refers to
an exported Google Sheets file; modify the documentation where the path is
defined and any references around create_spreadsheet to reflect the chosen
option so agents know whether to create a native Google Sheet or an exported
Excel file.
21-21: Consider documenting edge cases for the directory_column FK format.

The format <dataset>.<table>:<column> is clear for simple cases, but an AI agent might encounter:

Columns that could reference multiple directory tables

Composite foreign keys

Cases where the reference column name differs from the local column name

Adding a sentence about these scenarios would reduce ambiguity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/databasis-architecture.md at line 21, Update the
`directory_column` documentation line to explicitly note edge cases for the
`<dataset>.<table>:<column>` FK format: mention how to handle columns that may
reference multiple directory tables, how composite foreign keys should be
represented, and what to do when the referenced column name differs from the
local column name (e.g., indicate multiple entries or a clear tuple/notation and
mapping convention). Reference the `directory_column` field and the
`<dataset>.<table>:<column>` format when adding this single clarifying sentence
so readers/agents know how to encode these special cases.
27-27: Clarify interval notation for non-annual data.

The temporal coverage notation only shows annual examples with (1). For an AI agent processing monthly or quarterly data, the interval semantic is ambiguous:

Does (12) mean monthly (12 observations per year)?

Does (3) mean quarterly (every 3 months)?

How to express irregular intervals?

Adding examples like 2020(12)2022 for monthly data would eliminate this ambiguity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/databasis-architecture.md at line 27, Clarify the interval
semantics in the "Temporal coverage notation: START(INTERVAL)END" line: state
that the number inside parentheses is observations per year (e.g., (1)=annual,
(12)=monthly, (4)=quarterly) and add explicit examples such as `2020(12)2022`
for monthly and `2019(4)2021` for quarterly; also specify how to denote ongoing
ranges (keep `2013(1)` meaning from 2013 onward) and provide a convention for
irregular data (e.g., use `IRR` or leave interval empty with a note) so agents
can unambiguously parse `START(INTERVAL)END`.
.claude/commands/databasis-clean.md (4)
48-49: Clarify column order vs. column presence for subset validation.

Line 48 instructs starting with a small subset, but Line 49 requires "output column order must match architecture exactly." An AI agent might be uncertain whether:

All architecture columns must be present in the subset (even as null columns), or

Only columns present in the subset data need to match the architecture order

Adding "Output columns present in the subset must match architecture order" would eliminate ambiguity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/databasis-clean.md around lines 48 - 49, Clarify that the
"output column order must match architecture exactly" requirement applies only
to the columns present in the chosen subset by updating the second bullet (the
line that currently reads "Output column order must match architecture exactly")
to say something like "Output columns present in the subset must match
architecture order exactly (missing architecture columns may be omitted or
null)"; reference the two bullets "Start with a small subset (1 year or the
smallest available partition)" and the "Output column order..." sentence so
readers know the subset validation rule is about ordering of present columns,
not forcing all architecture columns to appear.
74-74: Consider adding an example row for the dictionary schema.

The dictionary schema uses Portuguese column names (id_tabela, nome_coluna, chave, valor) while architecture tables use English (name, bigquery_type, etc.). An AI agent might also be uncertain how to populate cobertura_temporal for dictionary entries.

Adding a brief example would clarify:
id_tabela | nome_coluna | chave | cobertura_temporal | valor
despesa   | estagio     | 1     | 2013(1)2022       | Empenhado
despesa   | estagio     | 2     | 2013(1)2022       | Liquidado
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/databasis-clean.md at line 74, Add a concrete example row
block for the dictionary schema to clarify Portuguese column usage and how to
populate cobertura_temporal; update the section that currently shows "Schema:
`id_tabela | nome_coluna | chave | cobertura_temporal | valor`" to include
sample rows (e.g., for the table/column combination and temporal coverage) so
agents know how to fill id_tabela, nome_coluna, chave, cobertura_temporal and
valor consistently.
45-59: Provide polars examples or recommend pandas by default.

Line 45 instructs the AI agent to "choose whichever fits the data better (polars for large files or complex transformations; pandas otherwise)", but Lines 56-59 provide only pandas-specific type casting patterns.

If an agent selects polars, it has no reference implementation for safe type casting. Consider either:

Recommending pandas as the default and using polars only for specific large-file scenarios

Adding equivalent polars type casting examples alongside pandas
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/databasis-clean.md around lines 45 - 59, The guidance
currently lists pandas-specific safe casting examples (INT64, FLOAT64, STRING,
DATE) but also tells agents they can choose polars, leaving polars users without
equivalents; update the doc so either pandas is recommended as the default
choice (and note polars only for large/complex cases) or add matching polars
safe_cast patterns alongside the pandas ones. Specifically, add polars
equivalents for the safe_cast logic and the standard column types (INT64,
FLOAT64, STRING, DATE) and mention pivot-to-long and pyarrow partitioning
applicability in polars, or explicitly state "prefer pandas by default" near the
existing examples and mark polars as an advanced option. Ensure the terms
safe_cast, INT64, FLOAT64, STRING, DATE, pivot-to-long, and pyarrow are
referenced so readers can find and apply the fix.
79-81: Consider specifying output format for validation and handoff.

Step 6 instructs to "report final row counts per partition" but doesn't specify the format. For reliable handoff to downstream commands (e.g., databasis-upload), a structured summary would help:
=== CLEANING COMPLETE ===
Dataset: <dataset_slug>

Tables:
  ✓ <table_slug> — <total_rows> rows across <N> partitions
    └─ ano=2020: <rows>, ano=2021: <rows>, ...
  
Output path: <output_path>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/databasis-clean.md around lines 79 - 81, Update the "Step 6
— Scale to full data" instructions to require emitting a structured, machine-
and human-friendly summary (for downstream validation/hand-off to commands like
databasis-upload) by adding a concrete output template under the step (include
Dataset slug, a Tables section with per-table total rows and per-partition row
counts, and the Output path). Reference the existing Step 6 heading ("Step 6 —
Scale to full data") and ensure the documentation explicitly shows the exact
fields required (Dataset, Tables → <table_slug> total rows and partition
breakdown, Output path) and a short marker like "=== CLEANING COMPLETE ===" so
downstream tooling can reliably parse it.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/agents/onboarding-pipeline.md:
- Around line 18-20: Add language identifiers to the unlabeled fenced code
blocks so they pass MD040: change the opening backticks for the example,
sequence, and checklist fences to include a language token (e.g., ```text or
```md) instead of plain ```; locate the three fenced blocks named "example",
"sequence" and "checklist" in the onboarding-pipeline content and update their
opening fences to ```text (and do the same for the other similar
example/sequence/checklist fenced blocks in this file).

In @.claude/commands/databasis-context.md:
- Around line 37-47: Update the fenced code block that begins with "=== DATASET
CONTEXT: <slug> ===" so the opening fence includes a language specifier; replace
the existing "```" with "```text" to satisfy the MD040 rule and mark the block
as plain text. Ensure only the opening fence is changed and the closing "```"
remains the same so the block renders correctly.

In @.claude/commands/databasis-dbt-run.md:
- Around line 15-18: Replace the hardcoded pyenv path used to create the
virtualenv with a generic python invocation: change the command using
"~/.pyenv/versions/3.11.6/bin/python -m venv /tmp/dbt_env" to use "python -m
venv /tmp/dbt_env" so the venv honors the repo's Python constraint; similarly
update any subsequent references that assume that specific pyenv path (e.g.,
"/tmp/dbt_env/bin/pip install dbt-bigquery") to remain the same but ensure the
initial venv creation uses the generic python executable.
- Around line 26-27: The documented dbt run/test command uses --profiles-dir
~/.dbt which doesn't match this repo's profiles.yml location; update the command
examples in .claude/commands/databasis-dbt-run.md to point --profiles-dir to the
repository root (or remove the flag so dbt picks up ./profiles.yml) for both
occurrences (the block containing "--profiles-dir ~/.dbt \ --target <target>"
and the similar lines at 37-38), ensuring the --target flag remains unchanged.

In @.claude/commands/databasis-dbt.md:
- Line 2: Replace all occurrences of the alternative schema filename (e.g.,
"schema.yaml") with the canonical "schema.yml" throughout the document and any
template strings so the generator and lookup logic are consistent; specifically
update the description line and the later reference that currently uses
"schema.yaml"/"schema.yml" mismatch and ensure any examples, templates or
variables that emit the filename use "schema.yml" only.

In @.claude/commands/databasis-discover.md:
- Around line 16-18: Label the unlabeled fenced code blocks that contain the
snippet "discover_ids(env=<env>)" (and the other block around lines 41-74) with
an appropriate language tag (e.g., use ```text for plain output or ```bash for
command-like snippets) so the markdown linter MD040 is satisfied; update the two
fenced blocks containing the shown snippet and the longer block later to start
with a language identifier instead of a bare ```.

In @.claude/commands/databasis-pr.md:
- Around line 38-55: The fenced code block that begins with the "## Dataset"
header is missing a language identifier and triggers MD040; update the opening
fence to include a markdown language tag (e.g., use ```md or ```markdown) so the
snippet is properly labeled and linting passes, leaving the rest of the block
content unchanged.
- Around line 31-35: Update the gh pr create invocation to open the PR as a
draft, use the required title pattern starting with a bracketed keyword (e.g.,
"[Data] {dataset_slug} {table names}"), and build the PR body from the
repository template: generate /tmp/pr_body.md by loading
.github/pull_request_template.md and inserting the dataset changelog into the
"Detalhes Técnicos" section (keep Descrição do PR, Teste e Validações, Riscos e
Mitigações, Dependências intact), then call gh pr create --title
"<bracketed-keyword> ${dataset_slug} <table names>" --body-file /tmp/pr_body.md
--label "test-dev,table-approve,metadata-test" --draft so the template structure
and draft status are preserved.

In @.claude/commands/databasis-upload.md:
- Around line 15-17: The fenced code block containing the message "Missing
~/.basedosdados/config.toml — run `basedosdados config init` first." is
unlabeled and triggers MD040; update the opening backticks to include a language
such as text or bash (e.g., ```text or ```bash) for the prerequisites snippet so
the fenced block is explicitly labeled.
- Around line 37-41: The monkey-patch references BILLING_PROJECT but it is not
defined, causing a NameError at runtime; define and validate BILLING_PROJECT
before assigning gcs.Client.bucket so _patched_bucket can reference it. Add a
top-level definition for BILLING_PROJECT (e.g., read from configuration or
os.environ and fail fast with a clear error if missing), then apply the
monkey-patch that replaces gcs.Client.bucket with _patched_bucket; ensure
references to _patched_bucket and gcs.Client.bucket occur after BILLING_PROJECT
is set.

In `@AGENTS.md`:
- Around line 242-244: The fenced code block containing "Onboard dataset <slug>.
Raw files at <path>. Drive folder: BD/Dados/Conjuntos/<slug>/." is missing a
language label; update the opening ``` to include a language token (e.g.,
```text or ```md) so the block passes MD040 and is properly labeled in
AGENTS.md.

---

Nitpick comments:
In @.claude/commands/databasis-architecture.md:
- Line 41: Update the wording where "Brasil-level" appears in the Partition
columns line (the line containing "2. **Partition columns:** ... Brasil-level:
`ano` only") to include a brief parenthetical definition clarifying it means
federal/national data not disaggregated by state or municipality (e.g., add
"(Brasil-level = federal/national level, not disaggregated by state or
municipality)") so any agent or reader cannot misinterpret the term.
- Around line 68-70: Update "Step 6 — Output" to require a deterministic,
machine-parsable summary for downstream use (databasis-metadata). Replace the
informal sentence with a specified structured format that includes a clear
header (e.g., "=== ARCHITECTURE TABLES ==="), counts ("Found: <N> | Created:
<N>"), and a line-per-table list of " <table_slug> — <drive_url>" so callers can
reliably parse table slugs and Drive URLs; explicitly state that the Drive URLs
must be stored for databasis-metadata consumption.
- Line 64: The file path example uses an .xlsx extension but the code/workflow
uses create_spreadsheet (Google Sheets); update the spec to avoid ambiguity by
either removing the .xlsx extension from the path string
(`BD/Dados/Conjuntos/<dataset>/architecture/<table_slug>`), stating explicitly
that create_spreadsheet should create a native Google Sheet which may be
exported as .xlsx if needed, or documenting that `<table_slug>.xlsx` refers to
an exported Google Sheets file; modify the documentation where the path is
defined and any references around create_spreadsheet to reflect the chosen
option so agents know whether to create a native Google Sheet or an exported
Excel file.
- Line 21: Update the `directory_column` documentation line to explicitly note
edge cases for the `<dataset>.<table>:<column>` FK format: mention how to handle
columns that may reference multiple directory tables, how composite foreign keys
should be represented, and what to do when the referenced column name differs
from the local column name (e.g., indicate multiple entries or a clear
tuple/notation and mapping convention). Reference the `directory_column` field
and the `<dataset>.<table>:<column>` format when adding this single clarifying
sentence so readers/agents know how to encode these special cases.
- Line 27: Clarify the interval semantics in the "Temporal coverage notation:
START(INTERVAL)END" line: state that the number inside parentheses is
observations per year (e.g., (1)=annual, (12)=monthly, (4)=quarterly) and add
explicit examples such as `2020(12)2022` for monthly and `2019(4)2021` for
quarterly; also specify how to denote ongoing ranges (keep `2013(1)` meaning
from 2013 onward) and provide a convention for irregular data (e.g., use `IRR`
or leave interval empty with a note) so agents can unambiguously parse
`START(INTERVAL)END`.

In @.claude/commands/databasis-clean.md:
- Around line 48-49: Clarify that the "output column order must match
architecture exactly" requirement applies only to the columns present in the
chosen subset by updating the second bullet (the line that currently reads
"Output column order must match architecture exactly") to say something like
"Output columns present in the subset must match architecture order exactly
(missing architecture columns may be omitted or null)"; reference the two
bullets "Start with a small subset (1 year or the smallest available partition)"
and the "Output column order..." sentence so readers know the subset validation
rule is about ordering of present columns, not forcing all architecture columns
to appear.
- Line 74: Add a concrete example row block for the dictionary schema to clarify
Portuguese column usage and how to populate cobertura_temporal; update the
section that currently shows "Schema: `id_tabela | nome_coluna | chave |
cobertura_temporal | valor`" to include sample rows (e.g., for the table/column
combination and temporal coverage) so agents know how to fill id_tabela,
nome_coluna, chave, cobertura_temporal and valor consistently.
- Around line 45-59: The guidance currently lists pandas-specific safe casting
examples (INT64, FLOAT64, STRING, DATE) but also tells agents they can choose
polars, leaving polars users without equivalents; update the doc so either
pandas is recommended as the default choice (and note polars only for
large/complex cases) or add matching polars safe_cast patterns alongside the
pandas ones. Specifically, add polars equivalents for the safe_cast logic and
the standard column types (INT64, FLOAT64, STRING, DATE) and mention
pivot-to-long and pyarrow partitioning applicability in polars, or explicitly
state "prefer pandas by default" near the existing examples and mark polars as
an advanced option. Ensure the terms safe_cast, INT64, FLOAT64, STRING, DATE,
pivot-to-long, and pyarrow are referenced so readers can find and apply the fix.
- Around line 79-81: Update the "Step 6 — Scale to full data" instructions to
require emitting a structured, machine- and human-friendly summary (for
downstream validation/hand-off to commands like databasis-upload) by adding a
concrete output template under the step (include Dataset slug, a Tables section
with per-table total rows and per-partition row counts, and the Output path).
Reference the existing Step 6 heading ("Step 6 — Scale to full data") and ensure
the documentation explicitly shows the exact fields required (Dataset, Tables →
<table_slug> total rows and partition breakdown, Output path) and a short marker
like "=== CLEANING COMPLETE ===" so downstream tooling can reliably parse it.

In @.claude/commands/databasis-metadata.md:
- Line 142: Update the documentation sentence that reads "All other fields must
be re-passed as well (the API requires them)" in the databasis-metadata
instructions (the Step 1 / deferred call paragraph) to explicitly list the
required fields to re-pass: slug, name_pt, name_en, name_es, description_pt,
description_en, description_es, dataset_id, status_id, and id; place this
explicit bullet-style list or inline comma-separated list immediately after the
sentence so agents know exactly which fields are required when performing the
deferred relationship write.
- Line 148: The fenced code block containing the line "=== METADATA REGISTRATION
COMPLETE (env=<env>) ===" needs a language identifier; replace the opening
triple backticks for that block with a language-tagged fence (e.g., ```text) so
the block becomes ```text ... ``` to satisfy MD040 and improve
rendering/accessibility.
- Around line 135-138: The raw_data_source_ids selection logic (using
get_raw_data_sources and start_year) currently assumes a rigid
pre-2013/post-2013 split and exactly two sources; update the documentation to
make that assumption explicit and add fallback guidance: state that the 2013
rule applies only when get_raw_data_sources returns distinct pre- and post-2013
sources, otherwise (a) if there is a single source spanning all years, select
that source; (b) if there are multiple sources not aligned to 2013, prefer the
source(s) whose temporal coverage overlaps the table's temporal_coverage and
include a note to consult raw source descriptions; and (c) if topic-based splits
exist or ambiguity remains, instruct the agent to ask the user for clarification
rather than guessing.
- Line 158: The document shows an inconsistent command invocation: a leading
slash is used in the example "/databasis-metadata <slug> --env prod" while the
title and other references use "databasis-metadata" without the slash; update
the examples in all three command files so the invocation format is
consistent—either prefix every occurrence of the command with a slash (e.g.,
"/databasis-metadata <slug> --env prod") if these are true slash commands, or
remove the slash everywhere (e.g., "databasis-metadata <slug> --env prod") if
not; search for occurrences of databasis-metadata across the three command files
and make the same change to each to ensure uniformity.
- Around line 30-31: The guidance is ambiguous between "direct copy-paste from
raw data source documentation" and "No bullet lists"; update the rule text so it
explicitly says that when the source contains bullet lists you may copy the
content but must convert bullets into 1–3 sentence prose per entity. Locate the
phrases "direct copy-paste from raw data source documentation when it captures
the content well" and "No bullet lists." and augment them with a short
clarifying clause such as "If the source uses bullet lists, transform those
bullets into concise prose sentences rather than preserving list formatting."
- Line 56: The documentation includes a detailed backend troubleshooting note
about M2M fields (raw_data_source_ids, published_by_ids, data_cleaned_by_ids)
and even references implementation details in perform_mutate
(backend/custom/graphql_auto.py) and a date; move the long technical explanation
out of .claude/commands/databasis-metadata.md into a separate "Known Issues" or
"Troubleshooting" document and replace it with a short actionable note (e.g.,
"M2M fields require deferred update — see Troubleshooting doc") that links to
the new doc so the command spec focuses on what to do not implementation
details.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1f8dc0cf-906d-47ab-b612-4c07517ebfb3

📥 Commits

Reviewing files that changed from the base of the PR and between a47889e and 6d018fc.

📒 Files selected for processing (11)

.claude/agents/onboarding-pipeline.md
.claude/commands/databasis-architecture.md
.claude/commands/databasis-clean.md
.claude/commands/databasis-context.md
.claude/commands/databasis-dbt-run.md
.claude/commands/databasis-dbt.md
.claude/commands/databasis-discover.md
.claude/commands/databasis-metadata.md
.claude/commands/databasis-pr.md
.claude/commands/databasis-upload.md
AGENTS.md

coderabbitai · 2026-03-31T01:15:20Z

+  --profiles-dir ~/.dbt \
+  --target <target>


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Repo-level profiles file:" fd -a '^profiles\.yml$' . echo echo "Command currently expects ~/.dbt/profiles.yml:" if [ -f "$HOME/.dbt/profiles.yml" ]; then echo "Found ~/.dbt/profiles.yml" else echo "Missing ~/.dbt/profiles.yml" fi

Repository: basedosdados/pipelines

Length of output: 199

🏁 Script executed:

# Check the exact location of profiles.yml ls -la ./profiles.yml echo "---" # Read the command file at the specified lines echo "Lines 20-35 of the command file:" sed -n '20,35p' ./.claude/commands/databasis-dbt-run.md echo "---" echo "Lines 35-45 of the command file:" sed -n '35,45p' ./.claude/commands/databasis-dbt-run.md

Repository: basedosdados/pipelines

Length of output: 787

--profiles-dir ~/.dbt is misaligned with this repo's profiles.yml location.

The repo provides profiles.yml at the project root, but the command points to ~/.dbt. This will cause runs and tests to fail unless ~/.dbt/profiles.yml is pre-seeded externally.

Suggested update

/tmp/dbt_env/bin/dbt run \ --select <dataset_slug> \ - --profiles-dir ~/.dbt \ + --profiles-dir <pipelines_root> \ --target <target>

/tmp/dbt_env/bin/dbt test \ --select <dataset_slug> \ - --profiles-dir ~/.dbt \ + --profiles-dir <pipelines_root> \ --target <target>

Also applies to: lines 37-38

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/commands/databasis-dbt-run.md around lines 26 - 27, The documented dbt run/test command uses --profiles-dir ~/.dbt which doesn't match this repo's profiles.yml location; update the command examples in .claude/commands/databasis-dbt-run.md to point --profiles-dir to the repository root (or remove the flag so dbt picks up ./profiles.yml) for both occurrences (the block containing "--profiles-dir ~/.dbt \ --target <target>" and the similar lines at 37-38), ensuring the --target flag remains unchanged.

coderabbitai · 2026-03-31T01:15:20Z

@@ -0,0 +1,99 @@
+---
+description: Write DBT .sql and schema.yaml files for a Data Basis dataset


⚠️ Potential issue | 🟡 Minor

Use one schema filename convention (schema.yml) consistently.

Line 2 says schema.yaml, while Line 47 uses schema.yml. Keep one canonical extension to avoid generator drift and missing file lookups. Based on learnings: Every dbt model must have a schema.yml entry inside its dataset directory declaring model metadata and all data quality tests.

Also applies to: 47-48

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/commands/databasis-dbt.md at line 2, Replace all occurrences of the alternative schema filename (e.g., "schema.yaml") with the canonical "schema.yml" throughout the document and any template strings so the generator and lookup logic are consistent; specifically update the description line and the later reference that currently uses "schema.yaml"/"schema.yml" mismatch and ensure any examples, templates or variables that emit the filename use "schema.yml" only.

coderabbitai · 2026-03-31T01:15:20Z

+gh pr create \
+  --title "[$dataset_slug] <table names>" \
+  --body "<changelog>" \
+  --label "test-dev,table-approve,metadata-test"
+```


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Does gh pr create --body "" replace the repository pull request template content instead of auto-populating it?

💡 Result:

Yes, gh pr create --body "" replaces the repository pull request template content instead of auto-populating it. The --body flag supplies the literal body text for the pull request, skipping any interactive prompts or automatic template population. Repository templates are only used as a starting point in interactive mode (when no --body or --title is provided) or explicitly via --template, where they are loaded into the editor. Official documentation states that --body takes precedence over other autofill mechanisms like --fill, overwriting them, and issues confirm --body or --body-file uses the provided content directly without incorporating templates unless manually combined.

Citations:

1: https://cli.github.com/manual/gh_pr_create

2: https://man.archlinux.org/man/gh-pr-create.1.en

3: https://www.gsp.com/cgi-bin/man.cgi?section=1&topic=GH-PR-CREATE

4: https://www.mankier.com/1/gh-pr-create

5: Allow "gh pr create" to specify a pull request template to use cli/cli#2567

6: gh pr create with --body-file should allow --edit or --body-template cli/cli#6408

🏁 Script executed:

# Check if the PR template exists find . -name "pull_request_template.md" -o -name "*pull_request*" | head -20

Repository: basedosdados/pipelines

Length of output: 100

🏁 Script executed:

# Read the databasis-pr.md file to see the actual content and context if [ -f ".claude/commands/databasis-pr.md" ]; then echo "=== File exists, reading content ===" wc -l ".claude/commands/databasis-pr.md" echo "" echo "=== Lines 25-60 ===" sed -n '25,60p' ".claude/commands/databasis-pr.md" fi

Repository: basedosdados/pipelines

Length of output: 918

🏁 Script executed:

# Check directory structure for .github and related files ls -la .github/ 2>/dev/null | head -20

Repository: basedosdados/pipelines

Length of output: 378

🏁 Script executed:

# Read the PR template to see what is actually required cat ".github/pull_request_template.md"

Repository: basedosdados/pipelines

Length of output: 2945

Add --draft flag and build PR body from template structure; fix title pattern to match repository requirements.

The --body "<changelog>" parameter bypasses the repository's pull request template entirely instead of populating it. Additionally, the title pattern [$dataset_slug] <table names> does not match the required format and the --draft flag is missing.

Per .github/pull_request_template.md, PR titles must start with a keyword in brackets ([Feature], [Data], [Bugfix], etc.), PRs must open as draft, and body must include specific sections: Descrição do PR, Detalhes Técnicos, Teste e Validações, Riscos e Mitigações, and Dependências.

Suggested update

gh pr create \ + --draft \ - --title "[$dataset_slug] <table names>" \ + --title "[Data] $dataset_slug" \ - --body "<changelog>" \ + --body-file /tmp/pr_body.md \ --label "test-dev,table-approve,metadata-test"

Build /tmp/pr_body.md to include all sections from .github/pull_request_template.md, embedding the dataset changelog in the "Detalhes Técnicos" section instead of replacing the template structure.

Applies to lines 31–35 and the suggested PR body format in lines 37–55.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/commands/databasis-pr.md around lines 31 - 35, Update the gh pr create invocation to open the PR as a draft, use the required title pattern starting with a bracketed keyword (e.g., "[Data] {dataset_slug} {table names}"), and build the PR body from the repository template: generate /tmp/pr_body.md by loading .github/pull_request_template.md and inserting the dataset changelog into the "Detalhes Técnicos" section (keep Descrição do PR, Teste e Validações, Riscos e Mitigações, Dependências intact), then call gh pr create --title "<bracketed-keyword> ${dataset_slug} <table names>" --body-file /tmp/pr_body.md --label "test-dev,table-approve,metadata-test" --draft so the template structure and draft status are preserved.

coderabbitai · 2026-03-31T01:15:21Z

+   ```
+   Missing ~/.basedosdados/config.toml — run `basedosdados config init` first.
+   ```


⚠️ Potential issue | 🟡 Minor

Add fence language in prerequisites snippet.

The fenced block at Line 15 is unlabeled (MD040). Use text or bash.

🧰 Tools

🪛 markdownlint-cli2 (0.22.0)

[warning] 15-15: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/commands/databasis-upload.md around lines 15 - 17, The fenced code block containing the message "Missing ~/.basedosdados/config.toml — run `basedosdados config init` first." is unlabeled and triggers MD040; update the opening backticks to include a language such as text or bash (e.g., ```text or ```bash) for the prerequisites snippet so the fenced block is explicitly labeled.

coderabbitai · 2026-03-31T01:15:21Z

+# Monkey-patch for requester-pays bucket
+_orig_bucket = gcs.Client.bucket
+def _patched_bucket(self, bucket_name, user_project=None):
+    return _orig_bucket(self, bucket_name, user_project=BILLING_PROJECT)
+gcs.Client.bucket = _patched_bucket


⚠️ Potential issue | 🔴 Critical

Define BILLING_PROJECT before monkey-patching GCS client.

Line 40 references BILLING_PROJECT, but the snippet never defines it. This will fail at runtime with NameError and abort upload flow.

Suggested fix

import basedosdados as bd import google.cloud.storage as gcs from pathlib import Path import argparse +import os +# Resolve billing project before patch +BILLING_PROJECT = os.getenv("BILLING_PROJECT") +if not BILLING_PROJECT: + raise ValueError("BILLING_PROJECT must be set") + # Monkey-patch for requester-pays bucket _orig_bucket = gcs.Client.bucket def _patched_bucket(self, bucket_name, user_project=None): return _orig_bucket(self, bucket_name, user_project=BILLING_PROJECT) gcs.Client.bucket = _patched_bucket

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/commands/databasis-upload.md around lines 37 - 41, The monkey-patch references BILLING_PROJECT but it is not defined, causing a NameError at runtime; define and validate BILLING_PROJECT before assigning gcs.Client.bucket so _patched_bucket can reference it. Add a top-level definition for BILLING_PROJECT (e.g., read from configuration or os.environ and fail fast with a clear error if missing), then apply the monkey-patch that replaces gcs.Client.bucket with _patched_bucket; ensure references to _patched_bucket and gcs.Client.bucket occur after BILLING_PROJECT is set.

… onboarding

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (7)

.claude/commands/onboarding-pr.md (2)
38-55: ⚠️ Potential issue | 🟡 Minor

Add language identifier to fenced code block.

The PR body template fenced block is missing a language specifier, triggering MD040. Use md or text.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/onboarding-pr.md around lines 38 - 55, The fenced code
block in .claude/commands/onboarding-pr.md is missing a language specifier and
triggers MD040; update the triple-backtick opening fence to include a language
identifier (for example change ``` to ```md or ```text) in the PR body template
so the block is marked as markdown/plain text and the linter warning is
resolved.
31-35: ⚠️ Potential issue | 🟠 Major

Add --draft flag and build PR body from template structure.

The --body "<changelog>" parameter bypasses the repository's pull request template. Additionally, the --draft flag is missing, and the title pattern should start with a bracketed keyword (e.g., [Data]).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/onboarding-pr.md around lines 31 - 35, The gh pr create
command is bypassing the repo PR template and missing a draft flag; replace the
literal --body "<changelog>" with either --fill (to use the repository PR
template) or --body-file <generated_body_path> (if you build the body from your
template structure), add the --draft flag, and change the title pattern
"[$dataset_slug] <table names>" to start with a bracketed keyword (e.g., "[Data]
${dataset_slug} <table names>"). Ensure you update the invocation of gh pr
create to include --draft and use --fill or --body-file instead of --body
"<changelog>" so the template is respected.
AGENTS.md (1)
242-244: ⚠️ Potential issue | 🟡 Minor

Add language identifier to fenced code block.

The command example fenced block is missing a language specifier, triggering MD040. Use text or bash.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@AGENTS.md` around lines 242 - 244, The fenced code block that contains the
example command "Onboard dataset <slug>. Raw files at <path>. Drive folder:
BD/Dados/Conjuntos/<slug>/." is missing a language specifier and should be
updated to include one (e.g., add ```text or ```bash before the block) so the
markdown linter (MD040) stops flagging it; locate the block in AGENTS.md and
prepend the appropriate language identifier.
.claude/commands/onboarding-context.md (1)
37-47: ⚠️ Potential issue | 🟡 Minor

Add language identifier to fenced code block.

The fenced block is missing a language specifier, triggering MD040. Use text for this structured output example.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/onboarding-context.md around lines 37 - 47, The fenced code
block starting with "=== DATASET CONTEXT: <slug> ===" is missing a language
specifier which triggers MD040; update the opening fence from ``` to ```text so
the block is marked as plain text (i.e., change ``` to ```text immediately
before the DATASET CONTEXT block) to satisfy the linter.
.claude/commands/onboarding-upload.md (2)
15-17: ⚠️ Potential issue | 🟡 Minor

Add language identifier to fenced code block.

The error message fenced block is missing a language specifier, triggering MD040. Use text.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/onboarding-upload.md around lines 15 - 17, Update the
fenced code block that currently contains "Missing ~/.basedosdados/config.toml —
run `basedosdados config init` first." so the opening fence includes the
language identifier text (i.e., change ``` to ```text) to satisfy the MD040 lint
rule; no other changes to the block content are needed.
37-41: ⚠️ Potential issue | 🔴 Critical

Define BILLING_PROJECT before using it in the monkey-patch.

Line 40 references BILLING_PROJECT, but it's never defined in the snippet. This will cause a NameError at runtime.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/onboarding-upload.md around lines 37 - 41, The monkey-patch
uses BILLING_PROJECT but never defines it; declare and initialize
BILLING_PROJECT before the patch (e.g., read from configuration or environment)
and ensure it's available to the closure used by _patched_bucket; update
references around _orig_bucket, _patched_bucket and gcs.Client.bucket to use
that defined value (or validate and raise a clear error if missing) so the
patched function can call _orig_bucket(self, bucket_name,
user_project=BILLING_PROJECT) without NameError.
.claude/agents/onboarding.md (1)
18-20: ⚠️ Potential issue | 🟡 Minor

Add language identifiers to fenced code blocks.

Multiple fenced blocks throughout the file are missing language specifiers, triggering MD040. Use text for the example invocation (Line 18), step sequence (Line 28), and verification checklist (Line 46).

Also applies to lines 28-40 and 46-55.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/agents/onboarding.md around lines 18 - 20, Add language identifiers
"text" to the three fenced code blocks that currently lack them: the example
invocation block containing "Onboard dataset <slug>. Raw files at <path>. Drive
folder: BD/Dados/Conjuntos/<slug>/.", the step sequence block that lists the
onboarding steps, and the verification checklist block; update the opening
backticks for each of these fenced blocks to "```text" so the markdown linter
(MD040) no longer flags them.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/commands/onboarding-clean.md:
- Around line 14-24: The fenced code block showing the dataset folder tree in
.claude/commands/onboarding-clean.md lacks a language identifier (causing
MD040); edit that block's opening backticks to include the language "text"
(i.e., change ``` to ```text) so the diagram is properly labeled while leaving
the internal directory content unchanged.

In @.claude/commands/onboarding-metadata.md:
- Around line 148-159: Update the fenced code block that begins with "===
METADATA REGISTRATION COMPLETE (env=<env>) ===" to include a language identifier
to satisfy MD040; add "text" immediately after the opening triple backticks so
the block becomes a labeled text code block; ensure the rest of the block
content (Dataset/Files/Tables/Next step lines) remains unchanged.

---

Duplicate comments:
In @.claude/agents/onboarding.md:
- Around line 18-20: Add language identifiers "text" to the three fenced code
blocks that currently lack them: the example invocation block containing
"Onboard dataset <slug>. Raw files at <path>. Drive folder:
BD/Dados/Conjuntos/<slug>/.", the step sequence block that lists the onboarding
steps, and the verification checklist block; update the opening backticks for
each of these fenced blocks to "```text" so the markdown linter (MD040) no
longer flags them.

In @.claude/commands/onboarding-context.md:
- Around line 37-47: The fenced code block starting with "=== DATASET CONTEXT:
<slug> ===" is missing a language specifier which triggers MD040; update the
opening fence from ``` to ```text so the block is marked as plain text (i.e.,
change ``` to ```text immediately before the DATASET CONTEXT block) to satisfy
the linter.

In @.claude/commands/onboarding-pr.md:
- Around line 38-55: The fenced code block in .claude/commands/onboarding-pr.md
is missing a language specifier and triggers MD040; update the triple-backtick
opening fence to include a language identifier (for example change ``` to ```md
or ```text) in the PR body template so the block is marked as markdown/plain
text and the linter warning is resolved.
- Around line 31-35: The gh pr create command is bypassing the repo PR template
and missing a draft flag; replace the literal --body "<changelog>" with either
--fill (to use the repository PR template) or --body-file <generated_body_path>
(if you build the body from your template structure), add the --draft flag, and
change the title pattern "[$dataset_slug] <table names>" to start with a
bracketed keyword (e.g., "[Data] ${dataset_slug} <table names>"). Ensure you
update the invocation of gh pr create to include --draft and use --fill or
--body-file instead of --body "<changelog>" so the template is respected.

In @.claude/commands/onboarding-upload.md:
- Around line 15-17: Update the fenced code block that currently contains
"Missing ~/.basedosdados/config.toml — run `basedosdados config init` first." so
the opening fence includes the language identifier text (i.e., change ``` to
```text) to satisfy the MD040 lint rule; no other changes to the block content
are needed.
- Around line 37-41: The monkey-patch uses BILLING_PROJECT but never defines it;
declare and initialize BILLING_PROJECT before the patch (e.g., read from
configuration or environment) and ensure it's available to the closure used by
_patched_bucket; update references around _orig_bucket, _patched_bucket and
gcs.Client.bucket to use that defined value (or validate and raise a clear error
if missing) so the patched function can call _orig_bucket(self, bucket_name,
user_project=BILLING_PROJECT) without NameError.

In `@AGENTS.md`:
- Around line 242-244: The fenced code block that contains the example command
"Onboard dataset <slug>. Raw files at <path>. Drive folder:
BD/Dados/Conjuntos/<slug>/." is missing a language specifier and should be
updated to include one (e.g., add ```text or ```bash before the block) so the
markdown linter (MD040) stops flagging it; locate the block in AGENTS.md and
prepend the appropriate language identifier.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 10ed3e49-8100-4690-8587-e8136b06e249

📥 Commits

Reviewing files that changed from the base of the PR and between 6d018fc and 73c9d0e.

📒 Files selected for processing (11)

.claude/agents/onboarding.md
.claude/commands/onboarding-architecture.md
.claude/commands/onboarding-clean.md
.claude/commands/onboarding-context.md
.claude/commands/onboarding-dbt-run.md
.claude/commands/onboarding-dbt.md
.claude/commands/onboarding-discover.md
.claude/commands/onboarding-metadata.md
.claude/commands/onboarding-pr.md
.claude/commands/onboarding-upload.md
AGENTS.md

✅ Files skipped from review due to trivial changes (4)

.claude/commands/onboarding-dbt-run.md
.claude/commands/onboarding-dbt.md
.claude/commands/onboarding-architecture.md
.claude/commands/onboarding-discover.md

…040)

feat: conversational agentic flow for data flow

4ce55cf

rdahis self-assigned this Mar 30, 2026

rdahis added this to Roadmap de dados Mar 30, 2026

rdahis added the enhancement New feature or request label Mar 30, 2026

rdahis added 2 commits March 30, 2026 22:09

chore: update metadata instructions

b49c73a

refactor(ai): move onboarding flow to .claude/agents/onboarding-pipeline

fadb753

rdahis marked this pull request as ready for review March 31, 2026 01:05

Merge branch 'main' into feat/ai_data_upload

6d018fc

coderabbitai Bot reviewed Mar 31, 2026

View reviewed changes

refactor(ai): rename databasis-* commands and agent to onboarding-* /…

73c9d0e

… onboarding

coderabbitai Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread .claude/commands/onboarding-clean.md Outdated

Comment thread .claude/commands/onboarding-metadata.md Outdated

docs(claude): add language labels to unlabeled fenced code blocks (MD…

57d6359

…040)

rdahis merged commit aa960fd into main Mar 31, 2026
3 of 4 checks passed

rdahis deleted the feat/ai_data_upload branch March 31, 2026 01:32

github-project-automation Bot moved this to ✅ Finalizada! in Roadmap de dados Mar 31, 2026

This was referenced Mar 31, 2026

[Chore] Melhorias nos agentes #1487

Merged

[Data] br_mma_cnuc — Cadastro Nacional de Unidades de Conservação #1488

Open

[Chore] Melhorias agentes + comandos #1489

Merged

		@@ -0,0 +1,99 @@
		---
		description: Write DBT .sql and schema.yaml files for a Data Basis dataset

Conversation

rdahis commented Mar 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Descrição do PR:

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rdahis commented Mar 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 30, 2026 •

edited

Loading