Skip to content

[Data] br_mma_cnuc — Cadastro Nacional de Unidades de Conservação#1488

Open
rdahis wants to merge 5 commits intomainfrom
data/br_mma_cnuc
Open

[Data] br_mma_cnuc — Cadastro Nacional de Unidades de Conservação#1488
rdahis wants to merge 5 commits intomainfrom
data/br_mma_cnuc

Conversation

@rdahis
Copy link
Copy Markdown
Member

@rdahis rdahis commented Mar 31, 2026

Descrição

Adiciona o dataset br_mma_cnuc com a tabela unidades_conservacao.

  • Fonte: Ministério do Meio Ambiente e Mudança do Clima (MMA) — API CNUC
  • Cobertura temporal: 2018–2026 (semestral), 13 partições
  • Linhas: 38.065
  • Particionamento: ano + semestre
  • Geometria: coluna geometria do tipo GEOGRAPHY (WKT → ST_GEOGFROMTEXT), disponível para 2024-S2 (2.927 polígonos)

Arquivos

  • models/br_mma_cnuc/br_mma_cnuc__unidades_conservacao.sql
  • models/br_mma_cnuc/schema.yaml
  • models/br_mma_cnuc/code/clean.py — limpeza dos CSVs e merge de shapefile
  • models/br_mma_cnuc/code/upload.py — upload para GCS/BQ staging

Testes

  • dbt run e dbt test passando (PASS=4, WARN=0, ERROR=0)
  • Uniqueness em (ano, semestre, codigo_uc)
  • not_null_proportion >= 0.05 (excluindo 6 colunas vazias na fonte)

Summary by CodeRabbit

  • New Features
    • Conservation units dataset now available with validated geographic boundaries and biannual partitions.
    • Automated pipeline publishes consistent Parquet partitions for each year/semester to simplify consumption.
  • Data Quality / Tests
    • Uniqueness and completeness checks added to ensure primary-key uniqueness and minimum not-null proportions.
  • Documentation
    • Onboarding guides updated with geometry handling and parquet-schema requirements.

Adds br_mma_cnuc.unidades_conservacao: 13 biannual snapshots (2018–2026),
38,065 rows, partitioned by ano+semestre, with WKT geometry (GEOGRAPHY)
for 2024-S2. Source: MMA/CNUC API.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 31, 2026

📝 Walkthrough

Walkthrough

Adds a complete ingest pipeline for Brazil CNUC conservation units: a Python cleaner that normalizes CSV snapshots and merges geometries into partitioned Parquet, a dbt model that casts types and converts WKT to GEOGRAPHY, schema/tests, and a small upload script to push Parquet to basedosdados.

Changes

Cohort / File(s) Summary
Cleaning script
models/br_mma_cnuc/code/clean.py
New CLI script to parse cnuc_*.csv snapshots, normalize column names, coerce types (ints, floats, dates, strings), merge WKT geometry from shapefiles (reproject → EPSG:4674), enforce a fixed PyArrow schema, and write partitioned Parquet at output/unidades_conservacao/ano=<year>/semestre=<n>/data.parquet.
DBT model
models/br_mma_cnuc/br_mma_cnuc__unidades_conservacao.sql
New dbt model (materialized as table) selecting from staging, applying safe_cast to typed columns and converting geometria with st_geogfromtext(safe_cast(geometria as string), make_valid => true).
Schema & tests
models/br_mma_cnuc/schema.yaml
New model metadata, column docs, dbt_utils.unique_combination_of_columns on (ano, semestre, codigo_uc), and not_null_proportion_multiple_columns test with specified ignore_values.
Upload helper
models/br_mma_cnuc/code/upload.py
New script to instantiate basedosdados.Table and create/replace the destination br_mma_cnuc.unidades_conservacao table by uploading local Parquet directory.
Onboarding docs
.claude/commands/onboarding-clean.md, .claude/commands/onboarding-dbt.md, .claude/commands/onboarding-discover.md
Documentation updates: cleaning guidelines (encoding fallbacks, Brazilian numeric normalization, explicit pyarrow schema, geometry WKT→GEOGRAPHY guidance), dbt schema template and geometry notes, and discover_ids guidance extended to include theme and stronger ID lookup rules.

Sequence Diagram

sequenceDiagram
    participant CSV as CSV Snapshots
    participant Clean as clean.py
    participant Shp as Shapefile\n(geometry)
    participant Parquet as Parquet\nPartitions
    participant dbt as dbt Model
    participant Upload as upload.py
    participant BDD as basedosdados

    CSV->>Clean: read cnuc_*.csv\n(utf-8-sig → latin1 fallback)
    Clean->>Clean: normalize columns,\ncoerce types, attach metadata
    Shp->>Clean: provide WKT geometry\n(join by codigo_uc / cd_cnuc)
    Clean->>Parquet: write partitioned\nParquet (ano/semestre)
    Parquet->>dbt: staging select
    dbt->>dbt: safe_cast fields\nst_geogfromtext(..., make_valid => true)
    dbt->>Upload: materialized table ready
    Upload->>BDD: upload/replace table\nwith Parquet data
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related PRs

Suggested labels

enhancement

Poem

🐰
I hopped through rows of CSV light,
Cleaned the names and fixed each byte,
WKT stitched to shapes so neat,
Parquets sleeping in year/semester seat,
Off to basedosdados — hop, delight! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description provides clear context, but is missing several required template sections: no explicit Motivação/Contexto, incomplete Technical Details (lacks performance impact), missing Testing & Validation checkboxes, no Risk/Rollback discussion, and Dependencies section is incomplete. Complete all required template sections: add Motivação/Contexto explaining why this dataset was needed, detail performance impact, check off Test & Validation items, document rollback procedures, and explicitly mark Dependencies status.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies this as a data addition PR for the br_mma_cnuc dataset with a descriptive subtitle specifying the Conservation Units table, directly matching the main changeset focus.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch data/br_mma_cnuc

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
models/br_mma_cnuc/code/clean.py (3)

231-239: Type annotation mismatch for geometry lookup dictionaries.

There are inconsistencies in the type hints for geometry-related dictionaries:

  1. load_geometry returns dict[str, str] (line 237-238: keys are cd_cnuc strings)
  2. clean_file parameter geo_lookup is typed as dict[int, str] (line 242) but should be dict[str, str]
  3. geo_cache is typed as dict[tuple[int, int], dict[int, str]] (line 359) but the inner dict should be dict[str, str]
🔧 Proposed fix for type annotations
-def clean_file(path: Path, geo_lookup: dict[int, str] | None) -> pd.DataFrame:
+def clean_file(path: Path, geo_lookup: dict[str, str] | None) -> pd.DataFrame:
-    geo_cache: dict[tuple[int, int], dict[int, str]] = {}
+    geo_cache: dict[tuple[int, int], dict[str, str]] = {}

Also applies to: 242-242, 359-359

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/br_mma_cnuc/code/clean.py` around lines 231 - 239, The type hints for
geometry lookup dicts are inconsistent: update the annotations so they match the
actual keys returned by load_geometry (strings). Change clean_file's geo_lookup
parameter from dict[int, str] to dict[str, str], update geo_cache's inner dict
type from dict[int, str] to dict[str, str], and ensure load_geometry stays
declared as dict[str, str]; adjust any references to these symbols
(load_geometry, clean_file, geo_cache) accordingly so the key types are
consistent across the module.

270-283: Integer conversion could be simplified.

The integer conversion logic is a bit convoluted. Consider using pandas' built-in nullable integer conversion more directly.

♻️ Simplified integer conversion
     # Type casts — int
     for col in INT_COLS:
         if col in df.columns:
-            cleaned = (
-                df[col]
-                .astype(str)
-                .str.replace(".", "", regex=False)
-                .str.strip()
-            )
-            s = pd.to_numeric(cleaned, errors="coerce")
-            mask = s.isna()
-            arr = s.fillna(0).astype(int).astype("Int64")
-            arr[mask] = pd.NA
-            df[col] = arr
+            df[col] = pd.to_numeric(
+                df[col]
+                .astype(str)
+                .str.replace(".", "", regex=False)
+                .str.strip(),
+                errors="coerce",
+            ).astype("Int64")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/br_mma_cnuc/code/clean.py` around lines 270 - 283, The integer
conversion loop for INT_COLS is overly complex; replace the mask/fillna/astype
dance by converting the cleaned string series directly to pandas nullable
integers: for each col in INT_COLS (when present in df) produce cleaned =
df[col].astype(str).str.replace(".", "", regex=False).str.strip(), then set
df[col] = pd.to_numeric(cleaned, errors="coerce").astype("Int64") so NaNs become
pd.NA via the nullable Int64 dtype and you can remove the manual mask/fill
steps.

205-209: Missing docstrings for several functions.

Per coding guidelines, Python functions should have docstrings following Google Style. The following functions are missing docstrings: parse_filename, read_csv, clean_string, clean_file, write_partition, and main.

📝 Example docstring additions
 def parse_filename(path: Path) -> tuple[int, int]:
+    """Extract ano and semestre from CSV filename.
+
+    Args:
+        path: Path to the CSV file with pattern cnuc_YYYY_S.csv.
+
+    Returns:
+        Tuple of (ano, semestre).
+
+    Raises:
+        ValueError: If filename doesn't match expected pattern.
+    """
     m = re.search(r"cnuc_(\d{4})_(\d)", path.name)
 def read_csv(path: Path) -> pd.DataFrame:
+    """Read CSV with fallback encodings (utf-8-sig, latin1).
+
+    Args:
+        path: Path to the CSV file.
+
+    Returns:
+        DataFrame with all columns as strings.
+
+    Raises:
+        ValueError: If file cannot be decoded with any encoding.
+    """
     for enc in ("utf-8-sig", "latin1"):

As per coding guidelines: "Add type hints and docstrings for Python functions following Google Style".

Also applies to: 212-220, 223-228, 242-322, 345-350, 353-380

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/br_mma_cnuc/code/clean.py` around lines 205 - 209, Add Google-style
docstrings to each missing function (parse_filename, read_csv, clean_string,
clean_file, write_partition, main and any others in the ranges noted) describing
purpose, Args with types, Returns with types, and Raises (e.g., ValueError for
parse_filename) where applicable; ensure the docstrings match existing type
hints and include brief examples or notes if behavior is non-obvious (e.g.,
regex format in parse_filename, expected CSV encoding/columns in read_csv,
normalization rules in clean_string, file I/O and partitioning behavior in
write_partition/clean_file, and command-line/entry semantics for main).
models/br_mma_cnuc/code/upload.py (1)

1-16: Missing module docstring and if __name__ == "__main__" guard.

The script lacks documentation and executes at import time, which can cause issues when the module is imported elsewhere.

♻️ Proposed structure with guard and docstring
+"""
+Upload cleaned CNUC parquet data to basedosdados staging.
+
+Usage:
+    python upload.py
+"""
+
+from pathlib import Path
+
 import basedosdados as bd
 
 DATASET_ID = "br_mma_cnuc"
 TABLE_ID = "unidades_conservacao"
 BILLING_PROJECT = "basedosdados-dev"
 
-tb = bd.Table(dataset_id=DATASET_ID, table_id=TABLE_ID)
+ROOT = Path(__file__).resolve().parent.parent
+OUTPUT_DIR = ROOT / "output" / "unidades_conservacao"
 
-path_to_data = "/Users/rdahis/Downloads/CNUC/output/unidades_conservacao"
 
-tb.create(
-    path=path_to_data,
-    if_storage_data_exists="replace",
-    if_table_exists="replace",
-    source_format="parquet",
-)
+def main() -> None:
+    """Upload parquet data to GCS and create BigQuery table."""
+    tb = bd.Table(dataset_id=DATASET_ID, table_id=TABLE_ID)
+    tb.create(
+        path=str(OUTPUT_DIR),
+        if_storage_data_exists="replace",
+        if_table_exists="replace",
+        source_format="parquet",
+    )
+
+
+if __name__ == "__main__":
+    main()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/br_mma_cnuc/code/upload.py` around lines 1 - 16, The module executes
table upload at import-time and lacks documentation; add a top-level module
docstring describing its purpose, and move the creation logic into a guarded
main block: wrap the path_to_data, tb = bd.Table(...) initialization and the
tb.create(...) call inside an if __name__ == "__main__": block (or a main()
function invoked by that guard), so importing this module won't trigger
immediate execution; keep constants DATASET_ID, TABLE_ID, BILLING_PROJECT at
module scope and reference them from the main function.
models/br_mma_cnuc/schema.yaml (1)

10-21: Consider scoping tests to most recent data using __most_recent_year__.

For large datasets with historical partitions, the uniqueness test could be scoped to recent data to improve test performance. Based on learnings, you can use __most_recent_year__ in a where config.

📝 Optional: Scope uniqueness test to recent year
tests:
  - dbt_utils.unique_combination_of_columns:
      combination_of_columns: [ano, semestre, codigo_uc]
      config:
        where: __most_recent_year__

Based on learnings: "Use most_recent_year keyword in dbt test where config to scope tests to most recent rows (uses ano column)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/br_mma_cnuc/schema.yaml` around lines 10 - 21, Scope the uniqueness
test to the most recent year to improve performance: update the dbt test using
dbt_utils.unique_combination_of_columns (the test that uses
combination_of_columns: [ano, semestre, codigo_uc]) to add a config with a where
clause set to __most_recent_year__; keep the same combination_of_columns and
ensure the where uses the ano-based keyword so the uniqueness check only runs
against the most recent partition.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@models/br_mma_cnuc/code/upload.py`:
- Line 9: The hardcoded path assigned to path_to_data in upload.py will break on
other machines; change it to derive a path relative to the script (use
pathlib.Path(__file__).resolve().parent / "output/unidades_conservacao") or
accept a CLI argument (use argparse to parse a --data-path) and default to the
relative path for parity with clean.py; update any references to path_to_data
accordingly so the script works on CI/other developer machines.

---

Nitpick comments:
In `@models/br_mma_cnuc/code/clean.py`:
- Around line 231-239: The type hints for geometry lookup dicts are
inconsistent: update the annotations so they match the actual keys returned by
load_geometry (strings). Change clean_file's geo_lookup parameter from dict[int,
str] to dict[str, str], update geo_cache's inner dict type from dict[int, str]
to dict[str, str], and ensure load_geometry stays declared as dict[str, str];
adjust any references to these symbols (load_geometry, clean_file, geo_cache)
accordingly so the key types are consistent across the module.
- Around line 270-283: The integer conversion loop for INT_COLS is overly
complex; replace the mask/fillna/astype dance by converting the cleaned string
series directly to pandas nullable integers: for each col in INT_COLS (when
present in df) produce cleaned = df[col].astype(str).str.replace(".", "",
regex=False).str.strip(), then set df[col] = pd.to_numeric(cleaned,
errors="coerce").astype("Int64") so NaNs become pd.NA via the nullable Int64
dtype and you can remove the manual mask/fill steps.
- Around line 205-209: Add Google-style docstrings to each missing function
(parse_filename, read_csv, clean_string, clean_file, write_partition, main and
any others in the ranges noted) describing purpose, Args with types, Returns
with types, and Raises (e.g., ValueError for parse_filename) where applicable;
ensure the docstrings match existing type hints and include brief examples or
notes if behavior is non-obvious (e.g., regex format in parse_filename, expected
CSV encoding/columns in read_csv, normalization rules in clean_string, file I/O
and partitioning behavior in write_partition/clean_file, and command-line/entry
semantics for main).

In `@models/br_mma_cnuc/code/upload.py`:
- Around line 1-16: The module executes table upload at import-time and lacks
documentation; add a top-level module docstring describing its purpose, and move
the creation logic into a guarded main block: wrap the path_to_data, tb =
bd.Table(...) initialization and the tb.create(...) call inside an if __name__
== "__main__": block (or a main() function invoked by that guard), so importing
this module won't trigger immediate execution; keep constants DATASET_ID,
TABLE_ID, BILLING_PROJECT at module scope and reference them from the main
function.

In `@models/br_mma_cnuc/schema.yaml`:
- Around line 10-21: Scope the uniqueness test to the most recent year to
improve performance: update the dbt test using
dbt_utils.unique_combination_of_columns (the test that uses
combination_of_columns: [ano, semestre, codigo_uc]) to add a config with a where
clause set to __most_recent_year__; keep the same combination_of_columns and
ensure the where uses the ano-based keyword so the uniqueness check only runs
against the most recent partition.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 411fa375-7570-4fa0-8c2c-5b9b9e2d9035

📥 Commits

Reviewing files that changed from the base of the PR and between a43a7f2 and f10acb7.

📒 Files selected for processing (4)
  • models/br_mma_cnuc/br_mma_cnuc__unidades_conservacao.sql
  • models/br_mma_cnuc/code/clean.py
  • models/br_mma_cnuc/code/upload.py
  • models/br_mma_cnuc/schema.yaml


tb = bd.Table(dataset_id=DATASET_ID, table_id=TABLE_ID)

path_to_data = "/Users/rdahis/Downloads/CNUC/output/unidades_conservacao"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hardcoded local path will break on other machines.

The path /Users/rdahis/Downloads/CNUC/output/unidades_conservacao is specific to the author's machine. This script will fail for any other developer or CI environment.

Consider using a relative path from the script location (consistent with clean.py), or accept the path as a CLI argument.

🔧 Proposed fix using relative path
+from pathlib import Path
+
 import basedosdados as bd
 
 DATASET_ID = "br_mma_cnuc"
 TABLE_ID = "unidades_conservacao"
 BILLING_PROJECT = "basedosdados-dev"
 
+ROOT = Path(__file__).resolve().parent.parent
+OUTPUT_DIR = ROOT / "output" / "unidades_conservacao"
+
 tb = bd.Table(dataset_id=DATASET_ID, table_id=TABLE_ID)
 
-path_to_data = "/Users/rdahis/Downloads/CNUC/output/unidades_conservacao"
+path_to_data = str(OUTPUT_DIR)
 
 tb.create(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/br_mma_cnuc/code/upload.py` at line 9, The hardcoded path assigned to
path_to_data in upload.py will break on other machines; change it to derive a
path relative to the script (use pathlib.Path(__file__).resolve().parent /
"output/unidades_conservacao") or accept a CLI argument (use argparse to parse a
--data-path) and default to the relative path for parity with clean.py; update
any references to path_to_data accordingly so the script works on CI/other
developer machines.

@rdahis rdahis requested a review from a team March 31, 2026 22:33
- clean: code lives in pipelines/models/<dataset>/code/ from the start;
  add explicit pyarrow schema, encoding strategy, Brazilian number format,
  geometry/shapefile join key guidance, and post-write schema validation step
- dbt: add geometry/GEOGRAPHY pattern with make_valid, YAML block scalar rule,
  ignore_values parameter documentation for not_null_proportion test, and
  uniqueness test guidance (prefer stable string keys over nullable int IDs)
- discover: mandate discover_ids for all reference lookups; add theme to
  output block; remove license from example block
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
.claude/commands/onboarding-dbt.md (1)

70-71: Clarify the uniqueness test template for multi-column partitions.

The template shows:

combination_of_columns: [<partition_col>, <primary_key_col>]

But many tables (including br_mma_cnuc) use multiple partition columns (e.g., ano + semestre). The singular placeholder <partition_col> might lead developers to include only one partition column.

📝 Proposed clarification
       - dbt_utils.unique_combination_of_columns:
-          combination_of_columns: [<partition_col>, <primary_key_col>]
+          combination_of_columns: [<partition_col_1>, <partition_col_2>, ..., <primary_key_col>]
+          # Include ALL partition columns plus the primary key

Or use a concrete example:

# Example for a table partitioned by ano + semestre with primary key codigo_uc:
combination_of_columns: [ano, semestre, codigo_uc]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/onboarding-dbt.md around lines 70 - 71, Update the dbt
uniqueness test template (dbt_utils.unique_combination_of_columns) to make it
explicit that combination_of_columns can include multiple partition columns and
the primary key, not just a single <partition_col> placeholder; change the
placeholder text and/or add an inline concrete example (e.g., for a table
partitioned by ano + semestre with primary key codigo_uc show
combination_of_columns: [ano, semestre, codigo_uc]) so developers know to list
all partition columns followed by the primary key.
.claude/commands/onboarding-clean.md (2)

99-109: Strengthen the join-key validation guidance.

The current guidance says "Verify the join key" but doesn't specify how to verify or what action to take when keys mismatch. The actual implementation in models/br_mma_cnuc/code/clean.py:314-321 uses .map() which silently produces NA for unmatched keys without raising an error.

Consider adding explicit verification steps:

# After geometry merge
missing_geo = df[df["geometria"].isna() & df["codigo_uc"].notna()]
if len(missing_geo) > 0:
    logger.warning(f"{len(missing_geo)} rows missing geometry after merge")
    # Optionally: print sample IDs or raise if coverage is too low
📋 Proposed documentation enhancement
 - **Verify the join key** between the shapefile and tabular data — shapefile IDs
   and tabular IDs are often different systems (e.g. `cd_cnuc` vs `id_uc`).
   Inspect both before joining.
+- After the merge, assert that geometry coverage meets expectations:
+  ```python
+  missing = df[df["geometria"].isna() & df["<id_col>"].notna()]
+  assert len(missing) / len(df) < 0.01, f">{1%} rows missing geometry"
+  ```
 - In the DBT model, cast with `ST_GEOGFROMTEXT(col, make_valid => true)` and
   type the column as GEOGRAPHY, not STRING.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/onboarding-clean.md around lines 99 - 109, The merge that
populates the geometry in models/br_mma_cnuc/code/clean.py currently uses .map()
(around the block using .map() at lines ~314-321) which silently yields NA for
unmatched keys; after that assignment add explicit validation: compute
missing_geo = df[df["geometria"].isna() & df["codigo_uc"].notna()], log a
warning with len(missing_geo) and a small sample of codigo_uc values (e.g.
missing_geo["codigo_uc"].unique()[:10]), and either assert a coverage threshold
(e.g. len(missing_geo)/len(df) < 0.01) or raise if too many are missing;
additionally consider switching the .map() step to an explicit merge/join to
make mismatches clearer and preserve unmatched-key diagnostics.

113-117: Add geometry-specific validation to Step 4.

The validation checklist should explicitly mention checking geometry column completeness, especially since the join-key mapping (lines 104-106) can silently produce NA values.

📋 Proposed addition
 After running on the subset:
 1. Check the parquet schema with `pq.read_schema(path)` — verify all column types
    match the architecture before uploading.
 2. Verify column names match architecture exactly.
 3. Check for unexpected nulls in primary key columns.
+4. If geometry is present, verify coverage: print % of rows with non-null geometry.
-4. Print row counts and a sample.
+5. Print row counts and a sample.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/onboarding-clean.md around lines 113 - 117, Update the
onboarding checklist by extending Step 4 to include geometry-specific
validation: after printing row counts and a sample, explicitly verify the
geometry column(s) for completeness (no NA/null values), correct type(s) and
CRS, and validity (no corrupt/empty geometries) to catch silent NA results from
the join-key mapping; refer to the geometry column name(s) used by the join-key
mapping and add a short guideline to fail or flag the upload if any geometry
rows are missing or invalid.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/commands/onboarding-discover.md:
- Around line 20-24: The docs list what discover_ids returns but omit "area" and
"tags" while the example uses area.br and mentions tags; update the
documentation so it's unambiguous: either add "area" and "tags" to the
discover_ids return list if discover_ids actually returns them, or explicitly
state in the example that area.br (and any area-related values) come from
lookup_area (and show lookup_area usage) and that tags are obtained via
discover_ids or another lookup; reference the functions/values discover_ids,
lookup_area, area.br, and tags when making the clarification.

---

Nitpick comments:
In @.claude/commands/onboarding-clean.md:
- Around line 99-109: The merge that populates the geometry in
models/br_mma_cnuc/code/clean.py currently uses .map() (around the block using
.map() at lines ~314-321) which silently yields NA for unmatched keys; after
that assignment add explicit validation: compute missing_geo =
df[df["geometria"].isna() & df["codigo_uc"].notna()], log a warning with
len(missing_geo) and a small sample of codigo_uc values (e.g.
missing_geo["codigo_uc"].unique()[:10]), and either assert a coverage threshold
(e.g. len(missing_geo)/len(df) < 0.01) or raise if too many are missing;
additionally consider switching the .map() step to an explicit merge/join to
make mismatches clearer and preserve unmatched-key diagnostics.
- Around line 113-117: Update the onboarding checklist by extending Step 4 to
include geometry-specific validation: after printing row counts and a sample,
explicitly verify the geometry column(s) for completeness (no NA/null values),
correct type(s) and CRS, and validity (no corrupt/empty geometries) to catch
silent NA results from the join-key mapping; refer to the geometry column
name(s) used by the join-key mapping and add a short guideline to fail or flag
the upload if any geometry rows are missing or invalid.

In @.claude/commands/onboarding-dbt.md:
- Around line 70-71: Update the dbt uniqueness test template
(dbt_utils.unique_combination_of_columns) to make it explicit that
combination_of_columns can include multiple partition columns and the primary
key, not just a single <partition_col> placeholder; change the placeholder text
and/or add an inline concrete example (e.g., for a table partitioned by ano +
semestre with primary key codigo_uc show combination_of_columns: [ano, semestre,
codigo_uc]) so developers know to list all partition columns followed by the
primary key.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4249d7d5-bafa-421a-a783-f4b08c7cdd70

📥 Commits

Reviewing files that changed from the base of the PR and between f10acb7 and 3216a27.

📒 Files selected for processing (3)
  • .claude/commands/onboarding-clean.md
  • .claude/commands/onboarding-dbt.md
  • .claude/commands/onboarding-discover.md

Comment on lines +20 to +24
This returns IDs for: status, bigquery_type, entity, license, availability, organization, theme.

**Never search the web, hardcode IDs, or guess slugs.** All reference IDs (themes,
organizations, licenses, tags, entities, statuses) must come from `discover_ids`
or `lookup_area`. IDs differ between dev and prod environments.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Documentation inconsistency: "area" and "tags" not listed in discover_ids output.

Line 20 lists what discover_ids returns but omits "area" and "tags," yet:

  • Line 23 mentions "tags" as a reference ID type
  • Line 54 in the example shows area.br

Since line 24 clarifies that lookup_area is a separate tool for areas, consider either:

  1. Adding "area" and "tags" to line 20 if discover_ids returns them, OR
  2. Clarifying in the example (around line 54) that area.br comes from lookup_area (Step 1 mentions only discover_ids)
📝 Suggested clarification

Option 1: If discover_ids does return area and tags, update line 20:

-This returns IDs for: status, bigquery_type, entity, license, availability, organization, theme.
+This returns IDs for: status, bigquery_type, entity, area, license, availability, organization, theme, tags.

Option 2: If areas come from lookup_area, clarify in the example section by adding a comment or separate subsection showing the lookup_area call result.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/onboarding-discover.md around lines 20 - 24, The docs list
what discover_ids returns but omit "area" and "tags" while the example uses
area.br and mentions tags; update the documentation so it's unambiguous: either
add "area" and "tags" to the discover_ids return list if discover_ids actually
returns them, or explicitly state in the example that area.br (and any
area-related values) come from lookup_area (and show lookup_area usage) and that
tags are obtained via discover_ids or another lookup; reference the
functions/values discover_ids, lookup_area, area.br, and tags when making the
clarification.

safe_cast(razao_diferenca_area as float64) razao_diferenca_area,
safe_cast(data_publicacao_cnuc as date) data_publicacao_cnuc,
safe_cast(data_ultima_certificacao as date) data_ultima_certificacao,
st_geogfromtext(safe_cast(geometria as string), make_valid => true) geometria,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

geometria deve ser tipada como geography. Atualizar o manual de estilo e integrar com o MCP seria um bom próx passo

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outro ponto, pra garantir qualidade é bom validar as geometrias com o BBOX aproximado do brasil. O objetivo é saber se os polígonos estão dentro do território BR

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Das +-36k de linhas somente 2927k tem geometrias não nulas. Me parece estranho;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existem geometrias nulas com valores de áreas em hectares nas demais colunas

safe_cast(esfera_administrativa as string) esfera_administrativa,
safe_cast(categoria_manejo as string) categoria_manejo,
safe_cast(categoria_iucn as string) categoria_iucn,
safe_cast(grupo as string) grupo,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

converter para NULL

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mesmo acontece nas variáveis orgao_gestor e informacoes_gerais

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Esse comportamento acontece com diversas colunas


select
safe_cast(ano as int64) ano,
safe_cast(semestre as int64) semestre,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variáveis string com descrições de ucs e similares estão ora em caixa alta ora em Title

OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# ── Shapefile sources: (ano, semestre) → polygon shapefile path ────────────
# Points-only files (shp_2024_1) are excluded; 2025 shapefiles not yet available.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@folhesgabriel folhesgabriel self-requested a review April 1, 2026 10:58
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 7, 2026

@rdahis esse pull request tem conflitos 😩

@mergify mergify Bot added the conflict label Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants