[Data] br_mma_cnuc — Cadastro Nacional de Unidades de Conservação by rdahis · Pull Request #1488 · basedosdados/pipelines

rdahis · 2026-03-31T22:26:09Z

Descrição

Adiciona o dataset br_mma_cnuc com a tabela unidades_conservacao.

Fonte: Ministério do Meio Ambiente e Mudança do Clima (MMA) — API CNUC
Cobertura temporal: 2018–2026 (semestral), 13 partições
Linhas: 38.065
Particionamento: ano + semestre
Geometria: coluna geometria do tipo GEOGRAPHY (WKT → ST_GEOGFROMTEXT), disponível para 2024-S2 (2.927 polígonos)

Arquivos

models/br_mma_cnuc/br_mma_cnuc__unidades_conservacao.sql
models/br_mma_cnuc/schema.yaml
models/br_mma_cnuc/code/clean.py — limpeza dos CSVs e merge de shapefile
models/br_mma_cnuc/code/upload.py — upload para GCS/BQ staging

Testes

dbt run e dbt test passando (PASS=4, WARN=0, ERROR=0)
Uniqueness em (ano, semestre, codigo_uc)
not_null_proportion >= 0.05 (excluindo 6 colunas vazias na fonte)

Summary by CodeRabbit

New Features
- Conservation units dataset now available with validated geographic boundaries and biannual partitions.
- Automated pipeline publishes consistent Parquet partitions for each year/semester to simplify consumption.
Data Quality / Tests
- Uniqueness and completeness checks added to ensure primary-key uniqueness and minimum not-null proportions.
Documentation
- Onboarding guides updated with geometry handling and parquet-schema requirements.

Adds br_mma_cnuc.unidades_conservacao: 13 biannual snapshots (2018–2026), 38,065 rows, partitioned by ano+semestre, with WKT geometry (GEOGRAPHY) for 2024-S2. Source: MMA/CNUC API.

coderabbitai · 2026-03-31T22:26:41Z

📝 Walkthrough

Walkthrough

Adds a complete ingest pipeline for Brazil CNUC conservation units: a Python cleaner that normalizes CSV snapshots and merges geometries into partitioned Parquet, a dbt model that casts types and converts WKT to GEOGRAPHY, schema/tests, and a small upload script to push Parquet to basedosdados.

Changes

Cohort / File(s)	Summary
Cleaning script `models/br_mma_cnuc/code/clean.py`	New CLI script to parse `cnuc_*.csv` snapshots, normalize column names, coerce types (ints, floats, dates, strings), merge WKT geometry from shapefiles (reproject → EPSG:4674), enforce a fixed PyArrow schema, and write partitioned Parquet at `output/unidades_conservacao/ano=<year>/semestre=<n>/data.parquet`.
DBT model `models/br_mma_cnuc/br_mma_cnuc__unidades_conservacao.sql`	New dbt model (materialized as `table`) selecting from staging, applying `safe_cast` to typed columns and converting `geometria` with `st_geogfromtext(safe_cast(geometria as string), make_valid => true)`.
Schema & tests `models/br_mma_cnuc/schema.yaml`	New model metadata, column docs, `dbt_utils.unique_combination_of_columns` on `(ano, semestre, codigo_uc)`, and `not_null_proportion_multiple_columns` test with specified `ignore_values`.
Upload helper `models/br_mma_cnuc/code/upload.py`	New script to instantiate `basedosdados.Table` and create/replace the destination `br_mma_cnuc.unidades_conservacao` table by uploading local Parquet directory.
Onboarding docs `.claude/commands/onboarding-clean.md`, `.claude/commands/onboarding-dbt.md`, `.claude/commands/onboarding-discover.md`	Documentation updates: cleaning guidelines (encoding fallbacks, Brazilian numeric normalization, explicit pyarrow schema, geometry WKT→GEOGRAPHY guidance), dbt schema template and geometry notes, and discover_ids guidance extended to include `theme` and stronger ID lookup rules.

Sequence Diagram

sequenceDiagram
    participant CSV as CSV Snapshots
    participant Clean as clean.py
    participant Shp as Shapefile\n(geometry)
    participant Parquet as Parquet\nPartitions
    participant dbt as dbt Model
    participant Upload as upload.py
    participant BDD as basedosdados

    CSV->>Clean: read cnuc_*.csv\n(utf-8-sig → latin1 fallback)
    Clean->>Clean: normalize columns,\ncoerce types, attach metadata
    Shp->>Clean: provide WKT geometry\n(join by codigo_uc / cd_cnuc)
    Clean->>Parquet: write partitioned\nParquet (ano/semestre)
    Parquet->>dbt: staging select
    dbt->>dbt: safe_cast fields\nst_geogfromtext(..., make_valid => true)
    dbt->>Upload: materialized table ready
    Upload->>BDD: upload/replace table\nwith Parquet data

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related PRs

[Feature] Subindo dados com IA na BD #1484 — touches the same onboarding documentation files and likely overlaps on cleaning/dbt guidance.

Suggested labels

enhancement

Poem

🐰
I hopped through rows of CSV light,
Cleaned the names and fixed each byte,
WKT stitched to shapes so neat,
Parquets sleeping in year/semester seat,
Off to basedosdados — hop, delight! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The description provides clear context, but is missing several required template sections: no explicit Motivação/Contexto, incomplete Technical Details (lacks performance impact), missing Testing & Validation checkboxes, no Risk/Rollback discussion, and Dependencies section is incomplete.	Complete all required template sections: add Motivação/Contexto explaining why this dataset was needed, detail performance impact, check off Test & Validation items, document rollback procedures, and explicitly mark Dependencies status.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies this as a data addition PR for the br_mma_cnuc dataset with a descriptive subtitle specifying the Conservation Units table, directly matching the main changeset focus.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch data/br_mma_cnuc

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

models/br_mma_cnuc/code/clean.py (3)

231-239: Type annotation mismatch for geometry lookup dictionaries.

There are inconsistencies in the type hints for geometry-related dictionaries:

load_geometry returns dict[str, str] (line 237-238: keys are cd_cnuc strings)
clean_file parameter geo_lookup is typed as dict[int, str] (line 242) but should be dict[str, str]
geo_cache is typed as dict[tuple[int, int], dict[int, str]] (line 359) but the inner dict should be dict[str, str]

🔧 Proposed fix for type annotations

-def clean_file(path: Path, geo_lookup: dict[int, str] | None) -> pd.DataFrame:
+def clean_file(path: Path, geo_lookup: dict[str, str] | None) -> pd.DataFrame:

-    geo_cache: dict[tuple[int, int], dict[int, str]] = {}
+    geo_cache: dict[tuple[int, int], dict[str, str]] = {}

Also applies to: 242-242, 359-359

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@models/br_mma_cnuc/code/clean.py` around lines 231 - 239, The type hints for
geometry lookup dicts are inconsistent: update the annotations so they match the
actual keys returned by load_geometry (strings). Change clean_file's geo_lookup
parameter from dict[int, str] to dict[str, str], update geo_cache's inner dict
type from dict[int, str] to dict[str, str], and ensure load_geometry stays
declared as dict[str, str]; adjust any references to these symbols
(load_geometry, clean_file, geo_cache) accordingly so the key types are
consistent across the module.

270-283: Integer conversion could be simplified.

The integer conversion logic is a bit convoluted. Consider using pandas' built-in nullable integer conversion more directly.

♻️ Simplified integer conversion

     # Type casts — int
     for col in INT_COLS:
         if col in df.columns:
-            cleaned = (
-                df[col]
-                .astype(str)
-                .str.replace(".", "", regex=False)
-                .str.strip()
-            )
-            s = pd.to_numeric(cleaned, errors="coerce")
-            mask = s.isna()
-            arr = s.fillna(0).astype(int).astype("Int64")
-            arr[mask] = pd.NA
-            df[col] = arr
+            df[col] = pd.to_numeric(
+                df[col]
+                .astype(str)
+                .str.replace(".", "", regex=False)
+                .str.strip(),
+                errors="coerce",
+            ).astype("Int64")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@models/br_mma_cnuc/code/clean.py` around lines 270 - 283, The integer
conversion loop for INT_COLS is overly complex; replace the mask/fillna/astype
dance by converting the cleaned string series directly to pandas nullable
integers: for each col in INT_COLS (when present in df) produce cleaned =
df[col].astype(str).str.replace(".", "", regex=False).str.strip(), then set
df[col] = pd.to_numeric(cleaned, errors="coerce").astype("Int64") so NaNs become
pd.NA via the nullable Int64 dtype and you can remove the manual mask/fill
steps.

205-209: Missing docstrings for several functions.

Per coding guidelines, Python functions should have docstrings following Google Style. The following functions are missing docstrings: parse_filename, read_csv, clean_string, clean_file, write_partition, and main.

📝 Example docstring additions

 def parse_filename(path: Path) -> tuple[int, int]:
+    """Extract ano and semestre from CSV filename.
+
+    Args:
+        path: Path to the CSV file with pattern cnuc_YYYY_S.csv.
+
+    Returns:
+        Tuple of (ano, semestre).
+
+    Raises:
+        ValueError: If filename doesn't match expected pattern.
+    """
     m = re.search(r"cnuc_(\d{4})_(\d)", path.name)

 def read_csv(path: Path) -> pd.DataFrame:
+    """Read CSV with fallback encodings (utf-8-sig, latin1).
+
+    Args:
+        path: Path to the CSV file.
+
+    Returns:
+        DataFrame with all columns as strings.
+
+    Raises:
+        ValueError: If file cannot be decoded with any encoding.
+    """
     for enc in ("utf-8-sig", "latin1"):

As per coding guidelines: "Add type hints and docstrings for Python functions following Google Style".

Also applies to: 212-220, 223-228, 242-322, 345-350, 353-380

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@models/br_mma_cnuc/code/clean.py` around lines 205 - 209, Add Google-style
docstrings to each missing function (parse_filename, read_csv, clean_string,
clean_file, write_partition, main and any others in the ranges noted) describing
purpose, Args with types, Returns with types, and Raises (e.g., ValueError for
parse_filename) where applicable; ensure the docstrings match existing type
hints and include brief examples or notes if behavior is non-obvious (e.g.,
regex format in parse_filename, expected CSV encoding/columns in read_csv,
normalization rules in clean_string, file I/O and partitioning behavior in
write_partition/clean_file, and command-line/entry semantics for main).

models/br_mma_cnuc/code/upload.py (1)

1-16: Missing module docstring and if __name__ == "__main__" guard.

The script lacks documentation and executes at import time, which can cause issues when the module is imported elsewhere.

♻️ Proposed structure with guard and docstring

+"""
+Upload cleaned CNUC parquet data to basedosdados staging.
+
+Usage:
+    python upload.py
+"""
+
+from pathlib import Path
+
 import basedosdados as bd
 
 DATASET_ID = "br_mma_cnuc"
 TABLE_ID = "unidades_conservacao"
 BILLING_PROJECT = "basedosdados-dev"
 
-tb = bd.Table(dataset_id=DATASET_ID, table_id=TABLE_ID)
+ROOT = Path(__file__).resolve().parent.parent
+OUTPUT_DIR = ROOT / "output" / "unidades_conservacao"
 
-path_to_data = "/Users/rdahis/Downloads/CNUC/output/unidades_conservacao"
 
-tb.create(
-    path=path_to_data,
-    if_storage_data_exists="replace",
-    if_table_exists="replace",
-    source_format="parquet",
-)
+def main() -> None:
+    """Upload parquet data to GCS and create BigQuery table."""
+    tb = bd.Table(dataset_id=DATASET_ID, table_id=TABLE_ID)
+    tb.create(
+        path=str(OUTPUT_DIR),
+        if_storage_data_exists="replace",
+        if_table_exists="replace",
+        source_format="parquet",
+    )
+
+
+if __name__ == "__main__":
+    main()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@models/br_mma_cnuc/code/upload.py` around lines 1 - 16, The module executes
table upload at import-time and lacks documentation; add a top-level module
docstring describing its purpose, and move the creation logic into a guarded
main block: wrap the path_to_data, tb = bd.Table(...) initialization and the
tb.create(...) call inside an if __name__ == "__main__": block (or a main()
function invoked by that guard), so importing this module won't trigger
immediate execution; keep constants DATASET_ID, TABLE_ID, BILLING_PROJECT at
module scope and reference them from the main function.

models/br_mma_cnuc/schema.yaml (1)

10-21: Consider scoping tests to most recent data using __most_recent_year__.

For large datasets with historical partitions, the uniqueness test could be scoped to recent data to improve test performance. Based on learnings, you can use __most_recent_year__ in a where config.
📝 Optional: Scope uniqueness test to recent year
tests:
  - dbt_utils.unique_combination_of_columns:
      combination_of_columns: [ano, semestre, codigo_uc]
      config:
        where: __most_recent_year__
Based on learnings: "Use most_recent_year keyword in dbt test where config to scope tests to most recent rows (uses ano column)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/br_mma_cnuc/schema.yaml` around lines 10 - 21, Scope the uniqueness
test to the most recent year to improve performance: update the dbt test using
dbt_utils.unique_combination_of_columns (the test that uses
combination_of_columns: [ano, semestre, codigo_uc]) to add a config with a where
clause set to __most_recent_year__; keep the same combination_of_columns and
ensure the where uses the ano-based keyword so the uniqueness check only runs
against the most recent partition.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@models/br_mma_cnuc/code/upload.py`:
- Line 9: The hardcoded path assigned to path_to_data in upload.py will break on
other machines; change it to derive a path relative to the script (use
pathlib.Path(__file__).resolve().parent / "output/unidades_conservacao") or
accept a CLI argument (use argparse to parse a --data-path) and default to the
relative path for parity with clean.py; update any references to path_to_data
accordingly so the script works on CI/other developer machines.

---

Nitpick comments:
In `@models/br_mma_cnuc/code/clean.py`:
- Around line 231-239: The type hints for geometry lookup dicts are
inconsistent: update the annotations so they match the actual keys returned by
load_geometry (strings). Change clean_file's geo_lookup parameter from dict[int,
str] to dict[str, str], update geo_cache's inner dict type from dict[int, str]
to dict[str, str], and ensure load_geometry stays declared as dict[str, str];
adjust any references to these symbols (load_geometry, clean_file, geo_cache)
accordingly so the key types are consistent across the module.
- Around line 270-283: The integer conversion loop for INT_COLS is overly
complex; replace the mask/fillna/astype dance by converting the cleaned string
series directly to pandas nullable integers: for each col in INT_COLS (when
present in df) produce cleaned = df[col].astype(str).str.replace(".", "",
regex=False).str.strip(), then set df[col] = pd.to_numeric(cleaned,
errors="coerce").astype("Int64") so NaNs become pd.NA via the nullable Int64
dtype and you can remove the manual mask/fill steps.
- Around line 205-209: Add Google-style docstrings to each missing function
(parse_filename, read_csv, clean_string, clean_file, write_partition, main and
any others in the ranges noted) describing purpose, Args with types, Returns
with types, and Raises (e.g., ValueError for parse_filename) where applicable;
ensure the docstrings match existing type hints and include brief examples or
notes if behavior is non-obvious (e.g., regex format in parse_filename, expected
CSV encoding/columns in read_csv, normalization rules in clean_string, file I/O
and partitioning behavior in write_partition/clean_file, and command-line/entry
semantics for main).

In `@models/br_mma_cnuc/code/upload.py`:
- Around line 1-16: The module executes table upload at import-time and lacks
documentation; add a top-level module docstring describing its purpose, and move
the creation logic into a guarded main block: wrap the path_to_data, tb =
bd.Table(...) initialization and the tb.create(...) call inside an if __name__
== "__main__": block (or a main() function invoked by that guard), so importing
this module won't trigger immediate execution; keep constants DATASET_ID,
TABLE_ID, BILLING_PROJECT at module scope and reference them from the main
function.

In `@models/br_mma_cnuc/schema.yaml`:
- Around line 10-21: Scope the uniqueness test to the most recent year to
improve performance: update the dbt test using
dbt_utils.unique_combination_of_columns (the test that uses
combination_of_columns: [ano, semestre, codigo_uc]) to add a config with a where
clause set to __most_recent_year__; keep the same combination_of_columns and
ensure the where uses the ano-based keyword so the uniqueness check only runs
against the most recent partition.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 411fa375-7570-4fa0-8c2c-5b9b9e2d9035

📥 Commits

Reviewing files that changed from the base of the PR and between a43a7f2 and f10acb7.

📒 Files selected for processing (4)

models/br_mma_cnuc/br_mma_cnuc__unidades_conservacao.sql
models/br_mma_cnuc/code/clean.py
models/br_mma_cnuc/code/upload.py
models/br_mma_cnuc/schema.yaml

coderabbitai · 2026-03-31T22:32:30Z

+
+tb = bd.Table(dataset_id=DATASET_ID, table_id=TABLE_ID)
+
+path_to_data = "/Users/rdahis/Downloads/CNUC/output/unidades_conservacao"


⚠️ Potential issue | 🟠 Major

Hardcoded local path will break on other machines.

The path /Users/rdahis/Downloads/CNUC/output/unidades_conservacao is specific to the author's machine. This script will fail for any other developer or CI environment.

Consider using a relative path from the script location (consistent with clean.py), or accept the path as a CLI argument.

🔧 Proposed fix using relative path

+from pathlib import Path + import basedosdados as bd DATASET_ID = "br_mma_cnuc" TABLE_ID = "unidades_conservacao" BILLING_PROJECT = "basedosdados-dev" +ROOT = Path(__file__).resolve().parent.parent +OUTPUT_DIR = ROOT / "output" / "unidades_conservacao" + tb = bd.Table(dataset_id=DATASET_ID, table_id=TABLE_ID) -path_to_data = "/Users/rdahis/Downloads/CNUC/output/unidades_conservacao" +path_to_data = str(OUTPUT_DIR) tb.create(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@models/br_mma_cnuc/code/upload.py` at line 9, The hardcoded path assigned to path_to_data in upload.py will break on other machines; change it to derive a path relative to the script (use pathlib.Path(__file__).resolve().parent / "output/unidades_conservacao") or accept a CLI argument (use argparse to parse a --data-path) and default to the relative path for parity with clean.py; update any references to path_to_data accordingly so the script works on CI/other developer machines.

- clean: code lives in pipelines/models/<dataset>/code/ from the start; add explicit pyarrow schema, encoding strategy, Brazilian number format, geometry/shapefile join key guidance, and post-write schema validation step - dbt: add geometry/GEOGRAPHY pattern with make_valid, YAML block scalar rule, ignore_values parameter documentation for not_null_proportion test, and uniqueness test guidance (prefer stable string keys over nullable int IDs) - discover: mandate discover_ids for all reference lookups; add theme to output block; remove license from example block

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

.claude/commands/onboarding-dbt.md (1)

70-71: Clarify the uniqueness test template for multi-column partitions.

The template shows:

combination_of_columns: [<partition_col>, <primary_key_col>]

But many tables (including br_mma_cnuc) use multiple partition columns (e.g., ano + semestre). The singular placeholder <partition_col> might lead developers to include only one partition column.

📝 Proposed clarification

       - dbt_utils.unique_combination_of_columns:
-          combination_of_columns: [<partition_col>, <primary_key_col>]
+          combination_of_columns: [<partition_col_1>, <partition_col_2>, ..., <primary_key_col>]
+          # Include ALL partition columns plus the primary key

Or use a concrete example:

# Example for a table partitioned by ano + semestre with primary key codigo_uc:
combination_of_columns: [ano, semestre, codigo_uc]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.claude/commands/onboarding-dbt.md around lines 70 - 71, Update the dbt
uniqueness test template (dbt_utils.unique_combination_of_columns) to make it
explicit that combination_of_columns can include multiple partition columns and
the primary key, not just a single <partition_col> placeholder; change the
placeholder text and/or add an inline concrete example (e.g., for a table
partitioned by ano + semestre with primary key codigo_uc show
combination_of_columns: [ano, semestre, codigo_uc]) so developers know to list
all partition columns followed by the primary key.

.claude/commands/onboarding-clean.md (2)

99-109: Strengthen the join-key validation guidance.

The current guidance says "Verify the join key" but doesn't specify how to verify or what action to take when keys mismatch. The actual implementation in models/br_mma_cnuc/code/clean.py:314-321 uses .map() which silently produces NA for unmatched keys without raising an error.

Consider adding explicit verification steps:

# After geometry merge
missing_geo = df[df["geometria"].isna() & df["codigo_uc"].notna()]
if len(missing_geo) > 0:
    logger.warning(f"{len(missing_geo)} rows missing geometry after merge")
    # Optionally: print sample IDs or raise if coverage is too low

📋 Proposed documentation enhancement

 - **Verify the join key** between the shapefile and tabular data — shapefile IDs
   and tabular IDs are often different systems (e.g. `cd_cnuc` vs `id_uc`).
   Inspect both before joining.
+- After the merge, assert that geometry coverage meets expectations:
+  ```python
+  missing = df[df["geometria"].isna() & df["<id_col>"].notna()]
+  assert len(missing) / len(df) < 0.01, f">{1%} rows missing geometry"
+  ```
 - In the DBT model, cast with `ST_GEOGFROMTEXT(col, make_valid => true)` and
   type the column as GEOGRAPHY, not STRING.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.claude/commands/onboarding-clean.md around lines 99 - 109, The merge that
populates the geometry in models/br_mma_cnuc/code/clean.py currently uses .map()
(around the block using .map() at lines ~314-321) which silently yields NA for
unmatched keys; after that assignment add explicit validation: compute
missing_geo = df[df["geometria"].isna() & df["codigo_uc"].notna()], log a
warning with len(missing_geo) and a small sample of codigo_uc values (e.g.
missing_geo["codigo_uc"].unique()[:10]), and either assert a coverage threshold
(e.g. len(missing_geo)/len(df) < 0.01) or raise if too many are missing;
additionally consider switching the .map() step to an explicit merge/join to
make mismatches clearer and preserve unmatched-key diagnostics.

113-117: Add geometry-specific validation to Step 4.

The validation checklist should explicitly mention checking geometry column completeness, especially since the join-key mapping (lines 104-106) can silently produce NA values.

📋 Proposed addition

 After running on the subset:
 1. Check the parquet schema with `pq.read_schema(path)` — verify all column types
    match the architecture before uploading.
 2. Verify column names match architecture exactly.
 3. Check for unexpected nulls in primary key columns.
+4. If geometry is present, verify coverage: print % of rows with non-null geometry.
-4. Print row counts and a sample.
+5. Print row counts and a sample.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.claude/commands/onboarding-clean.md around lines 113 - 117, Update the
onboarding checklist by extending Step 4 to include geometry-specific
validation: after printing row counts and a sample, explicitly verify the
geometry column(s) for completeness (no NA/null values), correct type(s) and
CRS, and validity (no corrupt/empty geometries) to catch silent NA results from
the join-key mapping; refer to the geometry column name(s) used by the join-key
mapping and add a short guideline to fail or flag the upload if any geometry
rows are missing or invalid.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/commands/onboarding-discover.md:
- Around line 20-24: The docs list what discover_ids returns but omit "area" and
"tags" while the example uses area.br and mentions tags; update the
documentation so it's unambiguous: either add "area" and "tags" to the
discover_ids return list if discover_ids actually returns them, or explicitly
state in the example that area.br (and any area-related values) come from
lookup_area (and show lookup_area usage) and that tags are obtained via
discover_ids or another lookup; reference the functions/values discover_ids,
lookup_area, area.br, and tags when making the clarification.

---

Nitpick comments:
In @.claude/commands/onboarding-clean.md:
- Around line 99-109: The merge that populates the geometry in
models/br_mma_cnuc/code/clean.py currently uses .map() (around the block using
.map() at lines ~314-321) which silently yields NA for unmatched keys; after
that assignment add explicit validation: compute missing_geo =
df[df["geometria"].isna() & df["codigo_uc"].notna()], log a warning with
len(missing_geo) and a small sample of codigo_uc values (e.g.
missing_geo["codigo_uc"].unique()[:10]), and either assert a coverage threshold
(e.g. len(missing_geo)/len(df) < 0.01) or raise if too many are missing;
additionally consider switching the .map() step to an explicit merge/join to
make mismatches clearer and preserve unmatched-key diagnostics.
- Around line 113-117: Update the onboarding checklist by extending Step 4 to
include geometry-specific validation: after printing row counts and a sample,
explicitly verify the geometry column(s) for completeness (no NA/null values),
correct type(s) and CRS, and validity (no corrupt/empty geometries) to catch
silent NA results from the join-key mapping; refer to the geometry column
name(s) used by the join-key mapping and add a short guideline to fail or flag
the upload if any geometry rows are missing or invalid.

In @.claude/commands/onboarding-dbt.md:
- Around line 70-71: Update the dbt uniqueness test template
(dbt_utils.unique_combination_of_columns) to make it explicit that
combination_of_columns can include multiple partition columns and the primary
key, not just a single <partition_col> placeholder; change the placeholder text
and/or add an inline concrete example (e.g., for a table partitioned by ano +
semestre with primary key codigo_uc show combination_of_columns: [ano, semestre,
codigo_uc]) so developers know to list all partition columns followed by the
primary key.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4249d7d5-bafa-421a-a783-f4b08c7cdd70

📥 Commits

Reviewing files that changed from the base of the PR and between f10acb7 and 3216a27.

📒 Files selected for processing (3)

.claude/commands/onboarding-clean.md
.claude/commands/onboarding-dbt.md
.claude/commands/onboarding-discover.md

coderabbitai · 2026-03-31T23:19:26Z

+This returns IDs for: status, bigquery_type, entity, license, availability, organization, theme.
+
+**Never search the web, hardcode IDs, or guess slugs.** All reference IDs (themes,
+organizations, licenses, tags, entities, statuses) must come from `discover_ids`
+or `lookup_area`. IDs differ between dev and prod environments.


⚠️ Potential issue | 🟡 Minor

Documentation inconsistency: "area" and "tags" not listed in discover_ids output.

Line 20 lists what discover_ids returns but omits "area" and "tags," yet:

Line 23 mentions "tags" as a reference ID type

Line 54 in the example shows area.br

Since line 24 clarifies that lookup_area is a separate tool for areas, consider either:

Adding "area" and "tags" to line 20 if discover_ids returns them, OR

Clarifying in the example (around line 54) that area.br comes from lookup_area (Step 1 mentions only discover_ids)

📝 Suggested clarification

Option 1: If discover_ids does return area and tags, update line 20:

-This returns IDs for: status, bigquery_type, entity, license, availability, organization, theme. +This returns IDs for: status, bigquery_type, entity, area, license, availability, organization, theme, tags.

Option 2: If areas come from lookup_area, clarify in the example section by adding a comment or separate subsection showing the lookup_area call result.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/commands/onboarding-discover.md around lines 20 - 24, The docs list what discover_ids returns but omit "area" and "tags" while the example uses area.br and mentions tags; update the documentation so it's unambiguous: either add "area" and "tags" to the discover_ids return list if discover_ids actually returns them, or explicitly state in the example that area.br (and any area-related values) come from lookup_area (and show lookup_area usage) and that tags are obtained via discover_ids or another lookup; reference the functions/values discover_ids, lookup_area, area.br, and tags when making the clarification.

folhesgabriel · 2026-04-01T10:41:41Z

+    safe_cast(razao_diferenca_area as float64) razao_diferenca_area,
+    safe_cast(data_publicacao_cnuc as date) data_publicacao_cnuc,
+    safe_cast(data_ultima_certificacao as date) data_ultima_certificacao,
+    st_geogfromtext(safe_cast(geometria as string), make_valid => true) geometria,


geometria deve ser tipada como geography. Atualizar o manual de estilo e integrar com o MCP seria um bom próx passo

Outro ponto, pra garantir qualidade é bom validar as geometrias com o BBOX aproximado do brasil. O objetivo é saber se os polígonos estão dentro do território BR

Das +-36k de linhas somente 2927k tem geometrias não nulas. Me parece estranho;

existem geometrias nulas com valores de áreas em hectares nas demais colunas

folhesgabriel · 2026-04-01T10:44:33Z

+    safe_cast(esfera_administrativa as string) esfera_administrativa,
+    safe_cast(categoria_manejo as string) categoria_manejo,
+    safe_cast(categoria_iucn as string) categoria_iucn,
+    safe_cast(grupo as string) grupo,


converter para NULL

Mesmo acontece nas variáveis orgao_gestor e informacoes_gerais

Esse comportamento acontece com diversas colunas

folhesgabriel · 2026-04-01T10:50:17Z

+
+select
+    safe_cast(ano as int64) ano,
+    safe_cast(semestre as int64) semestre,


variáveis string com descrições de ucs e similares estão ora em caixa alta ora em Title

folhesgabriel · 2026-04-01T10:56:57Z

+OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+
+# ── Shapefile sources: (ano, semestre) → polygon shapefile path ────────────
+# Points-only files (shp_2024_1) are excluded; 2025 shapefiles not yet available.


Dahis, tem até 2026 disponível https://dados.gov.br/dados/conjuntos-dados/unidadesdeconservacao

mergify · 2026-04-07T18:51:06Z

@rdahis esse pull request tem conflitos 😩

[Data] br_mma_cnuc — Cadastro Nacional de Unidades de Conservação

2b5ca31

Adds br_mma_cnuc.unidades_conservacao: 13 biannual snapshots (2018–2026), 38,065 rows, partitioned by ano+semestre, with WKT geometry (GEOGRAPHY) for 2024-S2. Source: MMA/CNUC API.

Merge branch 'main' into data/br_mma_cnuc

f10acb7

coderabbitai Bot reviewed Mar 31, 2026

View reviewed changes

rdahis requested a review from a team March 31, 2026 22:33

coderabbitai Bot reviewed Mar 31, 2026

View reviewed changes

folhesgabriel requested changes Apr 1, 2026

View reviewed changes

folhesgabriel approved these changes Apr 1, 2026

View reviewed changes

folhesgabriel self-requested a review April 1, 2026 10:58

mergify Bot added 2 commits April 1, 2026 13:57

Merge branch 'main' into data/br_mma_cnuc

2b0c798

Merge branch 'main' into data/br_mma_cnuc

6599a61

mergify Bot added the conflict label Apr 7, 2026


		tb = bd.Table(dataset_id=DATASET_ID, table_id=TABLE_ID)

		path_to_data = "/Users/rdahis/Downloads/CNUC/output/unidades_conservacao"

Conversation

rdahis commented Mar 31, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Descrição

Arquivos

Testes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rdahis commented Mar 31, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 31, 2026 •

edited

Loading