[Data] br_mma_cnuc — Cadastro Nacional de Unidades de Conservação#1488
[Data] br_mma_cnuc — Cadastro Nacional de Unidades de Conservação#1488
Conversation
Adds br_mma_cnuc.unidades_conservacao: 13 biannual snapshots (2018–2026), 38,065 rows, partitioned by ano+semestre, with WKT geometry (GEOGRAPHY) for 2024-S2. Source: MMA/CNUC API.
📝 WalkthroughWalkthroughAdds a complete ingest pipeline for Brazil CNUC conservation units: a Python cleaner that normalizes CSV snapshots and merges geometries into partitioned Parquet, a dbt model that casts types and converts WKT to GEOGRAPHY, schema/tests, and a small upload script to push Parquet to basedosdados. Changes
Sequence DiagramsequenceDiagram
participant CSV as CSV Snapshots
participant Clean as clean.py
participant Shp as Shapefile\n(geometry)
participant Parquet as Parquet\nPartitions
participant dbt as dbt Model
participant Upload as upload.py
participant BDD as basedosdados
CSV->>Clean: read cnuc_*.csv\n(utf-8-sig → latin1 fallback)
Clean->>Clean: normalize columns,\ncoerce types, attach metadata
Shp->>Clean: provide WKT geometry\n(join by codigo_uc / cd_cnuc)
Clean->>Parquet: write partitioned\nParquet (ano/semestre)
Parquet->>dbt: staging select
dbt->>dbt: safe_cast fields\nst_geogfromtext(..., make_valid => true)
dbt->>Upload: materialized table ready
Upload->>BDD: upload/replace table\nwith Parquet data
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
models/br_mma_cnuc/code/clean.py (3)
231-239: Type annotation mismatch for geometry lookup dictionaries.There are inconsistencies in the type hints for geometry-related dictionaries:
load_geometryreturnsdict[str, str](line 237-238: keys arecd_cnucstrings)clean_fileparametergeo_lookupis typed asdict[int, str](line 242) but should bedict[str, str]geo_cacheis typed asdict[tuple[int, int], dict[int, str]](line 359) but the inner dict should bedict[str, str]🔧 Proposed fix for type annotations
-def clean_file(path: Path, geo_lookup: dict[int, str] | None) -> pd.DataFrame: +def clean_file(path: Path, geo_lookup: dict[str, str] | None) -> pd.DataFrame:- geo_cache: dict[tuple[int, int], dict[int, str]] = {} + geo_cache: dict[tuple[int, int], dict[str, str]] = {}Also applies to: 242-242, 359-359
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@models/br_mma_cnuc/code/clean.py` around lines 231 - 239, The type hints for geometry lookup dicts are inconsistent: update the annotations so they match the actual keys returned by load_geometry (strings). Change clean_file's geo_lookup parameter from dict[int, str] to dict[str, str], update geo_cache's inner dict type from dict[int, str] to dict[str, str], and ensure load_geometry stays declared as dict[str, str]; adjust any references to these symbols (load_geometry, clean_file, geo_cache) accordingly so the key types are consistent across the module.
270-283: Integer conversion could be simplified.The integer conversion logic is a bit convoluted. Consider using pandas' built-in nullable integer conversion more directly.
♻️ Simplified integer conversion
# Type casts — int for col in INT_COLS: if col in df.columns: - cleaned = ( - df[col] - .astype(str) - .str.replace(".", "", regex=False) - .str.strip() - ) - s = pd.to_numeric(cleaned, errors="coerce") - mask = s.isna() - arr = s.fillna(0).astype(int).astype("Int64") - arr[mask] = pd.NA - df[col] = arr + df[col] = pd.to_numeric( + df[col] + .astype(str) + .str.replace(".", "", regex=False) + .str.strip(), + errors="coerce", + ).astype("Int64")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@models/br_mma_cnuc/code/clean.py` around lines 270 - 283, The integer conversion loop for INT_COLS is overly complex; replace the mask/fillna/astype dance by converting the cleaned string series directly to pandas nullable integers: for each col in INT_COLS (when present in df) produce cleaned = df[col].astype(str).str.replace(".", "", regex=False).str.strip(), then set df[col] = pd.to_numeric(cleaned, errors="coerce").astype("Int64") so NaNs become pd.NA via the nullable Int64 dtype and you can remove the manual mask/fill steps.
205-209: Missing docstrings for several functions.Per coding guidelines, Python functions should have docstrings following Google Style. The following functions are missing docstrings:
parse_filename,read_csv,clean_string,clean_file,write_partition, andmain.📝 Example docstring additions
def parse_filename(path: Path) -> tuple[int, int]: + """Extract ano and semestre from CSV filename. + + Args: + path: Path to the CSV file with pattern cnuc_YYYY_S.csv. + + Returns: + Tuple of (ano, semestre). + + Raises: + ValueError: If filename doesn't match expected pattern. + """ m = re.search(r"cnuc_(\d{4})_(\d)", path.name)def read_csv(path: Path) -> pd.DataFrame: + """Read CSV with fallback encodings (utf-8-sig, latin1). + + Args: + path: Path to the CSV file. + + Returns: + DataFrame with all columns as strings. + + Raises: + ValueError: If file cannot be decoded with any encoding. + """ for enc in ("utf-8-sig", "latin1"):As per coding guidelines: "Add type hints and docstrings for Python functions following Google Style".
Also applies to: 212-220, 223-228, 242-322, 345-350, 353-380
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@models/br_mma_cnuc/code/clean.py` around lines 205 - 209, Add Google-style docstrings to each missing function (parse_filename, read_csv, clean_string, clean_file, write_partition, main and any others in the ranges noted) describing purpose, Args with types, Returns with types, and Raises (e.g., ValueError for parse_filename) where applicable; ensure the docstrings match existing type hints and include brief examples or notes if behavior is non-obvious (e.g., regex format in parse_filename, expected CSV encoding/columns in read_csv, normalization rules in clean_string, file I/O and partitioning behavior in write_partition/clean_file, and command-line/entry semantics for main).models/br_mma_cnuc/code/upload.py (1)
1-16: Missing module docstring andif __name__ == "__main__"guard.The script lacks documentation and executes at import time, which can cause issues when the module is imported elsewhere.
♻️ Proposed structure with guard and docstring
+""" +Upload cleaned CNUC parquet data to basedosdados staging. + +Usage: + python upload.py +""" + +from pathlib import Path + import basedosdados as bd DATASET_ID = "br_mma_cnuc" TABLE_ID = "unidades_conservacao" BILLING_PROJECT = "basedosdados-dev" -tb = bd.Table(dataset_id=DATASET_ID, table_id=TABLE_ID) +ROOT = Path(__file__).resolve().parent.parent +OUTPUT_DIR = ROOT / "output" / "unidades_conservacao" -path_to_data = "/Users/rdahis/Downloads/CNUC/output/unidades_conservacao" -tb.create( - path=path_to_data, - if_storage_data_exists="replace", - if_table_exists="replace", - source_format="parquet", -) +def main() -> None: + """Upload parquet data to GCS and create BigQuery table.""" + tb = bd.Table(dataset_id=DATASET_ID, table_id=TABLE_ID) + tb.create( + path=str(OUTPUT_DIR), + if_storage_data_exists="replace", + if_table_exists="replace", + source_format="parquet", + ) + + +if __name__ == "__main__": + main()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@models/br_mma_cnuc/code/upload.py` around lines 1 - 16, The module executes table upload at import-time and lacks documentation; add a top-level module docstring describing its purpose, and move the creation logic into a guarded main block: wrap the path_to_data, tb = bd.Table(...) initialization and the tb.create(...) call inside an if __name__ == "__main__": block (or a main() function invoked by that guard), so importing this module won't trigger immediate execution; keep constants DATASET_ID, TABLE_ID, BILLING_PROJECT at module scope and reference them from the main function.models/br_mma_cnuc/schema.yaml (1)
10-21: Consider scoping tests to most recent data using__most_recent_year__.For large datasets with historical partitions, the uniqueness test could be scoped to recent data to improve test performance. Based on learnings, you can use
__most_recent_year__in awhereconfig.📝 Optional: Scope uniqueness test to recent year
tests: - dbt_utils.unique_combination_of_columns: combination_of_columns: [ano, semestre, codigo_uc] config: where: __most_recent_year__Based on learnings: "Use most_recent_year keyword in dbt test where config to scope tests to most recent rows (uses ano column)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@models/br_mma_cnuc/schema.yaml` around lines 10 - 21, Scope the uniqueness test to the most recent year to improve performance: update the dbt test using dbt_utils.unique_combination_of_columns (the test that uses combination_of_columns: [ano, semestre, codigo_uc]) to add a config with a where clause set to __most_recent_year__; keep the same combination_of_columns and ensure the where uses the ano-based keyword so the uniqueness check only runs against the most recent partition.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@models/br_mma_cnuc/code/upload.py`:
- Line 9: The hardcoded path assigned to path_to_data in upload.py will break on
other machines; change it to derive a path relative to the script (use
pathlib.Path(__file__).resolve().parent / "output/unidades_conservacao") or
accept a CLI argument (use argparse to parse a --data-path) and default to the
relative path for parity with clean.py; update any references to path_to_data
accordingly so the script works on CI/other developer machines.
---
Nitpick comments:
In `@models/br_mma_cnuc/code/clean.py`:
- Around line 231-239: The type hints for geometry lookup dicts are
inconsistent: update the annotations so they match the actual keys returned by
load_geometry (strings). Change clean_file's geo_lookup parameter from dict[int,
str] to dict[str, str], update geo_cache's inner dict type from dict[int, str]
to dict[str, str], and ensure load_geometry stays declared as dict[str, str];
adjust any references to these symbols (load_geometry, clean_file, geo_cache)
accordingly so the key types are consistent across the module.
- Around line 270-283: The integer conversion loop for INT_COLS is overly
complex; replace the mask/fillna/astype dance by converting the cleaned string
series directly to pandas nullable integers: for each col in INT_COLS (when
present in df) produce cleaned = df[col].astype(str).str.replace(".", "",
regex=False).str.strip(), then set df[col] = pd.to_numeric(cleaned,
errors="coerce").astype("Int64") so NaNs become pd.NA via the nullable Int64
dtype and you can remove the manual mask/fill steps.
- Around line 205-209: Add Google-style docstrings to each missing function
(parse_filename, read_csv, clean_string, clean_file, write_partition, main and
any others in the ranges noted) describing purpose, Args with types, Returns
with types, and Raises (e.g., ValueError for parse_filename) where applicable;
ensure the docstrings match existing type hints and include brief examples or
notes if behavior is non-obvious (e.g., regex format in parse_filename, expected
CSV encoding/columns in read_csv, normalization rules in clean_string, file I/O
and partitioning behavior in write_partition/clean_file, and command-line/entry
semantics for main).
In `@models/br_mma_cnuc/code/upload.py`:
- Around line 1-16: The module executes table upload at import-time and lacks
documentation; add a top-level module docstring describing its purpose, and move
the creation logic into a guarded main block: wrap the path_to_data, tb =
bd.Table(...) initialization and the tb.create(...) call inside an if __name__
== "__main__": block (or a main() function invoked by that guard), so importing
this module won't trigger immediate execution; keep constants DATASET_ID,
TABLE_ID, BILLING_PROJECT at module scope and reference them from the main
function.
In `@models/br_mma_cnuc/schema.yaml`:
- Around line 10-21: Scope the uniqueness test to the most recent year to
improve performance: update the dbt test using
dbt_utils.unique_combination_of_columns (the test that uses
combination_of_columns: [ano, semestre, codigo_uc]) to add a config with a where
clause set to __most_recent_year__; keep the same combination_of_columns and
ensure the where uses the ano-based keyword so the uniqueness check only runs
against the most recent partition.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 411fa375-7570-4fa0-8c2c-5b9b9e2d9035
📒 Files selected for processing (4)
models/br_mma_cnuc/br_mma_cnuc__unidades_conservacao.sqlmodels/br_mma_cnuc/code/clean.pymodels/br_mma_cnuc/code/upload.pymodels/br_mma_cnuc/schema.yaml
|
|
||
| tb = bd.Table(dataset_id=DATASET_ID, table_id=TABLE_ID) | ||
|
|
||
| path_to_data = "/Users/rdahis/Downloads/CNUC/output/unidades_conservacao" |
There was a problem hiding this comment.
Hardcoded local path will break on other machines.
The path /Users/rdahis/Downloads/CNUC/output/unidades_conservacao is specific to the author's machine. This script will fail for any other developer or CI environment.
Consider using a relative path from the script location (consistent with clean.py), or accept the path as a CLI argument.
🔧 Proposed fix using relative path
+from pathlib import Path
+
import basedosdados as bd
DATASET_ID = "br_mma_cnuc"
TABLE_ID = "unidades_conservacao"
BILLING_PROJECT = "basedosdados-dev"
+ROOT = Path(__file__).resolve().parent.parent
+OUTPUT_DIR = ROOT / "output" / "unidades_conservacao"
+
tb = bd.Table(dataset_id=DATASET_ID, table_id=TABLE_ID)
-path_to_data = "/Users/rdahis/Downloads/CNUC/output/unidades_conservacao"
+path_to_data = str(OUTPUT_DIR)
tb.create(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@models/br_mma_cnuc/code/upload.py` at line 9, The hardcoded path assigned to
path_to_data in upload.py will break on other machines; change it to derive a
path relative to the script (use pathlib.Path(__file__).resolve().parent /
"output/unidades_conservacao") or accept a CLI argument (use argparse to parse a
--data-path) and default to the relative path for parity with clean.py; update
any references to path_to_data accordingly so the script works on CI/other
developer machines.
- clean: code lives in pipelines/models/<dataset>/code/ from the start; add explicit pyarrow schema, encoding strategy, Brazilian number format, geometry/shapefile join key guidance, and post-write schema validation step - dbt: add geometry/GEOGRAPHY pattern with make_valid, YAML block scalar rule, ignore_values parameter documentation for not_null_proportion test, and uniqueness test guidance (prefer stable string keys over nullable int IDs) - discover: mandate discover_ids for all reference lookups; add theme to output block; remove license from example block
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
.claude/commands/onboarding-dbt.md (1)
70-71: Clarify the uniqueness test template for multi-column partitions.The template shows:
combination_of_columns: [<partition_col>, <primary_key_col>]But many tables (including
br_mma_cnuc) use multiple partition columns (e.g.,ano+semestre). The singular placeholder<partition_col>might lead developers to include only one partition column.📝 Proposed clarification
- dbt_utils.unique_combination_of_columns: - combination_of_columns: [<partition_col>, <primary_key_col>] + combination_of_columns: [<partition_col_1>, <partition_col_2>, ..., <primary_key_col>] + # Include ALL partition columns plus the primary keyOr use a concrete example:
# Example for a table partitioned by ano + semestre with primary key codigo_uc: combination_of_columns: [ano, semestre, codigo_uc]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/commands/onboarding-dbt.md around lines 70 - 71, Update the dbt uniqueness test template (dbt_utils.unique_combination_of_columns) to make it explicit that combination_of_columns can include multiple partition columns and the primary key, not just a single <partition_col> placeholder; change the placeholder text and/or add an inline concrete example (e.g., for a table partitioned by ano + semestre with primary key codigo_uc show combination_of_columns: [ano, semestre, codigo_uc]) so developers know to list all partition columns followed by the primary key..claude/commands/onboarding-clean.md (2)
99-109: Strengthen the join-key validation guidance.The current guidance says "Verify the join key" but doesn't specify how to verify or what action to take when keys mismatch. The actual implementation in
models/br_mma_cnuc/code/clean.py:314-321uses.map()which silently producesNAfor unmatched keys without raising an error.Consider adding explicit verification steps:
# After geometry merge missing_geo = df[df["geometria"].isna() & df["codigo_uc"].notna()] if len(missing_geo) > 0: logger.warning(f"{len(missing_geo)} rows missing geometry after merge") # Optionally: print sample IDs or raise if coverage is too low📋 Proposed documentation enhancement
- **Verify the join key** between the shapefile and tabular data — shapefile IDs and tabular IDs are often different systems (e.g. `cd_cnuc` vs `id_uc`). Inspect both before joining. +- After the merge, assert that geometry coverage meets expectations: + ```python + missing = df[df["geometria"].isna() & df["<id_col>"].notna()] + assert len(missing) / len(df) < 0.01, f">{1%} rows missing geometry" + ``` - In the DBT model, cast with `ST_GEOGFROMTEXT(col, make_valid => true)` and type the column as GEOGRAPHY, not STRING.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/commands/onboarding-clean.md around lines 99 - 109, The merge that populates the geometry in models/br_mma_cnuc/code/clean.py currently uses .map() (around the block using .map() at lines ~314-321) which silently yields NA for unmatched keys; after that assignment add explicit validation: compute missing_geo = df[df["geometria"].isna() & df["codigo_uc"].notna()], log a warning with len(missing_geo) and a small sample of codigo_uc values (e.g. missing_geo["codigo_uc"].unique()[:10]), and either assert a coverage threshold (e.g. len(missing_geo)/len(df) < 0.01) or raise if too many are missing; additionally consider switching the .map() step to an explicit merge/join to make mismatches clearer and preserve unmatched-key diagnostics.
113-117: Add geometry-specific validation to Step 4.The validation checklist should explicitly mention checking geometry column completeness, especially since the join-key mapping (lines 104-106) can silently produce
NAvalues.📋 Proposed addition
After running on the subset: 1. Check the parquet schema with `pq.read_schema(path)` — verify all column types match the architecture before uploading. 2. Verify column names match architecture exactly. 3. Check for unexpected nulls in primary key columns. +4. If geometry is present, verify coverage: print % of rows with non-null geometry. -4. Print row counts and a sample. +5. Print row counts and a sample.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/commands/onboarding-clean.md around lines 113 - 117, Update the onboarding checklist by extending Step 4 to include geometry-specific validation: after printing row counts and a sample, explicitly verify the geometry column(s) for completeness (no NA/null values), correct type(s) and CRS, and validity (no corrupt/empty geometries) to catch silent NA results from the join-key mapping; refer to the geometry column name(s) used by the join-key mapping and add a short guideline to fail or flag the upload if any geometry rows are missing or invalid.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/commands/onboarding-discover.md:
- Around line 20-24: The docs list what discover_ids returns but omit "area" and
"tags" while the example uses area.br and mentions tags; update the
documentation so it's unambiguous: either add "area" and "tags" to the
discover_ids return list if discover_ids actually returns them, or explicitly
state in the example that area.br (and any area-related values) come from
lookup_area (and show lookup_area usage) and that tags are obtained via
discover_ids or another lookup; reference the functions/values discover_ids,
lookup_area, area.br, and tags when making the clarification.
---
Nitpick comments:
In @.claude/commands/onboarding-clean.md:
- Around line 99-109: The merge that populates the geometry in
models/br_mma_cnuc/code/clean.py currently uses .map() (around the block using
.map() at lines ~314-321) which silently yields NA for unmatched keys; after
that assignment add explicit validation: compute missing_geo =
df[df["geometria"].isna() & df["codigo_uc"].notna()], log a warning with
len(missing_geo) and a small sample of codigo_uc values (e.g.
missing_geo["codigo_uc"].unique()[:10]), and either assert a coverage threshold
(e.g. len(missing_geo)/len(df) < 0.01) or raise if too many are missing;
additionally consider switching the .map() step to an explicit merge/join to
make mismatches clearer and preserve unmatched-key diagnostics.
- Around line 113-117: Update the onboarding checklist by extending Step 4 to
include geometry-specific validation: after printing row counts and a sample,
explicitly verify the geometry column(s) for completeness (no NA/null values),
correct type(s) and CRS, and validity (no corrupt/empty geometries) to catch
silent NA results from the join-key mapping; refer to the geometry column
name(s) used by the join-key mapping and add a short guideline to fail or flag
the upload if any geometry rows are missing or invalid.
In @.claude/commands/onboarding-dbt.md:
- Around line 70-71: Update the dbt uniqueness test template
(dbt_utils.unique_combination_of_columns) to make it explicit that
combination_of_columns can include multiple partition columns and the primary
key, not just a single <partition_col> placeholder; change the placeholder text
and/or add an inline concrete example (e.g., for a table partitioned by ano +
semestre with primary key codigo_uc show combination_of_columns: [ano, semestre,
codigo_uc]) so developers know to list all partition columns followed by the
primary key.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4249d7d5-bafa-421a-a783-f4b08c7cdd70
📒 Files selected for processing (3)
.claude/commands/onboarding-clean.md.claude/commands/onboarding-dbt.md.claude/commands/onboarding-discover.md
| This returns IDs for: status, bigquery_type, entity, license, availability, organization, theme. | ||
|
|
||
| **Never search the web, hardcode IDs, or guess slugs.** All reference IDs (themes, | ||
| organizations, licenses, tags, entities, statuses) must come from `discover_ids` | ||
| or `lookup_area`. IDs differ between dev and prod environments. |
There was a problem hiding this comment.
Documentation inconsistency: "area" and "tags" not listed in discover_ids output.
Line 20 lists what discover_ids returns but omits "area" and "tags," yet:
- Line 23 mentions "tags" as a reference ID type
- Line 54 in the example shows
area.br
Since line 24 clarifies that lookup_area is a separate tool for areas, consider either:
- Adding "area" and "tags" to line 20 if
discover_idsreturns them, OR - Clarifying in the example (around line 54) that
area.brcomes fromlookup_area(Step 1 mentions onlydiscover_ids)
📝 Suggested clarification
Option 1: If discover_ids does return area and tags, update line 20:
-This returns IDs for: status, bigquery_type, entity, license, availability, organization, theme.
+This returns IDs for: status, bigquery_type, entity, area, license, availability, organization, theme, tags.Option 2: If areas come from lookup_area, clarify in the example section by adding a comment or separate subsection showing the lookup_area call result.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/commands/onboarding-discover.md around lines 20 - 24, The docs list
what discover_ids returns but omit "area" and "tags" while the example uses
area.br and mentions tags; update the documentation so it's unambiguous: either
add "area" and "tags" to the discover_ids return list if discover_ids actually
returns them, or explicitly state in the example that area.br (and any
area-related values) come from lookup_area (and show lookup_area usage) and that
tags are obtained via discover_ids or another lookup; reference the
functions/values discover_ids, lookup_area, area.br, and tags when making the
clarification.
| safe_cast(razao_diferenca_area as float64) razao_diferenca_area, | ||
| safe_cast(data_publicacao_cnuc as date) data_publicacao_cnuc, | ||
| safe_cast(data_ultima_certificacao as date) data_ultima_certificacao, | ||
| st_geogfromtext(safe_cast(geometria as string), make_valid => true) geometria, |
There was a problem hiding this comment.
geometria deve ser tipada como geography. Atualizar o manual de estilo e integrar com o MCP seria um bom próx passo
There was a problem hiding this comment.
Outro ponto, pra garantir qualidade é bom validar as geometrias com o BBOX aproximado do brasil. O objetivo é saber se os polígonos estão dentro do território BR
There was a problem hiding this comment.
Das +-36k de linhas somente 2927k tem geometrias não nulas. Me parece estranho;
There was a problem hiding this comment.
existem geometrias nulas com valores de áreas em hectares nas demais colunas
| safe_cast(esfera_administrativa as string) esfera_administrativa, | ||
| safe_cast(categoria_manejo as string) categoria_manejo, | ||
| safe_cast(categoria_iucn as string) categoria_iucn, | ||
| safe_cast(grupo as string) grupo, |
There was a problem hiding this comment.
Mesmo acontece nas variáveis orgao_gestor e informacoes_gerais
There was a problem hiding this comment.
Esse comportamento acontece com diversas colunas
|
|
||
| select | ||
| safe_cast(ano as int64) ano, | ||
| safe_cast(semestre as int64) semestre, |
There was a problem hiding this comment.
variáveis string com descrições de ucs e similares estão ora em caixa alta ora em Title
| OUTPUT_DIR.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| # ── Shapefile sources: (ano, semestre) → polygon shapefile path ──────────── | ||
| # Points-only files (shp_2024_1) are excluded; 2025 shapefiles not yet available. |
There was a problem hiding this comment.
Dahis, tem até 2026 disponível https://dados.gov.br/dados/conjuntos-dados/unidadesdeconservacao
|
@rdahis esse pull request tem conflitos 😩 |

Descrição
Adiciona o dataset
br_mma_cnuccom a tabelaunidades_conservacao.ano+semestregeometriado tipo GEOGRAPHY (WKT → ST_GEOGFROMTEXT), disponível para 2024-S2 (2.927 polígonos)Arquivos
models/br_mma_cnuc/br_mma_cnuc__unidades_conservacao.sqlmodels/br_mma_cnuc/schema.yamlmodels/br_mma_cnuc/code/clean.py— limpeza dos CSVs e merge de shapefilemodels/br_mma_cnuc/code/upload.py— upload para GCS/BQ stagingTestes
dbt runedbt testpassando (PASS=4, WARN=0, ERROR=0)(ano, semestre, codigo_uc)not_null_proportion >= 0.05(excluindo 6 colunas vazias na fonte)Summary by CodeRabbit