Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 82 additions & 0 deletions .claude/agents/onboarding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
---
name: onboarding
description: Orchestrates the full dataset onboarding workflow for Data Basis — from raw data → clean data → BigQuery → metadata in the backend. Spawn this agent when the user asks to onboard a dataset.
---

# Data Basis Pipeline Agent

This agent orchestrates the full dataset onboarding workflow for Data Basis: from raw data → clean data → BigQuery → metadata in the backend.

## MCP server

All Data Basis backend API calls go through the `mcp` MCP server. Always use the tools for backend operations — never write raw HTTP requests inline.

## How to invoke

The agent responds to natural-language onboarding requests. Canonical form:

```
Onboard dataset <slug>. Raw files at <path>. Drive folder: BD/Dados/Conjuntos/<slug>/.
```
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Outdated

Anything missing from the request is gathered in `onboarding-context` before proceeding.

## Step sequence

Work through these steps in order. Do not skip steps. Use the corresponding skill for each step.

```
1. /onboarding-context gather raw source URLs, docs, org, license, coverage
2. /onboarding-architecture fetch or create architecture tables on Drive
3. /onboarding-clean write and run data cleaning code → partitioned parquet
4. /onboarding-upload upload parquet to BigQuery (dev first)
5. /onboarding-dbt write .sql and schema.yaml files
6. /onboarding-dbt-run run DBT tests; fix or flag errors
7. /onboarding-discover resolve all reference IDs from backend
8. /onboarding-metadata register metadata in dev backend
[PAUSE — verification checkpoint]
9. /onboarding-metadata --env prod (only after human approval)
10. /onboarding-pr open PR with changelog
```

## Verification checkpoint (between steps 8 and 9)

After step 8 succeeds, output a verification checklist and wait for explicit approval before promoting to prod:

```
✓ Dataset registered in dev: <slug>
✓ Tables: <list>
✓ Columns: <counts>
✓ Coverage: <start>–<end>
✓ Cloud tables: OK
✓ Verify at: https://development.basedosdados.org/dataset/<slug>

Reply "approved" to promote to prod, or describe what needs fixing.
```

Do not proceed to step 9 or step 10 without the user replying "approved" (or equivalent).

## Environment

- Default: dev (`development.backend.basedosdados.org`, `basedosdados-dev` GCP project)
- Prod: only after explicit user approval

## Commit discipline

Commit after each logical unit completes:
- After cleaning code is verified
- After DBT files are written
- After metadata is registered in dev
- After metadata is promoted to prod

Use conventional commits: `feat(br_me_siconfi): add architecture tables and cleaning code`

Never commit data files (parquet, CSV). Ensure `.gitignore` covers the output path.

## Translation

All descriptions and names must be provided in Portuguese, English, and Spanish. When only one language is available (typically Portuguese from architecture tables), translate to the other two using domain knowledge of Brazilian public administration and statistics. Apply consistent terminology.

## Architecture table is the source of truth

When there is a conflict between raw data column names, DBT file conventions, and the architecture table, the architecture table wins. Update the other artifacts to match.
70 changes: 70 additions & 0 deletions .claude/commands/onboarding-architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
description: Fetch or create architecture tables for a Data Basis dataset
argument-hint: <dataset_slug> [drive_folder_path]
---

Fetch or create architecture tables for a Data Basis dataset. Each table in the dataset gets one Google Sheets file.

**Dataset:** $ARGUMENTS

## Architecture table schema

Each file has these columns (in order):

| Column | Description |
|--------|-------------|
| `name` | BQ column name (snake_case) |
| `bigquery_type` | BQ type: INT64, STRING, FLOAT64, DATE, etc. |
| `description` | Description in Portuguese |
| `temporal_coverage` | Empty = same as table; different = explicit, e.g. `2013(1)` or `2013(1)2022` |
| `covered_by_dictionary` | `yes` / `no` |
| `directory_column` | BD directories FK: `<dataset>.<table>:<column>`, e.g. `br_bd_diretorios_brasil.municipio:id_municipio` |
| `measurement_unit` | e.g. `year`, `BRL`, `hectare` — blank if none |
| `has_sensitive_data` | `yes` / `no` |
| `observations` | Free text notes |
| `original_name` | Column name in the raw source |

**Temporal coverage notation:** `START(INTERVAL)END` — e.g. `2004(1)2022` = annual from 2004 to 2022. `2013(1)` = from 2013, ongoing. Empty = same as the table's coverage.

## Step 1 — Check if architecture files already exist

Use the `databasis-workspace` Google Drive MCP to check the Drive folder:
`BD/Dados/Conjuntos/<dataset>/architecture/`

If files exist: read and validate them. Report any missing required columns or schema mismatches.

## Step 2 — Ask design questions (if creating new tables)

Before creating architecture tables, ask the user:

1. **Format:** Should tables be in long format? (Data Basis default: yes — each row = one observation)
2. **Partition columns:** What columns partition the data? (default: `ano`, `sigla_uf`; Brasil-level: `ano` only)
3. **Unit of observation:** What does one row represent? (e.g. "one municipality-year-account")
4. **Categorical columns:** Which columns have a finite set of categories that need a dictionary?
5. **Directory columns:** Which columns link to BD standard directories (municipalities, states, time)?

## Step 3 — Infer or create schema

If creating new tables:
1. Read the first 20 rows of each raw data file
2. Infer column names, types, and candidate partition columns
3. Apply long-format transformation if the data is wide (one column per year → pivot to long)
4. Map known standard columns to BD directories:
- `ano` → `br_bd_diretorios_data_tempo.ano:ano`
- `sigla_uf` → `br_bd_diretorios_brasil.uf:sigla_uf`
- `id_municipio` → `br_bd_diretorios_brasil.municipio:id_municipio`

## Step 4 — Translate descriptions

Architecture files have descriptions in Portuguese. Translate all descriptions to English and Spanish. The translation should be accurate and use the correct technical terminology for Brazilian public finance / the relevant domain.

## Step 5 — Save to Drive

Save each architecture table as a Google Sheet in:
`BD/Dados/Conjuntos/<dataset>/architecture/<table_slug>.xlsx`

Use `mcp__databasis-workspace__create_spreadsheet` and `mcp__databasis-workspace__modify_sheet_values`.

## Step 6 — Output

Return a summary listing all tables found/created and the Drive URLs for each architecture file. Store these URLs — they are needed by `databasis-metadata`.
81 changes: 81 additions & 0 deletions .claude/commands/onboarding-clean.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
---
description: Write and run data cleaning code to produce partitioned parquet output
argument-hint: <dataset_slug> <raw_data_path> [output_path]
---

Write and run data cleaning code for a Data Basis dataset.

**Dataset / paths:** $ARGUMENTS

## Folder structure

Work in a folder **external to the `pipelines/` repo**:

```
<dataset_root>/
├── input/ ← raw files (CSV, Excel, JSON, etc.) — do not modify
├── output/
│ └── <table_slug>/
│ └── ano=<year>/sigla_uf=<uf>/ (municipio/UF tables)
│ └── ano=<year>/ (Brasil-level tables)
└── code/
└── clean.py (one script per dataset if tables share raw source)
└── clean_<table>.py (one per table if they don't)
```
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Outdated

## Step 1 — Read architecture tables

Read the architecture tables from Drive (URLs from `databasis-architecture` output or ask the user). These define:
- Exact column names and types
- Partition columns
- Which columns are categorical (need dictionary)
- Directory column mappings

## Step 2 — Inspect raw data

Read the first 20 rows of each raw file to understand structure. Check:
- File format (CSV, Excel, JSON, fixed-width, etc.)
- Encoding (UTF-8, ISO-8859-1, etc.)
- Column names and their mapping to architecture names
- Any header rows, footer rows, or skip rows
- Date formats

## Step 3 — Write cleaning code

Write Python cleaning code. Use pandas or polars — choose whichever fits the data better (polars for large files or complex transformations; pandas otherwise).

Rules:
- **Start with a small subset** (1 year or the smallest available partition) before scaling
- Output column order must match architecture exactly
- Use `safe_cast` logic: coerce types with error handling, not hard casts
- For wide data: pivot to long format
- Partition outputs using `pyarrow` with the partition columns from the architecture
- Never modify input files

Standard column types:
- INT64: `pd.to_numeric(col, errors='coerce').astype('Int64')`
- FLOAT64: `pd.to_numeric(col, errors='coerce')`
- STRING: `.astype(str).str.strip().replace('nan', pd.NA)`
- DATE: `pd.to_datetime(col, errors='coerce').dt.date`

## Step 4 — Validate subset output

After running on the subset:
1. Verify column names match architecture exactly
2. Verify types are correct
3. Check for unexpected nulls in primary key columns
4. Print row counts and a sample

Only proceed to full data after subset is verified. Ask the user to confirm.

## Step 5 — Dictionary table

If any column has `covered_by_dictionary: yes`, create a `dicionario` table:
- Schema: `id_tabela | nome_coluna | chave | cobertura_temporal | valor`
- One row per (table, column, key) combination
- All tables and columns in a single dictionary file
- Output to: `output/dicionario/dicionario.csv` (not partitioned)

## Step 6 — Scale to full data

Run on all years/partitions and report final row counts per partition.
49 changes: 49 additions & 0 deletions .claude/commands/onboarding-context.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
---
description: Gather context for a Data Basis dataset onboarding (raw sources, docs, coverage, org)
argument-hint: <dataset_slug> [raw_data_path] [drive_folder_path]
---

Gather all context needed to onboard a dataset to Data Basis. Work through the following steps:

**Dataset:** $ARGUMENTS

## Step 1 — Search for official documentation

Search online for the dataset's official documentation, raw data source URLs, and any existing Data Basis presence. Look for:
- Official government or institution page hosting the raw data
- Download URL or API endpoint
- Data dictionary or codebook
- License information (look for open data licenses: CC-BY, CC0, OGL, etc.)
- Update frequency (monthly, annual, etc.)
- Responsible organization

Also read any local README, documentation, or metadata files present in the dataset folder if a path was provided.

## Step 2 — Ask the user to confirm or supplement

Present your findings and ask the user to confirm or fill in:
1. Raw data source URL(s) — where to download the raw files
2. Responsible organization slug on Data Basis (e.g. `ministerio-da-economia`)
3. License slug (e.g. `open-database`)
4. Update frequency and lag (e.g. annual, 1-year lag)
5. Geographic coverage (e.g. Brazil — municipalities, states, federal)
6. Temporal coverage (start year, end year or "ongoing")
7. Drive folder path for architecture tables: `BD/Dados/Conjuntos/<dataset>/`

## Step 3 — Output a structured context block

Once all information is gathered, output a context block in this format for use by downstream skills:

```
=== DATASET CONTEXT: <slug> ===
Organization: <org_slug>
License: <license_slug>
Raw source URL(s): <url(s)>
Update frequency: <e.g. annual>
Update lag: <e.g. 1 year>
Coverage: <start_year>–<end_year or "present">, <geography>
Drive folder: <path>
Notes: <any other relevant info>
```
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Outdated

Keep the context block in the conversation for subsequent skills to reference.
61 changes: 61 additions & 0 deletions .claude/commands/onboarding-dbt-run.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
---
description: Run DBT tests and materialize tables; fix or flag errors
argument-hint: <dataset_slug> [--tables table1,table2] [--target dev|prod]
---

Run DBT tests and materialize tables for a Data Basis dataset.

**Dataset:** $ARGUMENTS

Parse `--target` (default: dev) and `--tables` (default: all tables in the dataset) from arguments.

## Step 1 — Ensure DBT environment

Check for `/tmp/dbt_env/bin/dbt`. If missing, create it:
```bash
~/.pyenv/versions/3.11.6/bin/python -m venv /tmp/dbt_env
/tmp/dbt_env/bin/pip install dbt-bigquery
```
Comment thread
coderabbitai[bot] marked this conversation as resolved.

## Step 2 — Run models

```bash
cd <pipelines_root>
/tmp/dbt_env/bin/dbt run \
--select <dataset_slug> \
--profiles-dir ~/.dbt \
--target <target>
Comment on lines +26 to +27
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Repo-level profiles file:"
fd -a '^profiles\.yml$' .

echo
echo "Command currently expects ~/.dbt/profiles.yml:"
if [ -f "$HOME/.dbt/profiles.yml" ]; then
  echo "Found ~/.dbt/profiles.yml"
else
  echo "Missing ~/.dbt/profiles.yml"
fi

Repository: basedosdados/pipelines

Length of output: 199


🏁 Script executed:

# Check the exact location of profiles.yml
ls -la ./profiles.yml
echo "---"
# Read the command file at the specified lines
echo "Lines 20-35 of the command file:"
sed -n '20,35p' ./.claude/commands/databasis-dbt-run.md
echo "---"
echo "Lines 35-45 of the command file:"
sed -n '35,45p' ./.claude/commands/databasis-dbt-run.md

Repository: basedosdados/pipelines

Length of output: 787


--profiles-dir ~/.dbt is misaligned with this repo's profiles.yml location.

The repo provides profiles.yml at the project root, but the command points to ~/.dbt. This will cause runs and tests to fail unless ~/.dbt/profiles.yml is pre-seeded externally.

Suggested update
 /tmp/dbt_env/bin/dbt run \
   --select <dataset_slug> \
-  --profiles-dir ~/.dbt \
+  --profiles-dir <pipelines_root> \
   --target <target>
 /tmp/dbt_env/bin/dbt test \
   --select <dataset_slug> \
-  --profiles-dir ~/.dbt \
+  --profiles-dir <pipelines_root> \
   --target <target>

Also applies to: lines 37-38

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/databasis-dbt-run.md around lines 26 - 27, The documented
dbt run/test command uses --profiles-dir ~/.dbt which doesn't match this repo's
profiles.yml location; update the command examples in
.claude/commands/databasis-dbt-run.md to point --profiles-dir to the repository
root (or remove the flag so dbt picks up ./profiles.yml) for both occurrences
(the block containing "--profiles-dir ~/.dbt \ --target <target>" and the
similar lines at 37-38), ensuring the --target flag remains unchanged.

```

Report which models compiled and ran successfully.

## Step 3 — Run tests

```bash
/tmp/dbt_env/bin/dbt test \
--select <dataset_slug> \
--profiles-dir ~/.dbt \
--target <target>
```

## Step 4 — Handle test failures

For each failing test:

1. **`not_null` on a nullable column**: Add the column to a `not_null_proportion_multiple_columns` block instead. Remove the strict `not_null` test.

2. **`relationships` test fails** (FK violation): Report the failing values and ask the user whether to (a) remove the test, (b) relax to a proportion test, or (c) fix the data.

3. **`not_null_proportion_multiple_columns` fails** (below 5%): The data may genuinely be sparse. Report the actual proportion and ask the user to confirm the threshold is acceptable.

4. **Other failures**: Report the full test output and SQL. Propose a fix but ask before implementing.

## Step 5 — Report

Output a summary:
```
Models: X passed, Y failed
Tests: X passed, Y failed
```

List any remaining failures with proposed next steps.
Loading