diff --git a/.claude/agents/onboarding.md b/.claude/agents/onboarding.md new file mode 100644 index 000000000..1e6f74794 --- /dev/null +++ b/.claude/agents/onboarding.md @@ -0,0 +1,82 @@ +--- +name: onboarding +description: Orchestrates the full dataset onboarding workflow for Data Basis — from raw data → clean data → BigQuery → metadata in the backend. Spawn this agent when the user asks to onboard a dataset. +--- + +# Data Basis Pipeline Agent + +This agent orchestrates the full dataset onboarding workflow for Data Basis: from raw data → clean data → BigQuery → metadata in the backend. + +## MCP server + +All Data Basis backend API calls go through the `mcp` MCP server. Always use the tools for backend operations — never write raw HTTP requests inline. + +## How to invoke + +The agent responds to natural-language onboarding requests. Canonical form: + +```text +Onboard dataset . Raw files at . Drive folder: BD/Dados/Conjuntos//. +``` + +Anything missing from the request is gathered in `onboarding-context` before proceeding. + +## Step sequence + +Work through these steps in order. Do not skip steps. Use the corresponding skill for each step. + +```text +1. /onboarding-context gather raw source URLs, docs, org, license, coverage +2. /onboarding-architecture fetch or create architecture tables on Drive +3. /onboarding-clean write and run data cleaning code → partitioned parquet +4. /onboarding-upload upload parquet to BigQuery (dev first) +5. /onboarding-dbt write .sql and schema.yaml files +6. /onboarding-dbt-run run DBT tests; fix or flag errors +7. /onboarding-discover resolve all reference IDs from backend +8. /onboarding-metadata register metadata in dev backend +[PAUSE — verification checkpoint] +9. /onboarding-metadata --env prod (only after human approval) +10. /onboarding-pr open PR with changelog +``` + +## Verification checkpoint (between steps 8 and 9) + +After step 8 succeeds, output a verification checklist and wait for explicit approval before promoting to prod: + +```text +✓ Dataset registered in dev: +✓ Tables: +✓ Columns: +✓ Coverage: +✓ Cloud tables: OK +✓ Verify at: https://development.basedosdados.org/dataset/ + +Reply "approved" to promote to prod, or describe what needs fixing. +``` + +Do not proceed to step 9 or step 10 without the user replying "approved" (or equivalent). + +## Environment + +- Default: dev (`development.backend.basedosdados.org`, `basedosdados-dev` GCP project) +- Prod: only after explicit user approval + +## Commit discipline + +Commit after each logical unit completes: +- After cleaning code is verified +- After DBT files are written +- After metadata is registered in dev +- After metadata is promoted to prod + +Use conventional commits: `feat(br_me_siconfi): add architecture tables and cleaning code` + +Never commit data files (parquet, CSV). Ensure `.gitignore` covers the output path. + +## Translation + +All descriptions and names must be provided in Portuguese, English, and Spanish. When only one language is available (typically Portuguese from architecture tables), translate to the other two using domain knowledge of Brazilian public administration and statistics. Apply consistent terminology. + +## Architecture table is the source of truth + +When there is a conflict between raw data column names, DBT file conventions, and the architecture table, the architecture table wins. Update the other artifacts to match. diff --git a/.claude/commands/onboarding-architecture.md b/.claude/commands/onboarding-architecture.md new file mode 100644 index 000000000..236d58e84 --- /dev/null +++ b/.claude/commands/onboarding-architecture.md @@ -0,0 +1,70 @@ +--- +description: Fetch or create architecture tables for a Data Basis dataset +argument-hint: [drive_folder_path] +--- + +Fetch or create architecture tables for a Data Basis dataset. Each table in the dataset gets one Google Sheets file. + +**Dataset:** $ARGUMENTS + +## Architecture table schema + +Each file has these columns (in order): + +| Column | Description | +|--------|-------------| +| `name` | BQ column name (snake_case) | +| `bigquery_type` | BQ type: INT64, STRING, FLOAT64, DATE, etc. | +| `description` | Description in Portuguese | +| `temporal_coverage` | Empty = same as table; different = explicit, e.g. `2013(1)` or `2013(1)2022` | +| `covered_by_dictionary` | `yes` / `no` | +| `directory_column` | BD directories FK: `.:`, e.g. `br_bd_diretorios_brasil.municipio:id_municipio` | +| `measurement_unit` | e.g. `year`, `BRL`, `hectare` — blank if none | +| `has_sensitive_data` | `yes` / `no` | +| `observations` | Free text notes | +| `original_name` | Column name in the raw source | + +**Temporal coverage notation:** `START(INTERVAL)END` — e.g. `2004(1)2022` = annual from 2004 to 2022. `2013(1)` = from 2013, ongoing. Empty = same as the table's coverage. + +## Step 1 — Check if architecture files already exist + +Use the `databasis-workspace` Google Drive MCP to check the Drive folder: +`BD/Dados/Conjuntos//architecture/` + +If files exist: read and validate them. Report any missing required columns or schema mismatches. + +## Step 2 — Ask design questions (if creating new tables) + +Before creating architecture tables, ask the user: + +1. **Format:** Should tables be in long format? (Data Basis default: yes — each row = one observation) +2. **Partition columns:** What columns partition the data? (default: `ano`, `sigla_uf`; Brasil-level: `ano` only) +3. **Unit of observation:** What does one row represent? (e.g. "one municipality-year-account") +4. **Categorical columns:** Which columns have a finite set of categories that need a dictionary? +5. **Directory columns:** Which columns link to BD standard directories (municipalities, states, time)? + +## Step 3 — Infer or create schema + +If creating new tables: +1. Read the first 20 rows of each raw data file +2. Infer column names, types, and candidate partition columns +3. Apply long-format transformation if the data is wide (one column per year → pivot to long) +4. Map known standard columns to BD directories: + - `ano` → `br_bd_diretorios_data_tempo.ano:ano` + - `sigla_uf` → `br_bd_diretorios_brasil.uf:sigla_uf` + - `id_municipio` → `br_bd_diretorios_brasil.municipio:id_municipio` + +## Step 4 — Translate descriptions + +Architecture files have descriptions in Portuguese. Translate all descriptions to English and Spanish. The translation should be accurate and use the correct technical terminology for Brazilian public finance / the relevant domain. + +## Step 5 — Save to Drive + +Save each architecture table as a Google Sheet in: +`BD/Dados/Conjuntos//architecture/.xlsx` + +Use `mcp__databasis-workspace__create_spreadsheet` and `mcp__databasis-workspace__modify_sheet_values`. + +## Step 6 — Output + +Return a summary listing all tables found/created and the Drive URLs for each architecture file. Store these URLs — they are needed by `databasis-metadata`. diff --git a/.claude/commands/onboarding-clean.md b/.claude/commands/onboarding-clean.md new file mode 100644 index 000000000..36ee2408e --- /dev/null +++ b/.claude/commands/onboarding-clean.md @@ -0,0 +1,81 @@ +--- +description: Write and run data cleaning code to produce partitioned parquet output +argument-hint: [output_path] +--- + +Write and run data cleaning code for a Data Basis dataset. + +**Dataset / paths:** $ARGUMENTS + +## Folder structure + +Work in a folder **external to the `pipelines/` repo**: + +```text +/ +├── input/ ← raw files (CSV, Excel, JSON, etc.) — do not modify +├── output/ +│ └── / +│ └── ano=/sigla_uf=/ (municipio/UF tables) +│ └── ano=/ (Brasil-level tables) +└── code/ + └── clean.py (one script per dataset if tables share raw source) + └── clean_
.py (one per table if they don't) +``` + +## Step 1 — Read architecture tables + +Read the architecture tables from Drive (URLs from `databasis-architecture` output or ask the user). These define: +- Exact column names and types +- Partition columns +- Which columns are categorical (need dictionary) +- Directory column mappings + +## Step 2 — Inspect raw data + +Read the first 20 rows of each raw file to understand structure. Check: +- File format (CSV, Excel, JSON, fixed-width, etc.) +- Encoding (UTF-8, ISO-8859-1, etc.) +- Column names and their mapping to architecture names +- Any header rows, footer rows, or skip rows +- Date formats + +## Step 3 — Write cleaning code + +Write Python cleaning code. Use pandas or polars — choose whichever fits the data better (polars for large files or complex transformations; pandas otherwise). + +Rules: +- **Start with a small subset** (1 year or the smallest available partition) before scaling +- Output column order must match architecture exactly +- Use `safe_cast` logic: coerce types with error handling, not hard casts +- For wide data: pivot to long format +- Partition outputs using `pyarrow` with the partition columns from the architecture +- Never modify input files + +Standard column types: +- INT64: `pd.to_numeric(col, errors='coerce').astype('Int64')` +- FLOAT64: `pd.to_numeric(col, errors='coerce')` +- STRING: `.astype(str).str.strip().replace('nan', pd.NA)` +- DATE: `pd.to_datetime(col, errors='coerce').dt.date` + +## Step 4 — Validate subset output + +After running on the subset: +1. Verify column names match architecture exactly +2. Verify types are correct +3. Check for unexpected nulls in primary key columns +4. Print row counts and a sample + +Only proceed to full data after subset is verified. Ask the user to confirm. + +## Step 5 — Dictionary table + +If any column has `covered_by_dictionary: yes`, create a `dicionario` table: +- Schema: `id_tabela | nome_coluna | chave | cobertura_temporal | valor` +- One row per (table, column, key) combination +- All tables and columns in a single dictionary file +- Output to: `output/dicionario/dicionario.csv` (not partitioned) + +## Step 6 — Scale to full data + +Run on all years/partitions and report final row counts per partition. diff --git a/.claude/commands/onboarding-context.md b/.claude/commands/onboarding-context.md new file mode 100644 index 000000000..d12b48fdc --- /dev/null +++ b/.claude/commands/onboarding-context.md @@ -0,0 +1,49 @@ +--- +description: Gather context for a Data Basis dataset onboarding (raw sources, docs, coverage, org) +argument-hint: [raw_data_path] [drive_folder_path] +--- + +Gather all context needed to onboard a dataset to Data Basis. Work through the following steps: + +**Dataset:** $ARGUMENTS + +## Step 1 — Search for official documentation + +Search online for the dataset's official documentation, raw data source URLs, and any existing Data Basis presence. Look for: +- Official government or institution page hosting the raw data +- Download URL or API endpoint +- Data dictionary or codebook +- License information (look for open data licenses: CC-BY, CC0, OGL, etc.) +- Update frequency (monthly, annual, etc.) +- Responsible organization + +Also read any local README, documentation, or metadata files present in the dataset folder if a path was provided. + +## Step 2 — Ask the user to confirm or supplement + +Present your findings and ask the user to confirm or fill in: +1. Raw data source URL(s) — where to download the raw files +2. Responsible organization slug on Data Basis (e.g. `ministerio-da-economia`) +3. License slug (e.g. `open-database`) +4. Update frequency and lag (e.g. annual, 1-year lag) +5. Geographic coverage (e.g. Brazil — municipalities, states, federal) +6. Temporal coverage (start year, end year or "ongoing") +7. Drive folder path for architecture tables: `BD/Dados/Conjuntos//` + +## Step 3 — Output a structured context block + +Once all information is gathered, output a context block in this format for use by downstream skills: + +```text +=== DATASET CONTEXT: === +Organization: +License: +Raw source URL(s): +Update frequency: +Update lag: +Coverage: , +Drive folder: +Notes: +``` + +Keep the context block in the conversation for subsequent skills to reference. diff --git a/.claude/commands/onboarding-dbt-run.md b/.claude/commands/onboarding-dbt-run.md new file mode 100644 index 000000000..f4d0293dc --- /dev/null +++ b/.claude/commands/onboarding-dbt-run.md @@ -0,0 +1,61 @@ +--- +description: Run DBT tests and materialize tables; fix or flag errors +argument-hint: [--tables table1,table2] [--target dev|prod] +--- + +Run DBT tests and materialize tables for a Data Basis dataset. + +**Dataset:** $ARGUMENTS + +Parse `--target` (default: dev) and `--tables` (default: all tables in the dataset) from arguments. + +## Step 1 — Ensure DBT environment + +Check for `/tmp/dbt_env/bin/dbt`. If missing, create it: +```bash +python -m venv /tmp/dbt_env +/tmp/dbt_env/bin/pip install dbt-bigquery +``` + +## Step 2 — Run models + +```bash +cd +/tmp/dbt_env/bin/dbt run \ + --select \ + --profiles-dir ~/.dbt \ + --target +``` + +Report which models compiled and ran successfully. + +## Step 3 — Run tests + +```bash +/tmp/dbt_env/bin/dbt test \ + --select \ + --profiles-dir ~/.dbt \ + --target +``` + +## Step 4 — Handle test failures + +For each failing test: + +1. **`not_null` on a nullable column**: Add the column to a `not_null_proportion_multiple_columns` block instead. Remove the strict `not_null` test. + +2. **`relationships` test fails** (FK violation): Report the failing values and ask the user whether to (a) remove the test, (b) relax to a proportion test, or (c) fix the data. + +3. **`not_null_proportion_multiple_columns` fails** (below 5%): The data may genuinely be sparse. Report the actual proportion and ask the user to confirm the threshold is acceptable. + +4. **Other failures**: Report the full test output and SQL. Propose a fix but ask before implementing. + +## Step 5 — Report + +Output a summary: +```text +Models: X passed, Y failed +Tests: X passed, Y failed +``` + +List any remaining failures with proposed next steps. diff --git a/.claude/commands/onboarding-dbt.md b/.claude/commands/onboarding-dbt.md new file mode 100644 index 000000000..ac7991dce --- /dev/null +++ b/.claude/commands/onboarding-dbt.md @@ -0,0 +1,99 @@ +--- +description: Write DBT .sql and schema.yaml files for a Data Basis dataset +argument-hint: [--tables table1,table2] +--- + +Write DBT SQL models and schema.yaml for a Data Basis dataset following `pipelines/` conventions. + +**Dataset:** $ARGUMENTS + +## Step 1 — Read conventions from a neighboring dataset + +Read 1–2 existing DBT files from a similar dataset as style reference. Pick a dataset with a similar structure (e.g. for a municipal finance dataset, read `br_cgu_orcamento_publico/`). + +## Step 2 — Write SQL model files + +One file per table: `models//__.sql` + +Template: +```sql +{{ + config( + schema="", + alias="", + materialized="table", + partition_by={ + "field": "", + "data_type": "", + "range": {"start": , "end": , "interval": 1}, + }, + ) +}} + + +select + safe_cast( as ) , + safe_cast( as ) , + ... +from + {{ set_datalake_project("_staging.") }} + as t +``` + +Column order must match the architecture table exactly. + +## Step 3 — Write schema.yaml + +One file: `models//schema.yml` + +Template: +```yaml +--- +version: 2 +models: + - name: __ + description: + tests: + - not_null_proportion_multiple_columns: + at_least: 0.05 + columns: + - name: + description: + tests: [not_null] # for primary key / partition columns + - name: + description: ... + tests: + - relationships: + to: ref('') + field: +``` + +Rules: +- Add `not_null` test to partition columns and primary keys +- Add `relationships` test to any column with a `directory_column` in the architecture +- Add `not_null_proportion_multiple_columns` at 0.05 to every model +- Use Portuguese descriptions from architecture + +## Step 4 — Check dbt_project.yml + +Verify the dataset has an entry in `dbt_project.yml`. If not, add: +```yaml +: + +materialized: table + +schema: +``` + +## Step 5 — Dictionary model (if needed) + +If a `dicionario` table exists, add its SQL model and schema entry. Pattern: +```sql +select + safe_cast(id_tabela as string) id_tabela, + safe_cast(nome_coluna as string) nome_coluna, + safe_cast(chave as string) chave, + safe_cast(cobertura_temporal as string) cobertura_temporal, + safe_cast(valor as string) valor +from + {{ set_datalake_project("_staging.dicionario") }} + as t +``` diff --git a/.claude/commands/onboarding-discover.md b/.claude/commands/onboarding-discover.md new file mode 100644 index 000000000..5dc774e1b --- /dev/null +++ b/.claude/commands/onboarding-discover.md @@ -0,0 +1,76 @@ +--- +description: Resolve all reference IDs from the Data Basis backend needed for metadata creation +argument-hint: [--env dev|prod] +--- + +Resolve all reference IDs from the Data Basis backend for a dataset. + +**Dataset:** $ARGUMENTS + +Parse `--env` (default: dev) from arguments. + +## Step 1 — Fetch all reference IDs + +Use the `discover_ids` MCP tool (env from argument): + +```text +discover_ids(env=) +``` + +This returns IDs for: status, bigquery_type, entity, area, license, availability, organization. + +## Step 2 — Fetch dataset state + +Use `get_dataset(slug=, env=)` to get: +- Dataset ID (None if it doesn't exist yet) +- All existing tables and their IDs +- Existing columns, observation levels, cloud tables, coverages, updates per table + +## Step 3 — Fetch authenticated account + +Use `get_authenticated_account(env=)` to get the current user's ID. This will be used as `dataCleanedBy` and `publishedBy`. + +## Step 4 — Fetch raw data sources + +Use `get_raw_data_sources(dataset_slug=, env=)` to find any registered raw data source IDs. + +## Step 5 — Output structured IDs block + +Output a block in this format: + +```text +=== DISCOVERED IDs (env=) === + +Reference IDs: + status.em_processamento: + status.editando: + entity.year: + entity.state: + entity.municipality: + entity.financing_phase: + entity.financing_account: + area.br: + bigquery_type.INT64: + ... + +Dataset: + id: + slug: + +Tables (existing): + : + columns: [: , ...] + observation_levels: [: , ...] + cloud_tables: [] + coverages: [ (area=)] + updates: [ (entity=)] + +Account: + id: + email: + +Raw data sources: + : +``` + +Keep this block in the conversation — `databasis-metadata` needs all these IDs. diff --git a/.claude/commands/onboarding-metadata.md b/.claude/commands/onboarding-metadata.md new file mode 100644 index 000000000..8baaf1231 --- /dev/null +++ b/.claude/commands/onboarding-metadata.md @@ -0,0 +1,159 @@ +--- +description: Register or update all metadata in the Data Basis backend (dataset, tables, columns, OLs, coverage, updates) +argument-hint: [--env dev|prod] [--dry-run] [--tables table1,table2] +--- + +Register or update all metadata for a dataset in the Data Basis backend. + +**Dataset:** $ARGUMENTS + +Parse `--env` (default: dev), `--dry-run`, and `--tables` from arguments. + +In dry-run mode: print all operations without executing them. + +## Prerequisites + +1. Run `databasis-discover` first (or have its output in context). +2. Have architecture table URLs in context (from `databasis-architecture`). +3. All IDs from the discover step must be available. + +## Dataset-level step (once, before per-table loop) + +### 0. Write dataset and table descriptions + +Before registering tables, draft descriptions in Portuguese, English, and Spanish for: +1. The **dataset** itself +2. Each **table** being registered + +Guidelines: +- Be cogent and technical: state what the data contains, the source system, the entity level, and the time period. +- Prefer direct copy-paste from raw data source documentation when it captures the content well — then translate. +- Format: 1–3 sentences per entity. No bullet lists. +- Pass descriptions to `create_update_dataset` and `create_update_table` via `description_pt/en/es`. + +Example dataset description (PT): "Dados contábeis e fiscais do setor público brasileiro publicados pelo SICONFI (Sistema de Informações Contábeis e Fiscais do Setor Público Brasileiro) do Ministério da Fazenda. Abrange receitas, despesas, balanços patrimoniais e variações patrimoniais de municípios, estados e governo federal a partir de 1989." + +--- + +## Per-table steps + +For each table (in order): + +### 1. Create/update table record + +Call `create_update_table` with: +- `slug`: table slug +- `name_pt / name_en / name_es`: from architecture or context +- `description_pt / en / es`: from step 0 +- `dataset_id`: from discover step (dataset must exist first — create if needed via `create_update_dataset`) +- `status_id`: use `status.published` from discovered IDs +- `published_by_ids`: authenticated account ID (from `get_authenticated_account`) +- `data_cleaned_by_ids`: authenticated account ID (from `get_authenticated_account`) +- `id`: existing table ID if updating + +Do **not** pass `raw_data_source_ids` here — see step 8. + +> **Known issue (M2M fields in `CreateUpdateTable`):** `raw_data_source_ids`, `published_by_ids`, and `data_cleaned_by_ids` are Django `ManyToManyField`s. A fix was applied to `perform_mutate` in `backend/custom/graphql_auto.py` (using `form.save(commit=False)` + `form.save_m2m()`) and confirmed working locally, but as of 2026-03-30 the behavior on the dev backend is inconsistent — mutations succeed without error but M2M assignments are sometimes silently dropped. Pass these fields and verify in the admin; if they don't appear, it is a backend deployment/environment issue rather than a code problem. + +**Name translation:** If only Portuguese names are available, translate to English and Spanish: +- Translate accurately using domain knowledge of Brazilian public finance / the relevant domain +- Use the same terminology conventions as existing tables in the dataset + +### 2. Create/update observation levels + +For each entity in the table's observation level list (from architecture design or context): +- Call `create_update_observation_level` +- Look up entity_id from discover IDs using the entity slug +- Pass existing OL id if updating + +Track returned OL IDs — needed for column updates. + +### 3. Upload columns from architecture table + +Call `upload_columns_from_sheet` with: +- `table_id`: from step 1 +- `architecture_url`: Google Sheets URL for this table's architecture +- `observation_levels`: JSON string mapping column name → bare OL ID from step 2 + (e.g. `{"ano": "", "sigla_uf": "", "estagio": ""}`) +- `env`: current env + +This creates all columns with descriptions, BQ types, OL links, status, and directory primary keys in one call. Note: `directoryPrimaryKey` requires the referenced directory dataset to exist in the target environment — it will be silently skipped in dev if the directory dataset is absent. + +### 4. Update individual columns + +For each column in the architecture table, call `update_column` to set fields not handled by the sheet upload: + +- `column_id`: from the `upload_columns_from_sheet` response +- `column_name`: column name +- `table_id`: table ID +- `is_partition`: True for partition columns (e.g. `ano`, `sigla_uf`) +- `is_primary_key`: True for primary key columns from architecture +- `description_en` / `description_es`: translate from Portuguese if only PT available + +**Do not re-set** `observation_level_id`, `description_pt`, `measurement_unit`, `has_sensitive_data`, `covered_by_dictionary`, or `temporal_coverage` — those were already set in step 3. + +**Translation rule:** Auto-translate Portuguese descriptions to English and Spanish using domain knowledge of Brazilian public finance. Apply consistent terminology. + +### 5. Create/update cloud table + +Call `create_update_cloud_table` with: +- `table_id`: table ID +- `gcp_project_id`: `basedosdados-dev` (dev) or `basedosdados` (prod) +- `gcp_dataset_id`: dataset slug (e.g. `br_me_siconfi`) +- `gcp_table_id`: table slug +- `id`: existing cloud table ID if updating + +### 6. Create/update coverage and datetime range + +Call `create_update_coverage` with table_id and area_id for "br". + +Then call `create_update_datetime_range` with: +- `coverage_id`: from coverage step +- `start_year`: from context +- `end_year`: from context +- `interval`: 1 (annual) or appropriate value +- `is_closed`: False (unless the series has ended) +- `id`: existing DTR ID if updating + +Wrap in try/except — log failures but continue to next step. + +### 7. Create/update update record + +Call `create_update_update` with: +- `table_id`: table ID +- `entity_id`: year entity ID +- `frequency`: 1 +- `lag`: 1 +- `latest`: current datetime as ISO string (use Python `datetime.now().isoformat()`) +- `id`: existing update ID if updating + +Wrap in try/except — log failures but continue. + +### 8. Link raw data sources (deferred update) + +Call `create_update_table` again with the same `id` and `slug`, passing only: +- `raw_data_source_ids`: select from `get_raw_data_sources` using the table's temporal coverage and broader context (e.g. what the raw source actually contains, whether the table topic existed before 2013): + - `start_year >= 2013` → include only the post-2013 source + - `start_year < 2013` → include both sources (pre-2013 and post-2013) + - When in doubt, cross-check with the raw source descriptions and the dataset documentation +- `published_by_ids`: authenticated account ID (from `get_authenticated_account`) +- `data_cleaned_by_ids`: authenticated account ID (from `get_authenticated_account`) + +All other fields must be re-passed as well (the API requires them). This deferred call ensures the table record is fully persisted before the relationship is written. + +## Summary output + +After processing all tables, output: + +```text +=== METADATA REGISTRATION COMPLETE (env=) === + +Dataset: (id=) + +Tables: + ✓ — table, OLs, columns, cloud table, coverage, update + ✗ — FAILED: + +Next step: verify at https://development.basedosdados.org/dataset/ + then run `/databasis-metadata --env prod` to promote to prod. +``` diff --git a/.claude/commands/onboarding-pr.md b/.claude/commands/onboarding-pr.md new file mode 100644 index 000000000..76dd4298d --- /dev/null +++ b/.claude/commands/onboarding-pr.md @@ -0,0 +1,59 @@ +--- +description: Open a pull request for a Data Basis dataset onboarding +argument-hint: +--- + +Open a pull request on `basedosdados/pipelines` for a dataset onboarding. + +**Dataset:** $ARGUMENTS + +## Step 1 — Confirm all files are committed + +Run `git status` and `git diff`. If there are uncommitted changes, list them and ask the user whether to commit them first. + +Ensure data files are excluded (check `.gitignore` covers the output parquet path). + +## Step 2 — Draft changelog + +Draft a changelog that includes: +- Dataset slug and full name +- Tables added or updated (list each) +- Coverage years +- Data source and organization +- Any known limitations or open issues + +Present the draft to the user. **Do not open the PR until the user approves the changelog.** + +## Step 3 — Open the PR + +Once approved, open the PR: +```bash +gh pr create \ + --title "[$dataset_slug]
" \ + --body "" \ + --label "test-dev,table-approve,metadata-test" +``` + +PR body format: +```text +## Dataset +**Slug:** +**Tables:** +**Coverage:** +**Source:** + +## Changes + + +## Checklist +- [ ] DBT models run successfully in dev +- [ ] DBT tests pass +- [ ] Metadata registered in dev backend +- [ ] Verify at: https://development.basedosdados.org/dataset/ + +🤖 Generated with [Claude Code](https://claude.ai/claude-code) +``` + +## Step 4 — Return PR URL + +Return the PR URL to the user. diff --git a/.claude/commands/onboarding-upload.md b/.claude/commands/onboarding-upload.md new file mode 100644 index 000000000..2e800436b --- /dev/null +++ b/.claude/commands/onboarding-upload.md @@ -0,0 +1,52 @@ +--- +description: Upload cleaned parquet files to BigQuery via the basedosdados package +argument-hint: [--tables table1,table2] [--env dev|prod] +--- + +Upload cleaned parquet files to BigQuery for a Data Basis dataset. + +**Args:** $ARGUMENTS + +Parse `--env` (default: dev) and `--tables` (default: all tables) from arguments. + +## Prerequisites + +1. Verify `~/.basedosdados/config.toml` exists. If missing, fail with: + ``` + Missing ~/.basedosdados/config.toml — run `basedosdados config init` first. + ``` +2. Verify the output parquet files exist at the expected path. + +## GCP project mapping + +| env | GCP project | +|-----|-------------| +| dev | `basedosdados-dev` | +| prod | `basedosdados` | + +## Upload script + +Write a Python script at `/code/upload_to_db.py`: + +```python +import basedosdados as bd +import google.cloud.storage as gcs +from pathlib import Path +import argparse + +# Monkey-patch for requester-pays bucket +_orig_bucket = gcs.Client.bucket +def _patched_bucket(self, bucket_name, user_project=None): + return _orig_bucket(self, bucket_name, user_project=BILLING_PROJECT) +gcs.Client.bucket = _patched_bucket +``` + +Key steps per table: +1. Instantiate `bd.Table(table_id=, dataset_id=)` +2. Delete stale GCS staging prefix before upload (avoids BQ partition key conflicts) +3. Upload: `table.create(if_exists="replace")` then `table.upload(path, if_exists="replace")` +4. Report success with row count + +## Run + +Run the upload script and report success/failure per table. If any table fails, report the error and stop — do not continue to subsequent tables without user confirmation. diff --git a/AGENTS.md b/AGENTS.md index 550e51238..345476a86 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -235,6 +235,16 @@ dbt test --select models/ - Never bypass hooks with `--no-verify`. - Add type hints and docstrings for python functions following Google Style. +## Dataset Onboarding + +To onboard a new dataset (raw data → BigQuery → metadata), spawn the `onboarding` agent: + +```text +Onboard dataset . Raw files at . Drive folder: BD/Dados/Conjuntos//. +``` + +The agent will run the full 10-step sequence (context → architecture → clean → upload → dbt → tests → discover → metadata → prod → PR), pausing for human approval before promoting to production. + ## Key Rules for Agents 1. **Never hardcode credentials or secrets.** Use environment variables or Vault.