-
Notifications
You must be signed in to change notification settings - Fork 20
[Feature] Subindo dados com IA na BD #1484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
4ce55cf
b49c73a
fadb753
6d018fc
73c9d0e
57d6359
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,82 @@ | ||
| --- | ||
| name: onboarding | ||
| description: Orchestrates the full dataset onboarding workflow for Data Basis — from raw data → clean data → BigQuery → metadata in the backend. Spawn this agent when the user asks to onboard a dataset. | ||
| --- | ||
|
|
||
| # Data Basis Pipeline Agent | ||
|
|
||
| This agent orchestrates the full dataset onboarding workflow for Data Basis: from raw data → clean data → BigQuery → metadata in the backend. | ||
|
|
||
| ## MCP server | ||
|
|
||
| All Data Basis backend API calls go through the `mcp` MCP server. Always use the tools for backend operations — never write raw HTTP requests inline. | ||
|
|
||
| ## How to invoke | ||
|
|
||
| The agent responds to natural-language onboarding requests. Canonical form: | ||
|
|
||
| ``` | ||
| Onboard dataset <slug>. Raw files at <path>. Drive folder: BD/Dados/Conjuntos/<slug>/. | ||
| ``` | ||
|
|
||
| Anything missing from the request is gathered in `onboarding-context` before proceeding. | ||
|
|
||
| ## Step sequence | ||
|
|
||
| Work through these steps in order. Do not skip steps. Use the corresponding skill for each step. | ||
|
|
||
| ``` | ||
| 1. /onboarding-context gather raw source URLs, docs, org, license, coverage | ||
| 2. /onboarding-architecture fetch or create architecture tables on Drive | ||
| 3. /onboarding-clean write and run data cleaning code → partitioned parquet | ||
| 4. /onboarding-upload upload parquet to BigQuery (dev first) | ||
| 5. /onboarding-dbt write .sql and schema.yaml files | ||
| 6. /onboarding-dbt-run run DBT tests; fix or flag errors | ||
| 7. /onboarding-discover resolve all reference IDs from backend | ||
| 8. /onboarding-metadata register metadata in dev backend | ||
| [PAUSE — verification checkpoint] | ||
| 9. /onboarding-metadata --env prod (only after human approval) | ||
| 10. /onboarding-pr open PR with changelog | ||
| ``` | ||
|
|
||
| ## Verification checkpoint (between steps 8 and 9) | ||
|
|
||
| After step 8 succeeds, output a verification checklist and wait for explicit approval before promoting to prod: | ||
|
|
||
| ``` | ||
| ✓ Dataset registered in dev: <slug> | ||
| ✓ Tables: <list> | ||
| ✓ Columns: <counts> | ||
| ✓ Coverage: <start>–<end> | ||
| ✓ Cloud tables: OK | ||
| ✓ Verify at: https://development.basedosdados.org/dataset/<slug> | ||
|
|
||
| Reply "approved" to promote to prod, or describe what needs fixing. | ||
| ``` | ||
|
|
||
| Do not proceed to step 9 or step 10 without the user replying "approved" (or equivalent). | ||
|
|
||
| ## Environment | ||
|
|
||
| - Default: dev (`development.backend.basedosdados.org`, `basedosdados-dev` GCP project) | ||
| - Prod: only after explicit user approval | ||
|
|
||
| ## Commit discipline | ||
|
|
||
| Commit after each logical unit completes: | ||
| - After cleaning code is verified | ||
| - After DBT files are written | ||
| - After metadata is registered in dev | ||
| - After metadata is promoted to prod | ||
|
|
||
| Use conventional commits: `feat(br_me_siconfi): add architecture tables and cleaning code` | ||
|
|
||
| Never commit data files (parquet, CSV). Ensure `.gitignore` covers the output path. | ||
|
|
||
| ## Translation | ||
|
|
||
| All descriptions and names must be provided in Portuguese, English, and Spanish. When only one language is available (typically Portuguese from architecture tables), translate to the other two using domain knowledge of Brazilian public administration and statistics. Apply consistent terminology. | ||
|
|
||
| ## Architecture table is the source of truth | ||
|
|
||
| When there is a conflict between raw data column names, DBT file conventions, and the architecture table, the architecture table wins. Update the other artifacts to match. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,70 @@ | ||
| --- | ||
| description: Fetch or create architecture tables for a Data Basis dataset | ||
| argument-hint: <dataset_slug> [drive_folder_path] | ||
| --- | ||
|
|
||
| Fetch or create architecture tables for a Data Basis dataset. Each table in the dataset gets one Google Sheets file. | ||
|
|
||
| **Dataset:** $ARGUMENTS | ||
|
|
||
| ## Architecture table schema | ||
|
|
||
| Each file has these columns (in order): | ||
|
|
||
| | Column | Description | | ||
| |--------|-------------| | ||
| | `name` | BQ column name (snake_case) | | ||
| | `bigquery_type` | BQ type: INT64, STRING, FLOAT64, DATE, etc. | | ||
| | `description` | Description in Portuguese | | ||
| | `temporal_coverage` | Empty = same as table; different = explicit, e.g. `2013(1)` or `2013(1)2022` | | ||
| | `covered_by_dictionary` | `yes` / `no` | | ||
| | `directory_column` | BD directories FK: `<dataset>.<table>:<column>`, e.g. `br_bd_diretorios_brasil.municipio:id_municipio` | | ||
| | `measurement_unit` | e.g. `year`, `BRL`, `hectare` — blank if none | | ||
| | `has_sensitive_data` | `yes` / `no` | | ||
| | `observations` | Free text notes | | ||
| | `original_name` | Column name in the raw source | | ||
|
|
||
| **Temporal coverage notation:** `START(INTERVAL)END` — e.g. `2004(1)2022` = annual from 2004 to 2022. `2013(1)` = from 2013, ongoing. Empty = same as the table's coverage. | ||
|
|
||
| ## Step 1 — Check if architecture files already exist | ||
|
|
||
| Use the `databasis-workspace` Google Drive MCP to check the Drive folder: | ||
| `BD/Dados/Conjuntos/<dataset>/architecture/` | ||
|
|
||
| If files exist: read and validate them. Report any missing required columns or schema mismatches. | ||
|
|
||
| ## Step 2 — Ask design questions (if creating new tables) | ||
|
|
||
| Before creating architecture tables, ask the user: | ||
|
|
||
| 1. **Format:** Should tables be in long format? (Data Basis default: yes — each row = one observation) | ||
| 2. **Partition columns:** What columns partition the data? (default: `ano`, `sigla_uf`; Brasil-level: `ano` only) | ||
| 3. **Unit of observation:** What does one row represent? (e.g. "one municipality-year-account") | ||
| 4. **Categorical columns:** Which columns have a finite set of categories that need a dictionary? | ||
| 5. **Directory columns:** Which columns link to BD standard directories (municipalities, states, time)? | ||
|
|
||
| ## Step 3 — Infer or create schema | ||
|
|
||
| If creating new tables: | ||
| 1. Read the first 20 rows of each raw data file | ||
| 2. Infer column names, types, and candidate partition columns | ||
| 3. Apply long-format transformation if the data is wide (one column per year → pivot to long) | ||
| 4. Map known standard columns to BD directories: | ||
| - `ano` → `br_bd_diretorios_data_tempo.ano:ano` | ||
| - `sigla_uf` → `br_bd_diretorios_brasil.uf:sigla_uf` | ||
| - `id_municipio` → `br_bd_diretorios_brasil.municipio:id_municipio` | ||
|
|
||
| ## Step 4 — Translate descriptions | ||
|
|
||
| Architecture files have descriptions in Portuguese. Translate all descriptions to English and Spanish. The translation should be accurate and use the correct technical terminology for Brazilian public finance / the relevant domain. | ||
|
|
||
| ## Step 5 — Save to Drive | ||
|
|
||
| Save each architecture table as a Google Sheet in: | ||
| `BD/Dados/Conjuntos/<dataset>/architecture/<table_slug>.xlsx` | ||
|
|
||
| Use `mcp__databasis-workspace__create_spreadsheet` and `mcp__databasis-workspace__modify_sheet_values`. | ||
|
|
||
| ## Step 6 — Output | ||
|
|
||
| Return a summary listing all tables found/created and the Drive URLs for each architecture file. Store these URLs — they are needed by `databasis-metadata`. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,81 @@ | ||
| --- | ||
| description: Write and run data cleaning code to produce partitioned parquet output | ||
| argument-hint: <dataset_slug> <raw_data_path> [output_path] | ||
| --- | ||
|
|
||
| Write and run data cleaning code for a Data Basis dataset. | ||
|
|
||
| **Dataset / paths:** $ARGUMENTS | ||
|
|
||
| ## Folder structure | ||
|
|
||
| Work in a folder **external to the `pipelines/` repo**: | ||
|
|
||
| ``` | ||
| <dataset_root>/ | ||
| ├── input/ ← raw files (CSV, Excel, JSON, etc.) — do not modify | ||
| ├── output/ | ||
| │ └── <table_slug>/ | ||
| │ └── ano=<year>/sigla_uf=<uf>/ (municipio/UF tables) | ||
| │ └── ano=<year>/ (Brasil-level tables) | ||
| └── code/ | ||
| └── clean.py (one script per dataset if tables share raw source) | ||
| └── clean_<table>.py (one per table if they don't) | ||
| ``` | ||
|
coderabbitai[bot] marked this conversation as resolved.
Outdated
|
||
|
|
||
| ## Step 1 — Read architecture tables | ||
|
|
||
| Read the architecture tables from Drive (URLs from `databasis-architecture` output or ask the user). These define: | ||
| - Exact column names and types | ||
| - Partition columns | ||
| - Which columns are categorical (need dictionary) | ||
| - Directory column mappings | ||
|
|
||
| ## Step 2 — Inspect raw data | ||
|
|
||
| Read the first 20 rows of each raw file to understand structure. Check: | ||
| - File format (CSV, Excel, JSON, fixed-width, etc.) | ||
| - Encoding (UTF-8, ISO-8859-1, etc.) | ||
| - Column names and their mapping to architecture names | ||
| - Any header rows, footer rows, or skip rows | ||
| - Date formats | ||
|
|
||
| ## Step 3 — Write cleaning code | ||
|
|
||
| Write Python cleaning code. Use pandas or polars — choose whichever fits the data better (polars for large files or complex transformations; pandas otherwise). | ||
|
|
||
| Rules: | ||
| - **Start with a small subset** (1 year or the smallest available partition) before scaling | ||
| - Output column order must match architecture exactly | ||
| - Use `safe_cast` logic: coerce types with error handling, not hard casts | ||
| - For wide data: pivot to long format | ||
| - Partition outputs using `pyarrow` with the partition columns from the architecture | ||
| - Never modify input files | ||
|
|
||
| Standard column types: | ||
| - INT64: `pd.to_numeric(col, errors='coerce').astype('Int64')` | ||
| - FLOAT64: `pd.to_numeric(col, errors='coerce')` | ||
| - STRING: `.astype(str).str.strip().replace('nan', pd.NA)` | ||
| - DATE: `pd.to_datetime(col, errors='coerce').dt.date` | ||
|
|
||
| ## Step 4 — Validate subset output | ||
|
|
||
| After running on the subset: | ||
| 1. Verify column names match architecture exactly | ||
| 2. Verify types are correct | ||
| 3. Check for unexpected nulls in primary key columns | ||
| 4. Print row counts and a sample | ||
|
|
||
| Only proceed to full data after subset is verified. Ask the user to confirm. | ||
|
|
||
| ## Step 5 — Dictionary table | ||
|
|
||
| If any column has `covered_by_dictionary: yes`, create a `dicionario` table: | ||
| - Schema: `id_tabela | nome_coluna | chave | cobertura_temporal | valor` | ||
| - One row per (table, column, key) combination | ||
| - All tables and columns in a single dictionary file | ||
| - Output to: `output/dicionario/dicionario.csv` (not partitioned) | ||
|
|
||
| ## Step 6 — Scale to full data | ||
|
|
||
| Run on all years/partitions and report final row counts per partition. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,49 @@ | ||
| --- | ||
| description: Gather context for a Data Basis dataset onboarding (raw sources, docs, coverage, org) | ||
| argument-hint: <dataset_slug> [raw_data_path] [drive_folder_path] | ||
| --- | ||
|
|
||
| Gather all context needed to onboard a dataset to Data Basis. Work through the following steps: | ||
|
|
||
| **Dataset:** $ARGUMENTS | ||
|
|
||
| ## Step 1 — Search for official documentation | ||
|
|
||
| Search online for the dataset's official documentation, raw data source URLs, and any existing Data Basis presence. Look for: | ||
| - Official government or institution page hosting the raw data | ||
| - Download URL or API endpoint | ||
| - Data dictionary or codebook | ||
| - License information (look for open data licenses: CC-BY, CC0, OGL, etc.) | ||
| - Update frequency (monthly, annual, etc.) | ||
| - Responsible organization | ||
|
|
||
| Also read any local README, documentation, or metadata files present in the dataset folder if a path was provided. | ||
|
|
||
| ## Step 2 — Ask the user to confirm or supplement | ||
|
|
||
| Present your findings and ask the user to confirm or fill in: | ||
| 1. Raw data source URL(s) — where to download the raw files | ||
| 2. Responsible organization slug on Data Basis (e.g. `ministerio-da-economia`) | ||
| 3. License slug (e.g. `open-database`) | ||
| 4. Update frequency and lag (e.g. annual, 1-year lag) | ||
| 5. Geographic coverage (e.g. Brazil — municipalities, states, federal) | ||
| 6. Temporal coverage (start year, end year or "ongoing") | ||
| 7. Drive folder path for architecture tables: `BD/Dados/Conjuntos/<dataset>/` | ||
|
|
||
| ## Step 3 — Output a structured context block | ||
|
|
||
| Once all information is gathered, output a context block in this format for use by downstream skills: | ||
|
|
||
| ``` | ||
| === DATASET CONTEXT: <slug> === | ||
| Organization: <org_slug> | ||
| License: <license_slug> | ||
| Raw source URL(s): <url(s)> | ||
| Update frequency: <e.g. annual> | ||
| Update lag: <e.g. 1 year> | ||
| Coverage: <start_year>–<end_year or "present">, <geography> | ||
| Drive folder: <path> | ||
| Notes: <any other relevant info> | ||
| ``` | ||
|
coderabbitai[bot] marked this conversation as resolved.
Outdated
|
||
|
|
||
| Keep the context block in the conversation for subsequent skills to reference. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| --- | ||
| description: Run DBT tests and materialize tables; fix or flag errors | ||
| argument-hint: <dataset_slug> [--tables table1,table2] [--target dev|prod] | ||
| --- | ||
|
|
||
| Run DBT tests and materialize tables for a Data Basis dataset. | ||
|
|
||
| **Dataset:** $ARGUMENTS | ||
|
|
||
| Parse `--target` (default: dev) and `--tables` (default: all tables in the dataset) from arguments. | ||
|
|
||
| ## Step 1 — Ensure DBT environment | ||
|
|
||
| Check for `/tmp/dbt_env/bin/dbt`. If missing, create it: | ||
| ```bash | ||
| ~/.pyenv/versions/3.11.6/bin/python -m venv /tmp/dbt_env | ||
| /tmp/dbt_env/bin/pip install dbt-bigquery | ||
| ``` | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
|
|
||
| ## Step 2 — Run models | ||
|
|
||
| ```bash | ||
| cd <pipelines_root> | ||
| /tmp/dbt_env/bin/dbt run \ | ||
| --select <dataset_slug> \ | ||
| --profiles-dir ~/.dbt \ | ||
| --target <target> | ||
|
Comment on lines
+26
to
+27
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
set -euo pipefail
echo "Repo-level profiles file:"
fd -a '^profiles\.yml$' .
echo
echo "Command currently expects ~/.dbt/profiles.yml:"
if [ -f "$HOME/.dbt/profiles.yml" ]; then
echo "Found ~/.dbt/profiles.yml"
else
echo "Missing ~/.dbt/profiles.yml"
fiRepository: basedosdados/pipelines Length of output: 199 🏁 Script executed: # Check the exact location of profiles.yml
ls -la ./profiles.yml
echo "---"
# Read the command file at the specified lines
echo "Lines 20-35 of the command file:"
sed -n '20,35p' ./.claude/commands/databasis-dbt-run.md
echo "---"
echo "Lines 35-45 of the command file:"
sed -n '35,45p' ./.claude/commands/databasis-dbt-run.mdRepository: basedosdados/pipelines Length of output: 787
The repo provides Suggested update /tmp/dbt_env/bin/dbt run \
--select <dataset_slug> \
- --profiles-dir ~/.dbt \
+ --profiles-dir <pipelines_root> \
--target <target> /tmp/dbt_env/bin/dbt test \
--select <dataset_slug> \
- --profiles-dir ~/.dbt \
+ --profiles-dir <pipelines_root> \
--target <target>Also applies to: lines 37-38 🤖 Prompt for AI Agents |
||
| ``` | ||
|
|
||
| Report which models compiled and ran successfully. | ||
|
|
||
| ## Step 3 — Run tests | ||
|
|
||
| ```bash | ||
| /tmp/dbt_env/bin/dbt test \ | ||
| --select <dataset_slug> \ | ||
| --profiles-dir ~/.dbt \ | ||
| --target <target> | ||
| ``` | ||
|
|
||
| ## Step 4 — Handle test failures | ||
|
|
||
| For each failing test: | ||
|
|
||
| 1. **`not_null` on a nullable column**: Add the column to a `not_null_proportion_multiple_columns` block instead. Remove the strict `not_null` test. | ||
|
|
||
| 2. **`relationships` test fails** (FK violation): Report the failing values and ask the user whether to (a) remove the test, (b) relax to a proportion test, or (c) fix the data. | ||
|
|
||
| 3. **`not_null_proportion_multiple_columns` fails** (below 5%): The data may genuinely be sparse. Report the actual proportion and ask the user to confirm the threshold is acceptable. | ||
|
|
||
| 4. **Other failures**: Report the full test output and SQL. Propose a fix but ask before implementing. | ||
|
|
||
| ## Step 5 — Report | ||
|
|
||
| Output a summary: | ||
| ``` | ||
| Models: X passed, Y failed | ||
| Tests: X passed, Y failed | ||
| ``` | ||
|
|
||
| List any remaining failures with proposed next steps. | ||
Uh oh!
There was an error while loading. Please reload this page.