UMDB (Urban Microbiome Database) is a GitHub-hosted bioinformatics resource for discovering, organizing, and publishing urban environmental sequencing metadata from public repositories.
The repository has two jobs:
- Harvest and normalize public metadata from NCBI-linked resources.
- Publish those records as a static website and downloadable database under
docs/.
The current public site is available at:
https://aglucaci.github.io/UMDB/
UMDB (Urban Microbiome Database) is designed as scientific infrastructure for urban metagenomics and related environmental sequencing studies. It focuses on public sequencing records associated with urban, built-environment, wastewater, air, surface, and similar city-linked contexts.
Rather than operating as a traditional server-backed application, the repository stores harvested records directly in versioned files and publishes a static web interface through GitHub Pages. This keeps the project easy to inspect, easy to archive, and inexpensive to maintain.
The published site includes:
- a searchable BioProject explorer
- expandable SRR-level detail views
- filter controls for geography, assay class, center, and year
- search-derived dataset bundling for matched SRR cohorts
- exportable SRR accession lists and FASTQ download scripts
- downloadable JSON artifacts under
docs/db/ - reviewer-facing pages such as
About,Methodology, andData Access - summary analytics across countries, cities, years, centers, and assay classes
- a global map view of country-level coverage
At a high level, UMDB (Urban Microbiome Database) follows this flow:
- Discover candidate public records from NCBI-linked resources.
- Retrieve run-level metadata for SRR entries.
- Join associated BioProject and BioSample context when available.
- Derive practical annotations such as assay class and geographic labels.
- Store the resulting records in local append-only or chunked artifacts.
- Export static JSON files that power the website.
- Publish the
docs/directory through GitHub Pages.
This means the repository itself is both the processing workspace and the published database release.
UMDB (Urban Microbiome Database) uses public repository metadata. The main sources currently represented in the code and outputs are:
- NCBI SRA for run-level sequencing metadata
- NCBI BioProject for project-level metadata
- NCBI BioSample for sample and location context
- NCBI E-utilities for public metadata retrieval and cross-linking
All records remain subject to the quality and completeness of the original public submissions.
UMDB (Urban Microbiome Database) no longer relies on a single search string for study discovery. The harvester now supports a multi-profile query strategy intended to improve recall across different ways urban microbiome studies are described in public repositories.
The default discovery pass combines several query profiles:
- urban shotgun and metagenomic studies
- urban amplicon and marker-gene studies
- wastewater and sewage surveillance studies
- transit and surface microbiome studies
- urban air, aerosol, and dust studies
This improves coverage for studies that do not all use the same vocabulary in titles, abstracts, or repository metadata.
Even with broader discovery, UMDB should still be treated as a high-recall public index rather than a guaranteed complete census of every urban microbiome study ever performed. Repository metadata are inconsistent, and some studies are only discoverable through follow-up curation.
UMDB can now add an AI-assisted curation layer on top of the original public metadata.
This curation layer is designed to:
- judge whether the available metadata are sufficient for city-level interpretation
- decide whether the sample appears to come from a truly urban or city-associated context
- inspect BioSample attributes, titles, descriptions, and run metadata together
- propose corrected country, city, or assay annotations when the evidence supports it
- preserve the original annotations while publishing AI-reviewed final values alongside them
AI curation is stored separately in cache and merged into the published dataset at export time as an ai_curation object on each record. The original repository metadata remain unchanged.
When present, ai_curation can include fields such as:
metadata_sufficiencymetadata_sufficienturban_originconfidenceenvironment_typeai_fixedoriginal_countryoriginal_cityoriginal_assay_classfinal_countryfinal_cityfinal_assay_classreasoning_summaryevidence
The website can then distinguish between original-only records and AI-fixed records.
The repository is organized around three main areas: harvesting code, intermediate data, and published web artifacts.
UMDB/
├── data/
│ ├── cache/
│ │ ├── bioproject.json
│ │ ├── bioproject_uid.json
│ │ └── biosample.json
│ ├── seen_sra_uids.txt
│ ├── seen_srr_runs.txt
│ └── srr_catalog_2026*.jsonl
├── docs/
│ ├── index.html
│ ├── about.html
│ ├── methodology.html
│ ├── data.html
│ ├── analytics.html
│ ├── global-map.html
│ ├── latest_srr.json
│ ├── assets/
│ │ ├── db.js
│ │ └── site.css
│ └── db/
│ ├── srr_records_manifest.json
│ ├── srr_records_part000.json
│ ├── ...
│ ├── srr_index.json
│ ├── bioprojects.json
│ └── biosamples.json
├── logo/
├── scripts/
│ └── urbanscope_harvester/
│ ├── cli.py
│ ├── ingest.py
│ ├── exports.py
│ ├── ncbi.py
│ ├── bioproject.py
│ ├── biosample.py
│ ├── assay.py
│ ├── utils.py
│ └── config.py
└── README.md
The Python package under scripts/urbanscope_harvester/ is the operational core of UMDB. The package path still uses the older internal module name, but it powers the current UMDB pipeline and published site.
cli.pydefines command-line entry behavior.ingest.pycoordinates harvesting and record assembly.ncbi.pycontains public NCBI request logic.bioproject.pyandbiosample.pyhandle project and sample enrichment.assay.pyderives higher-level assay labels from run metadata.exports.pywrites the static JSON artifacts used by the website.config.pycentralizes path and output settings.utils.pyprovides common helpers for JSON, timestamps, and iteration.
Together, these modules take public metadata and turn it into the chunked release files under docs/db/.
The data/ directory is the working data store for harvesting and enrichment.
seen_sra_uids.txtandseen_srr_runs.txttrack identifiers that have already been processed.srr_catalog_*.jsonlfiles store accumulated run-level records.cache/stores locally reused metadata lookups such as BioProject and BioSample responses.
This directory is important because it represents the stateful layer of the pipeline. It is where UMDB remembers what it has seen and where it stages enriched metadata before export.
The docs/ directory is the public release of the project.
- HTML pages provide the user-facing web interface.
docs/db/contains machine-readable database artifacts.latest_srr.jsonprovides a compact recent-record feed.assets/contains shared JavaScript and CSS used across the site.
Anything under docs/ should be treated as published output, not just internal project files.
UMDB currently publishes a manifest-plus-parts layout for SRR-level records.
docs/db/srr_records_manifest.json describes:
- when the release was generated
- how many total records exist
- which part files belong to the release
- which years are represented in the export summary
docs/db/srr_records_partXXX.json files contain arrays of SRR-level records. A typical record can include:
srrfor the run accessionruninfo_rowfor SRA RunInfo fieldsbioprojectfor project-level metadatageofor parsed geographic contextassayfor high-level assay classificationncbifor source URLs
The website loads these files directly in the browser and aggregates them into BioProject-level summaries for exploration.
Many users want to discover studies, not just individual runs. UMDB therefore groups records by BioProject in the main explorer while preserving the underlying SRR records for inspection.
This gives the site two layers:
- a study-facing summary layer for browsing
- a run-facing provenance layer for detail and reuse
That structure is useful for both casual exploration and manuscript review.
The site now includes pages that help present the repository as a maintained scientific database:
about.htmlexplains the scope and intended use of the resourcemethodology.htmldescribes derivation logic, provenance, and caveatsdata.htmlexplains downloads and schema expectationsanalytics.htmlsummarizes global counts and distributionsglobal-map.htmlvisualizes country-level coverage
These pages are meant to reduce ambiguity for collaborators, manuscript reviewers, and future users of the database.
UMDB follows a few simple design principles:
- static publication over server complexity
- auditable files over opaque backend state
- public metadata reuse over proprietary infrastructure
- browser access plus machine-readable exports
- low-cost maintenance with transparent artifacts
Although the exact command flow may evolve, a normal update cycle looks like this:
- Harvest new or target records into
data/. - Update caches and enrichments.
- Rebuild exported JSON release artifacts in
docs/db/. - Refresh site pages that consume the latest summaries.
- Commit the updated data and published files.
In other words, a repository update is also a database release.
To use AI-assisted metadata curation locally, set your environment first:
export OPENAI_API_KEY="your_openai_api_key"
export OPENAI_MODEL="gpt-4o-mini"
export NCBI_API_KEY="your_ncbi_api_key"
export NCBI_EMAIL="you@example.org"
export NCBI_TOOL="umdb-srr-harvester"You can curate newly harvested records during a deep crawl:
python3 -m scripts.urbanscope_harvester.cli crawl \
--fetch-biosample \
--fetch-bioproject \
--page-size 500 \
--sort date \
--max-total 0 \
--stop-after-new-srr 0 \
--ai-curateYou can also run AI curation against existing harvested records:
python3 -m scripts.urbanscope_harvester.cli curate-ai \
--model gpt-4o-mini \
--max-records 500To overwrite previously cached AI reviews:
python3 -m scripts.urbanscope_harvester.cli curate-ai \
--model gpt-4o-mini \
--overwrite \
--max-records 500To target a single year:
python3 -m scripts.urbanscope_harvester.cli curate-ai \
--year 2024 \
--max-records 500The repository also includes a dedicated full-dataset AI curation workflow:
- Workflow file:
.github/workflows/urbanscope_full_ai_curation.yml - Workflow name in GitHub Actions:
UMDB Full AI Curation - Required secret:
OPENAI_API_KEY - Optional but recommended secrets for NCBI-linked context:
NCBI_API_KEY,NCBI_EMAIL
To run it in GitHub:
- Open the repository Actions tab.
- Select
UMDB Full AI Curation. - Click
Run workflow. - Leave
max_recordsas0to process the entire dataset. - Leave
yearblank to process all years, or set a single year for a targeted pass. - Turn on
overwriteonly when you want to replace cached AI reviews.
This workflow runs the harvester CLI, rebuilds the exported database artifacts, and commits updated data/ and docs/ outputs back to the repository.
UMDB publishes several output styles because different users need different levels of structure:
- JSONL for accumulation-oriented internal catalogs in
data/ - JSON for the public website and machine-readable exports in
docs/ - HTML for the user-facing static interface
- shell scripts and accession lists generated from browser searches for raw FASTQ retrieval
At the moment, the public website is centered on JSON-driven delivery rather than a live backend database.
UMDB is primarily a metadata and discovery resource, but the web explorer now supports bundling matched search results into download-ready dataset cohorts.
From the main explorer, a user can:
- export a JSON bundle of the currently matched runs
- export a plain-text list of SRR accessions
- export a shell script that uses
prefetchandfasterq-dumpfrom SRA Toolkit
This makes it possible to go from a filtered UMDB search to raw FASTQ retrieval without manually collecting accessions one by one.
UMDB is useful, but it should be interpreted carefully.
- Geographic metadata may be incomplete, ambiguous, or inconsistently formatted.
- Assay labels are practical summary categories, not formal reannotation of every study.
- Inclusion reflects what is publicly available and discoverable through the current harvesting logic.
- Presence in UMDB does not imply endorsement, quality ranking, or uniform study design.
These limitations are expected for public metadata integration projects and should be stated clearly in any manuscript built around the resource.
The project is intentionally lightweight.
| Component | Provider | Cost |
|---|---|---|
| Compute | Local runs or GitHub-based workflows | Low to free |
| Storage | Git repository artifacts | Low to free |
| Hosting | GitHub Pages | Free |
| Metadata access | NCBI public utilities | Free within usage guidance |
UMDB is provided for research and informational purposes only. It does not constitute medical advice, clinical guidance, or public health policy.
Alexander G. Lucaci, PhD
Computational Evolutionary Biology
Urban Metagenomics
Genomic Surveillance