A FastAPI service that generates, versions, and serves ML datasets as NPZ artifacts.
juniper-data turns a catalogue of dataset generators into a REST service: you ask it for a dataset
by name and parameters, and it returns a versioned, NPZ-formatted train/test/full split. The
catalogue spans synthetic classification problems (two-spiral, concentric circles, XOR, Gaussian
mixtures, moons, checkerboard), image sets (MNIST), the ARC-AGI visual-reasoning families, a CSV/JSON
import path, and a family of time-series and irregularly-sampled sequence generators
(autoregressive, Mackey-Glass, multi-sine, delay-product, equities, and the irregular-Δt equities_seq
contract). A named-version registry, tag filtering, batch creation, and per-dataset preview round out
the surface. Call GET /v1/generators for the live catalogue.
It is the foundational data layer of the platform: the dataset identifiers it returns are the
substrate juniper-cascor trains on and juniper-canopy visualises.
Part of the Juniper platform. juniper-data is the dataset-generation service of Juniper — a multi-package ML research platform built around constructive (Cascade-Correlation) and recurrent neural networks. It runs standalone; the rest of the platform consumes it over HTTP (see
juniper-data-client).
pip install juniper-data # from PyPIFor development from a clone (the optional extras are api, arc-agi, equities, observability,
test, dev, all):
git clone https://github.com/pcalnon/juniper-data.git && cd juniper-data
pip install -e ".[all]"uvicorn --factory juniper_data.api.app:get_app --reload # binds 127.0.0.1:8100
curl http://localhost:8100/v1/health/ready
curl http://localhost:8100/v1/generators # the live generator catalogueCreate a dataset over the REST API:
curl -sX POST localhost:8100/v1/datasets \
-H 'Content-Type: application/json' \
-d '{"generator": "spiral", "name": "demo", "params": {"n_spirals": 2, "noise": 0.1}}'Or generate one in-process, without the service:
from juniper_data.generators import SpiralGenerator, SpiralParams
dataset = SpiralGenerator.generate(SpiralParams(n_spirals=2, n_points_per_spiral=100, noise=0.1))
# dataset: dict of float32 arrays — X_train, y_train, X_test, y_test, X_full, y_fullDatasets are NPZ archives with the keys X_train, y_train, X_test, y_test, X_full, y_full,
all float32. This is the contract every Juniper consumer reads.
Settings load from the JUNIPER_DATA_ environment namespace (juniper_data/api/settings.py) and honor
the Docker _FILE secret convention. The most common knobs (full surface in
docs/REFERENCE.md):
| Variable | Default | Purpose |
|---|---|---|
JUNIPER_DATA_HOST / JUNIPER_DATA_PORT |
127.0.0.1 / 8100 |
Bind address / port (0.0.0.0 under Docker). |
JUNIPER_DATA_STORAGE_PATH |
./data/datasets |
Where persisted dataset artifacts live. |
JUNIPER_DATA_API_KEYS |
(unset) | CSV / JSON-array of X-API-Key values; auth is disabled when unset. |
JUNIPER_DATA_LOG_LEVEL / _LOG_FORMAT |
INFO / text |
Verbosity / text or json. |
JUNIPER_DATA_METRICS_ENABLED |
false |
Expose /metrics for Prometheus (IP-gated). |
docker build -t juniper-data:latest .
docker run --rm -p 8100:8100 -e JUNIPER_DATA_HOST=0.0.0.0 juniper-data:latestMulti-stage build (Python 3.14-slim); health is probed at /v1/health/ready. For the full stack, see
juniper-deploy.
Live on PyPI. The current version is shown by the badge above; see CHANGELOG.md.
Consumed by juniper-cascor and juniper-canopy via JUNIPER_DATA_URL, and by
juniper-data-client programmatically.
docs/QUICK_START.md— get running in five minutesdocs/USER_MANUAL.md— comprehensive usage guidedocs/api/JUNIPER_DATA_API.md— full REST reference (filtering, batch, tagging, versioning)docs/REFERENCE.md— configuration and environment-variable referencedocs/DOCUMENTATION_OVERVIEW.md— index of all juniper-data docs
MIT — see LICENSE.