libRandomizer is a portable training-data generator for simple prediction
networks. You define the datatype or schema for the input, define the datatype
or schema for the output, choose a fixed record count, and set a seed. The SDK
then produces a reproducible list of input/output training pairs.
The Python package is the reference implementation. Its schema contract is JSON-native so the same dataset definition can be carried across language targets without depending on opaque Python objects.
python -m pip install .from librandomizer import TrainingDataGenerator, choice, integer
generator = TrainingDataGenerator(
input_schema=integer(0, 99),
output_schema=choice(["low", "medium", "high"]),
count=100,
seed=42,
)
pairs = generator.generate()Each generated record has the same language-neutral shape:
{
"input": 81,
"output": "low"
}Calling the same generator again with the same schema, count, and seed produces the same records in the same order.
Most datasets can be generated from separate input and output schemas. When a
target should be calculated from the input, pass a transform callback and keep
an output_schema so the result can be validated and serialized consistently.
from librandomizer import TrainingDataGenerator, integer, number
generator = TrainingDataGenerator(
input_schema=integer(0, 10),
output_schema=number(0, 20),
count=100,
seed=42,
transform=lambda value: value * 2,
)
pairs = generator.generate()This is useful for labels, thresholds, regression targets, boolean decisions, and other predictable outputs for supervised learning examples.
The v1 schema layer focuses on portable JSON-native datatypes:
| Helper | Purpose |
|---|---|
integer(min, max) |
Bounded integer values |
number(min, max, precision=None) |
Bounded floating point values |
boolean() |
true or false values |
string(length=8, alphabet=None) |
Fixed-length strings |
choice(values) |
One value from a finite set |
array_schema(items, length) |
Fixed-length arrays |
object_schema(properties) |
Nested JSON objects |
null() |
Explicit null values |
literal(value) |
A fixed serializable value |
one_of(schemas) |
A deterministic choice among schema variants |
Schemas can be nested:
from librandomizer import boolean, integer, object_schema, string
input_schema = object_schema({
"profile": object_schema({
"age": integer(18, 65),
"active": boolean(),
}),
"plan": string(length=6),
})TrainingDataGenerator(
input_schema,
output_schema,
*,
count=None,
seed=42,
transform=None,
transform_spec=None,
)input_schemadescribes the generated input side of each pair.output_schemadescribes the generated output side, or validates transform results when a transform is supplied.countis the default number of pairs produced bygenerate()and export methods.seedcontrols deterministic generation.transformis optional and receives one generated input value.transform_specis optional serializable metadata for cross-language specs.
generator.write_json("train.json")
generator.write_jsonl("train.jsonl")
generator.write_csv("train.csv")JSON preserves the full nested record structure. JSONL is convenient for
streaming and line-oriented tooling. CSV flattens nested records into stable
columns such as input.profile.age, input.features[0], and output.score.
Generators can be serialized as a spec:
spec = generator.to_spec()
restored = TrainingDataGenerator.from_spec(spec)Specs include the seed, count, input schema, output schema, and optional
transform metadata. Host-language callback code is intentionally not serialized;
portable transforms should be represented by a named transform_spec and bound
to native code in each SDK.
The same implementation must produce identical datasets for the same:
- input schema
- output schema
- seed
- count
- transform behavior, when a transform is used
Different seeds should change generated inputs while preserving schema validity. Exports are deterministic so generated JSON, JSONL, and CSV files can be used in tests, demos, examples, and repeatable training experiments.
The original OS-backed random primitive APIs remain available as compatibility
shims while the training-data generator becomes the primary product. New code
should use TrainingDataGenerator and the portable schema helpers.
Python/CLI is the full behavior-complete reference implementation. Other language SDKs are currently generated API surfaces and are advancing through parity hardening in this order: JavaScript, TypeScript, Go, C#, Java, Rust, C/C++, PHP, Ruby, Kotlin, Swift, Dart, R.
Current language status is tracked in docs/SDK_PARITY_STATUS.md.
The GitHub Pages site lives in docs/. Start with docs/index.html for the
developer-facing overview, then see spec/training/README.md for the portable
schema contract.
python -m unittest discover tests