fm-bench

Benchmark Apple's fm command on macOS 27+.

Measure latency, throughput, streaming smoothness, stability, and goodput across Apple Foundation Models — with repeatable prompt suites and JSON/CSV reports for automation.

Why fm-bench exists

Apple's Foundation Models can run on-device, through Private Cloud Compute, and through model adapters. That makes raw model quality only half the story.

For real apps, the important questions are:

How fast does the first token arrive?
Does streaming stay smooth?
How stable is latency over repeated runs?
What happens under concurrency?
Which model/hardware pair meets an interactive SLO?

fm-bench answers those questions with repeatable local benchmarks. Think of it as GeekBench for Apple Foundation Models — run it, get numbers, compare across hardware, models, and macOS updates.

Quick Start

npm install -g fm-bench
fm-bench

One command discovers your models, runs the standard prompt suite, and prints a full benchmark report:

fm-bench 0.5.0 | darwin/arm64 | fm
prompts 5 | runs 3 | concurrency 1,2 | stream on | measured 30 | failed 0 | skipped 0 | elapsed 42.10s | SLO TTFT<=750ms,E2E<=4.00s

┌───┬────────┬────────┬─────────┬──────┬──────┬──────────┬──────┬──────────┬─────┬─────┐
│ C │ MODEL  │ STATUS │ OK/RUNS │ SUCC │ GOOD │ GOOD RPS │ TTFT │ E2E P95  │ SYS │ CV  │
├───┼────────┼────────┼─────────┼──────┼──────┼──────────┼──────┼──────────┼─────┼─────┤
│ 1 │ system │ ok     │   15/15 │ 100% │  93% │      0.4 │ 318ms│    3.20s │  42 │ 12% │
│ 2 │ system │ ok     │   15/15 │ 100% │  80% │      0.7 │ 501ms│    4.40s │  68 │ 21% │
└───┴────────┴────────┴─────────┴──────┴──────┴──────────┴──────┴──────────┴─────┴─────┘

Wide terminals add TTFT P95, TPOT, decode/prefill throughput, chunk-gap smoothness, and 95% CI columns. Narrow terminals switch to compact model cards automatically.

Install

npm install -g fm-bench

Install directly from GitHub (always latest):

npm install -g --install-links git+https://github.com/devinoldenburg/fm-bench.git

Local development:

npm install && npm link
fm-bench doctor   # verify your setup

Requirements: macOS 27+, Node.js 20+, Apple Intelligence enabled.

Commands

Command	What it does
`fm-bench`	Run the full benchmark (default)
`fm-bench models`	List discovered models, availability, and quota
`fm-bench compare <a.json> <b.json>`	Regression diff: before/after metrics with color-coded deltas
`fm-bench history [dir]`	Chronological trend table from a directory of saved reports
`fm-bench legend`	Definitions for every table column and color rule
`fm-bench doctor`	Environment check: Node, macOS, `fm`, CPU, memory, thermals, battery

Common Recipes

# Quick smoke test
fm-bench --profile quick

# Standard 5-run benchmark with SLO budgets
fm-bench --runs 5 --slo-ttft-ms 750 --slo-e2e-ms 4000

# Sweep concurrency to find your throughput ceiling
fm-bench --sweep-concurrency 1,2,4 --runs 3

# Stress both on-device and PCC models
fm-bench --models system,pcc --runs 3 --profile stress

# Reasoning and coding workloads
fm-bench --profile reasoning --runs 5
fm-bench --profile coding --runs 3 --histogram

# Archive runs and compare before/after a macOS update
fm-bench --output-dir reports/ --tag before-update
fm-bench --output-dir reports/ --tag after-update
fm-bench compare reports/fm-bench_*before*.json reports/fm-bench_*after*.json

# Fail CI when SLOs regress
fm-bench --ci --slo-ttft-ms 750 --slo-e2e-ms 4000 --runs 5

# Save JSON for automation
fm-bench --json --out bench.json
fm-bench --format csv --out bench.csv

Options Reference

Workload

Flag	Default	Description
`-m, --models <list>`	all	Comma-separated or repeated model names
`-r, --runs <n>`	1	Measured runs per prompt/model
`--warmup <n>`	0	Warmup runs per model before measurement
`-c, --concurrency <n>`	1	Parallel `fm` processes
`--sweep-concurrency <list>`	—	Separate operating points, e.g. `1,2,4`
`--request-rate <rps>`	—	Pace request starts at a target rate
`--ramp-up-ms <n>`	0	Gradually ramp pacing over `n` ms
`--timeout-ms <n>`	60000	Timeout per `fm` call
`--retry <n>`	0	Retry failed calls with exponential backoff (500ms–4s)
`--profile <name>`	standard	Built-in prompt suite (see Profiles)
`-p, --prompt <text>`	—	Custom prompt, repeatable
`--prompt-file <file>`	—	JSON, JSONL, or blank-line separated prompts
`-i, --instructions <text>`	—	Passed to `fm respond`

Quality Gates

Flag	Description
`--slo-ttft-ms <n>`	Count a run as good only if TTFT ≤ n ms
`--slo-e2e-ms <n>`	Count a run as good only if E2E latency ≤ n ms
`--slo-tpot-ms <n>`	Count a run as good only if TPOT ≤ n ms
`--ci`	Exit 1 if any run fails or any SLO is violated (for pipelines)
`--fail-fast`	Stop after the first failed run

Output

Flag	Description
`--json` / `--csv`	Output format (also `--format table\|json\|csv`)
`-o, --out <file>`	Save a report to a file
`--output-dir <dir>`	Auto-save a timestamped JSON report to a directory
`--tag <name>`	Label this run; repeatable; appears in payload and header
`--note <text>`	Freeform annotation in payload and header
`--histogram`	Print ASCII latency distribution chart after the report
`--capture-output`	Include raw model output in JSON reports
`-v, --verbose`	Append per-run CSV after the summary table

Display

Flag	Description
`--color` / `--no-color`	Force or disable ANSI colors (auto on TTYs)
`--ascii`	Plain ASCII table borders instead of Unicode
`--compact`	Force narrow terminal layout
`--width <n>`	Render as if the terminal is `n` columns wide
`--progress` / `--no-progress`	Force or disable the live progress line

Prompt Profiles

Nine built-in suites, choose the one that matches your use case:

Profile	Prompts	Best for
`quick`	1	Smoke test, fast health check
`standard`	3	Default — short chat, JSON generation, medium output
`interactive`	3	Conversational latency (TTFT-heavy)
`throughput`	3	Longer generation, token throughput signal
`client`	5	Real-world mix: chat, content, extraction, summarization, code
`stress`	5	High-load mix with math and reasoning
`reasoning`	5	Multi-step logic, estimation, debugging — capability + speed
`coding`	5	Code review, refactoring, algorithms, system design
`creative`	5	Product copy, analogies, commit messages, docs

Regression Tracking

Track performance across macOS updates, model changes, or hardware swaps:

# Before
fm-bench --profile coding --runs 5 --output-dir reports/ --tag before

# After the change
fm-bench --profile coding --runs 5 --output-dir reports/ --tag after

# See what changed
fm-bench compare reports/fm-bench_*before*.json reports/fm-bench_*after*.json

The compare output shows each model/concurrency row with the before value, a color-coded percent delta (green = improvement, red = regression), and the after value — for TTFT, E2E, TPOT, tokens/s, RPS, success rate, and CV.

# View the full trend over time
fm-bench history reports/

CI Integration

Gate deployments or model updates on benchmark quality:

# Fails with exit code 1 if TTFT > 750ms or E2E > 4s on any run
fm-bench --ci --slo-ttft-ms 750 --slo-e2e-ms 4000 --runs 5

Prints fm-bench ci: PASS or fm-bench ci: FAIL — <reason> to stderr. Designed for GitHub Actions, Buildkite, or any shell-based pipeline.

Prompt Files

JSON array:

[
  { "id": "tiny", "prompt": "Reply with exactly: ok" },
  { "id": "json", "prompt": "Convert alpha, beta, gamma into JSON." }
]

JSONL:

{"id":"tiny","prompt":"Reply with exactly: ok"}
{"id":"latency","prompt":"Explain p95 latency in one sentence."}

Plain text files are split on blank lines.

Metrics

Latency — TTFT (p50/p95), E2E (p50/p95/p99), TPOT (p50/p95), 95% confidence interval, coefficient of variation (CV).

Throughput — prefill tokens/s, decode tokens/s, output tokens/s per request, aggregate system tokens/s, requests per second.

Streaming quality — second-chunk delay, chunk-gap p95. Captured from stdout chunks during streaming runs.

Reliability — success rate, goodput rate and RPS against SLO budgets, repeatability (most common output hash frequency across repeated runs).

Stability — CV (stddev/mean for E2E latency); green ≤10%, yellow ≤25%, red >25%.

Token counts come from fm token-count --quiet. If fm cannot count tokens, those fields are blank while character throughput is still reported.

Terminal layout is responsive: wide → full scoreboard + detail tables, medium → tighter single table, narrow → compact model cards. Use --width to preview any layout and --ascii for log-friendly output.

Colors and Legend

Table output is color-coded on interactive terminals — green is better/passing, yellow is marginal/partial, red is failing/unstable. Fixed thresholds apply to success rate, goodput, CV, and repeatability. Latency uses SLO thresholds when set, otherwise lower-is-better relative ranking. Throughput uses higher-is-better relative ranking.

fm-bench legend          # full column definitions and color rules
fm-bench legend --json   # machine-readable

NO_COLOR=1 disables color; FORCE_COLOR=1 or --color enables it. --ascii switches to plain ASCII borders for log systems.

A live single-line progress indicator runs on stderr during interactive sessions. The final report always goes to stdout — --json, --csv, and --out stay automation-friendly.

Requirements

macOS 27 or newer (Apple's fm CLI is preinstalled).
Node.js 20 or newer.
Apple Intelligence enabled on the device.

pcc (Private Cloud Compute) availability depends on Apple's current eligibility. fm-bench shows it as skipped if fm available --model pcc reports unavailable.

See docs/methodology.md for benchmark methodology and metric references.

Development

npm install
npm test     # node --test
npm run lint

No runtime npm dependencies.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github		.github
bin		bin
docs		docs
src		src
test		test
.editorconfig		.editorconfig
.gitignore		.gitignore
.npmignore		.npmignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fm-bench

Why fm-bench exists

Quick Start

Install

Commands

Common Recipes

Options Reference

Prompt Profiles

Regression Tracking

CI Integration

Prompt Files

Metrics

Colors and Legend

Requirements

Development

License

About

Uh oh!

Releases 12

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

fm-bench

Why fm-bench exists

Quick Start

Install

Commands

Common Recipes

Options Reference

Prompt Profiles

Regression Tracking

CI Integration

Prompt Files

Metrics

Colors and Legend

Requirements

Development

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages