Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 79 additions & 0 deletions python/benchmarks/non_asv/bench_col_stats.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
import argparse
import atexit
import shutil
import signal
import time

import numpy as np
import pandas as pd
import psutil

from arcticdb import Arctic

LMDB_PATH = "/tmp/arcticdb_bench_col_stats"


def _cleanup():
shutil.rmtree(LMDB_PATH, ignore_errors=True)


atexit.register(_cleanup)
signal.signal(signal.SIGINT, lambda *_: exit(130))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't do cleanup like this. This file should end with:

if __name__ == "__main__":
  try:
    run()
  finally:
    cleanup()

or you could put the same idea in your run(). There's no need for exit handlers.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this??

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


SCENARIOS = [
# (rows, cols)
(10, 10),
(1_000, 1_000),
(100_000, 1_000),
(100_000, 10_000),
(1_000_000, 1_000),
(1_000_000, 10_000),
(10_000_000, 1_000),
]

parser = argparse.ArgumentParser()
parser.add_argument("--simple_tests", action="store_true", help="Run only the first 4 scenarios")
args = parser.parse_args()

scenarios = SCENARIOS[:2] if args.simple_tests else SCENARIOS
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: this slices the first 2 scenarios, but the --simple_tests help text on line 35 advertises "Run only the first 4 scenarios". Either update the help string or change the slice (e.g. SCENARIOS[:4]) so the flag actually does what it claims.


ac = Arctic(f"lmdb://{LMDB_PATH}")
if not ac.has_library("bench"):
ac.create_library("bench")
lib = ac.get_library("bench")
nvs = lib._nvs

results = []

for rows, cols in scenarios:
sym = f"r{rows}_c{cols}"

df = pd.DataFrame(
np.random.rand(rows, cols).astype(np.float64),
columns=[f"col_{i}" for i in range(cols)],
)

t0 = time.time()
lib.write(sym, df)
write_time = time.time() - t0

col_stats = {f"col_{i}": {"MINMAX"} for i in range(cols)}

mem_before = psutil.Process().memory_info().rss / 1e6
Copy link
Copy Markdown
Collaborator

@poodlewars poodlewars May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These memory measurements are wrong, we care about peak RSS for the function calls, not RSS after they're done. This means that each experiment must be run in its own process. We also care about measuring peak RSS for the write call. This means that in each run of the script you should just do a write or create stats for a single example, then you can write a bash script to run your code across the different examples and summarize the results in a readable way.

t0 = time.time()
nvs.create_column_stats(sym, col_stats)
stats_time = time.time() - t0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless test cases are very slow, it's always best to run timings with a couple of un-measured warmup iterations, then N measured iterations, to get more stable results. Then report the mean, median and max across the measured iterations.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

mem_after = psutil.Process().memory_info().rss / 1e6

nvs.drop_column_stats(sym)
lib.delete(sym)

results.append((rows, cols, write_time, stats_time, mem_after - mem_before))
print(f"rows={rows:>10,} cols={cols:>6,} write={write_time:6.2f}s stats={stats_time:6.2f}s mem_delta={mem_after - mem_before:+.1f} MB")

print()
print(f"{'rows':>12} {'cols':>6} {'write_symbol_time':>8} {'stats_create_time':>8} {'consumed_memory_mb':>14}")
print("-" * 60)

for rows, cols, write_time, stats_time, mem_delta in results:
print(f"{rows:>12,} {cols:>6,} {write_time:>8.2f} {stats_time:>8.2f} {mem_delta:>+14.1f}")
3 changes: 3 additions & 0 deletions python/benchmarks/non_asv/bench_col_stats.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/usr/bin/env bash
set -euo pipefail
python "$(dirname "$0")/bench_col_stats.py" "$@"
Loading