____ ____ _ _ ____ _ _ _ _
/ ___| | _ \ | | | | | _ \ | | | | | | | |
| | _ | |_) | | | | | | |_) | | |_| | | |_| |
| |_| | | __/ | |_| | | _ < | _ | | _ |
\____| |_| \___/ |_| \_\ |_| |_| |_| |_|
P o o a
U b o s
i d h
n i
n
g
A header-only CUDA library implementing a parallel hash table for NVIDIA GPUs, using Robin Hood open-addressing with bucketed (cache-line-aligned), sub-warp-cooperative probing. The design target is bandwidth-bound performance on large tables.
Check out my recent blog post about this codebase. This includes performance benchmarks that show strong performance against cuCollections and WarpCore baselines for retrieval at sufficient load factors, along with information about how it was built.
Insert and lookup are implemented, tested, and benchmarked against cuCollections and WarpCore — on a 4090, lookups perform favorably at sufficiently high load factors. See docs/ for design rationale and scope.
- NVIDIA GPU, sm_70 or newer (64-bit slots run natively on all supported arches; 128-bit slots run natively on sm_90+ and via libcu++ lock-table fallback otherwise).
- CUDA Toolkit 13.x (for the C++20 + libcu++
bit_castcombination this library relies on). - A C++20 host compiler compatible with that nvcc.
include/gpurhh/— the library headers. Add this to your include path; everything ships from here.tests/— per-topic test executables built against the public header.examples/— small standalone programs.examples/basic.cuis a self-contained construct + bulk insert + bulk lookup walkthrough.benchmarks/— apples-to-apples timing benchmarks (with optional cuCollections and WarpCore baselines) and a bandwidth experiment.docs/— design documentation (the canonical record of why the implementation looks the way it does).
#include <gpurhh/hash_table.cuh>
using Table = gpurhh::HashTable<std::uint32_t, std::uint32_t>;
Table table(1u << 20); // host-side; capacity rounded up to next pow-2
__global__ void insert_kernel(Table::View view, const std::uint32_t* keys,
const std::uint32_t* values, std::size_t n) {
namespace cg = cooperative_groups;
auto block = cg::this_thread_block();
auto tile = cg::tiled_partition<Table::tile_size>(block);
const std::size_t tile_id = (blockIdx.x * blockDim.x + threadIdx.x) / Table::tile_size;
const std::size_t total_tiles = (gridDim.x * blockDim.x) / Table::tile_size;
for (std::size_t i = tile_id; i < n; i += total_tiles) {
view.insert(tile, keys[i], values[i]);
}
}See examples/basic.cu for the complete walkthrough and docs/usage.md for the API surface and the View pattern.
The repository ships a single Makefile:
make— build tests, examples, and benchmarks.make tests/make examples/make benchmarks— build one group.make test— build and run all tests.make clean— wipebuild/.
Overrides via env vars: make ARCH=sm_90 CXX_STD=c++20 OPT=-O3.
Benchmark binaries always build at -O3; unoptimized throughput numbers are meaningless.
Optional baselines under external/include/ are picked up automatically; populate them with scripts/setup-baselines.sh if you want the cuCollections / WarpCore comparisons.
scripts/benchmark.sh --timing
scripts/benchmark.sh --memory-bandwidth
scripts/benchmark.sh --all # everything availableOutput lands under output/<timestamp>/, with run_info.txt capturing nvcc --version, uname, and nvidia-smi for each run.
Design and rationale live under docs/, starting with docs/index.md. The C++ API reference is produced separately by Doxygen from the headers themselves.
MIT.