Skip to content

[http_server] Simplify and improve perf of DRT node and table#132

Open
cong1920 wants to merge 1 commit intomatt-42:masterfrom
cong1920:simplify_drt_node
Open

[http_server] Simplify and improve perf of DRT node and table#132
cong1920 wants to merge 1 commit intomatt-42:masterfrom
cong1920:simplify_drt_node

Conversation

@cong1920
Copy link
Copy Markdown
Contributor

@cong1920 cong1920 commented Dec 31, 2024

[http_server] Perf and readability improvements of DRT
Introduce a hybrid_children_map that uses a flat vector<> with linear serach
for nodes with < 8 children, auto-upgrading to unordered_map<> when getting
more nodes to replace linear search with fast hashmap access.

This change is most TRIE nodes have 1-5 children, so they benefit from
contiguous cache-friendly access with linear search.

Benchmark results (median, 25 iterations, WSL2 g++ -O2):

Standard scenario (152 routes) vs master:
insert:    274.3 -> 211.7 ns/op -22.8%
hit:        80.9 ->  56.0 ns/op -30.8%
miss:       45.2 ->  26.7 ns/op -40.9%

Wide scenario (112 routes, 20-24 children/node) vs master:
insert: 268.1 -> 212.9 ns/op -20.6%
hit:     83.2 ->  62.1 ns/op -25.4%
miss:    59.6 ->  33.8 ns/op -43.3%

Besides this perf improvement, this PR also improves code readability by using
wrapped data structures other than std::vector<std::shared_ptr<T>> and other
multi-layer of standard containers. Based on wrapping structures there is also
some optimizations like bulk allocation and etc.

=================

It started by seeing the std::vecotr<std::shared_ptr<drt_node>> and std::vector<std::shared_ptr<std::string>> as the drt_node and dynamic_routing_table's members respectively, I couldn't help wondering why the shared_ptr<T> are needed here. Later I realized the instance of the DRT table is allowed to copy therefore we need to keep its objects being shared across all instances. Somehow I still think it is a waste to have shared_ptr<T> manage each small object. So I made my first try years ago at new year eve during the family trip, which introduced something new and simplified something else.

However I wasn't not very confident without a perf measurement. I planned to do so but just could not sit down for few hours to design and code some perf tests. Thanks to LLMs and OpenClaw and all the fancy things nowadays. One day this project came back to memory again and I just drove all AI things in hand to quickly come up with the benchmarks/drt.cc and measured mixed perf results. Routing node hits and misses are slightly faster but inserting gets much slower.

It might be okay because it would be rare to add new nodes once server is up. Well it is not hard to diagnose and try different approaches to improve too, with AI's assistance. After few iterations, we get this simplified dynamic_routng_table.hh with significant perf improvements no matter insert, hit or miss. @matt-42

@cong1920 cong1920 marked this pull request as ready for review December 31, 2024 08:06
@cong1920 cong1920 marked this pull request as draft December 31, 2024 08:07
@cong1920 cong1920 force-pushed the simplify_drt_node branch from 1966842 to 7a5aa97 Compare March 16, 2025 07:58
@cong1920 cong1920 force-pushed the simplify_drt_node branch from 7a5aa97 to 41b5f88 Compare March 6, 2026 06:00
@cong1920 cong1920 force-pushed the simplify_drt_node branch from 41b5f88 to 5dd4518 Compare March 16, 2026 05:36
@cong1920
Copy link
Copy Markdown
Contributor Author

cong1920 commented Mar 16, 2026

Pleas ignore this. See my latest reply below.

=================

With the benchmark test added in commit ac446b1 I found this refactoring actually makes the route insert to drt slower and URL hit/miss almost no change. Thanks to LLM assisted coding nowadays.

                       master          simplify_drt_node     delta
   ──────────────────────────────── ──────────────────────────────── ──
   Insert (ns/op)      295.8           315.6                +6.7% slower
   Lookup hit (ns/op)   83.3            82.6                - 0.8% (same)
   Lookup miss (ns/op)  46.2            44.9                - 2.8% faster

Considering route insert is not common once server is set up to run, this perf "regression" maybe is okay? @matt-42

So sorry for this long delayed PR. I always wanted to add some perf tests with this PR but always being lazy to do so. Nowadays every developer is driving LLM to code. I decided to give a try tonight :)

If route insert "regression" is not okay I can continue drive my AI agent to improve.

Introduce a hybrid_children_map that uses a flat vector<> with linear serach
for nodes with < 8 children, auto-upgrading to unordered_map<> when getting
more nodes to replace linear search with fast hashmap access.

This change is most TRIE nodes have 1-5 children, so they benefit from
contiguous cache-friendly access with linear search.

Benchmark results (median, 25 iterations, WSL2 g++ -O2):

Standard scenario (152 routes) vs master:
  insert:  274.3 -> 211.7 ns/op  -22.8%
  hit:      80.9 ->  56.0 ns/op  -30.8%
  miss:     45.2 ->  26.7 ns/op  -40.9%

Wide scenario (112 routes, 20-24 children/node) vs master:
  insert:  268.1 -> 212.9 ns/op  -20.6%
  hit:      83.2 ->  62.1 ns/op  -25.4%
  miss:     59.6 ->  33.8 ns/op  -43.3%

Besides this perf improvement, this PR also improves code readability by using
wrapped data structures other than `std::vector<std::shared_ptr<T>>` and other
multi-layer of standard containers. Based on wrapping structures there is also
some optimizations like bulk allocation and etc.
@cong1920 cong1920 force-pushed the simplify_drt_node branch from 0254001 to 8863e98 Compare March 28, 2026 05:19
@cong1920 cong1920 changed the title [http_server] Simplify DRT node and table [http_server] Simplify and improve perf of DRT node and table Mar 28, 2026
@cong1920 cong1920 marked this pull request as ready for review March 28, 2026 05:29
@cong1920
Copy link
Copy Markdown
Contributor Author

I decided to let LLMs help me again on how to improve, and the outcome is impressive. No more regression. All improvements.

Benchmark results (median, 25 iterations, WSL2 g++ -O2):

Standard scenario (152 routes) vs master:
insert:    274.3 -> 211.7 ns/op -22.8%
hit:        80.9 ->  56.0 ns/op -30.8%
miss:       45.2 ->  26.7 ns/op -40.9%

Wide scenario (112 routes, 20-24 children/node) vs master:
insert: 268.1 -> 212.9 ns/op -20.6%
hit:     83.2 ->  62.1 ns/op -25.4%
miss:    59.6 ->  33.8 ns/op -43.3%

With the benchmark test added in commit ac446b1 I found this refactoring actually makes the route insert to drt slower and URL hit/miss almost no change. Thanks to LLM assisted coding nowadays.

                       master          simplify_drt_node     delta
   ──────────────────────────────── ──────────────────────────────── ──
   Insert (ns/op)      295.8           315.6                +6.7% slower
   Lookup hit (ns/op)   83.3            82.6                - 0.8% (same)
   Lookup miss (ns/op)  46.2            44.9                - 2.8% faster

Considering route insert is not common once server is set up to run, this perf "regression" maybe is okay? @matt-42

So sorry for this long delayed PR. I always wanted to add some perf tests with this PR but always being lazy to do so. Nowadays every developer is driving LLM to code. I decided to give a try tonight :)

If route insert "regression" is not okay I can continue drive my AI agent to improve.

Benchmark results (median, 25 iterations, WSL2 g++ -O2):

Standard scenario (152 routes) vs master: insert: 274.3 -> 211.7 ns/op -22.8% hit: 80.9 -> 56.0 ns/op -30.8% miss: 45.2 -> 26.7 ns/op -40.9%

Wide scenario (112 routes, 20-24 children/node) vs master: insert: 268.1 -> 212.9 ns/op -20.6% hit: 83.2 -> 62.1 ns/op -25.4% miss: 59.6 -> 33.8 ns/op -43.3%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant