feat(p2p): latency-aware peer selection for faster sync and better propagation#16
Conversation
- Add SuccessCount, LastLatency, LastTopoHeight, LastMeasured to Peer - Replace greedy peer selection with weighted random reservoir sampling - Capture latency from ping path via Peer_UpdateLatency - Sort seed nodes and sync partners by latency - Fix Peer_SetSuccess race: move after Peer_Add in dispatch_test_handshake - Enhance PeerList_Print with success count, latency, age columns - Add 14 table-driven peerScore tests Verified: testnet (2-node, both whitelisted), mainnet full fastsync (2.5h to tip, 33/45 peers with data, 75ms-1012ms, zero crashes).
|
P2P_PEER_PRIORITIZATION.md Additional documentation detailing the PR. |
DHEBP
left a comment
There was a problem hiding this comment.
Overall: sound and mergeable. Verified locally — whole module builds clean, go vet ./p2p/ passes, and the 14 tests pass under -race. The core latency-scoring logic is correct, the reservoir sampling is textbook (A-Res weighted + Algorithm R uniform), and the latency unit math (ns → ms) checks out against rtt_micro. The Peer_SetSuccess move is a genuine fix — it previously lived in the server-side Handshake handler gated by if !c.Incoming, which almost never fired, so SuccessCount stayed 0.
A few minor cleanups below before merge — only the gofmt one I'd consider a real gate. The two design notes (trigger_sync load concentration, Global_Random concurrency) are judgment calls for the maintainers, not defects.
Nice work — the weight floor to prevent peer starvation and the TTL decay on the latency bonus are thoughtful touches.
| ConnectAfter uint64 `json:"connectafter"` // we should connect when the following timestamp passes | ||
| BlacklistBefore uint64 `json:"blacklistbefore"` // peer blacklisted till epoch , priority nodes are never blacklisted, 0 if not blacklist | ||
| GoodCount uint64 `json:"goodcount"` // how many times peer has been shared with us | ||
| SuccessCount uint64 `json:"successcount"` // outbound connection successes (for scoring) |
There was a problem hiding this comment.
Doc nit: documented as "outbound connection successes", but since dispatch_test_handshake runs for both incoming (controller.go:589) and outgoing (controller.go:741) connections, Peer_SetSuccess now increments on inbound handshakes too. That matches the "both sides whitelist each other" intent — just worth syncing the comment so the next reader isn't misled.
| SuccessCount uint64 `json:"successcount"` // outbound connection successes (for scoring) | |
| SuccessCount uint64 `json:"successcount"` // successful handshakes, in or out (for scoring) |
| p.FailCount = 0 // fail count is zero again | ||
| p.ConnectAfter = 0 | ||
| p.Whitelist = true | ||
| p.LastConnected = uint64(time.Now().UTC().Unix()) // set time when last connected | ||
| p.SuccessCount++ | ||
|
|
||
| // logger.Infof("Setting peer as white listed") | ||
| // logger.Infof("Setting peer as white listed") |
There was a problem hiding this comment.
gofmt: this block is over-indented by one tab level. gofmt -l also flags p2p/peer_pool_test.go (struct field alignment). A single gofmt -w p2p/ cleans both. (Heads-up: controller.go also shows a gofmt diff, but that's a pre-existing misalignment in P2P_Shutdown you didn't touch — ignore it.)
| p.FailCount = 0 // fail count is zero again | |
| p.ConnectAfter = 0 | |
| p.Whitelist = true | |
| p.LastConnected = uint64(time.Now().UTC().Unix()) // set time when last connected | |
| p.SuccessCount++ | |
| // logger.Infof("Setting peer as white listed") | |
| // logger.Infof("Setting peer as white listed") | |
| p.FailCount = 0 // fail count is zero again | |
| p.ConnectAfter = 0 | |
| p.Whitelist = true | |
| p.LastConnected = uint64(time.Now().UTC().Unix()) // set time when last connected | |
| p.SuccessCount++ | |
| // logger.Infof("Setting peer as white listed") |
| // sort by height descending (furthest ahead first), then latency ascending (fastest among equal) | ||
| sort.SliceStable(clist, func(i, j int) bool { | ||
| hi := atomic.LoadInt64(&clist[i].Height) | ||
| hj := atomic.LoadInt64(&clist[j].Height) | ||
| if hi != hj { | ||
| return hi > hj | ||
| } | ||
| li := atomic.LoadInt64(&clist[i].Latency) | ||
| lj := atomic.LoadInt64(&clist[j].Latency) | ||
| return li < lj | ||
| }) |
There was a problem hiding this comment.
Design note (non-blocking): the old code randomly shuffled sync partners; this deterministically sorts by height-desc → latency-asc and breaks on the first lagging peer. Net effect: many nodes will preferentially pull from the same highest+fastest peer, concentrating sync load on the best-connected nodes. It's bounded (one sync at a time per node) and is arguably the intended win, but it's a real behavioral shift from "spread load randomly." Flagging for a conscious sign-off — no change requested.
| w = 1 // minimum weight 1 so all eligible peers have a chance | ||
| } | ||
| totalWeight += w | ||
| if globals.Global_Random.Float64()*totalWeight < w { |
There was a problem hiding this comment.
Pre-existing concurrency note (non-blocking): globals.Global_Random is a single *math/rand.Rand, which isn't safe for concurrent use. These Float64() calls are fine — they run under peer_mutex. But trigger_sync and the seed-node code call Global_Random.Shuffle without that lock, concurrently. This race already existed before this PR (the old trigger_sync/seed code used .Shuffle the same way); you're adding call sites but no new class of bug. The crypto/rand-backed source makes real harm unlikely, and the -race suite won't catch it since the tests don't exercise concurrent Global_Random access. Worth a maintainer ticket someday, not this PR's job.
Summary
Implements latency-aware peer selection to prefer low-latency peers for outgoing connections and chain sync, improving fastsync speed and mini-block propagation times.
Changes
New peer scoring system (
p2p/peer_pool.go):SuccessCount,LastLatency,LastTopoHeight,LastMeasuredtoPeerstructpeerScore(): weighted random scoring based on success count, latency, and agefind_peer_to_connect()rewritten with weighted random reservoir samplingPeerList_Print()enhanced with success count, latency, age columnsLatency capture from ping path (
p2p/connection_pool.go):Peer_UpdateLatency()called fromping_loopafter each successful pingSeed node latency sort (
p2p/controller.go):Race condition fix (
p2p/rpc_handshake.go):Peer_SetSuccessmoved fromHandshakehandler todispatch_test_handshake(afterPeer_Add)successcountwas always 0; both sides now whitelist each otherSync partner selection (
p2p/connection_pool.go):trigger_sync()rewritten with atomic reads; sorts by height desc then latency ascVerification
peerScore(), all pass with-raceFiles Changed