Skip to content

feat(p2p): latency-aware peer selection for faster sync and better propagation#16

Open
moralpriest wants to merge 1 commit into
DEROFDN:community-devfrom
moralpriest:feat/p2p-latency-prioritization
Open

feat(p2p): latency-aware peer selection for faster sync and better propagation#16
moralpriest wants to merge 1 commit into
DEROFDN:community-devfrom
moralpriest:feat/p2p-latency-prioritization

Conversation

@moralpriest
Copy link
Copy Markdown

Summary

Implements latency-aware peer selection to prefer low-latency peers for outgoing connections and chain sync, improving fastsync speed and mini-block propagation times.

Changes

New peer scoring system (p2p/peer_pool.go):

  • Added SuccessCount, LastLatency, LastTopoHeight, LastMeasured to Peer struct
  • New peerScore(): weighted random scoring based on success count, latency, and age
  • find_peer_to_connect() rewritten with weighted random reservoir sampling
  • PeerList_Print() enhanced with success count, latency, age columns

Latency capture from ping path (p2p/connection_pool.go):

  • New Peer_UpdateLatency() called from ping_loop after each successful ping

Seed node latency sort (p2p/controller.go):

  • Known seeds sorted by latency (fastest first); unknown seeds shuffled for diversity

Race condition fix (p2p/rpc_handshake.go):

  • Peer_SetSuccess moved from Handshake handler to dispatch_test_handshake (after Peer_Add)
  • Fixes bug where successcount was always 0; both sides now whitelist each other

Sync partner selection (p2p/connection_pool.go):

  • trigger_sync() rewritten with atomic reads; sorts by height desc then latency asc

Verification

  • Unit tests: 14 table-driven tests for peerScore(), all pass with -race
  • Testnet: 2-node, both sides whitelist each other, successcount/latency populated
  • Mainnet full fastsync: 2.5h to tip (height 7,149,293), 33/45 peers with data, latency range 75ms-1012ms (median 160ms), zero crashes
  • Safety: Zero consensus changes, no wire protocol changes, worst-case degrades to random selection (current behavior)

Files Changed

p2p/connection_pool.go    |  19 +-
p2p/controller.go         |  65 +++-
p2p/peer_pool.go          | 137 +++++++--
p2p/peer_pool_test.go     | 203 +++++++++++++
p2p/rpc_handshake.go      |   4 +-
5 files changed, 389 insertions(+), 39 deletions(-)

- Add SuccessCount, LastLatency, LastTopoHeight, LastMeasured to Peer
- Replace greedy peer selection with weighted random reservoir sampling
- Capture latency from ping path via Peer_UpdateLatency
- Sort seed nodes and sync partners by latency
- Fix Peer_SetSuccess race: move after Peer_Add in dispatch_test_handshake
- Enhance PeerList_Print with success count, latency, age columns
- Add 14 table-driven peerScore tests

Verified: testnet (2-node, both whitelisted), mainnet full fastsync
(2.5h to tip, 33/45 peers with data, 75ms-1012ms, zero crashes).
@moralpriest
Copy link
Copy Markdown
Author

P2P_PEER_PRIORITIZATION.md
FASTSYNC_PR_PLAN.md

Additional documentation detailing the PR.

Copy link
Copy Markdown

@DHEBP DHEBP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall: sound and mergeable. Verified locally — whole module builds clean, go vet ./p2p/ passes, and the 14 tests pass under -race. The core latency-scoring logic is correct, the reservoir sampling is textbook (A-Res weighted + Algorithm R uniform), and the latency unit math (ns → ms) checks out against rtt_micro. The Peer_SetSuccess move is a genuine fix — it previously lived in the server-side Handshake handler gated by if !c.Incoming, which almost never fired, so SuccessCount stayed 0.

A few minor cleanups below before merge — only the gofmt one I'd consider a real gate. The two design notes (trigger_sync load concentration, Global_Random concurrency) are judgment calls for the maintainers, not defects.

Nice work — the weight floor to prevent peer starvation and the TTL decay on the latency bonus are thoughtful touches.

Comment thread p2p/peer_pool.go
ConnectAfter uint64 `json:"connectafter"` // we should connect when the following timestamp passes
BlacklistBefore uint64 `json:"blacklistbefore"` // peer blacklisted till epoch , priority nodes are never blacklisted, 0 if not blacklist
GoodCount uint64 `json:"goodcount"` // how many times peer has been shared with us
SuccessCount uint64 `json:"successcount"` // outbound connection successes (for scoring)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc nit: documented as "outbound connection successes", but since dispatch_test_handshake runs for both incoming (controller.go:589) and outgoing (controller.go:741) connections, Peer_SetSuccess now increments on inbound handshakes too. That matches the "both sides whitelist each other" intent — just worth syncing the comment so the next reader isn't misled.

Suggested change
SuccessCount uint64 `json:"successcount"` // outbound connection successes (for scoring)
SuccessCount uint64 `json:"successcount"` // successful handshakes, in or out (for scoring)

Comment thread p2p/peer_pool.go
Comment on lines +209 to +215
p.FailCount = 0 // fail count is zero again
p.ConnectAfter = 0
p.Whitelist = true
p.LastConnected = uint64(time.Now().UTC().Unix()) // set time when last connected
p.SuccessCount++

// logger.Infof("Setting peer as white listed")
// logger.Infof("Setting peer as white listed")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gofmt: this block is over-indented by one tab level. gofmt -l also flags p2p/peer_pool_test.go (struct field alignment). A single gofmt -w p2p/ cleans both. (Heads-up: controller.go also shows a gofmt diff, but that's a pre-existing misalignment in P2P_Shutdown you didn't touch — ignore it.)

Suggested change
p.FailCount = 0 // fail count is zero again
p.ConnectAfter = 0
p.Whitelist = true
p.LastConnected = uint64(time.Now().UTC().Unix()) // set time when last connected
p.SuccessCount++
// logger.Infof("Setting peer as white listed")
// logger.Infof("Setting peer as white listed")
p.FailCount = 0 // fail count is zero again
p.ConnectAfter = 0
p.Whitelist = true
p.LastConnected = uint64(time.Now().UTC().Unix()) // set time when last connected
p.SuccessCount++
// logger.Infof("Setting peer as white listed")

Comment thread p2p/connection_pool.go
Comment on lines +674 to 684
// sort by height descending (furthest ahead first), then latency ascending (fastest among equal)
sort.SliceStable(clist, func(i, j int) bool {
hi := atomic.LoadInt64(&clist[i].Height)
hj := atomic.LoadInt64(&clist[j].Height)
if hi != hj {
return hi > hj
}
li := atomic.LoadInt64(&clist[i].Latency)
lj := atomic.LoadInt64(&clist[j].Latency)
return li < lj
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design note (non-blocking): the old code randomly shuffled sync partners; this deterministically sorts by height-desc → latency-asc and breaks on the first lagging peer. Net effect: many nodes will preferentially pull from the same highest+fastest peer, concentrating sync load on the best-connected nodes. It's bounded (one sync at a time per node) and is arguably the intended win, but it's a real behavioral shift from "spread load randomly." Flagging for a conscious sign-off — no change requested.

Comment thread p2p/peer_pool.go
w = 1 // minimum weight 1 so all eligible peers have a chance
}
totalWeight += w
if globals.Global_Random.Float64()*totalWeight < w {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing concurrency note (non-blocking): globals.Global_Random is a single *math/rand.Rand, which isn't safe for concurrent use. These Float64() calls are fine — they run under peer_mutex. But trigger_sync and the seed-node code call Global_Random.Shuffle without that lock, concurrently. This race already existed before this PR (the old trigger_sync/seed code used .Shuffle the same way); you're adding call sites but no new class of bug. The crypto/rand-backed source makes real harm unlikely, and the -race suite won't catch it since the tests don't exercise concurrent Global_Random access. Worth a maintainer ticket someday, not this PR's job.

@8lecramm 8lecramm self-assigned this Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants