Skip to content

Fix unsafe pickle deserialization in gRPC PolicyServer (CVE-2026-26210)#1944

Open
Chocapikk wants to merge 2 commits intokvcache-ai:mainfrom
Chocapikk:fix/cve-2026-26210-unsafe-deserialization
Open

Fix unsafe pickle deserialization in gRPC PolicyServer (CVE-2026-26210)#1944
Chocapikk wants to merge 2 commits intokvcache-ai:mainfrom
Chocapikk:fix/cve-2026-26210-unsafe-deserialization

Conversation

@Chocapikk
Copy link
Copy Markdown

@Chocapikk Chocapikk commented Apr 23, 2026

Summary

This PR addresses CVE-2026-26210, an unauthenticated remote code execution vulnerability in the balance_serve scheduler RPC module caused by unsafe pickle.loads() on attacker-controlled data received over unauthenticated ZMQ channels.

Changes

  • Bind to 127.0.0.1 by default instead of 0.0.0.0 (configurable via sched_bind)
  • HMAC-SHA256 message authentication on all ZMQ frames using a shared secret (KTRANSFORMERS_RPC_SECRET env var or auto-generated at startup and shared via environment)
  • RestrictedUnpickler with explicit allowlist for C++ scheduler extension types (QueryAdd, QueryUpdate, BatchQueryTodo, etc.) - blocks arbitrary code execution while preserving compatibility with sched_ext objects
  • Safetensors for KV cache tensor serialization instead of pickle.dumps()/mp.reductions.reduce_tensor()
  • ZMQ multipart protocol: [signature, payload, tensor_data] with consistent HMAC verification on both sides
  • Config file loading uses RestrictedUnpickler (compatible with existing pickle.dump in balance_serve.py)

Security Impact

Before: Any unauthenticated network-reachable attacker could achieve arbitrary code execution by sending a crafted pickle payload to the ZMQ port.

After:

  • Localhost-only by default (no remote access)
  • HMAC-SHA256 authentication prevents unauthorized message injection
  • RestrictedUnpickler blocks dangerous modules (os, subprocess, etc.) - only explicitly allowed types can be deserialized
  • Tensor data uses safetensors (no code execution possible)

References

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces pickle with json and safetensors for RPC serialization and introduces HMAC-based message authentication. Several critical issues were identified: the switch to JSON breaks the serialization of complex scheduler objects like QueryAdd, and the random generation of the RPC secret at import time prevents communication between separate processes. There is also a format mismatch with the configuration file writer, and inconsistencies in the HMAC signing and verification logic between the client and server need to be corrected.

Comment on lines +35 to +40
def _serialize_msg(obj: dict) -> bytes:
"""Serialize an RPC message to JSON bytes. Non-JSON-serializable values
are dropped with a placeholder so the frame always round-trips."""
def _default(o):
return f"<non-serializable:{type(o).__name__}>"
return json.dumps(obj, default=_default).encode()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The switch from pickle to json for RPC message serialization is a critical breaking change. The _default function drops non-serializable objects, but the RPC protocol relies on passing complex objects like QueryAdd, QueryUpdate, and BatchQueryTodo (from the sched_ext extension). These objects are not JSON-serializable and will be replaced by placeholder strings (e.g., "<non-serializable:QueryAdd>"), causing the scheduler to fail when it receives these strings instead of the expected objects. You must implement a proper to_dict/from_dict mechanism for these types or use a serialization format that supports them.

Comment on lines +22 to +24
_RPC_SECRET = os.environ.get(
"KTRANSFORMERS_RPC_SECRET", ""
).encode() or secrets.token_bytes(32)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Generating a random _RPC_SECRET at import time using secrets.token_bytes(32) will cause HMAC verification to fail if the client and server run in separate processes (which is the case here, as balance_serve.py spawns sched_rpc.py via subprocess.Popen). Each process will generate its own unique secret. To fix this, ensure the secret is either provided via the KTRANSFORMERS_RPC_SECRET environment variable or that the parent process generates it and explicitly sets it in the environment before spawning the child process.

Comment on lines +271 to +272
with open(args.config, "r") as f:
main_args = json.load(f)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a mismatch in configuration file formats. While this file has been updated to use json.load, the calling code in kt-sft/ktransformers/server/backend/interfaces/balance_serve.py (line 322) still uses pickle.dump(args, temp_file). This will result in a json.decoder.JSONDecodeError when the scheduler attempts to start. Both sides must be updated to use JSON.

Comment on lines +98 to +99
sig, payload, _ = parts
if not _verify(payload, sig):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The HMAC verification in _recv is inconsistent with the signing logic in _send. While _send signs payload + tensor_data, _recv only verifies the payload, ignoring the third part of the multipart message. This could lead to verification failures if a client sends non-empty tensor data or security issues if the tensor data is tampered with.

Suggested change
sig, payload, _ = parts
if not _verify(payload, sig):
sig, payload, tensor_data = parts
if not _verify(payload + tensor_data, sig):

Comment on lines +203 to +204
sig = _sign(payload)
self.socket.send_multipart([sig, payload, b""])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For protocol consistency, the client should sign the message including the empty tensor data frame, matching the server's expectation that the signature covers all data parts.

Suggested change
sig = _sign(payload)
self.socket.send_multipart([sig, payload, b""])
sig = _sign(payload + b"")
self.socket.send_multipart([sig, payload, b""])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant