Fix unsafe pickle deserialization in gRPC PolicyServer (CVE-2026-26210)#1944
Fix unsafe pickle deserialization in gRPC PolicyServer (CVE-2026-26210)#1944Chocapikk wants to merge 2 commits intokvcache-ai:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request replaces pickle with json and safetensors for RPC serialization and introduces HMAC-based message authentication. Several critical issues were identified: the switch to JSON breaks the serialization of complex scheduler objects like QueryAdd, and the random generation of the RPC secret at import time prevents communication between separate processes. There is also a format mismatch with the configuration file writer, and inconsistencies in the HMAC signing and verification logic between the client and server need to be corrected.
| def _serialize_msg(obj: dict) -> bytes: | ||
| """Serialize an RPC message to JSON bytes. Non-JSON-serializable values | ||
| are dropped with a placeholder so the frame always round-trips.""" | ||
| def _default(o): | ||
| return f"<non-serializable:{type(o).__name__}>" | ||
| return json.dumps(obj, default=_default).encode() |
There was a problem hiding this comment.
The switch from pickle to json for RPC message serialization is a critical breaking change. The _default function drops non-serializable objects, but the RPC protocol relies on passing complex objects like QueryAdd, QueryUpdate, and BatchQueryTodo (from the sched_ext extension). These objects are not JSON-serializable and will be replaced by placeholder strings (e.g., "<non-serializable:QueryAdd>"), causing the scheduler to fail when it receives these strings instead of the expected objects. You must implement a proper to_dict/from_dict mechanism for these types or use a serialization format that supports them.
| _RPC_SECRET = os.environ.get( | ||
| "KTRANSFORMERS_RPC_SECRET", "" | ||
| ).encode() or secrets.token_bytes(32) |
There was a problem hiding this comment.
Generating a random _RPC_SECRET at import time using secrets.token_bytes(32) will cause HMAC verification to fail if the client and server run in separate processes (which is the case here, as balance_serve.py spawns sched_rpc.py via subprocess.Popen). Each process will generate its own unique secret. To fix this, ensure the secret is either provided via the KTRANSFORMERS_RPC_SECRET environment variable or that the parent process generates it and explicitly sets it in the environment before spawning the child process.
| with open(args.config, "r") as f: | ||
| main_args = json.load(f) |
There was a problem hiding this comment.
There is a mismatch in configuration file formats. While this file has been updated to use json.load, the calling code in kt-sft/ktransformers/server/backend/interfaces/balance_serve.py (line 322) still uses pickle.dump(args, temp_file). This will result in a json.decoder.JSONDecodeError when the scheduler attempts to start. Both sides must be updated to use JSON.
| sig, payload, _ = parts | ||
| if not _verify(payload, sig): |
There was a problem hiding this comment.
The HMAC verification in _recv is inconsistent with the signing logic in _send. While _send signs payload + tensor_data, _recv only verifies the payload, ignoring the third part of the multipart message. This could lead to verification failures if a client sends non-empty tensor data or security issues if the tensor data is tampered with.
| sig, payload, _ = parts | |
| if not _verify(payload, sig): | |
| sig, payload, tensor_data = parts | |
| if not _verify(payload + tensor_data, sig): |
| sig = _sign(payload) | ||
| self.socket.send_multipart([sig, payload, b""]) |
There was a problem hiding this comment.
For protocol consistency, the client should sign the message including the empty tensor data frame, matching the server's expectation that the signature covers all data parts.
| sig = _sign(payload) | |
| self.socket.send_multipart([sig, payload, b""]) | |
| sig = _sign(payload + b"") | |
| self.socket.send_multipart([sig, payload, b""]) |
…C consistency, fix secret sharing
Summary
This PR addresses CVE-2026-26210, an unauthenticated remote code execution vulnerability in the
balance_servescheduler RPC module caused by unsafepickle.loads()on attacker-controlled data received over unauthenticated ZMQ channels.Changes
127.0.0.1by default instead of0.0.0.0(configurable viasched_bind)KTRANSFORMERS_RPC_SECRETenv var or auto-generated at startup and shared via environment)pickle.dumps()/mp.reductions.reduce_tensor()[signature, payload, tensor_data]with consistent HMAC verification on both sidesSecurity Impact
Before: Any unauthenticated network-reachable attacker could achieve arbitrary code execution by sending a crafted pickle payload to the ZMQ port.
After:
References