grpcproxy: fix CAS writes and AC reads failing against strict gRPC backends#886
grpcproxy: fix CAS writes and AC reads failing against strict gRPC backends#886lance-tan-ai wants to merge 5 commits intobuchgr:masterfrom
Conversation
The ByteStream Write protocol requires the last WriteRequest to have finish_write=true to signal that the upload is complete. Without it, servers that strictly enforce the spec (e.g. EngFlow) reject the write with: "Request completed before all data was sent: finished_writing=false". The previous loop only called CloseAndRecv() after reading n==0, which closes the gRPC stream client-side but never sends a message with finish_write=true. io.Reader is also permitted to return n>0 alongside io.EOF in the same call, so the last data chunk could be sent without the finish flag and the n==0 branch might never be reached. Fix: track whether the most recent Read returned io.EOF (finishWrite), set FinishWrite on the WriteRequest for that chunk, and call CloseAndRecv immediately after — handling both the "EOF with data" and "EOF alone" cases. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@buchgr Mind taking a look at this? We are hitting this issue when trying to write CAS blobs through bazel-remote's gRPC proxy. fwiw, our setup is the basic bazel-remote setup and just trying to hit the gRPC proxy backend. We are seeing this error appear on every proxy write. |
The previous fix set FinishWrite=true on the data chunk when Read returned (n>0, io.EOF) simultaneously, but missed the case where Read returns (n>0, nil) for the last chunk and then (0, io.EOF) on the following call — which is the more common io.Reader behaviour. In that case the data was fully sent across previous iterations but no message with FinishWrite=true was ever transmitted; CloseAndRecv just closed the stream client-side, leaving the server reporting finished_writing=false. Fix: when finishWrite is true and n==0, send a zero-data WriteRequest with FinishWrite=true before calling CloseAndRecv so the server always receives an explicit completion signal. Co-authored-by: Cursor <cursoragent@cursor.com>
The ByteStream Write protocol requires each WriteRequest to carry the byte offset of its data relative to the start of the resource. Without it the server rejects the terminal finish_write=true message with: write_offset: Current write_offset (0) does not match expected one (<size>) Track a running writeOffset, increment it after each data chunk, and pass it on every WriteRequest — including the zero-data terminal message sent when io.Reader returns (0, io.EOF) separately from the last chunk. Co-authored-by: Cursor <cursoragent@cursor.com>
Get() received the action digest size as the size parameter but the AC branch ignored it, always sending SizeBytes=-1 to the upstream. Servers that require a valid digest (e.g. EngFlow) reject this with: action_digest: invalid or missing action digest Fix: pass size through directly; it comes from the client's original GetActionResult request and is always the correct action digest size. Co-authored-by: Cursor <cursoragent@cursor.com>
grpc_ac.go passes -1 for the ActionResult size (unknown at request time), but the same -1 was also being used as the action digest size_bytes in the proxy's GetActionResult call, which gRPC remote caches reject as invalid. Thread the action digest size_bytes via context so the proxy can construct a correct Digest without changing the Proxy.Get interface. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Hi, which server are you seeing these errors with? |
@mostynb We are seeing this with EngFlow servers. |
|
For some context, this is our |
|
I refreshed my memory a bit, bazel-remote's grpc proxy backend is a relatively recent addtion, and should probably still be marked as experimental. It has the best chance of working with another bazel-remote instance as the grpc proxy backend. The reason for this is historical- bazel-remote originally only supported bazel's older http cache protocol, which does not provide full digest information to the server as the newer grpc protocol does. Some non-trivial refactoring is required to improve the situation, and I suspect that your changes to GetActionResult would not be enough to make the current setup work well. I would prefer to take some time to fix this instead of sneaking extra data through with the context. However the grpcproxy bytestream changes look OK. Would you be interested in reducing the scope of this PR to focus on this? |
Will descope the PR. I will create an issue to outline all of the issues I was seeing with migrating to a gRPC backend since there were additional patches that I needed to apply for it to work out-of-the-box. |
Errors
When using the gRPC proxy backend, both CAS write-throughs and AC read-throughs fail against gRPC cache backends that strictly enforce the Remote Execution API spec.
CAS writes fail with
finished_writing='false':Then after partially fixing that, with
write_offset=0:AC reads fail with an invalid digest:
Root causes
1.
finish_writenever set on the final ByteStreamWriteRequestThe ByteStream protocol requires the last
WriteRequestto havefinish_write = trueto signal upload completion. The previous loop never set it — callingCloseAndRecv()alone is not sufficient for backends that require an explicit terminal message.2.
write_offsetnot tracked across chunksEach
WriteRequestmust carry the cumulative byte offset of data sent so far. The previous implementation always sentwrite_offset = 0, causing backends to reject multi-chunk uploads.3. Action digest
size_bytessent as-1inGetActionResultgrpc_ac.gopassesunknownActionResultSize(-1) tocache.Get()because the ActionResult size is not known at request time. The proxy was forwarding this same-1assize_bytesin theGetActionResultDigest, which backends reject.The correct value is
ActionDigest.SizeBytesfrom the client request (the size of the Action, not the ActionResult). Since theProxy.Getinterface doesn't carry the action digest size, it is threaded viacontext.WithValueusing a package-level typed key.Fix
finishWritefromio.EOFand setFinishWrite: trueon the correspondingWriteRequest, handling both EOF-with-data (n > 0) and EOF-only (n == 0).writeOffsetand include it in everyWriteRequest.grpc_ac.go, storereq.ActionDigest.SizeBytesin context before callingcache.Get(). In the proxy's AC path, read it back to construct a validDigestforGetActionResult.Validation
After deploying with these fixes and clearing the local disk cache to force proxy fallbacks, all CAS writes and AC reads succeed: