Replace CAS with CS and CLS by meroton-benjamin · Pull Request #347 · buildbarn/bb-storage

meroton-benjamin · 2026-06-22T07:14:14Z

This commit builds on top of our split and splice blob support to make
it a mandatory first class feature in Buildbarn. With this commit the
Content Addressable Storage (CAS) is created from two Storage
configurations that work in tandem. A Chunk Storage (CS) which is
content addressed and contains chunks of blobs, and a Chunk List Storage
(CLS) which is addressed by a blob digest and contains a manifest
describing the chunks that make up the blob.

All api calls are automatically translated to use Chunk Lists created
with RepMaxCDC. Effectively this means that large blobs no longer exists
in the storage layer, individual chunks of the large blobs are in turn
deduplicated in such a manner that the chunks are stored only once.

The automatic translation makes certain that clients that are not cdc
aware can still continue to use the storage backend without performing
any changes. Clients which support RepMaxCDC also gets a significant
reduction in the amount of blobs to transfer as they only need to
transfer modified chunks rather than the entire blob.

This commit adds support for the SplitBlob and SpliceBlob methods from the Remote Execution v2 (REv2) api. SplitBlob and SpliceBlob can be used to facilitate uploads and downloads of large files but a naïve implementation like this has some major drawbacks as well. The blobs must exist in both their chunked and non chunked form, which may significantly increase storage requirements for large blobs. The protocol gives no guarantee that a large blob stored in the CAS exists in its chunked form which forces you to perform a fairly heavy Split call that loads the entire large blob in order to decomposition it into its chunks. This implementation mostly exists as a stepping stone for a different implementation where Buildbarn internally manages all blobs as chunked blobs.

This commit builds on top of our split and splice blob support to make it a mandatory first class feature in Buildbarn. With this commit the Content Addressable Storage (CAS) is created from two Storage configurations that work in tandem. A Chunk Storage (CS) which is content addressed and contains chunks of blobs, and a Chunk List Storage (CLS) which is addressed by a blob digest and contains a manifest describing the chunks that make up the blob. All api calls are automatically translated to use Chunk Lists created with RepMaxCDC. Effectively this means that large blobs no longer exists in the storage layer, individual chunks of the large blobs are in turn deduplicated in such a manner that the chunks are stored only once. The automatic translation makes certain that clients that are not cdc aware can still continue to use the storage backend without performing any changes. Clients which support RepMaxCDC also gets a significant reduction in the amount of blobs to transfer as they only need to transfer modified chunks rather than the entire blob.

EdSchouten · 2026-06-22T11:30:01Z

+
+			var parameterCache *cdc.TTLCache[cdc.Parameters]
+			if casConfiguration.ContentDefinedChunkingParameterCache != nil {
+				parameterCacheConfiguraiton := casConfiguration.ContentDefinedChunkingParameterCache


Typo: Configuraiton

EdSchouten · 2026-06-22T11:38:46Z

+				if err != nil {
+					return err
+				}
+				parameterCache = cdc.NewTTLCache[cdc.Parameters](


So this is a cache that maps instance names to CDC parameters, right? I don't think it's correct to create such a cache globally. If you use DemultiplexingBlobAccess, you can rewrite instance names. Wouldn't that lead to potential collisions?

Also in the case of centralized storage nodes I don't think it makes sense to have any caching of CDC parameters. There's nothing to cache, as the storage node would just a single configuration globally. It's only the gRPC client backend that needs a cache.

Maybe better to just extend BlobAccess to have a new GetCDCParameters() method or something? Then let individual storage backends be responsible for caching this information (or not).

EdSchouten · 2026-06-22T11:39:52Z

+					clock.SystemClock,
+					evictionSet,
+					int(parameterCacheConfiguraiton.GetCacheSize()),
+					parameterCacheConfiguraiton.CacheDuration.AsDuration(),


Missing .CacheDuration.CheckValid()?

EdSchouten · 2026-06-22T11:40:12Z

 			)
 			if err != nil {
-				return util.StatusWrap(err, "Failed to create Content Addressable Storage")
+				return util.StatusWrap(err, "Failed to create Content Addressable Storage: Failed to create Chunk Storage")


"Failed to create Chunk Storage for Content Addressable Storage"?

EdSchouten · 2026-06-22T11:46:20Z

+
+	// Chunk list is marked for validation bypass, push it directy to
+	// downstream blob store.
+	if cdc.ChunkListValidationBypassed(ctx) {


To me the notion of whether a chunk list is known to be valid isn't really a properly of the calling context. It's more of a property of the chunk list that's passed in. Maybe better to either change the signature of BlobAccess.Put(), or add this to buffer.Buffer?

I take it that this logic is added to make sure that if a client performs a legacy Write() for a large object, that the frontend doesn't re-read the objects just to make sure that the ChunkList is valid, right? If so, how does this actually relate to buffer.Source? Maybe we should just treat ChunkLists created by bb-frontend itself as being buffer.BackendProvided(), and use that to skip validation?

EdSchouten · 2026-06-22T11:49:36Z

+    // demands on the state of the Content Addressable Storage (CAS)
+    // after those methods have been called.
+    //
+    // SplitBlob requires that that the blob as well as all chunks of


s/that that/that/

EdSchouten · 2026-06-22T11:50:26Z

+    // that the supplied chunk list composes into the blob. Notably it
+    // does not require the chunks to follow any particular chunking
+    // algorithm but our implementation ensures that after any call a
+    // proper rep max cdc chunk list is verified even if the caller


s/rep max cdc/RepMaxCDC/

EdSchouten · 2026-06-22T11:50:53Z

-  // Storage (CAS).
-  ScannableBlobAccessConfiguration content_addressable_storage = 17;
+  // Optional: Blobstore configurations for the Content
+  // AddressableContentAddressa Storage (CAS).


EdSchouten · 2026-06-22T11:51:10Z

+
+  // Was 'chunk_list_storage'. Has been moved into the
+  // content_addressable_storage.
+  reserved 22;


This can be removed, right?

EdSchouten · 2026-06-22T11:53:32Z

+			if casConfiguration.ChunkStorage == nil {
+				return status.Error(codes.InvalidArgument, "The Chunk Storage is a mandatory part of the Content Addressable Storage.")
+			}
+			if casConfiguration.ChunkListStorage == nil {
+				return status.Error(codes.InvalidArgument, "The Chunk List Storage is a mandatory part of the Content Addressable Storage.")
+			}


Is this really a necessary requirement? I can imagine that for frontends that implement the full REv2 API it's necessary that both are provided. But for individual shards of our storage backends there is no requirement that each node provides both a CS and CLS.

meroton-benjamin force-pushed the cdc-support-step-2 branch from f9c925b to a6748b0 Compare June 22, 2026 09:55

meroton-benjamin force-pushed the cdc-support-step-2 branch from a6748b0 to c2f0bf7 Compare June 22, 2026 10:21

EdSchouten reviewed Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace CAS with CS and CLS#347

Replace CAS with CS and CLS#347
meroton-benjamin wants to merge 2 commits into
buildbarn:mainfrom
meroton:cdc-support-step-2

meroton-benjamin commented Jun 22, 2026

Uh oh!

EdSchouten Jun 22, 2026

Uh oh!

EdSchouten Jun 22, 2026

Uh oh!

EdSchouten Jun 22, 2026 •

edited

Loading

Uh oh!

EdSchouten Jun 22, 2026

Uh oh!

EdSchouten Jun 22, 2026

Uh oh!

EdSchouten Jun 22, 2026

Uh oh!

EdSchouten Jun 22, 2026

Uh oh!

EdSchouten Jun 22, 2026

Uh oh!

EdSchouten Jun 22, 2026

Uh oh!

EdSchouten Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

meroton-benjamin commented Jun 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EdSchouten Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EdSchouten Jun 22, 2026 •

edited

Loading