Add S3 blob storage with cashier billing to ic-gateway by shilingwang · Pull Request #193 · dfinity/ic-gateway

shilingwang · 2026-04-17T13:22:57Z

#NODE-1941

Summary

Adds a full blob storage API to ic-gateway, enabling upload/download of content-addressed blobs backed by a single AWS S3 bucket with per-owner billing through the cashier canister.
Introduces new /v1/ HTTP endpoints for blob metadata, chunk operations, and owner data management, gated behind --s3-endpoint and --cashier-canister-id CLI flags.
Integrates billing (budget checks, usage reporting) via a CashierConnector that caches budgets locally and flushes usage counters periodically, wired into ic-gateway's existing TaskManager and HealthManager.

New modules

src/s3/ — S3 client abstraction (BucketLike trait, AWSBucket impl, RamFakeBucket for dev), config
src/cashier/ — CashierClient (4 canister calls: whoami, pricelist, budget, usage reporting), CashierConnector (local billing cache + periodic flush)
src/storage/ — Shared types (blob metadata, hash tree, chunk constants), S3 key paths, IC egress certificate auth
src/routing/storage/ — Axum handlers + router for all /v1/ endpoints

HTTP endpoints (under /v1/)

HEAD /v1/blob — Blob metadata headers (size, content type)
GET /v1/blob — Download blob with Range header support
GET /v1/blob-tree — Raw blob metadata JSON
PUT /v1/blob-tree — Upload blob metadata (with IC egress cert auth)
GET /v1/chunk — Download a single chunk
PUT /v1/chunk — Upload a single chunk (SHA-256 verified)
DELETE /v1/owner — Delete all data for an owner (host-gated)

Design decisions

Single S3 bucket: One bucket configured via CLI, no multi-bucket routing. Simpler than the multi-instance model in object-storage.
Billing gated: Storage routes are only mounted when both --s3-endpoint and --cashier-canister-id are provided. Without either, ic-gateway serves only normal IC traffic.
Budget caching: Per-owner budgets cached for 30s to avoid hitting the cashier canister on every request. Usage counters flushed every 10s.
IC egress auth: PUT /blob-tree verifies an OwnerEgressSignature certificate from the request body. Bypassable with --fake-ingress-auth for local dev.

frankdavid

Are there any plans for testing?

frankdavid · 2026-04-28T21:48:30Z

+) -> impl Stream<Item = Result<bytes::Bytes, std::io::Error>> + Send + 'static {
+    stream::iter(parts)
+        .map(move |part| fetch_chunk(state.clone(), owner, part))
+        .buffered(CHUNK_DOWNLOAD_PARALLELISM)


I thought about it again and this may be exploited by malicious clients. A malicious client may request a file and just load a single byte from it (e.g. by terminating the connection early) but we'll load up to 8MB every time nevertheless. Also, if the clients are slow (or act slow because the client is intentionally dropping packets / delaying ACKs), memory usage can blow up (e.g. 10000 connections will require 80GB RAM).
It'd be nice to do some benchmarking, e.g. how fast the connection is to AWS - if as fast as the connection between the client and the gateway, buffering may not even be necessary.

blind-oracle · 2026-04-28T11:42:51Z

+            .blob_tree
+            .root_hash()
+            .ok_or_else(|| StorageError::Forbidden("blob tree has no root hash".into()))?
+            .to_string();


Do we really need to stringify it here? Can't we check it directly? I guess because OwnerEgressSignature has it as a string?

blind-oracle · 2026-04-28T12:04:24Z

+/// Errors from S3 storage operations.
+#[derive(Debug)]
+pub enum StorageError {
+    AwsS3(String),


Why do we have only AWS specific errors?

blind-oracle · 2026-04-28T12:50:53Z

+        Client::from_conf(s3_config)
+    }
+
+    /// Ensure the bucket exists, creating it if necessary. Probe intelligent tiering.


Does it really probe tiering?

blind-oracle · 2026-04-28T12:55:32Z

+                HeadBucketError::NotFound(_) => false,
+                other => return Err(StorageError::AwsS3(other.to_string())),
+            },
+            Err(e) => return Err(StorageError::AwsS3(format!("{}", DisplayErrorContext(e)))),


Here and everywhere: just DisplayErrorContext(e).into() might be nicer?

blind-oracle · 2026-04-28T13:02:31Z

+                .body
+                .collect()
+                .await
+                .map(|b| Some(b.to_vec()))


Maybe we should change the trait to work with Bytes everywhere? This way we won't have to convert from Bytes that S3 client returns and Vec, and vice-versa. That would save us allocations and CPU cycles.

blind-oracle · 2026-04-28T13:47:57Z

+        request: &GetBudgetRequestV1,
+    ) -> Result<GetBudgetResult, Error> {
+        let encoded_args =
+            candid::encode_args((request,)).context("failed to encode budget_get_v1 args")?;


nit: I think we can drop the budget_get_v1 here in the context and everywhere else the same way. It's anyway inferred where the error happens.

blind-oracle · 2026-04-28T15:29:03Z

+    client: Arc<CashierClient>,
+    gateway_id: GatewayId,
+    pricelist: Pricelist,
+    budgets: RwLock<HashMap<Principal, CachedBudget>>,


I think it's best to use DashMap here instead. Or even better - Moka cache with a TTL.

blind-oracle · 2026-04-28T15:44:39Z

+    CashierUnavailable(String),
+}
+
+impl std::fmt::Display for BillingError {


Use thiserror instead of manual impl

blind-oracle · 2026-04-28T15:47:24Z

+
+impl fmt::Debug for CashierConnector {
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        f.debug_struct("CashierConnector")


nit: maybe just use normal write! with some formatting?

blind-oracle · 2026-04-28T15:50:17Z

+        client: Arc<CashierClient>,
+        gateway_name: Option<String>,
+    ) -> Result<Self, Error> {
+        let principal = client.principal()?;


add error context here and below?

shilingwang · 2026-04-29T17:58:16Z

I lost the access so I cannot easily push any change. I'll create a new PR. @blind-oracle @frankdavid

shilingwang · 2026-04-29T18:46:14Z

Thank you for contributing! Unfortunately this repository does not accept external contributions yet.

We are working on enabling this by aligning our internal processes and our CI setup to handle external contributions. However this will take some time to set up so in the meantime we unfortunately have to close this Pull Request.

We hope you understand and will come back once we accept external PRs.

— The DFINITY Foundation

@r-birkner could you help me change the permission of this Repo?

blind-oracle · 2026-04-29T12:18:53Z

+        let mut budgets = self.budgets.write().await;
+        let cached = budgets
+            .get_mut(owner)
+            .expect("cache entry exists after refresh");


Avoid expect since it would panic

blind-oracle · 2026-04-30T10:45:38Z

+                .is_none_or(|c| c.fetched_at.elapsed() >= BUDGET_TTL)
+        };
+
+        if needs_refresh {


Hmm I'm not sure, but don't we have condition here with flush_usage? Like, we run it periodically and what if we haven't flushed yet the budget (e.g. we had some debits) - here we just overwrite it with a fresh copy from the cashier?

blind-oracle · 2026-04-30T10:52:43Z

+    if divisor == 0 {
+        return 0;
+    }
+    // Promote to `i128` so `quantity (u64) * cost (i64)` cannot overflow


Why we don't just error out when we get "absurd inputs"? Like negative cost etc.

blind-oracle · 2026-04-30T10:59:06Z

+
+fn int_to_i64(v: &Int) -> i64 {
+    // Int is arbitrary precision; clamp to i64 range for local budget math.
+    v.0.to_string().parse::<i64>().unwrap_or(i64::MAX)


Stringifying BigInt for every conversion operation is a big overhead. Why don't we just operate on BigInts directly w/o conversion to i64? That would probably make life easier (e.g. no need to worry about overflows etc below)

blind-oracle · 2026-04-30T11:01:15Z

+
+type S = Arc<StorageState>;
+
+const BODY_READ_TIMEOUT: Duration = Duration::from_secs(60);


Make configurable or use some already present CLI option

blind-oracle · 2026-04-30T11:01:38Z

+/// download. `buffered(N)` preserves source order, so the response body stays
+/// strictly sequential while we prefetch ahead. Bounds peak per-download
+/// memory at `CHUNK_DOWNLOAD_PARALLELISM * 1 MiB` (~8 MiB).
+const CHUNK_DOWNLOAD_PARALLELISM: usize = 8;


Also make configurable

shilingwang added 7 commits April 17, 2026 15:08

add s3 connection with billing to ic-gateway

608d9a6

remove the whoami part

f134ffe

use cors middleware

9130d35

move header constant to router

0073d2c

refactor storage/handlers.rs

7e4873c

refactor storage to be similar like ic

bb89666

refactor to put all storage related code under routing/storage

589733e

shilingwang marked this pull request as ready for review April 17, 2026 14:53

shilingwang requested a review from a team as a code owner April 17, 2026 14:53