wip weights by michaeldwan · Pull Request #2974 · replicate/cog

michaeldwan · 2026-04-28T22:43:44Z

No description provided.

ask-bonk · 2026-04-28T22:49:52Z

LGTM

ask-bonk

I've reviewed the managed weights PR. The code is generally well-structured with good documentation. I've identified a few actionable issues around security and correctness.

ask-bonk · 2026-04-29T02:39:07Z

+	// Inventory paths use forward slashes; filepath.Join on a path
+	// containing "/" works correctly on POSIX and is normalized on
+	// Windows.
+	abs := filepath.Join(s.dir, filepath.FromSlash(path))


This function relies on path validation happening elsewhere (//nolint:gosec // path is under the configured source dir), but this creates a maintenance risk. If validation in computeInventory is bypassed or has a bug, this becomes a path traversal vulnerability. Consider adding explicit validation here:

Suggested change

abs := filepath.Join(s.dir, filepath.FromSlash(path))

abs := filepath.Join(s.dir, filepath.FromSlash(path))

abs = filepath.Clean(abs)

if !strings.HasPrefix(abs, s.dir+string(filepath.Separator)) && abs != s.dir {

return nil, fmt.Errorf("path escapes source directory: %s", path)

}

ask-bonk · 2026-04-29T02:39:07Z

+// io.Discard and nil is returned. This matters because Pull streams a
+// whole layer tar and may encounter files already stored from a
+// previous pull — we need those to succeed without desyncing the tar.
+func (s *FileStore) PutFile(ctx context.Context, expectedDigest string, _ int64, r io.Reader) error {


The size parameter is ignored (underscore _). This parameter should be validated against the actual bytes written to detect truncation or corruption. Consider either:

Validating the size matches after writing:

Suggested change

func (s *FileStore) PutFile(ctx context.Context, expectedDigest string, _ int64, r io.Reader) error {

func (s *FileStore) PutFile(ctx context.Context, expectedDigest string, expectedSize int64, r io.Reader) error {

...

n, err := io.Copy(tmp, reader)

if err != nil {

_ = tmp.Close()

return fmt.Errorf("write %s: %w", expectedDigest, err)

}

if n != expectedSize {

_ = tmp.Close()

return fmt.Errorf("size mismatch for %s: expected %d, got %d", expectedDigest, expectedSize, n)

}

Or remove the unused parameter to avoid confusion.

ask-bonk · 2026-04-29T02:39:07Z

+// resolveRef calls the Hub API to resolve a ref (branch/tag/sha) to the
+// full 40-char commit sha.
+func (s *HFSource) resolveRef(ctx context.Context) (string, error) {
+	url := fmt.Sprintf("%s/api/models/%s/revision/%s", s.baseURL, s.repo, s.ref)


URLs are constructed using fmt.Sprintf with user-controlled path components. While s.repo and s.ref are validated, the path argument in fetchFile could contain special characters that require URL escaping. Consider using url.PathEscape:

Suggested change

url := fmt.Sprintf("%s/api/models/%s/revision/%s", s.baseURL, s.repo, s.ref)

func (s *HFSource) fetchFile(ctx context.Context, ref, path string) (io.ReadCloser, error) {

escapedPath := url.PathEscape(path)

url := fmt.Sprintf("%s/%s/resolve/%s/%s", s.baseURL, s.repo, ref, escapedPath)

return s.doGet(ctx, url)

}

ask-bonk · 2026-04-29T02:39:07Z

+// file, followed by the files themselves in supplied order. File
+// bytes come from the content-addressed store, keyed by digest.
+func writeLayer(ctx context.Context, st store.Store, tw *tar.Writer, files []weightsource.InventoryFile) error {
+	for _, dir := range collectDirs(files) {


The writeLayer function checks ctx.Err() when writing files (line 428), but not when writing directory headers. For consistency and correctness in highly concurrent scenarios, consider adding a context check before each directory header write:

Suggested change

for _, dir := range collectDirs(files) {

for _, dir := range collectDirs(files) {

if err := ctx.Err(); err != nil {

return err

}

if err := tw.WriteHeader(deterministicDirHeader(dir)); err != nil {

return fmt.Errorf("write dir header %s: %w", dir, err)

}

}

ask-bonk · 2026-04-29T02:39:14Z

Posted review with 4 suggestions targeting security and correctness issues:

Path traversal validation in FileSource.Open - adds explicit validation to prevent directory escape
Size validation in PutFile - validates the size parameter instead of ignoring it
URL path escaping in HuggingFace source - escapes path components when constructing URLs
Context cancellation in writeLayer - adds consistent context checking for directory headers

github run

ask-bonk

I'm Bonk, and I've done a quick review of your PR. This PR introduces managed weights functionality for Cog, implementing a weight import/push/pull system using OCI artifacts.

I've identified 4 actionable issues around security and correctness that should be addressed:

ask-bonk · 2026-04-29T16:09:04Z

+// io.Discard and nil is returned. This matters because Pull streams a
+// whole layer tar and may encounter files already stored from a
+// previous pull — we need those to succeed without desyncing the tar.
+func (s *FileStore) PutFile(ctx context.Context, expectedDigest string, _ int64, r io.Reader) error {


The size parameter is ignored (underscore _). This parameter should be validated against the actual bytes written to detect truncation or corruption. Consider either:

Validating the size matches after writing:

Suggested change

func (s *FileStore) PutFile(ctx context.Context, expectedDigest string, _ int64, r io.Reader) error {

func (s *FileStore) PutFile(ctx context.Context, expectedDigest string, expectedSize int64, r io.Reader) error {

...

n, err := io.Copy(tmp, reader)

if err != nil {

_ = tmp.Close()

return fmt.Errorf("write %s: %w", expectedDigest, err)

}

if n != expectedSize {

_ = tmp.Close()

return fmt.Errorf("size mismatch for %s: expected %d, got %d", expectedDigest, expectedSize, n)

}

Or remove the unused parameter to avoid confusion.

ask-bonk · 2026-04-29T16:09:04Z

+
+// fetchFile streams one file from the resolve endpoint. The endpoint
+// issues a 302 to the appropriate backend (LFS CDN, xet, or inline).
+func (s *HFSource) fetchFile(ctx context.Context, ref, path string) (io.ReadCloser, error) {


URLs are constructed using fmt.Sprintf with user-controlled path components. While s.repo and s.ref are validated, the path argument in fetchFile could contain special characters that require URL escaping. Consider using url.PathEscape:

Suggested change

func (s *HFSource) fetchFile(ctx context.Context, ref, path string) (io.ReadCloser, error) {

func (s *HFSource) fetchFile(ctx context.Context, ref, path string) (io.ReadCloser, error) {

escapedPath := url.PathEscape(path)

url := fmt.Sprintf("%s/%s/resolve/%s/%s", s.baseURL, s.repo, ref, escapedPath)

return s.doGet(ctx, url)

}

ask-bonk · 2026-04-29T16:09:04Z

+// file, followed by the files themselves in supplied order. File
+// bytes come from the content-addressed store, keyed by digest.
+func writeLayer(ctx context.Context, st store.Store, tw *tar.Writer, files []weightsource.InventoryFile) error {
+	for _, dir := range collectDirs(files) {


The writeLayer function checks ctx.Err() when writing files (line 428), but not when writing directory headers. For consistency and correctness in highly concurrent scenarios, consider adding a context check before each directory header write:

Suggested change

for _, dir := range collectDirs(files) {

for _, dir := range collectDirs(files) {

if err := ctx.Err(); err != nil {

return err

}

if err := tw.WriteHeader(deterministicDirHeader(dir)); err != nil {

return fmt.Errorf("write dir header %s: %w", dir, err)

}

}

ask-bonk · 2026-04-29T16:09:27Z

Posted a review on PR #2974 with 4 suggestions addressing security and correctness issues:

Path traversal validation in FileSource.Open - adds explicit validation to prevent directory escape
Size validation in PutFile - validates the size parameter instead of ignoring it
URL path escaping in HuggingFace source - escapes path components when constructing URLs
Context cancellation in writeLayer - adds consistent context checking for directory headers

The review is available at: #2974 (review)

github run

ask-bonk

I'm Bonk, and I've done a quick review of your PR. This PR introduces managed weights functionality for Cog, implementing a weight import/push/pull system using OCI artifacts.

I've identified 1 actionable issue that should be addressed:

ask-bonk · 2026-04-29T17:53:23Z

+// buildURLWithQuery is like buildURL but appends a raw query string.
+func (s *HFSource) buildURLWithQuery(query string, segments ...string) string {
+	u := *s.baseURL // shallow copy
+	u.Path = path.Join(u.Path, path.Join(segments...))


URLs are constructed using path.Join which cleans paths but does not URL-escape special characters. The filePath segment (passed through from Open) could contain characters like #, %, or spaces that need escaping in URL paths. While url.URL.String() does some escaping, it doesn't handle all edge cases properly.

Consider using url.PathEscape on the filePath segment:

Suggested change

u.Path = path.Join(u.Path, path.Join(segments...))

func (s *HFSource) buildURLWithQuery(query string, segments ...string) string {

u := *s.baseURL // shallow copy

// Escape the last segment (filePath) which may contain special chars

if len(segments) > 0 {

segments[len(segments)-1] = url.PathEscape(segments[len(segments)-1])

}

u.Path = path.Join(u.Path, path.Join(segments...))

u.RawQuery = query

return u.String()

}

ask-bonk · 2026-04-29T17:53:35Z

I've completed the review of PR #2974.

Summary: This PR introduces managed weights functionality for Cog, implementing a weight import/push/pull system using OCI artifacts.

Issues Found: 1 actionable issue

From the 4 issues identified in previous reviews:

✅ Path traversal validation - Secure by design (uses fs.FS which prevents traversal)
✅ Size validation in PutFile - Fixed (lines 157-159 in file.go)
✅ Context cancellation in writeLayer - Fixed
⚠️ URL path escaping in HuggingFace - Still needs attention

I posted a review with a suggestion for the URL path escaping issue in pkg/model/weightsource/huggingface.go:398. The filePath parameter passed through from Open() may contain special characters (like #, %, spaces) that need URL escaping, but path.Join only cleans paths without escaping them.

Review URL: #2974 (review)

github run

ask-bonk · 2026-04-29T22:32:54Z

All tests pass and all issues from previous reviews have been addressed:

Path traversal validation - Fixed using os.DirFS
Size validation - Fixed in PutFile
Context cancellation - Fixed in writeLayer
URL path escaping - Working correctly (verified by TestHFSource_Open_EscapesPathComponents)

The new drift detection feature is well-tested and cleanly implemented.

LGTM

github run

ask-bonk · 2026-04-29T22:52:03Z

All previously identified issues have been verified as fixed:

Path traversal ✅ - Uses os.DirFS at line 56
Size validation ✅ - Validated at lines 157-159
Context cancellation ✅ - Checked at lines 422, 431
URL path escaping ✅ - Test passes

All tests pass and no linting issues.

LGTM

github run

…CLI) Adds the v1 managed-weights feature: - pkg/config: `weights` stanza in cog.yaml + JSON schema validation - pkg/model/weightsource: pluggable Source interface with file:// and hf:// scheme implementations, include/exclude filters, content-addressed fingerprinting - pkg/model: weight manifest v1 format, OCI bundle index assembly, packer plan/execute split, weight pusher with progress reporting - pkg/weights: WeightStore interface, FileStore (hardlinked extraction cache), WeightManager, parallel layer puller, mount helper for predict - pkg/weights/lockfile: weights.lock parsing, generation, source fingerprint - pkg/paths: cache-dir resolver - pkg/cli/weights: hidden `cog weights` command group with import, pull, status subcommands - pkg/cli/predict: auto-pull and mount managed weights at predict time - pkg/cli/{build,push}: emit weight manifests as part of the OCI bundle - pkg/predict: wire WeightManager into the predictor Tooling additions: - go.mod: hashicorp/go-retryablehttp for resilient pulls - mise.toml: goimports + rust-analyzer - .gitignore: drop legacy weights-gen ignore lines The `cog weights` command group is hidden, and the bundle format only activates when cog.yaml declares a `weights` stanza, so existing models are unaffected.

- integration-tests/tests/weights_*.txtar: end-to-end coverage for cog weights import, pull, predict, and include/exclude filters - integration-tests/tests/oci_bundle_*.txtar: bundle build/push/inspect flows updated for the v1 weight manifest format - integration-tests/harness: helpers for the new tests - examples/managed-weights: working reference model with cog.yaml, predict.py, README, and a checked-in weights.lock for verifying the pull/predict round trip

specs/draft-weights.md is an internal design document for the OCI weight bundle format. Lives under specs/ rather than docs/ so it is not part of the published documentation site. `draft-` prefix signals it is not yet a stable reference.

- Parallelize computeLayerDigests and computeInventory file hashing with errgroup (bounded by GOMAXPROCS) - Type WeightStatus/LayerStatus as string enums for compile-time safety - Deduplicate formatDigestShort by delegating to model.ShortDigest - Remove HasProblems() wrapper, inline !AllReady() at call site

The hidden cog inspect command was added during early managed-weights development to debug bundle pushes. It was never designed for general release: no docs, no tests beyond a single integration script, and the output format (text and JSON variants, with a separate raw streaming mode) accreted features without a coherent target audience. Removing it lets us also drop pkg/model/index.go (the Index/IndexManifest parse types only inspect consumed) and a small block in resolver.go that populated those types when loading bundles. The push-side IndexBuilder in index_factory.go is unrelated and stays. PlatformUnknown moves next to Platform in artifact_image.go, where it belongs structurally. TestModel_IsBundle moves to model_test.go since index_test.go went away.

Three integration tests asserted on the v0 cog weights {build,push,inspect} verbs and v0 lockfile schema (dest, digestOriginal fields). Those commands and that schema are gone in the current managed-weights design; the tests have been failing in CI on this branch since the v1 rewrite landed. - weights_build.txtar: tested cog weights build standalone. v1 has no separate build step (the lockfile is generated as part of import); the schema fields it asserted on no longer exist. - weights_push_inspect.txtar: tested the build → push → inspect lifecycle end-to-end. v1 collapses build+push into cog weights import (covered by oci_bundle_push.txtar) and replaces inspect with status (covered by weights_status unit tests). - oci_bundle_build.txtar: tested cog predict against a pre-built bundled image. predict.go:193-195 documents that pre-built images are now opaque to Cog at predict time — that path is explicitly out of scope for v1. One real coverage gap surfaces: parseRepoOnly's rejection of tagged references in cog weights import is no longer asserted anywhere. Worth adding a small test for it later, but unrelated to this cleanup.

cmdMockWeights generated N random files plus a synthetic weights.lock, and was registered as the mock-weights testscript command. Originally useful when weights were single random files, but the v1 design treats weights more like Hugging Face repos: one or two large shards alongside several small JSON/text files. A flat 'N random files of size X' fixture no longer represents the shape we want to test against, and no existing .txtar invokes mock-weights — every weights-related test seeds files inline with dd/printf/echo. If we later need realistic fixtures, the right shape is a templated hf-style directory with configurable padding for the large files, not a refactor of this helper. Dropping the dead code now to keep the harness honest about what it actually provides.

Apply gofmt across files that drifted out of canonical formatting. Mechanical changes only (whitespace, struct tag spacing, table-driven test alignment). No behavior change. CI's Format Go job was rejecting these on the branch.

Five integration tests were failing in CI for substantive reasons (not just the dead commands cleaned up earlier). Each one tripped on a specific v1 design choice: - weights_import_predict, weights_pull_predict, weights_filter: the cog.yaml templates carried a placeholder image: line, with the test also appending the real image: via printf at runtime. Two image: keys breaks YAML parsing. Drop the placeholder; the printf writes the only one. - weights_pull, weights_pull_predict: assumed cog weights import leaves the local cache cold so the subsequent cog weights pull has work to do. v1's import warms the cache as a side effect (the cog-i12u guarantee), so pull becomes a no-op. Purge the cache between import and pull to simulate the realistic 'lockfile checked in, fresh clone, cold cache' scenario these tests want to cover. - oci_bundle_push: three issues. (1) source: weights-alpha used the v0 string schema; v1 requires source.uri with a file:// URI. (2) cog push no longer runs the weight builder implicitly — it requires weights.lock to already exist, so add an explicit cog weights import step. (3) the run.cog.reference.type annotation assertion was for an annotation v1 doesn't emit; replace with run.cog.weight.set-digest, which v1 does emit on each weight descriptor. All five verified passing locally with the just-built dist binary against a real local registry. None of these failures were introduced by recent slices; they have been failing on the branch since the v1 rewrite landed in 07db257.

- FileSource.Open: use os.DirFS to prevent path traversal at the FS boundary instead of relying on upstream validation - FileStore.PutFile: validate expectedSize against bytes written to detect truncation/corruption - HFSource: build URLs via path.Join + url.URL instead of fmt.Sprintf string interpolation; parse baseURL once at construction - writeLayer: check context cancellation before directory header writes

Removing a weight from cog.yaml left its entry in weights.lock. Since the lockfile is projected into /.cog/weights.json (the runtime manifest), coglet expected weights that no longer exist. Add Retain() and PruneLockfile() to the lockfile package. Prune runs during cog weights import (always, using all config names) and in Resolver.Build() before the image build so writeRuntimeWeightsManifest sees a clean lockfile. Also adds config.WeightNames() helper to avoid duplicating the name-extraction loop across call sites.

…mple The qwen3.6-27b-fp8 weight was commented out in cog.yaml but its entry (~30GB of layer metadata) persisted in weights.lock. This was the bug that motivated the pruning fix. Clean up the example to reflect the actual config.

build, push, predict, and train now compare cog.yaml weight declarations against weights.lock before invoking resolver.Build(). If the lockfile is stale, missing, or has orphaned entries the command fails immediately with a clear message listing each mismatch and directing the user to run 'cog weights import'. The drift detection is split into two layers: - lockfile.CheckDrift: pure comparison, no I/O, lives in pkg/weights/lockfile - weights.CheckDrift: loads lockfile, normalizes config, formats errors

Resolver.Build() no longer opens the weight store, ingresses files, or writes weights.lock. It loads lockfile entries into a new model.Weight type -- the model's lightweight representation of a managed weight. BundlePusher.Push() no longer uploads weight layers. It HEAD-checks each weight manifest by tag (pushed earlier by 'cog weights import'), gets the descriptor, and assembles the OCI index. Image and weight HEAD-checks run concurrently. WeightArtifact/WeightBuilder/WeightPusher remain for the import path. Also removes dead fields: PushOptions.WeightProgressFn, BuildOptions.WeightsLockPath.

ask-bonk · 2026-04-30T02:11:10Z

All issues from previous reviews have been verified as fixed:

Path traversal validation ✅ - Uses os.DirFS(path) at line 56 in file.go
Size validation ✅ - Validated at lines 157-159 in file.go
Context cancellation ✅ - Checked at lines 421-424 and 431 in packer.go
URL path escaping ✅ - Working correctly (verified by passing test TestHFSource_Open_EscapesPathComponents)

All tests pass successfully.

LGTM

github run

buildURLWithQuery now percent-encodes each path component individually so filenames containing #, %, spaces, or other URL-special characters produce valid URLs. Uses url.URL.RawPath to avoid double-encoding. Resolves ask-bonk review feedback on PR #2974.

ask-bonk · 2026-04-30T04:07:47Z

All previously identified issues have been verified as fixed:

Path traversal validation ✅ - Uses os.DirFS(path) at line 56 in pkg/model/weightsource/file.go
Size validation ✅ - Validated at lines 157-159 in pkg/weights/store/file.go
Context cancellation ✅ - Checked at lines 421-424 and 431 in pkg/model/packer.go
URL path escaping ✅ - Working correctly (verified by passing test TestHFSource_Open_EscapesPathComponents)

All tests pass and no linting issues detected.

LGTM

github run

ask-bonk · 2026-04-30T04:11:07Z

LGTM

github run

ask-bonk Bot reviewed Apr 29, 2026

View reviewed changes

michaeldwan added 15 commits April 29, 2026 17:22

style: gofmt fixes

4593416

Apply gofmt across files that drifted out of canonical formatting. Mechanical changes only (whitespace, struct tag spacing, table-driven test alignment). No behavior change. CI's Format Go job was rejecting these on the branch.

chore(examples): add directory listing to predict.py setup for debugging

e15c11f

michaeldwan force-pushed the md/managed-weights-v1 branch from d80a200 to e15c11f Compare April 30, 2026 02:09

michaeldwan added 2 commits April 29, 2026 22:04

chore: fix formatting

95cb315

-	abs := filepath.Join(s.dir, filepath.FromSlash(path))
+	abs := filepath.Join(s.dir, filepath.FromSlash(path))
+	abs = filepath.Clean(abs)
+	if !strings.HasPrefix(abs, s.dir+string(filepath.Separator)) && abs != s.dir {
+		return nil, fmt.Errorf("path escapes source directory: %s", path)
+	}

-func (s *FileStore) PutFile(ctx context.Context, expectedDigest string, _ int64, r io.Reader) error {
+func (s *FileStore) PutFile(ctx context.Context, expectedDigest string, expectedSize int64, r io.Reader) error {
+	...
+	n, err := io.Copy(tmp, reader)
+	if err != nil {
+		_ = tmp.Close()
+		return fmt.Errorf("write %s: %w", expectedDigest, err)
+	}
+	if n != expectedSize {
+		_ = tmp.Close()
+		return fmt.Errorf("size mismatch for %s: expected %d, got %d", expectedDigest, expectedSize, n)
+	}

-	url := fmt.Sprintf("%s/api/models/%s/revision/%s", s.baseURL, s.repo, s.ref)
+func (s *HFSource) fetchFile(ctx context.Context, ref, path string) (io.ReadCloser, error) {
+	escapedPath := url.PathEscape(path)
+	url := fmt.Sprintf("%s/%s/resolve/%s/%s", s.baseURL, s.repo, ref, escapedPath)
+	return s.doGet(ctx, url)
+}

-	for _, dir := range collectDirs(files) {
+	for _, dir := range collectDirs(files) {
+		if err := ctx.Err(); err != nil {
+			return err
+		}
+		if err := tw.WriteHeader(deterministicDirHeader(dir)); err != nil {
+			return fmt.Errorf("write dir header %s: %w", dir, err)
+		}
+	}

-	u.Path = path.Join(u.Path, path.Join(segments...))
+func (s *HFSource) buildURLWithQuery(query string, segments ...string) string {
+	u := *s.baseURL // shallow copy
+	// Escape the last segment (filePath) which may contain special chars
+	if len(segments) > 0 {
+		segments[len(segments)-1] = url.PathEscape(segments[len(segments)-1])
+	}
+	u.Path = path.Join(u.Path, path.Join(segments...))
+	u.RawQuery = query
+	return u.String()
+}

Conversation

michaeldwan commented Apr 28, 2026

Uh oh!

ask-bonk Bot commented Apr 28, 2026

Uh oh!

ask-bonk Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ask-bonk Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk Bot commented Apr 29, 2026

Uh oh!

ask-bonk Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ask-bonk Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk Bot commented Apr 29, 2026

Uh oh!

ask-bonk Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ask-bonk Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk Bot commented Apr 29, 2026

Uh oh!

ask-bonk Bot commented Apr 29, 2026

Uh oh!

ask-bonk Bot commented Apr 29, 2026

Uh oh!

ask-bonk Bot commented Apr 30, 2026

Uh oh!

ask-bonk Bot commented Apr 30, 2026

Uh oh!

ask-bonk Bot commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant