Skip to content

wip weights#2974

Draft
michaeldwan wants to merge 17 commits intomainfrom
md/managed-weights-v1
Draft

wip weights#2974
michaeldwan wants to merge 17 commits intomainfrom
md/managed-weights-v1

Conversation

@michaeldwan
Copy link
Copy Markdown
Member

No description provided.

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Apr 28, 2026

LGTM

github run

Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed the managed weights PR. The code is generally well-structured with good documentation. I've identified a few actionable issues around security and correctness.

Comment thread pkg/model/weightsource/file.go Outdated
// Inventory paths use forward slashes; filepath.Join on a path
// containing "/" works correctly on POSIX and is normalized on
// Windows.
abs := filepath.Join(s.dir, filepath.FromSlash(path))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function relies on path validation happening elsewhere (//nolint:gosec // path is under the configured source dir), but this creates a maintenance risk. If validation in computeInventory is bypassed or has a bug, this becomes a path traversal vulnerability. Consider adding explicit validation here:

Suggested change
abs := filepath.Join(s.dir, filepath.FromSlash(path))
abs := filepath.Join(s.dir, filepath.FromSlash(path))
abs = filepath.Clean(abs)
if !strings.HasPrefix(abs, s.dir+string(filepath.Separator)) && abs != s.dir {
return nil, fmt.Errorf("path escapes source directory: %s", path)
}

Comment thread pkg/weights/store/file.go Outdated
// io.Discard and nil is returned. This matters because Pull streams a
// whole layer tar and may encounter files already stored from a
// previous pull — we need those to succeed without desyncing the tar.
func (s *FileStore) PutFile(ctx context.Context, expectedDigest string, _ int64, r io.Reader) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size parameter is ignored (underscore _). This parameter should be validated against the actual bytes written to detect truncation or corruption. Consider either:

  1. Validating the size matches after writing:
Suggested change
func (s *FileStore) PutFile(ctx context.Context, expectedDigest string, _ int64, r io.Reader) error {
func (s *FileStore) PutFile(ctx context.Context, expectedDigest string, expectedSize int64, r io.Reader) error {
...
n, err := io.Copy(tmp, reader)
if err != nil {
_ = tmp.Close()
return fmt.Errorf("write %s: %w", expectedDigest, err)
}
if n != expectedSize {
_ = tmp.Close()
return fmt.Errorf("size mismatch for %s: expected %d, got %d", expectedDigest, expectedSize, n)
}
  1. Or remove the unused parameter to avoid confusion.

Comment thread pkg/model/weightsource/huggingface.go Outdated
// resolveRef calls the Hub API to resolve a ref (branch/tag/sha) to the
// full 40-char commit sha.
func (s *HFSource) resolveRef(ctx context.Context) (string, error) {
url := fmt.Sprintf("%s/api/models/%s/revision/%s", s.baseURL, s.repo, s.ref)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URLs are constructed using fmt.Sprintf with user-controlled path components. While s.repo and s.ref are validated, the path argument in fetchFile could contain special characters that require URL escaping. Consider using url.PathEscape:

Suggested change
url := fmt.Sprintf("%s/api/models/%s/revision/%s", s.baseURL, s.repo, s.ref)
func (s *HFSource) fetchFile(ctx context.Context, ref, path string) (io.ReadCloser, error) {
escapedPath := url.PathEscape(path)
url := fmt.Sprintf("%s/%s/resolve/%s/%s", s.baseURL, s.repo, ref, escapedPath)
return s.doGet(ctx, url)
}

Comment thread pkg/model/packer.go
// file, followed by the files themselves in supplied order. File
// bytes come from the content-addressed store, keyed by digest.
func writeLayer(ctx context.Context, st store.Store, tw *tar.Writer, files []weightsource.InventoryFile) error {
for _, dir := range collectDirs(files) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The writeLayer function checks ctx.Err() when writing files (line 428), but not when writing directory headers. For consistency and correctness in highly concurrent scenarios, consider adding a context check before each directory header write:

Suggested change
for _, dir := range collectDirs(files) {
for _, dir := range collectDirs(files) {
if err := ctx.Err(); err != nil {
return err
}
if err := tw.WriteHeader(deterministicDirHeader(dir)); err != nil {
return fmt.Errorf("write dir header %s: %w", dir, err)
}
}

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Apr 29, 2026

Posted review with 4 suggestions targeting security and correctness issues:

  1. Path traversal validation in FileSource.Open - adds explicit validation to prevent directory escape
  2. Size validation in PutFile - validates the size parameter instead of ignoring it
  3. URL path escaping in HuggingFace source - escapes path components when constructing URLs
  4. Context cancellation in writeLayer - adds consistent context checking for directory headers

github run

Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm Bonk, and I've done a quick review of your PR. This PR introduces managed weights functionality for Cog, implementing a weight import/push/pull system using OCI artifacts.

I've identified 4 actionable issues around security and correctness that should be addressed:

Comment thread pkg/model/weightsource/file.go Outdated
Comment thread pkg/weights/store/file.go Outdated
// io.Discard and nil is returned. This matters because Pull streams a
// whole layer tar and may encounter files already stored from a
// previous pull — we need those to succeed without desyncing the tar.
func (s *FileStore) PutFile(ctx context.Context, expectedDigest string, _ int64, r io.Reader) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size parameter is ignored (underscore _). This parameter should be validated against the actual bytes written to detect truncation or corruption. Consider either:

  1. Validating the size matches after writing:
Suggested change
func (s *FileStore) PutFile(ctx context.Context, expectedDigest string, _ int64, r io.Reader) error {
func (s *FileStore) PutFile(ctx context.Context, expectedDigest string, expectedSize int64, r io.Reader) error {
...
n, err := io.Copy(tmp, reader)
if err != nil {
_ = tmp.Close()
return fmt.Errorf("write %s: %w", expectedDigest, err)
}
if n != expectedSize {
_ = tmp.Close()
return fmt.Errorf("size mismatch for %s: expected %d, got %d", expectedDigest, expectedSize, n)
}
  1. Or remove the unused parameter to avoid confusion.

Comment thread pkg/model/weightsource/huggingface.go Outdated

// fetchFile streams one file from the resolve endpoint. The endpoint
// issues a 302 to the appropriate backend (LFS CDN, xet, or inline).
func (s *HFSource) fetchFile(ctx context.Context, ref, path string) (io.ReadCloser, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URLs are constructed using fmt.Sprintf with user-controlled path components. While s.repo and s.ref are validated, the path argument in fetchFile could contain special characters that require URL escaping. Consider using url.PathEscape:

Suggested change
func (s *HFSource) fetchFile(ctx context.Context, ref, path string) (io.ReadCloser, error) {
func (s *HFSource) fetchFile(ctx context.Context, ref, path string) (io.ReadCloser, error) {
escapedPath := url.PathEscape(path)
url := fmt.Sprintf("%s/%s/resolve/%s/%s", s.baseURL, s.repo, ref, escapedPath)
return s.doGet(ctx, url)
}

Comment thread pkg/model/packer.go
// file, followed by the files themselves in supplied order. File
// bytes come from the content-addressed store, keyed by digest.
func writeLayer(ctx context.Context, st store.Store, tw *tar.Writer, files []weightsource.InventoryFile) error {
for _, dir := range collectDirs(files) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The writeLayer function checks ctx.Err() when writing files (line 428), but not when writing directory headers. For consistency and correctness in highly concurrent scenarios, consider adding a context check before each directory header write:

Suggested change
for _, dir := range collectDirs(files) {
for _, dir := range collectDirs(files) {
if err := ctx.Err(); err != nil {
return err
}
if err := tw.WriteHeader(deterministicDirHeader(dir)); err != nil {
return fmt.Errorf("write dir header %s: %w", dir, err)
}
}

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Apr 29, 2026

Posted a review on PR #2974 with 4 suggestions addressing security and correctness issues:

  1. Path traversal validation in FileSource.Open - adds explicit validation to prevent directory escape
  2. Size validation in PutFile - validates the size parameter instead of ignoring it
  3. URL path escaping in HuggingFace source - escapes path components when constructing URLs
  4. Context cancellation in writeLayer - adds consistent context checking for directory headers

The review is available at: #2974 (review)

github run

Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm Bonk, and I've done a quick review of your PR. This PR introduces managed weights functionality for Cog, implementing a weight import/push/pull system using OCI artifacts.

I've identified 1 actionable issue that should be addressed:

Comment thread pkg/model/weightsource/huggingface.go Outdated
// buildURLWithQuery is like buildURL but appends a raw query string.
func (s *HFSource) buildURLWithQuery(query string, segments ...string) string {
u := *s.baseURL // shallow copy
u.Path = path.Join(u.Path, path.Join(segments...))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URLs are constructed using path.Join which cleans paths but does not URL-escape special characters. The filePath segment (passed through from Open) could contain characters like #, %, or spaces that need escaping in URL paths. While url.URL.String() does some escaping, it doesn't handle all edge cases properly.

Consider using url.PathEscape on the filePath segment:

Suggested change
u.Path = path.Join(u.Path, path.Join(segments...))
func (s *HFSource) buildURLWithQuery(query string, segments ...string) string {
u := *s.baseURL // shallow copy
// Escape the last segment (filePath) which may contain special chars
if len(segments) > 0 {
segments[len(segments)-1] = url.PathEscape(segments[len(segments)-1])
}
u.Path = path.Join(u.Path, path.Join(segments...))
u.RawQuery = query
return u.String()
}

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Apr 29, 2026

I've completed the review of PR #2974.

Summary: This PR introduces managed weights functionality for Cog, implementing a weight import/push/pull system using OCI artifacts.

Issues Found: 1 actionable issue

From the 4 issues identified in previous reviews:

  • Path traversal validation - Secure by design (uses fs.FS which prevents traversal)
  • Size validation in PutFile - Fixed (lines 157-159 in file.go)
  • Context cancellation in writeLayer - Fixed
  • ⚠️ URL path escaping in HuggingFace - Still needs attention

I posted a review with a suggestion for the URL path escaping issue in pkg/model/weightsource/huggingface.go:398. The filePath parameter passed through from Open() may contain special characters (like #, %, spaces) that need URL escaping, but path.Join only cleans paths without escaping them.

Review URL: #2974 (review)

github run

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Apr 29, 2026

All tests pass and all issues from previous reviews have been addressed:

  1. Path traversal validation - Fixed using os.DirFS
  2. Size validation - Fixed in PutFile
  3. Context cancellation - Fixed in writeLayer
  4. URL path escaping - Working correctly (verified by TestHFSource_Open_EscapesPathComponents)

The new drift detection feature is well-tested and cleanly implemented.

LGTM

github run

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Apr 29, 2026

All previously identified issues have been verified as fixed:

  1. Path traversal ✅ - Uses os.DirFS at line 56
  2. Size validation ✅ - Validated at lines 157-159
  3. Context cancellation ✅ - Checked at lines 422, 431
  4. URL path escaping ✅ - Test passes

All tests pass and no linting issues.

LGTM

github run

…CLI)

Adds the v1 managed-weights feature:

- pkg/config: `weights` stanza in cog.yaml + JSON schema validation
- pkg/model/weightsource: pluggable Source interface with file:// and hf://
  scheme implementations, include/exclude filters, content-addressed
  fingerprinting
- pkg/model: weight manifest v1 format, OCI bundle index assembly,
  packer plan/execute split, weight pusher with progress reporting
- pkg/weights: WeightStore interface, FileStore (hardlinked extraction
  cache), WeightManager, parallel layer puller, mount helper for predict
- pkg/weights/lockfile: weights.lock parsing, generation, source fingerprint
- pkg/paths: cache-dir resolver
- pkg/cli/weights: hidden `cog weights` command group with import, pull,
  status subcommands
- pkg/cli/predict: auto-pull and mount managed weights at predict time
- pkg/cli/{build,push}: emit weight manifests as part of the OCI bundle
- pkg/predict: wire WeightManager into the predictor

Tooling additions:
- go.mod: hashicorp/go-retryablehttp for resilient pulls
- mise.toml: goimports + rust-analyzer
- .gitignore: drop legacy weights-gen ignore lines

The `cog weights` command group is hidden, and the bundle format only
activates when cog.yaml declares a `weights` stanza, so existing models
are unaffected.
- integration-tests/tests/weights_*.txtar: end-to-end coverage for
  cog weights import, pull, predict, and include/exclude filters
- integration-tests/tests/oci_bundle_*.txtar: bundle build/push/inspect
  flows updated for the v1 weight manifest format
- integration-tests/harness: helpers for the new tests
- examples/managed-weights: working reference model with cog.yaml,
  predict.py, README, and a checked-in weights.lock for verifying the
  pull/predict round trip
specs/draft-weights.md is an internal design document for the OCI
weight bundle format. Lives under specs/ rather than docs/ so it is
not part of the published documentation site. `draft-` prefix signals
it is not yet a stable reference.
- Parallelize computeLayerDigests and computeInventory file hashing
  with errgroup (bounded by GOMAXPROCS)
- Type WeightStatus/LayerStatus as string enums for compile-time safety
- Deduplicate formatDigestShort by delegating to model.ShortDigest
- Remove HasProblems() wrapper, inline !AllReady() at call site
The hidden cog inspect command was added during early managed-weights
development to debug bundle pushes. It was never designed for general
release: no docs, no tests beyond a single integration script, and the
output format (text and JSON variants, with a separate raw streaming
mode) accreted features without a coherent target audience.

Removing it lets us also drop pkg/model/index.go (the Index/IndexManifest
parse types only inspect consumed) and a small block in resolver.go that
populated those types when loading bundles. The push-side IndexBuilder
in index_factory.go is unrelated and stays.

PlatformUnknown moves next to Platform in artifact_image.go, where it
belongs structurally. TestModel_IsBundle moves to model_test.go since
index_test.go went away.
Three integration tests asserted on the v0 cog weights {build,push,inspect}
verbs and v0 lockfile schema (dest, digestOriginal fields). Those commands
and that schema are gone in the current managed-weights design; the tests
have been failing in CI on this branch since the v1 rewrite landed.

- weights_build.txtar: tested cog weights build standalone. v1 has no
  separate build step (the lockfile is generated as part of import); the
  schema fields it asserted on no longer exist.
- weights_push_inspect.txtar: tested the build → push → inspect lifecycle
  end-to-end. v1 collapses build+push into cog weights import (covered by
  oci_bundle_push.txtar) and replaces inspect with status (covered by
  weights_status unit tests).
- oci_bundle_build.txtar: tested cog predict against a pre-built bundled
  image. predict.go:193-195 documents that pre-built images are now
  opaque to Cog at predict time — that path is explicitly out of scope
  for v1.

One real coverage gap surfaces: parseRepoOnly's rejection of tagged
references in cog weights import is no longer asserted anywhere. Worth
adding a small test for it later, but unrelated to this cleanup.
cmdMockWeights generated N random files plus a synthetic weights.lock,
and was registered as the mock-weights testscript command. Originally
useful when weights were single random files, but the v1 design treats
weights more like Hugging Face repos: one or two large shards alongside
several small JSON/text files. A flat 'N random files of size X'
fixture no longer represents the shape we want to test against, and no
existing .txtar invokes mock-weights — every weights-related test seeds
files inline with dd/printf/echo.

If we later need realistic fixtures, the right shape is a templated
hf-style directory with configurable padding for the large files, not
a refactor of this helper. Dropping the dead code now to keep the
harness honest about what it actually provides.
Apply gofmt across files that drifted out of canonical formatting.
Mechanical changes only (whitespace, struct tag spacing, table-driven
test alignment). No behavior change. CI's Format Go job was rejecting
these on the branch.
Five integration tests were failing in CI for substantive reasons (not
just the dead commands cleaned up earlier). Each one tripped on a
specific v1 design choice:

- weights_import_predict, weights_pull_predict, weights_filter: the
  cog.yaml templates carried a placeholder image: line, with the test
  also appending the real image: via printf at runtime. Two image:
  keys breaks YAML parsing. Drop the placeholder; the printf writes
  the only one.

- weights_pull, weights_pull_predict: assumed cog weights import
  leaves the local cache cold so the subsequent cog weights pull has
  work to do. v1's import warms the cache as a side effect (the
  cog-i12u guarantee), so pull becomes a no-op. Purge the cache
  between import and pull to simulate the realistic 'lockfile checked
  in, fresh clone, cold cache' scenario these tests want to cover.

- oci_bundle_push: three issues. (1) source: weights-alpha used the v0
  string schema; v1 requires source.uri with a file:// URI. (2) cog
  push no longer runs the weight builder implicitly — it requires
  weights.lock to already exist, so add an explicit cog weights import
  step. (3) the run.cog.reference.type annotation assertion was for an
  annotation v1 doesn't emit; replace with run.cog.weight.set-digest,
  which v1 does emit on each weight descriptor.

All five verified passing locally with the just-built dist binary
against a real local registry. None of these failures were introduced
by recent slices; they have been failing on the branch since the v1
rewrite landed in 07db257.
- FileSource.Open: use os.DirFS to prevent path traversal at the FS
  boundary instead of relying on upstream validation
- FileStore.PutFile: validate expectedSize against bytes written to
  detect truncation/corruption
- HFSource: build URLs via path.Join + url.URL instead of fmt.Sprintf
  string interpolation; parse baseURL once at construction
- writeLayer: check context cancellation before directory header writes
Removing a weight from cog.yaml left its entry in weights.lock.
Since the lockfile is projected into /.cog/weights.json (the runtime
manifest), coglet expected weights that no longer exist.

Add Retain() and PruneLockfile() to the lockfile package. Prune runs
during cog weights import (always, using all config names) and in
Resolver.Build() before the image build so writeRuntimeWeightsManifest
sees a clean lockfile.

Also adds config.WeightNames() helper to avoid duplicating the
name-extraction loop across call sites.
…mple

The qwen3.6-27b-fp8 weight was commented out in cog.yaml but its
entry (~30GB of layer metadata) persisted in weights.lock. This was
the bug that motivated the pruning fix. Clean up the example to
reflect the actual config.
build, push, predict, and train now compare cog.yaml weight declarations
against weights.lock before invoking resolver.Build(). If the lockfile is
stale, missing, or has orphaned entries the command fails immediately with
a clear message listing each mismatch and directing the user to run
'cog weights import'.

The drift detection is split into two layers:
- lockfile.CheckDrift: pure comparison, no I/O, lives in pkg/weights/lockfile
- weights.CheckDrift: loads lockfile, normalizes config, formats errors
Resolver.Build() no longer opens the weight store, ingresses files, or
writes weights.lock. It loads lockfile entries into a new model.Weight
type -- the model's lightweight representation of a managed weight.

BundlePusher.Push() no longer uploads weight layers. It HEAD-checks
each weight manifest by tag (pushed earlier by 'cog weights import'),
gets the descriptor, and assembles the OCI index. Image and weight
HEAD-checks run concurrently.

WeightArtifact/WeightBuilder/WeightPusher remain for the import path.

Also removes dead fields: PushOptions.WeightProgressFn,
BuildOptions.WeightsLockPath.
@michaeldwan michaeldwan force-pushed the md/managed-weights-v1 branch from d80a200 to e15c11f Compare April 30, 2026 02:09
@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Apr 30, 2026

All issues from previous reviews have been verified as fixed:

  1. Path traversal validation ✅ - Uses os.DirFS(path) at line 56 in file.go
  2. Size validation ✅ - Validated at lines 157-159 in file.go
  3. Context cancellation ✅ - Checked at lines 421-424 and 431 in packer.go
  4. URL path escaping ✅ - Working correctly (verified by passing test TestHFSource_Open_EscapesPathComponents)

All tests pass successfully.

LGTM

github run

buildURLWithQuery now percent-encodes each path component individually
so filenames containing #, %, spaces, or other URL-special characters
produce valid URLs. Uses url.URL.RawPath to avoid double-encoding.

Resolves ask-bonk review feedback on PR #2974.
@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Apr 30, 2026

All previously identified issues have been verified as fixed:

  1. Path traversal validation ✅ - Uses os.DirFS(path) at line 56 in pkg/model/weightsource/file.go
  2. Size validation ✅ - Validated at lines 157-159 in pkg/weights/store/file.go
  3. Context cancellation ✅ - Checked at lines 421-424 and 431 in pkg/model/packer.go
  4. URL path escaping ✅ - Working correctly (verified by passing test TestHFSource_Open_EscapesPathComponents)

All tests pass and no linting issues detected.

LGTM

github run

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Apr 30, 2026

LGTM

github run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant