wip weights#2974
Conversation
|
LGTM |
| // Inventory paths use forward slashes; filepath.Join on a path | ||
| // containing "/" works correctly on POSIX and is normalized on | ||
| // Windows. | ||
| abs := filepath.Join(s.dir, filepath.FromSlash(path)) |
There was a problem hiding this comment.
This function relies on path validation happening elsewhere (//nolint:gosec // path is under the configured source dir), but this creates a maintenance risk. If validation in computeInventory is bypassed or has a bug, this becomes a path traversal vulnerability. Consider adding explicit validation here:
| abs := filepath.Join(s.dir, filepath.FromSlash(path)) | |
| abs := filepath.Join(s.dir, filepath.FromSlash(path)) | |
| abs = filepath.Clean(abs) | |
| if !strings.HasPrefix(abs, s.dir+string(filepath.Separator)) && abs != s.dir { | |
| return nil, fmt.Errorf("path escapes source directory: %s", path) | |
| } |
| // io.Discard and nil is returned. This matters because Pull streams a | ||
| // whole layer tar and may encounter files already stored from a | ||
| // previous pull — we need those to succeed without desyncing the tar. | ||
| func (s *FileStore) PutFile(ctx context.Context, expectedDigest string, _ int64, r io.Reader) error { |
There was a problem hiding this comment.
The size parameter is ignored (underscore _). This parameter should be validated against the actual bytes written to detect truncation or corruption. Consider either:
- Validating the size matches after writing:
| func (s *FileStore) PutFile(ctx context.Context, expectedDigest string, _ int64, r io.Reader) error { | |
| func (s *FileStore) PutFile(ctx context.Context, expectedDigest string, expectedSize int64, r io.Reader) error { | |
| ... | |
| n, err := io.Copy(tmp, reader) | |
| if err != nil { | |
| _ = tmp.Close() | |
| return fmt.Errorf("write %s: %w", expectedDigest, err) | |
| } | |
| if n != expectedSize { | |
| _ = tmp.Close() | |
| return fmt.Errorf("size mismatch for %s: expected %d, got %d", expectedDigest, expectedSize, n) | |
| } |
- Or remove the unused parameter to avoid confusion.
| // resolveRef calls the Hub API to resolve a ref (branch/tag/sha) to the | ||
| // full 40-char commit sha. | ||
| func (s *HFSource) resolveRef(ctx context.Context) (string, error) { | ||
| url := fmt.Sprintf("%s/api/models/%s/revision/%s", s.baseURL, s.repo, s.ref) |
There was a problem hiding this comment.
URLs are constructed using fmt.Sprintf with user-controlled path components. While s.repo and s.ref are validated, the path argument in fetchFile could contain special characters that require URL escaping. Consider using url.PathEscape:
| url := fmt.Sprintf("%s/api/models/%s/revision/%s", s.baseURL, s.repo, s.ref) | |
| func (s *HFSource) fetchFile(ctx context.Context, ref, path string) (io.ReadCloser, error) { | |
| escapedPath := url.PathEscape(path) | |
| url := fmt.Sprintf("%s/%s/resolve/%s/%s", s.baseURL, s.repo, ref, escapedPath) | |
| return s.doGet(ctx, url) | |
| } |
| // file, followed by the files themselves in supplied order. File | ||
| // bytes come from the content-addressed store, keyed by digest. | ||
| func writeLayer(ctx context.Context, st store.Store, tw *tar.Writer, files []weightsource.InventoryFile) error { | ||
| for _, dir := range collectDirs(files) { |
There was a problem hiding this comment.
The writeLayer function checks ctx.Err() when writing files (line 428), but not when writing directory headers. For consistency and correctness in highly concurrent scenarios, consider adding a context check before each directory header write:
| for _, dir := range collectDirs(files) { | |
| for _, dir := range collectDirs(files) { | |
| if err := ctx.Err(); err != nil { | |
| return err | |
| } | |
| if err := tw.WriteHeader(deterministicDirHeader(dir)); err != nil { | |
| return fmt.Errorf("write dir header %s: %w", dir, err) | |
| } | |
| } |
|
Posted review with 4 suggestions targeting security and correctness issues:
|
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR. This PR introduces managed weights functionality for Cog, implementing a weight import/push/pull system using OCI artifacts.
I've identified 4 actionable issues around security and correctness that should be addressed:
| // io.Discard and nil is returned. This matters because Pull streams a | ||
| // whole layer tar and may encounter files already stored from a | ||
| // previous pull — we need those to succeed without desyncing the tar. | ||
| func (s *FileStore) PutFile(ctx context.Context, expectedDigest string, _ int64, r io.Reader) error { |
There was a problem hiding this comment.
The size parameter is ignored (underscore _). This parameter should be validated against the actual bytes written to detect truncation or corruption. Consider either:
- Validating the size matches after writing:
| func (s *FileStore) PutFile(ctx context.Context, expectedDigest string, _ int64, r io.Reader) error { | |
| func (s *FileStore) PutFile(ctx context.Context, expectedDigest string, expectedSize int64, r io.Reader) error { | |
| ... | |
| n, err := io.Copy(tmp, reader) | |
| if err != nil { | |
| _ = tmp.Close() | |
| return fmt.Errorf("write %s: %w", expectedDigest, err) | |
| } | |
| if n != expectedSize { | |
| _ = tmp.Close() | |
| return fmt.Errorf("size mismatch for %s: expected %d, got %d", expectedDigest, expectedSize, n) | |
| } |
- Or remove the unused parameter to avoid confusion.
|
|
||
| // fetchFile streams one file from the resolve endpoint. The endpoint | ||
| // issues a 302 to the appropriate backend (LFS CDN, xet, or inline). | ||
| func (s *HFSource) fetchFile(ctx context.Context, ref, path string) (io.ReadCloser, error) { |
There was a problem hiding this comment.
URLs are constructed using fmt.Sprintf with user-controlled path components. While s.repo and s.ref are validated, the path argument in fetchFile could contain special characters that require URL escaping. Consider using url.PathEscape:
| func (s *HFSource) fetchFile(ctx context.Context, ref, path string) (io.ReadCloser, error) { | |
| func (s *HFSource) fetchFile(ctx context.Context, ref, path string) (io.ReadCloser, error) { | |
| escapedPath := url.PathEscape(path) | |
| url := fmt.Sprintf("%s/%s/resolve/%s/%s", s.baseURL, s.repo, ref, escapedPath) | |
| return s.doGet(ctx, url) | |
| } |
| // file, followed by the files themselves in supplied order. File | ||
| // bytes come from the content-addressed store, keyed by digest. | ||
| func writeLayer(ctx context.Context, st store.Store, tw *tar.Writer, files []weightsource.InventoryFile) error { | ||
| for _, dir := range collectDirs(files) { |
There was a problem hiding this comment.
The writeLayer function checks ctx.Err() when writing files (line 428), but not when writing directory headers. For consistency and correctness in highly concurrent scenarios, consider adding a context check before each directory header write:
| for _, dir := range collectDirs(files) { | |
| for _, dir := range collectDirs(files) { | |
| if err := ctx.Err(); err != nil { | |
| return err | |
| } | |
| if err := tw.WriteHeader(deterministicDirHeader(dir)); err != nil { | |
| return fmt.Errorf("write dir header %s: %w", dir, err) | |
| } | |
| } |
|
Posted a review on PR #2974 with 4 suggestions addressing security and correctness issues:
The review is available at: #2974 (review) |
| // buildURLWithQuery is like buildURL but appends a raw query string. | ||
| func (s *HFSource) buildURLWithQuery(query string, segments ...string) string { | ||
| u := *s.baseURL // shallow copy | ||
| u.Path = path.Join(u.Path, path.Join(segments...)) |
There was a problem hiding this comment.
URLs are constructed using path.Join which cleans paths but does not URL-escape special characters. The filePath segment (passed through from Open) could contain characters like #, %, or spaces that need escaping in URL paths. While url.URL.String() does some escaping, it doesn't handle all edge cases properly.
Consider using url.PathEscape on the filePath segment:
| u.Path = path.Join(u.Path, path.Join(segments...)) | |
| func (s *HFSource) buildURLWithQuery(query string, segments ...string) string { | |
| u := *s.baseURL // shallow copy | |
| // Escape the last segment (filePath) which may contain special chars | |
| if len(segments) > 0 { | |
| segments[len(segments)-1] = url.PathEscape(segments[len(segments)-1]) | |
| } | |
| u.Path = path.Join(u.Path, path.Join(segments...)) | |
| u.RawQuery = query | |
| return u.String() | |
| } |
|
I've completed the review of PR #2974. Summary: This PR introduces managed weights functionality for Cog, implementing a weight import/push/pull system using OCI artifacts. Issues Found: 1 actionable issue From the 4 issues identified in previous reviews:
I posted a review with a suggestion for the URL path escaping issue in Review URL: #2974 (review) |
|
All tests pass and all issues from previous reviews have been addressed:
The new drift detection feature is well-tested and cleanly implemented. LGTM |
|
All previously identified issues have been verified as fixed:
All tests pass and no linting issues. LGTM |
…CLI)
Adds the v1 managed-weights feature:
- pkg/config: `weights` stanza in cog.yaml + JSON schema validation
- pkg/model/weightsource: pluggable Source interface with file:// and hf://
scheme implementations, include/exclude filters, content-addressed
fingerprinting
- pkg/model: weight manifest v1 format, OCI bundle index assembly,
packer plan/execute split, weight pusher with progress reporting
- pkg/weights: WeightStore interface, FileStore (hardlinked extraction
cache), WeightManager, parallel layer puller, mount helper for predict
- pkg/weights/lockfile: weights.lock parsing, generation, source fingerprint
- pkg/paths: cache-dir resolver
- pkg/cli/weights: hidden `cog weights` command group with import, pull,
status subcommands
- pkg/cli/predict: auto-pull and mount managed weights at predict time
- pkg/cli/{build,push}: emit weight manifests as part of the OCI bundle
- pkg/predict: wire WeightManager into the predictor
Tooling additions:
- go.mod: hashicorp/go-retryablehttp for resilient pulls
- mise.toml: goimports + rust-analyzer
- .gitignore: drop legacy weights-gen ignore lines
The `cog weights` command group is hidden, and the bundle format only
activates when cog.yaml declares a `weights` stanza, so existing models
are unaffected.
- integration-tests/tests/weights_*.txtar: end-to-end coverage for cog weights import, pull, predict, and include/exclude filters - integration-tests/tests/oci_bundle_*.txtar: bundle build/push/inspect flows updated for the v1 weight manifest format - integration-tests/harness: helpers for the new tests - examples/managed-weights: working reference model with cog.yaml, predict.py, README, and a checked-in weights.lock for verifying the pull/predict round trip
specs/draft-weights.md is an internal design document for the OCI weight bundle format. Lives under specs/ rather than docs/ so it is not part of the published documentation site. `draft-` prefix signals it is not yet a stable reference.
- Parallelize computeLayerDigests and computeInventory file hashing with errgroup (bounded by GOMAXPROCS) - Type WeightStatus/LayerStatus as string enums for compile-time safety - Deduplicate formatDigestShort by delegating to model.ShortDigest - Remove HasProblems() wrapper, inline !AllReady() at call site
The hidden cog inspect command was added during early managed-weights development to debug bundle pushes. It was never designed for general release: no docs, no tests beyond a single integration script, and the output format (text and JSON variants, with a separate raw streaming mode) accreted features without a coherent target audience. Removing it lets us also drop pkg/model/index.go (the Index/IndexManifest parse types only inspect consumed) and a small block in resolver.go that populated those types when loading bundles. The push-side IndexBuilder in index_factory.go is unrelated and stays. PlatformUnknown moves next to Platform in artifact_image.go, where it belongs structurally. TestModel_IsBundle moves to model_test.go since index_test.go went away.
Three integration tests asserted on the v0 cog weights {build,push,inspect}
verbs and v0 lockfile schema (dest, digestOriginal fields). Those commands
and that schema are gone in the current managed-weights design; the tests
have been failing in CI on this branch since the v1 rewrite landed.
- weights_build.txtar: tested cog weights build standalone. v1 has no
separate build step (the lockfile is generated as part of import); the
schema fields it asserted on no longer exist.
- weights_push_inspect.txtar: tested the build → push → inspect lifecycle
end-to-end. v1 collapses build+push into cog weights import (covered by
oci_bundle_push.txtar) and replaces inspect with status (covered by
weights_status unit tests).
- oci_bundle_build.txtar: tested cog predict against a pre-built bundled
image. predict.go:193-195 documents that pre-built images are now
opaque to Cog at predict time — that path is explicitly out of scope
for v1.
One real coverage gap surfaces: parseRepoOnly's rejection of tagged
references in cog weights import is no longer asserted anywhere. Worth
adding a small test for it later, but unrelated to this cleanup.
cmdMockWeights generated N random files plus a synthetic weights.lock, and was registered as the mock-weights testscript command. Originally useful when weights were single random files, but the v1 design treats weights more like Hugging Face repos: one or two large shards alongside several small JSON/text files. A flat 'N random files of size X' fixture no longer represents the shape we want to test against, and no existing .txtar invokes mock-weights — every weights-related test seeds files inline with dd/printf/echo. If we later need realistic fixtures, the right shape is a templated hf-style directory with configurable padding for the large files, not a refactor of this helper. Dropping the dead code now to keep the harness honest about what it actually provides.
Apply gofmt across files that drifted out of canonical formatting. Mechanical changes only (whitespace, struct tag spacing, table-driven test alignment). No behavior change. CI's Format Go job was rejecting these on the branch.
Five integration tests were failing in CI for substantive reasons (not just the dead commands cleaned up earlier). Each one tripped on a specific v1 design choice: - weights_import_predict, weights_pull_predict, weights_filter: the cog.yaml templates carried a placeholder image: line, with the test also appending the real image: via printf at runtime. Two image: keys breaks YAML parsing. Drop the placeholder; the printf writes the only one. - weights_pull, weights_pull_predict: assumed cog weights import leaves the local cache cold so the subsequent cog weights pull has work to do. v1's import warms the cache as a side effect (the cog-i12u guarantee), so pull becomes a no-op. Purge the cache between import and pull to simulate the realistic 'lockfile checked in, fresh clone, cold cache' scenario these tests want to cover. - oci_bundle_push: three issues. (1) source: weights-alpha used the v0 string schema; v1 requires source.uri with a file:// URI. (2) cog push no longer runs the weight builder implicitly — it requires weights.lock to already exist, so add an explicit cog weights import step. (3) the run.cog.reference.type annotation assertion was for an annotation v1 doesn't emit; replace with run.cog.weight.set-digest, which v1 does emit on each weight descriptor. All five verified passing locally with the just-built dist binary against a real local registry. None of these failures were introduced by recent slices; they have been failing on the branch since the v1 rewrite landed in 07db257.
- FileSource.Open: use os.DirFS to prevent path traversal at the FS boundary instead of relying on upstream validation - FileStore.PutFile: validate expectedSize against bytes written to detect truncation/corruption - HFSource: build URLs via path.Join + url.URL instead of fmt.Sprintf string interpolation; parse baseURL once at construction - writeLayer: check context cancellation before directory header writes
Removing a weight from cog.yaml left its entry in weights.lock. Since the lockfile is projected into /.cog/weights.json (the runtime manifest), coglet expected weights that no longer exist. Add Retain() and PruneLockfile() to the lockfile package. Prune runs during cog weights import (always, using all config names) and in Resolver.Build() before the image build so writeRuntimeWeightsManifest sees a clean lockfile. Also adds config.WeightNames() helper to avoid duplicating the name-extraction loop across call sites.
…mple The qwen3.6-27b-fp8 weight was commented out in cog.yaml but its entry (~30GB of layer metadata) persisted in weights.lock. This was the bug that motivated the pruning fix. Clean up the example to reflect the actual config.
build, push, predict, and train now compare cog.yaml weight declarations against weights.lock before invoking resolver.Build(). If the lockfile is stale, missing, or has orphaned entries the command fails immediately with a clear message listing each mismatch and directing the user to run 'cog weights import'. The drift detection is split into two layers: - lockfile.CheckDrift: pure comparison, no I/O, lives in pkg/weights/lockfile - weights.CheckDrift: loads lockfile, normalizes config, formats errors
Resolver.Build() no longer opens the weight store, ingresses files, or writes weights.lock. It loads lockfile entries into a new model.Weight type -- the model's lightweight representation of a managed weight. BundlePusher.Push() no longer uploads weight layers. It HEAD-checks each weight manifest by tag (pushed earlier by 'cog weights import'), gets the descriptor, and assembles the OCI index. Image and weight HEAD-checks run concurrently. WeightArtifact/WeightBuilder/WeightPusher remain for the import path. Also removes dead fields: PushOptions.WeightProgressFn, BuildOptions.WeightsLockPath.
d80a200 to
e15c11f
Compare
|
All issues from previous reviews have been verified as fixed:
All tests pass successfully. LGTM |
buildURLWithQuery now percent-encodes each path component individually so filenames containing #, %, spaces, or other URL-special characters produce valid URLs. Uses url.URL.RawPath to avoid double-encoding. Resolves ask-bonk review feedback on PR #2974.
|
All previously identified issues have been verified as fixed:
All tests pass and no linting issues detected. LGTM |
|
LGTM |
No description provided.