Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
152 commits
Select commit Hold shift + click to select a range
88508a0
fix(scripts): replace mapfile with bash 3.2-compatible read loop in h…
mesutoezdil May 26, 2026
3460e5f
docs: add macOS compiler troubleshooting (#1569)
amfred May 26, 2026
fa84e43
fix(gateway): configure local dev auth (#1575)
krishicks May 26, 2026
9e5aee4
docs: add Pi as supported sandbox (#1572)
vegarsti May 26, 2026
7174983
fix(sandbox): add mechanistic smoke test for L4 deny and document the…
mesutoezdil May 26, 2026
47d208c
docs(readme): whitespace (#1578)
krishicks May 26, 2026
2e03faf
fix(cli): replace outdated name reference (#1582)
krishicks May 26, 2026
a3ed421
fix(sandbox): probe Landlock before build, skip on unsupported kernel…
dims May 27, 2026
c9056bb
fix(sandbox): decouple GPU baseline from network policy (#1524)
elezar May 27, 2026
b2f0f22
docs(kubernetes): note that Sandbox volumeClaimTemplates is immutable…
mesutoezdil May 27, 2026
db40831
fix(sandbox): use succinct endpoint denial reason (#1584)
krishicks May 27, 2026
ee637e1
feat(docker): add provisioning progress events (#1567)
drew May 27, 2026
fafde3e
docs(kubernetes): add RBAC section to setup page (#1540)
mesutoezdil May 27, 2026
9bfcad4
fix(sandbox): delegate PID limits to runtimes (#1497)
mjamiv May 27, 2026
2bdc968
fix(gateway): make readiness health checks dependency-aware (#1328)
alangou May 27, 2026
d8010ef
fix(vm): scope rootfs cache by openshell version (#1587)
drew May 27, 2026
b6d5825
fix(cli): preserve symlinks during sandbox upload (#1595)
johntmyers May 27, 2026
dc1f098
fix(core): preserve SSH gateway default ports (#1602)
TaylorMutch May 27, 2026
3f520dd
feat(server): declare gRPC auth (mode + scope + role) at the handler,…
mrunalp May 27, 2026
6c7950d
ci(snap): add snap release pipeline (#1600)
drew May 27, 2026
63e3a8f
docs: refresh landing terminal demo and apply NVIDIA fern theme (#1615)
aschilling-nv May 28, 2026
5bcc462
build(macos): remove unused import of tracing::warn (#1619)
Cali0707 May 29, 2026
9b95281
chore: align .python-version with mise.toml (#1618)
Cali0707 May 29, 2026
5007042
feat(helm): add optional PostgreSQL backing store (#1579)
sauagarwa May 29, 2026
188b355
docs(config): update gateway config reference (#1624)
TaylorMutch May 29, 2026
7873f61
feat(flake): add Nix development shell (#1592)
SDAChess May 29, 2026
d01d106
refactor(proto): move phase and current_policy_version into status (#…
derekwaynecarr May 29, 2026
fb03e38
feat(python-sdk): support OIDC Bearer auth on SandboxClient (#1621)
mrunalp May 29, 2026
7d32bf9
fix(helm): vendor chart dependencies before release packaging (#1627)
TaylorMutch May 29, 2026
f1ed347
fix(driver-podman): bind gateway to 0.0.0.0 in rootless mode (#1623)
jewzaam May 29, 2026
f6d0fd1
docs(providers): note that ANTHROPIC_API_KEY requires an API account,…
mesutoezdil May 29, 2026
0f73d11
fix(podman): avoid host-gateway on macOS machines (#1637)
TaylorMutch May 29, 2026
7036dcf
chore(vm): generalize crate for multi-device PCIe passthrough (#1573)
cheese-head May 29, 2026
f1fc87e
fix(sandbox): trust exact declared private endpoints (#1560)
mjamiv May 29, 2026
e98ea3e
feat(policy): add agentic approval loop (#1528)
zredlined May 30, 2026
28ee296
fix(e2e): clean up temp files in sandbox-runner on exit (#1647)
mesutoezdil Jun 1, 2026
269dbc6
ci(kubernetes): add HA e2e workflow (#1598)
TaylorMutch Jun 1, 2026
5045b9c
ci(release): use bundled Z3 for macOS gateway build (#1658)
pimlock Jun 1, 2026
7cea9d9
fix(gateway): align package TLS bootstrap path (#1601)
TaylorMutch Jun 1, 2026
eb97fb3
feat(tui): add PageUp/PageDown scrolling to all panes (#1656)
major Jun 1, 2026
c63ac76
feat(telemetry): add anonymous opt-out OpenShell usage telemetry (#1433)
kirit93 Jun 1, 2026
2d78503
ci(release): gate helm/oci artifact publishing on release (#1662)
krishicks Jun 1, 2026
99ca85a
ci(kubernetes): stabilize HA e2e setup (#1659)
TaylorMutch Jun 1, 2026
019a986
fix(gateway): place supervisor_image under podman driver TOML table (…
jhjaggars Jun 1, 2026
29e2539
refactor: deduplicate shared utilities across driver crates (#1660)
ericcurtin Jun 1, 2026
3d441e7
fix(config): reject unknown fields in nested gateway config tables (#…
pimlock Jun 2, 2026
d990822
feat(kubernetes): support sandbox image pull secrets (#1671)
TaylorMutch Jun 2, 2026
79aa355
refactor(driver): trim compute capability response (#1402)
elezar Jun 2, 2026
f061b1d
feat(providers): add Google Vertex AI inference provider (#1568)
maxamillion Jun 2, 2026
ae5127f
fix: correct example paths in local-inference README (#1676)
mesutoezdil Jun 2, 2026
1d2d8c3
ci(release): bring Fedora RPM canary to parity (#1688)
krishicks Jun 2, 2026
8bf667f
fix: update RFC link in agent-driven-policy-management README (#1677)
mesutoezdil Jun 2, 2026
62c421b
feat(providers): add profile-backed policy visibility (#1640)
johntmyers Jun 3, 2026
61b33ea
ci(release): fix Ubuntu Snap canary install and registration (#1699)
krishicks Jun 3, 2026
19be568
feat(snap): add openshell.term desktop app (#1693)
zyga Jun 3, 2026
5102cb9
fix(sandbox): restore GPU procfs baseline (#1522)
elezar Jun 3, 2026
1f07bf0
fix(gateway): try harder to detect Podman (#1536)
krishicks Jun 3, 2026
427dacb
chore(mise): refresh tool lockfile (#1712)
krishicks Jun 3, 2026
b7ce0be
ci(release): authenticate snap canary artifact download (#1711)
krishicks Jun 3, 2026
1c8417c
docs(container-gateway): fix Docker driver setup for containerized ga…
ericcurtin Jun 3, 2026
d5b79e5
refactor(server): deduplicate test helpers and grpc utilities (#1708)
ericcurtin Jun 3, 2026
e4bcfdf
fix(gateway): allow local sandbox jwt to not expire (#1721)
TaylorMutch Jun 3, 2026
5f58cb0
fix(helm): create sandbox JWT secret when cert-manager is enabled (#1…
TaylorMutch Jun 3, 2026
5e32403
feat(k8s-driver): add default_runtime_class_name config for sandbox p…
sjenning Jun 3, 2026
b41e0df
docs: add Hermes Agent to supported agents (#1735)
shannonsands Jun 4, 2026
76d7453
fix(cli): roll back gateway registration when auth fails during gatew…
zanetworker Jun 4, 2026
69764d8
refactor: deduplicate shared driver and TUI helpers (#1741)
ericcurtin Jun 4, 2026
eea9751
feat(cli): support multiple --upload flags on sandbox create (#1635) …
feloy Jun 4, 2026
c26d4e8
fix(grpc): allow credential rotation when legacy provider.type exceed…
latenighthackathon Jun 4, 2026
a4014f7
fix(cli): respect gateway name for mTLS lookup (#1626)
alexclewontin Jun 4, 2026
79b77ca
chore(deps): bump actions/checkout from 6.0.2 to 6.0.3 (#1739)
dependabot[bot] Jun 4, 2026
586c385
chore(k8s): use upstream agent-sandbox manifest in CI/e2e (#1657)
rmalani-nv Jun 4, 2026
884d4ed
fix(bootstrap): set docker build platform args (#1761)
shiju-nv Jun 4, 2026
e26a1b1
fix(kubernetes): configure sandbox apparmor profile (#1767)
TaylorMutch Jun 5, 2026
97986d9
fix(server): resume unspecified sandbox phase (#1765)
shiju-nv Jun 5, 2026
35afcf8
refactor(tui): extract shared setting edit overlay (#1776)
ericcurtin Jun 5, 2026
c3964a6
feat(kubernetes): support driver config passthrough (#1744)
elezar Jun 5, 2026
13e8318
fix(sandbox): stop log push after auth failure (#1787)
johntmyers Jun 5, 2026
b392b2e
feat(providersv2): add path auth_style (#1622)
Cali0707 Jun 6, 2026
3558888
refactor(tui): extract shared draw_text_field and draw_confirm_popup …
ericcurtin Jun 7, 2026
25abc9e
feat(inference): allow local embeddings route (#1774)
shiju-nv Jun 7, 2026
88b5f3d
test(cli): avoid browser launch in auth rollback test (#1808)
elezar Jun 8, 2026
f236279
docs: document DCO commit sign-off requirement (#1811)
elezar Jun 8, 2026
1399f37
refactor: deduplicate OCSF builder setters and persistence helpers (#…
ericcurtin Jun 8, 2026
1f5e123
feat(vm): add vm life cycle extensions (#1583)
cheese-head Jun 8, 2026
4da07f6
feat(cli): add generic output formatter to eliminate --output flag du…
jeffmaury Jun 8, 2026
7274a6b
feat(cli): add --env flag to sandbox create/exec and fix env var pass…
russellb Jun 8, 2026
4025894
chore(snap): remove early snap packaging (#1648)
zyga Jun 9, 2026
3a4463e
fix(cli): fall back to regular upload when git filtering excludes all…
russellb Jun 9, 2026
3aba30c
feat(telemetry): add build-time option to compile out telemetry (#1845)
russellb Jun 9, 2026
70acbaf
refactor(driver-utils): centralize container mount path constants (#1…
ericcurtin Jun 9, 2026
c4ca283
refactor(helm): require external postgres for ha (#1844)
TaylorMutch Jun 9, 2026
d2a522d
feat(snap): switch to prebuilt binaries shared with other packages (#…
zyga Jun 10, 2026
713d46c
feat(snap): expand snap description with setup instructions (#1695)
zyga Jun 10, 2026
27fd31c
fix(cli)!: require explicit gpu sandbox flag (#1835)
elezar Jun 10, 2026
84c24a0
fix(ocsf): widen the shorthand [reason:] budget so denial endpoints s…
latenighthackathon Jun 10, 2026
c1d3b43
fix(policy): classify advisory private-IP notes with the canonical is…
latenighthackathon Jun 10, 2026
702cbc4
feat(providers): support SPIFFE-backed token grants (#1784)
TaylorMutch Jun 10, 2026
9e805dc
fix(build): use zigbuild for musl supervisor staging (#1850)
elezar Jun 10, 2026
530aaf1
feat(drivers): support docker and podman config mounts (#1785)
drew Jun 10, 2026
d8e0ef5
fix(ci): pin snap artifact downloads to valid action (#1855)
drew Jun 10, 2026
4a7f8e7
fix(ci): use existing snap gateway wrapper (#1859)
elezar Jun 10, 2026
c5ce3ed
AGENTS.md: Add more detailed signoff guidance (#1852)
russellb Jun 10, 2026
42e7b80
feat(podman): make container health check interval configurable (#1833)
sshnaidm Jun 10, 2026
4b44d62
fix(helm): use stable gateway container name (#1864)
TaylorMutch Jun 10, 2026
7dab612
feat(helm): support Deployment kind in HA gateway workloads (#1867)
TaylorMutch Jun 10, 2026
1dc5985
feat(gpu): move device selection to driver config (#1815)
elezar Jun 11, 2026
b6c87a7
feat(server): add grpc rate limiting gateway-wide (#1566)
alangou Jun 11, 2026
58a3777
fix(drivers): filter bind-backed named volumes (#1861)
elezar Jun 11, 2026
f33fd02
fix(server): use public tonic body type in gRPC rate limiter (#1872)
alangou Jun 11, 2026
e73745f
feat(gateway): add reconciler lease for HA multi-replica deployments …
derekwaynecarr Jun 11, 2026
fb83d1a
feat(gateway): add system registry support and source indicators (#1625)
alexclewontin Jun 11, 2026
21ff5db
ci(stale): add stale issue and PR workflow (#1890)
TaylorMutch Jun 12, 2026
ec197a4
fix(e2e): correct return type of _stub_with_token (#1897)
mesutoezdil Jun 13, 2026
6c8cf38
ci(docs): add docs website automation (#1788)
pimlock Jun 15, 2026
8c01534
test(e2e): add GPU workload image artifacts (#1484)
elezar Jun 15, 2026
62aa5e3
ci(branch-checks): align Python checks with pre-commit (#1908)
TaylorMutch Jun 15, 2026
ac3bb63
docs(rfc): improve template and add creation skill (#1889)
krishicks Jun 15, 2026
1ca23bc
refactor(openshell-sandbox): Split `sandbox` into `process` and `netw…
rrhubenov Jun 15, 2026
ed65bfd
feat(cli): add JSON/YAML output format to provider list command (#1830)
jeffmaury Jun 15, 2026
ec71b1a
fix(sandbox): apply initial OCSF JSON setting (#1921)
TaylorMutch Jun 16, 2026
f4a5005
chore(deps): bump astral-sh/setup-uv from 8.0.0 to 8.2.0 (#1926)
dependabot[bot] Jun 16, 2026
294c64e
fix(gpu): prefer single CDI devices for local runtimes (#1675)
elezar Jun 16, 2026
fd6cbf6
fix(server): retry sandbox delete phase conflicts (#1905)
TaylorMutch Jun 16, 2026
ff028ce
feat(server): support TLS certificate hot-reload (#1870)
lunarwhite Jun 16, 2026
36bb9e3
feat(providers): add DeepInfra as a built-in inference provider (#1902)
mmilutinovic371 Jun 16, 2026
5ca39b0
docs(rfc): require issues before RFCs (#1918)
drew Jun 16, 2026
f1245a3
test(e2e): retry transient forward proxy stale policy responses (#1929)
drew Jun 17, 2026
4c75b85
fix(server): share gateway shutdown channel (#1945)
elezar Jun 17, 2026
234e69d
fix(e2e): refresh latest sandbox image for docker runs (#1928)
krishicks Jun 17, 2026
f5e109a
feat: build CLI during pull request (#1491)
jeffmaury Jun 17, 2026
70fed04
fix(helm): build chart dependencies before lint (#1947)
elezar Jun 17, 2026
f23c2c8
test(e2e): remove python gpu smoke test (#1948)
elezar Jun 17, 2026
0e35ce2
feat: add openlock cred-inject + trust stack
vessux May 8, 2026
f273e19
ci: add openlock release workflow for fork binaries
vessux May 5, 2026
b4fd29a
ci: use system z3 instead of bundled-z3 for openshell-cli
vessux May 5, 2026
feaa90b
ci(release): mark -rc/-beta/-alpha tags as GitHub prerelease
vessux May 8, 2026
d00d560
fix(policy): populate fork allowed_secrets in test fixtures and silen…
vessux May 20, 2026
079af91
fix(sandbox): populate fork fields in test fixtures and clean dead code
vessux May 20, 2026
008767a
feat: --volume bind mounts with auto userns-remap on rootless podman …
vessux May 21, 2026
12d3cb3
chore(ci): drop inherited NVIDIA mirror plumbing (#4)
vessux May 26, 2026
6fb605d
feat(sandbox): expose Stop/Start RPCs in gateway API + CLI (#3)
vessux May 26, 2026
fbd7781
feat(sandbox): SandboxPhase::Stopped distinguishes intentional stop f…
vessux May 26, 2026
815a7ad
refactor(proto): move fork-added field/enum numbers into 9000+ range
vessux May 26, 2026
690dcb8
build(release): static-link z3 via bundled-z3 (no runtime libz3)
vessux Jun 11, 2026
8b9a774
ci(release): authenticate z3-sys bundled fetch via READ_ONLY_GITHUB_T…
vessux Jun 11, 2026
359de8e
ci(release): macOS uses system z3 (brew), only Linux static-links bun…
vessux Jun 11, 2026
deb7fb3
ci(release): build macOS CLI with static bundled-z3 via zig (#7)
vessux Jun 11, 2026
22304da
feat(cred_inject): add value_prefix to CredInjectHeader (#8)
vessux Jun 12, 2026
89fb211
feat(sandbox): debug-gated L7 egress request/response header logging …
vessux Jun 17, 2026
0e0985e
chore(sync): adapt fork delta to upstream after 133-commit catch-up
vessux Jun 18, 2026
9ae1ad9
fix(ci): build gateway with bundled-z3 (upstream gave openshell-serve…
vessux Jun 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
10 changes: 9 additions & 1 deletion .agents/skills/build-from-issue/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,8 @@ In the prompt, instruct the reviewer to:
- **Medium**: Multiple files/components, some design decisions, but well-scoped
- **High**: Cross-cutting changes, architectural decisions needed, significant unknowns
8. Call out risks, unknowns, and decisions that need stakeholder input.
9. Assess **LSM compatibility** — if the change touches process identity, `/proc` filesystem access, binary execution, or inter-process visibility, flag whether it will behave differently on hosts running SELinux (enforcing) or AppArmor. In particular, tests that fork+exec into system binaries will fail on SELinux-enforcing hosts due to cross-label `/proc/<pid>/exe` access restrictions.
9. Assess **gateway config documentation impact** — if the change adds, removes, renames, or changes defaults for gateway TOML keys or driver-specific config options, the plan must include an update to `docs/reference/gateway-config.mdx`. If the change is surfaced through Helm or a compute-driver overview, also include `docs/reference/sandbox-compute-drivers.mdx` or the relevant deployment docs.
10. Assess **LSM compatibility** — if the change touches process identity, `/proc` filesystem access, binary execution, or inter-process visibility, flag whether it will behave differently on hosts running SELinux (enforcing) or AppArmor. In particular, tests that fork+exec into system binaries will fail on SELinux-enforcing hosts due to cross-label `/proc/<pid>/exe` access restrictions.

### A2: Post the Plan Comment

Expand Down Expand Up @@ -436,6 +437,13 @@ Review the documentation requirements in `AGENTS.md` and update any affected
docs as part of the implementation. Keep documentation changes scoped to the
behavior or subsystem that changed.

If the implementation changes gateway TOML parsing, `[openshell.gateway]`
fields, `[openshell.drivers.<name>]` fields, driver config defaults, or Helm
rendering of `gateway.toml`, update `docs/reference/gateway-config.mdx` in the
same branch. If the change affects user-facing compute-driver setup, also
update `docs/reference/sandbox-compute-drivers.mdx` or the relevant deployment
page.

### Step 12: Commit and Push

Commit all changes using conventional commit format. The `<type>` comes from the issue type in the plan:
Expand Down
9 changes: 9 additions & 0 deletions .agents/skills/create-github-pr/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,15 @@ Create pull requests on GitHub using the `gh` CLI.

## Before Creating a PR

### Check Config Documentation

If the branch changes gateway TOML parsing, `[openshell.gateway]` fields,
`[openshell.drivers.<name>]` fields, driver config defaults, or Helm rendering
of `gateway.toml`, verify that `docs/reference/gateway-config.mdx` is updated
in the same branch. If the change affects user-facing compute-driver setup,
also update `docs/reference/sandbox-compute-drivers.mdx` or the relevant
deployment docs.

### Run Pre-commit Checks

Run the local pre-commit task before opening a PR:
Expand Down
51 changes: 51 additions & 0 deletions .agents/skills/create-rfc/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
name: create-rfc
description: Create OpenShell RFC proposals in rfc/ from a design request. Use when the user asks to write, draft, start, create, or update an RFC, Request for Comments, architecture proposal, API proposal, process proposal, or cross-cutting design proposal that should follow the OpenShell RFC process and template.
---

# Create RFC

## Workflow

Create RFCs by following `rfc/README.md` and `rfc/0000-template/README.md`.
Keep the template as the source of truth for section guidance.

1. Read `rfc/README.md` to confirm when an RFC is appropriate, how to choose the
RFC number, and how the lifecycle works.
2. Read `rfc/0000-template/README.md` before drafting. Follow its section
guidance, including scope, expected detail, and suggested section length.
3. Choose the next available `NNNN` from the existing `rfc/NNNN-*` directories
unless the user provided a specific number.
4. Create `rfc/NNNN-short-title/README.md` by copying the template and replacing
placeholders. Use a short hyphenated folder title.
5. Fill in front matter with the RFC author, `state: draft`, and any related
links the user provided. If the author is unknown, use the requesting user's
GitHub handle when available or leave the template placeholder.
6. Draft each section from the user's design context. Keep Summary concise,
Motivation readable by anyone, Non-goals explicit, Proposal focused on what
is being proposed, and Alternatives focused on credible competing approaches.
7. Preserve uncertainty in Open questions instead of silently deciding unknowns.
If a missing decision blocks a coherent RFC, ask the user for that decision.
8. Check the completed RFC against the template once more before finishing.

## Writing Standards

- Prefer concrete design statements over placeholder language.
- Link to relevant issues, prior RFCs, and architecture docs when they provide
needed context.
- Keep rejected or left-out designs in Alternatives, not Proposal.
- Use Mermaid diagrams for architecture or data flow when a diagram would make
the proposal easier to review.
- Do not update `architecture/` or published docs just because an RFC was
drafted. Those updates belong with implementation or with an accepted RFC when
the user asks for them.

## Validation

Before handing the RFC back to the user:

- Verify the folder name and RFC number match the process in `rfc/README.md`.
- Verify every template section is present or intentionally marked as not
applicable.
- Run a Markdown formatting or lint check only if the repo already provides one
for Markdown-only changes.
4 changes: 4 additions & 0 deletions .agents/skills/create-rfc/agents/openai.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
interface:
display_name: "Create RFC"
short_description: "Create OpenShell RFC proposals"
default_prompt: "Use $create-rfc to draft an OpenShell RFC from this design."
6 changes: 4 additions & 2 deletions .agents/skills/create-spike/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,9 +91,11 @@ The prompt to the reviewer **must** instruct it to:

9. **Check architecture docs** in the `architecture/` directory for relevant documentation about the affected subsystems.

10. **Assess Linux Security Module (LSM) impact.** If the change involves process identity, `/proc` filesystem access, file labeling, binary execution, or inter-process visibility, call out whether it will behave differently on hosts running SELinux (enforcing) or AppArmor. For example: reading `/proc/<pid>/exe` across an SELinux domain boundary returns ENOENT, not EACCES. Tests that fork+exec into system binaries (different SELinux label) will fail on enforcing hosts. Flag any LSM-sensitive code paths and recommend mitigations.
10. **Assess gateway config documentation impact.** If the change would add, remove, rename, or change defaults for gateway TOML keys or driver-specific config options, call out that `docs/reference/gateway-config.mdx` must be updated. If the change is surfaced through Helm or compute-driver setup docs, call out the relevant deployment or compute-driver docs too.

11. **Determine the issue type:** `feat`, `fix`, `refactor`, `chore`, `perf`, or `docs`.
11. **Assess Linux Security Module (LSM) impact.** If the change involves process identity, `/proc` filesystem access, file labeling, binary execution, or inter-process visibility, call out whether it will behave differently on hosts running SELinux (enforcing) or AppArmor. For example: reading `/proc/<pid>/exe` across an SELinux domain boundary returns ENOENT, not EACCES. Tests that fork+exec into system binaries (different SELinux label) will fail on enforcing hosts. Flag any LSM-sensitive code paths and recommend mitigations.

12. **Determine the issue type:** `feat`, `fix`, `refactor`, `chore`, `perf`, or `docs`.

### What makes a good investigation prompt

Expand Down
60 changes: 52 additions & 8 deletions .agents/skills/debug-openshell-cluster/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ Use gateway metadata, deployment values, or the user's setup notes to identify t
|---|---|
| Docker | Gateway process logs, Docker daemon health, sandbox containers, image pulls. |
| Podman | Podman socket, rootless networking, sandbox containers, image pulls. |
| Kubernetes | Helm release, StatefulSet, service, secrets, sandbox pods, events. |
| Kubernetes | Helm release, gateway workload, service, secrets, sandbox pods, events. |
| VM | VM driver logs, rootfs availability, host virtualization support. |

### Step 3: Check Docker-Backed Gateways
Expand Down Expand Up @@ -131,12 +131,27 @@ Common findings:
```bash
helm -n openshell status openshell
helm -n openshell get values openshell
kubectl -n openshell get statefulset,pod,svc,pvc
kubectl -n openshell logs statefulset/openshell --tail=200
kubectl -n openshell get deployment,statefulset,pod,svc,pvc
kubectl -n openshell logs deployment/openshell -c openshell-gateway --tail=200
kubectl -n openshell logs statefulset/openshell -c openshell-gateway --tail=200
kubectl -n openshell rollout status deployment/openshell
kubectl -n openshell rollout status statefulset/openshell
```

Look for failed installs, unexpected values, missing namespace, wrong image tag, TLS settings that do not match the registered endpoint, and scheduling failures.
Use the log and rollout commands for the workload kind that exists in the
release. Look for failed installs, unexpected values, missing namespace, wrong
image tag, TLS settings that do not match the registered endpoint, and
scheduling failures.

For HA or PostgreSQL-backed installs, also check the external database Secret
referenced by `server.externalDbSecret` and the PostgreSQL workload if the test
or operator deployed one in-cluster:

```bash
kubectl -n openshell get secret openshell-ha-pg -o yaml
kubectl -n openshell get deployment,service,pod -l app.kubernetes.io/name=openshell-e2e-postgres
kubectl -n openshell logs deployment/openshell-e2e-postgres --tail=200
```

Check required Helm deployment secrets:

Expand All @@ -148,17 +163,45 @@ kubectl -n openshell get secret \
openshell-jwt-keys
```

In cert-manager installs, `certManager.enabled=true` makes cert-manager own TLS
generation. The Helm chart should still render the `openshell-certgen`
pre-install/pre-upgrade hook in JWT-only mode to create `openshell-jwt-keys`,
even if `pkiInitJob.enabled` remains true.
If the gateway pod is pending with `MountVolume.SetUp failed for volume
"sandbox-jwt"` and `openshell-jwt-keys` is absent, inspect the rendered
`templates/certgen.yaml` output and the hook Job logs; cert-manager creates TLS
Secrets but does not create the sandbox JWT signing Secret.

If the gateway exits with `failed to read sandbox JWT signing key from
/etc/openshell-jwt/signing.pem`, verify that `openshell-jwt-keys` contains
`signing.pem`, `public.pem`, and `kid`, and that the StatefulSet mounts the
`signing.pem`, `public.pem`, and `kid`, and that the gateway workload mounts the
`sandbox-jwt` secret at `/etc/openshell-jwt`. The sandbox JWT mount is required
even when local Helm values disable TLS.

If `server.providerTokenGrants.spiffe.enabled=true`, the gateway should still
render `[openshell.gateway.gateway_jwt]` and mount the `sandbox-jwt` Secret.
SPIRE is used only by sandbox pods for dynamic provider token grants. Verify
that SPIRE is installed, the CSI driver is available, and the Kubernetes driver
config includes `provider_spiffe_workload_api_socket_path`:

```bash
helm -n openshell get values openshell | grep -E 'providerTokenGrants|workloadApiSocketPath'
kubectl get pods -A | grep -E 'spire|spiffe'
kubectl -n openshell get configmap openshell-config -o yaml | grep provider_spiffe_workload_api_socket_path
```

Sandbox pods using provider token grants should have an
`openshell.io/sandbox-id` annotation, an `openshell.ai/managed-by=openshell`
label, supervisor env vars `OPENSHELL_K8S_SA_TOKEN_FILE` and
`OPENSHELL_PROVIDER_SPIFFE_WORKLOAD_API_SOCKET`, plus both the projected
`openshell-sa-token` volume and the `spiffe-workload-api` CSI volume.

Check the image references currently used by the gateway deployment:

```bash
kubectl -n openshell get deployment openshell -o jsonpath="{.spec.template.spec.containers[*].image}{\"\n\"}{.spec.template.spec.containers[*].env[?(@.name==\"OPENSHELL_SUPERVISOR_IMAGE\")].value}{\"\n\"}"
kubectl -n openshell get statefulset openshell -o jsonpath="{.spec.template.spec.containers[*].image}{\"\n\"}{.spec.template.spec.containers[*].env[?(@.name==\"OPENSHELL_SUPERVISOR_IMAGE\")].value}{\"\n\"}"
helm -n openshell get values openshell | grep -E 'repository|tag|supervisorImage'
helm -n openshell get values openshell | grep -E 'repository|tag|supervisorImage|workload'
```

The gateway image built from `deploy/docker/Dockerfile.gateway` and the scratch supervisor image built from `deploy/docker/Dockerfile.supervisor` should use the same build tag in branch and E2E deploys. A stale supervisor image can make sandbox behavior lag behind gateway policy or proto changes.
Expand Down Expand Up @@ -201,7 +244,8 @@ If the gateway is healthy but sandbox creation fails:
```bash
kubectl -n openshell get pods
kubectl -n openshell get events --sort-by=.lastTimestamp | tail -n 50
kubectl -n openshell logs statefulset/openshell --tail=200
kubectl -n openshell logs deployment/openshell -c openshell-gateway --tail=200
kubectl -n openshell logs statefulset/openshell -c openshell-gateway --tail=200
```

Check the configured sandbox namespace:
Expand Down Expand Up @@ -249,7 +293,7 @@ openshell logs <sandbox-name>
| Docker or Podman sandbox never registers | Wrong callback endpoint or supervisor startup failure | Gateway logs and sandbox container logs |
| Docker GPU e2e fails before GPU sandbox comparison | NVIDIA CDI specs are missing or Docker has not discovered them | `docker info --format '{{json .DiscoveredDevices}}'`, `/etc/cdi`, `/var/run/cdi`, `nvidia-cdi-refresh.service` |
| Kubernetes gateway pod pending | PVC unbound, taint, selector, or insufficient resources | `kubectl -n openshell describe pod <pod>` |
| Kubernetes gateway pod crash loops | Missing secret, bad DB URL, bad TLS config | `kubectl -n openshell logs statefulset/openshell` |
| Kubernetes gateway pod crash loops | Missing secret, bad DB URL, bad TLS config | `kubectl -n openshell logs deployment/openshell -c openshell-gateway` or `kubectl -n openshell logs statefulset/openshell -c openshell-gateway` |
| CLI TLS error | Local mTLS bundle does not match server cert/CA | Check `~/.config/openshell/gateways/<name>/mtls/` |
| Image pull failure | Gateway or sandbox image cannot be pulled | Runtime events and image pull credentials |
| `K8s namespace not ready` with `envoy-gateway-openshell.yaml: the server could not find the requested resource` | Optional Gateway API manifest was applied without Envoy Gateway CRDs, or k3s Helm controller startup exceeded the namespace wait | Apply `deploy/kube/manifests/envoy-gateway-openshell.yaml` manually only after Envoy Gateway is installed and `grpcRoute` is enabled |
Expand Down
34 changes: 30 additions & 4 deletions .agents/skills/helm-dev-environment/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: helm-dev-environment
description: Start up, tear down, and configure the local Kubernetes development environment for OpenShell. Uses k3d (Docker-backed k3s) + Skaffold + Helm. Covers cluster lifecycle, optional add-ons (Keycloak OIDC, Envoy Gateway), and port mappings. Trigger keywords - local k8s, local cluster, k3d, skaffold, helm dev, start cluster, stop cluster, tear down cluster, delete cluster, create cluster, helm:k3s, helm:skaffold, local dev environment, dev cluster, k8s dev, envoy gateway local, keycloak local.
description: Start up, tear down, and configure the local Kubernetes development environment for OpenShell. Uses k3d (Docker-backed k3s) + Skaffold + Helm. Covers cluster lifecycle, optional add-ons (Keycloak OIDC, Envoy Gateway), HA testing, and port mappings. Trigger keywords - local k8s, local cluster, k3d, skaffold, helm dev, start cluster, stop cluster, tear down cluster, delete cluster, create cluster, helm:k3s, helm:skaffold, local dev environment, dev cluster, k8s dev, envoy gateway local, keycloak local, high availability, HA.
---

# Helm Dev Environment
Expand All @@ -26,9 +26,10 @@ mise run helm:k3s:create
```

Creates a k3d cluster and merges its kubeconfig into the worktree-local `kubeconfig` file.
Also applies base manifests (`deploy/kube/manifests/agent-sandbox.yaml`) and preloads the
default community sandbox image into k3d so the first sandbox create does not wait on a
large registry pull. Traefik is disabled at cluster creation time.
Also applies the upstream agent-sandbox CRDs/controller (pinned via `AGENT_SANDBOX_VERSION`
in `tasks/scripts/helm-k3s-local.sh`, fetched from `github.com/kubernetes-sigs/agent-sandbox`
releases) and preloads the default community sandbox image into k3d so the first sandbox
create does not wait on a large registry pull. Traefik is disabled at cluster creation time.

**Multi-worktree support:** the cluster name is derived from the last component of the
current git branch (e.g. branch `kube-support/local-dev/tmutch` → cluster
Expand Down Expand Up @@ -65,6 +66,11 @@ generates mTLS secrets on first install. Envoy Gateway opt-in; see the Optional

The gateway Service uses ClusterIP. Access is via Envoy Gateway (port `8080`) or `kubectl port-forward`.

**HA test deploy** (two gateway replicas + external PostgreSQL Secret): uncomment
`#- ci/values-high-availability.yaml` in `deploy/helm/openshell/skaffold.yaml`,
create the Secret named `openshell-ha-pg` with a `uri` key, then run
`mise run helm:skaffold:run` or `mise run helm:skaffold:dev`.

### TLS behaviour

`ci/values-skaffold.yaml` sets `server.disableTls: true`, so Skaffold-based deploys run
Expand Down Expand Up @@ -172,6 +178,23 @@ To remove Keycloak:
mise run keycloak:k8s:teardown
```

### SPIRE / SPIFFE Provider Token Grants

Skaffold can install SPIRE with the SPIFFE hardened Helm charts. To activate
SPIFFE JWT-SVIDs for dynamic provider token grants:

1. Uncomment the `spire-crds` and `spire` releases in `deploy/helm/openshell/skaffold.yaml`
2. Uncomment `#- ci/values-spire.yaml` in the OpenShell release values files
3. Redeploy: `mise run helm:skaffold:run`

`ci/values-spire-stack.yaml` configures the local SPIRE trust domain as
`openshell.local` and adds a `ClusterSPIFFEID` that maps sandbox pod
annotations to `spiffe://openshell.local/openshell/sandbox/<sandbox-id>`.
OpenShell mounts the SPIFFE CSI Workload API socket at
`/spiffe-workload-api/spire-agent.sock` into sandbox pods for provider token
grants. Supervisor-to-gateway authentication remains on the Kubernetes
ServiceAccount bootstrap and gateway-minted sandbox JWT path.

---

## Cluster Lifecycle (suspend/resume)
Expand All @@ -198,7 +221,10 @@ mise run helm:k3s:status
| `deploy/helm/openshell/ci/values-skaffold.yaml` | Dev overrides (image pull policy, TLS disabled for local Skaffold) |
| `deploy/helm/openshell/ci/values-cert-manager.yaml` | cert-manager PKI overlay (opt-in; disables pkiInitJob) |
| `deploy/helm/openshell/ci/values-gateway.yaml` | Envoy Gateway GRPCRoute + Gateway overlay |
| `deploy/helm/openshell/ci/values-high-availability.yaml` | HA test overlay (`replicaCount: 2` with external PostgreSQL Secret) |
| `deploy/helm/openshell/ci/values-keycloak.yaml` | Keycloak OIDC overlay |
| `deploy/helm/openshell/ci/values-spire.yaml` | SPIFFE/SPIRE provider token grant overlay |
| `deploy/helm/openshell/ci/values-spire-stack.yaml` | SPIRE hardened chart values for local dev |
| `deploy/helm/openshell/ci/values-tls-disabled.yaml` | Lint-only: TLS + auth disabled (reverse-proxy edge termination) |
| `deploy/kube/manifests/envoy-gateway-openshell.yaml` | GatewayClass for Envoy Gateway (`mise run helm:gateway:apply`) |
| `tasks/scripts/helm-k3s-local.sh` | k3d cluster create/delete/start/stop/status |
Expand Down
5 changes: 3 additions & 2 deletions .agents/skills/openshell-cli/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -495,8 +495,9 @@ openshell gateway remove local # Remove local registrati
```bash
# Inspect a Kubernetes Helm release and gateway pod
helm -n openshell status openshell
kubectl -n openshell get pods,svc
kubectl -n openshell logs statefulset/openshell --tail=100
kubectl -n openshell get deployment,statefulset,pods,svc
kubectl -n openshell logs deployment/openshell -c openshell-gateway --tail=100
kubectl -n openshell logs statefulset/openshell -c openshell-gateway --tail=100
```

For Docker, Podman, and VM-backed gateways, inspect the gateway process or container logs and the selected runtime directly.
Expand Down
Loading