Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -821,8 +821,16 @@ jobs:
echo "╔════════════════════════════════════════════════════════════╗"
echo "║ Juniper Canopy - Docker Smoke Test ║"
echo "╚════════════════════════════════════════════════════════════╝"
# SEC-F22 / D2 (PR #420 follow-up): the startup loopback bind-guard
# (src/security.py -> enforce_loopback_bind_guard, run in main.lifespan)
# refuses to serve on the container's internal non-loopback bind
# (the Dockerfile sets JUNIPER_CANOPY_SERVER__HOST=0.0.0.0) unless the
# operator attests a fronting authenticating proxy. This smoke test has
# no such proxy and only needs the container to boot for the health
# probe, so attest here (and ONLY in this ephemeral CI container).
docker run -d --name juniper-canopy-test \
-p 8050:8050 \
-e JUNIPER_CANOPY_FRONTING_AUTH_ATTESTED=true \
juniper-canopy:ci-${{ github.sha }}

echo "Waiting for service to become healthy..."
Expand Down
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -395,6 +395,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Fixed

- **CI DOCKER SMOKE — attest the ephemeral smoke container past the new bind-guard so the image can boot**: PR #420's startup loopback bind-guard (SEC-F22 / D2) refuses to serve on a non-loopback interface without `JUNIPER_CANOPY_FRONTING_AUTH_ATTESTED=true`, but the Dockerfile bakes `JUNIPER_CANOPY_SERVER__HOST=0.0.0.0` (required for the container to be reachable through the published port), so canopy's `docker-build` "Verify Container Starts" smoke step (`.github/workflows/ci.yml`) would abort at startup (`NonLoopbackBindError`, raised in `main.lifespan`) and never reach `healthy`. The smoke step now passes `-e JUNIPER_CANOPY_FRONTING_AUTH_ATTESTED=true` to `docker run` — scoped to that ephemeral CI container only (no fronting proxy is present; the container just needs to boot for the `/v1/health` probe). Independent security review of #420, §9.

- **SEC-F19 / D4 WS cap rollback — globally rejected sockets no longer leak reserved per-IP/per-session slots**: the endpoints reserve the per-IP/session counters before awaiting `WebSocketManager.connect()`, but the new stack-absolute global cap can reject later inside `connect()` when `max_connections` is already full. That rejection path closes the socket before it enters `active_connections`, so the endpoint's normal `disconnect()` cleanup no-ops and the reserved counters stayed inflated, letting repeated over-global attempts strand a browser session/IP at its cap until process restart. `connect()` now returns whether it actually registered the socket, and all WS endpoints release the reserved cap slots when registration is rejected or fails before activation. Regression coverage: `src/tests/unit/test_ws_connection_caps.py::TestGlobalConnectionCap::test_global_cap_rejection_releases_reserved_session_slots`.

- **STALE-VERSION SHADOW — `.dockerignore` now excludes *nested* `**/*.egg-info/` so the image stops reporting a stale package version**: `importlib.metadata.version("juniper-canopy")` (the source of `APP_VERSION` → `/v1/health` `version`, the Prometheus `juniper_canopy_build` metric, and the Sentry release) resolved to **0.4.0** while `pyproject.toml` was **0.5.0**. Root cause: a stale, git-untracked `src/juniper_canopy.egg-info` build artifact was COPYed into the image by the Dockerfile's `COPY src/ ./src/`; with `ENV PYTHONPATH=/app/src` ahead of site-packages, `importlib.metadata` resolved that egg-info's `PKG-INFO` (0.4.0) instead of the freshly-installed `juniper_canopy-0.5.0.dist-info`. The existing `.dockerignore` carried `*.egg-info/`, but that pattern only matches the **context root** and silently missed the nested `src/*.egg-info`. Fix: add the `**/`-prefixed `**/*.egg-info/` and `**/*.dist-info/` forms so nested build-metadata dirs are excluded from the build context at any depth (verified against a context containing `src/X.egg-info` — excluded, while real source is kept). Surfaced by the build-provenance `make doctor` work (juniper-ml [`notes/BUILD_PROVENANCE_DESIGN_2026-06-14.md`](https://github.com/pcalnon/juniper-ml/blob/main/notes/JUNIPER_2026-06-14_JUNIPER-ECOSYSTEM_BUILD-PROVENANCE-DESIGN.md)): doctor correctly reported the image **FRESH** (`git_sha` == source HEAD) while the version string lied — exactly why `git_sha` is the reliable staleness signal. Takes effect on the next canopy image rebuild. Regression coverage: `src/tests/regression/test_dockerignore_egg_info.py` pins the nested-exclusion patterns.
Expand All @@ -411,7 +413,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Security

- **SEC-F22 / D2 — startup loopback bind-guard (the loopback bind is now an enforced invariant)**: canopy's browser training-control gate (`/api/train/*`, `/ws/control`) authenticates the same-origin browser by `Origin` + CSRF, both of which are forgeable by an in-network **non-browser** client (spoofable `Origin`, anonymously-mintable CSRF token — audit HO-6), so the **only** effective control is the loopback bind. That bind was an implicit default, not an enforced invariant — flipping `BIND_HOST=0.0.0.0` silently made the control surface in-network- (or internet-) reachable. A new startup guard (`src/security.py`: `is_loopback_host` / `enforce_loopback_bind_guard` / `NonLoopbackBindError`, called from `main.lifespan`, mirroring the E-8 `enforce_dependency_floors` fail-loud idiom) now **refuses to start** (CRITICAL log + raise; fail-closed) when `settings.server.host` (`JUNIPER_CANOPY_SERVER__HOST`) is a non-loopback interface (anything not in `127.0.0.0/8`, `::1`, or `localhost`) **unless** the new `settings.fronting_auth_attested` (`JUNIPER_CANOPY_FRONTING_AUTH_ATTESTED`, default `False`) is `True` — an operator attestation that a fronting authenticating proxy is present (the attested non-loopback path logs a loud WARNING). Loopback binds (the default) start normally, so this is zero-UX for the shipped posture. Implemented **inline in canopy** (no new dependency). Regression coverage: `src/tests/unit/test_bind_guard.py`. Design-of-record: juniper-ml [`notes/JUNIPER_CANOPY_CONTROL_SURFACE_AUTH_AND_NAT_DESIGN_2026-07-03.md`](https://github.com/pcalnon/juniper-ml/blob/main/notes/JUNIPER_2026-07-03_JUNIPER-CANOPY_CONTROL-SURFACE-AUTH-AND-NAT-DESIGN.md) §4 / §8 (D2); implementation note: [`notes/JUNIPER_CANOPY_CONTROL-SURFACE-HARDENING_SEC-F22-F19_NOTE_2026-07-04.md`](notes/JUNIPER_CANOPY_CONTROL-SURFACE-HARDENING_SEC-F22-F19_NOTE_2026-07-04.md).
- **SEC-F19 log hygiene — hash the `canopy_session` cookie before logging (never log the raw value)**: `WebSocketManager.check_per_session_limit` (`src/communication/websocket_manager.py`) logged a raw 8-char prefix of the anonymous `canopy_session` cookie (`session_key[:8]`) when the per-session cap tripped. That cookie is a signed Starlette session token, so even a prefix in a log line is an avoidable session-identifier leak. A new `_hash_session_key_for_log` helper now emits a short, non-reversible tag instead — keyed HMAC-SHA256 over the raw cookie with a per-process random secret (`_LOG_HASH_KEY`), truncated to 12 hex chars — so the logged digest is not an offline-computable function of the cookie and does not correlate across process restarts, mirroring the cascor sibling that hashes its identity before logging (`juniper-cascor src/api/workers/security.py`). Regression coverage: `src/tests/unit/test_ws_connection_caps.py::TestPerSessionLogHygiene`. Independent security review of #420, §9.

- **SEC-F22 / D2 — startup loopback bind-guard (the loopback bind is now an enforced invariant)**: canopy's browser training-control gate (`/api/train/*`, `/ws/control`) authenticates the same-origin browser by `Origin` + CSRF, both of which are forgeable by an in-network **non-browser** client (spoofable `Origin`, anonymously-mintable CSRF token — audit HO-6), so the **only** effective control is the loopback bind. That bind was an implicit default, not an enforced invariant — flipping `BIND_HOST=0.0.0.0` silently made the control surface in-network- (or internet-) reachable. A new startup guard (`src/security.py`: `is_loopback_host` / `enforce_loopback_bind_guard` / `NonLoopbackBindError`, called from `main.lifespan`, mirroring the E-8 `enforce_dependency_floors` fail-loud idiom) now **refuses to start** (CRITICAL log + raise; fail-closed) when `settings.server.host` (`JUNIPER_CANOPY_SERVER__HOST`) is a non-loopback interface (anything not in `127.0.0.0/8`, `::1`, or `localhost`) **unless** the new `settings.fronting_auth_attested` (`JUNIPER_CANOPY_FRONTING_AUTH_ATTESTED`, default `False`) is `True` — an operator attestation that a fronting authenticating proxy is present (the attested non-loopback path logs a loud WARNING). Loopback binds (the default) start normally, so this is zero-UX for the shipped posture. Implemented **inline in canopy** (no new dependency). Regression coverage: `src/tests/unit/test_bind_guard.py`. Design-of-record: juniper-ml [`notes/JUNIPER_CANOPY_CONTROL_SURFACE_AUTH_AND_NAT_DESIGN_2026-07-03.md`](https://github.com/pcalnon/juniper-ml/blob/main/notes/JUNIPER_CANOPY_CONTROL_SURFACE_AUTH_AND_NAT_DESIGN_2026-07-03.md) §4 / §8 (D2); implementation note: [`notes/JUNIPER_CANOPY_CONTROL-SURFACE-HARDENING_SEC-F22-F19_NOTE_2026-07-04.md`](notes/JUNIPER_CANOPY_CONTROL-SURFACE-HARDENING_SEC-F22-F19_NOTE_2026-07-04.md).

- **SEC-F19 / D4 — global + per-session WebSocket connection caps (kills the shared-NAT self-DoS)**: Docker NAT collapses every WS client to the bridge-gateway IP (audit HO-3), so the existing per-IP cap (`max_connections_per_ip=5`) is shared across all users behind the gateway — one client's five sockets exhaust the cap for everyone (a live self-DoS). `src/communication/websocket_manager.py` now adds, alongside the per-IP cap: (a) the stack-absolute **global** cap `max_connections` (=50) enforced in `connect()` — the single admission choke point shared by `/ws/training`, `/ws/control`, `/ws` — rejecting the N+1th connection stack-wide with close code `1013`; and (b) a **per-session** cap `max_connections_per_session` (=5, new `WebSocketSettings` field) keyed on the anonymous `canopy_session` cookie read from the WS handshake, restoring per-client fairness where the per-IP cap is inert (one session can no longer starve another behind the same gateway). A cookieless first connection is allowed and left to the global cap as the backstop. The three endpoints call a new `check_connection_limits()` (per-IP then per-session, rolling back the per-IP slot on a per-session rejection so a rejected attempt can't leak the per-IP counter); each endpoint keeps its existing close-reason (`/ws/control` stays opaque per M-SEC-06). The per-IP cap is retained but re-scoped honestly (a code comment + this note): it is **inert behind NAT** — DoS-dampening, **not** authentication. Regression coverage: `src/tests/unit/test_ws_connection_caps.py`. Design-of-record §5 / §8 (D4). **Deferred (Phase 4, owner-gated, NOT in this PR):** X-Forwarded-For-from-trusted-proxy (D6) and a real dashboard login / fronting proxy (D7) — the only mechanisms that restore genuine per-client identity / close SEC-F22 for the remote/multi-user case.

Expand Down
32 changes: 31 additions & 1 deletion src/communication/websocket_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,9 +132,12 @@

import asyncio
import contextlib
import hashlib
import hmac

# import json
import logging
import secrets
import threading
import time
from datetime import datetime
Expand All @@ -157,6 +160,33 @@
_SESSION_COOKIE_NAME = "canopy_session"


# SEC-F19 log hygiene (PR #420 independent-review follow-up): never log the raw
# ``canopy_session`` cookie value. That cookie is a signed Starlette session
# token (``main.py`` -> ``SessionMiddleware(session_cookie="canopy_session")``);
# even a short raw prefix in a log line is an avoidable identifier leak that aids
# cross-log correlation of a browser session. ``_hash_session_key_for_log``
# returns a short, non-reversible tag instead -- keyed HMAC-SHA256 over the raw
# cookie with a per-process random secret, so the digest is NOT an offline-
# computable function of the cookie (an attacker who obtains the logs cannot
# confirm a stolen/guessed cookie by re-hashing it) and does not correlate across
# process restarts. Mirrors the cascor sibling, which hashes its identity before
# logging (juniper-cascor ``src/api/workers/security.py``).
_LOG_HASH_KEY = secrets.token_bytes(32)


def _hash_session_key_for_log(session_key: str) -> str:
"""Return a short, non-reversible tag for ``session_key`` that is safe to log.

Keyed HMAC-SHA256 over the raw ``canopy_session`` cookie value with a
per-process secret (:data:`_LOG_HASH_KEY`); only the first 12 hex characters
are returned -- enough to correlate log lines within one process lifetime
without being reversible. Guarantees the raw cookie value never reaches a log
line (SEC-F19 log hygiene).
"""
digest = hmac.new(_LOG_HASH_KEY, session_key.encode("utf-8"), hashlib.sha256).hexdigest()
return digest[:12]


class WebSocketManager:
"""
- Active WebSocket connections with metadata
Expand Down Expand Up @@ -469,7 +499,7 @@ def check_per_session_limit(self, websocket: WebSocket, max_per_session: int) ->
with self._session_lock:
current = self._per_session_counts.get(session_key, 0)
if current >= max_per_session:
self.logger.warning(f"Per-session limit reached for session {session_key[:8]}... ({current}/{max_per_session})")
self.logger.warning(f"Per-session limit reached for session hash={_hash_session_key_for_log(session_key)} ({current}/{max_per_session})")
return False
self._per_session_counts[session_key] = current + 1
return True
Expand Down
46 changes: 44 additions & 2 deletions src/tests/unit/test_ws_connection_caps.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,11 @@
§5 (Option B) / §9 (R2, testing).
"""

from unittest.mock import AsyncMock
from unittest.mock import AsyncMock, MagicMock

import pytest

from communication.websocket_manager import WebSocketManager
from communication.websocket_manager import WebSocketManager, _hash_session_key_for_log


def _make_ws(ip="127.0.0.1", session=None):
Expand Down Expand Up @@ -220,3 +220,45 @@ async def test_global_cap_rejection_releases_reserved_session_slots(self):
mgr.disconnect(ws1)
assert ip not in mgr._per_ip_counts
assert session not in mgr._per_session_counts


@pytest.mark.unit
class TestPerSessionLogHygiene:
"""SEC-F19 log hygiene: the per-session cap must never log the raw cookie.

PR #420 independent-review follow-up. ``check_per_session_limit`` previously
logged ``session_key[:8]`` -- a raw prefix of the signed ``canopy_session``
cookie. It must log a non-reversible keyed hash instead so the raw cookie
value never reaches a log line.
"""

def test_over_cap_warning_hashes_session_and_omits_raw_cookie(self):
raw_cookie = "RAW-canopy-session-COOKIE-9f8e7d6c5b4a"
mgr = WebSocketManager()
# Spy on the (possibly project-SystemLogger) logger directly so the
# assertion is independent of how the logger routes/propagates records.
mgr.logger = MagicMock()

# Fill the per-session cap, then trip it once more to force the warning.
for _ in range(5):
assert mgr.check_per_session_limit(_make_ws(session=raw_cookie), max_per_session=5) is True
assert mgr.check_per_session_limit(_make_ws(session=raw_cookie), max_per_session=5) is False

mgr.logger.warning.assert_called_once()
logged = " ".join(str(arg) for arg in mgr.logger.warning.call_args.args)
# The raw cookie -- and its first-8 prefix, the exact pre-fix leak -- must
# be absent from the emitted log line...
assert raw_cookie not in logged
assert raw_cookie[:8] not in logged
# ...and the keyed hash of the cookie must be present in its place.
assert _hash_session_key_for_log(raw_cookie) in logged

def test_hash_is_reversible_free_deterministic_and_distinct(self):
tag_a = _hash_session_key_for_log("session-A")
tag_b = _hash_session_key_for_log("session-B")
# Deterministic within a process; a compact hex prefix; not the raw value.
assert tag_a == _hash_session_key_for_log("session-A")
assert tag_a != tag_b
assert len(tag_a) == 12
assert all(ch in "0123456789abcdef" for ch in tag_a)
assert "session-A" not in tag_a
Loading