Fix OTEL telemetry collection for OpenShell backend by EmilienM · Pull Request #81 · opendatahub-io/agentic-ci

EmilienM · 2026-06-10T00:08:04Z

Summary

Fix OTEL telemetry collection for the OpenShell backend, which was silently dropping all metrics
The previous approach pointed the OTEL endpoint at 10.200.0.1 (gateway bridge IP), which doesn't route to the host network stack
Use host.openshell.internal instead, which resolves to the host inside the sandbox (OpenShell PR #1279)
Add host.openshell.internal:<port> to the sandbox network policy so the gateway proxy forwards OTEL exports to the host-side collector
The OpenShell backend handles OTEL env vars directly with a shorter export interval (5s) to capture metrics from short-lived runs
Add Section F to the OpenShell e2e test skill for OTEL verification

Test plan

tox -e py313 — 501 tests pass
tox -e lint — clean
tox -e check-format — clean
tox -e typecheck — clean
E2E Section F: Verified OTEL Token/Cost Summary prints with real data (input: 312, cacheCreation: 32,085, output: 69, cost: $0.0408)
Verified host.openshell.internal resolves correctly and HTTP traffic reaches the host collector through the gateway proxy

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Optional OpenTelemetry collection for sandbox runs with configurable collector bind address and endpoint.
- Show token and cost summary from telemetry when a collector is running.
Documentation
- Added Section F to e2e docs for OTEL validation, prerequisites, and a "run with OTEL enabled" command; full suite order now A→F.
Tests
- Updated tests and e2e checks to verify OTEL endpoint behavior and assert non‑zero token/cost outputs.

coderabbitai · 2026-06-10T00:08:12Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Security note: review OTEL binding and subprocess environment handling for CWE-922 (Insecure Storage of Sensitive Information) and CWE-78 / CWE-20 (OS Command Injection / Input Validation) risks.

This PR extends OpenTelemetry support to the OpenShell backend by threading OTEL port and log path through the backend contract, making the collector bind address configurable, implementing sandbox-local OTEL log capture and download in OpenShellBackend, introducing a harness gateway host constant, and wiring OTEL initialization through the CLI. The integration allows the OpenShell sandbox to run an embedded OTEL collector, export telemetry locally, and return logs to the host for summary reporting. Podman backend signatures are updated for contract compliance, and the e2e test documentation is extended with OTEL verification procedures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 9 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
No Sensitive Data In Logs	⚠️ Warning	Code logs entire raw HTTP request bodies to JSONL files without filtering. OTLP payloads can contain PII, API tokens, credentials, SQL queries, and other sensitive data. No redaction applied. CWE-532.	Redact or filter sensitive attributes from OTLP payloads before logging. Implement allowlist of safe fields or use OpenTelemetry redaction processor pattern to strip PII/tokens before persisting to logs.

✅ Passed checks (9 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and accurately reflects the primary objective of the changeset: fixing OTEL telemetry collection specifically for the OpenShell backend by updating endpoint resolution from hardcoded gateway IP to host.openshell.internal.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Contribution Quality And Spam Detection	✅ Passed	Established contributor fixing OTEL telemetry by replacing unreachable 10.200.0.1 with host.openshell.internal. 54 files, 9+ components, 501 tests passed. Zero spam signals detected.
No Hardcoded Secrets	✅ Passed	No hardcoded secrets found. OTEL endpoints use hostname variables, API keys read from environment, test fixtures use monkeypatch, documentation uses placeholder variables.
No Weak Cryptography	✅ Passed	No banned cryptographic primitives, custom crypto implementations, or insecure secret comparisons found. PR is exclusively OTEL telemetry infrastructure.
No Injection Vectors	✅ Passed	No injection vectors. Subprocess calls use list notation (CWE-78 safe), otel_port parsed as int, bind_addr constrained by argparse, no eval/exec/pickle/yaml unsafe patterns.
No Privileged Containers	✅ Passed	PR contains only Python source and Markdown documentation; no Dockerfiles, Kubernetes manifests, or Helm templates with privileged configurations present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/agentic_ci/otel.py (1)

113-125: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Stop masking collector bind failures behind a timeout.

A bad bind_addr or bind conflict will make the child process exit before it writes otel-port. Right now that collapses into a generic 5s timeout because stderr is discarded and the loop never checks whether the subprocess already died. Fail fast on proc.poll() and surface stderr/return code here.

Suggested fix

     proc = subprocess.Popen(
         [sys.executable, "-m", "agentic_ci.otel"],
         env=env,
-        stderr=subprocess.DEVNULL,
+        stderr=subprocess.PIPE,
+        text=True,
     )
 
     for _ in range(50):
+        if proc.poll() is not None:
+            stderr = proc.stderr.read().strip() if proc.stderr else ""
+            raise RuntimeError(
+                f"OTEL collector exited before startup: {stderr or proc.returncode}"
+            )
         if os.path.exists(port_file):
             break
         time.sleep(0.1)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/agentic_ci/otel.py` around lines 113 - 125, The current startup loop
masks collector bind failures because stderr is discarded and the code never
checks if the child process exited; update the Popen invocation that creates
proc (the subprocess.Popen call running "agentic_ci.otel") to capture stderr
(e.g., stderr=subprocess.PIPE) and then in the waiting loop that checks
os.path.exists(port_file) poll proc with proc.poll(); if proc.poll() is not None
(process exited) immediately read the captured stderr and raise a RuntimeError
that includes proc.returncode and the decoded stderr to surface bind/exit
errors; only kill the process and raise the generic timeout if the child is
still alive after the wait.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/test-e2e-openshell/SKILL.md:
- Line 338: The command `cd /tmp/e2e-workdir` references a directory that is
never created (Section E uses `/tmp/workdir-test`) and Section F omits
`--workdir`, so either remove the `cd /tmp/e2e-workdir` step, change it to `cd
/tmp/workdir-test` to match Section E, or explicitly create `/tmp/e2e-workdir`
and/or add `--workdir /tmp/e2e-workdir` to the command in Section F; update the
`cd /tmp/e2e-workdir` occurrence accordingly so the working directory used by
the skill matches the created path or is not required.
- Line 335: In SKILL.md Section F1 fix the failing directory change and harden
the external image input: ensure the script creates the expected working
directory before cd by adding mkdir -p /tmp/e2e-workdir (or remove the cd to
/tmp/e2e-workdir) so that the step using /tmp/e2e-workdir does not fail; also
replace the unpinned OPENSHELL_SUPERVISOR_IMAGE value
(OPENSHELL_SUPERVISOR_IMAGE=quay.io/mprpic/openshell-supervisor:pr1763) with a
digest-pinned reference and add a note requiring provenance/approval for the
Quay source for CI/E2E to prevent using an untrusted tag.

In `@src/agentic_ci/backends/openshell/__init__.py`:
- Around line 21-42: The OTEL collector start is fire-and-forget and failures
are suppressed: change the logic around _SANDBOX_OTEL_COLLECTOR and the sandbox
fetch path so we verify the collector actually started and surface errors when
otel_enabled is true; specifically, replace the background-shell-with-& pattern
by launching the collector process via a controllable mechanism (or run
foreground with subprocess and PID), poll/connect to
127.0.0.1:_SANDBOX_OTEL_PORT until it accepts connections (with a short timeout
loop) to confirm startup, and make _download_otel_log() propagate or log non-OK
exec/fetch/timeout errors instead of swallowing them so the caller (when
otel_enabled) can fail fast or emit a visible warning; ensure references to
_SANDBOX_OTEL_LOG and the sandbox exec/cat path are preserved while adding
explicit error handling and clear messages when startup or fetch fails.

In `@src/agentic_ci/cli.py`:
- Around line 97-100: The OTEL_RATE_FILE environment cleanup is currently
guarded by the otel_proc truthiness so setting otel_proc = None prevents the env
var from being cleared; update the cleanup block in the shutdown/cleanup
sequence (the code that calls otel.stop_collector and handles otel_proc and
otel_log) to always unset os.environ["OTEL_RATE_FILE"] (and remove it with
os.environ.pop if present) regardless of whether otel_proc is truthy, and ensure
otel_proc is still set to None after stopping the collector; reference
otel_proc, otel.stop_collector, otel_log and OTEL_RATE_FILE to locate and modify
the code.

---

Outside diff comments:
In `@src/agentic_ci/otel.py`:
- Around line 113-125: The current startup loop masks collector bind failures
because stderr is discarded and the code never checks if the child process
exited; update the Popen invocation that creates proc (the subprocess.Popen call
running "agentic_ci.otel") to capture stderr (e.g., stderr=subprocess.PIPE) and
then in the waiting loop that checks os.path.exists(port_file) poll proc with
proc.poll(); if proc.poll() is not None (process exited) immediately read the
captured stderr and raise a RuntimeError that includes proc.returncode and the
decoded stderr to surface bind/exit errors; only kill the process and raise the
generic timeout if the child is still alive after the wait.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 5dd768fb-ba90-47b6-99e3-3873ecd074b1

📥 Commits

Reviewing files that changed from the base of the PR and between 4e934f5 and b68fe2c.

📒 Files selected for processing (8)

.claude/skills/test-e2e-openshell/SKILL.md
src/agentic_ci/backend.py
src/agentic_ci/backends/openshell/__init__.py
src/agentic_ci/backends/podman.py
src/agentic_ci/cli.py
src/agentic_ci/harness.py
src/agentic_ci/otel.py
tests/test_harness.py

coderabbitai · 2026-06-10T12:13:54Z

+podman exec \
+  -e ANTHROPIC_VERTEX_PROJECT_ID=<your-project-id> \
+  -e CLOUD_ML_REGION=global \
+  -e OPENSHELL_SUPERVISOR_IMAGE=quay.io/mprpic/openshell-supervisor:pr1763 \


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify if this supervisor image reference exists in other files or documentation rg -n "openshell-supervisor:pr1763" --type md

Repository: opendatahub-io/agentic-ci

Length of output: 265

🏁 Script executed:

#!/bin/bash set -euo pipefail file=".claude/skills/test-e2e-openshell/SKILL.md" echo "== Context around /tmp/e2e-workdir in $file ==" sed -n '320,420p' "$file" | nl -ba | sed -n '1,120p' echo echo "== Search for /tmp/e2e-workdir creation/usage ==" rg -n "/tmp/e2e-workdir" "$file" echo echo "== Search for mkdir -p /tmp/e2e-workdir ==" rg -n "mkdir\s+-p\s+/tmp/e2e-workdir" "$file" || true echo echo "== Search for PR 1763 image refs ==" rg -n "openshell-supervisor:pr1763" -S --hidden . echo echo "== Inspect docs/backends/openshell.md around the PR merge note ==" target="docs/backends/openshell.md" sed -n '1,120p' "$target" | nl -ba | sed -n '1,120p' sed -n '180,260p' "$target" | nl -ba | sed -n '1,120p'

Repository: opendatahub-io/agentic-ci

Length of output: 214

🌐 Web query:

quay.io mprpic openshell-supervisor pr1763

💡 Result:

The query "quay.io mprpic openshell-supervisor pr1763" appears to conflate several distinct projects and references within the NVIDIA OpenShell ecosystem [1][2]. There is no evidence of a pull request #1763 in the NVIDIA/OpenShell repository specifically authored by "mprpic" (Martin Prpič) related to an "openshell-supervisor" image on quay.io [1][3][4]. Here is the context regarding the components mentioned: 1. OpenShell Supervisor: NVIDIA's OpenShell project utilizes an openshell-supervisor component, which is a specialized OCI image used for sandboxing [1][5]. While the project maintains its own container images (e.g., hosted at github.com/NVIDIA/OpenShell/pkgs/container/openshell/supervisor), it is distinct from the Quay.io container registry service [1][6]. 2. Martin Prpič (mprpic): A software developer known for contributions to Red Hat, Opendatahub, and other open-source projects [3][4]. There is no record of him being a primary contributor to the NVIDIA OpenShell supervisor image infrastructure [1][3][4]. 3. Quay.io: This is a container image registry service provided by Red Hat, not a component of the NVIDIA OpenShell software [6][7]. It is possible that "PR 1763" refers to a recent, ongoing, or closed pull request within the NVIDIA/OpenShell repository, but it does not involve the entities or registry paths identified in your query [2]. Other related PRs in the OpenShell repository, such as #1767, have recently addressed Kubernetes and AppArmor configurations for the supervisor [2].

Citations:

1: feat: minimal changes to run OpenShell in an existing Kubernetes cluster NVIDIA/OpenShell#996

2: feat: Kubernetes support on AppArmor-enabled host nodes NVIDIA/OpenShell#1643

3: https://github.com/mprpic

4: https://github.com/mprpic?tab=repositories

5: NVIDIA/OpenShell@8a813ab

6: https://www.projectquay.io/

7: https://quay.io/

🏁 Script executed:

#!/bin/bash set -euo pipefail SKILL=".claude/skills/test-e2e-openshell/SKILL.md" BACKEND="docs/backends/openshell.md" python3 - <<'PY' from pathlib import Path path = Path(".claude/skills/test-e2e-openshell/SKILL.md") lines = path.read_text().splitlines() def show(start,end): for i in range(start-1,end): if 0 <= i < len(lines): print(f"{i+1}:{lines[i]}") print("== SKILL.md: lines 320-360 ==") show(320,360) print("\n== SKILL.md: lines containing /tmp/e2e-workdir / e2e-workdir ==") for i,l in enumerate(lines, start=1): if "/tmp/e2e-workdir" in l or "e2e-workdir" in l: print(f"{i}:{l}") print("\n== SKILL.md: lines containing openshell-supervisor:pr1763 ==") for i,l in enumerate(lines, start=1): if "openshell-supervisor:pr1763" in l: print(f"{i}:{l}") PY python3 - <<'PY' from pathlib import Path path = Path("docs/backends/openshell.md") lines = path.read_text().splitlines() def show(start,end): for i in range(start-1,end): if 0 <= i < len(lines): print(f"{i+1}:{lines[i]}") print("== openshell.md: lines 40-90 (supervisor_image var) ==") show(40,90) print("\n== openshell.md: lines 200-240 (PR merge note / tag usage) ==") show(200,240) PY echo echo "== Quick grep for /tmp/e2e-workdir creation ==" rg -n "mkdir\s+-p\s+/tmp/e2e-workdir|/tmp/e2e-workdir.*(mkdir|create|rm|cleanup)" "$SKILL" || true echo echo "== Quick grep for PR tag mention ==" rg -n "pr1763|`#1763`|PR\s*#?1763" "$SKILL" "$BACKEND" || true

Repository: opendatahub-io/agentic-ci

Length of output: 6848

🏁 Script executed:

#!/bin/bash set -euo pipefail SKILL=".claude/skills/test-e2e-openshell/SKILL.md" python3 - <<'PY' from pathlib import Path path = Path(".claude/skills/test-e2e-openshell/SKILL.md") lines = path.read_text().splitlines() def show(a,b): for i in range(a-1,b): if 0 <= i < len(lines): print(f"{i+1}:{lines[i]}") print("== SKILL.md: occurrences of workdir/workdir-test ==") import re for i,l in enumerate(lines, start=1): if re.search(r'workdir', l, re.IGNORECASE): print(f"{i}:{l}") PY echo echo "== SKILL.md: around 'Before you start' (first 260 lines search) ==" python3 - <<'PY' from pathlib import Path path=Path(".claude/skills/test-e2e-openshell/SKILL.md") lines=path.read_text().splitlines() for idx,l in enumerate(lines, start=1): if "Before you start" in l: start=max(1, idx-20); end=idx+60 for i in range(start-1,end): if i < len(lines): print(f"{i+1}:{lines[i]}") break else: print("No 'Before you start' heading found in first scan.") PY echo echo "== SKILL.md: around section E and F commands (270-380) ==" python3 - <<'PY' from pathlib import Path p=Path(".claude/skills/test-e2e-openshell/SKILL.md') lines=p.read_text().splitlines() for i in range(250-1, 390): if 0<=i<len(lines): print(f"{i+1}:{lines[i]}") PY

Repository: opendatahub-io/agentic-ci

Length of output: 4392

🏁 Script executed:

#!/bin/bash set -euo pipefail SKILL=".claude/skills/test-e2e-openshell/SKILL.md" python3 - <<'PY' from pathlib import Path path = Path(".claude/skills/test-e2e-openshell/SKILL.md") lines = path.read_text().splitlines() import re print("== SKILL.md: occurrences of workdir/workdir-test ==") for i,l in enumerate(lines, start=1): if re.search(r'workdir', l, re.IGNORECASE) or "/tmp/" in l and "workdir" in l: print(f"{i}:{l}") PY echo echo "== SKILL.md: around 'Before you start' heading ==" python3 - <<'PY' from pathlib import Path p=Path(".claude/skills/test-e2e-openshell/SKILL.md") lines=p.read_text().splitlines() for idx,l in enumerate(lines, start=1): if "Before you start" in l: start=max(1, idx-20); end=idx+70 for i in range(start-1, min(end, len(lines))): print(f"{i+1}:{lines[i]}") break else: print("No 'Before you start' heading found.") PY echo echo "== SKILL.md: around section E and F commands (250-380) ==" python3 - <<'PY' from pathlib import Path p=Path(".claude/skills/test-e2e-openshell/SKILL.md") lines=p.read_text().splitlines() for i in range(250-1, min(390, len(lines))): print(f"{i+1}:{lines[i]}") PY

Repository: opendatahub-io/agentic-ci

Length of output: 8395

Harden OPENSHELL_SUPERVISOR_IMAGE supply-chain input + fix invalid cd in Section F

.claude/skills/test-e2e-openshell/SKILL.md (Section F1): cd /tmp/e2e-workdir is used, but the skill only creates /tmp/workdir-test; add mkdir -p /tmp/e2e-workdir (or remove the cd) to prevent runtime failure (CWE-703).

The quay.io/mprpic/openshell-supervisor:pr1763 tag is intentional per docs/backends/openshell.md (it’s tied to NVIDIA/OpenShell PR #1763 for the GCE metadata emulator), but it’s still an unpinned external OCI tag (CWE-494, CWE-829). Pin the image by digest and require provenance/approval for the Quay source for CI/E2E.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.claude/skills/test-e2e-openshell/SKILL.md at line 335, In SKILL.md Section F1 fix the failing directory change and harden the external image input: ensure the script creates the expected working directory before cd by adding mkdir -p /tmp/e2e-workdir (or remove the cd to /tmp/e2e-workdir) so that the step using /tmp/e2e-workdir does not fail; also replace the unpinned OPENSHELL_SUPERVISOR_IMAGE value (OPENSHELL_SUPERVISOR_IMAGE=quay.io/mprpic/openshell-supervisor:pr1763) with a digest-pinned reference and add a note requiring provenance/approval for the Quay source for CI/E2E to prevent using an untrusted tag.

coderabbitai · 2026-06-10T12:13:54Z

+  -e OPENSHELL_SUPERVISOR_IMAGE=quay.io/mprpic/openshell-supervisor:pr1763 \
+  -e SANDBOX_IMAGE="$CLAUDE_SANDBOX_IMAGE" \
+  openshell-e2e bash -c '
+    cd /tmp/e2e-workdir && \


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Command references uncreated directory.

cd /tmp/e2e-workdir attempts to change to a directory that is never created in the skill instructions. Section E uses /tmp/workdir-test (different path). Since Section F doesn't specify --workdir, this cd is likely unnecessary and will cause the command to fail.

🐛 Proposed fix

openshell-e2e bash -c ' - cd /tmp/e2e-workdir && \ agentic-ci run \

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

cd /tmp/e2e-workdir && \

openshell-e2e bash -c '

agentic-ci run \

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.claude/skills/test-e2e-openshell/SKILL.md at line 338, The command `cd /tmp/e2e-workdir` references a directory that is never created (Section E uses `/tmp/workdir-test`) and Section F omits `--workdir`, so either remove the `cd /tmp/e2e-workdir` step, change it to `cd /tmp/workdir-test` to match Section E, or explicitly create `/tmp/e2e-workdir` and/or add `--workdir /tmp/e2e-workdir` to the command in Section F; update the `cd /tmp/e2e-workdir` occurrence accordingly so the working directory used by the skill matches the created path or is not required.

coderabbitai · 2026-06-10T12:13:54Z

+_SANDBOX_OTEL_COLLECTOR = r"""
+python3 -c '
+import json,os
+from http.server import HTTPServer,BaseHTTPRequestHandler
+from datetime import datetime,timezone
+class H(BaseHTTPRequestHandler):
+    def do_POST(self):
+        l=int(self.headers.get("Content-Length",0))
+        b=self.rfile.read(l) if l else b""
+        try: p=json.loads(b) if b else {}
+        except json.JSONDecodeError: p={"raw":b.decode("utf-8",errors="replace")}
+        r={"ts":datetime.now(timezone.utc).isoformat(),"path":self.path,"payload":p}
+        with open("OTEL_LOG","a") as f: f.write(json.dumps(r)+"\n")
+        self.send_response(200)
+        self.send_header("Content-Type","application/json")
+        self.end_headers()
+        self.wfile.write(b'"'"'{"partialSuccess":{}}'"'"')
+    def log_message(self,*a): pass
+HTTPServer(("127.0.0.1",OTEL_PORT),H).serve_forever()
+' &
+sleep 1
+""".replace("OTEL_LOG", _SANDBOX_OTEL_LOG).replace("OTEL_PORT", str(_SANDBOX_OTEL_PORT))


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

OTEL capture fails silently when the sandbox collector or fetch step breaks.

The collector is started with fire-and-forget & plus a fixed sleep 1, and _download_otel_log() suppresses timeout/exec failures. If python3 is missing, 127.0.0.1:4318 is unavailable, or openshell sandbox exec ... cat fails, the command path still returns success and telemetry is just dropped. When otel_enabled is true, verify the collector actually came up and emit a hard failure or at least a visible warning on fetch failure.

Also applies to: 183-205

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/agentic_ci/backends/openshell/__init__.py` around lines 21 - 42, The OTEL collector start is fire-and-forget and failures are suppressed: change the logic around _SANDBOX_OTEL_COLLECTOR and the sandbox fetch path so we verify the collector actually started and surface errors when otel_enabled is true; specifically, replace the background-shell-with-& pattern by launching the collector process via a controllable mechanism (or run foreground with subprocess and PID), poll/connect to 127.0.0.1:_SANDBOX_OTEL_PORT until it accepts connections (with a short timeout loop) to confirm startup, and make _download_otel_log() propagate or log non-OK exec/fetch/timeout errors instead of swallowing them so the caller (when otel_enabled) can fail fast or emit a visible warning; ensure references to _SANDBOX_OTEL_LOG and the sandbox exec/cat path are preserved while adding explicit error handling and clear messages when startup or fetch fails.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/agentic_ci/cli.py (1)

135-135: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix incomplete OTEL_RATE_FILE cleanup path (CWE-459).

Line 96 sets otel_proc = None on success, so Line 135 never runs and OTEL_RATE_FILE remains set. Move env cleanup outside the if otel_proc guard.

♻️ Proposed fix

     finally:
         if otel_proc:
             otel.stop_collector(otel_proc)
-            os.environ.pop("OTEL_RATE_FILE", None)
+        os.environ.pop("OTEL_RATE_FILE", None)
         if not args.keep:
             backend.stop()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/agentic_ci/cli.py` at line 135, The cleanup for OTEL_RATE_FILE is
currently guarded by the truthiness of otel_proc so when otel_proc is set to
None (e.g., success path) the env var is left behind; move the
os.environ.pop("OTEL_RATE_FILE", None) call out of the if otel_proc guard so it
always runs (for example, place it after the surrounding shutdown/cleanup block
or in a finally) ensuring OTEL_RATE_FILE is removed regardless of otel_proc's
value; reference the otel_proc variable and the literal "OTEL_RATE_FILE" when
locating where to relocate the pop call.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/agentic_ci/cli.py`:
- Around line 74-77: The OTEL collector is being bound to 0.0.0.0 when
args.backend == "openshell" (bind_addr used in the otel.start_collector call),
which exposes the OTLP receiver; change this so that for "openshell" you either
(a) resolve and use the gateway/interface IP instead of "0.0.0.0" (e.g., obtain
the host/gateway address at startup and assign it to bind_addr before calling
otel.start_collector(run_dir, bind_addr=bind_addr)) or (b) implement token-based
authentication/authorization inside otel.start_collector (or the OTLP HTTP
handler it registers) and require a secret token for incoming telemetry; update
the logic around bind_addr and otel.start_collector accordingly so the collector
is not publicly exposed without auth.

---

Duplicate comments:
In `@src/agentic_ci/cli.py`:
- Line 135: The cleanup for OTEL_RATE_FILE is currently guarded by the
truthiness of otel_proc so when otel_proc is set to None (e.g., success path)
the env var is left behind; move the os.environ.pop("OTEL_RATE_FILE", None) call
out of the if otel_proc guard so it always runs (for example, place it after the
surrounding shutdown/cleanup block or in a finally) ensuring OTEL_RATE_FILE is
removed regardless of otel_proc's value; reference the otel_proc variable and
the literal "OTEL_RATE_FILE" when locating where to relocate the pop call.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 84a6c9bf-e1c2-472e-9ad5-ef15c7777771

📥 Commits

Reviewing files that changed from the base of the PR and between b68fe2c and 318dcc4.

📒 Files selected for processing (9)

.claude/skills/test-e2e-openshell/SKILL.md
src/agentic_ci/backend.py
src/agentic_ci/backends/openshell/__init__.py
src/agentic_ci/backends/openshell/sandbox.py
src/agentic_ci/backends/podman.py
src/agentic_ci/cli.py
src/agentic_ci/harness.py
src/agentic_ci/otel.py
tests/test_harness.py

✅ Files skipped from review due to trivial changes (1)

.claude/skills/test-e2e-openshell/SKILL.md

🚧 Files skipped from review as they are similar to previous changes (3)

src/agentic_ci/otel.py
tests/test_harness.py
src/agentic_ci/harness.py

The OTEL collector binds to 0.0.0.0 on the host but was unreachable from inside the sandbox because the harness pointed the OTEL endpoint at 10.200.0.1 (the gateway bridge IP), which doesn't route to the host network stack. Use host.openshell.internal instead, which resolves to the host inside the sandbox (added in OpenShell PR #1279). The sandbox policy now includes host.openshell.internal:<port> so the gateway proxy forwards OTEL exports to the host-side collector. The OpenShell backend handles OTEL env vars directly (instead of delegating to the harness) because it needs the OpenShell-specific hostname and a shorter export interval (5s vs 10s) to capture metrics from short-lived runs. Also adds Section F to the OpenShell e2e test skill for verifying OTEL collection works end-to-end. Signed-off-by: Emilien Macchi <emacchi@redhat.com> Co-Authored-By: Claude <noreply@anthropic.com>

EmilienM mentioned this pull request Jun 10, 2026

Start OTEL collector in run_skill() default container runner #78

Merged

4 tasks

EmilienM force-pushed the otel-openshell branch from d9e8603 to b68fe2c Compare June 10, 2026 12:00

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

EmilienM force-pushed the otel-openshell branch from b68fe2c to 318dcc4 Compare June 10, 2026 13:12

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread src/agentic_ci/cli.py

mprpic reviewed Jun 10, 2026

View reviewed changes

Comment thread .claude/skills/test-e2e-openshell/SKILL.md Outdated

mprpic approved these changes Jun 10, 2026

View reviewed changes

EmilienM force-pushed the otel-openshell branch from 318dcc4 to 0abda20 Compare June 10, 2026 18:30

EmilienM merged commit e4b786c into main Jun 10, 2026
15 of 16 checks passed

EmilienM deleted the otel-openshell branch June 10, 2026 18:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OTEL telemetry collection for OpenShell backend#81

Fix OTEL telemetry collection for OpenShell backend#81
EmilienM merged 1 commit into
mainfrom
otel-openshell

EmilienM commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

Review failed

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Uh oh!

coderabbitai Bot Jun 10, 2026

Uh oh!

coderabbitai Bot Jun 10, 2026

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	cd /tmp/e2e-workdir && \
	openshell-e2e bash -c '
	agentic-ci run \

Conversation

EmilienM commented Jun 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EmilienM commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading