Add private-network VM install modes (DNS-01, self-signed)#1106
Draft
RAVEENSR wants to merge 7 commits into
Draft
Add private-network VM install modes (DNS-01, self-signed)#1106RAVEENSR wants to merge 7 commits into
RAVEENSR wants to merge 7 commits into
Conversation
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The advanced VM installer assumed a publicly reachable VM: public letsencrypt issues via TLS-ALPN-01, which needs the ACME CA to connect inbound on :443. Companies that keep Agent Manager on a private network could not use it. Add two TLS modes to install-advanced.sh for a private VM with internet egress but no public ingress: - letsencrypt-dns: public-trusted certs via the ACME DNS-01 challenge. The CA validates a DNS TXT record and never connects to the VM, so a firewalled VM still gets trusted certs (split-horizon A records are fine). Issued by a dockerized lego; a daily systemd timer renews it. - selfsigned: a generated local CA + leaf for fully offline installs with no public zone and no internal CA; the CA is emitted for client trust-store distribution. Both modes only produce a cert and reuse the existing byoc serving path (Caddyfile, cert mounts, validate_cert), so lib-vm.sh is unchanged. New cert production lives in lib-tls.sh. validate_config learns both modes, the --init template and on-a-vm.mdx document them (with a DNS-01 flow diagram), and --dry-run previews issuance without doing it. Related to wso2#1017
10b9261 to
11b76b9
Compare
The Advanced tab interleaved public and private guidance and its label
("custom domain / BYOC / load balancer") no longer fit once private
modes were added, making it hard to follow one coherent path.
Rename the tab to "Advanced" and split its body with nested
Public network | Private network tabs (distinct groupId) that hold the
network-specific bits: which TLS modes apply, the DNS records, and a
copy-paste example config per mode. Shared scaffolding (prerequisites,
configure, TLS-modes reference, install, teardown) stays common, so
there is a single source of truth.
The standalone DNS and Private-network sections are dissolved into the
tabs, the flow diagram moves into the Private tab, the TLS-modes table
gains a Network column, and orphaned #adv-dns / #adv-private anchors
are repointed. Docs site builds clean.
In advisory modes (byoc, upstream, selfsigned, letsencrypt-dns) validate_dns returns 0 even when some hostnames did not resolve, so preflight_dns logged "all hostnames resolve to this VM" right after listing the failures. This is constant for the private modes, whose internal/private domains rarely resolve at dry-run time. Gate the success message on an empty DNS_ERRORS, and otherwise tell the operator to point their private DNS or client hosts entries at the VM. Reword the advisory note so it fits all advisory modes, not just an LB.
Configure is trimmed to just the template-generation step. Each Public and Private network tab now carries its own TLS modes table, a complete config-key table (common plus network-specific keys), BYOC certificate requirements, DNS records, and example configs (including a new byoc example per tab). The shared mixed config-key table and the standalone TLS modes section (with BYOC requirements and upstream topology) were removed; their content was redistributed into the tabs. Links to the old #adv-tls-modes and #adv-byoc-san anchors are repointed to the in-tab subsections.
goacme/lego:latest now resolves to v5.x, whose CLI dropped the global flags build_lego_args emits (e.g. --accept-tos), so letsencrypt-dns issuance failed immediately with "flag provided but not defined". Pin to goacme/lego:v4.35.2 (override via the LEGO_IMAGE env var) in both the initial issuance and the renewal-timer unit, and add a unit-test guard so the image can't regress to :latest.
A clean single-user advanced install on a fresh VM exhausts the default fs.inotify.max_user_instances ceiling once k3d is up: systemd logs "Failed to allocate directory watch: Too many open files" while enabling the cert-renewal timer, and k3d configmap/ secret watches can silently go stale. Add ensure_inotify_limits to the shared bootstrap phase (both the standalone and advanced installers) to raise instances/watches to k3d-friendly floors and persist them under /etc/sysctl.d. Only keys below their floor are touched, so a host already tuned higher is never lowered. The decision is factored into a pure inotify_bump_ target helper with unit tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
The VM installer has two entry points today — the simple installer (
install-vm.sh, sslip.io + Let's Encrypt on:443) and the advanced installer (install-advanced.shwithletsencrypt | byoc | upstream). Both assume the VM is reachable from the public internet. Many companies will not expose the console/API publicly and want Agent Manager reachable only from their private network. The blocker is TLS: publicletsencrypt(TLS-ALPN-01) needs the ACME CA to connect inbound to the VM on:443, which a private VM cannot accept.Related to #1017 (VM standalone install).
Resolves #1081
Goals
Add two TLS modes to the advanced installer so a VM with private ingress and internet egress can run Agent Manager without ever being publicly reachable:
letsencrypt-dns(recommended) — public-trusted certificates via the ACME DNS-01 challenge. The CA validates a DNS TXT record and never connects to the VM, so it works behind a firewall.Arecords may point at the VM's private IP (split-horizon).selfsigned— fully offline: a generated local CA + leaf for teams with neither a public DNS zone nor an internal CA.byoc(internal-CA cert) andupstream(internal LB) already cover the remaining private variants and are unchanged.Approach
Both new modes only produce a cert/key on disk and then reuse the existing
byocserving path (Caddyfile rendering, container cert mounts, andvalidate_cert's SAN/wildcard coverage checks) — solib-vm.shis untouched and the serving logic is shared.deployments/vm/lib-tls.sh: pure helpers (tls_san_list,build_lego_args,render_renewal_units) plus side-effecting issuance (issue_dns01_certvia the dockerizedgoacme/legoimage,generate_selfsigned_cavia openssl,install_renewal_timer).letsencrypt-dnsissues a cert covering every service host + the*.<AGENTS_BASE>wildcard, copies it to/opt/amp/certs/, and installs a dailysystemdtimer (lego renew+caddy reload).selfsignedgenerates a local CA + leaf and writes the CA to/opt/amp/certs/ca.crtfor distribution to client trust stores.validate_configlearns the two modes (letsencrypt-dnsrequiresDNS_PROVIDER+ACME_EMAIL;selfsignedrequires nothing extra), the--inittemplate documents them, and--dry-runpreviews the issuance/generation plan without doing it.User stories
As a platform operator at a security-conscious company, I can install Agent Manager on a private-network-only VM with trustworthy TLS — either public-trusted certificates via DNS-01 (no public ingress) or a generated local CA when fully offline.
Release note
The advanced VM installer now supports private-network-only deployments via two new TLS modes:
letsencrypt-dns(public-trusted certificates through the ACME DNS-01 challenge, with automatic renewal) andselfsigned(a generated local CA for fully offline installs).Documentation
Updated
documentation/docs/getting-started/on-a-vm.mdx: new "Private network (no public exposure)" section, the two new modes added to the TLS-modes and config-key tables, DNS split-horizon guidance for DNS-01, and troubleshooting entries.Training
N/A — no training content impact.
Certification
N/A — no impact on certification exams.
Marketing
N/A — no marketing content for this change.
Automation tests
Extended
deployments/vm/tests/run.sh(SAN list, lego arg builder, renewal-unit rendering,validate_configfor both modes,--inittemplate, and--dry-runfor both modes) anddeployments/vm/tests/preflight.sh(local-CA generation produces a cert whose SANs cover every host + the agent wildcard, verified viavalidate_cert+ openssl). Both suites pass; the changed scripts are shellcheck-clean.The suites are hermetic (no Docker/cluster). A live DNS-01 issuance against a real zone and a
selfsignedbrowser-trust check on a GCP VM are planned as an out-of-band smoke test, mirroring the earlier VM-install smoke tests (this PR is a draft pending that).Security checks
amp-config.envat install time and the renewal config copy is written0600; none are committed.Samples
N/A — no new samples. The docs include a DNS-01 (AWS Route 53) walkthrough.
Related PRs
Builds on the VM standalone (#1017) and advanced-install work in
deployments/vm/.Migrations (if applicable)
N/A — additive feature; no migration required. Switching
TLS_MODEon an existing install only re-renders Caddy and re-issues/regenerates the certificate.Test environment
Unit + preflight suites run locally on macOS (bash 3.2). The installer targets Linux VMs (Ubuntu/Debian, bash 4+); the real-VM smoke test is pending (see Automation tests).
Learning
Key insight: the ACME DNS-01 challenge validates a DNS TXT record rather than connecting to the host, so it can issue public-trusted (and wildcard) certificates for a server with no public ingress. Used
lego(single-binary ACME client, ~150 DNS providers) to keep the stockcaddy:2image and reuse the existingbyocserving path rather than building a custom Caddy image with DNS plugins.