Skip to content

feat(nextjs): standalone output, deterministic builds, and scoped Fastly purge#3364

Open
blarghmatey wants to merge 5 commits into
mainfrom
tmacey/nextjs-deployment-simplification
Open

feat(nextjs): standalone output, deterministic builds, and scoped Fastly purge#3364
blarghmatey wants to merge 5 commits into
mainfrom
tmacey/nextjs-deployment-simplification

Conversation

@blarghmatey
Copy link
Copy Markdown
Member

@blarghmatey blarghmatey commented May 20, 2026

What are the relevant tickets?

N/A

Description (What does it do?)

Addresses the root causes of missing-chunk 404 errors in the MIT Learn Next.js deployment by removing the EFS/blue-green architecture and adding runtime env injection so a single Docker image works across all environments.

1. Scoped Fastly cache invalidation
Adds a Surrogate-Key: html-pages response header to HTML routes and sitemaps. The Concourse pipeline now calls /service/b3/purge/html-pages instead of purge_all — so immutable /_next/static/ content-addressed chunks are never evicted during a deploy.

2. Deterministic builds

  • generateBuildId returns NEXT_PUBLIC_VERSION || GIT_REF || 'dev' — stabilises the build manifest filename across identical builds.
  • Webpack output.filename / output.chunkFilename overridden to use [contenthash] instead of [chunkhash], making chunk filenames depend only on file content rather than module graph ordering.

3. Standalone Docker image (replaces EFS + Kubernetes build Job)

  • Adds output: 'standalone' to next.config.js so Next.js emits a self-contained server bundle.
  • Removes the build_skip_yarn Docker stage; adds a new slim runner stage that copies .next/standalone/, .next/static/, and public/ from the build stage into a clean node:24-alpine image.
  • Adds ARG GIT_REF / ENV GIT_REF to the build stage so the git SHA passed by Concourse as BUILD_ARG_GIT_REF is available to next build.
  • The build is now fully baked into the image at CI time — no EFS volume or Kubernetes Job needed at deploy time.

4. Runtime env injection via PublicEnvScript + env() helper
Fixes a critical issue with standalone builds: webpack's DefinePlugin inlines all process.env.NEXT_PUBLIC_* references as literal strings at build time. Since yarn build runs in CI without per-environment values, all NEXT_PUBLIC_* vars would be empty strings in the bundle — even though Kubernetes sets the correct values.

The solution decouples env var reads from the build:

  • src/env.tsenv("NEXT_PUBLIC_FOO") helper. In the browser, reads window.__ENV[key] (set before any JS loads). On the server, uses dynamic bracket access (process.env[key]) which is not replaced by DefinePlugin. Falls back to process.env[key] in the browser so existing vi.stubEnv() / direct process.env assignments in tests continue to work without modification.

  • src/app/components/PublicEnvScript.tsx — a Next.js Server Component that calls connection() (opts the route out of SSG so env is read at request time) and renders a synchronous inline <script>window.__ENV={...}</script> in <head> with all NEXT_PUBLIC_* env vars serialized as HTML-safe JSON. Because it's a plain <script> tag (not next/script), it executes synchronously before any JS module — including Sentry's instrumentation-client.ts.

  • 56 source files migrated: all process.env.NEXT_PUBLIC_* references replaced with env() calls; no test files changed.

  • PostHog bootstrap feature flags (NEXT_PUBLIC_FEATURE_*) are now computed at runtime from window.__ENV inside useEffect in ConfiguredPostHogProvider, rather than being baked via processFeatureFlags() in next.config.js.

  • Server startup validation: next.config.js skips validateEnv() during Docker builds (NEXT_BUILD_CI=1). Runtime validation was moved to instrumentation-node.ts (runs via instrumentation.ts register() at server startup) — if required env vars are missing, the pod exits with code 1 and crash-loops to make the misconfiguration immediately visible.

Companion infrastructure PR: mitodl/ol-infrastructure#4652 — removes the blue/green EFS deployment and Kubernetes build Job from Pulumi. These two PRs must be merged and deployed together.

How can this be tested?

Standalone image + env injection:

docker build -f frontends/main/Dockerfile.web --target runner -t mit-learn-nextjs:test .
docker run --rm -p 3000:3000 \
  -e NEXT_PUBLIC_ORIGIN=http://localhost:3000 \
  -e NEXT_PUBLIC_MITOL_API_BASE_URL=https://api.example.com \
  -e NEXT_PUBLIC_SITE_NAME="MIT Learn" \
  -e NEXT_PUBLIC_MITOL_SUPPORT_EMAIL=mitlearn-support@mit.edu \
  -e NEXT_PUBLIC_CSRF_COOKIE_NAME=csrftoken \
  mit-learn-nextjs:test
  • Visit http://localhost:3000 — app should load with correct env values
  • curl -s http://localhost:3000/ | grep 'window.__ENV' — should show the injected env vars in the HTML <head>
  • Start the container without required env vars — pod should exit 1 immediately with a clear error message

Fastly surrogate key:

curl -I http://localhost:3000/
# → Surrogate-Key: html-pages

curl -I http://localhost:3000/_next/static/chunks/main.js
# → no Surrogate-Key header

Deterministic builds:
Build twice from the same source; confirm chunk filenames are identical.

Additional Context

Deployment coordination: this PR and mitodl/ol-infrastructure#4652 must be deployed as a single pulumi up — deploying the new Dockerfile without removing the Kubernetes build Job (or vice versa) will break the app.

One-time post-deploy: the first deployment after merging will have no existing html-pages-tagged objects in Fastly (nothing was tagged before). A one-time manual purge_all should be run immediately after the first deploy to ensure stale pages are cleared, after which the scoped purge takes over.

…tly purge

Three coordinated improvements to the Next.js deployment pipeline:

1. Scoped Fastly cache invalidation
   - Add Surrogate-Key: html-pages response header to HTML routes and
     sitemaps (excludes /_next/static/ via the existing file-extension
     regex), so purge/html-pages at deploy time no longer invalidates
     immutable content-addressed static chunks.

2. Deterministic builds
   - Add generateBuildId returning NEXT_PUBLIC_VERSION || GIT_REF || 'dev'
     so the build manifest filename is stable across identical builds.
   - Override webpack output.filename / output.chunkFilename to use
     [contenthash] instead of [chunkhash], making chunk filenames depend
     only on content rather than module graph ordering.

3. Standalone output + slimmed Docker image
   - Add output: 'standalone' to next.config.js so Next.js emits a
     self-contained server with a minimal node_modules tree.
   - Remove the build_skip_yarn Docker stage; add a new slim runner stage
     that copies .next/standalone/, .next/static/, and public/ from the
     build stage into a clean node:24-alpine image. The build is now fully
     baked in at image build time — no EFS volume or Kubernetes Job needed
     at deploy time.
   - Add ARG/ENV GIT_REF to the build stage so the git SHA passed by
     Concourse as BUILD_ARG_GIT_REF is available to next build.

BREAKING: this Dockerfile change must be deployed together with the
corresponding ol-infrastructure Pulumi change that removes the blue/green
EFS deployment and Kubernetes build Job.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 20, 2026 20:05
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 20, 2026

OpenAPI Changes

No changes detected

View full changelog

Unexpected changes? Ensure your branch is up-to-date with main (consider rebasing).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the MIT Learn Next.js frontend build/deploy approach to reduce missing-chunk 404s by scoping CDN purges, making build outputs more deterministic, and shipping a standalone Next.js server bundle in the Docker image.

Changes:

  • Add Surrogate-Key: html-pages to sitemap and “page” responses to enable targeted Fastly purges without touching immutable /_next/static/* assets.
  • Make builds more deterministic via a stable generateBuildId and replacing webpack [chunkhash] with [contenthash] in output filenames.
  • Switch the Docker image to a standalone runner stage that runs the emitted server via node and copies only the standalone output + static/public assets.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
frontends/main/next.config.js Adds standalone output, surrogate-key tagging, stable build IDs, and deterministic chunk naming.
frontends/main/Dockerfile.web Reworks the container build into build + slim runner stage that runs the standalone server bundle.

Comment thread frontends/main/next.config.js
Comment thread frontends/main/Dockerfile.web
Comment thread frontends/main/Dockerfile.web
blarghmatey and others added 2 commits May 20, 2026 16:17
…tern

- Fix misleading comment: standalone bundle includes a minimal node_modules,
  not zero node_modules
- Add NODE_ENV=production to runner stage; it was previously inherited from
  the base stage but the runner uses a fresh FROM so must be set explicitly
- Exclude /healthcheck from the html-pages Surrogate-Key pattern; healthcheck
  returns JSON and should not be tagged as an HTML page for Fastly purges

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace webpack build-time NEXT_PUBLIC_* inlining with a runtime injection
pattern that allows a single Docker image to be deployed across environments.

Problem: DefinePlugin bakes all process.env.NEXT_PUBLIC_* references as
literal values at build time. When yarn build runs in CI (without per-env
values), all NEXT_PUBLIC_* vars are empty strings in the bundle—even though
the Kubernetes pod has the correct values set.

Solution:
- PublicEnvScript: a Server Component that calls connection() (opts out of
  SSG) and renders a synchronous inline <script> setting window.__ENV to
  all NEXT_PUBLIC_* values from process.env at request time. Placed in root
  <head> so it executes before any JS bundle loads.
- env(): a helper that reads window.__ENV in the browser and uses dynamic
  process.env[key] access on the server (dynamic bracket access is NOT
  replaced by DefinePlugin). Falls back to process.env[key] in the browser
  to support test environments (jsdom) that set process.env directly.
- Migrate all 143 process.env.NEXT_PUBLIC_* usages across 56 source files
  to env(). Test files unchanged—existing vi.stubEnv / direct process.env
  assignment continues to work via the fallback.

Additional changes:
- next.config.js: gate validateEnv() with NEXT_BUILD_CI to skip at build
  time; runtime validation added to instrumentation-node.ts (runs at server
  startup via instrumentation.ts register()).
- Dockerfile: set ENV NEXT_BUILD_CI=1 in build stage.
- ConfiguredPostHogProvider: compute PostHog bootstrap feature flags from
  window.__ENV at runtime (previously baked via processFeatureFlags() in
  next.config.js which returns {} in CI builds). Remove build-time
  FEATURE_FLAGS from next.config.js env config.
- instrumentation-node.ts: use dynamic process.env[key] access for
  NEXT_PUBLIC_SENTRY_* and NEXT_PUBLIC_VERSION to avoid DefinePlugin
  inlining; add required env var check at startup.
- otel-utils.ts: same dynamic access for APP_VERSION.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@blarghmatey
Copy link
Copy Markdown
Member Author

@copilot address the test failures

Copilot AI and others added 2 commits May 20, 2026 22:06
Agent-Logs-Url: https://github.com/mitodl/mit-learn/sessions/03f963fb-4658-4987-87e4-3a7610eee518

Co-authored-by: blarghmatey <479088+blarghmatey@users.noreply.github.com>
Agent-Logs-Url: https://github.com/mitodl/mit-learn/sessions/03f963fb-4658-4987-87e4-3a7610eee518

Co-authored-by: blarghmatey <479088+blarghmatey@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented May 20, 2026

@copilot address the test failures

Addressed in 9e82cc7, with a small cleanup in c7c0888. I traced the failing CI run to the javascript-tests job’s Next.js build step, added a typed requiredEnv(...) helper, and switched the robots/sitemap modules to use it so NEXT_PUBLIC_ORIGIN narrows to string during build. I also reran the production build locally with the CI env values and it now passes, and the final review/CodeQL checks are clean.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • o1.ingest.sentry.io
    • Triggering command: /usr/local/bin/node /usr/local/bin/node /home/REDACTED/work/mit-learn/mit-learn/node_modules/next/dist/bin/next build (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants