Skip to content

MAF-19750: feat(deploy): provision Grafana alerts for Heimdall via sidecar#130

Merged
seongsu-dev merged 10 commits into
mainfrom
MAF-19750-mif-alert-provisioning
May 19, 2026
Merged

MAF-19750: feat(deploy): provision Grafana alerts for Heimdall via sidecar#130
seongsu-dev merged 10 commits into
mainfrom
MAF-19750-mif-alert-provisioning

Conversation

@seongsu-dev
Copy link
Copy Markdown
Contributor

@seongsu-dev seongsu-dev commented May 14, 2026

Summary

  • Adds Helm-managed Grafana Unified Alerting provisioning to the MIF chart, covering all four alerting resources (rules, notification templates, routing policies, and the Heimdall Slack contact point) as ConfigMaps labelled grafana_alert=1. The grafana-sc-alerts sidecar mounts them into /etc/grafana/provisioning/alerting/ and Grafana reloads automatically — mirroring the existing templates/grafana/dashboard-configmap.yaml pattern.
  • Clickable links inside Slack messages (alert rule view, Grafana Explore) are built from Grafana's own externalURL, which is in turn driven by prometheus-stack.grafana.grafana.ini.server.root_url. Operators only need to set that single value to expose the cluster's public Grafana URL — the chart does not carry a separate URL knob anymore.
  • The Slack webhook URL is supplied either by alerts.heimdall.slack.existingSecret + slack.secretKeys.webhookUrlKey (resolved via Helm lookup) or directly via the inline alerts.heimdall.slack.webhookUrl, following the Bitnami secretKeys.<role>Key secret-reference convention. The receiver name is hardcoded to heimdall-slack so the routing policy and the contact-points ConfigMap line up automatically.

Background

This is the productisation of the MAF-19750 PoC that was validated on the p-cluster mif release. Three ConfigMaps (rules / templates / policies) were originally applied by hand to deliver: Heimdall error logs → Loki → Grafana Unified Alerting → Slack. This PR moves those resources into the chart, generalises the PoC-specific bits (namespace="seongsu", owner: seongsu, environment: p-cluster labels, hardcoded grafana.product.moreh.dev URLs), and adds the Slack contact point as a chart-managed resource as well so the chart is self-contained.

The LogQL selector {app="heimdall-inference-scheduler", level="error"} is chart-name based, so a single alert rule covers every Heimdall release in the cluster without per-release duplication.

Files

  • templates/grafana/alert-configmap.yaml — new template, generates one ConfigMap per file under files/alerts/*.yaml using Files.Get as a pure pass-through (no chart-side string substitution).
  • templates/grafana/heimdall-slack-configmap.yaml — new template, generates the Slack contact-points provisioning ConfigMap when alerts.heimdall.enabled is true and a webhook URL is available (via existingSecret + secretKeys.webhookUrlKey, or inline webhookUrl).
  • templates/grafana/datasource-loki.yaml — pins the Loki datasource UID to loki so the alert rule's datasourceUid: loki reference can resolve reliably.
  • files/alerts/heimdall-rules.yaml — alert rule. Annotations reference {{ externalURL }} directly (Grafana substitutes it at evaluation time from server.root_url).
  • files/alerts/heimdall-templates.yaml — Slack notification template (heimdall-slack.title, heimdall-slack.body).
  • files/alerts/heimdall-policies.yaml — routing policy: matches component=heimdallheimdall-slack receiver.
  • values.yaml — enables prometheus-stack.grafana.sidecar.alerts and adds the top-level alerts.heimdall.{enabled, slack.{webhookUrl, existingSecret, secretKeys.webhookUrlKey}} section.
  • deploy/helm/AGENTS.md — adds an "Alert provisioning" rule capturing the ConfigMap-only layout, the operator contract around server.root_url, and conventions for adding new alert files (no tpl, secret values resolved via lookup). Existing rules condensed in the same pass.
  • deploy/helm/moai-inference-framework/README.md — regenerated by make helm-docs.

Operator contract

Setting Purpose
alerts.heimdall.enabled (default false) Toggle the whole pipeline. Disabled by default because a Slack webhook URL must be supplied first.
prometheus-stack.grafana.grafana.ini.server.root_url Cluster's public Grafana URL. Used for all clickable links surfaced in Slack messages (Grafana's own title link + the Explore / rule-view links emitted by the notification template). Without it, Grafana falls back to http://localhost:3000 and every link becomes unreachable.
alerts.heimdall.slack.existingSecret + slack.secretKeys.webhookUrlKey Reference an externally managed Secret containing the webhook URL (Bitnami secret-reference shape). Takes precedence over the inline webhookUrl.
alerts.heimdall.slack.webhookUrl Inline webhook URL. Used only when existingSecret is empty. SECRET — pass via --set-file / sealed-secrets / SOPS / external secrets operator; never commit.

Out of scope (follow-up)

  • moreh-iac integration. A later change to moreh-iac/SNUSHC/p-cluster/mif/mif.tf should bump the chart version, set prometheus-stack.grafana.grafana.ini.server.root_url=https://grafana.product.moreh.dev, and either point alerts.heimdall.slack.existingSecret (+ slack.secretKeys.webhookUrlKey) at an externally managed Secret or pass webhookUrl from a CI secret. Once that lands, the PoC's hand-applied ConfigMaps can be removed.
  • Additional alert rules. Heimdall panic detection, gRPC 5xx burst, responses-store backend errors, etc. would land as additional files/alerts/heimdall-<category>.yaml files in follow-up PRs.

Test plan

  • helm lint deploy/helm/moai-inference-framework — passes.
  • helm template with --set alerts.heimdall.enabled=true --set-file alerts.heimdall.slack.webhookUrl=<webhook> renders four ConfigMaps (*-alert-heimdall-{rules,templates,policies,slack-contact-points}) all labelled grafana_alert: "1". Render also verified for the existingSecret + secretKeys.webhookUrlKey path (empty URL under helm template because lookup requires a cluster, ConfigMap correctly skipped).
  • kubectl apply --dry-run=client on the rendered ConfigMaps — created (dry run) for all four.
  • ConfigMap data parses as valid YAML and matches the Grafana provisioning schema (apiVersion: 1, groups/templates/policies/contactPoints; alert rule uid/title/condition/data/noDataState/execErrState; object_matchers 3-tuple form; receiver name hardcoded to heimdall-slack).
  • Grafana's own Go template syntax ({{ printf "%.180s" .message }}, {{ externalURL }}, {{ define "heimdall-slack.title" }}, {{ if .Annotations.exploreURL }}) is preserved as raw text in the rendered ConfigMap — Helm does not evaluate it.
  • alerts.heimdall.enabled=false and prometheus-stack.grafana.sidecar.alerts.enabled=false each gate the ConfigMaps off (0 rendered). With enabled=true but no webhook supplied, the contact-points ConfigMap is skipped and Slack delivery is silently off (alert rules still fire).
  • The rendered Grafana Deployment includes the grafana-sc-alerts sidecar with LABEL=grafana_alert, FOLDER=/etc/grafana/provisioning/alerting, and the correct reload URL.
  • make helm-docs regenerates deploy/helm/moai-inference-framework/README.md to include the new alerts.heimdall.* and prometheus-stack.grafana.sidecar.alerts.enabled keys.
  • End-to-end on a local kind cluster, both secret-reference paths exercised: (a) inline slack.webhookUrl and (b) slack.existingSecret + slack.secretKeys.webhookUrlKey with a separately-applied Secret. Mock pod emitting JSON error logs → Vector → Loki → LogQL rule evaluation → routing policy match (component=heimdall) → Slack contact point → channel message delivered. All three links (Grafana-emitted title link, body View error logs in Grafana, body View alert rule) resolve to the server.root_url prefix.

Bugs found during validation and fixed

Bug Fix commit
Webhook URL passed through --set-file carried a trailing newline and Grafana rejected it as an invalid URL 4edf1df — added trim to the resolved webhook URL
Loki datasource was assigned a random UID by Grafana, so the alert rule's datasourceUid: loki reference could not resolve 4edf1df — pinned the Loki datasource UID to loki
{{ externalURL }} always carries a trailing slash, producing //path when concatenated with a leading-slash path 5c6064a — dropped the leading slash from the annotation paths so the final URL has exactly one slash

🤖 Generated with Claude Code

…decar

Add file-based Grafana Unified Alerting provisioning to the MIF chart so that
installing MIF immediately wires up the Heimdall error-log alert pipeline that
was validated as a PoC in the p-cluster `mif` release.

The new `templates/grafana/alert-configmap.yaml` mirrors the existing
`dashboard-configmap.yaml`: it iterates over `files/alerts/*.yaml` and emits one
ConfigMap per file with the `grafana_alert` label, which the
`grafana-sc-alerts` sidecar picks up and mounts into
`/etc/grafana/provisioning/alerting/`. Cluster-specific values
(`__GRAFANA_URL__` for Slack deep links, `__RECEIVER__` for the contact point
name) are substituted from chart values via `replace` rather than `tpl`, so
that Grafana's own Go template syntax embedded in alert rules (e.g.
`{{ printf "%.180s" .message }}` inside LogQL) is preserved as raw text.

Contact points (Slack webhook URLs) are intentionally out of scope because the
webhook is a secret. Operators must create the contact point named by
`alerts.heimdall.receiver` separately via the Grafana UI or a Secret-backed
provisioning file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 14, 2026 07:41
@seongsu-dev seongsu-dev requested a review from a team as a code owner May 14, 2026 07:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Helm-managed file-based Grafana Unified Alerting provisioning to the MIF chart, mirroring the existing dashboard ConfigMap pattern so that Heimdall error-log alerts (LogQL → Loki → Slack) are installed automatically alongside the chart. Cluster-specific values are passed via __GRAFANA_URL__ / __RECEIVER__ placeholders that are replaced literally (rather than via tpl) so Grafana's own Go template syntax inside the alert YAML survives Helm rendering.

Changes:

  • New templates/grafana/alert-configmap.yaml that generates one ConfigMap per file under files/alerts/*.yaml, gated by prometheus-stack.grafana.sidecar.alerts.enabled and alerts.heimdall.enabled.
  • New alert content: heimdall-rules.yaml (error-log-burst rule), heimdall-templates.yaml (Slack message templates), and heimdall-policies.yaml (component=heimdall routing policy).
  • values.yaml, README.md, and AGENTS.md updated to expose the new alerts.heimdall.* keys, enable the alerts sidecar, and document the no-tpl constraint and out-of-scope contact-point boundary.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
deploy/helm/moai-inference-framework/templates/grafana/alert-configmap.yaml New template rendering one ConfigMap per alert file with placeholder substitution.
deploy/helm/moai-inference-framework/files/alerts/heimdall-rules.yaml Error-log-burst alert rule with Explore/rule deep links using __GRAFANA_URL__.
deploy/helm/moai-inference-framework/files/alerts/heimdall-templates.yaml Slack notification templates for Heimdall alerts.
deploy/helm/moai-inference-framework/files/alerts/heimdall-policies.yaml Routing policy mapping component=heimdall to __RECEIVER__.
deploy/helm/moai-inference-framework/values.yaml Enables prometheus-stack.grafana.sidecar.alerts and adds alerts.heimdall.{enabled,grafanaURL,receiver}.
deploy/helm/moai-inference-framework/README.md Regenerated docs for the new values keys.
deploy/helm/AGENTS.md Documents the alert-provisioning pattern and the tpl prohibition.

Comment thread deploy/helm/moai-inference-framework/templates/grafana/alert-configmap.yaml Outdated
…fanaURL

Prevents double-slash URLs in Slack notification links when operators configure
`alerts.heimdall.grafanaURL` with a trailing slash (e.g.
`https://grafana.example.com/`). Without trimming, the alert rule annotations
would render `https://grafana.example.com//explore?...` and
`https://grafana.example.com//alerting/...`, which most browsers tolerate but
reverse proxies and OAuth redirect path matchers may reject.

Apply `trimSuffix "/"` to the value before substituting `__GRAFANA_URL__`, so
both `https://grafana.example.com` and `https://grafana.example.com/` produce
the same single-slash result. Also document the trimming behavior in the
values.yaml comment and add a "Placeholder conventions" subsection to
deploy/helm/AGENTS.md so authors of future alert files use the same pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

seongsu-dev and others added 2 commits May 19, 2026 12:40
…nfigMap

Add chart-managed provisioning for the Heimdall Slack contact point as a
ConfigMap labelled `grafana_alert=1`, picked up by the existing
`grafana-sc-alerts` sidecar alongside the rules / templates / policies.
Removes the need for operators to create the contact point through the
Grafana UI (which requires `grafana.persistence.enabled=true` to survive
pod restarts) or via a separate Secret-backed provisioning file.

The webhook URL is sourced either from `alerts.heimdall.slack.existingSecret`
(resolved through Helm `lookup`, following the same convention as the
sibling MongoDB and Redis Sentinel charts) or from `alerts.heimdall.slack.webhookUrl`
when no external Secret is referenced. With neither set, the contact-points
ConfigMap is skipped and Slack delivery is silently off — alert rules
still fire but do not route anywhere.

Other adjustments:

- Default `alerts.heimdall.enabled` to false; the chart cannot deliver
  Slack messages without a webhook URL, so the operator must opt in
  explicitly after providing one.
- Hardcode the receiver name to `heimdall-slack` inside both the policy
  routing file and the contact-points ConfigMap, and drop the now-unused
  `alerts.heimdall.receiver` value and `__RECEIVER__` placeholder.
- Trim duplication in `deploy/helm/AGENTS.md` Alert Provisioning section
  and document the new ConfigMap-only layout plus the `helm template`
  limitation when using `existingSecret`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ontact point

Replace the implicit `webhook-url` data key with an explicit
`alerts.heimdall.slack.secretKey` value (default `SLACK_WEBHOOK_URL`), and
rename `slack.webhookUrl` to `slack.secretValue` so operators can see both
the key and the value contract in the same shape used elsewhere (e.g.
mongodb `existingSecret` pattern in the heimdall-inference-scheduler repo).

`existingSecret` retains precedence over `secretValue`: when set, the chart
reads the URL from `existingSecret.data[secretKey]` via Helm `lookup`. When
`existingSecret` is empty the chart embeds `secretValue` directly into the
contact-points ConfigMap. With neither producing a URL the ConfigMap is
skipped and Slack delivery is silently off — alert rules still fire.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 19, 2026 03:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

seongsu-dev and others added 2 commits May 19, 2026 14:24
…e UID

Two issues surfaced while validating the alert pipeline end-to-end on a
local kind cluster against a real Slack workspace.

1. Operators who pass the Slack webhook URL via `helm install --set-file`
   pick up a trailing newline from the source file, which Grafana then
   rejects when parsing the contact-points provisioning YAML:

       invalid URL "https://hooks.slack.com/services/.../...
"

   Trim leading/trailing whitespace from the resolved URL (both the
   `secretValue` path and the `existingSecret` `lookup` path) so the
   chart is robust regardless of how the URL is supplied.

2. The Heimdall alert rule hardcodes `datasourceUid: loki`, but the
   `datasource-loki.yaml` ConfigMap left `uid` unset, so Grafana assigned
   a random UID instead and rule evaluation failed with:

       failed to build query 'A': data source not found

   Pin the Loki datasource UID to `loki` so the alert rule can resolve
   it reliably.

With both fixes, the kind cluster e2e flow succeeds: mock JSON error logs
→ Vector → Loki → LogQL rule evaluation → routing policy match
(`component=heimdall`) → Slack contact point → channel message delivered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-side grafanaURL

Drop the `alerts.heimdall.grafanaURL` chart value and the `__GRAFANA_URL__`
placeholder substitution in `alert-configmap.yaml`. The alert rule annotations
now reference `{{ externalURL }}` directly, which Grafana resolves at
alert-evaluation time from `server.root_url`. Operators only need to set
`prometheus-stack.grafana.grafana.ini.server.root_url` to the cluster's
public Grafana URL; every link surfaced in Slack messages (Grafana's own
title link, plus the Explore and rule-view links emitted by the
notification template) is then built from that single source.

The substituted `externalURL` always carries a trailing slash, so the path
fragments in `heimdall-rules.yaml` are now written without a leading slash
(`{{ externalURL }}explore?...` instead of `{{ externalURL }}/explore?...`)
to avoid `//path` in the final URL — verified end-to-end on a kind cluster
where the first Slack message rendered with `//explore` and the
post-fix message rendered with `/explore` against the same `root_url`.

`alert-configmap.yaml` is now a pure pass-through (`Files.Get | nindent`)
with no replace step, which also removes the chart-side `trimSuffix "/"`
that was previously needed to normalise the chart-provided URL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 19, 2026 06:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

…nvention

Reshape `alerts.heimdall.slack` to follow the Bitnami values convention used
across `bitnami/postgresql`, `bitnami/redis`, and `bitnami/common._secrets`:

  slack:
    webhookUrl: ""                 # inline (intent in the name)
    existingSecret: ""             # externally-managed Secret reference
    secretKeys:
      webhookUrlKey: webhook-url   # data key inside `existingSecret`

This drops the generalised `secretKey`/`secretValue` names that obscured the
intent of the inline value, and matches the nested `secretKeys.<role>Key`
shape Bitnami uses even for single-value cases so future credentials can be
added without restructuring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@seongsu-dev seongsu-dev requested a review from hhk7734 May 19, 2026 07:12
Comment thread deploy/helm/moai-inference-framework/values.yaml
Collapse the multi-paragraph descriptions on `alerts.heimdall.enabled`,
`slack.webhookUrl`, `slack.existingSecret`, `slack.secretKeys`, and the
section header to 2-3 lines each, keeping only load-bearing facts
(required keys, precedence, dry-run caveat, Bitnami parity, secret
warning). README values table follows via helm-docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 19, 2026 08:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Comment thread deploy/helm/moai-inference-framework/values.yaml
Trim AGENTS.md from 223 to 70 lines by removing dated anecdotes,
collapsing multi-paragraph rules into single sentences, replacing
the responsibility-boundary 4-unit lists with one paragraph, and
shrinking the Reserved labels table by dropping descriptions.

All load-bearing rules are preserved: YAGNI, verification commands,
sub-chart enablement convention, naming/refs, MinIO pattern, alert
provisioning (Bitnami secret-reference shape), Odin preset
responsibility split, PD decode proxy debug flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hhk7734
hhk7734 previously approved these changes May 19, 2026
…ity split

The condensation in e3fa898 dropped the "Utils define" unit from the
responsibility split. Re-add a single-line Utils bullet noting the
offline HF cache env (HF_HOME / HF_HUB_OFFLINE / HF_MODULES_CACHE) lives
in *-hf-hub-offline templates and is shared by runtime bases and
presets, so new presets don't accidentally redefine or omit it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 19, 2026 08:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.

@seongsu-dev seongsu-dev merged commit f37fe5f into main May 19, 2026
4 checks passed
@seongsu-dev seongsu-dev deleted the MAF-19750-mif-alert-provisioning branch May 19, 2026 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants