MAF-19750: feat(deploy): provision Grafana alerts for Heimdall via sidecar#130
Merged
Conversation
…decar
Add file-based Grafana Unified Alerting provisioning to the MIF chart so that
installing MIF immediately wires up the Heimdall error-log alert pipeline that
was validated as a PoC in the p-cluster `mif` release.
The new `templates/grafana/alert-configmap.yaml` mirrors the existing
`dashboard-configmap.yaml`: it iterates over `files/alerts/*.yaml` and emits one
ConfigMap per file with the `grafana_alert` label, which the
`grafana-sc-alerts` sidecar picks up and mounts into
`/etc/grafana/provisioning/alerting/`. Cluster-specific values
(`__GRAFANA_URL__` for Slack deep links, `__RECEIVER__` for the contact point
name) are substituted from chart values via `replace` rather than `tpl`, so
that Grafana's own Go template syntax embedded in alert rules (e.g.
`{{ printf "%.180s" .message }}` inside LogQL) is preserved as raw text.
Contact points (Slack webhook URLs) are intentionally out of scope because the
webhook is a secret. Operators must create the contact point named by
`alerts.heimdall.receiver` separately via the Grafana UI or a Secret-backed
provisioning file.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds Helm-managed file-based Grafana Unified Alerting provisioning to the MIF chart, mirroring the existing dashboard ConfigMap pattern so that Heimdall error-log alerts (LogQL → Loki → Slack) are installed automatically alongside the chart. Cluster-specific values are passed via __GRAFANA_URL__ / __RECEIVER__ placeholders that are replaced literally (rather than via tpl) so Grafana's own Go template syntax inside the alert YAML survives Helm rendering.
Changes:
- New
templates/grafana/alert-configmap.yamlthat generates one ConfigMap per file underfiles/alerts/*.yaml, gated byprometheus-stack.grafana.sidecar.alerts.enabledandalerts.heimdall.enabled. - New alert content:
heimdall-rules.yaml(error-log-burst rule),heimdall-templates.yaml(Slack message templates), andheimdall-policies.yaml(component=heimdall routing policy). values.yaml,README.md, andAGENTS.mdupdated to expose the newalerts.heimdall.*keys, enable the alerts sidecar, and document the no-tplconstraint and out-of-scope contact-point boundary.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| deploy/helm/moai-inference-framework/templates/grafana/alert-configmap.yaml | New template rendering one ConfigMap per alert file with placeholder substitution. |
| deploy/helm/moai-inference-framework/files/alerts/heimdall-rules.yaml | Error-log-burst alert rule with Explore/rule deep links using __GRAFANA_URL__. |
| deploy/helm/moai-inference-framework/files/alerts/heimdall-templates.yaml | Slack notification templates for Heimdall alerts. |
| deploy/helm/moai-inference-framework/files/alerts/heimdall-policies.yaml | Routing policy mapping component=heimdall to __RECEIVER__. |
| deploy/helm/moai-inference-framework/values.yaml | Enables prometheus-stack.grafana.sidecar.alerts and adds alerts.heimdall.{enabled,grafanaURL,receiver}. |
| deploy/helm/moai-inference-framework/README.md | Regenerated docs for the new values keys. |
| deploy/helm/AGENTS.md | Documents the alert-provisioning pattern and the tpl prohibition. |
…fanaURL Prevents double-slash URLs in Slack notification links when operators configure `alerts.heimdall.grafanaURL` with a trailing slash (e.g. `https://grafana.example.com/`). Without trimming, the alert rule annotations would render `https://grafana.example.com//explore?...` and `https://grafana.example.com//alerting/...`, which most browsers tolerate but reverse proxies and OAuth redirect path matchers may reject. Apply `trimSuffix "/"` to the value before substituting `__GRAFANA_URL__`, so both `https://grafana.example.com` and `https://grafana.example.com/` produce the same single-slash result. Also document the trimming behavior in the values.yaml comment and add a "Placeholder conventions" subsection to deploy/helm/AGENTS.md so authors of future alert files use the same pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nfigMap Add chart-managed provisioning for the Heimdall Slack contact point as a ConfigMap labelled `grafana_alert=1`, picked up by the existing `grafana-sc-alerts` sidecar alongside the rules / templates / policies. Removes the need for operators to create the contact point through the Grafana UI (which requires `grafana.persistence.enabled=true` to survive pod restarts) or via a separate Secret-backed provisioning file. The webhook URL is sourced either from `alerts.heimdall.slack.existingSecret` (resolved through Helm `lookup`, following the same convention as the sibling MongoDB and Redis Sentinel charts) or from `alerts.heimdall.slack.webhookUrl` when no external Secret is referenced. With neither set, the contact-points ConfigMap is skipped and Slack delivery is silently off — alert rules still fire but do not route anywhere. Other adjustments: - Default `alerts.heimdall.enabled` to false; the chart cannot deliver Slack messages without a webhook URL, so the operator must opt in explicitly after providing one. - Hardcode the receiver name to `heimdall-slack` inside both the policy routing file and the contact-points ConfigMap, and drop the now-unused `alerts.heimdall.receiver` value and `__RECEIVER__` placeholder. - Trim duplication in `deploy/helm/AGENTS.md` Alert Provisioning section and document the new ConfigMap-only layout plus the `helm template` limitation when using `existingSecret`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ontact point Replace the implicit `webhook-url` data key with an explicit `alerts.heimdall.slack.secretKey` value (default `SLACK_WEBHOOK_URL`), and rename `slack.webhookUrl` to `slack.secretValue` so operators can see both the key and the value contract in the same shape used elsewhere (e.g. mongodb `existingSecret` pattern in the heimdall-inference-scheduler repo). `existingSecret` retains precedence over `secretValue`: when set, the chart reads the URL from `existingSecret.data[secretKey]` via Helm `lookup`. When `existingSecret` is empty the chart embeds `secretValue` directly into the contact-points ConfigMap. With neither producing a URL the ConfigMap is skipped and Slack delivery is silently off — alert rules still fire. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e UID
Two issues surfaced while validating the alert pipeline end-to-end on a
local kind cluster against a real Slack workspace.
1. Operators who pass the Slack webhook URL via `helm install --set-file`
pick up a trailing newline from the source file, which Grafana then
rejects when parsing the contact-points provisioning YAML:
invalid URL "https://hooks.slack.com/services/.../...
"
Trim leading/trailing whitespace from the resolved URL (both the
`secretValue` path and the `existingSecret` `lookup` path) so the
chart is robust regardless of how the URL is supplied.
2. The Heimdall alert rule hardcodes `datasourceUid: loki`, but the
`datasource-loki.yaml` ConfigMap left `uid` unset, so Grafana assigned
a random UID instead and rule evaluation failed with:
failed to build query 'A': data source not found
Pin the Loki datasource UID to `loki` so the alert rule can resolve
it reliably.
With both fixes, the kind cluster e2e flow succeeds: mock JSON error logs
→ Vector → Loki → LogQL rule evaluation → routing policy match
(`component=heimdall`) → Slack contact point → channel message delivered.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-side grafanaURL
Drop the `alerts.heimdall.grafanaURL` chart value and the `__GRAFANA_URL__`
placeholder substitution in `alert-configmap.yaml`. The alert rule annotations
now reference `{{ externalURL }}` directly, which Grafana resolves at
alert-evaluation time from `server.root_url`. Operators only need to set
`prometheus-stack.grafana.grafana.ini.server.root_url` to the cluster's
public Grafana URL; every link surfaced in Slack messages (Grafana's own
title link, plus the Explore and rule-view links emitted by the
notification template) is then built from that single source.
The substituted `externalURL` always carries a trailing slash, so the path
fragments in `heimdall-rules.yaml` are now written without a leading slash
(`{{ externalURL }}explore?...` instead of `{{ externalURL }}/explore?...`)
to avoid `//path` in the final URL — verified end-to-end on a kind cluster
where the first Slack message rendered with `//explore` and the
post-fix message rendered with `/explore` against the same `root_url`.
`alert-configmap.yaml` is now a pure pass-through (`Files.Get | nindent`)
with no replace step, which also removes the chart-side `trimSuffix "/"`
that was previously needed to normalise the chart-provided URL.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nvention
Reshape `alerts.heimdall.slack` to follow the Bitnami values convention used
across `bitnami/postgresql`, `bitnami/redis`, and `bitnami/common._secrets`:
slack:
webhookUrl: "" # inline (intent in the name)
existingSecret: "" # externally-managed Secret reference
secretKeys:
webhookUrlKey: webhook-url # data key inside `existingSecret`
This drops the generalised `secretKey`/`secretValue` names that obscured the
intent of the inline value, and matches the nested `secretKeys.<role>Key`
shape Bitnami uses even for single-value cases so future credentials can be
added without restructuring.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hhk7734
requested changes
May 19, 2026
Collapse the multi-paragraph descriptions on `alerts.heimdall.enabled`, `slack.webhookUrl`, `slack.existingSecret`, `slack.secretKeys`, and the section header to 2-3 lines each, keeping only load-bearing facts (required keys, precedence, dry-run caveat, Bitnami parity, secret warning). README values table follows via helm-docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trim AGENTS.md from 223 to 70 lines by removing dated anecdotes, collapsing multi-paragraph rules into single sentences, replacing the responsibility-boundary 4-unit lists with one paragraph, and shrinking the Reserved labels table by dropping descriptions. All load-bearing rules are preserved: YAGNI, verification commands, sub-chart enablement convention, naming/refs, MinIO pattern, alert provisioning (Bitnami secret-reference shape), Odin preset responsibility split, PD decode proxy debug flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hhk7734
previously approved these changes
May 19, 2026
…ity split The condensation in e3fa898 dropped the "Utils define" unit from the responsibility split. Re-add a single-line Utils bullet noting the offline HF cache env (HF_HOME / HF_HUB_OFFLINE / HF_MODULES_CACHE) lives in *-hf-hub-offline templates and is shared by runtime bases and presets, so new presets don't accidentally redefine or omit it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hhk7734
approved these changes
May 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
grafana_alert=1. Thegrafana-sc-alertssidecar mounts them into/etc/grafana/provisioning/alerting/and Grafana reloads automatically — mirroring the existingtemplates/grafana/dashboard-configmap.yamlpattern.externalURL, which is in turn driven byprometheus-stack.grafana.grafana.ini.server.root_url. Operators only need to set that single value to expose the cluster's public Grafana URL — the chart does not carry a separate URL knob anymore.alerts.heimdall.slack.existingSecret+slack.secretKeys.webhookUrlKey(resolved via Helmlookup) or directly via the inlinealerts.heimdall.slack.webhookUrl, following the BitnamisecretKeys.<role>Keysecret-reference convention. The receiver name is hardcoded toheimdall-slackso the routing policy and the contact-points ConfigMap line up automatically.Background
This is the productisation of the MAF-19750 PoC that was validated on the p-cluster
mifrelease. Three ConfigMaps (rules / templates / policies) were originally applied by hand to deliver: Heimdall error logs → Loki → Grafana Unified Alerting → Slack. This PR moves those resources into the chart, generalises the PoC-specific bits (namespace="seongsu",owner: seongsu,environment: p-clusterlabels, hardcodedgrafana.product.moreh.devURLs), and adds the Slack contact point as a chart-managed resource as well so the chart is self-contained.The LogQL selector
{app="heimdall-inference-scheduler", level="error"}is chart-name based, so a single alert rule covers every Heimdall release in the cluster without per-release duplication.Files
templates/grafana/alert-configmap.yaml— new template, generates one ConfigMap per file underfiles/alerts/*.yamlusingFiles.Getas a pure pass-through (no chart-side string substitution).templates/grafana/heimdall-slack-configmap.yaml— new template, generates the Slack contact-points provisioning ConfigMap whenalerts.heimdall.enabledis true and a webhook URL is available (viaexistingSecret+secretKeys.webhookUrlKey, or inlinewebhookUrl).templates/grafana/datasource-loki.yaml— pins the Loki datasource UID tolokiso the alert rule'sdatasourceUid: lokireference can resolve reliably.files/alerts/heimdall-rules.yaml— alert rule. Annotations reference{{ externalURL }}directly (Grafana substitutes it at evaluation time fromserver.root_url).files/alerts/heimdall-templates.yaml— Slack notification template (heimdall-slack.title,heimdall-slack.body).files/alerts/heimdall-policies.yaml— routing policy: matchescomponent=heimdall→heimdall-slackreceiver.values.yaml— enablesprometheus-stack.grafana.sidecar.alertsand adds the top-levelalerts.heimdall.{enabled, slack.{webhookUrl, existingSecret, secretKeys.webhookUrlKey}}section.deploy/helm/AGENTS.md— adds an "Alert provisioning" rule capturing the ConfigMap-only layout, the operator contract aroundserver.root_url, and conventions for adding new alert files (notpl, secret values resolved vialookup). Existing rules condensed in the same pass.deploy/helm/moai-inference-framework/README.md— regenerated bymake helm-docs.Operator contract
alerts.heimdall.enabled(defaultfalse)prometheus-stack.grafana.grafana.ini.server.root_urlhttp://localhost:3000and every link becomes unreachable.alerts.heimdall.slack.existingSecret+slack.secretKeys.webhookUrlKeywebhookUrl.alerts.heimdall.slack.webhookUrlexistingSecretis empty. SECRET — pass via--set-file/ sealed-secrets / SOPS / external secrets operator; never commit.Out of scope (follow-up)
moreh-iac/SNUSHC/p-cluster/mif/mif.tfshould bump the chart version, setprometheus-stack.grafana.grafana.ini.server.root_url=https://grafana.product.moreh.dev, and either pointalerts.heimdall.slack.existingSecret(+slack.secretKeys.webhookUrlKey) at an externally managed Secret or passwebhookUrlfrom a CI secret. Once that lands, the PoC's hand-applied ConfigMaps can be removed.files/alerts/heimdall-<category>.yamlfiles in follow-up PRs.Test plan
helm lint deploy/helm/moai-inference-framework— passes.helm templatewith--set alerts.heimdall.enabled=true --set-file alerts.heimdall.slack.webhookUrl=<webhook>renders four ConfigMaps (*-alert-heimdall-{rules,templates,policies,slack-contact-points}) all labelledgrafana_alert: "1". Render also verified for theexistingSecret+secretKeys.webhookUrlKeypath (empty URL underhelm templatebecauselookuprequires a cluster, ConfigMap correctly skipped).kubectl apply --dry-run=clienton the rendered ConfigMaps —created (dry run)for all four.dataparses as valid YAML and matches the Grafana provisioning schema (apiVersion: 1,groups/templates/policies/contactPoints; alert ruleuid/title/condition/data/noDataState/execErrState;object_matchers3-tuple form; receiver name hardcoded toheimdall-slack).{{ printf "%.180s" .message }},{{ externalURL }},{{ define "heimdall-slack.title" }},{{ if .Annotations.exploreURL }}) is preserved as raw text in the rendered ConfigMap — Helm does not evaluate it.alerts.heimdall.enabled=falseandprometheus-stack.grafana.sidecar.alerts.enabled=falseeach gate the ConfigMaps off (0 rendered). Withenabled=truebut no webhook supplied, the contact-points ConfigMap is skipped and Slack delivery is silently off (alert rules still fire).grafana-sc-alertssidecar withLABEL=grafana_alert,FOLDER=/etc/grafana/provisioning/alerting, and the correct reload URL.make helm-docsregeneratesdeploy/helm/moai-inference-framework/README.mdto include the newalerts.heimdall.*andprometheus-stack.grafana.sidecar.alerts.enabledkeys.slack.webhookUrland (b)slack.existingSecret+slack.secretKeys.webhookUrlKeywith a separately-applied Secret. Mock pod emitting JSON error logs → Vector → Loki → LogQL rule evaluation → routing policy match (component=heimdall) → Slack contact point → channel message delivered. All three links (Grafana-emitted title link, bodyView error logs in Grafana, bodyView alert rule) resolve to theserver.root_urlprefix.Bugs found during validation and fixed
--set-filecarried a trailing newline and Grafana rejected it as an invalid URL4edf1df— addedtrimto the resolved webhook URLdatasourceUid: lokireference could not resolve4edf1df— pinned the Loki datasource UID toloki{{ externalURL }}always carries a trailing slash, producing//pathwhen concatenated with a leading-slash path5c6064a— dropped the leading slash from the annotation paths so the final URL has exactly one slash🤖 Generated with Claude Code