-
Notifications
You must be signed in to change notification settings - Fork 0
MAF-19750: feat(deploy): provision Grafana alerts for Heimdall via sidecar #130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 6 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
c7568ad
MAF-19750: feat(deploy): provision Grafana alerts for Heimdall via si…
seongsu-dev 89435b4
MAF-19750: fix(deploy): strip trailing slash from alerts.heimdall.gra…
seongsu-dev 44b09db
MAF-19750: feat(deploy): provision Heimdall Slack contact point as Co…
seongsu-dev 9a98bac
MAF-19750: refactor(deploy): expose secretKey/secretValue for slack c…
seongsu-dev 4edf1df
MAF-19750: fix(deploy): trim slack webhook URL and pin Loki datasourc…
seongsu-dev 5c6064a
MAF-19750: refactor(deploy): use Grafana externalURL instead of chart…
seongsu-dev 748a5f8
MAF-19750: refactor(deploy): align slack secret block with Bitnami co…
seongsu-dev 6f9defd
MAF-19750: docs(deploy): tighten alerts.heimdall comments to 2-3 lines
seongsu-dev e3fa898
MAF-19750: docs(deploy): condense deploy/helm/AGENTS.md
seongsu-dev 27719bb
MAF-19750: docs(deploy): restore Utils row to Odin preset responsibil…
seongsu-dev File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
17 changes: 17 additions & 0 deletions
17
deploy/helm/moai-inference-framework/files/alerts/heimdall-policies.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| apiVersion: 1 | ||
| policies: | ||
| - orgId: 1 | ||
| receiver: grafana-default-email | ||
| group_by: | ||
| - grafana_folder | ||
| - alertname | ||
| routes: | ||
| - receiver: heimdall-slack | ||
| object_matchers: | ||
| - - component | ||
| - "=" | ||
| - heimdall | ||
| group_wait: 30s | ||
| group_interval: 5m | ||
| repeat_interval: 4h | ||
| continue: false |
79 changes: 79 additions & 0 deletions
79
deploy/helm/moai-inference-framework/files/alerts/heimdall-rules.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,79 @@ | ||
| apiVersion: 1 | ||
| groups: | ||
| - orgId: 1 | ||
| name: Heimdall Error Alerts | ||
| folder: Heimdall | ||
| interval: 1m | ||
| rules: | ||
| - uid: heimdall-error-log-burst | ||
| title: Heimdall Error Log Burst | ||
| condition: B | ||
| data: | ||
| # LogQL: group by instance/namespace and extract the error message into a label_format. | ||
| # The resulting time series carries instance, namespace, and error_summary labels, | ||
| # which propagate to alert labels so Slack messages can render them. | ||
| - refId: A | ||
| relativeTimeRange: | ||
| from: 300 | ||
| to: 0 | ||
| datasourceUid: loki | ||
| model: | ||
| refId: A | ||
| datasource: | ||
| type: loki | ||
| uid: loki | ||
| expr: | | ||
| sum by (instance, namespace, error_summary) ( | ||
| count_over_time( | ||
| {app="heimdall-inference-scheduler", level="error"} | ||
| | json | ||
| | label_format error_summary=`{{ if .error }}{{ printf "%.180s" .message }}: {{ printf "%.180s" .error }}{{ else }}{{ printf "%.300s" .message }}{{ end }}` | ||
| [5m] | ||
| ) | ||
| ) | ||
| queryType: instant | ||
| intervalMs: 1000 | ||
| maxDataPoints: 43200 | ||
| - refId: B | ||
| datasourceUid: __expr__ | ||
| model: | ||
| refId: B | ||
| datasource: | ||
| type: __expr__ | ||
| uid: __expr__ | ||
| type: threshold | ||
| expression: A | ||
| conditions: | ||
| - evaluator: | ||
| type: gt | ||
| params: [0] | ||
| operator: | ||
| type: and | ||
| query: | ||
| params: [] | ||
| reducer: | ||
| type: last | ||
| params: [] | ||
| type: query | ||
| intervalMs: 1000 | ||
| maxDataPoints: 43200 | ||
| for: 1m | ||
| noDataState: OK | ||
| execErrState: Error | ||
| annotations: | ||
| summary: Heimdall error logs detected | ||
| description: '{{ $values.A.Value }} error log entries detected in the last 5 minutes.' | ||
| # Grafana Explore deep link — pre-filled LogQL filtered to Heimdall error logs. | ||
| # `{{ externalURL }}` is substituted by Grafana at alert-evaluation time | ||
| # from the configured `server.root_url`, so the link resolves to whichever | ||
| # Grafana instance is fronting the cluster (no chart-side override needed). | ||
| # Note: Grafana's substituted `externalURL` always carries a trailing slash, | ||
| # so the path here must NOT start with one to avoid `//path` in the final URL. | ||
| # Slack template references this annotation as a clickable link. | ||
| exploreURL: '{{ externalURL }}explore?schemaVersion=1&orgId=1&panes=%7B%22h1%22%3A%7B%22datasource%22%3A%22loki%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22loki%22%2C%22uid%22%3A%22loki%22%7D%2C%22expr%22%3A%22%7Bapp%3D%5C%22heimdall-inference-scheduler%5C%22%2Clevel%3D%5C%22error%5C%22%7D%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%22now-30m%22%2C%22to%22%3A%22now%22%7D%7D%7D' | ||
| # Direct link to this alert rule view (same `externalURL` source). | ||
| ruleURL: '{{ externalURL }}alerting/grafana/heimdall-error-log-burst/view' | ||
| labels: | ||
| severity: warning | ||
| component: heimdall | ||
| isPaused: false |
28 changes: 28 additions & 0 deletions
28
deploy/helm/moai-inference-framework/files/alerts/heimdall-templates.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| apiVersion: 1 | ||
| templates: | ||
| - orgId: 1 | ||
| name: heimdall-slack-templates | ||
| template: | | ||
| {{ define "heimdall-slack.title" }} | ||
| [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} | ||
| {{ end }} | ||
|
|
||
| {{ define "heimdall-slack.body" }} | ||
| {{ range .Alerts }} | ||
| :pushpin: *Summary*: {{ .Annotations.summary }} | ||
| :memo: *Description*: {{ .Annotations.description }} | ||
|
|
||
| {{ if .Labels.error_summary }}:warning: *Error message*: | ||
| ```{{ .Labels.error_summary }}```{{ end }} | ||
|
|
||
| *Context* | ||
| {{ if .Labels.namespace }}• Namespace: `{{ .Labels.namespace }}`{{ end }} | ||
| {{ if .Labels.instance }}• Instance: `{{ .Labels.instance }}`{{ end }} | ||
| • Severity: `{{ .Labels.severity }}` | ||
| • Component: `{{ .Labels.component }}` | ||
| {{ if .Labels.environment }}• Environment: `{{ .Labels.environment }}`{{ end }} | ||
|
|
||
| {{ if .Annotations.exploreURL }}:mag: <{{ .Annotations.exploreURL }}|View error logs in Grafana>{{ end }} | ||
| {{ if .Annotations.ruleURL }}:bell: <{{ .Annotations.ruleURL }}|View alert rule>{{ end }} | ||
| {{ end }} | ||
| {{ end }} |
25 changes: 25 additions & 0 deletions
25
deploy/helm/moai-inference-framework/templates/grafana/alert-configmap.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| {{- $ps := index .Values "prometheus-stack" }} | ||
| {{- if and $ps.enabled $ps.grafana.enabled $ps.grafana.sidecar.alerts.enabled .Values.alerts.heimdall.enabled }} | ||
| {{- $files := .Files.Glob "files/alerts/*.yaml" }} | ||
| {{- if $files }} | ||
| {{- range $path, $_ := $files }} | ||
| {{- $alertName := base $path | trimSuffix ".yaml" }} | ||
| --- | ||
| apiVersion: v1 | ||
| kind: ConfigMap | ||
| metadata: | ||
| namespace: {{ include "common.names.namespace" $ }} | ||
| name: {{ include "common.names.name" $ }}-alert-{{ $alertName }} | ||
| annotations: | ||
| {{- with $ps.grafana.sidecar.alerts.annotations }} | ||
| {{- toYaml . | nindent 4 }} | ||
| {{- end }} | ||
| labels: | ||
| {{ tpl $ps.grafana.sidecar.alerts.label $ }}: {{ ((tpl $ps.grafana.sidecar.alerts.labelValue $) | default 1) | quote }} | ||
| {{- include "mif.labels" $ | nindent 4 }} | ||
| data: | ||
| {{ base $path }}: |- | ||
| {{- $.Files.Get $path | nindent 4 }} | ||
| {{- end }} | ||
| {{- end }} | ||
| {{- end }} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
56 changes: 56 additions & 0 deletions
56
deploy/helm/moai-inference-framework/templates/grafana/heimdall-slack-configmap.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,56 @@ | ||
| {{- $ps := index .Values "prometheus-stack" }} | ||
| {{- if and $ps.enabled $ps.grafana.enabled $ps.grafana.sidecar.alerts.enabled .Values.alerts.heimdall.enabled }} | ||
| {{- $slack := .Values.alerts.heimdall.slack }} | ||
| {{- $webhookUrl := "" }} | ||
| {{- if $slack.existingSecret }} | ||
| {{- /* | ||
| Resolve the URL from an externally-managed Secret at key `secretKey`. | ||
| `lookup` returns nil during `helm template` and `helm install --dry-run` | ||
| (no cluster access), so the rendered URL is empty there — the ConfigMap | ||
| is then skipped and alerts will not deliver until applied against a real | ||
| cluster. Takes precedence over `secretValue`. | ||
| */ -}} | ||
| {{- $existing := lookup "v1" "Secret" (include "common.names.namespace" .) $slack.existingSecret }} | ||
| {{- if and $existing $existing.data (index $existing.data $slack.secretKey) }} | ||
| {{- $webhookUrl = index $existing.data $slack.secretKey | b64dec }} | ||
| {{- end }} | ||
| {{- else }} | ||
| {{- $webhookUrl = $slack.secretValue }} | ||
| {{- end }} | ||
| {{- /* | ||
| Trim surrounding whitespace, including trailing newlines that creep in | ||
| when operators load the URL with `--set-file` or from a Secret whose | ||
| data was stored from a file. Grafana's contact-point provisioning | ||
| rejects the URL otherwise (treats `https://...\n` as an invalid URL). | ||
| */ -}} | ||
| {{- $webhookUrl = trim $webhookUrl }} | ||
| {{- if $webhookUrl }} | ||
| --- | ||
| apiVersion: v1 | ||
| kind: ConfigMap | ||
| metadata: | ||
| namespace: {{ include "common.names.namespace" . }} | ||
| name: {{ include "common.names.name" . }}-alert-heimdall-slack-contact-points | ||
| annotations: | ||
| {{- with $ps.grafana.sidecar.alerts.annotations }} | ||
| {{- toYaml . | nindent 4 }} | ||
| {{- end }} | ||
| labels: | ||
| {{ tpl $ps.grafana.sidecar.alerts.label . }}: {{ ((tpl $ps.grafana.sidecar.alerts.labelValue .) | default 1) | quote }} | ||
| {{- include "mif.labels" . | nindent 4 }} | ||
| data: | ||
| heimdall-slack-contact-points.yaml: | | ||
| apiVersion: 1 | ||
| contactPoints: | ||
| - orgId: 1 | ||
| name: heimdall-slack | ||
| receivers: | ||
| - uid: heimdall-slack | ||
| type: slack | ||
| disableResolveMessage: false | ||
| settings: | ||
| url: {{ $webhookUrl | quote }} | ||
| title: '{{`{{ template "heimdall-slack.title" . }}`}}' | ||
| text: '{{`{{ template "heimdall-slack.body" . }}`}}' | ||
| {{- end }} | ||
| {{- end }} | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.