Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions deploy/helm/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,22 @@ Rules specific to the `deploy/helm/` directory. General contribution guidelines

- **Do not use YAML anchors at the root level of `values.yaml`** (e.g., `_defaults: &defaults`). Helm treats unknown root-level keys as invalid and may emit warnings or errors. Instead, duplicate shared configuration explicitly for each component.

### Alert Provisioning

The chart provisions Grafana Unified Alerting entirely through ConfigMaps labelled `grafana_alert=1`. The `grafana-sc-alerts` sidecar mounts them into `/etc/grafana/provisioning/alerting/` and Grafana reloads automatically. Resources fall into two groups:

- **Rules / templates / policies** — one ConfigMap per file under `moai-inference-framework/files/alerts/*.yaml`, generated by `alert-configmap.yaml` (mirroring the `files/dashboards/*.json` pattern). The receiver name is hardcoded to `heimdall-slack` and the routing policy is wired to it automatically.
- **Heimdall Slack contact point** — a separate ConfigMap rendered by `heimdall-slack-configmap.yaml` when `alerts.heimdall.enabled` is true and a webhook URL is available. The chart resolves the URL from `alerts.heimdall.slack.existingSecret` (an externally-managed Secret whose data key is named by `alerts.heimdall.slack.secretKey`, looked up via Helm `lookup`); when `existingSecret` is empty it falls back to `alerts.heimdall.slack.secretValue`. When neither produces a URL the ConfigMap is skipped and Slack delivery is silently off.

`alerts.heimdall.enabled` defaults to `false` because a webhook URL must be supplied for delivery to work. `helm template` and `helm install --dry-run` cannot reach the cluster, so `existingSecret` resolves to an empty URL there — verify against a real cluster (`helm install` / `helm upgrade`). Cluster-specific links use `alerts.heimdall.grafanaURL` (trailing slash auto-trimmed).

**Conventions when adding new alert files**:

- Read files as raw bytes via `Files.Get` and substitute cluster-specific values with `replace`. Do **not** wrap the result with `tpl` — alert rule YAML embeds Grafana's own Go template syntax (e.g. `{{ printf "%.180s" .message }}`) and `tpl` would evaluate it at render time and crash.
- Use `__UPPER_SNAKE__` placeholders so they cannot collide with Grafana's `{{ ... }}` syntax. Add a matching `replace` call to `alert-configmap.yaml` and verify with `helm template ... | grep '__'` that no tokens survive rendering.
- When concatenating a placeholder with a path segment, write the path with a leading slash (e.g. `__GRAFANA_URL__/explore?...`); the template strips trailing slashes from `grafanaURL` before substitution.
- For values that must remain secret at runtime, resolve them via Helm `lookup` against a Secret and write the plaintext into the rendered ConfigMap, following the Heimdall Slack contact point pattern.

## Odin Presets (`moai-inference-preset`)

An Odin preset is a pair of Odin `InferenceServiceTemplate` resources — a **base template** (runtime base) and a **preset-specific template** — that together define how to deploy a Moreh vLLM pod. The base template defines how vLLM servers are launched and is shared across presets. The preset-specific template adds model-specific arguments, environment variables, resource requests, and disaggregation settings.
Expand Down
6 changes: 6 additions & 0 deletions deploy/helm/moai-inference-framework/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,11 @@ Moreh Inference Framework

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| alerts.heimdall.enabled | bool | `false` | Enable provisioning of Heimdall alert resources. Disabled by default because a Slack webhook URL must be supplied (via `slack.existingSecret` or `slack.secretValue`) for alerts to actually deliver; flip to true once the webhook URL is in place. |
| alerts.heimdall.grafanaURL | string | `""` | Base URL of this cluster's Grafana, used for clickable links in Slack messages (Explore + alert rule view). A trailing slash is allowed and is stripped before substitution, so both `https://grafana.example.com` and `https://grafana.example.com/` are accepted. Leave empty to disable the link prefix; resulting links will be relative paths. |
| alerts.heimdall.slack.existingSecret | string | `""` | Externally-managed Secret holding the webhook URL at key `secretKey`. When set, the chart resolves the URL by `lookup` at install/upgrade time and embeds it into the contact-points ConfigMap. Takes precedence over `secretValue`. Note: `helm template` and `helm install --dry-run` cannot read cluster state, so the rendered ConfigMap will contain an empty URL when used with those commands. |
| alerts.heimdall.slack.secretKey | string | `"SLACK_WEBHOOK_URL"` | Name of the key inside the Secret holding the webhook URL. The chart reads this key from `existingSecret`, and uses it as the data key when creating the Secret implicitly via `secretValue`. Default mirrors the common env-var convention to make the contract obvious. |
| alerts.heimdall.slack.secretValue | string | `""` | Slack incoming webhook URL value. Used only when `existingSecret` is empty. The chart writes this value verbatim into the contact-points ConfigMap labelled `grafana_alert=1`; required for Slack delivery. SECRET — pass via `--set`, `--set-file`, sealed-secrets, SOPS, or an external secrets operator; never commit to git. |
| commonLabels | object | `{}` | Labels applied to all resources. |
| ecrTokenRefresher.aws.accessKeyId | string | `""` | AWS_ACCESS_KEY_ID |
| ecrTokenRefresher.aws.region | string | `"ap-northeast-2"` | AWS Region. |
Expand Down Expand Up @@ -131,6 +136,7 @@ Moreh Inference Framework
| prometheus-stack.defaultRules.create | bool | `false` | |
| prometheus-stack.enabled | bool | `true` | Enable prometheus-community/kube-prometheus-stack. Set to false if already deployed. |
| prometheus-stack.grafana.enabled | bool | `true` | |
| prometheus-stack.grafana.sidecar.alerts.enabled | bool | `true` | |
| prometheus-stack.grafana.sidecar.dashboards.enabled | bool | `true` | |
| prometheus-stack.kubeApiServer.enabled | bool | `false` | |
| prometheus-stack.kubeControllerManager.enabled | bool | `false` | |
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
apiVersion: 1
policies:
- orgId: 1
receiver: grafana-default-email
group_by:
- grafana_folder
- alertname
routes:
- receiver: heimdall-slack
object_matchers:
- - component
- "="
- heimdall
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
continue: false
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
apiVersion: 1
groups:
- orgId: 1
name: Heimdall Error Alerts
folder: Heimdall
interval: 1m
rules:
- uid: heimdall-error-log-burst
title: Heimdall Error Log Burst
condition: B
data:
# LogQL: group by instance/namespace and extract the error message into a label_format.
# The resulting time series carries instance, namespace, and error_summary labels,
# which propagate to alert labels so Slack messages can render them.
- refId: A
relativeTimeRange:
from: 300
to: 0
datasourceUid: loki
model:
refId: A
datasource:
type: loki
uid: loki
expr: |
sum by (instance, namespace, error_summary) (
count_over_time(
{app="heimdall-inference-scheduler", level="error"}
| json
| label_format error_summary=`{{ if .error }}{{ printf "%.180s" .message }}: {{ printf "%.180s" .error }}{{ else }}{{ printf "%.300s" .message }}{{ end }}`
[5m]
)
)
queryType: instant
intervalMs: 1000
maxDataPoints: 43200
- refId: B
datasourceUid: __expr__
model:
refId: B
datasource:
type: __expr__
uid: __expr__
type: threshold
expression: A
conditions:
- evaluator:
type: gt
params: [0]
operator:
type: and
query:
params: []
reducer:
type: last
params: []
type: query
intervalMs: 1000
maxDataPoints: 43200
for: 1m
noDataState: OK
execErrState: Error
annotations:
summary: Heimdall error logs detected
description: '{{ $values.A.Value }} error log entries detected in the last 5 minutes.'
# Grafana Explore deep link — pre-filled LogQL filtered to Heimdall error logs.
# Slack template references this annotation as a clickable link.
exploreURL: '__GRAFANA_URL__/explore?schemaVersion=1&orgId=1&panes=%7B%22h1%22%3A%7B%22datasource%22%3A%22loki%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22loki%22%2C%22uid%22%3A%22loki%22%7D%2C%22expr%22%3A%22%7Bapp%3D%5C%22heimdall-inference-scheduler%5C%22%2Clevel%3D%5C%22error%5C%22%7D%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%22now-30m%22%2C%22to%22%3A%22now%22%7D%7D%7D'
# Direct link to this alert rule view.
ruleURL: '__GRAFANA_URL__/alerting/grafana/heimdall-error-log-burst/view'
labels:
severity: warning
component: heimdall
isPaused: false
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
apiVersion: 1
templates:
- orgId: 1
name: heimdall-slack-templates
template: |
{{ define "heimdall-slack.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
{{ end }}

{{ define "heimdall-slack.body" }}
{{ range .Alerts }}
:pushpin: *Summary*: {{ .Annotations.summary }}
:memo: *Description*: {{ .Annotations.description }}

{{ if .Labels.error_summary }}:warning: *Error message*:
```{{ .Labels.error_summary }}```{{ end }}

*Context*
{{ if .Labels.namespace }}• Namespace: `{{ .Labels.namespace }}`{{ end }}
{{ if .Labels.instance }}• Instance: `{{ .Labels.instance }}`{{ end }}
• Severity: `{{ .Labels.severity }}`
• Component: `{{ .Labels.component }}`
{{ if .Labels.environment }}• Environment: `{{ .Labels.environment }}`{{ end }}

{{ if .Annotations.exploreURL }}:mag: <{{ .Annotations.exploreURL }}|View error logs in Grafana>{{ end }}
{{ if .Annotations.ruleURL }}:bell: <{{ .Annotations.ruleURL }}|View alert rule>{{ end }}
{{ end }}
{{ end }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{{- $ps := index .Values "prometheus-stack" }}
{{- if and $ps.enabled $ps.grafana.enabled $ps.grafana.sidecar.alerts.enabled .Values.alerts.heimdall.enabled }}
{{- $files := .Files.Glob "files/alerts/*.yaml" }}
{{- if $files }}
{{- range $path, $_ := $files }}
{{- $alertName := base $path | trimSuffix ".yaml" }}
---
apiVersion: v1
kind: ConfigMap
metadata:
namespace: {{ include "common.names.namespace" $ }}
name: {{ include "common.names.name" $ }}-alert-{{ $alertName }}
annotations:
{{- with $ps.grafana.sidecar.alerts.annotations }}
{{- toYaml . | nindent 4 }}
{{- end }}
labels:
{{ tpl $ps.grafana.sidecar.alerts.label $ }}: {{ ((tpl $ps.grafana.sidecar.alerts.labelValue $) | default 1) | quote }}
{{- include "mif.labels" $ | nindent 4 }}
data:
{{ base $path }}: |-
{{- $.Files.Get $path | replace "__GRAFANA_URL__" (trimSuffix "/" ($.Values.alerts.heimdall.grafanaURL | default "")) | nindent 4 }}
{{- end }}
{{- end }}
{{- end }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
{{- $ps := index .Values "prometheus-stack" }}
{{- if and $ps.enabled $ps.grafana.enabled $ps.grafana.sidecar.alerts.enabled .Values.alerts.heimdall.enabled }}
{{- $slack := .Values.alerts.heimdall.slack }}
{{- $webhookUrl := "" }}
{{- if $slack.existingSecret }}
{{- /*
Resolve the URL from an externally-managed Secret at key `secretKey`.
`lookup` returns nil during `helm template` and `helm install --dry-run`
(no cluster access), so the rendered URL is empty there — the ConfigMap
is then skipped and alerts will not deliver until applied against a real
cluster. Takes precedence over `secretValue`.
*/ -}}
{{- $existing := lookup "v1" "Secret" (include "common.names.namespace" .) $slack.existingSecret }}
{{- if and $existing $existing.data (index $existing.data $slack.secretKey) }}
{{- $webhookUrl = index $existing.data $slack.secretKey | b64dec }}
{{- end }}
{{- else }}
{{- $webhookUrl = $slack.secretValue }}
{{- end }}
Comment thread
seongsu-dev marked this conversation as resolved.
{{- if $webhookUrl }}
---
apiVersion: v1
kind: ConfigMap
metadata:
namespace: {{ include "common.names.namespace" . }}
name: {{ include "common.names.name" . }}-alert-heimdall-slack-contact-points
annotations:
{{- with $ps.grafana.sidecar.alerts.annotations }}
{{- toYaml . | nindent 4 }}
{{- end }}
labels:
{{ tpl $ps.grafana.sidecar.alerts.label . }}: {{ ((tpl $ps.grafana.sidecar.alerts.labelValue .) | default 1) | quote }}
{{- include "mif.labels" . | nindent 4 }}
data:
heimdall-slack-contact-points.yaml: |
apiVersion: 1
contactPoints:
- orgId: 1
name: heimdall-slack
receivers:
- uid: heimdall-slack
type: slack
disableResolveMessage: false
settings:
url: {{ $webhookUrl | quote }}
title: '{{`{{ template "heimdall-slack.title" . }}`}}'
text: '{{`{{ template "heimdall-slack.body" . }}`}}'
{{- end }}
{{- end }}
43 changes: 43 additions & 0 deletions deploy/helm/moai-inference-framework/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ prometheus-stack:
sidecar:
dashboards:
enabled: true
alerts:
enabled: true
kubernetesServiceMonitors:
enabled: true
kubeApiServer:
Expand Down Expand Up @@ -59,6 +61,47 @@ prometheus-stack:
thanosRuler:
enabled: false

# Heimdall alert provisioning consumed by the grafana-sc-alerts sidecar.
# All resources (alert rules, notification templates, routing policies, and
# the `heimdall-slack` contact point) are delivered as ConfigMaps labelled
# `grafana_alert=1`. The receiver name is fixed to `heimdall-slack` and the
# routing policy is wired to it automatically.
alerts:
heimdall:
# -- Enable provisioning of Heimdall alert resources. Disabled by default
# because a Slack webhook URL must be supplied (via `slack.existingSecret`
# or `slack.secretValue`) for alerts to actually deliver; flip to true
# once the webhook URL is in place.
enabled: false

# -- Base URL of this cluster's Grafana, used for clickable links in Slack
# messages (Explore + alert rule view). A trailing slash is allowed and is
# stripped before substitution, so both `https://grafana.example.com` and
# `https://grafana.example.com/` are accepted. Leave empty to disable the
# link prefix; resulting links will be relative paths.
grafanaURL: ""

slack:
# -- Externally-managed Secret holding the webhook URL at key
# `secretKey`. When set, the chart resolves the URL by `lookup` at
# install/upgrade time and embeds it into the contact-points
# ConfigMap. Takes precedence over `secretValue`. Note: `helm template`
# and `helm install --dry-run` cannot read cluster state, so the
# rendered ConfigMap will contain an empty URL when used with those
# commands.
existingSecret: ""
# -- Name of the key inside the Secret holding the webhook URL. The
# chart reads this key from `existingSecret`, and uses it as the data
# key when creating the Secret implicitly via `secretValue`. Default
# mirrors the common env-var convention to make the contract obvious.
secretKey: SLACK_WEBHOOK_URL
# -- Slack incoming webhook URL value. Used only when `existingSecret`
# is empty. The chart writes this value verbatim into the contact-points
# ConfigMap labelled `grafana_alert=1`; required for Slack delivery.
# SECRET — pass via `--set`, `--set-file`, sealed-secrets, SOPS, or
# an external secrets operator; never commit to git.
secretValue: ""

lws:
# -- Enable kubernetes-sigs/lws. Set to false if already deployed.
enabled: true
Expand Down
Loading