Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions deploy/helm/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,16 @@ Rules specific to the `deploy/helm/` directory. General contribution guidelines

- **Do not use YAML anchors at the root level of `values.yaml`** (e.g., `_defaults: &defaults`). Helm treats unknown root-level keys as invalid and may emit warnings or errors. Instead, duplicate shared configuration explicitly for each component.

### Alert Provisioning

The chart provisions Grafana Unified Alerting via ConfigMaps loaded by the `grafana-sc-alerts` sidecar. Each file under `moai-inference-framework/files/alerts/*.yaml` becomes an individual ConfigMap labelled `grafana_alert=1`; the sidecar mounts them into `/etc/grafana/provisioning/alerting/` and Grafana reloads automatically. The pattern mirrors `files/dashboards/*.json` so that adding a new alert file requires no template changes.

**Out of scope for this chart**: contact points. Slack webhook URLs (and any other receiver credentials) are secrets and are not managed here. Operators must create the contact point named in `alerts.heimdall.receiver` through the Grafana UI or via a separate Secret-backed provisioning file.

**Do NOT wrap alert ConfigMap data with `tpl`.** Alert rule YAML embeds Grafana's own Go template syntax (e.g. `{{ printf "%.180s" .message }}`, `{{ if .error }}`). `tpl` would evaluate those expressions at Helm render time and fail. The template (`templates/grafana/alert-configmap.yaml`) instead reads files as raw bytes via `Files.Get` and performs explicit `replace` substitutions for `__GRAFANA_URL__` and `__RECEIVER__` — keep new placeholders to this same convention.

To customise cluster-specific links and the routing target, override `alerts.heimdall.grafanaURL` (used to compose Grafana Explore and rule view links surfaced in Slack messages) and `alerts.heimdall.receiver` (the contact point name).

## Odin Presets (`moai-inference-preset`)

An Odin preset is a pair of Odin `InferenceServiceTemplate` resources — a **base template** (runtime base) and a **preset-specific template** — that together define how to deploy a Moreh vLLM pod. The base template defines how vLLM servers are launched and is shared across presets. The preset-specific template adds model-specific arguments, environment variables, resource requests, and disaggregation settings.
Expand Down
4 changes: 4 additions & 0 deletions deploy/helm/moai-inference-framework/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,9 @@ Moreh Inference Framework

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| alerts.heimdall.enabled | bool | `true` | Enable provisioning of Heimdall alert rules, notification templates, and routing policies. Requires `prometheus-stack.grafana.sidecar.alerts.enabled`. |
| alerts.heimdall.grafanaURL | string | `""` | Base URL of this cluster's Grafana, used for clickable links in Slack messages (Explore + alert rule view). Leave empty to omit the link prefix. |
| alerts.heimdall.receiver | string | `"heimdall-slack"` | Name of the Grafana contact point that Heimdall alerts route to. The contact point itself must be created out-of-band (UI or Secret-backed provisioning) because the Slack webhook URL is a secret. |
| commonLabels | object | `{}` | Labels applied to all resources. |
| ecrTokenRefresher.aws.accessKeyId | string | `""` | AWS_ACCESS_KEY_ID |
| ecrTokenRefresher.aws.region | string | `"ap-northeast-2"` | AWS Region. |
Expand Down Expand Up @@ -131,6 +134,7 @@ Moreh Inference Framework
| prometheus-stack.defaultRules.create | bool | `false` | |
| prometheus-stack.enabled | bool | `true` | Enable prometheus-community/kube-prometheus-stack. Set to false if already deployed. |
| prometheus-stack.grafana.enabled | bool | `true` | |
| prometheus-stack.grafana.sidecar.alerts.enabled | bool | `true` | |
| prometheus-stack.grafana.sidecar.dashboards.enabled | bool | `true` | |
| prometheus-stack.kubeApiServer.enabled | bool | `false` | |
| prometheus-stack.kubeControllerManager.enabled | bool | `false` | |
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
apiVersion: 1
policies:
- orgId: 1
receiver: grafana-default-email
group_by:
- grafana_folder
- alertname
routes:
- receiver: __RECEIVER__
object_matchers:
- - component
- "="
- heimdall
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
continue: false
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
apiVersion: 1
groups:
- orgId: 1
name: Heimdall Error Alerts
folder: Heimdall
interval: 1m
rules:
- uid: heimdall-error-log-burst
title: Heimdall Error Log Burst
condition: B
data:
# LogQL: group by instance/namespace and extract the error message into a label_format.
# The resulting time series carries instance, namespace, and error_summary labels,
# which propagate to alert labels so Slack messages can render them.
- refId: A
relativeTimeRange:
from: 300
to: 0
datasourceUid: loki
model:
refId: A
datasource:
type: loki
uid: loki
expr: |
sum by (instance, namespace, error_summary) (
count_over_time(
{app="heimdall-inference-scheduler", level="error"}
| json
| label_format error_summary=`{{ if .error }}{{ printf "%.180s" .message }}: {{ printf "%.180s" .error }}{{ else }}{{ printf "%.300s" .message }}{{ end }}`
[5m]
)
)
queryType: instant
intervalMs: 1000
maxDataPoints: 43200
- refId: B
datasourceUid: __expr__
model:
refId: B
datasource:
type: __expr__
uid: __expr__
type: threshold
expression: A
conditions:
- evaluator:
type: gt
params: [0]
operator:
type: and
query:
params: []
reducer:
type: last
params: []
type: query
intervalMs: 1000
maxDataPoints: 43200
for: 1m
noDataState: OK
execErrState: Error
annotations:
summary: Heimdall error logs detected
description: '{{ $values.A.Value }} error log entries detected in the last 5 minutes.'
# Grafana Explore deep link — pre-filled LogQL filtered to Heimdall error logs.
# Slack template references this annotation as a clickable link.
exploreURL: '__GRAFANA_URL__/explore?schemaVersion=1&orgId=1&panes=%7B%22h1%22%3A%7B%22datasource%22%3A%22loki%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22loki%22%2C%22uid%22%3A%22loki%22%7D%2C%22expr%22%3A%22%7Bapp%3D%5C%22heimdall-inference-scheduler%5C%22%2Clevel%3D%5C%22error%5C%22%7D%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%22now-30m%22%2C%22to%22%3A%22now%22%7D%7D%7D'
# Direct link to this alert rule view.
ruleURL: '__GRAFANA_URL__/alerting/grafana/heimdall-error-log-burst/view'
labels:
severity: warning
component: heimdall
isPaused: false
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
apiVersion: 1
templates:
- orgId: 1
name: heimdall-slack-templates
template: |
{{ define "heimdall-slack.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
{{ end }}

{{ define "heimdall-slack.body" }}
{{ range .Alerts }}
:pushpin: *Summary*: {{ .Annotations.summary }}
:memo: *Description*: {{ .Annotations.description }}

{{ if .Labels.error_summary }}:warning: *Error message*:
```{{ .Labels.error_summary }}```{{ end }}

*Context*
{{ if .Labels.namespace }}• Namespace: `{{ .Labels.namespace }}`{{ end }}
{{ if .Labels.instance }}• Instance: `{{ .Labels.instance }}`{{ end }}
• Severity: `{{ .Labels.severity }}`
• Component: `{{ .Labels.component }}`
{{ if .Labels.environment }}• Environment: `{{ .Labels.environment }}`{{ end }}

{{ if .Annotations.exploreURL }}:mag: <{{ .Annotations.exploreURL }}|View error logs in Grafana>{{ end }}
{{ if .Annotations.ruleURL }}:bell: <{{ .Annotations.ruleURL }}|View alert rule>{{ end }}
{{ end }}
{{ end }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{{- $ps := index .Values "prometheus-stack" }}
{{- if and $ps.enabled $ps.grafana.enabled $ps.grafana.sidecar.alerts.enabled .Values.alerts.heimdall.enabled }}
{{- $files := .Files.Glob "files/alerts/*.yaml" }}
{{- if $files }}
{{- range $path, $_ := $files }}
{{- $alertName := base $path | trimSuffix ".yaml" }}
---
apiVersion: v1
kind: ConfigMap
metadata:
namespace: {{ include "common.names.namespace" $ }}
name: {{ include "common.names.name" $ }}-alert-{{ $alertName }}
annotations:
{{- with $ps.grafana.sidecar.alerts.annotations }}
{{- toYaml . | nindent 4 }}
{{- end }}
labels:
{{ tpl $ps.grafana.sidecar.alerts.label $ }}: {{ ((tpl $ps.grafana.sidecar.alerts.labelValue $) | default 1) | quote }}
{{- include "mif.labels" $ | nindent 4 }}
data:
{{ base $path }}: |-
{{- $.Files.Get $path | replace "__GRAFANA_URL__" ($.Values.alerts.heimdall.grafanaURL | default "") | replace "__RECEIVER__" ($.Values.alerts.heimdall.receiver | default "heimdall-slack") | nindent 4 }}
Comment thread
seongsu-dev marked this conversation as resolved.
Outdated
{{- end }}
{{- end }}
{{- end }}
22 changes: 22 additions & 0 deletions deploy/helm/moai-inference-framework/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ prometheus-stack:
sidecar:
dashboards:
enabled: true
alerts:
enabled: true
kubernetesServiceMonitors:
enabled: true
kubeApiServer:
Expand Down Expand Up @@ -59,6 +61,26 @@ prometheus-stack:
thanosRuler:
enabled: false

# Heimdall alert provisioning consumed by the grafana-sc-alerts sidecar.
# Contact points (Slack webhook URLs) are intentionally not managed here —
# operators must create the contact point referenced by `receiver` separately
# (via Grafana UI or a Secret-backed provisioning file) because webhook URLs
# are secrets.
alerts:
heimdall:
# -- Enable provisioning of Heimdall alert rules, notification templates,
# and routing policies. Requires `prometheus-stack.grafana.sidecar.alerts.enabled`.
enabled: true

# -- Base URL of this cluster's Grafana, used for clickable links in Slack
# messages (Explore + alert rule view). Leave empty to omit the link prefix.
grafanaURL: ""

# -- Name of the Grafana contact point that Heimdall alerts route to.
# The contact point itself must be created out-of-band (UI or
# Secret-backed provisioning) because the Slack webhook URL is a secret.
receiver: "heimdall-slack"

lws:
# -- Enable kubernetes-sigs/lws. Set to false if already deployed.
enabled: true
Expand Down
Loading