diff --git a/.claude/skills/perf-diagnostics-deploy/SKILL.md b/.claude/skills/perf-diagnostics-deploy/SKILL.md new file mode 100644 index 0000000..1a4e819 --- /dev/null +++ b/.claude/skills/perf-diagnostics-deploy/SKILL.md @@ -0,0 +1,233 @@ +--- +name: perf-diagnostics-deploy +description: Use when making changes to Telegraf or InfluxDB config files in the performance-diagnostics repo and wanting to test those changes on a live Delphix engine via SSH +--- + +# Performance Diagnostics Deploy & Verify + +## Overview + +Workflow for making config changes to the performance-diagnostics repo, deploying them to a Delphix test engine over SSH, and verifying the changes are working correctly via InfluxDB queries. + +## Workflow + +```dot +digraph deploy { + "Make changes" -> "Ask: test on engine?"; + "Ask: test on engine?" -> "Done" [label="no"]; + "Ask: test on engine?" -> "Ask: which engine hostname?" [label="yes"]; + "Ask: which engine hostname?" -> "SSH as delphix, ask for password"; + "SSH as delphix, ask for password" -> "Deploy changed files"; + "Deploy changed files" -> "Restart services"; + "Restart services" -> "Wait 5 min"; + "Wait 5 min" -> "Query InfluxDB, verify changes"; +} +``` + +## Step 1 — Make Changes + +Make the requested code/config changes. Summarise what was changed and why before asking about testing. + +## Step 2 — Ask to Test + +``` +Changes done. Do you want to test these on a Delphix engine? +``` + +If no → stop. If yes → proceed. + +## Step 3 — Get Engine + Password + +``` +Which Delphix engine hostname should I deploy to? +``` + +Then SSH using `sshpass`: +```bash +sshpass -p "$PASSWORD" ssh -o StrictHostKeyChecking=no delphix@$HOST "..." +``` + +Ask for the SSH password if not already known. Default user is always `delphix`. + +## Step 4 — Deploy Changed Files + +**File locations on engine:** + +| Repo path | Engine path | +|---|---| +| `telegraf/telegraf.base` | `/etc/telegraf/telegraf.base` | +| `telegraf/telegraf.inputs.*` | `/etc/telegraf/telegraf.inputs.*` | +| `telegraf/connstat-stats.sh` | `/etc/telegraf/connstat-stats.sh` | +| `telegraf/nfs-threads.sh` | `/etc/telegraf/nfs-threads.sh` | +| `telegraf/zcache-stats.sh` | `/etc/telegraf/zcache-stats.sh` | +| `telegraf/zpool-iostat-o.sh` | `/etc/telegraf/zpool-iostat-o.sh` | +| `telegraf/delphix-telegraf-service` | `/usr/bin/delphix-telegraf-service` | +| `telegraf/perf_playbook` | `/usr/bin/perf_playbook` | +| `telegraf/delphix-telegraf.service` | `/lib/systemd/system/delphix-telegraf.service` | +| `influxdb/delphix-influxdb-init` | `/usr/bin/delphix-influxdb-init` | +| `influxdb/delphix-influxdb-service` | `/usr/bin/delphix-influxdb-service` | +| `influxdb/perf_influxdb` | `/usr/bin/perf_influxdb` | +| `influxdb/influxdb.toml` | `/etc/influxdb/influxdb.toml` | +| `influxdb/influxdb-init.conf` | `/etc/influxdb/influxdb-init.conf` | +| `influxdb/delphix-influxdb.service` | `/lib/systemd/system/delphix-influxdb.service` | +| `influxdb/influxdb-nginx.conf` | `/opt/delphix/server/etc/nginx/conf.d/influxdb.conf` | + +Only copy files that were actually changed. Use `scp` to `/tmp/` first, then `sudo cp` to destination. Set `chmod +x` on any shell scripts. + +**InfluxDB data directory:** `/var/lib/influxdb/engine` + +## Step 5 — Restart Services + +```bash +sudo systemctl restart delphix-influxdb +sleep 5 +sudo systemctl restart delphix-telegraf + +# Confirm both are active +systemctl is-active delphix-influxdb +systemctl is-active delphix-telegraf +``` + +## Step 6 — Wait 5 Minutes + +Wait for data to flow into InfluxDB. Use `ScheduleWakeup` with `delaySeconds: 270` (within cache window). + +## Step 7 — Verify via InfluxDB Query + +Get the InfluxDB credentials from the engine: +```bash +sudo cat /etc/influxdb/influxdb_meta +``` + +Query InfluxDB using the Flux API to verify the changes for the **last 5 minutes** of data. Tailor the query to what was changed: + +| Change type | What to verify | +|---|---| +| New measurement added | `from(bucket:"default") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "new_measurement") |> count()` | +| Field removed (e.g. `wwid` tag) | Check tag keys don't include the removed tag | +| Histogram processors | Verify `hist_estat_*` measurements exist with bucket fields | +| `microseconds` field | Check field exists in relevant measurements | +| `connstat` aggregation | Verify `tcp_stats` has `service` and `connections` fields | + +Query via curl: +```bash +curl -s -X POST "http://localhost:8086/api/v2/query?org=delphix" \ + -H "Authorization: Token $INFLUXDB_READ_TOKEN" \ + -H "Content-Type: application/vnd.flux" \ + -d 'from(bucket:"default") |> range(start: -5m) |> filter(fn:(r) => r._measurement == "MEASUREMENT") |> limit(n:5)' +``` + +Report results clearly: what measurements exist, what fields/tags are present, and whether the change is confirmed working. + +Then ask: + +``` +Verification done. Do you want to commit these changes? +``` + +If no → stop. If yes → proceed to Step 8. + +## Step 8 — Commit Changes + +Show the latest commit: +```bash +git log -1 --oneline +``` + +Ask: +``` +Latest commit: " " +Do you want to (1) amend that commit or (2) create a new commit? +``` + +**If amend:** +```bash +git add -A +git commit --amend --no-edit +``` + +**If new commit:** +Ask: +``` +What should the commit message be? (include the Jira ID, e.g. "DLPX-12345 Fix xyz") +``` + +Then: +```bash +git add -A +git commit -m "" +``` + +Then ask: +``` +Commit done. Do you want to push and update/raise a PR? +``` + +If no → stop. If yes → proceed to Step 9. + +## Step 9 — Push and PR + +`git review` is a Delphix tool that pushes the branch and creates/updates the PR in one command. Use it instead of `git push` + `gh pr create`. + +First check for an existing open PR on the current branch: +```bash +gh pr list --head "$(git branch --show-current)" --state open +``` + +**If PR exists → update it:** + +```bash +git review -r +``` + +Then fetch the current PR description and update it to reflect the new changes: +```bash +gh pr view --json body +gh pr edit --body "..." +``` +Keep the existing structure but add/update the relevant sections. + +**If no PR exists → raise a new one:** + +Ask: +``` +What is the Jira ticket number for this PR? (e.g. DLPX-12345) +``` + +Fetch the Jira issue using the Jira MCP tool (`mcp__jira__jira_get_issue`) to understand the problem context. Then run: + +```bash +git review +``` + +This creates the PR as a draft. Get the PR URL from the output, then set a full description: + +```bash +gh pr edit --title ": " --body "$(cat <<'EOF' +## Summary +- + +## Problem + + +## Solution + + +## Testing +- [ ] Deployed to test engine +- [ ] InfluxDB queries confirmed data flowing correctly +- [ ] + +Jira: +EOF +)" +``` + +Return the PR URL to the user. + +## Common Mistakes + +- Forgetting `chmod +x` on shell scripts → Telegraf fails with `EXEC` error +- Restarting Telegraf before InfluxDB is ready → Telegraf starts with `[[outputs.discard]]` +- Querying before 5 minutes pass → no data in range, looks broken but isn't +- Copying the wrong file path (influxdb vs telegraf directories) diff --git a/debian/control b/debian/control index 173d013..1cbf646 100644 --- a/debian/control +++ b/debian/control @@ -13,6 +13,6 @@ Standards-Version: 4.1.2 Package: performance-diagnostics Architecture: any -Depends: python3-bpfcc, python3-minimal, python3-psutil, telegraf, docker-ce +Depends: python3-bpfcc, python3-minimal, python3-psutil, telegraf, docker-ce, influxdb2, curl Description: eBPF-based Performance Diagnostic Tools A collection of eBPF-based tools for diagnosing performance issues. diff --git a/debian/postinst b/debian/postinst index ea9a0ce..44224e3 100644 --- a/debian/postinst +++ b/debian/postinst @@ -24,6 +24,14 @@ if ! groups "$USER" | grep -q "\b$GROUP\b"; then fi fi +# Remove the influxdb2 package default config — we use influxdb.toml exclusively. +rm -f /etc/influxdb/config.toml + +# Reload nginx to pick up the InfluxDB proxy location block. +if nginx -t -c /etc/nginx/nginx.conf &>/dev/null && systemctl is-active --quiet nginx; then + nginx -s reload +fi + #DEBHELPER# exit 0 \ No newline at end of file diff --git a/debian/rules b/debian/rules index d6f4f00..c84f85c 100755 --- a/debian/rules +++ b/debian/rules @@ -13,11 +13,12 @@ # need to rename a couple files, so do that here. # override_dh_auto_build: - mkdir -p build/cmd/ + mkdir -p build/cmd/ build/influxdb/ cp cmd/estat.py build/cmd/estat cp cmd/stbtrace.py build/cmd/stbtrace cp cmd/nfs_threads.py build/cmd/nfs_threads cp cmd/dsp.py build/cmd/dsp + cp influxdb/influxdb-nginx.conf build/influxdb/influxdb.conf override_dh_auto_install: dh_install build/cmd/* /usr/bin @@ -26,3 +27,7 @@ override_dh_auto_install: dh_install telegraf/delphix-telegraf-service telegraf/perf_playbook /usr/bin dh_install telegraf/delphix-telegraf.service /lib/systemd/system dh_install telegraf/telegraf* telegraf/*.sh /etc/telegraf + dh_install influxdb/delphix-influxdb-service influxdb/delphix-influxdb-init influxdb/perf_influxdb /usr/bin + dh_install influxdb/delphix-influxdb.service /lib/systemd/system + dh_install influxdb/influxdb.toml influxdb/influxdb-init.conf /etc/influxdb + dh_install build/influxdb/influxdb.conf /opt/delphix/server/etc/nginx/conf.d diff --git a/influxdb/delphix-influxdb-init b/influxdb/delphix-influxdb-init new file mode 100644 index 0000000..3501c50 --- /dev/null +++ b/influxdb/delphix-influxdb-init @@ -0,0 +1,274 @@ +#!/bin/bash -eu +# +# Copyright (c) 2026 by Delphix. All rights reserved. +# +# One-time InfluxDB initialization: creates org, bucket, admin token, +# a read-only token for DCT Smart Proxy, and writes the +# [[outputs.influxdb_v2]] stanza to /etc/telegraf/telegraf.outputs.influxdb, +# which is included by delphix-telegraf-service when INFLUXDB_ENABLED flag exists. +# Skips setup if InfluxDB is already initialized. +# + +INFLUXDB_URL="http://127.0.0.1:8086" +INFLUXDB_CONFIG_DIR="/etc/influxdb" +INFLUXDB_META_FILE="$INFLUXDB_CONFIG_DIR/influxdb_meta" +# State file written immediately after /api/v2/setup so the script can resume +# if it is interrupted before the metadata file is fully written. +INFLUXDB_SETUP_STATE_FILE="$INFLUXDB_CONFIG_DIR/influxdb_setup_state" +INFLUXDB_FLAG=/etc/telegraf/INFLUXDB_ENABLED +INFLUXDB_OUTPUT=/etc/telegraf/telegraf.outputs.influxdb +INFLUXDB_INIT_CONF="$INFLUXDB_CONFIG_DIR/influxdb-init.conf" + +# Load tunable configuration (org, bucket, retention, wait parameters). +# shellcheck source=/etc/influxdb/influxdb-init.conf +# shellcheck disable=SC1091 +source "$INFLUXDB_INIT_CONF" + +INFLUXDB_ADMIN_USER="admin" +INFLUXDB_ADMIN_PASSWORD="" + +# +# Log a message to stderr with a timestamp. +# +log() { + echo "[$(date -u '+%Y-%m-%dT%H:%M:%SZ')] $*" >&2 +} + +# +# Extract a field from a JSON string using python3. +# +json_field() { + local json="$1" + local field="$2" + echo "$json" | python3 -c "import json,sys; print(json.loads(sys.stdin.read())$field)" || + { log "ERROR: Failed to parse field '$field' from JSON response."; return 1; } +} + +# +# POST to the InfluxDB HTTP API. Exits with an error if the request fails. +# +influx_post() { + local endpoint="$1" + local data="$2" + local auth_header="${3:-}" + + local curl_args=(-sf -X POST "$INFLUXDB_URL$endpoint" -H 'Content-Type: application/json' -d "$data") + [[ -n "$auth_header" ]] && curl_args+=(-H "Authorization: Token $auth_header") + + local response + response=$(curl "${curl_args[@]}") || + { log "ERROR: HTTP POST to '$endpoint' failed."; return 1; } + echo "$response" +} + +mkdir -p "$INFLUXDB_CONFIG_DIR" + +# Skip if already fully initialized. +if [[ -f "$INFLUXDB_META_FILE" ]]; then + log "InfluxDB already initialized, skipping." + exit 0 +fi + +# +# Wait for InfluxDB to be ready. +# +ready=false +for i in $(seq 1 "$INFLUXDB_WAIT_RETRIES"); do + if curl -sf "$INFLUXDB_URL/health" &>/dev/null; then + ready=true + break + fi + sleep "$INFLUXDB_WAIT_INTERVAL" +done + +if [[ "$ready" != "true" ]]; then + log "ERROR: InfluxDB did not become ready after $((INFLUXDB_WAIT_RETRIES * INFLUXDB_WAIT_INTERVAL))s." + exit 1 +fi + +# +# Initial setup — creates org, bucket, and returns admin token + IDs. +# /api/v2/setup is a one-shot operation; if the script is interrupted after +# this point and re-run, the state file lets us skip setup and reuse the +# already-created admin token. +# +ADMIN_TOKEN="" +ORG_ID="" +BUCKET_ID="" +SUPPORT_BUCKET_ID="" + +if [[ -f "$INFLUXDB_SETUP_STATE_FILE" ]]; then + while IFS= read -r line; do + key="${line%%=*}" + value="${line#*=}" + case "$key" in + ADMIN_TOKEN) ADMIN_TOKEN="$value" ;; + ORG_ID) ORG_ID="$value" ;; + BUCKET_ID) BUCKET_ID="$value" ;; + SUPPORT_BUCKET_ID) SUPPORT_BUCKET_ID="$value" ;; + INFLUXDB_ADMIN_PASSWORD) INFLUXDB_ADMIN_PASSWORD="$value" ;; + WRITE_TOKEN) WRITE_TOKEN="$value" ;; + READ_TOKEN) READ_TOKEN="$value" ;; + SUPPORT_WRITE_TOKEN) SUPPORT_WRITE_TOKEN="$value" ;; + esac + done <"$INFLUXDB_SETUP_STATE_FILE" +else + # Generate password only when actually running setup for the first time. + INFLUXDB_ADMIN_PASSWORD="$(openssl rand -hex 16)" + SETUP_RESPONSE=$(influx_post "/api/v2/setup" "{ + \"username\": \"$INFLUXDB_ADMIN_USER\", + \"password\": \"$INFLUXDB_ADMIN_PASSWORD\", + \"org\": \"$INFLUXDB_ORG\", + \"bucket\": \"$INFLUXDB_BUCKET\", + \"retentionPeriodSeconds\": $INFLUXDB_RETENTION_SECONDS + }") || exit 1 + + ADMIN_TOKEN=$(json_field "$SETUP_RESPONSE" "['auth']['token']") || exit 1 + ORG_ID=$(json_field "$SETUP_RESPONSE" "['org']['id']") || exit 1 + BUCKET_ID=$(json_field "$SETUP_RESPONSE" "['bucket']['id']") || exit 1 + + # Persist admin token + IDs + password immediately so a subsequent re-run + # can resume without repeating the one-shot setup call, and so the password + # stored in influxdb_meta always matches what InfluxDB was initialised with. + old_umask="$(umask)" + umask 077 + tmp_state="$(mktemp "${INFLUXDB_SETUP_STATE_FILE}.XXXXXX")" + printf 'ADMIN_TOKEN=%s\nORG_ID=%s\nBUCKET_ID=%s\nINFLUXDB_ADMIN_PASSWORD=%s\n' \ + "$ADMIN_TOKEN" "$ORG_ID" "$BUCKET_ID" "$INFLUXDB_ADMIN_PASSWORD" >"$tmp_state" + chmod 600 "$tmp_state" + mv "$tmp_state" "$INFLUXDB_SETUP_STATE_FILE" + umask "$old_umask" +fi + +# +# Create the support_metrics bucket (skipped if already persisted in state). +# +if [[ -z "$SUPPORT_BUCKET_ID" ]]; then + SUPPORT_BUCKET_RESPONSE=$(influx_post "/api/v2/buckets" "{ + \"orgID\": \"$ORG_ID\", + \"name\": \"$INFLUXDB_SUPPORT_BUCKET\", + \"retentionRules\": [{\"type\": \"expire\", \"everySeconds\": $INFLUXDB_SUPPORT_RETENTION_SECONDS}] + }" "$ADMIN_TOKEN") || exit 1 + SUPPORT_BUCKET_ID=$(json_field "$SUPPORT_BUCKET_RESPONSE" "['id']") || exit 1 + printf 'SUPPORT_BUCKET_ID=%s\n' "$SUPPORT_BUCKET_ID" >>"$INFLUXDB_SETUP_STATE_FILE" +fi + +# Token creation is guarded so that on crash-resume (setup state exists but +# meta file not yet written), we reuse already-created tokens rather than +# creating orphaned duplicates in InfluxDB on each retry. +WRITE_TOKEN="${WRITE_TOKEN:-}" +READ_TOKEN="${READ_TOKEN:-}" +SUPPORT_WRITE_TOKEN="${SUPPORT_WRITE_TOKEN:-}" + +# +# Create a write-only token for Telegraf (skipped if already persisted in state). +# +if [[ -z "$WRITE_TOKEN" ]]; then + WRITE_TOKEN_RESPONSE=$(influx_post "/api/v2/authorizations" "{ + \"orgID\": \"$ORG_ID\", + \"description\": \"telegraf-write-token\", + \"permissions\": [ + {\"action\": \"write\", \"resource\": {\"type\": \"buckets\", \"id\": \"$BUCKET_ID\", \"orgID\": \"$ORG_ID\"}} + ] + }" "$ADMIN_TOKEN") || exit 1 + WRITE_TOKEN=$(json_field "$WRITE_TOKEN_RESPONSE" "['token']") || exit 1 + printf 'WRITE_TOKEN=%s\n' "$WRITE_TOKEN" >>"$INFLUXDB_SETUP_STATE_FILE" +fi + +# +# Create a read-only token for DCT Smart Proxy (skipped if already persisted in state). +# +if [[ -z "$READ_TOKEN" ]]; then + READ_TOKEN_RESPONSE=$(influx_post "/api/v2/authorizations" "{ + \"orgID\": \"$ORG_ID\", + \"description\": \"dct-read-token\", + \"permissions\": [ + {\"action\": \"read\", \"resource\": {\"type\": \"buckets\", \"id\": \"$BUCKET_ID\", \"orgID\": \"$ORG_ID\"}} + ] + }" "$ADMIN_TOKEN") || exit 1 + READ_TOKEN=$(json_field "$READ_TOKEN_RESPONSE" "['token']") || exit 1 + printf 'READ_TOKEN=%s\n' "$READ_TOKEN" >>"$INFLUXDB_SETUP_STATE_FILE" +fi + +# +# Create a write-only token for the support_metrics bucket (skipped if already persisted). +# +if [[ -z "$SUPPORT_WRITE_TOKEN" ]]; then + SUPPORT_WRITE_TOKEN_RESPONSE=$(influx_post "/api/v2/authorizations" "{ + \"orgID\": \"$ORG_ID\", + \"description\": \"telegraf-support-write-token\", + \"permissions\": [ + {\"action\": \"write\", \"resource\": {\"type\": \"buckets\", \"id\": \"$SUPPORT_BUCKET_ID\", \"orgID\": \"$ORG_ID\"}} + ] + }" "$ADMIN_TOKEN") || exit 1 + SUPPORT_WRITE_TOKEN=$(json_field "$SUPPORT_WRITE_TOKEN_RESPONSE" "['token']") || exit 1 + printf 'SUPPORT_WRITE_TOKEN=%s\n' "$SUPPORT_WRITE_TOKEN" >>"$INFLUXDB_SETUP_STATE_FILE" +fi + +# +# Write three [[outputs.influxdb_v2]] stanzas to a dedicated telegraf output file: +# - default bucket: Grafana-facing measurements (cpu, mem, disk, net, zfs, estat_*, hist_estat_*) +# - default bucket (second stanza): tcp_stats slim — only the 4 fields needed by Grafana +# dashboards (connections, inbytes, outbytes, retranssegs) +# - support_metrics bucket: operational measurements + full tcp_stats with all TCP internals +# (tcp_stats, processes, system, procstat, agg_*) +# The flag is read by delphix-telegraf-service to conditionally include this output. +# +cat >"$INFLUXDB_OUTPUT" <"$tmp_meta" </dev/null || true diff --git a/influxdb/delphix-influxdb-service b/influxdb/delphix-influxdb-service new file mode 100644 index 0000000..eac68a4 --- /dev/null +++ b/influxdb/delphix-influxdb-service @@ -0,0 +1,23 @@ +#!/bin/bash +# +# Copyright (c) 2026 by Delphix. All rights reserved. +# +# Wrapper script to start InfluxDB 2.x and run first-time initialization. +# + +INFLUXDB_CONFIG=/etc/influxdb/influxdb.toml +INFLUXDB_INIT=/usr/bin/delphix-influxdb-init + +# Start influxd in the background. +# influxd does not support a --config-path flag; config file is passed via env var. +INFLUXD_CONFIG_PATH="$INFLUXDB_CONFIG" /usr/bin/influxd & +INFLUXDB_PID=$! + +# Run initialization (the init script handles waiting for InfluxDB to be ready) +if ! $INFLUXDB_INIT; then + echo "ERROR: delphix-influxdb-init failed, stopping influxd" >&2 + kill "$INFLUXDB_PID" 2>/dev/null + exit 1 +fi + +wait "$INFLUXDB_PID" diff --git a/influxdb/delphix-influxdb.service b/influxdb/delphix-influxdb.service new file mode 100644 index 0000000..ec69c0b --- /dev/null +++ b/influxdb/delphix-influxdb.service @@ -0,0 +1,16 @@ +[Unit] +Description=Delphix InfluxDB Time Series Database +Documentation=https://docs.influxdata.com/influxdb/v2/ +PartOf=delphix.target +After=delphix-platform.service +PartOf=delphix-platform.service + +[Service] +User=root +ExecStart=/usr/bin/delphix-influxdb-service +Restart=on-failure +RestartForceExitStatus=SIGPIPE +KillMode=control-group + +[Install] +WantedBy=delphix.target diff --git a/influxdb/influxdb-init.conf b/influxdb/influxdb-init.conf new file mode 100644 index 0000000..6093f98 --- /dev/null +++ b/influxdb/influxdb-init.conf @@ -0,0 +1,14 @@ +# +# Copyright (c) 2026 by Delphix. All rights reserved. +# +# Configuration for delphix-influxdb-init. +# Sourced by /usr/bin/delphix-influxdb-init at runtime. +# + +INFLUXDB_ORG="delphix" +INFLUXDB_BUCKET="default" +INFLUXDB_RETENTION_SECONDS=2592000 # 30 days (720h) +INFLUXDB_SUPPORT_BUCKET="support_metrics" +INFLUXDB_SUPPORT_RETENTION_SECONDS=604800 # 7 days +INFLUXDB_WAIT_RETRIES=30 +INFLUXDB_WAIT_INTERVAL=2 diff --git a/influxdb/influxdb-nginx.conf b/influxdb/influxdb-nginx.conf new file mode 100644 index 0000000..ba2a74a --- /dev/null +++ b/influxdb/influxdb-nginx.conf @@ -0,0 +1,17 @@ +# +# Copyright (c) 2026 by Delphix. All rights reserved. +# +# Proxy InfluxDB 2.x API through nginx so external clients (DCT, Grafana) +# can reach it over HTTPS using the engine's existing TLS certificate. +# InfluxDB itself binds to 127.0.0.1:8086 (HTTP, localhost only). +# +location /influxdb/ { + proxy_pass http://127.0.0.1:8086/; + proxy_set_header Host $http_host; + proxy_set_header X-Real-IP $remote_addr; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + proxy_set_header X-Forwarded-Proto $scheme; + proxy_http_version 1.1; + proxy_read_timeout 999d; + proxy_buffering off; +} diff --git a/influxdb/influxdb.toml b/influxdb/influxdb.toml new file mode 100644 index 0000000..49b7e3f --- /dev/null +++ b/influxdb/influxdb.toml @@ -0,0 +1,10 @@ +# +# Copyright 2026 Delphix. All rights reserved. +# +# InfluxDB 2.x Configuration +# + +bolt-path = "/var/lib/influxdb/influxd.bolt" +engine-path = "/var/lib/influxdb/engine" +http-bind-address = "127.0.0.1:8086" +log-level = "warn" diff --git a/influxdb/perf_influxdb b/influxdb/perf_influxdb new file mode 100644 index 0000000..00baa6f --- /dev/null +++ b/influxdb/perf_influxdb @@ -0,0 +1,47 @@ +#!/bin/bash +# +# Copyright (c) 2026 by Delphix. All rights reserved. +# +# Script that enables and disables InfluxDB metric output for Telegraf. +# + +INFLUXDB_FLAG=/etc/telegraf/INFLUXDB_ENABLED +INFLUXDB_OUTPUT=/etc/telegraf/telegraf.outputs.influxdb + +function die() { + echo -e "$(date +%T:%N:%z): $(basename $0): $*" >&2 + exit 1 +} + +[[ $EUID -ne 0 ]] && die "must be run as root" + +function usage() { + echo "$(basename $0): $*" >&2 + echo "Usage: $(basename $0) [enable|disable]" + exit 2 +} + +function enable_influxdb() { + date + [[ ! -f $INFLUXDB_OUTPUT ]] && die "$INFLUXDB_OUTPUT not found. Run delphix-influxdb-init first." + echo "Enabling InfluxDB Metric Output" + touch $INFLUXDB_FLAG + systemctl restart delphix-telegraf +} + +function disable_influxdb() { + date + echo "Disabling InfluxDB Metric Output" + rm -f $INFLUXDB_FLAG + systemctl restart delphix-telegraf +} + +if [[ $# -ne 1 ]]; then + usage +fi + +case "$1" in +enable) enable_influxdb ;; +disable) disable_influxdb ;; +*) usage ;; +esac diff --git a/telegraf/connstat-stats.sh b/telegraf/connstat-stats.sh new file mode 100755 index 0000000..eec32ef --- /dev/null +++ b/telegraf/connstat-stats.sh @@ -0,0 +1,106 @@ +#!/usr/bin/env python3 +# +# Collect per-connection TCP stats from connstat and aggregate by remote +# endpoint (laddr:raddr:service) to bound cardinality on engines with many +# connections — e.g. Oracle dNFS (hundreds of connections per VDB host) or +# Elastic Data (many connections per object storage endpoint IP). +# Mirrors the aggregation done by LocalTCPStatsCollector in the mgmt stack. +# +# Service name lookup reads from /etc/services, matching LocalTCPStatsCollector +# exactly. lport is checked before rport so that listening services (where the +# engine is the server) are identified correctly. Falls back to "unknown". +# +# Output fields per aggregated endpoint: +# laddr, raddr, service +# inbytes, outbytes, retranssegs, suna, unsent (summed across connections) +# swnd, cwnd, rwnd, rtt (averaged across connections) +# connections (count of aggregated conns) +# +import subprocess +import sys + +# Load port->service mapping from /etc/services, same as LocalTCPStatsCollector. +svc = {} +try: + with open('/etc/services') as f: + for line in f: + line = line.strip() + if not line or line.startswith('#'): + continue + parts = line.split() + if len(parts) >= 2 and '/tcp' in parts[1]: + try: + port = int(parts[1].split('/')[0]) + if port not in svc: + svc[port] = parts[0] + except ValueError: + pass +except OSError: + pass + +# Delphix-specific ports not present in /etc/services. +# Matches LocalTCPStatsCollector.getService() special-cases exactly. +svc[8415] = 'dlpx-sp' +svc[50001] = 'network-throughput-test' +svc[8341] = 'oracle-logsync' +svc[9100] = 'dlpx-connector' + +proc = subprocess.Popen( + ['/usr/bin/connstat', '-PLe', '-i', '10', '-T', 'u', + '-o', 'laddr,lport,raddr,rport,inbytes,outbytes,retranssegs,' + 'suna,unsent,swnd,cwnd,rwnd,rtt'], + stdout=subprocess.PIPE, + text=True, + bufsize=1, +) + +cnt = {} +inb = {} +outb = {} +ret_ = {} +sun = {} +uns = {} +sw = {} +cw = {} +rw = {} +rt = {} + +for raw in proc.stdout: + line = raw.rstrip('\n') + if line.startswith('='): + for key, n in cnt.items(): + la, ra, sv = key + sys.stdout.write( + f"{la},{ra},{sv}," + f"{inb[key]},{outb[key]},{ret_[key]},{sun[key]},{uns[key]}," + f"{sw[key]//n},{cw[key]//n},{rw[key]//n},{rt[key]//n},{n}\n" + ) + sys.stdout.flush() + cnt.clear(); inb.clear(); outb.clear(); ret_.clear() + sun.clear(); uns.clear(); sw.clear(); cw.clear(); rw.clear(); rt.clear() + continue + + fields = line.split(',') + if len(fields) != 13: + continue + la, lp, ra, rp = fields[0], fields[1], fields[2], fields[3] + lp_i = int(lp) if lp.isdigit() else 0 + rp_i = int(rp) if rp.isdigit() else 0 + if lp_i in svc: + sv = svc[lp_i] + elif rp_i in svc: + sv = svc[rp_i] + else: + sv = 'unknown' + + key = (la, ra, sv) + cnt[key] = cnt.get(key, 0) + 1 + inb[key] = inb.get(key, 0) + int(fields[4]) + outb[key] = outb.get(key, 0) + int(fields[5]) + ret_[key] = ret_.get(key, 0) + int(fields[6]) + sun[key] = sun.get(key, 0) + int(fields[7]) + uns[key] = uns.get(key, 0) + int(fields[8]) + sw[key] = sw.get(key, 0) + int(fields[9]) + cw[key] = cw.get(key, 0) + int(fields[10]) + rw[key] = rw.get(key, 0) + int(fields[11]) + rt[key] = rt.get(key, 0) + int(fields[12]) diff --git a/telegraf/delphix-telegraf-service b/telegraf/delphix-telegraf-service index 72df797..935465c 100755 --- a/telegraf/delphix-telegraf-service +++ b/telegraf/delphix-telegraf-service @@ -3,7 +3,10 @@ BASE_CONFIG=/etc/telegraf/telegraf.base DOSE_INPUTS=/etc/telegraf/telegraf.inputs.dose DCT_INPUTS=/etc/telegraf/telegraf.inputs.dct PLAYBOOK_INPUTS=/etc/telegraf/telegraf.inputs.playbook +STORAGE_IO_INPUTS=/etc/telegraf/telegraf.inputs.storage_io +INFLUXDB_OUTPUT=/etc/telegraf/telegraf.outputs.influxdb PLAYBOOK_FLAG=/etc/telegraf/PLAYBOOK_ENABLED +INFLUXDB_FLAG=/etc/telegraf/INFLUXDB_ENABLED TELEGRAF_CONFIG=/etc/telegraf/telegraf.conf @@ -21,6 +24,10 @@ function playbook_is_enabled() { [[ -f $PLAYBOOK_FLAG ]] } +function influxdb_is_enabled() { + [[ -f $INFLUXDB_FLAG ]] +} + rm -f $TELEGRAF_CONFIG if engine_is_object_based; then @@ -43,4 +50,21 @@ else fi fi +if influxdb_is_enabled && [[ -f $INFLUXDB_OUTPUT ]]; then + if [[ -f $STORAGE_IO_INPUTS ]]; then + cat $STORAGE_IO_INPUTS >> $TELEGRAF_CONFIG + fi + cat $INFLUXDB_OUTPUT >> $TELEGRAF_CONFIG +else + if influxdb_is_enabled; then + logger -t delphix-telegraf "WARNING: INFLUXDB_ENABLED is set but $INFLUXDB_OUTPUT is missing — metrics will be discarded. Run delphix-influxdb-init to restore InfluxDB output." + fi + # No InfluxDB output configured. Add discard so Telegraf can start — + # it requires at least one output plugin. + echo "[[outputs.discard]]" >> $TELEGRAF_CONFIG +fi + +# Restrict permissions so the InfluxDB write token is not world-readable. +chmod 640 $TELEGRAF_CONFIG + /usr/bin/telegraf -config $TELEGRAF_CONFIG diff --git a/telegraf/metaslab-alloc-stats.sh b/telegraf/metaslab-alloc-stats.sh new file mode 100755 index 0000000..aaee3fc --- /dev/null +++ b/telegraf/metaslab-alloc-stats.sh @@ -0,0 +1,9 @@ +#!/bin/sh +# +# Wrapper around "estat metaslab-alloc -jm 10" that filters out metrics whose +# "name" tag contains garbage characters (DLPX-88427). A kernel bug causes +# estat to occasionally emit stat names containing raw memory bytes or C macro +# strings. Only names consisting of printable ASCII letters, digits, spaces, +# and common punctuation are passed through. +# +estat metaslab-alloc -jm 10 | grep -E '"name":"[A-Za-z0-9 ,_()/.-]+"' diff --git a/telegraf/telegraf.base b/telegraf/telegraf.base index 7abd9a4..5abc3e2 100644 --- a/telegraf/telegraf.base +++ b/telegraf/telegraf.base @@ -11,53 +11,20 @@ ############################################################################### # OUTPUT PLUGINS # ############################################################################### -# Define the main metric output file, excluding aggregated stats and -# Performance Playbook (estat) data. -[[outputs.file]] - files = ["/var/log/telegraf/metrics.json"] - rotation_max_size = "50MB" - rotation_max_archives = 9 - data_format = "json" - namedrop = ["*estat_*", "agg_*", "zfs", "zpool*", "zcache*", "docker*"] - -# Define output file for ZFS related metrics -[[outputs.file]] - files = ["/var/log/telegraf/metrics_zfs.json"] - rotation_max_size = "30MB" - rotation_max_archives = 5 - data_format = "json" - namepass = ["zpool*", "zcache*", "zfs"] - -# Define output file for Performance Playbook (estat) metrics -[[outputs.file]] - files = ["/var/log/telegraf/metrics_estat.json"] - rotation_max_size = "30MB" - rotation_max_archives = 5 - data_format = "json" - namepass = ["*estat_*"] - -# Define output file for aggregate statistics -[[outputs.file]] - files = ["/var/log/telegraf/metric_aggregates.json"] - rotation_max_size = "30MB" - rotation_max_archives = 5 - data_format = "json" - namepass = ["agg_*"] - -# Enable Live Monitoring, intended for internal Delphix use only: -#[[outputs.influxdb]] -# urls = ["http://dbsvr.company.com:8086"] -# database = "live_metrics" -# skip_database_creation = true -# data_format = "influx" +# All metrics are ingested into InfluxDB. The output stanza is written by +# delphix-influxdb-init to /etc/telegraf/telegraf.outputs.influxdb and +# appended here by delphix-telegraf-service when InfluxDB is enabled. +# Use 'perf_influxdb enable|disable' to toggle and restart Telegraf. ############################################################################### # INPUT PLUGINS # ############################################################################### -# Get CPU usage +# Get CPU usage — only cpu-total, not per-core (reduces data volume on +# many-CPU engines; agg_cpu automatically inherits this restriction). +# percpu defaults to true so must be explicitly set to false. [[inputs.cpu]] - percpu = true + percpu = false totalcpu = true collect_cpu_time = false report_active = false @@ -65,31 +32,58 @@ # Get mount point stats [[inputs.disk]] + interval = "60s" mount_points = ["/","/domain0"] - -# Get disk I/O stats + tagexclude = ["fstype", "mode"] + fieldpass = ["used", "free", "total"] + +# Get disk I/O stats for whole disks only — partitions add cardinality without +# diagnostic value and account for ~30% of diskio/agg_diskio line volume. +# Excluded: +# zd* — ZFS zvol internal block devices +# *p[0-9]* — NVMe partitions (nvme0n1p1, nvme0n1p9, etc.) +# sd*[0-9]* — SCSI/SATA partitions (sda1, sdb2, etc.) +# wwid is a redundant 100+ char tag; the short-form name tag is sufficient. [[inputs.diskio]] - -# Track stats for the current metric files -[[inputs.filestat]] - files = ["/var/log/telegraf/metrics.json", - "/var/log/telegraf/metrics_estat.json", - "/var/log/telegraf/metrics_zfs.json", - "/var/log/telegraf/metric_aggregates.json"] + interval = "60s" + tagdrop = {name = ["zd*", "*p[0-9]*", "sd*[0-9]*"]} + tagexclude = ["wwid"] + fieldpass = ["reads", "writes", "read_bytes", "write_bytes", "read_time", "write_time", "iops_in_progress"] # Get Memory stats [[inputs.mem]] + fieldpass = ["used", "available", "total", "free", "cached", "buffered", "dirty", "slab"] # Get some network interface stats [[inputs.net]] fieldpass = ["tcp*","bytes*","packets*","err*","drop*"] +# Per-endpoint TCP stats (bytes, RTT, window sizes) via connstat. +# Aggregated by remote endpoint (laddr:raddr:rport) to mirror the aggregation +# in LocalTCPStatsCollector — avoids cardinality explosion on Oracle dNFS +# engines (hundreds of connections per VDB host) and Elastic Data engines +# (many connections per object storage endpoint IP). +# Cumulative fields (inbytes, outbytes, etc.) are summed; window/RTT fields +# are averaged; connections = number of TCP connections aggregated. +[[inputs.execd]] + command = ["/etc/telegraf/connstat-stats.sh"] + name_override = "tcp_stats" + signal = "none" + restart_delay = "30s" + data_format = "csv" + csv_delimiter = "," + csv_trim_space = true + csv_column_names = ["laddr", "raddr", "service", "inbytes", "outbytes", "retranssegs", "suna", "unsent", "swnd", "cwnd", "rwnd", "rtt", "connections"] + csv_column_types = ["string", "string", "string", "int", "int", "int", "int", "int", "int", "int", "int", "int", "int"] + csv_tag_columns = ["laddr", "raddr", "service"] + # Track CPU and Memory for the "delphix-mgmt" service (and children). [[inputs.procstat]] systemd_unit = "delphix-mgmt.service" include_systemd_children = true namedrop = ["procstat_lookup"] fieldpass = ["memory_usage", "cpu_usage", "memory_rss"] + tagexclude = ["cgroup_full"] # Track CPU and Memory for the "zfs-object-agent" service (and children). [[inputs.procstat]] @@ -97,19 +91,43 @@ include_systemd_children = true namedrop = ["procstat_lookup"] fieldpass = ["memory_usage", "cpu_usage", "memory_rss"] + tagexclude = ["cgroup_full"] # Get process counts [[inputs.processes]] -# Get swap memory usage -[[inputs.swap]] - # Get misc 'other' stats (load and uptime) [[inputs.system]] # ZFS kstats (arcstat, abdstat, zfetch, etc) +# arcstats_l2_* fields are L2ARC stats — unused on all appliances (no L2ARC). [[inputs.zfs]] interval = "1m" + fieldpass = [ + "arcstats_anon_data", "arcstats_anon_evictable_data", + "arcstats_anon_evictable_metadata", "arcstats_anon_metadata", + "arcstats_arc_need_free", "arcstats_arc_no_grow", "arcstats_arc_prune", + "arcstats_arc_sys_free", "arcstats_async_upgrade_sync", + "arcstats_c", "arcstats_data_size", + "arcstats_demand_data_hits", "arcstats_demand_data_misses", + "arcstats_demand_hit_predictive_prefetch", + "arcstats_evict_not_enough", "arcstats_evict_skip", + "arcstats_hits", "arcstats_misses", + "arcstats_memory_available_bytes", "arcstats_memory_direct_count", + "arcstats_memory_free_bytes", "arcstats_memory_indirect_count", + "arcstats_metadata_size", + "arcstats_mfu_data", "arcstats_mfu_evictable_data", + "arcstats_mfu_evictable_metadata", "arcstats_mfu_ghost_hits", + "arcstats_mfu_hits", "arcstats_mfu_metadata", + "arcstats_mru_data", "arcstats_mru_evictable_data", + "arcstats_mru_evictable_metadata", "arcstats_mru_ghost_hits", + "arcstats_mru_hits", "arcstats_mru_metadata", + "arcstats_prefetch_data_hits", "arcstats_prefetch_data_misses", + "arcstats_size", + "zil_commit_count", "zil_itx_count", "zil_commit_stall_count", + "zfetchstats_hits", "zfetchstats_misses", + "dmu_tx_dirty_throttle", "dmu_tx_delay" + ] # Detailed ZFS pool metrics from "zpool_influxdb" (noisy) #[[inputs.exec]] @@ -127,5 +145,5 @@ drop_original = false stats = ["min", "max", "mean", "stdev"] name_prefix = "agg_" - namepass = ["cpu","disk","diskio","mem","net","processes","system","swap"] + namepass = ["cpu","disk","diskio","mem","net","processes","system"] diff --git a/telegraf/telegraf.inputs.dct b/telegraf/telegraf.inputs.dct index 07ceb4d..07dc47f 100644 --- a/telegraf/telegraf.inputs.dct +++ b/telegraf/telegraf.inputs.dct @@ -11,12 +11,5 @@ ] docker_label_exclude = ["com.docker.compose.*", "resty*"] - [[outputs.file]] - files = ["/var/log/telegraf/metrics_docker.json"] - rotation_max_size = "30MB" - rotation_max_archives = 5 - data_format = "json" - namepass = ["docker*"] - ####################### End of Docker/DCT services Metrics ####################### diff --git a/telegraf/telegraf.inputs.playbook b/telegraf/telegraf.inputs.playbook index 5ed7e21..d478ab2 100644 --- a/telegraf/telegraf.inputs.playbook +++ b/telegraf/telegraf.inputs.playbook @@ -1,31 +1,8 @@ ############################################################################## -# Performance Playbook (estat, nfs_threads) collection - -# Collect output from "estat nfs -jm 10" -[[inputs.execd]] - command = ["estat", "nfs", "-jm", "10"] - name_override = "estat_nfs" - signal = "none" - restart_delay = "30s" - data_format = "json" - tag_keys = [ - "name", - "axis" - ] - json_string_fields = ["iops(/s)", "avg latency(us)", "stddev(us)", "throughput(k/s)", "microseconds"] - -# Collect output from "estat iscsi -jm 10" -[[inputs.execd]] - command = ["estat", "iscsi", "-jm", "10"] - name_override = "estat_iscsi" - signal = "none" - restart_delay = "30s" - data_format = "json" - tag_keys = [ - "name", - "axis" - ] - json_string_fields = ["iops(/s)", "avg latency(us)", "stddev(us)", "throughput(k/s)", "microseconds"] +# Performance Playbook (estat, nfs_threads) collection +# Note: estat_nfs, estat_iscsi, and estat_backend-io live in +# telegraf.inputs.storage_io and are always collected when InfluxDB is +# enabled, independent of playbook state. # Collect output from "estat zpl -jm 10" [[inputs.execd]] @@ -40,19 +17,6 @@ ] json_string_fields = ["iops(/s)", "avg latency(us)", "stddev(us)", "throughput(k/s)", "microseconds"] -# Collect output from "estat backend-io -jm 10" -[[inputs.execd]] - command = ["estat", "backend-io", "-jm", "10"] - name_override = "estat_backend-io" - signal = "none" - restart_delay = "30s" - data_format = "json" - tag_keys = [ - "name", - "axis" - ] - json_string_fields = ["iops(/s)", "avg latency(us)", "stddev(us)", "throughput(k/s)", "microseconds"] - # Collect output from "estat zvol -jm 10" [[inputs.execd]] command = ["estat", "zvol", "-jm", "10"] @@ -91,9 +55,10 @@ ] json_string_fields = ["iops(/s)", "avg latency(us)", "stddev(us)", "throughput(k/s)", "microseconds"] -# Collect output from "estat metaslab-alloc -jm 10" +# Collect output from "estat metaslab-alloc -jm 10" via wrapper script. +# The wrapper filters out metrics with garbage "name" tags (DLPX-88427). [[inputs.execd]] - command = ["estat", "metaslab-alloc", "-jm", "10"] + command = ["/etc/telegraf/metaslab-alloc-stats.sh"] name_override = "estat_metaslab-alloc" signal = "none" restart_delay = "30s" @@ -123,47 +88,17 @@ ############################################################################### # PROCESSOR PLUGINS # ############################################################################### -# Convert strings from estat into integer values so they don't get dropped +# Convert strings from estat into integer values so they don't get dropped. +# Scoped to playbook-only metrics; estat_nfs/iscsi/backend-io have their own +# converter in telegraf.inputs.storage_io. [[processors.converter]] + namepass = ["estat_zpl", "estat_zio", "estat_zvol", "estat_zio-queue", "estat_metaslab-alloc"] [processors.converter.fields] integer = ["iops(/s)", "avg latency(us)", "stddev(us)", "throughput(k/s)"] -# The estat output contains a nested latency histogram, so we need to -# parse that out as a new array metric rather than a non-JSON string. -# -# From this: -# "microseconds":"{20000,5},{30000,15},{40000,3},{50000,24}" -# to this: -# "microseconds":"{20000:5,30000:15,40000:3,50000:24}" -# -# Clone the original so we have a "new" metric with a "hist_" name prefix -[[processors.clone]] - order = 1 - name_prefix = "hist_" - namepass = ["estat_*"] - -# Rewrite the histograms for the "hist_estat_*" metrics as JSON objects -[[processors.regex]] - order = 2 - namepass = ["hist_estat_*"] - [[processors.regex.fields]] - key = "microseconds" - pattern = "{(\\d+),(\\d+)}" - replacement = "\"${1}\":${2}" - [[processors.regex.fields]] - key = "microseconds" - pattern = ".*" - replacement = "{$0}" - -# Now parse out the arrays for "hist_estat_*" metrics -[[processors.parser]] - order = 3 - merge = "override" - parse_fields = ["microseconds"] - drop_original = false - data_format = "json" - namepass = ["hist_estat_*"] - fieldpass = ["microseconds"] +# Note: histogram processors (clone/regex/parser) for all estat_* measurements +# live in telegraf.inputs.storage_io, which is always included when InfluxDB +# is enabled. No duplication needed here. # End of Processor section ############################################################################## diff --git a/telegraf/telegraf.inputs.storage_io b/telegraf/telegraf.inputs.storage_io new file mode 100644 index 0000000..2341147 --- /dev/null +++ b/telegraf/telegraf.inputs.storage_io @@ -0,0 +1,99 @@ +############################################################################## +# Storage I/O collection: NFS server, iSCSI target, and backend disk I/O. +# Always included when InfluxDB is enabled, independent of playbook state. +# NFS/iSCSI/backend-IO are also collected continuously by delphix-stat into +# analytics_datapoint (the source for Support Grafana), so full histogram +# capture here mirrors that existing always-on precedent. + +# Collect output from "estat nfs -jm 10" +[[inputs.execd]] + command = ["estat", "nfs", "-jm", "10"] + name_override = "estat_nfs" + signal = "none" + restart_delay = "30s" + data_format = "json" + tag_keys = [ + "name", + "axis" + ] + json_string_fields = ["iops(/s)", "avg latency(us)", "stddev(us)", "throughput(k/s)", "microseconds"] + +# Collect output from "estat iscsi -jm 10" +[[inputs.execd]] + command = ["estat", "iscsi", "-jm", "10"] + name_override = "estat_iscsi" + signal = "none" + restart_delay = "30s" + data_format = "json" + tag_keys = [ + "name", + "axis" + ] + json_string_fields = ["iops(/s)", "avg latency(us)", "stddev(us)", "throughput(k/s)", "microseconds"] + +# Collect output from "estat backend-io -jm 10" (stbtrace io equivalent) +[[inputs.execd]] + command = ["estat", "backend-io", "-jm", "10"] + name_override = "estat_backend-io" + signal = "none" + restart_delay = "30s" + data_format = "json" + tag_keys = [ + "name", + "axis" + ] + json_string_fields = ["iops(/s)", "avg latency(us)", "stddev(us)", "throughput(k/s)", "microseconds"] + +# Convert estat string fields to integers so they are not dropped by Telegraf. +[[processors.converter]] + namepass = ["estat_nfs", "estat_iscsi", "estat_backend-io"] + [processors.converter.fields] + integer = ["iops(/s)", "avg latency(us)", "stddev(us)", "throughput(k/s)"] + +# Clone estat_* measurements as hist_estat_* to hold histogram data only. +# microseconds is removed from the originals (order 2 below) so it lives in +# hist_estat_* exclusively — no duplication. +# Keeps the original format "{20000,5},{30000,15}" compatible with import code. +[[processors.clone]] + order = 1 + name_prefix = "hist_" + namepass = ["estat_*"] + +# Drop microseconds from all original estat_* measurements after cloning. +# Covers both storage_io (estat_nfs/iscsi/backend-io) and playbook +# (estat_zpl/zvol/zio/etc) measurements in one place. +[[processors.strings]] + order = 2 + namepass = ["estat_*"] + fieldexclude = ["microseconds"] + +# Expand hist_estat_* microseconds histogram strings into per-bucket rows for +# Grafana heatmap support. Each "{upper_bound_us,count}" pair becomes a +# separate metric with le= tag and count field, replacing the +# opaque string. Runs after clone (order=1) and strings (order=2) so the +# microseconds field is still present in hist_estat_* at this point. +[[processors.starlark]] + order = 3 + namepass = ["hist_estat_*"] + source = ''' +def apply(metric): + ms = metric.fields.get("microseconds") + if ms == None: + return [metric] + + result = [] + for pair in ms[1:-1].split("},{"): + parts = pair.split(",") + if len(parts) == 2: + m = deepcopy(metric) + m.tags["le"] = parts[0] + for k in list(m.fields.keys()): + m.fields.pop(k) + m.fields["count"] = int(parts[1]) + result.append(m) + + return result if result else [metric] +''' + +# End of Storage I/O section +##############################################################################