Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
110 commits
Select commit Hold shift + click to select a range
cb2d6bc
Merged PR 52: add script to update DCGM when CUDA is newer than 12.8
RuiGaoMS Jun 16, 2025
ef14c84
Merged PR 64: Merge BugFix and SecurityFix from release/1.1 to main
Jun 23, 2025
acf0056
Merged PR 66: Alert-manager - fix validation result parse issue and i…
Jul 2, 2025
001abee
Merged PR 68: enable VC admin for users so the vc admin can stop the …
RuiGaoMS Jul 3, 2025
b12ed4f
Merged PR 45: Add feature: Dashboard
quge009 Jul 8, 2025
bb0738d
Merged PR 67: [New feature] add reverse proxy client container into p…
Jul 9, 2025
6670e5d
Merged PR 78: Update Dashboard: MTBF and MTBI tables
quge009 Jul 16, 2025
5e96404
Merged PR 82: add dind runtimeplugin in job protocol
Jul 17, 2025
7bfeafe
Merged PR 81: disable non-acr image
Jul 21, 2025
5773488
Merged PR 71: Node Issue Classifier - limit the triaged_hardware node…
Jul 21, 2025
cbcdb14
Merged PR 80: Add Feature: Copilot Frontend Plugin
quge009 Jul 23, 2025
991fca6
Merged PR 79: Add Feature: Dashboard - service for backup the Dashboa…
quge009 Jul 23, 2025
e451754
Merged PR 74: clean the virtualClusters which resourcesTotal is empty
RuiGaoMS Jul 23, 2025
27836f6
Merged PR 76: Remove internal storage service and replace PostgreSql …
RuiGaoMS Jul 23, 2025
9d32d5f
Merged PR 72: Log Manager - support node logs
Jul 24, 2025
42d4da4
Merged PR 83: add job summary data recorder to kusto
Jul 24, 2025
0c85cee
Merged PR 73: [New feature] Prometheus: add deployment of disk and bl…
Jul 25, 2025
b62d636
Merged PR 69: Fix the user's group list when changing group list
RuiGaoMS Jul 30, 2025
bfca3ad
Merged PR 92: enable a new field in plugin to set whether it can be v…
Jul 30, 2025
4de7ef2
Merged PR 90: increase the timeout of job server
Jul 30, 2025
bdd7f29
Merged PR 87: [New Feature] Add cert-expiration-checker in alert-manager
Jul 30, 2025
4860df7
Merged PR 91: [NewFeature] Add chat history clean button and request …
Jul 30, 2025
e135210
Merged PR 94: Revert 'Remove internal storage service and replace Pos…
Jul 30, 2025
d52f8ff
Merged PR 88: Add available_nodata status in Kusto SDK
abuccts Jul 30, 2025
84752f3
Merged PR 96: Fix bug: Copilot URL Alignment
quge009 Aug 1, 2025
c35028c
Merged PR 86: Automatic Failure Detector - monitor, detector and redi…
Aug 1, 2025
c69b8b2
Merged PR 95: Add deployment for cluster-local storage service
abuccts Aug 2, 2025
b63de3d
Merged PR 77: Add cluster-local storage service
abuccts Aug 2, 2025
6d4ea68
Merged PR 98: fix the certification missing when deploying without no…
RuiGaoMS Aug 2, 2025
f52746c
Merged PR 99: AKS update scripts
Aug 4, 2025
ccd3d82
Merged PR 97: fix vulnerabilities in several imeges including node.js…
RuiGaoMS Aug 4, 2025
7f3c0a9
Merged PR 93: enable managed identity to access blob in worker pod
RuiGaoMS Aug 4, 2025
e0b93c8
Merged PR 89: Add Feature: Copilot Backend
quge009 Aug 4, 2025
64918b2
Merged PR 101: Bug fix - deployment issues in copilot-chat
Aug 7, 2025
5d10997
Merged PR 102: Bug fix - fix dashboard-data-update deployment failures
Aug 7, 2025
4bcd05e
Merged PR 103: Bug fix - fix schedule interva, incompatible datetype,…
Aug 8, 2025
63b2625
Merged PR 105: adapt alert-parser to available no data status after …
Aug 10, 2025
9719a8f
Merged PR 104: Fix deployment in cluster local storage
abuccts Aug 10, 2025
d70c921
Merged PR 107: Fix bugs in cluster local storage
abuccts Aug 11, 2025
3b1f9ce
Merged PR 112: Alert-parser: fix bug from available_nodata to cordoned
Aug 12, 2025
e53fc4e
Merged PR 110: fix the deploy problems in device-plugin
RuiGaoMS Aug 12, 2025
13711a5
Merged PR 108: Merged PR 76: Remove internal storage service and repl…
RuiGaoMS Aug 12, 2025
9718b95
Merged PR 111: support prometheus disk pvc configurable
Aug 12, 2025
330c421
Merged PR 114: Configure ipoib when starting cluster local storage
abuccts Aug 13, 2025
7187f23
Merged PR 113: Bug fix - local storage ssh rsync service init and pyl…
Aug 13, 2025
5fc91ca
fix bug of alert parser race condition due to prometheus delay
yukirora Sep 1, 2025
a4ee13b
Merged PR 134: Fix syntax error in cluster local storage
abuccts Aug 28, 2025
880f6d1
revert raid setup due to vm boot failure
yukirora Sep 1, 2025
a8929d3
Merge pull request #49 from microsoft/yutji/revert-raid
yukirora Sep 1, 2025
4298321
Merge pull request #48 from microsoft/yutji/cherrypick-pr134
yukirora Sep 1, 2025
94d93cc
Merge pull request #47 from microsoft/yutji/fix-alert-parser
yukirora Sep 1, 2025
ef6da8d
Merged PR 121: fix bug: dependency alerts
quge009 Aug 14, 2025
2e2d948
Merged PR 124: Fix Security Alert: replace rye with pip
quge009 Aug 18, 2025
acc275a
Merged PR 106: Update: Dashboard - Job table src db
quge009 Aug 18, 2025
37407ff
Merged PR 120: Add Feature: Copilot - Support for SGLANG endpoint
quge009 Aug 20, 2025
39914eb
Merged PR 118: Feature: Support streaming chat in chat plugin
Aug 21, 2025
80fefb1
Fix bug: Dashboard data backup (re-submit) (#54)
quge009 Sep 5, 2025
059605a
Move pai-runtime, framework controller and hivedscheduler to PAI repo…
hippogr Sep 5, 2025
cd7f54d
Copilot - code refactor (#57)
quge009 Sep 5, 2025
a6d5c6e
License - Add license headers from Microsoft for src (#67)
yzygitzh Sep 9, 2025
2f7ad7a
Bootstrap - add H200 entry for VMSS provision scripts
hippogr Sep 10, 2025
6804c9f
Copilot: Add feature: Supporting Dashboard Metric Inquiry (#59)
quge009 Sep 10, 2025
139f28f
Feature: Support folding reasoning content for thinking model (#60)
zhogu Sep 10, 2025
1af64a8
Feature: Make message showed as markdown in chat-plugin and copied as…
zhogu Sep 10, 2025
a3ea170
Copilot: Update: Add User group membership authenticate (#68)
quge009 Sep 10, 2025
600ab84
License - Add license headers from Microsoft for contrib (#65)
yzygitzh Sep 11, 2025
46f48f0
Copilot: Add feature: collect Question/Answer/Feedback for analytic p…
quge009 Sep 11, 2025
1fbd44d
fix docker files for several services to upgrade system and keep pack…
hippogr Sep 11, 2025
e08773d
License - Add license headers from Microsoft for examples (#64)
yzygitzh Sep 11, 2025
b04eb69
License - Add license headers from Microsoft for deployment (#66)
yzygitzh Sep 11, 2025
3991054
Add webportal-dind to run webportal in a container (#42)
hippogr Sep 11, 2025
fb52875
Copilot: Data Cleanup: remove unnecessary data (#73)
quge009 Sep 12, 2025
84d8e51
Add missing ssh-proxy and utilization reporter service (#55)
hippogr Sep 12, 2025
a05308c
Tools - add common tools for LTP administration (#62)
hippogr Sep 12, 2025
d57dbc4
remove the default azure disk value so we can use local disk without …
hippogr Sep 12, 2025
a054f16
remove public-key checking to avoid checking failure when there is no…
hippogr Sep 12, 2025
fd9fee2
Feature: Add a new service "model-proxy" to support to redirect reque…
zhogu Sep 15, 2025
9d63d7f
Update k8s deployment scripts for Kubespray (#72)
abuccts Sep 15, 2025
2d7b405
Release - Lucia Training Platform v1.3 (#93)
yukirora Oct 10, 2025
b56c29e
fix copy error on docker 28 (#97)
zhogu Oct 16, 2025
4cfbd3b
Bootstrap - Fix race condition of dpkg lock in vmss extension (#94)
hippogr Oct 20, 2025
73f0efc
fix security alert: remove legacy dependency that generates a securit…
quge009 Oct 20, 2025
3ee2327
Improve: copilot: test experience (#100)
quge009 Oct 27, 2025
ffc2e39
Contrib - Add webportal plugin for cluster local storage (#69)
abuccts Nov 4, 2025
2165ad2
Update chat-plugin to support model-proxy (#101)
zhogu Nov 5, 2025
bcca646
Security update for docker images in Oct. 2025 (#96)
hippogr Nov 5, 2025
b162d22
Add Doc: manual, user (#111)
quge009 Nov 5, 2025
8c2c69f
add externalTrafficPolicy to constrain the traffic to paimaster inste…
hippogr Nov 6, 2025
8365ac4
fix deployment template issue (#106)
yukirora Nov 6, 2025
64ab8d0
CI/CD update: add a new workflow to test building all images; skip pu…
zhogu Nov 7, 2025
8d611a4
Add tolerations and priority class for cluster local storage (#119)
abuccts Nov 7, 2025
35b2a3c
enable imagelist argument for image build script (#95)
hippogr Nov 7, 2025
56f2a32
CICD FIX: disable building all image during PR. prefer to trigger it …
zhogu Nov 7, 2025
6c49415
Webportal support jobType in job protocol and dedicated parameters in…
zhogu Nov 7, 2025
4eca0c5
RestServer - support inference/training jobType in job protocal (#105)
zhogu Nov 7, 2025
adfb217
RestServer - add image regex to limit the valid image (#107)
yukirora Nov 7, 2025
fc2d2dc
Update Cilium version to v1.17.5 (#56)
hippogr Nov 7, 2025
7e1a263
Fix deployment on arm64 architecture (#109)
abuccts Nov 8, 2025
a3ba466
Database - create postgresql backend and support backend switch with …
yukirora Nov 8, 2025
bf4eef9
ModelProxy supports the "jobType" field of inference jobs (#118)
zhogu Nov 8, 2025
0c24e48
Release - Lucia Training Platform v1.4 (#131)
yukirora Dec 2, 2025
ad0056a
Fix job exporter compatibility issue on ARM (#132)
abuccts Jan 12, 2026
eac9678
support assign job name (#135)
zhogu Jan 15, 2026
a9cc4de
Dec. 2025 security update (#133)
hippogr Jan 15, 2026
c977d35
Fix the module "logger" missing when running DCGM (#137)
hippogr Jan 16, 2026
381cf37
list disappeared vc jobs (#112)
hippogr Jan 20, 2026
f367fc4
Mount ssh key pairs for cluster local storage (#136)
abuccts Jan 20, 2026
681cd4e
Release - Lucia Training Platform v1.5 (#149)
hippogr Feb 10, 2026
6f585c5
Update webportal to node.js 24 with necessary packages updating
Mar 24, 2026
74ad562
Bump brace-expansion from 1.1.12 to 1.1.13 in /src/rest-server
dependabot[bot] Mar 31, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
20 changes: 14 additions & 6 deletions .azure-pipelines/cluster-update-all.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,14 @@ jobs:
pool:
vmImage: 'ubuntu-24.04'
steps:
- checkout: self
submodules: false
persistCredentials : true
- powershell: |
$header = "AUTHORIZATION: bearer $(System.AccessToken)"
git -c http.extraheader="$header" submodule sync
git -c http.extraheader="$header" submodule update --init --force --depth=1
displayName: Checkout Submodule
- script: python -m pip install pyyaml jinja2 paramiko etcd3 protobuf==3.20.3 kubernetes gitpython
displayName: Install python libs
- script: |
Expand All @@ -48,17 +56,17 @@ jobs:
destinationFolder: '$(Pipeline.Workspace)/config'
cleanDestinationFolder: true
- script: |
mv $(Pipeline.Workspace)/config/pylon-configuration /tmp/
mv $(Pipeline.Workspace)/config/auth-configuration /tmp/
ls -l /tmp/auth-configuration
displayName: Arrange Config Files
- script: |
# Build all services # Skip "frameworkcontroller" "hivedscheduler" "openpai-runtime" "k8s-dashboard" "marketplace-db" "node-exporter" "prometheus"
$(Build.Repository.LocalPath)/build/pai_build.py build -c $(Pipeline.Workspace)/config/cluster-configuration -s base-image cleaning-image cluster-configuration device-plugin job-exporter log-manager grafana alert-manager watchdog internal-storage postgresql database-controller fluentd pylon rest-server webportal
# Build all services # Skip"k8s-dashboard" "marketplace-db" "node-exporter" "prometheus"
$(Build.Repository.LocalPath)/build/pai_build.py build -c $(Pipeline.Workspace)/config/cluster-configuration -s base-image cleaning-image cluster-configuration device-plugin job-exporter log-manager grafana alert-manager frameworkcontroller hivedscheduler openpai-runtime watchdog internal-storage postgresql database-controller fluentd pylon rest-server webportal
displayName: 'Build all Services images'
condition: or(eq(${{ parameters.rebuildImage }}, 'true'), eq(${{ parameters.repushImage }}, 'true'))
- script: |
# Build all services # Skip "frameworkcontroller" "hivedscheduler" "openpai-runtime" "k8s-dashboard" "marketplace-db" "node-exporter" "prometheus"
$(Build.Repository.LocalPath)/build/pai_build.py push -c $(Pipeline.Workspace)/config/cluster-configuration -s base-image cleaning-image cluster-configuration device-plugin job-exporter log-manager grafana alert-manager watchdog internal-storage postgresql database-controller fluentd pylon rest-server webportal
# Push all services # Skip "k8s-dashboard" "marketplace-db" "node-exporter" "prometheus"
$(Build.Repository.LocalPath)/build/pai_build.py push -c $(Pipeline.Workspace)/config/cluster-configuration -s base-image cleaning-image cluster-configuration device-plugin job-exporter log-manager grafana alert-manager frameworkcontroller hivedscheduler openpai-runtime watchdog internal-storage postgresql database-controller fluentd pylon rest-server webportal
displayName: 'Push all Services'
condition: eq(${{ parameters.repushImage }}, 'true')
- task: AzureCLI@2
Expand Down Expand Up @@ -97,7 +105,7 @@ jobs:
echo "Testing rest-server $(paiWebUrl)/rest-server/api/v2/info"
curl $(paiWebUrl)/rest-server/api/v2/info
echo "Checking virtual cluster status..."
vc_info=$(curl -H "Authorization: Bearer $(paiWebToken)" -s $(paiWebUrl)/rest-server/api/v2/virtualclusters)
vc_info=$(curl -H "Authorization: Bearer $(paiWebToken)" -s $(paiWebUrl)/rest-server/api/v2/virtual-clusters)
if [ $? -ne 0 ]; then
echo "Failed to access virtual cluster API"
exit 1
Expand Down
52 changes: 35 additions & 17 deletions .azure-pipelines/cluster-update-changes.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,35 @@ variables:
value: '$(Build.BuildId)'
- group: 'pai-cicd-cluster'

parameters:
- name: "rebuildImage"
type: boolean
default: true
displayName: "Rebuild images"
- name: "repushImage"
type: boolean
default: true
displayName: "Repush images"
- name: "redeployService"
type: boolean
default: true
displayName: "Redeploy services"

jobs:
- job: Build
displayName: 'Build and Deploy'
timeoutInMinutes: 120
pool:
vmImage: 'ubuntu-24.04'
steps:
- checkout: self
submodules: false
persistCredentials : true
- powershell: |
$header = "AUTHORIZATION: bearer $(System.AccessToken)"
git -c http.extraheader="$header" submodule sync
git -c http.extraheader="$header" submodule update --init --force --depth=1
displayName: Checkout Submodule
- script: |
if [ "$(Build.Reason)" == "PullRequest" ]; then
# Fetch the target branch
Expand All @@ -32,14 +55,7 @@ jobs:
fi

folders=$(echo "$changed_files" | grep '^src/' | awk -F'/' '{print $2}' | sort -u)

# Check if "hivedscheduler" is in the folder list
if echo "$folders" | grep -q "hivedscheduler"; then
folders=$(echo "$folders" | grep -v "cluster-configuration" | grep -v "rest-server")
# Add "cluster-configuration" to the head and "rest-server" to the end
folders="cluster-configuration $folders rest-server"
fi

folders=$(echo "$folders" | tr '\n' ' ')
# Store the folder list in a pipeline variable
echo "Changed folders: $folders"
echo "##vso[task.setvariable variable=changed_folders]$folders"
Expand Down Expand Up @@ -72,27 +88,29 @@ jobs:
destinationFolder: '$(Pipeline.Workspace)/config'
cleanDestinationFolder: true
- script: |
mv $(Pipeline.Workspace)/config/pylon-configuration /tmp/
mv $(Pipeline.Workspace)/config/auth-configuration /tmp/
ls -l /tmp/auth-configuration
displayName: Arrange Config Files
condition: eq(variables['has_changed'], 'true')
- script: |
# Build the changed services
# Skip "openpai-runtime" due to the image built by other repo
changed_services=$(echo $(changed_folders) | tr ' ' '\n' | grep -v "openpai-runtime" | tr '\n' ' ')
echo "Building folders" $(changed_folders)
changed_services=$(echo $(changed_folders) | tr ' ' '\n' )
echo "Building: " $changed_services
$(Build.Repository.LocalPath)/build/pai_build.py build -c $(Pipeline.Workspace)/config/cluster-configuration -s $changed_services
displayName: 'Build Changed Services'
condition: eq(variables['has_changed'], 'true')
condition: and( eq( variables['has_changed'], 'true'), or(eq(${{ parameters.rebuildImage }}, 'true'), eq(${{ parameters.repushImage }}, 'true')) )
- script: |
# Push the changed services
# Skip "openpai-runtime" due to the image built by other repo
changed_services=$(echo $(changed_folders) | tr ' ' '\n' | grep -v "openpai-runtime" | tr '\n' ' ')

changed_services=$(echo $(changed_folders) | tr ' ' '\n' )
echo "Pushing: " $changed_services
$(Build.Repository.LocalPath)/build/pai_build.py push -c $(Pipeline.Workspace)/config/cluster-configuration -s $changed_services
displayName: 'Push Changed Services'
condition: eq(variables['has_changed'], 'true')
condition: and(eq(${{ parameters.repushImage }}, 'true'), eq(variables['has_changed'], 'true') )
- task: AzureCLI@2
displayName: 'Azure CLI get credentials of aks and deploy the pai services'
condition: eq(variables['has_changed'], 'true')
condition: and(eq(${{ parameters.redeployService }}, 'true'), eq(variables['has_changed'], 'true'))
inputs:
azureSubscription: $(azureSubscriptionEndpoint)
scriptType: bash
Expand Down Expand Up @@ -128,7 +146,7 @@ jobs:
echo "Testing rest-server $(paiWebUrl)/rest-server/api/v2/info"
curl $(paiWebUrl)/rest-server/api/v2/info
echo "Checking virtual cluster status..."
vc_info=$(curl -H "Authorization: Bearer $(paiWebToken)" -s $(paiWebUrl)/rest-server/api/v2/virtualclusters)
vc_info=$(curl -H "Authorization: Bearer $(paiWebToken)" -s $(paiWebUrl)/rest-server/api/v2/virtual-clusters)
if [ $? -ne 0 ]; then
echo "Failed to access virtual cluster API"
exit 1
Expand Down
2 changes: 2 additions & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
. @microsoft/ltpadmin
.github @microsoft/ltpadmin
121 changes: 121 additions & 0 deletions .github/workflows/build-all.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
name: Build All Services

permissions:
contents: read

on:
pull_request:
types: [opened, reopened, closed]
branches: ["release/*"]
release:
types: [published]
workflow_dispatch:
inputs:
branch:
description: 'The branch name or tag to run the workflow on'
required: true
default: 'dev'
type: string

env:
TAG: ${{ github.run_number }}

jobs:
build:
name: Build All
runs-on: [self-hosted, paicicd]
timeout-minutes: 120
environment: auto-test
if: github.event_name != 'pull_request' || ( github.event.action == 'opened' || github.event.action == 'reopened' || github.event.pull_request.merged == true)
container:
image: ubuntu:latest
volumes:
- /var/run/docker.sock:/var/run/docker.sock
steps:
- name: Install git
run: |
DEBIAN_FRONTEND=noninteractive apt update
DEBIAN_FRONTEND=noninteractive apt install -y git

- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0
submodules: false
ref: ${{ github.event.inputs.branch || github.ref }}

- name: Get All Services
id: all
run: |
services=$(ls -1d src/* | awk -F'/' '{print $2}' | tr '\n' ' ')
skipped_services="base-image cleaning-image dev-box marketplace-db marketplace-restserver marketplace-webportal utilization-reporter"
for skip in $skipped_services; do
services=$(echo $services | sed "s/\b$skip\b//g")
done
echo "All services: $services"
echo "services=$services" >> $GITHUB_OUTPUT

- name: Install Package
if: steps.all.outputs.services != ''
run: |
DEBIAN_FRONTEND=noninteractive apt install -y python3 python-is-python3 pip git unzip ca-certificates curl apt-transport-https lsb-release gnupg parallel
curl -sL https://aka.ms/InstallAzureCLIDeb | bash
curl -fsSL https://get.docker.com | sh

- name: Install python libs
if: steps.all.outputs.services != ''
run: python -m pip install --break-system-packages pyyaml jinja2 paramiko etcd3 protobuf==3.20.3 kubernetes gitpython

- name: Decode and unzip config file
if: steps.all.outputs.services != ''
run: |
echo "${{ secrets.CONFIG_FILE_B64 }}" | base64 -d > config.zip
mkdir -p $GITHUB_WORKSPACE/config
unzip -o config.zip -d $GITHUB_WORKSPACE/config
ls -l $GITHUB_WORKSPACE/config

- name: Arrange Config Files
if: steps.all.outputs.services != ''
run: |
rm -rf /tmp/auth-configuration
mv $GITHUB_WORKSPACE/config/auth-configuration /tmp/
ls -l /tmp/auth-configuration

- name: Log in to GHCR
run: echo "${{ secrets.GITHUB_TOKEN }}" | docker login ghcr.io -u ${{ github.actor }} --password-stdin

- name: Build Images of Services
if: steps.all.outputs.services != ''
run: |
all_services="${{ steps.all.outputs.services }}"
echo "Building: $all_services"
echo "--------------------------------"
failed_services=""
for service in $all_services; do
if echo "$service" | grep -q "alert-manager"; then
echo "alert-manager is in the changed services"
# Build specific images in alert-manager
echo "Building specific alert-manager images"
$GITHUB_WORKSPACE/build/pai_build.py build \
-c $GITHUB_WORKSPACE/config/cluster-configuration \
-s alert-manager \
-i abnormal-detector,alert-handler,alert-parser,cert-expiration-checker,cluster-utilization,job-data-recorder,job-status-change-notification,node-failure-detection,node-issue-classifier,nvidia-gpu-low-perf-fixer,redis-monitoring
fi
echo "Building service: $service"
if python3 $GITHUB_WORKSPACE/build/pai_build.py build \
-c $GITHUB_WORKSPACE/config/cluster-configuration \
-s $service; then
echo "✓ Successfully built: $service"
else
echo "✗ Failed to build: $service"
failed_services="$failed_services $service"
fi
done

if [ -n "$failed_services" ]; then
echo "::error::Failed to build services:$failed_services"
echo "FAILED_SERVICES=$failed_services"
exit 1
else
echo "All services built successfully"
fi
Loading
Loading