Skill

SkillsSoftware Development › Systems & low-level

exec-slurm-compile

Compile TensorRT-LLM on a SLURM cluster. Covers submitting a batch job with a container image, monitoring the job, and verifying the build. Use when the user wants to compile TRT-LLM remotely via SLURM rather than on a local compute node.

Freerisk: low
execslurmcompiledockerpytorch

The full skill

— name: exec-slurm-compile description: Compile TensorRT-LLM on a SLURM cluster. Covers submitting a batch job with a container image, monitoring the job, and verifying the build. Use when the user wants to compile TRT-LLM remotely via SLURM rather than on a local compute node. license: Apache-2.0 metadata: author: NVIDIA Corporation — # Compile TensorRT-LLM on SLURM Cluster Submit, monitor, and verify a TensorRT-LLM compilation job on a SLURM cluster using enroot containers. ## When to Use | Scenario | Use This Skill? | |———-|—————-| | User wants to compile TRT-LLM on a SLURM cluster | Yes | | User is already on a compute node and wants to compile | No — use `exec-local-compile` skill instead | ## Finding the Docker Image The official Docker image tag for a given TensorRT-LLM version is recorded in the repo itself: “` <repo_dir>/jenkins/current_image_tags.properties “` Read this file to find the current image URL (e.g., `urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.12-py3-aarch64-ubuntu24.04-trt10.14.1.48-skip-tritondevel-202602011118-10901`). ## Pre-dumping the Container Image (enroot import) SLURM clusters using enroot/pyxis require a `.sqsh` container image. To avoid download overhead at compile time, **pre-dump the image in advance** using the `enroot-import` companion script: “`bash # Basic usage — submits a SLURM job on a CPU partition to import the image enroot-import –partition cpu_datamover –debug <docker_image_url> “` The script submits an `sbatch` job that runs `enroot import docker://<image_url>` and produces a `.sqsh` file in the current directory. The output on stdout is the SLURM job ID. ### enroot-import flags | Flag | Description | |——|————-| | `-p, –partition` | SLURM partition for the import job (use a CPU partition like `cpu_datamover`) | | `-d, –debug` | Enable debug output and preserve the SLURM log (recommended) | | `-o, –output` | Custom output path for the `.sqsh` file | | `-A, –account` | SLURM account (defaults to user's first account) | | `-t, –time` | Time limit for the import job (default: 1 hour) | | `-n, –just-print` | Print the sbatch command without executing | | `-J, –job-name` | Custom job name | ### enroot-import workflow 1. Read the image tag from `jenkins/current_image_tags.properties` in the TRT-LLM repo. 2. Run `enroot-import` to submit the import job: “`bash cd <directory_where_sqsh_should_be_stored> <path_to>/enroot-import –partition cpu_datamover –debug <image_url> “` **IMPORTANT:** Convert `urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:xxx` to `urm.nvidia.com#sw-tensorrt-docker/tensorrt-llm:xxx` to avoid credential issues. 3. Wait for the import job to complete (`squeue -j <job_id>`). 4. The resulting `.sqsh` file is the `container_image` used in the compile step. ## Prerequisites The user must provide (or you must ask for) these values: | Parameter | Description | Example | |———–|————-|———| | `container_image` | Path to `.sqsh` container image (see enroot import above) | `/path/to/pytorch.sqsh` | | `repo_dir` | Path to the TensorRT-LLM repository | `/path/to/TensorRT-LLM` | | `mount_dir` | Top-level directory to bind-mount into the container | `/shared/users` | | `partition` | SLURM partition | `batch` | | `account` | SLURM account | `my_account` | Optional parameters: | Parameter | Description | Default | |———–|————-|———| | `jobname` | SLURM job name | `trtllm-compile.<username>` | | `gpu_count` | Number of GPUs to request | `4` | | `time_limit` | Job time limit | `02:00:00` | | `arch` | GPU architecture(s) for `-a` flag | `100-real` | | `extra_build_args` | Extra flags for `build_wheel.py` | (none) | ## Companion Scripts This skill includes three companion scripts in `scripts/`: | Script | Purpose | |——–|———| | `enroot-import` | Pre-dump a Docker image to `.sqsh` via a SLURM batch job | | `submit_compile.sh` | Template for submitting the SLURM job — copy and customize | | `compile.slurm` | SLURM batch script — launches the container and calls `compile.sh` | | `compile.sh` | Runs inside the container — executes `build_wheel.py` | Scripts directory: `skills/exec-slurm-compile/scripts/` ## Instructions Follow these steps in order: ### Step 0: Resolve the Container Image (if needed) If the user does not already have a `.sqsh` container image: 1. Read the Docker image tag from `<repo_dir>/jenkins/current_image_tags.properties`. 2. Use `enroot-import` to pre-dump it: “`bash cd <directory_for_sqsh_files> <scripts_dir>/enroot-import –partition cpu_datamover –debug <image_url> “` 3. Monitor the import job with `squeue -j <job_id>`. 4. Once complete, the `.sqsh` file path becomes the `container_image` parameter. If the user already has a `.sqsh` file, skip this step. ### Step 1: Gather Information Ask the user for any missing prerequisite values listed above. At minimum you need: – `container_image` (or the Docker image URL — then run Step 0 first) – `repo_dir` – `mount_dir` – `partition` and `account` If the user has used this workflow before, check if previous values are stored in memory files. ### Step 2: Prepare the Scripts Directory The compile scripts must be accessible from inside the container (i.e., under `mount_dir`). Either: **Option A** — Copy companion scripts to a location under `mount_dir`: “`bash scripts_dir=<mount_dir>/<username>/workspace/tensorrt_llm_scripts mkdir -p ${scripts_dir}/log cp skills/exec-slurm-compile/scripts/compile.sh ${scripts_dir}/ cp skills/exec-slurm-compile/scripts/compile.slurm ${scripts_dir}/ chmod +x ${scripts_dir}/compile.sh ${scripts_dir}/compile.slurm “` **Option B** — If the user already has scripts at a known location, use those directly. ### Step 3: Submit the Job Run `sbatch` from the login node (or a node with SLURM client access): “`bash sbatch \ –nodes=1 –ntasks=1 –ntasks-per-node=1 \ –gres=gpu:<gpu_count> \ –partition=<partition> \ –account=<account> \ –job-name=<jobname> \ –time=<time_limit> \ <scripts_dir>/compile.slurm \ <container_image> <mount_dir> <scripts_dir> <repo_dir> “` Capture and report the job ID from the `sbatch` output. ### Step 4: Monitor the Job (Proactive — Do NOT Wait for User) **You MUST actively poll the job until it completes.** Do not submit and walk away. “`bash # Check job status (repeat every 30-60 seconds) squeue -j <job_id> -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R" # Once running, periodically tail the log (do NOT use tail -f, use tail -30 instead) tail -30 <scripts_dir>/log/compile_<job_id>.srun.log “` **Monitoring loop:** 1. Poll `squeue -j <job_id>` to check state 2. If `PD` (pending) — report the reason, keep polling every 30-60s 3. If `R` (running) — tail the build log every 30-60s; look for `[XX%] Building`, errors, or completion 4. If the job disappears from `squeue`, it has finished — proceed to Step 5 5. If `F` (failed) — immediately read the full log and report the error **Progress indicators to look for in the log:** – `[XX%] Building CXX object…` — compilation progress – `Linking CXX…` — link phase – `FAILED:`, `error:`, `fatal error:` — build failure – `Successfully built` — success ### Step 5: Verify the Build Once the job completes, check for success: “`bash # Check SLURM exit code sacct -j <job_id> –format=JobID,State,ExitCode,Elapsed # Check the build log for errors tail -50 <scripts_dir>/log/compile_<job_id>.srun.log “` A successful build ends with a message like `Successfully built tensorrt_llm` or completes without error. ## Common Build Flags Reference | Flag | Description | |——|————-| | `–trt_root /usr/local/tensorrt` | TensorRT installation path (standard in NVIDIA containers) | | `–benchmarks` | Build the C++ benchmarks | | `-a "100-real"` | Target architecture — `100` for Blackwell, `90` for Hopper, etc. | | `–nvtx` | Enable NVTX markers for profiling | | `–no-venv` | Skip virtual environment creation | | `-ccache` | Use ccache to speed up recompilation | | `–skip_building_wheel` | Build in-place without creating a wheel file | | `-f` | Fast build — skip some kernels for faster dev compilation | | `-c` | Clean build — wipe build directory before building | Common architecture values: – `"100-real"` — Blackwell (B200, GB200) – `"90-real"` — Hopper (H100, H200) – `"89-real"` — Ada Lovelace (L40S) – `"80-real"` — Ampere (A100) – `"90;100-real"` — Multiple architectures ## Troubleshooting | Issue | Solution | |——-|———-| | `sbatch: error: invalid partition` | Verify partition name with `sinfo -s` | | `sbatch: error: invalid account` | Check available accounts with `sacctmgr show assoc user=$USER` | | Container image not found | Verify the `.sqsh` path exists and is readable | | Build fails with missing TensorRT | Ensure `–trt_root` points to the correct path inside the container | | Build OOM (out of memory) | Reduce parallelism with `-j <N>` flag to `build_wheel.py` | | `srun: error: Unable to create step` | The node may lack enroot/pyxis — check with cluster admin | | Job stuck in `PD` state | Check `squeue -j <id> -o %R` for the reason (e.g., resource limits, priority) | | `enroot import` fails with auth error | Check `~/.config/enroot/.credentials` has the correct registry credentials | | `enroot import` produces empty/corrupt `.sqsh` | Re-run with `–debug` and check the SLURM log; verify the image URL has no `https://` prefix | | Weird compile issues | Retry with a clean build (`-c` flag) | | `QOSGrpNodeLimit` shown in `NODELIST(REASON)` | Not a blocker, just wait for the job to get scheduled | ## Example Interaction **User**: "Compile TRT-LLM on the OCI cluster" **Agent actions**: 1. Ask for container image path, repo path, mount dir (if not known) 2. Confirm partition/account for OCI cluster 3. Copy scripts to accessible location under mount_dir 4. Submit with `sbatch` 5. Report job ID 6. Monitor with `squeue` until complete 7. Check logs and report success/failure