A lightweight job submission tool that supports both local execution and SLURM cluster submissions. It uses Jinja2 templates and YAML configuration to manage job parameters and execution.
For automated setup in your Python repository (on the MLCloud):
- Clone submit into your project:
git clone <submit-repo-url> submit/- Run automated setup: Start an interactive session with
srun --partition=2080-galvani --gres=gpu:1 --pty bashand then run
python3 -m submit.initConsider the following additional arguments:
--non-interactive: setup in yolo mode.--force: overwrite existing configuration files.--run-yaml-only: only rebuildrun.yamlfile.--singularity-only: only rebuildSingularity.defandbuild_container.shfiles.--verbose: enable verbose logging for debugging script discovery issue
- Build container (if prompted or manually):
./build_container.sh # Run in SLURM interactive session for cloud usageBefore setup:
your-project/
├── submit/ # Cloned submit repository
│ ├── submit.py
│ ├── init.py
│ ├── templates/
│ └── examples/ # Example configuration files
│ ├── run.yaml
│ └── Singularity.def
├── scripts/ # Your Python scripts (optional)
│ ├── train.py
│ └── evaluate.py
├── pyproject.toml # Or setup.py, requirements.txt
└── src/ # Your source code
After setup:
your-project/
├── submit/
│ ├── submit.py
│ ├── init.py
│ ├── run.yaml # Generated configuration
│ ├── templates/
│ └── examples/ # Example files (unchanged)
├── scripts/ # Your scripts (discovered automatically)
├── Singularity.def # Generated container definition
├── build_container.sh # Generated build script
├── python.sif # Built container (after running build script)
├── logs/ # Generated logs directory
└── pyproject.toml
The automated setup will:
- Discover Python scripts in
**/scripts/*.pydirectories - Generate container definition based on your
pyproject.toml/setup.py - Create run configuration with found scripts
- Optionally build the Singularity container
My current workflow is as follows:
- Start by cloning
submitin your working repository (e.g.$WORK/repos/my-repo/submit) - Next copy
submit/examples/*tosubmit/*, e.g. via
cp -rf ./submit/examples/* ./submit/examples/- Container/Environment setup:
- Option A (Singularity): I prefer running jobs on the ML Cloud using a singularity container (cf. https://portal.mlcloud.uni-tuebingen.de/user-guide/tutorials/singularity/). To do so, copy and modify the
submit/Singularity.deffile in your working repository. Then, start an interactive session (using e.g.srun ...) and build the singularity container with the following command:
- Option A (Singularity): I prefer running jobs on the ML Cloud using a singularity container (cf. https://portal.mlcloud.uni-tuebingen.de/user-guide/tutorials/singularity/). To do so, copy and modify the
# Set cache and tmp directories for the Singularity build - (not optimal)
export SINGULARITY_CACHEDIR="/scratch_local/$USER-$SLURM_JOBID"
export SINGULARITY_TMPDIR="/scratch_local/$USER-$SLURM_JOBID"
# Build the singularity containers
singularity build --fakeroot --force --bind /mnt:/mnt --nv python.sif submit/Singularity.def- Option B (Conda): Alternatively, you can run jobs using a Conda environment. You don't need to build a container. Instead, define a
runtimeentry withpython_cmd: "python"and apre_commandthat activates your Conda environment before the script runs.
- Next, open and modify the
submit/run.yamlfile in the mainsubmitdirectory. It is important to change the entries below scripts,regression...to the scripts you want to run and have in the repo. (Note:default_argsis optional, so in most cases this section can be removed). - From the working repository, run jobs with the following command structure:
python3 submit/submit.py --mode [local|cloud_local|slurm] --script <script_name> [--slurm_args <slurm_args>] [--script_args <script_args>]You can find some more details below.
Remarks:
- If you want to use jax, keep in mind to install it with cuda dependencies (e.g.
pip install -U "jax[cuda12]") - All examples/* files are excluded when placed in the main
submitdirectory to not mess with this repository.
- Python 3.6+
- Jinja2
- PyYAML
This tool is designed to run on the ML Cloud without the need for a python environment or package installations.
The tool uses a YAML configuration file (run.yaml) to define:
- Execution modes (
local,cloud_local,slurm) - Runtime definitions (
venv, Conda, Singularity, or plain Python) - Template paths
- Script configurations
- Default arguments
Example configuration structure:
mode:
local:
template: "./submit/templates/local_job_cmd.j2"
runtime: "venv"
shell_executable: "bash"
cloud_local:
template: "./submit/templates/cloud_local_job_cmd.j2"
runtime: "conda"
shell_executable: "bash"
slurm:
template: "./submit/templates/slurm_job.sh.j2"
runtime: "singularity"
slurm_defaults:
nodes: 1
ntasks: 1
slurm_log_dir: "./logs"
partition_defaults:
2080-galvani:
gres: "gpu:1"
cpus_per_task: 12
mem_per_cpu: "12G"
time_limit: "3-00:00:00"
partition_aliases:
2080: "2080-galvani"
runtime:
venv:
python_cmd: "./.venv/bin/python"
conda:
python_cmd: "python"
pre_command: |
eval "$(conda shell.bash hook)"
conda activate myenv
singularity:
python_cmd: "python"
command_wrapper: "singularity exec --bind /mnt:/mnt --nv python.sif bash -lc"
scripts:
my_script:
path: "path/to/script.py"
default_args:
param1: [1.0, 2.0]
param2: ["value1", "value2"]Default is to create such a run.yaml file in the main submit directory.
Basic usage:
python submit/submit.py --mode [local|cloud_local|slurm] --runtime <runtime_name> --script <script_name> [--slurm_args <slurm_args>] [--script_args <script_args>]Required arguments:
--script: Name of the script configuration from run.yaml--config_file: Path to YAML config file (default: ./submit/run.yaml)
Optional arguments:
--mode: Execution mode (local or slurm, default: local)--runtime: Runtime name from the YAMLruntime:section. If omitted,mode.<name>.runtimeis used.
SLURM-specific arguments:
--partition: SLURM partition--nodes: Number of nodes--cpus-per-task: CPUs per task--mem-per-cpu: Memory per CPU (e.g. 4G)--gres: Generic resources (e.g. gpu:1)--time: Time limit (e.g. 3-00:00:00)
Additional arguments:
- Any
--key value1 value2 ...pairs will be passed to the script. If multiple values are provided, the script will be run with all combinations of the script values.
Run a local job with default parameters:
python submit/submit.py --mode local --script my_scriptRun a SLURM job with custom parameters:
python submit/submit.py --mode slurm --script my_script --partition gpu --cpus-per-task 4 --mem-per-cpu 4GIf your config defines mode.slurm.partition_defaults, then choosing a
partition can auto-fill the other SLURM resources:
python submit/submit.py --mode slurm --partition 2080-galvani --script my_script
python submit/submit.py --mode slurm --partition 2080 --script my_scriptThe second form uses mode.slurm.partition_aliases.
To group runs sequentially and reduce the number of individual jobs queued on the cluster, the submit.py script supports multi-dimensional batching. You can pool multiple execution configurations into a single SLURM job submission.
To prevent collision with any --batch_size arguments parsed by your underlying Python scripts, use the slurm_batch_size argument in your configuration:
scripts:
my_script:
path: "path/to/script.py"
default_args:
param1: ["value1", "value2", "value3"]
batch_size: 32 # This is cleanly passed to python script args
# Group parameter combos sequentially inside a single #SBATCH job
slurm_batch_size: 3 # submit 3 sequential evaluations per 1 SLURM sbatch commandIf you want specific parameter dimensions to vary within one logical job, mark them with iter: true:
scripts:
my_script:
default_args:
param1:
values: ["value1", "value2", "value3"]
iter: true
seed: [0, 1, 2]With that configuration, submit creates three logical jobs, one per seed, and
each logical job runs the three param1 values sequentially. If
slurm_batch_size is also set, it batches whole logical jobs on top of that.
You can also use a top-level shorthand if you prefer to keep the parameter values as plain lists:
scripts:
my_script:
default_args:
param1: ["value1", "value2", "value3"]
seed: [0, 1, 2]
iter: [param1]Both YAML list styles work there, so this is equivalent:
iter:
- param1For fast SLURM emulation, this repository ships a fake-sbatch test path that
runs on Linux in Docker without starting a real cluster. It covers:
- batch-script rendering
sbatchinvocation- CLI-level SLURM planning with
iter: true
Run it with:
docker compose -f docker-compose.test.yml build
docker compose -f docker-compose.test.yml run --rm submit-testsThe Docker test runner targets only the submit-focused tests:
tests/test_submit_framework.pytests/test_slurm_executor.pytests/test_submit_cli_slurm.py
For a real multi-container SLURM environment, a vendored cluster setup is
available under submit/slurm-docker-cluster/. That is the next step when the
fake-sbatch path is green and you want to validate sbatch, squeue, and
cluster behavior more end-to-end.
When submitting FSP jobs, you can let submit read calibrated prior hashes from
models/gp/d=<data_name>_gp_prior/prior_registry.json and the on-disk prior
folders automatically:
scripts:
train_fsp:
default_args:
data_name: ["pos_2", "base_2"]
prior_name: ["auto"]submit resolves ["auto"] separately for each data_name, only creates valid
(data_name, prior_name) combinations, and passes hashes to the Python script
in quoted form such as --prior_name 'a74704d4'.
If you think features are missing or issues occur, please reach out.