Minimal Anemoi Training on LUMI-G
This tutorial shows the shortest path to a working Anemoi training run on LUMI-G using a LUMI AI Factory container and a small virtual environment layered on top.
Before you start, set your project account:
export PROJECT_ACCOUNT=project_462000131
1. Create the workspace
export LUMI_USER="${LUMI_USER:-$USER}"
export ANEMOI_ROOT="/scratch/${PROJECT_ACCOUNT}/${LUMI_USER}/anemoi"
mkdir -p "${ANEMOI_ROOT}"/{configs,jobs}
cd "${ANEMOI_ROOT}"
Create env.sh:
cat > env.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
export PROJECT_ACCOUNT="${PROJECT_ACCOUNT:-project_462000131}"
export LUMI_USER="${LUMI_USER:-${USER}}"
export ANEMOI_ROOT="${ANEMOI_ROOT:-/scratch/${PROJECT_ACCOUNT}/${LUMI_USER}/anemoi}"
export CONTAINER="${CONTAINER:-/appl/local/laifs/containers/lumi-multitorch-u24r64f21m43t29-20260225_144743/lumi-multitorch-full-u24r64f21m43t29-20260225_144743.sif}"
export ANEMOI_DATA_ROOT="${ANEMOI_DATA_ROOT:-${ANEMOI_ROOT}/data}"
export ANEMOI_GRAPH_ROOT="${ANEMOI_GRAPH_ROOT:-${ANEMOI_ROOT}/graphs}"
export ANEMOI_OUTPUT_ROOT="${ANEMOI_OUTPUT_ROOT:-${ANEMOI_ROOT}/logs}"
export ANEMOI_VENV="${ANEMOI_VENV:-${ANEMOI_ROOT}/.venv}"
EOF
chmod +x env.sh
source env.sh
ls -lh "$CONTAINER"
2. Install Anemoi into a small venv
Create requirements.txt:
cat > requirements.txt <<'EOF'
anemoi-training==0.7.0
anemoi-models==0.10.0
anemoi-graphs==0.7.2
zarr<3
trimesh
pyshtools
EOF
Create install_venv.sh:
cat > install_venv.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "${ROOT_DIR}/env.sh"
module purge
module use /appl/local/laifs/modules
module load lumi-aif-singularity-bindings
mkdir -p \
"${ANEMOI_DATA_ROOT}" \
"${ANEMOI_GRAPH_ROOT}" \
"${ANEMOI_OUTPUT_ROOT}" \
"$(dirname "${ANEMOI_VENV}")"
singularity exec "${CONTAINER}" bash -lc "
set -euo pipefail
python3 -m venv '${ANEMOI_VENV}' --system-site-packages
'${ANEMOI_VENV}/bin/python' -m pip install --upgrade pip setuptools wheel
'${ANEMOI_VENV}/bin/python' -m pip install -r '${ROOT_DIR}/requirements.txt'
"
EOF
chmod +x install_venv.sh
The tutorial pins anemoi-training, anemoi-models, and anemoi-graphs
together so the package defaults and internal APIs stay aligned. If
requirements.txt changes, remove ${ANEMOI_VENV} and recreate it instead of
trying to upgrade the existing environment in place.
Install from a short dev-g allocation:
salloc \
--account="${PROJECT_ACCOUNT}" \
--partition=dev-g \
--nodes=1 \
--gpus-per-node=1 \
--ntasks=1 \
--cpus-per-task=7 \
--mem-per-gpu=60G \
--time=00:30:00
Inside the allocation:
cd "${ANEMOI_ROOT}"
rm -rf "${ANEMOI_VENV}"
./install_venv.sh
exit
3. Download the sample dataset
cd "${ANEMOI_ROOT}"
source env.sh
curl -L \
https://data.ecmwf.int/anemoi-datasets/era5-o48-2020-2021-6h-v1.zip \
-o "${ANEMOI_DATA_ROOT}/era5-o48-2020-2021-6h-v1.zip"
ls -lh "${ANEMOI_DATA_ROOT}/era5-o48-2020-2021-6h-v1.zip"
4. Create the minimal config
Create configs/training-minimal.yaml:
cat > configs/training-minimal.yaml <<'EOF'
defaults:
- data: zarr
- dataloader: native_grid
- diagnostics: evaluation
- hardware: example
- graph: multi_scale
- model: gnn
- training: default
- _self_
config_validation: true
data:
resolution: o48
hardware:
num_gpus_per_node: 1
paths:
data: ${oc.env:ANEMOI_DATA_ROOT}
graph: ${oc.env:ANEMOI_GRAPH_ROOT}
output: ${oc.env:ANEMOI_OUTPUT_ROOT}
files:
dataset: era5-o48-2020-2021-6h-v1.zip
graph: first_graph_o48.pt
dataloader:
num_workers:
training: 1
validation: 1
test: 1
batch_size:
training: 1
validation: 1
test: 1
limit_batches:
training: 8
validation: 2
test: 2
training:
max_epochs: 4
lr:
rate: 1.0e-4
diagnostics:
plot:
callbacks: []
EOF
The tutorial pins anemoi-training==0.7.0 on purpose. Installing from
ecmwf/anemoi-core main caused the tutorial to drift as the config schema
and internal APIs changed.
If you already created the venv from an older version of this tutorial, remove it and recreate it:
rm -rf "${ANEMOI_VENV}"
./install_venv.sh
This pinned tutorial uses the matching hardware-based config layout. If you
previously created configs/training-minimal.yaml from an older version of
this page, replace it with the block above.
5. Create the training job
Create jobs/train_minimal.sh:
cat > jobs/train_minimal.sh <<'EOF'
#!/bin/bash
#SBATCH --job-name=anemoi-train
#SBATCH --account=project_462000131
#SBATCH --partition=small-g
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gpus=1
#SBATCH --cpus-per-task=7
#SBATCH --time=00:30:00
#SBATCH --output=%x-%j.out
set -euo pipefail
ROOT_DIR="${SLURM_SUBMIT_DIR:-$(cd "$(dirname "$0")"/.. && pwd)}"
cd "${ROOT_DIR}"
source "${ROOT_DIR}/env.sh"
module purge
module use /appl/local/laifs/modules
module load lumi-aif-singularity-bindings
exec singularity exec "${CONTAINER}" bash -lc "
set -euo pipefail
VENV_SITE=\$('${ANEMOI_VENV}/bin/python' -c 'import site; print(site.getsitepackages()[0])')
export PYTHONNOUSERSITE=1
export PYTHONPATH=\"\${VENV_SITE}\${PYTHONPATH:+:\${PYTHONPATH}}\"
cd '${ROOT_DIR}/configs'
exec '${ANEMOI_VENV}/bin/anemoi-training' train --config-name=training-minimal.yaml
"
EOF
sed -i "s/project_462000131/${PROJECT_ACCOUNT}/g" jobs/train_minimal.sh
chmod +x jobs/train_minimal.sh
6. Submit and check
cd "${ANEMOI_ROOT}"
sbatch jobs/train_minimal.sh
Check status:
squeue -u "$USER"
Check the latest log:
tail -n 100 "$(ls -1t anemoi-train-*.out | head -n 1)"