Multi-GPU Training Pitfalls

01 // THE PROBLEM

You have 8 GPUs and 13 independent models to train. Each model takes ~1 hour on a single GPU. Sequentially, that's 13 hours. With 8 GPUs in parallel, it should be ~2 hours. Simple, right?

It took four increasingly desperate attempts before we got a solution that actually worked. This post documents each attempt, why it failed, and the fundamental lesson: PyTorch multi-GPU parallelism must use OS processes, not Python threads.

"Everything that can go wrong with threading will go wrong with threading." — every systems engineer, eventually

Note: this is not about data-parallel training (one model across multiple GPUs, like torchrun / DDP). This is about running many independent models, one per GPU — an embarrassingly parallel workload that turns out to be not so embarrassing.

02 // ATTEMPT 1: ThreadPoolExecutor

The simplest approach. Python's concurrent.futures.ThreadPoolExecutor with one thread per GPU:

from concurrent.futures import ThreadPoolExecutor, as_completed

def train_on_gpu(config, gpu_id):
    device = torch.device(f'cuda:{gpu_id}')
    model = MyModel().to(device)
    loader = DataLoader(dataset, batch_size=256, num_workers=2)
    # ... training loop ...
    return model.state_dict()

with ThreadPoolExecutor(max_workers=8) as pool:
    futures = {}
    for i, cfg in enumerate(configs):
        f = pool.submit(train_on_gpu, cfg, gpu_id=i % 8)
        futures[f] = cfg['name']

    for f in as_completed(futures):
        result = f.result()  # blocks until done

What happened: the program hung after computing FID scores. GPU utilization dropped to 0%. Classic deadlock.

Root cause: the FID computation library (clean-fid) internally loads an InceptionV3 model and is not thread-safe. Multiple threads calling it simultaneously caused resource contention and eventual deadlock.

Lesson: third-party libraries are often not thread-safe, and there's no way to know without testing or reading their source code.

03 // ATTEMPT 2: Threading + Lock

We added a threading.Lock to serialize the FID calls:

import threading
fid_lock = threading.Lock()

def compute_fid(samples):
    with fid_lock:
        return clean_fid.compute_fid(real_dir, gen_dir)

What happened: GPU utilization was erratic — jumping between 0% and ~30%, with some GPUs completely idle. Training that should take 1 hour per model was barely progressing.

Root cause: the Global Interpreter Lock (GIL). This is the fundamental issue with Python threading for CPU-bound work:

Python's GIL allows only one thread to execute Python bytecode at a time
PyTorch releases the GIL during CUDA kernel execution, but not during: DataLoader iteration, optimizer.step(), loss.backward() Python overhead, or any pure-Python code
With 8 threads competing for the GIL, each thread gets ~1/8 of the CPU time to feed its GPU
Result: GPUs starve waiting for data, utilization crashes

Lesson: Python threads are concurrent but not parallel for CPU-bound work. The GIL serializes all Python execution.

04 // ATTEMPT 3: torch.multiprocessing

Since threads don't work, use torch.multiprocessing with spawn context. Each GPU gets its own process with its own GIL:

import torch.multiprocessing as mp

def train_worker(gpu_id, task_queue, save_dir):
    device = torch.device(f'cuda:{gpu_id}')
    torch.cuda.set_device(device)
    dataset = load_dataset()  # each process loads its own copy
    while not task_queue.empty():
        config = task_queue.get(timeout=2)
        model = train(config, device, dataset)
        torch.save(model, f'{save_dir}/{config["name"]}.pt')

ctx = mp.get_context('spawn')
task_queue = ctx.Queue()
for cfg in configs:
    task_queue.put(cfg)

processes = []
for gpu_id in range(8):
    p = ctx.Process(target=train_worker, args=(gpu_id, task_queue, save_dir))
    p.start()
    processes.append(p)

for p in processes:
    p.join()

What happened: the program hung during model initialization. Some GPU processes never started training.

Root cause: two issues compounded:

torch.compile + multi-process: we were using torch.compile(model) in each worker process. The dynamo compiler performs extensive Python-level analysis during the first forward pass. With spawn, each process imports the module fresh and triggers compilation — this compilation is CPU-heavy, and when 8 processes compile simultaneously, they saturate CPU cores and thrash the cache, causing extreme slowdown or apparent hangs.
Pickling closures: the training step functions were closures that captured sigma_min, sigma_max, etc. Closures can't be pickled, and spawn requires all arguments to be picklable (since it starts a fresh Python interpreter). We had to refactor to pass serializable config dicts instead.

Lesson: torch.compile adds significant hidden complexity. Disable it when running multiple processes. And spawn requires everything to be serializable — no closures, no lambda functions, no complex objects.

05 // ATTEMPT 4: subprocess.Popen (SUCCESS)

The solution that finally worked is the simplest conceptually: launch each training job as a completely independent OS process using subprocess.Popen.

import subprocess, json, sys

def run_experiment():
    # ... setup, warmup, define configs ...
    save_dir = '/tmp/models'
    this_script = os.path.abspath(__file__)
    gpu_procs = {}  # gpu_id -> (Popen, name)
    pending = list(configs)

    while pending or gpu_procs:
        # Launch on free GPUs
        for gpu_id in range(n_gpus):
            if gpu_id not in gpu_procs and pending:
                cfg = pending.pop(0)
                env = {**os.environ, 'CUDA_VISIBLE_DEVICES': str(gpu_id)}
                p = subprocess.Popen(
                    [sys.executable, this_script,
                     '--train-single', json.dumps(cfg),
                     '--gpu', str(gpu_id),
                     '--save-dir', save_dir],
                    env=env, stdout=sys.stdout, stderr=sys.stderr
                )
                gpu_procs[gpu_id] = (p, cfg['name'])

        # Poll for completion
        time.sleep(5)
        for gpu_id in list(gpu_procs):
            p, name = gpu_procs[gpu_id]
            if p.poll() is not None:
                print(f'{name} done (exit {p.returncode})')
                del gpu_procs[gpu_id]

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--train-single', type=str, default=None)
    parser.add_argument('--gpu', type=int, default=0)
    parser.add_argument('--save-dir', type=str, default='/tmp')
    args = parser.parse_args()

    if args.train_single:
        # Worker mode: train one model
        cfg = json.loads(args.train_single)
        train_single_model(cfg, args.gpu, args.save_dir)
    else:
        # Main mode: orchestrate all training
        run_experiment()

Why this works:

Each subprocess is a fresh Python interpreter with its own GIL, memory space, and CUDA context
CUDA_VISIBLE_DEVICES ensures each process sees only one GPU (as cuda:0), eliminating cross-GPU conflicts
Config is passed as a JSON string — trivially serializable, no pickling issues
Models are saved to disk as .pt files — no inter-process communication needed
The main process just polls p.poll() every few seconds — negligible overhead
Cannot deadlock because there is zero shared state between processes

This is essentially what submitit (from FAIR) does for SLURM clusters, and what you'd do manually by opening 8 terminal windows. The pattern is ancient and battle-tested.

06 // THE ARCHITECTURE

The final system has two phases:

Main Process
  |
  |-- Phase 1: Training (parallel)
  |     |-- subprocess GPU 0: python script.py --train-single '{"name":"model_a",...}' --gpu 0
  |     |-- subprocess GPU 1: python script.py --train-single '{"name":"model_b",...}' --gpu 1
  |     |-- ...
  |     |-- subprocess GPU 7: python script.py --train-single '{"name":"model_h",...}' --gpu 7
  |     |   (when one finishes, next model starts on that GPU)
  |     |
  |     `-- All 13 .pt files saved to disk
  |
  `-- Phase 2: Evaluation (sequential)
        |-- Load each .pt, generate samples, compute FID
        `-- (sequential because FID library is not thread-safe)

With 8 GPUs and 13 models:

First batch: 8 models train in parallel (~1h each)
As each finishes, the next model from the queue starts on the freed GPU
Total training time: ~2h (vs 13h sequential)
Evaluation: ~3h sequential on 1 GPU
Total: ~5h vs ~16h — a 3x speedup

07 // KEY DESIGN DECISIONS

Use CUDA_VISIBLE_DEVICES per subprocess, not torch.device('cuda:N'). Setting CUDA_VISIBLE_DEVICES=3 makes the process see only GPU 3 as cuda:0. This is safer than addressing GPUs by index within a single process, because libraries like torch.compile and InceptionV3 often default to cuda:0 and would conflict.

Save models to disk, not shared memory. A 2M-parameter model's state dict is ~8MB. Saving 13 of them to an SSD takes milliseconds. This is far simpler and more reliable than mp.Queue, shared memory, or pipe-based IPC.

Pass config as JSON strings, not Python objects. Training step functions are closures that capture hyperparameters. Closures can't be pickled. Instead, pass a serializable dict like {"type": "edm", "sigma_min": 0.01, ...} and reconstruct the function inside each subprocess.

Separate training from evaluation. If your evaluation code has threading issues (as ours did with clean-fid), don't try to fix it — just run it sequentially after all training is complete. The evaluation is typically much faster than training anyway.

Don't use torch.compile in parallel settings. torch.compile with dynamo does heavy Python-level analysis on the first forward pass. In multi-process settings, 8 simultaneous compilations can saturate your CPU. For small models where compile overhead exceeds training time savings, skip it entirely.

Add flush=True to all print statements. Python buffers stdout by default. In multi-process settings, you won't see output until the buffer fills. Always use print(..., flush=True) so you can monitor progress in real time.

08 // DECISION TREE

When you need to run multiple PyTorch jobs on multiple GPUs, here's how to choose:

Scenario	Solution	Why
One model, multiple GPUs	`torchrun` / DDP	Standard data parallelism
Many models, SLURM cluster	`submitit`	Job array with proper scheduling
Many models, single machine	`subprocess.Popen`	Zero shared state, cannot deadlock
Many models, need IPC	`mp.Process` with `spawn`	Only if you need shared queues
Any GPU workload	`ThreadPoolExecutor`	Never. GIL kills performance.

09 // THE TAKEAWAY

The fundamental lesson: Python threads do not provide parallelism for CPU-bound code. PyTorch training has significant CPU-bound components (data loading, gradient computation orchestration, optimizer steps), so threading gives you concurrency without parallelism — the worst of both worlds, because you pay the complexity cost of thread safety without the performance benefit.

The subprocess approach is unglamorous. It feels like "cheating" compared to elegant shared-memory designs. But it works on the first try, every time, because OS process isolation eliminates entire classes of bugs (GIL contention, thread-unsafe libraries, CUDA context conflicts, pickle failures).

In distributed systems, the safest abstraction is no shared state. When your parallel tasks are independent (as they often are in hyperparameter sweeps, ablation studies, and method comparisons), embrace the simplicity: separate processes, config as JSON, results on disk.

"Make it work, make it right, make it fast. Most parallelism bugs happen because people skip step one." — Kent Beck (paraphrased)

The Multi-GPU Training Gauntlet: Threads, Processes, and the GIL