ADR-006: No GPU Acceleration on Consumer Hardware¶

Status:: Accepted
Date:: 2026-03
Deciders:: Core team

Context¶

ADR-001 established a CPU-only backend to match the synchrotron beamline deployment environment. A separate question arose: would consumer GPU acceleration (specifically an NVIDIA RTX 4090 Laptop, 16 GB VRAM, CUDA 13.1) improve performance for local development and pre-beamline analysis?

This ADR documents the quantitative assessment.

Decision¶

Do not implement GPU acceleration on consumer-class GPUs. The CPU-only configuration (JAX_PLATFORMS=cpu) is enforced in device/cpu.py and remains the only supported backend. No changes to pyproject.toml, the Makefile, or device configuration are required.

Rationale¶

1. Float64 throughput penalty on consumer GPUs

Homodyne parameters span 7 orders of magnitude (\(D_0 \sim 10^4\), \(\dot\gamma_0 \sim 10^{-3}\)). The positivity floor epsilon_abs = 1e-12 is below float32 machine epsilon (\(\sim 1.2 \times 10^{-7}\)). Float32 would cause NUTS leapfrog divergence and NLSQ Jacobian collapse. Float64 is non-negotiable.

Consumer GPUs penalize float64 severely:

Hardware	float64 (TFLOPS)	vs 20-core CPU	float64 : float32
RTX 4090 Laptop	~1.3	1.3–2.6x	1 : 64
A100 SXM4	~19.5	20–40x	1 : 2
H100 SXM5	~67	67–130x	1 : 2

The generic “20–100x” speedup claim in GPU acceleration guides assumes float32 workloads. For float64 physics on consumer hardware the net advantage is 1.3–2.6x before transfer overhead.

2. NLSQ path: PCIe overhead exceeds compute savings

The external nlsq C extension forces a CPU round-trip every Levenberg–Marquardt iteration:

GPU kernel -> block_until_ready -> np.asarray (D2H)
  -> NLSQ optimizer step (CPU)
  -> jnp.asarray (H2D) -> GPU kernel

For a typical dataset (n_time=1000, n_phi=23):

Jacobian size: \((23, 1000, 1000, 7)\) float64 = 1.23 GB.
PCIe transfer per iteration: ~70 ms at ~20 GB/s.
Optimistic kernel speedup: 2x (967 ms to ~484 ms).
Net with transfer: ~559 ms GPU vs 967 ms CPU = 1.7x – reduced to ~1.3–1.5x after synchronization barriers.

For 10M+ point datasets the expanded Jacobian (~22 GB) exceeds 16 GB VRAM, forcing a fallback to the CPU out-of-core solver anyway.

3. CMC path: architectural incompatibility

The CMC backend (backends/multiprocessing.py) is structurally incompatible with GPU execution:

Virtual CPU devices: --xla_force_host_platform_device_count=4 is undefined on the CUDA backend.
CUDA context overhead: Each spawned worker creates an independent CUDA context (300–800 MB). With 9 workers: 2.7–7.2 GB consumed before any computation.
Shared memory: SharedDataManager uses POSIX shared memory (CPU RAM only); each worker would need to re-transfer shard data to VRAM.
Single-process alternative: Eliminating spawn-based parallelism to run sequential shards on GPU would sacrifice 9-way concurrency, resulting in a net 3–6x slowdown.

Consequences¶

Positive:

No additional complexity in the device configuration layer.
NLSQ and CMC paths remain unchanged and fully tested.
No CUDA version management or GPU driver dependencies.
The existing 2718-test suite runs unmodified.

Negative / Accepted trade-offs:

Users with consumer GPUs cannot offload computation to the GPU.
Local development cannot exploit GPU parallelism for faster iteration.

When to Revisit¶

GPU acceleration becomes viable when all three conditions are met:

Datacenter GPU with \(\geq\) 1 : 2 float64 ratio (A100 / H100).
NLSQ library boundary eliminated – migrate from nlsq.curve_fit to a pure-JAX optimizer (e.g., jaxopt.LevenbergMarquardt), removing the per-iteration CPU round-trip.
CMC refactored to single-process jax.vmap-over-chains, replacing spawn-based multiprocessing.

Upgrade path	Estimated speedup	Engineering effort
A100 + current code	2–4x (NLSQ boundary limited)	Low
A100/H100 + jaxopt LM rewrite	10–30x (NLSQ)	High
A100/H100 + CMC pmap refactor	5–15x (CMC)	Medium
A100/H100 + both rewrites	10–30x (end-to-end)	High