GPU & numerics
C+ does not have a kernel language, a GPU dialect, or a built-in tensor type. It does not need one. The language core stays small, and heavy numerics come from packages that bind the vendor's own SDK through plain C FFI. You get the same pre-tuned matmul the platform ships, called from typed C+ with no shim layer in between.
Two properties make this pleasant rather than painful:
- It is just FFI. Each binding is
extern fnover the real library (Metal,cuBLAS, a CBLAS), so there is nothing to reimplement and nothing to fall behind the vendor. C+ is a consumer of these SDKs. - Resources are
Drop-managed. Device buffers, BLAS handles, and Metal objects each free themselves at scope exit (cudaFree,cublasDestroy,objc_release), so the ownership model that protects host memory protects device resources too.
The backend matrix
Pick the package for the hardware you are on:
| Platform | GPU | CPU |
|---|---|---|
| Apple | metal — Metal + MPS | accelerate — Accelerate (BLAS, vDSP) |
| NVIDIA | cuda — CUDA Runtime + cuBLAS | — |
| Cross-platform | — | cblas — OpenBLAS / Netlib / MKL |
For CPU-side vector math that does not need a BLAS at all, the
simd package gives portable f32x4 and integer-lane
kernels (see also the SIMD types language page).
A typical shape
GPU numerics in C+ follows the same rhythm regardless of backend: get a handle or device, move data to where the compute happens, call the pre-tuned routine, read the result back. Every handle in that chain frees itself.
import "cuda/cublas" as cublas;
guard let cublas::Handle::new() else { return 1; } // Drop = cublasDestroy
// ... upload inputs into DeviceBuffers (Drop = cudaFree), run sgemm, copy out ...
The matmul itself is the vendor's sgemm, column-major, exactly as the SDK
defines it. See each package page for the concrete API:
- cuda — NVIDIA GPU compute and cuBLAS.
- cblas — the cross-platform CPU BLAS path.
- accelerate — Apple's CPU numerics and the reference path for checking GPU results.
- metal — Apple GPU compute and MPS.
Linking a library outside the default path
A vendor library often lives somewhere the linker does not search by default
(CUDA's lib64, a custom OpenBLAS prefix). The manifest [link] table takes
search directories so the build resolves them at both link and run time without
LD_LIBRARY_PATH:
[link]
search-paths = ["/usr/local/cuda/lib64"]
Each entry becomes both -L<dir> and -Wl,-rpath,<dir>; relative entries
resolve against the manifest directory. See
Modules & packages for the full manifest.