Systems · View as Markdown

GPU & numerics

C+ does not have a kernel language, a GPU dialect, or a built-in tensor type. It does not need one. The language core stays small, and heavy numerics come from packages that bind the vendor's own SDK through plain C FFI. You get the same pre-tuned matmul the platform ships, called from typed C+ with no shim layer in between.

For a macOS-only reference recipe, see Metal compute. It compiles a Metal shader, embeds the .metallib, dispatches it through Objective-C FFI, and validates GPU readback.

Two properties make this pleasant rather than painful:

It is just FFI. Each binding is extern fn over the real library (Metal, cuBLAS, a CBLAS), so there is nothing to reimplement and nothing to fall behind the vendor. C+ is a consumer of these SDKs.
Resources are Drop-managed. Device buffers, BLAS handles, and Metal objects each free themselves at scope exit (cudaFree, cublasDestroy, objc_release), so the ownership model that protects host memory protects device resources too.

The backend matrix

Pick the package for the hardware you are on:

Platform	GPU	CPU
Apple	metal — Metal + MPS	accelerate — Accelerate (BLAS, vDSP)
NVIDIA	cuda — CUDA Runtime + cuBLAS	—
Cross-platform	—	cblas — OpenBLAS / Netlib / MKL

For CPU-side vector math that does not need a BLAS at all, the simd package gives portable f32x4 and integer-lane kernels (see also the SIMD types language page).

A typical shape

GPU numerics in C+ follows the same rhythm regardless of backend: get a handle or device, move data to where the compute happens, call the pre-tuned routine, read the result back. Every handle in that chain frees itself.

import "cuda/cublas" as cublas;

guard let cublas::Handle::new() else { return 1; }   // Drop = cublasDestroy
// ... upload inputs into DeviceBuffers (Drop = cudaFree), run sgemm, copy out ...

The matmul itself is the vendor's sgemm, column-major, exactly as the SDK defines it. See each package page for the concrete API:

cuda — NVIDIA GPU compute and cuBLAS.
cblas — the cross-platform CPU BLAS path.
accelerate — Apple's CPU numerics and the reference path for checking GPU results.
metal — Apple GPU compute and MPS.

Linking a library outside the default path

A vendor library often lives somewhere the linker does not search by default (CUDA's lib64, a custom OpenBLAS prefix). The manifest [link] table takes search directories so the build resolves them at both link and run time without LD_LIBRARY_PATH:

[link]
search-paths = ["/usr/local/cuda/lib64"]

Each entry becomes both -L<dir> and -Wl,-rpath,<dir>; relative entries resolve against the manifest directory. See Modules & packages for the full manifest.