# GPU & numerics

C+ does not have a kernel language, a GPU dialect, or a built-in tensor type. It
does not need one. The language core stays small, and heavy numerics come from
**packages that bind the vendor's own SDK** through plain C FFI. You get the same
pre-tuned matmul the platform ships, called from typed C+ with no shim layer in
between.

Two properties make this pleasant rather than painful:

- **It is just FFI.** Each binding is `extern fn` over the real library
  (`Metal`, `cuBLAS`, a CBLAS), so there is nothing to reimplement and nothing to
  fall behind the vendor. C+ is a *consumer* of these SDKs.
- **Resources are `Drop`-managed.** Device buffers, BLAS handles, and Metal
  objects each free themselves at scope exit (`cudaFree`, `cublasDestroy`,
  `objc_release`), so the ownership model that protects host memory protects
  device resources too.

## The backend matrix

Pick the package for the hardware you are on:

| Platform | GPU | CPU |
|---|---|---|
| **Apple** | [metal](/docs/packages/metal) — Metal + MPS | [accelerate](/docs/packages/accelerate) — Accelerate (BLAS, vDSP) |
| **NVIDIA** | [cuda](/docs/packages/cuda) — CUDA Runtime + cuBLAS | — |
| **Cross-platform** | — | [cblas](/docs/packages/cblas) — OpenBLAS / Netlib / MKL |

For CPU-side vector math that does not need a BLAS at all, the
[simd](/docs/packages/simd) package gives portable `f32x4` and integer-lane
kernels (see also the [SIMD types](/docs/simd) language page).

## A typical shape

GPU numerics in C+ follows the same rhythm regardless of backend: get a handle
or device, move data to where the compute happens, call the pre-tuned routine,
read the result back. Every handle in that chain frees itself.

```cplus
import "cuda/cublas" as cublas;

guard let cublas::Handle::new() else { return 1; }   // Drop = cublasDestroy
// ... upload inputs into DeviceBuffers (Drop = cudaFree), run sgemm, copy out ...
```

The matmul itself is the vendor's `sgemm`, column-major, exactly as the SDK
defines it. See each package page for the concrete API:

- **[cuda](/docs/packages/cuda)** — NVIDIA GPU compute and cuBLAS.
- **[cblas](/docs/packages/cblas)** — the cross-platform CPU BLAS path.
- **[accelerate](/docs/packages/accelerate)** — Apple's CPU numerics and the
  reference path for checking GPU results.
- **[metal](/docs/packages/metal)** — Apple GPU compute and MPS.

## Linking a library outside the default path

A vendor library often lives somewhere the linker does not search by default
(CUDA's `lib64`, a custom OpenBLAS prefix). The manifest `[link]` table takes
search directories so the build resolves them at both link and run time without
`LD_LIBRARY_PATH`:

```toml
[link]
search-paths = ["/usr/local/cuda/lib64"]
```

Each entry becomes both `-L<dir>` and `-Wl,-rpath,<dir>`; relative entries
resolve against the manifest directory. See
[Modules & packages](/docs/modules-and-packages) for the full manifest.
