# cuda

Typed bindings to the **CUDA Runtime** and **cuBLAS** for NVIDIA GPU compute.
This is plain C FFI over the vendor SDK: C+ stays a *consumer* of CUDA, with no
kernel language and nothing to reimplement. For the wider GPU/numerics picture
and the backend matrix, see [GPU & numerics](/docs/gpu-numerics).

Two sub-modules:

- `cuda/runtime` — device management and memory. A `DeviceBuffer` owns device
  memory and frees it on `Drop` (`cudaFree`), so device allocations follow the
  same ownership rules as host memory.
- `cuda/cublas` — a cuBLAS `Handle` (created once, freed on `Drop` via
  `cublasDestroy`) exposing the dense Level-3 / Level-2 routines `sgemm` and
  `sgemv`, **column-major**, matching the cuBLAS ABI exactly.

```cplus
import "cuda/runtime" as cuda;
import "cuda/cublas" as cublas;

guard let cublas::Handle::new() as h else { return 1; }   // Drop = cublasDestroy

// Inputs live in DeviceBuffers; each frees its device memory on scope exit.
let dA: cuda::DeviceBuffer = cuda::DeviceBuffer::from_host(a_host);
let dB: cuda::DeviceBuffer = cuda::DeviceBuffer::from_host(b_host);
let dC: cuda::DeviceBuffer = cuda::DeviceBuffer::zeros(m * n);

// Column-major C = alpha*A*B + beta*C.
h.sgemm(m, n, k, 1.0f32, dA, dB, 0.0f32, dC);

dC.copy_to_host(c_host);
```

Because cuBLAS is column-major, lay out matrices column-major (or pass the
transpose flags) exactly as you would from C. The `Handle` and every
`DeviceBuffer` release their resources deterministically at scope exit; there is
no explicit teardown to forget.

## Linking

The CUDA libraries usually live in `lib64`, outside the linker's default search
path. List the directory in your manifest so it resolves at both link and run
time without `LD_LIBRARY_PATH`:

```toml
[link]
search-paths = ["/usr/local/cuda/lib64"]
```

For the CPU fallback and a result-checking reference, see
[cblas](/docs/packages/cblas) and [accelerate](/docs/packages/accelerate).
