Packages · View as Markdown

cuda

Typed bindings to the CUDA Runtime and cuBLAS for NVIDIA GPU compute. This is plain C FFI over the vendor SDK: C+ stays a consumer of CUDA, with no kernel language and nothing to reimplement. For the wider GPU/numerics picture and the backend matrix, see GPU & numerics.

Three sub-modules, plus a cuda/cuda facade that re-exports the types:

cuda/runtime — device management (device_count, set_device, synchronize) and the CudaError type (.code(), .message()).
cuda/buffer — a DeviceBuffer owns device memory and frees it on Drop (cudaFree), so device allocations follow the same ownership rules as host memory. Allocate with alloc(bytes:); move bytes with .write(from:, bytes:) / .read(to:, bytes:).
cuda/cublas — a cuBLAS Handle (created via handle(), freed on Drop via cublasDestroy_v2) exposing the dense Level-3 / Level-2 routines sgemm and sgemv, column-major, matching the cuBLAS ABI exactly.

import "cuda/runtime"  as rt;
import "cuda/buffer"   as buf;
import "cuda/cublas"   as blas;
import "stdlib/result" as result;

// hostA, hostB, hostC are host *u8 buffers laid out column-major.

// A cuBLAS context, created once; Drop = cublasDestroy_v2.
guard let result::Result[blas::Handle, rt::CudaError]::Ok(h) = blas::handle() else { return 1 as i32; };

// `alloc` returns a value-or-reason Result; each DeviceBuffer frees its
// device memory on scope exit (Drop = cudaFree).
guard let result::Result[buf::DeviceBuffer, rt::CudaError]::Ok(dA) = buf::alloc(bytes: 16 as usize) else { return 2 as i32; };
guard let result::Result[buf::DeviceBuffer, rt::CudaError]::Ok(dB) = buf::alloc(bytes: 16 as usize) else { return 3 as i32; };
guard let result::Result[buf::DeviceBuffer, rt::CudaError]::Ok(dC) = buf::alloc(bytes: 16 as usize) else { return 4 as i32; };

// Stage inputs on the device. write/read return Option[CudaError] (None = ok).
let _ = dA.write(from: hostA, bytes: 16 as usize);
let _ = dB.write(from: hostB, bytes: 16 as usize);

// Column-major C = alpha*op(A)*op(B) + beta*C. Operands are device pointers;
// the call returns Option[CudaError] instead of trapping.
let _ = h.sgemm(
    blas::Op::N, blas::Op::N,
    2 as i32, 2 as i32, 2 as i32,
    1.0f32, dA.device_ptr(), 2 as i32, dB.device_ptr(), 2 as i32,
    0.0f32, dC.device_ptr(), 2 as i32);
let _ = rt::synchronize();

let _ = dC.read(to: hostC, bytes: 16 as usize);

Because cuBLAS is column-major, lay out matrices column-major (or pass the transpose flags) exactly as you would from C. The Handle and every DeviceBuffer release their resources deterministically at scope exit; there is no explicit teardown to forget.

Linking

The CUDA libraries usually live in lib64, outside the linker's default search path. List the directory in your manifest so it resolves at both link and run time without LD_LIBRARY_PATH:

[link]
search-paths = ["/usr/local/cuda/lib64"]

For the CPU fallback and a result-checking reference, see cblas and accelerate.