Buffer Registry

CuBIE centralises GPU memory management through the buffer_registry module. Every CUDA-generating component registers its memory requirements, and the registry computes a layout that is allocated once at kernel launch time.

CUDABuffer Descriptors

Each buffer is described by a CUDABuffer:

name

Unique identifier (scoped to its parent component).

size

Number of elements.

location

"shared" (on-chip, fast, limited to ~48 KB per block) or "local" (per-thread, in registers/L1).

persistent

If True, the buffer survives across steps (e.g. state arrays). Non-persistent buffers can be aliased.

aliases

Name of another buffer that this one can share storage with, provided their lifetimes do not overlap.

precision

Element dtype (np.float32 or np.float64).

Registering Buffers

Components register buffers in their build() method:

from cubie.buffer_registry import buffer_registry

buffer_registry.register(
    name="stage_k",
    parent=self,
    size=self.n_states * self.n_stages,
    location="shared",
    precision=self.precision,
)

Allocators

After all components have registered, the registry computes offsets and produces allocator callables:

get_allocator(name, parent)

Returns a function that, given a base pointer and thread index, returns a typed array slice for the named buffer.

get_toplevel_allocators(kernel)

Returns (shared_allocator, local_allocator) for the top-level kernel launch.

get_child_allocators(parent, child)

Delegates a region of the parent’s allocation to a child component.

Shared Memory Budget

Shared memory is limited (~48 KB default, configurable up to 100 KB on some architectures). When the total exceeds the budget, CuBIE reduces the block size (fewer threads per block) to fit.

Aliasing buffers with non-overlapping lifetimes is the main tool for reducing shared memory pressure. For example, intermediate stage vectors that are only needed during the step computation can alias the output writeback buffer.

Layout Queries

buffer_registry.shared_buffer_size(parent)

Total shared memory bytes for a component and its children.

buffer_registry.local_buffer_size(parent)

Total local memory bytes (non-persistent).

buffer_registry.persistent_local_buffer_size(parent)

Total local memory bytes (persistent).

Use these to verify that your component’s memory footprint is reasonable.