BatchSolverKernel

class cubie.batchsolving.BatchSolverKernel.BatchSolverKernel(system: SymbolicODE, loop_settings: Dict[str, Any] | None = None, evaluate_driver_at_t: Callable | None = None, driver_del_t: Callable | None = None, profileCUDA: bool = False, step_control_settings: Dict[str, Any] | None = None, algorithm_settings: Dict[str, Any] | None = None, output_settings: Dict[str, Any] | None = None, memory_settings: Dict[str, Any] | None = None, cache_settings: Dict[str, Any] | None = None, cache: bool | str | Path = True)[source]

Bases: CUDAFactory

Factory for CUDA kernel which coordinates a batch integration.

Parameters:
  • system – ODE system describing the problem to integrate.

  • loop_settings – Mapping of loop configuration forwarded to cubie.integrators.SingleIntegratorRun. Recognised keys include "save_every" and "summarise_every".

  • evaluate_driver_at_t – Optional evaluation function for an interpolated forcing term.

  • profileCUDA – Flag enabling CUDA profiling hooks.

  • step_control_settings – Mapping of overrides forwarded to cubie.integrators.SingleIntegratorRun for controller configuration.

  • algorithm_settings – Mapping of overrides forwarded to cubie.integrators.SingleIntegratorRun for algorithm configuration.

  • output_settings – Mapping of output configuration forwarded to the integrator. See cubie.outputhandling.OutputFunctions for recognised keys.

  • memory_settings – Mapping of memory configuration forwarded to the memory manager, typically via cubie.memory.

Notes

The kernel delegates integration logic to SingleIntegratorRun instances and expects upstream APIs to perform batch construction. It executes the compiled loop function against kernel-managed memory slices and distributes work across GPU threads for each input batch.

_get_chunk_events(chunk_idx: int) Tuple[source]

Get the three CUDA events for a specific chunk.

Parameters:

chunk_idx (int) – Chunk index (0-based)

Returns:

(h2d_event, kernel_event, d2h_event) for the chunk

Return type:

tuple

_invalidate_cache() None[source]

Mark cached outputs as invalid, flushing cache if cache_handler in “flush on change” mode.

_on_allocation(response: ArrayResponse) None[source]

Update run parameters with chunking metadata from allocation.

_setup_cuda_events(chunks: int) None[source]

Create CUDA events for timing instrumentation.

Parameters:

chunks (int) – Number of chunks to process

Notes

Creates one GPU workload event and 3 events per chunk (h2d_transfer, kernel, d2h_transfer). Events are created regardless of verbosity - they become no-ops internally when verbosity is None.

_setup_memory_manager(settings: Dict[str, Any]) MemoryManager[source]

Register the kernel with a memory manager instance.

Parameters:

settings – Mapping of memory configuration options recognised by the memory manager.

Returns:

Memory manager configured for solver allocations.

Return type:

MemoryManager

_validate_timing_parameters(duration: float) None[source]

Validate timing parameters to prevent invalid array accesses.

Parameters:

duration – Integration duration in time units.

Raises:

ValueError – When timing parameters would result in no outputs or invalid sampling.

Notes

Uses dt_min as an absolute tolerance when comparing floating point timing parameters by adding dt_min to the requested duration. Small in-loop timing oversteps smaller than dt_min are treated as valid and do not trigger validation errors.

property active_outputs: ActiveOutputs

Active output array flags derived from compile_flags.

property algorithm: str

Identifier of the selected integration algorithm.

property atol: float

Absolute error tolerance applied during adaptive stepping.

build() BatchSolverCache[source]

Compile the integration kernel and return it.

build_kernel() None[source]

Build and compile the CUDA integration kernel.

property cache_config: CacheConfig

Cache configuration for the kernel, parsed on demand.

property chunks

Number of chunks in the most recent run.

property compile_flags: OutputCompileFlags

Boolean compile-time controls for which output features are enabled.

property d_statuscodes: Any

Device buffer storing integration status codes.

property device_driver_coefficients: ndarray[Any, dtype[floating]] | None

Device-resident driver coefficients.

property device_function

Return the compiled CUDA device function.

Returns:

Compiled CUDA device function.

Return type:

callable

property device_observable_summaries_array: Any

Device buffer storing observable summary reductions.

property device_observables_array: Any

Device buffer storing saved observable trajectories.

property device_state_array: Any

Device buffer storing saved state trajectories.

property device_state_summaries_array: Any

Device buffer storing state summary reductions.

disable_profiling() None[source]

Disable CUDA profiling hooks for subsequent launches.

property driver_coefficients: ndarray[Any, dtype[floating]] | None

Horner-ordered driver coefficients on the host.

property dt: float | None

Current integrator step size when available.

property dt_max: float

Maximum allowable step size from the controller.

property dt_min: float

Minimum allowable step size from the controller.

property duration: float

Requested integration duration.

enable_profiling() None[source]

Enable CUDA profiling hooks for subsequent launches.

property initial_values: Any

Host view of initial state values.

property iteration_counters: Any

Host view of iteration counters at each save point.

property kernel: Callable

Compiled integration kernel callable.

limit_blocksize(blocksize: int, dynamic_sharedmem: int, bytes_per_run: int, numruns: int) tuple[int, int][source]

Reduce block size until dynamic shared memory fits within limits.

Parameters:
  • blocksize – Requested CUDA block size.

  • dynamic_sharedmem – Shared-memory footprint per block at the current block size.

  • bytes_per_run – Shared-memory requirement per run.

  • numruns – Total number of runs queued for the launch.

Returns:

Adjusted block size and shared-memory footprint per block.

Return type:

tuple[int, int]

Notes

The shared-memory ceiling uses 32 kiB so three blocks can reside per SM on CC7* hardware. Larger requests reduce per-thread L1 availability.

property local_memory_elements: int

Number of precision elements required in local memory per run.

property mem_proportion: float | None

Fraction of managed memory reserved for this kernel.

property memory_manager: MemoryManager

Registered memory manager for this kernel.

property num_runs: int

Number of runs scheduled for the batch integration.

property observable_summaries: Any

Host view of observable summary reductions.

property observables: Any

Host view of saved observable trajectories.

property ouput_array_sizes_2d: SingleRunOutputSizes

Two-dimensional output sizes for individual runs.

property output_array_heights: Any

Height metadata for the batched output arrays.

property output_array_sizes_3d: BatchOutputSizes

Three-dimensional output sizes for batched runs.

property output_heights: Any

Height metadata for each host output array.

property output_length: int

Number of saved trajectory samples in the main run.

Delegates to SingleIntegratorRun.output_length() with the current duration.

property output_types: Any

Active output type identifiers configured for the run.

property parameters: Any

Host view of parameter tables.

property profileCUDA: bool

Indicate whether CUDA profiling hooks are enabled.

property rtol: float

Relative error tolerance applied during adaptive stepping.

run(inits: ndarray[Any, dtype[floating]], params: ndarray[Any, dtype[floating]], driver_coefficients: ndarray[Any, dtype[floating]] | None, duration: float, blocksize: int = 256, stream: Any | None = None, warmup: float = 0.0, t0: float = 0.0) None[source]

Execute the solver kernel for batch integration.

Chunking is performed along the run axis when memory constraints require splitting the batch.

Parameters:
  • inits – Initial conditions with shape (n_runs, n_states).

  • params – Parameter table with shape (n_runs, n_params).

  • driver_coefficients – Optional Horner-ordered driver interpolation coefficients with shape (num_segments, num_drivers, order + 1).

  • duration – Duration of the simulation window.

  • blocksize – CUDA block size for kernel execution.

  • stream – CUDA stream assigned to the batch launch.

  • warmup – Warmup time before the main simulation.

  • t0 – Initial integration time.

Returns:

This method performs the integration for its side effects.

Return type:

None

Notes

The kernel prepares array views, queues allocations, and executes the device loop on each chunked workload. Shared-memory demand may reduce the block size automatically, emitting a warning when the limit drops below a warp.

property sample_summaries_every: float

Interval between summary metric samples from the loop.

property save_every: float | None

Interval between saved samples from the loop, or None if save_last only.

property save_time: float

Elapsed time spent saving outputs during integration.

property saved_observable_indices: Any

Indices of saved observable variables.

property saved_state_indices: Any

Indices of saved state variables.

set_cache_dir(path: str | Path) None[source]

Set a custom cache directory for compiled kernels.

Parameters:

path – New cache directory path. Can be absolute or relative.

property shared_memory_bytes: int

Shared-memory footprint per run for the compiled kernel.

property shared_memory_elements: int

Number of precision elements required in shared memory per run.

property shared_memory_needs_padding: bool

Indicate whether shared-memory padding is required.

Returns:

True when a four-byte skew reduces bank conflicts for single precision.

Return type:

bool

Notes

Shared memory load instructions for float64 require eight-byte alignment. Padding in that scenario would misalign alternate runs and trigger misaligned-access faults, so padding only applies to single precision workloads where the skew preserves alignment.

property state: Any

Host view of saved state trajectories.

property state_summaries: Any

Host view of state summary reductions.

property status_codes: Any

Host view of integration status codes.

property stream: Any

CUDA stream used for kernel launches.

property stream_group: str

Stream group label assigned by the memory manager.

property summaries_length: int

Number of complete summary intervals across the integration window.

Delegates to SingleIntegratorRun.summaries_length() with the current duration.

property summarise_every: float | None

Interval between summary reductions from the loop

property summarised_observable_indices: Any

Indices of summarised observable variables.

property summarised_state_indices: Any

Indices of summarised state variables.

property summary_legend_per_variable: Any

Legend entries describing each summarised variable.

property summary_unit_modifications: Any

Unit modifications for each summarised variable.

property system: BaseODE

Underlying ODE system handled by the kernel.

property system_sizes: Any

Structured size metadata for the system.

property t0: float

Configured initial integration time.

property threads_per_loop: int

CUDA threads consumed by each run in the loop.

property total_runs: int

Total number of runs in the full batch.

update(updates_dict: Dict[str, Any] | None = None, silent: bool = False, **kwargs: Any) set[str][source]

Update solver configuration parameters.

Parameters:
  • updates_dict – Mapping of parameter updates forwarded to the single integrator and compile settings.

  • silent – Flag suppressing errors when unrecognised parameters remain.

  • **kwargs – Additional parameter overrides merged into updates_dict.

Returns:

Names of parameters successfully applied.

Return type:

set[str]

Raises:

KeyError – Raised when unknown parameters persist and silent is False.

Notes

The method applies updates to the single integrator before refreshing compile-critical settings so the kernel rebuild picks up new metadata.

wait_for_writeback()[source]

Wait for async writebacks into host arrays after chunked runs

property warmup: float

Configured warmup duration.