BatchSolverKernel

class cubie.batchsolving.BatchSolverKernel.BatchSolverKernel(system: SymbolicODE, loop_settings: Dict[str, Any] | None = None, evaluate_driver_at_t: Callable | None = None, driver_del_t: Callable | None = None, profileCUDA: bool = False, step_control_settings: Dict[str, Any] | None = None, algorithm_settings: Dict[str, Any] | None = None, output_settings: Dict[str, Any] | None = None, memory_settings: Dict[str, Any] | None = None, cache_settings: Dict[str, Any] | None = None, cache: bool | str | Path = True)[source]

Bases: CUDAFactory

Factory for CUDA kernel which coordinates a batch integration.

Parameters:

system – ODE system describing the problem to integrate.
loop_settings – Mapping of loop configuration forwarded to cubie.integrators.SingleIntegratorRun. Recognised keys include "save_every" and "summarise_every".
evaluate_driver_at_t – Optional evaluation function for an interpolated forcing term.
profileCUDA – Flag enabling CUDA profiling hooks.
step_control_settings – Mapping of overrides forwarded to cubie.integrators.SingleIntegratorRun for controller configuration.
algorithm_settings – Mapping of overrides forwarded to cubie.integrators.SingleIntegratorRun for algorithm configuration.
output_settings – Mapping of output configuration forwarded to the integrator. See cubie.outputhandling.OutputFunctions for recognised keys.
memory_settings – Mapping of memory configuration forwarded to the memory manager, typically via cubie.memory.

Notes

The kernel delegates integration logic to SingleIntegratorRun instances and expects upstream APIs to perform batch construction. It executes the compiled loop function against kernel-managed memory slices and distributes work across GPU threads for each input batch.

_get_chunk_events(chunk_idx: int) → Tuple[source]

Get the three CUDA events for a specific chunk.

Parameters:: chunk_idx (int) – Chunk index (0-based)
Returns:: (h2d_event, kernel_event, d2h_event) for the chunk
Return type:: tuple

_invalidate_cache() → None[source]: Mark cached outputs as invalid, flushing cache if cache_handler in “flush on change” mode.

_on_allocation(response: ArrayResponse) → None[source]: Update run parameters with chunking metadata from allocation.

_setup_cuda_events(chunks: int) → None[source]

Create CUDA events for timing instrumentation.

Parameters:: chunks (int) – Number of chunks to process

Notes

Creates one GPU workload event and 3 events per chunk (h2d_transfer, kernel, d2h_transfer). Events are created regardless of verbosity - they become no-ops internally when verbosity is None.

_setup_memory_manager(settings: Dict[str, Any]) → MemoryManager[source]

Register the kernel with a memory manager instance.

Parameters:: settings – Mapping of memory configuration options recognised by the memory manager.
Returns:: Memory manager configured for solver allocations.
Return type:: MemoryManager

_validate_timing_parameters(duration: float) → None[source]

Validate timing parameters to prevent invalid array accesses.

Parameters:: duration – Integration duration in time units.
Raises:: ValueError – When timing parameters would result in no outputs or invalid sampling.

Notes

Uses dt_min as an absolute tolerance when comparing floating point timing parameters by adding dt_min to the requested duration. Small in-loop timing oversteps smaller than dt_min are treated as valid and do not trigger validation errors.

property active_outputs: ActiveOutputs: Active output array flags derived from compile_flags.

property algorithm: str: Identifier of the selected integration algorithm.

property atol: float: Absolute error tolerance applied during adaptive stepping.

build() → BatchSolverCache[source]: Compile the integration kernel and return it.

build_kernel() → None[source]: Build and compile the CUDA integration kernel.

property cache_config: CacheConfig: Cache configuration for the kernel, parsed on demand.

property chunks: Number of chunks in the most recent run.

property compile_flags: OutputCompileFlags: Boolean compile-time controls for which output features are enabled.

property d_statuscodes: Any: Device buffer storing integration status codes.

property device_driver_coefficients: ndarray[Any, dtype[floating]] | None: Device-resident driver coefficients.

property device_function

Return the compiled CUDA device function.

Returns:: Compiled CUDA device function.
Return type:: callable

property device_observable_summaries_array: Any: Device buffer storing observable summary reductions.

property device_observables_array: Any: Device buffer storing saved observable trajectories.

property device_state_array: Any: Device buffer storing saved state trajectories.

property device_state_summaries_array: Any: Device buffer storing state summary reductions.

disable_profiling() → None[source]: Disable CUDA profiling hooks for subsequent launches.

property driver_coefficients: ndarray[Any, dtype[floating]] | None: Horner-ordered driver coefficients on the host.

property dt: float | None: Current integrator step size when available.

property dt_max: float: Maximum allowable step size from the controller.

property dt_min: float: Minimum allowable step size from the controller.

property duration: float: Requested integration duration.

enable_profiling() → None[source]: Enable CUDA profiling hooks for subsequent launches.

property initial_values: Any: Host view of initial state values.

property iteration_counters: Any: Host view of iteration counters at each save point.

property kernel: Callable: Compiled integration kernel callable.

limit_blocksize(blocksize: int, dynamic_sharedmem: int, bytes_per_run: int, numruns: int) → tuple[int, int][source]

Reduce block size until dynamic shared memory fits within limits.

Parameters:

blocksize – Requested CUDA block size.
dynamic_sharedmem – Shared-memory footprint per block at the current block size.
bytes_per_run – Shared-memory requirement per run.
numruns – Total number of runs queued for the launch.

Returns:

Adjusted block size and shared-memory footprint per block.

Return type:

tuple[int, int]

Notes

The shared-memory ceiling uses 32 kiB so three blocks can reside per SM on CC7* hardware. Larger requests reduce per-thread L1 availability.

property local_memory_elements: int: Number of precision elements required in local memory per run.

property mem_proportion: float | None: Fraction of managed memory reserved for this kernel.

property memory_manager: MemoryManager: Registered memory manager for this kernel.

property num_runs: int: Number of runs scheduled for the batch integration.

property observable_summaries: Any: Host view of observable summary reductions.

property observables: Any: Host view of saved observable trajectories.

property ouput_array_sizes_2d: SingleRunOutputSizes: Two-dimensional output sizes for individual runs.

property output_array_heights: Any: Height metadata for the batched output arrays.

property output_array_sizes_3d: BatchOutputSizes: Three-dimensional output sizes for batched runs.

property output_heights: Any: Height metadata for each host output array.

property output_length: int

Number of saved trajectory samples in the main run.

Delegates to SingleIntegratorRun.output_length() with the current duration.

property output_types: Any: Active output type identifiers configured for the run.

property parameters: Any: Host view of parameter tables.

property profileCUDA: bool: Indicate whether CUDA profiling hooks are enabled.

property rtol: float: Relative error tolerance applied during adaptive stepping.

run(inits: ndarray[Any, dtype[floating]], params: ndarray[Any, dtype[floating]], driver_coefficients: ndarray[Any, dtype[floating]] | None, duration: float, blocksize: int = 256, stream: Any | None = None, warmup: float = 0.0, t0: float = 0.0) → None[source]

Execute the solver kernel for batch integration.

Chunking is performed along the run axis when memory constraints require splitting the batch.

Parameters:

inits – Initial conditions with shape (n_runs, n_states).
params – Parameter table with shape (n_runs, n_params).
driver_coefficients – Optional Horner-ordered driver interpolation coefficients with shape (num_segments, num_drivers, order + 1).
duration – Duration of the simulation window.
blocksize – CUDA block size for kernel execution.
stream – CUDA stream assigned to the batch launch.
warmup – Warmup time before the main simulation.
t0 – Initial integration time.

Returns:

This method performs the integration for its side effects.

Return type:

None

Notes

The kernel prepares array views, queues allocations, and executes the device loop on each chunked workload. Shared-memory demand may reduce the block size automatically, emitting a warning when the limit drops below a warp.

property sample_summaries_every: float: Interval between summary metric samples from the loop.

property save_every: float | None: Interval between saved samples from the loop, or None if save_last only.

property save_time: float: Elapsed time spent saving outputs during integration.

property saved_observable_indices: Any: Indices of saved observable variables.

property saved_state_indices: Any: Indices of saved state variables.

set_cache_dir(path: str | Path) → None[source]

Set a custom cache directory for compiled kernels.

Parameters:: path – New cache directory path. Can be absolute or relative.

property shared_memory_bytes: int: Shared-memory footprint per run for the compiled kernel.

property shared_memory_elements: int: Number of precision elements required in shared memory per run.

property shared_memory_needs_padding: bool

Indicate whether shared-memory padding is required.

Returns:: True when a four-byte skew reduces bank conflicts for single precision.
Return type:: bool

Notes

Shared memory load instructions for float64 require eight-byte alignment. Padding in that scenario would misalign alternate runs and trigger misaligned-access faults, so padding only applies to single precision workloads where the skew preserves alignment.

property state: Any: Host view of saved state trajectories.

property state_summaries: Any: Host view of state summary reductions.

property status_codes: Any: Host view of integration status codes.

property stream: Any: CUDA stream used for kernel launches.

property stream_group: str: Stream group label assigned by the memory manager.

property summaries_length: int

Number of complete summary intervals across the integration window.

Delegates to SingleIntegratorRun.summaries_length() with the current duration.

property summarise_every: float | None: Interval between summary reductions from the loop

property summarised_observable_indices: Any: Indices of summarised observable variables.

property summarised_state_indices: Any: Indices of summarised state variables.

property summary_legend_per_variable: Any: Legend entries describing each summarised variable.

property summary_unit_modifications: Any: Unit modifications for each summarised variable.

property system: BaseODE: Underlying ODE system handled by the kernel.

property system_sizes: Any: Structured size metadata for the system.

property t0: float: Configured initial integration time.

property threads_per_loop: int: CUDA threads consumed by each run in the loop.

property total_runs: int: Total number of runs in the full batch.

update(updates_dict: Dict[str, Any] | None = None, silent: bool = False, **kwargs: Any) → set[str][source]

Update solver configuration parameters.

Parameters:

updates_dict – Mapping of parameter updates forwarded to the single integrator and compile settings.
silent – Flag suppressing errors when unrecognised parameters remain.
**kwargs – Additional parameter overrides merged into updates_dict.

Returns:

Names of parameters successfully applied.

Return type:

set[str]

Raises:

KeyError – Raised when unknown parameters persist and silent is False.

Notes

The method applies updates to the single integrator before refreshing compile-critical settings so the kernel rebuild picks up new metadata.

wait_for_writeback()[source]: Wait for async writebacks into host arrays after chunked runs

property warmup: float: Configured warmup duration.