BatchSolverKernel

class cubie.batchsolving.BatchSolverKernel.BatchSolverKernel(system: SymbolicODE, loop_settings: Dict[str, Any] | None = None, evaluate_driver_at_t: Callable | None = None, driver_del_t: Callable | None = None, profileCUDA: bool = False, step_control_settings: Dict[str, Any] | None = None, algorithm_settings: Dict[str, Any] | None = None, output_settings: Dict[str, Any] | None = None, memory_settings: Dict[str, Any] | None = None, cache_settings: Dict[str, Any] | None = None, cache: bool | str | Path = True)[source]

Bases: CUDAFactory

Factory for CUDA kernel which coordinates a batch integration.

Parameters:
  • system – ODE system describing the problem to integrate.

  • loop_settings – Mapping of loop configuration forwarded to cubie.integrators.SingleIntegratorRun. Recognised keys include "save_every" and "summarise_every".

  • evaluate_driver_at_t – Optional evaluation function for an interpolated forcing term.

  • profileCUDA – Flag enabling CUDA profiling hooks.

  • step_control_settings – Mapping of overrides forwarded to cubie.integrators.SingleIntegratorRun for controller configuration.

  • algorithm_settings – Mapping of overrides forwarded to cubie.integrators.SingleIntegratorRun for algorithm configuration.

  • output_settings – Mapping of output configuration forwarded to the integrator. See cubie.outputhandling.OutputFunctions for recognised keys.

  • memory_settings – Mapping of memory configuration forwarded to the memory manager, typically via cubie.memory.

  • cache_settings – Mapping of cache configuration forwarded to cubie.cubie_cache.CubieCacheHandler.

  • cache – Cache mode control. True enables default caching, False disables caching, or a string/Path sets a custom cache directory.

Notes

The kernel delegates integration logic to SingleIntegratorRun instances and expects upstream APIs to perform batch construction. It executes the compiled loop function against kernel-managed memory slices and distributes work across GPU threads for each input batch.

_get_chunk_events(chunk_idx: int) Tuple[source]

Get the three CUDA events for a specific chunk.

Parameters:

chunk_idx (int) – Chunk index (0-based)

Returns:

(h2d_event, kernel_event, d2h_event) for the chunk

Return type:

tuple

_invalidate_cache() None[source]

Mark cached outputs as invalid, flushing cache if cache_handler in “flush on change” mode.

_on_allocation(response: ArrayResponse) None[source]

Update run parameters with chunking metadata from allocation.

_setup_cuda_events(chunks: int) None[source]

Create CUDA events for timing instrumentation.

Parameters:

chunks (int) – Number of chunks to process

Notes

Creates one GPU workload event and 3 events per chunk (h2d_transfer, kernel, d2h_transfer). Events are created regardless of verbosity - they become no-ops internally when verbosity is None.

_setup_memory_manager(settings: Dict[str, Any]) MemoryManager[source]

Register the kernel with a memory manager instance.

Parameters:

settings – Mapping of memory configuration options recognised by the memory manager.

Returns:

Memory manager configured for solver allocations.

Return type:

MemoryManager

_validate_timing_parameters(duration: float) None[source]

Validate timing parameters to prevent invalid array accesses.

Parameters:

duration – Integration duration in time units.

Raises:

ValueError – When timing parameters would result in no outputs or invalid sampling.

Notes

Uses dt_min as an absolute tolerance when comparing floating point timing parameters by adding dt_min to the requested duration. Small in-loop timing oversteps smaller than dt_min are treated as valid and do not trigger validation errors.

property active_outputs: ActiveOutputs

Active output array flags derived from compile_flags.

property algorithm: str

Identifier of the selected integration algorithm.

property atol: float

Absolute error tolerance applied during adaptive stepping.

build() BatchSolverCache[source]

Compile the integration kernel and return it.

build_kernel() None[source]

Build and compile the CUDA integration kernel.

property cache_config: CacheConfig

Cache configuration for the kernel, parsed on demand.

property chunks

Number of chunks in the most recent run.

property compile_flags: OutputCompileFlags

Boolean compile-time controls for which output features are enabled.

property device_driver_coefficients: ndarray[Any, dtype[floating]] | None

Device-resident driver coefficients.

property device_function

Return the compiled CUDA device function.

Returns:

Compiled CUDA device function.

Return type:

callable

disable_profiling() None[source]

Disable CUDA profiling hooks for subsequent launches.

property driver_coefficients: ndarray[Any, dtype[floating]] | None

Horner-ordered driver coefficients on the host.

property dt: float | None

Current integrator step size when available.

property dt_max: float

Maximum allowable step size from the controller.

property dt_min: float

Minimum allowable step size from the controller.

property duration: float

Requested integration duration.

enable_profiling() None[source]

Enable CUDA profiling hooks for subsequent launches.

property initial_values: Any

Host view of initial state values.

property iteration_counters: Any

Host view of iteration counters at each save point.

property kernel: Callable

Compiled integration kernel callable.

limit_blocksize(blocksize: int, dynamic_sharedmem: int, bytes_per_run: int, numruns: int) tuple[int, int][source]

Reduce block size until dynamic shared memory fits within limits.

Parameters:
  • blocksize – Requested CUDA block size.

  • dynamic_sharedmem – Shared-memory footprint per block at the current block size.

  • bytes_per_run – Shared-memory requirement per run.

  • numruns – Total number of runs queued for the launch.

Returns:

Adjusted block size and shared-memory footprint per block.

Return type:

tuple[int, int]

Notes

The shared-memory ceiling uses 32 kiB so three blocks can reside per SM on CC7* hardware. Larger requests reduce per-thread L1 availability.

property local_memory_elements: int

Number of precision elements required in local memory per run.

property mem_proportion: float | None

Fraction of managed memory reserved for this kernel.

property memory_manager: MemoryManager

Registered memory manager for this kernel.

property n_drivers: int

Number of interpolated driver inputs for the system.

property num_runs: int

Number of runs scheduled for the batch integration.

property observable_summaries: Any

Host view of observable summary reductions.

property observables: Any

Host view of saved observable trajectories.

property output_array_heights: Any

Height metadata for the batched output arrays.

property output_heights: Any

Height metadata for each host output array.

property output_length: int

Number of saved trajectory samples in the main run.

Delegates to SingleIntegratorRun.output_length() with the current duration.

property output_types: Any

Active output type identifiers configured for the run.

property parameters: Any

Host view of parameter tables.

property profileCUDA: bool

Indicate whether CUDA profiling hooks are enabled.

property rtol: float

Relative error tolerance applied during adaptive stepping.

run(inits: ndarray[Any, dtype[floating]], params: ndarray[Any, dtype[floating]], driver_coefficients: ndarray[Any, dtype[floating]] | None, duration: float, blocksize: int = 256, stream: Any | None = None, warmup: float = 0.0, t0: float = 0.0) None[source]

Execute the solver kernel for batch integration.

Chunking is performed along the run axis when memory constraints require splitting the batch.

Parameters:
  • inits – Initial conditions with shape (n_states, n_runs).

  • params – Parameter table with shape (n_params, n_runs).

  • driver_coefficients – Optional Horner-ordered driver interpolation coefficients with shape (num_segments, num_drivers, order + 1).

  • duration – Duration of the simulation window.

  • blocksize – CUDA block size for kernel execution.

  • stream – CUDA stream assigned to the batch launch.

  • warmup – Warmup time before the main simulation.

  • t0 – Initial integration time.

Notes

The kernel prepares array views, queues allocations, and executes the device loop on each chunked workload. Shared-memory demand may reduce the block size automatically, emitting a warning when the limit drops below a warp.

property sample_summaries_every: float

Interval between summary metric samples from the loop.

property save_every: float | None

Interval between saved samples from the loop, or None if save_last only.

property save_time: float

Elapsed time spent saving outputs during integration.

property saved_observable_indices: Any

Indices of saved observable variables.

property saved_state_indices: Any

Indices of saved state variables.

set_cache_dir(path: str | Path) None[source]

Set a custom cache directory for compiled kernels.

Parameters:

path – New cache directory path. Can be absolute or relative.

property shared_memory_bytes: int

Shared-memory footprint per run for the compiled kernel.

property shared_memory_elements: int

Number of precision elements required in shared memory per run.

property shared_memory_needs_padding: bool

Indicate whether shared-memory padding is required.

Returns:

True when a four-byte skew reduces bank conflicts for single precision.

Return type:

bool

Notes

Shared memory load instructions for float64 require eight-byte alignment. Padding in that scenario would misalign alternate runs and trigger misaligned-access faults, so padding only applies to single precision workloads where the skew preserves alignment.

property state: Any

Host view of saved state trajectories.

property state_summaries: Any

Host view of state summary reductions.

property status_codes: Any

Host view of integration status codes.

property stream: Any

CUDA stream used for kernel launches.

property stream_group: str

Stream group label assigned by the memory manager.

property summaries_length: int

Number of complete summary intervals across the integration window.

Delegates to SingleIntegratorRun.summaries_length() with the current duration.

property summarise_every: float | None

Interval between summary reductions from the loop

property summarised_observable_indices: Any

Indices of summarised observable variables.

property summarised_state_indices: Any

Indices of summarised state variables.

property summary_legend_per_variable: Any

Legend entries describing each summarised variable.

property summary_unit_modifications: Any

Unit modifications for each summarised variable.

property system: BaseODE

Underlying ODE system handled by the kernel.

property system_sizes: Any

Structured size metadata for the system.

property t0: float

Configured initial integration time.

property threads_per_loop: int

CUDA threads consumed by each run in the loop.

update(updates_dict: Dict[str, Any] | None = None, silent: bool = False, **kwargs: Any) set[str][source]

Update solver configuration parameters.

Parameters:
  • updates_dict – Mapping of parameter updates forwarded to the single integrator and compile settings.

  • silent – Flag suppressing errors when unrecognised parameters remain.

  • **kwargs – Additional parameter overrides merged into updates_dict.

Returns:

Names of parameters successfully applied.

Return type:

set[str]

Raises:

KeyError – Raised when unknown parameters persist and silent is False.

Notes

The method applies updates to the single integrator before refreshing compile-critical settings so the kernel rebuild picks up new metadata.

wait_for_writeback()[source]

Wait for async writebacks into host arrays after chunked runs

property warmup: float

Configured warmup duration.