BatchSolverKernel
- class cubie.batchsolving.BatchSolverKernel.BatchSolverKernel(system: SymbolicODE, loop_settings: Dict[str, Any] | None = None, evaluate_driver_at_t: Callable | None = None, driver_del_t: Callable | None = None, profileCUDA: bool = False, step_control_settings: Dict[str, Any] | None = None, algorithm_settings: Dict[str, Any] | None = None, output_settings: Dict[str, Any] | None = None, memory_settings: Dict[str, Any] | None = None, cache_settings: Dict[str, Any] | None = None, cache: bool | str | Path = True)[source]
Bases:
CUDAFactoryFactory for CUDA kernel which coordinates a batch integration.
- Parameters:
system – ODE system describing the problem to integrate.
loop_settings – Mapping of loop configuration forwarded to
cubie.integrators.SingleIntegratorRun. Recognised keys include"save_every"and"summarise_every".evaluate_driver_at_t – Optional evaluation function for an interpolated forcing term.
profileCUDA – Flag enabling CUDA profiling hooks.
step_control_settings – Mapping of overrides forwarded to
cubie.integrators.SingleIntegratorRunfor controller configuration.algorithm_settings – Mapping of overrides forwarded to
cubie.integrators.SingleIntegratorRunfor algorithm configuration.output_settings – Mapping of output configuration forwarded to the integrator. See
cubie.outputhandling.OutputFunctionsfor recognised keys.memory_settings – Mapping of memory configuration forwarded to the memory manager, typically via
cubie.memory.
Notes
The kernel delegates integration logic to
SingleIntegratorRuninstances and expects upstream APIs to perform batch construction. It executes the compiled loop function against kernel-managed memory slices and distributes work across GPU threads for each input batch.- _invalidate_cache() None[source]
Mark cached outputs as invalid, flushing cache if cache_handler in “flush on change” mode.
- _on_allocation(response: ArrayResponse) None[source]
Update run parameters with chunking metadata from allocation.
- _setup_cuda_events(chunks: int) None[source]
Create CUDA events for timing instrumentation.
- Parameters:
chunks (int) – Number of chunks to process
Notes
Creates one GPU workload event and 3 events per chunk (h2d_transfer, kernel, d2h_transfer). Events are created regardless of verbosity - they become no-ops internally when verbosity is None.
- _setup_memory_manager(settings: Dict[str, Any]) MemoryManager[source]
Register the kernel with a memory manager instance.
- Parameters:
settings – Mapping of memory configuration options recognised by the memory manager.
- Returns:
Memory manager configured for solver allocations.
- Return type:
- _validate_timing_parameters(duration: float) None[source]
Validate timing parameters to prevent invalid array accesses.
- Parameters:
duration – Integration duration in time units.
- Raises:
ValueError – When timing parameters would result in no outputs or invalid sampling.
Notes
Uses dt_min as an absolute tolerance when comparing floating point timing parameters by adding dt_min to the requested duration. Small in-loop timing oversteps smaller than dt_min are treated as valid and do not trigger validation errors.
- property active_outputs: ActiveOutputs
Active output array flags derived from compile_flags.
- property cache_config: CacheConfig
Cache configuration for the kernel, parsed on demand.
- property chunks
Number of chunks in the most recent run.
- property compile_flags: OutputCompileFlags
Boolean compile-time controls for which output features are enabled.
- property device_driver_coefficients: ndarray[Any, dtype[floating]] | None
Device-resident driver coefficients.
- property device_function
Return the compiled CUDA device function.
- Returns:
Compiled CUDA device function.
- Return type:
callable
- property device_observable_summaries_array: Any
Device buffer storing observable summary reductions.
- property driver_coefficients: ndarray[Any, dtype[floating]] | None
Horner-ordered driver coefficients on the host.
- limit_blocksize(blocksize: int, dynamic_sharedmem: int, bytes_per_run: int, numruns: int) tuple[int, int][source]
Reduce block size until dynamic shared memory fits within limits.
- Parameters:
blocksize – Requested CUDA block size.
dynamic_sharedmem – Shared-memory footprint per block at the current block size.
bytes_per_run – Shared-memory requirement per run.
numruns – Total number of runs queued for the launch.
- Returns:
Adjusted block size and shared-memory footprint per block.
- Return type:
Notes
The shared-memory ceiling uses 32 kiB so three blocks can reside per SM on CC7* hardware. Larger requests reduce per-thread L1 availability.
- property memory_manager: MemoryManager
Registered memory manager for this kernel.
- property ouput_array_sizes_2d: SingleRunOutputSizes
Two-dimensional output sizes for individual runs.
- property output_array_sizes_3d: BatchOutputSizes
Three-dimensional output sizes for batched runs.
- property output_length: int
Number of saved trajectory samples in the main run.
Delegates to SingleIntegratorRun.output_length() with the current duration.
- run(inits: ndarray[Any, dtype[floating]], params: ndarray[Any, dtype[floating]], driver_coefficients: ndarray[Any, dtype[floating]] | None, duration: float, blocksize: int = 256, stream: Any | None = None, warmup: float = 0.0, t0: float = 0.0) None[source]
Execute the solver kernel for batch integration.
Chunking is performed along the run axis when memory constraints require splitting the batch.
- Parameters:
inits – Initial conditions with shape
(n_runs, n_states).params – Parameter table with shape
(n_runs, n_params).driver_coefficients – Optional Horner-ordered driver interpolation coefficients with shape
(num_segments, num_drivers, order + 1).duration – Duration of the simulation window.
blocksize – CUDA block size for kernel execution.
stream – CUDA stream assigned to the batch launch.
warmup – Warmup time before the main simulation.
t0 – Initial integration time.
- Returns:
This method performs the integration for its side effects.
- Return type:
None
Notes
The kernel prepares array views, queues allocations, and executes the device loop on each chunked workload. Shared-memory demand may reduce the block size automatically, emitting a warning when the limit drops below a warp.
- property save_every: float | None
Interval between saved samples from the loop, or None if save_last only.
- set_cache_dir(path: str | Path) None[source]
Set a custom cache directory for compiled kernels.
- Parameters:
path – New cache directory path. Can be absolute or relative.
Shared-memory footprint per run for the compiled kernel.
Number of precision elements required in shared memory per run.
Indicate whether shared-memory padding is required.
- Returns:
Truewhen a four-byte skew reduces bank conflicts for single precision.- Return type:
Notes
Shared memory load instructions for
float64require eight-byte alignment. Padding in that scenario would misalign alternate runs and trigger misaligned-access faults, so padding only applies to single precision workloads where the skew preserves alignment.
- property summaries_length: int
Number of complete summary intervals across the integration window.
Delegates to SingleIntegratorRun.summaries_length() with the current duration.
- update(updates_dict: Dict[str, Any] | None = None, silent: bool = False, **kwargs: Any) set[str][source]
Update solver configuration parameters.
- Parameters:
updates_dict – Mapping of parameter updates forwarded to the single integrator and compile settings.
silent – Flag suppressing errors when unrecognised parameters remain.
**kwargs – Additional parameter overrides merged into
updates_dict.
- Returns:
Names of parameters successfully applied.
- Return type:
- Raises:
KeyError – Raised when unknown parameters persist and
silentisFalse.
Notes
The method applies updates to the single integrator before refreshing compile-critical settings so the kernel rebuild picks up new metadata.