BatchSolverKernel
- class cubie.batchsolving.BatchSolverKernel.BatchSolverKernel(system: SymbolicODE, loop_settings: Dict[str, Any] | None = None, evaluate_driver_at_t: Callable | None = None, driver_del_t: Callable | None = None, profileCUDA: bool = False, step_control_settings: Dict[str, Any] | None = None, algorithm_settings: Dict[str, Any] | None = None, output_settings: Dict[str, Any] | None = None, memory_settings: Dict[str, Any] | None = None, cache_settings: Dict[str, Any] | None = None, cache: bool | str | Path = True)[source]
Bases:
CUDAFactoryFactory for CUDA kernel which coordinates a batch integration.
- Parameters:
system – ODE system describing the problem to integrate.
loop_settings – Mapping of loop configuration forwarded to
cubie.integrators.SingleIntegratorRun. Recognised keys include"save_every"and"summarise_every".evaluate_driver_at_t – Optional evaluation function for an interpolated forcing term.
profileCUDA – Flag enabling CUDA profiling hooks.
step_control_settings – Mapping of overrides forwarded to
cubie.integrators.SingleIntegratorRunfor controller configuration.algorithm_settings – Mapping of overrides forwarded to
cubie.integrators.SingleIntegratorRunfor algorithm configuration.output_settings – Mapping of output configuration forwarded to the integrator. See
cubie.outputhandling.OutputFunctionsfor recognised keys.memory_settings – Mapping of memory configuration forwarded to the memory manager, typically via
cubie.memory.cache_settings – Mapping of cache configuration forwarded to
cubie.cubie_cache.CubieCacheHandler.cache – Cache mode control.
Trueenables default caching,Falsedisables caching, or a string/Pathsets a custom cache directory.
Notes
The kernel delegates integration logic to
SingleIntegratorRuninstances and expects upstream APIs to perform batch construction. It executes the compiled loop function against kernel-managed memory slices and distributes work across GPU threads for each input batch.- _invalidate_cache() None[source]
Mark cached outputs as invalid, flushing cache if cache_handler in “flush on change” mode.
- _on_allocation(response: ArrayResponse) None[source]
Update run parameters with chunking metadata from allocation.
- _setup_cuda_events(chunks: int) None[source]
Create CUDA events for timing instrumentation.
- Parameters:
chunks (int) – Number of chunks to process
Notes
Creates one GPU workload event and 3 events per chunk (h2d_transfer, kernel, d2h_transfer). Events are created regardless of verbosity - they become no-ops internally when verbosity is None.
- _setup_memory_manager(settings: Dict[str, Any]) MemoryManager[source]
Register the kernel with a memory manager instance.
- Parameters:
settings – Mapping of memory configuration options recognised by the memory manager.
- Returns:
Memory manager configured for solver allocations.
- Return type:
- _validate_timing_parameters(duration: float) None[source]
Validate timing parameters to prevent invalid array accesses.
- Parameters:
duration – Integration duration in time units.
- Raises:
ValueError – When timing parameters would result in no outputs or invalid sampling.
Notes
Uses dt_min as an absolute tolerance when comparing floating point timing parameters by adding dt_min to the requested duration. Small in-loop timing oversteps smaller than dt_min are treated as valid and do not trigger validation errors.
- property active_outputs: ActiveOutputs
Active output array flags derived from compile_flags.
- property cache_config: CacheConfig
Cache configuration for the kernel, parsed on demand.
- property chunks
Number of chunks in the most recent run.
- property compile_flags: OutputCompileFlags
Boolean compile-time controls for which output features are enabled.
- property device_driver_coefficients: ndarray[Any, dtype[floating]] | None
Device-resident driver coefficients.
- property device_function
Return the compiled CUDA device function.
- Returns:
Compiled CUDA device function.
- Return type:
callable
- property driver_coefficients: ndarray[Any, dtype[floating]] | None
Horner-ordered driver coefficients on the host.
- limit_blocksize(blocksize: int, dynamic_sharedmem: int, bytes_per_run: int, numruns: int) tuple[int, int][source]
Reduce block size until dynamic shared memory fits within limits.
- Parameters:
blocksize – Requested CUDA block size.
dynamic_sharedmem – Shared-memory footprint per block at the current block size.
bytes_per_run – Shared-memory requirement per run.
numruns – Total number of runs queued for the launch.
- Returns:
Adjusted block size and shared-memory footprint per block.
- Return type:
Notes
The shared-memory ceiling uses 32 kiB so three blocks can reside per SM on CC7* hardware. Larger requests reduce per-thread L1 availability.
- property memory_manager: MemoryManager
Registered memory manager for this kernel.
- property output_length: int
Number of saved trajectory samples in the main run.
Delegates to SingleIntegratorRun.output_length() with the current duration.
- run(inits: ndarray[Any, dtype[floating]], params: ndarray[Any, dtype[floating]], driver_coefficients: ndarray[Any, dtype[floating]] | None, duration: float, blocksize: int = 256, stream: Any | None = None, warmup: float = 0.0, t0: float = 0.0) None[source]
Execute the solver kernel for batch integration.
Chunking is performed along the run axis when memory constraints require splitting the batch.
- Parameters:
inits – Initial conditions with shape
(n_states, n_runs).params – Parameter table with shape
(n_params, n_runs).driver_coefficients – Optional Horner-ordered driver interpolation coefficients with shape
(num_segments, num_drivers, order + 1).duration – Duration of the simulation window.
blocksize – CUDA block size for kernel execution.
stream – CUDA stream assigned to the batch launch.
warmup – Warmup time before the main simulation.
t0 – Initial integration time.
Notes
The kernel prepares array views, queues allocations, and executes the device loop on each chunked workload. Shared-memory demand may reduce the block size automatically, emitting a warning when the limit drops below a warp.
- property save_every: float | None
Interval between saved samples from the loop, or None if save_last only.
- set_cache_dir(path: str | Path) None[source]
Set a custom cache directory for compiled kernels.
- Parameters:
path – New cache directory path. Can be absolute or relative.
Shared-memory footprint per run for the compiled kernel.
Number of precision elements required in shared memory per run.
Indicate whether shared-memory padding is required.
- Returns:
Truewhen a four-byte skew reduces bank conflicts for single precision.- Return type:
Notes
Shared memory load instructions for
float64require eight-byte alignment. Padding in that scenario would misalign alternate runs and trigger misaligned-access faults, so padding only applies to single precision workloads where the skew preserves alignment.
- property summaries_length: int
Number of complete summary intervals across the integration window.
Delegates to SingleIntegratorRun.summaries_length() with the current duration.
- update(updates_dict: Dict[str, Any] | None = None, silent: bool = False, **kwargs: Any) set[str][source]
Update solver configuration parameters.
- Parameters:
updates_dict – Mapping of parameter updates forwarded to the single integrator and compile settings.
silent – Flag suppressing errors when unrecognised parameters remain.
**kwargs – Additional parameter overrides merged into
updates_dict.
- Returns:
Names of parameters successfully applied.
- Return type:
- Raises:
KeyError – Raised when unknown parameters persist and
silentisFalse.
Notes
The method applies updates to the single integrator before refreshing compile-critical settings so the kernel rebuild picks up new metadata.