krotov.parallelization module¶

Support routines for running the optimization in parallel across the objectives.

The time-propagation that is the main numerical effort in an optimization with Krotov’s method can naturally be performed in parallel for the different objectives. There are three time-propagations that happen inside optimize_pulses():

A forward propagation of the initial_state of each objective under the initial guess pulse.
A backward propagation of the states \(\ket{\chi_k}\) constructed by the chi_constructor routine that is passed to optimize_pulses(), where the number of states is the same as the number of objectives.
A forward propagation of the initial_state of each objective under the optimized pulse in each iteration. This can only be parallelized per time step, as the propagated states from each time step collectively determine the pulse update for the next time step, which is then used for the next propagation step. (In this sense Krotov’s method is “sequential”)

The optimize_pulses() routine has a parameter parallel_map that can receive a tuple of three “map” functions to enable parallelization, corresponding to the three propagation listed above. If not given, qutip.parallel.serial_map() is used for all three propagations, running in serial. Any alternative “map” must have the same interface as qutip.parallel.serial_map().

It would be natural to assume that qutip.parallel.parallel_map() respectively the slightly improved parallel_map() provided in this module would be a good choice for parallel execution, using multiple CPUs on the same machine. However, this function is only a good choice for the propagation (1) and (2): these run in parallel over the entire time grid without any communication, and thus minimal overhead. However, this is not true for the propagation (3), which must synchronize after each time step. In that case, the “naive” use of qutip.parallel.parallel_map() results in a communication overhead that completely dominates the propagation, and actually makes the optimization slower (potentially by more than an order of magnitude).

The function parallel_map_fw_prop_step() provided in this module is an appropriate alternative implementation that uses long-running processes, internal caching, and minimal inter-process communication to eliminate the communication overhead as much as possible. However, the internal caching is valid only under the assumption that the propagate function does not have side effects.

In general,

parallel_map=(
    krotov.parallelization.parallel_map,
    krotov.parallelization.parallel_map,
    krotov.parallelization.parallel_map_fw_prop_step,
)

is a decent choice for enabling parallelization for a typical multi-objective optimization (but don’t expect wonders: general pure-python parallelization is an unsolved problem.)

You may implement your own “map” functions to exploit parallelization paradigms other than Python’s built-in multiprocessing, provided here. This includes distributed propagation, e.g. through ipyparallel clusters. To write your own parallel_map functions, review the source code of optimize_pulses() in detail.

In most cases, it will be difficult to obtain a linear speedup from parallelization: even with carefully tuned manual interprocess communication, the communication overhead can be substantial. For best results, it would be necessary to use parallel_map functions implemented in Cython, where the GIL can be released and the entire propagation (and storage of propagated states) can be done in shared-memory with no overhead.

Also note that the overhead of multi-process parallelization is platform-dependent. On Linux, subprocesses are “forked” which causes them to inherit the current state of the parent process without any explicit (and expensive) inter-process communication (IPC). On other platforms, most notably Windows and the combination of macOS with Python 3.8, subprocesses are “spawned” instead of “forked”: The subprocesses start from a clean slate, and all objects must be transfered from the parent process via IPC. This is very slow, and you should not expect to be able to achieve any speedup from parallelization on such platforms.

Another caveat on platforms using “spawn” is that certain objects by default cannot be transferred via IPC, due to limitations of the pickle protocol. This affects Lambdas and functions defined in Jupyter notebooks, in particular. The third-party loky library provides an alternative implementation for multi-processes parallelization that does not have these restrictions, but causes even more overhead.

You may attempt to use the various options to set_parallelization() in order to find a combination of settings that minimizes the runtime in your particular environment.

Summary¶

Classes:

`Consumer`	A process-based task consumer.
`FwPropStepTask`	A task that performs a single forward-propagation step.

Functions:

`parallel_map`	Map function task onto values, in parallel.
`parallel_map_fw_prop_step`	parallel_map function for the forward-propagation by one time step.
`set_parallelization`	Configure multi-process parallelization.

__all__: Consumer, FwPropStepTask, parallel_map, parallel_map_fw_prop_step, serial_map, set_parallelization

Reference¶

krotov.parallelization.set_parallelization(use_loky=False, start_method=None, loky_pickler=None, use_threadpool_limits=True)[source]¶

Configure multi-process parallelization.

Parameters

use_loky (bool) – Value for USE_LOKY.
start_method (None or str) – One of ‘fork’, ‘spawn’, and ‘forkserver’, see multiprocessing.set_start_method(). If use_loky=True, also ‘loky’ and ‘loky_int_main’, see loky. If None, a platform and version-dependent default will be chosen automatically (e.g., ‘fork’ on Linux, ‘spawn’ on Windows, ‘loky’ if use_loky=True)
loky_pickler (None or str) – Serialization module to use for loky. One of ‘cloudpickle’, ‘pickle’. This forces the serialization for all objects. The default value None chooses the serialization automatically depending of the type of object. Using ‘cloudpickle’ is signficiantly slower than ‘pickle’ (but ‘pickle’ cannot serialize all objects, such as lambda functions or functions defined in a Jupyter notebook).
use_threadpool_limits (bool) – Value for USE_THREADPOOL_LIMITS.

Raises

ImportError – if use_loky=True but loky is not installed.

Note

When working in Jupyter notebooks on systems that use the ‘spawn’ start_method (Windows, or macOS with Python >= 3.8), you may have to use loky (use_loky=True). This will incur a signficiant increase in multi-processing overhead. Use Linux if you can.

Warning

This function should only be called once per script/notebook, at its very beginning. The USE_LOKY and USE_THREADPOOL_LIMITS variables may be set at any time.

krotov.parallelization.parallel_map(task, values, task_args=None, task_kwargs=None, num_cpus=None, progress_bar=None)[source]¶

Map function task onto values, in parallel.

This function’s interface is identical to qutip.parallel.parallel_map() as of QuTiP 4.5.0, but has the option of using loky as a backend (see set_parallelization()). It also eliminates internal threads, according to USE_THREADPOOL_LIMITS.

class krotov.parallelization.Consumer(task_queue, result_queue, data)[source]¶

Bases: multiprocessing.context.Process

A process-based task consumer.

This is for internal use in parallel_map_fw_prop_step().

Parameters

task_queue (multiprocessing.JoinableQueue) – A queue from which to read tasks.
result_queue (multiprocessing.Queue) – A queue where to put the results of a task
data – cached (in-process) data that will be passed to each task

run()[source]¶

Execute all tasks on the task_queue.

Each task must be a callable that takes data as its only argument. The return value of the task will be put on the result_queue. A None value on the task_queue acts as a “poison pill”, causing the Consumer process to shut down.

class krotov.parallelization.FwPropStepTask(i_state, pulse_vals, time_index)[source]¶

Bases: object

A task that performs a single forward-propagation step.

The task object is a callable, receiving the single tuple of the same form as task_args in parallel_map_fw_prop_step() as input. This data is internally cached by the Consumer that will execute the task.

This is for internal use in parallel_map_fw_prop_step().

Parameters

i_state (int) – The index of the state to propagation. That is, the index of the objective from whose initial_state the propagation started
pulse_vals (list[float]) – the values of the pulses at time_index to use.
time_index (int) – the index of the interval on the time grid covered by the propagation step

The passed arguments update the internal state (data) of the Consumer executing the task; they are the minimal information that must be passed via inter-process communication to enable the forward propagation (assuming propagate in optimize_pulses() has no side-effects)

krotov.parallelization.parallel_map_fw_prop_step(shared, values, task_args)[source]¶

parallel_map function for the forward-propagation by one time step.

Parameters

shared – A global object to which we can attach attributes for sharing data between different calls to parallel_map_fw_prop_step(), allowing us to have long-running Consumer processes, avoiding process-management overhead. This happens to be a callable (the original internal routine for performing a forward-propagation), but here, it is (ab-)used as a storage object only.
values (list) – a list 0..(N-1) where N is the number of objectives
task_args (tuple) –
A tuple of 7 components:
1. A list of states to propagate, one for each objective.
2. The list of objectives
3. The list of optimized pulses (updated up to time_index)
4. The “pulses mapping”, cf extract_controls_mapping()
5. The list of time grid points
6. The index of the interval on the time grid over which to propagate
7. A list of propagate callables, as passed to optimize_pulses(). The propagators must not have side-effects in order for parallel_map_fw_prop_step() to work correctly.

krotov.parallelization.USE_LOKY = False¶

Whether to use loky instead of multiprocessing.

Set by set_parallelization().

krotov.parallelization.USE_THREADPOOL_LIMITS = True¶

Whether to limit the number of low-level BLAS/OpenMP threads.

When using multi-process parallelization, nested parallelization must be avoided. That is, low-level numerical routines e.g. in numpy should not be allowed to use multiple threads. This would lead to over-subscribing CPUs and can slow down the entire program by orders of magnitude.

If True, threadpoolctl will be used internally to attempt to eliminate any nested threads.

Set by set_parallelization().

Note

Alternatively (or in addition), you may want to consider setting the following environment variables in your shell:

export MKL_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OMP_NUM_THREADS=1