krotov.parallelization module¶
Support routines for running the optimization in parallel across the objectives.
The time-propagation that is the main numerical effort in an optimization with
Krotov’s method can naturally be performed in parallel for the different
objectives. There are three time-propagations that happen inside
optimize_pulses()
:
A forward propagation of the
initial_state
of each objective under the initial guess pulse.A backward propagation of the states \(\ket{\chi_k}\) constructed by the chi_constructor routine that is passed to
optimize_pulses()
, where the number of states is the same as the number of objectives.A forward propagation of the
initial_state
of each objective under the optimized pulse in each iteration. This can only be parallelized per time step, as the propagated states from each time step collectively determine the pulse update for the next time step, which is then used for the next propagation step. (In this sense Krotov’s method is “sequential”)
The optimize_pulses()
routine has a parameter parallel_map that can
receive a tuple of three “map” functions to enable parallelization,
corresponding to the three propagation listed above. If not given,
qutip.parallel.serial_map()
is used for all three propagations, running
in serial. Any alternative “map” must have the same interface as
qutip.parallel.serial_map()
.
It would be natural to assume that qutip.parallel.parallel_map()
respectively the slightly improved parallel_map()
provided in this module
would be a good choice for parallel execution, using multiple CPUs on the same
machine. However, this function is only a good choice for the propagation (1)
and (2): these run in parallel over the entire time grid without any
communication, and thus minimal overhead. However, this is not true for the
propagation (3), which must synchronize after each time step. In that case, the
“naive” use of qutip.parallel.parallel_map()
results in a communication
overhead that completely dominates the propagation, and actually makes the
optimization slower (potentially by more than an order of magnitude).
The function parallel_map_fw_prop_step()
provided in this module is an
appropriate alternative implementation that uses long-running processes,
internal caching, and minimal inter-process communication to eliminate the
communication overhead as much as possible. However, the internal caching is
valid only under the assumption that the propagate function does not have
side effects.
In general,
parallel_map=(
krotov.parallelization.parallel_map,
krotov.parallelization.parallel_map,
krotov.parallelization.parallel_map_fw_prop_step,
)
is a decent choice for enabling parallelization for a typical multi-objective optimization (but don’t expect wonders: general pure-python parallelization is an unsolved problem.)
You may implement your own “map” functions to exploit parallelization paradigms
other than Python’s built-in multiprocessing
, provided here. This
includes distributed propagation, e.g. through ipyparallel clusters. To write
your own parallel_map functions, review the source code of
optimize_pulses()
in detail.
In most cases, it will be difficult to obtain a linear speedup from parallelization: even with carefully tuned manual interprocess communication, the communication overhead can be substantial. For best results, it would be necessary to use parallel_map functions implemented in Cython, where the GIL can be released and the entire propagation (and storage of propagated states) can be done in shared-memory with no overhead.
Also note that the overhead of multi-process parallelization is platform-dependent. On Linux, subprocesses are “forked” which causes them to inherit the current state of the parent process without any explicit (and expensive) inter-process communication (IPC). On other platforms, most notably Windows and the combination of macOS with Python 3.8, subprocesses are “spawned” instead of “forked”: The subprocesses start from a clean slate, and all objects must be transfered from the parent process via IPC. This is very slow, and you should not expect to be able to achieve any speedup from parallelization on such platforms.
Another caveat on platforms using “spawn” is that certain objects by default
cannot be transferred via IPC, due to limitations of the pickle
protocol. This affects Lambdas and functions defined in
Jupyter notebooks, in particular. The third-party loky
library provides
an alternative implementation for multi-processes parallelization that does not
have these restrictions, but causes even more overhead.
You may attempt to use the various options to set_parallelization()
in
order to find a combination of settings that minimizes the runtime in your
particular environment.
Summary¶
Classes:
A process-based task consumer. |
|
A task that performs a single forward-propagation step. |
Functions:
Map function task onto values, in parallel. |
|
parallel_map function for the forward-propagation by one time step. |
|
Configure multi-process parallelization. |
__all__
: Consumer
, FwPropStepTask
, parallel_map
, parallel_map_fw_prop_step
, serial_map
, set_parallelization
Reference¶
-
krotov.parallelization.
set_parallelization
(use_loky=False, start_method=None, loky_pickler=None, use_threadpool_limits=True)[source]¶ Configure multi-process parallelization.
- Parameters
start_method (None or str) – One of ‘fork’, ‘spawn’, and ‘forkserver’, see
multiprocessing.set_start_method()
. Ifuse_loky=True
, also ‘loky’ and ‘loky_int_main’, seeloky
. If None, a platform and version-dependent default will be chosen automatically (e.g., ‘fork’ on Linux, ‘spawn’ on Windows, ‘loky’ ifuse_loky=True
)loky_pickler (None or str) – Serialization module to use for
loky
. One of ‘cloudpickle’, ‘pickle’. This forces the serialization for all objects. The default value None chooses the serialization automatically depending of the type of object. Using ‘cloudpickle’ is signficiantly slower than ‘pickle’ (but ‘pickle’ cannot serialize all objects, such as lambda functions or functions defined in a Jupyter notebook).use_threadpool_limits (bool) – Value for
USE_THREADPOOL_LIMITS
.
- Raises
ImportError – if
use_loky=True
butloky
is not installed.
Note
When working in Jupyter notebooks on systems that use the ‘spawn’ start_method (Windows, or macOS with Python >= 3.8), you may have to use
loky
(use_loky=True
). This will incur a signficiant increase in multi-processing overhead. Use Linux if you can.Warning
This function should only be called once per script/notebook, at its very beginning. The
USE_LOKY
andUSE_THREADPOOL_LIMITS
variables may be set at any time.
-
krotov.parallelization.
parallel_map
(task, values, task_args=None, task_kwargs=None, num_cpus=None, progress_bar=None)[source]¶ Map function task onto values, in parallel.
This function’s interface is identical to
qutip.parallel.parallel_map()
as of QuTiP 4.5.0, but has the option of usingloky
as a backend (seeset_parallelization()
). It also eliminates internal threads, according toUSE_THREADPOOL_LIMITS
.
-
class
krotov.parallelization.
Consumer
(task_queue, result_queue, data)[source]¶ Bases:
multiprocessing.context.Process
A process-based task consumer.
This is for internal use in
parallel_map_fw_prop_step()
.- Parameters
task_queue (multiprocessing.JoinableQueue) – A queue from which to read tasks.
result_queue (multiprocessing.Queue) – A queue where to put the results of a task
data – cached (in-process) data that will be passed to each task
-
class
krotov.parallelization.
FwPropStepTask
(i_state, pulse_vals, time_index)[source]¶ Bases:
object
A task that performs a single forward-propagation step.
The task object is a callable, receiving the single tuple of the same form as task_args in
parallel_map_fw_prop_step()
as input. This data is internally cached by theConsumer
that will execute the task.This is for internal use in
parallel_map_fw_prop_step()
.- Parameters
i_state (int) – The index of the state to propagation. That is, the index of the objective from whose
initial_state
the propagation startedpulse_vals (list[float]) – the values of the pulses at time_index to use.
time_index (int) – the index of the interval on the time grid covered by the propagation step
The passed arguments update the internal state (data) of the
Consumer
executing the task; they are the minimal information that must be passed via inter-process communication to enable the forward propagation (assuming propagate inoptimize_pulses()
has no side-effects)
-
krotov.parallelization.
parallel_map_fw_prop_step
(shared, values, task_args)[source]¶ parallel_map function for the forward-propagation by one time step.
- Parameters
shared – A global object to which we can attach attributes for sharing data between different calls to
parallel_map_fw_prop_step()
, allowing us to have long-runningConsumer
processes, avoiding process-management overhead. This happens to be a callable (the original internal routine for performing a forward-propagation), but here, it is (ab-)used as a storage object only.values (list) – a list 0..(N-1) where N is the number of objectives
task_args (tuple) –
A tuple of 7 components:
A list of states to propagate, one for each objective.
The list of objectives
The list of optimized pulses (updated up to time_index)
The “pulses mapping”, cf
extract_controls_mapping()
The list of time grid points
The index of the interval on the time grid over which to propagate
A list of propagate callables, as passed to
optimize_pulses()
. The propagators must not have side-effects in order forparallel_map_fw_prop_step()
to work correctly.
-
krotov.parallelization.
USE_LOKY
= False¶ Whether to use
loky
instead ofmultiprocessing
.Set by
set_parallelization()
.
-
krotov.parallelization.
USE_THREADPOOL_LIMITS
= True¶ Whether to limit the number of low-level BLAS/OpenMP threads.
When using multi-process parallelization, nested parallelization must be avoided. That is, low-level numerical routines e.g. in
numpy
should not be allowed to use multiple threads. This would lead to over-subscribing CPUs and can slow down the entire program by orders of magnitude.If True, threadpoolctl will be used internally to attempt to eliminate any nested threads.
Set by
set_parallelization()
.Note
Alternatively (or in addition), you may want to consider setting the following environment variables in your shell:
export MKL_NUM_THREADS=1 export NUMEXPR_NUM_THREADS=1 export OMP_NUM_THREADS=1