recursionpharma · bengioe · Feb 28, 2024 · Feb 28, 2024 · Feb 29, 2024 · Feb 29, 2024
@@ -51,4 +51,18 @@ The data used for training GFlowNets can come from a variety of sources. `DataSo
 
 `DataSource` also covers validation sets, including cases such as:
 - Generating new trajectories (w.r.t a fixed dataset of conditioning goals)
-- Evaluating the model's likelihood on trajectories from a fixed, offline dataset
+- Evaluating the model's likelihood on trajectories from a fixed, offline dataset
+
+## Multiprocessing
+
+We use the multiprocessing features of torch's `DataLoader` to parallelize data generation and featurization. This is done by setting the `num_workers` (via `cfg.num_workers`) parameter of the `DataLoader` to a value greater than 0. Because workers cannot (easily) use a CUDA handle, we have to resort to a number of tricks.
+
+Because training models involves sampling them, the worker processes need to be able to call the models. This is done by passing a wrapped model (and possibly wrapped replay buffer) to the workers, using `gflownet.utils.multiprocessing_proxy`. These wrappers ensure that model calls are routed to the main worker process, where the model lives (e.g. in CUDA), and that the returned values are properly serialized and sent back to the worker process. These wrappers are also designed to be API-compatible with models, e.g. `model(input)` or `model.method(input)` will work as expected, regardless of whether `model` is a torch module or a wrapper. Note that it is only possible to call methods on these wrappers, direct attribute access is not supported.
+
+Note that the workers do not use CUDA, therefore have to work entirely on CPU, but the code is designed to be somewhat agnostic to this fact. By using `get_worker_device`, code can be written without assuming too much; again, calls such as `model(input)` will work as expected.
+
+On message serialization, naively sending batches of data and results (`Batch` and `GraphActionCategorical`) through multiprocessing queues is fairly inefficient. Torch tries to be smart and will use shared memory for tensors that are sent through queues, which unfortunately is very slow because creating these shared memory files is slow, and because `Data` `Batch`es tend to contain lots of small tensors, which is not a good fit for shared memory.
+
+We implement two solutions to this problem (in order of preference):
+- using `SharedPinnedBuffer`s, which are shared tensors of fixed size (`cfg.mp_buffer_size`), but initialized once and pinned. This is the fastest solution, but requires that the size of the largest possible batch/return value is known in advance. This should work for any message, but has only been tested with `Batch` and `GraphActionCategorical` messages.
+- using `cfg.pickle_mp_messages`, which simply serializes messages with `pickle`. This prevents the creation of lots of shared memory files, but is slower than the `SharedPinnedBuffer` solution. This should work for any message that `pickle` can handle.
@@ -8,7 +8,8 @@ universal = "true"
 [tool.bandit]
 # B101 tests the use of assert
 # B301 and B403 test the use of pickle
-skips = ["B101", "B301", "B403"]
+# B614 tests the use of torch.load/save
+skips = ["B101", "B301", "B403", "B614"]
 exclude_dirs = ["tests", ".tox", ".venv"]
 
 [tool.pytest.ini_options]

@@ -2,7 +2,7 @@
 from ast import literal_eval
 from subprocess import check_output  # nosec - command is hard-coded, no possibility of injection
 
-from setuptools import setup
+from setuptools import Extension, setup
 
 
 def _get_next_version():
@@ -25,4 +25,18 @@ def _get_next_version():
     return f"{major}.{minor}.{latest_patch+1}"
 
 
-setup(name="gflownet", version=_get_next_version())
+ext = [
+    Extension(
+        name="gflownet._C",
+        sources=[
+            "src/C/main.c",
+            "src/C/data.c",
+            "src/C/graph_def.c",
+            "src/C/node_view.c",
+            "src/C/edge_view.c",
+            "src/C/degree_view.c",
+            "src/C/mol_graph_to_Data.c",
+        ],
+    )
+]
+setup(name="gflownet", version=_get_next_version(), ext_modules=ext)