Skip to content

Commit

Permalink
Add network docs (#35)
Browse files Browse the repository at this point in the history
Fix #24 

- Add documentation about the network tracking and how to use it in the
benchmarks
- Update the ``NetworkTracker`` class to track the total time directly,
rather than having the ``network_activity_tracker`` track the time
separately. This change addresses the following issues: a)
``NetworkTracker`` was missing the total time so results are not the
same compared to ``network_activity_tracker`` and 2)
``NetworkTracker.asv_network_statistics`` was not being updated in the
``network_activity_tracker`` so the timing result was not being
recorded.
- Update ``NetworkTracker`` and ``network_activity_tracker`` to allow
the user to optionally set the process ID to track. This will be useful
if/when we need to run code we want to profile in a separate process
(e.g., when running in node.js)

---------

Co-authored-by: Cody Baker <[email protected]>
  • Loading branch information
oruebel and CodyCBakerPhD authored Mar 5, 2024
1 parent a497d8d commit baf8115
Show file tree
Hide file tree
Showing 6 changed files with 102 additions and 21 deletions.
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,5 @@ pip install -r docs/requirements-rtd.txt
then build the docs by executing the command...

```
mkdir -p docs/build/html
sphinx-build -M html docs docs/build/html
sphinx-build -M html docs docs/build
```
5 changes: 1 addition & 4 deletions docs/development.rst
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,4 @@ which is also indented for improved human readability and line-by-line GitHub tr
If this ``results`` folder eventually becomes too large for Git to reasonably handle, we will explore options to share via other data storage services.


Network Tracking
----------------

Stay tuned https://github.com/NeurodataWithoutBorders/nwb_benchmarks/issues/24
.. include:: network_tracking.rst
24 changes: 24 additions & 0 deletions docs/network_tracking.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
.. _network-tracking:

Network Tracking
----------------

The network tracking is implemented as part of the `nwb_benchmarks.core` module and consists of the following main components:

* ``CaptureConnections`` : This class uses the ``psutils`` library to capture network connections and map the connections to process IDs (PIDs). This information is then used downstream to allow filtering of network traffic packets by PID to allow us to distinguish between network traffic generated by us versus other processes running on the same system. See `core/_capture_connections.py <https://github.com/NeurodataWithoutBorders/nwb_benchmarks/blob/main/src/nwb_benchmarks/core/_capture_connections.py>`_
* ``NetworkProfiler`` : This class uses the ``tshark`` command line tool (and ``pyshark`` package) to capture the network traffic (packets) generated by all processes on the system. In combination with ``CaptureConnections`` we can then filter the captured packets to retrieve the packets generated by a particular PID via the ``get_packets_for_connections`` function. See `core/_network_profiler.py <https://github.com/NeurodataWithoutBorders/nwb_benchmarks/blob/main/src/nwb_benchmarks/core/_network_profiler.py>`_
* ``NetworkStatistics`` : This class provides functions for processing the network packets captured by the ``NetworkProfiler`` to compute basic network statistics, such as, the number of packets sent/received or the size of the data up/downloaded. The ``get_statistics`` function provides a convenient method to retrieve all the metrics via a single function call. See `core/_network_statistics.py <https://github.com/NeurodataWithoutBorders/nwb_benchmarks/blob/main/src/nwb_benchmarks/core/_network_statistics.py>`_
* ``NetworkTracker`` and ``network_activity_tracker`` : The ``NetworkTracker`` class, and corresponding ``network_activity_tracker`` context manager, built on the functionality implemented in the above modules to make it easy to track and compute network statistics for a given time during the execution of a code.

.. note::

``CaptureConnections`` uses `psutil.net_connections() <https://psutil.readthedocs.io/en/latest/#psutil.net_connections>`_, which requires sudo/root access on macOS and AIX.

.. note::

Running the network tracking generates additional threads/processes in order to capture traffic while the main code is running: **1)** ``NetworkProfiler.start_capture`` generates a ``subprocess`` for running the ``tshark`` command line tool, which is then being terminated when ``NetworkProfiler.stop_capture`` is called and **2)** ``CaptureConnections`` implements a ``Thread`` that is being run in the background. The ``NetworkTracker`` automatically starts and terminates these processs/threads, so a user typically does not need to manage these directly.

Typical usage
^^^^^^^^^^^^^

In most cases, users will use the ``NetworkTracker`` or ``network_activity_tracker`` to track network traffic and statistics as illustrated in :ref:`network-tracking-benchmarks`.
4 changes: 3 additions & 1 deletion docs/running_benchmarks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@ use `psutil net_connections <https://psutil.readthedocs.io/en/latest/#psutil.net
sudo nwb_benchmarks run
Or drop the ``sudo`` if on Windows. Running on Windows may also require you to set the ``TSHARK_PATH`` environment variable beforehand, which should be the absolute path to the ``tshark.exe`` on your system.
Or drop the ``sudo`` if on Windows.

When running on Windows or if ``tshark`` is not installed on the path, then may also need to set the ``TSHARK_PATH`` environment variable beforehand, which should be the absolute path to the ``tshark`` executable (e.g., ``tshark.exe``) on your system.

Many of the current tests can take several minutes to complete; the entire suite will take many times that. Grab some coffee, read a book, or better yet (when the suite becomes larger) just leave it to run overnight.

Expand Down
45 changes: 44 additions & 1 deletion docs/writing_benchmarks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -116,4 +116,47 @@ Notice how the ``read_hdf5_nwbfile_remfile`` function (which reads an HDF5-backe
nwbfile = io.read()
return (nwbfile, io, file, byte_stream)
and so we managed to save ~5 lines of code for every occurence of this logic in the benchmarks. Good choices of function names are critical to effectively communicating the actions being undertaken. Thorough annotation of signatures is likewise critical to understanding input/output relationships for these functions.
and so we managed to save ~5 lines of code for every occurrence of this logic in the benchmarks. Good choices of function names are critical to effectively communicating the actions being undertaken. Thorough annotation of signatures is likewise critical to understanding input/output relationships for these functions.


.. _network-tracking-benchmarks:


Writing a network tracking benchmark
------------------------------------

Functions that require network access ---such as reading a file from S3--- are often a black box, with functions in other libraries (e.g., ``h5py``, ``fsspec``, etc.) managing the access to the remote resources. The runtime performance of such functions is often inherently driven by how these functions utilize the network to access the resources. It is, hence, important that we can profile the network traffic that is being generated to better understand, e.g., the amount of data that is being downloaded and uploaded, the number of requests that are being sent/received, and others.

To simplify the implementation of benchmarks for tracking network statistics, we implemented in the ``nwb_benchmarks.core`` module various helper classes and functions. The network tracking functionality is designed to track the network traffic generated by the main Python process that our tests are running during a user-defined period of time. The ``network_activity_tracker`` context manager can be used to track the network traffic generated by the code within the context. A basic network benchmark, then looks as follows:

.. code-block:: python
from nwb_benchmarks import TSHARK_PATH
from nwb_benchmarks.core import network_activity_tracker
import requests # Only used here for illustration purposes
class SimpleNetworkBenchmark:
def track_network_activity_uri_request():
with network_activity_tracker(tshark_path=TSHARK_PATH) as network_tracker:
x = requests.get('https://nwb-benchmarks.readthedocs.io/en/latest/setup.html')
return network_tracker.asv_network_statistics
In cases where a context manager may not be sufficient, we can alternatively use the ``NetworkTracker`` class directly to explicitly control when to start and stop the tracking.

.. code-block:: python
from nwb_benchmarks import TSHARK_PATH
from nwb_benchmarks.core import NetworkTracker
import requests # Only used here for illustration purposes
class SimpleNetworkBenchmark:
def track_network_activity_uri_request():
tracker = NetworkTracker()
tracker.start_network_capture(tshark_path=TSHARK_PATH)
x = requests.get('https://nwb-benchmarks.readthedocs.io/en/latest/setup.html')
tracker.stop_network_capture()
return tracker.asv_network_statistics
By default, the ``NetworkTracker`` and ``network_activity_tracker`` track the network activity of the current process ID (i.e., ``os.getpid()``), but the PID to track can also be set explicitly if a different process needs to be monitored.
42 changes: 29 additions & 13 deletions src/nwb_benchmarks/core/_network_tracker.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,22 +12,21 @@


@contextlib.contextmanager
def network_activity_tracker(tshark_path: Union[pathlib.Path, None] = None):
"""Context manager for tracking network activity and statistics for the code executed in the context"""
def network_activity_tracker(tshark_path: Union[pathlib.Path, None] = None, pid: int = None):
"""
Context manager for tracking network activity and statistics for the code executed in the context
:param tshark_path: Path to the tshark CLI command to use for tracking network traffic
:param pid: The id of the process to compute the network statistics for. If set to None, then the
PID of the current process will be used.
"""
network_tracker = NetworkTracker()

try:
network_tracker.start_network_capture(tshark_path=tshark_path)
time.sleep(0.3)

t0 = time.time()
yield network_tracker
finally:
network_tracker.stop_network_capture()

t1 = time.time()
network_total_time = t1 - t0
network_tracker.network_statistics["network_total_time_in_seconds"] = network_total_time
network_tracker.stop_network_capture(pid=pid)


class NetworkTracker:
Expand All @@ -52,11 +51,14 @@ def __init__(self):
self.pid_packets = None
self.network_statistics = None
self.asv_network_statistics = None
self.__start_capture_time = None

def start_network_capture(self, tshark_path: Union[pathlib.Path, None] = None):
"""
Start capturing the connections on this machine as well as all network packets
:param tshark_path: Path to the tshark CLI command to use for tracking network traffic
Side effects: This functions sets the following instance variables:
* self.connections_thread
* self.network_profile
Expand All @@ -69,10 +71,16 @@ def start_network_capture(self, tshark_path: Union[pathlib.Path, None] = None):
self.network_profiler = NetworkProfiler()
self.network_profiler.start_capture(tshark_path=tshark_path)

def stop_network_capture(self):
# start the main timer
self.__start_capture_time = time.time()

def stop_network_capture(self, pid: int = None):
"""
Stop capturing network packets and connections.
:param pid: The id of the process to compute the network statistics for. If set to None, then the
PID of the current process (i.e., os.getpid()) will be used.
Note: This function will fail if `start_network_capture` was not called first.
Side effects: This functions sets the following instance variables:
Expand All @@ -81,15 +89,23 @@ def stop_network_capture(self):
* self.network_statistics
* self.asv_network_statistics
"""
# stop capturing the network
self.network_profiler.stop_capture()
self.connections_thread.stop()

# get the connections for the PID of this process
self.pid_connections = self.connections_thread.get_connections_for_pid(os.getpid())
# compute the total time
stop_capture_time = time.time()
network_total_time = stop_capture_time - self.__start_capture_time

# get the connections for the PID of this process or the PID set by the user
if pid is None:
pid = os.getpid()
self.pid_connections = self.connections_thread.get_connections_for_pid(pid)
# Parse packets and filter out all the packets for this process pid by matching with the pid_connections
self.pid_packets = self.network_profiler.get_packets_for_connections(self.pid_connections)
# Compute all the network statistics
self.network_statistics = NetworkStatistics.get_statistics(packets=self.pid_packets)
self.network_statistics["network_total_time_in_seconds"] = network_total_time

# Very special structure required by ASV
# 'samples' is the value tracked in our results
Expand Down

0 comments on commit baf8115

Please sign in to comment.