This repository contains the code for the paper "Website Fingerprinting in the Age of QUIC" (PETS 2021).
The code is divided into "workflows" with each workflow responsible for one or a few related experiments. Each workflow contains the scripts for collecting and processing data, performing machine learning classification, and generating plots for the paper.
The data used to support this paper are provided in two sets:
- quic-wf-core.tgz (831 MB):
- The domains used for scanning and the scan results in CSV format with headers.
- The open-world-dataset in HDF5 format with class, protocol, and VPN location labels; arrays of sizes and timestamps; and packets below 175 bytes removed.
- quic-wf-raw.tar (28 GB):
- The raw fetch QUIC and TCP traces and their associated metadata.
- Each file is a JSON stream of objects with the following, possible null-valued, keys:
- url, final_url: requested and final redirected URLs
- status: HTTP status code of the fetch
- protocol: protocol used to request the main page, "quic" or "tcp"
- packets: base64 encoded PCAP for the request
- http_trace: Chromium DevTools performance log (reference)
Note: A QUIC trace in the dataset refers to a trace of a Wireguard tunnel containing a mixture of QUIC and TCP packets associated with requesting the web-page over a QUIC connection. This means that at the least the initial connection to the web-server is over QUIC.
- Bash
- Git and Git-LFS 2.17
- Python 3.7
- Dependencies listed in
requirements.txt
- The wf-tools library
- Dependencies listed in
- Optional:
- Docker 19.03.12
- Wireguard v1.0.20200513
- docker-machine v0.16.2
The code was last run on a computing cluster with each experiment running on 2–4 cores (each 2.4 GHz), 6 GB main memory per core. Machine learning training and testing used an additional 0–2 GPUs based on the classifier. These jobs were run in parallel to reduce runtime, with each train-test split requiring 1–3 hours.
Note: The current requirements.txt
specifies tensorflow-cpu. If you have access to a GPU then install tensorflow-gpu instead.
A virutal box image with code and data already downloaded is currently available at this link for quick access.
The instructions below describe how setup and run the workflow from scratch.
# Clone the repository
git clone https://github.com/jpcsmith/wf-in-the-age-of-quic.git
# Change to the code directory
cd wf-in-the-age-of-quic/
# Download the git LFS files
git lfs pull
python3 -m venv env
source env/bin/activate
# Ensure that pip is the latest version
python3 -m pip install --upgrade pip
# Install the requirements using pip
python3 -m pip install --no-cache-dir -r requirements.txt
If the installation fails, ensure that the Python development libraries are installed and retry the above.
On Ubuntu 18.04, this would be the python3.7-dev
and python3-venv
packages.
wget https://polybox.ethz.ch/index.php/s/u10mAN6NCcDP39U/download -O quic-wf-core.tgz
tar -xzvf quic-wf-core.tgz
If planning to run trace-collection, i.e. from the Fetch QUIC Traces workflow, install docker (19.03.12) and Wireguard (v1.0.20200513).
Change to the desired workflow's directory and follow the instructions for running the workflow.
The workflows responsible for the various sections of the paper are mapped below.
Paper Section | Workflows | Directories |
---|---|---|
4. Combined QUIC-TCP Dataset | Identify QUIC Sites Fetch QUIC Traces |
workflows/identify-quic-sites workflows/fetch-any-quic |
6. From TCP to QUIC | Generalisability Analysis Single and Mixed Analyses |
workflows/generalisability-analysis workflows/single-and-mixed-analyses |
7. Joint Classification of QUIC and TCP | Single and Mixed Analyses Distinguish Protocol |
workflows/single-and-mixed-analyses workflows/distinguish-protocol |
8. Remove Control Packets | Removing Control Packets | workflows/removing-control-packets |
The following table lists the programs and files responsible for the various tables and figures found in the paper.
Notebooks are located in the notebooks/
directory and outputs in the results/plots
directory relative to the associated workflow.
The code and associated data is released under an MIT licence as found in the LICENCE file.