Skip to content

Latest commit

 

History

History
57 lines (44 loc) · 1.52 KB

REPRODUCE.md

File metadata and controls

57 lines (44 loc) · 1.52 KB

AWS configuration

  • Ubuntu 22.04
  • c5.24xlarge (96 vCPUs, 192 GiB of RAM)
  • 400GB SSD

Reproducing the environment

sudo apt-get update
git clone https://github.com/ADAPT-uiuc/compare-dataframe-libraries

Install Python

# We use version 3.10.6
sudo apt install python3.10

Install pip, venv

sudo apt install python3-pip
sudo apt install python3.10-venv

Create environment (named env) and activate it

python3 -m venv env
source env/bin/activate

Install dependencies

cd compare-dataframe-libraries
pip install -r requirements.txt

Download datasets

cd datasets
./download_datasets.sh

Make sure that you see both iris.csv and yellow_tripdata_2015-01.csv.

Running notebooks

You should be able to run any of the notebooks. However, running them with ipython through the terminal will not be useful. You need to run it through a notebook environment. Below are some alternatives:

  • Launch Jupyter remotely (tutorial)
  • Run notebooks through VSCode directly.

Reproducing numbers

You should run a notebook multiple times because Jupyter virtualizes many things, including e.g., the disk. Note that the numbers may be quite different from run to run and from the paper. We could not avoid this variability. However, the conclusions should be the same, namely the orders of magnitude of slowdowns and speedups mentioned in the paper should be the same.