This example uses Deephaven to perform real-time predictions of whether or not an Amazon review was generated by ChatGPT. The data comes from the Amazon Reviews Dataset, collected by Julian McAuley's lab and hosted on Huggingface.
The model used for bot prediction comes from Vidhi Kishor Waghela's entry in a ChatGPT-generated text detection Kaggle competition. The detector training data, script, and resulting PyTorch model are stored in the detector
directory.
This Deephaven example can be run in Jupyter using Deephaven's Python package, or inside of a Docker container. We've provided scripts, notebooks, and instructions for each of Jupyter and Docker, so pick the path that feels most comfortable to you.
The trained PyTorch model used in this project is stored with Git LFS. To access this model, you need to install LFS and use it for this repository.
-
Install Git LFS by following the instructions here.
-
Configure Git LFS for this repo and use it to pull the PyTorch model:
git lfs install git lfs fetch git lfs pull
Now that the PyTorch model is available, continue to the Jupyter or Docker section to start working with this example.
Deephaven's Python package requires Java 17 or higher to be installed on your machine. See this page for OS-specific instructions on installing Java.
-
Navigate to the
jupyter
subdirectory:cd jupyter
-
Then, execute a script to set up the environment:
chmod +x create-venv.sh ./create-venv.sh
This creates a Python virtual environment called
dh-amazon-venv
and installs all of the required Python packages into that environment. -
Next, activate the environment and start Jupyter:
source dh-amazon-venv/bin/activate jupyter notebook
Once you've started Jupyter, you're ready to go!
This step only needs to be done once, and can take quite a while, depending on the speed of your internet connection and the processing power of your machine. It took about 20 minutes on a Macbook Pro M2 with 8 cores.
-
Open the
download_data.ipynb
notebook and select thedh-amazon-venv
kernel. -
Set the
NUM_PROC
variable at the top of the second cell equal to the number of processors available to you. This has a significant impact on the download speed. -
Run the whole notebook. This will download the Amazon data, filter it for 2023, and write it to the
amazon-data
directory in Parquet format.
Finally, navigate to the detect_bots.ipynb
notebook and select the dh-amazon-venv
kernel. This notebook walks you through the whole example, and gives you the opportunity to play with Deephaven. We hope you learn something new!
To run this example with Docker, you must have Docker installed on your machine. See this guide for OS-specific instructions.
-
Navigate to the
docker
subdirectory:cd docker
-
Build and run the Docker image using
docker-compose
:docker compose up
-
Once the image is built, navigate to the Deephaven IDE at
http://localhost:10000/ide/
.
The Deephaven IDE contains all of the scripts associated to this example. Let's get started!
This step only needs to be done once, and can take quite a while, depending on the speed of your internet connection and the processing power of your machine. It took about 20 minutes on a Macbook Pro M2 with 8 cores. You may need to allocate more resources to the Docker engine to access the full capabilities of your machine. This can be done using Docker Desktop. See this guide for more details.
-
In the right-hand sidebar, open the
download_data.py
script. -
Set the
NUM_PROC
variable in line 8 equal to the number of processors available to you. This has a significant impact on the download speed. -
Run the script using the "play" button at the top of the screen. This will download the Amazon data, filter it for 2023, and write it to the
amazon-data
directory in Parquet format.
Once you've downloaded the data, you're ready to start working with the example. The code is divided between two scripts, stream_data.py
and detect_bots.py
. Running detect_bots.py
will also execute stream_data.py
, so you can start there if you'd like. We hope you enjoy this example!