Finding the right Python package on PyPI can be a bit difficult, since PyPI isn't really designed for discovering packages easily. For example, you can search for the word "plot" and get a list of hundreds of packages that contain the word "plot" in seemingly random order.
Inspired by this blog post about finding arXiv articles using vector embeddings, I decided to build a small application that helps you find Python packages with a similar approach. For example, you can ask it "I want to make nice plots and visualizations", and it will provide you with a short list of packages that can help you with that.
The project works by collecting project summaries and descriptions for all packages on PyPI with more than 100 weekly downloads. These are then converted into vector representations using Sentence Transformers. When the user enters a query, it is converted into a vector representation, and the most similar package descriptions are fetched from the vector database. Additional weight is given to the amount of weekly downloads before presenting the results to the user in a dashboard.
The project uses the following technologies:
- FastAPI for the API backend
- NextJS and TailwindCSS for the frontend
- Sentence Transformers for vector embeddings
By default, all data will be stored on your local machine. It is also possible to store the data for the API on Azure Blob storage, and
have the API read from there. To do so, create a .env
file:
cp .env.template .env
and fill in the required fields.
The setup script will:
- Download and process the PyPI dataset and store the results in the
data
directory. - Create vector embeddings for the PyPI dataset.
- If the
STORAGE_BACKEND
environment variable is set toBLOB
: Upload the datasets to blob storage.
There are three methods to run the setup script, dependent on if you have a NVIDIA GPU and NVIDIA Container Toolkit installed. Please run the setup script using the method that is applicable for you:
- Option 1: Using Poetry
- Option 2: Using Docker with NVIDIA GPU and NVIDIA Container Toolkit
- Option 3: Using Docker without NVIDIA GPU and NVIDIA Container Toolkit
Note
The dataset contains approximately 100.000 packages on PyPI with more than 100 weekly downloads. To speed up local development,
you can lower the amount of packages that is processed locally by lowering the value of FRAC_DATA_TO_INCLUDE
in pypi_scout/config.py
.
Start the application using Docker Compose:
docker-compose up
After a short while, your application will be live at http://localhost:3000.
The dataset for this project is created using the PyPI dataset on Google BigQuery. The SQL query used can be found in pypi_bigquery.sql. The resulting dataset is available as a CSV file on Google Drive.