GitHub - fpgmaas/pypi-scout: Find Python Packages on PyPI with the help of vector embeddings

✨ Try it out at pypiscout.com ✨

What does this do?

Finding the right Python package on PyPI can be a bit difficult, since PyPI isn't really designed for discovering packages easily. For example, you can search for the word "plot" and get a list of hundreds of packages that contain the word "plot" in seemingly random order.

Inspired by this blog post about finding arXiv articles using vector embeddings, I decided to build a small application that helps you find Python packages with a similar approach. For example, you can ask it "I want to make nice plots and visualizations", and it will provide you with a short list of packages that can help you with that.

How does this work?

The project works by collecting project summaries and descriptions for all packages on PyPI with more than 100 weekly downloads. These are then converted into vector representations using Sentence Transformers. When the user enters a query, it is converted into a vector representation, and the most similar package descriptions are fetched from the vector database. Additional weight is given to the amount of weekly downloads before presenting the results to the user in a dashboard.

Stack

The project uses the following technologies:

FastAPI for the API backend
NextJS and TailwindCSS for the frontend
Sentence Transformers for vector embeddings

Getting Started

Build and Setup

1. (Optional) Create a `.env` file

By default, all data will be stored on your local machine. It is also possible to store the data for the API on Azure Blob storage, and have the API read from there. To do so, create a .env file:

cp .env.template .env

and fill in the required fields.

2. Run the Setup Script

The setup script will:

Download and process the PyPI dataset and store the results in the data directory.
Create vector embeddings for the PyPI dataset.
If the STORAGE_BACKEND environment variable is set to BLOB: Upload the datasets to blob storage.

There are three methods to run the setup script, dependent on if you have a NVIDIA GPU and NVIDIA Container Toolkit installed. Please run the setup script using the method that is applicable for you:

Note

The dataset contains approximately 100.000 packages on PyPI with more than 100 weekly downloads. To speed up local development, you can lower the amount of packages that is processed locally by lowering the value of FRAC_DATA_TO_INCLUDE in pypi_scout/config.py.

3. Run the Application

Start the application using Docker Compose:

docker-compose up

After a short while, your application will be live at http://localhost:3000.

Data

The dataset for this project is created using the PyPI dataset on Google BigQuery. The SQL query used can be found in pypi_bigquery.sql. The resulting dataset is available as a CSV file on Google Drive.

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.github		.github
data		data
frontend		frontend
pypi_scout		pypi_scout
static		static
tests/embeddings		tests/embeddings
.editorconfig		.editorconfig
.env.template		.env.template
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
DockerfileCPU		DockerfileCPU
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SETUP.md		SETUP.md
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
package.json		package.json
poetry.lock		poetry.lock
pypi_bigquery.sql		pypi_bigquery.sql
pyproject.toml		pyproject.toml
requirements-cpu.txt		requirements-cpu.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

✨ Try it out at pypiscout.com ✨

What does this do?

How does this work?

Stack

Getting Started

Build and Setup

1. (Optional) Create a `.env` file

2. Run the Setup Script

3. Run the Application

Data

About

Languages

License

fpgmaas/pypi-scout

Folders and files

Latest commit

History

Repository files navigation

✨ Try it out at pypiscout.com ✨

What does this do?

How does this work?

Stack

Getting Started

Build and Setup

1. (Optional) Create a .env file

2. Run the Setup Script

3. Run the Application

Data

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

1. (Optional) Create a `.env` file