This project is a PoC to create a vector database from content scrapped from web pages. Here we use ChromaDB as the vector database technology and the Amazon Titan embedding model, but the project also support BAAI/bge-large-zh-v1.5, available on HuggingFace website.
To install the project is recommended to use python 3.11 and a virtual environment:
virtualenv -p python3.11 venv
source venv/bin/activate
make setup-dev
If one is using a computer with ARM architecture, is important to set the architecture during dependencies installation.
make setup-dev-arm64
Firstly, set up the environment variables by copying the config/.env.example
file and
creating config/.env.development
. The following variables are crucial to connect to AWS Bedrock service if running
with AWS embedding model.
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_SESSION_TOKEN=
To run the scraper app inside a Docker container just run the following command:
make docker-start
To run locally we still need to run ChromaDB on a container before starting the scraper app.
make docker-start-db
python main.py -verbose -store_on_database -continue_scrapping
The main.py
script has 3 important flags: that help to understand what is happening.
-verbose
: (default is False) Print information about the process of scrapping pages, creating embeddings and store data into de database.-store_on_database
: (default is False) Force the scraper to store content and embeddings into de database.-continue_scrapping
: (default is False) Force the scrapper to visit new blog page links found inside the scrapped page. This is a recursive process.
After run the main script and hydrate the database we can check the data in two ways.
- run
test_queries.py
and check the related answers. - Use the TensorFlow Project to analyse the vectors and try to identify
inconsistencies. Run the script
tsv-formatter.py
and load the files embedding.tsv and metadata.tsv in the TensorFlow Project page.
TODO:
- Fix scraper logs
- Make scraper container run automatically