Cheesecake Blog Scraper

This project is a PoC to create a vector database from content scrapped from web pages. Here we use ChromaDB as the vector database technology and the Amazon Titan embedding model, but the project also support BAAI/bge-large-zh-v1.5, available on HuggingFace website.

How to install

To install the project is recommended to use python 3.11 and a virtual environment:

virtualenv -p python3.11 venv
source venv/bin/activate
make setup-dev

If one is using a computer with ARM architecture, is important to set the architecture during dependencies installation.

make setup-dev-arm64

How to run

Firstly, set up the environment variables by copying the config/.env.example file and creating config/.env.development. The following variables are crucial to connect to AWS Bedrock service if running with AWS embedding model.

AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_SESSION_TOKEN=

On Docker container

To run the scraper app inside a Docker container just run the following command:

make docker-start

locally

To run locally we still need to run ChromaDB on a container before starting the scraper app.

make docker-start-db
python main.py -verbose -store_on_database -continue_scrapping

The main.py script has 3 important flags: that help to understand what is happening.

-verbose: (default is False) Print information about the process of scrapping pages, creating embeddings and store data into de database.
-store_on_database: (default is False) Force the scraper to store content and embeddings into de database.
-continue_scrapping: (default is False) Force the scrapper to visit new blog page links found inside the scrapped page. This is a recursive process.

Analysing the database

After run the main script and hydrate the database we can check the data in two ways.

run test_queries.py and check the related answers.
Use the TensorFlow Project to analyse the vectors and try to identify inconsistencies. Run the script tsv-formatter.py and load the files embedding.tsv and metadata.tsv in the TensorFlow Project page.

TODO:

Fix scraper logs
Make scraper container run automatically

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.idea		.idea
config		config
embeddings		embeddings
tsv		tsv
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
scraper.py		scraper.py
test_queries.py		test_queries.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cheesecake Blog Scraper

How to install

How to run

On Docker container

locally

Analysing the database

About

Releases

Packages

Languages

CheesecakeLabs/blog-scraper

Folders and files

Latest commit

History

Repository files navigation

Cheesecake Blog Scraper

How to install

How to run

On Docker container

locally

Analysing the database

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages