This project facilitates the extraction of document data from the Verra Verified Carbon Standard (VCS) Registry, an open database widely utilized by carbon credit traders. Leveraging Python, it offers a user-friendly interface for efficient data scraping, including summary data, metadata, and PDF documents from detailed VCS pages.
Explore the Verra VCS Registry: Verra VCS Registry
- Automated Scraping: Effortlessly retrieves summary data, metadata, and direct PDF document links.
- User-Friendly: Simple command-line interface for easy operation.
- Flexible: Options to customize the scraping process according to your needs.
-
Prerequisites: Ensure you have Taskfile installed on your system as a prerequisite. Taskfile is a task runner / build tool that simplifies the execution of predefined tasks within a project.
-
Set Up Environment: To configure your environment and install all necessary dependencies, execute the following command:
task req-install
-
Launch the Scraper: Start scraping by running the main script.
python3 src/main.py
This initiates the scraping of both summary data and PDF document links.
-
Customize Your Scraping:
-
To scrape only the summary data, use:
python3 src/main.py --disable-document
-
To obtain only PDF links, run:
python3 src/main.py --disable-summary
-
The scraper categorizes the data into two main types: Summary Data and PDF Links.
-
Summary Data: Extracted summary information is saved in
.txt
format in theresults/summary
directory. For instance, data for project ID 33 will be stored in33.txt
. -
PDF Links: Links to PDF documents, along with metadata such as the last update date and file names, are compiled into a CSV file named
pdf_links.csv
located in theresults/
directory. Access or download the documents directly via these links.
To better understand what parts of the webpage are scraped for summary data and where the PDF documents are located, refer to the image below:
This image illustrates the specific sections of the Verra VCS Registry that the scraper targets for data extraction. The highlighted areas show where the summary information is located and which parts contain links to PDF documents.