Skip to content

This project facilitates the extraction of document data from the Verra Verified Carbon Standard (VCS) Registry, an open database widely utilized by carbon credit traders.

License

Notifications You must be signed in to change notification settings

yc-wang00/verra-scaper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Verra VCS Document Scraper

Table of Contents

About

This project facilitates the extraction of document data from the Verra Verified Carbon Standard (VCS) Registry, an open database widely utilized by carbon credit traders. Leveraging Python, it offers a user-friendly interface for efficient data scraping, including summary data, metadata, and PDF documents from detailed VCS pages.

Explore the Verra VCS Registry: Verra VCS Registry

Features

  • Automated Scraping: Effortlessly retrieves summary data, metadata, and direct PDF document links.
  • User-Friendly: Simple command-line interface for easy operation.
  • Flexible: Options to customize the scraping process according to your needs.

Getting Started

Installation

  1. Prerequisites: Ensure you have Taskfile installed on your system as a prerequisite. Taskfile is a task runner / build tool that simplifies the execution of predefined tasks within a project.

  2. Set Up Environment: To configure your environment and install all necessary dependencies, execute the following command:

    task req-install

Usage

Quick Start

  1. Launch the Scraper: Start scraping by running the main script.

    python3 src/main.py

    This initiates the scraping of both summary data and PDF document links.

  2. Customize Your Scraping:

    • To scrape only the summary data, use:

      python3 src/main.py --disable-document
    • To obtain only PDF links, run:

      python3 src/main.py --disable-summary

Output

The scraper categorizes the data into two main types: Summary Data and PDF Links.

  • Summary Data: Extracted summary information is saved in .txt format in the results/summary directory. For instance, data for project ID 33 will be stored in 33.txt.

  • PDF Links: Links to PDF documents, along with metadata such as the last update date and file names, are compiled into a CSV file named pdf_links.csv located in the results/ directory. Access or download the documents directly via these links.

To better understand what parts of the webpage are scraped for summary data and where the PDF documents are located, refer to the image below:

Verra VCS Registry Data Scraping Illustration

This image illustrates the specific sections of the Verra VCS Registry that the scraper targets for data extraction. The highlighted areas show where the summary information is located and which parts contain links to PDF documents.


About

This project facilitates the extraction of document data from the Verra Verified Carbon Standard (VCS) Registry, an open database widely utilized by carbon credit traders.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages