Verra VCS Document Scraper

About

This project facilitates the extraction of document data from the Verra Verified Carbon Standard (VCS) Registry, an open database widely utilized by carbon credit traders. Leveraging Python, it offers a user-friendly interface for efficient data scraping, including summary data, metadata, and PDF documents from detailed VCS pages.

Explore the Verra VCS Registry: Verra VCS Registry

Features

Automated Scraping: Effortlessly retrieves summary data, metadata, and direct PDF document links.
User-Friendly: Simple command-line interface for easy operation.
Flexible: Options to customize the scraping process according to your needs.

Getting Started

Installation

Prerequisites: Ensure you have Taskfile installed on your system as a prerequisite. Taskfile is a task runner / build tool that simplifies the execution of predefined tasks within a project.
Set Up Environment: To configure your environment and install all necessary dependencies, execute the following command:
```
task req-install
```

Usage

Quick Start

Launch the Scraper: Start scraping by running the main script.
```
python3 src/main.py
```
This initiates the scraping of both summary data and PDF document links.
Customize Your Scraping:
- To scrape only the summary data, use:
```
python3 src/main.py --disable-document
```
- To obtain only PDF links, run:
```
python3 src/main.py --disable-summary
```

Output

The scraper categorizes the data into two main types: Summary Data and PDF Links.

Summary Data: Extracted summary information is saved in .txt format in the results/summary directory. For instance, data for project ID 33 will be stored in 33.txt.
PDF Links: Links to PDF documents, along with metadata such as the last update date and file names, are compiled into a CSV file named pdf_links.csv located in the results/ directory. Access or download the documents directly via these links.

To better understand what parts of the webpage are scraped for summary data and where the PDF documents are located, refer to the image below:

This image illustrates the specific sections of the Verra VCS Registry that the scraper targets for data extraction. The highlighted areas show where the summary information is located and which parts contain links to PDF documents.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data		data
docs/assets		docs/assets
setup		setup
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Taskfile.yml		Taskfile.yml
clean_data.py		clean_data.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Verra VCS Document Scraper

Table of Contents

About

Features

Getting Started

Installation

Usage

Quick Start

Output

About

Releases

Packages

Languages

License

yc-wang00/verra-scaper

Folders and files

Latest commit

History

Repository files navigation

Verra VCS Document Scraper

Table of Contents

About

Features

Getting Started

Installation

Usage

Quick Start

Output

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages