Skip to content

Latest commit

 

History

History
43 lines (25 loc) · 1.15 KB

README.md

File metadata and controls

43 lines (25 loc) · 1.15 KB

Site Metadata Scraper

Scrape ecommerce site metadata for classification and keyword analysis.

Dependencies

Methodology

  • Iterate on list of validated urls (input can be origin, hostname, domain name)
  • Batch input (~5) with concurrent browser instances to invoke the page
  • For each, extract basic site metadata:
    1. Html lang value to guide any subsequent analysis
    2. Document title
    3. Meta information: keywords and description if available
    4. Social media handle anchors for major platforms

Service intended to run infrequently e.g. on a monthly basis with build and run from repository source via e.g. AWS CodeBuild...

Run

Export variables to the environment:

<path>: endpoint for index of url data to iterate on
<size>: a reasonable batch size for concurrent browser instances (~5)

export INPUT_PATH=<path>
export BATCH_SIZE=<size>

Run the service:

npm run start

TODO

Dockerise to run headful puppeteer in container with xvfb.