Skip to content
/ parsera Public
forked from raznem/parsera

Lightweight library for scraping web-sites with LLMs

License

Notifications You must be signed in to change notification settings

ajitmp/parsera

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📦 Parsera

Website Downloads

Lightweight Python library for scraping websites with LLMs. You can test it on Parsera website.

Why Parsera?

Because it's simple and lightweight, with minimal token use which boosts speed and reduces expenses.

Installation

pip install parsera
playwright install

Basic usage

If you want to use OpenAI, remember to set up OPENAI_API_KEY env variable. You can do this from python with:

import os

os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY_HERE"

Next, you can run a basic version that uses gpt-4o-mini

from parsera import Parsera

url = "https://news.ycombinator.com/"
elements = {
    "Title": "News title",
    "Points": "Number of points",
    "Comments": "Number of comments",
}

scrapper = Parsera()
result = scrapper.run(url=url, elements=elements)

result variable will contain a json with a list of records:

[
   {
      "Title":"Hacking the largest airline and hotel rewards platform (2023)",
      "Points":"104",
      "Comments":"24"
   },
    ...
]

There is also arun async method available:

result = await scrapper.arun(url=url, elements=elements)

Running with Jupyter Notebook:

Either place this code at the beginning of your notebook:

import nest_asyncio
nest_asyncio.apply()

Or instead of calling run method use async arun.

Run with custom model

You can instantiate Parsera with any chat model supported by LangChain, for example, to run the model from Azure:

import os
from langchain_openai import AzureChatOpenAI

llm = AzureChatOpenAI(
    azure_endpoint=os.getenv("AZURE_GPT_BASE_URL"),
    openai_api_version="2023-05-15",
    deployment_name=os.getenv("AZURE_GPT_DEPLOYMENT_NAME"),
    openai_api_key=os.getenv("AZURE_GPT_API_KEY"),
    openai_api_type="azure",
    temperature=0.0,
)

url = "https://news.ycombinator.com/"
elements = {
    "Title": "News title",
    "Points": "Number of points",
    "Comments": "Number of comments",
}
scrapper = Parsera(model=llm)
result = scrapper.run(url=url, elements=elements)

About

Lightweight library for scraping web-sites with LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%