Lightweight Python library for scraping websites with LLMs. You can test it on Parsera website.
Because it's simple and lightweight, with minimal token use which boosts speed and reduces expenses.
pip install parsera
playwright install
If you want to use OpenAI, remember to set up OPENAI_API_KEY
env variable.
You can do this from python with:
import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY_HERE"
Next, you can run a basic version that uses gpt-4o-mini
from parsera import Parsera
url = "https://news.ycombinator.com/"
elements = {
"Title": "News title",
"Points": "Number of points",
"Comments": "Number of comments",
}
scrapper = Parsera()
result = scrapper.run(url=url, elements=elements)
result
variable will contain a json with a list of records:
[
{
"Title":"Hacking the largest airline and hotel rewards platform (2023)",
"Points":"104",
"Comments":"24"
},
...
]
There is also arun
async method available:
result = await scrapper.arun(url=url, elements=elements)
Either place this code at the beginning of your notebook:
import nest_asyncio
nest_asyncio.apply()
Or instead of calling run
method use async arun
.
You can instantiate Parsera
with any chat model supported by LangChain, for example, to run the model from Azure:
import os
from langchain_openai import AzureChatOpenAI
llm = AzureChatOpenAI(
azure_endpoint=os.getenv("AZURE_GPT_BASE_URL"),
openai_api_version="2023-05-15",
deployment_name=os.getenv("AZURE_GPT_DEPLOYMENT_NAME"),
openai_api_key=os.getenv("AZURE_GPT_API_KEY"),
openai_api_type="azure",
temperature=0.0,
)
url = "https://news.ycombinator.com/"
elements = {
"Title": "News title",
"Points": "Number of points",
"Comments": "Number of comments",
}
scrapper = Parsera(model=llm)
result = scrapper.run(url=url, elements=elements)