-
-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support pagination with a --next option #105
Comments
Here's a prototype I built to help me scrape through all of https://news.ycombinator.com/from?site=simonwillison.net following the more links: diff --git a/shot_scraper/cli.py b/shot_scraper/cli.py
index 9bc48aa..eb3a80e 100644
--- a/shot_scraper/cli.py
+++ b/shot_scraper/cli.py
@@ -524,6 +524,21 @@ def accessibility(url, auth, output, javascript, timeout, log_console, skip, fai
is_flag=True,
help="Output JSON strings as raw text",
)
+@click.option(
+ 'next_',
+ "--next",
+ help="JavaScript to run to find next page",
+)
+@click.option(
+ "--next-delay",
+ type=int,
+ help="Milliseconds to wait before following --next",
+)
+@click.option(
+ "--next-limit",
+ type=int,
+ help="Maximum number of --after pages",
+)
@browser_option
@user_agent_option
@reduced_motion_option
@@ -536,6 +551,9 @@ def javascript(
auth,
output,
raw,
+ next_,
+ next_delay,
+ next_limit,
browser,
user_agent,
reduced_motion,
@@ -571,6 +589,7 @@ def javascript(
if not javascript:
javascript = input.read()
url = url_or_file_path(url, _check_and_absolutize)
+ next_count = 0
with sync_playwright() as p:
context, browser_obj = _browser_context(
p,
@@ -582,9 +601,27 @@ def javascript(
page = context.new_page()
if log_console:
page.on("console", console_log)
- response = page.goto(url)
- skip_or_fail(response, skip, fail)
- result = _evaluate_js(page, javascript)
+ result = []
+ while url:
+ response = page.goto(url)
+ skip_or_fail(response, skip, fail)
+ evaluated = _evaluate_js(page, javascript)
+ if next_:
+ result.extend(evaluated)
+ else:
+ result = evaluated
+ next_count += 1
+ if next_:
+ if next_limit is not None and next_count >= next_limit:
+ raise click.ClickException(
+ f"Reached --after-limit of {next_limit} pages"
+ )
+ url = _evaluate_js(page, next_)
+ print(url)
+ if next_delay:
+ time.sleep(next_delay / 1000)
+ else:
+ url = None
browser_obj.close()
if raw:
output.write(str(result)) I ran it like this and it worked! shot-scraper javascript \
'https://news.ycombinator.com/from?site=simonwillison.net' \
-i /tmp/scrape.js \
--next '() => {
let el = document.querySelector(".morelink[rel=next]");
if (el) {
return el.href;
}
}' -o /tmp/all.json --next-delay 1000 |
Needs more thought about how things like concatenating together results from multiple pages should work. It would also be neat if this could return a |
I was trying to scrape some Google Maps lists of places, but didn't manage as the first page that loads is a cookie notice that triggers a navigation event when accepted / rejected that results in To your question, maybe it could just return JSON-LD and leave the concat to downstream? |
Pagination is difficult to wrap your head around. I scrape 1000 of pages on a daily basis, and pagination is something no scraper can get right. From the script above, In a nutshell, websites consist of list pages and single pages. List pages “list” the pages a website has, and single pages are the “final” page. How devs think about scrapingFor this type of scraping, think of any list page (IMDb genre pages, Amazon shoes pages), then a “next” is fine. The list page is the final page. flowchart LR;
start-url --> list-page-1
start-url --> list-page-2
start-url --> list-page-3
What scrapers actually wantBut, in reality, list pages have a very different purpose. Lists are a “summary” of a page, not the actual data scrapers want. List pages are designed to “entice” users to click. It doesn't have the actual data a scraper wants (see case below). flowchart LR;
start-url --> list-page-1-->single-pages-11[single page 1]
list-page-1-->single-pages-12[single page 2]
list-page-1-->single-pages-13[single page 3]
start-url --> list-page-2-->single-pages-21[single page 1]
list-page-2-->single-pages-22[single page 2]
list-page-2-->single-pages-23[single page 3]
SummaryTo sum it up, allowing shot-scraper to “follow” links, one has to think about 2 types of links to be followed: pagination links (1,2,3, next etc.), and list items (card, article, col etc.). It also helps to actually call it that:
A case for pagination + followExample: huggingface.coOn the list pages, you got: name, category, update, number of downloads rounded to near 1000 and favorites rounded to near 1000 Let's say you want the growth-rate. On the list page it's listed as 227K, but when you click and visit the actual page it says 226,828 The difference between scraping the list page and the actual page is it takes 1000 downloads before you notice a change. In real life, it means you won't be able to catch “trending” AI models. Another example: you want to know the sentiment about an AI model. On list pages, you have favorites. That doesn't say much about an AI model, a person can favorite to get updates, view it later, likes the idea, interested in how it works etc. “favorite” doesn't really say much about an AI model. On the actual page, you have a community tab, which reveals far more about sentiment. The ratio between open and closed issues, for example. 800 open issues and 1 closed one tells a different story then 800 open/1000 closed, 0 open/800 closed or even 800 closed/last update: 1980 Example: rottentomatoes.comAnother example of a list page not having everything you need is rottentomatoes.com. On the list page, you got title, tomato meter, audience score, openings date. On the actual page you got MPA rating (G, PG, PG-13), genre, duration, critics consensus, recommendation/similar movies, where to watch, language, synopsis, cast. Even if you don't require anything complicated (genre, for example), shot-scraper still needs to visit the actual page to get the info, since the list page lacks pretty much anything. Pagination resourcesMost commonly used pagination types
|
Would be neat if you could do pagination when running
shot-scraper javascript
- by running extra JavaScript that returns the URL of the next page to visit.The text was updated successfully, but these errors were encountered: