Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wayback Machine links completely break crawling #3

Open
NinCollin opened this issue Oct 8, 2023 · 3 comments
Open

Wayback Machine links completely break crawling #3

NinCollin opened this issue Oct 8, 2023 · 3 comments

Comments

@NinCollin
Copy link

NinCollin commented Oct 8, 2023

I'm having an issue where Wayback Machine links breaks crawling on completely unrelated pages
This page has links to two Wayback Machine links, this one and this one.

After crawling the page with those links, subsequent unrelated websites fail to be crawled with an error message pertaining to the previous two Wayback Machine links, despite the fact that the sites that the error occurs on are completely unrelated, and not even on the same domain. SOSSE also fails to cache them too.

Below are some screenshots showing how the error is unrelated to the failed crawled pages
image
image

@biolds
Copy link
Owner

biolds commented Oct 9, 2023

It seems the crawler has reached a broken state, due to a previously crawled page having bogus links (most likely the Wayback machine page indeed).
As a work-around, you could probably recrawl the wayback machine pages using Python Request instead of Chromium. As for the tilde.town they can most likely be recrawled as is after restarting the crawler.
Otherwise, I'll have a look tonight to fix the root of the issue.

@biolds
Copy link
Owner

biolds commented Oct 9, 2023

It looks like a bug in Selenium, I have opened SeleniumHQ/selenium#12906 . I'll implement a work-around in the mean time.

Edit:
The bug is actually in Chromedriver, I have opened an other issue there: https://bugs.chromium.org/p/chromedriver/issues/detail?id=4589

@biolds
Copy link
Owner

biolds commented Oct 15, 2023

@NinCollin I have released a new version that adds support for crawling with Firefox, this way Wayback Machine pages can crawled!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants