Heritrix3 Crawling JS Archiving Rich Media
WebRecorder Tools pywb warc tool
WebRecorder / Brozzer Speedruns
New Save Page Now Uses Brozzler
Batch archive URLs from google sheets
StackExchange Answer Save Page Now 2 Change Log Save Page Now 2 Public API Docs Draft
Unofficial Wayback Save API Unofficial Wabback Save Gem
Director of Wayback
Mark Graham Presentation Video
[01:25] "There are many of our partners that are using us, specifically, to archive the news. This is one particular crawl that I set up, archiving North Korean news. I am capturing something like forty-some North Korean sources every day."
I am especially passionate about this archive of web content from and about North Korea: https://archive-it.org/collections/6777
- Mark
Archive-It North Korea Collection
kcna.kp collections GDELT appears to be prominent
North Korea Governments/North Korea
Could set up Archive Team project following this guide.
Wayback SPN appears broken for kcna.kp and others.
Example SPN2 not working Crashes SPN for rodong.rep.kp
The same example website from the link above about SPN2 not working now appears to be displayed on Wayback and is working mostly. Link Perhaps it takes SPN a few days to show up on Wayback.
Some of the images are not being saved properly. Showing 403 access denied for some, but I can access from my browser. One example case of this strange behavior is with this photo that was captured, but failed 403 a few days later here.
User Agent probably doesn't matter. Most of these sites appear to have extreme rate limiting.
I'm also able to capture well using Webrecorder's ArchiveWeb Extension
Research Projects #467 Webscraping README War Dialing README