Skip to content

Latest commit

 

History

History
115 lines (57 loc) · 4.78 KB

RESEARCH.md

File metadata and controls

115 lines (57 loc) · 4.78 KB

Research

Tooling

WebArchiving Tools Index

Heritrix3 Crawling JS Archiving Rich Media

Brozzler

WebRecorder Tools pywb warc tool

WebRecorder / Brozzer Speedruns

Wayback Machine

www.kcna.kp Wayback Sitemap

New Save Page Now Uses Brozzler

Batch archive URLs from google sheets

Official Save Page Now API

StackExchange Answer Save Page Now 2 Change Log Save Page Now 2 Public API Docs Draft

Unofficial Wayback Save API Unofficial Wabback Save Gem

Mark Graham

Director of Wayback

Mark Graham Presentation Video

[01:25] "There are many of our partners that are using us, specifically, to archive the news. This is one particular crawl that I set up, archiving North Korean news. I am capturing something like forty-some North Korean sources every day."

IA Reddit AMA

I am especially passionate about this archive of web content from and about North Korea: https://archive-it.org/collections/6777

  • Mark

Archive-It North Korea Collection

Other IA Wayback Collections

kcna.kp collections GDELT appears to be prominent

IA Whole Earth Web Archiving

WEWA NK Page IA Webservices

Archive Team

Archive Bot

ArchiveBot

Archive Bot Job 8veu3

North Korea Governments/North Korea

Could set up Archive Team project following this guide.

DNS Leak

NK DNS Leak

Behavior

Wayback SPN appears broken for kcna.kp and others.

Example SPN2 not working Crashes SPN for rodong.rep.kp

The same example website from the link above about SPN2 not working now appears to be displayed on Wayback and is working mostly. Link Perhaps it takes SPN a few days to show up on Wayback.

Some of the images are not being saved properly. Showing 403 access denied for some, but I can access from my browser. One example case of this strange behavior is with this photo that was captured, but failed 403 a few days later here.

User Agent probably doesn't matter. Most of these sites appear to have extreme rate limiting.

I'm also able to capture well using Webrecorder's ArchiveWeb Extension

How to connect and other resources

Research Projects #467 Webscraping README War Dialing README