Carefully curated list of awesome digital preservation resources.
This Awesome List is one a suite of community-owned resources for digital preservation. See digipres.org or the digipres.org discussion forum for more information.
Contributions are welcome. Please add links through pull requests, or create an issue to start a discussion. Please refer to CONTRIBUTING.md for detailed guidance. And if obsolescence claims something awesome, there's always the Archive.
The text of an annual reminder email about these resources is also held here, in reminder.md. This will be sent around various mailings list once per year, ahead of World Digital Preservation Day.
- Get Started
- Store Digital Content
- Create Preservation Metadata
- Find Test Files
- Find More Tools
- Build Workflows
- Improve The Tools
Spotted digital data at risk, but don't know who can save it?
- Save web pages via:
- Internet Archive Nominations
- archive.is - Also known as archive.today.
- perma.cc
- webcitation.org
- UK Web Archive Site Nomination - Suggest URLs for the UK Web Archive. Note that UKWA is offline at present.
- Alert the Archive Team, and help them save digital stuff
- The Getting Started chapter of the Digital Preservation Handbook is a great place to start.
- The Digital Preservation Handbook Glossary - Introduces a lot of the core terminology.
- For material that describes the broader issues, you can refer to Digital Preservation on Wikipedia, and consider contributing to the Digital Preservation Wikipedia Project.
- Build your roadmap, guided by the NDSA Levels of Digital Preservation
- Use the Digital Preservation Business Case Toolkit to help get funding.
- Understanding your costs can help to plan your preservation work more effectively. The Curation Costs Exchange allows you to compare your costing data with that of many other organisations.
- Learn about Preserving digital Objects with Restricted Resources
- Learn how to Authenticate, Manage, and Preserve Video -- WITNESS trains activists to archive and preserve their video so that human rights abuses cannot be denied or forgotten over time.
- Explore and contribute to the DP Requirements and Solutions wiki
- Brainscape Digital Preservation Flash Cards - See https://github.com/ross-spencer/brainscape-digital-preservation#readme for more information.
We need to understand the file formats of the resources we care for, and the software they depend on.
- Search across format registries
- Find or add formats to the File Formats Wiki
- Understand file format risks (e.g. JP2)
- Game File Format Central (archived version) - Community project documenting over 1300 game related file-formats.
- Just Solve It - File Formats Wiki - Community project documenting a wide-variety of file formats.
If you have good examples of digital resources and their risks, please consider adding them to a test corpus.
There are a lot of tools out there (see the tools section below), but some tools are particularly great for early experimentation. These tools can be used right in your web browser, so you can get started without installing software locally.
These tools are accessed using your browser, and work by sending a copy of your files to a remote server.
- Siegfried - You can use the side bar to upload a file for Siegfried to identify the format.
- Online TrID File Identifier - A web service that identifies files using TrID.
These tools run entirely in your web browser, so no data is sent anywhere.
- Siegfried JS - This runs the Siegfried format identification tool on your files in your browser.
- CyberChef - The Cyber Swiss Army Knife. Capable of running lots of basic data operations on text or files, including computing things like MD5 or SHA hashes.
- warc-analyser - Proof-of-concept that analyses WARC files in your browser. See https://github.com/edsu/warc-analyzer for more information.
- Demystify Lite - This runs Siegfried WASM on your files in your browser and outputs a Demystify formatted report profiling your collection and highlighting files that might require specific attention during appraisal, such as duplicates; and through various preservation activities, such as caring for file names encoded using specific character-encodings.
- Visual examples of digital preservation challenges, such as graphic corruption, can be incredibly useful in communicating the digital preservation message. That's why we built the Atlas of Digital Damages Gallery and website. Please add your own images of a digital preservation challenge, failed rendering, encoding damage, corrupt data, or visual evidence documenting to the Atlas of Digital Damages.
- Use the POWRR One Pagers to educate stakeholders about the issues.
- Working with your IT department (some responses arising from this question on twitter:
- Backup versus preservation: "I had useful discussions about 'for the long-term' and what issues that might throw up as a starting point."
- "often it comes down to language - avoid the word archiving. means 1 thing to IT & another DP professionals"
- Ten IT skills you need to have to work with digital preservation - written by Dave Thompson
- "Am going to go all Bruce Lee 'adjust to the object'. What drives IT? Delivery, efficiency, user needs etc."
Advance digital preservation by pooling our experience, sharing our stories and finding the answers to the big questions.
- Q&A:
- Ask and answer digital preservation questions
- We tried to run a Digital Preservation Stack Exchange, but it didn't work out. The content is available here
- Forums
- Discussion forums and active blogs provide the opportunity to share informal advice and war stories, get recommendations and discuss the finer points of digital preservation. By sharing both your intentions for digital preservation work and your results, you can ensure your work benefits from a wealth of community experience.
- Discuss preservation issues on the Digital Curation forum
- Share war stories on OPF blogs
- Mastodon - Join these federations with a digital preservation or general GLAM focus:
- Twitter - Use these lists to find people to follow:
- r/DataHoarder - "We are digital librarians."
- r/Archiveteam - "Archive Team is a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage."
- Join the Digital POWRR Slack
- Face-to-Face communities/support groups:
- Collaborations (inc. groups that build things together):
- Association of Moving Image Archivists (AMIA) Open Source Committee on GitHub
- Zenodo Digital Preservation Community - Building a comprehensive bibliography of publications, presentations, instructions and data sets related to digital preservation.
- Models, Standards & Certification:
- The Reference Model for an Open Archival Information System (OAIS)
- CoreTrustSeal - CoreTrustSeal offers to any interested data repository a core level certification based on the Core Trustworthy Data Repositories Requirements.
- Conferences:
- Membership organizations:
- A history of storage media
- File system conventions:
- The PREMIS Data Dictionary for Preservation Metadata
- Metadata Encoding & Transmission Standard (METS)
- Portland Common Data Model (PCDM)
To improve our digital preservation tools, we need to be able to test them and evaluate of their performance. Publicly available sample files make this much easier. Tool developers can use them to test their work, discover bugs, and hone their tools ready for others to use. A test corpus can contain real digital objects from a collection, or be created specifically for exhibiting certain characteristics for testing purposes. Real data, particularly with examples of broken, badly formed or corrupted files can be particularly useful.
- The OPF Format Corpus
- The iPres System Showcase Test Suite - Hosted by the UK Web Archive. Note that UKWA is offline at present.
- The Encyclopedia of Graphics File Formats Companion CD-ROM contains lots of test files for image formats:
- EDRM Data Set Files (archived version)
- digitalcorpora.org's corpora - including govdocs1.
- Open Preservation Foundation had a corpora page (archived version).
- OPF govdocs here
- OPF also created a by-format subset of govdocs1.
- digicam corpus - Contains a corpus of Digital Camera files collected by Tyler Thorsted.
- The Skeleton Test Suite - Builds test files from PRONOM binary and container signatures. These can be used to test DROID and other (compatible) identification tools.
- Fine Free File Test Suite - Set up for Fedora testing.
- JHOVE's test files
- JHOVE2's test files
- The disktype test files
- The Metadata Working Group specifications (archived version) and embedded image metadata test corpus (archived version)
- Apache Tika issue about setting up a nightly test corpus - See also tika-parsers/src/test/resources/test-documents
- The Chemical MIME Home Page
- Online-convert.com example files (use this link to browse the folder structure)
- RDSS Archivematica Test Data Corpus - A collection of research dataset files used for testing Archivematica integration and functionality in the JISC Research Data Shared Service (RDSS).
- Archivematica Sample Data - Includes OPF format corpus, as well as other test material.
- ExifTool test files
- PREFORMA Ground Truth Classes - Instructions how to reproduce validation-failing files for Matroska, FFV1, LPCM, TIFF, and PDF formats.
- "Small" - Collection of "the smallest possible syntactically valid files in different programming/scripting/markup languages."
- MediaArea-RegressionTestingFiles - Public regression testing files for MediaArea. Contains AVI, FLV, MPEG Audio, MOV, MPEG-4, MPEG-PS, and Matroska files.
- TechSlides sample files for web development (archived version) - Sample files for various image formats, video files, data structures, fonts, and web development files.
- Internet File Formats - Companion CD-ROM to Internet File Formats, contains Sample Files and some File Format Specifications for a variety of common file formats circa 1995.
- Apache Tika's regression corpus - Millions of files collected largely from govdocs1 and Common Crawl with oversampling on binary formats.
- Apache Tika's Bugtracker corpora - Dense set of problematic files -- attachments from bug trackers for open source parsers.
- Adobe Acrobat Engineering (archived version) - Site has lots of useful test documents (archived version).
- Isartor PDF/A Test Suite
- veraPDF Corpus - For PDF/A.
- Synthetic PDF Testset for File Format Validation - Test set for well formedness validation in JHOVE - see associated paper.
- PDF Differences - Targeted test files that highlight specific portability and interoperability issues by the PDF Association.
- PDF Cabinet of Horrors
- DARPA SafeDocs - 8 million non-truncated PDFs from a month of Common Crawl
- See also: The PDF Association's list of PDF-focused corpora
- The IDPF ePub test suite
- KBNLresearch/epubPolicyTests - Some #epub samples with encryption, DTBook content and foreign resources, with corresponding #epubcheck output.
- The libtiff TIFF Test Images
- OPF JP2k test corpus
- NITF version 2.1 JPEG 2000 Sample Imagery (archived version)
- JPEG 2000 Part 4 Conformance Test Files (archived version) (v.1.5 with earlier versions also available in the archive history)
- openjpeg-data - Test files for OpenJPEG.
- jpylyzer-test-files - Test files for Jpylyzer.
If the existing corpora aren't cutting it, perhaps you can contribute to the OPF Format Corpus hosted on GitHub. There's a guide here on how to contribute (archived version) or you can contact OPF for help on how to get involved.
Web archives can provide a useful source of files of particular formats. For example, search via the UKWA interface. Note that UKWA is offline at present.
Software tools give us the means the interrogate, manipulate, understand and ultimately preserve our digital data. The Community Owned digital Preservation Tool Registry, COPTR has unified five isolated tool registries. It provides an easy-to-edit wiki interface where we can share our knowledge about, and experiences with, tools used for digital preservation purposes.
- Find tools to solve your challenges with the POWRR Tools Grid, generated from the COPTR wiki.
- Find tools by function.
- Contribute your experiences of using tools to the COPTR wiki.
- If you find or create new tools, please add them to COPTR.
Resources to help build up preservation workflows, e.g. templates for how to use command-line tools, and how to chain things together.
- ffmprovisr 'Making FFmpeg Easier' (example of how to use
ffmpeg
to perform specific tasks) - AMIA Open Source: List of open workflows for A/V resources
Contributing to the development and improvement of tools is easy, even if you're not technical. Check out this guide to making small documentation edits, or raising issues on GitHub
Identifying file formats is the bread and butter of digital preservation characterisation and assessment. Identification tool coverage and accuracy could be much better, and this primarily comes down to the signatures, or file format "magic", used to identify each format. You can help contribute and make our identification tools more effective here:
- A basic guide for writing format signatures - Covers Apache Tika and DROID.
- DROID/PRONOM also has this official guide
- Contribute a file format signature to FILE - See this guide.
Deep file characterisation enables validation, identification of preservation risks and extraction of metadata. In developing a new characterisation capability, begin with thorough research to identify existing code to re-use or build on, develop a focused command line tool, then consider turning it into a JHOVE module.
- Develop a new file characterisation capability and turn it into a JHOVE module, or an Apache Tika module.