Skip to content

Commit

Permalink
Merge branch 'master' into issue-843
Browse files Browse the repository at this point in the history
  • Loading branch information
machawk1 committed Oct 16, 2024
2 parents 4e1a44f + 79359a5 commit 54a8d4e
Show file tree
Hide file tree
Showing 8 changed files with 16 additions and 14 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@ jobs:
- macos-latest
# - windows-latest
python:
- "3.8"
- "3.9"
- "3.10"
- "3.11"
- "3.12"
ipfs:
- "0.28"
- "0.29"
Expand Down
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

[![pypi](https://img.shields.io/pypi/v/ipwb.svg)](https://pypi.org/project/ipwb) [![codecov](https://codecov.io/gh/oduwsdl/ipwb/branch/master/graph/badge.svg)](https://codecov.io/gh/oduwsdl/ipwb)

InterPlanetary Wayback (ipwb) facilitates permanence and collaboration in web archives by disseminating the contents of [WARC](http://www.iso.org/iso/catalogue_detail.htm?csnumber=44717) files into the IPFS network. [IPFS](https://ipfs.io/) is a peer-to-peer content-addressable file system that inherently allows deduplication and facilitates opt-in replication. ipwb splits the header and payload of WARC response records before disseminating into IPFS to leverage the deduplication, builds a [CDXJ index](https://github.com/oduwsdl/ORS/wiki/CDXJ) with references to the IPFS hashes returned, and combines the header and payload from IPFS at the time of replay.
InterPlanetary Wayback (ipwb) facilitates permanence and collaboration in web archives by disseminating the contents of [WARC](http://www.iso.org/iso/catalogue_detail.htm?csnumber=44717) files into the IPFS network. [IPFS](https://ipfs.io/) is a peer-to-peer content-addressable file system that inherently allows deduplication and facilitates opt-in replication. ipwb splits the header and payload of WARC response records before disseminating into IPFS to leverage the deduplication, builds a [CDXJ index](https://github.com/oduwsdl/ORS/wiki/CDXJ) with references to the IPFS hashes that are returned, and combines the header and payload from IPFS at the time of replay.

InterPlanetary Wayback primarily consists of two scripts:

Expand Down Expand Up @@ -90,7 +90,7 @@ $ ipwb replay QmYwAPJzv5CZsnANOTaREALhashYgPpHdWEz79ojWnPbdG

Once started, the replay system's web interface can be accessed through a web browser, e.g., <http://localhost:2016/> by default.

To run it under a domain name other than `localhost`, the easiest approach is to use a reverse proxy that supports HTTPS. The replay system utilizes [Service Worker](https://developer.mozilla.org/en-US/docs/Web/API/Service_Worker_API) for URL rerouting/rewriting to prevent [live leakage (zombies)](http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html). However, for security reason many web browsers have mandated HTTPS for the Service Worker API with only exception if the domain is `localhost`. [Caddy Server](https://caddyserver.com/) and [Traefik](https://traefik.io/) can be used as a reverse-proxy server and are very easy to setup. They come with built-in HTTPS support and manage (install and update) TLS certificates transparently and automatically from [Let's Encrypt](https://letsencrypt.org/). However, any web server proxy that has HTTPS support on the front-end will work. To make ipwb replay aware of the proxy, use `--proxy` or `-P` flag to supply the proxy URL. This way the replay will yield the supplied proxy URL as a prefix when generating various fully qualified domain name (FQDN) URIs or absolute URIs (for example, those in the TimeMap or Link header) instead of the default `http://localhost:2016`. This can be necessary when the service is running in a private network or a container and only exposed via a reverse-proxy. Suppose a reverse-proxy server is running and ready to forward all traffic on the `https://ipwb.example.com` to the ipwb replay server then the replay can be started as following:
To run it under a domain name other than `localhost`, the easiest approach is to use a reverse proxy that supports HTTPS. The replay system utilizes [Service Worker](https://developer.mozilla.org/en-US/docs/Web/API/Service_Worker_API) for URL rerouting/rewriting to prevent [live leakage (zombies)](http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html). However, for security reason many web browsers have mandated HTTPS for the Service Worker API with only exception if the domain is `localhost`. [Caddy Server](https://caddyserver.com/) and [Traefik](https://traefik.io/) can be used as a reverse-proxy server and are very easy to setup. They come with built-in HTTPS support and manage (install and update) TLS certificates transparently and automatically from [Let's Encrypt](https://letsencrypt.org/). However, any web server proxy that has HTTPS support on the front-end will work. To make ipwb replay aware of the proxy, use `--proxy` or `-P` flag to supply the proxy URL. This way the replay will yield the supplied proxy URL as a prefix when generating various fully qualified domain name (FQDN) URIs or absolute URIs (for example, those in the TimeMap or Link header) instead of the default `http://localhost:2016`. This can be necessary when the service is running in a private network or a container, and only exposed via a reverse-proxy. Suppose a reverse-proxy server is running and ready to forward all traffic on the `https://ipwb.example.com` to the ipwb replay server then the replay can be started as following:

```
$ ipwb replay --proxy=https://ipwb.example.com <path/to/cdxj>
Expand Down Expand Up @@ -121,7 +121,7 @@ To build an image from the source, run the following command from the directory
$ docker image build -t oduwsdl/ipwb .
```

By default, the image building process also performs tests, so it might take a while to build the image. It ensures that an image will not be created with failing tests. However, it is possible to skip tests by supplying a build-arg `--build-arg SKIPTEST=true` as illustrated below:
By default, the image building process also performs tests, so it might take a while to build the image. It ensures that an image will not be created with failing tests. However, it is possible to skip tests by supplying a build-arg `--build-arg SKIPTEST=true` as shown below:

```
$ docker image build --build-arg SKIPTEST=true -t oduwsdl/ipwb .
Expand Down Expand Up @@ -201,7 +201,7 @@ This repo contains the code for integrating [WARC](http://www.iso.org/iso/catalo

### Citing Project

We have numerous publications related to this project, but the most significant and primary one was published in TPDL 2016. ([Read the PDF](https://matkelly.com/papers/2016_tpdl_ipwb.pdf))
There are numerous publications related to this project, but the most significant and primary one was published in TPDL 2016. ([Read the PDF](https://matkelly.com/papers/2016_tpdl_ipwb.pdf))

> Mat Kelly, Sawood Alam, Michael L. Nelson, and Michele C. Weigle. __InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives__. In _Proceedings of the 20th International Conference on Theory and Practice of Digital Libraries_, pages 411–416, Hamburg, Germany, June 2016.
Expand Down
6 changes: 3 additions & 3 deletions ipwb/error_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

def exception_logger(catch=True, exception_class=Exception):
"""
Decorator which catches exceptions in the function and logs them.
Decorator that catches exceptions in the function and logs them.
Usage:
Expand All @@ -17,11 +17,11 @@ def decorated_function(foo, bar):
do_something
```
`exception_logger()` will catch any exception which happens in
`exception_logger()` will catch any exception that happens in
`decorated_function()` while it is being executed, and log an error using
Python built in `logging` library.
Unless `catch` argument is `False` - in which case the exception will be
Unless `catch` argument is `False` - in which case, the exception will be
reraised.
"""
def decorator(f: Callable):
Expand Down
6 changes: 3 additions & 3 deletions ipwb/replay.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@

import sys
import os
import importlib.resources
import ipfshttpclient as ipfsapi
import json
import subprocess
import pkg_resources
import surt
import re
import traceback
Expand Down Expand Up @@ -1018,8 +1018,8 @@ def get_index_file_full_path(cdxj_file_path=INDEX_FILE):
if os.path.isfile(cdxj_file_path):
return cdxj_file_path

index_file_name = pkg_resources.resource_filename(
__name__, index_file_path)
index_file_name = importlib.resources.files(
__name__).joinpath(index_file_path)
return index_file_name


Expand Down
2 changes: 1 addition & 1 deletion ipwb/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@

from ipfshttpclient.exceptions import ConnectionError, AddressError
from multiaddr.exceptions import StringParseError
from pkg_resources import parse_version
from packaging.version import parse as parse_version

from .exceptions import IPFSDaemonNotAvailable

Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ requests>=2.19.1
beautifulsoup4>=4.6.3
surt>=0.3.0
multiaddr >= 0.0.9
packaging==23.0
1 change: 1 addition & 0 deletions test-requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@ flake8>=3.7.9
pytest>=5.3.5
pytest-cov
pytest-flake8
setuptools
4 changes: 2 additions & 2 deletions tests/test_error_handler.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import pytest
from unittest.mock import MagicMock, patch
from unittest.mock import MagicMock, patch, ANY

from ipwb.error_handler import exception_logger

Expand All @@ -24,4 +24,4 @@ def test_catch():
with patch('ipwb.error_handler.logger.critical', mock_logger):
caught_error('boo')

assert mock_logger.called_once_with(('boo', ))
mock_logger.assert_called_once_with('boo')

0 comments on commit 54a8d4e

Please sign in to comment.