Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: initial tg25/tgno crawling config #2

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion browsertrix-crawler/configs/tg24.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
# TODO: Adjust for new TG24 (and beyond) site url structure
seeds:
# Crawl content available via navigation and frontpage
- url: https://www.gathering.org
Expand Down
34 changes: 34 additions & 0 deletions browsertrix-crawler/configs/tgno.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Config intended to be used on new tg.no once launched. This page differs from
# previous iterations (in practice, even if not in theory) by being a single
# site gradually updated with new content and styling, rather than a new site
# each year.
seeds:
# Crawl content available via navigation and frontpage
- url: https://www.gathering.org
include:
# Basic pages
- www.gathering.org

# Block calls to our tracking service
blockRules:
- url: matomo.gathering.org

collection: tgno

behaviors: autoscroll,autoplay,autofetch,siteSpecific
waitUntil: load,networkidle0
generateCDX: true
combineWARCs: true
saveState: always
workers: 4
# TODO: Remove it not needed, hopefully we won't need consent flow on new site
# Minimal profile that includes consent answers
# profile: /crawls/profiles/tg24.tar.gz

# Make "live" crawling view available at 9037
newContext: window
screencastPort: 9037

warcinfo:
operator: The Gathering
hostname: tg.no
2 changes: 2 additions & 0 deletions wayback/startup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ git clone https://github.com/gathering/go-archive-tg21 || (cd go-archive-tg21 ;
git clone https://github.com/gathering/go-archive-tg22 || (cd go-archive-tg22 ; git pull ; git lfs pull ; cd ..)
git clone https://github.com/gathering/go-archive-tg23 || (cd go-archive-tg23 ; git pull ; git lfs pull ; cd ..)
git clone https://github.com/gathering/go-archive-tg24 || (cd go-archive-tg24 ; git pull ; git lfs pull ; cd ..)
git clone https://github.com/gathering/go-archive-tg25 || (cd go-archive-tg25 ; git pull ; git lfs pull ; cd ..)

cd "$WORKDIR"

Expand All @@ -25,5 +26,6 @@ cp -r "$SOURCES/go-archive-tg21/browsertrix-crawler/crawls/collections/tg21/" "$
cp -r "$SOURCES/go-archive-tg22/browsertrix-crawler/tg22/" "$COLLECTIONS/"
cp -r "$SOURCES/go-archive-tg23/tg23/" "$COLLECTIONS/"
cp -r "$SOURCES/go-archive-tg24/tg24/" "$COLLECTIONS/"
cp -r "$SOURCES/go-archive-tg25/tg25/" "$COLLECTIONS/"

exec /docker-entrypoint.sh $@