Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Write a ScraPy spider that fetches a tag (#vacancy, others?) and extracts toots/updates found for that tag. #41

Open
berkes opened this issue Mar 3, 2021 · 1 comment
Labels
fedifind Issues related to the intermediate "Fedi Find" project. scrapy task
Milestone

Comments

@berkes
Copy link
Contributor

berkes commented Mar 3, 2021

Write a ScraPy spider that fetches a tag (#vacancy, others?) and extracts toots/updates found for that tag.

Details

This spider should get a list of instances where it starts (seeds) and follow across instances to fetch toots/updates for a certain hashtag (e.g. #vacancy, #job etc.).

Deliverable

  • It should try to denormalize toots. When instance "example.com" has a toot by '@[email protected]" and "example.org" has this toot too, it should appear only once in the datafile.
  • If an update is manually re-tooted (i.e. text copied as a new update) it may appear multiple times. Denormalizing based on content of an update is not important.
  • Boosts and or replies should be ignored (for now).
  • If tooling is required to setup the environment (pipenv etc) a command should be presented how to get this running for devs and CI.
  • It should be one command, so that integration is easy. Preferably a command that runs and then stops over a deamon.
  • ScraPy is preferered as other parts of this project use that already.
@berkes berkes added task fedifind Issues related to the intermediate "Fedi Find" project. labels Mar 3, 2021
@berkes berkes added this to the Search milestone Mar 3, 2021
@berkes berkes added the scrapy label Mar 3, 2021
@berkes
Copy link
Contributor Author

berkes commented Mar 17, 2021

I've experimented with the mastodon API through elefren.

The preliminary result is a project called hunter2.

Usage: target/debug/hunter2 [options]

Options:
    -h, --help          print this help menu
    -r, --register      register hunter2 with your instance.
    -f, --follow        follow live updates.
    -p, --past          fetch past updates.

Using this, I've filled an initial MeiliSearch index. It now runs on 178.62.220.231 (This will change, will go down, and will be replaced with a proper, https backed, domain-named, instance).

afbeelding

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
fedifind Issues related to the intermediate "Fedi Find" project. scrapy task
Projects
None yet
Development

No branches or pull requests

1 participant