Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

RFC Intermediate Search Product #37

Open
berkes opened this issue Mar 3, 2021 · 0 comments
Open

RFC Intermediate Search Product #37

berkes opened this issue Mar 3, 2021 · 0 comments
Labels
rfc Feature Requests, Proposals, ideas and concepts

Comments

@berkes
Copy link
Contributor

berkes commented Mar 3, 2021

Summary

A search engine that helps fedizens (fediverse users) to search for:

  • Job postings shared on the fediverse.
  • Candiates that have marked themselves as being for hire on their fediverse profile(s).

Basic example

As a user who is looking for work
When I go to a Flockingbird Work website
And I go to "job openings and announcements"
Then I get a list of all recent toots/updates that mention job openings/vacancies on the known fediverse.
And when I use a keyword in the search
Then I the list those toots/updates is limited to those who mention this keyword and ranked by relevance to that word.
So that I can search for job openings advertised/shared on the fediverse.

As a user who is looking for work
When I add the tag #forhire (and possibly some often used synonyms) to my profile description (bio)
And my profile is public and marked indexable
Then my profile will show up to people searching for candidates on all Flockingbird Work websites
So that I can advertise the fact that I'm looking for work, on the fediverse.

As an actor looking for candidates
When I go to a Flockingbird Work website
And I go to "candidates on the fediverse"
Then I get a list of all candidates who marked their profile as "for hire".
And when I use a keyword in the search
Then I the list those profiles is limited to those who mention this keyword and ranked by relevance to that word.
So that I can look for candidates on the fediverse.

As an actor looking for candidates
When I share my job posting with the relevant tags to the fediverse
Then It gets indexed on al Flockingbird Work websites
So that my job posting Is searchable for interested candidates

Motivation

This would be an intermediate product. It's goal is threefold:

  1. Offer a valuable service to job-seekers and candidate-seekers on the existing fediverse.
  2. Prove there is a demand for a place to advertise yourselves as job-seeker and possible candidate on the fediverse.
  3. Prove there is a demand for a place to advertise vacancies(job-postings) on the fediverse.

The latter two give insight in how urgent these goals are, in order to be introduced in Flockingbird. The former is to offer a valid, free (freedom) space to advertise jobs and candidacy without building the entire software of flockingbird first.

Detailed design

Using Scrapy, we build a simple and naive spider that can crawl the fediverse public profiles. It extract data from the HTML in the profiles and serializes that in one or more data-files (JSON).

This spider can be naïve and incrementally made smarter and more resilient. Initial iterations can limit themselves to the (latest stable) mastodon HTML/structure. Further refinement can add mastodon with different frontends. And even further implementation can include other mastodon-alike software such as friendica, misskey, pleroma et al. And then it can expand to crawl the longer tail of software such as pixelfed, peertube, diaspura, hubzilla etc.

This spider does not need incredible performance. It can index on a schedule; e.g. once a week and therefore may take that entire window to crawl the space.

This spider will be highly focused on this one task. If adversaries grab the open source spiders in order to index profiles in a less friendly way, they should not be helped in this by a generic and easy to extend spider setup. This also allows us to focus on a simple delivery.

The spider must announce itself so instance admins can investigate and get in contact. Or block it entirely. We must adhere to no-index and robots.txt to allow instance owners an easy way to block us.
A list of domains on which the crawler is not allowed, should be published, to offer transparancy to users who'se admins block us, without informing their users.

Another spider needs to index toots/updates. It should do this by crawling and following all updates with a certain tag (or one of certain tags). It extracts data from the toot/update HTML and serializes that in one or more data-files (JSON).

Another process picks those data files up, and pushes them into a Meilisearch instance. Possibly into two separate indexes: candidates(or profiles) and job-postings.
This requires that the fields in the data are simple, flat (not deeply nested), contain text verbatim but with HTML removed, and generate a unique ID. E.g. by hashing the [email protected] handle or by re-using the toot./update id.

A frontend, in HTML, CSS and some javascript, requests search queries with the Meilisearch instance. The results are then presented in clean and simple HTML using JavaScript.

Drawbacks

Why should we not do this? Please consider:

  • Fediverse users or instance admins may be averse to indexing and crawlers.
  • Having a public, Open Source crawler that can be used to mine data by rogue players, may cause us to do the community a disservice. Writing such crawlers, however, is not that hard, so it (not) existing should not withhold others to write their own anyway.
  • It should (initially) be separate software from flockingbird itself. Anyone who wants to host such a search instance should be familiar with crawlers, and hosting all this.
  • Using fedisearch instead of writing our own, would save time and research. Fedisearch is not open source, not self-hostable, and not hackable to search only "forhire" profiles and job postings. It also doesn't work, gives a 500, on searches.

Alternatives

  • Meilisearch is chosen over Elasticsearch for three reasons: Better Open Source licencing, far simpler model to host and deploy, lightweight and simple.
  • Using the API to fetch structured profiles and toots. Benefit is that this is already structured data. Downside is that it is easy to include data that a user did not expect to be indexable: if an API offers details that the public HTML does not show, a crawler cannot accidentally include it: it will only ever see what is publicly visible. Another upside is that this removes the need to process javascript in the crawler for instances such as pleroma.

Adoption strategy

We offer a hosted version on work.flockingbird.social. This would be advertised on the fediverse in toots and blogposts.

By mentioning a certain user (e.g. flockingbirdbot or so), toots could be indexed faster. additional work on such a bot would be needed, left out of this RFC. This would make the account and service more visible by being mentioned often in peoples timelines.

Something similar can be achieved with proper tags. E.g. not just "#for-hire" but also "#flockingbird" as requirement to be indexed. This should be implemented later, if at all, to allow more indexable content now: people already use '#for-hire' or '#vacancy' but not '#flockingbird'.

How we teach this

A simple static /about page on the service explaining how to:

  • Get your profile included.
  • Get a job-posting listed.
  • Have something removed.

Unresolved questions

  • Removal policy. We only have upserts and no 'deletes' yet. Manually deleting should be possible, but then the account should not be re-added on a new crawl-run. A blacklist can solve this?
  • Maybe a garbage collector pruning old accounts. Or a checker running daily/weekly checks on all indexed toots and accounts if they still are online and carry the required tags.
  • Automating the removal through a form on the service: how to deal with proof of identity: e.g. avoid people requesting removal of other users.
  • GDPR requests: we only have public data, but still have to provide data when a user requests this. Can be done manually for now, but should probably be scripted or automated if done more often.
  • How to deal with javascript-heavy instances such as pleroma: the crawler would need to run a (headless) browser to index those, probably.
  • How to deal with changing HTML: Scrapy crawlers rely on the DOM and if that changes (when it changes), the crawler will break.

Footnotes and references


This RFC template is modified from the React RFC
template

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
rfc Feature Requests, Proposals, ideas and concepts
Projects
None yet
Development

No branches or pull requests

1 participant