Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethinking canonify #278

Open
sopoforic opened this issue Mar 6, 2022 · 2 comments
Open

Rethinking canonify #278

sopoforic opened this issue Mar 6, 2022 · 2 comments
Labels
cannon Related to URL normalisation

Comments

@sopoforic
Copy link
Contributor

sopoforic commented Mar 6, 2022

Right now, the URL normalizer in cannon.py is a kind of monolithic, complicated thing with special cases for certain kinds of URLs. It would make more sense to me to encapsulate all the per-site behavior cleanly, so I took a stab at it:

https://github.com/sopoforic/promnesia/tree/better_specs

This branch is a first attempt at cleaning up cannon.py. It does two things, as a demonstration:

  1. It handles wayback URLs by extracting the original site URL and recursing to canonify--which is essentially what canonify already does, but this way that behavior is extracted out of the main function.
  2. It handles steam store pages, as a demonstration of how to write a 'standard' normalizer.

It also has a couple of nice features:

First, the normalizers are in a namespace package, and are loaded dynamically, so as to support users making their own private additions. It should also be easier to take PRs for better site support with this kind of structure.

Second, because it uses __subclasses__ to accomplish this, it means users can just add something like

from promnesia.normalizers.Site import Site

class Foo(Site):
    @staticmethod
    def supports(url):
        return False

to their my.config and it'll get picked up automatically.

This is just an experiment, but it's working fine for me, so I am in fact using it already. If this direction seems fruitful to you, I'll put a little more effort into this and make a PR.

@karlicoss
Copy link
Owner

Hi @sopoforic sorry for late reply, just got to process my email! Thanks so much, that looks interesting.

First, the normalizers are in a namespace package, and are loaded dynamically, so as to support users making their own private additions. It should also be easier to take PRs for better site support with this kind of structure.

Right, that makes sense. On the other hand, keeping each site in a separate file makes it much more annoying for refactoring, maintaining and seeing common patterns. I'd say it makes sense to keep most normalisers in the same file, but also have a mechanism so users could load their own files with normalisers as well.

This is just an experiment, but it's working fine for me, so I am in fact using it already. If this direction seems fruitful to you, I'll put a little more effort into this and make a PR.

Yep, very appreciated! It's hard to tell if it would be a suitable approach so early on (with just a couple urls). I'd say if it's okay, keep it in your branch for now -- and as you add more stuff would be clearer if it's a good approach.

Skimmed the diff -- maybe one comment is that imo it's unavoidable that URL first will need to be split into protocol/domain/path/query parts, we can only get so far with regex approach. But again would probably be more evident as you add more site specific normalisers.

By the way you'll be very welcome in https://memex.zulipchat.com -- there are spaces there to discuss Promnesia in particular and you might get some input from other people as well (you can login with github -- so won't need to create a new account!)

@karlicoss karlicoss added the cannon Related to URL normalisation label Mar 17, 2022
@sopoforic
Copy link
Contributor Author

On the other hand, keeping each site in a separate file makes it much more annoying for refactoring, maintaining and seeing common patterns. I'd say it makes sense to keep most normalisers in the same file, but also have a mechanism so users could load their own files with normalisers as well.

Sure, I think it makes sense to have anything related in a single file--one file per site wasn't exactly a design requirement, just how it worked out in this little sample. Though I do think that having one giant file with a bunch of unrelated things is a recipe for trouble, so there should be a balance there.

Skimmed the diff -- maybe one comment is that imo it's unavoidable that URL first will need to be split into protocol/domain/path/query parts, we can only get so far with regex approach.

I agree. I use urlparse for similar stuff typically, but for the steam/wayback examples I was going to need regex either way, so I just went with it on the whole. The nice thing about encapsulating per-site behavior like this is that you're free to use whichever tools make the most sense.

I'd say if it's okay, keep it in your branch for now -- and as you add more stuff would be clearer if it's a good approach.

Okay, will do. I don't often encounter sites that both have problems and exist in my notes, but it won't be hard to maintain the branch as things accumulate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cannon Related to URL normalisation
Projects
None yet
Development

No branches or pull requests

2 participants