Rethinking canonify #278

sopoforic · 2022-03-06T22:12:27Z

Right now, the URL normalizer in cannon.py is a kind of monolithic, complicated thing with special cases for certain kinds of URLs. It would make more sense to me to encapsulate all the per-site behavior cleanly, so I took a stab at it:

https://github.com/sopoforic/promnesia/tree/better_specs

This branch is a first attempt at cleaning up cannon.py. It does two things, as a demonstration:

It handles wayback URLs by extracting the original site URL and recursing to canonify--which is essentially what canonify already does, but this way that behavior is extracted out of the main function.
It handles steam store pages, as a demonstration of how to write a 'standard' normalizer.

It also has a couple of nice features:

First, the normalizers are in a namespace package, and are loaded dynamically, so as to support users making their own private additions. It should also be easier to take PRs for better site support with this kind of structure.

Second, because it uses __subclasses__ to accomplish this, it means users can just add something like

from promnesia.normalizers.Site import Site

class Foo(Site):
    @staticmethod
    def supports(url):
        return False

to their my.config and it'll get picked up automatically.

This is just an experiment, but it's working fine for me, so I am in fact using it already. If this direction seems fruitful to you, I'll put a little more effort into this and make a PR.

The text was updated successfully, but these errors were encountered:

karlicoss · 2022-03-17T22:23:32Z

Hi @sopoforic sorry for late reply, just got to process my email! Thanks so much, that looks interesting.

First, the normalizers are in a namespace package, and are loaded dynamically, so as to support users making their own private additions. It should also be easier to take PRs for better site support with this kind of structure.

Right, that makes sense. On the other hand, keeping each site in a separate file makes it much more annoying for refactoring, maintaining and seeing common patterns. I'd say it makes sense to keep most normalisers in the same file, but also have a mechanism so users could load their own files with normalisers as well.

This is just an experiment, but it's working fine for me, so I am in fact using it already. If this direction seems fruitful to you, I'll put a little more effort into this and make a PR.

Yep, very appreciated! It's hard to tell if it would be a suitable approach so early on (with just a couple urls). I'd say if it's okay, keep it in your branch for now -- and as you add more stuff would be clearer if it's a good approach.

Skimmed the diff -- maybe one comment is that imo it's unavoidable that URL first will need to be split into protocol/domain/path/query parts, we can only get so far with regex approach. But again would probably be more evident as you add more site specific normalisers.

By the way you'll be very welcome in https://memex.zulipchat.com -- there are spaces there to discuss Promnesia in particular and you might get some input from other people as well (you can login with github -- so won't need to create a new account!)

sopoforic · 2022-03-17T23:29:40Z

On the other hand, keeping each site in a separate file makes it much more annoying for refactoring, maintaining and seeing common patterns. I'd say it makes sense to keep most normalisers in the same file, but also have a mechanism so users could load their own files with normalisers as well.

Sure, I think it makes sense to have anything related in a single file--one file per site wasn't exactly a design requirement, just how it worked out in this little sample. Though I do think that having one giant file with a bunch of unrelated things is a recipe for trouble, so there should be a balance there.

Skimmed the diff -- maybe one comment is that imo it's unavoidable that URL first will need to be split into protocol/domain/path/query parts, we can only get so far with regex approach.

I agree. I use urlparse for similar stuff typically, but for the steam/wayback examples I was going to need regex either way, so I just went with it on the whole. The nice thing about encapsulating per-site behavior like this is that you're free to use whichever tools make the most sense.

I'd say if it's okay, keep it in your branch for now -- and as you add more stuff would be clearer if it's a good approach.

Okay, will do. I don't often encounter sites that both have problems and exist in my notes, but it won't be hard to maintain the branch as things accumulate.

karlicoss added the cannon Related to URL normalisation label Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethinking canonify #278

Rethinking canonify #278

sopoforic commented Mar 6, 2022 •

edited

Loading

karlicoss commented Mar 17, 2022

sopoforic commented Mar 17, 2022

Rethinking canonify #278

Rethinking canonify #278

Comments

sopoforic commented Mar 6, 2022 • edited Loading

karlicoss commented Mar 17, 2022

sopoforic commented Mar 17, 2022

sopoforic commented Mar 6, 2022 •

edited

Loading