Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate codepoints on read and write #268

Closed
wants to merge 4 commits into from
Closed

Conversation

madig
Copy link
Collaborator

@madig madig commented Jul 18, 2022

Closes #263

Copy link
Member

@cmyr cmyr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry about this, slipped by me somehow. Might want to rebase on #274 when that gets in, but overall this makes sense. I have some notes on the implementation inline. :)

src/glyph/parse.rs Outdated Show resolved Hide resolved
@@ -57,8 +60,11 @@ impl Glyph {
start.push_attribute(("format", "2"));
writer.write_event(Event::Start(start)).map_err(GlifWriteError::Xml)?;

let mut seen_codepoints = HashSet::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we just enforce deduplication whenever the user modifies the codepoints? As in we always keep them sorted, and we don't insert dupes (we can use slice::binary_search to find the insertion point, or whether the item already exists)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... see my note above, not sure if it's just easier to do it at read/write time?

@madig madig changed the base branch from master to maintenance November 10, 2022 22:27
Base automatically changed from maintenance to master November 10, 2022 23:20
@madig
Copy link
Collaborator Author

madig commented Nov 11, 2022

Oops, need to rebase.

Copy link
Member

@cmyr cmyr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 🚀

@madig
Copy link
Collaborator Author

madig commented Nov 14, 2022

The serialization code path is ugly though. Know of something better?

@RickyDaMa
Copy link
Collaborator

Question: out of interest, why do we need to preserve order in the codepoints?

Also, it should be trivial to add a borrowing iterator for Codepoints:

fn iter(&self) -> impl Iterator<Item = char> {
    self.codepoints.iter().copied()
}

Also, I don't know if itertools is already pulled in as a transitive dependency, but if so you could use that for it's .unique() filter (link)

@madig
Copy link
Collaborator Author

madig commented Nov 28, 2022

Codepoints are ordered lists in the UFO spec. In defcon and ufoLib2, a glyph has the unicode and unicodes properties, where the former returns the first item of the latter.

I think the point of a custom data struct here is to avoid/reduce memory allocations on reading and writing, and itertools uses a HashSet :)

@RickyDaMa
Copy link
Collaborator

itertools uses a HashSet

I know, and that's no different to what you've done here, just would be cleaner / more readable

the point of a custom data struct here is to avoid/reduce memory allocations on reading and writing

Yeah so you don't deserialize into Codepoints directly, but can transform a list if you need to? Why is that preferable - I guess in cases where the user doesn't do anything involving the unicode(s)?

@madig
Copy link
Collaborator Author

madig commented Nov 28, 2022

I did a list transform before, but @cmyr said he'd prefer a type 😬 I mean, I haven't settled on anything and would gladly take the simplest solution (custom containers are a pain to bring up to the API they're wrapping).

@cmyr
Copy link
Member

cmyr commented Nov 29, 2022

I understand that the order of the codepoints is important insofar as the first codepoint is special, but does the order of subsequent codepoints also matter?

@madig
Copy link
Collaborator Author

madig commented Nov 29, 2022

I don't know.

@anthrotype
Copy link
Collaborator

does the order of subsequent codepoints also matter?

it doesn't. I probably only does if this order is carried over when roundtripping from xml to xml, e.g. if one wishes to reduce diff noise

by the way, as you all well know, the whole notion of "primary unicode codepoint" is flawed because it doesn't exist in opentype cmap tables, which is a map from codepoints to glyphs and not the other way around, as font editors make it seem.

@madig
Copy link
Collaborator Author

madig commented Nov 29, 2022

We could go full-on set then, and sort on serialization, if we didn't care about git noise with other libraries.

@RickyDaMa
Copy link
Collaborator

RickyDaMa commented Nov 29, 2022

Alternatively to reduce noise, just retain a sorted, deduplicated list. Deterministic order, without duplicates. With this approach you could also preserve the first glyph if you wanted to

@cmyr
Copy link
Member

cmyr commented Nov 29, 2022

so my original thinking with the Codepoints struct is that we would separate out the 'primary' codepoint, like:

Codepoints {
    primary: char,
   others: Vec<char>,
}

but it might be nice if internally we do use a single Vec, like in this patch, because then we can have AsRef<[char]> and get all the iterators etc for free.

In this case, I do think that we should preserve the position of the first item in the array, but otherwise ensure that it is always deduplicated. To do this at construction time, we could have something like:

use std::cmp::Ordering;

struct Codepoints(Vec<char>);

impl Codepoints {
    fn new(mut raw: Vec<char>) -> Self {
        if raw.len() <= 1 {
            return Self(raw);
        }
        let first = *raw.first().unwrap();
        
        // custom sorting that always ranks first item lowest
        raw.sort_unstable_by(|a, b| match (a, b) {
            (a, b) if *a == first && *b == first => Ordering::Equal,
            (a, _) if *a == first => Ordering::Less,
            (_, b) if *b == first => Ordering::Greater,
            (a, b) => a.cmp(b),
        });
        raw.dedup();
        Self(raw)
    }

    fn insert(&mut self, codepoint: char) {
        if !self.0.contains(&codepoint) {
            self.0.push(codepoint);
            // don't include first codepoint in sort
            self.0[1..].sort_unstable();
        }
    }
}

And then we would want AsRef and Deref (but not AsMut or DerefMut) impls, and maybe a few more methods? for instance do we want a method for changing the primary codepoint? what other tasks are common?

@anthrotype
Copy link
Collaborator

do we want a method for changing the primary codepoint?

No! We should instead deprecate that in existing APIs like defcon or ufoLib2 and nudge users to ever only use the unicodes list, in the plural. There's no use case for a "primary unicode".

@RickyDaMa
Copy link
Collaborator

Just a note for implementation details: we will still need to use a newtype over the Vec<char> otherwise a user would be able to modify it without upholding its invariants (being ordered except the first element and having no duplicates)

@madig
Copy link
Collaborator Author

madig commented Nov 29, 2022

I can get behind using a set (and sorting on serialization) and declaring that people should use unicodes in Py libs.

@anthrotype
Copy link
Collaborator

anthrotype commented Nov 29, 2022

that's what it really is, an unsorted set of unique codepoints that map to a given glyph, with no intrinsic priority over one another except for the (semantically meaningless) order in which the elements appear in the xml document. E.g. in a unicase font, if both 0x0041 and 0x0061 characters map to the same glyph (named "A" or "a" or "Colin", doesn't really matter), that doesn't make either 0x0041 or 0x0061 more "primary".

@cmyr
Copy link
Member

cmyr commented Nov 29, 2022

Okay, given all this input I vote for just using indexset. This will ensure that we maintain the original order, but also ensure that we do not contain duplicates.

@cmyr cmyr deleted the dedupl-codepoints branch July 25, 2023 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GLIF load/save: Filter duplicate codepoints
4 participants