Deduplicate codepoints on read and write #268

madig · 2022-07-18T17:36:10Z

Closes #263

cmyr

Oops sorry about this, slipped by me somehow. Might want to rebase on #274 when that gets in, but overall this makes sense. I have some notes on the implementation inline. :)

src/glyph/parse.rs

cmyr · 2022-11-10T16:35:53Z

src/glyph/serialize.rs

@@ -57,8 +60,11 @@ impl Glyph {
        start.push_attribute(("format", "2"));
        writer.write_event(Event::Start(start)).map_err(GlifWriteError::Xml)?;

+        let mut seen_codepoints = HashSet::new();


maybe we just enforce deduplication whenever the user modifies the codepoints? As in we always keep them sorted, and we don't insert dupes (we can use slice::binary_search to find the insertion point, or whether the item already exists)

Hm... see my note above, not sure if it's just easier to do it at read/write time?

madig · 2022-11-11T09:53:38Z

Oops, need to rebase.

Closes #263

cmyr

lgtm 🚀

madig · 2022-11-14T18:58:03Z

The serialization code path is ugly though. Know of something better?

RickyDaMa · 2022-11-28T10:37:05Z

Question: out of interest, why do we need to preserve order in the codepoints?

Also, it should be trivial to add a borrowing iterator for Codepoints:

fn iter(&self) -> impl Iterator<Item = char> {
    self.codepoints.iter().copied()
}

Also, I don't know if itertools is already pulled in as a transitive dependency, but if so you could use that for it's .unique() filter (link)

madig · 2022-11-28T10:54:14Z

Codepoints are ordered lists in the UFO spec. In defcon and ufoLib2, a glyph has the unicode and unicodes properties, where the former returns the first item of the latter.

I think the point of a custom data struct here is to avoid/reduce memory allocations on reading and writing, and itertools uses a HashSet :)

RickyDaMa · 2022-11-28T11:26:11Z

itertools uses a HashSet

I know, and that's no different to what you've done here, just would be cleaner / more readable

the point of a custom data struct here is to avoid/reduce memory allocations on reading and writing

Yeah so you don't deserialize into Codepoints directly, but can transform a list if you need to? Why is that preferable - I guess in cases where the user doesn't do anything involving the unicode(s)?

madig · 2022-11-28T11:56:47Z

I did a list transform before, but @cmyr said he'd prefer a type 😬 I mean, I haven't settled on anything and would gladly take the simplest solution (custom containers are a pain to bring up to the API they're wrapping).

cmyr · 2022-11-29T14:15:59Z

I understand that the order of the codepoints is important insofar as the first codepoint is special, but does the order of subsequent codepoints also matter?

madig · 2022-11-29T14:24:01Z

I don't know.

anthrotype · 2022-11-29T14:50:11Z

does the order of subsequent codepoints also matter?

it doesn't. I probably only does if this order is carried over when roundtripping from xml to xml, e.g. if one wishes to reduce diff noise

by the way, as you all well know, the whole notion of "primary unicode codepoint" is flawed because it doesn't exist in opentype cmap tables, which is a map from codepoints to glyphs and not the other way around, as font editors make it seem.

madig · 2022-11-29T15:19:55Z

We could go full-on set then, and sort on serialization, if we didn't care about git noise with other libraries.

RickyDaMa · 2022-11-29T15:44:17Z

Alternatively to reduce noise, just retain a sorted, deduplicated list. Deterministic order, without duplicates. With this approach you could also preserve the first glyph if you wanted to

cmyr · 2022-11-29T16:09:26Z

so my original thinking with the Codepoints struct is that we would separate out the 'primary' codepoint, like:

Codepoints {
    primary: char,
   others: Vec<char>,
}

but it might be nice if internally we do use a single Vec, like in this patch, because then we can have AsRef<[char]> and get all the iterators etc for free.

In this case, I do think that we should preserve the position of the first item in the array, but otherwise ensure that it is always deduplicated. To do this at construction time, we could have something like:

use std::cmp::Ordering;

struct Codepoints(Vec<char>);

impl Codepoints {
    fn new(mut raw: Vec<char>) -> Self {
        if raw.len() <= 1 {
            return Self(raw);
        }
        let first = *raw.first().unwrap();
        
        // custom sorting that always ranks first item lowest
        raw.sort_unstable_by(|a, b| match (a, b) {
            (a, b) if *a == first && *b == first => Ordering::Equal,
            (a, _) if *a == first => Ordering::Less,
            (_, b) if *b == first => Ordering::Greater,
            (a, b) => a.cmp(b),
        });
        raw.dedup();
        Self(raw)
    }

    fn insert(&mut self, codepoint: char) {
        if !self.0.contains(&codepoint) {
            self.0.push(codepoint);
            // don't include first codepoint in sort
            self.0[1..].sort_unstable();
        }
    }
}

And then we would want AsRef and Deref (but not AsMut or DerefMut) impls, and maybe a few more methods? for instance do we want a method for changing the primary codepoint? what other tasks are common?

anthrotype · 2022-11-29T16:15:31Z

do we want a method for changing the primary codepoint?

No! We should instead deprecate that in existing APIs like defcon or ufoLib2 and nudge users to ever only use the unicodes list, in the plural. There's no use case for a "primary unicode".

RickyDaMa · 2022-11-29T16:20:50Z

Just a note for implementation details: we will still need to use a newtype over the Vec<char> otherwise a user would be able to modify it without upholding its invariants (being ordered except the first element and having no duplicates)

madig · 2022-11-29T16:22:38Z

I can get behind using a set (and sorting on serialization) and declaring that people should use unicodes in Py libs.

anthrotype · 2022-11-29T16:27:01Z

that's what it really is, an unsorted set of unique codepoints that map to a given glyph, with no intrinsic priority over one another except for the (semantically meaningless) order in which the elements appear in the xml document. E.g. in a unicase font, if both 0x0041 and 0x0061 characters map to the same glyph (named "A" or "a" or "Colin", doesn't really matter), that doesn't make either 0x0041 or 0x0061 more "primary".

cmyr · 2022-11-29T20:47:39Z

Okay, given all this input I vote for just using indexset. This will ensure that we maintain the original order, but also ensure that we do not contain duplicates.

cmyr approved these changes Nov 10, 2022

View reviewed changes

madig force-pushed the dedupl-codepoints branch from cab1657 to f21fd5f Compare November 10, 2022 22:22

madig changed the base branch from master to maintenance November 10, 2022 22:27

Base automatically changed from maintenance to master November 10, 2022 23:20

Deduplicate codepoints on read and write

1760f8d

Closes #263

madig force-pushed the dedupl-codepoints branch from f21fd5f to 1760f8d Compare November 11, 2022 11:42

madig added 2 commits November 12, 2022 13:47

Only deduplicate codepoints if there's more than one

7348aa7

Simplify code

0f5b5c1

cmyr approved these changes Nov 14, 2022

View reviewed changes

cmyr mentioned this pull request Nov 24, 2022

Update version to 0.8.0, bump kurbo #278

Merged

WIP: Implement codepoints as a type instead

0ff9762

cmyr mentioned this pull request Nov 30, 2022

Use IndexSet instead of Vec for codepoints #283

Merged

cmyr closed this Dec 5, 2022

cmyr deleted the dedupl-codepoints branch July 25, 2023 19:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplicate codepoints on read and write #268

Deduplicate codepoints on read and write #268

madig commented Jul 18, 2022

cmyr left a comment

cmyr Nov 10, 2022

madig Nov 10, 2022

madig commented Nov 11, 2022

cmyr left a comment

madig commented Nov 14, 2022

RickyDaMa commented Nov 28, 2022

madig commented Nov 28, 2022 •

edited

Loading

RickyDaMa commented Nov 28, 2022

madig commented Nov 28, 2022 •

edited

Loading

cmyr commented Nov 29, 2022

madig commented Nov 29, 2022

anthrotype commented Nov 29, 2022

madig commented Nov 29, 2022

RickyDaMa commented Nov 29, 2022 •

edited

Loading

cmyr commented Nov 29, 2022

anthrotype commented Nov 29, 2022

RickyDaMa commented Nov 29, 2022

madig commented Nov 29, 2022

anthrotype commented Nov 29, 2022 •

edited

Loading

cmyr commented Nov 29, 2022

Deduplicate codepoints on read and write #268

Deduplicate codepoints on read and write #268

Conversation

madig commented Jul 18, 2022

cmyr left a comment

Choose a reason for hiding this comment

cmyr Nov 10, 2022

Choose a reason for hiding this comment

madig Nov 10, 2022

Choose a reason for hiding this comment

madig commented Nov 11, 2022

cmyr left a comment

Choose a reason for hiding this comment

madig commented Nov 14, 2022

RickyDaMa commented Nov 28, 2022

madig commented Nov 28, 2022 • edited Loading

RickyDaMa commented Nov 28, 2022

madig commented Nov 28, 2022 • edited Loading

cmyr commented Nov 29, 2022

madig commented Nov 29, 2022

anthrotype commented Nov 29, 2022

madig commented Nov 29, 2022

RickyDaMa commented Nov 29, 2022 • edited Loading

cmyr commented Nov 29, 2022

anthrotype commented Nov 29, 2022

RickyDaMa commented Nov 29, 2022

madig commented Nov 29, 2022

anthrotype commented Nov 29, 2022 • edited Loading

cmyr commented Nov 29, 2022

madig commented Nov 28, 2022 •

edited

Loading

madig commented Nov 28, 2022 •

edited

Loading

RickyDaMa commented Nov 29, 2022 •

edited

Loading

anthrotype commented Nov 29, 2022 •

edited

Loading