-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplicate codepoints on read and write #268
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops sorry about this, slipped by me somehow. Might want to rebase on #274 when that gets in, but overall this makes sense. I have some notes on the implementation inline. :)
src/glyph/serialize.rs
Outdated
@@ -57,8 +60,11 @@ impl Glyph { | |||
start.push_attribute(("format", "2")); | |||
writer.write_event(Event::Start(start)).map_err(GlifWriteError::Xml)?; | |||
|
|||
let mut seen_codepoints = HashSet::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we just enforce deduplication whenever the user modifies the codepoints? As in we always keep them sorted, and we don't insert dupes (we can use slice::binary_search
to find the insertion point, or whether the item already exists)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... see my note above, not sure if it's just easier to do it at read/write time?
cab1657
to
f21fd5f
Compare
Oops, need to rebase. |
f21fd5f
to
1760f8d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm 🚀
The serialization code path is ugly though. Know of something better? |
Question: out of interest, why do we need to preserve order in the codepoints? Also, it should be trivial to add a borrowing iterator for fn iter(&self) -> impl Iterator<Item = char> {
self.codepoints.iter().copied()
} Also, I don't know if |
Codepoints are ordered lists in the UFO spec. In defcon and ufoLib2, a glyph has the I think the point of a custom data struct here is to avoid/reduce memory allocations on reading and writing, and itertools uses a HashSet :) |
I know, and that's no different to what you've done here, just would be cleaner / more readable
Yeah so you don't deserialize into |
I did a list transform before, but @cmyr said he'd prefer a type 😬 I mean, I haven't settled on anything and would gladly take the simplest solution (custom containers are a pain to bring up to the API they're wrapping). |
I understand that the order of the codepoints is important insofar as the first codepoint is special, but does the order of subsequent codepoints also matter? |
I don't know. |
it doesn't. I probably only does if this order is carried over when roundtripping from xml to xml, e.g. if one wishes to reduce diff noise by the way, as you all well know, the whole notion of "primary unicode codepoint" is flawed because it doesn't exist in opentype cmap tables, which is a map from codepoints to glyphs and not the other way around, as font editors make it seem. |
We could go full-on set then, and sort on serialization, if we didn't care about git noise with other libraries. |
Alternatively to reduce noise, just retain a sorted, deduplicated list. Deterministic order, without duplicates. With this approach you could also preserve the first glyph if you wanted to |
so my original thinking with the Codepoints {
primary: char,
others: Vec<char>,
} but it might be nice if internally we do use a single In this case, I do think that we should preserve the position of the first item in the array, but otherwise ensure that it is always deduplicated. To do this at construction time, we could have something like: use std::cmp::Ordering;
struct Codepoints(Vec<char>);
impl Codepoints {
fn new(mut raw: Vec<char>) -> Self {
if raw.len() <= 1 {
return Self(raw);
}
let first = *raw.first().unwrap();
// custom sorting that always ranks first item lowest
raw.sort_unstable_by(|a, b| match (a, b) {
(a, b) if *a == first && *b == first => Ordering::Equal,
(a, _) if *a == first => Ordering::Less,
(_, b) if *b == first => Ordering::Greater,
(a, b) => a.cmp(b),
});
raw.dedup();
Self(raw)
}
fn insert(&mut self, codepoint: char) {
if !self.0.contains(&codepoint) {
self.0.push(codepoint);
// don't include first codepoint in sort
self.0[1..].sort_unstable();
}
}
} And then we would want AsRef and Deref (but not AsMut or DerefMut) impls, and maybe a few more methods? for instance do we want a method for changing the primary codepoint? what other tasks are common? |
No! We should instead deprecate that in existing APIs like defcon or ufoLib2 and nudge users to ever only use the unicodes list, in the plural. There's no use case for a "primary unicode". |
Just a note for implementation details: we will still need to use a newtype over the |
I can get behind using a set (and sorting on serialization) and declaring that people should use |
that's what it really is, an unsorted set of unique codepoints that map to a given glyph, with no intrinsic priority over one another except for the (semantically meaningless) order in which the elements appear in the xml document. E.g. in a unicase font, if both 0x0041 and 0x0061 characters map to the same glyph (named "A" or "a" or "Colin", doesn't really matter), that doesn't make either 0x0041 or 0x0061 more "primary". |
Okay, given all this input I vote for just using |
Closes #263