Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading RegionProcessor fails on Windows for regions with non-ASCII UTF8 characters #406

Open
korsbakken opened this issue Oct 3, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@korsbakken
Copy link
Collaborator

When instantiating a RegionProcessor object from RegionProcessor.from_directory, pydantic raises a ValidationError for regions with non-ASCII characters, stating they are not in the RegionCodeList even though they in fact are defined with the exact same characters in the DSD region codelist.

Screenshot of example error message attached (from another user, sorry, I can't test this on Windows myself).
image

I think the problem is in the file open statement in RegionAggregationMapping.from_yaml here:

with open(file, "r") as f:

It does not specify the encoding. Contrast that with CodeList.._parse_codelist_dir here:

with open(yaml_file, "r", encoding="utf-8") as stream:

The latter explicitly specifies utf8 encoding, which creates an inconsistency if the system doesn't use utf8 as the default encoding.

I would suggest to explicitly set utf8 encoding in RegionAggregationMapping.from_yaml (and maybe we should check that it's crystal clear from the documentation that yaml files must be in utf8). Or, if there is some reason not to, instead remove it from elsewhere for consistency.

I can submit a PR, but would be great to get feedback on any constraints, and whether this might touch on broader issues that I'm not aware of.

@korsbakken
Copy link
Collaborator Author

Just an update: There are numerous places in the code where text files are read both with and without explicit encoding. I think the bigger question here is: Do we want to 1) force utf-8 as the encoding for all yaml files, or should we 2) let the user (or codelist/mapping creators) decide freely and live with the consequences?

Option 1 would make the code more robust and behavior more consistent across platforms, but forces Windows users to check that their editor saves yaml files in UTF8 rather than latin1 or whatever their language settings prefer.

Option 2 makes it easier for Windows users to create and edit yaml files with their text editor of choice, but can lead to unpredictable behavior for non-ASCII region names, especially if run on other platforms.

I assume option 1 is the only one that really makes sense, especially considering the chaos that option 2 can cause when mixing yaml files from different sources. In which case we should specify encoding="utf8" in all open statements for text files and text streams.

@korsbakken
Copy link
Collaborator Author

I have submitted a PR that addresses the issue by forcing utf-8 encoding in all open statements that read from or write to yaml files or yaml text streams, #407 .

@korsbakken korsbakken added the bug Something isn't working label Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant