Loading RegionProcessor fails on Windows for regions with non-ASCII UTF8 characters #406

korsbakken · 2024-10-03T09:26:32Z

When instantiating a RegionProcessor object from RegionProcessor.from_directory, pydantic raises a ValidationError for regions with non-ASCII characters, stating they are not in the RegionCodeList even though they in fact are defined with the exact same characters in the DSD region codelist.

Screenshot of example error message attached (from another user, sorry, I can't test this on Windows myself).

I think the problem is in the file open statement in RegionAggregationMapping.from_yaml here:

nomenclature/nomenclature/processor/region.py

Line 271 in 30fb5b1

with open(file, "r") as f:

It does not specify the encoding. Contrast that with CodeList.._parse_codelist_dir here:

nomenclature/nomenclature/codelist.py

Line 294 in 30fb5b1

with open(yaml_file, "r", encoding="utf-8") as stream:

The latter explicitly specifies utf8 encoding, which creates an inconsistency if the system doesn't use utf8 as the default encoding.

I would suggest to explicitly set utf8 encoding in RegionAggregationMapping.from_yaml (and maybe we should check that it's crystal clear from the documentation that yaml files must be in utf8). Or, if there is some reason not to, instead remove it from elsewhere for consistency.

I can submit a PR, but would be great to get feedback on any constraints, and whether this might touch on broader issues that I'm not aware of.

The text was updated successfully, but these errors were encountered:

korsbakken · 2024-10-03T13:17:43Z

Just an update: There are numerous places in the code where text files are read both with and without explicit encoding. I think the bigger question here is: Do we want to 1) force utf-8 as the encoding for all yaml files, or should we 2) let the user (or codelist/mapping creators) decide freely and live with the consequences?

Option 1 would make the code more robust and behavior more consistent across platforms, but forces Windows users to check that their editor saves yaml files in UTF8 rather than latin1 or whatever their language settings prefer.

Option 2 makes it easier for Windows users to create and edit yaml files with their text editor of choice, but can lead to unpredictable behavior for non-ASCII region names, especially if run on other platforms.

I assume option 1 is the only one that really makes sense, especially considering the chaos that option 2 can cause when mixing yaml files from different sources. In which case we should specify encoding="utf8" in all open statements for text files and text streams.

korsbakken · 2024-10-03T16:20:51Z

I have submitted a PR that addresses the issue by forcing utf-8 encoding in all open statements that read from or write to yaml files or yaml text streams, #407 .

This was referenced Oct 3, 2024

Added explicit encoding to all open statements. ciceroOslo/nomenclature#1

Open

Added explicit encoding to all open statements. #407

Open

korsbakken added the bug Something isn't working label Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading RegionProcessor fails on Windows for regions with non-ASCII UTF8 characters #406

Loading RegionProcessor fails on Windows for regions with non-ASCII UTF8 characters #406

korsbakken commented Oct 3, 2024

korsbakken commented Oct 3, 2024

korsbakken commented Oct 3, 2024

Loading RegionProcessor fails on Windows for regions with non-ASCII UTF8 characters #406

Loading RegionProcessor fails on Windows for regions with non-ASCII UTF8 characters #406

Comments

korsbakken commented Oct 3, 2024

korsbakken commented Oct 3, 2024

korsbakken commented Oct 3, 2024