You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When instantiating a RegionProcessor object from RegionProcessor.from_directory, pydantic raises a ValidationError for regions with non-ASCII characters, stating they are not in the RegionCodeList even though they in fact are defined with the exact same characters in the DSD region codelist.
Screenshot of example error message attached (from another user, sorry, I can't test this on Windows myself).
I think the problem is in the file open statement in RegionAggregationMapping.from_yaml here:
The latter explicitly specifies utf8 encoding, which creates an inconsistency if the system doesn't use utf8 as the default encoding.
I would suggest to explicitly set utf8 encoding in RegionAggregationMapping.from_yaml (and maybe we should check that it's crystal clear from the documentation that yaml files must be in utf8). Or, if there is some reason not to, instead remove it from elsewhere for consistency.
I can submit a PR, but would be great to get feedback on any constraints, and whether this might touch on broader issues that I'm not aware of.
The text was updated successfully, but these errors were encountered:
Just an update: There are numerous places in the code where text files are read both with and without explicit encoding. I think the bigger question here is: Do we want to 1) force utf-8 as the encoding for all yaml files, or should we 2) let the user (or codelist/mapping creators) decide freely and live with the consequences?
Option 1 would make the code more robust and behavior more consistent across platforms, but forces Windows users to check that their editor saves yaml files in UTF8 rather than latin1 or whatever their language settings prefer.
Option 2 makes it easier for Windows users to create and edit yaml files with their text editor of choice, but can lead to unpredictable behavior for non-ASCII region names, especially if run on other platforms.
I assume option 1 is the only one that really makes sense, especially considering the chaos that option 2 can cause when mixing yaml files from different sources. In which case we should specify encoding="utf8" in all open statements for text files and text streams.
I have submitted a PR that addresses the issue by forcing utf-8 encoding in all open statements that read from or write to yaml files or yaml text streams, #407 .
When instantiating a
RegionProcessor
object fromRegionProcessor.from_directory
, pydantic raises a ValidationError for regions with non-ASCII characters, stating they are not in the RegionCodeList even though they in fact are defined with the exact same characters in the DSD region codelist.Screenshot of example error message attached (from another user, sorry, I can't test this on Windows myself).
I think the problem is in the file open statement in
RegionAggregationMapping.from_yaml
here:nomenclature/nomenclature/processor/region.py
Line 271 in 30fb5b1
It does not specify the encoding. Contrast that with
CodeList.._parse_codelist_dir
here:nomenclature/nomenclature/codelist.py
Line 294 in 30fb5b1
The latter explicitly specifies utf8 encoding, which creates an inconsistency if the system doesn't use utf8 as the default encoding.
I would suggest to explicitly set utf8 encoding in
RegionAggregationMapping.from_yaml
(and maybe we should check that it's crystal clear from the documentation that yaml files must be in utf8). Or, if there is some reason not to, instead remove it from elsewhere for consistency.I can submit a PR, but would be great to get feedback on any constraints, and whether this might touch on broader issues that I'm not aware of.
The text was updated successfully, but these errors were encountered: