Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New version for testing #3

Closed
duncandewhurst opened this issue Aug 9, 2023 · 38 comments
Closed

New version for testing #3

duncandewhurst opened this issue Aug 9, 2023 · 38 comments
Assignees

Comments

@duncandewhurst
Copy link
Collaborator

I've pushed an updated version of the template to this repository and I've updated the Google Docs copy too.

@matamadio please use this version for testing. The issues you flagged in the old version should be resolved now. Let me know if you spot any other problems.

In addition to fixing those issues, I also updated the formatting and implemented data validation for id fields so you need only enter each identifier once, in it's 'parent' worksheet (e.g. datasets). When referring to an identifier from another worksheet (e.g. resources), you can select the identifier from a drop-down list.

The updated version of the template is based on the schema from GFDRR/rdl-standard#181, which includes a few fixes for issues that I noticed whilst working on the template.

@matamadio
Copy link
Collaborator

matamadio commented Aug 9, 2023

Thanks, I'm testing it.

One preliminary question: how are the enums loaded now? I don't see the enum tab anymore, but can't figure out how the magic works.

In addition to fixing those issues, I also updated the formatting and implemented data validation for id fields so you need only enter each identifier once, in it's 'parent' worksheet (e.g. datasets). When referring to an identifier from another worksheet (e.g. resources), you can select the identifier from a drop-down list.

This is nice; yet looks like the parent id field gets duplicated. I'm removing it manually.
immagine

@matamadio
Copy link
Collaborator

Some more comments:

primary id description

  • suggesting to use unique code such as project id rather than a URL

spatial/gazetteerEntries/0/scheme

  • duplicated ISO a2 in codelist, should include ISO a2 and ISO a3

resources/0/coordinate_system

  • should it include open epsg codelist?

hazard/event_sets/0/frequency_distribution

  • add GEV (Generalised Extreme Values) to codelist
links/0/rel - Link relation type
The relationship with this related resource, from the IANA Link Relationship Types.  The 'describedby' code must only be used in the first item in the `links` array to link to the schema that describes the structure of the data.

@matamadio
Copy link
Collaborator

matamadio commented Aug 9, 2023

Example for HAZARD: Fathom country dataset (Thailand)

Hazard template on Gdocs

Note:

  • non-hazard tabs and fields have been dropped
  • not all tabs and attributes are filled; required + some others
  • license is the only required missing, as "commercial" attribution is not listed in codelist
  • not quite sure if I got the id branching right:
    • I have one dataset made of 3 resources (id 1,2,3).
    • Each of these resources are described by corresponding hazard_events_set (id 1,2,3).
      • They are identified by two type of processes:
        • hazard_events_set 1 and 2 are fluvial flood, 3 is pluvial.
        • then hazard/event_sets/0/hazards/0/id is (1;1;2)
    • Within each hazard_events_set, we have 10 different probabilistic scenarios each (hazard/event_sets/0/events/0/id 1 to 10)

@matamadio
Copy link
Collaborator

matamadio commented Aug 9, 2023

Example for EXPOSURE: GHS builtup dataset (Thailand)

Exposure template on Gdocs

Note:

  • non-exposure tabs and fields have been dropped
  • for IANA codelist indicating related publication, i used
    "describedby": Refers to a resource providing information about the link's context.
  • exposure/cost/0/unit should not be Required
    • eg building layers having just area values, not an economic attribute

@matamadio matamadio self-assigned this Aug 9, 2023
This was referenced Aug 9, 2023
@duncandewhurst
Copy link
Collaborator Author

Thanks for the feedback! I'll respond to the points that are specific to the template here.

@odscjen please can you:

  1. Create issues on https://github.com/GFDRR/rdl-standard for the points that are related to the schema.
  2. Convert and validate the examples provided by Mat (per Conversion to JSON and validation #4) and share your feedback

One preliminary question: how are the enums loaded now? I don't see the enum tab anymore, but can't figure out how the magic works.

The enum tab is there but hidden to prevent accidental edits. You should be able to unhide it if you need to. Let me know if you run into problems.

This is nice; yet looks like the parent id field gets duplicated. I'm removing it manually.

Ah, thanks for flagging. I will fix it in the next iteration. I've created an issue: #6

@duncandewhurst
Copy link
Collaborator Author

  • for IANA codelist indicating related publication, i used
    "describedby": Refers to a resource providing information about the link's context.

'describedby' is used to link to the schema that describes that RDLS metadata so it probably isn't the right code here. What is the relationship between the related publication and the dataset? Does it fit the semantics of referencedBy or sources?

@matamadio
Copy link
Collaborator

matamadio commented Aug 10, 2023

'describedby' is used to link to the schema that describes that RDLS metadata so it probably isn't the right code here. What is the relationship between the related publication and the dataset? Does it fit the semantics of referencedBy or sources?

It is a publication describing the methodology to produce the dataset.
sources is used to point to the main model or dataset that stemmed this specific dataset.
I used referenced_by instead. File has been updated, same link.

@odscjen
Copy link
Collaborator

odscjen commented Aug 10, 2023

@matamadio I'm in the process of checking and validating your examples, could you give me comment access to the spreadsheets (at the moment I've only got view access) as I think it might be easiest if I can leave the feedback comments on the relevant columns in the spreadsheets rather than attempt to write them out in this issue?

@matamadio
Copy link
Collaborator

@odscjen you should have edit access to the folder.

@odscjen
Copy link
Collaborator

odscjen commented Aug 10, 2023

great, thanks :)

@odscjen
Copy link
Collaborator

odscjen commented Aug 10, 2023

@duncandewhurst an odd thing in the flatten-tool converted version of the Hazard example

referenced_by.author_names is an array within an array, as is spatial.bbox

"referenced_by": [
        {
            "id": "1",
            "name": "A high-resolution global flood hazard model",
            "authorNames": [
                [
                    "Christopher C. Sampson",
                    " Andrew M. Smith",
                    " Paul D. Bates",
                    " Jeffrey C. Neal",
                    " Lorenzo Alfieri",
                    " Jim E. Freer"
                ]
            ],
            "datePublished": "2015-08-18",
            "url": "https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/2015WR016954",
            "doi": "10.1002/2015WR016954"
        }
    ],

and

"bbox": [
            [
                96.72037,
                5.407344,
                105.72785,
                20.718323
            ]
        ],

but it looks right in the spreadsheet?

@odscjen
Copy link
Collaborator

odscjen commented Aug 10, 2023

@matamadio re comment above

not quite sure if I got the id branching right:

It looks like you did get the branching right :)

@duncandewhurst
Copy link
Collaborator Author

@duncandewhurst an odd thing in the flatten-tool converted version of the Hazard example

Yep, that's due to the Flatten Tool bug I mentioned in #4. I've opened an issue on Flatten Tool to get it fixed. Looks like the data has been entered correctly in the spreadsheet, albeit with extra whitespace before each author's name.

@odscjen
Copy link
Collaborator

odscjen commented Aug 11, 2023

ah, cool, I ignored all the other errors reported due to that mentioned bug but hadn't picked up that this was part of the same thing.

@duncandewhurst
Copy link
Collaborator Author

@duncandewhurst an odd thing in the flatten-tool converted version of the Hazard example

Yep, that's due to the Flatten Tool bug I mentioned in #4. I've opened an issue on Flatten Tool to get it fixed. Looks like the data has been entered correctly in the spreadsheet, albeit with extra whitespace before each author's name.

I was mistaken, it's not a bug. The correct syntax is a semicolon-separated list, rather than a comma-separated list.

@matamadio I've just pushed a new version of the template (XLSX, GSheets) with the following additions and improvements:

  • Add a basic README sheet
  • Add codelist hyperlinks
  • Add data input guidance
  • Update error messages for array fields
  • Auto-populate and hide the links sheet
  • Permit additional values for open codelists, add error messages
  • Remove duplicate id fields

Please can you test it. I think this is all we can add in terms of features, but we can fix any bugs and improve the documentation.

@odscjen
Copy link
Collaborator

odscjen commented Aug 14, 2023

EDIT: see below comment for the fix to this confusion.

entering as a semi-colon seperated list looks like this in the flatten-tool output (from Mat's Exposure example last week):

"referenced_by": [
                {
                    "id": "1",
                    "name": " GHS-BUILT-S R2023A - GHS built-up surface grid, derived from Sentinel2 composite and Landsat, multitemporal (1975-2030)",
                    "authorNames": [
                        [
                            "Pesaresi",
                            " Martino"
                        ],
                        [
                            " Politis",
                            " Panagiotis"
                        ]
                    ],
                    "datePublished": "2023",
                    "url": "http://data.europa.eu/89h/9f06f36f-4b11-47ec-abb0-4f8b7b1d72ea",
                    "doi": "10.2905/9F06F36F-4B11-47EC-ABB0-4F8B7B1D72EA"
                }
            ]

so it's still creating array's within array's which can't be right?

@odscjen
Copy link
Collaborator

odscjen commented Aug 14, 2023

Realised what's wrong above, it's the combination of commas and semi colons. It's best then to give the author names without a comma, so "Jen Harris" or "J Harris" or "J. Harris" rather than "Harris, Jen"

@matamadio
Copy link
Collaborator

matamadio commented Aug 14, 2023

Realised what's wrong above, it's the combination of commas and semi colons. It's best then to give the author names without a comma, so "Jen Harris" or "J Harris" or "J. Harris" rather than "Harris, Jen"

We'll need to be extremely clear on the instructions, or possibly have this auto-validated and corrected (i.e. commas and semicolons are auto-removed from the field?).

@duncandewhurst
Copy link
Collaborator Author

I've asked the devs what the correct syntax is in OpenDataServices/flatten-tool#427. If it isn't possible to include commas or semi-colons within the values, we can update the input guidance for array fields and add a data validation warning if the cell's value includes commas or semi-colons.

@matamadio
Copy link
Collaborator

Latest split templates and examples for hazard and exposure (real data).

https://drive.google.com/drive/folders/1V33k5YmYjcvjFnYpx7chOx-PeSivYRWm?usp=sharing

@duncandewhurst
Copy link
Collaborator Author

@matamadio, a couple of questions on the Global Human Settlement layer (Thailand) example:

  1. What is the reason for listing the Thailand data a dataset, rather than just listing the whole GHS-BUILT-S R2023A - GHS built-up surface grid, derived from Sentinel2 composite and Landsat, multitemporal (1975-2030) dataset in line with the listing in the EU Joint Research Centre Data Catalogue?
  2. Why is the exposure_cost sheet populated? If I understood correctly, the dataset doesn't describe the cost of buildings, it only describes their area.

@matamadio
Copy link
Collaborator

matamadio commented Aug 16, 2023

  1. What is the reason for listing the Thailand data a dataset, rather than just listing the whole GHS-BUILT-S R2023A - GHS built-up surface grid, derived from Sentinel2 composite and Landsat, multitemporal (1975-2030) dataset in line with the listing in the EU Joint Research Centre Data Catalogue?

The reason is that the global dataset is made of a large number of tiles; here, those relative to the country were downloaded and merged together. In other cases there could be more processing compared to the source (e.g. change of resolution, value classification, clipping, others). Thus the download will point to the derived dataset, and not the original source.

  1. Why is the exposure_cost sheet populated? If I understood correctly, the dataset doesn't describe the cost of buildings, it only describes their area.

At the moment, the "cost" of exposure is limited to monetary currencies.
But the value represented by an exposure dataset might be intangible, or just a proxy to later calculate the economic value; in my experience it is actually pretty uncommon to use an exposure dataset that already comes into economic terms.
In this specific case it is a value of built-up area over total pixel area. In other cases, the value could be building height, or volume, population density or others. A range of different metrics could be represented by exposure, in order to measure the cost.

I see two options:

  1. Put cost field as optional, use it only if actually a currency value. Don't specify exposure metric.
  2. Add exposure metric field as open codelist

@duncandewhurst
Copy link
Collaborator Author

  1. What is the reason for listing the Thailand data a dataset, rather than just listing the whole GHS-BUILT-S R2023A - GHS built-up surface grid, derived from Sentinel2 composite and Landsat, multitemporal (1975-2030) dataset in line with the listing in the EU Joint Research Centre Data Catalogue?

The reason is that the global dataset is made of a large number of tiles; here, those relative to the country were downloaded and merged together. In other cases there could be more processing compared to the source (e.g. change of resolution, value classification, clipping, others). Thus the download will point to the derived dataset, and not the original source.

Ah, I see. So the actual processed/merged dataset is not linked in the example? If so, the value of resources/0/url should be left blank. Currently, it links to https://ghsl.jrc.ec.europa.eu/download.php?ds=bu, which is the download page for the whole dataset rather than the Thailand subset. The global dataset should be linked in sources.

  1. Why is the exposure_cost sheet populated? If I understood correctly, the dataset doesn't describe the cost of buildings, it only describes their area.

At the moment, the "cost" of exposure is limited to monetary currencies. But the value represented by an exposure dataset might be intangible, or just a proxy to later calculate the economic value; in my experience it is actually pretty uncommon to use an exposure dataset that already comes into economic terms. In this specific case it is a value of built-up area over total pixel area. In other cases, the value could be building height, or volume, population density or others. A range of different metrics could be represented by exposure, in order to measure the cost.

I see two options:

1. Put `cost` field as optional, use it only if actually a currency value. Don't specify exposure metric.

2. Add exposure `metric` field as open codelist

I've created an issue on the standard repo, let's follow up there: GFDRR/rdl-standard#194

@odscjen
Copy link
Collaborator

odscjen commented Aug 17, 2023

If so, the value of resources/0/url should be left blank.

Unfortunately leaving this blank will cause the data to be reported as invalid, resource.url is a required field. We don't want to provide invalid examples, but equally we don't want to be providing semantically incorrect examples either. I see that we have a couple of options:

  1. we create dummy url's that don't point to anywhere, E.g. http://example.com/GHS_THA. This option may require some additional explanation in the documentation as to what exactly resource.url should be pointing to.
  2. we only create example's where there is already existing data with a url for the exact resource being described.

@odscjen
Copy link
Collaborator

odscjen commented Aug 17, 2023

@duncandewhurst the autopopulation of the links sheet might not be right, when I converted FTH-snip example it gave

"links": [
    {
      "href": "https://raw.githubusercontent.com/GFDRR/rdl-standard/0__2__0/schema/rdls_schema.json",
      "rel": "describedby"
    }
  ]

but describedby should be describedBy

@matamadio
Copy link
Collaborator

Unfortunately leaving this blank will cause the data to be reported as invalid, resource.url is a required field. We don't want to provide invalid examples, but equally we don't want to be providing semantically incorrect examples either. I see that we have a couple of options:

1. we create dummy url's that don't point to anywhere, E.g. http://example.com/GHS_THA. This option may require some additional explanation in the documentation as to what exactly `resource.url` should be pointing to.

2. we only create example's where there is already existing data with a url for the exact resource being described.

I had the exposure example file uploaded on sharepoint, though the link is accessible only for WB at the moment.

THA_GHS-BU2020

@odscjen
Copy link
Collaborator

odscjen commented Aug 17, 2023

I had the exposure example file uploaded on sharepoint, though the link is accessible only for WB at the moment.

Ah, ok. That will be fine for the actual data being uploaded to the data catalogue, but I think that for the case of examples it'd be best not to use non-open data. As in this case it's obviously useful to create the RDLS record for this data as it will exist, I'd recommend using a dummy url for the example. That way when you do upload it to the data catalogue you've already got the RDLS file and you'll just need to update the url to the real Sharepoint url

EDIT: I've just seen you've created GFDRR/rdl-standard#195 so we can wait for the resolution of this and hopefully there will be real url's to use soon :)

@odscjen
Copy link
Collaborator

odscjen commented Aug 17, 2023

@matamadio regarding the rdls_hzd-FTH-snip example. Converting this to JSON and validating against the schema returns one major error:

  • There are no resources provided which means the example is invalid as resources is a required object. Subsequently here is no link to the actual datasets as this would go into resources.

The way that resource has been written it defines a single resource, i.e. "An individual file or other form of data." But in this case you are describing a larger product which contains multiple resources that you cannot link to as they are behind a paywall so it's not a dataset that you're going to be able to publish. For this reason I don't think this is an appropriate example.

(I'll create an additional comment in one of the documentation issues (GFDRR/rdl-standard#149) around how much detail we may want to include based on how the dataset being described will be available, e.g. in an open catalogue or behind some sort of access restriction.)

The appropriate example to use if you want to include the FATHOM global flood map is the one you have in rdls_hzd-FTH-THA where it has been used to create this specific datasource.

The rdls_hzd-FTH-THA however suffers from the same issues as @duncandewhurst describes for the Global Human Settlement layer (Thailand) example, namely that the dataset fields appear to be describing the source dataset rather than the dataset that the resources cover. The FATHOM map in this case should be in sources and the fields in the dataset sheet should be describing this Thailand specific version that has been created out of the FATHOM map rather than the FATHOM map in general.

Some additional errors in rdls_hzd-FTH-THA:

  • the values in bbox are separated by commas not semi-colons (I've corrected this one in the spreadsheet)
  • no resource.url's
  • no event.hazard's (but this is already being discussed in #188 and #192

@matamadio
Copy link
Collaborator

Ah, ok. That will be fine for the actual data being uploaded to the data catalogue, but I think that for the case of examples it'd be best not to use non-open data. As in this case it's obviously useful to create the RDLS record for this data as it will exist, I'd recommend using a dummy url for the example. That way when you do upload it to the data catalogue you've already got the RDLS file and you'll just need to update the url to the real Sharepoint url

EDIT: I've just seen you've created GFDRR/rdl-standard#195 so we can wait for the resolution of this and hopefully there will be real url's to use soon :)

Could we host the demo data here on the GH?

@matamadio
Copy link
Collaborator

matamadio commented Aug 17, 2023

The idea for the "snippet" file was to provide a quick json example of just key metadata in reference to the example figure, without actual data download. For schema reference / Example tabs only, replacing the current ones.

The rdls_hzd-FTH-THA however suffers from the same issues as @duncandewhurst describes for the Global Human Settlement layer (Thailand) example, namely that the dataset fields appear to be describing the source dataset rather than the dataset that the resources cover. The FATHOM map in this case should be in sources and the fields in the dataset sheet should be describing this Thailand specific version that has been created out of the FATHOM map rather than the FATHOM map in general.

I thought the only difference in metadata would be the subset coverage: Thailand instead of global. But yes, the resource should point to the download in our folder instead of source, but this is WIP.
In the case of Fathom data, there is no change or transformation either from source; it already comes as country datasets.

@odscjen
Copy link
Collaborator

odscjen commented Aug 17, 2023

I thought the only difference in metadata would be the subset coverage: Thailand instead of global.

Yes, in general this would be true and given your other clarification about there being no change from the source I think actually the only field that doesn't quite make sense is details: you've put "The FATHOM flood-hazard model (previously known as SSBN), is a global gridded dataset of flood hazard produced at the global scale." but not mentioned in this field that the actual resource isn't global, i.e. you've described the FATHOM flood-hazard model rather than this specific subset of it. The fix is as simple as rewording this slightly to add a bit at the beginning:

"The Thailand country level dataset taken from the FATHOM flood-hazard model (previously known as SSBN). The FATHOM flood-hazard model is a global gridded dataset of flood hazard produced at the global scale. ..."

This just makes it a bit clearer that this isn't the entire FATHOM map.

In the case of Fathom data, there is no change or transformation either from source; it already comes as country datasets.

Ah, okay so in that case you're right to put it in as the dataset fields rather than the source, but is this a dataset that can actually be made available? It looks like FATHOM is only available for a fee?

@odscjen
Copy link
Collaborator

odscjen commented Aug 17, 2023

The idea for the "snippet" file was to provide a quick json example of just key metadata in reference to the example figure, without actual data download.

I can see the logic in this but my worry is that it makes it seem as though you can use RDLS without providing the data, which is not something you'd want to imply.

@matamadio
Copy link
Collaborator

Agree on the need to specify a subset in the details.

Ah, okay so in that case you're right to put it in as the dataset fields rather than the source, but is this a dataset that can actually be made available? It looks like FATHOM is only available for a fee?

Yes it is a commercial product, I chose it because it is the most frequently used for flood analysis across the bank, so it's relevant to show.
But yes, I can build a similar example with a similar open layer.

I can see the logic in this but my worry is that it makes it seem as though you can use RDLS without providing the data, which is not something you'd want to imply.

Does it mean that examples will also need to have the actual resource download? Could it be just a dummy (empty resource links)?

@odscjen
Copy link
Collaborator

odscjen commented Aug 17, 2023

Does it mean that GFDRR/rdl-standard#135 will also need to have the actual resource download? Could it be just a dummy (empty resource links)?

A dummy link would be fine but I'd recommend making sure it looks like a dummy link, e.g. use http://example.com/YOUR_EXAMPLE

@matamadio
Copy link
Collaborator

matamadio commented Aug 22, 2023

The readme is ok, we don't want to replicate the information found on docs;
but it should give reference to RDL project at the beginning with links:

A template for entering RDLS metadata in spreadsheet format.
More information on the Risk Data Library Schema: https://riskdatalibrary.org/
Guidelines for filling this metadata spreadhist and using the conversion tool: https://rdl-standard.readthedocs.io/

Please have fields description in multiple rows instead of one cell for easier reading, like in rdls_hzd_AQD_docsample.
immagine

@duncandewhurst
Copy link
Collaborator Author

I've asked the devs what the correct syntax is in OpenDataServices/flatten-tool#427. If it isn't possible to include commas or semi-colons within the values, we can update the input guidance for array fields and add a data validation warning if the cell's value includes commas or semi-colons.

It isn't possible to include commas or semicolons within array values so I've updated the input guidance for array fields.

@duncandewhurst
Copy link
Collaborator Author

The readme is ok, we don't want to replicate the information found on docs; but it should give reference to RDL project at the beginning with links:

A template for entering RDLS metadata in spreadsheet format.
More information on the Risk Data Library Schema: https://riskdatalibrary.org/
Guidelines for filling this metadata spreadhist and using the conversion tool: https://rdl-standard.readthedocs.io/

Please have fields description in multiple rows instead of one cell for easier reading, like in rdls_hzd_AQD_docsample.

I've made those updates. At the same time, I've moved the readme content from the spreadsheet template to the README.md file in this repository and linked to it from the readme sheet in the template. I did that for three reasons:

  • There are many limitations on formatting when creating the readme sheet programmatically, not least that it is only possible to have whole-cell hyperlinks.
  • It is very time-consuming to maintain and update the code for creating the readme sheet programmatically, so it's better to have it in a Markdown file which anyone can easily edit.
  • When users follow the readme link, they will always get the latest version.

@duncandewhurst
Copy link
Collaborator Author

@odscjen @matamadio I'm going to close this issue as I think that everything related to the template in this issue is now done. If there is anything outstanding relating to the example data or schema, please open an issue on the main rdl-standard repo. If there is anything outstanding relating to the spreadsheet template, please feel free to reopen this issue :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants