Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

identifier optionally including time #17

Open
crotwell opened this issue Jul 17, 2020 · 9 comments
Open

identifier optionally including time #17

crotwell opened this issue Jul 17, 2020 · 9 comments

Comments

@crotwell
Copy link

related to #16 , would it be useful to allow the addition of a time to form a URI that actually uniquely identifies a source? I think in most cases this would not be used as the time would be implicit, but might be useful in some cases where identifying an actual channel is important, perhaps in forming requests or connections between channel and something else?

A separate deliminator is probably needed. Maybe use the '@' symbol and an ISO time, something like:

FDSN:XA_ABC_00_B_H_Z@20200607T12:34:56

which would mean the BHZ active at that time? Or perhaps with the time should always be the starttime?

If #11, then networks and stations could also have the time addition, like:

FDSN:XA@20200607T12:34:56
FDSN:XA_ABC@20200607T12:34:56

Alternatively, using more http style url query params is possible, like:

FDSN:XA_ABC_00_B_H_Z?active=20200607T12:34:56

This gets a bit more verbose, but also allows for other sub-identifying information in a future version of the spec.

As I said, not really sure this is needed or even a good idea at this point, but thought it was worth thinking about.

@chad-earthscope
Copy link

chad-earthscope commented Sep 2, 2020

I think the general idea is very powerful. I don't think we aught to target a just time, but some conventions for adding attributes to the identifier for start and end time, data version, etc. I'm not sure @ is the right separator, I've see # used more for URIs, that would take more digging to figure out if there are patterns used elsewhere to decorate identifiers.

To sum up, don't this just for a time value, but instead a whole section on adding attributes.

@crotwell
Copy link
Author

crotwell commented Sep 2, 2020

Agree, a more general scheme would be a good idea. I can also imagine using this for time ranges, like in data requests. Also maybe for identifying things like derived, synthetic or other non-data but still "channel like" objects?

One advantage of following the style of http query params is that there are existing parsing systems that do not have to be reinvented. Having the system follow the nameA=valueA&nameB=valueB style means the rules, especially for escaping chars and other gotchas are already decided.

Question, is this issue important enough to resolve before submission of the review for approval, or delay until a future revision?

@chad-earthscope
Copy link

Question, is this issue important enough to resolve before submission of the review for approval, or delay until a future revision?

Probably the best approach would be to add a section describing the basics of a pattern for adding attributes to an SID at a high level and note that the WG may wish to consider refinement of the details/conventions.
I would be grateful if someone could come up with a draft of the section text and suggested location.

@WayneCrawford
Copy link

This seems like a great idea to me, but beyond my knowledge of URIs!

@crotwell
Copy link
Author

crotwell commented Sep 3, 2020

I did some reading of the URN spec, and they have 3 types, r-components, ?+, used for routing, q-components, q=, and f-components, #. Either the q-component or the f-compnent sound like they could be used for this, but for all three it says:

... SHALL NOT be taken into account when determining URN-equivalence.

so if we want to follow the URN proposed spec, I think there is the problem of deciding if two identifiers are the same or not if they contain fragments (f-components).

That said, I kind of feel like the practical use of this should be ok, and feel like a fragment with some limited rules could do this.

Here is a start. I would suggest added to the Identifiers as a URN section, immediately before the Temporary network codes convention heading.

Fragment identifiers

A FDSN Source Identifier may contain an optional fragment identifier, as defined by RFC 3986. The start of the fragment is indicated by a hash or number sign ("#", ascii 35) character and terminated by the end of the URI. The form of the fragment within an FDSN Source Identifier is an sequence of key=value pairs, separated by the equals sign, = ascii 61, with each subsequent pair separated by an ampersand, & ascii 38 and following all character escaping rules for URIs in general.

Keys may be any valid combination of characters, but keys composed of only capitol letters, ascii 65 to 90,
and digits 0-9, ascii 48 to 57, are reserved for future definition by the FDSN. Keys defined outside of the FDSN by users must contain at least one lower case letter, ascii 97 to 122.

A FDSN Source Identifier that contains a fragment identifier implies a relationship to the source identifier without the fragment, but in the absence of external information about the keys and values in the fragment, does not imply equivalence. Moreover, order of the key-value pairs within the fragment, and any meaning thereof is undefined. Equivalence between identifiers, with or without fragments, is only guarantied in the case of exact string match for the entire URI without reordering. Existence of a fragment may or may not imply numeric differences in recorded data values from the source with the fragment removed.

Sources identified with fragments should respect all other rules relating to data source naming, including band, source and subsource codes. In particular, a fragment should NOT be used to create a derived source that is fundamentally different from the original source. For example the latency of a seismometer channel should NOT be formed by appending a fragment of #latency=datacenter as the data would no longer be broadband high gain seismometer data.

Possible uses of the fragment could be to identify subsets of data from a source, eg by time range, to identify derived or processed versions of data from a source, or to indicate levels of quality control.

Example:

FDSN:IU_COLA_00_B_H_Z#foo=bar&doo=wop&n=7

could imply a data source that is somehow derived from or related to the FDSN:IU_COLA_00_B_H_Z data source.

@chad-earthscope
Copy link

Also maybe for identifying things like derived, synthetic or other non-data but still "channel like" objects?

I was not thinking anything derived or synthetic, as that is a pretty limited way to denote data processing or generating provenance, which is a big can of worms to open.

Instead I suggest that we limit this, for now, to attributes that further define, aka narrow the scope, of the data source identified. As in time range, data version, and other characteristics perhaps quality related.

... SHALL NOT be taken into account when determining URN-equivalence.

so if we want to follow the URN proposed spec, I think there is the problem of deciding if two identifiers are the > same or not if they contain fragments (f-components).

Seems OK in our case if we interpret the URN-equivalence not to mean the exact same data, I don't think we can get to that point without including time and version as fundamental components anyway.

So I'm thinking things like:
FDSN:IU_COLA_00_B_H_Z#version=3
FDSN:IU_COLA_00_B_H_Z#timerange=2020-8-1T00:00:00/2020-8-2T00:00:00

Here's an alternate idea to include the fundamentals in the URI in a comparing way, add any "defining" characteristics to the path portion, e.g.:

FDSN:IU_COLA_00_B_H_Z/<version>/<starttime>/<endtime>#clockquality=100
like
FDSN:IU_COLA_00_B_H_Z/3/2020-8-1T00:00:00/2020-8-2T00:00:00#clockquality=100

Then the strict URN part uniquely identifies data. Hmm, not totally convinced we need this and it's pretty rigid.

@crotwell
Copy link
Author

crotwell commented Sep 4, 2020

After sleeping on this, I am having doubts about this idea.

I think your can of worms thought is right. The reason I put that in was just to be very general wrt the future uses of the fragment, and also becuase I always feel a tiny bit of guilt whenever I apply some processing step to waveform data and then write it back out using the original channel code. But you are right that the source identifier is not the right place for provenance.

The time range "subset" idea feels the most natural, and so may be worth doing, but I worry about uses like in a future miniseed3 or stationxml where the time is provided elsewhere. If someone sets the channel id to be

FDSN:IU_COLA_00_B_H_Z#timerange=2020-8-1T00:00:00/2020-8-2T00:00:00

but the header says it is data actually from 2019, then we have a problem. I suppose we could say that fragments are not allowed in places where the time is available via other means, but perhaps just not having it at all is safer. Similar issue of course if the fragment says #version=3 and the header says the version is 4.

Still not sure how I feel about this...

@chad-earthscope
Copy link

chad-earthscope commented Sep 4, 2020

You raise a good point, we wouldn't want information in these attributes that overlaps whatever is in formats where they are used (StationXML, mseed3, web service requests, etc). Which begs the question of where would these be used at all.

The use case we have internally at our data center is for inter-service communication, so one service can provide a parsable "token" to another service, an identifier for some data with enough information/context needed to do some work. The example of FDSN:IU_COLA_00_B_H_Z#timerange=2020-8-1T00:00:00/2020-8-2T00:00:00, with some defaults, would be enough for a service to know what needs to be extracted, or plotted, or to direct to a work queue. But this may not rise to a sufficient level of need for a pattern in the FDSN specification.

Small detail: it was pointed out that using a #, while following standards for URIs, makes it more difficult to use as a token in a URI. If we ever get back to this discussion and for use as values/tokens in URIs is needed, we may just want to stick with a simple path-separator pattern, e.g. FDSN:IU_COLA_00_B_H_Z/<key1>=<value1>/<key2>=<value2>

@crotwell
Copy link
Author

crotwell commented Sep 5, 2020

Just my $0.02, but when embedding one URI, or really a string of any kind, in another URI, you had better do a %-escape on reserved chars or you are just asking for things to blow up in your face. So I don't buy the # is a bad char argument.

That said, I, too, am struggling to find the use case that really motivates this. Unless there is one, I would defer this whole idea I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants