custom statistics proposal #876

james-rms · 2023-04-03T00:32:05Z

Public-Facing Changes

Description

wkalt · 2023-04-03T16:36:10Z

docs/specification/README.md

+| `metadatas` | a statistic about all metadata records |
+| `attachments` | a statistic about all attachment records |
+| `chunks` | a statistic about all chunks |
+| `this_channel` | a statistic about the channel record immediately preceding |


Few thoughts on this -

I do not like that this introduces a new naming scheme with "this_*". I think we can avoid that.

I don't get how these will get recorded. I would expect the custom statistics records to get recorded into a custom statistics portion of the summary section (i.e these are not intermixed with the other portions of the summary section - there is a summary offset record for custom statistic, pointing at a contiguous run of these).

So what's the "immediately preceding" channel? Likewise for metadata/attachment/chunk.

Is it necessary to have statistics for both "this channel" and "channels"? If you are recording the "this channel" stats, would you be able to synthesize what you want for "channels"? And would tooling be able to do generic handling based on types (e.g for numeric types, compute the averages, distributions, etc maybe) to do a good enough job of covering everything?

If we can do that, we avoid forcing users to record both "this_chunk: message_count" and "chunks: average message count".

Couple alternatives to throw out -

Single record: downside - "id" and opcode are abused.

Custom Statistic: name string value float64 op OpCode id uint64 // Legal OpCodes are Channel and Chunk, as well as 0x00 which denotes a full-file statistic.

Separate records

FileStatistic name string value float64 ChannelStatistic name string value float64 channelID uint16 ChunkStatistic name string value float64 offset uint64

In either case, we'd use the existing index structure. So for each record type getting introduced, there would be a summary offset record with that ID, pointing up at a contiguous run of those records in the summary section.

we can include display format on these as you have - haven't thought that through but seems like a reasonable idea to me.

If you are recording the "this channel" stats, would you be able to synthesize what you want for "channels"?

I considered allowing the Custom Specification record to specify how the reader should aggregate when more than one is concerned - would be neccessary for making "all channels" stats from individual stats as you say. It would also be neccessary to preserve custom stats across merges. The problem is that there are several ways for this to work (weighted average, min, max, sum, etc.) which is hard to specify ahead of time. i also considered embedding a little DSL to do this maths (something like bc notation or APL). In the end I don't think it's worth the API surface and implementation burden for the reader.

wkalt · 2023-04-03T16:46:23Z

docs/specification/README.md

+
+Custom statistics include a `display_format` string which indicates how the statistic should be presented to human readers. SI units should be used where applicable.
+
+TODO fully specify format syntax


I'm not sure if we need to specify a syntax. Another option would be to define an enum for regular kinds of types, and use that single byte value in the field. Then implementations can render the units in whatever way makes sense. This would probably lower the amount of custom parsing required for implementers.

wkalt · 2023-04-03T16:47:24Z

docs/specification/README.md

+
+#### Custom statistic subjects
+
+Custom Statistics specify a subject, which indicates what this statistic is about. Some subjects are well-known:


specification of a subject is very similar to what summary offset records do, with their opcode field. it would be ideal if we could use the same mechanism here IMO

jtbandes · 2023-04-03T17:25:00Z

Where can I read more about the motivation/backstory behind this proposal?

wkalt · 2023-04-03T17:26:51Z

@jtbandes context here: #723

jtbandes · 2023-04-03T17:32:53Z

I don't see any examples there of what this might be used for. Do we have a list of examples we or our users are thinking about using this for?

wkalt · 2023-04-03T17:54:57Z

There's a ticket linked in the linked ticket with one example: #384, which is about per-topic size statistics. We have also gotten requests for statistics on per-topic compression ratios (which I doubt would go into the official writers but could be implemented with this proposal). Separate thread recently raised per-topic coverage.

In general the idea with this record type would be to head off future requests for spec extensions with new statistics, by instead supporting them generically.

james-rms · 2023-04-06T01:49:00Z

Closing this for now as i'm not planning to think about this for the next few weeks. May reopen with more interest.

custom statistics proposal

3b0ad0e

wkalt reviewed Apr 3, 2023

View reviewed changes

james-rms closed this Apr 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

custom statistics proposal #876

custom statistics proposal #876

james-rms commented Apr 3, 2023

wkalt Apr 3, 2023

wkalt Apr 3, 2023

james-rms Apr 4, 2023

wkalt Apr 3, 2023

wkalt Apr 3, 2023

jtbandes commented Apr 3, 2023

wkalt commented Apr 3, 2023

jtbandes commented Apr 3, 2023

wkalt commented Apr 3, 2023

james-rms commented Apr 6, 2023


		Custom statistics include a `display_format` string which indicates how the statistic should be presented to human readers. SI units should be used where applicable.

		TODO fully specify format syntax


		#### Custom statistic subjects

		Custom Statistics specify a subject, which indicates what this statistic is about. Some subjects are well-known:

custom statistics proposal #876

custom statistics proposal #876

Conversation

james-rms commented Apr 3, 2023

Public-Facing Changes

Description

wkalt Apr 3, 2023

Choose a reason for hiding this comment

wkalt Apr 3, 2023

Choose a reason for hiding this comment

james-rms Apr 4, 2023

Choose a reason for hiding this comment

wkalt Apr 3, 2023

Choose a reason for hiding this comment

wkalt Apr 3, 2023

Choose a reason for hiding this comment

jtbandes commented Apr 3, 2023

wkalt commented Apr 3, 2023

jtbandes commented Apr 3, 2023

wkalt commented Apr 3, 2023

james-rms commented Apr 6, 2023