-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
custom statistics proposal #876
Conversation
| `metadatas` | a statistic about all metadata records | | ||
| `attachments` | a statistic about all attachment records | | ||
| `chunks` | a statistic about all chunks | | ||
| `this_channel` | a statistic about the channel record immediately preceding | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few thoughts on this -
-
I do not like that this introduces a new naming scheme with "this_*". I think we can avoid that.
-
I don't get how these will get recorded. I would expect the custom statistics records to get recorded into a custom statistics portion of the summary section (i.e these are not intermixed with the other portions of the summary section - there is a summary offset record for custom statistic, pointing at a contiguous run of these).
So what's the "immediately preceding" channel? Likewise for metadata/attachment/chunk.
- Is it necessary to have statistics for both "this channel" and "channels"? If you are recording the "this channel" stats, would you be able to synthesize what you want for "channels"? And would tooling be able to do generic handling based on types (e.g for numeric types, compute the averages, distributions, etc maybe) to do a good enough job of covering everything?
If we can do that, we avoid forcing users to record both "this_chunk: message_count" and "chunks: average message count".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple alternatives to throw out -
Single record: downside - "id" and opcode are abused.
Custom Statistic:
name string
value float64
op OpCode
id uint64
// Legal OpCodes are Channel and Chunk, as well as 0x00 which denotes a full-file statistic.
Separate records
FileStatistic
name string
value float64
ChannelStatistic
name string
value float64
channelID uint16
ChunkStatistic
name string
value float64
offset uint64
In either case, we'd use the existing index structure. So for each record type getting introduced, there would be a summary offset record with that ID, pointing up at a contiguous run of those records in the summary section.
we can include display format on these as you have - haven't thought that through but seems like a reasonable idea to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are recording the "this channel" stats, would you be able to synthesize what you want for "channels"?
I considered allowing the Custom Specification record to specify how the reader should aggregate when more than one is concerned - would be neccessary for making "all channels" stats from individual stats as you say. It would also be neccessary to preserve custom stats across merges. The problem is that there are several ways for this to work (weighted average, min, max, sum, etc.) which is hard to specify ahead of time. i also considered embedding a little DSL to do this maths (something like bc
notation or APL). In the end I don't think it's worth the API surface and implementation burden for the reader.
|
||
Custom statistics include a `display_format` string which indicates how the statistic should be presented to human readers. SI units should be used where applicable. | ||
|
||
TODO fully specify format syntax |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we need to specify a syntax. Another option would be to define an enum for regular kinds of types, and use that single byte value in the field. Then implementations can render the units in whatever way makes sense. This would probably lower the amount of custom parsing required for implementers.
|
||
#### Custom statistic subjects | ||
|
||
Custom Statistics specify a subject, which indicates what this statistic is about. Some subjects are well-known: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specification of a subject is very similar to what summary offset records do, with their opcode field. it would be ideal if we could use the same mechanism here IMO
Where can I read more about the motivation/backstory behind this proposal? |
I don't see any examples there of what this might be used for. Do we have a list of examples we or our users are thinking about using this for? |
There's a ticket linked in the linked ticket with one example: #384, which is about per-topic size statistics. We have also gotten requests for statistics on per-topic compression ratios (which I doubt would go into the official writers but could be implemented with this proposal). Separate thread recently raised per-topic coverage. In general the idea with this record type would be to head off future requests for spec extensions with new statistics, by instead supporting them generically. |
Closing this for now as i'm not planning to think about this for the next few weeks. May reopen with more interest. |
Public-Facing Changes
Description