Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom statistics proposal #876

Closed
wants to merge 1 commit into from
Closed

Conversation

james-rms
Copy link
Collaborator

Public-Facing Changes

Description

| `metadatas` | a statistic about all metadata records |
| `attachments` | a statistic about all attachment records |
| `chunks` | a statistic about all chunks |
| `this_channel` | a statistic about the channel record immediately preceding |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few thoughts on this -

  1. I do not like that this introduces a new naming scheme with "this_*". I think we can avoid that.

  2. I don't get how these will get recorded. I would expect the custom statistics records to get recorded into a custom statistics portion of the summary section (i.e these are not intermixed with the other portions of the summary section - there is a summary offset record for custom statistic, pointing at a contiguous run of these).

So what's the "immediately preceding" channel? Likewise for metadata/attachment/chunk.

  1. Is it necessary to have statistics for both "this channel" and "channels"? If you are recording the "this channel" stats, would you be able to synthesize what you want for "channels"? And would tooling be able to do generic handling based on types (e.g for numeric types, compute the averages, distributions, etc maybe) to do a good enough job of covering everything?

If we can do that, we avoid forcing users to record both "this_chunk: message_count" and "chunks: average message count".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple alternatives to throw out -

Single record: downside - "id" and opcode are abused.

Custom Statistic:
name string
value float64
op OpCode
id uint64

// Legal OpCodes are Channel and Chunk, as well as 0x00 which denotes a full-file statistic. 

Separate records

FileStatistic
name string
value float64

ChannelStatistic
name string
value float64
channelID uint16

ChunkStatistic
name string
value float64
offset uint64

In either case, we'd use the existing index structure. So for each record type getting introduced, there would be a summary offset record with that ID, pointing up at a contiguous run of those records in the summary section.

we can include display format on these as you have - haven't thought that through but seems like a reasonable idea to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are recording the "this channel" stats, would you be able to synthesize what you want for "channels"?

I considered allowing the Custom Specification record to specify how the reader should aggregate when more than one is concerned - would be neccessary for making "all channels" stats from individual stats as you say. It would also be neccessary to preserve custom stats across merges. The problem is that there are several ways for this to work (weighted average, min, max, sum, etc.) which is hard to specify ahead of time. i also considered embedding a little DSL to do this maths (something like bc notation or APL). In the end I don't think it's worth the API surface and implementation burden for the reader.


Custom statistics include a `display_format` string which indicates how the statistic should be presented to human readers. SI units should be used where applicable.

TODO fully specify format syntax
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need to specify a syntax. Another option would be to define an enum for regular kinds of types, and use that single byte value in the field. Then implementations can render the units in whatever way makes sense. This would probably lower the amount of custom parsing required for implementers.


#### Custom statistic subjects

Custom Statistics specify a subject, which indicates what this statistic is about. Some subjects are well-known:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specification of a subject is very similar to what summary offset records do, with their opcode field. it would be ideal if we could use the same mechanism here IMO

@jtbandes
Copy link
Member

jtbandes commented Apr 3, 2023

Where can I read more about the motivation/backstory behind this proposal?

@wkalt
Copy link
Contributor

wkalt commented Apr 3, 2023

@jtbandes context here: #723

@jtbandes
Copy link
Member

jtbandes commented Apr 3, 2023

I don't see any examples there of what this might be used for. Do we have a list of examples we or our users are thinking about using this for?

@wkalt
Copy link
Contributor

wkalt commented Apr 3, 2023

There's a ticket linked in the linked ticket with one example: #384, which is about per-topic size statistics. We have also gotten requests for statistics on per-topic compression ratios (which I doubt would go into the official writers but could be implemented with this proposal). Separate thread recently raised per-topic coverage.

In general the idea with this record type would be to head off future requests for spec extensions with new statistics, by instead supporting them generically.

@james-rms
Copy link
Collaborator Author

Closing this for now as i'm not planning to think about this for the next few weeks. May reopen with more interest.

@james-rms james-rms closed this Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants