-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
specification: user-defined statistics #723
Comments
|
regarding 2 - I think we would probably just introduce one. The representation in use in file formats I have encountered is IEEE 754. about 3, I agree - however I'm not sure the alternatives are better. I can imagine two ways to handle it. We could go with a scheme that looks like this:
or split out different record types as the OP suggests. I don't know with the previous scheme how we encode "chunk channel statistics" if that is something we would want to provide, and the implied dispatch on user-supplied strings to interpret the "reference" field seems kind of brittle to me. But I think at the end of the day you could use make either approach work. |
Another interesting application for custom statistics would be accelerating text search using trigrams. Consider that in typical robotics data, text values within specific fields are extremely conserved: they will generally derive from a finite number of error strings that exist in code somewhere. Typical searches will be for rare strings like "ERROR", not common strings like "success". Imagine a post-processing step that creates a chunk or file-level statistic for each text field contained within the chunk or file. The value of the statistic is a bit vector, maybe 8 or 12 bytes long. During post-processing, string values for each string field are decomposed into trigrams. The trigrams are hashed into the vector and combined with a bitwise OR. To execute a search, the search term is decomposed into trigrams and hashed into a vector in the same manner. This vector is then checked for overlap with the index vector. If all (or sufficiently many) flipped bits in the search vector are flipped in the index vector, the file/chunk must be examined. Otherwise it can be skipped. The same technique could be used for generic text search (without specification of fields at all), but a longer vector would be required. |
Much of that idea is lifted from https://www.postgresql.org/docs/current/pgtrgm.html |
Related to #384
It would be nice if there were a way for writers to define and use custom statistics in their mcap files and have them surfaced by the "info" subcommand. I think this could be implemented with a new record type like "custom statistic", which could minimally contain "name" and "value" fields. Better IMO would be to provide a little more structure, like supporting "channel statistic" and "file statistic" variants, or even "channel statistic", "file statistic", "attachment statistic", "metadata statistic", "chunk statistic". The channel statistics would have channel IDs associated with them, file statistics would be whole-file, and metadata/attachment/chunk statistics could contain references to the relevant record offset. I am imagining all these records would be written to the summary section somewhere, I guess in a new statistics section. The "info" command could then display file and channel statistics, and the list attachments/chunks/metadata/channels commands could show the relevant type of statistic as well.
Some additional things to think about:
The text was updated successfully, but these errors were encountered: