-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add mcap du
command
#1021
Add mcap du
command
#1021
Conversation
0850810
to
bf25d21
Compare
// total message size by topic name | ||
topicMessageSize map[string]uint64 | ||
|
||
totalSize uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally pedantic, but if we're going to come up with totalSize by summing up all the bytes that come out of the lexer, we'll need to add 8 bytes each for the header and trailer magic.
go/cli/mcap/cmd/du.go
Outdated
} | ||
|
||
printTable(os.Stdout, rows, []string{ | ||
"record kind", "sum bytes", "% of total file bytes", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"record kind", "sum bytes", "% of total file bytes", | |
"record kind", "sum bytes", "% of total file bytes (after chunk decompression)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this suggestion is correct. These sizes are only the top-level record sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The help text for this subcommand should remind the user that this reads the entire file.
Regarding showing per-record-type usage
We've had questions around this before, sometimes non-message records can take up a surprising amount of room. For example when @amacneil was coming up with plot test data, his example file with millions of tiny messages had almost half the space taken up with MessageIndex records. We've also previously had writers who re-wrote schema records for every new channel, which can also add up. I think this is valuable to know.
Adjusting for compression
There's some nuance missing here around compression.
If i'm looking for the total % contribution of some record type or topic to the file's size on disk, I need to see the effect that compression is having on that size. Right now you display percentages where the denominator is a sum of all record bytes after decompression. this will de-emphasise the impact of uncompressed records, particularly MessageIndex records.
Given:
- sum of uncompressed record bytes
ru
- sum of uncompressed record bytes in chunk n
rc[n]
- the compressed data size of chunk n
tc[n]
- the uncompressed data size of chunk n
tu[n]
- the total size of the file
f
I could imagine synthesizing a record impact estimate as:
impact % = (ru + sum(rc[n] * (tc[n] / tu[n]) for all n)) / f
which assumes all bytes in a compressed chunk contribute equally to the compressed output size, which isn't true but there's no easy way to know the right answer.
If this feels too complicated to present, you could consider presenting some of these values separately and allowing the user to do their own mental arithmetic.
Schemas
IMO it's worth parsing schema records and including the schema names and encodings when presenting per-topic statistics.
Human-readable bytes
We have a humanBytes
function used in info.go
, you could re-use that to make the byte counts more presentable (perhaps under a flag).
Are there plans to finish this PR and merge? An |
I got notified by John's comment. I don't want to kick a hornet's nest but my feedback on the concept is,
Accordingly, my vote would be to close this and instead approach it as a side-effect of custom statistics support (there are multiple tickets/discussions on that around already). Once a better model for statistics exists, du can be implemented trivially with great performance, or even folded into "mcap info" (possibly just another column in the output?). Writers would then need to update to get the benefit, but that's no different from writers needing to opt in to various features to get the full data out of the current "info" command. Some previous custom stats ideas - edit: minor added detail |
Another thought: lots of tools allow you to write custom plugins for operations that are convenient but don't quite rise to the standard of inclusion in the tool itself. Two examples are git and cargo. One possible way to split the difference on this would instead be to add a plugin extension mechanism, and point users who want this kind of feature either toward that or toward a plugin in some Foxglove repo. This would keep the door open for the required format changes to support this within the proper tool, without being a blocker for users who would accept the poor performance. Two approaches I have seen for this:
|
My perspective is that this patch is a simple improvement that provides a level of valuable information. We can caveat it that different topics compress differently but there's no particular reason to hold up the feature on some idealized changes or optimizations. If later someone proposes alternatives or wants to implement proposals - we can evaluate those separately. Custom stats puts the requirements on writers to add support for features. This allows a reader to provide insight into a file without writers having to be updated. |
+1 to moving this specific patch forward. My own personal interest in this is for MCAPs that already exist, not a future version of the spec for files that haven’t been serialized yet. |
Well, I think I have clearly expressed why this is not a good feature - it can't be made to play well with remote or large files without physical changes to the format, which subjects users to a poor UX. The plugin mechanism I suggested also handles the backward compatibility issue, and unlike this approach is fully generic (plugin writers can implement whatever functionality they want, including making use of their own internal MCAP conventions -- something not possible in the tool today). In contrast with this approach we get a single expensive operation that does one thing. Anyway, my stake in this is not too big so I'll appeal to @james-rms and bow out. |
@james-rms I've updated the PR to use @wkalt The PR description does note that this implementation is "slower" by reading the entire fire rather than using message index records. That would be a good follow-on improvement that doesn't need to hold up the v1 of this command. |
|
||
var duCmd = &cobra.Command{ | ||
Use: "du <file>", | ||
Short: "Report space usage within an MCAP file", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please also add a Long
description which includes that notes that this scans the entire file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I agree with the sentiment that generic statistics are the correct future solution to this problem, but disagree that this should block shipping a tool that many people want to use on their existing data.
Regarding a fast implementation using message indexes, there's no way to make it exact. Message indexes don't say how long the messages are, and you can't assume the only records in chunks are messages, so you can't just measure up until the start of the next message index.
Problem: When looking at mcap files I want to understand how much space particular topics take up. I might want to know if I have some heavyweight topics I need to trim or if some debug topic is taking a large amount of space relative to its worth. Having a command that shows me how much "space" is used by each topic can help me answer such questions.
This change introduces a new
mcap du
command. This command reads and mcap file and outputs "disk usage" statistics about the mcap file.Below I show invocations of this command on some sample mcap files. The output shows the space taken by each kind of record and then the space per topic (relative to all topics). For me, the primary use-case for this command to show the "topic" information but wanted to offer both sets of "usage" for discussion. I would probably hide the "record" breakdown under a flag or remove it entirely.
Implementation notes: