Add `mcap du` command #1021

defunctzombie · 2023-11-19T00:49:50Z

Problem: When looking at mcap files I want to understand how much space particular topics take up. I might want to know if I have some heavyweight topics I need to trim or if some debug topic is taking a large amount of space relative to its worth. Having a command that shows me how much "space" is used by each topic can help me answer such questions.

This change introduces a new mcap du command. This command reads and mcap file and outputs "disk usage" statistics about the mcap file.

Below I show invocations of this command on some sample mcap files. The output shows the space taken by each kind of record and then the space per topic (relative to all topics). For me, the primary use-case for this command to show the "topic" information but wanted to offer both sets of "usage" for discussion. I would probably hide the "record" breakdown under a flag or remove it entirely.

$ mcap du ~/Downloads/nuScenes-v1.0-mini-scene-0061.mcap
Top level record stats:

record          sum bytes       % of total file bytes
------          ---------       ---------------------
schema          119442          0.025821             
channel         5100            0.001103             
statistics      446             0.000096             
header          28              0.000006             
message index   514284          0.111177             
data end        4               0.000001             
footer          20              0.000004             
chunk           461903442       99.853394            
chunk index     38784           0.008384             
summary offset  68              0.000015             

Message size stats:

topic                                           sum bytes (uncompressed)        % of total message bytes
-----                                           ------------------------        ------------------------
/LIDAR_TOP                                      253.02 MiB                      37.325157               
/CAM_BACK/image_markers_lidar                   36.65 MiB                       5.405929                
/CAM_FRONT_LEFT/image_rect_compressed           34.88 MiB                       5.145901                
/CAM_BACK_LEFT/image_rect_compressed            34.37 MiB                       5.070285                
/CAM_BACK_LEFT/image_markers_lidar              34.11 MiB                       5.031229                
/map                                            33.84 MiB                       4.991500                
/CAM_BACK_RIGHT/image_rect_compressed           32.08 MiB                       4.732708                
/CAM_FRONT/image_rect_compressed                31.26 MiB                       4.611454                
/CAM_FRONT_RIGHT/image_rect_compressed          29.69 MiB                       4.380036                
/CAM_BACK_RIGHT/image_markers_lidar             29.09 MiB                       4.290726                
/CAM_FRONT_LEFT/image_markers_lidar             28.44 MiB                       4.195673                
/CAM_BACK/image_rect_compressed                 28.44 MiB                       4.195309                
/CAM_FRONT_RIGHT/image_markers_lidar            26.71 MiB                       3.939883                
/CAM_FRONT/image_markers_lidar                  25.05 MiB                       3.695423                
/drivable_area                                  3.81 MiB                        0.562412                
/diagnostics                                    3.35 MiB                        0.494162                
/RADAR_BACK_LEFT                                1.39 MiB                        0.205230                
/RADAR_FRONT                                    1.36 MiB                        0.200218                
/CAM_BACK/image_markers_annotations             1.36 MiB                        0.200176                
/RADAR_BACK_RIGHT                               1.34 MiB                        0.197152                
/CAM_FRONT/image_markers_annotations            1.06 MiB                        0.156522                
/RADAR_FRONT_RIGHT                              978.25 KiB                      0.140927                
/CAM_FRONT_RIGHT/image_markers_annotations      841.36 KiB                      0.121206                
/markers/annotations                            738.96 KiB                      0.106455                
/CAM_BACK_RIGHT/image_markers_annotations       710.51 KiB                      0.102356                
/semantic_map                                   680.68 KiB                      0.098059                
/odom                                           651.51 KiB                      0.093857                
/RADAR_FRONT_LEFT                               643.55 KiB                      0.092710                
/imu                                            594.98 KiB                      0.085713                
/CAM_FRONT_LEFT/image_markers_annotations       273.38 KiB                      0.039383                
/CAM_BACK_LEFT/image_markers_annotations        139.28 KiB                      0.020065                
/tf                                             95.27 KiB                       0.013725                
/CAM_FRONT_LEFT/camera_info                     68.03 KiB                       0.009801                
/CAM_FRONT/camera_info                          66.94 KiB                       0.009643                
/CAM_BACK_LEFT/camera_info                      66.30 KiB                       0.009551                
/CAM_BACK_RIGHT/camera_info                     66.21 KiB                       0.009538                
/CAM_FRONT_RIGHT/camera_info                    66.12 KiB                       0.009525                
/CAM_BACK/camera_info                           64.34 KiB                       0.009268                
/gps                                            4.76 KiB                        0.000686                
/pose                                           3.08 KiB                        0.000444

Implementation notes:

This implementation reads the entire file. There is an opportunity for "optimization" by using the MessageIndex records to figure out the size of records without decompressing chunks. I view this out-of-scope for the v1.

mcap.code-workspace

james-rms · 2023-11-19T21:19:06Z

go/cli/mcap/cmd/du.go

+	// total message size by topic name
+	topicMessageSize map[string]uint64
+
+	totalSize uint64


totally pedantic, but if we're going to come up with totalSize by summing up all the bytes that come out of the lexer, we'll need to add 8 bytes each for the header and trailer magic.

james-rms · 2023-11-19T21:20:41Z

go/cli/mcap/cmd/du.go

+		}
+
+		printTable(os.Stdout, rows, []string{
+			"record kind", "sum bytes", "% of total file bytes",


Suggested change

"record kind", "sum bytes", "% of total file bytes",

"record kind", "sum bytes", "% of total file bytes (after chunk decompression)",

I am not sure this suggestion is correct. These sizes are only the top-level record sizes.

go/cli/mcap/cmd/du.go

james-rms

The help text for this subcommand should remind the user that this reads the entire file.

Regarding showing per-record-type usage

We've had questions around this before, sometimes non-message records can take up a surprising amount of room. For example when @amacneil was coming up with plot test data, his example file with millions of tiny messages had almost half the space taken up with MessageIndex records. We've also previously had writers who re-wrote schema records for every new channel, which can also add up. I think this is valuable to know.

Adjusting for compression

There's some nuance missing here around compression.

If i'm looking for the total % contribution of some record type or topic to the file's size on disk, I need to see the effect that compression is having on that size. Right now you display percentages where the denominator is a sum of all record bytes after decompression. this will de-emphasise the impact of uncompressed records, particularly MessageIndex records.

Given:

sum of uncompressed record bytes ru
sum of uncompressed record bytes in chunk n rc[n]
the compressed data size of chunk n tc[n]
the uncompressed data size of chunk n tu[n]
the total size of the file f

I could imagine synthesizing a record impact estimate as:

impact % = (ru + sum(rc[n] * (tc[n] / tu[n]) for all n)) / f

which assumes all bytes in a compressed chunk contribute equally to the compressed output size, which isn't true but there's no easy way to know the right answer.

If this feels too complicated to present, you could consider presenting some of these values separately and allowing the user to do their own mental arithmetic.

Schemas

IMO it's worth parsing schema records and including the schema names and encodings when presenting per-topic statistics.

Human-readable bytes

We have a humanBytes function used in info.go, you could re-use that to make the byte counts more presentable (perhaps under a flag).

jhurliman · 2024-10-25T00:38:56Z

Are there plans to finish this PR and merge? An mcap du command would be very useful.

wkalt · 2024-10-28T02:04:19Z

I got notified by John's comment. I don't want to kick a hornet's nest but my feedback on the concept is,

The principled way to handle this (consistent with MCAP design goals) is with a backward-compatible evolution of the statistics concept (probably via new records) to allow writers to record this information to the summary section. This would make "du" operations cheap on remote files, which is a goal met by all the other summarization operations supported by the format and tool. I think supporting "du" with a full scan doesn't play to the strengths of the format, and I think it's usually unwise to introduce features that can't be well-supported without additional design/work.
If extending the statistics concept feels like too much, I think this patch can still be improved by basing the statistics on the message index records and skipping all decompression. The timestamps/offsets in those indexes should be sufficient, and will require a lot less data to be inspected. However, it will still require a seek per chunk, which will make it untenably slow on NFS (and possibly worse on S3 etc) - unlike "mcap info" and the other related commands. It would suffer from the same issues as "rosbag info" in this article: https://foxglove.dev/blog/mcap-vs-ros1-bag-index-performance. For that reason I don't think this optimization is a "good" approach, for the same reason a full scan isn't.

Accordingly, my vote would be to close this and instead approach it as a side-effect of custom statistics support (there are multiple tickets/discussions on that around already). Once a better model for statistics exists, du can be implemented trivially with great performance, or even folded into "mcap info" (possibly just another column in the output?).

Writers would then need to update to get the benefit, but that's no different from writers needing to opt in to various features to get the full data out of the current "info" command.

Some previous custom stats ideas -
#723
#384

edit: minor added detail

wkalt · 2024-10-28T02:17:19Z

Another thought: lots of tools allow you to write custom plugins for operations that are convenient but don't quite rise to the standard of inclusion in the tool itself. Two examples are git and cargo.

One possible way to split the difference on this would instead be to add a plugin extension mechanism, and point users who want this kind of feature either toward that or toward a plugin in some Foxglove repo. This would keep the door open for the required format changes to support this within the proper tool, without being a blocker for users who would accept the poor performance.

Two approaches I have seen for this:

scan the user's path for executables with a known prefix, and treat these as subcommands. IIRC this is how cargo and git both work.
for better integration with the cobra lib and help tooling, take a look at Add Support for Plugins spf13/cobra#691 and related discussions. This is how I handle plugins for the dp3 binary. Getting this working cross-language is probably more of a pain.

defunctzombie · 2024-10-28T04:38:42Z

My perspective is that this patch is a simple improvement that provides a level of valuable information. We can caveat it that different topics compress differently but there's no particular reason to hold up the feature on some idealized changes or optimizations. If later someone proposes alternatives or wants to implement proposals - we can evaluate those separately.

Custom stats puts the requirements on writers to add support for features. This allows a reader to provide insight into a file without writers having to be updated.

jhurliman · 2024-10-28T06:22:26Z

+1 to moving this specific patch forward. My own personal interest in this is for MCAPs that already exist, not a future version of the spec for files that haven’t been serialized yet.

wkalt · 2024-10-28T12:55:05Z

Well, I think I have clearly expressed why this is not a good feature - it can't be made to play well with remote or large files without physical changes to the format, which subjects users to a poor UX.

The plugin mechanism I suggested also handles the backward compatibility issue, and unlike this approach is fully generic (plugin writers can implement whatever functionality they want, including making use of their own internal MCAP conventions -- something not possible in the tool today). In contrast with this approach we get a single expensive operation that does one thing.

Anyway, my stake in this is not too big so I'll appeal to @james-rms and bow out.

defunctzombie · 2024-10-30T11:49:47Z

@james-rms I've updated the PR to use utils.FormatTabe and humanBytes as suggested. I've also made it more clear that the message size is uncompressed size by adding (uncompressed) to the sum bytes heading item.

@wkalt The PR description does note that this implementation is "slower" by reading the entire fire rather than using message index records. That would be a good follow-on improvement that doesn't need to hold up the v1 of this command.

james-rms · 2024-11-01T11:33:54Z

go/cli/mcap/cmd/du.go

+
+var duCmd = &cobra.Command{
+	Use:   "du <file>",
+	Short: "Report space usage within an MCAP file",


please also add a Long description which includes that notes that this scans the entire file.

james-rms

This looks good to me. I agree with the sentiment that generic statistics are the correct future solution to this problem, but disagree that this should block shipping a tool that many people want to use on their existing data.

Regarding a fast implementation using message indexes, there's no way to make it exact. Message indexes don't say how long the messages are, and you can't assume the only records in chunks are messages, so you can't just measure up until the start of the next message index.

github-actions bot deployed to mcap (Preview) November 19, 2023 00:54 View deployment

Add mcap du command

bf25d21

defunctzombie force-pushed the roman/add-du-command branch from 0850810 to bf25d21 Compare November 19, 2023 01:00

defunctzombie marked this pull request as ready for review November 19, 2023 01:00

defunctzombie requested review from jtbandes, wkalt and james-rms November 19, 2023 01:01

github-actions bot deployed to mcap (Preview) November 19, 2023 01:05 View deployment

james-rms reviewed Nov 19, 2023

View reviewed changes

mcap.code-workspace Outdated Show resolved Hide resolved

james-rms reviewed Nov 19, 2023

View reviewed changes

go/cli/mcap/cmd/du.go Outdated Show resolved Hide resolved

james-rms reviewed Nov 19, 2023

View reviewed changes

defunctzombie added 3 commits October 30, 2024 04:05

update

15d848b

Merge remote-tracking branch 'origin/main' into roman/add-du-command

ee8bf94

update

1e4d266

github-actions bot deployed to mcap (Preview) October 30, 2024 11:50 View deployment

linter

3c1cc24

defunctzombie requested a review from james-rms October 30, 2024 11:53

github-actions bot deployed to mcap (Preview) October 30, 2024 11:56 View deployment

more lint

af81a34

github-actions bot deployed to mcap (Preview) October 30, 2024 12:03 View deployment

james-rms reviewed Nov 1, 2024

View reviewed changes

james-rms approved these changes Nov 1, 2024

View reviewed changes

add long description

30ac0e9

github-actions bot deployed to mcap (Preview) November 1, 2024 19:54 View deployment

lint

bac6c7e

github-actions bot deployed to mcap (Preview) November 1, 2024 20:04 View deployment

defunctzombie merged commit 693030f into main Nov 1, 2024
23 checks passed

defunctzombie deleted the roman/add-du-command branch November 1, 2024 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `mcap du` command #1021

Add `mcap du` command #1021

defunctzombie commented Nov 19, 2023 •

edited

Loading

james-rms Nov 19, 2023 •

edited

Loading

james-rms Nov 19, 2023

defunctzombie Oct 30, 2024

james-rms left a comment •

edited

Loading

jhurliman commented Oct 25, 2024

wkalt commented Oct 28, 2024 •

edited

Loading

wkalt commented Oct 28, 2024 •

edited

Loading

defunctzombie commented Oct 28, 2024 •

edited

Loading

jhurliman commented Oct 28, 2024

wkalt commented Oct 28, 2024

defunctzombie commented Oct 30, 2024

james-rms Nov 1, 2024

james-rms left a comment •

edited

Loading

	"record kind", "sum bytes", "% of total file bytes",
	"record kind", "sum bytes", "% of total file bytes (after chunk decompression)",

Add mcap du command #1021

Add mcap du command #1021

Conversation

defunctzombie commented Nov 19, 2023 • edited Loading

james-rms Nov 19, 2023 • edited Loading

Choose a reason for hiding this comment

james-rms Nov 19, 2023

Choose a reason for hiding this comment

defunctzombie Oct 30, 2024

Choose a reason for hiding this comment

james-rms left a comment • edited Loading

Choose a reason for hiding this comment

Regarding showing per-record-type usage

Adjusting for compression

Schemas

Human-readable bytes

jhurliman commented Oct 25, 2024

wkalt commented Oct 28, 2024 • edited Loading

wkalt commented Oct 28, 2024 • edited Loading

defunctzombie commented Oct 28, 2024 • edited Loading

jhurliman commented Oct 28, 2024

wkalt commented Oct 28, 2024

defunctzombie commented Oct 30, 2024

james-rms Nov 1, 2024

Choose a reason for hiding this comment

james-rms left a comment • edited Loading

Choose a reason for hiding this comment

Add `mcap du` command #1021

Add `mcap du` command #1021

defunctzombie commented Nov 19, 2023 •

edited

Loading

james-rms Nov 19, 2023 •

edited

Loading

james-rms left a comment •

edited

Loading

wkalt commented Oct 28, 2024 •

edited

Loading

wkalt commented Oct 28, 2024 •

edited

Loading

defunctzombie commented Oct 28, 2024 •

edited

Loading

james-rms left a comment •

edited

Loading