Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: secondary indexes #918

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

proposal: secondary indexes #918

wants to merge 1 commit into from

Conversation

james-rms
Copy link
Collaborator

@james-rms james-rms commented Jul 3, 2023

Adds new record types which facilitate fast message lookup by timestamps other than log_time. These closely mirror the existing log_time index structure, but can be based on any user-defined timestamp value.

This PR is currently only for review. Before merging, we would:

  • Implement secondary index writing in the Go client
  • Implement secondary indexed reading in the Go client
  • Add an option to add a secondary index by publish time or ROS header stamp to the mcap CLI

Background

Roboticists need detailed time information in order to understand exactly what happened when for their robot. For a given frame or message, one or more of these may be recorded:

  1. Log time: when the message was received by the recorder
  2. Publish time: when the message was published by the originator
  3. Event timestamp: for messages about events that happened before publishing, they may include a timestamp that records when the event happened (which can be a significant time before publishing the message).
  4. Window start and/or end time: Some messages contain information about a window of time (point cloud scans, for example).

Since most robotics frameworks are designed to allow for distributed compute and recording, we assume that some messages are out-of-order for every timestamp ordering within an MCAP file.

Justification

Right now the MCAP format includes an optional index on log_time. However, a roboticist may want to use any of these other timestamp categories to view messages within a given time window, or in a particular time order. Some example use-cases are:

  1. Locating all messages derived from a single camera frame. The camera frame's event timestamp would be the same as the event timestamp for all downstream data messages. (searches by event timestamp)
  2. Locating all point cloud scan frame(s) from multiple LiDARs that overlap with a given event. (searches by event timestamp, window start and window end time)
  3. Analyzing jitter in the publish rate from a given node (reads in publish time order)

Without index information in the file, the above use-cases can be served either by performing a windowed sort (which is how Studio currently supports ordering by header timestamp) or by reading the entire file. A windowed sort has a chance to produce incorrect results, if an out-of-order message appears outside the window.

Paths not taken

There were a few options I considered and rejected during development, listed here for discussion.

  • index arbitrary key types, not just timestamps: My initial stab at this allowed any type to be an index key, not just timestamps. However, the index structure we use here and for log_time assumes that chunk key ranges are almost-disjoint. We do a top-level search through chunk indexes to find the few chunks that overlap with our time range, then a bottom-level search through message indices to find specific messages. If the indexed keys had a random distribution across the file, all chunk ranges would overlap completely, and the top-level search would lose effectiveness. Message timestamps of all kinds tend to be roughly in increasing order across a file, which provides the needed almost-disjoint chunk ranges. I may still add advice to the specification regarding index effectiveness for random timestamps.
    • we could try to support keys with arbitrary types and distributions, but we'd need to build a more sophisticated search index, and I don't see an obvious way to do that without requiring that the writer store the entire index in memory until the summary section is written.
    • we may still want to extend this proposal to allow indexing by message sequence count. That's not a timestamp, but it should still produce almost-disjoint chunk ranges.
  • merging Secondary Chunk Index information into the existing Chunk Index record: Right now we don't have a great story for extending the specification of existing records, so I chose to add all new data into new record types.
  • include secondary key start/end in (or close to) the Chunk record: It might be nice for completeness to include information about secondary key start/end in the chunk record, or in a separate record just before the chunk. This would allow linear readers to skip parsing chunks they don't need. This is still a path we could go down, but I consider it a niche use-case that would be more complex to implement than it's worth.

@james-rms
Copy link
Collaborator Author

cc. @jhurliman

@@ -179,6 +183,30 @@ The message encoding and schema must match that of the Channel record correspond
| 8 | publish_time | Timestamp | Time at which the message was published. If not available, must be set to the log time. |
| N | data | Bytes | Message data, to be decoded according to the schema of the channel. |

### Secondary Index Key (op=0x10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should come up with a more future-proof word than "secondary", in case we decide in a V2 to merge the primary indexes into this same record type.

> before reading into the Data section. This means that the Secondary Index Key in the Data section
> is not normally used. However, if a MCAP is truncated and the summary section is lost, having the
> Secondary Index Key appear before any Secondary Message Index records allows the MCAP to be fully
> recovered.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to have this record type, vs the parser just inferring from the existence of the other two record types that the index is in use? I can see dropping this would require moving the "name" field into the secondary chunk index record but that doesn't seem like the biggest thing we stick in those records anyway.

Today the parser knows a file is indexed via the presence of the index records - we don't need a third record type for that right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I should follow up with a concrete suggestion - what about

SecondaryMessageIndex
name
channel_id
records

SecondaryChunkIndex
name
chunk_start_offset
first_key
last_key
message_index_offsets
metadata (??? - will elaborate in another comment)

then in the summary offset section, we'd have a new group pointing at "SecondaryChunkIndex". The "name" key is stored in both locations to allow a partially-written file to still have index data recovered, which is a purpose your third record type also supplies. The cost is the duplication of the "name" field in the SecondaryMessageIndex records. It would be good to get some data on how much this costs us - my assumption is that the effect would generally be swamped by the size of "records", this it doesn't justify the extra record type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forget the "metadata" part - I think it would be better implemented with a "chunk info" record as described in another comment.

| ----- | ------------------ | --------------------------------- | -------------------------------------------------------------------------------------------------------------- |
| 2 | channel_id | uint16 | Channel ID. |
| 2 | secondary_index_id | uint16 | Secondary Index ID. |
| 4 + N | records | `Array<Tuple<Timestamp, uint64>>` | Array of timestamp and offset for each record. Offset is relative to the start of the uncompressed chunk data. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be key, offset I think

@wkalt
Copy link
Contributor

wkalt commented Jul 10, 2023

I think rather than conceptualizing this as "secondary indexes", it would be better to think about it as "custom index support" with the intention of it eventually absorbing the existing chunk/message index scheme as well.

I think once we have this "secondary index scheme", what we'd probably want to do is split a "chunk info" record out of the existing "chunk index" record, to contain stuff like compression stats and chunk start/end that aren't really relevant to message indexes. Then your "secondary chunk index" can I think be used as the only kind of chunk index.

If we want to we could also add to "chunk info" some "metadata" or "typed statistics" kind of field to allow custom index authors to display custom chunk statistics from the info command.

[Summary Offset 0x01]
[Summary Offset 0x05]
[Summary Offset 0x07]
[Summary Offset 0x08]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these summary offsets seem wrong, 0x01 is the header and that's not in the summary section.

I think they should be 0x03, 0x04, 0x10, 0x08, 0x12, 0x0A, 0x0B

@wkalt
Copy link
Contributor

wkalt commented Jul 10, 2023

Adds new record types which facilitate fast message lookup by timestamps other than log_time. These closely mirror the existing log_time index structure, but can be based on any user-defined timestamp value.

This PR is currently only for review. Before merging, we would:

* [ ]  Implement secondary index writing in the Go client

* [ ]  Implement secondary indexed reading in the Go client

* [ ]  Add an option to add a secondary index by publish time or ROS header stamp to the `mcap` CLI

I think we'll also want to slot the integrations into whatever existing info & summarization subcommands would touch it (probably at least "mcap info" and "mcap list chunks", "mcap cat"). That would provide a good demonstration mechanism.

Background

4. Window start and/or end time: Some messages contain information about a window of time (point cloud scans, for example).

For this, would you be including two secondary indexes under the proposed scheme? You would get sort by start/end for free but additional work/first-class treatment would probably be required if supporting search by overlap (which seems more niche for this case but I'm not sure what else people may need).

Since most robotics frameworks are designed to allow for distributed compute and recording, we assume that some messages are out-of-order for every timestamp ordering within an MCAP file.

Justification

Right now the MCAP format includes an optional index on log_time. However, a roboticist may want to use any of these other timestamp categories to view messages within a given time window, or in a particular time order. Some example use-cases are:

1. Locating all messages derived from a single camera frame. The camera frame's event timestamp would be the same as the event timestamp for all downstream data messages. (searches by event timestamp)

2. Locating all point cloud scan frame(s) from multiple LiDARs that overlap with a given event. (searches by event timestamp, window start and window end time)

3. Analyzing jitter in the publish rate from a given node (reads in publish time order)

Without index information in the file, the above use-cases can be served either by performing a windowed sort (which is how Studio currently supports ordering by header timestamp) or by reading the entire file. A windowed sort has a chance to produce incorrect results, if an out-of-order message appears outside the window.

Paths not taken

There were a few options I considered and rejected during development, listed here for discussion.

* **index arbitrary key types, not just timestamps**: My initial stab at this allowed any type to be an index key, not just timestamps. However, the index structure we use here and for `log_time` assumes that chunk key ranges are almost-disjoint. We do a top-level search through chunk indexes to find the few chunks that overlap with our time range, then a bottom-level search through message indices to find specific messages. If the indexed keys had a random distribution across the file, all chunk ranges would overlap completely, and the top-level search would lose effectiveness. Message timestamps of all kinds tend to be roughly in increasing order across a file, which provides the needed almost-disjoint chunk ranges. I may still add advice to the specification regarding index effectiveness for random timestamps.
  
  * we could try to support keys with arbitrary types and distributions, but we'd need to build a more sophisticated search index, and I don't see an obvious way to do that without requiring that the writer store the entire index in memory until the summary section is written.

I think even constraining to a uint64 rather than a "Timestamp" would probably be a win. If the user records a massively disordered file based on their desired ordering, it seems ok to me to let tooling fall over or handle as desired, rather than prohibiting via spec. That's the same approach we have taken with the existing timestamps. That would knock out the sequence ID feature as well.

Operations like "get 50 messages ordered on $secondary starting at t" would still be quicker than if the file had no index.

* **merging Secondary Chunk Index information into the existing Chunk Index record**: Right now we don't have a great story for extending the specification of existing records, so I chose to add all new data into new record types.

+1

* **include secondary key start/end in (or close to) the Chunk record**: It might be nice for completeness to include information about secondary key start/end in the chunk record, or in a separate record just before the chunk. This would allow linear readers to skip parsing chunks they don't need. This is still a path we could go down, but I consider it a niche use-case that would be more complex to implement than it's worth.

+1. I think the "record before chunk" idea is interesting but easily splittable from this.


## Secondary index keys

The Secondary Index Key `name` field may contain the following options:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "may" indicate I can put my own stuff in there and expect some tooling support? I think this would be good to shoot for. Rather than having studio or whatever hard code "header.stamp", "publish_time", etc, would it be viable to dynamically show a list of sort options based on the file's index section?

And likewise with the info command, CLI, reader support etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i feel like you should expect some tooling support for any key, but tooling can make extra assumptions about well-known keys.

| 2 | secondary_index_id | uint16 | Secondary Index ID. |
| 8 | chunk_start_offset | uint64 | Offset to the chunk record from the start of the file. |
| 8 | earliest_key | Timestamp | Earliest key in the chunk. Zero if the chunk contains no messages with this key. |
| 8 | latest_key | Timestamp | Latest key in the chunk. Zero if the chunk contains no messages with this key. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using a zero timestamp as a sentinal is something we do elsewhere but not strictly correct, since 0 is a valid timestamp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about saying chunk indexes should be omitted when they would apply to no messages?

@wkalt
Copy link
Contributor

wkalt commented Aug 28, 2023

another thing to think about (I don't remember if it's part of this patch) is indexes that are chunk-level only. Since remote readers need to download the full chunk to make use of an index on it, and can probably decompress/scan the chunk faster than they can download it (not guaranteed but common) the benefits of the message level index can be pretty marginal, however the chunk-level index at the end of the file may still be very useful. Similar to postgres BRIN indexes https://www.postgresql.org/docs/15/brin-intro.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants