Skip to content

Image rectification node becomes a performance bottleneck after 20 minutes of runtime #1126

@mbed92

Description

@mbed92

For the reference - the issue started in apriltag_ros, but it seems the problem is more connected with RectifyNode. I attach the link to the issue with description: christianrauch/apriltag_ros#33 (comment)

Hi @christianrauch. I also encountered the problem mentioned in the issue. I tested multiple setups that I'll summarize below. I also cannot see how the approximate sync policy could help. It seems to be workaround without understanding, rather than fix. Below I show more details regarding my setup, so pleas forgive me talking about other nodes also (which does not seem to be an issue eventually) - I wanted to deep dive into the problem which is important for me.

I use OAK-D-PRO-W-PoE camera with enabled rgb stream only. My setup is simply:

oak_driver -> image_proc::RectifyNode -> apriltag_node

My environment:

  • ROS 2 Jazzy

  • apriltag_ros

  • OAK camera driver publishing:

    • /oak/oak/rgb/image_raw @ ~20 Hz
    • /oak/oak/rgb/camera_info @ ~20 Hz
  • Rectification via image_proc::RectifyNode

    • publishes /oak/oak/rgb/image_rect

In some tests oak driver and rectify_node are in the same component_container with apriltag_node, in other apriltag_node is a separate process. Below I introduce more details.

Bug description

Simple as that: apriltag_node stops matching image_rect + camera_info after ~20–22 minutes.

Summary

After ~20 minutes of runtime, apriltag_node starts printing:

[WARN ...] [apriltag_node]: [image_transport] Topics '/oak/oak/rgb/image_rect' and '/oak/oak/rgb/camera_info' do not appear to be synchronized. In the last 10s:
Image messages received: 1
CameraInfo messages received: 20
Synchronized pairs: 0
  • apriltag_node prints repeated synchronization warnings
  • Rate of images arriving to apriltag_node drops (observed in logs / ros2 topic hz)
  • No tags detected
  • Restarting camera container resets the issue temporarily

In some runs (when apriltag_node runs in the component container with rectify node + driver), both streams dropped to 1:

Image messages received: 1
CameraInfo messages received: 1
Synchronized pairs: 0

Investigation Summary

1) Frequency observations

Before failure:

  • /oak/oak/rgb/image_raw ~20 Hz ✅
  • /oak/oak/rgb/camera_info ~20 Hz ✅
  • /oak/oak/rgb/image_rect ~20 Hz ✅

After failure:

  • /oak/oak/rgb/image_raw ~20 Hz ✅
  • /oak/oak/rgb/camera_info ~20 Hz ✅
  • /oak/oak/rgb/image_rect drops to ~5 Hz → ~1 Hz ❌ obviously i suspect that rectify node is a problem

2) Timestamp verification

  • image_raw and camera_info from the camera have identical timestamps for long periods (I sampled ~16 minutes spanning before and after the failure).
  • Therefore, the camera seems to publish synchronized data correctly.

3) QoS verification

Initial hypothesis was QoS mismatch.

Observed initially:

  • Camera publishes RELIABLE / KEEP_LAST(10)
  • rectify_color_node subscribed with BEST_EFFORT / KEEP_LAST(5) (later forced to RELIABLE / KEEP_LAST(50) to test)
  • rectify_color_node publishes image_rect as RELIABLE / KEEP_LAST(10)
  • apriltag_node originally subscribed RELIABLE / KEEP_LAST(10)

Changing QoS on rectify subscriptions to RELIABLE did not remove the failure.

4) RectifyNode callback counters

I added counters to the RectifyNode callbacks; it reports:

  • images_received == camera_infos_received == pairs_matched
  • meaning it does not lose sync internally (it only processes matched pairs)
  • BUT after the failure begins, it receives fewer and fewer pairs over time (its callback rate declines),
    even though upstream raw topics remain at ~20 Hz.

Example:

RectifyNode received 27571 images, 27571 camera infos, 27571 pairs

When bug happens: RectifyNode still shows equal counts, but callback rate decreases

I don't have more ideas to debug rectify node, so at this point I assume it works OK to move forward. It just passes processed data and might be blocked by downstream packages (apriltag_node).

5) apriltag_node behavior during failure:

During failure:

  • apriltag_node receives camera_info at full rate (often still ~20 Hz)
  • apriltag_node receives image_rect increasingly rarely (20 → 5 → 1)
  • Therefore, ExactTime in image_transport::CameraSubscriber cannot form pairs and prints warnings.

This suggests the problem is not purely “timestamps mismatch”, but rather image delivery/throughput collapses (?).

Why this points to apriltag_node

apriltag_node uses image_transport::CameraSubscriber, which uses message_filters::TimeSynchronizer (ExactTime policy). It subscribes to image_rect, then derives the sibling camera_info topic automatically.

When the node falls behind (CPU spikes, processing delays, executor contention), image delivery can get throttled, and ExactTime pair matching drops to zero even if camera_info still arrives.

Additionally, in component-container setups (single-thread executor), starvation could make both streams appear to stop - don;t know why.

TLDR

  • Moving apriltag_node to a separate process improved behavior, but the issue could still reappear (~22 minutes)
  • Verified image_raw and camera_info stamps stay identical - upstream is fine then
  • Verified QoS changes on rectify subscriber (RELIABLE / KEEP_LAST(50)) did not fix the issue
  • Set apriltag_node subscriber to QoS sensor_data - no effect
  • Reduce processing cost of apriltag_node (decimate, more threads) - no effect

For now I'm testing the setup in which I modified RectifyNode to:

  • subscribe images as BEST_EFFORT always to attempt to avoid plugging of the pipeline
  • re-publish camera_info every time when image_rect is published to avoid any de-synchronization issues in downstream packages
  • make my downstream nodes consume re-published camera_info topic

I would appreciate any comments or suggestions to recover from this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions