Skip to content

issue: 4409403 Fix heap corruption since c73d96a #337

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: vNext
Choose a base branch
from

Conversation

tomerdbz
Copy link
Collaborator

@tomerdbz tomerdbz commented Apr 22, 2025

Description

This PR fixes a critical heap corruption issue in TCP socket timer management
that was introduced in commit c73d96a.

What

Fix race condition between timer thread and socket deletion in TCP socket timers.

Why ?

A race condition was introduced in commit c73d96a that allowed the timer thread
to access socket objects after they had been deleted by the event handler thread,
causing heap corruption and crashes.

How ?

The fix improves synchronization between threads:

  1. Removes sockets from timer collections while holding the socket lock
  2. Creates a new deletion path that doesn't attempt to access timer collections
    after socket cleanup
  3. Additionally fixes a lock leak in the early return path of clean_socket_obj()

Change type

What kind of change does this PR introduce?

  • Bugfix
  • Feature
  • Code style update
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • CI related changes
  • Documentation content changes
  • Tests
  • Other

Check list

  • Code follows the style de facto guidelines of this project
  • Comments have been inserted in hard to understand places
  • Documentation has been updated (if necessary)
  • Test has been added (if possible)

Performance impact

We now hold the socket lock a bit longer during cleanup to remove it from timer collections
This is a one-time cost during socket closure, not during normal operation
Only affects the cleanup path, not data transmission or reception.

It can affect CPS tests, but preventing heap corruption far outweighs that minor performance impact imho.

How this bug was found?

  1. sockperf UDP ping-pong mode recreated this issue - it was found during new config STP.
  2. git bisect found the commit
  3. Cursor helped with explaining the commit issues
  4. bug found
  5. sockperf didn't crash anymore :)

attaching a pic for a sequence diagram explaining the faulty flow that was fixed:
image

This commit fixes a critical race condition in timer management for TCP
sockets that was introduced in commit c73d96a.

The heap corruption was caused by a race condition between the timer
thread and socket destruction. Sockets could be deleted by the event
handler thread while still being referenced by the timer thread in the
timer collections, resulting in heap corruption when the timer thread
attempted to access the deleted memory.

In the original implementation, sockets were removed from timer
collections and deleted asynchronously without proper synchronization
with the timer processing thread.

Fix:
- Remove sockets from timer collections while still holding the socket
  lock, guaranteeing the timer thread cannot access sockets marked for
  deletion
- Create a simplified deletion path that doesn't attempt to access timer
  collections again after socket cleanup

Additionally, as an unrelated improvement, this patch fixes a lock leak
in the early return path of sockinfo_tcp::clean_socket_obj() where a
lock was acquired but not released when a socket was already marked as
cleaned.

The heap corruption stemmed from a fundamental architectural change that
separated socket objects from their timer management without providing
proper synchronization for the distributed socket lifecycle.

Signed-off-by: Tomer Cabouly <[email protected]>
@tomerdbz tomerdbz changed the base branch from master to vNext April 22, 2025 12:20
@tomerdbz tomerdbz requested review from pasis and galnoam April 23, 2025 10:51
@pasis
Copy link
Member

pasis commented Apr 23, 2025

As we discussed offline,
The "timer thread" shouldn't exist and running timers outside the internal thread (event handler thread) is problematic indeed. In the past, we removed such a flow when the tcp timer could run as part of sockinfo_tcp::tcp_con_unlock() causing issues.

First, we need to identify the flow to run tcp timer outside of the internal thread context and this flow will like need to be avoided.

Note, this comment assumes that delegated timers feature is off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants