Fix macOS segfault on close after hardware error #211

MrDOS · 2021-02-01T12:16:27Z

This fixes the JVM crash experienced on macOS after a hardware error occurs. The crash was introduced along with the core hardware-error-detection functionality in #172.

In order to generate the OUTPUT_BUFFER_EMPTY event, the monitor thread event loop periodically checks the status of the serial port output buffer, just as it checks for other state changes. On platforms which have it, and on Windows where it's emulated, it uses the TIOCSERGETLSR ioctl to get the information. On platforms which don't have that ioctl (macOS and FreeBSD <10), the monitor thread starts up another “drain thread” under the monitor thread which loops on tcdrain(3), and generates the event whenever that call unblocks. That separate drain thread is usually terminated by RXTXPort.interruptEventListener(). However, because the monitor thread event loop now internally terminates upon encountering a hardware error, the drain thread cleanup in that interruption method isn't being performed. The fix is to have the event loop call the interruption method when it detects a hardware failure. That ensures all of the normal monitor thread cleanup happens.

However, doing that revealed another, rare race condition where the application thread could encounter the port failure and invoke RXTXPort.close() before the monitor thread began its shutdown (e.g., if the application is sitting on a tight loop around available(), as in the case of my DisconnectTest test case). This manifested in another segfault. The obvious solution was some locking around the internal state of the monitor thread. This already kind of existed in the form of the RXTXPort.MonitorThreadLock boolean and the RXTXPort.waitForTheNativeCodeSilly() method which waited for the flag to become unset in a 5-second sleep loop. I took this as an opportunity to replace that nonsense with a real lock. I chose a read/write lock rather than just synchronized blocks around an object because it more closely matches the semantics of the old behaviour: all of the I/O functions checked that the flag was clear, which is akin to sharing a read lock, and all of the notification reconfiguration methods set/cleared the flag, which is akin to an exclusive write lock.

This should fix the issue @d5smith raised in #197. It probably doesn't fix the original hang-on-close()/disconnect() issue @mvalla is facing, unless the new locking has accidentally cleaned up another concurrency issue.

Just cleaning up the event info struct isn't enough: one some platforms, there's a “drain loop” thread which runs underneath the monitor thread to watch for an empty output buffer. This thread needs to be stopped, too, or it'll segfault and bring down the whole JVM after the event info struct is freed.

The old `MonitorThreadLock` boolean field was only checked at a very slow interval (5s!), and, itself not being synchronized, was prone to race conditions. More importantly, it wasn't set during the monitor thread's self-cleanup after hardware failure, so under typical access patterns, the monitor thread and the application thread would both try to clean up the monitor thread simultaneously. This race condition could occasionally lead to a segfault (only reproduced on macOS, but I've no doubt it could happen elsewhere). I've also attempted to clean up some redundant flag fields, and consolidate setting the remaining fields to further avoid concurrency/reentrancy issues.

d5smith · 2021-02-01T15:01:43Z

I can confirm this fixes the behaviour i was seeing on macOS. Thanks!

MrDOS · 2021-02-01T15:07:50Z

Great! I have not yet tried this build on any platform other than macOS, nor have I tested behaviour other than failures (have not even run any data through it), so I want to do some more QC before merging. But thank you for trying it so quickly.

claui · 2024-02-08T17:18:13Z

@MrDOS According to preliminary tests, this appears to be fixing the crashes in interruptEventLoop on our x86_64 Linux boxes.

If combined with PR #249, all the crashes we’ve been getting at disconnect time no longer occur.

claui · 2025-01-08T15:21:42Z

src/main/java/gnu/io/RXTXPort.java

-			waitForTheNativeCodeSilly();
-			MonitorThreadAlive=true;
+			try {
+				this.monitorThreadReady.await();


@MrDOS Any error that occurs during initialise_event_info_struct causes the monitor thread to bail. As the thread dies without ever releasing the lock and pounding the signal, it indefinitely locks the main thread inside this await() call.

We've been testing this PR for a couple of weeks on a monolithic JVM with several webapps and decent network traffic, and keep encountering this hang once in a while. My theory is that due to an inrush burst of filesystem/network activity, the number of file descriptors temporarily exceeds FD_SETSIZE, causing the kernel to hand out fd numbers over that limit, which (rightly) causes initialise_event_info_struct to fail, locking up the whole JVM.

claui · 2025-01-08T15:41:00Z

src/main/java/gnu/io/RXTXPort.java

+			} catch (InterruptedException e) {
+				z.reportln("Interrupted while waiting for the monitor thread to start!");
+			}
+			this.monitorThreadState.writeLock().unlock();


This might be a good place to check whether the monitoring thread's initialization phase was successful.
(If it wasn't, throw an exception and clean up the state so the client has a chance to retry.)

MrDOS added 2 commits February 1, 2021 11:10

This was referenced Feb 1, 2021

In case of HARDWARE_ERROR event serial.disconnect() never returns #197

Open

5.2.1 crash in multi-thread context #188

Open

MrDOS mentioned this pull request Mar 1, 2021

After unplug USB cable, java program crashed. #212

Closed

MrDOS mentioned this pull request May 4, 2021

Fix FD leak when checking lock dir permissions #216

Merged

MrDOS added this to the v5.2.2 milestone May 4, 2021

wborn mentioned this pull request May 7, 2021

Workaround causes serial port discovery issues openhab/org.openhab.binding.zigbee#577

Open

MrDOS mentioned this pull request Aug 25, 2023

Update to Gradle 8; resuscitate CI #238

Open

MrDOS mentioned this pull request Sep 5, 2023

Two years without a release, any plan for a new release? #243

Closed

claui mentioned this pull request Feb 8, 2024

Fix segfault if file descriptor unavailable #249

Merged

claui mentioned this pull request Feb 9, 2024

Crash in interruptEventLoop trying to dereference garbage pointer #251

Open

claui suggested changes Jan 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix macOS segfault on close after hardware error #211

Fix macOS segfault on close after hardware error #211

MrDOS commented Feb 1, 2021 •

edited

Loading

d5smith commented Feb 1, 2021

MrDOS commented Feb 1, 2021

claui commented Feb 8, 2024

claui Jan 8, 2025

claui Jan 8, 2025

Fix macOS segfault on close after hardware error #211

Are you sure you want to change the base?

Fix macOS segfault on close after hardware error #211

Conversation

MrDOS commented Feb 1, 2021 • edited Loading

d5smith commented Feb 1, 2021

MrDOS commented Feb 1, 2021

claui commented Feb 8, 2024

claui Jan 8, 2025

Choose a reason for hiding this comment

claui Jan 8, 2025

Choose a reason for hiding this comment

MrDOS commented Feb 1, 2021 •

edited

Loading