Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Software caused connection abort on network disconnection #173

Open
5 of 8 tasks
Shusek opened this issue Dec 13, 2021 · 19 comments
Open
5 of 8 tasks

Software caused connection abort on network disconnection #173

Shusek opened this issue Dec 13, 2021 · 19 comments
Labels
bug Something isn't working

Comments

@Shusek
Copy link

Shusek commented Dec 13, 2021

Describe the bug

During a websocket connection when user turn off wifi connection he got "Software caused connection abort" exception. This exception is uncatchable so it cause application crash. I'm debugging which component is causing this error but it's pretty hard if anyone could confirm its occurrence it was great.

Reproduction and additional details

  1. Connect to stomp by secure protocol
  2. Turn off wifi
  3. Crash occured

Context

  • Krossbow version: 2.7.0
  • Kotlin version: 1.6.0
  • Kotlin target(s): Android
  • org.conscrypt:conscrypt-android:2.5.2
  • okio 3.0.0
  • ktor 1.6.7

Artifacts used:

  • krossbow-stomp-core
  • krossbow-stomp-jackson
  • krossbow-stomp-kxserialization
  • krossbow-websocket-core
  • krossbow-websocket-sockjs
  • krossbow-websocket-spring
  • krossbow-websocket-okhttp
  • krossbow-websocket-ktor
@Shusek Shusek added the bug Something isn't working label Dec 13, 2021
@Shusek Shusek changed the title Software caused connection abort on network disconnection Software caused connection abort on network disconnection Dec 13, 2021
@Shusek
Copy link
Author

Shusek commented Dec 13, 2021

After debugging, I discovered the culprit.
Inside KtorWebSocketConnectionAdapter when anyone tries to use outgoing the channel (eg. sendText method) when network connection is closing this will raise an exception.

Overall, this is consistent with the documentation:

Enqueue frame, may suspend if outgoing queue is full. May throw an exception if outgoing channel is already closed so it is impossible to transfer any message. Frames that were sent after close frame could be silently ignored. Please note that close frame could be sent automatically in reply to a peer close frame unless it is raw websocket session.

Summarizing the Krossbow should catch such errors or not allow their occurrence.

@joffrey-bion
Copy link
Owner

Thanks a lot for the report, and for the investigation.

Inside KtorWebSocketConnectionAdapter when anyone tries to use outgoing the channel (eg. sendText method) when network connection is closing this will raise an exception.

This seems appropriate.

Summarizing the Krossbow should catch such errors or not allow their occurrence.

I'm not sure I'm following. It seems legitimate to me that trying to send frames on a closed session throws an exception. Do you mean that an exception can occur due to automatically sent frames without programmer error?

@Shusek
Copy link
Author

Shusek commented Dec 14, 2021

I include sample code:


    override fun onViewCreated(view: View, savedInstanceState: Bundle?) {
        viewLifecycleOwner.lifecycleScope.launch {
            try {
                provideStop().connect(
                    BuildConfig.SOCKET_URL,
                    host = null
                ).withJsonConversions(json = JsonParser)

                // disconnect wifi connection

                delay(30000)
            } catch (throwable: Throwable) {
                // the error is not caught 
            }
        }
    }

The reason is sending hearthbeat after socket is aborted and it causes the error:

java.net.SocketException: Software caused connection abort
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:119)
    at java.net.SocketInputStream.read(SocketInputStream.java:176)
    at java.net.SocketInputStream.read(SocketInputStream.java:144)
    at okio.InputStreamSource.read(JvmOkio.kt:93)
    at okio.AsyncTimeout$source$1.read(AsyncTimeout.kt:125)
    at okio.RealBufferedSource.request(RealBufferedSource.kt:206)
    at okio.RealBufferedSource.require(RealBufferedSource.kt:199)
    at okio.RealBufferedSource.readByte(RealBufferedSource.kt:209)
    at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.kt:119)
    at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.kt:102)
    at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.kt:293)
    at okhttp3.internal.ws.RealWebSocket$connect$1.onResponse(RealWebSocket.kt:195)
    at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:519)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
    at java.lang.Thread.run(Thread.java:923)

I/Process: Sending signal. PID: 23274 SIG: 9

In my opinion, heartbeat should not be sent when socket is corrupted. To catch such exceptions for this moment i can use CoroutineExceptionHandler inside launch method.

@joffrey-bion
Copy link
Owner

joffrey-bion commented Dec 14, 2021

Got your point, thanks a lot for the sample code. It is indeed incorrect for Krossbow to try to send heart beats when the session is already closed. I will look into that.

The try/catch around the connect call only works for catching exceptions during the connection. Once the session is connected, other exceptions will be thrown through subscription flows, or when calling session methods like send. They can also be accessed via an instrumentation if one is defined in the Krossbow config, if you need a global way of getting "informed". So this part is kind of expected.

@Shusek
Copy link
Author

Shusek commented Jan 7, 2022

The worst of this bug is it that reconnect mechanism does not recognize such a case. Even when you call withAutoReconnect and something will cut your wifi connection AutoReconnect it will not work.

@joffrey-bion
Copy link
Owner

joffrey-bion commented Jan 7, 2022

I'm confused, does this happen with Ktor or OkHttp? The stack trace points to OkHttp (without any Krossbow methods), but you mention Ktor's outgoing channel.

Also, it'd be great to be able to reproduce this outside of Android's environment, so we could write a test case for it.

Last but not least, do you reproduce the issue in Krossbow 3.1.0?

@Shusek
Copy link
Author

Shusek commented Jan 7, 2022

Yes, it is reproducible on 3.1.0.
KTOR library works like a proxy and has a lot net engine inside (OkHttp, Jetty etc.). In my case i use okHttp like that

val ktorClient = HttpClient(OkHttp) {
            install(WebSockets)
}

But for me it shouldn't matter, probably it can be reproduced on the plain OkHTTP without ktor.
sendText is called by HearthBeat action sendHeartBeat and exception invokes in this method:

    override suspend fun sendText(frameText: String) {
        wsSession.outgoing.send(Frame.Text(frameText))
    }

If I disable HearthBeat action SocketException in this case it does not appear anymore.

Simplest hack-workaround for that is global handler can also be added and ignore this exception like that:

         val webSocketClient =
            KtorWebSocketClient(httpClient).withAutoReconnect(
                ReconnectConfig(
                    maxAttempts = Int.MAX_VALUE
)
            )
        val client = StompClient(webSocketClient, StompConfig().apply {
            heartBeat = HeartBeat(5.seconds, 13.seconds)
        })
        val errorHandler = CoroutineExceptionHandler { _, exception ->
            ErrorLogger.logError(exception)
        }
        return client.connect(
            url,
            host = null,
            sessionCoroutineContext = EmptyCoroutineContext + errorHandler
        ).withJsonConversions(json = JsonParser)

@joffrey-bion
Copy link
Owner

joffrey-bion commented Jan 7, 2022

Thanks a lot for the details. Ok so you're using krossbow-werbsocket-ktor with Ktor client using OkHttp engine, thanks.

sendText is called by HearthBeat action sendHeartBeat and exception invokes in this method

If you confirmed this by debugging then that's ok, thanks. I just didn't get why the stacktrace you provided wouldn't include that. Indeed if an error on the socket occurs, the heart beat shouldn't be sent. I can think of 2 possibilities where this could happen in 3.1.0 (maybe there are more):

  1. a heart beat tick happens between the actual websocket error/closure and the cancellation of the heartbeat job
  2. the websocket implementation doesn't bubble up the error at all (Ktor's incoming frames channel doesn't throw) - IMO this would be a bug in the websocket implementation

To prevent the first one, we could cancel the heart beat job early, but that would still be racing with the error bubbling. So it would still be possible that the heart beat kicks in while the error callbacks haven't reached the point where the session is aborted and all internal coroutines cancelled.

In order to prevent heart beat failures, we could check if the websocket is still open before attempting to send a heart beat (canSend). In that case we probably also want to shutdown the session if it can't accept heart beats, in order to fix the problem 2 as well.

As a side note, both StompClient's constructor and withReconnectConfig accept lambdas for the configuration, so you don't have to write Config().apply { ... } yourself:

val webSocketClient = KtorWebSocketClient(httpClient).withAutoReconnect {
    maxAttempts = Int.MAX_VALUE
}

val client = StompClient(webSocketClient) {
    heartBeat = HeartBeat(5.seconds, 13.seconds)
}

@mahmed1987
Copy link

hello @joffrey-bion .

I am facing this exact same issue in version 7.0.0

This was resolved ?

I have my heartbeats set and I was testing a few things in my android application. My backend is using spring boot
I took my application to the background for a while and put my cellphone to sleep mode.

I rightfully show on the logs of my spring boot backend that the heartbeats stopped appearing and the server destroyed my session.

When I bought my application back to live , the application crashed with the "Software caused connection abort".

I was able to hit this breakpoint before the app crashed
/Users/ahmed/.gradle/caches/modules-2/files-2.1/org.hildan.krossbow/krossbow-stomp-core-iosarm64/7.0.0/aa21a035596090a27ec803e96eb75a6a72dd104a/krossbow-stomp-core-iosarm64-7.0.0-sources.jar!/commonMain/org/hildan/krossbow/stomp/BaseStompSession.kt:83

private suspend fun shutdown(message: String, cause: Throwable? = null) {
        // cancel heartbeats immediately to limit the chances of sending a heartbeat to a closed socket
        heartBeaterJob?.cancel()
        // let other subscribers handle the error/closure before cancelling the scope
        awaitSubscriptionsCompletion()
        scope.cancel(message, cause = cause)
    }

The app control was at " heartBeaterJob?.cancel()" after which it crashed

image image

@joffrey-bion joffrey-bion reopened this Nov 21, 2024
@joffrey-bion
Copy link
Owner

Yes this should have been fixed a long time ago. Thanks for reporting that it happens again, and for the details to help reproduce it. I reopened this issue to have another look.

Could you please share a little more details about which artifacts you were using? Are you using Ktor with the OkHttp engine? Or OkHttp directly?

@mahmed1987
Copy link

Oh yes , I am actually using Ktor engine .

@joffrey-bion
Copy link
Owner

Thanks! And did you have a chance to try this experiment with the latest version of Krossbow?

@mahmed1987
Copy link

yes indeed, my krossbow version is krossbow-version = "7.0.0"

@joffrey-bion
Copy link
Owner

joffrey-bion commented Nov 21, 2024

If it's not too much hassle, could you please confirm that you can still reproduce with Krossbow 8.0.0? I don't think the changes in Krossbow 8 should change this specific behaviour, but I'd just like to be sure to avoid investigating a ghost :)

@mahmed1987
Copy link

@joffrey-bion that is correct. Let me test this out with version 8 and revert back to you

@mahmed1987
Copy link

mahmed1987 commented Nov 21, 2024

@joffrey-bion surprisingly , that issue is not appearing in 8.0.0 despite your comment that you didn't change anything that was relevant to this issue .

Lets chalk this one up to some random act of God :D . You can close the issue while I monitor more closely and would revert back to you if i see some unexpected behavior .

Thanks again

@joffrey-bion
Copy link
Owner

Thank you so much for testing this! Ok I'll close for now, and we'll reopen if it pops up again.

@joffrey-bion joffrey-bion added this to the 3.1.1 milestone Nov 21, 2024
@mahmed1987
Copy link

hey @joffrey-bion so I was able to consistently throw this exception .

It maybe occurs when the coroutine scope has been already cancelled (for example because of network disconnection), yet we try to access it via calling some of the stompSessions functions (this is just a hypothesis)

I have a code snippet which can always throw this crash. Keep in mind this code is just to illustrate the situation and we do not do this in production. However that been said I still feel that this exception should somehow be caught .

Here is the code

val messages: Flow<String> = subscriptionPayload.flatMapLatest { payload ->
        callbackFlow {
            val job = launch {
                confess("Socket: Subscribing to room ${payload.roomId}")
                stompSession
                    ?.subscribe(StompSubscribeHeaders(destination = "/topic/room/${payload.roomId}"))
                    ?.collect {
                        send(it.bodyAsText)
                    }
            }
            awaitClose {
                confess("Socket: Unsubscribing from room ${payload.roomId}")
                job.cancel()
            }
        }
            .catch { e ->
                stompSession?.disconnect() // this would crash with the Software Abort exception
            }

    }

This is how we connect to stomp

suspend fun connect() {
        try {
            val coroutineContext = createSessionContext() + errorHandler
            stompSession = stompClient.connect(
                connectionSettings.stompBaseUrl,
                customStompConnectHeaders = mapOf(
                    "Authorization" to RequestHeader.getBearerToken()
                ),
                sessionCoroutineContext = EmptyCoroutineContext + coroutineContext
            )
            confess("Connection Succeed")
            updateConnectionState(SharedSocketState.Connected)
        } catch (ex: Exception) {
            handle exception
            stompSession = null
        }
    }

So what I have understood so far is that if a catch has invoked , there needn't any reason to call disconnect , since an abnormal termination is indicative that a disconnection has already taken place.

That been said maybe you can make more sense of why this occurs given the above ?

I just posted here because last time we spoke we were a bit unsure as to how to reproduce this .

Thanks

@joffrey-bion
Copy link
Owner

Thank you so much for sharing this. I'll reopen the issue so I can take a look when I have some free time.

@joffrey-bion joffrey-bion reopened this Dec 29, 2024
@joffrey-bion joffrey-bion removed this from the 3.1.1 milestone Dec 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants