Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[THIN Mode] High CPU Usage and Nodejs Event loop stucked #1700

Open
Fiorello opened this issue Oct 25, 2024 · 17 comments
Open

[THIN Mode] High CPU Usage and Nodejs Event loop stucked #1700

Fiorello opened this issue Oct 25, 2024 · 17 comments

Comments

@Fiorello
Copy link

  1. What versions are you using?
  • node-oracle 6.6.0
  • nodejs v21.7.3
  • Oracle Database 12c Standard Edition Release 12.2.0.1.0 - 64bit Production
  • NestJS ^10.3.10
  • Typeorm 0.3.20
  1. Is it an error or a hang or a crash?
    No error thrown by node-oracle but the process become unresponsive.

  2. What error(s) or behavior you are seeing?
    The process reach 100% of CPU usage and no more operation is executed.

We recently upgrade node-oracle to version 6.6.0.
Thin driver is awesome, uses less memory and less cpu than thick.
But after a day or two(random) the process stop to execute operations, the web server stop to receive request and also the background tasks are not executed anymore.
So i have started a debug session with chrome performance tool, i have taken a screenshot(attached)

What i see is that, the process is stuck inside a while loop of the bgThreadFunction in lib/thin/pool.js
At the moment, we have switched to Thick driver and it works, but i'd like to use Thin version of course.

Does anybody have similar problems?
CPU-20241017T222332.cpuprofile

@Fiorello Fiorello added the bug label Oct 25, 2024
@sudarshan12s
Copy link

sudarshan12s commented Oct 28, 2024

Hi @Fiorello ,
You start receiving error, NJS-040: connection request timeout when the process stops? I see around 24secs consumed by bgThreadFunc . But it never returns to normal state ?

If you have enabled statistics (enableStatistics: true) and have stats (getStatistics) collected at regular intervals, please share.
Wanted to check if the busy connections are getting completed (requestsDequeued) and hence releasing the connections back to pool to serve new incoming requests.

Can you also let us know what are the poolMin, poolMax , poolIncrement values configured and was there a large number of incoming requests received before you see the process stop. It would help if you can share high level code showing the calls made by application.

on a separate note:

bgThreadFunc could be expanding in smaller poolIncrement steps and number of incoming requests were too large. Increasing poolIncrement should have expansion done in multiple steps and keep poolMin enough value so that it can be served from freelist for most of the storm/large number of connections received. So adjusting poolMin to large value reduces the task of expansion during the storm. The parameters tuning depends on application incoming traffic, ..

@Fiorello
Copy link
Author

Hi Sudarshan, thanks for your time.
I give you, just a few information about the context of the problem.

The problem was initially found on a production version, compiled with pkg and therefore impossible to inspect.
So, I tried to reproduce it locally using a tool called k6, which used 10000 virtual users making 5 requests each.
Given the large number of requests, the connection pools were not enough and I often received the error NJS-040: connection request timeout.
But, when i stop k6, i expected the process to return to normal operation, but instead it was block inside the while of the bgThreadFunc.

I tried this debug session with different pool configurations:
Start from my production configuration with 10 poolMax and all other options as defaults.
Than after reading node-oracle docs, i tried with 10 poolMax, 10 poolMin and 0 poolIncrement.
In both cases the problem was the same.

In the cpu profile file, you see only 24 seconds, because i have stopped the profiler, but the process remained blocked for several minutes, without executing any other task node.
I also tried to add a setInterval function that would have printed a console.log message every 5 seconds and, when the process was stucked, the console.log was never execute anymore.

About statistics, i didn't enabled during that debug session, i will try to enable and i'll come back to you with results.

@sudarshan12s
Copy link

sudarshan12s commented Oct 29, 2024

Thanks for reproducing it locally. Can you keep a console inside
bgThreadFunc after this._poolCloseWaiter as shown below and confirm, if you are seeing the print 'Nothing to expand to serve' from your local tests. I am still working/understanding on test case and fix, just wanted to confirm if this is a possible scenario that can be hit.

Please see below patch is not a fix..


// when pool is closing, break from while loop
      if (this._poolCloseWaiter) {
        break;
      }
      + if (this._pendingRequests.length > 0 && (this._freeConnectionList.length === 0) && (numToCreate === 0)) {
        + console.log(' Nothing to expand to serve _pendingRequests');
      +}

@Fiorello
Copy link
Author

Fiorello commented Oct 29, 2024

It seems like it doesn't enter into the for loop cause numToCreate value is 0.

image

@sudarshan12s
Copy link

It seems like it doesn't enter into the for loop cause numToCreate value is 0.

image

above there is small typo in code , edited it.

this._freeConnectionList.length

@Fiorello
Copy link
Author

Fiorello commented Oct 29, 2024

At the moment i've the profiler paused in that state.
As you can see from attached screenshot, this. _pendingRequests.length === 3 and this._freeConnectionList.length === 0.
So, it should print your console.log.
I'm going to test it anyway. I'll let you know
image

@sudarshan12s
Copy link

At the moment i've the profiler paused in that state. As you can see from attached screenshot, this. _pendingRequests.length === 3 and this._freeConnectionList.length === 0. So, it should print your console.log. I'm going to test it anyway. I'll let you know image

Thanks. I can see numToCreate is also 0 at this point. I think at this point, it is in tight loop?.. I will try to simulate with tests.

@Fiorello
Copy link
Author

I've applied your changes.
At the beginning, the console.log is executed, but after a while the cpu grows up to 100% and at that point, in freeConnectionList i see one item so the console.log is not executed anymore.

image

@sudarshan12s
Copy link

Can you please remove the line 488 and 489 and modify it like this and try it?

if (this._pendingRequests.length > 0 && (this._freeConnectionList.length === 0) && (numToCreate === 0)) {
console.log(' Nothing to expand to serve _pendingRequests');
}

@Fiorello
Copy link
Author

The behaviour is changed.
It worked for more than before (about 8minute) before the stuck.
This happened because the value of this._freeConnectionList.length === 3 so it does not pass the check.

image

@sudarshan12s
Copy link

The behaviour is changed. It worked for more than before (about 8minute) before the stuck. This happened because the value of this._freeConnectionList.length === 3 so it does not pass the check.

image

I mean you can remove the await and continue , just keep only console..

if (this._pendingRequests.length > 0 && (this._freeConnectionList.length === 0) && (numToCreate === 0)) {
console.log(' Nothing to expand to serve _pendingRequests');
}

@Fiorello
Copy link
Author

I have tried with only console.log inside the check.
When the process hangs, only the console.log is executed, everything else is not.

@sudarshan12s
Copy link

sudarshan12s commented Oct 30, 2024

Thanks for confirming . I was trying with a test to reach this state but was not successful.
Is it possible to provide high level test reproducing the issue with k6 options used too.

export let options = {
  vus: 10000, // similar to `-c 50` in ab
  iterations: vus * 5,
  //duration: '10s', // test duration
};

If i can reproduce it locally, I can try out few fixes. If possible can you try below fix with 6.6.0

diff --git a/lib/pool.js b/lib/pool.js
index 566189849..1d7436ed6 100644
--- a/lib/pool.js
+++ b/lib/pool.js
@@ -423,7 +423,7 @@ class Pool extends EventEmitter {
       throw err;
     }

-    if (this._connectionsOut >= poolMax ||
+    while (this._connectionsOut >= poolMax ||
         this.status === constants.POOL_STATUS_RECONFIGURING) {

       // when the queue is huge, throw error early without waiting for queue

@sudarshan12s
Copy link

sudarshan12s commented Oct 30, 2024

I was able to replicate some what similar issue but not sure if its the same one you are facing.
The above patch is not a complete one, Can you try above change and let us know if it works for your tests. Thanks..

@Fiorello
Copy link
Author

I ran two debugging sessions, one for 30 minutes and one for 20 without problems.
It seems that the patch actually works.

@sudarshan12s
Copy link

I ran two debugging sessions, one for 30 minutes and one for 20 without problems. It seems that the patch actually works.

Thanks @Fiorello for confirmation. We will share details on the patch.

@Fiorello
Copy link
Author

Fiorello commented Nov 4, 2024

Thanks to you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants