-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance degradation in multi-threaded application on RHEL 8 #603
Comments
Before assigning RHEL 8 the blame, can you do an apples-to-apples comparison of Python 2 vs Python 3 on the same OS release? |
With cx_Oracle 8.3.0? Does that still work with python 2.7? (on pypi it says tested with 3.6+) |
No. cx_Oracle 8.3 requires a minimum of Python 3.6. Can you clarify the comparison? I thought you were just running cx_Oracle 8.3 on Python 3.10 on both RHEL 7.2 and RHEL 8.5. Is that not the case? If not, please run the same code (whether Python 2 with cx_Oracle 7.3 or Python 3 with cx_Oracle 8.3) on both machines. Thanks! |
RHEL 7.2 vs RHEL 8.5 was using the same python version (3.10) and cx_Oracle (8.3.0) |
So that would suggest that the OS is the culprit here (or some change in the OS that interacts badly with threaded Oracle clients). I don't know what that might be, unfortunately! @cjbj might be able to help further, hopefully. |
Are there any variances if you try different versions of the Oracle Client libraries? Are you using session pooling? If not, have you set the |
When I asked around, it was pointed out that neither of those OS versions have a supported Python 3.10 package and that you should verify your Python binaries are compiled properly for each distro; I wasn't given any tips on how to do that! |
We have investigated this further in the context of our application. The old version of our application runs on Python 2.7.10 / cxOracle 7.3.0. The new version runs on Python 3.10.0 / cxOracle 8.3.0. However, we found another major difference in that the old version runs the Python process with CPU affinity whereas the new version does not set CPU affinity. After setting CPU affinity for the Python 3 process of the new version, we started to get performance similar to the old version (the Python 3 process is affined to a single CPU). It seems that calls to cxOracle Cursor.close() are much faster with CPU affinity under heavy load (a large number of threads. e.g. 50 threads). Without CPU affinity these calls are really expensive especially when the Cursor was used to fetch a large number of columns (e.g. 15 - 20 columns). Maybe this is something that can be mentioned in the cxOracle docs as in our case the performance difference is quite dramatic. For example, having 50 threads that execute 365 queries each and close the Cursor after each query, result in the following accumulated time spent in Cursor.close() per thread: with CPU affinity: 0.0065s |
That's great - thanks for digging into it and for sharing the results. It jogs a distant memory of a discussion about thread stickiness during development of our Node.js API. We can certainly make a note about CPU affinity in our tuning doc. It also gives us an opportunity to do an internal analysis of the behavior. |
Before my colleague, @mariustantareanu, discovered that affining the python3 application to a single CPU restores performance back to what we see in python2, I looked deeply into this apparent performance degradation from cxo73/python2 to cxo83/python3. My colleague recommended that I post these findings here in case it is of interest. I'll attached a copy of the "cx.py" test harness. Perhaps it might motivate code and/or documentation changes. What I found was that there is a general degradation of throughput performance from python2 to python3 involving the following cofactors:
With the above cofactors in place, the key difference between python2 and python3 is the significant change to the GIL implementation from being a simple mutex to involving multiple mutexes with a goal of better "fairness". This enhanced GIL implementation is described in the comments of "https://github.com/python/cpython/blob/v3.6.8/Python/ceval_gil.h". In the specific case of cx_Oracle, a high amount of GIL chatter is produced during the closing of cxoCursor objects which eventually causes all of the allocated cxoVar objects to be freed. Below is a typical stack trace of one of these cxoVar_free functions trying to regrab the GIL in python3:
And below is the part of the cxoVar_free function in play:
Normally we might look at something like this and say this is normal GIL contention behavior. But the contention is much worse under python3's new sophisticated and fairer GIL. Releasing the GIL when it isn't really necessary can be more costly in a multi-threaded high-load python3 context. The specific table we noticed this issue on has ~70 columns, and as a result the cxoCursor object would have ~70 cxoVar objects allocated, which will then involve ~70 GIL release/regrab cycles per cursor close. This performance degradation can also be reproduced with multiple threads looping around "time.sleep(0)" which will release/regrab the GIL without doing any real I/O. I further investigated if this GIL release around dpiVar_release was really necessary. For example, it appears that the GIL is not released for the "cxoVar_externalSetValue -> dpiVar_release" code path. Also, from a cursory review of the dpiVar_release implementation, no obvious signs of I/O or other lock negotiations were observed that might warrant releasing the GIL. For testing, a binary patch of the cxo shared object was prepared that replaces the calls to release and grab the GIL around this call to dpiVar_release in cxoVar_free with NOPs. Below are python commands to replace the GIL grab/release with NOPs for cxo73 and cxo83 respectively: (Addresses obtained with "$ objdump -d <cxo.so>|grep -A13 cxoVar_free.:")
This "nogil" version showed improved performance in both python2 and python3, but especially python3. Below are the raw results from the test harness for all eight combinations of {py2|py3},{gil|nogil},{affined|not-affined} executing 500 iterations in each of 20 threads on a Linux VM with 8 CPU cores:
|
Thanks for the detailed explanation. Looking at the implementation of |
After we performed an upgrade to various software because of a switch from python 2.7 to 3.X we noticed that under heavy load in a multi-threaded application we were having a lot more active connections and increased execution time.
Eliminating our application from the equation shows slowdown when going from single-threaded to multi-threaded with a simple python script (below)
The following are some rough timings on RHEL 8.5 with python 3.10 and cxOracle 8.3.0 when performing 100k queries on a single thread vs distributed on 10 threads (10k per thread)
The table used here isn't anything special. There are about 200 entries only in it so most of the queries will return None anyway.
If I run the script on a RHEL 7.2 server (and target the same DB) I get better results compared to RHEL 8.
Before this we were not seeing any problems with python 2.7 / cxOracle 7.3.0 under RHEL 6.X or 7.X.
If you need other data from our env to help narrow things down let me know (@mariustantareanu can also respond to questions if I am not available)
As I said, the table itself doesn't really matter. I used the following python script (with some strings for custom paths and user/passwords removed)
The text was updated successfully, but these errors were encountered: