Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation in multi-threaded application on RHEL 8 #603

Open
fv-mf opened this issue Feb 10, 2022 · 12 comments
Open

Performance degradation in multi-threaded application on RHEL 8 #603

fv-mf opened this issue Feb 10, 2022 · 12 comments

Comments

@fv-mf
Copy link

fv-mf commented Feb 10, 2022

  1. What versions are you using?
DB: Oracle Database 12c Standard Edition Release 12.1.0.2.0
platform.platform: Linux-4.18.0-348.12.2.el8_5.x86_64-x86_64-with-glibc2.28
sys.maxsize > 2**32: True
platform.python_version: 3.10.0
cx_Oracle.version: 8.3.0
cx_Oracle.clientversion: (18, 5, 0, 0, 0)
  1. Describe the problem

After we performed an upgrade to various software because of a switch from python 2.7 to 3.X we noticed that under heavy load in a multi-threaded application we were having a lot more active connections and increased execution time.

Eliminating our application from the equation shows slowdown when going from single-threaded to multi-threaded with a simple python script (below)

The following are some rough timings on RHEL 8.5 with python 3.10 and cxOracle 8.3.0 when performing 100k queries on a single thread vs distributed on 10 threads (10k per thread)

time /tmp/cx.py -t 1 -r 100000

real    1m35.427s
user    0m59.392s
sys     0m7.678s

time /tmp/cx.py -t 10 -r 10000

real    4m10.947s
user    3m26.608s
sys     5m30.610s

The table used here isn't anything special. There are about 200 entries only in it so most of the queries will return None anyway.

If I run the script on a RHEL 7.2 server (and target the same DB) I get better results compared to RHEL 8.

time /tmp/cx.py -t 10 -r 10000

real    1m46.939s
user    1m18.926s
sys     1m26.848s

Before this we were not seeing any problems with python 2.7 / cxOracle 7.3.0 under RHEL 6.X or 7.X.

If you need other data from our env to help narrow things down let me know (@mariustantareanu can also respond to questions if I am not available)

  1. Include a runnable Python script that shows the problem.

As I said, the table itself doesn't really matter. I used the following python script (with some strings for custom paths and user/passwords removed)

import os
import threading
import sys
import argparse


os.environ['TNS_ADMIN'] = '<>'
os.environ['ORA_TZFILE'] = '<>'
os.environ['ORA_NLS10'] = '<>'


import cx_Oracle

def f(con, start_id, num_reqs):
  for i in range(num_reqs):
    x = start_id + i
    cursor = con.cursor()
    try:
      cursor.execute('select * from devices where dvc_id = :dvcId', dvcId=x)
      d = cursor.fetchone()
    finally:
      cursor.close()


parser = argparse.ArgumentParser()
parser.add_argument('-t', '--threads', help='Number of threads', type=int, required=True)
parser.add_argument('-r', '--requests', help='Number of requests per thread', type=int, required=True)
parser.add_argument('-u', '--username', default='<>')
parser.add_argument('-p', '--password', default='<>')
parser.add_argument('-d', '--dbname', default='<>')

args = parser.parse_args()
connect_string = '%s/%s@%s' % (args.username, args.password, args.dbname)

print("pid=%d" % os.getpid())

con_ts = []
for i in range(args.threads):
  con = cx_Oracle.Connection(connect_string, threaded=True)
  t = threading.Thread(target=f, args=(con, i * args.requests, args.requests))
  con_ts.append((con, t))

for _, t in con_ts:
  t.start()

for _, t in con_ts:
  t.join()
@fv-mf fv-mf added the question label Feb 10, 2022
@cjbj
Copy link
Member

cjbj commented Feb 10, 2022

Before assigning RHEL 8 the blame, can you do an apples-to-apples comparison of Python 2 vs Python 3 on the same OS release?

@fv-mf
Copy link
Author

fv-mf commented Feb 10, 2022

With cx_Oracle 8.3.0? Does that still work with python 2.7? (on pypi it says tested with 3.6+)

@anthony-tuininga
Copy link
Member

No. cx_Oracle 8.3 requires a minimum of Python 3.6. Can you clarify the comparison? I thought you were just running cx_Oracle 8.3 on Python 3.10 on both RHEL 7.2 and RHEL 8.5. Is that not the case? If not, please run the same code (whether Python 2 with cx_Oracle 7.3 or Python 3 with cx_Oracle 8.3) on both machines. Thanks!

@fv-mf
Copy link
Author

fv-mf commented Feb 11, 2022

RHEL 7.2 vs RHEL 8.5 was using the same python version (3.10) and cx_Oracle (8.3.0)

@anthony-tuininga
Copy link
Member

So that would suggest that the OS is the culprit here (or some change in the OS that interacts badly with threaded Oracle clients). I don't know what that might be, unfortunately! @cjbj might be able to help further, hopefully.

@cjbj
Copy link
Member

cjbj commented Feb 14, 2022

Are there any variances if you try different versions of the Oracle Client libraries? Are you using session pooling? If not, have you set the threaded parameter?

@cjbj
Copy link
Member

cjbj commented Feb 14, 2022

When I asked around, it was pointed out that neither of those OS versions have a supported Python 3.10 package and that you should verify your Python binaries are compiled properly for each distro; I wasn't given any tips on how to do that!

@mariustantareanu
Copy link

We have investigated this further in the context of our application. The old version of our application runs on Python 2.7.10 / cxOracle 7.3.0. The new version runs on Python 3.10.0 / cxOracle 8.3.0. However, we found another major difference in that the old version runs the Python process with CPU affinity whereas the new version does not set CPU affinity.

After setting CPU affinity for the Python 3 process of the new version, we started to get performance similar to the old version (the Python 3 process is affined to a single CPU).

It seems that calls to cxOracle Cursor.close() are much faster with CPU affinity under heavy load (a large number of threads. e.g. 50 threads). Without CPU affinity these calls are really expensive especially when the Cursor was used to fetch a large number of columns (e.g. 15 - 20 columns). Maybe this is something that can be mentioned in the cxOracle docs as in our case the performance difference is quite dramatic.

For example, having 50 threads that execute 365 queries each and close the Cursor after each query, result in the following accumulated time spent in Cursor.close() per thread:

with CPU affinity: 0.0065s
without CPU affinity: 17.8674s

@cjbj
Copy link
Member

cjbj commented Mar 4, 2022

That's great - thanks for digging into it and for sharing the results. It jogs a distant memory of a discussion about thread stickiness during development of our Node.js API. We can certainly make a note about CPU affinity in our tuning doc. It also gives us an opportunity to do an internal analysis of the behavior.

@cjbj cjbj added enhancement and removed question labels Mar 4, 2022
@david-west-at-micro-focus

Before my colleague, @mariustantareanu, discovered that affining the python3 application to a single CPU restores performance back to what we see in python2, I looked deeply into this apparent performance degradation from cxo73/python2 to cxo83/python3.

My colleague recommended that I post these findings here in case it is of interest. I'll attached a copy of the "cx.py" test harness. Perhaps it might motivate code and/or documentation changes.

What I found was that there is a general degradation of throughput performance from python2 to python3 involving the following cofactors:

  • Highly multithreaded workload. (10-100 or more active threads)
  • Multiple CPUs with affinity set to multiple CPUs.
  • High amount of GIL chatter. (That is, frequent release/grab without any significant I/O happening in between.)

With the above cofactors in place, the key difference between python2 and python3 is the significant change to the GIL implementation from being a simple mutex to involving multiple mutexes with a goal of better "fairness".

This enhanced GIL implementation is described in the comments of "https://github.com/python/cpython/blob/v3.6.8/Python/ceval_gil.h".

In the specific case of cx_Oracle, a high amount of GIL chatter is produced during the closing of cxoCursor objects which eventually causes all of the allocated cxoVar objects to be freed.

Below is a typical stack trace of one of these cxoVar_free functions trying to regrab the GIL in python3:

Thread 18 (Thread 0x7f51fd7fa700 (LWP 249884)):
#0  0x00007f523bc4c66d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f523bc478f7 in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0
#2  0x00007f523bc4950c in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#3  0x00007f523bf53c44 in take_gil () from /lib64/libpython3.6m.so.1.0
#4  0x00007f523bf53e7d in PyEval_RestoreThread () from /lib64/libpython3.6m.so.1.0
#5  0x00007f523a45a829 in cxoVar_free (var=0x7f5239d75490) at src/cxoVar.c:90
#6  0x00007f523bf45985 in list_dealloc () from /lib64/libpython3.6m.so.1.0
#7  0x00007f523a44a7d5 in _Py_DECREF (op=<optimized out>) at /opt/python/cp310-cp310/include/python3.10/object.h:500
#8  cxoCursor_close (cursor=0x7f5239d732d0, args=<optimized out>) at src/cxoCursor.c:546
#9  0x00007f523bff82e1 in call_function () from /lib64/libpython3.6m.so.1.0

And below is the part of the cxoVar_free function in play:

    85 static void cxoVar_free(cxoVar *var)
    86 {
    87    if (var->handle) {
    88        Py_BEGIN_ALLOW_THREADS
    89        dpiVar_release(var->handle);
    90        Py_END_ALLOW_THREADS
    91        var->handle = NULL;
    92    }

Normally we might look at something like this and say this is normal GIL contention behavior. But the contention is much worse under python3's new sophisticated and fairer GIL. Releasing the GIL when it isn't really necessary can be more costly in a multi-threaded high-load python3 context.

The specific table we noticed this issue on has ~70 columns, and as a result the cxoCursor object would have ~70 cxoVar objects allocated, which will then involve ~70 GIL release/regrab cycles per cursor close.

This performance degradation can also be reproduced with multiple threads looping around "time.sleep(0)" which will release/regrab the GIL without doing any real I/O.

I further investigated if this GIL release around dpiVar_release was really necessary. For example, it appears that the GIL is not released for the "cxoVar_externalSetValue -> dpiVar_release" code path.

Also, from a cursory review of the dpiVar_release implementation, no obvious signs of I/O or other lock negotiations were observed that might warrant releasing the GIL.

For testing, a binary patch of the cxo shared object was prepared that replaces the calls to release and grab the GIL around this call to dpiVar_release in cxoVar_free with NOPs.

Below are python commands to replace the GIL grab/release with NOPs for cxo73 and cxo83 respectively: (Addresses obtained with "$ objdump -d <cxo.so>|grep -A13 cxoVar_free.:")

$ python3 -c 'import sys;f=open(sys.argv[1],"br+");f.seek(int(sys.argv[2],base=16));f.write(b"\x90"*5)' cxo73nogil/cx_Oracle.so 2af30
$ python3 -c 'import sys;f=open(sys.argv[1],"br+");f.seek(int(sys.argv[2],base=16));f.write(b"\x90"*5)' cxo73nogil/cx_Oracle.so 2af44

$ python3 -c 'import sys;f=open(sys.argv[1],"br+");f.seek(int(sys.argv[2],base=16));f.write(b"\x90"*5)' cxo83/cx_Oracle.cpython-310-x86_64-linux-gnu.so 5b810
$ python3 -c 'import sys;f=open(sys.argv[1],"br+");f.seek(int(sys.argv[2],base=16));f.write(b"\x90"*5)' cxo83/cx_Oracle.cpython-310-x86_64-linux-gnu.so 5b824

This "nogil" version showed improved performance in both python2 and python3, but especially python3.

Below are the raw results from the test harness for all eight combinations of {py2|py3},{gil|nogil},{affined|not-affined} executing 500 iterations in each of 20 threads on a Linux VM with 8 CPU cores:

# py3
$ python3 cx.py -t 20 -r 500
pid:75216
Time Taken: 28.37s
 
# py3,nogil
$ python3 cx.py -n -t 20 -r 500
pid:75339
Time Taken: 5.83s
 
# py3,affinity
$ python3 cx.py -c 0 -t 20 -r 500
pid:75387
pid 75387's current affinity list: 0-7
pid 75387's new affinity list: 0
cpu affinity: 0 RC: 0
Time Taken: 2.91s
 
# py3,affinity,nogil
$ python3 cx.py -c 0 -n -t 20 -r 500
pid:75420
pid 75420's current affinity list: 0-7
pid 75420's new affinity list: 0
cpu affinity: 0 RC: 0
Time Taken: 2.91s
 
# py2
$ python2 cx.py -t 20 -r 500
pid:75505
Time Taken: 7.73s
 
# py2,nogil
$ python2 cx.py -n -t 20 -r 500
pid:75543
Time Taken: 5.61s
 
# py2,affinity
$ python2 cx.py -c 0  -t 20 -r 500
pid:75588
pid 75588's current affinity list: 0-7
pid 75588's new affinity list: 0
cpu affinity: 0 RC: 0
Time Taken: 2.97s
 
# py2,affinity,nogil
$ python2 cx.py -c 0 -n -t 20 -r 500
pid:75618
pid 75618's current affinity list: 0-7
pid 75618's new affinity list: 0
cpu affinity: 0 RC: 0
Time Taken: 2.89s

cx.py.gz

@anthony-tuininga
Copy link
Member

Thanks for the detailed explanation. Looking at the implementation of cxoVar_free() the release of the GIL was intended to cover the possibility of a round trip occurring due to the call to dpiVar_release(). I believe that should only happen, however, if the connection is being released. If so, the fact that the variable holds a reference to the connection should mean that that scenario should never take place! And if that is in fact confirmed, then those lines can safely be removed. We will look into that further. Thanks!

@cjbj
Copy link
Member

cjbj commented Mar 31, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants