-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
plan to support Oracle direct-path api for faster bulk inserts? #369
Comments
Data loading is something that SQL Loader already does well. We have no immediate plans to investigate using the Oracle Call Interface Direct Path Load API for cx_Oracle, but I know it would be nice to have. |
Just to be clear: usign the |
Hi, sorry for the confusion. I am referring to the latter - using the
separate OCI API with reduced overhead. With Simple Compression turned on
for the table (offered for free from Oracle) considerable space can be
saved, 40-60%. This only kicks in for Direct Path loaded blocks.
Additionally we have found that insert times can be significantly impacted,
up to 50% faster load times.
To me this is more than a "nice to have" requirement. All production
quality ETL tools support Oracle Direct Path insertion e.g. SAP Data
Services, Oracle Integrator, and Informatica. We are trying to displace an
ETL tool with Python data pipelines and we cannot accomplish this without
Direct Path support for these reasons.
If this were to be implemented would the developer have to work through the
C OCI calls? May I assume cx_oracle calls C which then calls OCI? I am
thinking this is pretty difficult development do you have a sense for it?
I may be willing to pitch in but I am concerned it may be very involved.
Thanks for your thoughts and advice. Appreciate your time! Mike.
…On Thu, Nov 14, 2019 at 5:41 PM Anthony Tuininga ***@***.***> wrote:
Just to be clear: usign the /*+ append */ hint should work with cx_Oracle
just as well as it does in SQL*Plus or any other client. The other method
that Chris mentioned is a different API altogether which bypasses some of
the other overhead associated with binding data.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#369?email_source=notifications&email_token=ANYVZZPL45ORA6HICXYVQT3QTXHXTA5CNFSM4JNMNEOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEDSCIA#issuecomment-554115360>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANYVZZILV23PUUPBOCUXZADQTXHXTANCNFSM4JNMNEOA>
.
|
50% faster than The Oracle Call Interface direct path load API is not small, so learning & implementing & testing would be a solid effort. It would need to be added to ODPI-C first, which @anthony-tuininga would have to do (we don't accept PR's for ODPI-C for various reasons). Then this would be wrapped in cx_Oracle calls. Typically Anthony would do the cx_Oracle API while doing the ODPI-C interface since it makes testing easier. Can you post a brain-storm cx_Oracle API and show how you would use it? |
Up to 50 percent faster using an etl tool using direct path vs regular
insert. Data Services has settings for this, the improvement is
environment and data dependent obviously.
Yes I will propose an API I think generally the API would be very similar
to executemany with an addition or two to indicate a desired direct path
load. I will make a proposal. Thanks for your help. Mike
…On Thu, Nov 14, 2019, 8:29 PM Christopher Jones ***@***.***> wrote:
50% faster than executeMany()? Do you have Python benchmark data?
The Oracle Call Interface direct path load API is not small, so learning &
implementing & testing would be a solid effort. It would need to be added
to ODPI-C first, which @anthony-tuininga
<https://github.com/anthony-tuininga> would have to do (we don't accept
PR's for ODPI-C for various reasons). Then this would be wrapped in
cx_Oracle calls. Typically Anthony would do the cx_Oracle API while doing
the ODPI-C interface since it makes testing easier.
Can you post a brain-storm cx_Oracle API and show how you would use it?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#369?email_source=notifications&email_token=ANYVZZLB4N3INQNQYSW664LQTX3N5A5CNFSM4JNMNEOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEED62BI#issuecomment-554167557>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANYVZZKYHSEURXNVJWTNV5TQTX3N5ANCNFSM4JNMNEOA>
.
|
Can you do a quick benchmark with |
I will. We are loading millions of rows. I have copied the benefits
below from a white paper from Oracle.
https://docs.oracle.com/cd/B10501_01/server.920/a96652/ch09.htm
The paper lists the multitude of reasons why the direct path API is faster,
and please don't forget about the compression benefit as well. Direct
loaded blocks are compressed with Basic Compression turned on for the table.
The bottom line is that an Oracle direct path load is a totally different
mechanism for loading data into a table. No inserts are executed
internally by Oracle. Data is preformatted into blocks and appended to the
end of the table offering many performance benefits. The internals are
totally different.
The below information is from Oracle.
Advantages of a Direct Path Load
A direct path load is faster than the conventional path for the following
reasons:
- Partial blocks are not used, so no reads are needed to find them, and
fewer writes are performed.
- SQL*Loader need not execute any SQL INSERT statements; therefore, the
processing load on the Oracle database is reduced.
- A direct path load calls on Oracle to lock tables and indexes at the
start of the load and releases them when the load is finished. A
conventional path load calls Oracle once for each array of rows to process
a SQL INSERT statement.
- A direct path load uses multiblock asynchronous I/O for writes to the
database files.
- During a direct path load, processes perform their own write I/O,
instead of using Oracle's buffer cache. This minimizes contention with
other Oracle users.
- The sorted indexes option available during direct path loads allows
you to presort data using high-performance sort routines that are native to
your system or installation.
- When a table to be loaded is empty, the presorting option eliminates
the sort and merge phases of index-building. The index is filled in as data
arrives.
- Protection against instance failure does not require redo log file
entries during direct path loads. Therefore, no time is required to log the
load when:
- Oracle is operating in NOARCHIVELOG mode
- The UNRECOVERABLE parameter is set to Y
- The object being loaded has the NOLOG attribute set
…On Thu, Nov 14, 2019, 9:57 PM Christopher Jones ***@***.***> wrote:
Can you do a quick benchmark with executeMany()
<https://cx-oracle.readthedocs.io/en/latest/user_guide/batch_statement.html>
in Python and compare that with your ETL direct path? I am very curious.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#369?email_source=notifications&email_token=ANYVZZKBHPAPQLSKHDJVJMLQTYFZHA5CNFSM4JNMNEOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEED4HI#issuecomment-554188317>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANYVZZNPE4VAVAIGUQGJAU3QTYFZHANCNFSM4JNMNEOA>
.
|
@gitpickle how did your benchmark go? |
Hi Chris! Sorry about taking so long to get back.
I have not been able to construct a benchmark due mainly to being busy but
I wanted to make one more plea for this request. I don't know if I will
be able to find time to construct a benchmark.
The functionality from Oracle exists for a reason. When using /*+ APPEND
*/, which uses the direct path loading mechanism, rows are inserted in a
fundamentally different way from a traditional path load. The buffer cache
is bypassed and blocks are appended above the high water mark in the
table. For inserting large numbers of rows (e.g. millions, tens of
millions, hundreds of millions, etc.) performance can be much improved.
How much? It depends on environment specifics!
Contention on the redo logs can also be resolved if tables are being loaded
in parallel. If the table is set to NOLOGGING, when bulk loading Oracle
does not write data to the redo logs. Another performance benefit!
And finally, as I mentioned previously as well Oracle supports BASIC
compression which compresses blocks written with the direct path API. This
type of compression is free and saves space and further improves
performance by writing less data to disk because it is compressed (based on
repeated values - they are only written once per block). Obviously its
effectiveness is dependent on the characteristics of the actual data. We
have saved up to 50% space by using Basic compression. This only kicks in
during a direct-path insert.
I absolutely don't mean to be argumentative at all and I greatly appreciate
all the work of you and your team!!! The differences in how Oracle loads
tables with a direct path load is pretty well documented. Given these
architectural differences I am not really seeing the reason for running
benchmarks. I could easily contrive something that works the way I want it
to for instance based on my environment, which may be different than most
other environments, who knows. We may have a slow SAN, whereas others may
have a fast one where they have other bottlenecks.
What I am simply saying is that Oracle documents how the direct-path insert
mechanism is fundamentally different. With our SAP Data Services tool, we
have had to use the bulk loading mechanism (APPEND mode in DS) extensively
at Fox Chase Cancer Center to save space and to speed up our "Initial"
loads. Our tool will not make a round trip for every row regardless of
whether or not we use the bulk loader. Our performance improvements are
due to the increased efficiency of the insert mechanism Oracle employs in a
direct path load, rather than reducing round trips. I am aware that
executemany avoids the round trips. What I am saying is needed to compete
with ETL tools is support for the direct path API (append based insertion).
I am hoping this makes sense and I 100% understand if you simply don't have
time to implement the feature. I get it. But at the same time I also feel
it is important.
One other thing I will mention, is that SQLAlchemy should turn around and
leverage the feature if you implement it. We are using SQLAlchemy, which
in turn can obviously use cx_oracle.
Hope this helps,
Mike
…On Wed, Dec 4, 2019 at 4:01 AM Christopher Jones ***@***.***> wrote:
@gitpickle <https://github.com/gitpickle> how did your benchmark go?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#369?email_source=notifications&email_token=ANYVZZKML5IVQYM64Z2ZDBLQW5WXJA5CNFSM4JNMNEOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEF4HPTI#issuecomment-561543117>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANYVZZIWN474QPN4FJOAD6TQW5WXJANCNFSM4JNMNEOA>
.
|
The decision factors are standard ones (i) implementation complexity and maintenance costs (ii) performance benefits (iii) usability benefits (iv) whether leaving the feature to existing Oracle tools that do it is a wiser allocation of responsibilities. |
gotcha, makes perfect sense Chris thank you. Mike
…On Wed, Dec 11, 2019 at 5:37 PM Christopher Jones ***@***.***> wrote:
The decision factors are standard ones (i) implementation complexity and
maintenance costs (ii) performance benefits (iii) usability benefits (iv)
whether leaving the feature to existing Oracle tools that do it is a wiser
allocation of responsibilities.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#369?email_source=notifications&email_token=ANYVZZLUTBZCTW6PC4YRNK3QYFTRZA5CNFSM4JNMNEOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGU2B6Y#issuecomment-564764923>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANYVZZNZYNXMPBP3U5H6WHTQYFTRZANCNFSM4JNMNEOA>
.
|
@gitpickle since you have data, you could help with (ii). |
I will try to get numbers
…On Wed, Dec 11, 2019, 5:41 PM Christopher Jones ***@***.***> wrote:
@gitpickle <https://github.com/gitpickle> since you have data, you could
help with (ii).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#369?email_source=notifications&email_token=ANYVZZKSTKYUCSLIY3NV3MTQYFUALA5CNFSM4JNMNEOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGU2LWI#issuecomment-564766169>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANYVZZK5JO25CJAG2MFNE7DQYFUALANCNFSM4JNMNEOA>
.
|
I also hope to have the function of direct path loading without giving up cx_Oracler |
yes it would be very nice to support this important Oracle functionality
…On Fri, May 7, 2021 at 8:54 PM 李道然 ***@***.***> wrote:
I also hope to have the function of direct path loading without giving up
cx_Oracler
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#369 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANYVZZP5TUBDN4EV7C3GMEDTMSDSXANCNFSM4JNMNEOA>
.
|
I think so too. It's on my long wish-list, but it's a big project. API suggestions, testcases etc welcome |
Will there be compressed data when cx oracle reads data transmission and writes to a remote server to increase the transmission speed? |
@qianxuanyon the feature hasn't been investigated yet. |
I currently have a scenario where a csv.gz file needs to be written to a remote database
The second method plus the data upload time is much faster than the first method So is it possible for us to compress data during the transmission of cx_Oracle to increase the transmission speed? |
@qianxuanyon that is a different topic to this enhancement request. Can you start a new issue? |
Oracle provides a well-known mechanism commonly refereed to as the "Direct Path API" which appends rows to the end of a table for faster bulk insertion. Additionally, Oracle offers (for free) the ability to compress blocks written in this manner based on repeating values. The current cx_oracle executemany code does not seem to leverage or offer a way to leverage the Direct Path API. The API is accessed commonly through the /*+ append */ hint. Are there plans for cx_oracle to natively support direct path insertion? Thanks for your help! Mike
The text was updated successfully, but these errors were encountered: