Skip to content

Commit c1a539b

Browse files
committed
Implement parallel preads
Let clients issue concurrent pread calls without blocking each other or having to wait for all the writes and fsync calls. Even though at the POSIX level pread calls are thread-safe [1], Erlang OTP file backend forces a single controlling process for raw file handles. So, all our reads were always funnelled through the couch_file gen_server, having to queue up behind potentially slower writes. In particular this is problematic with remote file systems, where fsyncs and writes may take a lot longer while preads can hit the cache and return quicker. Parallel pread calls are implemented via a NIF which copies some of the file functions OTP's prim_file NIF [2]. The original OTP handle is dup-ed, and then closed, then our NIF takes control of the new duplicated file descriptor. This is necessary in order to allow multiple reader access via reader/writer locks, and also to carefully manage the closing state. In order to keep things simple the new handles created by couch_cfile implement the `#file_descriptor{module = $Module, data = $Data}` protocol, such that once opened, the regular `file` module in OTP will "know" how to dispatch calls with this handle to our couch_cfile.erl functions. In this way most of the `couch_file` stays the same, with all the same `file:` calls in the main data path. `couch_cfile` bypass is also opportunistic, if it is not available (on Windows) or not enables things proceed as before. The reason we need a new dup()-ed file descriptor is to manage closing very carefully. Since on POSIX systems file descriptors are just integers, it's very easy to accidentally read from an already closed and re-opened (by something else) file descriptor. That's why there are locks and a whole new file descriptor which our NIF controls. But as long as we control the file descriptor with our resource "handle", we can be sure it will stay open and won't be re-used by any other process. To gain confidence that the new couch_cfile behaves the same way as the Erlang/OTP one there is a property test which asserts that for any pair of {Raw, CFile} handle any supported file operations return exactly the same results. It was validated by modifying some of couch_file.c arguments and the property tests started to fail. Since neither one of the three compatible IOQ systems currently know how call a simple MFA, and instead only send a `$gen_call` message to a gen_server, parallel cfile reads are only available if we bypass the IOQ. By default if the requests are already configured to bypass the IOQ, then they will use the parallel preads. To enable parallel preads for all requests, toggle the `[couchdb] cfile_skip_ioq` setting to `true`. A simple sequential benchmark was run initially to show that even the most unfavorable case, all sequential operations, we haven't gotten worse: ``` > fabric_bench:go(#{q=>1, n=>1, doc_size=>small, docs=>100000}). *** Parameters * batch_size : 1000 * doc_size : small * docs : 100000 * individual_docs : 1000 * n : 1 * q : 1 *** Environment * Nodes : 1 * Bench ver. : 1 * N : 1 * Q : 1 * OS : unix/linux ``` Each case ran 5 times and picked the best rate in ops/sec, so higher is better: ``` Default CFile * Add 100000 docs, ok:100/accepted:0 (Hz): 16000 16000 * Get random doc 100000X (Hz): 4900 5800 * All docs (Hz): 120000 140000 * All docs w/ include_docs (Hz): 24000 31000 * Changes (Hz): 49000 51000 * Single doc updates 1000X (Hz): 380 410 ``` [1] https://www.man7.org/linux/man-pages/man2/pread.2.html [2] https://github.com/erlang/otp/blob/maint-25/erts/emulator/nifs/unix/unix_prim_file.c [3] https://github.com/saleyn/emmap [4] https://www.man7.org/linux/man-pages/man2/dup.2.html
1 parent e137b72 commit c1a539b

File tree

11 files changed

+1930
-39
lines changed

11 files changed

+1930
-39
lines changed

LICENSE

+20-1
Original file line numberDiff line numberDiff line change
@@ -2385,4 +2385,23 @@ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
23852385
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
23862386
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
23872387
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
2388-
THE SOFTWARE.
2388+
THE SOFTWARE.
2389+
2390+
couch_cfile
2391+
2392+
couch_cfile.c NIF has parts from Erlang/OTP's prim_file NIF to ensure we
2393+
have the exact pread behavior as Erlang/OTP
2394+
2395+
Copyright Ericsson 2017-2022. All Rights Reserved.
2396+
2397+
Licensed under the Apache License, Version 2.0 (the "License");
2398+
you may not use this file except in compliance with the License.
2399+
You may obtain a copy of the License at
2400+
2401+
http://www.apache.org/licenses/LICENSE-2.0
2402+
2403+
Unless required by applicable law or agreed to in writing, software
2404+
distributed under the License is distributed on an "AS IS" BASIS,
2405+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
2406+
See the License for the specific language governing permissions and
2407+
limitations under the License.

rel/overlay/etc/default.ini

+16
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,22 @@ view_index_dir = {{view_index_dir}}
103103
; Javascript engine. The choices are: spidermonkey and quickjs
104104
;js_engine = spidermonkey
105105

106+
; Use cfile. This is a C-based file I/O module that can execute parallel file
107+
; read calls. The regular Erlang VM file module, at least as of OTP 28 forces
108+
; all file operations to go through a single controlling process which can
109+
; become a bottleneck sometimes. cfile is enabled by default on supported
110+
; systems (currently Linux, MacOS and FreeBSD). However, it is a new feature,
111+
; so there any issues with it is possible to disable by setting the value to
112+
; "false".
113+
;use_cfile = true
114+
115+
; When enabled, use cfile parallel reads for all the requests. By default the
116+
; setting is "false", so only requests which are configured to bypass the IOQ
117+
; would use the cfile parallel reads. If there is enough RAM available for a
118+
; large file cache and the disks have enough IO bandwith, consider enabling
119+
; this setting.
120+
;cfile_skip_ioq = false
121+
106122
[purge]
107123
; Allowed maximum number of documents in one purge request
108124
;max_document_id_number = 100

src/couch/.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ ebin/
55
priv/couch_js/config.h
66
priv/couchjs
77
priv/couchspawnkillable
8+
priv/couch_cfile/*.d
89
priv/*.exp
910
priv/*.lib
1011
priv/*.dll

0 commit comments

Comments
 (0)