Skip to content

fsync completed files, and directories before renames #7033

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

eddyashton
Copy link
Member

@eddyashton eddyashton commented Jun 3, 2025

Follow-up to #7029 (will rebase once that's in done), implementing the additional calls to fsync() described in #7030.

Opening as a draft so we can discuss the perf impact and implementation.

TODO:

  • Mention this in the CHANGELOG, since it may have user-visible perf impact
  • Add TimeBoundLogger around these fsync calls, so we can track when they're slow
  • Move all of these fsync()s to occur off the main thread

As a simple test, I've run this locally with strace attached to the primary during the modified test_parse_snapshot_file. Sample output attached:
strace_output.txt

The time of each syscall is in the <angle brackets> at the end of the line, in seconds, at us precision. For comparison, the write() calls appear to take 10s, and occasionally 100s, of us. The fsync() calls are much more expensive, around 3.5ms. A brief bit of Python stats for this data shown below:

>>> ns = [2.5e-05, 0.0022, 0.002452, 0.002463, 0.002474, 0.002487, 0.002511, 0.002562, 0.002636, 0.002654, 0.002684, 0.002713, 0.002716, 0.002728, 0.00273, 0.002732, 0.002732, 0.002736, 0.002748, 0.002751, 0.002763, 0.002786, 0.002801, 0.002805, 0.002813, 0.002823, 0.002844, 0.002855, 0.002856, 0.00286, 0.002873, 0.002874, 0.002892, 0.002894, 0.002902, 0.002912, 0.002918, 0.002918, 0.002928, 0.002934, 0.002937, 0.002939, 0.002945, 0.002957, 0.002996, 0.003003, 0.003015, 0.003021, 0.003036, 0.003039, 0.00304, 0.003043, 0.003064, 0.003068, 0.003083, 0.003104, 0.003105, 0.003107, 0.003109, 0.003111, 0.003133, 0.003139, 0.003146, 0.003154, 0.00317, 0.003193, 0.003199, 0.003208, 0.003223, 0.00323, 0.003251, 0.003252, 0.003254, 0.003265, 0.003278, 0.00328, 0.00328, 0.003285, 0.003287, 0.003288, 0.003306, 0.003324, 0.003369, 0.003402, 0.003406, 0.00342, 0.003444, 0.003453, 0.003453, 0.003467, 0.00347, 0.003472, 0.003494, 0.00351, 0.003522, 0.003525, 0.003528, 0.003543, 0.003578, 0.003579, 0.003592, 0.003595, 0.003613, 0.003613, 0.003617, 0.003626, 0.003627, 0.00363, 0.003636, 0.00365, 0.003653, 0.003669, 0.00367, 0.003713, 0.003746, 0.00375, 0.003752, 0.003764, 0.003768, 0.003782, 0.00381, 0.003847, 0.003854, 0.003894, 0.003903, 0.003911, 0.003914, 0.003921, 0.003923, 0.003928, 0.003938, 0.003942, 0.003959, 0.003976, 0.003987, 0.00401, 0.004039, 0.004045, 0.004061, 0.004065, 0.004068, 0.004069, 0.004073, 0.004077, 0.004114, 0.00417, 0.00421, 0.004211, 0.004252, 0.004326, 0.004334, 0.004354, 0.004366, 0.004384, 0.004411, 0.004427, 0.004434, 0.004437, 0.004536, 0.00454, 0.004551, 0.004695, 0.004699, 0.004785, 0.004786, 0.004817, 0.004914, 0.005012, 0.005068, 0.005093, 0.005195, 0.0052, 0.005331, 0.005351, 0.005506, 0.00568, 0.006049, 0.006065, 0.006288, 0.006744, 0.006755, 0.007001, 0.007613, 0.007702, 0.007898, 0.008211, 0.008569, 0.008604, 0.008827, 0.009733, 0.010683, 0.011178, 0.015102, 0.02056, 0.022335]
>>> import statistics
>>> [round(q, 4) for q in statistics.quantiles(ns, n=20)]
[0.0027, 0.0027, 0.0029, 0.0029, 0.003, 0.0031, 0.0032, 0.0033, 0.0035, 0.0035, 0.0036, 0.0038, 0.0039, 0.004, 0.0042, 0.0044, 0.0049, 0.0058, 0.0083]

One low outlier taking just 25us, a few high outliers taking over 10ms up to 22ms, but 75% in the 2.7-4ms range. This is definitely slow enough to be a worry, especially as we expect the distribution and outlier tails to be much higher on some production instances, but not a blocker. Other thoughts on this welcome.

@eddyashton
Copy link
Member Author

More data, as the test above was using consistently tiny snapshots. How does fsync() time increase for larger snapshots?

2.02 MB => 10.294ms
8.86 MB => 23.132ms
14.72 MB => 32.279ms
20.58 MB => 67.817ms
26.44 MB => 120.001ms
32.30 MB => 166.814ms
38.16 MB => 237.068ms
44.02 MB => 244.987ms
49.88 MB => 285.758ms
55.74 MB => 266.124ms
60.62 MB => 358.746ms
66.48 MB => 451.756ms
72.34 MB => 499.287ms
78.20 MB => 598.253ms
84.06 MB => 711.104ms
89.92 MB => 690.123ms
95.78 MB => 644.684ms
101.64 MB => 876.336ms
107.50 MB => 935.2ms
113.36 MB => 967.401ms
119.22 MB => 1002.278ms
125.08 MB => 1165.025ms
130.94 MB => 1199.173ms
136.80 MB => 1190.197ms
142.66 MB => 1217.634ms
148.51 MB => 1329.358ms
154.37 MB => 3662.77ms
159.26 MB => 1524.369ms
166.09 MB => 1364.311ms
171.95 MB => 2823.497ms
177.81 MB => 1543.736ms
183.67 MB => 1431.434ms
189.53 MB => 1470.677ms
195.39 MB => 1375.85ms
201.25 MB => 1537.724ms
206.13 MB => 1850.312ms
211.99 MB => 1565.881ms
217.85 MB => 1915.014ms
223.71 MB => 1630.499ms
229.57 MB => 2026.603ms
235.43 MB => 1710.456ms
241.29 MB => 3201.681ms
247.15 MB => 4502.611ms
253.01 MB => 4296.261ms
258.87 MB => 2162.861ms
264.73 MB => 1985.533ms
270.59 MB => 2136.738ms
276.45 MB => 2045.077ms
282.31 MB => 2100.265ms
288.17 MB => 2157.976ms
294.03 MB => 2173.375ms
299.89 MB => 2134.625ms
305.75 MB => 2241.53ms
311.61 MB => 3398.21ms
317.47 MB => 4931.418ms
323.33 MB => 2541.655ms

Yes, significantly. So this definitely cannot block the main thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant