Skip to content

iotune: Add warmup period to measurements #2810

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

StephanDollberg
Copy link
Contributor

This adds a warmup period to each test during which measurements are
ignored. Disks are often fairly bursty in the initial part of the test.
This skews the measurements and makes us produce io-properties that are
potentially way too large.

It also heavily affects the shown deviation which can look a lot worse
than it is.

The warmup period is at least 3 seconds or ten percent of the test
duration. Whichever one is larger.

warmup_period(std::chrono::duration<double> duration) {
auto min_warmup = 3s;
auto ten_percent = duration / 10;
return min_warmup.count() > ten_percent.count() ? min_warmup : ten_percent;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to compare counts, durations can be compared directly.

Also, std::max().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, needed some explicit casting on ten_percent as we use double durations there for whatever reason.

@StephanDollberg StephanDollberg force-pushed the stephan/iotune-warmup branch from 66d8ee2 to c4c3a50 Compare June 19, 2025 10:32
@StephanDollberg
Copy link
Contributor Author

ping @xemul @robertbindar

Copy link
Contributor

@robertbindar robertbindar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @StephanDollberg 🚀 ! I left some inline comments, overall I think the code can get simpler, rates are more basic than I think you assumed in there, let me know if I got it wrong.
I'm guessing the patch comes from some practical real world tests, I'd like to see some examples of such runs where you noticed higher deviation and higher io properties and an amelioration of these symptoms with a warmup period.
Also it'd be nice if we determine if 10% is needed or we can work with less.

// `issue_request` itself so the first bucket might not be a full
// measurement and cause skew otherwise.
size_t samples_to_drop = warmup_period(_duration) / period + 1;
_rates.erase(_rates.begin(), _rates.begin() + std::min(samples_to_drop, _rates.size()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rates here are one number per shard, that is number of requests executed by one worker in duration time, they have no relation with the period. I don't think you have anything else to do here than adjust the start and end measurement points in io_worker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this true, you see that elements are inserted into this array every second based on the timer a few lines above.

This is (unfortunately) decoupled from the other measurement stuff in io_worker::issue_request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me to the line that does that? I see the rates being updated only in the destructor of the meter. I also double checked yesterday that there's only one rate per shard by printing them in all workloads.

Copy link
Contributor Author

@StephanDollberg StephanDollberg Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timer is configured here: https://github.com/scylladb/seastar/blob/master/apps/iotune/iotune.cc#L335-L338

armed here: https://github.com/scylladb/seastar/blob/master/apps/iotune/iotune.cc#L342

E.g.: with the following debug print added (on top of my patch):

diff --git a/apps/iotune/iotune.cc b/apps/iotune/iotune.cc
index 2694d124..016b001b 100644
--- a/apps/iotune/iotune.cc
+++ b/apps/iotune/iotune.cc
@@ -363,6 +363,7 @@ class io_worker {
                 _rates.push_back(_requests);
             }

+           fmt::print("rates: {}\n", fmt::join(_rates, ", "));
             // Drop samples that got measured during the warm up period.
             // We drop an extra sample. This is because the one second timer
             // logic is really independent of when we start counting requests in

I get the follow output:

./build/release/apps/iotune/iotune -c 4 -m 2G --evaluation-directory /mnt/xfs
...
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: rates: 0, 0, 0, 1163, 1162, 1163, 1162, 1163, 1162, 1163, 1163, 1162, 1163, 1162, 1163, 1162, 1163, 1163, 1162, 1163, 1162, 1163, 1162, 1163
rates: 0, 0, 0, 1163, 1162, 1163, 1163, 1162, 1163, 1162, 1163, 1162, 1163, 1162, 1163, 1163, 1162, 1163, 1162, 1163, 1162, 1163, 1163, 1162
rates: 0, 0, 0, 1162, 1163, 1163, 1162, 1163, 1162, 1163, 1162, 1163, 1163, 1162, 1163, 1162, 1163, 1162, 1163, 1163, 1162, 1163, 1162, 1163
rates: 0, 0, 0, 1163, 1162, 1163, 1163, 1162, 1163, 1162, 1163, 1162, 1163, 1163, 1162, 1163, 1162, 1163, 1162, 1163, 1163, 1162, 1163, 1162
1162 MiB/s
Measuring sequential read bandwidth: rates: 0, 0, 0, 2442, 2441, 2442, 2441, 2442, 2441, 2441, 2442, 2441, 2442, 2441, 2442
2441 MiB/s
Measuring random write IOPS: rates: 0, 0, 5, 67966, 67673, 67206, 67812, 67738, 67536, 67564, 67694, 67965, 67495, 67595, 67682
rates: 0, 0, 0, 67486, 67963, 67547, 67570, 67911, 67751, 67630, 67897, 68015, 67716, 67665, 67808
rates: 0, 0, 5, 67573, 67461, 67850, 67646, 67614, 67596, 68169, 67632, 67667, 67750, 67628, 67549
rates: 0, 0, 0, 65796, 65719, 66224, 65789, 65553, 65941, 65451, 65587, 65170, 65850, 65939, 65721
268823 IOPS
Measuring random read IOPS: rates: 0, 0, 0, 129175, 129428, 129530, 129675, 129593, 129331, 129487, 129446, 129433, 129327, 129282, 128691
rates: 0, 0, 0, 129233, 129748, 129772, 129672, 129609, 129427, 129688, 129731, 129712, 129483, 129266, 128756
rates: 0, 0, 0, 124915, 125636, 125422, 125735, 125496, 125070, 125390, 125508, 125507, 125154, 125201, 124977
rates: 0, 0, 0, 129388, 129950, 129887, 129651, 129806, 129341, 129827, 129788, 129825, 129263, 129335, 128951
513794 IOPS

One print per shard with each containing a secondly measurement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and funny, I didn't spot the _rates.push_back(_requests - _prev_requests); in the constructor and the ticker was running for seconds equal to number of shards, my print was misleading 🤦

@@ -376,7 +398,7 @@ class io_worker {

io_worker(size_t buffer_size, std::chrono::duration<double> duration, std::unique_ptr<request_issuer> reqs, std::unique_ptr<position_generator> pos, std::vector<unsigned>& rates)
: _buffer_size(buffer_size)
, _start_measuring(iotune_clock::now() + std::chrono::duration<double>(10ms))
, _start_measuring(iotune_clock::now() + std::chrono::duration<double>(warmup_period(duration)))
, _end_measuring(_start_measuring + duration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to adjust the end measuring point as well to respect the proportion and the total duration passed by the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_end_measuring is already implicitly moved as it adds on to _start_measuring.

Or are you saying that we should subtract the warmup period here? That seems unneeded to me. I think if someone passes --duration then we should measure for that period of time. The warmup is independent of that.

Note I think there is maybe some implicit assumption in the code that wTP (70%) + rTP(10%) + wIOPS(10%) + rIOPS(10%) adds up to 100% of the duration but this is already quite random why the latter tests run for shorter. It's really just needed this way such that the wTP test has actually written the full file for it to be fully extended for the read tests.

In the default config the wIOPS test for example only runs for 12 seconds so if you remove the warmup from that it's even shorter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_end_measuring is already implicitly moved as it adds on to _start_measuring.

you're correct, sorry.

Or are you saying that we should subtract the warmup period here? That seems unneeded to me. I think if someone passes --duration then we should measure for that period of time. The warmup is independent of that.

Note I think there is maybe some implicit assumption in the code that wTP (70%) + rTP(10%) + wIOPS(10%) + rIOPS(10%) adds up to 100% of the duration but this is already quite random why the latter tests run for shorter. It's really just needed this way such that the wTP test has actually written the full file for it to be fully extended for the read tests.

the first sequential write workload is responsible for creating the dataset on which the other tests run, that's why it runs for 70% of the duration, you need a big enough file to minimize cache hits. Unfotunately I don't know the reasoning why 70% and not less, maybe there were measurements done. I've conducted experiments a few days ago where I didn't see iops artificially increasing even with files generated in 10% of the duration (IIRC the total file size generated in 10% was ~3GB).

In the default config the wIOPS test for example only runs for 12 seconds so if you remove the warmup from that it's even shorter.

@StephanDollberg
Copy link
Contributor Author

Here is an example from m7gd.4xl (wTP: 577MiB/s, rTP: 1201MiB/s, wIOPS: 134168, rIOPS: 268336)

master:

Measuring sequential write bandwidth: 584 MiB/s (deviation 12%)
Measuring sequential read bandwidth: 1362 MiB/s (deviation 39%)
Measuring random write IOPS: 145383 IOPS (deviation 25%)
Measuring random read IOPS: 296471 IOPS (deviation 23%)

this PR:

Measuring sequential write bandwidth: 576 MiB/s
Measuring sequential read bandwidth: 1200 MiB/s
Measuring random write IOPS: 135082 IOPS
Measuring random read IOPS: 268818 IOPS

std::chrono::duration<double>
warmup_period(std::chrono::duration<double> duration) {
auto min_warmup = 3s;
auto ten_percent = std::chrono::duration_cast<std::chrono::seconds>(duration) / 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please run some experiments and see if we can work with less than 10%?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had previously looked into this on a few AWS instances and they all converge pretty fast which is why I went with the 3 second minimum. This was true for both instance types with shared and unshared disks.

My crappy personal consumer disk took a bit longer which is why I went with the 10% thing as I didn't consider 10% to be bad in terms of how much time it adds. But if you don't like it then we could get rid of it or do 5% or something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5% would be alright, I'll add an arg in the future if it becomes annoying to anyone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I have changed it to 5%.

Make iotune ignore the first three seconds of measurements. Disks are
often fairly bursty in the initial part of the test. This skews the
measured measurements and makes us produce io-properties that are
potential way too large.

It also heavily affects the shown deviation which can look a lot worse
than it is.
Copy link
Contributor

@robertbindar robertbindar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thanks @StephanDollberg.

@robertbindar
Copy link
Contributor

@scylladb/scylla-maint please review and merge

@mykaul
Copy link

mykaul commented Aug 4, 2025

@scylladb/scylla-maint please review and merge

@scylladb/seastar-maint please review and merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants