RIAK monitoring in 2021 #1060

marcoshaw · 2021-03-16T16:53:16Z

marcoshaw
Mar 16, 2021

I'm faced with RIAK 2.0 nodes that have started to have high CPU. I realize we don't have any application monitoring or trending other than the usual CPU and memory of the VM itself.

I'm not quite sure where to start, but I guess my first thing would be to figure out how to gather some kind of trending data for RIAK itself.

What is a good tool/script these days to monitor cluster health?

Otherwise, do I investigate if I have some kind of problem or is high CPU on a busy cluster normal?

martinsumner · 2021-03-16T17:54:15Z

martinsumner
Mar 16, 2021

There is the standard stuff recommended in the docs - https://docs.riak.com/riak/kv/2.2.3/using/reference/statistics-monitoring.1.html.

If you have CPU, tracking where CPU is being spent within Riak can be hard. You may be able to glean something by looking at reductions in etop. Also looking at the msg_q may indicate where bottlenecks are.

If you have more CPU cores than vnodes on each node, I would generally not expect to see CPU maxing out. The things that can cause higher CPU are:

Excessive AAE rebuild/repair activity https://docs.riak.com/riak/kv/2.2.3/using/admin/riak-admin/index.html#aae-status;
Full-sync activity with large deltas between clusters https://docs.riak.com/riak/kv/latest/using/reference/multi-datacenter/statistics/index.html
Excessive use of 2i queries (especially with back-tracking regular expressions employed). There's some help in stats for tracking this, but normally this is easier to find from application logging
Large CRDT objects, you can see object merge times in stats to help track this.

This isn't an exhaustive list. Most of the load testing done post-basho, as part of releases, tends to prove disk-bound workloads, so high CPU only really occurs due to increasing io wait times. We saw increases sys cpu times after meltdown/spectre fixes. We tend to see reduced CPU utilisation with Riak 3.0.

There are some OS tuning guides in the docs, but most of the recommendations don't make a huge difference on their own other than disabling transparent huge pages.

There's a lot of "it depends" as well. Hardware choices, access profiles, number and size of objects, backend choice etc.

Normally, the default answer is to expand the cluster first, as that should usually provide CPU relief while you try and work out what is happening.

3 replies

marcoshaw Mar 16, 2021
Author

Thank you. I will dig into this further.

You mentioned the cores vs vnodes. I'll investigate that. riak-admin status tells me:
riak_kv_vnodes_running=26

Is that the vnode value you're mentioning?

This is a VM, and I just recently increased from 2 vCPUs to 4.

I've noticed my disk growing (/etc/riak/data), so I'll be resizing the disk soon also. I'm hoping symbolic links aren't a problem as I'll probably add a 2nd disk for the data. I'll think about it...

martinsumner Mar 16, 2021

With 4 vCPUs and 25 vnodes, you could definitely be able to max out your CPU.

In our volume/performance riak test environment we have approx 1 vCPU per vnode. This is not a requirement to have such a ratio, but just an illustration that in your setup it would not necessarily be unusual that you have very high CPU, and so need to scale.

marcoshaw Mar 16, 2021
Author

I'm reading up a bit... Our ring/partition size is 128 with 5 nodes, so that comes out to 25.6 vnodes/node. I might consider going to 6 CPUs/node, but no higher. I'm not aware of any actual issues even though the CPU is over 90% all the time, but I don't like seeing a CPU in that state in production.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RIAK monitoring in 2021 #1060

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

RIAK monitoring in 2021 #1060

marcoshaw Mar 16, 2021

Replies: 1 comment · 3 replies

martinsumner Mar 16, 2021

marcoshaw Mar 16, 2021 Author

martinsumner Mar 16, 2021

marcoshaw Mar 16, 2021 Author

marcoshaw
Mar 16, 2021

Replies: 1 comment 3 replies

martinsumner
Mar 16, 2021

marcoshaw Mar 16, 2021
Author

marcoshaw Mar 16, 2021
Author