Skip to content

Conversation

@neutralalice
Copy link

I am interested in collecting interrupt request metrics within talos. I believe this is the only kernel config definition needed to do so. This definition usually generates a /proc/pressure/irq file which can be ingested in to a tsdb.

Considerations: There is a small performance hit according to sources online, I am unsure if talos has already trialed this config option and evaluated the hit. Some other distributions do build with this option (RHEL notably).

Enables fine grained interrupt request time accounting.

Signed-off-by: arita <[email protected]>
@github-project-automation github-project-automation bot moved this to To Do in Planning Jan 17, 2026
@talos-bot talos-bot moved this from To Do to In Review in Planning Jan 17, 2026
@frezbo frezbo requested a review from dsseng January 19, 2026 07:26
@dsseng dsseng requested a review from smira January 20, 2026 16:13
Copy link
Member

@dsseng dsseng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is okay, however it would be interesting to hear the use cases.

Measuring interrupt time is going to have the most impact on things like random IO, and is told to have small yet measurable impact, so it would be nice to know the advantages this has for various use cases

@smira smira moved this from In Review to On Hold in Planning Jan 20, 2026
@neutralalice
Copy link
Author

neutralalice commented Jan 20, 2026

This change is okay, however it would be interesting to hear the use cases.

Measuring interrupt time is going to have the most impact on things like random IO, and is told to have small yet measurable impact, so it would be nice to know the advantages this has for various use cases

context wise: Generally this is coming from an hpc centered approach. Where we are often interested in both performance, and looking for indicators of possible hardware issues. There's a balance, but for us, we're not necessarily numbers/benchmark chasing.

We've got 3 clusters (generally serving scientific workloads for "climate/ hazards modeling")

  1. traditional hpc cluster(no k8s) backed by slurm where we make heavy usage of MPI. Often we are solving the inverse problem. The workloads here tend to be a huge variety where some are network(infinband) and storage(GPFS) intensive, and others just need large amounts of compute/memory. Usually we are memory bound, but some workloads are cpu bound, and it is difficult at times to differentiate if it's actually the cpu not keeping up because of context switching or not. We have IRQ monitoring here and have at times used it to recommend changes to code.

  2. typical k8s cluster on rhel, serving out kubeflow. We don't tend to have a lot of issue with this cluster, but it has infiniband/ storage attachments as above. We do have IRQ monitoring here. The same people as above are often writing new workflows (heavy batch schedule jobs and reflex jobs) to go in here. Overall our main problem is with provisioning/pulling nodes, which is where 3 hopefully changes some stuff for us

  3. skunkworks talos cluster, This is really in an evaluation stage and having IRQ metrics available wouldn't make or break us using talos, it's really just for helping us guide possible optimizations for the end scientists to look at

@neutralalice
Copy link
Author

It's also worth noting and evaluating the content of the ongoing refactor in this space as well - https://lore.kernel.org/all/[email protected]/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: On Hold

Development

Successfully merging this pull request may close these issues.

2 participants