Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[deployer] Add Profiling #629

Draft
wants to merge 27 commits into
base: main
Choose a base branch
from
Draft

[deployer] Add Profiling #629

wants to merge 27 commits into from

Conversation

patrick-ogrady
Copy link
Contributor

@patrick-ogrady patrick-ogrady commented Mar 18, 2025

TODO

  • Add runtime::tokio support for collect_system_metrics(), expose_metrics(), and push_profiles()
  • Fix README (push not pull)
  • Download flame graph file directly instead of git clone
  • Convert folds to microseconds rather than sample count for Pyroscope compatibility

@patrick-ogrady patrick-ogrady requested a review from Copilot March 18, 2025 22:58

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds Pyroscope profiling support to the deployer, updating monitoring configuration, AWS security groups, and documentation to incorporate the new profiling service.

  • Adds Pyroscope version constants, service configuration, and installation command updates in Rust code.
  • Updates AWS security groups and instance documentation to expose the new profiling port.
  • Includes Pyroscope dependencies in the example project configuration.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
deployer/src/ec2/services.rs Added Pyroscope constants, service file content, and configuration generation.
examples/flood/Cargo.toml Added Pyroscope dependencies.
deployer/src/ec2/aws.rs Updated port mapping to reference new profiling port constants.
deployer/src/ec2/mod.rs Updated documentation to include Pyroscope and added new port definitions.
deployer/src/ec2/create.rs Integrated Pyroscope service and configuration file handling during deployment.
Comments suppressed due to low confidence (2)

examples/flood/Cargo.toml:32

  • Verify if 'pyroscope_pprofrs' is correctly spelled. It might be intended to be 'pyroscope_pprof' or another similar name.
pyroscope_pprofrs = "0.2.8"

deployer/src/ec2/services.rs:165

  • [nitpick] For consistency with generate_prometheus_config where the IP is named 'ip', consider renaming 'private_ip' to 'ip'.
for (name, private_ip, region) in binary_instances {
@patrick-ogrady
Copy link
Contributor Author

patrick-ogrady commented Mar 19, 2025

This code regularly segfaults on ubuntu (afaict because of a bug in pprof).

My guess is that pprof2 is out-of-date and suffers from this bug: tikv/pprof-rs#255

(may have been fixed here but not released yet: grafana/pyroscope-rs#192)

image

It should be possible to avoid by writing our own "backend" (If it is the issue): https://github.com/grafana/pyroscope-rs/blob/main/pyroscope_backends/pyroscope_pprofrs/src/lib.rs

@patrick-ogrady
Copy link
Contributor Author

patrick-ogrady commented Mar 19, 2025

Didn't help:

Mar 19 14:54:44 ip-10-2-1-142 systemd[1]: Started Deployed Binary Service.
Mar 19 14:56:21 ip-10-2-1-142 systemd[1]: binary.service: Main process exited, code=dumped, status=11/SEGV
Mar 19 14:56:21 ip-10-2-1-142 systemd[1]: binary.service: Failed with result 'core-dump'.
Mar 19 14:56:21 ip-10-2-1-142 systemd[1]: binary.service: Consumed 2min 42.575s CPU time.
Mar 19 14:56:21 ip-10-2-1-142 systemd[1]: binary.service: Scheduled restart job, restart counter is at 1.
Mar 19 14:56:21 ip-10-2-1-142 systemd[1]: Stopped Deployed Binary Service.
Mar 19 14:56:21 ip-10-2-1-142 systemd[1]: binary.service: Consumed 2min 42.575s CPU time.
Mar 19 14:56:21 ip-10-2-1-142 systemd[1]: Started Deployed Binary Service.
Mar 19 14:56:58 ip-10-2-1-142 systemd[1]: binary.service: Main process exited, code=dumped, status=11/SEGV
Mar 19 14:56:58 ip-10-2-1-142 systemd[1]: binary.service: Failed with result 'core-dump'.
Mar 19 14:56:58 ip-10-2-1-142 systemd[1]: binary.service: Consumed 58.670s CPU time.
Mar 19 14:56:58 ip-10-2-1-142 systemd[1]: binary.service: Scheduled restart job, restart counter is at 2.
Mar 19 14:56:58 ip-10-2-1-142 systemd[1]: Stopped Deployed Binary Service.
Mar 19 14:56:58 ip-10-2-1-142 systemd[1]: binary.service: Consumed 58.670s CPU time.
Mar 19 14:56:58 ip-10-2-1-142 systemd[1]: Started Deployed Binary Service.
Mar 19 14:58:00 ip-10-2-1-142 systemd[1]: binary.service: Main process exited, code=dumped, status=11/SEGV
Mar 19 14:58:00 ip-10-2-1-142 systemd[1]: binary.service: Failed with result 'core-dump'.
Mar 19 14:58:00 ip-10-2-1-142 systemd[1]: binary.service: Consumed 1min 47.018s CPU time.
Mar 19 14:58:00 ip-10-2-1-142 systemd[1]: binary.service: Scheduled restart job, restart counter is at 3.
Mar 19 14:58:00 ip-10-2-1-142 systemd[1]: Stopped Deployed Binary Service.
Mar 19 14:58:00 ip-10-2-1-142 systemd[1]: binary.service: Consumed 1min 47.018s CPU time.
Mar 19 14:58:00 ip-10-2-1-142 systemd[1]: Started Deployed Binary Service.
Mar 19 14:58:10 ip-10-2-1-142 systemd[1]: binary.service: Main process exited, code=dumped, status=11/SEGV
Mar 19 14:58:10 ip-10-2-1-142 systemd[1]: binary.service: Failed with result 'core-dump'.
Mar 19 14:58:10 ip-10-2-1-142 systemd[1]: binary.service: Consumed 16.659s CPU time.
Mar 19 14:58:11 ip-10-2-1-142 systemd[1]: binary.service: Scheduled restart job, restart counter is at 4.
Mar 19 14:58:11 ip-10-2-1-142 systemd[1]: Stopped Deployed Binary Service.
Mar 19 14:58:11 ip-10-2-1-142 systemd[1]: binary.service: Consumed 16.659s CPU time.
Mar 19 14:58:11 ip-10-2-1-142 systemd[1]: Started Deployed Binary Service.
Mar 19 14:59:11 ip-10-2-1-142 systemd[1]: binary.service: Main process exited, code=dumped, status=11/SEGV
Mar 19 14:59:11 ip-10-2-1-142 systemd[1]: binary.service: Failed with result 'core-dump'.
Mar 19 14:59:11 ip-10-2-1-142 systemd[1]: binary.service: Consumed 1min 42.811s CPU time.
Mar 19 14:59:11 ip-10-2-1-142 systemd[1]: binary.service: Scheduled restart job, restart counter is at 5.
Mar 19 14:59:11 ip-10-2-1-142 systemd[1]: Stopped Deployed Binary Service.
Mar 19 14:59:11 ip-10-2-1-142 systemd[1]: binary.service: Consumed 1min 42.811s CPU time.
Mar 19 14:59:11 ip-10-2-1-142 systemd[1]: Started Deployed Binary Service.
Mar 19 14:59:28 ip-10-2-1-142 systemd[1]: binary.service: Main process exited, code=dumped, status=11/SEGV
Mar 19 14:59:28 ip-10-2-1-142 systemd[1]: binary.service: Failed with result 'core-dump'.
Mar 19 14:59:28 ip-10-2-1-142 systemd[1]: binary.service: Consumed 28.984s CPU time.
Mar 19 14:59:28 ip-10-2-1-142 systemd[1]: binary.service: Scheduled restart job, restart counter is at 6.
Mar 19 14:59:28 ip-10-2-1-142 systemd[1]: Stopped Deployed Binary Service.
Mar 19 14:59:28 ip-10-2-1-142 systemd[1]: binary.service: Consumed 28.984s CPU time.
Mar 19 14:59:28 ip-10-2-1-142 systemd[1]: Started Deployed Binary Service.
Mar 19 14:59:52 ip-10-2-1-142 systemd[1]: binary.service: Main process exited, code=dumped, status=11/SEGV
Mar 19 14:59:52 ip-10-2-1-142 systemd[1]: binary.service: Failed with result 'core-dump'.
Mar 19 14:59:52 ip-10-2-1-142 systemd[1]: binary.service: Consumed 38.131s CPU time.
Mar 19 14:59:52 ip-10-2-1-142 systemd[1]: binary.service: Scheduled restart job, restart counter is at 7.
Mar 19 14:59:52 ip-10-2-1-142 systemd[1]: Stopped Deployed Binary Service.
Mar 19 14:59:52 ip-10-2-1-142 systemd[1]: binary.service: Consumed 38.131s CPU time.
Mar 19 14:59:52 ip-10-2-1-142 systemd[1]: Started Deployed Binary Service.
Mar 19 15:00:40 ip-10-2-1-142 systemd[1]: binary.service: Main process exited, code=dumped, status=11/SEGV
Mar 19 15:00:40 ip-10-2-1-142 systemd[1]: binary.service: Failed with result 'core-dump'.
Mar 19 15:00:40 ip-10-2-1-142 systemd[1]: binary.service: Consumed 1min 15.173s CPU time.
Mar 19 15:00:41 ip-10-2-1-142 systemd[1]: binary.service: Scheduled restart job, restart counter is at 8.
Mar 19 15:00:41 ip-10-2-1-142 systemd[1]: Stopped Deployed Binary Service.
Mar 19 15:00:41 ip-10-2-1-142 systemd[1]: binary.service: Consumed 1min 15.173s CPU time.
Mar 19 15:00:41 ip-10-2-1-142 systemd[1]: Started Deployed Binary Service.

Could potentially have to do with running on ARM?

@patrick-ogrady
Copy link
Contributor Author

patrick-ogrady commented Mar 20, 2025

debugging:

gdb --args /home/ubuntu/binary --peers=/home/ubuntu/peers.yaml --config=/home/ubuntu/config.conf
Thread 2 "tokio-runtime-w" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xfffff5a0ee80 (LWP 5868)]
0x0000fffff7eed4c8 in ?? () from /lib/aarch64-linux-gnu/libgcc_s.so.1
(gdb) bt
#0  0x0000fffff7eed4c8 in ?? () from /lib/aarch64-linux-gnu/libgcc_s.so.1
#1  0x0000fffff7eeec18 in _Unwind_Backtrace () from /lib/aarch64-linux-gnu/libgcc_s.so.1
#2  0x0000aaaaaae0bebc in perf_signal_handler ()
#3  <signal handler called>
#4  0x0000aaaaaaffcc9c in tokio::runtime::scheduler::multi_thread::worker::<impl tokio::runtime::task::Schedule for alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>::schedule ()
#5  0xfffffffffffffff0 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

@patrick-ogrady
Copy link
Contributor Author

Even in dedicated thread it panics:

Thread 4 "tokio-runtime-w" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xfffff620ee60 (LWP 3818)]
0x0000fffff7eed4c8 in ?? () from /lib/aarch64-linux-gnu/libgcc_s.so.1
(gdb) bt
#0  0x0000fffff7eed4c8 in ?? () from /lib/aarch64-linux-gnu/libgcc_s.so.1
#1  0x0000fffff7eeec18 in _Unwind_Backtrace () from /lib/aarch64-linux-gnu/libgcc_s.so.1
#2  0x0000aaaaaad7bfb8 in perf_signal_handler ()
#3  <signal handler called>
#4  0x0000aaaaaab98fd8 in core::hash::BuildHasher::hash_one ()
#5  0x0000ffff00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

@patrick-ogrady
Copy link
Contributor Author

patrick-ogrady commented Mar 20, 2025

Consider adding (if we just use perf):

-C force-frame-pointers=y

@patrick-ogrady
Copy link
Contributor Author

patrick-ogrady commented Mar 20, 2025

Works with just perf (but timestamps are messed up):

PID=$(systemctl show --property MainPID binary.service | cut -d= -f2)
sudo perf record -F 99 -p $PID -g -- sleep 60
git clone --depth 1 https://github.com/brendangregg/FlameGraph.git /tmp/flamegraph
sudo cp /tmp/flamegraph/stackcollapse-perf.pl /home/ubuntu/
sudo chmod +x /home/ubuntu/stackcollapse-perf.pl
sudo perf script -i perf.data | /home/ubuntu/stackcollapse-perf.pl > /tmp/perf.stack
curl -X POST "10.1.1.209:4040/ingest?name=flood&format=folded&sampleRate=99&units=samples&aggregationType=sum&from=1742484667&until=1742484727" --data-binary @/tmp/perf.stack --header "Content-Type: text/plain" -v
image

(units param is probably wrong -> may need to convert counts to time)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant