Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana Integration #401

Open
dwffls opened this issue Oct 25, 2024 · 17 comments
Open

Grafana Integration #401

dwffls opened this issue Oct 25, 2024 · 17 comments
Assignees
Labels
enhancement This tackles a new feature of the code (and not a bug) needs more work Someone has worked on this but more work is needed PR welcome 💞 This issue has no PR that tries to implement it. Please create one! ros2 PR tackling a ROS2 branch

Comments

@dwffls
Copy link

dwffls commented Oct 25, 2024

During @ct2034's talk at ROSCon 2024 the idea was started to visualize /diagnostics messages in a Grafana dashboard.
I wanted to restart this conversation here in the form of a feature request.
As I'd like to contribute to this feature I guess the first thing to tackle is the structure of this integration.

My own suggestion is to implement this in the diagnostics_aggregator and piggy back of the publishing of the /diagnostics_agg topic being send. This data would then be sent to Telegraf (taking inspiration from another talk at ROSCon) to be later used in grafana.

Happy to hear feedback!

@ct2034
Copy link
Collaborator

ct2034 commented Oct 25, 2024

Hi
I think it is an interesting idea. Especially if you fine-tune your diagnostics information to contain all necessary status information, this could be a powerful tool for fleets.
And I also think that an aggregator is the right place to implement it. Then people can use the aggregator matching to choose the info to be piped to Grafana. I have not worked with Telegraf before, but it seems to be designed for these kinds of use cases.

@nnarain
Copy link

nnarain commented Oct 26, 2024

Hey guys. I'm also interested in seeing what can be done here to improve diagnostics.

My company has done an approach like this for many years (though not with telegraf/grafana but a similar stack). And we can visualize diagnostics metrics.

It might be worth discussing the future of rosdiagnostics and figuring out what the scope is. I'd personally think this could be implemented generically to handle any stack.

@dwffls
Copy link
Author

dwffls commented Oct 26, 2024

@nnarain Could you please explain more how you would set it up to handle any stack? I guess any implementation (be that prometheus, telegraf or straight to influxdb) would need it's own configuration.

@nnarain
Copy link

nnarain commented Oct 26, 2024

So my take on it would be a new composable node that consumes the aggregated diagnostics topic and forwards it to the desired endpoint (telegraf, elastic, a network sockets, etc).

I personally wouldn't do this in the aggregator node as to not add new dependency for those that don't want to use a particular metrics stack.

Maybe something like "diagnostics_telegraf".

@dwffls
Copy link
Author

dwffls commented Oct 30, 2024

Sending data to either InfluxDB itself or Telegraf works by sending a small http request, with the data formatted in a special text as such:

home,room=Living\ Room temp=21.1,hum=35.9,co=0i 1641024000
home,room=Kitchen temp=21.0,hum=35.9,co=0i 1641024000
home,room=Living\ Room temp=21.4,hum=35.9,co=0i 1641027600
home,room=Kitchen temp=23.0,hum=36.2,co=0i 1641027600
home,room=Living\ Room temp=21.8,hum=36.0,co=0i 1641031200

The only extra dependency we have to add to the aggregator node is curl. Personally I do not see this as a problem to include. @ct2034 What do you think?

@nnarain
Copy link

nnarain commented Oct 30, 2024

Ya I'd imagine a lot of these tools just use JSON.

So along the lines of what I mentioned earlier it might be a composable node that converts the DiagnosticArray into a JSON payload and sends it to an endpoint.

It sounds like a good use of composition to me. But it depends on what is and is not in scope of the aggregator

@ct2034
Copy link
Collaborator

ct2034 commented Dec 3, 2024

I have thought about this again. Yes, it is only a dependency to curl. But I think it should be a separate package just to separate the concerns more clearly. Then we would also be able to support other backends down the line. And it is a functionality that I think is not in the default feature set that one expects from diagnostics and so it should be in its own package.

@ct2034 ct2034 self-assigned this Dec 3, 2024
@ct2034 ct2034 added enhancement This tackles a new feature of the code (and not a bug) ros2 PR tackling a ROS2 branch needs more work Someone has worked on this but more work is needed PR welcome 💞 This issue has no PR that tries to implement it. Please create one! labels Dec 3, 2024
@dwffls
Copy link
Author

dwffls commented Dec 4, 2024

Allright that seals it. I have some time on my hand to start work on this, will post the fork here when i have something up and running.

I'll start by naming the package "diagnostics_remote" and the node "telegraf" to start with. Any input on this naming is appreciated.

@ct2034
Copy link
Collaborator

ct2034 commented Dec 4, 2024

Sounds good. :) Looking forward to look at what you came up with.

For the package naming, I am thinking about something like:

  • diagnostics_remote_bridge
  • diagnostics_remote_logging
  • diagnostics_remote_export

I wanted to find something a little more descriptive.

The node naming sounds good. Then we could have other node names for other backends. I think that makes sense.

@dwffls
Copy link
Author

dwffls commented Dec 4, 2024

I'll start with diagnostics_remote_logging, if anything better comes up in this thread I'll change it

@dwffls
Copy link
Author

dwffls commented Dec 4, 2024

I've prepared a working version of the diagnostics code, available at https://github.com/dwffls/diagnostics.

The conversion logic for diagnostics messages to the InfluxDB line protocol is in a separate header file for reusability, such as in nodes sending data directly to InfluxDB.

Testing

Set up InfluxDB (e.g., InfluxDB Cloud and a local Telegraf instance. I've followed this guide.

Finaly add this to /etc/telegraf/telegraf.conf:

[[inputs.http_listener_v2]]
  service_address = "tcp://:8186"
  paths = ["/telegraf"]
  data_format = "influx"

Once set up, data should appear in the InfluxDB UI.

Feedback on the code and or it's structure is welcome!

@avanmalleghem
Copy link

I'm really interested in this topic.

Here is the roscon talk @dwffls talks about : https://vimeo.com/1024971769
There is also an available github repository related to this : https://github.com/bonsairobotics/ros_health_components

You can see the telegraf_bridge package for example.

@dwffls, I will definitely have a look at your repo 👍

@avanmalleghem
Copy link

@dwffls I tried you repo on Humble and I run into the following issue:

  • I start telegraf running docker : docker run -p 8186:8186 -v $PWD/telegraf.conf:/etc/telegraf/telegraf.conf:ro telegraf with following config file :
[[inputs.http_listener_v2]]
  service_address = "tcp://:8186"
  paths = ["/telegraf"]
  data_format = "influx"
[[outputs.file]]
  • I started your node : ros2 run diagnostic_remote_logging telegraf

And.... I receive {"error":"http: bad request"} whenever your node tries to send data to telegraf. I tried with a dummy command like curl -i -XPOST 'http://localhost:8186/telegraf' --data-binary 'cpu_load_short,host=server01,region=us-west value=0.64 1434055562000000000' and it works successfully so I guess there is something missing in your node ?

In addition to it, in the documentation of http_listener_v2, it is recommended to use the [influxdb_v2_listener](https://github.com/influxdata/telegraf/blob/release-1.32/plugins/inputs/influxdb_v2_listener/README.md) instead of the http_listener_v2 (but I guess it won't solve the issue).

@dwffls
Copy link
Author

dwffls commented Dec 18, 2024

Could you pull the repository again? Ive added some more error handling to when it posts to Telegraf.
It should now output the whole influx line when a bad request happens. It will probably still error out with the new code, but now it shows what it tries to post so that I can debug it. There is probably a problem in the conversion to this influx line protocol. So when it errors out could you send me the new output?

As to the the whole http_listener_v2 vs influxdb_v2_listener, I think you are right, we should be using the new influxdb_v2_listener. I've changed the default url to reflect the changes. telegraf.conf should now look like this:

[[inputs.influxdb_v2_listener]]
  service_address = ":8086"
[[outputs.file]]

As we are now using the full influxdb input we could change the node to be a full "influxdb" node with an example in the readme to use telegraf as a proxy. Kind of on the fence about this one...

Edit: I've started the "rewrite" on a seperate branch to send it directly to influxdb as an option. Readme will follow with instructions for both telegraf and influxdb itself

Let me know if anything else doesn't work.

@dwffls
Copy link
Author

dwffls commented Dec 18, 2024

I've switched to the influx_db branch for developement, please check this out and also see the README for examples on how to run

@avanmalleghem
Copy link

@dwffls I still run into the same issue but at least I have a more verbose log :

  • Terminal 1 : ros2 run diagnostic_updater example --ros-args -r diagnostics:=diagnostics_agg
  • Terminal 2 : docker run -p 8086:8086 -v $PWD/telegraf.conf:/etc/telegraf/telegraf.conf:ro telegraf with the following file :
[[inputs.influxdb_v2_listener]]
  service_address = ":8086"
[[outputs.file]]
  • Terminal 3 : ros2 run diagnostic_remote_logging influx. This terminal gives the following output :
[ERROR] [1734970053.916198230] [influxdb]: Error (400) when sending to telegraf:
Device-27-46,ns=none level=2,message="Buckle your seat belt. Launch in 0.000000 seconds!",Diagnostic\ Name=dummy,Time\ to\ Launch=0,Geeky\ thing\ to\ say=The\ square\ of\ the\ time\ to\ launch\ 0.000000\ is\ 0.000000 1734970053913356304
Device-27-46,ns=none level=1,message="This is a silly updater.",Stupidicity\ of\ this\ updater=1000 1734970053913356304
Device-27-46,ns=none level=2,message="Too low",Events\ in\ window=20,Events\ since\ startup=1622,Duration\ of\ window\ (s)=9.999948,Actual\ frequency\ (Hz)=2.000010,Minimum\ acceptable\ frequency\ (Hz)=0.450000,Maximum\ acceptable\ frequency\ (Hz)=2.200000,Low-Side\ Margin=-5 1734970053913356304

[ERROR] [1734970053.916443130] [influxdb]: Failed to send /diagnostics_agg to telegraf
^C[INFO] [1734970054.450720740] [rclcpp]: signal_handler(signum=2)
quare\\\\ of\\\\ the\\\\ time\\\\ to\\\\ launch\\\\ 0.000000\\\\ is\\\\ 0.000000 1734970052913400411\"","op":""}{"code":"invalid","err":"metric parse error: expected field at 1:108: \"Device-27-46,ns=none level=2,message=\\\"Buckle your seat belt. Launch in 0.000000 seconds!\\\",Diagnostic\\\\ Name=dummy,Time\\\\ to\\\\ Launch=0,Geeky\\\\ thing\\\\ to\\\\ say=The\\\\ square\\\\ of\\\\ the\\\\ time\\\\ to\\\\ launch\\\\ 0.000000\\\\ is\\\\ 0.000000 1734970053913356304\"","message":"metric parse error: expected field at 1:108: \"Device-27-46,ns=none level=2,message=\\\"Buckle your seat belt. Launch in 0.000000 seconds!\\\",Diagnostic\\\\ Name=dummy,Time\\\\ to\\\\ Launch=0,Geeky\\\\ thing\\\\ to\\\\ say=The\\\\ square\\\\ of\\\\ the\\\\ time\\\\ to\\\\ launch\\\\ 0.000000\\\\ is\\\\ 0.000000 1734970053913356304\"","op":""}

I'm not sure about what I'm doing in terminal 1. I don't know if I can just remap diagnostics to diagnostics_agg (they expect the same message type so I guess it is ok ?).

@avanmalleghem
Copy link

@dwffls I dive deeper into it :

  • Terminal 1 : ros2 launch diagnostic_aggregator example.launch.py so that I'm sure the aggregated data is correct on the topic.
  • Terminal 2 : same command
  • Terminal 3 : same command and no error logs anymore (I don't provide any configuration file because default setup seems ok if I read your documentation well)

Unfortunately, in terminal 2 (telegraf), I receive only toplevel_state data :

toplevel_state,host=c1f317f8c0a2 level=0 1734975170050394773
toplevel_state,host=c1f317f8c0a2 level=0 1734975171047563314
toplevel_state,host=c1f317f8c0a2 level=0 1734975172050791978
toplevel_state,host=c1f317f8c0a2 level=1 1734975173050238132
toplevel_state,host=c1f317f8c0a2 level=2 1734975174050949573
toplevel_state,host=c1f317f8c0a2 level=1 1734975175051484584
toplevel_state,host=c1f317f8c0a2 level=0 1734975176051512479
toplevel_state,host=c1f317f8c0a2 level=0 1734975177051590442
toplevel_state,host=c1f317f8c0a2 level=1 1734975178050374746
toplevel_state,host=c1f317f8c0a2 level=0 1734975179051080703
toplevel_state,host=c1f317f8c0a2 level=1 1734975180050639867
toplevel_state,host=c1f317f8c0a2 level=1 1734975181051602363
toplevel_state,host=c1f317f8c0a2 level=0 1734975182052508831
toplevel_state,host=c1f317f8c0a2 level=0 1734975183052811622
toplevel_state,host=c1f317f8c0a2 level=1 1734975184048020362
toplevel_state,host=c1f317f8c0a2 level=1 1734975185049937486
toplevel_state,host=c1f317f8c0a2 level=1 1734975186051164865
toplevel_state,host=c1f317f8c0a2 level=1 1734975187049219369

So it looks like I don't receive diagnostics_agg data (the topic exists and data is published on it). Here is an example message :

header:
  stamp:
    sec: 1734975372
    nanosec: 26533256
  frame_id: ''
status:
- level: "\0"
  name: /Aggregation/Arms
  message: OK
  hardware_id: ''
  values:
  - key: /arms/left/motor
    value: OK
  - key: /arms/right/motor
    value: OK
- level: "\0"
  name: /Aggregation/Arms/ arms left motor
  message: OK
  hardware_id: ''
  values: []
- level: "\0"
  name: /Aggregation/Arms/ arms right motor
  message: OK
  hardware_id: ''
  values: []
- level: "\0"
  name: /Aggregation/Legs
  message: OK
  hardware_id: ''
  values:
  - key: /legs/left/motor
    value: OK
  - key: /legs/right/motor
    value: OK
- level: "\0"
  name: /Aggregation/Legs/ legs left motor
  message: OK
  hardware_id: ''
  values: []
- level: "\0"
  name: /Aggregation/Legs/ legs right motor
  message: OK
  hardware_id: ''
  values: []
- level: "\0"
  name: /Aggregation/Motors
  message: OK
  hardware_id: ''
  values:
  - key: /arms/left/motor
    value: OK
  - key: /arms/right/motor
    value: OK
  - key: /legs/left/motor
    value: OK
  - key: /legs/right/motor
    value: OK
- level: "\0"
  name: /Aggregation/Motors/ arms left motor
  message: OK
  hardware_id: ''
  values: []
- level: "\0"
  name: /Aggregation/Motors/ arms right motor
  message: OK
  hardware_id: ''
  values: []
- level: "\0"
  name: /Aggregation/Motors/ legs left motor
  message: OK
  hardware_id: ''
  values: []
- level: "\0"
  name: /Aggregation/Motors/ legs right motor
  message: OK
  hardware_id: ''
  values: []
- level: "\0"
  name: /Aggregation/Optional
  message: OK
  hardware_id: ''
  values:
  - key: /optional/runtime/analyzer
    value: OK
- level: "\0"
  name: /Aggregation/Optional/ optional runtime analyzer
  message: OK
  hardware_id: ''
  values: []
- level: "\0"
  name: /Aggregation/Sensors
  message: OK
  hardware_id: ''
  values:
  - key: /sensors/front/cam
    value: OK
  - key: /sensors/left/cam
    value: OK
  - key: /sensors/rear/cam
    value: OK
  - key: /sensors/right/cam
    value: OK
- level: "\0"
  name: /Aggregation/Sensors/ sensors front cam
  message: OK
  hardware_id: ''
  values: []
- level: "\0"
  name: /Aggregation/Sensors/ sensors left cam
  message: OK
  hardware_id: ''
  values: []
- level: "\0"
  name: /Aggregation/Sensors/ sensors rear cam
  message: OK
  hardware_id: ''
  values: []
- level: "\0"
  name: /Aggregation/Sensors/ sensors right cam
  message: OK
  hardware_id: ''
  values: []
- level: "\0"
  name: /Aggregation/Topology/Left
  message: OK
  hardware_id: ''
  values:
  - key: /arms/left/motor
    value: OK
  - key: /legs/left/motor
    value: OK
  - key: /sensors/left/cam
    value: OK
- level: "\0"
  name: /Aggregation/Topology/Left/ arms left motor
  message: OK
  hardware_id: ''
  values: []
- level: "\0"
  name: /Aggregation/Topology/Left/ legs left motor
  message: OK
  hardware_id: ''
  values: []
- level: "\0"
  name: /Aggregation/Topology/Left/ sensors left cam
  message: OK
  hardware_id: ''
  values: []
- level: "\0"
  name: /Aggregation/Topology/Right
  message: OK
  hardware_id: ''
  values:
  - key: /arms/right/motor
    value: OK
  - key: /legs/right/motor
    value: OK
  - key: /sensors/right/cam
    value: OK
- level: "\0"
  name: /Aggregation/Topology/Right/ arms right motor
  message: OK
  hardware_id: ''
  values: []
- level: "\0"
  name: /Aggregation/Topology/Right/ legs right motor
  message: OK
  hardware_id: ''
  values: []
- level: "\0"
  name: /Aggregation/Topology/Right/ sensors right cam
  message: OK
  hardware_id: ''
  values: []
- level: "\0"
  name: /Aggregation/Topology
  message: OK
  hardware_id: ''
  values:
  - key: Left
    value: OK
  - key: Right
    value: OK
- level: "\0"
  name: /Aggregation
  message: OK
  hardware_id: ''
  values:
  - key: Arms
    value: OK
  - key: Legs
    value: OK
  - key: Motors
    value: OK
  - key: Optional
    value: OK
  - key: Sensors
    value: OK
  - key: /Aggregation/Topology
    value: OK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement This tackles a new feature of the code (and not a bug) needs more work Someone has worked on this but more work is needed PR welcome 💞 This issue has no PR that tries to implement it. Please create one! ros2 PR tackling a ROS2 branch
Projects
None yet
Development

No branches or pull requests

4 participants