Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(WIP) Improve logging and OpenMetrics docs #21089

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

mosabua
Copy link
Member

@mosabua mosabua commented Mar 14, 2024

Description

Additional context and related issues

Related to TCB 57

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.

@cla-bot cla-bot bot added the cla-signed label Mar 14, 2024
@github-actions github-actions bot added the docs label Mar 14, 2024
@mosabua mosabua force-pushed the openmetrics branch 2 times, most recently from 5d58530 to 3f80d5e Compare March 19, 2024 18:27
@lozbrown
Copy link
Contributor

The slack channel recently swiftly educated me that for some metrics I need pull metrics from all nodes including workers and not just the coordinator, Prometheus supports k8s service discovery as standard and most implementations have a standard "prometheus-anotations" job configured, however in general that's unauthenticated.

I created two copies of that, one for the coordinator and one for the workers because in my case:

  • workers require the username (but not password) of a user with system information
  • coordinator requires both username and password and this can only be passed over https so must come in via the ingress (ALB in my case)

@mattstep requested I share some Prometheus configuration required to sour

Firstly in the helm for trino add the annotations

coordinator:
  annotations:
    prometheus.io/trino_scrape: "true"
worker:
  annotations:
    prometheus.io/trino_scrape: "true"

Then in the prometheus config

    - job_name: trino-metrics-worker
      scrape_interval: 10s
      scrape_timeout: 10s
      kubernetes_sd_configs:
        - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_trino_scrape]
        action: keep # scrape only pods with the trino scrape anotation
        regex: true
      - source_labels: [__meta_kubernetes_pod_container_name]
        action: keep # dont try to scrape non trino container
        regex: trino-worker
      - action: hashmod
        modulus: $(SHARDS)
        source_labels:
        - __address__
        target_label: __tmp_hash
      - action: keep
        regex: $(SHARD)
        source_labels:
        - __tmp_hash
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_pod_container_name]
        action: replace
        target_label: container
      metric_relabel_configs:
          - source_labels: [__name__]
            regex: ".+_FifteenMinute.+|.+_FiveMinute.+|.+IterativeOptimizer.+|.*io_airlift_http_client_type_HttpClient.+"
            action: drop # droping some highly granular metrics 
          - source_labels: [__meta_kubernetes_pod_name]
            regex: ".+"
            target_label: pod
            action: replace 
          - source_labels: [__meta_kubernetes_pod_container_name]
            regex: ".+"
            target_label: container
            action: replace 
            
      scheme: http
      tls_config:
        insecure_skip_verify: true
      basic_auth:
        username: mysuer # replace with a user with system information permission 
        # DO NOT ADD PASSWORD
    - job_name: trino-metrics-coordinator
      scrape_interval: 10s
      scrape_timeout: 10s
      kubernetes_sd_configs:
        - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_trino_scrape]
        action: keep # scrape only pods with the trino scrape anotation
        regex: true
      - source_labels: [__meta_kubernetes_pod_container_name]
        action: keep # dont try to scrape non trino container
        regex: trino-coordinator
      - action: hashmod
        modulus: $(SHARDS)
        source_labels:
        - __address__
        target_label: __tmp_hash
      - action: keep
        regex: $(SHARD)
        source_labels:
        - __tmp_hash
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_pod_container_name]
        action: replace
        target_label: container
      - action: replace  # overide the address to the https ingress address 
        target_label: __address__
        replacement: {{ .Values.trinourl }} 
      metric_relabel_configs:
          - source_labels: [__name__]
            regex: ".+_FifteenMinute.+|.+_FiveMinute.+|.+IterativeOptimizer.+|.*io_airlift_http_client_type_HttpClient.+"
            action: drop # droping some highly granular metrics 
          - source_labels: [__meta_kubernetes_pod_name]
            regex: ".+"
            target_label: pod
            action: replace 
          - source_labels: [__meta_kubernetes_pod_container_name]
            regex: ".+"
            target_label: container
            action: replace 
            
      scheme: https
      tls_config:
        insecure_skip_verify: true
      basic_auth:
        username: mysuer # replace with a user with system information permission 
        password_file: /some/password/file

Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Apr 11, 2024
@mattstep
Copy link
Contributor

This looks good to me, and I think would benefit from the more complex example from @lozbrown , this gives a more realistic configuration for a production usecase.

@mosabua
Copy link
Member Author

mosabua commented Apr 11, 2024

I have a whole bunch of further info and details from @lozbrown and others that I will add .. just have to get back to working on this.

@lozbrown
Copy link
Contributor

I think it would be good to include a list of the things you typically monitor in your dashboard as experts.

Possibly including promql for these

@mosabua
Copy link
Member Author

mosabua commented Apr 11, 2024

Agreed @lozbrown .. I will work the ones you supplied into this PR .. and if @mattstep or others have more examples.. they can be added too. We can also do more updates after this PR gets merged.

@mosabua mosabua added stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. and removed stale labels Apr 11, 2024
@lozbrown
Copy link
Contributor

@mosabua can we move along some sort of first attempt here on the basis something is better than completely undocumented features

@mosabua
Copy link
Member Author

mosabua commented May 24, 2024

Its on my backlog but I am flat out busy .. I will try to get a minimal PR merged first and expand later at this stage.

Trino also includes a [](/connector/prometheus) that allows you to query
Prometheus data using SQL.

## Example use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i really think that running Prometheus server is out of scope for Trino Documentation


```shell
curl -H X-Trino-User:foo localhost:8080/metrics
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this works for any cluster the localhost should not be necessary (most users will never run trino on their PC)

it's necessary to point out users need system-information read permission

Its necessary to point out for complete metrics all workers will need to be scraped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed docs stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed.
Development

Successfully merging this pull request may close these issues.

3 participants