Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]? prometheus-exporter-plugin-for-opensearch shows the cluster status red, but the background view cluster status is always yellow. #278

Open
east4ming opened this issue May 29, 2024 · 6 comments
Assignees

Comments

@east4ming
Copy link

Backgrounds

We've just switched from ES to OpenSearch not long ago.

Recently, we found that appear such a case:

The prometheus-exporter-plugin-for-opensearch shows that the cluster status is red, but the direct view of the cluster status in the background (via the OpenSearch api) is always yellow(no transition to red).

See the following figure for details:
prometheus-exporter-plugin-for-opensearch opensearch_cluster_status with prometheus ui

opensearch status with curl get opensearch api

Note: UTC is 07:00 - 07:05, corresponding to UTC +8 is 15:00-15:05. The two pictures above are at the same time.

Details

  1. Our OpenSearch is a single node, so the state should always be yellow;
  2. opensearch version:2.12.0
  3. prometheus-exporter version: 2.12.0.0
  4. View OpenSearch api status command: cur -XGET -u username:password localhost:9200/_cat/health?v

Other

If you need more details, please reply. I'll attach it in due course.

Thank you, sir.

@lukas-vlcek lukas-vlcek self-assigned this May 29, 2024
@lukas-vlcek
Copy link
Collaborator

lukas-vlcek commented May 29, 2024

@east4ming Hi, thanks for reporting. Before I investigate further I have some questions:

  • What action caused the cluster health state to change? Was it index creation?
  • And what is the Prometheus scraping interval?

In other words is it possible that Prometheus scraped the metrics right after the index was created but before even the primary shards were allocated (thus the state would be red 🔴 )? This can happen for a very short period of time but if that is the moment Prometheus scrapes the metrics then the next update of metric will come with the next scraping cycle.

@east4ming
Copy link
Author

@east4ming Hi, thanks for reporting. Before I investigate further I have some questions:

  • What action caused the cluster health state to change? Was it index creation?
  • And what is the Prometheus scraping interval?

In other words is it possible that Prometheus scraped the metrics right after the index was created but before even the primary shards were allocated (thus the state would be red 🔴 )? This can happen for a very short period of time but if that is the moment Prometheus scrapes the metrics then the next update of metric will come with the next scraping cycle.

Thank you for your quick reply. Answer:

Q: What action caused the cluster health state to change? Was it index creation?
A: You're right. Status red when the index close is in progress, and then index delete.

Q: And what is the Prometheus scraping interval?
A: See below yaml. scrape_interval: 1m and scrape_timeout: 30s. (Most of this configuration follows ES configuration such as scrape_interval etc and modify something to fit opensearch, and ES has never had a Status red false positive.)

global:
  evaluation_interval: 1m
  scrape_interval: 1m
  scrape_timeout: 10s
scrape_configs:
- job_name: 'log_opensearch'
  scrape_timeout: 30s
  static_configs:
      - targets:
        - 192.168.1.1:9200
  metrics_path: "/_prometheus/metrics"
  basic_auth:
    username: 'xxxx'
    password: 'xxxxxxxx'
  metric_relabel_configs:
  - action: keep
    regex: opensearch_circuitbreaker_tripped_count|opensearch_cluster_datanodes_number|opensearch_cluster_nodes_number|opensearch_cluster_pending_tasks_number|opensearch_cluster_shards_active_percent|opensearch_cluster_shards_number|opensearch_cluster_status|opensearch_cluster_task_max_waiting_time_seconds|opensearch_fs_io_total_read_bytes|opensearch_fs_io_total_write_bytes|opensearch_fs_path_free_bytes|opensearch_fs_path_total_bytes|opensearch_index_fielddata_evictions_count|opensearch_index_flush_total_count|opensearch_index_flush_total_time_seconds|opensearch_index_indexing_delete_current_number|opensearch_index_indexing_index_count|opensearch_index_indexing_index_current_number|opensearch_index_indexing_index_failed_count|opensearch_index_indexing_index_time_seconds|opensearch_index_merges_current_size_bytes|opensearch_index_merges_total_docs_count|opensearch_index_merges_total_stopped_time_seconds|opensearch_index_merges_total_throttled_time_seconds|opensearch_index_merges_total_time_seconds|opensearch_index_querycache_evictions_count|opensearch_index_querycache_hit_count|opensearch_index_querycache_memory_size_bytes|opensearch_index_querycache_miss_number|opensearch_index_refresh_total_count|opensearch_index_refresh_total_time_seconds|opensearch_index_requestcache_evictions_count|opensearch_index_requestcache_hit_count|opensearch_index_requestcache_memory_size_bytes|opensearch_index_requestcache_miss_count|opensearch_index_search_fetch_count|opensearch_index_search_fetch_current_number|opensearch_index_search_fetch_time_seconds|opensearch_index_search_query_count|opensearch_index_search_query_current_number|opensearch_index_search_query_time_seconds|opensearch_index_search_scroll_count|opensearch_index_search_scroll_current_number|opensearch_index_search_scroll_time_seconds|opensearch_index_segments_memory_bytes|opensearch_index_segments_number|opensearch_index_shards_number|opensearch_index_store_size_bytes|opensearch_index_translog_operations_number|opensearch_indices_indexing_index_count|opensearch_indices_store_size_bytes|opensearch_ingest_total_count|opensearch_ingest_total_failed_count|opensearch_ingest_total_time_seconds|opensearch_jvm_bufferpool_number|opensearch_jvm_bufferpool_total_capacity_bytes|opensearch_jvm_bufferpool_used_bytes|opensearch_jvm_gc_collection_count|opensearch_jvm_gc_collection_time_seconds|opensearch_jvm_mem_heap_committed_bytes|opensearch_jvm_mem_heap_used_bytes|opensearch_jvm_mem_nonheap_committed_bytes|opensearch_jvm_mem_nonheap_used_bytes|opensearch_jvm_threads_number|opensearch_jvm_uptime_seconds|opensearch_os_cpu_percent|opensearch_os_mem_used_percent|opensearch_os_swap_free_bytes|opensearch_os_swap_used_bytes|opensearch_threadpool_tasks_number|opensearch_threadpool_threads_number|opensearch_transport_rx_bytes_count|opensearch_transport_server_open_number|opensearch_transport_tx_bytes_count|up|opensearch_os_cpu_percent

And, my alert rule see below(I thought you might want to know):

- alert: opensearchClusterRed
  expr: opensearch_cluster_status == 2
  for: 0m
  labels:
    severity: emergency
  annotations:
    summary: opensearch Cluster Red (instance {{ $labels.instance }}, node {{ $labels.node }})
    description: "Elastic Cluster Red status\n  VALUE = {{ $value }}"

@lukas-vlcek
Copy link
Collaborator

lukas-vlcek commented May 29, 2024

Thanks for more details.

Status red when the index close is in progress, and then index delete.

Would you mind sharing a bit more information about this process please?

  • Are you closing index using their full ID or are you using a wildcard patterns as well?
  • Following that are you deleting the indices that have been closed right before that? Or the close and delete operations are more independent?
  • Based on your scrape interval (1m) it seems like the Prometheus has scraped the target several times and still got the red status. So what exactly happened during those 4 minutes that we can see in the chart above? You closed a single index and then deleted it?

I am trying to see if we can recreate the sequence of steps to reliably replicate this issue, that is why I ask all the questions.

Thanks a lot!
Lukáš

@east4ming
Copy link
Author

Hi Lukáš,

Answers:
• Are you closing index using their full ID or are you using a wildcard patterns as well?
A: Use pattern
• Following that are you deleting the indices that have been closed right before that? Or the close and delete operations are more independent?
A: Delete the closed indices
• Based on your scrape interval (1m) it seems like the Prometheus has scraped the target several times and still got the red status. So what exactly happened during those 4 minutes that we can see in the chart above? You closed a single index and then deleted it?
A: Because we need to close 194 indices and deleted them. So it(status red) went on for a while.

Thanks

@lukas-vlcek
Copy link
Collaborator

@east4ming Thanks for the details.

I know that a red state can happen (for a short period) when a new index is created but based on your explanation this is not the case because you are actually closing and then deleting indices. I do not think there is any reason why the cluster should become red in this scenario.

Q: Just for clarity, do you make sure the index close operation finishes (ie. yields success/ack response, not timeout or any error) before the delete operation is called, right?

@east4ming
Copy link
Author

Yes , close operation finished, delete operation will start. These operation is called by opensearch-curator tool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants