-
Notifications
You must be signed in to change notification settings - Fork 4
/
troubleshooting.html.md.erb
86 lines (57 loc) · 5.04 KB
/
troubleshooting.html.md.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
title: Troubleshooting PCF Log Search
owner: PCF Metrics
---
<strong><%= modified_date %></strong>
This topic describes how to resolve common issues with missing logs the Pivotal Cloud Foundry (PCF) Log Search tile.
If not all logs expected are being received and displayed in the Log Search Kibana UI,
the following items should be validated as part of the troubleshooting process.
## <a id='enable_shard_allocation'></a> Confirm that Shard Allocation is Enabled
* Disabling the **Enable Shard Allocation** errand and applying changes that affect
Elasticsearch nodes prevents new daily indexes from being created.
When updating stemcells it is tempting to disable the **Enable Shard Allocation**
errand to speed up the deployment. Unfortunately doing this leaves Elasticsearch
in a “broken” state; specifically it is unable to create new daily indexes at midnight UTC.
* If you suspect this has happened; check whether your cluster
`cluster.routing.allocation.enable` setting has been set to something other than **all** by
querying:
`http://logsearch.<system_domain>/elasticsearch/_cluster/settings?pretty`<br/>
If it has, you can correct this by re-enabling the errand in Ops Manager via
the `Log Search > Errand` screen, and **Apply Changes**.
## <a id='installation_steps'></a>Confirm Correct Installation
* Log Search (v1.0.2 and later) requires two important installation steps in order to receive all logs/metrics expected
* Ensure NATS credentials from the **Ops Manager** tile were [entered](https://docs.pivotal.io/logsearch/installing.html#install) in Log Search tile.
* Ensure **System Logging** was correctly [configured](https://docs.pivotal.io/logsearch/installing.html#elastic-runtime) in Elastic Runtime with IP information from Log Search tile.
## <a id='resource_scale'></a>Confirm that Resource Scale Matches Log Volume
* The default resource configuration for the PCF Log Search tile is targeted primarily at consumption of platform logs. If you are also using the Experimental Firehose feature in order to consume application logs, this can put strain on the Log Parser and Elasticsearch data nodes if application logging volume is high.
* In order to determine if you may need to scale your clusters to support application logging consumption, please [Find the Rate of Incoming Messages](https://docs.pivotal.io/logsearch/installing.html#firehose) and then look to the table shown in [Scaling](https://docs.pivotal.io/logsearch/scaling.html#scale-up) for a suggested amount to further scale up your resources.
<p class="note warning">
<strong>WARNING: </strong> If you recently upgraded to PCF v1.10 or later,
you will likely need to [scale up](./scaling.html#scale-up) your Log Search cluster to accommodate higher Loggregator message throughput. See [gRPC in Loggregator](https://docs.pivotal.io/pivotalcf/1-10/pcf-release-notes/runtime-rn.html#loggregator) for more info.
</p>
## <a id='cluster_health'></a>Review Current Cluster Health
* Log Search includes a visualization of current cluster health.
It is accessible at `logsearch_monitor.<SystemDomain>`
* Clusters that are in a **red** state are broken
* The best way to troubleshoot the cluster health is to look at the logs for **Elasticsearch Master** and
**Elasticsearch Data nodes**. This is found in `Log Search > Status > Logs`
* Clusters that are in a **yellow** state are not replicating.
* Verify there are multiple data nodes. If there are multiple data nodes in a **yellow** state, look at
the logs for **Elasticsearch Master** and **Elasticsearch Data nodes**. This is found in
`Log Search > Status > Logs`
* If there is only a single data node, scale up the number of **Elasticsearch Data nodes** in
`Log Search > Resource Config`
## <a id='data_nodes_at_maximum'></a>Check if Elasticsearch Data Nodes at Maximum Disk
* This issue is most commonly caused by a Redis queuing issue, and is often evident in product logs
* Example log message:
`Redis key size has hit a congestion threshold 1000000 suspending output for 10 seconds`
1. Restart the ingestor and wait a few minutes.
1. If the cluster health continues to be unhealthy, scale up the size of the clusters.
* Reference: [Log Search Scaling Guide](https://docs.pivotal.io/logsearch/scaling.html#need)
* Example: If the Elasticsearch data nodes are running near maximum on persistent disk, consider increasing the persistent disk or instance count to add more capacity.
## <a id='parser_at_maximum'></a>Check if Parser VMs at Maximum CPU
* As a general rule, use more Log parser nodes than Elasticsearch Data nodes.
* Scale up parser nodes until the average CPU goes down.
* Reference: [Log Search Scaling Guide](https://docs.pivotal.io/logsearch/scaling.html#need)
## <a id='elasticsearch_master'></a>Check for Multiple Elasticsearch Master VMs (v1.0.1 and earlier)
* Ensure there is only one Elasticsearch Master VM. The product cannot support running with more than one Elasticsearch master VM. The ability to accidentally scale this was removed as of v1.0.2 and later.