You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a database setup with 3 InfluxDB nodes in a Primary / Replica configuration. The primary node is where I am sending reads / writes, and maintaining the replica nodes as backups. Within this configuration, I have four buckets in each node. So my total replication count is 4 (buckets) * 2 (replicas) = 8.
I have a writer that is writing ~300,000 points / second across all the buckets to the master node (of variable sizes, some in a few KBs). When the master starts to replicate data to the replicas, I start seeing the following error:
This error generally starts showing up after around 20-30 seconds.
From what I know, this is happening because there are too many TCP connections being made to the replicas on the master node and in doing so, I exhaust all available local ephemeral ports. A few fixes I have tried on my end is setting the following parameters:
I have also increased the ulimits in my docker compose:
ulimits:
nofile:
soft: 400000
hard: 800000
With these fixes, I still see the error after a while, and then when one of these TCP connections is available, the retry count resets back to 0.
I have a hunch that this can be resolved by modifying the HTTP client parameters such as MaxIdleConnsPerHost, but I do not have access to modify the replication writeAPI client parameters as needed from outside the codebase.
For further clarification on setup, all of the InfluxDB nodes are on separate VMs (but they are on the same local network, with < 0.5ms ping).
Steps to reproduce:
List the minimal actions needed to reproduce the behavior.
Setup a master influxdb node, and two replica nodes on 3 different servers.
Setup the replications of 4 buckets across 2 replicas on the master
Write huge amount of variable sized data to all the buckets
Wait for a minute, to start seeing errors.
Expected behavior:
Expected behavior is that the data is transported to the replicas using a shared connection pool or is configured to reuse TCP connections.
Actual behavior:
Large amount of TCP connections causing unavailability of open port for binding target address.
Environment info:
System info: Run uname -srm and copy the output here
Linux 5.15.0-71-generic x86_64
InfluxDB version: Run influxd version and copy the output here
Other relevant environment details: Container runtime, disk info, etc
Config:
Copy any non-default config values here or attach the full config as a gist or file.
Logs:
Include snippet of errors in log.
Performance:
Generate profiles with the following commands for bugs related to performance, locking, out of memory (OOM), etc.
# Commands should be run when the bug is actively happening.# Note: This command will run for ~30 seconds.
curl -o profiles.tar.gz "http://localhost:8086/debug/pprof/all?cpu=30s"
iostat -xd 1 30 > iostat.txt
# Attach the `profiles.tar.gz` and `iostat.txt` output files.
The text was updated successfully, but these errors were encountered:
I have a database setup with 3 InfluxDB nodes in a Primary / Replica configuration. The primary node is where I am sending reads / writes, and maintaining the replica nodes as backups. Within this configuration, I have four buckets in each node. So my total replication count is
4 (buckets) * 2 (replicas) = 8
.I have a writer that is writing ~300,000 points / second across all the buckets to the master node (of variable sizes, some in a few KBs). When the master starts to replicate data to the replicas, I start seeing the following error:
This error generally starts showing up after around 20-30 seconds.
From what I know, this is happening because there are too many TCP connections being made to the replicas on the master node and in doing so, I exhaust all available local ephemeral ports. A few fixes I have tried on my end is setting the following parameters:
I have also increased the ulimits in my docker compose:
With these fixes, I still see the error after a while, and then when one of these TCP connections is available, the retry count resets back to 0.
I have a hunch that this can be resolved by modifying the HTTP client parameters such as
MaxIdleConnsPerHost
, but I do not have access to modify the replicationwriteAPI
client parameters as needed from outside the codebase.For further clarification on setup, all of the InfluxDB nodes are on separate VMs (but they are on the same local network, with < 0.5ms ping).
Steps to reproduce:
List the minimal actions needed to reproduce the behavior.
Expected behavior:
Expected behavior is that the data is transported to the replicas using a shared connection pool or is configured to reuse TCP connections.
Actual behavior:
Large amount of TCP connections causing unavailability of open port for binding target address.
Environment info:
uname -srm
and copy the output hereinfluxd version
and copy the output hereConfig:
Copy any non-default config values here or attach the full config as a gist or file.
Logs:
Include snippet of errors in log.
Performance:
Generate profiles with the following commands for bugs related to performance, locking, out of memory (OOM), etc.
The text was updated successfully, but these errors were encountered: