-
Notifications
You must be signed in to change notification settings - Fork 24
Data prepper fails when sending traces from different EKS (multiple data prepper singletons) to ES #479
Comments
This is what the pipeline looks like
|
@kuberkaul This is a race condition which we fixed in the next version(0.8.0). @wrijeff isreproducing your scenario it and will get back to you. Quick question, why are you running data prepper in two different clusters? The applications you run in two clusters are same or different? |
Thanks @kowshikn, Ill move to the next version. We have about 10 different clusters ( 10 AWS accounts) which house very different applications. The idea would be two keep two ES (1 in virginia and 1 in oregon) to serve as our primary and DR regions for tracing data and all meshes within primary would send data to vir ES and DR meshes would send to ore ES. Curious tho, in case it was same app but running in different meshes ( we also divide our EKS clusters per environment) would you recommend not using the same ES ? |
Thanks @kuberkaul for the bug report. I haven't been able to reproduce the issue yet (trying using a single EKS cluster with multiple Data Prepper pods), however as Kowshik mentioned try moving to the 0.8.0-beta release. We fixed a race condition here but I'm not 100% sure it's the same issue you're encountering (which is why I'm trying to reproduce it myself). When moving to the next version, I'd suggest starting with the k8s deployment template because there were a few backwards-incompatible changes from 0.7 to 0.8. Specifically the term "processor" has been replaced with "prepper, and a required argument was added to the |
@wrijeff yup, I am on it now. This was the plan anyway to move to a scalable data-prepper cluster so this just speeds things up for us. Ill report back when this is done. |
Yes, if it's the same app, the trace data from different environment published to same elastic search index. We also do not support filter by additional/custom attributes in our dashboards which means all latency etch will be reported together example region 1 and region 2 latency will be a displayed as a single latency. |
so moved to the
|
both data preppers fail btw. I could delete all indexes and try again to see if it changes. here the data prepper yaml :
|
Thanks for the stacktrace and yaml, very helpful. Checking the error message again, it looks like an index named I'm not sure how it could get to that state, but can you check what the indices look like for that cluster? There should be multiple |
I'll delete the |
@kuberkaul Just FYI, the API we adopted for ingesting bulk records into ES is close to the following DSL
where |
This was working fine in 0.7.0 but seems to happen now im on 0.8.0-beta. I have scaled the data prepper cluster to 2 within the EKS and am using the headless service for pod discovery. Do you recommend a certain cpu/mem for data prepper pods ? Also should I be increasing the workers on the data prepper configmap ? Right now its all default. |
Regarding configuration - we have an initial tuning guide but it's a bit lacking in terms of recommended settings per workload (currently working on that). The stacktrace with the mapdb error was fixed in #320 but it looks like it wasn't merged into the 0.8.x branch 😥. So until that fix actually is released, please keep the worker count set to 1 for the pipeline containing the For the service map issue, I'm not sure what's going on there yet. I have seen blank results when the time range at the top filters out all results (when it's set to something small like 1 minute), but from the screenshot it looks like yours is set to a reasonable time. I wiped my ES cluster and tested with 0.8.0/EKS, single pod, wasn't able to reproduce. Will try scaling it up and wiping again. Thanks for testing it out and providing feedback, super useful. |
Thank you for the guide, Ive tuned it and given it :
resources to start of with. Moving the workers for I know theres data though as the graph is still getting created and |
@kuberkaul - sorry for the delay in updating this issue. I ended up creating multiple follow-up issues based off your feedback, they're linked above. Could you check if the service map is still having JS errors? A record in that cluster which was missing appears to now be present, hopefully that will resolve the immediate error. What we think happened is described in #514, essentially an "edge" in the service map appeared to have been missing and the frontend didn't know how to handle it. I've created an issue on the Kibana side to handle missing edges instead of completely erroring. I've also got an issue on our side to look into what might've caused it to have been missing, and what we can do to prevent it in the future. Thanks again for your patience. |
@wrijeff : Thats great, Im lookin at the service map though and I still see the same error. Everything else : traces/dashboard continues to work and gives me traces but services errors out . Also pasting the errors I get (same) |
Recreated the entire Elastic search and the error persist. Infact now both service map and services turn up blank. |
Using data prepper to send traces from EKS clusters (3) to the same elastic search. This works when sending traces from just one EKS but with multiple EKS (therefore multiple data-prepper) , this fails with :
{"error":{"root_cause":[{"type":"invalid_alias_name_exception","reason":"Invalid alias name [otel-v1-apm-span], an index exists with the same name as the alias","index_uuid":"0-ZhKKdyRHeJjw4QKjF-Yg","index":"otel-v1-apm-span"}],"type":"invalid_alias_name_exception","reason":"Invalid alias name [otel-v1-apm-span], an index exists with the same name as the alias","index_uuid":"0-ZhKKdyRHeJjw4QKjF-Yg","index":"otel-v1-apm-span"},"status":400}
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Traces should be sent from both EKS clusters to ES . Alias should be configurable within data prepper configmap if need be.
Screenshots
Am I missing something in the config ? I dont see a way to set the alias in the data prepper configmap.
The text was updated successfully, but these errors were encountered: