Skip to content
This repository has been archived by the owner on Feb 15, 2022. It is now read-only.

Data prepper fails when sending traces from different EKS (multiple data prepper singletons) to ES #479

Open
kuberkaul opened this issue Apr 2, 2021 · 21 comments
Labels
bug Something isn't working

Comments

@kuberkaul
Copy link

Using data prepper to send traces from EKS clusters (3) to the same elastic search. This works when sending traces from just one EKS but with multiple EKS (therefore multiple data-prepper) , this fails with :

{"error":{"root_cause":[{"type":"invalid_alias_name_exception","reason":"Invalid alias name [otel-v1-apm-span], an index exists with the same name as the alias","index_uuid":"0-ZhKKdyRHeJjw4QKjF-Yg","index":"otel-v1-apm-span"}],"type":"invalid_alias_name_exception","reason":"Invalid alias name [otel-v1-apm-span], an index exists with the same name as the alias","index_uuid":"0-ZhKKdyRHeJjw4QKjF-Yg","index":"otel-v1-apm-span"},"status":400}

To Reproduce
Steps to reproduce the behavior:

  1. Run data prepper as a deployment of 1 from 2 different EKS clusters sending traces to same ES
  2. data prepper : amazon/opendistro-for-elasticsearch-data-prepper:0.7.1-alpha
  3. Observe error in data-prepper as it tries to send traces

Expected behavior
Traces should be sent from both EKS clusters to ES . Alias should be configurable within data prepper configmap if need be.

Screenshots
Screen Shot 2021-04-02 at 2 08 30 PM

Am I missing something in the config ? I dont see a way to set the alias in the data prepper configmap.

@kuberkaul kuberkaul added the bug Something isn't working label Apr 2, 2021
@kuberkaul
Copy link
Author

kuberkaul commented Apr 2, 2021

This is what the pipeline looks like

    raw-pipeline:
      source:
        pipeline:
          name: "entry-pipeline"
      processor:
        - otel_trace_raw_processor:
      sink:
        - elasticsearch:
            hosts: [ 'https://xxx.us-east-1.es.amazonaws.com' ]
            insecure: true
            aws_region: "us-east-1"
            trace_analytics_raw: true
    service-map-pipeline:
      delay: "100"
      source:
        pipeline:
          name: "entry-pipeline"
      processor:
        - service_map_stateful:
      sink:
        - elasticsearch:
            hosts: ['https://xxx.us-east-1.es.amazonaws.com']
            insecure: true
            aws_region: "us-east-1"
            trace_analytics_service_map: true

@kuberkaul
Copy link
Author

@kowshikn ^

@kowshikn
Copy link
Contributor

kowshikn commented Apr 2, 2021

@kuberkaul This is a race condition which we fixed in the next version(0.8.0). @wrijeff isreproducing your scenario it and will get back to you.

Quick question, why are you running data prepper in two different clusters? The applications you run in two clusters are same or different?

@kuberkaul
Copy link
Author

Thanks @kowshikn, Ill move to the next version. We have about 10 different clusters ( 10 AWS accounts) which house very different applications. The idea would be two keep two ES (1 in virginia and 1 in oregon) to serve as our primary and DR regions for tracing data and all meshes within primary would send data to vir ES and DR meshes would send to ore ES.

Curious tho, in case it was same app but running in different meshes ( we also divide our EKS clusters per environment) would you recommend not using the same ES ?

@wrijeff
Copy link
Contributor

wrijeff commented Apr 2, 2021

Thanks @kuberkaul for the bug report. I haven't been able to reproduce the issue yet (trying using a single EKS cluster with multiple Data Prepper pods), however as Kowshik mentioned try moving to the 0.8.0-beta release. We fixed a race condition here but I'm not 100% sure it's the same issue you're encountering (which is why I'm trying to reproduce it myself).

When moving to the next version, I'd suggest starting with the k8s deployment template because there were a few backwards-incompatible changes from 0.7 to 0.8. Specifically the term "processor" has been replaced with "prepper, and a required argument was added to the java -jar command.

@kuberkaul
Copy link
Author

@wrijeff yup, I am on it now. This was the plan anyway to move to a scalable data-prepper cluster so this just speeds things up for us. Ill report back when this is done.

@kowshikn
Copy link
Contributor

kowshikn commented Apr 2, 2021

Thanks @kowshikn, Ill move to the next version. We have about 10 different clusters ( 10 AWS accounts) which house very different applications. The idea would be two keep two ES (1 in virginia and 1 in oregon) to serve as our primary and DR regions for tracing data and all meshes within primary would send data to vir ES and DR meshes would send to ore ES.

Curious tho, in case it was same app but running in different meshes ( we also divide our EKS clusters per environment) would you recommend not using the same ES ?

Yes, if it's the same app, the trace data from different environment published to same elastic search index. We also do not support filter by additional/custom attributes in our dashboards which means all latency etch will be reported together example region 1 and region 2 latency will be a displayed as a single latency.

@kuberkaul
Copy link
Author

kuberkaul commented Apr 2, 2021

so moved to the 0.8.0-beta but the error remain the same ( running in 2 EKS clusters , AWS account A and B) sending traces to ES in AWS account C ( same region for all three(US-EAST-1). Heres the error trace :


{"error":{"root_cause":[{"type":"invalid_alias_name_exception","reason":"Invalid alias name [otel-v1-apm-span], an index exists with the same name as the alias","index_uuid":"0-ZhKKdyRHeJjw4QKjF-Yg","index":"otel-v1-apm-span"}],"type":"invalid_alias_name_exception","reason":"Invalid alias name [otel-v1-apm-span], an index exists with the same name as the alias","index_uuid":"0-ZhKKdyRHeJjw4QKjF-Yg","index":"otel-v1-apm-span"},"status":400}
		at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:283) ~[data-prepper.jar:0.8.0-beta]
		at org.elasticsearch.client.RestClient.performRequest(RestClient.java:261) ~[data-prepper.jar:0.8.0-beta]
		at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235) ~[data-prepper.jar:0.8.0-beta]
		at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1611) ~[data-prepper.jar:0.8.0-beta]
		at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1596) ~[data-prepper.jar:0.8.0-beta]
		at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1563) ~[data-prepper.jar:0.8.0-beta]
		at org.elasticsearch.client.IndicesClient.create(IndicesClient.java:139) ~[data-prepper.jar:0.8.0-beta]
		at com.amazon.dataprepper.plugins.sink.elasticsearch.ElasticsearchSink.checkAndCreateIndex(ElasticsearchSink.java:183) ~[data-prepper.jar:0.8.0-beta]
		at com.amazon.dataprepper.plugins.sink.elasticsearch.ElasticsearchSink.start(ElasticsearchSink.java:88) ~[data-prepper.jar:0.8.0-beta]
		at com.amazon.dataprepper.plugins.sink.elasticsearch.ElasticsearchSink.<init>(ElasticsearchSink.java:71) ~[data-prepper.jar:0.8.0-beta]
		at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:?]
		at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:64) ~[?:?]
		at jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:?]
		at java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500) ~[?:?]
		at java.lang.reflect.Constructor.newInstance(Constructor.java:481) ~[?:?]
		at com.amazon.dataprepper.plugins.PluginFactory.newPlugin(PluginFactory.java:24) ~[data-prepper.jar:0.8.0-beta]
		at com.amazon.dataprepper.plugins.sink.SinkFactory.newSink(SinkFactory.java:12) ~[data-prepper.jar:0.8.0-beta]
		at com.amazon.dataprepper.parser.PipelineParser.buildSinkOrConnector(PipelineParser.java:146) ~[data-prepper.jar:0.8.0-beta]
		at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) [?:?]
		at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625) [?:?]
		at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) [?:?]
		at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) [?:?]
		at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) [?:?]
		at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) [?:?]
		at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) [?:?]
		at com.amazon.dataprepper.parser.PipelineParser.buildPipelineFromConfiguration(PipelineParser.java:95) [data-prepper.jar:0.8.0-beta]
		at com.amazon.dataprepper.parser.PipelineParser.parseConfiguration(PipelineParser.java:61) [data-prepper.jar:0.8.0-beta]
		at com.amazon.dataprepper.DataPrepper.execute(DataPrepper.java:97) [data-prepper.jar:0.8.0-beta]
		at com.amazon.dataprepper.DataPrepperExecute.main(DataPrepperExecute.java:20) [data-prepper.jar:0.8.0-beta]
2021-04-02T21:03:31,951 [main] ERROR com.amazon.dataprepper.parser.PipelineParser - Construction of pipeline components failed, skipping building of pipeline [raw-pipeline] and its connected pipelines
com.amazon.dataprepper.plugins.PluginException: Encountered exception while instantiating the plugin ElasticsearchSink
	at com.amazon.dataprepper.plugins.PluginFactory.newPlugin(PluginFactory.java:34) ~[data-prepper.jar:0.8.0-beta]
	at com.amazon.dataprepper.plugins.sink.SinkFactory.newSink(SinkFactory.java:12) ~[data-prepper.jar:0.8.0-beta]
	at com.amazon.dataprepper.parser.PipelineParser.buildSinkOrConnector(PipelineParser.java:146) ~[data-prepper.jar:0.8.0-beta]
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) ~[?:?]
	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625) ~[?:?]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) ~[?:?]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) ~[?:?]
	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) ~[?:?]
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) ~[?:?]
	at com.amazon.dataprepper.parser.PipelineParser.buildPipelineFromConfiguration(PipelineParser.java:95) [data-prepper.jar:0.8.0-beta]
	at com.amazon.dataprepper.parser.PipelineParser.parseConfiguration(PipelineParser.java:61) [data-prepper.jar:0.8.0-beta]
	at com.amazon.dataprepper.DataPrepper.execute(DataPrepper.java:97) [data-prepper.jar:0.8.0-beta]
	at com.amazon.dataprepper.DataPrepperExecute.main(DataPrepperExecute.java:20) [data-prepper.jar:0.8.0-beta]
Caused by: java.lang.reflect.InvocationTargetException
	at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:?]
	at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:64) ~[?:?]
	at jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:?]
	at java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500) ~[?:?]
	at java.lang.reflect.Constructor.newInstance(Constructor.java:481) ~[?:?]
	at com.amazon.dataprepper.plugins.PluginFactory.newPlugin(PluginFactory.java:24) ~[data-prepper.jar:0.8.0-beta]
	... 13 more
Caused by: org.elasticsearch.ElasticsearchStatusException: Elasticsearch exception [type=invalid_alias_name_exception, reason=Invalid alias name [otel-v1-apm-span], an index exists with the same name as the alias]
	at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:177) ~[data-prepper.jar:0.8.0-beta]
	at org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:1897) ~[data-prepper.jar:0.8.0-beta]
	at org.elasticsearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:1867) ~[data-prepper.jar:0.8.0-beta]
	at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1624) ~[data-prepper.jar:0.8.0-beta]
	at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1596) ~[data-prepper.jar:0.8.0-beta]
	at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1563) ~[data-prepper.jar:0.8.0-beta]
	at org.elasticsearch.client.IndicesClient.create(IndicesClient.java:139) ~[data-prepper.jar:0.8.0-beta]
	at com.amazon.dataprepper.plugins.sink.elasticsearch.ElasticsearchSink.checkAndCreateIndex(ElasticsearchSink.java:183) ~[data-prepper.jar:0.8.0-beta]
	at com.amazon.dataprepper.plugins.sink.elasticsearch.ElasticsearchSink.start(ElasticsearchSink.java:88) ~[data-prepper.jar:0.8.0-beta]
	at com.amazon.dataprepper.plugins.sink.elasticsearch.ElasticsearchSink.<init>(ElasticsearchSink.java:71) ~[data-prepper.jar:0.8.0-beta]
	at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:?]
	at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:64) ~[?:?]
	at jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:?]
	at java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500) ~[?:?]
	at java.lang.reflect.Constructor.newInstance(Constructor.java:481) ~[?:?]
	at com.amazon.dataprepper.plugins.PluginFactory.newPlugin(PluginFactory.java:24) ~[data-prepper.jar:0.8.0-beta]
	... 13 more
	Suppressed: org.elasticsearch.client.ResponseException: method [PUT], host [https://vpc-tracing-es-hubco4qcbsjy2lrhpefcsalih4.us-east-1.es.amazonaws.com], URI [/otel-v1-apm-span-000001?master_timeout=30s&timeout=30s], status line [HTTP/1.1 400 Bad Request]
{"error":{"root_cause":[{"type":"invalid_alias_name_exception","reason":"Invalid alias name [otel-v1-apm-span], an index exists with the same name as the alias","index_uuid":"0-ZhKKdyRHeJjw4QKjF-Yg","index":"otel-v1-apm-span"}],"type":"invalid_alias_name_exception","reason":"Invalid alias name [otel-v1-apm-span], an index exists with the same name as the alias","index_uuid":"0-ZhKKdyRHeJjw4QKjF-Yg","index":"otel-v1-apm-span"},"status":400}
		at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:283) ~[data-prepper.jar:0.8.0-beta]
		at org.elasticsearch.client.RestClient.performRequest(RestClient.java:261) ~[data-prepper.jar:0.8.0-beta]
		at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235) ~[data-prepper.jar:0.8.0-beta]
		at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1611) ~[data-prepper.jar:0.8.0-beta]
		at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1596) ~[data-prepper.jar:0.8.0-beta]
		at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1563) ~[data-prepper.jar:0.8.0-beta]
		at org.elasticsearch.client.IndicesClient.create(IndicesClient.java:139) ~[data-prepper.jar:0.8.0-beta]
		at com.amazon.dataprepper.plugins.sink.elasticsearch.ElasticsearchSink.checkAndCreateIndex(ElasticsearchSink.java:183) ~[data-prepper.jar:0.8.0-beta]
		at com.amazon.dataprepper.plugins.sink.elasticsearch.ElasticsearchSink.start(ElasticsearchSink.java:88) ~[data-prepper.jar:0.8.0-beta]
		at com.amazon.dataprepper.plugins.sink.elasticsearch.ElasticsearchSink.<init>(ElasticsearchSink.java:71) ~[data-prepper.jar:0.8.0-beta]
		at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:?]
		at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:64) ~[?:?]
		at jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:?]
		at java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500) ~[?:?]
		at java.lang.reflect.Constructor.newInstance(Constructor.java:481) ~[?:?]
		at com.amazon.dataprepper.plugins.PluginFactory.newPlugin(PluginFactory.java:24) ~[data-prepper.jar:0.8.0-beta]
		at com.amazon.dataprepper.plugins.sink.SinkFactory.newSink(SinkFactory.java:12) ~[data-prepper.jar:0.8.0-beta]
		at com.amazon.dataprepper.parser.PipelineParser.buildSinkOrConnector(PipelineParser.java:146) ~[data-prepper.jar:0.8.0-beta]
		at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) ~[?:?]
		at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625) ~[?:?]
		at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) ~[?:?]
		at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) ~[?:?]
		at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) ~[?:?]
		at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
		at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) ~[?:?]
		at com.amazon.dataprepper.parser.PipelineParser.buildPipelineFromConfiguration(PipelineParser.java:95) [data-prepper.jar:0.8.0-beta]
		at com.amazon.dataprepper.parser.PipelineParser.parseConfiguration(PipelineParser.java:61) [data-prepper.jar:0.8.0-beta]
		at com.amazon.dataprepper.DataPrepper.execute(DataPrepper.java:97) [data-prepper.jar:0.8.0-beta]
		at com.amazon.dataprepper.DataPrepperExecute.main(DataPrepperExecute.java:20) [data-prepper.jar:0.8.0-beta]
2021-04-02T21:03:31,953 [main] ERROR       com.amazon.dataprepper.DataPrepper - No valid pipeline is available for execution, exiting

@kuberkaul
Copy link
Author

both data preppers fail btw. I could delete all indexes and try again to see if it changes.

here the data prepper yaml :

    entry-pipeline:
      delay: "100"
      source:
        otel_trace_source:
          health_check_service: true
          ssl: false
      prepper:
        - peer_forwarder:
            discovery_mode: "dns"
            domain_name: "data-prepper"
            ssl: false
      sink:
        - pipeline:
            name: "raw-pipeline"
        - pipeline:
            name: "service-map-pipeline"
    raw-pipeline:
      source:
        pipeline:
          name: "entry-pipeline"
      prepper:
        - otel_trace_raw_prepper:
      sink:
        - elasticsearch:
            hosts: 
              - "https://vpc-tracing-es-1234.us-east-1.es.amazonaws.com"
            insecure: true
            aws_region: "us-east-1"
            trace_analytics_raw: true
    service-map-pipeline:
      delay: "100"
      source:
        pipeline:
          name: "entry-pipeline"
      prepper:
        - service_map_stateful:
      sink:
        - elasticsearch:
            hosts: 
              - "https://vpc-tracing-es-1234.us-east-1.es.amazonaws.com"
            insecure: true
            aws_region: "us-east-1"
            trace_analytics_service_map: true

@wrijeff
Copy link
Contributor

wrijeff commented Apr 2, 2021

Thanks for the stacktrace and yaml, very helpful. Checking the error message again, it looks like an index named otel-v1-apm-span somehow was created even though it was supposed to be an alias - not a race condition but an unexpected state. So the fix appears to be delete the index so the alias can be created :-/

I'm not sure how it could get to that state, but can you check what the indices look like for that cluster? There should be multiple otel-v1-apm-span-000* indices and a single otel-v1-apm-span alias. Will keep poking around for more info.

image

@kuberkaul
Copy link
Author

yup heres the screengrab :
Screen Shot 2021-04-03 at 1 30 48 PM

@kuberkaul
Copy link
Author

I'll delete the otel-v1-apm-span index. From ES perpective, I ran vanilla terraform to create the ES and then ran data prepper on its endpoint so can confirm didnt actually create this manually or through some other source. Let me kill this and try again though to see if that resolves the situation for good.

@chenqi0805
Copy link
Contributor

@kuberkaul Just FYI, the API we adopted for ingesting bulk records into ES is close to the following DSL

POST otel-v1-apm-span/_bulk
{ "index" : {  } }
{ 
  "field1" : "value1" 
  ....
}
...

where otel-v1-apm-span is the alias of otel-v1-apm-span-000* indices. So if somehow the indices otel-v1-apm-span-000* get deleted by accident while data prepper is still running, it might create the new index otel-v1-apm-span by mistake and write records under that index.

@kuberkaul
Copy link
Author

kuberkaul commented Apr 5, 2021

That might be what happened as I was experimenting with the setup. Deleting the index and starting over worked. Moving to new version though has its own issues. I have 3 different data prepper(3 EKS) sending traces to ES. While I can see the traces under discover, and under trace analytics traces. I dont see anything appear under services. What is weird is the service map is actually getting populated with relevant info.

able to see the traces here :
Screen Shot 2021-04-05 at 12 54 47 PM

but once I go to service view, it is blank but the graph shows them.

Screen Shot 2021-04-05 at 2 29 02 PM

Unfortunately the logs dont say much or give any error, so im assuming everything is good here:
Screen Shot 2021-04-05 at 12 55 15 PM

Edit 1
Updated worker to 8 and the recc buffer size. No changes but get this error not very frequently

Screen Shot 2021-04-05 at 5 21 52 PM

@kuberkaul
Copy link
Author

kuberkaul commented Apr 5, 2021

This was working fine in 0.7.0 but seems to happen now im on 0.8.0-beta. I have scaled the data prepper cluster to 2 within the EKS and am using the headless service for pod discovery.

Do you recommend a certain cpu/mem for data prepper pods ? Also should I be increasing the workers on the data prepper configmap ? Right now its all default.

@wrijeff
Copy link
Contributor

wrijeff commented Apr 5, 2021

Regarding configuration - we have an initial tuning guide but it's a bit lacking in terms of recommended settings per workload (currently working on that).

The stacktrace with the mapdb error was fixed in #320 but it looks like it wasn't merged into the 0.8.x branch 😥. So until that fix actually is released, please keep the worker count set to 1 for the pipeline containing the service_map_stateful prepper.

For the service map issue, I'm not sure what's going on there yet. I have seen blank results when the time range at the top filters out all results (when it's set to something small like 1 minute), but from the screenshot it looks like yours is set to a reasonable time. I wiped my ES cluster and tested with 0.8.0/EKS, single pod, wasn't able to reproduce. Will try scaling it up and wiping again.

Thanks for testing it out and providing feedback, super useful.

@kuberkaul
Copy link
Author

kuberkaul commented Apr 6, 2021

Thank you for the guide, Ive tuned it and given it :

              cpu: 1
              memory: 2Gi
            requests:
              cpu: 200m
              memory: 400Mi

resources to start of with. Moving the workers for service_map_stateful prepper to 1 resolved the error. Everything is working fine except the service under Trace analytics showing blank. What is weird is that services did start displaying a couple of hours ago (timeframe is not a problem since I am now giving anything for past 30 days) but it again went back to blank now.

I know theres data though as the graph is still getting created and dashboard shows a bunch of services. This makes me think its not really a data prepper issue and I will open an AWS ticket for this.

Screen Shot 2021-04-06 at 1 52 16 PM

Screen Shot 2021-04-06 at 1 52 29 PM

@kuberkaul
Copy link
Author

kuberkaul commented Apr 6, 2021

So after some more debugging. here's what I actually do see happening, these services keep populating and disappearing
Screen Shot 2021-04-06 at 1 52 16 PM
Screen Shot 2021-04-06 at 2 56 24 PM

Even though nothing changes. But when I actually inspect it in the browser I do see an error :
Screen Shot 2021-04-06 at 2 45 06 PM

Going through a sample trace using dev tools in ES, what I find is that for some reason, the trace group for one of the traces shows up as null and the trace group is actually getting populated to name

{
  "took" : 978,
  "timed_out" : false,
  "_shards" : {
    "total" : 32,
    "successful" : 32,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 7.7403736,
    "hits" : [
      {
        "_index" : "otel-v1-apm-span-000004",
        "_type" : "_doc",
        "_id" : "d5968f7aa6040afa",
        "_score" : 7.7403736,
        "_source" : {
          "traceId" : "00013fa8fb91db97f22c189681ac8727",
          "spanId" : "d5968f7aa6040afa",
          "traceState" : "",
          "parentSpanId" : "f22c189681ac8727",
          "name" : "djin-mplatformassets-itemsapi.djin-platform.svc.cluster.local:5000/djin-mplatformassets-itemsapi/assets/author-lists*",
          "kind" : "SPAN_KIND_SERVER",
          "startTime" : "2021-04-06T18:40:05.924338Z",
          "endTime" : "2021-04-06T18:40:05.970057Z",
          "durationInNanos" : 45719000,
          "serviceName" : "djin-mplatformassets-itemsapi.djin-platform",
          "events" : [ ],
          "links" : [ ],
          "droppedAttributesCount" : 0,
          "droppedEventsCount" : 0,
          "droppedLinksCount" : 0,
          "traceGroup" : null,
          "span.attributes.request_size" : "0",
          "span.attributes.response_size" : "423",
          "resource.attributes.cluster@region" : "us-east-1",
          "span.attributes.user_agent" : "PostmanRuntime/7.25.0",
          "span.attributes.downstream_cluster" : "-",
          "span.attributes.peer@address" : "10.147.168.188",
          "span.attributes.guid:x-request-id" : "7c2e8549-ce46-93db-a0e9-6f326e5af7e0",
          "resource.attributes.cluster@name" : "vir-int-usr",
          "span.attributes.net@host@ip" : "10.147.164.191",
          }
      },
      {
        "_index" : "otel-v1-apm-span-000004",
        "_type" : "_doc",
        "_id" : "f22c189681ac8727",
        "_score" : 7.6892185,
        "_source" : {
          "traceId" : "00013fa8fb91db97f22c189681ac8727",
          "spanId" : "f22c189681ac8727",
          "traceState" : "",
          "parentSpanId" : "",
          "name" : "djin-mplatformassets-itemsapi.djin-platform.svc.cluster.local:5000/djin-mplatformassets-itemsapi/assets/author-lists*",
          "kind" : "SPAN_KIND_CLIENT",
          "startTime" : "2021-04-06T18:40:05.923588Z",
          "endTime" : "2021-04-06T18:40:05.970863Z",
          "durationInNanos" : 47275000,
          "serviceName" : "istio-ingressgateway",
          "events" : [ ],
          "links" : [ ],
          "droppedAttributesCount" : 0,
          "droppedEventsCount" : 0,
          "droppedLinksCount" : 0,
          "traceGroup" : "djin-mplatformassets-itemsapi.djin-platform.svc.cluster.local:5000/djin-mplatformassets-itemsapi/assets/author-lists*",
          "resource.attributes.service@name" : "istio-ingressgateway",
          "span.attributes.component" : "proxy",
          "resource.attributes.cluster@lz" : "usr",
          "status.code" : 0,
          "span.attributes.http@method" : "GET",
          "resource.attributes.cluster@region" : "us-east-1",
          "span.attributes.user_agent" : "PostmanRuntime/7.25.0",
          "span.attributes.downstream_cluster" : "-",
          "span.attributes.peer@address" : "10.147.167.1",
          "span.attributes.guid:x-request-id" : "7c2e8549-ce46-93db-a0e9-6f326e5af7e0",
          "resource.attributes.cluster@name" : "vir-int-usr",
          "span.attributes.zone" : "us-east-1d",
          "span.attributes.net@host@ip" : "10.147.168.188",
          "span.attributes.http@status_code" : "200",
          "resource.attributes.cluster@env" : "int",
          "span.attributes.node_id" : "router~10.147.168.188~istio-ingressgateway-77c9d7fbf5-qp5fk.istio-system~istio-system.svc.cluster.local"
        }
      }
    ]
  }
}

I believe this could be causing the mapping to not work and the services trace analytics to show up as blank even though all the data exist. Any idea why the trace group would show up as null and actually get inserted to name ?

@wrijeff
Copy link
Contributor

wrijeff commented Apr 14, 2021

@kuberkaul - sorry for the delay in updating this issue. I ended up creating multiple follow-up issues based off your feedback, they're linked above. Could you check if the service map is still having JS errors? A record in that cluster which was missing appears to now be present, hopefully that will resolve the immediate error.

What we think happened is described in #514, essentially an "edge" in the service map appeared to have been missing and the frontend didn't know how to handle it. I've created an issue on the Kibana side to handle missing edges instead of completely erroring. I've also got an issue on our side to look into what might've caused it to have been missing, and what we can do to prevent it in the future. Thanks again for your patience.

@kuberkaul
Copy link
Author

kuberkaul commented Apr 15, 2021

@wrijeff : Thats great, Im lookin at the service map though and I still see the same error. Everything else : traces/dashboard continues to work and gives me traces but services errors out . Also pasting the errors I get (same)
Screen Shot 2021-04-15 at 2 51 06 PM
Screen Shot 2021-04-15 at 2 53 07 PM

@kuberkaul
Copy link
Author

Recreated the entire Elastic search and the error persist. Infact now both service map and services turn up blank.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants