Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add goals for a OpenTelemetry Desktop Viewer/Development Tool #230

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

austinlparker
Copy link
Member

Per the discussion in open-telemetry/community#1515, I have created this OTEP as a way to gather requirements and build agreement towards what a 'desktop viewer' for OpenTelemetry should be.

@austinlparker austinlparker requested a review from a team June 6, 2023 15:58
text/0230-telemetry-viewer.md Outdated Show resolved Hide resolved
text/0230-telemetry-viewer.md Outdated Show resolved Hide resolved
text/0230-telemetry-viewer.md Outdated Show resolved Hide resolved
@tigrannajaryan
Copy link
Member

There is a prototype that may be close to what this OTEP looks for: open-telemetry/opentelemetry-collector-contrib#19634

@reyang
Copy link
Member

reyang commented Jun 19, 2023

This might change the charter and scope of OpenTelemetry (although "tools" can be extended to cover almost everything):

https://opentelemetry.io/

image

What do we do if later users ask for a query language support in the desktop viewer?

@austinlparker
Copy link
Member Author

This might change the charter and scope of OpenTelemetry (although "tools" can be extended to cover almost everything):

As you point out, 'tools' is deliberately broad by intention. The collector is a foundational component for a wide variety of tooling, such as tail-based sampling. Providing an open source utility for this purpose doesn't invalidate the existing and new commercial/proprietary solutions that exist; Similarly, a diagnostic tool for local query and display of telemetry doesn't prevent ecosystem development of the same. As an example, all databases have some sort of command-line utility (or even a GUI-based one), but that doesn't prevent the development of alternatives.

What do we do if later users ask for a query language support in the desktop viewer?

I think it would depend on exactly what they were asking for. I personally don't see how this utility could be created without some sort of simple query dialect (either SQL or a variant, or basic predicate matching like 'attribute.name = 'foo' && latency > 500ms'). Perhaps this dialect could resemble the telemety transform language in its syntax and semantics?

That said, the goal of this OTEP is to narrowly scope the use case for this viewing tool, which also necessarily scopes the query functionality of it to what is needed to fulfill the purpose of the OTEP.

text/0230-telemetry-viewer.md Outdated Show resolved Hide resolved
text/0230-telemetry-viewer.md Outdated Show resolved Hide resolved
@gillg
Copy link

gillg commented Jun 21, 2023

I'm really sceptic about this topic. Maybe I don't exactly see what you want to achieve because open telemetry is a set of concepts.
I eventually imagine your tool as a kid of ephemeral database to explore fresh data received during the development phase ? But the performance and long term storage should not be a priority.
For a generic tool, maybe the very new websocket processor in otel collector contrib could help. That could eventually avoid to implement an otlp receiver layer on the application.

But creating a "basic" logs/traces/metrics explorer seems very challenging...
About the language OTTL seems a good candidate to extract data and experiment transformations.

@svrnm
Copy link
Member

svrnm commented Sep 13, 2023

Following up on this issue, since it became relevant in a discussion for docs (see open-telemetry/opentelemetry.io#3266 & open-telemetry/opentelemetry.io#3144 (comment)): I want to emphasize that such a tool would be greatly beneficial for the quality of our documentation and how people can get started with OpenTelemetry:

  • We want to promote the use of OTLP throughout our documentation, but right now we can only suggest the visualization of traces easily, as Jaeger is the only pure-OSS tool available that supports OTLP directly. So for logs & metrics we are stuck with either dumping them to stdout or to translate to prometheus, etc to have them visualized
  • Feedback I get a lot from end-users is that if they do not have an observability backend at hand, setting one up is rather painful if ALL you want is look at a handful of logs, metrics & traces for your "let me try out otel" experience
  • Having an integration with the collector would help us to create a round story in the documentation: Set up otel in your app with console exporters, switch to otlp exporters, send data to the collector, look at telemetry with OTel desktop viewer, update collector config to send telemetry to your backend of choice, everyone is happy!

@yurishkuro
Copy link
Member

yurishkuro commented Sep 13, 2023

We want to promote the use of OTLP throughout our documentation, but right now we can only suggest the visualization of traces easily, as Jaeger is the only pure-OSS tool available that supports OTLP directly. So for logs & metrics we are stuck with either dumping them to stdout or to translate to prometheus, etc to have them visualized

so here's a radical idea - why not focus the effort that would be needed to develop these extra capabilities by building them within Jaeger, by extending its scope from "just traces"?

@svrnm
Copy link
Member

svrnm commented Sep 13, 2023

We want to promote the use of OTLP throughout our documentation, but right now we can only suggest the visualization of traces easily, as Jaeger is the only pure-OSS tool available that supports OTLP directly. So for logs & metrics we are stuck with either dumping them to stdout or to translate to prometheus, etc to have them visualized

so here's a radical idea - why not focus the effort that would be needed to develop these extra capabilities by building them within Jaeger, by extending its scope from "just traces"?

It's more obvious than radical, looks like I didn't see the forest for the trees ... for the docs use case I outlined above this would definitely be a solution we can work with 👍

@samsp-msft
Copy link

I started building a tool very similar to what is being talked about here, to aid .NET developers who are adopting OTEL with diagnosing what is happening. I was pointed to this thread by @reyang, and would be happy to contribute based on what I found during this investigation.

The tool I created uses the OTEL proto files to create an endpoint, listen to the data being sent, and provide basic visualizations. I happened to write this using Blazor as that was simplest for me, but it could be done using almost any stack.

The ideal usage experience is as an exe and/or container that can be run locally, and uses a browser as the UI. It provides a default OTLP endpoint on https://localhost:5317 that you can configure your app to point to via the standard env variables. Ideally its one app - it should not require installing a web server, database etc. It shouldn't need a database, it can keep a buffer in memory of the most recent telemetry its captured, and dispose of older data as needed.

Somebody I showed it to described it as WireShark for OTLP which I think is an apt description.

Here are a couple of screen shots of the UI - its ugly as I am not a designer - but I hope it gets the point across:

Logging

logs
logs2

This shows the log messages that have been collected via OTLP. There it has basic filtering capability. Clicking an entry will show all the parameters. (I originally showed them all in the table, but the columns got out of hand).

Metrics

metrics

This shows the metrics that have been sent to the OTLP endpoint. You don't need to fish for them, they are shown in a list.
When you select a metric, you see the dimension combinations that have been emitted, a list of recent values, and a basic graph. Nobody should confuse this with a dashboard system, but if you want to know what platform metrics you are getting, and or verify your own metrics, its ideal.

metrics2
This second view is for a metric with multiple dimension combinations, so there is a graph for each combination.

metrics3
This 3rd view is for a histogram metric, again with multiple dimension combinations. Histograms are shown with a bar chart based on the buckets.

Tracing

traces

There is not significant new ground trod here - you get a similar gant chart view similar to Jaeger or Zipkin. The difference is that all traces are listed as they are seen - no need to query for them. Trace properties are seen for the selected span.

In this case the color choice is bit odd as the test app is calling back to itself. The idea is to pick a color for each process, and use a gradient for each operation that occurs. In the test app shown, its recursive, which is why the particular bands got used.

Some spans include little diamonds, those represent the events that occurred during the span - the details are shown in the pane on the right.

text/0230-telemetry-viewer.md Show resolved Hide resolved
text/0230-telemetry-viewer.md Outdated Show resolved Hide resolved
text/0230-telemetry-viewer.md Outdated Show resolved Hide resolved
text/0230-telemetry-viewer.md Show resolved Hide resolved
text/0230-telemetry-viewer.md Outdated Show resolved Hide resolved
text/0230-telemetry-viewer.md Show resolved Hide resolved
@samsp-msft
Copy link

What do we do if later users ask for a query language support in the desktop viewer?

Say no - a query language probably isn't required - simple filtering based on existing values is probably sufficient. Depending on implementation tech, something like https://dynamic-linq.net/ for .NET can do that for you, but should be a low priority feature.

@austinlparker
Copy link
Member Author

Wanted to resurrect this OTEP a bit after some discussions at KubeCon. I'm going to briefly summarize the top-level changes/thoughts in this comment for the sake of people subscribed to this thread.

In general, I still think this is a good idea, and I also think it's a good idea to have this be a separate component owned by OpenTelemetry rather than trying to get Jaeger or some other project to build it for us. Existing tools are, more or less, good at what they do today. I wouldn't necessarily want to mix the functionality of what this can be with what other tools already are. Moreover, I don't want an OpenTelemetry component to be managed by a different project. Finally, I think one of the reasons OpenTelemetry works as well as it does is because we very explicitly do not prefer any data store, query language, etc. Having Jaeger become 'the default' would change this, even if it was just for a local-only dev experience.

The best analogue to what I have in mind is from the past, specifically, the spate of Installers and Dashboards that popped up in the early days of k8s. There were multiple competing installer scripts and dashboards that, by and large, simply don't exist any more. The Kubernetes Dashboard still exists, but all of its functionality has been absorbed into other ecosystem tools (e.g. console viewers like k9s or UIs in managed k8s environments), same with installation (such as Cluster API and kubeadm).

I view this OTEP as, conceptually, having a similar path. I want us to be able to say as a project, "hey, here is a starting point and a good default for developers and operators who are end-users to be able to see what's going on in a nice UI". You should be able to use it to get real-time feedback on OTTL transforms, on changes to environment variables, to new attributes you're adding to code. It doesn't need persistence, it doesn't need a query language, it really should just be a filterable stream. You should be able to also use this component with OpAMP, to view/read/write changes to configs. We can't rely solely on the community or vendors to create tools here -- if we do, those tools will almost certainly not be licensed in favorable ways, or perhaps will not be as vendor-agnostic as we'd like.

Anyway, please review the updated OTEP and let me know what you think.

Copy link
Member

@trask trask left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any production use cases you are thinking of, or can we explicitly say it's "not for production"?

text/0230-telemetry-viewer.md Outdated Show resolved Hide resolved
text/0230-telemetry-viewer.md Outdated Show resolved Hide resolved
text/0230-telemetry-viewer.md Outdated Show resolved Hide resolved
@austinlparker
Copy link
Member Author

are there any production use cases you are thinking of, or can we explicitly say it's "not for production"?

By itself, I can't really think of a production use case, but I think it's worthwhile to consider that 'production' means a lot of things. Like, if I'm running a homelab with k8s that serves some public services (so it's production to me) then I could see using this to manage OpAMP configs, for example. Would I run a business on it? No.

Could other vendors or community partners come along and make this (or parts of it) into a 'production' system? Sure, in the same way that (conceptually) people took the k8s dashboard and incorporated it into their managed k8s solutions.

@austinlparker
Copy link
Member Author

I would say that it's not designed for production use.

@samsp-msft
Copy link

FYI - We added something very similar to .NET Aspire in the form of a developer dashboard for exactly these scenarios.
https://devblogs.microsoft.com/dotnet/introducing-dotnet-aspire-simplifying-cloud-native-development-with-dotnet-8/#dashboard-your-central-hub-for-app-monitoring-and-inspection
This has proved to be a very popular part of Aspire.

@austinlparker
Copy link
Member Author

FYI - We added something very similar to .NET Aspire in the form of a developer dashboard for exactly these scenarios. https://devblogs.microsoft.com/dotnet/introducing-dotnet-aspire-simplifying-cloud-native-development-with-dotnet-8/#dashboard-your-central-hub-for-app-monitoring-and-inspection This has proved to be a very popular part of Aspire.

I believe I call out Aspire explicitly in the updated OTEP :) It's good stuff.

@martinjt
Copy link
Member

I would, very explicitly, say this is not something we should promote for production. To the point that we explicitly say it's not for production.

The in-memory element of this makes it practically impossible, and very resource/cost intensive to run at a level that provides real benefit to production.

I've spent time with the Aspire Dashboard (the thing that @samsp-msft) mentioned, specifically looking at the production use-cases, and although they're promoting that use-case, I can say with some confidence that a decent sized site isn't going to get use out of it. I've run it with the Otel demo and it was unusable, purely down to the size of telemetry generated by even such a small site, with a small amount of load.

To go further, the majority of installations of the collector, based on the survey and my own experience with customers and developers, factors in multiple instances of the collector, which means that without a distributed datastore it isn't going to work. Given that we don't want to push for a datastore connector for it, that wouldn't make sense. At best it won't be useful, at worst it will end up causing people to think that there's a problem with their traces and logs.

I 100% support the idea behind this OTEP, I think it will be a great addition to the toolkit for Local Development use-cases for debugging, and also for thinking about how to debug telemetry in general.

@austinlparker
Copy link
Member Author

I would, very explicitly, say this is not something we should promote for production. To the point that we explicitly say it's not for production.

I mean, I think it's useful to spell out some specific use cases here and see if we agree on what 'production' means.

In-Scope:

  • I'm a solo/hobbyist developer with a home lab. I have a k8s cluster with a few applications deployed. I want to set up the Collector with some data transformations, so I deploy this viewer and connect it to my Collectors in order to view the log stream and transformed data.
  • I'm a professional developer writing code to add OpenTelemetry instrumentation to an existing or new service. I'm running a local Collector that I'm sending metrics/logs/traces to from my service. I want to quickly see adjustments to the span attributes and new spans that I'm creating, excluding other telemetry that my service may be sending.
  • I'm an operator that has a self-managed production cluster or deployment of Collectors. I want a drop-in tool that can show me the data stream on an individual collector via a UI.

Out of Scope:

  • I'm a developer trying to monitor the performance of an application by analyzing telemetry through a dashboard.
  • I'm an operator trying to manage a fleet of Collector configurations via OpAMP or get their health on a long-term basis.
  • I'm an OpenTelemetry user trying to record data from my service and visualize it over a long period of time and make this available to other users in my organization.

I would suggest that the out of scope actions are clearly 'production' use cases, but I just want to make sure that we're ok with the in scope items being in scope and being "non-production".

@samsp-msft
Copy link

Phrasing it as production / non-production may not the best way to talk about it, as its not about the type of workload that its used with, instead about the purpose and type of analysis that the tool will perform:

  • It is for instantaneous sniffing of the data to aid developers to see what is being sent.
  • It is not for doing any kind of analysis that involves history - such as trends, search, alerting, comparison, auditing.

The work that we (Microsoft) are doing with the Aspire dashboard is not intended to replace Azure Monitor/Application Insights as the Azure APM solution. The "production scenarios" for using the Aspire dashboard are to aid developers in diagnosing post deployment teething problems, or bug repo scenarios. For the day-to-day monitoring, altering, problem detection, trend detection etc should use an APM such as Application Insights, Grafana etc.

@martinjt
Copy link
Member

the Aspire dashboard are to aid developers in diagnosing post deployment teething problems, or bug repo scenarios

That's the issue, with a real production site, a 10k circular buffer is just unusable. For a low volume hobby site, it's probably fine, but anything more than that and it's basically not useful for those scenarios. I suppose unless your site literally stops and doesn't actually serve the traffic, but there are better ways to solve that.

I say that as someone who loves the dashboard, and has spent a lot of time using it so far. Saying it's for a production deployment is a mistake in my opinion, and it's going to have more of a negative impact than positive on the whole telemetry movement.

I'm a solo/hobbyist developer with a home lab. I have a k8s cluster with a few applications deployed. I want to set up the Collector with some data transformations, so I deploy this viewer and connect it to my Collectors in order to view the log stream and transformed data.

I can see that being an ok usecase, but in that scenario, I don't think that should be a goal, or something that is actively catered for. Those people will likely do it anyway.

I'm an operator that has a self-managed production cluster or deployment of Collectors. I want a drop-in tool that can show me the data stream on an individual collector via a UI.

My issue here is that it's a stream, a fast stream, that circular buffer won't be enough to actually catch anything as you can't scroll back. The same issue as with the Aspire production scenario. The narrowness of when that usecase is valid, and when it's unusable/not useful is so small that I don't think it's a usecase that should be a goal for the project.

@trask
Copy link
Member

trask commented Mar 26, 2024

Those people will likely do it anyway.

👍

@austinlparker
Copy link
Member Author

I think you're underestimating how many values can be stored in memory, but whatever, the lack of persistent storage makes it "not for production use" by default. I'd rather we focus on things that we can definitively state rather than what the definition of is, is, vis a vis "what is production"

1. While Prometheus and Jaeger do provide powerful analysis tools and can both
be run fairly easily in a local environment, they are not necessarily built
for the purpose of this proposal as-is.
2. There is no CNCF tooling available for Dashboard or Log Storage/Querying;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Dashboards, there is Perses in candidate status. But it's currently a few widgets short I can imagine as the result of this OTEP.

@mmanciop
Copy link

mmanciop commented Apr 9, 2024

I love the idea, it's a great problem to solve. I'd join in helping with the user flows and overall UX.

Relations with other CNCF projects

For Dashboards, one could consider Perses, in candidate status. It's currently a few widgets short of what this OTEP seems to need (like heatmaps and timelines), but IMO it would be a net benefit for the community if we ended up contributing widgets to round up the usual observability visualizations.

On how to communicate about (lack of) production-readiness

About how to reduce the risk that experienced people will not use it in production, usually NOT supporting the following does the trick:

  1. Persistence of data (already covered as of bcf3fd5)
  2. Alerting (currently not spelled out as out-of-scope AFAICT)

On the storage

About the data storage, I am wondering if we could not also store the data browser-side: the browser would retain a part of the data (a moving window with memory cap?) as they are streamed by the collector, reducing the amount of buffering needed in the collector itself.

The main side-effect would be that different use-sessions would see different subsets of data, but that seems to me like an acceptable tradeoff, and likely moves most of the complexity in the frontend, where it's (in my experience) cheaper and faster to develop and iterate.

@jaronoff97
Copy link

Alright so I've been hacking on a project for a while that I'm finally ready to release a bit more broadly. It fits the following criteria from Austin's comments:

  • gives developers and operators a way to view what's going on in the collector in real time
  • does some minimal filtering on resource and attributes on a live stream of telemetry (works for metrics, traces and logs)
  • optionally connects to a collector via OpAMP (right now it just lets you view config, view identifying/non identifying attrs)
  • wholly vendor neutral!
  • very minimal server overhead, I have it running as a sidecar to running collectors and passively uses around 25-50m of memory!
  • no storage needed, everything is streamed directly to the client!

You can view the repo here and my blog post that goes over a quick demo of finding some logs missing an attribute, updating the collector config, and viewing it worked!

Here are some screenshots:
home
clicked
filters
config

I'd love to answer any questions or take any feedback anyone has to make it suit this issue.

@austinlparker
Copy link
Member Author

@jaronoff97 Thanks! This is pretty much exactly what I had in mind, yeah.

@rogeralsing
Copy link

rogeralsing commented Apr 29, 2024

Not trying to sell anything here, I got informed about this topic from the opentelemetry slack channel.
I´ve just released my own tool for the very same purpose here: https://tracelens.io/

Focus on visualization and helping developers to better understand what is going on in a distributed system.
e.g. what is actually happening in an IoT solution or in a game etc.

Maybe of value to someone else here. or delete my comment if too unrelated

image

image

@devurandom
Copy link

I started building a tool very similar to what is being talked about here, to aid .NET developers who are adopting OTEL with diagnosing what is happening.

I could not find a link and first thought it might be a MS internal tool, but it seems to be on GitHub (unlicensed, though): https://github.com/samsp-msft/OTLPView

@samsp-msft
Copy link

I started building a tool very similar to what is being talked about here, to aid .NET developers who are adopting OTEL with diagnosing what is happening.

I could not find a link and first thought it might be a MS internal tool, but it seems to be on GitHub (unlicensed, though): https://github.com/samsp-msft/OTLPView

@devurandom That prototype evolved into the Aspire dashboard. It is now available as standalone docker container for monitoring OTLP data. https://learn.microsoft.com/en-us/dotnet/aspire/fundamentals/dashboard/standalone

@devurandom
Copy link

I started building a tool very similar to what is being talked about here, to aid .NET developers who are adopting OTEL with diagnosing what is happening.

I could not find a link and first thought it might be a MS internal tool, but it seems to be on GitHub (unlicensed, though): https://github.com/samsp-msft/OTLPView

@devurandom That prototype evolved into the Aspire dashboard. It is now available as standalone docker container for monitoring OTLP data. https://learn.microsoft.com/en-us/dotnet/aspire/fundamentals/dashboard/standalone

Thanks! Is it correct that the code is open and MIT-licensed, living at https://github.com/dotnet/aspire/tree/main/src/Aspire.Dashboard ?
The docs are also MIT-licensed and live at https://github.com/dotnet/docs-aspire/blob/main/docs/fundamentals/dashboard/standalone.md.

@pellared
Copy link
Member

Regarding query language there is a Query Standardization Working Group in CNCF TAG Observability. Shouldn't the efforts be combined?

@samsp-msft
Copy link

@devurandom That prototype evolved into the Aspire dashboard. It is now available as standalone docker container for monitoring OTLP data. https://learn.microsoft.com/en-us/dotnet/aspire/fundamentals/dashboard/standalone

Thanks! Is it correct that the code is open and MIT-licensed, living at https://github.com/dotnet/aspire/tree/main/src/Aspire.Dashboard ? The docs are also MIT-licensed and live at https://github.com/dotnet/docs-aspire/blob/main/docs/fundamentals/dashboard/standalone.md.

Yes. it's MIT licensed. You are free to fork and do with it what you want. We also welcome suggestions and PRs.
The dashboard has become one of the most popular features of the Aspire project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.