-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add goals for a OpenTelemetry Desktop Viewer/Development Tool #230
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Severin Neumann <[email protected]>
There is a prototype that may be close to what this OTEP looks for: open-telemetry/opentelemetry-collector-contrib#19634 |
This might change the charter and scope of OpenTelemetry (although "tools" can be extended to cover almost everything): What do we do if later users ask for a query language support in the desktop viewer? |
As you point out, 'tools' is deliberately broad by intention. The collector is a foundational component for a wide variety of tooling, such as tail-based sampling. Providing an open source utility for this purpose doesn't invalidate the existing and new commercial/proprietary solutions that exist; Similarly, a diagnostic tool for local query and display of telemetry doesn't prevent ecosystem development of the same. As an example, all databases have some sort of command-line utility (or even a GUI-based one), but that doesn't prevent the development of alternatives.
I think it would depend on exactly what they were asking for. I personally don't see how this utility could be created without some sort of simple query dialect (either SQL or a variant, or basic predicate matching like 'attribute.name = 'foo' && latency > 500ms'). Perhaps this dialect could resemble the telemety transform language in its syntax and semantics? That said, the goal of this OTEP is to narrowly scope the use case for this viewing tool, which also necessarily scopes the query functionality of it to what is needed to fulfill the purpose of the OTEP. |
I'm really sceptic about this topic. Maybe I don't exactly see what you want to achieve because open telemetry is a set of concepts. But creating a "basic" logs/traces/metrics explorer seems very challenging... |
Following up on this issue, since it became relevant in a discussion for docs (see open-telemetry/opentelemetry.io#3266 & open-telemetry/opentelemetry.io#3144 (comment)): I want to emphasize that such a tool would be greatly beneficial for the quality of our documentation and how people can get started with OpenTelemetry:
|
so here's a radical idea - why not focus the effort that would be needed to develop these extra capabilities by building them within Jaeger, by extending its scope from "just traces"? |
It's more obvious than radical, looks like I didn't see the forest for the trees ... for the docs use case I outlined above this would definitely be a solution we can work with 👍 |
I started building a tool very similar to what is being talked about here, to aid .NET developers who are adopting OTEL with diagnosing what is happening. I was pointed to this thread by @reyang, and would be happy to contribute based on what I found during this investigation. The tool I created uses the OTEL proto files to create an endpoint, listen to the data being sent, and provide basic visualizations. I happened to write this using Blazor as that was simplest for me, but it could be done using almost any stack. The ideal usage experience is as an exe and/or container that can be run locally, and uses a browser as the UI. It provides a default OTLP endpoint on https://localhost:5317 that you can configure your app to point to via the standard env variables. Ideally its one app - it should not require installing a web server, database etc. It shouldn't need a database, it can keep a buffer in memory of the most recent telemetry its captured, and dispose of older data as needed. Somebody I showed it to described it as WireShark for OTLP which I think is an apt description. Here are a couple of screen shots of the UI - its ugly as I am not a designer - but I hope it gets the point across: LoggingThis shows the log messages that have been collected via OTLP. There it has basic filtering capability. Clicking an entry will show all the parameters. (I originally showed them all in the table, but the columns got out of hand). MetricsThis shows the metrics that have been sent to the OTLP endpoint. You don't need to fish for them, they are shown in a list.
TracingThere is not significant new ground trod here - you get a similar gant chart view similar to Jaeger or Zipkin. The difference is that all traces are listed as they are seen - no need to query for them. Trace properties are seen for the selected span. In this case the color choice is bit odd as the test app is calling back to itself. The idea is to pick a color for each process, and use a gradient for each operation that occurs. In the test app shown, its recursive, which is why the particular bands got used. Some spans include little diamonds, those represent the events that occurred during the span - the details are shown in the pane on the right. |
Say no - a query language probably isn't required - simple filtering based on existing values is probably sufficient. Depending on implementation tech, something like https://dynamic-linq.net/ for .NET can do that for you, but should be a low priority feature. |
Wanted to resurrect this OTEP a bit after some discussions at KubeCon. I'm going to briefly summarize the top-level changes/thoughts in this comment for the sake of people subscribed to this thread. In general, I still think this is a good idea, and I also think it's a good idea to have this be a separate component owned by OpenTelemetry rather than trying to get Jaeger or some other project to build it for us. Existing tools are, more or less, good at what they do today. I wouldn't necessarily want to mix the functionality of what this can be with what other tools already are. Moreover, I don't want an OpenTelemetry component to be managed by a different project. Finally, I think one of the reasons OpenTelemetry works as well as it does is because we very explicitly do not prefer any data store, query language, etc. Having Jaeger become 'the default' would change this, even if it was just for a local-only dev experience. The best analogue to what I have in mind is from the past, specifically, the spate of Installers and Dashboards that popped up in the early days of k8s. There were multiple competing installer scripts and dashboards that, by and large, simply don't exist any more. The Kubernetes Dashboard still exists, but all of its functionality has been absorbed into other ecosystem tools (e.g. console viewers like k9s or UIs in managed k8s environments), same with installation (such as Cluster API and kubeadm). I view this OTEP as, conceptually, having a similar path. I want us to be able to say as a project, "hey, here is a starting point and a good default for developers and operators who are end-users to be able to see what's going on in a nice UI". You should be able to use it to get real-time feedback on OTTL transforms, on changes to environment variables, to new attributes you're adding to code. It doesn't need persistence, it doesn't need a query language, it really should just be a filterable stream. You should be able to also use this component with OpAMP, to view/read/write changes to configs. We can't rely solely on the community or vendors to create tools here -- if we do, those tools will almost certainly not be licensed in favorable ways, or perhaps will not be as vendor-agnostic as we'd like. Anyway, please review the updated OTEP and let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any production use cases you are thinking of, or can we explicitly say it's "not for production"?
Co-authored-by: Trask Stalnaker <[email protected]>
Co-authored-by: Trask Stalnaker <[email protected]>
By itself, I can't really think of a production use case, but I think it's worthwhile to consider that 'production' means a lot of things. Like, if I'm running a homelab with k8s that serves some public services (so it's production to me) then I could see using this to manage OpAMP configs, for example. Would I run a business on it? No. Could other vendors or community partners come along and make this (or parts of it) into a 'production' system? Sure, in the same way that (conceptually) people took the k8s dashboard and incorporated it into their managed k8s solutions. |
I would say that it's not designed for production use. |
FYI - We added something very similar to .NET Aspire in the form of a developer dashboard for exactly these scenarios. |
I believe I call out Aspire explicitly in the updated OTEP :) It's good stuff. |
I would, very explicitly, say this is not something we should promote for production. To the point that we explicitly say it's not for production. The in-memory element of this makes it practically impossible, and very resource/cost intensive to run at a level that provides real benefit to production. I've spent time with the Aspire Dashboard (the thing that @samsp-msft) mentioned, specifically looking at the production use-cases, and although they're promoting that use-case, I can say with some confidence that a decent sized site isn't going to get use out of it. I've run it with the Otel demo and it was unusable, purely down to the size of telemetry generated by even such a small site, with a small amount of load. To go further, the majority of installations of the collector, based on the survey and my own experience with customers and developers, factors in multiple instances of the collector, which means that without a distributed datastore it isn't going to work. Given that we don't want to push for a datastore connector for it, that wouldn't make sense. At best it won't be useful, at worst it will end up causing people to think that there's a problem with their traces and logs. I 100% support the idea behind this OTEP, I think it will be a great addition to the toolkit for Local Development use-cases for debugging, and also for thinking about how to debug telemetry in general. |
I mean, I think it's useful to spell out some specific use cases here and see if we agree on what 'production' means. In-Scope:
Out of Scope:
I would suggest that the out of scope actions are clearly 'production' use cases, but I just want to make sure that we're ok with the in scope items being in scope and being "non-production". |
Phrasing it as production / non-production may not the best way to talk about it, as its not about the type of workload that its used with, instead about the purpose and type of analysis that the tool will perform:
The work that we (Microsoft) are doing with the Aspire dashboard is not intended to replace Azure Monitor/Application Insights as the Azure APM solution. The "production scenarios" for using the Aspire dashboard are to aid developers in diagnosing post deployment teething problems, or bug repo scenarios. For the day-to-day monitoring, altering, problem detection, trend detection etc should use an APM such as Application Insights, Grafana etc. |
That's the issue, with a real production site, a 10k circular buffer is just unusable. For a low volume hobby site, it's probably fine, but anything more than that and it's basically not useful for those scenarios. I suppose unless your site literally stops and doesn't actually serve the traffic, but there are better ways to solve that. I say that as someone who loves the dashboard, and has spent a lot of time using it so far. Saying it's for a production deployment is a mistake in my opinion, and it's going to have more of a negative impact than positive on the whole telemetry movement.
I can see that being an ok usecase, but in that scenario, I don't think that should be a goal, or something that is actively catered for. Those people will likely do it anyway.
My issue here is that it's a stream, a fast stream, that circular buffer won't be enough to actually catch anything as you can't scroll back. The same issue as with the Aspire production scenario. The narrowness of when that usecase is valid, and when it's unusable/not useful is so small that I don't think it's a usecase that should be a goal for the project. |
👍 |
I think you're underestimating how many values can be stored in memory, but whatever, the lack of persistent storage makes it "not for production use" by default. I'd rather we focus on things that we can definitively state rather than what the definition of is, is, vis a vis "what is production" |
Co-authored-by: Nathan Lincoln <[email protected]>
1. While Prometheus and Jaeger do provide powerful analysis tools and can both | ||
be run fairly easily in a local environment, they are not necessarily built | ||
for the purpose of this proposal as-is. | ||
2. There is no CNCF tooling available for Dashboard or Log Storage/Querying; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Dashboards, there is Perses in candidate status. But it's currently a few widgets short I can imagine as the result of this OTEP.
I love the idea, it's a great problem to solve. I'd join in helping with the user flows and overall UX. Relations with other CNCF projectsFor Dashboards, one could consider Perses, in candidate status. It's currently a few widgets short of what this OTEP seems to need (like heatmaps and timelines), but IMO it would be a net benefit for the community if we ended up contributing widgets to round up the usual observability visualizations. On how to communicate about (lack of) production-readinessAbout how to reduce the risk that experienced people will not use it in production, usually NOT supporting the following does the trick:
On the storageAbout the data storage, I am wondering if we could not also store the data browser-side: the browser would retain a part of the data (a moving window with memory cap?) as they are streamed by the collector, reducing the amount of buffering needed in the collector itself. The main side-effect would be that different use-sessions would see different subsets of data, but that seems to me like an acceptable tradeoff, and likely moves most of the complexity in the frontend, where it's (in my experience) cheaper and faster to develop and iterate. |
Alright so I've been hacking on a project for a while that I'm finally ready to release a bit more broadly. It fits the following criteria from Austin's comments:
You can view the repo here and my blog post that goes over a quick demo of finding some logs missing an attribute, updating the collector config, and viewing it worked! I'd love to answer any questions or take any feedback anyone has to make it suit this issue. |
@jaronoff97 Thanks! This is pretty much exactly what I had in mind, yeah. |
Not trying to sell anything here, I got informed about this topic from the opentelemetry slack channel. Focus on visualization and helping developers to better understand what is going on in a distributed system. Maybe of value to someone else here. or delete my comment if too unrelated |
I could not find a link and first thought it might be a MS internal tool, but it seems to be on GitHub (unlicensed, though): https://github.com/samsp-msft/OTLPView |
@devurandom That prototype evolved into the Aspire dashboard. It is now available as standalone docker container for monitoring OTLP data. https://learn.microsoft.com/en-us/dotnet/aspire/fundamentals/dashboard/standalone |
Thanks! Is it correct that the code is open and MIT-licensed, living at https://github.com/dotnet/aspire/tree/main/src/Aspire.Dashboard ? |
Regarding query language there is a Query Standardization Working Group in CNCF TAG Observability. Shouldn't the efforts be combined? |
Yes. it's MIT licensed. You are free to fork and do with it what you want. We also welcome suggestions and PRs. |
Per the discussion in open-telemetry/community#1515, I have created this OTEP as a way to gather requirements and build agreement towards what a 'desktop viewer' for OpenTelemetry should be.