Skip to content

NRE when calling TraceEventProviders.GetProviderGuidByName #2177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

desdesdes
Copy link

Then TraceEventSession.ProviderNameToGuid is called code assumes this never results in null. See TraceEventProviders.GetProviderGuidByName. Then TraceEventNativeMethods.TdhEnumerateProviders results return a hr != 0 or providersDesc == null then Trace.WriteLine will be called and the function returns null. This pull request changes the behavior to return an incomplete list instead of crashing with a NRE. A point can me made to change the Trace.WriteLine to a throw new Exception(.

…eturns incomplete list instead of crashing with a NRE.
@desdesdes desdesdes requested a review from brianrob as a code owner March 27, 2025 12:16
@cincuranet
Copy link
Collaborator

I think the most common way TdhEnumerateProviders would fail is when the account does not have permissions. I assume you found this when using TraceEvent package directly, right?

My gut feeling is telling me that throwing an exception would be more correct solution to this. But it is also very rough. Looking into the code around, I don't see a pattern we use for such cases. I'll defer to @brianrob for final call about this.

@brianrob
Copy link
Member

@desdesdes do you have a scenario where you're seeing this fail? I've seen a couple of reports, but haven't been able to get more information on what is actually happening.

@desdesdes
Copy link
Author

@desdesdes do you have a scenario where you're seeing this fail? I've seen a couple of reports, but haven't been able to get more information on what is actually happening.

We have an application which runs as a windows service. This service sometimes seems to fail to startup while booting windows. The windows infrastructure automatically retries and after a some retries it succeeds to start. We have a few thousand servers running and we only experience this issue on some servers. We have not found a pattern yet.

We pinpointed the problem to this location in code. It could be that the TdhEnumerateProviders is returning 122 and TdhEnumerateProviders should be called in a loop. It could also be that TdhEnumerateProviders is dependent on another windows service or another part of windows which is still initializing, but I don't know.

It would be nice if you throw and exception to add the HR to the exception so we can analyse this better.

@brianrob
Copy link
Member

brianrob commented Apr 2, 2025

Thank you for clarifying the scenario. Do you have a callstack of when this happens?

@desdesdes
Copy link
Author

desdesdes commented Apr 2, 2025

Thank you for clarifying the scenario. Do you have a callstack of when this happens?

Hope this helps.

Application: ProfitPerformanceMonitor.exe
CoreCLR Version: 8.0.1425.11118
.NET Version: 8.0.14

Description: The process was terminated due to an unhandled exception.
Exception Info: System.NullReferenceException: Object reference not set to an instance of an object.
at Microsoft.Diagnostics.Tracing.Session.TraceEventProviders.GetProviderGuidByName(String name)
at Microsoft.Diagnostics.Tracing.Session.TraceEventSession.EnableProvider(String providerName, TraceEventLevel providerLevel, UInt64 matchAnyKeywords, TraceEventProviderOptions options)
at Afas.Profit.ProfitTraceWriter.TraceWatcher..ctor(String traceSessionName, UInt64 flags, TraceEventLevel level) in C:\Dev\profit-release\08\src\Anta\Projects\Tools Projects\Profit Trace Writer\ProfitTraceWriter\TraceWatcher.cs:line 26
at ProfitPerformanceMonitor.Engine.Start() in C:\Dev\profit-release\08\src\Anta\Projects\Tools Projects\Profit Perf Mon\ProfitPerformanceMonitor\Engine.cs:line 83
at ProfitPerformanceMonitor.Worker.ExecuteAsync(CancellationToken stoppingToken) in C:\Dev\profit-release\08\src\Anta\Projects\Tools Projects\Profit Perf Mon\ProfitPerformanceMonitor\Worker.cs:line 15
at Microsoft.Extensions.Hosting.Internal.Host.<StartAsync>b__15_1(IHostedService service, CancellationToken token)
at Microsoft.Extensions.Hosting.Internal.Host.ForeachService[T](IEnumerable`1 services, CancellationToken token, Boolean concurrent, Boolean abortOnFirstException, List`1 exceptions, Func`3 operation)
at Microsoft.Extensions.Hosting.Internal.Host.<StartAsync>g__LogAndRethrow|15_3(<>c__DisplayClass15_0&)
at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.Run(IHost host)
at ProfitPerformanceMonitor.Program.Main(String[] args) in C:\Dev\profit-release\08\src\Anta\Projects\Tools Projects\Profit Perf Mon\ProfitPerformanceMonitor\Program.cs:line 38

@brianrob
Copy link
Member

brianrob commented Apr 2, 2025

Yes, this is exactly what I am looking for. I think this is worth fixing, but I'm afraid that the fix is more complicated than this. I think that @cincuranet is right that we should be throwing an exception. The documentation states that there are only two error codes that should come back from the call to TdhEnumerateProviders, but I suspect that you're right - there might be something else happening under the covers and a different error code is returned.

A more appropriate fix will be a bit more involved, as I think we'll want to throw an exception when an unexpected error code is returned, but, we'll want to modify callers of TraceEventSession.ProviderNameToGuid to handle this exception properly. For example, it 'GetProviderGuidByName`, we should catch the exception, then do the check to see if the name is actually a GUID, and if it is, return it. Otherwise, re-throw the exception.

As for the exception itself, I think we'll want to create an exception that describes that the call to TdhEnumerateProviders failed with error code (code).

Ultimately, this won't make your unhandled exception go away, but it will tell us when something goes wrong that is outside of our control, and maybe will give us an opportunity to report a bug.

@cincuranet
Copy link
Collaborator

@desdesdes Can you maybe try capture the trace log? The call Trace.WriteLine has the error code in the message. So at least we would see what the number is.

@brianrob If the name is already GUID, can't we skip calling GetProviderGuidByName altogether?

@brianrob
Copy link
Member

brianrob commented Apr 3, 2025

@brianrob If the name is already GUID, can't we skip calling GetProviderGuidByName altogether?

Presumably, yes. But I would not be surprised if we ran into a situation where someone had registered a provider whose name is a GUID but not the same GUID as the ID of the provider. :( That would be the risk of the change, but honestly, I think I'd be OK with that. I think I'm more curious what the transient failure is, but having looked at the code for TdhEnumerateProviders, it does look like things can fail if other subsystems haven't come up yet or are otherwise unavailable.

@desdesdes
Copy link
Author

@cincuranet The issue only occurs in our production environment where this only occur on less then 1% of our >6700 servers and it occurs on different servers every time. So it is very hard for us to capture the trace logs. It is hard for us to change anything because on our production environment and it has to pass the Security department/checks. If you could change the Trace.WriteLine to an exception with the hr in the exception message then we can deploy it as a hotfix and let you know which hr it is.

@cincuranet
Copy link
Collaborator

@desdesdes Is it OK if I give you just the nupkg?

@desdesdes
Copy link
Author

@cincuranet Unfortunately we still use 3.1.9 our production code. The issue is self resolving, the service tries to restart after the NRE an after a few attempts it succeeds, so it is not a critical issue to resolve as a hotfix in production for us. We can not rollout a hotfix to eventtrace 3.1.20 to resolve/test a non-critical issue.

So I can do two things.

  • I can wait the newly released package (eventtrace 3.1.21?) where you change the behavior to in include the hr in the exception. Include that as a normal package upgrade in the next major product cycle. It might take a few months before I can give you any feedback on that.
  • I can also get 3.1.9 locally, change the behavior so to it throws an exception with the hr in this local build. Create a hotfix with this change and push this as a partial fix for the non-critical issue. This will only take a about two weeks to rollout. Would you be interested in that?

Best regards,
Bart

@cincuranet
Copy link
Collaborator

@desdesdes Yes, if you can hotpatch 3.1.9 locally, that would be great. Knowing the hr would help us make more educated decision how to proceed next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants