NRE when calling TraceEventProviders.GetProviderGuidByName #2177

desdesdes · 2025-03-27T12:16:04Z

Then TraceEventSession.ProviderNameToGuid is called code assumes this never results in null. See TraceEventProviders.GetProviderGuidByName. Then TraceEventNativeMethods.TdhEnumerateProviders results return a hr != 0 or providersDesc == null then Trace.WriteLine will be called and the function returns null. This pull request changes the behavior to return an incomplete list instead of crashing with a NRE. A point can me made to change the Trace.WriteLine to a throw new Exception(.

cincuranet · 2025-03-31T09:03:30Z

I think the most common way TdhEnumerateProviders would fail is when the account does not have permissions. I assume you found this when using TraceEvent package directly, right?

My gut feeling is telling me that throwing an exception would be more correct solution to this. But it is also very rough. Looking into the code around, I don't see a pattern we use for such cases. I'll defer to @brianrob for final call about this.

brianrob · 2025-03-31T19:42:00Z

@desdesdes do you have a scenario where you're seeing this fail? I've seen a couple of reports, but haven't been able to get more information on what is actually happening.

desdesdes · 2025-04-01T08:57:03Z

@desdesdes do you have a scenario where you're seeing this fail? I've seen a couple of reports, but haven't been able to get more information on what is actually happening.

We have an application which runs as a windows service. This service sometimes seems to fail to startup while booting windows. The windows infrastructure automatically retries and after a some retries it succeeds to start. We have a few thousand servers running and we only experience this issue on some servers. We have not found a pattern yet.

We pinpointed the problem to this location in code. It could be that the TdhEnumerateProviders is returning 122 and TdhEnumerateProviders should be called in a loop. It could also be that TdhEnumerateProviders is dependent on another windows service or another part of windows which is still initializing, but I don't know.

It would be nice if you throw and exception to add the HR to the exception so we can analyse this better.

brianrob · 2025-04-02T01:18:47Z

Thank you for clarifying the scenario. Do you have a callstack of when this happens?

desdesdes · 2025-04-02T09:12:25Z

Thank you for clarifying the scenario. Do you have a callstack of when this happens?

Hope this helps.

Application: ProfitPerformanceMonitor.exe
CoreCLR Version: 8.0.1425.11118
.NET Version: 8.0.14

Description: The process was terminated due to an unhandled exception.
Exception Info: System.NullReferenceException: Object reference not set to an instance of an object.
at Microsoft.Diagnostics.Tracing.Session.TraceEventProviders.GetProviderGuidByName(String name)
at Microsoft.Diagnostics.Tracing.Session.TraceEventSession.EnableProvider(String providerName, TraceEventLevel providerLevel, UInt64 matchAnyKeywords, TraceEventProviderOptions options)
at Afas.Profit.ProfitTraceWriter.TraceWatcher..ctor(String traceSessionName, UInt64 flags, TraceEventLevel level) in C:\Dev\profit-release\08\src\Anta\Projects\Tools Projects\Profit Trace Writer\ProfitTraceWriter\TraceWatcher.cs:line 26
at ProfitPerformanceMonitor.Engine.Start() in C:\Dev\profit-release\08\src\Anta\Projects\Tools Projects\Profit Perf Mon\ProfitPerformanceMonitor\Engine.cs:line 83
at ProfitPerformanceMonitor.Worker.ExecuteAsync(CancellationToken stoppingToken) in C:\Dev\profit-release\08\src\Anta\Projects\Tools Projects\Profit Perf Mon\ProfitPerformanceMonitor\Worker.cs:line 15
at Microsoft.Extensions.Hosting.Internal.Host.<StartAsync>b__15_1(IHostedService service, CancellationToken token)
at Microsoft.Extensions.Hosting.Internal.Host.ForeachService[T](IEnumerable`1 services, CancellationToken token, Boolean concurrent, Boolean abortOnFirstException, List`1 exceptions, Func`3 operation)
at Microsoft.Extensions.Hosting.Internal.Host.<StartAsync>g__LogAndRethrow|15_3(<>c__DisplayClass15_0&)
at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.Run(IHost host)
at ProfitPerformanceMonitor.Program.Main(String[] args) in C:\Dev\profit-release\08\src\Anta\Projects\Tools Projects\Profit Perf Mon\ProfitPerformanceMonitor\Program.cs:line 38

brianrob · 2025-04-02T16:46:34Z

Yes, this is exactly what I am looking for. I think this is worth fixing, but I'm afraid that the fix is more complicated than this. I think that @cincuranet is right that we should be throwing an exception. The documentation states that there are only two error codes that should come back from the call to TdhEnumerateProviders, but I suspect that you're right - there might be something else happening under the covers and a different error code is returned.

A more appropriate fix will be a bit more involved, as I think we'll want to throw an exception when an unexpected error code is returned, but, we'll want to modify callers of TraceEventSession.ProviderNameToGuid to handle this exception properly. For example, it 'GetProviderGuidByName`, we should catch the exception, then do the check to see if the name is actually a GUID, and if it is, return it. Otherwise, re-throw the exception.

As for the exception itself, I think we'll want to create an exception that describes that the call to TdhEnumerateProviders failed with error code (code).

Ultimately, this won't make your unhandled exception go away, but it will tell us when something goes wrong that is outside of our control, and maybe will give us an opportunity to report a bug.

cincuranet · 2025-04-03T11:14:34Z

@desdesdes Can you maybe try capture the trace log? The call Trace.WriteLine has the error code in the message. So at least we would see what the number is.

@brianrob If the name is already GUID, can't we skip calling GetProviderGuidByName altogether?

brianrob · 2025-04-03T17:02:21Z

@brianrob If the name is already GUID, can't we skip calling GetProviderGuidByName altogether?

Presumably, yes. But I would not be surprised if we ran into a situation where someone had registered a provider whose name is a GUID but not the same GUID as the ID of the provider. :( That would be the risk of the change, but honestly, I think I'd be OK with that. I think I'm more curious what the transient failure is, but having looked at the code for TdhEnumerateProviders, it does look like things can fail if other subsystems haven't come up yet or are otherwise unavailable.

desdesdes · 2025-04-14T12:49:10Z

@cincuranet The issue only occurs in our production environment where this only occur on less then 1% of our >6700 servers and it occurs on different servers every time. So it is very hard for us to capture the trace logs. It is hard for us to change anything because on our production environment and it has to pass the Security department/checks. If you could change the Trace.WriteLine to an exception with the hr in the exception message then we can deploy it as a hotfix and let you know which hr it is.

cincuranet · 2025-04-15T08:35:56Z

@desdesdes Is it OK if I give you just the nupkg?

desdesdes · 2025-04-17T07:52:14Z

@cincuranet Unfortunately we still use 3.1.9 our production code. The issue is self resolving, the service tries to restart after the NRE an after a few attempts it succeeds, so it is not a critical issue to resolve as a hotfix in production for us. We can not rollout a hotfix to eventtrace 3.1.20 to resolve/test a non-critical issue.

So I can do two things.

I can wait the newly released package (eventtrace 3.1.21?) where you change the behavior to in include the hr in the exception. Include that as a normal package upgrade in the next major product cycle. It might take a few months before I can give you any feedback on that.
I can also get 3.1.9 locally, change the behavior so to it throws an exception with the hr in this local build. Create a hotfix with this change and push this as a partial fix for the non-critical issue. This will only take a about two weeks to rollout. Would you be interested in that?

Best regards,
Bart

cincuranet · 2025-04-25T10:13:00Z

@desdesdes Yes, if you can hotpatch 3.1.9 locally, that would be great. Knowing the hr would help us make more educated decision how to proceed next.

…g at the hr instead of causing an NRE which masks the exception.

desdesdes · 2025-05-27T10:23:13Z

@cincuranet I updated the code and pushed it out, but by the time the hotfix was rolled out system management had already mitigated the NRE by not auto-starting the tracing service and starting it after 3 minuten. They are not willing to rollback the change in production, because it solved their issue. They have rolled it out to a few testing clusters, but these are relatively small (10 machines per cluster) and not very busy so I have not been able to reproduce the issue there.

I have updated the pull request so now it throws an exception and i am hoping you merge the change so the hr is actually written part of the throw exception instead of an NRE.

cincuranet · 2025-06-11T12:47:59Z

Merging. Thanks.

desdesdes requested a review from brianrob as a code owner March 27, 2025 12:16

desdesdes requested review from cincuranet, leculver, mconnew and marklio as code owners May 27, 2025 10:12

Throw exception then TdhEnumerateProviders fails, which allows lookin…

d6156c0

…g at the hr instead of causing an NRE which masks the exception.

desdesdes force-pushed the fix-NRE-EnableProvider branch from 6138f80 to d6156c0 Compare May 27, 2025 10:20

cincuranet approved these changes Jun 11, 2025

View reviewed changes

cincuranet merged commit ee7ef76 into microsoft:main Jun 11, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NRE when calling TraceEventProviders.GetProviderGuidByName #2177

NRE when calling TraceEventProviders.GetProviderGuidByName #2177

Uh oh!

desdesdes commented Mar 27, 2025

Uh oh!

cincuranet commented Mar 31, 2025

Uh oh!

brianrob commented Mar 31, 2025

Uh oh!

desdesdes commented Apr 1, 2025

Uh oh!

brianrob commented Apr 2, 2025

Uh oh!

desdesdes commented Apr 2, 2025 •

edited

Loading

Uh oh!

brianrob commented Apr 2, 2025 •

edited

Loading

Uh oh!

cincuranet commented Apr 3, 2025

Uh oh!

brianrob commented Apr 3, 2025 •

edited

Loading

Uh oh!

desdesdes commented Apr 14, 2025

Uh oh!

cincuranet commented Apr 15, 2025

Uh oh!

desdesdes commented Apr 17, 2025

Uh oh!

cincuranet commented Apr 25, 2025

Uh oh!

desdesdes commented May 27, 2025

Uh oh!

cincuranet commented Jun 11, 2025

Uh oh!

Uh oh!

Uh oh!

NRE when calling TraceEventProviders.GetProviderGuidByName #2177

NRE when calling TraceEventProviders.GetProviderGuidByName #2177

Uh oh!

Conversation

desdesdes commented Mar 27, 2025

Uh oh!

cincuranet commented Mar 31, 2025

Uh oh!

brianrob commented Mar 31, 2025

Uh oh!

desdesdes commented Apr 1, 2025

Uh oh!

brianrob commented Apr 2, 2025

Uh oh!

desdesdes commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brianrob commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cincuranet commented Apr 3, 2025

Uh oh!

brianrob commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

desdesdes commented Apr 14, 2025

Uh oh!

cincuranet commented Apr 15, 2025

Uh oh!

desdesdes commented Apr 17, 2025

Uh oh!

cincuranet commented Apr 25, 2025

Uh oh!

desdesdes commented May 27, 2025

Uh oh!

cincuranet commented Jun 11, 2025

Uh oh!

Uh oh!

Uh oh!

desdesdes commented Apr 2, 2025 •

edited

Loading

brianrob commented Apr 2, 2025 •

edited

Loading

brianrob commented Apr 3, 2025 •

edited

Loading