-
Notifications
You must be signed in to change notification settings - Fork 276
Improve error reporting / debugging UX with the OTLP default/HTTP exporters #1936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Improve error reporting / debugging UX with the OTLP default/HTTP exporters #1936
Conversation
3096a1c to
0dbb1a8
Compare
|
Random passerby here ~ just want to say thank you @chen-anders! I am knee-deep debugging errors between my ruby app and my OTLP collector and the improvements in this PR would vastly help my efforts. |
kaylareopelle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this PR! This is a problem I've run into myself and I'm glad to see work toward improving the situation.
I'm a little worried about the cost of adding all of these log messages to our existing exporters. I believe the previous approach was taken for performance reasons. We may need to find something in the middle of your current design and the old system to craft a solution.
One small adjustment in the name of performance could be to update all log messages with interpolation or other method calls to be passed in blocks rather than strings. This will delay evaluation of the strings until the message is logged, rather than running the interpolation regardless of log level. See this post for details.
I need to think about this a little more and can do so next week. Just wanted to let you know we're taking a look.
robertlaurin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm blocking this change based on the amount of duplicate logs alone.
| OpenTelemetry.logger.debug("BatchSpanProcessor#export_batch: exporter=#{@exporter.class.name}") | ||
| OpenTelemetry.logger.debug("BatchSpanProcessor#export_batch: Exporting batch of #{batch.size} spans with timeout #{timeout}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there's a lot of instances where you're emitting two lines when it could be a single log line invocation, string allocation.
Directly after these logs are emitted, when the export function is called this log line is emitted with the exact same information.
OpenTelemetry.logger.debug("OTLP::HTTP::TraceExporter#export: Called with #{span_data&.size || 0} spans, timeout=#{timeout.inspect}")| return OpenTelemetry::SDK::Trace::Export.failure(message: 'send_bytes called with nil bytes') if bytes.nil? | ||
|
|
||
| @metrics_reporter.record_value('otel.otlp_exporter.message.uncompressed_size', value: bytes.bytesize) | ||
| OpenTelemetry.logger.debug("OTLP::Exporter#send_bytes: Uncompressed size=#{bytes.bytesize} bytes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This information is already being reported in the line above.
| request.add_field('Content-Encoding', 'gzip') | ||
| body = Zlib.gzip(bytes) | ||
| @metrics_reporter.record_value('otel.otlp_exporter.message.compressed_size', value: body.bytesize) | ||
| OpenTelemetry.logger.debug("OTLP::Exporter#send_bytes: Compressed size=#{body.bytesize} bytes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reported in line above
| OpenTelemetry.logger.debug("OTLP::Exporter#send_bytes: Compressed size=#{body.bytesize} bytes") | ||
| else | ||
| body = bytes | ||
| OpenTelemetry.logger.debug('OTLP::Exporter#send_bytes: No compression applied') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the logger need to be invoked every single time we export without compression? It is configured during initialization and isn't mutable.
|
|
||
| case response | ||
| when Net::HTTPOK | ||
| OpenTelemetry.logger.debug('OTLP::Exporter#send_bytes: SUCCESS - HTTP 200 OK') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few lines above this you log
OpenTelemetry.logger.debug("OTLP::Exporter#send_bytes: Received response code=#{response.code}, message=#{response.message}")Then on this line you log the response code, and in most of the subsequent logs you log the response code again.
fbogsany
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of unnecessary code here. While this may be valuable to add during a debugging session, it does not belong in production code. I've provided some concrete feedback, but there is a lot more to address as well.
| def initialize(spans) | ||
| super("Unable to export #{spans.size} spans") | ||
| @spans = spans | ||
| @error = error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does nothing AFAICT. The only error accessible at this point is the method created by attr_reader and it'll return @error, which is nil.
| # | ||
| # @return [Array<OpenTelemetry::SDK::Trace::Span>] | ||
| attr_reader :spans | ||
| attr_reader :spans, :error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume error is intended to hold a wrapped error of some sort. This is already exposed by StandardError#cause, which will be populated automatically in cases like:
rescue FooError
raise ExportError| # Factory method for creating a success result | ||
| # @return [ExportResult] | ||
| def self.success | ||
| ExportResult.new(SUCCESS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an unnecessary allocation on every successful export call.
| case other | ||
| when Integer | ||
| @code == other | ||
| when ExportResult | ||
| @code == other.code | ||
| else | ||
| super | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is sufficient:
| case other | |
| when Integer | |
| @code == other | |
| when ExportResult | |
| @code == other.code | |
| else | |
| super | |
| end | |
| other.to_i == @code |
| # Log detailed error information if available | ||
| if error | ||
| OpenTelemetry.logger.error("BatchSpanProcessor: export failed due to #{error.class}: #{error.message}") | ||
| elsif message | ||
| OpenTelemetry.logger.error("BatchSpanProcessor: export failed: #{message}") | ||
| else | ||
| OpenTelemetry.logger.error('BatchSpanProcessor: export failed (no error details available)') | ||
| OpenTelemetry.logger.error("BatchSpanProcessor: call stack:\n#{caller.join("\n")}") | ||
| end | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this can be implemented effectively in a custom error handler (i.e. via the OpenTelemetry.handle_error call below).
addresses: #1931
This PR significantly enhances the debugging experience for OTLP exporters by:
These changes ended up helping me debug a really gnarly issue where a slightly old version of the sentry-ruby SDK was causing issues with how the OpenTelemetry ruby SDK was bubbling up errors due to incorrect IPv6 parsing - causing all my traces to be dropped with an one-line error
Unable to export X spans.Reviewer's Note
Significant AI assistance was used in the process of getting this PR working.
Motivation
Previously, when OTLP exports failed, developers had minimal information to diagnose the root cause. The exporters simply returned a
FAILUREconstant without any context about:This made troubleshooting production issues extremely difficult, especially for:
Changes
1. Enhanced Export Result Type (
sdk/lib/opentelemetry/sdk/trace/export.rb)Introduced a new
ExportResultclass that wraps result codes with optional error context:Backwards Compatibility: The
ExportResultclass overloads the==operator and providesto_ito ensure existing code comparing results toSUCCESS,FAILURE, orTIMEOUTconstants continues to work seamlessly.2. Comprehensive Debug Logging
Added detailed debug-level logging at key points in the export pipeline:
Entry/Exit Points
HTTP Request Flow
Exception Handling
3. Rich Failure Context
All failure scenarios now return detailed context via
Export.failure():HTTP Error Responses
Network Exceptions
Timeout Scenarios
4. Enhanced BatchSpanProcessor Error Reporting
Updated
BatchSpanProcessorto extract and log error context:5. Updated Exporters
Applied consistent changes to both:
exporter/otlp/lib/opentelemetry/exporter/otlp/exporter.rb)exporter/otlp-http/lib/opentelemetry/exporter/otlp/http/trace_exporter.rb)Both now capture exception objects and maintain the error context through the entire export pipeline.
Example Scenarios
Before
After (with debug logging enabled)