Skip to content

Conversation

@chen-anders
Copy link

@chen-anders chen-anders commented Oct 12, 2025

addresses: #1931

This PR significantly enhances the debugging experience for OTLP exporters by:

  1. Adding rich context to export failure results
  2. Introducing comprehensive debug-level logging throughout the export pipeline
  3. Maintaining full backwards compatibility with existing exporter implementations

These changes ended up helping me debug a really gnarly issue where a slightly old version of the sentry-ruby SDK was causing issues with how the OpenTelemetry ruby SDK was bubbling up errors due to incorrect IPv6 parsing - causing all my traces to be dropped with an one-line error Unable to export X spans.

Reviewer's Note

Significant AI assistance was used in the process of getting this PR working.

Motivation

Previously, when OTLP exports failed, developers had minimal information to diagnose the root cause. The exporters simply returned a FAILURE constant without any context about:

  • What type of error occurred
  • HTTP response codes and messages
  • Response bodies from the collector
  • Retry attempts and their outcomes
  • Exception details

This made troubleshooting production issues extremely difficult, especially for:

  • Network connectivity problems
  • SSL/TLS certificate issues
  • Collector endpoint configuration errors
  • HTTP timeout scenarios
  • Server-side errors (4xx/5xx responses)

Changes

1. Enhanced Export Result Type (sdk/lib/opentelemetry/sdk/trace/export.rb)

Introduced a new ExportResult class that wraps result codes with optional error context:

class ExportResult
  attr_reader :code, :error, :message

  # Factory methods
  def self.success
  def self.failure(error: nil, message: nil)
  def self.timeout
end

Backwards Compatibility: The ExportResult class overloads the == operator and provides to_i to ensure existing code comparing results to SUCCESS, FAILURE, or TIMEOUT constants continues to work seamlessly.

2. Comprehensive Debug Logging

Added detailed debug-level logging at key points in the export pipeline:

Entry/Exit Points

  • Function entry with parameters (span count, timeout values)
  • Function exit with return values
  • Byte sizes (compressed vs uncompressed)

HTTP Request Flow

  • Request preparation and compression
  • Timeout calculations and retry counts
  • HTTP response codes and messages
  • Response bodies for error cases

Exception Handling

  • Exception type and message for all caught exceptions
  • Retry attempt tracking
  • Max retry exceeded scenarios

3. Rich Failure Context

All failure scenarios now return detailed context via Export.failure():

HTTP Error Responses

OpenTelemetry::SDK::Trace::Export.failure(
  message: "export failed with HTTP #{response.code} (#{response.message}) after #{retry_count} retries: #{body}"
)

Network Exceptions

OpenTelemetry::SDK::Trace::Export.failure(
  error: e,
  message: "export failed due to SocketError after #{retry_count} retries: #{e.message}"
)

Timeout Scenarios

OpenTelemetry::SDK::Trace::Export.failure(
  message: 'timeout exceeded before sending request'
)

4. Enhanced BatchSpanProcessor Error Reporting

Updated BatchSpanProcessor to extract and log error context:

def report_result(result_code, span_array, error: nil, message: nil)
  if result_code == SUCCESS
    # ... metrics ...
  else
    error_message = if error
                "BatchSpanProcessor: export failed due to #{error.class}: #{error.message}"
              elsif message
                "BatchSpanProcessor: export failed: #{message}"
              else
                "BatchSpanProcessor: export failed (no error details available) \n Call stack: #{caller.join("\n")}"
              end

     OpenTelemetry.handle_error(exception: ExportError.new(span_array), message: error_message)
  end
end

5. Updated Exporters

Applied consistent changes to both:

  • OTLP default Exporter (exporter/otlp/lib/opentelemetry/exporter/otlp/exporter.rb)
  • OTLP HTTP Exporter (exporter/otlp-http/lib/opentelemetry/exporter/otlp/http/trace_exporter.rb)

Both now capture exception objects and maintain the error context through the entire export pipeline.

Example Scenarios

Before

ERROR -- : OpenTelemetry error: Unable to export 10 spans

After (with debug logging enabled)

DEBUG -- : OTLP::Exporter#export: Called with 10 spans, timeout=30.0
DEBUG -- : OTLP::Exporter#export: Calling encode for 10 spans
DEBUG -- : OTLP::Exporter#send_bytes: Sending HTTP request
DEBUG -- : OTLP::Exporter#send_bytes: Caught SocketError: Connection refused, retry_count=1
DEBUG -- : OTLP::Exporter#send_bytes: Max retries exceeded for SocketError
ERROR -- : BatchSpanProcessor: export failed due to SocketError: Connection refused - connect(2) for "localhost" port 4318
ERROR -- : OpenTelemetry error: Unable to export 10 spans

@chen-anders chen-anders force-pushed the anders/improve-debugging-ux branch from 3096a1c to 0dbb1a8 Compare October 12, 2025 13:38
@tkling
Copy link

tkling commented Oct 16, 2025

Random passerby here ~ just want to say thank you @chen-anders! I am knee-deep debugging errors between my ruby app and my OTLP collector and the improvements in this PR would vastly help my efforts.

Copy link
Contributor

@kaylareopelle kaylareopelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this PR! This is a problem I've run into myself and I'm glad to see work toward improving the situation.

I'm a little worried about the cost of adding all of these log messages to our existing exporters. I believe the previous approach was taken for performance reasons. We may need to find something in the middle of your current design and the old system to craft a solution.


One small adjustment in the name of performance could be to update all log messages with interpolation or other method calls to be passed in blocks rather than strings. This will delay evaluation of the strings until the message is logged, rather than running the interpolation regardless of log level. See this post for details.

I need to think about this a little more and can do so next week. Just wanted to let you know we're taking a look.

Copy link
Contributor

@robertlaurin robertlaurin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm blocking this change based on the amount of duplicate logs alone.

Comment on lines 189 to 190
OpenTelemetry.logger.debug("BatchSpanProcessor#export_batch: exporter=#{@exporter.class.name}")
OpenTelemetry.logger.debug("BatchSpanProcessor#export_batch: Exporting batch of #{batch.size} spans with timeout #{timeout}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there's a lot of instances where you're emitting two lines when it could be a single log line invocation, string allocation.

Directly after these logs are emitted, when the export function is called this log line is emitted with the exact same information.

 OpenTelemetry.logger.debug("OTLP::HTTP::TraceExporter#export: Called with #{span_data&.size || 0} spans, timeout=#{timeout.inspect}")

return OpenTelemetry::SDK::Trace::Export.failure(message: 'send_bytes called with nil bytes') if bytes.nil?

@metrics_reporter.record_value('otel.otlp_exporter.message.uncompressed_size', value: bytes.bytesize)
OpenTelemetry.logger.debug("OTLP::Exporter#send_bytes: Uncompressed size=#{bytes.bytesize} bytes")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This information is already being reported in the line above.

request.add_field('Content-Encoding', 'gzip')
body = Zlib.gzip(bytes)
@metrics_reporter.record_value('otel.otlp_exporter.message.compressed_size', value: body.bytesize)
OpenTelemetry.logger.debug("OTLP::Exporter#send_bytes: Compressed size=#{body.bytesize} bytes")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reported in line above

OpenTelemetry.logger.debug("OTLP::Exporter#send_bytes: Compressed size=#{body.bytesize} bytes")
else
body = bytes
OpenTelemetry.logger.debug('OTLP::Exporter#send_bytes: No compression applied')
Copy link
Contributor

@robertlaurin robertlaurin Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the logger need to be invoked every single time we export without compression? It is configured during initialization and isn't mutable.


case response
when Net::HTTPOK
OpenTelemetry.logger.debug('OTLP::Exporter#send_bytes: SUCCESS - HTTP 200 OK')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few lines above this you log

OpenTelemetry.logger.debug("OTLP::Exporter#send_bytes: Received response code=#{response.code}, message=#{response.message}")

Then on this line you log the response code, and in most of the subsequent logs you log the response code again.

Copy link
Contributor

@fbogsany fbogsany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of unnecessary code here. While this may be valuable to add during a debugging session, it does not belong in production code. I've provided some concrete feedback, but there is a lot more to address as well.

def initialize(spans)
super("Unable to export #{spans.size} spans")
@spans = spans
@error = error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does nothing AFAICT. The only error accessible at this point is the method created by attr_reader and it'll return @error, which is nil.

#
# @return [Array<OpenTelemetry::SDK::Trace::Span>]
attr_reader :spans
attr_reader :spans, :error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume error is intended to hold a wrapped error of some sort. This is already exposed by StandardError#cause, which will be populated automatically in cases like:

rescue FooError
  raise ExportError

# Factory method for creating a success result
# @return [ExportResult]
def self.success
ExportResult.new(SUCCESS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unnecessary allocation on every successful export call.

Comment on lines 46 to 53
case other
when Integer
@code == other
when ExportResult
@code == other.code
else
super
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is sufficient:

Suggested change
case other
when Integer
@code == other
when ExportResult
@code == other.code
else
super
end
other.to_i == @code

Comment on lines 216 to 225
# Log detailed error information if available
if error
OpenTelemetry.logger.error("BatchSpanProcessor: export failed due to #{error.class}: #{error.message}")
elsif message
OpenTelemetry.logger.error("BatchSpanProcessor: export failed: #{message}")
else
OpenTelemetry.logger.error('BatchSpanProcessor: export failed (no error details available)')
OpenTelemetry.logger.error("BatchSpanProcessor: call stack:\n#{caller.join("\n")}")
end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this can be implemented effectively in a custom error handler (i.e. via the OpenTelemetry.handle_error call below).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants