Skip to content

Conversation

@carlesarnal
Copy link
Member

@carlesarnal carlesarnal commented Nov 27, 2025

Summary

This PR migrates Apicurio Registry's protobuf implementation from Square's Wire-Schema library to protobuf4j, a library that runs the official protoc compiler via WebAssembly (WASM) on the JVM.

Key Benefits:

  • 79% dependency size reduction (~19 MB → ~4 MB) by eliminating wire-schema, kotlin-stdlib, and icu4j (~14 MB)
  • 73% code reduction in FileDescriptorUtils.java (1,605 → 432 lines)
  • 100% protoc compatibility by using the actual protoc binary compiled to WASM
  • Improved canonicalization with guaranteed byte-for-byte consistency for semantically equivalent schemas
  • Performance optimizations including ZeroFs filesystem pooling and WASM instance caching

Root Cause / Motivation

Wire-Schema presented several challenges:

  1. Heavy dependency tree: Required wire-schema-jvm, okio, kotlin-stdlib, and icu4j (~14 MB alone)
  2. Semantic inconsistencies: Wire's interpretation occasionally differed from official protoc behavior
  3. Canonicalization limitations: Couldn't guarantee exact semantic equivalence without protoc's internal normalization
  4. Maintenance overhead: Required ongoing effort to keep aligned with latest protobuf specification

Changes

Core Migration:

  • Replaced Wire-Schema parsing with protobuf4j's WASM-based protoc compilation
  • Updated ProtobufSchemaLoader, ProtobufSchemaUtils, and FileDescriptorUtils to use protobuf4j APIs
  • Migrated canonicalization to use Protobuf.normalizeSchemaToText() for consistent output
  • Updated syntax validation to use Protobuf.validateSyntax() without dependency resolution

New Utilities:

  • ProtobufWellKnownTypes: Centralized well-known type detection (google/protobuf/, google/type/, etc.)
  • ProtobufCompilationContext: Filesystem pooling and WASM instance caching for improved performance
  • ProtobufSchemaUtils: New utility class replacing Wire-specific helpers

Serdes Updates:

  • Migrated Kafka, Pulsar, and NATS serializers/deserializers to use FileDescriptor directly
  • Improved caching to avoid repeated WASM cold-starts in hot paths
  • Maintained full backward compatibility with existing clients

Deleted Files:

  • DynamicSchema.java (427 lines)
  • EnumDefinition.java (69 lines)
  • MessageDefinition.java (206 lines)
  • ProtobufMessage.java (147 lines)

Testing Infrastructure:

  • Added ProtobufBackwardCompatibilityIT - shades old wire-schema serializer (v3.1.2) to verify interoperability
  • Added ProtobufSerdesPerformanceIT - compares new vs old vs Confluent serializer performance
  • Added ProtobufParsingBenchmark for compilation performance profiling
  • Comprehensive unit tests for new utility classes (39 tests for well-known types)

Test Plan

  • All 144 protobuf unit tests pass (98 in protobuf-schema-utilities + 46 in schema-util/protobuf)
  • ProtobufBackwardCompatibilityIT: Verifies messages serialized with old wire-schema serializer can be deserialized by new protobuf4j serializer and vice versa
  • ProtobufSerdesPerformanceIT: Confirms no performance regression in serialization/deserialization hot paths
  • ContentCanonicalizerTest: Verifies canonical form consistency
  • Integration tests with all storage variants (H2, PostgreSQL, KafkaSQL)
  • Native image build verification (fix native image build commit)

Migration Considerations

For Existing Users:

  • Server-first upgrade recommended: Upgrade registry server before clients
  • Content ID stability: Canonical forms may produce different content IDs for edge cases; existing schemas remain accessible
  • Client compatibility: Old clients continue to work with new server; gradual client upgrade supported

Breaking Changes:

  • Removed Wire-Schema AST types from public API (ProtoFileElement, MessageElement, etc.)
  • Code using these types should migrate to FileDescriptor-based APIs

@carlesarnal
Copy link
Member Author

This work is finalized besides the migration path. The failing test is the test that's checking that both the old serializer and the new one result in the same contentId being used.

@carlesarnal carlesarnal force-pushed the protobuf4j-migration branch 8 times, most recently from 6c2ff4a to 45bd3f3 Compare December 15, 2025 09:22
  Introduce compilation context pooling and centralize well-known type
  detection to reduce code duplication and improve performance through
  resource reuse.

  Key changes:
  - Add ProtobufCompilationContext with pooled filesystem and WASM instances
  - Create ProtobufWellKnownTypes utility to centralize type detection logic
  - Add toNormalizedProtoText() and hasOriginalProtoText() to ProtobufSchema
  - Implement rewriteReferences() in ProtobufDereferencer for import rewriting
  - Add pre-parsed schema support to ProtobufCompatibilityChecker
  - Simplify validateSyntaxOnly() to use pooled compilation context
  - Add extractProtocError() for enhanced error messages

  This reduces code duplication from 6+ locations to 1 utility class and
  enables WASM instance reuse across compilations.

  Note: The pooling implementation has known issues for long-running servers
  that are documented in docs/protobuf4j-improvements-plan.md and will be
  addressed in follow-up work (shutdown hooks, idle eviction, configurable
  pooling mode for server vs serdes use cases).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proof of Concept: Replace wire-schema library with grpc-zero

1 participant