Skip to content

[Bug]: BigQueryIO write inconsistent implicit int->long conversion for nullable long and for different write methods #36735

@Abacn

Description

@Abacn

What happened?

  • write int to target schema int64, succeed

  • write int to target schema nullable(int64), using storage_write_api, succeed

  • write int to target schema nullable(int64), using file_load avro, failing with

Caused by: org.apache.avro.UnresolvedUnionException: Not in union ["null","long"]: 123 (field=nullableLong)

A simple reproduce (not using Beam):

public class AvroTest {
  private static final String SCHEMA_JSON = "{\n" +
      "  \"type\": \"record\",\n" +
      "  \"name\": \"UserEvent\",\n" +
      "  \"namespace\": \"com.example.avro\",\n" +
      "  \"fields\": [\n" +
      "    {\"name\": \"userId\", \"type\": \"string\"},\n" +
      "    {\"name\": \"nonNullLong\", \"type\": \"long\"},\n" +
      "    {\"name\": \"nullableLong\", \"type\": [\"null\", \"long\"], \"default\": null}\n" +
      "  ]\n" +
      "}";

  public static void main(String[] argv) throws AvroRuntimeException, IOException {
    Schema schema = new Schema.Parser().parse(SCHEMA_JSON);
    GenericRecord eventWithTimestamp = new GenericData.Record(schema);
    eventWithTimestamp.put("userId", "user-123");
    eventWithTimestamp.put("nonNullLong", 123);
    eventWithTimestamp.put("nullableLong", 123); // fail

    File avroOutputFile = new File("user-events.avro");
    DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(schema);
    try (DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<>(datumWriter)) {
      dataFileWriter.create(schema, avroOutputFile);
      dataFileWriter.append(eventWithTimestamp);
    }
  }
}

this is a known avro issue: https://stackoverflow.com/questions/35963285/org-apache-avro-unresolvedunionexception-not-in-union-long-null

However this led a breaking change for Beam Yaml 2.69.0 where it switched the batch BigQueryIO write to storage_write_api to Managed IO (backed by file_load).

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions