Writes Fail due to Column Mismatch

When writing out new ConceptMaps to an existing table, there can be an exception on column data-type mismatches:

```
An error occurred while calling o214.writeToDatabase.
: org.apache.spark.sql.AnalysisException: cannot resolve '`useContext`' due to data type mismatch: cannot cast array<struct<id:string,code:struct<id:string,system:string,version:string,code:string,display:string,userSelected:boolean>,valueQuantity:struct<id:string,value:decimal(12,4),comparator:string,unit:string,system:string,code:string>,valueRange:struct<id:string,low:struct<id:string,value:decimal(12,4),comparator:string,unit:string,system:string,code:string>,high:struct<id:string,value:decimal(12,4),comparator:string,unit:string,system:string,code:string>>,valueCodeableConcept:struct<id:string,coding:array<struct<id:string,system:string,version:string,code:string,display:string,userSelected:boolean>>,text:string>>> to array<struct<id:string,code:struct<id:string,system:string,version:string,code:string,display:string,userSelected:boolean>,valueCodeableConcept:struct<id:string,coding:array<struct<id:string,system:string,version:string,code:string,display:string,userSelected:boolean>>,text:string>,valueQuantity:struct<id:string,value:decimal(12,4),comparator:string,unit:string,system:string,code:string>,valueRange:struct<id:string,low:struct<id:string,value:decimal(12,4),comparator:string,unit:string,system:string,code:string>,high:struct<id:string,value:decimal(12,4),comparator:string,unit:string,system:string,code:string>>>>;;
'InsertIntoHadoopFsRelationCommand location/warehouse/ontologies.db/conceptmaps, false, [timestamp#5475], Parquet, Map(serialization.format -> 1, path -> location/warehouse/ontologies.db/conceptmaps), Append, CatalogTable(
Database: ontologies
Table: conceptmaps
Owner: hadoop
Created Time: Mon Aug 06 20:33:27 UTC 2018
Last Access: Thu Jan 01 00:00:00 UTC 1970
Created By: Spark 2.3.0
Type: MANAGED
Provider: parquet
Table Properties: [transient_lastDdlTime=1533587608]
Location: location/warehouse/ontologies.db/conceptmaps
Serde Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Storage Properties: [serialization.format=1]
Partition Provider: Catalog
Partition Columns: [`timestamp`]
```

On analysis it appears the Hive table schema can drift from the dynamically created Spark schema; columns can be in different orders. We can compare gists of the schemas:

[Schema](https://gist.github.com/bdrillard/4fb4da9a534084d7267724f831954b50) for a `Dataset<ConceptMap>` using Bunsen 0.4.5, using HAPI 3.3.0.

[Schema](https://gist.github.com/bdrillard/63706af3af360fe071db8a39d3129f95) for an existing ConceptMap table, built on a previous version of Bunsen. This schema differs from the first in column order of SourceUri/SourceReference, TargetUri/TargetReference, and useContext.valueQuantity fields (valueQuantity being in a different position is what is conveyed by the error message at the top).

[Schema](https://gist.github.com/bdrillard/53a60c7fb9276c15d45d7b802783a683) for a new ConceptMap table, built from the Dataset. This schema matches the first.

Even if we load the original table using Bunsen APIs

```
ontologies_maps = get_concept_maps(spark, "ontologies")

ontologies_maps.get_maps().printSchema()
```

as opposed to Spark APIs

```
spark.table("ontologies.conceptmaps").printSchema()
```

the result is still a mismatch to the `Dataset<ConceptMap>` we'd intend to write.

I don't _think_ this is related to issues we've seen with Spark in the past, where we have to explicitly `SELECT` columns in a particular order to avoid _data_ being written under the wrong column.

I think this is an issue related to the order the a RuntimeElement returns information about its children in the `EncoderBuilder`. Digging into `ConceptMap.useContext.value`, comparing the `Encoder` schema for different versions of Bunsen again, we see the differences seen at the table/dataset schema level, and if we dig more deeply into the `EncoderBuilder` runtime, we find that depending on the HAPI version, we get different orders in the ChoiceType children for `ConceptMap.useContext.value`, and those orders match the differences we see in the Dataset and table schemas. 

This amounts to tables for a given STU release being subject to non-passive changes (even though updates within HAPI for an STU release should be purely passive in regards to the resources).

The simplest thing to do is to just drop/archive the tables and rebuild them with the latest Bunsen version, but this requirement might be unexpected to users while consuming Bunsen over a HAPI version on the same FHIR STU release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Writes Fail due to Column Mismatch #43

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Writes Fail due to Column Mismatch #43

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions