-
Notifications
You must be signed in to change notification settings - Fork 49
Writes Fail due to Column Mismatch #43
Comments
I think the solution here is based on the append vs insert behavior of Spark. We make use of DataFrameWriter#insertInto throughout the codebase, which inserts based on strict column ordering, however, DataFrameWriter#saveAsTable, specifying a We specify the insertion mode as "ErrorIfExists", but that may be redundant given the duplicate partition discovery we apply. Also given that FHIR will never have duplicate field names for the same in the same level of a structure, we can guarantee a unique set of column names for each Dataset struct. |
The existing code assumes the HAPI API preserved order when it does not, but I don't think this will be a problem once we bring forward the 0.5.0-dev branch (and adopt its ordering). That branch uses the FHIR resource definition, which unambiguously specifies field ordering, and the same resource definition will produce the same field ordering independently of the version of HAPI. |
This is also a problem with performing union on 2 resource datasets. UnionByName does not help either because while it will address the columns it will do nothing about the column types so if we have 2 struct columns and the structs are out of order - it will choke |
I'm having this same issue too. Any updates on when the work done in the 0.5.0-dev branch will be merged in, so that FHIR resource definition is used to ensure ordering? |
Any update about this issue? |
When writing out new ConceptMaps to an existing table, there can be an exception on column data-type mismatches:
On analysis it appears the Hive table schema can drift from the dynamically created Spark schema; columns can be in different orders. We can compare gists of the schemas:
Schema for a
Dataset<ConceptMap>
using Bunsen 0.4.5, using HAPI 3.3.0.Schema for an existing ConceptMap table, built on a previous version of Bunsen. This schema differs from the first in column order of SourceUri/SourceReference, TargetUri/TargetReference, and useContext.valueQuantity fields (valueQuantity being in a different position is what is conveyed by the error message at the top).
Schema for a new ConceptMap table, built from the Dataset. This schema matches the first.
Even if we load the original table using Bunsen APIs
as opposed to Spark APIs
the result is still a mismatch to the
Dataset<ConceptMap>
we'd intend to write.I don't think this is related to issues we've seen with Spark in the past, where we have to explicitly
SELECT
columns in a particular order to avoid data being written under the wrong column.I think this is an issue related to the order the a RuntimeElement returns information about its children in the
EncoderBuilder
. Digging intoConceptMap.useContext.value
, comparing theEncoder
schema for different versions of Bunsen again, we see the differences seen at the table/dataset schema level, and if we dig more deeply into theEncoderBuilder
runtime, we find that depending on the HAPI version, we get different orders in the ChoiceType children forConceptMap.useContext.value
, and those orders match the differences we see in the Dataset and table schemas.This amounts to tables for a given STU release being subject to non-passive changes (even though updates within HAPI for an STU release should be purely passive in regards to the resources).
The simplest thing to do is to just drop/archive the tables and rebuild them with the latest Bunsen version, but this requirement might be unexpected to users while consuming Bunsen over a HAPI version on the same FHIR STU release.
The text was updated successfully, but these errors were encountered: