Console application to compare two xlsx files containing the same data with different schema.
This was built to support data migrations, comparing xlsx exports from source and destination systems to identify bugs in the process.
The files are referred to as the "left" and "right" files, based on the arguments passed.
For each row in the left file, we find the matching row in right file, and then compare the two rows with configurable matching rules.
The comparison results are then written to xlsx for further analysis, one row for each cell that didn't match.
You can download a pre-built binaries from the github releases page.
$ ./XlsxCompare config.json left.xlsx right.xlsx
[20:54:44] info: XlsxCompare.Driver[0]
Starting
[20:54:44] info: XlsxCompare.Driver[0]
Reading config from config.json
[20:54:44] info: XlsxCompare.XlsxComparer[0]
Comparing left.xlsx to right.xlsx
[20:54:48] info: XlsxCompare.XlsxComparer[0]
Comparing [XlsxFacade Sheet1, A1:BX9011] to [XlsxFacade Sheet1, A1:BQ9011]
[20:55:09] info: XlsxCompare.XlsxComparer[0]
Compared
[20:55:09] info: XlsxCompare.ResultsWriter[0]
Writing results to results.xlsx
[20:55:10] info: XlsxCompare.ResultsWriter[0]
Written
[20:55:10] info: XlsxCompare.Driver[0]
Finished
...
A complicated json file controls how the two files will be compared. Here's a small example:
{
"leftKeyColumn": "Id",
"rightKeyColumn": "OLD_ID",
"resultOptions": {
"leftValueHeader": "my value",
"rightValueHeader": "your value",
"leftColumnNames": [
"Batch #"
],
"rightColumnNames": [
"NEW_ID"
]
},
"assertions": [
{
"leftColumnName": "Name",
"rightColumnName": "CUSTOMER_NAME"
},
{
"leftColumnName": "DateAdded",
"rightColumnName": "DT_CREATE",
"matchBy": "Date"
}
]
}
This config will:
- join the two xlsx files on
Id == OLD_ID
- checks for
Name == CUSTOMER_NAME
, ignoring case and normalizing nulls and whitespace (e.g."Ryan "
will match"RYAN"
andnull
will match""
) - checks for
DateAdded == DT_CREATE
, parsing both into dates before comparing (e.g."2021-04-02"
will match"4/2/2021 3:45PM"
) - the output
results.xlsx
will have a row for each mismatch, with these columns:Batch #
- value from the left file'sBatch #
columnNEW_ID
- value from the right file'sNEW_ID
columnMismatched field
- eitherName
orDateAdded
, depending on which assertion failedmy value
- mismatched value from the left fileyour value
- mismatched value from the right file
key | meaning | default |
---|---|---|
leftKeyColumn |
column name in the "left" file that contains a primary key | |
rightKeyColumn |
column name in the "right" file that matches the primary key | |
resultOptions |
object configuring the result file, see below | |
assertions |
an array of assertion objects configuring how we want columns to match, see below | |
ignoreMissingRows |
allow rows to exist in the "left" without with a match in the "right" file, useful for checking partial output | false |
key | meaning | default |
---|---|---|
path |
name of the xlsx file to write | results.xlsx |
leftValueHeader |
header to use over leftKeyColumn values |
left value |
rightValueHeader |
header to use over rightKeyColumn values |
right value |
leftColumnNames |
additional data to include from the left file. This is useful for adding context to help analyse the mismatches. | null |
rightColumnNames |
additional data to include from the right file | null |
key | meaning | default |
---|---|---|
leftColumnName |
column to compare from the left file | |
rightColumnName |
column to compare from the right file | |
matchBy |
how to compare the two values, see matchBy below |
string |
remove |
if present: before comparison, remove this string from both values | null |
zeroRepresentsEmpty |
if true: before comparison, convert any zero values (e.g. 0 , 0.0 ) to empty string |
false |
synonyms |
if present: this is a nested list of terms that should be treated as identical. For example, with these synonyms: [ ["street", "st"], ["n", "north"] ] , then 123 North Main St will match 123 N Main Street |
null |
matchBy |
rule | examples |
---|---|---|
string |
strings must match, ignoring case and leading/trailing whitespace | test matches test, TEST and Test , but not testing |
stringIgnoreMissingLeft |
same as string , but treat a missing "left" value as a match |
same as string , but an empty string matches test |
integer |
parse to integers before comparison | 0123 matches 123 |
decimal |
parse to decimals before comparison | 0.123 matches .123000 |
date |
parse as dates before comparison | 2021-04-02 matches 20210402 and 4/2/2021 3:45PM , but not 2021-04-03 |
stringLeftStartsWithRight |
the left value must start with the right value | testing matches test , but not testing with suffix |
stringRightStartsWithLeft |
the right value must start with the left value | test matches testing |
tokens |
each value must contain the same tokens, after splitting on whitespace. This is useful for addresses that have the same components in a different order, or other data where the ordering is insignificant | 1234 S Main St matches 1234 Main St S |
- uses
net5
- recommend using Visual Studio Code, or another IDE that supports EditorConfig
- releases are based on git tags: make a tag like
v${SEMVER}
and CI will create a github release with standalone binaries