CVV_COLUMNS table integration test #16

dmorokov · 2022-05-26T01:29:00Z

Added JdbcUtils and methods
Added svv_columns.sql for retrieving table's data
Added SvvColumnsTest test
Added SvvColumnsRow POJO object
Added CsvUtil and methods
Added Gradle reporting plugin

paolomorandini

Looks good, I added a few comments and asked a few questions.

...er-integration-tests/redshift-tests/src/test/java/com/google/integration/SvvColumnsTest.java

dumper-integration-tests/redshift-tests/src/main/java/com/google/sql/svv_columns.sql

...er-integration-tests/redshift-tests/src/test/java/com/google/integration/SvvColumnsTest.java

dumper-integration-tests/redshift-tests/src/main/java/com/google/jdbc/JdbcUtil.java

paolomorandini · 2022-05-26T09:13:40Z

dumper-integration-tests/redshift-tests/build.gradle

    implementation 'commons-io:commons-io:2.11.0'
    implementation 'org.apache.commons:commons-collections4:4.4'
-    implementation 'com.google.guava:guava:31.0.1-jre'
    implementation 'com.opencsv:opencsv:5.6'


Dumper uses commons-csv, using the same might rule-out some issues due to library incompatibility on special/corner cases?

Quite the opposite, if lib A has any bug or unexpected behavior, than using in tests lib B allows us to catch that behavior, doesn't it?

It will tell where A and B differ in behavior, and both libraries might be just using different - although valid - conventions.

Discussed in 1-to-1 chat. No alterations required here. Consider to leave it as is.

Can we please share the answer here for other reviewers?

Sure,

Paolo Morandini, Fri 11:22 AM
I'm going to suggest to stick with opencsv and reconsider only if we get some test failures / problems due to incompatibility with commons-csv

The assumption is that this would favor coherence with, and re-use of, the already existing test suite for the extraction tool.

It's valid for a server to return a different column set at a different time, for example, a different server release.

This PR binds the columns into a hard-written POJO, so when that (valid) act happens, this test case will fail. So the test case is enforcing something which is not within the contract of the dumper. That concerns me a little.

If the use of OpenCSV is to do with the beans binding and POJOs, then we may need to raise this discussion by one level, to work out what contract we are actually enforcing. I would have thought that the contract is that:

If we create a dump, and

Run the loader over it, then

We obtain a particular set of semantic information.
I would avoid enforcing the presence of columns which may or may not be present, and which we may or may not care about.

I think that a test should do exactly what you are saying - enforce the contract.
When a test is 'flexible' and 'adjusting on the fly' it's very probable to miss a corner case.

Obviously, there are a bunch of traits of a test, but I think, that a good test should be simple, specific and understandable. So, here we are talking about a test being specific with its data, structure, etc.

I read it as no changes are required here.
Please, confirm or resolve the conversation.

The contract we aim to enforce is that the loader ends up with a particular set of semantic information, not that a test which selects from the same table obtains the same rows. I think this is the key issue about this entire pull-request chain, and I think this thread is not resolved.

shevek-google

Can you please update the description to explain the purpose of this code?

shevek-google · 2022-05-30T15:45:30Z

dumper-integration-tests/redshift-tests/build.gradle

    implementation 'commons-io:commons-io:2.11.0'
    implementation 'org.apache.commons:commons-collections4:4.4'
-    implementation 'com.google.guava:guava:31.0.1-jre'
    implementation 'com.opencsv:opencsv:5.6'


Can we please share the answer here for other reviewers?

shevek-google · 2022-05-30T15:50:24Z

...gration-tests/redshift-tests/src/main/java/com/google/edwmigration/dumper/base/TestBase.java

+   * @param dbMultiset List of extracted from DB items
+   * @param csvMultiset List of uploaded from Avro items
+   */
+  public static void assertMultisetsEqual(


Can we use Maps.difference() on a Map<T, Integer>?

Sure we can, but what's the reason behind it?
We used that approach for testing Teradata dwh-assessment-extraction-tool and faced zero issues with it even on large amounts of data.

Every line of code we have under maintenance costs us dollars and hours. Using pre-existing, tested code saves us money both now and in the long run. The objective of programming is to write code. The objective of good engineering is to achieve results, preferably while not writing code. So the less code we have, the better.

I think you want something like
[Immutable]Map<K, Integer> m = Maps.toMap(multiset, multiset::getCount)
And then make assertions on Maps.difference(m0, m1).

The code here, in addition to having maintenance cost, is very expensive in terms of remove() calls, and the error messages it finally produces are not as clear as those which can easily be generated from a MapDifference object.

Conclusion
I've tried this approach with Maps.difference() and it doesn't give me a desirable result.
Because it's keeping custom objects as Keys and instance counters as Values, this algorithm misses pretty obvious test cases, thus the test runs and indicates false-positive result.

I have lists A and B, I removed A[0], A[1], and B[last] and the algorithm with maps said that there is the diff in only 1 element, when in fact it's there 3 mutual exclusive elements.

The original code isn't a piece of art but it's fine tuned and it does its job. Given that we are not going to do any sort of performance tests using these test case scenarios I don't see any potential risk for this piece of code even though it's not perfect.

Point
I think, that 1 custom method, effectively, 8 lines of code, 4 of them are pretty much logging and reporting to CLI, would not require too much attention and costs wouldn't be so dramatically high.

...gration-tests/redshift-tests/src/main/java/com/google/edwmigration/dumper/base/TestBase.java

...tegration-tests/redshift-tests/src/main/java/com/google/edwmigration/dumper/sql/SqlUtil.java

...er-integration-tests/redshift-tests/src/test/java/com/google/integration/SvvColumnsTest.java

dmorokov · 2022-05-31T17:41:57Z

Can you please update the description to explain the purpose of this code?

@shevek-google I don't understand what I should update in the commit message. I think that all the main points of this PR are already there. Please, let me know if any specific point should be included.

shevek-google

Hi, this is definitely moving in the right direction, I feel like we're into second-layer comments now, which is great, thank you.

shevek-google · 2022-06-02T16:49:28Z

dumper-integration-tests/redshift-tests/build.gradle

    implementation 'commons-io:commons-io:2.11.0'
    implementation 'org.apache.commons:commons-collections4:4.4'
-    implementation 'com.google.guava:guava:31.0.1-jre'
    implementation 'com.opencsv:opencsv:5.6'


It's valid for a server to return a different column set at a different time, for example, a different server release.

This PR binds the columns into a hard-written POJO, so when that (valid) act happens, this test case will fail. So the test case is enforcing something which is not within the contract of the dumper. That concerns me a little.

If the use of OpenCSV is to do with the beans binding and POJOs, then we may need to raise this discussion by one level, to work out what contract we are actually enforcing. I would have thought that the contract is that:

If we create a dump, and

Run the loader over it, then

We obtain a particular set of semantic information.
I would avoid enforcing the presence of columns which may or may not be present, and which we may or may not care about.

shevek-google · 2022-06-02T16:50:15Z

dumper-integration-tests/redshift-tests/build.gradle

-    implementation 'com.google.protobuf:protobuf-gradle-plugin:0.8.18'
-    implementation 'com.google.protobuf:protobuf-java:3.20.1'
+    annotationProcessor 'com.google.auto.value:auto-value:1.9'
+    compileOnly 'com.google.auto.value:auto-value-annotations:1.9'


We have typically set dependency versions globally using a Gradle build convention. Can we please follow that convention for these submodule(s)?

As of right now, this test automation framework isn't included as a submodule into the main project.
There are several reasons for that:

We don't want to overlap in dependencies or cross-dependencies

We use different libs (OpenCSV and Common-CSV)

We may have different core level test frameworks - TestNG and Junit

We may have different reporting and logging implementations

Integration tests, in accordance with industry best practices, are considered to be a separate project and it doesn't necessarily have to be integrated with the code of the product. If that were Component or Unit tests, then yes, it's a valid scenario for submoduling and bundling.

(2) and (3) are review comments elsewhere; our convention is to use commons-csv and junit.
(4) should be a review comment; we would like this code to follow the conventions of the repository.
(5) Gradle would normally represent an integration test as a TestSet, and this practice is sufficiently standard that it is now part of Gradle core.

shevek-google · 2022-06-02T16:52:03Z

...gration-tests/redshift-tests/src/main/java/com/google/edwmigration/dumper/base/TestBase.java

+   * @param dbMultiset List of extracted from DB items
+   * @param csvMultiset List of uploaded from Avro items
+   */
+  public static void assertMultisetsEqual(


I think you want something like
[Immutable]Map<K, Integer> m = Maps.toMap(multiset, multiset::getCount)
And then make assertions on Maps.difference(m0, m1).

The code here, in addition to having maintenance cost, is very expensive in terms of remove() calls, and the error messages it finally produces are not as clear as those which can easily be generated from a MapDifference object.

...on-tests/redshift-tests/src/main/java/com/google/edwmigration/dumper/pojo/SvvColumnsRow.java

...er-integration-tests/redshift-tests/src/test/java/com/google/integration/SvvColumnsTest.java

...gration-tests/redshift-tests/src/main/java/com/google/edwmigration/dumper/jdbc/JdbcUtil.java

...er-integration-tests/redshift-tests/src/test/java/com/google/integration/SvvColumnsTest.java

...on-tests/redshift-tests/src/main/java/com/google/edwmigration/dumper/base/TestConstants.java

...er-integration-tests/redshift-tests/src/test/java/com/google/integration/SvvColumnsTest.java

...tegration-tests/redshift-tests/src/main/java/com/google/edwmigration/dumper/csv/CsvUtil.java

dumper/lib-ext-hive-metastore/build.gradle

.../redshift-tests/src/test/java/com/google/edwmigration/dumper/integration/SvvColumnsTest.java

paolomorandini · 2022-06-20T16:02:46Z

.../redshift-tests/src/test/java/com/google/edwmigration/dumper/integration/SvvColumnsTest.java

+
+    assertThat(dbColumnHeaders.size()).isEqualTo(csvColumnHeaders.size());
+    assertThat(dbColumnHeaders).containsExactlyElementsIn(csvColumnHeaders);
+    assertDbCsvDataEqual(dbMultiset, csvMultiset);


I was wondering if this is equal to

Truth.assertThat(Multisets.difference(dbMultiset, csvMultiset)) .containsExactlyElementsIn(Multisets.difference(csvMultiset, dbMultiset));

Ok, let's make an experiment.
There are 4 possible cases for this assert:

A and B are equal

A has plus or minus some entries

B has plus or minus some entries

A and B have mutually exclusive entries

Now, with this output can you clearly understand what happened in the test?

value of: multiset.onlyElement() expected: {numeric_precision_radix=, is_nullable=NO, column_default=, numeric_scale=, table_catalog=dev, table_name=pg_tablespace, collation_catalog=, domain_name=, character_set_name=, interval_type=, datetime_precision=, collation_schema=, column_name=spcname, character_maximum_length=, collation_name=case_sensitive, character_set_catalog=, interval_precision=, character_set_schema=, ordinal_position=1, table_schema=pg_catalog, data_type=name, numeric_precision=, remarks=} but was : {numeric_precision_radix=, datetime_precision=, is_nullable=YES, column_default=, collation_schema=, column_name=as_locator, character_maximum_length=, collation_name=, numeric_scale=, character_set_catalog=, table_catalog=dev, table_name=routines, interval_precision=, character_set_schema=, collation_catalog=, ordinal_position=54, domain_name=character_data, character_set_name=, table_schema=information_schema, data_type=character varying, numeric_precision=, interval_type=, remarks=}

and can you with this

DB view and CSV file have mutually exclusive row(s) DB view has 1 different row(s): {numeric_precision_radix=, datetime_precision=, is_nullable=YES, column_default=, collation_schema=, column_name=as_locator, character_maximum_length=, collation_name=, numeric_scale=, character_set_catalog=, table_catalog=dev, table_name=routines, interval_precision=, character_set_schema=, collation_catalog=, ordinal_position=54, domain_name=character_data, character_set_name=, table_schema=information_schema, data_type=character varying, numeric_precision=, interval_type=, remarks=} CSV file has 1 different row(s): {numeric_precision_radix=, is_nullable=NO, column_default=, numeric_scale=, table_catalog=dev, table_name=pg_tablespace, collation_catalog=, domain_name=, character_set_name=, interval_type=, datetime_precision=, collation_schema=, column_name=spcname, character_maximum_length=, collation_name=case_sensitive, character_set_catalog=, interval_precision=, character_set_schema=, ordinal_position=1, table_schema=pg_catalog, data_type=name, numeric_precision=, remarks=}

Conclusion
Does the usage of this built-in comparison method provide any benefit for tests?
I don't think so.
We save 20 lines of code, sure, but we lose a ton of output verbosity and visibility.
I completely cannot understand why we are fighting so hard for the 20-line code fragment and willing to sacrifice output readability over the number of code lines.

So, yes, the method does work.
However, it's a push back from my side on that. As a QA engineer, dealing with unreadable test output is worse than having a 20-line code block.

The following considerations apply:

widely used libraries are less likely to have bugs

same applies for performance reasons

easier for new contributors to understand what's going on (under the assumption that they know the libraries in use)

the output format expected: {} \n but was: {} is very common in test failure reports (some tools might even look for this pattern)

If we want to keep the custom failure reporting, we could at least use Multisets.difference instead of doing it manually by iterate-and-remove.

libs are more likely to have bugs compare to Java native methods

I tried both methods, this custom and provided by you above - no performance difference on 20k entries

agree, should be easier. The problem in this case is - easier doesn't mean better. As Shevek said, we develop code that does what we need, it solves some computations or problems, here is the example of solving a problem with understand-ability of long log outputs.

again, I agree that it's very common, but we are talking about different code properties. Common vs Convenient-and-Readable. A test must do what its supposed to do - verify and provide a report that's easy to read.

@paolomorandini Still, from the first log output above, can you recognize and distinguish - does either DB or CSV have more or less entries? or both?

So Truth.assertThat(dbMultiset).containsExactlyElementsIn(csvMultiset) has been excluded because of too much output being produced (it would've been possible to solve this with an override if Multiset.toString wasn't final).

Should assertDbCsvDataEqual be refactored to use Multisets.difference(a, b) instead of manually removing elements with b.forEach(a::remove)?

Yes, the reduction itself can be refactored.
Although, keep in mind that Multisets.difference() is marked as a Beta method.

@Beta public static <E extends @Nullable Object> Multiset<E> difference() {

which stands for:

Signifies that a public API (public class, method or field) is subject to incompatible changes, or even removal, in a future release. An API bearing this annotation is exempt from any compatibility guarantees made by its containing library. Note that the presence of this annotation implies nothing about the quality or performance of the API in question, only the fact that it is not "API-frozen."
It is generally safe for applications to depend on beta APIs, at the cost of some extra work during upgrades. However it is generally inadvisable for libraries (which get included on users' CLASSPATHs, outside the library developers' control) to do so.

@paolomorandini , I can't understand why we implement the beta functionality of the library which is a subject to at least for some alterations. It doesn't look like a step that adds stability to our framework. Quite opposite, it may require some maintenance.
However, I see, that there is a strong opinion on using libraries, so, I'll implement corresponding changes but I've already highlighted the possible risks as well.

Done.

.../redshift-tests/src/test/java/com/google/edwmigration/dumper/integration/SvvColumnsTest.java

shevek-google

Can we please include some instructions on how to run these tests and what contracts they aim to enforce?

My major concern now is that these tests are enforcing an unstable contract, and that we should in fact be enforcing a contract on the output of the loader, given a dump.

shevek-google · 2022-06-22T21:40:45Z

dumper-integration-tests/redshift-tests/build.gradle

    implementation 'commons-io:commons-io:2.11.0'
    implementation 'org.apache.commons:commons-collections4:4.4'
-    implementation 'com.google.guava:guava:31.0.1-jre'
    implementation 'com.opencsv:opencsv:5.6'


The contract we aim to enforce is that the loader ends up with a particular set of semantic information, not that a test which selects from the same table obtains the same rows. I think this is the key issue about this entire pull-request chain, and I think this thread is not resolved.

shevek-google · 2022-06-22T21:45:23Z

dumper-integration-tests/redshift-tests/build.gradle

-    implementation 'com.google.protobuf:protobuf-gradle-plugin:0.8.18'
-    implementation 'com.google.protobuf:protobuf-java:3.20.1'
+    annotationProcessor 'com.google.auto.value:auto-value:1.9'
+    compileOnly 'com.google.auto.value:auto-value-annotations:1.9'


(2) and (3) are review comments elsewhere; our convention is to use commons-csv and junit.
(4) should be a review comment; we would like this code to follow the conventions of the repository.
(5) Gradle would normally represent an integration test as a TestSet, and this practice is sufficiently standard that it is now part of Gradle core.

Kokoro build script invokes Dumper and Dumper extracts metadata to a local folder on a build worker, the path to a folder will be stored in EXPORT_PATH env var. After that, integration tests will be run and pointed to EXPORT_PATH to read CSV files from there. The logic of integration tests is to make an SQL request to DB, extract the data and save them to memory, then read CSV files, deserialize it, and then compare with the extraction from DB that the both data sets are field-to-field equal.

dmorokov requested review from paolomorandini, bdw-g and haselbach May 26, 2022 01:29

paolomorandini reviewed May 26, 2022

View reviewed changes

dmorokov force-pushed the dev-cvv-columns-test branch 2 times, most recently from e49fd30 to a7987ee Compare May 26, 2022 17:28

dmorokov requested a review from paolomorandini May 26, 2022 17:34

dmorokov force-pushed the dev-cvv-columns-test branch 4 times, most recently from 8d6aeaa to 38729b7 Compare May 28, 2022 00:23

shevek-google requested changes May 30, 2022

View reviewed changes

dmorokov force-pushed the dev-cvv-columns-test branch from 38729b7 to 6be870a Compare May 31, 2022 15:13

dmorokov requested a review from shevek-google May 31, 2022 15:15

dmorokov force-pushed the dev-cvv-columns-test branch 2 times, most recently from 775e185 to 3889805 Compare June 2, 2022 02:52

shevek-google requested changes Jun 2, 2022

View reviewed changes

dmorokov force-pushed the dev-cvv-columns-test branch 4 times, most recently from 5661f9c to d5e6e7e Compare June 6, 2022 20:26

dmorokov requested a review from shevek-google June 6, 2022 20:29

dmorokov force-pushed the dev-cvv-columns-test branch from d5e6e7e to 32e7f6e Compare June 14, 2022 00:23

dmorokov commented Jun 14, 2022

View reviewed changes

...er-integration-tests/redshift-tests/src/test/java/com/google/integration/SvvColumnsTest.java Outdated Show resolved Hide resolved

paolomorandini force-pushed the dev branch from d9b385a to a99dd62 Compare June 15, 2022 15:50

dmorokov requested review from cyulysses-corp and mjquinn-google as code owners June 15, 2022 15:50

dmorokov force-pushed the dev-cvv-columns-test branch 2 times, most recently from 8738593 to 4e7e799 Compare June 16, 2022 18:44

paolomorandini reviewed Jun 17, 2022

View reviewed changes

dmorokov force-pushed the dev-cvv-columns-test branch from 4e7e799 to 7b499ed Compare June 19, 2022 17:19

dmorokov requested a review from paolomorandini June 19, 2022 17:20

paolomorandini reviewed Jun 20, 2022

View reviewed changes

.../redshift-tests/src/test/java/com/google/edwmigration/dumper/integration/SvvColumnsTest.java Show resolved Hide resolved

dmorokov force-pushed the dev-cvv-columns-test branch 3 times, most recently from 42a0b6e to 663d9db Compare June 22, 2022 01:48

shevek-google requested changes Jun 22, 2022

View reviewed changes

dmorokov force-pushed the dev-cvv-columns-test branch from 663d9db to 03e12df Compare June 28, 2022 23:22

CVV_COLUMNS table integration test #16

Are you sure you want to change the base?

CVV_COLUMNS table integration test #16

Uh oh!

Conversation

dmorokov commented May 26, 2022

Uh oh!

paolomorandini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmorokov Jun 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shevek-google left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmorokov Jun 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dmorokov commented May 31, 2022

Uh oh!

shevek-google left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dmorokov Jun 6, 2022 •

edited

Loading

dmorokov Jun 6, 2022 •

edited

Loading

paolomorandini Jun 20, 2022 •

edited

Loading

dmorokov Jun 20, 2022 •

edited

Loading

paolomorandini Jun 21, 2022 •

edited

Loading

dmorokov Jun 21, 2022 •

edited

Loading

dmorokov Jun 22, 2022 •

edited

Loading