Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CVV_COLUMNS table integration test #16

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from
Open

Conversation

dmorokov
Copy link

  1. Added JdbcUtils and methods
  2. Added svv_columns.sql for retrieving table's data
  3. Added SvvColumnsTest test
  4. Added SvvColumnsRow POJO object
  5. Added CsvUtil and methods
  6. Added Gradle reporting plugin

Copy link
Collaborator

@paolomorandini paolomorandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I added a few comments and asked a few questions.

implementation 'commons-io:commons-io:2.11.0'
implementation 'org.apache.commons:commons-collections4:4.4'
implementation 'com.google.guava:guava:31.0.1-jre'
implementation 'com.opencsv:opencsv:5.6'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumper uses commons-csv, using the same might rule-out some issues due to library incompatibility on special/corner cases?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite the opposite, if lib A has any bug or unexpected behavior, than using in tests lib B allows us to catch that behavior, doesn't it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will tell where A and B differ in behavior, and both libraries might be just using different - although valid - conventions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in 1-to-1 chat. No alterations required here. Consider to leave it as is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please share the answer here for other reviewers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure,

Paolo Morandini, Fri 11:22 AM
I'm going to suggest to stick with opencsv and reconsider only if we get some test failures / problems due to incompatibility with commons-csv

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption is that this would favor coherence with, and re-use of, the already existing test suite for the extraction tool.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's valid for a server to return a different column set at a different time, for example, a different server release.

This PR binds the columns into a hard-written POJO, so when that (valid) act happens, this test case will fail. So the test case is enforcing something which is not within the contract of the dumper. That concerns me a little.

If the use of OpenCSV is to do with the beans binding and POJOs, then we may need to raise this discussion by one level, to work out what contract we are actually enforcing. I would have thought that the contract is that:

  1. If we create a dump, and
  2. Run the loader over it, then
  3. We obtain a particular set of semantic information.
    I would avoid enforcing the presence of columns which may or may not be present, and which we may or may not care about.

Copy link
Author

@dmorokov dmorokov Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that a test should do exactly what you are saying - enforce the contract.
When a test is 'flexible' and 'adjusting on the fly' it's very probable to miss a corner case.

Obviously, there are a bunch of traits of a test, but I think, that a good test should be simple, specific and understandable. So, here we are talking about a test being specific with its data, structure, etc.

I read it as no changes are required here.
Please, confirm or resolve the conversation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contract we aim to enforce is that the loader ends up with a particular set of semantic information, not that a test which selects from the same table obtains the same rows. I think this is the key issue about this entire pull-request chain, and I think this thread is not resolved.

@dmorokov dmorokov force-pushed the dev-cvv-columns-test branch 2 times, most recently from e49fd30 to a7987ee Compare May 26, 2022 17:28
@dmorokov dmorokov requested a review from paolomorandini May 26, 2022 17:34
@dmorokov dmorokov force-pushed the dev-cvv-columns-test branch 4 times, most recently from 8d6aeaa to 38729b7 Compare May 28, 2022 00:23
Copy link
Collaborator

@shevek-google shevek-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update the description to explain the purpose of this code?

implementation 'commons-io:commons-io:2.11.0'
implementation 'org.apache.commons:commons-collections4:4.4'
implementation 'com.google.guava:guava:31.0.1-jre'
implementation 'com.opencsv:opencsv:5.6'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please share the answer here for other reviewers?

* @param dbMultiset List of extracted from DB items
* @param csvMultiset List of uploaded from Avro items
*/
public static void assertMultisetsEqual(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use Maps.difference() on a Map<T, Integer>?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we can, but what's the reason behind it?
We used that approach for testing Teradata dwh-assessment-extraction-tool and faced zero issues with it even on large amounts of data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every line of code we have under maintenance costs us dollars and hours. Using pre-existing, tested code saves us money both now and in the long run. The objective of programming is to write code. The objective of good engineering is to achieve results, preferably while not writing code. So the less code we have, the better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want something like
[Immutable]Map<K, Integer> m = Maps.toMap(multiset, multiset::getCount)
And then make assertions on Maps.difference(m0, m1).

The code here, in addition to having maintenance cost, is very expensive in terms of remove() calls, and the error messages it finally produces are not as clear as those which can easily be generated from a MapDifference object.

Copy link
Author

@dmorokov dmorokov Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conclusion
I've tried this approach with Maps.difference() and it doesn't give me a desirable result.
Because it's keeping custom objects as Keys and instance counters as Values, this algorithm misses pretty obvious test cases, thus the test runs and indicates false-positive result.

I have lists A and B, I removed A[0], A[1], and B[last] and the algorithm with maps said that there is the diff in only 1 element, when in fact it's there 3 mutual exclusive elements.

The original code isn't a piece of art but it's fine tuned and it does its job. Given that we are not going to do any sort of performance tests using these test case scenarios I don't see any potential risk for this piece of code even though it's not perfect.

Point
I think, that 1 custom method, effectively, 8 lines of code, 4 of them are pretty much logging and reporting to CLI, would not require too much attention and costs wouldn't be so dramatically high.

@dmorokov dmorokov force-pushed the dev-cvv-columns-test branch from 38729b7 to 6be870a Compare May 31, 2022 15:13
@dmorokov dmorokov requested a review from shevek-google May 31, 2022 15:15
@dmorokov
Copy link
Author

Can you please update the description to explain the purpose of this code?

@shevek-google I don't understand what I should update in the commit message. I think that all the main points of this PR are already there. Please, let me know if any specific point should be included.

@dmorokov dmorokov force-pushed the dev-cvv-columns-test branch 2 times, most recently from 775e185 to 3889805 Compare June 2, 2022 02:52
Copy link
Collaborator

@shevek-google shevek-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this is definitely moving in the right direction, I feel like we're into second-layer comments now, which is great, thank you.

implementation 'commons-io:commons-io:2.11.0'
implementation 'org.apache.commons:commons-collections4:4.4'
implementation 'com.google.guava:guava:31.0.1-jre'
implementation 'com.opencsv:opencsv:5.6'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's valid for a server to return a different column set at a different time, for example, a different server release.

This PR binds the columns into a hard-written POJO, so when that (valid) act happens, this test case will fail. So the test case is enforcing something which is not within the contract of the dumper. That concerns me a little.

If the use of OpenCSV is to do with the beans binding and POJOs, then we may need to raise this discussion by one level, to work out what contract we are actually enforcing. I would have thought that the contract is that:

  1. If we create a dump, and
  2. Run the loader over it, then
  3. We obtain a particular set of semantic information.
    I would avoid enforcing the presence of columns which may or may not be present, and which we may or may not care about.

implementation 'com.google.protobuf:protobuf-gradle-plugin:0.8.18'
implementation 'com.google.protobuf:protobuf-java:3.20.1'
annotationProcessor 'com.google.auto.value:auto-value:1.9'
compileOnly 'com.google.auto.value:auto-value-annotations:1.9'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have typically set dependency versions globally using a Gradle build convention. Can we please follow that convention for these submodule(s)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of right now, this test automation framework isn't included as a submodule into the main project.
There are several reasons for that:

  1. We don't want to overlap in dependencies or cross-dependencies
  2. We use different libs (OpenCSV and Common-CSV)
  3. We may have different core level test frameworks - TestNG and Junit
  4. We may have different reporting and logging implementations
  5. Integration tests, in accordance with industry best practices, are considered to be a separate project and it doesn't necessarily have to be integrated with the code of the product. If that were Component or Unit tests, then yes, it's a valid scenario for submoduling and bundling.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(2) and (3) are review comments elsewhere; our convention is to use commons-csv and junit.
(4) should be a review comment; we would like this code to follow the conventions of the repository.
(5) Gradle would normally represent an integration test as a TestSet, and this practice is sufficiently standard that it is now part of Gradle core.

* @param dbMultiset List of extracted from DB items
* @param csvMultiset List of uploaded from Avro items
*/
public static void assertMultisetsEqual(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want something like
[Immutable]Map<K, Integer> m = Maps.toMap(multiset, multiset::getCount)
And then make assertions on Maps.difference(m0, m1).

The code here, in addition to having maintenance cost, is very expensive in terms of remove() calls, and the error messages it finally produces are not as clear as those which can easily be generated from a MapDifference object.

@dmorokov dmorokov force-pushed the dev-cvv-columns-test branch 4 times, most recently from 5661f9c to d5e6e7e Compare June 6, 2022 20:26
@dmorokov dmorokov requested a review from shevek-google June 6, 2022 20:29
@dmorokov dmorokov force-pushed the dev-cvv-columns-test branch from d5e6e7e to 32e7f6e Compare June 14, 2022 00:23
@dmorokov dmorokov force-pushed the dev-cvv-columns-test branch from 4e7e799 to 7b499ed Compare June 19, 2022 17:19
@dmorokov dmorokov requested a review from paolomorandini June 19, 2022 17:20
dumper/lib-ext-hive-metastore/build.gradle Outdated Show resolved Hide resolved

assertThat(dbColumnHeaders.size()).isEqualTo(csvColumnHeaders.size());
assertThat(dbColumnHeaders).containsExactlyElementsIn(csvColumnHeaders);
assertDbCsvDataEqual(dbMultiset, csvMultiset);
Copy link
Collaborator

@paolomorandini paolomorandini Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if this is equal to

    Truth.assertThat(Multisets.difference(dbMultiset, csvMultiset))
            .containsExactlyElementsIn(Multisets.difference(csvMultiset, dbMultiset));

Copy link
Author

@dmorokov dmorokov Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's make an experiment.
There are 4 possible cases for this assert:

  1. A and B are equal
  2. A has plus or minus some entries
  3. B has plus or minus some entries
  4. A and B have mutually exclusive entries

Now, with this output can you clearly understand what happened in the test?

value of: multiset.onlyElement()
  expected: {numeric_precision_radix=, is_nullable=NO, column_default=, numeric_scale=, table_catalog=dev, table_name=pg_tablespace, collation_catalog=, domain_name=, character_set_name=, interval_type=, datetime_precision=, collation_schema=, column_name=spcname, character_maximum_length=, collation_name=case_sensitive, character_set_catalog=, interval_precision=, character_set_schema=, ordinal_position=1, table_schema=pg_catalog, data_type=name, numeric_precision=, remarks=}
  but was : {numeric_precision_radix=, datetime_precision=, is_nullable=YES, column_default=, collation_schema=, column_name=as_locator, character_maximum_length=, collation_name=, numeric_scale=, character_set_catalog=, table_catalog=dev, table_name=routines, interval_precision=, character_set_schema=, collation_catalog=, ordinal_position=54, domain_name=character_data, character_set_name=, table_schema=information_schema, data_type=character varying, numeric_precision=, interval_type=, remarks=}

and can you with this

DB view and CSV file have mutually exclusive row(s)
DB view has 1 different row(s): 
{numeric_precision_radix=, datetime_precision=, is_nullable=YES, column_default=, collation_schema=, column_name=as_locator, character_maximum_length=, collation_name=, numeric_scale=, character_set_catalog=, table_catalog=dev, table_name=routines, interval_precision=, character_set_schema=, collation_catalog=, ordinal_position=54, domain_name=character_data, character_set_name=, table_schema=information_schema, data_type=character varying, numeric_precision=, interval_type=, remarks=}
CSV file has 1 different row(s): 
{numeric_precision_radix=, is_nullable=NO, column_default=, numeric_scale=, table_catalog=dev, table_name=pg_tablespace, collation_catalog=, domain_name=, character_set_name=, interval_type=, datetime_precision=, collation_schema=, column_name=spcname, character_maximum_length=, collation_name=case_sensitive, character_set_catalog=, interval_precision=, character_set_schema=, ordinal_position=1, table_schema=pg_catalog, data_type=name, numeric_precision=, remarks=}

Conclusion
Does the usage of this built-in comparison method provide any benefit for tests?
I don't think so.
We save 20 lines of code, sure, but we lose a ton of output verbosity and visibility.
I completely cannot understand why we are fighting so hard for the 20-line code fragment and willing to sacrifice output readability over the number of code lines.

So, yes, the method does work.
However, it's a push back from my side on that. As a QA engineer, dealing with unreadable test output is worse than having a 20-line code block.

Copy link
Collaborator

@paolomorandini paolomorandini Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following considerations apply:

  • widely used libraries are less likely to have bugs
  • same applies for performance reasons
  • easier for new contributors to understand what's going on (under the assumption that they know the libraries in use)
  • the output format expected: {} \n but was: {} is very common in test failure reports (some tools might even look for this pattern)

If we want to keep the custom failure reporting, we could at least use Multisets.difference instead of doing it manually by iterate-and-remove.

Copy link
Author

@dmorokov dmorokov Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • libs are more likely to have bugs compare to Java native methods
  • I tried both methods, this custom and provided by you above - no performance difference on 20k entries
  • agree, should be easier. The problem in this case is - easier doesn't mean better. As Shevek said, we develop code that does what we need, it solves some computations or problems, here is the example of solving a problem with understand-ability of long log outputs.
  • again, I agree that it's very common, but we are talking about different code properties. Common vs Convenient-and-Readable. A test must do what its supposed to do - verify and provide a report that's easy to read.

@paolomorandini Still, from the first log output above, can you recognize and distinguish - does either DB or CSV have more or less entries? or both?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So Truth.assertThat(dbMultiset).containsExactlyElementsIn(csvMultiset) has been excluded because of too much output being produced (it would've been possible to solve this with an override if Multiset.toString wasn't final).

Should assertDbCsvDataEqual be refactored to use Multisets.difference(a, b) instead of manually removing elements with b.forEach(a::remove)?

Copy link
Author

@dmorokov dmorokov Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the reduction itself can be refactored.
Although, keep in mind that Multisets.difference() is marked as a Beta method.

@Beta
  public static <E extends @Nullable Object> Multiset<E> difference() {

which stands for:

Signifies that a public API (public class, method or field) is subject to incompatible changes, or even removal, in a future release. An API bearing this annotation is exempt from any compatibility guarantees made by its containing library. Note that the presence of this annotation implies nothing about the quality or performance of the API in question, only the fact that it is not "API-frozen."
It is generally safe for applications to depend on beta APIs, at the cost of some extra work during upgrades. However it is generally inadvisable for libraries (which get included on users' CLASSPATHs, outside the library developers' control) to do so.

@paolomorandini , I can't understand why we implement the beta functionality of the library which is a subject to at least for some alterations. It doesn't look like a step that adds stability to our framework. Quite opposite, it may require some maintenance.
However, I see, that there is a strong opinion on using libraries, so, I'll implement corresponding changes but I've already highlighted the possible risks as well.

Done.

@dmorokov dmorokov force-pushed the dev-cvv-columns-test branch 3 times, most recently from 42a0b6e to 663d9db Compare June 22, 2022 01:48
Copy link
Collaborator

@shevek-google shevek-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please include some instructions on how to run these tests and what contracts they aim to enforce?

My major concern now is that these tests are enforcing an unstable contract, and that we should in fact be enforcing a contract on the output of the loader, given a dump.

implementation 'commons-io:commons-io:2.11.0'
implementation 'org.apache.commons:commons-collections4:4.4'
implementation 'com.google.guava:guava:31.0.1-jre'
implementation 'com.opencsv:opencsv:5.6'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contract we aim to enforce is that the loader ends up with a particular set of semantic information, not that a test which selects from the same table obtains the same rows. I think this is the key issue about this entire pull-request chain, and I think this thread is not resolved.

implementation 'com.google.protobuf:protobuf-gradle-plugin:0.8.18'
implementation 'com.google.protobuf:protobuf-java:3.20.1'
annotationProcessor 'com.google.auto.value:auto-value:1.9'
compileOnly 'com.google.auto.value:auto-value-annotations:1.9'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(2) and (3) are review comments elsewhere; our convention is to use commons-csv and junit.
(4) should be a review comment; we would like this code to follow the conventions of the repository.
(5) Gradle would normally represent an integration test as a TestSet, and this practice is sufficiently standard that it is now part of Gradle core.

Kokoro build script invokes Dumper and Dumper extracts metadata to a local folder on a build worker,
the path to a folder will be stored in EXPORT_PATH env var.
After that, integration tests will be run and pointed to EXPORT_PATH to read CSV files from there.
The logic of integration tests is to make an SQL request to DB, extract the data and save them to memory,
then read CSV files, deserialize it, and then compare with the extraction from DB that the both data sets are
field-to-field equal.
@dmorokov dmorokov force-pushed the dev-cvv-columns-test branch from 663d9db to 03e12df Compare June 28, 2022 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants