Update_batch additional tests #2437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

grusev wants to merge 16 commits into master from update_batch_tests

Collaborator

grusev commented Jul 2, 2025 •

edited

Loading

Reference Issues/PRs

What does this implement or fix?

Any other comments?

Included are 3 integration tests covering update_batch with all supported types of dataframes and different type of updates over symbol. Covers both static and dynamic schema with checks types promotion. Updates on static schema are also tested against symbols spreading accross several segments.

All tests also cover real storages.

Several utilities have also been added to help integration scenarios testing with update and update_batch in future.

A small set of ASV tests were also developed to keep track of time and mem

Checklist

Checklist for code changes...

Have you updated the relevant docstrings, documentation and copyright notice?
Is this contribution tested against all ArcticDB's features?
Do all exceptions introduced raise appropriate error messages?
Are API changes highlighted in the PR description?
Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?


          initial version

eae8e86

grusev added the patch label

grusev requested a review from IvoDD

July 2, 2025 08:44

grusev and others added 5 commits

July 2, 2025 11:44


          Merge branch 'master' into update_batch_tests

ff8affb


          added asv tests on LMDB

f90ee83


          Merge branch 'master' into update_batch_tests

6c6288e


          fixes

8f7c1f3


          benchmarks.json update

00df2c4

grusev marked this pull request as ready for review

July 4, 2025 08:27

grusev requested review from alexowens90 and poodlewars as code owners

July 4, 2025 08:27

Georgi Rusev and others added 3 commits

July 4, 2025 16:22


          3.8 rare problem log

0fb9358


          comment added

1a84b13


          Merge branch 'master' into update_batch_tests

0550eeb

vasil-pashov requested changes

View reviewed changes

python/arcticdb/util/test.py

Comment on lines +424 to +427

+                  if "uint" not in str(dtype):
+                      platform_int_info = np.iinfo("int_")
+                  else:
+                      platform_int_info = iinfo

Collaborator

vasil-pashov Jul 11, 2025

I don't why this is needed. Why is iinfo = np.iinfo(dtype) not enough?

Collaborator Author

grusev Jul 14, 2025

that is a common function, which was not working on uints. Thus in order to make it work for uints the else was added, the other part is preserved so it behaves as it used to and no unexpected problems arise. Thus I cannot comments exactly why np.iinfo("int_") was chosen.

Collaborator

vasil-pashov Jul 15, 2025

This is working in both numpy 1 and numpy 2

>>> import numpy as np
>>> np.iinfo("uint32")
iinfo(min=0, max=4294967295, dtype=uint32)
>>> np.iinfo("uint16")
iinfo(min=0, max=65535, dtype=uint16)

Collaborator

vasil-pashov Jul 15, 2025

Also avoid string checks in dtypes. Prefer the built in functions of either numpy or pandas to check the properties of the dtype.

python/tests/integration/arcticdb/test_update.py

+                      for i in range(cols):
+                              dtype = upgradable_dtypes[i % len(upgradable_dtypes)]
+                              desired_type = self._resolve(dtype)
+                              if 'int' in str(desired_type):

Collaborator

vasil-pashov Jul 11, 2025

numpy has issubdtype method. Use that instead of comparing strings,

Collaborator Author

grusev Jul 15, 2025

not sure what is the benefit of doing that. This works for int and uint, if I do the change there will be at least one elif more

python/tests/integration/arcticdb/test_update.py Outdated

Comment on lines 204 to 207

+                  read_data = dict()
+                  for result in read_results:
+                      read_data[result.symbol] = result
+                  return read_data

Collaborator

vasil-pashov Jul 11, 2025

dict comprehension will make this in one line

python/tests/integration/arcticdb/test_update.py Outdated

		return update_df


		def read_batch(lib: Library, symbol_names: List[str]) -> Dict[str, VersionedItem]:

Collaborator

vasil-pashov Jul 11, 2025

Type hint is wrong. The result of batch read is either VersionedItem or DataError. I'd also change the name so that it's not exactly as in the library API, I believe it's easier to read this way.

Collaborator Author

grusev Jul 14, 2025 •

edited

Loading

now def read_batch_as_dict(lib: Library, symbol_names: List[str]) -> Dict[str, Union[VersionedItem, DataError]]:

python/tests/integration/arcticdb/test_update.py

+                      self._type_conversion_dict: Dict[type, type] = self.no_upgrade_types()
+                      self.seed = seed
+                  def define_upgrade_types(self, mappings: Dict[type, type]):

Collaborator

vasil-pashov Jul 11, 2025

I think that either both define_upgrade_types and no_upgrade_types should return or none of them should return.

I haven't seen type as a type hint. What does it do?

python/tests/integration/arcticdb/test_update.py Outdated

Comment on lines 41 to 42

		BEFORE = 0
		RIGHT_BEFORE = 1

Collaborator

vasil-pashov Jul 11, 2025

I think before and right before need some comment or different name. It looks to me that both of those will generate data that does not overlap with the dataframe on disk.

I don't see any options that have some data outside of the on disk and then overlaps with either the beginning or the end e.g.
on: disk [5, 10]
update: [3, 7]
update: [7, 13]
I also don't see anything that's larger than the dataframe on disk
update: [3, 13]

Is this intentional?

Collaborator Author

grusev Jul 14, 2025

Great catch! You are correct, those 3 indeed are missing.

class UpdatePositionType(Enum):
'''Enum specifying relative position of an update start or end time relative to
original dataframe start and end time'''
BEFORE = 0 # End of the update is much before start of original DF
RIGHT_BEFORE = 1 # End of the update is exactly before start of original DF
BEFORE_OVERLAP_START = 2 # End of the update overlaps with the start of the original DF
INSIDE_OVERLAP_BEGINNING = 3 # Start of update overlaps with the start of the original DF, no restrictions on end
TOTAL_OVERLAP = 4 # Exact same start and end and number of rows as the original DF
OVERSHADOW_ORIGINAL = 5 # Starts before start of original and ends after the original DF
INSIDE = 6 # literally inside the original DF, update does not overlap neither start nor end
INSIDE_OVERLAP_END = 7 # end of both update and original overlap, no restrictions for start
AFTER_OVERLAP_END = 8 # end of original overlaps with start of of update
RIGHT_AFTER = 9 # start of update is exactly next to original
AFTER = 104 # start of the update is after at least 1 duration of end of original

Added:
- BEFORE_OVERLAP_START
- OVERSHADOW_ORIGINAL
- AFTER_OVERLAP_END

The most critical point of overlap is always overlap with one row at beging and at end and all that is covered now.

Collaborator

vasil-pashov Jul 15, 2025

I think some examples will be useful, IMO RIGHT_BEFORE and BEFORE_OVERLAP_START are a bit ambiguous. Is this correct assuming df with index [5, 10]

RIGHT_BEFORE -> [2, 4]
BEFORE_OVERLAP_START -> [2, 5]

python/tests/integration/arcticdb/test_update.py

+                      elif position_type == UpdatePositionType.INSIDE_OVERLAP_BEGINNING:
+                          update_df = self.generator.get_dataframe(number_cols, number_rows, start)
+                      elif position_type == UpdatePositionType.INSIDE:
+                          to_update = number_rows if (number_rows + 2) <= rows else rows - 2

Collaborator

vasil-pashov Jul 11, 2025

Why do we check if number_rows + 2 <= rows? I think this needs a comment

Collaborator Author

grusev Jul 14, 2025

I think I documented that now - the idea is that the generated DF must be inside physically in the original one, thus the returned generated row may have different than requested from used number of rows

python/tests/integration/arcticdb/test_update.py

+                  for index, result in enumerate(update_result):
+                      symbol = symbol_names[index]
+                      if not hasattr(result, "version"):
+                          # To catch rare problem on 3.8

Collaborator

vasil-pashov Jul 11, 2025

Elaborate on this. Does this happen only on python 3.8? Is it reproducible or flaky? Is it a known issue. Have we done any investigation on that. It seems pretty odd to have an attribute randomly missing from an object. Is this related to batch_update or it can happen on any call?

Collaborator Author

grusev Jul 14, 2025

this is a bug escape I currently investigate - will log issue upon it once all details are relevealed but looks like we do not handle in 3.8 one case

ERROR:tests.integration.arcticdb.test_update:Expected Version UNKNOWN of symbol 'some heck of a symbol!.!0' unable to be retrieved: Error Category: 'UNKNOWN' Exception String: 'arcticdb::ArcticCategorizedException<(arcticdb::ErrorCategory)4>: E_DESCRIPTOR_MISMATCH No valid common type between TD<type=FLOAT64, dim=0> and TD<type=UTF_DYNAMIC64, dim=0> for column col_2' with version attribute <arcticdb_ext.version_store.DataError object at 0x7f8594b7cef0>

Collaborator Author

grusev Jul 15, 2025

there is an issue logged now and also the part that was failing is now xfailed - Linux 3.8 specific but underlines some logic weakness

python/tests/integration/arcticdb/test_update.py Outdated

+                  """ Simulates partially dynamic update, when update df has new columns.
+                  The scenario where there are type promotions of existing columns is not covered yet
+                  """
+                  def add_missing_ending_columns(to_df: pd.DataFrame, from_df: pd.DataFrame) -> pd.DataFrame:

Collaborator

vasil-pashov Jul 11, 2025

Why is this a nested function. Also why is the return value ignored?

Collaborator Author

grusev Jul 14, 2025

it is private to dataframe_simulate_arcticdb_update_dynamic function, not intended to be reused.
The plan is at certain point dataframe_simulate_arcticdb_update_dynamic to be full simulator of Arctic logic in updates
removed returned value and added doc. Basically the to_df is modified so no reason of returning value

python/tests/integration/arcticdb/test_update.py

Comment on lines +50 to +55

+              class DataFrameGenerator(ABC):
+                  @abstractmethod
+                  def get_dataframe(self, number_columns:int, number_rows: int,
+                                    start_time: Union[Timestamp, TimestampNumber], **kwargs) -> pd.DataFrame:
+                      pass

Collaborator

vasil-pashov Jul 11, 2025

I don't see how we benefit from such abstraction. Python lists are heterogeneous, this doesn't add any function that can be reused by the derived classes, the check if get_dataframe exists is at runtime. It looks like Java translated to Python.

Collaborator Author

grusev Jul 14, 2025

This is a standard OOP pattern for interface, as python supports OOP then it should be ok also. Moreover this exact approach was is used already in ASV real tests. So I used it here also, What would be be then more pythonic way to express that certain class supports this specific method which is basis of update generation?

Collaborator

vasil-pashov Jul 15, 2025

The pythonic way not to have abstract class at all and use duck typing. Python is multi-paradigm language but just because it supports multiple paradigms doesn't mean we have to use all of them.

grusev and others added 7 commits

July 14, 2025 11:45


          Merge branch 'master' into update_batch_tests

56b0b57


          updates vrom comments

df2c4e2


          Merge branch 'master' into update_batch_tests


          xfail a test

fae55af


          fix xfail

d4151b9


          one liner

a2f73ef


          Merge branch 'master' into update_batch_tests

01023ae

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

vasil-pashov vasil-pashov requested changes

IvoDD Awaiting requested review from IvoDD

alexowens90 Awaiting requested review from alexowens90 alexowens90 is a code owner

poodlewars Awaiting requested review from poodlewars poodlewars is a code owner

Requested changes must be addressed to merge this pull request.

Labels

patch