Skip to content

Update_batch additional tests #2437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open

Update_batch additional tests #2437

wants to merge 16 commits into from

Conversation

grusev
Copy link
Collaborator

@grusev grusev commented Jul 2, 2025

Reference Issues/PRs

What does this implement or fix?

Any other comments?

Included are 3 integration tests covering update_batch with all supported types of dataframes and different type of updates over symbol. Covers both static and dynamic schema with checks types promotion. Updates on static schema are also tested against symbols spreading accross several segments.

All tests also cover real storages.

Several utilities have also been added to help integration scenarios testing with update and update_batch in future.

A small set of ASV tests were also developed to keep track of time and mem

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

@grusev grusev added the patch Small change, should increase patch version label Jul 2, 2025
@grusev grusev requested a review from IvoDD July 2, 2025 08:44
@grusev grusev marked this pull request as ready for review July 4, 2025 08:27
Comment on lines +424 to +427
if "uint" not in str(dtype):
platform_int_info = np.iinfo("int_")
else:
platform_int_info = iinfo
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't why this is needed. Why is iinfo = np.iinfo(dtype) not enough?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is a common function, which was not working on uints. Thus in order to make it work for uints the else was added, the other part is preserved so it behaves as it used to and no unexpected problems arise. Thus I cannot comments exactly why np.iinfo("int_") was chosen.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is working in both numpy 1 and numpy 2

>>> import numpy as np
>>> np.iinfo("uint32")
iinfo(min=0, max=4294967295, dtype=uint32)
>>> np.iinfo("uint16")
iinfo(min=0, max=65535, dtype=uint16)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also avoid string checks in dtypes. Prefer the built in functions of either numpy or pandas to check the properties of the dtype.

for i in range(cols):
dtype = upgradable_dtypes[i % len(upgradable_dtypes)]
desired_type = self._resolve(dtype)
if 'int' in str(desired_type):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy has issubdtype method. Use that instead of comparing strings,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what is the benefit of doing that. This works for int and uint, if I do the change there will be at least one elif more

Comment on lines 204 to 207
read_data = dict()
for result in read_results:
read_data[result.symbol] = result
return read_data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict comprehension will make this in one line

return update_df


def read_batch(lib: Library, symbol_names: List[str]) -> Dict[str, VersionedItem]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hint is wrong. The result of batch read is either VersionedItem or DataError. I'd also change the name so that it's not exactly as in the library API, I believe it's easier to read this way.

Copy link
Collaborator Author

@grusev grusev Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now def read_batch_as_dict(lib: Library, symbol_names: List[str]) -> Dict[str, Union[VersionedItem, DataError]]:

self._type_conversion_dict: Dict[type, type] = self.no_upgrade_types()
self.seed = seed

def define_upgrade_types(self, mappings: Dict[type, type]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that either both define_upgrade_types and no_upgrade_types should return or none of them should return.

I haven't seen type as a type hint. What does it do?

Comment on lines 41 to 42
BEFORE = 0
RIGHT_BEFORE = 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think before and right before need some comment or different name. It looks to me that both of those will generate data that does not overlap with the dataframe on disk.

I don't see any options that have some data outside of the on disk and then overlaps with either the beginning or the end e.g.
on: disk [5, 10]
update: [3, 7]
update: [7, 13]
I also don't see anything that's larger than the dataframe on disk
update: [3, 13]

Is this intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! You are correct, those 3 indeed are missing.

class UpdatePositionType(Enum):
'''Enum specifying relative position of an update start or end time relative to
original dataframe start and end time'''
BEFORE = 0 # End of the update is much before start of original DF
RIGHT_BEFORE = 1 # End of the update is exactly before start of original DF
BEFORE_OVERLAP_START = 2 # End of the update overlaps with the start of the original DF
INSIDE_OVERLAP_BEGINNING = 3 # Start of update overlaps with the start of the original DF, no restrictions on end
TOTAL_OVERLAP = 4 # Exact same start and end and number of rows as the original DF
OVERSHADOW_ORIGINAL = 5 # Starts before start of original and ends after the original DF
INSIDE = 6 # literally inside the original DF, update does not overlap neither start nor end
INSIDE_OVERLAP_END = 7 # end of both update and original overlap, no restrictions for start
AFTER_OVERLAP_END = 8 # end of original overlaps with start of of update
RIGHT_AFTER = 9 # start of update is exactly next to original
AFTER = 104 # start of the update is after at least 1 duration of end of original

Added:
- BEFORE_OVERLAP_START
- OVERSHADOW_ORIGINAL
- AFTER_OVERLAP_END

The most critical point of overlap is always overlap with one row at beging and at end and all that is covered now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some examples will be useful, IMO RIGHT_BEFORE and BEFORE_OVERLAP_START are a bit ambiguous. Is this correct assuming df with index [5, 10]

  • RIGHT_BEFORE -> [2, 4]
  • BEFORE_OVERLAP_START -> [2, 5]

elif position_type == UpdatePositionType.INSIDE_OVERLAP_BEGINNING:
update_df = self.generator.get_dataframe(number_cols, number_rows, start)
elif position_type == UpdatePositionType.INSIDE:
to_update = number_rows if (number_rows + 2) <= rows else rows - 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we check if number_rows + 2 <= rows? I think this needs a comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I documented that now - the idea is that the generated DF must be inside physically in the original one, thus the returned generated row may have different than requested from used number of rows

for index, result in enumerate(update_result):
symbol = symbol_names[index]
if not hasattr(result, "version"):
# To catch rare problem on 3.8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elaborate on this. Does this happen only on python 3.8? Is it reproducible or flaky? Is it a known issue. Have we done any investigation on that. It seems pretty odd to have an attribute randomly missing from an object. Is this related to batch_update or it can happen on any call?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bug escape I currently investigate - will log issue upon it once all details are relevealed but looks like we do not handle in 3.8 one case

ERROR:tests.integration.arcticdb.test_update:Expected Version UNKNOWN of symbol 'some heck of a symbol!.!0' unable to be retrieved: Error Category: 'UNKNOWN' Exception String: 'arcticdb::ArcticCategorizedException<(arcticdb::ErrorCategory)4>: E_DESCRIPTOR_MISMATCH No valid common type between TD<type=FLOAT64, dim=0> and TD<type=UTF_DYNAMIC64, dim=0> for column col_2' with version attribute <arcticdb_ext.version_store.DataError object at 0x7f8594b7cef0>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an issue logged now and also the part that was failing is now xfailed - Linux 3.8 specific but underlines some logic weakness

""" Simulates partially dynamic update, when update df has new columns.
The scenario where there are type promotions of existing columns is not covered yet
"""
def add_missing_ending_columns(to_df: pd.DataFrame, from_df: pd.DataFrame) -> pd.DataFrame:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a nested function. Also why is the return value ignored?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is private to dataframe_simulate_arcticdb_update_dynamic function, not intended to be reused.
The plan is at certain point dataframe_simulate_arcticdb_update_dynamic to be full simulator of Arctic logic in updates
removed returned value and added doc. Basically the to_df is modified so no reason of returning value

Comment on lines +50 to +55
class DataFrameGenerator(ABC):

@abstractmethod
def get_dataframe(self, number_columns:int, number_rows: int,
start_time: Union[Timestamp, TimestampNumber], **kwargs) -> pd.DataFrame:
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how we benefit from such abstraction. Python lists are heterogeneous, this doesn't add any function that can be reused by the derived classes, the check if get_dataframe exists is at runtime. It looks like Java translated to Python.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a standard OOP pattern for interface, as python supports OOP then it should be ok also. Moreover this exact approach was is used already in ASV real tests. So I used it here also, What would be be then more pythonic way to express that certain class supports this specific method which is basis of update generation?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pythonic way not to have abstract class at all and use duck typing. Python is multi-paradigm language but just because it supports multiple paradigms doesn't mean we have to use all of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
patch Small change, should increase patch version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants