-
Notifications
You must be signed in to change notification settings - Fork 144
Update_batch additional tests #2437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
if "uint" not in str(dtype): | ||
platform_int_info = np.iinfo("int_") | ||
else: | ||
platform_int_info = iinfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't why this is needed. Why is iinfo = np.iinfo(dtype)
not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is a common function, which was not working on uints. Thus in order to make it work for uints the else was added, the other part is preserved so it behaves as it used to and no unexpected problems arise. Thus I cannot comments exactly why np.iinfo("int_") was chosen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is working in both numpy 1 and numpy 2
>>> import numpy as np
>>> np.iinfo("uint32")
iinfo(min=0, max=4294967295, dtype=uint32)
>>> np.iinfo("uint16")
iinfo(min=0, max=65535, dtype=uint16)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also avoid string checks in dtypes. Prefer the built in functions of either numpy or pandas to check the properties of the dtype.
for i in range(cols): | ||
dtype = upgradable_dtypes[i % len(upgradable_dtypes)] | ||
desired_type = self._resolve(dtype) | ||
if 'int' in str(desired_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numpy has issubdtype
method. Use that instead of comparing strings,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what is the benefit of doing that. This works for int and uint, if I do the change there will be at least one elif more
read_data = dict() | ||
for result in read_results: | ||
read_data[result.symbol] = result | ||
return read_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dict comprehension will make this in one line
return update_df | ||
|
||
|
||
def read_batch(lib: Library, symbol_names: List[str]) -> Dict[str, VersionedItem]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type hint is wrong. The result of batch read is either VersionedItem or DataError. I'd also change the name so that it's not exactly as in the library API, I believe it's easier to read this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now def read_batch_as_dict(lib: Library, symbol_names: List[str]) -> Dict[str, Union[VersionedItem, DataError]]:
self._type_conversion_dict: Dict[type, type] = self.no_upgrade_types() | ||
self.seed = seed | ||
|
||
def define_upgrade_types(self, mappings: Dict[type, type]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that either both define_upgrade_types and no_upgrade_types should return or none of them should return.
I haven't seen type as a type hint. What does it do?
BEFORE = 0 | ||
RIGHT_BEFORE = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think before and right before need some comment or different name. It looks to me that both of those will generate data that does not overlap with the dataframe on disk.
I don't see any options that have some data outside of the on disk and then overlaps with either the beginning or the end e.g.
on: disk [5, 10]
update: [3, 7]
update: [7, 13]
I also don't see anything that's larger than the dataframe on disk
update: [3, 13]
Is this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! You are correct, those 3 indeed are missing.
class UpdatePositionType(Enum):
'''Enum specifying relative position of an update start or end time relative to
original dataframe start and end time'''
BEFORE = 0 # End of the update is much before start of original DF
RIGHT_BEFORE = 1 # End of the update is exactly before start of original DF
BEFORE_OVERLAP_START = 2 # End of the update overlaps with the start of the original DF
INSIDE_OVERLAP_BEGINNING = 3 # Start of update overlaps with the start of the original DF, no restrictions on end
TOTAL_OVERLAP = 4 # Exact same start and end and number of rows as the original DF
OVERSHADOW_ORIGINAL = 5 # Starts before start of original and ends after the original DF
INSIDE = 6 # literally inside the original DF, update does not overlap neither start nor end
INSIDE_OVERLAP_END = 7 # end of both update and original overlap, no restrictions for start
AFTER_OVERLAP_END = 8 # end of original overlaps with start of of update
RIGHT_AFTER = 9 # start of update is exactly next to original
AFTER = 104 # start of the update is after at least 1 duration of end of original
Added:
- BEFORE_OVERLAP_START
- OVERSHADOW_ORIGINAL
- AFTER_OVERLAP_END
The most critical point of overlap is always overlap with one row at beging and at end and all that is covered now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some examples will be useful, IMO RIGHT_BEFORE and BEFORE_OVERLAP_START are a bit ambiguous. Is this correct assuming df with index [5, 10]
- RIGHT_BEFORE -> [2, 4]
- BEFORE_OVERLAP_START -> [2, 5]
elif position_type == UpdatePositionType.INSIDE_OVERLAP_BEGINNING: | ||
update_df = self.generator.get_dataframe(number_cols, number_rows, start) | ||
elif position_type == UpdatePositionType.INSIDE: | ||
to_update = number_rows if (number_rows + 2) <= rows else rows - 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we check if number_rows + 2 <= rows? I think this needs a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I documented that now - the idea is that the generated DF must be inside physically in the original one, thus the returned generated row may have different than requested from used number of rows
for index, result in enumerate(update_result): | ||
symbol = symbol_names[index] | ||
if not hasattr(result, "version"): | ||
# To catch rare problem on 3.8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elaborate on this. Does this happen only on python 3.8? Is it reproducible or flaky? Is it a known issue. Have we done any investigation on that. It seems pretty odd to have an attribute randomly missing from an object. Is this related to batch_update or it can happen on any call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bug escape I currently investigate - will log issue upon it once all details are relevealed but looks like we do not handle in 3.8 one case
ERROR:tests.integration.arcticdb.test_update:Expected Version UNKNOWN of symbol 'some heck of a symbol!.!0' unable to be retrieved: Error Category: 'UNKNOWN' Exception String: 'arcticdb::ArcticCategorizedException<(arcticdb::ErrorCategory)4>: E_DESCRIPTOR_MISMATCH No valid common type between TD<type=FLOAT64, dim=0> and TD<type=UTF_DYNAMIC64, dim=0> for column col_2' with version attribute <arcticdb_ext.version_store.DataError object at 0x7f8594b7cef0>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is an issue logged now and also the part that was failing is now xfailed - Linux 3.8 specific but underlines some logic weakness
""" Simulates partially dynamic update, when update df has new columns. | ||
The scenario where there are type promotions of existing columns is not covered yet | ||
""" | ||
def add_missing_ending_columns(to_df: pd.DataFrame, from_df: pd.DataFrame) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a nested function. Also why is the return value ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is private to dataframe_simulate_arcticdb_update_dynamic function, not intended to be reused.
The plan is at certain point dataframe_simulate_arcticdb_update_dynamic to be full simulator of Arctic logic in updates
removed returned value and added doc. Basically the to_df is modified so no reason of returning value
class DataFrameGenerator(ABC): | ||
|
||
@abstractmethod | ||
def get_dataframe(self, number_columns:int, number_rows: int, | ||
start_time: Union[Timestamp, TimestampNumber], **kwargs) -> pd.DataFrame: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how we benefit from such abstraction. Python lists are heterogeneous, this doesn't add any function that can be reused by the derived classes, the check if get_dataframe exists is at runtime. It looks like Java translated to Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a standard OOP pattern for interface, as python supports OOP then it should be ok also. Moreover this exact approach was is used already in ASV real tests. So I used it here also, What would be be then more pythonic way to express that certain class supports this specific method which is basis of update generation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pythonic way not to have abstract class at all and use duck typing. Python is multi-paradigm language but just because it supports multiple paradigms doesn't mean we have to use all of them.
Reference Issues/PRs
What does this implement or fix?
Any other comments?
Included are 3 integration tests covering update_batch with all supported types of dataframes and different type of updates over symbol. Covers both static and dynamic schema with checks types promotion. Updates on static schema are also tested against symbols spreading accross several segments.
All tests also cover real storages.
Several utilities have also been added to help integration scenarios testing with update and update_batch in future.
A small set of ASV tests were also developed to keep track of time and mem
Checklist
Checklist for code changes...