Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion breadbox/breadbox/io/data_validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
AnnotationValidationError,
)
from breadbox.schemas.dataset import ColumnMetadata
import pyarrow.parquet as pq

pd.set_option("mode.use_inf_as_na", True)

Expand Down Expand Up @@ -83,6 +84,20 @@ def annotation_type_to_pandas_column_type(annotation_type: AnnotationType):
return dtype


def _read_parquet_with_less_memory(filename):
# it seems like pd.read_parquet() should be the right thing to do
# but in practice, when reading a file with 20k columns, the memory usage balloons
# to > 30GB and would take down breadbox. Reading it in as a table, and then
# processing it column by column seems to avoid this problem. Not 100% sure that this
# results in identical behavior, but it seems valid for numerical matrices
table = pq.read_table(filename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this fixes the issue, I think it looks good and I feel good about you merging it (especially since it's needed for the release).

But I am curious... I found this github issue, suggesting the problem is the pyarrow engine that pandas uses. I'd be curious if you could use a simpler solution of just doing:

pandas.read_parquet(engine='fastparquet')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory leak described there looks fairly different then how I'd characterize what I've observed. However, that's a good point about maybe the issue stems from pyarrow and not pandas.

I tried

pandas.read_parquet(engine='fastparquet')

and yes, that works fine. We'll have to add fastparquet as a dependency, but that sounds fine.

I'll switch the code to use that instead.

df_dict = {
column_name: column
for column_name, column in zip(table.column_names, table.columns)
}
return pd.DataFrame(df_dict).convert_dtypes()


def _validate_dimension_type_metadata_file(
metadata_file: Optional[UploadFile],
annotation_type_mapping: Dict[str, AnnotationType],
Expand Down Expand Up @@ -178,7 +193,8 @@ def _validate_data_value_type(


def _read_parquet(file: BinaryIO, value_type: ValueType) -> pd.DataFrame:
df = pd.read_parquet(file, use_nullable_dtypes=True) # pyright: ignore
# df = pd.read_parquet(file, use_nullable_dtypes=True) # pyright: ignore
df = _read_parquet_with_less_memory(file)

# the first column will be treated as the index. Make sure it's of type string
df[df.columns[0]] = df[df.columns[0]].astype("string")
Expand Down
Loading