-
Notifications
You must be signed in to change notification settings - Fork 145
Description
Describe the bug
When reading and slicing a subset of a large DataFrame:
- The entire DataFrame appears to be loaded into memory.
- A slice is taken and returned, likely as a view retaining a reference to the original.
- If this operation is repeated in a loop and each slice is stored (e.g., in a list), the original large DataFrames are never deallocated.
This causes cumulative memory usage to increase continuously, eventually leading to an out-of-memory crash.
Steps/Code to Reproduce
1. Create demo data
import pandas as pd
import numpy as np
from datetime import datetime
def generate_random_dataframe(n_rows=25, n_cols=10, start_date="2000-01-01", freq="D"):
"""
Generate a random DataFrame with a datetime index.
Args:
n_rows (int): Number of rows in the DataFrame.
n_cols (int): Number of columns in the DataFrame.
start_date (str): Start date for the datetime index.
freq (str): Frequency for the datetime index (e.g., 'D' for daily, 'H' for hourly).
Returns:
pd.DataFrame: A random DataFrame with datetime as the index.
"""
# Generate column names
cols = [f"COL_{i}" for i in range(n_cols)]
# Generate random data
data = np.random.randint(0, 100, size=(n_rows, n_cols))
# Create a datetime index
index = pd.date_range(start=start_date, periods=n_rows, freq=freq)
# Create the DataFrame
df = pd.DataFrame(data, columns=cols, index=index)
return df
Write a 20 years times 10 000 columns DataFrame with random data to Arctic:
df = generate_random_dataframe(
n_rows=255 * 20, n_cols=10_000, start_date="1990-01-01", freq="D"
)
df.head()
import arcticdb as adb
uri = "lmdb://tmp/arcticdb_leak"
ac = adb.Arctic(uri)
library = ac.get_library("demo_lib", create_if_missing=True)
library.write("test_frame", df)
2. Read Data
Helper function to get memory usage of a list of DataFrames:
def get_total_dataframe_size_gb(df_list):
"""
Calculate total memory usage of a list of DataFrames in gigabytes.
Parameters:
df_list (list of pd.DataFrame): List of DataFrames.
Returns:
float: Total size in GB.
"""
total_bytes = sum(df.memory_usage(deep=True).sum() for df in df_list)
return total_bytes / (1024**3)
Read the full dataframe for reference:
from_storage_df = library.read("test_frame").data
print(f"Shape of data: {from_storage_df.shape}")
print(f"Size of data: {get_total_dataframe_size_gb([from_storage_df]):.2f} GB")
Read only a slice:
n_rows_to_read = 3
from_date = from_storage_df.index[1]
to_date = from_storage_df.index[n_rows_to_read]
small_df = library.read(
"test_frame",
date_range=(from_date, to_date),
).data
print(f"Shape of fetched subset of data: {small_df.shape}")
print(
f"Size of fetched subset of data: {get_total_dataframe_size_gb([small_df]):.4f} GB"
)
Now read the small slice in a loop and save results in a list:
retrieved_data = []
n_times_to_fetch = 100
for i in range(n_times_to_fetch):
small_df = library.read(
"test_frame",
date_range=(
from_date,
to_date,
),
).data
retrieved_data.append(small_df)
if i % 10 == 0:
print(f"Fetched small subset {i} times")
print(
f" Total size of retrieved data so far: {get_total_dataframe_size_gb(retrieved_data):.2f} GB"
)
print()
print(
f"Total size of retrieved data: {get_total_dataframe_size_gb(retrieved_data):.2f} GB"
)
Expected Results
Memory usage should in total increase by ca 0.0002 GB * 300 = 0.06 GB -- instead, it increases by several hundred MB per iteration (until out of memory).
OS, Python Version and ArcticDB Version
- Linux
- Python 3.10.12
- ArcticDB 5.2.3
Backend storage used
LMDB
Additional Context
We're able to bypass the problem using adb.QueryBuilder().date_range((from_date, to_date))
as an argument instead to library.read
, but it's not clear if this is the intended way to do it and why the most obvious way to read a slice of a DataFrame would cause this memory leak.